# learning_from_many_trajectories__b0f4d4a5.pdf

Journal of Machine Learning Research 25 (2024) 1-109 Submitted 9/23; Revised 6/24; Published 7/24

Learning from many trajectories

Stephen Tu stephen.tu@usc.edu University of Southern California

Ming Hsieh Department of Electrical and Computer Engineering Los Angeles, CA 90089, USA

Roy Frostig frostig@google.com Google Deep Mind San Francisco, CA 94105, USA

Mahdi Soltanolkotabi soltanol@usc.edu University of Southern California Ming Hsieh Department of Electrical and Computer Engineering Los Angeles, CA 90089, USA

Editor: Daniel Hsu

We initiate a study of supervised learning from many independent sequences ( trajectories ) of non-independent covariates, reﬂecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for eﬃcient learning generalize the former setting trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable. For linear least-squares regression, given n-dimensional examples produced by m trajectories, each of length T, we observe a notable change in statistical eﬃciency as the number of trajectories increases from a few (namely m n) to many (namely m n). Specifically, we establish that the worst-case error rate of this problem is Θ(n/m T) whenever m n. Meanwhile, when m n, we establish a (sharp) lower bound of Ω(n2/m2T) on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identiﬁcation problem. Keywords: learning with dependent data, linear dynamical systems, system identiﬁcation

1. Introduction

Statistical learning theory aims to characterize the worst-case eﬃciency of learning from example data. Its most common setup assumes that examples are independently and identically distributed (iid) draws from an underlying data distribution, but various branches of theory not to mention deployed applications of machine learning consume non-indepen-

. Work performed while author was employed at Google Deep Mind.

c 2024 Stephen Tu, Roy Frostig, and Mahdi Soltanolkotabi.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1145.html.

Tu, Frostig, and Soltanolkotabi

dent data as well. An especially fruitful setting, and the focus of this paper, is in learning from sequential data, where examples are generated by some ordered stochastic process that renders them possibly correlated. Naturally, sequential processes describe application domains spanning engineering and the sciences, such as robotics (Nguyen-Tuong and Peters, 2011), data center cooling (e.g. Lazic et al. (2018)), language (e.g. Sutskever et al. (2014); Belanger and Kakade (2015)), neuroscience (e.g. Linderman et al. (2017); Glaser et al. (2020)), and economic forecasting (Mc Donald et al., 2017). Learning over sequential data can also capture some formulations of imitation learning (Osa et al., 2018) and reinforcement learning (Chen et al., 2021; Janner et al., 2021). In supervised learning, one learns to predict output labels from input covariates, given example pairings of the two. Formal treatments of learning from sequential data typically concern a single inter-dependent chain of covariates. Where these treatments vary is in their assumptions about the underlying process that generates the covariate chain. For instance, some assume that the process is auto-regressive (e.g. Lai and Wei (1983); Goldenshluger and Zeevi (2001); Gonz alez and Rojas (2020)) or ergodic (e.g. Yu (1994); Duchi et al. (2012)). Others assume that it is a linear dynamical system (e.g. Simchowitz et al. (2018); Faradonbeh et al. (2018); Sarkar and Rakhlin (2019)). In this paper, we examine what happens when we learn from many independent chains rather than from one, as one does anyway in many applications (e.g. Pomerleau (1989); Khansari-Zadeh and Billard (2011); Brants et al. (2007); J ozefowicz et al. (2016)). Figure 1 depicts the data dependence structure of our setup in comparison with its two natural counterparts. Learning from a dataset of many short (constant length) chains ought to be similar to independent learning, even if each chain is highly intra-dependent. On the other hand, for any non-trivial chain length, intuition suggests that the error can degrade relative to the total sample size in the worst case, since a greater proportion of the data may contain correlations. Lower bounds even show that, when one sees only a single chain, this degradation is outright necessary in the worst case (Bresler et al., 2020). Do we see any such eﬀect with many chains? We study this question by sharply characterizing worst-case error rates of a fundamental task linear regression imposed over a general sequential data model. Our ﬁndings reveal a remarkable phenomenon: after seeing suﬃciently many chains (m) relative to the example dimension n, no matter the chain length T, the error rate matches that of learning from the same total number m T of independent examples, drawn from their respective marginal distributions. In our data model, each chain, called a trajectory, comprises a sequence of covariates {xt} generated from a stochastic process. Each covariate is accompanied by a noisy linear response yt as its label. A training set {(x(i) t , y(i) t )}m,T i=1,t=1 comprises m independent chains, each of length T. From such a training set, an estimator produces a hypothesis that predicts the label of any covariate. The resulting hypothesis is evaluated according to its meansquared prediction error over a fresh chain of length T , possibly unequal to T a notion of risk deﬁned naturally over a trajectory. All of our risk upper bounds are guarantees for the ordinary least-squares estimator in particular. A concrete, recurring example in this paper takes the covariate-generating process to be a linear dynamical system (LDS). Speciﬁcally, ﬁxing matrices A Rn n, B Rn d, and W Rp n, a single trajectory {(xt, yt)}t 1 is generated as follows. Let x0 = 0, and for

Learning from many trajectories

x1,1 x1,2 x1,3 x1,T

x2,1 x2,2 x2,3 x2,T

xm,1 xm,2 xm,3 xm,T

(a) independent covariates

x1,1 x1,2 x1,3 x1,T

x2,1 x2,2 x2,3 x2,T

xm,1 xm,2 xm,3 xm,T

(b) one long trajectory

x1,1 x1,2 x1,3 x1,T

x2,1 x2,2 x2,3 x2,T

xm,1 xm,2 xm,3 xm,T

(c) independent trajectories

Figure 1: The covariate dependence structure induced by three data models on m T many training examples. In (a): independent examples, typical of basic statistical learning. In (b): the data models often considered in the sequential learning literature, comprising a long auto-correlated chain of examples. Learning in this setting can be infeasible in general, so oftentimes ergodicity is assumed in order to rule out strong long-range dependencies, essentially inducing an independent resetting eﬀect across time. The eﬀective reset frequency then factors uniformly into error bounds, in a way suggesting that one learns only one independent example s worth within each eﬀective reset window (cf. Section 2). In (c): our multi-trajectory data model. Our accompanying assumptions allow for non-ergodic chains, and for arbitrary chain lengths T, while introducing explicit independent resets. Decoupling the m resets from the sequential data model lets us vary the training set dimensions (m, T) freely, without aﬀecting other data assumptions, as we study their eﬀect on error rates. We ﬁnd that with enough trajectories m, the worst-case error rate behaves the same as in the independent setting depicted in (a); i.e., one learns as though every example were independently drawn from its marginal distribution. Some recent work in system identiﬁcation assumes a data model related to ours (speciﬁcally linear dynamical data) and likewise avoids ergodicity; our bounds improve these guarantees T-fold where applicable, and upgrade the regimes in which they apply (cf. Section 2).

t 1 consider the process:

xt = Axt 1 + Bwt, (linear dynamics)

yt = W xt + ξt, (linear regression)

where the {wt}t 1 are iid centered isotropic Gaussian draws and {ξt}t 1 is a sub-Gaussian martingale diﬀerence sequence (with respect to past covariates {xk}t k=1 and noise variables {ξk}t 1 k=1). Incidentally, combining linear dynamical systems with linear regression captures the basic problem of linear system identiﬁcation (as in Simchowitz et al. (2018)) as a special case. In other instantiations of learning from trajectories, the covariates {xt} may be generated by a diﬀerent process; what remains common is the superimposed regression task set up by the ground truth W and the noise {ξt}. The key condition that we will introduce, which renders a covariate process amenable to regression, is that it satisﬁes a trajectory small-ball criterion (Deﬁnition 4.1). Section 4.1 shows that LDS-generated data conforms to the trajectory small-ball condition in particular, as do many other distributions. Our main results (Sections 5 and 6) sharply characterize worst-case rates of learning from trajectory data as a function of the training trajectory count m, the training trajectory length T, the evaluation length T , the covariate and response dimensions n and p, and scale parameters of noise in the data model (such as the variance of the noise {ξt}). Restricting

Tu, Frostig, and Soltanolkotabi

only to terms of covariate dimension n, training set size m and T, and evaluation length T , our bounds imply the following summary statement:

Theorem 1.1 (informal; error rate with many small-ball trajectories, T T). If m n, T T, and covariate trajectories are drawn from a trajectory small-ball distribution, then the worst-case excess prediction risk (over evaluation horizon T ) for linear regression from m many trajectories of n-dimensional covariates, each of length T, is Θ(n/(m T)).

In drawing comparisons to learning from independent examples, it makes sense to consider training and evaluations lengths T and T equal (cf. Section 3), rendering Theorem 1.1 applicable. The theorem thus echoes our main point above: the same rate of Θ(n/(m T)) describes regression on m T independent examples (details on this point are expanded in Section 3.3). Further structural assumptions are needed (cf. Section 3.3) in order to cover the remaining range of problem dimensions, namely few trajectories (m n) or extended evaluations (T > T), and to that end we return to linear dynamical systems as a focus. Our remaining risk upper bounds, targeting learning under linear dynamics, require that the dynamics matrix A be marginally unstable (meaning that its spectral radius ρ(A) is at most one) and diagonalizable. When trajectories are longer at test time than during training (i.e., T > T), marginal instability is practically necessary, otherwise the risk can scale exponentially in T T. The assumption otherwise still allows for unstable and therefore non-ergodic systems at ρ(A) = 1. For simplicity, we also require that the control matrix B have full row rank. Our bounds then imply the following summary statement about regression when the number of trajectories is limited:

Theorem 1.2 (informal; error rate with few LDS trajectories). If m n, m T n, and covariate trajectories are drawn from a linear dynamical system whose dynamics A are marginally unstable and diagonalizable, then the worst-case excess prediction risk (over evaluation horizon T ) for linear regression from m many trajectories of n-dimensional covariates, each of length T, is Θ(n/(m T) max{n T /(m T), 1}).1

If the evaluation horizon T is a constant, the rate in Theorem 1.2 recovers that of Theorem 1.1, up to log factors and extra assumptions. To compare the theorems further, consider equal training and evaluation horizons (T = T). In this setting, the rate in Theorem 1.2 is weaker than that of Theorem 1.1, by up to a factor of the covariate dimension n, so it may seem that Theorem 1.2 establishes a separation between few and many trajectory learning. However, the varying premises of many vs. few trajectories constrains the risk deﬁnitions to diﬀer: under a ﬁxed data budget N := m T = m T , fewer trajectories m imply a longer horizon T over which the risk is evaluated. Intuitively, a longer evaluation horizon makes for a diﬀerent problem, and renders the rate comparison invalid. A more sound comparison across regimes is possible by ﬁrst normalizing the notion of performance within a problem instance. To this end, we can consider the worst-case risk of learning from trajectories relative to that of learning from independent examples in the

1. In proving the lower bound in Theorem 1.2, we construct a hard instance by decoupling the martingale diﬀerence noise {ξt} from the covariate-generating process (cf. Deﬁnition 7.1). Technically, this excludes the lower bound from applying directly to the linear system identiﬁcation problem. Resolving this discrepancy is a question that remains open for future work.

Learning from many trajectories

same regime. Constructing the latter baseline is somewhat subtle (cf. Section 3.2). To decorrelate the problem of learning from trajectories while maintaining its temporal structure otherwise, we can imagine drawing from its marginal distributions independently at each time step. The resulting dataset is independent, but not identically distributed. Although the rates for the sequential and decorrelated regression problems are as already highlighted remarkably the same under many trajectories, the few-trajectory rate in Theorem 1.2 is indeed weaker than the Θ(n/(m T)) rate that we prove for its decorrelated baseline (cf. Theorem 5.7). Since the more general Theorem 1.1 already describes what happens under many trajectories (m n) and a strict evaluation horizon (T T), what remains is a somewhat niche regime: many trajectories and an extended evaluation horizon T > T. For completeness, our bounds supply the following summary statement:

Theorem 1.3 (informal; error rate with many LDS trajectories). If m n and covariate trajectories are drawn from a linear dynamical system whose dynamics A are marginally unstable and diagonalizable, then the worst-case excess prediction risk (over evaluation horizon T ) for linear regression from m many trajectories of n-dimensional covariates, each of length T, is Θ(n/(m T) max{T /T, 1}).

Using the tools of our analysis, we also develop upper bounds for parameter error instead of prediction risk, which inform recovery of the ground truth W and (by reduction) of the dynamics matrix A in LDS. The latter captures the linear system identiﬁcation problem. Our upper bounds improve on its worst-case guarantees by a factor of 1/T where applicable, and extend the parameter ranges in which guarantees hold at all.

2. Related work

Linear regression is a basic and well-studied problem. The two treatments most closely related to our work are Hsu et al. (2014) and Mourtada (2022), who develop sharp ﬁnitesample characterizations of the risk of random design linear regression (i.e., from iid examples). Discussion and references therein cover the broader problem over its long history. A common approach to studying dependent covariates is to assume that the datagenerating process is ergodic (see e.g. Yu (1994); Meir (2000); Mohri and Rostamizadeh (2008); Steinwart and Christmann (2009); Mohri and Rostamizadeh (2010); Duchi et al. (2012); Kuznetsov and Mohri (2017); Mc Donald et al. (2017); Shalizi (2021) and references therein). The key phenomenon at play is that N correlated examples are statistically similar to N/τmix independent examples, where τmix is the process mixing-time. Relying on this idea, generalization bounds informing independent data can typically be ported to the ergodic setting, where the eﬀective sample size is simply deﬂated by a factor of τmix. Since mixing-based bounds become vacuous as τmix , they do not present an eﬀective strategy for studying dynamics that do not mix. A critical instance of this arises in linear dynamical systems: in LDS, the ergodicity condition amounts to stability of the dynamics matrix A (i.e., ρ(A) < 1), where τmix as ρ(A) 1 (e.g. Meyn and Tweedie, 1993, Thm. 17.6.2). Marginally unstable systems, in which ρ(A) = 1, are thus not captured. A recent line of work uncovers ways to sharpen generalization bounds based on the speciﬁc structure of realizable least-squares regression problems over an ergodic trajectory.

Tu, Frostig, and Soltanolkotabi

For realizable linear regression with stationary covariates, results from Bresler et al. (2020) imply that, after the trajectory length exceeds an initial burn-in time scaling as τmixn, the minimax (excess) risk coincides with the classic iid rates. Additionally, Ziemann and Tu (2022) show that the empirical risk minimizer exhibits similar behavior in realizable nonparametric regression problems, provided certain small-ball assumptions of the underlying process hold. While these results sharpen our understanding of how the mixing time τmix affects regression risk bounds, they ultimately rely on ergodicity. Since learning from a single trajectory is generally impossible without ergodicity, we are led to study other sequential learning conﬁgurations. The two, however, are not mutually exclusive: our results actually apply when mixing, and in fact show that the empirical risk minimizer is minimax optimal (after a burn-in time scaling with the mixing time). This eschews the need for algorithmic modiﬁcations to learning from mixing trajectory data (Bresler et al., 2020). We give details on this in Appendix B.7.

Non-temporal dependency structures. Covariates and responses can be inter-dependent in many ways, not only via temporal structure. A recent resurgence of work investigates learning under an Ising model structure over covariates (Bresler, 2015; Dagan et al., 2019; Ghosal and Mukherjee, 2020; Dagan et al., 2021b), as well as over responses (Daskalakis et al., 2019; Dagan et al., 2021a) (conditioned on the covariates). At a conceptual level, the extension from a single temporally dependent trajectory to multiple trajectories is analogous to the extension from single observations to Ising models with multiple independent observations. Incidentally, in this area, investigations began by studying learning under multiple independent observations, and progressed towards guarantees on learning from a single one. Relating these two data models trajectories and Ising grids under intercompatible assumptions may reveal interesting connections between these results.

System identiﬁcation. A special case of our LDS-speciﬁc data model captures linear system identiﬁcation with full state observation: the task of recovering the dynamical system parameters A from observations of trajectories. While classic results are asymptotic in nature (see e.g. Lai and Wei (1982, 1983); Ljung (1998)), recent work gives ﬁnite-sample guarantees for recovery of linear systems with fully observed states (Simchowitz et al., 2018; Dean et al., 2020; Jedra and Proutiere, 2020; Faradonbeh et al., 2018; Sarkar and Rakhlin, 2019; Jedra and Proutiere, 2019; Tsiamis and Pappas, 2021), and also partially observed states (Oymak and Ozay, 2019; Simchowitz et al., 2019; Tsiamis and Pappas, 2019; Sarkar et al., 2021; Zheng and Li, 2021). The proof of our upper bounds builds on the smallball arguments from Simchowitz et al. (2018) (that, in turn, extend Mendelson (2015); Koltchinskii and Mendelson (2015)), which do not require ergodicity. To the best of our knowledge, our results are the ﬁrst to quantify the trade-oﬀs between few long trajectories and many short trajectories. Nearly all ﬁnite-sample guarantees for linear system identiﬁcation consider a single trajectory, with a few notable exceptions. First, Dean et al. (2020) allow for m 1 trajectories with fully observed states and make no assumptions on the dynamics matrix A. However, their analysis discards all but the last state transition within a trajectory, reducing to iid learning over only m examples. Second, Zheng and Li (2021); Xin et al. (2022) study the recovery of Markov parameters from partially observed states over many trajectories. However, their error bounds do not decrease with longer training horizons T, since the number of Markov parameters one must

Learning from many trajectories

recover scales with the trajectory length. Third, Xing et al. (2021) consider multiple trajectories where the noise enters multiplicatively instead of additively. Their main ﬁnite-sample parameter recovery result (Theorem 2) states that the operator norm of the parameter error scales as p

T/m, with the additional restriction that T n2. To achieve consistency, this result ﬁxes the trajectory length T and takes the trajectory count m . By contrast, our analysis varies the two quantities T and m independently. Furthermore, a line of work concurrent to ours investigates learning from multiple sources of linear dynamical systems (Chen and Poor, 2022; Modi et al., 2022). This is a latent variable model, where the underlying index of the LDS must be disambiguated from data. This model is more general than the one studied in this paper, and specializing the corresponding results to our setup yields sub-optimal bounds and unnecessary requirements. We discuss this in Section 5.2, after presenting upper bounds in detail. Finally, to more clearly interpret the eﬀects of multiple trajectories on learning, a core part of our work studies linear dynamical systems under an ideal setup with Gaussian controls and time-invariant dynamics. Recent work extends beyond this ideal setup in the single trajectory setting, including learning in piecewise-aﬃne systems (Block et al., 2023), bilinear systems (Sattar et al., 2022), switched linear systems (Massucci et al., 2022), and linear dynamical systems with non-linear control laws (Li et al., 2023). Extending our multi-trajectory analysis to these settings is an interesting open direction for future work.

Our LDS setup (Section 3.4) decouples the covariate dynamics model A from the observation model W , and our risk deﬁnition additionally allows for an arbitrary evaluation horizon T . The risk over an arbitrary evaluation horizon is harder to control than parameter error, which corresponds to an evaluation length of one. This is because the larger signal-to-noise ratio accrued by a less stable system magniﬁes the prediction error over the entire evaluation horizon. Although the observation model that we consider is mentioned in Simchowitz et al. (2018), the general setup with matching upper and lower bounds are all, to the best of our knowledge, new contributions.

A complementary line of work studies the problem of online sequence prediction in a no-regret framework, where the baseline expert class comprises of trajectories generated by a linear dynamical system (Hazan et al., 2017, 2018; Ghai et al., 2020). These results also allow for marginally unstable dynamics but are otherwise not directly comparable. Other eﬀorts look beyond linear systems to identifying various non-linear classes, such as exponentially stable non-linear systems (Sattar and Oymak, 2020; Foster et al., 2020) and marginally unstable non-linear systems (Jain et al., 2021). These results again learn from a single trajectory. We believe that elements of our analysis can be ported over to oﬀer many-trajectory bounds for these particular classes of non-linear systems.

3. Problem formulation

Notation. The real eigenvalues of a Hermitian matrix M Ck k are λmax(M) = λ1(M) . . . λk(M) = λmin(M). For a square matrix M Ck k, M denotes its conjugate transpose, and ρ(M) denotes its spectral radius: ρ(M) = max{|λ| | λ is an eigenvalue of M}. The space of n n real-valued symmetric positive semideﬁnite (resp. positive deﬁnite) matrices is denoted Symn 0 (resp. Symn >0). The non-negative (resp. positive) orthant in Rn is

Tu, Frostig, and Soltanolkotabi

denoted as Rn 0 (resp. Rn >0), and Sn 1 denotes the unit sphere in Rn. Finally, the set of positive integers is denoted by N+.

3.1 Linear regression from sequences

Regression model. A covariate sequence is an indexed set {xt}t 1 Rn. Any distribution Px over covariate sequences is assumed to have bounded second moments, i.e., that E[xtx T t ] exists and is ﬁnite for all t 1. Also for such a distribution Px, let Pξ[Px] be a distribution over observation noise sequences {ξt}t 1 Rp. Denoting by {Ft}t 0 the ﬁltration with Ft = σ({xk}t+1 k=1, {ξk}t k=1), we assume that {ξt}t 1 is a σξ-sub-Gaussian martingale diﬀerence sequence (MDS), i.e., for t 1:

E[ v, ξt | Ft 1] = 0, E[exp(λ v, ξt ) | Ft 1] exp(λ2 v 2 2σ2 ξ/2) a.s. λ R, v Rp.

Given a ground truth model W Rp n, deﬁne the observations (a.k.a. responses or labels ):

yt = W xt + ξt, t 1. (3.1)

Denote by PW x,y[Px, Pξ] the joint distribution over covariates and observations {(xt, yt)}t 1.

Regression task. Fix a ground truth model W Rp n, a covariate distribution Px, an observation noise model Pξ, a training horizon T, and a test horizon T . Draw m independent sequences {(x(i) t , y(i) t )}i [m],t 1 from PW x,y[Px, Pξ], and call their length-T preﬁxes

{(x(i) t , y(i) t )}m,T i=1,t=1 the training examples. From these examples, the regression task is to ﬁnd a hypothesis ˆfm,T : Rn Rp that matches ground truth predictions f W (x) := W x in expectation over unseen trajectories of length T . Speciﬁcally, the excess risk of a hypothesis ˆf is:

L( ˆf; T , Px) := EPx

t=1 ˆf(xt) f W (xt) 2 2

We say that the evaluation horizon T is strict if T T and extended if T > T. When the hypothesis class is linear, meaning the hypotheses ˆf are of the form ˆf(x) = ˆWx with ˆW Rp n, the risk expression (3.2) simpliﬁes as follows. For a positive deﬁnite matrix Σ Rn n, deﬁne the weighted square norm M 2 Σ := tr(MΣMT) for M Rp n. Denoting, for t 1:

Σt(Px) := EPx[xtx T t ], Γt(Px) := 1

k=1 Σk(Px), (3.3)

we overload notation and write:

L( ˆW; T , Px) = ˆW W 2 ΓT (Px). (3.4)

The risk (3.2), being a notion of error averaged over time steps, relates to that of Ziemann et al. (2022) in the study of learning dynamics (the diﬀerence lies in whether the error norm is squared).

Learning from many trajectories

By allowing unequal training and test horizons T = T , we cover two related scenarios at once: system identiﬁcation in linear dynamical systems (when T = 1) and predicting past the end of a sequence (when T > T). For the latter, the risk deﬁnition (3.2) is closely related to a commonly studied notion of ﬁnal step generalization (see e.g. (Kuznetsov and Mohri, 2017, Eq. 5), (Mc Donald et al., 2017, Def. 10)) that measures the performance of a hypothesis at T T time steps beyond the training horizon: Lend( ˆf; T , Px) := EPx[ ˆf(x T ) f W (x T ) 2 2]. Linear hypotheses enjoy the identity Lend( ˆW; T , Px) = ˆW W 2 ΣT (Px). In turn:

Lend( ˆW; T , Px) L( ˆW; T , Px) Lend( ˆW; T /2 , Px).

In other words, provided the scale of the covariances Σt(Px) does not grow substantially over time t, our risk deﬁnition L is comparable to the ﬁnal-step risk Lend.

Minimax risk. To compare the hardness of learning across problem classes (i.e., families of covariate distributions Px), we measure the minimax rate of the risk L i.e., the behavior of the best estimator s worst-case risk over valid problem instances as a function of the amount of training data m, T and other problem parameters such as n, p, σξ, and T . Recall that PW x,y denotes the distribution over labeled trajectories {(xt, yt)}t 1. For a collection of covariate sequence distributions Px, the minimax risk over problem instances consistent with Px is:

R(m, T, T ; Px) := inf Alg sup Px Px sup W ,Pξ E m i=1PW x,y [Px,Pξ] h L Alg({(x(i) t , y(i) t )}m,T i=1,t=1); T , Px i ,

where the inﬁmum ranges over estimators Alg : (Rn Rp)m T (Rn Rp) that map training samples to hypotheses, the supremum over W is over all p n ground truth models, and the supremum over Pξ is over all σξ-sub-Gaussian MDS processes determining the observation noise.

The ordinary least-squares estimator. Much like its classical role in iid learning, the ordinary least-squares (OLS) estimator will be key to bounding the minimax risk (3.5) from above. We deﬁne the OLS estimator to be the linear hypothesis ˆWm,T Rp n that satisﬁes:

ˆWm,T argmin W Rp n

t=1 Wx(i) t y(i) t 2 2. (3.6)

For i = 1, . . . , m, let X(i) m,T RT n be the data matrix for the i-th trajectory (i.e., the t-th

row of X(i) m,T is x(i) t for t = 1, . . . , T). Deﬁne Y (i) m,T RT p and Ξ(i) m,T RT p analogously.

Put Xm,T Rm T n as the vertical concatenation of X(1) m,T , . . . , X(m) m,T , and similarly for Ym,T Rm T p and Ξm,T Rm T p. Whenever Xm,T has full column rank, then we can write ˆWm,T as:

ˆWm,T = Y T m,T Xm,T (XT m,T Xm,T ) 1. (3.7)

Tu, Frostig, and Soltanolkotabi

x1 x2 x3 x T

y1 y2 y3 y T

(a) The Seq-LS problem (Problem 3.1): covariates {xt} are drawn from a sequence distribution, and noisy observations {yt} are drawn conditioned on these covariates.

x1 x2 x3 x T

y1 y2 y3 y T

x 1 x 2 x 3 x T

y 1 y 2 y 3 y T

(b) The corresponding baseline Ind-Seq-LS problem (Problem 3.2): independent covariate-observation pairs {(xt, yt)} are drawn, each from the marginal distribution of the corresponding t th step in the Seq-LS problem.

Figure 2: Formulations of regression from sequential data, illustrated as graphical models. Specifically, these graphs depict a simpliﬁed special case of our data model, in which the observations {yt} across time are independent conditioned on the covariates {xt}. In our general deﬁnitions (Problem 3.1 and Problem 3.2), the observations {yt} can be conditionally interdependent, via a martingale diﬀerence sequence on the observation noise (Section 3.1).

OLS is not the only estimation method for least-squares problems. Often some regularization penalty such as that of the ridge or LASSO estimator is added to the least-squares loss (3.6), based on some structural understanding of the problem instance at hand. Studying the interplay between multiple trajectories and, say, norm-based risk bounds (Zhang, 2002) or sparse recovery (Cand es et al., 2006), is an exciting direction for future work.

3.2 Problem classes

We formalize linear regression from sequential data generally as follows:

Problem 3.1 (Seq-LS). Assume a covariate sequence distribution Px in the linear regression model (3.1). Fix an evaluation horizon T . On input m labeled trajectories of length T drawn from this model, in the form of examples {(x(i) t , y(i) t )}m,T i=1,t=1, output a hypothesis ˆfm,T that minimizes excess risk L( ˆfm,T ; T , Px).

Our topmost goal is to study the eﬀect of learning from sequentially dependent covariates in comparison with learning in the classical iid setup. Linear regression is well understood in the latter setting. Focusing on well-speciﬁed linear regression further simpliﬁes our presentation, allowing us to isolate the eﬀects of what interests us most dependent covariates. Generalizing the supervision aspect of Seq-LS (say, to unrealizable and nonparametric regression, or to classiﬁcation) is left to future work. We return to discuss this in Section 9. Separately, our assumption that the learner accesses m trajectories each with common length T is also intended for simplicity. Generalizing our results to varying trajectory lengths T1, . . . , Tm, and even varying covariate sequence distributions P(1) x , . . . , P(m) x is conceptually straightforward, but notationally more burdensome.

Learning from many trajectories

To study how dependent data aﬀects learning, we need to establish an independent data baseline. The natural comparison point for Seq-LS is to remove all correlations across time. Namely, instead of drawing covariates sequentially from the distribution Px, consider learning separately from the marginals of Px at each time step. The resulting decorrelated distribution generates independent examples, but typically not iid ones. We formalize linear regression from independent data generally as follows:

Problem 3.2 (Ind-Seq-LS). Fix a sequence of distributions {Px,t}t 1. Consider their product over time t 1Px,t as the covariate sequence distribution in the linear regression model (3.1). Fix an evaluation horizon T . On input m labeled trajectories of length T drawn from this model, in the form of examples {(x(i) t , y(i) t )}m,T i=1,t=1, output a hypothesis ˆfm,T that minimizes L( ˆfm,T ; T , t 1Px,t).

This Ind-Seq-LS problem generalizes the canonical iid learning setup slightly. Existing theory can still characterize its minimax risk, provided the covariances of the distributions {Px,t} are roughly equal in scale across time t. However, this equal-scale requirement rules out the marginals of interesting applications, such as dynamical systems that are not stable or ergodic. We therefore extend, in later sections, characterizations of the regression risk to handle covariances that can scale polynomially across time instead.

3.3 Problem separations

To set up a baseline for a Seq-LS problem, we will speciﬁcally instantiate Ind-Seq-LS over its marginals. Namely, for a sequence distribution Px over {xt}t 1, let µt[Px] be the marginal distribution of xt at time t 1, and consider Ind-Seq-LS with covariates drawn from the sequence {µt[Px]}t 1. Figure 2 illustrates such a Seq-LS problem and the corresponding Ind-Seq-LS instance over its marginals. This decorrelated baseline is a hypothetical benchmark: in a practical context, collecting independent marginal data, when nature only supplies its dependent form, can be expensive or infeasible. However, we can expect that having such data on hand would make learning easier, with risk rates that resemble iid learning. In what follows, we outline scenarios where a sequential learning problem and its decorrelated baseline coincide in diﬃculty, and others in which they diverge. We then outline the possible assumptions that would allow us to always relate the two.

The iid special case. When T = T = 1, the example trajectories {x(i) 1 }m i=1 are trivially a set of iid covariates. The problems Seq-LS and Ind-Seq-LS thus coincide, and reduce to the well-speciﬁed random design linear regression problem over m iid covariates. It is well-known that under iid data, and mild regularity conditions, the minimax risk scales as σ2 ξpn/m, and is achieved by the OLS estimator (Hsu et al., 2014; Mourtada, 2022; Wainwright, 2019).

Extending the horizon. Considering nontrivial horizons T = T > 1, both Seq-LS and its corresponding Ind-Seq-LS baseline become more involved, but for diﬀerent reasons. The Ind-Seq-LS problem, as we show in Section 6, is not generally learnable with polynomially many examples. Speciﬁcally, the minimax rate scales exponentially in the dimension n provided the trajectory count m is constant. To address this, we will require that the covariances of its constituent distributions {Px,t} grow at most polynomially with

Tu, Frostig, and Soltanolkotabi

time t. Under this constraint, the problem s minimax risk again scales as the iid-like rate σ2 ξpn/(m T) times, at most, a factor determined exponentially by the covariance growth. The Seq-LS problem inherits the same growth limitation. Even then, it is still not generally learnable without further assumptions on the dependence structure of covariates: the minimax risk is otherwise bounded away from zero as the horizon T tends to inﬁnity, provided the trajectory count m is constant. To realize this, consider x1 N(0, In) and xt = xt 1 for t 2, a sequence of identical covariates whose marginals are all independent Gaussians. The resulting dataset presents an underdetermined regression problem if m < n. In essence, its covariates lack suﬃcient excitation across time. To rein Seq-LS back in to the realm of learnability, one must:

(a) make further modeling assumptions about covariates, or

(b) introduce excitation via independent resets.

For (a), as detailed in Section 2, the most common modeling assumption considers sequences that mix rapidly to a stationary distribution. Another avenue recently active in the literature, and sometimes overlapping with the mixing approach considers sequences generated by linear dynamical systems. Among these two, mixing implies risk bounds that tend to zero with T, but only hold in the worst case after a burn-in time that scales proportionally to the mixing time (Bresler et al., 2020). This prevents a characterization of minimax risk uniformly across the full range of problem instances Px that mix, unless one caps the mixing time to a ﬁxed constant. Narrowing instead to LDS models in the sequel, we manage to succinctly carve out a basic problem family, with unbounded mixing time, and to characterize its minimax risk uniformly. One still pays a price for sequential dependency, as this minimax risk turns out to be larger than its Ind-Seq-LS counterpart by a factor of the dimension n. Turning in addition to (b), by introducing (suﬃciently many) resets, we can expand our data model substantially: we manage to lift most of our LDS assumptions and extend to other dynamical systems. Remarkably, we even show that for any controllable LDS including ones that are unstable and hence grow exponentially in time having sufﬁciently many resets guarantees that the risk exhibits, once again, the iid-like behavior of σ2 ξpn/(m T), up to mere constants.

3.4 Linear dynamical trajectories

Fix a dynamics matrix A Rn n and a control matrix B Rn d. Consider the ndimensional trajectory {xt}t 1 deﬁned by the linear dynamical system:

xt = Axt 1 + Bwt, where wt N(0, Id), for t 1, (3.8)

taking x0 = 0 by convention. We assume that the noise process {wt}t 1 is independent across time, i.e., that wt wt whenever t = t . Overloading notation, let the matrix Σt(A, B) := Pt 1 k=0 Ak BBT(Ak)T denote the covariance of xt, and let the matrix Γt(A, B) := 1

t Pt k=1 Σk(A, B) denote the average covariance. Denote by PA,B x the distribution over the

trajectory {xt}t 1, and let {x(i) t }t 1 for i 1 denote independent draws from PA,B x . When B = In, we use the respective shorthand notation Σt(A), Γt(A), and PA x .

Learning from many trajectories

Modeling regression covariates as linear dynamical trajectories gives us the LDS-LS problem, a specialization of Seq-LS (Problem 3.1):

Problem 3.3 (LDS-LS). Assume a dynamics matrix A Rn n, a control matrix B Rn d, and a corresponding linear dynamical covariate distribution PA,B x in the linear regression model (3.1). Fix an evaluation horizon T . On input m labeled trajectories of length T, drawn from this model, in the form of examples {(x(i) t , y(i) t )}m,T i=1,t=1, output a hypothesis ˆfm,T that minimizes L( ˆfm,T ; T , PA,B x ).

Let PA,B x,t be the marginal distribution of xt under PA,B x at each t 1. The natural decorrelated baseline for LDS-LS is a corresponding specialization of Ind-Seq-LS (Problem 3.2) to LDS trajectories:

Problem 3.4 (Ind-LDS-LS). Assume a dynamics matrix A Rn n, a control matrix B Rn d, and a corresponding trajectory distribution PA,B x . Consider covariates drawn independently from its marginals, i.e., assume the linear regression model (3.1) under the covariate sequence distribution t 1PA,B x,t . Fix an evaluation horizon T . On input m labeled

trajectories of length T, drawn from this model, in the form of examples {(x(i) t , y(i) t )}m,T i=1,t=1,

output a hypothesis ˆfm,T that minimizes L( ˆfm,T ; T , t 1PA,B x,t ).

Learning dynamical systems. LDS-LS generalizes linear system identiﬁcation, the problem of recovering the dynamics A from data. The reduction follows by setting W = A and ξ(i) t = Bw(i) t+1, so that y(i) t = x(i) t+1. Note that when B has full row rank, the squared parameter error in the weighted BBT norm BBT is simply the risk L( ˆA; T , PA,B x ) when T = 1. Recent related work typically assumes that B indeed has full row rank, but in later sections we touch on the more general case where this is not required, so long as the pair (A, B) is controllable. Bounds in operator norm are also easily obtainable from our proof techniques. However, our lower bounds will not inform the system identiﬁcation problem speciﬁcally; our hardness results rely on decoupling W from A and ξ(i) t from w(i) t+1, whereas this reduction naturally ties them.

4. Trajectory small-ball deﬁnition and examples

We establish risk upper bounds by studying the behavior of the ordinary least-squares estimator. The key technical deﬁnition that drives the analysis is a small-ball condition on covariate sequences:

Deﬁnition 4.1 (Trajectory small-ball (Traj SB)). Fix a trajectory length T N+, a parameter k {1, . . . , T}, positive deﬁnite matrices {Ψj} T/k j=1 Symn >0, and constants csb 1,

α (0, 1]. The distribution Px satisﬁes the (T, k, {Ψj} T/k j=1 , csb, α)-trajectory-small-ball (Traj SB) condition if:

(a) 1 T/k P T/k j=1 Ψj ΓT (Px),

(b) {xt}t 1 is adapted to a ﬁltration {Ft}t 1, and

Tu, Frostig, and Soltanolkotabi

(c) for all v Rn \ {0}, j {1, . . . , T/k } and ε > 0:

t=(j 1)k+1 v, xt 2 ε v TΨjv

(csbε)α a.s. (4.1)

Above, F0 is understood to be the minimal σ-algebra. Additionally, the distribution Px satisﬁes the (T, k, Ψ, csb, α)-Traj SB condition if it satisﬁes (T, k, {Ψj} T/k j=1 , csb, α)-Traj SB with Ψj = Ψ. Finally, we call the parameter k the excitation window.

We will soon show in Lemma 5.1 how the various quantities parameterizing the trajectory small-ball condition closely govern the ﬁnal risk bound of the ordinary least-squares estimator. For the remainder of this section, however, we focus on developing intuition for Deﬁnition 4.1, by working through diverse examples. In the deﬁnition, we typically consider the matrices Ψj to be the sharpest almost-sure lower bound that we can specify (in the Loewner order) on the quantity E[ 1

k Pjk t=(j 1)k+1 xtx T t | F(j 1)k]. Section 4.1 lists examples of covariate sequence distributions Px that satisfy the Traj SB condition. Deﬁnition 4.1 draws inspiration from the block martingale small-ball condition from Simchowitz et al. (2018, Deﬁnition 2.1). There are, however, two main diﬀerences: (a) we consider the small-ball probability of the entire block 1

k Pjk t=(j 1)k+1 v, xt 2 at once, instead of the average of small-ball probabilities:

t=(j 1)k+1 P n v, xt 2 ε v TΨjv | F(j 1)k o , (4.2)

and (b) equation (4.1) is required to hold at all scales ε > 0, instead of at a single resolution. We need the ﬁrst modiﬁcation (a) to prove optimal rates under many trajectories without assuming stability or ergodicity; we expand on this point in Section 4.1. Furthermore, condition (4.1) is implied by a bound on the average of small-ball probabilities (4.2) (cf. Proposition 4.2), rendering it a more general condition. We need the second modiﬁcation (b) in order to bound the expected value of the OLS risk. The need to modify (b) in order to bound the expected OLS risk is present even in the iid setting, as discussed in Mourtada (2022, Remark 4). It is this modiﬁcation that prevents Deﬁnition 4.1 from fully subsuming Simchowitz et al. (2018, Deﬁnition 2.1). The following remark discusses a small change to Deﬁnition 4.1 that addresses this issue, by exchanging expected OLS risk bounds to high probability bounds:

Remark 4.1. In Appendix B.7, we consider the following modiﬁcation to Deﬁnition 4.1, where we instead suppose that (4.1) holds for some ﬁxed ε (such that the inequality s right-hand side is strictly less than one), rather than for all ε; we refer to this modiﬁcation as weak trajectory small-ball (Deﬁnition B.1). As described above, a consequence of the weak trajectory small-ball condition is that the main OLS risk bounds now hold with high probability (i.e., polylogarithmic in 1/δ) rather than in expectation (Lemma B.23).2 A key upshot (Proposition B.24), however, is that this change allows for an ergodic covariate

2. Such a high probability bound does not, in turn, imply a bound on the expected risk via integration over the tail. The reason is that the high probability bound (Lemma B.23) requires that the number of data

Learning from many trajectories

sequence with φ-mixing time bounded by τmix to be considered (weak) trajectory smallball (with excitation window k τmix), provided the stationary distribution µ satisﬁes a standard (weak) small-ball condition (Mendelson, 2015; Koltchinskii and Mendelson, 2015; Oliveira, 2016):

sup v Sn 1 Pµ{ v, x 2 ε Eµ[ v, x 2]} < 1 for some ε > 0.

This in turn yields upper bounds for Seq-LS in the few trajectories (m n) regime of the following form: if m T Ω(τmixn), then

L( ˆWm,T ; T, Px) O σ2 ξ pn m T

with high probability. This statement generalizes the risk bound for a single ergodic trajectory from Bresler et al. (2020) to the ordinary least-squares estimator (3.6). We can interpret the condition on m T as a burn-in time requirement. Meanwhile, at least in the single-trajectory (m = 1) setting, Bresler et al. (2020, Theorem 1) tells us that that such a burn-in assumption (T τmixn) is necessary for a non-trivial risk guarantee.

4.1 Examples of trajectory small-ball distributions

We now turn to speciﬁc examples of distributions Px which satisfy the trajectory smallball condition. First is the example introduced in Section 3.3, where x1 is drawn from a multivariate Gaussian and subsequently copied as xt = xt 1 for all t 2:

Example 4.1 (Copies of a Gaussian draw). Let Σ Symn >0, and let Px denote the process x1 N(0, Σ) and xt = xt 1 for t 2. Fix any T N+. Then Px satisﬁes the (T, T, Σ, e, 1

2)- Traj SB condition.

Note that this process only satisﬁes the trajectory small-ball condition with excitation window k = T. In other words, the conditional distribution xt+k | xt for k 1 (a Dirac distribution on xt) contains no excitation as needed for learning. This example can actually be generalized to arbitrary Gaussian processes indexed by time:

Example 4.2 (Gaussian processes). Let Px be a Gaussian process indexed by time, i.e., for every ﬁnite index set I N+, the collection of random variables (xt)t I is jointly Gaussian. Let Tnd := inf{t N+ | det(E[xtx T t ]) = 0}, and suppose Tnd is ﬁnite. Fix a T N+ satisfying T Tnd. Then Px satisﬁes the (T, T, ΓT (Px), 2e, 1

2)-Traj SB condition.

This example illustrates the insuﬃciency of the average small-ball probabilities condition (4.2). Suppose that x1, . . . , x T is a jointly Gaussian process. Then Example 4.2 states (cf. Equation (4.1)) that for all v = 0, ε > 0:

t=1 v, xt 2 ε v TΓT (Px)v

points m T grows proportional to log(1/δ), where δ is the failure probability. If one attempts to bound E[ ˆ Wm,T W 2 Γ ] = R 0 P( ˆ Wm,T W 2 Γ t) dt using Lemma B.23 to control the tail probability on the RHS, then for any ﬁxed m, T, there exists a scale t0 (corresponding to setting δ suﬃciently small) such that, for any t t0, the bound on P( ˆ Wm,T W 2 Γ t) from Lemma B.23 is no longer valid. The same issue occurs in, e.g., Simchowitz et al. (2018, Theorem 2.4), and for the same reason: the need for a small-ball assumption that holds at all resolutions, as underscored in Mourtada (2022, Remark 4).

Tu, Frostig, and Soltanolkotabi

If instead we use the average small-ball condition (4.2), the corresponding bound would be, for all v = 0, ε > 0:

t=1 P n v, xt 2 ε v TΓT (Px)v o

2e mint {1,...,T} λmin(Γ 1 T (Px)Σt) ε

The extra 1/ mint {1,...,T} λmin(Γ 1 T (Px)Σt) factor would then enter the resulting rates for least-squares regression, in turn requiring assumptions that limit the growth of ΓT (Px) relative to Σt. Fortunately, the trajectory small-ball condition (4.1) avoids these issues. Our next example involves independent, but not identically distributed, covariates:

Example 4.3 (Independent Gaussians). Let {Σt}t 1 Symn >0, and let Px = t 1N(0, Σt). Fix a T N+. Then Px satisﬁes the (T, 1, {Σt}T t=1, e, 1

2)-Traj SB condition.

Example 4.3 allows us to select k = 1, reﬂecting the independence of the covariates across time. We can also craft an example around a process that does not mix, but that still exhibits an excitation window of k = 2:

Example 4.4 (Alternating halfspaces). Suppose that n 4 is even, and let u1, . . . , un be a ﬁxed orthonormal basis of Rn. Put U0 = span(u1, . . . , un/2) and U1 = span(un/2+1, . . . , un). Let i1 Bern(1

2), it+1 = (it + 1) mod 2 for t N+, and let Px denote the process with conditional distribution xt | it uniform over the spherical measure on Uit Sn 1. For any T 2, the process Px satisﬁes the (T, 2, In/(2n), e, 1

2)-Traj SB condition.

To see that the covariate distribution {xt} does not mix, observe that the marginal distribution for all t is uniform on Sn 1, whereas the conditional distribution xt+k | xt for any k N+ is either uniform on U0 Sn 1 or uniform on U1 Sn 1. Although it does not mix at all, the trajectory supplies ample excitation for learning in any mere two steps. Even for a process that does mix, it may exhibit an excitation window far smaller than its mixing time. The following sets up such an example, where again suﬃcient excitation is provided with k = 2 steps:

Example 4.5 (Normal subspaces). Suppose that n 3. Let u1, . . . , un be a ﬁxed orthonormal basis in Rn, and let U i := span({uj}j =i) for i {1, . . . , n}. Consider the Markov chain {it}t 1 deﬁned by i1 Unif({1, . . . , n}), and it+1 | it Unif({1, . . . , n} \ {it}). Let Px denote the process with conditional distribution xt | it uniform over the spherical measure on U it Sn 1. For any T 2, the process Px satisﬁes the (T, 2, In/(4n 4), e, 1

2)-Traj SB condition.

In this example, a straightforward computation (detailed in Proposition B.11) shows that the mixing time τmix(ε) of the Markov chain {it}t 1 scales as logn(1/ε).3 In most analyses which rely on mixing time arguments, one requires that the mixing time resolution ε tends

3. For concreteness, given a discrete-time Markov chain over a ﬁnite state-space S with transition matrix P and stationary distribution π, we deﬁne the mixing time as: τmix(ε) := inf{k N | supµ P(S) µP k π tv ε}. Here, P(S) denotes the set of all probability distributions over S, and tv denotes the total variation norm over distributions.

Learning from many trajectories

to zero as either the amount of data and/or probability of success increases; as a concrete example, Duchi et al. (2012, Eq. 3.2) suggests to set ε = 1/

T, where T is the number of samples drawn from the underlying distribution. On the other hand, the trajectory small-ball condition in Example 4.5 holds with a short excitation window of length k = 2, independently of T. Next we consider linear dynamical systems. As setup, we ﬁrst deﬁne the notion of controllability for a pair of dynamics matrices (A, B):

Deﬁnition 4.2 (Controllability). Let (A, B) be a pair of matrices with A Rn n and B Rn d. For k {1, . . . , n}, we say that (A, B) is k-step controllable if the matrix: B AB A2B Ak 1B Rn kd

has full row rank.

The classical deﬁnition of controllability in linear systems (cf. Rugh, 1996, Chapter 25) is equivalent to n-step controllability. Deﬁnition 4.2 allows the system to be controllable in fewer than n steps. Also note that k is restricted to {1, . . . , n}, since if a system is not n-step controllable, it will not be n -step controllable for any n > n (by the Cayley Hamilton theorem). A few special cases of interest to note are as follows. If B has rank n, then (A, B) is trivially one-step controllable for any A. On the other hand, if (A, B) are in canonical controllable form (i.e., A is the companion matrix associated with the polynomial p(z) = a0 + a1z + + an 1zn 1 + zn and B is the n-th standard basis vector), then (A, B) is n-step controllable. The latter corresponds directly to the state-space representation of autoregressive processes of order n, e.g. AR(n).

Example 4.6 (Linear dynamical systems). Let (A, B) with A Rn n and B Rn d be kcstep-controllable (Deﬁnition 4.2). Let PA,B x be the linear dynamical system deﬁned in (3.8). Fix any T, k N+ satisfying T k kc. Then, PA,B x satisﬁes the (T, k, Γk(A, B), e, 1

2)- Traj SB condition.

We next consider LDS controlled by linear feedback policies. Let K Rd n parameterize a linear feedback policy ut = Kxt, and consider the closed-loop linear dynamical system described by the recurrence:

xt = Acxt 1, Ac := A + BK. (4.3)

When x1 N(0, Σ), we have the following small-ball condition with excitation window k = T.

Example 4.7 (Closed-loop linear dynamical systems). Let PAc x denote the closed-loop linear dynamical system deﬁned in (4.3). Suppose that the initial condition x1 N(0, Σ) where Σ is positive deﬁnite. For any T N+, PAc x satisﬁes the (T, T, ΣT (Ac, Σ1/2)/T, e, 1

2)-Traj SB condition.

Example 4.7 follows directly from the Gaussian process example (Example 4.2). Proving small-ball conditions under an excitation window k < T for close-loop linear dynamical systems is not possible under our current framework, because there is only one source of

Tu, Frostig, and Soltanolkotabi

randomness within a trajectory: at its beginning. Hence the conditional distributions in (4.1) become Dirac measures. This is to be expected, as few-trajectory learning in (4.3) can be impossible without noise injection. For example, if A is rank-deﬁcient, then the iterates x2, . . . , x T all lie in the same lower-dimensional subspace of Rn, precluding learning. In all of the examples so far, the time-t marginal distribution of covariates xt has either been a multivariate Gaussian or a spherical measure. To underscore the generality of the small-ball method, we can create additional examples where this is not the case. In what follows, we consider Volterra series (Mathews and Sicuranza, 2000), which generalize the classical Taylor series to causal sequences. Analogous to how polynomials can approximate continuous functions arbitrarily well on a compact set, Volterra series can approximate signals that depend continuously (and solely) on their history over a bounded set of inputs (cf. Rugh, 1981, Section 1.5).

Example 4.8 (Degree-D Volterra series). Fix a D N+. Let {c(d,ℓ) i1,...,id}i1,...,id N for d

{1, . . . , D} and ℓ {1, . . . , n} denote arbitrary rank-d arrays. Let {w(ℓ) t }t 0 be iid N(0, 1) random variables for ℓ {1, . . . , n}. Consider the process Px where for t 1, the ℓ-th coordinate of xt, denoted (xt)ℓ, is:

i1,...,id=0 c(d,ℓ) i1,...,id

d =1 w(ℓ) t id 1. (4.4)

Let Tnd := inf{t N+ | det(Γt(Px)) = 0}, and suppose Tnd is ﬁnite. There is a constant c D > 0, depending only on D, such that for any T Tnd, the process Px satisﬁes the (T, T, ΓT (Px), c D, 1/(2D))-Traj SB condition.

The main idea behind Example 4.8 is that, while xt is certainly not Gaussian, the quadratic form PT t=1 v, xt 2 is a degree at most 2D polynomial in {w(ℓ) t }T 1 t=0 . It will hence exhibit anti-concentration, according to a landmark result from Carbery and Wright (2001). The same result actually provides an immediate extension of this example as well as the previous examples to noise distributions with log-concave densities, such as Laplace or uniform noise. We next present a special case of the Volterra series, where we can choose the excitation window k in the small-ball deﬁnition strictly between the endpoints 1 and T. To set up, a few more deﬁnitions are needed:

Deﬁnition 4.3. Fix an integer d N+. A rank-d array of coeﬃcients {ci1,...,id}i1,...,id N is called:

(a) symmetric if ci1,...,id = cπ(i1,...,id) for any permutation π of indices i1, . . . , id N,

(b) traceless if ci,...,i = 0 for all i N, and

(c) non-degenerate if there exists an knd N+ such that the following set is non-empty:

{(i1, . . . , id) | ci1,...,id = 0, i1, . . . , id {0, . . . , knd 1}}.

The smallest knd such that {ci1,...,id} is the non-degeneracy index.

Learning from many trajectories

Example 4.9 (Degree-2 Volterra series). Consider the following process Px. Let {c(ℓ) i,j }i,j 0 for ℓ {1, . . . , n} be symmetric, traceless, non-degenerate arrays (Deﬁnition 4.3). Let {w(ℓ) t }t 0 be iid N(0, 1) random variables for ℓ {1, . . . , n}. For t 1, the ℓ-th coordinate of xt, denoted (xt)ℓ, is:

j=i c(ℓ) i,j w(ℓ) t i 1w(ℓ) t j 1. (4.5)

Let knd N+ denote the smallest non-degeneracy index for all n arrays. There is a universal positive constant c such that for any T and k satisfying T k knd, Px satisﬁes the (T, k, Γk(Px), c, 1

4)-Traj SB condition.

The assumptions pulled in from Deﬁnition 4.3 help simplify the construction of an almost sure lower bound for conditional covariances E[ 1

k Pjk t=(j 1)k+1 xtx T t | F(j 1)k], to establish that Example 4.9 satisﬁes the trajectory small-ball condition. We believe that generalizations to higher degree Volterra series with k strictly between 1 and T are possible by more involved calculations. Of course, many other examples are possible. To help in recognizing them, the following statement shows that condition (4.1) in the trajectory small-ball deﬁnition can be veriﬁed by separately establishing small-ball probabilities for the conditional distributions:

Proposition 4.2 (Average small-ball implies trajectory small-ball). Fix T N+, k {1, . . . , T}, {Ψj} T/k j=1 Symn >0, and α, β (0, 1). Let Px be a covariate distribution, with {xt}t 1 adapted to a ﬁltration {Ft}t 1. Suppose for all v Rn \{0} and j {1, . . . , T/k }:

t=(j 1)k+1 Pxt Px n v, xt 2 α v TΨjv F(j 1)k o β a.s., (4.6)

where F0 is the minimal σ-algebra. Then, for all v Rn \ {0}, j {1, . . . , T/k }, and ε (0, α)

t=(j 1)k+1 v, xt 2 ε v TΨjv

β 1 ε/α a.s. (4.7)

An immediate corollary of Proposition 4.2 is the following: suppose that for all v Rn\{0}, j {1, . . . , T/k }, and ε > 0,

t=(j 1)k+1 Pxt Px n v, xt 2 ε v TΨjv F(j 1)k o (csbε)α a.s. (4.8)

Then, the (T, k, {Ψj} T/k j=1 , 21+1/αcsb, α)-Traj SB condition holds. Equation (4.8) can be easier to verify than (4.1), since the former allows one to reason about each conditional distribution individually, whereas the latter requires reasoning about the entire excitation window altogether. The following two sections present upper and lower bounds for learning from trajectories, involving various instances of the trajectory small-ball assumption where applicable. All main results are summarized in Table 1.

Tu, Frostig, and Soltanolkotabi

5. Risk upper bounds

The trajectory small-ball deﬁnition allows us to carve out conditions for learnability. A key quantity for what follows is the minimum eigenvalue of the ratio of two positive deﬁnite matrices:

λ(A, B) := λmin(B 1/2AB 1/2), A, B Symn >0. (5.1)

Our various upper bounds statements build on the following general lemma:

Lemma 5.1 (General OLS upper bound). There are universal positive constants c0 and c1 such that the following holds. Suppose that Px satisﬁes the (T, k, {Ψj} T/k j=1 , csb, α)-Traj SB condition (Deﬁnition 4.1). Put S := T/k and ΓT := ΓT (Px). Fix any Γ Symn >0 satisfying 1 S PS j=1 Ψj Γ ΓT , and let µ({Ψj}S j=1, Γ) denote the geometric mean of the minimum eigenvalues {λ(Ψj, Γ)}S j=1, i.e.,

µ({Ψj}S j=1, Γ) :=

j=1 λ(Ψj, Γ)

Suppose that:

max{e, csb} αλ(Γ, ΓT )µ({Ψj}S j=1, Γ)

Then, for any Γ Symn >0:

E[ ˆWm,T W 2 Γ ] c1csbσ2 ξ pn m Tαλ(Γ, Γ )µ({Ψj}S j=1, Γ) log

max{e, csb} αλ(Γ, ΓT )µ({Ψj}S j=1, Γ)

Lemma 5.1 is a general statement that highlights the interplay between the various trajectory small-ball parameters, and how they directly inﬂuence the resulting OLS risk. To develop some intuition for the lemma, consider Γ = ΓT . We see that the closer the quantities λ(Γ, ΓT ) (0, 1] and µ({Ψj}S j=1, Γ) (0, 1] are to one, the sharper the OLS risk (5.4). To satisfy this straightforwardly, we can set Ψj = Γ = ΓT , in which case λ(Γ, ΓT ) = µ({Ψj}S j=1, Γ) = 1, and the resulting OLS risk (5.4) simpliﬁes (up to constant factors) to the iid rate σ2 ξ pn/(m T), treating α, csb as constants. However, in light of the trajectory small-ball condition (4.1), this is only generally possible at either end of the following spectrum:

when Px encodes iid covariates (cf. Example 4.3), in which case we can take k = 1, and our data requirement (5.3) becomes m T n; or

in the many trajectories setting (cf. Example 4.2) by setting k = T, in which our data requirement becomes m n.

Learning from many trajectories

When Px falls within these two ends, the excitation window k needs to be selected carefully to balance the data requirement (5.3) with the OLS risk (5.4). The optimal selection of k is heavily inﬂuenced by the growth rate of the covariance proxies {Ψj}. As a general rule, as k T, the covariance proxies {Ψj} tend to ΓT , but the rate at which this occurs is heavily dependent on Px. Furthermore, when the covariance proxies {Ψj} and the bound Γ are not equal to ΓT , the quantities λ(Γ, ΓT ), µ({Ψj}S j=1, Γ) are driven away from one, which adds extra factors in the OLS risk (5.4) over the iid rate. All the upper bounds we prove in this section are derived by carefully selecting the excitation window k based on the process Px, and quantifying the resulting overhead. The proof of Lemma 5.1 blends ideas from the analysis of random design linear regression (Hsu et al., 2014; Oliveira, 2016; Mourtada, 2022) with techniques from linear system identiﬁcation with full state observation (Simchowitz et al., 2018; Sarkar and Rakhlin, 2019; Faradonbeh et al., 2018; Dean et al., 2020). Note that Lemma 5.1 makes no explicit assumptions on the ergodicity of the process Px. The role of Px is instead succinctly captured by the trajectory small-ball condition, together with the minimum eigenvalue quantities that appear in the bound. The proof of Lemma 5.1 also yields, with some straightforward modiﬁcations, bounds on the risk that hold with high probability; we only present bounds in expectation for simplicity. Finally, if the square norm X 2 M is deﬁned to be λmax(XMXT) instead of tr(XMXT), then (5.4) holds with the expression p + n replacing pn in the numerator. As long as the process Px satisﬁes the trajectory small-ball condition with excitation window k = T, Lemma 5.1 (with Ψ1 = Γ = ΓT (Px)) immediately yields the following result for learning from many trajectories in the Seq-LS problem:

Theorem 5.2 (Upper bound for Seq-LS, many trajectories). There are univeral positive constants c0 and c1 such that the following holds. Suppose that Px satisﬁes the trajectory small-ball condition (Deﬁnition 4.1) with parameters (T, T, ΓT (Px), csb, α). If:

α log max{e, csb}

then, for any Γ Symn >0:

E[ ˆWm,T W 2 Γ ] c1csbσ2 ξ pn m Tαλ(ΓT (Px), Γ ) log max{e, csb}

This result provides the upper bound for the summary statement Theorem 1.1. To interpret the bound (5.5), suppose that csb and α are universal constants. Then, the requirement on m simpliﬁes to m n. Under any strict evaluation horizon T T, taking Γ = ΓT (Px), the risk E[L( ˆWm,T ; T , Px)] scales as σ2 ξpn/(m T). The lower bound for Theorem 1.1 follows from the fact that iid linear regression is a special case of Seq-LS. Meanwhile, to obtain guarantees for parameter recovery, consider taking Γ = In. Then Theorem 5.2 implies that the parameter error E[ ˆWm,T W 2 F ] scales as σ2 ξpn/[m T λmin(ΓT (Px))]. Note that operator norm bounds on parameters also hold, with the expression p + n replacing pn in the bound.

Tu, Frostig, and Soltanolkotabi

Problem Many? Upper Lower Assumptions

Seq-LS Y pn m T Traj SB(a)

Seq-LS N pn m T e T/(τmixn) + pn

T Ergodicity of covariates(b)

LDS-LS Y pn m T kc-step controllability(c)

LDS-LS N γ pn2

m2T Marginal stability, etc.(d)

Ind-Seq-LS N pn m T Small-ball, poly variance growth(e)

Ind-LDS-LS N γ pn

m T pn m T Diagonalizable(f)

LDS-Sys ID Y n2 m T λmin(ΓT ) - Same as LDS-LS (Y)(g)

LDS-Sys ID N γ n2 m T λmin(Γm T/n) n2 T 2 Same as LDS-LS (N)(h)

Table 1: Summary of main results presented in Section 5 and Section 6. All upper/lower bounds shown suppress constant and polylogarithmic factors. The Many? column indicates whether the bounds apply in the many trajectories regime (Y) where m n, or the few trajectories regime (N) where m n. A checkmark ( ) in the Lower column indicates that the lower bound matches the upper bound, up to polylogarithmic factors. The LDS-Sys ID problem is classic linear system identiﬁcation: recover the unknown dynamics matrix A from linear dynamical trajectories, with error measured in squared Frobenius norm (see Equation (3.8) and the discussion at the end of Section 3.4). Elaborations on assumptions:

(a) Upper bound follows from Theorem 5.2, treating as O(1) all constants related to trajectory small-ball (Deﬁnition 4.1). Lower bound follows from iid linear regression being a special case.

(b) Upper bound follows from combining (i) Lemma B.23, a general OLS high probability upper bound that utilizes a simple modiﬁcation (Deﬁnition B.1) to our trajectory small-ball deﬁnition, with (ii) Proposition B.24, which shows that a φ-mixing (Deﬁnition B.2) covariate process (where the marginal distributions also fulﬁll a weak small-ball condition (B.29)) satisﬁes our modiﬁed trajectory smallball condition. The upper bound holds in high probability instead of in expectation, and requires a burn-in time that satisﬁes T τmixn log(1/δ), where δ denotes the failure probability. The lower bound is from Bresler et al. (2020, Theorems 1 and 3), and holds for the single trajectory (m = 1).

(c) Upper bound follows from Theorem 5.5; see Deﬁnition 4.2 for deﬁnition of kc-step controllability. Lower bound follows again from iid linear regression being a special case.

(d) Upper bound follows from Theorem 5.6, under Assumption 5.1 (marginal stability), Assumption 5.2

(diagonalizability), and Assumption 5.3 (one-step controllability). The condition number of the diagonalizing factor is denoted by γ (Deﬁnition 5.1). Lower bound follows from Theorem 6.3, and is realized by a decoupled noise sequence (Deﬁnition 7.1).

(e) Upper bound follows from Theorem 5.3, treating as O(1) constants relating to small-ball (Equation (5.7)) and variance growth (Equation (5.8)). The necessity of the variance growth condition is shown in Theorem 6.2. Note that in the many trajectories regime, the Seq-LS upper bound applies. Lower bound again follows from iid linear regression.

(f) Upper bound follows from Theorem 5.7; γ is the condition number of the diagonalizing factor (Deﬁnition 5.1). Lower bound again follows from iid linear regression.

(g) Upper bound follows from Theorem 5.4; ΓT is the T-step average covariance matrix (Equation (3.3)). Lower bound is marked with a dash indicating that Theorem 5.6 does not directly apply to LDS-Sys ID (cf. the discussion following Deﬁnition 7.1)

(h) Upper bound follows from Theorem 5.8; γ is the condition number of the diagonalizing factor (Definition 5.1), and Γm T/n is the m T/n-step average covariance matrix (Equation (3.3)). Lower bound applies to the single trajectory (m = 1) setting and follows from Simchowitz et al. (2018, Theorem 2.3). (Technically, their bound applies to the operator, instead of Frobenius norm, but the proof can be adjusted to apply.) Specializing the upper bound to one trajectory (m = 1), drawn from the lower bound s hard instance, implies a gap of n3/T 2 (upper) versus n2/T 2 (lower) as noted in Simchowitz et al. (2018, Section 2.2).

Learning from many trajectories

Lemma 5.1 also yields a bound for Ind-Seq-LS, assuming polynomial growth of the time-t covariances Σt (3.3). To state the result, let φ : [1, ) [0, ) [1, ) be deﬁned as:

( 1 if x 1, ax otherwise. (5.6)

Note that 1 φ(a, x) max{ax, 1}.

Theorem 5.3 (Upper bound for Ind-Seq-LS). There are universal positive constants c0 and c1 such that the following holds. Fix any sequence of distributions {Px,t}t 1, and let Σt := Ext Px,t[xtx T t ] for t N+. Suppose there exists csb > 0 and α (0, 1] such that for all v Rn \ {0}, ε > 0 and t N+:

Pxt Px,t n v, xt 2 ε v TΣtv o (csbε)α. (5.7)

Furthermore, suppose there exists a cβ 1 and β 0 such that for all s, t N+ satisfying s t:

1 λ(Σs, Σt) cβ(t/s)β. (5.8)

n 2, m T c0n

β + log max{e, csb}cβ

then, for Px = t 1Px,t:

E[L( ˆWm,T ; T , Px)] c1csbσ2 ξcβeβ pn m Tα φ cβ(β + 1), (T /T)β β + log max{e, csb}cβ

Consider specializing Theorem 5.3 to the case when Σt = Σ for all t N+. Doing so yields random design linear regression from m T covariates drawn iid from Px,1. The growth condition (5.8) is trivially satisﬁed with cβ = 1 and β = 0. The small-ball assumption (5.7) simpliﬁes to Px1 Px,1{| v, x1 | ε v Σ} ( csbε)2α for all v = 0 and ε > 0, which matches Mourtada (2022, Assumption 1) up to a minor redeﬁnition of the constants csb, α. Treating csb and α as constants, the conclusion of Theorem 5.3 in this setting is that E[ ˆWm,T W 2 Σ] σ2 ξpn/(m T) as long as n 2 and m T n, which recovers Mourtada (2022, Proposition 2). On the other hand, Theorem 5.3 does not require that the covariates are drawn iid from the same distribution, allowing the time-t covariances Σt to grow polynomially. As an example, suppose that Σt = tβ In for some β > 0. In this case, 1/λ(Σs, Σt) = (t/s)β, so we can take cβ = 1 in (5.8). Again treating csb and α as constants and taking T T, we have E[L( ˆWm,T ; T , Px)] σ2 ξβeβ pn/(m T) as long as m T βn. If β is also considered a constant, then we further have the risk bound E[L( ˆWm,T ; T , Px)] σ2 ξpn/(m T). This matches the minimax rate for iid linear regression.

Tu, Frostig, and Soltanolkotabi

It is natural to ask if the covariance growth condition (5.8) is needed under strict evaluation horizons T T.4 In Section 6, we show that if the covariances are set to Σt = 2t In and Px,t = N(0, Σt) (satisfying (5.7)), then the minimax risk R(m, T, T; { t 1Px,t}) must scale at least 2cn/m/T whenever m n, for some positive constant c. Sub-exponential growth rates are therefore necessary for polynomial sample complexity. Determining the optimal dependence of β in (5.9) is left to future work. Note that Theorem 5.3 is most interesting either when trajectories are few (m n) or evaluations are extended (T > T). When m n and T T, one can usually apply Theorem 5.2 with Γ = ΓT (Px) instead, and avoid placing any requirements on the growth of covariances. Considering any of the small-ball examples in Section 4.1, recall that when the excitation window k and the horizon T are equal, Theorem 5.2 provides an upper bound on the risk of OLS estimation for the corresponding Seq-LS problem. Speciﬁcally, for Example 4.1 and Example 4.6 with k = T, if T = T and trajectories are abundant (m n), then the OLS estimator s rate σ2 ξpn/(m T) matches its behavior in iid linear regression. Meanwhile, for the degree-D Volterra series (Example 4.8), we require that m c D n, and the OLS risk bound scales as σ2 ξc D pn/(m T), for constants c D and c D that only depend on D. In order to cover scenarios in which trajectories may be relatively scarce, namely m n, we need additional structure. More technically, when the small-ball condition is satisﬁed with k < T, one needs to further control the various eigenvalues that appear in Lemma 5.1 in order to bound the risk of OLS. Speciﬁcally for Ind-Seq-LS, a covariate growth assumption suﬃces: Example 4.3 combined with Theorem 5.3 yields an OLS risk bound. Furthermore, both Example 4.4 and Example 4.5 can be immediately combined with Lemma 5.1, since the matrices Ψj in these examples are bounded above and below by ΓT (Px) up to universal constant factors. But arbitrarily large risk can still be realized in the general Seq-LS problem, even when the trajectory small-ball condition is satisﬁed. To study the behavior of OLS across all regimes of trajectory count m, example dimensions p and n, and trajectory lengths T and T , we focus speciﬁcally on linear dynamical systems and the LDS-LS problem for our remaining upper bounds.

5.1 Upper bounds for linear dynamical system

In this section, we focus exclusively on dynamics PA,B x described by a linear dynamical system (3.8). As discussed previously, in order to apply Lemma 5.1 in the few trajectories regime when m n (or when m n and T > T), we must (a) show that the process PA,B x satisﬁes the trajectory small-ball condition, and (b) bound the various eigenvalues which appear in Lemma 5.1. Example 4.6 establishes that PA,B x satisﬁes the (T, k, Γk(A, B), e, 1

2)- Traj SB condition, as long as (A, B) is kc-step controllable and k kc, thus taking care of (a). To handle (b), we introduce additional assumptions on the dynamics matrices (A, B):

Assumption 5.1 (Marginal instability). The dynamics matrix A in LDS-LS is marginally unstable. That is, ρ(A) 1, where ρ(A) denotes the spectral radius of A.

4. Some regularity is needed when under extended evaluations T > T, otherwise the risk could be arbitrarily large.

Learning from many trajectories

Assumption 5.2 (Diagonalizability). The dynamics matrix A in LDS-LS is complex diagonalizable as A = SDS 1, where S Cn n is invertible and D Cn n is a diagonal matrix comprising the eigenvalues of A.

Assumption 5.3 (One-step controllability). The control matrix B in LDS-LS has full row rank, i.e., rank(B) = n. Equivalently, the pair (A, B) is one-step controllable (Deﬁnition 4.2).

Assumption 5.1 is fairly standard in the literature. Going beyond the regime ρ(A) = 1 + ε, where ε 1/T, requires additional technical assumptions on the dynamics matrix A that we choose to avoid in the interest of simplicity; the OLS estimator is in general not a consistent estimator when ρ(A) > 1 and m = 1 (cf. Phillips and Magdalinos (2013); Sarkar and Rakhlin (2019)). The condition ρ(A) 1 is often referred to as marginal stability in other work. We choose to call it marginally unstable instead, to emphasize the fact that such systems, namely at ρ(A) = 1, may not be ergodic and that the state can grow unbounded (e.g. have magnitude roughly tn at time t). Diagonalizability (Assumption 5.2) is less standard in the literature. We use it together with Assumption 5.1 and Assumption 5.3 to establish that

λ(k, t; A, B) := λ(Γk(A, B), Γt(A, B)) c k/t,

whenever k t, where c is a constant that depends only on A and B (and not k and t). In previous work on linear system identiﬁcation, the term λ(k, t; A, B) only appears under a logarithm, and so coarser analyses in the general case can still establish polynomial rates (cf. Simchowitz et al. (2018, Proposition A.1) and Sarkar and Rakhlin (2019, Proposition 7.6)).5

However, by allowing for evaluation lengths T > 1, the dependence on λ(k, t; A, B) is no longer entirely conﬁned under a logarithm (cf. Lemma 5.1). A sharp characterization is hence critical for deriving optimal rates. In Appendix A, we conjecture the correct scaling of λ(k, t; A, B) as a function of the ratio k/t and the largest Jordan block size of A, based on numerical simulation. One-step controllability (Assumption 5.3) is also an assumption commonly made in linear system identiﬁcation. It is clear that some form of controllability is needed, otherwise learning may be impossible (e.g. consider the extreme case of B = 0). General multi-step controllability does not suﬃce either: Tsiamis and Pappas (2021, Theorem 2) show that under a single trajectory (m = 1), n-step controllability (where n remains the state dimension) does not ensure ﬁnite risk, and even a more robust controllability deﬁnition (Tsiamis and Pappas, 2021, Deﬁnition 3) cannot ensure risk bounds better than exponential in the dimension n. Considering these barriers, we simply choose to rely on one-step controllability in the few-trajectory setting (m n). Finally, we introduce a condition number quantity that will feature commonly in our bounds:

Deﬁnition 5.1. For dynamics matrices (A, B) in LDS-LS satisfying Assumption 5.2 and Assumption 5.3, the condition number γ(A, B) is deﬁned as: γ(A, B) := λmax(S 1BBTS )

λmin(S 1BBTS ) . Here, the matrix S diagonalizes A, as deﬁned in Assumption 5.2.

5. Note, however, that without diagonalizability, Simchowitz et al. (2018, Corollary A.2) can only guarantee a p

n2/T rate for the operator norm of the parameter error in general, and this is likely not optimal.

Tu, Frostig, and Soltanolkotabi

5.1.1 Many trajectories

Our ﬁrst result instantiates Theorem 5.2 in the special case of Γ = In, which yields a sharp bound for parameter recovery without requiring stability of the dynamics matrix A:

Theorem 5.4 (Parameter recovery upper bound for LDS-LS, many trajectories). There are universal positive constants c0 and c1 such that the following holds for any instance of LDS-LS. Suppose that (A, B) is kc-step controllable, If n 2, m c0n, and T kc, then:

E[ ˆWm,T W 2 F ] c1σ2 ξ pn m T λmin(ΓT (A, B)).

Theorem 5.4 improves on existing linear system identiﬁcation results in the following way: it replaces stability assumptions on the dynamics matrix A with a simpler assumption of relatively many trajectories (m n), and it guarantees a rate that is inversely proportional to the total number of examples m T instead of only one example per trajectory. In other words, our analysis does not need to discard the data within a trajectory, which is the case in Dean et al. (2020, Proposition 1.1). Additionally, although OLS is generally not a consistent estimator from one trajectory (m = 1) if the dynamics A are unstable, the results of Dean et al. (2020) imply consistency as m , i.e., that ˆWm,T converges in probability to W as m . Theorem 5.4 adds that, provided m n, OLS is consistent under unstable systems as T as well, even if the trajectory count m remains ﬁnite. We will return to parameter recovery from relatively few trajectories (m n) by this section s end. We now look beyond an evaluation horizon of length one, and consider the setting with many trajectories (m n). As noted previously, in order to handle an arbitrary evaluation horizon T (in particular those that extend past the training horizon T), some constraint on the admissible dynamics matrices is needed to ensure that the minimax risk remains ﬁnite. Without assumptions, the quantity λ(ΓT (A, B), ΓT (A, B)), whose inverse inevitably bounds the risk (3.2) from below, can be arbitrarily small whenever T > T, resulting in arbitrarily large risk. We will use our stated assumptions from the beginning of this section. The following specializes Theorem 5.2 to LDS-LS:

Theorem 5.5 (Risk upper bound for LDS-LS, many trajectories). There are universal positive constants c0 and c1 such that the following holds for any instance of LDS-LS. Suppose that (A, B) is kc-step controllable. If n 2, m c0n, T kc, and the evaluation horizon is strict (T T), then:

E[L( ˆWm,T ; T , PA,B x )] c1σ2 ξ pn

On the other hand, suppose that (A, B) satisﬁes Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Deﬁnition 5.1). If n 2, m c0n, and the evaluation horizon is extended (T > T), then:

E[L( ˆWm,T ; T , PA,B x )] c1σ2 ξ pn

Setting T = T, Theorem 5.5 states that the risk of LDS-LS in the many trajectories regime satisﬁes E[L( ˆWm,T ; T, PA,B x )] σ2 ξpn/(m T). This rate matches the corresponding

Learning from many trajectories

independent baseline Ind-LDS-LS in the many trajectories regime. To see this, ﬁrst observe that the marginal distribution PA,B x,t at time t N+ is N(0, Σt(A, B)). Hence, the covariate distribution for Ind-LDS-LS corresponds to the product distribution t 1N(0, Σt(A, B)), which is an instance of a Gaussian process. Therefore, Example 4.2 combined with Theorem 5.2 yields that the Ind-LDS-LS problem also has a risk bound that scales as σ2 ξpn/(m T)

whenever m n. Put diﬀerently, the dependent structure of the covariate distribution PA,B x in LDS-LS does not add any statistical overhead to the learning problem (compared to the independent learning problem Ind-LDS-LS), as long as m n.

5.1.2 Few trajectories

We now cover the regime in which relatively few training trajectories are available (m n). Our ﬁrst result bounds the OLS risk for the LDS-LS problem:

Theorem 5.6 (Risk upper bound for LDS-LS, few trajectories). There are universal positive constants c0, c1, and c2 such that the following holds for any instance of LDS-LS. Suppose that (A, B) satisﬁes Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Deﬁnition 5.1). If n 2, m c0n, and m T c1n log(max{γn/m, e}), then:

E[L( ˆWm,T ; T , PA,B x )] c2σ2 ξ pn log(max{γn/m, e})

m T φ γ, c1n log(max{γn/m, e})

To interpret Theorem 5.6, consider γ a constant and suppose that T = T. Then Theorem 5.6 states that E[L( ˆWm,T ; T, PA,B x )] σ2 ξ pn/(m T) n log2(n/m)/m. We now see that this LDS-LS risk is an extra n log2(n/m)/m factor larger than the risk of the baseline problem Ind-LDS-LS:

Theorem 5.7 (Risk upper bound for Ind-LDS-LS). There are universal positive constants c0 and c1 such that the following holds for any instance of Ind-LDS-LS. Suppose that (A, B) satisﬁes Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Deﬁnition 5.1). If n 2 and m T c0n log(max{γ, e}), then:

E[L( ˆWm,T ; T , t 1PA,B x,t )] c1σ2 ξ pnγ log(max{γ, e})

Treating γ as a constant and setting T = T, Theorem 5.7 states that the Ind-LDS-LS risk E[L( ˆWm,T ; T, t 1PA,B x,t )] scales as σ2 ξpn/(m T), matching the risk of iid linear regression up to constant factors. In Section 6, we will see that the result of Theorem 5.6 is sharp up to constants and γ, and therefore the LDS-LS problem is fundamentally more diﬃcult than its corresponding baseline problem Ind-LDS-LS when trajectories are relatively scarce. As an aside: the dependence of both Theorem 5.6 and Theorem 5.7 on the condition number γ is likely not optimal, and we leave sharpening this dependence to future work. We conclude with our ﬁnal upper bound, using our assumptions to generalize Simchowitz et al. (2018, Theorem 2.1) to the few-trajectory setting:

Theorem 5.8 (Parameter recovery upper bound for LDS-LS, few trajectories). There are universal positive constants c0, c1, and c2 such that the following holds for any instance

Tu, Frostig, and Soltanolkotabi

of LDS-LS. Suppose that (A, B) satisﬁes Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Deﬁnition 5.1). If n 2, and m T c0n log(max{γn/m, e}), then:

E[ ˆWm,T W 2 F ] c1σ2 ξ pn log(max{γn/m, e})

m T λmin(Γk (A, B)) , k := c2T n/m log(max{γn/m, e})

Theorem 5.8 complements Theorem 5.4; together they cover parameter recovery across all problem regimes. Again, operator norm bounds also hold with p + n in place of pn.

5.2 Comparison to learning from trajectories of multiple unknown systems

As mentioned in Section 2, Chen and Poor (2022); Modi et al. (2022) both study the setup where a learner observes multiple independent trajectories from K diﬀerent unknown linear dynamical systems. The task is to identify the parameters of the K underlying systems. This is more general than the setting we consider, which is recovered by ﬁxing K = 1. However, specializing these rates to our setting yield either unnecessary requirements, suboptimal bounds, or both. To see this, ﬁrst, if we specialize Chen and Poor (2022, Theorem 1) to our setup, we generate unnecessary assumptions. Speciﬁcally, Theorem 1 requires strict stability, onestep controllability, and m T max{n3, 1/(1 ρ)}, where ρ is the spectral radius of A. In comparison, Theorem 5.4 only requires kc-step controllability, T kc, and m n. However, note that Theorem 1, like Theorem 5.4, does have the property that the parameter error (in operator norm) scales as p

n/(m T), reﬂecting that all collected datapoints contribute to reducing error. Next, we specialize Modi et al. (2022, Theorem 2). Theorem 2 bounds the error of an estimation procedure which outputs m diﬀerent estimates { ˆAi}m i=1, one for each observed trajectory (cf. Eq. (3)). Speciﬁcally, it gives an upper bound on the quantity 1

m Pm i=1 ˆAi Ai 2 F , where Ai is the dynamics matrix associated with the i-th trajectory. To specialize this to our setting, we average the estimates and apply Jensen s inequality followed by Theorem 2. This yields the bound ˆA A 2 F 1/T + n2/(m T), where ˆA := 1

m Pm i=1 ˆAi is the averaged estimate. We see that, for a ﬁxed T, as m , the rate tends to 1/T instead of zero (compared with the n2/(m T) bound from Theorem 5.4). Additionally, Theorem 2 requires both that the dynamics are one-step controllable and that the spectral radius of A is bounded by 1 + O(1/T).

6. Risk lower bounds

Our lower bounds rely on the following statement, that the expected trace inverse covariance a classic quantity in asymptotic statistics bounds the minimax risk from below:

Lemma 6.1 (Expected trace of inverse covariance bounds risk from below). Fix m, T N+ and a set of covariate distributions Px. Suppose that for every Px Px, the data matrix Xm,T Rm T n drawn from m i=1Px has full column rank almost surely. The minimax risk R(m, T, T ; Px) satisﬁes:

R(m, T, T ; Px) σ2 ξp sup Px Px E m i=1Px h tr Γ1/2 T (Px)(XT m,T Xm,T ) 1Γ1/2 T (Px) i . (6.1)

Learning from many trajectories

Lemma 6.1 is well known, possibly considered folklore; we state and prove it for completeness. Our proof is inspired by a recent argument from Mourtada (2022). It smooths over problem instances according to a Gaussian prior, and analytically characterizes the posterior distribution of the parameter W under a simple Gaussian observation model detailed in Section 7.2. Note that the presence of the expected value outside the of trace inverse, on the right-hand side of (6.1), is critical in proving our separation results. Using Jensen s inequality, one could move the expectation under the inverse, precisely recovering the classic Cram er-Rao lower bound (CRLB) for the variance of unbiased estimators. However, for least-squares regression problems, the CRLB yields (when T = T) the familiar σ2 ξ pn/(m T) rate for iid linear regression. This is not enough for our purposes, and so we must analyze the more complex expected-trace-inverse quantity appearing in (6.1). Our ﬁrst lower bound underscores the need to make variance growth assumptions (5.8), in Theorem 5.3, for Ind-Seq-LS in the few trajectories (m n) regime:

Theorem 6.2 (Need for growth assumptions in Ind-Seq-LS when m n). There exists universal constant c0, c1, and c2 such that the following holds. Suppose that Px = t 1N(0, 2t In), n 6, m T n, and m c0n. Then:

R(m, T, T; {Px}) c1σ2 ξ p 2c2n/m

Theorem 6.2 states that if the variances Σt are allowed to grow exponentially in t, then the minimax risk of Ind-Seq-LS scales exponentially in n/m when m n. Thus, some subexponential growth assumption is necessary in order to have the risk scale polynomially in n/m. We now turn to a lower bound for LDS-LS. We consider two particular hard instances for LDS-LS dynamics matrices (A, B), where we set B = In and vary A. The ﬁrst instance corresponds to iid covariates, i.e., A = 0n n. The second instance corresponds to an isotropic Gaussian random walk, i.e., A = In. These two hard instances satisfy Assumption 5.1, Assumption 5.2, and Assumption 5.3. Together they show that our upper bounds are sharp up to logarithmic factors, treating the condition number γ(A, B) from Deﬁnition 5.1 as a constant:

Theorem 6.3 (Risk lower bound). There are universal positive constants c0, c1, and c2 such that the following holds. Recall that PIn x (resp. P0n n x ) denotes the covariate distribution for a linear dynamical system with A = In and B = In (resp. A = 0n n and B = In). If T c0, n c1, and m T n, then:

R(m, T, T ; {P0n n x , PIn x }) c2σ2 ξ pn

m T max n T

We can interpret this lower bound by a breakdown of ϕ := max{n T /(m T), T /T, 1} across various regimes. When trajectories are limited (m n), ϕ max{n T /(m T), 1}, and therefore the minimax risk is bounded below by σ2 ξ pn/(m T) max{n T /(m T), 1}. This is the same rate prescribed by the OLS upper bound of Theorem 5.6, up to the condition number γ(A, B) and logarithmic factors in n/m. We have thus justiﬁed the summary statement Theorem 1.2. On the other hand, under many trajectories (m n), ϕ max{T /T, 1} and

Tu, Frostig, and Soltanolkotabi

the minimax risk is bounded below by σ2 ξ pn/(m T) max{T /T, 1}. By Theorem 5.5, the OLS risk is bounded above by the same quantity times γ(A, B), justifying the summary statement Theorem 1.3.

7. Key proof ideas

In this section, we highlight some of the key ideas behind our results. Proofs of the upper bounds are in Appendix B, and proofs of the lower bounds are in Appendix C.

Additional notation. For r N+ and M Rn n, let Jr Rr r denote the Jordan block of size r with ones along its diagonal, let BDiag(M, r) Rnr nr denote the block diagonal matrix with diagonal blocks M, and let BToep(M, r) Rnr nr denote the block Toeplitz matrix with ﬁrst column (In, MT, . . . , (Mr 1)T)T.

7.1 Upper bounds

The proof of Lemma 5.1 decomposes the risk using a standard basic inequality, which we now describe. While Lemma 5.1 is stated quite generally, for simplicity of exposition we restrict ourselves in this section to the case when the matrix parameters {Ψj}S j=1 in Deﬁnition 4.1 are all set to Γ. Under this simpliﬁcation, we have that µ({Ψj}S j=1, Γ) = 1. Equation (3.1) yields the identity Ym,T = Xm,T W T + Ξm,T . Plugging this relationship into the formula (3.6) for ˆWm,T gives ˆWm,T W = ΞT m,T Xm,T (XT m,T Xm,T ) 1. Deﬁne the whitened version of Xm,T as Xm,T := Xm,T Γ 1/2. From these deﬁnitions and after some basic manipulations, for any Γ Symn >0:

ˆWm,T W 2 Γ min{n, p} ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λmin( XT m,T Xm,T ) λ(Γ, Γ ) . (7.1)

This decomposes the analysis into two parts: (a) upper-bounding the self-normalized martingale ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op, and (b) lower-bounding the term λmin( XT m,T Xm,T ). The analysis for the martingale term is fairly standard (cf. Abbasi-Yadkori et al., 2011, Corollary 1), so for the remainder of this section we focus on the minimum eigenvalue bound, which contains much of what is novel in our analysis. We ﬁrst demonstrate how the trajectory small-ball deﬁnition (Deﬁnition 4.1) can be used to establish pointwise convergence of the quadratic form χ(v) := Pm i=1 PT t=1 v, x(i) t 2

for v Sn 1, where x(i) t := Γ 1/2x(i) t is a whitened state vector. Speciﬁcally, we show that for a ﬁxed v Sn 1, the probability of the event {χ(v) ψ ε} is small, for ψ, ε to be speciﬁed. The key idea is that for any non-negative random variable X satisfying P{X ε} (cε)α for all ε > 0, the moment generating function satisﬁes E[exp( ηX)] (c/η)α for all η > 0 (cf. Proposition B.4). Hence, by condition (4.1) from Deﬁnition 4.1, for any η > 0:

t=(j 1)k+1 v, x(i) t 2

α a.s., i = 1, . . . m, j = 1, . . . S.

Learning from many trajectories

By a Chernoﬀbound, the tower property of conditional expectation, and the independence of the trajectories { x(i) t }t 1 and { x(i ) t }t 1 when i = i , for any ψ > 0:

t=1 v, x(i) t 2 ζ

inf η 0 eηζE exp

t=1 v, x(i) t 2 !

inf η 0 eηζ csb

= exp m Sα log m Sα

Now with a change of variables t := log m Sα

csbζ 1, we obtain:

t=1 v, x(i) t 2 m Tα

2csb e (t+1) !

exp( m Sαt) t > 0. (7.2)

The key upshot of (7.2) is that it controls tail probability at all scales. This control is needed in order to bound the expected value of (7.1) by integration. At this point, it remains to upgrade (7.2) from pointwise to uniform over Sn 1. A natural approach is to use standard covering and union bound arguments, as is done in Simchowitz et al. (2018). However, straightforward covering argument yields un-necessary logarithmic factors in the covariate dimension n. In order to circumvent this issue, we utilize the PAC-Bayes argument from Mourtada (2022) (which itself is an extension of Oliveira (2016)) to establish uniform concentration. The details are given in Appendix B.4.

7.2 Lower bounds

7.2.1 Observation noise behind Lemma 6.1

Our deﬁnition of minimax risk R(m, T, T ; Px) in (3.5) involves a supremum over the worst case σξ-sub-Gaussian MDS distribution that models the observation noise. The proof of Lemma 6.1 bounds this supremum from below by considering a noise model that decouples the observation noise {ξt}t 1 from the randomness that drives the trajectory {xt}t 1:

Deﬁnition 7.1 (Gaussian observation noise). The Gaussian observation noise model holds when ξt N(0, σ2 ξIp), ξt ξt if t = t , and the process {ξt}t 1 is independent from the process {xt}t 1.

Decoupling the noise processes orthogonalizes the two problems simultaneously present in Seq-LS: learning the dynamics of covariates and learning the responses from covariates. Deﬁnition 7.1 draws attention to the latter. It will unfortunately exclude us from addressing linear system identiﬁcation speciﬁcally with our lower bound, but it allows a sharp and simple characterization of the minimax risk in general. The proof of Lemma 6.1 is given in Appendix C.2.

Tu, Frostig, and Soltanolkotabi

7.2.2 An analysis of non-isotropic gramian matrices

A key technical challenge for our analysis lies in constructing a sharp lower bound on the expected trace inverse of a gramian matrix formed by random non-isotropic Gaussian random vectors. Speciﬁcally, for integers q, n N+ with q n, and for a ﬁxed positive deﬁnite matrix Σ Symq >0, we are interested in a lower bound on the quantity E tr((W TΣW) 1), where W Rq n has iid N(0, 1) entries. The matrix W TΣW is equal in distribution to the gramian matrix Y Rn n of the vectors g1, . . . , gn Rq, which are drawn iid from N(0, Σ), i.e., Yij = gi, gj . The main tool we use to analyze E tr((W TΣW) 1) is the convex Gaussian min-max theorem (CGMT) from Thrampoulidis et al. (2014), which allows us to bound from below the expected trace inverse by studying a two dimensional min-max game that is more amenable to analysis. The key idea is to cast the expected trace inverse as a least-norm optimization problem, and apply CGMT to the value of the optimization problem. We believe the following result to be of independent interest.

Lemma 7.1. Let q, n be positive integers with q n and n 2. Let W Rq n have iid N(0, 1) entries, and let Σ Rq q be positive deﬁnite. Let g N(0, Iq) and h N(0, In 1), with g and h independent. Also, let {ei}q i=1 be the standard basis vectors in Rq. We have:

E tr((W TΣW) 1) n Pq i=1 E minβ 0 maxτ 0 h β h 2

τ + βg ei 2 (Σ 1+β h 2τIq) 1 i. (7.3)

The proof of Lemma 7.1 appears in Appendix C.4. We now discuss how to analyze the two-dimensional min-max game appearing in Lemma 7.1. We ﬁrst start by heuristically replacing it with a stylized problem, where the random quantities which appear in (7.3) are replaced by their expected scaling:

SP(Σ, n) :=

i=1 min β 0 max τ 0

τ + β2 tr((Σ 1 + β nτIq) 1) + (Σ 1 + β nτIq) 1 ii

| {z } =: ℓi(β,τ)

While (7.4) is not a valid upper bound on the value of the min-max game appearing in (7.3), analyzing (7.4) is simpler and gives the correct intuition; we give a rigorous upper bound in Lemma C.9. We start by observing that if β = 0, then regardless of the choice of τ, ℓi(0, τ) = Σii, and therefore Pq i=1 ℓi(0, τi) = tr(Σ) for any {τi}q i=1 Rq 0. On the other hand, if β > 0, then ℓi(β, τ) tends to as τ 0+ and to 0 as τ . Therefore, if we can show that there exists a v (0, tr(Σ)), such that every set of points {(βi, τi)}q i=1 R2 >0 satisfying:6

β (βi, τi) = ℓi

τ (βi, τi) = 0, i = 1, . . . , q, (7.5)

6. The conditions given in (7.5) are not in general necessary ﬁrst-order optimality conditions for a nonconvex/nonconcave game (see e.g. Jin et al., 2020, Proposition 21). However, since for every β > 0, the function τ 7 ℓi(β, τ) has only strictly concave stationary points (Proposition C.10), these conditions are necessary for this particular problem.

Learning from many trajectories

also satisﬁes v = Pq i=1 ℓi(βi, τi), then SP(Σ, n) = v. To uncover the critical points, we deﬁne the functions f and qi, for i = 1, . . . , q, as:

f(x) := x n + x2 tr((Σ 1 + x n Iq) 1), qi(x) := (Σ 1 + x n Iq) 1 ii .

With these deﬁnitions, we can write:

ℓi(β, τ) = 1

τ 2 f(βτ) + qi(βτ).

Calculating ℓi

β (β, τ) = ℓi

τ (β, τ) = 0 yields, for τ = 0:

τ (β, τ) = τ 2f (βτ)β 2τ 3f(βτ) + q i(βτ)β, (7.6)

β (β, τ) = τ 2f (βτ)τ + q i(βτ)τ. (7.7)

The second condition (7.7) implies that q i(βτ) = τ 2f (βτ). Plugging this condition into (7.6) implies that f(βτ) = 0, and hence ℓi(β, τ) = qi(βτ) for the critical point (β, τ). We now study the positive roots of the equation f(x) = 0, or equivalently:

x n = x2 tr((Σ 1 + x n Iq) 1).

Using the variable substitution y := x n, we have, when y > 0, the equivalent problem:

ψ(y; Σ) := y tr((Σ 1 + y Iq) 1) = n.

Observe that ψ(0; Σ) = 0 and limy ψ(y; Σ) = q. Furthermore, ψ(y; Σ) is continuous and monotonically increasing with y. Therefore, as long as q > n, there is exactly one y (0, ) such that ψ( y; Σ) = n, or equivalently there is exactly one x (0, ) such that ψ( x n; Σ) = n. Such a quantity x supplies the curve of critical points Crit( x) := {(β, τ) R2 >0 | βτ = x}. Note that Crit( x) is the set of critical points for every ℓi(β, τ), i = 1, . . . , q. Furthermore, for any (β , τ ) Crit( x) and i {1, . . . , q}, we have that ℓi(β , τ ) = qi(β τ ) = (Σ 1 + x n Iq) 1 ii . Therefore:

{(βi, τi)}T i=1 Crit( x) =

i=1 ℓi(βi, τi) = tr((Σ 1 + x n Iq) 1) (0, tr(Σ)),

SP(Σ, n) = n

x , with x the solution to ψ( x n; Σ) = n. (7.8)

In light of (7.8), Lemma 7.1 then suggests that:

E tr((W TΣW) 1) n SP(Σ, n) = x n, (7.9)

where the notation indicates the heuristic nature of replacing the expected min-max game appearing in the bound (7.3) with the approximation (7.4). If we brieﬂy check (7.9) in the simple case when Σ = Iq, we see that:

n = ψ( x n; Iq) = x n q 1 + x n = x n = n

q (1 + x n) n

Hence, (7.9) yields that E tr((W TW) 1) n/q, which is the correct scaling; the exact result is E tr((W TW) 1) = n/(q n 1) for q n + 2.

Tu, Frostig, and Soltanolkotabi

7.2.3 Ideas behind Theorem 6.2

We let Xm,T denote the data matrix associated with m iid copies of {xt}T t=1, with xt N(0, 2t In) and xt xt for t = t . We also deﬁne ΓT := 1 T PT t=1 2t In = 2 T (2T 1) In, and observe that ΓT 2T

T In. By Lemma 7.1, it suﬃces to lower bound the

quantity E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ 1/2 T ). Since each column of Xm,T is independent, the matrix Xm,T 2 T/2 has the same distribution as BDiag(Θ1/2, m)W, where Θ RT T is diagonal, Θii = 2i T for i {1, . . . , T}, and W Rm T n has iid N(0, 1) entries. In other words, we have:

E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ 1/2 T ) 1

T E tr((W TBDiag(Θ, m)W) 1).

By the arguments in Section 7.2.2, we have:

E tr((W TBDiag(Θ, m)W) 1) n SP(BDiag(Θ, m), n),

where the notation indicates the heuristic nature of the inequality as explained previously. From (7.8), we want to ﬁnd x such that:

n = ψ( x n; BDiag(Θ, m)) = x n m

1 2j + x n.

While solving this equation exactly for x n is not tractable, we can estimate a lower bound on x n quite easily. For any integer Tc {0, . . . , T}, we have the following estimate:

1 2j + x n Tc + 2 x n 2 Tc.

Let us ﬁrst assume that x n [1, 2T 1], so that log2( x n) {0, . . . , T}. Setting Tc = log2( x n) then yields the lower bound x n 2n/m 3. On the other hand, if x n > 2T 1, then since we assume m T c1n, we also have x n > 2c1n/m 1. Finally, if x n < 1, we have:

1 2j + x n <

1 2j + x n 2 = m n/2.

This yields a contradiction, since by assumption m c2n, if c2 < 1/2, so we must have x n 2c n/m 3 with c = min{1, c1}. Now by (7.8) and (7.9):

SP(BDiag(Θ, m), n) = n x n n2 c n/m+3 = E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ 1/2 T ) 2c n/m

We make this argument rigorous in Appendix C.5.

Learning from many trajectories

7.2.4 Ideas behind Theorem 6.3

We focus here on the hard instance when A = In and m n, since the cases when A = 0n n or A = In and m n are straightforward applications of Jensen s inequality and some basic manipulations (see Lemma C.7). The proof used by Theorem 6.3 when A = In and m n is actually a special case of a general proof indexed by the largest Jordan block size of the hard instance. For a maximum Jordan block size r, the hard instances are A = BDiag(Jr, n/r), where we assume for simplicity that r divides n; this reduces to A = In when r = 1. We associate two important matrices with these hard instances. To deﬁne them, let Ir := {1, 1 + r, . . . , 1 + (T 1)r}, and let EIr RT Tr denote the linear operator that extracts the coordinates in Ir. The following matrices then play a key role in our analysis:

Ψr,T,T := BDiag(Γ 1/2 T (Jr), T)BToep(Jr, T), Θr,T,T := EIrΨr,T,T ΨT r,T,T ET Ir. (7.10)

The next step is to use a simple decoupling argument (see Lemma C.11) to argue that, for A = BDiag(Jr, d):

E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ1/2 T ) E tr((W TBDiag(Θr,T,T , m)W) 1),

where W Rm T d has iid N(0, 1) entries. This positions us to use the arguments in Section 7.2.2 again. We ﬁrst focus on the r = 1 case. We reduce the problem to assuming T = T, by observing that since Γt(In) = t+1

2 In for any t N+, then Θ1,T,T = T+1

T +1 Θ1,T,T . Therefore,

E tr((W TBDiag(Θ1,T,T , m)W) 1) = T + 1

T + 1 E tr((W TBDiag(Θ1,T,T , m)W) 1) (7.11)

T + 1 n SP(BDiag(Θ1,T,T , m), n),

where again the notation highlights the heuristic nature of the bound, used to build intuition. To proceed, let LT RT T be the lower triangular matrix of all ones and deﬁne ST := (LT LT T ) 1. A computation yields that Θ 1 1,T,T = T+1

2 ST . Note that we can write ST as a rank-one perturbation to a tri-diagonal matrix. Speciﬁcally, ST = Tri(2, 1; T) e T e T T , where Tri(a, b; T) denotes the symmetric T T tri-diagonal matrix with a on the diagonal and b on the lower and upper oﬀ-diagonals. By the standard formula for the eigenvalues of a tri-diagonal matrix, we have that λT k+1(Tri(2, 1; T)) = 2 1 cos kπ T+1 k2/T 2. In Appendix C.7, we apply the work of Kulkarni et al. (1999) to show that the rank-one perturbation is negligible: λT k+1(ST ) k2/T 2 as well. Therefore λT k+1(Θ 1 1,T,T ) k2/T. With this bound, we have:

n = ψ( x n; BDiag(Θ1,T,T , m)) = x n m

1 λi(Θ 1 1,T,T ) + x n

1 i2/T + x n x n m Z T

1 x2/T + x n dx q

Tu, Frostig, and Soltanolkotabi

Td 2m(SP) 1

m = 1, n = dr, T = 2n

r = 1, slope = 2.05 r = 2, slope = 4.18 r = 3, slope = 6.42 r = 4, slope = 8.67 r = 5, slope = 10.81 r = 6, slope = 12.75 r = 7, slope = 14.38

(a) Plot of T d

2m(SP) 1 versus n.

Td 2m(SP) 1

n = 150, T = 2n

r = 1, slope = 2.02 r = 2, slope = 3.88 r = 3, slope = 5.95 r = 4, slope = 7.89

(b) Plot of T d

2m(SP) 1 versus m.

Figure 3: Plot of T d

2m(SP) 1 versus n in (a) and versus m in (b), both on a log-log scale. For (a), m and p are ﬁxed to one, d is ﬁxed to n/r, and T is ﬁxed to 2n. For (b), n is ﬁxed to 150, p is ﬁxed to one, and T is ﬁxed to 2n. In the legends, the slope of the line (in log-log space) computed via linear regression is shown. Based on these plots, we conjecture that T d

2m(SP) 1 cr(d/m)2r, where cr depends only on r.

This implies that x n n2/(m2T), and therefore by (7.8) and (7.9):

SP(BDiag(Θ1,T,T , m), n) = n x n m2T

n = E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ1/2 T ) T

We make this argument rigorous in Appendix C.8.

7.2.5 Beyond diagonalizability

When r 2, the analytic complexity of characterizing the solution to the equation n = ψ( x n; BDiag(Θr,T,T , m)) increases signiﬁcantly. Nevertheless, we can still solve for x n by numerical root ﬁnding, to look at the scaling patterns for small values of r and T = T. This computation leads us to conjecture a general bound of R(m, T, T; {PBDiag(Jr,n/r) x }) crn2r/(m2r T) when m n, where cr is a constant depending only on r (see Figure 3). A complete and precise statement is given in Appendix A.

8. Numerical simulation

We conduct a simple numerical simulation illustrating the beneﬁts of multiple trajectories on learning. We construct a family of LDS-LS problem instances, parameterized by a scalar ρ (0, ) as follows. The covariate distribution Px is the linear dynamical system xt+1 = Axt + wt with:

A = U diag( ρ, . . . , ρ | {z } n/2 times

, ρ, . . . , ρ)UT, U Unif(O(n)), wt N(0, I/4).

Here, O(n) denotes the set of n n orthonormal matrices. By construction, ρ is the spectral radius of A. The labels yt are set as yt = xt+1, so that the ground truth W Rn n is equal to A.

Learning from many trajectories

We compare the risk of the OLS estimator (3.6) on the LDS-LS problem instance, compared with its risk on the corresponding Ind-LDS-LS baseline. Speciﬁcally, we plot the ratio between OLS excess risks E[L( ; T, Px)] on the two problem instances (Px), respectively. We ﬁx the covariate dimension n = 5 and the trajectory horizon length T = 10n, and vary the number of trajectories m {1, . . . , 10}. Figure 4 shows the result of this experiment, where we also vary ρ {0.98, 0.99, 1.0, 1.01, 1.02}. The error bars are plotted over 1000 trials. All computations are implemented using jax (Bradbury et al., 2018), and run with float64 precision on a single machine.

2 4 6 8 10 m

Excess Risk Ratio

n = 5, T = 10n

Figure 4: Plot of the ratio of excess risk for LDS-LS problem instances over its corresponding Ind-LDS-LS baseline instance, as a function of the number of trajectories m, holding both covariate dimension n and horizon length T ﬁxed. The vertical blue line marks the transition between few trajectories (m n) and many trajectories (m n).

In Figure 4, we see that for the few trajectories regime (m n) appearing to the left of the vertical blue line, the instability of the covariate process plays an outstanding role in determining the value of the ratio. On the other hand, for the many trajectories regime (m n) appearing to the right of the blue line, the ratios quickly converge to a constant no greater than two (at m = 10). This behavior is consistent with Theorem 5.5. Finally, Theorem 6.3 suggests that the scaling behavior of the ρ = 1 curve with respect to m is on the order of 1/m.

9. Concluding remarks

Having sharply characterized the worst-case excess risk of Seq-LS and LDS-LS, we see more precisely the trade-oﬀs or arguably the lack thereof presented by resetting a system, or by simply observing parallel runs from one, where possible. After suﬃcient resets, one learns roughly as though examples were independent altogether (as reﬂected in the Ind-Seq-LS and Ind-LDS-LS baselines). In addition to the theoretical upshot that it presents, this phenomenon seems encouraging insofar as the setup may describe reality: one does not learn to ride a bicycle by witnessing thousands of unrelated pedal strokes, nor by watching one cyclist endure the

Tu, Frostig, and Soltanolkotabi

entire Tour de France, but rather by seeing and attempting many moderate rides and maneuvers. We see a number of future directions for research, primarily in further charting out the reach of the iid-like phenomenon in learning from multiple sequences. Our work oﬀers the trajectory small-ball criterion (Deﬁnition 4.1) as a vehicle for proving that this phenomenon occurs, or otherwise for bounding the minimax rate from above. What other notable sequential processes, outside of those covered in Section 4.1, can we capture as trajectory small-ball instances? One might look to covariate sequences generated by, say, input-tostate stable (ISS) non-linear systems, stochastic polynomial diﬀerence equations, or various Markov decision processes. On the ﬂip side, when must we necessarily pay a price for dependent data? One answer from our work is that a necessary gap between independent and sequentially dependent learning appears when there are insuﬃciently many trajectories (m n). As outlined in Section 7.2.5 and Appendix A, we conjecture that this gap can be made much wider, namely by considering non-diagonalizable linear dynamical systems. That said, other pertinent problems may exhibit gaps as well. Finding them would help inform where the limits of learning from sequential data lie. On the regression side, one might look to move beyond a well-speciﬁed linear regression model, extend to other loss functions, analyze regularized least-squares estimators in place of OLS, or consider a more adversarial analysis (e.g. measuring regret rather than risk, in an online setting).

Acknowledgments

We thank Vikas Sindhwani for organizing a lecture series on learning and control, during which this work was prompted, and for encouraging us to pursue the ensuing questions. We also thank Sumeet Singh for pointing out that the claimed ﬁrst-order optimality conditions in (7.6) and (7.7) require further conditions to hold for nonconvex/nonconcave games, e.g., strictly concave stationary points of the inner maximization problem. Finally, we thank the anonymous reviewers of this manuscript for their constructive feedback, which helped improve the clarity of our exposition. M. Soltanolkotabi is supported by the Packard Fellowship in Science and Engineering, a Sloan Research Fellowship in Mathematics, an NSF-CAREER under award #1846369, DARPA Learning with Less Labels (Lw LL) and Fast NICS programs, and NSF-CIF awards #1813877 and #2008443.

Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Online least squares estimation with self-normalized processes: An application to bandit problems. In Conference on Learning Theory, 2011.

Peter L. Bartlett, Philip M. Long, G abor Lugosi, and Alexander Tsigler. Benign overﬁtting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020.

Learning from many trajectories

David Belanger and Sham Kakade. A linear dynamical system model for text. In International Conference on Machine Learning, 2015.

Adam Block, Max Simchowitz, and Russ Tedrake. Smoothed online learning for prediction in piecewise aﬃne systems. In Neural Information Processing Systems, 2023.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax.

Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeﬀrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-Co NLL), 2007.

Guy Bresler. Eﬃciently learning ising models on arbitrary graphs. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, 2015.

Guy Bresler, Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli, and Xian Wu. Least squares regression with markovian data: Fundamental limits and algorithms. In Neural Information Processing Systems, 2020.

Emmanuel J. Cand es, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207 1223, 2006.

Anthony Carbery and James Wright. Distributional and Lq norm inequalities for polynomials over convex bodies in Rn. Mathematical Research Letters, 8:233 248, 2001.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Neural Information Processing Systems, 2021.

Yanxi Chen and H. Vincent Poor. Learning mixtures of linear dynamical systems. In International Conference on Machine Learning, 2022.

Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, and Siddhartha Jayanti. Learning from weakly dependent data under dobrushin s condition. In Conference on Learning Theory, 2019.

Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Surbhi Goel, and Anthimos Vardis Kandiros. Statistical estimation from dependent data. In International Conference on Machine Learning, 2021a.

Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, and Anthimos Vardis Kandiros. Learning ising models from one or multiple samples. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021b.

Tu, Frostig, and Soltanolkotabi

Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60 65, 2003.

Constantinos Daskalakis, Nishanth Dikkala, and Ioannis Panageas. Regression from dependent observations. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019.

Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20:633 679, 2020.

John C. Duchi, Alekh Agarwal, Mikael Johansson, and Michael I. Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549 1578, 2012.

Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite time identiﬁcation in unstable linear systems. Automatica, 96:342 353, 2018.

Dylan J. Foster, Alexander Rakhlin, and Tuhin Sarkar. Learning nonlinear dynamical systems from a single trajectory. In Learning for Dynamics & Control Conference, 2020.

Udaya Ghai, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. No-regret prediction in marginally stable systems. In Conference on Learning Theory, 2020.

Promit Ghosal and Sumit Mukherjee. Joint estimation of parameters in Ising model. The Annals of Statistics, 48(2):785 810, 2020.

Joshua Glaser, Matthew Whiteway, John P Cunningham, Liam Paninski, and Scott Linderman. Recurrent switching dynamical systems models for multiple interacting neural populations. In Neural Information Processing Systems, 2020.

Alexander Goldenshluger and Assaf Zeevi. Nonasymptotic bounds for autoregressive time series modeling. The Annals of Statistics, 29(2):417 444, 2001.

Rodrigo A. Gonz alez and Cristian R. Rojas. A ﬁnite-sample deviation bound for stable autoregressive processes. In Learning for Dynamics and Control, 2020.

Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral ﬁltering. In Neural Information Processing Systems, 2017.

Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral ﬁltering for general linear dynamical systems. In Neural Information Processing Systems, 2018.

Daniel Hsu, Sham M. Kakade, and Tong Zhang. Random design analysis of ridge regression. Foundations of Computational Mathematics, 14:569 600, 2014.

Prateek Jain, Suhas S. Kowshik, Dheeraj Nagaraj, and Praneeth Netrapalli. Near-optimal oﬄine and streaming algorithms for learning non-linear dynamical systems. In Neural Information Processing Systems, 2021.

Michael Janner, Qiyang Li, and Sergey Levine. Oﬄine reinforcement learning as one big sequence modeling problem. In Neural Information Processing Systems, 2021.

Learning from many trajectories

Yassir Jedra and Alexandre Proutiere. Sample complexity lower bounds for linear system identiﬁcation. In 2019 IEEE 58th Conference on Decision and Control (CDC), 2019.

Yassir Jedra and Alexandre Proutiere. Finite-time identiﬁcation of stable linear systems optimality of the least-squares estimator. In 2020 59th IEEE Conference on Decision and Control (CDC), 2020.

Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. What is local optimality in nonconvexnonconcave minimax optimization? In International Conference on Machine Learning, 2020.

Rafal J ozefowicz, Oriol Vinyals, Mike Schuster, Noam M. Shazeer, and Yonghui Wu. Exploring the limits of language modeling. ar Xiv preprint ar Xiv:1602.02410, 2016.

S. Mohammad Khansari-Zadeh and Aude Billard. Learning stable nonlinear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 27(5):943 957, 2011.

Vladimir Koltchinskii and Shahar Mendelson. Bounding the smallest singular value of a random matrix without concentration. International Mathematics Research Notices, 2015 (23):12991 13008, 2015.

Devadatta Kulkarni, Darrell Schmidt, and Sze-Kai Tsui. Eigenvalues of tridiagonal pseudotoeplitz matrices. Linear Algebra and its Applications, 297(1 3):63 80, 1999.

Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing processes. Machine Learning, 106:93 117, 2017.

Tze Leung Lai and Ching Zong Wei. Least squares estimates in stochastic regression models with applications to identiﬁcation and control of dynamic systems. The Annals of Statistics, 10(1):154 166, 1982.

Tze Leung Lai and Ching Zong Wei. Asymptotic properties of general autoregressive models and strong consistency of least-squares estimates of their parameters. Journal of Multivariate Analysis, 13(1):1 23, 1983.

B eatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000.

Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle. Data center cooling using model-predictive control. In Neural Information Processing Systems, 2018.

Yingying Li, Tianpeng Zhang, Subhro Das, JeﬀShamma, and Na Li. Non-asymptotic system identiﬁcation for linear systems with nonlinear policies. IFAC-Papers On Line, 56 (2):1672 1679, 2023.

Scott Linderman, Matthew Johnson, Andrew Miller, Ryan Adams, David Blei, and Liam Paninski. Bayesian learning and inference in recurrent switching linear dynamical systems. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017.

Tu, Frostig, and Soltanolkotabi

Lennart Ljung. System Identiﬁcation: Theory for the User. Pearson, 1998.

Jan R. Magnus. The moments of products of quadratic forms in normal variables. Statistica Neerlandica, 32(4):201 210, 1978.

Louis Massucci, Fabien Lauer, and Marion Gilson. A statistical learning perspective on switched linear system identiﬁcation. Automatica, 145:110532, 2022.

V. John Mathews and Giovanni L. Sicuranza. Polynomial Signal Processing. Wiley, 2000.

Daniel J. Mc Donald, Cosma Rohilla Shalizi, and Mark Schervish. Nonparametric risk bounds for time-series forecasting. Journal of Machine Learning Research, 18:1 40, 2017.

Ron Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39:5 34, 2000.

Shahar Mendelson. Learning without concentration. Journal of the ACM, 62(3):1 25, 2015.

Sean P. Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. Springer Verlag, 1993.

Aditya Modi, Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Joint learning of linear time-invariant dynamical systems. ar Xiv preprint ar Xiv:2112.10955, 2022.

Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In Neural Information Processing Systems, 2008.

Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary φ-mixing and β-mixing processes. Journal of Machine Learning Research, 11(26):789 814, 2010.

Jaouad Mourtada. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. The Annals of Statistics, 50(4):2157 2178, 2022.

Duy Nguyen-Tuong and Jan Peters. Model learning for robot control: a survey. Cognitive Processing, 12:319 340, 2011.

Roberto Imbuzeiro Oliveira. The lower tail of random quadratic forms with applications to ordinary least squares. Probability Theory and Related Fields, 166(3):1175 1194, 2016.

Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1 179, 2018.

Samet Oymak and Necmiye Ozay. Non-asymptotic identiﬁcation of lti systems from a single trajectory. In 2019 American Control Conference (ACC), 2019.

Peter C.B. Phillips and Tassos Magdalinos. Inconsistent var regression with common explosive roots. Econometric Theory, 29:808 837, 2013.

Dean A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Neural Information Processing Systems, 1989.

Learning from many trajectories

Wilson J. Rugh. Nonlinear System Theory: The Volterra / Wiener Approach. The Johns Hopkins University Press, 1981.

Wilson J. Rugh. Linear system theory. Prentice Hall, 2nd ed. edition, 1996.

Tuhin Sarkar and Alexander Rakhlin. Near optimal ﬁnite time identiﬁcation of arbitrary linear dynamical systems. In International Conference on Machine Learning, 2019.

Tuhin Sarkar, Alexander Rakhlin, and Munther A. Dahleh. Finite time lti system identiﬁcation. Journal of Machine Learning Research, 22:1 61, 2021.

Yahya Sattar and Samet Oymak. Non-asymptotic and accurate learning of nonlinear dynamical systems. ar Xiv preprint ar Xiv:2002.08538, 2020.

Yahya Sattar, Samet Oymak, and Necmiye Ozay. Finite sample identiﬁcation of bilinear dynamical systems. In 2022 IEEE 61st Conference on Decision and Control (CDC), 2022.

Cosma Rohilla Shalizi. Advanced Data Analysis from an Elementary Point of View. 2021.

https://www.stat.cmu.edu/~cshalizi/ADAfa EPo V/ADAfa EPo V.pdf.

Max Simchowitz, Horia Mania, Stephen Tu, Michael I. Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identiﬁcation. In Conference on Learning Theory, 2018.

Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. In Conference on Learning Theory, 2019.

Ingo Steinwart and Andreas Christmann. Fast learning from non-i.i.d. observations. In Neural Information Processing Systems, 2009.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Neural Information Processing Systems, 2014.

Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. A tight version of the gaussian min-max theorem in the presence of convexity. Technical report, Caltech, 2014.

Anastasios Tsiamis and George J. Pappas. Finite sample analysis of stochastic system identiﬁcation. In 2019 IEEE 58th Conference on Decision and Control (CDC), 2019.

Anastasios Tsiamis and George J. Pappas. Linear systems can be hard to learn. ar Xiv preprint ar Xiv:2104.01120, 2021.

Stephen Tu and Ross Boczar. An elementary proof of anti-concentration for degree two non-negative gaussian polynomials. ar Xiv preprint ar Xiv:2301.05992, 2023.

Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.

Lei Xin, George Chiu, and Shreyas Sundaram. Learning the dynamics of autonomous linear systems from multiple trajectories. ar Xiv preprint ar Xiv:2203.12794, 2022.

Tu, Frostig, and Soltanolkotabi

Yu Xing, Benjamin Gravell, Xingkang He, Karl Henrik Johansson, and Tyler Summers. Identiﬁcation of linear systems with multiplicative noise from multiple trajectory data. ar Xiv preprint ar Xiv:2106.16078, 2021.

Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94 116, 1994.

Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527 550, 2002.

Yang Zheng and Na Li. Non-asymptotic identiﬁcation of linear dynamical systems using multiple trajectories. IEEE Control Systems Letters, 5(5):1693 1698, 2021.

Ingvar Ziemann and Stephen Tu. Learning with little mixing. In Neural Information Processing Systems, 2022.

Ingvar Ziemann, Henrik Sandberg, and Nikolai Matni. Single trajectory nonparametric learning of nonlinear dynamics. In Conference on Learning Theory, 2022.

Appendix A. Beyond diagonalizability: a conjecture for the general case

Recall that various results in Section 5.1 required a diagaonalizability assumption (Assumption 5.2) on the dynamics matrix A, speciﬁcally in the many trajectories regime when T > T (Theorem 5.5), or in the few trajectories regime (Section 5.1.2). In this section, we conjecture how removing the diagonalizability assumption would aﬀect the results. For simplicity, we focus on the few trajectories regime, and further assume that T = T. Building on potential extensions of this paper s analysis, and numerical evidence detailed in Section 7, we conjecture the following extensions of Theorem 5.6 and Theorem 6.3:

Conjecture A.1 (Risk for LDS-LS with few trajectories, general case). There are universal positive constants c0, c1, c2, c3, and a universal mapping ϕ : N+ R>0 such that the following holds for any instance of LDS-LS satisfying Assumption 5.1 (marginal stability) and Assumption 5.3 (one-step controllability). Let A = SJS 1 denote the Jordan normal form of the dynamics matrix A. Deﬁne γ := λmax(S 1BBTS )

λmin(S 1BBTS ) , and let r be the size of the largest Jordan block in J. If n c0, m c1n, and m T c2n, then:

E[L( ˆWm,T ; T , PA,B x )] c3σ2 ξϕ(r)γ pn2r

m2r T . (A.1)

Additionally, there exist universal positive constants c 0, c 1, c 2, c 3, and c 4 such that the following is true. Suppose A Rn n is any set containing all n n matrices with Jordan blocks of size at most r. Let T c 0, n c 1, m T c 2n, and m c 3n. Then:

R(m, T, T; {PA x | A A}) c 4σ2 ξϕ(r)γ pn2r

m2r T . (A.2)

Lemma 5.1 provides a viable path towards proving the upper bound (A.1) from Conjecture A.1 up to logarithmic factors in the regime of constant Jordan block size r, by

Learning from many trajectories

reducing the problem to understanding the scaling of λ(k, t; A, B) = λ(Γk(A, B), Γt(A, B)) when k t. Our analysis uses diagonalizability (Assumption 5.2) of the dynamics matrices to show that λ(k, t; A, B) γ 1 k/t. Without such an assumption, analyzing λ(k, t; A, B) is substantially more involved. A numerical simulation (Figure 5) suggests

101 102 103 α

t/k = α, k = 5

r = 1, slope = 1.00 r = 2, slope = 3.02 r = 3, slope = 5.04 r = 4, slope = 7.06 r = 5, slope = 9.09 r = 6, slope = 11.13 r = 7, slope = 13.18

Figure 5: A plot of the ratio α versus 1/λ(k, t) with k ﬁxed to 5 and t ﬁxed to kα. Here, λ(k, t) := λ(k, t; Jr, Ir). The slope of the line (in log-log space) computed via linear regression is reported. We conjecture that in general, λ(k, t; A, B) crγ 1 (k/t)2r 1.

that λ(k, t; A, B) crγ 1 (k/t)2r 1 is the general rate for dynamics matrices A with Jordan blocks at most size r, where cr is a constant depending only on r. Assuming this scaling to be correct and plugging the rate into Lemma 5.1 yields (A.1) up to logarithmic factors. Partial progress towards analyzing λ(k, t; A, B) was made in Sarkar and Rakhlin (2019, Proposition 7.6), where it is shown that λ(k, t; A, B) crγ 1 (k/t)r2, with 1/cr depending exponentially on r. We do not conjecture a form for the mapping ϕ(r); λ(k, t; A, B) becomes numerically ill-conditioned when r is large, hindering simulation with large blocks.

On the other hand, the analytic arguments in Section 7.2.4 combined with the numerical evidence in Figure 3 suggest that the bound (A.2) holds (up to the condition number factor γ). The one caveat is that, even if we were to analytically characterize the eigenvalues of Θr,T,T for all r, our proof strategy would most likely not be able to give a sharp characterization of the leading constant ϕ(r) in the lower bound. This is because our proof inherently exploits the independence between decoupled subsystems, and does not tackle the harder challenge of understanding the coupling eﬀects within a Jordan block.

We conclude this section by noting that Conjecture A.1 does not include any logarithmic factors in the upper bound rate (A.1), and includes the condition number factor γ in the lower bound (A.2). In other words, Conjecture A.1 applied to the special case of r = 1 conjectures that Theorem 5.6 is loose by log2(γn/m), and that Theorem 6.3 is loose by a factor of γ.

Tu, Frostig, and Soltanolkotabi

Appendix B. Analysis for upper bounds

B.1 Preliminaries

We collect various technical results which we will use in the proof of the upper bounds. The ﬁrst result gives us a bound on the functional inverse of T 7 T/ log T.

Proposition B.1 (Simchowitz et al. (2018, Lemma A.4)). For α 1, T 2α log(4α) implies that T α log T.

The next two results study various properties of functions involving λ.

Proposition B.2. For A Symn >0, the map X 7 λ(X, A) is concave over symmetric matrices.

Proof Observe that λ(X, A) = λmin(A 1/2XA 1/2) = inf{ X, A 1/2vv TA 1/2 | v Sn 1} is the pointwise inﬁmum over a set of linear functions in X, and is therefore concave.

Proposition B.3. Fix T N+, {Ψt}T t=1 Symn >0, and Γ Symn >0. Suppose that

1 T PT t=1 Ψt Γ. Then h QT t=1 λ(Ψt, Γ) i1/T 1.

Proof We have that: " T Y

t=1 λ(Ψt, Γ)

t=1 λ(Ψt, Γ) using the AM-GM inequality

using Proposition B.2 and Jensen s inequality

λ(Γ, Γ) since 1

The next result relates the anti-concentration properties of a non-negative random variable to its moment generating function on ( , 0).

Proposition B.4 (Mourtada (2022, Lemma 7)). Let X be a non-negative random variable. Suppose there exists an α (0, 1] and positive constant c such that:

P(X t) (ct)α t > 0.

E[exp( ηX)] (c/η)α η > 0.

The next few results involve various properties of Gaussian and spherical distributions.

Learning from many trajectories

Proposition B.5 (Magnus (1978, Lemma 6.2)). For w N(0, I) and symmetric matrices A, B:

E[w TAww TBw] = 2 A, B + tr(A) tr(B).

Proposition B.6 (Dasgupta and Gupta (2003, Lemma 2.2)). Let n 2 and v Rn \ {0} be ﬁxed. Suppose that ψ is drawn uniformly at random from the uniform measure over Sn 1. We have that for all ε > 0:

P n v, ψ 2 ε

n v 2 2 o (eε)1/2.

Next, we state a classic result which gives us anti-concentration of arbitrary Gaussian (more generally any log-concave distribution) polynomials of bounded degree.

Theorem B.7 (Carbery and Wright (2001, Theorem 8)). Fix an integer d N+. There exists a universal positive constant c such that the following is true. Let p : Rn R be a degree d polynomial, and let ε > 0. We have:

P{|p(x)| ε E|p(x)|} c dε1/d, x N(0, In).

For the case when d = 2 and p is non-negative, we can take c = p

Theorem B.7 can be further specialized as follows. Suppose w N(0, I), x is ﬁxed, and Q11 Q12 QT 12 Q22

is positive semideﬁnite. Then:

T Q11 Q12 QT 12 Q22

(eε)1/2 ε > 0. (B.1)

Both (B.1) and the explicit constant in Theorem B.7 for d = 2 and p non-negative can be derived by bounding the MGF of various Gaussian quadratic forms; see e.g. Tu and Boczar (2023). Next, we state a well-known result from Abbasi-Yadkori et al. (2011), which yields an anytime bound for the size of a self-normalized martingale diﬀerence sequence (MDS).

Lemma B.8 (Abbasi-Yadkori et al. (2011, Theorem 3)). Fix a δ (0, 1) and positive deﬁnite matrix V Rd d. Let {xt}t 1 Rd be a stochastic process adapted to the ﬁltration {Ft}t 1. Let {ηt}t 1 R be a martingale diﬀerence sequence adapted to {Ft}t 2. Suppose there exists R > 0 such that E[exp(ληt) | Ft] exp(λ2R2/2) a.s. for all λ R and t 1. Deﬁne Vt := Pt k=1 xkx T k for t 1. With probability at least 1 δ,

2R2 log det(Vt + V )1/2 det(V ) 1/2

Lemma B.8 is generalized to vector-valued self-normalized MDS via a covering argument:

Tu, Frostig, and Soltanolkotabi

Proposition B.9 (Sarkar and Rakhlin (2019, Proposition 8.2)). Fix a δ (0, 1) and positive deﬁnite matrix V Rd d. Let {xt}t 1 Rd be a stochastic process adapted to the ﬁltration {Ft}t 1. Let {ηt}t 1 Rp be a stochastic process adapted to {Ft}t 2. Suppose that for every ﬁxed v Sp 1, for every t 1:

(a) E[ v, ηt | Ft] = 0 a.s.

(b) E[exp(λ v, ηt ) | Ft] exp(λ2R2/2) a.s. for every λ R.

Deﬁne Vt := Pt k=1 xkx T k for t 1. With probability at least 1 δ, for all t 1:

k=1 ηkx T k (Vt + V ) 1/2 op 2

2R2 log 5p det(Vt + V )1/2 det(V ) 1/2

The next result assumes Vt is invertible in order to simplify Proposition B.9.

Proposition B.10. Under the same hypothesis of Proposition B.9, we have with probability at least 1 δ, for all t 1:

k=1 ηkx T k V 1/2 t

R2 log 5p det(Vt + V )1/2 det(V ) 1/2

Proof Observe that when Vt V , we have:

2Vt Vt + V = V 1 t 2(Vt + V ) 1.

For two positive deﬁnite matrices M1 and M2 satisfying M1 M2, and any matrix N,

NM1/2 1 op = q

λmax(NM1NT) q

λmax(NM2NT) = NM1/2 2 op.

k=1 ηkx T k V 1/2 t

k=1 ηkx T k (Vt + V ) 1/2 op

R2 log 5p det(Vt + V )1/2 det(V ) 1/2

where the last inequality holds for every t with probability at least 1 δ by Proposition B.9.

B.2 Examples of trajectory small-ball

In this section, we prove that the examples listed in Section 4.1 satisfying the trajectory small-ball condition (Deﬁnition 4.1).

Learning from many trajectories

Example 4.1 (Copies of a Gaussian draw). Let Σ Symn >0, and let Px denote the process x1 N(0, Σ) and xt = xt 1 for t 2. Fix any T N+. Then Px satisﬁes the (T, T, Σ, e, 1

2)- Traj SB condition.

Proof When k = T and Γ = In, the condition (4.1) simpliﬁes to:

sup v Sn 1 P

t=1 v, xt 2 ε

(csbε)α ε > 0.

Since x1 = x2 = = x T , this further simpliﬁes to:

sup v Sn 1 P v, x1 2 ε (csbε)α ε > 0.

Since v, x1 N(0, 1), Equation (B.1) yields that PX N(0,1){X2 ε} (eε)1/2, so we can take csb = e and α = 1/2.

Example 4.2 (Gaussian processes). Let Px be a Gaussian process indexed by time, i.e., for every ﬁnite index set I N+, the collection of random variables (xt)t I is jointly Gaussian. Let Tnd := inf{t N+ | det(E[xtx T t ]) = 0}, and suppose Tnd is ﬁnite. Fix a T N+ satisfying T Tnd. Then Px satisﬁes the (T, T, ΓT (Px), 2e, 1

2)-Traj SB condition.

Proof Since the covariates (x1, . . . , x T ) are jointly Gaussian, we can write,

where µ1, . . . , µT Rn and M1, . . . , MT Rn n T are ﬁxed, and w N(0, In T ). For any v Rn,

t=1 v, xt 2 = 1

t=1 v, µt + Mtw 2.

This is a degree 2 non-negative polynomial in w, and therefore by Theorem B.7, for all ε > 0:

t=1 v, xt 2 εE

t=1 v, xt 2 #)

Example 4.4 (Alternating halfspaces). Suppose that n 4 is even, and let u1, . . . , un be a ﬁxed orthonormal basis of Rn. Put U0 = span(u1, . . . , un/2) and U1 = span(un/2+1, . . . , un). Let i1 Bern(1

2), it+1 = (it + 1) mod 2 for t N+, and let Px denote the process with conditional distribution xt | it uniform over the spherical measure on Uit Sn 1. For any T 2, the process Px satisﬁes the (T, 2, In/(2n), e, 1

2)-Traj SB condition.

Tu, Frostig, and Soltanolkotabi

Proof For i {0, 1}, let ψi be uniform on the uniform measure over Ui Sn 1, let PUi denote the orthogonal projector onto Ui, and let vi = PUiv. Fix any v Rn\{0}. We observe that for any t N+, v, xt+1 2+ v, xt+2 2 | it is equal in distribution to v, ψ0 2 + v, ψ1 2, which itself is equal in distribution to v0, ψ0 2 + v1, ψ1 2. Suppose ﬁrst that v0 2 v1 2. Then, since v 2 2 = v0 2 2 + v1 2 2 2 v0 2 2:

n v0, ψ0 2 + v1, ψ1 2 ε

n v 2 2 o v0, ψ0 2 + v1, ψ1 2 2ε

v0, ψ0 2 2ε

Writing α0 = ( u1, v , . . . , un/2, v ) Rn/2, by a change of coordinates we have that α0 2 2 = v0 2 2, and that v0, ψ0 is equal in distribution to α0, ζ0 , where ζ0 is uniform on Sn/2 1. Since we assumed v0 2 v1 2, we must have that α0 = 0. Hence by Proposition B.6,

P n v0, ψ0 2 + v1, ψ1 2 ε

n v 2 2 o P α0, ζ0 2 2ε

Note that if v1 2 > v0 2, an identical argument yields the same bound. Hence, letting Ft = σ(x1, . . . , xt), we have shown that for all t 0:

ℓ=1 v, xt+ℓ 2 ε v T 1

from which the claim follows.

Example 4.5 (Normal subspaces). Suppose that n 3. Let u1, . . . , un be a ﬁxed orthonormal basis in Rn, and let U i := span({uj}j =i) for i {1, . . . , n}. Consider the Markov chain {it}t 1 deﬁned by i1 Unif({1, . . . , n}), and it+1 | it Unif({1, . . . , n} \ {it}). Let Px denote the process with conditional distribution xt | it uniform over the spherical measure on U it Sn 1. For any T 2, the process Px satisﬁes the (T, 2, In/(4n 4), e, 1

2)-Traj SB condition.

Proof Fix any v Rn \ {0}, and for i {1, . . . , n}, let vi = PU iv, where PU i is the orthogonal projector onto U i Let {ψi}n i=1 be independent random variables, where each ψi is uniform on the uniform measure over U i Sn 1. Let indices j, k {1, . . . , n} with j = k. We ﬁrst observe that since j = k, we have that U j = span(uj) U k. Therefore:

v 2 2 = vj 2 2 + P U jvj 2 2 vj 2 2 + vk 2 2.

Hence, assuming that vj 2 vk 2, we have: vj, ψj 2 + vk, ψk 2 ε 2(n 1) v 2 2

vj, ψj 2 + vk, ψk 2 ε n 1 vj 2 2

Learning from many trajectories

vj, ψj 2 ε n 1 vj 2 2

Writing αj = ( ui, v )i =j Rn 1, by a change of coordinates we have that αj 2 2 = vj 2 2, and that vj, ψj is equal in distribution to αj, ζj , where ζj is uniform on Sn 2. Since we assumed vj 2 vk 2, we must have that αj = 0. Hence by Proposition B.6,

P vj, ψj 2 + vk, ψk 2 ε 2(n 1) v 2 2

P αj, ζj 2 ε n 1 αj 2 2

On the other hand if vk 2 > vj 2, an identical argument yields the same bound. Now, for any i {1, . . . , n} and t N+:

P v, xt+1 2 + v, xt+2 2 ε 2(n 1) v 2 2

j =i,k =j P v, xt+1 2 + v, xt+2 2 ε 2(n 1) v 2 2

it = i, it+1 = j, it+2 = k

P{it+1 = j, it+2 = k | it = i}

j =i,k =j P vj, ψj 2 + vk, ψk 2 ε 2(n 1) v 2 2

P{it+1 = j, it+2 = k | it = i}

j =i,k =j P{it+1 = j, it+2 = k | it = i}

Note we also have that P n v, x1 2 + v, x2 2 ε 2(n 1) v 2 2 o (eε)1/2 by a nearly identical argument. Hence, letting Ft = σ(x1, . . . , xt), we have shown that for all t 0:

ℓ=1 v, xt+ℓ 2 ε v T 1 4(n 1)In

from which the claim follows.

For the next claim, recall that the mixing time of a Markov chain over state-space S with transition matrix P and stationary distribution π is deﬁned as:

τmix(ε) = inf

k N sup µ P(S) µP k π tv ε

Here, P(S) denotes the set of distributions over S, and tv is the total-variation norm over distributions.

Proposition B.11. Let n 2. Consider the Markov chain {it}t 1 where i1 Unif({1, . . . , n}) and it+1 | it Unif({1, . . . , n} \ {it}). We have that:

τmix(ε) = inf k N (n 1) k 2ε 1 1/n

Tu, Frostig, and Soltanolkotabi

Proof Let 1 Rn denote the all ones vector. The transition matrix for this Markov chain is:

P = 1 n 1(11T In),

and its stationary distribution is uniform over {1, . . . , n}. Note that for j 1, (11T)j = nj 111T. Since 11T and In commute, by the binomial theorem we have that:

P k = 1 (n 1)k

(11T)k j( 1)j

nk j 1( 1)j11T + ( 1)k In

(n 1)k ( 1)k 11T + ( 1)k In

n11T + ( 1)k

Now, let µ Rn 0 satisfy µT1 = 1. We have: µTP k 1

n1T 1 = 1 (n 1)k

It is straightforward to check that supµ Rn 0,µT1=1 µ 1

n, from which the claim

follows, since the TV distance between two distributions µ, ν is µ ν tv = 1

Example 4.6 (Linear dynamical systems). Let (A, B) with A Rn n and B Rn d be kcstep-controllable (Deﬁnition 4.2). Let PA,B x be the linear dynamical system deﬁned in (3.8). Fix any T, k N+ satisfying T k kc. Then, PA,B x satisﬁes the (T, k, Γk(A, B), e, 1

2)- Traj SB condition.

Proof Let Γk = Γk(Px) and Σk = Σk(Px) be shorthand notation. Let w = (w1, . . . , wk) Rnk denote the vertical concatenation of the process noise variables. Let Mt := At Φt

Rn n(k+1) denote the matrix such that xt = Mt

. With this notation, for any v Rn:

t=1 v, xt 2 = x w

t=1 MT t vv TMt

By Equation (B.1), for any ε > 0,

t=1 MT t vv TMt

t=1 ΦT t vv TΦt

Learning from many trajectories

On the other hand:

t=1 ΦT t vv TΦt

v = v T 1 k

v = v TΓkv.

Because we assumed that k kc, then Γk is invertible. Thus, we can take csb = e and α = 1/2.

Proposition B.12. Consider the scalar stochastic process {xt}t 1 deﬁned by:

j=0 ci,jwt i 1wt j 1,

where {ci,j}i,j 0 are the coeﬃcients which describe the dynamics, and {wt}t 0 are iid N(0, 1) random variables. Let {Ft}t 1 denote the ﬁltration deﬁned as Ft := σ(w0, . . . , wt 1), so that xt is Ft-measurable. Suppose that {ci,j}i,j 0 is symmetric and traceless. For every t 1 and k 0, almost surely we have:

E[x2 t+k | Fk] E[x2 t ] + (E[xt+k | Fk])2.

Proof For t N+, deﬁne the symmetric matrices Mt Rt t with (Mt)ii = 0 and (Mt)ij = c(i 1),(j 1). With this notation and with wt N(0, It), we can write xt as:

xt = w T t Mt wt.

Therefore, by Proposition B.5 and the assumption that tr(Mt) = 0:

E[x2 t ] = E( w T t Mt wt)2 = 2 Mt 2 F + tr(Mt)2 = 2 Mt 2 F .

Now, partition Mt+k as:

Mt+k = Mt Dt,k DT t,k Et,k

Let vk = (wk 1, . . . , w0). Given Fk, we can write xt+k as:

xt+k = wt vk

T Mt Dt,k DT t,k Et,k

With this notation:

E[xt+k | Fk] = v T k Et,k vk, E[x2 t+k | Fk] = E wt

T Mt Dt,k DT t,k Et,k

Expanding the square: wt vk

T Mt Dt,k DT t,k Et,k

!2 = ( w T t Mt wt + 2 w T t Dt,k vk + v T k Et,k vk)2

Tu, Frostig, and Soltanolkotabi

= ( w T t Mt wt)2 + 4 w T t Mt wt w T t Dt,k vk + 2 w T t Mt wt v T k Et,k vk + 4( w T t Dt,k vk)2 + 4 w T t Dt,k vk v T k Et,k vk + ( v T k Et,k vk)2.

Using Proposition B.5 again:

E[x2 t+k | Fk] = E wt

T Mt Dt,k DT t,k Et,k

= E wt( w T t Mt wt)2 + 2 tr(Mt) v T k Et,k vk + 4 Dt,k vk 2 2 + ( v T k Et,k vk)2

= 2 Mt 2 F + 4 Dt,k vk 2 2 + ( v T k Et,k vk)2 2 Mt 2 F + ( v T k Et,k vk)2.

To complete the proof, we recall that E[x2 t ] = 2 Mt 2 F and E[xt+k | Fk] = v T k Et,k vk.

Example 4.9 (Degree-2 Volterra series). Consider the following process Px. Let {c(ℓ) i,j }i,j 0 for ℓ {1, . . . , n} be symmetric, traceless, non-degenerate arrays (Deﬁnition 4.3). Let {w(ℓ) t }t 0 be iid N(0, 1) random variables for ℓ {1, . . . , n}. For t 1, the ℓ-th coordinate of xt, denoted (xt)ℓ, is:

j=i c(ℓ) i,j w(ℓ) t i 1w(ℓ) t j 1. (4.5)

Let knd N+ denote the smallest non-degeneracy index for all n arrays. There is a universal positive constant c such that for any T and k satisfying T k knd, Px satisﬁes the (T, k, Γk(Px), c, 1

4)-Traj SB condition.

Proof Fix a v Rn. The relation (4.5) shows that v, xt 2 is a degree four polynomial in {w(ℓ) i }t 1,n i=0,ℓ=1. Let Ft = σ({w(ℓ) i }t 1,n i=0,ℓ=1), so that xt is Ft-measurable. By Theorem B.7, there exists a univeral positive constant c > 0 such that for any s 0,

t=1 v, xt+s 2 εE

t=1 v, xt+s 2 Fs

(cε)1/4 a.s.

To conclude the proof, we need to lower bound E 1 k Pk t=1 v, xt+s 2 Fs

. For any t 1,

E v, xt+s 2 Fs

ℓ=1 vℓ (xt+s)ℓ

ℓ=1 v2 ℓ E[(xt+s)2 ℓ| Fs] +

ℓ1 =ℓ2 vℓ1vℓ2 E[(xt+s)ℓ1 | Fs] E[(xt+s)ℓ2] | Fs]

ℓ=1 v2 ℓ E[(xt)2 ℓ] +

ℓ=1 v2 ℓ (E[(xt+s)ℓ| Fs])2

Learning from many trajectories

ℓ1 =ℓ2 vℓ1vℓ2 E[(xt+s)ℓ1 | Fs] E[(xt+s)ℓ2] | Fs]

ℓ=1 v2 ℓ E[(xt)2 ℓ] +

ℓ=1 vℓ E[(xt+s)ℓ| Fs]

ℓ=1 v2 ℓ E[(xt)2 ℓ] (c) = v TΣt(Px)v.

Above, (a) follows since each coordinate of xt is independent by deﬁnition, (b) follows from Proposition B.12, and (c) follows since E[xt] = 0 and each coordinate is independent, so E[(xt)ℓ1(xt)ℓ2] = E[(xt)ℓ1]E[(xt)ℓ2] = 0 for ℓ1 = ℓ2. Hence, we have shown:

t=1 v, xt+s 2 Fs

v = v TΓk(Px)v.

Note that because we assume that k knd, the covariances Σt(Px) are all invertible (and hence so is Γk(Px)). The claim now follows.

Example 4.8 (Degree-D Volterra series). Fix a D N+. Let {c(d,ℓ) i1,...,id}i1,...,id N for d

{1, . . . , D} and ℓ {1, . . . , n} denote arbitrary rank-d arrays. Let {w(ℓ) t }t 0 be iid N(0, 1) random variables for ℓ {1, . . . , n}. Consider the process Px where for t 1, the ℓ-th coordinate of xt, denoted (xt)ℓ, is:

i1,...,id=0 c(d,ℓ) i1,...,id

d =1 w(ℓ) t id 1. (4.4)

Let Tnd := inf{t N+ | det(Γt(Px)) = 0}, and suppose Tnd is ﬁnite. There is a constant c D > 0, depending only on D, such that for any T Tnd, the process Px satisﬁes the (T, T, ΓT (Px), c D, 1/(2D))-Traj SB condition.

Proof Fix a v Rn. The deﬁnition (4.4) expresses v, xt as a degree at most D polynomial in the noise variables {w(ℓ) t }. By Theorem B.7, there exists a positive constant c D, only depending on D, such that:

t=1 v, xt 2 εE

t=1 v, xt 2 #)

(c Dε)1/(2D).

Since T Tnd, the matrix ΓT (Px) is invertible. The claim now follows.

B.3 Proof of Proposition 4.2

Proposition 4.2 (Average small-ball implies trajectory small-ball). Fix T N+, k {1, . . . , T}, {Ψj} T/k j=1 Symn >0, and α, β (0, 1). Let Px be a covariate distribution, with

Tu, Frostig, and Soltanolkotabi

{xt}t 1 adapted to a ﬁltration {Ft}t 1. Suppose for all v Rn \{0} and j {1, . . . , T/k }:

t=(j 1)k+1 Pxt Px n v, xt 2 α v TΨjv F(j 1)k o β a.s., (4.6)

where F0 is the minimal σ-algebra. Then, for all v Rn \ {0}, j {1, . . . , T/k }, and ε (0, α)

t=(j 1)k+1 v, xt 2 ε v TΨjv

β 1 ε/α a.s. (4.7)

Proof The following proof builds on the argument given in Simchowitz et al. (2018, Section E.1). We note that a similar style of proof is used in Bartlett et al. (2020, Lemma 15). Deﬁne the shorthand notation Pt{ } := P{ | Ft}, and similarly Et[ ] := E[ | Ft]. Now ﬁx a v Rn \ {0}, j {1, . . . , T/k }. Markov s inequality yields that:

t=(j 1)k+1 v, xt 2 αv TΨjv 1

t=(j 1)k+1 1{ v, xt 2 > αv TΨjv},

and therefore for all ε > 0:

t=(j 1)k+1 v, xt 2 ε v TΨjv

t=(j 1)k+1 1{ v, xt 2 > αv TΨjv} ε/α

Deﬁne Zj := 1

k Pjk t=(j 1)k+1 1{ v, xt 2 > αv TΨjv}, and observe that Zj [0, 1]. By (4.6), we have:

E(j 1)k[Zj] 1 β.

On the other hand:

E(j 1)k[Zj] = E(j 1)k[Zj1{Zj > ε/α}] + E(j 1)k[Zj1{Zj ε/α}]

P(j 1)k{Zj > ε/α} + ε/α P(j 1)k{Zj ε/α} since Zj 1

= 1 (1 ε/α)P(j 1)k{Zj ε/α}.

Combining both these inequalities, and further restricting ε (0, α), we obtain,

P(j 1)k{Zj ε/α} β 1 ε/α,

which implies (4.7).

Learning from many trajectories

B.4 General ordinary least-squares estimator upper bound

In this section, we supply the proof of Lemma 5.1. We ﬁrst start with a result which bounds the minimum eigenvalue of the empirical covariance matrix.

Lemma B.13 (Minimum eigenvalue bound via trajectory small-ball). Suppose that Px satisﬁes the (T, k, {Ψj} T/k j=1 , csb, α)-trajectory-small-ball condition (Deﬁnition 4.1). Put S := T/k , and ΓT := ΓT (Px). Fix any Γ Symn >0 satisfying 1 S PS j=1 Ψj Γ ΓT . Deﬁne Xm,T := Xm,T Γ 1/2, and:

µ({Ψj}S j=1, Γ) :=

j=1 λ(Ψj, Γ)

Suppose that:

320ecsb αλ(Γ, ΓT )µ({Ψj}S j=1, Γ)

For any t 0, with probability at least 1 2e t, the following statements simultaneously hold:

tr( XT m,T Xm,T ) m Tnet

λ(Γ, ΓT ), λmin( XT m,T Xm,T ) m Tαµ({Ψj}S j=1, Γ) 8ecsb exp 16kn

Proof The proof uses the PAC-Bayes argument for uniform convergence from Mourtada (2022). The ﬁrst step is to construct a family of random variables, indexed by both v Sn 1

and a scale parameter η > 0, such that its moment generating function is pointwise bounded by one. For notational brevity, let:

λ := λ(Γ, ΓT ), µ := µ({Ψj}S j=1, Γ).

Since Γ ΓT by assumption, we have λ (0, 1]. Similarly, since 1

S PS j=1 Ψj Γ, we also have µ (0, 1] by Proposition B.3. The trajectory small-ball condition (4.1) implies for any v Sn 1, j {1, . . . , S}, and ε > 0, by substituting v Γ 1/2v and lower bounding v TΓ 1/2ΨjΓ 1/2v λ(Ψj, Γ):

t=(j 1)k+1 v, Γ 1/2xt 2 ελ(Ψj, Γ)

Using a change of variables ε ε/λ(Ψj, Γ),

t=(j 1)k+1 v, Γ 1/2xt 2 ε

(csb/λ(Ψj, Γ) ε)α .

Tu, Frostig, and Soltanolkotabi

By Proposition B.4, for any η > 0,

t=(j 1)k+1 v, Γ 1/2xt 2 + α log ηλ(Ψj, Γ)

1 a.s. (B.5)

For i {1, . . . , m} and j {1, . . . , S}, deﬁne the random variables Z(i) j (v; η), Z(i)(v; η), and Z(v; η):

Z(i) j (v; η) := η

t=(j 1)k+1 v, Γ 1/2x(i) t 2 + α log ηλ(Ψj, Γ)

Z(i)(v; η) :=

j=1 Z(i) j (v; η),

i=1 Z(i)(v; η).

We ﬁrst claim that E[exp(Z(v; η))] 1 for every v Sn 1 and η > 0. Since Z(i)(v; η) is independent of Z(i )(v; η) whenever i = i , we have that:

E[exp(Z(v; η))] = E

i=1 Z(i)(v; η)

i=1 E[exp(Z(i)(v; η))].

Furthermore, by repeated applications of the tower property and (B.5), for every i {1, . . . , m},

E[exp(Z(i)(v; η))] = E

j=1 Z(i) j (v; η)

j=1 Z(i) j (v; η)

E[exp(Z(i) S (v; η)) | F(S 1)k]

j=1 Z(i) j (v; η)

Hence E[exp(Z(v; η))] 1 for every v Sn 1 and η > 0. Let us now import some notation from Mourtada (2022, Section 4). First, let π denote the spherical measure on Sn 1, and let ρv,γ denote the uniform measure over the spherical cap

C(v, γ) := {w Sn 1 | v w 2 γ}.

Learning from many trajectories

Next, let Fv,γ(Σ) := R

C(v,γ) w, Σw dρv,γ for any symmetric matrix Σ. Fix any positive t, η. For two measures µ and ν with µ absolutely continuous w.r.t. ν, let KL(µ, ν) := Eµ log dµ dν denote the KL-divergence between µ and ν. By the PAC-Bayes deviation bound (cf. Mourtada, 2022, Lemma 4), there exists an event Et,1 with probability at least 1 e t, such that on Et,1, we have for every v Sn 1 and γ > 0,

t=1 x(i) t (x(i) t )TΓ 1/2 !

+ m Sα log ηµ

KL(ρv,γ, π) + t. (B.6)

Next, by Mourtada (2022, Section 4.3), we can write Fv,γ in terms of a scalar function φ such that:

Fv,γ(Σ) = (1 φ(γ)) v, Σv + φ(γ) 1

n tr(Σ), φ(γ) 0, n n 1γ2 . (B.7)

Furthermore, for every v Sn 1 and γ > 0, the KL-divergence term can be upper bounded by (Mourtada, 2022, Section 4.4):

KL(ρv,γ, π) n log 1 + 2

Therefore on Et,1, plugging (B.7) and (B.8) into (B.6),

λmin XT m,T Xm,T k η(1 φ(γ))

m Sα log ηµ

n log 1 + 2

φ(γ) 1 φ(γ) 1 n tr XT m,T Xm,T .

Restricting γ [0, 1/2], we have from (B.7) that 0 φ(γ) n n 1γ2 2γ2 1/2. Hence, 1 φ(γ) [1/2, 1]. Furthermore, 1 + 2/γ 5/(4γ2). Therefore,

λmin XT m,T Xm,T k

m Sα log ηµ

n tr XT m,T Xm,T .

Deﬁne the non-negative random variables ψi := PT t=1 Γ 1/2x(i) t 2 2, for i = 1, . . . , m. It is straightforward to verify that tr( XT m,T Xm,T ) = Pm i=1 ψi. By Markov s inequality, for any β > 0:

P tr( XT m,T Xm,T ) > β = P

E [Pm i=1 ψi] β

= m T tr(Γ 1ΓT )

β m Tnλmax(Γ1/2 T Γ 1Γ1/2 T ) β = m Tn

Therefore, setting β = etm Tn

λ , there exists an event Et,2 such that P(Ec t,2) e t and on Et,2,

tr( XT m,T Xm,T ) etm Tn

Tu, Frostig, and Soltanolkotabi

Therefore on Et,1 Et,2, which we assume holds for the remainder of the proof, we have:

λmin XT m,T Xm,T k

m Sα log ηµ

Next, we further restrict ηµ csb e so that log(ηµ/csb) 1. Now consider, for positive constants A, B, C, the function x 7 A log(B/x)+Cx on the domain (0, ). The derivative vanishes at x = A/C, and the function attains a minimum value of A(1 + log(BC/A)) with this choice of x. Let us set:

4, C 4m Tet

Then by choosing γ2 = knλ 4ηm Tet , we have that:

1 + log 5m Tetη

Note that this choice of γ satisﬁes γ [0, 1/2], since:

knλ 4ηm Tet 1

4 = kn ηm T 1 since t 0 and λ 1

= knµ ecsbm T 1 since η ecsb/µ

= n ecsb m T

k since µ 1,

and the last condition holds by (B.3). With this choice of γ, we have:

λmin XT m,T Xm,T

m Sα log ηµ

t n 1 + log 5m Tetη

(m Sα n) log ηµ

t n 1 + log 5csbm Tet

t n 1 + log 5csbm Tet

since m S 2n/α by (B.3)

(1 + n)t n 1 + log 5csbm T

2nt n 1 + log 5csbm T

Learning from many trajectories

" log(ηµ/csb) 8nt m Sα ηµ/csb

To justify inequality (a), we ﬁrst note that m Sα

4n 1 + log 5csbm T

holds from (B.3), since:

4n 1 + log 5csbm T

8 1 + log 5csbm T

since S T/(2k)

α log 5ecsb

α log 5ecsb

α log 5ecsb

using Proposition B.1

α log 320ecsb

The inequality m Sα

4n 1 + log 5csbm T

then implies (a) by observing:

+ n 1 + log 5csbm T

+ n 1 + log 5csbm T

since ηµ/csb e.

It remains to optimize over η [ecsb/µ, ). For any G R, the function η 7 log η G

η on (0, ) attains a maximum of exp( 1 G) at η = exp(1 + G). Hence, setting η = csb

µ exp(1 + 8nt/(m Sα)), which satisﬁes η ecsb/µ, we have:

λmin XT m,T Xm,T km Sαµ

4ecsb exp 8nt

8ecsb exp 16kn

m Tαt since S T/(2k).

The claim now follows by gathering the requirements on the quantity m T

kn and simplifying as in (B.3).

Corollary B.14. Assume the hypothesis of Lemma B.13 hold. Then, XT m,T Xm,T is invertible almost surely.

Proof For any t 0, deﬁne the event Et as:

Et := λmin XT m,T Xm,T < m Tαµ

8ecsb exp 16kn

Tu, Frostig, and Soltanolkotabi

The event {λmin( XT m,T Xm,T ) = 0} is the intersection T t=1 Et. By Lemma B.13, we have that P(Et) 2e t. Since the events Et Et whenever t t, by continuity of measure from above,

P(λmin( XT m,T Xm,T ) = 0) = P

= lim t P(Et) lim t 2e t = 0.

We are now ready to restate and prove Lemma 5.1.

Lemma 5.1 (General OLS upper bound). There are universal positive constants c0 and c1 such that the following holds. Suppose that Px satisﬁes the (T, k, {Ψj} T/k j=1 , csb, α)-Traj SB condition (Deﬁnition 4.1). Put S := T/k and ΓT := ΓT (Px). Fix any Γ Symn >0 satisfying 1 S PS j=1 Ψj Γ ΓT , and let µ({Ψj}S j=1, Γ) denote the geometric mean of the minimum eigenvalues {λ(Ψj, Γ)}S j=1, i.e.,

µ({Ψj}S j=1, Γ) :=

j=1 λ(Ψj, Γ)

Suppose that:

max{e, csb} αλ(Γ, ΓT )µ({Ψj}S j=1, Γ)

Then, for any Γ Symn >0:

E[ ˆWm,T W 2 Γ ] c1csbσ2 ξ pn m Tαλ(Γ, Γ )µ({Ψj}S j=1, Γ) log

max{e, csb} αλ(Γ, ΓT )µ({Ψj}S j=1, Γ)

Proof For notational brevity, let:

λ := λ(Γ, ΓT ), µ := µ({Ψj}S j=1, Γ).

We choose c0 64 such that (5.3) implies (B.3). By Corollary B.14, Xm,T has full column rank almost surely, hence:

ˆWm,T W = ΞT m,T Xm,T (XT m,T Xm,T ) 1.

Put Xm,T := Xm,T Γ 1/2. With this decomposition, we have:

ˆWm,T W 2 Γ = ΞT m,T Xm,T (XT m,T Xm,T ) 1 2 Γ

= ΞT m,T Xm,T (XT m,T Xm,T ) 1Γ1/2 2 Γ 1/2Γ Γ 1/2

λmax(Γ 1/2Γ Γ 1/2) ΞT m,T Xm,T (XT m,T Xm,T ) 1Γ1/2 2 F

Learning from many trajectories

= λmax(Γ 1/2Γ Γ 1/2) ( XT m,T Xm,T ) 1 XT m,T Ξm,T 2 F min{n, p}λmax(Γ 1/2Γ Γ 1/2) ( XT m,T Xm,T ) 1 XT m,T Ξm,T 2 op

min{n, p}λmax(Γ 1/2Γ Γ 1/2) ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λmin( XT m,T Xm,T )

= min{n, p} ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λ(Γ, Γ ) λmin( XT m,T Xm,T ) .

Fix any t > 0. By Lemma B.13, there exists an event Et,1 with probability at least 1 2e t, such that on Et,1 we have:

tr( XT m,T Xm,T ) mn Tet

λ , λmin( XT m,T Xm,T ) m Tαµ

8ecsb exp 16kn

By our choice of c0, we have m T/k 64n/α. Hence, on Et,1,

λmin( XT m,T Xm,T ) ζt := m Tαµ

8ecsb exp( t/4).

We now apply Proposition B.10 with V Mt := ζt In and:

x1, . . . , x T , x T+1, . . . , x2T , . . . , x(m 1)T+1, . . . , xm T

Γ 1/2x(1) 1 , . . . , Γ 1/2x(1) T , Γ 1/2x(2) 1 , . . . , Γ 1/2x(2) T , . . . , Γ 1/2x(m) 1 , . . . , Γ 1/2x(m) T ,

to conclude that there exists an event Et,2 with probability at least 1 e t such that on Et,2:

1{ XT m,T Xm,T Mt} ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op

p log 5 + 1

2 log det In + ζ 1 t XT m,T Xm,T + t

32σ2 ξ h p + log det In + ζ 1 t XT m,T Xm,T + t i

32σ2 ξ h p + n log(1 + ζ 1 t tr( XT m,T Xm,T )/n) + t i .

Above, the last inequality holds since log det(X) n log(tr(X)/n) for any X Symn 0 by the AM-GM inequality. By Proposition B.1, whenever t 8 log 16, we have t et/4. Furthermore, for any t 0 we have 1 et/4. Therefore, for t 8 log 16, on Et,1 Et,2:

( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λmin( XT m,T Xm,T )

m Tα et/4σ2 ξ

p + n log 1 + 8ecsb

αλµ e(1+1/4)t + t

m Tα et/4σ2 ξ

p + n log 16ecsb

+ n(1 + 1/4)t + t

Tu, Frostig, and Soltanolkotabi

m Tα et/4σ2 ξ

p + n log 16ecsb

p + n log 16ecsb

Deﬁne the random variable Z as:

Z := ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λmin( XT m,T Xm,T )

p + n log 16ecsb

We have shown that:

P(Z > et/2) 3e t t 8 log 16 P(Z > s) 3s 2 s 164.

0 P(Z > s) ds 164 + 3 Z

164 s 2 ds = 164 + 3/164.

That is, for some universal positive c1,

E[ ˆWm,T W 2 Γ ] c1σ2 ξ min{n, p}csb

p + n log max{e,csb}

m Tαλ(Γ, Γ )µ

Now, if p n, (B.10) is upper bounded by:

p + n log max{e,csb}

m Tαλ(Γ, Γ )µ

2c1σ2 ξpcsb

n log max{e,csb}

m Tαλ(Γ, Γ )µ

On the other hand, if p > n, (B.10) is upper bounded by:

p + n log max{e,csb}

m Tαλ(Γ, Γ )µ

< 2c1σ2 ξncsb

p log max{e,csb}

m Tαλ(Γ, Γ )µ

B.5 Proof of Theorem 5.3

Theorem 5.3 (Upper bound for Ind-Seq-LS). There are universal positive constants c0 and c1 such that the following holds. Fix any sequence of distributions {Px,t}t 1, and let Σt := Ext Px,t[xtx T t ] for t N+. Suppose there exists csb > 0 and α (0, 1] such that for all v Rn \ {0}, ε > 0 and t N+:

Pxt Px,t n v, xt 2 ε v TΣtv o (csbε)α. (5.7)

Learning from many trajectories

Furthermore, suppose there exists a cβ 1 and β 0 such that for all s, t N+ satisfying s t:

1 λ(Σs, Σt) cβ(t/s)β. (5.8)

n 2, m T c0n

β + log max{e, csb}cβ

then, for Px = t 1Px,t:

E[L( ˆWm,T ; T , Px)] c1csbσ2 ξcβeβ pn m Tα φ cβ(β + 1), (T /T)β β + log max{e, csb}cβ

Proof Equation (5.7) shows that Px satisﬁes the (T, 1, {Σt}T t=1, csb, α)-Traj SB condition. Let Γt := 1

t Pt k=1 Σk for t N+. For any s, t N+ with s t,

λ(Γs, Γt) λ(Γs, Σt) since Γt Σt

k=1 λ(Σk, Σt) using Proposition B.2 and Jensen s inequality

k=1 (k/t)β using (5.8)

1 cβ(β + 1)(s/t)β since x 7 xβ is increasing.

Next, the growth condition (5.8) implies that:

µ({Σt}T t=1, ΓT ) =

t=1 λ(Σt, ΓT )

t=1 λ(Σt, ΣT )

since ΓT ΣT

1 cβ (t/T)β #1/T

using (5.8)

= 1 cβT β (T!)β/T

1 cβeβ since T! (T/e)T .

We now apply Lemma 5.1 with Γ = ΓT . In doing so, the requirement (5.3) simpliﬁes to:

α log max{e, csb}cβeβ

Tu, Frostig, and Soltanolkotabi

We ﬁrst assume that T T, in which case (5.4) yields:

E[L( ˆWm,T ; T , Px)] c1csbσ2 ξ pn m Tα cβeβ log max{e, csb}cβeβ

On the other hand, when T > T, we have λ(ΓT , ΓT ) 1 cb(β+1)(T/T )β, and (5.4) yields:

E[L( ˆWm,T ; T , Px)] c1csbσ2 ξ pn m Tα cβeβ cβ(β + 1) T

β log max{e, csb}cβeβ

B.6 Proofs for linear dynamical systems

B.6.1 Control of ratios of covariance matrices

Proposition B.15. Let (A, B) be the dynamics matrices for an LDS-LS instance, and suppose (A, B) satisfy Assumption 5.1, Assumption 5.2, and Assumption 5.3. Put Σt := Σt(A, B) for t N+ and γ := γ(A, B). For any integers T1, T2 satisfying 1 T2 T1,

λmin(Σ 1/2 T1 ΣT2Σ 1/2 T1 ) 1

Proof Observe that for any t 1,

k=0 Ak BB (Ak) =

k=0 SDk S 1BB S (Dk) S .

By Assumption 5.3, we have that BB is invertible, and hence S 1BB S is also invertible. Therefore we have the following lower and upper bound on Σt:

λmin(S 1BB S ) S

k=0 Dk(Dk) !

S Σt λmax(S 1BB S ) S

k=0 Dk(Dk) !

Now recall that for two square matrices X, Y , the eigenvalues of XY coincide with the eigenvalues of Y X. Letting Qt := Pt 1 k=0 Dk(Dk) , we have:

λmin(Σ 1/2 T1 ΣT2Σ 1/2 T1 ) λmin(S 1BB S )λmin(Σ 1/2 T1 SQT2S Σ 1/2 T1 )

= λmin(S 1BB S )λmin((SQT2S )1/2Σ 1 T1 (SQT2S )1/2)

λmin(S 1BB S )

λmax(S 1BB S )λmin((SQT2S )1/2(S Q 1 T1 S )(SQT2S )1/2)

= λmin(S 1BB S )

λmax(S 1BB S )λmin(QT2Q 1 T1 ).

Learning from many trajectories

Let λ C be an eigenvalue of A. We have

k=0 |λ|2k =

1 |λ|2 if |λ| < 1,

t if |λ| = 1.

Therefore, (QT2Q 1 T1 )ii is:

(QT2Q 1 T1 )ii =

1 |λi|2T1 = 1 (|λi|2T1)T2/T1

1 |λi|2T1 if |λi| < 1,

T2/T1 if |λi| = 1.

Note that infx (0,1) 1 xc

1 x = c for c [0, 1]. Therefore, we can lower bound:

λmin(QT2Q 1 T1 ) T2

The claim now follows.

Proposition B.16. Let (A, B) be the dynamics matrices for an LDS-LS instance, and suppose (A, B) satisfy Assumption 5.1, Assumption 5.2, and Assumption 5.3. Put Γt := Γt(A, B) for t N+ and γ := γ(A, B). For any integers k, t N+ satisfying k t, we have:

λ(Γk, Γt) 1

Proof Let Σt := Σt(A, B) for t N+. We ﬁrst consider the case when k 2. Observe that Γt Σt. Furthermore, for any k 2, we have:

k = k/2 Σ k/2 = k k/2 + 1

λ(Γk, Γt) = λmin(Γ 1/2 t ΓkΓ 1/2 t ) 1

2λmin(Σ 1/2 t Σ k/2 Σ 1/2 t ) (a) 1

Above, (a) follows from Proposition B.15. When k = 1, we have Γ1 = Σ1, and therefore by Proposition B.15:

λ(Γ1, Γt) = λ(Σ1, Γt) λ(Σ1, Σt) = λmin(Σ 1/2 t Σ1Σ 1/2 t ) 1

The claim now follows.

Fact B.17. Let (A, B) be the dynamics matrices for an LDS-LS instance. For any s, t N+ with s t:

Γs(A, B) Γt(A, B).

Tu, Frostig, and Soltanolkotabi

Proposition B.18. Let (A, B) be the dynamics matrices for an LDS-LS instance, and suppose (A, B) satisfy Assumption 5.1, Assumption 5.2, and Assumption 5.3. Put Γt := Γt(A, B) for t N+, Σt := Σt(A, B) for t N+, and γ := γ(A, B). For any T, we have:

t=1 λ(Σt, ΓT )

Proof By Proposition B.16, we have that λ(Γt, ΓT ) 1 8γ t T for all t {1, . . . , T}. Therefore, since λ(Σt, ΓT ) λ(Γt, ΓT ), and since n! (n/e)n for all n N+,

t=1 λ(Σt, ΓT )

8γT 1 8eγ .

B.6.2 Many trajectory results

Lemma B.19. There are universal positive constants c0 and c1 such that the following holds for any instance of LDS-LS. Suppose that (A, B) is kc-step controllable. If n 2 and m c0n, then for any Γ Symn >0:

E[ ˆWm,T W 2 Γ ] c1σ2 ξ pn m T λ(ΓT (A, B), Γ ). (B.12)

Proof Let ΓT := ΓT (A, B). By Example 4.6, LDS-LS satisﬁes the (T, T, ΓT , e, 1/2)-Traj SB condition. We therefore invoke Lemma 5.1 with k = T and Γ = ΓT . In this case, µ from (5.2) simpliﬁes to µ = λ(ΓT , ΓT ) = 1, and the requirement (5.3) simpliﬁes to n 2 and m c0n. Finally, the rate (5.4) simpliﬁes to (B.12).

Theorem 5.4 (Parameter recovery upper bound for LDS-LS, many trajectories). There are universal positive constants c0 and c1 such that the following holds for any instance of LDS-LS. Suppose that (A, B) is kc-step controllable, If n 2, m c0n, and T kc, then:

E[ ˆWm,T W 2 F ] c1σ2 ξ pn m T λmin(ΓT (A, B)).

Proof Follows by invoking Lemma B.19 with Γ = In.

Theorem 5.5 (Risk upper bound for LDS-LS, many trajectories). There are universal positive constants c0 and c1 such that the following holds for any instance of LDS-LS. Suppose that (A, B) is kc-step controllable. If n 2, m c0n, T kc, and the evaluation horizon is strict (T T), then:

E[L( ˆWm,T ; T , PA,B x )] c1σ2 ξ pn

Learning from many trajectories

On the other hand, suppose that (A, B) satisﬁes Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Deﬁnition 5.1). If n 2, m c0n, and the evaluation horizon is extended (T > T), then:

E[L( ˆWm,T ; T , PA,B x )] c1σ2 ξ pn

Proof Let Γt := Γt(A, B) for t N+. Invoking Lemma B.19 with Γ = ΓT yields the bound:

E[L( ˆWm,T ; T , PA,B x )] c1σ2 ξ pn m T λ(ΓT , ΓT ).

If T T, then λ(ΓT , ΓT ) 1 since ΓT ΓT by Fact B.17. On the other hand, if T > T, by Proposition B.16, λ(ΓT , ΓT ) 1 8γ T T . The claim now follows.

B.6.3 Few trajectory results

Theorem 5.6 (Risk upper bound for LDS-LS, few trajectories). There are universal positive constants c0, c1, and c2 such that the following holds for any instance of LDS-LS. Suppose that (A, B) satisﬁes Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Deﬁnition 5.1). If n 2, m c0n, and m T c1n log(max{γn/m, e}), then:

E[L( ˆWm,T ; T , PA,B x )] c2σ2 ξ pn log(max{γn/m, e})

m T φ γ, c1n log(max{γn/m, e})

Proof Let Γt := Γt(A, B) for all t N+. By Example 4.6, for any k {1, . . . , T}, LDS-LS satisﬁes the (T, k, Γk, e, 1/2)-Traj SB condition. We will apply Lemma 5.1 with Γ = Γk. The quantity µ from (5.2) simpliﬁes to µ = λ(Γk, Γk) = 1. By Proposition B.16, we have that λ(Γk, ΓT ) 1 8γ k T . Hence the requirement (5.3) simpliﬁes to n 2 and

kn c log γ T

, γ := max{e, γ} (B.13)

for some universal positive constant c. Thus, for (B.13) to hold, it suﬃces to require:

m log γ , 2cn

As long as 2cn/m 1, then by Proposition B.1,

Hence, for (B.14) to hold, it suﬃces to require

m log 8cγ n

Tu, Frostig, and Soltanolkotabi

Based on (B.15), we choose k as:

k = T 4cn/m log(8cγ n/m)

To ensure that k 1, we need to ensure that:

m T 4cn log(8cγ n/m). (B.17)

On the other hand, since 2cn/m 1, we have that:

m log 8cγ n

which ensures that k T. Thus, our choice of k from (B.16) ensures that (B.13) holds. We now ready to invoke Lemma 5.1 with Γ = ΓT , and conclude for a universal c :

E[L( ˆWm,T ; T , PA,B x )] c σ2 ξ pn log(e/λ(Γk, ΓT ))

m Tλ(Γk, ΓT ) . (B.18)

First, we assume that T k. By Fact B.17 we have Γk ΓT , and therefore λ(Γk, ΓT ) 1. Equation (B.18) yields:

E[L( ˆWm,T ; T , PA,B x )] c σ2 ξ pn log(e/λ(Γk, ΓT ))

By Proposition B.16,

λ(Γk, ΓT ) 1

T 4cn/m log(8cγ n/m)

m 64cγn log(8cγ n/m). (B.20)

Plugging (B.20) into (B.19), and using the inequalities log x x for x > 0 and φ(a, x) 1 for all a 1 yields, for another universal c :

E[L( ˆWm,T ; T , PA,B x )] c σ2 ξ pn

m T log(e 64cγn/m log(8cγ n/m))

c σ2 ξ pn log(512e (cγ n/m)2)

c σ2 ξ pn log(max{γn/m, e})

c σ2 ξ pn log(max{γn/m, e})

m T φ γ, c1 n log(max{γn/m, e})

On the other hand, if T > k, then by Proposition B.16,

λ(Γk, ΓT ) 1

T 4cn/m log(8cγ n/m)

m 64cγn log(8cγ n/m) T

Plugging (B.20) and (B.21) into (B.18) and using again the inequality log x x for x > 0 yields, for a universal c :

E[L( ˆWm,T ; T , PA,B x )]

Learning from many trajectories

m T log(e 64cγn/m log(8cγ n/m)) 64cγn/m log(8cγ n/m) T

c σ2 ξ pn log(max{γn/m, e})

m T γ n log(max{γn/m, e})

Furthermore, when T > k, by choosing c1 suﬃciently large:

8cn log(8cγ n/m)

= c1 n log(max{γn/m, e})

= γc1 n log(max{γn/m, e})

T = φ γ, c1 n log(max{γn/m, e})

The claim now follows.

Theorem 5.7 (Risk upper bound for Ind-LDS-LS). There are universal positive constants c0 and c1 such that the following holds for any instance of Ind-LDS-LS. Suppose that (A, B) satisﬁes Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Deﬁnition 5.1). If n 2 and m T c0n log(max{γ, e}), then:

E[L( ˆWm,T ; T , t 1PA,B x,t )] c1σ2 ξ pnγ log(max{γ, e})

Proof Let Γt := Γt(A, B) and Σt := Σt(A, B) for t N+. From Example 4.3, we have that Ind-LDS-LS satisﬁes the (T, 1, {Σt}T t=1, e, 1/2)-Traj SB condition. We will apply Lemma 5.1 with Γ = ΓT , k = 1, and Γ = ΓT . By Proposition B.18, we have that:

µ({Σt}T t=1, ΓT ) 1 8eγ .

The requirement (5.3) simpliﬁes to n 2 and m T

n c log(max{γ, e}) for a universal constant c. By Lemma 5.1, for a universal c :

E[L( ˆWm,T ; T , PA,B x )] c σ2 ξ pn log(max{γ, e})

m T λ(ΓT , ΓT ) 8eγ.

If T T, then λ(ΓT , ΓT ) 1 since ΓT ΓT by Fact B.17. On the other hand, if T > T, then by Proposition B.16, λ(ΓT , ΓT ) 1 8γ T T . The claim now follows.

Theorem 5.8 (Parameter recovery upper bound for LDS-LS, few trajectories). There are universal positive constants c0, c1, and c2 such that the following holds for any instance of LDS-LS. Suppose that (A, B) satisﬁes Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Deﬁnition 5.1). If n 2, and m T c0n log(max{γn/m, e}), then:

E[ ˆWm,T W 2 F ] c1σ2 ξ pn log(max{γn/m, e})

m T λmin(Γk (A, B)) , k := c2T n/m log(max{γn/m, e})

Tu, Frostig, and Soltanolkotabi

Proof The proof is identical to that of Theorem 5.6 until (B.18), after which we set T = 1 from which the result follows.

B.7 High probability upper bounds

B.7.1 Weak trajectory small ball

We ﬁrst present a modiﬁed deﬁnition of trajectory small-ball (cf. Deﬁnition 4.1) which we will use to establish high probability bounds.

Deﬁnition B.1 (Weak trajectory small-ball (w Traj SB)). Fix a trajectory length T N+, a parameter k {1, . . . , T}, positive deﬁnite matrices {Ψj} T/k j=1 Symn >0, and constants

α, β (0, 1). The distribution Px satisﬁes the (T, k, {Ψj} T/k j=1 , α, β)-weak-trajectory-smallball (w Traj SB) condition if:

(a) 1 T/k P T/k j=1 Ψj ΓT (Px),

(b) {xt}t 1 is adapted to a ﬁltration {Ft}t 1, and

(c) for all v Rn \ {0}, j {1, . . . , T/k }:

t=(j 1)k+1 v, xt 2 α v TΨjv

β a.s. (B.22)

The main diﬀerence between Deﬁnition B.1 vs. Deﬁnition 4.1 is the third condition (B.22), which only needs to hold for a ﬁxed resolution α and failure probability β. By contrast, in Deﬁnition 4.1, the condition must hold of for all resolutions there denoted by ε with failure probabilities that tend to zero as the resolution ε 0 (cf. (4.1)).

B.7.2 Ordinary least squares bounds

Lemma B.20 (Minimum eigenvalue bound via weak trajectory small-ball). Suppose that Px satisﬁes the (T, k, {Ψj} T/k j=1 , α, β)-w Traj SB condition (Deﬁnition B.1). Put S := T/k and Γt := Γt(Px) for t N+. Fix any Γ Symn >0 satisfying 1 S PS j=1 Ψj Γ ΓT , and deﬁne the constants:

1 S PS j=1 λ(Ψj, Γ)2 1 S PS j=1 λ(Ψj, Γ) 2 , µ := 1

j=1 λ(Ψj, Γ). (B.23)

(Note that 1 CS S always). Fix δ (0, 1), and suppose that:

1 β log 1280CS α(1 β)λ(Γ, ΓT ) µδ

Learning from many trajectories

With probability at least 1 δ, the following events simulatenously hold:

t=1 x(i) t (x(i) t )TΓ 1/2 !

α(1 β)m T µ

t=1 x(i) t (x(i) t )TΓ 1/2 !

2m Tn λ(Γ, ΓT ) δ.

Proof The proof proceeds quite similarly to the proof of Lemma B.13. Thus, we focus mostly on the parts that diﬀer. For notational brevity, let:

β := 1 β, λ := λ(Γ, ΓT ), λj := λ(Ψj, Γ).

Since Γ ΓT by assumption, we have λ (0, 1]. The ﬁrst step, in preparation for applying the PAC-Bayes deviation inequality, is to construct a family of random variables with moment generating function upper bounded by one. To do this, we utilize the weak trajectory small-ball condition (B.22), which implies for any v Sn 1 and j {1, . . . , S}:

t=(j 1)k+1 v, Γ 1/2xt 2 αλ(Ψj, Γ)

Let xt := Γ 1/2xt be the whitened vector. Deﬁne the random indicator variables for i = 1, . . . , m and j = 1, . . . , S:

B(i) j := 1

t=(j 1)k+1 v, x(i) t 2 αλ(Ψj, Γ)

By Markov s inequality:

t=1 v, x(i) t 2 kα

j=1 λj1{B(i) j = 1}.

Hence for any η > 0 and v Sn 1:

t=1 v, x(i) t 2 !

j=1 λj1{B(i) j = 1}

Now observe:

E[exp( ηkαλj1{B(i) j = 1}) | F(j 1)k] = e ηkαλj P(B(i) j = 1 | F(j 1)k) + P(B(i) j = 0 | F(j 1)k)

= (e ηkαλj 1)P(B(i) j = 1 | F(j 1)k) + 1

(e ηkαλj 1)β + 1

Tu, Frostig, and Soltanolkotabi

(a) 1 + ηkαλj + 1

2η2k2α2λ2 j

(b) exp ηkαλj + 1

2η2k2α2λ2 j

Above, we used the facts (a) for x > 0, we have e x 1 x + x2

2 , and (b) for x R, we have 1 + x ex. Hence by the tower property:

t=1 v, x(i) t 2 !

2η2k2α2λ2 j

2η2k2α2β S X

PS j=1 λ2 j PS j=1 λj

Now, let us set

PS j=1 λj PS j=1 λ2 j = 1

from which we conclude:

t=1 v, x(i) t 2 !

PS j=1 λj 2

PS j=1 λ2 j

1 S PS j=1 λj 2

1 S PS j=1 λ2 j

By independence across the m trajectories:

t=1 v, x(i) t 2 + m Sβ

As desired, we have constructed a family of random variables indexed by v Sn 1, with MGF bounded by one. Using the PAC-Bayes arguments from Lemma B.13 followed by Markov s inequality, with probability at least 1 2e t, for all v Sn 1 and γ [0, 1/2]:

i,t x(i) t , v 2 1

2CS n log 5

Choosing γ2 = nλ 4ηm Tet , we have that:

1 + log 5m Tetη

Learning from many trajectories

Note that this choice of γ satisﬁes γ [0, 1/2], since:

nλ 4ηm Tet 1

4 = n ηm T 1 since t 0 and λ 1.

The RHS above is ensured by:

PS j=1 λ2 j PS j=1 λj = αCS µ = m T

kn CS since α, µ 1.

If we further enforce that:

4CS (n + 1) t + log 5m Tη

then combining (B.25) with (B.26):

i,t x(i) t , v 2 1

2CS t n n log 5m Tetη

2CS (n + 1)t n n log 5m Tη

2CS (n + 1)t (n + 1) log 5m Tη

4ηCS = αβ m T µ

For (B.27), it suﬃces that the following conditions hold:

PS j=1 λj PS j=1 λ2 j

β log 5 λα µCS

By Proposition B.1 it suﬃces that:

β max log 5 λα µCS

β log 128CS

Note that x log(1/x) 1/e for all x > 0, and therefore:

CS log 5 λα µCS

+ 1 2CS log 5

Hence it suﬃces that:

β max log 5

, log 128CS

Tu, Frostig, and Soltanolkotabi

The claim now follows by simplifying all the required inequalities for the quantity m T/kn.

To contrast the eﬀects of the w Traj SB assumption from those of the Traj SB assumption, let us compare Lemma B.20 to its counterpart Lemma B.13. The minimum eigenvalue bound (B.24) from Lemma B.20 diﬀers from the corresponding Traj SB bound (B.4) in the role of the eigenvalues of the matrices {Ψj} T/k j=1 from the small-ball deﬁnition. However, due to the diﬀering requirements on the amount of data m T, neither result is necessarily sharper than the other, as detailed in the following remark:

Remark B.21. When the matrices {Ψj} T/k j=1 from the trajectory small-ball deﬁnition vary across j, both Lemma B.13 and Lemma B.20 yield diﬀerent dependencies on the eigenvalues {λ(Ψj, Γ)} T/k j=1 . In particular, Lemma B.13 yields a minimum eigenvalue bound scaling as m Tµ, where µ is the geometric mean of the eigenvalues {λ(Ψj, Γ)}, whereas Lemma B.20 yields a bound scaling as m T µ, where µ is the arithmetic mean of the eigenvalues. By the AM-GM inequality, we have that µ µ, so the latter bound is stronger than the former. However, Lemma B.20 has a stronger requirement on the amount of data, requiring that m T kn CS, where CS [1, S] is deﬁned in (B.23), whereas Lemma B.13 has the weaker requirement that m T kn. In the worst case when CS S, then the m T kn CS requirement simpliﬁes to the many trajectories assumption m n. Thus, the qualitative behavior of these two bounds are not necessarily comparable.

Meanwhile, although neither bound is strictly sharper than the other, if we assume polynomial growth of the {Ψj} matrices, then the two bounds are roughly on par:

Remark B.22. When the matrices {Ψj} exhibit low degree polynomial growth, both Lemma B.13 and Lemma B.20 yield similar qualitative behavior. Concretely, let us suppose that k = 1, Ψj = jp I for j [T], and Γ = 1

T PT j=1 Ψj. Then, µ = 1, whereas µ 1+p

ep . Thus, if we consider p as constant, then µ µ.

We now state our general OLS upper bound under the weak trajectory small-ball condition.

Lemma B.23 (General OLS upper bound, high probability). There are universal positive constants c0 and c1 such that the following holds. Suppose that Px satisﬁes the (T, k, {Ψj} T/k j=1 , α, β)-w Traj SB condition (Deﬁnition B.1). Put S := T/k and Γt := Γt(Px) for t N+. Fix any Γ Symn >0 satisfying 1 S PS j=1 Ψj Γ ΓT , and the constants:

1 S PS j=1 λ(Ψj, Γ)2 1 S PS j=1 λ(Ψj, Γ) 2 , µ := 1

j=1 λ(Ψj, Γ).

Fix δ (0, 1/e). Suppose that:

1 β log CS α(1 β)λ(Γ, ΓT ) µδ

Learning from many trajectories

Then, for any Γ Symn >0, with probability at least 1 δ:

ˆWm,T W 2 Γ c1σ2 ξ

pn log 1 α(1 β)λ(Γ,ΓT ) µδ

λ(Γ, Γ )α(1 β)m T µ

Proof Put β := 1 β and Xm,T := Xm,T Γ 1/2. By the arguments in the proof of Lemma 5.1,

ˆWm,T W 2 Γ min{n, p} ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λ(Γ, Γ ) λmin( XT m,T Xm,T ) .

Put M := (αβ m T µ/8) I := ζ I. By Proposition B.10, with probability at least 1 δ/2:

1{ XT m,T Xm,T M} ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op

p log 5 + 1

2 log det In + ζ 1 XT m,T Xm,T + log(2/δ)

32σ2 ξ h p + log det In + ζ 1 XT m,T Xm,T + log(2/δ) i

32σ2 ξ h p + n log(1 + ζ 1 tr( XT m,T Xm,T )/n) + log(2/δ) i .

Now, by Lemma B.20, with probability at least 1 δ/2, we also have:

λmin( XT m,T Xm,T ) ζ, tr( XT m,T Xm,T ) 4m Tn

On both events:

( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op 32σ2 ξ

p + n log 1 + 32 αβ λ µδ

p + n log 33 αβ λ µδ

Combining the inequalities:

ˆWm,T W 2 Γ 512σ2 ξ min{n, p}

p + n log 33 αβ λ µδ

λ(Γ, Γ )αβ m T µ

pn log 33 αβ λ µδ

λ(Γ, Γ )αβ m T µ

By a union bound, both events hold with probability at least 1 δ, which concludes the proof.

Tu, Frostig, and Soltanolkotabi

B.7.3 Mixing implies weak trajectory small-ball

One advantage of Deﬁnition B.1 is that it is implied by the standard notions of φ-mixing in the literature (see e.g. Mohri and Rostamizadeh (2008); Duchi et al. (2012); Kuznetsov and Mohri (2017)). In this section, we prove this reduction. First, we state the deﬁnition of φ-mixing.

Deﬁnition B.2 (φ-mixing covariate sequence). Let {xt}t 1 be a covariate sequence which is adapted to a ﬁltration {Ft}t 1. Deﬁne the function φ(k) as:

φ(k) := sup t N+ sup B Ft Pxt+k( | B) Pxt+k tv. (B.28)

The process {xt}t 1 is called φ-mixing if limk φ(k) = 0. We also let φ(k) denote the upper envelope of φ(k), i.e., φ(k) := supk k φ(k).

The following result shows that a φ-mixing covariate sequence where each marginal distribution is weakly small-ball satisﬁes the weak trajectory small-ball condition.

Proposition B.24. Fix α (0, 1) and β (0, 1/4). Suppose that the covariate sequence {xt}t 1 is φ-mixing, and that for every t N+ and v Rn \ {0} we have:

Pxt{ v, xt 2 αv TΣtv} β, Σt := E[xtx T t ]. (B.29)

Let kmix := inf{k N+ | φ(k) β} and assume that T 2kmix. Put S := T/(2kmix) and suppose that {Ψj}S j=1 satisﬁes:

4Σt j [S], t [kmix(2j 1) + 1, 2jkmix].

Then, Px satisﬁes the T, 2kmix, {Ψj}T/(2kmix) j=1 , α, 4

2 + β -w Traj SB condition (cf. Deﬁnition B.1).

Proof Fix j [S]. Since Ψj 1

4Σt for all t [kmix(2j 1) + 1, 2jkmix], we have:

t=kmix(2j 1)+1 Σt.

j=1 Ψj 1 4Skmix

t=kmix(2j 1)+1 Σt 1 4Skmix

t=1 Σt ΓT .

Above, the last inequality holds since T/(2kmix) T/(4kmix). By deﬁnition of φ-mixing (cf. Deﬁnition B.2) and the upper envelope φ, for any j [S] and t kmix(2j 1) + 1:

Pxt n v, xt 2 4α v TΨjv F(j 1)2kmix o Pxt n v, xt 2 4α v TΨjv o + β

Pxt n v, xt 2 α v TΣtv o + β

Learning from many trajectories

t=(j 1)2kmix+1 Pxt n v, xt 2 4α v TΨjv F(j 1)2kmix o

1 2kmix [kmix + 2βkmix] = 1

The claim now follows from Proposition 4.2.

We conclude by noting that φ-mixing is a stronger notion of mixing than β-mixing, where (B.28) is only required to hold in expectation. We leave to future work an analysis that only relies on the weaker β-mixing.

Appendix C. Analysis for lower bounds

C.1 Preliminaries

Here, we collect the necessary auxiliary results we will use to prove the lower bound. The ﬁrst result is an instance of the well-known fact that the conditional mean is the estimator which minimizes the mean squared error.

Proposition C.1. Let T N+ and {Px,t}T t=1 be a sequence of distributions over Rn with ﬁnite second moments Σt := Ext Px,t[xtx T t ]. Let PW be any arbitrary distribution on Rp n. Put ΓT := 1

T PT t=1 Σt. We have:

inf ˆ W EW PW

t=1 Ext Px,t ˆW(xt) Wxt 2 2

= EW PW EW PW [W ] W 2 ΓT ,

where the inﬁmum ranges over measurable functions ˆW : Rn Rp.

Proof Let µT := 1

T PT t=1 Px,t denote the uniform mixture distribution, so that

t=1 Ext Px,t ˆW(xt) Wxt 2 2 = E x µT ˆW( x) W x 2 2.

By repeated applications of Fubini s theorem,

inf ˆ W EW PW E x µT ˆW( x) W x 2 2 = inf ˆ W E x µT EW PW ˆW( x) W x 2 2

inf ˆy Rp EW PW ˆy W x 2 2

= E x µT EW PW EW PW [W ] x W x 2 2 = EW PW E x µT EW PW [W ] x W x 2 2 = EW PW EW PW [W ] W 2 ΓT .

Tu, Frostig, and Soltanolkotabi

The next result is a simple fact which states that if a function is strictly increasing and concave on an interval, then any root of the function is lower bounded by the root of the linear approximation at any point in the interval.

Proposition C.2. Let f : I R be a C1(I) function that is strictly increasing and concave on an interval I R. Suppose that f has a (unique) root x0 I. For any x I, we have that:

Proof Because f is concave on I, we have that:

0 = f(x0) f(x) + f (x)(x0 x).

Next, because f is strictly increasing on I, we have that f (x) > 0. The claim now follows by re-arranging the previous inequality.

The next result states that the trace inverse of any positive deﬁnite matrix is lower bounded by the trace inverse of any priciple submatrix. The claim is immediate from Cauchy s eigenvalue interlacing theorem, but we give a more direct proof.

Proposition C.3. Let M Rq n have full column rank. Let I {1, . . . , n} be any index set, and let EI : Rn R|I| denote any linear map which extracts the coordinates associated to I. We have:

tr((MTM) 1) tr((EIMTMET I ) 1).

Proof Fix a z Rn. Since M has full column rank, we have that (MT) = M(MTM) 1. Therefore,

min c Rq:MTc=z c 2 2 = (MT) z 2 2 = z T(MTM) 1z.

Taking expectation with z N(0, In),

tr((MTM) 1) = Ez N(0,In)

min c Rq:MTc=z c 2 2

On the other hand, we have that:

min c Rq:MTc=z c 2 2 min c Rq:EIMTc=EIz c 2 2.

This is clear because for any c Rq satisfying MTc = z, the equality EIMTc = EIz trivially holds. This means we have the following set inclusion:

{c Rq | MTc = z} {c Rq | EIMTc = EIz}.

Learning from many trajectories

Therefore, minimizing any function over the ﬁrst set will be lower bounded by minimizing the same function over the second set. From this inclusion, we conclude for any index set I:

tr((MTM) 1) = Ez N(0,In)

min c Rq:MTc=z c 2 2

min c Rq:EIMTc=EIz c 2 2

= Ez N(0,I|I|)

min c Rq:EIMTc=z c 2 2

= tr((EIMTMET I ) 1).

Next, we state well-known upper and lower tail bounds for chi-squared random variables.

Lemma C.4 (Laurent and Massart (2000, Lemma 1)). Let g1, . . . , g D be iid N(0, 1) random variables, and let a1, . . . , a D be non-negative scalars. For any t > 0, we have:

i=1 ai(g2 i 1) 2

i=1 a2 i + 2t max i=1,...,D ai

i=1 ai(g2 i 1) 2

Finally, we conclude with a convex extension of Gordon s min-max theorem.

Theorem C.5 (Thrampoulidis et al. (2014, Theorem II.1)). Let A Rm n, g Rm, and h Rn have iid N(0, 1) entires and be independent of each other. Suppose that S1 Rn

and S2 Rm are non-empty compact convex sets, and let ψ : S1 S2 R be a continuous, convex-concave function. For every t R, we have:

P min x S1 max y S2

h y TAx + ψ(x, y) i t 2P min x S1 max y S2

h x 2g Ty + y 2h Tx + ψ(x, y) i t .

C.2 Proof of Lemma 6.1

We ﬁrst prove the following intermediate result, which holds under the Gaussian observation noise model (Deﬁnition 7.1).

Lemma C.6. Let T N+, {Px,t}T t=1 be a sequence of distributions over Rn with ﬁnite second moments Σt := Ext Px,t[xtx T t ], and σξ > 0. Let PX be a distribution on Rq n

with q n such that for X PX, XTX is invertible almost surely. For W Rp n, let PW be the distribution over Rq n Rq p with (X, Y ) PW satisfying X PX and Y | X = XW T + Ξ, where Ξ Rq p has iid N(0, σ2 ξ) entries (and is independent of everything else). Put ΓT := 1

T PT t=1 Σt. We have that:

inf ˆ W sup W Rp n E(X,Y ) PW

t=1 Ext Px,t ˆW(X, Y, xt) Wxt 2 2

Tu, Frostig, and Soltanolkotabi

σ2 ξp EX PX tr(Γ1/2 T (XTX) 1Γ1/2 T ),

where the inﬁmum ranges over all measurable functions ˆW : Rq n Rq p Rn Rp.

Proof The proof extends the Bayesian argument from Mourtada (2022, Theorem 1). Let pλ be any prior distribution over Rp n. Let µT := 1

T PT t=1 Px,t denote the uniform mixture. Bounding the minimax risk from below by the Bayes risk:

inf ˆ W sup W Rp n E(X,Y ) PW E x µT ˆW(X, Y, x) W x 2 2

inf ˆ W EWλ pλE(X,Y ) PWλE x µT ˆW(X, Y, x) Wλ x 2 2

= inf ˆ W E(X,Y )EWλ|(X,Y )E x µT ˆW(X, Y, x) Wλ x 2 2 using Fubini s theorem

= E(X,Y ) inf ˆ WX,Y EWλ|(X,Y )E x µT ˆWX,Y ( x) Wλ x 2 2 where ˆWX,Y maps Rn Rp

= E(X,Y )EWλ|(X,Y ) E[Wλ | X, Y ] Wλ 2 ΓT using Proposition C.1.

Now let Wλ pλ have iid N(0, 1/λ) entries for λ > 0. Noting that

vec(Y ) = (Ip X)vec(W T λ ) + vec(Ξ),

we see that the vector vec(W T λ ) vec(Y )

is jointly Gaussian conditioned on X:

vec(W T λ ) vec(Y )

λIpn 1 λ(Ip XT) 1 λ(Ip XXT) + σ2 ξIqp

Therefore, the distribution of vec(W T λ ) | X, Y is:

vec(W T λ ) | X, Y N(µλ, Σλ),

λ(Ip XXT) + σ2 ξIqp

λ2 (Ip XT) 1

λ(Ip XXT) + σ2 ξIqp

A generalization of the identity XT( 1

λXXT + σ2 ξIq) 1 = ( 1

λXTX + σ2 ξIn) 1XT yields:

λ(Ip XXT) + σ2 ξIqp

λ(Ip XTX) + σ2 ξInp

E[vec(W T λ ) | X, Y ] vec(W T λ )

= µλ vec(W T λ )

= h (Ip XTX) + σ2 ξλInp i 1 (Ip XT)vec(Y ) vec(W T λ )

Learning from many trajectories

= h (Ip XTX) + σ2 ξλInp i 1 (Ip XTX) Inp

vec(W T λ )

+ h (Ip XTX) + σ2 ξλInp i 1 (Ip XT)vec(Ξ).

Observing that

E[Wλ | X, Y ] Wλ 2 ΓT = E[vec(W T λ ) | X, Y ] vec(W T λ ) 2 Ip ΓT ,

and deﬁning MX(λ) := (Ip XTX) + σ2 ξλInp, we have the following bias-variance decomposition:

EX,Ξ,Wλ E[Wλ | X, Y ] Wλ 2 ΓT

= EX,Ξ,Wλ h M 1 X (λ)(Ip XTX) Inp i vec(W T λ ) 2

+ σ2 ξEX tr (Ip Γ1/2 T )M 1 X (λ)(Ip XTX)M 1 X (λ)(Ip Γ1/2 T )

σ2 ξEX tr (Ip Γ1/2 T )M 1 X (λ)(Ip XTX)M 1 X (λ)(Ip Γ1/2 T ) .

Since λ 7 tr (Ip Γ1/2 T )M 1 X (λ)(Ip XTX)M 1 X (λ)(Ip Γ1/2 T ) is non-negative and decreasing in λ for λ > 0, by the monotone convergence theorem:

lim λ 0+ EX,Ξ,Wλ E[Wλ | X, Y ] Wλ 2 ΓT

σ2 ξ lim λ 0+ EX tr (Ip Γ1/2 T )M 1 X (λ)(Ip XTX)M 1 X (λ)(Ip Γ1/2 T )

= σ2 ξEX tr (Ip Γ1/2 T )M 1 X (0)(Ip XTX)M 1 X (0)(Ip Γ1/2 T )

= σ2 ξEX tr((Ip Γ1/2 T (XTX) 1Γ1/2 T ))

= σ2 ξp EX tr(Γ1/2 T (XTX) 1Γ1/2 T ).

Since the ﬁrst expression above lower bounds the minimax risk, this concludes the proof.

We now restate and prove Lemma 6.1.

Lemma 6.1 (Expected trace of inverse covariance bounds risk from below). Fix m, T N+ and a set of covariate distributions Px. Suppose that for every Px Px, the data matrix Xm,T Rm T n drawn from m i=1Px has full column rank almost surely. The minimax risk R(m, T, T ; Px) satisﬁes:

R(m, T, T ; Px) σ2 ξp sup Px Px E m i=1Px h tr Γ1/2 T (Px)(XT m,T Xm,T ) 1Γ1/2 T (Px) i . (6.1)

Proof Fix a Px Px, and let {Px,t}T t=1 denote its marginal distributions up to time T . Let Pg ξ denote the σξ-MDS corresponding to the Gaussian observation noise model (Deﬁnition 7.1). Note that for any hypothesis f : Rn Rp, we have from (3.2):

L( ˆf; T , Px) = EPx

t=1 ˆf(xt) W xt 2 2

t=1 Ext Px,t ˆf(xt) W xt 2 2.

Tu, Frostig, and Soltanolkotabi

By the deﬁnition of R(m, T, T ; Px) from (3.5) and Lemma C.6:

R(m, T, T ; Px) inf Alg sup W E m i=1PW x,y [Px,Pg ξ] h L Alg({(x(i) t , y(i) t )}m,T i=1,t=1); T , Px i

σ2 ξp E m i=1Px h tr Γ1/2 T (Px)(XT m,T Xm,T ) 1Γ1/2 T (Px) i .

Since the bound above holds for any Px Px, we can take the supremum over Px Px, from the which the claim follows.

C.3 A general risk lower bound

We now state a lower bound which applies with an arbitrary number of trajectories.

Lemma C.7. Suppose that Px is any set containing P0n n x and PIn x . Let m T n. Then:

R(m, T, T ; Px) σ2 ξ 2 pn

Proof Deﬁne ζ(A) := E m i=1PA x

h tr Γ1/2 T (A)(XT m,T Xm,T ) 1Γ1/2 T (A) i . By Lemma 6.1:

R(m, T, T ; Px) σ2 ξp max{ζ(0n n), ζ(In)}. (C.1)

Next, for any M Symn 0, the function X 7 tr(M1/2X 1M1/2) is convex on the domain Symn >0. To see this, we deﬁne f(X; v) := v TX 1v for X Symn >0. We can write f(X; v) as f(X; v) = sup z TXz + 2v Tz | z Rn ; therefore X 7 f(X; v) is convex on Symn >0, since it is the pointwise supremum of an aﬃne function in X. Now we see that X 7 tr(M1/2X 1M1/2) is convex, since tr(M1/2X 1M1/2) = Pn i=1 f(X; M1/2ei), which is the sum of convex functions. Therefore by Jensen s inequality, whenever XT m,T Xm,T is invertible almost surely,

ζ(A) = E m i=1PA x

h tr Γ1/2 T (A)(XT m,T Xm,T ) 1Γ1/2 T (A) i

tr Γ1/2 T (A)(E m i=1PA x [XT m,T Xm,T ]) 1Γ1/2 T (A)

= tr Γ1/2 T (A)(m T ΓT (A)) 1Γ1/2 T (A)

= tr(ΓT (A)Γ 1 T (A)) m T .

We ﬁrst consider the case when A = 0n n. Under these dynamics, it is a standard fact that when m T n, then XT m,T Xm,T is invertible almost surely. Furthermore, Γt(0n n) = In for all t, Hence, ζ(0n n) n m T . Next, we consider the case when A = In. We ﬁrst argue that as long as m T n, the matrix XT m,T Xm,T is invertible almost surely. We write x(i) t = Pt k=1 w(i) k , where

{w(i) t }m,T i=1,t=1 are all iid N(0, In) vectors. Let p : Rm Tn R be the polynomial p({w(i) t }) = det(XT m,T Xm,T ). The zero-set of p is either all of Rm Tn, or Lebesgue measure zero. We

Learning from many trajectories

will select {w(i) t } so that p({w(i) t }) = 0, which shows that the zeros of this polynomial are not all of Rm Tn, and hence Lebesgue measure zero. Since the Gaussian measure on Rm Tn is absolutely continuous w.r.t. the Lebesgue measure on Rm Tn, this implies that det(XT m,T Xm,T ) = 0 almost surely.

To select {w(i) t }, we introduce some notation. Let ei Rn denote the i-th standard basis vector. For any positive integer k, let U(k) Rk k be the upper triangular matrix with ones for all its non-zero entries. Let S(k) = U(k)U(k)T. By construction, S(k) is invertible since U(k) is invertible. We put w(i) t = e(i 1)T+t 1{(i 1)T + t n}. We now claim that

with this choice of {w(i) t }, the matrix XT m,T Xm,T is invertible. Suppose ﬁrst that T n. Then we have that XT m,T Xm,T = S(n), and therefore det(XT m,T Xm,T ) = 0. On the other hand, suppose that T < n. Because m T n, then we have that:

XT m,T Xm,T = BDiag(S(T), . . . , S(T) | {z } n/T times

, S(n T n/T )),

where BDiag(M1, . . . , Mk) denotes the block diagonal matrices with block diagonals M1, . . . , Mk. Since S(T) and S(n T n/T ) are both invertible, so is XT m,T Xm,T and therefore det(XT m,T Xm,T ) = 0. Thus, XT m,T Xm,T is invertible almost surely. Next, we note that Σt(In) = t In and Γt(In) = 1

t Pt k=1 k In = t+1

2 In. Hence we have ΓT (In)Γ 1 T (In) = T +1

2T In, and therefore ζ(In) n 2m T T

T . Combining our bounds on ζ(0n n) and ζ(In), we have the desired claim:

R(m, T, T ; Px) σ2 ξp max n

m T , n 2m T T

C.4 Non-isotropic random gramian matrices

The goal of this subsection is to prove Lemma 7.1, which gives a bound on the expected trace inverse of a non-isotropic random gramian matrix. We ﬁrst prove an auxiliary lemma, which will be used as a building block in the proof.

Lemma C.8. Fix any x Rq. Let g Rq and h Rn be random vectors with iid N(0, 1) entries, and let W Rq n be a random matrix with iid N(0, 1) entries. Let Σ Rq q be positive deﬁnite. We have that:

E min α Rn Σ1/2Wα x 2 2 E min β 0 max τ 0

τ + βg Σ 1/2x 2 (Σ 1+β h 2τIq) 1

Proof The proof invokes the convex Gaussian min-max lemma (Theorem C.5) via a limiting argument. In what follows, let {αk}k 1 and {vk}k 1 be any two positive, increasing sequences of scalars tending to + . It is clear that for every W,

lim k min α 2 αk Σ1/2Wα x 2 2 = min α Rn Σ1/2Wα x 2 2.

Tu, Frostig, and Soltanolkotabi

Since α = 0 is always a feasible solution to min α 2 αk Σ1/2Wα x 2 2, we have for every k 1:

0 min α 2 αk Σ1/2Wα x 2 2 x 2 2.

Therefore, by the dominated convergence theorem,

E min α Rn Σ1/2Wα x 2 2 = E lim k min α 2 αk Σ1/2Wα x 2 2 = lim k E min α 2 αk Σ1/2Wα x 2 2.

We next state two variational forms which we will use:

1 2 x 2 2 = max v Rq

v Tx v 2 2 2

x 2 = min τ 0

x 2 2τ 2 + 1

Using the ﬁrst variational form (C.3), we have for every W and k1 1,

min α 2 αk1

1 2 Σ1/2Wα x 2 2 = min α 2 αk1 max v Rq

v T(Σ1/2Wα x) v 2 2 2

= min α 2 αk1 max v Rq

v TWα v TΣ 1/2x v TΣ 1v

= min α 2 αk1 max v 2 ΣW opαk1+ Σ1/2x 2

v TWα v TΣ 1/2x v TΣ 1v

= lim k2 min α 2 αk1 max v 2 vk2

v TWα v TΣ 1/2x v TΣ 1v

Observe that for every k2 1,

0 min α 2 αk1 max v 2 vk2

v TWα v TΣ 1/2x v TΣ 1v

max v 2 vk2

v TΣ 1/2x v TΣ 1v

v TΣ 1/2x v TΣ 1v

Therefore, by (C.2) and another application of the dominated convergence theorem:

E min α Rn Σ1/2Wα x 2 2 = lim k1 E min α 2 αk1 Σ1/2Wα x 2 2

= lim k1 E lim k2 min α 2 αk1 max v 2 vk2

v TWα v TΣ 1/2x v TΣ 1v

= lim k1 lim k2 E min α 2 αk1 max v 2 vk2

v TWα v TΣ 1/2x v TΣ 1v

Learning from many trajectories

We now apply Theorem C.5 to the expectation on the RHS of (C.5):

E min α 2 αk1 max v 2 vk2

v TWα v TΣ 1/2x v TΣ 1v

min α 2 αk1 max v 2 vk2

v TWα v TΣ 1/2x v TΣ 1v

min α 2 αk1 max v 2 vk2

α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v

= 2E min α 2 αk1 max v 2 vk2

α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v

2E min α 2 αk1 max v Rq

α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v

Above, inequality (a) is an application of Theorem C.5. Now for every k1, g, and h, deﬁne

ψk1(g, h) := min α 2 αk1 max v Rq

α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v

For every k1, g, and h, we have

0 ψk1(g, h) max v Rq

v TΣ 1/2x v TΣ 1v

= x 2 2 2 .

Furthermore, since {αk} is an increasing sequence, the sequence {ψk(g, h)}k 1 is montonically decreasing. Therefore, by the monotone convergence theorem,

lim k ψk(g, h) = inf{ψk(g, h) | k N+}

= min α Rn max v Rq

α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v

Therefore by another application of the dominated convergence theorem, we have that:

lim k1 E min α 2 αk1 max v Rq

α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v

= E lim k1 min α 2 αk1 max v Rq

α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v

= E min α Rn max v Rq

α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v

Chaining together inequalities (C.5), (C.6), (C.7), we have:

E min α Rn Σ1/2Wα x 2 2 2E min α Rn max v Rq

α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v

Tu, Frostig, and Soltanolkotabi

We now proceed to study the RHS of (C.8), which we denote by (AO) (the auxiliary optimization problem):

(AO) := min α Rn max v Rq

α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v

= min β 0 min θ [ 1,1] max v Rq

βg Tv + β v 2 h 2θ v TΣ 1/2x v TΣ 1v

(a) = min β 0 max v Rq

βg Tv β v 2 h 2 v TΣ 1/2x v TΣ 1v

(b) = min β 0 max v Rq max τ 0

βg Tv β h 2 v 2 2 τ 2 β h 2

2τ v TΣ 1/2x v TΣ 1v

= min β 0 max τ 0 max v Rq

βg Tv β h 2 v 2 2 τ 2 β h 2

2τ v TΣ 1/2x v TΣ 1v

= min β 0 max τ 0

2 βg Σ 1/2x 2 (Σ 1+β h 2τIq) 1

Above, (b) holds by the variational form (C.4). The proof is now ﬁnished after justifying (a). First, let hβ(θ, v) denote the term in the bracket, so that

(AO) = min β 0 min θ [ 1,1] max v Rq hβ(θ, v).

Fix a β 0. By weak duality,

min θ [ 1,1] max v Rq hβ(θ, v) max v Rq min θ [ 1,1] hβ(θ, v) = max v Rq hβ( 1, v).

On the other hand,

min θ [ 1,1] max v Rq hβ(θ, v) min θ [ 1,0] max v Rq hβ(θ, v) = max v Rq min θ [ 1,0] hβ(θ, v) = max v Rq hβ( 1, v).

The ﬁrst equality above is Sion s minimax theorem, since the function θ 7 hβ(θ, v) is aﬃne for every v and the function v 7 hβ(θ, v) is concave for θ [ 1, 0]. Therefore,

min θ [ 1,1] max v Rq hβ(θ, v) = max v Rq hβ( 1, v).

With Lemma C.8 in hand, we can now restate and prove Lemma 7.1.

Lemma 7.1. Let q, n be positive integers with q n and n 2. Let W Rq n have iid N(0, 1) entries, and let Σ Rq q be positive deﬁnite. Let g N(0, Iq) and h N(0, In 1), with g and h independent. Also, let {ei}q i=1 be the standard basis vectors in Rq. We have:

E tr((W TΣW) 1) n Pq i=1 E minβ 0 maxτ 0 h β h 2

τ + βg ei 2 (Σ 1+β h 2τIq) 1 i. (7.3)

Learning from many trajectories

Proof We rewrite E tr((W TΣW) 1) in a way that is amenable to Lemma C.8. Let w1 Rq

denote the ﬁrst column of W, so that W = w1 W2 with W2 Rq (n 1). We write:

W TΣW = w1 2 Σ w T 1 ΣW2 W T 2 Σw1 W T 2 ΣW2

Using the block matrix inversion formula to compute the (1, 1) entry of (W TΣW) 1:

((W TΣW) 1)11 = (w T 1 (Σ ΣW T 2 (W T 2 ΣW2) 1W T 2 Σ)w1) 1

= (w T 1 Σ1/2(I PΣ1/2W2)Σ1/2w1) 1

= (w T 1 Σ1/2P Σ1/2W2Σ1/2w1) 1.

Since the columns of W are all independent and identically distributed, this calculation shows that the law of ((W TΣW) 1)ii is the same as the law of ((W TΣW) 1)11 for all i = 1, . . . , n. Therefore:

E tr((W TΣW) 1) =

i=1 E((W TΣW) 1)ii = n E(w T 1 Σ1/2P Σ1/2W2Σ1/2w1) 1

n E tr(Σ1/2P Σ1/2W2Σ1/2).

The last inequality follows from Jensen s inequality combined with the independence of w1 and W2. By decomposing tr(Σ1/2P Σ1/2W2Σ1/2) = Pq i=1 P Σ1/2W2Σ1/2ei 2 2 and observing that

P Σ1/2W2x 2 2 = min α Rn 1 Σ1/2W2α x 2 2 x Rq,

we have the following identity:

E tr(Σ1/2P Σ1/2W2Σ1/2) =

i=1 E min αi Rn 1 Σ1/2W2αi Σ1/2ei 2 2.

Invoking Lemma C.8 with x = Σ1/2ei for i = 1, . . . , q yields

E tr(Σ1/2P Σ1/2W2Σ1/2)

i=1 E min β 0 max τ 0

τ + βg ei 2 (Σ 1+β h 2τIq) 1

where g N(0, Iq) and h N(0, In 1). The claim now follows.

We conclude this section with the following technical result which we will use in the sequel.

Lemma C.9. Let q, n N+ with q n and n 6, and let Σ Symq >0. Let g N(0, Iq) and h N(0, In 1) with g and h independent. Deﬁne the random variables Zi for i {1, . . . , q} as:

Zi := min β 0 max τ 0

2τ + β2 g 2 (Σ 1+β h 2τIq) 1 + (Σ 1 + β h 2τIq) 1 ii

Tu, Frostig, and Soltanolkotabi

Let {λi}q i=1 denote the eigenvalues of Σ 1 listed in decreasing order. Deﬁne n1 and the random function p(y) as:

64, p(y) :=

y λi + yg2 i n1

There exists an event E (over the probability of g and h) such that the following statements hold:

(a) P(Ec) e n/128 + e q/16.

(b) On E, there exists a unique root y (0, ) such that p(y ) = 0.

(c) The following bounds hold for i {1, . . . , q}:

Zi Σii, 1{E}Zi 1{E}(Σ 1 + y Iq) 1 ii . (C.11)

Proof First, we observe that we can trivially upper bound the value of Zi by setting β = 0 and obtaining the bound Zi Σii. Furthermore, by the rotational invariance of g and the fact that g and h are independent, we have that Zi is equal in distribution to:

Zi = min β 0 max τ 0

2τ + β2 q X

g2 i λi + β h 2τ + (Σ 1 + β h 2τIq) 1 ii

Deﬁne the following events:

Eh := h 2 n/8 , Eg :=

i=1 g2 i q/2

and put E := Eh Eg. Since n 6, by a standard computation we have that E h 2 n/4. Therefore, by Gaussian concentration of Lipschitz functions (cf. Wainwright, 2019, Chapter 2), P(Ec h) e n/128. Furthermore, Lemma C.4 yields that P(Ec g) e q/16. By a union bound, P(Ec) e n/128 + e q/16. We now focus on upper bounding the quantity:

1{E}Zi 1{E} min β 0 max τ 0

16τ + β2 q X

g2 i λi + β nτ/8 + (Σ 1 + β nτ/8Iq) 1 ii

| {z } =:ℓi(β,τ)

Let us bracket the value of the game minβ 0 maxτ 0 ℓi(β, τ). We previously noted that ℓi(0, τ) = Σii for all τ [0, ). Next, for any β > 0, limτ ℓi(β, τ) = 0. Hence,

min β 0 max τ 0 ℓi(β, τ) [0, Σii].

Recalling from (C.10) that n1 = n/64 (so that n1 = n/8) and deﬁning f, qi as:

f(x) := x n1

g2 i λi + x n1 ,

Learning from many trajectories

qi(x) := (Σ 1 + x n1) 1 ii ,

we have that ℓi(β, τ) = 1 τ 2 f(βτ) + qi(βτ). In order to sharpen our estimate for the value of the game, we will study the positive critical points (β, τ) R2 >0 of the game minβ maxτ ℓi(β, τ), i.e., the points (β, τ) R2 >0 satisfying ℓi

β (β, τ) = 0 and ℓi

τ (β, τ) = 0. Note that in general for a nonconvex/nonconcave game, this is not a necessary ﬁrst order optimality condition for the global min/max value (see e.g. Jin et al., 2020, Proposition 21). However, for every ﬁxed β > 0, stationary points of the function τ 7 ℓi(β, τ) on R>0 are strictly concave by Proposition C.10. Hence, by the implicit function theorem (or alternatively Jin et al. (2020, Theorem 23)), the ﬁrst order stationarity conditions ℓi

β (β, τ) = 0 and ℓi

τ (β, τ) = 0 are necessary for global min/max optimality. For τ = 0, this yields:

0 = τ 2f (βτ)β 2τ 3f(βτ) + q i(βτ)β,

0 = τ 2f (βτ)τ + q i(βτ)τ.

Together, these conditions imply that f(βτ) = 0, and that the value of the game at such a critical point is qi(βτ). Thus, we are interested in the positive roots of f(x) = 0. To proceed, recall the deﬁnition of p(y) from (C.10):

y λi + yg2 i n1

Note that y is a positive root of p iﬀy / n1 is a positive root of f. Since q n by assumption, observe that on E:

lim y p(y) =

i=1 g2 i n1/2 q/2 n1/2 n/2 n/64 > 0.

On the other hand, p(0) = n1/2 < 0. Since p(y) is continuous and strictly increasing, on E there exists a unique y (0, ) such that p(y ) = 0. Thus,

1{E}Zi 1{E}(Σ 1 + y Iq) 1 ii .

Proposition C.10. Let M, A be n n positive deﬁnite matrices, and let α, β be positive numbers. Consider the function:

τ + (A + βτI) 1, M .

Suppose there exists a τ (0, ) satisfying f (τ) = 0. Then, f (τ) < 0.

Proof A straightforward computation yields the following expressions for f (τ) and f (τ):

f (τ) = ατ 2 β (A + βτI) 2, M ,

Tu, Frostig, and Soltanolkotabi

f (τ) = 2ατ 3 + 2β2 (A + βτI) 3, M .

The assumed condition f (τ) = 0 implies that:

ατ 2 = β (A + βτI) 2, M = 2ατ 3 = 2βτ 1 (A + βτI) 2, M .

Next, let A = QΛQT be the eigendecomposition of A, with Λ = diag({λi}n i=1). For any integer k:

(A + βτI) k, M = tr(MQ(Λ + βτI) k QT) = QMQT, (Λ + βτI) k =

(QMQT)ii (λi + βτ)k .

Now, since M is positive deﬁnite, (QMQT)ii > 0 for all i [n]. Furthermore, since A is positive deﬁnite, λi > 0 for all i [n]. Hence plugging these expressions into the expression of f (τ):

f (τ) = 2β2 n X

(QMQT)ii (λi + βτ)2βτ + 2β2 n X

(QMQT)ii (λi + βτ)2(λi + βτ)

(QMQT)ii (λi + βτ)2βτ + 2β2 n X

(QMQT)ii (λi + βτ)2βτ

C.5 Proof of Theorem 6.2

Theorem 6.2 (Need for growth assumptions in Ind-Seq-LS when m n). There exists universal constant c0, c1, and c2 such that the following holds. Suppose that Px = t 1N(0, 2t In), n 6, m T n, and m c0n. Then:

R(m, T, T; {Px}) c1σ2 ξ p 2c2n/m

Proof Let ΓT := ΓT (Px). We have that ΓT = 2

T (2T 1)In 2T

T In. By Lemma 6.1:

R(m, T, T; {Px}) σ2 ξp E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ1/2 T )

T E tr((2 T/2XT m,T Xm,T 2 T/2) 1).

Since each column of Xm,T is independent, the matrix Xm,T 2 T/2 has the same distribution as BDiag(Θ1/2, m)W, where Θ RT T is diagonal, Θii = 2i T for i {1, . . . , T}, and W Rm T n has iid N(0, 1) entries. Let λt = 2T t for t {1, . . . , T}. With this notation:

E tr((2 T/2XT m,T Xm,T 2 T/2) 1) = E tr((W TBDiag(Θ, m)W) 1).

Learning from many trajectories

Let {gj}m j=1 be independent isotropic Gaussian random vectors in RT , and let h N(0, In 1) be independent from {gj}. Deﬁne the random variables {Zi}T i=1 as:

Zi := min β 0 max τ 0

2τ + β2 m X

g2 j,t λt + β h 2τ + 1 λi + β h 2τ

By Lemma 7.1,

E tr((W TBDiag(Θ, m)W) 1) n

Next, deﬁne

64, p(y) :=

y λt + yg2 j,t n1

Since n 6 and m T n, we can invoke Lemma C.9 to conclude there exists an event E1 (over the probability of {gj} and h) such that:

(a) on E1, there exists a unique root y (0, ) such that p(y ) = 0,

(b) the following inequalities holds:

λi , 1{E1}Zi 1{E1} 1 λi + y , (C.13)

(c) the following estimate holds:

P(Ec 1) e n/128 + e m T/16.

Now, let c = 1/20, and assume that cn1/m 4. We can check easily that cn1/m T. Fix a δ (0, e 2] to be chosen later. Deﬁne the integer Tc := cn1/m {4, . . . , T}, and the events (over the probability of {gj} and h):

t=1 g2 j,t 5m Tc

, Eg,+ 2 :=

max t=1,...,T

j=1 g2 j,t 2m + 4 log t2π2

By Lemma C.4, P((Eg,Tc 2 )c) e m Tc. Next, Gaussian concentration for Lipschitz functions (cf. Wainwright, 2019, Chapter 2) yields, for any η (0, 1):

max t=1,...,T P

j=1 g2 j,t m + p

Hence by a union bound, and the fact that 6δ/π2 PT t=1 t 2 6δ/π2 P t=1 t 2 = δ, we have that P((Eg,+ 2 )c) δ. Putting E := E1 Eg,Tc 2 Eg,+ 2 , we have:

P(Ec) e n/128 + e m T/16 + e m Tc + δ

Tu, Frostig, and Soltanolkotabi

e n/128 + e m T/16 + e cn1 + δ. (C.14)

Next, noting that t/2 log2 log((t + 1)2π2/6) for all t 4:

t=Tc 2 t log((t + 1)2π2/(6δ))

t=Tc 2 t+log2 log((t+1)2π2/6) + log(1/δ)

t=Tc 2 t/2 + log(1/δ)

t=Tc 2 t since Tc 4

2 1)(2 Tc/2 2 T/2) + 2 log(1/δ)(2 Tc 2 T )

(4 + 2 log(1/δ))2 Tc/2

4 log(1/δ)2 Tc/2 since δ (0, e 2). (C.15)

λt + y g2 j,t since p(y ) = 0

t=1 g2 j,t + y m X

t=Tc 2 tg2 j,t+1

t=1 g2 j,t + y T 1 X

j=1 g2 j,t+1

5m Tc + y T 1 X

t=Tc 2 t 2m + 4 log((t + 1)2π2/(6δ)) using E

5m Tc + 4my 2 Tc + 16y log(1/δ)2 Tc/2 using (C.15)

5m Tc + 18my log(1/δ)2 Tc/2 since δ (0, e 2).

This inequality implies the following lower bound on y :

y 2cn1/(2m)

18 log(1/δ)

= 2cn1/(2m)

18 log(1/δ)

4m 5 i since c = 1/20

144 log(1/δ) n1

m since cn1/m 4 = n1/(8m) 5

We now bound,

i=1 E[Zi] =

i=1 [E[1{E}Zi] + E[1{Ec}Zi]]

Learning from many trajectories

E 1{E} 1 λi + y

using (C.13)

1 λi + y + P(Ec)

1 λi since y y on E

1 2t + y + 2 e n/128 + e m T/16 + e cn1 + δ using (C.14)

y + 2 2 Tc + 2 e n/128 + e m T/16 + e cn1 + δ

288c log(1/δ)2 cn1/(2m) + 2 2 cn1/m

+ 2 e n/128 + e m T/16 + e cn1 + δ .

Since cn1/m 4, we can choose δ = e cn1/(2m) (0, e 2] and obtain:

i=1 E[Zi] 144c2 n1

m 2 cn1/(2m) + 2 2 cn1/m + 2 e n/128 + e m T/16 + e cn1 + e cn1/(2m) .

Since 1 cn1/(4m), m T n, and m 1, this inequality implies there exists universal positive constants c1, c2 such that:

i=1 E[Zi] c1n

R(m, T, T; {Px}) σ2 ξp

T n 2m m c1n2c2n/m = σ2 ξp2c2n/m

C.6 Block decoupling

We now use a block decoupling argument to study lower bounds on the risk. The ﬁrst step is the following result, which bounds the risk from below by a particular random gramian matrix.

Lemma C.11. Let n = dr with both d, r positive integers. Deﬁne Ir := {1, 1 + r, . . . , 1 + (T 1)r}, and let EIr RT Tr denote the linear operator which extracts the coordinates in Ir, so that (EIrx)i = x1+(i 1)r for i = 1, . . . , T. Recall the following deﬁnitions from Equation (7.10):

Ψr,T,T = BDiag(Γ 1/2 T (Jr), T)BToep(Jr, T) RTr Tr,

Θr,T,T = EIrΨr,T,T ΨT r,T,T ET Ir RT T .

Tu, Frostig, and Soltanolkotabi

Then, for A = BDiag(Jr, d) we have:

E m i=1PA x

h tr Γ1/2 T (A)(XT m,T Xm,T ) 1Γ1/2 T (A) i E tr((W TBDiag(Θr,T,T , m)W) 1),

where W Rm T d is a matrix with independent N(0, 1) entries.

Proof We apply Proposition C.3 with:

M = Xm,T Γ 1/2 T , I = {1, 1 + r, 1 + 2r, . . . , 1 + (d 1)r}, |I| = d.

Note that the block diagonal structure of A yields the same block diagonal structure on ΓT and its inverse square root, speciﬁcally ΓT (A) = BDiag(ΓT (Jr), d) and Γ 1/2 T (A) = BDiag(Γ 1/2 T (Jr), d). Hence, it is not hard to see that the columns of MET I are not only independent, but also identically distributed. Furthermore, the distribution of each column obeys a multivariate Gaussian in Rm T . Hence, MET I is equal in distribution to Q1/2 m,T W, where W Rm T d is a matrix of iid Gaussians and Qm,T Symm T >0 is a positive deﬁnite covariance matrix to be determined. Furthermore, because MET I contains the vertical concatenation of m independent trajectories, Qm,T itself is block diagonal:

Qm,T = BDiag(QT , m), QT Sym T >0.

Let us now compute an expression for QT . Consider the dynamics:

xr t+1 = Jrxr t + wr t , xr 0 = 0, wr t N(0, σ2 w Ir).

It is not hard to see that, with wr 0:T 1 = (w0, . . . , w T 1) Rr T ,

Γ 1/2 T (Jr)xr 1 ... Γ 1/2 T (Jr)xr T

= Ψr,T,T wr 0:T 1.

From this, we see that every column of MET I is equal in distribution to EIrΨr,T,T wr 0:T 1, and therefore has distribution N(0, EIrΨr,T,T ΨT r,T,T ET Ir). Therefore:

QT = EIrΨr,T,T ΨT r,T,T ET Ir = Θr,T,T .

The claim now follows.

C.7 Eigenvalue analysis of a tridiagonal matrix

For any T N+, recall that LT denotes the T T lower triangle matrix with ones in the lower triangle, and Tri(a, b; T) denotes the symmetric T T tri-diagonal matrix with a on the digonal and b on the lower and upper oﬀ-diagonals. In this section, we study the eigenvalues of (LT LT T ) 1, which we denote by ST :

ST = (LT LT T ) 1 = Tri(2, 1; T) e T e T T . (C.16)

Understanding the eigenvalues of this matrix will be necessary in the proof of Lemma C.15. The following result sharply characterizes the spectrum of ST up to constant factors.

Learning from many trajectories

Lemma C.12. Suppose T 8. For all k = 1, . . . , T, we have that:

T 2 λT k+1(ST ) π2 k2

Proof We prove the upper bound in Proposition C.13, and the lower bound in Proposition C.14.

The next result gives the necessary upper bounds on the eigenvalues of ST .

Proposition C.13. We have that:

λT k+1(ST ) π2 k2

T 2 , k = 1, . . . , T.

Proof By (C.16), we immediately produce a semideﬁnite upper bound on ST :

ST = Tri(2, 1; T) e T e T T Tri(2, 1; T).

Therefore by the Courant min-max theorem, followed by the closed-form expression for the eigenvalues of Tri(2, 1; T), we have:

λT k+1(ST ) λT k+1(Tri(2, 1; T)) = 2 1 cos kπ

, k = 1, . . . , T.

Next, we have the following elementary lower bounds for cos(x) on x [0, π]:

( 1 x2/2 if x [0, 2π/3], (x π)2/4 1 if x [2π/3, π].

Therefore, when k n 1, . . . , j 2(T+1)

3 ko , we immediately have that:

λT k+1(ST ) π2 k2

For the case when k { 2(T+1)

3 + 1, . . . , T}, we use the cosine lower bounds to bound:

λT k+1(ST ) 4 π2

1 1 k T + 1

= 4 k T + 1

= 4 k T + 1

since k 2(T + 1)/3

Tu, Frostig, and Soltanolkotabi

The claim now follows by taking the maximum over the upper bounds.

We now move to the lower bound on λT k+1(ST ). At this point, it would be tempting to use Weyl s inequalities, which imply that λi(ST ) λi(Tri(2, 1; T)) 1. However, this bound becomes vacuous, since λT (Tri(2, 1; T)) 1/T 2. To get ﬁner grained control, we need to use the eigenvalue interlacing result of Kulkarni et al. (1999). This is done in the following result:

Proposition C.14. Suppose that T 8. We have that

λT k+1(ST ) 0.02 k2

T 2 , k = 1, . . . , T.

Proof The proof relies on the interlacing result from Kulkarni et al. (1999, Theorem 4.1). However, the interlacing result does not cover the minimum eigenvalue of ST , so we ﬁrst explicitly derive a lower bound for λmin(ST ). To do this, we note that:

λmin(ST ) = λmin((LT LT T ) 1) = 1 LT 2op .

Letting li RT denote the i-th column of LT , by the variational form of the operator norm followed by Cauchy-Schwarz,

LT op = max v 2=1 LT v 2 max v 2=1

i=1 li 2|vi|

i=1 li 2 2 =

T(T + 1)/2.

λmin(ST ) 2 T(T + 1) 1

Now we may proceed with the remaining eigenvalues. We can write ST as the following block matrix, with e T 1 RT 1 denoting the (T 1)-th standard basis vector:

ST = Tri(2, 1; T 1) e T 1 e T T 1 1

This matrix is of the form studied in Kulkarni et al. (1999, Theorem 4.1); for what follows we will borrow their notation. Let UT (x) denote the T-th degree Chebyshev polynomial of the 2nd kind. We know that the eigenvalues of ST are given by λ = 2(1 x), where x are the roots of the polynomial p T (x) deﬁned as:

p T (x) := (1 + 2x)UT 1(x) UT 2(x). (C.17)

Therefore, letting ψ1 . . . ψT denote the roots of (C.17) listed in increasing order, we have:

λi(ST ) = 2(1 ψi), i = 1, . . . , T.

Learning from many trajectories

Let η1 < < ηT 2 denote the T 2 roots of UT 2(x) listed in increasing order. Put η0 := and ηT 1 := + . Because the roots of UT 2(x) are given by x = cos( kπ

T 1), k = 1, . . . , T 2, we have that:

ηi = cos (T 1 i)π

, i = 1, . . . , T 2.

Kulkarni et al. (1999, Theorem 4.1) states that there is exactly one root of p T (x) in each of the intervals (ηj, ηj+1) for j {0, . . . , T 2} \ {i }, with i satisfying:

3 } if 2(T 1) mod 3 = 0, {2(T 1)

3 + 1} otherwise,

and furthermore (ηi , ηi +1) contains exactly two roots of p T (x). Therefore, for i {i + 3, . . . , T 1}:

ψi ηi 1 = λi(ST ) 2(1 ηi 1) = 2 1 cos (T i)π

For i {i + 3, . . . , T 1}, we have:

T i T 1 T i 3

T 1 = T (2(T 1)

3 1) 3 T 1 = 1

It is elementary to check that:

2(1 cos(x)) x2

2 x [0, π/3].

Therefore for i {i + 3, . . . , T 1},

Furthermore, for i {1, . . . , i + 2}, ψi ηi +1 implies that

λi(ST ) 2(1 ηi +1) = 2 1 cos (T 1 i 1)π

2(1 cos(π/21)).

The last inequality holds by:

cos (T 1 i 1)π

cos (T 1) (2(T 1)/3 + 2)

T 1 π since i 2(T 1)

cos(π/21) since T 8.

Summarizing, we have shown that:

λT k+1(ST )

1 T 2 if k = 1,

2 k 1 T 1 2 if k {2, . . . , T i 2},

2(1 cos(π/21)) if k {T i 1, . . . , T}.

Tu, Frostig, and Soltanolkotabi

T 1 k 2T when k 2, and since 2(1 cos(π/21)) 2(1 cos(π/21)) k2

T 2 trivially, we have shown the desired conclusion:

λT k+1(ST ) min 1, π2

8 , 2(1 cos(π/21)) k2

T 2 0.02 k2

T 2 , k = 1, . . . , T.

C.8 A risk lower bound in the few trajectories regime

Lemma C.15. There exist universal positive constants c0, c1, c2, and c3 such that the following is true. Suppose A Rn n is any set containing In. Let T c0, n c1, m T n, and m c2n. We have that:

R(m, T, T ; {PA x | A A}) c3σ2 ξp n2

Proof Let {gj}m j=1 be independent N(0, IT ) random vectors, and let h N(0, In 1) be independent from {gj}. Let {λt}T t=1 denote the eigenvalues of Θ 1 1,T,T listed in decreasing order. Deﬁne the random variables {Zi}T i=1 as:

Zi := min β 0 max τ 0

2τ + β2 m X

g2 j,t λt + β h 2τ + (Θ 1 1,T,T + β h 2τIT ) 1 ii

We now lower bound the minimax risk as follows:

R(m, T, T ; {PIn x })

σ2 ξp E m i=1PIn x

h tr ΓT (In)1/2(XT m,T Xm,T ) 1ΓT (In)1/2 i by Lemma 6.1

σ2 ξp E tr((W TBDiag(Θ1,T,T , m)W) 1) by Lemma C.11

= σ2 ξp T + 1

T + 1 E tr((W TBDiag(Θ1,T,T , m)W) 1) using (7.11)

2T E tr((W TBDiag(Θ1,T,T , m)W) 1)

by Lemma 7.1. (C.19)

Next, deﬁne:

64, p(y) :=

y λt + yg2 j,t n1

Assuming that c1 6 so that n 6 and m T n, we can invoke Lemma C.9 to conclude there exists an event E1 (over the probability of {gj} and h) such that:

Learning from many trajectories

(a) on E1, there exists a unique root y (0, ) such that p(y ) = 0,

(b) the following inequalities holds:

Zi (Θ1,T,T )ii, 1{E1}Zi 1{E1} 1 λi + y , (C.20)

(c) the following estimate holds:

P(Ec 1) e n/128 + e m T/16.

The remainder of the proof is to estimate a lower bound on y . Towards this goal, we deﬁne an auxiliary function:

p(y) := E[p1(y)] = m

y λt + y n1

Let y be the unique solution to p(y) = 0. A unique root exists because p(0) < 0, limy p(y) = m T n1/2 n n/64 > 0, and p is continuous and strictly increasing. We derive a lower bound on y through a lower bound on y . For any ﬁxed α > 0, the function x 7 x α+x is monotonically increasing and concave on R>0. Therefore, the function p(y) is monotonically increasing and concave on R>0. By Proposition C.2, the root of the linear approximation to p(y) at y is a lower bound to y :

1{E1}y 1{E1} y p( y )

Equation (C.21) is a crucial step for the proof, because it turns analyzing y , which is the root of a random function, into analyzing the pointwise evaluation of a random function on a deterministic quantity. To lower bound the RHS, we need a upper bound on p( y ) and lower bounds on both y and p ( y ).

Upper and lower bounds on y . We ﬁrst derive a crude upper bound by Jensen s inequality. Observe that p( y ) = 0 implies that:

λt λt + y .

The function x 7 x/(x + y ) is concave on R>0. Let λ := 1

T PT t=1 λt. Jensen s inequality states that T λ λ+ y PT t=1 λt λt+ y . Therefore:

1 n1 2m T λ λ + y = y λ n1

2m T 1 1 n1/(2m T).

Recalling the deﬁnition of ST from (C.16), we can immediately bound

T tr(Θ 1 1,T,T ) = 1

tr(ST ) 2T.

Tu, Frostig, and Soltanolkotabi

Therefore, since m T n,

m 1 1 n1/(2m T) 2n1

Now for the lower bound on y . Noting that λT k+1 = λT k+1(Θ 1 1,T,T ) = T+1

2 λT k+1(ST ), Corollary C.12 implies (assuming that c0 8 so T 8) that

T λT k+1 π2 k2

T , k = 1, . . . , T. (C.22)

Therefore, p( y ) = 0 implies that:

1 λt + y 2m

1 0.01t2/T + y

1 0.01x2/T + y dx = 20m

T n1 y tan 1

Solving for y yields:

y 1 100π2 n2 1 m2T .

Next, we use this lower bound on y to bootstrap our upper bound y 2n1/m into something stronger. Using the upper bounds on λt from (C.22),

1 λt + y 2m

1 π2t2/T + y 2m

1 π2x2/T + y dx

tan 1 (T + 1)π T y

tan 1 π T y

The function tan 1(x) is increasing. Using the y 2n1/m upper bound and the assumption that m T n,

(T + 1)π T y π

32 = tan 1 (T + 1)π T y

On the other handing, using the bound y 1 100π2 n2 1 m2T and the assumption that m

π T y 10π m

32/2 = tan 1 π T y

Combining these inequalities:

32) tan 1(π

32/2) i 2 0.05

T n1 y = y 791π2 n2 1 m2T .

Learning from many trajectories

Therefore we have the following upper and lower bounds on y :

1 100π2 n2 1 m2T y min 791π2 n2 1 m2T , 2n1

For the remainder of the proof, in order to avoid precisely tracking constants, we let c0, c1, c2, c3 be any positive universal constants such that:

T λT k+1 c1 k2

T , k = 1, . . . , T, (C.24)

c2 n2 1 m2T y c3 n2 1 m2T . (C.25)

Equations (C.22) and (C.23) give one valid setting of these constants.

Upper bound on p( y ). To upper bound p( y ), we note that:

λt + y g2 j,t n1

λt + y (g2 i,j 1) +

λt + y (g2 i,j 1) since p( y ) = 0.

Therefore, by Lemma C.4,

2 + 2t max t=1,...,T y

e t t > 0. (C.26)

We upper bound:

2 using (C.24)

= m( y )2T 2c0T y + 2( y )2 + T y

2 c0 tan 1 s

2c0 + π T y

n2 1 m T + π

m using (C.25)

= c3 128c0 + π

n1 since m T n and m 1

Tu, Frostig, and Soltanolkotabi

=: c4n1. (C.27)

Next, we immediately have:

max t=1,...,T y

λt + y 1. (C.28)

Thus, combining (C.26), (C.27), and (C.28), we have:

P p( y ) > 2 tu c4n1 + 2tu e tu tu > 0. (C.29)

Lower bound on p ( y ). Diﬀerentiating p(y) yields:

λt (λt + y)2 g2 j,t.

Applying Lemma C.4 yields,

p ( y ) < m

λt (λt + y )2 2

λ2 t (λt + y )4

e t t > 0. (C.30)

Our ﬁrst goal is to lower bound m PT t=1 λt (λt+ y )2 . The function x 7 x/(x + y )2 is increasing when x [0, y ] and decreasing when x ( y , ). Let t {0, . . . , T} be such that c1t2/T y for t {1, . . . , t } and c1t2/T > y for t {t + 1, . . . , T} (t = 0 if c1/T > y ). We write:

λt (λt + y )2 c0

c1t2/T (c1t2/T + y )2 using (C.24)

c1t2/T (c1t2/T + y )2 +

c1t2/T (c1t2/T + y )2

c1x2/T (c1x2/T + y )2 dx + Z T+1

c1x2/T (c1x2/T + y )2 dx

c1x2/T (c1x2/T + y )2 dx Z t +1

t c1x2/T (c1x2/T + y )2 dx

The function z 7 z (z+ y )2 is upper bounded by 1 4 y . Therefore,

t c1x2/T (c1x2/T + y )2 dx 1 4 y 1 4c2

c1x2/T (c1x2/T + y )2 dx = c1T

2c3/2 1 T y tan 1 (T + 1) c1 T y

T + 1 2c2 1(T + 1)2 + 2c1T y

Learning from many trajectories

2c3/2 1 c3n1 tan 1 (T + 1) c1 T y

2c3/2 1 c3n1 tan 1 64 rc1

The last inequality holds because:

(T + 1) c1 T y (T + 1) rc1

Above, the ﬁrst inequality holds using (C.25) and the last inequality holds since m T n. Therefore, assuming that m T 2 q

c3 c1 1 tan 1(64

c1x2/T (c1x2/T + y )2 dx tan 1(64 p

c1/c3) 4 c1c3

Combining these inequalities, assuming that m c2 2 c1c3 tan 1(64 p

c1/c3)n1, we have:

λt (λt + y )2 c0

" tan 1(64 p

c1/c3) 4 c1c3

c0 tan 1(64 p

8c3/2 1 c3 =: c5 m2T

Next, we turn to upper bounding m PT t=1 λ2 t (λt+ y )4 . Again the function x 7 x2/(x+ y )4

is increasing when x [0, y ] and decreasing when x ( y , ), and therefore x2/(x+ y )4 1 16( y )2 for all x 0. Let t {0, . . . , T} be such that c0t2/T y for t {1, . . . , t } and c0t2/T > y for t {t + 1, . . . , T}. In the case when c0/T > y , we set t = 0. We have:

λ2 t (λt + y )4

c2 1 c2 0 m

(c0t2/T + y )4 using (C.24)

= c2 1 c2 0 m t 1 X

(c0t2/T + y )4 +

(c0t2/T + y )4

+ (c0(t )2/T)2

(c0(t )2/T + y )4 + (c0(t + 1)2/T)2

(c0(t + 1)2/T + y )4

c2 1 c2 0 m Z t

(c0x2/T + y )4 dx + Z T

(c0x2/T + y )4 dx

+ (c0(t )2/T)2

(c0(t )2/T + y )4 + (c0(t + 1)2/T)2

(c0(t + 1)2/T + y )4

c2 1 c2 0 m Z T

(c0x2/T + y )4 dx + (c0(t )2/T)2

(c0(t )2/T + y )4 + (c0(t + 1)2/T)2

(c0(t + 1)2/T + y )4

Tu, Frostig, and Soltanolkotabi

c2 1 c2 0 m Z T

(c0x2/T + y )4 dx + 1 8( y )2

since max x>0 x2

(x + y )4 1 16( y )2

c2 1 c2 0 m Z T

(c0x2/T + y )4 dx + 1

We now bound:

(c0x2/T + y )4 dx = c2 0T 2

(3c0T + y )(c0T 3 y )

48c2 0T y (c0T + y )3 + tan 1 q

16c5/2 0 T 3/2( y )3/2

c2 0T 2 " 1 16c0 y (c0T + y )2 + π

32c5/2 0 T 3/2( y )3/2

c2 0T 2 " 1 16c3 0 y T 2 + π

32c5/2 0 T 3/2( y )3/2

c2 0T 2 " 1 16c3 0c2

32c5/2 0 c3/2 2

using (C.25)

" 1 1024c0c2 + π

32c1/2 0 c3/2 2

n3 1 since m T n.

Combining these inequalities, assuming that m n1:

λ2 t (λt + y )4 c2 1 c2 0

"" 1 1024c0c2 + π

32c1/2 0 c3/2 2

" 1 1024c0c2 + π

32c1/2 0 c3/2 2 + 1

n3 1 since m n1

=: c6 m4T 2

n3 1 . (C.32)

Combining (C.30), (C.31), and (C.32) yields

p ( y ) < c5 m2T

n1 2 tℓ c6 m2T

e tℓ tℓ> 0. (C.33)

Lower bounds on y . We now combine (C.29) with (C.33) to established a lower bound on y . Equations (C.21) and (C.25) imply that:

p ( y ) c2n2 1 m2T p( y )

We ﬁrst set tℓ= c2 5 16c6 n1, so that by (C.33),

P p ( y ) < c5

e c2 5 16c6 n1.

Learning from many trajectories

We next set tu = βn1 for a β > 0 to be speciﬁed. By (C.29),

P p( y ) > 2( p

c4β + β)n1 e βn1.

Let E2 denote the event:

E2 := p ( y ) c5

n p( y ) 2( p

c4β + β)n1 o .

By a union bound, P(Ec 2) e c2 5 16c6 n1 + e βn1. Furthermore,

1{E2} c2n2 1 m2T p( y )

1{E2} c2 4( c4β + β)

Setting β = c7 := min n c2c5

16 , c2 2c2 5 162c4

o , we have that c2 4( c4β+β)

c5 c2/2, and therefore from (C.21),

1{E1}y 1{E1 E2} c2n2 1 m2T p( y )

2 n2 1 m2T . (C.34)

Finishing the proof. Deﬁne E := E1 E2 and deﬁne y := c2

2 n2 1 m2T . By a union bound,

P(E) e n/128 + e m T/16 + e c2 5 16c6 n1 + e c7n1

e n/128 + e n/16 + e c2 5 1024c6 n + e c7

64 n since m T n

4 exp min 1

16, c2 5 1024c6 , c7

n =: 4e c8n. (C.35)

From (C.20), since y y on E by (C.34),

1{E}Zi 1{E} 1 λi + y 1 λi + y . (C.36)

Next, by Proposition B.1, if n 2 max{1, c 1 8 } log(4 max{1, c 1 8 }), then we have

n c 1 8 log n ne c8n 1.

We now bound,

i=1 E[Zi] =

i=1 [E[1{E}Zi] + E[1{Ec}Zi]]

1 λt + y + P(Ec)(Θ1,T,T )tt

using (C.36) and Zi (Θ1,T,T )ii

1 λt + y + P(Ec)T since tr(Θ1,T,T ) = T

Tu, Frostig, and Soltanolkotabi

1 c0t2/T + y + 4Te c8n using (C.24) and (C.35)

1 c0x2/T + y dx + 4Te c8n

T c0y + 4Te c8n =

n1 + 4Te c8n

2π 2 c0c2 + 1

n1 =: c8 m T

n1 since ne c8n 1 and m 1.

Plugging this upper bound into (C.19):

R(m, T, T ; {PIn x }) σ2 ξp T

n1 m T = 1 256c8 σ2 ξ pn2

The claim now follows.

C.9 Proof of Theorem 6.3

Theorem 6.3 (Risk lower bound). There are universal positive constants c0, c1, and c2 such that the following holds. Recall that PIn x (resp. P0n n x ) denotes the covariate distribution for a linear dynamical system with A = In and B = In (resp. A = 0n n and B = In). If T c0, n c1, and m T n, then:

R(m, T, T ; {P0n n x , PIn x }) c2σ2 ξ pn

m T max n T

Proof Let Px := {P0n n x , PIn x }. We let c 0, c 1, c 2, and c 3 denote the universal positive constants in the statement of Lemma C.15. We ﬁrst invoke Lemma C.7 to conclude that:

R(m, T, T ; Px) σ2 ξ 2 pn

T , 1 . (C.37)

The proof now proceeds in three cases:

Case n T /(m T) 1. In this case, we trivially have max n T

T , 1 o = max n n T m T , T

T , 1 o . Therefore, (C.37) yields:

R(m, T, T ; Px) σ2 ξ 2 pn

m T max n T

Case n T /(m T) > 1 and m c 2n. In this case, we can invoke Lemma C.15 to conclude that:

R(m, T, T ; Px) c 3σ2 ξ pn

m T = c 3σ2 ξ pn

m T max n T

m T , 1 . (C.38)

Learning from many trajectories

Since n/m 1/c 2, we have that n T /(m T) T /(c 2T). Therefore:

m T , 1 = max n T

c 2T , 1 min{1, 1/c 2} max n T

Hence, from (C.38),

R(m, T, T ; Px) min{c 3, c 3/c 2}σ2 ξ pn

m T max n T

Case n T /(m T) > 1 and m > c 2n. In this case, we have T /T > c 2n T /(m T). Therefore, we have:

T , 1 = max c 2 n T

T , 1 min{1, c 2} max n T

Hence, from (C.37),

R(m, T, T ; Px) min{1/2, c 2/2}σ2 ξ pn

m T max n T

The claim now follows taking c0 = c 0, c1 = c 1, and c2 = min{1/2, c 3, c 3/c 2, c 2/2}.