# longcontext_linear_system_identification__b2def602.pdf

Published as a conference paper at ICLR 2025

LONG-CONTEXT LINEAR SYSTEM IDENTIFICATION

O guz Kaan Y uksel EPFL Lausanne, Switzerland

Mathieu Even Inria ENS Paris, France

Nicolas Flammarion EPFL Lausanne, Switzerland

This paper addresses the problem of long-context linear system identification, where the state xt of a dynamical system at time t depends linearly on previous states xs over a fixed context window of length p. We establish a sample complexity bound that matches the i.i.d. parametric rate up to logarithmic factors for a broad class of systems, extending previous works that considered only first-order dependencies. Our findings reveal a learning-without-mixing phenomenon, indicating that learning long-context linear autoregressive models is not hindered by slow mixing properties potentially associated with extended context windows. Additionally, we extend these results to (i) shared low-rank representations, where rank-regularized estimators improve rates with respect to dimensionality, and (ii) misspecified context lengths in strictly stable systems, where shorter contexts offer statistical advantages.

1 INTRODUCTION

System identification, which consists of estimating the parameters of a dynamical system from observations of its trajectories, is a fundamental problem in many fields such as econometrics, robotics, aeronautics, mechanical engineering, or reinforcement learning (Ljung, 1998; Gupta et al., 1976; Moerland et al., 2023). Recent theoretical advances focused on linear system identification, where observations are of the form: xt = A xt 1 + ξt , (1)

for t 1, with initialization x0 Rd, noise ξt Rd and design matrix A Rd d. Linear system identification (Simpkins, 1999) has been thoroughly studied, with recent interest in sharp non-asymptotic rates (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019; Faradonbeh et al., 2018; Jedra & Prouti ere, 2019). The existing analyses, however, focus solely on order-1 time dependency, in which the law of xt only depends on the previous state xt 1. For order-p time dependencies, the literature on non-asymptotic rates becomes surprisingly scarce, as existing techniques do not extend to p > 1.

We study this more general setting, where the state xt depends on previous states xs for s in a context window of length p N , i.e.,

k=1 A kxt k + ξt , (2)

for t p, the initialization x0, . . . , xp 1 Rd, noise ξt Rd and design matrices A 1, . . . , A p Rd d. This classical pth-order vector autoregression model (Box et al., 2015; Brockwell & Davis, 1991; Hamilton, 2020) is termed long-context linear autoregressive model. The term linear refers to the (noisy) linear relationship between iterates and long-context refers to the context length p. Recent advances in autoregressive models and architectures such as transformers (Vaswani et al., 2017; Dosovitskiy et al., 2021; El-Nouby et al., 2024) highlight the importance of long-context and its impact on learning. Developing a theoretical understanding of long-context linear autoregressive models is a necessary first step toward tackling these more complex architectures.

Motivated by empirical evidence that high-dimensional data may share some lower-dimensional representation (Bengio et al., 2013; Hospedales et al., 2022), several works additionally studied the

Published as a conference paper at ICLR 2025

problem of learning matrices A k under the assumption that they are of low-rank (Alquier et al., 2020; Basu et al., 2019), for order-1 autoregressive models. In the long-context setting, this problem is further motivated by the fact that if there exists a lower-dimensional representation of the autoregressive process, this translates into shared kernels for the matrices A k.

Finally, a key challenge in long-context autoregressive models is misspecification: the system might have an unknown context window p as in Equation (2). p may be arbitrarily large and unknown by the statistician. She may then specify a context length p that can be much smaller, thus yielding the following two fundamental questions: can useful structure still be learned under misspecified context lengths? And what advantages, if any, arise from model misspecification?

Our contributions in long-context linear systems identification are then threefold.

(i) We derive statistical rates on the recovery of matrices A k in terms of Frobenius norm, which depends on the number of trajectories N and their length T, on the dimension d and the context length p. These rates reveal a learning-without-mixing phenomenon as they do not have a deflation in effective sample size due to the mixing time of the autoregressive process. This first contribution is an attempt to fill the gap in linear system identification for long context lengths.

(ii) We study statistical guarantees for learning the matrices A k assuming that they are all of rank at most r d. We prove that the statistical rate reduces, and that rank-regularized estimators adapt to the low-rank structure.

(iii) We study a scenario under which the model is misspecified. Fitting a linear model with context length p < p instead of p, we show that the first p matrices are still learned. More importantly, the sample complexity of learning these matrices depends only on the misspecified context length, indicating that misspecification may benefit the model statistically, not just computationally.

Finally, we confirm these statistical rates through experiments that verify the scaling laws predicted by problem parameters. Due to space constraints, these experiments are provided in Section E.

2 RELATED WORKS

In multivariate linear regression, one observes {(xi, yi)}N i=1 from the model yi = A xi + ξi, where matrix A Rd d and the sequences of noise ξi and inputs xi are i.i.d.. The number of samples N needs to scale at least as d2 for a good estimation of A with ordinary least squares estimator (Hsu et al., 2012; Wainwright, 2019) in Frobenius norm A ˆA 2 F 1. However, in many domains, data is sequential, violating the i.i.d. assumption. In such domains, classical non-i.i.d. formulations, such as vector autoregressive models or discrete-time linear dynamical systems (LDS), as seen in Equation (1), are often employed. Most works used to deal with the non-i.i.d.-ness of the data through mixing time arguments that fall short when the spectral radius of A reaches 1, leading to rates of the form ˆA A 2 F = O(d2/(n(1 ρ))) or ˆA A 2 op = O(d/(n(1 ρ))) for some spectral quantity 1 ρ related to the mixing time of the process. These rates apply to the OLS estimator (Faradonbeh et al., 2018) and online settings (Hardt et al., 2018; Even, 2023) alike.

Simchowitz et al. (2018); Sarkar & Rakhlin (2019) have developed excitation-based arguments to leverage mixing-time independent statistical bounds for the OLS estimator, while Hazan et al. (2017); Jain et al. (2021) respectively used spectral filtering and reverse experience replay in the online setting to obtain such bounds. The estimation of low-rank features has been studied by Basu et al. (2019); Alquier et al. (2020) via nuclear norm regularization. Finally, learning parameters of dynamical systems from N trajectories of length T has previously been considered by Tu et al. (2024) in a more general framework than Equation (1).

Layers of complexity can be added to the LDS described in Equation (1). Mania et al. (2022); Foster et al. (2020) considered non-linear dynamics, that write respectively as xt+1 = A ϕ(xt, ut) + ξt and xt+1 = f (xt) + ξt, where in the former A is to be estimated and ϕ is a known non-linearity, while in the latter f is to be estimated. Kostic et al. (2022) recently provided a general framework using Koopman operators, to estimate the parameters of some general Markov chain. Giraud et al. (2015) considered time-varying systems, with arbitrary context lengths, while Bacchiocchi et al. (2024) studied autoregressive bandits. Ziemann & Tu (2022) provided a framework for learning non-parametric dynamical systems with little mixing : as their rates are not hindered by slow

Published as a conference paper at ICLR 2025

mixing after a burn-in time (that may itself depend on mixing properties). We refer the reader to Tsiamis et al. (2023) for a survey on recent advances on non-asymptotic system identification of LDS such as in Equation (1). Surprisingly, there does not seem to be much known about longcontext LDS in Equation (2), the counterparts of LDS in Equation (1) with a context window p > 1.

3 PROBLEM SETTING

For a matrix M Rd1 d2 with singular values σ1, . . . , σmin {d1,d2}, we denote its squared Frobenius norm as M 2 F = P

(i,j) M 2 ij = P

ℓσ2 ℓ, operator norm as M op = maxℓ|σℓ|, and nuclear norm as M = P ℓ|σℓ|. Id and 0d denotes the identity and the null d d matrices, respectively. A = (A1, . . . , Ap) denotes a rectangular matrix of size d pd where each Ai is d d block.

3.1 DATA GENERATION PROCESS

Let d, p N be the dimension of the state space and the context length, respectively. Consider the following linear autoregressive process:

t > 0 : xt =

k=1 A kxt k + ξt , (3)

where xs = 0 for any s 0 and the noise ξt is independent of the xs, ξs for s < t. This is a particular instance of the general linear autoregressive model in Equation (2) with initial conditions x0, . . . , xp 1 set to 0 and the independent noise structure. We assume sub-Gaussian noise: Assumption 3.1. For all t, the noise ξt is centered and isotropic:

E[ξt] = 0 , E[ξtξ t ] = σ2Id ,

and each coordinate of ξt is independent and c2σ2-sub-Gaussian (Wainwright, 2019, Chapter 2) for some c 1:

i [d] (ξt)i ψ2 c2σ2 , where x ψ2 = sup k 1 k 1/2E |x|k 1/k .

Let AR(A , σ2) denote the law of the sequence defined in Equation (3) where A denotes (A 1, . . . , A p) for brevity. Given N independent sequences of length T > p: n x(n) t , n [N], t [T] o , where (x(n) t )t [T ] i.i.d. AR(A , σ2) ,

the goal of long-context linear system identification is to estimate the matrices A k, k [p].

Lastly, we assume a condition on the design matrices A k, k [p] that amounts to an operator norm bound. First, we define the following linear operators for any matrix A Rd pd: Definition 3.2. Let MA RT d T d be the block-matrix with block entries of size d d:

M (i,j) A = Ai j , for all 1 j < i j + p T , and M (i,j) A = 0d , otherwise .

Definition 3.3. Let L RT d T d be the block-matrix with block entries of size d d:

L(1,1) = Id and L(i,1) =

max {i 1,p} X

k=1 A k L(i k,1) 1 < i T ,

L(i,j) = L(i j+1,1) for all 1 i j p , and L(i,j) = 0d otherwise .

MA executes predictions from the given data with A and L generates the data from the noise. That is, letting (MA)t , (L )t : Rd RT d be the tth block-row of MA and L , respectively, we have:

x(n) 1... x(n) T

k=1 Akx(n) t k , (L )t

ξ(n) 1... ξ(n) T

= x(n) t , with xs = 0 for s 0 .

Published as a conference paper at ICLR 2025

Therefore, the operator norm of MA is a measure of the worst-case growth of the predictions. Moreover, MA is linked to the data-generating operator L :

L = IT d + MA L = L = (IT d MA ) 1 = IT d +

i=1 (MA )i .

We assume the following conditions on the design matrices: Assumption 3.4. There exists a known constant D 1 such that MA op D.

Theorem 3.4 is not restrictive as D is arbitrary and only needs to be an upper bound on MA op. However, the knowledge of D is necessary, as it is used to confine the estimator in Section 3.2.

As the operator MA is a derived object over the full trajectory, it is important to relate Theorem 3.4 to conditions on the design matrices A k. In Theorem 3.5 below, we provide two different assumptions on the design matrices that ensure the boundedness of the operator norm of MA with the same constant. Both conditions imply Theorem 3.4. Proposition 3.5. Theorem 3.4 holds if one of the following holds:

i=1 A i op D , (ii) A op D p .

There is no direct assumption on L ; yet, our results depend on well-behavedness of κ, the condition number of L , which is related to Γt that appears in Simchowitz et al. (2018); Sarkar & Rakhlin (2019). κ is related to the system stability, as explained in Section 5.

Definition 3.6. Let κ be the condition number of L , i.e., κ := L op σmin(L ).

3.2 CONSTRAINED LEAST SQUARES

A natural estimator is the Ordinary Least Square (OLS), defined as any minimizer of the square loss:

ˆ AOLS argmin A L(A) , where L(A) := 1 NT

k=1 Akx(n) t k

The OLS estimator has been considered in previous works (Simchowitz et al., 2018; Alquier et al., 2020; Faradonbeh et al., 2018; Sarkar & Rakhlin, 2019), albeit in the p = 1 case. Most of these works provide estimation rates on ˆA A op or ˆA A F , for marginally stable systems, i.e., under the assumption that ρ(A) 1 (Alquier et al., 2020; Simchowitz et al., 2018; Basu et al., 2019) and in the general case (Sarkar & Rakhlin, 2019).

Instead of directly considering the OLS estimator, we consider the empirical minimizer of the square loss under a restricted set of matrices A that have a bounded operator norm: ˆ A argmin A=(A1,...,Ap) {L(A) | MA op D} . (5)

Note that the set A(D) := {A = (A1, . . . , Ap) | MA op D} , is bounded, closed and convex. Hence, the empirical minimizer of the square loss over A(D) can be computed with projected gradient descent (Duchi et al., 2008) or the Frank-Wolfe algorithm (Jaggi, 2013) as done for ℓ1 constrained optimization. To avoid projecting onto the set A(D), following Theorem 3.5, it is possible to restrict A(D) further into

i=1 Ai 2 op D2 )

, and A(D) := A | A op D p

in order to ensure a condition directly on design matrices. Then, the empirical minimizer of the square loss over A(D) or A(D) can again be computed via projected gradient descent or the Frank-Wolfe algorithm, with simplified projection steps.

Lastly, we briefly remark that the diameter constraint in Equation (5) can be removed, i.e., A(D) replaced by A( ), under an additional assumption on NT. This is explained in detail in Section 5.

Published as a conference paper at ICLR 2025

3.3 LOW-RANK ASSUMPTION

A common assumption in multi-task and meta-learning is that high-dimensional data often shares a representation in a smaller space (Bengio et al., 2013; Tripuraneni et al., 2021; Hospedales et al., 2022; Boursier et al., 2022; Collins et al., 2022; Y uksel et al., 2024) The following low-rank assumptions are crucial, as they significantly improve the statistical complexity of the problem.

Assumption 3.7. For all k [p], rank(A k) r.

Assumption 3.8. There exists an orthonormal matrix P Rr d and matrices B 1, . . . , B p Rd r

such that A k = B k P for all k [p].

Note that Theorem 3.8 is an instance of Theorem 3.7. The factorization A k = Q C k is another subcase of Theorem 3.7, but is not considered as it leads to iterates that directly lie in the subspace spanned by Q and hence Q can be learned by treating iterates x(n) t as independent. In order to benefit from the low-rank structure, we consider the following regularized estimator:

ˆ A argmin A Ar(D) L(A) , where Ar(D) := {A A(D) | k [p], rank(Ak) r} . (6)

3.4 MISSPECIFICATION

The context length of the generative autoregressive process might be unbounded, too large for an efficient estimation, or apriori unknown. In any case, practitioners still have to set a context length p N for the estimator, which might differ from the true p. In this scenario, we need an additional boundedness assumption that relates the first p matrices of the ground truth.

Assumption 3.9. There exists a constant D such that MA MA 1:p L op D , where A 1:p = (A 1, . . . , A p , 0d, . . . , 0d) .

Instead of the estimator defined in Equation (6), we consider the following misspecified estimator:

ˆ A argmin A Ar,p (D) L(A) , where Ar,p (D) := {A Ar(D) | p < k p, Ak = 0d} . (7)

Theorem 3.9 is a strong assumption as it requires that L is well-behaved regardless of the sequence length T. Consequently, the misspecification results are more stringent than other results and apply to a smaller class of systems that still includes strictly stable systems as discussed in Section 5.

4 LONG-CONTEXT LINEAR SYSTEM IDENTIFICATION

In this section, we present statistical rates for the recovery of the design matrices in terms of Frobenius norm. Since the matrices A lie in Rd pd, the number of variables is pd2. In the i.i.d. setting, the rates of the form ˆ A A 2 F = O(pd2/(NT)) are expected. The following theorem extends this rate for long-context linear dynamical system identification:

Theorem 4.1. Let Theorems 3.1 and 3.4 hold. Then, for any 0 < δ < e 1, there exists a constant C(δ) such that the estimator ˆ A in Equation (5) verifies with probability 1 δ: ˆ A A 2

F C(δ)D2 pd2

N(T p)polylog(κ, p, d, N, T) . (8)

The constant C(δ) depends mildly on the sub-Gaussianity constant c as described in Section C.5 and a sketch of proof is provided in Section A. The rate is numerically verified in Figure 1 in Section E. Theorem 4.1 exhibits several interesting features.

First, it shows that despite the temporal dependencies in the data, learning still occurs at a pace reminiscent of the i.i.d. setting, with a logarithmic term adjustment. This implies that the number of samples required to learn the system is approximately the same as in the i.i.d. setting, except for the logarithmic factor. Therefore, even though the data is sequential and only i.i.d. at the sequence level, the number of iterates N(T p) represents the effective data size.

Published as a conference paper at ICLR 2025

Second, the rate in Equation (8) exhibits a linear dependency on the context length p instead of a quadratic dependency. This is only due to the number of parameters to be estimated, which is pd2 instead of d2 and not a deflation in T by a factor of p, which implies the context length does not affect the effective sample size. The additive factor in T p is due to the fact that first iterates do not depend on the full context length, and thus are not as informative as the later iterates. More detailed discussions of Theorem 4.1, in comparison with previous work, can be found in Section 5.

Low-rank setting. Next, we extend the results to the low-rank setting: Theorem 4.2. Let Theorems 3.1, 3.4 and 3.7 hold. Then, for any 0 < δ < e 1, there exists a constant C(δ) such that the estimator ˆ A in Equation (6) verifies with probability 1 δ: ˆ A A 2

F C(δ)D2 prd N(T p)polylog(κ, p, d, r, N, T) .

The improved statistical rate depends on rd instead of d2. Note, however, that this estimator cannot be computed in polynomial time, since the underlying optimization problem involves a non-convex constraint on the rank of all Ak. Several heuristics exist to approximate this estimator. One approach is the Burer-Monteiro factorization (Burer & Monteiro, 2003; 2004), which involves parameterizing Ak as Ak = Bk Ck with Bk Rd r and Ck Rr d. This method relaxes the constraint to a convex set but results in a non-convex function. Another approach is hard-thresholding algorithms, which use projected (stochastic) gradient descent on the non-convex constraint set (Blumensath & Davies, 2009; Foucart & Subramanian, 2019).

Perhaps the most intuitive approach is to use nuclear norm regularization, which is a convex relaxation of the rank constraint: ˆ A argmin {Lλ(A) | A A(D)} , where Lλ(A) = L(A) + λ A ,group , (9)

and A ,group = Pp k=1 Ak is the group-nuclear norm. We leave the analysis of the nuclear norm estimator for future work.

While the low-rank estimator cannot be computed easily, substituting the constraint k, rank(Ak) r with rank(A) r enables a closed-form solution for the optimization problem (Bunea et al., 2011). However, the latter constraint effectively includes the former only when r pr, which would lead to suboptimal dependencies on the context length. These constraints are equivalent only if all Ak matrices project onto the same space: i.e., Ak = QBk for some Q Rd r and Bk Rr r.

Misspecification. Lastly, we study linear long-context autoregressive prediction models under misspecified context lengths and show that partial learning still occurs for misspecified models: Theorem 4.3. Let Theorems 3.1, 3.4, 3.7 and 3.9 hold. Then, for any 0 < δ < e 1, there exists a constant C(δ) such that the estimator ˆ A in Equation (7) verifies with probability 1 δ: ˆ A A p 2

F C(δ)D2(D + 1)2 p dr N(T p)polylog(κ, p , d, r, N, T) .

For r = d, we recover Theorem 4.1 (full-rank setting) for misspecified context windows. The main improvement in that case of Theorem 4.3 over Theorem 4.1 is the dependency on p instead of p. In practice, p can be much larger than p and even on the order of T. In such a setting, learning all matrices A k becomes impossible if N is not large enough and one does not take advantage of the length T of the sequences. One can instead misspecify the student with a context length of p p such that NT p d2, so that the first p matrices are still learned.

Lastly, we briefly remark that Theorem 4.1 provides a rate for the case where p < p . The latter case can be seen under a well-specified setting by rewriting the ground truth model as A = (A 1, . . . , A p, 0d, . . . , 0d) where the last p p indices are padded with null matrices. Learning in such a case is then answered by Theorem 4.1 with a worsened rate that depends on p .

5 DISCUSSION

We now discuss the rates obtained in Section 4 and compare them with previous results obtained for linear dynamical systems. In particular, we comment the learning-without-mixing phenomenon, introduced by Simchowitz et al. (2018) for the first-order linear dynamical systems.

Published as a conference paper at ICLR 2025

Adaptation of first-order techniques (p = 1) to the long-context setting. Here, we explain why techniques developed in the p = 1 setting, in particular those of (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019), do not work for the p > 1 setting and why, even if adapted, they would fail to achieve the desired sharp dependency on p.

Observe that the multi-step dynamics can be cast as a 1-step dynamic using block companion matrices. Let X(n) t = (x(n), t , . . . , x(n), t+p 1) Rpd, Ξ(n) t = (0, . . . , 0, (ξ(n) t ) ) Rpd and let A Rpd pd be the companion matrix associated to A :

0d Id 0d ... ... ... ... 0d 0d Id A p A p 1 A 1

We have the relation X(n) t+1 = A X(n) t + Ξ(n) t , reducing the problem to the p = 1 case by increasing the dimension from d to pd. First, brute-force adapting previous results to this case (e.g. Basu et al., 2019; Simchowitz et al., 2018; Sarkar & Rakhlin, 2019) is not possible since these works assume that the noise covariance of the additive noise added at each step (Ξ(n) t here) is the identity matrix, or at least is positive definite. In our case, the noise covariance is the pd pd block-diagonal matrix, with p 1 blocks equal to 0d and the last one to Id. The covariance matrix is thus non-invertible, preventing the use of previous works.

In addition, arguments based on system excitation (e.g. Basu et al., 2019; Simchowitz et al., 2018) are bound to incur an additional dependence on p, on top of the factors expected due to the dimensionality of the problem. In particular, as seen in the small-ball martingales argument by Simchowitz et al. (2018, Section 2.3), evaluating quantities like (A A )X(n) t 2 for the (k, ν, q)-block martingale small-ball assumption requires k p as p represents the minimum number of steps for noise to propagate in every direction. Consequently, these analyses lead to a suboptimal p dependency.

Moreover, adapting the techniques developed in the p = 1 setting (Sarkar & Rakhlin, 2019) which relies on explicit factorization of the OLS estimator is challenging. In the p > 1 case, the higherorder dynamics complicate the factorization, and the data matrix takes a Toeplitz form, which is more difficult to handle.

Learning-without-mixing. We explain why our rates exhibit learning-without-mixing . We begin by defining learning-with-mixing and discussing the factors that influence the mixing time τmix. We then introduce the concept of learning-without-mixing as exemplified by Simchowitz et al. (2018) and show that our bounds exhibit similar properties.

Let τmix be the mixing time of the Markov chain (X(n) t )t 0. In the i.i.d. setting (for which τmix = 1), the OLS estimator obtains the optimal rate ˆ AOLS A 2 F = O(pd2/NT), since pd2 is the dimension of the inputs. With non-i.i.d. but Markovian data, a naive strategy would be to emulate i.i.d.-ness and take only a sample every τmix steps of the trajectory to compute the OLS estimator, thus having data that are approximately i.i.d. while dividing the number of samples by τmix. This naive learning-with-mixing estimator would yield ˆ Anaive A 2 F = O(τmixpd2/NT), where the mixing time appears as a cost of non-i.i.d.-ness and O hides the logarithmic terms in problem parameters p, d, N, T.

In our case, two components contribute to the mixing time, τmix. The first component is related to the stability or the excitability of the system and scales as 1/(1 ρ), where ρ = MA op < 1. When ρ 1, this component has no impact, while ρ tends to 1, the system is less stable and the Markov chain mixes more slowly. The second component is directly related to the context length p of the process. Regardless of the factor 1/(1 ρ) above, the mixing time of our Markov chain is larger than p: since noise is added only in the last block in the recursion X(n) t+1 = A X(n) t + Ξ(n) t , starting from a given state, p iterations at least are needed to eventually forget this given state. The naive learning-with-mixing benchmark rate is thus ˆ A A 2 F max (1/ (1 ρ) , p) pd2/NT.

In contrast, a rate of convergence that exhibits learning-without-mixing is a rate of the form ˆ A A 2 F Cpd2/NT where C τmix. Such a rate means that the matrix A is learned without

Published as a conference paper at ICLR 2025

paying the cost of non-i.i.d.-ness. For instance, in the p = 1 case, the rate of Simchowitz et al. (2018) does not worsen as ρ tends to 1 in fact, ρ 1 actually improves their rates.

The bound presented in Theorem 4.1 takes the form O(D2pd2/(N(T p))). Importantly, the dependencies on the underlying Markov chain are only through D and ln κ, which do not have a direct dependency on the mixing time. The dependency on D is merely an operator norm upper bound and does not diverge as the mixing time grows to . Similarly, ln κ is logarithmic in T for systems of interest, as we discuss below.

System stability and κ. We now explain the behavior of κ defined in Theorem 3.6. First, by Theorem C.11, we have that σmin(L ) 1 D+1 and, thus, it is sufficient to upper bound

ζ(T) := sup i,j [T ] L(i,j) op sup i,j [T ]

to control κ. Equation (10) implies that if the noise at step i contributes to step j, as measured by L(i,j) , at a polynomial rate in (j i), then κ grows at most polynomially in T. For such a κ, the resulting dependency on T is of order ln T and mild. Instead, if it is exponential in (j i), then ln κ grows linearly in T and the dependency on T cancels out in the rate.

We use the quantity ζ(T) to define strictly stable, marginally stable and explosive systems: Definition 5.1. An LDS as defined in Equation (3) is called

strictly stable if ζ(T) = O(ρT ) for some ρ < 1 ,

marginally stable if ζ(T) = O(T k) for some k N ,

explosive if ζ(T) = O(ρT ) for some ρ > 1 .

Theorem 5.1 is similar to the notions of strictly stable, marginally stable and explosive systems considered in (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019) for p = 1. Let ρ(A ) := λmax(A ) be the spectral radius of A and V ΛV 1 be the Jordan normal form of A . Then,

L(i,j) op = (A )j i op = V Λj 1V 1 op V op Λj i op V 1 op .

Note that V op and V 1 op are constants. For upper bounding Λj i op, consider the Jordan blocks {Λk} of Λ, associated with the eigenvalues λk of A . Then, Λj i op supk Λj i k op and

Λj i k op = (λk In + Nn)j i op =

max {j i,n 1} X

max {j i,n 1} X

m=0 ρ(A )j i j i m

where n is the block size for the Jordan block Λk. Note here that n does not scale with T.

In particular, for strictly stable systems of Simchowitz et al. (2018); Sarkar & Rakhlin (2019) with ρ < 1, ζ(T) = O(ρT ). For marginally stable systems of Sarkar & Rakhlin (2019) with ρ < 1 + γ

T with some constant γ > 0, ρ(A )j i eγ and ζ(T) = O(T k) for some fixed k that depends on the largest Jordan block of A . For explosive systems of Sarkar & Rakhlin (2019) with ρ > 1, ζ(T) = O(ρT ). Thus, Theorem 5.1 provides a general categorization of the systems based on the growth of ζ(T) in p > 1 case. Furthermore, our analysis yields sharp rates for strictly stable and marginally stable systems previously considered only in the p = 1 setting.

Search space diameter D. Our analysis is based on the assumption that the diameter D of the search space is bounded and, hence, not directly applicable to the OLS estimator in Equation (4). However, Theorem C.7 in Section C.1 extends the results of Theorems 4.1 to 4.3 to minimizers without a constraint on the diameter of the search space. This extension does not change the rates but requires the additional assumption that NT = Ω(p 2dr).1 In the case of Theorem 4.1, this

1We use the convention that r = d, p = p for Theorem 4.1.

Published as a conference paper at ICLR 2025

corresponds to a result for the OLS estimator, but necessitating a number of samples quadratic in context length. Below, we comment on why the diameter restrictions is required when NT p 2dr.

As mentioned earlier in comparison with (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019), the simple OLS factorization in the p = 1 case does not generalize to the p > 1 and the data matrix has a Toeplitz structure that is more difficult to control. In order to deal with these issues, as explained in the sketch of proof in Section A, we rely on techniques from empirical process theory. These techniques are applied to quantify the probability of the event in Equation (13), which hold for any empirical risk minimizer of the square loss. This leads us to the study of the concentration of the martingales defined in Equation (15) around their predictable variation, which is a key step in our analysis. A uniform concentration is possible only if there is a uniform lower bound on the variations of the martingales, which can be achieved using a set of well-behaved matrices MA MA F / MA MA op. In order to translate these conditions on the design matrices without additional dimensional dependencies, we introduce the operator norm constraint.

Lastly, it is possible to extend our analysis to unconstrained OLS by establishing a general coarse upper bound on the operator norm M ˆ A op K. This allows us to consider uniform lower bounds to matrices A with MA op K, which lead to a rate for the OLS estimator in a similar manner.

Upper bound on D . The misspecification result in Theorem 4.3 requires the additional assumption given in Theorem 3.9. In Theorem C.6, we show that a good upper bound on D is possible when D < 1, i.e., the system is strictly stable, by using the bound L op 1/(1 D). However, misspecification results are not, a priori, applicable to marginally stable systems, which limits the practical applicability of our results. We leave the investigation of misspecification results for marginally stable systems for future work.

Practical implications. The rates obtained in Section 4 holds true if the empirical risk minimizer ˆ A is replaced by any estimate A that verifies

L( A) L(A ) . (11)

The training error for the ground truth A concentrates around σ2d with a rate of O(1/

NT), which is identical to that in the i.i.d. setting. Hence, any algorithm that optimizes the training error below the threshold σ2d achieve the rates in Section 4.

Moreover, in Section D, we show that the rates extend to approximate minimizers, i.e., any estimate A that verifies

L( A) L( ˆ A) + ϵtr , where ϵtr = O p dr

where ϵtr is the surplus training error of the estimate A over the empirical risk minimizer ˆ A.

Estimates satisfying equations (11) or (12) are computationally tractable in practice. This implies that practitioners can determine the required number of samples, or N and T, for estimating the system parameters up to a fixed precision by using the rates in Section 4.

6 CONCLUSION

In this work, we extend non-asymptotic linear system identification theory and derive upper bounds on the sample complexity of learning long-context linear autoregressive models. Our bounds improve upon the existing arguments specific to first-order systems by employing a uniform concentration argument over prediction differences. We further establish improved statistical rates when learning under a low-rank assumption. Finally, we show that even with long or unbounded generative contexts, misspecification still allows the estimation of the matrices with a reduced sample complexity and for stable systems.

While this work makes significant progress for non-asymptotic linear system identification theory, several technical questions remain open for further investigation. Can the OLS operator norm be coarsely controlled to derive rates for unconstrained OLS in the NT = Ω(pdr) regime? Is it possible to find efficient algorithms that would benefit from low-rank assumptions? Lastly, can misspecification be beneficial for marginally stable systems?

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

This project was supported by the Swiss National Science Foundation (grant number 212111). This work was partially funded by an unrestricted gift from Google.

Pierre Alquier, Karine Bertin, Paul Doukhan, and R emy Garnier. High-dimensional VAR with low-rank transition. Statistics and Computing, 30(4):1139 1153, March 2020. doi: 10.1007/ s11222-020-09929-7. URL https://doi.org/10.1007/s11222-020-09929-7.

Francesco Bacchiocchi, Gianmarco Genalti, Davide Maran, Marco Mussi, Marcello Restelli, Nicola Gatti, and Alberto Maria Metelli. Autoregressive bandits. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li (eds.), International Conference on Artificial Intelligence and Statistics, 2-4 May 2024, Palau de Congressos, Valencia, Spain, volume 238 of Proceedings of Machine Learning Research, pp. 937 945. PMLR, 2024. URL https://proceedings.mlr.press/v238/ bacchiocchi24a.html.

Sumanta Basu, Xianqi Li, and George Michailidis. Low rank and structured modeling of highdimensional vector autoregressions. IEEE Trans. Signal Process., 67(5):1207 1222, 2019. doi: 10.1109/TSP.2018.2887401. URL https://doi.org/10.1109/TSP.2018.2887401.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798 1828, aug 2013. ISSN 01628828. doi: 10.1109/TPAMI.2013.50. URL https://doi.org/10.1109/TPAMI.2013. 50.

Thomas Blumensath and Mike E. Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265 274, 2009. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha.2009.04.002. URL https://www.sciencedirect.com/ science/article/pii/S1063520309000384.

Etienne Boursier, Mikhail Konobeev, and Nicolas Flammarion. Trace norm regularization for multitask learning with scarce data. In Po-Ling Loh and Maxim Raginsky (eds.), Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pp. 1303 1327. PMLR, 2022. URL https://proceedings.mlr.press/ v178/boursier22a.html.

George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time Series Analysis: Forecasting and Control. John Wiley & Sons, 2015.

Peter J Brockwell and Richard A Davis. Time Series: Theory and Methods. Springer Science & Business Media, 1991.

Florentina Bunea, Yiyuan She, and Marten H. Wegkamp. Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics, 39(2):1282 1309, 2011. ISSN 00905364, 21688966. URL http://www.jstor.org/stable/29783674.

Samuel Burer and Renato D.C. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329 357, February 2003. ISSN 1436-4646. doi: 10.1007/s10107-002-0352-8. URL http://dx.doi.org/10. 1007/s10107-002-0352-8.

Samuel Burer and Renato D.C. Monteiro. Local minima and convergence in low-rank semidefinite programming. Mathematical Programming, 103(3):427 444, December 2004. ISSN 1436-4646. doi: 10.1007/s10107-004-0564-1. URL http://dx.doi.org/10.1007/ s10107-004-0564-1.

Emmanuel J. Cand es and Yaniv Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory, 57 (4):2342 2359, 2011. doi: 10.1109/TIT.2011.2111771.

Published as a conference paper at ICLR 2025

Liam Collins, Aryan Mokhtari, Sewoong Oh, and Sanjay Shakkottai. MAML and ANIL provably learn representations. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 4238 4310. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/ collins22a.html.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum? id=Yicb Fd NTTy.

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, ICML 08, pp. 272 279, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605582054. doi: 10.1145/1390156.1390191. URL https: //doi.org/10.1145/1390156.1390191.

Kacha Dzhaparidze and JH Van Zanten. On bernstein-type inequalities for martingales. Stochastic processes and their applications, 93(1):109 117, 2001.

Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Vaishaal Shankar, Alexander T. Toshev, Joshua M. Susskind, and Armand Joulin. Scalable pre-training of large autoregressive image models. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Open Review.net, 2024. URL https:// openreview.net/forum?id=c92KDf EZTg.

Mathieu Even. Stochastic gradient descent under markovian sampling schemes. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 9412 9439. PMLR, 2023. URL https://proceedings.mlr.press/v202/even23a.html.

Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite time identification in unstable linear systems. Autom., 96:342 353, 2018. doi: 10.1016/J.AUTOMATICA. 2018.07.008. URL https://doi.org/10.1016/j.automatica.2018.07.008.

Dylan Foster, Tuhin Sarkar, and Alexander Rakhlin. Learning nonlinear dynamical systems from a single trajectory. In Alexandre M. Bayen, Ali Jadbabaie, George Pappas, Pablo A. Parrilo, Benjamin Recht, Claire Tomlin, and Melanie Zeilinger (eds.), Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pp. 851 861. PMLR, 10 11 Jun 2020. URL https://proceedings.mlr.press/v120/ foster20a.html.

Simon Foucart and Srinivas Subramanian. Iterative hard thresholding for low-rank recovery from rank-one projections. Linear Algebra and its Applications, 572:117 134, 2019. ISSN 0024-3795. doi: https://doi.org/10.1016/j.laa.2019.03.007. URL https://www.sciencedirect. com/science/article/pii/S0024379519301028.

David A Freedman. On tail probabilities for martingales. the Annals of Probability, pp. 100 118, 1975.

Christophe Giraud, Franc ois Roueff, and Andres Sanchez-Perez. Aggregation of predictors for nonstationary sub-linear processes and online adaptive forecasting of time varying autoregressive processes. The Annals of Statistics, 43(6):2412 2450, 2015. ISSN 00905364. URL http: //www.jstor.org/stable/43818856.

N. K. Gupta, R. K. Mehra, and W. E. Hall. Application of optimal input synthesis to aircraft parameter identification. Journal of Dynamic Systems, Measurement, and Control, 98(2):139 145, June 1976. ISSN 1528-9028. doi: 10.1115/1.3427000. URL http://dx.doi.org/10.1115/ 1.3427000.

Published as a conference paper at ICLR 2025

James D Hamilton. Time Series Analysis. Princeton University Press, 2020.

D. L. Hanson and F. T. Wright. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079 1083, 1971. ISSN 00034851. URL http://www.jstor.org/stable/2240253.

Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. J. Mach. Learn. Res., 19:29:1 29:44, 2018. URL https://jmlr.org/papers/ v19/16-465.html.

Charles R. Harris, K. Jarrod Millman, St efan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern andez del R ıo, Mark Wiebe, Pearu Peterson, Pierre G erard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, September 2020. doi: 10.1038/ s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2.

Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral filtering. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 6702 6712, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ 165a59f7cf3b5c4396ba65953d679f17-Abstract.html.

T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey. Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis; Machine Intelligence, 44(09):5149 5169, sep 2022. ISSN 1939-3539. doi: 10.1109/TPAMI.2021.3079209.

Daniel Hsu, Sham M. Kakade, and Tong Zhang. Random design analysis of ridge regression. In Shie Mannor, Nathan Srebro, and Robert C. Williamson (eds.), Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pp. 9.1 9.24, Edinburgh, Scotland, 25 27 Jun 2012. PMLR. URL https://proceedings. mlr.press/v23/hsu12.html.

Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Sanjoy Dasgupta and David Mc Allester (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 427 435, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/ jaggi13.html.

Prateek Jain, Suhas S. Kowshik, Dheeraj Nagaraj, and Praneeth Netrapalli. Streaming linear system identification with reverse experience replay. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 30140 30152, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ fd2c5e4680d9a01dba3aada5ece22270-Abstract.html.

Yassir Jedra and Alexandre Prouti ere. Sample complexity lower bounds for linear system identification. In 58th IEEE Conference on Decision and Control, CDC 2019, Nice, France, December 11-13, 2019, pp. 2676 2681. IEEE, 2019. doi: 10.1109/CDC40024.2019.9029303. URL https://doi.org/10.1109/CDC40024.2019.9029303.

Vladimir Kostic, Pietro Novelli, Andreas Maurer, Carlo Ciliberto, Lorenzo Rosasco, and Massimiliano Pontil. Learning dynamical systems via koopman operator regression in reproducing kernel hilbert spaces. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/ hash/196c4e02b7464c554f0f5646af5d502e-Abstract-Conference.html.

Published as a conference paper at ICLR 2025

Lennart Ljung. System Identification, pp. 163 173. Birkh auser Boston, Boston, MA, 1998. ISBN 978-1-4612-1768-8. doi: 10.1007/978-1-4612-1768-8 11. URL https://doi.org/10. 1007/978-1-4612-1768-8_11.

Horia Mania, Michael I. Jordan, and Benjamin Recht. Active learning for nonlinear system identification with guarantees. J. Mach. Learn. Res., 23:32:1 32:30, 2022. URL https: //jmlr.org/papers/v23/20-807.html.

Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey. Found. Trends Mach. Learn., 16(1):1 118, 2023. doi: 10.1561/2200000086. URL https://doi.org/10.1561/2200000086.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library. pdf.

Mark Rudelson and Roman Vershynin. Hanson-Wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18(none):1 9, 2013. doi: 10.1214/ECP.v18-2865. URL https://doi.org/10.1214/ECP.v18-2865.

Tuhin Sarkar and Alexander Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5610 5618. PMLR, 09 15 Jun 2019. URL https://proceedings. mlr.press/v97/sarkar19a.html.

Max Simchowitz, Horia Mania, Stephen Tu, Michael I. Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In S ebastien Bubeck, Vianney Perchet, and Philippe Rigollet (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 439 473. PMLR, 06 09 Jul 2018. URL https://proceedings.mlr.press/v75/simchowitz18a.html.

Alex Simpkins. System identification: Theory for the user, 2nd edition (ljung, l.; 1999). IEEE Robotics and Automation Magazine, 19(2):95 96, 1999. doi: 10.1109/MRA.2012.2192817.

Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 10434 10443. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/ v139/tripuraneni21a.html.

Anastasios Tsiamis, Ingvar Ziemann, Nikolai Matni, and George J Pappas. Statistical learning theory for control: A finite-sample perspective. IEEE Control Systems Magazine, 43(6):67 97, 2023.

Stephen Tu, Roy Frostig, and Mahdi Soltanolkotabi. Learning from many trajectories. J. Mach. Learn. Res., 25:216:1 216:109, 2024. URL https://jmlr.org/papers/v25/ 23-1145.html.

Guido Van Rossum and Fred L. Drake. Python 3 Reference Manual. Create Space, Scotts Valley, CA, 2009. ISBN 1441412697.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998 6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

Published as a conference paper at ICLR 2025

Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019. doi: 10.1017/ 9781108627771.

F. T. Wright. A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables Whose Distributions are not Necessarily Symmetric. The Annals of Probability, 1(6):1068 1070, 1973. doi: 10.1214/aop/1176996815. URL https://doi.org/10.1214/aop/ 1176996815.

O guz Kaan Y uksel, Etienne Boursier, and Nicolas Flammarion. First-order ANIL provably learns representations despite overparametrisation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=if2v Rb S8Ew.

Ingvar M. Ziemann and Stephen Tu. Learning with little mixing. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ 1dc9fbdb6b4d9955ad377cb983232c9f-Abstract-Conference.html.

Published as a conference paper at ICLR 2025

ORGANIZATION OF THE APPENDIX

The appendix is organized as follows,

In Section A, we provide a sketch of proof for Theorem 4.1. In Section B, we provide preliminary tools needed for our analyses: Hanson-Wright and Freedman inequalities, supremum of sub-Gaussian processes, covering numbers and proof of Theorem 3.5. In Section C, we prove Theorems 4.1 to 4.3 jointly under Theorem C.4. In Section D, we show that the rates extend to approximate empirical minimizers. In Section E, we provide numerical experiments to verify our theoretical findings.

A SKETCH OF PROOF

We provide a sketch of proof for Theorem 4.1. The proofs of Theorems 4.2 and 4.3 are similar and can be found in Section C. In the following, A is a shorthand for MA MA and E RT d N is the matrix that collects the noise concatenated over time, as explained in Theorem C.1.

The empirical risk minimizer ˆ A satisfies the following optimality condition:

L( ˆ A) L(A ) , or written differently, ˆ AL E 2 2 Tr E L ˆ AE , (13)

due to the well-specified setting, i.e., A A(D). The condition in Equation (13) is of interest as

A A(D) , E h ˆ AL E 2i = σ2N ˆ AL 2 F > 0 = E Tr E L ˆ AE .

This inequality hints that if for a set of matrices A (D) A(D), there is a uniform result

E := n A A (D) : ˆ AL E 2 2 Tr E L ˆ AE o , (14)

with high probability as seen from their means, then the empirical risk minimizer ˆ A belongs to the set A(D) \ A (D) with the same high probability by a simple Bayesian argument. Hence, the proof of Theorem 4.1 is reduced to proving Equation (14) for a suitable set of matrices.

Fix a A A (D) and study the martingale series defined through the differences sequences

d(n) t,i = (A A ) X(n) t

where the series is first ordered in i, then in t, and finally in n. The sum of the differences is then

n,t,i d(n) t,i = X

D (A A ) X(n) t , ξ(n) t E = Tr E L AE ,

and the quadratic predictable variation of the series is

n,t,i E ξ(n) t

d(n) t,i 2 = X

(A A) X(n) t 2 = σ2 AL E 2 .

The condition that is asked in Equation (14) is then that the sum of the differences YA is large compared to the quadratic predictable variation WA, i.e., E = A A (D) : WA 2σ2YA .

In order to prove probabilistic statements on E, we use Freedman s inequality (Freedman, 1975; Dzhaparidze & Van Zanten, 2001) which gives control on YA and W R A for a particular A:

P YA r Y , W R A r W exp r2 Y /2 r W + Rr Y

where W R A = WA + P

n,t,i 1d(n) t,i >R

d(n) t,i 2 and r Y , r W , R > 0 are arbitrary constants. As the

noise is sub-Gaussian, it is possible to upper bound W R A with WA: X

d(n) t,i 2 sup n,t,i

n,t (A A ) X(n) t 2

w.h.p = A : W R A C1 ln (2d NT) WA .

Published as a conference paper at ICLR 2025

Further, assume that there are uniform upper and lower bounds on YA and WA, respectively:

0 < αL < αU such that A A (D) : YA αU and αL WA W R A . (17)

Let γ = C1 ln 2d NT and set k to be the smallest integer that verifies ek αL 2γσ2αU. Then,

P WA 2σ2YA P W R A 2γσ2YA k k=1P W R A ekαL, 2γσ2YA ek 1αL .

Each of the terms in the union can be controlled by the Freedman s inequality in Equation (16) with the choices of r Y = αLek 1/2γσ2, r W = αLek and R = 2eγσ2:

P WA 2γσ2YA

k=1 exp ek 2αL

αL C2σ4 (ln 2d NT)2 + ln ln C3σ2 ln (2d NT) αU

As can be seen from Equation (18), the probability of the event WA 2σ2YA is largely controlled with the lower bound αL as the ratio αU/αL only matters logarithmically. This is crucial as the αU and αL differ with the condition number κ, which can scale with T.

Therefore, it possible to control the event E with a union bound over an ϵ-net of A (D). In particular, αL needs to be uniformly bounded below as follows:

αL ln |Nϵ(A (D))| ln(2d NT)2 .

This is achieved by Hanson-Wright inequality (Hanson & Wright, 1971) which allows us to derive the needed uniform lower and upper bounds in Equation (17)

YA c1σ2 L op p

pdr N A F , W R A WA c2σ4σmin(L )2N A 2 F ,

with high probability as long as A (D) is composed of matrices that satisfy

A 2 F A 2op ln |Nϵ(A (D))| , where ϵ polysqrt(p, d, N, T, κ)

1 + c2 ln 1

Here, we pick up the dependency on κ as ϵ scales with κ. This is needed to bound the worst-case errors while transitioning from point-wise bounds on the ϵ-net Nϵ(A (D)) to the whole set A (D).

Finally, since there is a uniform bound on A op 2D implied by Theorem 3.4, setting

A (D) = A | A 2 F C(δ)D2 pdr

N polylog(κ, p, d, N, T) ,

for some constant C(δ) is sufficient to deduce ˆ A A(D)\A (D) with probability 1 δ. The proof of Theorem 4.1 is then complete as A 2 F = MA MA 2 F (T p) A A 2 F .

B PRELIMINARY TOOLS

B.1 HANSON-WRIGHT INEQUALITY

We use Hanson-Wright inequality (Hanson & Wright, 1971; Wright, 1973; Rudelson & Vershynin, 2013) to show concentration of certain second-order terms. Theorem B.1. (Hanson-Wright) Let Z = (Z1, . . . , Zn) Rn be a random vector with independent components Zi which satisfy E[Zi] = 0 and Zi ψ2 K. Let P be an n n matrix. Then, for every r 0,

P Z PZ E Z PZ > r 2 exp CHW min r2

K4 P 2 F , r K2 P op

The bound can be turned into a one-sided bound by dropping the constant 2. Remark B.2. For data regime considered in this paper, K = cσ in Theorem B.1.

Published as a conference paper at ICLR 2025

B.2 FREEDMAN S INEQUALITY

We use an extension of Freedman s inequality (Freedman, 1975) to non-bounded differences by Dzhaparidze & Van Zanten (2001) to show concentration of certain second-order terms. For the sake of completeness, we provide the original Freedman s inequality. We also remark that it is possible to use the original Freedman s inequality in our proofs to deal with any bounded noise, leading to improved logarithmic factors, as explained in Theorem C.27. Theorem B.3. (Freedman s inequality) Let Y0, . . . , Yn be a real-valued martingale series that is adapted to the filtration F0, . . . , Fn where Y0 = 0. Let d1, . . . , dn be the difference sequence induced, i.e., di = Yi Yi 1 for i = 1, . . . , n. Assume that di is upper bounded by some R, i.e., |di| R for all i. Let Wi be the quadratic variation of the martingale series, i.e.,

j=1 E[d2 j | Fj 1] for i = 1, . . . , n.

Then, for any r, W > 0,

P ( k 0 : Yk r and Wk W) exp r2/2 W + Rr

Theorem B.4. (Freedman s inequality for non-bounded differences) Let Y0, . . . , Yn be a realvalued martingale series that is adapted to the filtration F0, . . . , Fn where Y0 = 0. Let d1, . . . , dn be the difference sequence induced, i.e.,

di = Yi Yi 1 for i = 1, . . . , n.

Let W R i be the quadratic variation of the martingale series plus an error term for large differences,

j=1 E[d2 j | Fj 1] + d2 i 1{|di|>R} for i = 1, . . . , n.

We set Wi = W 0 i for ease of notation. Then, for any r, R, W > 0,

P k 0 : Yk r and W R k W exp r2/2 W + Rr

We extend these theorems in Theorem B.5 to compare the quadratic variation with the martingale series itself. This is useful in our proofs to show certain events necessarily implied by empirical risk minimization do not occur with high probability. Lemma B.5. Let γ, R > 0, αU αL > 0 be scalars and let E denote the following event

E = W R n αL {Yn αU} ,

where Yn and W R n verifies the assumptions of Theorem B.4. Then, we have the following concentration inequality

P W R n γYn E exp αL 2eγ (eγ + R) + ln ln γαU

Proof. Let G = αL, eαL, . . . , ekαL where k is the smallest positive integer such that

Then, by a union bound,

P W R n γYn E P k i=1 W R n eiαL, γYn ei 1αL E

P k i=1 W R n eiαL, γYn ei 1αL

i=1 P W R n eiαL, γYn ei 1αL .

Published as a conference paper at ICLR 2025

By applying Theorem B.4 with r = ei 1αL/γ and W = eiαL, we obtain

P W R n eiαL, Yn ei 1αL/γ exp ei 2αL 2γ (eγ + R)

for each i = 1, . . . , k. The union bound gives

i=1 P W R n eiαL, Yn ei 1αL

i=1 exp ei 2αL 2γ (eγ + R)

exp αL 2eγ (eγ + R) + ln k .

The result follows by noting that

B.3 SUPREMUM OF THE NOISE

We need the following lemma to control the supremum of the noise in our proofs.

Lemma B.6. Let X1, . . . , Xn be i.i.d. mean zero and σ2-sub-Gaussian random variables (in the sense provided in Theorem 3.1). Then, there exists a universal constant c such that for any t > 0,

P sup i=1,...,n |Xi| > c σ

2 ln 2n + t exp t2

Proof. By the sub-Gaussian property, we have a universal constant c such that

P (Xi r) exp r2

Then, by the union bound,

P sup i=1,...,n Xi r i P (Xi r) n exp r2

Similarly, we have

P sup i=1,...,n Xi r i P ( Xi r) n exp r2

The result follows by union bound with r = c σ

2 ln 2n + t.

Corollary B.7. For any 0 < δ < e 1, there exists a universal constant c such that

sup t,n ξ(n) t c σ

with probability 1 δ.

Proof. Each component ξ(n) t

i are i.i.d. and sub-Gaussian with parameter σ. By Theorem B.6,

P sup t,n ξ(n) t > c σ

2 ln 2d TN + t exp t2

for any t > 0. Select t = c σ q

δ to obtain the desired confidence level of δ.

Published as a conference paper at ICLR 2025

B.4 COVERING NUMBERS

We use the following lemma by (Cand es & Plan, 2011) to control the covering numbers of the set of matrices:

Lemma B.8. (Covering number for low-rank matrices) Let Mn1 n2 r denote the set of low-rank matrices of rank r and Frobenius norm bounded by 1:

Mn1 n2 r = M Rn1 n2 : M F = 1, rank(M) r .

Then, there exists an ϵ-net S of M for Frobenius norm such that

(n1+n2+1)r .

Corollary B.9. Let Mr,p (D) denote the following set of matrices with bounded operator norm:

Mr,p (D) = A Rd pd | A op D, rank(Ai) r, Ap +1 = = Ap = 0 .

Then, there exists an ϵ-net S of Mr,p (D) for operator norm such that

|S| exp 9p dr ln Dp

Proof. Since for any matrix M, M op M F , it is sufficient to give a covering number for the Frobenius norm. Let Md d r (D) be the set of low-rank matrices of rank r and Frobenius norm bounded by D:

Md d r (D) = M Rd d | M F D, rank(Mi) r .

By Theorem B.8, there exists an ϵ

p -net S of Md d r (D) for Frobenius norm such that

(2d+1)r exp 9dr ln Dp

as any ϵ Dp -net of Md d r (1) gives an ϵ

p -net of Md d r (D).

Observe that the set Mr,p (D) is a subset of

U = Md d r (D) p {0d d}p p .

Then, (S)p gives an ϵ-net of U and hence of Mr,p (D).

B.5 PROOF OF THEOREM 3.5

Proof. Using Theorem C.12, we have MA op p A op, directly leading to (ii).

For (i), we have

i=1 MA ,(i) op , where A ,(i) =

0d, 0d, . . . , 0d | {z } i 1 times , A ,i, 0d, . . . , 0d

Then, it is easy to see that MA ,(i) op A i op.

Published as a conference paper at ICLR 2025

C PROOF OF THEOREMS 4.1 TO 4.3

Before proving the main theorems, we recall certain definitions from the main body of the paper: Definition C.1. For any A Rd pd, let A = A,p be defined as follows

A,i = (MAi MA i ) ,

where Ai = (A1, . . . , Ai, 0d, . . . , 0d) , A i = (A 1, . . . , A i , 0d, . . . , 0d) .

Let ξ(i) RT d be the whole noise concatenated in time, i.e.,

ξ(i) = ξ(i) 1 , . . . , ξ(i) T ,

and let E RT d N be the matrix that collects the noise for all sequences, i.e.,

E = ξ(1), . . . , ξ(n) .

Proposition C.2. With the definitions of Theorem C.1, we have the following properties: X

n,t (Ai A i )X(n) t , ξ(n) t = Tr(E A,i L E) ,

(Ai A i )X(n) t 2 = (MAi MA i )L E 2 F = A,i L E 2 F .

Definition C.3. Let Ar,p(D) and Sr,p(C, D) be the search and solution set for constants C, D 1:

Ar,p (D) = n A Rd pd A op D, rank(Ai) r, Ap +1 = = Ap = 0 o ,

Sr,p (C, D) = A A(D) A A p 2 F CD2η2τ p dr N(T p )

where η is a constant that captures an additional factor for the misspecified setting,

1 if p = p ,

max 1, 1 + MA MA p L op

and τ is the following logarithmic term:

τ = 1 + ln p 2d NT 2

3 (1 + ln κ) .

Let Gr,p (C, D) be defined as follows,

Gr,p (C, D) = A Ar,p(D) A 2 F A 2op Cη2τ p dr

We set Ar,p = Ar,p ( ) and G(C)r,p = G(C, ). Lastly, we drop the subscript r, p when it is clear from the context.

C.1 MAIN RESULTS

In this subsection, we state Theorem C.4 that generalizes the statements in Theorems 4.1 to 4.3. We give a proof that reduces Theorem C.4 to a uniform concentration result in Theorem C.5. The proof of Theorem C.5 is deferred to Section C.5. Next, in Theorem C.6, we discuss the factor η that appears in our results and the conditions under which our misspecification bounds are tight. Lastly, Theorem C.7 removes the constraint on the diameter of the search set which allows us to treat OLS as a special case of the main theorem. Theorem C.4. Let Theorems 3.1 and 3.4 hold. Furthermore, let Theorem 3.7 for r < d and Theorem 3.9 for p < p hold. Consider the following constrained empirical risk minimizer: ˆ A = argmin A A(D) L(A) .

Then, for any 0 < δ < e 1, there exists a constant C(δ) such that

P ˆ A S(C(δ), D) 1 δ .

Published as a conference paper at ICLR 2025

Proof. Let EA be the following event

EA = { AL E 2 F 2η Tr(E AL E)} . By Theorem C.13, G(C, D) S(C, D) and thus, for any random choice of A,

P ({A S(C, D)}) P ({A G(C, D)}) = 1 P ({A A(D) \ G(C, D)}) .

For the choice of ˆ A, P E ˆ A = 1 by Theorem C.15 and

P n ˆ A S(C, D) o 1 P n ˆ A A(D) \ G(C, D) o | E ˆ A .

By Bayes rule, we have

P n ˆ A S(C, D) o 1 P E ˆ A | n ˆ A A(D) \ G(C, D) o P n ˆ A A(D) \ G(C, D) o

1 P E ˆ A | n ˆ A A(D) \ G(C, D) o .

Then, the proof is complete by applying Theorem C.5 to the right-hand side.

Theorem C.5. Let all the assumptions of Theorem C.4 hold. Then, for any 0 < δ < e 1, there exists a constant C(δ) such that

P A A(D) \ G(C(δ), D) : AL E 2 F 2η Tr(E AL E) δ . (19)

Remark C.6. For strictly stable systems with MA op < 1 and MA p op < 1, the factor η is controlled by Theorem C.10. However, for marginally stable systems or explosive systems, there is no a prior good upper bound on η, implying that the misspecification results only applies to strictly stable systems without any further assumptions. Corollary C.7. Let all the assumptions of Theorem C.4 hold and suppose that NT verifies the following condition: N(T p ) C(δ)η2τp 2dr , (20) where 0 < δ < e 1 is a constant and C(δ) is given by Theorem C.4 Then, the OLS estimator ˆ AOLS = argmin A A( ) L(A) ,

satisfies the same concentration result as in Theorem C.4:

P ˆ AOLS S(C(δ), D) 1 δ .

Proof. Assume that D is sufficiently large such that A(D)o S(C(δ), D) , i.e., the interior of A(D) contains S(C(δ), D). We have the following relation:

A( ) = A = αA | A A(D) \ S(C, D), α 1 R . Then, by Theorem C.5, we have

P A A( ) \ S(C(δ), D) : AL E 2 F 2η Tr(E AL E)

= P α 1 R, A A(D) \ S(C(δ), D) : αAL E 2 F 2η Tr(E αAL E)

= P A A(D) \ S(C(δ), D) : AL E 2 F 2η Tr(E AL E) δ , as α 1, we have the following:

AL E 2 F 2η Tr(E AL E) = αAL E 2 F 2η Tr(E αAL E) . Thus, the result is complete by applying the same argument as in Theorem C.4 where A(D) is replaced by A( ).

We only need to provide a D such that A(D)o S(C(δ), D). For any A S(C(δ), D), we have

A 2 op p A 2 op p A 2 F C(δ)D2η2τ p 2dr N(T p ) ,

from Theorem C.12. Therefore, we need to find a D such that

D2 C(δ)D2η2τ p 2dr N(T p ) ,

which equivalent to the condition in Equation (20).

Published as a conference paper at ICLR 2025

C.2 TECHNICAL LEMMAS

In this subsection, we present simple technical results on L , MA and A that are used in the proof of Theorem C.4. Lemma C.8. L and MA satisfy the following relations:

L = MA L + I, MA = (L I)L 1 , L = (I MA ) 1 .

Proof. The first relation follows from a direct computation. For the second, note that L is invertible since it is a lower triangular matrix with non-zero diagonals. Lastly,

L = I + MA L = I + MA + M 2 A L =

= I + MA + M 2 A + + M T 1 A

= (I MA ) 1 ,

where we have used the fact that M T A = 0T d.

Lemma C.9. Assume that MA op < 1. Then, the operator norm and minimum singular value of L are bounded as follows,

1 1 + MA op L op 1 1 MA op , 1 2 σmin (L ) .

Proof. By Weyl s inequality for singular values on the identity L = MA L +I from Theorem C.8,

L op I op + MA L op 1 + MA op L op , L op I op MA L op 1 MA op L op .

This implies the desired inequalities for L op. For the lower bound on minimal singular value, use Theorem C.8,

σmin (L ) = σmin (I MA ) 1 = 1 I MA op 1 1 + MA op 1

Corollary C.10. Assume that MA op < 1 and MA p op < 1. Then, we have

η 2 1 MA op .

Proof. Applying Theorem C.9,

η MA MA p op L op MA op + MA p op 1 MA op 2 1 MA op .

Lemma C.11. Assume that MA op D. Then, the operator norm and minimum singular value of L are bounded as follows,

D 1 , 1 D + 1 σmin (L ) .

Proof. By Weyl s inequality for singular values on the identity L = I + MA + + M T 1 A from Theorem C.8,

L op I op +

t=1 M t A op

t=0 Dt DT 1

Published as a conference paper at ICLR 2025

For the lower bound on minimal singular value, use Theorem C.8,

σmin (L ) = σmin (I MA ) 1 = 1 I MA op 1 1 + MA op 1 D + 1 .

Lemma C.12. For any A Rd pd,

A op MA op p

p A op , 1 T A 2 F A A p 2 F 1 T p A 2 F .

Proof. Let u = (u1, . . . , u T ) RT d be an arbitrary vector with u 2 2 = 1. Then, setting u a = 0 for any a 0,

i=1 (MAu)i 2 2 =

k=1 Akui k 2 2

i=1 A:p ui p :i 1 2 2

i=1 A:p 2 op ui p :i 1 2 2

i=1 p ui 2 2 = p A 2 op.

The left-hand side of the first inequality follows by picking up +1:T = 0 and u1:p as the maximal singular vector of A:p with unit length. The second inequality follows by a simple computation.

Corollary C.13. For any A A(D),

A A p 2 F D2

T p A 2 F A 2op .

Proof. By definition of A(D),

T p A 2 F A 2op 1 T p A 2 F ,

and the result follows by Theorem C.12.

Proposition C.14. The empirical risk minimizer ˆ A, i.e., ˆ A argmin A A(D) L(A) , (21)

implies L( ˆ A) L(A p ), which can be rewritten as follows: ˆ AL E 2 F 2 Tr E L ˆ A

I MA p L E .

Proof. By Theorem C.8,

NTL(A) = (MA I) L E 2 2 = [(MA MA ) L I] E 2 F

= h MA MA p L + MA p MA L I i E 2

= MA MA p L E 2

F + h MA p MA L I i E 2

+ 2 Tr E L MA MA p h MA p MA L I i E

= MA MA p L E 2

F + MA p I L E 2

+ 2 Tr E L MA MA p MA p I L E .

Published as a conference paper at ICLR 2025

Then, we have

NT L(A) L(A p ) = MA MA p L E 2

+ 2 Tr E L MA MA p MA p I L E , (22)

which implies the desired result for any A that satisfies L(A) L(A p ).

Corollary C.15. Observe that for p = p, Theorem C.14 reads ˆ AL E 2 F 2 Tr E ˆ AL E .

For p < p, one can write the following relaxed condition for any ˆ A: ˆ AL E 2 F 2 I MA p L op Tr E ˆ AL E

= 2 IT d + MA MA p L op Tr E ˆ AL E

2 1 + MA MA p L op

Tr E ˆ AL E

= 2η Tr E ˆ AL E .

C.3 LOWER AND UPPER ISOMETRIES

In Theorem C.16, we present uniform bounds on AL E 2 F and Tr(E AL E) in terms of A F . In order to establish these bounds, we first start with point-wise bounds in Theorems C.19 and C.20 that rely on Hanson-Wright inequality for bounding the deviation of quadratic forms. Then, we use Theorems C.21 to C.23 with a discretization argument to establish uniform isometries. Finally, with Theorem C.17, we have a uniform control over the range of both AL E 2 F and Tr(E AL E). Theorem C.16. Let δ > 0 be small and fixed. Then, there exists a constant 1 C(δ) = O(ln(1/δ)) such that the following holds uniformly for all A A(D) \ G(C, D) and C C(δ):

AL E 2 F σ2

8 σmin(L )2N A 2 F ,

Tr(E AL E) σ2 L op p

Cτp dr N A F ,

with probability at least 1 δ.

Proof. Let ν1, ν2 (0, 1) be arbitrary. By Theorems C.19 and C.20, with probability at least 1 δ1 δ2, the following holds:

AL E 2 F σ2 1 c2ν1 σmin(L )2N A 2 F ,

Tr(E AL E) σ2c2ν2 L op p

Cη2τp dr N A F , (23)

for any arbitrary ν1, ν2 (0, 1) and A A(D) \ G(C, D) where

δ1 = exp CHW Cν2 1η2τp dr , δ2 = exp CHW Cν2 2η2τp dr .

Let B(C, D) be the normalized A(D) \ G(C, D),

B(C, D) = A A F | A A(D) \ G(C, D) .

Then, since the conditions are homogeneous, Equation (23) holds for any A B(C, D) with probability 1 δ1 δ2.

Let Nϵ(D) be ϵ-net over the set B(C, D). Hence, with probability at least

1 δ0 = 1 |Nϵ(D)|(δ1 + δ2),

Published as a conference paper at ICLR 2025

the condition Equation (23) holds A Nϵ(D). Moreover, by Theorems C.21 to C.23, we have

2σ2 1 c2ν1 σmin(L )2N A 2 F σ2(1 + c2ν3)ϵ2 L 2 opp d NT

Tr(E AL E) σ2c2ν2 L op p

Cη2τp dr N A F + σ2(1 + c2ν3)ϵ L op p

A B(C, D) with probability at least 1 δ0 δ3 where

δ3 = exp CHW min ν3, ν2 3 d NT .

Recall that A 2 F (T p) A 2 F = T p , for any A B(C, D) due to the normalization.

Setting ν1 = 1 4c2 , ν2 = 1 2c2 and ϵ such that

ϵ = 1 2 (1 + c2ν3)

( 1 σcond(L )

and we have the following:

1 2σ2 1 c2ν1 σmin(L )2N A 2 F σ2(1 + c2ν3)ϵ2 L 2 opp d NT σ2

8 σmin(L )2N A 2 F ,

σ2c2ν2 L op p

Cη2τp dr N A F + σ2(1 + c2ν3)ϵ L op p

Cη2τp dr N A F ,

A B(C, D) with probability 1 δ0 δ3.

By homogeneity, this implies that A A(D) \ G(C, D), with probability at least 1 δ0 δ3,

AL E 2 F σ2

8 σmin(L )2N A 2 F ,

Tr(E AL E) σ2 L op p

Cη2τp dr N A F .

Lastly, we need ensure that δ0 + δ3 δ. First, δ3 δ/2 can be achieved with the choice of

ν3(δ/2) = max

1, ln 1 δ/2 CHW d NT

= O (ln(1/δ)) .

Moreover, the ϵ-net size is bounded as follows:

|Nε(D)| exp 9p dr ln p

by Theorem B.9. Then,

δ0 = |Nϵ(D)| (δ1 + δ2)

exp 5 16c4 CHW Cη2τp dr + 9p dr ln p

Thus, δ0 δ/2 can be achieved with the choice of

C C(δ) = 16c4

Note that η, τ 1 and the latter term is O(ln(1/δ)). For the first term, crudely upper bound ln p

ϵ by using x ln x + 1 for any x 0:

T p + ln σcond(L ) + ln 2(1 + c2ν3(δ/2)) τ + O(ln(1/δ)) .

Therefore, the definition in Equation (24) verifies C(δ) = O(ln(1/δ)).

Published as a conference paper at ICLR 2025

Corollary C.17. For any small δ > 0, there exists a constant 1 C(δ) = O(ln(1/δ)) such that the following holds uniformly for all A A(D) \ G(C(δ), D) and C C(δ):

inf A A(D)\G(C,D) AL E 2 F σ2

8 σmin(L )2C(δ)D2η2τp dr ,

sup A A(D)\G(C,D) Tr(E AL E) σ2 L op p

C(δ)D2η2τp d2r NT .

Proof. Plug in the results from Theorem C.16 and use the definition of A(D)\G(C(δ), D) to lower and upper bound A F .

Definition C.18. For applying Theorem B.1 in our setup, consider the following objects:

E = (ξ(1) , . . . , ξ(N)) RNT d , A = diag( A) RNT d RNT d ,

where diag(P) puts P in the diagonal blocks of a larger diagonal matrix. Lemma C.19. For any A A \ G(C) and ν (0, 1), with probability at least

1 exp CHW Cν2η2τp dr ,

we have the following

AL E 2 F σ2 1 c2ν σmin(L )2N A 2 F .

Proof. First, observe that AL E 2 F σmin(L )2 AE 2 F .

Applying Theorem B.1 with P = A A and r = c2σ2ν A 2 F ,

P E A A E E[ E A A E] c2σ2ν A 2 F

ν2 A 4 F A A 2 F , ν A 2 F A A op

Observe that E A A E = Tr(E A AE) = AE 2 F and

E h E A A E i = E h Tr E E A A i = σ2 A 2 F = σ2N A 2 F .

Furthermore, A 4 F = N 2 A 4 F , A A 2 F = N A A 2 F and A A op = A A op = A 2 op. Plugging these into the bound,

P AE 2 F σ2N A 2 F c2σ2νN A 2 F

exp CHW min ν2N A 4 F A A 2 F , νN A 2 F A 2op

Then, using A A 2 F A 2 F A 2 op and ν < 1,

P AE 2 F σ2(1 c2ν)N A 2 F 1 exp CHW ν2N A 2 F A 2op

The result follows from the definition of set A \ G(C).

Lemma C.20. For any A A \ G(C) and ν (0, 1), with probability at least

1 exp CHW Cν2η2τp dr ,

we have the following

Tr(E AL E) c2σ2ν L op p

Cη2τp dr N A F .

Published as a conference paper at ICLR 2025

Proof. First, by the properties of trace and Frobenius norm, we have

Tr(E AL E) L op Tr(E AE) .

Applying Theorem B.1 with P = A and r = c2σ2ν p

Cη2τp dr A F ,

P E A E E[ E A E] c2σ2ν p

Cη2τp dr A F

ν2Cη2τp dr, ν p

Cη2τp dr A F

Noting E[ E A E] = 0 and rewriting,

P Tr E AE c2σ2ν p

Cη2τp dr N A F

1 exp CHW min ν2Cη2τp dr, ν p

Cη2τp dr N A F

The result follows from the definition of set A \ G(C).

Lemma C.21. For any ν > 0, with probability at least

1 exp CHW min ν, ν2 d NT ,

we have the following E 2 F σ2d NT c2σ2νd NT.

Proof. Applying Theorem B.1 with P = Id T N, r = c2σ2νd NT,

P E E E[ E E] c2σ2νd NT exp ( CHW νd NT) .

The result follows by the fact E[ E E] = σ2d NT.

Lemma C.22. For any A1, A2 A,

A2L E 2 F 1

2 A1L E 2 F p A1 A2 2 op L 2 op E 2 F .

Proof. By the properties of Frobenius norm and Theorem C.12,

A1L E 2 F = ( A1 A2 + A2)L E 2 F 2 A2L E 2 F + 2 ( A1 A2)L E 2 F 2 A2L E 2 F + 2 A1 A2 2 op L 2 op E 2 F 2 A2L E 2 F + 2p A1 A2 2 op L 2 op E 2 F .

The results readily follows by reordering terms.

Lemma C.23. For any A1, A2 A,

Tr(E A2L E) Tr(E A1L E) + p

p A1 A2 op L op E 2 F ,

Proof. By the properties of trace and Theorem C.12,

Tr(E A2L E) = Tr E ( A2 A1 + A1)L E

= Tr E A1L E + Tr E ( A2 A1)L E

Tr E A1L E + A2 A1 op L op E 2 F Tr E A1L E + p

p A1 A2 op L op E 2 F .

Published as a conference paper at ICLR 2025

C.4 CONCENTRATION INEQUALITIES

In Theorem C.24, we show that the quantities of interest that show up in Equation (19) are related to a martingale series and its predictable quadratic variation. This allow us to use Theorem B.5 in Theorem C.26 to quantify the probability of the event in Equation (19) for finite sets of A.

Remark C.24. Fix A A(D). Consider the martingale differences sequences

d(n) t,i = A A p X(n) t

where the series is first ordered in i, then in t, and finally in n. Let Y be the sum of the martingale differences, i.e.,

i,t,n d(n) i,t .

Let W R A be the quadratic variation of the series plus an error term as in Theorem B.4, i.e.,

i,t,n E ξ(n) t

d(n) i,t 2 + X

n,t,i 1d(n) t,i >R

d(n) t,i 2 ,

Then, we have the following computations:

n,t (A A p )X(n) t , ξ(n) t = Tr E AL E ,

WA := W 0 A = σ2 X

A A p X(n) t 2 = σ2 AL E 2 F .

Proposition C.25. Let 0 < δ < e 1 and R > 0 be constants. Then, with probability at least 1 δ, we have

A A(D), C ln 2d NT + ln 1

WA W R A , (25)

where C = 1 + 4c 2 and c is the universal constant in Theorem B.6.

Proof. By Theorem B.7, there exists a constant c such that

sup t,n ξ(n) t c σ

Therefore, for any A A(D), we have

W R A = WA + X

n,t,i 1d(n) t,i >R

WA + 4c 2σ2 ln 2d TN + ln 1

n,t,i 1d(n) t,i >R

A A p X(n) t 2

WA + 4c 2σ2 ln 2d TN + ln 1

A A p X(n) t 2

WA + 4c 2 ln 2d TN + ln 1

Then, by rearranging terms, we have 1 + 4c 2 ln 2d TN + ln 1

Published as a conference paper at ICLR 2025

Theorem C.26. Let S A(D) be a finite set and let E(S) be the following event

E(S) = inf A S WA αL

sup A S YA αU

where αL, αU > 0 are two constants. Then, for any γ, R > 0 and 0 < δ < e 1,

P ({ A S : WA γYA} E(S))

|S| exp αL 2γ (γ + R) + ln ln γ αU

where γ = C eγ ln 2d TN + ln 1

and C is the universal constant in Theorem C.25.

Proof. For any A, let EA be the following event:

EA = W R A αL {YA αU} .

Then, by union bound, we have

P A S : W R A γYA E(S) X

A S P W R A γYA E(S)

A S P W R A γYA EA .

For any A S,

P W R A γYA EA exp αL 2eγ (eγ + R) + ln ln γαU

by Theorem B.5 which implies that

P A S : W R A γYA E(S) |S| exp αL 2eγ (eγ + R) + ln ln γαU

Finally, let Eδ be the event in Equation (25). Then,

P ({ A S : WA γYA} E(S))

P ({ A S : WA γYA} E(S) Eδ) + P EC δ

P A S : W R A C γ ln 2d NT + ln 1

P A S : W R A C γ ln 2d NT + ln 1

C.5 PROOF OF THEOREM C.5

Recall that Theorem C.24 shows that

YA = Tr E AL E , WA = σ2 AL E 2 F .

Therefore, we need to show that for any 0 < δ < e 1, there exists a constant C(δ) such that

P A A(D) \ G(C(δ), D) : WA 2σ2ηYA δ .

By Theorem C.17, there exists a constant C1(δ/4) = O(ln(1/δ)) such that for all C C1(δ/4)

inf A A(D)\G(C,D) WA σ4

8 σmin(L )2CD2η2τp dr ,

sup A A(D)\G(C,D) YA σ2 L op p

CD2η2τp d2r NT , (26)

Published as a conference paper at ICLR 2025

with probability 1 δ/4. In addition, by Theorem C.21, with probability at least 1 δ/4, we have

E 2 F 1 + c2ν(δ/4) d NT , where ν(δ/4) = max

In the following, we work conditionally on these events.

Let S be an ϵ-net over A(D)\G(C, D). By Theorems C.22 and C.23, for any A A(D)\G(C, D), there exists A S such that

2WA ϵ1(δ/4)ϵ2 , YA YA + ϵ2(δ/4)ϵ ,

where ϵ1(δ/4), ϵ2(δ/4) are given as follows:

ϵ1(δ/4) = σ4(1 + c2ν3(δ/4))p d NT L 2 op ,

ϵ2(δ/4) = σ2(1 + c2ν3(δ/4)) p

p d NT L op .

We set ϵ as follows:

ϵ = min r αL ϵ1(δ/4), αL 4σ2ηϵ2(δ/4), D .

In particular, ϵ is small such that for any A A(D) \ G(C, D),

A S : WA 2WA + 2αL , YA YA αL

4σ2η . (28)

Assume that A A(D) \ G(C, D) verifies

WA 2σ2ηYA .

Then, by Equation (28), there exists A S such that

WA 4WA 8σ2ηYA 16σ2ηYA 4WA 16σ2ηYA .

Therefore, we have

P A A(D) \ G(C, D) : WA 2σ2ηYA E P A S : WA 16σ2ηYA E ,

where E is the event that both Equation (26) and Equation (27) hold.

By Theorem C.26, there exists a universal constant C such that

P A S : WA 16σ2ηYA E

4γ 2 + ln ln γ αU

with the following choices:

γ = 16σ2η , R = γ = C eγ ln 2d TN + ln 1

8 σmin(L )2CD2η2τp dr ,

αU = σ2 L op p

CD2η2τp d2r NT .

By a union bound, Equation (29) implies that

P A A(D) \ G(C, D) : WA 2σ2ηYA

4γ 2 + ln ln γ αU

By Theorem C.12, we have

A op = MA A op A A op .

Published as a conference paper at ICLR 2025

Therefore, an ϵ-net covering of Mr,p (D) defined in Theorem B.9 is also an ϵ-net over A(D) after a shift by A and we have

ln |S| 9p dr ln Dp

Finally, we have to show that the right-hand side of Equation (30) is upper bounded by δ. That is, we need to prove

αL 4γ 2 ln ln γ αU

+ 1 + 9p dr ln Dp

for a suitable choice of C. Note that Equation (31) is homogeneous in σ. In addition, the left-hand side does not depend on η, whereas the right-hand side is decreasing in η. Therefore, for simplicity, we can set σ = 1 and η = 1.

By Theorem C.11 we have

σmin(L ) 1 D + 1 1 2D ,

32 = Ω C (1 + ln p d NT)3 (1 + ln κ) p dr .

For a choice that verifies

we have the following lower bound for αL:

αL 4γ 2 = Ω (1 + ln p d NT) (1 + ln κ) 1 + ln 1

p dr . (33)

The condition Equation (31) is satisfied if the following three conditions are individually satisfied:

(i) αL 4γ 2 ln ln γ αU

(ii) αL 4γ 2 9p dr ln Dp

(iii) αL 4γ 2 ln 4

as the left-hand side of Equation (31) is increasing in C, while the right-hand side is decreasing in C, we can pick the maximum C that satisfies all three conditions.

Condition (i). By Theorem C.11, we have

2D L op (D + 1) L op (D + 1)σmin(L ) 1 ,

and αU αL 2D L opαU

As x ln x + 1 for any x 0, we have

+ 1 ln γ αU

αL = O ln 2d NT + ln κ + ln 1

The lower bound in Equation (33) is sufficient to satisfy this condition.

Published as a conference paper at ICLR 2025

Condition (ii). We have that

ϵ ln Dp max

αL , 4ϵ2(δ/4)

ln Dp max Dϵ1(δ/4)

αL , 4ϵ2(δ/4)

= O ln p NT + ln κ + ln 1

The lower bound in Equation (33) is sufficient to satisfy this condition.

Condition (iii). This condition is readily verified by Equation (33).

Remark C.27. The choice of C in Equation (32) and τ, which appears in the bounds, have thirdorder logarithmic dependencies on δ and the problem parameters, respectively. This arises because the error is bounded above by second-order terms in δ and problem parameters. An additional logarithmic factor appears due to the discretization of the problem. Thus, assuming bounded noise, i.e., sup ξ(n) t B for some B > 0, one can replace the third-order logarithmic dependencies with first-order counterparts, along with dependency on B.

D APPROXIMATE EMPIRICAL RISK MINIMIZATION

In this section, we extend the results of Section C to approximate empirical risk minimizers A as in Equation (12). We begin with trivial extensions of Theorem C.14 and Theorem C.15 to approximate minimizers.

Proposition D.1. Let A be an estimate that provides an ϵtr-approximation to the loss of L(A p ):

L( A) L(A p ) + ϵtr .

In particular, one can choose A as an ϵtr-approximate empirical risk minimizer, i.e.,

A n A A(D) | L(A) L( ˆ A) + ϵtr o ,

where ˆ A is the empirical risk minimizer defined in Equation (21). Then, A satisfies AL E 2 F 2 Tr E L A

I MA p L E + NTϵtr .

Proof. The proof follows from Equation (22).

Corollary D.2. Observe that for p = p, Theorem D.1 reads AL E 2 F 2 Tr E AL E + NTϵtr .

For p < p, one can write the following relaxed condition for any A: AL E 2 F 2η Tr E AL E + NTϵtr .

Proof. The proof follows from Theorem C.15.

We divide the analysis into two cases: 2η Tr E AL E > NTϵtr and 2η Tr E AL E NTϵtr. In the first case, we can directly apply the results of Section C to approximate minimizers after noting that Theorem C.15 implies that AL E 2 F 2 (2η) Tr E AL E .

This is equivalent to doubling the constant η in the main theorem and does not change our results.

Published as a conference paper at ICLR 2025

In the second case, we have the following: AL E 2 F 2NTϵtr .

Recall the lower isometry proven in Equation (26):

inf A A(D)\G(C,D) AL E 2 F σ2

8 σmin(L )2CD2η2τp dr .

If ϵtr is such that

8 σmin(L )2CD2η2τp dr ,

then the approximate minimizer A is guaranteed to be in the set G(C, D) with high probability. By Theorem C.13, G(C, D) S(C, D), and thus, we have the desired result A S(C, D).

E EXPERIMENTS

All experiments in this section are implemented with Python 3 (Van Rossum & Drake, 2009) under PSF license and Py Torch (Paszke et al., 2019) under BSD-3-Clause license. In addition, we use Num Py (Harris et al., 2020) under BSD license.

For all the experiments, A is generated as follows. First, p orthogonal matrices of shape d d are sampled. These are then scaled down by α p where α is arbitrarily set to 0.5. In cases where A needs to be initialized, we use the same recipe for the student model with p instead of p and set α = 1. For experiments with low-rank ground truth, we set arbitrary d r singular values to 0 following a SVD decomposition. Each experiment in this section has been run over 3 independent seeds and the average is plotted. As the variance is small and the plots usually overlap, we opt to not plot it for visual clarity.

Theorems 4.1 to 4.3 provide rates on estimation error for empirical minimizers. In the following, we study these rates empirically for various values of p , p, d, N, T and r where r = d for fullrank experiments or r = 5 for low-rank experiments and p = p except it is stated otherwise. We use two quantities, β = NT, the number of total tokens, γ = pdr, the number of parameters to estimate, to summarize information in the plots. For Theorems 4.1 and 4.3, ˆA is computed with the OLS estimator and for Theorem 4.2, ˆA is learned with gradient descent with learning rate α on the group-norm regularized loss in Equation (9). The parameter λ and learning rate α are tuned by a grid search.

Figure 1 plots the estimation error for d {5, 10, 15}, p {5, 10, 15}, N {1, 5, 10} and T {1, 5, 10, 25, 50} pdr/N. The upper bound in Theorem 4.1 scales with the ratio β/γ up to logarithmic terms as empirically verified by Figure 1. In Figure 2, we verify that there is no individual trend to p and d, which implies that the error depends only on γ. Furthermore, we show the trend in N can be accounted for by incorporating the logarithmic term into β to obtain β = β/ ln(1 +

Figure 3 plots the estimation error for different degrees of misspecification where the context length is fixed to p = 15. The curves for various p {5, 10, 15} overlap, which verify the rate γ/β = p d2/NT predicted by Theorem 4.3 holds.

Figure 4 repeats the same plots for low-rank experiments where d = 15, r = 5 are fixed and p, N and T are varied as before. Good estimation of A is not straightforward as λ has to be appropriately tuned. Yet, we see that the group-nuclear norm regularized estimators found with gradient descent after tuning on regularization problem λ 10 1, 10 2, 10 3, 10 4, 10 5, 10 6, 10 7

and learning rate α 10 1, 10 2, 10 3 obtain improved estimation errors than non-regularized OLS estimator. Particularly, the sample efficiency benefits of the group-nuclear norm regularization are amplified in the low-data regime. We leave the analysis of group-nuclear norm regularization as a future work.

Published as a conference paper at ICLR 2025

0 10 20 30 40 50 β/γ

N = 1 N = 5 N = 10

Figure 1: Scaling of estimation error with respect to the ratio β/γ = NT/pd2 with the OLS estimator. The black dashed line marks the reference value γ/β.

0 10 20 30 40 50 β/γ

p = 5 p = 10 p = 15

0 10 20 30 40 50 β/γ

d = 5 d = 10 d = 15

0 20 40 60 β/γ

N = 1 N = 5 N = 10

Figure 2: Scaling of estimation error for different values of p, d and N with the OLS estimator. Recall that β = NT, γ = pd2 and β = β/ ln(1 +

N). The black dashed lines mark the reference values corresponding to γ/β and γ/ β.

0 10 20 30 40 50 β/γ

p = 5 p = 10 p = 15

Figure 3: Scaling of estimation error with respect to the ratio β/γ = NT/p d2 for different p = 5, 10, 15 with the OLS estimator. The black dashed line marks the reference value γ/β.

Published as a conference paper at ICLR 2025

0 5 10 15 20 25 β/γ

ˆASV D A 2 F

p = 5 p = 5 , λ = 0

p = 10 p = 10 , λ = 0

p = 15 p = 15 , λ = 0

Figure 4: Scaling of estimation error with respect to β/γ = NT pdr for different context windows p = 5, 10, 15 with the OLS estimator (λ = 0) and group-nuclear norm regularized estimators.