# learning_with_little_mixing__b704e9e9.pdf Learning with little mixing Ingvar Ziemann KTH Royal Institute of Technology ziemann@kth.se Stephen Tu Robotics at Google stephentu@google.com We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the leastsquares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the L2 and L2+ε norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional ℓ2(N) ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time. 1 Introduction Consider regression in the context of the time-series model: Yt = f (Xt) + Wt, t = 0, 1, 2, . . . . (1) Such models are ubiquitous in applications of machine learning, signal processing, econometrics, and control theory. In our setup, the learner is given access to T N+ pairs {(Xt, Yt)}T 1 t=0 drawn from the model (1), and is asked to output a hypothesis bf from a hypothesis class F which best approximates the (realizable) regression function f F in terms of square loss. In this work, we study the least-squares estimator (LSE). This procedure minimizes the empirical risk associated to the square loss over the class F. When each pair of observations (Xt, Yt) is drawn iid from some fixed distribution, this procedure is minimax optimal over a broad set of hypothesis classes [1 4]. However, much less is known about the optimal rate of convergence for the general time-series model (1), as correlations across time in the covariates {Xt} complicate the analysis. With this in mind, we seek to extend our understanding of the minimax optimality of the LSE for the time-series model (1). We show that for a broad class of function spaces and covariate processes, the effects of data dependency across time enter the LSE excess risk only as a higher order term, whereas the leading term in the excess risk remains order-wise identical to that in the iid setting. Hence, after a sufficiently long, but finite burn-in time, the LSE s excess risk scales as if all T samples are independent. This behavior applies to processes that exhibit correlations which decay slower than geometrically. We refer to this double phenomenon, where the mixing-time only enters as a burn-in time, and where the mixing requirement is mild, as learning with little mixing. Our result stands in contrast to a long line of work on learning from dependent data (see e.g., [5 14] and the references within), where the blocking technique [5] is used to create independence amongst 36th Conference on Neural Information Processing Systems (Neur IPS 2022). the dependent covariates, so that tools to analyze independent learning can be applied. While these aforementioned works differ in their specific setups, the main commonality is that the resulting dependent data rates mimic the corresponding independent rates, but with the caveat that the sample size is replaced by an effective sample size that is decreased in some way by the mixing-time, even after any necessary burn-in time. Interestingly, the results of Ziemann et al. [15] studying the LSE on the model (1) also suffer from such sample degradation, but do not rely on the blocking technique. The model (1) captures learning dynamical systems by setting Yt = Xt+1, so that the regression function f describes the dynamics of the state variable Xt. Recent progress in system identification shows that the lack of ergodicity does not necessarily degrade learning rates. Indeed, when the states evolve as a linear dynamical system (i.e., the function f is linear), learning rates are not deflated by any mixing times, and match existing rates for iid linear regression [16 19]. Kowshik et al. [20], Gao and Raskutti [21] extend results of this flavor to parameter recovery of dynamics driven by a generalized linear model. The extent to which this phenomenon less ergodicity not impeding learning generalizes beyond linear and generalized linear models is a key motivation for our work. Contributions We consider the realizable setting, where f is assumed to be contained in a known function space F. Our results rest on two assumptions regarding both the covariate process {Xt} and the function space F. The first assumption posits that the process {Xt} exhibits some mild form of ergodicity (that is significantly weaker than the typical geometric ergodicity assumption). The second assumption is a hypercontractivity condition that holds uniformly in F along the trajectory {Xt}, extending contractivity assumptions for iid learning [3] to dependent processes. Informally, our main result (Theorem 4.1, presented in Section 4), shows that under these two assumptions, letting comp(F) denote some (inverse) measure of complexity of F, the LSE bf satisfies: E bf f 2 L2 dimensional factors σ2 W T comp(F) + higher order o(1/T comp(F)) terms. (2) The first term in (2) matches existing LSE risk bounds for iid learning order-wise, and most importantly, does not include any dependence on the mixing-time of the process. Indeed, all mixing-time dependencies enter only in the higher order term. Since this term scales as o(1/T comp(F)), it becomes negligible after a finite burn-in time. This captures the crux of our results: on a broad class of problems, given enough data, the LSE applied to time-series model (1) behaves as if all samples are independent. Section 5 provides several examples for which the trajectory hypercontractivity assumption holds. When the covariate process {Xt} is generated by a finite-state irreducible and aperiodic Markov chain, then any function class F satisfies the requisite condition. More broadly, the condition is satisfied for any bounded function classes for which the L2 and L2+ε norms (along trajectories) are equivalent. Next, we show that many infinite dimensional function spaces based on ℓ2(N) ellipsoids satisfy our hypercontractivity condition, demonstrating that our results are not inherently limited to finite-dimensional hypothesis classes. To demonstrate the broad applicability of our framework, Section 6 instantiates our main result on two system identification problems that have received recent attention in the literature: linear dynamical systems (LDS), and systems with generalized linear model (GLM) transitions. For stable LDS, after a polynomial burn-in time, we recover an excess risk bound that matches the iid rate. A more general form of this result was recently established by Tu et al. [19]. For stable GLMs, also after a polynomial burn-in time, we obtain the first excess risk bound for this problem which matches the iid rate, up to logarithmic factors in various problem constants including the mixing-time. In both of these settings, our excess risk bounds also yield nearly optimal rates for parameter recovery, matching known results for LDS [16] and GLMs [20] in the stable case. In Appendix A, we show experimentally, using the stable GLM model, that the trends predicted by our theory are indeed realized in practice. 2 Related work While regression is a fundamental problem studied across many disciplines, our work draws its main inspiration from Mendelson [3] and Simchowitz et al. [16]. Mendelson [3] shows that for nonparametric iid regression, only minimal assumptions are required for one-sided isometry, and thus learning. Simchowitz et al. [16] build on this intuition and provide mixing-free rates for linear regression over trajectories generated by a linear dynamical system. We continue this trend, by leveraging one-sided isometry to show that mixing only enters as a higher order term in the rates of the nonparametric LSE. More broadly, the technical developments we follow synthesize techniques from two lines of work: nonparametric regression with iid data, and learning from dependent data. Nonparametric regression with iid data. Beyond the seminal work of Mendelson [3], the works [2, 22, 23] all study iid regression with square loss under various moment equivalence conditions. In addition to moment equivalence, we build on the notion of offset Rademacher complexity defined by [24] in the context of iid regression. Indeed, we show that a martingale analogue of the offset complexity (described in [15]) characterizes the LSE rate in (1). Learning from dependent data. As discussed previously, many existing results for learning from dependent data reduce the problem to independent learning via the blocking technique [5], at the expense of sample complexity deflation by the mixing-time. Nagaraj et al. [25] prove a lower bound for linear regression stating that in a worst case agnostic model, this deflation is unavoidable. Moreover, if the linear regression problem is realizable, Nagaraj et al. [25] provide upper and lower bounds showing that the mixing-time only affects the burn-in time, but not the final risk. We note that their upper bound is an algorithmic result that holds only for a specific modification of SGD. Our work can be interpreted as an upper bound in the more general nonparametric setting, where we put forth sufficient conditions to recover the iid rate after a burn-in time. Our result is algorithm agnostic and directly applies to the empirical risk minimizer. Ziemann et al. [15] also study the model (1), and provide an information-theoretic analysis of the nonparametric LSE. However, their approach fundamentally reduces to showing two-sided concentration something our work evades and therefore their bounds incur worst case dependency on the mixing-time. Roy et al. [13] extend the results from Mendelson [3] to the dependent data setting. While following Mendelson s argument allows their results to handle non-realizability and heavy-tailed noise, their proof ultimately still relies on two-sided concentration for both the version space and the noise interaction . Hence, their rates end up degrading for slower mixing processes. We note that this is actually expected in the non-realizable setting in light of the lower bounds in Nagaraj et al. [25]. The measure of dependencies we use for the process {Xt} is due to Samson [26]. Recently, Dagan et al. [27] use a similar measure to study learning when the covariates have no obvious sequential ordering (e.g., a graph structure or Ising model). However, our results are not directly comparable, other than noting that their risk bounds degrade as the measure of correlation increases. Results in linear system identification show that lack of ergodicity does not degrade parameter recovery rates [16 19, 28, 29]. Beyond linear system identification, Kowshik et al. [20], Gao and Raskutti [21], Sattar and Oymak [30], Foster et al. [31] prove parameter recovery bounds for dynamical systems driven by a generalized linear model (GLM) transition. Most relevant are Kowshik et al. [20] and Gao and Raskutti [21], who again show that the lack of ergodicity does not hamper rates. Indeed, Gao and Raskutti [21] even manage to do so in a semiparametric setting with an unknown link function. As mentioned previously, our main result instantiated to these problems in the stable case matches existing excess risk and parameter recovery bounds for linear system identification, and actually provides the sharpest known excess risk bound for the GLM setting (when the link function is known). A more detailed comparison to existing LDS results is given in Appendix I.4, and to existing GLM results in Appendix J.1. 3 Problem formulation The time-series (1) evolves on two subsets of Euclidean space, X Rd X and Y Rd Y, with Xt X and Yt, Wt Y. Expectation (resp. probability) with respect to all the randomness of the underlying probability space is denoted by E (resp. P). The Euclidean norm on Rd is denoted 2, and the unit sphere in Rd is denoted Sd 1. For a matrix M Rd1 d2, M op denotes the largest singular value, σmin(M) the smallest non-zero singular value, and cond(M) = M op/σmin(M) the condition number. When the matrix M is symmetric, λmin(M) will be used to denote its minimum eigenvalue. We assume there exists a filtration {Ft} such that (a) {Wt} is a square integrable martingale difference sequence (MDS) with respect to this filtration, and (b) {Xt} is adapted to {Ft 1}. Further tail conditions on this MDS will be imposed as necessary later on. Let F be a hypothesis space of functions mapping Rd X to Rd Y. We assume that the true regression function is an element of F (i.e., f F), and that F is known to the learner. Given two compatible function spaces F1, F2, let F1 F2 {f1 f2 | f1 F1, f2 F2}. A key quantity in our analysis is the shifted function class F F {f }. Our results will be stated under the assumption that F is star-shaped,1 although we will see that this is not too restrictive. For any function f : X Rd Y, we define f supx X f 2. A function f is B-bounded if f B. Similarly, a hypothesis class is B-bounded if each of its elements is B-bounded. For a bounded class F and resolution ε > 0, the quantity N (F, ε) denotes the size of the minimal ε-cover of F (contained in F) in the -norm. We fix a T N+, indicating the number of labeled observations {(Xt, Yt)}T 1 t=0 from the time-series (1) that are available to the learner. The joint distribution of X0:T 1 (X0, . . . , XT 1) is denoted PX. For p 1, we endow F F with Lp(PX) norms: f g p Lp 1 T PT 1 t=0 E f(Xt) g(Xt) p 2, where expectation is taken with respect to PX. We will mostly be interested in L2(PX), hereafter often just referred to as L2. This is the L2 space associated to the law of the uniform mixture over X0:T 1 and thus, for iid data, coincides with the standard L2 space often considered in iid regression. For a radius r > 0, we let B(r) denote the closed ball of F with radius r in L2, and we let B(r) denote its boundary: B(r) f F f 2 L2 r2 and B(r) f F f 2 L2 = r2 . The learning task is to produce an estimate bf of f , which renders the excess risk bf f 2 L2 as small as possible. We emphasize that bf f 2 L2 = 1 T PT 1 t=0 E X0:T 1 bf( Xt) f ( Xt) 2 2 where X0:T 1 is a fresh, statistically independent, sample with the same law PX as X0:T 1. Namely, bf f 2 L2 is a random quantity, still depending on the internal randomness of the learner and that of the sample X0:T 1 used to generate bf. We study the performance of the least-squares estimator (LSE) defined as bf argminf F n 1 T PT 1 t=0 Yt f(Xt) 2 2 o , and measure the excess risk E bf f 2 L2. This section presents our main result. We first detail the definitions behind our main assumptions in Section 4.1. The main result and two corollaries are then presented in Section 4.2. 4.1 Hypercontractivity and the dependency matrix Hypercontractivity. We first state our main trajectory hypercontractivity condition, which we will use to establish lower isometry. The following definition is heavily inspired by recent work on learning without concentration [3, 23]. Definition 4.1 (Trajectory (C, α)-hypercontractivity). Fix constants C > 0 and α [1, 2]. We say that the tuple (F, PX) satisfies the trajectory (C, α)-hypercontractivity condition if t=0 f(Xt) 4 2 t=0 f(Xt) 2 2 for all f F. (3) Here, the expectation is with respect to PX, the joint law of X0:T 1. Condition (3) interpolates between boundedness and small-ball behavior. Indeed, if the class F is B-bounded, then it satisfies trajectory (B2, 1)-hypercontractivity trivially. On the other hand, for α = 2, (3) asks that f L4 C1/4 f L2 for trajectory-wise Lp-norms; by the Paley-Zygmund inequality, this implies that a small-ball condition holds. Moreover, if for some ε (0, 2), the trajectory-wise L2 and L2+ε norms are equivalent on F, then Proposition 5.2 (Section 5) shows that the condition holds for a nontrivial α = 1 + ε/2 (1, 2). More examples are given in Section 5. Our main results assume that (F , PX) (or a particular subset of F ) satisfies the trajectory (C, α)- hypercontractivity condition with α > 1, which we refer to as the hypercontractive regime. The condition α > 1 is required in our analysis for the lower order excess risk term to not depend on 1A function class F is star-shaped if for any α [0, 1], f F implies αf F. the mixing-time. Our results instantiated for the α = 1 case directly correspond to existing work by Ziemann et al. [15], and exhibit a lower order term that depends on the mixing-time. Ergodicity via the dependency matrix. We now state the main definition we use to measure the stochastic dependency of a process. Recall that for two measures µ, ν on the same measurable space with σ-algebra A, the total-variation norm is defined as µ ν TV sup A A |µ(A) ν(A)|. Definition 4.2 (Dependency matrix, Samson [26, Section 2]). The dependency matrix of a process {Zt}T 1 t=0 with distribution PZ is the (upper-triangular) matrix Γdep(PZ) = {Γij}T 1 i,j=0 RT T defined as follows. Let Z0:i denote the σ-algebra generated by {Zt}i t=0. For indices i < j, let 2 sup A Z0:i PZj:T 1( | A) PZj:T 1 TV. (4) For the remaining indices i j, let Γii = 1 and Γij = 0 when i > j (below the diagonal). Given the dependency matrix from Definition 4.2, we measure the dependency of the process PX by the quantity Γdep(PX) op. Notice that this quantity always satisfies 1 Γdep(PX) op T. The lower bound indicates that the process PX is independent across time. The upper bound indicates that the process is fully dependent, e.g., Xt+1 = Xt for all t N. Our results apply to cases where Γdep(PX) 2 op grows sub-linearly in T the exact requirement depends on the specific function class F. If the process {Xt} is geometrically ϕ-mixing, then Γdep(PX) 2 op is upper bounded by a constant that depends on the mixing-time of the process, and is independent of T [26, Section 2]. Other examples, such as processes satisfying Doeblin s condition [32], are given in Samson [26, Section 2]. When {Xt} is a stationary time-homogenous Markov chain with invariant distribution π, the coefficients Γij simplify to Γ2 ij = 2 sup A X PXj i( | A) π TV for indices j > i, where X is the σ-algebra generated by X π (cf. Proposition F.1). Hence, the requirement Γdep(PX) 2 op T β for β (0, 1) then corresponds to sup A X PXt( | A) π TV 1/t1 β for t N+. Jarner and Roberts [33] give various examples and conditions to check polynomial convergence rates for Markov chains. We also provide further means to verify Γdep(PX) op = O(1) in Appendix F and Appendix G. 4.2 Learning with little mixing A key quantity appearing in our bounds is a martingale variant of the notion of Gaussian complexity. Definition 4.3 (Martingale offset complexity, cf. Liang et al. [24], Ziemann et al. [15]). For the regression problem (1), the martingale offset complexity of a function space F is given by: MT (F) sup f F t=0 4 Wt, f(Xt) f(Xt) 2 2 Recall that F = F {f } is the centered function class and B(r) = {f F | f L2 = r} is the boundary of the L2 ball B(r). The following theorem is the main result of this paper. Theorem 4.1. Fix B > 0, C : (0, B] R+, α [1, 2], and r (0, B]. Suppose that F is star-shaped and B-bounded. Let Fr F be a r/ 8-net of B(r) in the supremum norm , and suppose that (Fr, PX) satisfies the trajectory (C(r), α)-hypercontractivity condition (cf. Definition 4.1). Then: E bf f 2 L2 8EMT (F ) + r2 + B2|Fr| exp Tr4 2α 8C(r) Γdep(PX) 2op The assumption that F is star-shaped in Theorem 4.1 is not particularly restrictive. Indeed, Theorem 4.1 still holds if we replace F by its star-hull star(F ) {γf | γ [0, 1], f F }, and B(r) with the boundary of the r-sphere of star(F ). In this case, we note that (a) the metric entropy of star(F ) is well controlled by the metric entropy of F ,2 and (b) the trajectory hypercontractivity conditions over a class F and its star-hull star(F ) are equivalent. Hence, at least whenever we are 2Specifically, log N (star(F ), ε) log(2B/ε) + log N (F , ε/2) [34, Lemma 4.5]. able to verify hypercontractivity over the entire class F , little generality is lost. While most of our examples are star-shaped, we will need the observations above when we work with generalized linear model dynamics in Section 6.2. To understand Theorem 4.1, we will proceed in a series of steps. We first need to understand the martingale complexity term EMT (F ). Since F is B-bounded, if one further imposes the tail conditions that the noise process {Wt} is a σ2 W -sub-Gaussian MDS,3 a chaining argument detailed in Ziemann et al. [15, Lemma 4] shows that: EMT (F ) inf γ>0,δ [0,γ] ( σ2 W log N (F , γ) log N (F , s)ds + γ2 ) (7) In particular, this bound only depends on F and is independent of Γdep(PX) 2 op. Furthermore, (7) coincides with the corresponding risk bound for the LSE with iid covariates [24]. Given that EMT (F ) corresponds to the rate of learning from T iid covariates, the form of (6) suggests that we choose r2 EMT (F ), so that the dominant term in (6) is equal to EMT (F ) in scale. Given that r has been set, the only remaining degree of freedom in (6) is to set T large enough (the burn-in time) so that the third term is dominated by r2. Thus, it is this third term in (6) that captures the interplay between the function class F and the dependency measure Γdep(PX) op. We will now consider specific examples to illustrate how the burn-in time can be set. Our first example supposes that (a) F satisfies the trajectory (C, 2)-hypercontractivity condition, and that (b) F is nonparametric, but not too large: p > 0, q (0, 2) s.t. log N (F , ε) p 1 q for all ε (0, 1). (8) Covering numbers of the form (8) are typical for sufficiently smooth function classes, e.g. the space of k-times continuously differentiable functions mapping X Y for any k d X/2 [35]. If condition (8) holds and the noise process {Wt} is a sub-Gaussian MDS, inequality (7) yields EMT (F ) T 2 2+q , and hence we want to set r2 = o(T 2 2+q ). Carrying out this program yields the following corollary. Corollary 4.1. Fix B 1, C > 0, p > 0, q (0, 2), and γ (0, q 2+q). Suppose that F is star-shaped, B-bounded, satisfies (8), and (F , PX) satisfies the trajectory (C, 2)-hypercontractivity condition. Suppose that T satisfies: ( 8(32p + 1)C Γdep(PX) 2 op 1 2( 2 2+q +γ) , 4 log B 8 1 q 2( 2 2+q +γ) ) Then, we have that: E bf f 2 L2 8EMT (F ) + 2T ( 2 2+q +γ). (10) The rate (10) of Corollary 4.1 highlights the fact that the first order term of the excess risk is bounded by the martingale offset complexity EMT (F ). This behavior arises since the dependency matrix Γdep(PX) only appears as the burn-in requirement (9). Here, the value of q constrains how fast Γdep(PX) 2 op is allowed to grow. In particular, condition (9) requires that Γdep(PX) 2 op = o(T 1 q 2+q ), otherwise the burn-in condition cannot be satisfied for any γ (0, q 2+q). In our next example, we consider both a variable hypercontractivity parameter C(r) that varies with the covering radius r, and also allow α (1, 2] to vary. Since our focus is on the interaction of the parameters in the hypercontractivity definition, we will consider smaller function classes with logarithmic metric entropy. This includes parametric classes but also bounded subsets of certain reproducing kernel Hilbert spaces. For such function spaces, one expects EMT (F ) O(T 1), and hence we set r2 = o(T 1). 3That is, for any u Sd Y 1, λ R, and t N, we have E[exp(λ Wt, u ) | Ft 1] exp(λ2σ2 W /2). Corollary 4.2. Fix B 1, C : (0, 1] R+, α (1, 2], b1 [0, 1), b2 [0, 2), γ (0, 1), and p, q 1. Suppose that F is star-shaped and B-bounded, and that for every r (0, 1), there exists a r-net Fr of B(r) in the -norm such that (a) log |Fr| p logq 1 r and (b) (Fr, PX) satisfies the trajectory (C(r), α)-hypercontractivity condition. Next, suppose the growth conditions hold: Γdep(PX) 2 op T b1, C(r) (1/r)b2 r (0, 1). As long as the constants α, b1, b2, and γ satisfy ψ := 1 b1 (1+γ)(4 2α+b2) 2 > 0, then for any p, log B, q ψ , we have: E bf f 2 L2 8EMT (F ) + 2 1 Here polyq/ψ denotes a polynomial of degree O(q/ψ) in its arguments the exact expression is given in the proof. Proposition 5.4 in Section 5 gives an example of an ℓ2(N) ellipsoid which satisfies the assumptions in Corollary 4.2. Corollary 4.2 illustrates the interplay between the function class F , the data dependence of the covariate process {Xt}, and the hypercontractivity constant α. Let us consider a few cases. First, let us suppose that the process {Xt} is geometrically ergodic and that C(r) is a constant, so that we can set b1 and b2 arbitrarily close to zero (at the expense of a longer burn-in time). Then, the ψ > 0 condition simplifies to α > 2 1 1+γ . This illustrates that in the hypercontractivity regime (α > 1), there exists a valid setting of (b1, b2, γ) that satisfies ψ > 0. Next, let us consider the case where C(r) is again a constant, but {Xt} is not geometrically ergodic. Setting b2 and γ arbitrarily close to zero, we have ψ > 0 simplifies to b1 < α 1. Compared to Corollary 4.1, we see that in the case when α = 2, the parametric nature of F allows the dependency requirement to be less strict: o(T) in the parametric case versus o(T 1 q 2+q ) in the nonparametric case. We conclude with noting that when α = 1, it is not possible to remove the dependence on Γdep(PX) 2 op in the lowest order term. In this situation, our results recover existing risk bounds from Ziemann et al. [15] see Appendix H for a discussion. 5 Examples of trajectory hypercontractivity In this section, we detail a few examples of trajectory hypercontractivity. Let us begin by considering the simplest possible example: a finite hypothesis class. Let |F| < . Define for any fixed f F the constant cf E h 1 T PT 1 t=0 f(Xt) 4 2 i / E h 1 T PT 1 t=0 f(Xt) 2 2 i 2 , where the ratio 0/0 is taken to be 1. Then the class F is trajectory (maxf F cf, 2)-hypercontractive. Similarly, processes evolving on a finite state space can also be verified to be hypercontractive. Proposition 5.1. Fix a µ > 0. Let {µt}T 1 t=0 denote the marginal distributions of PX. Suppose that the µt s all share a common support of a finite set of atoms {ψ1, . . . , ψK} Rd X, and that min0 t T 1 min1 k K µt(ψk) µ. For any class of functions F mapping {ψ1, . . . , ψK} Rd Y, we have that F satisfies the trajectory (1/µ, 2)-hypercontractivity condition. We remark that when PX is an aperiodic and irreducible Markov chain over a finite state space, the condition µ > 0 is always valid even as T [36]. In this case, our findings are related to Wolfer and Kontorovich [37, Theorem 3.1], who show that in the high accuracy regime (i.e., after a burn-in time), the minimax rate of estimating the transition probabilities of such a chain is not affected by the mixing time (in their case the pseudo-spectral gap). The examples considered thus far rely on the fact that under a certain degree of finiteness, the fourth and second moment can be made uniformly equivalent. The next proposition relaxes this assumption. Namely, if for some ε (0, 2] the L2 and L2+ε norms are equivalent on a bounded class F, this class then satisfies a nontrivial hypercontractivity constant, α > 1 (cf. Mendelson [23]). Proposition 5.2. Fix ε (0, 2] and c > 0. Suppose that F is B-bounded and that f L2+ε c f L2 for all f F. Then F is trajectory (B2 εc2+ε, 1 + ε/2)-hypercontractive. Next, we show that for processes {Xt} which converge fast enough to a stationary distribution, it suffices to verify the hypercontractivity condition only over the stationary distribution. This mimics existing results in iid learning, where hypercontractivity is assumed over the covariate distribution [3, 24]. We first recall the definition of the χ2 divergence between two measures. Let µ and ν be two measures over the same probability space, and suppose that µ is absolutely continuous w.r.t. ν. The χ2(µ, ν) divergence is defined as χ2(µ, ν) Eν dµ dν 1 2 . Proposition 5.3. Fix positive r, Cχ2, CTV, and C8 2. Suppose that the process {Xt} has a stationary distribution π. Let {µt} denote the marginal distributions of {Xt}, and suppose that the marginals {µt} are absolutely continuous w.r.t. π. Assume the process is ergodic in the sense that: sup t N χ2(µt, π) Cχ2, 1 T t=0 µt π TV CTVr2. (11) Suppose also that for all f F : Eπ f(X) 8 2 C8 2(Eπ f(X) 2 2)4. Then the set B(r) satisfies (C, 2)-trajectory hypercontractivity with C = (1 + p Cχ2) C8 2(1 + CTVB2)2. We further discuss the ergodicity condition (11) in Appendix E.3.1. Ellipsoids in ℓ2(N). Given that equivalence of norms is typically a finite-dimensional phenomenon, one may wonder whether examples of hypercontractivity exist in an infinite-dimensional setting. Here we show that such examples are actually rather abundant. The key is that hypercontractivity need only be satisfied on an ε-cover of F . As discussed above, every finite hypothesis class (and thus every finite cover) is automatically (C, 2)-hypercontractive for some C > 0. The issue is to ensure that this constant does not grow too fast as one refines the cover. The next result shows that the growth can be controlled for ℓ2(N) ellipsoids of orthogonal expansions. By Mercer s theorem, these ellipsoids correspond to unit balls in reproducing kernel Hilbert spaces [4, Corollary 12.26]. Proposition 5.4. Fix positive constants β, B, K, and q. Fix a base measure λ on X and suppose that {ϕn}n N+ is an orthonormal system in L2(λ) satisfying ϕn Bnq, n N. Suppose µj e 2βj and define the ellipsoid: P n f = P j=1 θjϕj P j=1 θ2 j µj 1 o . Fix ε > 0, and let mε denote the smallest positive integer solution to m 2 β log 8B βε subject to m log m q β . Let P P be an arbitrary subset. There exists an ε-cover Pε of P in the -norm satisfying log |Pε| mε log 1 + 8Bmq ε ε . Further, let {µt}T 1 t=0 be the marginal distributions of PX and suppose that max0 t T 1 max n dµt o K. Then, as long as ε inff P f L2(PX), (Pε, PX) is trajectory (Cε, 2)-hypercontractive with Cε = (1 + 7K3B4m4q+2 ε ). Proposition 5.4 states that when F P, then ( B(r), PX) is (C(r), 2)-hypercontractive where C(r) = Cr only grows poly-logarithmically in 1/r and thus verifies the assumptions of Corollary 4.2. 6 System identification in parametric classes To demonstrate the sharpness of our main result, we instantiate Theorem 4.1 on two parametric system identification problems which have received recent attention in the literature: linear dynamical systems (LDS) and generalized linear model (GLM) dynamics. 6.1 Linear dynamical systems Consider the setting where the process {Xt}t 0 is described by a linear dynamical system: Xt+1 = A Xt + HVt, X0 = HV0, Vt Rd V, Vt N(0, I), Vt Vt t = t . (12) In this setting, the system identification problem is to recover the dynamics matrix A from {Xt}T 1 t=0 evolving according to (12). We derive rates for recovering A by first deriving an excess risk bound on the least-squares estimator via Theorem 4.1, and then converting the risk bound to a parameter error bound. Since Theorem 4.1 relies on the process being ergodic, we consider the case when A is stable. We start by stating a few standard definitions. Definition 6.1. Fix a k {1, . . . , d X}. The pair (A, H) is k-step controllable if rank H AH A2H . . . Ak 1H = d X. For t N, let the t-step controllability gramian be defined as Γt Pt k=0 Ak HHT(Ak)T. Since the noise in (12) serves as the control in this setting, the controllability gramian also coincides with the covariance at time t, i.e., E[Xt XT t ] = Γt. Definition 6.2. Fix a τ 1 and ρ (0, 1). A matrix A is called (τ, ρ)-stable if for all k N we have Ak op τρk. With these definitions in place, we now state our result for linear dynamical system. Theorem 6.1. Suppose that the matrix A in (12) is (τ, ρ)-stable (cf. Definition 6.2), and that the pair (A , H) is κ-step controllable (cf. Definition 6.1). Suppose also that A F B for some B 1. Consider the linear hypothesis class and true regression function: F {f(x) = Ax | A Rd X d X, A F B}, f (x) = A x. (13) Suppose that model (1) follows the process described in (12) with Yt = Xt+1. There exists T0 such that the LSE with hypothesis class F achieves for all T T0: E bf f 2 L2 8EMT (F ) + 4 H 2 opd2 X T . (14) Furthermore, T0 satisfies for a universal positive constant c0: T0 = c0 τ 4 H 4 opd2 X (1 ρ)2λmin(Γκ 1)2 κ2 1 (1 ρ)2 polylog B, d X, τ, H op, 1 λmin(Γκ 1), 1 1 ρ, . Appendix I.4 contains a more detailed discussion about the results in Theorem 6.1. There, we argue that the term EMT (F ) in (14) is proportional to H 2 opd2 X/T implying that the final rate is proportional to the minimax rate, i.e., E bf f 2 L2 H 2 opd2 X/T. 6.2 Generalized linear models We next consider the following non-linear dynamical system: Xt+1 = σ(A Xt) + HVt, X0 = HV0, Vt Rd X, Vt N(0, I), Vt Vt t = t . (16) Here, A Rd X d X is the dynamics matrix and σ : Rd X Rd X is a coordinate wise link function. The notation σ will also be overloaded to refer to the individual coordinate function mapping R R. We study the system identification problem where the link function σ is assumed to be known, but the dynamics matrix A is unknown and to be recovered from {Xt}T 1 t=0 . We will apply Theorem 4.1 to derive a nearly optimal excess risk bound for the LSE on this problem in the stable case. We start by stating a few assumptions that are again standard in the literature [20, 31]. Assumption 6.1. Suppose that A , H, and σ from the GLM process (16) satisfy: 1. (One-step controllability). The matrix H Rd X d X is full rank. 2. (Link function regularity). The link function σ : R R is 1-Lipschitz, satisfies ϕ(0) = 0, and there exists a ζ (0, 1] such that |σ(x) σ(y)| ζ|x y| for all x, y R. 3. (Lyapunov stability). There exists a positive definite diagonal matrix P Rd X d X satisfying P I and a ρ (0, 1) such that AT P A ρP . With our assumptions in place, we are ready to instantiate our main result on the process (16). Theorem 6.2. Suppose the model (1) follows the process described in (16) with Yt = Xt+1. Assume that the process (16) satisfies Assumption 6.1. Fix a B 1, and suppose that A F B. Consider the hypothesis class and true regression function: F {f(x) = σ(Ax) | A Rd X d X, A F B}, f (x) = σ(A x). (17) There exists a T0 and a universal positive constant c0 such that the LSE with hypothesis class F achieves for all T T0: E bf f 2 L2 c0 H 2 opd2 X T log max T, B, d X, P op, H op, 1 1 ρ Furthermore, for a universal constant c1 > 0, we may choose T0 that satisfies: T0 = c1 max P 2 opcond(H)4d4 X ζ4(1 ρ)6 , 1 polylog B, d X, P op, cond(H), 1 Further discussion regarding Assumption 6.1 and Theorem 6.2, including a more detailed comparison with existing results, can be found in Appendix J.3. 7 Conclusion We developed a framework for showing when the mixing-time of the covariates plays a relatively small role in the rate of convergence of the least-squares estimator. In many situations, after a finite burn-in time, this learning procedure exhibits an excess risk that scales as if all the samples were independent (Theorem 4.1). As a byproduct of our framework, by instantiating our results to system identification for dynamics with generalized linear model transitions (Section 6.2), we derived the sharpest known excess risk rate for this problem; our rates are nearly minimax optimal after only a polynomial burn-in time. To arrive at Theorem 4.1, we leveraged insights from Mendelson [3] via a one-sided concentration inequality (Theorem B.2). As mentioned in Section 4.1, hypercontractivity is closely related to the small-ball condition [3]. Such conditions can be understood as quantitative identifiability conditions by providing control of the version space (cf. Mendelson [3]). Given that identifiability conditions also play a key role in linear system identification a setting in which a similar phenomenon as studied here had already been reported this suggests an interesting direction for future work: are such conditions actually necessary for learning with little mixing? Acknowledgements We thank Dheeraj Nagaraj for helpful discussions regarding the results in Nagaraj et al. [25], and Abhishek Roy for clarifying the results in Roy et al. [13]. We also thank Mahdi Soltanolkotabi for an informative discussion about the computational aspects of empirical risk minimization for generalized linear models under the square loss. [1] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009. [2] Guillaume Lecué and Shahar Mendelson. Learning subgaussian classes: Upper and minimax bounds. ar Xiv preprint ar Xiv:1305.4825, 2013. [3] Shahar Mendelson. Learning without concentration. In Conference on Learning Theory, pages 25 39. PMLR, 2014. [4] Martin J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. [5] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94 116, 1994. [6] Dharmendra S. Modha and Elias Masry. Minimum complexity regression estimation with weakly dependent observations. IEEE Transactions on Information Theory, 42(6):2133 2145, 1996. [7] Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. Advances in Neural Information Processing Systems, 21, 2008. [8] Ingo Steinwart and Andreas Christmann. Fast learning from non-i.i.d. observations. Advances in Neural Information Processing Systems, 22, 2009. [9] Bin Zou, Luoqing Li, and Zongben Xu. The generalization performance of erm algorithm with strongly mixing observations. Machine Learning, 75:275 295, 2009. [10] Alekh Agarwal and John C. Duchi. The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory, 59(1):573 587, 2012. [11] John C. Duchi, Alekh Agarwal, Mikael Johansson, and Michael I. Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549 1578, 2012. [12] Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing processes. Machine Learning, 106(1):93 117, 2017. [13] Abhishek Roy, Krishnakumar Balasubramanian, and Murat A. Erdogdu. On empirical risk minimization with dependent and heavy-tailed data. In Advances in Neural Information Processing Systems, volume 34, 2021. [14] Alessio Sancetta. Estimation in reproducing kernel hilbert spaces with dependent data. IEEE Transactions on Information Theory, 67(3):1782 1795, 2021. [15] Ingvar Ziemann, Henrik Sandberg, and Nikolai Matni. Single trajectory nonparametric learning of nonlinear dynamics. ar Xiv preprint ar Xiv:2202.08311, 2022. [16] Max Simchowitz, Horia Mania, Stephen Tu, Michael I. Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Conference On Learning Theory, pages 439 473. PMLR, 2018. [17] Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite time identification in unstable linear systems. Automatica, 96:342 353, 2018. [18] Tuhin Sarkar and Alexander Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. In International Conference on Machine Learning, pages 5610 5618. PMLR, 2019. [19] Stephen Tu, Roy Frostig, and Mahdi Soltanolkotabi. Learning from many trajectories. ar Xiv preprint ar Xiv:2203.17193, 2022. [20] Suhas Kowshik, Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. Near-optimal offline and streaming algorithms for learning non-linear dynamical systems. Advances in Neural Information Processing Systems, 34, 2021. [21] Yue Gao and Garvesh Raskutti. Improved prediction and network estimation using the monotone single index multi-variate autoregressive model. ar Xiv preprint ar Xiv:2106.14630, 2021. [22] Jean-Yves Audibert and Olivier Catoni. Robust linear least squares regression. The Annals of Statistics, 39(5):2766 2794, 2011. [23] Shahar Mendelson. On aggregation for heavy-tailed classes. Probability Theory and Related Fields, 168(3):641 674, 2017. [24] Tengyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with square loss: Localization through offset rademacher complexity. In Conference on Learning Theory, pages 1260 1285. PMLR, 2015. [25] Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. Advances in Neural Information Processing Systems, 33, 2020. [26] Paul-Marie Samson. Concentration of measure inequalities for markov chains and ϕ-mixing processes. The Annals of Probability, 28(1):416 461, 2000. [27] Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, and Siddhartha Jayanti. Learning from weakly dependent data under dobrushin s condition. In Conference on Learning Theory, pages 914 928. PMLR, 2019. [28] Anders Rantzer. Concentration bounds for single parameter adaptive control. In 2018 Annual American Control Conference (ACC), pages 1862 1866, 2018. [29] Yassir Jedra and Alexandre Proutiere. Finite-time identification of stable linear systems optimality of the least-squares estimator. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 996 1001. IEEE, 2020. [30] Yahya Sattar and Samet Oymak. Non-asymptotic and accurate learning of nonlinear dynamical systems. ar Xiv preprint ar Xiv:2002.08538, 2020. [31] Dylan Foster, Tuhin Sarkar, and Alexander Rakhlin. Learning nonlinear dynamical systems from a single trajectory. In Learning for Dynamics and Control, pages 851 861. PMLR, 2020. [32] Sean P. Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, 1993. [33] Søren F. Jarner and Gareth O. Roberts. Polynomial convergence rates of markov chains. The Annals of Applied Probability, 12(1):224 247, 2002. [34] Shahar Mendelson. Improving the sample complexity using global data. IEEE Transactions on Information Theory, 48(7):1977 1991, 2002. [35] Vladimir M. Tikhomirov. ϵ-Entropy and ϵ-Capacity of Sets In Functional Spaces, pages 86 170. Springer Netherlands, 1993. [36] David A. Levin and Yuval Peres. Markov Chains and Mixing Times. American Mathematical Society, 2008. [37] Geoffrey Wolfer and Aryeh Kontorovich. Statistical estimation of ergodic markov chain kernel over discrete state space. Bernoulli, 27(1):532 553, 2021. [38] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. [39] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018. [40] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and Geometry of Markov Diffusion Operators. Springer, 2014. [41] Jan R. Magnus. The moments of products of quadratic forms in normal variables. Statistica Neerlandica, 32(4):201 210, 1978. [42] Minwoo Chae and Stephen G. Walker. Wasserstein upper bounds of the total variation for smooth densities. Statistics & Probability Letters, 163:108771, 2020. [43] Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in rl. In Proceedings of the 38th International Conference on Machine Learning, pages 2826 2836. PMLR, 2021. [44] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Online least squares estimation with self-normalized processes: An application to bandit problems. ar Xiv preprint ar Xiv:1102.2670, 2011. [45] Anastasios Tsiamis and George J. Pappas. Linear systems can be hard to learn. ar Xiv preprint ar Xiv:2104.01120, 2021. [46] Randal Douc, Eric Moulines, and Jeffrey S. Rosenthal. Quantitative bounds on convergence of time-inhomogeneous markov chains. The Annals of Applied Probability, 14(4):1643 1665, 2004. [47] Martin Hairer and Jonathan C. Mattingly. Yet another look at harris ergodic theorem for markov chains. In Seminar on Stochastic Analysis, Random Fields and Applications VI, 2011.