# risk_bounds_for_reservoir_computing__452a691a.pdf Journal of Machine Learning Research 21 (2020) 1-61 Submitted 10/19; Published 11/20 Risk Bounds for Reservoir Computing Lukas Gonon Gonon@math.lmu.de Mathematics Institute Ludwig-Maximilians-Universit at M unchen Germany Lyudmila Grigoryeva Lyudmila.Grigoryeva@uni-konstanz.de Department of Mathematics and Statistics Graduate School of Decision Sciences Universit at Konstanz Germany Juan-Pablo Ortega Juan-Pablo.Ortega@unisg.ch Faculty of Mathematics and Statistics Universit at Sankt Gallen Switzerland Centre National de la Recherche Scientifique (CNRS) France Editor: Moritz Hardt We analyze the practices of reservoir computing in the framework of statistical learning theory. In particular, we derive finite sample upper bounds for the generalization error committed by specific families of reservoir computing systems when processing discrete-time inputs under various hypotheses on their dependence structure. Non-asymptotic bounds are explicitly written down in terms of the multivariate Rademacher complexities of the reservoir systems and the weak dependence structure of the signals that are being handled. This allows, in particular, to determine the minimal number of observations needed in order to guarantee a prescribed estimation accuracy with high probability for a given reservoir family. At the same time, the asymptotic behavior of the devised bounds guarantees the consistency of the empirical risk minimization procedure for various hypothesis classes of reservoir functionals. Keywords: Reservoir computing, RC, echo state networks, ESN, state affine systems, SAS, random reservoirs, Rademacher complexity, weak dependence, empirical risk minimization, PAC bounds, risk bounds. 1. Introduction Reservoir computing (RC) is a well established paradigm in the supervised learning of dynamic processes, which exploits the ability of specific families of semi-randomly generated state-space systems to solve computational and signal treatment tasks, both in deterministic and in stochastic setups. In recent years, both researchers and practitioners have been paying increasing attention to reservoir systems and their applications in learning. The main reasons behind this growing interest are threefold. Firstly, training strategies in reser- c 2020 Lukas Gonon, Lyudmila Grigoryeva, and Juan-Pablo Ortega. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-902.html. Gonon, Grigoryeva, and Ortega voir computing are easy to implement as they simply consist in estimating the weights of memoryless static readouts, while the internal weights of the reservoir network are randomly created; this feature is closely linked to ideas originating in biology and the neurosciences in relation with the design of brain-inspired computing algorithms. Second, there is an important interplay between reservoir systems and the theory of dynamical systems, recurrent neural networks, and nonlinear discrete-time state-space systems, which makes the collection of tools available for their analysis very rich and explains in part why RC appears in the literature assimilated to other denominations, such as Liquid State Machines (see Maass and Sontag, 2000; Maass et al., 2002; Natschl ager et al., 2002; Maass et al., 2004, 2007) or Echo State Networks (see Jaeger and Haas, 2004; Jaeger, 2010)). Finally, several families of reservoir systems have shown excellent performance in various classification and forecasting exercises including both standard machine learning benchmarks (see Lukoˇseviˇcius and Jaeger, 2009, and references therein) and sophisticated applications that range from learning the attractors of chaotic nonlinear infinite dimensional dynamical systems (see Jaeger and Haas, 2004; Pathak et al., 2017, 2018; Lu et al., 2018) to the detection of Steady-State Visual Evoked Potentials (SSVEPs) in electroencephalographic signals as in Ib a nez-Soria et al. (2019). It is also important to point out that RC implementations with dedicated hardware have been proved to exhibit information processing speeds that significantly outperform standard Turing-type computers (see, for instance, Appeltant et al., 2011; Rodan and Tino, 2011; Vandoorne et al., 2011; Larger et al., 2012; Paquot et al., 2012; Brunner et al., 2013; Vandoorne et al., 2014; Vinckier et al., 2015; Laporte et al., 2018). For a number of years, the reservoir computing community has worked hard on characterizing the key properties that explain the performance of reservoirs in classification, forecasting, and memory reconstruction tasks and also on formulating necessary conditions for a given state-space system to serve as a properly functioning reservoir system. Salient features of reservoir systems that have been shown to be important are the fading memory property (FMP), which appears in the context of systems theory (Volterra, 1959; Wiener, 1958; Boyd and Chua, 1985), computational neurosciences (Maass et al., 2004), physics (Coleman and Mizel, 1968), or mechanics (see Fabrizio et al., 2010, and references therein), the echo state property (ESP) (Jaeger, 2010; Yildiz et al., 2012; Manjunath and Jaeger, 2013), and the pairwise separation property (SP) (see, for instance, Legenstein and Maass, 2007; Lukoˇseviˇcius and Jaeger, 2009; Maass, 2011, and references therein). Much effort has been made to provide rigorous definitions for these concepts and to characterize their relations under various hypotheses (see Jaeger, 2010; Grigoryeva and Ortega, 2019, and references therein). In particular, the crucial importance of these properties manifests itself in a series of universal approximation results which have been obtained for RC systems (see for example Maass et al., 2007; Grigoryeva and Ortega, 2018a,b; Gonon and Ortega, 2020b,a). This feature is a dynamic analog to well-established universal approximation properties for static machine learning paradigms, like neural networks (Cybenko, 1989; Funahashi, 1989; Hornik et al., 1989), for which the so-called approximation error (see Cucker and Smale, 2002; Smale and Zhou, 2003; Cucker and Zhou, 2007) can be made arbitrary small. From the point of view of learning theory, the most important feature for any paradigm is its ability to generalize. Here this means that the performance of a given RC architecture on a training sample should be comparable to its behavior on previously unseen realizations Risk Bounds for Reservoir Computing of the same data generation process. In the RC literature, this problem has been traditionally tackled using the notion of memory capacity, that has been the subject of much research (Jaeger, 2002; White et al., 2004; Ganguli et al., 2008; Hermans and Schrauwen, 2010; Dambre et al., 2012; Grigoryeva et al., 2015; Couillet et al., 2016; Grigoryeva et al., 2016). Unfortunately, it has been recently shown that optimizing memory capacity does not necessarily lead to higher prediction performance (see Marzen, 2017). Moreover, the recently proved universal approximation properties of RC that we just brought up do not guarantee that a given universal reservoir system will exhibit small generalization errors. In other words, they guarantee the availability of RC architectures that exhibit arbitrarily small training errors but give no control on their generalization power. Following the standard learning theoretical approach to measure the generalization power would lead to consider the difference between the training error (empirical risk) and the testing error (statistical risk or generalization error) and aim at controlling it uniformly over a given class of reservoir systems by using a measure of the class complexity. A number of complexity measures for function classes have been proposed over the years. We can name the Vapnik-Chervonenkis (VC) dimension (Vapnik, 1998), Rademacher and Gaussian complexities (Bartlett and Mendelson, 2003), uniform stability (Mukherjee et al., 2006; Bousquet and Elisseeff, 2002; Poggio et al., 2004), and their modifications. In particular, a vast literature is available also on complexities and probably approximately correct (PAC) bounds for multilayer neural networks or recurrent neural networks (see, for instance, Haussler, 1992; Koiran and Sontag, 1998; Sontag, 1998; Anthony and Bartlett, 1999; Bartlett et al., 2017; Zhang et al., 2018, and references therein). However, it is important to emphasize that using this traditional learning theoretical approach to formulate generalization error bounds in the case of reservoir systems is challenging and requires non-trivial extensions of this circle of ideas. Indeed, since a key motivation for our analysis are time series applications, the standard i.i.d. assumption on inputs and outputs cannot be invoked anymore, which makes a number of conventional tools unsuitable. Here the signals to be treated are stochastic processes with a particular dependence structure, which introduces mathematical difficulties that only a few works have analyzed in a learning theory context. In most of the available contributions on learning in a non-i.i.d. setting, stationarity and specific mixing properties of the input are key assumptions (see, for instance, Mc Donald et al., 2017; Kuznetsov and Mohri, 2017, 2018, and references therein). The time series applications that we are interested motivate us, however, to part with the latter. A common argument in this direction (Kuznetsov and Mohri, 2018) is that many standard time-series processes happen to be non-mixing; for example, one can easily construct AR(1) and ARFIMA processes which are not mixing (see Andrews, 1983; Baillie, 1996, respectively). On the other hand, it has been pointed out (see Adams and Nobel, 2010) that the convergence of empirical quantities to population-based ones can be arbitrarily slow for general stationary data and one cannot hope to obtain distributionfree probability bounds as they exist for the i.i.d. case. Motivated by these observations, in this article we restrict to a particular type of dependent processes, namely, we focus on dependence structures created by causal Bernoulli shifts (see, for instance, Dedecker et al., 2007; Alquier and Wintenberger, 2012) and hence the error bounds that we obtain are valid for any input process with such a dependence structure. Apart from trivially incorporating the i.i.d. case, the Bernoulli shifts category includes the VARMA time-series class of Gonon, Grigoryeva, and Ortega models, financial econometric models such as various GARCH specifications (Engle, 1982; Bollerslev, 1986; Engle, 2009), and the ARFIMA (Beran, 1994) processes that allow the modeling of long memory behavior exhibited by many financial time series (for example realized variances). As we show later on, even though the bounds that we obtain depend on the weak dependence assumptions, they hold true without making precise the distributions of the input and the outputs, which are generally unknown. We hence place ourselves in a semi-agnostic setup. Regarding complexity measures, bounds for them are also customarily formulated in an i.i.d. setting. Recently, some authors addressed the question of constructing versions of Rademacher complexities for dependent inputs. For example, if one defines the risk in terms of conditional expectations, then the so-called sequential Rademacher complexities can be used to derive bounds (see, for instance, Rakhlin et al. (2010, Proposition 15) and Rakhlin, Sridharan, and Tewari (2014)). In this paper we pursue a more traditional approach in terms of the definition of the expected risk and hence the associated Rademacher complexity. The main contribution of this paper is the formulation of the first explicit generalization bounds for reservoir systems such as recurrent neural networks with input data exhibiting a sequential dependence structure for the classical notion of risk defined as an expected loss. The uniform high-probability bounds which we state in this paper depend exclusively on the weak dependence behavior of the input and target processes and a quantitative measure of the capacity (Rademacher complexity) of the set of functionals generated by the considered reservoir systems. The finite sample guarantees provided by our generalization bounds explicitly answer practical questions concerning the bounds for the parameters within a particular reservoir family, the rates of uniform convergence, and hence the length of the training sample required to achieve a desired learning generalization quality within a given RC class. Finally, when one wishes to apply empirical risk minimization (ERM) in order to pick the reservoir functional within the hypothesis class, the asymptotic behavior of the devised bounds guarantees the consistency of ERM for reservoir systems. The paper is organized as follows: Section 2 describes the notation used in the paper. We introduce reservoir systems, the associated filters and functionals, as well as a detailed description of various families of popular reservoir systems in the literature. Section 3 sets up the statistical learning problem for reservoir computing systems. It starts by introducing a general framework for the learning procedure, necessary risk definitions, and criteria of risk-consistency for the particular case of empirical risk minimization (ERM). The second subsection constitutes the main part of Section 3 and we present in it the setting in which reservoir systems are analyzed in the rest of the paper. First, three alternative core assumptions regarding the weak dependence structure of the input and target processes are analyzed and illustrated with examples. Second, the hypothesis classes of reservoir maps and functionals are constructed under a set of mild assumptions. Finally, the strategy for the derivation of risk bounds for a given choice of loss function is discussed. Section 4 contains the main results of the paper. Proofs and auxiliary results are postponed to the appendices. Section 4 is structured as follows. In the first subsec- Risk Bounds for Reservoir Computing tion the expected value of the worst-case difference between the generalization and training errors over the class of reservoir functionals is shown to be bounded by its Rademacher complexity and terms related to the weak dependence structure of the input and target processes. The obtained rates differ depending on the various assumptions invoked. The second subsection provides explicit expressions for upper bounds of the Rademacher complexities associated to the families of reservoir systems presented in Section 2. The third subsection concludes with the formulation of highprobability finite-sample generalization bounds for reservoir systems. We emphasize that previously such bounds were not available in the literature. The asymptotic behavior of these bounds shows in passing the weak risk-consistency of the ERM procedure for reservoir systems. The last subsection contains a result that provides high-probability bounds for families of reservoir systems whose reservoir maps have been generated randomly. This result is a theoretical justification of the well-known good empirical properties of this standard modus operandi in reservoir computing. 2. Preliminaries We start by specifying our notation and introducing reservoir computing systems for which in the following sections we will set up a statistical learning strategy. In the last subsection we provide a list of particular families of reservoir systems which are popular in the RC literature and in applications. 2.1 Notation We use the symbol N (respectively, N+) to denote the set of natural numbers with the zero element included (respectively, excluded). Z denotes the set of all integers, and Z (respectively, Z+) stands for the set of the negative (respectively, positive) integers with the zero element included. Let d, n, m N+. Given an element x Rn, we denote by R[x] the real-valued multivariate polynomials on x with real coefficients. Given a vector v Rn, the symbol v 2 stands for its Euclidean norm. We denote by Mm,n the space of real m n matrices. When n = m, we use the symbol Mn to refer to the space of square matrices of order n. For any A Mm,n, |||A|||2 denotes its matrix norm induced by the Euclidean norms in Rm and Rn, which satisfies that |||A|||2 = σmax(A) with σmax(A) the largest singular value of A. |||A|||2 is sometimes referred to as the spectral norm of A (see Horn and Johnson, 2013). When working in a deterministic setup, the inputs and outputs will be modeled using semi-infinite sequences z (Rd)Z and y (Rm)Z , respectively. We shall restrict very frequently to input sequences that exhibit additional convergence properties that are imposed with the help of weighting sequences. A weighting sequence w is a strictly decreasing sequence with zero limit w : N (0, 1] such that w0 = 1. We define the weighted 1-norm or the (1, w)-norm 1,w in the space of semi-infinite sequences (Rd)Z as t Z zt 2 w t, for any z (Rd)Z . We then set ℓ1,w (Rd) := n z (Rd)Z | z 1,w < o . Gonon, Grigoryeva, and Ortega This weighted sequence space can be characterized as a Bochner space (Hyt onen et al., 2016) by noticing that ℓ1,w (Rd), 1,w = L1(Z , P(Z ), µw; Rd), L1(Z ;Rd) , where P(Z ) stands for the power set of Z and µw is the measure defined on (Z , P(Z )) generated by the assignments µ({t}) := w t, for any t Z . This equality guarantees that the pair ℓ1,w (Rd), 1,w forms a separable Banach space. Let now τ Z and define the time delay operator T τ : (Rd)Z (Rd)Z by T τ(z)t := zt+τ, for any t Z . We call T τ(z) (Rd)Z the τ-shifted version of the semi-infinite sequence z (Rd)Z . It can be proved (Grigoryeva and Ortega, 2019) that T τ restricts to a continuous linear operator in ℓ1,w (Rd), 1,w and that the operator norm of the resulting maps T τ : ℓ1,w (Rd) ℓ1,w (Rd) satisfies |||T1|||1,w = Lw and |||T τ|||1,w L τ w , for all τ Z , (1) provided that the condition Lw < holds, where 1 Lw is the inverse decay ratio of w defined as Lw := sup t N We define for future reference the decay ratio Dw of w as Dw := sup t N The other weighted norm of much use in the context of reservoir computing is the ( , w)- norm, defined by z ,w := sup t Z { zt 2w t}, for any z (Rd)Z . We then set ℓ ,w (Rd) := n z (Rd)Z | z ,w < o . It can also be showed that the pair (ℓ ,w (Rd), ,w) is a Banach space (Grigoryeva and Ortega, 2018b). Additionally, the time delay operators also restrict to ℓ ,w (Rd) and the corresponding operator norms satisfy (1). 2.2 Filters and reservoir computing systems The objects at the core of this paper are input/output maps of the form U : Rd Z (Rm)Z. We will restrict to the case in which the maps U are causal and time-invariant (see Grigoryeva and Ortega, 2019, for definitions and the proofs of the facts that we now state) and hence it suffices to work with the restrictions U : Rd Z (Rm)Z . Moreover, causal and time-invariant filters U uniquely determine functionals of the type HU : Rd Z Rm HU(z) := U(z)0, for any z Rd Z . Risk Bounds for Reservoir Computing In this setup, we shall say that U : Rd Z (Rm)Z is a filter and that HU : Rd Z Rm is its corresponding functional. Conversely, given a functional H : Rd Z Rm, there is a unique causal and time-invariant filter UH : Rd Z (Rm)Z determined by it as UH(z)t := H(T t(z)), for any z Rd Z , t Z . (2) Suppose that given a weighting sequence w, the filter U restricts to a map between weighted ( , w)-spaces, that is, U : ℓ ,w (Rd) ℓ ,w (Rm) and that, additionally, U is continuous with respect to the norm topology in those spaces. In that case we say that U has the fading memory property (FMP) with respect to w. We shall provide an answer to the supervised learning of filters by estimating approximants built as reservoir filters. Reservoir filters are obtained out of a reservoir system, that is, a state-space system made out of two recurrent equations of the form: xt = F(xt 1, zt), yt = h(xt), (3) for all t Z and where F : DN Dd DN and h : DN Rm are maps, Dd Rd, DN RN. The sequences z (Dd)Z and y (Rm)Z stand for the input and the output (target) of the system, respectively, and x (DN)Z are the associated reservoir states. A reservoir system determines a filter when the first equation in (3) satisfies the socalled echo state property (ESP), that is, when for any z (Dd)Z there exists a unique x (DN)Z such that (3) holds. In that case, we talk about the reservoir filter UF h : (Dd)Z (Rm)Z associated to the reservoir system (3) that is defined by: UF h := h UF , where UF (z) := x, with z (Dd)Z and x (DN)Z linked by the first equation in (3) via the ESP. It is easy to show that reservoir filters are automatically causal and time-invariant (Grigoryeva and Ortega, 2018b, Proposition 2.1) and hence determine a reservoir functional HF h : (Dd)Z Rm. As the following Proposition shows, a sufficient condition guaranteeing that the echo state property holds is that DN is a closed ball and that the map F is continuous and a contraction in the first argument. Proposition 1 Let S > 0, BS = {x RN : x 2 S} and suppose that F : BS Dd BS is continuous. Assume that F is a contraction in the first argument, that is, there exists 0 < r < 1 such that for all x1, x2 BS, z Dd it holds that F(x1, z) F(x2, z) 2 r x1 x2 2. (4) Then the system (3) has the echo state property and hence its first equation determines a unique causal and time-invariant filter UF : (Dd)Z (BS)Z as well as a functional HF : (Dd)Z BS that are continuous (where both (Dd)Z and (BS)Z are equipped with the product topologies). Gonon, Grigoryeva, and Ortega Remark 2 If we have a continuous function F : RN Dd RN which satisfies (4) for all x1, x2 RN and F(0, z) 2 c for all z Dd and a certain c > 0, then choosing any S c/(1 r) it is easy to see that for all u BS F(u, z) 2 F(u, z) F(0, z) 2 + F(0, z) 2 r u 2 + c r S + c S. Thus, F(BS Dd) BS and so the restriction of F to BS Dd satisfies the assumptions of Proposition 1. 2.3 Families of reservoir systems The following paragraphs introduce various families of reservoir systems that appear in applications. We shall later on explicitly construct for these specific families the risk bounds contained in the main results of the paper. Reservoir systems with linear reservoir maps (LRC) In this case one associates to each input signal z (Dd)Z an output y (Rm)Z via the two recurrent equations xt = Axt 1 + Czt + ζ, (5) yt = h(xt), (6) with t Z and A MN, C MN,d, ζ RN. Systems with linear reservoir maps of the type (5) have been vastly studied in the literature in numerous contexts and under different denominations. In the RC setting, systems of the type (5)-(6) with polynomial readout maps h : DN Rm have been proved in Grigoryeva and Ortega (2018a) to be universal approximators in the category of fading memory filters either when presented with uniformly bounded inputs in the deterministic setup (see Corollary 3.4) or with almost surely uniformly bounded stochastic inputs (see Corollary 4.8). These boundedness hypotheses have been dropped in Gonon and Ortega (2020b) by considering density with respect to Lp norms, 1 p < , defined using the law of the input data generating process. Sufficient conditions which ensure the echo state property and the fading memory property for these systems have been established (see Section 3.1 in Grigoryeva and Ortega, 2019). More specifically, consider the reservoir map F A,C,ζ : DN Dd DN of the system (5)-(6) given by F A,C,ζ(x, z) = Ax + Cz + ζ. It is easy to see that F A,C,ζ is a contraction in the first entry whenever the matrix A satisfies |||A|||2 < 1. For these systems we consider only the case of uniformly bounded input signals: Case with uniformly bounded inputs. Suppose now |||A|||2 < 1. If the inputs are uniformly bounded, that is, if Dd = BM for some M > 0 and so z KM with KM := z (Rd)Z | zt 2 M for all t Z , then the reservoir system (5)-(6) has the echo state property and defines a unique causal and time-invariant reservoir filter UA,C,ζ : KM (DN)Z given by UA,C,ζ(z)t := P j=0 Aj(Czt j + ζ), t Z . Here DN = BMF with MF = (|||C|||2M + ζ 2)/(1 |||A|||2) (see Remark 2 and part (ii) in the first example in Section 4.1 of Grigoryeva and Ortega, 2019). In particular, the corresponding functional Risk Bounds for Reservoir Computing HA,C,ζ : KM DN satisfies that HA,C,ζ(z) 2 MF for all z KM. Additionally, it can be shown that the reservoir system (5)-(6) has the fading memory property with respect to any weighting sequence. In what follows we consider a particular subfamily of systems (5)-(6), namely reservoir systems with linear reservoir and linear readout maps, in which case h : DN Rm is given by applying W Mm,N. Echo State Networks (ESN) Echo State Networks (Matthews, 1992; Jaeger and Haas, 2004) are a family of reservoir systems that exhibit excellent performance in many practical applications and have been recently proved to have universal approximation properties. More specifically, Grigoryeva and Ortega (2018b) proved ESNs to be universal in the category of fading memory filters with semi-infinite uniformly bounded inputs in a deterministic setup, and Gonon and Ortega (2020b) obtained universality results for ESNs in the stochastic situation with respect to Lp-type criteria for stochastic discrete-time semi-infinite inputs. An echo state network of dimension N N+ with reservoir matrix A MN, input mask C MN,d, input shift ζ RN, and readout matrix W Mm,N is the system xt = σ(Axt 1 + Czt + ζ), (7) yt = Wxt, (8) which for each t Z transforms the input zt Dd Rd into the reservoir state xt DN RN and, consequently, into the corresponding output yt Rm. The reservoir map F σ,A,C,ζ : DN Dd DN of the system (7)-(8) is given by F σ,A,C,ζ(x, z) = σ(Ax + Cz + ζ), where σ: RN RN is defined by the componentwise application of a given activation function σ: R R. Throughout, we assume that σ is Lipschitz-continuous with Lipschitzconstant Lσ. It is straightforward to verify that F σ,A,C,ζ is a contraction in the first entry whenever Lσ|||A|||2 < 1. The sufficient conditions which ensure the echo state and the fading memory properties of (7)-(8) have been also carefully studied in the literature (see Buehner and Young, 2006; Yildiz et al., 2012; Grigoryeva and Ortega, 2018b, for details) and depend both on the type of the activation function σ: R R and on the type of the input presented to the network. We consider the following two cases: Case with arbitrary input signals and bounded activation function. In this situation, Dd is arbitrary and so generic input signals z (Dd)Z are considered, but we assume that the range of the activation function σ is bounded and contained in [σmin, σmax] with σmin < σmax R. Then, by Proposition 1, the condition Lσ|||A|||2 < 1 suffices to ensure that the system (7)-(8) has the echo state property and hence defines a unique causal and timeinvariant filter Uσ,A,C,ζ : (Dd)Z (DN)Z as well as a functional Hσ,A,C,ζ : (Dd)Z DN that are additionally continuous with respect to the product topologies on the spaces (Dd)Z and (DN)Z . Here DN = BMF with MF = N max(|σmin|, |σmax|) and in particular, we obviously have that Hσ,A,C,ζ(z) 2 MF . Gonon, Grigoryeva, and Ortega Case with uniformly bounded inputs. Suppose that Lσ|||A|||2 < 1 and Dd = BM for some M > 0 and so the inputs are uniformly bounded, that is, z KM with KM := z (Rd)Z | zt 2 M for all t Z . In this case the reservoir system (7)-(8) has the echo state property and defines a unique causal and time-invariant reservoir filter Uσ,A,C,ζ : KM (DN)Z as well as a functional Hσ,A,C,ζ : KM DN, where DN = BM1 F with M1 F := [Lσ(|||C|||2M + ζ 2) + Nσ(0)]/(1 Lσ|||A|||2) (see Proposition 1 and Remark 2 or part (ii) in the first example in Section 4.1 of Grigoryeva and Ortega, 2019). Additionally, the fading memory property holds with respect to any weighting sequence. Note that these results hold true even though the range of the activation function is not assumed to be bounded and Hσ,A,C,ζ(z) 2 MF with MF = M1 F when the activation function σ has an unbounded range and with MF = min(M1 F , N max (|σmin|, |σmax|)), otherwise. State-Affine Systems (SAS) The so-called homogeneous state-affine systems have been first introduced in the systems theory literature and were shown to exhibit universality properties in the discrete-time setting for compact times (see Fliess and Normand-Cyrot, 1980; Sontag, 1979a,b). A nonhomogeneous version of these systems was introduced in Grigoryeva and Ortega (2018a), where they were proved to be universal approximants in the category of fading memory filters for the non-compact discrete-time deterministic setup. Trigonometric state-affine systems were later on studied in a stochastic setup in Gonon and Ortega (2020b), where their universality for stochastic discrete-time semi-infinite inputs with respect to Lp-criteria was established. State-affine systems serve as an excellent example of reservoir systems with easy-to-train linear readouts and even though little is known about their empirical performance in learning tasks, we find it important to provide explicit risk bounds for this family. In the rest of the paper we reserve the name State-Affine Systems (SAS) for the non-homogeneous version if not stated otherwise and leave the trigonometric family for future work. The following notation for multivariate polynomials will be used: for any multi-index α Nd and any z Rd, we write zα := zα1 1 zαd d . Furthermore, the space MN,M[z], N, M N+, of polynomials in the variable z Rd with matrix coefficients in MN,M is the set of elements p of the form α Vp zαAα, z Rd, where Vp Nd is a finite subset and the elements Aα MN,M are matrix coefficients. The degree deg(p) of the polynomial p is defined as deg(p) = max α Vp { α 1} , where α 1 := α1 + + αd. We also define the following norm on MN,M[z]: |||p||| = max α Vp |||Aα|||2. (9) The non-homogeneous state-affine system (SAS) of dimension N N+ associated to two given polynomials p MN,N[z] and q MN,1[z] with matrix and vector coefficients, respectively, is the reservoir system determined by the following state-space transformation Risk Bounds for Reservoir Computing of each input signal z (Dd)Z into the output signal y (Rm)Z , xt = p(zt)xt 1 + q(zt), (10) yt = Wxt, (11) for t Z , with W Mm,N the readout map. The reservoir map F p,q : DN Dd DN of the system (10)-(11) is given by F p,q(x, z) = p(z)x + q(z). (12) Additionally, we define Mp := sup z Dd |||p(z)|||2, Mq := sup z Dd |||q(z)|||2. First, we notice that for regular SAS defined by nontrivial polynomials, the set Dd needs to be bounded in order for Mp and Mq to be finite. It is easy to see that F in (12) is a contraction in the first entry with constant Mp whenever Mp < 1, which is a condition that we will assume holds true together with Mq < in the next paragraph. Case with uniformly bounded input signals. Let Dd = BM for some M > 0 so that we consider inputs z KM with KM := z (Rd)Z | zt 2 M for all t Z . In that case the system (10)-(11) has the echo state property and determines (see Proposition 1 and Remark 2 or part (ii) in the third example in Section 4.1 of Grigoryeva and Ortega, 2019) a unique reservoir filter Up,q : KM (DN)Z as well as a functional Hp,q : KM DN, where DN = BMF with MF = Mq/(1 Mp). In addition, the fading memory property holds with respect to any weighting sequence. Moreover, in this case the filter can be explicitly written as (Up,q(z))t = P j=0(Qj 1 k=0 p(zt k))q(zt j), t Z , and Hp,q(z) 2 MF , for all z KM. 3. The learning problem for reservoir computing systems In this paper we work in the setting of supervised learning in a probabilistic framework and our goal is to provide performance estimates for reservoir systems from the statistical learning theory perspective. With that in mind, we start this section by stating the general learning problem for systems with stochastic input and target signals. We then introduce three alternative assumptions on the weak dependence of input and output processes which will be assumed later on in the paper and provide examples of important time series models that satisfy the conditions under consideration. We define the statistical risk and its empirical analogs for reservoir functionals and motivate the need for generalization error bounds. More specifically, on the one hand the in-class generalization error (risk) can be used to bound the estimation error of a class. On the other hand, whenever the learner follows the empirical risk minimization (ERM) strategy to select the reservoir computing system within the RC hypothesis class based on minimization of the empirical (training) error, generalization error bounds can be used to prove the weak universal risk-consistency of ERM for reservoir systems. If the inputs are i.i.d. (which is a particular case of our setup), this definition is essentially equivalent to saying that the hypothesis class of reservoir functionals is a (weak) uniform Glivenko-Cantelli class (see for example Mukherjee et al., 2006). Gonon, Grigoryeva, and Ortega 3.1 General setup of the learning procedure Input and target stochastic processes. We fix a probability space (Ω, A, P) on which all random variables are defined. The triple consists of the sample space Ω, which is the set of possible outcomes, the σ-algebra A (a set of subsets of Ω(events)), and a probability measure P : A [0, 1]. The input and target signals are modeled by discrete-time stochastic processes Z = (Zt)t Z and Y = (Yt)t Z taking values in Dd Rd and Rm, respectively. Moreover, we write Z(ω) = (Zt(ω))t Z and Y(ω) = (Yt(ω))t Z for each outcome ω Ωto denote the realizations or sample paths of Z and Y, respectively. Since Z can be seen as a random sequence in Dd Rd, we write interchangeably Z : Z Ω Dd and Z : Ω (Dd)Z . The latter is necessarily measurable with respect to the Borel σ-algebra induced by the product topology in (Dd)Z . The same applies to the analogous assignments involving Y. Hypothesis class H, loss functions, statistical, and empirical risk. Let F be the class of all measurable functionals H : (Dd)Z Rm, Dd Rd, that is F := {H : (Dd)Z Rm | H is measurable}. Consider a smaller hypothesis class H of admissible functionals H F. For a fixed loss, that is, a measurable function1 L: Rm Rm R and for any functional H F we define the statistical risk (sometimes just referred to as risk) or generalization error associated with H as R(H) := E[L(H(Z), Y0)], (13) where by definition the expectation is taken with respect to the joint law of (Z, Y). The ultimate goal of the learning procedure consists in determining the Bayes functional H F F that exhibits the minimal statistical risk (Bayes risk) in the class of all measurable functionals, which we denote as R F := R(H F) = inf H F R(H). (14) Even though this task is generally infeasible, one may hope to solve it for the best-in-class functional H H H with the minimal associated in-class statistical risk (Bayes in-class risk), which is assumed achievable, and which we denote as R H := R(H H) = inf H H R(H). The standard learning program is then based on the following error decomposition. For any H H we can write that R(H) R F = (R(H) R H) + (R H R F), where the first term is called the estimation error and the second one is the approximation error. In this paper we focus on upper bounds of the estimation component, while the same problem for the approximation error will be treated in the forthcoming work, namely 1. It is customary in the literature to consider nonnegative loss functions. This automatically guarantees that the expectation in (13) is well-defined, although it is not necessarily finite. In this paper, for the sake of mathematical convenience, we allow for general real-valued loss functions but carefully address technical questions where relevant. Risk Bounds for Reservoir Computing Gonon, Grigoryeva, and Ortega (2020). We emphasize that since the universal approximation properties of reservoir systems have been established in numerous situations (see the introduction and Section 2.3) the approximation error can be made arbitrarily small by choosing an appropriate hypothesis class H. The distribution of (Z, Y) is generally unknown, and hence computing the risks (13) or (14) is in practice infeasible. This implies, in particular, that the estimation error cannot be explicitly evaluated. Therefore, the usual procedure is in this case to use an empirical counterpart for (13) which can be computed using a training dataset. Suppose that a training sample for both the input and the target discrete-time stochastic processes is available up to some n N+ steps into the past, namely (Z i, Y i)i {0,...,n 1}. For each time step i {0, . . . , n 1} we define the truncated training sample for the input stochastic process Z as Z n+1 i := (. . . , 0, 0, Z n+1, . . . , Z i 1, Z i). (15) In this time series context the training error or the empirical risk analog b Rn(H) of (13) is given by b Rn(H) = 1 i=0 L(H(Z n+1 i ), Y i) = 1 i=0 L(UH(Z n+1 0 ) i, Y i), (16) where UH denotes the filter associated to the functional H as introduced in (2). In what follows we will also make use of what we call its idealized empirical risk version defined as b R n (H) = 1 i=0 L(H(Z i ), Y i) = 1 i=0 L(UH(Z 0 ) i, Y i), (17) which makes use of a larger training sample containing all the past values of the input process Z. Remark 3 The results of this paper are also valid if one replaces the zero elements in the truncated training sample (15) by an arbitrary sequence (deterministic, random, or dependent on the training sample). More specifically, consider an arbitrary function I : (Dd)Z (Dd)Z that we use to extend the input training sample, for each i {0, . . . , n 1}, as e Z n+1 i = (. . . , (I(Z n+1 0 )) 1, (I(Z n+1 0 ))0, Z n+1, . . . , Z i 1, Z i), (18) and use this sample to define the empirical risk as in (16). Later on in Proposition 5, we show that the difference between the empirical risk (16) and its idealized counterpart (17) can be made arbitrarily small under various assumptions that we shall consistently invoke. The proof of that result remains valid for the more general definition of empirical risk using (18). Moreover, that result is used to justify why, in the rest of the paper, it will be sufficient to work almost exclusively with the idealized empirical risk (17). Risk bounds and risk-consistency. As we have already discussed, the learner is interested in obtaining upper bounds of the estimation error (R(H) R H). In many cases, these bounds can be constructed by bounding the statistical risk (generalization error) or n := sup H H {R(H) b Rn(H)}. Gonon, Grigoryeva, and Ortega An upper bound of n (or its version with the absolute value of the difference) allows to quantify the worst in-class error between the statistical risk and its empirical analog. We emphasize that bounding this worst in-class error gives guarantees of performance for any learning algorithm which builds upon the idea of using the empirical risk to pick a concrete b Hn out of the hypothesis class H based on available training data. A standard example of such learning rules is the so-called empirical risk minimization (ERM) principle for which generalization error bounds or bounds for n can be used in a straightforward manner to bound the estimation error and, moreover, to establish some important consistency properties. More specifically, in the ERM procedure the learner chooses the desired functional b Hn out of the hypothesis class H of the admissible ones using (16) (the empirical version of (13)), that is, b Hn = arg min H H b Rn(H), (19) which is well defined provided that such a minimizer exists and is unique; otherwise one may define b Hn to be an ϵ-minimizer of the empirical risk (see Alon et al., 1997, for details). We say that the ERM is strongly consistent within the hypothesis class H if the generalization error R( b Hn) (or statistical risk) and the training error b Rn( b Hn) (or empirical risk) as defined in (13) and in (16), respectively, for a sequence of functionals ( b Hn)n N picked by ERM from H using random samples of increasing length, both converge almost surely to the Bayes in-class risk R H in (14), that is lim n R( b Hn) = R H a.s. (20) and lim n b Rn( b Hn) = R H a.s. (21) When no assumptions on the distribution of (Z, Y) are used to prove (20) and (21), then this means that (20) and (21) hold for all distributions and one talks about universal strong risk-consistency of the ERM principle over the class H. This is essentially the case in our setting, since we are working in a semi-agnostic setup and only invoke assumptions on the temporal dependence (but not on the marginal distributions of the input and target stochastic processes (Z, Y)). A standard approach to proving the strong risk-consistency of the ERM procedure for the hypothesis class of functionals H consists in finding a sequence (ηn)n N converging to zero for which the inequality n := sup H H | b Rn(H) R(H)| ηn, holds P-a.s. To see that this implies (20) and (21) one notes the following inequalities: R( b Hn) R H = R( b Hn) b Rn( b Hn) + b Rn( b Hn) b Rn(H H) + b Rn(H H) R H 2ηn + b Rn( b Hn) b Rn(H H) 2ηn, (22) where the last inequality follows from the fact that, by definition (19), b Hn is a minimizer of the empirical risk b Rn. Risk Bounds for Reservoir Computing In the context of reservoir systems, we shall be working with a weak version of consistency which imposes all the convergence conditions to hold only in probability. In what follows we devise bounds for n that allow us to establish the risk-consistency of the ERM procedure for reservoir systems. Additionally, we formulate high-probability bounds for n which provide us with convergence rates for the ERM-based estimation of RC systems that, to our knowledge, are not yet available in the literature. It is well known that in some cases (for small classes with zero Bayes risk, see for example Bartlett et al., 2006) the argument that we just discussed results in unreasonably slow rates. We defer the discussion of possible refinements of the rates obtained in this paper to future projects. 3.2 Learning procedure for reservoir systems The following paragraphs describe the implementation of the empirical risk minimization procedure in the setting of reservoir computing. We spell out the assumptions needed to derive the results in the next sections, construct the hypothesis classes, and set up the ERM learning strategy for the different families of reservoir systems discussed in Section 2.3. Input and target stochastic processes. For both the input Z and the target Y processes we assume a causal Bernoulli shift structure (see for instance Dedecker et al., 2007; Alquier and Wintenberger, 2012). More precisely, for I = y, z and q I N+ suppose that the so-called causal functional GI : (Rq I)Z Do I (with oz = d and Doy = Rm) is measurable and that ξ = ((ξy t , ξz t ))t Z are independent and identically distributed Rqy Rqz-valued random variables. We assume then that the input Z and target processes Y are Bernoulli shifts, that is, they are the (strictly) stationary processes determined by Zt = Gz(. . . , ξz t 1, ξz t ), t Z , Yt = Gy(. . . , ξy t 1, ξy t ), t Z , (23) with E[ Z0 2] < , E[ Y0 2] < . Many processes derived from stationary innovation sequences have causal Bernoulli shift structure including some that are of non-mixing type (see for instance the introduction in Dedecker et al., 2007). In order to obtain risk bounds for reservoir functionals as learning models, we need to additionally impose assumptions on the weak dependency of the processes (23). More specifically, each of the three main results provided in the next section is formulated under a different weak dependence assumption which we now spell out in detail. We start with the strongest assumption of the three but which will allow us to obtain the strongest conclusions in terms of risk bounds for reservoir systems. Assumption 1 For I = y, z the functional GI is LI-Lipschitz continuous when restricted to (ℓ1,w I (Rq I), 1,w I) for some strictly decreasing weighting sequence w I : N (0, 1] with finite mean, that is, P j N jw I j < . More specifically, there exists LI > 0 such that for all u I = (u I t )t Z ℓ1,w I (Rq I) and v I = (v I t )t Z ℓ1,w I (Rq I) it holds that GI(u I) GI(v I) 2 LI u I v I 1,w I . (24) Additionally, let the innovations in (23) satisfy E[ ξI 0 2] < for I = y, z. Gonon, Grigoryeva, and Ortega The following example shows that one can easily construct causal Bernoulli shifts using reservoir functionals. Example 1 (Causal Bernoulli shifts out of reservoir functionals) Let I {y, z} and consider a reservoir system of the type (3) (see also the examples in Section 2.3) determined by the Lipschitz-continuous reservoir map F : DN Dd DN with Dd Rd, DN RN. Assume, additionally, that F is a r-contraction on the first entry and denote by L > 0 the Lipschitz constant of F with respect to the second entry. Let w : N (0, 1] be a strictly decreasing weighting sequence with finite mean, that is P j N jwj < , and a finite associated inverse decay ratio Lw (see Section 2.1). Let now Vd (Dd)Z ℓw,1 (Rd) be a time-invariant set and consider inputs z Vd. Suppose that the reservoir system (3) has a solution (x0, z0) (DN)Z Vd, that is, x0 t = F(x0 t 1, z0 t ), for all t Z . Then, by Theorem 4.1 and Remark 4.4 in Grigoryeva and Ortega (2019), if then the reservoir system associated to F with inputs in Vd has the echo state property and hence determines a unique continuous, causal, and time-invariant reservoir filter UF : (Vd, 1,w) ((DN)Z , 1,w) which is Lipschitz-continuous with constant LUF := L 1 r Lw . It hence also has the fading memory property with respect to w. The Lipschitz continuity of the filter UF implies that the associated functional HUF is also Lipschitz-continuous with the same Lipschitz constant ((see Proposition 3.7 in Grigoryeva and Ortega, 2019). Taking GI := HUF , it is easy to see that (24) indeed holds with LI = LUF . The next assumption is weaker and it is satisfied by many discrete-time stochastic processes. The results that we obtain in the following sections invoking this type of weak dependence will be also less strong than under Assumption 1. Assumption 2 For I = y, z denote by (eξI t )t Z an independent copy of (ξI t )t Z and define θI(τ) := E[ GI(. . . , ξI 1, ξI 0) GI(. . . , eξI τ 1, eξI τ, ξI τ+1, . . . , ξI 0) 2], τ N+. (25) Assume that for I = y, z there exist λI (0, 1) and CI > 0 such that it holds that θI(τ) CIλτ I, for all τ N+. (26) Remark 4 Note that whenever the weighting sequence w I in Assumption 1 can be chosen to be a geometric one, that is, w I j = λj I, j N with λI (0, 1), then Assumption 2 is also automatically satisfied. The argument proving this appears for instance in the proof of part (i) of Corollary 8. The following example illustrates that for many widely used time series models this assumption does hold. In particular, we show that vector autoregressive VARMA processes with time-varying coefficients under mild conditions and, in particular, GARCH processes satisfy Assumption 2. Risk Bounds for Reservoir Computing Example 2 (VARMA process with time-varying coefficients) Suppose Z = (Zt)t Z is a vector autoregressive process of first order with time-varying coefficients, which we write as Zt = At Zt 1 + ηt, t Z , (27) where (ηt)t Z IID with ηt Rd and E[ η0 2] < , and where (At)t Z IID with At Md and E[|||A0|||2] < 1. Under these hypotheses (see for instance Brandt, 1986; Bougerol and Picard, 1992, Theorem 1.1) there exists a unique stationary process satisfying (23) and (27) and E[ Z0 2] < . Iterating (27) yields Z0 = η0 + A0η 1 + + A0 A τ+1Z τ, and so by definition, using the independence of (At)t Z and stationarity, one gets θz(τ) 2E[|||A0 A τ+1|||2]E[ Z τ 2] 2E[|||A0|||2]τE[ Z0 2]. We now define Cz := 2E[ Z0 2], λz := E[|||A0|||2] and immediately obtain that (26) indeed holds, that is, for all τ N+ θz(τ) Czλτ z, as required. We now consider a concrete example of an autoregressive process of the type (27) which is extensively used to describe and eventually to forecast the volatility of financial time series, namely the generalized autoregressive conditional heterostedastic (GARCH) family (Engle, 1982; Bollerslev, 1986; Francq and Zakoian, 2010). Example 3 (GARCH process) Consider a GARCH(1,1) model given by the following equations: rt = σtεt, εt IID(0, 1), t Z (28) σ2 t = ω + αr2 t 1 + βσ2 t 1, t Z (29) with parameters that satisfy α, β, ω 0, α + β < 1, which guarantees the second order stationarity of the process (rt)t Z and the positivity of the conditional variances (σ2 t )t Z . We now check if the GARCH(1,1) process in (28)-(29) falls in the framework (27) introduced in the previous example. Let d = 2 and define Zt := r2 t σ2 t , ηt := ωε2 t ω , At := αε2 t βε2 t α β It is easy to verify that with this choice of matrix At, t Z , one has E[|||A0|||2] = E[αε2 0 + β] = α + β < 1, by the stationarity condition. Additionally, E[ η0 2] = ωE[ p ε4 0 + 1] ωE[ε2 0 + 1] = 2ω < . Hence, the GARCH(1,1) model in (28)-(29) can be represented as (27) and automatically satisfies Assumption 2. Assumption 3 Assume that for I = y, z there exist αI (0, ) and CI > 0 such that, θI(τ) CIτ αI, for all τ N+, (30) with θI as in (25). Gonon, Grigoryeva, and Ortega Example 4 (ARFIMA process) Let d ( 1 2) and suppose that Z = (Zt)t Z is an autoregressive fractionally integrated moving average ARFIMA (0, d, 0) process (see, for instance, Hosking, 1981; Beran, 1994, for details). The process Z admits an infinite moving average (MA( )) representation k=0 φkεt k, t Z , with innovations (εt)t Z IID(0, 1) and where the coefficients are given by φk = Γ(k+d) Γ(k+1)Γ(d) so that Γ(d)k1 dφk 1, as k . Using this asymptotic behavior and the independence of the innovations one obtains !1/2 2 sup l N+{l1 dφl} k=τ k2d 2 !1/2 . Comparing the sum to the integral R τ x2d 2dx = 1 1 2dτ 2d 1, it is easy to see that (30) is satisfied with αz = 1 Hypothesis classes of reservoir maps FRC and reservoir functionals HRC. The next step in order to set up the learning program in the context of reservoir systems is to construct the associated hypothesis classes. These classes need to be chosen beforehand and consist of candidate functionals associated to causal time-invariant reservoir filters of the type discussed in Sections 2.2 and 2.3. For fixed N, d N+, consider a class FRC of reservoir maps F : DN Dd DN, 0 DN RN, Dd Rd that we assume is (a subset of and) separable in the space of bounded continuous functions when equipped with the supremum norm and, additionally, satisfies the following assumptions: Assumption 4 There exist r (0, 1) and LR > 0 such that for each F FRC: (i) for any z Dd, F( , z) is an r-contraction, (ii) for any x DN, F(x, ) is LR-Lipschitz. Assumption 5 For any F FRC the (first equation in the) system (3) has the echo state property. If HF is the functional associated to it, we assume that HF is measurable with respect to the Borel σ-algebra associated to the product topology on its domain. Notice that if the state space DN is a closed ball, then Assumption 4 implies Assumption 5 by Proposition 1. This implication holds for any reservoir system with bounded reservoir maps, an example of which are the elements of the echo state networks family with bounded activation functions σ in Section 2.3. Assumption 6 There exists MF > 0 such that HF (z) 2 MF, for all z (Dd)Z and for each F FRC. Risk Bounds for Reservoir Computing This assumption automatically holds for many families of reservoir systems. We carefully addressed this question in Section 2.3, where we discussed various families and input types for which the reservoir functionals are indeed bounded. For example, Assumption 6 is satisfied by construction in the case of bounded inputs for all the families in Section 2.3. In the presence of generic unbounded inputs, Assumption 6 obviously holds for echo state networks (ESN) with bounded activation function. In addition, the condition in Assumption 6 appears in many applications. For instance, in the recent paper by Verzelli et al. (2019) it is shown that using a so-called self-normalizing activation function allows one to achieve high performances in standard benchmark tasks. It is not difficult to see that self-normalizing functions yield HF (z) 2 1. Our assumptions also guarantee that various suprema over the classes HRC and FRC that will appear in the sequel are measurable random variables. There are very general conditions that guarantee such a fact holds (see Dudley, 2014, Corollary 5.25) but here we simply assume that HF for all F FRC is bounded (see Assumption 6) and that FRC is separable in the space of bounded continuous functions when equipped with the supremum norm. This condition together with the continuity assumptions imposed below on the loss function allows us to conclude the measurability of the suprema over HRC and FRC (see Lemma 17 in Appendix 5.1 for the details). Once we have spelled out Assumptions 4-6 that define the class FRC, we proceed to construct the corresponding hypothesis class of reservoir functionals HRC. Since in most of the cases considered in the literature the readouts h in (3) are either polynomial (as in the case of reservoir systems with linear reservoir maps and polynomial readouts) or linear (as in the case of reservoir systems with linear reservoir maps and linear readouts, ESNs, and SAS in Section 2.3), we shall treat the case of generic Lipschitz readouts and the linear case separately: (i) Reservoir functionals hypothesis class HRC with Lipschitz readouts. We consider a set FO of readout maps h: DN Rm that are Lipschitz-continuous with Lipschitz constant Lh > 0. We assume that for all the members of the class it holds that Lh Lh and h(0) 2 Lh,0, for some fixed Lh, Lh,0 > 0, and that the class contains the zero function and is separable in the space of bounded continuous functions when equipped with the supremum norm. In this situation, we define the hypothesis class HRC of reservoir functionals as HRC := {H : (Dd)Z Rm | H(z) = h(HF (z)), h FO, F FRC}. (31) (ii) Reservoir functionals hypothesis class HRC with linear readouts. Most of the examples of reservoir systems which we discussed in Section 2.3 are constructed using linear readout maps, which are known to be easier to train and popular in many practical applications. We hence treat this case separately. Let now the readouts h be given by maps of the type h(x) = Wx + a, x DN, with W Mm,N and a Rm. We assume that for all the members of the class it holds that |||W|||2 Lh and h(0) 2 = a 2 Lh,0, for some fixed Lh, Lh,0 > 0. In this case, such a class of readouts is automatically separable in the space of bounded continuous functions when equipped with the supremum norm. In this situation we hence define the hypothesis Gonon, Grigoryeva, and Ortega class HRC of reservoir functionals as HRC := {H : (Dd)Z Rm | H(z) = WHF (z) + a, W Mm,N, a Rm, |||W|||2 Lh, a 2 Lh,0, F FRC}. (32) Loss function. The choice of loss function is often key to the success in quantifying risk bounds for learning models. In this paper we work with distance-based loss functions of the form i=1 fi(xi yi), (33) for x, y Rm, where for each i {1, . . . , m}, the so-called representing functions fi : R R are all Lipschitz-continuous with the same Lipschitz constant LL/ m and satisfy fi(0) = 0. The assumption of Lipschitz-continuity on the loss L in the case in which its codomain is restricted to R+ guarantees that it is also a Nemitski loss of order p = 1 (see Christmann and Steinwart, 2008, for detailed discussion of Nemitski losses and their associated risks). Notice that our assumptions imply in particular that E[|L(0, Y0)|] < (34) and |L(x, y) L(x, y)| LL( x x 2 + y y 2), x, x, y, y Rm. (35) Additionally, we notice that since we restrict to reservoir systems satisfying the echo state property and the hypothesis class HRC contains their associated reservoir functionals, for H = h HF the idealized empirical risk (17) can be written as b R n (H) = 1 i=0 L(h(HF (Z i )), Y i) = 1 i=0 L(UF h (Z 0 ) i, Y i) = 1 i=0 L(h(X i), Y i), where X is the solution of the reservoir system Xt = F(Xt 1, Zt), t Z . Risk consistency and risk bounds of reservoir systems. As discussed in Section 3.1 we are interested in generalization error bounds or, in particular, in deriving uniform bounds for n = sup H HRC{R(H) b Rn(H)}. In order to proceed, we first decompose n and write n = sup H HRC{R(H) b Rn(H)} sup H HRC{R(H) b Rn(H) b R n (H) + b R n (H)} n b R n (H) b Rn(H) o + sup H HRC n R(H) b R n (H) o b Rn(H) b R n (H) + sup H HRC R(H) b R n (H) . This means that one can find upper bounds for both n and n = sup H HRC | b Rn(H) R(H)| by controlling the two summands in the right hand side of the last inequality. Risk Bounds for Reservoir Computing Coming back to the example of the ERM procedure that we discussed in Section 3.1, the previous expression can also be used to deduce the weak (essentially universal) riskconsistency of the ERM for the class HRC. More specifically, in line with classical results due to Vapnik (1991), from the inequalities (22) it follows that in order to establish the weak (essentially universal) risk-consistency of ERM for reservoir functionals one simply needs to show that for any ϵ, δ > 0 there exists n0 N+ such that for all n n0 it holds that R(H) b Rn(H) > ϵ Whenever the inputs are i.i.d. (which is a particular case of our setup), this definition is essentially equivalent to saying that HRC is a (weak) uniform Glivenko-Cantelli class (see for example Mukherjee et al., 2006). From expression (36) it also follows then that in order to establish the (weak) risk-consistency, it suffices to show the two-sided uniform convergence over the class HRC of reservoir functionals, first, of the truncated versions of empirical risk to their idealized counterparts and, second, of these idealized versions of the empirical risk to the generalization error (or statistical risk). More explicitly, we shall separately show that for any ϵ1, δ1 > 0 and ϵ2, δ2 > 0 there exist n1 N+ and n2 N+ such that for all n n1 and n n2, respectively, it holds that b Rn(H) b R n (H) > ϵ1 R(H) b R n (H) > ϵ2 One needs to start by showing that the suprema of both these differences over the class HRC are indeed random variables. This fact has been proved in Lemma 17 in the Appendix 5.1. Next, we need to show that the difference between the idealized and the truncated empirical risks can be made as small as one wants by choosing an appropriate length n N+ of the training sample. This fact is contained in the following result. Proposition 5 Consider the hypothesis class HRC of reservoir functionals defined in (31). Define C0 := 2r LLLh MF Then, for any n N+ b Rn(H) b R n (H) C0(1 rn) holds P-a.s. This proposition implies that (37) indeed holds. In order to complete the uniform convergence argument, we also need to show that (38) holds. Even though in order to prove the (essentially universal) risk-consistency of the ERM procedure for reservoir systems it Gonon, Grigoryeva, and Ortega is sufficient to show that the upper bounds of n (and n) can be made as small as one wants with n , for hyper-parameter selection in practical applications the availability of non-asymptotic bounds is also of much importance. We see in the following section that depending on the particular weak dependence assumption imposed (Assumptions 1-3) we will be able to use more or less strong concentration inequalities that yield finite-sample size bounds with different rates of convergence. 4. Main Results In this section we provide high-probability risk bounds for reservoir computing systems. The main ingredients of the probability bounds are the expected values of Γn := sup H HRC{R(H) b R n (H)} and of Γn := sup H HRC |R(H) b R n (H)|, which are the maximum difference of the idealized training and the generalization errors over the class HRC and the maximum of the absolute value of this difference, respectively. Since the random variables in the training sample of the input and the output discrete-time processes are not independent and identically distributed, bounding the expected values of Γn and Γn is a challenging task. In the first subsection we show that this problem may be circumvented using the following idea: one may compute the empirical risk by partitioning the training sample into blocks of appropriate length and then exploiting the weak dependence of the input and output stochastic processes spelled out in Assumptions 1-3. We first make use of this block-partitioning idea in order to derive the bounds for the expected values of the random variables Γn and Γn in the setting of each of those three assumptions. These bounds are expressed in terms of the so-called Rademacher complexities of the reservoir hypothesis classes and the weak dependence coefficients of the input and the target stochastic processes. We provide details concerning the complexity bounds for particular families in the second subsection. In the third subsection we then use the fact that the random fluctuations of Γn and Γn around their expected values can be controlled using concentration inequalities, which, as we show further, can be done either with the help of the Markov inequality under the weaker Assumptions 2-3, or using stronger exponential concentration inequalities (Propositions 19, 20) under the stronger Assumption 1. This approach yields explicit expressions for nonasymptotic high-probability bounds for n and hence for n, which we spell out in the third subsection. Finally, showing that these upper bounds can be made as small as one wants as n proves the desired (weak and essentially universal) risk-consistency of the ERM-selected reservoir systems used as learning models. All the proofs of the main results given in this section are provided in the appendices. 4.1 Bounding the expected value The main ingredient that needs to be introduced in order to bound the expected value of both Γn and Γn is a complexity measure for the hypothesis classes of reservoir functionals HRC. Many complexity measures have been discussed in the literature in recent years (see for example Vapnik and Chervonenkis, 1968; Ledoux and Talagrand, 1991; Bartlett and Mendelson, 2003; Ben-David and Shalev-Shwartz, 2014; Rakhlin et al., 2010). In this paper we use the so-called (multivariate) Rademacher-type complexity associated to a given HRC, which we denote as Rk(HRC). More explicitly, let k N+ and consider ε0, . . . , εk 1 Risk Bounds for Reservoir Computing independent and identically distributed Rademacher random variables and let e Z(j), j = 0, . . . , k 1, denote independent copies of Z (ghost processes), which are also independent of ε0, . . . , εk 1. The Rademacher-type complexity Rk(HRC) over k ghost processes is defined as Rk(HRC) = 1 j=0 εj H(e Z(j)) Note that Rk(HRC) is not an empirical Rademacher complexity and that the expectation is taken with respect to all randomness. In this paper we do not use the standard approach consisting in bounding the theoretical Rademacher complexity using its empirical analogue (conditional on (e Z(j))j {0,...,k 1}), since in the context of reservoir systems the ghost processes e Z(j) have no empirical interpretation due to the fact that it is usually only a single trajectory and not i.i.d. samples of input data which are available to the learner. The following results provide upper bounds for the expected and the expected absolute value of the largest deviation of the statistical risk from its idealized empirical analogue within the hypothesis class of reservoir functionals HRC. The two upper bounds (41) and (42) in the next proposition share the same first three terms, up to a factor 2 due to the absolute value. The first term is related to the weak dependence coefficients of the input and target signals, θz and θy, respectively. The second term involves the Rademacher-type complexity (40) of the hypothesis class of reservoir functionals HRC. Finally, the third term is always of order τ n, where τ is the block length, which needs to be carefully chosen depending on the rates of decay of θz and θy, as we show later in Corollary 8. The upper bound for the expected absolute value (42) contains an additional term of order τ n. Proposition 6 Let HRC be the hypothesis class of reservoir functionals associated to the reservoir maps in the class FRC as given in (31) or (32). Let both the input process Z and the target process Y have a causal Bernoulli shift structure as in (23) and take values in (Dd)Z and (Rm)Z , respectively. Then, there exist B > 0, M > 0, and {aτ}τ N+ with aτ (0, ), such that for any τ, n N+ with τ < n it holds that n R(H) b R n (H) o# n Rk(HRC) + 2M(n kτ) where k = n/τ and R(H) b R n (H) n aτ + 2Bkτ n Rk(HRC) + 2M(n kτ) k n LLE Y0 2 2 1/2 . (42) In these expressions, the Rademacher complexity Rk(HRC) of the hypothesis class HRC of reservoir functionals is defined as in (40), the constants and the sequence {aτ}τ N+ can be explicitly expressed as B = 2 m LL, (43) Gonon, Grigoryeva, and Ortega M = LLLh MF + E[|L(0, Y0)|] + Lh,0LL, (44) aτ = LL(2rτMFLh + θy(τ) + LRLh l=0 rlθz(τ l)), (45) where for I = y, z the weak dependence coefficients θI for τ N+ are defined as in (25), namely, θI(τ) = E[ GI(. . . , ξI 1, ξI 0) GI(. . . , ξ I τ 1, ξ I τ, ξI τ+1, . . . , ξI 0) 2], (46) with (ξ I t )t Z an independent copy of (ξI t )t Z . We now explore the conditions required for the upper bounds in (41) and (42) to be finite and exhibit a certain decay as a function of τ and n. Notice that (up to a factor 2) the right-hand sides of the two inequalities are equal up to the last summand in (42) and hence one needs to impose that E Y0 2 2 < . Additionally, in order to better understand the behavior of both bounds as a function of τ and n one needs to study two more ingredients, namely the sequence {aτ}τ N+ and the Rademacher complexity Rk(HRC). The behavior of the sequence {aτ}τ N+ is exclusively determined by the properties of the input and target processes, while the Rademacher complexity of HRC is fully characterized by the type of reservoir and readout maps of the given family of reservoir systems. In the following remark we argue that under either the stronger Assumption 1 or the weaker Assumptions 2-3 the sequence {aτ}τ N+ converges to zero. Remark 7 A condition guaranteeing that the sequence {aτ}τ N+ in (45) converges to zero is, for instance, that X τ=1 θI(τ) < , I = y, z. (47) To verify this, observe that this condition also implies that l=0 rlθz(τ l) = τ=l+1 θz(τ l) = 1 1 r τ=1 θz(τ) < , where we used that r (0, 1). This proves that P τ=1 aτ < , which necessarily implies that limτ aτ = 0 as required. It is easy to verify that under Assumption 1 one has that τ=1 2LIE[ ξI 0 2] j=τ w I j = 2LIE[ ξI 0 2] = 2LIE[ ξI 0 2] j=1 jw I j < , which immediately implies that condition (47) is satisfied. Additionally, notice that under Assumption 2, condition (47) is also automatically satisfied. However, under Assumption 3 condition (47) may not be satisfied, but a straightforward argument (see the proof of part (iii) of Corollary 8) shows that limτ aτ = 0. Risk Bounds for Reservoir Computing We just argued that under any of the three assumptions 1-3, the convergence to zero of the sequence {aτ}τ N+ is guaranteed, which implies that if we establish the finiteness and a certain decay of the Rademacher complexity term we shall have proved that the upper bounds given in (41) and (42) are finite and tend to 0 as n . The rate of convergence of these bounds is however affected by the particular dependence assumption adopted. We address this important issue in the following corollary where we assume that the Rademacher complexity is finite and exhibits a certain decay and we prove decay rates for the bounds in (41) and (42) that are valid under the different assumptions 1-3. The boundedness of the Rademacher complexities is studied in detail later on in Section 4.2 for the different hypothesis classes of reservoir systems that we introduced in Section 2.3 Corollary 8 Assume that there exists CRC > 0 such that for all k N+ the Rademachertype complexity Rk(HRC) of the class HRC of reservoir functionals satisfies Rk(HRC) CRC Consider the following three cases that correspond to Assumptions 1, 2, and 3, respectively: (i) Suppose that Assumption 1 holds and that, additionally, for I = y, z the weighting se- quences w I : N (0, 1] are such that the associated decay ratios Dw I := supi N w I i+1 w I i 1. Let λmax := max(r, Dwy, Dwz). Then, there exist C1, C2, C3, C3,abs > 0 such that for all n N+ satisfying log(n) < n log(λ 1 max) it holds that n R(H) b R n (H) o# n + C2log(n) log(n) n (49) R(H) b R n (H) n + C2log(n) n + C3,abs p log(n) n . (50) The constants can be explicitly chosen as C1 = 2MFLLLh + LLCy λmax , C2 = 2M log(λ 1 max) + LLLRLh Cz λmax log(λ 1 max), (51) C3 = 2 m LLCRC q log(λ 1 max) , C3,abs = 2C3 + 4LLE Y0 2 2 1/2 q log(λ 1 max) , (52) where M is as in (44) and CI = 2LIE[ ξI 0 2] 1 Dw I for I = y, z. (ii) Suppose that Assumption 2 holds and let λmax := max(r, λy, λz) with λy, λz as in (26). Then there exist C1, C2, C3, C3,abs > 0 such that for all n N+ satisfying log(n) < n log(λ 1 max) the bounds in (49) and (50) hold. The constants can be explicitly chosen as in (51)-(52) with CI as in (26). Gonon, Grigoryeva, and Ortega (iii) Suppose that Assumption 3 holds and denote α := min(αy, αz). Then there exist C1, C2, C1,abs > 0 such that for all n N+ it holds that n R(H) b R n (H) o# C1n 1 2+α 1 + C2n 2 2+α 1 R(H) b R n (H) C1,absn 1 2+α 1 + C2n 2 2+α 1 . The constants can be explicitly chosen as C1 = LL(2MFLhr γα + LRLh Cz Cα + Cy) + BCRC, C2 = 2M, (53) C1,abs = C1 + 4LLE Y0 2 2 1/2 + BCRC, (54) with M, B as in (43)-(44) and γα = max τ N+ Cα = max(2αz, r γα)(1 r) 1, (56) and CI, αI, for I = y, z as in (30). 4.2 Rademacher complexity of reservoir systems In this section we show that for the most important hypothesis classes of reservoir systems, the Rademacher complexity tends to 0 as k at the rates required in (48) of Corollary 8. More specifically, in the next propositions we will provide upper bounds for the Rademacher complexities of the most popular reservoir families that we spelled out in Section 2.3. For these propositions we assume that the corresponding parameter set Θ is separable in the respective (Euclidean) space. This is required in order to ensure that the supremum of random variables appearing in the definition of the Rademacher complexity (40) is again a random variable. For instance, if Θ is an open set, then it is separable. Reservoir systems with linear reservoir map (LRC) and linear readout We now provide a bound for the Rademacher complexity of classes of reservoir functionals associated to linear reservoir maps and readouts. We recall that in this case we always work with uniformly bounded inputs Z (see Section 2.3) by some constant M > 0, that is, Dd = BM and so the random variable Z takes values in the set KM := n z (Rd)Z | zt 2 M for all t Z o . (57) Proposition 9 Let N, d N+ and let Θ MN MN,d RN. Define the classes of linear reservoir maps as FRC := {F A,C,ζ | (A, C, ζ) Θ} Risk Bounds for Reservoir Computing and let HRC be a class of reservoir functionals of the type defined in (32), associated to reservoir systems with linear reservoir maps and readouts. Additionally, define λA max := sup (A,C,ζ) Θ |||A|||2, (58) λC max := sup (A,C,ζ) Θ |||C|||2, (59) λζ max := sup (A,C,ζ) Θ ζ 2. (60) If for the class FRC it holds that 0 < λA max < 1, λC max < , λζ max < , (61) then Assumptions 4-6 are satisfied and the Rademacher complexity of the associated class of reservoir functionals satisfies Rk(HRC) CLRC for any k N+, where CLRC = Lh 1 λAmax λC max E h Z0 2 2 i1/2 + λζ max + Lh,0, (63) and with Z the input process. Remark 10 Due to the uniform boundedness of the inputs, the constant E h Z0 2 2 i1/2 in (63) is bounded by the value M that defines the set KM in which the inputs take values. Nevertheless, E h Z0 2 2 i1/2 can obviously be much smaller than M. Echo State Networks (ESN) The following proposition provides an estimate for the Rademacher complexity of hypothesis classes constructed using echo state networks. We recall that in this case we either work with arbitrary inputs and a bounded activation function σ or with a possibly unbounded activation function σ and uniformly bounded inputs Z (see Section 2.3) by some constant M > 0, that is, Dd = BM. Proposition 11 Let N, d N+, let Θ MN MN,d RN be a subset, and let FRC be a family of echo state reservoir systems defined as FRC := {F σ,A,C,ζ | (A, C, ζ) Θ} and let HRC be a class of reservoir functionals of the type defined in (32), associated to reservoir systems with linear reservoir maps and readouts. Suppose that the class FRC is such that for any F σ,A,C,ζ FRC one necessarily has that F σ,A,C,ζ( , ) FRC.2 Define λA max :=Lσ l=1 sup (A,C,ζ) Θ Al, , 2. This is satisfied for example if σ is odd and (A, C, ζ) Θ (A, C, ζ) Θ. Gonon, Grigoryeva, and Ortega λC max :=Lσ l=1 sup (A,C,ζ) Θ Cl, 2, λζ max :=Lσ l=1 sup (A,C,ζ) Θ |ζl|. If for class FRC it holds that 0 < λA max < 1, λC max < , λζ max < , then Assumptions 4-6 are satisfied and the Rademacher complexity of the associated class of reservoir functionals satisfies Rk(HRC) CESN k , for any k N+, (64) CESN = Lh 1 λAmax λC max E h Z0 2 2 i1/2 + λζ max + Lh,0, (65) and with Z the input process. Remark 12 Notice that by (62)-(63) and by (64)-(65) the Rademacher complexities of the hypothesis classes formed by reservoir systems with linear reservoir maps and linear readouts or by echo state networks are finite whenever the second moment of the input process is finite, which is not directly implied by any of the assumptions 1-3 and hence needs to be separately assumed. State Affine Systems (SAS) In the following proposition we provide an estimate for the Rademacher complexity of hypothesis classes constructed using state affine systems. In this case we also work with uniformly bounded inputs (see Section 2.3) in a set of the type KM as in (57) with M = 1. Proposition 13 Let Θ MN,N[z] MN,1[z], and define the class of SAS reservoir maps as FRC := {F p,q | (p, q) Θ}. Assume that there is a finite set Imax Nd such that for any P(z) = P α Nd Aαzα with P = p or P = q, (p, q) Θ one has Aα = 0 for α / Imax and define |Imax| := card(Imax), λSAS := sup (p,q) Θ |||p|||, c SAS := sup (p,q) Θ |||q|||, (66) where the norm ||| ||| was introduced in (9). Let HRC be the hypothesis class of reservoir systems with linear readouts associated to FRC as in (32). Then, if for the class FRC it Risk Bounds for Reservoir Computing holds that λSAS < 1/|Imax| and c SAS < , then Assumptions 4-6 are satisfied and for any k N+ it holds that Rk(HRC) CSAS CSAS = Lh c SAS|Imax| 1 |Imax|λSAS + Lh,0. (67) 4.3 High-probability risk bounds for reservoir systems We now use the previous results and the three assumptions 1-3 in conjunction with different concentration inequalities to produce three families of high-probability bounds for n := sup H H | b Rn(H) R(H)| of different strength for reservoir systems, which prove in passing the (weak) universal risk-consistency of ERM for reservoir functionals. High-probability finite-sample generalization RC bounds of this type were not available in the literature previously. Theorem 14 Let HRC be a hypothesis class of reservoir functionals of the type specified in (32) associated to a class FRC of reservoir maps that satisfies Assumptions 4-6 and assume that the Rademacher complexity of HRC satisfies (48). Suppose that both the input Z and the target Y processes have a causal Bernoulli shift structure as in (23) and that they take values in (Rd)Z and (Rm)Z , d, m N+, respectively. (i) Suppose that Assumption 1 is satisfied and that, additionally, for I = y, z the strictly decreasing weighting sequences w I : N (0, 1] are such that the associated decay ratios Dw I := supi N w I i+1 w I i < 1. Let λmax := max(r, Dwy, Dwz). (a) Assume that the innovations are bounded, that is, there exists M > 0 such that ξt 2 M for all t Z . Then there exist constants C0, C1, C2, C3, Cbd > 0 such that for all n N+ satisfying log(n) < n log(λ 1 max) and for all δ (0, 1), the following bound holds sup H HRC | b Rn(H) R(H)| (1 rn)C0 + C1 n + C2log(n) log(n) n + Cbd q where the constant C0 is explicitly given in (39), C1, C2 are given in (51), C3 in (52), and Cbd in (93). (b) Assume that for Φ(x) = xp, p > 1 or Φ(x) = exp(x) 1 the innovations possess Φ2-moments, that is, for any u > 0, E[Φ(u ξ0 2)2] < . Then there exist constants C0, C1, C2, C3 > 0 such that for all n N+ satisfying log(n) < n log(λ 1 max) and for all δ (0, 1) it holds that sup H HRC | b Rn(H) R(H)| (1 rn)C0 + C1 n + C2log(n) log(n) n + BΦ(n, δ) Gonon, Grigoryeva, and Ortega where BΦ(n, δ) is given in (112). The constants are explicitly given: C0 in (39), C1, C2 are given in (51), and C3 in (52). (ii) Suppose that Assumption 2 is satisfied and let λmax := max(r, λy, λz) with λy, λz as in (26). Then there exist constants C0, C1, C2, C3,abs > 0 such that for all n N+ satisfying log(n) < n log(λ 1 max) and for all δ (0, 1) it holds that sup H HRC | b Rn(H) R(H)| (1 rn)C0 n + C2log(n) n + C3,abs p (70) The constants are explicitly given: C0 in (39), C1, C2 are given in (51), and C3,abs in (52) with CI as in (26). (iii) Suppose that Assumption 3 is satisfied. Denote α = min(αy, αz) with αy, αz as in (30). Then there exist constants C0, C1,abs, C2 > 0 such that for all n N+, δ (0, 1), sup H HRC | b Rn(H) R(H)| (1 rn)C0 C1,absn 1 2+α 1 + C2n 2 2+α 1 ! (71) The constants are explicitly given: C0 in (39), C1,abs is given in (54) and C2 in (53) together with (55)-(56). In order to obtain explicit high-probability risk bounds for particular families of reservoir systems, one can use the bounds that we obtained for the Rademacher complexities of various families in Section 4.2. For example, let FRC be a family of state affine systems that satisfies the assumptions of Proposition 13; in that case one takes the value CRC appearing in various constants above (for example in C3 given in (52)) as CRC = CSAS with CSAS given in (67). The same applies to other families: for echo state networks one takes CRC = CESN with CESN given in (65). For the family of reservoir systems with linear reservoir map one takes CRC = CLRC, with CLRC given in (63). Remark 15 The result in part (ii) requires Assumption 2 which, as we saw in Remark 4, is implied by Assumption 1 with geometric weighting sequences, that is, wz s = λs z and wy s = λs y, for some λz, λy (0, 1). Therefore, both (68) and (70) provide bounds in this case. We also emphasize that the result in part (iii) allows the treatment of long-memory processes as inputs (see, for instance Example 4). 4.4 High-probability risk bounds for randomly generated reservoir systems We now show that the results in Theorem 14 can be reformulated for echo state networks whose parameters A, C, and ζ have been randomly generated. This statement is a theoretical justification of the good empirical properties of this standard modus operandi in reservoir computing. Even though results of this type could be formulated for all the reservoir families introduced in Section 2.3 and all the different settings considered in Theorem 14, we restrict our study in the next proposition to echo state networks and part (ii). Risk Bounds for Reservoir Computing Proposition 16 (Random reservoirs) Let A, C, ζ be independent random variables with values in MN, MN,d, and in RN, respectively. Consider now echo state networks that have those random values as parameters and whose activation function σ is odd, that is, consider the random class of reservoir maps defined as FRC := {F σ,ρAA,ρCC,ρζζ | (ρA, ρC, ρζ) ( a λA ) [ c, c] [ s, s]} for some a (0, 1), c, s > 0 and with Suppose also that the input process Z and the target process Y are independent of the parameter random variables A, C, ζ and that Assumption 2 is satisfied. Let λmax := max(r, λy, λz) with λy, λz as in (26). Then there exist constants C0, C1, C2, C3,abs > 0 such that for all n N+ satisfying log(n) < n log(λ 1 max) and for all δ (0, 1) it holds that sup H HRC | b Rn(H) R(H)| (1 rn)C0 n + C2log(n) n + C3,abs p (72) where HRC is the hypothesis class of reservoir functionals associated to FRC and with linear readouts as in (32). The constants are explicitly given. More specifically, C0 is given in (39), C1, C2 are given in (51), C3,abs in (52) with CI as in (26). Additionally, the constant CRC appearing in (52) is given by CRC = Lh 1 a(E[λC]E[ Z0 2 2]1/2 + E[λζ]) + Lh,0, (73) , λζ = s Lσ 5. Appendices These appendices contain preliminary results and the proofs of all the main results of the paper. 5.1 Preliminary results The following Lemma shows that the supremum appearing for instance in (68) is indeed a random variable. More precisely, sup H HRC | b Rn(H) R(H)| is a measurable mapping from (Ω, A) to R equipped with its Borel sigma-algebra. An analogous argument can be used in the case of the other suprema, for instance the supremum in (40), considered in the paper. Lemma 17 Let HRC be the reservoir hypothesis class introduced in (31) or in (32) and let R and b Rn be the statistical and empirical risk introduced in (13) and (16), respectively. Then, sup H HRC | b Rn(H) R(H)| is a random variable. Gonon, Grigoryeva, and Ortega Proof For any H HRC set (H) := | b Rn(H) R(H)|. Then, for any H, H HRC (H) ( H) | (H) ( H)| | b Rn(H) R(H) b Rn( H) + R( H)| | b Rn(H) b Rn( H)| + |R(H) R( H)| i=0 H(Z n+1 i ) H(Z n+1 i ) 2 + LLE[ H(Z) H(Z) 2] 2LL sup z (Dd)Z H(z) H(z) 2, (74) where we used the (reverse) triangle inequality and the Lipschitz property (35) of the loss function. Further, by using the definition (31) of HRC, Assumption 6 on the boundedness of the functionals associated to reservoir maps, and again the triangle inequality, one obtains sup H HRC (H) = sup H HRC | b Rn(H) R(H)| (0) + 4LL[Lh MF + Lh,0] and so (34) yields that sup H HRC | b Rn(H) R(H)| is finite, P-a.s. It remains to prove the measurability. The separability assumption imposed on FRC guarantees the existence of a countable subset {Fj}j N+ FRC which is dense with respect to the supremum norm. Let also {hk}k N+ FO be a countable dense subset of readouts that by hypothesis exists. This can be used to construct a countable dense subset of HRC. Indeed, for any H HRC, H = h(HF ), one may choose indices (jl, kl)l N+ such that Fjl F 0 and hkl h 0 as l . Consequently, using the triangle inequality and an argument as in part (iii) of Theorem 3.1 in Grigoryeva and Ortega (2018b), one obtains for any z (Dd)Z that H(z) hkl(HFjl(z)) 2 h hkl + Lh HF (z) HFjl(z) 2 h hkl + 1 1 r Fjl F . Combining this with (74) and setting Hl = hkl(HFjl) one obtains that lim l | (H) (Hl)| lim l 2LL h hkl + 1 1 r Fjl F In particular this shows that for any H HRC, (H) supj,k N+ (hk(HFj)). Taking the supremum over H HRC thus shows that sup H HRC | b Rn(H) R(H)| = sup j,k N+ (hk(HFj)) is measurable, as it is the supremum of a countable collection of random variables. Risk Bounds for Reservoir Computing 5.2 Proof of Proposition 1 Consider the map F : (BS)Z (Dd)Z (BS)Z (x, z) 7 (F(x, z))t := F(xt 1, zt), t Z , and endow (Dd)Z and (BS)Z with the relative topologies induced by the product topologies in (Rd)Z and (RN)Z , respectively. Notice that F can be written as t Z Ft with Ft := F pt T1 id(Dd)Z : (BS)Z (Dd)Z BS, where pt yields the canonical projection of any sequence onto its t-th component. It is easy to see that the maps pt and T1 are continuous with respect to the product topologies and hence F is a Cartesian product of continuous functions, which is always continuous in the product topology. We now recall that, by the compactness of BS, we have that (BS)Z ℓ ,w (RN) and that by Corollary 2.7 in Grigoryeva and Ortega (2018b), the product topology on (BS)Z coincides with the norm topology induced by ℓ ,w (RN), for any weighting sequence w, that we choose in the sequel satisfying the inequality r Lw < 1. We now show that F is a contraction in the first entry with constant r Lw < 1. Indeed, for any x1, x2 (BS)Z and any z (Dd)Z , we have F(x1, z) F(x2, z) ,w = sup t Z F(x1 t 1, zt) F(x2 t 1, zt) 2 w t x1 t 1 x2 t 1 2 rw t , where we used that F is a contraction in the first entry. Now, x1 t 1 x2 t 1 2 rw t = r sup t Z x1 t 1 x2 t 1 2 w (t 1) w t w (t 1) r Lw x1 x2 ,w , which shows that F is a family of contractions with constant r Lw < 1 that is continuously parametrized by the elements in (Dd)Z . In view of these facts and given that the product topology in (Dd)Z (Rd)Z is metrizable (see Munkres, 2014, Theorem 20.5) and that (BS)Z (RN)Z is compact by Tychonoff s Theorem (see Munkres, 2014, Theorem 37.3) in the product topology and hence complete, Theorem 6.4.1 in Sternberg (2010) implies the existence of a unique fixed point of F( , z) for each z (Dd)Z , which establishes the ESP. Moreover, that result also shows the continuity of the associated filter UF : (Dd)Z ((BS)Z , ,w). 5.3 Proof of Proposition 5 In order to proceed with the proof of this proposition, we first need the following lemma. Lemma 18 For any F FRC and any z, z (Dd)Z the following holds for all i N+: HF (z) HF (z) 2 2ri MF + LR j=0 rj z j z j 2. (75) Gonon, Grigoryeva, and Ortega Proof of Lemma 18. Let z, z (Dd)Z and denote by x the solution to (3) and by x the solution to (3) with z replaced by z. Then the triangle inequality and Assumption 4 on F FRC yield HF (z) HF (z) 2 = x0 x0 2 F(x 1, z0) F(x 1, z0) 2 + F(x 1, z0) F(x 1, z0) 2 LR z0 z0 2 + r x 1 x 1 2. By iterating this estimate one obtains HF (z) HF (z) 2 ri x i x i 2 + LR j=0 rj z j z j 2, from which the claim follows by Assumption 6. We now proceed to prove Proposition 5. Let e Z := Z n+1 0 and write for any H HRC | b Rn(H) b R n (H)| = i=0 L(H(Z n+1 i ), Y i) L(H(Z i ), Y i) i=0 LL h(HF (e Z i )) h(HF (Z i )) 2 i=0 LLLh(2rn i MF + LR j=0 rj e Z j i Z j i 2) n 2r LLLh MF In these derivations, the first inequality follows from the triangle inequality and the Lipschitz continuity of the loss function (35), the second one is a consequence of the Lipschitz continuity of the readout map and of (75) in Lemma 18, which finally yield the claim with the choice of constant C0 in (39). 5.4 Proof of Proposition 6 In order to simplify the notation, for any i N we define an (Rd)Z Rm-valued random variable V i as V i := (Z i , Y i) and denote its associated loss by LH(V i) := L(H(Z i ), Y i). We start by using the assumptions on the Lipschitz-continuity of both the loss function (35) and of the reservoir readout map and hence for any i N and H HRC write Risk Bounds for Reservoir Computing |LH(V i)| |LH(V i) L(0, Y i)| + |L(0, Y i)| LL H(Z i ) 2 + |L(0, Y i)| LL h(HF (Z i )) h(0) 2 + LL h(0) 2 + |L(0, Y i)| LLLh MF + |L(0, Y i)| + Lh,0LL. We continue by decomposing n = kτ +(n kτ) with k = n τ . For the last (n kτ) elements one estimates i=kτ {E [LH(V i)] LH(V i)} 2(n kτ)(LLLh MF + E [|L(0, Y0)|] + Lh,0LL) with M as in (44). Subsequently using the definitions of the generalization error (13) and the idealized empirical risk (17) one obtains n R(H) b R n (H) o# sup H HRC E[L(H(Z), Y0)] i=0 L(H(Z i ), Y i) sup H HRC 1 n i=0 {E[LH(V i)] LH(V i)} i=0 {E[LH(V i)] LH(V i)} E LH(V (τj+i)) LH(V (τj+i)) j=0 {E[LH(V τj)] LH(V τj)} In order to obtain a bound for the first summand in the last expression, we introduce ghost samples and use tools that hinge on the independence between them. Let ξ (j) = (ξ y,(j) t , ξ z,(j) t )t Z , j = 0, . . . , k 1 denote independent copies of ξ. Next, for I = y, z define ξI,(j) by setting ξI,(j) i = ξI i for i = τ(j + 1) + 1, . . . , 0 and ξI,(j) i = ξ I,(j) i for i τ(j + 1). Additionally, let Z (j) t := Gz(. . . , ξz,(j) t 1 , ξz,(j) t ) and Y (j) t := Gy(. . . , ξy,(j) t 1 , ξy,(j) t ) for t Z and define the (Rd)Z Rm-valued random variables U(j) := (Z ,(j) τj , Y (j) τj), j = 0, . . . , k 1. Gonon, Grigoryeva, and Ortega Then one has Z (j) t τj = Gz(. . . , ξz,(j) t τj 1, ξz,(j) t τj) ( Gz(. . . , ξ z,(j) τj τ, ξz τj τ+1, . . . , ξz t τj), t = τ + 1, . . . , 0, Gz(. . . , ξ z,(j) t τj 1, ξ z,(j) t τj), t τ and so, for any j = 0, . . . , k 1, the random variable U(j) is measurable with respect to the σ-algebra generated by (ξ τj+t)t= τ+1,...,0 and ξ (j). The assumption of independence between the ghost samples implies that U(0), . . . , U(k 1) are also independent and identically distributed with the same distribution as V0 = (Z, Y0) introduced above. Hence one can rewrite the first summand of the right hand side of the last inequality in (76) as j=0 {E[LH(V τj)] LH(V τj)} n E[LH(V τj)] LH(U(j)) o n LH(U(j)) LH(V τj) o n E[LH(U(j))] LH(U(j)) o n LH(U(j)) LH(V τj) o We now analyze these two terms separately. For the second term, we first note that for any H HRC it holds that LH(V τj) LH(U(j)) = L(H(Z τj), Y τj) L(H(Z ,(j) τj ), Y (j) τj) LL H(Z τj) H(Z ,(j) τj ) 2 + Y τj Y (j) τj 2 . (78) Next, we use the Lipschitz-continuity of the readout maps (see (31)) together with the estimate (75) in Lemma 18 and compute sup H HRC H(Z τj) H(Z ,(j) τj ) 2 2rτMFLh + LRLh l=0 rl Z l τj Z (j) l τj 2. (79) Combining (79) with (78) we now estimate the second term in (77) by n LH(U(j)) LH(V τj) o 2rτMFLh + E[ Y τj Y (j) τj 2] + LRLh l=0 rl E[ Z l τj Z (j) l τj 2] Risk Bounds for Reservoir Computing with aτ as in (45). In order to estimate the first term in (77), one relies on techniques which are common in the case of independent and identically distributed random variables. Here we start by introducing real Rademacher random variables ε0, . . . , εk 1 (see for example Hyt onen et al., 2016, Definition 3.2.9), which are independent of all the other random variables considered so far. In what follows we need to use the structure for the loss function introduced in (33) as well as the other hypotheses on it that we spelled out in Section 3.2. The fact that the loss functions are a sum of representing functions fi : R R, implies that their evaluation on the hypothesis class HRC can be expressed through the evaluation of each representing function on the sets HRC i , i {1, . . . , m}, defined by HRC i := { e H : (Dd)Z Rm R | e H(x, y) := (H(x))i yi, H HRC}. Using independence and a symmetrization trick by Gin e and Zinn (1984) (see for example Ledoux and Talagrand (1991, Lemma 6.3) or the proof of Bartlett and Mendelson (2003, Theorem 8)) one writes n E[LH(U(j))] LH(U(j)) o j=0 εj LH(U(j)) sup e H HRC i j=0 εj(fi e H)(U(j)) Applying the contraction principle for Rademacher random variables (see Ben-David and Shalev-Shwartz (2014, Lemma 26.9) and also Ledoux and Talagrand (1991, Theorem 4.12)) to the last expression one obtains n E[LH(U(j))] LH(U(j)) o sup e H HRC i j=0 εj e H(U(j)) j=0 εj H(Z ,(j) τj ) Y (j) τj j=0 εj H(Z ,(j) τj ) j=0 εj(Y (j) τj)i j=0 εj H(Z ,(j) τj ) = 2 m LLRk(HRC), (82) with the Rademacher complexity defined as in (40). Note that the last term in the fourth line is equal to zero due to the independence and the fact that the expectation of Rademacher random variables is zero. Gonon, Grigoryeva, and Ortega We now come back to the estimate of the expected maximum difference between the in-class statistical risk and the idealized empirical risk and rewrite the expression (76) using (77), (80), and (82) n R(H) b R n (H) o# n 2k m LLRk(HRC) + kaτ + 2M(n kτ) which then yields (41) as required. It remains to prove (42). To do so, notice that the triangle inequality and the same arguments used in (76), (77), (80) and (81) may be applied in the presence of absolute values to obtain b R n (H) R(H) sup e H HRC i j=0 εj(fi e H)(U(j)) Applying again the contraction principle for Rademacher random variables (Bartlett and Mendelson, 2003, Theorem 12.4), one estimates, for any i = 1, . . . , m, sup e H HRC i j=0 εj(fi e H)(U(j)) sup e H HRC i j=0 εj e H(U(j)) j=0 εj(H(Z ,(j) τj ))i j=0 εj(Y (j) τj)i j=0 εj H(Z ,(j) τj ) j=0 εj(Y (j) τj)i mk Rk(HRC) + j=0 εj(Y (j) τj)i with the Rademacher complexity defined as in (40). Finally, using the independence of the Rademacher sequence and the ghost samples (Y (j))j=0,...,k 1 as well as the stationarity properties of the latter, one has j=0 εj(Y (j) τj)i j=0 E h (Y (j) τj)2 i i k m E Y0 2 2 1/2 . The combination of this inequality with (83) and (84) yields (42), as required. Risk Bounds for Reservoir Computing 5.5 Proof of Corollary 8 Proof of part (i) We start by noticing that under Assumption 1, the weak dependence coefficients θI defined in (46) for I = y, z and τ N+ can be estimated as θI(τ) = E[ GI(. . . , ξI 1, ξI 0) GI(. . . , eξI τ 1, eξI τ, ξI τ+1, . . . , ξI 0) 2] j=τ w I j ξI j eξI j 2 2LIE[ ξI 0 2] j=τ w I j 2LIE[ ξI 0 2] j=τ (Dw I)j = 2LIE[ ξI 0 2] (Dw I)τ where we used that by hypothesis Dw I < 1. Consequently, if we set CI := 2LIE[ ξI 0 2] 1 Dw I , condition (26) does hold for all τ N+. We now define c0 := 2LLLh MF, c1 := LLLRCz Lh, and c2 := LLCy and with the notation λmax := max(r, Dwy, Dwz) (0, 1) write (45) for any τ N+ as aτ c0rτ + c1 l=0 rl(Dwz)τ l + c2(Dwy)τ λτ max(c0 + τc1 + c2). (85) Next, let τ N+ with τ < n and set k = n/τ . Inserting assumption (48) in (41) and then using that n/τ 1 k n/τ, one obtains n R(H) b R n (H) o# n aτ + BCRC kτ n + 2M(n kτ) aτ + BCRC τ n + 2Mτ Our goal now is to choose the length of the block τ depending on λmax. Recall that by hypothesis log(n) < n log(λ 1 max), which means that in order to be able to apply the blocking technique for a given value λmax (0, 1) the number of observations n N+ should be sufficiently large. In this situation one can choose τ = log(n)/ log(λ 1 max) , which is then guaranteed to satisfy τ < n. Notice that then λτ+1 max 1/n and consequently (49) follows from (85) and (86) with the appropriate choice of constants as given in (51)-(52). Finally, the last term in (50) follows by noticing that k LLE Y0 2 2 1/2 n 4 τLLE Y0 2 2 1/2 n , (87) and hence one gets (50) as required. Proof of part (ii) Recall that by Assumption 2 for I = y, z there exist λI (0, 1) and CI > 0 such that, for all τ N+, it holds that θI(τ) CIλτ I. Gonon, Grigoryeva, and Ortega Mimicking the proof of part (i) with λmax := max (r, λy, λz) yields the claim. Proof of part (iii) By our choice of γα it holds that τ/4 + γα log(τ)αz/ log(r 1) for all τ N+. One has rl/2(τ l) αz 2αzτ αz for l τ/2 and rl/2(τ l) αz rτ/4 r γατ αz, for τ/2 l τ 1. Setting Cα = max(2αz, r γα)(1 r) 1 one has for all τ N+ that l=0 rl(τ l) αz l=0 rl/2 max(2αz, r γα)τ αz = Cατ αz. Defining c0, c1, and c2 as in the proof of part (i), applying (30) and inserting the above estimate thus allows us to bound (45) for any τ N+ by aτ c0rτ + c1 l=0 rl(τ l) αz + c2τ αy τ α r γαc0 + Cαc1 + c2 . Furthermore, (86) remains valid and so choosing τ = nβ yields n R(H) b R n (H) o# (r γαc0 + Cαc1 + c2) n1/2 β/2 + 2M Taking β = 1 2) 1 yields 1/2 β/2 = α 2) 1 and hence the desired result. The last term in (42) can be bounded by proceeding analogously to part (i) and noticing that (87) remains valid, which concludes the proof. 5.6 Proof of Proposition 9 (Reservoir systems with linear reservoir and readout maps) The condition (61) together with Proposition 1 ensure that the ESP property of the reservoir systems in the hypothesis class is guaranteed and that for any z KM we have that HA,C,ζ(z) = P i=0 Ai(Cz i + ζ). Using the definition of Rademacher complexity, one estimates j=0 εj H(e Z(j)) sup (A,C,ζ) Θ W:|||W|||2 Lh a: a 2 Lh,0 j=0 εj(WHA,C,ζ(e Z(j)) + a) sup (A,C,ζ) Θ W:|||W|||2 Lh j=0 εj WHA,C,ζ(e Z(j)) sup a: a 2 Lh,0 sup W:|||W|||2 Lh |||W|||2E sup (A,C,ζ) Θ j=0 εj HA,C,ζ(e Z(j)) sup a: a 2 Lh,0 Risk Bounds for Reservoir Computing l=0 sup (A,C,ζ) Θ sup (A,C,ζ) Θ j=0 εj(Ce Z(j) l + ζ) sup a: a 2 Lh,0 l=0 sup (A,C,ζ) Θ sup (A,C,ζ) Θ |||C|||2E j=0 εj e Z(j) l + sup (A,C,ζ) Θ ζ 2E sup a: a 2 Lh,0 l=0 (λA max)l j=0 εj e Z(j) 0 sup a: a 2 Lh,0 where we used stationarity, the fact that |||W|||2 Lh for all readout maps from the class H HRC, and constants as in (58)-(60). For the first summand in this expression we have j=0 εj e Z(j) 0 j=0 εj e Z(j) 0 j=0 E e Z(j) 0 2 E[ε2 j] = k E h Z0 2 2 i , where in the first step we used the Jensen inequality, the next equality is obtained using the independence of the ghost samples and the Rademacher sequence and also the fact that E[εj εj] = 0 when j = j for Rademacher variables. The second summand in (88) is bounded using the inequality by Khintchine (1923) We bound the third term as the first one and obtain sup a: a 2 Lh,0 k L2 h,0, (89) where we took into account that a 2 Lh,0 for all readout maps from the class H HRC. Finally, (88) can be rewritten as j=0 εj H(e Z(j)) l=0 (λA max)l λC max E h Z0 2 2 i1/2 + λζ max k Lh 1 λAmax λC max E h Z0 2 2 i1/2 + λζ max where we used that λA max (0, 1). Finally, the choice of constants which satisfy conditions (61) yields (62), as required. Gonon, Grigoryeva, and Ortega 5.7 Proof of Proposition 11 (Echo State Networks) Firstly, note that for any x DN, it holds that x 2 x 1 = PN i=1 |xi| and hence one can write j=0 εj H(e Z(j)) sup F FRC W:|||W|||2 Lh a: a 2 Lh,0 j=0 εj(WHF (e Z(j)) + a) sup F FRC W:|||W|||2 Lh j=0 εj WHF (e Z(j)) sup a: a 2 Lh,0 sup (A,C,ζ) Θ j=0 εj Hσ,A,C,ζ l (e Z(j)) where we used the same arguments as in (89). Using the assumed symmetry of the family FRC in the first step and the contraction principle in the second step one may estimate sup (A,C,ζ) Θ j=0 εj Hσ,A,C,ζ l (e Z(j)) sup (A,C,ζ) Θ j=0 εj Hσ,A,C,ζ l (e Z(j)) sup (A,C,ζ) Θ j=0 εj(Al, Hσ,A,C,ζ(e Z ,(j) 1 ) + Cl, e Z(j) 0 + ζl) l=1 sup (A,C,ζ) Θ Al, E sup (A,C,ζ) Θ j=0 εj Hσ,A,C,ζ(e Z ,(j) 1 ) l=1 sup (A,C,ζ) Θ Cl, 2E j=0 εj e Z(j) 0 sup (A,C,ζ) Θ Since λA max (0, 1) by assumption and using that our hypotheses ensure that Assumption 6 is satisfied, one may iterate the above inequality to obtain sup (A,C,ζ) Θ j=0 εj Hσ,A,C,ζ l (e Z(j)) l=0 (λA max)l j=0 εj e Zj 0 Risk Bounds for Reservoir Computing For the first summand in this expression we obtain j=0 εj e Z(j) 0 j=0 εj e Z(j) 0 j=0 E e Z(j) 0 2 E[ε2 j] = k E h Z0 2 2 i , where in the first step we used Jensen s inequality, the next equality is obtained using the independence of the ghost samples and the Rademacher sequence and also the fact that E[εj εj] = 0 for j = j for Rademacher variables. The last step trivially follows again from the definition of ghost samples and the definition of Rademacher variables. The second summand in (90) is bounded using Khintchine s inequality (Khintchine, 1923) and hence in (90) we obtain sup (A,C,ζ) Θ j=0 εj Hσ,A,C,ζ l (e Z(j)) l=0 (λA max)l λC max E h Z0 2 2 i1/2 + λζ max λC max E h Z0 2 2 i1/2 + λζ max which finally gives j=0 εj H(e Z(j)) k Lh 1 λAmax λC max E h Z0 2 2 i1/2 + λζ max which with the choice of constant in (65) gives (64) as required. 5.8 Proof of Proposition 13 (State-Affine Systems) Let (p, q) Θ. Then, the conditions on the quantities λSAS and c SAS introduced in (66) imply that Mp = max z B (0,M) {|||p(z)|||2} < 1 (91) and so F p,q is a contraction on the first entry. Thus, Proposition 1 implies that the system (3) has the echo state property. Moreover, by Grigoryeva and Ortega (2018a, Proposition 14), for any z KM, we have that i=0 p(z0)p(z 1) p(z i+1)q(z i). Next, write p(z) = P α Imax zαBα and q(z) = P α Imax zαCα. Again, by the conditions on the quantities λSAS and c SAS introduced in (66) one has that the image of Hp,q is bounded Gonon, Grigoryeva, and Ortega and so, combining this with (91) one obtains j=0 εj Hp,q(e Z(j)) 2 j=0 εjp(e Z(j) 0 ) p(e Z(j) i+1)q(e Z(j) i) j=0 εj(e Z(j) 0 )αp(e Z(j) 1) p(e Z(j) i+1)q(e Z(j) i) αi Imax |||Bα1|||2 |||Bαi|||2 j=0 εj(e Z(j) 0 )α1 (e Z(j) i+1)αiq(e Z(j) i) αi Imax |||Bα1|||2 |||Bαi|||2 X α Imax Cα 2 j=0 εj(e Z(j) 0 )α1 (e Z(j) i+1)αi(e Z(j) i)α Therefore, by this expression and using the same type of arguments for linear readout as in the example of echo state networks one obtains j=0 εj H(e Zj) sup F FRC W:|||W|||2 Lh a: a 2 Lh,0 j=0 εj(WHF (e Z(j)) + a) αi Imax (λSAS)ic SAS X j=0 εj(e Z(j) 0 )α1 (e Z(j) i+1)αi(e Z(j) i)α αi Imax (λSAS)ic SAS X j=0 εj(e Z(j) 0 )α1 (e Z(j) i+1)αi(e Z(j) i)α αi Imax (λSAS)ic SAS X α Imax E (Z0)α1 (Z i+1)αi(Z i)α 2 1/2 + Lh c SAS|Imax| 1 |Imax|λSAS + Lh,0 5.9 Proof of Theorem 14 We start by working out two concentration inequalities that are needed in the proof. These are contained in the two propositions 19 and 20 and are used in part (i) of the theorem in relation with the use of Assumption 1. Risk Bounds for Reservoir Computing 5.9.1 (Exponential) concentration inequalities Proposition 19 Define Γn := sup H HRC{R(H) b R n (H)}. Suppose that Assumptions 4-6 hold, that Assumption 1 is satisfied, and that the Bernoulli shifts innovations are bounded, that is, there exists M > 0 such that for I = y, z and for all t Z , ξI t 2 M. Then there exists Cbd > 0 such that for any η > 0, n N+ it holds that P (|Γn E[Γn]| η) 2 exp 2nη2 The constant Cbd is explicitly given by 1 r MFr + LRMLz wz 1 + MLy wy 1 Proof. The main idea of the proof is to exploit the Bernoulli shift structure and apply Mc Diarmid s inequality (Boucheron et al., 2013, see for example), which out of the bound of differences of functions constructed in a particular manner yields the bound in (92). In order to ease the notation, we first define Y := (BM BM) (Rqy Rqz). Consider now a function φ: Yn 1 YZ R, which is defined for ui = (uy i, uz i) Y, i = 0, . . . , n 2 and un 1 = (uy n+1+t, uz n+1+t)t Z YZ by φ(u0, . . . , un 2, un 1) = sup H HRC i=0 L(H(Gz(uz, i )), Gy(uy, i )) Fix k {0, . . . , n 1} and let eu be an identical copy of the sequence u so that only the k-th entry in eu is different from u, that is eu i = u i for all i = k. We now estimate the difference of the function φ: Yn 1 YZ R evaluated at u and eu as follows: φ(u0, . . . , un 1) φ(eu0, . . . , eun 1) sup H HRC inf e H HRC 1 n n L( e H(Gz(euz, i )), Gy(euy, i )) L(H(Gz(uz, i )), Gy(uy, i )) o R( e H) + R(H) sup H HRC 1 n n L(H(Gz(euz, i )), Gy(euy, i )) L(H(Gz(uz, i )), Gy(uy, i )) o = sup H HRC 1 n n L(H(Gz(euz, i )), Gy(euy, i )) L(H(Gz(uz, i )), Gy(uy, i )) o sup H HRC 1 n n LL( H(Gz(euz, i )) H(Gz(uz, i )) 2 + Gy(euy, i ) Gy(uy, i ) 2) o , where in the last inequality we used that by assumption the loss function L is LL-Lipschitz. For the first summand under the supremum in (94) we use the bound (75) and hence write Gonon, Grigoryeva, and Ortega i=0 H(Gz(euz, i )) H(Gz(uz, i )) 2 Lh(2rk+1 i MF + LR j=0 rj Gz(euz, j i ) Gz(uz, j i ) 2) j=i rj i Gz(euz, j ) Gz(uz, j ) 2) Notice that the second summand under the supremum in (94) and the second summand in (95) can be bounded using that (24) holds by Assumption 1 and that by hypothesis both eu and u satisfy u I t 2 M and eu I t 2 M for any t Z . More specifically, for I = y, z and any i {0, . . . , k} one obtains GI(eu I, i ) GI(u I, i ) 2 LI l=0 w I l u I l i eu I l i 2 = LIw I k i u I k eu I k 2 2LIw I k i M, where we used that eu l i = u l i, for all l + i = k, l N. Combining this expression with (95) we estimate (94) as φ(u0, . . . , un 1) φ(eu0, . . . , eun 1) sup H HRC 1 n n LL( H(Gz(euz, i )) H(Gz(uz, i )) 2 + Gy(euy, i ) Gy(uy, i ) 2) o Lh MFr1 rk+1 1 r + Lh LRMLz j=i rj iwz k j + MLy Lh MFr1 rk+1 1 r + Lh LRMLz j=0 rjwz k i j + MLy Lh MF r 1 r + Lh LRMLz 1 r MFr + LRMLz wz 1 + MLy wy 1 with the constant Cbd as in (93). We now use this bound of differences in Mc Diarmid s inequality and simply notice that Γn = φ(ξ0, . . . , ξ n+2, ξ n+1) in the statement, which immediately yields (92), as required. Risk Bounds for Reservoir Computing Proposition 20 Define Γn := sup H HRC{R(H) b R n (H)}. Suppose that Assumptions 46 hold and that Assumption 1 is satisfied. Let Φ: [0, ) [0, ) be a convex and increasing function that satisfies Φ(0) = 0. Furthermore, assume that max I {y,z} E Φ(2Cmom I ξI 0 2) < where Cmom z = LL 1 r Lz wz 1, (96) Cmom y = LLLy wy 1, (97) and denote ϕ(M) := X I {y,z} E h Φ(2Cmom I ξI 0 2)1{ ξI 0 2>M} i . (98) Then, there exists a constant C0 > 0 such that for any η > 0, n N+, M > 0 satisfying X I {y,z} Cmom I E[ ξI 0 21{ ξI 0 2>M}] < η P (|Γn E[Γn]| η) 2 exp 9(C0 + 2M(Cmom z + Cmom y ))2 2 ϕ(M) Φ(η/3). The constant C0 is explicitly given by (39). Proof. Let M > 0 and for any t Z , I = y, z denote by ξI,M t the Bernoulli shift innovations whose Euclidean norm is bounded above by M, that is, ξI,M t := ξI t 1{ ξI t 2 M}. In order to simplify the notation, we define ZM t := Gz(. . . , ξz,M t 1 , ξz,M t ) = Gz((ξz,M) t ), t Z , (100) YM t := Gy(. . . , ξy,M t 1 , ξy,M t ) = Gy((ξy,M) t ), t Z , (101) b R ,M n (H) := 1 i=0 L(H((ZM) i ), YM i), (102) RM(H) := E[L(H(ZM), YM 0 )] (103) and, additionally, denote ΓM n := sup H HRC{RM(H) b R ,M n (H)}. Firstly, the triangle inequality yields |Γn E[Γn]| = |Γn ΓM n (E[Γn] E[ΓM n ]) + ΓM n E[ΓM n ]| |Γn ΓM n | + |E[Γn ΓM n ]| + |ΓM n E[ΓM n ]| |Γn ΓM n | + E[|Γn ΓM n |] + |ΓM n E[ΓM n ]|. (104) For the first summand in expression (104) we write n R(H) b R n (H) o sup H HRC n RM(H) b R ,M n (H) o Gonon, Grigoryeva, and Ortega sup H HRC inf e H HRC n R(H) b R n (H) o n RM( e H) b R ,M n ( e H) o n R(H) b R n (H) RM(H) + b R ,M n (H) o RM(H) R(H) + sup H HRC b R ,M n (H) b R n (H) . (105) Using this result, we can immediately get the following bound for the second summand in expression (104) E h Γn ΓM n i sup H HRC RM(H) R(H) + E b R ,M n (H) b R n (H) The first two terms in the right hand side of (104) are thus controlled by (105) and (106). The third term in (104) is of the type required in Proposition 19, that is, the Bernoulli shifts are defined using bounded innovations and hence the term will be controlled in what follows using the result in Proposition 19. The next step in our proof is to derive bounds for the terms in the right hand sides of (105) and (106). First, we consider the estimate for the term they share, for which we write sup H HRC|RM(H) R(H)| = sup H HRC |E[L(H(ZM), YM 0 ) L(H(Z), Y0)]| sup H HRC E h LL( H(ZM) H(Z) 2 + YM 0 Y0 2) i j=0 rj E[ ZM j Z j 2] + E[ YM 0 Y0 2]) 1 r E[ Gz(ξz,M) Gz(ξz) 2] + E[ Gy(ξy) Gy(ξy,M) 2] , (107) where the first and the second (in)equality are obtained using the definition (103) and the assumption that the loss function L is LL-Lipschitz, the third step follows from the estimate (75) from Lemma 18 and the last step again uses (100)-(101) and the i.i.d. assumption on ξ. In order to proceed, notice that since Assumption 1 holds by hypothesis, by (24) one has for any j N, I = y, z the following estimate GI((ξI) j ) GI((ξI,M) j ) 2 LI l=0 w I l ξI l j ξI,M l j 2 l=0 w I l ξI l j 21{ ξI l j 2>M}. (108) Combining (107) and (108) and using the i.i.d. assumption on ξ one obtains Risk Bounds for Reservoir Computing sup H HRC|RM(H) R(H)| l=0 wz l ξz l 21{ ξz l 2>M}] + Ly E[ l=0 wy l ξy l 21{ ξy l 2>M}] 1 r Lz wz 1E[ ξz 0 21{ ξz 0 2>M}] + Ly wy 1E[ ξy 0 21{ ξy 0 2>M}] Cmom z E[ ξz 0 21{ ξz 0 2>M}] + Cmom y E[ ξy 0 21{ ξy 0 2>M}] I {y,z} Cmom I E[ ξI 0 21{ ξI 0 2>M}] (109) with Cmom I for I = y, z defined as in (96) and (97). Next, we analyze the second term in (105). Using the function Φ : [0, ) [0, ) we obtain for η > 0 b R ,M n (H) b R n (H) η b R ,M n (H) b R n (H) j=0 rj Gz((ξz) j i) Gz((ξz,M) j i) 2 + Gy((ξy) i ) Gy((ξy,M) i ) 2 i j=0 rj E h Φ LLLh LR 1 r Gz((ξz) j i) Gz((ξz,M) j i) 2 + LL Gy((ξy) i ) Gy((ξy,M) i ) 2 i 2E h Φ 2LLLh LR 1 r Gz(ξz) Gz(ξz,M) 2 i + 1 2E h Φ 2LL Gy(ξy) Gy(ξy,M) 2 i 2E h Φ 2LLLh LR l=0 wz l ξz l ξz,M l 2 i + 1 2E h Φ 2LLLy l=0 wy l ξy l ξy,M l 2 i I {y,z} E h Φ 2Cmom I w I 1 l=0 w I l ξI l ξI,M l 2 i l=0 w I l E h Φ 2Cmom I ξI l 21{ ξI l 2>M} i 2ϕ(M), (110) where Cmom z , Cmom y , and ϕ(M) are given in (96), (97), and (98), respectively. In these derivations we used Markov s inequality for increasing and non-negative functions in the first Gonon, Grigoryeva, and Ortega step. The second inequality uses the definition in (102), the estimate (75) from Lemma 18, and the fact that Φ is by hypothesis an increasing function. We subsequently used Jensen s inequality for discrete probability measures, the stationarity assumption and the convexity of the function Φ. Finally, (108), the appropriate choice of constants and once more Jensen s inequality for discrete probability measures and the stationarity assumption as well as Φ(0) = 0 yields the result. We now notice that this result provides automatically a bound for the second term in (106). In order to see that, one needs to take as the function Φ the identity and then by the second and the last two lines in (110) it holds that b R ,M n (H) b R n (H) I {y,z} Cmom I E[ ξI 0 21{ ξI 0 2>M}]. (111) We now consider again expression (104) and taking into account (105), (106), together with the bounds for their ingredients given in (109), (110), and (111) we derive, for any η > 0 satisfying (99), P (|Γn E[Γn]| η) 9 + sup H HRC b R ,M n (H) b R n (H) + η 9 + |ΓM n E[ΓM n ]| η P |ΓM n E[ΓM n ]| η b R ,M n (H) b R n (H) η 2 ϕ(M) Φ(η/3), with Cbd as in (93). In the first inequality we used the hypothesis in (99) and the bounds (109), (111). In the second inequality we used (110) and (92) in Proposition 19. Finally, noticing that Cbd = 2M(Cmom z + Cmom y ) + C0 with C0 as in (39) and setting Cmom z and Cmom y as in (96)-(97) immediately yields the claim. Corollary 21 Suppose that Assumptions 4-6 hold and that Assumption 1 is satisfied. Let Φ: [0, ) [0, ) be a convex and strictly increasing function that satisfies Φ(0) = 0. Furthermore, assume that for all u > 0, E[Φ(u ξI 0 )2] < for I = y, z. Then, for any δ (0, 1), n N+, P (|Γn E[Γn]| BΦ(n, δ)) δ BΦ(n, δ) = 9 max (C0 + 2Φ 1(n)(Cmom z + Cmom y )) q 2n , Φ 1 2CΦ I {y,z} E Φ(2Cmom I ξI 0 2)2 1/2 E[Φ( ξI 0 2)]1/2, (113) and C0, Cmom z , Cmom y are given by (39), (96), and (97). Risk Bounds for Reservoir Computing Proof. We start with the function ϕ(M) given in (98) and obtain that for any M > 0 it holds that I {y,z} E h Φ(2Cmom I ξI 0 2)1{ ξI 0 2>M} i I {y,z} E Φ(2Cmom I ξI 0 2)2 1/2 E h (1{ ξI 0 2>M})2i1/2 I {y,z} E Φ(2Cmom I ξI 0 2)2 1/2 P( ξI 0 2 > M)1/2 CΦ Φ(M)1/2 , (114) where the first inequality is a consequence of H older s inequality, and the last step is obtained by applying Markov s inequality for increasing nonnegative functions and by using the definition in (113). Furthermore, by convexity, Jensen s inequality and since Φ(0) = 0, one has that I {y,z} Cmom I E[ ξI 0 21{ ξI 0 2>M}]) X 1 2Φ 2Cmom I E[ ξI 0 21{ ξI 0 2>M}] I {y,z} E h Φ 2Cmom I ξI 0 21{ ξI 0 2>M} i 2ϕ(M). (115) Choosing M = Φ 1(n) in (114) and setting η = BΦ(n, δ) defined in (112) one easily verifies that 1 2 ϕ(M) Φ(η/9) 1 2 CΦ nΦ(η/9) δ In particular, this also implies that ϕ(M) < Φ(η/9) and so (115) yields I {y,z} Cmom I E[ ξI 0 21{ ξI 0 2>M}]) < Φ(η/9) and hence X I {y,z} Cmom I E[ ξI 0 21{ ξI 0 2>M}] < η Thus (99) is satisfied and we may apply Proposition 20 and use Φ(η/9) Φ(η/3) and (116), which yields P (|Γn E[Γn]| η) 2 exp 2nη2 9(C0 + 2Φ 1(n)(Cmom z + Cmom y ))2 Gonon, Grigoryeva, and Ortega 5.9.2 Proof of Theorem 14 Proof of part (i). In this situation, the hypotheses of part (i) of Corollary 8 are satisfied and so the following bound holds: n R(H) b R n (H) o# n + C2log(n) log(n) n . (117) Let us denote Γn := sup H HRC{R(H) b R n (H)}. We may then apply the triangle inequality, insert the estimate on the difference between the empirical risk and its idealized counterpart obtained in Proposition 5 as well as the estimate on the expected value (117) to obtain that P-a.s., sup H HRC{R(H) b Rn(H)} = sup H HRC{R(H) b Rn(H) + b R n (H) b R n (H)} E[Γn] + E[Γn] sup H HRC{ b R n (H) b Rn(H)} + Γn E[Γn] + E[Γn] sup H HRC | b R n (H) b Rn(H)| + |Γn E[Γn]| + E[Γn] n + |Γn E[Γn]| + C1 n + C2log(n) (118) Part (a): Denote by η the upper bound that we need to prove holds with high probability, that is, η := (1 rn)C0 + C1 n + C2log(n) log(n) n + Cbd q Combining the estimate (118) with the exponential concentration inequality Proposition 19 then yields sup H HRC{R(H) b Rn(H)} > η |Γn E[Γn]| > Cbd q By applying the result that we just proved to the loss function L one obtains that sup H HRC{ b Rn(H) R(H)} > η Using that |x| = max(x, x) one can thus combine the two estimates to deduce R(H) b Rn(H) > η sup H HRC{R(H) b Rn(H)} > η sup H HRC{ b Rn(H) R(H)} > η Risk Bounds for Reservoir Computing Part (b): Proceeding analogously as in part (a), denote by η the high-probability upper bound which needs to be established, that is, η = (1 rn)C0 + C1 n + C2log(n) log(n) n + BΦ(n, δ). Combining (118) with Corollary 21 then yields sup H HRC{R(H) b Rn(H)} > η P (|Γn E[Γn]| > BΦ(n, δ)) δ The claim then follows precisely as in the proof of part (a). Proof of part (ii). Firstly, one may use Proposition 5 to obtain P-a.s., sup H HRC{R(H) b Rn(H)} sup H HRC{ b R n (H) b Rn(H)} + |Γn| (1 rn)C0 (120) Setting n + C2log(n) n + C3,abs p and applying Markov s inequality, (120) and part (ii) of Corollary 8 then yields sup H HRC{R(H) b Rn(H)} > η n + C2log(n) n + C3,abs p n + C2log(n) n + C3,abs p By applying what we just proved to the loss function L the claim then follows precisely as in the proof of part (i). Proof of part (iii). The proof is the same as the proof of part (ii), except that instead of choosing η as in (121) one takes η = (1 rn)C0 C1,absn 1 2+α 1 + C2n 2 2+α 1 and instead of using part (ii) of Corollary 8 one applies its part (iii) to estimate E[|Γn|]. Gonon, Grigoryeva, and Ortega 5.10 Proof of Proposition 16 Denote by Θ := (ρAA, ρCC, ρζζ) | (ρA, ρC, ρζ) ( a λA , a λA ) [ c, c] [ s, s] the random set of admissible parameters for the echo state network. Since l=1 sup (A,C,ζ) Θ Al, = a (0, 1), for any realization of A, C, ζ (that is, conditional on A, C, ζ) the assumptions of Proposition 11 are satisfied. Thus one may argue as in the proof of part (ii) of Theorem 14 to obtain that for any η > 0, sup H HRC{R(H) b Rn(H)} > (1 rn)C0 E[|Γn| | A, C, ζ] and then apply part (ii) of Corollary 8 to obtain sup H HRC{R(H) b Rn(H)} > (1 rn)C0 n + C2log(n) n + C3,abs p where the constants can be explicitly chosen using (51)-(52). In particular, C1 and C2 are given by (51) with CI as in (26), and C3,abs can be written using (52) as C3,abs = 2C3 + 4LLE Y0 2 2 1/2 q log(λ 1 max) , C3 = 2 m LLCRC q log(λ 1 max) CRC = Lh 1 a λCE h Z0 2 2 i1/2 + λζ + Lh,0. Taking expectations, one sees that E[C3,abs] = C3,abs, where C3,abs is as in (52) with CRC given by (73). Thus we obtain sup H HRC{R(H) b Rn(H)} > (1 rn)C0 n + C2log(n) n + C3,abs p and the claim follows by arguing as in the proof of part (ii) in Theorem 14 . Risk Bounds for Reservoir Computing Acknowledgments Lukas Gonon and Juan-Pablo Ortega acknowledge partial financial support coming from the Research Commission of the Universit at Sankt Gallen, the Swiss National Science Foundation (grants number 175801/1 and 179114), and the French ANR BIPHOPROC project (ANR-14-OHRI-0018-02). Lyudmila Grigoryeva acknowledges partial financial support of the Graduate School of Decision Sciences of the Universit at Konstanz. T. M. Adams and A. B. Nobel. Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. Annals of Probability, 38(4):1345 1367, 2010. N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4):615 631, 1997. P. Alquier and O. Wintenberger. Model selection for weakly dependent time series forecasting. Bernoulli, 18(3):883 913, 2012. D. Andrews. First order autoregressive processes and strong mixing. Cowles Founda-tion Discussion Papers 664, 1983. M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations. 1999. L. Appeltant, M. C. Soriano, G. Van der Sande, J. Danckaert, S. Massar, J. Dambre, B. Schrauwen, C. R. Mirasso, and I. Fischer. Information processing using a single dynamical node as complex system. Nature Communications, 2:468, jan 2011. R. T. Baillie. Long memory processes and fractional integration in econometrics. Journal of Econometrics, 73(1):5 59, 1996. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(3):463 482, 2003. P. L. Bartlett, M. I. Jordan, and J. D. Mcauliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. P. L. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 2017-Decem:6241 6250, 2017. S. Ben-David and S. Shalev-Shwartz. Understanding Machine Learning: From Theory to Algorithms. 2014. J. Beran. Statistics for Long-Memory Processes. CRC Press, 1994. T. Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3):307 327, 1986. Gonon, Grigoryeva, and Ortega S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. P. Bougerol and N. Picard. Strict Stationarity of Generalized Autoregressive Processes. The Annals of Probability, 1992. O. Bousquet and A. Elisseeff. Stability and generalisaiton. Journal of Machine Learning Reasearch, 2:499 526, 2002. S. Boyd and L. Chua. Fading memory and the problem of approximating nonlinear operators with Volterra series. IEEE Transactions on Circuits and Systems, 32(11):1150 1161, nov 1985. A. Brandt. The stochastic equation Yn +1=An Yn + Bn with stationary coefficients. Advances in Applied Probability, 18(01):211 220, mar 1986. D. Brunner, M. C. Soriano, C. R. Mirasso, and I. Fischer. Parallel photonic information processing at gigabyte per second data rates using transient states. Nature Communications, 4(1364), 2013. M. Buehner and P. Young. A tighter bound for the echo state property. IEEE Transactions on Neural Networks, 17(3):820 824, 2006. A. Christmann and I. Steinwart. Support Vector Machines. Springer New York, 2008. B. D. Coleman and V. J. Mizel. On the general theory of fading memory. Archive for Rational Mechanics and Analysis, 29(1):18 31, jan 1968. R. Couillet, G. Wainrib, H. Sevi, and H. T. Ali. The asymptotic performance of linear echo state neural networks. Journal of Machine Learning Research, 17(178):1 35, 2016. F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1):1 49, 2002. F. Cucker and D.-X. Zhou. Learning Theory : An Approximation Theory Viewpoint. Cambridge University Press, 2007. G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4):303 314, dec 1989. J. Dambre, D. Verstraeten, B. Schrauwen, and S. Massar. Information processing capacity of dynamical systems. Scientific reports, 2(514), 2012. J. Dedecker, P. Doukhan, G. Lang, J. R. Le on, S. Louhichi, and C. Prieur. Weak Dependence: With Examples and Applications. Springer Science+Business Media, 2007. R. M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 2nd edition, 2014. R. Engle. Anticipating Correlations. Princeton University Press, Princeton, NJ, 2009. Risk Bounds for Reservoir Computing R. F. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4):987 1007, 1982. M. Fabrizio, C. Giorgi, and V. Pata. A new approach to equations with memory, volume 198. 2010. M. Fliess and D. Normand-Cyrot. Vers une approche alg ebrique des syst emes non lin eaires en temps discret. In A. Bensoussan and J. Lions, editors, Analysis and Optimization of Systems. Lecture Notes in Control and Information Sciences, vol. 28. Springer Berlin Heidelberg, 1980. C. Francq and J.-M. Zakoian. GARCH Models: Structure, Statistical Inference and Financial Applications. Wiley, 2010. K.-i. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2:183 192, 1989. S. Ganguli, D. Huh, and H. Sompolinsky. Memory traces in dynamical systems. Proceedings of the National Academy of Sciences of the United States of America, 105(48):18970 5, dec 2008. E. Gin e and J. Zinn. Some limit theorems for empirical processes. Annals of Probability, 12:929 989, 1984. L. Gonon and J.-P. Ortega. Fading memory echo state networks are universal. Preprint ar Xiv: 2010.12047, pages 1 6, 2020a. L. Gonon and J.-P. Ortega. Reservoir computing universality with stochastic inputs. IEEE Transactions on Neural Networks and Learning Systems, 31(1):100 112, 2020b. L. Gonon, L. Grigoryeva, and J.-P. Ortega. Approximation error estimates for random neural networks and reservoir systems. ar Xiv preprint 2002.05933, 2020. L. Grigoryeva and J.-P. Ortega. Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems. Journal of Machine Learning Research, 19(24):1 40, 2018a. L. Grigoryeva and J.-P. Ortega. Echo state networks are universal. Neural Networks, 108: 495 508, 2018b. L. Grigoryeva and J.-P. Ortega. Differentiable reservoir computing. Journal of Machine Learning Research, 20(179):1 62, 2019. L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. Optimal nonlinear information processing capacity in delay-based reservoir computers. Scientific Reports, 5(12858):1 11, 2015. L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. Nonlinear memory capacity of parallel time-delay reservoir computers in the processing of multidimensional signals. Neural Computation, 28:1411 1451, 2016. Gonon, Grigoryeva, and Ortega D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 1992. M. Hermans and B. Schrauwen. Memory in linear recurrent neural networks in continuous time. Neural networks : the official journal of the International Neural Network Society, 23(3):341 55, apr 2010. R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, second edition, 2013. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 366, 1989. J. R. M. Hosking. Fractional differencing. Biometrika, 1981. T. Hyt onen, J. van Neerven, M. Veraar, and L. Weis. Analysis in Banach Spaces, volume I. Springer International Publishing, 2016. D. Ib a nez-Soria, A. Soria-Frisch, J. Garcia-Ojalvo, and G. Ruffini. Characterization of the non-stationary nature of steady-state visual evoked potentials using echo state networks. PLOS ONE, 2019. H. Jaeger. Short term memory in echo state networks. Fraunhofer Institute for Autonomous Intelligent Systems. Technical Report., 152, 2002. H. Jaeger. The echo state approach to analysing and training recurrent neural networks with an erratum note. Technical report, German National Research Center for Information Technology, 2010. H. Jaeger and H. Haas. Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. Science, 304(5667):78 80, 2004. A. Khintchine. Uber dyadische Br uche. Mathematische Zeitschriften, 18:109 116, 1923. P. Koiran and E. D. Sontag. Vapnik-Chervonenkis dimension of recurrent neural networks. Discrete Applied Mathematics, 1998. V. Kuznetsov and M. Mohri. Generalization bounds for non-stationary mixing processes. Machine Learning, 106(1):93 117, 2017. V. Kuznetsov and M. Mohri. Theory and algorithms for forecasting time series. 2018. F. Laporte, A. Katumba, J. Dambre, and P. Bienstman. Numerical demonstration of neuromorphic computing with photonic crystal cavities. Optics Express, 26(7):7955, apr 2018. L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M. Gutierrez, L. Pesquera, C. R. Mirasso, and I. Fischer. Photonic information processing beyond Turing: an optoelectronic implementation of reservoir computing. Optics Express, 20(3):3241, jan 2012. M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer-Verlag, 1991. Risk Bounds for Reservoir Computing R. Legenstein and W. Maass. What makes a dynamical system computationally powerful? In S. Haykin, editor, New directions in statistical signal processing: from systems to brain. MIT Press, Cambridge, MA, 2007. Z. Lu, B. R. Hunt, and E. Ott. Attractor reconstruction by machine learning. Chaos, 28 (6), 2018. M. Lukoˇseviˇcius and H. Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127 149, 2009. W. Maass, T. Natschl ager, and H. Markram. Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Computation, 14:2531 2560, 2002. W. Maass. Liquid state machines: motivation, theory, and applications. In S. S. Barry Cooper and A. Sorbi, editors, Computability In Context: Computation and Logic in the Real World, chapter 8, pages 275 296. 2011. W. Maass and E. D. Sontag. Neural Systems as Nonlinear Filters. Neural Computation, 12 (8):1743 1772, aug 2000. W. Maass, T. Natschl ager, and H. Markram. Fading memory and kernel properties of generic cortical microcircuit models. Journal of Physiology Paris, 98(4-6 SPEC. ISS.): 315 330, 2004. W. Maass, P. Joshi, and E. D. Sontag. Computational aspects of feedback in neural circuits. PLo S Computational Biology, 3(1):e165, 2007. G. Manjunath and H. Jaeger. Echo state property linked to an input: exploring a fundamental characteristic of recurrent neural networks. Neural Computation, 25(3):671 696, 2013. S. Marzen. Difference between memory and prediction in linear recurrent networks. Physical Review E, 96(3):1 7, 2017. M. B. Matthews. On the Uniform Approximation of Nonlinear Discrete-Time Fading Memory Systems Using Neural Network Models. Ph D thesis, ETH Z urich, 1992. D. J. Mc Donald, C. R. Shalizi, and M. Schervish. Nonparametric risk bounds for time-series forecasting. Journal of Machine Learning Research, 18:1 40, 2017. S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1-3):161 193, 2006. J. Munkres. Topology. Pearson, second edition, 2014. T. Natschl ager, W. Maass, and H. Markram. The Liquid Computer : a novel strategy for real-time computing on time series. Special Issue on Foundations of Information Processing of TELEMATIK, 8(1):39 43, 2002. Gonon, Grigoryeva, and Ortega Y. Paquot, F. Duport, A. Smerieri, J. Dambre, B. Schrauwen, M. Haelterman, and S. Massar. Optoelectronic reservoir computing. Scientific reports, 2:287, jan 2012. J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott. Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data. Chaos, 27(12), 2017. J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach. Physical Review Letters, 120(2):24102, 2018. T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428(6981):419 422, 2004. A. Rakhlin, K. Sridharan, and A. Tewari. Online learning via sequential complexities. Journal of Machine Learning Research, 16:155 186, 2010. A. Rakhlin, K. Sridharan, and A. Tewari. Sequential complexities and uniform martingale laws of large numbers. Probability Theory and Related Fields, 161(1-2):111 153, 2014. A. Rodan and P. Tino. Minimum complexity echo state network. IEEE Transactions on Neural Networks, 22(1):131 44, jan 2011. S. Smale and D.-X. Zhou. Estimating the approximation error in learning theory. Analysis and Applications, 01(01):17 41, 2003. E. Sontag. Realization theory of discrete-time nonlinear systems: Part I-The bounded case. IEEE Transactions on Circuits and Systems, 26(5):342 356, may 1979a. E. D. Sontag. Polynomial Response Maps. In Lecture Notes Control in Control and Information Sciences. Vol. 13. Springer Verlag, 1979b. E. D. Sontag. VC dimension of neural networks. NATO ASI Series F Computer and Systems Sciences, 168:69 96, 1998. S. Sternberg. Dynamical Systems. Dover, 2010. K. Vandoorne, J. Dambre, D. Verstraeten, B. Schrauwen, and P. Bienstman. Parallel reservoir computing using optical amplifiers. IEEE Transactions on Neural Networks, 22 (9):1469 1481, sep 2011. K. Vandoorne, P. Mechet, T. Van Vaerenbergh, M. Fiers, G. Morthier, D. Verstraeten, B. Schrauwen, J. Dambre, and P. Bienstman. Experimental demonstration of reservoir computing on a silicon photonics chip. Nature Communications, 5:78 80, mar 2014. V. Vapnik. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems 4 (NIPS 1991), pages 831 838, 1991. V. Vapnik. Statistical Learning Theory. Wiley, adaptive a edition, 1998. V. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Dokl. Akad. Nauk SSSR, 181(4):781, 1968. Risk Bounds for Reservoir Computing P. Verzelli, C. Alippi, and L. Livi. Echo State Networks with self-normalizing activations on the hyper-sphere. Scientific Reports, 9(13887), 2019. Q. Vinckier, F. Duport, A. Smerieri, K. Vandoorne, P. Bienstman, M. Haelterman, and S. Massar. High-performance photonic reservoir computer based on a coherently driven passive cavity. Optica, 2(5):438 446, 2015. V. Volterra. Theory of Functionals and of Integral and Integro-Differential Equations. Dover, 1959. O. White, D. Lee, and H. Sompolinsky. Short-Term Memory in Orthogonal Neural Networks. Physical Review Letters, 92(14):148102, apr 2004. N. Wiener. Nonlinear Problems in Random Theory. The Technology Press of MIT, 1958. I. B. Yildiz, H. Jaeger, and S. J. Kiebel. Re-visiting the echo state property. Neural Networks, 35:1 9, nov 2012. J. Zhang, Q. Lei, and I. S. Dhillon. Stabilizing gradients for deep neural networks via efficient SVD parameterization. 2018.