# differentiable_reservoir_computing__ee96e2e6.pdf Journal of Machine Learning Research 20 (2019) 1-62 Submitted 2/19; Revised 8/19; Published 11/19 Differentiable reservoir computing Lyudmila Grigoryeva Lyudmila.Grigoryeva@uni-konstanz.de Department of Mathematics and Statistics Graduate School of Decision Sciences Universit at Konstanz Germany Juan-Pablo Ortega Juan-Pablo.Ortega@unisg.ch Faculty of Mathematics and Statistics Universit at Sankt Gallen Switzerland Centre National de la Recherche Scientifique (CNRS) France Editor: Sayan Mukherjee Numerous results in learning and approximation theory have evidenced the importance of differentiability at the time of countering the curse of dimensionality. In the context of reservoir computing, much effort has been devoted in the last two decades to characterize the situations in which systems of this type exhibit the so-called echo state (ESP) and fading memory (FMP) properties. These important features amount, in mathematical terms, to the existence and continuity of global reservoir system solutions. That research is complemented in this paper with the characterization of the differentiability of reservoir filters for very general classes of discrete-time deterministic inputs. This constitutes a novel strong contribution to the long line of research on the ESP and the FMP and, in particular, links to existing research on the input-dependence of the ESP. Differentiability has been shown in the literature to be a key feature in the learning of attractors of chaotic dynamical systems. A Volterra-type series representation for reservoir filters with semi-infinite discrete-time inputs is constructed in the analytic case using Taylor s theorem and corresponding approximation bounds are provided. Finally, it is shown as a corollary of these results that any fading memory filter can be uniformly approximated by a finite Volterra series with finite memory. Keywords: reservoir computing, fading memory property, finite memory, echo state property, differentiable reservoir filter, Volterra series representation, state-space systems, system identification, machine learning. 1. Introduction Context and preliminary discussion. Reservoir computing (RC) is a neural approach to the learning of dynamic processes which advocates the use of paradigms in which the supervised estimation of all available interconnection weights is not necessary and only the training of a static memoryless readout suffices to obtain good performances. This computational strategy has been simultaneously inspired by ideas coming from three different fields, namely, recurrent neural networks, dynamical systems, and biologically inspired neural microcircuits. The common thread to these analyses is the use of rich dynamics to process information and to create memory traces. This explains why RC it can be found in the literature under other denominations like Liquid State Machines Maass and Sontag (2000); Maass et al. (2002); Natschl ager et al. (2002); Maass et al. (2004, 2007) and is represented by various c 2019 Lyudmila Grigoryeva and Juan-Pablo Ortega. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/19-150.html. Grigoryeva and Ortega learning paradigms, being the Echo State Networks introduced in Jaeger (2010); Jaeger and Haas (2004) a particularly important example. RC has shown superior performance in many forecasting and classification engineering tasks (see Lukoˇseviˇcius and Jaeger (2009) and references therein) and has shown unprecedented abilities in the learning of the attractors of complex nonlinear infinite dimensional dynamical systems Jaeger and Haas (2004); Pathak et al. (2017, 2018); Lu et al. (2018). Additionally, RC implementations with dedicated hardware have been designed and built (see, for instance, Appeltant et al. (2011); Rodan and Tino (2011); Vandoorne et al. (2011); Larger et al. (2012); Paquot et al. (2012); Brunner et al. (2013); Vandoorne et al. (2014); Vinckier et al. (2015); Laporte et al. (2018); Tanaka et al. (2019)) that exhibit information processing speeds that largely outperform standard Turing-type computers. Ever since the inception of this methodology, much effort has been devoted to identify the features that make a RC system capable of retaining relevant memory traces of the inputs and computationally powerful. The first question has given rise to various notions and computational schemes for the memory capacity of RC systems Jaeger (2002); White et al. (2004); Ganguli et al. (2008); Hermans and Schrauwen (2010); Dambre et al. (2012); Grigoryeva et al. (2015); Couillet et al. (2016); Grigoryeva et al. (2016); Tio (2018). Another strand of interesting literature that we will not explore in this work has to do with the Turing computability capabilities of the systems of the type that we just introduced; recent relevant works in this direction are Kilian and Siegelmann (1996); Siegelmann et al. (1997); Cabessa and Villa (2015, 2016), and references therein. Regarding computational power, there are three properties that pervade the literature and that are usually declared as necessary to obtain an adequate functioning in a RC system (see, for instance, Legenstein and Maass (2007); Lukoˇseviˇcius and Jaeger (2009); Maass (2011) and references therein), namely, the fading memory property (FMP), the echo state property (ESP), and the pairwise separation property (SP). The FMP is a notion observed in many modeling situations in which the influence of the input gradually fades out in time. This property is repeatedly invoked in systems theory Volterra (1930); Wiener (1958), computational neurosciences Maass et al. (2004), physics Coleman and Mizel (1968), or mechanics (see Fabrizio et al. (2010) and references therein). The ESP Jaeger (2010); Yildiz et al. (2012); Manjunath and Jaeger (2013) is an existence and uniqueness property for the solutions of a state-space system that guarantees that the past history of the input fully determines the state of the system at any given point in time. Finally, the SP is satisfied by an input/output system if for any two input time series which differed in the past, the network assumes at subsequent time points different states. Even though these three properties are an essential part of the RC jargon , it is not always clear in the literature why they are important. A partial answer to this question has been given in the development of universality theorems for RC machine learning paradigms. Indeed, it has been shown in Maass and Sontag (2000); Maass et al. (2002, 2004, 2007); Grigoryeva and Ortega (2018a,b) that various families of RC systems that have these three properties are uniform universal approximants in a dynamical context in the presence of uniformly bounded (respectively, almost surely uniformly bounded) deterministic (respectively, stochastic) inputs. Moreover, these properties are exactly what is needed to prove universality statements using the Stone-Weierstrass theorem. Nevertheless, it has also been shown Gonon and Ortega (2018) that when the uniform approximation criterion is replaced by a Lp norm defined with the measure induced by the input stochastic process, then the FMP does not play any role anymore. Additionally, when these properties are invoked, it is not always clear what the actual definition that is being used is and they are even used exchangeably sometimes. The reason for this confusion is that, in the presence of various compactness and contractivity hypotheses, the ESP and the FMP are automatically simultaneously satisfied. Moreover, the same entanglement occurs when it comes to the actual dynamical implications that these properties entail like the input and state forgetting properties (see later on in the text for detailed definitions). Differentiable reservoir computing From a learning theoretical perspective, the connections that we just brought up between these dynamical properties (FMP, ESP, and SP) and universality can be rephrased by saying that families that exhibit them are capable of making the approximation error in a learning task as small as desired. Also in the approximation error context, classical results in static setups (see, for instance, Girosi and Anzellotti (1992); Girosi (1995)) show that the differentiability of the objects that need to be approximated is as beneficial for convergence rates as the dimensionality of the input is detrimental. This feature is sometimes referred to as the blessing of smoothness, as opposed to the curse of dimensionality. Differentiability is hence a crucial element in the understanding of the learning theoretical properties of most machine learning paradigms and, as far as we know, has never been tackled in the reservoir computing context and that is at the core of this paper. Important existing results. In order to make these remarks explicit, we recall here some results that will help us later on to introduce the contributions in this paper. Consider the discrete-time nonlinear state-space transformation xt = F(xt 1, zt), yt = h(xt). In the context of supervised machine learning we will refer to these transformations as reservoir systems and we will think of them as special types of recurrent neural networks. In that setup, the map F : RN Rn RN, n, N N+, is called the reservoir, it is usually randomly generated and h : RN Rd is the readout, which is estimated via a supervised learning procedure. The input in this system is given by the elements of the infinite sequence z = (. . . , z 1, z0, z1, . . .) (Rn)Z and the output by the components of y (Rd)Z. Given that the state-space may need to be high-dimensional in order to exhibit adequate approximation properties, it is desirable that the readout is as simple as possible (linear or polynomial, for instance). In this direction, various families of reservoir systems with linear readouts like Echo State Networks Jaeger and Haas (2004) or State Affine Systems (see later on in the text) have been shown to have universal approximation properties Grigoryeva and Ortega (2018a,b); Gonon and Ortega (2018). Training for these systems reduces to the solution of a linear regression problem (eventually regularized) when the mean square error is used as loss function. We say that the reservoir system (1.1)-(1.2) satisfies the echo state property (ESP) when for any z (Rn)Z there exists a unique y (Rd)Z that satisfies (1.1). When this existence and uniqueness feature is available one can associate well-defined filters U F : (Rn)Z (RN)Z and U F h : (Rn)Z (Rd)Z to the reservoir map F and the reservoir system (1.1)-(1.2), respectively. Very general situations have been characterized in which the ESP holds. For example, suppose that we restrict ourselves to inputs that are uniformly bounded by a constant M > 0, that is, consider the space KM of semi-infinite sequences given by KM := n z (Rn)Z | zt M for all t Z o , M > 0, (1.3) and assume that the reservoir map F is continuous and a contraction on the first entry that maps F : B (0, L) B (0, M) B (0, L), with L > 0 (the symbol B (v, r) denotes the closure of the open ball B (v, r) with respect to a given norm , center v, and radius r > 0). In that case, it can be shown (see, for instance, (Grigoryeva and Ortega, 2018b, Theorem 3.1)) that for any z KM there exists a unique x KL := {x RN Z | xt L for all t Z } that satisfies (1.1), that is, the ESP holds. This fact allows us to associate a unique filter U F : KM KL to the reservoir map F and U F h : KM (Rd)Z to the reservoir system (1.1)-(1.2), respectively, with U F h := h U F . Moreover, in this situation (see again (Grigoryeva and Ortega, 2018b, Theorem 3.1)) the continuity of F and h implies that both U F and U F h are continuous when we consider either the uniform or the product topologies in the domain and target spaces. The continuity with respect to the product topology is called in this setup the fading memory property (FMP) and, as we shall see below, can Grigoryeva and Ortega be characterized using weighted norms in the spaces of input and output sequences, which shows that recent inputs are more represented in the outputs of FMP filters than older ones. Equivalently, the outputs produced by FMP filters associated to inputs that are close in the recent past are close, even when those inputs may be very different in the distant past. The restriction to uniformly bounded inputs of the type (1.3) when using contracting reservoir maps does not only make the ESP and the FMP to simultaneously hold but it also simplifies enormously the characterization of the FMP. Indeed, it has been shown in Sandberg (2003); Grigoryeva and Ortega (2018b) that in that case the fading memory property is not a metric but an exclusively topological property that does not depend on the weighted norm used to define it. Therefore, the FMP does not contain in that situation any information about the rate at which the dependence on the past inputs in the system output declines. This is not the case anymore when we consider unbounded input sets since, as we show later on in Theorem 7, reservoir systems have the FMP only with respect to weighting sequences that converge to zero sufficiently fast and at a rate that is related to the contracting properties of the reservoir map. There are important connections between the notions and the results that we just reviewed and fundamental concepts in the theories of non-autonomous and of random dynamical systems. Even though we shall not pursuit that line of thought, the reader is encouraged to check with Arnold (1998); Kloeden (2003); Kloeden and Rasmussen (2010); Manjunath and Jaeger (2014); Newman (2018) and references therein for in-depth presentations. Main contributions of the paper. The core contributions of this paper are, first, the analysis of the ESP and the FMP in the absence of boundedness hypotheses and, second, the extension of the FMP-related continuity statements in the literature to the study of the differentiability properties of reservoir computers. In particular, we aim at characterizing the situations in which one can obtain the differentiability of reservoir filters out of the differentiability properties of the maps that define the corresponding reservoir system. Regarding the first objective, there are several reasons to study reservoir computing systems with unbounded inputs. First, even though we only deal in this paper with the deterministic setup, any random component in the data generating process of the inputs, like a Gaussian perturbation, would imply unboundedness. Second, when dealing with reservoir systems associated to physical systems, it is certainly reasonable to assume boundedness in the input due to the saturation effects that most of those systems present. Nevertheless, the value of the bounding constant is in general unknown beforehand, which makes uniform boundedness hypotheses unrealistic. Finally, in the study of the differentiability properties of reservoir computers, the differentiability of Fr echet type is only defined on open subsets of normed spaces. We shall see that any open set in the Banach space of inputs with a weighted norm contains unbounded sequences, which forces us to deal with that situation. As to the analysis of the differentiability properties of reservoir systems, this is an important question for several reasons: It has been shown (see, for instance, Girosi and Anzellotti (1992); Girosi (1995)) that differentiability is a key element in decreasing the complexity that is needed at the time of approximating a function with a prescribed accuracy level. The influence of this feature is comparable to that of the dimensionality of the input. Even though the development of bounds for the approximation error in the RC context is the subject of a forthcoming paper, it is reasonable to presume that differentiability is a crucial element in the understanding of the learning theoretical properties of this type of machine learning paradigms. RC applications to the learning of the attractors of chaotic deterministic dynamical systems have been shown (see Lu et al. (2018)) to be much related with the notion of Generalized Synchronization Kocarev and Parlitz (1995, 1996) for which differentiability is a relevant feature Hunt et al. (1997). Differentiable reservoir computing Indeed, in the absence of differentiability, the synchronization mapping may be wild enough (in the terminology of Hunt et al. (1997)) to create a gap between the information dimensions of the attractors of the input system and the system used to learn it. Additionally, one of the standard techniques to assess the quality of the result of this learning task is the comparison of the Lyapunov spectra of the problem system and the learnt proxy. These spectra are only available in the presence of differentiability. Also in the context of the learning of dynamical systems, differential topological arguments have been used Hart et al. (2019) to establish Takens-type embedding results for Echo State Networks. This is an important result that justifies the forecasting abilities of RC that are empirically observed in this framework. When filters are analytic, they obviously admit a Taylor series expansion which coincides with the so-called discrete-time Volterra series representation Volterra (1930); Schetzen (1980); Rugh (1981); Priestley (1988) and, moreover, different Taylor remainders can be used to provide bounds on the approximation errors that are committed when those series are truncated. This path has been explicitly explored in Sandberg (1998a, 1999) for analytic filters with respect to the supremum norm and with inputs with a finite past. We extend this work and we characterize the inputs for which an analytic fading memory reservoir filter with respect to a weighted norm admits a Volterra series representation with semi-infinite inputs. Additionally, we can use the causality and time-invariance hypotheses to show that the corresponding Volterra series representations have time-independent coefficients (this feature is not available in the case studied in Sandberg (1999)) that automatically satisfy the convergence conditions spelled out in Sandberg (1998b,c). The availability of this series representation has important learning theoretical consequences since, as we shall show in a forthcoming publication, implies that any analytic filter can be represented as a reservoir filter with linear readouts and where the reservoir map has been randomly generated using a well-specified distribution. This result is a corollary of the Volterra series representation presented later on in the paper combined with an adequately chosen version of the Johnson Lindenstrauss Lemma. In a continuous-time setup the construction involves the use of the so-called signature process designed in Rough Path Theory. These statements can be combined with the results in Grigoryeva and Ortega (2018a) to provide an alternative proof of the following Volterra series universality theorem that was stated for the first time in (Boyd and Chua, 1985, Theorems 3 and 4): any time-invariant and causal fading memory filter can be uniformly approximated by a finite Volterra series with finite memory. The local nature of the differential allows the formulation of conditions that ensure both the local and global existence of differentiable and, in passing, fading memory solutions. These conditions are a novel strong contribution to the long line of research on the ESP and the FMP and, in particular, link to existing research Manjunath and Jaeger (2013) on the input-dependence of the echo state property. The metric nature of the differential allows us to measure the speed at which fading memory filters forget inputs. As we see later on in Theorem 26, we are able to characterize this important piece of information with the differentiability property. Organization of the paper. The paper is organized as follows: The introductory Section 2 presents the causal and time-invariant filters and functionals that are at the center of this paper. In Section 2.1 the Banach sequence spaces where the semi-infinite inputs and outputs of the reservoir systems that we study are defined. Various elementary facts about weighted and supremum norms are stated. In Section 2.2 the notions of fading memory, continuity, Grigoryeva and Ortega and differentiability of maps between sequence spaces are carefully introduced. Section 2.3 focuses on causal and time-invariant filters defined on the sequence spaces introduced in Section 2.1. Those results are put to work in Section 2.4 to easily show well-known results that link the continuity of a filter with input and output spaces endowed with weighted norms with its asymptotic independence on the remote past input. Starting from Section 3 the paper focuses on reservoir filters. The main result in this section is Theorem 7 that provides a sufficient (but not necessary) condition for the ESP and FMP to hold in the presence of inputs that are not necessarily bounded. This is a significant generalization with respect to the standard compactness conditions imposed in Jaeger (2010) or the uniform boundedness in the inputs that was required in similar results in, for instance, Grigoryeva and Ortega (2018b). An important observation in Theorem 7 is that for general inputs, the FMP depends on the weighting sequence that is used to define it and establishes that, roughly speaking, reservoir systems have the FMP only with respect to weighting sequences that converge to zero sufficiently fast and at a rate that is related to the contracting properties of the reservoir map. This newly introduced FMP condition is spelled out for several widely used families of reservoir systems. The above mentioned results involving uniform boundedness hypotheses can be obtained as a corollary (see Corollary 10) of the results in this section. Another statement that we prove (see Theorem 12) is that when the target of the reservoir map is a compact set then the echo state property is in that situation guaranteed for no matter what input, even though the FMP may obviously not hold in that case. Section 4 is the core of the paper and studies the differentiability properties of reservoir filters determined by differentiable reservoir maps. The main results are contained in Theorems 14 and 19. The first theorem provides an explicit and easy-to-verify sufficient condition for the ESP and the FMP to hold around a given input for which we know that the reservoir system associated to a differentiable reservoir map has a solution. Theorem 19 is a global extension of the previous result that, unlike Theorems 7 and 14, fully characterizes the ESP and the differentiability (and hence the FMP) of the reservoir filter associated to a differentiable reservoir map. In Section 4.2 we show that the global conditions in Theorem 19 are much stronger than the local ones in Theorem 14 by introducing an example that shows how the ESP and the FMP are structural features of a reservoir system when considered globally but are mostly input dependent when considered only locally. This important observation has already been noticed in Manjunath and Jaeger (2013) where, using tools coming from the theory of non-autonomous dynamical systems, sufficient conditions have been formulated (see, for instance, (Manjunath and Jaeger, 2013, Theorem 2)) that ensure the ESP in connection to a given specific input. The differentiability conditions that we impose to our reservoir systems allow us to draw similar conclusions and, additionally, to automatically establish the FMP of the resulting locally defined reservoir filters. In Section 4.3 we show how for globally differentiable reservoir filters we can formulate a non-uniform version of the well-known input forgetting property for FMP filters that we recovered in Section 2.3 for inputs that are not necessarily bounded. Moreover, a novel uniform differential version of that result is provided in Theorem 26. Section 5 contains two main results. First, Theorem 29 shows the availability of discrete-time Volterra series representations for analytic, causal, time-invariant, and FMP filters. This result extends a similar statement formulated in Sandberg (1998a, 1999) to inputs with a semi-infinite past that are not necessarily bounded. Second, in Theorem 31, we combine the previous result with a universality statement in Grigoryeva and Ortega (2018a) to provide an alternative proof of the Volterra series universality theorem stated for the first time in (Boyd and Chua, 1985, Theorems 3 and 4). Differentiable reservoir computing The proofs of most results are provided in the appendices at the end of the paper. 2. Causal and time-invariant input/output systems 2.1. The input and output spaces This paper studies input/output systems that are causal, that is, the output depends only on the past history of the input and that, in general, have infinite memory. This makes us consider the spaces of left-infinite sequences with values in Rn, that is, (Rn)Z = {z = (. . . , z 2, z 1, z0) | zi Rn, i Z }. Analogously, (Dn)Z stands for the space of semi-infinite sequences with elements in the subset Dn Rn. The space Rn will be considered as a normed space with a norm denoted by which is not necessarily the Euclidean one (even though they are all equivalent), unless it is explicitly mentioned. We endow these infinite product spaces with the Banach space structures associated to one of the following two norms. First, the supremum norm z := supt Z { zt }. The symbol ℓ (Rn) is used to denote the Banach space formed by the elements that have a finite supremum norm. Second, given a strictly decreasing sequence with zero limit w : N (0, 1] and that w0 = 1, we define the weighted norm w on (Rn)Z associated to w by z w := supt Z { ztw t }. It can be shown (see Grigoryeva and Ortega (2018b)) that the set ℓw (Rn) formed by the elements that have a finite w-weighted norm is a Banach space. Moreover, it is easy to show that z w z , for all z (Rn)Z . This implies that ℓ (Rn) ℓw (Rn) and that the inclusion map (ℓ (Rn), ) , (ℓw (Rn), w) is continuous. The Banach spaces (ℓ (Rn), ) and (ℓw (Rn), w) are particular cases of weighted Banach sequence spaces (ℓp,w (Rn), p,w) where t Z zt p w t , with 1 p < + , z (Rn)Z , and w a sequence. (2.1) When p = + we set p,w := w. We then define ℓp,w (Rn) := n z Rn | z p,w < + o . (2.2) These spaces are defined in the literature (see, for instance, Rekic-Vukovic et al. (2015); Gunawan et al. (2015)) without the requirement that w is a weighting sequence in the sense of the definition above. Indeed, the standard Banach spaces (ℓp (Rn), p), with 1 p + , are particular cases of (ℓp,w (Rn), p,w) that are obtained by taking as sequence w the constant sequence wι given by wι t := 1, for all t N. This observation is used in the paper to obtain many results for the spaces ℓ (Rn) as a particular case of those proved for ℓw (Rn). We emphasize that wι is not a weighting sequence and that the spaces (ℓw (Rn), w) considered in this paper are all based on sequences w of weighting type. It can be proved (see (Rekic-Vukovic et al., 2015, Theorems 3.3 and 4.1 and Corollary 4.1)) that, in that case: ℓp,w (Rn) ℓw (Rn), for any 1 p < + , (2.3) and that, ℓp (Rn) ℓp,w (Rn), for any 1 p + . (2.4) All the results in this paper are formulated for the weighted spaces (ℓw (Rn), w) even though many of the statements that we provide are also valid for (ℓ (Rn), ) and (ℓp,w (Rn), p,w). That will be explicitly pointed out in the statements or in remarks when it is the case. The Appendix 6.1 contains a collection of results regarding the topologies induced by weighted and supremum norms. Grigoryeva and Ortega 2.2. FMP, continuity, and differentiability of maps on infinite sequence spaces Much of this paper is related to the continuity and the differentiability of maps of the type f : W ℓw1 (Rn) V ℓw2 (RN), with w1, w2 weighting sequences and W and V subsets of ℓw1 (Rn) and ℓw2 (RN), respectively, that in the case of differentiable maps are necessarily open. Maps that are continuous with respect to topologies generated by weighted norms will be generically referred to as fading memory maps (or we say that they have the fading memory property (FMP)) while when the topology considered is generated by the supremum norm, we just say that the map is continuous. Most of the definitions that we provide in what follows for the weighted norms case can be adapted to the supremum norm case by replacing the weighting sequences by the constant sequence wι given by wι t := 1, for all t N. Suppose now that W and V are open subsets. The map f : W ℓw1 (Rn) V ℓw2 (RN) is (Fr echet) differentiable at u0 W when there exists a bounded linear map Df(u0) : ℓw1 (Rn) ℓw2 (RN) that satisfies lim u u0 f(u) f(u0) Df(u0) (u u0) u u0 w1 = 0. (2.5) We say that f : W ℓw1 (Rn) V ℓw2 (RN) is of class C1(W) when it is differentiable at any point in W and the induced map Df : W L ℓw1 (Rn), ℓw2 (RN) is continuous, where the space of linear maps L ℓw1 (Rn), ℓw2 (RN) is endowed with the operator norm ||| |||w1,w2 defined by |||A|||w1,w2 := sup u = 0 , A L ℓw1 (Rn), ℓw2 (RN) . (2.6) When in the domain and the range we use the same weighting sequence w, we will write |||A|||w instead of |||A|||w1,w2. The higher order derivatives Drf(u0) : ℓw1 (Rn) ℓw1 (Rn) | {z } r times ℓw2 (RN), r N+, are inductively defined and the map f is said to be of class Cr(W) when it is r-times differentiable at any point in W and the induced map Drf : W Lr ℓw1 (Rn), ℓw2 (RN) into the normed space of r-multilinear maps is continuous. We recall that the operator norm ||| |||w1,w2 in Lr ℓw1 (Rn), ℓw2 (RN) is given by |||A|||w1,w2 := sup u1,...,ur ℓw1 (Rn) A(u1, . . . , ur) w2 u1 w1 ur w1 u1, . . . , ur = 0 , A Lr ℓw1 (Rn), ℓw2 (RN) . (2.7) We recall that differentiable functions are automatically continuous and we denote the class of continuous functions by C0(W). When f is of class Cr(W) in W for any r N+, we say that f is smooth in W and we denote this class by C (W). When f is smooth in W we can construct for it a Taylor power series expansion. We say that f is analytic in W when the convergence domain of that power series includes W. The analytic class is denoted by Cω(W). It can be shown (see Lemma 32 in the appendices) that for any weighting sequence w, any open set in ℓw (Rn), w contains unbounded sequences. For instance, let B w(0, ϵ) be the ball of radius ϵ > 0 around the zero sequence and let v Rn be a vector such that v = 1. The divergent sequence z defined by zt := ϵv/2w t is such that z w = ϵ/2 and hence z B w(0, ϵ) ℓw (Rn). Differentiable reservoir computing 2.3. Causal and time-invariant filters and functionals Let Dn Rn and DN RN. We refer to the maps of the type U : (Dn)Z (DN)Z as filters or operators and to those like H : (Dn)Z DN (or H : (Dn)Z DN) as RN-valued functionals. These definitions can be easily extended to accommodate situations where the domains and the targets of the filters are not necessarily product spaces but just arbitrary subsets Vn and VN of (Rn)Z and RN Z like, for instance, ℓ (Rn) and ℓ (RN), or ℓw (Rn) and ℓw (RN), for some weighting sequence w. A filter U : (Dn)Z (DN)Z is called causal when for any two elements z, w (Dn)Z that satisfy that zτ = wτ for any τ t, for a given t Z, we have that U(z)t = U(w)t. Let T Z τ : (Rn)Z (Rn)Z be the time delay operator defined by T Z τ (z)t := zt τ, τ Z. A subset Vn (Rn)Z is called time-invariant when T Z τ (Vn) = Vn, for all τ Z. The filter U is called time-invariant when it is defined on a time-invariant set and commutes with the time delay operator, that is, T Z τ U = U T Z τ , for any τ Z (in this expression, the two operators T Z τ have to be understood as defined in the appropriate sequence spaces). We recall that there is a bijection between causal time-invariant filters and functionals on (Dn)Z . Indeed, given a causal and time-invariant filter U : (Dn)Z (RN)Z, we can associate to it a functional HU : (Dn)Z RN via the assignment HU(z) := U(ze)0, where ze (Rn)Z is an arbitrary extension of z (Dn)Z to (Dn)Z. Conversely, for any functional H : (Dn)Z RN, we can define a timeinvariant causal filter UH : (Dn)Z (RN)Z by UH(z)t := H((PZ T Z t)(z)), where T Z t is the ( t)-time delay operator and PZ : (Rn)Z (Rn)Z is the natural projection. Moreover, when considering causal and time-invariant filters U : (Dn)Z (DN)Z it suffices to work just with the restriction U : (Dn)Z (DN)Z , that we denote with the same symbol, since the latter uniquely determines the former. Indeed, by definition, for any z (Dn)Z and t N+: U(z)t = T Z t (U(z)) 0 = U T Z t(z) where the second equality holds by the time-invariance of U and the value in the right-hand side depends only on PZ T Z t(z) (Dn)Z , by causality. In view of this observation, we restrict our study to filters with domain and target in the spaces of left semi-infinite sequences. In particular, we say that a causal and time-invariant filter U has the fading memory property or that it is continuous when the corresponding restricted filter defined on left semi-infinite inputs has those properties, as we defined them in Section 2.2. Additionally, from now on we consider most of the time delay operators with domain and target in (Rn)Z and that we simply denote as T τ : (Rn)Z (Rn)Z . The definition of these restricted time delay operators T τ requires considering two cases: T τ : (Rn)Z (Rn)Z with τ negative: as before, T τ(z)t := zt+τ, for any z (Rn)Z and t Z . This implies that, in this case, T τ(z) = PZ T Z τ(ze), z (Rn)Z , τ < 0, where ze (Rn)Z is an arbitrary extension of z (Rn)Z to (Rn)Z. The map T τ, τ Z , is surjective, that is, Tτ((Rn)Z ) = (Rn)Z , but it is not injective. The same applies to the restriction of T τ to any time-invariant set Vn (Rn)Z which satisfies T τ(Vn) = Vn. T τ : (Rn)Z (Rn)Z with τ positive: there is in principle not a unique way to define the restricted operators T τ since that involves the choice of vectors vτ (Rn)τ such that T τ(z) := (z, vτ), for any z (Rn)Z . The choice vτ = 0 for all τ > 0 is canonical since it is the only one that makes the resulting maps linear and additionally satisfy T τ = T 1 T 1 | {z } τ times Grigoryeva and Ortega We hence adopt the definition T τ(z) := (z, 0, . . . , 0 | {z } τ times ), z (Rn)Z , τ > 0, for the rest of the paper. In this case T τ it is injective but not surjective. The following lemma gathers some differentiability properties of projections and time delay operators when restricted to normed sequence spaces and that will be used later on. A key element in this result is what we call, for each weighting sequence w, their decay ratio Dw and inverse decay ratio Lw, that are defined as: Dw := sup t N and Lw := sup t N As w is by definition strictly decreasing we necessarily have that 0 < wt+1/wt < 1, for all t N, and 1 < w0/w1 supt N {wt/wt+1} = Lw. Consequently: 0 < Dw 1 and 1 < Lw + . The decay ratios provide a geometric bound for the convergence speed of w and the divergence rate of w 1. Indeed, it is easy to see that wt Dt w and 1/wt Lt w, for any t N. (2.10) Additionally, the fact that for all t N we have that 1 < wt/wt+1 and that 0 < wt+1/wt < 1 implies that and 1/ sup t N which, in both cases, implies that Lw Dw 1. (2.11) More generally, in relation with the power weighting sequences that we discussed in Lemma 35, we have that: 0 < Dwn Dw Dw1/m 1 and 1 < Lw1/m Lw Lwn + , for any m, n N+. (2.12) Lemma 1 Let w be a weighting sequence and n N+. Then: (i) The projections pt : (ℓw (Rn), w) (Rn, ), t Z , given by pt(z) := zt, z ℓw (Rn), are linear, smooth, and hence continuous. Moreover, |||pt|||w = 1/w t. (ii) Consider the restriction of the time delay operator T t to ℓw (Rn) for any t Z. We consider two cases. First, if t < 0 and the inverse decay ratio Lw of w is finite, then T t maps into ℓw (Rn), that is, ℓw (Rn) is T t-invariant and T t : (ℓw (Rn), w) (ℓw (Rn), w) is surjective, open, and a submersion, that is, ker T t is a split subspace of ℓw (Rn). If t > 0, then ℓw (Rn) is always T t-invariant. T t : (ℓw (Rn), w) (ℓw (Rn), w) is in that case an immersion, that is, it is injective and its image Im T t is split. Moreover, for any t > 0, Tt T t = Iℓw (Rn), and in both cases the maps T t are linear, smooth, and hence continuous. Additionally, |||T1|||w = Lw, |||T 1|||w = Dw, |||T t|||w L t w , and |||Tt|||w D t w , for all t Z . (2.13) Differentiable reservoir computing (iii) For any t1, t2 Z we have pt1+t2 = pt1 T t2 = pt2 T t1. (2.14) These statements also hold true when (ℓw (Rn), w) is replaced by (ℓ (Rn), ). In that case one has to take as sequence w the constant sequence wι given by wι t := 1, for all t N, and Lw and Dw are replaced by the constant 1. Remark 2 The decay ratios are easy to compute for many families of weighting sequences. Two cases that we frequently encounter are: (i) Geometric sequence: wt := λt, t N, with 0 < λ < 1. In this case: Lw := sup t N λ > 1 and Dw := sup t N (ii) Harmonic sequence: wt := 1/(1 + td), t N, with d > 0. In this case Dw = 1 and Lw = 1 + d. We emphasize that the finiteness of the inverse decay ratio is not guaranteed for all weighting sequences. An example that illustrates this fact is the sequence wt := exp( t2). It is easy to verify that in that case Lw = + and Dw = 1/e. Remark 3 The inequalities (2.13) can be combined with Gelfand s formula (Lax, 2002, page 195) to provide bounds for the spectral radii ρ(T t) and ρ(Tt) for all t Z . Indeed, ρ(T t) = lim n T n t 1/n w lim n (L tn w )1/n = L t w , with t Z . Analogously, one shows that ρ(Tt) D t w . Remark 4 Lemma 1 remains valid when instead of the spaces ℓw (Rn) we use the spaces ℓp,w (Rn) that we introduced in Section 2.1, for any 1 p < + . In that case, and for any t Z , |||pt|||p,w = 1 w1/p t , (2.15) |||T1|||p,w = L1/p w , |||T 1|||p,w = D1/p w , |||T t|||p,w L t/p w , and |||Tt|||p,w D t/p w , for all t Z . (2.16) Remark 5 Some of the properties of time delays operators that we just studied have interesting interpretations in a Hilbert space context. See Lindquist and Picci (2015) for a detailed study. 2.4. The fading memory property and remote past input independence The properties of time delay operators that we enunciated in Lemma 1 allow us to show how the fading memory property, defined as the continuity of a filter linking input and output spaces endowed with weighted norms, (see Section 2.1) can be interpreted as its asymptotic independence on the remote past input (Wiener, 1958, page 89). Analogously, we can see that the FMP amounts to the attribute that, in the words of Volterra (Volterra, 1930, page 188), the influence of the input a long time before the given moment fades out. This property has also been characterized as a unique steady-state property in Boyd and Chua (1985) and referred to as the input forgetting property in Jaeger (2010). All these characterizations were proved under various compactness and/or uniformly boundedness hypotheses on the inputs. The next result shows that property as a straightforward corollary of Lemma 1 that, later on in Section 4.3, will be generalized to situations where the inputs are eventually unbounded. In the following statement we will be using the following notation: given the sequences u (Rn)Z and v (Rn)t, t N, the symbol uv (Rn)Z (Rn)t denotes the concatenation of u and v. Grigoryeva and Ortega Theorem 6 (FMP and the uniform input forgetting property) Let M, L > 0, n, N N+ and let KM (Rn)Z , KL (RN)Z (respectively, K+ M (Rn)N+, K+ L (RN)N+) be the sets of uniformly bounded left (respectively, right) semi-infinite sequences defined in (1.3). Let U : KM KL be a causal and time-invariant fading memory filter. Then, for any u, v KM and z K+ M we have that lim t + U(uz)t U(vz)t = 0, (2.17) where in this expression the filter U is defined by time-invariance on positive times using (2.8). The convergence in (2.17) is uniform on u, v, and z in the sense that there exists a monotonously decreasing sequence w U with zero limit such that for all u, v KM, z K+ M, and t N, U(uz)t U(vz)t w U t . (2.18) Filters that satisfy condition (2.17) for any u, v KM and z K+ M are said to have the input forgetting property and we refer to (2.18) as the uniform input forgetting property. Proof. We start by recalling that in the presence of uniformly bounded inputs, the FMP can be characterized as the continuity of the map U : KM KL with the sets KM and KL endowed with the relative topology induced either by the product topology on (Rn)Z and (RN)Z , respectively, or by the weighted norms in the spaces ℓw (Rn) and ℓw (RN), with w any weighting sequence (see (Grigoryeva and Ortega, 2018b, Corollary 2.7 and Proposition 2.11)). Moreover, the sets KM and KL are compact in this topology (Grigoryeva and Ortega, 2018b, Corollay 2.8) and hence the FMP filter U : KM KL is not only continuous but also uniformly continuous. Consequently, once we have fixed a weighting sequence w, an increasing modulus of continuity ωU : R+ R+ can be associated to the map U : (KM, w) (KL, w). We emphasize that ωU depends on w since it is a metric and not a purely topological notion. Now, using (2.8) and an arbitrary weighting sequence w that we choose with Dw < 1, we can write for any t N U(uz)t U(vz)t = U PZ T Z t(uz) 0 U PZ T Z t(vz) = p0 U PZ T Z t(uz) p0 U PZ T Z t(vz) U PZ T Z t(uz) U PZ T Z t(vz) w , (2.19) where we used that |||p0|||w = 1 by the first part of Lemma 1. We now notice that PZ T Z t(uz) = T t(u) + (. . . , 0, z1, . . . , zt) , and PZ T Z t(vz) = T t(v) + (. . . , 0, z1, . . . , zt) , which substituted in (2.19) and using the second part of Lemma 1 yields U(uz)t U(vz)t ωU ( T t(u v) w) ωU (|||T t|||w u v w) ωU Dt w u v w ωU 2MDt w . (2.20) Now, as w has been chosen so that Dw < 1 and lim t 0 ωU(t) = 0, we set w U t := ωU (2MDt w), and we have that lim t + w U t = lim t + ωU 2MDt w = 0, (2.21) which using the inequality (2.20) proves the claim. Differentiable reservoir computing 3. The fading memory property in reservoir filters with unbounded inputs Starting in this section we focus on filters defined by reservoir systems of the type introduced in (1.1) (1.2), but this time we consider reservoir maps F : DN Dn DN where the input variable takes values on a set Dn RN that is not necessarily bounded. All along this section, the reservoir map F will be assumed to be continuous and a contraction on the first entry with constant 0 < c < 1, that is, F(x1, z) F(x2, z) c x1 x2 , for all x1, x2 DN and z Dn. When the inputs are assumed to be uniformly bounded by a constant M > 0 and F maps into a ball B (0, L) RN, L > 0, it has been proved (see (Grigoryeva and Ortega, 2018b, Proposition 2.1 and Theorem 3.1)) that we can associate to this system unique filters U F : KM KL and U F h : KM (Rd)Z (the sets KM and KL are introduced in (1.3)) that are causal, time-invariant, continuous and, moreover, satisfy the fading memory property with respect to any weighting sequence w. We recall that U F is the filter associated to the solutions of the reservoir equation (1.1) and assigns to any input sequence z KM the output U F (z) that satisfies U F (z)t = F(U F (z)t 1, zt), for any t Z . (3.1) Recall also that U F h : KM (Rd)Z is the filter associated to the full system (1.1) (1.2) and is given by U F h := h U F . We denote by HF : KM B (0, L) and HF h : KM Rd the corresponding reservoir functionals. The reservoir functionals are related to the corresponding reservoir filters via the identities: HF (z) = U F (z)0 = F(U F (z) 1, z0) and HF h (z) = h U F (z) , (3.2) for all z KM. The next theorem is the most important result in this section and shows that the results that we just recalled about the ESP and the FMP for reservoir filters with uniformly bounded inputs remain valid in the presence of unbounded inputs. However, in that case, the fading memory property depends on the weighting sequence that is used to define it. The sufficient condition for the FMP spelled out in the next theorem asserts, roughly speaking, that reservoir systems have the FMP only with respect to weighting sequences that converge to zero sufficiently fast and at a rate that is related to the contracting properties of the reservoir map. Theorem 7 (ESP and FMP with continuous reservoir maps) Let F : DN Dn DN be a continuous reservoir map where Dn Rn, DN RN, n, N N+. Assume, additionally, that it is a contraction on the first entry with constant 0 < c < 1. Let w be a weighting sequence with finite inverse decay ratio Lw and let Vn (Dn)Z ℓw (Rn) be a time-invariant set. We consider two situations regarding the target DN of the reservoir map: (i) DN is a compact subset of RN. (ii) (DN)Z ℓw (RN) is a complete subset of the Banach space ℓw (RN), w , F is Lipschitz continuous, and the reservoir system (1.1) associated to F has a solution (x0, z0) (DN)Z ℓw (RN) Vn, that is, x0 t = F(x0 t 1, z0 t), for all t Z In both cases, if c Lw < 1 (3.3) then the reservoir system associated to F with inputs in Vn has the echo state property and hence determines a unique continuous, causal, and time-invariant reservoir filter U F : (Vn, w) ((DN)Z Grigoryeva and Ortega ℓw (RN), w) that has the fading memory property with respect to w. Moreover, if F is Lipschitz on the second component (which is always the case under the hypotheses in (ii)) with constant Lz, that is, F(x, z1) F(x, z2) Lz z1 z2 , for any x DN, z1, z2 Dn, then U F is also Lipschitz with constant LU F := Lz 1 c Lw . (3.4) This statement also holds true under the hypotheses in part (ii) when (ℓw (Rn), w) is replaced by (ℓ (Rn), ). In that case Lw is replaced by the constant 1 and hence condition (3.3) is automatically satisfied. The resulting reservoir filter U F : (Vn, ) ((DN)Z ℓ (RN), ) is continuous. Remark 8 A very common situation that provides the solution (x0, z0) (DN)Z ℓw (RN) Vn for the reservoir system needed in part (ii), is the existence of a fixed point (x0, z0) DN Dn of F that satisfies F(x0, z0) = x0. In that case the required solution is given by the constant sequences x0 t = x0, z0 t = z0, for all t Z . Remark 9 If the target DN of the reservoir map is a closed subset of RN, that is DN = DN, then by part (iii) of the Corollary 33, the set (DN)Z ℓw (RN) is a closed subset of ℓw (RN), w and it is hence necessarily complete, as required in case (ii) of the theorem. Moreover, if DN is closed and Vn contains a constant sequence z0 then the condition on the existence of a solution (x0, z0) (DN)Z ℓw (RN) Vn is automatically satisfied. Indeed, let z Rn be such that (z0)t := z for all t Z and let x DN arbitrary. Consider the sequence {x, F(x, z), F(F(x, z), z), F(F(F(x, z), z), z), . . .}. The Banach Contraction-Mapping Principle (see (Shapiro, 2016, Theorem 3.2)) guarantees that this sequence converges to the unique fixed point, we call it x DN, of the map F( , z). The pair (x0, z0) (DN)Z ℓw (RN) Vn, with (x0)t := x for all t Z , is the solution needed in case (ii) of the theorem. As a corollary of Theorem 7 it can be shown that reservoir systems that have by construction uniformly bounded inputs and outputs always have the ESP and FMP properties and that, for any weighting sequence w. This result was already shown in (Grigoryeva and Ortega, 2018b, Theorem 3.1). Corollary 10 Let M, L > 0, let KM (Rn)Z and KL RN Z be subsets of uniformly bounded sequences defined as in (1.3), and let F : B (0, L) B (0, M) B (0, L) be a continuous reservoir map. Assume, additionally, that F is a contraction on the first entry with constant 0 < c < 1. Then, the reservoir system associated to F has the echo state property. Moreover, this system has a unique associated causal and time-invariant filter U F : KM KL that has the fading memory property with respect to any weighting sequence w. Proof. Given that B (0, L) is a compact subset of RN, the hypothesis in part (i) of Theorem 7 and condition (3.3) guarantee that there exists a reservoir filter U F : KM KL associated to F that has the fading memory property with respect to any weighting sequence that satisfies (3.3). Such a sequence always exists as it suffices to take any geometric sequence wt := λt, t N, with c < λ < 1. However, as it has been shown in (Grigoryeva and Ortega, 2018b, Corollary 2.7), all the weighted norms induce in the sets KM and KL the same topology, namely, the product topology and hence if U F is continuous with respect to the topology induced by the weighted norm w then so it is with respect to the norm associated to any other weighting sequence. Remark 11 This corollary shows that, in general, the condition (3.3) is sufficient but not necessary. Indeed, if the hypotheses in the corollary are satisfied, the resulting filter U F has the fading memory Differentiable reservoir computing property with respect to any geometric sequence wt := λt, with 0 < λ < 1, t N for which (see Remark 2) Lw = 1/λ. In particular, this holds true when λ is chosen so that 0 < λ < c and hence when (3.3) is not satisfied since in that case c Lw > 1. Additional concrete examples that show that the condition (3.3) is sufficient but not necessary are provided in Section 3.1. We emphasize that the FMP condition (3.3) is sufficient but not necessary even in the absence of boundedness conditions like in Corollary 10 Another important statement that can be proved when the target of the reservoir map is a compact subset of RN is that the echo state property is in that situation guaranteed for no matter what input1 in (Rn)Z even though the FMP may obviously not hold in that case. Theorem 12 (ESP for reservoir maps with compact target) Let F : DN Dn DN be a continuous reservoir map, Dn Rn, DN RN, n, N N+, such that DN is a compact subset of RN and F is a contraction on the first entry with constant 0 < c < 1. Then, the reservoir system associated to F has the echo state property for any input in (Dn)Z . Let U F : (Dn)Z (DN)Z be the associated reservoir filter. For any weighting sequence w such that c Lw < 1 the map U F : (Dn)Z ((DN)Z , w) is continuous when in (Dn)Z we consider the relative topology induced by the product topology in (RN)Z . Moreover, if (Dn)Z ℓw (Rn) then U F has the fading memory property. The following result shows how the FMP of the filter associated to a reservoir map established in Theorem 7 propagates to the FMP of the filter of the full reservoir system if the readout map is continuous. Corollary 13 In the conditions of Theorem 7, let h : DN Rd be a continuous readout map. Consider the following two cases that correspond to the two sets of hypotheses studied in Theorem 7: (i) If DN is a compact subset of RN then there is a constant R > 0 such that the filter U F h defined by U F h (z)t := h U F (z)t , t Z , z Vn maps U F h : (Vn, w) (KR, w) and has the fading memory property. (ii) If (DN)Z ℓw (RN) is a complete subset of ℓw (RN, w and h is Lipschitz continuous on DN such that U F h (z0) ℓw (Rd), then the reservoir filter U F h : (Vn, w) (ℓw (Rd), w) has the fading memory property. This statement also holds true under the hypotheses in part (ii) when (ℓw (Rn), w) is replaced by (ℓ (Rn), ). The resulting reservoir filter U F h : (Vn, ) (ℓ (Rd), ) is continuous. 3.1. Examples In the following paragraphs we show how the sufficient condition (3.3) explicitly looks like for reservoir systems that are widely used and that have been shown to have universality properties in the fading memory category both with deterministic and stochastic inputs Grigoryeva and Ortega (2018a,b); Gonon and Ortega (2018). Linear reservoir maps. Consider the reservoir map F : RN Rn RN given by F(x, z) = Ax + cz, with A MN, c MN,n. (3.5) It is easy to see that F is a contraction on the first entry whenever the matrix A satisfies that |||A||| < 1. In that case, using the notation in Theorem 7, c = |||A|||. Indeed, for any x1, x2 RN, z Rn: F(x1, z) F(x2, z) = A(x1 x2) |||A||| x1 x2 . 1. We thank Lukas Gonon for pointing this out. Grigoryeva and Ortega We now assume that |||A||| < 1. The following two statements are proved in the Appendix 6.8: (i) The reservoir system associated to (3.5) has the echo state property and defines a unique reservoir filter U F : ℓw (Rn) ℓw (RN) that has the fading memory property with respect to any weighting sequence w that satisfies the condition wj < + . (3.6) The FMP condition (3.3) reads in this case as |||A|||Lw < 1, (3.7) and implies (3.6) but not vice versa. (ii) If the inputs presented to the reservoir system associated to (3.5) are uniformly bounded then it has the fading memory property with respect to any weighting sequence. This result was already known as it can be easily obtained by combining (Grigoryeva and Ortega, 2018a, Corollary 11) with (Grigoryeva and Ortega, 2018b, Corollary 2.7). We obtain it here directly out of Corollary 10 by noting that for any M > 0, F(B (0, L), B (0, M) B (0, L), with L := |||c|||M 1 |||A|||. (3.8) Echo state networks (ESN). Let σ : R [ 1, 1] be a squashing function, that is, σ is nondecreasing, limx σ(x) = 1, and limx σ(x) = 1. Moreover, assume that Lσ := supx R{|σ (x)|} < + . Let σ : RN [ 1, 1]N be the map obtained by componentwise application of the the squashing function σ. An echo state network is a reservoir system with linear readout and reservoir map given by F(x, z) = σ(Ax + cz + ζ), with A MN, c MN,n, ζ RN. (3.9) We notice first that if |||A|||Lσ < 1 then F is a contraction on the first component with constant |||A|||Lσ (see the second part in (Grigoryeva and Ortega, 2018b, Corollary 3.2)). By construction, F maps into the compact space [ 1, 1]N RN and hence satisfies the hypotheses in the first part of Theorem 7. Consequently, for any weighting sequence w that satisfies |||A|||LσLw < 1 (3.10) there exists a unique reservoir filter U F : ℓw (Rn) ℓw (RN) associated to F that has the fading memory property with respect to w. By Corollary 10 this statement holds true for any w when one considers uniformly bounded inputs. Non-homogeneous state-affine systems (SAS). These systems are determined by reservoir maps F : RN Rn RN of the form F(x, z) := p(z)x + q(z), (3.11) where p and q are polynomials with matrix and vector coefficients, respectively, that depending on their nature determine the following two families of SAS systems: (i) Regular SAS. p and q are polynomials of degree r and s of the form: i1,...,in {0,...,r} i1+ +in r zi1 1 zin n Ai1,...,in, Ai1,...,in MN, z Dn Rn, i1,...,in {0,...,s} i1+ +in s zi1 1 zin n Bi1,...,in, Bi1,...,in MN,1, z Dn Rn. Differentiable reservoir computing (ii) Trigonometric SAS. We use trigonometric polynomials instead: k=1 Ap k cos(up k z) + Bp k sin(vp k z), Ap k, Bp k MN, up k, vp k RN, z Dn Rn, k=1 Aq k cos(uq k z) + Bq k sin(vq k z), Aq k, Bq k MN,1, uq k, vq k RN, z Dn Rn. In both cases, define Mp := sup z Dn {|||p(z)|||} and Mq := sup z Dn {|||q(z)|||} . Note that for regular SAS defined by nontrivial polynomials, the set Dn needs to be bounded in order for Mp and Mq to be finite. Additionally, it is easy to see that F is a contraction on the first entry with constant Mp whenever Mp < 1, which is a condition that we will assume holds true in the rest of this example. Additionally, we assume that Mq < + . Regular SAS are a generalization of the linear case that we considered in the first part of this section and hence two statements can be proved (see Appendix 6.8) that are analogous to the ones in that part, namely: (i) The reservoir system associated to (3.11) has the echo state property and defines a unique reservoir filter U F : ℓw (Rn) (Dn)Z ℓw (RN) that has the fading memory property with respect to any weighting sequence w that satisfies the condition M j p wj < + . (3.12) The FMP condition (3.3) that in this case reads as Mp Lw < 1 implies (3.12) but not vice versa. (ii) If the inputs presented to the reservoir system associated to (3.11) are uniformly bounded then it has the fading memory property with respect to any weighting sequence. We obtain this result out of Corollary 10 by noting that for any M > 0, F(B (0, L), B (0, M) B (0, L), with L := Mq 1 Mp . We emphasize that in the case of regular SAS, this is the only situation for which one can have Mp < 1 and Mq < + . 4. Differentiability in reservoir filters with unbounded inputs We now extend the results in the previous section from continuity to differentiability. More specifically, we characterize the situations in which one can prove the existence and obtain the differentiability of reservoir filters out of the differentiability properties of the maps that define the reservoir system. This approach gives us in passing new techniques to establish the echo state and the fading memory properties of reservoir systems. In particular, differentiability being a local property, we show how systems that do not globally have any of these properties may still have them in a neighborhood of certain types of inputs. A phenomenon of this type has also been explored in Manjunath and Jaeger (2013). It is worth emphasizing that the study of the differentiability properties of fading memory reservoir filters calls naturally for the handling of unbounded inputs since the definition of the Fr echet derivative requires them to be defined on open subsets of the Banach space ℓw (Rn) that always contain unbounded sequences (see the first part of Lemma 32 in the appendices). Grigoryeva and Ortega 4.1. Differentiable reservoir filters associated to differentiable reservoir maps The first result in this section shows that under certain conditions, the echo state and the fading memory properties associated to differentiable reservoir systems locally persist, that is, if a reservoir system has a unique filter associated to a specific input and it is continuous and differentiable at it, then the same property holds for neighboring inputs. Theorem 14 (Local persistence of the ESP and FMP properties) Let F : RN Rn RN be a reservoir map and let w be a weighting sequence with finite inverse decay ratio Lw. Suppose that F is of class C1(RN Rn) and that the corresponding reservoir system (1.1) has a solution (x0, z0) ℓw (RN) ℓw (Rn), that is, x0 t = F(x0 t 1, z0 t), for all t Z . Suppose, additionally, that LF := sup (x,z) RN Rn {|||DF(x, z)|||} < + . (4.1) Define LFx(x0, z0) := sup t Z Dx F(x0 t 1, z0 t) and suppose that LFx(x0, z0)Lw < 1. (4.2) Then there exist open time-invariant neighborhoods Vx0 and Vz0 of x0 and z0 in ℓw (RN) and ℓw (Rn), respectively, such that the reservoir system associated to F with inputs in Vz0 has the echo state property and hence determines a unique causal and time-invariant reservoir filter U F : (Vz0, w) (Vx0, w). Moreover, U F is differentiable at all the points of the form T t(z0), t Z , it is locally Lipschitz continuous on Vz0, and it hence has the fading memory property. Remark 15 We refer to (4.2) as the persistence condition. We emphasize that this inequality puts into relation the solution (x0, z0) whose persistence we are studying with the weighting sequence w. In particular, that relation tells us that solutions are more likely to persist with respect to weighting sequences that decay more slowly (that is, Lw is smaller). Remark 16 There is a situation where the persistence condition is particularly easy to verify, namely, when the solution of the reservoir system is constructed as a constant sequence coming from a fixed point of the reservoir map, that is, (x0, z0) RN Rn such that F(x0, z0) = x0. In that case LFx(x0, z0) := Dx F(x0, z0) . Remark 17 The persistence condition (4.2) can be interpreted as a stability condition for the reservoir system determined by F at the solution (x0, z0) with respect to perturbations in ℓw (Rn). The persistence of solutions under stability conditions of that type has been thoroughly studied for many types of dynamical systems (see, for instance, Montaldi (1997b,a); Ortega and Ratiu (1997); Chossat et al. (2003)). Remark 18 The derivative DU F (z0) at z0 of the locally defined reservoir filter U F is determined by the differentiation of the relation (3.1). Indeed, for any u ℓw (Rn), and t Z , the directional derivative DU F (z0) u is determined by the recursions (DU F (z0) u)t = DF U F (z0)t 1, z0 t DU F (z0) u t 1 , ut (4.3) = Dx F U F (z0)t 1, z0 t DU F (z0) u t 1 + Dz F(U(z0)t 1, z0 t) ut. (4.4) This relation implies, in particular, that DU F (z0) : ℓw (Rn) ℓw (RN) is a bounded linear operator and that DU F (z0) w LFz(x0, z0) 1 LFx(x0, z0)Lw , (4.5) Differentiable reservoir computing where LFz(x0, z0) := sup t Z Dz F(x0 t 1, z0 t) . Indeed, notice first that for any t Z , Dx F(x0 t 1, z0 t) DF(x0 t 1, z0 t) and Dz F(x0 t 1, z0 t) DF(x0 t 1, z0 t) , (4.6) which, using hypothesis (4.1) implies that LFx(x0, z0) LF < + and LFz(x0, z0) LF < + . (4.7) Now, for any u ℓw (Rn), and t Z , the relation (4.3) and the inequalities (4.7) imply that DU F (z0) u w = sup t Z DU F (z0) u n DF U F (z0)t 1, z0 t DU F (z0) u t 1 , ut w t o n Dx F U F (z0)t 1, z0 t DU F (z0) u t 1 + Dz F(U(z0)t 1, z0 t) ut w t o LFx(x0, z0) sup t Z n DU F (z0) u w t o + LFz(x0, z0) sup t Z { ut w t} LFx(x0, z0) sup t Z DU F (z0) u w (t 1) w t w (t 1) + LFz(x0, z0) sup t Z { ut w t} LFx(x0, z0)Lw DU F (z0) u w + LFz(x0, z0) u w , which implies (4.5). The previous theorem proves that when the persistence condition (4.2) is satisfied at a preexisting solution of a reservoir system then this system has a unique fading memory (and differentiable) filter associated for neighboring inputs. In the next results we show that a global version of that condition ensures first, that globally defined reservoir filters exist, and second, that those filters are differentiable and hence have the fading memory property. Theorem 19 (Characterization of global reservoir filter differentiability) Let F : RN Rn RN be a reservoir map of class C1(RN Rn) and let w be a weighting sequence with finite inverse decay ratio Lw. (i) Suppose that F satisfies (4.1) and define LFx := sup (x,z) RN Rn {|||Dx F(x, z)|||} and LFz := sup (x,z) RN Rn {|||Dz F(x, z)|||} . If the reservoir system (1.1) associated to F has a solution (x0, z0) ℓw (RN) ℓw (Rn), that is, x0 t = F(x0 t 1, z0 t), for all t Z , and LFx Lw < 1 (4.8) then it has the echo state property and hence determines a unique causal and time-invariant reservoir filter U F : (ℓw (Rn), w) (ℓw (RN), w). Moreover, U F is differentiable and Lipschitz continuous on ℓw (Rn) with Lipschitz constant LU F given by LU F := LFz 1 LFx Lw and DU F (z) w LFz 1 LFx Lw , for any z ℓw (Rn). (4.9) The filter U F has hence the fading memory property. Grigoryeva and Ortega (ii) Conversely, let Vn ℓw (Rn) be an open and time-invariant subset of ℓw (Rn) and assume that the reservoir system (1.1) associated to F has a unique causal and time-invariant reservoir filter U F : Vn ℓw (RN) that is differentiable at z0 ℓw (Rn). Then, t Z Dx F U F (z0)t 1, z0 t ) T1 < 1, (4.10) where ρ stands for the spectral radius. This in turn implies that Dx F U F (z0) 1, z0 0 Dx F U F (z0) k, z0 k+1 1 = 0. (4.11) Examples 20 We briefly examine the form that the hypotheses of Theorem 19 take for the three families of reservoir systems that we analyzed in Section 3.1: (i) Linear reservoir maps. In this case, for any x RN and z Rn, DF(x, z) = (A | c) , Dx F(x, z) = A, and Dz F(x, z) = c. Consequently LF = |||(A | c)|||, LFx = |||A|||, LFz = |||c|||. The condition (4.1) is always satisfied and in this case the sufficient differentiability condition (4.8) amounts to |||A|||Lw < 1 that, as we saw in (3.7), is the same as the sufficient condition for the FMP to hold. (ii) Echo state networks (ESN). Consider an ESN constructed using a squashing function σ that satisfies that Lσ := supx R{|σ (x)|} < + . In this case, for any x RN and z Rn, DF(x, z) = Dσ(Ax + cz + ζ) (A | c) , Dx F(x, z) = Dσ(Ax + cz + ζ) A, Dz F(x, z) = Dσ(Ax + cz + ζ) c. Notice that |||Dσ(x)||| < Lσ < + , for any x RN, and hence |||DF(x, z)||| Lσ|||(A | c)||| < + , |||Dx F(x, z)||| Lσ|||A||| < + , |||Dz F(x, z)||| Lσ|||c||| < + , for any x RN and z Rn. This implies, in particular, that in this case LF < + , LFx < + , LFz < + , and the sufficient differentiability condition (4.8) is implied by the inequality |||A|||LσLw < 1. (4.12) (iii) Non-homogeneous state-affine systems (SAS). A straightforward computations shows that for any x RN and z Rn, DF(x, z) = (p(z), Dp(z)( )x + Dq(z)( )) , Dx F(x, z) = p(z), Dz F(x, z) = Dp(z)( )x + Dq(z)( ). (4.13) Differentiable reservoir computing As we already pointed out, for regular SAS defined by nontrivial polynomials the norm |||p(z)||| is not bounded in Rn and hence LFx = sup(x,z) RN Rn {|||Dx F(x, z)|||} = supz Rn {|||p(z)|||} = Mp is not finite; the same applies to LF , which implies that in this case neither (4.1) nor (4.8) can be satisfied. This is not the case for trigonometric SAS for which the norms of the derivatives in (4.13) are bounded on their domains which, in particular, implies that LF < + , LFx < + , and LFz < + . Moreover, the sufficient differentiability condition (4.8) in this case reads Remark 21 We recall here an example that we introduced in Section 3.1 to show that, as it was already the case with the FMP condition (3.3) in Theorem 7, the differentiability condition (4.8) is sufficient but not necessary. Indeed, consider a linear system with matrix A given by A = 0 a 0 0 , with a > 0. Given that |||A||| = a, the reservoir map determined by A is not necessarily a contraction on the first entry. Nevertheless, the nilpotency of A implies that the reservoir system associated to (3.5) always has a solution for any input z (R2)Z and hence has the ESP and induces a filter U : (R2)Z (R2)Z given by U(z)t := zt + Azt 1, t Z or, equivalently, U = I(R2)Z + (Q t Z A) T1. Consider now any weighting sequence w with finite inverse decay ratio Lw. Then the restriction of U to ℓw (R2) always maps into ℓw (R2), has the FMP, and it is differentiable. Indeed, it is easy to show using the linearity of the filter that U = DU(z) for any z ℓw (R2) and that |||U|||w = |||DU(z)|||w (1 + a Lw). (4.14) Note that in this case LFx = |||A||| = a and as (4.14) shows the differentiability of U with respect to any weighting sequence with finite Lw, we can conclude that the condition (4.8) is not necessary for filter differentiability. The following corollary puts together the previous theorem and a condition on the readout map that guarantees that the filter associated to the resulting reservoir system is differentiable. Corollary 22 Consider a reservoir system determined by a reservoir map F : RN Rn RN of class C1(RN Rn) and by a readout map h : RN Rd that is also of class C1(RN). Assume, additionally that F satisfies the hypotheses in part (i) of Theorem 19 and that h is such that ch := sup x RN {|||Dh(x)|||} < + , (4.15) and the sequence y0 := h x0 t Z = h U F (z0) t Z ℓw (Rd). Then, the reservoir filter U F h : (ℓw (Rn), w) (ℓw (Rd), w) is differentiable at each point in its domain and it hence has the fading memory property. Proof. Define first the map t Z h pt : ℓw (RN) (Rd)Z . (4.16) Given that U F h = H U F and by Theorem 19 the filter U F is differentiable then it suffices to prove that H is differentiable. This is a consequence of part (iii) in Lemma 36 and the hypothesis (4.15). Indeed, let Ht := h pt, t Z , and notice that by the first part of Lemma 1 sup x ℓw (RN) {|||DHt(x)|||} sup xt RN {|||Dh(xt)|||} sup x ℓw (RN) {|||pt(x)|||} ch Grigoryeva and Ortega Now, as (ch/w t)t Z w = ch < + and by hypothesis H(x0) ℓw (Rd) it follows from Lemma 36 that H maps into ℓw (Rd) and that it is differentiable, as required. In some occasions it is important to determine if a given filter is invertible. The differentiability of reservoir filters associated to reservoir systems associated to differentiable reservoir and readout maps that we established in the previous result allows us to use the inverse function theorem to formulate a sufficient invertibility condition. As we see in the next statement, this criterion can be written down entirely in terms of the derivatives of the reservoir and the readout maps. Corollary 23 Consider a reservoir system determined by a reservoir map F : RN Rn RN and a readout map h : RN Rd that are of class C1(RN Rn) and C1(RN), respectively, and additionally satisfy the conditions spelled out in the statement of Corollary 22. Let z ℓw (Rn), x := U F (z) ℓw (RN), and y := U F h (z) ℓw (Rd), and suppose that the map t Z Dx F (xt 1, zt) t Z Dz F (xt 1, zt) : ℓw (Rn) ℓw (Rd) (4.17) is a linear homeomorphism (continuous linear bijection with continuous inverse) with H as defined in (4.16). Then there exist open neighborhoods Vz ℓw (Rn) and Vy ℓw (RN) of z and y, respectively, such that the restriction of the filter U F h |Vz : Vz Vy has an inverse U F h |Vz 1. When the condition (4.17) is satisfied for all the solutions (z, U F (z)) of the reservoir system determined by F then the reservoir filter U F h admits a global inverse U F h 1 : U F h ℓw (Rn) ℓw (Rn). Proof. It is a straightforward consequence of the inverse function theorem as formulated in (Schechter, 1997, page 670) (see also Ver Eecke (1974)) applied to the Fr echet derivative of U F h = H U F at the point z ℓw (Rn). It is easy to see using the chain rule and (6.69) (which is in turn a consequence of (4.4)) that this derivative coincides with the operator in (4.17) whose invertibility we require. 4.2. The local versus the global echo state property Theorem 14 emphasizes the local nature of both the echo state and the fading memory properties by providing a sufficient condition that ensures the existence of a locally defined causal and time-invariant filter around a given solution that is shown to have the FMP. In contrast with this local approach, Theorem 19 characterizes the existence of a globally defined differentiable filter associated to a given reservoir system, that hence satisfies the FMP and the ESP for any input. Even though the conditions in Theorems 14 and 19 are very alike, the latter is much stronger than the former. In the following paragraphs we illustrate with a family of ESNs of the type introduced in Section 3.1 how it is possible to be in violation of the global condition of Theorem 19 and nevertheless to find solutions of such reservoir systems around which one can locally define FMP reservoir filters. This example illustrates how the ESP and the FMP are structural features of a reservoir system when considered globally but are mostly input dependent when considered only locally. This important observation has already been noticed in Manjunath and Jaeger (2013) where, using tools coming from the theory of non-autonomous dynamical systems, sufficient conditions have been formulated (see, for instance, (Manjunath and Jaeger, 2013, Theorem 2)) that ensure the ESP in connection to a given specific input. The differentiability conditions that we impose to our reservoir systems allow us to draw similar conclusions and, additionally, to automatically conclude the FMP of the resulting locally defined reservoir filters. Consider the one-dimensional echo state map F : R R R, where F(x, z) := σ(ax + z), with a R and σ(x) := x 1 + x2 . (4.18) Differentiable reservoir computing The sigmoid function σ in this expression has been chosen so that we can provide algebraic expressions in the following developments. Similar conclusions could nevertheless be drawn using other popular squashing functions. The function σ maps the real line into the interval [ 1, 1] and it is easy to see, using the notation introduced in the examples 20, that Lσ := supx R{|σ (x)|} = 1. Moreover, the one-dimensional character of the system makes that, in this case, LFx = |a|. (4.19) Consequently, by Lemma 43, the reservoir map F is a contraction on the first entry if and only if |a| < 1, in which case, by Theorem 7, the associated ESN has the ESP and the FMP with respect to any input in ℓw (Rn), where w is a weighting sequence that satisfies |a|Lw < 1. (4.20) The FMP holds with respect to any sequence w if we consider uniformly bounded inputs by Corollary 10. Moreover, a well-known result for ESNs due to H. Jaeger (see (Jaeger, 2010, Proposition 3)) shows that the ESP cannot be satisfied whenever |a| > 1. (4.21) Additionally, the global sufficient differentiability condition (4.8) in Theorem 19 states that the condition (4.20) also ensures that the ESP filter is also differentiable. We now prove using Theorem 14 the existence of locally defined FMP filters associated to this ESN in a neighborhood of certain inputs, even when condition (4.21) is satisfied which, as we already mentioned, prevents the global existence of such objects. Notice first that the solutions of the equation σ(ax) = x, x R, are characterized by the relation a2x2 = x2(a2x2 + 1) (4.22) that has as solutions x0 = 0, x a = where the solutions in the second line obviously exist and are different from the first one only when |a| > 1, a condition that we assume holds true in the rest of the section. The condition (4.22) implies that the constant sequences (x0, z0) and (x a , z0) defined by (x0, z0)t := (x0, 0) and (x a , z0)t := (x a , 0), for any t Z , are solutions of the reservoir system determined by F. Moreover, in the notation of Theorem 14, it is easy to see that LFx(x0, z0) = |a| > 1 and LFx(x a , z0) = 1 The persistence condition (4.2) in that result implies that for any weighting sequence that satisfies there exist open time-invariant neighborhoods Vx a and Vz0 of x a and z0 in ℓw (RN) and ℓw (Rn), respectively, such that the reservoir system associated to F with inputs in Vz0 has the echo state property and hence determines a unique causal, time-invariant, and FMP reservoir filter U F : (Vz0, w) (Vx0, w). Grigoryeva and Ortega 4.3. Remote past input independence and the state forgetting property for unbounded inputs In Section 2.3 we saw how fading memory filters presented with uniformly bounded inputs exhibit what we called the uniform input forgetting property. An analysis of the proof of the main result in that section, namely Theorem 6, shows that the compactness of the space of inputs guaranteed the existence of a modulus of continuity for the filter, which ensured the validity of the input forgetting property and, moreover, it made it uniform. In the context of reservoir systems, we saw in Theorems 7 and 19 that there are very weak hypotheses that, even when the inputs are not uniformly bounded, guarantee that the associated reservoir filters are Lipschitz and hence have a modulus of continuity. This allows us to prove an input forgetting property in that more general context. Theorem 24 (Input forgetting property for FMP reservoir filters) Let F : DN Dn DN be a reservoir map where Dn Rn, DN RN, n, N N+. Assume that the hypotheses of Theorem 7 part (ii) (plus F Lipschitz on the second component) or 19 part (i) are satisfied with respect to a weighting sequence w such that Dw < 1. Let U F : (Vn, w) ((DN)Z ℓw (RN), w) be the associated causal and time-invariant reservoir filter (Vn (Dn)Z ℓw (Rn) under the hypotheses of Theorem 7; Vn = ℓw (Rn) and DN = RN under the hypotheses of Theorem 19). Then, for any u, v ℓw (Rn) and z (DN)Z we have that lim t + U F (uz)t U F (vz)t = 0. (4.23) If DN is compact then the convergence in (4.23) is uniform on u, v, and z in the sense that there exists a monotonously decreasing sequence w U F with zero limit such that for all u, v, z, and t N, U F (uz)t U F (vz)t w U F t . (4.24) Proof. It mimics the proof of Theorem 6 using as modulus of continuity the map ωU F (t) := LU F t, t 0, where LU F is the Lipschitz constant whose existence is ensured by the hypotheses of Theorem 7 or 19 and given by (3.4) or by (4.9), respectively. Remark 25 In Remark 41 in the appendices it is shown how Theorem 7 can be extended to continuous reservoir systems with inputs and outputs in ℓp,w (Rn) and ℓp,w (RN), respectively. In particular, it is shown that the resulting filters are Lipschitz and hence have a non-trivial modulus of continuity. This implies that a result analogous to Theorem 24 can be proved for such systems that hence could also be referred to as fading memory from a dynamical point of view. When filters are differentiable, there is one more way to measure how they forget inputs simply by looking at their partial derivatives with respect to past input components. The result is a differential input forgetting property that, unlike Theorem 24, can be formulated in a uniform way even when the inputs are not uniformly bounded. Theorem 26 (Differential uniform input forgetting property) Assume that the hypotheses of Theorem 19 (i) are satisfied. Let Dzi t HF (z) RN be the partial derivative of the reservoir functional HF : ℓw (Rn) RN with respect to the i-th component of the t-th entry of z ℓw (Rn). Then, there exists a monotonously decreasing sequence w F with zero limit such that, for any t Z , Dzi t HF (z) w F t, for any z ℓw (Rn) and i {1, . . . , n} . (4.25) Differentiable reservoir computing Proof. Let ei,t := . . . , 0, ei, 0, . . . , 0 ℓw (Rn), where the vector ei is the canonical vector in Rn and it is placed in the t-th position. Then, since ei,t w = w t, we have by (4.9) and for any z ℓw (Rn) that Dzi t HF (z) = DHF (z) ei,t DHF (z) w ei,t w p0 DU F (z) ww t LFz 1 LFx Lw w t, which proves (4.25) by setting w F t := LFz 1 LFx Lw wt, t N. Apart from the filters that reservoir maps define when they have the echo state property, we can also use this object to define controlled forward-looking dynamical systems and flows. Indeed, given F : DN Dn DN a reservoir map, we denote by FF : (Dn)N+ DN (DN)N+ the reservoir flow associated to F that is uniquely determined by the recurrence relations: FF (z, x0)1 = F(x0, z1) with z (Dn)N+, x0 DN, FF (z, x0)t = F(FF (z, x0)t 1, zt), t > 1. (4.26) The value x0 DN is called the initial condition of the path FF (z, x0) (DN)N+ associated to the input or control sequence z (Dn)N+. As we saw in Theorems 7 and 19, the contracting property on the first component in a reservoir map is much related to the ESP and the FMP of the resulting reservoir filter and, in passing, (see Theorem 24) to the input forgetting property. The next result shows that something similar happens with reservoir flows associated to contracting reservoir maps as they forget the influence of initial conditions that are used to create the paths. This feature is referred to as the state forgetting property in Jaeger (2010). Theorem 27 (State forgetting property for contracting reservoir flows) Let F : DN Dn DN be a reservoir map where Dn Rn, DN RN, n, N N+, and suppose that F is a contraction on the first component. Given an input sequence z (Dn)N+, the reservoir flow FF : (Dn)N+ DN (DN)N+ associated to F satisfies that: lim t + FF (z, x0)t FF (z, x0)t = 0, for any x0, x0 DN. (4.27) If DN is compact then the convergence in (4.27) is uniform on z, x0, and x0 in the sense that there exists a monotonously decreasing sequence w F with zero limit such that for all x0, x0 DN, z (Dn)N+, and t N, FF (z, x0)t FF (z, x0)t w F t . (4.28) Reservoir filters that satisfy condition (4.27) are said to have the state forgetting property and we refer to (4.28) as the uniform state forgetting property. Proof. Let c < 1 be the contraction constant of F. Using the recursions (4.26) that define the reservoir flow we can write that for any t > 1: FF (z, x0)t FF (z, x0)t = F(FF (z, x0)t 1, zt) F(FF (z, x0)t 1, zt) c FF (z, x0)t 1 FF (z, x0)t 1 ct 1 FF (z, x0)1 FF (z, x0)1 ct 1 F(x0, z1) F(x0, z1) . Taking limits t + on both sides of this inequality yields (4.27). Now, if DN is compact then there exists a constant D > 0 such that F(x0, z1) F(x0, z1) < D for all x0, x0 DN, and z1 Dn, and hence (4.28) holds if we set w F t := ct 1D, t N. Grigoryeva and Ortega 4.4. Analytic reservoir filters associated to analytic reservoir maps The results in Section 4.1 characterized the conditions under which reservoir maps of class C1 yield differentiable reservoir filters with respect to inputs and outputs in weighted sequence spaces. This setup is convenient because it is able to accommodate unbounded signals and allows for an elegant encoding of the fading memory property. However, due to the infinite dimensional character of our setup one cannot immediately obtain higher order differentiable reservoir filters out of higher order differentiable reservoir maps (see Remark 40) because one needs, roughly speaking, to modify the weighted norm in the target of the map that defines the filter (see Proposition 39). This makes impossible the application in a higher order differentiability context of the Implicit Function Theorem, which is the main tool used in the results in the previous section. That is why in the following paragraphs we deal with analytic reservoir maps (as real valued functions) and we study the analyticity of the associated reservoir filters with respect to the supremum norm, as opposed to the weighted norms that we considered in the previous section. Using the supremum norm implies that filter differentiability in that context, when one manages to establish it, ensures filter continuity and not the fading memory property. In exchange, analyticity allows us to construct Taylor series expansions that, as we see later on, are discrete-time Volterra series representations. The next result is the analytic analog of the Local Persistence Theorem 14 formulated using the supremum norm that proves that analytic reservoir maps have locally defined analytic reservoir filters associated around constant solutions. Theorem 28 (Local persistence of the ESP, continuity, and analyticity) Let F : RN Rn RN be a reservoir map. Suppose that F is analytic and that the corresponding reservoir system (1.1) has a constant solution (x0, z0) RN Rn, that is, x0 = F(x0, z0). Suppose, additionally, that for all r 1, LF,r := sup (x,z) RN Rn {|||Dr F(x, z)|||} < + . (4.29) Suppose that LFx(x0, z0) := Dx F(x0, z0) < 1. (4.30) Then, there exist open time-invariant neighborhoods Vx0 and Vz0 of x0 and z0 in ℓ (RN) and ℓ (Rn), respectively, such that the reservoir system associated to F with inputs in Vz0 has the echo state property and hence determines a unique causal, time-invariant, and analytic (and hence continuous) reservoir filter U F : (Vz0, ) (Vx0, ). 5. The Volterra series representation of analytic filters and a universality theorem In this section we study the Taylor series expansions of analytic causal and time-invariant filters that, as we prove in the next result, coincide with the so-called discrete-time Volterra series representations. A very similar result has been formulated in Sandberg (1998a, 1999) for analytic filters with respect to the supremum norm and with inputs with a finite past. The next result extends that statement and characterizes the inputs for which an analytic time-invariant fading memory filter with respect to a weighted norm admits a Volterra series representation with semi-infinite past inputs. This generalized result allows this series representation for inputs that are not necessarily bounded. Additionally, we use the causality and time-invariance hypotheses to show that the corresponding Volterra series representations have time-independent coefficients. Differentiable reservoir computing Theorem 29 Let w be a weighting sequence and let U : B w(z0, M) ℓw (R) B w(U(z0), L) ℓw (RN) be a causal and time-invariant analytic filter, for some time-invariant z0 ℓ1,w (R) (that is, T t(z0) = z0, for all t Z ) and M, L > 0. Then, for any element in the domain that satisfies z B w(z0, M) ℓ1,w (R), that is X t Z |zt|w t < + , (5.1) there exists a unique expansion U(z)t = U(z0)t + mj= gj(m1, . . . , mj)(zm1+t z0 m1+t) (zmj+t z0 mj+t), t Z , (5.2) where the maps gj : Zj RN, j 1, are uniquely determined by the derivatives of the functional HU : B w(z0, M) ℓw (R) RN associated to U (that by Proposition 39 in the Appendix is analytic) via the relation gj(m1, . . . , mj) := 1 j!Dj H(z0) em1, . . . , emj with (en)t := 1 if t = n, 0 otherwise. (5.3) Moreover, for any p N+, we have that U(z)t U(z0)t mj= gj(m1, . . . , mj)(zm1+t z0 m1+t) (zmj+t z0 mj+t) p+1 . (5.4) These statements also hold true when ℓw (R) and ℓw (RN) are replaced by ℓ (R) and ℓ (RN), respectively. In that case, the relation (5.2) holds whenever z B (z0, M) ℓ1 (Rn) and the inequality (5.4) is obtained by taking as the sequence w the constant sequence wι given by wι t := 1, for all t N. Remark 30 The error estimate (5.4) can be reformulated in terms of the weighted norm of the sequence U(z)t U(z0)t mj= gj(m1, . . . , mj)(zm1+t z0 m1+t) Rp(z) w L 1 z w p+1 . (5.5) 5.1. Finite discrete-time Volterra series are universal in the fading memory category In this section we combine the Volterra series representation Theorem 29 with previous universality results in Grigoryeva and Ortega (2018a) to show that any fading memory filter with uniformly bounded inputs can be arbitrarily well approximated with a Volterra series with finite terms of the type in (5.2). This result provides an alternative proof of a Volterra series universality theorem that was stated for the first time in (Boyd and Chua, 1985, Theorems 3 and 4). In particular, this result shows that any time-invariant and causal fading memory filter can be uniformly approximated by a finite memory filter. Grigoryeva and Ortega Theorem 31 (Universality of finite discrete-time Volterra series) Let M, L > 0 and let KM (R)Z , KL Rd Z be as in (1.3). Let U : KM KL be a causal and time-invariant fading memory filter. Then, for any ϵ > 0 there exist x0 KL and J N+ such that for any j {1, . . . , J} there exist j numbers M j 1, . . . , M j j N+ and maps gj : Zj R such that the filter determined by the finite Volterra series given by V (z)t = x0 t + gj(m1, . . . , mj)zm1+t zmj+t (5.6) is such that |||U V ||| = sup z KM { U(z) V (z) } < ϵ. Proof. The Corollary 11 in Grigoryeva and Ortega (2018a) guarantees that for any ϵ > 0 there exists a linear reservoir system with polynomial readout h R[x] and nilpotent connectivity matrix A MN, determined by the expressions xt = Axt 1 + czt, A MN, c MN,n, yt = h(xt), h R[x], such that it has an associated reservoir filter U A,c h : KM KL that satisfies U U A,c h < ϵ. (5.9) Let J = deg(h) + 1 and assume that A is nilpotent of index p. In order to prove the theorem it suffices to show that the Volterra series expansion in (5.2) corresponding to U A,c h has an expression of the type (5.6). If that is the case, the statement in (5.9) proves the theorem. Indeed, recall (see, for instance, (Grigoryeva and Ortega, 2018a, Corollary 11)) that the functional HA,c h associated to the filter U A,c h is given by HA,c h (z) = h j=0 Aj(cz j) which is a composition of the polynomial h with the functional HA,c associated to the reservoir equation (5.7) given by the linear operator j=0 Aj(cz j). (5.10) It is easy to see that HA,c : (ℓ (R), ) RN has a finite operator norm HA,c and that HA,c |||c|||/(1 |||A|||), with |||c||| and |||A||| the top singular values of c and A, respectively. Moreover, it is easy to see that for any j N+, z KM, and v1, . . . , vj ℓ (R), we have Dj HA,c h (z)(v1, . . . , vj) = Djh(HA,c(z)) HA,c(v1), . . . , HA,c(vj)) , which shows that HA,c h : (ℓ (R), ) Rd is everywhere analytic. Using this expression and (5.3) we define gj(m1, . . . , mj) := 1 j!Djh(0) HA,c(em1), . . . , HA,c(emj) . (5.11) Differentiable reservoir computing As h has finite degree then Djh(0) = 0 for any j > deg(h) + 1 = J. Moreover, since the sum in (5.10) is finite by the nilpotency of A it is clear that gj(m1, . . . , mj) in (5.11) is nonzero as long as 1 j deg(h) + 1 = J and (p 1) m1, . . . , mj 0. If we define M j 1, . . . , M j j := p 1 then the Taylor series expansion of U A,c(z) coincides with (5.6). We emphasize that in this case this expansion is valid for any z KM by the finiteness of the number of terms in the sum and that the condition (5.1) is hence not necessary. 6. Appendices 6.1. The topologies induced by weighted and supremum norms An important feature of the topology generated by weighted norms is that they coincide with the product topology on subsets made of uniformly bounded sequences like the space KM in (1.3). This fact holds true for any weighting sequence w and has important consequences (see Grigoryeva and Ortega (2018b) for the details). First, the fading memory property is independent of the weighting sequence used to define it. Second, the subsets KM ℓw (Rn) are compact in the topology induced by the weighted norms w. We emphasize that these statements are valid exclusively in the context of uniformly bounded subsets which, as we see in the next result, are never open in the weighted topology. We adopt in the sequel the following notation for product sets and functions: for any family {At}t Z , of subsets At Rn the symbol Y t Z At := z (Rn)Z | zt At, for all t Z , (6.1) denotes the Cartesian product of the sets in the family. When all the elements in the family are identical to a given subset A, we will exchangeably use the symbols Q t Z A and (A)Z . A similar notation is adopted for the Cartesian product of maps: let V be a set and let ft : V At be a map, t Z . The symbol Q t Z ft denotes the map Q t Z ft : V Q t Z At v 7 (. . . , f 2(v), f 1(v), f0(v)) . (6.2) Lemma 32 Let w be a weighting sequence and n N+. Then: (i) For any z ℓw (Rn) and r > 0, B w(z, r) = [ In particular, this implies that B w(z, r) Y B w(z, r). (6.4) The identity (6.3) implies that any open ball B w(z, r) in ℓw (Rn) contains unbounded sequences. (ii) Let {At}t Z be a family of subsets At Rn such that there exists a sequence {ct}t Z that satisfies sup zt At { zt w t} < ct, for each t Z and supt Z {ct} < + , (6.5) then the product set Y t Z At ℓw (Rn). Grigoryeva and Ortega (iii) For every family {At}t Z of subsets At Rn such that the product set satisfies Q t Z At ℓw (Rn), we have Y These statements, except for the last sentence in part (i), are also valid for the space ℓ (Rn) and are obtained by taking as sequence w the constant sequence wι given by wι t := 1, for all t N. Proof. (i) We prove (6.3) by double inclusion. First, let x B w(z, r). By definition x z w = supt Z { xt zt w t} < r and hence for any δx > 0 such that x z w < δx < r we have that xt zt < δx/w t, for all t Z . This implies that Conversely, given an element x ℓw (Rn) in the right hand side of (6.3), there exists δx < r such that x Q t Z B (zt, δx/w t). This implies that xt zt w t < δx, for all t Z , and hence supt Z { xt zt w t} = x z w δx < r, which proves the inclusion. As to (6.4), the first inclusion is a straightforward consequence of (6.3). Let now x Q t Z B zt, r w t By definition this implies that xt zt w t < r, for all t Z , and consequently supt Z { xt zt w t} r or, equivalently, x z w r. This implies that x B w(z, r) and proves the second inclusion. (ii) Let x Q t Z At. Then, x w = sup t Z { xt w t} sup t Z {ct} < + , as required. (iii) We first prove that Q t Z At. If z Q t Z At, then for any ϵ > 0 and each t Z there exists an element xt At B zt, ϵ 2w t . Let x := (xt)t Z . By construction: x z w = sup t Z { xt zt w t} ϵ which implies that x B w (z, ϵ) Q t Z At and, as z Q t Z At is arbitrary, it guarantees that t Z At. In order to show the reverse inclusion first note that, as it is proved later on in Lemma 1, the projections pt : ℓw (Rn) Rn, t Z , defined by pt(z) := zt, are continuous. Let z Q t Z At arbitrary, let t Z be arbitrary but fixed, and let Vt be an open set in Rn that contains zt. The continuity of pt implies that p 1 t (Vt) is an open set in ℓw (Rn) that contains z and therefore there exists t Z At p 1 t (Vt). We consequently have that xt At, which guarantees that zt At, as required. Corollary 33 Let Dn be a subset of Rn and let w be a weighting sequence. Then: (i) If (Dn)Z ℓw (Rn) is an open subset of ℓw (Rn) then Dn = Rn, necessarily. (ii) If (Dn)Z ℓw (Rn) is a closed subset of ℓw (Rn) then Dn is necessarily closed in Rn, that is, Dn = Dn. Differentiable reservoir computing (iii) The following inclusion always holds (Dn)Z ℓw (Rn) Dn Z ℓw (Rn). (6.7) In particular, if Dn is closed in Rn then so is (Dn)Z ℓw (Rn) in ℓw (Rn). These statements in parts (ii) and (iii) are also valid when the space ℓw (Rn) is replaced by ℓ (Rn). Proof. (i) We proceed by contradiction. Suppose that Dn = Rn. Let x0 Rn \ Dn and let z0 Dn. Define the constant sequences x := (x0)t Z ℓw (Rn) \ (Dn)Z ℓw (Rn) and z := (z0)t Z (Dn)Z ℓw (Rn). Since by hypothesis (Dn)Z ℓw (Rn) is an open subset of ℓw (Rn) there exists ϵ > 0 such that B w(z, 2ϵ) (Dn)Z ℓw (Rn). By the relation (6.4) in Lemma 32 we also have B w(z, ϵ) Y B w(z, ϵ) B w(z, 2ϵ) (Dn)Z ℓw (Rn), and, in particular, (Dn)Z ℓw (Rn), which implies B z0, ϵ w t Dn, for all t Z . (6.8) Let r0 := x0 z0 and let t0 Z be such that for all t < t0 we have that ϵ/w t0 > r0. By (6.8) we have that x0 B z0, ϵ w t Dn, which contradicts the assumption on the choice of x0. (ii) By Lemma 32 (iii) we have that Dn Z . (6.9) Since by hypothesis (Dn)Z is closed and hence it holds true that (Dn)Z = (Dn)Z . (6.10) Consequently, by (6.9) and (6.10) we have that Dn Z = (Dn)Z which implies that Dn = Dn as required. (iii) Let x (Dn)Z ℓw (Rn) ℓw (Rn) and consider a sequence {xm}m N+ (Dn)Z ℓw (Rn) with limm xm = x, that is for each ϵ > 0 there exists such N(ϵ) N+ such that for all m > N(ϵ) it holds that xm x w < ϵ. Hence for all s Z one has that w s xm s xs sup t Z { xm t xt w t} = xm x w ϵ, which immediately implies that xm s xs < ϵ w s and hence one gets that xs Dn and therefore (6.7) holds as required. The last claim in part (iii) follows from (6.7). Indeed, if Dn = Dn then by (6.7) we have that (Dn)Z ℓw (Rn) Dn Z ℓw (Rn) = (Dn)Z ℓw (Rn). Since the reverse inclusion obviously always holds, we finally have that (Dn)Z ℓw (Rn) = (Dn)Z ℓw (Rn). Grigoryeva and Ortega We also recall (see (Grigoryeva and Ortega, 2018b, Proposition 2.9)) that the norm topology in ℓw (Rn) is strictly finer than the subspace topology induced by the product topology in (Rn)Z on ℓw (Rn) (Rn)Z . We complement this fact by comparing the norm topology on (ℓ (Rn), ) with the relative topology induced by (ℓw (Rn), w) on it. Corollary 34 The relative topology τw, induced by the norm topology τw of (ℓw (Rn), w) on ℓ (Rn) is strictly coarser than the norm topology τ on (ℓ (Rn), ), that is, τw, τ . Proof. Since, as we already saw, z w z , for all z (Rn)Z , we have that ℓ (Rn) ℓw (Rn) (see (2.4)) and the inclusion ι : ℓ (Rn) , ℓw (Rn) is continuous. Consequently, for any open W τw the set ι 1(W) = W ℓ (Rn) τw, is also open in τ . This immediately implies that In order to establish that this inclusion is strict, one needs to notice that, given an arbitrary open ball B (z, r), r > 0, around z ℓ (Rn), all the open balls B w(z, ϵ) for all ϵ > 0 contain elements that are not included in B (z, r) by Lemma 32 (i). Lemma 35 Let w be a weighting sequence and n N+. We denote by wa, a R, the sequence with terms wa t , t N. Then, the following inclusions are continuous: ℓ (Rn), , , ℓw 1 k+1 (Rn), w 1 k+1 , ℓw 1 k (Rn), w 1 k , , ℓw (Rn), w , ℓw (Rn), w , , ℓwk (Rn), wk , ℓwk+1 (Rn), wk+1 , , (Rn)Z , (6.12) where k N+ and in (Rn)Z we consider the trivial topology. Define k N+ ℓw 1 k (Rn) and Sw := [ k N+ ℓwk (Rn). (6.13) Then, in general, ℓ (Rn) Sw and Sw (Rn)Z . (6.14) Proof. The continuity of the inclusions (6.11) and (6.12) is a consequence of the fact that: z w 1 k z w 1 k+1 , for all k N+ and z ℓw 1 k (Rn), (6.15) z wk+1 z wk , for all k N+ and z ℓwk (Rn). (6.16) Regarding (6.14), the first inclusion follows from the fact that ℓ (Rn) ℓw (Rn) for any weighting sequence. In order to show that this inclusion is in general not an equality it suffices to consider the following example: let z (R)Z given by zt := t, t Z , and let w be the weighting sequence defined by wt := λt, with t N and 0 < λ < 1. A simple application of the L Hˆopital rule shows that, for any k N+, lim t ztw1/k t = 0, which proves, in particular, that z w1/k < and hence that z ℓw1/k (R), for any k N+. This implies that z Sw. However, z is an unbounded sequence and hence it does not belong to ℓ (R). In order to show that the second inclusion in (6.14) is also strict, take z (R)Z given by zt := λ t with Differentiable reservoir computing λ > 1 and t Z and let w be the weighting sequence defined by w0 := 1 and wt := 1 t , for any t N+. The L Hˆopital rule shows that, for any k N+, lim t |ztwk t| = + , and consequently z does not belong to any of the spaces ℓwk (R) and hence z Sw. 6.2. Products of continuous and differentiable functions using weighted norms The following lemma spells out conditions under which infinite Cartesian products of continuous and differentiable functions are continuous and differentiable when we use weighted and supremum norms. Lemma 36 Let W V with (V, ) a normed space and let DN RN be a subset of RN. Let Ht : W DN, t Z , be a family of maps. Consider the corresponding product map H : W (DN)Z , defined as in (6.2): t Z Ht := (. . . , H 2, H 1, H0) , or equivalently, (H(z))t := Ht(z), z W, t Z . (6.17) (i) Endow W V with the subspace topology. If DN is a compact subset of RN then (DN)Z ℓw (RN) for any weighting sequence w. If each of the functions Ht is continuous then H : W (DN)Z ℓw (RN) is also continuous. (ii) Let w be a weighting sequence and suppose that W contains a point z0 such that H(z0) ℓw (RN). If each of the functions Ht is Lipschitz continuous with Lipschitz constant c0 t and the sequence c0 := (c0 t)t Z formed by these Lipschitz constants satisfies that c0 ℓw (R), then H : W (DN)Z ℓw (RN) is Lipschitz continuous with Lipschitz constant c0 H c0 w. (iii) Suppose that W is an open convex subset of (V, ) and that it contains a point z0 such that H(z0) ℓw (RN). Suppose also that the maps Ht are of class Cr(W), r 1, and let cr t be finite constants such that supz W {|||Dr Ht(z)|||} cr t < + . If cr := (cr t)t Z ℓw (R) then H is differentiable of order r when considered as a map H : W (V, ) ℓw (RN), w and |||Dr H(z)||| cr w, for any z W. (6.18) Additionally, if cj ℓw (R) for all j {1, . . . , r}, then H is of class Cr 1(W) and the map Dr 1H : (W, ) Lr 1(V, ℓw (RN)), ||| ||| is Lipschitz continuous with Lipschitz constant cr H cr w. (iv) Suppose that W is an open convex subset of (V, ) and that it contains a point z0 such that H(z0) ℓw (RN). If the maps Ht are smooth and cr w < + , for each r N+, then so is H : W (V, ) (ℓw (RN), w). Suppose, additionally, that the maps Ht are analytic and that ρt > 0 is the radius of convergence of the series expansion of Ht. If ρ := inft Z {ρt} > 0 then H is analytic when considered as a map H : W (V, ) (ℓw (RN), w) and the radius of convergence ρH of its series expansion satisfies that ρH ρ > 0. Parts (ii), (iii), and (iv) also hold true when the Banach space (ℓw (RN), w) is replaced by (ℓ (RN), ). Part (i) is in general false in that situation. Proof. (i) The compactness of DN guarantees (Munkres, 2014, Theorem 27.3) that there exists L > 0 such that DN B (0, L) and hence (DN)Z ℓw (RN) necessarily. It can also be shown (see Grigoryeva and Ortega (Grigoryeva and Ortega, 2018b, Corollary 2.7)) that when DN is compact, the relative topology (DN)Z induced by the weighted norm w in ℓw (RN) coincides with the product topology. This implies (see (Munkres, 2014, Theorem 19.6)) that if the functions Ht are continuous then so is H. (ii) Let z1, z2 W. Then, H(z1) H(z2) w = sup t Z Ht(z1) Ht(z2) w t sup t Z c0 t z1 z2 w t c0 w z1 z2 , (6.19) which proves simultaneously that H is Lipschitz continuous and that it maps into ℓw (RN). Regarding the last point, recall that by hypothesis there exists a point z0 such that H(z0) ℓw (RN) and hence by (6.19) we have, for any z W, H(z) w c0 w z z0 + H(z0) w < + . (6.20) (iii) First, it is easy to prove recursively that for any z W, the map Dr H(z) := Q t Z Dr Ht(z) satisfies the condition (2.5). In order to prove the first statement of the lemma, it suffices to show that the multilinear map Dr H(z) : (V, ) (V, ) | {z } r times (ℓw (Rn), w), is bounded for any z W. Let v1, . . . , vr V r. Using the r-order differentiability of Ht we can write Dr H(z) v1, . . . , vr w = t Z Dr Ht(z) v1, . . . , vr w Dr Ht(z) v1, . . . , vr w t |||Dr Ht(z)||| v1 vr w t v1 vr sup t Z {cr tw t} cr w v1 vr , (6.21) which proves the boundedness of Dr H(z) and the inequality in (6.18). We now assume that cj ℓw (R) for all j {1, . . . , r} and show that H maps into ℓw (RN) and that it is of class Cr 1(W). Notice, first of all, that for any t Z and any z1, z2 Vn, we have by the convexity of W, the mean value theorem Abraham et al. (1988), and the hypothesis Ht Cr(W), that for all j {1, . . . , r}, Dj 1Ht(z1) Dj 1Ht(z2) w sup z W Dj Ht(z) z1 z2 = cj t z1 z2 . (6.22) Taking j = 1 in the previous inequality, we see that the functions Ht are Lipschitz continuous with constants c1 t that form a sequence that by hypothesis belongs to ℓw (R). This guarantees by part (ii) that H maps into ℓw (RN) necessarily. Now, using the inequality (6.22), we have that for any z1, z2 W, Dr 1H(z1) Dr 1H(z2) w v1,...,vr 1 V v1,...,vr 1 =0 ( Dr 1H(z1) Dr 1H(z2) v1, . . . , vr 1 w v1 vr 1 v1,...,vr 1 V v1,...,vr 1 =0 ( supt Z Dr 1Ht(z1) Dr 1Ht(z2) v1, . . . , vr 1 w t v1,...,vr 1 V v1,...,vr 1 =0 ( supt Z cr tw t z1 z2 v1 vr 1 = cr w z1 z2 , (6.23) Differentiable reservoir computing which shows that the map Dr 1H : (W, w) Lr 1(ℓw (Rn), ℓw (RN)), ||| |||w is Lipschitz continuous with Lipschitz constant cr H cr w. (iv) The previous part of the lemma together with the hypothesis c := supr N+ { cr w} < + guarantees that the differentiability of any order in the functions Ht gets translated into the differentiability of any order of the map H : W (V, ) (ℓw (RN), w). Moreover, let u ℓw (Rn) and let ur := (u, . . . , u) ℓw (Rn) r, r N+. The Taylor series expansion of H around 0 is 1 r!Dr H(0) ur = Y 1 r!Dr Ht(0) ur ! The expansion in the left hand side of this equality is convergent if and only if each of the series in the product in the right hand side is convergent. This is the case when u w < ρt, for all t Z , which guarantees the convergence of the Taylor series expansion in (6.24) for all the elements u ℓw (Rn) that satisfy u w < inft Z {ρt} = ρ. Since by hypothesis ρ > 0, we have proved that H is analytic with radius of convergence ρH ρ. The proof of the statements in (ii), (iii), and (iv) for the space ℓ (RN) is obtained by mimicking the proofs that we just provided, replacing the weighting sequence w by the constant sequence wι that is equal to 1 for each t Z . In order to show that part (i) is in general false in that situation take W = ( 1, 1), DN = [ 1, 1], and define Ht(z) := tanh ( tz), with t Z and z ( 1, 1). Given that H 1 t 1 t tanh 1( 1 t tanh 1( 1 2) it is clear that This equality shows that the preimage by the product map H of an open set is not open and hence H is not continuous. 6.3. Proof of Lemma 1 (i) The linearity of pt is obvious. Let u ℓw (Rn) arbitrary. Since pt(u) = ut = ut w t/w t supj Z { uj w j} /w t = u w /w t, we can conclude that |||pt|||w 1/w t. Let now v Rn such that v = 1 and define the element z ℓw (Rn) by zt := v/w t, for all t Z . It is clear that z w = 1 and that pt(z) / z w = 1/w t, which shows that |||pt|||w = 1/w t, as required. (ii) We first prove the statements in this part in the case t < 0. Suppose that the inverse decay ratio Lw is finite and let u ℓw (Rn) arbitrary. Then T1(u) w = sup t Z { ut 1 w t} = sup t Z ut 1 w (t 1) w t w (t 1) ut 1 w (t 1) sup t Z w t w (t 1) u w Lw. (6.25) This inequality shows that T1 maps ℓw (Rn) into ℓw (Rn) and that |||T1|||w Lw. Given that for any t Z we can write T t = T1 T1 | {z } t times the previous conclusion also proves that T t maps ℓw (Rn) into ℓw (Rn) and that |||T t|||w = |||T1 T1|||w |||T1|||w |||T1|||w L t w . Grigoryeva and Ortega It remains to be shown that |||T1|||w = Lw. In order to do so, take an element v Rn such that v = 1 and define the element u ℓw (Rn) by ut := v/w t, for all t Z . Notice that by construction u w = 1 and, moreover, u w = sup t Z { ut 1 w t} = sup t Z v w (t 1) w t which proves the required identity. We now show that T t : (ℓw (Rn), w) (ℓw (Rn), w) is surjective. Indeed, it is clear that for any u ℓw (Rn), the element eu := (u, 0, . . . , 0 | {z } t times ) is such that T t (eu) = u. We hence just need to show that eu ℓw (Rn). This is the case because eu w = sup s Z { f us w s} = sup s Z { us+t w s} = sup s Z us+t w (s+t) w s w (s+t) us+t w (s+t) w s w (s+1) w (s+1) w (s+2) w (s+t 1) u w L t w < + , (6.26) because u ℓw (Rn) and by hypothesis Lw < + . Now, since we already showed that T t : (ℓw (Rn), w) (ℓw (Rn), w) is continuous, then the Banach-Schauder Open Mapping Theorem (Abraham et al., 1988, Theorem 2.2.15) implies that T t is necessarily an open map. It remains to be shown that T t : (ℓw (Rn), w) (ℓw (Rn), w) is a submersion (see (Abraham et al., 1988, Section 3.5) for context and definitions). First, it is obvious that ker T t = (. . . , 0, 0, v) | v (Rn) t . Since T t is linear and bounded, in order to show that it is a submersion it suffices to show that ker T t is split, that is, it has a closed complement in (ℓw (Rn), w). We now prove that such a complement is given by the subspace (u, 0, . . . , 0 | {z } t times ) | u ℓw (Rn) The inequality (6.26) implies that C t ℓw (Rn). (6.28) Additionally, C t is clearly closed in ℓw (Rn). We conclude by showing by double inclusion that ℓw (Rn) = ker T t C t. (6.29) Let first u ℓw (Rn) and define u1 := (. . . , ut 2, ut 1, ut, 0, . . . , 0 | {z } t times ) and u2 := (. . . , 0, 0, ut+1, ut+2, . . . , u0). It is clear that u = u1 + u2. Additionally, the sequence u2 is obviously in ker T t and using an argument similar to the one in (6.26) it is easy to show that u1 C t, which proves the inclusion ℓw (Rn) ker T t C t. Differentiable reservoir computing Conversely, let u1 C t and u2 ker T t. By (6.28) we have that u1 w < + and it is also clear that u2 w < + . Therefore u1 + u2 w u1 w + u2 w < + and hence u1 + u2 ℓw (Rn), which shows that T t is a submersion. Finally, the statements in the case t > 0 are proved in a similar fashion. In particular, it is easy to see that T t Tt = PCt, for any t > 0, where PCt is the projection onto the subspace Ct defined in (6.27) according to the splitting (6.29). Moreover, it is easy to see that T t is injective and that its image Im T t is split because Im T t = Ct and by (6.29) ℓw (Rn) = ker Tt Im T t, t > 0, which proves that T t is an immersion. (iii) Straightforward consequence of the definitions. The proofs for the space (ℓ (Rn), ) can be obtained by replacing in the previous arguments the weighting sequence w by the constant sequence wι. 6.4. Equivalence of FMP and differentiability in filters and functionals The facts established in Lemma 1 can be used to show the equivalence between the continuity and the differentiability of causal and time-invariant filters and that of their associated functionals. The following result focuses on continuity and the fading memory property and generalizes to the context of eventually unbounded inputs the equivalence between fading memory filters and functionals established in (Grigoryeva and Ortega, 2018b, Propositions 2.11 and 2.12) for uniformly bounded inputs. In the results that follow we work in a setup slightly more general than the one that is customary in the literature as we will allow for the weighting sequences considered in the domain and the target of the filters to be different. This degree of generality is needed later on in the text. Proposition 37 Let Vn (Rn)Z and VN RN Z be time-invariant subsets and let DN RN. Let w1, w2 be weighting sequences with inverse decay ratios Lw1 and Lw2, respectively. (i) Let U : Vn ℓw1 (Rn) VN ℓw2 (RN) be a causal and time-invariant filter. If U has the fading memory property then so does its associated functional HU : Vn p0(VN). The same conclusion holds for continuous filters U : Vn ℓ (Rn) VN ℓ (RN). (ii) Let H : Vn ℓw1 (Rn) DN be a fading memory functional. If Lw1 is finite and DN is compact then the associated causal and time-invariant filter UH : Vn ℓw1 (Rn) (DN)Z ℓw2 (RN) has also the fading memory property. (iii) Let H : Vn ℓw1 (Rn) DN be a fading memory functional and suppose that Vn contains a point z0 such that UH(z0) ℓw2 (R), where UH is the causal and time-invariant filter associated to H. If H is Lipschitz, c H is a Lipschitz constant, and the weighting sequences satisfy one of the following two conditions either Rw1,w2 := sup s,t N w1 t w2 s w1 t+s < + or the sequence Lw1 := L t w1 t Z ℓw2 (R), (6.30) then UH : Vn ℓw1 (Rn) (DN)Z ℓw2 (RN) has also the fading memory property, it is Lipschitz, and Rw1,w2c H or Lw1 w2c H, respectively, is a Lipschitz constant of UH. The same conclusion holds for continuous functionals H : Vn ℓ (Rn) DN where the condition (6.30) is not needed. Grigoryeva and Ortega Proof. (i) As HU is given by HU = p0 U, the FMP (respectively, continuity) of U and the first part of Lemma 1 prove the statement. (ii) Notice first that as UH = Y t Z H T t (6.31) then, as Lw1 is finite, UH is by the second part of Lemma 1 the Cartesian product of continuous functions Ht := H T t : Vn ℓw1 (Rn) DN. Since DN is by hypothesis compact, the result follows from the first part of Lemma 36. (iii) Let z1, z2 Vn arbitrary. Then by (6.31) and the Lipschitz hypothesis on H, we have that UH(z1) UH(z2) w2 = sup t Z H(T t(z1)) H(T t(z2)) w2 t c H sup t Z T t(z1) T t(z2) w1 w2 t . (6.32) If the first condition in (6.30) is satisfied, this expression is bounded above by c H sup t,s Z T t(z1)s T t(z2)s w2 tw1 s = c H sup t,s Z ( z1 t+s z2 t+s w1 (t+s) w2 tw1 s w1 (t+s) Rw1,w2c H z1 z2 w1 which proves that in that case UH has the fading memory property, it is Lipschitz, and Rw1,w2c H is a Lipschitz constant. If the second condition in (6.30) is satisfied then the inverse decay ratio Lw1 is necessarily finite and hence (6.32) can be bounded using the second part of Lemma 1 as c H sup t Z T t(z1) T t(z2) w1 w2 t c H z1 z2 w1 sup t Z L t w1w2 t = Lw1 w2c H z1 z2 w1 , which proves that in that case Lw1 w2c H is a Lipschitz constant of UH. The Lipschitz continuity of UH together with the hypothesis on the existence of a point z0 such that UH(z0) ℓw2 (R) guarantee that UH maps into ℓw2 (RN) using a strategy similar to the one followed in (6.20). The proof for the spaces ℓ (Rn) and ℓ (RN) is obtained by taking as weighting sequences the constant sequence wι given by wι t := 1, for all t N, that automatically satisfies any of the two conditions in (6.30). Remark 38 When in part (iii) we consider the same weighting sequence w for the domain and the target, it is easy to see that Rw := sup s,t N satisfies that Rw Lw w and therefore the second condition in (6.30) implies the first one. Indeed, Rw = sup s,t N = sup s,t N wt+1 wt+2 wt+s 1 sup s N {Ls wws} = Lw w , as required. In this setup, the condition (6.30) is satisfied by many families of commonly used weighting sequences. In the two examples considered in Remark 2 we have that Rw = Lw w = 1 for the geometric sequence; for the harmonic sequence Lw w = + but Rw = 1 and hence (6.30) is still satisfied. We emphasize that condition (6.30) is not automatically satisfied by all weighting sequences. For example, as we saw in Remark 2, the sequence wt := exp( t2) is such that Lw = + and, additionally, it is easy to see that Rw := sups,t N {exp(2st)} = + . Differentiable reservoir computing Proposition 39 Let w1 and w2 be two weighting sequences with inverse decay ratios Lw1 and Lw1, respectively. Let Vn ℓw1 (Rn) and VN ℓw2 (RN) be time-invariant open subsets, and let DN be an open subset of RN. (i) Let U : Vn ℓw1 (Rn) VN ℓw2 (RN) be a causal and time-invariant filter. If U is of class Cr(Vn) (respectively, smooth or analytic) when considered as a map U : Vn ℓw1 (Rn), w VN ℓw2 (RN), w , then so is the associated functional HU : Vn ℓw1 (Rn), w p0(VN) RN. Moreover, |||Dr HU(z)|||w1 |||Dr U(z)|||w1,w2, for any z Vn. (6.33) The same conclusion holds when the weighted sequence spaces are replaced by ℓ (Rn), and ℓ (RN), . (ii) Let H : Vn ℓw1 (Rn) DN be a functional and suppose that Vn is convex and contains a point z0 such that UH(z0) ℓw2 (R), where UH is the causal and time-invariant filter associated to H. If the functional H is of class Cr(Vn) and for any j {1, . . . , r} we have that cj := supz Vn Dj H(z) w1 < + and the weighting sequences satisfy that Lw1,j := (L jt w1 )t Z ℓw2 (R), (6.34) then the associated causal and time-invariant filter UH is differentiable of order r when considered as a map UH : Vn ℓw1 (Rn), w1 (DN)Z ℓw2 (RN), w2 . Moreover, for any z Vn, |||Dr UH(z)|||w1,w2 cr Lw1,r w2. (6.35) Additionally, UH is of class Cr 1(Vn) and the map Dr 1UH : (Vn, w1) Lr 1 ℓw1 (Rn), ℓw2 (RN) , ||| |||w1,w2 is Lipschitz continuous with Lipschitz constant cr Lw1,r w2. The same conclusion holds when the weighted sequence spaces are replaced by ℓ (Rn), and ℓ (RN), . In that case the inequality (6.35) holds with Lw1,r w2 = 1. (iii) Let H : Vn ℓw1 (Rn) DN be a functional and suppose that Vn is convex and contains a point z0 such that UH(z0) ℓw2 (R), where UH is the causal and time-invariant filter associated to H. If the functional H is smooth and cr < + for all r N+, then so is the associated causal and time-invariant filter UH : Vn ℓw1 (Rn), w1 (DN)Z ℓw2 (RN), w2 . The same conclusion holds when the weighted spaces are replaced by ℓ (Rn), and ℓ (RN), . In that case, if H is analytic then so is UH and the radius of convergence of the series expansion of UH is bigger or equal than that of H. Proof. (i) Recall first that HU can be written as HU = p0 U. The chain rule and the linearity of the projection p0 imply that Dr HU(z) = p0 Dr U(z) for any z Vn. The first part of Lemma 1 guarantees then that HU is of class Cr(Vn) and that |||Dr HU(z)|||w1 = |||p0 Dr U(z)|||w1 |||p0|||w2 |||Dr U(z)|||w1,w2 = |||Dr U(z)|||w1,w2, for any z Vn, as required. The proof for the spaces ℓ (Rn) and ℓ (RN) is obtained by taking as sequence w the constant sequence wι given by wι t := 1, for all t N. Grigoryeva and Ortega (ii) First of all, notice that the hypothesis on c1 and the convexity of Vn imply via the mean value theorem Abraham et al. (1988) that H is Lipschitz. Moreover, the hypothesis on Lw,1 in the statement implies that condition (6.30) is satisfied and hence the third part in Proposition 37 guarantees that UH maps into ℓw2 (RN). Now, the expression (6.31) implies that for any z Vn, Dr UH(z) = Y t Z Dr H(T t(z)) (T t, . . . , T t) | {z } r times , r 1. (6.36) In order to prove (6.33) consider u1, . . . , ur ℓw1 (Rn) arbitrary and notice that by the second part of Lemma 1 we have Dr UH(z) u1, . . . , ur w2 = sup t Z Dr H(T t(z)) T t(u1), . . . , T t(ur) w2 t T t(u1) w1 T t(ur) w1 w2 t u1 w1 ur w1 L rt w1 w2 t cr Lw1,r w2 u1 w1 ur w1 , as required. We now show that UH is of class Cr 1(Vn). Let z1, z2 Vn arbitrary. Then, using a strategy similar to that one in the last inequality in the previous expression, we have Dr 1UH(z1) Dr 1UH(z2) w1,w2 u1,...,ur 1 ℓw1 (Rn) u1,...,ur 1 =0 ( Dr 1UH(z1) Dr 1UH(z2) u1, . . . , ur 1 w2 u1 w1 ur 1 w1 u1,...,ur 1 ℓw1 (Rn) u1,...,ur 1 =0 ( supt Z Dr 1H(T t(z1)) Dr 1H(T t(z2)) T t(u1), . . . , T t(ur 1) w2 t u1 w1 ur 1 w1 u1,...,ur 1 ℓw1 (Rn) u1,...,ur 1 =0 ( supt Z cr T t(z1) T t(z2) w1 T t(u1) w1 T t(ur 1) w1 w2 t u1 w1 ur 1 w1 u1,...,ur 1 ℓw1 (Rn) u1,...,ur 1 =0 ( supt Z cr z1 z2 w1 u1 w1 ur 1 w1L rt w1 w2 t u1 w1 ur 1 w1 cr Lw1,r w2 z1 z2 w1 , which shows that the map Dr 1UH : (Vn, w1) Lr 1(ℓw1 (Rn), ℓw2 (RN)), ||| |||w1,w2 is Lipschitz continuous with Lipschitz constant cr Lw1,r w2. (iii) First, the condition cr < + for all r N+ implies by part (ii) that UH is smooth if H is. Suppose now that we work with the supremum norm. The expression (6.36) shows that the point z Vn belongs to the domain of convergence of the series expansion of UH if and only if all the points T t(z) belong to the domain of convergence of the series expansion of H. Finally, suppose that z Vn belongs to the domain of convergence of the series expansion of H. Since |||T t||| 1 for all t Z by Lemma 1, we have that T t(z) z , which guarantees that all the points T t(z) belong to the domain of convergence of the series expansion of H and hence, by the argument above, z Vn belongs to the domain of convergence of the series expansion of UH, which proves the statement. Differentiable reservoir computing Remark 40 An important consequence of part (ii) in this proposition and, in particular, of the condition (6.34) is that, in general, one cannot obtain (higher order) differentiable filters out of differentiable functionals using the same weighted norm in the domain and the target of the filter. The weighted norm in the target needs to be chosen so that it satisfies the nonautomatic condition (6.34) that, additionally, depends on the differentiability degree that we want to preserve. Weighted norms that satisfy that property are relatively easy to find in most cases. For example, if we take as w1 the geometric sequence in Remark 2, then Lw1,j = λ jt t N and hence condition (6.34) is satisfied if we take as w2 any sequence of the type w1 r (using the notation in Lemma 35) with r j. 6.5. Proof of Theorem 7 Consider the map F : (DN)Z ℓw (RN) Vn (DN)Z (x, z) 7 (F(x, z))t := F(xt 1, zt). (6.37) We now show that first, under the two sets of hypotheses in the statement, F actually maps into (DN)Z ℓw (RN) and second, that F is continuous. Suppose first that we are in the hypotheses in (i). Since DN is compact then (DN)Z ℓw (RN) and hence F obviously maps into (DN)Z ℓw (RN). Regarding the continuity, notice that F can be written as t Z Ft with Ft := F pt (T1 id Vn) : (DN)Z ℓw (RN) Vn DN. (6.38) The continuity of F, the fact that Lw is by hypothesis finite, and Lemma 1 imply that all the functions Ft : (DN)Z ℓw (RN) Vn ℓw (RN) ℓw (Rn) DN RN are continuous and moreover, they map into a compact subset of RN. An argument mimicking the proof of the first part of Lemma 36 allows us to conclude that F : (DN)Z ℓw (RN) Vn (DN)Z ℓw (RN) is a continuous map. Suppose now that we are in the hypotheses in part (ii). We now show that since F is Lipschitz then so are all the functions Ft := F pt (T1 id Vn), t Z , by Lemma 1, where we consider the direct sum of weighted spaces ℓw (RN) ℓw (Rn) as a Banach space with the sum norm w w defined by (x, z) w w := x w + z w, for any (x, z) ℓw (RN) ℓw (Rn). Indeed, let c F be the Lipschitz constant of F and let (x1, z1), (x2, z2) (DN)Z ℓw (RN) Vn, then: F pt (T1 id Vn) (x1, z1) F pt (T1 id Vn) (x2, z2) c F pt (T1 id Vn) (x1 x2, z1 z2) (T1 id Vn) (x1 x2, z1 z2) w c F w t (Lw x1 x2 w + z1 z2 w) w t Lw( x1 x2 w + z1 z2 w) = c F w t Lw (x1, z1) (x2, z2) w w . (6.39) This chain of inequalities show that Ft is a Lipschitz continuous function and that c F Lw/w t is a Lipschitz constant. Given that the sequence c F := (c F Lw/w t)t Z is such that c F w = c F Lw < + , the part (ii) of Lemma 36 in the Appendices guarantees that F is Lipschitz continuous and that c F Lw is a Lipschitz constant, that is, F(x1, z1) F(x2, z2) w c F Lw (x1, z1) (x2, z2) w w . (6.40) Moreover, let u0 := (x0, z0) (DN)Z ℓw (RN) Vn. The fact that u0 is a solution of the reservoir system implies that F(u0) = x0 (DN)Z ℓw (RN). An argument mimicking (6.20) in the proof of part (ii) in Lemma 36 proves that in those conditions F maps into (DN)Z ℓw (RN). Grigoryeva and Ortega We now show that in the presence of hypothesis (3.3) F is a contraction on the first entry with constant c Lw < 1. Indeed, for any x1, x2 (DN)Z ℓw (RN) and any z Vn, we have F(x1, z) F(x2, z) w = sup t Z F(x1 t 1, zt) F(x2 t 1, zt) w t sup t Z x1 t 1 x2 t 1 cw t , (6.41) where we used that F is a contraction on the first entry. Now, x1 t 1 x2 t 1 cw t = c sup t Z x1 t 1 x2 t 1 w (t 1) w t w (t 1) c Lw x1 x2 w . (6.42) This shows that F is a family of contractions with constant c Lw < 1 that is continuously parametrized by the elements in Vn. Since by hypothesis, the domain (DN)Z ℓw (RN) is complete, Theorem 6.4.1 in Sternberg (2010) implies the existence of a continuous map U F : (Vn, w) (DN)Z ℓw (RN), w that is uniquely determined by the identity F U F (z), z = U F (z), for all z Vn. (6.43) The causality and the time-invariance of U F are a consequence of the time invariance of Vn and of Proposition 2.1 in Grigoryeva and Ortega (2018a). We now assume that F is Lipschitz on the second component and prove (3.4). The relation (6.43) that defines U F is equivalent to U F (z)t = F(U F (z)t 1, zt), for all z Vn, t Z . Consequently, for any z1, z2 Vn, we have, U F (z1) U F (z2) w = sup t Z U F (z1)t U F (z2)t w t F(U F (z1)t 1, z1 t) F(U F (z2)t 1, z2 t) w t F(U F (z1)t 1, z1 t) F(U F (z1)t 1, z2 t) + F(U F (z1)t 1, z2 t) F(U F (z2)t 1, z2 t) w t Lz z1 t z2 t w t + c U F (z1)t 1 U F (z2)t 1 w t . If we repeat this procedure i times, it is easy to see that U F (z1) U F (z2) w j=0 cj z1 t j z2 t j w t + ci+1 sup t Z U F (z1)t (i+1) U F (z2)t (i+1) w t . (6.44) Differentiable reservoir computing We now study separately the two summands in the right hand side of the previous inequality. First, by Lemma 1, j=0 cj z1 t j z2 t j w t = Lz sup t Z j=0 cj Tj(z1) = Lz sup t Z j=0 cj Tj(z1 z2)t w t j=0 cj sup t Z Tj(z1 z2)t w t j=0 cj Tj(z1 z2) w Lz z1 z2 w j=0 cj|||Tj|||w j=0 (c Lw)j = Lz z1 z2 w 1 (c Lw)i+1 1 c Lw , (6.45) while the second summand can be bounded as follows ci+1 sup t Z U F (z1)t (i+1) U F (z2)t (i+1) w t = ci+1 sup t Z Ti+1 U F (z1) t Ti+1 U F (z2) = ci+1 Ti+1(U F (z1) (U F (z2)) w ci+1|||Ti+1|||w U F (z1) U F (z2) w (c Lw)i+1 U F (z1) U F (z2) w . (6.46) If we now chain the inequalities (6.45) and (6.46) with (6.44) we can conclude that (1 (c Lw)i+1) U F (z1) U F (z2) w Lz z1 z2 w 1 (c Lw)i+1 1 c Lw , (6.47) which after simplification using the condition (3.3) results in (3.4). Remark 41 A slight modification of the proof of Theorem 7 (ii) can be used to extend this statement to reservoir systems with inputs and outputs in ℓp,w (Rn) and ℓp,w (RN), respectively. Indeed, assume that we are under the hypotheses of Theorem 7 (ii) with those spaces instead of ℓw (Rn) and ℓw (RN). Suppose, additionally, that c L1/p w < 1. (6.48) Then, there exists a unique causal and time-invariant continuous reservoir filter U F : (Vn, p,w) ((DN)Z ℓp,w (RN), p,w). Additionally, U F is also Lipschitz with constant LU F := Lz 1 c L1/p w . The proof of this fact is carried out by showing that the map F in (6.38) is Lipschitz continuous when ℓp,w (Rn) and ℓp,w (RN) spaces are considered in its domain and target, respectively, with Lipschitz constant c F L1/p w and hence (6.40) holds in that situation. Indeed, for any (x1, z1), (x2, z2) (DN)Z Grigoryeva and Ortega ℓp,w (RN) Vn we can show using the statements in Remark 4 that F(x1, z1) F(x2, z2) p p,w = X Ft(x1 t 1, z1 t) Ft(x2 t 1, z2 1) p w t F(x1 t 1, z1 t) Ft(x2 t 1, z2 t) p w t cp F X x1 t 1 x2 t 1 p w t + cp F X z1 t z2 t p w t cp F T1(x1 x2) p p,w + cp F z1 z2 p p,w cp F (L1/p w )p x1 x2 p p,w + cp F z1 z2 p p,w cp F (L1/p w )p (x1, z1) (x2, z2) p p,w w , where in the last inequality we used that L1/p w > 1. We now show that F is a contraction on the first entry whenever condition (6.48) is satisfied. Indeed, F(x1, z) F(x2, z) p p,w = X F(x1 t 1, zt) Ft(x2 t 1, zt) p w t x1 t 1 x2 t 1 p w t = cp T1(x1 x2) p p,w cp(L1/p w )p x1 x2 p p,w . The rest of the proof can be obtained by mimicking the developments after (6.42). 6.6. Proof of Theorem 12 Consider the map F : (DN)Z (Dn)Z (DN)Z defined in (6.37) and endow (Dn)Z and (DN)Z with the relative topologies induced by the product topologies in (Rn)Z and (RN)Z , respectively. It is easy to see that the maps pt and T1 are continuous with respect to those product topologies and hence F can be written using (6.38) as a Cartesian product of continuous functions, which is always continuous in the product topology. Consider now any weighting sequence w such that c Lw < 1. Using an argument similar to the proof of Lemma 36 (i), we can conclude that (DN)Z ℓw (RN) and that the product topology on (DN)Z coincides with the norm topology induced by w. Now, following the expressions (6.41) and (6.42) it can be shown that F is a contraction on the first entry and with respect to w. In view of these facts and given that the product topology in (Dn)Z (Rn)Z is metrizable (see (Munkres, 2014, Theorem 20.5)) and that (DN)Z (RN)Z is compact by Tychonoff s Theorem (see (Munkres, 2014, Theorem 37.3)) in the product topology and hence complete, Theorem 6.4.1 in Sternberg (2010) implies the existence of a unique fixed point of F for each z (Dn)Z , which establishes the ESP. Moreover, that result also shows the continuity of the associated filter U F : (Dn)Z ((DN)Z , w). Finally, if (Dn)Z ℓw (Rn), we know from (Grigoryeva and Ortega, 2018b, Proposition 2.9) that the inclusion ℓw (Rn) , (Rn)Z is continuous and hence so is U F when in (Dn)Z we consider the topology generated by the norm w, which establishes the FMP in that situation. 6.7. Proof of Corollary 13 Under the hypothesis in part (i), the continuity of h implies that h (DN) is compact and hence there exists a constant R > 0 such that h (DN) B (0, R). The first part of Lemma 36 guarantees that the map H := Q t Z h : ((DN)Z , w) (KR, w) is continuous and as U F h = H U F and we proved that under the hypotheses (i) in the theorem that U F : (Vn, w) (KL, w) is continuous, the claim follows. We now prove the statement under the hypotheses in part (ii). First, we show that if h is Lipschitz continuous in DN with constant ch then so is the map H in (DN)Z ℓw (RN). Indeed, let x1, x2 Differentiable reservoir computing (DN)Z ℓw (RN), then H(x1) H(x2) w = sup t Z h(x1 t) h(x2 t) w t ch x1 x2 w . The hypothesis U F h (z0) ℓw (Rd) amounts to the fact that the point U F (z0) (DN)Z ℓw (RN) is such that H(U F (z0)) ℓw (Rd). An argument mimicking (6.20) in the proof of part (ii) in Lemma 36 proves that in those conditions H maps into ℓw (Rd). 6.8. Proof of the statements in Section 3.1 Proof of statement (i) for linear reservoir maps. One can show by mimicking the proof of (Grigoryeva and Ortega, 2018a, Corollary 11) that whenever condition (3.6) is satisfied for a given weighting sequence w, the reservoir system determined by (3.5) has a unique reservoir filter U F : ℓw (Rn) ℓw (RN) associated that is determined by the linear functional HF : ℓw (Rn) RN given by j=0 Ajcz j. This linear functional is bounded because for any z ℓw (Rn), the hypothesis (3.6) implies that: Aj |||c||| z j = |||c||| wj |||c||| z w We now show that for any weighting sequence w that satisfies |||A|||Lw < 1, the condition (3.6) always holds. Indeed, using (2.10) we obtain j=0 |||A|||j Lj w = 1 1 |||A|||Lw < + , as required. We finally show that there exist sequences w that satisfy (3.6) but not |||A|||Lw < 1, which is one more example of the fact, that we already indicated in Remark 11, that the FMP condition (3.3) is sufficient but not necessary. Let w be a harmonic weighting sequence as in Remark 2 given by wj := 1/(1 + jd), j N, with d > 0. In this case Lw = 1 + d so we can choose a value d such that |||A|||(1 + d) > 1. However, at the same time, the condition (3.6) holds in this case because j=0 |||A|||j(1 + jd) = j=0 |||A|||j + d(j + 1)|||A|||j d|||A|||j j=0 (1 d)|||A|||j + d(j + 1)|||A|||j = 1 d 1 |||A||| + d (1 |||A|||)2 = 1 + |||A|||(d 1) (1 |||A|||)2 < + . Another example in this direction can be obtained by using a nilpotent matrices. If A is nilpotent then (3.6) is always satisfied for any weighting sequence w. At the same time, there are nilpotent matrices with arbitrarily large norm |||A||| which, once more, shows that (3.6) can hold, and hence the FMP, without (3.7) being necessarily true. We notice too that reservoir systems determined by nilpotent matrices always satisfy the echo state property even though they are not necessarily contractions. Proof of statement (ii) for linear reservoir maps. We first prove the statement (3.8). For any x B (0, L) and z B (0, M), F(x, z) = Ax + cz |||A|||L + |||c|||M = L, as required. Grigoryeva and Ortega This implies that the reservoir map F in (3.5) restricts to a map FL,M : B (0, L) B (0, M) B (0, L) that is a contraction on the first entry with constant |||A||| < 1 and hence satisfies the hypotheses of Corollary 10. This guarantees the existence of a unique associated causal and timeinvariant filter U F : KM KL that has the fading memory property with respect to any weighting sequence w. Proof of the statements for state-affine systems. We prove only the statement (i) since statement (ii) can be easily obtained by mimicking the similar statement for the linear case. Indeed, a straightforward generalization of (Grigoryeva and Ortega, 2018a, Proposition 14) shows that whenever Mp < 1 and Mq < + , the reservoir system determined by (3.11) has a unique reservoir filter U F : ℓw (Rn) (Dn)Z ℓw (RN) associated that is determined by the linear functional HF : ℓw (Rn) (Dn)Z RN given by Mimicking the proof of (Grigoryeva and Ortega, 2018a, Proposition 16) it can be shown that there exists a constant Cp,q > 0 that depends exclusively of p and q such that for any z, s ℓw (Rn) HF (z) HF (s) Cp,q j=0 M j p z j s j = Cp,q j=0 M j p z j s j wj wj Cp,q z s w which shows that HF : ℓw (Rn) (Dn)Z RN is Lipschitz continuous whenever the condition (3.12) holds. The last claim regarding the relation between (3.12) and the FMP condition (3.3) is proved by mimicking the similar statement for the linear case. 6.9. Proof of Theorem 14 We start with a preliminary result whose proof mimics that of Lemma 36 and is also a consequence of Lemma 1. As we already did in the proof of Theorem 7, in the statement we consider the direct sum of weighted spaces ℓw (RN) ℓw (Rn) as a Banach space with the sum norm w w defined by (u, v) w w := u w + v w, for any (u, v) ℓw (RN) ℓw (Rn). Additionally, in all that follows Vn stands for any open convex subset of the Banach space ℓw (Rn), w . Lemma 42 In the hypotheses of the theorem, consider the map F : ℓw (RN) Vn (RN)Z (x, z) 7 (F(x, z))t := F(xt 1, zt), (6.49) where Vn is an open convex subset of ℓw (Rn). Then, (i) F is Lipschitz continuous with constant LF Lw and maps into ℓw (RN). (ii) If F is of class Cr(RN Rn), r 1, suppose that LF,r := sup (x,z) RN Rn {|||Dr F(x, z)|||} < + . (6.50) and let w be any weighting sequence such that cw ,wr = sup t Z < + . (6.51) Differentiable reservoir computing Then the map F is a functor between the sets F : ℓw (RN) Vn ℓw (RN) ℓw (Rn) ℓw (RN) and is differentiable of order r and of class Cr 1(ℓw (RN) Vn). Moreover, |||Dr F(x, z)|||w,w LF,r Lr wcw ,wr, for all (x, z) ℓw (RN) Vn (6.52) and the map Dr 1F : ℓw (RN) Vn Lr 1(ℓw (Rn) ℓw (RN), ℓw (RN)) is Lipschitz continuous with Lipschitz constant LF,r Lr wcw ,wr. (iii) The linear map Dx F(x0, z0) : (ℓw (RN), w) ℓw (RN), w is a contraction with constant LFx(x0, z0)Lw < 1. These results also hold when the spaces ℓw (Rn), w and ℓw (RN), w are replaced by ℓ (Rn), and ℓ (RN), , respectively. In that case, the statement is obtained by taking as the sequences w and w the constant sequence wι given by wι t := 1, for all t N. The inequality (6.52) holds true with Lw = cw ,wr = 1. Proof of the lemma. (i) Notice first that, as we pointed out in (6.38), and using the notation in Lemma 36, t Z Ft, where Ft := F pt (T1 id Vn) : ℓw (RN) Vn RN. (6.53) Also, the hypothesis (4.1), the mean value theorem, and the convexity of the set pt (T1 id Vn) ℓw (RN) Vn imply that F is a Lipschitz function with constant LF . A development identical to (6.39) guarantees that the maps Ft are Lipschitz and that LF Lw/w t is a Lipschitz constant of Ft, t Z . Given that the sequence c F := (LF Lw/w t)t Z is such that c F w = LF Lw < + and F = Q t Z Ft, the part (ii) of Lemma 36 guarantees that F is Lipschitz continuous and that LF Lw is a Lipschitz constant of F. Since by hypothesis the reservoir system has a solution (x0, z0) ℓw (RN) Vn, we have that F(x0, z0) = x0 ℓw (RN). This implies that F maps into ℓw (RN) since the Lipschitz condition that we just proved shows that for any (x, z) ℓw (RN) Vn F(x, z) w LF Lw (x, z) (x0, z0) w w + F(x0, z0) w, which shows that F(x, z) w < + and hence that F(x, z) ℓw (RN). (ii) The expression (6.53), the chain rule, the finiteness of Lw, and the linearity of pt and T1 imply that for any (x, z) ℓw (RN) Vn: Dr Ft(x, z) = Dr F(xt 1, zt) (pt (T1 id Vn), . . . , pt (T1 id Vn)) : ℓw (RN) ℓw (Rn) r RN. (6.54) Grigoryeva and Ortega We now prove (6.52). Notice first that for u = u1, . . . , ur = (u1 x, u1 z), . . . , (ur x, ur z) ℓw (RN) ℓw (Rn) r we can write using (4.1) and Lemma 1: Dr Ft(x, z) u = Dr F(xt 1, zt) pt (T1 id Vn)(u1), . . . , pt (T1 id Vn)(ur) |||Dr F(xt 1, zt)||| pt (T1 id Vn)(u1) pt (T1 id Vn)(ur) T1(u1 x), u1 z w w (T1(ur x), ur z) w w T1(u1 x) w + u1 z w ( T1(ur x) w + ur z w) Lw u1 x w + u1 z w (Lw ur x w + ur z w) LF,r Lr w wr t u1 x w + u1 z w ( ur x w + ur z w) = LF,r Lr w wr t u1 w w ur w w , which shows that |||Dr Ft(x, z)|||w LF,r Lr w wr t . (6.55) Since, as we saw in part (i) F maps into ℓw (RN), and by Lemma 35 ℓw (RN) ℓwr (RN), then F also maps into ℓwr (RN). Additionally, since the sequence cr := LF,r Lr w/wr t t Z is such that cr wr = LF,r Lr w < + , the part (iii) of Lemma 36 guarantees that the map F : ℓw (RN) Vn ℓwr (RN) is differentiable of order r and that |||Dr F(x, z)|||w,wr LF,r Lr w < + . (6.56) This argument can be reproduced with the power sequence wr replaced by any other sequence w that satisfies (6.51), in which case, it is easy to see that ℓwr (RN) ℓw (RN), and we can conclude the differentiability of the map F : ℓw (RN) Vn ℓw (RN) for which the relation (6.56) is replaced by |||Dr F(x, z)|||w,w LF,r Lr wcw ,wr < + . (6.57) The rest of the statement is a consequence of part (iii) of Lemma 36 applied in this setup. (iii) A computation similar to the one that was used to establish (6.54) leads to the following expression for the partial derivatives Dx F of F: Dr x F(x, z) = Y t Z Dr x Ft(x, z) = Y t Z Dr x F(xt 1, zt) (pt T1, . . . , pt T1) . (6.58) Using this expression for r = 1 and Lemma 1 we can write, for any u ℓw (RN), Dx F(x0, z0) u w = sup t Z Dx F(x0 t 1, z0 t) (pt T1) (u) w t LFx(x0, z0) sup t Z {|||pt|||w T1(u) ww t} LFx(x0, z0) sup t Z w t T1(u) ww t LFx(x0, z0)Lw u w , Differentiable reservoir computing as required. We now proceed with the proof of the theorem in which we obtain the persistence result as a consequence of the Implicit Function Theorem and of the Lemma 42 that we just proved. Using the same notation as in that result we define the map G : ℓw (RN) ℓw (Rn) ℓw (RN) (x, z) 7 F(x, z) x, or equivalently, G = F πN, where πN : ℓw (RN) ℓw (Rn) ℓw (RN) is just the projection onto the first factor. Notice that by construction and the hypothesis on the point (x0, z0) we have that G(x0, z0) = 0. (6.59) Since the projection πN is linear and by Lemma 42 F is Lipschitz continuous and differentiable of order 1, then so is G = F πN. This implies in particular that the partial derivative Dx G(x0, z0) : ℓw (RN) ℓw (RN) is a bounded operator that we now set to prove that it is an isomorphism. We proceed in two stages that show how the hypotheses in the statement of the theorem imply that this linear map is both injective and surjective. The partial derivative Dx G(x0, z0) : ℓw (RN) ℓw (RN) is injective. Notice first that, Dx G(x0, z0) u = Dx F(x0, z0) u u, for any u ℓw (RN). Consequently, the points u ℓw (RN) such that Dx G(x0, z0) u = 0 coincide with the fixed points of the map Dx F(x0, z0) : ℓw (RN) ℓw (RN). Since by part (iii) of Lemma 42 Dx F(x0, z0) is a contracting linear map in ℓw (RN) it has hence only zero as unique fixed point and the claim follows. The partial derivative Dx G(x0, z0) : ℓw (RN) ℓw (RN) is surjective. We prove that for any v ℓw (RN) there exists u ℓw (RN) such that Dx G(x0, z0) u = v. By the definition of F in (6.49) and the expression of its partial derivative in (6.58), this equation is equivalent to the recursions, vt = Dx F(x0 t 1, z0 t) ut 1 ut, for all t Z . (6.60) This equation has a unique solution given by the series j=1 Dx F(x0 t 1, z0 t) Dx F(x0 t 2, z0 t 1) Dx F(x0 t j, z0 t j+1)( vt j), t Z . (6.61) Indeed, it is straightforward to show that (6.61) satisfies (6.60). It remains then to be shown that the sequence u determined by (6.61) belongs to ℓw (RN). In order to do so we first show that the series in (6.61) is convergent by proving that for any t Z , the sequence {Sn}n N+ defined by j=1 Dx F(x0 t 1, z0 t) Dx F(x0 t 2, z0 t 1) Dx F(x0 t j, z0 t j+1)( vt j)w t, (6.62) Grigoryeva and Ortega is a Cauchy sequence. This is so because for any m, n N+, m n, j=n+1 Dx F(x0 t 1, z0 t) Dx F(x0 t 2, z0 t 1) Dx F(x0 t j, z0 t j+1)( vt j) w (t j) w (t j+1) w (t j) w t w (t 1) Dx F(x0 t 1, z0 t) Dx F(x0 t 2, z0 t 1) Dx F(x0 t j, z0 t j+1) vt j w (t j)Lj w j=n+1 LFx(x0, z0)j Lj w v w = LFx(x0, z0)Lw n+1 LFx(x0, z0)Lw m+1 1 LFx(x0, z0)Lw v w , (6.63) which can be made as small as we want because the sequence { LFx(x0, z0)Lw j}j N+ is convergent and hence Cauchy due to the hypothesis LFx(x0, z0)Lw < 1. This implies that {Sn}n N+ is convergent and hence so is the series that defines ut in (6.61). It remains to be shown that the sequence u := (ut)t Z defined by (6.61) is an element of ℓw (RN). Following the same strategy that we used to construct the inequalities (6.63) it is easy to see that ut w t 1 1 LFx(x0, z0)Lw v w , for all t Z . Consequently, u w = sup t Z { ut w t} 1 1 LFx(x0, z0)Lw v w < + , as required. The partial derivative Dx G(x0, z0) : ℓw (RN) ℓw (RN) is a linear homeomorphism. This fact is a consequence of the Banach Isomorphism Theorem (see for instance Abraham et al. (1988)) that states that any continuous linear isomorphism of Banach spaces has necessarily a continuous inverse. Using all the facts that we just proved, we can invoke the the Implicit Function Theorem as formulated in (Schechter, 1997, page 671) (see also Ver Eecke (1974)) to show the existence of two open neighborhoods e Vx0 and e Vz0 of x0 and z0 in ℓw (RN) and ℓw (Rn), respectively, and a unique Lipschitz continuous map g U F : (e Vz0, w) (e Vx0, w) that is differentiable at z0 and satisfies G(g U F (z), z) = 0, for all z e Vz0, which is equivalent to F(g U F (z), z) = U F (z). In view of the identities (3.1) this means, in other words, that g U F is the unique reservoir filter with inputs in e Vz0 associated to the reservoir system determined by F. This filter is clearly causal and its Lipschitz continuity implies that it has the fading memory property. We conclude the proof by showing that the filter g U F can be extended to a time-invariant filter U F defined on the time-invariant saturations Vx0 and Vz0 of the sets e Vx0 and e Vz0, respectively, and that has the properties listed in the statement. Indeed, define t Z T t e Vx0 and Vz0 := [ t Z T t e Vz0 . Differentiable reservoir computing The sets Vx0 and Vz0 are by construction time-invariant and open by the openness of the maps T t that we established in part (ii) of Lemma 1. Define now the map U F : Vz0 Vx0 as U F (T t(z)) := T t g U F (z) , for some t Z and z e Vz0. (6.64) We first show that U F is well-defined and time-invariant. Let t1, t2 Z and z1, z2 e Vz0 be such that T t1(z1) = T t2(z2). Let us now show that U F (T t1(z1)) = U F (T t2(z2)) . (6.65) Indeed, for any t Z , the definition (6.64) and the causality of g U F imply that U F (T t1(z1)) t = T t1 g U F (z1) = g U F (z1)t+t1 = g U F (z2)t+t2 = T t2 g U F (z2) t = U F (T t2(z2)) which proves (6.65). The time-invariance of U F , as defined in (6.64), is straightforward. We conclude by showing that U F is differentiable at all the points of the form T t(z0), t Z and that it is locally Lipschitz continuous on Vz0. Since differentiability is a local property, it suffices to prove this property for the restriction of U F to open sets. Before we do that, we note that since by part (ii) of Lemma 1 the map T t : e Vz0 T t e Vz0 is a submersion, the Local Onto Theorem (see (Abraham et al., 1988, Theorem 3.5.2)) guarantees that for z := T t(z0) T t e Vz0 there exists an open neighborhood Vz T t e Vz0 and a smooth section σz : Vz e Vz0 of T t that satisfies that σz (z ) = z0 and T t σz = id Vz . (6.66) The section σz allows us to write down the restriction U F |Vz of U F to the open subset Vz as U F |Vz (z) = T t g U F (σz (z)) , for all z Vz . (6.67) This is so because by (6.66) we have that z = T t (σz (z)), with σz (z) e Vz0, as well as by (6.64). Consequently, since by (6.67) the restriction U F |Vz is a composition of Lipschitz continuous functions then so is U F |Vz . The differentiability of U F at the point z = T t(z0) can also be concluded using (6.67) by invoking the differentiability of T t and σz on their domains and the differentiability of g U F at σz (z ) = z0. 6.10. Proof of Theorem 19 (i) We start with a lemma that shows how condition (4.8) guarantees the existence of a globally defined filter U F : (ℓw (Rn), w) (ℓw (RN), w). Lemma 43 Let F : RN Rn RN be a reservoir map of class C1(RN Rn) and let w be a weighting sequence with finite inverse decay ratio Lw. The reservoir map F is a contraction on the first entry if and only if LFx < 1. (6.68) Moreover, whenever conditions (4.1) and (4.8) are satisfied and (x0, z0) (RN)Z (Rn)Z is a solution of the reservoir system determined by F, then there exists a unique causal, time-invariant, and fading memory filter U F : (ℓw (Rn), w) (ℓw (RN), w). Grigoryeva and Ortega Proof of the lemma. We first show that F is a contraction on the first entry if and only if LFx < 1. Suppose first that F is a contraction with contraction rate 0 < c < 1. Then for any (x, z) RN Rn and any u RN, the partial derivative Dx F(x, z) : RN RN satisfies that Dx F(x, z) u = lim t 0 F(x + tu, z) F(x, z) t lim t 0 ct u which implies that |||Dx F(x, z)||| c and hence LFx := sup (x,z) DN Dn {|||Dx F(x, z)|||} c < 1. Conversely, suppose that LFx < 1. Since F is of class C1(RN Rn), the mean value theorem guarantees that for any (x1, z), (x2, z) RN Rn: F(x1, z) F(x2, z) sup (x,z) RN Rn {|||Dx F(x, z)|||} x1 x2 = LFx x1 x2 , and F is hence is a contraction on the first entry. Suppose now that conditions (4.1) and (4.8) are satisfied and that (x0, z0) (RN)Z (Rn)Z is a solution of the reservoir system determined by F. Notice first that since Lw > 1 then the condition (4.8) implies that LFx < 1, necessarily, and hence, as we just proved, F is a contraction on the first entry with constant LFx. Additionally, as (4.1) is satisfied, the mean value theorem implies that F is Lipschitz continuous with constant LF . All these facts allow us to invoke part (ii) of Theorem 7 to conclude the existence of the filter U F in the statement, since in this situation, the condition (3.3) coincides with (4.8). The proof of the first part of the theorem can now be obtained by applying Theorem 14 to each point of the form (U F (z), z) ℓw (RN) ℓw (Rn) for which, according to its statement, there exist open neighborhoods VU F (z) and Vz of U F (z) and z in ℓw (RN) and ℓw (Rn), as well as a unique locally defined causal reservoir filter e UF : Vz VU F (z) associated to F. The uniqueness feature implies that e UF = U F |Vz. Moreover, since e UF is differentiable at z and we can repeat this construction for any point z ℓw (Rn) we can conclude that U F is differentiable at any point in ℓw (Rn). Finally, the Lipschitz continuity on ℓw (Rn) of U F is a consequence of the mean value theorem, the inequality (4.5), and the fact that sup z ℓw (Rn) DU F (z) w sup z ℓw (Rn) LFz(U F (z), z) 1 LFx(U F (z), z))Lw LFz 1 LFx Lw , which proves (4.9). (ii) First of all, the existence of the filter U F : Vn ℓw (RN) and its differentiability at z0 Vn imply that for any u ℓw (Rn) and t Z it satisfies (3.1) as well as (4.4), that is, (DU F (z0) u)t = Dx F U F (z0)t 1, z0 t DU F (z0) u t 1 + Dz F(U(z0)t 1, z0 t) ut. This identity can be rewritten in terms of operators on sequences as DU F (z0) = t Z Dx F U F (z0)t 1, z0 t T1 DU F (z0) + Y t Z Dz F U F (z0)t 1, z0 t , Differentiable reservoir computing or equivalently as t Z Dx F U F (z0)t 1, z0 t DU F (z0) = Y t Z Dz F U F (z0)t 1, z0 t . (6.69) This identity determines DU F (z0) that by hypothesis exists if and only if the operator on the left hand side is invertible, which is in turn equivalent to the condition (4.10). We finally show that (4.10) implies (4.11). We first notice that by Gelfand s formula (Lax, 2002, page 195) the condition (4.10) is equivalent to t Z Dx F U F (z0)t 1, z0 t ) T1 This in turn implies that for any u ℓw (Rn), we have that t Z Dx F U F (z0)t 1, z0 t ) T1 or, equivalently, that Dx F U F (z0)t 1, z0 t Dx F U F (z0)t k, z0 t k+1 (ut k) ! = 0. (6.70) If we now take vectors u ℓw (Rn) in (6.70) of the form ut := eu/w t, t Z , with eu Rn such that eu = 1, and we take the supremum in (6.70) with respect to all those vectors eu, we obtain that Dx F U F (z0)t 1, z0 t Dx F U F (z0)t k, z0 t k+1 w t w (t k) = 0. (6.71) Dx F U F (z0) 1, z0 0 Dx F U F (z0) k, z0 k+1 1 Dx F U F (z0)t 1, z0 t Dx F U F (z0)t k, z0 t k+1 w t w (t k) the condition (4.11) follows. 6.11. Proof of Theorem 28 It follows the same scheme as that of Theorem 14. In the following paragraphs we just hint the additional facts that need to be taken into account in order to adapt that proof to this setup. The first complementary fact has to do with the second part of Lemma 42 which, using the hypothesis (4.29) allows us to conclude that the map F : ℓ (RN) ℓ (Rn) ℓ (RN) defined in (6.38) is smooth. Additionally, it can be easily seen that it is also analytic and that the radii of convergence ρF and ρF of the Taylor series expansions of F and F around (x0, z0) and the associated constant sequence (that we denote with the same symbol) satisfy ρF ρF. (6.72) Grigoryeva and Ortega Indeed, (6.54) implies that the Taylor series expansion of F around the constant sequence (x0, z0) can be written, for any ur := (u, . . . , u) = ((ux, uz), . . . , (ux, uz)) ℓ (RN) ℓ (Rn) r, as 1 r!Dr F(x0, z0) (u (x0, z0))r = Y Ft(x0, z0) + 1 r!Dr Ft(x0, z0) (u (x0, z0))r ! F(x0, z0) + 1 r!Dr F(xt 1, zt) (pt (T1 id), . . . , pt (T1 id)) (u (x0, z0))r ! Suppose now that u = (ux, uz) ℓ (RN) ℓ (Rn) is chosen such that u = ux + uz < ρF . (6.74) Lemma 1 implies that for any t Z , we have in that case that pt (T1 id)(u) u < ρF and hence we can conclude that all the series labeled by t Z in each of the factors that make up the last term of (6.73) converge for all the elements u ℓ (RN) ℓ (Rn) that satisfy (6.74). This implies that such elements are inside the radius of convergence of the Taylor series expansion of F around the constant sequence (x0, z0) and hence (6.72) holds which, as ρF is nontrivial by hypothesis, proves that F is analytic. The rest of the proof can be obtained by mimicking that of Theorem 14 where, as it is customary, we replace the weighting sequence w by the constant sequence wι given by wι t := 1, for all t N, and Lw is replaced by the constant 1. A technical modification is needed at the time of invoking the Implicit Function Theorem. In Theorem 14 we used a version that requires only first order differentiability as hypothesis and produces Lipschitz continuous implicitly defined functions. In this case we can prove that the function G is analytic and hence it can be shown that the implicitly defined local filter g U F : (e Vz0, w) (e Vx0, w) is analytic by invoking, for instance, (Valent, 1988, page 175), and references therein. 6.12. Proof of Theorem 29 Since by hypothesis U is analytic in B w(z0, M) then U(z) = U(z0) + 1 j!Dj U(z0)(z z0, . . . , z z0 | {z } j times ), for any z B w(z0, M). (6.75) We now show that for the elements that satisfy (5.1) the series expansion (6.75) amounts to the discretetime Volterra series expansion (5.2). Let m Z and let δm ℓw (R) be the sequence defined by (δm)t := 1 w m if t = m, 0 otherwise. (6.76) Note that δm w = 1 for all m Z . Moreover, for any z ℓw (R) we can write t Z eztδt, with ezt = (zt z0 t )w t, Differentiable reservoir computing and hence by the multilinearity of the derivatives Dj U(z0)(z z0, . . . , z z0) and the causality of the filter U we have that Dj U(z0)(z z0, . . . , z z0)t = mj= ezm1 ezmj Dj U(z0)(δm1, . . . , δmj)t, for all t Z . (6.77) We first show that for the elements that satisfy (5.1) the sum in the right hand side of (6.77) is finite. Indeed, for any t Z : mj= ezm1 ezmj Dj U(z0)(δm1, . . . , δmj)t mj= ezm1 ezmj 1 w t w t Dj U(z0)(δm1, . . . , δmj)t ezm1 ezmj 1 w t Dj U(z0)(δm1, . . . , δmj) w Dj U(z0) w (δm1, . . . , δmj) w Dj U(z0) w w t mj= |ezm1| ezmj Dj U(z0) w w t zm z0 m w m < + , (6.78) where the last equality is a consequence of, for example, (Apostol, 1974, Theorem 8.44), and the last inequality follows from two facts. First, as U is analytic, it is in particular smooth and hence Dj U(z0) w < + for all j Z . Second, since by hypothesis z, z0 ℓ1,w (Rn) then z z0 ℓ1,w (Rn) and hence Pt m= zm z0 m w m < + . We now show that (6.75) can be rewritten as (5.2). Notice first that for any t, m Z such that m t, the sequences (6.76) satisfy T t (δm) = w (m t) w m δm t. (6.79) Second, the time-invariance of U and of the sequence z0, imply that for any j N+, t Z , and z1, . . . , zj ℓw (R), we have that T t Dj U(z0) z1, . . . , zj = Dj U(T t z0 ) T t z1 , . . . , T t zj = Dj U(z0) T t z1 , . . . , T t zj . These two relations imply that for any t Z Dj U(z0)(δm1, . . . , δmj)t = T t Dj U(z0)(δm1, . . . , δmj) 0 = Dj U(z0)(T t(δm1), . . . , T t(δmj))0 = Dj U(z0)(δm1 t, . . . , δmj t)0 w (m1 t) w m1 w (mj t) Grigoryeva and Ortega If we substitute this relation in the summands of (6.77), we obtain that ezm1 ezmj Dj U(z0)(δm1, . . . , δmj)t = (zm1 z0 m1) (zmj z0 mj) w (m1 t) w (mj t) Dj U(z0)(δm1 t, . . . , δmj t)0. (6.80) gj(n1, . . . , nj) := w n1 w nj 1 j!Dj U(z0)(δn1, . . . , δnj)0 = w n1 w nj 1 j!Dj H(z0)(δn1, . . . , δnj)0 = 1 j!Dj H(z0)(en1, . . . , enj)0, (6.81) where em ℓw (R) is the sequence defined in (5.3). If we make the change of variables ni := mi t in (6.80), we use (6.81), and we insert the resulting expression in (6.77) and subsequently in (6.75) we obtain (5.2). The uniqueness of this series expansion follows from the same argument as in (Sandberg, 1999, Theorem 1). We now prove the error estimates (5.4) with the same strategy as in Sandberg (1999). Using the Cauchy bounds for analytic functions (see, for instance, the last expression in (Hille and Phillips, 1957, page 112)) and the analyticity hypothesis on U : B w(z0, M) ℓw (R) B w(U(z0), L) ℓw (RN), we have that for any j N+ and t Z Dj U(z0)(z, . . . , z)t = pt Dj U(z0)(z, . . . , z) |||pt|||w Dj U(z0)(z, . . . , z) w j!L (6.82) where we also used the first part of Lemma 1. Now, as we saw in the previous paragraphs, U(z)t U(z0)t mj= gj(m1, . . . , mj)(zm1+t z0 m1+t) (zmj+t z0 mj+t) U(z)t U(z0)t 1 j!Dj U(z0)(z z0, . . . , z z0) 1 j!Dj U(z0)(z z0, . . . , z z0) p+1 , (6.83) where the inequalities in the last line follow from (6.82). 6.13. Time invariance of the solutions of a reservoir system The filters studied in this paper are those determined by reservoir systems of the type introduced in (1.1) (1.2). As we already pointed out, in that case we can associate unique reservoir filters U F and U F h to the reservoir map F and the reservoir system, respectively, whenever (1.1) satisfies the echo state property. In that case, it has been shown in (Grigoryeva and Ortega, 2018b, Proposition 2.1) that both U F and U F h are necessarily causal and time-invariant. We complement this fact with a similar elementary statement that does not require the echo state property or the existence reservoir filters. Lemma 44 Let (x0, z0) RN Z (Rn)Z be a solution of the reservoir system determined by the map F : RN Rn RN. Then, for any τ Z , the pair (Tτ(x0), Tτ(z0)) RN Z (Rn)Z is also a solution. Differentiable reservoir computing Proof. By hypothesis, for any t Z we have that F(x0 t 1, z0 t) = x0 t, F Tτ(x0)t 1, Tτ(z0)t = F(x0 t τ 1, z0 t τ) = x0 t τ = Tτ(x0)t, as required. Acknowledgments We thank Lukas Gonon and Herbert Jaeger for fruitful discussions as well as the input from the two anonymous referees that helped in improving the paper. The authors acknowledge partial financial support of the French ANR BIPHOPROC project (ANR-14-OHRI-0002-02). LG acknowledges partial financial support of the Graduate School of Decision Sciences of the Universit at Konstanz. JPO acknowledges partial financial support coming from the Research Commission of the Universit at Sankt Gallen and the Swiss National Science Foundation (grant number 200021 175801/1). Glossary of Symbols ℓ (Rn) Banach space formed by the semi-infinite sequences that have a finite supremum norm ℓp,w (Rn) Banach space formed by the semi-infinite sequences that have a finite (p, w)-norm ℓp (Rn) Banach space formed by the semi-infinite sequences that have a finite p-norm ℓw (Rn) Banach space formed by the semi-infinite sequences that have a finite weighted supremum norm MN Space of real square matrices of size N FF Reservoir flow associated to the reservoir map F Q t Z At Cartesian product of the sets At Q t Z ft Cartesian product of the functions ft ρ(A) Spectral radius of the matrix A σ Activation function (in ESN, for example) c Contraction constant on the first entry of the reservoir map d Dimension of the elements of the output signal Drf(z) r-order Fr echet differential of the map f at the point z Dw Decay ratio of the weighting sequence w Dxf(x, z) Partial derivative of the map f with respect to the first entry at the point (x, z) Df(z) Fr echet differential of the map f at the point z F : RN Rn RN Reservoir map HU : (Rn)Z Rd Functional associated to the causal and time-invariant filter U : (Rn)Z (Rd)Z h : RN Rd Generic readout map KM Space of semi-infinite sequences that are uniformly bounded by M Lσ Lipschitz constant of the activation function σ LF Lipschitz constant of the reservoir map F Grigoryeva and Ortega Lw Inverse decay ratio of the weighting sequence w LFx Lipschitz constant on the first entry of the reservoir map F LUF Lipschitz constant of the reservoir filter UF N Number of virtual neurons. Dimension of the reservoir state vectors n Dimension of the elements of the input signal pt : (Rn)Z Rn Projection onto the tth-entry Tτ : (Rn)Z (Rn)Z Time delay operator defined on semi-infinite sequences T Z τ : (Rn)Z (Rn)Z Time delay operator defined on two-sided infinite sequences UF h : (Rn)Z (Rd)Z Reservoir filter determined by the reservoir map F and the readout h UF : (Rn)Z (RN)Z Filter determined by the reservoir map F U : (Rn)Z (Rd)Z Filter with inputs in Rn and outputs in Rd UA,c h : KM RZ Linear reservoir filter determined by A, c, and the polynomial h UH : (Rn)Z (Rd)Z Causal and time-invariant filter associated to the functional H : (Rn)Z Rd w : N (0, 1] Weighting sequence x (Semi)-infinite sequence containing the reservoir states. The elements of this sequence are denoted by xt RN y (Semi)-infinite output signal. The elements of this sequence are denoted by yt Rd z (Semi)-infinite input signal. The elements of this sequence are denoted by zt Rn R. Abraham, J. E. Marsden, and T. S. Ratiu. Manifolds, Tensor Analysis, and Applications, volume 75. Applied Mathematical Sciences. Springer-Verlag, 1988. T. Apostol. Mathematical Analysis. Addison Wesley, second edition, 1974. L. Appeltant, M. C. Soriano, G. Van der Sande, J. Danckaert, S. Massar, J. Dambre, B. Schrauwen, C. R. Mirasso, and I. Fischer. Information processing using a single dynamical node as complex system. Nature Communications, 2:468, jan 2011. L. Arnold. Random Dynamical Systems. Springer, 1998. S. Boyd and L. Chua. Fading memory and the problem of approximating nonlinear operators with Volterra series. IEEE Transactions on Circuits and Systems, 32(11):1150 1161, nov 1985. D. Brunner, M. C. Soriano, C. R. Mirasso, and I. Fischer. Parallel photonic information processing at gigabyte per second data rates using transient states. Nature Communications, 4(1364), 2013. J. Cabessa and A. E. Villa. Computational capabilities of recurrent neural networks based on their attractor dynamics. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1 8. IEEE, jul 2015. J. Cabessa and A. E. Villa. Expressive power of first-order recurrent neural networks determined by their attractor dynamics. Journal of Computer and System Sciences, 82(8):1232 1250, 2016. P. Chossat, D. Lewis, J.-P. Ortega, and T. S. Ratiu. Bifurcation of relative equilibria in mechanical systems with symmetry. Advances in Applied Mathematics, 31:10 45, 2003. Differentiable reservoir computing B. D. Coleman and V. J. Mizel. On the general theory of fading memory. Archive for Rational Mechanics and Analysis, 29(1):18 31, jan 1968. R. Couillet, G. Wainrib, H. Sevi, and H. T. Ali. The asymptotic performance of linear echo state neural networks. Journal of Machine Learning Research, 17(178):1 35, 2016. J. Dambre, D. Verstraeten, B. Schrauwen, and S. Massar. Information processing capacity of dynamical systems. Scientific reports, 2(514), 2012. M. Fabrizio, C. Giorgi, and V. Pata. A new approach to equations with memory, volume 198. 2010. S. Ganguli, D. Huh, and H. Sompolinsky. Memory traces in dynamical systems. Proceedings of the National Academy of Sciences of the United States of America, 105(48):18970 5, dec 2008. F. Girosi. Approximation error bounds that use VC-bounds. In F. Fogelman-Soulie and P. Gallinari, editors, Proc. International Conference on Artificial Neural Networks, volume 1, pages 295 302, 1995. F. Girosi and G. Anzellotti. Convergence rates of approximation by translates. Technical report, Defense Technical Information Center, 1992. L. Gonon and J.-P. Ortega. Reservoir computing universality with stochastic inputs. IEEE Transactions on Neural Networks and Learning Systems, 2018. L. Grigoryeva and J.-P. Ortega. Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems. Journal of Machine Learning Research, 19(24):1 40, 2018a. L. Grigoryeva and J.-P. Ortega. Echo state networks are universal. Neural Networks, 108:495 508, 2018b. L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. Optimal nonlinear information processing capacity in delay-based reservoir computers. Scientific Reports, 5(12858):1 11, 2015. L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. Nonlinear memory capacity of parallel timedelay reservoir computers in the processing of multidimensional signals. Neural Computation, 28: 1411 1451, 2016. H. Gunawan, S. Konca, and M. Idris. p-summable sequence spaces with inner products. Bitlis Eren University Journal of Science and Technology, 5(1):1 9, 2015. A. G. Hart, J. L. Hook, and J. H. P. Dawes. Embedding and approximation theorems for echo state networks. Preprint, 2019. M. Hermans and B. Schrauwen. Memory in linear recurrent neural networks in continuous time. Neural networks : the official journal of the International Neural Network Society, 23(3):341 55, apr 2010. E. Hille and R. S. Phillips. Functional Analysis and Semi-Groups. American Mathematical Society, 1957. B. R. Hunt, E. Ott, and J. A. Yorke. Differentiable generalized synchronization of chaos. Physical Review E - Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, 55(4):4029 4034, 1997. H. Jaeger. Short term memory in echo state networks. Fraunhofer Institute for Autonomous Intelligent Systems. Technical Report., 152, 2002. Grigoryeva and Ortega H. Jaeger. The echo state approach to analysing and training recurrent neural networks with an erratum note. Technical report, German National Research Center for Information Technology, 2010. H. Jaeger and H. Haas. Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. Science, 304(5667):78 80, 2004. J. Kilian and H. T. Siegelmann. The dynamic universality of sigmoidal neural networks. Information and Computation, 128(1):48 56, 1996. P. E. Kloeden. Synchronization of nonautonomous dynamical systems. Electronic Journal of Differential Equations, 2003(39):1 10, 2003. P. E. Kloeden and M. Rasmussen. Nonautonomous Dynamical Systems. American Mathematical Society, 2010. L. Kocarev and U. Parlitz. General approach for chaotic synchronization with applications to communication. Physical Review Letters, 74(25):5028 5031, 1995. L. Kocarev and U. Parlitz. Generalized synchronization, predictability, and equivalence of unidirectionally coupled dynamical systems. Physical Review Letters, 76(11):1816 1819, 1996. F. Laporte, A. Katumba, J. Dambre, and P. Bienstman. Numerical demonstration of neuromorphic computing with photonic crystal cavities. Optics Express, 26(7):7955, apr 2018. L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M. Gutierrez, L. Pesquera, C. R. Mirasso, and I. Fischer. Photonic information processing beyond Turing: an optoelectronic implementation of reservoir computing. Optics Express, 20(3):3241, jan 2012. P. Lax. Functional Analysis. Wiley-Interscience, 2002. R. Legenstein and W. Maass. What makes a dynamical system computationally powerful? In S. Haykin, editor, New directions in statistical signal processing: from systems to brain. MIT Press, Cambridge, MA, 2007. A. Lindquist and G. Picci. Linear Stochastic Systems. Springer-Verlag, 2015. Z. Lu, B. R. Hunt, and E. Ott. Attractor reconstruction by machine learning. Chaos, 28(6), 2018. M. Lukoˇseviˇcius and H. Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127 149, 2009. W. Maass, T. Natschl ager, and H. Markram. Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Computation, 14:2531 2560, 2002. W. Maass. Liquid state machines: motivation, theory, and applications. In S. S. Barry Cooper and A. Sorbi, editors, Computability In Context: Computation and Logic in the Real World, chapter 8, pages 275 296. 2011. W. Maass and E. D. Sontag. Neural Systems as Nonlinear Filters. Neural Computation, 12(8):1743 1772, aug 2000. W. Maass, T. Natschl ager, and H. Markram. Fading memory and kernel properties of generic cortical microcircuit models. Journal of Physiology Paris, 98(4-6 SPEC. ISS.):315 330, 2004. W. Maass, P. Joshi, and E. D. Sontag. Computational aspects of feedback in neural circuits. PLo S Computational Biology, 3(1):e165, 2007. Differentiable reservoir computing G. Manjunath and H. Jaeger. Echo state property linked to an input: exploring a fundamental characteristic of recurrent neural networks. Neural Computation, 25(3):671 696, 2013. G. Manjunath and H. Jaeger. The dynamics of random difference equations Is remodeled by closed relations. SIAM Journal on Mathematical Analysis, 46(1):459 483, jan 2014. J. A. Montaldi. Persistence and stability of relative equilibria. Nonlinearity, 10:449 466, 1997a. J. A. Montaldi. Persistance d orbites p{ e}riodiques relatives dans les syst{ e}mes hamiltoniens sym{ e}triques. C. R. Acad. Sci. Paris S{ e}r. I Math., 324:553 558, 1997b. J. Munkres. Topology. Pearson, second edition, 2014. T. Natschl ager, W. Maass, and H. Markram. The Liquid Computer : a novel strategy for real-time computing on time series. Special Issue on Foundations of Information Processing of TELEMATIK, 8(1):39 43, 2002. J. Newman. Necessary and sufficient conditions for stable synchronization in random dynamical systems. Ergodic Theory and Dynamical Systems, 38(5):1857 1875, 2018. J.-P. Ortega and T. S. Ratiu. Persistence and smoothness of critical relative elements in Hamiltonian systems with symmetry. Comptes Rendus de l Acad emie des Sciences - Series I - Mathematics, 325 (10):1107 1111, nov 1997. Y. Paquot, F. Duport, A. Smerieri, J. Dambre, B. Schrauwen, M. Haelterman, and S. Massar. Optoelectronic reservoir computing. Scientific reports, 2:287, jan 2012. J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott. Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data. Chaos, 27(12), 2017. J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach. Physical Review Letters, 120(2): 24102, 2018. M. B. Priestley. Non-linear and Non-stationary Time Series Analysis. Academic Press, 1988. A. Rekic-Vukovic, N. Okicic, and E. Dunjakovic. On weighted Banach sequence spaces. Advances in Mathematics: Scientific Journal, 4(2):127 138, 2015. A. Rodan and P. Tino. Minimum complexity echo state network. IEEE Transactions on Neural Networks, 22(1):131 44, jan 2011. W. J. Rugh. Nonlinear System Theory. The Volterra/Wiener Approach. The Johns Hopkins University Press, 1981. I. W. Sandberg. Time-delay neural networks, Volterra series, and rates of approximation. Circuits, Systems, and Signal Processing, 17(5):653 655, 1998a. I. W. Sandberg. A note on representation theorems for linear discrete-space systems. Circuits, Systems, and Signal Processing, 17(6):703 708, 1998b. I. W. Sandberg. A representation theorem for linear discrete-space systems. Mathematical Problems in Engineering, 4:369 375, 1998c. I. W. Sandberg. Bounds for discrete-time Volterra series representations. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 46(1):135 139, 1999. Grigoryeva and Ortega I. W. Sandberg. Notes of fading-memory conditions. Circuits, Systems, and Signal Processing, 22(1): 43 55, 2003. E. Schechter. Handbook of Analysis and its Foundations, volume 91. Academic Press, 1997. M. Schetzen. The Volterra and Wiener Theories of Nonlinear Systems. Wiley, 1980. J. H. Shapiro. A Fixed-Point Farrago. Springer International Publishing Switzerland, 2016. H. Siegelmann, B. Horne, and C. Giles. Computational capabilities of recurrent NARX neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 27(2):208 215, apr 1997. S. Sternberg. Dynamical Systems. Dover, 2010. G. Tanaka, T. Yamane, J. B. H eroux, R. Nakane, N. Kanazawa, S. Takeda, H. Numata, D. Nakano, and A. Hirose. Recent advances in physical reservoir computing: A review. Neural Networks, 115: 100 123, 2019. P. Tio. Asymptotic Fisher memory of randomized linear symmetric Echo State Networks. Neurocomputing, 298:4 8, 2018. T. Valent. Boundary Value Problems of Finite Elasticity. Springer Verlag, 1988. K. Vandoorne, J. Dambre, D. Verstraeten, B. Schrauwen, and P. Bienstman. Parallel reservoir computing using optical amplifiers. IEEE Transactions on Neural Networks, 22(9):1469 1481, sep 2011. K. Vandoorne, P. Mechet, T. Van Vaerenbergh, M. Fiers, G. Morthier, D. Verstraeten, B. Schrauwen, J. Dambre, and P. Bienstman. Experimental demonstration of reservoir computing on a silicon photonics chip. Nature Communications, 5:78 80, mar 2014. P. Ver Eecke. Sur le calcul diff erentiel dans les espaces vectoriels topologiques. Cahiers de topologie et g eom etrie diff erentielle cat egoriques, 15(3):293 339, 1974. Q. Vinckier, F. Duport, A. Smerieri, K. Vandoorne, P. Bienstman, M. Haelterman, and S. Massar. High-performance photonic reservoir computer based on a coherently driven passive cavity. Optica, 2(5):438 446, 2015. V. Volterra. Theory of Functionals and of Integral and Integro-Differential Equations. Blackie & Son Limited, Glasgow, 1930. O. White, D. Lee, and H. Sompolinsky. Short-Term Memory in Orthogonal Neural Networks. Physical Review Letters, 92(14):148102, apr 2004. N. Wiener. Nonlinear Problems in Random Theory. The Technology Press of MIT, 1958. I. B. Yildiz, H. Jaeger, and S. J. Kiebel. Re-visiting the echo state property. Neural Networks, 35:1 9, nov 2012.