# differentiable_reservoir_computing__ee96e2e6.pdf

Journal of Machine Learning Research 20 (2019) 1-62 Submitted 2/19; Revised 8/19; Published 11/19

Diﬀerentiable reservoir computing

Lyudmila Grigoryeva Lyudmila.Grigoryeva@uni-konstanz.de Department of Mathematics and Statistics Graduate School of Decision Sciences Universit at Konstanz Germany

Juan-Pablo Ortega Juan-Pablo.Ortega@unisg.ch Faculty of Mathematics and Statistics Universit at Sankt Gallen Switzerland Centre National de la Recherche Scientiﬁque (CNRS) France

Editor: Sayan Mukherjee

Numerous results in learning and approximation theory have evidenced the importance of diﬀerentiability at the time of countering the curse of dimensionality. In the context of reservoir computing, much eﬀort has been devoted in the last two decades to characterize the situations in which systems of this type exhibit the so-called echo state (ESP) and fading memory (FMP) properties. These important features amount, in mathematical terms, to the existence and continuity of global reservoir system solutions. That research is complemented in this paper with the characterization of the diﬀerentiability of reservoir ﬁlters for very general classes of discrete-time deterministic inputs. This constitutes a novel strong contribution to the long line of research on the ESP and the FMP and, in particular, links to existing research on the input-dependence of the ESP. Diﬀerentiability has been shown in the literature to be a key feature in the learning of attractors of chaotic dynamical systems. A Volterra-type series representation for reservoir ﬁlters with semi-inﬁnite discrete-time inputs is constructed in the analytic case using Taylor s theorem and corresponding approximation bounds are provided. Finally, it is shown as a corollary of these results that any fading memory ﬁlter can be uniformly approximated by a ﬁnite Volterra series with ﬁnite memory. Keywords: reservoir computing, fading memory property, ﬁnite memory, echo state property, differentiable reservoir ﬁlter, Volterra series representation, state-space systems, system identiﬁcation, machine learning.

1. Introduction

Context and preliminary discussion. Reservoir computing (RC) is a neural approach to the learning of dynamic processes which advocates the use of paradigms in which the supervised estimation of all available interconnection weights is not necessary and only the training of a static memoryless readout suﬃces to obtain good performances. This computational strategy has been simultaneously inspired by ideas coming from three diﬀerent ﬁelds, namely, recurrent neural networks, dynamical systems, and biologically inspired neural microcircuits. The common thread to these analyses is the use of rich dynamics to process information and to create memory traces. This explains why RC it can be found in the literature under other denominations like Liquid State Machines Maass and Sontag (2000); Maass et al. (2002); Natschl ager et al. (2002); Maass et al. (2004, 2007) and is represented by various

c 2019 Lyudmila Grigoryeva and Juan-Pablo Ortega.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/19-150.html.

Grigoryeva and Ortega

learning paradigms, being the Echo State Networks introduced in Jaeger (2010); Jaeger and Haas (2004) a particularly important example. RC has shown superior performance in many forecasting and classiﬁcation engineering tasks (see Lukoˇseviˇcius and Jaeger (2009) and references therein) and has shown unprecedented abilities in the learning of the attractors of complex nonlinear inﬁnite dimensional dynamical systems Jaeger and Haas (2004); Pathak et al. (2017, 2018); Lu et al. (2018). Additionally, RC implementations with dedicated hardware have been designed and built (see, for instance, Appeltant et al. (2011); Rodan and Tino (2011); Vandoorne et al. (2011); Larger et al. (2012); Paquot et al. (2012); Brunner et al. (2013); Vandoorne et al. (2014); Vinckier et al. (2015); Laporte et al. (2018); Tanaka et al. (2019)) that exhibit information processing speeds that largely outperform standard Turing-type computers. Ever since the inception of this methodology, much eﬀort has been devoted to identify the features that make a RC system capable of retaining relevant memory traces of the inputs and computationally powerful. The ﬁrst question has given rise to various notions and computational schemes for the memory capacity of RC systems Jaeger (2002); White et al. (2004); Ganguli et al. (2008); Hermans and Schrauwen (2010); Dambre et al. (2012); Grigoryeva et al. (2015); Couillet et al. (2016); Grigoryeva et al. (2016); Tio (2018). Another strand of interesting literature that we will not explore in this work has to do with the Turing computability capabilities of the systems of the type that we just introduced; recent relevant works in this direction are Kilian and Siegelmann (1996); Siegelmann et al. (1997); Cabessa and Villa (2015, 2016), and references therein. Regarding computational power, there are three properties that pervade the literature and that are usually declared as necessary to obtain an adequate functioning in a RC system (see, for instance, Legenstein and Maass (2007); Lukoˇseviˇcius and Jaeger (2009); Maass (2011) and references therein), namely, the fading memory property (FMP), the echo state property (ESP), and the pairwise separation property (SP). The FMP is a notion observed in many modeling situations in which the inﬂuence of the input gradually fades out in time. This property is repeatedly invoked in systems theory Volterra (1930); Wiener (1958), computational neurosciences Maass et al. (2004), physics Coleman and Mizel (1968), or mechanics (see Fabrizio et al. (2010) and references therein). The ESP Jaeger (2010); Yildiz et al. (2012); Manjunath and Jaeger (2013) is an existence and uniqueness property for the solutions of a state-space system that guarantees that the past history of the input fully determines the state of the system at any given point in time. Finally, the SP is satisﬁed by an input/output system if for any two input time series which diﬀered in the past, the network assumes at subsequent time points diﬀerent states. Even though these three properties are an essential part of the RC jargon , it is not always clear in the literature why they are important. A partial answer to this question has been given in the development of universality theorems for RC machine learning paradigms. Indeed, it has been shown in Maass and Sontag (2000); Maass et al. (2002, 2004, 2007); Grigoryeva and Ortega (2018a,b) that various families of RC systems that have these three properties are uniform universal approximants in a dynamical context in the presence of uniformly bounded (respectively, almost surely uniformly bounded) deterministic (respectively, stochastic) inputs. Moreover, these properties are exactly what is needed to prove universality statements using the Stone-Weierstrass theorem. Nevertheless, it has also been shown Gonon and Ortega (2018) that when the uniform approximation criterion is replaced by a Lp norm deﬁned with the measure induced by the input stochastic process, then the FMP does not play any role anymore. Additionally, when these properties are invoked, it is not always clear what the actual deﬁnition that is being used is and they are even used exchangeably sometimes. The reason for this confusion is that, in the presence of various compactness and contractivity hypotheses, the ESP and the FMP are automatically simultaneously satisﬁed. Moreover, the same entanglement occurs when it comes to the actual dynamical implications that these properties entail like the input and state forgetting properties (see later on in the text for detailed deﬁnitions).

Differentiable reservoir computing

From a learning theoretical perspective, the connections that we just brought up between these dynamical properties (FMP, ESP, and SP) and universality can be rephrased by saying that families that exhibit them are capable of making the approximation error in a learning task as small as desired. Also in the approximation error context, classical results in static setups (see, for instance, Girosi and Anzellotti (1992); Girosi (1995)) show that the diﬀerentiability of the objects that need to be approximated is as beneﬁcial for convergence rates as the dimensionality of the input is detrimental. This feature is sometimes referred to as the blessing of smoothness, as opposed to the curse of dimensionality. Diﬀerentiability is hence a crucial element in the understanding of the learning theoretical properties of most machine learning paradigms and, as far as we know, has never been tackled in the reservoir computing context and that is at the core of this paper.

Important existing results. In order to make these remarks explicit, we recall here some results that will help us later on to introduce the contributions in this paper. Consider the discrete-time nonlinear state-space transformation xt = F(xt 1, zt),

yt = h(xt).

In the context of supervised machine learning we will refer to these transformations as reservoir systems and we will think of them as special types of recurrent neural networks. In that setup, the map F : RN Rn RN, n, N N+, is called the reservoir, it is usually randomly generated and h : RN Rd is the readout, which is estimated via a supervised learning procedure. The input in this system is given by the elements of the inﬁnite sequence z = (. . . , z 1, z0, z1, . . .) (Rn)Z and the output by the components of y (Rd)Z. Given that the state-space may need to be high-dimensional in order to exhibit adequate approximation properties, it is desirable that the readout is as simple as possible (linear or polynomial, for instance). In this direction, various families of reservoir systems with linear readouts like Echo State Networks Jaeger and Haas (2004) or State Aﬃne Systems (see later on in the text) have been shown to have universal approximation properties Grigoryeva and Ortega (2018a,b); Gonon and Ortega (2018). Training for these systems reduces to the solution of a linear regression problem (eventually regularized) when the mean square error is used as loss function. We say that the reservoir system (1.1)-(1.2) satisﬁes the echo state property (ESP) when for any z (Rn)Z there exists a unique y (Rd)Z that satisﬁes (1.1). When this existence and uniqueness feature is available one can associate well-deﬁned ﬁlters U F : (Rn)Z (RN)Z and U F h : (Rn)Z (Rd)Z to the reservoir map F and the reservoir system (1.1)-(1.2), respectively. Very general situations have been characterized in which the ESP holds. For example, suppose that we restrict ourselves to inputs that are uniformly bounded by a constant M > 0, that is, consider the space KM of semi-inﬁnite sequences given by

KM := n z (Rn)Z | zt M for all t Z o , M > 0, (1.3)

and assume that the reservoir map F is continuous and a contraction on the ﬁrst entry that maps F : B (0, L) B (0, M) B (0, L), with L > 0 (the symbol B (v, r) denotes the closure of the open ball B (v, r) with respect to a given norm , center v, and radius r > 0). In that case, it can be shown (see, for instance, (Grigoryeva and Ortega, 2018b, Theorem 3.1)) that for any z KM there exists a unique x KL := {x RN Z | xt L for all t Z } that satisﬁes (1.1), that is, the ESP holds. This fact allows us to associate a unique ﬁlter U F : KM KL to the reservoir map F and U F h : KM (Rd)Z to the reservoir system (1.1)-(1.2), respectively, with U F h := h U F . Moreover, in this situation (see again (Grigoryeva and Ortega, 2018b, Theorem 3.1)) the continuity of F and h implies that both U F and U F h are continuous when we consider either the uniform or the product topologies in the domain and target spaces. The continuity with respect to the product topology is called in this setup the fading memory property (FMP) and, as we shall see below, can

Grigoryeva and Ortega

be characterized using weighted norms in the spaces of input and output sequences, which shows that recent inputs are more represented in the outputs of FMP ﬁlters than older ones. Equivalently, the outputs produced by FMP ﬁlters associated to inputs that are close in the recent past are close, even when those inputs may be very diﬀerent in the distant past. The restriction to uniformly bounded inputs of the type (1.3) when using contracting reservoir maps does not only make the ESP and the FMP to simultaneously hold but it also simpliﬁes enormously the characterization of the FMP. Indeed, it has been shown in Sandberg (2003); Grigoryeva and Ortega (2018b) that in that case the fading memory property is not a metric but an exclusively topological property that does not depend on the weighted norm used to deﬁne it. Therefore, the FMP does not contain in that situation any information about the rate at which the dependence on the past inputs in the system output declines. This is not the case anymore when we consider unbounded input sets since, as we show later on in Theorem 7, reservoir systems have the FMP only with respect to weighting sequences that converge to zero suﬃciently fast and at a rate that is related to the contracting properties of the reservoir map. There are important connections between the notions and the results that we just reviewed and fundamental concepts in the theories of non-autonomous and of random dynamical systems. Even though we shall not pursuit that line of thought, the reader is encouraged to check with Arnold (1998); Kloeden (2003); Kloeden and Rasmussen (2010); Manjunath and Jaeger (2014); Newman (2018) and references therein for in-depth presentations.

Main contributions of the paper. The core contributions of this paper are, ﬁrst, the analysis of the ESP and the FMP in the absence of boundedness hypotheses and, second, the extension of the FMP-related continuity statements in the literature to the study of the diﬀerentiability properties of reservoir computers. In particular, we aim at characterizing the situations in which one can obtain the diﬀerentiability of reservoir ﬁlters out of the diﬀerentiability properties of the maps that deﬁne the corresponding reservoir system. Regarding the ﬁrst objective, there are several reasons to study reservoir computing systems with unbounded inputs. First, even though we only deal in this paper with the deterministic setup, any random component in the data generating process of the inputs, like a Gaussian perturbation, would imply unboundedness. Second, when dealing with reservoir systems associated to physical systems, it is certainly reasonable to assume boundedness in the input due to the saturation eﬀects that most of those systems present. Nevertheless, the value of the bounding constant is in general unknown beforehand, which makes uniform boundedness hypotheses unrealistic. Finally, in the study of the diﬀerentiability properties of reservoir computers, the diﬀerentiability of Fr echet type is only deﬁned on open subsets of normed spaces. We shall see that any open set in the Banach space of inputs with a weighted norm contains unbounded sequences, which forces us to deal with that situation. As to the analysis of the diﬀerentiability properties of reservoir systems, this is an important question for several reasons:

It has been shown (see, for instance, Girosi and Anzellotti (1992); Girosi (1995)) that diﬀerentiability is a key element in decreasing the complexity that is needed at the time of approximating a function with a prescribed accuracy level. The inﬂuence of this feature is comparable to that of the dimensionality of the input. Even though the development of bounds for the approximation error in the RC context is the subject of a forthcoming paper, it is reasonable to presume that diﬀerentiability is a crucial element in the understanding of the learning theoretical properties of this type of machine learning paradigms.

RC applications to the learning of the attractors of chaotic deterministic dynamical systems have been shown (see Lu et al. (2018)) to be much related with the notion of Generalized Synchronization Kocarev and Parlitz (1995, 1996) for which diﬀerentiability is a relevant feature Hunt et al. (1997).

Differentiable reservoir computing

Indeed, in the absence of diﬀerentiability, the synchronization mapping may be wild enough (in the terminology of Hunt et al. (1997)) to create a gap between the information dimensions of the attractors of the input system and the system used to learn it. Additionally, one of the standard techniques to assess the quality of the result of this learning task is the comparison of the Lyapunov spectra of the problem system and the learnt proxy. These spectra are only available in the presence of diﬀerentiability.

Also in the context of the learning of dynamical systems, diﬀerential topological arguments have been used Hart et al. (2019) to establish Takens-type embedding results for Echo State Networks. This is an important result that justiﬁes the forecasting abilities of RC that are empirically observed in this framework.

When ﬁlters are analytic, they obviously admit a Taylor series expansion which coincides with the so-called discrete-time Volterra series representation Volterra (1930); Schetzen (1980); Rugh (1981); Priestley (1988) and, moreover, diﬀerent Taylor remainders can be used to provide bounds on the approximation errors that are committed when those series are truncated. This path has been explicitly explored in Sandberg (1998a, 1999) for analytic ﬁlters with respect to the supremum norm and with inputs with a ﬁnite past. We extend this work and we characterize the inputs for which an analytic fading memory reservoir ﬁlter with respect to a weighted norm admits a Volterra series representation with semi-inﬁnite inputs. Additionally, we can use the causality and time-invariance hypotheses to show that the corresponding Volterra series representations have time-independent coeﬃcients (this feature is not available in the case studied in Sandberg (1999)) that automatically satisfy the convergence conditions spelled out in Sandberg (1998b,c). The availability of this series representation has important learning theoretical consequences since, as we shall show in a forthcoming publication, implies that any analytic ﬁlter can be represented as a reservoir ﬁlter with linear readouts and where the reservoir map has been randomly generated using a well-speciﬁed distribution. This result is a corollary of the Volterra series representation presented later on in the paper combined with an adequately chosen version of the Johnson Lindenstrauss Lemma. In a continuous-time setup the construction involves the use of the so-called signature process designed in Rough Path Theory.

These statements can be combined with the results in Grigoryeva and Ortega (2018a) to provide an alternative proof of the following Volterra series universality theorem that was stated for the ﬁrst time in (Boyd and Chua, 1985, Theorems 3 and 4): any time-invariant and causal fading memory ﬁlter can be uniformly approximated by a ﬁnite Volterra series with ﬁnite memory.

The local nature of the diﬀerential allows the formulation of conditions that ensure both the local and global existence of diﬀerentiable and, in passing, fading memory solutions. These conditions are a novel strong contribution to the long line of research on the ESP and the FMP and, in particular, link to existing research Manjunath and Jaeger (2013) on the input-dependence of the echo state property.

The metric nature of the diﬀerential allows us to measure the speed at which fading memory ﬁlters forget inputs. As we see later on in Theorem 26, we are able to characterize this important piece of information with the diﬀerentiability property.

Organization of the paper. The paper is organized as follows:

The introductory Section 2 presents the causal and time-invariant ﬁlters and functionals that are at the center of this paper. In Section 2.1 the Banach sequence spaces where the semi-inﬁnite inputs and outputs of the reservoir systems that we study are deﬁned. Various elementary facts about weighted and supremum norms are stated. In Section 2.2 the notions of fading memory, continuity,

Grigoryeva and Ortega

and diﬀerentiability of maps between sequence spaces are carefully introduced. Section 2.3 focuses on causal and time-invariant ﬁlters deﬁned on the sequence spaces introduced in Section 2.1. Those results are put to work in Section 2.4 to easily show well-known results that link the continuity of a ﬁlter with input and output spaces endowed with weighted norms with its asymptotic independence on the remote past input.

Starting from Section 3 the paper focuses on reservoir ﬁlters. The main result in this section is Theorem 7 that provides a suﬃcient (but not necessary) condition for the ESP and FMP to hold in the presence of inputs that are not necessarily bounded. This is a signiﬁcant generalization with respect to the standard compactness conditions imposed in Jaeger (2010) or the uniform boundedness in the inputs that was required in similar results in, for instance, Grigoryeva and Ortega (2018b). An important observation in Theorem 7 is that for general inputs, the FMP depends on the weighting sequence that is used to deﬁne it and establishes that, roughly speaking, reservoir systems have the FMP only with respect to weighting sequences that converge to zero suﬃciently fast and at a rate that is related to the contracting properties of the reservoir map. This newly introduced FMP condition is spelled out for several widely used families of reservoir systems. The above mentioned results involving uniform boundedness hypotheses can be obtained as a corollary (see Corollary 10) of the results in this section. Another statement that we prove (see Theorem 12) is that when the target of the reservoir map is a compact set then the echo state property is in that situation guaranteed for no matter what input, even though the FMP may obviously not hold in that case.

Section 4 is the core of the paper and studies the diﬀerentiability properties of reservoir ﬁlters determined by diﬀerentiable reservoir maps. The main results are contained in Theorems 14 and 19. The ﬁrst theorem provides an explicit and easy-to-verify suﬃcient condition for the ESP and the FMP to hold around a given input for which we know that the reservoir system associated to a diﬀerentiable reservoir map has a solution. Theorem 19 is a global extension of the previous result that, unlike Theorems 7 and 14, fully characterizes the ESP and the diﬀerentiability (and hence the FMP) of the reservoir ﬁlter associated to a diﬀerentiable reservoir map. In Section 4.2 we show that the global conditions in Theorem 19 are much stronger than the local ones in Theorem 14 by introducing an example that shows how the ESP and the FMP are structural features of a reservoir system when considered globally but are mostly input dependent when considered only locally. This important observation has already been noticed in Manjunath and Jaeger (2013) where, using tools coming from the theory of non-autonomous dynamical systems, suﬃcient conditions have been formulated (see, for instance, (Manjunath and Jaeger, 2013, Theorem 2)) that ensure the ESP in connection to a given speciﬁc input. The diﬀerentiability conditions that we impose to our reservoir systems allow us to draw similar conclusions and, additionally, to automatically establish the FMP of the resulting locally deﬁned reservoir ﬁlters. In Section 4.3 we show how for globally diﬀerentiable reservoir ﬁlters we can formulate a non-uniform version of the well-known input forgetting property for FMP ﬁlters that we recovered in Section 2.3 for inputs that are not necessarily bounded. Moreover, a novel uniform diﬀerential version of that result is provided in Theorem 26.

Section 5 contains two main results. First, Theorem 29 shows the availability of discrete-time Volterra series representations for analytic, causal, time-invariant, and FMP ﬁlters. This result extends a similar statement formulated in Sandberg (1998a, 1999) to inputs with a semi-inﬁnite past that are not necessarily bounded. Second, in Theorem 31, we combine the previous result with a universality statement in Grigoryeva and Ortega (2018a) to provide an alternative proof of the Volterra series universality theorem stated for the ﬁrst time in (Boyd and Chua, 1985, Theorems 3 and 4).

Differentiable reservoir computing

The proofs of most results are provided in the appendices at the end of the paper.

2. Causal and time-invariant input/output systems

2.1. The input and output spaces

This paper studies input/output systems that are causal, that is, the output depends only on the past history of the input and that, in general, have inﬁnite memory. This makes us consider the spaces of left-inﬁnite sequences with values in Rn, that is, (Rn)Z = {z = (. . . , z 2, z 1, z0) | zi Rn, i Z }. Analogously, (Dn)Z stands for the space of semi-inﬁnite sequences with elements in the subset Dn Rn. The space Rn will be considered as a normed space with a norm denoted by which is not necessarily the Euclidean one (even though they are all equivalent), unless it is explicitly mentioned. We endow these inﬁnite product spaces with the Banach space structures associated to one of the following two norms. First, the supremum norm z := supt Z { zt }. The symbol ℓ (Rn) is used to denote the Banach space formed by the elements that have a ﬁnite supremum norm. Second, given a strictly decreasing sequence with zero limit w : N (0, 1] and that w0 = 1, we deﬁne the weighted norm w on (Rn)Z associated to w by z w := supt Z { ztw t }. It can be shown (see Grigoryeva and Ortega (2018b)) that the set ℓw (Rn) formed by the elements that have a ﬁnite w-weighted norm is a Banach space. Moreover, it is easy to show that z w z , for all z (Rn)Z . This implies that ℓ (Rn) ℓw (Rn) and that the inclusion map (ℓ (Rn), ) , (ℓw (Rn), w) is continuous. The Banach spaces (ℓ (Rn), ) and (ℓw (Rn), w) are particular cases of weighted Banach sequence spaces (ℓp,w (Rn), p,w) where

t Z zt p w t

, with 1 p < + , z (Rn)Z , and w a sequence. (2.1)

When p = + we set p,w := w. We then deﬁne

ℓp,w (Rn) := n z Rn | z p,w < + o . (2.2)

These spaces are deﬁned in the literature (see, for instance, Rekic-Vukovic et al. (2015); Gunawan et al. (2015)) without the requirement that w is a weighting sequence in the sense of the deﬁnition above. Indeed, the standard Banach spaces (ℓp (Rn), p), with 1 p + , are particular cases of (ℓp,w (Rn), p,w) that are obtained by taking as sequence w the constant sequence wι given by wι t := 1, for all t N. This observation is used in the paper to obtain many results for the spaces ℓ (Rn) as a particular case of those proved for ℓw (Rn). We emphasize that wι is not a weighting sequence and that the spaces (ℓw (Rn), w) considered in this paper are all based on sequences w of weighting type. It can be proved (see (Rekic-Vukovic et al., 2015, Theorems 3.3 and 4.1 and Corollary 4.1)) that, in that case:

ℓp,w (Rn) ℓw (Rn), for any 1 p < + , (2.3)

and that, ℓp (Rn) ℓp,w (Rn), for any 1 p + . (2.4)

All the results in this paper are formulated for the weighted spaces (ℓw (Rn), w) even though many of the statements that we provide are also valid for (ℓ (Rn), ) and (ℓp,w (Rn), p,w). That will be explicitly pointed out in the statements or in remarks when it is the case. The Appendix 6.1 contains a collection of results regarding the topologies induced by weighted and supremum norms.

Grigoryeva and Ortega

2.2. FMP, continuity, and diﬀerentiability of maps on inﬁnite sequence spaces

Much of this paper is related to the continuity and the diﬀerentiability of maps of the type f : W ℓw1 (Rn) V ℓw2 (RN), with w1, w2 weighting sequences and W and V subsets of ℓw1 (Rn) and ℓw2 (RN), respectively, that in the case of diﬀerentiable maps are necessarily open. Maps that are continuous with respect to topologies generated by weighted norms will be generically referred to as fading memory maps (or we say that they have the fading memory property (FMP)) while when the topology considered is generated by the supremum norm, we just say that the map is continuous. Most of the deﬁnitions that we provide in what follows for the weighted norms case can be adapted to the supremum norm case by replacing the weighting sequences by the constant sequence wι given by wι t := 1, for all t N. Suppose now that W and V are open subsets. The map f : W ℓw1 (Rn) V ℓw2 (RN) is (Fr echet) diﬀerentiable at u0 W when there exists a bounded linear map Df(u0) : ℓw1 (Rn) ℓw2 (RN) that satisﬁes

lim u u0 f(u) f(u0) Df(u0) (u u0)

u u0 w1 = 0. (2.5)

We say that f : W ℓw1 (Rn) V ℓw2 (RN) is of class C1(W) when it is diﬀerentiable at any point

in W and the induced map Df : W L ℓw1 (Rn), ℓw2 (RN) is continuous, where the space of linear

maps L ℓw1 (Rn), ℓw2 (RN) is endowed with the operator norm ||| |||w1,w2 deﬁned by

|||A|||w1,w2 := sup

u = 0 , A L ℓw1 (Rn), ℓw2 (RN) . (2.6)

When in the domain and the range we use the same weighting sequence w, we will write |||A|||w instead of |||A|||w1,w2. The higher order derivatives

Drf(u0) : ℓw1 (Rn) ℓw1 (Rn) | {z } r times

ℓw2 (RN), r N+,

are inductively deﬁned and the map f is said to be of class Cr(W) when it is r-times diﬀerentiable at

any point in W and the induced map Drf : W Lr ℓw1 (Rn), ℓw2 (RN) into the normed space of

r-multilinear maps is continuous. We recall that the operator norm ||| |||w1,w2 in Lr ℓw1 (Rn), ℓw2 (RN)

is given by

|||A|||w1,w2 := sup

u1,...,ur ℓw1 (Rn)

A(u1, . . . , ur) w2

u1 w1 ur w1

u1, . . . , ur = 0 , A Lr ℓw1 (Rn), ℓw2 (RN) .

(2.7) We recall that diﬀerentiable functions are automatically continuous and we denote the class of continuous functions by C0(W). When f is of class Cr(W) in W for any r N+, we say that f is smooth in W and we denote this class by C (W). When f is smooth in W we can construct for it a Taylor power series expansion. We say that f is analytic in W when the convergence domain of that power series includes W. The analytic class is denoted by Cω(W). It can be shown (see Lemma 32 in the appendices) that for any weighting sequence w, any open set in ℓw (Rn), w contains unbounded sequences. For instance, let B w(0, ϵ) be the ball of radius ϵ > 0 around the zero sequence and let v Rn be a vector such that v = 1. The divergent sequence z deﬁned by zt := ϵv/2w t is such that z w = ϵ/2 and hence z B w(0, ϵ) ℓw (Rn).

Differentiable reservoir computing

2.3. Causal and time-invariant ﬁlters and functionals

Let Dn Rn and DN RN. We refer to the maps of the type U : (Dn)Z (DN)Z as ﬁlters or operators and to those like H : (Dn)Z DN (or H : (Dn)Z DN) as RN-valued functionals. These deﬁnitions can be easily extended to accommodate situations where the domains and the targets of the ﬁlters are not necessarily product spaces but just arbitrary subsets Vn and VN of (Rn)Z and RN Z like, for instance, ℓ (Rn) and ℓ (RN), or ℓw (Rn) and ℓw (RN), for some weighting sequence w. A ﬁlter U : (Dn)Z (DN)Z is called causal when for any two elements z, w (Dn)Z that satisfy that zτ = wτ for any τ t, for a given t Z, we have that U(z)t = U(w)t. Let T Z τ : (Rn)Z (Rn)Z be the time delay operator deﬁned by T Z τ (z)t := zt τ, τ Z. A subset Vn (Rn)Z is called time-invariant when T Z τ (Vn) = Vn, for all τ Z. The ﬁlter U is called time-invariant when it is deﬁned on a time-invariant set and commutes with the time delay operator, that is, T Z τ U = U T Z τ , for any τ Z (in this expression, the two operators T Z τ have to be understood as deﬁned in the appropriate sequence spaces). We recall that there is a bijection between causal time-invariant ﬁlters and functionals on (Dn)Z . Indeed, given a causal and time-invariant ﬁlter U : (Dn)Z (RN)Z, we can associate to it a functional HU : (Dn)Z RN via the assignment HU(z) := U(ze)0, where ze (Rn)Z is an arbitrary extension of z (Dn)Z to (Dn)Z. Conversely, for any functional H : (Dn)Z RN, we can deﬁne a timeinvariant causal ﬁlter UH : (Dn)Z (RN)Z by UH(z)t := H((PZ T Z t)(z)), where T Z t is the ( t)-time delay operator and PZ : (Rn)Z (Rn)Z is the natural projection. Moreover, when considering causal and time-invariant ﬁlters U : (Dn)Z (DN)Z it suﬃces to work just with the restriction U : (Dn)Z (DN)Z , that we denote with the same symbol, since the latter uniquely determines the former. Indeed, by deﬁnition, for any z (Dn)Z and t N+:

U(z)t = T Z t (U(z))

0 = U T Z t(z)

where the second equality holds by the time-invariance of U and the value in the right-hand side depends only on PZ T Z t(z) (Dn)Z , by causality. In view of this observation, we restrict our study to ﬁlters with domain and target in the spaces of left semi-inﬁnite sequences. In particular, we say that a causal and time-invariant ﬁlter U has the fading memory property or that it is continuous when the corresponding restricted ﬁlter deﬁned on left semi-inﬁnite inputs has those properties, as we deﬁned them in Section 2.2. Additionally, from now on we consider most of the time delay operators with domain and target in (Rn)Z and that we simply denote as T τ : (Rn)Z (Rn)Z . The deﬁnition of these restricted time delay operators T τ requires considering two cases:

T τ : (Rn)Z (Rn)Z with τ negative: as before, T τ(z)t := zt+τ, for any z (Rn)Z and t Z . This implies that, in this case,

T τ(z) = PZ T Z τ(ze), z (Rn)Z , τ < 0,

where ze (Rn)Z is an arbitrary extension of z (Rn)Z to (Rn)Z. The map T τ, τ Z , is surjective, that is, Tτ((Rn)Z ) = (Rn)Z , but it is not injective. The same applies to the restriction of T τ to any time-invariant set Vn (Rn)Z which satisﬁes T τ(Vn) = Vn.

T τ : (Rn)Z (Rn)Z with τ positive: there is in principle not a unique way to deﬁne the restricted operators T τ since that involves the choice of vectors vτ (Rn)τ such that T τ(z) := (z, vτ), for any z (Rn)Z . The choice vτ = 0 for all τ > 0 is canonical since it is the only one that makes the resulting maps linear and additionally satisfy

T τ = T 1 T 1 | {z } τ times

Grigoryeva and Ortega

We hence adopt the deﬁnition

T τ(z) := (z, 0, . . . , 0 | {z } τ times

), z (Rn)Z , τ > 0,

for the rest of the paper. In this case T τ it is injective but not surjective.

The following lemma gathers some diﬀerentiability properties of projections and time delay operators when restricted to normed sequence spaces and that will be used later on. A key element in this result is what we call, for each weighting sequence w, their decay ratio Dw and inverse decay ratio Lw, that are deﬁned as:

Dw := sup t N

and Lw := sup t N

As w is by deﬁnition strictly decreasing we necessarily have that 0 < wt+1/wt < 1, for all t N, and 1 < w0/w1 supt N {wt/wt+1} = Lw. Consequently:

0 < Dw 1 and 1 < Lw + .

The decay ratios provide a geometric bound for the convergence speed of w and the divergence rate of w 1. Indeed, it is easy to see that

wt Dt w and 1/wt Lt w, for any t N. (2.10)

Additionally, the fact that for all t N we have that 1 < wt/wt+1 and that 0 < wt+1/wt < 1 implies that

and 1/ sup t N

which, in both cases, implies that Lw Dw 1. (2.11)

More generally, in relation with the power weighting sequences that we discussed in Lemma 35, we have that:

0 < Dwn Dw Dw1/m 1 and 1 < Lw1/m Lw Lwn + , for any m, n N+. (2.12)

Lemma 1 Let w be a weighting sequence and n N+. Then:

(i) The projections pt : (ℓw (Rn), w) (Rn, ), t Z , given by pt(z) := zt, z ℓw (Rn), are linear, smooth, and hence continuous. Moreover, |||pt|||w = 1/w t.

(ii) Consider the restriction of the time delay operator T t to ℓw (Rn) for any t Z. We consider two cases. First, if t < 0 and the inverse decay ratio Lw of w is ﬁnite, then T t maps into ℓw (Rn), that is, ℓw (Rn) is T t-invariant and T t : (ℓw (Rn), w) (ℓw (Rn), w) is surjective, open, and a submersion, that is, ker T t is a split subspace of ℓw (Rn). If t > 0, then ℓw (Rn) is always T t-invariant. T t : (ℓw (Rn), w) (ℓw (Rn), w) is in that case an immersion, that is, it is injective and its image Im T t is split. Moreover, for any t > 0, Tt T t = Iℓw (Rn), and in both cases the maps T t are linear, smooth, and hence continuous. Additionally,

|||T1|||w = Lw, |||T 1|||w = Dw, |||T t|||w L t w , and |||Tt|||w D t w , for all t Z . (2.13)

Differentiable reservoir computing

(iii) For any t1, t2 Z we have

pt1+t2 = pt1 T t2 = pt2 T t1. (2.14)

These statements also hold true when (ℓw (Rn), w) is replaced by (ℓ (Rn), ). In that case one has to take as sequence w the constant sequence wι given by wι t := 1, for all t N, and Lw and Dw are replaced by the constant 1.

Remark 2 The decay ratios are easy to compute for many families of weighting sequences. Two cases that we frequently encounter are:

(i) Geometric sequence: wt := λt, t N, with 0 < λ < 1. In this case:

Lw := sup t N

λ > 1 and Dw := sup t N

(ii) Harmonic sequence: wt := 1/(1 + td), t N, with d > 0. In this case Dw = 1 and Lw = 1 + d.

We emphasize that the ﬁniteness of the inverse decay ratio is not guaranteed for all weighting sequences. An example that illustrates this fact is the sequence wt := exp( t2). It is easy to verify that in that case Lw = + and Dw = 1/e.

Remark 3 The inequalities (2.13) can be combined with Gelfand s formula (Lax, 2002, page 195) to provide bounds for the spectral radii ρ(T t) and ρ(Tt) for all t Z . Indeed,

ρ(T t) = lim n T n t 1/n w lim n (L tn w )1/n = L t w , with t Z .

Analogously, one shows that ρ(Tt) D t w .

Remark 4 Lemma 1 remains valid when instead of the spaces ℓw (Rn) we use the spaces ℓp,w (Rn) that we introduced in Section 2.1, for any 1 p < + . In that case, and for any t Z ,

|||pt|||p,w = 1

w1/p t , (2.15)

|||T1|||p,w = L1/p w , |||T 1|||p,w = D1/p w , |||T t|||p,w L t/p w , and |||Tt|||p,w D t/p w , for all t Z . (2.16)

Remark 5 Some of the properties of time delays operators that we just studied have interesting interpretations in a Hilbert space context. See Lindquist and Picci (2015) for a detailed study.

2.4. The fading memory property and remote past input independence

The properties of time delay operators that we enunciated in Lemma 1 allow us to show how the fading memory property, deﬁned as the continuity of a ﬁlter linking input and output spaces endowed with weighted norms, (see Section 2.1) can be interpreted as its asymptotic independence on the remote past input (Wiener, 1958, page 89). Analogously, we can see that the FMP amounts to the attribute that, in the words of Volterra (Volterra, 1930, page 188), the inﬂuence of the input a long time before the given moment fades out. This property has also been characterized as a unique steady-state property in Boyd and Chua (1985) and referred to as the input forgetting property in Jaeger (2010). All these characterizations were proved under various compactness and/or uniformly boundedness hypotheses on the inputs. The next result shows that property as a straightforward corollary of Lemma 1 that, later on in Section 4.3, will be generalized to situations where the inputs are eventually unbounded. In the following statement we will be using the following notation: given the sequences u (Rn)Z and v (Rn)t, t N, the symbol uv (Rn)Z (Rn)t denotes the concatenation of u and v.

Grigoryeva and Ortega

Theorem 6 (FMP and the uniform input forgetting property) Let M, L > 0, n, N N+ and let KM (Rn)Z , KL (RN)Z (respectively, K+ M (Rn)N+, K+ L (RN)N+) be the sets of uniformly bounded left (respectively, right) semi-inﬁnite sequences deﬁned in (1.3). Let U : KM KL be a causal and time-invariant fading memory ﬁlter. Then, for any u, v KM and z K+ M we have that

lim t + U(uz)t U(vz)t = 0, (2.17)

where in this expression the ﬁlter U is deﬁned by time-invariance on positive times using (2.8). The convergence in (2.17) is uniform on u, v, and z in the sense that there exists a monotonously decreasing sequence w U with zero limit such that for all u, v KM, z K+ M, and t N,

U(uz)t U(vz)t w U t . (2.18)

Filters that satisfy condition (2.17) for any u, v KM and z K+ M are said to have the input forgetting property and we refer to (2.18) as the uniform input forgetting property.

Proof. We start by recalling that in the presence of uniformly bounded inputs, the FMP can be characterized as the continuity of the map U : KM KL with the sets KM and KL endowed with the relative topology induced either by the product topology on (Rn)Z and (RN)Z , respectively, or by the weighted norms in the spaces ℓw (Rn) and ℓw (RN), with w any weighting sequence (see (Grigoryeva and Ortega, 2018b, Corollary 2.7 and Proposition 2.11)). Moreover, the sets KM and KL are compact in this topology (Grigoryeva and Ortega, 2018b, Corollay 2.8) and hence the FMP ﬁlter U : KM KL is not only continuous but also uniformly continuous. Consequently, once we have ﬁxed a weighting sequence w, an increasing modulus of continuity ωU : R+ R+ can be associated to the map U : (KM, w) (KL, w). We emphasize that ωU depends on w since it is a metric and not a purely topological notion. Now, using (2.8) and an arbitrary weighting sequence w that we choose with Dw < 1, we can write for any t N

U(uz)t U(vz)t = U PZ T Z t(uz)

0 U PZ T Z t(vz)

= p0 U PZ T Z t(uz) p0 U PZ T Z t(vz)

U PZ T Z t(uz) U PZ T Z t(vz) w , (2.19)

where we used that |||p0|||w = 1 by the ﬁrst part of Lemma 1. We now notice that

PZ T Z t(uz) = T t(u) + (. . . , 0, z1, . . . , zt) , and PZ T Z t(vz) = T t(v) + (. . . , 0, z1, . . . , zt) ,

which substituted in (2.19) and using the second part of Lemma 1 yields

U(uz)t U(vz)t ωU ( T t(u v) w)

ωU (|||T t|||w u v w) ωU Dt w u v w ωU 2MDt w . (2.20)

Now, as w has been chosen so that Dw < 1 and lim t 0 ωU(t) = 0, we set w U t := ωU (2MDt w), and we have

that lim t + w U t = lim t + ωU 2MDt w = 0, (2.21)

which using the inequality (2.20) proves the claim.

Differentiable reservoir computing

3. The fading memory property in reservoir ﬁlters with unbounded inputs

Starting in this section we focus on ﬁlters deﬁned by reservoir systems of the type introduced in (1.1) (1.2), but this time we consider reservoir maps F : DN Dn DN where the input variable takes values on a set Dn RN that is not necessarily bounded. All along this section, the reservoir map F will be assumed to be continuous and a contraction on the ﬁrst entry with constant 0 < c < 1, that is, F(x1, z) F(x2, z) c x1 x2 , for all x1, x2 DN and z Dn.

When the inputs are assumed to be uniformly bounded by a constant M > 0 and F maps into a ball B (0, L) RN, L > 0, it has been proved (see (Grigoryeva and Ortega, 2018b, Proposition 2.1 and Theorem 3.1)) that we can associate to this system unique ﬁlters U F : KM KL and U F h : KM (Rd)Z (the sets KM and KL are introduced in (1.3)) that are causal, time-invariant, continuous and, moreover, satisfy the fading memory property with respect to any weighting sequence w. We recall that U F is the ﬁlter associated to the solutions of the reservoir equation (1.1) and assigns to any input sequence z KM the output U F (z) that satisﬁes

U F (z)t = F(U F (z)t 1, zt), for any t Z . (3.1)

Recall also that U F h : KM (Rd)Z is the ﬁlter associated to the full system (1.1) (1.2) and is given by U F h := h U F . We denote by HF : KM B (0, L) and HF h : KM Rd the corresponding reservoir functionals. The reservoir functionals are related to the corresponding reservoir ﬁlters via the identities: HF (z) = U F (z)0 = F(U F (z) 1, z0) and HF h (z) = h U F (z) , (3.2)

for all z KM. The next theorem is the most important result in this section and shows that the results that we just recalled about the ESP and the FMP for reservoir ﬁlters with uniformly bounded inputs remain valid in the presence of unbounded inputs. However, in that case, the fading memory property depends on the weighting sequence that is used to deﬁne it. The suﬃcient condition for the FMP spelled out in the next theorem asserts, roughly speaking, that reservoir systems have the FMP only with respect to weighting sequences that converge to zero suﬃciently fast and at a rate that is related to the contracting properties of the reservoir map.

Theorem 7 (ESP and FMP with continuous reservoir maps) Let F : DN Dn DN be a continuous reservoir map where Dn Rn, DN RN, n, N N+. Assume, additionally, that it is a contraction on the ﬁrst entry with constant 0 < c < 1. Let w be a weighting sequence with ﬁnite inverse decay ratio Lw and let Vn (Dn)Z ℓw (Rn) be a time-invariant set. We consider two situations regarding the target DN of the reservoir map:

(i) DN is a compact subset of RN.

(ii) (DN)Z ℓw (RN) is a complete subset of the Banach space ℓw (RN), w , F is Lipschitz continuous, and the reservoir system (1.1) associated to F has a solution (x0, z0) (DN)Z ℓw (RN) Vn, that is, x0 t = F(x0 t 1, z0 t), for all t Z

In both cases, if c Lw < 1 (3.3)

then the reservoir system associated to F with inputs in Vn has the echo state property and hence determines a unique continuous, causal, and time-invariant reservoir ﬁlter U F : (Vn, w) ((DN)Z

Grigoryeva and Ortega

ℓw (RN), w) that has the fading memory property with respect to w. Moreover, if F is Lipschitz on the second component (which is always the case under the hypotheses in (ii)) with constant Lz, that is, F(x, z1) F(x, z2) Lz z1 z2 , for any x DN, z1, z2 Dn,

then U F is also Lipschitz with constant

LU F := Lz 1 c Lw . (3.4)

This statement also holds true under the hypotheses in part (ii) when (ℓw (Rn), w) is replaced by (ℓ (Rn), ). In that case Lw is replaced by the constant 1 and hence condition (3.3) is automatically satisﬁed. The resulting reservoir ﬁlter U F : (Vn, ) ((DN)Z ℓ (RN), ) is continuous.

Remark 8 A very common situation that provides the solution (x0, z0) (DN)Z ℓw (RN) Vn for the reservoir system needed in part (ii), is the existence of a ﬁxed point (x0, z0) DN Dn of F that satisﬁes F(x0, z0) = x0. In that case the required solution is given by the constant sequences x0 t = x0, z0 t = z0, for all t Z .

Remark 9 If the target DN of the reservoir map is a closed subset of RN, that is DN = DN, then by part (iii) of the Corollary 33, the set (DN)Z ℓw (RN) is a closed subset of ℓw (RN), w and it is hence necessarily complete, as required in case (ii) of the theorem. Moreover, if DN is closed and Vn contains a constant sequence z0 then the condition on the existence of a solution (x0, z0) (DN)Z ℓw (RN) Vn is automatically satisﬁed. Indeed, let z Rn be such that (z0)t := z for all t Z and let x DN arbitrary. Consider the sequence {x, F(x, z), F(F(x, z), z), F(F(F(x, z), z), z), . . .}. The Banach Contraction-Mapping Principle (see (Shapiro, 2016, Theorem 3.2)) guarantees that this sequence converges to the unique ﬁxed point, we call it x DN, of the map F( , z). The pair (x0, z0) (DN)Z ℓw (RN) Vn, with (x0)t := x for all t Z , is the solution needed in case (ii) of the theorem.

As a corollary of Theorem 7 it can be shown that reservoir systems that have by construction uniformly bounded inputs and outputs always have the ESP and FMP properties and that, for any weighting sequence w. This result was already shown in (Grigoryeva and Ortega, 2018b, Theorem 3.1).

Corollary 10 Let M, L > 0, let KM (Rn)Z and KL RN Z be subsets of uniformly bounded sequences deﬁned as in (1.3), and let F : B (0, L) B (0, M) B (0, L) be a continuous reservoir map. Assume, additionally, that F is a contraction on the ﬁrst entry with constant 0 < c < 1. Then, the reservoir system associated to F has the echo state property. Moreover, this system has a unique associated causal and time-invariant ﬁlter U F : KM KL that has the fading memory property with respect to any weighting sequence w.

Proof. Given that B (0, L) is a compact subset of RN, the hypothesis in part (i) of Theorem 7 and condition (3.3) guarantee that there exists a reservoir ﬁlter U F : KM KL associated to F that has the fading memory property with respect to any weighting sequence that satisﬁes (3.3). Such a sequence always exists as it suﬃces to take any geometric sequence wt := λt, t N, with c < λ < 1. However, as it has been shown in (Grigoryeva and Ortega, 2018b, Corollary 2.7), all the weighted norms induce in the sets KM and KL the same topology, namely, the product topology and hence if U F is continuous with respect to the topology induced by the weighted norm w then so it is with respect to the norm associated to any other weighting sequence.

Remark 11 This corollary shows that, in general, the condition (3.3) is suﬃcient but not necessary. Indeed, if the hypotheses in the corollary are satisﬁed, the resulting ﬁlter U F has the fading memory

Differentiable reservoir computing

property with respect to any geometric sequence wt := λt, with 0 < λ < 1, t N for which (see Remark 2) Lw = 1/λ. In particular, this holds true when λ is chosen so that 0 < λ < c and hence when (3.3) is not satisﬁed since in that case c Lw > 1. Additional concrete examples that show that the condition (3.3) is suﬃcient but not necessary are provided in Section 3.1. We emphasize that the FMP condition (3.3) is suﬃcient but not necessary even in the absence of boundedness conditions like in Corollary 10

Another important statement that can be proved when the target of the reservoir map is a compact subset of RN is that the echo state property is in that situation guaranteed for no matter what input1

in (Rn)Z even though the FMP may obviously not hold in that case.

Theorem 12 (ESP for reservoir maps with compact target) Let F : DN Dn DN be a continuous reservoir map, Dn Rn, DN RN, n, N N+, such that DN is a compact subset of RN

and F is a contraction on the ﬁrst entry with constant 0 < c < 1. Then, the reservoir system associated to F has the echo state property for any input in (Dn)Z . Let U F : (Dn)Z (DN)Z be the associated reservoir ﬁlter. For any weighting sequence w such that c Lw < 1 the map U F : (Dn)Z ((DN)Z , w) is continuous when in (Dn)Z we consider the relative topology induced by the product topology in (RN)Z . Moreover, if (Dn)Z ℓw (Rn) then U F has the fading memory property.

The following result shows how the FMP of the ﬁlter associated to a reservoir map established in Theorem 7 propagates to the FMP of the ﬁlter of the full reservoir system if the readout map is continuous.

Corollary 13 In the conditions of Theorem 7, let h : DN Rd be a continuous readout map. Consider the following two cases that correspond to the two sets of hypotheses studied in Theorem 7:

(i) If DN is a compact subset of RN then there is a constant R > 0 such that the ﬁlter U F h deﬁned by U F h (z)t := h U F (z)t , t Z , z Vn maps U F h : (Vn, w) (KR, w) and has the fading memory property.

(ii) If (DN)Z ℓw (RN) is a complete subset of ℓw (RN, w and h is Lipschitz continuous on DN such that U F h (z0) ℓw (Rd), then the reservoir ﬁlter U F h : (Vn, w) (ℓw (Rd), w) has the fading memory property.

This statement also holds true under the hypotheses in part (ii) when (ℓw (Rn), w) is replaced by (ℓ (Rn), ). The resulting reservoir ﬁlter U F h : (Vn, ) (ℓ (Rd), ) is continuous.

3.1. Examples

In the following paragraphs we show how the suﬃcient condition (3.3) explicitly looks like for reservoir systems that are widely used and that have been shown to have universality properties in the fading memory category both with deterministic and stochastic inputs Grigoryeva and Ortega (2018a,b); Gonon and Ortega (2018).

Linear reservoir maps. Consider the reservoir map F : RN Rn RN given by

F(x, z) = Ax + cz, with A MN, c MN,n. (3.5)

It is easy to see that F is a contraction on the ﬁrst entry whenever the matrix A satisﬁes that |||A||| < 1. In that case, using the notation in Theorem 7, c = |||A|||. Indeed, for any x1, x2 RN, z Rn:

F(x1, z) F(x2, z) = A(x1 x2) |||A||| x1 x2 .

1. We thank Lukas Gonon for pointing this out.

Grigoryeva and Ortega

We now assume that |||A||| < 1. The following two statements are proved in the Appendix 6.8:

(i) The reservoir system associated to (3.5) has the echo state property and deﬁnes a unique reservoir ﬁlter U F : ℓw (Rn) ℓw (RN) that has the fading memory property with respect to any weighting sequence w that satisﬁes the condition

wj < + . (3.6)

The FMP condition (3.3) reads in this case as

|||A|||Lw < 1, (3.7)

and implies (3.6) but not vice versa.

(ii) If the inputs presented to the reservoir system associated to (3.5) are uniformly bounded then it has the fading memory property with respect to any weighting sequence. This result was already known as it can be easily obtained by combining (Grigoryeva and Ortega, 2018a, Corollary 11) with (Grigoryeva and Ortega, 2018b, Corollary 2.7). We obtain it here directly out of Corollary 10 by noting that for any M > 0,

F(B (0, L), B (0, M) B (0, L), with L := |||c|||M 1 |||A|||. (3.8)

Echo state networks (ESN). Let σ : R [ 1, 1] be a squashing function, that is, σ is nondecreasing, limx σ(x) = 1, and limx σ(x) = 1. Moreover, assume that Lσ := supx R{|σ (x)|} < + . Let σ : RN [ 1, 1]N be the map obtained by componentwise application of the the squashing function σ. An echo state network is a reservoir system with linear readout and reservoir map given by

F(x, z) = σ(Ax + cz + ζ), with A MN, c MN,n, ζ RN. (3.9)

We notice ﬁrst that if |||A|||Lσ < 1 then F is a contraction on the ﬁrst component with constant |||A|||Lσ (see the second part in (Grigoryeva and Ortega, 2018b, Corollary 3.2)). By construction, F maps into the compact space [ 1, 1]N RN and hence satisﬁes the hypotheses in the ﬁrst part of Theorem 7. Consequently, for any weighting sequence w that satisﬁes

|||A|||LσLw < 1 (3.10)

there exists a unique reservoir ﬁlter U F : ℓw (Rn) ℓw (RN) associated to F that has the fading memory property with respect to w. By Corollary 10 this statement holds true for any w when one considers uniformly bounded inputs.

Non-homogeneous state-aﬃne systems (SAS). These systems are determined by reservoir maps F : RN Rn RN of the form F(x, z) := p(z)x + q(z), (3.11)

where p and q are polynomials with matrix and vector coeﬃcients, respectively, that depending on their nature determine the following two families of SAS systems:

(i) Regular SAS. p and q are polynomials of degree r and s of the form:

i1,...,in {0,...,r} i1+ +in r

zi1 1 zin n Ai1,...,in, Ai1,...,in MN, z Dn Rn,

i1,...,in {0,...,s} i1+ +in s

zi1 1 zin n Bi1,...,in, Bi1,...,in MN,1, z Dn Rn.

Differentiable reservoir computing

(ii) Trigonometric SAS. We use trigonometric polynomials instead:

k=1 Ap k cos(up k z) + Bp k sin(vp k z), Ap k, Bp k MN, up k, vp k RN, z Dn Rn,

k=1 Aq k cos(uq k z) + Bq k sin(vq k z), Aq k, Bq k MN,1, uq k, vq k RN, z Dn Rn.

In both cases, deﬁne Mp := sup z Dn {|||p(z)|||} and Mq := sup z Dn {|||q(z)|||} .

Note that for regular SAS deﬁned by nontrivial polynomials, the set Dn needs to be bounded in order for Mp and Mq to be ﬁnite. Additionally, it is easy to see that F is a contraction on the ﬁrst entry with constant Mp whenever Mp < 1, which is a condition that we will assume holds true in the rest of this example. Additionally, we assume that Mq < + . Regular SAS are a generalization of the linear case that we considered in the ﬁrst part of this section and hence two statements can be proved (see Appendix 6.8) that are analogous to the ones in that part, namely: (i) The reservoir system associated to (3.11) has the echo state property and deﬁnes a unique reservoir ﬁlter U F : ℓw (Rn) (Dn)Z ℓw (RN) that has the fading memory property with respect to any weighting sequence w that satisﬁes the condition

M j p wj < + . (3.12)

The FMP condition (3.3) that in this case reads as Mp Lw < 1 implies (3.12) but not vice versa.

(ii) If the inputs presented to the reservoir system associated to (3.11) are uniformly bounded then it has the fading memory property with respect to any weighting sequence. We obtain this result out of Corollary 10 by noting that for any M > 0,

F(B (0, L), B (0, M) B (0, L), with L := Mq 1 Mp .

We emphasize that in the case of regular SAS, this is the only situation for which one can have Mp < 1 and Mq < + .

4. Diﬀerentiability in reservoir ﬁlters with unbounded inputs

We now extend the results in the previous section from continuity to diﬀerentiability. More speciﬁcally, we characterize the situations in which one can prove the existence and obtain the diﬀerentiability of reservoir ﬁlters out of the diﬀerentiability properties of the maps that deﬁne the reservoir system. This approach gives us in passing new techniques to establish the echo state and the fading memory properties of reservoir systems. In particular, diﬀerentiability being a local property, we show how systems that do not globally have any of these properties may still have them in a neighborhood of certain types of inputs. A phenomenon of this type has also been explored in Manjunath and Jaeger (2013). It is worth emphasizing that the study of the diﬀerentiability properties of fading memory reservoir ﬁlters calls naturally for the handling of unbounded inputs since the deﬁnition of the Fr echet derivative requires them to be deﬁned on open subsets of the Banach space ℓw (Rn) that always contain unbounded sequences (see the ﬁrst part of Lemma 32 in the appendices).

Grigoryeva and Ortega

4.1. Diﬀerentiable reservoir ﬁlters associated to diﬀerentiable reservoir maps

The ﬁrst result in this section shows that under certain conditions, the echo state and the fading memory properties associated to diﬀerentiable reservoir systems locally persist, that is, if a reservoir system has a unique ﬁlter associated to a speciﬁc input and it is continuous and diﬀerentiable at it, then the same property holds for neighboring inputs.

Theorem 14 (Local persistence of the ESP and FMP properties) Let F : RN Rn RN be a reservoir map and let w be a weighting sequence with ﬁnite inverse decay ratio Lw. Suppose that F is of class C1(RN Rn) and that the corresponding reservoir system (1.1) has a solution (x0, z0) ℓw (RN) ℓw (Rn), that is, x0 t = F(x0 t 1, z0 t), for all t Z . Suppose, additionally, that

LF := sup (x,z) RN Rn {|||DF(x, z)|||} < + . (4.1)

Deﬁne LFx(x0, z0) := sup t Z

Dx F(x0 t 1, z0 t)

and suppose that LFx(x0, z0)Lw < 1. (4.2)

Then there exist open time-invariant neighborhoods Vx0 and Vz0 of x0 and z0 in ℓw (RN) and ℓw (Rn), respectively, such that the reservoir system associated to F with inputs in Vz0 has the echo state property and hence determines a unique causal and time-invariant reservoir ﬁlter U F : (Vz0, w) (Vx0, w). Moreover, U F is diﬀerentiable at all the points of the form T t(z0), t Z , it is locally Lipschitz continuous on Vz0, and it hence has the fading memory property.

Remark 15 We refer to (4.2) as the persistence condition. We emphasize that this inequality puts into relation the solution (x0, z0) whose persistence we are studying with the weighting sequence w. In particular, that relation tells us that solutions are more likely to persist with respect to weighting sequences that decay more slowly (that is, Lw is smaller).

Remark 16 There is a situation where the persistence condition is particularly easy to verify, namely, when the solution of the reservoir system is constructed as a constant sequence coming from a ﬁxed point of the reservoir map, that is, (x0, z0) RN Rn such that F(x0, z0) = x0. In that case LFx(x0, z0) := Dx F(x0, z0) .

Remark 17 The persistence condition (4.2) can be interpreted as a stability condition for the reservoir system determined by F at the solution (x0, z0) with respect to perturbations in ℓw (Rn). The persistence of solutions under stability conditions of that type has been thoroughly studied for many types of dynamical systems (see, for instance, Montaldi (1997b,a); Ortega and Ratiu (1997); Chossat et al. (2003)).

Remark 18 The derivative DU F (z0) at z0 of the locally deﬁned reservoir ﬁlter U F is determined by the diﬀerentiation of the relation (3.1). Indeed, for any u ℓw (Rn), and t Z , the directional derivative DU F (z0) u is determined by the recursions

(DU F (z0) u)t = DF U F (z0)t 1, z0 t DU F (z0) u

t 1 , ut (4.3)

= Dx F U F (z0)t 1, z0 t DU F (z0) u

t 1 + Dz F(U(z0)t 1, z0 t) ut. (4.4)

This relation implies, in particular, that DU F (z0) : ℓw (Rn) ℓw (RN) is a bounded linear operator and that DU F (z0) w LFz(x0, z0) 1 LFx(x0, z0)Lw , (4.5)

Differentiable reservoir computing

where LFz(x0, z0) := sup t Z

Dz F(x0 t 1, z0 t) .

Indeed, notice ﬁrst that for any t Z , Dx F(x0 t 1, z0 t) DF(x0 t 1, z0 t) and Dz F(x0 t 1, z0 t) DF(x0 t 1, z0 t) , (4.6)

which, using hypothesis (4.1) implies that

LFx(x0, z0) LF < + and LFz(x0, z0) LF < + . (4.7)

Now, for any u ℓw (Rn), and t Z , the relation (4.3) and the inequalities (4.7) imply that DU F (z0) u w = sup t Z

DU F (z0) u

n DF U F (z0)t 1, z0 t DU F (z0) u

t 1 , ut w t o

n Dx F U F (z0)t 1, z0 t DU F (z0) u

t 1 + Dz F(U(z0)t 1, z0 t) ut w t o

LFx(x0, z0) sup t Z

n DU F (z0) u

w t o + LFz(x0, z0) sup t Z { ut w t}

LFx(x0, z0) sup t Z

DU F (z0) u

w (t 1) w t w (t 1)

+ LFz(x0, z0) sup t Z { ut w t}

LFx(x0, z0)Lw DU F (z0) u w + LFz(x0, z0) u w ,

which implies (4.5).

The previous theorem proves that when the persistence condition (4.2) is satisﬁed at a preexisting solution of a reservoir system then this system has a unique fading memory (and diﬀerentiable) ﬁlter associated for neighboring inputs. In the next results we show that a global version of that condition ensures ﬁrst, that globally deﬁned reservoir ﬁlters exist, and second, that those ﬁlters are diﬀerentiable and hence have the fading memory property.

Theorem 19 (Characterization of global reservoir ﬁlter diﬀerentiability) Let F : RN Rn RN be a reservoir map of class C1(RN Rn) and let w be a weighting sequence with ﬁnite inverse decay ratio Lw.

(i) Suppose that F satisﬁes (4.1) and deﬁne

LFx := sup (x,z) RN Rn {|||Dx F(x, z)|||} and LFz := sup (x,z) RN Rn {|||Dz F(x, z)|||} .

If the reservoir system (1.1) associated to F has a solution (x0, z0) ℓw (RN) ℓw (Rn), that is, x0 t = F(x0 t 1, z0 t), for all t Z , and LFx Lw < 1 (4.8)

then it has the echo state property and hence determines a unique causal and time-invariant reservoir ﬁlter U F : (ℓw (Rn), w) (ℓw (RN), w). Moreover, U F is diﬀerentiable and Lipschitz continuous on ℓw (Rn) with Lipschitz constant LU F given by

LU F := LFz 1 LFx Lw and DU F (z) w LFz 1 LFx Lw , for any z ℓw (Rn). (4.9)

The ﬁlter U F has hence the fading memory property.

Grigoryeva and Ortega

(ii) Conversely, let Vn ℓw (Rn) be an open and time-invariant subset of ℓw (Rn) and assume that the reservoir system (1.1) associated to F has a unique causal and time-invariant reservoir ﬁlter U F : Vn ℓw (RN) that is diﬀerentiable at z0 ℓw (Rn). Then,

t Z Dx F U F (z0)t 1, z0 t ) T1

< 1, (4.10)

where ρ stands for the spectral radius. This in turn implies that

Dx F U F (z0) 1, z0 0 Dx F U F (z0) k, z0 k+1 1

= 0. (4.11)

Examples 20 We brieﬂy examine the form that the hypotheses of Theorem 19 take for the three families of reservoir systems that we analyzed in Section 3.1:

(i) Linear reservoir maps. In this case, for any x RN and z Rn,

DF(x, z) = (A | c) , Dx F(x, z) = A, and Dz F(x, z) = c.

Consequently LF = |||(A | c)|||, LFx = |||A|||, LFz = |||c|||. The condition (4.1) is always satisﬁed and in this case the suﬃcient diﬀerentiability condition (4.8) amounts to |||A|||Lw < 1 that, as we saw in (3.7), is the same as the suﬃcient condition for the FMP to hold.

(ii) Echo state networks (ESN). Consider an ESN constructed using a squashing function σ that satisﬁes that Lσ := supx R{|σ (x)|} < + . In this case, for any x RN and z Rn,

DF(x, z) = Dσ(Ax + cz + ζ) (A | c) ,

Dx F(x, z) = Dσ(Ax + cz + ζ) A,

Dz F(x, z) = Dσ(Ax + cz + ζ) c.

Notice that |||Dσ(x)||| < Lσ < + , for any x RN, and hence

|||DF(x, z)||| Lσ|||(A | c)||| < + ,

|||Dx F(x, z)||| Lσ|||A||| < + ,

|||Dz F(x, z)||| Lσ|||c||| < + ,

for any x RN and z Rn. This implies, in particular, that in this case

LF < + , LFx < + , LFz < + ,

and the suﬃcient diﬀerentiability condition (4.8) is implied by the inequality

|||A|||LσLw < 1. (4.12)

(iii) Non-homogeneous state-aﬃne systems (SAS). A straightforward computations shows that for any x RN and z Rn,

DF(x, z) = (p(z), Dp(z)( )x + Dq(z)( )) , Dx F(x, z) = p(z), Dz F(x, z) = Dp(z)( )x + Dq(z)( ). (4.13)

Differentiable reservoir computing

As we already pointed out, for regular SAS deﬁned by nontrivial polynomials the norm |||p(z)||| is not bounded in Rn and hence LFx = sup(x,z) RN Rn {|||Dx F(x, z)|||} = supz Rn {|||p(z)|||} = Mp is not ﬁnite; the same applies to LF , which implies that in this case neither (4.1) nor (4.8) can be satisﬁed. This is not the case for trigonometric SAS for which the norms of the derivatives in (4.13) are bounded on their domains which, in particular, implies that LF < + , LFx < + , and LFz < + . Moreover, the suﬃcient diﬀerentiability condition (4.8) in this case reads

Remark 21 We recall here an example that we introduced in Section 3.1 to show that, as it was already the case with the FMP condition (3.3) in Theorem 7, the diﬀerentiability condition (4.8) is suﬃcient but not necessary. Indeed, consider a linear system with matrix A given by

A = 0 a 0 0

, with a > 0.

Given that |||A||| = a, the reservoir map determined by A is not necessarily a contraction on the ﬁrst entry. Nevertheless, the nilpotency of A implies that the reservoir system associated to (3.5) always has a solution for any input z (R2)Z and hence has the ESP and induces a ﬁlter U : (R2)Z (R2)Z given by U(z)t := zt + Azt 1, t Z or, equivalently, U = I(R2)Z + (Q

t Z A) T1. Consider now any weighting sequence w with ﬁnite inverse decay ratio Lw. Then the restriction of U to ℓw (R2) always maps into ℓw (R2), has the FMP, and it is diﬀerentiable. Indeed, it is easy to show using the linearity of the ﬁlter that U = DU(z) for any z ℓw (R2) and that

|||U|||w = |||DU(z)|||w (1 + a Lw). (4.14)

Note that in this case LFx = |||A||| = a and as (4.14) shows the diﬀerentiability of U with respect to any weighting sequence with ﬁnite Lw, we can conclude that the condition (4.8) is not necessary for ﬁlter diﬀerentiability.

The following corollary puts together the previous theorem and a condition on the readout map that guarantees that the ﬁlter associated to the resulting reservoir system is diﬀerentiable.

Corollary 22 Consider a reservoir system determined by a reservoir map F : RN Rn RN of class C1(RN Rn) and by a readout map h : RN Rd that is also of class C1(RN). Assume, additionally that F satisﬁes the hypotheses in part (i) of Theorem 19 and that h is such that

ch := sup x RN {|||Dh(x)|||} < + , (4.15)

and the sequence y0 := h x0

t Z = h U F (z0)

t Z ℓw (Rd). Then, the reservoir ﬁlter U F h :

(ℓw (Rn), w) (ℓw (Rd), w) is diﬀerentiable at each point in its domain and it hence has the fading memory property.

Proof. Deﬁne ﬁrst the map

t Z h pt : ℓw (RN) (Rd)Z . (4.16)

Given that U F h = H U F and by Theorem 19 the ﬁlter U F is diﬀerentiable then it suﬃces to prove that H is diﬀerentiable. This is a consequence of part (iii) in Lemma 36 and the hypothesis (4.15). Indeed, let Ht := h pt, t Z , and notice that by the ﬁrst part of Lemma 1

sup x ℓw (RN) {|||DHt(x)|||} sup xt RN {|||Dh(xt)|||} sup x ℓw (RN) {|||pt(x)|||} ch

Grigoryeva and Ortega

Now, as (ch/w t)t Z

w = ch < + and by hypothesis H(x0) ℓw (Rd) it follows from Lemma 36

that H maps into ℓw (Rd) and that it is diﬀerentiable, as required.

In some occasions it is important to determine if a given ﬁlter is invertible. The diﬀerentiability of reservoir ﬁlters associated to reservoir systems associated to diﬀerentiable reservoir and readout maps that we established in the previous result allows us to use the inverse function theorem to formulate a suﬃcient invertibility condition. As we see in the next statement, this criterion can be written down entirely in terms of the derivatives of the reservoir and the readout maps.

Corollary 23 Consider a reservoir system determined by a reservoir map F : RN Rn RN and a readout map h : RN Rd that are of class C1(RN Rn) and C1(RN), respectively, and additionally satisfy the conditions spelled out in the statement of Corollary 22. Let z ℓw (Rn), x := U F (z) ℓw (RN), and y := U F h (z) ℓw (Rd), and suppose that the map

t Z Dx F (xt 1, zt)

t Z Dz F (xt 1, zt)

: ℓw (Rn) ℓw (Rd)

(4.17) is a linear homeomorphism (continuous linear bijection with continuous inverse) with H as deﬁned in (4.16). Then there exist open neighborhoods Vz ℓw (Rn) and Vy ℓw (RN) of z and y, respectively, such that the restriction of the ﬁlter U F h |Vz : Vz Vy has an inverse U F h |Vz 1. When the condition (4.17) is satisﬁed for all the solutions (z, U F (z)) of the reservoir system determined by F then the reservoir ﬁlter U F h admits a global inverse U F h 1 : U F h ℓw (Rn) ℓw (Rn).

Proof. It is a straightforward consequence of the inverse function theorem as formulated in (Schechter, 1997, page 670) (see also Ver Eecke (1974)) applied to the Fr echet derivative of U F h = H U F at the point z ℓw (Rn). It is easy to see using the chain rule and (6.69) (which is in turn a consequence of (4.4)) that this derivative coincides with the operator in (4.17) whose invertibility we require.

4.2. The local versus the global echo state property

Theorem 14 emphasizes the local nature of both the echo state and the fading memory properties by providing a suﬃcient condition that ensures the existence of a locally deﬁned causal and time-invariant ﬁlter around a given solution that is shown to have the FMP. In contrast with this local approach, Theorem 19 characterizes the existence of a globally deﬁned diﬀerentiable ﬁlter associated to a given reservoir system, that hence satisﬁes the FMP and the ESP for any input. Even though the conditions in Theorems 14 and 19 are very alike, the latter is much stronger than the former. In the following paragraphs we illustrate with a family of ESNs of the type introduced in Section 3.1 how it is possible to be in violation of the global condition of Theorem 19 and nevertheless to ﬁnd solutions of such reservoir systems around which one can locally deﬁne FMP reservoir ﬁlters. This example illustrates how the ESP and the FMP are structural features of a reservoir system when considered globally but are mostly input dependent when considered only locally. This important observation has already been noticed in Manjunath and Jaeger (2013) where, using tools coming from the theory of non-autonomous dynamical systems, suﬃcient conditions have been formulated (see, for instance, (Manjunath and Jaeger, 2013, Theorem 2)) that ensure the ESP in connection to a given speciﬁc input. The diﬀerentiability conditions that we impose to our reservoir systems allow us to draw similar conclusions and, additionally, to automatically conclude the FMP of the resulting locally deﬁned reservoir ﬁlters. Consider the one-dimensional echo state map F : R R R, where

F(x, z) := σ(ax + z), with a R and σ(x) := x

1 + x2 . (4.18)

Differentiable reservoir computing

The sigmoid function σ in this expression has been chosen so that we can provide algebraic expressions in the following developments. Similar conclusions could nevertheless be drawn using other popular squashing functions. The function σ maps the real line into the interval [ 1, 1] and it is easy to see, using the notation introduced in the examples 20, that Lσ := supx R{|σ (x)|} = 1. Moreover, the one-dimensional character of the system makes that, in this case, LFx = |a|. (4.19)

Consequently, by Lemma 43, the reservoir map F is a contraction on the ﬁrst entry if and only if |a| < 1, in which case, by Theorem 7, the associated ESN has the ESP and the FMP with respect to any input in ℓw (Rn), where w is a weighting sequence that satisﬁes

|a|Lw < 1. (4.20)

The FMP holds with respect to any sequence w if we consider uniformly bounded inputs by Corollary 10. Moreover, a well-known result for ESNs due to H. Jaeger (see (Jaeger, 2010, Proposition 3)) shows that the ESP cannot be satisﬁed whenever

|a| > 1. (4.21)

Additionally, the global suﬃcient diﬀerentiability condition (4.8) in Theorem 19 states that the condition (4.20) also ensures that the ESP ﬁlter is also diﬀerentiable. We now prove using Theorem 14 the existence of locally deﬁned FMP ﬁlters associated to this ESN in a neighborhood of certain inputs, even when condition (4.21) is satisﬁed which, as we already mentioned, prevents the global existence of such objects. Notice ﬁrst that the solutions of the equation σ(ax) = x, x R, are characterized by the relation

a2x2 = x2(a2x2 + 1) (4.22)

that has as solutions x0 = 0, x a =

where the solutions in the second line obviously exist and are diﬀerent from the ﬁrst one only when |a| > 1, a condition that we assume holds true in the rest of the section. The condition (4.22) implies that the constant sequences (x0, z0) and (x a , z0) deﬁned by

(x0, z0)t := (x0, 0) and (x a , z0)t := (x a , 0), for any t Z ,

are solutions of the reservoir system determined by F. Moreover, in the notation of Theorem 14, it is easy to see that

LFx(x0, z0) = |a| > 1 and LFx(x a , z0) = 1

The persistence condition (4.2) in that result implies that for any weighting sequence that satisﬁes

there exist open time-invariant neighborhoods Vx a and Vz0 of x a and z0 in ℓw (RN) and ℓw (Rn), respectively, such that the reservoir system associated to F with inputs in Vz0 has the echo state property and hence determines a unique causal, time-invariant, and FMP reservoir ﬁlter U F : (Vz0, w) (Vx0, w).

Grigoryeva and Ortega

4.3. Remote past input independence and the state forgetting property for unbounded inputs

In Section 2.3 we saw how fading memory ﬁlters presented with uniformly bounded inputs exhibit what we called the uniform input forgetting property. An analysis of the proof of the main result in that section, namely Theorem 6, shows that the compactness of the space of inputs guaranteed the existence of a modulus of continuity for the ﬁlter, which ensured the validity of the input forgetting property and, moreover, it made it uniform. In the context of reservoir systems, we saw in Theorems 7 and 19 that there are very weak hypotheses that, even when the inputs are not uniformly bounded, guarantee that the associated reservoir ﬁlters are Lipschitz and hence have a modulus of continuity. This allows us to prove an input forgetting property in that more general context.

Theorem 24 (Input forgetting property for FMP reservoir ﬁlters) Let F : DN Dn DN be a reservoir map where Dn Rn, DN RN, n, N N+. Assume that the hypotheses of Theorem 7 part (ii) (plus F Lipschitz on the second component) or 19 part (i) are satisﬁed with respect to a weighting sequence w such that Dw < 1. Let U F : (Vn, w) ((DN)Z ℓw (RN), w) be the associated causal and time-invariant reservoir ﬁlter (Vn (Dn)Z ℓw (Rn) under the hypotheses of Theorem 7; Vn = ℓw (Rn) and DN = RN under the hypotheses of Theorem 19). Then, for any u, v ℓw (Rn) and z (DN)Z we have that

lim t + U F (uz)t U F (vz)t = 0. (4.23)

If DN is compact then the convergence in (4.23) is uniform on u, v, and z in the sense that there exists a monotonously decreasing sequence w U F with zero limit such that for all u, v, z, and t N,

U F (uz)t U F (vz)t w U F t . (4.24)

Proof. It mimics the proof of Theorem 6 using as modulus of continuity the map ωU F (t) := LU F t, t 0, where LU F is the Lipschitz constant whose existence is ensured by the hypotheses of Theorem 7 or 19 and given by (3.4) or by (4.9), respectively.

Remark 25 In Remark 41 in the appendices it is shown how Theorem 7 can be extended to continuous reservoir systems with inputs and outputs in ℓp,w (Rn) and ℓp,w (RN), respectively. In particular, it is shown that the resulting ﬁlters are Lipschitz and hence have a non-trivial modulus of continuity. This implies that a result analogous to Theorem 24 can be proved for such systems that hence could also be referred to as fading memory from a dynamical point of view.

When ﬁlters are diﬀerentiable, there is one more way to measure how they forget inputs simply by looking at their partial derivatives with respect to past input components. The result is a diﬀerential input forgetting property that, unlike Theorem 24, can be formulated in a uniform way even when the inputs are not uniformly bounded.

Theorem 26 (Diﬀerential uniform input forgetting property) Assume that the hypotheses of Theorem 19 (i) are satisﬁed. Let Dzi t HF (z) RN be the partial derivative of the reservoir functional HF : ℓw (Rn) RN with respect to the i-th component of the t-th entry of z ℓw (Rn). Then, there exists a monotonously decreasing sequence w F with zero limit such that, for any t Z , Dzi t HF (z) w F t, for any z ℓw (Rn) and i {1, . . . , n} . (4.25)

Differentiable reservoir computing

Proof. Let ei,t := . . . , 0, ei, 0, . . . , 0 ℓw (Rn), where the vector ei is the canonical vector in Rn and it is placed in the t-th position. Then, since ei,t w = w t, we have by (4.9) and for any z ℓw (Rn) that Dzi t HF (z) = DHF (z) ei,t DHF (z) w ei,t w p0 DU F (z) ww t LFz 1 LFx Lw w t,

which proves (4.25) by setting

w F t := LFz 1 LFx Lw wt, t N.

Apart from the ﬁlters that reservoir maps deﬁne when they have the echo state property, we can also use this object to deﬁne controlled forward-looking dynamical systems and ﬂows. Indeed, given F : DN Dn DN a reservoir map, we denote by FF : (Dn)N+ DN (DN)N+ the reservoir ﬂow associated to F that is uniquely determined by the recurrence relations: FF (z, x0)1 = F(x0, z1) with z (Dn)N+, x0 DN, FF (z, x0)t = F(FF (z, x0)t 1, zt), t > 1. (4.26)

The value x0 DN is called the initial condition of the path FF (z, x0) (DN)N+ associated to the input or control sequence z (Dn)N+. As we saw in Theorems 7 and 19, the contracting property on the ﬁrst component in a reservoir map is much related to the ESP and the FMP of the resulting reservoir ﬁlter and, in passing, (see Theorem 24) to the input forgetting property. The next result shows that something similar happens with reservoir ﬂows associated to contracting reservoir maps as they forget the inﬂuence of initial conditions that are used to create the paths. This feature is referred to as the state forgetting property in Jaeger (2010).

Theorem 27 (State forgetting property for contracting reservoir ﬂows) Let F : DN Dn DN be a reservoir map where Dn Rn, DN RN, n, N N+, and suppose that F is a contraction on the ﬁrst component. Given an input sequence z (Dn)N+, the reservoir ﬂow FF : (Dn)N+ DN (DN)N+ associated to F satisﬁes that:

lim t + FF (z, x0)t FF (z, x0)t = 0, for any x0, x0 DN. (4.27)

If DN is compact then the convergence in (4.27) is uniform on z, x0, and x0 in the sense that there exists a monotonously decreasing sequence w F with zero limit such that for all x0, x0 DN, z (Dn)N+, and t N, FF (z, x0)t FF (z, x0)t w F t . (4.28)

Reservoir ﬁlters that satisfy condition (4.27) are said to have the state forgetting property and we refer to (4.28) as the uniform state forgetting property.

Proof. Let c < 1 be the contraction constant of F. Using the recursions (4.26) that deﬁne the reservoir ﬂow we can write that for any t > 1: FF (z, x0)t FF (z, x0)t = F(FF (z, x0)t 1, zt) F(FF (z, x0)t 1, zt)

c FF (z, x0)t 1 FF (z, x0)t 1 ct 1 FF (z, x0)1 FF (z, x0)1

ct 1 F(x0, z1) F(x0, z1) .

Taking limits t + on both sides of this inequality yields (4.27). Now, if DN is compact then there exists a constant D > 0 such that F(x0, z1) F(x0, z1) < D for all x0, x0 DN, and z1 Dn, and hence (4.28) holds if we set w F t := ct 1D, t N.

Grigoryeva and Ortega

4.4. Analytic reservoir ﬁlters associated to analytic reservoir maps

The results in Section 4.1 characterized the conditions under which reservoir maps of class C1 yield diﬀerentiable reservoir ﬁlters with respect to inputs and outputs in weighted sequence spaces. This setup is convenient because it is able to accommodate unbounded signals and allows for an elegant encoding of the fading memory property. However, due to the inﬁnite dimensional character of our setup one cannot immediately obtain higher order diﬀerentiable reservoir ﬁlters out of higher order diﬀerentiable reservoir maps (see Remark 40) because one needs, roughly speaking, to modify the weighted norm in the target of the map that deﬁnes the ﬁlter (see Proposition 39). This makes impossible the application in a higher order diﬀerentiability context of the Implicit Function Theorem, which is the main tool used in the results in the previous section. That is why in the following paragraphs we deal with analytic reservoir maps (as real valued functions) and we study the analyticity of the associated reservoir ﬁlters with respect to the supremum norm, as opposed to the weighted norms that we considered in the previous section. Using the supremum norm implies that ﬁlter diﬀerentiability in that context, when one manages to establish it, ensures ﬁlter continuity and not the fading memory property. In exchange, analyticity allows us to construct Taylor series expansions that, as we see later on, are discrete-time Volterra series representations. The next result is the analytic analog of the Local Persistence Theorem 14 formulated using the supremum norm that proves that analytic reservoir maps have locally deﬁned analytic reservoir ﬁlters associated around constant solutions.

Theorem 28 (Local persistence of the ESP, continuity, and analyticity) Let F : RN Rn RN be a reservoir map. Suppose that F is analytic and that the corresponding reservoir system (1.1) has a constant solution (x0, z0) RN Rn, that is, x0 = F(x0, z0). Suppose, additionally, that for all r 1, LF,r := sup (x,z) RN Rn {|||Dr F(x, z)|||} < + . (4.29)

Suppose that LFx(x0, z0) := Dx F(x0, z0) < 1. (4.30)

Then, there exist open time-invariant neighborhoods Vx0 and Vz0 of x0 and z0 in ℓ (RN) and ℓ (Rn), respectively, such that the reservoir system associated to F with inputs in Vz0 has the echo state property and hence determines a unique causal, time-invariant, and analytic (and hence continuous) reservoir ﬁlter U F : (Vz0, ) (Vx0, ).

5. The Volterra series representation of analytic ﬁlters and a universality theorem

In this section we study the Taylor series expansions of analytic causal and time-invariant ﬁlters that, as we prove in the next result, coincide with the so-called discrete-time Volterra series representations. A very similar result has been formulated in Sandberg (1998a, 1999) for analytic ﬁlters with respect to the supremum norm and with inputs with a ﬁnite past. The next result extends that statement and characterizes the inputs for which an analytic time-invariant fading memory ﬁlter with respect to a weighted norm admits a Volterra series representation with semi-inﬁnite past inputs. This generalized result allows this series representation for inputs that are not necessarily bounded. Additionally, we use the causality and time-invariance hypotheses to show that the corresponding Volterra series representations have time-independent coeﬃcients.

Differentiable reservoir computing

Theorem 29 Let w be a weighting sequence and let U : B w(z0, M) ℓw (R) B w(U(z0), L) ℓw (RN) be a causal and time-invariant analytic ﬁlter, for some time-invariant z0 ℓ1,w (R) (that is, T t(z0) = z0, for all t Z ) and M, L > 0. Then, for any element in the domain that satisﬁes

z B w(z0, M) ℓ1,w (R), that is X

t Z |zt|w t < + , (5.1)

there exists a unique expansion

U(z)t = U(z0)t +

mj= gj(m1, . . . , mj)(zm1+t z0 m1+t) (zmj+t z0 mj+t), t Z ,

(5.2) where the maps gj : Zj RN, j 1, are uniquely determined by the derivatives of the functional HU : B w(z0, M) ℓw (R) RN associated to U (that by Proposition 39 in the Appendix is analytic) via the relation

gj(m1, . . . , mj) := 1

j!Dj H(z0) em1, . . . , emj with (en)t := 1 if t = n, 0 otherwise. (5.3)

Moreover, for any p N+, we have that

U(z)t U(z0)t

mj= gj(m1, . . . , mj)(zm1+t z0 m1+t) (zmj+t z0 mj+t)

p+1 . (5.4)

These statements also hold true when ℓw (R) and ℓw (RN) are replaced by ℓ (R) and ℓ (RN), respectively. In that case, the relation (5.2) holds whenever z B (z0, M) ℓ1 (Rn) and the inequality (5.4) is obtained by taking as the sequence w the constant sequence wι given by wι t := 1, for all t N.

Remark 30 The error estimate (5.4) can be reformulated in terms of the weighted norm of the sequence

U(z)t U(z0)t

mj= gj(m1, . . . , mj)(zm1+t z0 m1+t)

Rp(z) w L 1 z w

p+1 . (5.5)

5.1. Finite discrete-time Volterra series are universal in the fading memory category

In this section we combine the Volterra series representation Theorem 29 with previous universality results in Grigoryeva and Ortega (2018a) to show that any fading memory ﬁlter with uniformly bounded inputs can be arbitrarily well approximated with a Volterra series with ﬁnite terms of the type in (5.2). This result provides an alternative proof of a Volterra series universality theorem that was stated for the ﬁrst time in (Boyd and Chua, 1985, Theorems 3 and 4). In particular, this result shows that any time-invariant and causal fading memory ﬁlter can be uniformly approximated by a ﬁnite memory ﬁlter.

Grigoryeva and Ortega

Theorem 31 (Universality of ﬁnite discrete-time Volterra series) Let M, L > 0 and let KM (R)Z , KL Rd Z be as in (1.3). Let U : KM KL be a causal and time-invariant fading memory ﬁlter. Then, for any ϵ > 0 there exist x0 KL and J N+ such that for any j {1, . . . , J} there exist j numbers M j 1, . . . , M j j N+ and maps gj : Zj R such that the ﬁlter determined by the ﬁnite Volterra series given by

V (z)t = x0 t +

gj(m1, . . . , mj)zm1+t zmj+t (5.6)

is such that |||U V ||| = sup z KM { U(z) V (z) } < ϵ.

Proof. The Corollary 11 in Grigoryeva and Ortega (2018a) guarantees that for any ϵ > 0 there exists a linear reservoir system with polynomial readout h R[x] and nilpotent connectivity matrix A MN, determined by the expressions

xt = Axt 1 + czt, A MN, c MN,n,

yt = h(xt), h R[x],

such that it has an associated reservoir ﬁlter U A,c h : KM KL that satisﬁes U U A,c h < ϵ. (5.9)

Let J = deg(h) + 1 and assume that A is nilpotent of index p. In order to prove the theorem it suﬃces to show that the Volterra series expansion in (5.2) corresponding to U A,c h has an expression of the type (5.6). If that is the case, the statement in (5.9) proves the theorem. Indeed, recall (see, for instance, (Grigoryeva and Ortega, 2018a, Corollary 11)) that the functional HA,c h associated to the ﬁlter U A,c h is given by

HA,c h (z) = h

j=0 Aj(cz j)

which is a composition of the polynomial h with the functional HA,c associated to the reservoir equation (5.7) given by the linear operator

j=0 Aj(cz j). (5.10)

It is easy to see that HA,c : (ℓ (R), ) RN has a ﬁnite operator norm HA,c and that HA,c |||c|||/(1 |||A|||), with |||c||| and |||A||| the top singular values of c and A, respectively. Moreover, it is easy to see that for any j N+, z KM, and v1, . . . , vj ℓ (R), we have

Dj HA,c h (z)(v1, . . . , vj) = Djh(HA,c(z)) HA,c(v1), . . . , HA,c(vj)) ,

which shows that HA,c h : (ℓ (R), ) Rd is everywhere analytic. Using this expression and (5.3) we deﬁne gj(m1, . . . , mj) := 1

j!Djh(0) HA,c(em1), . . . , HA,c(emj) . (5.11)

Differentiable reservoir computing

As h has ﬁnite degree then Djh(0) = 0 for any j > deg(h) + 1 = J. Moreover, since the sum in (5.10) is ﬁnite by the nilpotency of A it is clear that gj(m1, . . . , mj) in (5.11) is nonzero as long as 1 j deg(h) + 1 = J and (p 1) m1, . . . , mj 0. If we deﬁne M j 1, . . . , M j j := p 1 then the Taylor series expansion of U A,c(z) coincides with (5.6). We emphasize that in this case this expansion is valid for any z KM by the ﬁniteness of the number of terms in the sum and that the condition (5.1) is hence not necessary.

6. Appendices

6.1. The topologies induced by weighted and supremum norms

An important feature of the topology generated by weighted norms is that they coincide with the product topology on subsets made of uniformly bounded sequences like the space KM in (1.3). This fact holds true for any weighting sequence w and has important consequences (see Grigoryeva and Ortega (2018b) for the details). First, the fading memory property is independent of the weighting sequence used to deﬁne it. Second, the subsets KM ℓw (Rn) are compact in the topology induced by the weighted norms w. We emphasize that these statements are valid exclusively in the context of uniformly bounded subsets which, as we see in the next result, are never open in the weighted topology. We adopt in the sequel the following notation for product sets and functions: for any family {At}t Z , of subsets At Rn the symbol Y

t Z At := z (Rn)Z | zt At, for all t Z , (6.1)

denotes the Cartesian product of the sets in the family. When all the elements in the family are identical to a given subset A, we will exchangeably use the symbols Q

t Z A and (A)Z . A similar notation is adopted for the Cartesian product of maps: let V be a set and let ft : V At be a map, t Z . The symbol Q

t Z ft denotes the map Q

t Z ft : V Q

t Z At v 7 (. . . , f 2(v), f 1(v), f0(v)) . (6.2)

Lemma 32 Let w be a weighting sequence and n N+. Then:

(i) For any z ℓw (Rn) and r > 0,

B w(z, r) = [

In particular, this implies that

B w(z, r) Y

B w(z, r). (6.4)

The identity (6.3) implies that any open ball B w(z, r) in ℓw (Rn) contains unbounded sequences.

(ii) Let {At}t Z be a family of subsets At Rn such that there exists a sequence {ct}t Z that satisﬁes

sup zt At { zt w t} < ct, for each t Z and supt Z {ct} < + , (6.5)

then the product set Y

t Z At ℓw (Rn).

Grigoryeva and Ortega

(iii) For every family {At}t Z of subsets At Rn such that the product set satisﬁes Q

t Z At ℓw (Rn), we have Y

These statements, except for the last sentence in part (i), are also valid for the space ℓ (Rn) and are obtained by taking as sequence w the constant sequence wι given by wι t := 1, for all t N.

Proof. (i) We prove (6.3) by double inclusion. First, let x B w(z, r). By deﬁnition x z w = supt Z { xt zt w t} < r and hence for any δx > 0 such that x z w < δx < r we have that xt zt < δx/w t, for all t Z . This implies that

Conversely, given an element x ℓw (Rn) in the right hand side of (6.3), there exists δx < r such that x Q

t Z B (zt, δx/w t). This implies that xt zt w t < δx, for all t Z , and hence supt Z { xt zt w t} = x z w δx < r, which proves the inclusion.

As to (6.4), the ﬁrst inclusion is a straightforward consequence of (6.3). Let now x Q

t Z B zt, r w t

By deﬁnition this implies that xt zt w t < r, for all t Z , and consequently supt Z { xt zt w t} r or, equivalently, x z w r. This implies that x B w(z, r) and proves the second inclusion.

(ii) Let x Q

t Z At. Then,

x w = sup t Z { xt w t} sup t Z {ct} < + ,

as required.

(iii) We ﬁrst prove that Q

t Z At. If z Q

t Z At, then for any ϵ > 0 and each t Z

there exists an element xt At B zt, ϵ 2w t

. Let x := (xt)t Z . By construction:

x z w = sup t Z { xt zt w t} ϵ

which implies that x B w (z, ϵ) Q

t Z At and, as z Q

t Z At is arbitrary, it guarantees that

t Z At. In order to show the reverse inclusion ﬁrst note that, as it is proved later on in Lemma 1, the projections pt : ℓw (Rn) Rn, t Z , deﬁned by pt(z) := zt, are continuous. Let z Q

t Z At arbitrary, let t Z be arbitrary but ﬁxed, and let Vt be an open set in Rn that contains zt. The continuity of pt implies that p 1 t (Vt) is an open set in ℓw (Rn) that contains z and therefore there exists

t Z At p 1 t (Vt). We consequently have that xt At, which guarantees that zt At, as required.

Corollary 33 Let Dn be a subset of Rn and let w be a weighting sequence. Then:

(i) If (Dn)Z ℓw (Rn) is an open subset of ℓw (Rn) then Dn = Rn, necessarily.

(ii) If (Dn)Z ℓw (Rn) is a closed subset of ℓw (Rn) then Dn is necessarily closed in Rn, that is, Dn = Dn.

Differentiable reservoir computing

(iii) The following inclusion always holds

(Dn)Z ℓw (Rn)

Dn Z ℓw (Rn). (6.7)

In particular, if Dn is closed in Rn then so is (Dn)Z ℓw (Rn) in ℓw (Rn).

These statements in parts (ii) and (iii) are also valid when the space ℓw (Rn) is replaced by ℓ (Rn).

Proof. (i) We proceed by contradiction. Suppose that Dn = Rn. Let x0 Rn \ Dn and let z0 Dn. Deﬁne the constant sequences x := (x0)t Z ℓw (Rn) \ (Dn)Z ℓw (Rn) and z := (z0)t Z (Dn)Z ℓw (Rn). Since by hypothesis (Dn)Z ℓw (Rn) is an open subset of ℓw (Rn) there exists ϵ > 0 such that B w(z, 2ϵ) (Dn)Z ℓw (Rn). By the relation (6.4) in Lemma 32 we also have

B w(z, ϵ) Y

B w(z, ϵ) B w(z, 2ϵ) (Dn)Z ℓw (Rn),

and, in particular,

(Dn)Z ℓw (Rn), which implies B z0, ϵ w t

Dn, for all t Z . (6.8)

Let r0 := x0 z0 and let t0 Z be such that for all t < t0 we have that ϵ/w t0 > r0. By (6.8) we

have that x0 B z0, ϵ w t

Dn, which contradicts the assumption on the choice of x0.

(ii) By Lemma 32 (iii) we have that

Dn Z . (6.9)

Since by hypothesis (Dn)Z is closed and hence it holds true that

(Dn)Z = (Dn)Z . (6.10)

Consequently, by (6.9) and (6.10) we have that

Dn Z = (Dn)Z which implies that Dn = Dn as required.

(iii) Let x (Dn)Z ℓw (Rn) ℓw (Rn) and consider a sequence {xm}m N+ (Dn)Z ℓw (Rn) with limm xm = x, that is for each ϵ > 0 there exists such N(ϵ) N+ such that for all m > N(ϵ) it holds that xm x w < ϵ. Hence for all s Z one has that

w s xm s xs sup t Z { xm t xt w t} = xm x w ϵ,

which immediately implies that xm s xs < ϵ w s

and hence one gets that xs Dn and therefore (6.7) holds as required. The last claim in part (iii) follows from (6.7). Indeed, if Dn = Dn then by (6.7) we have that

(Dn)Z ℓw (Rn)

Dn Z ℓw (Rn) = (Dn)Z ℓw (Rn).

Since the reverse inclusion obviously always holds, we ﬁnally have that

(Dn)Z ℓw (Rn) = (Dn)Z ℓw (Rn).

Grigoryeva and Ortega

We also recall (see (Grigoryeva and Ortega, 2018b, Proposition 2.9)) that the norm topology in ℓw (Rn) is strictly ﬁner than the subspace topology induced by the product topology in (Rn)Z on ℓw (Rn) (Rn)Z . We complement this fact by comparing the norm topology on (ℓ (Rn), ) with the relative topology induced by (ℓw (Rn), w) on it.

Corollary 34 The relative topology τw, induced by the norm topology τw of (ℓw (Rn), w) on ℓ (Rn) is strictly coarser than the norm topology τ on (ℓ (Rn), ), that is, τw, τ .

Proof. Since, as we already saw, z w z , for all z (Rn)Z , we have that ℓ (Rn) ℓw (Rn) (see (2.4)) and the inclusion ι : ℓ (Rn) , ℓw (Rn) is continuous. Consequently, for any open W τw the set ι 1(W) = W ℓ (Rn) τw, is also open in τ . This immediately implies that

In order to establish that this inclusion is strict, one needs to notice that, given an arbitrary open ball B (z, r), r > 0, around z ℓ (Rn), all the open balls B w(z, ϵ) for all ϵ > 0 contain elements that are not included in B (z, r) by Lemma 32 (i).

Lemma 35 Let w be a weighting sequence and n N+. We denote by wa, a R, the sequence with terms wa t , t N. Then, the following inclusions are continuous:

ℓ (Rn), , , ℓw 1 k+1 (Rn), w 1 k+1

, ℓw 1 k (Rn), w 1 k

, , ℓw (Rn), w ,

ℓw (Rn), w , , ℓwk (Rn), wk , ℓwk+1 (Rn), wk+1 , , (Rn)Z , (6.12)

where k N+ and in (Rn)Z we consider the trivial topology. Deﬁne

k N+ ℓw 1 k (Rn) and Sw := [

k N+ ℓwk (Rn). (6.13)

Then, in general, ℓ (Rn) Sw and Sw (Rn)Z . (6.14)

Proof. The continuity of the inclusions (6.11) and (6.12) is a consequence of the fact that:

z w 1 k z w 1 k+1 , for all k N+ and z ℓw 1 k (Rn), (6.15)

z wk+1 z wk , for all k N+ and z ℓwk (Rn). (6.16)

Regarding (6.14), the ﬁrst inclusion follows from the fact that ℓ (Rn) ℓw (Rn) for any weighting sequence. In order to show that this inclusion is in general not an equality it suﬃces to consider the following example: let z (R)Z given by zt := t, t Z , and let w be the weighting sequence deﬁned by wt := λt, with t N and 0 < λ < 1. A simple application of the L Hˆopital rule shows that, for any k N+, lim t ztw1/k t = 0,

which proves, in particular, that z w1/k < and hence that z ℓw1/k (R), for any k N+. This implies that z Sw. However, z is an unbounded sequence and hence it does not belong to ℓ (R). In order to show that the second inclusion in (6.14) is also strict, take z (R)Z given by zt := λ t with

Differentiable reservoir computing

λ > 1 and t Z and let w be the weighting sequence deﬁned by w0 := 1 and wt := 1

t , for any t N+. The L Hˆopital rule shows that, for any k N+,

lim t |ztwk t| = + ,

and consequently z does not belong to any of the spaces ℓwk (R) and hence z Sw.

6.2. Products of continuous and diﬀerentiable functions using weighted norms

The following lemma spells out conditions under which inﬁnite Cartesian products of continuous and diﬀerentiable functions are continuous and diﬀerentiable when we use weighted and supremum norms.

Lemma 36 Let W V with (V, ) a normed space and let DN RN be a subset of RN. Let Ht : W DN, t Z , be a family of maps. Consider the corresponding product map H : W (DN)Z , deﬁned as in (6.2):

t Z Ht := (. . . , H 2, H 1, H0) , or equivalently, (H(z))t := Ht(z), z W, t Z . (6.17)

(i) Endow W V with the subspace topology. If DN is a compact subset of RN then (DN)Z ℓw (RN) for any weighting sequence w. If each of the functions Ht is continuous then H : W (DN)Z ℓw (RN) is also continuous.

(ii) Let w be a weighting sequence and suppose that W contains a point z0 such that H(z0) ℓw (RN). If each of the functions Ht is Lipschitz continuous with Lipschitz constant c0 t and the sequence c0 := (c0 t)t Z formed by these Lipschitz constants satisﬁes that c0 ℓw (R), then H : W (DN)Z ℓw (RN) is Lipschitz continuous with Lipschitz constant c0 H c0 w.

(iii) Suppose that W is an open convex subset of (V, ) and that it contains a point z0 such that H(z0) ℓw (RN). Suppose also that the maps Ht are of class Cr(W), r 1, and let cr t be ﬁnite constants such that supz W {|||Dr Ht(z)|||} cr t < + . If cr := (cr t)t Z ℓw (R) then H is diﬀerentiable of order r when considered as a map H : W (V, ) ℓw (RN), w and

|||Dr H(z)||| cr w, for any z W. (6.18)

Additionally, if cj ℓw (R) for all j {1, . . . , r}, then H is of class Cr 1(W) and the map Dr 1H : (W, ) Lr 1(V, ℓw (RN)), ||| ||| is Lipschitz continuous with Lipschitz constant cr H cr w.

(iv) Suppose that W is an open convex subset of (V, ) and that it contains a point z0 such that H(z0) ℓw (RN). If the maps Ht are smooth and cr w < + , for each r N+, then so is H : W (V, ) (ℓw (RN), w). Suppose, additionally, that the maps Ht are analytic and that ρt > 0 is the radius of convergence of the series expansion of Ht. If ρ := inft Z {ρt} > 0 then H is analytic when considered as a map H : W (V, ) (ℓw (RN), w) and the radius of convergence ρH of its series expansion satisﬁes that ρH ρ > 0.

Parts (ii), (iii), and (iv) also hold true when the Banach space (ℓw (RN), w) is replaced by (ℓ (RN), ). Part (i) is in general false in that situation.

Proof. (i) The compactness of DN guarantees (Munkres, 2014, Theorem 27.3) that there exists L > 0 such that DN B (0, L) and hence (DN)Z ℓw (RN) necessarily. It can also be shown (see

Grigoryeva and Ortega

(Grigoryeva and Ortega, 2018b, Corollary 2.7)) that when DN is compact, the relative topology (DN)Z induced by the weighted norm w in ℓw (RN) coincides with the product topology. This implies (see (Munkres, 2014, Theorem 19.6)) that if the functions Ht are continuous then so is H.

(ii) Let z1, z2 W. Then, H(z1) H(z2) w = sup t Z

Ht(z1) Ht(z2) w t sup t Z

c0 t z1 z2 w t c0 w z1 z2 ,

(6.19) which proves simultaneously that H is Lipschitz continuous and that it maps into ℓw (RN). Regarding the last point, recall that by hypothesis there exists a point z0 such that H(z0) ℓw (RN) and hence by (6.19) we have, for any z W,

H(z) w c0 w z z0 + H(z0) w < + . (6.20)

(iii) First, it is easy to prove recursively that for any z W, the map Dr H(z) := Q

t Z Dr Ht(z) satisﬁes the condition (2.5). In order to prove the ﬁrst statement of the lemma, it suﬃces to show that the multilinear map Dr H(z) : (V, ) (V, ) | {z } r times

(ℓw (Rn), w),

is bounded for any z W. Let v1, . . . , vr V r. Using the r-order diﬀerentiability of Ht we can write

Dr H(z) v1, . . . , vr w =

t Z Dr Ht(z) v1, . . . , vr w

Dr Ht(z) v1, . . . , vr w t

|||Dr Ht(z)||| v1 vr w t

v1 vr sup t Z {cr tw t} cr w v1 vr , (6.21)

which proves the boundedness of Dr H(z) and the inequality in (6.18). We now assume that cj ℓw (R) for all j {1, . . . , r} and show that H maps into ℓw (RN) and that it is of class Cr 1(W). Notice, ﬁrst of all, that for any t Z and any z1, z2 Vn, we have by the convexity of W, the mean value theorem Abraham et al. (1988), and the hypothesis Ht Cr(W), that for all j {1, . . . , r}, Dj 1Ht(z1) Dj 1Ht(z2) w sup z W

Dj Ht(z) z1 z2 = cj t z1 z2 . (6.22)

Taking j = 1 in the previous inequality, we see that the functions Ht are Lipschitz continuous with constants c1 t that form a sequence that by hypothesis belongs to ℓw (R). This guarantees by part (ii) that H maps into ℓw (RN) necessarily. Now, using the inequality (6.22), we have that for any z1, z2 W, Dr 1H(z1) Dr 1H(z2) w

v1,...,vr 1 V v1,...,vr 1 =0

( Dr 1H(z1) Dr 1H(z2) v1, . . . , vr 1 w v1 vr 1

v1,...,vr 1 V v1,...,vr 1 =0

( supt Z Dr 1Ht(z1) Dr 1Ht(z2) v1, . . . , vr 1 w t

v1,...,vr 1 V v1,...,vr 1 =0

( supt Z cr tw t z1 z2 v1 vr 1

= cr w z1 z2 , (6.23)

Differentiable reservoir computing

which shows that the map Dr 1H : (W, w) Lr 1(ℓw (Rn), ℓw (RN)), ||| |||w is Lipschitz continuous with Lipschitz constant cr H cr w.

(iv) The previous part of the lemma together with the hypothesis c := supr N+ { cr w} < + guarantees that the diﬀerentiability of any order in the functions Ht gets translated into the diﬀerentiability of any order of the map H : W (V, ) (ℓw (RN), w). Moreover, let u ℓw (Rn) and let ur := (u, . . . , u) ℓw (Rn) r, r N+. The Taylor series expansion of H around 0 is

1 r!Dr H(0) ur = Y

1 r!Dr Ht(0) ur !

The expansion in the left hand side of this equality is convergent if and only if each of the series in the product in the right hand side is convergent. This is the case when u w < ρt, for all t Z , which guarantees the convergence of the Taylor series expansion in (6.24) for all the elements u ℓw (Rn) that satisfy u w < inft Z {ρt} = ρ. Since by hypothesis ρ > 0, we have proved that H is analytic with radius of convergence ρH ρ.

The proof of the statements in (ii), (iii), and (iv) for the space ℓ (RN) is obtained by mimicking the proofs that we just provided, replacing the weighting sequence w by the constant sequence wι that is equal to 1 for each t Z . In order to show that part (i) is in general false in that situation take W = ( 1, 1), DN = [ 1, 1], and deﬁne Ht(z) := tanh ( tz), with t Z and z ( 1, 1). Given that H 1 t 1

t tanh 1( 1

t tanh 1( 1

2) it is clear that

This equality shows that the preimage by the product map H of an open set is not open and hence H is not continuous.

6.3. Proof of Lemma 1

(i) The linearity of pt is obvious. Let u ℓw (Rn) arbitrary. Since pt(u) = ut = ut w t/w t supj Z { uj w j} /w t = u w /w t, we can conclude that |||pt|||w 1/w t. Let now v Rn such that v = 1 and deﬁne the element z ℓw (Rn) by zt := v/w t, for all t Z . It is clear that z w = 1 and that pt(z) / z w = 1/w t, which shows that |||pt|||w = 1/w t, as required.

(ii) We ﬁrst prove the statements in this part in the case t < 0. Suppose that the inverse decay ratio Lw is ﬁnite and let u ℓw (Rn) arbitrary. Then

T1(u) w = sup t Z { ut 1 w t} = sup t Z

ut 1 w (t 1) w t w (t 1)

ut 1 w (t 1) sup t Z

w t w (t 1)

u w Lw. (6.25)

This inequality shows that T1 maps ℓw (Rn) into ℓw (Rn) and that |||T1|||w Lw. Given that for any t Z we can write T t = T1 T1 | {z } t times

the previous conclusion also proves that T t maps ℓw (Rn) into ℓw (Rn) and that

|||T t|||w = |||T1 T1|||w |||T1|||w |||T1|||w L t w .

Grigoryeva and Ortega

It remains to be shown that |||T1|||w = Lw. In order to do so, take an element v Rn such that v = 1 and deﬁne the element u ℓw (Rn) by ut := v/w t, for all t Z . Notice that by construction u w = 1 and, moreover,

u w = sup t Z { ut 1 w t} = sup t Z

v w (t 1) w t

which proves the required identity. We now show that T t : (ℓw (Rn), w) (ℓw (Rn), w) is surjective. Indeed, it is clear that for any u ℓw (Rn), the element

eu := (u, 0, . . . , 0 | {z } t times

) is such that T t (eu) = u.

We hence just need to show that eu ℓw (Rn). This is the case because

eu w = sup s Z { f us w s} = sup s Z { us+t w s} = sup s Z

us+t w (s+t) w s w (s+t)

us+t w (s+t) w s w (s+1)

w (s+1) w (s+2) w (s+t 1)

u w L t w < + , (6.26)

because u ℓw (Rn) and by hypothesis Lw < + . Now, since we already showed that T t : (ℓw (Rn), w) (ℓw (Rn), w) is continuous, then the Banach-Schauder Open Mapping Theorem (Abraham et al., 1988, Theorem 2.2.15) implies that T t is necessarily an open map. It remains to be shown that T t : (ℓw (Rn), w) (ℓw (Rn), w) is a submersion (see (Abraham et al., 1988, Section 3.5) for context and deﬁnitions). First, it is obvious that

ker T t = (. . . , 0, 0, v) | v (Rn) t .

Since T t is linear and bounded, in order to show that it is a submersion it suﬃces to show that ker T t is split, that is, it has a closed complement in (ℓw (Rn), w). We now prove that such a complement is given by the subspace

(u, 0, . . . , 0 | {z } t times

) | u ℓw (Rn)

The inequality (6.26) implies that C t ℓw (Rn). (6.28)

Additionally, C t is clearly closed in ℓw (Rn). We conclude by showing by double inclusion that

ℓw (Rn) = ker T t C t. (6.29)

Let ﬁrst u ℓw (Rn) and deﬁne

u1 := (. . . , ut 2, ut 1, ut, 0, . . . , 0 | {z } t times

) and u2 := (. . . , 0, 0, ut+1, ut+2, . . . , u0).

It is clear that u = u1 + u2. Additionally, the sequence u2 is obviously in ker T t and using an argument similar to the one in (6.26) it is easy to show that u1 C t, which proves the inclusion ℓw (Rn) ker T t C t.

Differentiable reservoir computing

Conversely, let u1 C t and u2 ker T t. By (6.28) we have that u1 w < + and it is also clear that u2 w < + . Therefore u1 + u2 w u1 w + u2 w < + and hence u1 + u2 ℓw (Rn), which shows that T t is a submersion. Finally, the statements in the case t > 0 are proved in a similar fashion. In particular, it is easy to see that T t Tt = PCt, for any t > 0,

where PCt is the projection onto the subspace Ct deﬁned in (6.27) according to the splitting (6.29). Moreover, it is easy to see that T t is injective and that its image Im T t is split because Im T t = Ct and by (6.29) ℓw (Rn) = ker Tt Im T t, t > 0,

which proves that T t is an immersion.

(iii) Straightforward consequence of the deﬁnitions.

The proofs for the space (ℓ (Rn), ) can be obtained by replacing in the previous arguments the weighting sequence w by the constant sequence wι.

6.4. Equivalence of FMP and diﬀerentiability in ﬁlters and functionals

The facts established in Lemma 1 can be used to show the equivalence between the continuity and the diﬀerentiability of causal and time-invariant ﬁlters and that of their associated functionals. The following result focuses on continuity and the fading memory property and generalizes to the context of eventually unbounded inputs the equivalence between fading memory ﬁlters and functionals established in (Grigoryeva and Ortega, 2018b, Propositions 2.11 and 2.12) for uniformly bounded inputs. In the results that follow we work in a setup slightly more general than the one that is customary in the literature as we will allow for the weighting sequences considered in the domain and the target of the ﬁlters to be diﬀerent. This degree of generality is needed later on in the text.

Proposition 37 Let Vn (Rn)Z and VN RN Z be time-invariant subsets and let DN RN. Let w1, w2 be weighting sequences with inverse decay ratios Lw1 and Lw2, respectively.

(i) Let U : Vn ℓw1 (Rn) VN ℓw2 (RN) be a causal and time-invariant ﬁlter. If U has the fading memory property then so does its associated functional HU : Vn p0(VN). The same conclusion holds for continuous ﬁlters U : Vn ℓ (Rn) VN ℓ (RN).

(ii) Let H : Vn ℓw1 (Rn) DN be a fading memory functional. If Lw1 is ﬁnite and DN is compact then the associated causal and time-invariant ﬁlter UH : Vn ℓw1 (Rn) (DN)Z ℓw2 (RN) has also the fading memory property.

(iii) Let H : Vn ℓw1 (Rn) DN be a fading memory functional and suppose that Vn contains a point z0 such that UH(z0) ℓw2 (R), where UH is the causal and time-invariant ﬁlter associated to H. If H is Lipschitz, c H is a Lipschitz constant, and the weighting sequences satisfy one of the following two conditions

either Rw1,w2 := sup s,t N

w1 t w2 s w1 t+s

< + or the sequence Lw1 := L t w1

t Z ℓw2 (R), (6.30)

then UH : Vn ℓw1 (Rn) (DN)Z ℓw2 (RN) has also the fading memory property, it is Lipschitz, and Rw1,w2c H or Lw1 w2c H, respectively, is a Lipschitz constant of UH. The same conclusion holds for continuous functionals H : Vn ℓ (Rn) DN where the condition (6.30) is not needed.

Grigoryeva and Ortega

Proof. (i) As HU is given by HU = p0 U, the FMP (respectively, continuity) of U and the ﬁrst part of Lemma 1 prove the statement.

(ii) Notice ﬁrst that as UH = Y

t Z H T t (6.31)

then, as Lw1 is ﬁnite, UH is by the second part of Lemma 1 the Cartesian product of continuous functions Ht := H T t : Vn ℓw1 (Rn) DN. Since DN is by hypothesis compact, the result follows from the ﬁrst part of Lemma 36.

(iii) Let z1, z2 Vn arbitrary. Then by (6.31) and the Lipschitz hypothesis on H, we have that UH(z1) UH(z2) w2 = sup t Z

H(T t(z1)) H(T t(z2)) w2 t

c H sup t Z

T t(z1) T t(z2) w1 w2 t . (6.32)

If the ﬁrst condition in (6.30) is satisﬁed, this expression is bounded above by

c H sup t,s Z

T t(z1)s T t(z2)s w2 tw1 s = c H sup t,s Z

( z1 t+s z2 t+s w1 (t+s) w2 tw1 s w1 (t+s)

Rw1,w2c H z1 z2 w1

which proves that in that case UH has the fading memory property, it is Lipschitz, and Rw1,w2c H is a Lipschitz constant. If the second condition in (6.30) is satisﬁed then the inverse decay ratio Lw1 is necessarily ﬁnite and hence (6.32) can be bounded using the second part of Lemma 1 as

c H sup t Z

T t(z1) T t(z2) w1 w2 t c H z1 z2 w1 sup t Z

L t w1w2 t = Lw1 w2c H z1 z2 w1 ,

which proves that in that case Lw1 w2c H is a Lipschitz constant of UH. The Lipschitz continuity of UH together with the hypothesis on the existence of a point z0 such that UH(z0) ℓw2 (R) guarantee that UH maps into ℓw2 (RN) using a strategy similar to the one followed in (6.20). The proof for the spaces ℓ (Rn) and ℓ (RN) is obtained by taking as weighting sequences the constant sequence wι given by wι t := 1, for all t N, that automatically satisﬁes any of the two conditions in (6.30).

Remark 38 When in part (iii) we consider the same weighting sequence w for the domain and the target, it is easy to see that

Rw := sup s,t N

satisﬁes that Rw Lw w and therefore the second condition in (6.30) implies the ﬁrst one. Indeed,

Rw = sup s,t N

= sup s,t N

wt+1 wt+2 wt+s 1

sup s N {Ls wws} = Lw w , as required.

In this setup, the condition (6.30) is satisﬁed by many families of commonly used weighting sequences. In the two examples considered in Remark 2 we have that Rw = Lw w = 1 for the geometric sequence; for the harmonic sequence Lw w = + but Rw = 1 and hence (6.30) is still satisﬁed. We emphasize that condition (6.30) is not automatically satisﬁed by all weighting sequences. For example, as we saw in Remark 2, the sequence wt := exp( t2) is such that Lw = + and, additionally, it is easy to see that Rw := sups,t N {exp(2st)} = + .

Differentiable reservoir computing

Proposition 39 Let w1 and w2 be two weighting sequences with inverse decay ratios Lw1 and Lw1, respectively. Let Vn ℓw1 (Rn) and VN ℓw2 (RN) be time-invariant open subsets, and let DN be an open subset of RN.

(i) Let U : Vn ℓw1 (Rn) VN ℓw2 (RN) be a causal and time-invariant ﬁlter. If U is of class

Cr(Vn) (respectively, smooth or analytic) when considered as a map U : Vn ℓw1 (Rn), w

VN ℓw2 (RN), w , then so is the associated functional HU : Vn ℓw1 (Rn), w

p0(VN) RN. Moreover,

|||Dr HU(z)|||w1 |||Dr U(z)|||w1,w2, for any z Vn. (6.33)

The same conclusion holds when the weighted sequence spaces are replaced by ℓ (Rn), and ℓ (RN), .

(ii) Let H : Vn ℓw1 (Rn) DN be a functional and suppose that Vn is convex and contains a point z0 such that UH(z0) ℓw2 (R), where UH is the causal and time-invariant ﬁlter associated to H. If the functional H is of class Cr(Vn) and for any j {1, . . . , r} we have that cj := supz Vn Dj H(z) w1 < + and the weighting sequences satisfy that

Lw1,j := (L jt w1 )t Z ℓw2 (R), (6.34)

then the associated causal and time-invariant ﬁlter UH is diﬀerentiable of order r when considered as a map UH : Vn ℓw1 (Rn), w1 (DN)Z ℓw2 (RN), w2 . Moreover, for any z Vn,

|||Dr UH(z)|||w1,w2 cr Lw1,r w2. (6.35)

Additionally, UH is of class Cr 1(Vn) and the map

Dr 1UH : (Vn, w1) Lr 1 ℓw1 (Rn), ℓw2 (RN) , ||| |||w1,w2

is Lipschitz continuous with Lipschitz constant cr Lw1,r w2. The same conclusion holds when the weighted sequence spaces are replaced by ℓ (Rn), and ℓ (RN), . In that case the inequality (6.35) holds with Lw1,r w2 = 1.

(iii) Let H : Vn ℓw1 (Rn) DN be a functional and suppose that Vn is convex and contains a point z0 such that UH(z0) ℓw2 (R), where UH is the causal and time-invariant ﬁlter associated to H. If the functional H is smooth and cr < + for all r N+, then so is the associated causal and time-invariant ﬁlter UH : Vn ℓw1 (Rn), w1 (DN)Z ℓw2 (RN), w2 . The same

conclusion holds when the weighted spaces are replaced by ℓ (Rn), and ℓ (RN), . In that case, if H is analytic then so is UH and the radius of convergence of the series expansion of UH is bigger or equal than that of H.

Proof. (i) Recall ﬁrst that HU can be written as HU = p0 U. The chain rule and the linearity of the projection p0 imply that Dr HU(z) = p0 Dr U(z) for any z Vn. The ﬁrst part of Lemma 1 guarantees then that HU is of class Cr(Vn) and that

|||Dr HU(z)|||w1 = |||p0 Dr U(z)|||w1 |||p0|||w2 |||Dr U(z)|||w1,w2 = |||Dr U(z)|||w1,w2, for any z Vn,

as required. The proof for the spaces ℓ (Rn) and ℓ (RN) is obtained by taking as sequence w the constant sequence wι given by wι t := 1, for all t N.

Grigoryeva and Ortega

(ii) First of all, notice that the hypothesis on c1 and the convexity of Vn imply via the mean value theorem Abraham et al. (1988) that H is Lipschitz. Moreover, the hypothesis on Lw,1 in the statement implies that condition (6.30) is satisﬁed and hence the third part in Proposition 37 guarantees that UH maps into ℓw2 (RN). Now, the expression (6.31) implies that for any z Vn,

Dr UH(z) = Y

t Z Dr H(T t(z)) (T t, . . . , T t) | {z } r times

, r 1. (6.36)

In order to prove (6.33) consider u1, . . . , ur ℓw1 (Rn) arbitrary and notice that by the second part of Lemma 1 we have Dr UH(z) u1, . . . , ur w2 = sup t Z

Dr H(T t(z)) T t(u1), . . . , T t(ur) w2 t

T t(u1) w1 T t(ur) w1 w2 t

u1 w1 ur w1 L rt w1 w2 t cr Lw1,r w2 u1 w1 ur w1 ,

as required. We now show that UH is of class Cr 1(Vn). Let z1, z2 Vn arbitrary. Then, using a strategy similar to that one in the last inequality in the previous expression, we have Dr 1UH(z1) Dr 1UH(z2) w1,w2

u1,...,ur 1 ℓw1 (Rn)

u1,...,ur 1 =0

( Dr 1UH(z1) Dr 1UH(z2) u1, . . . , ur 1 w2 u1 w1 ur 1 w1

u1,...,ur 1 ℓw1 (Rn)

u1,...,ur 1 =0

( supt Z Dr 1H(T t(z1)) Dr 1H(T t(z2)) T t(u1), . . . , T t(ur 1) w2 t

u1 w1 ur 1 w1

u1,...,ur 1 ℓw1 (Rn)

u1,...,ur 1 =0

( supt Z cr T t(z1) T t(z2) w1 T t(u1) w1 T t(ur 1) w1 w2 t

u1 w1 ur 1 w1

u1,...,ur 1 ℓw1 (Rn)

u1,...,ur 1 =0

( supt Z cr z1 z2 w1 u1 w1 ur 1 w1L rt w1 w2 t

u1 w1 ur 1 w1

cr Lw1,r w2 z1 z2 w1 ,

which shows that the map Dr 1UH : (Vn, w1) Lr 1(ℓw1 (Rn), ℓw2 (RN)), ||| |||w1,w2 is Lipschitz

continuous with Lipschitz constant cr Lw1,r w2.

(iii) First, the condition cr < + for all r N+ implies by part (ii) that UH is smooth if H is. Suppose now that we work with the supremum norm. The expression (6.36) shows that the point z Vn belongs to the domain of convergence of the series expansion of UH if and only if all the points T t(z) belong to the domain of convergence of the series expansion of H. Finally, suppose that z Vn belongs to the domain of convergence of the series expansion of H. Since |||T t||| 1 for all t Z by Lemma 1, we have that T t(z) z , which guarantees that all the points T t(z) belong to the domain of convergence of the series expansion of H and hence, by the argument above, z Vn belongs to the domain of convergence of the series expansion of UH, which proves the statement.

Differentiable reservoir computing

Remark 40 An important consequence of part (ii) in this proposition and, in particular, of the condition (6.34) is that, in general, one cannot obtain (higher order) diﬀerentiable ﬁlters out of diﬀerentiable functionals using the same weighted norm in the domain and the target of the ﬁlter. The weighted norm in the target needs to be chosen so that it satisﬁes the nonautomatic condition (6.34) that, additionally, depends on the diﬀerentiability degree that we want to preserve. Weighted norms that satisfy that property are relatively easy to ﬁnd in most cases. For example, if we take as w1 the geometric sequence in Remark 2, then Lw1,j = λ jt

t N and hence condition (6.34) is satisﬁed if we take as w2

any sequence of the type w1 r (using the notation in Lemma 35) with r j.

6.5. Proof of Theorem 7

Consider the map

F : (DN)Z ℓw (RN) Vn (DN)Z (x, z) 7 (F(x, z))t := F(xt 1, zt). (6.37)

We now show that ﬁrst, under the two sets of hypotheses in the statement, F actually maps into (DN)Z ℓw (RN) and second, that F is continuous. Suppose ﬁrst that we are in the hypotheses in (i). Since DN is compact then (DN)Z ℓw (RN) and hence F obviously maps into (DN)Z ℓw (RN). Regarding the continuity, notice that F can be written as

t Z Ft with Ft := F pt (T1 id Vn) : (DN)Z ℓw (RN) Vn DN. (6.38)

The continuity of F, the fact that Lw is by hypothesis ﬁnite, and Lemma 1 imply that all the functions Ft : (DN)Z ℓw (RN) Vn ℓw (RN) ℓw (Rn) DN RN are continuous and moreover, they map into a compact subset of RN. An argument mimicking the proof of the ﬁrst part of Lemma 36 allows us to conclude that F : (DN)Z ℓw (RN) Vn (DN)Z ℓw (RN) is a continuous map. Suppose now that we are in the hypotheses in part (ii). We now show that since F is Lipschitz then so are all the functions Ft := F pt (T1 id Vn), t Z , by Lemma 1, where we consider the direct sum of weighted spaces ℓw (RN) ℓw (Rn) as a Banach space with the sum norm w w deﬁned by (x, z) w w := x w + z w, for any (x, z) ℓw (RN) ℓw (Rn). Indeed, let c F be the Lipschitz constant of F and let (x1, z1), (x2, z2) (DN)Z ℓw (RN) Vn, then:

F pt (T1 id Vn) (x1, z1) F pt (T1 id Vn) (x2, z2) c F pt (T1 id Vn) (x1 x2, z1 z2)

(T1 id Vn) (x1 x2, z1 z2) w c F

w t (Lw x1 x2 w + z1 z2 w)

w t Lw( x1 x2 w + z1 z2 w) = c F

w t Lw (x1, z1) (x2, z2) w w . (6.39)

This chain of inequalities show that Ft is a Lipschitz continuous function and that c F Lw/w t is a Lipschitz constant. Given that the sequence c F := (c F Lw/w t)t Z is such that c F w = c F Lw < + , the part (ii) of Lemma 36 in the Appendices guarantees that F is Lipschitz continuous and that c F Lw is a Lipschitz constant, that is, F(x1, z1) F(x2, z2) w c F Lw (x1, z1) (x2, z2) w w . (6.40)

Moreover, let u0 := (x0, z0) (DN)Z ℓw (RN) Vn. The fact that u0 is a solution of the reservoir system implies that F(u0) = x0 (DN)Z ℓw (RN). An argument mimicking (6.20) in the proof of part (ii) in Lemma 36 proves that in those conditions F maps into (DN)Z ℓw (RN).

Grigoryeva and Ortega

We now show that in the presence of hypothesis (3.3) F is a contraction on the ﬁrst entry with constant c Lw < 1. Indeed, for any x1, x2 (DN)Z ℓw (RN) and any z Vn, we have F(x1, z) F(x2, z) w = sup t Z

F(x1 t 1, zt) F(x2 t 1, zt) w t sup t Z

x1 t 1 x2 t 1 cw t ,

(6.41) where we used that F is a contraction on the ﬁrst entry. Now,

x1 t 1 x2 t 1 cw t = c sup t Z

x1 t 1 x2 t 1 w (t 1) w t w (t 1)

c Lw x1 x2 w . (6.42)

This shows that F is a family of contractions with constant c Lw < 1 that is continuously parametrized by the elements in Vn. Since by hypothesis, the domain (DN)Z ℓw (RN) is complete, Theorem 6.4.1 in Sternberg (2010) implies the existence of a continuous map U F : (Vn, w) (DN)Z ℓw (RN), w

that is uniquely determined by the identity

F U F (z), z = U F (z), for all z Vn. (6.43)

The causality and the time-invariance of U F are a consequence of the time invariance of Vn and of Proposition 2.1 in Grigoryeva and Ortega (2018a). We now assume that F is Lipschitz on the second component and prove (3.4). The relation (6.43) that deﬁnes U F is equivalent to

U F (z)t = F(U F (z)t 1, zt), for all z Vn, t Z .

Consequently, for any z1, z2 Vn, we have,

U F (z1) U F (z2) w = sup t Z

U F (z1)t U F (z2)t w t

F(U F (z1)t 1, z1 t) F(U F (z2)t 1, z2 t) w t

F(U F (z1)t 1, z1 t) F(U F (z1)t 1, z2 t) + F(U F (z1)t 1, z2 t) F(U F (z2)t 1, z2 t) w t

Lz z1 t z2 t w t + c U F (z1)t 1 U F (z2)t 1 w t .

If we repeat this procedure i times, it is easy to see that

U F (z1) U F (z2) w

j=0 cj z1 t j z2 t j w t

+ ci+1 sup t Z

U F (z1)t (i+1) U F (z2)t (i+1) w t . (6.44)

Differentiable reservoir computing

We now study separately the two summands in the right hand side of the previous inequality. First, by Lemma 1,

j=0 cj z1 t j z2 t j w t

= Lz sup t Z

j=0 cj Tj(z1)

= Lz sup t Z

j=0 cj Tj(z1 z2)t w t

j=0 cj sup t Z

Tj(z1 z2)t w t

j=0 cj Tj(z1 z2) w Lz z1 z2 w

j=0 cj|||Tj|||w

j=0 (c Lw)j = Lz z1 z2 w 1 (c Lw)i+1

1 c Lw , (6.45)

while the second summand can be bounded as follows

ci+1 sup t Z

U F (z1)t (i+1) U F (z2)t (i+1) w t = ci+1 sup t Z

Ti+1 U F (z1)

t Ti+1 U F (z2)

= ci+1 Ti+1(U F (z1) (U F (z2)) w ci+1|||Ti+1|||w U F (z1) U F (z2) w (c Lw)i+1 U F (z1) U F (z2) w . (6.46)

If we now chain the inequalities (6.45) and (6.46) with (6.44) we can conclude that

(1 (c Lw)i+1) U F (z1) U F (z2) w Lz z1 z2 w 1 (c Lw)i+1

1 c Lw , (6.47)

which after simpliﬁcation using the condition (3.3) results in (3.4).

Remark 41 A slight modiﬁcation of the proof of Theorem 7 (ii) can be used to extend this statement to reservoir systems with inputs and outputs in ℓp,w (Rn) and ℓp,w (RN), respectively. Indeed, assume that we are under the hypotheses of Theorem 7 (ii) with those spaces instead of ℓw (Rn) and ℓw (RN). Suppose, additionally, that c L1/p w < 1. (6.48)

Then, there exists a unique causal and time-invariant continuous reservoir ﬁlter U F : (Vn, p,w) ((DN)Z ℓp,w (RN), p,w). Additionally, U F is also Lipschitz with constant

LU F := Lz 1 c L1/p w .

The proof of this fact is carried out by showing that the map F in (6.38) is Lipschitz continuous when ℓp,w (Rn) and ℓp,w (RN) spaces are considered in its domain and target, respectively, with Lipschitz constant c F L1/p w and hence (6.40) holds in that situation. Indeed, for any (x1, z1), (x2, z2) (DN)Z

Grigoryeva and Ortega

ℓp,w (RN) Vn we can show using the statements in Remark 4 that

F(x1, z1) F(x2, z2) p p,w = X

Ft(x1 t 1, z1 t) Ft(x2 t 1, z2 1) p w t

F(x1 t 1, z1 t) Ft(x2 t 1, z2 t) p w t cp F X

x1 t 1 x2 t 1 p w t + cp F X

z1 t z2 t p w t

cp F T1(x1 x2) p p,w + cp F z1 z2 p p,w cp F (L1/p w )p x1 x2 p p,w + cp F z1 z2 p p,w cp F (L1/p w )p (x1, z1) (x2, z2) p p,w w ,

where in the last inequality we used that L1/p w > 1. We now show that F is a contraction on the ﬁrst entry whenever condition (6.48) is satisﬁed. Indeed,

F(x1, z) F(x2, z) p p,w = X

F(x1 t 1, zt) Ft(x2 t 1, zt) p w t

x1 t 1 x2 t 1 p w t = cp T1(x1 x2) p p,w cp(L1/p w )p x1 x2 p p,w .

The rest of the proof can be obtained by mimicking the developments after (6.42).

6.6. Proof of Theorem 12

Consider the map F : (DN)Z (Dn)Z (DN)Z deﬁned in (6.37) and endow (Dn)Z and (DN)Z with the relative topologies induced by the product topologies in (Rn)Z and (RN)Z , respectively. It is easy to see that the maps pt and T1 are continuous with respect to those product topologies and hence F can be written using (6.38) as a Cartesian product of continuous functions, which is always continuous in the product topology. Consider now any weighting sequence w such that c Lw < 1. Using an argument similar to the proof of Lemma 36 (i), we can conclude that (DN)Z ℓw (RN) and that the product topology on (DN)Z coincides with the norm topology induced by w. Now, following the expressions (6.41) and (6.42) it can be shown that F is a contraction on the ﬁrst entry and with respect to w. In view of these facts and given that the product topology in (Dn)Z (Rn)Z is metrizable (see (Munkres, 2014, Theorem 20.5)) and that (DN)Z (RN)Z is compact by Tychonoﬀ s Theorem (see (Munkres, 2014, Theorem 37.3)) in the product topology and hence complete, Theorem 6.4.1 in Sternberg (2010) implies the existence of a unique ﬁxed point of F for each z (Dn)Z , which establishes the ESP. Moreover, that result also shows the continuity of the associated ﬁlter U F : (Dn)Z ((DN)Z , w). Finally, if (Dn)Z ℓw (Rn), we know from (Grigoryeva and Ortega, 2018b, Proposition 2.9) that the inclusion ℓw (Rn) , (Rn)Z is continuous and hence so is U F when in (Dn)Z we consider the topology generated by the norm w, which establishes the FMP in that situation.

6.7. Proof of Corollary 13

Under the hypothesis in part (i), the continuity of h implies that h (DN) is compact and hence there exists a constant R > 0 such that h (DN) B (0, R). The ﬁrst part of Lemma 36 guarantees that the map H := Q

t Z h : ((DN)Z , w) (KR, w) is continuous and as U F h = H U F and we proved that under the hypotheses (i) in the theorem that U F : (Vn, w) (KL, w) is continuous, the claim follows. We now prove the statement under the hypotheses in part (ii). First, we show that if h is Lipschitz continuous in DN with constant ch then so is the map H in (DN)Z ℓw (RN). Indeed, let x1, x2

Differentiable reservoir computing

(DN)Z ℓw (RN), then H(x1) H(x2) w = sup t Z

h(x1 t) h(x2 t) w t ch x1 x2 w .

The hypothesis U F h (z0) ℓw (Rd) amounts to the fact that the point U F (z0) (DN)Z ℓw (RN) is such that H(U F (z0)) ℓw (Rd). An argument mimicking (6.20) in the proof of part (ii) in Lemma 36 proves that in those conditions H maps into ℓw (Rd).

6.8. Proof of the statements in Section 3.1

Proof of statement (i) for linear reservoir maps. One can show by mimicking the proof of (Grigoryeva and Ortega, 2018a, Corollary 11) that whenever condition (3.6) is satisﬁed for a given weighting sequence w, the reservoir system determined by (3.5) has a unique reservoir ﬁlter U F : ℓw (Rn) ℓw (RN) associated that is determined by the linear functional HF : ℓw (Rn) RN given by

j=0 Ajcz j.

This linear functional is bounded because for any z ℓw (Rn), the hypothesis (3.6) implies that:

Aj |||c||| z j = |||c|||

wj |||c||| z w

We now show that for any weighting sequence w that satisﬁes |||A|||Lw < 1, the condition (3.6) always holds. Indeed, using (2.10) we obtain

j=0 |||A|||j Lj w = 1 1 |||A|||Lw < + , as required.

We ﬁnally show that there exist sequences w that satisfy (3.6) but not |||A|||Lw < 1, which is one more example of the fact, that we already indicated in Remark 11, that the FMP condition (3.3) is suﬃcient but not necessary. Let w be a harmonic weighting sequence as in Remark 2 given by wj := 1/(1 + jd), j N, with d > 0. In this case Lw = 1 + d so we can choose a value d such that |||A|||(1 + d) > 1. However, at the same time, the condition (3.6) holds in this case because

j=0 |||A|||j(1 + jd) =

j=0 |||A|||j + d(j + 1)|||A|||j d|||A|||j

j=0 (1 d)|||A|||j + d(j + 1)|||A|||j = 1 d 1 |||A||| + d (1 |||A|||)2 = 1 + |||A|||(d 1)

(1 |||A|||)2 < + .

Another example in this direction can be obtained by using a nilpotent matrices. If A is nilpotent then (3.6) is always satisﬁed for any weighting sequence w. At the same time, there are nilpotent matrices with arbitrarily large norm |||A||| which, once more, shows that (3.6) can hold, and hence the FMP, without (3.7) being necessarily true. We notice too that reservoir systems determined by nilpotent matrices always satisfy the echo state property even though they are not necessarily contractions.

Proof of statement (ii) for linear reservoir maps. We ﬁrst prove the statement (3.8). For any x B (0, L) and z B (0, M),

F(x, z) = Ax + cz |||A|||L + |||c|||M = L, as required.

Grigoryeva and Ortega

This implies that the reservoir map F in (3.5) restricts to a map FL,M : B (0, L) B (0, M) B (0, L) that is a contraction on the ﬁrst entry with constant |||A||| < 1 and hence satisﬁes the hypotheses of Corollary 10. This guarantees the existence of a unique associated causal and timeinvariant ﬁlter U F : KM KL that has the fading memory property with respect to any weighting sequence w.

Proof of the statements for state-aﬃne systems. We prove only the statement (i) since statement (ii) can be easily obtained by mimicking the similar statement for the linear case. Indeed, a straightforward generalization of (Grigoryeva and Ortega, 2018a, Proposition 14) shows that whenever Mp < 1 and Mq < + , the reservoir system determined by (3.11) has a unique reservoir ﬁlter U F : ℓw (Rn) (Dn)Z ℓw (RN) associated that is determined by the linear functional HF : ℓw (Rn) (Dn)Z RN given by

Mimicking the proof of (Grigoryeva and Ortega, 2018a, Proposition 16) it can be shown that there exists a constant Cp,q > 0 that depends exclusively of p and q such that for any z, s ℓw (Rn)

HF (z) HF (s) Cp,q

j=0 M j p z j s j = Cp,q

j=0 M j p z j s j wj

wj Cp,q z s w

which shows that HF : ℓw (Rn) (Dn)Z RN is Lipschitz continuous whenever the condition (3.12) holds. The last claim regarding the relation between (3.12) and the FMP condition (3.3) is proved by mimicking the similar statement for the linear case.

6.9. Proof of Theorem 14

We start with a preliminary result whose proof mimics that of Lemma 36 and is also a consequence of Lemma 1. As we already did in the proof of Theorem 7, in the statement we consider the direct sum of weighted spaces ℓw (RN) ℓw (Rn) as a Banach space with the sum norm w w deﬁned by (u, v) w w := u w + v w, for any (u, v) ℓw (RN) ℓw (Rn). Additionally, in all that follows Vn stands for any open convex subset of the Banach space ℓw (Rn), w .

Lemma 42 In the hypotheses of the theorem, consider the map

F : ℓw (RN) Vn (RN)Z (x, z) 7 (F(x, z))t := F(xt 1, zt), (6.49)

where Vn is an open convex subset of ℓw (Rn). Then,

(i) F is Lipschitz continuous with constant LF Lw and maps into ℓw (RN).

(ii) If F is of class Cr(RN Rn), r 1, suppose that

LF,r := sup (x,z) RN Rn {|||Dr F(x, z)|||} < + . (6.50)

and let w be any weighting sequence such that

cw ,wr = sup t Z

< + . (6.51)

Differentiable reservoir computing

Then the map F is a functor between the sets

F : ℓw (RN) Vn ℓw (RN) ℓw (Rn) ℓw (RN)

and is diﬀerentiable of order r and of class Cr 1(ℓw (RN) Vn). Moreover,

|||Dr F(x, z)|||w,w LF,r Lr wcw ,wr, for all (x, z) ℓw (RN) Vn (6.52)

and the map Dr 1F : ℓw (RN) Vn Lr 1(ℓw (Rn) ℓw (RN), ℓw (RN)) is Lipschitz continuous with Lipschitz constant LF,r Lr wcw ,wr.

(iii) The linear map Dx F(x0, z0) : (ℓw (RN), w) ℓw (RN), w is a contraction with constant LFx(x0, z0)Lw < 1.

These results also hold when the spaces ℓw (Rn), w and ℓw (RN), w are replaced by ℓ (Rn),

and ℓ (RN), , respectively. In that case, the statement is obtained by taking as the sequences w and w the constant sequence wι given by wι t := 1, for all t N. The inequality (6.52) holds true with Lw = cw ,wr = 1.

Proof of the lemma. (i) Notice ﬁrst that, as we pointed out in (6.38), and using the notation in Lemma 36,

t Z Ft, where Ft := F pt (T1 id Vn) : ℓw (RN) Vn RN. (6.53)

Also, the hypothesis (4.1), the mean value theorem, and the convexity of the set

pt (T1 id Vn) ℓw (RN) Vn

imply that F is a Lipschitz function with constant LF . A development identical to (6.39) guarantees that the maps Ft are Lipschitz and that LF Lw/w t is a Lipschitz constant of Ft, t Z . Given that the sequence c F := (LF Lw/w t)t Z is such that c F w = LF Lw < + and F = Q

t Z Ft, the part (ii) of Lemma 36 guarantees that F is Lipschitz continuous and that LF Lw is a Lipschitz constant of F. Since by hypothesis the reservoir system has a solution (x0, z0) ℓw (RN) Vn, we have that F(x0, z0) = x0 ℓw (RN). This implies that F maps into ℓw (RN) since the Lipschitz condition that we just proved shows that for any (x, z) ℓw (RN) Vn

F(x, z) w LF Lw (x, z) (x0, z0) w w + F(x0, z0) w,

which shows that F(x, z) w < + and hence that F(x, z) ℓw (RN).

(ii) The expression (6.53), the chain rule, the ﬁniteness of Lw, and the linearity of pt and T1 imply that for any (x, z) ℓw (RN) Vn:

Dr Ft(x, z) = Dr F(xt 1, zt) (pt (T1 id Vn), . . . , pt (T1 id Vn)) : ℓw (RN) ℓw (Rn) r RN. (6.54)

Grigoryeva and Ortega

We now prove (6.52). Notice ﬁrst that for u = u1, . . . , ur = (u1 x, u1 z), . . . , (ur x, ur z) ℓw (RN) ℓw (Rn) r

we can write using (4.1) and Lemma 1:

Dr Ft(x, z) u = Dr F(xt 1, zt) pt (T1 id Vn)(u1), . . . , pt (T1 id Vn)(ur)

|||Dr F(xt 1, zt)||| pt (T1 id Vn)(u1) pt (T1 id Vn)(ur)

T1(u1 x), u1 z w w (T1(ur x), ur z) w w

T1(u1 x) w + u1 z w ( T1(ur x) w + ur z w)

Lw u1 x w + u1 z w (Lw ur x w + ur z w)

LF,r Lr w wr t

u1 x w + u1 z w ( ur x w + ur z w) = LF,r Lr w wr t

u1 w w ur w w ,

which shows that |||Dr Ft(x, z)|||w LF,r Lr w wr t . (6.55)

Since, as we saw in part (i) F maps into ℓw (RN), and by Lemma 35 ℓw (RN) ℓwr (RN), then F also maps into ℓwr (RN). Additionally, since the sequence cr := LF,r Lr w/wr t

t Z is such that cr wr = LF,r Lr w < + , the part (iii) of Lemma 36 guarantees that the map

F : ℓw (RN) Vn ℓwr (RN)

is diﬀerentiable of order r and that

|||Dr F(x, z)|||w,wr LF,r Lr w < + . (6.56)

This argument can be reproduced with the power sequence wr replaced by any other sequence w that satisﬁes (6.51), in which case, it is easy to see that ℓwr (RN) ℓw (RN), and we can conclude the diﬀerentiability of the map F : ℓw (RN) Vn ℓw (RN) for which the relation (6.56) is replaced by

|||Dr F(x, z)|||w,w LF,r Lr wcw ,wr < + . (6.57)

The rest of the statement is a consequence of part (iii) of Lemma 36 applied in this setup.

(iii) A computation similar to the one that was used to establish (6.54) leads to the following expression for the partial derivatives Dx F of F:

Dr x F(x, z) = Y

t Z Dr x Ft(x, z) = Y

t Z Dr x F(xt 1, zt) (pt T1, . . . , pt T1) . (6.58)

Using this expression for r = 1 and Lemma 1 we can write, for any u ℓw (RN),

Dx F(x0, z0) u w = sup t Z

Dx F(x0 t 1, z0 t) (pt T1) (u) w t

LFx(x0, z0) sup t Z {|||pt|||w T1(u) ww t}

LFx(x0, z0) sup t Z

w t T1(u) ww t

LFx(x0, z0)Lw u w ,

Differentiable reservoir computing

as required.

We now proceed with the proof of the theorem in which we obtain the persistence result as a consequence of the Implicit Function Theorem and of the Lemma 42 that we just proved. Using the same notation as in that result we deﬁne the map

G : ℓw (RN) ℓw (Rn) ℓw (RN) (x, z) 7 F(x, z) x,

or equivalently, G = F πN, where πN : ℓw (RN) ℓw (Rn) ℓw (RN) is just the projection onto the ﬁrst factor. Notice that by construction and the hypothesis on the point (x0, z0) we have that

G(x0, z0) = 0. (6.59)

Since the projection πN is linear and by Lemma 42 F is Lipschitz continuous and diﬀerentiable of order 1, then so is G = F πN. This implies in particular that the partial derivative Dx G(x0, z0) : ℓw (RN) ℓw (RN) is a bounded operator that we now set to prove that it is an isomorphism. We proceed in two stages that show how the hypotheses in the statement of the theorem imply that this linear map is both injective and surjective.

The partial derivative Dx G(x0, z0) : ℓw (RN) ℓw (RN) is injective. Notice ﬁrst that,

Dx G(x0, z0) u = Dx F(x0, z0) u u, for any u ℓw (RN).

Consequently, the points u ℓw (RN) such that Dx G(x0, z0) u = 0 coincide with the ﬁxed points of the map Dx F(x0, z0) : ℓw (RN) ℓw (RN). Since by part (iii) of Lemma 42 Dx F(x0, z0) is a contracting linear map in ℓw (RN) it has hence only zero as unique ﬁxed point and the claim follows.

The partial derivative Dx G(x0, z0) : ℓw (RN) ℓw (RN) is surjective. We prove that for any v ℓw (RN) there exists u ℓw (RN) such that Dx G(x0, z0) u = v. By the deﬁnition of F in (6.49) and the expression of its partial derivative in (6.58), this equation is equivalent to the recursions,

vt = Dx F(x0 t 1, z0 t) ut 1 ut, for all t Z . (6.60)

This equation has a unique solution given by the series

j=1 Dx F(x0 t 1, z0 t) Dx F(x0 t 2, z0 t 1) Dx F(x0 t j, z0 t j+1)( vt j), t Z . (6.61)

Indeed, it is straightforward to show that (6.61) satisﬁes (6.60). It remains then to be shown that the sequence u determined by (6.61) belongs to ℓw (RN). In order to do so we ﬁrst show that the series in (6.61) is convergent by proving that for any t Z , the sequence {Sn}n N+ deﬁned by

j=1 Dx F(x0 t 1, z0 t) Dx F(x0 t 2, z0 t 1) Dx F(x0 t j, z0 t j+1)( vt j)w t, (6.62)

Grigoryeva and Ortega

is a Cauchy sequence. This is so because for any m, n N+, m n,

j=n+1 Dx F(x0 t 1, z0 t) Dx F(x0 t 2, z0 t 1) Dx F(x0 t j, z0 t j+1)( vt j)

w (t j) w (t j+1)

w (t j) w t w (t 1)

Dx F(x0 t 1, z0 t) Dx F(x0 t 2, z0 t 1) Dx F(x0 t j, z0 t j+1) vt j w (t j)Lj w

j=n+1 LFx(x0, z0)j Lj w v w =

LFx(x0, z0)Lw n+1 LFx(x0, z0)Lw m+1

1 LFx(x0, z0)Lw v w , (6.63)

which can be made as small as we want because the sequence { LFx(x0, z0)Lw j}j N+ is convergent and hence Cauchy due to the hypothesis LFx(x0, z0)Lw < 1. This implies that {Sn}n N+ is convergent and hence so is the series that deﬁnes ut in (6.61). It remains to be shown that the sequence u := (ut)t Z deﬁned by (6.61) is an element of ℓw (RN). Following the same strategy that we used to construct the inequalities (6.63) it is easy to see that

ut w t 1 1 LFx(x0, z0)Lw v w , for all t Z .

Consequently,

u w = sup t Z { ut w t} 1 1 LFx(x0, z0)Lw v w < + ,

as required.

The partial derivative Dx G(x0, z0) : ℓw (RN) ℓw (RN) is a linear homeomorphism. This fact is a consequence of the Banach Isomorphism Theorem (see for instance Abraham et al. (1988)) that states that any continuous linear isomorphism of Banach spaces has necessarily a continuous inverse.

Using all the facts that we just proved, we can invoke the the Implicit Function Theorem as formulated in (Schechter, 1997, page 671) (see also Ver Eecke (1974)) to show the existence of two open neighborhoods e Vx0 and e Vz0 of x0 and z0 in ℓw (RN) and ℓw (Rn), respectively, and a unique Lipschitz

continuous map g U F : (e Vz0, w) (e Vx0, w) that is diﬀerentiable at z0 and satisﬁes

G(g U F (z), z) = 0, for all z e Vz0,

which is equivalent to F(g U F (z), z) = U F (z). In view of the identities (3.1) this means, in other words, that g U F is the unique reservoir ﬁlter with inputs in e Vz0 associated to the reservoir system determined by F. This ﬁlter is clearly causal and its Lipschitz continuity implies that it has the fading memory property. We conclude the proof by showing that the ﬁlter g U F can be extended to a time-invariant ﬁlter U F

deﬁned on the time-invariant saturations Vx0 and Vz0 of the sets e Vx0 and e Vz0, respectively, and that has the properties listed in the statement. Indeed, deﬁne

t Z T t e Vx0 and Vz0 := [

t Z T t e Vz0 .

Differentiable reservoir computing

The sets Vx0 and Vz0 are by construction time-invariant and open by the openness of the maps T t that we established in part (ii) of Lemma 1. Deﬁne now the map U F : Vz0 Vx0 as

U F (T t(z)) := T t g U F (z) , for some t Z and z e Vz0. (6.64)

We ﬁrst show that U F is well-deﬁned and time-invariant. Let t1, t2 Z and z1, z2 e Vz0 be such that T t1(z1) = T t2(z2). Let us now show that

U F (T t1(z1)) = U F (T t2(z2)) . (6.65)

Indeed, for any t Z , the deﬁnition (6.64) and the causality of g U F imply that

U F (T t1(z1))

t = T t1 g U F (z1)

= g U F (z1)t+t1 = g U F (z2)t+t2 = T t2 g U F (z2)

t = U F (T t2(z2))

which proves (6.65). The time-invariance of U F , as deﬁned in (6.64), is straightforward. We conclude by showing that U F is diﬀerentiable at all the points of the form T t(z0), t Z and that it is locally Lipschitz continuous on Vz0. Since diﬀerentiability is a local property, it suﬃces to prove this property for the restriction of U F to open sets. Before we do that, we note that since by part (ii) of Lemma 1 the map T t : e Vz0 T t e Vz0 is a submersion, the Local Onto Theorem (see

(Abraham et al., 1988, Theorem 3.5.2)) guarantees that for z := T t(z0) T t e Vz0 there exists an

open neighborhood Vz T t e Vz0 and a smooth section σz : Vz e Vz0 of T t that satisﬁes that

σz (z ) = z0 and T t σz = id Vz . (6.66)

The section σz allows us to write down the restriction U F |Vz of U F to the open subset Vz as

U F |Vz (z) = T t g U F (σz (z)) , for all z Vz . (6.67)

This is so because by (6.66) we have that z = T t (σz (z)), with σz (z) e Vz0, as well as by (6.64). Consequently, since by (6.67) the restriction U F |Vz is a composition of Lipschitz continuous functions then so is U F |Vz . The diﬀerentiability of U F at the point z = T t(z0) can also be concluded using

(6.67) by invoking the diﬀerentiability of T t and σz on their domains and the diﬀerentiability of g U F at σz (z ) = z0.

6.10. Proof of Theorem 19

(i) We start with a lemma that shows how condition (4.8) guarantees the existence of a globally deﬁned ﬁlter U F : (ℓw (Rn), w) (ℓw (RN), w).

Lemma 43 Let F : RN Rn RN be a reservoir map of class C1(RN Rn) and let w be a weighting sequence with ﬁnite inverse decay ratio Lw. The reservoir map F is a contraction on the ﬁrst entry if and only if LFx < 1. (6.68)

Moreover, whenever conditions (4.1) and (4.8) are satisﬁed and (x0, z0) (RN)Z (Rn)Z is a solution of the reservoir system determined by F, then there exists a unique causal, time-invariant, and fading memory ﬁlter U F : (ℓw (Rn), w) (ℓw (RN), w).

Grigoryeva and Ortega

Proof of the lemma. We ﬁrst show that F is a contraction on the ﬁrst entry if and only if LFx < 1. Suppose ﬁrst that F is a contraction with contraction rate 0 < c < 1. Then for any (x, z) RN Rn

and any u RN, the partial derivative Dx F(x, z) : RN RN satisﬁes that

Dx F(x, z) u = lim t 0 F(x + tu, z) F(x, z)

t lim t 0 ct u

which implies that |||Dx F(x, z)||| c and hence

LFx := sup (x,z) DN Dn {|||Dx F(x, z)|||} c < 1.

Conversely, suppose that LFx < 1. Since F is of class C1(RN Rn), the mean value theorem guarantees that for any (x1, z), (x2, z) RN Rn: F(x1, z) F(x2, z) sup (x,z) RN Rn {|||Dx F(x, z)|||} x1 x2 = LFx x1 x2 ,

and F is hence is a contraction on the ﬁrst entry. Suppose now that conditions (4.1) and (4.8) are satisﬁed and that (x0, z0) (RN)Z (Rn)Z is a solution of the reservoir system determined by F. Notice ﬁrst that since Lw > 1 then the condition (4.8) implies that LFx < 1, necessarily, and hence, as we just proved, F is a contraction on the ﬁrst entry with constant LFx. Additionally, as (4.1) is satisﬁed, the mean value theorem implies that F is Lipschitz continuous with constant LF . All these facts allow us to invoke part (ii) of Theorem 7 to conclude the existence of the ﬁlter U F in the statement, since in this situation, the condition (3.3) coincides with (4.8).

The proof of the ﬁrst part of the theorem can now be obtained by applying Theorem 14 to each point of the form (U F (z), z) ℓw (RN) ℓw (Rn)

for which, according to its statement, there exist open neighborhoods VU F (z) and Vz of U F (z) and z in ℓw (RN) and ℓw (Rn), as well as a unique locally deﬁned causal reservoir ﬁlter e UF : Vz VU F (z) associated to F. The uniqueness feature implies that e UF = U F |Vz. Moreover, since e UF is diﬀerentiable at z and we can repeat this construction for any point z ℓw (Rn) we can conclude that U F is diﬀerentiable at any point in ℓw (Rn). Finally, the Lipschitz continuity on ℓw (Rn) of U F is a consequence of the mean value theorem, the inequality (4.5), and the fact that

sup z ℓw (Rn)

DU F (z) w sup z ℓw (Rn)

LFz(U F (z), z) 1 LFx(U F (z), z))Lw

LFz 1 LFx Lw ,

which proves (4.9).

(ii) First of all, the existence of the ﬁlter U F : Vn ℓw (RN) and its diﬀerentiability at z0 Vn imply that for any u ℓw (Rn) and t Z it satisﬁes (3.1) as well as (4.4), that is,

(DU F (z0) u)t = Dx F U F (z0)t 1, z0 t DU F (z0) u

t 1 + Dz F(U(z0)t 1, z0 t) ut.

This identity can be rewritten in terms of operators on sequences as

DU F (z0) =

t Z Dx F U F (z0)t 1, z0 t

T1 DU F (z0) + Y

t Z Dz F U F (z0)t 1, z0 t ,

Differentiable reservoir computing

or equivalently as

t Z Dx F U F (z0)t 1, z0 t

DU F (z0) = Y

t Z Dz F U F (z0)t 1, z0 t . (6.69)

This identity determines DU F (z0) that by hypothesis exists if and only if the operator on the left hand side is invertible, which is in turn equivalent to the condition (4.10). We ﬁnally show that (4.10) implies (4.11). We ﬁrst notice that by Gelfand s formula (Lax, 2002, page 195) the condition (4.10) is equivalent to

t Z Dx F U F (z0)t 1, z0 t ) T1

This in turn implies that for any u ℓw (Rn), we have that

t Z Dx F U F (z0)t 1, z0 t ) T1

or, equivalently, that

Dx F U F (z0)t 1, z0 t Dx F U F (z0)t k, z0 t k+1 (ut k) !

= 0. (6.70)

If we now take vectors u ℓw (Rn) in (6.70) of the form ut := eu/w t, t Z , with eu Rn such that eu = 1, and we take the supremum in (6.70) with respect to all those vectors eu, we obtain that

Dx F U F (z0)t 1, z0 t Dx F U F (z0)t k, z0 t k+1 w t w (t k)

= 0. (6.71)

Dx F U F (z0) 1, z0 0 Dx F U F (z0) k, z0 k+1 1

Dx F U F (z0)t 1, z0 t Dx F U F (z0)t k, z0 t k+1 w t w (t k)

the condition (4.11) follows.

6.11. Proof of Theorem 28

It follows the same scheme as that of Theorem 14. In the following paragraphs we just hint the additional facts that need to be taken into account in order to adapt that proof to this setup. The ﬁrst complementary fact has to do with the second part of Lemma 42 which, using the hypothesis (4.29) allows us to conclude that the map F : ℓ (RN) ℓ (Rn) ℓ (RN) deﬁned in (6.38) is smooth. Additionally, it can be easily seen that it is also analytic and that the radii of convergence ρF and ρF of the Taylor series expansions of F and F around (x0, z0) and the associated constant sequence (that we denote with the same symbol) satisfy ρF ρF. (6.72)

Grigoryeva and Ortega

Indeed, (6.54) implies that the Taylor series expansion of F around the constant sequence (x0, z0) can be written, for any ur := (u, . . . , u) = ((ux, uz), . . . , (ux, uz)) ℓ (RN) ℓ (Rn) r, as

1 r!Dr F(x0, z0) (u (x0, z0))r = Y

Ft(x0, z0) +

1 r!Dr Ft(x0, z0) (u (x0, z0))r !

F(x0, z0) +

1 r!Dr F(xt 1, zt) (pt (T1 id), . . . , pt (T1 id)) (u (x0, z0))r !

Suppose now that u = (ux, uz) ℓ (RN) ℓ (Rn) is chosen such that

u = ux + uz < ρF . (6.74)

Lemma 1 implies that for any t Z , we have in that case that

pt (T1 id)(u) u < ρF

and hence we can conclude that all the series labeled by t Z in each of the factors that make up the last term of (6.73) converge for all the elements u ℓ (RN) ℓ (Rn) that satisfy (6.74). This implies that such elements are inside the radius of convergence of the Taylor series expansion of F around the constant sequence (x0, z0) and hence (6.72) holds which, as ρF is nontrivial by hypothesis, proves that F is analytic. The rest of the proof can be obtained by mimicking that of Theorem 14 where, as it is customary, we replace the weighting sequence w by the constant sequence wι given by wι t := 1, for all t N, and Lw is replaced by the constant 1. A technical modiﬁcation is needed at the time of invoking the Implicit Function Theorem. In Theorem 14 we used a version that requires only ﬁrst order diﬀerentiability as hypothesis and produces Lipschitz continuous implicitly deﬁned functions. In this case we can prove that the function G is analytic and hence it can be shown that the implicitly deﬁned local ﬁlter g U F : (e Vz0, w) (e Vx0, w) is analytic by invoking, for instance, (Valent, 1988, page 175), and references therein.

6.12. Proof of Theorem 29

Since by hypothesis U is analytic in B w(z0, M) then

U(z) = U(z0) +

1 j!Dj U(z0)(z z0, . . . , z z0 | {z } j times

), for any z B w(z0, M). (6.75)

We now show that for the elements that satisfy (5.1) the series expansion (6.75) amounts to the discretetime Volterra series expansion (5.2). Let m Z and let δm ℓw (R) be the sequence deﬁned by

(δm)t := 1 w m if t = m, 0 otherwise. (6.76)

Note that δm w = 1 for all m Z . Moreover, for any z ℓw (R) we can write

t Z eztδt, with ezt = (zt z0 t )w t,

Differentiable reservoir computing

and hence by the multilinearity of the derivatives Dj U(z0)(z z0, . . . , z z0) and the causality of the ﬁlter U we have that

Dj U(z0)(z z0, . . . , z z0)t =

mj= ezm1 ezmj Dj U(z0)(δm1, . . . , δmj)t, for all t Z .

(6.77) We ﬁrst show that for the elements that satisfy (5.1) the sum in the right hand side of (6.77) is ﬁnite. Indeed, for any t Z :

mj= ezm1 ezmj Dj U(z0)(δm1, . . . , δmj)t

mj= ezm1 ezmj 1 w t w t Dj U(z0)(δm1, . . . , δmj)t

ezm1 ezmj 1 w t Dj U(z0)(δm1, . . . , δmj) w

Dj U(z0) w (δm1, . . . , δmj) w

Dj U(z0) w w t

mj= |ezm1| ezmj

Dj U(z0) w w t

zm z0 m w m

< + , (6.78)

where the last equality is a consequence of, for example, (Apostol, 1974, Theorem 8.44), and the last inequality follows from two facts. First, as U is analytic, it is in particular smooth and hence Dj U(z0) w < + for all j Z . Second, since by hypothesis z, z0 ℓ1,w (Rn) then z z0 ℓ1,w (Rn) and hence Pt m= zm z0 m w m < + . We now show that (6.75) can be rewritten as (5.2). Notice ﬁrst that for any t, m Z such that m t, the sequences (6.76) satisfy

T t (δm) = w (m t)

w m δm t. (6.79)

Second, the time-invariance of U and of the sequence z0, imply that for any j N+, t Z , and z1, . . . , zj ℓw (R), we have that

T t Dj U(z0) z1, . . . , zj = Dj U(T t z0 ) T t z1 , . . . , T t zj = Dj U(z0) T t z1 , . . . , T t zj .

These two relations imply that for any t Z

Dj U(z0)(δm1, . . . , δmj)t = T t Dj U(z0)(δm1, . . . , δmj)

0 = Dj U(z0)(T t(δm1), . . . , T t(δmj))0

= Dj U(z0)(δm1 t, . . . , δmj t)0 w (m1 t)

w m1 w (mj t)

Grigoryeva and Ortega

If we substitute this relation in the summands of (6.77), we obtain that

ezm1 ezmj Dj U(z0)(δm1, . . . , δmj)t = (zm1 z0 m1) (zmj z0 mj) w (m1 t) w (mj t) Dj U(z0)(δm1 t, . . . , δmj t)0. (6.80)

gj(n1, . . . , nj) := w n1 w nj 1 j!Dj U(z0)(δn1, . . . , δnj)0

= w n1 w nj 1 j!Dj H(z0)(δn1, . . . , δnj)0 = 1

j!Dj H(z0)(en1, . . . , enj)0, (6.81)

where em ℓw (R) is the sequence deﬁned in (5.3). If we make the change of variables ni := mi t in (6.80), we use (6.81), and we insert the resulting expression in (6.77) and subsequently in (6.75) we obtain (5.2). The uniqueness of this series expansion follows from the same argument as in (Sandberg, 1999, Theorem 1). We now prove the error estimates (5.4) with the same strategy as in Sandberg (1999). Using the Cauchy bounds for analytic functions (see, for instance, the last expression in (Hille and Phillips, 1957, page 112)) and the analyticity hypothesis on U : B w(z0, M) ℓw (R) B w(U(z0), L) ℓw (RN), we have that for any j N+ and t Z

Dj U(z0)(z, . . . , z)t = pt Dj U(z0)(z, . . . , z) |||pt|||w Dj U(z0)(z, . . . , z) w j!L

(6.82) where we also used the ﬁrst part of Lemma 1. Now, as we saw in the previous paragraphs, U(z)t U(z0)t

mj= gj(m1, . . . , mj)(zm1+t z0 m1+t) (zmj+t z0 mj+t)

U(z)t U(z0)t

1 j!Dj U(z0)(z z0, . . . , z z0)

1 j!Dj U(z0)(z z0, . . . , z z0)

p+1 , (6.83)

where the inequalities in the last line follow from (6.82).

6.13. Time invariance of the solutions of a reservoir system

The ﬁlters studied in this paper are those determined by reservoir systems of the type introduced in (1.1) (1.2). As we already pointed out, in that case we can associate unique reservoir ﬁlters U F

and U F h to the reservoir map F and the reservoir system, respectively, whenever (1.1) satisﬁes the echo state property. In that case, it has been shown in (Grigoryeva and Ortega, 2018b, Proposition 2.1) that both U F and U F h are necessarily causal and time-invariant. We complement this fact with a similar elementary statement that does not require the echo state property or the existence reservoir ﬁlters.

Lemma 44 Let (x0, z0) RN Z (Rn)Z be a solution of the reservoir system determined by the

map F : RN Rn RN. Then, for any τ Z , the pair (Tτ(x0), Tτ(z0)) RN Z (Rn)Z is also a solution.

Differentiable reservoir computing

Proof. By hypothesis, for any t Z we have that

F(x0 t 1, z0 t) = x0 t,

F Tτ(x0)t 1, Tτ(z0)t = F(x0 t τ 1, z0 t τ) = x0 t τ = Tτ(x0)t, as required.

Acknowledgments

We thank Lukas Gonon and Herbert Jaeger for fruitful discussions as well as the input from the two anonymous referees that helped in improving the paper. The authors acknowledge partial ﬁnancial support of the French ANR BIPHOPROC project (ANR-14-OHRI-0002-02). LG acknowledges partial ﬁnancial support of the Graduate School of Decision Sciences of the Universit at Konstanz. JPO acknowledges partial ﬁnancial support coming from the Research Commission of the Universit at Sankt Gallen and the Swiss National Science Foundation (grant number 200021 175801/1).

Glossary of Symbols

ℓ (Rn) Banach space formed by the semi-inﬁnite sequences that have a ﬁnite supremum norm

ℓp,w (Rn) Banach space formed by the semi-inﬁnite sequences that have a ﬁnite (p, w)-norm

ℓp (Rn) Banach space formed by the semi-inﬁnite sequences that have a ﬁnite p-norm

ℓw (Rn) Banach space formed by the semi-inﬁnite sequences that have a ﬁnite weighted supremum norm

MN Space of real square matrices of size N

FF Reservoir ﬂow associated to the reservoir map F Q t Z At Cartesian product of the sets At Q t Z ft Cartesian product of the functions ft

ρ(A) Spectral radius of the matrix A

σ Activation function (in ESN, for example)

c Contraction constant on the ﬁrst entry of the reservoir map

d Dimension of the elements of the output signal

Drf(z) r-order Fr echet diﬀerential of the map f at the point z

Dw Decay ratio of the weighting sequence w

Dxf(x, z) Partial derivative of the map f with respect to the ﬁrst entry at the point (x, z)

Df(z) Fr echet diﬀerential of the map f at the point z

F : RN Rn RN Reservoir map

HU : (Rn)Z Rd Functional associated to the causal and time-invariant ﬁlter U : (Rn)Z (Rd)Z

h : RN Rd Generic readout map

KM Space of semi-inﬁnite sequences that are uniformly bounded by M

Lσ Lipschitz constant of the activation function σ

LF Lipschitz constant of the reservoir map F

Grigoryeva and Ortega

Lw Inverse decay ratio of the weighting sequence w

LFx Lipschitz constant on the ﬁrst entry of the reservoir map F

LUF Lipschitz constant of the reservoir ﬁlter UF

N Number of virtual neurons. Dimension of the reservoir state vectors

n Dimension of the elements of the input signal

pt : (Rn)Z Rn Projection onto the tth-entry

Tτ : (Rn)Z (Rn)Z Time delay operator deﬁned on semi-inﬁnite sequences

T Z τ : (Rn)Z (Rn)Z Time delay operator deﬁned on two-sided inﬁnite sequences

UF h : (Rn)Z (Rd)Z Reservoir ﬁlter determined by the reservoir map F and the readout h

UF : (Rn)Z (RN)Z Filter determined by the reservoir map F

U : (Rn)Z (Rd)Z Filter with inputs in Rn and outputs in Rd

UA,c h : KM RZ Linear reservoir ﬁlter determined by A, c, and the polynomial h

UH : (Rn)Z (Rd)Z Causal and time-invariant ﬁlter associated to the functional H : (Rn)Z Rd

w : N (0, 1] Weighting sequence

x (Semi)-inﬁnite sequence containing the reservoir states. The elements of this sequence are denoted by xt RN

y (Semi)-inﬁnite output signal. The elements of this sequence are denoted by yt Rd

z (Semi)-inﬁnite input signal. The elements of this sequence are denoted by zt Rn

R. Abraham, J. E. Marsden, and T. S. Ratiu. Manifolds, Tensor Analysis, and Applications, volume 75. Applied Mathematical Sciences. Springer-Verlag, 1988.

T. Apostol. Mathematical Analysis. Addison Wesley, second edition, 1974.

L. Appeltant, M. C. Soriano, G. Van der Sande, J. Danckaert, S. Massar, J. Dambre, B. Schrauwen, C. R. Mirasso, and I. Fischer. Information processing using a single dynamical node as complex system. Nature Communications, 2:468, jan 2011.

L. Arnold. Random Dynamical Systems. Springer, 1998.

S. Boyd and L. Chua. Fading memory and the problem of approximating nonlinear operators with Volterra series. IEEE Transactions on Circuits and Systems, 32(11):1150 1161, nov 1985.

D. Brunner, M. C. Soriano, C. R. Mirasso, and I. Fischer. Parallel photonic information processing at gigabyte per second data rates using transient states. Nature Communications, 4(1364), 2013.

J. Cabessa and A. E. Villa. Computational capabilities of recurrent neural networks based on their attractor dynamics. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1 8. IEEE, jul 2015.

J. Cabessa and A. E. Villa. Expressive power of ﬁrst-order recurrent neural networks determined by their attractor dynamics. Journal of Computer and System Sciences, 82(8):1232 1250, 2016.

P. Chossat, D. Lewis, J.-P. Ortega, and T. S. Ratiu. Bifurcation of relative equilibria in mechanical systems with symmetry. Advances in Applied Mathematics, 31:10 45, 2003.

Differentiable reservoir computing

B. D. Coleman and V. J. Mizel. On the general theory of fading memory. Archive for Rational Mechanics and Analysis, 29(1):18 31, jan 1968.

R. Couillet, G. Wainrib, H. Sevi, and H. T. Ali. The asymptotic performance of linear echo state neural networks. Journal of Machine Learning Research, 17(178):1 35, 2016.

J. Dambre, D. Verstraeten, B. Schrauwen, and S. Massar. Information processing capacity of dynamical systems. Scientiﬁc reports, 2(514), 2012.

M. Fabrizio, C. Giorgi, and V. Pata. A new approach to equations with memory, volume 198. 2010.

S. Ganguli, D. Huh, and H. Sompolinsky. Memory traces in dynamical systems. Proceedings of the National Academy of Sciences of the United States of America, 105(48):18970 5, dec 2008.

F. Girosi. Approximation error bounds that use VC-bounds. In F. Fogelman-Soulie and P. Gallinari, editors, Proc. International Conference on Artiﬁcial Neural Networks, volume 1, pages 295 302, 1995.

F. Girosi and G. Anzellotti. Convergence rates of approximation by translates. Technical report, Defense Technical Information Center, 1992.

L. Gonon and J.-P. Ortega. Reservoir computing universality with stochastic inputs. IEEE Transactions on Neural Networks and Learning Systems, 2018.

L. Grigoryeva and J.-P. Ortega. Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-aﬃne systems. Journal of Machine Learning Research, 19(24):1 40, 2018a.

L. Grigoryeva and J.-P. Ortega. Echo state networks are universal. Neural Networks, 108:495 508, 2018b.

L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. Optimal nonlinear information processing capacity in delay-based reservoir computers. Scientiﬁc Reports, 5(12858):1 11, 2015.

L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. Nonlinear memory capacity of parallel timedelay reservoir computers in the processing of multidimensional signals. Neural Computation, 28: 1411 1451, 2016.

H. Gunawan, S. Konca, and M. Idris. p-summable sequence spaces with inner products. Bitlis Eren University Journal of Science and Technology, 5(1):1 9, 2015.

A. G. Hart, J. L. Hook, and J. H. P. Dawes. Embedding and approximation theorems for echo state networks. Preprint, 2019.

M. Hermans and B. Schrauwen. Memory in linear recurrent neural networks in continuous time. Neural networks : the oﬃcial journal of the International Neural Network Society, 23(3):341 55, apr 2010.

E. Hille and R. S. Phillips. Functional Analysis and Semi-Groups. American Mathematical Society, 1957.

B. R. Hunt, E. Ott, and J. A. Yorke. Diﬀerentiable generalized synchronization of chaos. Physical Review E - Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, 55(4):4029 4034, 1997.

H. Jaeger. Short term memory in echo state networks. Fraunhofer Institute for Autonomous Intelligent Systems. Technical Report., 152, 2002.

Grigoryeva and Ortega

H. Jaeger. The echo state approach to analysing and training recurrent neural networks with an erratum note. Technical report, German National Research Center for Information Technology, 2010.

H. Jaeger and H. Haas. Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. Science, 304(5667):78 80, 2004.

J. Kilian and H. T. Siegelmann. The dynamic universality of sigmoidal neural networks. Information and Computation, 128(1):48 56, 1996.

P. E. Kloeden. Synchronization of nonautonomous dynamical systems. Electronic Journal of Diﬀerential Equations, 2003(39):1 10, 2003.

P. E. Kloeden and M. Rasmussen. Nonautonomous Dynamical Systems. American Mathematical Society, 2010.

L. Kocarev and U. Parlitz. General approach for chaotic synchronization with applications to communication. Physical Review Letters, 74(25):5028 5031, 1995.

L. Kocarev and U. Parlitz. Generalized synchronization, predictability, and equivalence of unidirectionally coupled dynamical systems. Physical Review Letters, 76(11):1816 1819, 1996.

F. Laporte, A. Katumba, J. Dambre, and P. Bienstman. Numerical demonstration of neuromorphic computing with photonic crystal cavities. Optics Express, 26(7):7955, apr 2018.

L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M. Gutierrez, L. Pesquera, C. R. Mirasso, and I. Fischer. Photonic information processing beyond Turing: an optoelectronic implementation of reservoir computing. Optics Express, 20(3):3241, jan 2012.

P. Lax. Functional Analysis. Wiley-Interscience, 2002.

R. Legenstein and W. Maass. What makes a dynamical system computationally powerful? In S. Haykin, editor, New directions in statistical signal processing: from systems to brain. MIT Press, Cambridge, MA, 2007.

A. Lindquist and G. Picci. Linear Stochastic Systems. Springer-Verlag, 2015.

Z. Lu, B. R. Hunt, and E. Ott. Attractor reconstruction by machine learning. Chaos, 28(6), 2018.

M. Lukoˇseviˇcius and H. Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127 149, 2009.

W. Maass, T. Natschl ager, and H. Markram. Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Computation, 14:2531 2560, 2002.

W. Maass. Liquid state machines: motivation, theory, and applications. In S. S. Barry Cooper and A. Sorbi, editors, Computability In Context: Computation and Logic in the Real World, chapter 8, pages 275 296. 2011.

W. Maass and E. D. Sontag. Neural Systems as Nonlinear Filters. Neural Computation, 12(8):1743 1772, aug 2000.

W. Maass, T. Natschl ager, and H. Markram. Fading memory and kernel properties of generic cortical microcircuit models. Journal of Physiology Paris, 98(4-6 SPEC. ISS.):315 330, 2004.

W. Maass, P. Joshi, and E. D. Sontag. Computational aspects of feedback in neural circuits. PLo S Computational Biology, 3(1):e165, 2007.

Differentiable reservoir computing

G. Manjunath and H. Jaeger. Echo state property linked to an input: exploring a fundamental characteristic of recurrent neural networks. Neural Computation, 25(3):671 696, 2013.

G. Manjunath and H. Jaeger. The dynamics of random diﬀerence equations Is remodeled by closed relations. SIAM Journal on Mathematical Analysis, 46(1):459 483, jan 2014.

J. A. Montaldi. Persistence and stability of relative equilibria. Nonlinearity, 10:449 466, 1997a.

J. A. Montaldi. Persistance d orbites p{ e}riodiques relatives dans les syst{ e}mes hamiltoniens sym{ e}triques. C. R. Acad. Sci. Paris S{ e}r. I Math., 324:553 558, 1997b.

J. Munkres. Topology. Pearson, second edition, 2014.

T. Natschl ager, W. Maass, and H. Markram. The Liquid Computer : a novel strategy for real-time computing on time series. Special Issue on Foundations of Information Processing of TELEMATIK, 8(1):39 43, 2002.

J. Newman. Necessary and suﬃcient conditions for stable synchronization in random dynamical systems. Ergodic Theory and Dynamical Systems, 38(5):1857 1875, 2018.

J.-P. Ortega and T. S. Ratiu. Persistence and smoothness of critical relative elements in Hamiltonian systems with symmetry. Comptes Rendus de l Acad emie des Sciences - Series I - Mathematics, 325 (10):1107 1111, nov 1997.

Y. Paquot, F. Duport, A. Smerieri, J. Dambre, B. Schrauwen, M. Haelterman, and S. Massar. Optoelectronic reservoir computing. Scientiﬁc reports, 2:287, jan 2012.

J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott. Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data. Chaos, 27(12), 2017.

J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach. Physical Review Letters, 120(2): 24102, 2018.

M. B. Priestley. Non-linear and Non-stationary Time Series Analysis. Academic Press, 1988.

A. Rekic-Vukovic, N. Okicic, and E. Dunjakovic. On weighted Banach sequence spaces. Advances in Mathematics: Scientiﬁc Journal, 4(2):127 138, 2015.

A. Rodan and P. Tino. Minimum complexity echo state network. IEEE Transactions on Neural Networks, 22(1):131 44, jan 2011.

W. J. Rugh. Nonlinear System Theory. The Volterra/Wiener Approach. The Johns Hopkins University Press, 1981.

I. W. Sandberg. Time-delay neural networks, Volterra series, and rates of approximation. Circuits, Systems, and Signal Processing, 17(5):653 655, 1998a.

I. W. Sandberg. A note on representation theorems for linear discrete-space systems. Circuits, Systems, and Signal Processing, 17(6):703 708, 1998b.

I. W. Sandberg. A representation theorem for linear discrete-space systems. Mathematical Problems in Engineering, 4:369 375, 1998c.

I. W. Sandberg. Bounds for discrete-time Volterra series representations. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 46(1):135 139, 1999.

Grigoryeva and Ortega

I. W. Sandberg. Notes of fading-memory conditions. Circuits, Systems, and Signal Processing, 22(1): 43 55, 2003.

E. Schechter. Handbook of Analysis and its Foundations, volume 91. Academic Press, 1997.

M. Schetzen. The Volterra and Wiener Theories of Nonlinear Systems. Wiley, 1980.

J. H. Shapiro. A Fixed-Point Farrago. Springer International Publishing Switzerland, 2016.

H. Siegelmann, B. Horne, and C. Giles. Computational capabilities of recurrent NARX neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 27(2):208 215, apr 1997.

S. Sternberg. Dynamical Systems. Dover, 2010.

G. Tanaka, T. Yamane, J. B. H eroux, R. Nakane, N. Kanazawa, S. Takeda, H. Numata, D. Nakano, and A. Hirose. Recent advances in physical reservoir computing: A review. Neural Networks, 115: 100 123, 2019.

P. Tio. Asymptotic Fisher memory of randomized linear symmetric Echo State Networks. Neurocomputing, 298:4 8, 2018.

T. Valent. Boundary Value Problems of Finite Elasticity. Springer Verlag, 1988.

K. Vandoorne, J. Dambre, D. Verstraeten, B. Schrauwen, and P. Bienstman. Parallel reservoir computing using optical ampliﬁers. IEEE Transactions on Neural Networks, 22(9):1469 1481, sep 2011.

K. Vandoorne, P. Mechet, T. Van Vaerenbergh, M. Fiers, G. Morthier, D. Verstraeten, B. Schrauwen, J. Dambre, and P. Bienstman. Experimental demonstration of reservoir computing on a silicon photonics chip. Nature Communications, 5:78 80, mar 2014.

P. Ver Eecke. Sur le calcul diﬀ erentiel dans les espaces vectoriels topologiques. Cahiers de topologie et g eom etrie diﬀ erentielle cat egoriques, 15(3):293 339, 1974.

Q. Vinckier, F. Duport, A. Smerieri, K. Vandoorne, P. Bienstman, M. Haelterman, and S. Massar. High-performance photonic reservoir computer based on a coherently driven passive cavity. Optica, 2(5):438 446, 2015.

V. Volterra. Theory of Functionals and of Integral and Integro-Diﬀerential Equations. Blackie & Son Limited, Glasgow, 1930.

O. White, D. Lee, and H. Sompolinsky. Short-Term Memory in Orthogonal Neural Networks. Physical Review Letters, 92(14):148102, apr 2004.

N. Wiener. Nonlinear Problems in Random Theory. The Technology Press of MIT, 1958.

I. B. Yildiz, H. Jaeger, and S. J. Kiebel. Re-visiting the echo state property. Neural Networks, 35:1 9, nov 2012.