# dynamical_systems_as_temporal_feature_spaces__d92e2309.pdf

Journal of Machine Learning Research 21 (2020) 1-42 Submitted 7/19; Revised 12/19; Published 2/20

Dynamical Systems as Temporal Feature Spaces

Peter Tino P.Tino@cs.bham.ac.uk School of Computer Science University of Birmingham Birmingham, B15 2TT, UK

Editor: Sayan Mukherjee

Parametrised state space models in the form of recurrent networks are often used in machine learning to learn from data streams exhibiting temporal dependencies. To break the black box nature of such models it is important to understand the dynamical features of the inputdriving time series that are formed in the state space. We propose a framework for rigorous analysis of such state representations in vanishing memory state space models such as echo state networks (ESN). In particular, we consider the state space a temporal feature space and the readout mapping from the state space a kernel machine operating in that feature space. We show that: (1) The usual ESN strategy of randomly generating input-to-state, as well as state coupling leads to shallow memory time series representations, corresponding to cross-correlation operator with fast exponentially decaying coeﬃcients; (2) Imposing symmetry on dynamic coupling yields a constrained dynamic kernel matching the input time series with straightforward exponentially decaying motifs or exponentially decaying motifs of the highest frequency; (3) Simple ring (cycle) high-dimensional reservoir topology speciﬁed only through two free parameters can implement deep memory dynamic kernels with a rich variety of matching motifs. We quantify richness of feature representations imposed by dynamic kernels and demonstrate that for dynamic kernel associated with cycle reservoir topology, the kernel richness undergoes a phase transition close to the edge of stability.

Keywords: Recurrent Neural Network, Echo State Network, Dynamical Systems, Time Series, Kernel Machines

1. Introduction

When dealing with time series data, techniques of machine learning and signal processing must account in some way for temporal dependencies in the data stream. One popular option is to impose a parametric state-space model structure in which the state vector is supposed to dynamically code for the input time series processed so far and the output is determined through a static readout from the state. Recurrent neural networks (e.g. (Downey et al., 2017)), Kalman ﬁlters (Kalman, 1960) or hidden Markov models (Baum and Petrie, 1966) represent just a few examples of this approach. In some cases the state space and transition structure is (at least partially) imposed based on the relevant prior knowledge (Yoon, 2009), but usually it is learnt from the data along with the readout map. In the case of uncountable state space and non-linear state dynamics, the use of gradient methods in learning the state transition dynamics is hampered by the well known

c 2020 Peter Tino.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-589.html.

information latching problem (Bengio et al., 1993). As temporal sequences increase in length, the inﬂuence of early components of the sequence have less impact on the network output. This causes the partial gradients, used to update the weights, to (exponentially) shrink to zero as the sequence length increases. Several approaches have been suggested to overcome this challenge, e.g. (Bengio et al., 1994; Hochreiter and Schmidhuber, 1997; Jing et al., 2019). One possibility to avoid having to train the state transition part in a state space model is to simply initialise it randomly to a sensible fading memory dynamic ﬁlter and only train the static readout part of the model. Models following this philosophy (Jaeger, 2001; Maass et al., 2002; Tino and Dorﬀner, 2001) have been termed reservoir computation (RC) models (Lukosevicius and Jaeger, 2009). Perhaps the simplest form of a RC model is the Echo State Network (ESN) (Jaeger, 2001, 2002a,b; Jaeger and Hass, 2004). Brieﬂy, ESN is a recurrent neural network with a non-trainable state transition part (reservoir) and a simple trainable linear readout. Connection weights in the ESN reservoir, as well as the input weights are randomly generated. The reservoir weights are scaled so as to ensure the Echo State Property (ESP): the reservoir state is an echo of the entire input history and does not depend on the initial state. Scaling reservoir weights so that the largest singular value is smaller than 1 makes the reservoir dynamics contractive and guarantees the ESP. In practice, sometimes it is the spectral radius that guides the scaling. In this case, however, spectral radius < 1 does not guarantee the ESP. ESNs has been successfully applied in a variety of tasks (Jaeger and Hass, 2004; Skowronski and Harris, 2006; Bush and Anderson, July 2005; Tong et al., 2007). Many extensions of the classical ESN have been suggested in the literature, e.g. deep ESN (Gallicchio et al., 2017), intrinsic plasticity (Schrauwen et al., 2008; Steil, 2007), decoupled reservoirs (Xue et al., 2007), leaky-integrator reservoir units (Jaeger et al., 2007), ﬁlter neurons with delayand-sum readout (Holzmann and Hauser, 2009) etc. However, there are properties of the reservoir that are poorly understood (Xue et al., 2007) and speciﬁcation of the reservoir and input connections require numerous trails and even luck (Xue et al., 2007). Furthermore, imposing a constraint on spectral radius or largest singular value of the reservoir matrix is a weak tool to properly set the reservoir parameters (Ozturk et al., 2007). Finally, random connectivity and weight structure of the reservoir is unlikely to be optimal and such a setting prevents us from providing a clear and systematic insight into the reservoir dynamics organisation (Ozturk et al., 2007; Rodan and Tino, 2010). Rodan and Tino (2010) demonstrated that even an extremely simple setting of a high-dimensional state space structure governed by only two free parameters set deterministically can yield modelling capabilities on par with other ESN architectures. However, a deeper understanding of why this is so has been missing. In order to theoretically understand the workings of parametrised state space models as machine learning tools to process and learn from temporal data, there has been a lively research activity to formulate and assess diﬀerent aspects of computational power and information processing capacity in such systems (e.g. (Dambre et al., 2012; Obst and Boedecker, 2014; Hammer, 2001; Hammer and Tino, 2003; Siegelmann and Sontag, 1994; Tino and Hammer, 2004)). For example, tools of information theory have been used to assess information storage or transfer within systems of this kind (Lizier et al., 2007, 2012; Obst et al., 2010; Bossomaier et al., 2016). To speciﬁcally characterise capability of input-driven dy-

Dynamical Systems as Temporal Feature Spaces

namical systems to keep in their state-space information about past inputs, several memory quantiﬁers were proposed, for example short term memory capacity (Jaeger, 2002a) and Fisher memory curve (Ganguli et al., 2008; Tino, 2018). Even though those two measures have been developed from completely diﬀerent perspectives, deep connections exist between them (Tino and Rodan, 2013). The concept of memory capacity, originally developed for univariate input streams, was generalised to multivariate inputs in (Grigoryeva et al., 2016). Couillet et al. Couillet et al. (2016) rigorously studied mean-square error of linear dynamical systems used as dynamical ﬁlters in regression tasks and suggested memory quantities that generalise the short term memory capacity and Fisher memory curve measures. Finally, Ganguli and Sompolinsky (2010) showed an interesting connection between memory in dynamical systems and their capacity to perform dynamical compressed sensing of past inputs. In this contribution, we suggest a novel framework for characterising richness of dynamic representations of input time series in the form of states of a dynamical system, which is the core part of any state space model used as a learning machine. Our framework is based on the observation that the idea of ﬁxed dynamic reservoir with simple static linear mapping build on top of it strikingly resembles the philosophy of kernel machines (Legenstein and Maass, 2007). There, the inputs are transformed using a ﬁxed mapping (usually only implicitly deﬁned) into a feature space that is rich enough so that in that space it is suﬃcient to train linear models only. The key tool for building linear models in the feature space is the inner product. One can grasp workings of a kernel machine by understanding of how the data is mapped to the feature space and what data similarity in the original space means when expressed as the inner product in the feature space. We will view the reservoir state space as a temporal feature space in which the linear readout is operating. In this view, the input time series seen by the reservoir model results in a state that codes all history of the presented input items so far and thus forms a feature representation of the time series. Diﬀerent forms of coupling in the reservoir dynamical system will result in diﬀerent temporal feature spaces with diﬀerent feature representations of input time series, implying diﬀerent notions of similarity between time series, expressed as inner products of their feature space representations. We will ask if and how the feature spaces diﬀer in cases of traditional randomly generated reservoir models, as well as more constrained reservoir constructions studied in the literature. Since RC models are input-driven non-autonomous dynamical systems, theoretical studies linking their information processing capabilities to the reservoir coupling structures have been performed mostly in the context of linear dynamics, e.g. (Ganguli et al., 2008; Couillet et al., 2016; Couillet et al., 2016; Tino, 2018). While such studies are of interest by themselves, in the context of the present work, studying linear dynamics can shed light on a wide class of RC models whose approximation capabilities equal those of non-linear systems. In particular, Grigoryeva, Gonon and Ortega recently proved a series of important results concerning universality of RC models (Grigoryeva and Ortega, 2018b,a; Gonon and Ortega, 2019). The universality can be obtained even if the state transition dynamics is linear, provided the readout map is polynomial (or a neural network)1. However, univer-

1. Universal approximation capability was ﬁrst established in the L sense for deterministic, as well as almost surely uniformly bounded stochastic inputs (Grigoryeva and Ortega, 2018b). This was later

sality is a property of a whole family of RC models. For appropriate classes of ﬁlters2 and input sources, it guarantees that for any ﬁlter and approximation precision, there exists a RC model approximating the ﬁlter to that precision. This is an existential statement that does relate individual ﬁlters to their approximating RC models. Our new framework will enable us to reason about what kind of RC model setup is necessary if ﬁlters with deeper memory were to be approximated. In particular, we will ﬁrst investigate properties of linear dynamical readout kernels obtained on top of linear dynamical systems. Crucially, memory properties of such kernels can not be enhanced by moving from linear to polynomial static readout kernels. Loosely speaking, if feature representation x of a time series u captures properties of u only up to some look-back time t τ0 from the last observation time t, then no nonlinear transformation γ of x can prolong memory τ0 in the feature representation γ(x) of u. Hence, we will be able to make statements regarding appropriate settings of the linear dynamics that are necessary for universal approximation of deeper memory ﬁlters. The paper has the following organisation: In section 2 we set the scene and outline the main intuitions driving this work. Section 3 formally introduces the notion of temporal kernel and provides some useful properties of the kernel to be used further in our study. In section 4 we will setup basic tools for characterising dynamic kernels - motifs and their corresponding motif weights. Starting from section 5, we will analyse dynamic kernels corresponding to diﬀerent settings of the dynamical system. In particular, dynamical kernels associated with fully random, symmetric and highly constrained coupling of the dynamical system are analysed in sections 5, 6 and 7, respectively. We provide examples illustrating the developed theory and compare the motif richness of diﬀerent parameter settings of the dynamical system in section 8. The paper ﬁnishes with discussion and conclusions in section 9.

2. Preliminary concepts and intuitions

We consider fading memory state space models with linear input driven dynamics in an Ndimensional state space and univariate inputs and outputs. Note that in the ESN metaphor, the state dimensions correspond to reservoir units coupled to the input u(t) via an Ndimensional weight vector w RN. Denoting the state vector at time t by x(t) RN, the dynamical system evolves as

x(t) = W x(t 1) + w u(t), (1)

where W RN N is an N N weight matrix providing the dynamical coupling. In state space models, the output y(t) is often determined solely based on the current state x(t) through a readout function h:

y(t) = h(x(t)). (2)

The readout map h is typically trained (oﬄine or online) by minimising the (normalised) mean square error between the targets and reservoir readouts y(t).

extended in (Gonon and Ortega, 2019) to Lp, 1 p < and not necessarily almost surely uniformly bounded stochastic inputs. 2. transforming semi-inﬁnite input sequences into outputs

Dynamical Systems as Temporal Feature Spaces

Denote the set of natural numbers (including zero) by N0. In this contribution, we study how the dynamical system (1) extracts potentially useful information about the left inﬁnite input time series ..., u(t 2), u(t 1), u(t), u( j) R, j N0, in its state x(t) RN, since it is only the state x(t) that will be used to produce the predictive output y(t) upon seeing the input time series up to time t. In particular, we will consider readout maps constructed in the framework of kernel machines. For example, in the case of linear Support Vector Machine (SVM) regression, the readout from the state space at time t has the form

y(t) = h(x(t)) = X

i βi x(ti), x(t) + b, (3)

where βi R and b R are weight coeﬃcients and bias term, respectively and x(ti) are support state vectors observed at important support time instances ti. In the spirit of state space modelling discussed above, we consider the state x(t ) RN reached after observing the time series ..., u(t 2), u(t 1), u(t ), the feature state space representation of that time series. Hence (3) can also be written as

i βi K([...u(ti 1), u(ti)], [...u(t 1), u(t)]) + b, (4)

where K( , ) is a time series kernel associated with the dynamical system (1),

K([...u(ti 1), u(ti)], [...u(tj 1), u(tj)]) = x(ti), x(tj) . (5)

In this context, the support time instances ti can be viewed as end times of the support time series ..., u(ti 2), u(ti 1), u(ti) observed in the past and deemed important for producing the outputs by the training algorithm trained on the history of the time series before the time step t. The suggested viewpoint is illustrated in ﬁgure 1. There are three support time series (..., u(t1 2), u(t1 1), u(t1)), (..., u(t2 2), u(t2 1), u(t2)) and (..., u(t3 2), u(t3 1), u(t3)) represented through the states x(t1), x(t2) and x(t3), respectively. To evaluate the output at time t, the current feature space representation x(t) of ..., u(t 2), u(t 1), u(t) is compared with feature space representations x(ti), i = 1, 2, 3, of the support time series through dot products. We will next formalise these intuitions and then investigate the properties of state space feature representations of time series by dynamical systems. In particular, we will be interested in how diﬀerent forms of dynamic coupling W inﬂuence richness of such feature representations and how they map to properties of the corresponding temporal kernel.

3. Temporal kernel deﬁned by dynamical system

Without loss of generality, we will study feature state space representations under the dynamical system (1) of left-inﬁnite time series ..., u( 2), u( 1), u(0), u( j) R, j N0. We will assume that the largest singular value ν of the dynamic coupling W is strictly less than 1, making the dynamics (1) contractive. This means that the echo state property is fulﬁlled and for suﬃciently long past horizons τ 1, the inﬂuence of initial state x( τ) on the feature representation of

u( τ + 1), u( τ + 2), ..., u( 1), u(0)

time t 3 t2 t1

x( ) x( ) x( ) t t t t x( )

Figure 1: Illustration of the workings of kernel machine producing an output at time t after observing ..., u(t 1), u(t). The time series ..., u(t 1), u(t) is compared with the three support time series (..., u(ti 1), u(ti)), i = 1, 2, 3, by evaluating dot products between their feature space representations x(t) and x(ti).

is negligible. Note that ν < 1 is a suﬃcient condition for the echo state property, but the property may actually be achieved under milder conditions, especially when particular input streams are considered (for formal treatment and further details see e.g. (Yildiz et al., 2012; Manjunath and Jaeger, 2013)). In this contribution we use ν < 1, since it allows us (1) to consider arbitrary input streams over a bounded domain (the ESP is always guaranteed) and (2) to explicitly bound, in terms of properties of W, the norm of dynamical states, as well as the extent to which the initial state inﬂuences the temporal kernel. More formally, given a past horizon τ 1, we will represent the time series u( τ + 1), u( τ + 2), ..., u( 1), u(0) as a vector u(τ) = (u1, u2, ..., uτ) Rτ, where ui = u( i + 1), i = 1, 2, ...τ. In other words u(τ) = (u(0), u( 1), ..., u( τ + 1)) . Consider a state x( τ) RN at time τ. After seeing the input series u( τ+1), u( τ+ 2), ..., u( 1), u(0), the new state of the dynamics (1) will be3

x(0) = Wτx( τ) +

j=1 u(j τ)Wτ jw.

As discussed in the previous section, the state x(0) reached from the initial condition x( τ) after seeing u(τ) codes for information content in u(τ) and will be considered the feature space representation of u(τ) through the dynamical system (1):

φ(u(τ); x( τ)) = x(0)

= Wτx( τ) +

i=1 u(1 i)Wi 1w

= Wτx( τ) +

i=1 ui Wi 1w. (6)

Given two time series at past horizon τ represented through u(τ) = (u1, u2, ..., uτ) and v(τ) = (v1, v2, ..., vτ) , the temporal kernel deﬁned by dynamical system (1) evaluated on

3. W0 = IN N, the N N identity matrix.

Dynamical Systems as Temporal Feature Spaces

u(τ) and v(τ) reads:

K(u(τ), v(τ); x( τ)) = φ(u(τ); x( τ)), φ(v(τ); x( τ)) . (7)

We will now show that, as expected given the contractive nature of (1), for suﬃciently long past time horizons τ 1 on input streams over bounded domain4, the kernel evaluation is insensitive to the initial condition x( τ). This will allow us to simplify the presentation by setting x( τ) to the origin in the rest of the paper.

Theorem 1 Consider the dynamical system (1) driven by time series over a bounded domain [ U, U], 0 < U < , with a past time horizon τ > 1. Assume that the largest singular value ν of the dynamic coupling W is strictly smaller than 1 and that the norm of the input coupling w satisﬁes w B. Assume further that the norm of the initial condition is upper bounded by x( τ) A(τ) = c ζ τ, where ν < ζ < 1 and c > 0 is a large enough positive constant satisfying

Then, for any u(τ), v(τ) [ U, U]τ, it holds

K(u(τ), v(τ); x( τ)) = K(u(τ), v(τ); 0) + ϵ,

1 ν B U ϵ ητ c2 ητ + 2c 1 ν B U , (9)

with η = ν/ζ < 1.

Proof Note that

φ(u(τ); x( τ)) = Wτx( τ) + φ(u(τ); 0)

and therefore, denoting

φ(u(τ); 0) =

i=1 ui Wi 1w (10)

by φ0(u(τ)), we have

K(u(τ), v(τ); x( τ)) = Wτx( τ) + φ0(u(τ)), Wτx( τ) + φ0(v(τ))

= Wτx( τ) 2 2 + Wτx( τ), φ0(u(τ)) + φ0(v(τ))

+ K(u(τ), v(τ); 0),

4. It is common in the ESN literature to consider input streams over a bounded domain (e.g. Jaeger (2001)). In the recent work on universality of ESNs Grigoryeva and Ortega (2018b) consider almost surely uniformly bounded stochastic inputs. This is further relaxed in (Gonon and Ortega, 2019).

K(u(τ), v(τ); 0) = φ0(u(τ)), φ0(v(τ))

is the dynamic kernel evaluated using initial condition x( τ) set to the origin 0. We have,

Wτx( τ) 2 2 ν2τ (A(τ))2. (11)

Wτx( τ), φ0(u(τ)) Wτx( τ) φ0(u(τ))

and (see (10))

i=1 νi 1 U w

B U 1 1 ν , (12)

Wτx( τ), φ0(u(τ)) ντ

1 ν A(τ) B U.

We thus have

K(u(τ), v(τ); x( τ)) = K(u(τ), v(τ); 0) + ϵ,

ϵ ντ ντ(A(τ))2 + 2 1 ν A(τ) B U .

To evaluate the lower bound on ϵ, note that

Wτx( τ), φ0(u(τ)) Wτx( τ) φ0(u(τ))

1 ν A(τ) B U.

Since, trivially, Wτx( τ) 2 2 0, we have

We have thus obtained,

ντ 2 1 ν A(τ) B U ϵ ντ ντ(A(τ))2 + 2 1 ν A(τ) B U ,

which is equivalent to (9).

Dynamical Systems as Temporal Feature Spaces

In order to reconcile this setting with the dynamics (1), consider a past horizon τ + τ0 for some additional look-back time τ0 1. We require,

x( τ) A(τ) = c ζ τ and x( τ τ0) A(τ + τ0) = c ζ τ τ0.

But from the dynamics (1), we also have (see eqs.(6), (11) and (12)),

x( τ) Wτ0 x( τ τ0) + B U

ντ0 A(τ + τ0) + B U

We would like the norm of the state x( τ) reached from the initial state x( τ τ0) (bounded in norm by A(τ + τ0)) to be within the required bound A(τ). In other words, we would like

A(τ + τ0) ντ0 + B U

1 ν < c ζ τ (14)

to hold. Using A(τ + τ0) = c ζ τ τ0, we conclude that the inequality (14) holds when

1 ν ζ τ0 . (15)

Since 0 < η = ν/ζ < 1, for τ, τ0 1,

(1 ν) 1 1 ητ0 < B U (1 ν) (1 η)

we have that the inequality (15) is deﬁnitely satisﬁed when

c > B U (1 ν) (1 ν

Theorem 1 formally states that because the dynamical system (1) is contractive, the inﬂuence of the initial condition x( τ) on the kernel value K(u(τ), v(τ); x( τ)) decays exponentially with the past time horizon τ. For suﬃciently long past time horizons τ 1 we can thus set x( τ) = 0. Hence, in the rest of this study we will assume τ N and (unless necessary) we will drop speciﬁc reference to τ by writing u instead of u(τ) Rτ. In fact, it will be easier to think of time horizons in units of N, so that τ = ℓ N, for some suﬃciently large integer ℓ> 1. Furthermore, we will refer to φ(u(τ); 0) and K(u(τ), v(τ); 0) simply as φ(u) and K(u, v), respectively.

4. Temporal kernel and its motifs

In the previous section we established that the temporal kernel associated with dynamical system (1) and acting on time series with past time horizon τ 1 is deﬁned as

K(u, v) = φ(u), φ(v) . (16)

In order to analyse the action of K(u, v) on time series u, v, we need to ﬁnd its expression directly in terms of u and v. The next theorem shows that there exists a matrix Q of rank at most N that acts as a metric tensor on a subspace of Rτ (of dimensionality at most N), so that K(u, v) can be expressed as a quadratic form u Q v. This will allow us to study properties of K(u, v) by analysing the associated metric tensor Q.

Theorem 2 Consider the dynamical system (1) of state dimensionality N and a dynamic coupling W with largest singular value 0 < ν < 1. Let K(u, v) (16) be the temporal kernel associated with system (1). Then for any u, v Rτ,

K(u, v) = u Q v = u, v Q ,

where Q is a symmetric, positive semi-deﬁnite τ τ matrix of rank Nm = rank(Q) N and elements

Qi,j = w W i 1 Wj 1 w, i, j = 1, 2, ..., τ. (17)

The upper bound on absolute values of Qi,j decays exponentially with increasing time indices i, j = 1, 2, ..., τ, as

|Qi,j| νi+j 2 w 2 2. (18)

Proof First, we write

K(u, v) = φ(u), φ(v)

i=1 ui Wi 1w,

j=1 vj Wj 1w

i,j=1 ui vj Wi 1w, Wj 1w

i,j=1 ui vj Qi,j

Second, φ(u) can be written as φ(u) = Φu, where Φ is an N τ matrix whose i-th column is equal to Wi 1w. Hence, K(u, v) = u Φ Φ v and Q = Φ Φ is symmetric positive semi-deﬁnite with rank at most N τ.

Dynamical Systems as Temporal Feature Spaces

Finally, since Wiw Wi w νi w , we have

|Qi,j| = | Wi 1w, Wj 1w |

Wi 1w 2 Wj 1w 2 νi+j 2 w 2 2.

Note that K( , ) is a semi-inner product on Rτ. In other words, time series u ker (Q) from the kernel of the linear operator Q have zero length. It acts as an inner product in the quotient of Rτ by ker (Q), Rτ/ ker (Q) (image of Q). Since this distinction is not crucial for our argumentation, in order not to unnecessarily complicate the presentation, (slightly abusing mathematical terminology) we will refer to K( , ) as temporal kernel and to Q as the associated metric tensor. Theorem 2 tells us that K( , ) is a fading memory temporal kernel and we can unveil its inner workings through eigen-analysis of Q:

Q = MΛM , (19)

where the columns of M are the eigenvectors m1, m2, ..., mτ Rτ of Q with the corresponding real non-negative eigenvalues λ1 λ2 ... λτ arranged on the diagonal of the diagonal matrix Λ. Based on theorem 2, there are Nm N τ eigenvectors mi with positive eigenvalue λi > 0. Given two time series u, v Rτ of past time horizon τ, the temporal kernel value is

K(u, v) = u Q v

= Λ 1 2 M u Λ 1 2 M v. (20)

This has the following interpretation. In order to determine the kernel based similarity K(u, v) of two time time series u, v Rτ, both time series are ﬁrst represented through a series of matching scores with respect to a potentially small number of ﬁlters mi (Nm N τ) , weighted by λ1/2 i :

eu = λ1/2 1 m1, u , λ1/2 2 m2, u , ..., λ1/2 Nm m Nm, u RNm (21)

ev = λ1/2 1 m1, v , λ1/2 2 m2, v , ..., λ1/2 Nm m Nm, v RNm.

Similarity between u Rτ and v Rτ is then evaluated as the degree to which both u and v match in the same way the highly weighted ﬁlters mi. Hence, instead of direct matching of u and v, as would be the case for u, v , we consider u, v similar if eu, ev is high, in other words, if both u and v match well a number of signiﬁcant ﬁlters mi of high weight λ1/2 i .

The matching scores mi, u can be viewed as projections of the input time series u unto prototypical time series motifs mi that characterise what features of the input time series are used by the kernel to assess their similarity. Loosely speaking, a temporal kernel employing a rich set of slowly decaying high-weight ( signiﬁcant ) motifs with deep memory will be able to perform more nuanced time series similarity evaluation than a kernel with a small set of highly constrained and fast decaying short memory motifs. In what follows we refer to m1, m2, ..., m Nm Rτ as motifs of the temporal kernel K( , ) with the associated motif weights given by λ1/2 1 λ1/2 2 ... λ1/2 Nm > 0. In the light of the comments above, K( , ) acts as semi-inner product on Rτ and as inner product on the span of the motifs, span{m1, m2, ..., m Nm}. In the case of SVM regression, the readout output for a time series v Rτ, based on the state space representation of v through (1) would have the form (see eq.(4)) X

i βi K(ui, v) + b,

where ui Rτ are the support vectors ( support time series ). This can be rewritten as a linear model a v + b with weight vector a Rτ:

j=1 λj mj, ui m j . (22)

Free parameters of the output-producing function are the coeﬃcients βi corresponding to the support time series ui. In contrast, motifs mj and motif weights λ1/2 j are ﬁxed by the dynamical system (1). Hence, whatever setting of the free parameters βi one can come up with, the inherent memory and time series structures one can access in past data in order to produce the output for a newly observed time series are determined by the richness and memory depth characteristics of the motif set {mj}Nm j=1. In what follows we will take this viewpoint when analysing temporal kernels corresponding to the dynamical system (1) for diﬀerent types of state space coupling W RN N.

5. Random dynamic coupling W with zero-mean i.i.d. entries

It has been common practice in the reservoir computation community to generate dynamic coupling W of ESNs randomly (Lukosevicius and Jaeger, 2009), typically with elements of W generated independently from a zero-mean symmetric distribution and then normalised so that W has a desirable scaling property (e.g. certain spectral radius or largest singular value). In this section we investigate temporal kernels associated with such an ESN setting. We will see that the nature of motifs is remarkably stable (small set of shallow memory motifs), even though the couplings W are generated from a wide variety of distributions. Consider a random matrix f W with elements f Wi,j, i, j = 1, 2, ..., N, generated i.i.d. from a zero-mean distribution with variance σ2 0 > 0 and ﬁnite fourth moment. Such a realisation f W RN N will be rescaled to the desired largest singular value ν (0, 1):

σmax(f W) f W,

Dynamical Systems as Temporal Feature Spaces

where σmax(f W) is the maximum singular value of f W. For large N, the largest eigenvalue of N 1f W f W converges to 4σ2 0 almost surely (Rudelson and Vershynin, 2010; Tino, 2018). Hence, the largest singular value of N 1/2 f W approaches 2σ0. It follows that for large N, σmax(f W) approaches 2

Nσ0. Rescaling

can be equivalently thought of as generating Wi,j i.i.d. from a zero-mean distribution with standard deviation

We would like to reason, under the assumption of high state space dimensionality N of the dynamical system (1), about the properties of the metric tensor Q with elements Qi,j given by eq. (17). To ease the mathematical notation, we denote the matrix (W )i Wj by A(i,j). Hence,

Qi,j = w A(i 1,j 1) w. (24)

5.1. Diagonal elements of Q

The ﬁrst diagonal element of Q, Q1,1, is trivially equal to w 2 2, so let us ﬁrst concentrate on A(1,1) corresponding to Q2,2.

A(1,1) j,j = N

i=1 W 2 i,j

4 . (see eq. (23)) (25)

The oﬀ-diagonal terms of A(1,1) get asymptotically small as

A(1,1) i,j = N

k=1 Wk,i Wk,j

since for i = j, Wk,i and Wk,j are uncorrelated and generated from zero-mean distribution with standard deviation σ = O(1/

N) (see (23)). For large N we can thus approximate A(1,1) as

4 IN N, (26)

where IN N is the identity matrix of rank N.

To approximate A(2,2), we write

A(2,2) = (W )2 W2

= W A(1,1) W

2 IN N. (27)

Proceeding inductively, we obtain

A(k,k) = W A(k 1,k 1) W

k IN N, k = 2, 3, ..., τ 1. (28)

We can thus approximate Qj,j as

Qj,j = w A(j 1,j 1) w

2(j 1) w 2 2. (29)

Hence, the diagonal elements of Q, necessarily non-negative since A(j,j) are positive semideﬁnite, decay much faster (exponentially so, by the factor of 4 (j 1)) than the upper bound (18) of theorem 2,

Qj,j 4 (j 1) ν2(j 1) w 2 2. (30)

In particular, if all elements of the input coupling w have the same absolute value w (with possibly diﬀerent signs), we have

2(j 1) . (31)

5.2. Oﬀ-diagonal elements of Q

We now investigate terms Qi,j for i = j. Since Q is symmetric, without loss of generality we can assume j > i. Then,

A(i 1,j 1) = (W )i 1 Wi 1 Wj i

= A(i 1,i 1) Wj i

2(i 1) Wj i (32)

Dynamical Systems as Temporal Feature Spaces

The elements of A(i 1,j 1) decay exponentially with increasing i (deeper past in the time series). We will now approximate Wj i. Concentrate ﬁrst on the suband super-diagonal elements of Q. We have

and so besides the main diagonal elements Qj,j (ν/2)2(j 1) w 2 2 we have suband superdiagonal elements

Qj+1,j = Qj,j+1 ν

2(j 1) w Ww,

which, depending on W and w, can be substantially smaller than Qj,j. For example, if both W and w are generated element-wise independently from zero mean distributions, then for large N, Ww 0. This is because each row of W contains i.i.d. realisations of a random variable uncorrelated with random variable whose realisations are stored as elements of w. Then w Ww is negligible. For elements Qi,j further away from the diagonal, we ﬁrst analyse properties of the matrix B = W2.

k=1 Wi,k Wk,i

= W 2 i,i + X

k =i Wi,k Wk,i

because of uncorrelated Wi,k and Wk,i for k = i. Similarly, for i = j,

k=1 Wi,k Wk,j 0.

Qj+2,j = Qj,j+2 ν

2(j 1) w B w

i=1 W 2 i,i w2 i . (33)

Note that in order to scale a large matrix f W generated i.i.d. from a zero-mean distribution to spectral radius less than one, the individual elements Wi,j of the scaled matrix W need to be necessarily small, increasingly so for increasing dimensionality N. In particular, based on (23), the mean of W 2 i,i is approximately ν2/(4N). Comparing (see eq. (29))

with eq. (33), we see that there will be an increasing gap (with increasing state space dimensionality N) between the diagonal elements Qj,j of Q and the corresponding elements Qj+2,j = Qj,j+2 two places oﬀthe diagonal. Continuing the preceding argumentation inductively, we can conclude that compared to the diagonal terms Qj,j of Q, for the approximation purposes, the oﬀ-diagonal terms can be neglected and the metric tensor can be approximated by a diagonal matrix

Q b Q = w 2 2 diag 1, ν

2(τ 1) . (34)

5.3. Temporal kernel motifs generated by random W

The eigen-decomposition of b Q is straightforward: The eigenvectors form the standard basis {ei}, each vector ei containing zeros, except for the i-th element, which is equal to 1. The corresponding eigenvalues are equal to the diagonal elements of b Q,

bλi = w 2 ν

2(i 1) . (35)

This means that the motif mi = ei extracts the i-th element from the history of the time series and weights it with the weight w (ν/2)i 1. Perhaps surprisingly, the temporal kernel deﬁned by the dynamical system (1) with random coupling W generated i.i.d. from a zero mean distribution has a rigid Markovian ﬂavor with shallow memory. In particular, the kernel

i=1 λi mi, u mi, v

i=1 bλi ei, u ei, v

2(i 1) ui vi,

compares the corresponding recent entries of the time series and weights down comparisons of past elements with rapidly decaying weights. To illustrate this approximation, as well as the rapidly decaying memory of such temporal kernels, we considered 100-dimensional state space (N = 100) and generated 100 realisations of f W with elements Wi,j randomly distributed according to the standard normal distribution N(0, 1). Each f W was renormalised to W of largest singular value ν = 0.995 and an input coupling vector w was generated as a random vector with elements generated i.i.d. according to N(0, 1) and then renormalised to unit vector (length 1). We then imposed a past horizon τ = 200 and calculated the metric tensor Q, as well as its approximation b Q (eq. (34)). In ﬁgure 2 we show the true motifs mi (eigenvectors of Q for the ﬁrst four dominant motifs (motifs with the largest 4 motif weights) as the mean and standard deviations across the 100 realisations. For clarity, we only show the ﬁrst 10 dimensions. It is clear that the motifs approximately correspond to the ﬁrst four standard basis vectors ei, i = 1, 2, 3, 4,

Dynamical Systems as Temporal Feature Spaces

1 2 3 4 5 6 7 8 9 10 -0.2

1 2 3 4 5 6 7 8 9 10 -0.2

1 2 3 4 5 6 7 8 9 10 -0.2

1 2 3 4 5 6 7 8 9 10 -0.2

Figure 2: The ﬁrst 10 elements of the four most dominant kernel motifs corresponding to W R100 100 generated element-wise i.i.d. from N(0, 1) and renormalised to largest singular value ν = 0.995. The input coupling w was generated elementwise i.i.d. from N(0, 1) and renormalised to unit length. Shown are the means and standard deviations across 100 joint realisations of W and w.

as predicted by our theory. Figure 3 presents the corresponding eigenvalues - solid bars correspond to the means of the actual eigenvalues λi across the 100 realisations (also shown are standard deviations). The theoretically predicted values (eq. (35)) are shown as the red line. Again, there is a strong agreement between the theoretical approximations bλi and the real eigenvalues λi. We illustrate generality of this result in appendix A, where motifs and their weights were obtained under the same conditions, but with the input coupling vector w generated as a vector of all 1s with randomly ﬂipped signs (with equal probability 0.5 in each dimension). We also tried setting where both f W and w consist of all 1s with signs ﬂipped independently element-wise with probability 0.5. In both cases the Markovian motifs and their weights are almost indistinguishable from those shown in ﬁgures 2 and 3.

It is notable that even though the state space dimensionality is quite high (N = 100), the rapidly decaying motif weights basically prevent the kernel to be able to dig deeper into the history of the time series u, v when creating a quantitative evaluation of their similarity, K(u, v). If the time series are zero-mean, the kernel is estimating a weighted covariance of u and v with weights ν

2 2(i 1) exponentially decreasing at the rate much faster than the upper bound (ν)2(i 1), given the contractive dynamics of (1) with spectral radius ν.

We conclude this section by noting that for large random W, the spectral radius ρ ν/2. Hence, the resulting temporal kernel can be readily interpreted from the standpoint of spectral radius: The Markovian motifs ei have weights w ρi 1, leading to temporal

eigenvalues of Q

1 2 3 4 5 6 7 8 9 10 0

Figure 3: Eigenvalues (squared motif weights) of the metric tensor Q for random setting of the dynamical system (1) as described in ﬁg. 2. Solid bars correspond to the means of the actual eigenvalues λi across the 100 realisations of W and w (also shown are standard deviations). The theoretically predicted values (eq. (35)) are shown as the red line.

K(u, v) w 2 N X

i=1 ρ2(i 1)ui vi.

Zhang et al. (2012) studied echo state networks with i.i.d. random weights in W. They showed that the dynamic mapping (1) can be contractive with high probability even when only the spectral radius ρ (as opposed to maximum singular value ν) is less than one.

6. Symmetric dynamic coupling W

In this section we investigate how the nature of the temporal kernel changes if we impose symmetry on the dynamic coupling W of system (1): Wi,j = Wj,i, i, j = 1, 2, ..., N. In this case, the largest singular value ν of W is equal to its spectral radius. Memory capacity of such systems was rigorously analysed in (Tino and Rodan, 2013; Tino, 2018). In terms of memory capacity, the role of self-couplings in large systems was shown to be negligible. In (Couillet et al., 2016) systems with symmetric coupling were shown to lead to inferior performance on memory tasks, when compared with unconstrained dynamic coupling. Similar observation was made in the context of forecasting realised variances of stock market indices (Ficura, 2017). Recall that given Nk kernels K(a)( , ) operating on a space X and positive real numbers αa > 0, a = 1, 2, ..., Nk, the linear combination K( , ) = PNk a=1 αa K(a)( , ) is a valid kernel on X. We will show that in case of symmetric W, the corresponding temporal kernel can be understood as a linear combination of simple kernels, each with a unique exponentially decaying motif.

Dynamical Systems as Temporal Feature Spaces

Theorem 3 Consider the dynamical system (1) of state dimensionality N with symmetric coupling W of rank Nk N. Let s1, s2, ..., s Nk, be the eigenvectors of W corresponding to non-zero eigenvalues σ1 σ2 ... σNk. Denote by ewa = s a w the projection of the input coupling w onto the eigenvector sa. Then the temporal kernel K( , ) associated with system (1) is a linear combination of Nk kernels K(a)( , ),

a=1 ew2 a K(a)( , ), (36)

each kernel K(a) with a single motif

m(a) = (1, σa, σ2 a, ..., στ 1 a ) Rτ. (37)

Proof Since W is symmetric, it can be decomposed as

W = S Σ S ,

where S = [s1, s2, ..., s Nk] is an N Nk matrix storing the eigenvectors of W as columns, with the associated eigenvalues organised along the diagonal of Σ = diag(σ1, σ2, ..., σNk). The powers of W can be then expressed simply through powers of Σ: Wi = S Σi S . We thus have

Qi,j = w (W )i 1 Wj 1 w

= w Wi+j 2 w (by symmetry of W)

= w S Σi+j 2 S w

= ew Σi+j 2 ew, (38)

where ew = S w is the projection of input coupling w onto the orthonormal eigen-basis of W. Writing (38) as a quadratic form, we obtain

a,l=1 ewa ewl Σi+j 2 a,l

a=1 ew2 a σi+j 2 a , (39)

because Σ is a diagonal matrix. Let us deﬁne Nk matrices Q(a) Rτ τ, a = 1, 2, ..., Nk, as

Q(a) i,j = σi+j 2 a .

a=1 ew2 a Q(a). (40)

Note that Q(a) are rank-1 positive semi-deﬁnite matrices Q(a) = m(a) (m(a)) . Since

Q(a) m(a) = m(a) (m(a)) m(a) = m(a) 2 2 m(a),

we have that m(a) is the only eigenvector of Q(a) with a non-zero eigenvalue, i.e. m(a) is the only motif of the kernel

K(a)(u, v) = u Q(a) v

with non-zero motif weight. From (40) it follows that K(u, v) = PNk a=1 ew2 a K(a)(u, v).

Theorem 3 states that the temporal kernel of a system (1) with symmetric coupling is a linear combination of several kernels, each of which corresponds to a single non-zero eigenvalue σa of W. Each such kernel has a unique motif m(a) Rτ associated with it. The motifs m(a) can only be of two kinds: Either an exponentially decaying proﬁle (1, σa, σ2 a, σ3 a, σ4 a, ...), if σa is positive, or an exponentially decaying proﬁle with high oscillation frequency (1, |σa|, σ2 a, |σ3 a|, σ4 a, ...), if σa is negative. This is obviously quite limiting, precluding the component kernels K(a)( , ) to capture more diverse range of possible dynamic behaviours. A word of caution is in order. The individual motifs m(a) are indeed motifs of the component kernels K(a)( , ), but they are not motifs of the kernel K( , ). Even though one can write

Q = V ΣW V ,

where the matrix V = [m(1), m(2), ..., m(Nk)] stores component motifs m(a) as columns and ΣW = diag( ew2 1, ew2 a, ..., ew2 Nk), the component motifs m(a) are not orthogonal. Hence, in general there is no non-zero number κ, such that

Q m(a) = V ΣW V m(a) = κ m(a).

Unlike in the previous section, because of the imposed symmetry on W, it is much more diﬃcult to approximate the structure of Q. We can recover the upper bound (18) of theorem 2 on absolute values of Qi,j. From Theorem 2 and eq. (39) we have

a=1 ew2 a σi+j 2 a

a=1 ew2 a |σi+j 2 a |

a=1 ew2 a (41)

νi+j 2 w 2 2. (42)

Here (42) follows from (41) since the norm of the input coupling w is invariant with respect to orthonormal change of basis. The inequality in (42) becomes equality if W is full rank.

Dynamical Systems as Temporal Feature Spaces

7. W as a scaled permutation matrix

We will now consider a strongly constrained dynamical coupling W in the form of cyclic N N permutation matrix P, scaled by ν, so that the largest singular value, as well as the spectral radius of W = ν P is equal to ν. This follows from a theorem by Frobenius that states that for a non-negative matrix W, its spectral radius is lower and upper bounded by the minimum and maximum row sum, respectively (e.g. (Minc, 1988)). Since in our case all rows of W sum to ν, the spectral radius must be5 ν. Without loss of generality6 we will consider cyclic permutation {1 2, 2 3, ..., N 1 N, N 1}, represented by Pi+1,i = 1, i = 1, 2, ..., N 1 and P1,N = 1, all the other elements of P are zero. Dynamic couplings in the form of scaled cyclic permutation matrix correspond to the setting of simple cycle reservoir (Rodan and Tino, 2011), where the reservoir units are connected in a uni-directional ring structure, with the same weight value on all connections in the ring. Analogously, setting of the input coupling w can be very simple, controlled again by a single amplitude value w > 0 for all input weights. Intuitively, all the input weights should not have the same value w, as this would greatly symmetrise the ESN architecture. To break the symmetry, Rodan and Tino (2011) suggest to apply an a-periodic sign pattern to the input weights (e.g. according to binary expansion of an irrational number). While such a reservoir structure has the advantage of being extremely simple and completely deterministic, the predictive performance of the associated ESNs in a variety of tasks on time series of diﬀerent origins and memory structure was shown to be on par (and sometimes even better) with the usual random reservoir constructions (Rodan and Tino, 2011). Similar observations were made in (Strauss et al., 2012). This is of great practical importance, since many optics-based physical constructions of reservoir models follow the ring topology structure, which can be implemented using a single unit with multiple delays (R ohm and L udge, 2018; Tanaka et al., 2019; Appeltant et al., 2011). Yet, it has been unclear, why such a simple setting can be competitive in real word tasks, or why indeed the breaking of symmetry through a-periodic sign pattern in the input weights is so crucial. In this section, we will study the nature of motifs associated with ring reservoir topologies and the consequences of adopting periodic, rather than a-periodic input weight sign patterns. Given a time horizon τ = ℓN, for some positive integer ℓ> 1, we will now show that the temporal kernel motifs corresponding to the dynamical system (1) with scaled permutation coupling W = ν P have an intricate block structure.

Theorem 4 Consider the dynamical system (1) of state space dimensionality N, with coupling W = ν P, where ν (0, 1) and P is the N N cyclic permutation matrix. Let emi RN, i = 1, 2, ..., N, be motifs of the temporal kernel associated with (1) under past time horizon equal to N. Denote the corresponding motif weights by eωi. Then, given a diﬀerent past time horizon τ = ℓ N, for some positive integer ℓ> 1, the temporal kernel motifs mi Rτ associated with (1) have the following block form:

mi = em i , νN em i , ν2N em i , ..., ν(ℓ 1)N em i , i = 1, 2, ...N.

5. Alternatively, this can be shown by arguing that W is a normal matrix. 6. We can always renumber the state space dimensions.

The corresponding motif weights are equal to

Proof Note that because P is a permutation matrix, for any non-negative integer i N0, we have

Pi = PN (i\N) Pi mod N,

where mod and \ denote the modulo and integer division operations. Since PN (i\N) = IN N, we have Pi = Pi mod N. Consequently,

Wi = νi Pi mod N.

Furthermore, since P is orthogonal, P 1 = P . We can now write the elements of Q as (see eq. (17)),

Qi,j = w (W )i 1 Wj 1 w

= νi+j 2 w (P )i 1 Pj 1 w

= νi+j 2 w Pj i w.

= νi+j 2 w P(j i) mod N w. (43)

For k { N + 1, N + 2, ..., 1, 0, 1, ...N 1}, if k is positive, Pkw is the vector with elements of w rotated k places to the right. In case k is negative, the rotation is to the left. From (43) it is clear the Q Rτ τ has the following block structure:

Q(1,1) Q(1,2) Q(1,ℓ)

Q(2,1) Q(2,2) Q(2,ℓ)

Q(ℓ,1) Q(ℓ,2) Q(ℓ,ℓ)

where each matrix Q(a,b) RN N, a, b = 1, 2, .., , ℓ, has elements

Q(a,b) i,j = ν(a+b 2)N νi+j 2 w Pj i w, i, j = 1, 2, ..., N.

Deﬁne an N N matrix R with elements

Ri,j = νi+j 2 w Pj i w, i, j = 1, 2, ..., N, (44)

yielding Q(a,b) = ν(a+b 2)N R. Note that R is the metric tensor of the temporal kernel associated with (1) under the past time horizon N. Let emi RN be the i-th eigenvector of R with eigenvalue eλi. Then,

Q(a,b) emi = ν(a+b 2)N eλi emi,

Dynamical Systems as Temporal Feature Spaces

h Q(a,1), Q(a,2), , Q(a,ℓ)i

f mi emi emi

j=1 ν(j 1)N

ν(a 1)N f mi. (45)

It follows that for each a {1, 2, ..., ℓ},

h Q(a,1), Q(a,2), , Q(a,ℓ)i

f mi νN emi ν(ℓ 1)N emi

j=1 ν2(j 1)N

ν(a 1)N f mi. (46)

We can thus conclude that the vector

mi = em i , νN em i , ν2N em i , ..., ν(ℓ 1)N em i

is an eigenvector of Q with eigenvalue

= eλi 1 ν2Nℓ

= eλi 1 ν2τ

7.1. Periodic input coupling w

It has been empirically shown in (Rodan and Tino, 2011) that when the dynamic coupling W is formed by a scaled permutation matrix, a very simple setting of input coupling w is suﬃcient: all elements of w can have the same absolute value, but the sign pattern should be aperiodic. Intuitively, it is clear that for such W a periodic input coupling w will induce symmetry in the dynamic processing of (1) and such a symmetry should be broken. However, in this section we would like to ask exactly what representational capabilities are lost by imposing a periodicity in w. We will start by considering a general case of periodic w RN formed by k > 1 copies of a periodic block s Rp, w = (s , s , ..., s ) . Obviously, N = k p. Denote by P Rp p the top left p p block of the right shift permutation matrix P RN N. In other words, P is the right shift permutation matrix operating on vectors from Rp. Furthermore, we introduce matrix T Rp p with elements

Ti,j = νi+j 2 D s, P |j i|s E , i, j = 1, 2, ..., p. (47)

Theorem 5 Consider the dynamical system (1) of state space dimensionality N, with coupling W = ν P, where ν (0, 1) and P is the N N cyclic permutation matrix. Let the input coupling w RN consist of k > 1 copies of a periodic block s Rp. Denote by mi Rp, i = 1, 2, ..., p, eigenvectors of the matrix T (47) with the corresponding eigenvalues λi. Then, given a past time horizon τ = ℓ N, for some positive integer ℓ> 1, there are at most p temporal kernel motifs mi Rτ associated with (1) of non-zero motif weight. Furthermore, the kernel motifs have the following block form,

mi = (m i , νp m i , ν2p m i , ..., ντ p m i ) , i = 1, 2, ...p,

with the corresponding motif weights

Proof By Theorem 4, to determine motifs of the temporal kernel associated with (1), it is suﬃcient to perform eigen-analysis of the block matrix Q(1,1) = R (eq. (44)). For a = 0, 1, 2, ..., N 1,

w, Paw = k s, P as = k D s, P a mod p s E

and since R is symmetric, from (eq. (44)) we have

Q(1,1) i,j = k νi+j 2 D s, P |j i| mod p s E , i, j = 1, 2, ..., N. (48)

Therefore, Q(1,1) can be decomposed into blocks of p p matrices

C(1,1) C(1,2) C(1,k)

C(2,1) C(2,2) C(2,k)

Ck,1) C(k,2) C(k,k)

C(c,d) = ν(c+d 2)p C(1,1), c, d = 1, 2, ..., k

C(1,1) i,j = νi+j 2 D s, P |j i| s E , i, j = 1, 2, ..., p.

Now, let mi Rp be the i-th eigenvector of C(1,1) = T with eigenvalue λi. Then,

C(c,d) mi = ν(c+d 2)p C(1,1) mi = ν(c+d 2)p λi mi.

h C(c,1), C(c,2), , C(c,k)i

mi νp mi ν(k 1)p mi

j=1 ν2(j 1)p

ν(c 1)p mi (49)

Dynamical Systems as Temporal Feature Spaces

for c = 1, 2, ..., k. Hence,

emi = (m i , νp m i , ν2p m i , ..., ν(k 1)p m i )

is an eigenvector of Q(1,1) with eigenvalue

= λi 1 ν2pk

By Theorem 4, the corresponding eigenvector mi of Q reads:

mi = ( em i , νN em i , ..., ν(ℓ 1)N em i )

= (m i , νp m i , ..., ν(k 1)p m i , νN m i , νN+p m i , ..., ν(ℓ 1)N+(k 1)p m i )

= (m i , νp m i , ν2p ..., ντ p m i ) .

The last equality holds since from τ = ℓN and N = kp, we have (ℓ 1)N +(k 1)p = τ p. We can calculate the corresponding eigenvalue as

λi = eλi 1 ν2τ

1 ν2p 1 ν2τ

1 ν2N = λi 1 ν2τ

Theorem 5 formally speciﬁes consequences for the dynamical kernel of having a periodic input coupling w of period p in the system (1). First, the number of potentially useful kernel motifs of non-zero weight is reduced from N (the state space dimensionality) to p. Second, the motif structure is even more restricted than in the case of general w. If the past horizon is τ = ℓN, then in general, by theorem 4, each motif mi Rτ consists of a series of ℓcopies of the same core motif emi RN, down-weighted by exponential decay. In the case of periodic w, motifs mi Rτ are formed by a series of ℓk copies of the same small block mi Rp, down-weighted by exponential decay. We will now investigate special settings of the periodic input coupling w RN. Consider ﬁrst the binary setting, i.e. the core periodic block is s = (1, 0, 0, ..., 0) {0, 1}p. Assume w contains k such blocks (N = k p). Then, since for a = 0, 1, 2, ..., p 1,

s, P a s = 1, if a = 0 0, otherwise,

the matrix T Rp p (eq. (47)) will have a diagonal form, T = diag(1, ν2, ..., ν2(p 1)). The eigenvectors mi Rp of T, i = 1, 2, ..., p, correspond to the standard basis ei of Rp, i.e.

all elements of ei are zeros, except for the i-th element, which is 1. The corresponding eigenvalues are λi = ν2(i 1). By theorem 5, each motif

mi = (e i , νp e i , ν2p e i , ..., ντ p e i ) , (50)

with motif weight

ωi = νi 1 1 ν2τ

is a periodic exponentially decaying motif that picks up elements of time series driving (1) with periodicity p and initial lag i. Given a time series u Rτ,

j=1 ν(j 1)p ui+(j 1)p.

In the representation of (eq. (21)) we then have

2 ( m1, u , ν m2, u , ..., νp 1 mp, u ) Rp.

Given another time series v Rτ, the temporal kernel gives

K(u, v) = eu, ev

i=1 ν2(i 1) mi, u mi, v . (52)

In the case of all-ones w with a periodic sign pattern, the core periodic block is s = (+1, 1, 1, ..., 1) { 1, +1}p. For a = 0, 1, 2, ..., p 1, we have

s, P a s = p, if a = 0 p 4, otherwise,

From (eq. (47)), the matrix T Rp p with elements

Ti,j = ν2(i 1) p, if i = j νi+j 2 (p 4), otherwise,

can yield a richer set of eigenvectors mi than the standard basis ei in Rp. An exception is the case of period-4 sign pattern, p = 4. In that case, T is a diagonal matrix T = p diag(1, ν2, ..., ν2(p 1)), exactly the scaled version of T analysed above, when w was the binary vector composed of a series of k blocks of e1 Rp. Hence the four motifs mi will have the form suggested by eq. (50) and the motif weights (51) will be scaled by p = 2. We have thus established:

Dynamical Systems as Temporal Feature Spaces

Corollary 6 Under the assumptions of Theorem 5, assume that the input coupling w {0, 1}N consists of k > 1 copies of the binary standard basis block s = e1 {0, 1}p. Then there are p non-zero wight motifs of the dynamic kernel associated with (1),

mi = (e i , νp e i , ν2p e i , ..., ντ p e i ) ,

with motif weights

ωi = νi 1 1 ν2τ

Each mi is thus a periodic exponentially decaying motif that picks up elements of input time series with periodicity p and initial lag i. Furthermore, if the bipolar input coupling w { 1, +1}N consists of k > 1 copies of the block s = 2e1 1 { 1, +1}4 of period p = 4, then there are four non-zero wight motifs mi (50) with motif weights 2ωi.

8. Illustrative examples

In this section we will illustrate the results obtained so far showing the inﬂuence of the dynamic and input coupling, W and w, respectively, on the strength and richness of motifs of the temporal kernel associated with the dynamical system (1). In all illustrations we will use state space dimensionality N = 100 and re-normalise the dynamic coupling W R100 100 to largest singular value ν = 0.995. The input coupling w is renormalized to unit length. The past horizon will be set to τ = 200. We will show motifs with motif weights up to 10 2 of the highest motif weight. Figure 4 (left) shows motifs of the temporal kernel given by random coupling W, where all elements Wi,j were generated i.i.d. from normal distribution N(0, 1). The motifs are shown in a column-wise fashion, i.e. the x-axis indexes the individual motifs, while the motif values are shown along the y-axis. The associated motif weights are presented in the right plot. As explained in section 5, each of the Markovian motifs picks an element from the recent time series history, yielding a shallow memory involved in the kernel evaluation, with rapidly decaying motif weights. Almost identical results were obtained for Wi,j and wi generated i.i.d. from other distributions (e.g. uniform over [ 1, +1], Bernoulli over { 1, +1} or {0, 1}), as well as for many other settings of w, including the all-ones vector w = 1. Introduction of a structure into random W by imposing symmetry (Wigner W) leads to a slightly richer motif set, albeit still with shallow memory (see ﬁgure 5). Note the high frequency nature of some motifs, as discussed in section 6. Again, the number and nature of the motifs stayed unchanged across a variety of generative mechanisms for W and w described above. The situation changes dramatically when W is set to the scaled permutation matrix of section 7. Figure 6 shows an example of motif and motif weight structure for w generated randomly i.i.d. from N(0, 1). To demonstrate that what really matters, as argued in section 7, is the aperiodicity of w, we show in ﬁgures 7 and 8 motifs and motif weights when w is simply a vector of ones with signs prescribed by the ﬁrst N = 100 digits of binary

motif weights

1 2 3 4 5 6 7 0

Figure 4: Temporal kernel motifs and the corresponding motif weights for randomly generated W and w.

motif weights

1 2 3 4 5 6 7 8 9 10 0

Figure 5: Temporal kernel motifs and the corresponding motif weights for random symmetric Wigner W and random w.

expansion of π and e, respectively. This was suggested in (Rodan and Tino, 2011) as a simple controlled way of generating aperiodic input couplings. Such settings admit a full set of N = 100 highly variable motifs. The scaled block structure of motifs proved in section 7 is clearly visible. In striking contrast, as suggested in section 7, we present in ﬁgure 9 motifs and motif weights for the case of periodic w with period p = 10. As predicted by the theory, the shrunk motif set contains p = 10 simple periodic motifs given by repeated blocks of permuted standard basis ei (with possibly ﬂipped sign). As an example, in ﬁgure 10 we show in greater detail six temporal kernel motifs from ﬁgure 7, all with high motif weights. Compared with the setting of random or symmetric W, besides the sheer number of motifs with higher weight, there is a much richer motif

Dynamical Systems as Temporal Feature Spaces

20 40 60 80 100

motif weights

Figure 6: Temporal kernel motifs and the corresponding motif weights for scaled permutation matrix W and random w.

20 40 60 80 100

motif weights

Figure 7: Temporal kernel motifs and the corresponding motif weights for scaled permutation matrix W and aperiodic all-ones vector w with signs following binary expansion of π.

variety and longer memory. Note that in accordance with theorem 4, since the state space dimensionality and past horizon are set to N = 100 and τ = 200, respectively, the second half of each motif mi is the scaled version of the ﬁrst half, mi,101:200 = ν100 emi, emi = mi,1:100. In order to quantify motif richness , we perform Fast Fourier Transform (FFT) of motifs mi with motif weights ωi up to 10 2 of the highest motif weight. We collect the Fourier coeﬃcients zi,k C of each motif mi along with the corresponding motif weight qi,k = ωi in a set Fi = {(zi,k, qi,k)}k. The total coeﬃcient set is then F = S i Fi. Figure 11 presents distribution of motif Fourier coeﬃcients from F for diﬀerent settings of spectral radius ν {0.96, 0.99, 0.996, 1.0}.

20 40 60 80 100

motif weights

Figure 8: Temporal kernel motifs and the corresponding motif weights for scaled permutation matrix W and aperiodic all-ones vector w with signs following binary expansion of e.

motif weights

1 2 3 4 5 6 7 8 9 10 0

Figure 9: Temporal kernel motifs and the corresponding motif weights for scaled permutation matrix W and periodic binary vector w with period p = 10.

We designed two measures to characterise distribution of Fourier coeﬃcients from F. The ﬁrst is simply the area occupied by the coeﬃcients zi,k. In particular, the coeﬃcient space [ 7, 7]2 in the complex plane was covered by regular grid of cells with side length 0.05. The relative area covered by F is then the ratio of the number of cells visited by the coeﬃcients zi,k to the total number of cells. Figure 12 shows the relative area occupied by motif Fourier coeﬃcients, as a function of the scaling largest singular value ν. It is

Dynamical Systems as Temporal Feature Spaces

50 100 150 200

50 100 150 200

Figure 10: Selection of motifs (τ = 200) of temporal kernel associated with dynamical system (1) of state space dimensionality N = 100. Dynamic coupling W was the scaled permutation matrix with spectral radius ν = 0.995 and the input coupling w was formed by vector of all 1s with signs distributed according to the ﬁrst N = 100 digits of the binary expansion of π.

remarkable how stable the behaviour of the relative area is for the scaled permutation matrix W (black lines), as long as the input coupling w is aperiodic: we tried vector of all 1s with signs distributed randomly (stars), according to the ﬁrst N digits of the binary expansion of π (circles) and e (crosses), i.i.d. elements wi of w from N(0, 1) (squares) and uniform distribution over [ 1, +1] (diamonds). In all cases, the motif richness (measured by relative area covered by Fourier coeﬃcients) monotonically increases with ν up to ν = 0.99 (dashed red vertical line), where there is a phase transition marking the onset of a rapid decline in motif richness. No such behaviour can be observed for random W (Wi,j generated i.i.d. from N(0, 1) (dashed green lines), where the motif richness is consistently low. Our second measure of motif richness takes into account motif weights, instead of simply noting whether a particular small cell in the complex plane of Fourier coeﬃcients was visited or not. To that end the motif weights were ﬁrst normalised to the total sum 1. In each cell c we calculate the mean qc of the weights qi,k of coeﬃcients zi,k that landed in that cell. The relative weighted area covered by F is the ratio of the accumulated mean weight in cells visited by the coeﬃcients zi,k, P c qc, to the total number of cells. Figure 13 shows that the relative weighted area exhibits the same universal behaviour as a function of spectral radius ν of W as that followed by the relative area (ﬁgure 12).

9. Discussion and Conclusion

Parametrised state space models have been used extensively in the machine learning community, e.g. in the form of recurrent neural networks. Since learning of the dynamic part

-8 -6 -4 -2 0 2 4 6 8 -8

-8 -6 -4 -2 0 2 4 6 8 -8

-8 -6 -4 -2 0 2 4 6 8 -8

-8 -6 -4 -2 0 2 4 6 8 -8

Figure 11: Fourier coeﬃcient distribution of motifs of the temporal kernel associated with dynamical system (1) of state space dimensionality N = 100. Dynamic coupling W was the scaled permutation matrix with spectral radius ν = 0.96 (a), 0.99 (b), 0.996 (c) and 1.0 (d). The input coupling w was formed by vector of all 1s with signs distributed according to the ﬁrst N = 100 digits of the binary expansion of π. Motifs used have motif weights from the maximal motif weight ωmax to 10 2ωmax.

is known to be diﬃcult, the key idea of reservoir computation is to restrict learning only to the static readout part from the state space, while keeping the underlying dynamical system ﬁxed. Furthermore, the readout is usually a simple linear mapping. This is very similar in spirit to the idea of kernel machines: Transform the inputs using a ﬁxed mapping (usually only implicitly deﬁned) into another rich feature space and only train a linear model in that space, with the dot product as the canonical tool. The key to understanding the workings of kernel machines is to understand the feature space: How are the data mapped from the original space into the richer feature space? What similarity notions does the inner product in the feature space express in terms of the original data space? This paper is the ﬁrst study suggesting to formalise and rigorously analyse the connection between ﬁxed dynamics in reservoir computation models and ﬁxed kernel-based transformations to feature spaces in kernel machines. So far, theoretical tools at our disposal that would allow us to make statements regarding appropriateness of diﬀerent settings of dynamic coupling in reservoir computation models have been rather limited. The new framework introduced in this paper allows us to investigate richness of internal representations of input time series in terms of dynamic states and how they operate to produce desired outputs in terms of matching with a set of temporal motifs deﬁned by the structure of the dynamic coupling. Our investigations lead to several rather surprising results:

Dynamical Systems as Temporal Feature Spaces

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 largest singular value

relative area

F. coefs of motifs - relative area

1s sign pi 1s sign e 1s sign rnd rndn U[-1,1]

Figure 12: Relative area covered by Fourier coeﬃcients of the temporal kernel motifs as a function of largest singular value of W. Dimensionality of the dynamical system (1) was set to N = 100 with dynamic coupling W formed by the scaled permutation matrix (black lines) or random matrix with individual elements generated i.i.d. from N(0, 1) (dashed green lines). The input coupling w was formed by vector of all 1s with signs distributed randomly (stars), according to the ﬁrst N = 100 digits of the binary expansion of π (circles) and e (crosses). We also show the case where the elements of w were generated i.i.d. from N(0, 1) (squares) and uniform distribution over [ 1, +1] (diamonds). The vectors w were renormalised to unit length. In case of deterministic w but stochastic generation of W, the experiment was repeated 30 times. When both W and w were generated randomly, the experiment was repeated 60 times. In those cases, we show the means and standard deviations of the relative area covered by the Fourier coeﬃcients. In each experimental run, the motifs used have their weights ranging from the maximal motif weight ωmax to 10 2ωmax.

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 largest singular value

relative area

F. coefs - relative area (mean m. weight)

1s sign pi 1s sign e 1s sign rnd rndn U[-1,1]

Figure 13: Relative area measured using local mean motif weight of Fourier coeﬃcients of temporal kernel motifs. All other settings are equal to those of ﬁgure 12.

1. The usual strategy of random generation (i.i.d.) of input and dynamic coupling weights in the reservoir of ESN leads to shallow memory time series representations, corresponding to cross-correlation operator with fast exponentially decaying coeﬃcients. This ﬁnding is quite robust with respect to distributions with which the ESN parameters are generated.

2. Imposing symmetry on coupling weights of the dynamical system yields a constrained dynamic kernel that is a linear combination of trivial kernels matching the input time series with either a straightforward exponentially decaying motif or an exponentially decaying motif of the highest frequency.

3. The simple cycle reservoir topology has been empirically demonstrated to have the potential to equal the performance of more complex reservoir settings (Rodan and Tino, 2010). The dynamical system can have high state space dimensionality, but it is speciﬁed only through two free parameters, namely a constant coupling weight r > 0 between consecutive reservoir units in the cycle topology7 and a constant weight v > 0 of the input-to-state coupling. The crucial constraint is that the input coupling vector, while all its elements have the same absolute value v, has a-periodically distributed signs. In this paper we have provided rigorous arguments for the need of aperiodic sign distribution in the input coupling, showing that compared with periodic sign patterns, the feature representations of time series in case of a-periodically distributed signs are much richer. In addition, even though such settings of the dynamical system are extremely simple (two free parameters) and completely deterministic8, the number

7. Note that this also automatically makes the spectral radius ν of W equal to r, so no separate tuning of ν is needed. 8. The sign distribution can follow binary expansion of an irrational number, such as π or e.

Dynamical Systems as Temporal Feature Spaces

and variety of dynamic motifs of the associated dynamic kernel are far superior to the standard conﬁgurations of ESN that rely on stochastic generation of coupling weights.

4. By quantifying motif richness of the dynamic kernel associated with cycle reservoir topology, we showed that there is a phase transition in motif richness at spectral radius values close to, but strictly less than 1. This conﬁrms previous ﬁndings in the ESN literature on the importance of tuning the dynamical system at the edge of stability (Bertschinger and Natschlger, 2004).

The arguments in this paper were developed under the assumption of linear dynamical system with linear readout map. However, it has been proved that even linear dynamical systems can be universal, provided they are equipped with polynomial readout maps (Grigoryeva and Ortega, 2018b,a; Gonon and Ortega, 2019). In our setting, this corresponds to considering instead of the linear kernel (eq. (16)) a polynomial kernel (of some degree d),

K(u, v) = ( φ(u), φ(v) + a)d. (53)

Clearly, memory characteristics of such a kernel will not change with oﬀset a R or increasing polynomial degree d. By eqs. (20-21), the polynomial kernel can be written as

K(u, v) = ( eu, ev + a)d, (54)

where the elements eui, evi, i = 1, 2, ..., Nm, of eu, ev RNm are projections of u, v Rτ

onto motifs mi Rτ, scaled by the motif weight. Non-linear manipulation of eui, evi can increase the capacity of the readout mapping but only at the level of memory and feature set deﬁned by the motifs. Randomly generated or symmetric reservoir couplings will still lead to constrained shallow memory kernels. We have shown that simple cycle reservoirs tuned at the edge of stability, with aperiodic sign patterns in input coupling are among the ESN architectures capable of approximating deep memory processes when linear dynamical system and polynomial readout are used. Of course, when non-linearity is allowed in the dynamical system (for example, by employing a logistic sigmoid transfer function), even randomly generated reservoirs may be able to capture deeper memory. Our study contributes to the debate about what characteristics of the dynamical system are desirable to make it a universal temporal ﬁlter capable of producing rich representations of input time series in its state space. Such representations can then be further utilised by readouts, purpose-build for a variety tasks. Ozturk et al. (2007) hypothesised that the distribution of reservoir activations should have high entropy and suggested that it was desirable for the reservoir weight matrix to have eigenvalues uniformly distributed inside the unit circle. In this way the system dynamics would include uniform coverage of time constants (related to the uniform distribution of the poles) (Ozturk et al., 2007). Our work suggests a counterargument when linear reservoirs and non-linear readouts are used: Uniform distribution of eigenvalues inside the unit circle can be achieved by random generation of the reservoir matrix. However, this leads to a highly constrained set of shallow memory motifs of the associated dynamic kernel that describes how features of time series seen in the past contribute to the production of the model output. On the other hand, a very simple setting of high dimensional dynamical system governed by just two free parameters can achieve a much richer and deeper memory motifs of the dynamic kernel. Note that in

this case, the eigenvalues of the reservoir coupling matrix are distributed uniformly along a circle with radius equal to the spectral radius of the reservoir matrix.

Acknowledgement: This work was supported by the European Commission Horizon 2020 Innovative Training Network SUNDIAL (SUrvey Network for Deep Imaging Analysis and Learning), Project ID: 721463.

Appendix A. Markovian motifs resulting from random dynamical coupling W

In this appendix we demonstrate that the approximations in section 5 of motifs and their weights in the case of random dynamic coupling W hold for diverse distributions used to generate elements of W. In particular, in ﬁgure 14 we show kernel motifs obtained from W R100 100 generated element-wise i.i.d. from N(0, 1) and renormalised to largest singular value ν = 0.995 (the setting used in section 5) and input coupling vector w generated as a vector of all 1s with randomly ﬂipped signs (in each dimension with equal probability 0.5), renormalised to unit vector. The associated squared motif weights are presented in ﬁgure 15. We also show in ﬁgures 16 and 17 the kernel motifs and eigenvalues of Q when both f W and w consist of all 1s with signs ﬂipped independently element-wise with probability 0.5 (dynamical coupling renormalised to largest singular value ν = 0.995 and w to unit vector). In both cases, the Markovian motifs and their weights are almost indistinguishable from those shown in section 5 (ﬁgures 2 and 3).

L. Appeltant, M. C. Soriano, G. Van der Sande, J. Danckaert, S. Massar, J. Dambre, B. Schrauwen, C. R. Mirasso, and I. Fischer. Information processing using a single dynamical node as complex system. Nature Comm., 2, 2011.

L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of ﬁnite state markov chains. Ann. Math. Statist., 37(6):1554 1563, 12 1966.

Y. Bengio, P. Frasconi, and P Simard. The problem of learning long-term dependencies in recurrent networks. In Proceedings of the 1993 IEEE International Conference on Neural Networks, volume 3, pages 1183 1188, 1993.

Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is diﬃcult. IEEE Transactions on Neural Networks, 5(2):157 166, 1994.

Nils Bertschinger and Thomas Natschlger. Real-time computation at the edge of chaos in recurrent neural networks. Neural Computation, 16(7):1413 1436, 2004.

Terry Bossomaier, Lionel Barnett, Michael Harr, and Joseph T. Lizier. An Introduction to Transfer Entropy: Information Flow in Complex Systems. Springer Publishing Company, Incorporated, 1st edition, 2016.

Dynamical Systems as Temporal Feature Spaces

1 2 3 4 5 6 7 8 9 10 -0.2

1 2 3 4 5 6 7 8 9 10 -0.2

1 2 3 4 5 6 7 8 9 10 -0.2

1 2 3 4 5 6 7 8 9 10 -0.2

Figure 14: The ﬁrst 10 elements of the four most dominant kernel motifs corresponding to W R100 100 generated element-wise i.i.d. from N(0, 1) and renormalised to largest singular value ν = 0.995. The input coupling w was generated as a vector of all 1s with randomly ﬂipped signs (in each dimension with equal probability 0.5). Shown are the means and standard deviations across 100 joint realisations of W and w.

eigenvalues of Q

1 2 3 4 5 6 7 8 9 10 0

Figure 15: Eigenvalues (squared motif weights) of the metric tensor Q for random setting of the dynamical system (1) as described in ﬁg. 14. Solid bars correspond to the means of the actual eigenvalues λi across the 100 realisations of W and w (also shown are standard deviations). The theoretically predicted values (eq. (35)) are shown as the red line.

1 2 3 4 5 6 7 8 9 10 -0.2

1 2 3 4 5 6 7 8 9 10 -0.2

1 2 3 4 5 6 7 8 9 10 -0.2

1 2 3 4 5 6 7 8 9 10 -0.2

Figure 16: The ﬁrst 10 elements of the four most dominant kernel motifs corresponding to W R100 100 and w, both consisting of all 1s with signs ﬂipped independently element-wise with probability 0.5. W was renormalised to largest singular value ν = 0.995. Shown are the means and standard deviations across 100 joint realisations of W and w.

eigenvalues of Q

1 2 3 4 5 6 7 8 9 10 0

Figure 17: Eigenvalues (squared motif weights) of the metric tensor Q for random setting of the dynamical system (1) as described in ﬁg. 16. Solid bars correspond to the means of the actual eigenvalues λi across the 100 realisations of W and w (also shown are standard deviations). The theoretically predicted values (eq. (35)) are shown as the red line.

Dynamical Systems as Temporal Feature Spaces

K. Bush and C. Anderson. Modeling reward functions for incomplete state representations via echo state networks. In Proceedings of the International Joint Conference on Neural Networks, Montreal, Quebec, July 2005.

R. Couillet, G. Wainrib, H. Sevi, and H. T. Ali. Training performance of echo state neural networks. In 2016 IEEE Statistical Signal Processing Workshop (SSP), pages 1 4, 2016.

Romain Couillet, Gilles Wainrib, Harry Sevi, and Haﬁz Tiomoko Ali. The asymptotic performance of linear echo state neural networks. Journal of Machine Learning Research, 17(178):1 35, 2016.

J Dambre, David Verstraeten, Benjamin Schrauwen, and Serge Massar. Information processing capacity of dynamical systems. Scientiﬁc reports, 2:514, 07 2012.

Carlton Downey, Ahmed Hefny, Boyue Li, Byron Boots, and GeoﬀGordon. Predictive state recurrent neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pages 6055 6066, USA, 2017. Curran Associates Inc.

M. Ficura. Forecasting stock market realized variance with echo state neural networks. European Financial and Accounting Journal, 3, 2017. doi: 10.18267/j.efaj.193.

C. Gallicchio, A. Micheli, and L. Pedrelli. Deep reservoir computing: A critical experimental analysis. Neurocomputing, 268:87 99, 2017.

S. Ganguli, D. Huh, and H. Sompolinsky. Memory traces in dynamical systems. Proceedings of the National Academy of Sciences, 105:18970 18975, 2008.

Surya Ganguli and Haim Sompolinsky. Short-term memory in neuronal networks through dynamical compressed sensing. In Advances in neural information processing systems, pages 667 675, 2010.

Lukas Gonon and Juan-Pablo Ortega. Reservoir computing universality with stochastic inputs. IEEE Transactions on Neural Networks and Learning Systems, To appear, 02 2019.

L. Grigoryeva and J.-P. Ortega. Echo state networks are universal. Neural Networks, 108: 495 508, 2018a.

L. Grigoryeva and J.-P. Ortega. Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-aﬃne systems. J. Mach. Learn. Res., 19(1):892 931, January 2018b.

L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. Nonlinear memory capacity of parallel time-delay reservoir computers in the processing of multidimensional signals. Neural Computation, 28(7):1411 1451, 2016.

B. Hammer. Generalization ability of folding networks. IEEE Transactions on Knowledge and Data Engineering, 13(2):196 206, 2001.

B. Hammer and P. Tino. Recurrent neural networks with small weights implement deﬁnite memory machines. Neural Computation, 15(8):1897 1929, 2003.

J. Hochreiter and J. Schmidhuber. Long short term memory. Neural Computation, 9(8): 1735 1780, 1997.

G. Holzmann and H. Hauser. Echo state networks with ﬁlter neurons and a delay and sum readout. Neural Networks, 32(2):244 256, 2009.

H. Jaeger. The echo state approach to analysing and training recurrent neural networks. Technical report gmd report 148, German National Research Center for Information Technology, 2001.

H. Jaeger. Short term memory in echo state networks. Technical report gmd report 152, German National Research Center for Information Technology, 2002a.

H. Jaeger. A tutorial on training recurrent neural networks, covering bppt, rtrl, ekf and the echo state network approach. Technical report gmd report 159, German National Research Center for Information Technology, 2002b.

H. Jaeger and H. Hass. Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless telecommunication. Science, 304:78 80, 2004.

H. Jaeger, M. Lukosevicius, D. Popovici, and U. Siewert. Optimisation and applications of echo state networks with leaky-integrator neurons. Neural Networks, 20(3):335 352, 2007.

Li Jing, Caglar Gulcehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Soljacic, and Yoshua Bengio. Gated orthogonal recurrent units: On learning to forget. Neural Computation, 31(4):765 783, 2019.

R. E. Kalman. Approach to linear ﬁltering and prediction problems. Journal of Basic Engineering, 82(1):35, 1960.

R. Legenstein and W. Maass. What makes a dynamical system computationally powerful? In S. Haykin, J. C. Principe, T. Sejnowski, and J. Mc Whirter, editors, New Directions in Statistical Signal Processing: From Systems to Brains, pages 127 154. MIT Press, 2007.

Joseph T. Lizier, Mikhail Prokopenko, and Albert Y. Zomaya. Detecting non-trivial computation in complex dynamics. In Proceedings of the 9th European Conference on Advances in Artiﬁcial Life, ECAL 07, pages 895 904, Berlin, Heidelberg, 2007. Springer-Verlag.

Joseph T. Lizier, Mikhail Prokopenko, and Albert Y. Zomaya. Local measures of information storage in complex distributed computation. Inf. Sci., 208:39 54, November 2012.

M. Lukosevicius and H. Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127 149, 2009.

W. Maass, T. Natschlager, and H. Markram. Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Computation, 14(11):2531 2560, 2002.

Dynamical Systems as Temporal Feature Spaces

G. Manjunath and H. Jaeger. Echo state property linked to an input: Exploring a fundamental characteristic of recurrent neural networks. Neural Computation, 25(3):671 696, 2013. ISSN 0899-7667.

Henryk Minc. Nonnegative Matrices. John Wiley and Sons Ltd-Interscience Series in Discrete Mathematics and Optimization, 1988.

O. Obst and J. Boedecker. Guided self-organization of input-driven recurrent neural networks. In In Guided Self-Organization: Inception. Emergence, Complexity and Computation, pages 319 340. Springer, Berlin, Heidelberg, 2014.

Oliver Obst, Joschka Boedecker, and Minoru Asada. Improving recurrent neural network performance using transfer entropy. In Proceedings of the 17th International Conference on Neural Information Processing: Models and Applications - Volume Part II, ICONIP 10, pages 193 200, Berlin, Heidelberg, 2010. Springer-Verlag.

M. C. Ozturk, D. Xu, and J. Principe. Analysis and design of echo state network. Neural Computation, 19(1):111 138, 2007.

A. Rodan and P. Tino. Minimum complexity echo state network. IEEE Transactions on Neural Networks, 22(1):131 144, 2011.

Ali Rodan and Peter Tino. Minimum complexity echo state network. ieee trans neural netw. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 22:131 44, 11 2010. doi: 10.1109/TNN.2010.2089641.

Andr e R ohm and Kathy L udge. Multiplexed networks: reservoir computing with virtual and real nodes. Journal of Physics Communications, 2(8):085007, aug 2018.

M. Rudelson and R. Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In In Proceedings of the International Congress of Mathematicians. Volume III, Hindustan Book Agency, New Delhi, pages 1576 1602, 2010.

B. Schrauwen, M. Wardermann, D. Verstraeten, J.J. Steil, and D Stroobandt. Improving reservoirs using intrinsic plasticity. Neurocomputing, 71(7-9):1159 1171, 2008.

Hava T. Siegelmann and Eduardo D. Sontag. Analog computation via neural networks. Theoretical Computer Science, 131(2):331 360, 1994.

M.D. Skowronski and J.G. Harris. Minimum mean squared error time series classiﬁcation using an echo state network prediction model. In IEEE International Symposium on Circuits and Systems, Island of Kos, Greece, pp. 3153-3156, 2006.

J. Steil. Online reservoir adaptation by intrinsic plasticity for backpropagation-decorrelation and echo state learning. Neural Networks, 20:353 364, 2007.

Tobias Strauss, Welf Wustlich, and Roger Labahn. Design strategies for weight matrices of echo state networks. Neural Computation, 24(12):3246 3276, 2012. ISSN 0899-7667.

Gouhei Tanaka, Toshiyuki Yamane, Jean Benoit Hroux, Ryosho Nakane, Naoki Kanazawa, Seiji Takeda, Hidetoshi Numata, Daiju Nakano, and Akira Hirose. Recent advances in physical reservoir computing: A review. Neural Networks, 115:100 123, 2019.

P. Tino. Asymptotic ﬁsher memory of randomized linear symmetric echo state networks. Neurocomputing, 298:4 8, 2018.

P. Tino and G. Dorﬀner. Predicting the future of discrete sequences from fractal representations of the past. Machine Learning, 45(2):187 218, 2001.

P. Tino and B. Hammer. Architectural bias in recurrent neural networks: Fractal analysis. Neural Computation, 15(8):1931 1957, 2004.

P. Tino and A. Rodan. Short term memory in input-driven linear dynamical systems. Neurocomputing, 112:58 63, 2013.

M. H. Tong, A.D. Bicket, E.M. Christiansen, and G.W. Cottrell. Learning grammatical structure with echo state network. Neural Networks, 20:424 432, 2007.

Y. Xue, L. Yang, and S. Haykin. Decoupled echo state networks with lateral inhibition. Neural Networks, 20:365 376, 2007.

Izzet B. Yildiz, Herbert Jaeger, and Stefan J. Kiebel. Re-visiting the echo state property. Neural Networks, 35:1 9, 2012. ISSN 0893-6080.

Byung-Jun Yoon. Hidden markov models and their applications in biological sequence analysis. Current genomics, 10:402 15, 09 2009.

B. Zhang, D. J. Miller, and Y. Wang. Nonlinear system modeling with random matrices: Echo state networks revisited. IEEE Transactions on Neural Networks and Learning Systems, 23(1):175 182, Jan 2012. ISSN 2162-2388. doi: 10.1109/TNNLS.2011.2178562.