# bifurcations_and_loss_jumps_in_rnn_training__2b6ac507.pdf

Bifurcations and loss jumps in RNN training

Lukas Eisenmann1,2,*, Zahra Monfared1,*, Niclas Göring1,2, and Daniel Durstewitz1,2,3

{lukas.eisenmann,zahra.monfared,daniel.durstewitz}@zi-mannheim.de 1Department of Theoretical Neuroscience, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany 2Faculty of Physics and Astronomy, Heidelberg University, Heidelberg, Germany 3Interdisciplinary Center for Scientific Computing, Heidelberg University *These authors contributed equally

Recurrent neural networks (RNNs) are popular machine learning tools for modeling and forecasting sequential data and for inferring dynamical systems (DS) from observed time series. Concepts from DS theory (DST) have variously been used to further our understanding of both, how trained RNNs solve complex tasks, and the training process itself. Bifurcations are particularly important phenomena in DS, including RNNs, that refer to topological (qualitative) changes in a system s dynamical behavior as one or more of its parameters are varied. Knowing the bifurcation structure of an RNN will thus allow to deduce many of its computational and dynamical properties, like its sensitivity to parameter variations or its behavior during training. In particular, bifurcations may account for sudden loss jumps observed in RNN training that could severely impede the training process. Here we first mathematically prove for a particular class of Re LU-based RNNs that certain bifurcations are indeed associated with loss gradients tending toward infinity or zero. We then introduce a novel heuristic algorithm for detecting all fixed points and k-cycles in Re LU-based RNNs and their existence and stability regions, hence bifurcation manifolds in parameter space. In contrast to previous numerical algorithms for finding fixed points and common continuation methods, our algorithm provides exact results and returns fixed points and cycles up to high orders with surprisingly good scaling behavior. We exemplify the algorithm on the analysis of the training process of RNNs, and find that the recently introduced technique of generalized teacher forcing completely avoids certain types of bifurcations in training. Thus, besides facilitating the DST analysis of trained RNNs, our algorithm provides a powerful instrument for analyzing the training process itself.

1 Introduction

Recurrent neural networks (RNNs) are common and powerful tools for learning sequential tasks or modeling and forecasting time series data [31, 65, 2, 43, 9, 26]. Typically, RNNs are "black boxes," whose inner workings are hard to dissect. Techniques from dynamical system theory (DST) can significantly aid in this effort, as RNNs are formally discrete-time dynamical systems (DS) [37]. A better understanding of how RNNs solve their tasks is important for detecting failure modes and designing better architectures and training algorithms. In scientific machine learning, on the other hand, RNNs are often employed for reconstructing unknown DS from sets of time series observations [10, 30, 64, 63, 23, 55], e.g. in climate or disease modeling. In this case, the RNN is supposed to provide an approximation to the flow of the observed system that reproduces all its dynamical properties, e.g., cyclic behavior in climate or epidemiological systems. In such scientific or medical

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

settings, a detailed understanding of the RNN s dynamics and its sensitivity to parameter variations is in fact often crucial.

Beyond understanding the dynamical properties of a once trained RNN, it may also be of interest to know how its dynamical repertoire changes with changes in its parameters. The parameter space of any DS is partitioned into regions (or sets) with topologically different dynamical behaviors by bifurcation curves (or, more generally, manifolds). Such bifurcations, qualitative changes in the system dynamics due to parameter variations, may not only be crucial in scientific applications where we use RNNs, for instance, to predict tipping points in climate systems or medical scenarios like sepsis detection [45]. They also pose severe challenges for the training process itself as qualitative changes in RNN dynamics may go in hand with sudden jumps in the loss landscape [16, 44]. Although methods from DST have significantly advanced the field in recent years, especially with regards to algorithms for reconstructing nonlinear DS from data [55, 46, 67], progress is still hampered by the lack of efficient tools for analysing the dynamics of higher-dimensional RNNs and their bifurcations. In particular, methods are needed for exactly locating geometrical objects like fixed points or cycles in an RNN s state space, but current numerical techniques do not scale well to higher-dimensional scenarios and provide only approximate solutions [29, 21, 61].

The contributions of this work are threefold: After first providing an introduction into bifurcations of piecewise-linear (Re LU-based) RNNs (PLRNNs), which have been extensively used for reconstructing DS from empirical data [10, 30], we mathematically prove that certain bifurcations during the training process will indeed cause loss gradients to diverge to infinity, resulting in abrupt jumps in the loss, while others will cause them to vanish. As RNNs are likely to undergo several bifurcations during training on their way from some initial parameter configuration toward a dynamics that successfully implements any given task, this poses severe challenges for RNN training and may be one of the reasons for exploding and vanishing gradients [16, 44, 8, 25]. We then create a novel, efficient heuristic algorithm for exactly locating all fixed points and k-cycles in PLRNNs, which can be used to delineate bifurcation manifolds in higher-dimensional systems. Our algorithm finds these dynamical objects in many orders of magnitude less time than an exhaustive search would take. Using this algorithm, we demonstrate empirically that steep cliffs in loss landscapes and bifurcation curves indeed tightly overlap, and that bifurcations in the system dynamics are accompanied by sudden loss jumps. Finally, we prove and demonstrate that the recently introduced technique of generalized teacher forcing (GTF) [24] completely eliminates certain types of bifurcation in training, providing an explanation for its efficiency.

2 Related Work

DS analysis of trained RNNs In many areas of science one is interested in identifying the nonlinear DS that explains a set of observed time series [12, 23]. A variety of purely data-driven machine learning approaches have been developed for this purpose [11, 51, 15], but mostly RNNs are used for the goal of reconstructing DS from measured time series [10, 55, 30, 64, 63, 46, 45]. In scientific settings in particular, but also often in engineering applications, we seek a detailed understanding of the system dynamics captured or the dynamical repertoire produced by a trained RNN [62, 33, 61, 14, 13]. To analyze an RNN s dynamics, typically its fixed points are determined numerically, for instance by optimizing some cost function [21, 34, 61], by numerically traversing directional fibers [29], or by co-training a switching linear dynamical system [56]. These techniques, however, often scale poorly with dimensionality, provide only approximate solutions, and are not designed for detecting other dynamical objects like k-cycles. This is, however, crucial for understanding the complete dynamical repertoire of a trained RNN. PLRNNs are of particular interest in this context [10, 55, 30, 38], because their piecewise linear nature makes some of their properties analytically accessible [55, 10], which is a tremendous advantage from the perspective of DST. In this work, we develop an efficient heuristic algorithm for locating a PLRNN s fixed points exactly, as well as its k-cycles.

Bifurcations and loss jumps in RNN training The idea that bifurcations in RNN dynamics could impede the training process is not new [16, 44]. Doya [16], to our knowledge, was the first to point out that even in simple single-unit RNNs with sigmoid activation function (saddle-node) bifurcations may occur as an RNN parameter is adapted during training. This may not only cause an abrupt jump in training loss, but could lead to situations where it is impossible, even in principle, to reach the training objective (the desired target output), as across the bifurcation point there is a discrete change

in network behavior [16, 68, 4]. Pascanu et al. [44] discussed similar associations between steep cliffs in RNN loss functions and bifurcations. Although profound for training success, this topic received surprisingly little attention over the years. Haschke and Steil [22] extended previous work by a more formal treatment of bifurcation boundaries in RNNs, and Marichal et al. [35] examined fold bifurcations in RNNs. The effects of bifurcations and their relation to exploding gradients in gated recurrent units (GRUs) was investigated in Kanai et al. [27]. Ribeiro et al. [50] looked at the connection between dynamics and smoothness of the cost function, but failed to find a link between bifurcations and jumps in performance. In contrast, Rehmer and Kroll [49] observed large gradients at bifurcation boundaries and concluded that bifurcations can indeed cause problems in gradient-based optimization. To the best of our knowledge, we are, however, the first to formally prove a direct link between bifurcations and the behavior of loss gradients, and to derive a systematic and efficient algorithmic procedure for identifying bifurcation manifolds for a class of Re LU-based RNNs.

3 Theoretical analysis

In this paper we will focus on PLRNNs as one representative of the wider class of Re LU-based RNNs, but similar derivations and algorithmic procedures could be devised for any type of Re LU-based RNN (in fact, many other types of Re LU-based RNNs could be brought into the same functional form as PLRNNs; e.g. [10]). We will first, in sect. 3.1, provide some theoretical background on bifurcations in PLRNNs, and illustrate how existence and stability regions of fixed points and cycles could be analytically computed for low dimensional (2d) PLRNNs. In sect. 3.2 we will then state two theorems regarding the association of bifurcations and loss gradients in training. It turns out that for certain types of bifurcations exploding or vanishing gradients are inevitable in gradient-based training procedures like Back-Propagation Through Time (BPTT).

3.1 Bifurcation curves in PLRNN parameter space

The PLRNN, originally introduced as a kind of discrete time neural population model [18], has the general form zt = Fθ(zt 1, st) = A zt 1 + W ϕ(zt 1) + Cst + h, (1)

where zt RM is the latent state vector and θ are system parameters consisting of diagonal matrix A RM M (auto-regression weights), off-diagonal matrix W RM M (coupling weights), ϕ(zt 1) = max(zt 1, 0) is the element-wise rectified linear unit (Re LU) function, h RM a constant bias term, and st RK represents external inputs weighted by C RM K. The original formulation of the PLRNN is stochastic [18, 30], with a Gaussian noise term added to eq. (1), but here we will consider the deterministic variant.

Formally, like other Re LU-based RNNs, PLRNNs constitute piecewise linear (PWL) maps, a subclass of piecewise smooth (PWS) discrete-time DS. Define DΩ(t) := diag(dΩ(t)) as a diagonal matrix with indicator vector dΩ(t) := (d1, d2, , d M) such that dm(zm,t) =: dm = 1 whenever zm,t > 0, and zero otherwise. Then (1) can be rewritten as

zt = Fθ(zt 1) = (A + W DΩ(t 1))zt 1 + h =: WΩ(t 1) zt 1 + h, (2)

where we have ignored external inputs st for simplicity. There are in general 2M different configurations for matrix DΩ(t 1) and hence for matrix WΩ(t 1), dividing the phase space into 2M sub-regions separated by switching manifolds (see Appx. A.1.1 for more details).

Recall that fixed points of a map zt = Fθ(zt 1) are defined as the set of points for which we have z = Fθ(z ), and that the type (node, saddle, spiral) and stability of a fixed point can be read off from the eigenvalues of the Jacobian Jt := Fθ(zt 1)

zt 1 = zt zt 1 evaluated at z [3, 47]. Similarly, a k-cycle of map Fθ is a periodic orbit {z 1, z 2, . . . , z k} such that each of the periodic points z i , i = 1 . . . k, is distinct, and is a solution to the equation z i = F k θ (z i ), i.e. the k times iterated map Fθ. Type and stability of a k-cycle are then determined via the Jacobian Qk r=1 Jt+k r = Qk r=1 zt+k r zt+k r 1 = zt+k 1

zt 1 . Solving these equations and computing the corresponding Jacobians thus allows to determine all existence and stability regions of fixed points and cycles, where the latter are a subset of the former, bounded by bifurcation curves (see Appx. A.1 for more formal details).

To provide a specific example, assume M = 2 and fix for the purpose of this exposition parameters w12 = w22 = 0, such that we have

WΩ1 = WΩ3 = a11 0

, WΩ2 = WΩ4 = a11 + w11 0

i.e. only one border which divides the phase space into two distinct sub-regions (see Appx. A.1.1). For this setup, Fig. 1A provides examples of analytically determined stability regions for two low order cycles in the (a11, a11 + w11)-parameter plane (see Appx. A.1). Note that there are regions in parameter space where two or more stability regions overlap: In these regions we have multi-stability, the co-existence of different attractor states in the PLRNN s state space.

As noted above, bifurcation curves delimit the different stability regions in parameter space and are hence associated with abrupt changes in the topological structure of a system s state space. In general, there are many different types of bifurcations through which dynamical objects can come into existence, disappear, or change stability (see, e.g., [3, 40, 47]), the most common ones being saddle node, transcritical, pitchfork, homoclinic, and Hopf bifurcations. In comparison with smooth systems, bifurcation theory of PWS (or PWL) maps includes additional dynamical phenomena related to the existence of borders in the phase space [6]. Border-collision bifurcations (BCBs) arise when for a PWS map a specific point of an invariant set collides with a border and this collision leads to a qualitative change of dynamics [5, 6, 42]. More specifically, a BCB occurs, if for a PWS map zt = Fθ(zt 1) a fixed point or k-cycle either crosses the switching manifold P

i := {z Rn : e T i z = 0} transversely at θ = θ and its qualitative behavior changes in the event, or if it collides on the border with another fixed point or k-cycle and both objects disappear [7, 41]. Degenerate transcritical bifurcations (DTBs) occur when a fixed point or a periodic point of a cycle tends to infinity and one of its eigenvalues tends to 1 by variation of a parameter. Specifically, let Γk, k 1, be a fixed point or a k-cycle with the periodic points {z 1, z 2, . . . , z k}, and assume λi denotes an eigenvalue of the Jacobian matrix at the periodic point z i , i {1, 2, , k}. Then Γk undergoes a DTB at θ = θ , if λi(θ ) +1 and z i . Γk undergoes a degenerate flip bifurcation (DFB), iff λi(θ ) = 1 and the map F k has locally, in some neighborhood of z i , infinitely many 2-cycles at θ = θ . A center bifurcation (CB) occurs, if Γk has a pair of complex conjugate eigenvalues λ1,2 and locally becomes a center at the bifurcation value θ = θ , i.e. if its eigenvalues are complex and lie on the unit circle (|λ1,2(θ )| = 1). Another important class of bifurcations are multiple attractor bifurcations (MABs), discussed in more detail in Appx. A.1.5 (see Fig. S3). In addition to existence and stability regions of fixed points and cycles, in Appx. A.1.2-A.1.4 we also illustrate how to analytically determine the types of bifurcation curves bounding the existence and stability regions of k-cycles (DTB, DFB, CB and BCB curves; see also Fig. S6 for a 1d example).

3.2 Bifurcations and loss jumps in training

Here we will prove that major types of bifurcations discussed above are always associated with exploding or vanishing gradients in PLRNNs during training, and hence often with abrupt jumps in the loss. For this we may assume any generic loss function L(θ), like a negative log-likelihood or a mean-squared-error (MSE) loss, and a gradient-based training technique like BPTT [52] or Real-Time-Recurrent-Learning (RTRL) that involves a recursion (via chain rule) through loss terms across time. The first theorem establishes that a DTB inevitably causes exploding gradients. Theorem 1. Consider a PLRNN of the form (2) with parameters θ = {A, W , h}. Assume that it has a stable fixed point or k-cycle Γk (k 1) with BΓk as its basin of attraction. If Γk undergoes a degenerate transcritical bifurcation (DTB) for some parameter value θ = θ0 θ, then the norm of the PLRNN loss gradient, Lt

θ , tends to infinity at θ = θ0 for every z1 BΓk, i.e. limθ θ0 Lt

Proof. See Appx. A.2.1

However, bifurcations may also cause gradients to suddenly vanish, as it is the case for a BCB as established by our second theorem: Theorem 2. Consider a PLRNN of the form (2) with parameters θ = {A, W , h}. Assume that it has a stable fixed point or k-cycle Γk (k 1) with BΓk as its basin of attraction. If Γk undergoes a border collision bifurcation (BCB) for some parameter value θ = θ0 θ, then the gradient of the loss function, Lt

θ , vanishes at θ = θ0 for every z1 BΓk, i.e. limθ θ0 Lt

Figure 1: A) Analytically calculated stability regions for a 2-cycle (SRL, red), a 3-cycle (SRL2, blue), and their intersection (yellow) in the (a11, a11 + w11)-parameter plane for the system eq. (3) with a22 = 0.2, w21 = 0.5. B) Same as determined by SCYFI, with bifurcation curves bordering the stability regions labeled by the type of bifurcation (DTB = Degenerate Transcritical Bifurcation, BCB= Border Collision Bifurcation, DFB = Degenerate Flip Bifurcation). C) Bifurcation graph (showing the stable cyclic points in the z1 coordinate) along the cross-section in A indicated by the gray line, illustrating the different types of bifurcation encountered when moving in and out of the various stability regions in A. D) State space at the point denoted D in A (for a11 = 0.253, a11 + w11 = 2.83), where the 2-cycle (red) and 3-cycle (blue) co-exist for the same parameter settings, with their corresponding basins of attraction indicated by lighter colors.

Proof. See Appx. A.2.2.

Corollary 1. Assume that the PLRNN (2) has a stable fixed point Γ1 with BΓ1 as its basin of attraction. If Γ1 undergoes a degenerate flip bifurcation (DFB) for some parameter value θ = θ0 θ, then this will always coincide with a BCB of a 2-cycle, and as a result limθ θ0 Lt

θ = 0 for every z1 BΓ1.

Proof. See Appx. A.2.3. Hence, certain bifurcations will inevitably cause gradients to suddenly explode or vanish, and often induce abrupt jumps in the loss (see Appx. A.3.2 for when this will happen for a BCB). We emphasize that these results are general and hold for systems of any dimension, as well as in the presence of inputs. Since inputs do not affect the Jacobians in eqn. (70), (71) and (76), they do not change the theorems (even if they would affect the Jacobians, Theorem 2 would be unaltered, and Theorem 1 could be amended in a straightforward way). Furthermore, since we are addressing bifurcations that occur during model training, from this angle inputs may simply be treated as either additional parameters (if piecewise constant) or states of the system (without changing any of the mathematical derivations). In fact, mathematically, any non-autonomous dynamical system (RNN with inputs) can always and strictly be reformulated as an autonomous system (RNN without inputs), see [3, 47, 66].

4 Heuristic algorithm for finding PLRNN bifurcation manifolds

4.1 Searcher for fixed points and cycles (SCYFI): motivation and validation

In sect. 3.1 and Appx. A.1.2-A.1.4 we derived existence and stability regions for fixed points and low order (k 3) cycles in 2d PLRNNs with specific parameter constraints analytically. For higher-order cycles and higher-dimensional PLRNNs (or any other Re LU-type RNN) this is no longer feasible due to the combinatorial explosion in the number of subregions that need to be considered as M and k increase. Here we therefore introduce an efficient search algorithm for finding all k-cycles of a given PLRNN, which we call Searcher for Cycles and Fixed points: SCYFI (Algorithm 1). Once all k-cycles (k 1) have been detected on some parameter grid, the stability-/existence regions of these objects and thereby the bifurcation manifolds can be determined. k-cycles were defined in sect. 3.1,

Algorithm 1 SCYFI The algorithm is iteratively run with k = 1 Kmax, with Kmax the max. order of cycles tested Input: PLRNN parameters A, W , h; L = {Ln}k 1 n=1: collection of all sets Ln = {{z(m) l }n l=1}Mn m=1 of all lower order n-cycles discovered so far, where Mn is the number of found cycles {z(m) l }n l=1 of order n, with corresponding Re LU derivative matrices {D(m) l }n l=1. Parameters: Nout: max. number of random initialisations; Nin: max. number of iterations Output: L Lk; Lk: set of all discovered k-cycles

1: Lk = {} 2: i 0 3: while i < Nout . . . do 4: Select k subregions Dinit = {D1, D2, ..., Dk} at random with replacement 5: c 0 6: while c < Nin . . . do 7: Solve eq. (4) for a cycle candidate {z l }k l=1 with WΩ(k r) as defined in eq. (2) based on Dinit 8: Determine {D l }k l=1 based on the signs of the corresponding components of {z l }k l=1 9: if Dinit = {D l }k l=1 (self-consistency) and 1 s k, {z(m) l }k s+1 l=1 Lk s+1 : {z(m) l }k s+1 l=1 {z l }k l=1 then 10: Lk Lk {{z l }k l=1} 11: i c 0 12: else 13: Dinit {D l }k l=1 14: end if 15: c c + 1 16: end while 17: i i + 1 18: end while

and for the PLRNN, eq. (2), are given by the set of k-periodic points {z 1, . . . , z l , . . . , z k}, where

r=0 WΩ(k r)

r=0 WΩ(k r) + 1 h, (4)

if 1 Qk 1 r=0 WΩ(k r) is invertible (if not, we are dealing with a bifurcation or a continuous set of fixed points). The other periodic points are zl = F l(z k), l = 1, , k 1, with corresponding matrices WΩ(l) = A+W Dl . Now, if the diagonal entries in Dl are consistent with the signs of the

corresponding states z ml, i.e. if d(l) mm = 1 if z ml > 0 and d(l) mm = 0 otherwise for all l, {z 1, . . . , z k} is a true cycle of eq. (2), otherwise we call it virtual. To find a k-cycle, since an M-dimensional PLRNN harbors 2M different linear sub-regions, there are approximately 2Mk different combinations of configurations of the matrices Dl, l = 1 . . . k, to consider (strictly, mathematical constraints rule out some of these possibilities, e.g. not all periodic points can lie within the same sub-region/orthant).

Clearly, for higher-dimensional PLRNNs and higher cycle orders exhaustively searching this space becomes unfeasible. Instead, we found that the following heuristic works surprisingly well: First, for some order k and a random initialization of the matrices Dl, l = 1 . . . k, generate a cycle candidate by solving eq. (4). If each of the points z l , l = 1 . . . k, is consistent with the diagonal entries in the corresponding matrix Dl, l = 1 . . . k, and none of them is already in the current library of cyclic points, then a true k-cycle has been identified, otherwise the cycle is virtual (or a super-set of lower-order cycles). We discovered that the search becomes extremely efficient, without the need to exhaustively consider all configurations, if a new search loop is re-initialized at the last visited virtual cyclic point (i.e., all inconsistent entries d(l) mm in the matrices Dl, l = 1 . . . k, are flipped, d(l) mm 1 d(l) mm, to bring them into agreement with the signs of the solution points, z l = [z ml], of eq.

(4), thus yielding the next initial configuration). It is straightforward to see that this procedure almost surely converges if Nout is chosen large enough, see Appx. A.2.4. The whole procedure is formalized in Algorithm 1, and the code is available at https://github.com/Durstewitz Lab/SCYFI.

To validate the algorithm, we can compare analytical solutions as derived in sect. 3.1 to the output of the algorithm. To delineate all existence and stability regions, the algorithm searches for all k-cycles up to some maximum order K along a fine grid across the (a11, a11 +w11)-parameter plane. A bifurcation happens whenever between two grid points a cycle appears, disappears, or changes stability (as determined from the eigenvalue spectrum of the respective kth-order Jacobian). The results of this procedure are shown in Fig. 1B, illustrating that the analytical solutions for existence and stability regions precisely overlap with those identified by Algorithm 1 (see also Fig. S7).

4.2 Numerical and theoretical results on SCYFI s scaling behavior

Because of the combinatorial nature of the problem, it is generally not feasible to obtain ground truth settings in higher dimensions for SCYFI to compare to. To nevertheless assess its scaling behavior, we therefore studied two specific scenarios. For an exhaustive search, the expected and median numbers of linear subregions n until an object of interest (fixed point or cycle) is found, i.e. the number of {D1:k} constellations that need to be inspected until the first hit, are given by

E[n] = N + 1

m + 1 = 2Mk + 1

m + 1 , n = min

with m being the number of existing k-cycles and N the total number of combinations, as shown in Ahlgren [1] (assuming no prior knowledge about the mathematical limitations when drawing regions). The median n as well as the actual median number of to-be-searched combinations required by SCYFI to find at least one k-cycle is given for low-dimensional systems in Fig. 2A as a function of cycle order k, and can be seen to be surprisingly linear as confirmed by linear regression fits to the data (see Fig. 2 legend for details). To assess scaling as a function of dimensionality M, we explicitly constructed systems with one known fixed point (see Appx. A.3.1 for details) and determined the number n of subregions required by SCYFI to detect this embedded fixed point (Fig. 2B). In general, the scaling depended on the system s eigenspectrum, but for reasonable scenarios was polynomial or even sublinear (Fig. 2B, see also Fig. S8). In either case, the number of required SCYFI iterations scaled much more favorably than would be expected from an exhaustive search.

Figure 2: A) Number of linear subregions n searched until at least one cycle of order k was found by SCYFI (blue) vs. the median number n an exhaustive search would take by randomly drawing combinations without replacement (black) as a function of cycle order (M = 2 fixed). Each data point represents the median of 50 different initializations across 5 different PLRNN models. Error bars = median absolute deviation. Linear regression fit using weighted least-squares (R2 0.998, p < 10 30). B) Number of linear subregions n searched until a specific fixed point was found as function of dimensionality M for different eigenvalue spectra (see Appx. A.3.1 for details).

How could this surprisingly good scaling behavior be explained? As shown numerically in Fig. S9, when we initiate SCYFI in different randomly selected linear subregions, it converges to the subregions including the dynamical objects of interest exponentially fast, offsetting the combinatorial explosion. A more specific and stronger theoretical result about SCYFI s convergence speed can be obtained under certain conditions on the parameters (which agrees nicely with the numerical results in Fig. 2). It rests on the observation that SCYFI is designed to move only among subregions containing virtual or actual fixed points or cycles, based on the fact that it is always reinitialized with the next virtual fixed (cyclic) point in case the consistency check fails. The result can be stated as follows:

Theorem 3. Consider a PLRNN of the form (2) with parameters θ = {A, W , h}. Under certain conditions on θ (for which A + W < 1), SCYFI will converge in at most linear time.

Proof. See Appx. A.2.5

5 Loss landscapes and bifurcation curves

5.1 Bifurcations and loss jumps in training

Figure 3: A) Logarithm of gradient norm of the loss in PLRNN parameter space, with ground truth parameters centered at (0, 0). Superimposed in yellow are bifurcation curves computed by SCYFI, and in green two examples of training trajectories from different parameter initial conditions (indicated by the stars). Red dots indicate the bifurcation crossing time points shown in C & D. B) Relief plot of the loss landscape from A to highlight the differences in loss altitude associated with the bifurcations. C) Loss during the training run represented by the green trajectory labeled C in A and B. Red dot indicates the time point of bifurcation crossing corresponding to the red dot in A and B. D) Same for trajectory labeled D in A and B.

Fig. 3 provides a 2d toy example illustrating the tight association between the loss landscape and bifurcation curves, as determined through SCYFI, for a PLRNN trained by BPTT on reproducing a specific 16-cycle. Fig. 3A depicts a contour plot of the gradient norms with overlaid bifurcation curves in yellow, while Fig. 3B shows the MSE loss landscape as a relief for better appreciation of the sharp changes in loss height associated with the bifurcation curves. Shown in green are two trajectories from two different parameter initializations traced out during PLRNN training in parameter space, where training was confined to only those two parameters given in the graphs (i.e., all other PLRNN parameters were kept fixed during training for the purpose of this illustration). As confirmed in Fig. 3C & D, as soon as the training trajectory crosses the bifurcation curves in parameter space, a huge jump in the loss associated with a sudden increase in the gradient norm occurs. This illustrates empirically and graphically the theoretical results derived in sect. 3.

Next we illustrate the application of SCYFI on a real-world example, learning the behavior of a rodent spiking cortical neuron observed through time series measurements of its membrane potential (note that spiking is a highly nonlinear behavior involving fast within-spike and much slower between-spike time scales). For this, we constructed a 6-dimensional delay embedding of the membrane voltage [53, 28], and trained a PLRNN with one hidden layer (cf. eq. 6) using BPTT with sparse teacher forcing (STF) [37] to approximate the dynamics of the spiking neuron (see Appx. A.3.2 for a similar analysis on a biophysical neuron model). With M = 6 latent states and H = 20 hidden dimensions, the trained PLRNN comprises 220 different linear subregions and |θ| = 272 parameters, much higher-dimensional than the toy example considered above. Fig. 4A gives the MAE loss as a function of training epoch (i.e., single SGD updates), while Figs. 4B & C illustrate the well-trained behavior in time (Fig. 4B) and in a 2-dimensional projection of the model s state space obtained by PCA (Fig. 4C). The loss curve exhibits several steep jumps. Zooming into one of these regions (Fig. 4A; indicated by the red box) and examining the transitions in parameter space using SCYFI, we find they are indeed produced by bifurcations, with an example given in Fig. 4D. Note that we are now dealing with high-dimensional state and parameter spaces, such that visualization of results becomes tricky. For the bifurcation diagram in Fig. 4D we therefore projected all extracted k-cycles (k 1) onto a line given by the PCA-derived maximum eigenvalue component, and plotted this as a function of training epoch.1 Since SCYFI extracts all k-cycles and their eigenvalue spectrum, we can also determine the type of bifurcation that caused the jump. While before the loss jump the PLRNN already produced time series quite similar to those of the physiologically recorded cell (Fig. 4E), a

1Of course, very many of the PLRNN parameters may change from one epoch to the next.

DTB (cf. Theorem 1) produced catastrophic forgetting of the learned behavior with the PLRNN s states suddenly diverging to minus infinity (Fig. 4F; Fig. S10 also provides an example of a BCB during PLRNN training, and Fig. S11 an example of a DFB). This illustrates how SCYFI can be used to analyze the training process with respect to bifurcation events also for high-dimensional real-world examples, as well as the behavior of the trained model (Fig. 4C).

Figure 4: A) Loss across training epochs for a PLRNN with one hidden layer trained on electrophysiological recordings from a cortical neuron. Red box zooms in on one of the training phases with a huge loss jump, caused by a DTB. Letters refer to selected training epochs in other subpanels. B) Time series of true (gray) and PLRNN-simulated (black) membrane potential in the well trained regime (see A). C) All fixed points and cycles discovered by SCYFI for the well-trained model in state space projected onto the first two principle components using PCA. Filled circles represent stable and open circles unstable objects. The stable 39-cycle corresponds to the spiking behavior. D) Bifurcation diagram of the PLRNN as a function of training epoch around the loss peak in A. Locations of stable (filled circles) and unstable (open circles) objects projected onto the first principle component. E) Model behavior as in B shortly before the DTB and associated loss jump (from the epoch indicated in A, D). F) Model behavior as in B right around the DTB (diverging to ).

5.2 Implications for designing training algorithms

What are potential take-homes of the results in sects. 3.2 & 5.1 for designing RNN training algorithms? One possibility is to design smart initialization or training procedures that aim to place or push an RNN into the right topological regime by taking big leaps in parameter space whenever the current regime is not fit for the data, rather than dwelling within a wrong regime for too long. These ideas are discussed in a bit more depth in Appx.A.3.3, with a proof of concept in Fig. S12.

Figure 5: Example loss curves during training a PLRNN (M = 10) on a 2d cycle using gradient descent, once without GTF (α = 0, blue curve) but gradient clipping, and once with GTF (α = 0.1). Note that without GTF there are several sharp loss jumps associated with bifurcations in the PLRNN parameters, while activating GTF leads to a smooth loss curve avoiding bifurcations. Note: For direct comparability both loss curves were cut off at 4 and then scaled to [0, 1].The absolute loss is much lower for GTF.

More importantly, however, we discovered that the recently proposed technique of generalized teacher forcing (GTF) [24] tends to circumvent bifurcations in RNN training altogether, leading to much faster convergence as illustrated in Fig. 5. The way this works is that GTF, by trading off forward-iterated RNN latent states with data-inferred states according to a specific annealing schedule during training (see Appx. A.2.6), tends to pull the RNN directly into the right dynamical regime. In fact, for DTBs we can strictly prove these will never occur in PLRNN training with the right adjustment of the GTF parameter: Theorem 4. Consider a PLRNN of the form (2) with parameters θ = {A, W , h}. Assume that it has a stable fixed point or k-cycle Γk (k 1) that undergoes a degenerate transcritical bifurcation (DTB) for some parameter value θ = θ0 θ.

(i) If A + W 1, then for any GTF parameter 0 < α < 1, GTF controls the system, avoiding a DTB and, hence, gradient divergence at θ0.

(ii) If A + W = r > 1, then for any 1 1

r < α < 1, GTF prevents a DTB and, hence, gradient divergence at θ0. Proof. See Appx. A.2.6.

As this example illustrates, we may be able to amend training procedures such as to avoid specific types of bifurcations.

6 Discussion

DS theory [3, 47, 58] is increasingly appreciated in the ML/AI community as a powerful mathematical framework for understanding both the training process of ML models [49, 54, 44, 16] as well as the behavior of trained models [62, 33, 61]. While the latter is generally useful for understanding how a trained RNN performs a given ML task, with prospects of improving found solutions, it is in fact imperative in areas like science or medicine where excavating the dynamical behavior and repertoire of trained models yields direct insight into the underlying physical, biological, or medical processes the model is supposed to capture. However, application of DS theory is often not straightforward, especially when dealing with higher-dimensional systems, and commonly requires numerical routines that may only find some of the dynamical objects of interest, and also only approximate solutions. One central contribution of the present work therefore was the design of a novel algorithm, SCYFI, that can exactly locate fixed points and cycles of a wide class of Re LU-based RNNs. This provides an efficient instrument for the DS analysis of trained models, supporting their interpretability and explainability.

A surprising observation was that SCYFI often finds cycles in only linear time, despite the combinatorial nature of the problem, a feature shared with the famous Simplex algorithm for solving linear programming tasks [36, 32, 57]. While we discovered numerically that SCYFI for empirically relevant scenarios converges surprisingly fast, deriving strict theoretical guarantees is hard, and so far we could establish stronger theoretical results on its convergence properties only under specific assumptions on the RNN parameters. Further theoretical work is therefore necessary to precisely understand why the algorithm works so effectively.

In this work we applied SCYFI to illuminate the training process itself. Since RNNs are themselves DS, they are subject to different forms of bifurcations during training as their parameters are varied under the action of a training algorithm (similar considerations may apply to very deep NNs). It has been recognized for some time that bifurcations in RNN training may give rise to sudden jumps in the loss [44, 16], but the phenomenon has rarely been treated more systematically and mathematically. Another major contribution of this work thus was to formally prove a strict connection between three types of bifurcations and abrupt changes in the gradient norms, and to use SCYFI to further reveal such events during PLRNN training on various example systems. There are numerous other types of bifurcations (e.g., center bifurcations, Hopf bifurcations etc.) that are likely to impact gradients during training, only for a subset of which we could provide formal proofs here. As we have demonstrated, understanding the topological and bifurcation landscape of RNNs could help improve training algorithms and provide insights into their working. Hence, a more general understanding of how various types of bifurcation affect the training process in a diverse range of RNN architectures is a promising future avenue not only for our theoretical understandings of RNNs, but also for guiding future algorithm design.

Acknowledgements This work was supported by the German Research Foundation (DFG) through individual grants Du 354/10-1 & Du 354/15-1 to DD, within research cluster FOR-5159 ( Resolving prefrontal flexibility ; Du 354/14-1), and through the Excellence Strategy EXC 2181/1 390900948 (STRUCTURES). We also thank Mahasheta Patra for lending us code for graphing analytically derived bifurcation diagrams and state spaces.

[1] J. Ahlgren. The probability distribution for draws until first success without replacement. ar Xiv preprint ar Xiv:1404.1161, 2014.

[2] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[3] Kathleen T. Alligood, Tim D. Sauer, and James A. Yorke. Chaos: An Introduction to Dynamical Systems. Springer, New York, NY, 1996.

[4] S. Arora, Z. Li, and A. Panigrahi. Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, pages 948 1024. PMLR, 2022.

[5] V. Avrutin, M. Schanz, and S. Banerjee. Occurrence of multiple attractor bifurcations in the two-dimensional piecewise linear normal form map. Nonlinear Dynamics, 67:293 307, 2012.

[6] V. Avrutin, L. Gardini, I. Sushko, and F. Tramontana. Continuous and Discontinuous Piecewise Smooth One-Dimensional Maps: Invariant Sets and Bifurcation structures, volume 1. World Scientific, Singapore, 2019.

[7] S. Banerjee, MS. Karthik, G. Yuan, and J. A. Yorke. Bifurcations in one-dimensional piecewise smooth maps-theory and applications in switching circuits. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 47(3):389 394, 2000.

[8] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157 166, 1994. doi: 10.1109/72.279181.

[9] S.h Bouktif, A. Fiaz, A. Ouni, and M. A. Serhani. Multi-sequence lstm-rnn deep learning and metaheuristics for electric load forecasting. Energies, 13(2), 2020. ISSN 1996-1073. doi: 10.3390/en13020391.

[10] M. Brenner, F. Hess, J. M. Mikhaeil, L. F. Bereska, Z. Monfared, P. Kuo, and D. Durstewitz. Tractable Dendritic RNNs for Reconstructing Nonlinear Dynamical Systems. In Proceedings of the 39th International Conference on Machine Learning, pages 2292 2320. PMLR, June 2022. ISSN: 2640-3498.

[11] S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932 3937, 2016.

[12] K. Champion, B. Lusch, J. N. Kutz, and S. L. Brunton. Data-driven discovery of coordinates and governing equations. Proceedings of the National Academy of Sciences, 116(45):22445 22451, 2019. doi: 10.1073/pnas.1906995116.

[13] V. Chatziafratis, S. G. Nagarajan, and I. Panageas. Better depth-width trade-offs for neural networks through the lens of dynamical systems. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1469 1478. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v119/chatziafratis20a.html.

[14] V. Chatziafratis, I. Panageas, and S. Sanford, C.and Stavroulakis. On scrambling phenomena for randomly initialized recurrent networks. Advances in Neural Information Processing Systems, 35:18501 18513, 2022.

[15] B. M. de Silva, K. Champion, M. Quade, J. Loiseau, J. N. Kutz, and . L. Brunton. Pysindy: A python package for the sparse identification of nonlinear dynamics from data, 2020. URL https://arxiv.org/abs/2004.08424.

[16] K. Doya. Bifurcations in the learning of recurrent neural networks. Proceedings 1992 IEEE International Symposium on Circuits and Systems, 6:2777 2780, 1992.

[17] D. Durstewitz. Implications of synaptic biophysics for recurrent network dynamics and active memory. Neural Networks, 22(8):1189 1200, 2009. ISSN 0893-6080. doi: https://doi.org/ 10.1016/j.neunet.2009.07.016. URL https://www.sciencedirect.com/science/ article/pii/S0893608009001622. Cortical Microcircuits.

[18] D. Durstewitz. A state space approach for piecewiselinear recurrent neural networks for reconstructing nonlinear dynamics from neural measurements. PLo S Computational Biology, 13(6), 2017.

[19] A. Ganguli and S. Banerjee. Dangerous bifurcation at border collision: When does it occur? Physical Review E, 71, 2005.

[20] L. Gardini, D. Fournier-Prunaret, and P. Chargé. Border collision bifurcations in a twodimensional piecewise smooth map from a simple switching circuit. Chaos: An Interdisciplinary Journal of Nonlinear Science, 21, 2011.

[21] M. Golub and D. Sussillo. Fixed Point Finder: A Tensorflow toolbox for identifying and characterizing fixed points in recurrent neural networks. Journal of Open Source Software, 3 (31):1003, November 2018. ISSN 2475-9066.

[22] R. Haschke and J. J. Steil. Input space bifurcation manifolds of recurrent neural networks. Neurocomputing, 64:25 38, March 2005. ISSN 0925-2312. doi: 10.1016/j.neucom.2004.11. 030.

[23] D.l Hernandez, A. K. Moretti, Z.g Wei, S. Saxena, J. Cunningham, and L. Paninski. Nonlinear Evolution via Spatially-Dependent Linear Dynamics for Electrophysiology and Calcium Data, June 2020. ar Xiv:1811.02459 [cs, q-bio, stat].

[24] F. Hess, Z. Monfared, M. Brenner, and D. Durstewitz. Generalized teacher forcing for learning chaotic dynamics. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 13017 13049. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/hess23a.html.

[25] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8): 1735 1780, 1997.

[26] T. Hossen, A. S. Nair, R. A. Chinnathambi, and P. Ranganathan. Residential load forecasting using deep neural networks (dnn). In 2018 North American Power Symposium (NAPS), pages 1 5, 2018. doi: 10.1109/NAPS.2018.8600549.

[27] S. Kanai, Y. Fujiwara, and S. Iwamura. Preventing Gradient Explosions in Gated Recurrent Units. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

[28] H. Kantz and T. Schreiber. Nonlinear time series analysis, volume 7. Cambridge university press, 2004.

[29] G. E. Katz and J. A. Reggia. Using Directional Fibers to Locate Fixed Points of Recurrent Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 29(8):3636 3646, August 2018. ISSN 2162-2388. doi: 10.1109/TNNLS.2017.2733544. Conference Name: IEEE Transactions on Neural Networks and Learning Systems.

[30] G. Koppe, H . Toutounji, P. Kirsch, S. Lis, and D. Durstewitz. Identifying nonlinear dynamical systems via generative recurrent neural networks with applications to fmri. PLOS Computational Biology, 15(8):1 35, 2019.

[31] T. Lang and M. Rettenmeier. Understanding consumer behavior with recurrent neural networks. 2017.

[32] D. G. Luenberger and Y. Ye. Linear and Nonlinear Programming. Springer International Publishing, 2016. doi: doi:10.1007/978-3-319-18842-3.

[33] N. Maheswaranathan and D. Sussillo. How recurrent networks implement contextual processing in sentiment analysis. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6608 6619. PMLR, 13 18 Jul 2020.

[34] N. Maheswaranathan, AH. Williams, MD. Golub, S. Ganguli, and D. Sussillo. Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics. Advances in neural information processing systemsv, 32:15696 15705, 2019.

[35] R Marichal, JD Piñeiro, and E González. Study of fold bifurcation in a discrete recurrent neural network. In Proceedings of the World Congress on Engineering and Computer Science, volume 2, 2009.

[36] N. Megiddo. On the complexity of linear programming, page 225 268. Econometric Society Monographs. Cambridge University Press, 1987. doi: 10.1017/CCOL0521340446.006.

[37] J. M. Mikhaeil, Z. Monfared, and D. Durstewitz. On the difficulty of learning chaotic dynamics with RNNs. In Advances in Neural Information Processing Systems, 2022.

[38] Z. Monfared and D. Durstewitz. Existence of n-cycles and border-collision bifurcations in piecewise-linear continuous maps with applications to recurrent neural networks. Nonlinear Dynamics, 101(2):1037 1052, 2020.

[39] Z. Monfared and D. Durstewitz. Transformation of Re LU-based recurrent neural networks from discrete-time to continuous-time. In International Conference on Machine Learning, pages 6999 7009. PMLR, 2020. ISSN: 2640-3498.

[40] Z. Monfared, Z. Dadi, N. Miladi Lari, and Z. Afsharnezhad. Existence and nonexistence of periodic solution and hopf bifurcation of a tourism-based social-ecological system. Optik, 127 (22):10908 10918, 2016.

[41] H. E. Nusse and J. A. Yorke. Border-collision bifurcations including period two to period three for piecewise smooth systems. Physica D: Nonlinear Phenomena, 57(1-2):39 57, 1992.

[42] H.E. Nusse and J.A. Yorke. Border-collision bifurcations for piecewise smooth one dimensional maps. Int. J. Bifurc. Chaos, 5(1):189 207, 1995.

[43] S. H. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. Choi. Sequence-to-sequence prediction of vehicle trajectory via lstm encoder-decoder architecture. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 1672 1678, 2018. doi: 10.1109/IVS.2018.8500658.

[44] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310 1318, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR.

[45] D. Patel and E. Ott. Using machine learning to anticipate tipping points and extrapolate to post-tipping dynamics of non-stationary dynamical systems, 2022. URL https://arxiv. org/abs/2207.00521.

[46] J. Pathak, B. Hunt, M. Girvan, Z.n Lu, and E. Ott. Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach. Phys. Rev. Lett., 120:024102, Jan 2018. doi: 10.1103/Phys Rev Lett.120.024102.

[47] L. Perko. Differential Equations and Dynamical Systems. Springer New York, NY, 2001.

[48] P. Razvan. On Recurrent and Deep Neural Networks. Ph D thesis, 2014. Université de Montréal.

[49] A. Rehmer and A. Kroll. The effect of the forget gate on bifurcation boundaries and dynamics in Recurrent Neural Networks and its implications for gradient-based optimization. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 01 08, July 2022. doi: 10.1109/IJCNN55064.2022.9892458. ISSN: 2161-4407.

[50] A. H. Ribeiro, K. Tiels, Lu. A. Aguirre, and T. Schön. Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 2370 2380. PMLR, June 2020. URL https://proceedings.mlr.press/v108/ribeiro20a.html. ISSN: 2640-3498.

[51] S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz. Data-driven discovery of partial differential equations. Science Advances, 3(4):e1602614, 2017. doi: 10.1126/sciadv.1602614.

[52] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533 536, 1986.

[53] T. Sauer, J. A. Yorke, and M. Casdagli. Embedology. Journal of statistical Physics, 65:579 616, 1991.

[54] A. M. Saxe, J. L. Mc Clelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, 2013. URL https://arxiv.org/abs/1312. 6120.

[55] D. Schmidt, G. Koppe, Z. Monfared, M. Beutelspacher, and D. Durstewitz. Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies. In proceedings of the Ninth International Conference on Learning Representations, ICLR, 2021.

[56] J. Smith, S. Linderman, and D. Sussillo. Reverse engineering recurrent neural networks with Jacobian switching linear dynamical systems. In Advances in Neural Information Processing Systems, volume 34, pages 16700 16713. Curran Associates, Inc., 2021.

[57] D. A. Spielman and S. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time, 2001. URL https://arxiv.org/abs/cs/0111050.

[58] S. H. Strogatz. Nonlinear Dynamics and Chaos: Applications to Physics, Biology, Chemistry, and Engineering: With Applications to Physics, Biology, Chemistry and Engineering. CRC Press, 2015.

[59] I. Sushko and L. Gardini. Center bifurcation for a two-dimensional piecewise linear map, in business cycles dynamics. Models and Tools, eds. Puu, T. & Sushko, I. (Springer-Verlag, NY), pages 49 78, 2006.

[60] I. Sushko and L. Gardini. Center bifurcation for two-dimensional border-collision normal form. International Journal of Bifurcation and Chaos, 18:1029 1050, 2008.

[61] D. Sussillo and O. Barak. Opening the Black Box: Low-Dimensional Dynamics in High Dimensional Recurrent Neural Networks. Neural Computation, 25(3):626 649, March 2013. ISSN 0899-7667, 1530-888X. doi: 10.1162/NECO_a_00409.

[62] E. Turner, K. V. Dabholkar, and O. Barak. Charting and navigating the space of solutions for recurrent neural networks. In Advances in Neural Information Processing Systems, volume 34, pages 25320 25333. Curran Associates, Inc., 2021.

[63] P. R. Vlachas, W. Byeon, Z. Y. Wan, T. P. Sapsis, and P. Koumoutsakos. Data-driven forecasting of high-dimensional chaotic systems with long short-term memory networks. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2213):20170844, 2018. doi: 10.1098/rspa.2017.0844.

[64] P.R. Vlachas, J. Pathak, B.R. Hunt, T.P. Sapsis, M. Girvan, E. Ott, and P. Koumoutsakos. Backpropagation algorithms and reservoir computing in recurrent neural networks for the forecasting of complex spatiotemporal dynamics. Neural Networks, 126:191 217, 2020.

[65] C. Wang, D. Han, Q. Liu, and S. Luo. A deep learning approach for credit scoring of peerto-peer lending using attention mechanism lstm. IEEE Access, 7:2161 2168, 2019. doi: 10.1109/ACCESS.2018.2887138.

[66] D. Zhang, H.and Liu and Z. Wang. Controlling chaos: suppression, synchronization and chaotification. Springer Science & Business Media, 2009.

[67] Xun Zheng, Manzil Zaheer, Amr Ahmed, Yuan Wang, Eric P. Xing, and Alexander J. Smola. State space LSTM models with particle MCMC inference. Co RR, abs/1711.11179, 2017. URL http://arxiv.org/abs/1711.11179.

[68] X. Zhu, Z. Wang, X. Wang, M. Zhou, and R. Ge. Understanding edge-of-stability training dynamics with a minimalist example. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=p7Eag Bs MAEO.

A.1 Analysis of bifurcation curves

A.1.1 PLRNNs

The standard PLRNN [18], given in eq. (1) in sect. 3.1, was defined by

zt = Fθ(zt 1, st) = A zt 1 + W ϕ(zt 1) + Cst + h,

where ϕ(zt 1) = max(zt 1, 0). There are various extensions of this basic architecture like the dend PLRNN [10] or the shallow PLRNN (sh PLRNN) [24], as used in sect. 5.1 for training on single cell membrane potentials. The latter is essentially a 1-hidden-layer version of the form

zt = Fθ(zt 1, st) = A zt 1 + W1ϕ(W2zt 1 + h2) + Cst + h1, (6)

with W1 RM L and W2 RL M, L M, connectivity matrices, h1 RM, h2 RL bias terms, and all other parameters and variables as in eq. (1). While this formulation is beneficial for training, the sh PLRNN can essentially be rewritten in standard PLRNN form (see [24]).

Assume that DΩ(t) := diag(dΩ(t)) is a diagonal matrix with an indicator vector dΩ(t) := (d1, d2, , d M) such that dm(zm,t) =: dm = 1 whenever zm,t > 0, and zero otherwise. Then eq. (1) can be rewritten as

zt = (A + W DΩ(t 1))zt 1 + Cst + h =: WΩ(t 1) zt 1 + Cst + h.

Let us ignore the inputs for simplicity. There are 2M different configurations for matrix DΩ(t 1) and so 2M different forms for matrix WΩ(t 1) in the system

zt = Fθ(zt 1) = WΩ(t 1) zt 1 + h. (7)

Thus, the phase space of the system is divided into 2M sub-regions corresponding to the indexed matrices

WΩk := A + W DΩk, k = 1, 2, , 2M, (8)

see [38, 39] for more details. For M = 2, assuming

in (8), we have

WΩ1 = WΩ3 = a11 0

WΩ2 = WΩ4 = a11 + w11 0

Hence, for this parameter constellation, the map simplifies as there exists only one border which divides the phase space into two distinct sub-regions, such that (7) can be rewritten as a map of the form z1,t z2,t

= T(z1,t 1, z2,t 1)

TL(z1,t 1, z2,t 1) =

TR(z1,t 1, z2,t 1) =

with al = a11, ar = a11+w11, br = w21, d = a22, bl = c = 0. The map (11) is a PWL dynamical system whose phase space is split into left and right half-planes (sub-regions) by the borderline Σ (z2-axis). Note that bifurcation curves of the 2d PLRNN (7) in the (a11, a11 +w11)-parameter space can be determined analogous to those of the PWL map (11) in the (al, ar)-parameter space.

Another way to simplify the PLRNN to a 2d (M = 2) PWL map with just a single border is to remove one of the Re LU nonlinearities and define ϕ(zt 1) = (ϕ1(z1,t 1), β z2,t 1)T, where β R and ϕ1 is some variant of the Re LU function such as the leaky or parametric Re LU given by

ϕ1(z) = z; z > 0 α z; z 0 (α R). (12)

Then DΩ(t) := diag(d1, β) such that

d1(z1,t) =: d1 = 1; z1,t > 0 α; z1,t 0 , (13)

WΩ1 = WΩ3 = a11 + αw11 βw12 αw21 a22 + βw22

WΩ2 = WΩ4 = a11 + w11 βw12 w21 a22 + βw22

This gives another example of a representative of 2d PWL maps with only one border defined in eq. (11). We are pointing this out because eq. (11) is a generic system considered more widely in the discrete dynamical systems literature [5, 6], and also was the basis for the analyses below and in Fig. 1.

A.1.2 Fixed points of the map (11) and their bifurcations

For al, ar, bl, br, c, d, h1, h2 R, the map (11) has the following two fixed points

OL/R = z L/R 1 , z L/R 2 T = (1 d) h1 + c h2 (1 d)(1 al/r) bl/r c, bl/r h1 + (1 al/r) h2 (1 d)(1 al/r) bl/r c

The fixed points OL and OR exist iff z L 1 < 0 and z R 1 > 0 respectively; otherwise they are virtual. Hence, the existence regions of admissible fixed points are

EOL = (h1, h2, al, bl, c, d) (1 d) h1 + c h2 (1 d)(1 al) bl c < 0 ,

EOR = (h1, h2, ar, br, c, d) (1 d) h1 + c h2 (1 d)(1 ar) br c > 0 . (16)

Let DL/R be the determinant and TL/R the trace of AL/R, and

PL/R(λ) = λ2 (al/r + d)λ + al/r d bl/r c = λ2 TL/R λ + DL/R, (17)

its characteristic polynomial. The corresponding eigenvalues are given by

λ1,2(OL/R) = al/r + d

(al/r d)2 + 4 bl/r c

T 2 L/R 4DL/R

which are always real for bl/r c 0, while for bl/r c < 0 they are real provided that |al/r d| > 2p bl/r c. For complex conjugate eigenvalues of AL/R obviously |λ|2 = DL/R. Thus computing the real eigenvalues, the stability condition for the fixed points is determined as

(1 + DL/R) < TL/R < 1 + DL/R. (19)

Accordingly, the stability region of the fixed points OL and OR can be obtained by PL/R( 1) = 1 (al/r + d) + al/r d bl/r c > 0 and DL/R < 1 as

SL/R = (h1, h2, al/r, bl/r, c, d) EOL/R al/r d bl/r c < 1,

1 (al/r + d) + al/r d bl/r c > 0 . (20)

Note that when DL/R < 0, all the eigenvalues are real and so there cannot be any spiralling orbit. Remark 1. Consider the PLRNN (2) with M = 2. For the parameter setting (3), i.e.

WΩ1 = WΩ3 = a11 0

=: AL, WΩ2 = WΩ4 = a11 + w11 0

the two fixed points OL/R = z L/R 1 , z L/R 2 T are given by

OL = h1 1 a11 , h2 1 a22

T , OR = h1 1 a11 w11 , w21 h1 + (1 a11 w11) h2

(1 a22)(1 a11 w11)

Hence, the existence regions of admissible fixed points are

EOL = (h1, a11, a22) h1 1 a11 < 0 , EOR = (h1, a11, a22, w11) h1 1 a11 w11 > 0 ,

and their stability regions can be obtained as

SL = (h1, a11, a22) EOL a11 a22 < 1, 1 (a11 + a22) + a11 a22 > 0 , (23)

SR = (h1, a11, a22, w11) EOR (a11 + w11)a22 < 1, 1 (a11 + w11 + a22) + (a11 + w11)a22 > 0 .

Remark 2. If bl/r c = 0, then λ1,2(OL/R) are real and the stability regions SL/R become

SL/R = (h1, h2, al/r, bl/r, c, d) EOL/R bl/r c = 0, 1 al/r 1, 1 d 1 . (24)

The fixed points are regular saddles for all parameters that belong to (h1, h2, al/r, bl/r, c, d) EOL/R al/r + d > 1, al/r d al/r d + 1 < bl/r c < al/r d ,

and in this case λ1(OL/R) > 1, 0 < λ2(OL/R) < 1. Furthermore, they are flip saddles (i.e., with one negative eigenvalue) if parameters are in (h1, h2, al/r, bl/r, c, d) EOL/R al/r + d > 1, al/r d < bl/r c < al/r d + al/r + d + 1 [

(h1, h2, al/r, bl/r, c, d) EOL/R d al/r d + 1 < bl/r c < al/r d + al/r + d + 1,

0 < al/r + d 1, al/r

for which λ1(OL/R) > 1, 1 < λ2(OL/R) < 0, as well as in (h1, h2, al/r, bl/r, c, d) EOL/R al/r + d 1, al/r d < bl/r c < al/r d al/r d + 1 [

(h1, h2, al/r, bl/r, c, d) EOL/R al/r d + al/r + d + 1 < bl/r c < al/r d al/r + d 1,

1 < al/r + d < 0 (27)

such that 0 < λ1(OL/R) < 1, λ2(OL/R) < 1.

When bl/r c < 0 and |al/r d| < 2p bl/r c, the eigenvalues are complex conjugates and both OL and OR are spirally attracting (attracting focus) if al/r d bl/r c < 1. In this case, if al/r + d > 0 then they are clockwise spiral, while for al/r + d < 0 the spiralling motion will be counterclockwise. Moreover, for al/r d bl/r c > 1 they are repelling foci. Finally, for al/r d bl/r c = 1, the fixed points are locally centers and they undergo a CB at the following boundaries:

CL = (al, bl, c, d) bl c < 0, |al d| < 2 p

bl c, al d bl c = 1 ,

CR = (ar, br, c, d) br c < 0, |ar d| < 2 p

br c, ar d br c = 1 . (28)

At these boundaries, the fixed points lose their stability with a pair of complex conjugate eigenvalues crossing the unit circle. For the parameters belonging to CL/R, the Jacobian JL/R is a rotation matrix whose determinant is equal to 1. In this case, JL/R can be determined by a rotation number which is either rational ( p

q ) or irrational (ρ). Therefore, in some neighborhood of OL/R, there is a region filled with invariant ellipses such that they are periodic with period p (if the rotation number is a rational number p

q ) or quasiperiodic (if the rotation number is an irrational number ρ); for more information see [60, 59]. For (1 d) h1 + c h2 = 0, at the boundary

τL = (h1, h2, al, bl, c, d) 1 al d + al d bl c = 0 , (29)

the fixed point OL undergoes a DTB, since, if the parameters tend to τL, then OL and λ(OL) 1. Similarly, for (1 d) h1 + c h2 = 0, a DTB occurs for the fixed point OR at the boundary

τR = (h1, h2, ar, br, c, d) 1 ar d + ar d br c = 0 . (30)

A DTB of a fixed point results in its disappearance, as in this case the fixed point becomes virtual which may lead to changes in the global dynamics [6]. Furthermore, the BCB curves are given by

ξL = (h1, h2, al, bl, c, d) (1 d)(1 al) bl c = 0, (1 d) h1 + c h2 = 0 , (31)

ξR = (h1, h2, ar, br, c, d) (1 d)(1 ar) br c = 0, (1 d) h1 + c h2 = 0 . (32)

In addition, the DFB curves for the fixed points OL and OR are

FL = (h1, h2, al, bl, c, d) 1 + al + d + al d bl c = 0 ,

FR = (h1, h2, ar, br, c, d) 1 + ar + d + ar d br c = 0 . (33)

Remark 3. The existence regions EOL and EOR are bounded by the BCB curves ξL and ξR. Remark 4. The stability regions SL/R of fixed points (eq. (20)) are bounded by the DTB curves τL/R (eqn. (29) and (30)), the DFB curves FL/R (eq. (33)), and the CB curves al/r d bl/r c = 1. For instance, for d = 1, SL/R are illustrated in Fig. S1(a). In this case, the stability regions only exist for bl/r c < 0 and al/r d < 2p bl/r c. Moreover, as shown in Fig. S1(b), for d = 0.01, these stability regions can exist for both cases bl/r c < 0, al/r d < 2p bl/r c (in blue), and bl/r c < 0, al/r d > 2p bl/r c (in green), but not for bl/r c 0. Furthermore, if c = 1, there are stability regions SL/R for the two cases bl/r c < 0, al/r d < 2p bl/r c (in blue), and bl/r c > 0 (in purple); see Fig. S2(a). Finally, when c = 0, as explained in Remark 2, the stability regions have the form (24), i.e.

SL/R = (h1, h2, al/r, bl/r, c, d) EOL/R c = 0, bl/r R, 1 al/r, d 1 . (34)

which are displayed in Fig. S2(b).

Figure S1: Stability regions SL/R. Left: for d = 1; right: for d = 0.01. The case bl/r c < 0, al/r d < 2p bl/r c is plotted in blue, and the case bl/r c < 0, al/r d > 2p bl/r c is drawn in green.

Figure S2: Stability regions SL/R. Left: for c = 1; the case bl/r c < 0, al/r d < 2p bl/r c is plotted in yellow, and the case bl/r c > 0 is drawn in purple. Right: for c = 0.

Since the system (11) is a linear map in each sub-region L and R, there cannot be any n-cycle, n 2, with all periodic points on only one linear side. So, all period-n orbits have both letters L and R in their symbolic sequence.

A.1.3 2-cycles of the map (11) and their bifurcations

The 2-cycle ORL of the map (11) is determined by solving the equation TL TR(z1, z2) = (z1, z2)T where

TL TR(z1, z2) =

al ar + br c al c + c d

ar bl + br d d2 + bl c

c h2 + h1 al + 1

bl h1 + h2 d + 1

In this case if I JL JR is invertible, then the solution (z1, z2)T = z(1) 1 , z(1) 2 T is given by

z(1) 1 , z(1) 2 T = (1 d)h1 + c h2 al + d + al d bl c + 1

(ar d br c)(al d bl c) c(bl + br) d2 al ar + 1,

h2 1 + d al ar br c al ar d + ar bl c + h1 bl + ar bl + br d + al br d bl br c

(ar d br c)(al d bl c) c(bl + br) d2 al ar + 1

Also TR z(1) 1 , z(1) 2 = z(2) 1 , z(2) 2 T yields

z(2) 1 , z(2) 2 T = (1 d)h1 + c h2 ar + d + ar d br c + 1

(ar d br c)(al d bl c) c(bl + br) d2 al ar + 1,

h2 1 + d al ar bl c al ar d + al br c + h1 br + al br + bl d + ar bl d bl br c

(ar d br c)(al d bl c) c(bl + br) d2 al ar + 1

Hence, the existence region of the 2-cycle ORL is

EORL = (h1, h2, al, bl, c, d) (1 d)h1 + c h2 al + d + al d bl c + 1

(ar d br c)(al d bl c) c(bl + br) d2 al ar + 1 > 0,

(1 d)h1 + c h2 ar + d + ar d br c + 1

(ar d br c)(al d bl c) c(bl + br) d2 al ar + 1 < 0 . (38)

The characteristic polynomial of JRL = JL JR =

al ar + br c al c + c d

ar bl + br d d2 + bl c

is given by

PORL(λ) = λ2 (d2 + al ar + bl c + br c)λ + (ar d br c)(al d bl c), (39)

DRL = (ar d br c)(al d bl c),

PORL(1) = (ar d br c)(al d bl c) c(bl + br) d2 al ar + 1,

PORL( 1) = (ar d br c)(al d bl c) + c(bl + br) + d2 + al ar + 1,

λ1,2(ORL) = al ar + c(bl + br) + d2

(al ar + bl c)2 + (br c + d2)2 + 2(al ar bl c)(br c d2) + 4 c d(al br + ar bl)

Thus, the stability region of ORL is

SRL = (h1, h2, al/r, bl/r, c, d) EORL 1 < (ar d br c)(al d bl c) < 1,

(ar d br c)(al d bl c) 1 < c(bl + br) + d2 + al ar < (ar d br c)(al d bl c) + 1 .

In addition, for (1 d)h1 + c h2 al + d + al d bl c + 1 = 0, the set

τRL = (h1, h2, al, ar, bl, br, c, d) al ar + bl c + br c + d2 al ar d2 bl br c2 + al br c d

+ ar bl c d 1 = 0 , (42)

is the DTB curve for the 2-cycle ORL. As in this case, for the parameter values belonging to τRL, the points of the 2-cycle ORL tend to , and the corresponding eigenvalue tends to one. Moreover, for (1 d)h1 + c h2 = 0, the BCB curves of ORL can be computed as

ξ1 RL = (h1, h2, al, bl, c, d) al + d + al d bl c + 1 = 0 ,

ξ2 RL = (h1, h2, ar, br, c, d) ar + d + ar d br c + 1 = 0 . (43)

Note that here the condition (1 d)h1 + c h2 = 0 guarantees a regular BCB in the sense that only one periodic point of ORL collides with the switching boundary; for more details see [6]. Besides,

FRL = (h1, h2, al, ar, bl, br, c, d) alar + blc + brc + d2 + al ard2 + bl brc2

al br c d ar blcd + 1 = 0 , (44)

is the DFB curve of the 2-cycle ORL.

Remark 5. One can see that for (1 d)h1 + c h2 = 0 the DFB curves of the fixed points OL and OR (FL and FR) and the BCB boundaries of the 2-cycle ORL (ξ1 RL and ξ2 RL) are the same. In this case, the DFB of the fixed points can lead to the (attracting) 2-cycle ORL.

A.1.4 3-cycles of the map (11) and their bifurcations

Here, we investigate the existence, stability and bifurcation structure of maximal or basic 3-cycles. Note that for the continuous map (11), basic n-cycles ORLn 1 (n 3) exist in pairs with their complementary cycles (ORLn 2R), and they appear via BCBs such that one of them may be attracting and the other repelling [6, 20, 38]. In this case, a BCB of basic cycles demonstrates a non-smooth fold bifurcation which includes a stable basic orbit and an unstable nonbasic orbit [6, 5, 19]. Furthermore, the complementary orbits can have nonempty stability regions such that, similar to the basic orbits, they are bounded by curves of BCBs, DTBs and DFBs [6, 5].

Basic 3-cycles ORL2 and their complementary cycles OR2L. The basic 3-cycle ORL2 can be obtained from the equation TL TL TR(z1, z2) = (z1, z2)T where

TL TL TR(z1, z2) =

ar(a2 l + bl c) + br(al c + c d) c(a2 l + bl c) + d(al c + c d)

br(d2 + bl c) + ar(al bl + bl d) d(d2 + bl c) + c(al bl + bl d)

h1 bl c + al(al + 1) + 1 + h2 al c + c(d + 1)

h1 bl d + bl(al + 1) + h2 bl c + d(d + 1) + 1

If I J2 L JR is invertible, then the solution (z1, z2)T = z(1) 1 , z(1) 2 T is

(z(1) 1 , z(1) 2 )T = (1 d)h1 + c h2 G1 G , G2

G1 = a2 l d2 + a2 l d + a2 l 2al blcd alblc + ald2 + ald + al + b2 l c2 blcd + blc + d2 + d + 1,

G = a2 l ar d3 c al bl + al br + ar bl + d(2 bl + br) + (ar d br c)(al d bl c)2 + 1,

G2 = h2 + blh1 + dh2 + d2h2 + alblh1 + blch2 + bldh1 a2 l arh2 + brd2h1 arb2 l ch1

a2 l ardh2 + albrd2h1 + blbrc2h2 arb2 l c2 h2 a2 l ard2h2 + a2 l br d2h1 + b2 l br c2h1

+ al ar bl h1 al br c h2 ar bl c h2 + ar bl d h1 + bl br ch1 br c dh2 + al ar blch2

+ al ar bl dh1 al br c d h2 bl br c dh1 + 2al ar bl c dh2 2al bl br c dh1. (47)

TR(z(1) 1 , z(1) 2 ) = z(2) 1 , z(2) 2 T = (1 d)h1 + c h2 K1 G , K2

TL(z(2) 1 , z(2) 2 ) = z(3) 1 , z(3) 2 T = (1 d)h1 + c h2 H1 G , H2

K1 = ar + d + alar + blc + ard + ard2 + d2 + alard albrc brcd + alard2 + blbrc2

albrcd arblcd + 1,

K2 = h2 + brh1 + dh2 + d2h2 + albrh1 + brch2 + bldh1 a2 l arh2 + a2 l brh1 + bld2h1

+ a2 l brch2 + arbld2h1 + blbrc2h2 a2 l ard2h2 + b2 l brc2h1 alblch2 arblch2 + albldh1

+ bl br ch1 bl cd h2 + alarbldh1 al blbrch1 ar bl c dh2 blbrc dh1 + al arbld2h1

al blbrc2h2 arb2 l cdh1 + a2 l brcdh2 + al ar bl c d h2 al bl br c d h1 a2 l ardh2,

H1 = al + d + alar + ald + brc + ald2 + d2 + alard ar blc blcd + alard2 + blbrc2

albrcd arblcd + 1,

H2 = h2 + blh1 + dh2 + d2h2 + b2 l c2h2 + arblh1 + blch2 + brdh1 a2 l arh2 + b2 l ch1

+ bld2h1 a2 l ardh2 blcdh2 + albld2h1 + a2 l brdh1 b2 l cdh1 a2 l ard2h2 + b2 l brc2h1

+ alarblh1 alblch2 albrch2 + albrdh1 + alarblch2 alblbrch1 alblcd h2

+ alarbld2h1 alblbrc2h2 arb2 l cdh1 + a2 l brcdh2 + alarblcdh2 alblbrcdh1. (49)

Therefore, the existence region of the 3-cycle ORL2 is given by

EORL2 = (h1, h2, al, bl, c, d) (1 d)h1 + c h2 G1 G > 0,

(1 d)h1 + c h2 K1 G < 0, (1 d)h1 + c h2 H1 G < 0 , (50)

where G, G1, K1 and H1 are defined in (47) and (49). On the other hand, the characteristic polynomial of

JRL2 = J2 L JR =

ar(a2 l + bl c) + br(al c + c d) c(a2 l + bl c) + d(al c + c d)

br(d2 + bl c) + ar(al bl + bl d) d(d2 + bl c) + c(al bl + bl d)

PORL2 (λ) = λ2 a2 l ar + d3 + c albl + albr + arbl + d(2bl + br) λ + (ard brc)(ald blc)2.

According to

DRL2 = (al d bl c)2(ar d br c),

PORL2(1) = a2 l ar d3 c albl + albr + arbl + d(2bl + br)

PORL2( 1) = a2 l ar + d3 + c albl + albr + arbl + d(2bl + br) + (ard brc)(ald blc)2 + 1, (52)

the stability region of the 3-cycle ORL2 is given by

SRL2 = (h1, h2, al/r, bl/r, c, d) EORL2 1 < (al d bl c)2(ar d br c) < 1,

(ald blc)2(ar d brc) 1 < a2 l ar + d3 + c albl + albr + arbl + d(2bl + br)

< (al d bl c)2(ar d br c) + 1 . (53)

Furthermore, for (1 d)h1 + c h2 = 0 and G1, G2, K1, K2, H1, H2 = 0, the DTB curve for the 3-cycle ORL2 is

τRL2 = (h1, h2, al, ar, bl, br, c, d) br a2 l cd2 ar a2 l d3 + ara2 l 2 br alblc2d + 2ar al bl c d2

+ al bl c + br al c + br b2 l c3 ar b2 l c2d + 2 blcd + arbl c + br c d + d3 1 = 0 .

For (1 d)h1 + c h2 = 0

ξ1 RL2 = (h1, h2, ar, br, c, d) K1 = ar + d + alar + bl c + ard + ard2 + d2 + alard

al br c br c d + al ar d2 + bl br c2 al br c d ar bl c d + 1 = 0 ,

ξ2 RL2 = (h1, h2, ar, br, c, d) H1 = al + d + al ar + al d + br c + al d2 + d2 + al ar d

ar bl c bl c d + al ar d2 + bl br c2 al br c d ar bl c d + 1 = 0 , (55)

are (regular) BCB curves of ORL2. Furthermore, the set

FRL2 = (h1, h2, al, ar, bl, br, c, d) br a2 l cd2 + ara2 l d3 + ara2 l + 2bralblc2d 2aralblcd2

+ alblc + bralc brb2 l c3 + ar b2 l c2d + 2blcd + arblc + br c d + d3 + 1 = 0 ,

is the DFB curve of ORL2. As noted, the basic 3-cycle ORL2 exists in a pair with its complementary cycle OR2L. Moreover, the existence region of OR2L can easily be found by interchanging the letters L and R in all notations of the equations (45)- (49) and considering

z(1) 1 < 0, z(2) 1 > 0, z(3) 1 > 0. (57)

Further, the stability region of the 3-cycle OR2L for the parameter values satisfying (57) is given by

SR2L = (h1, h2, al/r, bl/r, c, d) 1 < (ar d br c)2(al d bl c) < 1,

(ard brc)2(ald blc) 1 < a2 ral + d3 + c arbr + arbl + albr + d(2br + bl)

< (ar d br c)2(al d bl c) + 1 . (58)

Notice that whenever the stable 3-cycle ORL2 exists, its complementary orbit OR2L also exists, but it is unstable. Furthermore, for (1 d)h1 + c h2 = 0 both the 3-cycles ORL2 and OR2L appear at the same BCB curves (55). On the other hand, the DTB and DFB curves of the 3-cycle OR2L are given by

τR2L = (h1, h2, al, ar, bl, br, c, d) bla2 rcd2 ala2 rd3 + ala2 r 2blarbrc2d + 2alarbrcd2

+ ar br c + blarc + bl b2 rc3 al b2 r c2d + 2brcd + albrc + bl c d + d3 1 = 0 ,

FR2L = (h1, h2, al, ar, bl, br, c, d) bla2 rcd2 + ala2 rd3 + ala2 r + 2blarbrc2d 2alarbrcd2

+ arbrc + bl arc bl b2 rc3 + al b2 rc2 d + 2brcd + al br c + bl c d + d3 + 1 = 0 ,

respectively.

A.1.5 Multiple attractor bifurcations (MABs) of the map (11)

To detect multiple attractor bifurcations for the map (11), a straightforward way is to determine the overlapping stability regions of different periodic orbits. For instance, as shown in Fig. 1A, for c = 0.8, d = 0.2, bl = 0.4, br = 0.5, two stability regions SRL and SRL2 overlap in the (al, ar)-parameter plane (or in the (a11, a11 + w11)-parameter space for the 2d PLRNN (7)). Their overlapping region, displayed in yellow, reveals the structure of the (al, ar)-parameter plane. This helps us to find various MABs. Assuming h2 = 0 and varying h1 from a negative value to a positive one, an MAB of the form

Os L h1 Os RL + Os RL2, (61)

occurs in the overlapping region. An example of this kind of bifurcation is illustrated in Fig. S3A.

Moreover, Fig. S4 indicates, for c = 0.9, d = 0.3, bl = 0.6, br = 1.54 , there are nonempty overlapping regions SRL SRL2 and SRL2 SR. This leads to the occurrence of two different MABs given by (61) and

Os L h1 Os R + ORL2, (62)

for h2 = 0 and h1 changing from negative to positive values. Both of these bifurcations are shown in Fig. S3A and Fig. S3B, associated with the points P1and P2 in Fig. S4. Note that in Fig. S4, all the points P2, P3 and P4 belong to the overlapping region SRL2 SR (in sky blue). These points are related to the parameter values c = 0.9, d = 0.3, bl = 0.6, br = 1.54, ar = 1.8, h2 = 0, and they only differ in the parameter al. In this case, one can see that fixing all parameters and changing only the parameter al, from P2 to P4, the basins of attraction change. The corresponding basins of attraction for these three points are demonstrated in Fig. S3B (right) and Fig. S5 for h1 = 0.5 (after the bifurcation).

Figure S3: MAB at c = 0.9, d = 0.3, bl = 0.6, br = 1.54, h2 = 0. A) Left: Bifurcation diagram for al = 0.44 and ar = 1.8 corresponding to the point P1 in Fig. S4. Right: Multistability of the fixed point Os R and the 3-cycle Os RL2 after the bifurcation and their basins of attraction at h1 = 0.5. B) Left: Bifurcation diagram for al = 0.35 and ar = 2.2 corresponding to the point P2 in Fig. S4. Right: Multistability of the 2-cycle Os RL and the 3-cycle Os RL2 after the bifurcation and their basins of attraction at h1 = 0.7.

Figure S4: Analytically calculated stability regions for a different parameter setting than used in Fig. 1. Left: Analytically calculated stability regions SR, SRL and SRL2, shown in green, red and blue, respectively, in the (a11, a11 + w11)-parameter plane for a22 = 0.3, w21 = 1.54. The overlapping regions SRL SRL2 and SRL2 SR, representing multi-stable regimes, are given in yellow and sky blue. Right: Bifurcation curves for the same parameter settings as determined by SCYFI. Note that SCYFI identifies additional structure (regions demarcated by gray curves) not included in our analytical derivations.

Figure S5: Multistability of the fixed point Os R and the 3-cycle Os RL2 at c = 0.9, d = 0.3, bl = 0.6, br = 1.54, ar = 1.8, h2 = 0 after the bifurcation and their basins of attraction at h1 = 0.5. Left: al = 0.06 (point P3 in Fig. S4); right: al = 0.26 (point P4 in Fig. S4).

A.2 Proofs of theorems

A.2.1 Proof of theorem 1

Proof. Let L(θ) be some loss function employed for PLRNN training that decomposes in time as L = PT t=1 Lt. Then

Denoting the Jacobian of system (2) at time t by

Jt := Fθ(zt 1)

zt 1 = zt zt 1 , (64)

θ = zt zt 1

θ = Jt zt 1

where + denotes the immediate partial derivative (see [48] for more details). Assume that Γk is a k-cycle (k 1) of (2). Thus, Γk is a set of temporally successive periodic points

Pk := {zt k, zt k 1, , zt k (k 1)} = {zt k, F(zt k), . . . , F k 1 θ (zt k)}, (66)

such that all of them are fixed points of

zt+k = F k θ (zt) = Fθ(Fθ(Fθ(...Fθ(zt)...))), (67)

and k is the smallest such positive integer (for k = 1, Γ1 is a fixed point of Fθ). Similar to (65), the tangent vector zt+k

θ can be computed as

r=0 Jt+k r zt

Thus, for zt k = F k θ (zt k) we have

r=0 Jt k r zt k

Accordingly

θ = adj I Qk 1 r=0 Jt k r

PQk 1 r=0 Jt k r(1) +zt k

where PQk 1 r=0 Jt k r(1) = det I Qk 1 r=0 Jt k r

. Moreover, from (63) and (70) we have

= 1 PQk 1 r=0 Jt k r(1)

Lt zt k adj I

Now, suppose that Γk undergoes a DTB, such that the fixed or cyclic points given by (66) tend to infinity and one of their eigenvalues tends to 1 for some parameter value θ = θ0. This implies PQk 1 r=0 Jt k r(1) becomes zero at θ = θ0 and so, due to (70), zt k

θ goes to infinity. Therefore

the norm of the loss gradient, Lt

θ , tends to infinity at θ = θ0 which results in a abrupt jump in the loss function.

Let {zt1, zt2, zt3, . . .} be an orbit which converges to Γk, i.e.

lim n d(ztn, Γk) = 0. (72)

Then there exists a neighborhood U of Γk and k sub-sequences {ztkm} m=1, {ztkm+1} m=1, , {ztkm+(k 1)} m=1 of the sequence {ztn} n=1 such that all these sub-sequences belong to U and

a) ztkm+s = F k(ztk(m 1)+s), s = 0, 1, 2, , k 1,

b) lim m ztkm+s = zt k s, s = 0, 1, 2, , k 1,

c) for every ztn U there is some s {0, 1, 2, , k 1} such that ztn {ztkm+s} m=1.

This implies for every ztn U with ztn {ztkm+s} m=1 , there exists some n N such that ztn = ztk n+s and lim n ztk n+s = zt k s . Consequently, there exists some N N such that for

every n N both ztk n+s and zt k s belong to the same sub-region and so the matrices WΩ(tk n+s) and WΩ(t k s) (s {0, 1, 2, , k 1}) are identical. Without loss of generality, let s = 0. Since ztk( n+1) = F k(ztk n), so

r=0 Jt k r ztk n

θ + +ztk( n+1)

On the other hand, lim n ztk( n+1)

θ = lim n ztk n

θ , which results in

lim n ztk( n+1)

1 lim n +ztk( n+1)

= adj I Qk 1 r=0 Jt k r

PQk 1 r=0 Jt k r(1) lim n +ztk( n+1)

This means as n , for any orbit converging to Γk the norm of the loss gradient tends to infinity at θ = θ0 which completes the proof.

A.2.2 Proof of theorem 2

Proof. Let Γk be a k-cycle (k 1) of (2) defined by periodic points (66). Suppose further that Γk undergoes a BCB for some parameter value θ = θ0. Hence, one of its periodic points, e.g.

zt k, collides with one border. Therefore, zm t k = 0 for some 1 m M by the definition of discontinuity boundaries in [38, 39]. Similar to the proof of Theorem 1, for θ = A, W we have

θ = adj I Qk 1 r=0 Jt k 1 r

PQk 1 r=0 Jt k 1 r(1) +zt k 1

wnm = 1(n,m) DΩ(t k) zt k,

amm = 1(m,m) zt k, (76)

where 1(n,m) is an M M indicator matrix with a 1 for the (n, m) th entry and 0 everywhere else.

Since zm t k = 0 at θ = θ0, due to (76) +zt k 1

θ becomes the zero vector at θ = θ0. Consequently, zt k 1

θ and so Lt

θ vanishes at θ = θ0. Now it can be shown that at θ = θ0 the loss gradient goes to zero for every z1 BΓk (the proof is similar to the last part of the proof of Theorem 1).

A.2.3 Proof of corollary 1

Proof. For M = 2, let h1 = 0. Then the DFB curves of the fixed point Γ1 coincide with the BCB curves of the 2-cycle ORL of the form

F1 = ξ1 RL = {(h1, h2, a11, a22)|1 + a11 + a22 + a11a22 = 0}, (77)

F2 = ξ2 RL = {(h1, h2, a11, w11, w21, a22) 1 + a11 + w11 + a22 + (a11 + w11)a22 = 0}. (78)

For M > 2, assume that Γ1 = {z 1} is a fixed point of the system, i.e.

z 1 = (I WΩ(t 1)) 1 h = adj(I WΩ(t 1)) PI WΩ(t 1 )(1) h, (79)

where PI WΩ(t 1 )(1) is the characteristic polynomial of I WΩ(t 1) at 1. Let us denote the first row of the adjoint matrix of I WΩ(t 1) by adj(I WΩ(t 1))1. If adj(I WΩ(t 1))1 h = 0, then we can analogously demonstrate that the DFB curves of the fixed point align with the BCB curves of the 2-cycles. This implies that, in accordance with Theorem 2, DFBs of fixed points will also lead to vanishing gradients in the loss function.

A.2.4 Convergence of SCYFI

To ensure that SCYFI almost surely converges, we can simply choose the number of random initializations (i.e., Nout in algorithm 1) large enough such that every linear subregion will have been sampled with probability almost 1.More precisely, drawing uniformly from the 2M different DΩ-matrices (linear subregions) for initialization, the probability that a particular subregion has not been drawn after r repetitions is p = (1 1 2M )r. Hence, in order to ensure that all 2M subregions have been visited with probability 1 ϵ we need r ln(ϵ) ln(1 1 2M ) iterations. Choosing Nout = r,

we can thus ensure that SCYFI was initialized in each subregion with probability almost 1, and thus, in the limit, will have probed all subregions for dynamical objects. This argument extends to k-cycles by replacing 2M by 2k M above (strictly, a more precise bound for k 2 is given by 2M(k 1) (2M 1) = 2Mk 2M(k 1), due to the fact that the PLRNN (2) is a linear map in each subregion and, hence, cannot have any k-cycles with all periodic points in only one subregion).

A.2.5 Proof of theorem 3

Proof. We examine the convergence and scaling behavior of SCYFI for fixed points. A similar argument applies to k-cycles where k > 1.

Let z 1 be a fixed point of the system, i.e.

z 1 = I WΩ(t 1) 1 h. (80)

z 1 is a true fixed point iff

(dm(t 1) a) zm,t 1 > 0, m {1, 2, , M}, (81)

where DΩ(t 1) = diag(d1(t 1), d2(t 1), , d M(t 1)) and 0 < a < 1 is a positive real constant.

For examining SCYFI s efficiency, here we focus on two scenarios that impose specific constraints on parameters θ; other cases remain to be investigated.

Case (I) : Let R be a randomly generated matrix with uniformly distributed entries in the interval [0, 1), and h = 0 be a random vector with all its components being non-negative. For an arbitrary ϵ > 0, we set

A = 1 2 + R + ϵ diag(R),

W = 1 2 + R + ϵ

R diag(R) . (82)

A = diag(R) 2 + R + ϵ < 1 2 + R + ϵ,

W = R diag(R)

2 + R + ϵ R + diag(R)

2 + R + ϵ < 1 + R 2 + R + ϵ, (83)

and so A + W < 1. Therefore

t WΩ(t) = A + W DΩ(t) A + W DΩ(t) A + W < 1, (84)

t ρ(WΩ(t)) WΩ(t) < 1. (85)

In this case, for any n N, we also have

WΩ(ti) A + W n < 1. (86)

This ensures the stability of all fixed points and k-cycles of the system.

According to (85), we have

I WΩ(t 1) 1 =

n=0 W n Ω(t 1) = I + WΩ(t 1) + W 2 Ω(t 1) + . (87)

Hence, all the elements of I WΩ(t 1) 1 are positive, and so zm,t 1 > 0 for every t 1. This implies that all true and virtual fixed points exist within a singular sub-region. Additionally, only one fixed point is true, while all the other fixed points are virtual.

Case (II) : Let h = (h1, h2, , h M)T be a random vector with all hm uniformly distributed in (0, 1] and

βmin = min hm : hm h, 1 m M > 0, 0 < βmin 1,

βmax = max hm : hm h, 1 m M > 0, 0 < βmax 1. (88)

Assume further that R1 is a randomly generated matrix with uniformly distributed entries in the interval ( 1, 0], and for M 2

W = βmin M + R1 + ϵ

R1 diag(R1) . (89)

αmax = max |wij| : wij W , 0 αmax < βmin M + R + ϵ (90)

and S {1, 2, , M} = I such that K = 2M card(S) 2M. Suppose that R2 = diag(r1, , r M) is a randomly chosen diagonal matrix with rm uniformly distributed in ( 1, 1) for m I \ S, and the other elements (m S) uniformly distributed in (r 1, 0) where r = (M 1)αmax βmax

βmin . Since

0 (M 1)αmax βmax

βmin < (M 1) βmax

M + R + ϵ (M 1) M + R + ϵ < (M 1)

M < 1, (91)

so 1 r 1 < 0.

A = 1 2 + R1 + ϵ R2, (92)

A = R2 2 + R1 + ϵ < 1 2 + R1 + ϵ,

W = βmin R1 diag(R1)

M + R1 + ϵ R1 + diag(R1)

M + R1 + ϵ < 1 + R1 2 + R1 + ϵ, (93)

which implies A + W < 1. We set ϵ > 0 large enough to satisfy the condition

I WΩ(t 1) 1 =

n=0 W n Ω(t 1) I + WΩ(t 1) t 1. (94)

On the other hand, for any t we have

WΩ(t) = A + W DΩ(t) =

a11 w12d2(t) w13d3(t) w1Md M(t) w21d1(t) a22 w23d3(t) w2Md M(t) w31d1(t) w32d2(t) a33 w3Md M(t) ... ... ... ... ... w M1d1(t) w M2d2(t) w M3d3(t) a MM

zm,t 1 = (1 + amm) hm +

wmj dj(t 1)hj = (1 + amm) hm

|wmj| dj(t 1)hj. (96)

Since for every t 1

|wmj| dj(t 1)hj

|wmj| hj, (97)

zm,t 1 (1 + amm) hm

|wmj| hj m I. (98)

Moreover ass (r 1, 0), for every s S, and thus

ass + 1 > (M 1)αmaxβmax

PM j=1 j =s αmaxβmax

PM j=1 j =s |wsj| hj

Therefore, due to (98) and (99), zs,t 1 > 0 for every t 1 and s S. This means that all true and virtual fixed points only exist within a relatively small number of sub-regions, denoted as K = 2M card(S) 2M. Given our specific initalization of θ, in both cases (I) and (II) there is a set of K different sub-regions, each associated with a unique DΩ(t) matrix. We refer to the entire set of these matrices as

DK = {D1, ..., DK}. (100)

SCYFI, by its definition, only moves within the sub-regions that have virtual and true fixed points, continuing until it discovers a true fixed point (or gets stuck in a virtual cycle). Thus, it can iterate between J K sub-regions

DJ = {D1, ..., DJ} DK, (101)

or within the set of virtual fixed points

ZL = {z1, ..., z L}. (102)

In case (I), there is only one true fixed point. Since all virtual fixed points are located within the same single sub-region, SCYFI s initialization will naturally position it within the correct linear region, requiring no more than 1 iteration. Hence, it needs at most 2 iterations to find the true fixed point. Consequently, SCYFI s scaling is constant.

For case (II), if we suppose that SCYFI follows the virtual/true fixed point structure of the underlying system in these K sub-regions, the necessity for the probability of discovering the fixed point to be close to 1, specifically 1 ϵ, is to have

N ln(ϵ) ln(1 1 2M card(S) ) = ln(ϵ) ln(1 1

K ) , (103)

iterations. Since 1 card(S) M 1, so K 2 and ln(1 1 K ) 1

K . For ϵ ϵ, let N = ln(ϵ ) ln(1 1

K ) ln(ϵ) ln(1 1

N = ln(ϵ ) ln(1 1

K ) ln(ϵ ) ln(1 1

ϵ )K := c K, (104)

which implies the number of iterations is bounded from above. If, for every M, we choose K small enough, then the upper bound will stay within a linear growth.

A.2.6 Proof of theorem 4

In GTF [24], during training RNN latent states are replaced by a weighted sum of forward propagated states zt = Fθ(zt 1) and data-inferred states zt = G 1 ϕ (xt) (obtained by inversion of the decoder model Gϕ):

zt := (1 α)zt + α zt, (105)

where 0 α 1 is the GTF parameter (usually adaptively regulated in training, see [24]). This leads to the following factorization of Jacobians in PLRNN (2) training:

JGT F t = zt zt 1 = zt zt 1

zt 1 zt 1 = Fθ( zt 1)

zt 1 zt 1 = (1 α) Jt = (1 α)WΩ(t). (106)

Proof. (i) Since A + W 1, we have

t WΩ(t) = A + W DΩ(t) A + W DΩ(t) A + W 1, (107)

t ρ(WΩ(t)) WΩ(t) 1, (108)

where ρ denotes the spectral radius of a matrix. In this case, for any n N, we also have

i=1 WΩ(ti))

WΩ(ti) A + W n 1. (109)

Now, for any 0 < α < 1, the product of Jacobians under GTF is

i=1 JGT F ti = (1 α)n n Y

i=1 Jti = (1 α)n n Y

i=1 WΩ(ti), (110)

i=1 JGT F ti = ρ (1 α)n n Y

i=1 WΩ(ti) = (1 α)nρ n Y

(1 α)n < 1. (111)

Hence ρ Qn i=1 JGT F ti < 1 which implies for any n N and 0 < α < 1, the product Qn i=1 JGT F ti has no eigenvalue equal to 1 and so no DTB can occur (see definition of DTB in sect. 3).

(ii) Let A + W = r > 1, then for any n N we have

i=1 Jti rn, (112)

i=1 JGT F ti = (1 α)nρ n Y

i=1 WΩ(ti) [(1 α) r]n. (113)

Since 0 < 1 1

r < 1, inserting 1 1

r < α = α < 1 into the r.h.s. of (113) again gives

ρ Qn i=1 JGT F ti < 1 for any n N, implying that no DTB can occur.

A.3 Additional results

Figure S6: A) Analytically calculated stability regions for a 2-cycle SRL (red), and a fixed point SR (green) for the 1d skew tent map, as defined in the figure, in the parameter plane given by (ar, al). B) Bifurcation diagram along the cross section indicated by the gray line in A, showing a BCB and DFB occurring simultaneously at ar = 1 and a DTB occurring at ar = 1.

Figure S7: Results from SCYFI (blue) versus analytical results (orange) for the 2-cycle in Fig. 1 (a11 + w11 = 2). Eigenvalues (left) and location in state space (right) for one of the cyclic points. This confirms that fixed point locations and eigenvalues computed in closed-form and via SCYFI exactly agree, as they should.

A.3.1 Scaling analysis

Although the results presented in Fig. 2 suggest that SCYFI s scaling behavior is much better than theoretically expected, the fact that it is hard to obtain ground truth comparisons for high-dimensional systems (because of the combinatorial explosion) generally makes an extensive empirical analysis difficult. For Fig. 2 we therefore focused on scenarios for which we can also provide analytical curves for an exhaustive search strategy (eq. (5)) and where we then either examined scaling with cycle-order k for rather low-dimensional systems, or where we explicitly embedded fixed points to search for which allowed us to move to very high dimensionality M. In general we observed that the scaling behavior also depended on the PLRNN s matrix norms and the eigenspectrum of the embedded fixed points, so we constructed different scenarios where we varied these factors as well.

To construct a fairly well behaved case with low matrix norms, we randomly generated matrices R with uniformly distributed entries in the interval [ 1, 1] and then normalized by its maximal eigenvalue: We set PLRNN parameters A = 1 λmax diag(R) and W = 1 λmax (R diag(R)), and chose h uniformly in the interval [ 50, 50]. For each of 10 different such systems, we fixed the number of outer loops and inner loops (Nout, Nin in Algorithm 1) such that a fixed point would be detected in at least 50/75 independent runs of the algorithm, and then determined the total number n of linear regions (i.e., across all Nout different initializations) the algorithm needed to cycle through

to detect a stable fixed point. We also ensured that across all different runs this stable fixed point would be the same, in accordance with our assumptions. The resulting scaling behavior was well fitted by a doubly-logarithmic curve of the form c1 ln(ln(M)) + c2 (R2 0.913, p < 10 4). This low-matrix norm scenario with a stable fixed point may be seen as a kind of lower bound on the scaling.

To embed a specific fixed point z , we again start with a matrix R as described above and take A = diag(R) and W = (R diag(R)). We then minimize

min A,W ,h | z ((A + W DΩ(t )) z + h) |, (114)

subject to A staying diagonal and W off-diagonal (we observed that adding a small Gaussian noise term to the right appearance of z in eq. 114 which decayed proportionally to the learning rate improved numerical stability in the optimization process). The such constructed PLRNNs generally have several fixed points, but to compute n we only search for the inserted fixed point z (making eq. (5) directly applicable). This way we produced 5 10 systems, initializing R with values in [ 0.2, 0.2] (orange curve in Fig. 2B) or [ 1.0, 1.0] (blue curve in Fig. 2B), thus effectively restricting the eigenspectrum of the fixed point as well as the matrix norms of the PLRNN to a certain range. However, since matrix norms may change during optimization, eq. (114), our procedure is not strictly guaranteed to produce eigenspectra and matrix norms within a desired range, which is crucial especially for the first scenario where we wanted to keep norms within a typical range (see below). So here, to ensure consistency among drawn systems and with our assumptions, the mean absolute eigenvalue of the embedded fixed points was kept close to 0.31 0.05 and the mean maximum absolute eigenvalue close to 1.25 0.13. For > 75% of the resulting systems spectral matrix norms were within the range [1.0, 3.0]. While this produced matrix spectra typical for trained PLRNNs (> 95% out of 361 PLRNNs trained on various benchmarks and data had spectral matrix norms within [1.0, 3.0]), the second initialization range resulted in unnaturally large matrix norms and hence may be seen as providing a kind of upper bound on SCYFI s scaling behavior. Fig. S8 shows the best case (left; purple curve in Fig. 2B) and typical (right; orange curve in Fig. 2B) scaling scenarios on linear scale to better expose the scaling behavior and function fits.

Figure S8: Zoom-ins on linear axes of the scenarios with doubly-logarithmic (left; R2 0.913, p < 10 4) and quadratic (right; R2 0.925, p < 10 5) scaling behaviors from Fig. 2B.

Figure S9: A) Initializing SCYFI in a wide array of different subregions (different colors), it quickly converges within just a few iterations to the same set of linear subregions which contain the dynamical objects of interest (fixed points in this case). B) The number of different subregions explored by SCYFI when started from different initializations shrinks exponentially fast with the number of iterations. Shown are means ( stdv) from 10 different systems with M = 10.

A.3.2 Loss jumps & bifurcations in PLRNN training on biophysical model simulations

Here we provide an additional illustration of how SCYFI can be used to dissect bifurcations in model training. For this, we produced time series of membrane voltage and a gating variable from a biophysical neuron model [17], on which we trained a dend PLRNN [10] using BPTT [52] with sparse teacher forcing (STF) [37]. The dend PLRNN used (M = 9 latent states, B = 2 bases) has 218 different linear subregions and |θ| = 124 parameters. Fig. S10A gives the MSE loss as a function of training epoch (i.e., single SGD updates). The loss curve exhibits several steep jumps. Zooming into these points and examining the transitions in parameter space using SCYFI, we find they are indeed produced by bifurcations, with an example given in Fig. S10B. As we had done for Fig. 4 in the main text, since the state and parameter spaces are very high dimensional, for the bifurcation diagram in Fig. S10B all extracted k-cycles (k 1), including fixed points, were projected onto a line given by the PCA-derived maximum eigenvalue component, and plotted as a function of training epoch. For the example in Fig. S10B, we found that a BCB (Theorem 2) underlies the transition in the qualitative dynamics of the PLRNN as training progresses. Fig. S10C illustrates the dend PLRNN dynamics just before (left) and right after (right) the bifurcation point highlighted in Fig. S10B, together with time series from the true system.

More generally, whether a bifurcation associated with vanishing gradients produces a loss jump depends on the system s dynamics before and after the bifurcation point. In the case of BCBs, one possible scenario involves a change in stability, as illustrated in Fig . S10. During a BCB, a stable fixed point (or cycle) can loose stability as it passes through the bifurcation point. The maximum Lyapunov exponent of an unstable fixed point (or cycle) is positive, resulting in exploding gradients right after the bifurcation point [37], and consequently to a very steep slope in the loss function near the bifurcation point as in Fig . S10.

Figure S10: A) Loss across training epochs for a dend PLRNN (M = 9 states, B = 2 bases) trained on a biophysical neuron model in a limit cycle (spiking) regime. Red dots indicate training epochs just before and after a loss jump for which time graphs are given in C and D. B) Bifurcation diagram of the dend PLRNN as a function of training epoch, with all state space locations of stable (filled circles) and unstable (open circles) objects projected onto the first principle component. The loss jump in A is accompanied by a bifurcation from fixed point to cyclic behavior. C) Time series of the voltage variable (x1) of the biophysical model (gray) and that predicted by the dend PLRNN (black) before the bifurcation event indicated in B. D) Same directly after the bifurcation event.

Figure S11: Loss jump induced by a degenerate flip bifurcation (DFB). A) Loss during a training run of a PLRNN (M = 5) on a 2-cycle. The gray line indicates a loss jump corresponding to a DFB and a simultaneously occurring border collision bifurcation (BCB). B) Bifurcation diagram of the PLRNN, with the DFB and BCB leading to the destruction of the fixed point and the emergence of a 2-cycle as indicated by the gray line.

A.3.3 Dealing with bifurcations in RNN training

Here are some additional thoughts on how RNN training algorithms could possibly be modified to deal with bifurcations. If the algorithm finds itself during training in a parameter regime which does not exhibit the right topological structure, it does not make sense to further dwell within that regime, or possibly anywhere within the vicinity of the current parameter estimate. Unlike standard SGD, the algorithm should therefore perhaps take large leaps in parameter space as soon as it gets stuck in a non-suitable dynamical regime. One possibility to implement this is through a look-ahead mechanism that probes for topological properties of regions not visited so far. While fully fleshing out this idea is beyond this paper, a proof of concept that this may speed up convergence is provided in Fig. S12. Along similar lines, if we knew the model s full bifurcation structure in parameter space ahead of time, we could simply pick a parameter set which corresponds to the right dynamics

describing patterns in the data best. While of course it will in general not be feasible to chart the whole bifurcation structure before training (this is in a sense the whole point of a training algorithm), it may be possible to design smart initialization procedures based on this insight, e.g. probing topological regimes at randomly selected points in parameter space before starting training and initializing with parameters that produce a desired type of dynamics (e.g., cyclic behavior) to begin with.

Figure S12: A) Example loss curves for RNNs trained on electrophysiological recordings by BPTT without (blue) vs. with (black) look-ahead (the look-ahead function checks whether there would be a bifurcation away from a stable fixed point when taking 10 the current gradient step). Dashed yellow line indicates the epoch at which the look-ahead step was executed. B) Average across 6 loss curves of RNNs trained without (blue) vs. with (black) look-ahead. Error bands = SEM.