# outofdomain_generalization_in_dynamical_systems_reconstruction__620359a2.pdf Out-of-Domain Generalization in Dynamical Systems Reconstruction Niclas G oring 1 2 Florian Hess 1 2 Manuel Brenner 1 2 Zahra Monfared 1 Daniel Durstewitz 1 2 3 In science we are interested in finding the governing equations, the dynamical rules, underlying empirical phenomena. While traditionally scientific models are derived through cycles of human insight and experimentation, recently deep learning (DL) techniques have been advanced to reconstruct dynamical systems (DS) directly from time series data. State-of-the-art dynamical systems reconstruction (DSR) methods show promise in capturing invariant and long-term properties of observed DS, but their ability to generalize to unobserved domains remains an open challenge. Yet, this is a crucial property we would expect from any viable scientific theory. In this work, we provide a formal framework that addresses generalization in DSR. We explain why and how out-ofdomain (OOD) generalization (OODG) in DSR profoundly differs from OODG considered elsewhere in machine learning. We introduce mathematical notions based on topological concepts and ergodic theory to formalize the idea of learnability of a DSR model. We formally prove that black-box DL techniques, without adequate structural priors, generally will not be able to learn a generalizing DSR model. We also show this empirically, considering major classes of DSR algorithms proposed so far, and illustrate where and why they fail to generalize across the whole state space. Our study provides the first comprehensive mathematical treatment of OODG in DSR, and gives a deeper conceptual understanding of where the fundamental problems in OODG lie and how they could possibly be addressed in practice. 1Department of Theoretical Neuroscience, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany 2Faculty of Physics and Astronomy, Heidelberg University, Heidelberg, Germany 3Interdisciplinary Center for Scientific Computing, Heidelberg University, Heidelberg, Germany. Correspondence to: Niclas G oring , Daniel Durstewitz . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). 1. Introduction The majority of complex systems we encounter in physics, biology, the social sciences, and beyond, can mathematically be described as systems of differential equations, whose behavior is the subject of dynamical systems theory (DST). Deriving accurate mathematical models of natural (or engineered) systems from observations for mechanistic insight, scientific understanding, and prediction, is the core of any scientific discipline. Recent years have seen a plethora of advances in the field of DS reconstruction (DSR) mostly based on deep learning (DL) approaches for inferring DS models directly from time series data and thus partly automatizing the scientific model building process (Brunton et al., 2016; Raissi et al., 2018; Vlachas et al., 2018; Platt et al., 2021; Brenner et al., 2022; Vlachas et al., 2022; Hess et al., 2023). Like any good scientific theory, a proper DS model inferred from data should be able to generalize to novel domains (dynamical regimes) not observed during training. Here we develop a principled mathematical framework for out-of-domain (OOD) generalization (OODG) in DSR. We mathematically and numerically demonstrate that current data-driven SOTA methods for DSR hit fundamental limits regarding OODG, and provide some directions of how these could potentially be addressed. Current state of DSR Current DSR models attempt to either approximate the underlying system s vector field (Brunton et al., 2016), or try to directly learn the flow (solution) operator of the data-generating DS (Lu et al., 2019; Vlachas et al., 2020; Li et al., 2020; Brenner et al., 2022; Hess et al., 2023; Chen & Wu, 2023). More specifically, DSR methods have been developed based on symbolic regression (Brunton et al., 2016; d Ascoli et al., 2023), on various forms of recurrent neural networks (RNNs) equipped with special training algorithms (Vlachas et al., 2018; Brenner et al., 2022; Hess et al., 2023), on ordinary or partial differential equation (ODE/PDE)-based DL models such as Neural ODEs (NODE; Chen et al. (2018); Ko et al. (2023)), on operator theory (Lu et al., 2019; Li et al., 2020; Chen & Wu, 2023) or on reservoir computing (RC; Pathak et al. (2017); Verzelli et al. (2021); Platt et al. (2022; 2023)). State-of-the-art (SOTA) methods (Brenner et al., 2022; Platt et al., 2023; Hess et al., 2023; Jiang et al., 2023) can generalize beyond the observed time horizon and capture an underlying system s dynamical Out-of-Domain Generalization in Dynamical Systems Reconstruction Ground-truth system xit> x = f(x) Multistable system with two attractors (equilibrium point and limit cycle) Monostable system with one attractor (limit cycle) Data & Training Training data consists of trajectories observed from only one basin DSR-algorithm What are the conditions for a successful reconstruction on basin 2? Out-of-domain generalization In-distribution generalization Ek Wfy Ql7J72A9i IJO8PNd Gt Sq N7tkbo Lzv6b Wy Rc= y = g(y) attractor not reconstructed attractor reconstructed attractor reconstructed initial condition initial condition Figure 1. In-distribution generalization within one basin (right; van-der-Pol oscillator) vs. OODG across basins (left; neuron model with a limit cycle corresponding to spiking activity and an equilibrium point corresponding to the resting potential). invariants and long-term properties, like the geometry of an attractor trajectories are converging to (Fig. A1), while providing accurate short-term forecasts on in-distribution test data. However, the current field of DSR has overwhelmingly focused on synthetic benchmark systems, e.g. given by loworder polynomial ODE and PDE systems, mostly in regimes where either only one (globally) attracting object exists in state space, as in the chaotic Lorenz system for common parameter settings (Lorenz, 1963), or at least reconstruction within just one dynamical regime was sought. Only a small number of studies considered experimental data, and even less consider systems which may harbor multiple attractor objects simultaneously, so-called multistability (Fig. 1, left; for a detailed overview of current benchmarks in use, see Appx. A). An unresolved challenge in DSR: generalization to unobserved dynamical regimes While current SOTA methods for DSR may generalize to nearby initial conditions close to the domain covered by the training data, which ultimately converge into the same limit set, at current their ability to generalize to unobserved regions of state space is either not given or remains unexplored (Fig. 1). Generalization across the whole state space of the DS, or scientifically relevant portions of it, is, however, a feature any sound scientific theory should possess. The OODG problem is exacerbated in the presence of multistability, i.e., if multiple dynamical objects coexist in the same DS. In fact, even the simplest examples of low-dimensional nonlinear DS, like a damped-driven pendulum, often have multistable regimes. The problem is also of high practical relevance, as most complex DS encountered in nature and society are likely extensively multistable, with examples ranging from neuroscience (Schiff et al., 1994; Durstewitz et al., 2000; Izhikevich, 2007; Khona & Fiete, 2022), optics (Brambilla et al., 1991), chemistry (Ngonghala et al., 2011), biology (Dubinkina et al., 2019), ecology (Mumby et al., 2007), to financial markets (Cavalli & Naimzada, 2016) and climate science (Yoden, 1997). For such DS, crossing the boundaries between different dynamical regimes, as induced by noise or external inputs, may lead to qualitatively completely different behavior (Fig. 1, left; Fig. A2). Current approaches toward OODG in DSR Several recent studies at least partially or implicitly address the question of OODG and multistability in DSR. For instance, in Ghadami & Epureanu (2018); Patel & Ott (2022); Bury et al. (2023), the authors attempt to anticipate tipping points in non-autonomous DS and predict the post-tipping point dynamics. A related topic is the forecasting of extreme events (Farazmand & Sapsis, 2018; Guth & Sapsis, 2019; Qi & Majda, 2020), which are events that are not, or only very sparsely, represented in the training data, thus constituting a form of generalization. Others consider learning DS across multiple environments defined by different parameter settings (Yin et al., 2022a; Bereska & Gavves, 2022). However, essentially all this work implicitly or explicitly assumed some observations from the domain on which generalization is sought to be available (i.e., reflecting more a form of transfer learning rather than true OODG; Kirch- Out-of-Domain Generalization in Dynamical Systems Reconstruction meyer et al. (2022); Yin et al. (2022a)). Another strategy to enable better generalization is to include physical domain knowledge into the model formulation or loss function, as in physics-informed neural networks (PINNs) (Raissi et al., 2019; Mehta et al., 2021; Subramanian & Mahadevan, 2023; Mouli et al., 2023), or directly ground-truth parameters of the DS studied (Fotiadis et al., 2023). This, again, assumes we already have prior knowledge about the domain which we would like to generalize to. Promoting data-driven DSR models to viable scientific theories of complex systems requires a thorough understanding of whether, how, and when reconstructions generalize to the entire state space in the common empirical scenario where the measurements sample only a limited portion of that space (Fig. 1, left). This leads us to our central research question: what are the precise mathematical conditions that allow for successful reconstruction of a DS on the whole state space? 2. Dynamical Systems Background 2.1. Dynamical Systems A DS is generally comprised of a state space M Rn, a set of times T R, and an evolution law. Here we focus on continuous-time systems described by ODEs x = f(x), x M Rn, (1) where f X 1(M) is a vector field (VF) from the set of functions with continuous first derivative on the (compact, metric, measurable) state space M. The VF gives rise to the evolution operator Φ : T M M that maps some initial condition x0 to the state xt at time t (Kuznetsov, 1998): xt = Φ(t, x0). (2) 2.2. Dynamical Systems Reconstruction (DSR) In data-driven DSR, the aim is to infer from time series observations a generative model of the true underlying system, approximating either its vector field f X 1(M) or evolution operator Φ(t, x) (hence its governing equations) given an inference algorithm from a hypothesis class H. This goes beyond mere time series forecasting, in that we require the model to also capture dynamical invariants, that is long-term statistics and topological properties, of the underlying system. Thus, after training a DSR model should ideally be topologically conjugate to the true system and capable of producing trajectories with the same temporal and geometrical structure as those of the true system (Fig. A1; Platt et al. (2022; 2023); Hess et al. (2023)). Note that this subsumes various more specific goals one may have in time series modeling and DS analysis. For instance, a proper DSR model should also provide excellent time series predictions, while, vice versa, a model optimized for time series prediction would not necessarily reproduce invariant statistics and the geometry and topology of a system s attractors. 2.3. Measure Theoretic Aspects Measure theoretical approaches investigate the long-term statistical properties of DS, the subject of ergodic theory (Eckmann & Ruelle, 1985). Specifically, there is a stable statistical property called the (average) occupation measure, defined as µx0,T (B) = 1 0 1B(x(s)) ds (3) where x(t) is a trajectory with t [0, T], starting from x0, B Rn is some Borel measurable set, and 1B denotes an indicator function that maps elements of the set B to one, and all other elements to zero. Intuitively, µx0,T (B) measures the amount of time the trajectory x(t) spends in the set B. 2.4. Topological Aspects and Attractors Another way to capture the long-term behavior of a DS is by studying special sets, so called invariant sets, which describe the anatomy of a DS. A set U M is called invariant under the flow, if Φ(t, U) U t. We can associate any point in state space with a special invariant set capturing its long-term behavior, the so-called ω-limit set ω(x, Φ) = \ {Φ(t, x)|t > s}, (4) where the overline denotes closure. If there is a set of points in state space with non-zero measure that have the same ω-limit set, it is called an attractor. More formally: Definition 2.1. An attractor (Milnor, 1985; Perko, 1991) is a closed invariant set A M such that there exists an open and forward-invariant set B(A) = {x M|ω(x, Φ) A}, called the basin of attraction, with ω(B(A)) = A. We further require that A is minimal (i.e., there is no proper subset with that same property). Stable equilibrium points, limit cycles and chaotic ( strange ) sets are examples of attractors with increasingly complicated topology (see Fig. A3). 3. A General Framework for OODG in DSR 3.1. Multistability Induces Distribution Shifts In statistical learning theory, out-of-sample generalization, and more importantly here OODG, is already quite well-studied (for a detailed treatment, see Appx. C.1, Out-of-Domain Generalization in Dynamical Systems Reconstruction Ben-David et al. (2010)). Generally, one assumes to have observed a set of training domains E on which the data is distributed according to some domain specific distribution pe, e E. The goal is to learn a function ˆf F in hypothesis class F that yields minimum risk Retest( ˆf) := E(x,y) petest[ℓ( ˆf(x), y)] on an unseen test domain etest / E, where the data is distributed according to petest. Since petest is unknown, one estimates the respective prediction error by taking the average loss across all empirically accessible domains as empirical risk: RE emp( ˆf) = 1 |E| i=1 ℓ( ˆf(xe i), ye i ). (5) Yin et al. (2022b) extend this definition to time-series data: RE emp(ΦR) = 1 |E| t=1 ℓ(ΦR(t, xe 0), xe t), (6) where |E| is the number of domains and Te time series length.1 Yin et al. (2022b) mainly associate these domains with different parameter settings of the ground-truth system. For multistable DS, however, the different domains obtain a very natural interpretation. A DS is called multistable if it has at least two attractors coexisting in its state space (Fig. 1, left). In this study, we generally assume that multistable systems allow for a decomposition of their state space into n disjoint basins of attraction (Milnor, 1985): M = n e=1B(Ae) M such that µ M = 0, (7) where µ is the Lebesgue measure. It is thus natural to define the domains in Eq. (6) as the different basins of attraction, each of which belongs to a different attractor. Different attractors generally give rise to different long-term dynamics with different topology (Fig. 1, left; Fig. A2), governed by different physical measures (see Appx B.1). Hence, in each basin the trajectories follow a different statistical law (regarding their long-term evolution), implying that the challenge of reconstructing multistable DS is ultimately the same as that of understanding OODG in DSR. In monostable DS, in contrast, each trajectory in the state space is governed by the same long-term statistics (Fig. 1, right). Hence, one (sufficiently long) trajectory is already enough to specify the dynamics on the attractor. Accordingly, generalization for monostable systems essentially comes down to classical in-distribution (out-of-sample) generalization. Note that this is also true for chaotic attractors. Indeed, as illustrated in Fig. A1 (and amply demonstrated in the literature, e.g. Mikhaeil et al. (2022); Brenner et al. (2022); Hess et al. (2023)), current SOTA methods fare very well 1For chaotic systems, both stationary and non-stationary probability distributions are possible (Parthasarathy & Rajasekar, 1998). on even short trajectories from complex chaotic systems, as long as these are monostable (or sufficient information from all basins of attraction is available, see Fig. A7). Yet, the DSR field overwhelmingly so far focused on just monostable hyperbolic attractor systems as benchmarks (Appx. A), making OODG for DSR an essentially unstudied problem. 3.2. OODG Error The most commonly employed loss function ℓis the meansquared-error (MSE), derived from the maximum likelihood principle assuming i.i.d. Gaussian model residuals. While this is still the default when it comes to training RNNs, NODEs or RCs, it cannot be used to assess the reconstruction for three reasons: 1) The MSE breaks down as a suitable test loss in Eq. (6) because of exponential trajectory divergence in chaotic DS (Wood, 2010; Koppe et al., 2019). Even for a perfectly reconstructed DS, numerical uncertainties in the initial conditions or small amounts of noise quickly lead to large prediction errors. This is accounted for in training methods like generalized teacher forcing (Hess et al., 2023); 2) The MSE does not capture any long-term, invariant or topological properties of the DS and its reconstruction. As discussed in Sect. 2.3 & 2.4, these are the central mathematical tools to study DS; 3) The MSE is not guaranteed to be sensitive to multistability, yet this property of a measure is much needed in light of Sect. 3.1. We thus propose a novel way of assessing generalization across state space by defining a statistical and a topological error that are provably sensitive to multistability (Theorem 3.3). Statistical error As the MSE is not a useful quantity for comparing (chaotic) trajectories, we define the statistical error through the sliced Wasserstein-1 distance (SW1; Bonneel et al. (2015)) between the occupation measures µΦ x,T of the ground-truth DS Φ and µΦR x,T of the DSR model ΦR: SW1(µΦ x,T , µΦR x,T ) = Eξ U(Sn 1) h W1(gξ µΦ x,T , gξ µΦR x,T ) i , where Sn 1 := {ξ Rn | ξ 2 2 = 1} is the unit hyper-sphere, gξ µ denotes the pushforward of µ, gξ(x) = ξT x is the one-dimensional slice projection, and W1 the Wasserstein-1 distance (Villani et al., 2009). Since the expectation in Eq. (8) is intractable, it is commonly approximated by Monte Carlo sampling (see Appx. C.2 for computational details). Definition 3.1. The statistical error Estat is defined as EU stat ΦR := Z U M SW1(µΦ x,T , µΦR x,T ) dx, (9) which integrates across initial conditions from a subset U of state space. Out-of-Domain Generalization in Dynamical Systems Reconstruction Topological error An important concept to assess if two DS agree in their topology is topological equivalence. Two DS are called topologically equivalent if there exists a homeomorphism between the two system s orbits preserving the direction of flow (see Appx. B.3 for more details). However, this homeomorphism is usually not known a priori and hard to access numerically. Hence, we will replace the condition of topological equivalence with three weaker conditions based on the Lyapunov spectrum, which contains topological and stability information about limit sets of a DS. Let us denote the ordered, largest n Lyapunov exponents of limit sets ω(x, Φ) and ω(x, ΦR) by λ1 λ2 λn and λR 1 λR 2 λR n , respectively. First, we require that all Lyapunov exponents agree in their signs, sgn(λi) = sgn(λR i ) i. Second, we demand that the maximum Lyapunov exponent is close in relative error, |λn λR n | / |λn| < ελn, where ελn is a tolerance. Lastly, the limit sets need to be close in state space, d H(ω(x, ΦR), ω(x, Φ)) < εd H, assessed through the Hausdorff distance (see Appx. C.3 for more details). We then define an indicator function on M, 1ΦR(x), which is equal to 1 for a given point x U M iff the associated limit sets fulfill all of the three conditions above. Definition 3.2. We define the topological generalization error on U M as EU top(ΦR) = 1 1 vol(U) U M 1ΦR(x) dx. (10) In the following, we will use Egen as a placeholder for both Etop and Estat, and statements involving Egen must hold for both errors.2 These errors are highly sensitive to a failure to reconstruct multistable systems: Theorem 3.3. Assume Φ is multistable with decomposition as in Eq. (7) and connected basins, and there exists one attractor Ak, k n, not reconstructed by ΦR. Then, the generalization error of ΦR is proportional to the volume of the basin of this non-reconstructed attractor: EMtest gen (ΦR) vol(B(Ak)). (11) This statement naturally generalizes to the case of multiple non-reconstructed attractors (with different proportionality constants). Proof. See Appx. E.1. 3.3. OOD Learnability in DSR Learnability is a fundamental concept in statistical learning theory (Vapnik, 2000; Shalev-Shwartz et al., 2010), with 2Note that Etop and Estat are solely theoretical constructs we introduce to formalize the OODG in DSR problem, not loss functions to be used in training. many different definitions advanced (Valle-P erez & Louis, 2020). In its simplest form, a hypothesis class is called learnable if, for any distribution of training data, the error between the learned and ground-truth function decreases monotonically with sample size and converges to zero in the limit of infinitely many data points. This concept has been extended to OODG settings in Fang et al. (2023). To apply these definitions to DSR, assume the state space segregates into n basins (domains), Eq. (7), |E| < n of which form the training domains Mtrain = e EB(Ae) and all others the test domains Mtest. For simplicity, we assume we have access to the data generating process Φ on Mtrain, such that the training data can be expressed as D x0 MtrainΦ(T, x0) where [0, T] is the time interval in which the trajectories are observed. In line with statistical DL theory, we further assume that H includes hypotheses consistent with both the training and test data (Valle-P erez et al., 2019; Belkin, 2021). In other words, there exist models within H that, in theory, achieve zero reconstruction error on both the training and test domains (but in practice will depend on uncertainties introduced by the DSR algorithm). Denote by Θ0 = {θ Θ|EMtrain gen (Φθ) = 0} the set of parameters associated with models having (near) zero reconstruction error on the training domain, and by H0 the corresponding set of DS. Then, learnability in DSR boils down to: Definition 3.4. The OODG problem (H, D) defined by hypothesis class H and dataset D is strictly learnable if ΦR H0 : EMtest gen (ΦR) = 0. (12) Hence, the OODG-problem is strictly learnable, if zero reconstruction error on the training domain leads to zero reconstruction error on the test domain. For highly expressive hypothesis classes there can be multiple, if not infinitely many, models in H0 with different generalization errors on Mtest. If we assume we are dealing with a parameterized function class Hθ = {Φθ|θ Θ RP }, as practically the case in all DL & DSR settings, the quantity of interest becomes the distribution of generalization errors of models in H0: Definition 3.5. We define the learnability-distribution of the OODG problem (Hθ, D) as p(εgen|D) = 1 vol(Θ0) Θ0 1[EMtest gen (Φθ) = εgen]dθ, the probability of a model with zero reconstruction error on the training domain having a generalization error of εgen on Mtest, where 1[ ] returns 1 if the condition in square brackets holds and 0 otherwise. The more mass the distribution has at zero, the better the problem is learnable. In Out-of-Domain Generalization in Dynamical Systems Reconstruction the limit of p(εgen|D) being fully concentrated at εgen = 0, the OODG problem becomes strictly learnable. Table A17 summarizes the most important differences between OODG from the perspectives of standard statistical learning theory vs. of DST as advanced here. The learnability of an OODG problem depends both on the hypothesis class H as well as the chosen prior in H through the training algorithm. Therefore, we will examine the following two scenarios: Strong prior (Sect. 4.1): Library-based algorithms such as SINDy (Brunton et al., 2016) introduce a strong inductive bias by explicitly providing a function class for the underlying VF. No prior (Sect. 4.2): Approaches based on universal approximators of DS, like RNNs (Funahashi & Nakamura, 1993; Kimura & Nakano, 1998; Hanson & Raginsky, 2020), do not incorporate any explicit prior (but may still introduce implicit priors through the choice of training algorithm and parameter initialization). 4.1. Strong Prior: Methods Based on Predefined Function Libraries Following the classical statistical approach of basis expansions (Hastie et al., 2009; Durstewitz, 2017), some popular DSR methods rest on a predefined library of basis functions in the observables (Brunton et al., 2016; Reinbold et al., 2020), most prominently SINDy and its further developments (Brunton et al., 2016; Loiseau & Brunton, 2018; Kaiser et al., 2018; Cortiella et al., 2021; Messenger & Bortz, 2021; Kaheman et al., 2022). These models usually are linear in the parameters, thus easing statistical inference. Since the library of functions needs to be specified a priori, these methods induce a strong inductive bias. A strong sparsity prior on the parameters, and correspondingly sparse regression methods like LASSO or sequential thresholding (Brunton et al., 2016), ensure that only a small subset of functions from the library is selected for modeling the vector field (for details on SINDy, see Appx. D.2). More formally, this defines the class of finite-dimensional linearly parameterized functions with m differentiable basis functions ψi : Rn R, i = 1, . . . , m, BL = fj(x; θ) = i=1 θi,jψi(x) j, θ Rm n , (14) where these basis functions may be arbitrarily chosen. In this hypothesis class, one trajectory is sufficient to fully specify the DS, unless a certain uniqueness condition is violated: Theorem 4.1. Let f BL be a multistable VF, and assume SINDy (or related) is used to learn ΦR, including the right terms from BL in its library. If there exists a trajectory Γx0 D not solving an algebraic equation in the parameters of Eq. (14), then the OODG problem given by (BL, D) is strictly learnable. Proof. See Appx. E.2. This implies that a single trajectory from one basin is enough for the DSR model to capture the dynamics on all other basins, as long as D contains a trajectory not solving an algebraic equation in the parameters of Eq. (14)3 and the correct function library is provided4. These observations are illustrated in Fig. 2a. Appx. D.4 provides an efficient formal procedure for checking whether the conditions on a given trajectory are met, and hence a generalizing solution could be found. In situations where the trajectory solves an algebraic equation, we can further restrict the library to find a unique solution (Corollary E.7). We remark that these conditions are usually established by LASSO. Reconstructed VF with library: Ground Truth VF Ground Truth VF Reconstructed VF from Algebraic Trajectory Reconstructed VF from Non-Algebraic Trajectory Reconstructed VF with library: [1, x, y, x2, y2, xy, x3, x2y, xy2, y3] [1, x, y, x2, y2, xy] Figure 2. a) Example reconstructions using SINDy (details in Appx. D.2). The underlying VF has two cycle solutions. One solves an algebraic equation (red), while the other does not (black). The VF is only correctly identified from a trajectory containing the inner cycle (center), but not for the outer cycle (right). b) SINDy needs the proper function library to correctly infer a system across the whole state space (center). If the 3rd order term present in the Duffing equations is lacking (right), the inferred VF may only be locally correct (or not at all for more complex systems). 3There are in fact particular systems where each trajectory may solve an algebraic equation, e.g. vector fields with a rational first integral like algebraic Hamiltonians (see Appx. D.3 for more details). 4If this is not the case, sometimes SINDy may still be able to find a good approximation, depending on the degree of mismatch and the expressiveness of the library Out-of-Domain Generalization in Dynamical Systems Reconstruction As laid out in Appx. A, this has important implications, since many benchmark systems and scientific models in physics (Ramberg & Osgood, 1943), chemistry (Fern andez Ramos et al., 2006), ecology (Goel et al., 1971), or epidemiology (Kermack & Mc Kendrick, 1927), are expressed in terms of polynomials. However, for many complex realworld systems, like climate or the brain, which we observe through time series measurements, this assumption is likely to be violated, with polynomials at best a convenient simplification. As Figs. 2b and A10 make clear, library approaches like SINDy will generally fail if the library does not contain the right terms describing the GT model.5 Hence, in scientific ML we often turn to more flexible and expressive models. 4.2. No Priors: Universal Approximators Next, we examine the most common data-driven approach of choosing a black-box model, like an RNN, to approximate the flow of the underlying system, i.e. without assuming any prior knowledge about the to-be-modeled system. If we assume that these models operate in the universal approximation limit, we can show that there is an infinity of models in the hypothesis class having zero reconstruction error on the training domain but a very high error on the whole state space or for OOD data from Mtest. This is in stark contrast to SINDy, where given the assumptions of Theorem 4.1 are met every model with zero reconstruction error on the training domain also has zero generalization error. Theorem 4.2. Let Φ be a multitstable flow that is not topologically transitive (cf. Appx. B.4) on Mtest, generated by a VF f X 1. Then, the OODG problem (X 1, D) is not strictly learnable. In fact, there exists an infinite family of f X 1 and an ε > 0 such that the corresponding flows fulfill EMtrain gen (Φ) = 0 and EMtest gen (Φ) ε. (15) Proof. See Appx. E.3. Note that this result is independent from the loss function used in training. Fig. 3, where data were just sampled from one basin of attraction of the multistable Duffing system (Eq. (29)), illustrates this idea for three of the most commonly used DSR models (in stark contrast to DSR performance on monostable systems, cf. Fig. A1). We emphasize that this is not a sampling issue: Regardless of how much data are drawn from one basin, generalization fails, while increasing sample size quickly helps to identify the whole state space if data from both basins are available (Fig. A9). SINDy on the other hand, provided the correct function library, generalizes (Fig. 2b). 5In fact, SINDy fails on many empirical datasets from complex systems (Brenner et al., 2022; Hess et al., 2023). It is important to note that while on multistable settings like the one above, if trajectories are drawn from just one basin OODG will generally fail, the very same architectures can be trained to approximately zero training error on the full state space M (see Fig. A5). This implies there are indeed regions in the loss landscape that would generalize, raising the question of why these are hardly ever discovered by the optimization algorithm. 4.3. Why OODG Fails We will shed light on this failure, focusing on RNNs trained with SGD. Given data D, the probability that an RNN after training has a generalization error εgen is formally given by p SGD(εstat | D) = Z Θ 1[EM gen(Φθf ) = εgen] popt (θf | θi, D) pini (θi) dθidθf, (16) where pini characterizes the initialization scheme and popt formalizes the training process, quantifying the probability of obtaining a final set of parameters θf given an initial set θi. Under certain assumptions (cf. Appx. D.5 for details), Eq. (16) exactly aligns with the learnability distribution (Eq. (13)). We now illustrate how the implicit biases in pini or popt will impede OODG. Simplicity bias in pini In recent studies of standard NNs (Valle-P erez et al., 2019; Mingard et al., 2023) and transformers (Bhattamishra et al., 2023) it was shown that the parameter-function map M (Appx. D.6) is biased towards simple functions, which in turn may explain the good generalization capability of these models on i.i.d. data (of course, time series data are not i.i.d. to begin with). Here we show that RNNs also exhibit a bias towards simplicity, which, in this case, unfortunately, manifests as a bias towards monostable DS. To this end we initialized a sh PLRNN Φθ (Hess et al. (2023), Appx. D.2) using the Glorot uniform and Glorot normal (Glorot & Bengio, 2010) scheme where we systematically varied the gain scaling the variance. We then uniformly drew NI initial conditions and evolved them with this randomly initialized DSR model Φθ until the resulting trajectories had converged to a limit set. The distribution of these limit set points across state space was then quantified through the Shannon entropy, which gives a measure for the complexity of the attractor structure at initialization (Fig. 4a). In Fig. 4b, the mean entropy is plotted as a function of gain (variance), revealing a clear trend (see also Fig. A12 for a higher-dimensional RNN example). Increasing the parameter variance hence leads to more complex dynamics at initialization. However, we empirically observe that models initialized with high gains become almost impossible to train by SGD, as commonly observed for vanilla Out-of-Domain Generalization in Dynamical Systems Reconstruction ground-truth Figure 3. Learnability of three SOTA DSR algorithms evaluated on the Duffing system in a multistable regime. a) Reconstructions of DSR models trained on four ground-truth trajectories (blue) from one basin. Red trajectories are freely generated using initial conditions of the training data and the respective DSR model. Grey trajectories comprise example ground-truth test trajectories and generated ones from both the training basin and OOD basin. While training data trajectories align with the ground-truth, all models fail to properly generalize to the unobserved attractor/basin. b) Empirical cumulative distribution function (e CDF) of both Estat and Etop based on N = 50 independent trainings of each DSR model evaluated over a grid of initial conditions covering both basins (see Fig. A4). Figure 4. a) Distribution of Shannon entropies (in Nat) for the limit sets of sh PLRNNs (M = 2, H = 100) initialized with different gains (parameter variances) using the Glorot uniform scheme. For a low gain (σ = 0.3), as predominantly used in DSR, the attractors of all models at initialization had H = 0, which means that these consisted only of a single equilibrium point. For higher gains, further peaks at H > 0 started to appear, implying that either more and/or higher-order objects (like cycles) exist upon initialization. b) Mean Shannon entropy for the same data plotted against gain, using the Glorot uniform and Glorot normal initialization scheme. feed-forward NNs (Glorot & Bengio, 2010). This conflict effectively biases all trainable (sh PL)RNNs toward monostability, and merely increasing the gain by itself is therefore not a viable option for enhancing OODG. Generalizing solutions are saddle points While the implicit bias introduced by pini plays a role in OODG failure, uncertainties in the optimization, as quantified through popt, turned out to be even more crucial. To illustrate this, we consider the bistable Duffing oscillator (see Appx. A13 for a chaotic multistable example) and denote by θgen the parameters of a model generalizing across M, i.e., with close to zero training loss ℓM = ℓB(A1) + ℓB(A2) and reconstruc- tion error Egen on both basins. We then retrain a model initialized with θgen on trajectories from just one of the two basins, i.e. employing ℓB(A1) as a loss function. In Fig. 5a we present the distribution of statistical errors for various generalizing models Φθgen and retrained models Φθre across the two basins, B(A1) and B(A2). We observe an about 20-fold increase in the reconstruction error of retrained compared to initialized models on B(A2), even though the error on B(A1) remained largely the same. Hence, the process of retraining effectively leads the models to unlearn the dynamics on the second basin. While here we illustrated Figure 5. a) Statistical error distribution on basins B(A1) and B(A2) for 20 generalizing models (green) and 20 20 models retrained (purple) using only B(A1) data. b) Illustration of loss landscapes using data from just one (left) or both (right) basin(s) of attraction, with parameters corresponding to generalizing solution (θgen), and to models retrained for 125k (θ1 re) and 250k (θ2 re) parameter updates, respectively. Note that ℓM does not exhibit the spurious loss valley present in ℓB(A1). Out-of-Domain Generalization in Dynamical Systems Reconstruction that this issue arises even in fairly simple systems like the bistable Duffing oscillator, Fig. A14 shows it is equally present in higher-dimensional, more complex systems like the generalized spatially extended chaotic Lorenz-96 (Pelzer & Sterk, 2020) model of atmospheric convection. Since θgen corresponds to a model that already agrees well with trajectories from both basins (low ℓM & EM stat), this raises the question of why the optimizer leaves this regime during the retraining phase in the first place. To further understand this, we studied the Hessian of the loss functions evaluated on trajectories from just one (ℓB(A1)) or both (ℓM) basins w.r.t. θgen (see Tab. A5). First, we noticed that θgen is not a minimum but a saddle in both loss landscapes. Further, the Hessian of ℓB(A1) has much fewer positive eigenvalues than that of ℓM, implying that the saddle is more stable (with less directions to escape) when trajectories from both basins are provided. Hence, as soon as data from one basin are removed from the training set, the optimizer will run into new directions with zero or small negative eigenvalue, thus forgetting the second equilibrium point. Fig. 5b further shows that the removal of data from the second basin leads to the emergence of spurious extrema. Current training routines may thus not be built to learn multistable systems, as they unlearn the multistable property even upon perfect initialization. Generalizing minima are sharp In standard DL, the width of minima correlates with generalization, with wider minima generalizing better than narrow ones (Hochreiter & Schmidhuber, 1997). While certain studies have contested this correlation (Dinh et al., 2017), large-scale studies, such as (Jiang et al., 2019), validate this association. Here, we adopt a specific notion of width based on the minima volumes or radii as outlined in Huang et al. (2020), where Appx. D.7 explains how this concept also applies to saddle regions. To further examine this idea, we trained sh PLRNNs as above with identical architecture and hyperparameters once on a trajectory from just one basin and once from both basins of attraction of the Duffing system (see Fig. A15 for the same analysis for a chaotic multistable system). We made sure that both models have approximately the same training error when evaluated only on a single trajectory from the first basin. We then examined the width (radius) r(θ) = θ θmin 2 of the minimum θmin corresponding to the loss evaluated only on the one trajectory common to both (the monoand the multistable) training setups, at a height 5% above the minimum value (other heights gave similar results, see Appx. D.7). On average, the minimum valleys corresponding to generalizing models, i.e. those trained on the whole state space, have a smaller radius (Fig. 6), in contrast to the more common observation in DL that generalizing minima are usually wider. This, in addition to the fact that SGD is more likely to converge to wider minima (Chaudhari et al., 2019; Foret et al., 2020; Xie et al., 2021), may further explain why generalizing minima are avoided in DSR. Figure 6. e CDF of minima radii for generalizing and nongeneralizing models. 5. Discussion Here we provide the first systematic mathematical treatment of OODG in DSR. We aimed to lay a theoretical foundation which could serve to guide the field toward future solutions of the OODG problem in DSR, by providing a new set of theoretically guided measures, providing theorems which clearly state what is, and what is not, possible, and by delineating where the hard problems lie and exactly why current SOTA algorithms struggle with them. The core problem is that most naturally observed DS will harbor many co-existing dynamical regimes, characterized by different VF topologies and long-term statistics, but usually we have observed data only from one or few of them. If we already know the correct function class, we can infer models (like SINDy) which generalize across the whole state space. But for the likely much more common empirical scenario where this is not the case, unique identification of a generalizing solution is no longer possible. In fact, if a chosen library does not even work on the training domain, this is already a strong hint that crucial terms are missing. Unfortunately, intentionally choosing a very expressive, too-large function library is not a remedy either (let alone for computational reasons), as it makes the problem underspecified. Practically, one DS-agnostic way to potentially address OODG may be by targeting implicit biases in the initialization and, more importantly, the training processes (cf. Sect. 4.3), for instance by promoting solutions that explicitly encourage multistability. Often, however, we may still need to guide the training process by a more profound physical or biological understanding of the DS in question, and evaluate trained models by explicitly (experimentally) testing novel predictions. More generally, future work may want to put the focus on training algorithms that encourage and preserve multistability and avoid overfitting the training basin. All code used here is available at https://github. com/Durstewitz Lab/OODG-in-DSR. Out-of-Domain Generalization in Dynamical Systems Reconstruction Acknowledgements This work was funded by the German Research Foundation (DFG) within Germany s Excellence Strategy EXC 2181/1 390900948 (STRUCTURES) and by DFG grant Du354/15-1 to DD. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. References Belkin, M. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 30:203 248, May 2021. ISSN 09624929, 1474-0508. doi: 10.1017/S0962492921000039. Publisher: Cambridge University Press. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine Learning, 79(1): 151 175, May 2010. ISSN 1573-0565. doi: 10.1007/ s10994-009-5152-4. URL https://doi.org/10. 1007/s10994-009-5152-4. Bereska, L. and Gavves, E. Continual Learning of Dynamical Systems With Competitive Federated Reservoir Computing. In Proceedings of The 1st Conference on Lifelong Learning Agents, pp. 335 350. PMLR, November 2022. URL https://proceedings.mlr.press/ v199/bereska22a.html. ISSN: 2640-3498. Bhattamishra, S., Patel, A., Kanade, V., and Blunsom, P. Simplicity bias in transformers and their ability to learn sparse boolean functions, 2023. Bonneel, N., Rabin, J., Peyr e, G., and Pfister, H. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51:22 45, 2015. Brambilla, M., Lugiato, L. A., Penna, V., Prati, F., Tamm, C., and Weiss, C. O. Transverse laser patterns. ii. variational principle for pattern selection, spatial multistability, and laser hydrodynamics. Phys. Rev. A, 43:5114 5120, May 1991. doi: 10.1103/Phys Rev A. 43.5114. URL https://link.aps.org/doi/10. 1103/Phys Rev A.43.5114. Brenner, M., Hess, F., Mikhaeil, J. M., Bereska, L. F., Monfared, Z., Kuo, P.-C., and Durstewitz, D. Tractable Dendritic RNNs for Reconstructing Nonlinear Dynamical Systems. In Proceedings of the 39th International Conference on Machine Learning, pp. 2292 2320. PMLR, June 2022. URL https://proceedings.mlr.press/ v162/brenner22a.html. ISSN: 2640-3498. Brunton, S. L., Proctor, J. L., and Kutz, J. N. Discovering governing equations from data: Sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932 3937, April 2016. ISSN 0027-8424, 1091-6490. doi: 10.1073/ pnas.1517384113. URL http://arxiv.org/abs/ 1509.03580. ar Xiv: 1509.03580. Bury, T. M., Dylewsky, D., Bauch, C. T., Anand, M., Glass, L., Shrier, A., and Bub, G. Predicting discrete-time bifurcations with deep learning. Nature Communications, 14 (1):6331, October 2023. ISSN 2041-1723. doi: 10.1038/ s41467-023-42020-z. URL https://www.nature. com/articles/s41467-023-42020-z. Number: 1 Publisher: Nature Publishing Group. Cavalli, F. and Naimzada, A. Complex dynamics and multistability with increasing rationality in market games. Chaos, Solitons & Fractals, 93:151 161, 2016. ISSN 0960-0779. doi: https://doi.org/10.1016/j.chaos.2016.10. 014. URL https://www.sciencedirect.com/ science/article/pii/S0960077916303095. Chaudhari, P., Choromanska, A., Soatto, S., Le Cun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. Entropy-SGD: biasing gradient descent into wide valleys*. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, December 2019. ISSN 1742-5468. doi: 10.1088/1742-5468/ ab39d9. URL https://dx.doi.org/10.1088/ 1742-5468/ab39d9. Publisher: IOP Publishing and SISSA. Chen, J. and Wu, K. Deep-osg: Deep learning of operators in semigroup. Journal of Computational Physics, 493: 112498, 2023. Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018. Climenhaga, V., Luzzatto, S., and Pesin, Y. The Geometric Approach for Constructing Sinai Ruelle Bowen Measures. Journal of Statistical Physics, 166(3):467 493, February 2017. ISSN 1572-9613. doi: 10.1007/ s10955-016-1608-7. URL https://doi.org/10. 1007/s10955-016-1608-7. Cortiella, A., Park, K.-C., and Doostan, A. Sparse identification of nonlinear dynamical systems via reweighted l1-regularized least squares. Computer Methods in Applied Mechanics and Engineering, 376:113620, April 2021. ISSN 0045-7825. doi: 10.1016/j.cma.2020.113620. URL https://www.sciencedirect.com/ science/article/pii/S0045782520308057. Out-of-Domain Generalization in Dynamical Systems Reconstruction d Ascoli, S., Becker, S., Mathis, A., Schwaller, P., and Kilbertus, N. Odeformer: Symbolic regression of dynamical systems with transformers. ar Xiv preprint ar Xiv:2310.05573, 2023. Dawkins, J. Same occupation measure same trajectory. Math Overflow, 2024. URL https://mathoverflow.net/q/463088. URL:https://mathoverflow.net/q/463088 (version: 2024-01-29), USER: (https://mathoverflow.net/users/42851/john-dawkins). de Silva, B. M., Champion, K., Quade, M., Loiseau, J.-C., Kutz, J. N., and Brunton, S. L. Py SINDy: A Python package for the Sparse Identification of Nonlinear Dynamics from Data. ar Xiv preprint ar Xiv:2004.08424, 2020. URL http://arxiv.org/abs/2004.08424. Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp Minima Can Generalize For Deep Nets. In Proceedings of the 34th International Conference on Machine Learning, pp. 1019 1028. PMLR, July 2017. URL https://proceedings.mlr.press/v70/ dinh17b.html. ISSN: 2640-3498. Dubinkina, V., Fridman, Y., Pandey, P. P., and Maslov, S. Multistability and regime shifts in microbial communities explained by competition for essential nutrients. e Life, 8:e49720, nov 2019. ISSN 2050-084X. doi: 10.7554/ e Life.49720. URL https://doi.org/10.7554/ e Life.49720. Duffing, G. Erzwungene Schwingungen bei ver anderlicher Eigenfrequenz und ihre technische Bedeutung. Number 41-42. Vieweg, 1918. Durstewitz, D. Advanced Data Analysis in Neuroscience. Bernstein Series in Computational Neuroscience. Springer International Publishing, Cham, 2017. ISBN 978-3-319-59974-8 978-3319-59976-2. doi: 10.1007/978-3-319-59976-2. URL http://link.springer.com/10.1007/ 978-3-319-59976-2. Durstewitz, D., Seamans, J. K., and Sejnowski, T. J. Neurocomputational models of working memory. Nature Neuroscience, 3(11):1184 1191, November 2000. ISSN 1546-1726. doi: 10.1038/81460. URL https://www. nature.com/articles/nn1100_1184. Number: 11 Publisher: Nature Publishing Group. Eckmann, J. P. and Ruelle, D. Ergodic theory of chaos and strange attractors. Reviews of Modern Physics, 57 (3):617 656, July 1985. doi: 10.1103/Rev Mod Phys. 57.617. URL https://link.aps.org/doi/10. 1103/Rev Mod Phys.57.617. Publisher: American Physical Society. Fang, Z., Li, Y., Lu, J., Dong, J., Han, B., and Liu, F. Is Out-of-Distribution Detection Learnable?, February 2023. URL http://arxiv.org/abs/2210. 14707. ar Xiv:2210.14707 [cs, stat]. Farazmand, M. and Sapsis, T. P. Extreme Events: Mechanisms and Prediction, March 2018. URL http:// arxiv.org/abs/1803.06277. ar Xiv:1803.06277 [nlin, physics:physics]. Fern andez-Ramos, A., Miller, J. A., Klippenstein, S. J., and Truhlar, D. G. Modeling the Kinetics of Bimolecular Reactions. Chemical Reviews, 106(11):4518 4584, November 2006. ISSN 0009-2665. doi: 10.1021/cr050205w. URL https://doi.org/10.1021/cr050205w. Publisher: American Chemical Society. Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware Minimization for Efficiently Improving Generalization. October 2020. URL https:// openreview.net/forum?id=6Tm1mposlr M. Fotiadis, S., Valencia, M. L., Hu, S., Garasto, S., Cantwell, C. D., and Bharath, A. A. Disentangled Generative Models for Robust Prediction of System Dynamics. In Proceedings of the 40th International Conference on Machine Learning, pp. 10222 10248. PMLR, July 2023. URL https://proceedings.mlr.press/ v202/fotiadis23a.html. ISSN: 2640-3498. Funahashi, K.-i. and Nakamura, Y. Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, 6(6):801 806, January 1993. ISSN 0893-6080. doi: 10.1016/S0893-6080(05)80125-X. URL https://www.sciencedirect.com/ science/article/pii/S089360800580125X. Geist, K., Parlitz, U., and Lauterborn, W. Comparison of different methods for computing lyapunov exponents. Progress of theoretical physics, 83(5):875 893, 1990. Ghadami, A. and Epureanu, B. I. Forecasting critical points and post-critical limit cycles in nonlinear oscillatory systems using pre-critical transient responses. International Journal of Non Linear Mechanics, 101:146 156, May 2018. ISSN 0020-7462. doi: 10.1016/j.ijnonlinmec.2018.02. 008. URL https://www.sciencedirect.com/ science/article/pii/S0020746217306716. Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249 256. JMLR Workshop and Conference Proceedings, March 2010. URL https://proceedings.mlr.press/ v9/glorot10a.html. ISSN: 1938-7228. Out-of-Domain Generalization in Dynamical Systems Reconstruction Goel, N. S., MAITRA, S. C., and MONTROLL, E. W. On the Volterra and Other Nonlinear Models of Interacting Populations. Reviews of Modern Physics, 43 (2):231 276, April 1971. doi: 10.1103/Rev Mod Phys. 43.231. URL https://link.aps.org/doi/10. 1103/Rev Mod Phys.43.231. Publisher: American Physical Society. Guth, S. and Sapsis, T. P. Machine Learning Predictors of Extreme Events Occurring in Complex Dynamical Systems. Entropy, 21(10):925, October 2019. ISSN 10994300. doi: 10.3390/e21100925. URL https://www. mdpi.com/1099-4300/21/10/925. Number: 10 Publisher: Multidisciplinary Digital Publishing Institute. Hanson, J. and Raginsky, M. Universal simulation of stable dynamical systems by recurrent neural nets. In Bayen, A. M., Jadbabaie, A., Pappas, G., Parrilo, P. A., Recht, B., Tomlin, C., and Zeilinger, M. (eds.), Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pp. 384 392. PMLR, 10 11 Jun 2020. URL https://proceedings.mlr.press/ v120/hanson20a.html. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York, NY, 2009. ISBN 978-0-387-848570 978-0-387-84858-7. doi: 10.1007/978-0-387-84858-7. URL http://link.springer.com/10.1007/ 978-0-387-84858-7. Hess, F., Monfared, Z., Brenner, M., and Durstewitz, D. Generalized Teacher Forcing for Learning Chaotic Dynamics. In Proceedings of the 40th International Conference on Machine Learning, pp. 13017 13049. PMLR, July 2023. URL https://proceedings. mlr.press/v202/hess23a.html. ISSN: 26403498. Hochreiter, S. and Schmidhuber, J. Flat Minima. Neural Computation, 9(1):1 42, January 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1. URL https://doi. org/10.1162/neco.1997.9.1.1. Huang, W. R., Emam, Z., Goldblum, M., Fowl, L., Terry, J. K., Huang, F., and Goldstein, T. Understanding Generalization Through Visualizations. pp. 87 97. PMLR, February 2020. URL https://proceedings.mlr. press/v137/huang20a.html. ISSN: 2640-3498. Innes, M., Saba, E., Fischer, K., Gandhi, D., Rudilosso, M. C., Joy, N. M., Karmali, T., Pal, A., and Shah, V. Fashionable modelling with flux. Co RR, abs/1811.01457, 2018. URL https://arxiv.org/abs/1811. 01457. Izhikevich, E. M. Dynamical systems in neuroscience: the geometry of excitability and bursting. Computational neuroscience. MIT Press, Cambridge, Mass, 2007. ISBN 978-0-262-09043-8. OCLC: ocm65400606. Jiang, R., Lu, P. Y., Orlova, E., and Willett, R. Training neural operators to preserve invariant measures of chaotic attractors, 2023. Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. Fantastic Generalization Measures and Where to Find Them, December 2019. URL http://arxiv. org/abs/1912.02178. ar Xiv:1912.02178 [cs, stat]. Kaheman, K., Brunton, S. L., and Kutz, J. N. Automatic differentiation to simultaneously identify nonlinear dynamics and extract noise probability distributions from data. Machine Learning: Science and Technology, 3(1):015031, March 2022. ISSN 2632-2153. doi: 10.1088/2632-2153/ ac567a. URL https://dx.doi.org/10.1088/ 2632-2153/ac567a. Publisher: IOP Publishing. Kaiser, E., Kutz, J. N., and Brunton, S. L. Sparse identification of nonlinear dynamics for model predictive control in the low-data limit. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2219):20180335, November 2018. doi: 10.1098/rspa.2018.0335. URL https://royalsocietypublishing. org/doi/full/10.1098/rspa.2018.0335. Publisher: Royal Society. Kermack, W. O. and Mc Kendrick, A. G. A Contribution to the Mathematical Theory of Epidemics. Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 115(772): 700 721, 1927. ISSN 0950-1207. URL https://www. jstor.org/stable/94815. Publisher: The Royal Society. Khona, M. and Fiete, I. R. Attractor and integrator networks in the brain. Nature Reviews Neuroscience, 23(12):744 766, December 2022. ISSN 1471-0048. doi: 10.1038/ s41583-022-00642-0. URL https://www.nature. com/articles/s41583-022-00642-0. Number: 12 Publisher: Nature Publishing Group. Kimura, M. and Nakano, R. Learning dynamical systems by recurrent neural networks from orbits. Neural Networks, 11(9):1589 1599, 1998. Kirchmeyer, M., Yin, Y., Don a, J., Baskiotis, N., Rakotomamonjy, A., and Gallinari, P. Generalizing to New Physical Systems via Context-Informed Dynamics Model, June 2022. URL http://arxiv.org/abs/2202. 01889. ar Xiv:2202.01889 [cs, stat]. Out-of-Domain Generalization in Dynamical Systems Reconstruction Ko, J.-H., Koh, H., Park, N., and Jhe, W. Homotopy-based training of Neural ODEs for accurate dynamics discovery, May 2023. URL http://arxiv.org/abs/2210. 01407. ar Xiv:2210.01407 [physics]. Kolouri, S., Zou, Y., and Rohde, G. K. Sliced Wasserstein Kernels for Probability Distributions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5258 5267, Las Vegas, NV, USA, June 2016. IEEE. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR. 2016.568. URL http://ieeexplore.ieee.org/ document/7780937/. Koppe, G., Toutounji, H., Kirsch, P., Lis, S., and Durstewitz, D. Identifying nonlinear dynamical systems via generative recurrent neural networks with applications to f MRI. PLOS Computational Biology, 15(8):e1007263, August 2019. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1007263. URL https://journals.plos.org/ ploscompbiol/article?id=10.1371/ journal.pcbi.1007263. Publisher: Public Library of Science. Kunze, H. Solving inverse problems for odes using the picard contraction mapping. 05 1999. Kuznetsov, Y. A. Elements of Applied Bifurcation Theory (2nd Ed.). Springer-Verlag, Berlin, Heidelberg, 1998. ISBN 0387983821. Lee, J. M. Introduction to Smooth Manifolds, volume 218 of Graduate Texts in Mathematics. Springer, New York, NY, 2012. ISBN 978-1-4419-9981-8 978-1-4419-9982-5. doi: 10.1007/978-1-4419-9982-5. URL https://link.springer.com/10.1007/ 978-1-4419-9982-5. Li, Z., Kovachki, N. B., Azizzadenesheli, K., Bhattacharya, K., Stuart, A., Anandkumar, A., et al. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2020. Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond. In Proceedings of the Eighth International Conference on Learning Representations (ICLR 2020), April 2020. Loiseau, J.-C. and Brunton, S. L. Constrained sparse Galerkin regression. Journal of Fluid Mechanics, 838: 42 67, March 2018. ISSN 0022-1120, 1469-7645. doi: 10.1017/jfm.2017.823. Publisher: Cambridge University Press. Lorenz, E. N. Deterministic nonperiodic flow. Journal of atmospheric sciences, 20(2):130 141, 1963. Lu, L., Jin, P., and Karniadakis, G. E. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. ar Xiv preprint ar Xiv:1910.03193, 2019. L u, J., Chen, G., and Cheng, D. A new chaotic system and beyond: the generalized lorenz-like system. International Journal of Bifurcation and Chaos, 14(05):1507 1537, May 2004. ISSN 02181274. doi: 10.1142/S021812740401014X. URL https://www.worldscientific.com/doi/ abs/10.1142/S021812740401014X. Publisher: World Scientific Publishing Co. Mehta, V., Char, I., Neiswanger, W., Chung, Y., Nelson, A., Boyer, M., Kolemen, E., and Schneider, J. Neural Dynamical Systems: Balancing Structure and Flexibility in Physical Prediction. In 2021 60th IEEE Conference on Decision and Control (CDC), pp. 3735 3742, December 2021. doi: 10.1109/CDC45484.2021.9682807. ISSN: 2576-2370. Messenger, D. A. and Bortz, D. M. Weak SINDy: Galerkin-Based Data-Driven Model Selection. Multiscale Modeling & Simulation, 19(3):1474 1497, January 2021. ISSN 1540-3459. doi: 10.1137/ 20M1343166. URL https://epubs.siam.org/ doi/10.1137/20M1343166. Publisher: Society for Industrial and Applied Mathematics. Mikhaeil, J. M., Monfared, Z., and Durstewitz, D. On the difficulty of learning chaotic dynamics with RNNs. In 36th Conference on Neural Information Processing Systems (Neur IPS 2022), October 2022. URL https: //openreview.net/forum?id=-_AMpmy V0Ll. Milnor, J. On the concept of attractor. Communications in Mathematical Physics, 99(2):177 195, June 1985. ISSN 1432-0916. doi: 10.1007/BF01212280. URL https: //doi.org/10.1007/BF01212280. Mingard, C., Rees, H., Valle-P erez, G., and Louis, A. A. Do deep neural networks have an inbuilt occam s razor?, 2023. Mouli, S. C., Alam, M. A., and Ribeiro, B. Meta Physi Ca: OOD Robustness in Physics-informed Machine Learning, March 2023. URL http://arxiv.org/abs/ 2303.03181. ar Xiv:2303.03181 [cs, stat]. Mumby, P. J., Hastings, A., and Edwards, H. J. Thresholds and the resilience of caribbean coral reefs. Nature, 450 (7166):98 101, 2007. doi: 10.1038/nature06252. URL https://doi.org/10.1038/nature06252. Ngonghala, C. N., Feudel, U., and Showalter, K. Extreme multistability in a chemical model system. Phys. Rev. Out-of-Domain Generalization in Dynamical Systems Reconstruction E, 83:056206, May 2011. doi: 10.1103/Phys Rev E.83. 056206. URL https://link.aps.org/doi/10. 1103/Phys Rev E.83.056206. Odani, K. The Limit Cycle of the van der Pol Equation Is Not Algebraic. Journal of Differential Equations, 115(1):146 152, January 1995. ISSN 0022-0396. doi: 10.1006/jdeq.1995.1008. URL https://www.sciencedirect.com/ science/article/pii/S002203968571008X. Parthasarathy, S. and Rajasekar, S. Probability distribution characteristics of chaos in a simple population model and the Bonhoeffer van der Pol oscillator. Physical Review E, 58(5):6839 6842, November 1998. doi: 10. 1103/Phys Rev E.58.6839. URL https://link.aps. org/doi/10.1103/Phys Rev E.58.6839. Publisher: American Physical Society. Patel, D. and Ott, E. Using Machine Learning to Anticipate Tipping Points and Extrapolate to Post Tipping Dynamics of Non-Stationary Dynamical Systems, July 2022. URL http://arxiv.org/abs/ 2207.00521. ar Xiv:2207.00521 [physics]. Pathak, J., Lu, Z., Hunt, B. R., Girvan, M., and Ott, E. Using Machine Learning to Replicate Chaotic Attractors and Calculate Lyapunov Exponents from Data. Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(12): 121102, December 2017. ISSN 1054-1500, 1089-7682. doi: 10.1063/1.5010300. URL http://arxiv.org/ abs/1710.07313. ar Xiv: 1710.07313. Pelzer, A. F. G. and Sterk, A. E. Finite Cascades of Pitchfork Bifurcations and Multistability in Generalized Lorenz-96 Models. Mathematical and Computational Applications, 25(4):78, December 2020. ISSN 2297-8747. doi: 10.3390/mca25040078. URL https: //www.mdpi.com/2297-8747/25/4/78. Number: 4 Publisher: Multidisciplinary Digital Publishing Institute. Perko, L. Differential Equations and Dynamical Systems. Texts in applied mathematics. World Publishing Company, 1991. ISBN 9780387974439. Platt, J. A., Wong, A., Clark, R., Penny, S. G., and Abarbanel, H. D. Robust forecasting using predictive generalized synchronization in reservoir computing. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(12), 2021. Platt, J. A., Penny, S. G., Smith, T. A., Chen, T.-C., and Abarbanel, H. D. I. A Systematic Exploration of Reservoir Computing for Forecasting Complex Spatiotemporal Dynamics, January 2022. URL http://arxiv.org/ abs/2201.08910. ar Xiv:2201.08910 [cs]. Platt, J. A., Penny, S. G., Smith, T. A., Chen, T.-C., and Abarbanel, H. D. Constraining chaos: Enforcing dynamical invariants in the training of reservoir computers. Chaos: An Interdisciplinary Journal of Nonlinear Science, 33 (10), 2023. P erez-Hern andez, J. A. and Benet, L. Perez Hz/Taylor Integration.jl: Taylor Integration v0.4.1, February 2019. URL https: //doi.org/10.5281/zenodo.2562353. Qi, D. and Majda, A. J. Using machine learning to predict extreme events in complex systems. Proceedings of the National Academy of Sciences, 117(1):52 59, January 2020. doi: 10.1073/pnas. 1917285117. URL https://www.pnas.org/doi/ full/10.1073/pnas.1917285117. Publisher: Proceedings of the National Academy of Sciences. Rackauckas, C. and Nie, Q. Differential Equations.jl a performant and feature-rich ecosystem for solving differential equations in Julia. Journal of Open Research Software, 5(1), 2017. Rackauckas, C., Ma, Y., Martensen, J., Warner, C., Zubov, K., Supekar, R., Skinner, D., and Ramadhan, A. Universal differential equations for scientific machine learning. ar Xiv preprint ar Xiv:2001.04385, 2020. Raissi, M., Perdikaris, P., and Karniadakis, G. E. Multistep Neural Networks for Data-driven Discovery of Nonlinear Dynamical Systems. ar Xiv:1801.01236 [nlin, physics:physics, stat], January 2018. URL http://arxiv.org/abs/1801.01236. ar Xiv: 1801.01236. Raissi, M., Perdikaris, P., and Karniadakis, G. Physicsinformed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686 707, February 2019. ISSN 00219991. doi: 10.1016/j.jcp.2018. 10.045. URL https://linkinghub.elsevier. com/retrieve/pii/S0021999118307125. Ramberg, W. and Osgood, W. R. Description of stress-strain curves by three parameters, July 1943. URL https: //ntrs.nasa.gov/citations/19930081614. NTRS Author Affiliations: National Bureau of Standards NTRS Report/Patent Number: NACA-TN-902 NTRS Document ID: 19930081614 NTRS Research Center: Legacy CDMS (CDMS). Reinbold, P. A. K., Gurevich, D. R., and Grigoriev, R. O. Using Noisy or Incomplete Data to Discover Models of Spatiotemporal Dynamics. Physical Review E, 101(1): 010203, January 2020. ISSN 2470-0045, 2470-0053. doi: Out-of-Domain Generalization in Dynamical Systems Reconstruction 10.1103/Phys Rev E.101.010203. URL http://arxiv. org/abs/1911.03365. ar Xiv:1911.03365 [physics]. Schiff, S. J., Jerger, K., Duong, D. H., Chang, T., Spano, M. L., and Ditto, W. L. Controlling chaos in the brain. Nature, 370(6491):615 620, August 1994. ISSN 1476-4687. doi: 10.1038/370615a0. URL https:// www.nature.com/articles/370615a0. Number: 6491 Publisher: Nature Publishing Group. Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, Stability and Uniform Convergence. Journal of Machine Learning Research, 11(90):2635 2670, 2010. ISSN 15337928. URL http://jmlr.org/papers/v11/ shalev-shwartz10a.html. Subramanian, A. and Mahadevan, S. Probabilistic physicsinformed machine learning for dynamic systems. Reliability Engineering & System Safety, 230:108899, February 2023. ISSN 0951-8320. doi: 10.1016/j.ress.2022.108899. URL https://www.sciencedirect.com/ science/article/pii/S0951832022005142. Valle-P erez, G. and Louis, A. A. Generalization bounds for deep learning, December 2020. URL http://arxiv. org/abs/2012.04115. ar Xiv:2012.04115 [cs, stat]. Valle-P erez, G., Camargo, C. Q., and Louis, A. A. Deep learning generalizes because the parameter-function map is biased towards simple functions, 2019. Vapnik, V. N. The Nature of Statistical Learning Theory. Springer, New York, NY, 2000. ISBN 978-1-4419-31603 978-1-4757-3264-1. doi: 10.1007/978-1-4757-3264-1. URL http://link.springer.com/10.1007/ 978-1-4757-3264-1. Verzelli, P., Alippi, C., and Livi, L. Learn to synchronize, synchronize to learn. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(8), 2021. Villani, C. et al. Optimal transport: old and new, volume 338. Springer, 2009. Vlachas, P. R., Byeon, W., Wan, Z. Y., Sapsis, T. P., and Koumoutsakos, P. Data-driven forecasting of highdimensional chaotic systems with long short-term memory networks. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2213): 20170844, May 2018. doi: 10.1098/rspa.2017.0844. URL https://royalsocietypublishing. org/doi/10.1098/rspa.2017.0844. Publisher: Royal Society. Vlachas, P. R., Pathak, J., Hunt, B. R., Sapsis, T. P., Girvan, M., Ott, E., and Koumoutsakos, P. Backpropagation algorithms and Reservoir Computing in Recurrent Neural Networks for the forecasting of complex spatiotemporal dynamics. Neural Networks, 126:191 217, June 2020. ISSN 0893-6080. doi: 10.1016/j.neunet.2020.02. 016. URL https://www.sciencedirect.com/ science/article/pii/S0893608020300708. Vlachas, P. R., Arampatzis, G., Uhler, C., and Koumoutsakos, P. Multiscale simulations of complex systems by learning their effective dynamics. Nature Machine Intelligence, 4(4):359 366, 2022. Vogt, R., Puelma Touzel, M., Shlizerman, E., and Lajoie, G. On lyapunov exponents for RNNs: Understanding information propagation using dynamical systems tools. Frontiers in Applied Mathematics and Statistics, 8, 2022. ISSN 22974687. URL https://www.frontiersin.org/ articles/10.3389/fams.2022.818799. Wood, S. N. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466(7310):1102 1104, August 2010. ISSN 1476-4687. doi: 10.1038/ nature09319. URL https://www.nature.com/ articles/nature09319. Number: 7310 Publisher: Nature Publishing Group. Xie, Z., Sato, I., and Sugiyama, M. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. 2021. Yin, Y., Ayed, I., de B ezenac, E., Baskiotis, N., and Gallinari, P. Leads: Learning dynamical systems that generalize across environments, 2022a. Yin, Y., Ayed, I., de B ezenac, E., Baskiotis, N., and Gallinari, P. LEADS: Learning Dynamical Systems that Generalize Across Environments, February 2022b. URL http://arxiv.org/abs/2106. 04546. ar Xiv:2106.04546 [cs, stat]. Yoden, S. Classification of simple low-order models in geophysical fluid dynamics and climate dynamics. Nonlinear Analysis: Theory, Methods & Applications, 30(7):4607 4618, 1997. ISSN 0362-546X. doi: https://doi.org/10.1016/S0362-546X(97)00306-4. URL https://www.sciencedirect.com/ science/article/pii/S0362546X97003064. Proceedings of the Second World Congress of Nonlinear Analysts. Out-of-Domain Generalization in Dynamical Systems Reconstruction A. Survey of Benchmark Systems Figure A1. In-distribution generalization in DSR. We surveyed 59 papers in the field of DSR, containing a wide range of methods and applications, with respect to the benchmark systems or datasets considered (Farmer & Sidorowich (1987); Wang & Lin (1998); Voss et al. (2004); Brunton et al. (2016); Trischler & D Eleuterio (2016); Sussillo et al. (2016); Linderman et al. (2016); Tran & Ward (2017); Pathak et al. (2017); Raissi et al. (2018); Mohajerin & Waslander (2018) Vlachas et al. (2018); Lu et al. (2018); Lusch et al. (2018); Karlsson & Svanstr om (2019); Otto & Rowley (2019); Raissi et al. (2019); Duncker et al. (2019); Ayed et al. (2019); Nguyen et al. (2019); Qin et al. (2019); Fu et al. (2019) Champion et al. (2019); Singh et al. (2019); Lee & Carlberg (2020); Shalova & Oseledets (2020); Vlachas et al. (2020); Zhao et al. (2020); Hernandez et al. (2020); Azencot et al. (2020); Strauss (2020); Gilpin (2020); Nguyen et al. (2021) (Kraemer et al., 2021; Li & Duan, 2021; Lu et al., 2021; Schmidt et al., 2021; Jordana et al., 2021; Kim et al., 2021; Lai et al., 2021; Gauthier et al., 2021; Goyal & Benner, 2021; Liu & Jin, 2021; Schlaginhaufen et al., 2021) Mehta et al. (2021); Zhang et al. (2022); Uribarri & Mindlin (2022); Gilpin (2022); Yin et al. (2022b); Rusch et al. (2022); Brenner et al. (2022); Lejarza & Baldea (2022); Chen et al. (2022); Geneva & Zabaras (2022); Mikhaeil et al. (2022) Yang et al. (2023); Linot et al. (2023); Tripura & Chakraborty (2023); Hess et al. (2023)). This survey motivated the classification in Table A1, where three types of systems dominate the literature: Simple, low-dimensional linear or nonlinear systems like the Fitz-Hugh-Nagumo equations, Lotka-Volterra system, or coupled or damped harmonic oscillators/ pendulums like the van-der-Pol oscillator. Simple monostable 3d chaotic attractors, predominantly the Lorenz-63, R ossler or Duffing systems. Nonlinear PDEs as models of fluid dynamics and convection (e.g. Burgers equation, Navier Stokes equation, Lorenz-96 or Kuramoto Sivashinsky equations). Experimental data or explicitly multistable systems were rarely considered (or at least not explored in their multiple stable regimes). Table A1. Classification of benchmark systems in the field of dynamical systems reconstruction. Category Counts Linear Models/Oscillators 24 Chaotic 3D Models 29 Fluid Dynamics/PDEs 13 Experimental Data 6 Multistable 3 Out-of-Domain Generalization in Dynamical Systems Reconstruction Figure A2. Illustration of multistability. Different basins of attraction can lead to completely different dynamical regimes with different topologies. B. Further Details on Section 2 B.1. Ergodic Theory and Topology Figure A3. Trajectories of systems with an equilibrium point (Duffing oscillator), cycle (van der Pol oscillator) and chaotic attractor (Lorenz system). Physical measure Definition B.1. We call µ a physical measure, related to Sinai-Ruelle-Bowen (SRB) measures (Climenhaga et al., 2017), if for some set U with positive Lebesgue measure (Ln(U) > 0) and x0 U, lim T µx0,T = µ . (17) In essence, this means that the measure is physically realisable. However, not every attractor (or even the DS itself) has to have a physical measure. Hausdorff Distance Definition B.2. Let X, Y be two non-empty subsets of a metric space (M, d). The Hausdorff-distance is defined by d H(X, Y ) = max sup x X d(x, Y ), sup y Y d(X, y) (18) where d(a, B) = infb B d(a, b) with a X and B X. The choice of Hausdorff distance in this context is motivated by its robustness and sensitivity to outliers between the two sets, which in the context of the topological error makes it a suitable choice for assessing the closeness of the limit sets. Out-of-Domain Generalization in Dynamical Systems Reconstruction Topological Equivalence Definition B.3. Let F1, F2 C1(U) with flow maps ϕF1 t , ϕF2 t . The two vector fields (VFs) are topologically equivalent (Perko, 1991), denoted by F1 F2 , if there exists a homeomorphism h : U U mapping orbits of the first system onto orbits of the second system, i.e. t R, x U : ϕF1 t (x) = h 1 ϕF2 τ h(x), (19) with τ : U R R, τ(x,t) t > 0 x U. This means the time direction of the orbits is preserved. Loosely speaking, two VFs are topologically equivalent if we can continuously deform one VF into the other, i.e. such that each orbit is only deformed in a continuous manner without ripping it apart . This implies that equilibrium points are mapped onto equilibrium points, and closed orbits to closed orbits. An open orbit will not be closed through h, and vice versa. Topological Transitivity Definition B.4. Let F C1(U) be a vector field on a topological space U, with its associated flow map ϕF t . The dynamical system induced by F is said to be topologically transitive if for any two non-empty open sets A, B U, there exists a time t R such that the flow map at time t, ϕF t , maps some part of A into B; that is, ϕF t (A) B = . This implies that trajectories of the vector field F, starting from an arbitrary region in the space U, will eventually enter any other region, given that these regions are open and non-empty. C. Further Details on Section 3 C.1. OODG in Statistical Learning Theory Consider a regression or classification setting, where X Rd and Y Rk are the input and output spaces, and S = {(xi, yi) X Y }n i=1 denotes a dataset sampled from a distribution p(x, y) defined on the domain X Y . Further assume there is a set E of training domains with cardinality |E|. Let the dataset of size ne from a single environment e E be Se = {(xe i, ye i ) Xe Y e}ne i=1, where samples are drawn i.i.d. from the unknown, data-generating distribution pe(x, y). Consider the class F of functions f : X Y and a loss function ℓ: Y Y R+ {0} measuring the goodness of fit. In OODG, the goal is to learn a generalizing, predictive function ˆf F from the |E| training domains to obtain a minimum prediction error on an unseen test domain etest with distribution petest(x, y), i.e. min ˆ f F E(x,y) petest h ℓ( ˆf(x), y) i , (20) Retest( ˆf) := E(x,y) petest h ℓ( ˆf(x), y) i = Z ℓ( ˆf(x), y) dpetest(x, y) (21) is the expected loss for ˆf, called the test risk. However, we cannot compute Retest( ˆf) because the distribution petest(x, y) is unknown. Hence, we estimate the expectation by the sample mean across all the training domains, called empirical risk, defined as RE emp( ˆf) = 1 |E| i=1 ℓ( ˆf(xe i), ye i ). (22) Based on this, the OODG error is defined as the gap between the test risk and the empirical risk, Retest( ˆf) RE emp( ˆf) . (23) C.2. Statistical Error To compute Eq. (8), we use a common Monte-Carlo approximation of the expectation SW1(µΦ x,T , µΦR x,T ) 1 l=1 W1(gξ(l) µΦ x,T , gξ(l) µΦR x,T ), (24) Out-of-Domain Generalization in Dynamical Systems Reconstruction where projection vectors ξ(l) U(Sn 1) are drawn uniformly across the unit hypersphere embedded in Rn. The Wasserstein1 distance is computed across trajectories (empirical distributions) of the ground-truth flow Φ and the reconstructed flow ΦR. Trajectories are drawn by evolving the respective system for T time units from initial conditions x Rn. Between two one-dimensional distributions, the Wasserstein-1 distance can then efficiently be computed as W1(µ, ν) = Z 1 F 1 µ (q) F 1 ν (q) dq, (25) where F 1 denotes the quantile function (inverse CDF). In practice, we approximate the integral in Eq. (25) by evaluating the quantile functions at a resolution of q = 10 3. We use L = 1000 samples in Eq. (24). For the final error EU stat, Eq. (9), we sample K initial conditions from a uniformly spaced grid Gr(U) = {x(1), . . . , x(K)} over U M Rn. The integral in Eq. (9) is then approximated by EU stat ΦR 1 x Gr(U) SW1(µΦ x,T , µΦR x,T ). (26) C.3. Topological Error For the topological error, the Lyapunov spectra of orbits in limit sets ω(x, Φ) and ω(x, ΦR) need to be computed. To compute the Lyapunov spectrum of continuous-time systems, i.e. ground-truth systems and Neural ODEs, we use the Julia library Taylor Integration.jl (P erez-Hern andez & Benet, 2019). For RNNs and RCs we use our own implementation of an algorithm described in (Geist et al., 1990; Vogt et al., 2022), which computes the Lyapunov spectrum by evaluating the Jacobian product along orbits of length T: λi = lim T 1 T log σi where σi is the i-th singular value. For numerical stability, the product of Jacobians is repeatedly re-orthogonalized using a QR decomposition. To ensure convergence to the limit set spectrum, transients are discarded from the computation of Eq. (27). For the Duffing system (Appx. D.1), we discard the first Ttrans = 3000 time steps and compute the Lyapunov spectrum across an additional T = 3000 time steps, while re-orthogonalizing every 50 time steps. For the multistable Lorenz-like system (Eq. (30)), we use Ttrans = 5000 and T = 10, 000. For the tolerance of the relative error between the maxmimum Lyapunov exponents λn and λR n of the ground-truth and reconstructed system, respectively, we choose ελn = 0.25. For evaluating the agreement of limit sets, d H(ω(x, ΦR), ω(x, Φ)) < εd H, we used the same setup as for computation of the Lyapunov spectra, but only use the T = 500 and T = 5000 last time steps. We set εd H = V/L, where V is the volume of U, and L the number of initial conditions contained in the grid Gr(U), which is the same as used for computation of EU stat ΦR . For the Duffing system, this comes down to εd H = 40.0/100 = 0.4. The integral in Eq. (10) is approximated and computed across the very same grid of initial conditions Gr(U) as used for the statistical error EU stat: EU top(ΦR) 1 1 x Gr(U) 1ΦR(x) (28) See Fig. A4 for a visualization of the grid of initial conditions used for the Duffing system. D. Further Details on Section 4 D.1. Ground-truth Models Duffing system The unforced Duffing system (Duffing, 1918) is given by a set of coupled ODEs: y = ay x b + cx2 where [a, b, c] = [ 1 10] places the system into a multistable regime with two coexisting equilibrium points. To generate datasets, we numerically integrate Eq. (29) for tint = 40.0 time units with a read-out interval of t = 0.01 using the adaptive step size integrator Tsit5 provided within the Julia library Differential Equations.jl (Rackauckas & Nie, 2017). Using K initial conditions, this results in an array of shape 4000 2 K. To facilitate training, we standardize our datasets by the overall mean and standard deviation across all trajectories. Out-of-Domain Generalization in Dynamical Systems Reconstruction Figure A4. Grid Gr(U) used to compute EU stat as well as EU top for the Duffing system. Multistable Lorenz-like system As an example system of multistable chaotic attractors, we use the multistable Lorenz-like system introduced in L u et al. (2004): a+bx yz + c y = ay + xz z = bz + xy , (30) where we chose parameters [a, b, c] = [ 10, 4, 18.1] such that the system exhibits two chaotic 1-scroll attractors in state space. We numerically integrate Eq. (30) for tint = 80.0 time units with a read-out interval of t = 0.005 using the Tsit5 integrator. Using K initial conditions, this results in an array of shape 16000 3 K. As for the Duffing datasets, we standardize using the overall mean and standard deviation. D.2. DSR Models and Training Routines SINDy Sparse Identification of Nonlinear Dynamics (SINDy) (Brunton et al., 2016) aims to identify a sparse representation of the governing dynamical equations from data. Given a set of measurements of the state x(t) Rn, where n is the number of system variables and t = {t1 . . . tm} represents observation times, application of SINDy first requires approximating the flow dx dt = x numerically, e.g. by finite difference methods. Following the notation in Brunton et al. (2016), the derivatives are arranged into matrix form: x (t1) x (t2) ... x (tm) x1(t1) x2(t1) xn(t1) x1(t2) x2(t2) xn(t2) ... ... ... ... x1(tm) x2(tm) xn(tm) SINDy optimization then tries to determine a sparse matrix of regression coefficients Ξ such that: X = Θ(x)Ξ, (32) Here, Θ(x) is a library of candidate functions on the state variables x that is defined beforehand, e.g.: | | | | | | 1 X X2 X3 X4 cos(X) . . . | | | | | | The regression coefficients are found by applying a sparsity-promoting optimization technique, such as the least absolute shrinkage and selection operator (LASSO regression) or the Sequentially Thresholded Least Squares (STLSQ) algorithm, to solve for Ξ. Out-of-Domain Generalization in Dynamical Systems Reconstruction The VF used for Figure 2 is defined as: x = x + x(x2 + y2 1)(4x2 4xy + 4y2) + (x2 + y2)( 2x + 2y + x3 + xy2), (34) y = y + y(x2 + y2 1)(4x2 4xy + 4y2) + (x2 + y2)( 2x 2y + y3 + x2y). For the inner cycle, a trajectory was drawn from a randomly chosen initial condition (x0, y0) = (0.6, 0.4). A long trajectory was then sampled with T = 100 and t = 0.01. To infer the VF with SINDy, we used the Python implementation (Py SINDy, de Silva et al. (2020)) with STLSQ optimizer and threshold 0.01, and a Polynomial Library up to degree 6. As the outer cycle is an unstable solution, small perturbations lead the system to diverge away from it, and so for the reconstructions in Fig. 2 (center) we made sure to initialize exactly on that cycle. For the results in Figure A8, we provided a polynomial library of second order for the multistable Lorenz-like system and a library of third order for the Duffing system, with other settings the same as used for Fig. 2. RNNs We trained clipped shallow piecewise-linear RNNs (sh PLRNNs; Hess et al. (2023)) using Backpropagation through time (BPTT) with sparse teacher forcing (STF, Mikhaeil et al. (2022); Brenner et al. (2022)) and identity teacher forcing (id-STF; Brenner et al. (2022)). The clipped sh PLRNN has a simple 1-hidden-layer architecture zt = Azt 1 + W1 ϕ(W2zt 1 + h2) ϕ (W2zt 1) + h1, (35) with latent state zt RM, diagonal matrix A RM M, rectangular connectivity matrices W1 RM H and W2 RH M, and thresholds h2 RH and h1 RM. The nonlinear activation function ϕ is given by the Re LU( ) = max( , 0). The idea behind id-STF is to replace latent states with states inferred from the observations at optimally chosen intervals τ, such as to pull model-generated trajectories back on track and to avoid strong gradient divergence for chaotic dynamics (Mikhaeil et al., 2022). Teacher forcing also has the effect of smoothening the loss landscape (Hess et al., 2023). As in Brenner et al. (2022), we take an identity-mapping for the observation model, ˆxt = Izt, with I RN M and Ikk = 1 taken to be the identity for the k read-out neurons, k N, and zeros for all other elements. STF is only used in training the model, but not when deploying and testing it. The loss function to be minimized is the MSE: ℓMSE( ˆ X, X) = 1 NTs t=1 ˆxt xt 2 2 , (36) where ˆ X are model predictions and X is the a training sequence of length Ts. For performing SGD updates, we employ the RAdam (Liu et al., 2020) optimizer paired with an exponential decay learning rate schedule. The sh PLRNN and training routine are implemented using the Flux.jl DL stack (Innes et al., 2018). Detailed hyperparameter settings are collected in Table A2. Hyperparameter Duffing Lorenz-like M 5 30 H 100 500 τ 15 15 Ts 100 50 batch size 32 32 ηstart 10 3 10 3 ηend 10 6 10 5 # trainable parameters 1, 116 30, 641 # SGD steps 250, 000 250, 000 Table A2. Hyperparameter settings of sh PLRNNs trained on the Duffing and Lorenz-like systems. Reservoir Computing (RC) We used a formulation of the RC architecture often employed in work on DSR (Patel & Ott, 2022): rt = αrt 1 + (1 α) tanh (W rt 1 + Winut + b) (37) ˆxt = Woutrt, (38) Out-of-Domain Generalization in Dynamical Systems Reconstruction where rt RM is the reservoir state, α R the leakage parameter, W RM M the reservoir connectivity matrix, Win RM N the input-to-reservoir matrix weighing inputs ut RN, b RM a bias vector, and Wout RN M the matrix mapping reservoir states to the observed data. In RCs, the dynamical reservoir parameters θr = {α, W , Win, b} are fixed after initialization. Here we initialized W to be fully connected with entries sampled from a standard normal distribution, and then scaled to have a predefined spectral radius specified by a hyperparameter ρ. Input-to-reservoir matrix Win and bias b are also drawn from Gaussian distributions with variances σ2 and β2, respectively. In RCs, only the reservoir-to-output matrix Wout is learned. The RC is trained by first driving the reservoir using groundtruth data X = [x1, . . . , x T ] RN T supplied through ut = xt. This results in a trajectory of reservoir states R = [r1, . . . , r T ] RM T . The only trainable parameters Wout are then determined by minimizing the least-squares error X Wout R 2 2, a straightforward linear regression problem with closed form solution Wout = XRT RRT 1 . (39) After training, the reservoir state is initialized with zeros and the RC is only provided a short sequence of ground-truth data {x1, . . . , x TW } to warm-up the dynamics of the reservoir state rt, where TW denotes the warm-up time. Afterwards, the RC runs closed-loop (autonomously) by feeding predictions ˆxt back to the reservoir through the input-to-reservoir connection. To keep the comparison between DSR models fair in Fig. 3, we only provide a single initial condition, i.e. TW = 1. For visual clarity, however, we still drop the first few time steps of RC-generated trajectories (e.g. in Fig. 3), which the zero-initialized reservoir state needs to converge to the correct dynamics. Detailed hyperparameter settings are in Table A3. Hyperparameter Duffing Lorenz-like M 500 2000 ρ 1.0 0.75 α 0.7 0.4 σ 0.2 0.3 β 0.5 0.7 # trainable parameters 1, 000 6, 000 Table A3. Hyperparameter settings of RCs trained on the Duffing and Lorenz-like systems. N-ODE We train N-ODEs (Chen et al., 2018) using the Julia library Diff Eq Flux.jl (Rackauckas et al., 2020). We use a simple multi-layer perceptron (MLP) architecture where parameters are optimized using the adjoint method. The loss function is the MSE, Eq. (36). As for RNNs we perform SGD updates using RAdam paired with an exponential decay learning rate schedule. Detailed hyperparameter settings are in Table A4. Hyperparameter Duffing Lorenz-like # hidden layer 2 3 hidden layer sizes [40, 40] [100, 100, 100] activation tanh Re LU Ts 30 30 batch size 32 32 ODE solver Tsit5 Tsit5 ηstart 10 3 10 3 ηend 10 5 10 5 # trainable parameters 1, 842 20, 903 # SGD steps 100, 000 100, 000 Table A4. Hyperparameter settings of N-ODEs trained on the Duffing and Lorenz-like systems. Out-of-Domain Generalization in Dynamical Systems Reconstruction Figure A5. Reconstructions of the Duffing system as in Fig. 3, but with models trained on ground-truth data (blue trajectories) from both basins. The models are capable of learning the multistable dynamics (red trajectories) of the system when supplied with data from both basins. This is also reflected in the drastically smaller statistical error Estat when applied to the same test data as used for Fig. 3: RC 2.7 10 3, N-ODE 2.1 10 3 and sh PLRNN 1.4 10 3. Figure A6. Learnability evaluated on the multistable Lorenz-like system. Left: Example reconstructions of DSR models trained on 4 ground-truth trajectories (blue) from one basin. Red trajectories are freely generated using initial conditions of the training data and the respective DSR model. Grey trajectories are example ground-truth test trajectories and model-generated trajectories, respectively, from both the training basin and the OOD basin. Again, all models fail to properly generalize to the unobserved attractor/basin. Right: e CDF of Estat with a sample size of N = 50 independent trainings of each DSR model evaluated over a grid of 125 initial conditions covering both basins. Note that the dynamics of the N-ODE models consistently diverged for many initial conditions from the grid. For N-ODE Estat values, this means that most mass is concentrated at much higher error values, which were cut off in the e CDF plot. Out-of-Domain Generalization in Dynamical Systems Reconstruction Figure A7. Reconstructions of the multistable Lorenz-like system using the same DSR models as in Fig. A6, but trained on ground-truth training data (blue) from both basins. As for the Duffing system, the models are capable of learning the multistable dynamics (red trajectories) of the Lorenz-like system when supplied with data from both basins. This is also reflected in lower statistical errors Estat when compared to the monostable training data, evaluating to 0.131 for RC, 0.133 for N-ODE and 0.064 for the sh PLRNN. Figure A8. Reconstruction of the multistable Lorenz-like system from a trajectory from just one basin, using Py SINDy (de Silva et al., 2020). Since in this case, the correct polynomial function library was provided, both basins are correctly identified (see also Fig. 2b). D.3. Specifications on Theorem 4.1 There are also examples of VFs with a dense set of trajectories which solve an algebraic equation. Any VF with a rational first integral (e.g. algebraic Hamiltonian) is not learnable (since then every solution solves an algebraic equation in terms of the basis functions). A simple example would be the standard harmonic oscillator in the regime where it has a center, i.e. a Out-of-Domain Generalization in Dynamical Systems Reconstruction Training data Figure A9. Generalization errors EM stat (left) and EM top (right) for sh PLRNNs (M = 5, H = 100) as a function of sample size (amount of training data), drawn from one basin (blue) or both basins (orange) of the Duffing system (center). Each data point is median median absolute deviation (ribbon) across 10 models. Increasing training data from one basin makes no difference for the generalization to the second basin (blue), i.e. both errors plateau at high values. dense set of closed orbits (each of them solving an algebraic equation). More generally, systems with a center manifold (as in many Hamiltonian systems and biological systems like the Fitz Hugh-Nagumo equation) may have this property, with all trajectories lying on the center manifold solving an algebraic equation. Ground Truth Non-Algebraic Trajectory Algebraic Trajectory Polynomial Library Trigonometric&Polynomial Library Trigonometric&Polynomial Library Figure A10. Example reconstructions using SINDy (details in Appx. D.2). a) The ground truth (GT) VF has two cycle solutions. One solves an algebraic equation (red), while the other does not (black). b) Providing SINDy with the black trajectory and the correct library (including both trigonometric and polynomial functions) leads to a correctly inferred VF. c) Providing as data the curve solving an algebraic trajectory leads to an incorrectly inferred VF, as stated in theorem 4.1, despite the correct library. d) Providing SINDy with a non-algebraic trajectory but a library that lacks the trigonometric terms also fails to reproduce the GT VF. The VF used for Figure A10 is defined as: Out-of-Domain Generalization in Dynamical Systems Reconstruction x = 2y cos(x), (40) y = x2 sin(x) 2x cos(x) + y2 sin(x) sin(x). Out-of-Domain Generalization in Dynamical Systems Reconstruction D.4. Specifications on Theorem 4.1 - Evaluating Identifiability Conditions Assume we are given a dataset (trajectory) D = {x(t0), x(t1), ..., x(t N)}, (41) for which we would like to check whether it satisfies an algebraic equation in the basis functions, i.e., θ Rm\{0} : Λ(x(t)) = i=0 θiψi(x(t)) = 0 t If x0. (42) Multiplying Eq. (42) by ψj(x(t)) yields j = 0, ..., m : ψj(x(t)) i=0 θiψi(x(t)) = 0. (43) As this equation is always zero, it is also zero when evaluated at the data points j = 0, ..., m : k=0 ψj(x(tk)) i=0 θiψi(x(tk)) = 0. (44) Equation (44) consists of m + 1 linear equations which can be written in matrix form: ψ0(x(tk)) ψ0(x(tk)) ψ0(x(tk)) ψ1(x(tk)) ψ0(x(tk)) ψm(x(tk)) ψ1(x(tk)) ψ0(x(tk)) ψ1(x(tk)) ψ1(x(tk)) ψ1(x(tk)) ψm(x(tk)) ... ... ... ... ψm(x(tk)) ψ0(x(tk)) ψm(x(tk)) ψ1(x(tk)) . . . ψm(x(tk)) ψm(x(tk)) θ0 θ1 ... θm Hence, if ker(Ψ) is non-trivial (contains not only the zero-vector), then the data points solve an algebraic equation in the basis functions. For instance, let us revisit the scenario described in the main text, where our training dataset comprises points on the circle Γx0 = (cos(t), sin(t)) | t [0, 2π) . With this information, we are able to determine the null space of Ψ by solving Eq. (45) using a library that includes polynomials up to third order. The null space is three-dimensional and consists of the three algebraic curves shown in Fig. A11. (Note that, by definition, also all linear combinations of these invariant algebraic curves are in the nullspace.) In contrast, the nullspace of Ψ for a trajectory that does not solve an algebraic equation in the basis functions contains only the zero-vector. Thus, checking whether the null space of Ψ contains only the zero-vector is a quick and efficient method for verifying whether any trajectory satisfies an algebraic equation in the basis functions. If the nullspace contains only the zero-vector and we have provided a proper (correct) library, SINDy (or related library methods) will generalize across the state space M. Conversely, if the null space is non-trivial, either a different trajectory must be selected or the hypothesis class must be limited, as outlined in Corollary E.7, to enable proper generalization. Out-of-Domain Generalization in Dynamical Systems Reconstruction Y _ _ _ _ _ _ _ _ _ _ _ _ _ _ ] _ _ _ _ _ _ _ _ _ _ _ _ _ _ [ c c c c c c c c c c c c c c a 0 1 0 0 0 0 1 0 0 d d d d d d d d d d d d d d b c c c c c c c c c c c c c c a d d d d d d d d d d d d d d b c c c c c c c c c c c c c c a 0 0.0648 0.0085 0 0 0 0.0648 0.0006 d d d d d d d d d d d d d d b | 1, 2, 3 œ R Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ Y _ _ _ _ _ _ _ _ _ _ _ _ _ _ ] _ _ _ _ _ _ _ _ _ _ _ _ _ _ [ c c c c c c c c c c c c c c a 0 0 0 0 0 0 0 0 0 0 d d d d d d d d d d d d d d b Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ Training data solves algebraic equation sk7ker(Ψ) non-trivial cannot use SINDy Training data solves no algebraic equation Ek9+u Ssk7ker(Ψ) trivial can use SINDy Assume polynomial basis up to third degree: xit>!1, y, y2, y3, x, x y, x y2, x2, x2 y, x3" Figure A11. Illustration of how to check whether training data solve, or do not solve, an algebraic equation in a practical setting, for the harmonic oscillator (left) and van-der-Pol oscillator (right). The limit cycle of the van-der-Pol oscillator is non-algebraic (Odani, 1995). D.5. Why OODG Fails The probability of a model trained with SGD having error εstat is defined by p SGD(εstat | D) = Z Θ 1[EM gen(Φθf ) = εgen]popt (θf | θi, D) pini (θi) dθidθf. (46) This coincides with the learnability distribution p(εgen|D) = 1 vol(Θ0) Θ0 1[EMtest gen (Φθ) = εgen]dθ, (47) under two conditions, the failure of either one introduces an implicit bias. First, it might be that the optimizer does not converge to a model with zero reconstruction error on M but a slightly larger error. The more significant implicit bias arises from the combination of pini and pstat. Assuming the optimizer always converges to a model with zero generalization error, we have p SGD(εgen|D) = 1 vol(Θ0) Θ0 1[EMtest gen (Φθ) = εgen] pbias(θ)dθ, (48) i.e., p SGD aligns with the learnability distribution only when an implicit bias term pbias is accounted for. In general, it is likely that a specific combination of pini and pstat preferentially converges to certain parameters θf, as observed for SGD. This preference is termed the implicit bias of the learning algorithm, and skews the learnability distribution toward the domain implicitly preferred by the learning algorithm. Out-of-Domain Generalization in Dynamical Systems Reconstruction D.6. Simplicity Bias We adopt the following definition for the parameter-function map from (Valle-P erez et al., 2019): Definition D.1. The parameter-function map M associates a given parameter set with the flow, expressed as: M : Θ Hθ, θ 7 Φθ. (49) Depending on the choice of model, multiple parameter vectors can map onto the same flow. For instance, for the sh PLRNN (Sect. D.2), scaling W1 by a positive scalar factor c R+ and adjusting the weights in W2 by 1 c results in the same dynamical model. The initial parameter distribution pini can lead to flows with different characteristics not directly apparent from the initial distribution. In the context of the simplicity bias, we are interested in the distribution over the complexity of the flows K(Φpini) induced by a choice of distribution over initial parameters. To assess this complexity, we select the Shannon entropy over the limit sets of the resulting flows (Eckmann & Ruelle, 1985). We evaluate this by drawing long trajectories from a grid of initial conditions and compute the entropy over the histogram of final states, using the R enyi algorithm from Complexity Measures.jl. This entropy has a natural interpretation for flows with different topologies, where e.g. global equilibrium points have low entropy and chaotic attractors have high entropy (see Fig. 4b). Figure A12. Simplicity bias for sh PLRNNs for M = 10 and H = 250. Table A5. Number of positive, zero and negative eigendirections for the Hessian of the loss function evaluated on trajectories from just one (ℓB(A1)) or both (ℓM) basins w.r.t. θgen. (For this analysis only 5 of the 20 generalizing models plotted in Fig. 5 where used.) #λ+ 171.93 0.62 103.08 2.10 #λ0 182.38 0.75 310.54 3.76 #λ 149.69 0.24 90.38 1.66 λmax 241.09 1.25 364.09 3.29 λmin 0.00126 0.00004 0.00186 0.00001 D.7. Assessing Sharpness of Minima In order to assess the volume of minima, we employ a sampling-based approach. Following Huang et al. (2020), we randomly select a vector θ from the parameter space Θ with dimensionality d within the hyper-sphere Sd 1 r with radius r. Subsequently, we evaluate the loss of models with parameters along a straight line in parameter space given by θmin + a(θ θmin). (50) Out-of-Domain Generalization in Dynamical Systems Reconstruction Here, a takes values in the interval [0, 1]. We define the threshold value a = ath as the one at which the loss of the corresponding model exceeds the predefined threshold ℓ> ℓ(1 + pth). We tested a threshold of pth = 1% and pth = 5% for our experiments. However, as the results in Fig. 6 and A15 indicate, the results obtained are not overly sensitive to this hyperparameter. The resulting threshold value ath yields an estimate of the minimum radius in the direction of θ , given by r ath. This radius serves as a lower bound for the minimum volume V = πn/2 Γ(1+n/2)Eθ rd(θ) , given by: 2 log π log Γ(d 2 + 1) + log Eϕ U[rd(θ)] 2 log π log Γ(d 2 + 1) + Eθ U[d log r(θ)] 2 log π log Γ(d i=1 log r(θi), where we used Jensen s inequality to pull the logarithm into the expectation, and Γ( d 2 + 1) is Euler s gamma function. Radius estimates for saddle points It is widely acknowledged that many critical points to which stochastic gradient descent (SGD) converges in high-dimensional spaces are saddle points (Chaudhari et al., 2019). Mathematically, saddle points lack a well-defined radius since there exist directions in which the loss decreases. Despite this, estimation techniques for the radius, such as the one proposed here, are commonly applied (Huang et al., 2020). Our convergence analysis supports the viability of our method, as demonstrated in Fig. A16a, where the estimation of the radius converges after approximately 3000 randomly drawn samples. A possible explanation for this convergence could be the prevalence of positive directions. As illustrated in Table A5, the Hessian has more positive eigendirections than zero or negative ones. This could lead to a much larger volume of parameters around the minima having an ascending loss. Further support for this hypothesis is provided in Fig. A16b, where random sampling of parameter vectors around minima, as described earlier, results more frequently in ascending than flat curves. Figure A13. Similar to Fig. 5a, this figure illustrates the statistical error of generalizing and retrained models for the multistable Lorenz-like system described by Eq. (30). The upper two density plots depict the statistical error of generalizing models on B(A1) and B(A2). Meanwhile, the lower row illustrates the same for retrained models. A surge in error is observed for B(A2) in the retrained models, suggesting an unlearning of the second attractor. This finding confirms that the results presented in the main text can be reproduced for a ground-truth system with completely different dynamics. Out-of-Domain Generalization in Dynamical Systems Reconstruction Figure A14. Statistical error distribution on basins B(A1) and B(A2) for 10 generalizing models (green) and 10 5 models retrained (purple) using only B(A1) data, similar to Fig. 5a in the main text, but based on a 6d Lorenz-96 system (Pelzer & Sterk, 2020) of the form xj = xj 1 (xj xj 1) xj + F, j = 1 . . . 6, where periodic boundary conditions are applied. Using F = 0.654502, the system has two coexisting chaotic attractors (see sect. 4.3 in (Pelzer & Sterk, 2020) for more details on the system). After retraining on data from B(A1), the models effectively unlearn the dynamics on B(A2). Figure A15. a) Similar results as in Fig. 6 for the Duffing system, but for a lower threshold pth = 1%, as described in Appx. D.7. b) Similar graphs as in Fig. 6 for the multistable Lorenz system (Eq. (30)) for pth = 1%. c) Same as b) but for pth = 5%. Out-of-Domain Generalization in Dynamical Systems Reconstruction Figure A16. a) Average radius as a function of the number of samples drawn from the loss landscape, indicating convergence to a constant radius at around 3000 samples. b) Loss as a function of the radius for a sh PLRNN (N = 2, M = 100) trained on the Duffing system (Eq. (29)) and 5000 sampled points. Curves are with kernel density smoothing. Out-of-Domain Generalization in Dynamical Systems Reconstruction E. Proofs E.1. Proof of Theorem 3.3 We first clarify the mathematical interpretation of not reconstructed . We consider two different scenarios, assuming the ground truth model Φ to be given (fixed): (i) All trajectories starting in B(Ak) converge to another reconstructed attractor Aj, j = k. This is often the case for trained models, as illustrated in Fig. 3 and Fig. A6, and corresponds to a missing basin. In this case EMtest gen (ΦR) vol(B(Ak)). (ii) There is an open set B(An+1) M corresponding to a new basin of a new attractor An+1 of ΦR, where we further assume Ai B(An+1) = , i n. This is the case when some additional dynamics is learned which is not present in the ground-truth system. In this case EMtest gen (ΦR) vol(B(An+1)). Proof of (i) We assume the decomposition as in Eq. (7) for the ground-truth system Φ: M = N e=1Be M such that µ M = 0 (51) We will first prove the theorem for the statistical error. First, we want to show that the error between the occupation measure of the ground-truth system and the reconstructed system on B(Ak) is non-zero: SW1(µΦ x,T , µΦR x,T ) = 0 x B(Ak). (52) To do so, we take some x B(Ak). By assumption j = k : ω(x, ΦR) Aj. We take some open subset U Aj of B(Aj), which has to exist as B(Aj) is open. But for Φ it still holds that ω(x, Φ) Ak. It follows that µΦR x,T (U) = µΦ x,T (U), (53) as, by definition of an attractor, there is some time T such that ΦR(T , x) enters U, while Φ(t, x) never does for any t. Also note that µΦR x,T (U) = 0 as ΦR enters U, making the occupation measure of U non-zero. Consequently, we found a Borel set on which the two occupation measures disagree, hence µΦR x,T = µΦ x,T . This construction can be repeated for any x B(Ak). Since SW1 is a metric on the space of measures (Kolouri et al., 2016), this implies x B(Ak), SW1(µΦ x,T , µΦR x,T ) = 0. (54) Since B(Ak) is connected, also B(Ak) is connected. As B(Ak) is closed and M compact, also B(Ak) is compact. By the mean-value theorem for integrals, there exists some x B(Ak) such that EB(Ak) stat ΦR = SW1(µΦ x ,T , µΦR x ,T ) vol(B(Ak)). (55) SW1(µΦ x ,T , µΦR x ,T ) is a constant, consequently EB(Ak) stat vol(B(Ak)) which proves the theorem. We go on to prove the statement for the topological error. Denote by Dn r (y) a unit ball of radius r in Rn centered on y. Again, we take some x B(Ak) and by assumption j = k : ω(x, ΦR) Aj, while for Φ it holds that ω(x, Φ) Ak. As M is equipped with the Euclidean distance, the Hausdorff distance between two sets measures the farthest possible Euclidean distance between a point in one set to the closest point in the other set. Denote by S the set of points in Aj being εd H-close to Ak, S = {y Aj|d H({y}, Ak) εd H} (56) For reasonable choices of our hyperparameter εd H this set will be empty. Only if both attractors Aj, Ak, lie very close to the boundary separating their basins of attraction, it could be non-zero. In this case we have d H(ω(x, ΦR), ω(x, Φ)) εd H, x y SDn εd H (y) (57) d H(ω(x, ΦR), ω(x, Φ)) > εd H, x B(Ak)\ y S Dn εd H (y). (58) Out-of-Domain Generalization in Dynamical Systems Reconstruction Hence, the indicator function in Eq. (10) 1ΦR(x) is 0 on the set B(Ak)\ y S Dn εd H (y), as the condition for the closeness of limit sets in the Hausdorff metric is violated. Consequently, EM top(ΦR) = 1 1 vol(M) M 1ΦR(x) dx = vol(B(Ak)) vol( y SDn εd H (y)) vol(M) vol(B(Ak)) vol(M) (59) This result proves the theorem when choosing Mtest = M, given a choice of εd H that ensures that vol(B(Ak)) vol( y SDn εd H (y)). This should be generally fulfilled by a reasonably small choice of εd H. Proof of (ii) The proof is analogous to the one of (i) if we replace Ak with An+1, but the main steps will be stated again for completeness. For the statistical error we want to prove a result similar to Eq. (52): SW1(µΦ x,T , µΦR x,T ) = 0 x B(An+1). (60) Assume some x B(An+1). We know that j n : ω(x, ΦR) Aj, since by assumption Ai B(An+1) = i n. That is, while ω(x, ΦR) An+1, for Φ we have j n : ω(x, Φ) Aj. In accordance with the proof of (i), we can construct a set U An+1 and conclude with a similar line of arguments as above that SW1(µΦ x,T , µΦR x,T ) = 0 x B(An+1). (61) Using the mean-value theorem for integrals we can conclude EB(An+1) stat (ΦR) vol(B(An+1)) (62) proving the statement if we set Mtest = B(An+1). For the topological error we take an x B(An+1). We know that ω(x, ΦR) An+1 and j n : ω(x, Φ) Aj. In the construction of the set S, we need to be cautious as B(An+1) could have many neighbouring basins. Accordingly, we have to construct a seperate set Si for each attractor. We denote by Si the set of points in Ai, εd H-close to An+1, Si = {y Ai|d H({y}, An+1) εd H}, i n (63) Again, each Si will be empty for reasonable choices of εd H and only non-empty for systems where Ai lies very close to An+1. Now, by an argument analogous to (i), we can conclude that EM top(ΦR) = vol(B(An+1)) vol( n i=1 y Si Dn εd H (y)) vol(M) vol(B(An+1)) vol(M) (64) proving the result when choosing Mtest = M, given a reasonable choice of εd H (as discussed in (i)). E.2. Proof of Theorem 4.1 For the proof of Theorem 4.1, we need the following definitions: Definition E.1. Given a trajectory Γx0, we define the graph of this trajectory by Ωx0 = If x0 Γx0. (65) Definition E.2. We define the following map, mapping the hypothesis class H, a set of initial conditions and a set of times to the graph of the solution: σ : H Rn R R M, (f, x0, If x0) 7 Ωx0 (66) The proof of theorem 4.1 is based on the following lemmata: (i) Using Lemma E.3, we will first show that the set of solutions to the parameter estimation in SINDy (or any library-based algorithm) which is based on a minimization problem, can be rewritten as the pre-image π1(σ 1(Ωx0)). π1 denotes the projection on the first argument of the pre-image. Out-of-Domain Generalization in Dynamical Systems Reconstruction (ii) Lemma E.4 then establishes how the pre-image and the associated minimization problem can be rewritten as a linear (matrix) equation. (iii) Lemma E.5 shows under which conditions this equation will have a unique solution. (iv) Finally, Lemma E.6 is used to characterize the set H0 for H = BL, which will be used to prove the theorem. Lemma E.3. Let Γx0 = {x(t)|t If x0} be the solution (with graph Ωx0) of some ground-truth VF f GT BL, with parameters θGT , f GT = f(x; θGT ). For linearly parameterized function spaces (see Eq. (14)) like BL we can write the (first projection of the) pre-image of a single trajectory as π1(σ 1(Ωx0)) = f BL| j = Z t1 xj(t) xj(0) Z t 0 fj(x(s); θ)ds 2 dt = 0, j {1, ..., n} = min θ {f(x; θ)| Z t1 t0 x(t) f(x; θ) 2 L2dt} (68) Note that the second equation exactly corresponds to the minimization problem stated by SINDy (without regularization). Proof. By the definition of Eq. (66), it holds that π1(σ 1(Ωx0)) = f(x; θ) BL | x(t) = f(x(t); θ), t If x0 & x(0) = x0 . (69) The following equivalence holds for all j {1, ..., n}: xj(t) = fj(x(t); θ), xj(0) = x0,j (70) xj(t) xj(0) = Z t 0 fj(x(s); θ)ds (71) xj(t) xj(0) Z t 0 fj(x(s); θ)ds 2 dt = 0 (72) where we used in the last step that for a continuous function f with f(x) 0, x I R we have R f(x)dx = 0 f(x) = 0. This establishes Eq. (67). For Eq. (68), let us assume f π1(σ 1(Ωx0)). By the definition of σ, we note that f satisfies for any j n xj(t) = fj(x(t); θ), xj(0) = x0,j (73) xj(t) fj(x(t); θ) = 0, xj(0) = x0,j. (74) By the same argument we used to get from Eq. (71) to Eq. (72), it holds that f fulfills Z t1 t0 x(t) f(x; θ) 2 L2dt = 0. (75) As the L2-norm (its square) 2 L2 is a positive function, the minimum value it can attain is 0. Consequently, f minθ{f(x; θ)| R t1 t0 x(t) f(x; θ) 2 L2dt}. For the other direction of the set-inclusion, assume f minθ{f(x; θ)| R t1 t0 x(t) f(x; θ) 2 L2dt}. In general, f could be different from f GT as there could exist many different f solving the minimization problem. However, for f GT we know that the following holds: Z t1 t0 x(t) f GT (x; θGT ) 2 L2dt = 0. (76) It follows that f minθ{f(x; θ)| R t1 t0 x(t) f(x; θ) 2 2dt} we must have Z t1 t0 x(t) f(x; θ) 2 L2dt = 0, (77) as the minimum value for R t1 t0 x(t) f(x; θ) 2 L2dt is zero when choosing θ = θGT . This establishes that any f minθ{f(x; θ)| R t1 t0 x(t) f(x; θ) 2 L2dt} has x(t) as a solution. By definition of σ, we conclude that f π1(σ 1(Ωx0)). Out-of-Domain Generalization in Dynamical Systems Reconstruction Based on Lemma E.3, studying the solution of the (unregularized) SINDy minimization problem, Eq. (68), comes down to studying the pre-image π1(σ 1(Ωx0)). Further, note that there is a bijection between parameters θ Rm n and functions f(x; θ) for a linearly parameterized function space such as BL. In the following, we will identify θ with f(x; θ), allowing us to write proofs for f in terms of proofs for θ. In Kunze (1999), the solutions to Eq. (68) were studied for a one-dimensional setting. For higher dimensional settings, this takes the following form: Lemma E.4. f(x; θ) is an element of π1(σ 1(Ωx0)) if and only if θ solves the following matrix equation Ψ0 Ψ0 Ψ0 Ψ1 Ψ0 ΨN Ψ1 Ψ0 Ψ1 Ψ1 Ψ1 ΨN ... ... ... ... ΨN Ψ0 ΨN Ψ1 ΨN ΨN θ1,j θ2,j ... θN,j xjΨ0 xjΨ1 ... xjΨN j = 1, . . . , n (78) with xj = xj(t) x(0) and 0 ψk(x(s))ds (79) Ψk Ψi = Z t1 t0 Ψk(t) Ψi(t)dt = Z t1 0 ψk(x(s))ds Z t 0 ψi(x(s))dsdt. (80) Proof. Consider the stationarity conditions j (x0)j = 0 & j θk,j = 0 k = 1, . . . , m j = 1, ..., n. The second condition leads to j θk,j = Z t1 2 xj(t) xj(0) Z t 0 fj(x(s); θ)ds ( θk,j 0 fj(x(s); θ)ds) dt (81) (xj(t) xj(0)) Z t 0 ψk(x(s))ds 0 ψi(x(s))ds 0 ψk(x(s))ds dt = 0. (82) Using the linearity of the integral this can be rewritten as 0 ψi(x(s))ds Z t 0 ψk(x(s))ds dt = Z t1 (xj(t) xj(0)) Z t 0 ψk(x(s))ds dt (83) leading to Eq. (78). Lemma E.5. Let Ωx0 be the graph of a trajectory from a VF f BL with x(t) = f(x(t)). It holds π1(σ 1(Ωx0)) is unique θ Rm : Λ(x(t)) = i=1 θiψi(x(t)) = 0 t If x0. (84) That is, the given trajectory does not solve an algebraic equation in the basis functions. By unique we mean that π1(σ 1(Ωx0)) contains only one element. Proof. We directly prove the equivalence stated in Eq. (84). Assuming π1(σ 1(Ωx0)) is not unique, by lemma E.4 this is equivalent to matrix Ψ being singular. This implies ker( Ψ) = {0}, which in turn is equivalent to θ Rm\{0} : Ψ θ = 0 Out-of-Domain Generalization in Dynamical Systems Reconstruction Hence, each row of the vector Ψ θ needs to be zero, implying that 6 0 ψk(x(s))ds Z t i=1 θiψi(x(s))ds dt = 0 (85) We need to show that the integral in Eq. (85) can only become zero iff Pm i=1 θiψi(x(t)) = 0 t [t0, t1]. By linearity of the integral over t, we can choose an arbitrary vector η Rm and sum its components such that Eq. (85) is equal to the following: 0 ψk(x(s))ds Z t i=1 θiψi(x(s))ds dt = 0 (86) We are free to choose η = θ, leading to i=1 θiψi(x(s))ds dt = 0 (87) By the same argument as above, as the integral of a continuous, non-negative function can only be zero if the function itself is zero, we have Z t i=1 θiψi(x(s))ds = 0 s [t0, t1]. (88) Since this has to be true t [t0, t1], we finally conclude i=1 θiψi(x(t)) = 0 t [t0, t1] (89) Lemma E.6. Consider two trajectories, Γy0 = {y(t)|t T, y(0) = y0} and Γx0 = {x(t)|t T, x(0) = x0}, with initial conditions x0 and y0 such that x0 = y0. We further assume these trajectories arise from two evolution operators associated to the VFs f, g, Φf, Φg : R M M with x(t) = Φf(t, x0) and y(t) = Φg(t, y0). If, for all times T R and Borel sets B M, the occupation measures associated with these trajectories satisfy µx0,T (B) = µy0,T (B) B, T R, (90) then the underlying trajectories are identical: Γx0 = Γy0. (91) Proof. The proof of this Lemma E.6 was adapted from Dawkins (2024). As M is compact and Hausdorff and the Lebesgue measure is regular, Eq. (90) is equal to a different condition. Let φ be a bounded continuous function φ : M R, where we denote the set of all bounded continuous functions by C0(M). Then, µx0,T = µy0,T implies Z T 0 φ(x(s)) ds = Z T 0 φ(y(s)) ds, T 0, φ C0(M) (92) where x(t) and y(t) are the trajectories corresponding to Φf and Φg. Differentiating with respect to T yields: φ(x(T)) = φ(y(T)), T 0. (93) Bounded continuous functions separate points, which means that x1 = x2 M and φ C0(M) : φ(x1) = φ(x2). It follows that x(T) = y(T), T 0. (94) 6Note that in this proof we deviate from our standard notation in that θ does not refer to the full matrix θ Rm n as in Eq. (14), but just to a vector in Rm. Out-of-Domain Generalization in Dynamical Systems Reconstruction Proof of Theorem 4.1 Based on these four lemmas, we can now prove Theorem 4.1. Proof. For the sake of simplicity, we focus on a bistable system where Mtrain = B(A1) and Mtest = B(A2), but the generalization to n basins immediately follows. Let us denote the ground truth VF by f GT BL and write f GT (x; θGT ) to make the parameter dependence explicit. Now, consider an arbitrary element g in BL,0 = {g BL|EMtrain gen (Φg) = 0}, where Φg denotes the evolution operator associated with g. g BL,0 implies EB(A1) stat (Φg) = 0. It holds that EB(A1) stat (Φg) = 0 µΦf x0,T = µΦg x0,T , (95) based on SW1 being a metric on the space of measures (Kolouri et al., 2016). This equality holds for all T If x0 and for all x0 B(A1). According to Lemma E.6, this implies that all trajectories in B(A1) of f GT and g coincide. Further, applying Lemma E.5 and considering the assumption that at least one trajectory does not satisfy an algebraic equation, which we will denote by Γ x 0 (with graph Ω x 0), we can uniquely determine the parameters θ of g(x; θ) BL via π1(σ 1(Ω x 0)). Since g and f GT share the same non-algebraic trajectory, f and g must be identical as π1(σ 1(Ωx0)) only contains one element (Lemma E.5). Consequently, BL,0 = {f GT }. (96) Thus, BL,0 solely consists of f GT . Since f GT has zero generalization error on M, it follows that any model in BL,0 exhibits zero generalization error on M. Therefore, (BL, D) is strictly learnable. Corollary E.7. When the observed trajectory Γx0 D solves an algebraic equation, we can nevertheless restrict BL to B L BL in a way that (B L, D) is strictly learnable. Proof. If Γx0 solves an algebraic equation, this means that ker( Ψ) = {0} (see Lemma E.5). Hence, Eq. (78) Ψ θj = Xj (97) does not have a unique solution for θj = 0. However, for linear function spaces, we can use the fundamental theorem of homomorphisms to make Ψ\ ker( Ψ) im Ψ an isomorphism. Then, the system Ψ θ j = X j (98) has a unique solution, where Ψ corresponds to a matrix where some basis functions ψi,j are removed. In the corresponding coefficient vector θ j Rm the coefficients for the basis ψi,j are removed as well, hence m < m. By this process of removing basis functions ψi,j from Ψ, thus restricting BL to B L, we can make the solution unique again. E.3. Proof of Theorem 4.2 Proof. Without loss of generality, we assume we have a bistable system, with f the ground truth VF, Mtrain = B(A1) and Mtest = B(A2). We aim to construct an infinite family of functions G = {gα|α I} with zero reconstruction error on B(A1) but non-zero error on B(A2). We denote by Φgα the evolution operator associated with gα. Let V = B(A2)\A2. Due to the assumption that f is not topologically transitive on B(A2), we can choose some non-empty, open subset V V . As M is locally Hausdorff, we can define a compact subset K V such that K itself contains an open set. As V is open, we define a bump function Λ C (M) (Lee, 2012) that is zero outside of V and equals 1 on K. The i-th component of a VF in G is then defined as gα,i(x) = fi(x) + Λ(x) ( fi(x) + sα,i(x)) , i = 1, ..., n, (99) where xα K. As K contains an open set, there are infinitely many different xα. By construction, EB(A1) gen (Φgα) = 0, since Λ is zero outside of V , thus zero on B(A1). On B(A1), gα has the form gα,i(x) = fi(x) (f has zero generalization error as it is the ground-truth VF). In contrast, on K, the VFs in G reduce to gα(x) = sα(x), (100) where we assume sα to be a differentiable VF with an attracting equilibrium point at xα. This construction is feasible on any compact set K in any dimension. Given that Λ is smooth and sα is differentiable, gα remains a member of X 1. Out-of-Domain Generalization in Dynamical Systems Reconstruction Thus, G forms an infinite family of VFs, each possessing an attracting equilibrium point at distinct positions in K. We will denote by B(Axα) V the basin of attraction of the equilibrium point xα. All functions gα have an attractor on B(A2) different from the one of the ground truth VF f. Additionally, B(Axα) A2 = by construction of V for any α. Thus, we constructed the same setting encountered in the proof of Theorem 3.3, case (ii). Using the proof and setting B(An+1) (in the notation of Theorem 3.3, case (ii)) to B(Axα), we can conclude that EB(Axα) gen (Φgα) ε (101) with ε > 0. Since V B(A2), it follows that EB(A2) gen (Φgα) = 0 for all α I. Consequently, (X 1, D) is not strictly learnable, since we can construct infinitely many functions in X 1 with non-zero generalization error 7. 7For the proof that (X 1, D) is not strictly learnable, a single VF as constructed above would have sufficed. Out-of-Domain Generalization in Dynamical Systems Reconstruction The equivalent object to the test distribution is the flow on a subset of the state space which was not seen in training . If is multistable, harbors basins different from . Hence, the data (trajectories) in can follow a different dynamical law than on . data generating process training data test distribution distribution shift Φ : T M æ M {(xi, yi)}N i=1 ptrain(x, y) The training data takes the form of trajectories where the initial conditions come from the training domain. fix0œX0Φ ([0, T], X0) , X0 µ Mtrain ptest = ptrain Φ : T Mtest æ Mtest In OODG, the test distributions are different from the training distribution. ptest(y|x) = ptrain(y|x), ptest(x) = ptrain(x) Can take the form of: ptest(x|y) = ptrain(x|y), ptest(y) = ptrain(y) covariate shift label shift There are also other forms like concept drift or even more general versions of a distribution shift. |Rtest(f) Rtrain(f)| < d(ptest, ptrain) bounds on the generalization gap Taking as a trained neural network, one generally tries to bound the test vs. the training error by the distance between the test and training distributions given some formal distance measure . PJIn8tx49ba9lvfrw+o16p5NMg Fv/w3Z68Edf p(Ágen|D) = 1 vol(Θ0) Θ0 I[EMtest gen (Φ ) = Ágen]d In DSR, the role of the generalization gap is taken on by the learnability distribution. In Sect. 3.3 we discuss why such a reformulation is necessary. Essentially, bounds in terms of the distance of two distributions are meaningless as we do not (necessarily) have any noise in our data. Instead, these bounds are based on topological and ergodic properties of the flows. DSR standard ML is the true distribution on an input and output space . =X Y is the flow on some manifold. Note that no stochasticity is assumed as the OODG problem in DSR persists without noise in the data. /KO/g Fe7usΦ p(x, y) data points are drawn from a training distribution. ur U=N A distribution shift occurs when the dynamics of on differ from the dynamics on . This is particularly the case if is multistable. Then, and contain different basins. We thoroughly discuss this in Sect. 3.1 of the paper. atexit>Mtest Mtrain atexit>Mtest Φ Mtest µ M Mtrain Mtest Mtest Figure A17. Comparison of OODG in DSR and standard ML. Out-of-Domain Generalization in Dynamical Systems Reconstruction Benchmark Survey Ayed, I., de B ezenac, E., Pajot, A., Brajard, J., and Gallinari, P. Learning Dynamical Systems from Partial Observations. ar Xiv:1902.11136 [physics], February 2019. URL http://arxiv.org/abs/1902.11136. ar Xiv: 1902.11136. Azencot, O., Erichson, N. B., Lin, V., and Mahoney, M. W. Forecasting Sequential Data using Consistent Koopman Autoencoders. ar Xiv:2003.02236 [physics], June 2020. URL http://arxiv.org/abs/2003.02236. ar Xiv: 2003.02236. Brenner, M., Hess, F., Mikhaeil, J. M., Bereska, L. F., Monfared, Z., Kuo, P.-C., and Durstewitz, D. Tractable Dendritic RNNs for Reconstructing Nonlinear Dynamical Systems. In Proceedings of the 39th International Conference on Machine Learning, pp. 2292 2320. PMLR, June 2022. URL https://proceedings.mlr.press/v162/brenner22a. html. ISSN: 2640-3498. Brunton, S. L., Proctor, J. L., and Kutz, J. N. Discovering governing equations from data: Sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932 3937, April 2016. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1517384113. URL http://arxiv.org/abs/1509.03580. ar Xiv: 1509.03580. Champion, K., Lusch, B., Kutz, J. N., and Brunton, S. L. Data-driven discovery of coordinates and governing equations. Proceedings of the National Academy of Sciences, 116(45):22445 22451, November 2019. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1906995116. URL http://arxiv.org/abs/1904.02107. ar Xiv:1904.02107 [stat]. Chen, Y., Qian, Y., and Cui, X. Time series reconstructing using calibrated reservoir computing. Scientific Reports, 12(1): 16318, September 2022. ISSN 2045-2322. doi: 10.1038/s41598-022-20331-3. URL https://www.nature.com/ articles/s41598-022-20331-3. Number: 1 Publisher: Nature Publishing Group. Duncker, L., Bohner, G., Boussard, J., and Sahani, M. Learning interpretable continuous-time models of latent stochastic dynamical systems. ar Xiv:1902.04420 [cs, math, stat], February 2019. URL http://arxiv.org/abs/1902. 04420. ar Xiv: 1902.04420. Farmer, J. D. and Sidorowich, J. J. Predicting chaotic time series. Physical Review Letters, 59(8):845 848, August 1987. doi: 10.1103/Phys Rev Lett.59.845. URL https://link.aps.org/doi/10.1103/Phys Rev Lett.59.845. Publisher: American Physical Society. Fu, Y., Saab Jr, S., Ray, A., and Hauser, M. A Dynamically Controlled Recurrent Neural Network for Modeling Dynamical Systems. ar Xiv:1911.00089 [cs, stat], October 2019. URL http://arxiv.org/abs/1911.00089. ar Xiv: 1911.00089. Gauthier, D. J., Bollt, E., Griffith, A., and Barbosa, W. A. S. Next generation reservoir computing. Nature Communications, 12(1):5564, September 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-25801-2. URL https://www.nature. com/articles/s41467-021-25801-2. Number: 1 Publisher: Nature Publishing Group. Geneva, N. and Zabaras, N. Transformers for modeling physical systems. Neural Networks, 146:272 289, February 2022. ISSN 0893-6080. doi: 10.1016/j.neunet.2021.11.022. URL https://www.sciencedirect.com/science/ article/pii/S0893608021004500. Gilpin, W. Deep reconstruction of strange attractors from time series. ar Xiv:2002.05909 [nlin, physics:physics, q-bio, stat], October 2020. URL http://arxiv.org/abs/2002.05909. ar Xiv: 2002.05909. Gilpin, W. Chaos as an interpretable benchmark for forecasting and data-driven modelling. In 36th Conference on Neural Information Processing Systems (Neur IPS 2022), January 2022. URL https://openreview.net/forum?id= en Yjtbj YJrf. Goyal, P. and Benner, P. Learning Dynamics from Noisy Measurements using Deep Learning with a Runge-Kutta Constraint, September 2021. URL http://arxiv.org/abs/2109.11446. ar Xiv:2109.11446 [cs, math]. Hernandez, D., Moretti, A. K., Saxena, S., Wei, Z., Cunningham, J., and Paninski, L. Nonlinear Evolution via Spatially Dependent Linear Dynamics for Electrophysiology and Calcium Data. Neurons, Behavior, Data analysis, and Theory, 3 (3), June 2020. Publisher: The neurons, behavior, data analysis and theory collective. Out-of-Domain Generalization in Dynamical Systems Reconstruction Hess, F., Monfared, Z., Brenner, M., and Durstewitz, D. Generalized Teacher Forcing for Learning Chaotic Dynamics. In Proceedings of the 40th International Conference on Machine Learning, pp. 13017 13049. PMLR, July 2023. URL https://proceedings.mlr.press/v202/hess23a.html. ISSN: 2640-3498. Jordana, A., Carpentier, J., and Righetti, L. Learning Dynamical Systems from Noisy Sensor Measurements using Multiple Shooting. ar Xiv:2106.11712 [cs, stat], June 2021. URL http://arxiv.org/abs/2106.11712. ar Xiv: 2106.11712. Karlsson, D. and Svanstr om, O. Modelling Dynamical Systems Using Neural Ordinary Differential Equations, 2019. Kim, T. D., Luo, T. Z., Pillow, J. W., and Brody, C. Inferring Latent Dynamics Underlying Neural Population Activity via Neural Differential Equations. In International Conference on Machine Learning, pp. 5551 5561. PMLR, July 2021. URL http://proceedings.mlr.press/v139/kim21h.html. ISSN: 2640-3498. Kraemer, K. H., Datseris, G., Kurths, J., Kiss, I. Z., Ocampo-Espindola, J. L., and Marwan, N. A unified and automated approach to attractor reconstruction. New Journal of Physics, 23(3):033017, March 2021. ISSN 1367-2630. doi: 10. 1088/1367-2630/abe336. URL https://doi.org/10.1088/1367-2630/abe336. Publisher: IOP Publishing. Lai, Z., Mylonas, C., Nagarajaiah, S., and Chatzi, E. Structural identification with physics-informed neural ordinary differential equations. Journal of Sound and Vibration, 508:116196, September 2021. ISSN 0022-460X. doi: 10.1016/j.jsv.2021. 116196. URL https://www.sciencedirect.com/science/article/pii/S0022460X21002686. Lee, K. and Carlberg, K. T. Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders. Journal of Computational Physics, 404:108973, 2020. ISSN 0021-9991. doi: 10.1016/j.jcp.2019.108973. URL https://www.sciencedirect.com/science/article/pii/S0021999119306783. Lejarza, F. and Baldea, M. Data-driven discovery of the governing equations of dynamical systems via moving horizon optimization. Scientific Reports, 12(1):11836, July 2022. ISSN 2045-2322. doi: 10.1038/s41598-022-13644-w. URL https://www.nature.com/articles/s41598-022-13644-w. Number: 1 Publisher: Nature Publishing Group. Li, Y. and Duan, J. A data-driven approach for discovering stochastic dynamical systems with non-Gaussian L evy noise. Physica D: Nonlinear Phenomena, 417:132830, March 2021. ISSN 01672789. doi: 10.1016/j.physd.2020.132830. URL https://linkinghub.elsevier.com/retrieve/pii/S0167278920308319. Linderman, S. W., Miller, A. C., Adams, R. P., Blei, D. M., Paninski, L., and Johnson, M. J. Recurrent switching linear dynamical systems, October 2016. URL http://arxiv.org/abs/1610.08466. ar Xiv:1610.08466 [stat]. Linot, A. J., Burby, J. W., Tang, Q., Balaprakash, P., Graham, M. D., and Maulik, R. Stabilized neural ordinary differential equations for long-time forecasting of dynamical systems. Journal of Computational Physics, 474:111838, February 2023. ISSN 0021-9991. doi: 10.1016/j.jcp.2022.111838. URL https://www.sciencedirect.com/science/ article/pii/S0021999122009019. Liu, Z. and Jin, L. Model-Free Prediction of Chaotic Systems Using High Efficient Next-generation Reservoir Computing, October 2021. URL http://arxiv.org/abs/2110.13614. ar Xiv:2110.13614 [nlin]. Lu, L., Jin, P., Pang, G., Zhang, Z., and Karniadakis, G. E. Learning nonlinear operators via Deep ONet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218 229, March 2021. ISSN 2522-5839. doi: 10.1038/s42256-021-00302-5. URL https://www.nature.com/articles/s42256-021-00302-5. Number: 3 Publisher: Nature Publishing Group. Lu, Z., Hunt, B. R., and Ott, E. Attractor Reconstruction by Machine Learning. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(6):061104, June 2018. ISSN 1054-1500, 1089-7682. doi: 10.1063/1.5039508. URL http: //arxiv.org/abs/1805.03362. ar Xiv:1805.03362 [nlin]. Lusch, B., Kutz, J. N., and Brunton, S. L. Deep learning for universal linear embeddings of nonlinear dynamics. Nature Communications, 9(1):4950, November 2018. ISSN 2041-1723. doi: 10.1038/s41467-018-07210-0. URL https: //www.nature.com/articles/s41467-018-07210-0. Number: 1 Publisher: Nature Publishing Group. Out-of-Domain Generalization in Dynamical Systems Reconstruction Mehta, V., Char, I., Neiswanger, W., Chung, Y., Nelson, A., Boyer, M., Kolemen, E., and Schneider, J. Neural Dynamical Systems: Balancing Structure and Flexibility in Physical Prediction. In 2021 60th IEEE Conference on Decision and Control (CDC), pp. 3735 3742, December 2021. doi: 10.1109/CDC45484.2021.9682807. ISSN: 2576-2370. Mikhaeil, J. M., Monfared, Z., and Durstewitz, D. On the difficulty of learning chaotic dynamics with RNNs. In 36th Conference on Neural Information Processing Systems (Neur IPS 2022), October 2022. URL https://openreview. net/forum?id=-_AMpmy V0Ll. Mohajerin, N. and Waslander, S. L. Multi-Step Prediction of Dynamic Systems with Recurrent Neural Networks. ar Xiv:1806.00526 [cs], May 2018. URL http://arxiv.org/abs/1806.00526. ar Xiv: 1806.00526. Nguyen, D., Ouala, S., Drumetz, L., and Fablet, R. EM-like Learning Chaotic Dynamics from Noisy and Partial Observations. ar Xiv:1903.10335 [cs, stat], March 2019. URL http://arxiv.org/abs/1903.10335. ar Xiv: 1903.10335. Nguyen, D., Ouala, S., Drumetz, L., and Fablet, R. Variational Deep Learning for the Identification and Reconstruction of Chaotic and Stochastic Dynamical Systems from Noisy and Partial Observations. ar Xiv:2009.02296 [cs, stat], February 2021. URL http://arxiv.org/abs/2009.02296. ar Xiv: 2009.02296. Otto, S. E. and Rowley, C. W. Linearly-Recurrent Autoencoder Networks for Learning Dynamics, January 2019. URL http://arxiv.org/abs/1712.01378. ar Xiv:1712.01378 [cs, math, stat]. Pathak, J., Lu, Z., Hunt, B. R., Girvan, M., and Ott, E. Using Machine Learning to Replicate Chaotic Attractors and Calculate Lyapunov Exponents from Data. Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(12):121102, December 2017. ISSN 1054-1500, 1089-7682. doi: 10.1063/1.5010300. URL http://arxiv.org/abs/1710.07313. ar Xiv: 1710.07313. Qin, T., Wu, K., and Xiu, D. Data driven governing equations approximation using deep neural networks. Journal of Computational Physics, 395:620 635, October 2019. ISSN 0021-9991. doi: 10.1016/j.jcp.2019.06.042. URL https://www.sciencedirect.com/science/article/pii/S0021999119304504. Raissi, M., Perdikaris, P., and Karniadakis, G. E. Multistep Neural Networks for Data-driven Discovery of Nonlinear Dynamical Systems. ar Xiv:1801.01236 [nlin, physics:physics, stat], January 2018. URL http://arxiv.org/abs/ 1801.01236. ar Xiv: 1801.01236. Raissi, M., Perdikaris, P., and Karniadakis, G. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686 707, February 2019. ISSN 00219991. doi: 10.1016/j.jcp.2018.10.045. URL https://linkinghub. elsevier.com/retrieve/pii/S0021999118307125. Rusch, T. K., Mishra, S., Erichson, N. B., and Mahoney, M. W. Long Expressive Memory for Sequence Modeling. ar Xiv:2110.04744 [cs, math, stat], February 2022. URL http://arxiv.org/abs/2110.04744. ar Xiv: 2110.04744. Schlaginhaufen, A., Wenk, P., Krause, A., and D orfler, F. Learning Stable Deep Dynamics Models for Partially Observed or Delayed Dynamical Systems. 35th Conference on Neural Information Processing Systems (Neur IPS 2021), October 2021. URL https://arxiv.org/abs/2110.14296v2. Schmidt, D., Koppe, G., Monfared, Z., Beutelspacher, M., and Durstewitz, D. Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies. ar Xiv:1910.03471 [cs, q-bio, stat], March 2021. URL http: //arxiv.org/abs/1910.03471. ar Xiv: 1910.03471. Shalova, A. and Oseledets, I. Deep Representation Learning for Dynamical Systems Modeling. ar Xiv:2002.05111 [nlin, physics:physics], February 2020. URL http://arxiv.org/abs/2002.05111. ar Xiv: 2002.05111. Singh, S. K., Yang, R., Behjat, A., Rai, R., Chowdhury, S., and Matei, I. PI-LSTM: Physics-Infused Long Short-Term Memory Network. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 34 41, December 2019. doi: 10.1109/ICMLA.2019.00015. Out-of-Domain Generalization in Dynamical Systems Reconstruction Strauss, R. Augmenting Neural Differential Equations to Model Unknown Dynamical Systems with Incomplete State Information. ar Xiv:2008.08226 [physics, q-bio], August 2020. URL http://arxiv.org/abs/2008.08226. ar Xiv: 2008.08226. Sussillo, D., Jozefowicz, R., Abbott, L. F., and Pandarinath, C. LFADS - Latent Factor Analysis via Dynamical Systems. ar Xiv:1608.06315 [cs, q-bio, stat], August 2016. URL http://arxiv.org/abs/1608.06315. ar Xiv: 1608.06315. Tran, G. and Ward, R. Exact Recovery of Chaotic Systems from Highly Corrupted Data. Multiscale Modeling & Simulation, 15(3):1108 1129, January 2017. ISSN 1540-3459. doi: 10.1137/16M1086637. URL https://epubs.siam.org/ doi/10.1137/16M1086637. Publisher: Society for Industrial and Applied Mathematics. Tripura, T. and Chakraborty, S. A sparse Bayesian framework for discovering interpretable nonlinear stochastic dynamical systems with Gaussian white noise. Mechanical Systems and Signal Processing, 187:109939, March 2023. ISSN 08883270. doi: 10.1016/j.ymssp.2022.109939. URL https://www.sciencedirect.com/science/article/ pii/S088832702201007X. Trischler, A. and D Eleuterio, G. M. Synthesis of recurrent neural networks for dynamical system simulation. ar Xiv:1512.05702 [cs], June 2016. URL http://arxiv.org/abs/1512.05702. ar Xiv: 1512.05702. Uribarri, G. and Mindlin, G. B. Dynamical time series embeddings in recurrent neural networks. Chaos, Solitons & Fractals, 154:111612, January 2022. ISSN 0960-0779. doi: 10.1016/j.chaos.2021.111612. URL https://www. sciencedirect.com/science/article/pii/S0960077921009668. Vlachas, P. R., Byeon, W., Wan, Z. Y., Sapsis, T. P., and Koumoutsakos, P. Data-driven forecasting of highdimensional chaotic systems with long short-term memory networks. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2213):20170844, May 2018. doi: 10.1098/rspa.2017.0844. URL https://royalsocietypublishing.org/doi/10.1098/rspa.2017.0844. Publisher: Royal Society. Vlachas, P. R., Pathak, J., Hunt, B. R., Sapsis, T. P., Girvan, M., Ott, E., and Koumoutsakos, P. Backpropagation algorithms and Reservoir Computing in Recurrent Neural Networks for the forecasting of complex spatiotemporal dynamics. Neural Networks, 126:191 217, June 2020. ISSN 0893-6080. doi: 10.1016/j.neunet.2020.02.016. URL https://www.sciencedirect.com/science/article/pii/S0893608020300708. Voss, H. U., Timmer, J., and Kurths, J. Nonlinear dynamical system identification from uncertain and indirect measurements. International Journal of Bifurcation and Chaos, 14(06):1905 1933, June 2004. ISSN 02181274. doi: 10.1142/S0218127404010345. URL https://www.worldscientific.com/doi/abs/10.1142/ S0218127404010345. Publisher: World Scientific Publishing Co. Wang, Y.-J. and Lin, C.-T. Runge-Kutta neural network for identification of dynamical systems in high accuracy. IEEE Transactions on Neural Networks, 9(2):294 307, March 1998. ISSN 1941-0093. doi: 10.1109/72.661124. Conference Name: IEEE Transactions on Neural Networks. Yang, L., Sun, X., Hamzi, B., Owhadi, H., and Xie, N. Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part V: Sparse Kernel Flows for 132 Chaotic Dynamical Systems, January 2023. URL https://ui. adsabs.harvard.edu/abs/2023ar Xiv230110321Y. ar Xiv e-prints ADS Bibcode: 2023ar Xiv230110321Y Type: article. Yin, Y., Ayed, I., de B ezenac, E., Baskiotis, N., and Gallinari, P. LEADS: Learning Dynamical Systems that Generalize Across Environments, February 2022. URL http://arxiv.org/abs/2106.04546. ar Xiv:2106.04546 [cs, stat]. Zhang, L., Tang, S., and He, G. Learning chaotic systems from noisy data via multi-step optimization and adaptive training. Chaos: An Interdisciplinary Journal of Nonlinear Science, 32(12):123134, 2022. doi: 10.1063/5.0114542. URL https://doi.org/10.1063/5.0114542. eprint: https://doi.org/10.1063/5.0114542. Zhao, H., Rai, P., Du, L., Buntine, W., Phung, D., and Zhou, M. Variational Autoencoders for Sparse and Overdispersed Discrete Data. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pp. 1684 1694. PMLR, June 2020. URL https://proceedings.mlr.press/v108/zhao20c.html. ISSN: 2640-3498.