# neural_oscillators_are_universal__6bafcb51.pdf Neural Oscillators are Universal Samuel Lanthaler California Institute of Technology slanth@caltech.edu T. Konstantin Rusch ETH Zurich Siddhartha Mishra ETH Zurich Coupled oscillators are being increasingly used as the basis of machine learning (ML) architectures, for instance in sequence modeling, graph representation learning and in physical neural networks that are used in analog ML devices. We introduce an abstract class of neural oscillators that encompasses these architectures and prove that neural oscillators are universal, i.e, they can approximate any continuous and casual operator mapping between time-varying functions, to desired accuracy. This universality result provides theoretical justification for the use of oscillator based ML systems. The proof builds on a fundamental result of independent interest, which shows that a combination of forced harmonic oscillators with a nonlinear read-out suffices to approximate the underlying operators. 1 Introduction Oscillators are ubiquitous in the sciences and engineering [12, 30]. Prototypical examples include pendulums in mechanics, feedback and relaxation oscillators in electronics, business cycles in economics and heart beat and circadian rhythms in biology. Particularly relevant to our context is the fact that the neurons in our brain can be thought of as oscillators on account of the periodic spiking and firing of the action potential [29, 11]. Consequently, functional brain circuits such as cortical columns are being increasingly analyzed in terms of networks of coupled oscillators [29]. Given this wide prevalence of (networks of) oscillators in nature and man-made devices, it is not surprising that oscillators have inspired various machine learning architectures in recent years. Prominent examples include the Co RNN [27] and Un ICORNN [28] recurrent neural networks for sequence modeling. Co RNN is based on a network of coupled, forced and damped oscillators, whereas Un ICORNN is a multi-layer sequence model that stacks networks of independent undamped oscillators as hidden layers within an RNN. Both these architectures were rigorously shown to mitigate the exploding and vanishing gradient problem [20] that plagues RNNs. Hence, both Co RNN and Un ICORNN performed very well on sequence learning tasks with long-term dependencies. Another example of the use of oscillators in machine learning is provided by Graph CON [26], a framework for designing graph neural networks (GNNs) [3], that is based on coupled oscillators. Graph CON was also shown to ameliorate the oversmoothing problem [25] and allow for the deployment of multi-layer deep GNNs. Other examples include Second Order Neural ODEs (SONODEs) [19], which can be interpreted as oscillatory neural ODEs, locally coupled oscillatory recurrent networks (Loco RNN) [17], and Oscillatory Fourier Neural Network (O-FNN) [13]. Another avenue where ML models based on oscillators arise is that of physical neural networks (PNNs) [35] i.e., physical devices that perform machine learning on analog (beyond digital) systems. Such analog systems have been proposed as alternatives or accelerators to the current paradigm of machine learning on conventional electronics, allowing us to significantly reduce the prohibitive energy costs of training state-of-the-art ML models. In [35], the authors propose a variety of physical neural networks which include a mechanical network of multi-mode oscillations on a plate and electronic circuits of oscillators as well as a network of nonlinear oscillators. Coupled with a novel physics aware training (PAT) algorithm, the authors of [35] demonstrated that their 37th Conference on Neural Information Processing Systems (Neur IPS 2023). nonlinear oscillatory PNN achieved very good performance on challenging benchmarks such as Fashion-MNIST [36]. Moreover, other oscillatory systems such as coupled lasers and spintronic nano-oscillators have also been proposed as possible PNNs, see [33] as an example of the use of thermally coupled vanadium dioxide oscillators for image recognition and [24, 32] for the use of spin-torque nano-oscillators for speech recognition and for neuromorphic computing, respectively. What is the rationale behind the successful use of (networks of) oscillators in many different contexts in machine learning? The authors of [27] attribute it to the inherent stability of oscillatory dynamics, as the state (and its gradients) of an oscillatory system remain within reasonable bounds throughout the time-evolution of the system. However, this is at best a partial explanation, as it does not demonstrate why oscillatory dynamics can learn (approximate) mappings between inputs and outputs rather than bias the learned states towards oscillatory functions. As an example, consider the problem of classification of MNIST [18] (or Fashion-MNIST) images. It is completely unclear if the inputs (vectors of pixel values), outputs (class probabilities) and the underlying mapping possess any (periodic) oscillatory structure. Consequently, how can oscillatory RNNs (such an Co RNN and Un ICORNN) or a network of oscillatory PNNs learn the underlying mapping? Our main aim in this paper is to provide an answer to this very question on the ability of neural networks, based on oscillators, to express (to approximate) arbitrary mappings. To this end, We introduce an abstract framework of neural oscillators that encompasses both sequence models such as Co RNN and Un ICORNN, as well as variants of physical neural networks as the ones proposed in [35]. These neural oscillators are defined in terms of second-order versions of neural ODEs [4], and combine nonlinear dynamics with a linear read-out. We prove a Universality theorem for neural oscillators by showing that they can approximate, to any given tolerance, continuous operators between appropriate function spaces. Our proof of universality is based on a novel theoretical result of independent interest, termed the fundamental Lemma, which implies that a suitable combination of linear oscillator dynamics with nonlinear read-out suffices for universality. Such universality results, [1, 5, 14, 21] and references therein, have underpinned the widespread use of traditional neural networks (such as multi-layer perceptrons and convolutional neural networks). Hence, our universality result establishes a firm mathematical foundation for the deployment of neural networks, based on oscillators, in myriad applications. Moreover, our constructive proof provides insight into how networks of oscillators can approximate a large class of mappings. 2 Neural Oscillators General Form of Neural Oscillators. Given u : [0, T] Rp as an input signal, for any final time T R+, we consider the following system of neural ODEs for the evolution of dynamic hidden variables y Rm, coupled to a linear read-out to yield the output z Rq, y(t) = σ (Wy(t) + V u(t) + b) , y(0) = y(0) = 0, z(t) = Ay(t) + c. (2.1a) (2.1b) (2.1c) Equation (2.1) defines an input-/output-mapping u(t) 7 z(t), with time-dependent output z : [0, T] Rq. Specification of this system requires a choice of the hidden variable dimension m and the activation function σ. The resulting mapping u 7 z depends on tunable weight matrices W Rm m, V Rm p, A Rq m and bias vectors b Rm, c Rq. For simplicity of the exposition, we consider only activation functions σ C (R), with σ(0) = 0 and σ (0) = 1, such as tanh or sin, although more general activation functions can be readily considered. This general second-order neural ODE system (2.1) will be referred to as a neural oscillator. Multi-layer neural oscillators. As a special case of neural oscillators, we consider the following much sparser class of second-order neural ODEs, y0(t) := u(t), yℓ(t) = σ wℓ yℓ(t) + V ℓyℓ 1(t) + bℓ , (ℓ= 1, . . . , L), yℓ(0) = yℓ(0) = 0, z(t) = Ay L(t) + c. In contrast to the general neural oscillator (2.1), the above multi-layer neural oscillator (2.2) defines a hierarchical structure; The solution yℓ Rmℓat level ℓsolves a second-order ODE with driving force yℓ 1, and the lowest level, y0 = u, is the input signal. Here, the layer dimensions m1, . . . , m L can vary across layers, the weights wℓ Rmℓare given by vectors, with componentwise multiplication, V ℓ Rmℓ mℓ 1 is a weight matrix, and bℓ Rmℓthe bias. Given the result of the final layer, y L, the output signal is finally obtained by an affine output layer z(t) = Ay L(t) + c. In the multilayer neural oscillator, the matrices V ℓ, A and vectors wℓ, bℓand c represent the trainable hidden parameters. The system (2.2) is a special case of (2.1), since it can be written in the form (2.1), with y := [y L, y L 1, . . . , y1]T , b := [b L, . . . , b1]T , and a (upper-diagonal) block-matrix structure for W: diag(w L) V L 0 . . . 0 0 diag(w L 1) V L 1 ... ... ... ... ... 0 0 . . . 0 diag(w2) V 2 0 . . . 0 0 diag(w1) 0 ... ... 0 V 1 Given the block-diagonal structure of the underlying weight matrices, it is clear that the multi-layer neural oscillator (2.2) is a much sparser representation of the general neural operator (2.1). Moreover, one can observe from the structure of the neural ODE (2.2) that within each layer, the individual neurons are causally independent of each other. Assuming that wℓ i = 0, for all 1 i mℓand all 1 ℓ L, we further highlight that the multi-layer neural oscillator (2.2) is a Hamiltonian system, with the layer-wise time-dependent Hamiltonian, H(yℓ, yℓ, t) = 1 1 wℓ i bσ(wℓ iyℓ i + (V ℓyℓ 1)i + bℓ i), (2.5) with bσ being the antiderivative of σ, and x 2 = x, x denoting the Euclidean norm of the vector x Rm and , the corresponding inner product. Hence, any symplectic discretization of the multi-layer neural oscillator (2.2) will result in a fully reversible model, which can first be leveraged in the context of normalizing flows [23], and second leads to a memory-efficient training, as the intermediate states (i.e., yℓ(t0), yℓ(t0), yℓ(t1), yℓ(t1), . . . , yℓ(t N), yℓ(t N), for some time discretization t0, t1, . . . , t N of length N) do not need to be stored and can be reconstructed during the backward pass. This potentially leads to a drastic memory saving of O(N) during training. 2.1 Examples of Neural Oscillators (Forced) harmonic oscillator. Let p = m = q = 1 and we set W = ω2, for some ω R, V = 1, b = 0 and the activation function to be identity σ(x) = x. In this case, the neural ODE (2.1) reduces to the ODE modeling the dynamics of a forced simple harmonic oscillator [12] of the form, y = ω2y + u, y(0) = y(0) = 0. (2.6) Here, y is the displacement of the oscillator, ω the frequency of oscillation and u is a forcing term that forces the motion of the oscillator. Note that (2.6) is also a particular example of the multi-layer oscillator (2.2) with L = 1. This simple example provides justification for our terminology of neural oscillators, as in general, the hidden state y can be thought of as the vector of displacements of m-coupled oscillators, which are coupled together through the weight matrix W and are forced through a forcing term u, whose effect is modulated via V and a bias term b. The nonlinear activation function mediates possible nonlinear feedback to the system on account of large displacements. Co RNN. The Coupled oscillatory RNN (Co RNN) architecture [27] is given by the neural ODE: y = σ (Wy + W y + V u + b) γy ϵ y. We can recover the neural oscillator (2.1) as a special case of Co RNN by setting W = 0, γ = ϵ = 0; thus, a universality theorem for neural oscillators immediately implies a corresponding universality result for the Co RNN architecture. Un ICORNN. The Undamped Independent Controlled Oscillatory RNN (Un ICORNN) architecture of [28, eqn. 1] recovers the multi-layer neural oscillator (2.2) in the case where the fundamental frequencies of Un ICORNN are automatically determined inside the weight matrix W in (2.1). Nonlinear oscillatory PNN of [35]. In [35, SM, Sect. 4.A], the authors propose an analog machine learning device that simulates a network of nonlinear oscillators, for instance realized through coupled pendula. The resulting mathematical model is the so-called simplified Frenkel-Kontorova model [2] given by the ODE system, M θ = K sin(θ) C sin(θ) + f, where θ = (θ1, . . . , θN) is the vector of angles across all coupled pendula, M = diag(µ1, . . . , µN) is a diagonal mass matrix, f an external forcing, K = diag(k1, . . . , k N) the spring constant for pendula, given by ki = µig/ℓwith ℓthe pendulum length and g the gravitational acceleration, and where C = CT is a symmetric matrix, with ℓ =ℓ Cℓℓ , so that [C sin(θ)]ℓ= X ℓ =ℓ Cℓℓ (sin(θℓ ) sin(θℓ)), (2.7) which quantifies the coupling between different pendula. We note that this simplified Frenkel Kontorova system can also model other coupled nonlinear oscillators, such as coupled lasers or spintronic oscillators [35]. We can bring the above system into a more familiar form by introducing the variable y according to the relationship Py = θ for a matrix P. Substitution of this ansatz then yields MP y = (K + C) sin(Py) + f; choosing P = M 1(K + C), we find y = sin(M 1(K + C)y) + f, (2.8) which can be written in the form y = σ(Wy) + f for σ = sin( ) and W = M 1(K + C). We next make a particular choice of C, by choosing it in a block-matrix form γLI CL 0 . . . 0 CL,T γL 1I ... ... 0 ... ... 0 ... . . . C3,T γ2I C2 0 . . . 0 C2,T γ1I and with corresponding mass matrix M in block-matrix form M = diag(µLI, µL 1I, . . . , µ1I), then with ρℓ:= γℓ/µℓ, we have ρLI CL/µL 0 . . . 0 CL,T /µL 1 ρL 1I ... ... 0 ... ... 0 ... ... ρ2I C2/µ2 0 . . . 0 C2,T /µ1 ρ1I With the intent of ordering the masses of different layers, such that µℓ/µℓ 1 ϵ is small, we now introduce an ordering parameter ϵ > 0. And we consider the following ordering γℓ, Cℓ, µℓ ϵℓ. Assuming this ordering, it follows that ρℓ, Cℓ µℓ= O(1), and Cℓ µℓ 1 = O(ϵ). This ordering of the masses introduces an effective one-way coupling, making ρLI V L 0 . . . 0 0 ρL 1I V L 1 ... ... ... ... 0 0 . . . 0 ρ2I V 2 0 . . . 0 0 ρ1I upper triangular, up to small terms of order ϵ. We note that the diagonal entries ρℓin M 1C are determined by the off-diagonal terms through the identity (2.7). The additional degrees of freedom in the (diagonal) K-matrix in (2.8) can be used to tune the diagonal weights of the resulting weight matrix W = M 1(K + C). Thus, physical systems such as the Frankel-Kontorova system of nonlinear oscillators can be approximated (to leading order) by multi-layer systems of the form yℓ= σ wℓ yℓ+ V ℓyℓ 1 + f ℓ, (2.9) with f ℓan external forcing, representing a tunable linear transformation of the external input to the system. The only formal difference between (2.9) and (2.2) is (i) the absence of a bias term in (2.9) and (ii) the fact that the external forcing appears outside of the nonlinear activation function σ in (2.9). A bias term could readily be introduced by measuring the angles represented by yℓin a suitably shifted reference frame; physically, this corresponds to tuning the initial position yℓ(0) of the pendula, with yℓ(0) also serving as the reference value. Furthermore, in our proof of universality for (2.2), it makes very little difference whether the external forcing f is applied inside the activation function, as in (2.2b) resp. (2.1a), or outside as in (2.9); indeed, the first layer in our proof of universality will in fact approximate the linearized dynamics of (2.2b), i.e. a forced harmonic oscillator (2.6). Consequently, a universality result for the multi-layer neural oscillator (2.2) also implies universality of variants of nonlinear oscillator-based physical neural networks, such as those considered in [35]. 3 Universality of Neural Oscillators In this section, we state and sketch the proof for our main result regarding the universality of neural oscillators (2.1) or, more specifically, multi-layer oscillators (2.2). To this end, we start with some mathematical preliminaries to set the stage for the main theorem. 3.1 Setting Input signal. We want to approximate operators Φ : u 7 Φ(u), where u = u(t) is a timedependent input signal over a time-interval t [0, T], and Φ(u)(t) is a time-dependent output signal. We will assume that the input signal t 7 u(t) is continuous, and that u(0) = 0. To this end, we introduce the space C0([0, T]; Rp) := {u : [0, T] Rp | t 7 u(t) is continuous and u(0) = 0}. We will assume that the underlying operator defines a mapping Φ : C0([0, T]; Rp) C0([0, T]; Rq). The approximation we discuss in this work are based on oscillatory systems starting from rest. These oscillators are forced by the input signal u. For such systems the assumption that u(0) = 0 is necessary, because the oscillator starting from rest takes a (arbitrarily small) time-interval to synchronize with the input signal (to warm up ); If u(0) = 0, then the oscillator cannot accurately approximate the output during this warm-up phase. This intuitive fact is also implicit in our proofs. We will provide a further comment on this issue in Remark 3.2, below. Operators of interest. We consider the approximation of an operator Φ : C0([0, T]; Rp) C0([0, T]; Rq), mapping a continuous input signal u(t) to a continuous output signal Φ(u)(t). We will restrict attention to the uniform approximation of Φ over a compact set of input functions K C0([0, T]; Rp). We will assume that Φ satisfies the following properties: Φ is causal: For any t [0, T], if u, v C0([0, T]; Rp) are two input signals, such that u|[0,t] v|[0,t], then Φ(u)(t) = Φ(v)(t), i.e. the value of Φ(u)(t) at time t does not depend on future values {u(τ) | τ > t}. Φ is continuous as an operator Φ : (C0([0, T]; Rp), L ) (C0([0, T]; Rq), L ), with respect to the L -norm on the input-/output-signals. Note that the class of Continuous and Causal operators are very general and natural in the contexts of mapping between sequence spaces or time-varying function spaces, see [7, 6] and references therein. 3.2 Universal approximation Theorem The universality of neural oscillators is summarized in the following theorem: Theorem 3.1. [Universality of the multi-layer neural oscillator] Let Φ : C0([0, T]; Rp) C0([0, T]; Rq) be a causal and continuous operator. Let K C0([0, T]; Rp) be compact. Then for any ϵ > 0, there exist hyperparameters L, m1, . . . , m L, weights wℓ Rmℓ, V ℓ Rmℓ mℓ 1, A Rq m L and bias vectors bℓ Rmℓ, c Rq, for ℓ= 1, . . . , L, such that the output z : [0, T] Rq of the multi-layer neural oscillator (2.2) satisfies sup t [0,T ] |Φ(u)(t) z(t)| ϵ, u K. It is important to observe that the sparse, independent multi-layer neural oscillator (2.2) suffices for universality in the considered class. Thus, there is no need to consider the wider class of neural oscillators (2.1), at least in this respect. We remark in passing that Theorem 3.1 immediately implies another universality result for neural oscillators, showing that they can also be used to approximate arbitrary continuous functions F : Rp Rq. This extension is explained in detail in SM A. Remark 3.2. We note that the theorem can be readily extended to remove the requirement on u(0) = 0 and Φ(u)(0) = 0. To this end, let Φ : C([0, T]; Rp) C([0, T]; Rq) be an operator between spaces of continuous functions, u 7 Φ(u) on [0, T]. Fix a t0 > 0, and extend any input function u : [0, T] Rp to a function E(u) C0([ t0, T]; Rp), by t0 u(0), t [ t0, 0), u(t), t [0, T]. Our proof of Theorem 3.1 can readily be used to show that the oscillator system with forcing E(u), and initialized at time t0 < 0, can uniformly approximate Φ(u) over the entire time interval [0, T], without requiring that u(0) = 0, or Φ(u)(0) = 0. In this case, the initial time interval [ t0, 0] provides the required warm-up phase for the neural oscillator. Remark 3.3. The required compactness property of the set of input signals K C0([0, T]; Rp) in Theorem 3.1 is implied by a suitable smoothness constraint; examples include uniform bounds on the Lipschitz norm, Hölder norms, uniform bounds on higher-order derivatives, or under the assumption that the input signals are uniformly bounded and band-limited. The proposed compactness assumption subsumes all of these smoothness constraints, resulting in a widely applicable universal approximation theorem. Remark 3.4. In practice, neural ODEs such as (2.2) need to be discretized via suitable numerical schemes. As examples, Co RNN and Un ICORNN were implemented in [27] and [28], respectively, with implicit-explicit time discretizations. Nevertheless, universality also applies for such discretizations as long as the time-step is small enough, as the underlying discretization is going to be a sufficiently accurate approximation of (2.2) and Theorem 3.1 can be used for showing universality of the discretized version of the multi-layer neural oscillator (2.2). 3.3 Outline of the Proof Figure 1: Illustration of the universal 3layer neural oscillator architecture constructed in the proof of Theorem 3.1. In the following, we outline the proof of the universality Theorem 3.1, while postponing the technical details to the SM. For a given tolerance ϵ, we will explicitly construct the weights and biases of the multi-layer neural oscillator (2.2) such that the underlying operator can be approximated within the given tolerance. This construction takes place in the following steps: (Forced) Harmonic Oscillators compute a timewindowed sine transform. Recall that the forced harmonic oscillator (2.6) is the simplest example of a neural oscillator (2.1). The following lemma, proved by direct calculation in SM B.1, shows that this forced harmonic oscillator actually computes a time-windowed variant of the sine transform at the corresponding frequency: Lemma 3.5. Assume that ω = 0. Then the solution of (2.6) is given by 0 u(t τ) sin(ωτ) dτ. (3.1) Given the last result, for a function u, we define its timewindowed sine transform as follows, Ltu(ω) := ˆ t 0 u(t τ) sin(ωτ) dτ. (3.2) Lemma 3.5 shows that a forced harmonic oscillator computes (3.2) up to a constant. Approximation of causal operators from finite realizations of time-windowed sine transforms. The following novel result, termed the fundamental Lemma, shows that the time-windowed sine transform (3.2) composed with a suitable nonlinear function can approximate causal operators Φ to desired accuracy; as a consequence, one can conclude that forced harmonic oscillators combined with a nonlinear read-out defines a universal architecture in the sense of Theorem 3.1. Lemma 3.6 (Fundamental Lemma). Let Φ : K C0([0, T]; Rp) C0([0, T]; Rq) be a causal and continuous operator, with K C0([0, T]; Rp) compact. Then for any ϵ > 0, there exists N N, frequencies ω1, . . . , ωN and a continuous mapping Ψ : Rp N [0, T 2/4] Rq, such that |Φ(u)(t) Ψ(Ltu(ω1), . . . , Ltu(ωN); t2/4)| ϵ, for all u K and t [0, T]. The function Ψ defined in Lemma 3.6 can be interpreted as a finite-dimensional encoding of the operator Φ. The main insight of the fundamental Lemma 3.6 is that the (infinite-dimensional) operator Φ can effectively be replaced by a (finite-dimensional) function Ψ, making the approximation by neural oscillators more tractable. The proof of the fundamental Lemma 3.6, detailed in SM B.2, is based on first showing that any continuous function can be reconstructed to desired accuracy, in terms of realizations of its timewindowed sine transform (3.2) at finitely many frequencies ω1, . . . , ωN (see SM Lemma B.1). Then, we leverage the continuity of the underlying operator Φ to approximate it with a finite-dimensional function Ψ, which takes the time-windowed sine transforms as its arguments. Given these two results, we can discern a clear strategy to prove the universality Theorem 3.1. First, we will show that a general nonlinear form of the neural oscillator (2.2) can also compute the time-windowed sine transform at arbitrary frequencies. Then, these outputs need to be processed in order to apply the fundamental Lemma 3.6 and approximate the underlying operator Φ. To this end, we will also approximate the function Ψ (mapping finite-dimensional inputs to finite-dimensional outputs) by oscillatory layers. The concrete steps in this strategy are outlined below. Nonlinear Oscillators approximate the time-windowed sine transform. The building block of multi-layer neural oscillators (2.2) is the nonlinear oscillator of the form, y = σ(w y + V u + b). (3.3) In the following Lemma (proved in SM B.3), we show that even for a nonlinear activation function σ such as tanh or sin, the nonlinear oscillator (3.3) can approximate the time-windowed sine transform. Lemma 3.7. Fix ω = 0. Assume that σ(0) = 0, σ (0) = 1. For any ϵ > 0, there exist w, V, b, A R, such that the nonlinear oscillator (3.3), initialized at y(0) = y(0) = 0, has output |Ay(t) Ltu(ω)| ϵ, u K, t [0, T], with Ltu(ω) being the time-windowed sine transform (3.2). Coupled Nonlinear Oscillators approximate time-delays. The next step in the proof is to show that coupled oscillators can approximate time-delays in the continuous input signal. This fact will be of crucial importance in subsequent arguments. We have the following Lemma (proved in SM B.4), Lemma 3.8. Let K C0([0, T]; Rp) be a compact subset. For every ϵ > 0, and t 0, there exist m N, w Rm, V Rm p, b Rm and A Rp m, such that the oscillator (3.3), initialized at y(0) = y(0) = 0, has output sup t [0,T ] |u(t t) Ay(t)| ϵ, u K, where u(t) is extended to negative values t < 0 by zero. Two-layer neural oscillators approximate neural networks pointwise. As in the strategy outlined above, the final ingredient in our proof of the universality theorem 3.1 is to show that neural oscillators can approximate continuous functions, such as the Ψ in the fundamental lemma 3.6, to desired accuracy. To this end, we will first show that neural oscillators can approximate general neural networks (perceptrons) and then use the universality of neural networks in the class of continuous functions to prove the desired result. The main difficulty here is that, due to the underlying oscillatory dynamics, it is not clear whether the output of even a shallow neural network can be reproduced by neural oscillators. The following lemma, visually represented by the bottom half of Figure 1, guarantees this: Lemma 3.9. Let K C0([0, T]; Rp) be compact. For matrices Σ, Λ and bias γ, and any ϵ > 0, there exists a two-layer (L = 2) oscillator (2.2), initialized at yℓ(0) = yℓ(0) = 0, ℓ= 1, 2, such that sup t [0,T ] Ay2(t) + c Σσ(Λu(t) + γ) ϵ, u K. The proof, detailed in SM B.5, is constructive and the neural oscillator that we construct has two layers. The first layer just processes a nonlinear input function through a nonlinear oscillator and the second layer, approximates the second-derivative (in time) from time-delayed versions of the input signal that were constructed in Lemma 3.8. Combining the ingredients to prove the universality theorem 3.1. The afore-constructed ingredients are combined in SM B.6 to prove the universality theorem. In this proof, we explicitly construct a three-layer neural oscillator (2.2) which approximates the underlying operator Φ. The first layer follows the construction of Lemma 3.7, to approximate the time-windowed sine transform (3.2), for as many frequencies as are required in the fundamental Lemma 3.6. The secondand third-layers imitate the construction of Lemma 3.9 to approximate a neural network (perception), which in turn by the universal approximation of neural networks, approximates the function Ψ in Lemma 3.6 to desired accuracy. Putting the network together leads to a three-layer oscillator that approximates the continuous and casual operator Φ. This construction is depicted in Figure 1. 4 Discussion Machine learning architectures, based on networks of coupled oscillators, for instance sequence models such as Co RNN [27] and Un ICORNN [28], graph neural networks such as Graph CON [26] and increasingly, the so-called physical neural networks (PNNs) such as linear and nonlinear mechanical oscillators [35] and spintronic oscillators [24, 32], are being increasingly used. A priori, it is unclear why ML systems based on oscillators can provide competitive performance on a variety of learning benchmarks, e.g. [27, 28, 26, 35], rather than biasing their outputs towards oscillatory functions. In order to address these concerns about their expressivity, we have investigated the theoretical properties of machine learning systems based on oscillators. Our main aim was to answer a fundamental question: are coupled oscillator based machine learning architectures universal? . In other words, can these architectures, in principle, approximate a large class of input-output maps to desired accuracy. To answer this fundamental question, we introduced an abstract framework of neural oscillators (2.1) and its particular instantiation, the multi-layer neural oscillators (2.2). This abstract class of second-order neural ODEs encompasses both sequence models such as Co RNN and Un ICORNN, as well as a very general and representative PNN, based on the so-called Frenkel-Kontorova model. The main contribution of this paper was to prove the universality theorem 3.1 on the ability of multi-layer neural oscillators (2.2) to approximate a large class of operators, namely causal and continuous maps between spaces of continuous functions, to desired accuracy. Despite the fact that the considered neural oscillators possess a very specific and constrained structure, not even encompassing general Hamiltonian systems, the approximated class of operators is nevertheless very general, including solution operators of general ordinary and even time-delay differential equations. The crucial theoretical ingredient in our proof was the fundamental Lemma 3.6, which implies that linear oscillator dynamics combined with a pointwise nonlinear read-out suffices for universal operator approximation; our construction can correspondingly be thought of as a large number of linear processors, coupled with nonlinear readouts. This construction could have implications for other models such as structured state space models [9, 8, 10] which follow a similar paradigm, and the extension of our universality results to such models could be of great interest. Our universality result has many interesting implications. To start with, we rigorously prove that an ML architecture based on coupled oscillators can approximate a very large class of operators. This provides theoretical support to many widely used sequence models and PNNs based on oscillators. Moreover, given the generality of our result, we hope that such a universality result can spur the design of innovative architectures based on oscillators, particularly in the realm of analog devices as ML inference systems or ML accelerators [35]. It is also instructive to lay out some of the limitations of the current article and point to avenues for future work. In this context, our setup currently only considers time-varying functions as inputs and outputs. Roughly speaking, these inputs and outputs have the structure of (infinite) sequences. However, a large class of learning tasks can be reconfigured to take sequential inputs and outputs. These include text (as evident from the tremendous success of large language models [22]), DNA sequences, images [16], timeseries and (offline) reinforcement learning [15]. Nevertheless, a next step would be to extend such universality results to inputs (and outputs) which have some spatial or relational structure, for instance by considering functions which have a spatial dependence or which are defined on graphs. On the other hand, the class of operators that we consider, i.e., casual and continuous, is not only natural in this setting but very general [7, 6]. Another limitation lies in the feed forward structure of the multi-layer neural oscillator (2.2). As mentioned before, most physical (and neurobiological) systems exhibit feedback loops between their constituents. However, this is not common in ML systems. In fact, we had to use a mass ordering in the Frenkel-Kontorova system of coupled pendula (2.8) in order to recast it in the form of the multi-layer neural oscillator (2.2). Such asymptotic ordering may not be possible for arbitrary physical neural networks. Exploring how such ordering mechanisms might arise in physical and biological systems in order to effectively give rise to a feed forward system could be very interesting. One possible mechanism for coupled oscillators that can lead to a hierarchical structure is that of synchronization [34, 31] and references therein. How such synchronization interacts with universality is a very interesting question and will serve as an avenue for future work. Finally, universality is arguably necessary but far from sufficient to analyze the performance of any ML architecture. Other aspects such as trainability and generalization are equally important, and we do not address these issues here. We do mention that trainability of oscillatory systems would profit from the fact that oscillatory dynamics is (gradient) stable and this formed the basis of the proofs of mitigation of the exploding and vanishing gradient problem for Co RNN in [27] and Un ICORNN in [28] as well as Graph CON in [26]. Extending these results to the general second-order neural ODE (2.2), for instance through an analysis of the associated adjoint system, is left for future work. [1] A . R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transcations on Information Theory, 39, 1993. [2] O. M. Braun and Y.S. Kivshar. Nonlinear dynamics of the frenkel-kontorova model. Physics Reports, 306:1 108, 1998. [3] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv:2104.13478, 2021. [4] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571 6583, 2018. [5] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303 314, Dec 1989. [6] Lukas Gonon, Lyudmila Grigoryeva, and Juan-Pablo Ortega. Risk bounds for reservoir computing. The Journal of Machine Learning Research, 21(1):9684 9744, 2020. [7] Lyudmila Grigoryeva and Juan-Pablo Ortega. Echo state networks are universal. Neural Networks, 108:495 508, 2018. [8] A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Re. Hippo:recurrent memory with optimal polynomial projections. In Advances in Neural Information Processing Systems, pages 1474 1487, 2020. [9] A. Gu, K. Goel, and C. Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021. [10] Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Re. How to train your hippo: State space models with generalized orthogonal basis projections. In International Conference on Learning Representations, 2022. [11] B-M. Gu, H. van Rijn, and W. K. Meck. Oscillatory multiplexing of neural population codes for interval timing and working memory. Neuroscience and Behaviorial reviews,, 48:160 185, 2015. [12] J. Guckenheimer and P. Holmes. Nonlinear oscillations, dynamical systems, and bifurcations of vector fields. Springer Verlag, New York, 1990. [13] Bing Han, Cheng Wang, and Kaushik Roy. Oscillatory fourier neural network: A compact and efficient architecture for sequential processing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36(6), pages 6838 6846, 2022. [14] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 366, 1989. [15] Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273 1286, 2021. [16] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128 3137, 2015. [17] T. Anderson Keller and Max Welling. Locally coupled oscillatory recurrent networks learn traveling waves and topographic organization. In Cosyne abstracts, 2023. [18] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [19] Alexander Norcliffe, Cristian Bodnar, Ben Day, Nikola Simidjievski, and Pietro Liò. On second order behaviour in augmented neural odes. Advances in neural information processing systems, 33:5911 5921, 2020. [20] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning, volume 28 of ICML 13, page III 1310 III 1318. JMLR.org, 2013. [21] Allan Pinkus. Approximation theory of the MLP model in neural networks. Acta numerica, 8(1):143 195, 1999. [22] A. Radford, K. Narasimhan, T. Salimans, and I. Suktskever. Improving language understanding by generative pre-training. ar Xiv:, 2018. [23] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530 1538. PMLR, 2015. [24] M. Romera and et. al. Vowel recognition with four coupled spin-torque nano-oscillators. Nature, 588:230 234, 2018. [25] T. Konstantin Rusch, Michael M Bronstein, and Siddhartha Mishra. A survey on oversmoothing in graph neural networks. ar Xiv:2303.10993, 2023. [26] T. Konstantin Rusch, Ben Chamberlain, James Rowbottom, Siddhartha Mishra, and Michael Bronstein. Graph-coupled oscillator networks. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 18888 18909. PMLR, 2022. [27] T. Konstantin Rusch and Siddhartha Mishra. Coupled oscillatory recurrent neural network (cornn): An accurate and (gradient) stable architecture for learning long time dependencies. In International Conference on Learning Representations, 2021. [28] T. Konstantin Rusch and Siddhartha Mishra. Unicornn: A recurrent model for learning very long time dependencies. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 9168 9178. PMLR, 2021. [29] K. M. Stiefel and G. B. Ermentrout. Neurons as oscillators. Journal of Neurophysiology, 116:2950 2960, 2016. [30] S. Strogatz. Nonlinear Dynamics and Chaos. Westview, Boulder CO, 2015. [31] S. H. Strogatz. Exploring complex networks. Nature, 410:268 276, 2001. [32] J. Torrejon and et. al. Neuromorphic computing with nanoscale spintronic oscillators. Nature, 547:428 431, 2017. [33] A. Velichko, M. Belyaev, and P. Boriskov. A model of an oscillatory neural network with multilevel neurons for pattern recognition and computing. Electronics, 8, 2019. [34] A. T. Winfree. Biological rhythms and the behavior of populations of coupled oscillators. Journal of Theoretical Biology, 16:15 42, 1967. [35] Logan G Wright, Tatsuhiro Onodera, Martin M Stein, Tianyu Wang, Darren T Schachter, Zoey Hu, and Peter L Mc Mahon. Deep physical neural networks trained with backpropagation. Nature, 601(7894):549 555, 2022. [36] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Supplementary Material for: Neural Oscillators are Universal A Another universality result for neural oscillators The universal approximation Theorem 3.1 immediately implies another universal approximation results for neural oscillators, as explained next. We consider a continuous map F : Rp Rq; our goal is to show that F can be approximated to given accuracy ϵ by suitably defined neural oscillators. Fix a time interval [0, T] for (an arbitrary choice) T = 2. Let K0 Rp be a compact set. Given ξ Rp, we associate with it a function uξ(t) C0([0, T]; Rp), by setting uξ(t) := tξ. (A.1) Clearly, the set K := {uξ | ξ K0} is compact in C0([0, T]; Rp). Furthermore, we can define an operator Φ : C0([0, T]; Rp) C0([0, T]; Rq), by Φ(u)(t) := 0, t [0, 1), (t 1)F(u(1)), t [1, T]. (A.2) where F : Rp Rq is the given continuous function that we wish to approximate. One readily checks that Φ defines a causal and continuous operator. Note, in particular, that Φ(uξ)(T) = (T 1)F(uξ(1)) = F(ξ), is just the evaluation of F at ξ, for any ξ K0. Since neural oscillators can uniformly approximate the operator Φ for inputs uξ K, then as a consequence of Theorem 3.1 and (2.3), it follows that, for any ϵ > 0 there exists m N, matrices W Rm m, V Rm p and A Rq m, and bias vectors b Rm, c Rq, such that for any ξ K0, the neural oscillator system, yξ(t) = σ (Wyξ(t) + t V ξ + b) , yξ(0) = yξ(0) = 0, zξ(t) = Ayξ(t) + c, (A.3) (A.4) (A.5) satisfies |zξ(T) F(ξ)| = |zξ(T) Φ(uξ)(T)| sup t [0,T ] |zξ(t) Φ(uξ)(t)| ϵ, uniformly for all ξ K0. Hence, neural oscillators can be used to approximate an arbitrary continuous function F : Rp Rq, uniformly over compact sets. Thus, neural oscillators also provide universal function approximation. B Proof of Theorem 3.1 B.1 Proof of Lemma 3.5 Proof. We can rewrite y(t) = 1 ω t 0 u(τ) sin(ω(t τ)) dτ. By direct differentiation, one readily verifies that y(t) so defined, satisfies 0 u(τ) cos(ω(t τ)) dτ + [u(τ) sin(ω(t τ))]τ=t = ˆ t 0 u(τ) cos(ω(t τ)) dτ, in account of the fact that sin(0) = 0. Differentiating once more, we find that y(t) = ω ˆ t 0 u(τ) sin(ω(t τ)) dτ + [u(τ) cos(ω(t τ))]τ=t = ω2y(t) + u(t). Thus y(t) solves the ODE (2.6), with initial condition y(0) = y(0) = 0. B.2 Proof of Fundamental Lemma 3.6 Reconstruction of a continuous signal from its sine transform. Let [0, T] R be an interval. We recall that we define the windowed sine transform Ltu(ω) of a function u : [0, T] Rp, by Ltu(ω) = ˆ t 0 u(t τ) sin(ωτ) dτ, ω R. In the following, we fix a compact set K C0([0, T]; Rp). Note that for any u K, we have u(0) = 0, and hence K can be identified with a subset of C(( , T]; Rp), consisting of functions with supp(u) [0, T]. We consider the reconstruction of continuous functions u K. We will show that u can be approximately reconstructed from knowledge of Lt(ω). More precisely, we provide a detailed proof of the following result: Lemma B.1. Let K C(( , T]; Rp) be compact, such that supp(u) [0, T] for all u K. For any ϵ, t > 0, there exists N N, frequencies ω1, . . . , ωN R \ {0}, phase-shifts ϑ1, . . . , ϑN R and weights α1, . . . , αN R, such that sup τ [0, t] j=1 αj Ltu(ωj) sin(ωjτ ϑj) for all u K and for all t [0, T]. Proof. Step 0: (Equicontinuity) We recall the following fact from topology. If K C(( , T]; Rp) is compact, then it is equicontinuous; i.e. there exists a continuous modulus of continuity ϕ : [0, ) [0, ) with ϕ(r) 0 as r 0, such that |u(t τ) u(t)| ϕ(τ), τ 0, t [0, T], u K. (B.1) Step 1: (Connection to Fourier transform) Fix t0 [0, T] and u K for the moment. Define f(τ) = u(t0 τ). Note that f C([0, ); Rp), and f has compact support supp(f) [0, T]. We also note that, by (B.1), we have |f(t + τ) f(t)| ϕ(τ), τ 0, t [0, T]. We now consider the following odd extension of f to all of R: F(τ) := f(τ), for τ 0, f( τ), for τ 0. Since F is odd, the Fourier transform of F is given by b F(ω) := ˆ F(τ)e iωτ dτ = i ˆ F(τ) sin(ωτ) dτ = 2i ˆ T 0 f(τ) sin(ωτ) dτ = 2i Lt0u(ω). Let ϵ > 0 be arbitrary. Our goal is to uniformly approximate F(τ) on the interval [0, t]. The main complication here is that F lacks regularity (is discontinuous), and hence the inverse Fourier transform of b F does not converge to F uniformly over this interval; instead, a more careful reconstruction based on mollification of F is needed. We provide the details below. Step 2: (Mollification) We now fix a smooth, non-negative and compactly supported function ρ : R R, such that supp(ρ) [0, 1], ρ 0, R ρ(t) dt = 1, and we define a mollifier ρϵ(t) := 1 ϵ ρ(t/ϵ). In the following, we will assume throughout that ϵ T. We point out that supp(ρϵ) [0, ϵ], and hence, the mollification Fϵ(t) = (F ρϵ)(t) satisfies, for t 0: |F(t) Fϵ(t)| = 0 (F(t) F(t + τ))ρϵ(τ) dτ = 0 (f(t) f(t + τ))ρϵ(τ) dτ sup τ [0,ϵ] |f(t) f(t + τ)| 0 ρϵ(τ) dτ ϕ(ϵ). In particular, this shows that sup t [0,T ] |F(t) Fϵ(t)| ϕ(ϵ), can be made arbitrarily small, with an error that depends only on the modulus of continuity ϕ. Step 3: (Fourier inverse) Let b Fϵ(ω) denote the Fourier transform of Fϵ. Since Fϵ is smooth and compactly supported, it is well-known that we have the identity b Fϵ(ω)e iωτ dω, t R, where ω 7 b Fϵ(ω) decays to zero very quickly (almost exponentially) as |ω| . In fact, since Fϵ = F ρϵ is a convolution, we have b Fϵ(ω) = b F(ω)bρϵ(ω), where | b F(ω)| 2 f L T is uniformly bounded, and bρϵ(ω) decays quickly. In particular, this implies that there exists a L = L(ϵ, T) > 0 independent of f, such that L b F(ω)bρϵ(ω)e iωτ dω |ω|>L |bρϵ(ω)| dω f L ϵ, τ R. Step 4: (Quadrature) Next, we observe that, since F and ρϵ are compactly supported, their Fourier transform ω 7 b F(ω)bρϵ(ω)e iωτ is smooth; in fact, for |τ| T, the Lipschitz constant of this mapping can be explicitly estimated by noting that h b F(ω)bρϵ(ω)e iωτi = supp(Fϵ) (F ρϵ)(t)eiω(t τ) dt supp(Fϵ) i(t τ)(F ρϵ)(t)eiω(t τ) dt. We next take absolute values, and note that any t in the support of Fϵ obeys the bound |t| T + ϵ 2T, while |τ| T by assumption; it follows that Lip ω 7 b F(ω)bρϵ(ω)e iωτ (2T + T) F L ρϵ L1 = 3T F L , τ [0, T]. It thus follows from basic results on quadrature that for an equidistant choice of frequencies ω1 < < ωN, with spacing ω = 2L/(N 1), we have L b F(ω)bρϵ(ω)e iωτ dω ω j=1 b F(ωj)bρϵ(ωj)e iωjτ N , τ [0, T], for an absolute constant C > 0, independent of F, T and N. By choosing N to be even, we can ensure that ωj = 0 for all j. In particular, recalling that L = L(T, ϵ) depends only on ϵ and T, and choosing N = N(T, ϵ) sufficiently large, we can combine the above estimate with (B.2) to ensure that Fϵ(τ) ω j=1 b F(ωj)bρϵ(ωj)e iωjτ 2 f L ϵ, τ [0, T], where we have taken into account that F L = f L . Step 5: (Conclusion) To conclude the proof, we recall that b F(ω) = 2i Lt0u(ω) can be expressed in terms of the sine transform Ltu of the function u which was fixed at the beginning of Step 1. Recall also that f(τ) = u(t0 τ), so that f L = u L . Hence, we can write the real part of ω 2π b F(ωj)bρϵ(ωj)e iωjτ = ω 2π 2i Lt0u(ωj)bρϵ(ωj)e iωjτ, in the form αj Lt0(ωj) sin(ωjτ ϑj) for coefficients αj R and θj R which depend only on ω and bρϵ(ωj), but are independent of u. In particular, it follows that sup τ [0, t] j=1 αj Lt0u(ωj) sin(ωjτ ϑj) = sup t [0, t] j=1 b F(ωj)bρϵ(ωj)e iωjτ sup τ [0, t] j=1 b F(ωj)bρϵ(ωj)e iωjτ sup τ [0, t] |F(τ) Fϵ(τ)| + sup τ [0, t] j=1 b F(ωj)bρϵ(ωj)e iωjτ By Steps 1 and 3, the first term on the right-hand side is bounded by ϕ(ϵ), while the second one is bounded by 2 supu K u L ϵ Cϵ, where C = C(K) < depends only on the compact set K C([0, T]; Rp). Hence, we have sup τ [0, t] j=1 αj Lt0u(ωj) sin(ωjτ ϑj) In this estimate, the function u K and t0 [0, T] were arbitrary, and the modulus of continuity ϕ as well as the constant C on the right-hand side depend only on the set K. it thus follows that for this choice of αj, ωj and ϑj, we have sup u K sup t [0,T ] sup τ [0, t] j=1 αj Ltu(ωj) sin(ωjτ ϑj) Since ϵ > 0 was arbitrary, the right-hand side can be made arbitrarily small. The claim then readily follows. The next step in the proof of the fundamental Lemma 3.6 needs the following preliminary result in functional analysis, Lemma B.2. Let X, Y be Banach spaces, and let K X be a compact subset. Assume that Φ : X Y is continous. Then for any ϵ > 0, there exists a δ > 0, such that if u u K X δ with u X, u K K, then Φ(u) Φ(u K) Y ϵ. Proof. Suppose not. Then there exists ϵ0 > 0 and a sequence uj, u K j , (j N), such that uj u K j X j 1, while Φ(uj) Φ(u K j ) Y ϵ0. By the compactness of K, we can extract a subsequence jk , such that u K jk u K converges to some u K K. By assumption on uj, this implies that ujk u K X ujk u K jk X + u K jk u K X (k ) 0, which, by the assumed continuity of Φ, leads to the contradiction that 0 < ϵ0 Φ(ujk) Φ(u K) Y 0, as k . Proof of Lemma 3.6. Now, we can prove the fundamental Lemma in the following, Proof. Let ϵ > 0 be given. We can identify K C0([0, T]; Rp) with a compact subset of C(( , T]; Rp), by extending all u K by zero for negative times, i.e. we set u(t) = 0 for t < 0. Applying Lemma B.2, with X = C0([0, T]; Rp) and Y = C0([0, T]; Rq), we can find a δ > 0, such that for any u C0([0, T]; Rp) and u K K, we have u u K L δ Φ(u) Φ(u K) L ϵ. (B.3) By the inverse sine transform Lemma B.1, there exist N N, frequencies ω1, . . . , ωN = 0, phaseshifts ϑ1, . . . , ϑN and coefficients α1, . . . , αN, such that for any u K and t [0, T]: sup τ [0,T ] j=1 αj Ltu(ωj) sin(ωjτ ϑj) Given Ltu(ω1), . . . , Ltu(ωN), we can thus define a reconstruction mapping R : RN [0, T] C([0, T]; Rp) by R(β1, . . . , βN; t)(τ) := j=1 αjβj sin(ωj(t τ) ϑj). Then, for τ [0, t], we have |u(τ) R(Ltu(ω1), . . . , Ltu(ωN); t)(τ)| δ. We can now uniquely define Ψ : RN [0, T 2/4] C0([0, T]; Rp), by the identity Ψ(Ltu(ω1), . . . , Ltu(ωN); t2/4) = Φ (R(Ltu(ω1), . . . , Ltu(ωN); t)) . Using the short-hand notation Rtu = R(Ltu(ω1), . . . , Ltu(ωN); t), we have supτ [0,t] |u(τ) Rtu(τ)| δ, for all t [0, T]. By (B.3), this implies that Φ(u)(t) Ψ(Ltu(ω1), . . . , Ltu(ωN); t2/4) = |Φ(u)(t) Φ(Rtu)(t)| ϵ. B.3 Proof of Lemma 3.7 Proof. Let ω = 0 be given. For a (small) parameter s > 0, we consider sσ( sω2ys + su), ys(0) = ys(0) = 0. Let Y be the solution of Y = ω2Y + u, Y (0) = Y (0) = 0. Then we have, on account of σ(0) = 0 and σ (0) = 1, s 1σ( sω2Y + su) [ ω2Y + u] = σ( sω2Y + su) σ(0) s σ (0)[ ω2Y + u] ζ σ( ζω2Y + ζu) dζ σ (0)[ ω2Y + u] σ ( ζω2Y + ζu) σ (0) dζ ω2Y + u . It follows from Lemma 3.5 that for any input u K, with supu K u L =: B < , we have a uniform bound Y L BT/ω, hence we can estimate | ω2Y + u| B(ωT + 1), uniformly for all such u. In particular, it follows that s 1σ( sω2Y + su) [ ω2Y + u] B(Tω + 1) sup |x| s B(T ω+1) |σ (x) σ (0)|. Clearly, for any δ > 0, we can choose s (0, 1] sufficiently small, such that the right hand-side is bounded by δ, i.e. with this choice of s, s 1σ( sω2Y (t) + su(t)) [ ω2Y (t) + u(t)] δ, t [0, T], holds for any choice of u K. We will fix this choice of s in the following, and write g(y, u) := s 1σ( sω2y + su). We note that g is Lipschitz continuous in y, for all |y| BT/ω and |u| B, with Lipy(g) ω2 sup|ξ| B(ωT +1) |σ (ξ)|. To summarize, we have shown that Y solves Y = g(Y, u) + f, Y (0) = Y (0) = 0, where f L δ. By definition, ys solves ys = g(ys, u), ys(0) = ys(0) = 0. It follows from this that |ys(t) Y (t)| ˆ t 0 {|g(ys(θ), u(θ)) g(Y (θ), u(θ))| + |f(θ)|} dθ dτ Lipy(g)|ys(θ) Y (θ)| + δ dθ dτ Tω2 sup |ξ| B(ωT +1) |σ (ξ)| ˆ t 0 |ys(τ) Y (τ)| dτ + T 2δ. Recalling that Y (t) = Ltu(ω), then by Gronwall s inequality, the last estimate implies that sup t [0,T ] |ys(t) Ltu(ω)| = sup t [0,T ] |ys Y | Cδ, for a constant C = C(T, ω, sup|ξ| B(ωT +1) |σ (ξ)|) > 0, depending only on T, ω, B and σ . Since δ > 0 was arbitrary, we can ensure that Cδ ϵ. Thus, we have shown that a suitably rescaled nonlinear oscillator approximates the harmonic oscillator to any desired degree of accuracy, and uniformly for all u K. To finish the proof, we observe that y solves y = σ( ω2y + su), y(0) = y(0) = 0, if, and only if, ys = y/s solves ys = s 1σ( sω2ys + su), ys(0) = ys(0) = 0. Hence, with W = ω2, V = s, b = 0 and A = s 1, we have sup t [0,T ] |Ay(t) Ltu(ω)| = sup t [0,T ] |ys(t) Ltu(ω)| ϵ. This concludes the proof. B.4 Proof of Lemma 3.8 Proof. Let ϵ, t be given. By the sine transform reconstruction Lemma B.1, there exists N N, frequencies ω1, . . . , ωN, weights α1, . . . , αN and phase-shifts ϑ1, . . . , ϑN, such that sup τ [0, t] j=1 αj Ltu(ωj) sin(ωjτ ϑj) 2, t [0, T], u K, (B.4) where any u K is extended by zero to negative times. It follows from Lemma 3.7, that there exists a coupled oscillator network, y = σ(w y + V u + b), y(0) = y(0) = 0, with dimension m = p N, and w Rm, V Rm p, and a linear output layer y 7 e Ay, e A Rm m, such that [ e Ay(t)]j Ltu(ωj) for j = 1, . . . , N; more precisely, such that sup t [0,T ] j=1 |αj| Ltu(ωj) [ e Ay]j(t) ϵ 2, u K. (B.5) Composing with another linear layer B : Rm Rp N Rp, which maps β = [β1, . . . , βN] to j=1 αjβj sin(ωj t ϑj) Rp, we define A := B e A, and observe that from (B.4) and (B.5): sup t [0,T ] |u(t t) Ay(t)| sup t [0,T ] j=1 αj Ltu(ωj) sin(ωj t ϑj) + sup t [0,T ] j=1 |αj| Ltu(ωj) [ e Ay]j(t) | sin(ωj t ϑj)| B.5 Proof of Lemma 3.9 Proof. Fix Σ, Λ, γ as in the statement of the lemma. Our goal is to approximate u 7 Σσ(Λu + γ). Step 1: (nonlinear layer) We consider a first layer for a hidden state y = [y1, y2]T Rp+p, given by y1(t) = σ(Λu(t) + γ) y2(t) = σ(γ) , y(0) = y(0) = 0. This layer evidently does not approximate σ(Λu(t) + γ); however, it does encode this value in the second derivative of the hidden variable y1. The main objective of the following analysis is to approximately compute y1(t) through a suitably defined additional layer. Step 2: (Second-derivative layer) To obtain an approximation of σ(Λu(t) + γ), we first note that the solution operator S : u(t) 7 η(t), where η(t) = σ(Λu(t) + γ) σ(γ), η(0) = η(0) = 0, defines a continuous mapping S : C0([0, T]; Rp) C2 0([0, T]; Rp), with η(0) = η(0) = η(0) = 0. Note that η is very closely related to y1. The fact that η = 0 is important to us, because it allows us to smoothly extend η to negative times by setting η(t) := 0 for t < 0 (which would not be true for y1(t)). The resulting extension defines a compactly supported function η : ( , 0] Rp, with η C2(( , T]; Rp). Furthermore, by continuity of the operator S, the image S(K) of the compact set K under S is compact in C2(( , T]; Rp). From this, it follows that for small t > 0, the second-order backward finite difference formula converges, sup t [0,T ] η(t) 2η(t t) + η(t 2 t) t2 η(t) = o t 0(1), η = S(u), u K, where the bound on the right-hand side is uniform in u K, due to equicontinuity of { η | η = S(u), u K}. In particular, the second derivative of η can be approximated through linear combinations of time-delays of η. We can now choose t > 0 sufficiently small so that sup t [0,T ] η(t) 2η(t t) + η(t 2 t) t2 η(t) ϵ 2 Σ , y = S(u), u K, where Σ denotes the operator norm of the matrix Σ. By Lemma 3.8, applied to the input set e K = S(K) C0([0, T]; Rp), there exists a coupled oscillator z(t) = σ(w z(t) + V η(t) + b), z(0) = z(0) = 0, (B.6) and a linear output layer z 7 e Az, such that sup t [0,T ] [η(t) 2η(t t) + η(t 2 t)] e Az(t) ϵ t2 2 Σ , η = S(u), u K. Indeed, Lemma 3.8 shows that time-delays of any given input signal can be approximated with any desired accuracy, and η(t) 2η(t ) η(t 2 ) is simply a linear combination of time-delays of the input signal η in (B.6). To connect η(t) back to the y(t) = [y1(t), y2(t)]T constructed in Step 1, we note that η = σ(Au(t) + b) σ(b) = y1 y2, and hence, taking into account the initial values, we must have η y1 y2 by ODE uniqueness. In particular, upon defining a matrix e V such that e V y := V y1 V y2 V η, we can equivalently write (B.6) in the form, z(t) = σ(w z(t) + e V y(t) + b), z(0) = z(0) = 0. (B.7) Step 3: (Conclusion) Composing the layers from Step 1 and 2, we obtain a coupled oscillator yℓ= σ(wℓ yℓ+ V ℓyℓ 1 + bℓ), (ℓ= 1, 2), initialized at rest, with y1 = y, y2 = z, such that for A := Σ e A and c := Σσ(γ), we obtain sup t [0,T ] Ay2(t) + c Σσ(Λu(t) + γ) Σ sup t [0,T ] e Az(t) [σ(Λu(t) + γ) σ(γ)] = Σ sup t [0,T ] e Az(t) η(t) Σ sup t [0,T ] e Az(t) η(t) 2η(t t) + η(t 2 t) + Σ sup t [0,T ] η(t) 2η(t t) + η(t 2 t) This concludes the proof. B.6 Proof of Theorem 3.1 Proof. Step 1: By the Fundamental Lemma 3.6, there exist N, a continuous mapping Ψ, and frequencies ω1, . . . , ωN, such that |Φ(u)(t) Ψ(Ltu(ω1), . . . , Ltu(ωN); t2/4)| ϵ, for all u K, and t [0, T]. Let M be a constant such that |Ltu(ω1)|, . . . , |Ltu(ωN)|, t2 for all u K and t [0, T]. By the universal approximation theorem for ordinary neural networks, there exist weight matrices Σ, Λ and bias γ, such that |Ψ(β1, . . . , βN; t2/4) Σσ(Λβ + γ)| ϵ, β := [β1, . . . , βN; t2/4]T , holds for all t [0, T], |β1|, . . . , |βN| M. Step 2: Fix ϵ1 1 sufficiently small, such that also Σ Λ Lip(σ)ϵ1 ϵ, where Lip(σ) := sup|ξ| Λ M+|γ|+1 |σ (ξ)| denotes an upper bound on the Lipschitz constant of the activation function over the relevant range of input values. It follows from Lemma 3.7, that there exists an oscillator network, y1 = σ(w1 y1 + V 1u + b1), y1(0) = y1(0) = 0, (B.8) of depth 1, such that sup t [0,T ] |[Ltu(ω1), . . . , Ltu(ωN); t2/4]T A1y1(t)| ϵ1, for all u K. Step 3: Finally, by Lemma 3.9, there exists an oscillator network, y2 = σ(w2 y2 + V 2y1 + b1), of depth 2, such that sup t [0,T ] |A2y2(t) Σσ(ΛA1y1(t) + γ)| ϵ, holds for all y1 belonging to the compact set K1 := S(K) C0([0, T]; RN+1), where S denotes the solution operator of (B.8). Step 4: Thus, we have for any u K, and with short-hand Ltu(ω) := (Ltu(ω1), . . . , Ltu(ωN)), Φ(u)(t) A2y2(t) Φ(u)(t) Ψ(Ltu(ω); t2/4) + Ψ(Ltu(ω); t2/4) Σσ(Λ[Ltu(ω); t2/4] + γ) + Σσ(Λ[Ltu(ω); t2/4] + γ) Σσ(ΛA1y1(t) + γ) + Σσ(ΛA1y1(t) + γ) A2y2(t) . By step 1, we can estimate Φ(u)(t) Ψ(Ltu(ω); t2/4) ϵ, t [0, T], u K. By the choice of Σ, Λ, γ, we have Ψ(Ltu(ω); t2/4) Σσ(Λ[Ltu(ω); t2/4] + γ) ϵ, t [0, T], u K. By construction of y1 in Step 2, we have Σσ(Λ[Ltu(ω); t2/4] + γ) Σσ(ΛA1y1(t) + γ) Σ Lip(σ) Λ [Ltu(ω); t2/4] A1y1(t) Σ Lip(σ) Λ ϵ1 ϵ, for all t [0, T] and u K. By construction of y2 in Step 3, we have Σσ(ΛA1y1(t) + γ) A2y2(t) ϵ, t [0, T], u K. Thus, we conclude that |Φ(u)(t) A2y2(t)| 4ϵ, for all t [0, T] and u K. Since ϵ > 0 was arbitrary, we conclude that for any causal and continuous operator Φ : C0([0, T]; Rp) C0([0, T]; Rq), compact set K C0([0, T]; Rp) and ϵ > 0, there exists a coupled oscillator of depth 3, which uniformly approximates Φ to accuracy ϵ for all u K. This completes the proof.