# noisy_recurrent_neural_networks__ce82d4d1.pdf Noisy Recurrent Neural Networks Soon Hoe Lim Nordita, KTH Royal Institute of Technology and Stockholm University soon.hoe.lim@su.se N. Benjamin Erichson University of Pittsburgh School of Engineering erichson@pitt.edu Liam Hodgkinson ICSI and Department of Statistics UC Berkeley liam.hodgkinson@berkeley.edu Michael W. Mahoney ICSI and Department of Statistics UC Berkeley mmahoney@stat.berkeley.edu We provide a general framework for studying recurrent neural networks (RNNs) trained by injecting noise into hidden states. Specifically, we consider RNNs that can be viewed as discretizations of stochastic differential equations driven by input data. This framework allows us to study the implicit regularization effect of general noise injection schemes by deriving an approximate explicit regularizer in the small noise regime. We find that, under reasonable assumptions, this implicit regularization promotes flatter minima; it biases towards models with more stable dynamics; and, in classification tasks, it favors models with larger classification margin. Sufficient conditions for global stability are obtained, highlighting the phenomenon of stochastic stabilization, where noise injection can improve stability during training. Our theory is supported by empirical results which demonstrate that the RNNs have improved robustness with respect to various input perturbations. 1 Introduction Viewing recurrent neural networks (RNNs) as discretizations of ordinary differential equations (ODEs) driven by input data has recently gained attention [7, 27, 16, 49]. The formulate in continuous time, and then discretize approach [38] motivates novel architecture designs before experimentation, and it provides a useful interpretation as a dynamical system. This, in turn, has led to gains in reliability and robustness to data perturbations. Recent efforts have shown how adding noise can also improve stability during training, and consequently improve robustness [35]. In this work, we consider discretizations of the corresponding stochastic differential equations (SDEs) obtained from ODE formulations of RNNs through the addition of a diffusion (noise) term. We refer to these as Noisy RNNs (NRNNs). By dropping the noisy elements at inference time, NRNNs become a stochastic learning strategy which, as we shall prove, has a number of important benefits. In particular, stochastic learning strategies (including dropout) are often used as natural regularizers, favoring solutions in regions of the loss landscape with desirable properties (often improved generalization and/or robustness). This mechanism is commonly referred to as implicit regularization [40, 39, 50], differing from explicit regularization where the loss is explicitly modified. For neural network (NN) models, implicit regularization towards wider minima is conjectured to be a prominent ingredient in the success of stochastic optimization [67, 28]. Indeed, implicit regularization has been linked to increases in classification margins [47], which can lead to improved generalization performance [51]. A common approach to identify and study implicit regularization is to approximate the implicit regularization by an appropriate explicit regularizer 35th Conference on Neural Information Processing Systems (Neur IPS 2021). [40, 39, 1, 6, 21]. Doing so, we will see that NRNNs favor wide minima (like SGD); more stable dynamics; and classifiers with a large classification margin, keeping generalization error small. SDEs have also seen recent appearances in neural SDEs [59, 24], stochastic generalizations of neural ODEs [9] which can be seen as an analogue of NRNNs for non-sequential data, with a similar relationship to NRNNs as feedforward NNs do to RNNs. They have been shown to be robust in practice [35]. Analogously, we shall show that the NRNN framework leads to more reliable and robust RNN classifiers, whose promise is demonstrated by experiments on benchmark data sets. Contributions. For the class of NRNNs (formulated first as a continuous-time model, which is then discretized), the following are our main contributions: we identify the form of the implicit regularization for NRNNs through a corresponding (datadependent) explicit regularizer in the small noise regime (see Theorem 1); we focus on its effect in classification tasks, providing bounds for the classification margin for the deterministic RNN classifiers (see Theorem 2); in particular, Theorem 2 reveals that stable RNN dynamics can lead to large classification margin; we show that noise injection can also lead to improved stability (see Theorem 3) via a Lyapunov stability analysis of continuous-time NRNNs; we demonstrate via empirical experiments on benchmark data sets that NRNN classifiers are more robust to data perturbations when compared to other recurrent models, while retaining state-of-the-art performance for clean data. Research code is provided here: https://github.com/erichson/Noisy RNN. Notation. We use v := v 2 to denote the Euclidean norm of the vector v, and A 2 and A F to denote the spectral norm and Frobenius norm of the matrix A, respectively. The ith element of a vector v is denoted by vi or [v]i, and the (i, j)-entry of a matrix A by Aij or [A]ij. For a vector v = (v1, . . . , vd), diag(v) denotes the diagonalization of v with diag(v)ii = vi. I denotes the identity matrix (with dimension clear from context), while superscript T denotes transposition. For a matrix M, M sym = (M + M T )/2 denotes its symmetric part, λmin(M) and λmax(M) denote its minimum and maximum eigenvalue respectively, σmax(M) denotes its maximum singular value, and Tr(M) denotes its trace. For a function f : Rn Rm such that each of its first-order partial derivatives (with respect to x) exist, f x Rm n is the Jacobian matrix of f. For a scalar-valued function g : Rn R, hg is the gradient of g with respect to the variable h Rn and Hhg is the Hessian of g with respect to h. 2 Related Work Dynamical Systems and Machine Learning. There are various interesting connections between machine learning and dynamical systems. Formulating machine learning in the framework of continuous-time dynamical systems was recently popularized by [62]. Subsequent efforts focus on constructing learning models by approximating continuous-time dynamical systems [9, 29, 48] and studying them using tools from numerical analysis [36, 64, 69, 68]. On the other hand, dynamical systems theory provides useful theoretical tools for analyzing NNs, including RNNs [60, 15, 34, 7, 16], and useful principles for designing NNs [23, 54]. Other examples of dynamical systems inspired models include the learning of invariant quantities via their Hamiltonian or Lagrangian representations [37, 22, 10, 71, 58]. Another class of models is inspired by Koopman theory, yielding models where the evolution operator is linear [56, 43, 17, 45, 33, 4, 3, 13]. Stochastic Training and Regularization Strategies. Regularization techniques such as noise injection and dropout can help to prevent overfitting in NNs. Following the classical work [5] that studies regularizing effects of noise injection on data, several work studies the effects of noise injection into different parts of networks for various architectures [25, 44, 35, 54, 2, 26, 66, 61]. In particular, recently [6] studies the regularizing effect of isotropic Gaussian noise injection into the layers of feedforward networks. For RNNs, [12] shows that noise additions on the hidden states outperform Bernoulli dropout in terms of performance and bias, whereas [18] introduces a variant of stochastic RNNs for generative modeling of sequential data. Some specific formulations of RNNs as SDEs were also considered in Chapter 10 of [41]. Implicit regularization has also been studied more generally than stochastic gradient based training of NNs [40, 39, 19, 11]. 3 Noisy Recurrent Neural Networks We formulate continuous-time recurrent neural networks (CT-RNNs) at full generality as a system of input-driven ODEs: for a terminal time T > 0 and an input signal x = (xt)t [0,T ] C([0, T]; Rdx), the output yt Rdy, for t [0, T], is a linear map of hidden states ht Rdh satisfying dht = f(ht, xt)dt, yt = V ht, (1) where V Rdy dh, and f : Rdh Rdx Rdh is typically Lipschitz continuous, guaranteeing existence and uniqueness of solutions to (1). A natural stochastic variant of CT-RNNs arises by replacing the ODE in (1) by an Itô SDE, that is, dht = f(ht, xt)dt + σ(ht, xt)d Bt, yt = V ht, (2) where σ : Rdh Rdx Rdh r and (Bt)t 0 is an r-dimensional Brownian motion. The functions f, σ are referred to as the drift and diffusion coefficients, respectively. Intuitively, (2) amounts to a noisy perturbation of the corresponding deterministic CT-RNN (1). At full generality, we refer to the system (2) as a continuous-time Noisy RNN (CT-NRNN). To guarantee the existence of a unique solution to (2), in the sequel, we assume that {f( , xt)}t [0,T ] and {σ( , xt)}t [0,T ] are uniformly Lipschitz continuous, and t 7 f(h, xt), t 7 σ(h, xt) are bounded in t [0, T] for each fixed h Rdh. For further details, see Section B in Supplementary Material (SM). While much of our theoretical analysis will focus on this general formulation of CT-NRNNs, our empirical and stability analyses focus on the choice of drift function f(h, x) = Ah + a(Wh + Ux + b), (3) where a : R R is a Lipschitz continuous scalar activation function extended to act on vectors pointwise, A, W Rdh dh, U Rdh dx and b Rdh. Typical examples of activation functions include a(x) = tanh(x). The matrices A, W, U, V, b are all assumed to be trainable parameters. This particular choice of drift dates back to the early Cohen-Grossberg formulation of CT-RNNs, and was recently reconsidered in [16]. 3.1 Noise Injections as Stochastic Learning Strategies While precise choices of drift functions f are the subject of existing deterministic RNN theory, good choices of the diffusion coefficient σ are less clear. Here, we shall consider a parametric class of diffusion coefficients given by: σ(h, x) ϵ(σ1I + σ2diag(f(h, x))), (4) where the noise level ϵ > 0 is small, and σ1 0 and σ2 0 are tunable parameters describing the relative strength of additive noise and a multiplicative noise respectively. While the stochastic component is an important part of the model, one can set ϵ 0 at inference time. In doing so, noise injections in NRNNs may be viewed as a learning strategy. A similar stance is considered in [35] for treating neural SDEs. From this point of view, we may relate noise injections generally to regularization mechanisms considered in previous works. For example, additive noise injection was studied in the context of feedforward NNs in [6], in which case a Gaussian noise is injected to the activation function at each layer of the NN. Furthermore, multiplicative noise injections includes stochastic depth and dropout strategies as special cases [36, 35]. By taking a Gaussian approximation to Bernoulli noise and taking a continuous-time limit, NNs with stochastic dropout can be weakly approximated by an SDE with appropriate multiplicative noise, see [36]. All of these works highlight various advantages of noise injection for training NNs. 3.2 Numerical Discretizations As in the deterministic case, exact simulation of the SDE in (2) is infeasible in practice, and so one must specify a numerical integration scheme. We will focus on the explicit Euler-Maruyama (E-M) integrators [30], which are the stochastic analogues of Euler-type integration schemes for ODEs. Let 0 := t0 < t1 < < t M := T be a partition of the interval [0, T]. Denote δm := tm+1 tm for each m = 0, 1, . . . , M 1, and δ := (δm). The E-M scheme provides a family (parametrized by δ) of approximations to the solution of the SDE in (2): hδ m+1 = hδ m + f(hδ m, ˆxm)δm + σ(hδ m, ˆxm) p for m = 0, 1, . . . , M 1, where (ˆxm)m=0,...,M 1 is a given sequential data, the ξm N(0, I) are independent r-dimensional standard normal random vectors, and hδ 0 = h0. As := maxm δm 0, the family of approximations (hδ m) converges strongly to the Itô process (ht) satisfying (2) (at rate O( ) when the step sizes are uniform; see Theorem 10.2.2 in [30]). See Section C in SM for details on the general case. 4 Implicit Regularization To highlight the advantages of NRNNs over their deterministic counterpart, we show that, under reasonable assumptions, NRNNs exhibit a natural form of implicit regularization. By this, we mean regularization imposed implicitly by the stochastic learning strategy, without explicitly modifying the loss, but that, e.g., may promote flatter minima. Our goal is achieved by deriving an appropriate explicit regularizer through a perturbation analysis in the small noise regime. This becomes useful when considering NRNNs as a learning strategy, since we can precisely determine the effect of the noise injection as a regularization mechanism. The study for discrete-time NRNNs is of practical interest and is our focus here. Nevertheless, analogous results for continuous-time NRNNs are also valuable for exploring other discretization schemes. For this reason, we also study the continuous-time case in Section E in SM. Our analysis covers general NRNNs, not necessarily those with the drift term (3) and diffusion term (4), that satisfy the following assumption, which is typically reasonable in practice. We remark that a Re LU activation will violate the assumption. However, RNNs with Re LU activation are less widely used in practice. Without careful initialization [31, 57], they typically suffer more from exploding gradient problems compared to those with bounded activation functions such as tanh. Assumption A. The drift f and diffusion coefficient σ of the SDE in (2) satisfy the following: (i) for all t [0, T] and x Rdx, h 7 f(h, x) and h 7 σij(h, x) have Lipschitz continuous partial derivatives in each coordinate up to order three (inclusive); (ii) for any h Rdh, t 7 f(h, xt) and t 7 σ(h, xt) are bounded and Borel measurable on [0, T]. We consider a rescaling of the noise σ 7 ϵσ in (2), where ϵ > 0 is assumed to be a small parameter, in line with our noise injection strategies in Subsection 3.1. In the sequel, we let hδ m denote the hidden states of the corresponding deterministic RNN model, satisfying hδ m+1 = hδ m + δmf( hδ m, ˆxm), m = 0, 1, . . . , M 1, (6) with hδ 0 = h0. Let := maxm {0,...,M 1} δm, and denote the state-to-state Jacobians by ˆJm = I + δm f h( hδ m, ˆxm). (7) For m, k = 0, . . . , M 1, also let ˆΦm,k = ˆJm ˆJm 1 ˆJk, (8) where the empty product is assumed to be the identity. Note that the ˆΦm,k are products of the state-to-state Jacobian matrices, important for analyzing signal propagation in RNNs [8]. For the sake of brevity, we denote fm = f( hδ m, ˆxm) and σm = σ( hδ m, ˆxm) for m = 0, 1, . . . , M. The following result, which is our first main result, relates the loss function, averaged over realizations of the injected noise, used for training NRNN to that for training deterministic RNN in the small noise regime. Theorem 1 (Implicit regularization induced by noise injection). Under Assumption A, Eℓ(hδ M) = ℓ( hδ M) + ϵ2 2 [ ˆQ( hδ) + ˆR( hδ)] + O(ϵ3), (9) as ϵ 0, where the terms ˆQ and ˆR are given by ˆQ( hδ) = l( hδ M)T M X k=1 δk 1 ˆΦM 1,k m=1 δm 1vm, (10) m=1 δm 1tr(σT m 1 ˆΦT M 1,m H hδl ˆΦM 1,mσm 1), (11) with vm a vector with the pth component: [vm]p = tr(σT m 1 ˆΦT M 2,m H hδ[f M]p ˆΦM 2,mσm 1), (12) for p = 1, . . . , dh. Moreover, | ˆQ( hδ)| CQ 2, | ˆR( hδ)| CR , (13) for CQ, CR > 0 independent of . If the loss is convex, then ˆR is non-negative, but ˆQ needs not be. However, ˆQ can be made negligible relative to ˆR provided that is taken sufficiently small. This also ensures that the E-M approximations are accurate. To summarize, Theorem 1 implies that the injection of noise into the hidden states of deterministic RNN is, on average, approximately equivalent to a regularized objective functional. Moreover, the explicit regularizer is solely determined by the discrete-time flow generated by the Jacobians fm h ( hδ m), the diffusion coefficients σn, and the Hessian of the loss function, all evaluated along the dynamics of the deterministic RNN. We can therefore expect that the use of NRNNs as a regularization mechanism should reduce the state-to-state Jacobians and Hessian of the loss function according to the noise level ϵ. Indeed, NRNNs exhibit a smoother Hessian landscape than that of the deterministic counterpart (see Figure 3 in SM). The Hessian of the loss function commonly appears in implicit regularization analyses, and suggests a preference towards wider minima in the loss landscape. Commonly considered a positive attribute [28], this, in turn, suggests a degree of robustness in the loss to perturbations in the hidden states [65]. More interesting, however, is the appearance of the Jacobians, which is indicative of a preference towards slower, more stable dynamics. Both of these attributes suggest NRNNs could exhibit a strong tendency towards models which are less sensitive to input perturbations. Overall, we can see that the use of NRNNs as a regularization mechanism reduces the state-to-state Jacobians and Hessian of the loss function according to the noise level. 5 Implications in Classification Tasks Our focus now turns to an investigation of the benefits of NRNNs over their deterministic counterparts for classification tasks. From Theorem 1, it is clear that adding noise to deterministic RNN implicitly regularizes the state-to-state Jacobians. Here, we show that doing so also enhances an implicit tendency towards classifiers with large classification margin. Our analysis here covers general deterministic RNNs, although we also apply our results to obtain explicit expressions for Lipschitz RNNs. Let SN denote a set of training samples sn := (xn, yn) for n = 1, . . . , N, where each input sequence xn = (xn,0, xn,1, . . . , xn,M 1) X Rdx M has a corresponding class label yn Y = {1, . . . , dy}. Following the statistical learning framework, these samples are assumed to be independently drawn from an underlying probability distribution µ on the sample space S = X Y. An RNN-based classifier gδ(x) is constructed in the usual way by taking gδ(x) = argmaxi=1,...,dypi(V hδ M[x]), (14) where pi(x) = exi/ P j exj is the softmax function. Letting ℓdenoting the cross-entropy loss, such a classifier is trained from SN by minimizing the empirical risk (training error), RN(gδ) := 1 N PN n=1 ℓ(gδ(xn), yn), as a proxy for the true (population) risk (test error), R(gδ) = E(x,y) µℓ(gδ(x), y), with (x, y) S. The measure used to quantify the prediction quality is the generalization error (or estimation error), which is the difference between the empirical risk of the classifier on the training set and the true risk: GE(gδ) := |R(gδ) RN(gδ)|. The classifier is a function of the output of the deterministic RNN, which is an Euler discretization of the ODE (1) with step sizes δ = (δm). In particular, for the Lipschitz RNN, ˆΦm,k = ˆJm ˆJm 1 ˆJk, (15) where ˆJl = I + δl(A + Dl W), with Dij l = a ([W hδ l + U ˆxl + b]i)eij. In the following, we let conv(X) denote the convex hull of X. We let ˆx0:m := (ˆx0, . . . , ˆxm) so that ˆx = ˆx0:M 1, and use the notation f[x] to indicate the dependence of the function f on the vector x. Our result will depend on two characterizations of a training sample si = (xi, yi). Definition 1 (Classification Margin). The classification margin of a training sample si = (xi, yi) measured by the Euclidean metric d is defined as the radius of the largest d-metric ball in X centered at xi that is contained in the decision region associated with the class label yi, i.e., it is: γd(si) = sup{a : d(xi, x) a gδ(x) = yi x}. Intuitively, a larger classification margin allows a classifier to associate a larger region centered on a point xi in the input space to the same class. This makes the classifier less sensitive to input perturbations, and a perturbation of xi is still likely to fall within this region, keeping the classifier prediction. In this sense, the classifier becomes more robust. In our case, the networks are trained by a loss (cross-entropy) that promotes separation of different classes in the network output. This, in turn, maximizes a certain notion of score of each training sample. Definition 2 (Score). For a training sample si = (xi, yi), we define its score as o(si) = minj =yi 2(eyi ej)T Sδ[xi] 0, where ei Rdy is the Kronecker delta vector with ei i = 1 and ej i = 0 for i = j, Sδ[xi] := p(V hδ M[xi]) with hδ M[xi] denoting the hidden state of the RNN, driven by the input sequence xi, at terminal index M. Recall that the classifier gδ(x) = arg maxi 1,...,dy[Sδ]i[x], and the decision boundary between class i and class j in the feature space is given by the hyperplane {z = Sδ : zi = zj}. A positive score implies that at the network output, classes are separated by a margin that corresponds to the score. However, a large score may not imply a large classification margin. Following the approach of [52, 63], we obtain the second main result, providing bounds for classification margin for the deterministic RNN classifiers gδ. We also provide a generalization bound in terms of the classification margin under additional assumptions (see Theorem 11 in SM). Theorem 2. Suppose that Assumption A holds. Assume that the score o(si) > 0 and γ(si) := o(si) C PM 1 m=0 δm supˆx conv(X) ˆΦM,m+1[ˆx] 2 > 0, (16) where C = V 2 maxm=0,1,...,M 1 f( hδ m,ˆxm) ˆxm > 0 is independent of si (in particular, C = V 2[maxm=0,...,M 1 Dm U 2] for Lipschitz RNNs), the ˆΦm,k are defined in (15) and the δm are the step sizes. Then, the classification margin for the training sample si: γd(si) γ(si). (17) Now, recalling from Section 4, up to O(ϵ2) and under the assumption that ˆQ vanishes, the loss minimized by the NRNN classifer is, on average, ℓ( hδ M) + ϵ2 ˆR( hδ), as ϵ 0, with regularizer ˆR( hδ) = 1 m=1 δm 1 ˆ MM 1 ˆΦM 1,mσm 1 2 F , (18) where ˆ M T M ˆ MM := H hδ M l is the Cholesky decomposition of the Hessian matrix of the convex crossentropy loss. The appearance of the state-to-state Jacobians in Φm,k in both the regularizer (18) and the lower bound (16) suggests that noise injection implicitly aids generalization performance. More precisely, in the small noise regime and on average, NRNNs promote classifiers with large classification margin, an attribute linked to both improved robustness and generalization [63]. In this sense, training with NRNN classifiers is a stochastic strategy to improve generalization over deterministic RNN classifiers, particularly in learning tasks where the given data is corrupted (c.f. the caveats pointed out in [53]). Theorem 2 implies that the lower bound for the classification margin is determined by the spectrum of the ˆΦM 1,m. To make the lower bound large, keeping δm and M fixed, the spectral norm of the ˆΦM 1,m should be made small. Doing so improves stability of the RNN, but may also lead to vanishing gradients, hindering capacity of the model to learn. To maximize the lower bound while avoiding the vanishing gradient problem, one should tune the numerical step sizes δm and noise level ϵ in NRNN appropriately. RNN architectures for the drift which help to ensure moderate Jacobians (e.g., ˆΦM 1,m 2 1 for all m [8]) also remain valuable in this respect. 6 Stability and Noise-Induced Stabilization Here we obtain sufficient conditions to guarantee stochastic stability of CT-NRNNs. This will also provide another lens to highlight the potential of NRNNs for improved robustness. A dynamical system is considered stable if trajectories which are close to each other initially remain close at subsequent times. As observed in [46, 42, 7], stability plays an essential role in the study of RNNs to avoid the exploding gradient problem, a property of unstable systems where the gradient increases in magnitude with the depth. While gradient clipping during training can somewhat alleviate this issue, better performance and robustness is achieved by enforcing stability in the model itself. Our stability analysis will focus on establishing almost sure exponential stability (for other notions of stability, see SM) for CT-NRNNs with the drift function (3). To preface the definition, consider initializing the SDE at two different random variables h0 and h 0 := h0 + ϵ0, where ϵ0 Rdh is a constant non-random perturbation with ϵ0 δ. The resulting hidden states, ht and h t, are set to satisfy (2) with the same Brownian motion Bt, starting from their initial values h0 and h 0, respectively. The evolution of ϵt = h t ht satisfies dϵt = Aϵtdt + at(ϵt)dt + σt(ϵt)d Bt, (19) where at(ϵt) = a(Wh t +Uxt +b) a(Wht +Uxt +b) and σt(ϵt) = σ(ht +ϵt, xt) σ(ht, xt). Since at(0) = 0, σt(0) = 0 for all t [0, T], ϵt = 0 admits a trivial equilibrium for (19). Our objective is to analyze the stability of the solution ϵt = 0, that is, to see how the final state ϵT (and hence the output of the RNN) changes for an arbitrarily small initial perturbation ϵ0 = 0. To this end, we consider an extension of the Lyapunov exponent to SDEs at the level of sample path [41]. Definition 3 (Almost sure global exponential stability). The sample (or pathwise) Lyapunov exponent of the trivial solution of (19) is Λ = lim supt t 1 log ϵt . The trivial solution ϵt = 0 is almost surely globally exponentially stable if Λ is almost surely negative for all ϵ0 Rdh. For the sample Lyapunov exponent Λ(ω), there is a constant C > 0 and a random variable 0 τ(ω) < such that for all t > τ(ω), ϵt = h t ht CeΛt almost surely. Therefore, almost sure exponential stability implies that almost all sample paths of (19) will tend to the equilibrium solution ϵ = 0 exponentially fast. With this definition in tow, we obtain the following stability result. Theorem 3. Assume that a is monotone non-decreasing, and σ1 ϵ σt(ϵ) F σ2 ϵ for all nonzero ϵ Rdh, t [0, T]. Then for any ϵ0 Rdh, with probability one, φ + λmin(Asym) Λ ψ + Laσmax(W) + λmax(Asym), (20) with φ = σ2 2 + σ2 1 2 and ψ = σ2 1 + σ2 2 2 , where La is the Lipschitz constant of a. In the special case without noise (σ1 = σ2 = 0), we recover case (a) of Theorem 1 in [16]: when Asym is negative definite and σmin(Asym) > Laσmax(W), Theorem 3 implies that (2) is exponentially stable. Most strikingly, and similar to [35], Theorem 3 implies that even if the deterministic CT-RNN is not exponentially stable, it can be stabilized through a stochastic perturbation. Consequently, injecting noise appropriately can improve training performance. 7 Empirical Results The evaluation of robustness of neural networks (RNNs in particular) is an often neglected yet crucial aspect. In this section, we investigate the robustness of NRNNs and compare their performance to other recently introduced state-of-the-art models on both clean and corrupted data. We refer to Section G in SM for further details of our experiments. Here, we study the sensitivity of different RNN models with respect to a sequence of perturbed inputs during inference time. We consider different types of perturbations: (a) white noise; (b) multiplicative white noise; (c) salt and pepper; and (d) adversarial perturbations. To be more concrete, let x be a sequence. The perturbations in consideration are as follows. Additive white noise perturbations are constructed as x = x+ x, where the additive noise is drawn from a Gaussian distribution x N(0, σ). This perturbation strategy emulates measurement errors that can result from data acquisition with poor sensors (where σ can be used to vary the strength of these errors). Multiplicative white noise perturbations are constructed as x = x x, where the additive noise is drawn from a Gaussian distribution x N(1, σM). Salt and pepper perturbations emulate defective pixels that result from converting analog signals to digital signals. The noise model takes the form P( X = X) = 1 α, and P( X = max) = P( X = min) = α/2, where X(i, j) denotes the corrupted image and min and max denote to the minimum and maximum pixel values. The parameter α controls the proportion of defective pixels. Adversarial perturbations are worst-case non-random perturbations maximizing the loss ℓ(gδ(X + X), y) subject to the constraint that the norm of the perturbation X r. We consider the fast gradient sign method for constructing these perturbations [55]. We consider in addition to the NRNN three other RNNs derived from continuous-time models, including the Lipschitz RNN [16] (the deterministic counterpart to our NRNN), the coupled oscillatory RNN (co RNN) [49] and the antisymmetric RNN [7]. We also consider the exponential RNN [32], a discrete-time model that uses orthogonal recurrent weights. We train each model with the prescribed tuning parameters for the ordered (see Sec. 7.1) and permuted (see SM) MNIST task. For the Electrocardiogram (ECG) classification task we performed a non-exhaustive hyper-tuning parameter search. For comparison, we train all models with hidden-to-hidden weight matrices of dimension dh = 128. We average the classification performance over ten different seed values. 7.1 Ordered Pixel-by-Pixel MNIST Classification First, we consider the ordered pixel-by-pixel MNIST classification task [31]. This task sequentially presents 784 pixels to the model and uses the final hidden state to predict the class membership probability of the input image. In the SM we present additional results for the situation when instead of an ordered sequence a fixed random permutation of the input sequence is presented to the model. Table 1 shows the average test accuracy (evaluated for models that are trained with 10 different seed values) for the ordered task. Here we present results for white noise and salt and pepper (S&P) perturbations. While the Lipschitz RNN performs best on clean input sequences, the NRNNs show an improved resilience to input perturbations. Here, we consider two different configuration for the NRNN. In both cases, we set the multiplicative noise level to 0.02, whereas we consider the additive noise levels 0.02 and 0.05. We chose these configurations as they appear to provide a good trade-off between accuracy and robustness. Note that the predictive accuracy on clean inputs starts to drop when the noise level becomes too large. Table 1: Robustness w.r.t. white noise (σ) and S&P (α) perturbations on the ordered MNIST task. Name clean σ = 0.1 σ = 0.2 σ = 0.3 α = 0.03 α = 0.05 α = 0.1 Antisymmetric RNN [7] 97.5% 45.7% 22.3% 17.0% 77.1% 63.9% 42.6% Co RNN [49] 99.1% 96.6% 61.9% 32.1% 95.6% 88.1% 58.9% Exponential RNN [32] 96.7% 86.7% 58.1% 33.3% 83.6% 70.7% 43.4% Lipschitz RNN [16] 99.2% 98.4% 78.9% 47.1% 97.6% 93.4% 73.5% NRNN (mult./add. noise: 0.02/0.02) 99.1% 98.9% 88.4% 62.9% 98.3% 95.6% 78.7% NRNN (mult./add. noise: 0.02/0.05) 99.1% 98.9% 92.2% 73.5% 98.5% 97.1% 85.5% 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.2 1.0 Noisy RNN Lipschitz RNN Antisymmetric RNN co RNN exp RNN test accuracy amount of noise (a) White noise perturbations. 0.00 0.05 0.10 0.15 0.20 0.25 0.2 amount of noise (b) Salt and pepper perturbations. Figure 1: Test accuracy for the ordered MNIST task as function of the strength of input perturbations. Table 2 shows the average test accuracy for the ordered MNIST task for adversarial perturbations. Again, the NRNNs show a superior resilience even to large perturbations, whereas the Antisymmetric and Exponential RNN appear to be sensitive even to small perturbations. Figure 1 summarizes the performance of different models with respect to white noise and salt and pepper perturbations. The colored bands indicate 1 standard deviation around the average performance. In all cases, the NRNN appears to be less sensitive to input perturbations as compared to the other models, while maintaining state-of-the-art performance for clean inputs. 7.2 Electrocardiogram (ECG) Classification Next, we consider the Electrocardiogram (ECG) classification task that aims to discriminate between normal and abnormal heart beats of a patient that has severe congestive heart failure [20]. We use 500 sequences of length 140 for training, 500 sequences for validation, and 4000 sequences for testing. Table 3 shows the average test accuracy (evaluated for models that are trained with 10 different seed values) for this task. We present results for additive white noise and multiplicative white noise perturbations. Here, the NRNN, trained with multiplicative noise level set to 0.03 and additive noise levels set to 0.06, performs best both on clean as well as on perturbed input sequences. Figure 2 summarizes the performance of different models with respect to additive and multiplicative white noise perturbations. Again, the NRNN appears to be less sensitive to input perturbations as compared to the other models, while achieving state-of-the-art performance for clean inputs. Table 2: Robustness w.r.t. adversarial perturbations on the ordered pixel-by-pixel MNIST task. Name r = 0.01 r = 0.05 r = 0.1 r = 0.15 Antisymmetric RNN [7] 79.4% 24.7% 11.4% 10.2% Co RNN [49] 97.5% 85.5% 55.9% 35.1% Exponential RNN [32] 94.5% 59.3% 19.7% 14.3% Lipschitz RNN [16] 98.1% 85.7% 58.9% 37.1% NRNN (mult./add. noise: 0.02/0.02) 98.8% 94.3% 79.6% 58.3% NRNN (mult./add. noise: 0.02/0.05) 98.8% 95.5% 86.8% 70.6% Table 3: Robustness w.r.t. white (σ) and multiplicative (σM) noise perturbations on the ECG task. Name clean σ = 0.4 σ = 0.8 σ = 1.2 σM = 0.4 σM = 0.8 σM = 1.2 Antisymmetric RNN [7] 97.1% 96.6% 91.6% 77.0% 96.6% 94.6% 91.2% Co RNN [49] 97.5% 96.8% 92.9% 87.2% 93.9% 85.4% 78.4% Exponential RNN [32] 97.4% 95.6% 86.4% 76.7% 95.7% 89.4% 81.3% Lipschitz RNN [16] 97.7% 97.4% 95.1% 88.9% 97.6% 97.0% 95.6% NRNN (mult./add. noise: 0.03/0.06) 97.7% 97.5% 96.3% 92.6% 97.7% 97.3% 96.5% 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 0.65 test accuracy amount of noise (a) Additive white noise perturbations. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 0.65 Noisy RNN Lipschitz RNN Antisymmetric RNN co RNN exp RNN amount of noise (b) Multiplicative white noise perturbations. Figure 2: Test accuracy for the ECG task as function of the strength of input perturbations. 8 Conclusion In this paper we provide a thorough theoretical analysis of RNNs trained by injecting noise into the hidden states. Within the framework of SDEs, we study the regularizing effects of general noise injection schemes. The experimental results are in agreement with our theory and its implications, finding that Noisy RNNs achieve superior robustness to input perturbations, while maintaining state-of-the-art generalization performance. We believe our framework can be used to guide the principled design of a class of reliable and robust RNN classifiers. Our work opens up a range of interesting future directions. In particular, for deterministic RNNs, it was shown that the models learn optimally near the edge of stability [8]. One could extend these analyses to NRNNs with the ultimate goal of improving their performance. On the other hand, as discussed in Section 5, although the noise is shown here to implicitly stabilize RNNs, it could negatively impact capacity for long-term memory [42, 70]. Providing analyses to account for this and the implicit bias due to the stochastic optimization procedure [50, 14] is the subject of future work. Acknowledgements We are grateful for the generous support from Amazon AWS. S. H. Lim would like to acknowledge Nordita Fellowship 2018-2021 for providing support of this work. N. B. Erichson, L. Hodgkinson, and M. W. Mahoney would like to acknowledge the IARPA (contract W911NF20C0035), ARO, NSF, and and ONR via its BRC on Rand NLA for providing partial support of this work. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred. [1] Alnur Ali, Edgar Dobriban, and Ryan Tibshirani. The implicit regularization of stochastic gradient flow for least squares. In International Conference on Machine Learning, pages 233 244. PMLR, 2020. [2] Raman Arora, Peter Bartlett, Poorya Mianjy, and Nathan Srebro. Dropout: Explicit forms and capacity control. In International Conference on Machine Learning, pages 351 361. PMLR, 2021. [3] Omri Azencot, N Benjamin Erichson, Vanessa Lin, and Michael W. Mahoney. Forecasting sequential data using consistent Koopman autoencoders. In International Conference on Machine Learning, pages 475 485. PMLR, 2020. [4] Kaushik Balakrishnan and Devesh Upadhyay. Deep adversarial Koopman model for reactiondiffusion systems. ar Xiv preprint ar Xiv:2006.05547, 2020. [5] Chris M Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108 116, 1995. [6] Alexander Camuto, Matthew Willetts, Umut Simsekli, Stephen J Roberts, and Chris C Holmes. Explicit regularisation in Gaussian noise injections. In Advances in Neural Information Processing Systems, volume 33, pages 16603 16614, 2020. [7] Bo Chang, Minmin Chen, Eldad Haber, and Ed H. Chi. Antisymmetric RNN: A dynamical system view on recurrent neural networks. In International Conference on Learning Representations, 2019. [8] Minmin Chen, Jeffrey Pennington, and Samuel Schoenholz. Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks. In International Conference on Machine Learning, pages 873 882. PMLR, 2018. [9] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571 6583, 2018. [10] Zhengdao Chen, Jianyu Zhang, Martin Arjovsky, and Léon Bottou. Symplectic recurrent neural networks. In International Conference on Learning Representations, 2019. [11] Michal Derezinski, Feynman T Liang, and Michael W Mahoney. Exact expressions for double descent and implicit regularization via surrogate random design. In Advances in Neural Information Processing Systems, volume 33, pages 5152 5164, 2020. [12] Adji Bousso Dieng, Rajesh Ranganath, Jaan Altosaar, and David Blei. Noisin: Unbiased regularization for recurrent neural networks. In International Conference on Machine Learning, pages 1252 1261. PMLR, 2018. [13] Akshunna S. Dogra and William Redman. Optimizing neural networks via Koopman operator theory. In Advances in Neural Information Processing Systems, volume 33, pages 2087 2097, 2020. [14] Melikasadat Emami, Mojtaba Sahraee-Ardakan, Parthe Pandit, Sundeep Rangan, and Alyson K Fletcher. Implicit bias of linear RNNs. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 2982 2992. PMLR, 2021. [15] Rainer Engelken, Fred Wolf, and LF Abbott. Lyapunov spectra of chaotic recurrent neural networks. ar Xiv preprint ar Xiv:2006.02427, 2020. [16] N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W. Mahoney. Lipschitz recurrent neural networks. In International Conference on Learning Representations, 2021. [17] N Benjamin Erichson, Michael Muehlebach, and Michael W Mahoney. Physics-informed autoencoders for Lyapunov-stable fluid flow prediction. ar Xiv preprint ar Xiv:1905.10866, 2019. [18] Marco Fraccaro, Søren Kaae Sø nderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems, volume 29, 2016. [19] D. F. Gleich and M. W. Mahoney. Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow. In Proceedings of the 31st International Conference on Machine Learning, pages 1018 1025, 2014. [20] Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation, 101(23):e215 e220, 2000. [21] Chengyue Gong, Tongzheng Ren, Mao Ye, and Qiang Liu. Maxup: Lightweight adversarial training with data augmentation improves neural network training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2474 2483, 2021. [22] Samuel Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. In Advances in Neural Information Processing Systems, volume 32, 2019. [23] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017. [24] Liam Hodgkinson, Chris van der Heide, Fred Roosta, and Michael W Mahoney. Stochastic continuous normalizing flows: Training SDEs as ODEs. In Uncertainty in Artificial Intelligence (UAI), 2021. [25] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646 661. Springer, 2016. [26] Kam-Chuen Jim, C Lee Giles, and Bill G Horne. An analysis of noise in recurrent neural networks: Convergence and generalization. IEEE Transactions on Neural Networks, 7(6):1424 1438, 1996. [27] Anil Kag, Ziming Zhang, and Venkatesh Saligrama. RNNs incrementally evolving on an equilibrium manifold: A panacea for vanishing and exploding gradients? In International Conference on Learning Representations, 2020. [28] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. ar Xiv preprint ar Xiv:1609.04836, 2016. [29] Patrick Kidger, James Morrill, James Foster, and Terry Lyons. Neural controlled differential equations for irregular time series. In Advances in Neural Information Processing Systems, volume 33, pages 6696 6707, 2020. [30] Peter E Kloeden and Eckhard Platen. Numerical Solution of Stochastic Differential Equations, volume 23. Springer Science & Business Media, 2013. [31] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. ar Xiv preprint ar Xiv:1504.00941, 2015. [32] Mario Lezcano-Casado and David Martınez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. In International Conference on Machine Learning, pages 3794 3803, 2019. [33] Yunzhu Li, Hao He, Jiajun Wu, Dina Katabi, and Antonio Torralba. Learning compositional Koopman operators for model-based control. In International Conference on Learning Representations, 2019. [34] Soon Hoe Lim. Understanding recurrent neural networks using nonequilibrium response theory. J. Mach. Learn. Res., 22:47 1, 2021. [35] Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. How does noise help robustness? Explanation and exploration under the neural SDE framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 282 290, 2020. [36] Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In International Conference on Machine Learning, pages 3276 3285. PMLR, 2018. [37] Michael Lutter, Christian Ritter, and Jan Peters. Deep Lagrangian networks: Using physics as model prior for deep learning. ar Xiv preprint ar Xiv:1907.04490, 2019. [38] Chao Ma, Stephan Wojtowytsch, and Lei Wu. Towards a mathematical understanding of neural network-based machine learning: What we know and what we don t. ar Xiv preprint ar Xiv:2009.10713, 2020. [39] M. W. Mahoney. Approximate computation and implicit regularization for very large-scale data analysis. In Proceedings of the 31st ACM Symposium on Principles of Database Systems, pages 143 154, 2012. [40] M. W. Mahoney and L. Orecchia. Implementing regularization implicitly via approximate eigenvector computation. In Proceedings of the 28th International Conference on Machine Learning, pages 121 128, 2011. [41] Xuerong Mao. Stochastic Differential Equations and Applications. Elsevier, 2007. [42] John Miller and Moritz Hardt. Stable recurrent models. ar Xiv preprint ar Xiv:1805.10369, 2018. [43] Jeremy Morton, Freddie D Witherden, and Mykel J Kochenderfer. Deep variational Koopman models: Inferring Koopman observations for uncertainty-aware dynamics modeling and control. ar Xiv preprint ar Xiv:1902.09742, 2019. [44] Hyeonwoo Noh, Tackgeun You, Jonghwan Mun, and Bohyung Han. Regularizing deep neural networks by noise: Its interpretation and optimization. In Advances in Neural Information Processing Systems, pages 5109 5118, 2017. [45] Shaowu Pan and Karthik Duraisamy. Physics-informed probabilistic learning of linear embeddings of nonlinear dynamics with guaranteed stability. SIAM Journal on Applied Dynamical Systems, 19(1):480 509, 2020. [46] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310 1318. PMLR, 2013. [47] Tomaso Poggio, Kenji Kawaguchi, Qianli Liao, Brando Miranda, Lorenzo Rosasco, Xavier Boix, Jack Hidary, and Hrushikesh Mhaskar. Theory of deep learning III: Explaining the non-overfitting puzzle. ar Xiv preprint ar Xiv:1801.00173, 2017. [48] Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. Continuous-in-depth neural networks. ar Xiv preprint ar Xiv:2008.02389, 2020. [49] T. Konstantin Rusch and Siddhartha Mishra. Coupled oscillatory recurrent neural network (co RNN): An accurate and (gradient) stable architecture for learning long time dependencies. In International Conference on Learning Representations, 2021. [50] Samuel L Smith, Benoit Dherin, David GT Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. ar Xiv preprint ar Xiv:2101.12176, 2021. [51] Jure Sokoli c, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Generalization error of deep neural networks: Role of classification margin and data structure. In 2017 International Conference on Sampling Theory and Applications (Samp TA), pages 147 151. IEEE, 2017. [52] Jure Sokoli c, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265 4280, 2017. [53] David Stutz, Matthias Hein, and Bernt Schiele. Disentangling adversarial robustness and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6976 6987, 2019. [54] Qi Sun, Yunzhe Tao, and Qiang Du. Stochastic training of residual networks: A differential equation viewpoint. ar Xiv preprint ar Xiv:1812.00174, 2018. [55] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. [56] Naoya Takeishi, Yoshinobu Kawahara, and Takehisa Yairi. Learning Koopman invariant subspaces for dynamic mode decomposition. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1130 1140, 2017. [57] Sachin S Talathi and Aniket Vartak. Improving performance of recurrent neural network with Re LU nonlinearity. ar Xiv preprint ar Xiv:1511.03771, 2015. [58] Peter Toth, Danilo J Rezende, Andrew Jaegle, Sébastien Racanière, Aleksandar Botev, and Irina Higgins. Hamiltonian generative networks. In International Conference on Learning Representations, 2019. [59] Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent Gaussian models in the diffusion limit. ar Xiv preprint ar Xiv:1905.09883, 2019. [60] Ryan Vogt, Maximilian Puelma Touzel, Eli Shlizerman, and Guillaume Lajoie. On Lyapunov exponents for RNNs: Understanding information propagation using dynamical systems tools. ar Xiv preprint ar Xiv:2006.14123, 2020. [61] Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects of dropout. In Proceedings of the 37th International Conference on Machine Learning, pages 10181 10192, 2020. [62] E Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1 11, 2017. [63] Huan Xu and Shie Mannor. Robustness and generalization. Machine Learning, 86(3):391 423, 2012. [64] Yibo Yang, Jianlong Wu, Hongyang Li, Xia Li, Tiancheng Shen, and Zhouchen Lin. Dynamical system inspired adaptive time stepping controller for residual network families. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):6648 6655, 2020. [65] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W. Mahoney. Hessian-based analysis of large batch training and robustness to adversaries. In Advances in Neural Information Processing Systems, pages 4954 4964, 2018. [66] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. ar Xiv preprint ar Xiv:1409.2329, 2014. [67] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021. [68] Huishuai Zhang, Da Yu, Mingyang Yi, Wei Chen, and Tie-yan Liu. Stability and convergence theory for learning Res Net: A full characterization. 2019. [69] Jingfeng Zhang, Bo Han, Laura Wynter, Bryan Kian Hsiang Low, and Mohan Kankanhalli. Towards robust Res Net: A small step but a giant leap. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019. [70] Jingyu Zhao, Feiqing Huang, Jia Lv, Yanjie Duan, Zhen Qin, Guodong Li, and Guangjian Tian. Do RNN and LSTM have long memory? In International Conference on Machine Learning, pages 11365 11375. PMLR, 2020. [71] Yaofeng Desmond Zhong, Biswadip Dey, and Amit Chakraborty. Symplectic ODE-Net: Learning Hamiltonian dynamics with control. In International Conference on Learning Representations, 2019.