# ode_to_an_ode__457f1edc.pdf

Ode to an ODE

Krzysztof Choromanski Robotics at Google NY Jared Quincy Davis Deep Mind & Stanford University Valerii Likhosherstov University of Cambridge

Xingyou Song Google Brain Jean-Jacques Slotine Massachusetts Institute of Technology Jacob Varley Robotics at Google NY

Honglak Lee Google Brain Adrian Weller University of Cambridge & The Alan Turing Institute

Vikas Sindhwani Robotics at Google NY

We present a new paradigm for Neural ODE algorithms, called ODEto ODE, where time-dependent parameters of the main ﬂow evolve according to a matrix ﬂow on the orthogonal group O(d). This nested system of two ﬂows, where the parameter-ﬂow is constrained to lie on the compact manifold, provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem which is intrinsically related to training deep neural network architectures such as Neural ODEs. Consequently, it leads to better downstream models, as we show on the example of training reinforcement learning policies with evolution strategies, and in the supervised learning setting, by comparing with previous SOTA baselines. We provide strong convergence results for our proposed mechanism that are independent of the depth of the network, supporting our empirical studies. Our results show an intriguing connection between the theory of deep neural networks and the ﬁeld of matrix ﬂows on compact manifolds.

1 Introduction

Neural ODEs [13, 10, 27] are natural continuous extensions of deep neural network architectures, with the evolution of the intermediate activations governed by an ODE:

dt = f(xt, t, θ), (1)

parameterized by θ Rn and where f : Rd R Rn Rd is some nonlinear mapping deﬁning dynamics. A solution to the above system with initial condition x0 is of the form:

xt = x0 + Z t

t0 f(xs, s, θ)ds,

and can be approximated with various numerical integration techniques such as Runge-Kutta or Euler methods [48]. The latter give rise to discretizations:

xt+dt = xt + f(xt, t, θ)dt,

equal contribution

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

that can be interpreted as discrete ﬂows of Res Net-like computations [29] and establish a strong connection between the theory of deep neural networks and differential equations. This led to successful applications of Neural ODEs in machine learning. Those include in particular efﬁcient time-continuous normalizing ﬂows algorithms [11] avoiding the computation of the determinant of the Jacobian (a computational bottleneck for discrete variants), as well as modeling latent dynamics in time-series analysis, particularly useful for handling irregularly sampled data [44]. Parameterized neural ODEs can be efﬁciently trained via adjoint sensitivity method [13] and are characterized by constant memory cost, independent of the depth of the system, with different parameterizations encoding different weight sharing patterns across inﬁnitesimally close layers.

Such Neural ODE constructions enable deeper models than would not otherwise be possible with a ﬁxed computation budget; however, it has been noted that training instabilities and the problem of vanishing/exploding gradients can arise during the learning of very deep systems [43, 4, 23].

To resolve these challenges for discrete recurrent neural network architectures, several improvements relying on transition transformations encoded by orthogonal/Hermitian matrices were proposed [2, 33]. Orthogonal matrices, while coupled with certain classes of nonlinear mappings, provably preserve norms of the loss gradients during backpropagation through layers, yet they also incur potentially substantial Riemannian optimization costs [21, 14, 28, 1]. Fortunately, there exist several efﬁcient parameterizations of the subgroups of the orthogonal group O(d) that, even though in principle reduce representational capacity, in practice produce high-quality models and bypass Riemannian optimization [36, 40, 34]. All these advances address discrete settings and thus it is natural to ask what can be done for continuous systems, which by deﬁnition are deep.

In this paper, we answer this question by presenting a new paradigm for Neural ODE algorithms, called ODEto ODE, where time-dependent parameters of the main ﬂow evolve according to a matrix ﬂow on the orthogonal group O(d). Such ﬂows can be thought of as analogous to sequences of orthogonal matrices in discrete structured orthogonal models. By linking the theory of training Neural ODEs with the rich mathematical ﬁeld of matrix ﬂows on compact manifolds, we can reformulate the problem of ﬁnding efﬁcient Neural ODE algorithms as a task of constructing expressive parameterized ﬂows on the orthogonal group. We show in this paper on the example of orthogonal ﬂows corresponding to the so-called isospectral ﬂows [8, 9, 16], that such systems studied by mathematicians for centuries indeed help in training Neural ODE architectures (see: ISO-ODEto ODEs in Sec. 3.1). There is a voluminous mathematical literature on using isospectral ﬂows as continuous systems that solve a variety of combinatorial problems including sorting, matching and more [8, 9], but to the best of our knowledge, we are the ﬁrst who propose to learn them as stabilizers for training deep Neural ODEs.

Our proposed nested systems of ﬂows, where the parameter-ﬂow is constrained to lie on the compact manifold, provide stability and effectiveness of training, and provably solve the gradient vanishing/exploding problem for continuous systems. Consequently, they lead to better downstream models, as we show on a broad set of experiments (training reinforcement learning policies with evolution strategies and supervised learning). We support our empirical studies with strong convergence results, independent of the depth of the network. We are not the ﬁrst to explore nested Neural ODE structure (see: [55]). Our novelty is in showing that such hierarchical architectures can be signiﬁcantly improved by an entanglement with ﬂows on compact manifolds.

To summarize, in this work we make the following contributions:

We introduce new, explicit constructions of non-autonomous nested Neural ODEs (ODEto ODEs) where parameters are deﬁned as rich functions of time evolving on compact matrix manifolds (Sec. 3). We present two main architectures: gated-ODEto ODEs and ISO-ODEto ODEs (Sec. 3.1).

We establish convergence results for ODEto ODEs on a broad class of Lipschitz-continuous objective functions, in particular in the challenging reinforcement learning setting (Sec. 4.2).

We then use the above framework to outperform previous Neural ODE variants and baseline architectures on RL tasks from Open AI Gym and the Deep Mind Control Suite, and simultaneously to yield strong results on image classiﬁcation tasks. To the best of our knowledge, we are the ﬁrst to show that well-designed Neural ODE models with a compact number of parameters make them good candidates for training reinforcement learning policies via evolutionary strategies (ES) [15].

All proofs are given in the Appendix. We conclude in Sec. 6 and discuss broader impact in Sec. 7.

2 Related work Our work lies in the intersection of several ﬁelds: the theory of Lie groups, Riemannian geometry and deep neural systems. We provide an overview of the related literature below.

Matrix ﬂows. Differential equations (DEs) on manifolds lie at the heart of modern differential geometry [17] which is a key ingredient to understanding structured optimization for stabilizing neural network training [14]. In our work, we consider in particular matrix gradient ﬂows on O(d) that are solutions to trainable matrix DEs. In order to efﬁciently compute these ﬂows, we leverage the theory of compact Lie groups that are compact smooth manifolds equipped with rich algebraic structure [35, 7, 42]. For efﬁcient inference, we apply local linearizations of O(d) via its Lie algebra (skew-symmetric matrices vector spaces) and exponential mappings [50] (see: Sec. 3).

Hypernetworks. An idea to use one neural network (hypernetwork) to provide weights for another network [26] can be elegantly adopted to the nested Neural ODE setting, where [55] proposed to construct time-dependent parameters of one Neural ODE as a result of concurrently evolving another Neural ODE. While helpful in practice, this is insufﬁcient to fully resolve the challenge of vanishing and exploding gradients. We expand upon this idea in our gated-ODEto ODE network, by constraining the hypernetwork to produce skew-symmetric matrices that are then translated to those from the orthogonal group O(d) via the aforementioned exponential mapping.

Stable Neural ODEs. On the line of work for stabilizing training of Neural ODEs, [22] proposes regularization based on optimal transport, while [37] adds Gaussian noise into the ODE equation, turning it into a stochastic dynamical system. [20] lifts the ODE to a higher dimensional space to prevent trajectory intersections, while [25, 39] add additional terms into the ODE, inspired by well known physical dynamical systems such as Hamiltonian dynamics.

Orthogonal systems. For classical non-linear networks, a broad class of methods which improve stability and generalization involve orthogonality constraints on the layer weights. Such methods range from merely orthogonal initialization [46, 32] and orthogonal regularization [3, 53] to methods which completely constrain weight matrices to the orthogonal manifold throughout training using Riemannian optimization [35], by projecting the unconstrained gradient to the orthogonal group O(d) [47]. Such methods have been highly useful in preventing exploding/vanishing gradients caused by long term dependencies in recurrent neural networks (RNNs) [30, 31]. [52, 12] note that dynamical isometry can be preserved by orthogonality, which allows training networks of very large depth.

3 Improving Neural ODEs via ﬂows on compact matrix manifolds

Our core ODEto ODE architecture is the following nested Neural ODE system: xt = f(Wtxt) Wt = Wtbψ(t, Wt) (2)

for some function f : R R (applied elementwise), and a parameterized function: bψ : R Rd d Skew(d), where Skew(d) stands for the vector space of skew-symmetric (antisymmetric) matrices in Rd d. We take W0 O(d), where the orthogonal group O(d) is deﬁned as: O(d) = {M Rd d : M M = Id}. It can be proven [35] that under these conditions Wt O(d) for every t 0. In principle ODEto ODE can exploit arbitrary compact matrix manifold Σ (with bψ modiﬁed respectively so that second equation in Formula 2 represents on-manifold ﬂow), yet in practice we ﬁnd that taking Σ = O(d) is the most effective. Furthermore, for Σ = O(d), the structure of bψ is particularly simple, it is only required to output skew-symmetric matrices. Schematic representation of the ODEto ODE architecture is given in Fig. 1. We take as f a function that is non-differentiable in at most ﬁnite number of its inputs and such that |f (x)| = 1 on all others (e.g. f(x) = |x|).

3.1 Shaping ODEto ODEs

An ODEto ODE is deﬁned by: x0, W0 (initial conditions) and bψ. Vector x0 encodes input data e.g. state of an RL agent (Sec. 5.1) or an image (Sec. 5.2). Initial point W0 of the matrix ﬂow can be either learned (see: Sec. 3.2) or sampled upfront uniformly at random from Haar measure on O(d). We present two different parameterizations of bψ, leading to two different classes of ODEto ODEs:

ISO-ODEto ODEs: Those leverage a popular family of the isospectral ﬂows (i.e. ﬂows preserving matrix spectrum), called double-bracket ﬂows, and given as: Ht = [Ht, [Ht, N]], where:

Figure 1: Schematic representation of the discretized ODEto ODE architecture. On the left: nested ODE structure with parameters of the main ﬂow being fed with the output of the matrix ﬂow (inside dashed contour). On the right: the matrix ﬂow evolves on the compact manifold Σ, locally linearized by vector space T and with mapping Γ encoding this linearization (Ot = bψ(t, Wt)).

Q def = H0, N Sym(d) Rd d, Sym(d) stands for the class of symmetric matrices in Rd d and [] denotes the Lie bracket: [A, B] def = AB BA. Double-bracket ﬂows with customized parametermatrices Q, N can be used to solve various combinatorial problems such as sorting, matching, etc. [9]. It can be shown that Ht is similar to H0 for every t 0, thus we can write: Ht = Wt H0W 1 t , where Wt O(d). The corresponding ﬂow on O(d) has the form: Wt = Wt[(Wt) QWt, N] and

we take: biso ψ (t, Wt) def = [(Wt) QWt, N], where ψ = (Q, N) are learnable symmetric matrices. It is easy to see that bψ outputs skew-symmetric matrices.

Gated-ODEto ODEs: In this setting, inspired by [10, 55], we simply take bgated ψ = Pd i=1 ai Bi ψ,

where d stands for the number of gates, a = (a1, ..., ad) are learnable coefﬁcients and Bi ψ def =

i ) . Here {f ψi

i }i=1,...,d are outputs of neural networks parameterized by ψis and producing unstructured matrices and ψ = concat(a, ψ1, ..., ψd). As above, bgated ψ outputs matrices in Skew(d).

3.2 Learning and executing ODEto ODEs

Note that most of the parameters of ODEto ODE from Formula 2 are unstructured and can be trained via standard methods. To train an initial orthogonal matrix W0, we apply tools from the theory of Lie groups and Lie algebras [35]. First, we notice that O(d) can be locally linearized via skew-symmetric matrices Skew(d) (i.e. a tangent vector space TW to O(d) in W O(d) is of the form: TW = W Sk(d)). We then compute Riemannian gradients which are projections of unstructured ones onto tangent spaces. For the inference on O(d), we apply exponential mapping

Γ(W) def = exp(ηbψ(t, W)), where η is the step size, leading to the discretization of the form: Wt+dt = WtΓ(Wt) (see: Fig. 1).

4 Theoretical results

4.1 ODEto ODEs solve gradient vanishing/explosion problem

First we show that ODEto ODEs do not suffer from the gradient vanishing/explosion problem, i.e. the norms of loss gradients with respect to intermediate activations xt do not grow/vanish exponentially while backpropagating through time (through inﬁnitesimally close layers).

Lemma 4.1 (ODEto ODES for gradient stabilization). Consider a Neural ODE on time interval [0, T] and given by Formula 2. Let L = L(x T ) be a differentiable loss function. The following holds for any t [0, 1], where e = 2.71828... is Euler constant:

4.2 ODEto ODEs for training reinforcement learning policies with ES

Here we show strong convergence guarantees for ODEto ODE. For the clarity of the exposition, we consider ISO-ODEto ODE, even though similar results can be obtained for gated-ODEto ODE. We focus on training RL policies with ES [45], which is mathematically more challenging to analyze than supervised setting, since it requires applying ODEto ODE several times throughout the rollout. Analogous results can be obtained for the conceptually simpler supervised setting.

Let env : Rd Rm Rd R be an environment function such that it gets current state sk Rd and action encoded as a vector ak Rm as its input and outputs the next state sk+1 and the next score value lk+1 R: (sk+1, lk+1) = env(sk, ak). We treat env as a black-box function, which means that we evaluate its value for any given input, but we don t have access to env s gradients. An overall score L is a sum of all per-step scores lk for the whole rollout from state s0 to state s K:

L(s0, a0, . . . , a K 1) =

k=1 lk, k {1, . . . , K} : (sk, lk) = env(sk 1, ak 1). (4)

We assume that s0 is deterministic and known. The goal is to train a policy function gθ : Rd Rm which, for the current state vector s Rd, returns an action vector a = gθ(s) where θ are trained parameters. By Stiefel manifold ST (d1, d2) we denote a nonsquare extension of orthogonal matrices: if d1 d2 then ST (d1, d2) = {Ω Rd1 d2 | Ω Ω= I}, otherwise ST (d1, d2) = {Ω Rd1 d2 | ΩΩ = I}. We deﬁne gθ as an ODEto ODE with Euler discretization given below:

gθ(s) = Ω2x N, x0 = Ω1s, i {1, . . . , N} : xi = xi 1 + 1

N f(Wixi 1 + b) (5)

Wi = Gi 1 exp 1

N (W i 1QWi 1N NW i 1QWi 1) O(d), (6)

θ = {Ω1 ST (n, d), Ω2 ST (m, n), b Rn, N S(n), Q S(n), W0 O(n)} D

where by D we denote θ s domain: D = ST (n, d) ST (m, n) Rn S(n) S(n) O(n). Deﬁne a ﬁnal objective F : D R to maximize as

F(θ) = L(s0, a0, . . . , a K 1), k {1, . . . , K} : ak 1 = gθ(sk 1). (7)

For convenience, instead of (sout, l) = env(sin, a) we will write sout = env(1)(sin, a) and l = env(2)(sin, a). In our subsequent analysis we will use a not quite limiting Lipschitz-continuity assumption on env which intuitively means that a small perturbation of env s input leads to a small perturbation of its output. In addition to that, we assume that per-step loss lk = env(2)(sk 1, ak 1) is uniformly bounded.

Assumption 4.2. There exist M, L1, L2 > 0 such that for any s , s Rd, a , a Rm it holds:

|env(2)(s , a )| M, env(1)(s , a ) env(1)(s , a ) 2 L1δ,

|env(2)(s , a ) env(2)(s , a )| L2δ, δ = s s 2 + a a 2.

Assumption 4.3. f( ) is Lipschitz-continuous with a Lipschitz constant 1: x , x R : |f(x ) f(x )| |x x |. In addition to that, f(0) = 0.

Instead of optimizing F(θ) directly, we opt for optimization of a Gaussian-smoothed proxi Fσ(θ):

Fσ(θ) = Eϵ N(0,I)F(θ + σϵ)

where σ > 0. Gradient of Fσ(θ) has a form:

σ Eϵ N(0,I)F(θ + σϵ)ϵ,

which suggests an unbiased stochastic approximation such that it only requires to evaluate F(θ), i.e. no access to F s gradient is needed, where v is a chosen natural number:

e Fσ(θ) = 1

w=1 F(θ + σϵw)ϵw, ϵ1, . . . , ϵv i.i.d. N(0, I),

By deﬁning a corresponding metric, D becomes a product of Riemannian manifolds and a Riemannian manifold itself. By using a standard Euclidean metric, we consider a Riemannian gradient of Fσ(θ):

RFσ(θ) = { Ω1Ω 1 Ω1 Ω1, Ω2Ω 2 Ω2 Ω2, b, 1

2( N + N), 1

2( Q + Q), (8)

W0W 0 W0 W0}, where { Ω1, Ω2, b, N, Q, W0} = Fσ(θ). (9)

We use Stochastic Riemannian Gradient Descent [6] to maximize F(θ). Stochastic Riemannian gradient estimate e RFσ(θ) is obtained by substituting e in (8-9). The following Theorem proves that SRGD is converging to a stationary point of the maximization objective F(θ) with rate O(τ 0.5+ϵ) for any ϵ > 0. Moreover, the constant hidden in O(τ 0.5+ϵ) rate estimate doesn t depend on the length N of the rollout.

Theorem 1. Consider a sequence {θ(τ) = {{Ω(τ) 1 , Ω(τ) 2 , b(τ), N(τ), Q(τ), W(τ) 0 } D} τ=0 where θ(0) is deterministic and ﬁxed and for each τ > 0:

Ω(τ) 1 = exp(ατ e (τ) R,Ω1)Ω(τ 1) 1 , Ω(τ) 2 = exp(ατ e (τ) R,Ω2)Ω(τ 1) 2 , b(τ) = b(τ 1) + ατ e (τ) R,b,

N(τ) = N(τ 1) + ατ e (τ) R,N, Q(τ) = Q(τ 1) + ατ e (τ) R,Q, W(τ) 0 = exp(ατ e (τ) R,W0)W(τ 1) 0 ,

{e (τ) R,Ω1, e (τ) R,Ω2, e (τ) R,b, e (τ) R,N, e (τ) R,Q, e (τ) R,W0} = e RFσ(θ(τ)), ατ = τ 0.5.

Then min0 τ <τ E[ RFσ(θ(τ )) 2 2|Fτ,D,Db] E τ 0.5+ϵ for any ϵ > 0 where E doesn t depend on N and D, Db > 0 are constants and Fτ,D,Db denotes a condition that for all τ τ it holds N(τ ) 2, Q(τ ) 2 < D, b(τ ) 2 < Db.

5 Experiments

We run two sets of experiments comparing ODEto ODE with several other methods in the supervised setting and to train RL policies with ES. To the best of our knowledge, we are the ﬁrst to propose Neural ODE architectures for RL-policies and explore how the compactiﬁcation of the number of parameters they provide can be leveraged by ES methods that beneﬁt from compact models [15].

5.1 Neural ODE policies with ODEto ODE architectures

5.1.1 Basic setup

In all Neural ODE methods we were integrating on the time interval [0, T] for T = 1 and applied discretization with integration step size η = 0.04 (in our ODEto ODE we used that η for both: the main ﬂow and the orthogonal ﬂow on O(d)). The dimensionality of the embedding of the input state s was chosen to be h = 64 for Open AI Gym Humanoid (for all methods but Hyper Net, where we chose h = 16, see: discussion below) and h = 16 for all other tasks.

Neural ODE policies were obtained by a linear projection of the input state into embedded space and Neural ODE ﬂow in that space, followed by another linear projection into action-space. In addition to learning the parameters of the Neural ODEs, we also trained their initial matrices and those linear projections. Purely linear policies were proven to get good rewards on Open AI Gym environments [38], yet they lead to inefﬁcient policies in practice [14]; thus even those environments beneﬁt from deep nonlinear policies. We enriched our studies with additional environments from Deep Mind Control Suite. We used standard deviation stddev = 0.1 of the Gaussian noise deﬁning ES perturbations, ES-gradient step size δ = 0.01 and function σ(x) = |x| as a nonlinear mapping. In all experiments we used k = 200 perturbations per iteration [15].

Number of policy parameters: To avoid favoring ODEto ODEs, the other architectures were scaled in such a way that they has similar (but not smaller) number of parameters. The ablation studies were conducted for them across different sizes and those providing best (in terms of the ﬁnal reward) mean curves were chosen. No ablation studies were run on ODEto ODEs.

Figure 2: Comparison of different algorithms: ODEto ODE, Deep Net, Deep Res Net, Base ODE, NANODE and Hyper Net on four Open AI Gym and four Deep Mind Control Suite tasks. Hyper Net on Swimmer, Swimmer 15 and Humanoid Stand as well as Deep Net on Humanoid were not learning at all so corresponding curves were excluded. For Half Cheetah and Hyper Net, there was initially a learning signal (red spike at the plot near the origin) but then curve ﬂattened at 0. Each plot shows mean +- stdev across s = 10 seeds. Hyper Net was also much slower than all other methods because of unstructured hypernetwork computations. Humanoid corresponds to Open AI Gym environment and two its other versions on the plots to Deep Mind Control Suite environment (with two different tasks).

5.1.2 Tested methods We compared the following algorithms including standard deep neural networks and Neural ODEs:

ODEto ODE: Matrices corresponding to linear projections were constrained to be taken from Stiefel manifold ST (d) [14] (a generalization of the orthogonal group O(d)). Evolution strategies (ES) were applied as follows to optimize entire policy. Gradients of Gaussian smoothings of the function: F : Rm R mapping vectorized policy (we denote as m the number of its parameters) to

obtained reward were computed via standard Monte Carlo procedure [15]. For the part of the gradient vector corresponding to linear projection matrices and initial matrix of the ﬂow, the corresponding Riemannian gradients were computed to make sure that their updates keep them on the respective manifolds [14]. The Riemannian gradients were then used with the exact exponential mapping from Skew(d) to O(d). For unconstrained parameters deﬁning orthogonal ﬂow, standard ES-update procedure was used [45]. We applied ISO-ODEto ODE version of the method (see: Sec. 3.1).

Deep(Res)Net: In this case we tested unstructured deep feedforward fully connected (Res Net) neural network policy with t = 25 hidden layers.

Base ODE: Standard Neural ODE dx(t)

dt = f(xt), where f was chosen to be a feedforward fully connected network with with two hidden layers of size h.

NANODE: This leverages recently introduced class of Non-Autonomous Neural ODEs (NANODEs) [19] that were showed to substantially outperform regular Neural ODEs in supervised training [19]. NANODEs rely on ﬂows of the form: dx(t)

dt = σ(Wtxt). Entries of weight matrices Wt are values of d-degree trigonometric polynomials [19] with learnable coefﬁcients. In all experiments we used d = 5. We observed that was an optimal choice and larger values of d, even though improving model capacity, hurt training when number of perturbations was ﬁxed (as mentioned above, in all experiments we used k = 200).

Hyper Net: For that method [55], matrix Wt was obtained as a result of the entangled neural ODE in time t after its output de-vectorization (to be more precise: the output of size 256 was reshaped into a matrix from R16 16). We encoded the dynamics of that entangled Neural ODE by a neural network f with two hidden layers of size s = 16 each. Note that even relatively small hypernetworks are characterized by large number of parameters which becomes a problem while training ES policies. We thus used h = 16 for all tasks while running Hyper Net algorithm. We did not put any structural assumptions on matrices Wt, in particular they were not constrained to belong to O(d).

5.1.3 Discussion of the results The results are presented in Fig. 2. Our ODEto ODE is solely the best performing algorithm on four out of eight tasks: Humanoid, Half Cheetah, Humanoid Stand and Swimmer 15 and is one of the two best performing on the remaining four. It is clearly the most consistent algorithm across the board and the only one that learns good policies for Humanoid. Each of the remaining methods fails on at least one of the tasks, some such as: Hyper Net, Deep Net and Dep Res Net on more. Exponential mapping (Sec. 3.2) computations for ODEto ODEs with hidden representations h 64 took negligible time (we used well-optimized scipy.linalg.expm function). Thus all the algorithms but Hyper Net (with expensive hypernetwork computations) had similar running time.

5.2 Supervised learning with ODEto ODE architectures

We also show that ODEto ODE can be effectively applied in supervised learning by comparing it with multiple baselines on various image datasets.

5.2.1 Basic setup

All our supervised learning experiments use the Corrupted MNIST [41] (11 different variants) dataset. For all models in Table 1, we did not use dropout, applied hidden width w = 128, and trained for 100 epochs. For all models in Table 2, we used dropout with r = 0.1 rate, hidden width w = 256, and trained for 100 epochs. For ODEto ODE variants, we used a discretization of η = 0.01.

5.2.2 Tested methods

We compared our ODEto ODE approach with several strong baselines: feedforward fully connected neural networks (Dense), Neural ODEs inspired by [13] (NODE), Non-Autonomous Neural ODEs [19] (NANODE) and hypernets [55] (Hyper Net). For NANODE, we vary the degree of the trigonometric polynomials. For Hyper Net, we used its gated version and compared against gated ODEEto ODE (see: Sec. 3.1). In this experiment, we focus on fully-connected architecture variants of all models. As discussed in prior work, orthogonality takes on a different form in the convolutional case [49], so we reserve discussion of this framing for future work.

5.2.3 Discussion of the results

The results are presented in Table 1 and Table 2 below. Our ODEto ODE outperforms other model variants on 11 out of the 15 tasks in Table 1. On this particularly task, we ﬁnd that ODEto ODE-1,

whereby we apply a constant perturbation to the hidden units θ(0) works best. We highlighted the best results for each task in bolded blue.

Models Dense-1 Dense-10 NODE NANODE-1 NANODE-10 Hyper Net-1 Hyper Net-10 ODEto ODE-1 ODEto ODE-10

Dotted Lines 92.99 88.22 91.54 92.42 92.74 91.88 91.9 95.42 95.22 Spatter 93.54 89.52 92.49 93.13 93.15 93.19 93.28 94.9 94.9 Stripe 30.55 20.57 36.76 16.4 21.37 19.71 18.69 44.51 44.37 Translate 25.8 24.82 27.09 28.97 29.31 29.42 29.3 26.82 26.63 Rotate 82.8 80.38 82.76 83.19 83.65 83.5 83.56 83.9 84.1 Scale 58.68 62.73 62.05 66.43 66.63 67.84 68.11 66.68 66.76 Shear 91.73 89.52 91.82 92.18 93.11 92.33 92.48 93.37 92.93 Motion Blur 78.16 67.25 75.18 76.53 79.22 78.82 78.33 78.63 77.58 Glass Blur 91.62 84.89 87.94 90.18 91.07 91.3 91.17 93.91 93.29 Shot Noise 96.16 91.63 94.24 94.97 94.73 94.76 94.81 96.91 96.71 Identity 97.55 95.73 97.61 97.65 97.69 97.72 97.54 97.94 97.88 Table 1: Test accuracy comparison of different methods. Postﬁx terms refer to hidden depth for Dense, trigonometric polynomial degree for NANODE, and number of gates for Hyper Net and ODEto ODE.

Models Dense-1 Dense-2 Dense-4 NODE NANODE-1 NANODE-2 NANODE-4 Hyper Net-1 Hyper Net-2 Hyper Net-4 ODEto ODE-1 ODEto ODE-2 ODEto ODE-4

Dotted Line 94.47 93.21 91.88 92.9 92 92.02 92.02 92.35 92.91 92.57 95.64 95.66 95.6 Spatter 94.52 93.59 93.63 93.32 92.81 92.82 92.84 94.09 93.94 93.73 95.28 95.47 95.29 Stripe 29.69 32.73 31.49 36.08 27.32 23.56 24.66 30.86 31.12 29.1 36.25 28.21 31.91 Translate 25.85 28.35 27.27 29.13 29.11 29.24 28.7 29.85 29.68 29.87 25.61 26.1 26.42 Rotate 83.62 83.73 83.88 83.09 82.5 82.77 83.03 83.96 84.04 84.13 85.1 85.03 84.72 Scale 61.41 65.07 63.72 63.62 65.49 65.33 64.03 69.51 68.77 69.8 67.97 66.95 67.09 Shear 92.25 92.76 92.55 92.27 92.08 92.18 92.3 93.35 92.84 93.04 93.15 93.06 93.38 Motion Blur 76.47 73.89 74.95 76.3 75.24 75.88 76.22 80.67 81.26 81.36 78.92 79.11 79.02 Glass Blur 92.5 89.65 90.29 89.47 89.5 89.5 89.8 93.07 92.67 92.69 94.46 94.19 94.3 Shot Noise 96.41 95.78 95.22 95.09 94.18 94.12 93.85 95.36 96.88 96.77 96.93 96.88 96.77 Identity 97.7 97.75 97.71 97.64 97.6 97.5 97.52 97.7 97.82 97.79 98.03 98.12 98.11 Table 2: Additional Sweep for the setting as in Table 1. This time all models also incorporate dropout with rate r = 0.1 and width w = 256. As above in Table 1, ODEto ODE variants achieve the highest performance.

6 Conclusion In this paper, we introduced nested Neural ODE systems, where the parameter-ﬂow evolves on the orthogonal group O(d). Constraining this matrix-ﬂow to develop on the compact manifold provides us with an architecture that can be efﬁciently trained without exploding/vanishing gradients, as we showed theoretically and demonstrated empirically by presenting gains on various downstream tasks. We are the ﬁrst to demonstrate that algorithms training RL policies and relying on compact models representations can signiﬁcantly beneﬁt from such hierarchical systems.

7 Broader impact

We believe our contributions have potential broader impact that we brieﬂy discuss below: Reinforcement Learning with Neural ODEs: To the best of our knowledge, we are the ﬁrst to propose to apply nested Neural ODEs in Reinforcement Learning, in particular to train Neural ODE policies. More compact architectures encoding deep neural network systems is an especially compelling feature in policy training algorithms, in particular while combined with ES methods admitting embarrassingly simple and efﬁcient parallelization, yet suffering from high sampling complexity that increases with the number of policy parameters. Our work shows that RL training of such systems can be conducted efﬁciently provided that evolution of the parameters of the system is highly structured and takes place on compact matrix manifolds. Learnable Isospectral Flows: We demonstrated that ISO-ODEto ODEs can be successfully applied to learn reinforcement learning policies. Those rely on the isospectral ﬂows that in the past were demonstrated to be capable of solving combinatorial problems ranging from sorting to (graph) matching [8, 54]. As emphasized before, such ﬂows are however ﬁxed and not trainable whereas we learn ours. That suggests that isospectral ﬂows can be potentially learned to solve combinatoriallyﬂavored machine learning problems or even integrated with non-combinatorial blocks in larger ML computational pipelines. The beneﬁts of such an approach lie in the fact that we can efﬁciently backpropagate through these continuous systems and is related to recent research on differentiable sorting [18, 5, 24].

8 Acknowledgements Adrian Weller acknowledges support from the David Mac Kay Newton research fellowship at Darwin College, The Alan Turing Institute under EPSRC grant EP/N510129/1 and U/B/000074, and the Leverhulme Trust via CFI. Valerii Likhosherstov acknowledges support from the Cambridge Trust and Deep Mind.

[1] Pierre-Antoine Absil, Robert E. Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008. [2] Mart ın Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1120 1128, 2016. [3] Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from orthogonality regularizations in training deep networks? In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada, pages 4266 4276, 2018. [4] Yoshua Bengio, Paolo Frasconi, and Patrice Y. Simard. The problem of learning long-term dependencies in recurrent networks. In Proceedings of International Conference on Neural Networks (ICNN 88), San Francisco, CA, USA, March 28 - April 1, 1993, pages 1183 1188, 1993. [5] Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. Fast differentiable sorting and ranking. Co RR, abs/2002.08871, 2020. [6] Silvere Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217 2229, 2013. [7] Theodor Br ocker and Tammo tom Dieck. Representations of compact lie groups. 1985. [8] Roger W. Brockett. Least squares matching problems. Linear Algebra and its Applications, 1989. [9] Roger W Brockett. Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems. Linear Algebra and its applications, 146:79 91, 1991. [10] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018. [11] Changyou Chen, Chunyuan Li, Liquan Chen, Wenlin Wang, Yunchen Pu, and Lawrence Carin. Continuous-time ﬂows for efﬁcient inference and density estimation. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 823 832. PMLR, 2018. [12] Minmin Chen, Jeffrey Pennington, and Samuel S. Schoenholz. Dynamical isometry and a mean ﬁeld theory of rnns: Gating enables signal propagation in recurrent neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, pages 872 881, 2018. [13] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571 6583, 2018. [14] Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov, Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tam as Sarl os, Adrian Weller, and Vikas Sindhwani. Stochastic ﬂows and geometric optimization on the orthogonal group. Co RR, abs/2003.13563, 2020. [15] Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E. Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, pages 969 977, 2018. [16] Moody T Chu and Larry K Norris. Isospectral ﬂows and abstract matrix factorizations. SIAM journal on numerical analysis, 25(6):1383 1391, 1988. [17] Philippe G. Ciarlet. An introduction to differential geometry with applications to elasticity. Journal of Elasticity, 78-79:1 215, 2005. [18] Marco Cuturi, Olivier Teboul, and Jean-Philippe Vert. Differentiable ranking and sorting using optimal transport. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alch e-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information

Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 6858 6868, 2019.

[19] Jared Quincy Davis, Krzysztof Choromanski, Jake Varley, Honglak Lee, Jean-Jacques E. Slotine, Valerii Likhosterov, Adrian Weller, Ameesh Makadia, and Vikas Sindhwani. Time dependence in non-autonomous neural odes. Co RR, abs/2005.01906, 2020.

[20] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural odes. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 3134 3144, 2019.

[21] Alan Edelman, Tom as A. Arias, and Steven Thomas Smith. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Analysis Applications, 20(2):303 353, 1998.

[22] Chris Finlay, J orn-Henrik Jacobsen, Levon Nurbekyan, and Adam M. Oberman. How to train your neural ODE. Co RR, abs/2002.02798, 2020.

[23] Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. Deep Learning. Adaptive computation and machine learning. MIT Press, 2016.

[24] Aditya Grover, Eric Wang, Aaron Zweig, and Stefano Ermon. Stochastic optimization of sorting networks via continuous relaxations. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

[25] Batuhan G uler, Alexis Laignelet, and Panos Parpas. Towards robust and stable deep learning algorithms for forward backward stochastic differential equations. Co RR, abs/1910.11623, 2019.

[26] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. Co RR, abs/1609.09106, 2016.

[27] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017.

[28] Ernst Hairer. Important aspects of geometric numerical integration. J. Sci. Comput., 25(1):67 81, 2005.

[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770 778, 2016.

[30] Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal recurrent neural networks with scaled Cayley transform. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, pages 1974 1983, 2018.

[31] Mikael Henaff, Arthur Szlam, and Yann Le Cun. Recurrent orthogonal networks and longmemory tasks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2034 2042, 2016.

[32] Wei Hu, Lechao Xiao, and Jeffrey Pennington. Provable beneﬁt of orthogonal initialization in optimizing deep linear networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.

[33] Kui Jia, Shuai Li, Yuxin Wen, Tongliang Liu, and Dacheng Tao. Orthogonal deep neural networks. Co RR, abs/1905.05929, 2019.

[34] Li Jing, Yichen Shen, Tena Dubcek, John Peurifoy, Scott A. Skirlo, Yann Le Cun, Max Tegmark, and Marin Soljacic. Tunable efﬁcient unitary neural networks (EUNN) and their application to rnns. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1733 1741, 2017.

[35] John Lee. Introduction to smooth manifolds. 2nd revised ed, volume 218. 01 2012.

[36] Valerii Likhosherstov, Jared Davis, Krzysztof Choromanski, and Adrian Weller. CWY parametrization for scalable learning of orthogonal and stiefel matrices. Co RR, abs/2004.08675, 2020.

[37] Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. Neural SDE: stabilizing neural ODE networks with stochastic noise. Co RR, abs/1906.02355, 2019.

[38] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search of static linear policies is competitive for reinforcement learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada, pages 1805 1814, 2018.

[39] Stefano Massaroli, Michael Poli, Michelangelo Bin, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. Stable neural ﬂows. Co RR, abs/2003.08063, 2020.

[40] Zakaria Mhammedi, Andrew D. Hellicar, Ashfaqur Rahman, and James Bailey. Efﬁcient orthogonal parametrisation of recurrent neural networks using Householder reﬂections. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2401 2409, 2017.

[41] Norman Mu and Justin Gilmer. Mnist-c: A robustness benchmark for computer vision. ar Xiv preprint ar Xiv:1906.02337, 2019.

[42] Peter J. Olver. Applications of Lie Groups to Differential Equations. Springer, second edition, 2000.

[43] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difﬁculty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 1310 1318. JMLR.org, 2013.

[44] Yulia Rubanova, Tian Qi Chen, and David Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alch e-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 5321 5331, 2019.

[45] Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. Co RR, abs/1703.03864, 2017.

[46] Andrew M. Saxe, James L. Mc Clelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

[47] Uri Shalit and Gal Chechik. Coordinate-descent for learning orthogonal matrices through Givens rotations. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 548 556, 2014.

[48] Endre S uli and David F. Mayers. An introduction to numerical analysis. 2003.

[49] Jiayun Wang, Yubei Chen, Rudrasis Chakraborty, and Stella X Yu. Orthogonal convolutional neural networks. ar Xiv preprint ar Xiv:1911.12207, 2019.

[50] Frank W. Warner. Foundations of differentiable manifolds and Lie groups. 1971.

[51] Ralph M Wilcox. Exponential operators and parameter differentiation in quantum physics. Journal of Mathematical Physics, 8(4):962 982, 1967.

[52] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean ﬁeld theory of cnns: How to train 10, 000-layer vanilla convolutional neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, pages 5389 5398, 2018.

[53] Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5075 5084, 2017.

[54] M. M. Zavlanos and G. J. Pappas. A dynamical systems approach to weighted graph matching. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 3492 3497, Dec 2006.

[55] Tianjun Zhang, Zhewei Yao, Amir Gholami, Joseph E. Gonzalez, Kurt Keutzer, Michael W. Mahoney, and George Biros. ANODEV2: A coupled neural ODE framework. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information

Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 5152 5162, 2019.