# dissecting_neural_odes__4b782598.pdf Dissecting Neural ODEs Stefano Massaroli The University of Tokyo, Diff Eq ML massaroli@robot.t.u-tokyo.ac.jp Michael Poli KAIST, Diff Eq ML poli_m@kaist.ac.kr Jinkyoo Park KAIST jinkyoo.park@kaist.ac.kr Atsushi Yamashita The University of Tokyo yamashita@robot.t.u-tokyo.ac.jp Hajime Asama The University of Tokyo asama@robot.t.u-tokyo.ac.jp Abstract Continuous deep learning architectures have recently re emerged as Neural Ordinary Differential Equations (Neural ODEs). This infinite depth approach theoretically bridges the gap between deep learning and dynamical systems, offering a novel perspective. However, deciphering the inner working of these models is still an open challenge, as most applications apply them as generic black box modules. In this work we open the box , further developing the continuous depth formulation with the aim of clarifying the influence of several design choices on the underlying dynamics. 1 Introduction Neural ODEs (Chen et al., 2018) represent the latest instance of continuous deep learning models, first developed in the context of continuous recurrent networks (Cohen and Grossberg, 1983). Since their introduction, research on Neural ODEs variants (Tzen and Raginsky, 2019; Jia and Benson, 2019; Zhang et al., 2019b; Yıldız et al., 2019; Poli et al., 2019) has progressed at a rapid pace. However, the search for concise explanations and experimental evaluations of novel architectures has left many fundamental questions unanswered. In this work, we establish a general system theoretic Neural ODE formulation (1) and dissect it into its core components; we analyze each of them separately, shining light on peculiar phenomena unique to the continuous deep learning paradigm. In particular, augmentation strategies are generalized beyond ANODEs (Dupont et al., 2019), and the novel concepts of data control and adaptive depth enriching (1) are showcased as effective approaches to learn maps such as reflections or concentric annuli without augmentation. While explicit dependence on the depth variable has been considered in the original formulation (Chen et al., 2018), a parameter depth variance in continuous models has been overlooked. We provide a treatment in infinite dimensional space required by the true deep limit of Res Nets, the solution of which leads to a Neural ODE variant based on a spectral discretization. Neural Ordinary Differential Equation z = fθ(s)(s, x, z(s)) z(0) = hx(x) ˆy(s) = hy(z(s)) s S (1) Input x Rnx Output ˆy Rny (Hidden) State z Rnz Parameters θ(s) Rnθ Neural Vector Field fθ(s) Rnz Input Network hx Rnx Rnz Output Network hy Rnz Rny Equal contribution. Author order was decided by flipping a coin. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Depth variance Vanilla Neural ODEs (Chen et al., 2018) cannot be considered the deep limit of Res Nets. We discuss the subtleties involved, uncovering a formal optimization problem in functional space as the price to pay for true depth variance. Obtaining its solution leads to two novel variants of Neural ODEs: a Gal erkin inspired spectral discretization (Gal NODE) and a piecewise constant model. Gal NODEs are showcased on a task involving a loss distributed on the depth domain, requiring the introduction of a generalized version of the adjoint in (Chen et al., 2018). Augmentation strategies The augmentation idea of ANODEs (Dupont et al., 2019) is taken further and generalized to novel dynamical system inspired and parameter efficient alternatives, relying on different choices of hx in (1). These approaches, which include input layer and higher order augmentation, are verified to be more effective than existing methods in terms of performance and parameter efficiency. Beyond augmentation: data control and adaptive depth We unveil that although important, augmentation is not always necessary in challenging tasks such as learning reflections or concentric annuli (Dupont et al., 2019). To start, we demonstrate that depth varying vector fields alone are sufficient in dimensions greater than one. We then provide theoretical and empirical results motivating two novel Neural ODE paradigms: adaptive depth, where the integration bound is itself determined by an auxiliary neural network, and data controlled, where fθ(s) is conditioned by the input data x, allowing the ODE to learn a family of vector fields instead of a single one. Finally, we warn against input networks hx of the multilayer, nonlinear type, as these can make Neural ODE flows superfluous. 2 Continuous Depth Models A general formulation In the context of Neural ODEs we suppose to be given a stream of input output data {(xk, yk)}k K (where K is a linearly ordered finite subset of N). The inference of Neural ODEs is carried out by solving the inital value problem (IVP) (1), i.e. S fθ(τ)(τ, x, z(τ))dτ Our degree of freedom, other than hx and hy, in the Neural ODE model is the choice of the parameter θ inside a given pre-specified class W of functions S Rnθ. Well posedness If fθ(s) is Lipschitz, for each xk the initial value problem in (1) admits a unique solution z defined in the whole S. If this is the case, there is a mapping φ from W Rnx to the space of absolutely continuous functions S 7 Rnz such that zk := φ(xk, θ) satisfies the ODE in (1). This in turn implies that, for all k K, the map (θ, xk, s) 7 γ(s, xk, θ) := hy φ(θ, xk)(s) satisfies ˆy = γ(θ, xk, s). For compactness, for any s S, we denote φ(θ, xk)(s) by φs(θ, xk). Training: optimal control (Chen et al., 2018) treated the training of constant parameters Neural ODE (i.e. W is the space of constant functions) considering only terminal loss functions depending on the terminal state z(S). However, in the framework of Neural ODEs, the latent state evolves through a continuum of layers steering the model output ˆy(s) towards the label. It thus makes sense to introduce a loss function also distributed on the whole depth domain S, e.g. ℓ:= L(z(S)) + Z S l(τ, z(τ))dτ (2) The training can be then cast into the optimal control (Pontryagin et al., 1962) problem min θ W 1 |K| subject to z(s) = fθ(s) (s, xk, z(s)) s S z(0) = hx(xk), ˆy(s) = hy(z(s)) solved by gradient descent. Here, if θ is constant, the gradients can be computed with O(1) memory efficiency by generalizing the adjoint sensitivity method in (Chen et al., 2018). Proposition 1 (Generalized Adjoint Method). Consider the loss function (2). Then, dℓ dθ = Z θ dτ where a(s) satisfies ( a (s) = a fθ z a (S) = L z(S) Appendix B contains additional insights on the choice of activation, training regularizers and approximation capabilities of Neural ODEs, along with a detailed derivation of the above result. 3 Depth-Variance: Infinite Dimensions for Infinite Layers Bring residual networks to the deep limit Vanilla Neural ODEs, as they appear in the original paper (Chen et al., 2018), cannot be fully considered the deep limit of Res Nets. In fact, while each residual block is characterized by its own parameters vector θs, the authors consider model z = fθ(s, z(s)) where the depth variable s enters in the dynamics per se2 rather than in the map s 7 θ(s). The first attempt to pursue the true deep limit of Res Nets is the hypernetwork approach of (Zhang et al., 2019b) where another neural network parametrizes the dynamics of θ(s). However, this approach is not backed by any theoretical argument and it exhibits a considerable parameter inefficiency, as it generally scales polynomially in nθ. We adopt a different approach, setting out to tackle the problem theoretically in the general formulation. Here, we uncover an optimization problem in functional space, solved by a direct application of the adjoint sensitivity method in infinite-dimensions. We then introduce two parameter efficient depth variant Neural ODE architectures based on the solution of such problem: Gal erkin Neural ODEs and Stacked Neural ODEs. Gradient descent in functional space When the model parameters are depth varying, θ : S Rnθ, the nonlinear optimization problem (3) should be in principle solved by iterating a gradient descent algorithm in a functional space (Smyrlis and Zisis, 2004), e.g. θk+1(s) = θk(s) ηδℓk/δθ(s) once the Gateaux derivative δℓk/δθ(s) is computed. Let L2(S Rnθ) be the space of square integrable functions S Rnθ. Hereafter, we show that if θ(s) W := L2(S Rnθ), then the loss sensitivity to θ(s) can be computed through the adjoint method. Theorem 1 (Infinite Dimensional Gradients). Consider the loss function (2) and let θ(s) L2(S Rnθ). Then, sensitivity of ℓwith respect to θ(s) (i.e. directional derivative in functional space) is δℓ δθ(s) = a (s) fθ(s) θ(s) where a(s) satisfies ( a (s) = a (s) fθ(s) z a (S) = L z(S) Note that, although Theorem 1 provides a constructive method to compute the loss gradient in the infinite dimensional setting, its implementation requires choosing a finite dimensional approximation of the solution. We offer two alternatives: a spectral discretization approach relying on reformulating the problem on some functional bases and a depth discretization approach. Spectral discretization: Galërkin Neural ODEs The idea is to expand θ(s) on a complete orthogonal basis of a predetermined subspace of L2(S Rnθ) and truncate the series to the m-th term: j=1 αj ψj(s) In this way, the problem is turned into finite dimension and the training will aim to optimize the parameters α = (α1, . . . , αm) Rmnθ whose gradients can be computed as follows Corollary 1 (Spectral Gradients). Under the assumptions of Theorem 1, if θ(s) = Pm j=1 αj ψj(s), dℓ dα = Z S a (τ) fθ(s) θ(s) ψ(τ)dτ, ψ = (ψ1, . . . , ψm) Depth discretization: Stacked Neural ODEs An alternative approach to parametrize θ(s) is to assume it piecewise constant in S, i.e. θ(s) = θi s [si, si+1] and S = Sp 1 i=0 [si, si+1]. It is easy to see how evaluating this model is equivalent to stacking p Neural ODEs with constant parameters, z(S) = hx(x) + s1 fθi(τ, x, z(τ))dτ Here, the training is carried out optimizing the resulting pnθ parameters using the following: Corollary 2 (Stacked Gradients). Under the assumptions of Theorem 1, if θ(s) = θi s [si, si+1], dℓ dθi = Z si si+1 a (τ) fθi θi dτ where a(s) satisfies ( a (s) = a (s) fθi z s [si, si+1] a (S) = L z(S) 2In practice, s is often concatenated to z and fed to fθ. The two approaches offer different perspectives on the problem of parametrizing the evolution of θ(s); while the spectral method imposes a stronger prior to the model class, based on the chosen bases (e.g. Fourier series, Chebyshev polynomials, etc.) the depth discretization method allows for more freedom. Further details on proofs, derivation and implementation of the two models are given in the Appendix. z(s) [state] Periodic Tracking with Integral Loss Figure 1: Galërkin Neural ODEs trained with integral losses accurately recover periodic signals. Blue curves correspond to different initial conditions and converge asymptotically to the reference desired trajectory. Tracking signals via depth variance Consider the problem of tracking a periodic signal β(s). We show how this can be achieved without introducing additional inductive biases such as (Greydanus et al., 2019) through a synergistic combination of a two layer Galërkin Neural ODEs and the generalized adjoint with integral loss l(s) := β(s) z(s) 2 2. The models, trained in s [0, 1] generalize accurately in extrapolation, recovering the dynamics. Fig.2 showcases the depth dynamics of θ(s) for Galërkin and Stacked variants trained to solve a simple binary classification problem. Additional insights and details are reported in Appendix. Depth variance brings Neural ODEs closer to the ideal continuum of neural network layers with untied weights, enhancing their expressivity. 4 Augmenting Neural ODEs Augmented Neural ODEs (ANODEs) (Dupont et al., 2019) propose solving the initial value problem (IVP) in a higher dimensional space to limit the complexity of learned flows, i.e. having nz > nx. The proposed approach of the seminal paper relies on initializing to zero the na := nz nx augmented dimensions: z(0) = [x, 0]. We will henceforth refer to this augmentation strategy as 0 augmentation. In this section we discuss alternative augmentation strategies for Neural ODEs that match or improve on 0 augmentation in terms of performance or parameter efficiency. Input layer augmentation Following the standard deep learning approach of increasing layer width to achieve improved model capacity, 0 augmentation can be generalized by introducing an Stacked Neural ODEs Gal erkin Neural ODEs 0 0.2 0.4 0.6 0.8 1 5 Parameters Evolution 0 0.2 0.4 0.6 0.8 1 5 Parameters Evolution Figure 2: Galërkin and Stacked parameter-varying Neural ODE variants. Depth flows (Above) and evolution of the parameters (Below). input network hx : Rnx Rnz to compute z(0): z(0) = hx(x) (4) leading to the general formulation of (1). This approach (4) gives the model more freedom in determining the initial condition for the IVP instead of constraining it to a concatenation of x and 0, at a small parameter cost if hx is, e.g., a linear layer. We refer to this type of augmentation as input layer (IL) augmentation and to the model as IL Neural ODE (IL NODE). Note that 0-augmentation is compatible with the general IL formulation, as it corresponds to x 7 (x, 0) := hx(x) In applications where maintaining the structure of the first nx dimensions is important, e.g. approximation of dynamical systems, a parameter efficient alternative of (4) can be obtained by modifying the input network hx to only affect the additional na dimensions, i.e. hx := [x, ξ(x)], ξ : Rnx Rna. Higher order Neural ODEs Further parameter efficiency can be achieved by lifting the Neural ODEs to higher orders. For example, let z(s) = [zq(s), zp(s)] a second order Neural ODE of the form: zq(s) = fθ(s)(s, z(s)). (5) equivalent to the first order system zq(s) = zp(s) zp(s) = fθ(s)(s, zq(s), zp(s)) (6) The above can be extended to higher order Neural ODEs as dnz1 dsn = fθ(s) ds , , dn 1z1 , z = [z1, z2, . . . , zn], zi Rnz/n (7) or, equivalently, zi = zi+1, zn = fθ(s)(s, z). Note that the parameter efficiency of this method arises from the fact that fθ(s) : Rnz Rnz/n instead of Rnz Rnz. A limitation of system (6) is that a naive extension to second order requires a number of augmented dimensions na = nx. To allow for flexible augmentations of few dimensions na < nx, the formulation of second order Neural ODEs can be modified to select only a few dimensions to have higher order dynamics. We include formulation and additional details of selective higher order augmentation in the supplementary material. Finally, higher order augmentation can itself be compatible with input layer augmentation. Revisiting results for augmented Neural ODEs In higher dimensional state spaces, such as those of image classification settings, the benefits of augmentation become subtle and manifest as performance improvements and a lower number of function evaluations (NFEs) (Chen et al., 2018). We revisit the image classification experiments of (Dupont et al., 2019) and evaluate four classes of depth invariant Neural ODEs: namely, vanilla (no augmentation), ANODE (0 augmentation), ILNODE (input layer augmentation), and second order. The input network hx is composed of a single, linear layer. Main objective of these experiments is to rank the efficieny of different augmentation strategies; for this reason, the setup does not involve hybrid or composite Neural ODE architectures and data augmentation. The results for five experiments are reported in Table 4. IL NODEs consistently preserve lower NFEs than other variants, whereas second order Neural ODEs offer a parameter efficient alternative. The performance gap widens on CIFAR10, where the disadvantage of fixed 0 initial conditions forces 0 augmented Neural ODEs into performing a high number of function evaluations. NODE ANODE IL-NODE 2nd Ord. MNIST CIFAR MNIST CIFAR MNIST CIFAR MNIST CIFAR Test Acc. 96.8 58.9 98.9 70.8 99.1 72.8 NFE 98 93 71 169 44 59 Param.[K] 21.4 Table 1: Mean test results across 10 runs on MNIST and CIFAR. We report the mean NFE at convergence. Input layer and higher order augmentation improve task performance and preserve low NFEs at convergence. x = 1 Data Controlled Neural ODEs Figure 3: Depth trajectories over vector field of the data controlled neural ODEs (9) for x = 1, x = 1. The model learns a family of vector fields conditioned by the input x to approximate ϕ(x). It should be noted that prepending an input multi layer neural network to the Neural ODE was the approach chosen in the experimental evaluations of the original Neural ODE paper (Chen et al., 2018) and that (Dupont et al., 2019) opted for a comparison between no input layer and 0 augmentation. However, a significant difference exists between architectures depending on the depth and expressivity of hx. Indeed, utilizing non linear and multi layer input networks can be detrimental, as discussed in Sec. 5. Augmentation relieves Neural ODEs of their expressivity limitations. Learning initial conditions improves on 0 augmentation in terms of performance and NFEs. 5 Beyond Augmentation: Data Control and Depth Adaptation Augmentation strategies are not always necessary for Neural ODEs to solve challenging tasks such as concentric annuli (Dupont et al., 2019). While it is indeed true that two distinct trajectories can never intersect in the state space in the one dimensional case, this does not necessarily hold in general. In fact, dynamics in the first two spatial dimensions are substantially different e.g no chaotic behaviors are possible (Khalil and Grizzle, 2002). In the two dimensions of R2 (and so in Rn), infinitely wider than R, distinct trajectories of a time varying process can well intersect in the state space, provided that they do not pass through the same point at the same time (Khalil and Grizzle, 2002). This implies that, in turn, depth varying models such as Gal erkin Neural ODEs can solve these tasks in all dimensions but R. Starting from the one dimensional case, we propose new classes of models allowing Neural ODEs to perform challenging tasks such as approximating reflections (Dupont et al., 2019) without the need of any augmentation. 5.1 Data controlled Neural ODEs We hereby derive a new class of models, namely data controlled Neural ODEs. To introduce the proposed approach, we start with an analytical result regarding the approximation of reflection maps such as ϕ(x) = x. The proof provides a design recipe for a simple handcrafted ODE capable of approximating ϕ with arbitrary accuracy by leveraging input data x. We denote the conditioning of the vector field with x necessary to achieve the desired result as data control. This result highlights that, through data control, Neural ODEs can arbitrarily approximate ϕ without augmentation, providing a novel perspective on existing results about expressivity limitations of continuous models (Dupont et al., 2019). The result is the following: Proposition 2. For all ϵ > 0, x R there exists a parameter θ > 0 such that |ϕ(x) z(1)| < ϵ, (8) where z(1) is the solution of the Neural ODE z(s) = θ(z(s) + x) z(0) = x , s [0, 1] . (9) The proof is reported in the Appendix. Fig. (3) shows a version of model (9) where θ is trained with standard backpropagation. This model is indeed able to closely approximate ϕ(x) without augmentation, confirming the theoretical result. From this preliminary example, we then define the general data controlled Neural ODE as z(s) = fθ(s)(s, x, z(s)) z(0) = hx(x) . (10) Model (10) incorporates input data x into the vector field, effectively allowing the ODE to learn a family of vector fields instead of a single one. Direct dependence on x further constrains the ODE to be smooth with respect to the initial condition, acting as a regularizer. Indeed, in the experimental evaluation at the end of Sec. 5, data controlled models recover an accurate decision boundary. Further experimental results with the latter general model on the representation of ϕ are reported in the Appendix. It should be noted that (10) does not require explicit dependence of the vector field on x. Computationally, x can be passed to fθ(s) in different ways, such as through an additional embedding step. In this setting, data control offers a natural extension to conditional Neural ODEs. Conditional Continuous Normaling Flows 0 0.2 0.4 0.6 0.8 1 Figure 4: Data controlled CNFs can morph prior distributions into distinct posteriors to produce conditional samples. This task often requires crossing trajectories and is not possible with vanilla CNFs. Data control in normalizing flows Conditional variants of generative models can be guided to produce samples of different characteristics depending on specific requirements. Data control can be leveraged to obtain a conditional variant of continuous normalizing flows (CNFs) (Chen et al., 2018). Here, we consider the standard setting of learning an unknown data distribution p(x) given samples {xk}k K through a parametrized function pθ. Continuous normaling flows (CNFs) (Chen et al., 2018; Grathwohl et al., 2018; Finlay et al., 2020) obtain pθ by change of variables using the flow of an ODE to warp a (known) prior distribution q(z), i.e. log pθ(x) = log q(φS(x)) + log det | φS(x)| where the log determinant of the Jacobian is computed via the fluid mechanics identity d ds log det | φs(x)| = fθ(t)(s, φs(x)) , (Villani, 2003). CNFs are trained via maximum likelihood, i.e by minimizing the Kullback Leibler divergence between p and pθ, or equivalently ℓ:= 1/|K| P k log pθ(xk). A CNF can be then used as generative model for pθ(x) by sampling the known distribution z S q(z S) and evolve z S backward in the depth domain: z(0) = z S + Z 0 S fθ(s)(s, z(s))ds In this context, introducing data control into fθ allows the CNF to be conditioned with data or task information. Data controlled CNFs can thus be used in multi objective generative tasks e.g using a single model to sample from N different distribution pθ by warping N predetermined know distributions qi. We train one dimensional, data controlled CNFs to approximate two different data distributions p1, p2 by sampling from two distinct priors q1, q2 and conditioning the vector field with the samples z S of the prior distributions, i.e. z(s) = fθ(z S, z), z S q1 or z S q2 Fig 4 shows how data controlled CNFs are capable of conditionally sampling from two normal target data distributions. In this case we selected p1, p2 as univariate normal distributions with mean 1 and 1, respectively and q1 p2, q2 p1. The resulting learned vector field strongly depends on the value of the prior sample z S and it is almost constant in z, meaning that the prior distributions are just shifted almost rigidly along the flow in a direction determined by the initial condition. This task is inaccessible to standard CNFs as it requires crossing flows in z. Indeed, the proposed benchmark represents a density estimation analogue to the crossing trajectories problem. 5.2 Adaptive Depth Neural ODEs Let us come back to the approximation of ϕ(x). Indeed, without incorporating input data into fθ(s), it is not possible to realize a mapping x 7 φs(x) mimicking ϕ due to the topology preserving property of the flows. Nevertheless, a Neural ODE can be employed to approximate ϕ(x) without the need of any crossing trajectory. In fact, if each input is integrated for in a different depth domain, S(x) = [0, s x], it is possible to learn ϕ without crossing flows as shown in Fig. 5. Adaptive Integration Depth Inputs trajectories through network depth Figure 5: Depth trajectories over vector field of the adaptive depth Neural ODEs. The reflection map can be learned by the proposed model. The key is to assign different integration times to the inputs, thus not requiring the intersection of trajectories. In general, we can use a hypernetwork g trained to learn the integration depth of each sample. In this setting, we define the general adaptive depth class as Neural ODEs performing the mapping x 7 φgω(x)(x), i.e. leading to hx(x) + Z gω(x) 0 fθ(s)(τ, x, z(τ))dτ where gω : Rnx Rnω R is a neural network with trainable parameters ω. Appendix B contains details on differentiation under the integral sign, required to backpropagate the loss gradients into ω. Experiments of non augmented models We inspect the performance of different Neural ODE variants: depth invariant, depth variant with s concatenated to z and passed to the vector field, Gal erkin Neural ODEs and data controlled. The concentric annuli (Dupont et al., 2019) dataset is utilized, and the models are qualitatively evaluated based on the complexity of the learned flows and on how accurately they extrapolate to unseen points, i.e. the learned decision boundaries. For Gal erkin Neural ODEs, we choose a Fourier series with m = 5 harmonics as the eigenfunctions ψk, k = 1, . . . , 5 to compute the parameters θ(s), as described in Sec. 3. Data control allows Neural ODEs to learn a family of vector fields, conditioning on input data information. Depth adaptation sidesteps known expressivity limitations of continuous depth models. Original space Flows in Latent Space Figure 6: Solving concentric annuli without augmentation by prepending a nonlinear transformation performed by a two layer fully connected network. Mind your input networks An alternative approach to learning maps that prove to be challenging to approximate for vanilla Neural ODEs involves solving the ODE in a latent state space. Fig. 6 shows that with no augmentation, a network composed by a two fully connected layers with non linear activation followed by a Neural ODE can solve the concentric annuli problem. However, the flows learned by the Neural ODEs are superfluous: indeed, the clusters were already linearly separable after the first non linear transformation. This example warns against superficial evaluations of Neural ODE architectures preceded or followed by several layers of non linear input and output transformations. In these scenarios, the learned flows risk performing unnecessary transformations and in pathological cases can collapse into a simple identity map. To sidestep these issues, we propose visually inspecting trajectories or performing an ablation experiment on the Neural ODE block. Concatenated Depth 0.5 0 0.5 1 Data Controlled Figure 7: Depth-flows of the data in the state space. The resulting decision boundaries of output linear layer hy are indicated by the dotted orange line. 6 Related Work We include a brief history of classical approaches to dynamical system inspired deep learning. A brief historical note on continuous deep learning Continuous neural networks have a long history that goes back to continuous time variants of recurrent networks (Cohen and Grossberg, 1983). Since then, several works explored the connection between dynamical systems, control theory and machine learning (Zhang et al., 2014; Li et al., 2017; Lu et al., 2017; Weinan, 2017). (Marcus and Westervelt, 1989) provides stability analyses and introduces delays. Many of these concepts have yet to resurface in the context of Neural ODEs. Haber and Ruthotto (2017) analyzes Res Net dynamics and links stability with robustness. Injecting stability into neural networks has inspired the design of a series of architectures (Chang et al., 2019; Haber et al., 2019; Bai et al., 2019; Massaroli et al., 2020). Hauser et al. (2019) explored the algebraic structure of neural networks governed by finite difference equations, further linking discretizations of ODEs and Res Nets in (Hauser et al., 2019). Approximating ODEs with neural networks has been discussed in (Wang and Lin, 1998; Filici, 2008). (Poli et al., 2020a) explores the interplay between Neural ODEs and their solver. On the optimization front, several works leverage dynamical system formalism in continuous time (Wibisono et al., 2016; Maddison et al., 2018; Massaroli et al., 2019). Neural ODEs This work concerns Neural ODEs (Chen et al., 2018) and a system theoretic discussion of their dynamical behavior. The main focus is on Neural ODEs and not the extensions to other classes of differential equations (Li et al., 2020; Tzen and Raginsky, 2019; Jia and Benson, 2019), though the insights developed here can be broadly applied to continuous depth models. More recently, Finlay et al. (2020) introduced regularization strategies to alleviate the heavy computational training overheads of Neural ODEs. These terms are propagated during the forward pass of the model and thus require state augmentation. Leveraging our generalized adjoint formulation provides an approach to integral regularization terms without augmentation and memory overheads. 7 Conclusion In this work, we establish a general system theoretic framework for Neural ODEs and dissect it into its core components. With the aim of shining light on fundamental questions regarding depth variance, we formulate and solve the infinite dimensional problem linked to the true deep limit formulation of Neural ODE. We provide numerical approximations to the infinite dimensional problem, leading to novel model variants, such as Gal erkin and piecewise constant Neural ODEs. Augmentation is developed beyond existing approaches (Dupont et al., 2019) to include input layer and higher order augmentation strategies showcased to be more performant and parameter efficient. Finally, the novel paradigms of data control and depth adaptation are introduced to perform challenging tasks such as learning reflections without augmentation. The code to reproduce all the experiments present in the paper is built on Torch Dyn (Poli et al., 2020b) and Py Torch Lighning (Falcon et al., 2019) libraries, can be found in the following repo: https://github.com/Diff Eq ML/ diffeqml-research/tree/master/dissecting-neural-odes. Broader Impact As continuous deep learning sees increased utilization across fields such as healthcare (Rubanova et al., 2019; Yıldız et al., 2019), it is of utmost importance that we develop appropriate tools to further our understanding of neural differential equations. The search for robustness in traditional deep learning has only recently seen a surge in ideas and proposed solutions; this work aims at providing exploratory first steps necessary to extend the discussion to this novel paradigm. The leitmotif of this work is injecting system theoretic concepts into the framework of continuous models. These ideas are of foundational importance in tangential fields such control and forecasting of dynamical systems, and are routinely used to develop robust algorithms with theoretical and practical guarantees. Acknowledgment This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, 2018R1D1A1B07050443. S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 688 699. 2019. B. Chang, M. Chen, E. Haber, and E. H. Chi. Antisymmetricrnn: A dynamical system view on recurrent neural networks. ar Xiv preprint ar Xiv:1902.09689, 2019. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pages 6571 6583, 2018. D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv preprint ar Xiv:1511.07289, 2015. M. A. Cohen and S. Grossberg. Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE transactions on systems, man, and cybernetics, (5): 815 826, 1983. E. Dupont, A. Doucet, and Y. W. Teh. Augmented neural odes. In Advances in Neural Information Processing Systems, pages 3134 3144, 2019. W. Falcon et al. Pytorch lightning. Git Hub. Note: https://github. com/william Falcon/pytorch-lightning Cited by, 3, 2019. C. Filici. On a neural approximator to odes. IEEE transactions on neural networks, 19(3):539 543, 2008. C. Finlay, J.-H. Jacobsen, L. Nurbekyan, and A. M. Oberman. How to train your neural ode. ar Xiv preprint ar Xiv:2002.02798, 2020. W. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, 2018. S. Greydanus, M. Dzamba, and J. Yosinski. Hamiltonian neural networks. In Advances in Neural Information Processing Systems, pages 15353 15363, 2019. E. Haber and L. Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1): 014004, 2017. E. Haber, K. Lensink, E. Triester, and L. Ruthotto. Imexnet: A forward stable deep neural network. ar Xiv preprint ar Xiv:1903.02639, 2019. M. Hauser, S. Gunn, S. Saab Jr, and A. Ray. State-space representations of deep neural networks. Neural computation, 31(3):538 554, 2019. J. Jia and A. R. Benson. Neural jump stochastic differential equations. In Advances in Neural Information Processing Systems, pages 9843 9854, 2019. H. K. Khalil and J. W. Grizzle. Nonlinear systems, volume 3. Prentice hall Upper Saddle River, NJ, 2002. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Q. Li, L. Chen, C. Tai, and E. Weinan. Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1):5998 6026, 2017. Q. Li, T. Lin, and Z. Shen. Deep learning via dynamical systems: An approximation perspective. ar Xiv preprint ar Xiv:1912.10382, 2019. X. Li, T.-K. L. Wong, R. T. Q. Chen, and D. Duvenaud. Scalable gradients for stochastic differential equations. volume 108 of Proceedings of Machine Learning Research, pages 3870 3882, Online, 26 28 Aug 2020. PMLR. URL http://proceedings.mlr.press/v108/li20i.html. Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neural networks: A view from the width. In Advances in neural information processing systems, pages 6231 6239, 2017. C. J. Maddison, D. Paulin, Y. W. Teh, B. O Donoghue, and A. Doucet. Hamiltonian descent methods. ar Xiv preprint ar Xiv:1809.05042, 2018. C. Marcus and R. Westervelt. Stability of analog neural networks with delay. Physical Review A, 39 (1):347, 1989. S. Massaroli, M. Poli, F. Califano, A. Faragasso, J. Park, A. Yamashita, and H. Asama. Porthamiltonian approach to neural network training. ar Xiv preprint ar Xiv:1909.02702, 2019. S. Massaroli, M. Poli, M. Bin, J. Park, A. Yamashita, and H. Asama. Stable neural flows. ar Xiv preprint ar Xiv:2003.08063, 2020. M. Poli, S. Massaroli, J. Park, A. Yamashita, H. Asama, and J. Park. Graph neural ordinary differential equations. ar Xiv preprint ar Xiv:1911.07532, 2019. M. Poli, S. Massaroli, A. Yamashita, H. Asama, and J. Park. Hypersolvers: Toward fast continuousdepth models. ar Xiv preprint ar Xiv:2007.09601, 2020a. M. Poli, S. Massaroli, A. Yamashita, H. Asama, and J. Park. Torchdyn: A neural differential equations library. ar Xiv preprint ar Xiv:2009.09346, 2020b. L. S. Pontryagin, E. Mishchenko, V. Boltyanskii, and R. Gamkrelidze. The mathematical theory of optimal processes. 1962. P. J. Prince and J. R. Dormand. High order embedded runge-kutta formulae. Journal of Computational and Applied Mathematics, 7(1):67 75, 1981. Y. Rubanova, T. Q. Chen, and D. K. Duvenaud. Latent ordinary differential equations for irregularlysampled time series. In Advances in Neural Information Processing Systems, pages 5321 5331, 2019. G. Smyrlis and V. Zisis. Local convergence of the steepest descent method in hilbert spaces. Journal of mathematical analysis and applications, 300(2):436 453, 2004. B. Tzen and M. Raginsky. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. ar Xiv preprint ar Xiv:1905.09883, 2019. C. Villani. Topics in optimal transportation. Number 58. American Mathematical Society, 2003. Y.-J. Wang and C.-T. Lin. Runge-kutta neural network for identification of dynamical systems in high accuracy. IEEE Transactions on Neural Networks, 9(2):294 307, 1998. E. Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1 11, 2017. A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113(47):E7351 E7358, 2016. Ç. Yıldız, M. Heinonen, and H. Lähdesmäki. Ode 2 vae: Deep generative second order odes with bayesian neural networks. ar Xiv preprint ar Xiv:1905.10994, 2019. H. Zhang, Z. Wang, and D. Liu. A comprehensive review of stability analysis of continuous-time recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems, 25(7): 1229 1262, 2014. H. Zhang, X. Gao, J. Unterman, and T. Arodz. Approximation capabilities of neural ordinary differential equations. ar Xiv preprint ar Xiv:1907.12998, 2019a. T. Zhang, Z. Yao, A. Gholami, K. Keutzer, J. Gonzalez, G. Biros, and M. Mahoney. Anodev2: A coupled neural ode evolution framework. ar Xiv preprint ar Xiv:1906.04596, 2019b. H. Zheng, Z. Yang, W. Liu, J. Liang, and Y. Li. Improving deep neural networks using softplus units. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1 4. IEEE, 2015.