# neural_flows_efficient_alternative_to_neural_odes__5111a412.pdf

Neural Flows: Efﬁcient Alternative to Neural ODEs

Marin Biloš1 , Johanna Sommer1, Syama Sundar Rangapuram2, Tim Januschowski2, Stephan Günnemann1

1Technical University of Munich, 2AWS AI Labs, Germany

Neural ordinary differential equations describe how values change in time. This is the reason why they gained importance in modeling sequential data, especially when the observations are made at irregular intervals. In this paper we propose an alternative by directly modeling the solution curves the ﬂow of an ODE with a neural network. This immediately eliminates the need for expensive numerical solvers while still maintaining the modeling capability of neural ODEs. We propose several ﬂow architectures suitable for different applications by establishing precise conditions on when a function deﬁnes a valid ﬂow. Apart from computational efﬁciency, we also provide empirical evidence of favorable generalization performance via applications in time series modeling, forecasting, and density estimation.

1 Introduction

Ordinary differential equations (ODEs) are among the most important tools for modeling complex systems, both in natural and social sciences. They describe the instantaneous change in the system, which is often an easier way to model physical phenomena than specifying the whole system itself. For example, the change of the pendulum angle or the change in population can be naturally expressed in the differential form. Similarly, Chen et al. [11] introduce neural ODEs that describe how some quantity of interest represented as a vector x, changes with time: x = f(t, x(t)), where f is now a neural network. Starting at some initial value x(t0) we can ﬁnd the result of this dynamic at any t1:

x(t1) = x(t0) + Z t1

t0 f(t, x(t)) dt = ODESolve(x(t0), f, t0, t1). (1)

Neural ODE Neural ﬂow

ODESolve( )

Figure 1: (Left) ODE requires numerical solver which evaluates f at many points along the solution curve. (Right) Our approach returns the solutions directly.

It is sufﬁcient for f to be continuous in t and Lipschitz continuous in x to have a unique solution, by the Picard Lindelöf theorem [14]. This mild condition is already satisﬁed by a large family of neural networks. In most practically relevant scenarios, the integral in Equation 1 has to be solved numerically, requiring a trade-off between computation cost and numerical precision. Much of the follow up work to [11] focused on retaining expressive dynamics while requiring fewer solver evaluations [22, 37].

In the machine learning context we are given a set of initial conditions (often at t0 = 0) and a loss function for the solution evaluated at time t1. One example is modeling time series where the latent state is evolved in continuous time and is used to predict the observed measurements [16]. Here, unlike in physics for example, the function f is completely unknown and needs to be learned from data. Thus, [11] used neural networks to model it, for their ability to capture complex dynamics. However, note that this comes at the cost of the ODE being non-interpretable.

Work partially done during an internship at Amazon Research. Correspondence to: bilos@in.tum.de.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Since solving an ODE is expensive, we want to ﬁnd a way to keep the desired properties of neural ODEs at a much smaller computation cost. If we take a step back, we see that neural ODEs take initial values as inputs and return non-intersecting solution curves (Figure 1). In this paper we propose to model the solution curves directly, with a neural network, instead of specifying the derivative. That is, given an initial condition we return the solution with a single forward pass through our network. Straight away, this leads to improvements in computation performance because we avoid using ODE solvers altogether. We show how our method can be used as a faster alternative to ODEs in existing models [9, 16, 34, 69], while improving the modeling performance. In the following, we derive the conditions that our method needs to satisfy and propose different architectures that implement them.

2 Neural ﬂows

In this section, we present our method, neural ﬂows, that directly models the solution curve of an ODE with a neural network. For simplicity, let us brieﬂy assume that the initial condition x0 = x(t0) is speciﬁed at t0 = 0. We handle the general case shortly. Then, Equation 1 can be written as x(t) = F(t, x0), where F is the solution to the initial value problem, x = f(t, x(t)), x0 = x(0). We will model F with a neural network. For this, we ﬁrst list the conditions that F must satisfy so that it is a solution to some ODE. Let F : [0, T] Rd Rd be a smooth function satisfying:

i) F(0, x0) = x0, (initial condition) ii) F(t, ) is invertible, t. (uniqueness of the solution given the initial value x0; i.e., the curves speciﬁed by F corresponding to different initial values should not intersect for any t)

There is an exact correspondence between a function F with the above properties and an ODE deﬁned with f such that the derivative d dt F(t, x0) matches f(t, x(t)) everywhere, given x0 = x(0) [47, Theorem 9.12]. In general, we can say that f deﬁnes a vector ﬁeld and F deﬁnes a family of integral curves, also known as the ﬂow in mathematics (not to be confused with normalizing ﬂow). As F will be parameterized with a neural network, condition i) requires that its parameters must depend on t such that we have the identity map at t = 0.

Note that by providing x0 we deﬁne a smooth trajectory F( , x0) the solution to some ODE with the initial condition at t0 = 0. If we relax the restriction t0 = 0 and allow x0 to be speciﬁed at an arbitrary t0 R, the solution can be obtained with a simple procedure. We ﬁrst go back to the case t = 0 where we obtain the corresponding initial value ˆx0 := x(0) = F 1(t0, x0). This then gives us the required solution F( , ˆx0) to the original initial value problem. Thus, we often prefer functions with an analytical inverse.

Finally, we tackle implementing F. The second property instructs us that the function F(t, ) is a diffeomorphism on Rd. We can satisfy this by drawing inspiration from existing works on normalizing ﬂows and invertible neural networks [e.g., 17, 2]. In our case, the parameters must be conditioned on time, with identity at t = 0. As a starting example, consider a linear ODE f(t, x(t)) = Ax(t), with x(0) = x0. Its solution can be expressed as F(t, x0) = exp(At)x0, where exp is the matrix exponential. Here, the learnable parameters A are simply multiplied by t to ensure condition i); and given ﬁxed t, the network behaves as an invertible linear transformation. In the following we propose other, more expressive functions suitable for applications such as time series modeling.

2.1 Proposed ﬂow architectures

Res Net ﬂow. A single residual layer xt+1 = xt + g(xt) [30] bears a resemblance to Equation 1 and can be seen as a discretized version of a continuous transformation which inspired the development of neural ODEs. Although plain Res Nets are not invertible, one could use spectral normalization [26] to enforce a small Lipschitz constant of the network, which guarantees invertibility [2, Theorem 1]. Thus, Res Nets become a natural choice for modeling the solution curve F resulting in the following extension Res Net ﬂow:

F(t, x) = x + ϕ(t)g(t, x), (2)

where ϕ : R Rd. This satisﬁes properties i) and ii) from above when ϕ(0) = 0 and |ϕ(t)i| < 1; and g : Rd+1 Rd is an arbitrary contractive neural network (Lip(g) < 1). One simple choice for ϕ is a tanh function. The inverse of F can be found via ﬁxed point iteration similar to [2].

GRU ﬂow. Time series data is traditionally modeled with recurrent neural networks, e.g., with a GRU [12], such that the hidden state ht 1 is updated at ﬁxed intervals with the new observation xt:

ht = GRUCell(ht 1, xt) = zt ht 1 + (1 zt) ct, (3)

where zt and ct are functions of the previous state ht 1 and the new input xt.

De Brouwer et al. [16] derived the continuous equivalent of this architecture called GRU-ODE (see Appendix A.1). Given the initial condition h0 = h(t0), they evolve the hidden state h(t) with an ODE, until they observe new xt1 at time t1, when they use Equation 3 to update it: ht1 = ODESolve(h0, GRU-ODE, t0, t1), ht1 = GRUCell( ht1, xt1). (4)

Here, we will derive the ﬂow version of GRU-ODE. If we rewrite Equation 3 by regrouping terms: ht = ht 1 + (1 zt) (ct ht 1), we see that GRU update acts as a single Res Net layer.

Deﬁnition 1. Let fz, fr, fc : Rd+1 Rd be any arbitrary neural networks and let z(t, h) = α σ(fz(t, h)), r(t, h) = β σ(fr(t, h)), c(t, h) = tanh(fc(t, r(t, h) h)), where α, β R and σ is a sigmoid function. Further, let ϕ : R Rd be a continuous function with ϕ(0) = 0 and |ϕ(t)i| < 1. Then the evolution of GRU state in continuous time is deﬁned as:

F(t, h) = h + ϕ(t)(1 z(t, h)) (c(t, h) h). (5)

Theorem 1. A neural network deﬁned by Equation 5 speciﬁes a ﬂow when the functions fz, fr and fc are contractive maps, i.e., Lip(f ) < 1, and α = 2

We prove Theorem 1 in Appendix A.3 by showing that the second summand on the right hand side in Equation 5 satisﬁes Lipschitz constraint making the whole network invertible. We also show that the GRU ﬂow has the same desired properties as GRU-ODE, namely, bounding the hidden state in ( 1, 1) and having the Lipschitz constant of 2. Note that GRU ﬂow (Equation 5) acts as a replacement to ODESolve in Equation 4. Alternatively, we can append xt to the input of fz, fr and fc, which would give us a continuous-in-time version of GRU.

Coupling ﬂow. The disadvantage of both Res Net ﬂow and GRU ﬂow is the missing analytical inverse. To this end, we propose a continuous-in-time version of an invertible transformation based on splitting the input dimensions into two disjoint sets A and B, A B = {1, 2, . . . , d} [17]. We copy the values indexed by B and transform the rest conditioned on x B which gives us the coupling ﬂow:

F(t, x)A = x A exp(u(t, x B)ϕu(t)) + v(t, x B)ϕv(t), (6)

where u, v are arbitrary neural networks and ϕu(0) = ϕv(0) = 0. We can easily see that this satisﬁes condition i), and it is invertible by design regardless of t [17]. Since some values stay constant in a single layer, we apply multiple consecutive transformations, choosing different partitions A and B.

For all three models we can stack multiple layers F = F1 Fn and still deﬁne a proper ﬂow since the composition of invertible functions is invertible, and consecutive identities give an identity.

We can think of ϕ (including ϕu, ϕv) as a time embedding function that has to be zero at t = 0. Since it is a function of a single variable, we would like to keep the complexity low and avoid using general neural networks in favor of interpretable and expressive basis functions. A simple example is linear dependence on time ϕ(t) = αt, or tanh(αt) for Res Net ﬂow. We use these in the experiments. An alternative, more powerful embedding consists of Fourier features ϕ(t)i = P

k αik sin(βikt).

2.2 On approximation capabilities

Previous works established that neural ODEs are sup-universal for diffeomorphic functions [76] and are Lp-universal for continuous maps when composed with terminal family [48]. A similar result also holds for afﬁne coupling ﬂows [75], whereas general residual networks can approximate any function [53]. The Res Net ﬂow, as deﬁned in Equation 2, can be viewed as an Euler discretization, meaning it is enough to stack appropriately many layers to uniformly approximate any ODE solution [48]. GRU ﬂow can be viewed as a Res Net ﬂow and coupling ﬂow shares a similar structure, meaning that if we can set them to act as an Euler discretization we can match any ODE. However, this is of limited use in practice since we use ﬁnitely many layers, so the main focus of this paper is to provide the empirical evidence that we can outperform neural ODEs on relevant real-world tasks.

Other results [20, 81] consider limitations of neural ODEs in modeling general homeomorphisms (e.g., x 7 x) and propose the solution that adds dimensions to the input x. Such augmented

networks can model higher order dynamics. This can be explicitly deﬁned through certain constraints for further improvements in performance and better interpretability [59]. We can apply the same trick to our models. However, instead of augmenting x, a simpler solution is to relax the conditions on F given the task. For example, if we do not need invertibility, we can remove the Lipschitz constraint in Equation 2. Since neural ﬂows offer such ﬂexibility, they might be of more practical relevance in these use cases.

3 Applications

In this section we review two main applications of neural ODEs: modeling irregularly-sampled time series and density estimation. We describe the existing modeling approaches and propose extensions using neural ﬂows. In Section 4 we will use models presented here to qualitatively and quantitatively compare neural ﬂows with neural ODEs.

3.1 Continuous-time latent variable models

Autoregressive [62, 70] and state space models [32, 68] have achieved considerable success modeling regularly-sampled time series. However, many real-world applications do not have a constant sampling rate and may contain missing values, e.g., in healthcare we have very sparse measurements at irregular time intervals. Here we describe how our neural ﬂow models can be used in such scenario.

Encoder. In this setting, we are given a sequence of observations X = (x1, . . . , xn), xi Rd at times t = (t1, . . . , tn). To represent this type of data, previous RNN-based works relied on exponentially decaying hidden state [8], time gating [58], or simply adding time as an additional input [19]. More recently, various ODE-based models built on top of RNNs to evolve the hidden state between observations in continuous time, giving rise to, e.g., ODE-RNN [69], while outperforming previous approaches. Another model is GRU-ODE [16], which we already described in Equation 4. We proposed the GRU ﬂow (Equation 5) that can be used as a straightforward replacement.

Lechner and Hasani [46] showed that simply evolving the hidden state with a neural ODE can cause vanishing or exploding gradients, a known issue in RNNs [3]. Thus, they propose using an LSTM-based [31] model instead. The difference to ODE-RNN [69] is using an LSTMCell and introducing another hidden state that is not updated continuously in time, which in turn allows gradient propagation via internal LSTM gating. To adapt this to our framework, we simply replace the ODESolve with the Res Net or coupling ﬂow to obtain a neural ﬂow model.

Decoder. Once we have a hidden state representation hi of the irregularly-sampled sequence up to xi, we are interested in making future predictions. The ODE based models continue evolving the hidden state using a numerical solver to get the representation at time ti+1, with hi+1 = ODESolve(hi, f, ti, ti+1). With neural ﬂows we can simply pass the next time point ti+1 into F and get the next hidden state directly. In the following we show how the presented encoder-decoder model is used in both the smoothing and ﬁltering approaches for irregular time series modeling.

Smoothing approach. The given sequence of observations (X, t) is modeled with latent variables or states (z1, . . . , zn) Rh, such that xi p(xi|zi), conditionally independent of other xj [11, 69]. There is a predesignated prior state z0 at t = 0 from which the latent state is assumed to evolve continuously. More precisely, if z0 is a sample from the initial latent state z0, then a latent state sample at any future time step t is given by zt = F(t, z0).

Since the exact inference on the initial state z0, p(z0|X, t), is intractable, we proceed by doing approximate inference following the variational auto-encoder approach [11, 69]. We use an LSTMbased neural ﬂow encoder that processes (X, t) and outputs the approximate posterior parameters µ and σ from the last state, q(z0|X, t) = N(µ, σ). The decoder returns all zi deterministically at times t with F(t, z0), with initial condition z0 q(z0|X, t). For the latent state at an arbitrary ti, the target is generated according to the model xi p(xi|zi). Given p(z0) = N(0, 1), the overall model is trained by maximizing the evidence lower bound:

ELBO = Ez0 q(z0|X,t))[log p(X)] KL[q(z0|X, t)||p(z0)]. (7)

Using continuous time models brings up multiple advantages, from handling irregular time points automatically to making predictions at any, and as many time points as required, allowing us to do

reconstruction, missing value imputation and forecasting. This holds whether we use neural ﬂows or ODEs, but our approach is more computationally efﬁcient, which matters as we scale to bigger data.

Filtering approach. In contrast to the previous approach, we can alternatively do the inference in an online fashion at each of the observed time points, i.e., estimating the posterior p(zi|x1:i, t1:i) after seeing observations until the current time step i. This is known as ﬁltering. Here, the prediction for future time steps is done by evolving the posterior corresponding to the ﬁnal observed time point p(zn|X, t) instead of the initial time point p(z0|X, t), as was done in the smoothing approach.

In this paper, we follow the general approach suggested by De Brouwer et al. [16] for capturing non-linear dynamics. We use GRU ﬂow (instead of GRU-ODE) for evolving the hidden state hi Rh and we output the mean and variance of the approximate posterior q(zi|x1:i, t1:i). The log-likelihood cannot be computed exactly under this model so [16] suggest using a custom objective that is the analogue to Bayesian ﬁltering (see Appendix A.2 for details). Unlike [16], which needs to solve the ODE for every observation, our method only needs a single pass through the network per observation.

3.2 Temporal point processes

Sometimes temporal data is measured irregularly and the times at which we observe the events come from some underlying process modeled with temporal point processes (TPPs). For example, we can use TPPs to model the times of messages between users. One example type of behavior we want to capture is excitation [29], e.g., observing one message increases the chance of seeing other soon after.

A realization of a TPP on an interval [0, T] is an increasing sequence of arrival times t = (t1, . . . , tn), ti [0, T], where n is a random variable. The model is deﬁned with an intensity function λ(t) that tells us how many events we expect to see in some bounded area [15]. The intensity has to be positive. We deﬁne the history Hti as the events that precede ti, and further deﬁne the conditional intensity function λ (t) which depends on history. For convenience, we can also work with inter-event times τi = ti ti 1, without losing generality. We train the model by maximizing the log-likelihood:

i log λ (ti) Z T

0 λ (s) ds. (8)

Previous works [72] used autoregressive models (e.g., RNNs) to represent the history with a ﬁxed-size vector hi [19]. The intensity function can correspond to a simple distribution [19] or a mixture of distributions [71]. Then the integral in Equation 8 can be computed exactly. Another possibility is modeling λ(t) with an arbitrary neural network which requires Monte Carlo integration [6, 56]. On the other hand, Jia and Benson [34] propose a jump ODE model that evolves the hidden state h(t) with an ODE and updates the state with new observations, similar to LSTM-ODE. In this case, obtaining the hidden state and solving the integral in Equation 8 can be done in a single solver call.

Marked point processes. Often, we are also interested in what type of an event happened at time point ti. Thus, we can assign the observed type xi, also called mark, and model the arrival times and marks jointly: p(t, X) = p(t)p(X|t). Written like this, we can keep the model for arrival times as in Equation 8, and add a module that inputs the history hi and the next time point ti+1 and outputs the probabilities for each mark type. The special case of xi Rd is covered in the next section.

3.3 Time-dependent density estimation

Normalizing ﬂows (NFs) deﬁne densities with invertible transformations of random variables. That is, given a random variable z q(z), z Rd and an invertible function F : Rd Rd, we can compute the probability density function of x = F(z) with the change of variables formula [65]: p(x) = q(z)| det JF (z)| 1, where JF is the Jacobian of F. As we can see, it is important to deﬁne a function F that is easily invertible and has a tractable determinant of the Jacobian. One example is the coupling NF [17], which we used to construct the coupling ﬂow in Equation 6. Other tractable models include autoregressive [41, 64] and matrix factorization based NFs [4, 40].

In contrast to this, Chen et al. [11] deﬁne the transformation with an ODE: f(t, z(t)) =

tz(t). This allows them to deﬁne the instantaneous change in log-density as well as the continuous equivalent to the change of variables formula, giving rise to the continuous normalizing ﬂow (CNF): t log p(z(t)) = tr f

, log p(x) = log q(z(t0)) Z t1

where t0 = 0 and t1 = 1 are usually ﬁxed. The neural network f can be arbitrary as long as it gives unique ODE solutions. This offers an advantage when we need special structure of f that cannot be easily implemented with the discrete NFs, e.g., in physics we often require equivariant transformations [5, 43]. Besides the cost of running the solver, calculating the trace at each step in Equation 9 becomes intractable as the dimension of data grows, so one resorts to stochastic estimation [27]. A similar approximation method is used for estimating the determinant in an invertible Res Net model [2]. We discuss the computation complexity in Appendix A.8. Again, if we consider a linear ODE, we can easily show that calculating the trace and calculating the determinant of the corresponding ﬂow is equivalent (see Appendix A.7).

However, we are not interested in comparison between different normalizing ﬂows for stationary densities [see e.g., 42], since ﬂow endpoints t0 and t1 are always ﬁxed; thus, our models would be reduced to the discrete NFs. Recently, Chen et al. [9] demonstrated how CNFs can evolve the densities in continuous time, with varying t0 and t1, which proves useful for spatio-temporal data. We will show how to do the same with our coupling ﬂow, something that has not been explored before.

Spatio-temporal processes. We reuse the notation from Section 3.2 to denote the arrival times with t and marks with X, xi Rd, which are now continuous variables. Values xi often correspond to locations of events, e.g., earthquakes [60] or disease outbreaks [57]. We use the temporal point processes from Section 3.2 to model p(t), and are only left with the conditional density p(X|t). Chen et al. [9] propose several models for this, the ﬁrst one being the time-varying CNF where p(xi|ti) is estimated by integrating Equation 9 from t0 = 0 to observed ti. Using our afﬁne coupling ﬂow as deﬁned in Equation 6 we can write: p(xi|ti) = q(F 1(ti, xi))| det JF 1(xi)|, (10) where q is the base density (deﬁned with any NF) and the determinant is the product of the diagonal values of the Jacobian w.r.t. xi, which are simply exp terms from Equation 6 [17]. The density p evolves with time, the same way as in the CNF model, but without using the solver or trace estimation. To generate new realizations at ti, we ﬁrst sample from q to get x0 q(x0), then evaluate F(ti, x0).

An alternative model, attentive CNF [9], is more expressive compared to the time-varying CNF and more efﬁcient than jump ODE models [9, 34]. The probability density of xi depends on all the previous values xj<i through the attention mechanism [79]. In our model, we represent all the previous points xj<i with an attention encoder and deﬁne a conditional coupling NF p(xi|ti, xj<i). We describe the full model in Appendix A.5. Both of the previous models can also use Res Net ﬂow, but the beneﬁts over ODEs vanish since the determinant and the inverse require iterative procedure.

4 Experiments

In this section we show that ﬂow-based models can match or outperform ODEs at a smaller computation cost, both in latent variable time series modeling, as well as TPPs and time-dependent density estimation. To make fair comparison, we used recently introduced reparameterization trick for ODEs that allows faster mini-batching [9], and the semi-norm trick for faster backpropagation [38], making the models more competitive compared to the original works. In all experiments we split the data into train, validation and test set; train with early stopping and report results on test set. We use Adam optimizer [39]. For training we use two different machines, one with 3.4GHz processor and 32GB RAM and another with 61GB RAM and NVIDIA Tesla V100 GPU 16GB [52]. All datasets are publicly available, we include the download links and release the code that reproduces the results.2

Synthetic data. We compare the performance of neural ODEs and neural ﬂows on periodic signals and data generated from autonomous ODEs. Full setup and results are presented in Appendix B. In short, we observe that training with adaptive solvers [18] is slower compared to ﬁxed-step solvers, as expected. With the ﬁxed step, however, we are not guaranteed invertibility [63], which can be an issue in, e.g., density estimation. Using the same setup, our models are an order of magnitude faster. Finally, neural ODEs struggle with non-smooth signals while neural ﬂows perform much better, although they also only output smooth dynamics. Neural ﬂows are also better at extrapolating, although none of the models excel in this task.

Stiff ODEs. The numerical approach to solving ODEs is not only slow but it can be unstable. This can happen when the ODE becomes stiff, i.e., the solver needs to take very small steps even though

2https://www.daml.in.tum.de/neural-flows

Mu Jo Co Activity Physionet MSE MSE Accuracy MSE AUC

Neural ODE 8.403 0.142 6.390 0.136 0.756 0.013 4.833 0.078 0.777 0.012 Coupling ﬂow 4.217 0.147 6.579 0.049 0.752 0.012 4.860 0.070 0.788 0.004 Res Net ﬂow 5.147 0.171 6.279 0.098 0.760 0.004 4.903 0.125 0.784 0.010 Table 1: Test mean squared error (lower is better) and accuracy/area under curve (higher is better). Best result is bolded, result within one standard deviation is highlighted. Averaged over 5 runs.

the solution curve is smooth. For neural ODEs, it can happen that the target dynamic is known to be stiff or the latent dynamic becomes stiff during training.

0 5 10 15 20 25 t

ODE Flow Data

Figure 2: Flows handle stiffness better.

To see the effects of this, we use the experiment from [24]. The ODE is given by: x = 1000x+3000 2000e t. We train a neural ODE model and a coupling ﬂow to match the data, minimizing MSE. The data contains initial conditions and solutions, on small intervals with t2 t1 = 0.125, t [0, 15]. The ﬂow ﬁrst ﬁnds the solution at t0 = 0 and then solves for t2 (Section 2). We evaluate on an extended time interval given x0 = 0. Figure 2 shows that the neural ODE with an adaptive solver does not match the correct solution, due to its stiffness. In contrast, ﬂow captures the solution correctly, as expected, since it does not use a numerical solver.

Smoothing approach. Following [69], we use three datasets: Activity, Physionet, and Mu Jo Co. Activity contains 6554 time series of 3d positions of 4 sensors attached to an individual. The goal is to classify one of the 7 possible activities (e.g., walking, lying, etc.). Physionet [73] contains 8000 time series and 37 features of patients measurements from the ﬁrst 48 hours after being admitted to ICU. The goal is to predict the mortality. Mu Jo Co is created from a simple physics simulation Hopper [74] by randomly sampling initial positions and velocities and letting dynamics evolve deterministically in time. There are 10000 sequences, with 100 time steps and 14 features.

We use the encoder-decoder model (Section 3.1) and maximize Equation 7. We use the same number of hidden layers and the same size of latent states for both the neural ODE, coupling ﬂow and Res Net ﬂow, giving approximately the same number of trainable parameters. ODE models use either Euler or adaptive solvers and we report the best results. The results in Table 1 show the reconstruction error and the accuracy of prediction. For better readability, we scale MSE scores same as in [69]. Neural ﬂows outperform ODE models everywhere (Physionet reconstruction within the conﬁdence interval). We noticed that it is possible to further improve the results with bigger ﬂow models but we focused on having similar sized models to show that we can get better results at a much smaller cost.

Speed improvements. In the smoothing experiment, our method offers more than two times speed-up during training compared to an ODE using an Euler method (Figure 3, different boxes corresponding to different datasets, grouped by experiment types). The gap is even larger for adaptive solvers. Note that Figure 3 shows an average time to run one training epoch which includes other operations, such as data fetching, state update etc. This shows that ODESolve contributes signiﬁcantly to long training times. When comparing ODEs and ﬂows alone, our method is much faster. In the following we will discuss the results from Figure 3 for other experiments as well as other results.

Filtering approach. Following De Brouwer et al. [16], we use clinical database MIMIC-III [35], pre-processed to contain 21250 patients time series, with 96 features. We also process newly released MIMIC-IV [25, 36] to obtain 17874 patients. The details are in Appendix D.2. The goal is to predict the next three measurements in the 12 hour interval after the observation window of 36 hours.

Table 2 shows that our GRU ﬂow model (Equation 5) mostly outperforms GRU-ODE [16]. Additionally, we show that the ordinary Res Net ﬂow with 4 stacked transformations (Equation 2) performs worse. The reason might be because it is missing GRU ﬂow properties, such as boundedness. Similarly, an ODE with a regular neural network does not outperform GRU-ODE [16]. Finally, we report that the model with GRU ﬂow requires 60% less time to run one training epoch.

Temporal point processes. As we saw in Section 3.2, most of the TPP models consist of two parts: the encoder that processes the history, and the network that outputs the intensity. In the context of neural ODEs, we would like to answer: 1) whether having continuous state h(t) outperforms RNNs, and 2) if intertwining the hidden state evolution with the intensity outperforms other approaches. For this purpose we propose the following models based on continuous intensity and mixture distributions.

MIMIC-III MIMIC-IV MSE NLL MSE NLL

GRU-ODE 0.507 0.005 0.770 0.023 0.379 0.005 0.748 0.045 Res Net ﬂow 0.508 0.007 0.779 0.023 0.379 0.005 0.774 0.059 GRU ﬂow 0.499 0.004 0.781 0.041 0.364 0.008 0.734 0.054 Table 2: Forecasting on healthcare data averaged over 5 runs (lower is better).

MOOC Reddit Wiki

Discrete GRU -0.4448 2.7563 -0.9299 1.8468 -0.5832 8.0527

Jump ODE 0.8710 4.6118 0.1308 3.6654 -0.3115 10.6040 Coupling ﬂow 0.7694 5.5494 -0.1263 3.6312 -0.2807 9.7214 Res Net ﬂow -1.2379 2.9466 -1.2962 2.3932 -1.2907 10.4368

Jump ODE -0.2626 3.0723 -1.0907 1.9057 -1.3635 7.5537 Coupling ﬂow -0.4026 2.5877 -1.0933 1.6817 -1.2702 8.8018 Res Net ﬂow -0.5664 3.0005 -1.0605 1.9491 -1.1937 8.5489 Table 3: Test NLL for TPP (left columns, per dataset) and marked TPP (right columns); full results in Appendix C. Cont. denotes models with continuous intensity, and Mix. with mixture distribution.

Jump ODE evolves h(t) continuously together with the intensity function λ(t) = g(h(t)) [34, 9], where g is a neural network. The neural ﬂow version replaces an ODE with our proposed ﬂow models to evolve h(t) and uses Monte Carlo integration to evaluate Equation 8. Note that this operation can be parallelized unlike solving an ODE.

The mixture-based models keep the same continuous time encoders (ODEs and ﬂows) but output the stationary log-normal mixture for the next arrival time. That is, instead of outputting the continuous intensity, they only use the hidden state at the last observation to deﬁne the probability density function [71]. As a baseline, we use a discrete GRU with the same mixture decoder.

We use both synthetic and real-world data, following [61, 71]. We generate 4 synthetic datasets corresponding to homogeneous, renewal and self-correcting processes. For real-world data, we collect timesteps of forum posts (Reddit), interactions of students with an online course system (MOOC), and Wiki page edits [44]. The details of the data are in Appendix D.3.

We report the test negative log-likelihood on real-world data in Table 3, for models trained both with and without marks. Full results, including synthetic data can be found in the Appendix C. We note that all the models capture the synthetic data, although continuous intensity models struggle compared to those with the mixture distribution. We can see this is the case for real-world data too, where the mixture distribution usually outperforms the corresponding continuous intensity model. In general, neural ﬂows are better than ODE-based models, with the exception of one ODE model on Wiki dataset. We can conclude that having a continuous encoder is preferred to a discrete RNN because it can capture the irregular time sequence better. However, there is no beneﬁt in parametrizing the intensity function in a continuous fashion, especially since this is a much slower approach.

Table 8 in Appendix C shows the comparison of wall clock times. Comparing only continuous intensity models we can see that Monte Carlo integration is faster than solving an ODE. As expected, using the mixture distribution gives the best performance. Thus, our ﬂow models offer more than an order of magnitude faster processing compared to ODEs with continuous intensity. Figure 3 shows the difference for continuous models on the respective real-world datasets, the gap is even bigger if we include mixture-based models, where the speed-up is over an order of magnitude.

Spatial data. We compare the continuous normalizing ﬂows with our continuous-time version of the coupling NF on time-dependent density estimation. We use two versions of each model: time-varying and attentive, as described in Section 3.3. Following Chen et al. [9], we use locations of bike rentals (Bikes), Covid cases for the state of New Jersey [77], and earthquake events in Japan (EQ) [78].

Results in Table 4 show the test NLL for spatial data, that is, we do not report the TPP loss since this is shared between models. Our continuous coupling NF models perform better on all datasets. Since afﬁne coupling is a simple transformation, we require bigger models with more parameters. At the same time, our models are still more than an order of magnitude faster. Adapting some other, more expressive normalizing ﬂows to satisfy ﬂow constraints might reduce the number of parameters.

Bikes Covid EQ

Time-var. CNF 2.315 1.984 1.709 Attentive CNF 2.371 1.973 1.668 Time-var. coupling 2.280 1.916 1.633 Attentive coupling 2.330 1.926 1.457

Table 4: Test NLL for spatial datasets.

0.0 0.2 0.4 0.6 0.8 1.0

Relative time

Smoothing Filtering TPP Density

Neural ODE Neural ﬂow

Figure 3: Comparing per-epoch wall-clock times. Each box is dataset (order by appearance in text).

5 Discussion

In this paper we presented neural ﬂows as an efﬁcient alternative to neural ODEs. We retain all the desirable properties of neural ODEs, without using numerical solvers. Our method outperforms the ODE based models in time series modeling and density estimation, at a much smaller computation cost. This brings the possibility to scale to larger datasets and models in the future.

Other related work. Early works on approximating the ODE solutions without numerical solvers used splines or radial basis functions [55, 50], or functions similar to modern Res Nets [45]. More recently, [66] approximate the solution by minimizing the error of the solution points and of the boundary condition. Unlike these approaches, we do not approximate the solution to some given ODE but learn the solutions which corresponds to learning the unknown ODE. Also, our method guarantees that we always deﬁne a proper ﬂow, as is required in certain applications.

A similar problem is modeling the solutions to partial differential equations, e.g., with a model that is analogous to the classical discrete encoder-decoder [49]. Although we cannot compare these two settings directly, one could use our method to enhance modeling PDE solutions.

Res Nets were initially recognized as a discretization of dynamical systems [51, 80] and were used to tackle inﬁnite depth [1, 54], stability [13, 28] and invertibility [7, 33]. We take a different approach and propose modiﬁed Res Nets, among other, avoiding any iterative procedure. Res Nets also lead to neural ODEs which have memory efﬁcient backpropagation as one of the main features [21, 11]. Further, to combat solver inefﬁciency, many improvements have been proposed, such as adding regularization [22, 24, 37], improving training [23, 38, 82] and having faster inference [67].

Limitations. Deﬁning a ﬂow automatically deﬁnes an ODE, but since many ODEs do not have closed-form solutions, we cannot always ﬁnd the exact ﬂow corresponding to a particular ODE. This is usually not an issue since in most applications, such as those presented in Section 3, it is sufﬁcient for both neural ODEs and neural ﬂows to approximate an unknown dynamic. However, if we restrict ourselves to autonomous ODEs (ﬁxed vector ﬁeld in time), we cannot deﬁne a general neural ﬂow that satisﬁes this condition. We further discuss this in Appendix A.6 and present a potential solution that involves a simple regularization.

Since neural ODEs reuse the same function f in the solver, essentially deﬁning implicit layers, they can be more parameter efﬁcient. Sometimes we might need more parameters to represent the same dynamic, as we observed in the density estimation task. But even here, the results show neural ﬂows are more efﬁcient. In the special setting with limited memory, we can resort to existing solutions [10].

Future work. In this work we designed neural ﬂow models as invertible functions that satisfy initial condition using simple dependence on time. Although these models already outperform neural ODEs, it would be interesting to see if there are other ways to deﬁne a neural ﬂow, and whether these architectures can outperform the ones we proposed here.

We applied our method to the main applications of neural ODEs: time series modeling and density estimation. In the future we hope to see neural ﬂows adapted for other use cases as well. Investigating ﬂows that deﬁne the higher order dynamics might also be of interest.

Broader impact. We introduced a new method to replace neural ODEs. As such, it has a wide variety of potential applications, some of which we explored in this paper. We used several healthcare datasets and hope to see further applications of our method in this domain. At the same time, it is important to pay attention to data privacy and fairness when building such models, especially for sensitive applications, such as healthcare. One of the main beneﬁts of our method is the reduced computation cost, which may imply energy savings.

Acknowledgments

We would like to thank Oleksandr Shchur for helpful discussions.

[1] S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. In Neur IPS, 2019.

[2] J. Behrmann, W. Grathwohl, R. T. Chen, D. Duvenaud, and J.-H. Jacobsen. Invertible residual networks. In ICML, 2019.

[3] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difﬁcult. IEEE transactions on neural networks, 5(2):157 166, 1994.

[4] R. v. d. Berg, L. Hasenclever, J. M. Tomczak, and M. Welling. Sylvester normalizing ﬂows for variational inference. In UAI 2018, 2018.

[5] M. Biloš and S. Günnemann. Scalable normalizing ﬂows for permutation invariant densities. In ICML, 2021.

[6] M. Biloš, B. Charpentier, and S. Günnemann. Uncertainty on asynchronous time event prediction. In Neur IPS, 2019.

[7] B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert, and E. Holtham. Reversible architectures for arbitrarily deep residual neural networks. In AAAI, 2018.

[8] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu. Recurrent neural networks for multivariate time series with missing values. Scientiﬁc reports, 8(1):1 12, 2018.

[9] R. T. Q. Chen, B. Amos, and M. Nickel. Neural spatio-temporal point processes. In ICLR, 2021.

[10] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost. ar Xiv preprint ar Xiv:1604.06174, 2016.

[11] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In Neur IPS, 2018.

[12] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), 2014.

[13] M. Ciccone, M. Gallieri, J. Masci, C. Osendorfer, and F. Gomez. NAIS-Net: Stable deep networks from non-autonomous differential equations. In Neur IPS, 2018.

[14] E. A. Coddington and N. Levinson. Theory of ordinary differential equations. Tata Mc Graw-Hill Education, 1955.

[15] D. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes: Volume I: Elementary Theory and Methods. Springer Science & Business Media, 2007.

[16] E. De Brouwer, J. Simm, A. Arany, and Y. Moreau. GRU-ODE-Bayes: Continuous modeling of sporadically-observed time series. In Neur IPS, 2019.

[17] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. In ICLR, 2017.

[18] J. R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of computational and applied mathematics, 6(1):19 26, 1980.

[19] N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. Gomez-Rodriguez, and L. Song. RMTPP: Embedding event history to vector. In KDD, 2016.

[20] E. Dupont, A. Doucet, and Y. W. Teh. Augmented neural ODEs. In Neur IPS, 2019.

[21] P. E. Farrell, D. A. Ham, S. W. Funke, and M. E. Rognes. Automated derivation of the adjoint of high-level transient ﬁnite element programs. SIAM Journal on Scientiﬁc Computing, 35(4): C369 C393, 2013.

[22] C. Finlay, J.-H. Jacobsen, L. Nurbekyan, and A. M. Oberman. How to train your neural ODE. In ICML, 2020.

[23] A. Gholami, K. Keutzer, and G. Biros. ANODE: Unconditionally accurate memory-efﬁcient gradients for neural ODEs. In IJCAI, 2019.

[24] A. Ghosh, H. S. Behl, E. Dupont, P. H. Torr, and V. Namboodiri. STEER: Simple temporal regularization for neural odes. In Neur IPS, 2020.

[25] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. Physiobank, Physiotoolkit, and Physionet: Components of a new research resource for complex physiologic signals. Circulation, 101(23): e215 e220, 2000.

[26] H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing Lipschitz continuity. Machine Learning, 110(2):393 416, 2021.

[27] W. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. In ICLR, 2019.

[28] E. Haber and L. Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34 (1):014004, 2017.

[29] A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83 90, 1971.

[30] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE CCVPR, 2016.

[31] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997.

[32] R. Hyndman, A. Koehler, K. Ord, and R. Snyder. Forecasting with exponential smoothing. The state space approach. Springer Science & Business Media, 2008. doi: 10.1007/ 978-3-540-71918-2.

[33] J. Jacobsen, A. W. M. Smeulders, and E. Oyallon. i-revnet: Deep invertible networks. In ICLR, 2018.

[34] J. Jia and A. R. Benson. Neural jump stochastic differential equations. In Neur IPS, 2019.

[35] A. Johnson, T. Pollard, L. Shen, H. L. Li-Wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. MIMIC-III, a freely accessible critical care database. Scientiﬁc data, 3(1):1 9, 2016.

[36] A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. Celi, and R. Mark. MIMIC-IV (version 1.0). Physio Net, 2021. doi: 10.13026/s6n6-xd98.

[37] J. Kelly, J. Bettencourt, M. J. Johnson, and D. Duvenaud. Learning differential equations that are easy to solve. In Neur IPS, 2020.

[38] P. Kidger, R. T. Chen, and T. Lyons. "Hey, that s not an ODE": Faster ODE adjoints via seminorms. In ICML, 2021.

[39] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[40] D. P. Kingma and P. Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In Neur IPS, 2018.

[41] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive ﬂow. In Neur IPS, 2016.

[42] I. Kobyzev, S. Prince, and M. Brubaker. Normalizing ﬂows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

[43] J. Köhler, L. Klein, and F. Noé. Equivariant ﬂows: exact likelihood generative learning for symmetric densities. In ICML, 2020.

[44] S. Kumar, X. Zhang, and J. Leskovec. Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1269 1278, 2019.

[45] I. E. Lagaris, A. Likas, and D. I. Fotiadis. Artiﬁcial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks, 9(5):987 1000, 1998.

[46] M. Lechner and R. Hasani. Learning long-term dependencies in irregularly-sampled time series. In Neur IPS, 2020.

[47] J. M. Lee. Introduction to Smooth Manifolds. Springer, 2012.

[48] Q. Li, T. Lin, and Z. Shen. Deep learning via dynamical systems: An approximation perspective. ar Xiv preprint ar Xiv:1912.10382, 2019.

[49] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. In ICLR, 2021.

[50] Li Jianyu, Luo Siwei, Qi Yingjian, and Huang Yaping. Numerical solution of differential equations by radial basis function neural networks. In IJCNN, 2002.

[51] Q. Liao and T. Poggio. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. ar Xiv preprint ar Xiv:1604.03640, 2016.

[52] E. Liberty, Z. Karnin, B. Xiang, L. Rouesnel, B. Coskun, R. Nallapati, J. Delgado, A. Sadoughi, Y. Astashonok, P. Das, et al. Elastic machine learning algorithms in amazon sagemaker. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2020.

[53] H. Lin and S. Jegelka. Res Net with one-neuron hidden layers is a universal approximator. In Neur IPS, 2018.

[54] Y. Lu, A. Zhong, Q. Li, and B. Dong. Beyond ﬁnite layer neural networks: Bridging deep architectures and numerical differential equations. In ICML, 2018.

[55] A. J. Meade Jr and A. A. Fernandez. The numerical solution of linear ordinary differential equations by feedforward neural networks. Mathematical and Computer Modelling, 19(12): 1 25, 1994.

[56] H. Mei and J. M. Eisner. The neural hawkes process: A neurally self-modulating multivariate point process. In Neur IPS, 2017.

[57] S. Meyer, J. Elias, and M. Höhle. A space time conditional intensity model for invasive meningococcal disease occurrence. Biometrics, 68(2):607 616, 2012.

[58] D. Neil, M. Pfeiffer, and S.-C. Liu. Phased LSTM: Accelerating recurrent network training for long or event-based sequences. In Neur IPS, 2016.

[59] A. Norcliffe, C. Bodnar, B. Day, N. Simidjievski, and P. Liò. On second order behaviour in augmented neural ODEs. In Neur IPS, 2020.

[60] Y. Ogata and D. Vere-Jones. Inference for earthquake models: A self-correcting model. Stochastic processes and their applications, 17(2):337 347, 1984.

[61] T. Omi, N. Ueda, and K. Aihara. Fully neural network based model for general temporal point processes. In Neur IPS, 2019.

[62] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. In SSW, 2016.

[63] K. Ott, P. Katiyar, P. Hennig, and M. Tiemann. Res Net after all? Neural ODEs and their numerical solution. ICLR, 2021.

[64] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive ﬂow for density estimation. In Neur IPS, 2017.

[65] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan. Normalizing ﬂows for probabilistic modeling and inference. ar Xiv:1912.02762, 2019.

[66] M. L. Piscopo, M. Spannowsky, and P. Waite. Solving differential equations with neural networks: Applications to the calculation of cosmological phase transitions. Phys. Rev. D, 2019.

[67] M. Poli, S. Massaroli, A. Yamashita, H. Asama, and J. Park. Hypersolvers: Toward fast continuous-depth models. In Neur IPS, 2020.

[68] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and T. Januschowski. Deep state space models for time series forecasting. In Neur IPS, 2018.

[69] Y. Rubanova, R. T. Chen, and D. Duvenaud. Latent ODEs for irregularly-sampled time series. In Neur IPS, 2019.

[70] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181 1191, 2020.

[71] O. Shchur, M. Biloš, and S. Günnemann. Intensity-free learning of temporal point processes. In ICLR, 2020.

[72] O. Shchur, A. C. Türkmen, T. Januschowski, and S. Günnemann. Neural temporal point processes: A review. In IJCAI, 2021.

[73] I. Silva, G. Moody, D. J. Scott, L. A. Celi, and R. G. Mark. Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In 2012 Computing in Cardiology, pages 245 248. IEEE, 2012.

[74] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. ar Xiv preprint ar Xiv:1801.00690, 2018.

[75] T. Teshima, I. Ishikawa, K. Tojo, K. Oono, M. Ikeda, and M. Sugiyama. Coupling-based invertible neural networks are universal diffeomorphism approximators. In Neur IPS, 2020.

[76] T. Teshima, K. Tojo, M. Ikeda, I. Ishikawa, and K. Oono. Universal approximation property of neural ordinary differential equations. In Neur IPS 2020 Workshop on Differential Geometry meets Deep Learning, 2020.

[77] The New York Times. Coronavirus (Covid-19) data in the United States, 2020. URL https: //github.com/nytimes/covid-19-data.

[78] U.S. Geological Survey. Earthquake catalogue (accessed May 15, 2021), 2020. URL https: //earthquake.usgs.gov/earthquakes/search/.

[79] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Neur IPS, 2017.

[80] E. Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1 11, 2017.

[81] H. Zhang, X. Gao, J. Unterman, and T. Arodz. Approximation capabilities of neural ODEs and invertible residual networks. In ICML, pages 11086 11095, 2020.

[82] J. Zhuang, N. Dvornek, X. Li, S. Tatikonda, X. Papademetris, and J. Duncan. Adaptive checkpoint adjoint method for gradient estimation in neural ODE. In ICML, 2020.