# neural_flows_efficient_alternative_to_neural_odes__5111a412.pdf Neural Flows: Efficient Alternative to Neural ODEs Marin Biloš1 , Johanna Sommer1, Syama Sundar Rangapuram2, Tim Januschowski2, Stephan Günnemann1 1Technical University of Munich, 2AWS AI Labs, Germany Neural ordinary differential equations describe how values change in time. This is the reason why they gained importance in modeling sequential data, especially when the observations are made at irregular intervals. In this paper we propose an alternative by directly modeling the solution curves the flow of an ODE with a neural network. This immediately eliminates the need for expensive numerical solvers while still maintaining the modeling capability of neural ODEs. We propose several flow architectures suitable for different applications by establishing precise conditions on when a function defines a valid flow. Apart from computational efficiency, we also provide empirical evidence of favorable generalization performance via applications in time series modeling, forecasting, and density estimation. 1 Introduction Ordinary differential equations (ODEs) are among the most important tools for modeling complex systems, both in natural and social sciences. They describe the instantaneous change in the system, which is often an easier way to model physical phenomena than specifying the whole system itself. For example, the change of the pendulum angle or the change in population can be naturally expressed in the differential form. Similarly, Chen et al. [11] introduce neural ODEs that describe how some quantity of interest represented as a vector x, changes with time: x = f(t, x(t)), where f is now a neural network. Starting at some initial value x(t0) we can find the result of this dynamic at any t1: x(t1) = x(t0) + Z t1 t0 f(t, x(t)) dt = ODESolve(x(t0), f, t0, t1). (1) Neural ODE Neural flow ODESolve( ) Figure 1: (Left) ODE requires numerical solver which evaluates f at many points along the solution curve. (Right) Our approach returns the solutions directly. It is sufficient for f to be continuous in t and Lipschitz continuous in x to have a unique solution, by the Picard Lindelöf theorem [14]. This mild condition is already satisfied by a large family of neural networks. In most practically relevant scenarios, the integral in Equation 1 has to be solved numerically, requiring a trade-off between computation cost and numerical precision. Much of the follow up work to [11] focused on retaining expressive dynamics while requiring fewer solver evaluations [22, 37]. In the machine learning context we are given a set of initial conditions (often at t0 = 0) and a loss function for the solution evaluated at time t1. One example is modeling time series where the latent state is evolved in continuous time and is used to predict the observed measurements [16]. Here, unlike in physics for example, the function f is completely unknown and needs to be learned from data. Thus, [11] used neural networks to model it, for their ability to capture complex dynamics. However, note that this comes at the cost of the ODE being non-interpretable. Work partially done during an internship at Amazon Research. Correspondence to: bilos@in.tum.de. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Since solving an ODE is expensive, we want to find a way to keep the desired properties of neural ODEs at a much smaller computation cost. If we take a step back, we see that neural ODEs take initial values as inputs and return non-intersecting solution curves (Figure 1). In this paper we propose to model the solution curves directly, with a neural network, instead of specifying the derivative. That is, given an initial condition we return the solution with a single forward pass through our network. Straight away, this leads to improvements in computation performance because we avoid using ODE solvers altogether. We show how our method can be used as a faster alternative to ODEs in existing models [9, 16, 34, 69], while improving the modeling performance. In the following, we derive the conditions that our method needs to satisfy and propose different architectures that implement them. 2 Neural flows In this section, we present our method, neural flows, that directly models the solution curve of an ODE with a neural network. For simplicity, let us briefly assume that the initial condition x0 = x(t0) is specified at t0 = 0. We handle the general case shortly. Then, Equation 1 can be written as x(t) = F(t, x0), where F is the solution to the initial value problem, x = f(t, x(t)), x0 = x(0). We will model F with a neural network. For this, we first list the conditions that F must satisfy so that it is a solution to some ODE. Let F : [0, T] Rd Rd be a smooth function satisfying: i) F(0, x0) = x0, (initial condition) ii) F(t, ) is invertible, t. (uniqueness of the solution given the initial value x0; i.e., the curves specified by F corresponding to different initial values should not intersect for any t) There is an exact correspondence between a function F with the above properties and an ODE defined with f such that the derivative d dt F(t, x0) matches f(t, x(t)) everywhere, given x0 = x(0) [47, Theorem 9.12]. In general, we can say that f defines a vector field and F defines a family of integral curves, also known as the flow in mathematics (not to be confused with normalizing flow). As F will be parameterized with a neural network, condition i) requires that its parameters must depend on t such that we have the identity map at t = 0. Note that by providing x0 we define a smooth trajectory F( , x0) the solution to some ODE with the initial condition at t0 = 0. If we relax the restriction t0 = 0 and allow x0 to be specified at an arbitrary t0 R, the solution can be obtained with a simple procedure. We first go back to the case t = 0 where we obtain the corresponding initial value ˆx0 := x(0) = F 1(t0, x0). This then gives us the required solution F( , ˆx0) to the original initial value problem. Thus, we often prefer functions with an analytical inverse. Finally, we tackle implementing F. The second property instructs us that the function F(t, ) is a diffeomorphism on Rd. We can satisfy this by drawing inspiration from existing works on normalizing flows and invertible neural networks [e.g., 17, 2]. In our case, the parameters must be conditioned on time, with identity at t = 0. As a starting example, consider a linear ODE f(t, x(t)) = Ax(t), with x(0) = x0. Its solution can be expressed as F(t, x0) = exp(At)x0, where exp is the matrix exponential. Here, the learnable parameters A are simply multiplied by t to ensure condition i); and given fixed t, the network behaves as an invertible linear transformation. In the following we propose other, more expressive functions suitable for applications such as time series modeling. 2.1 Proposed flow architectures Res Net flow. A single residual layer xt+1 = xt + g(xt) [30] bears a resemblance to Equation 1 and can be seen as a discretized version of a continuous transformation which inspired the development of neural ODEs. Although plain Res Nets are not invertible, one could use spectral normalization [26] to enforce a small Lipschitz constant of the network, which guarantees invertibility [2, Theorem 1]. Thus, Res Nets become a natural choice for modeling the solution curve F resulting in the following extension Res Net flow: F(t, x) = x + ϕ(t)g(t, x), (2) where ϕ : R Rd. This satisfies properties i) and ii) from above when ϕ(0) = 0 and |ϕ(t)i| < 1; and g : Rd+1 Rd is an arbitrary contractive neural network (Lip(g) < 1). One simple choice for ϕ is a tanh function. The inverse of F can be found via fixed point iteration similar to [2]. GRU flow. Time series data is traditionally modeled with recurrent neural networks, e.g., with a GRU [12], such that the hidden state ht 1 is updated at fixed intervals with the new observation xt: ht = GRUCell(ht 1, xt) = zt ht 1 + (1 zt) ct, (3) where zt and ct are functions of the previous state ht 1 and the new input xt. De Brouwer et al. [16] derived the continuous equivalent of this architecture called GRU-ODE (see Appendix A.1). Given the initial condition h0 = h(t0), they evolve the hidden state h(t) with an ODE, until they observe new xt1 at time t1, when they use Equation 3 to update it: ht1 = ODESolve(h0, GRU-ODE, t0, t1), ht1 = GRUCell( ht1, xt1). (4) Here, we will derive the flow version of GRU-ODE. If we rewrite Equation 3 by regrouping terms: ht = ht 1 + (1 zt) (ct ht 1), we see that GRU update acts as a single Res Net layer. Definition 1. Let fz, fr, fc : Rd+1 Rd be any arbitrary neural networks and let z(t, h) = α σ(fz(t, h)), r(t, h) = β σ(fr(t, h)), c(t, h) = tanh(fc(t, r(t, h) h)), where α, β R and σ is a sigmoid function. Further, let ϕ : R Rd be a continuous function with ϕ(0) = 0 and |ϕ(t)i| < 1. Then the evolution of GRU state in continuous time is defined as: F(t, h) = h + ϕ(t)(1 z(t, h)) (c(t, h) h). (5) Theorem 1. A neural network defined by Equation 5 specifies a flow when the functions fz, fr and fc are contractive maps, i.e., Lip(f ) < 1, and α = 2 We prove Theorem 1 in Appendix A.3 by showing that the second summand on the right hand side in Equation 5 satisfies Lipschitz constraint making the whole network invertible. We also show that the GRU flow has the same desired properties as GRU-ODE, namely, bounding the hidden state in ( 1, 1) and having the Lipschitz constant of 2. Note that GRU flow (Equation 5) acts as a replacement to ODESolve in Equation 4. Alternatively, we can append xt to the input of fz, fr and fc, which would give us a continuous-in-time version of GRU. Coupling flow. The disadvantage of both Res Net flow and GRU flow is the missing analytical inverse. To this end, we propose a continuous-in-time version of an invertible transformation based on splitting the input dimensions into two disjoint sets A and B, A B = {1, 2, . . . , d} [17]. We copy the values indexed by B and transform the rest conditioned on x B which gives us the coupling flow: F(t, x)A = x A exp(u(t, x B)ϕu(t)) + v(t, x B)ϕv(t), (6) where u, v are arbitrary neural networks and ϕu(0) = ϕv(0) = 0. We can easily see that this satisfies condition i), and it is invertible by design regardless of t [17]. Since some values stay constant in a single layer, we apply multiple consecutive transformations, choosing different partitions A and B. For all three models we can stack multiple layers F = F1 Fn and still define a proper flow since the composition of invertible functions is invertible, and consecutive identities give an identity. We can think of ϕ (including ϕu, ϕv) as a time embedding function that has to be zero at t = 0. Since it is a function of a single variable, we would like to keep the complexity low and avoid using general neural networks in favor of interpretable and expressive basis functions. A simple example is linear dependence on time ϕ(t) = αt, or tanh(αt) for Res Net flow. We use these in the experiments. An alternative, more powerful embedding consists of Fourier features ϕ(t)i = P k αik sin(βikt). 2.2 On approximation capabilities Previous works established that neural ODEs are sup-universal for diffeomorphic functions [76] and are Lp-universal for continuous maps when composed with terminal family [48]. A similar result also holds for affine coupling flows [75], whereas general residual networks can approximate any function [53]. The Res Net flow, as defined in Equation 2, can be viewed as an Euler discretization, meaning it is enough to stack appropriately many layers to uniformly approximate any ODE solution [48]. GRU flow can be viewed as a Res Net flow and coupling flow shares a similar structure, meaning that if we can set them to act as an Euler discretization we can match any ODE. However, this is of limited use in practice since we use finitely many layers, so the main focus of this paper is to provide the empirical evidence that we can outperform neural ODEs on relevant real-world tasks. Other results [20, 81] consider limitations of neural ODEs in modeling general homeomorphisms (e.g., x 7 x) and propose the solution that adds dimensions to the input x. Such augmented networks can model higher order dynamics. This can be explicitly defined through certain constraints for further improvements in performance and better interpretability [59]. We can apply the same trick to our models. However, instead of augmenting x, a simpler solution is to relax the conditions on F given the task. For example, if we do not need invertibility, we can remove the Lipschitz constraint in Equation 2. Since neural flows offer such flexibility, they might be of more practical relevance in these use cases. 3 Applications In this section we review two main applications of neural ODEs: modeling irregularly-sampled time series and density estimation. We describe the existing modeling approaches and propose extensions using neural flows. In Section 4 we will use models presented here to qualitatively and quantitatively compare neural flows with neural ODEs. 3.1 Continuous-time latent variable models Autoregressive [62, 70] and state space models [32, 68] have achieved considerable success modeling regularly-sampled time series. However, many real-world applications do not have a constant sampling rate and may contain missing values, e.g., in healthcare we have very sparse measurements at irregular time intervals. Here we describe how our neural flow models can be used in such scenario. Encoder. In this setting, we are given a sequence of observations X = (x1, . . . , xn), xi Rd at times t = (t1, . . . , tn). To represent this type of data, previous RNN-based works relied on exponentially decaying hidden state [8], time gating [58], or simply adding time as an additional input [19]. More recently, various ODE-based models built on top of RNNs to evolve the hidden state between observations in continuous time, giving rise to, e.g., ODE-RNN [69], while outperforming previous approaches. Another model is GRU-ODE [16], which we already described in Equation 4. We proposed the GRU flow (Equation 5) that can be used as a straightforward replacement. Lechner and Hasani [46] showed that simply evolving the hidden state with a neural ODE can cause vanishing or exploding gradients, a known issue in RNNs [3]. Thus, they propose using an LSTM-based [31] model instead. The difference to ODE-RNN [69] is using an LSTMCell and introducing another hidden state that is not updated continuously in time, which in turn allows gradient propagation via internal LSTM gating. To adapt this to our framework, we simply replace the ODESolve with the Res Net or coupling flow to obtain a neural flow model. Decoder. Once we have a hidden state representation hi of the irregularly-sampled sequence up to xi, we are interested in making future predictions. The ODE based models continue evolving the hidden state using a numerical solver to get the representation at time ti+1, with hi+1 = ODESolve(hi, f, ti, ti+1). With neural flows we can simply pass the next time point ti+1 into F and get the next hidden state directly. In the following we show how the presented encoder-decoder model is used in both the smoothing and filtering approaches for irregular time series modeling. Smoothing approach. The given sequence of observations (X, t) is modeled with latent variables or states (z1, . . . , zn) Rh, such that xi p(xi|zi), conditionally independent of other xj [11, 69]. There is a predesignated prior state z0 at t = 0 from which the latent state is assumed to evolve continuously. More precisely, if z0 is a sample from the initial latent state z0, then a latent state sample at any future time step t is given by zt = F(t, z0). Since the exact inference on the initial state z0, p(z0|X, t), is intractable, we proceed by doing approximate inference following the variational auto-encoder approach [11, 69]. We use an LSTMbased neural flow encoder that processes (X, t) and outputs the approximate posterior parameters µ and σ from the last state, q(z0|X, t) = N(µ, σ). The decoder returns all zi deterministically at times t with F(t, z0), with initial condition z0 q(z0|X, t). For the latent state at an arbitrary ti, the target is generated according to the model xi p(xi|zi). Given p(z0) = N(0, 1), the overall model is trained by maximizing the evidence lower bound: ELBO = Ez0 q(z0|X,t))[log p(X)] KL[q(z0|X, t)||p(z0)]. (7) Using continuous time models brings up multiple advantages, from handling irregular time points automatically to making predictions at any, and as many time points as required, allowing us to do reconstruction, missing value imputation and forecasting. This holds whether we use neural flows or ODEs, but our approach is more computationally efficient, which matters as we scale to bigger data. Filtering approach. In contrast to the previous approach, we can alternatively do the inference in an online fashion at each of the observed time points, i.e., estimating the posterior p(zi|x1:i, t1:i) after seeing observations until the current time step i. This is known as filtering. Here, the prediction for future time steps is done by evolving the posterior corresponding to the final observed time point p(zn|X, t) instead of the initial time point p(z0|X, t), as was done in the smoothing approach. In this paper, we follow the general approach suggested by De Brouwer et al. [16] for capturing non-linear dynamics. We use GRU flow (instead of GRU-ODE) for evolving the hidden state hi Rh and we output the mean and variance of the approximate posterior q(zi|x1:i, t1:i). The log-likelihood cannot be computed exactly under this model so [16] suggest using a custom objective that is the analogue to Bayesian filtering (see Appendix A.2 for details). Unlike [16], which needs to solve the ODE for every observation, our method only needs a single pass through the network per observation. 3.2 Temporal point processes Sometimes temporal data is measured irregularly and the times at which we observe the events come from some underlying process modeled with temporal point processes (TPPs). For example, we can use TPPs to model the times of messages between users. One example type of behavior we want to capture is excitation [29], e.g., observing one message increases the chance of seeing other soon after. A realization of a TPP on an interval [0, T] is an increasing sequence of arrival times t = (t1, . . . , tn), ti [0, T], where n is a random variable. The model is defined with an intensity function λ(t) that tells us how many events we expect to see in some bounded area [15]. The intensity has to be positive. We define the history Hti as the events that precede ti, and further define the conditional intensity function λ (t) which depends on history. For convenience, we can also work with inter-event times τi = ti ti 1, without losing generality. We train the model by maximizing the log-likelihood: i log λ (ti) Z T 0 λ (s) ds. (8) Previous works [72] used autoregressive models (e.g., RNNs) to represent the history with a fixed-size vector hi [19]. The intensity function can correspond to a simple distribution [19] or a mixture of distributions [71]. Then the integral in Equation 8 can be computed exactly. Another possibility is modeling λ(t) with an arbitrary neural network which requires Monte Carlo integration [6, 56]. On the other hand, Jia and Benson [34] propose a jump ODE model that evolves the hidden state h(t) with an ODE and updates the state with new observations, similar to LSTM-ODE. In this case, obtaining the hidden state and solving the integral in Equation 8 can be done in a single solver call. Marked point processes. Often, we are also interested in what type of an event happened at time point ti. Thus, we can assign the observed type xi, also called mark, and model the arrival times and marks jointly: p(t, X) = p(t)p(X|t). Written like this, we can keep the model for arrival times as in Equation 8, and add a module that inputs the history hi and the next time point ti+1 and outputs the probabilities for each mark type. The special case of xi Rd is covered in the next section. 3.3 Time-dependent density estimation Normalizing flows (NFs) define densities with invertible transformations of random variables. That is, given a random variable z q(z), z Rd and an invertible function F : Rd Rd, we can compute the probability density function of x = F(z) with the change of variables formula [65]: p(x) = q(z)| det JF (z)| 1, where JF is the Jacobian of F. As we can see, it is important to define a function F that is easily invertible and has a tractable determinant of the Jacobian. One example is the coupling NF [17], which we used to construct the coupling flow in Equation 6. Other tractable models include autoregressive [41, 64] and matrix factorization based NFs [4, 40]. In contrast to this, Chen et al. [11] define the transformation with an ODE: f(t, z(t)) = tz(t). This allows them to define the instantaneous change in log-density as well as the continuous equivalent to the change of variables formula, giving rise to the continuous normalizing flow (CNF): t log p(z(t)) = tr f , log p(x) = log q(z(t0)) Z t1 where t0 = 0 and t1 = 1 are usually fixed. The neural network f can be arbitrary as long as it gives unique ODE solutions. This offers an advantage when we need special structure of f that cannot be easily implemented with the discrete NFs, e.g., in physics we often require equivariant transformations [5, 43]. Besides the cost of running the solver, calculating the trace at each step in Equation 9 becomes intractable as the dimension of data grows, so one resorts to stochastic estimation [27]. A similar approximation method is used for estimating the determinant in an invertible Res Net model [2]. We discuss the computation complexity in Appendix A.8. Again, if we consider a linear ODE, we can easily show that calculating the trace and calculating the determinant of the corresponding flow is equivalent (see Appendix A.7). However, we are not interested in comparison between different normalizing flows for stationary densities [see e.g., 42], since flow endpoints t0 and t1 are always fixed; thus, our models would be reduced to the discrete NFs. Recently, Chen et al. [9] demonstrated how CNFs can evolve the densities in continuous time, with varying t0 and t1, which proves useful for spatio-temporal data. We will show how to do the same with our coupling flow, something that has not been explored before. Spatio-temporal processes. We reuse the notation from Section 3.2 to denote the arrival times with t and marks with X, xi Rd, which are now continuous variables. Values xi often correspond to locations of events, e.g., earthquakes [60] or disease outbreaks [57]. We use the temporal point processes from Section 3.2 to model p(t), and are only left with the conditional density p(X|t). Chen et al. [9] propose several models for this, the first one being the time-varying CNF where p(xi|ti) is estimated by integrating Equation 9 from t0 = 0 to observed ti. Using our affine coupling flow as defined in Equation 6 we can write: p(xi|ti) = q(F 1(ti, xi))| det JF 1(xi)|, (10) where q is the base density (defined with any NF) and the determinant is the product of the diagonal values of the Jacobian w.r.t. xi, which are simply exp terms from Equation 6 [17]. The density p evolves with time, the same way as in the CNF model, but without using the solver or trace estimation. To generate new realizations at ti, we first sample from q to get x0 q(x0), then evaluate F(ti, x0). An alternative model, attentive CNF [9], is more expressive compared to the time-varying CNF and more efficient than jump ODE models [9, 34]. The probability density of xi depends on all the previous values xj