# neural_ode_processes__eea2ad4c.pdf Published as a conference paper at ICLR 2021 NEURAL ODE PROCESSES Alexander Norcliffe Department of Computer Science University College London London, United Kingdom ucabino@ucl.ac.uk Cristian Bodnar , Ben Day , Jacob Moss & Pietro Li o Department of Computer Science University of Cambridge Cambridge, United Kingdom {cb2015, bjd39, jm2311, pl219}@cam.ac.uk Neural Ordinary Differential Equations (NODEs) use a neural network to model the instantaneous rate of change in the state of a system. However, despite their apparent suitability for dynamics-governed time-series, NODEs present a few disadvantages. First, they are unable to adapt to incoming data-points, a fundamental requirement for real-time applications imposed by the natural direction of time. Second, time-series are often composed of a sparse set of measurements that could be explained by many possible underlying dynamics. NODEs do not capture this uncertainty. In contrast, Neural Processes (NPs) are a new class of stochastic processes providing uncertainty estimation and fast data-adaptation, but lack an explicit treatment of the flow of time. To address these problems, we introduce Neural ODE Processes (NDPs), a new class of stochastic processes determined by a distribution over Neural ODEs. By maintaining an adaptive data-dependent distribution over the underlying ODE, we show that our model can successfully capture the dynamics of low-dimensional systems from just a few data-points. At the same time, we demonstrate that NDPs scale up to challenging high-dimensional time-series with unknown latent dynamics such as rotating MNIST digits. 1 INTRODUCTION Many time-series that arise in the natural world, such as the state of a harmonic oscillator, the populations in an ecological network or the spread of a disease, are the product of some underlying dynamics. Sometimes, as in the case of a video of a swinging pendulum, these dynamics are latent and do not manifest directly in the observation space. Neural Ordinary Differential Equations (NODEs) (Chen et al., 2018), which use a neural network to parametrise the derivative of an ODE, have become a natural choice for capturing the dynamics of such time-series (C a gatay Yıldız et al., 2019; Rubanova et al., 2019; Norcliffe et al., 2020; Kidger et al., 2020; Morrill et al., 2020). However, despite their fundamental connection to dynamics-governed time-series, NODEs present certain limitations that hinder their adoption in these settings. Firstly, NODEs cannot adjust predictions as more data is collected without retraining the model. This ability is particularly important for real-time applications, where it is desirable that models adapt to incoming data points as time passes and more data is collected. Secondly, without a larger number of regularly spaced measurements, there is usually a range of plausible underlying dynamics that can explain the data. However, NODEs do not capture this uncertainty in the dynamics. As many real-world time-series are comprised of sparse sets of measurements, often irregularly sampled, the model can fail to represent the diversity of suitable solutions. In contrast, the Neural Process (Garnelo et al., 2018a;b) family offers a class of (neural) stochastic processes designed for uncertainty estimation and fast adaptation to changes in the observed data. However, NPs modelling time-indexed random functions lack an explicit treatment of time. Designed for the general case of an arbitrary input domain, they treat time as an unordered set and do not explicitly consider the time-delay between different observations. To address these limitations, we introduce Neural ODE Processes (NDPs), a new class of stochastic processes governed by stochastic data-adaptive dynamics. Our probabilistic Neural ODE formulation relies on and extends the framework provided by NPs, and runs parallel to other attempts to Equal contribution. Work done as an AI Resident at the University of Cambridge. Published as a conference paper at ICLR 2021 Figure 1: Schematic diagram of Neural ODE Processes. Left: Observations from a time series, the context set , are encoded and aggregated to form r which parametrises the latent variables D and L0. Middle: A sample is drawn from L0 and D, initialising and conditioning the ODE, respectively. Each sample produces a plausible, coherent trajectory. Right: Predictions at a target time, t T i , are made by decoding the state of the ODE, l(t T i ) together with t T i . An example is shown with the connected from the ODE position plot to the Predictions plot. Middle & right: the bold lines in each plot refer to the same sample, fainter lines to other samples. All: The plots are illustrations only. incorporate application-specific inductive biases in this class of models such as Attentive NPs (Kim et al., 2019), Conv CNPs (Gordon et al., 2019), and MPNPs (Day et al., 2020). We demonstrate that NDPs can adaptively capture many potential dynamics of low-dimensional systems when faced with limited amounts of data. Additionally, we show that our approach scales to high-dimensional time series with latent dynamics such as rotating MNIST digits (Casale et al., 2018). Our code and datasets are available at https://github.com/crisbodnar/ndp. 2 BACKGROUND AND FORMAL PROBLEM STATEMENT Problem Statement We consider modelling random functions F : T Y, where T = [t0, ) represents time and Y Rd is a compact subset of Rd. We assume F has a distribution D, induced by another distribution D over some underlying dynamics that govern the time-series. Given a specific instantation F of F, let C = {(t C i , y C i )}i IC be a set of samples from F with some indexing set IC. We refer to C as the context points, as denoted by the superscript C. For a given context C, the task is to predict the values {y T j }j IT that F takes at a set of target times {t T j }j IT, where IT is another index set. We call T = {(t T j , y T j )} the target set. Additionally let t C = {ti|i IC} and similarly define y C, t T and y T. Conventionally, as in Garnelo et al. (2018b), the target set forms a superset of the context set and we have C T. Optionally, it might also be natural to consider that the initial time and observation (t0, y0) are always included in C. During training, we let the model learn from a dataset of (potentially irregular) time-series sampled from F. We are interested in learning the underlying distribution over the dynamics as well as the induced distribution over functions. We note that when the dynamics are not latent and manifest directly in the observation space Y, the distribution over ODE trajectories and the distribution over functions coincide. Neural ODEs NODEs are a class of models that parametrize the velocity z of a state z with the help of a neural network z = fθ(z, t). Given the initial time t0 and target time t T i , NODEs predict the corresponding state ˆy T i by performing the following integration and decoding operations: z(t0) = h1(y0), z(t T i ) = z(t0) + Z t T i t0 fθ(z(t), t)dt, ˆy T i = h2(z(t T i )), (1) where h1 and h2 can be neural networks. When the dimensionality of z is greater than that of y and h1, h2 are linear, the resulting model is an Augmented Neural ODE (Dupont et al., 2019) with input layer augmentation (Massaroli et al., 2020). The extra dimensions offer the model additional flexibility as well as the ability to learn higher-order dynamics (Norcliffe et al., 2020). Neural Processes (NPs) NPs model a random function F : X Y, where X Rd1 and Y Rd2. The NP represents a given instantiation F of F through the global latent variable z, Published as a conference paper at ICLR 2021 which parametrises the variation in F. Thus, we have F(xi) = g(xi, z). For a given context set C = {(x C i , y C i )} and target set x1:n, y1:n, the generative process is given by: p(y1:n, z|x1:n, C) = p(z|C) i=1 N(yi|g(xi, z), σ2), (2) where p(z) is chosen to be a multivariate standard normal distribution and y1:n is a shorthand for the sequence (y1, . . . , yn). The model can be trained using an amortised variational inference procedure that naturally gives rise to a permutation-invariant encoder qθ(z|C), which stores the information about the context points. Conditioned on this information, the decoder g(x, z) can make predictions at any input location x. We note that while the domain X of the random function F is arbitrary, in this work we are interested only in stochastic functions with domain on the real line (time-series). Therefore, from here our notation will reflect that, using t as the input instead of x. The output y remains the same. 3 NEURAL ODE PROCESSES Figure 2: Graphical model of NDPs. The dark nodes denote observed random variables, while the light nodes denote hidden random variables. IC and IT represent the indexing sets for the context and target points, respectively. Full arrows show the generative process. Dotted arrows indicate inference. Model Overview We introduce Neural ODE Processes (NDPs), a class of dynamics-based models that learn to approximate random functions defined over time. To that end, we consider an NP whose context is used to determine a distribution over ODEs. Concretely, the context infers a distribution over the initial position (and optionally the initial velocity) and, at the same time, stochastically controls its derivative function. The positions given by the ODE trajectories at any time t T i are then decoded to give the predictions. In what follows, we offer a detailed description of each component of the model. A schematic of the model can be seen in Figure 1. 3.1 GENERATIVE PROCESS We first describe the generative process behind NDPs. A graphical model perspective of this process is also included in Figure 2. Encoder and Aggregator Consider a given context set C = {(t C i , y C i )}i IC of observed points. We encode this context into two latent variables L(t0) q L(l(t0)|C) and D q D(d|C), representing the initial state and the global control of an ODE, respectively. To parametrise the distribution of the latter variable, the NDP encoder produces a representation ri = fe((t C i , y C i )) for each context pair (t C i , y C i ). The function fe is as a neural network, fully connected or convolutional, depending on the nature of y. An aggregator combines all the representations ri to form a global representation, r, that parametrises the distribution of the global latent context, D q D(d|C) = N z|µD(r), diag(σD(r)) . As the aggregator must preserve order invariance, we choose to take the element-wise mean. The distribution of L0 might be parametrised identically as a function of the whole context by q L(d|C), and, in particular, if the initial observation y0 is always known, then q L(l(0)|C) = q L(l(0)|y0) = N l(0)|µL(y0), diag(σL(y0)) . Latent ODE To obtain a distribution over functions, we are interested in capturing the dynamics that govern the time-series and exploiting the temporal nature of the data. To that end, we allow the latent context to evolve according to a Neural ODE (Chen et al., 2018) with initial position L(0) and controlled by D. These two random variables factorise the uncertainty in the underlying dynamics into an uncertainty over the initial conditions (given by L(t0)) and an uncertainty over the ODE derivative, given by D. Published as a conference paper at ICLR 2021 By using the target times, t T 1:n = (t T 1, ..., t T N), the latent state at a given time is found by evolving a Neural ODE: l(t T i ) = l(t0) + Z t T i t0 fθ(l(t), d, t)dt, (3) where fθ is a neural network that models the derivative of l. As explained above, we allow d to modulate the derivative of this ODE by acting as a global control signal. Ultimately, for fixed initial conditions, this results in an uncertainty over the ODE trajectories. Decoder To obtain a prediction at a time t T i , we decode the random state of the ODE at time t T i , given by L(t T i ). Assuming that the outputs are noisy, for a given sample l(t T i ) from this stochastic state, the decoder g produces a distribution over Y T ti p y T i |g(l(t T i ), ti) parametrised by the decoder output. Concretely, for regression tasks, we take the target output to be normally distributed with constant (or optionally learned) variance Y T ti N yi|g(l(ti), ti), σ2 . When Y T ti is a random vector formed of independent binary random variables (e.g. a black and white image), we use a Bernoulli distribution Y T ti Qdim(Y ) j=1 Bernoulli g(l(ti), ti)j . Putting everything together, for a set of observed context points C, the generative process of NDPs is given by the expression below, where we emphasise once again that l(ti) also implicitly depends on l(0) and d. p(y1:n, l(0), d|t1:n, C) = p l(0)|C p(d|C) i=1 p yi|g(l(ti), ti) , (4) We remark that NDPs generalise NPs defined over time. If the latent NODE learns the trivial velocity fθ(l(t), d, t) = 0, the random state L(t) = L(t0) remains constant at all times t. In this case, the distribution over functions is directly determined by L(t0) p(l(t0)|C), which substitutes the random variable Z from a regular NP. For greater flexibility, the control signal d can also be supplied to the decoder g(l(t), d, t). This shows that, in principle, NDPs are at least as expressive as NPs. Therefore, NDPs could be a sensible choice even in applications where the time-series are not solely determined by some underlying dynamics, but are also influenced by other generative factors. 3.2 LEARNING AND INFERENCE Since the true posterior is intractable because of the highly non-linear generative process, the model is trained using an amortised variational inference procedure. The variational lower-bound on the probability of the target values given the known context log p(y T|t T, y C) is as follows: Eq l(t0),d|t T,y T " X i IT log p(yi|l(t0), d, ti) + log q L(l(t0)|t C, y C) q L(l(t0)|t T, y T) + log q D(d|t C, y C) q D(d|t T, y T) where q L, q D give the variational posteriors (the encoders described in Section 3.1). The full derivation can be found in Appendix B. We use the reparametrisation trick to backpropagate the gradients of this loss. During training, we sample random contexts of different sizes to allow the model to become sensitive to the size of the context and the location of its points. We train using mini-batches composed of multiple contexts. For that, we use an extended ODE that concatenates the independent ODE states of each sample in the batch and integrates over the union of all the times in the batch (Rubanova et al., 2019). Pseudo-code for this training procedure is also given in Appendix C. 3.3 MODEL VARIATIONS Here we present the different ways to implement the model. The majority of the variation is in the architecture of the decoder. However, it is possible to vary the encoder such that fe((t C i , y C i )) can be a multi-layer-perceptron, or additionally contain convolutions. Neural ODE Process (NDP) In this setup the decoder is an arbitrary function g(l(t T i ), d, t T i ) of the latent position at the time of interest, the control signal, and time. This type of model is particularly suitable for high-dimensional time-series where the dynamics are fundamentally latent. The inclusion of d in the decoder offers the model additional flexibility and makes it a good default choice for most tasks. Published as a conference paper at ICLR 2021 Second Order Neural ODE Process (ND2P) This variation has the same decoder architecture as NDP, however the latent ODE evolves according to a second order ODE. The latent state, l, is split into a position , l1 and velocity , l2, with l1 = l2 and l2 = fθ(l1, l2, d, t). This model is designed for time-series where the dynamics are second-order, which is often the case for physical systems (C a gatay Yıldız et al., 2019; Norcliffe et al., 2020). NDP Latent-Only (NDP-L) The decoder is a linear transformation of the latent state g(l(t T i )) = W (l(t T i )) + b. This model is suitable for the setting when the dynamics are fully observed (i.e. they are not latent) and, therefore, do not require any decoding. This would be suitable for simple functions generated by ODEs, for example, sines and exponentials. This decoder implicitly contains information about time and d because the ODE evolution depends on these variables as described in Equation 3. ND2P Latent-Only (ND2P-L) This model combines the assumption of second-order dynamics with the idea that the dynamics are fully observed. The decoder is a linear layer of the latent state as in NDP-L and the phase space dynamics are constrained as in ND2P. 3.4 NEURAL ODE PROCESSES AS STOCHASTIC PROCESSES The Kolmogorov Extension Theorem states that exchangeability and consistency are necessary and sufficient conditions for a collection of joint marginal distributions to define a stochastic process Øksendal (2003); Garnelo et al. (2018b). We define these conditions and show that the NDP model satisfies them. The proofs can be found in Appendix A. Definition 3.1 (Exchangeability). Exchangeability refers to the invariance of the joint distribution ρt1:n(y1:n) under permutations of y1:n. That is, for a permutation π of {1, 2, ..., n}, π(t1:n) = (tπ(1), ..., tπ(n)) and π(y1:n) = (yπ(1), ..., yπ(n)), the joint probability distribution ρt1:n(y1:n) is invariant if ρt1:n(y1:n) = ρπ(t1:n)(π(y1:n)). Proposition 3.1. NDPs satisfy the exchangeability condition. Definition 3.2 (Consistency). Consistency says if a part of a sequence is marginalised out, then the joint probability distribution is the same as if it was only originally taken from the smaller sequence ρt1:m(y1:m) = R ρt1:n(y1:n)dym+1:n. Proposition 3.2. NDPs satisfy the consistency condition. It is important to note that the stochasticity comes from sampling the latent l(0) and d. There is no stochasticity within the ODE, such as Brownian motion, though stochastic ODEs have previously been explored (Liu et al., 2019; Tzen & Raginsky, 2019; Jia & Benson, 2020; Li et al., 2020). For any given pair l(0) and d, both the latent state trajectory and the observation space trajectory are fully determined. We also note that outside the NP family and differently from our approach, NODEs have been used to generate continuous stochastic processes by transforming the density of another latent process (Deng et al., 2020). 3.5 RUNNING TIME COMPLEXITY For a model with n context points and m target points, an NP has running time complexity O(n+m), since the model only has to encode each context point and decode each target point. However, a Neural ODE Process has added complexity due to the integration process. Firstly, the integration itself has runtime complexity O(NFE), where NFE is the number of function evaluations. In turn, the worst-case NFE depends on the minimum step size δ the ODE solver has to use and the maximum time we are interested in, which we denote by tmax. Secondly, for settings where the target times are not already ordered, an additional O m log(m) term is added for sorting them. This ordering is required by the ODE solver. Therefore, given that m n and assuming a constant tmax exists, the worst-case complexity of NDPs is O m log(m) . For applications where the times are already sorted (e.g. real-time applications), the complexity falls back to the original O n + m . In either case, NDPs scale well with the size of the input. We note, however, that the integration steps tmax/δ could result in a very large constant, hidden by the big-O notation. Nonetheless, modern ODE solvers use adaptive step sizes that Published as a conference paper at ICLR 2021 NP on Sine Data 3 2 1 0 1 2 3 t 1 NDP on Sine Data 3 2 1 0 1 2 3 t 0 5 10 15 20 25 30 Epoch Test MSE On Sine Data Figure 3: We present example posteriors of trained models and the loss during training of the NP and NDP models for the sine data. We find that NDPs are able to produce a greater range of functions when a single context point is provided, and a sharper, better targeted range as more points in the time series are observed. Quantitatively, NDPs train to a lower loss in fewer epochs, as may be expected for functions that are generated by ODEs. Both models were trained for 30 epochs. adjust to the data that has been supplied and this should alleviate this problem. In our experiments, when sorting is used, we notice the NDP models are between 1 and 1.5 orders of magnitude slower to train than NPs in terms of wall-clock time. At the same time, this limitation of the method is traded-off by a significantly faster loss decay per epoch and superior final performance. We provide a table of time ratios from our 1D experiments, from section 4.1, in Appendix D. 4 EXPERIMENTS To test the proposed advantages of NDPs we carried out various experiments on time series data. For the low-dimensional experiments in Sections 4.1 and 4.2, we use an MLP architecture for the encoder and decoder. For the high-dimensional experiments in Section 4.3, we use a convolutional architecture for both. We train the models using RMSprop (Tieleman & Hinton, 2012) with learning rate 1 10 3. Additional model and task details can be found in Appendices F and G, respectively. 4.1 ONE DIMENSIONAL REGRESSION We begin with a set of 1D regression tasks of differing complexity sine waves, exponentials, straight lines and damped oscillators that can be described by ODEs. For each task, the functions are determined by a set of parameters (amplitude, shift, etc) with pre-defined ranges. To generate the distribution over functions, we sample these parameters from a uniform distribution over their respective ranges. We use 490 time-series for training and evaluate on 10 separate test time-series. Each series contains 100 points. We repeat this procedure across 5 different random seeds to compute the standard error. Additional details can be found in Appendix G.1. The left and middle panels of Figure 3 show how NPs and NDPs adapt on the sine task to incoming data points. When a single data-point has been supplied, NPs have incorrectly collapsed the distribution over functions to a set of almost horizontal lines. NDPs, on the other hand, are able to produce a wide range of possible trajectories. Even when a large number of points have been supplied, the NP posterior does not converge on a good fit, whereas NDPs correctly capture the true sine curve. In the right panel of Figure 3, we show the test-set MSE as a function of the training epoch. It can be seen that NDPs train in fewer iterations to a lower test loss despite having approximately 10% fewer parameters than NPs. We conducted an ablation study, training all model variants on all the 1D datasets, with final test MSE losses provided in Table 1 and training plots in Appendix G.1. We find that NDPs either strongly outperform NPs (sine, linear), or their standard errors overlap (exponential, oscillators). For the exponential and harmonic oscillator tasks, where the models perform similarly, many points are close to zero in each example and as such it is possible to achieve a low MSE score by producing outputs that are also around zero. In contrast, the sine and linear datasets have a significant variation in the y-values over the range, and we observe that NPs perform considerably worse than the NDP models on these tasks. Published as a conference paper at ICLR 2021 Table 1: Final MSE Loss on 1D regression tasks with standard error (lower is better). Bold indicates that the model performance is within error at the 95% confidence level, with underline indicating the best estimate for top-performer. NPs perform similarly to NDPs and their variants on the exponential and oscillator tasks, where y-values are close to zero. For the sine and linear tasks, where y values vary significantly over the time range, NPs perform worse than NDPs and their variants. MSE 10 2 Model Sine Linear Exponential Oscillators NP 5.93 0.96 5.85 0.70 0.29 0.03 0.64 0.06 NDP 2.09 0.12 3.76 0.32 0.31 0.08 0.72 0.08 ND2P 2.75 0.19 4.37 1.14 0.25 0.04 0.55 0.03 NDP-L 2.51 0.24 4.77 0.67 0.40 0.04 0.72 0.04 ND2P-L 2.64 0.30 3.16 0.46 0.39 0.05 0.66 0.03 The difference between NDP and the best of the other model variants is not significant across the set of tasks. As such, we consider only NDPs for the remainder of the paper as this is the least constrained model version: they have unrestricted latent phase-space dynamics, unlike the secondorder counterparts, and a more expressive decoder architecture, unlike the latent-only variants. In addition, NDPs train in a faster wall clock time than the other variants, as shown in Appendix D. Active Learning We perform an active learning experiment on the sines dataset to evaluate both the uncertainty estimates produced by the models and how well they adapt to new information. Provided with an initial context point, additional points are greedily queried according to the model uncertainty. Higher quality uncertainty estimation and better adaptation will result in more information being acquired at each step, and therefore a faster and greater reduction in error. As shown in Figure 4, NDPs also perform better in this setting. 3 2 1 0 1 2 3 t N = 1 N = 2 N = 3 3 2 1 0 1 2 3 t 2 4 6 8 10 Number of Points NP Random NP Active NDP Random NDP Active Figure 4: Active learning on the sines dataset. Left: NPs querying the points of highest uncertainty. Middle: NDPs querying the points of highest uncertainty, qualitatively it outperforms NPs. Right: MSE plots of four different querying regimes, NPs and NDPS looking actively and randomly, NDP Active decreases the MSE the fastest. 4.2 PREDATOR-PREY DYNAMICS The Lotka-Volterra Equations are used to model the dynamics of a two species system, where one species predates on the other. The populations of the prey, u, and the predator, v, are given by the differential equations u = αu βuv, v = δuv γv, for positive real parameters, α, β, δ, γ. Intuitively, when prey is plentiful, the predator population increases (+δuv), and when there are many predators, the prey population falls ( βuv). The populations exhibit periodic behaviour, with the phase-space orbit determined by the conserved quantity V = δu γ ln(u) + βv α ln(v). Thus for any predator-prey system there exists a range of stable functions describing the dynamics, with any particular realisation being determined by the initial conditions, (u0, v0). We consider the system (α, β, γ, δ) = (2/3, 4/3, 1, 1). We generate sample time-series from the Lotka Volterra system by considering different starting configurations; (u0, v0) = (2E, E), where E is sampled from a uniform distribution in the range (0.25, 1.0). The training set consists of 40 such samples, with a further 10 samples forming the test Published as a conference paper at ICLR 2021 Prey u Predator v Context 0.0 1.5 Time 0.0 1.5 Time V = δu γ ln u + βv α ln v Ground truth NP NDP 0.5 1.5 Prey population u Predator population v Targets NP NDP Figure 5: NPs and NDPs on the Lotka-Volterra task. Black is used for targets or ground truth, solid lines for mean predictions over 50 samples, and dashed lines for sample trajectories. In the left and middle plots, the shaded regions show the min-max range over 50 samples, in the right plot the shaded region was produced using kernel density estimation. Left: NPs are less able to model the dynamics, diverging from the ground truth even in regions with dense context sampling, whereas the NDP is both more accurate and varies more appropriately. Middle: Plotting the theoretically conserved quantity V better exposes how the models deviate from the ground truth Right: In phase space (u, v) the NDP is more clearly seen to better track the ground truth. set. As before, each time series consists of 100 time samples and we evaluate across 5 different random seeds to obtain a standard error. We find that NDPs are able to train in fewer epochs to a lower loss (Appendix G.2). We record final test MSEs ( 10 2) at 44 4 for the NPs and 15 2 for the NDPs. As in the 1D tasks, NDPs perform better despite having a representation r and context z with lower dimensionality, leading to 10% fewer parameters than NPs. Figure 5 presents these advantages for a single time series. 4.3 VARIABLE ROTATING MNIST To test our model on high-dimensional time-series with latent dynamics, we consider the rotating MNIST digits (Casale et al., 2018; C a gatay Yıldız et al., 2019). In the original task, samples of digit 3 start upright and rotate once over 16 frames (= 360 s 1) (i.e. constant angular velocity, zero angular shift). However, since we are interested in time-series with variable latent dynamics and increased variability in the initial conditions as in our formal problem statement, we consider a more challenging version of the task. In our adaptation, the angular velocity varies between samples in the range (360 60 )s 1 and each sample starts at a random initial rotation. To induce some irregularity in each time-series in the training dataset, we remove five randomly chosen time-steps (excluding the initial time t0) from each time-series. Overall, we generate a dataset with 1, 000 training time-series, 100 validation time-series and 200 test time-series, each using disjoint combinations of different calligraphic styles and dynamics. We compare NPs and NDPs using identical convolutional networks for encoding the images in the context. We assume that the initial image y0 (i.e. the image at t0) is always present in the context. As such, for NDPs, we compute the distribution of L0 purely by encoding y0 and disregarding the other samples in the context, as described in Section 3. We train the NP for 500 epochs and use the validation set error to checkpoint the best model for testing. We follow a similar procedure for the NDP model but, due to the additional computational load introduced by the integration operation, only train for 50 epochs. extrapolation Figure 6: Predictions on the test set of Variable Rotating MNIST. NDP is able to extrapolate beyond the training time range whereas NP cannot even learn to reconstruct the digit. Published as a conference paper at ICLR 2021 In Figure 6, we include the predictions offered by the two models on a time-series from the test dataset, which was not seen in training by either of the models. Despite the lower number of epochs they are trained for, NDPs are able to interpolate and even extrapolate on the variable velocity MNIST dataset, while also accurately capturing the calligraphic style of the digit. NPs struggle on this challenging task and are unable to produce anything resembling the digits. In order to better understand this wide performance gap, we also train in Appendix G.3 the exact same models on the easier Rotating MNIST task from C a gatay Yıldız et al. (2019) where the angular velocity and initial rotation are constant. In this setting, the two models perform similarly since the NP model can rely on simple interpolations without learning any dynamics. 5 DISCUSSION AND RELATED WORK We now consider two perspectives on how Neural ODE Processes relate to existing work and discuss the model in these contexts. NDPs as Neural Processes From the perspective of stochastic processes, NDPs can be seen as a generalisation of NPs defined over time and, as such, existing improvements in this family are likely orthogonal to our own. For instance, following the work of Kim et al. (2019), we would expect adding an attention mechanism to NDPs to reduce uncertainty around context points. Additionally, the intrinsic sequential nature of time could be further exploited to model a dynamically changing sequence of NDPs as in Sequential NPs (Singh et al., 2019). For application domains where the observations evolve on a graph structure, such as traffic networks, relational information could be exploited with message passing operation as in MPNPs (Day et al., 2020). NDPs as Neural ODEs From a dynamics perspective, NDPs can be thought of as an amortised Bayesian Neural ODE. In this sense, ODE2VAE (C a gatay Yıldız et al., 2019) is the model that is most closely related to our method. While there are many common ideas between the two, significant differences exist. Firstly, NDPs do not use an explicit Bayesian Neural Network but are linked to them through the theoretical connections inherited from NPs (Garnelo et al., 2018b). NDPs handle uncertainty through latent variables, whereas ODE2VAE uses a distribution over the NODE s weights. Secondly, NDPs stochastically condition the ODE derivative function and initial state on an arbitrary context set of variable size. In contrast, ODE2VAE conditions only the initial position and initial velocity on the first element and the first M elements in the sequence, respectively. Therefore, our model can dynamically adapt the dynamics to any observed time points. From that point of view, our model also runs parallel to other attempts of making Neural ODEs capable of dynamically adapting to irregularly sampled data (Kidger et al., 2020). We conclude this section by remarking that any Latent NODE, as originally defined in Chen et al. (2018), is also a stochastic process. However, regular Latent NODEs are not trained to use a data-adaptive prior over the latent context, but use a fixed standard-normal prior. This corresponds to the case when C = T as remarked by Le et al. (2018). Additionally, they also only model an uncertainty in the initial position of the ODE, but do not consider an uncertainty in the derivative function. 6 CONCLUSION We introduce Neural ODE Processes (NDPs), a new class of stochastic processes suitable for modelling data-adaptive stochastic dynamics. First, NDPs tackle the two main problems faced by Neural ODEs applied to dynamics-governed time series: adaptability to incoming data points and uncertainty in the underlying dynamics when the data is sparse and, potentially, irregularly sampled. Second, they add an explicit treatment of time as an additional inductive bias inside Neural Processes. To do so, NDPs include a probabilistic ODE as an additional encoded structure, thereby incorporating the assumption that the time-series is the direct or latent manifestation of an underlying ODE. Furthermore, NDPs maintain the scalability of NPs to large inputs. We evaluate our model on synthetic 1D and 2D data, as well as higher-dimensional problems such as rotating MNIST digits. Our method exhibits superior training performance when compared with NPs, yielding a lower loss in fewer iterations. Whether or not the underlying ODE of the data is latent, we find that where there is a fundamental ODE governing the dynamics, NDPs perform well. Published as a conference paper at ICLR 2021 Francesco Paolo Casale, Adrian V Dalca, Luca Saglietti, Jennifer Listgarten, and Nicolo Fusi. Gaussian Process Prior Variational Autoencoders. ar Xiv e-prints, art. ar Xiv:1810.11738, October 2018. Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571 6583, 2018. Ben Day, C at alina Cangea, Arian R. Jamasb, and Pietro Li o. Message passing neural processes, 2020. Ruizhi Deng, Bo Chang, Marcus A. Brubaker, Greg Mori, and Andreas Lehrmann. Modeling continuous stochastic processes with dynamic normalizing flows. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural odes. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 3140 3150. Curran Associates, Inc., 2019. URL http: //papers.nips.cc/paper/8577-augmented-neural-odes.pdf. Marta Garnelo, Dan Rosenbaum, Chris J. Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J. Rezende, and S. M. Ali Eslami. Conditional neural processes, 2018a. Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. Neural processes, 2018b. Jonathan Gordon, Wessel P. Bruinsma, Andrew Y. K. Foong, James Requeima, Yann Dubois, and Richard E. Turner. Convolutional conditional neural processes, 2019. Junteng Jia and Austin R. Benson. Neural jump stochastic differential equations, 2020. Patrick Kidger, James Morrill, James Foster, and Terry Lyons. Neural controlled differential equations for irregular time series. ar Xiv preprint ar Xiv:2005.08926, 2020. Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes, 2019. Tuan Anh Le, Hyunjik Kim, Marta Garnelo, Dan Rosenbaum, Jonathan Schwarz, and Yee Whye Teh. Empirical evaluation of neural process objectives. In Neur IPS workshop on Bayesian Deep Learning, 2018. Xuechen Li, Ting-Kam Leonard Wong, Ricky T. Q. Chen, and David Duvenaud. Scalable gradients for stochastic differential equations, 2020. Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. Neural sde: Stabilizing neural ode networks with stochastic noise, 2019. Stefano Massaroli, Michael Poli, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. Dissecting neural odes, 2020. James Morrill, Patrick Kidger, Cristopher Salvi, James Foster, and Terry Lyons. Neural cdes for long time series via the log-ode method, 2020. Alexander Norcliffe, Cristian Bodnar, Ben Day, Nikola Simidjievski, and Pietro Li o. On second order behaviour in augmented neural odes, 2020. Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations, pp. 65 84. Springer, 2003. Published as a conference paper at ICLR 2021 Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. Yulia Rubanova, Ricky T. Q. Chen, and David Duvenaud. Latent odes for irregularly-sampled time series, 2019. Gautam Singh, Jaesik Yoon, Youngsung Son, and Sungjin Ahn. Sequential neural processes. In Advances in Neural Information Processing Systems, pp. 10254 10264, 2019. T. Tieleman and G. Hinton. Lecture 6.5 Rms Prop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit, 2019. Ben H Williams, Marc Toussaint, and Amos J Storkey. Extracting motion primitives from natural handwriting data. In International Conference on Artificial Neural Networks, pp. 634 643. Springer, 2006. C a gatay Yıldız, Markus Heinonen, and Harri L ahdesm aki. Ode2vae: Deep generative second order odes with bayesian neural networks, 2019. Published as a conference paper at ICLR 2021 ACKNOWLEDGEMENTS We d like to thank C at alina Cangea and Nikola Simidjievski for their feedback on an earlier version of this work, and Felix Opolka for many discussions in this area. We were greatly enabled and are indebted to the developers of a great number of open-source projects, most notably the torchdiffeq library. Jacob Moss is funded by a GSK grant. A STOCHASTIC PROCESS PROOFS Before giving the proofs, we state the following important Lemma. Lemma A.1. As in NPs, the decoder output g(l(t), t) can be seen as a function F(t) for a given fixed l(t0) and d. Proof. This follows directly from the fact that l(t) = l(t0) + R T t0 fθ(l(t), t, d)dt can be seen as a function of t and that the integration process is deterministic for a given pair l(t0) and d (i.e. for fixed initial conditions and control). Proposition 3.1 NDPs satisfy the exchangeability condition. Proof. This follows directly from Lemma A.1, since any permutation on t1:n would automatically act on F1:n and consequently on p(y1:n, l(t0), d|t1:n), for any given l(t0), d. Proposition 3.2 NDPs satisfy the consistency condition. Proof. Based on Lemma A.1 we can write the joint distribution (similarly to a regular NP) as follows: ρt1:n(y1:n) = Z p(F) i=1 p(yi|F(ti))d F. (6) Because the density of any yi depends only on the corresponding ti, integrating out any subset of y1:n gives the joint distribution of the remaining random variables in the sequence. Thus, consistency is guaranteed. B ELBO DERIVATION As noted in Lemma A.1, the joint probability p(y, l(t0), d|t) = p(l(t0))p(d)p(y|g(l(t), d, t)) can still be seen as a function that depends only on t, since the ODE integration process is deterministic for a given l(t0) and d. Therefore, the ELBO derivation proceeds as usual (Garnelo et al., 2018b). For convenience, let z = (l(t0), d) denote the concatenation of the two latent vectors and q(z) = q L(l(t0))q D(d). First, we derive the ELBO for log p(y T|t T). log p(y T|t T) = DKL q(z|t T, y T) p(z|t T, y T) + LELBO (7) LELBO = Eq(z|t T,y T) log q(z|t T, y T) + log p(y T, z|t T) (8) = Eq(z|t T,y T) log q(z|t T, y T) + Eq(z|t T,y T) log p(z) + log p(y T|t T, z) (9) = Eq(z|t T,y T) i IT log p(yi|z, ti) + log p(z) q(z|t T, y T) Noting that at training time, we want to maximise log p(y T|t T, y C). Using the derivation above, we obtain a similar lower-bound, but with a new prior p(z|t C, y C), updated to reflect the additional information supplied by the context. log p(y T|t T, y C) Eq(z|t T,y T) i IT log p(yi|z, ti) + log p(z|t C, y C) q(z|t T, y T) Published as a conference paper at ICLR 2021 If we approximate the true p(z|t C, y C) with the variational posterior, this takes the final form log p(y T|t T, y C) Eq(z|t T,y T) i IT log p(yi|z, ti) + log q(z|t C, y C) q(z|t T, y T) Splitting z = (l(t0), d) back into its constituent parts, we obtain the loss function Eq l(t0),d|t T,y T " X i IT log p(yi|l(t0), d, ti) + log q L(l(t0)|t C, y C) q L(l(t0)|t T, y T) + log q D(d|t C, y C) q D(d|t T, y T) C LEARNING AND INFERENCE PROCEDURE We include below the pseudocode for training NDPs. For clarity of exposition, we give code for a single time-series. However, in practice, we batch all the operations in lines 6 15. Algorithm 1: Learning and Inference in Neural ODE Processes Input : A dataset of time-series {Xk}, k K, where K is the total number of time-series 1 Initialise NDP model with parameters θ 2 Let m be the number of context points and n the number of extra target points 3 for i 0 to training steps do 4 Sample m from U[1, max context points] 5 Sample n from U[1, max extra target points] 6 Uniformly sample a time-series Xk 7 Uniformly sample from Xk the target points T = (t T, y T), where t T is the time batch with shape (m + n, 1) and y T is the corresponding outputs batch with shape (m + n, dim(y)) 8 Extract the (unordered) context set C = T[0 : m] 9 Compute q(l(t0), d|C) using the variational encoder 10 Compute q(l(t0), d|T) using the variational encoder // During training, we sample from q(l(t0), d|T) 11 Sample l(t0), d from q(l(t0), d|T) 12 Integrate to compute l(t) as in Equation 3 for all times t t T 13 foreach time t t T do 14 Use decoder to compute p(y(t)|g(l(t)), t) 15 Compute loss LELBO based on Equation 5 16 θ θ α θLELBO It is worth highlighting that during training we sample l(t0), d from the target-conditioned posterior, rather than the context-conditioned posterior. In contrast, at inference time we sample from the context-conditioned posterior. D WALL CLOCK TRAINING TIMES To explore the additional term in the runtime given in Section 3.5, we record the wall clock time for each model to train for 30 epochs on the 1D synthetic datasets, over 5 seeds. Then we take the ratio of a given model and the NP. The experiments were run on an Nvidia Titan XP. The results can be seen in Table 2. E SIZE OF LATENT ODE To investigate how many dimensions the ODE l should have we carry out an ablation study, looking at the performance on the 1D sine dataset. We train models with l-dimension {1, 2, 5, 10, 15, 20} for 30 epochs. Figure 7 shows training plots for dim(l) = {1, 2, 10, 20}, and final MSE values are given in Table 3. Published as a conference paper at ICLR 2021 Table 2: Table of ratios, of Neural ODE Process and Neural Process training times on different 1D synthetic datasets. We see that NDP/NP is the lowest (i.e. fastest) in each case. Time Ratios Sine Exponential Linear Oscillators NDP/NP 22.1 0.9 23.6 0.9 10.9 1.4 22.2 2.3 ND2P/NP 55.2 6.3 32.4 1.5 14.2 0.3 35.8 0.7 NDP-L/NP 55.2 6.2 47.5 18.0 14.7 1.5 25.3 0.5 ND2P-L/NP 43.7 1.9 27.9 1.1 15.1 1.6 32.8 1.1 NP Training Time /s 22.4 0.2 45.5 0.3 100.9 0.3 23.2 0.4 0 5 10 15 20 25 30 Epoch Training Loss Training Loss 0 5 10 15 20 25 30 Epoch NP Final 1 2 10 20 0 5 10 15 20 25 30 Epoch Test -log(p) Test -log(p) Figure 7: Training plots of NDP with ODEs of different sizes, training on the sine dataset. We see that for dim(l) = 1, the model trains slowly, as would be expected for a sine curve where at least 2 dimensions are needed to learn second order and test performance is close to the standard NP. The other models train at approximately the same rate. We see that when dim(l) = 1, NDPs are slow to train and require more epochs. This is because sine curves are second-order ODEs, and at least two dimensions are required to learn second-order dynamics (one for the position and one for the velocity). When dim(l) = 1, NDPs perform similarly to NPs, which is expected when the latent ODE is unable to capture the underlying dynamics. We then see that for all other dimensions, NDPs train at approximately the same rate (over epochs) and have similar final MSE scores. As the dimension increases beyond 10, the test MSE increases, indicating overfitting. F ARCHITECTURAL DETAILS For the experiments with low dimensionality (1D, 2D), the architectural details are as follows: Encoder: [ti, yi] ri: Multilayer Perceptron, 2 hidden layers, Re LU activations. Aggregator: r1:n r: Taking the mean. Representation to Hidden: r h: One linear layer followed by Re LU. Hidden to L(t0) Mean: h µL: One linear layer. Hidden to L(t0) Variance: h σL: One linear layer, followed by sigmoid, multiplied by 0.9 add 0.1, i.e. σL = 0.1 + 0.9 sigmoid(W h + b). Hidden to D(t0) Mean: h µL: One linear layer. Hidden to D(t0) Variance: h σD: One linear layer, followed by sigmoid, multiplied by 0.9 add 0.1, i.e. σD = 0.1 + 0.9 sigmoid(W h + b). ODE Layers: [l, d, t] l: Multilayer Perceptron, two hidden layers, tanh activations. Decoder: g(l(t T i ), d, t T i ) y T i , for the NDP model and ND2P described in section 3.3, this function is a linear layer, acting on a concatenation of the latent state and a function of l(t T i ), d, and t T i . g(l(t T i ), d, t T i ) = W (l(t T i )||h(l(t T i ), d, t T i )) + b. Where h is a Multilayer Perceptron with two hidden layers and Re LU activations. Published as a conference paper at ICLR 2021 Table 3: Final MSE values for NDPs training on the sine dataset with different sized ODEs, with NP performance included for reference. Peak performance is found when dim(l) = 2, which is to be expected as the true dynamics are 2-dimensional. For dim(l) = 1, the MSE is highest, and within error of the NP. Performance degrades with increasing dim(l), with overfitting becoming a problem for dim(l) = 20. l-dimension MSE 10 2 Training Times /s NP 5.9 0.9 22.4 0.2 1 5.6 1.3 299.7 20.5 2 1.7 0.1 413.8 52.9 5 2.2 0.2 414.8 13.1 10 2.1 0.1 496.7 20.5 15 2.6 0.2 618.0 30.7 20 3.1 0.3 652.0 38.8 For the high-dimensional experiments (Rotating MNIST). Encoder: [ti, yi] ri: Convolutional Neural Network, 4 layers with 16, 32, 64, 128 channels respectively and kernel size of 5, stride 2. Re LU activations. Batch normalisation. Aggregator: r1:n r: Taking the mean. Representation to D Hidden: r h D: One linear layer followed by Re LU. Hidden to D Mean: h D µD: One linear layer. Hidden to D Variance: h D σz: One linear layer, followed by sigmoid, multiplied by 0.9 add 0.1, i.e. σD = 0.1 + 0.9 sigmoid(W h + b). y0 to L(t0) Hidden: y0 h L: Convolutional Neural Network, 4 layers with 16, 32, 64, 128 channels respectively and kernel size of 5, stride 2. Re LU activations. Batch normalisation. L(t0) Hidden to L(t0) Mean: h L µL: One linear layer. L(t0) Hidden to L(t0) Variance: h L σz: One linear layer, followed by sigmoid, multiplied by 0.9 add 0.1, i.e. σL = 0.1 + 0.9 sigmoid(W h + b). ODE Layers: [l, d, t] l: Multilayer Perceptron, two hidden layers, tanh activations. Decoder: g(l(t T i )) y T i : 1 linear layer followed by a 4 layer transposed Convolutional Neural Network with 32, 128, 64, 32 channels respectively. Re LU activations. Batch normalisation. G TASK DETAILS AND ADDITIONAL RESULTS G.1 ONE DIMENSIONAL REGRESSION We carried out an ablation study over model variations on various 1D synthetic tasks sines, exponentials, straight lines and harmonic oscillators. Each task is based on some function described by a set of parameters that are sampled over to produce a distribution over functions. In every case, the parameters are sampled from uniform distributions. A trajectory example is formed by sampling from the parameter distributions and then sampling from that function at evenly spaced timestamps, t, over a fixed range to produce 100 data points (t, y). We give the equations for these tasks in terms of their defining parameters and the ranges for these parameters in Table 4. To test after each epoch, 10 random context points are taken, and then the mean-squared error and negative log probability are calculated over all the points (not just a subset of the target points). Each model was trained 5 times on each dataset (with different weight initialisation). We used a batch size of 5, with context size ranging from 1 to 10, and the extra target size ranging from 0 to 5.1 The results are presented in Figure 8. 1As written in the problem statement in section 2, we make the context set a subset of the target set when training. So we define a context size range and an extra target size range for each task. Published as a conference paper at ICLR 2021 Task Form a b t # train # test Sines y = a sin(t b) ( 1, 1) ( 1/2, 1/2) ( π, π) 490 10 Exponentials y = a/60 exp(t b) ( 1, 1) ( 1/2, 1/2) ( 1, 4) 490 10 Straight lines y = at + b ( 1, 1) ( 1/2, 1/2) (0, 5) 490 10 Oscillators y = a sin(t b) exp( t/2) ( 1, 1) ( 1/2, 1/2) (0, 5) 490 10 Table 4: Task details for 1D regression. a and b are sampled uniformly at random from the given ranges. t is sampled at 100 regularly spaced intervals over the given range. 490 training examples and 10 test examples were used in every case. All models perform better than NPs, with fewer parameters (approximately 10% less). Because there are no significant differences between the different models, we use NDP in the remainder of the experiments, because it has the fewest model restrictions. The phase space dynamics are not restricted like its second-order variant, and the decoder has a more expressive architecture than the latent-only variants. It also trains the fastest in wall clock time seen in Appendix D. G.2 LOTKA-VOLTERRA SYSTEM To generate samples from the Lotka Volterra system, we sample different starting configurations, (u0, v0) = (2E, E), where E is sampled from a uniform distribution in the range (0.25, 1.0). We then evolve the Lotka Volterra system dt = αu βuv, dv dt = δuv γv. (14) using (α, β, γ, δ) = (2/3, 4/3, 1, 1). This is evolved from t = 0 to t = 15 and then the times are rescaled by dividing by 10. The training for the Lotka-Volterra system can be seen in Figure 9. This was taken across 5 seeds, with a training set of 40 trajectories, 10 test trajectories and batch size 5. We use a context size ranging from 1 to 100, and extra target size ranging from 0 to 45. The test context size was fixed at 90 query times. NDP trains slightly faster with lower loss, as expected. G.3 ROTATING MNIST & ADDITIONAL RESULTS To better understand what makes vanilla NPs fail on our Variable Rotating MNIST from Section 4.3, we train the exact same models on the simpler Rotating MNIST dataset (C a gatay Yıldız et al., 2019). In this dataset, all digits start in the same position and rotate with constant velocity. Additionally, the fourth rotation is removed from all the time-series in the training dataset. We follow the same training procedure as in Section 4.3. We report in Figure 10 the predictions for the two models on a random time-series from the validation dataset. First, NPs and NDPs perform similarly well at interpolation and extrapolation within the time-interval used in training. As an exception but in agreement with the results from ODE2VAE, NDPs produces a slightly better reconstruction for the fourth time step in the time-series. Second, neither model is able to extrapolate the dynamics beyond the time-range seen in training (i.e. the last five time-steps). Overall, these observations suggest that for the simpler Rot MNIST dataset, explicit modelling of the dynamics is not necessary and the tasks can be learnt easily by interpolating between the context points. And indeed, it seems that even NDPs, which should be able to learn solutions that extrapolate, collapse on these simpler solutions present in the parameter space, instead of properly learning the desired latent dynamics. A possible explanation is that the Variable Rotating MNIST dataset can be seen as an image augmentation process which makes the convolutional features to be approximately rotation equivariant. In this way, the NDP can also learn rotation dynamics in the spatial dimensions of the convolutional features. Finally, in Figure 11, we plot the reconstructions of different digit styles on the test dataset of Variable Rotating MNIST. This confirms that NDPs are able to capture different calligraphic styles. Published as a conference paper at ICLR 2021 0 5 10 15 20 25 30 Epoch Training Loss Sine Data - Training Loss NP NDP ND2P NDP-L ND2P-L 0 5 10 15 20 25 30 Epoch NP NDP ND2P NDP-L ND2P-L 0 5 10 15 20 25 30 Epoch Test -log(p) Test -log(p) NP NDP ND2P NDP-L ND2P-L 0 5 10 15 20 25 30 Epoch Training Loss Exponential Data - Training Loss NP NDP ND2P NDP-L ND2P-L 0 5 10 15 20 25 30 Epoch NP NDP ND2P NDP-L ND2P-L 0 5 10 15 20 25 30 Epoch Test -log(p) Test -log(p) NP NDP ND2P NDP-L ND2P-L 0 5 10 15 20 25 30 Epoch Training Loss Damped Oscillator Data - Training Loss NP NDP ND2P NDP-L ND2P-L 0 5 10 15 20 25 30 Epoch NP NDP ND2P NDP-L ND2P-L 0 5 10 15 20 25 30 Epoch Test -log(p) Test -log(p) NP NDP ND2P NDP-L ND2P-L 0 5 10 15 20 25 30 Epoch Training Loss Linear Data - Training Loss NP NDP ND2P NDP-L ND2P-L 0 5 10 15 20 25 30 Epoch NP NDP ND2P NDP-L ND2P-L 0 5 10 15 20 25 30 Epoch Test -log(p) Test -log(p) NP NDP ND2P NDP-L ND2P-L Figure 8: Training model variants on 1D synthetic datasets. NPs train slower in all cases. All Neural ODE Process variants train approximately at the same rate. With the latent-only variants performing slightly worse than the more expressive model variants. Additionally, ND2P performs slightly better than NDP on the damped oscillator and linear sets, because they are naturally easier to learn as second-order ODEs. G.4 HANDWRITTEN CHARACTERS The Character Trajectories dataset consists of single-stroke handwritten digits recorded using an electronic tablet (Williams et al., 2006; Dua & Graff, 2017). The trajectories of the pen tip in two dimensions, (x, y), are of varying length, with a force cut-off used to determine the start and end of a stroke. We consider a reduced dataset, containing only letters that were written in a single stroke, this disregards letters such as f , i and t . Whilst it is not obvious that character trajectories should follow an ODE, the related Neural Controlled Differential Equation (NCDEs) model has been applied successfully to this task (Kidger et al., 2020). We train with a training set with 49600 examples, a test set with 400 examples and a batch size of 200. We use a context size ranging between 1 and 100, an extra target size ranging between 0 and 100 and a fixed test context size of 20. We visualise the training of the models in Figure 12 and the models plotting posteriors in Figure 13. Published as a conference paper at ICLR 2021 0 100 200 300 400 500 Epoch Training Loss LV - Training Loss 0 100 200 300 400 500 Epoch LV - Test MSE 0 100 200 300 400 500 Epoch Test -log(p) LV - Test -log(p) Figure 9: Training NP and NDP on the Lotka-Volterra equations. Due to the additional encoding structure of NDP, it can be seen that NDPs train in fewer iterations, to a lower loss than NPs. extrapolation Figure 10: Predictions on the simpler Rotating MNIST dataset. NPs are also able to perform well on this task, but NDPs are not able to extrapolate beyond the maximum training time. We observe that NPs and NDPs are unable to successfully learn the time series as well as NCDEs. We record final test MSEs ( 10 1) at 4.6 0.1 for NPs and a slightly lower 3.4 0.1 for NDPs. We believe the reason is because handwritten digits do not follow an inherent ODE solution, especially given the diversity of handwriting styles for the same letter. We conjecture that Neural Controlled Differential Equations were able to perform well on this dataset due to the control process. Controlled ODEs follow the equation: z(T) = z(t0) + Z T t0 fθ(z(t), t)d X(t) dt dt, z(t0) = h1(x(t0)), ˆx(T) = h2(z(T)) (15) Where X(t) is the natural cubic spline through the observed points x(t). If the learnt fθ is an identity operation, then the result returned will be the cubic spline through the observed points. Therefore, a controlled ODE can learn an identity with a small perturbation, which is easier to learn with the aid of a control process, rather than learning the entire ODE trajectory. Published as a conference paper at ICLR 2021 Figure 11: NDP is able to capture different styles in the Variable Rotating MNIST dataset. 0 100 200 300 400 500 600 Epoch Training Loss Handwriting - Training Loss 0 100 200 300 400 500 600 Epoch Handwriting - Test MSE 0 100 200 300 400 500 600 Epoch Test -log(p) Handwriting - Test -log(p) Figure 12: NPs and NDPs training on handwriting. NDPs perform slightly better, achieving a lower loss in fewer iterations. However this is a marginal improvement, and we believe it is down to significant diversity in the dataset, due to there being no fundamental differential equation for handwriting. Figure 13: We test the models on drawing the letter a with varying numbers of context points. For a few context points, the trajectories are diverse and not entirely recognisable. As more context points are observed, the trajectories become less diverse and start approaching an a . We expect that with more training, and editing the hyperparameters, such as batch size, or the number of hidden layers this model would improve. Additionally, we observe that NDP qualitatively outperforms NP on a small number and a large number of context points.