# learning_unknown_ode_models_with_gaussian_processes__0df77bf9.pdf Learning unknown ODE models with Gaussian processes Markus Heinonen * 1 2 C agatay Yıldız * 1 Henrik Mannerstr om 1 Jukka Intosalmi 1 Harri L ahdesm aki 1 In conventional ODE modelling coefficients of an equation driving the system state forward in time are estimated. However, for many complex systems it is practically impossible to determine the equations or interactions governing the underlying dynamics. In these settings, parametric ODE model cannot be formulated. Here, we overcome this issue by introducing a novel paradigm of nonparametric ODE modelling that can learn the underlying dynamics of arbitrary continuous-time systems without prior knowledge. We propose to learn non-linear, unknown differential functions from state observations using Gaussian process vector fields within the exact ODE formalism. We demonstrate the model s capabilities to infer dynamics from sparse data and to simulate the system forward into future. 1. Introduction Dynamical systems modelling is a cornerstone of experimental sciences. In biology, as well as in physics and chemistry, modelers attempt to capture the dynamical behavior of a given system or a phenomenon in order to improve its understanding and make predictions about its future state. Systems of coupled ordinary differential equations (ODEs) are undoubtedly the most widely used models in science. Even simple ODE functions can describe complex dynamical behaviours (Hirsch et al., 2004). Typically, the dynamics are firmly grounded in physics with only a few parameters to be estimated from data. However, equally ubiquitous are the cases where the governing dynamics are partially or completely unknown. We consider the dynamics of a system governed by multi- *Equal contribution 1Aalto University, Finland 2Helsinki Institute of Information Technology HIIT, Finland. Correspondence to: Markus Heinonen . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). variate ordinary differential functions: x(t) = dx(t) dt = f(x(t)) (1) where x(t) X = RD is the state vector of a Ddimensional dynamical system at time t, and the x(t) X = RD is the first order time derivative of x(t) that drives the state x(t) forward, and where f : RD RD is the vector-valued derivative function. The ODE solution is determined by x(t) = x0 + Z t 0 f(x(τ))dτ, (2) where we integrate the system state from an initial state x(0) = x0 for time t forward. We assume that f( ) is completely unknown and we only observe one or several multivariate time series Y = (y1, . . . , y N)T RN D obtained from an additive noisy observation model at observation time points T = (t1, . . . , t N) RN, y(t) = x(t) + εt, (3) where εt N(0, Ω) follows a stationary zero-mean multivariate Gaussian distribution with diagonal noise variances Ω= diag(ω2 1, . . . , ω2 D). The observation time points do not need to be equally spaced. Our task is to learn the differential function f( ) given observations Y , with no prior knowledge of the ODE system. There is a vast literature on conventional ODEs (Butcher, 2016) where a parametric form for function f(x; θ, t) is assumed to be known, and its parameters θ are subsequently optimised with least squares or Bayesian approach, where the expensive forward solution xθ(ti) = R ti 0 f(x(τ); θ, t)dτ is required to evaluate the system responses xθ(ti) from parameters θ against observations y(ti). To overcome the computationally intensive forward solution, a family of methods denoted as gradient matching (Varah, 1982; Ellner et al., 2002; Ramsay et al., 2007) have proposed to replace the forward solution by matching f(yi) yi to empirical gradients yi of the data instead, which do not require the costly integration step. Recently several authors have proposed embedding a parametric differential function within a Bayesian or Gaussian process (GP) framework (Graepel, 2003; Calderhead et al., 2008; Learning unknown ODE models with GP Dondelinger et al., 2013; Wang and Barber, 2014; Macdonald, 2017) (see Macdonald et al. (2015) for a review). GPs have been successfully applied to model linear differential equations as they are analytically tractable (Gao et al., 2008; Raissi et al., 2017). However, conventional ODE modelling can only proceed if a parametric form of the driving function f( ) is known. Recently, initial work to handle unknown or non-parametric ODE models have been proposed, although with various limiting approximations. Early works include spline-based smoothing and additive functions PD j fj(xj) to infer gene regulatory networks (De Hoon et al., 2002; Henderson and Michailidis, 2014). Aij o and L ahdesm aki (2009) proposed estimating the unknown nonlinear function with GPs using either finite time differences, or analytically solving the derivative function as a function of only time, x(t) = f(t) ( Aij o et al., 2013). In a seminal technical report of Heinonen and d Alche Buc (2014) a full vector-valued kernel model f(x) was proposed, however using a gradient matching approximation. To our knowledge, there exists no model that can learn non-linear ODE functions x(t) = f(x(t)) over the state x against the true forward solutions x(ti). In this work we propose NPODE1: the first ODE model for learning arbitrary, and a priori completely unknown nonparametric, non-linear differential functions f : X X from data in a Bayesian way. We do not use gradient matching or other approximative models, but instead propose to directly optimise the exact ODE system with the fully forward simulated responses against data. We parameterise our model as an augmented Gaussian process vector field with inducing points, while we propose sensitivity equations to efficiently compute the gradients of the system. Our model can forecast continuous-time systems arbitrary amounts to future, and we demonstrate the state-of-the-art performance in human motion datasets. 2. Nonparametric ODE Model The differential function f(x) to be learned defines a vector field2 f, that is, an assignment of a gradient vector f(x) RD to every state x RD. We model the vector field as a vector-valued Gaussian process (Rasmussen and Williams, 2006) f(x) GP(0, K(x, x )), (4) which defines a priori distribution over function values f(x) whose mean and covariances are E[f(x)] = 0 (5) cov[f(x), f(x )] = K(x, x ), (6) 1The implementation is publicly available in http://www. github.com/cagatayyildiz/npode 2We use vector field and differential function interchangeably. Figure 1. (a) Illustration of an ODE system vector field induced by the Gaussian process. The vector field f(x) (gray arrows) at arbitrary states x is interpolated from the inducing points u, z (black arrows), with the trajectory x(t) (red points) following the differential system f(x) exactly. (b) The trajectory x(t) plotted over time t. and where the kernel K(x, x ) RD D is matrixvalued. A GP prior defines that for any collection of states X = (x1, . . . , x N)T RN D, the function values F = (f(x1), . . . , f(x N))T RN D follow a matrixvalued normal distribution, p(F) = N(vec(F)|0, K(X, X)), (7) where K(X, X) = (K(xi, xj))N i,j=1 RND ND is a block matrix of matrix-valued kernels K(xi, xj). The key property of Gaussian processes is that they encode functions where similar states x, x induce similar differentials f(x), f(x ), and where the state similarity is defined by the kernel K(x, x ). In standard GP regression we would obtain the posterior of the vector field by conditioning the GP prior with the data (Rasmussen and Williams, 2006). In ODE models the conditional f(x)|Y of a vector field is intractable due to the integral mapping (2) between observed states y(ti) and differentials f(x). Instead, we resort to augmenting the Gaussian process with a set of M inducing points z X and u X, such that f(z) = u (Qui nonero-Candela and Learning unknown ODE models with GP Rasmussen, 2005). We choose to interpolate the differential function between the inducing points as (See Figure 1) f(x) Kθ(x, Z)Kθ(Z, Z) 1vec(U), (8) which supports the function f(x) with inducing locations Z = (z1, . . . , z M), inducing vectors U = (u1, . . . , u M), and θ are the kernel parameters. The function above corresponds to a vector-valued kernel function (Alvarez et al., 2012), or to a multi-task Gaussian process conditional mean without the variance term (Rasmussen and Williams, 2006). This definition is then compatible with the deterministic nature of the ODE formalism. Due to universality of several kernels and kernel functions (Shawe-Taylor and Cristianini, 2004), we can represent arbitrary vector fields with appropriate inducing point and kernel choices. 2.1. Operator-valued Kernels The vector-valued kernel function (8) uses operator-valued kernels, which result in matrix-valued kernels Kθ(z, z ) RD D for real valued states x, z, while the kernel matrix over data points becomes Kθ = (K(zi, zj))M i,j=1 RMD MD (See Alvarez et al. (2012) for a review). Most straightforward operator-valued kernel is the identity decomposable kernel Kdec(z, z ) = k(z, z ) ID, where the scalar Gaussian kernel Kθ(z, z ) = σ2 f exp with differential variance σ2 f and dimension-specific lengthscales ℓ= (ℓ1, . . . , ℓD) are expanded into a diagonal matrix of size D D. We collect the kernel parameters as θ = (σf, ℓ). We note that more complex kernels can also be considered given prior information of the underlying system characteristics. The divergence-free matrix-valued kernel induces vector fields that have zero divergence (Wahlstr om et al., 2013; Solin et al., 2015). Intuitively, these vector fields do not have sinks or sources, and every state always finally returns to itself after sufficient amount of time. Similarly, curl-free kernels induce curl-free vector fields that can contain sources or sinks, that is, trajectories can accelerate or decelerate. For theoretical treatment of vector field kernels, see (Narcowich and Ward, 1994; Bhatia et al., 2013; Fuselier and Wright, 2017). Non-stationary vector fields can be modeled with input-dependent lengthscales (Heinonen et al., 2016), while spectral kernels can represent stationary (Wilson et al., 2013) or non-stationary (Remes et al., 2017) recurring patterns in the differential function. 2.2. Joint Model We assume a Gaussian likelihood over the observations yi and the corresponding simulated responses x(ti) of Equation (2), p(Y |x0, U, Z, ω) = i=1 N(yi|x(ti), Ω), (10) where x(ti) are forward simulated responses using the integral Equation (2) and differential Equation (8), and Ω= diag(ω2 1 . . . , ω2 D) collects the dimension-specific noise variances. The inducing vectors have a Gaussian process prior p(U|Z, θ) = N(vec(U)|0, Kθ(Z, Z)). (11) The model posterior is then p(U, x0, θ, ω|Y ) p(Y |x0, U, ω)p(U|θ) = L, (12) where we have for brevity omitted the dependency on the locations of the inducing points Z and also the parameter hyperpriors p(θ) and p(ω) since we assume them to be uniform, unless there is specific domain knowledge of the priors. The model parameters are the initial state x03, the inducing vectors U, the noise standard deviations ω = (ω1, . . . , ωD), and the kernel hyperparameters θ = (σf, ℓ1, . . . , ℓD). 2.3. Noncentral Parameterisation We apply a latent parameterisation using Cholesky decomposition LθLT θ = Kθ(Z, Z), which maps the inducing vectors to whitened domain (Kuss and Rasmussen, 2005) U = Lθ e U, e U = L 1 θ U. (13) The latent variables e U are projected on the kernel manifold Lθ to obtain the inducing vectors U. This non-centered parameterisation (NCP) transforms the hierarchical posterior L of Equation (12) into a reparameterised form p(x0, e U, θ, ω|Y ) p(Y |x0, e U, ω, θ)p(e U), (14) where all variables to be optimised are decoupled, with the latent inducing vectors having a standard normal prior e U N(0, I). Optimizing e U and θ is now more efficient since they have independent contributions to the vector field via U = Lθ e U. The gradients of the whitened posterior can be retrieved analytically as (Heinonen et al., 2016) e U log L = LT θ U log L. (15) 3In case of multiple time-series, we will use one initial state for each time-series. Learning unknown ODE models with GP Finally, we find a maximum a posteriori (MAP) estimate for the initial state x0, latent vector field e U, kernel parameters θ and noise variances ω by gradient ascent, x0,MAP, e UMAP, θMAP, ωMAP = arg max x0, e U,θ,ω log L, (16) while keeping the inducing locations Z fixed on a sufficiently dense grid (See Figure 1). The partial derivatives of the posterior with respect to noise parameters ω can be found analytically, while the derivative with respect to σf is approximated with finite differences. We select the optimal lengthscales ℓby cross-validation. 3. Sensitivity Equations The key term to carry out the MAP gradient ascent optimization is the likelihood log p(Y |x0, e U, ω) that requires forward integration and computing the partial derivatives with respect to the whitened inducing vectors e U. Given Equation (15) we only need to compute the gradients with respect to the inducing vectors u = vec(U) RMD, d log p(Y |x0, u, ω) d log N(ys|x(ts, u), Ω) dx dx(ts, u) This requires computing the derivatives of the simulated system response x(t, u) against the vector field parameters u, du S(t) RD MD, (18) which we denote by Sij(t) = x(t,u)i uj , and expand the notation to make the dependency of x on u explicit. Approximating these with finite differences is possible in principle, but is highly inefficient and has been reported to cause unstability (Raue et al., 2013). We instead turn to sensitivity equations for u and x0 that provide computationally efficient, analytical gradients S(t) (Kokotovic and Heller, 1967; Fr ohlich et al., 2017). The solution for dx(t,u) du can be derived by differentiating the full nonparametric ODE system with respect to u by d du dx(t, u) duf(x(t, u)). (19) The sensitivity equation for the given system can be obtained by changing the order of differentiation on the left hand side and carrying out the differentiation on the right hand side. The resulting sensitivity equation can then be expressed in the form S(t) z }| { d dt dx(t, u) J(t) z }| { f(x(t, u)) S(t) z }| { dx(t, u) R(t) z }| { f(x(t, u)) where J(t) RD D, R(t), S(t) RD MD (See Supplements for detailed specification). For our nonparametric ODE system the sensitivity equation is fully determined by J(t) = K(x, Z) x K(Z, Z) 1u (21) R(t) = K(x, Z)K(Z, Z) 1. (22) The sensitivity equation provides us with an additional ODE system which describes the time evolution of the derivatives with respect to the inducing vectors S(t). The sensitivities are coupled with the actual ODE system and, thus both systems x(t) and S(t) are concatenated as the new augmented state that is solved jointly by Equation (2) driven by the differentials x(t) and S(t) (Leis and Kramer, 1988). The initial sensitivities are computed as S(0) = dx0 du . In our implementation, we merge x0 with u for sensitivity analysis to obtain the partial derivatives with respect to the initial state which is estimated along with the other parameters. We use the CVODES solver from the SUNDIALS package (Hindmarsh et al., 2005) to solve the nonparametric ODE models and the corresponding gradients numerically. The sensitivity equation based approach is superior to the finite differences approximation because we have exact formulation for the gradients of state over inducing points, which can be solved up to the numerical accuracy of the ODE solver. 4. Simple Simulated Dynamics As first illustration of the proposed nonparametric ODE method we consider three simulated differential systems: the Van der Pol (VDP), Fitz Hugh-Nagumo (FHN) and Lotka-Volterra (LV) oscillators of form VDP : x1 = x2 x2 = (1 x2 1)x2 x1 FHN : x1 = 3(x1 x3 1 3 + x2) x2 = 0.2 3x1 0.2x2 LV : x1 = 1.5x1 x1x2 x2 = 3x2 + x1x2. In the conventional ODE case the coefficients of these equations can be inferred using standard statistical techniques if sufficient amount of time series data is available (Girolami, 2008; Raue et al., 2013). Our main goal is to infer unknown dynamics, that is, when these equations are unavailable and we instead represent the dynamics with a nonparametric Learning unknown ODE models with GP Figure 2. Estimated dynamics from Van der Pol, Fitz Hugh-Nagumo and Lotka-Volterra systems. The top part (a-c) shows the learned vector field (grey arrows) against the true vector field (black arrows). The bottom part (d-f) shows the training data (grey region points) and forecasted future cycle likelihoods with the learned model (shaded region) against the true trajectory (black line). vector field of Equation (8). We use these simulated models to only illustrate our model behavior against the true dynamics. We employ 25 data points from one cycle of noisy observation data from VDP and FHN models, and 25 data points from 1.7 cycles from the LV model with a noise variance of σ2 n = 0.12. We learn the np ODE model with five training sequences using M = 62 inducing locations on a fixed grid, and forecast between 4 and 8 future cycles starting from true initial state x0 at time 0. Training takes approximately 100 seconds per oscillator. Figure 2 (bottom) shows the training datasets (grey regions), initial states, true trajectories (black lines) and the forecasted trajectory likelihoods (colored regions). The model accurately learns the dynamics from less than two cycles of data and can reproduce them reliably into future. Figure 2 (top) shows the corresponding true vector field (black arrows) and the estimated vector field (grey arrows). The vector field is a continuous function, which is plotted on a 8x8 grid for visualisation. In general the most difficult part of the system is learning the middle of the loop (as seen in the FHN model), and learning the most outermost regions (bottom left in the LV model). The model learns the underlying differential f(x) accurately close to observed points, while making only few errors in the border regions with no data. 5. Unknown System Estimation Next, we illustrate how the model estimates realistic, unknown dynamics from noisy observations y(t1), . . . , y(t N). As in Section 4, we make no assumptions on the structure or form of the underlying system, and capture the underlying dynamics with the nonparameteric system alone. We employ no subjective priors, and assume no inputs, controls or other sources of information. The task is to infer the underlying dynamics f(x), and interpolate or extrapolate the state trajectory outside the observed data. We use a benchmark dataset of human motion capture data from the Carnegie Mellon University motion capture (CMU mocap) database. Our dataset contains 50-dimensional pose measurements y(ti) from humans walking, where each pose dimension records a measurement in different parts of the body during movement (Wang et al., 2008). We apply the preprocessing of Wang et al. (2008) by downsampling the datasets by a factor of four and centering the data. This resulted in a total of 4303 datapoints spread across 43 trajec- Learning unknown ODE models with GP tories with on average 100 frames per trajectory. In order to tackle the problem of dimensionality, we project the original dataset with PCA to a three dimensional latent space where the system is specified, following Damianou et al. (2011) and Wang et al. (2006). We place M = 53 inducing vectors on a fixed grid, and optimize our model starting from 100 different initial values, which we set by perturbing the projected empirical differences y(ti) y(ti 1) to the inducing vectors. We use an L-BFGS optimizer in Matlab. The whole inference takes approximately few minutes per trajectory. We evaluate the method with two types of experiments: imputing missing values and forecasting future cycles. For the forecasting the first half of the trajectory is reserved for model training, and the second half is to be forecasted. For imputation we remove roughly 20% of the frames from the middle of the trajectory, which are to be filled by the models. We perform model selection for lengthscales ℓwith crossvalidation split of 80/20. We record the root mean square error (RMSE) over test points in the original feature space in both cases, where we reconstruct the original dimensions from the latent space trajectories. Due to the current lack of ODE methods suitable for this nonparametric inference task, we instead compare our method to the state-of-the-art state-space models where such problems have been previously considered (Wang et al., 2008). In a state-space or dynamical model a transition function x(tk+1) = g(x(tk)) moves the system forward in discrete steps. With sufficiently high sampling rate, such models can estimate and forecast finite approximations of smooth dynamics. In Gaussian process dynamical model (Wang et al., 2006; Frigola et al., 2014; Svensson et al., 2016) a GP transition function is inferred in a latent space, which can be inferred with a standard GPLVM (Lawrence, 2004) or with a dependent GPLVM (Zhao and Sun, 2016). In dynamical systems the transition function is replaced by a GP interpolation (Damianou et al., 2011). The discrete time state-space models emphasize inference of a low-dimensional manifold as an explanation of the high-dimensional measurement trajectories. We compare our method to the dynamical model GPDM of Wang et al. (2006) and to the dynamical system VGPLVM of Damianou et al. (2011), where we directly apply the implementations provided by the authors at inverseprobability.com/vargplvm and dgp. toronto.edu/ jmwang/gpdm. Both methods optimize their latent spaces separately, and they are thus not directly comparable. 5.1. Forecasting In the forecasting task we train all models with the first half of the trajectory, while forecasting the second half starting from the first frame. The models are trained and forecasted Table 1. Means and standard deviations of RMSEs of 43 datasets in forecasting and filling experiments. MODEL FORECASTING IMPUTATION NPODE 4.52 2.31 3.94 3.50 GPDM 4.94 3.3 5.31 3.39 VGPLVM 8.74 3.43 3.91 1.80 within a low-dimensional space, and subsequently projected back into the original space via inverting the PCA or with GPLVM mean predictions. As all methods optimize their latent spaces separately, they are not directly comparable. Thus, the mean errors are computed in the original highdimensional space. Note that the low-dimensional representation necessarily causes some reconstruction errors. Figure 3 illustrates the models on one of the trajectories 35 12.amc. The top part (a) shows the training data in the PCA space for np ODE, and optimized training data representation for GPDM and VGPLVM (black points). The colored lines (np ODE) and points (GPDM, VGPLVM) indicate the future forecast. The bottom part (b) shows the first 9 reconstructed original pose dimensions reconstructed from the latent forecasted trajectories. The training data is shown in grey background, while test data is shown with circles. The VGPLVM has most trouble forecasting future points, and reverts quickly after training data to a value close to zero, failing to predict future points. The GPDM model produces more realistic trajectories, but fails to predict any of the poses accurately. Finally, np ODE can accurately predict five poses, and still retains adequate performance on remaining poses, except for pose 2. Furthermore, Table 1 indicates that np ODE is also best performing method on average over the whole dataset in the forecasting. 5.2. Imputation In the imputation task we remove approximately 20% of the training data from the middle of the trajectory. The goals are to learn a model with the remaining data and to forecast the missing values. Figure 4 highlights the performance of the three models on the trajectory 07 07.amc. The top part (a) shows the training data (black points) in the PCA space (np ODE) or optimized training locations in the latent space (GPDM, VGPLVM). The middle part imputation is shown with colored points or lines. Interestingly both np ODE and GPDM operate on cyclic representations, while VGPLVM is not cyclic. The bottom panel (b) shows the first 9 reconstructed pose Learning unknown ODE models with GP Figure 3. Forecasting 50 future frames after 49 frames of training data of human motion dataset 35 12.amc. (a) The estimated locations of the trajectory in a latent space (black points) and future forecast (colored lines). (b) The original features reconstructed from the latent predictions with grey region showing the training data. dimensions from the three models. The missing values are shown in circles, while training points are shown with black dots. All models can accurately reproduce the overall trends, while np ODE seems to fit slightly worse than the other methods. The PCA projection causes the seemingly perfect fit of the np ODE prediction (at the top) to lead to slightly warped reconstructions (at the bottom). All methods mostly fit the missing parts as well. Table 1 shows that on average the np ODE and VGPLVM have approximately equal top performance on the imputing missing values task. 6. Discussion We proposed the framework of nonparametric ODE model that can accurately learn arbitrary, nonlinear continuos-time dynamics from purely observational data without making assumptions of the underlying system dynamics. We demonstrated that the model excels at learning dynamics that can be forecasted into the future. We consider this work as the first in a line of studies of nonparametric ODE systems, and foresee several aspects as future work. Currently we do not handle non-stationary vector fields, that is time-dependent differentials ft(x). Furthermore, an interesting future avenue is the study of various vector field kernels, such as divergence-free, curl-free or spectral kernels (Remes et al., 2017). Finally, including inputs or controls to the system would allow precise modelling in interactive settings, such as robotics. The proposed nonparametric ODE model operates along a continuous-time trajectory, while dynamic models such as hidden Markov models or state-space models are restricted to discrete time steps. These models are unable to consider system state at arbitrary times, for instance, between two successive timepoints. Conventional ODE models have also been considered from the stochastic perspective with stochastic differential equation (SDE) models that commonly model the deterministic Learning unknown ODE models with GP Figure 4. Imputation of 17 missing frames in the middle of a 94-length trajectory of human motion dataset 07 07.amc (subsampled every fourth frame). (a) The estimated locations of the missing points in the latent space are colored. (b) The original features reconstructed from the latent trajectory. system drift and diffusion processes separately leading to a distribution of trajectories p(x(t)) (Archambeau et al., 2007; Garc ıa et al., 2017). As future work we will consider stochastic extensions of our nonparametric ODE model, as well as MCMC sampling of the inducing point posterior p(U|Y ), leading to trajectory distribution as well. Acknowledgements. The data used in this project was obtained from mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217. This work has been supported by the Academy of Finland Center of Excellence in Systems Immunology and Physiology, the Academy of Finland grants no. 284597, 311584, 313271, 299915. Tarmo Aij o, Kirsi Granberg, and Harri L ahdesm aki. Sorad: a systems biology approach to predict and modulate dynamic signaling pathway response from phosphoproteome time-course measurements. Bioinformatics, 29 (10):1283 1291, 2013. M. Alvarez, L. Rosasco, and N. Lawrence. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 2012. C. Archambeau, D. Cornford, M. Opper, and J. Shawe Taylor. Gaussian process approximations of stochastic differential equations. In Gaussian Processes in Practice, 2007. H. Bhatia, G. Norgard, V. Pascucci, and P. Bremer. The helmholtz-hodge decomposition - a survey. IEEE Transactions on visualization and computer graphics, 2013. J. Butcher. Numerical methods for ordinary differential equations. John Wiley & Sons, 2016. Learning unknown ODE models with GP B. Calderhead, M. Girolami, and N. Lawrence. Accelerating bayesian inference over nonlinear differential equations with gaussian processes. NIPS, 2008. Andreas Damianou, Michalis K Titsias, and Neil D Lawrence. Variational gaussian process dynamical systems. In Advances in Neural Information Processing Systems, pages 2510 2518, 2011. Michiel JL De Hoon, Seiya Imoto, Kazuo Kobayashi, Naotake Ogasawara, and Satoru Miyano. Inferring gene regulatory networks from time-ordered gene expression data of bacillus subtilis using differential equations. In Biocomputing 2003, pages 17 28. World Scientific, 2002. F. Dondelinger, M. Filippone, S Rogers, and D. Husmeier. Ode parameter inference using adaptive gradient matching with gaussian processes. JMLR, 2013. S. P. Ellner, Y. Seifu, and R. H. Smith. Fitting population dynamic models to time-series data by gradient matching. Ecology, 83(8):2256 2270, 2002. Roger Frigola, Yutian Chen, and Carl Edward Rasmussen. Variational gaussian process state-space models. In Advances in Neural Information Processing Systems, pages 3680 3688, 2014. Fabian Fr ohlich, Barbara Kaltenbacher, Fabian J. Theis, and Jan Hasenauer. Scalable parameter estimation for genome-scale biochemical reaction networks. PLOS Computational Biology, 13(1):1 18, 01 2017. doi: 10.1371/journal.pcbi.1005331. E. Fuselier and G. Wright. A radial basis function method for computing helmholtz hodge decompositions. IMA Journal of Numerical Analysis, 2017. Pei Gao, Antti Honkela, Magnus Rattray, and Neil D Lawrence. Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities. Bioinformatics, 24(16):i70 i75, 2008. C. Garc ıa, A. Otero, P. Felix, J. Presedo, and D. Marquez. Nonparametric estimation of stochastic differential equations with sparse Gaussian processes. Physical Review E, 96(2):022104, 2017. Mark Girolami. Bayesian inference for differential equations. Theor. Comput. Sci., 408(1):4 16, 2008. T. Graepel. Solving noisy linear operator equations by gaussian processes: Application to ordinary and partial differential equations. ICML, 2003. M. Heinonen and F. d Alche Buc. Learning nonparametric differential equations with operator-valued kernels and gradient matching. arxiv, Telecom Paris Tech, 2014. M. Heinonen, H. Mannerstr om, J. Rousu, S. Kaski, and H. L ahdesm aki. Non-stationary Gaussian process regression with Hamiltonian Monte Carlo. In AISTATS, volume 51, pages 732 740, 2016. J. Henderson and G. Michailidis. Network reconstruction using nonparametric additive ode models. PLOS ONE, 2014. Alan C Hindmarsh, Peter N Brown, Keith E Grant, Steven L Lee, Radu Serban, Dan E Shumaker, and Carol S Woodward. SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers. ACM Trans. Math. Softw., 31(3):363 396, 2005. M. Hirsch, S. Smale, and Devaney. Differential Equations, Dynamical Systems, and an Introduction to Chaos (Edition: 2). Elsevier Science & Technology Books, 2004. P Kokotovic and J Heller. Direct and adjoint sensitivity equations for parameter optimization. IEEE Transactions on Automatic Control, 12(5):609 610, 1967. Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary gaussian process classification. Journal of machine learning research, 6(Oct):1679 1704, 2005. Neil D Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In Advances in neural information processing systems, pages 329 336, 2004. Jorge R. Leis and Mark A. Kramer. The simultaneous solution and sensitivity analysis of systems described by ordinary differential equations. ACM Trans. Math. Softw., 14(1), 1988. Benn Macdonald. Statistical inference for ordinary differential equations using gradient matching. Ph D thesis, University of Glasgow, 2017. Benn Macdonald, Catherine Higham, and Dirk Husmeier. Controversy in mechanistic modelling with gaussian processes. In International Conference on Machine Learning, pages 1539 1547, 2015. Francis J Narcowich and Joseph D Ward. Generalized hermite interpolation via matrix-valued conditionally positive definite functions. Mathematics of Computation, 63 (208):661 687, 1994. Joaquin Qui nonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6 (Dec):1939 1959, 2005. Learning unknown ODE models with GP Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Inferring solutions of differential equations using noisy multi-fidelity data. Journal of Computational Physics, 335:736 746, 2017. J. Ramsay, G. Hooker, D. Campbell, and J. Cao. Parameter estimation for differential equations: a generalized smoothing approach. Journal of the Royal Statistical Society: Series B, 69:741 796, 2007. C.E. Rasmussen and K.I. Williams. Gaussian processes for machine learning. MIT Press, 2006. Andreas Raue, Marcel Schilling, Julie Bachmann, Andrew Matteson, Max Schelker, Daniel Kaschek, Sabine Hug, Clemens Kreutz, Brian D. Harms, Fabian J. Theis, Ursula Klingm uller, and Jens Timmer. Lessons learned from quantitative dynamical modeling in systems biology. PLOS ONE, 8(9):1 17, 2013. S. Remes, M. Heinonen, and S. Kaski. Non-stationary spectral kernels. NIPS, 2017. John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004. A. Solin, M. Kok, N. Wahlstrom, T. Schon, and S. S arkk a. Modeling and interpolation of the ambient magnetic field by gaussian processes. ar Xiv, 2015. ar Xiv:1509.04634. Andreas Svensson, Arno Solin, Simo S arkk a, and Thomas Sch on. Computationally efficient bayesian learning of gaussian process state space models. In Artificial Intelligence and Statistics, pages 213 221, 2016. J. M. Varah. A spline least squares method for numerical parameter estimation in differential equations. SIAM J.sci. Stat. Comput., 3(1):28 46, 1982. N. Wahlstr om, M. Kok, and T. Sch on. Modeling magnetic fields using gaussian processes. IEEE conf on Acoustics, Speech and Signal Processing (ICASSP), 2013. Jack Wang, Aaron Hertzmann, and David M Blei. Gaussian process dynamical models. In Advances in neural information processing systems, pages 1441 1448, 2006. Jack M Wang, David J Fleet, and Aaron Hertzmann. Gaussian process dynamical models for human motion. IEEE transactions on pattern analysis and machine intelligence, 30(2):283 298, 2008. Y. Wang and D. Barber. Gaussian processes for bayesian estimation in ordinary differential equations. ICML, 2014. A. Wilson, E. Gilboa, A. Nehorai, and J. Cunningham. Fast multidimensional pattern extrapolation with gaussian processes. AISTATS, 2013. Jing Zhao and Shiliang Sun. Variational dependent multioutput gaussian process dynamical systems. The Journal of Machine Learning Research, 17(1):4134 4169, 2016. T Aij o and H. L ahdesm aki. Learning gene regulatory networks from gene expression measurements using nonparametric molecular kinetics. Bioinformatics, 25:2937 2944, 2009.