# learning_unknown_ode_models_with_gaussian_processes__0df77bf9.pdf

Learning unknown ODE models with Gaussian processes

Markus Heinonen * 1 2 C agatay Yıldız * 1 Henrik Mannerstr om 1 Jukka Intosalmi 1 Harri L ahdesm aki 1

In conventional ODE modelling coefﬁcients of an equation driving the system state forward in time are estimated. However, for many complex systems it is practically impossible to determine the equations or interactions governing the underlying dynamics. In these settings, parametric ODE model cannot be formulated. Here, we overcome this issue by introducing a novel paradigm of nonparametric ODE modelling that can learn the underlying dynamics of arbitrary continuous-time systems without prior knowledge. We propose to learn non-linear, unknown differential functions from state observations using Gaussian process vector ﬁelds within the exact ODE formalism. We demonstrate the model s capabilities to infer dynamics from sparse data and to simulate the system forward into future.

1. Introduction

Dynamical systems modelling is a cornerstone of experimental sciences. In biology, as well as in physics and chemistry, modelers attempt to capture the dynamical behavior of a given system or a phenomenon in order to improve its understanding and make predictions about its future state. Systems of coupled ordinary differential equations (ODEs) are undoubtedly the most widely used models in science. Even simple ODE functions can describe complex dynamical behaviours (Hirsch et al., 2004). Typically, the dynamics are ﬁrmly grounded in physics with only a few parameters to be estimated from data. However, equally ubiquitous are the cases where the governing dynamics are partially or completely unknown.

We consider the dynamics of a system governed by multi-

*Equal contribution 1Aalto University, Finland 2Helsinki Institute of Information Technology HIIT, Finland. Correspondence to: Markus Heinonen <markus.o.heinonen@aalto.ﬁ>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

variate ordinary differential functions:

x(t) = dx(t)

dt = f(x(t)) (1)

where x(t) X = RD is the state vector of a Ddimensional dynamical system at time t, and the x(t) X = RD is the ﬁrst order time derivative of x(t) that drives the state x(t) forward, and where f : RD RD is the vector-valued derivative function. The ODE solution is determined by

x(t) = x0 + Z t

0 f(x(τ))dτ, (2)

where we integrate the system state from an initial state x(0) = x0 for time t forward. We assume that f( ) is completely unknown and we only observe one or several multivariate time series Y = (y1, . . . , y N)T RN D obtained from an additive noisy observation model at observation time points T = (t1, . . . , t N) RN,

y(t) = x(t) + εt, (3)

where εt N(0, Ω) follows a stationary zero-mean multivariate Gaussian distribution with diagonal noise variances Ω= diag(ω2 1, . . . , ω2 D). The observation time points do not need to be equally spaced. Our task is to learn the differential function f( ) given observations Y , with no prior knowledge of the ODE system.

There is a vast literature on conventional ODEs (Butcher, 2016) where a parametric form for function f(x; θ, t) is assumed to be known, and its parameters θ are subsequently optimised with least squares or Bayesian approach, where the expensive forward solution xθ(ti) = R ti 0 f(x(τ); θ, t)dτ is required to evaluate the system responses xθ(ti) from parameters θ against observations y(ti). To overcome the computationally intensive forward solution, a family of methods denoted as gradient matching (Varah, 1982; Ellner et al., 2002; Ramsay et al., 2007) have proposed to replace the forward solution by matching f(yi) yi to empirical gradients yi of the data instead, which do not require the costly integration step. Recently several authors have proposed embedding a parametric differential function within a Bayesian or Gaussian process (GP) framework (Graepel, 2003; Calderhead et al., 2008;

Learning unknown ODE models with GP

Dondelinger et al., 2013; Wang and Barber, 2014; Macdonald, 2017) (see Macdonald et al. (2015) for a review). GPs have been successfully applied to model linear differential equations as they are analytically tractable (Gao et al., 2008; Raissi et al., 2017).

However, conventional ODE modelling can only proceed if a parametric form of the driving function f( ) is known. Recently, initial work to handle unknown or non-parametric ODE models have been proposed, although with various limiting approximations. Early works include spline-based smoothing and additive functions PD j fj(xj) to infer gene regulatory networks (De Hoon et al., 2002; Henderson and Michailidis, 2014). Aij o and L ahdesm aki (2009) proposed estimating the unknown nonlinear function with GPs using either ﬁnite time differences, or analytically solving the derivative function as a function of only time, x(t) = f(t) ( Aij o et al., 2013). In a seminal technical report of Heinonen and d Alche Buc (2014) a full vector-valued kernel model f(x) was proposed, however using a gradient matching approximation. To our knowledge, there exists no model that can learn non-linear ODE functions x(t) = f(x(t)) over the state x against the true forward solutions x(ti).

In this work we propose NPODE1: the ﬁrst ODE model for learning arbitrary, and a priori completely unknown nonparametric, non-linear differential functions f : X X from data in a Bayesian way. We do not use gradient matching or other approximative models, but instead propose to directly optimise the exact ODE system with the fully forward simulated responses against data. We parameterise our model as an augmented Gaussian process vector ﬁeld with inducing points, while we propose sensitivity equations to efﬁciently compute the gradients of the system. Our model can forecast continuous-time systems arbitrary amounts to future, and we demonstrate the state-of-the-art performance in human motion datasets.

2. Nonparametric ODE Model

The differential function f(x) to be learned deﬁnes a vector ﬁeld2 f, that is, an assignment of a gradient vector f(x) RD to every state x RD. We model the vector ﬁeld as a vector-valued Gaussian process (Rasmussen and Williams, 2006)

f(x) GP(0, K(x, x )), (4)

which deﬁnes a priori distribution over function values f(x) whose mean and covariances are

E[f(x)] = 0 (5)

cov[f(x), f(x )] = K(x, x ), (6)

1The implementation is publicly available in http://www. github.com/cagatayyildiz/npode

2We use vector ﬁeld and differential function interchangeably.

Figure 1. (a) Illustration of an ODE system vector ﬁeld induced by the Gaussian process. The vector ﬁeld f(x) (gray arrows) at arbitrary states x is interpolated from the inducing points u, z (black arrows), with the trajectory x(t) (red points) following the differential system f(x) exactly. (b) The trajectory x(t) plotted over time t.

and where the kernel K(x, x ) RD D is matrixvalued. A GP prior deﬁnes that for any collection of states X = (x1, . . . , x N)T RN D, the function values F = (f(x1), . . . , f(x N))T RN D follow a matrixvalued normal distribution,

p(F) = N(vec(F)|0, K(X, X)), (7)

where K(X, X) = (K(xi, xj))N i,j=1 RND ND is a block matrix of matrix-valued kernels K(xi, xj). The key property of Gaussian processes is that they encode functions where similar states x, x induce similar differentials f(x), f(x ), and where the state similarity is deﬁned by the kernel K(x, x ).

In standard GP regression we would obtain the posterior of the vector ﬁeld by conditioning the GP prior with the data (Rasmussen and Williams, 2006). In ODE models the conditional f(x)|Y of a vector ﬁeld is intractable due to the integral mapping (2) between observed states y(ti) and differentials f(x). Instead, we resort to augmenting the Gaussian process with a set of M inducing points z X and u X, such that f(z) = u (Qui nonero-Candela and

Learning unknown ODE models with GP

Rasmussen, 2005). We choose to interpolate the differential function between the inducing points as (See Figure 1)

f(x) Kθ(x, Z)Kθ(Z, Z) 1vec(U), (8)

which supports the function f(x) with inducing locations Z = (z1, . . . , z M), inducing vectors U = (u1, . . . , u M), and θ are the kernel parameters. The function above corresponds to a vector-valued kernel function (Alvarez et al., 2012), or to a multi-task Gaussian process conditional mean without the variance term (Rasmussen and Williams, 2006). This deﬁnition is then compatible with the deterministic nature of the ODE formalism. Due to universality of several kernels and kernel functions (Shawe-Taylor and Cristianini, 2004), we can represent arbitrary vector ﬁelds with appropriate inducing point and kernel choices.

2.1. Operator-valued Kernels

The vector-valued kernel function (8) uses operator-valued kernels, which result in matrix-valued kernels Kθ(z, z ) RD D for real valued states x, z, while the kernel matrix over data points becomes Kθ = (K(zi, zj))M i,j=1 RMD MD (See Alvarez et al. (2012) for a review). Most straightforward operator-valued kernel is the identity decomposable kernel Kdec(z, z ) = k(z, z ) ID, where the scalar Gaussian kernel

Kθ(z, z ) = σ2 f exp

with differential variance σ2 f and dimension-speciﬁc lengthscales ℓ= (ℓ1, . . . , ℓD) are expanded into a diagonal matrix of size D D. We collect the kernel parameters as θ = (σf, ℓ).

We note that more complex kernels can also be considered given prior information of the underlying system characteristics. The divergence-free matrix-valued kernel induces vector ﬁelds that have zero divergence (Wahlstr om et al., 2013; Solin et al., 2015). Intuitively, these vector ﬁelds do not have sinks or sources, and every state always ﬁnally returns to itself after sufﬁcient amount of time. Similarly, curl-free kernels induce curl-free vector ﬁelds that can contain sources or sinks, that is, trajectories can accelerate or decelerate. For theoretical treatment of vector ﬁeld kernels, see (Narcowich and Ward, 1994; Bhatia et al., 2013; Fuselier and Wright, 2017). Non-stationary vector ﬁelds can be modeled with input-dependent lengthscales (Heinonen et al., 2016), while spectral kernels can represent stationary (Wilson et al., 2013) or non-stationary (Remes et al., 2017) recurring patterns in the differential function.

2.2. Joint Model

We assume a Gaussian likelihood over the observations yi and the corresponding simulated responses x(ti) of Equation (2),

p(Y |x0, U, Z, ω) =

i=1 N(yi|x(ti), Ω), (10)

where x(ti) are forward simulated responses using the integral Equation (2) and differential Equation (8), and Ω= diag(ω2 1 . . . , ω2 D) collects the dimension-speciﬁc noise variances.

The inducing vectors have a Gaussian process prior

p(U|Z, θ) = N(vec(U)|0, Kθ(Z, Z)). (11)

The model posterior is then

p(U, x0, θ, ω|Y ) p(Y |x0, U, ω)p(U|θ) = L, (12)

where we have for brevity omitted the dependency on the locations of the inducing points Z and also the parameter hyperpriors p(θ) and p(ω) since we assume them to be uniform, unless there is speciﬁc domain knowledge of the priors.

The model parameters are the initial state x03, the inducing vectors U, the noise standard deviations ω = (ω1, . . . , ωD), and the kernel hyperparameters θ = (σf, ℓ1, . . . , ℓD).

2.3. Noncentral Parameterisation

We apply a latent parameterisation using Cholesky decomposition LθLT θ = Kθ(Z, Z), which maps the inducing vectors to whitened domain (Kuss and Rasmussen, 2005)

U = Lθ e U, e U = L 1 θ U. (13)

The latent variables e U are projected on the kernel manifold Lθ to obtain the inducing vectors U. This non-centered parameterisation (NCP) transforms the hierarchical posterior L of Equation (12) into a reparameterised form

p(x0, e U, θ, ω|Y ) p(Y |x0, e U, ω, θ)p(e U), (14)

where all variables to be optimised are decoupled, with the latent inducing vectors having a standard normal prior e U N(0, I). Optimizing e U and θ is now more efﬁcient since they have independent contributions to the vector ﬁeld via U = Lθ e U. The gradients of the whitened posterior can be retrieved analytically as (Heinonen et al., 2016)

e U log L = LT θ U log L. (15)

3In case of multiple time-series, we will use one initial state for each time-series.

Learning unknown ODE models with GP

Finally, we ﬁnd a maximum a posteriori (MAP) estimate for the initial state x0, latent vector ﬁeld e U, kernel parameters θ and noise variances ω by gradient ascent,

x0,MAP, e UMAP, θMAP, ωMAP = arg max x0, e U,θ,ω log L, (16)

while keeping the inducing locations Z ﬁxed on a sufﬁciently dense grid (See Figure 1). The partial derivatives of the posterior with respect to noise parameters ω can be found analytically, while the derivative with respect to σf is approximated with ﬁnite differences. We select the optimal lengthscales ℓby cross-validation.

3. Sensitivity Equations

The key term to carry out the MAP gradient ascent optimization is the likelihood

log p(Y |x0, e U, ω)

that requires forward integration and computing the partial derivatives with respect to the whitened inducing vectors e U. Given Equation (15) we only need to compute the gradients with respect to the inducing vectors u = vec(U) RMD,

d log p(Y |x0, u, ω)

d log N(ys|x(ts, u), Ω)

dx dx(ts, u)

This requires computing the derivatives of the simulated system response x(t, u) against the vector ﬁeld parameters u,

du S(t) RD MD, (18)

which we denote by Sij(t) = x(t,u)i

uj , and expand the notation to make the dependency of x on u explicit. Approximating these with ﬁnite differences is possible in principle, but is highly inefﬁcient and has been reported to cause unstability (Raue et al., 2013). We instead turn to sensitivity equations for u and x0 that provide computationally efﬁcient, analytical gradients S(t) (Kokotovic and Heller, 1967; Fr ohlich et al., 2017).

The solution for dx(t,u)

du can be derived by differentiating the full nonparametric ODE system with respect to u by

d du dx(t, u)

duf(x(t, u)). (19)

The sensitivity equation for the given system can be obtained by changing the order of differentiation on the left hand side and carrying out the differentiation on the right hand side.

The resulting sensitivity equation can then be expressed in the form

S(t) z }| { d dt dx(t, u)

J(t) z }| { f(x(t, u))

S(t) z }| { dx(t, u)

R(t) z }| { f(x(t, u))

where J(t) RD D, R(t), S(t) RD MD (See Supplements for detailed speciﬁcation). For our nonparametric ODE system the sensitivity equation is fully determined by

J(t) = K(x, Z)

x K(Z, Z) 1u (21)

R(t) = K(x, Z)K(Z, Z) 1. (22)

The sensitivity equation provides us with an additional ODE system which describes the time evolution of the derivatives with respect to the inducing vectors S(t). The sensitivities are coupled with the actual ODE system and, thus both systems x(t) and S(t) are concatenated as the new augmented state that is solved jointly by Equation (2) driven by the differentials x(t) and S(t) (Leis and Kramer, 1988). The initial sensitivities are computed as S(0) = dx0

du . In our implementation, we merge x0 with u for sensitivity analysis to obtain the partial derivatives with respect to the initial state which is estimated along with the other parameters. We use the CVODES solver from the SUNDIALS package (Hindmarsh et al., 2005) to solve the nonparametric ODE models and the corresponding gradients numerically. The sensitivity equation based approach is superior to the ﬁnite differences approximation because we have exact formulation for the gradients of state over inducing points, which can be solved up to the numerical accuracy of the ODE solver.

4. Simple Simulated Dynamics

As ﬁrst illustration of the proposed nonparametric ODE method we consider three simulated differential systems: the Van der Pol (VDP), Fitz Hugh-Nagumo (FHN) and Lotka-Volterra (LV) oscillators of form

VDP : x1 = x2 x2 = (1 x2 1)x2 x1

FHN : x1 = 3(x1 x3 1 3 + x2) x2 = 0.2 3x1 0.2x2

LV : x1 = 1.5x1 x1x2 x2 = 3x2 + x1x2.

In the conventional ODE case the coefﬁcients of these equations can be inferred using standard statistical techniques if sufﬁcient amount of time series data is available (Girolami, 2008; Raue et al., 2013). Our main goal is to infer unknown dynamics, that is, when these equations are unavailable and we instead represent the dynamics with a nonparametric

Learning unknown ODE models with GP

Figure 2. Estimated dynamics from Van der Pol, Fitz Hugh-Nagumo and Lotka-Volterra systems. The top part (a-c) shows the learned vector ﬁeld (grey arrows) against the true vector ﬁeld (black arrows). The bottom part (d-f) shows the training data (grey region points) and forecasted future cycle likelihoods with the learned model (shaded region) against the true trajectory (black line).

vector ﬁeld of Equation (8). We use these simulated models to only illustrate our model behavior against the true dynamics.

We employ 25 data points from one cycle of noisy observation data from VDP and FHN models, and 25 data points from 1.7 cycles from the LV model with a noise variance of σ2 n = 0.12. We learn the np ODE model with ﬁve training sequences using M = 62 inducing locations on a ﬁxed grid, and forecast between 4 and 8 future cycles starting from true initial state x0 at time 0. Training takes approximately 100 seconds per oscillator. Figure 2 (bottom) shows the training datasets (grey regions), initial states, true trajectories (black lines) and the forecasted trajectory likelihoods (colored regions). The model accurately learns the dynamics from less than two cycles of data and can reproduce them reliably into future.

Figure 2 (top) shows the corresponding true vector ﬁeld (black arrows) and the estimated vector ﬁeld (grey arrows). The vector ﬁeld is a continuous function, which is plotted on a 8x8 grid for visualisation. In general the most difﬁcult part of the system is learning the middle of the loop (as seen in the FHN model), and learning the most outermost regions (bottom left in the LV model). The model learns the

underlying differential f(x) accurately close to observed points, while making only few errors in the border regions with no data.

5. Unknown System Estimation

Next, we illustrate how the model estimates realistic, unknown dynamics from noisy observations y(t1), . . . , y(t N). As in Section 4, we make no assumptions on the structure or form of the underlying system, and capture the underlying dynamics with the nonparameteric system alone. We employ no subjective priors, and assume no inputs, controls or other sources of information. The task is to infer the underlying dynamics f(x), and interpolate or extrapolate the state trajectory outside the observed data.

We use a benchmark dataset of human motion capture data from the Carnegie Mellon University motion capture (CMU mocap) database. Our dataset contains 50-dimensional pose measurements y(ti) from humans walking, where each pose dimension records a measurement in different parts of the body during movement (Wang et al., 2008). We apply the preprocessing of Wang et al. (2008) by downsampling the datasets by a factor of four and centering the data. This resulted in a total of 4303 datapoints spread across 43 trajec-

Learning unknown ODE models with GP

tories with on average 100 frames per trajectory. In order to tackle the problem of dimensionality, we project the original dataset with PCA to a three dimensional latent space where the system is speciﬁed, following Damianou et al. (2011) and Wang et al. (2006). We place M = 53 inducing vectors on a ﬁxed grid, and optimize our model starting from 100 different initial values, which we set by perturbing the projected empirical differences y(ti) y(ti 1) to the inducing vectors. We use an L-BFGS optimizer in Matlab. The whole inference takes approximately few minutes per trajectory.

We evaluate the method with two types of experiments: imputing missing values and forecasting future cycles. For the forecasting the ﬁrst half of the trajectory is reserved for model training, and the second half is to be forecasted. For imputation we remove roughly 20% of the frames from the middle of the trajectory, which are to be ﬁlled by the models. We perform model selection for lengthscales ℓwith crossvalidation split of 80/20. We record the root mean square error (RMSE) over test points in the original feature space in both cases, where we reconstruct the original dimensions from the latent space trajectories.

Due to the current lack of ODE methods suitable for this nonparametric inference task, we instead compare our method to the state-of-the-art state-space models where such problems have been previously considered (Wang et al., 2008). In a state-space or dynamical model a transition function x(tk+1) = g(x(tk)) moves the system forward in discrete steps. With sufﬁciently high sampling rate, such models can estimate and forecast ﬁnite approximations of smooth dynamics. In Gaussian process dynamical model (Wang et al., 2006; Frigola et al., 2014; Svensson et al., 2016) a GP transition function is inferred in a latent space, which can be inferred with a standard GPLVM (Lawrence, 2004) or with a dependent GPLVM (Zhao and Sun, 2016). In dynamical systems the transition function is replaced by a GP interpolation (Damianou et al., 2011). The discrete time state-space models emphasize inference of a low-dimensional manifold as an explanation of the high-dimensional measurement trajectories.

We compare our method to the dynamical model GPDM of Wang et al. (2006) and to the dynamical system VGPLVM of Damianou et al. (2011), where we directly apply the implementations provided by the authors at inverseprobability.com/vargplvm and dgp. toronto.edu/ jmwang/gpdm. Both methods optimize their latent spaces separately, and they are thus not directly comparable.

5.1. Forecasting

In the forecasting task we train all models with the ﬁrst half of the trajectory, while forecasting the second half starting from the ﬁrst frame. The models are trained and forecasted

Table 1. Means and standard deviations of RMSEs of 43 datasets in forecasting and ﬁlling experiments.

MODEL FORECASTING IMPUTATION

NPODE 4.52 2.31 3.94 3.50 GPDM 4.94 3.3 5.31 3.39 VGPLVM 8.74 3.43 3.91 1.80

within a low-dimensional space, and subsequently projected back into the original space via inverting the PCA or with GPLVM mean predictions. As all methods optimize their latent spaces separately, they are not directly comparable. Thus, the mean errors are computed in the original highdimensional space. Note that the low-dimensional representation necessarily causes some reconstruction errors.

Figure 3 illustrates the models on one of the trajectories 35 12.amc. The top part (a) shows the training data in the PCA space for np ODE, and optimized training data representation for GPDM and VGPLVM (black points). The colored lines (np ODE) and points (GPDM, VGPLVM) indicate the future forecast. The bottom part (b) shows the ﬁrst 9 reconstructed original pose dimensions reconstructed from the latent forecasted trajectories. The training data is shown in grey background, while test data is shown with circles.

The VGPLVM has most trouble forecasting future points, and reverts quickly after training data to a value close to zero, failing to predict future points. The GPDM model produces more realistic trajectories, but fails to predict any of the poses accurately. Finally, np ODE can accurately predict ﬁve poses, and still retains adequate performance on remaining poses, except for pose 2.

Furthermore, Table 1 indicates that np ODE is also best performing method on average over the whole dataset in the forecasting.

5.2. Imputation

In the imputation task we remove approximately 20% of the training data from the middle of the trajectory. The goals are to learn a model with the remaining data and to forecast the missing values. Figure 4 highlights the performance of the three models on the trajectory 07 07.amc. The top part (a) shows the training data (black points) in the PCA space (np ODE) or optimized training locations in the latent space (GPDM, VGPLVM). The middle part imputation is shown with colored points or lines. Interestingly both np ODE and GPDM operate on cyclic representations, while VGPLVM is not cyclic.

The bottom panel (b) shows the ﬁrst 9 reconstructed pose

Learning unknown ODE models with GP

Figure 3. Forecasting 50 future frames after 49 frames of training data of human motion dataset 35 12.amc. (a) The estimated locations of the trajectory in a latent space (black points) and future forecast (colored lines). (b) The original features reconstructed from the latent predictions with grey region showing the training data.

dimensions from the three models. The missing values are shown in circles, while training points are shown with black dots. All models can accurately reproduce the overall trends, while np ODE seems to ﬁt slightly worse than the other methods. The PCA projection causes the seemingly perfect ﬁt of the np ODE prediction (at the top) to lead to slightly warped reconstructions (at the bottom). All methods mostly ﬁt the missing parts as well. Table 1 shows that on average the np ODE and VGPLVM have approximately equal top performance on the imputing missing values task.

6. Discussion

We proposed the framework of nonparametric ODE model that can accurately learn arbitrary, nonlinear continuos-time dynamics from purely observational data without making assumptions of the underlying system dynamics. We demonstrated that the model excels at learning dynamics that can be forecasted into the future. We consider this work as the

ﬁrst in a line of studies of nonparametric ODE systems, and foresee several aspects as future work. Currently we do not handle non-stationary vector ﬁelds, that is time-dependent differentials ft(x). Furthermore, an interesting future avenue is the study of various vector ﬁeld kernels, such as divergence-free, curl-free or spectral kernels (Remes et al., 2017). Finally, including inputs or controls to the system would allow precise modelling in interactive settings, such as robotics.

The proposed nonparametric ODE model operates along a continuous-time trajectory, while dynamic models such as hidden Markov models or state-space models are restricted to discrete time steps. These models are unable to consider system state at arbitrary times, for instance, between two successive timepoints.

Conventional ODE models have also been considered from the stochastic perspective with stochastic differential equation (SDE) models that commonly model the deterministic

Learning unknown ODE models with GP

Figure 4. Imputation of 17 missing frames in the middle of a 94-length trajectory of human motion dataset 07 07.amc (subsampled every fourth frame). (a) The estimated locations of the missing points in the latent space are colored. (b) The original features reconstructed from the latent trajectory.

system drift and diffusion processes separately leading to a distribution of trajectories p(x(t)) (Archambeau et al., 2007; Garc ıa et al., 2017). As future work we will consider stochastic extensions of our nonparametric ODE model, as well as MCMC sampling of the inducing point posterior p(U|Y ), leading to trajectory distribution as well.

Acknowledgements. The data used in this project was obtained from mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217. This work has been supported by the Academy of Finland Center of Excellence in Systems Immunology and Physiology, the Academy of Finland grants no. 284597, 311584, 313271, 299915.

Tarmo Aij o, Kirsi Granberg, and Harri L ahdesm aki. Sorad: a systems biology approach to predict and modulate

dynamic signaling pathway response from phosphoproteome time-course measurements. Bioinformatics, 29 (10):1283 1291, 2013.

M. Alvarez, L. Rosasco, and N. Lawrence. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 2012.

C. Archambeau, D. Cornford, M. Opper, and J. Shawe Taylor. Gaussian process approximations of stochastic differential equations. In Gaussian Processes in Practice, 2007.

H. Bhatia, G. Norgard, V. Pascucci, and P. Bremer. The helmholtz-hodge decomposition - a survey. IEEE Transactions on visualization and computer graphics, 2013.

J. Butcher. Numerical methods for ordinary differential equations. John Wiley & Sons, 2016.

Learning unknown ODE models with GP

B. Calderhead, M. Girolami, and N. Lawrence. Accelerating bayesian inference over nonlinear differential equations with gaussian processes. NIPS, 2008.

Andreas Damianou, Michalis K Titsias, and Neil D Lawrence. Variational gaussian process dynamical systems. In Advances in Neural Information Processing Systems, pages 2510 2518, 2011.

Michiel JL De Hoon, Seiya Imoto, Kazuo Kobayashi, Naotake Ogasawara, and Satoru Miyano. Inferring gene regulatory networks from time-ordered gene expression data of bacillus subtilis using differential equations. In Biocomputing 2003, pages 17 28. World Scientiﬁc, 2002.

F. Dondelinger, M. Filippone, S Rogers, and D. Husmeier. Ode parameter inference using adaptive gradient matching with gaussian processes. JMLR, 2013.

S. P. Ellner, Y. Seifu, and R. H. Smith. Fitting population dynamic models to time-series data by gradient matching. Ecology, 83(8):2256 2270, 2002.

Roger Frigola, Yutian Chen, and Carl Edward Rasmussen. Variational gaussian process state-space models. In Advances in Neural Information Processing Systems, pages 3680 3688, 2014.

Fabian Fr ohlich, Barbara Kaltenbacher, Fabian J. Theis, and Jan Hasenauer. Scalable parameter estimation for genome-scale biochemical reaction networks. PLOS Computational Biology, 13(1):1 18, 01 2017. doi: 10.1371/journal.pcbi.1005331.

E. Fuselier and G. Wright. A radial basis function method for computing helmholtz hodge decompositions. IMA Journal of Numerical Analysis, 2017.

Pei Gao, Antti Honkela, Magnus Rattray, and Neil D Lawrence. Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities. Bioinformatics, 24(16):i70 i75, 2008.

C. Garc ıa, A. Otero, P. Felix, J. Presedo, and D. Marquez. Nonparametric estimation of stochastic differential equations with sparse Gaussian processes. Physical Review E, 96(2):022104, 2017.

Mark Girolami. Bayesian inference for differential equations. Theor. Comput. Sci., 408(1):4 16, 2008.

T. Graepel. Solving noisy linear operator equations by gaussian processes: Application to ordinary and partial differential equations. ICML, 2003.

M. Heinonen and F. d Alche Buc. Learning nonparametric differential equations with operator-valued kernels and gradient matching. arxiv, Telecom Paris Tech, 2014.

M. Heinonen, H. Mannerstr om, J. Rousu, S. Kaski, and H. L ahdesm aki. Non-stationary Gaussian process regression with Hamiltonian Monte Carlo. In AISTATS, volume 51, pages 732 740, 2016.

J. Henderson and G. Michailidis. Network reconstruction using nonparametric additive ode models. PLOS ONE, 2014.

Alan C Hindmarsh, Peter N Brown, Keith E Grant, Steven L Lee, Radu Serban, Dan E Shumaker, and Carol S Woodward. SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers. ACM Trans. Math. Softw., 31(3):363 396, 2005.

M. Hirsch, S. Smale, and Devaney. Differential Equations, Dynamical Systems, and an Introduction to Chaos (Edition: 2). Elsevier Science & Technology Books, 2004.

P Kokotovic and J Heller. Direct and adjoint sensitivity equations for parameter optimization. IEEE Transactions on Automatic Control, 12(5):609 610, 1967.

Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary gaussian process classiﬁcation. Journal of machine learning research, 6(Oct):1679 1704, 2005.

Neil D Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In Advances in neural information processing systems, pages 329 336, 2004.

Jorge R. Leis and Mark A. Kramer. The simultaneous solution and sensitivity analysis of systems described by ordinary differential equations. ACM Trans. Math. Softw., 14(1), 1988.

Benn Macdonald. Statistical inference for ordinary differential equations using gradient matching. Ph D thesis, University of Glasgow, 2017.

Benn Macdonald, Catherine Higham, and Dirk Husmeier. Controversy in mechanistic modelling with gaussian processes. In International Conference on Machine Learning, pages 1539 1547, 2015.

Francis J Narcowich and Joseph D Ward. Generalized hermite interpolation via matrix-valued conditionally positive deﬁnite functions. Mathematics of Computation, 63 (208):661 687, 1994.

Joaquin Qui nonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6 (Dec):1939 1959, 2005.

Learning unknown ODE models with GP

Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Inferring solutions of differential equations using noisy multi-ﬁdelity data. Journal of Computational Physics, 335:736 746, 2017.

J. Ramsay, G. Hooker, D. Campbell, and J. Cao. Parameter estimation for differential equations: a generalized smoothing approach. Journal of the Royal Statistical Society: Series B, 69:741 796, 2007.

C.E. Rasmussen and K.I. Williams. Gaussian processes for machine learning. MIT Press, 2006.

Andreas Raue, Marcel Schilling, Julie Bachmann, Andrew Matteson, Max Schelker, Daniel Kaschek, Sabine Hug, Clemens Kreutz, Brian D. Harms, Fabian J. Theis, Ursula Klingm uller, and Jens Timmer. Lessons learned from quantitative dynamical modeling in systems biology. PLOS ONE, 8(9):1 17, 2013.

S. Remes, M. Heinonen, and S. Kaski. Non-stationary spectral kernels. NIPS, 2017.

John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.

A. Solin, M. Kok, N. Wahlstrom, T. Schon, and S. S arkk a. Modeling and interpolation of the ambient magnetic ﬁeld by gaussian processes. ar Xiv, 2015. ar Xiv:1509.04634.

Andreas Svensson, Arno Solin, Simo S arkk a, and Thomas Sch on. Computationally efﬁcient bayesian learning of gaussian process state space models. In Artiﬁcial Intelligence and Statistics, pages 213 221, 2016.

J. M. Varah. A spline least squares method for numerical parameter estimation in differential equations. SIAM J.sci. Stat. Comput., 3(1):28 46, 1982.

N. Wahlstr om, M. Kok, and T. Sch on. Modeling magnetic ﬁelds using gaussian processes. IEEE conf on Acoustics, Speech and Signal Processing (ICASSP), 2013.

Jack Wang, Aaron Hertzmann, and David M Blei. Gaussian process dynamical models. In Advances in neural information processing systems, pages 1441 1448, 2006.

Jack M Wang, David J Fleet, and Aaron Hertzmann. Gaussian process dynamical models for human motion. IEEE transactions on pattern analysis and machine intelligence, 30(2):283 298, 2008.

Y. Wang and D. Barber. Gaussian processes for bayesian estimation in ordinary differential equations. ICML, 2014.

A. Wilson, E. Gilboa, A. Nehorai, and J. Cunningham. Fast multidimensional pattern extrapolation with gaussian processes. AISTATS, 2013.

Jing Zhao and Shiliang Sun. Variational dependent multioutput gaussian process dynamical systems. The Journal of Machine Learning Research, 17(1):4134 4169, 2016.

T Aij o and H. L ahdesm aki. Learning gene regulatory networks from gene expression measurements using nonparametric molecular kinetics. Bioinformatics, 25:2937 2944, 2009.