# stochastic_differential_equations_with_variational_wishart_diffusions__7b7a6099.pdf Stochastic Differential Equations with Variational Wishart Diffusions Martin Jørgensen 1 Marc Peter Deisenroth 2 Hugh Salimbeni 3 We present a Bayesian non-parametric way of inferring stochastic differential equations for both regression tasks and continuous-time dynamical modelling. The work has high emphasis on the stochastic part of the differential equation, also known as the diffusion, and modelling it by means of Wishart processes. Further, we present a semiparametric approach that allows the framework to scale to high dimensions. This successfully lead us onto how to model both latent and autoregressive temporal systems with conditional heteroskedastic noise. We provide experimental evidence that modelling diffusion often improves performance and that this randomness in the differential equation can be essential to avoid overfitting. 1. Introduction An endeared assumption to make when modelling multivariate phenomena with Gaussian processes (GPs) is that of independence between processes, i.e. every dimension of a multivariate phenomenon is modelled independently. Consider the case of a two-dimensional temporal process xt evolving as xt := f(xt 1) + ϵt, (1) where f(xt 1) = (f1(xt 1), f2(xt 1)) , f1 and f2 independent, and ϵt N(0, σ2I). This model is commonly used in the machine learning community and is easy to use and understand, but for many real-world cases the noise is too simplistic. In this paper, we will investigate the noise term ϵt and also make it dependent on the state xt 1. This is also known as heteroskedastic noise. We will refer to the sequence of ϵt as the diffusion or process noise. 1Department for Mathematics and Computer Science, Technical University of Denmark 2Department of Computer Science, University College London 3G-Research. Correspondence to: Martin Jørgensen . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). Why model the process noise? Assume that in the example above, the two states represent meteorological measurements: rainfall and wind speed. Both are influenced by confounders, such as atmospheric pressure, which are not measured directly. This effect can in the case of the model in (1) only be modelled through the diffusion ϵ. Moreover, wind and rain may not correlate identically for all states of the confounders. Dynamical modelling with focus in the noise-term is not a new area of research. The most prominent one is the Auto-Regressive Conditional Heteroskedasticity (ARCH) model (Engle, 1982), which is central to scientific fields like econometrics, climate science and meteorology. The approach in these models is to estimate large process noise when the system is exposed to a shock, i.e. an unforeseen significant change in states. Thus, it does not depend on the value of some state, but rather on a linear combination of previous states. In this paper, we address this shortcoming and introduce a model to handle the process noise by the use of Wishart processes. Through this, we can sample covariance matrices dependent on the input state. This allows the system to evolve as a homogeneous system rather than independent sequences. By doing so, we can avoid propagating too much noise which can often be the case with diagonal covariances and potentially improve on modelling longerrange dependencies. Volatility modelling with GPs has been considered by Wu et al. (2014); Wilson & Ghahramani (2010); Heaukulani & van der Wilk (2019). For regression tasks, our model is closely related to several recent works exploring continuous-time deep neural networks (E, 2017; Haber & Ruthotto, 2017; Chen et al., 2018). Here the notion of depth is no longer a discrete quantity (i.e. the number of hidden layers), but an interval on which a continuous flow is defined. In this view, continuous-time learning takes residual networks (He et al., 2016) to their infinite limit, while remaining computationally feasible. The flow, parameterized by a differential equation, allows for time-series modelling, even with temporal observations that are not equidistant. This line of work has been extended with stochastic equivalents (Twomey et al., 2019; Tzen & Raginsky, 2019; Liu et al., 2019; Li et al., 2020), and the work by Stochastic Differential Equations with Variational Wishart Diffusions Andreas & Kandemir (2019), who model the drift and diffusion of an SDE with Bayesian neural networks. These approaches make the framework more robust, as the original approach can fail even on simple tasks (Dupont et al., 2019). The work that inspired our model most was by Hegde et al. (2019). They model the random field that defines the SDE with a Gaussian field. They consider regression and classification problems. To this end, they can take deep GPs (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) to their infinite limit while avoiding their degeneracy discussed by Duvenaud et al. (2014). Our main focus throughout this paper lies on the stochasticity of the flow, what impact it has and to which degree it can be tamed or manipulated to improve overall performance. Contributions: A model that unifies theory from conditional heteroskedastic dynamics, stochastic differential equations (SDEs) and regression. We show how to perform variational inference in this model. A scalable approach to extend the methods to highdimensional input without compromising with interdimensional independence assumptions. 2. Background In this section, we give an overview of the relevant material on GPs, Wishart processes, and SDEs. 2.1. Gaussian Processes A Gaussian process (GP) is a distribution over functions f : Rd RD, satisfying that for any finite set of points X := x1, . . . , x N RN d, the outputs f(x1), . . . , f(x N) RN D are jointly Gaussian distributed. A GP is fully determined by a mean function µ : Rd RD and a covariance function c : Rd Rd RD D. This notation is slightly unorthodox, and we will elaborate. The usual convention when dealing with multi-output GPs (i.e. D > 1) is to assume D i.i.d. processes that share the same covariance function (Álvarez & Lawrence, 2011), which equivalently can be done by choosing the covariance matrix K = k(X, X) ID, where denotes the Kronecker product and k is a covariance function for univariate output. For ease of notation we shall use k D(a, b) := k(a, b) ID; that is, k(a, b) returns a kernel matrix of dimension number of rows in a times the number of rows in b. This corresponds to the assumption of independence between output dimensions. Furthermore, we write f := f(X), µ := vec(µ(X)) and denote by K the ND ND-matrix with Ki,j = k D(xi, xj). Then we can write in short p(f) = N(µ, K). As the number N of training data points gets large, the size of K becomes a challenge as well, due to a required inversion during training/prediction. To circumvent this, we consider sparse (or low-rank) GP methods. In this respect, we choose M auxiliary inducing locations Z = z1, . . . , z M RM d, and define their function values u := f(Z) RM D. Since any finite set of function values are jointly Gaussian, p(f, u) is Gaussian as well, and we can write p(f, u) = p(f|u)p(u), where p(f|u) = N( µ, K) with µ = µ + α vec(u µ(Z)), (2) K = k D(X, X) α k D(Z, Z)α, (3) where α = k D(X, Z)k D(Z, Z) 1. Here it becomes evident why this is computationally attractive, as we only have to deal with the inversion of k D(Z, Z), which due to the structure, only requires inversion of k(Z, Z) of size M M. This is opposed to a matrix of size ND ND had we not used the low-rank approximation and independence of GPs. We will consider variational inference to marginalise u (Titsias, 2009). Throughout the paper, we will choose our variational posterior to be q(f, u) = p(f|u)q(u), where q(u) := N(m, S), similar to Hensman et al. (2013). Further, q factorises over the dimensions, i.e. q(u) = QD j=1 N(mj, Sj), where m = (m1, . . . , m D) and S is a block-diagonal MD MD-matrix, with block-diagonal entries {Sj}D j=1. In this case, we can analytically marginalise u in (2) to obtain q(f) = Z p(f|u)q(u)du = N(µq f, Kq f), (4) µq f = µ + α vec(m µ(Z)), (5) Kq f = k D(X, X) α k D(Z, Z) S α, (6) which resembles (2) (3), but which is analytically tractable given variational parameters m, S, Z . Recall that a vector field is a mapping f : Rd RD that associates a point in Rd with a vector in RD. A Gaussian (random) field is a vector field, such that for any finite collection of points {xi}N i=1, their associated vectors in RD are jointly Gaussian distributed, i.e. a Gaussian field is a GP. We shall use both terminologies, but when we refer to a Gaussian field, we will think of the outputs as having a direction. 2.2. Wishart Processes The Wishart distribution is a distribution over symmetric, positive semi-definite matrices. It is the multidimensional generalisation of the χ2-distribution. Suppose Fv is a Dvariate Gaussian vector for each v = 1, . . . , ν independently, say Fv N(0, A). Then Σ = Pν v=1 Fv F v is Wishart Stochastic Differential Equations with Variational Wishart Diffusions distributed with ν degrees of freedom and scale matrix A. We write for short Σ WD(A, ν). By Bartlett s decomposition (Kshirsagar, 1959), this can also be represented as Σ = LF F L , where F is a D ν-matrix with all entries unit Gaussian and A = LL . With this parametrization we define Wishart processes, as in (Wilson & Ghahramani, 2010): Definition 1. Let L be a D D matrix, such that LL is positive semidefinite and fd,v GP 0, kd,v(x, x ) independently for every d = 1, . . . , D and v = 1 . . . , ν, where ν D. Then if v=1 fv(x)f v (x) is Wishart distributed for any marginal x, and if for any finite collection of points X = {xi}N i=1 the joint distribution Σ(X) is determined through the covariance functions kd,v, then Σ( ) is a Wishart process. We will write Σ WPD(LL , ν, κ), (8) where κ is the collection of covariance functions {kd,v}. If Σ follows a Wishart distribution with ν degrees of freedom and scale matrix LL of size D D, then for some ρ D-matrix R of rank ρ, we have that RΣR Wρ(RLL R , ν). That is, RΣR is Wishart distributed on the space of ρ ρ symmetric, positive semi-definite matrices. The Wishart distribution is closely related to the Gaussian distribution in a Bayesian framework, as it is the conjugate prior to the precision matrix of a multivariate Gaussian. Furthermore, it is the distribution of the maximum likelihood estimator of the covariance matrix. The Wishart process is a slight misnomer as the posterior processes are not marginally Wishart. This is due to the mean function not being constant 0, and a more accurate name could be matrix-Gamma processes. We shall not refrain from the usual terminology: a Wishart process is a stochastic process, whose prior is a Wishart process. 2.3. Stochastic Differential Equations We will consider SDEs of the form dxt = µ(xt)dt + p Σ(xt)d Bt, (9) where the last term of the right-hand side is the Itô integral (Itô, 1946). The solution xt is a stochastic process, often referred to as a diffusion process, and µ and Σ are the drift and diffusion coefficients, respectively. In (9), Bt denotes the Brownian motion. The Brownian motion is the GP satisfying that all increments are independent in the sense that, for 0 s1 < t1 s2 < t2, then Bt1 s1 is independent from Bt2 s2. Further, any increment has distribution Bt Bs N(0, t s). Lastly, B0 = 0. This is equivalent to the GP with constant mean function 0 and covariance function (t, s) 7 min{s, t} (Rasmussen & Williams, 2006). Given some initial condition (e.g. x0 = 0), we can generate sample paths [0, T] RD by the Euler-Maruyama method. Euler-Maruyama (Kloeden & Platen, 2013) finely discretizes the temporal dimension 0 = t0 < t1 < . . . < tl = T, and pushes xti along the vector field xti+1 = xti + µ(xti) i + p Σ(xti) i N, where N N(0, ID) and i = ti+1 ti. 3. Model and variational inference We consider a random field f : RD [0, T] RD and a GP g : RD Rη. Their priors are f GP(0, kf( , ) ID), g GP(0, kg( , ) Iη). (10) We also have a Wishart process Σ : RD [0, T] G, where G is the set of symmetric, positive semi-definite D D matrices; the specific prior on this will follow in Section 3.1. We will approximate the posteriors of f, g and Σ with variational inference, but first we will formalise the model. We propose a continuous-time deep learning model that can propagate noise in high-dimensions. This is done by letting the diffusion coefficient Σ(xt) of an SDE be governed by a Wishart process. The model we present factorises as p(y, Θ) =p(y|g)p(g|x T , ug)p(ug)p(x T |f) p(f|Σ, uf)p(uf)p(Σ|uΣ)p(uΣ), (11) where Θ := g, ug, x T , f, uf, Σ, uΣ denotes all variables to be marginalised. We assume that data D = (xi, yi) N i=1 is i.i.d. given the process, such that p(y|g) = QN i=1 p(yi|gi). We approximate the posterior of g with the variational distribution as in (4), i.e. q(gi) = Z p(gi|ug)q(ug)dug (12) = N( µg(xi), kg(xi, xi)), (13) µg(xi) = α g (xi)vec(mg), (14) kg(xi, xi) = kη g(xi, xi) (15) α g (xi) kη g(Zg, Zg) Sg αg(xi), where αg(xi) := kη g(xi, Zg)kη g(Zg, Zg) 1. Here mg is an M η matrix, and Sg is an Mη Mη-matrix, constructed as η different M M-matrices Sg = {Sj}η j . During inference (Quiñonero Candela & Rasmussen, 2005), we Stochastic Differential Equations with Variational Wishart Diffusions (a) Graphical model based on Eq. (11) ft = µ(xt)(s t) + p (b) Cycle from (a) and how it moves along the time-axis. Figure 1. (a) Graphical model based on the factorisation in Eq. (11); (b) The cycle from (a), which represents the field f, and how it moves along the time-axis. Here N N(0, (s t)I). Blue represents the flow/SDE, square nodes are variational variables. additionally assume that the marginals gi = g(xi) are independent when conditioned on ug. This is an approximation to make inference computationally easier. The inputs to g are given as the state distribution of an SDE at a fixed time point T 0. We construct this SDE from the viewpoint of a random field. Consider the random walk with step size on the simplest Gaussian field, where any state has mean µ and covariance Σ. For any time point t, the state distribution is tractable, i.e. p(xt) = x0+PS s=1 N( sµ, sΣ), where P s = t and S is any positive integer. For a state-dependent Gaussian field, we define the random walk xt+ = xt + µ(xt) + p Σ(xt) N, (16) with N N(0, I). Given an initial condition x0, the state x S after S steps is given by Σ(xs) N . (17) In the limit 0, this random walk dynamical system is given by the diffusion process (Durrett, 2018) x T x0 = Z T 0 µ(xt)dt + Z T Σ(xt)d Bt, (18) where B is a Brownian motion. This is an SDE in the Îto-sense, which we numerically can solve by the Euler Maruyama method. We will see that by a particular choice of variational distribution that Σ(xt) will be the realisation of a Wishart process. The coefficients in (18) are determined as the mean and covariance of a Gaussian field f. The posterior of f is approximated with a Gaussian q(fi) = N(µq f(xi), kq f(xi, xi)), where µq f(xi) =α f (xi)vec(mf), (19) kq f(xi, xi) =k D f (xi, xi) (20) α f (xi) k D f (Zf, Zf) Sf αf(xi), (21) and αf( ) = k D f ( , Zf)k D f (Zf, Zf) 1. So far, we have seen how we move a data point x0 through the SDE (18) to x T , and further through the GP g, to make a prediction. However, each coordinate of x moves independently. By introducing the Wishart process, we will see how this assumption is removed. 3.1. Wishart-priored Gaussian random field We are still considering the Gaussian field f, whose posterior is approximated by the variational distribution q(f). To regularise (or learn) the noise propagated through this field into g, while remaining within the Bayesian variational framework, we define a hierarchical model as p(f) = Z p(f|uf, Σ)p(uf)p(Σ|uΣ)p(uΣ)d{Σ, uf, uΣ}, (22) where Σ is a Wishart process. Specifically, its prior is Σ WPD(LL , ν, kf), (23) that is any marginal Σ(xt) = LJJ L , where J is the D ν-matrix with all independent entries jd,v(xt) drawn from GP s that share the same prior jd,v( ) GP(0, kf( , )). To approximate the posterior of the Wishart process we choose a variational distribution q(J, uΣ) = q(J|uΣ)q(uΣ) := p(J|uΣ)q(uΣ), (24) where q(uΣ) = QD d=1 Qν v=1 N(mΣ d,v, SΣ d,v). Here, mΣ d,v is M 1 and SΣ d,v is M M for each pair {d, v}. Notice the same kernel is used for the Wishart process as is used for the random field f, that is: only one kernel controls the vector field f. The posterior of Σ is naturally defined through the posterior of J. Given our choice of kernel, this approximate posterior is identical to Eqs. (19)-(21), only changing the variational parameters to mΣ and SΣ, and D changes to Dν. What remains to be defined in (11) is p(f|Σ, uf). Since Stochastic Differential Equations with Variational Wishart Diffusions Σ(xt) is a D D-matrix we define p f|{Σ(xi)}N i=1, uf = N µ(X), kΣ f (X, X) , (25) µ(xi) = α f (xi)vec(uf), (26) kΣ f (xi, xj) = Σ(xi) hij δij, (27) where hij = αf(xi) k D f (Zf, Zf)αf(xj) and δij is Kronecker s delta. Notice this, conditioned on the Wishart process, constitutes a FITC-type model (Snelson & Ghahramani, 2006). This goes beyond the assumption of independent output dimensions, and instead makes the model learn the interdimensional dependence structure through the Wishart process Σ. This structure shall also be learned in the variational inference setup. The posterior of conditional f is approximated by q(f, uf|{Σ(xi)}N i=1) = q(f|{Σ(xi)}N i=1, uf)q(uf) = p(f|{Σ(xi)}N i=1, uf)q(uf), (28) where q(uf) := N(mf, k D f (Zf, Zf)). At first, this might seem restrictive, but covariance estimation is already in Σ and the variational approximation is the simple expression q(f|{Σ(xi)}N i=1) = i=1 N α f (xi)mf, Σ(xi) . (29) The marginalisation can then be computed with Jensen s inequality log p(y) = log Z p(y, Θ)dΘ Z log p(y, Θ) = Z log p(y|g)q g|Θ\{g} dΘ (30) KL q(ug) p(ug) KL q(uf) p(uf) KL q(uΣ) p(uΣ) , or, in a more straightforward language, log p(y) Eq(g)[log p(y|g)] KL q(ug) p(ug) (31) KL q(uf) p(uf) KL q(uΣ) p(uΣ) . The right-hand side in (31) is the so-called evidence lower bound (ELBO). The first term, the expectation, is analytically intractable, due to q(g) being non-conjugate to the likelihood. Therefore, we determine it numerically with Monte Carlo (MC) or with Gauss-Hermite quadrature (Hensman et al., 2015). With MC, often a few samples are enough for reliable inference (Salimans & Knowles, 2013). The KL-terms in (31) can be computed analytically as they all involve multivariate Gaussians. Still, due to some of the modelling constraints, it is helpful to write them out, which yields KL q(ug) p(ug) = d=1 KL q(ugd) p(ugd) , (32) KL q(uΣ) p(uΣ) = v=1 KL q(uΣd,v) p(uΣd,v) , where in both instances we used the independence between the GPs. The remaining one is special. Since both distribution share the same covariance it reduces to KL q(uf) p(uf) = 1 d=1 m fdk D f (Zf, Zf) 1mfd. (34) Here, k D f (Zf, Zf) 1 is already known from the computation of (33), as the kernel and inducing locations are shared. Summarising this section, we have inputs x0 := x that are warped through an SDE (governed by a random field f) with drift µ and diffusion Σ that is driven by one kernel k D f . The value of this SDE, at some given time T, is then used as input to a final layer g, i.e. g(x T ) predicts targets y(x). All this is inferred by maximising the ELBO (31). 3.2. Complexity and scalability The computational cost of estimating Σ with a Wishart, as opposed to a diagonal matrix, can be burdensome. For the diagonal, the cost is O(DNM 2) since we need to compute (3) D times. Sampling Dν GP values and then matrix-multiplying it with a D ν matrix is of complexity O(DνNM 2 + DνD). Hence, if we, for simplicity, let ν = D, we have overhead cost of O(D2NM 2 + D3). Note this is only the computational budget associated with the diffusion coefficients of the random field; the most costly one. On this inspection, we propose a way to overcome a too heavy burden if D is large. Naturally this involves an approximation; this time a low-rank approximation on the dimensionality-axis. Recall that, if Σρ WPρ(I, ν, κ), then ΣD := LΣρL WPD(LL , ν, κ). The matrices naturally are of rank ρ D. The computational overhead is reduced to O(ρ2NM 2+Dρ2) if ν = ρ. This same structure was introduced by Heaukulani & van der Wilk (2019) for time-series modelling of financial data; and it reminisces the structure of Semiparametric Latent Factor Models (SLFM) (Seeger et al., 2005). That is, we have ρ GPs, and the Ddimensional outputs are all linear combinations of these. For clarity, we need only to compute/sample ΣD = LJ, Stochastic Differential Equations with Variational Wishart Diffusions where J is a ρ ν matrix, with GP values according to the approximate posterior q(J), where D replaced by ρ. 3.3. Further model specifications If ρ is too small it can be difficult to identify a good diffusion coefficient as the matrix is too restricted by the low rank. One possible way to overcome this is too add white noise to the matrix Σ = LF F L + Λ, (35) where Λ is a diagonal D D-matrix. In many situations, this will ensure that the diffusion is full rank, and this provides more freedom in estimating the marginal variances. However, if the values on the diagonal of Λ are estimated by maximum likelihood, we have to be cautious. If Λ becomes to dominant , inference can turn off the Wishart-part, potentially leading to overfitting. Consider the matrix L, that makes up the scale matrix of the Wishart process. It is fully inferred by maximum likelihood, hence there is no KL-term to regularise it. Effectively, this can turn off the stochasticity of the flow by making some matrix norm of L be approximately zero. Then the flow is only determined by its drift and overfitting is a likely scenario. To alleviate this concern we propose to regularise L by its rownorms. That is, d = 1, . . . , D : r=1 L2 d,r = 1, (36) where Ld,r denotes the entries of L. First of all, this ensures that the prior variance for all dimensions is determined by the kernel hyperparameters, as it makes the diagonal of the scale matrix LL equal to 1. This way the variance in each dimension is a fair linear combination of the ρ GPs that control the Wishart. 3.4. Extending to time series The specified model can be specified to model temporal data D = {yi, ti}N i=1 in a straightforward way. In a few lines, see also Figure 1, we write xt = x0 + Z t 0 µ(xs)ds + Z t Σ(xs)d Bs, (37) f( )|Σ( ), D GP(µ( ), Σ( )), (38) Σ( ) WP( |D), (39) p(yt|xt) = N(g(xt), AΣ(xt)A + Λ). (40) If g is not the identity mapping, we can define a latent dynamical model. Say g is a GP mapping from RD to Rη. This is similar to GP state space models (GPSSM) where the dynamics, or transitions, are defined xt = f(xt 1)+ϵx and yt = g(yt) + ϵy, for GPs f and g and some noise variables ϵx and ϵy, usually Gaussian with zero mean and diagonal covariance matrix (Deisenroth et al., 2012; Eleftheriadis et al., 2017). The latent dynamics defined in (37) (39) are not restricted to have equi-temporal measurements and model non-diagonal covariance structure both in the latent states x and in the observed states y through the matrix A, which is an η Dmatrix. Adding the diagonal η η-matrix Λ is necessary to avoid singularity. Even though Σ( ) is a D D-matrix, we can still lower-rank approximate with a ρ-rank matrix, as described in Section 3.2. The log-likelihood we compute is log p(yt|gt, Σ(xt)) = η 2 log(2π) log(det(B)) 2(yt gt) B 1(yt gt), where B := AΣ(xt)A + Λ. As a consequence of the matrix-determinant lemma and the Woodbury identity, we can evaluate the likelihood cheaply, because of B s structure. The ELBO that we optimise during training is similar to (31), only the likelihood term is different: it is swapped for a variational expectation over (41). We assume independence between all temporal observations, i.e. p(D) = QN i=1 p({yi, ti}). 4. Experiments We evaluate the presented model in both regression and a dynamical setup. In both instances, we use baselines that are similar to our model to easier distinguish the influence the diffusion has on the experiments. We evaluate on a wellstudied regression benchmark and on a higher-dimensional dynamical dataset. 4.1. Regression We compare our model, which we will dub Wishart-priored GP flow (diff WGP), to three baseline models in order to shed light on some properties of the diff WGP. GP flows Reproducing the model from Hegde et al. (2019) will give indications, if it is possible to increase overall performance by modelling the randomness in the flow. This model has a diagonal matrix Σ with entries determined solely by the chosen covariance function. We will refer to this model with diff GP. No noise flows We also evaluate the model, where Σ = 0, i.e. the situation where the flow is deterministic. The remaining part of the flow is still as in (19) to make fair Stochastic Differential Equations with Variational Wishart Diffusions bike boston concrete kin8nm naval power protein wine_white Dataset Test log likelihood (relative to SGP) diff WGP (rho = 5) diff WGP (rho = 10) Figure 2. Test-set log-likelihood values on eight UCI regression datasets. The violin plots show the test-set log (likelihood-ratio) of baseline diffusion models with respect to the SGP baseline. Values greater than 0 indicate an improvement over SGP. Key findings are that No noise can overfit heavily (boston, concrete, naval), and diff WGP performs best on most datasets. The figure has been cut for readability this explain why occasionally purple violins are missing. comparisons. All the relevant KL-terms are removed from the ELBO (31). We refer to this as No noise. Sparse GPs Also in the variational setup we shall compare to vanilla sparse GPs, i.e. the model introduced by Titsias (2009). We will refer to this as SGP. 4.1.1. EXPERIMENTAL SETUP In all experiments, we choose 100 inducing points for the variational distributions, all of which are Gaussians. All models are trained for 50000 iterations with a mini-batch size of 2000, or the number of samples in the data if smaller. In all instances, the first 10000 iterations are warm-starting the final layer GP g, keeping all other parameters fixed. We use the Adam-optimiser with a step-size of 0.01. After this all flows (this excludes SGP) are initialised with a constant mean 0 and covariance functions chosen as RBF with automatic relevance determination (ARD), initialised with tiny signal noise to ensure x0 x T . The time variable T is always 1. The remaining 40000 iterations (SGP excluded) are updating again with Adam with a more cautious step-size of 0.001. For the diff WGP, the first 4000 of these are warmstarting the KL-terms associated with the flow to speed up convergence. Note that this model fits more parameters than the baseline models. For the diff WGP, we update the ELBO Eq(g)[log p(y|g)] KL q(ug) p(ug) c2KL q(uf) p(uf) c KL q(uΣ) p(uΣ) , (42) diff GP vs. SGP diff WGP vs. diff GP BIKE (14) 0.8695 0.2262 BOSTON (13) <0.0001 0.9867 CONCRETE (8) 0.0042 0.0348 KIN8NM (8) <0.0001 0.0164 NAVAL (26) 0.8695 <0.0001 POWER (4) <0.0001 0.1387 PROTEIN (9) <0.0001 <0.0001 WINE_WHITE (11) 0.0003 0.3238 Table 1. Wilcoxons paired signed rank-test. Listed are the p-values of the hypothesis of equal median versus alternative that location shift is negative. Bold highlights the significant ones at a 0.05 confidence level. In parenthesis are the input dimensionality of the datasets. Results are for ρ = 5. where c = min(1, iteration 4000 ), i.e. we warm-start the regularising KL-terms. 4.1.2. UCI REGRESSION BENCHMARK Figure 2 shows the results on eight UCI benchmark datasets over 20 train-test splits (90/10). On the y-axis we see the distribution of the test-set log-likelihood subtracted by the SGP log-likelihood on the same split. Values greater than 0 are improvements over the baseline SGP. An analogous plot with RMSE is supplied in the supplementary material. In Table 1, we use Wilcoxon s paired rank test to evaluate whether the more advanced models perform better. Key observations are: not having noise in the flow (No Stochastic Differential Equations with Variational Wishart Diffusions 0 10 20 30 40 50 Horizon (hours) Log likelihood Diagonal noise No drift diff WGP (rho = 5) (a) The log-likelihood of forecasted measurements up to 48 hours. The bold lines mark the average log-likelihood at a given hour based on 50 simulations. The associated shaded areas span twice the standard error. (b) The density (colour) of the 48-hour horizon predictions of temperature measurement in Tiantan (x-axis) and Dongsi (y-axis). These locations are within a few kilometres of each other. Left: diagonal noise case; Right: Wishart noise. The Wishart detects a correlation between these two temperature measurements, as we would expect for such nearby locations. Figure 3. (a): The performance of predictions plotted over the forecast horizon. (b): The joint development of two temperature measurements over the forecasted time-horizon for two different models. noise) seem to lead to overfitting, except in two cases, where a more expressive model is preferred. In one of these cases (protein) Wishart modelling improves both the RMSE and the log-likelihood. In one case (boston), overfitting was absurdly large: on this dataset we were not able to reproduce the results from Hegde et al. (2019) either. In four cases (concrete, kin8nm, power, wine_white), No noise overfitted mildly. In two of these cases, diff WGP improved over diff GP. The two cases, where no improvement is significant, are simple cases, wine_white and power, which are almost linear or low-dimensional. On the naval dataset, the No noise model could not run due to numerical issues. Here diff WGP outperforms diff GP in the log-likelihood. We conjecture this is because of the high dimensionality and the fact that almost no observation noise is present. We found no substantial influence of the parameter ρ; if any then it actually seems to prefer lower-rank approximations. This emphasises that training Wishart processes is difficult, and further research in this area is needed. 4.2. Auto-regressive modelling of air quality We evaluate our dynamical model on atmospheric air-quality data from Beijing (Zhang et al., 2017). We pre-processed the data for three locations in the city (Shunyi, Tiantan, Dongsi), which each have hourly observation of ten features over the period of 2014 2016. Explicitly, the ten features are: the concentration of PM2.5, PM10, SO2, NO2, CO, O3, the temperature and dew point temperature, air pressure and amount of precipitation.1 1Full data set available at https://archive.ics. uci.edu/ml/datasets/Beijing+Multi-Site+ Air-Quality+Data. Stochastic Differential Equations with Variational Wishart Diffusions We use the first two years of this dataset for training and aim to forecast into the first 48 hours of 2016. Including the variables year, month, day and hour, we have in total 34 features for the three cities and 17520 temporal observations for training. Missing values were linearly interpolated. All features were standardised. To analyse properties of our proposed model, we perform an ablation study with the following models: diff WGP The model proposed in the paper to model the diffusion with Wisharts. Diagonal noise The drift term remains as in the diff WGP model, but the diffusion is restricted to diagonal, i.e. correlated diffusion cannot be modelled. This becomes the model xt = xs + µ(xs)(t s) + p Λ(t s)ϵt, ϵ N(0, I). (43) No drift The drift is constantly zero, and the diffusion is modelled by a Wishart, which results in the model xt = xs + q AΣ(xt)A + Λ (t s)ϵt. (44) This model is a continuous-time version of the model presented by Heaukulani & van der Wilk (2019). In all instances, we train by minibatching shorter sequences, and we use the Adam optimiser (Kingma & Ba, 2014) with a learning rate 0.01. Due to the large amount of temporal observation compared to small batches we ease off on momentum. Figure 3(a) shows how the different models forecast future observations by reporting the log-likelihood traces of individual models at test time. The figure shows the mean and two times the standard error, which we obtain from 50 simulations. At first, we see that having no drift starts off better, but quickly drops in performance. This is not unexpected, as the data has structure in its evolution. The difference between the models with drift, but different diffusions, are more interesting for this dataset. Overall, Wishart diffusions perform best, and it seems to be resilient and take only few and relatively small dips . We expect this dataset to have highly correlated features. The three locations in Beijing are, in distance, close to each other; naturally the different air measurements are similar in their evolution over time. Figure 3(b) illustrates how a model with diagonal noise is incapable of learning this joint development of temperature measurements. Here, the Wishart learns that when the temperature in Dongsi is high, it is also high in Tiantan. This behaviour is seen in many pairs of the features considered, and it suggests diff WGP has dynamics moving on a manifold of smaller dimension than if diagonal noise was considered. This supports the hypothesis that diff WGP moves as one dynamical systems, opposed to 34. 5. Conclusion In a non-parametric Bayesian way, we presented a scalable approach to continuous-time learning with high emphasis on correlated process noise. This noise is modelled with a Wishart process, which lets high-dimensional data evolve as a single system, rather than D independent systems. We presented a way to scale this to high dimensions. We found that it is never worse taking the dependence structure in the process noise into account. However, with certain types of data, it can mitigate overfitting effects and improve performance. Code is publicly available at: https://github.com/ Jorgensen Mart/Wishart-priored-SDE. Acknowledgements MJ was supported by a research grant (15334) from VILLUM FONDEN. Álvarez, M. A. and Lawrence, N. D. Computationally efficient convolved multiple output Gaussian processes. Journal of Machine Learning Research, 12:1459 1500, 2011. Andreas, L. and Kandemir, M. Differential Bayesian neural nets. ar Xiv:1912.00796, 2019. Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, 2018. Damianou, A. and Lawrence, N. D. Deep Gaussian processes. In Artificial Intelligence and Statistics, 2013. Deisenroth, M. P., Turner, R., Huber, M., Hanebeck, U. D., and Rasmussen, C. E. Robust filtering and smoothing with Gaussian processes. IEEE Transactions on Automatic Control, 57(7):1865 1871, 2012. Dupont, E., Doucet, A., and Teh, Y. W. Augmented neural ODEs. In Advances in Neural Information Processing Systems, 2019. Durrett, R. Stochastic calculus: a practical introduction. CRC press, 2018. Duvenaud, D., Rippel, O., Adams, R., and Ghahramani, Z. Avoiding pathologies in very deep networks. In Artificial Intelligence and Statistics, 2014. E, W. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5 (1):1 11, 2017. Stochastic Differential Equations with Variational Wishart Diffusions Eleftheriadis, S., Nicholson, T. F. W., Deisenroth, M. P., and Hensman, J. Identification of Gaussian process state space models. In Advances in Neural Information Processing Systems, 2017. Engle, R. F. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica: Journal of the Econometric Society, pp. 987 1007, 1982. Haber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition, 2016. Heaukulani, C. and van der Wilk, M. Scalable Bayesian dynamic covariance modeling with variational Wishart and inverse Wishart processes. In Advances in Neural Information Processing Systems, 2019. Hegde, P., Heinonen, M., Lähdesmäki, H., and Kaski, S. Deep learning with differential Gaussian process flows. In Artificial Intelligence and Statistics, 2019. Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian processes for big data. In Uncertainty in Artificial Intelligence, 2013. Hensman, J., Matthews, A. G. d. G., Filippone, M., and Ghahramani, Z. MCMC for variationally sparse Gaussian processes. In Advances in Neural Information Processing Systems, 2015. Itô, K. On a stochastic integral equation. Proceedings of the Japan Academy, 22(2):32 35, 1946. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2014. Kloeden, P. E. and Platen, E. Numerical Solution of Stochastic Differential Equations, volume 23. Springer Science & Business Media, 2013. Kshirsagar, A. M. Bartlett Decomposition and Wishart distribution. The Annals of Mathematical Statistics, 30 (1):239 241, 1959. Li, X., Wong, T.-K. L., Chen, R. T. Q., and Duvenaud, D. Scalable gradients for stochastic differential equations. In Artificial Intelligence and Statistics, 2020. Liu, X., Xiao, T., Si, S., Cao, Q., Kumar, S., and Hsieh, C.-J. Neural SDE: Stabilizing neural ODE networks with stochastic noise. ar Xiv:1906.02355, 2019. Quiñonero Candela, J. and Rasmussen, C. E. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939 1959, 2005. Rasmussen, C. E. and Williams, C. Gaussian Processes for Machine Learning. MIT Press, 2006. Salimans, T. and Knowles, D. A. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837 882, 2013. Salimbeni, H. and Deisenroth, M. P. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, 2017. Seeger, M., Teh, Y.-W., and Jordan, M. Semiparametric latent factor models. Technical report, 2005. Snelson, E. and Ghahramani, Z. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, 2006. Titsias, M. Variational learning of inducing variables in sparse Gaussian processes. In Artificial Intelligence and Statistics, 2009. Twomey, N., Kozłowski, M., and Santos-Rodríguez, R. Neural ODEs with stochastic vector field mixtures. ar Xiv:1905.09905, 2019. Tzen, B. and Raginsky, M. Neural stochastic differential equations: deep latent Gaussian models in the diffusion limit. ar Xiv:1905.09883, 2019. Wilson, A. G. and Ghahramani, Z. Generalised Wishart processes. ar Xiv:1101.0240, 2010. Wu, Y., Hernández-Lobato, J. M., and Ghahramani, Z. Gaussian process volatility model. In Advances in Neural Information Processing Systems. 2014. Zhang, S., Guo, B., Dong, A., He, J., Xu, Z., and Chen, S. X. Cautionary tales on air-quality improvement in Beijing. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473, 2017.