# string_and_membrane_gaussian_processes__cd31b822.pdf

Journal of Machine Learning Research 17 (2016) 1-87 Submitted 7/15; Revised 6/16; Published 8/16

String and Membrane Gaussian Processes

Yves-Laurent Kom Samo YLKS@ROBOTS.OX.AC.UK

Stephen J. Roberts SJROB@ROBOTS.OX.AC.UK Department of Engineering Science and Oxford-Man Institute University of Oxford Eagle House, Walton Well Road, OX2 6ED, Oxford, United Kingdom

Editor: Neil Lawrence

In this paper we introduce a novel framework for making exact nonparametric Bayesian inference on latent functions that is particularly suitable for Big Data tasks. Firstly, we introduce a class of stochastic processes we refer to as string Gaussian processes (string GPs which are not to be mistaken for Gaussian processes operating on text). We construct string GPs so that their ﬁnitedimensional marginals exhibit suitable local conditional independence structures, which allow for scalable, distributed, and ﬂexible nonparametric Bayesian inference, without resorting to approximations, and while ensuring some mild global regularity constraints. Furthermore, string GP priors naturally cope with heterogeneous input data, and the gradient of the learned latent function is readily available for explanatory analysis. Secondly, we provide some theoretical results relating our approach to the standard GP paradigm. In particular, we prove that some string GPs are Gaussian processes, which provides a complementary global perspective on our framework. Finally, we derive a scalable and distributed MCMC scheme for supervised learning tasks under string GP priors. The proposed MCMC scheme has computational time complexity O(N) and memory requirement O(d N), where N is the data size and d the dimension of the input space. We illustrate the efﬁcacy of the proposed approach on several synthetic and real-world data sets, including a data set with 6 millions input points and 8 attributes.

Keywords: String Gaussian processes, scalable Bayesian nonparametrics, Gaussian processes, nonstationary kernels, reversible-jump MCMC, point process priors

1. Introduction

Many problems in statistics and machine learning involve inferring a latent function from training data (for instance regression, classiﬁcation, inverse reinforcement learning, inference on point processes to name but a few). Real-valued stochastic processes, among which Gaussian processes (GPs), are often used as functional priors for such problems, thereby allowing for a full Bayesian nonparametric treatment. In the machine learning community, interest in GPs grew out of the observation that some Bayesian neural networks converge to GPs as the number of hidden units approaches inﬁnity (Neal (1996)). Since then, other similarities have been established between GPs and popular models such as Bayesian linear regression, Bayesian basis function regression, spline models and support vector machines (Rasmussen and Williams (2006)). However, they often perform poorly on Big Data tasks primarily for two reasons. Firstly, large data sets are likely to exhibit multiple types of local patterns that should appropriately be accounted for by ﬂexible and possibly

c 2016 Yves-Laurent Kom Samo and Stephen J. Roberts.

KOM SAMO AND ROBERTS

nonstationary covariance functions, the development of which is still an active subject of research. Secondly, inference under GP priors often consists of looking at the values of the GP at all input points as a jointly Gaussian vector with fully dependent coordinates, which induces a memory requirement and time complexity respectively squared and cubic in the training data size, and thus is intractable for large data sets. We refer to this approach as the standard GP paradigm. The framework we introduce in this paper addresses both of the above limitations.

Our work is rooted in the observation that, from a Bayesian nonparametric perspective, it is inefﬁcient to deﬁne a stochastic process through fully-dependent marginals, as it is the case for Gaussian processes. Indeed, if a stochastic process (f(x))x Rd has fully dependent marginals and exhibits no additional conditional independence structure then, when f is used as functional prior and some observations related to (f(x1), . . . , f(xn)) are gathered, namely (y1, . . . , yn), the additional memory required to take into account an additional piece of information (yn+1, xn+1) grows in O(n), as one has to keep track of the extent to which yn+1 informs us about f(xi) for every i n, typically through a covariance matrix whose size will increase by 2n + 1 terms. Clearly, this is inefﬁcient, as yn+1 is unlikely to be informative about f(xi), unless xi is sufﬁciently close to xn+1. More generally, the larger n, the less information a single additional pair (yn+1, xn+1) will add to existing data, and yet the increase in memory requirement will be much higher than that required while processing earlier and more informative data. This inefﬁciency in resource requirements extends to computational time, as the increase in computational time resulting from adding (yn+1, xn+1) typically grows in O(n2), which is the difference between the numbers of operations required to invert a n n matrix and to invert a (n + 1) (n + 1) matrix. A solution for addressing this inefﬁciency is to appropriately limit the extent to which values f(x1), . . . , f(xn) are related to each other. Existing approaches such as sparse Gaussian processes (see Quinonero-Candela and Rasmussen (2005) for a review), resort to an ex-post approximation of fully-dependent Gaussian marginals with multivariate Gaussians exhibiting conditional independence structures. Unfortunately, these approximations trade-off accuracy for scalability through a control variable, namely the number of inducing points, whose choice is often left to the user. The approach we adopt in this paper consists of going back to stochastic analysis basics, and constructing stochastic processes whose ﬁnite-dimensional marginals exhibit suitable conditional independence structures so that we need not resorting to ex-post approximations. Incidentally, unlike sparse GP techniques, the conditional independence structures we introduce also allow for ﬂexible and principled learning of local patterns, and this increased ﬂexibility does not come at the expense of scalability.

The contributions of this paper are as follows. We introduce a novel class of stochastic processes, string Gaussian processes (string GPs), that may be used as priors over latent functions within a Bayesian nonparametric framework, especially for large scale problems and in the presence of possibly multiple types of local patterns. We propose a framework for analysing the ﬂexibility of random functions and surfaces, and prove that our approach yields more ﬂexible stochastic processes than isotropic Gaussian processes. We demonstrate that exact inference under a string GP prior scales considerably better than in the standard GP paradigm, and is amenable to distributed computing. We illustrate that popular stationary kernels can be well approximated within our framework, making string GPs a scalable alternative to commonly used GP models. We derive the joint law of a string GP and its gradient, thereby allowing for explanatory analysis on the learned latent function. We propose a reversible-jump Markov Chain Monte Carlo sampler for automatic learning of model complexity and local patterns from data.

STRING AND MEMBRANE GAUSSIAN PROCESSES

The rest of the paper is structured as follows. In Section 2 we review recent advances on Gaussian processes in relation to inference on large data sets. In Section 3 we formally construct string GPs and derive some important results. In Section 4 we provide detailed illustrative and theoretical comparisons between string GPs and the standard GP paradigm. In Section 5 we propose methods for inferring latent functions under string GP priors with time complexity and memory requirement that are linear in the size of the data set. The efﬁcacy of our approach compared to competing alternatives is illustrated in Section 6. Finally, we ﬁnish with a discussion in Section 7.

2. Related Work

The two primary drawbacks of the standard GP paradigm on large scale problems are the lack of scalability resulting from postulating a full multivariate Gaussian prior on function values at all training inputs, and the difﬁculty postulating a priori a class of covariance functions capable of capturing intricate and often local patterns likely to occur in large data sets. A tremendous amount of work has been published that attempt to address either of the aforementioned limitations. However, scalability is often achieved either through approximations or for speciﬁc applications, and nonstationarity is usually introduced at the expense of scalability, again for speciﬁc applications.

2.1 Scalability Through Structured Approximations

As far as scalability is concerned, sparse GP methods have been developed that approximate the multivariate Gaussian probability density function (pdf) over training data with the marginal over a smaller set of inducing points multiplied by an approximate conditional pdf (Smola and Bartlett (2001); Lawrence et al. (2003); Seeger (2003b,a); Snelson and Ghahramani (2006)). This approximation yields a time complexity linear rather than cubic in the data size and squared in the number of inducing points. We refer to Quinonero-Candela and Rasmussen (2005) for a review of sparse GP approximations. More recently, Hensman et al. (2013, 2015) combined sparse GP methods with Stochastic Variational Inference (Hoffman et al. (2013)) for GP regression and GP classiﬁcation. However, none of these sparse GP methods addresses the selection of the number of inducing points (and the size of the minibatch in the case of Hensman et al. (2013, 2015)), although this may greatly affect scalability. More importantly, although these methods do not impose strong restrictions on the covariance function of the GP model to approximate, they do not address the need for ﬂexible covariance functions inherent to large scale problems, which are more likely to exhibit intricate and local patterns, and applications considered by the authors typically use the vanilla squared exponential kernel.

Lazaro-Gredilla et al. (2010) proposed approximating stationary kernels with truncated Fourier series in Gaussian process regression. An interpretation of the resulting sparse spectrum Gaussian process model as Bayesian basis function regression with a ﬁnite number K of trigonometric basis functions allows making inference in time complexity and memory requirement that are both linear in the size of the training sample. However, this model has two major drawbacks. Firstly, it is prone to over-ﬁtting. In effect, the learning machine will aim at inferring the K major spectral frequencies evidenced in the training data. This will only lead to appropriate prediction out-ofsample when the underlying latent phenomenon can be appropriately characterised by a ﬁnite discrete spectral decomposition that is expected to be the same everywhere on the domain. Secondly, this model implicitly postulates that the covariance between the values of the GP at two points does not vanish as the distance between the points becomes arbitrarily large. This imposes a priori the

KOM SAMO AND ROBERTS

view that the underlying function is highly structured, which might be unrealistic in many real-life non-periodic applications. This approach is generalised by the so-called random Fourier features methods (Rahimi and Recht (2007); Le et al. (2013); Yang et al. (2015)). Unfortunately all existing random Fourier features methods give rise to stationary covariance functions, which might not be appropriate for data sets exhibiting local patterns. The bottleneck of inference in the standard GP paradigm remains inverting and computing the determinant of a covariance matrix, normally achieved through the Cholesky decomposition or Singular Value Decomposition. Methods have been developed that speed-up these decompositions through low rank approximations (Williams and Seeger (2001)) or by exploiting speciﬁc structures in the covariance function and in the input data (Saatchi (2011); Wilson et al. (2014)), which typically give rise to Kronecker or Toeplitz covariance matrices. While the Kronecker method used by Saatchi (2011) and Wilson et al. (2014) is restricted to inputs that form a Cartesian grid and to separable kernels,1 low rank approximations such as the Nystr om method used by Williams and Seeger (2001) modify the covariance function and hence the functional prior in a non-trivial way. Methods have also been proposed to interpolate the covariance matrix on a uniform or Cartesian grid in order to beneﬁt from some of the computational gains of Toeplitz and Kronecker techniques even when the input space is not structured (Wilson and Nickisch (2015)). However, none of these solutions is general as they require that either the covariance function be separable (Kronecker techniques), or the covariance function be stationary and the input space be one-dimensional (Toeplitz techniques).

2.2 Scalability Through Data Distribution

A family of methods have been proposed to scale-up inference in GP models that are based on the observation that it is more computationally efﬁcient to compute the pdf of K independent small Gaussian vectors with size n than to compute the pdf of a single bigger Gaussian vector of size n K. For instance, Kim et al. (2005) and Gramacy and Lee (2008) partitioned the input space, and put independent stationary GP priors on the restrictions of the latent function to the subdomains forming the partition, which can be regarded as independent local GP experts. Kim et al. (2005) partitioned the domain using Voronoi tessellations, while Gramacy and Lee (2008) used tree based partitioning. These two approaches are provably equivalent to postulating a (nonstationary) GP prior on the whole domain that is discontinuous along the boundaries of the partition, which might not be desirable if the latent function we would like to infer is continuous, and might affect predictive accuracy. The more local experts there are, the more scalable the model will be, but the more discontinuities the latent function will have, and subsequently the less accurate the approach will be. Mixtures of Gaussian process experts models (Mo E) (Tresp (2001); Rasmussen and Ghahramani (2001); Meeds and Osindero (2006); Ross and Dy (2013)) provide another implementation of this idea. Mo E models assume that there are multiple latent functions to be inferred from the data, on which it is placed independent GP priors, and each training input is associated to one latent function. The number of latent functions and the repartition of data between latent functions can then be performed in a full Bayesian nonparametric fashion (Rasmussen and Ghahramani (2001); Ross and Dy (2013)). When there is a single continuous latent function to be inferred, as it is the case for most regression models, the foregoing Bayesian nonparametric approach will learn a single latent function, thereby leading to a time complexity and a memory requirement that are the same as in the standard GP paradigm, which deﬁes the scalability argument.

1. That is multivariate kernel that can be written as product of univariate kernels.

STRING AND MEMBRANE GAUSSIAN PROCESSES

The last implementation of the idea in this section consists of distributing the training data over multiple independent but identical GP models. In regression problems, examples include the Bayesian Committee Machines (BCM) of Tresp (2000), the generalized product of experts (g Po E) model of Cao and Fleet (2014), and the robust Bayesian Committee Machines (r BCM) of Deisenroth and Ng (2015). These models propose splitting the training data in small subsets, each subset being assigned to a different GP regression model referred to as an expert that has the same hyper-parameters as the other experts, although experts are assumed to be mutually independent. Training is performed by maximum marginal likelihood, with time complexity (resp. memory requirement) linear in the number of experts and cubic (resp. squared) in the size of the largest data set processed by an expert. Predictions are then obtained by aggregating the predictions of all GP experts in a manner that is speciﬁc to the method used (that is the BCM, the g Po E or the r BCM). However, these methods present major drawbacks in the training and testing procedures. In effect, the assumption that experts have identical hyper-parameters is inappropriate for data sets exhibiting local patterns. Even if one would allow GP experts to be driven by different hyper-parameters as in Nguyen and Bonilla (2014) for instance, learned hyper-parameters would lead to overly simplistic GP experts and poor aggregated predictions when the number of training inputs assigned to each expert is small this is a direct consequence of the (desirable) fact that maximum marginal likelihood GP regression abides by Occam s razor. Another critical pitfall of BCM, g Po E and r BCM is that their methods for aggregating expert predictions are Kolmogorov inconsistent. For instance, denoting ˆp the predictive distribution in the BCM, it can be easily seen from Equations (2.4) and (2.5) in Tresp (2000) that the predictive distribution ˆp(f(x 1)|D) (resp. ˆp(f(x 2)|D))2 provided by the aggregation procedure of the BCM is not the marginal over f(x 2) (resp. over f(x 1)) of the multivariate predictive distribution ˆp(f(x 1), f(x 2)|D) obtained from experts multivariate predictions pk(f(x 1), f(x 2)|D) using the same aggregation procedure: ˆp(f(x 1)|D) = R ˆp(f(x 1), f(x 2)|D)df(x 2). Without Kolmogorov consistency, it is impossible to make principled Bayesian inference of latent function values. A principled Bayesian nonparametric model should not provide predictions about f(x 1) that differ depending on whether or not one is also interested in predicting other values f(x i ) simultaneously. This pitfall might be the reason why Cao and Fleet (2014) and Deisenroth and Ng (2015) restricted their expositions to predictive distributions about a single function value at a time ˆp(f(x )|D), although their procedures (Equation 4 in Cao and Fleet (2014) and Equation 20 in Deisenroth and Ng (2015)) are easily extended to posterior distributions over multiple function values. These extensions would also be Kolmogorov inconsistent, and restricting the predictions to be of exactly one function value is unsatisfactory as it does not allow determining the posterior covariance between function values at two test inputs.

2.3 Expressive Stationary Kernels

In regards to ﬂexibly handling complex patterns likely to occur in large data sets, Wilson and Adams (2013) introduced a class of expressive stationary kernels obtained by summing up convolutions of Gaussian basis functions with Dirac delta functions in the spectral domain. The sparse spectrum kernel can be thought of as the special case where the convolving Gaussian is degenerate. Although such kernels perform particularly well in the presence of globally repeated patterns in the data, their stationarity limits their utility on data sets with local patterns. Moreover the proposed covariance

2. Here f is the latent function to be inferred, x 1, x 2 are test points and D denotes training data.

KOM SAMO AND ROBERTS

functions generate inﬁnitely differentiable random functions, which might be too restrictive in some applications.

2.4 Application-Speciﬁc Nonstationary Kernels

As for nonstationary kernels, Paciorek and Schervish (2004) proposed a method for constructing nonstationary covariance functions from any stationary one that involves introducing n input dependent d d covariance matrices that will be inferred from the data. Plagemann et al. (2008) proposed a faster approximation to the model of Paciorek and Schervish (2004). However, both approaches scale poorly with the input dimension and the data size as they have time complexity O max(nd3, n3) . Mac Kay (1998), Schmidt and O Hagan (2003), and Calandra et al. (2014) proposed kernels that can be regarded as stationary after a non-linear transformation d on the input space: k(x, x ) = h ( d(x) d(x ) ) , where h is positive semi-deﬁnite. Although for a given deterministic function d the kernel k is nonstationary, Schmidt and O Hagan (2003) put a GP prior on d with mean function m(x) = x and covariance function invariant under translation, which unfortunately leads to a kernel that is (unconditionally) stationary, albeit more ﬂexible than h ( x x ) . To model nonstationarity, Adams and Stegle (2008) introduced a functional prior of the form y(x) = f(x) exp g(x) where f is a stationary GP and g is some scaling function on the domain. For a given non-constant function g such a prior indeed yields a nonstationary Gaussian process. However, when a stationary GP prior is placed on the function g as Adams and Stegle (2008) did, the resulting functional prior y(x) = f(x) exp g(x) becomes stationary. The piecewise GP (Kim et al. (2005)) and treed GP (Gramacy and Lee (2008)) models previously discussed also introduce nonstationarity. The authors premise is that heterogeneous patterns might be locally homogeneous. However, as previously discussed such models are inappropriate for modelling continuous latent functions.

2.5 Our Approach

The approach we propose in this paper for inferring latent functions in large scale problems, possibly exhibiting locally homogeneous patterns, consists of constructing a novel class of smooth, nonstationary and ﬂexible stochastic processes we refer to as string Gaussian processes (string GPs), whose ﬁnite dimensional marginals are structured enough so that full Bayesian nonparametric inference scales linearly with the sample size, without resorting to approximations. Our approach is analogous to Mo E models in that, when the input space is one-dimensional, a string GP can be regarded as a collaboration of local GP experts on non-overlapping supports, that implicitly exchange messages with one another, and that are independent conditional on the aforementioned messages. Each local GP expert only shares just enough information with adjacent local GP experts for the whole stochastic process to be sufﬁciently smooth (for instance continuously differentiable), which is an important improvement over Mo E models as the latter generate discontinuous latent functions. These messages will take the form of boundary conditions, conditional on which each local GP expert will be independent from any other local GP expert. Crucially, unlike the BCM, the g Po E and the r BCM, we do not assume that local GP experts share the same prior structure (that is mean function, covariance function, or hyper-parameters). This allows each local GP expert to ﬂexibly learn local patterns from the data if there are any, while preserving global smoothness, which will result in improved accuracy. Similarly to Mo Es, the computational gain in our approach stems from the fact that the conditional independence of the local GP experts conditional on shared boundary

STRING AND MEMBRANE GAUSSIAN PROCESSES

conditions will enable us to write the joint distribution over function and derivative values at a large number of inputs as the product of pdfs of much smaller Gaussian vectors. The resulting effect on time complexity is a decrease from O(N3) to O(max k n3 k), where N = P k nk, nk N. In fact,

in Section 5 we will propose Reversible-Jump Monte Carlo Markov Chain (RJ-MCMC) inference methods that achieve memory requirement and time complexity O(N), without any loss of ﬂexibility. All these results are preserved by our extension of string GPs to multivariate input spaces, which we will occasionally refer to as membrane Gaussian processes (or membrane GPs). Unlike the BCM, the g Po E and the r BCM, the approach we propose in this paper, which we will refer to as the string GP paradigm, is Kolmogorov consistent, and enables principled inference of the posterior distribution over the values of the latent function at multiple test inputs.

3. Construction of String and Membrane Gaussian Processes

In this section we formally construct string Gaussian processes, and we provide some important theoretical results including smoothness, and the joint law of string GPs and their gradients. We construct string GPs indexed on R, before generalising to string GPs indexed on Rd, which we will occasionally refer to as membrane GPs to stress that the input space is multivariate. We start by considering the joint law of a differentiable GP on an interval and its derivative, and introducing some related notions that we will use in the construction of string GPs.

Proposition 1 (Derivative Gaussian processes) Let I be an interval, k : I I R a C2 symmetric positive semi-deﬁnite function,3 m : I R a C1 function.

(A) There exists a R2-valued stochastic process (Dt)t I , Dt = (zt, z t), such that for all t1, . . . , tn I, (zt1, . . . , ztn, z t1, . . . , z tn) is a Gaussian vector with mean m(t1), . . . , m(tn), dm

dt (t1), . . . , dm

and covariance matrix such that

cov(zti, ztj) = k(ti, tj), cov(zti, z tj) = k

y (ti, tj), and cov(z ti, z tj) = 2k

x y(ti, tj),

where x (resp. y) refers to the partial derivative with respect to the ﬁrst (resp. second) variable of k. We herein refer to (Dt)t I as a derivative Gaussian process.

(B) (zt)t I is a Gaussian process with mean function m, covariance function k and that is C1 in the L2 (mean square) sense.

(C) (z t)t I is a Gaussian process with mean function dm

dt and covariance function 2k x y. Moreover, (z t)t I is the L2 derivative of the process (zt)t I.

Proof Although this result is known in the Gaussian process community, we provide a proof for the curious reader in Appendix B.

We will say of a kernel k that it is degenerate at a when a derivative Gaussian process (zt, z t)t I

3. C1 (resp. C2) functions denote functions that are once (resp. twice) continuously differentiable on their domains.

KOM SAMO AND ROBERTS

with kernel k is such that za and z a are perfectly correlated,4 that is

|corr(za, z a)| = 1.

As an example, the linear kernel k(u, v) = σ2(u c)(v c) is degenerate at 0. Moreover, we will say of a kernel k that it is degenerate at b given a when it is not degenerate at a and when the derivative Gaussian process (zt, z t)t I with kernel k is such that the variances of zb and z b conditional on (za, z a) are both zero.5 For instance, the periodic kernel proposed by Mac Kay (1998) with period T is degenerate at u + T given u. An important subclass of derivative Gaussian processes in our construction are the processes resulting from conditioning paths of a derivative Gaussian process to take speciﬁc values at certain times (t1, . . . , tc). We herein refer to those processes as conditional derivative Gaussian process. As an illustration, when k is C3 on I I with I = [a, b], and neither degenerate at a nor degenerate at b given a, the conditional derivative Gaussian process on I = [a, b] with unconditional mean function m and unconditional covariance function k that is conditioned to start at ( za, z a) is the derivative Gaussian process with mean function

t I, ma c(t; za, z a) = m(t) + Kt;a K 1 a;a

za m(a) z a dm

and covariance function ka c that reads

t, s I, ka c (t, s) = k(t, s) Kt;a K 1 a;a K T s;a (2)

where Ku;v =

" k(u, v) k y(u, v)

k x(u, v) 2k x y(u, v)

, and Kt;a = h k(t, a) k y(t, a) i . Similarly, when the

process is conditioned to start at ( za, z a) and to end at ( zb, z b), the mean function reads

t I, ma,b c (t; za, z a, zb, z b) = m(t) + Kt;(a,b)K 1 (a,b);(a,b)

za m(a) z a dm

dt (a) zb m(b) z b dm

and the covariance function ka,b c reads

t, s I, ka,b c (t, s) = k(t, s) Kt;(a,b)K 1 (a,b);(a,b) K T s;(a,b), (4)

where K(a,b);(a,b) = Ka;a Ka;b Kb;a Kb;b

, and Kt;(a,b) = Kt;a Kt;b . It is important to note that

both Ka;a and K(a,b);(a,b) are indeed invertible because the kernel is assumed to be neither degenerate at a nor degenerate at b given a. Hence, the support of (za, z a, zb, z b) is R4, and any function and derivative values can be used for conditioning. Figure 1 illustrates example independent draws from a conditional derivative Gaussian process.

4. Or equivalently when the Gaussian vector (za, z a) is degenerate. 5. Or equivalently when the Gaussian vector (za, z a) is not degenerate but (za, z a, zb, z b) is.

STRING AND MEMBRANE GAUSSIAN PROCESSES

3.1 String Gaussian Processes on R

The intuition behind string Gaussian processes on an interval comes from the analogy of collaborative local GP experts we refer to as strings that are connected but independent of each other conditional on some regularity boundary conditions. While each string is tasked with representing local patterns in the data, a string only shares the states of its extremities (value and derivative) with adjacent strings. Our aim is to preserve global smoothness and limit the amount of information shared between strings, thus reducing computational complexity. Furthermore, the conditional independence between strings will allow for distributed inference, greater ﬂexibility and principled nonstationarity construction. The following theorem at the core of our framework establishes that it is possible to connect together GPs on a partition of an interval I, in a manner consistent enough that the newly constructed stochastic object will be a stochastic process on I and in a manner restrictive enough that any two connected GPs will share just enough information to ensure that the constructed stochastic process is continuously differentiable (C1) on I in the L2 sense.

Theorem 2 (String Gaussian process) Let a0 < < ak < < a K, I = [a0, a K] and let p N (x; µ, Σ) be the multivariate Gaussian density with mean vector µ and covariance matrix Σ. Furthermore, let (mk : [ak 1, ak] R)k [1..K] be C1 functions, and (kk : [ak 1, ak] [ak 1, ak] R)k [1..K] be C3 symmetric positive semideﬁnite functions, neither degenerate at ak 1, nor degenerate at ak given ak 1.

(A) There exists an R2-valued stochastic process (SDt)t I, SDt = (zt, z t) satisfying the following conditions: 1) The probability density of (SDa0, . . . , SDa K) reads:

pb(x0, . . . , x K) :=

k=0 p N xk; µb k, Σb k (5)

where: Σb 0 = 1Ka0;a0, k > 0 Σb k = k Kak;ak k Kak;ak 1 k K 1 ak 1;ak 1 k KT ak;ak 1, (6)

µb 0 = 1Ma0, k > 0 µb k = k Mak + k Kak;ak 1 k K 1 ak 1;ak 1(xk 1 k Mak 1), (7)

with k Ku;v =

" kk(u, v) kk

x (u, v) 2kk x y(u, v)

, and k Mu = mk(u) dmk

2) Conditional on (SDak = xk)k [0..K], the restrictions (SDt)t ]ak 1,ak[, k [1..K] are independent conditional derivative Gaussian processes, respectively with unconditional mean function mk and unconditional covariance function kk and that are conditioned to take values xk 1 and xk at ak 1 and ak respectively. We refer to (SDt)t I as a string derivative Gaussian process, and to its ﬁrst coordinate (zt)t I as a string Gaussian process namely,

(zt)t I SGP({ak}, {mk}, {kk}).

(B) The string Gaussian process (zt)t I deﬁned in (A) is C1 in the L2 sense and its L2 derivative is the process (z t)t I deﬁned in (A).

KOM SAMO AND ROBERTS

Proof See Appendix C.

In our collaborative local GP experts analogy, Theorem 2 stipulates that each local expert takes as message from the previous expert its left hand side boundary conditions, conditional on which it generates its right hand side boundary conditions, which it then passes on to the next expert. Conditional on their boundary conditions local experts are independent of each other, and resemble vibrating pieces of string on ﬁxed extremities, hence the name string Gaussian process.

3.2 Pathwise Regularity

Thus far we have dealt with regularity only in the L2 sense. However, we note that a sufﬁcient condition for the process (z t)t I in Theorem 2 to be almost surely continuous (i.e. sample paths are continuous with probability 1) and to be the almost sure derivative of the string Gaussian process (zt)t I, is that the Gaussian processes on Ik = [ak 1, ak] with mean and covariance functions mak 1,ak ck and kak 1,ak ck (as per Equations 3 and 4 with m := mk and k := kk) are themselves almost surely C1 for every boundary condition.6 We refer to (Adler and Taylor, 2011, Theorem 2.5.2) for a sufﬁcient condition under which a C1 in L2 Gaussian process is also almost surely C1. As the above question is provably equivalent to that of the almost sure continuity of a Gaussian process (see Adler and Taylor, 2011, p. 30), Kolmogorov s continuity theorem (see Øksendal, 2003, Theorem 2.2.3) provides a more intuitive, albeit stronger, sufﬁcient condition than that of (Adler and Taylor, 2011, Theorem 2.5.2).

3.3 Illustration

Algorithm 1 illustrates sampling jointly from a string Gaussian process and its derivative on an interval I = [a0, a K]. We start off by sampling the string boundary conditions (zak, z ak) sequentially, conditional on which we sample the values of the stochastic process on each string. This we may do in parallel as the strings are independent of each other conditional on boundary conditions. The resulting time complexity is the sum of O(max n3 k) for sampling values within strings, and O(n) for sampling boundary conditions, where the sample size is n = P k nk. The memory requirement grows as the sum of O(P k n2 k), required to store conditional covariance matrices of the values within strings, and O(K) corresponding to the storage of covariance matrices of boundary conditions. In the special case where strings are all empty, that is inputs and boundary times are the same, the resulting time complexity and memory requirement are O(n). Figure 2 illustrates a sample from a string Gaussian process, drawn using this approach.

3.4 String Gaussian Processes on Rd

So far the input space has been assumed to be an interval. We generalise string GPs to hyperrectangles in Rd as stochastic processes of the form:

f(t1, . . . , td) = φ z1 t1, . . . , zd td , (10)

where the link function φ : Rd R is a C1 function and (zj t ) are d independent ( ) latent string Gaussian processes on intervals. We will occasionally refer to string GPs indexed on Rd with d > 1 as membrane GPs to avoid any ambiguity. We note that when d = 1 and when the link function

6. The proof is provided in Appendix D.

STRING AND MEMBRANE GAUSSIAN PROCESSES

0.0 0.2 0.4 0.6 0.8 1.0 3.0 2.5 2.0 1.5 1.0 0.5

0.0 0.5 1.0 1.5

0.0 0.2 0.4 0.6 0.8 1.0 15

Figure 1: Draws from a conditional derivative GP conditioned to start at 0 with derivative 0 and to ﬁnish at 1.0 with derivative 0.0. The unconditional kernel is the squared exponential kernel with variance 1.0 and input scale 0.2.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 4

0.0 0.5 1.0 1.5 2.0 2.5 3.0 100

Figure 2: Draw from a string GP (zt) with 3 strings and its derivative (z t), under squared exponential kernels (green and yellow strings), and the periodic kernel of Mac Kay (1998) (red string).

KOM SAMO AND ROBERTS

Algorithm 1 Simulation of a string derivative Gaussian process

Inputs: boundary times a0 < < a K, string times {tk j ]ak 1, ak[}j [1..nk],k [1..K], unconditional mean (resp. covariance) functions mk (resp. kk) Output: {. . . , zak, z ak, . . . , ztk j , z tk j , . . . }.

Step 1: sample the boundary conditions sequentially. for k = 0 to K do

Sample (zak, z ak) N µb k, Σb k , with µb k and Σb k as per Equations (7) and (6). end for Step 2: sample the values on each string conditional on the boundary conditions in parallel. parfor k = 1 to K do

Let k Mu and k Ku;v be as in Theorem 2,

k Ktk 1;ak 1 k Ktk 1;ak . . . . . .

k Ktknk;ak 1 k Ktknk;ak

k Kak 1;ak 1 k Kak 1;ak

k Kak;ak 1 k Kak;ak

k Mtk 1 . . .

zak 1 mk(ak 1) z ak 1 dmk

dt (ak 1) zak mk(ak) z ak dmk

k Ktk 1;tk 1 . . . k Ktk 1;tknk . . . . . . . . .

k Ktknk;tk 1 . . . k Ktknk;tknk

k Ktk 1;ak 1 k Ktk 1;ak . . . . . .

k Ktknk;ak 1 k Ktknk;ak

Sample ztk 1, z tk 1 . . . , ztknk, z tknk

N (µs k, Σs k).

STRING AND MEMBRANE GAUSSIAN PROCESSES

is φ(x) = x, we recover string GPs indexed on an interval as previously deﬁned. When the string GPs (zj t ) are a.s. C1, the membrane GP f in Equation (10) is also a.s. C1, and the partial derivative with respect to the j-th coordinate reads:

f tj (t1, . . . , td) = zj tj φ tj

z1 t1, . . . , zd td . (11)

Thus in high dimensions, string GPs easily allow an explanation of the sensitivity of the learned latent function to inputs.

3.5 Choice of Link Function

Our extension of string GPs to Rd departs from the standard GP paradigm in that we did not postulate a covariance function on Rd Rd directly. Doing so usually requires using a metric on Rd, which is often problematic for heterogeneous input dimensions, as it introduces an arbitrary comparison between distances in each input dimension. This problem has been partially addressed by approaches such as Automatic Relevance Determination (ARD) kernels, that allow for a linear rescaling of input dimensions to be learned jointly with kernel hyper-parameters. However, inference under a string GP prior can be thought of as learning a coordinate system in which the latent function f resembles the link function φ through non-linear rescaling of input dimensions. In particular, when φ is symmetric, the learned univariate string GPs (being interchangeable in φ) implicitly aim at normalizing input data across dimensions, making string GPs naturally cope with heterogeneous data sets. An important question arising from our extension is whether or not the link function φ needs to be learned to achieve a ﬂexible functional prior. The ﬂexibility of a string GP as a functional prior depends on both the link function and the covariance structures of the underlying string GP building blocks (zj t ). To address the impact of the choice of φ on ﬂexibility, we constrain the string GP building blocks by restricting them to be independent identically distributed string GPs with one string each (i.e. (zj t ) are i.i.d Gaussian processes). Furthermore, we restrict ourselves to isotropic kernels as they provide a consistent basis for putting the same covariance structure in R and Rd. One question we might then ask, for a given link function φ0, is whether or not an isotropic GP indexed on Rd with covariance function k yields more ﬂexible random surfaces than the stationary string GP f(t1, . . . , td) = φ0(z1 t1, . . . , zd td), where (zj tj) are stationary GPs indexed on R with the same covariance function k. If we ﬁnd a link function φ0 generating more ﬂexible random surfaces than isotropic GP counterparts it would suggest φ need not to be inferred in dimension d > 1 to be more ﬂexible than any GP using one of the large number of commonly used isotropic kernels, among which squared exponential kernels, rational quadratic kernels, and Mat ern kernels to name but a few. Before discussing whether such a φ0 exists, we need to introduce a rigorous meaning to ﬂexibility . An intuitive qualitative deﬁnition of the ﬂexibility of a stochastic process indexed on Rd is the ease with which it can generate surfaces with varying shapes from one random sample to another independent one. We recall that the tangent hyperplane to a C1 surface y f(x) = 0, x Rd at some point x0 = (t0 1, . . . , t0 d) has equation f(x0)T (x x0) (y f(x0)) = 0 and admits as normal vector ( f

t1 (t0 1), . . . , f

td (t0 d), 1). As tangent hyperplanes approximate a surface locally, a ﬁrst criterion of ﬂexibility for a random surface y f(x) = 0, x Rd is the proclivity of the (random) direction of its tangent hyperplane at any point x and hence the proclivity of f(x) to vary.

KOM SAMO AND ROBERTS

This criterion alone, however, does not capture the difference between the local shapes of the random surface at two distinct points. A complementary second criterion of ﬂexibility is the proclivity of the (random) directions of the tangent hyperplanes at any two distinct points x0, x1 Rd and hence the proclivity of f(x0) and f(x1) to be independent. The ﬁrst criterion can be measured using the entropy of the gradient at a point, while the second criterion can be measured through the mutual information between the two gradients. The more ﬂexible a stochastic process, the higher the entropy of its gradient at any point, and the lower the mutual information between its gradients at any two distinct points. This is formalised in the deﬁnition below.

Deﬁnition 3 (Flexibility of stochastic processes) Let f and g be two real valued, almost surely C1 stochastic processes indexed on Rd, and whose gradients have a ﬁnite entropy everywhere (i.e. x, H( f(x)), H( g(x)) < ). We say that f is more ﬂexible than g if the following conditions are met: 1) x, H( f(x)) H( g(x)), 2) x = y, I( f(x); f(y)) I( g(x); g(y)), where H is the entropy operator, and I(X; Y ) = H(X) + H(Y ) H(X, Y ) stands for the mutual information between X and Y .

The following proposition establishes that the link function φs (x1, . . . , xd) = Pd i=j xj yields more ﬂexible stationary string GPs than their isotropic GP counterparts, thereby providing a theoretical underpinning for not inferring φ.

Proposition 4 (Additively separable string GPs are ﬂexible) Let k(x, y) := ρ ||x y||2 L2 be a stationary covariance function generating a.s. C1 GP paths indexed on Rd, d > 0, and ρ a function that is C2 on ]0, + [ and continuous at 0. Let φs(x1, . . . , xd) = Pd j=1 xj, let (zj t )t Ij, j [1..d] be independent stationary Gaussian processes with mean 0 and covariance function k (where the L2 norm is on R), and let f(t1, . . . , td) = φs(z1 t1, . . . , zd td) be the corresponding stationary string GP. Finally, let g be an isotropic Gaussian process indexed on I1 Id with mean 0 and covariance function k (where the L2 norm is on Rd). Then:

1) x I1 Id, H( f(x)) = H( g(x)), 2) x = y I1 Id, I( f(x); f(y)) I( g(x); g(y)).

Proof See Appendix E.

Although the link function need not be inferred in a full nonparametric fashion to yield comparable if not better results than most isotropic kernels used in the standard GP paradigm, for some problems certain link functions might outperform others. In Section 4.2 we analyse a broad family of link functions, and argue that they extend successful anisotropic approaches such as the Automatic Relevance Determination (Mac Kay (1998)) and the additive kernels of Duvenaud et al. (2011). Moreover, in Section 5 we propose a scalable inference scheme applicable to any link function.

4. Comparison with the Standard GP Paradigm

We have already established that sampling string GPs scales better than sampling GPs under the standard GP paradigm and is amenable to distributed computing. We have also established that

STRING AND MEMBRANE GAUSSIAN PROCESSES

stationary additively separable string GPs are more ﬂexible than their isotropic counterparts in the standard GP paradigm. In this section, we provide further theoretical results relating the string GP paradigm to the standard GP paradigm. Firstly we establish that string GPs with link function φs (x1, . . . , xd) = Pd i=j xj are GPs. Secondly, we derive the global mean and covariance functions induced by the string GP construction for a variety of link functions. Thirdly, we provide a sense in which the string GP paradigm can be thought of as extending the standard GP paradigm. And ﬁnally, we show that the string GP paradigm may serve as a scalable approximation of commonly used stationary kernels.

4.1 Some String GPs are GPs

On one hand we note from Theorem 2 that the restriction of a string GP deﬁned on an interval to the support of the ﬁrst string in other words the ﬁrst local GP expert is a Gaussian process. On the other hand, the messages passed on from one local GP expert to the next are not necessarily consistent with the unconditional law of the receiving local expert, so that overall a string GP deﬁned on an interval, that is when looked at globally and unconditionally, might not be a Gaussian process. However, the following proposition establishes that some string GPs are indeed Gaussian processes.

Proposition 5 (Additively separable string GPs are GPs) String Gaussian processes on R are Gaussian processes. Moreover, string Gaussian processes on Rd with link function φs(x1, . . . , xd) = Pd j=1 xj are also Gaussian processes.

Proof The intuition behind this proof lies in the fact that if X is a multivariate Gaussian, and if conditional on X, Y is a multivariate Gaussian, providing that the conditional mean of Y depends linearly on X and the conditional covariance matrix of Y does not depend on X, the vector (X, Y ) is jointly Gaussian. This will indeed be the case for our collaboration of local GP experts as the boundary conditions picked up by an expert from the previous will not inﬂuence the conditional covariance structure of the expert (the conditional covariance strucuture depends only on the partition of the domain, not the values of the boundary conditions) and will affect the mean linearly. See Appendix H for the full proof.

The above result guarantees that commonly used closed form predictive equations under GP priors are still applicable under some string GP priors, providing the global mean and covariance functions, which we derive in the following section, are available. Proposition 5 also guarantees stability of the corresponding string GPs in the GP family under addition of independent Gaussian noise terms as in regression settings. Moreover, it follows from Proposition 5 that inference techniques developed for Gaussian processes can be readily used under string GP priors. In Section 5 we provide an additional MCMC scheme that exploits the conditional independence between strings to yield greater scalability and distributed inference.

4.2 String GP Kernels and String GP Mean Functions

The approach we have adopted in the construction of string GPs and membrane GPs did not require explicitly postulating a global mean function or covariance function. In Appendix I we derive the global mean and covariance functions that result from our construction. The global covariance function could be used for instance as a stand-alone kernel in any kernel method, for instance GP

KOM SAMO AND ROBERTS

models under the standard GP paradigm, which would provide a ﬂexible and nonstationary alternative to commonly used kernels that may be used to learn local patterns in data sets some successful example applications are provided in Section 5. That being said, adopting such a global approach should be limited to small scale problems as the conditional independence structure of string GPs does not easily translate into structures in covariance matrices over string GP values (without derivative information) that can be exploited to speed-up SVD or Cholesky decomposition. Crucially, marginalising out all derivative information in the distribution of derivative string GP values at some inputs would destroy any conditional independence structure, thereby limiting opportunities for scalable inference. In Section 5 we will provide a RJ-MCMC inference scheme that fully exploits the conditional independence structure in string GPs and scales to very large data sets.

4.3 Connection Between Multivariate String GP Kernels and Existing Approaches

We recall that for n d, the n-th order elementary symmetric polynomial (Macdonald (1995)) is given by

e0(x1, . . . , xd) := 1, 1 n d en(x1, . . . , xd) = X

1 j1<j2< <jn d

k=1 xjk. (12)

As an illustration,

e1(x1, . . . , xd) =

j=1 xj = φs(x1, . . . , xd),

e2(x1, . . . , xd) = x1x2 + x1x3 + + x1xd + + xd 1xd,

ed(x1, . . . , xd) =

j=1 xj = φp(x1, . . . , xd).

Covariance kernels of string GPs, using as link functions elementary symmetric polynomials en, extend most popular approaches that combine unidimensional kernels over features for greater ﬂexibility or cheaper design experiments. The ﬁrst-order polynomial e1 gives rise to additively separable Gaussian processes, that can be regarded as Bayesian nonparametric generalised additive models (GAM), particularly popular for their interpretability. Moreover, as noted by Durrande et al. (2012), additively separable Gaussian processes are considerably cheaper than alternate transformations in design experiments with highdimensional input spaces. In addition to the above, additively separable string GPs also allow postulating the existence of local properties in the experimental design process at no extra cost. The d-th order polynomial ed corresponds to a product of unidimensional kernels, also known as separable kernels. For instance, the popular squared exponential kernel is separable. Separable kernels have been successfully used on large scale inference problems where the inputs form a grid (Saatchi, 2011; Wilson et al., 2014), as they yield covariance matrices that are Kronecker products, leading to maximum likelihood inference in linear time complexity and with linear memory requirement. Separable kernels are often used in conjunction with the automatic relevance determination (ARD) model, to learn the relevance of features through global linear rescaling. However, ARD

STRING AND MEMBRANE GAUSSIAN PROCESSES

kernels might be limited in that we might want the relevance of a feature to depend on its value. As an illustration, the market value of a watch can be expected to be a stronger indicator of its owner s wealth when it is in the top 1 percentile, than when it is in the bottom 1 percentile; the rationale being that possessing a luxurious watch is an indication that one can afford it, whereas possessing a cheap watch might be either an indication of lifestyle or an indication that one cannot afford a more expensive one. Separable string GP kernels extend ARD kernels, in that strings between input dimensions and within an input dimension may have unconditional kernels with different hyperparameters, and possibly different functional forms, thereby allowing for automatic local relevance determination (ALRD). More generally, using as link function the n-th order elementary symmetric polynomial en corresponds to the n-th order interaction of the additive kernels of Duvenaud et al. (2011). We also note that the class of link functions φ(x1, . . . , xd) = Pd i=1 σiei(x1, . . . , xd) yield full additive kernels. Duvenaud et al. (2011) noted that such kernels are exceptionally well-suited to learn nonlocal structures in data. String GPs complement additive kernels by allowing them to learn local structures as well.

4.4 String GPs as Extension of the Standard GP Paradigm

The following proposition provides a perspective from which string GPs may be considered as extending Gaussian processes on an interval.

Proposition 6 (Extension of the standard GP paradigm) Let K N , let I = [a0, a K] and Ik = [ak 1, ak] be intervals with a0 < < a K. Furthermore, let m : I R be a C1 function, mk the restriction of m to Ik, h : I I R a C3 symmetric positive semi-deﬁnite function, and hk the restriction of h to Ik Ik. If

(zt)t I SGP({ak}, {mk}, {hk}),

then k [1..K], (zt)t Ik GP(m, h).

Proof See Appendix F.

We refer to the case where unconditional string mean and kernel functions are restrictions of the same functions as in Proposition 6 as uniform string GPs. Although uniform string GPs are not guaranteed to be as much regular at boundary times as their counterparts in the standard GP paradigm, we would like to stress that they may well generate paths that are. In other words, the functional space induced by a uniform string GP on an interval extends the functional space of the GP with the same mean and covariance functions m and h taken globally and unconditionally on the whole interval as in the standard GP paradigm. This allows for (but does not enforce) less regularity at the boundary times. When string GPs are used as functional prior, the posterior mean can in fact have more regularity at the boundary times than the continuous differentiability enforced in the string GP paradigm, providing such regularity is evidenced in the data. We note from Proposition 6 that when m is constant and h is stationary, the restriction of the uniform string GP (zt)t I to any interval whose interior does not contain a boundary time, the largest of which being the intervals [ak 1, ak], is a stationary GP. We refer to such cases as partition stationary string GPs.

KOM SAMO AND ROBERTS

4.5 Commonly Used Covariance Functions and their String GP Counterparts

Considering the superior scalability of the string GP paradigm, which we may anticipate from the scalability of sampling string GPs, and which we will conﬁrm empirically in Section 5, a natural question that comes to mind is whether or not kernels commonly used in the standard GP paradigm can be well approximated by string GP kernels, so as to take advantage of the improved scalability of the string GP paradigm. We examine the distortions to commonly used covariance structures resulting from restricting strings to share only C1 boundary conditions, and from increasing the number of strings. Figure 3 compares some popular stationary kernels on [0, 1] [0, 1] (ﬁrst column) to their uniform string GP kernel counterparts with 2, 4, 8 and 16 strings of equal length. The popular kernels considered are the squared exponential kernel (SE), the rational quadratic kernel k RQ(u, v) = 1 + 2(u v)2

α α with α = 1 (RQ 1) and α = 5 (RQ 5), the Mat ern 3/2 kernel (MA 3/2), and the Mat ern 5/2 kernel (MA 5/2), each with output scale (variance) 1 and input scale 0.5. Firstly, we observe that each of the popular kernels considered coincides with its uniform string GP counterparts regardless of the number of strings, so long as the arguments of the covariance function are less than an input scale apart. Except for the Mat ern 3/2, the loss of information induced by restricting strings to share only C1 boundary conditions becomes noticeable when the arguments of the covariance function are more than 1.5 input scales apart, and the effect is ampliﬁed as the number of strings increases. As for the Mat ern 3/2, no loss of information can been noticed, as further attests Table 1. In fact, this comes as no surprise given that stationary Mat ern 3/2 GP are 1-Markov, that is the corresponding derivative Gaussian process is a Markov process so that the vector (zt, z t) contains as much information as all string GP or derivative values prior to t (see Doob (1944)). Table 1 provides some statistics on the absolute errors between each of the popular kernels considered and uniform string GP counterparts.

STRING AND MEMBRANE GAUSSIAN PROCESSES

Figure 3: Commonly used covariance functions on [0, 1] [0, 1] with the same input and output scales (ﬁrst column) and their uniform string GP counterparts with K > 1 strings of equal length.

KOM SAMO AND ROBERTS

K = 2 K = 4 K = 8 K = 16

min avg max min avg max min avg max min avg max

SE 0 0.01 0.13 0 0.02 0.25 0 0.03 0.37 0 0.04 0.44 RQ 1 0 0.01 0.09 0 0.03 0.20 0 0.05 0.37 0 0.07 0.52 RQ 5 0 0.01 0.12 0 0.02 0.24 0 0.04 0.37 0 0.05 0.47 MA 3/2 0 0 0 0 0 0 0 0 0 0 0 0 MA 5/2 0 0.01 0.07 0 0.03 0.15 0 0.05 0.29 0 0.08 0.48

Table 1: Minimum, average, and maximum absolute errors between some commonly used stationary covariance functions on [0, 1] [0, 1] (with unit variance and input scale 0.5) and their uniform string GP counterparts with K > 1 strings of equal length.

STRING AND MEMBRANE GAUSSIAN PROCESSES

5. Inference under String and Membrane GP Priors

In this section we move on to developing inference techniques for Bayesian nonparametric inference of latent functions under string GP priors. We begin with marginal likelihood inference in regression problems. We then propose a novel reversible-jump MCMC sampler that enables automatic learning of model complexity (that is the number of different unconditional kernel conﬁgurations) from the data, with a time complexity and memory requirement both linear in the number of training inputs.

5.1 Maximum Marginal Likelihood for Small Scale Regression Problems

Firstly, we leverage the fact that additively separable string GPs are Gaussian processes to perform Bayesian nonparametric regressions in the presence of local patterns in the data, using standard Gaussian process techniques (see Rasmussen and Williams, 2006, p.112 5.4.1). We use as generative model

yi = f(xi) + ϵi, ϵi N 0, σ2 ki , σ2 ki > 0, xi I1 Id, yi, ϵi R

we are given the training data set D = { xi, yi}i [1..N], and we place a mean-zero additively separable string GP prior on f, namely

j=1 zj x[j], (zj t ) SGP {aj k}, {0}, {kj k} , j < l, (zj t ) (zl t),

which we assume to be independent of the measurement noise process. Moreover, the noise terms are assumed to be independent, and the noise variance σ2 ki affecting f(xi) is assumed to be the same for any two inputs whose coordinates lie on the same string intervals. Such a heteroskedastic noise model ﬁts nicely within the string GP paradigm, can be very useful when the dimension of the input space is small, and may be replaced by the typical constant noise variance assumption in high-dimensional input spaces. Let us deﬁne y = ( y1, . . . , y N), X = ( x1, . . . , x N), f = (f( x1), . . . , f( x N)) and let KX;X denote the auto-covariance matrix of f (which we have derived in Section 4.2), and let D = diag({σ2 ki}) denote the diagonal matrix of noise variances. It follows that y is a Gaussian vector with mean 0 and auto-covariance matrix Ky := KX;X + D and that the log marginal likelihood reads:

log p y X, {σki}, {θj k}, {aj k} = 1

2y T K 1 y y 1

2 log det (Ky) n

2 log 2π. (13)

We obtain estimates of the string measurement noise standard deviations {ˆσki} and estimates of the string hyper-parameters {ˆθj k} by maximising the marginal likelihood for a given domain partition {aj k}, using gradient-based methods. We deduce the predictive mean and covariance matrix of the latent function values f at test points X , from the estimates {ˆθj k}, {ˆσki} as

E(f |y) = KX ;XK 1 y y and cov(f |y) = KX ;X KX ;XK 1 y KX;X , (14)

using the fact that (f , y) is jointly Gaussian, and that the cross-covariance matrix between f and y is KX ;X as the additive measurement noise is assumed to be independent from the latent process f.

KOM SAMO AND ROBERTS

5.1.1 REMARKS

The above analysis and equations still hold when a GP prior is placed on f with one of the multivariate string GP kernels derived in Section 4.2 as covariance function. It is also worth noting from the derivation of string GP kernels in Appendix I that the marginal likelihood Equation (13) is continuously differentiable in the locations of boundary times. Thus, for a given number of boundary times, the positions of the boundary times can be determined as part of the marginal likelihood maximisation. The derivatives of the marginal log-likelihood (Equation 13) with respect to the aforementioned locations {aj k} can be determined from the recursions of Appendix I, or approximated numerically by ﬁnite differences. The number of boundary times in each input dimension can then be learned by trading off model ﬁt (the maximum marginal log likelihood) and model simplicity (the number of boundary times or model parameters), for instance using information criteria such as AIC and BIC. When the input dimension is large, it might be advantageous to further constrain the hypothesis space of boundary times before using information criteria, for instance by assuming that the number of boundary times is the same in each dimension. An alternative Bayesian nonparametric approach to learning the number of boundary times will be discussed in section 5.4. This method of inference cannot exploit the structure of string GPs to speed-up inference, and as a result it scales like the standard GP paradigm. In fact, any attempt to marginalize out univariate derivative processes, including in the prior, will inevitably destroy the conditional independence structure. Another perspective to this observation is found by noting from the derivation of global string GP covariance functions in Appendix I that the conditional independence structure does not easily translate in a matrix structure that may be exploited to speed-up matrix inversion, and that marginalizing out terms relating to derivatives processes as in Equation (13) can only make things worse.

5.2 Generic Reversible-Jump MCMC Sampler for Large Scale Inference

More generally, we consider learning a smooth real-valued latent function f, deﬁned on a ddimensional hyper-rectangle, under a generative model with likelihood p (D|f, u), where f denotes values of f at training inputs points and u denotes other likelihood parameters that are not related to f. A large class of machine learning problems aiming at inferring a latent function have a likelihood model of this form. Examples include celebrated applications such as nonparametric regression and nonparametric binary classiﬁcation problems, but also more recent applications such as learning a proﬁtable portfolio generating-function in stochastic portfolio theory (Karatzas and Fernholz (2009)) from the data. In particular, we do not assume that p (D|f, u) factorizes over training inputs. Extensions to likelihood models that depend on the values of multiple latent functions are straight-forward and will be discussed in Section 5.3.

5.2.1 PRIOR SPECIFICATION

We place a prior p(u) on other likelihood parameters. For instance, in regression problems under a Gaussian noise model, u can be the noise variance and we may choose p(u) to be the inverse Gamma distribution for conjugacy. We place a mean-zero string GP prior on f

f(x) = φ z1 x[1], . . . , zd x[d] , (zj t ) SGP {aj k}, {0}, {kj k} , j < l, (zj t ) (zl t). (15)

STRING AND MEMBRANE GAUSSIAN PROCESSES

As discussed in Section 3.5, the link function φ need not be inferred as the symmetric sum was found to yield a sufﬁciently ﬂexible functional prior. Nonetheless, in this section we do not impose any restriction on the link function φ other than continuous differentiability. Denoting z the vector of univariate string GP processes and their derivatives, evaluated at all distinct input coordinate values, we may re-parametrize the likelihood as p (D|z, u), with the understanding that f can be recovered from z through the link function φ. To complete our prior speciﬁcation, we need to discuss the choice of boundary times {aj k} and the choice of the corresponding unconditional kernel structures {kj k}. Before doing so, we would like to stress that key requirements of our sampler are that i) it should decouple the need for scalability from the need for ﬂexibility, ii) it should scale linearly with the number of training and test inputs, and iii) the user should be able to express prior views on model complexity/ﬂexibility in an intuitive way, but the sampler should be able to validate or invalidate the prior model complexity from the data. While the motivations for the last two requirements are obvious, the ﬁrst requirement is motivated by the fact that a massive data set may well be more homogeneous than a much smaller data set.

5.2.2 SCALABLE CHOICE OF BOUNDARY TIMES

To motivate our choice of boundary times that achieves great scalability, we ﬁrst note that the evaluation of the likelihood, which will naturally be needed by the MCMC sampler, will typically have at least linear time complexity and linear memory requirement, as it will require performing computations that use each training sample at least once. Thus, the best we can hope to achieve overall is linear time complexity and linear memory requirement. Second, in MCMC schemes with functional priors, the time complexity and memory requirements for sampling from the posterior

p (f|D) p (D|f) p(f)

are often the same as the resource requirements for sampling from the prior p (f), as evaluating the model likelihood is rarely the bottleneck. Finally, we note from Algorithm 1 that, when each input coordinate in each dimension is a boundary time, the sampling scheme has time complexity and memory requirement that are linear in the maximum number of unique input coordinates across dimensions, which is at most the number of training samples. In effect, each univariate derivative string GP is sampled in parallel at as many times as there are unique input coordinates in that dimension, before being combined through the link function. In a given input dimension, univariate derivative string GP values are sampled sequentially, one boundary time conditional on the previous. The foregoing sampling operation is very scalable not only asymptotically but also in absolute terms; it merely requires storing and inverting at most as many 2 2 matrices as the number of input points. We will evaluate the actual overall time complexity and memory requirement when we discuss our MCMC sampler in greater details. For now, we would like to stress that i) choosing each distinct input coordinate value as a boundary time in the corresponding input dimension before training is a perfectly valid choice, ii) we expect this choice to result in resource requirements that grow linearly with the sample size and iii) in the string GP theory we have developed thus far there is no requirement that two adjacent strings be driven by different kernel hyper-parameters.

5.2.3 MODEL COMPLEXITY LEARNING AS A CHANGE-POINT PROBLEM

The remark iii) above pertains to model complexity. In the simplest case, all strings are driven by the same kernel and hyper-parameters as it was the case in Section 4.5, where we discussed how this

KOM SAMO AND ROBERTS

setup departs from postulating the unconditional string covariance function kj k globally similarly to the standard GP paradigm. The more distinct unconditional covariance structures there are, the more complex the model is, as it may account for more types of local patterns. Thus, we may identify model complexity to the number of different kernel conﬁgurations across input dimensions. In order to learn model complexity, we require that some (but not necessarily all) strings share their kernel conﬁguration.7 Moreover, we require kernel membership to be dimension-speciﬁc in that two strings in different input dimensions may not explicitly share a kernel conﬁguration in the prior speciﬁcation, although the posterior distribution over their hyper-parameters might be similar if the data support it. In each input dimension j, kernel membership is deﬁned by a partition of the corresponding domain operated by a (possibly empty) set of change-points,8 as illustrated in Figure 4. When there is no change-point as in Figure 4-(a), all strings are driven by the same kernel and hyper-parameters. Each change-point cj p induces a new kernel conﬁguration θj p that is shared by all strings whose boundary times aj k and aj k+1 both lie in [cj p, cj p+1[. When one or multiple change-points cj p occur between two adjacent boundary times as illustrated in Figures 4-(b-d), for instance aj k cj p aj k+1, the kernel conﬁguration of the string deﬁned on [aj k, aj k+1] is that of the largest change-point that lies in [aj k, aj k+1] (see for instance Figure 4-(d)). For consistency, we denote θj 0 the kernel conﬁguration driving the ﬁrst string in the j-th dimension; it also drives strings that come before the ﬁrst changepoint, and all strings when there is no change-point. To place a prior on model complexity, it sufﬁces to deﬁne a joint probability measure on the set of change-points and the corresponding kernel conﬁgurations. As kernel conﬁgurations are not shared across input dimensions, we choose these priors to be independent across input dimensions. Moreover, {cj p} being a random collection of points on an interval whose number and positions are both random, it is de facto a point process (Daley and Vere-Jones (2008)). To keep the prior speciﬁcation of change-points uninformative, it is desirable that conditional on the number of change-points, the positions of change-points be i.i.d. uniform on the domain. As for the number of change-points, it is important that the support of its distribution not be bounded, so as to allow for an arbitrarily large model complexity if warranted. The two requirements above are satisﬁed by a homogeneous Poisson process or HPP (Daley and Vere-Jones (2008)) with constant intensity λj. More precisely, the prior probability measure on {cj p, θj p}, λj is constructed as follows:

λj Γ(αj, βj), {cj p} λj HPP(λj)

θj p[i] {cj p}, λj i.i.d log N(0, ρj) (j, p) = (l, q) θj p θl q,

where we choose the Gamma distribution Γ as prior on the intensity λj for conjugacy, we assume all kernel hyper-parameters are positive as is often the case in practice,9 the coordinates of the hyperparameters of a kernel conﬁguration are assumed i.i.d., and kernel hyper-parameters are assumed

7. That is, the functional form of the unconditional kernel kj k and its hyper-parameters. 8. We would like to stress that change-points do not introduce new input points or boundary times, but solely deﬁne a partition of the domain of each input dimension. 9. This may easily be relaxed if needed, for instance by putting normal priors on parameters that may be negative and log-normal priors on positive parameters.

STRING AND MEMBRANE GAUSSIAN PROCESSES

Figure 4: Effects of domain partition through change-points (coloured circles), on kernel membership. Each vertical bar corresponds to a distinct boundary time aj k. For the same collection of boundary times, we consider four scenarios: (a) no partition, (b) partition of the domain in two by a single change-point that does not coincide with any existing boundary time, (c) partition of the domain in three by two change-points, one of which coincides with an existing boundary time, and (d) partition of the domain in two by two distinct changepoints. In each scenario, kernel membership is illustrated by colour-coding. The colour of the interval between two consecutive boundary times aj k and aj k+1 reﬂects what kernel conﬁguration drives the corresponding string; in particular, the colour of the vertical bar corresponding to boundary time aj k+1 determines what kernel conﬁguration should be used to compute the conditional distribution of the value of the derivative string GP (zj t , zj t ) at aj k+1, given its value at aj k.

KOM SAMO AND ROBERTS

independent between kernel conﬁgurations. Denoting the domain of the j-th input [aj, bj], it follows from applying the laws of total expectation and total variance on Equation (16) that the expected number of change-points in the j-th dimension under our prior is

E #{cj p} = bj aj αj

and the variance of the number of change-points in the j-dimension under our prior is

Var #{cj p} = bj aj αj

The two equations above may guide the user when setting the parameters αj and βj. For instance, these values may be set so that the expected number of change-points in a given input dimension be a ﬁxed fraction of the number of boundary times in that input dimension, and so that the prior variance over the number of change-points be large enough that overall the prior isn t too informative. We could have taken a different approach to construct our prior on change-points. In effect, assuming for the sake of the argument that the boundaries of the domain of the j-th input, namely aj and bj, are the ﬁrst and last change-point in that input dimension, we note that the mapping

. . . , cj p, . . . . . . , pj p, . . . :=

. . . , cj p+1 cj p bj aj , . . .

deﬁnes a bijection between the set of possible change-points in the j-th dimension and the set of all discrete probability distributions. Thus, we could have placed as prior on . . . , pj p, . . . a Dirichlet process (Ferguson (1973)), a Pitman-Yor process (Pitman and Yor (1997)), more generally normalized completely random measures (Kingman (1967)) or any other probability distribution over partitions. We prefer the point process approach primarily because it provides an easier way of expressing prior belief about model complexity through the expected number of change-points #{cj p}, while remaining uninformative about positions thereof. One might also be tempted to regard change-points in an input dimension j as inducing a partition, not of the domain [aj, bj], but of the set of boundary times aj k in the same dimension, so that one may deﬁne a prior over kernel memberships through a prior over partitions of the set of boundary times. However, this approach would be inconsistent with the aim to learn local patterns in the data if the corresponding random measure is exchangeable. In effect, as boundary times are all input coordinates, local patterns may only arise in the data as a result of adjacent strings sharing kernel conﬁgurations. An exchangeable random measure would postulate a priori that two kernel membership assignments that have the same kernel conﬁgurations (i.e. the same number of conﬁgurations and the same set of hyper-parameters) and the same number of boundary times in each kernel cluster (although not exactly the same boundary times), are equally likely to occur, thereby possibly putting more probability mass on kernel membership assignments that do not respect boundary time adjacency. Unfortunately, exchangeable random measures (among which the Dirichlet process and the Pitman-Yor process) are by far more widely adopted by the machine learning community than non-exchangeable random measures. Thus, this approach might be perceived as overly complex. That being said, as noted by Foti and Williamson (2015), non-exchangeable normalized random measures may be regarded as Poisson point processes (with varying intensity functions) on some

STRING AND MEMBRANE GAUSSIAN PROCESSES

augmented spaces, which makes this choice of prior speciﬁcation somewhat similar, but stronger (that is more informative) than the one we adopt in this paper. Before deriving the sampling algorithm, it is worth noting that the prior deﬁned in Equation (16) does not admit a density with respect to the same base measure,10 as the number of change-points #{cj p}, and subsequently the number of kernel conﬁgurations, may vary from one sample to another. Nevertheless, the joint distribution over the data D and all other model parameters is well deﬁned and, as we will see later, we may leverage reversible-jump MCMC techniques (Green (1995); Green and Hastie (2009)) to construct a Markov chain that converges to the posterior distribution.

5.2.4 OVERALL STRUCTURE OF THE MCMC SAMPLER

To ease notations, we denote c the set of all change-points in all input dimensions, we denote n = . . . , #{cj p}, . . . Nd the vector of the numbers of change-points in each input dimension,

we denote θ the set of kernel hyper-parameters,11 and ρ := (. . . , ρj, . . . ) the vector of variances of the independent log-normal priors on θ. We denote λ := (. . . , λj, . . . ) the vector of change-points intensities, we denote α := (. . . , αj, . . . ) and β := (. . . , βj, . . . ) the vectors of parameters of the Gamma priors we placed on the change-points intensities across the d input dimensions, and we recall that u denotes the vector of likelihood parameters other than the values of the latent function f. We would like to sample from the posterior distribution p(f, f , f, f |D, α, β, ρ), where f and f are the vectors of values of the latent function f at training and test inputs respectively, and f, f the corresponding gradients. Denoting z the vector of univariate string GP processes and their derivatives, evaluated at all distinct training and test input coordinate values, we note that to sample from p(f, f , f, f |D, α, β, ρ), it sufﬁces to sample from p(z|D, α, β, ρ), compute f and f using the link function, and compute the gradients using Equation (11). To sample from p(z|D, α, β, ρ), we may sample from the target distribution

π(n, c, θ, λ, z, u) := p(n, c, θ, λ, z, u|D, α, β, ρ), (19)

and discard variables that are not of interest. As previously discussed, π is not absolutely continuous with respect to the same base measure, though we may still decompose it as

π(n, c, θ, λ, z, u) = 1 p(D|α, β, ρ)p(n|λ)p(λ|α, β)p(c|n)p(θ|n, ρ)p(u)p(z|c, θ)p(D|z, u),

(20) where we use the notation p(.) and p(.|.) to denote probability measures rather than probability density functions or probability mass functions, and where product and scaling operations are usual measure operations. Before proceeding any further, we will introduce a slight re-parametrization of Equation (20) that will improve the inference scheme. Let na = (. . . , #{aj k}k, . . . ) be the vector of the numbers of unique boundary times in all d input dimensions. We recall from our prior on f that

p(z|c, θ) =

j=1 p zj aj 0, zj aj 0

k=1 p zj aj k, zj aj k

zj aj k 1, zj aj k 1

10. That is the joint prior probability measure is neither discrete, nor continuous. 11. To simplify the exposition, we assume without loss of generality that each kernel conﬁguration has the same kernel functional form, so that conﬁgurations are deﬁned by kernel hyper-parameters.

KOM SAMO AND ROBERTS

where each factor in the decomposition above is a bivariate Gaussian density whose mean vector and covariance matrix is obtained from the partitions c, the kernel hyper-parameters θ, and the kernel membership scheme described in Section 5.2.3 and illustrated in Figure 4, and using Equations (6-7). Let j k Ku;v be the unconditional covariance matrix between zj u, zj u and zj v, zj v as per

the unconditional kernel structure driving the string deﬁned on the interval [aj k, aj k+1[. Let Σj 0 :=

j 0Kaj 0;aj 0 be the auto-covariance matrix of zj aj 0, zj aj 0

Σj k := j k Kaj k;aj k j k Kaj k;aj k 1

j k K 1 aj k 1;aj k 1

j k KT aj k;aj k 1

be the covariance matrix of zj aj k, zj aj k

given zj aj k 1, zj aj k 1

Mj k = j k Kaj k;aj k 1

j k K 1 aj k 1;aj k 1.

Finally, let Lj k := Uj k(Dj k) 1 2 with Σj k = Uj k Dj k(Uj k)T the singular value decomposition (SVD) of

Σj k. We may choose to represent zj aj 0, zj aj 0

zj aj 0 zj aj 0

= Lj 0xj 0, (22)

and for k > 0 we may also choose to represent zj aj k, zj aj k

zj aj k zj aj k

zj aj k 1 zj aj k 1

+ Lj kxj k, (23)

where {xj k} are independent bivariate standard normal vectors. Equations (22-23) provide an equivalent representation. In effect, we recall that if Z = M + LX, where X N(0, I) is a standard multivariate Gaussian, M is a real vector, and L is a real matrix, then Z N(M, LLT ). Equations

(22-23) result from applying this result to zj aj 0, zj aj 0

and zj aj k, zj aj k

zj aj k 1, zj aj k 1

that at training time, Mj k and Lj k only depend on kernel hyper-parameters. Denoting x the vector of all xj k, x is a so-called whitened representation of z, which we prefer for reasons we will discuss shortly. In the whitened representation, the target distribution π is re-parameterized as

π(n, c, θ, λ, x, u) = 1 p(D|α, β, ρ)p(n|λ)p(λ|α, β)p(c|n)p(θ|n, ρ)p(u)p(x)p(D|x, c, θ, u),

(24) where the dependency of the likelihood term on the partitions and the hyper-parameters stems from the need to recover z and subsequently f from x through Equations (22) and (23). The whitened representation Equation (24) has two primary advantages. Firstly, it is robust to ill-conditioning of

STRING AND MEMBRANE GAUSSIAN PROCESSES

Σj k, which would typically occur when two adjacent boundary times are too close to each other. In the representation of Equation (20), as one needs to evaluate the density p(z|c, θ), ill-conditioning of Σj k would result in numerical instabilities. In contrast, in the whitened representation, one needs to evaluate the density p(x), which is that of i.i.d. standard Gaussians and as such can be evaluated robustly. Moreover, the SVD required to evaluate Lj k is also robust to ill-conditioning of Σj k, so that Equations (22) and (23) hold and can be robustly evaluated for degenerate Gaussians too. The second advantage of the whitened representation is that it improves mixing by establishing a link between kernel hyper-parameters and the likelihood. Equation (24) allows us to cast our inference problem as a Bayesian model selection problem under a countable family of models indexed by n Nd, each deﬁned on a different parameter subspace Cn, with cross-model normalizing constant p(D|α, β, ρ), model probability driven by p(n|λ)p(λ|α, β), model-speciﬁc prior p(c|n)p(θ|n, ρ)p(u)p(x), and likelihood p(D|x, c, θ, u). Critically, it can be seen from Equation (24) that the conditional probability distribution

π(c, θ, λ, x, u|n)

admits a density with respect to Lebesgue s measure on Cn. Our setup is therefore analogous to that which motivated the seminal paper Green (1995), so that to sample from the posterior π(c, θ, λ, x, u, n) we may use any Reversible-Jump Metropolis Hastings (RJ-MH) scheme satisfying detailed balance and dimension-matching as described in section 3.3 of Green (1995). To improve mixing of the Markov chain, we will alternate between a between-models RJ-MH update with target distribution π(n, c, θ, λ, x, u), and a within-model MCMC-within-Gibbs sampler with target distribution π(c, θ, λ, x, u|n). Constructing reversiblejump samplers by alternating between within-model sampling and between-models sampling is standard practice, and it is well-known that doing so yields a Markov chain that converges to the target distribution of interest (see Brooks et al., 2011, p. 50). In a slight abuse of notation, in the following we might use the notations p(.|.) and p(.), which we previously used to denote probability measures, to refer to the corresponding probability density functions or probability mass functions.

5.2.5 WITHIN-MODEL UPDATES

We recall from Equation (24) that c, θ, λ, x, u|D, α, β, ρ, n has probability density function

p(n|λ)p(λ|α, β)p(c|n)p(θ|n, ρ)p(u)p(x)p(D|x, c, θ, u), (25)

up to a normalizing constant. Updating λ: By independence of the priors over (λ[j], n[j]), the distributions λ[j] n[j] are also independent, so that the updates may be performed in parallel. Moreover, recalling that the prior number of change-points in the j-th input dimension is Poisson distributed with intensity λ[j] bj aj , and by conjugacy of the Gamma distribution to the Poisson likelihood, it follows that

λ[j] n[j] Γ n[j]

bj aj + α[j], 1 + β[j] . (26)

This update step has memory requirement and time complexity both constant in the number of training and test samples.

KOM SAMO AND ROBERTS

Updating u: When the likelihood has additional parameters u, they may be updated with a Metropolis-Hastings step. Denoting q(u u ) the proposal probability density function, the acceptance ratio reads

ru = min 1, p(u )p (D|x, c, θ, u ) q(u u)

p(u)p (D|x, c, θ, u) q(u u )

In some cases however, it might be possible and more convenient to choose p(u) to be conjugate to the likelihood p (D|x, c, θ, u). For instance, in regression problems under a Gaussian noise model, we may take u to be the noise variance on which we may place an inverse-gamma prior. Either way, the computational bottleneck of this step is the evaluation of the likelihood p (D|x, c, θ, u ), which in most cases can be done with a time complexity and memory requirement that are both linear in the number of training samples. Updating c: We update the positions of change-points sequentially using the Metropolis Hastings algorithm, one input dimension j at a time, and for each input dimension we proceed in increasing order of change-points. The proposal new position for the change-point cj p is sampled uniformly at random on the interval ]cj p 1, cj p+1[, where cj p 1 (resp. cj p+1) is replaced by aj (resp. bj) for the ﬁrst (resp. last) change-point. The acceptance probability of this proposal is easily found to be

rcj p = min 1, p (D|x, c , θ, u)

p (D|x, c, θ, u)

where c is identical to c except for the change-point to update. This step requires computing the factors {Lj k, Mj k} corresponding to inputs in j-th dimension whose kernel conﬁguration would change if the proposal were to be accepted, the corresponding vector of derivative string GP values z, and the observation likelihood under the proposal p (D|x, c , θ, u). The computational bottleneck of this step is therefore once again the evaluation of the new likelihood p (D|x, c , θ, u). Updating x: The target conditional density of x is proportional to

p(x)p(D|x, c, θ, u). (29)

Recalling that p(x) is a multivariate standard normal, it follows that the form of Equation (29) makes it convenient to use elliptical slice sampling (Murray et al. (2010)) to sample from the unormalized conditional p(x)p(D|x, c, θ, u). The two bottlenecks of this update step are sampling a new proposal from p(x) and evaluating the likelihood p(D|x, c, θ, u). Sampling from the multivariate standard normal p(x) may be massively parallelized, for instance by using GPU Gaussian random number generators. When no parallelism is available, the overall time complexity reads O Pd j=1 na[j] , where we recall that na[j] denotes the number of distinct training and testing input coordinates in the j-th dimension. In particular, if we denote N the total number of training and testing d-dimensional input samples, then Pd j=1 na[j] d N, although for many classes of data sets with sparse input values such as images, where each input (single-colour pixel value) may have at most 256 distinct values, we might have Pd j=1 na[j] d N. As for the memory required to

sample from p(x), it grows proportionally to the size of x, that is in O Pd j=1 na[j] . In regards to the evaluation of the likelihood p(D|x, c, θ, u), as previously discussed its resource requirements are application-speciﬁc, but it will typically have time complexity that grows in O (N) and memory requirement that grows in O (d N). For instance, the foregoing resource requirements always hold

STRING AND MEMBRANE GAUSSIAN PROCESSES

for i.i.d. observation models such as in nonparametric regression and nonparametric classiﬁcation problems. Updating θ: We note from Equation (25) that the conditional distribution of θ given everything else has unormalized density

p(θ|n, ρ)p(D|x, c, θ, u), (30)

which we may choose to represent as

p(log θ|n, ρ)p(D|x, c, log θ, u). (31)

As we have put independent log-normal priors on the coordinates of θ (see Equation 16), we may once again use elliptical slice sampling to sample from log θ before taking the exponential. The time complexity of generating a new sample from p(log θ|n, ρ) will typically be at most linear in the total number of distinct kernel hyper-parameters. Overall, the bottleneck of this update is the evaluation of the likelihood p(D|x, c, log θ, u). In this update, the latter operation requires recomputing the factors Mj k and Lj k of Equations (22) and (23), which requires computing and taking the SVD of unrelated 2 2 matrices, computations we may perform in parallel. Once the foregoing factors have been computed, we evaluate z, the derivative string GP values at boundary times, parallelizing over input dimensions, and running a sequential update within an input dimension using Equations (22) and (23). Updating z therefore has time complexity that is, in the worst case where no distributed computing is available, O (d N), and O (N) when there are up to d computing cores. The foregoing time complexity will also be that of this update step, unless the observation likelihood is more expensive to evaluate. The memory requirement, as in previous updates, is O (d N). Overall resource requirement: To summarize previous remarks, the overall computational bottleneck of a within-model iteration is the evaluation of the likelihood p(D|x, c, θ, u). For i.i.d. observation models such as classiﬁcation and regression problems for instance, the corresponding time complexity grows in O(N) when d computing cores are available, or O(d N) otherwise, and the memory requirement grows in O(d N).

5.2.6 BETWEEN-MODELS UPDATES

Our reversible-jump Metropolis-Hastings update proceeds as follows. We choose an input dimension, say j, uniformly at random. If j has no change-points, that is n[j] = 0, we randomly choose between not doing anything, and adding a change-point, each outcome having the same probability. If n[j] > 0, we either do nothing, add a change-point, or delete a change-point, each outcome having the same probability of occurrence. Whenever we choose not to do anything, the acceptance ratio is easily found to be one:

rj0 = 1. (32)

Whenever we choose to add a change-point, we sample the position cj of the proposal new change-point uniformly at random on the domain [aj, bj] of the j-th input dimension. This proposal will almost surely break an existing kernel membership cluster, say the p-th, into two; that is cj p < cj < cj p+1 where we may have aj = cj p and/or bj = cj p+1. In the event cj coincides with an existing change-point, which should happen with probability 0, we do nothing. When adding a change-point, we sample a new vector of hyper-parameters θj from the log-normal prior of Equation (16), and we

KOM SAMO AND ROBERTS

propose as hyper-parameters for the tentative new clusters [cj p, cj [ and [cj , cj p+1[ the vectors θj add-left and θj add-right deﬁned as

log θj add-left := cos(α) log θj p sin(α) log θj (33)

and log θj add-right := sin(α) log θj p + cos(α) log θj (34)

respectively, where α [0, π

2 ] and θj p is the vector of hyper-parameters currently driving the kernel membership deﬁned by the cluster [cj p, cj p+1[. We note that if θj p is distributed as per the prior in Equation (16) then θj add-left and θj add-right are i.i.d. distributed as per the foregoing prior. More generally, this elliptical transformation determines the extent to which the new proposal kernel conﬁgurations should deviate from the current conﬁguration θj p. α is restricted to [0, π

2 ] so as to give a positive weight the the current vector of hyper-parameters θj p. When α = 0, the left handside cluster [cj p, cj [ will fully exploit the current kernel conﬁguration, while the right hand-side cluster [cj , cj p+1[ will use the prior to explore a new set of hyper-parameters. When α = π

2 the reverse occurs. To preserve symmetry between the left and right hand-side kernel conﬁgurations, we choose α = π

Whenever we choose to delete a change-point, we choose an existing change-point uniformly at random, say cj p. Deleting cj p, would merge the clusters [cj p 1, cj p[ and [cj p, cj p+1[, where we may have aj = cj p 1 and/or bj = cj p+1. We propose as vector of hyper-parameters for the tentative merged cluster [cj p 1, cj p+1[ the vector θj del-merged satisfying:

log θj del-merged = cos(α) log θj p 1 + sin(α) log θj p, (36)

which together with log θj del-* = sin(α) log θj p 1 + cos(α) log θj p, (37)

constitute the inverse of the transformation deﬁned by Equations (33) and (34). Whenever a proposal to add or delete a change-point occurs, the factors Lj k and Mj k that would be affected by the change in kernel membership structure are recomputed, and so are the affected coordinates of z. This scheme satisﬁes the reversibility and dimension-matching requirements of Green (1995). Moreover, the absolute value of the Jacobian of the mapping

log θj p, log θj log θj add-left, log θj add-right

of the move to add a change-point in [cj p, cj p+1[ reads

log θj add-left, log θj add-right

log θj p, log θj

STRING AND MEMBRANE GAUSSIAN PROCESSES

Similarly, the absolute value of the Jacobian of the mapping corresponding to a move to delete change-point cj p, namely log θj p 1, log θj p log θj del-merged, log θj del-* ,

log θj del-merged, log θj del-*

log θj p 1, log θj p

Applying the standard result Equation (8) of Green (1995), the acceptance ratio of the move to add a change-point is found to be

1, p(D|x, c+, θ+, u)

p(D|x, c, θ, u)

plog θj log θj add-left plog θj log θj add-right

plog θj log θj p plog θj log θj

where plog θj is the prior over log hyper-parameters in the j-th input dimension (as per the prior speciﬁcation Equation 16), which we recall is i.i.d. centred Gaussian with variance ρ[j], and c+ and θ+ denote the proposal new vector of change-points and the corresponding vector of hyperparameters. The three coloured terms in the acceptance probability are very intuitive. The green term p(D|x,c+,θ+,u)

p(D|x,c,θ,u) represents the ﬁt improvement that would occur if the new proposal is accepted.

In the red term

λ[j](bj aj)

1+n[j] , λ[j] bj aj

represents the average number of change-points in the jth input dimension as per the HPP prior, while 1 + n[j] corresponds to the proposal new number of change-points in the j-th dimension, so that the whole red term acts as a complexity regulariser. Fi-

nally, the blue term

plog θj(log θj add-left)plog θj(log θj add-right)

plog θj(log θj p)plog θj(log θj ) plays the role of hyper-parameters regulariser.

Similarly, the acceptance ratio of the move to delete change-point cj p, thereby changing the number of change-points in the j-th input dimension from n[j] to n[j] 1, is found to be

1, p(D|x, c , θ , u)

p(D|x, c, θ, u)

n[j] λ[j] (bj aj)

plog θj log θj del-merged plog θj log θj del-*

plog θj log θj p 1 plog θj log θj p

(41) where c and θ denote the proposal new vector of change-points and the corresponding vector of hyper-parameters. Once more, each coloured term plays the same intuitive role as its counterpart in Equation (40). Overall resource requirement: The bottleneck of between-models updates is the evaluation of the new likelihoods p(D|x, c+, θ+, u) or p(D|x, c , θ , u), whose resource requirements, which are the same as those of within-models updates, we already discussed. Algorithm 2 summarises the proposed MCMC sampler.

5.3 Multi-Output Problems

Although we have restricted ourselves to cases where the likelihood model depends on a single real-valued function for brevity and to ease notations, cases where the likelihood depends on vectorvalued functions, or equivalently multiple real-valued functions, present no additional theoretical or

KOM SAMO AND ROBERTS

Algorithm 2 MCMC sampler for nonparametric Bayesian inference of a real-valued latent function under a string GP prior

Inputs: Likelihood model p(D|f, u), link function φ, training data D, test inputs, type of unconditional kernel, prior parameters α, β, ρ. Outputs: Posterior samples of the values of the latent function at training and test inputs f and f , and the corresponding gradients f and f .

Step 0: Set n = 0 and c = , and sample θ, λ, x, u from their priors. repeat

Step 1: Perform a within-model update.

1.1: Update each λ[j] by sampling from the Gamma distribution in Equation (26). 1.2: Update u, the vector of other likelihood parameters, if any, using Metropolis-Hastings (MH) with proposal q and acceptance ratio Equation (27) or by sampling directly from the posterior when p(u) is conjugate to the likelihood model.

1.3: Update θ, using elliptical slice sampling (ESS) with target distribution Equation (31), and record the newly computed factors {Lj k, Mj k} that relate z to its whitened representation x.

1.4: Update x using ESS with target distribution Equation (29). 1.5: Update change-point positions c sequentially using MH, drawing a proposal update for cj p uniformly at random on ]c J p 1, cj p+1[, and accepting the update with probability rcj p (deﬁned

Equation 28). On accept, update the factors {Lj k, Mj k}. Step 2: Perform a between-models update.

2.1: Sample a dimension to update, say j, uniformly at random. 2.2: Consider adding or deleting a change-point

if n[j] = 0 then

Randomly choose to add a change-point with probability 1/2. if we should consider adding a change-point then

Construct proposals to update following Section 5.2.6. Accept proposals with probability rj + (see Equation 40). end if else

Randomly choose to add/delete a change-point with probability 1/3. if we should consider adding a change-point then

Construct proposals to update following Section 5.2.6. Accept proposals with probability rj + (see Equation 40). else if we should consider deleting a change-point then

Construct proposals to update following Section 5.2.6. Accept proposals with probability rj (see Equation 41). else

Continue. end if end if Step 3: Compute f, f , f and f , ﬁrst recovering z from x, and then recalling that f(x) =

φ z1 x[1], . . . , zd x[d] and f(x) = z1 x[1] φ x[1](x), . . . , zd x[d] φ x[d](x) . until enough samples are generated after mixing.

STRING AND MEMBRANE GAUSSIAN PROCESSES

practical challenge. We may simply put independent string GP priors on each of the latent functions. An MCMC sampler almost identical to the one introduced herein may be used to sample from the posterior. All that is required to adapt the proposed MCMC sampler to multi-outputs problems is to redeﬁne z to include all univariate derivative string GP values across input dimensions and across latent functions, perform step 1.1 of Algorithm 2 for each of the latent function, and update step 2.1 so as to sample uniformly at random not only what dimension to update but also what latent function. Previous analyses and derived acceptance ratios remain unchanged. The resource requirements of the resulting multi-outputs MCMC sampler on a problem with K latent functions, N training and test d-dimensional inputs, are the same as those of the MCMC sampler for a single output (Algorithm 2) with N training and test d K-dimensional inputs. The time complexity is O(N) when d K computing cores are available, O(d KN) when no distributed computing is available, and the memory requirement becomes O(d KN).

5.4 Flashback to Small Scale GP Regressions with String GP Kernels

In Section 5.1 we discussed maximum marginal likelihood inference in Bayesian nonparametric regressions under additively separable string GP priors, or GP priors with string GP covariance functions. We proposed learning the positions of boundary times, conditional on their number, jointly with kernel hyper-parameters and noise variances by maximizing the marginal likelihood using gradient-based techniques. We then suggested learning the number of strings in each input dimension by trading off goodness-of-ﬁt with model simplicity using information criteria such as AIC and BIC. In this section, we propose a fully Bayesian nonparametric alternative. Let us consider the Gaussian process regression model

yi = f(xi) + ϵi, f GP (0, k SGP(., .)) , ϵi N 0, σ2 , (42)

xi [a1, b1] [ad, bd], yi, ϵi R, (43)

where k SGP is the covariance function of some string GP with boundary times {aj k} and corresponding unconditional kernels {kj k} in the j-th input dimension. It is worth stressing that we place a GP (not string GP) prior on the latent function f, but the covariance function of the GP is a string GP covariance function (as discussed in Section 4.2 and as derived in Appendix I). Of course when the string GP covariance function k SGP is separately additive, the two functional priors are the same. However, we impose no restriction on the link function of the string GP that k SGP is the covariance function of, other than continuous differentiability. To make full Bayesian nonparametric inference, we may place on the boundary times {aj k} independent homogeneous Poisson process priors, each with intensity λj. Similarly to the previous section (Equation 16) our full prior speciﬁcation of the string GP kernel reads

λj Γ(αj, βj), {aj k} λj HPP(λj)

θj k[i] {aj k}, λj i.i.d log N(0, ρj) (j, k) = (l, p) θj k θl p,

where θj k is the vector of hyper-parameters driving the unconditional kernel kj k. The method developed in the previous section and the resulting MCMC sampling scheme (Algorithm 2) may be

KOM SAMO AND ROBERTS

reused to sample from the posterior over function values, pending the following two changes. First, gradients f and f are no longer necessary. Second, we may work with function values (f, f ) directly (that is in the original as opposed to whitened space). The resulting (Gaussian) distribution of function values (f, f ) conditional on all other variables is then analytically derived using standard Gaussian identities, like it is done in vanilla Gaussian process regression, so that the within-model update of (f, f ) is performed using a single draw from a multivariate Gaussian. This approach to model complexity learning is advantageous over the information criteria alternative of Section 5.1 in that it scales better with large input-dimensions. Indeed, rather than performing complete maximum marginal likelihood inference a number of times that grows exponentially with the input dimension, the approach of this section alternates between exploring a new combination of numbers of kernel conﬁgurations in each input dimension, and exploring function values and kernel hyper-parameters (given their number). That being said, this approach should only be considered as an alternative to commonly used kernels for small scale regression problems to enable the learning of local patterns. Crucially, it scales as poorly as the standard GP paradigm, and Algorithm 2 should be preferred for large scale problems.

6. Experiments

We now move on to presenting empirical evidence for the efﬁcacy of string GPs in coping with local patterns in data sets, and in doing so in a scalable manner. Firstly we consider maximum marginal likelihood inference on two small scale problems exhibiting local patterns. We begin with a toy experiment that illustrates the limitations of the standard GP paradigm in extrapolating and interpolating simple local periodic patterns. Then, we move on to comparing the accuracy of Bayesian nonparametric regression under a string GP prior to that of the standard Gaussian process regression model and existing mixture-of-experts alternatives on the motorcycle data set of Silverman (1985), commonly used for the local patterns and heteroskedasticity it exhibits. Finally, we illustrate the performance of the previously derived MCMC sampler on two large scale Bayesian inference problems, namely the prediction of U.S. commercial airline arrival delays of Hensman et al. (2013) and a new large scale dynamic asset allocation problem.

6.1 Extrapolation and Interpolation of Synthetic Local Patterns

In our ﬁrst experiment, we illustrate a limitation of the standard approach consisting of postulating a global covariance structure on the domain, namely that this approach might result in unwanted global extrapolation of local patterns, and we show that this limitation is addressed by the string GP paradigm. To this aim, we use 2 toy regression problems. We consider the following functions:

f0(t) = sin(60πt) t [0, 0.5] 15

4 sin(16πt) t ]0.5, 1] , f1(t) = sin(16πt) t [0, 0.5] 1 2 sin(32πt) t ]0.5, 1] . (45)

f0 (resp. f1) undergoes a sharp (resp. mild) change in frequency and amplitude at t = 0.5. We consider using their restrictions to [0.25, 0.75] for training. We sample those restrictions with frequency 300, and we would like to extrapolate the functions to the rest of their domains using Bayesian nonparametric regression. We compare marginal likelihood string GP regression models, as described in Section 5.1, to vanilla GP regression models using popular and expressive kernels. All string GP models have two strings and the partition is learned in the marginal likelihood maximisation. Figure 5 illustrates

STRING AND MEMBRANE GAUSSIAN PROCESSES

plots of the posterior means for each kernel used, and Table 2 compares predictive errors. Overall, it can be noted that the string GP kernel with the periodic kernel (Mac Kay (1998)) as building block outperforms competing kernels, including the expressive spectral mixture kernel

k=1 σ2 k exp( 2π2r2γ2 k) cos(2πrµk)

of Wilson and Adams (2013) with K = 5 mixture components.12

The comparison between the spectral mixture kernel and the string spectral mixture kernel is of particular interest, since spectral mixture kernels are pointwise dense in the family of stationary kernels, and thus can be regarded as ﬂexible enough for learning stationary kernels from the data. In our experiment, the string spectral mixture kernel with a single mixture component per string signiﬁcantly outperforms the spectral mixture kernel with 5 mixture components. This intuitively can be attributed to the fact that, regardless of the number of mixture components in the spectral mixture kernel, the learned kernel must account for both types of patterns present in each training data set. Hence, each local extrapolation on each side of 0.5 will attempt to make use of both amplitudes and both frequencies evidenced in the corresponding training data set, and will struggle to recover the true local sine function. We would expect that the performance of the spectral mixture kernel in this experiment will not improve drastically as the number of mixture components increases. However, under a string GP prior, the left and right hand side strings are independent conditional on the (unknown) boundary conditions. Therefore, when the string GP domain partition occurs at time 0.5, the training data set on [0.25, 0.5] inﬂuences the hyper-parameters of the string to the right of 0.5 only to the extent that both strings should agree on the value of the latent function and its derivative at 0.5. To see why this is a weaker condition, we consider the family of pair of functions:

(αω1 sin(ω2t), αω2 sin(ω1t)), ωi = 2πki, ki N, α R.

Such functions always have the same value and derivative at 0.5, regardless of their frequencies, and they are plausible GP paths under a spectral mixture kernel with one single mixture component (µk = ki and γk 1), and under a periodic kernel. As such it is not surprising that extrapolation under a string spectral mixture kernel or a string periodic kernel should perform well. To further illustrate that string GPs are able to learn local patterns that GPs with commonly used and expressive kernels can t, we consider interpolating two bivariate functions f2 and f3 that exhibit local patterns. The functions are deﬁned as:

u, v [0.0, 1.0] f2(u, v) = f0(u)f1(v), f3(u, v) = p

f0(u)2 + f1(v)2. (46)

We consider recovering the original functions as the posterior mean of a GP regression model trained on [0.0, 0.4] [0.6, 1.0] [0.0, 0.4] [0.6, 1.0]. Each bivariate kernel used is a product of two univariate kernels in the same family, and we used standard Kronecker techniques to speed-up inference (see Saatchi, 2011, p.134). The univariate kernels we consider are the same as previously. Each univariate string GP kernel has one change-point (two strings) whose position is learned by maximum marginal likelihood. Results are illustrated in Figures 6 and 7. Once again it can be seen that unlike any competing kernel, the product of string periodic kernels recover both functions almost perfectly. In particular, it is impressive to see that, despite f3 not being a separable function,

12. The sparse spectrum kernel of Lazaro-Gredilla et al. (2010) can be thought of as the special case γk 1.

KOM SAMO AND ROBERTS

a product of string periodic kernels recovered it almost perfectly. The interpolations performed by the spectral mixture kernel (see Figures 6 and 7) provide further evidence for our previously developed narrative: the spectral mixture kernel tries to blend all local patterns found in the training data during the interpolation. The periodic kernel learns a single global frequency characteristic of the whole data set, ignoring local patterns, while the squared exponential, Mat ern and rational quadratic kernels merely attempt to perform interpolation by smoothing. Although we used synthetic data to ease illustrating our argument, it is reasonable to expect that in real-life problems the bigger the data set, the more likely there might be local patterns that should not be interpreted as noise and yet are not indicative of the data set as whole.

STRING AND MEMBRANE GAUSSIAN PROCESSES

0.0 0.2 0.4 0.6 0.8 1.0 t

Extrapolation of f0

f0 Per. SM String Per. String SM SE MA 3/2 MA 5/2 RQ

0.0 0.2 0.4 0.6 0.8 1.0 t

Extrapolation of f1

f1 Per. SM String Per. String SM SE MA 3/2 MA 5/2 RQ

Figure 5: Extrapolation of two functions f0 and f1 through Bayesian nonparametric regression under string GP priors and vanilla GP priors with popular and expressive kernels. Each model is trained on [0.25, 0.5] and extrapolates to [0, 1.0].

KOM SAMO AND ROBERTS

Absolute Error Squared Error

Kernel f0 f1 f0 f1

Squared exponential 1.44 2.40 0.48 0.58 3.50 9.20 0.31 0.64 Rational quadratic 1.39 2.31 0.51 0.83 3.28 8.79 0.43 1.15 Mat ern 3/2 1.63 2.53 1.26 1.37 4.26 11.07 2.06 3.55 Mat ern 5/2 1.75 2.77 0.48 0.58 5.00 12.18 0.31 0.64 Periodic 1.51 2.45 0.53 0.60 3.79 9.62 0.37 0.72 Spec. Mix. (5 comp.) 0.75 1.15 0.39 0.57 0.94 2.46 0.24 0.58 String Spec. Mix. (2 strings, 1 comp.) 0.23 0.84 0.01 0.03 0.21 1.07 0.00 0.00 String Periodic 0.02 0.02 0.00 0.01 0.00 0.00 0.00 0.00

Table 2: Predictive accuracies in the extrapolation of the two functions f0 and f1 of Section 6.1 through Bayesian nonparametric regression under string GP priors and vanilla GP priors with popular and expressive kernels. Each model is trained on [0.25, 0.5] and extrapolates to [0, 1.0]. The predictive errors are reported as average 2 standard deviations.

STRING AND MEMBRANE GAUSSIAN PROCESSES

6.2 Small Scale Heteroskedastic Regression

In our second experiment, we consider illustrating the advantage of the string GP paradigm over the standard GP paradigm, but also over the alternatives of Kim et al. (2005), Gramacy and Lee (2008), Tresp (2000) and Deisenroth and Ng (2015) that consist of considering independent GP experts on disjoint parts of the domain or handling disjoint subsets of the data. Using the motorcycle data set of Silverman (1985), commonly used for the local patterns and heteroskedasticity it exhibits, we show that our approach outperforms the aforementioned competing alternatives, thereby providing empirical evidence that the collaboration between consecutive GP experts introduced in the string GP paradigm vastly improves predictive accuracy and certainty in regression problems with local patterns. We also illustrate learning of the derivative of the latent function, solely from noisy measurements of the latent function. The observations consist of accelerometer readings taken through time in an experiment on the efﬁcacy of crash helmets. It can be seen at a glance in Figure 8 that the data set exhibits roughly 4 regimes. Firstly, between 0ms and 15ms the acceleration was negligible. Secondly, the impact slowed down the helmet, resulting in a sharp deceleration between 15ms and 28ms. Thirdly, the helmet seems to have bounced back between 28ms and 32ms, before it ﬁnally gradually slowed down and came to a stop between 32ms and 60ms. It can also be noted that the measurement noise seems to have been higher in the second half of the experiment. We ran 50 independent random experiments, leaving out 5 points selected uniformly at random from the data set for prediction, the rest being used for training. The models we considered include the vanilla GP regression model, the string GP regression model with marginal maximum likelihood inference as described in Section 5.1, mixtures of independent GP experts acting on disjoint subsets of the data both for training and testing, the Bayesian committee machine (Tresp (2000)), and the robust Bayesian committee machine (Deisenroth and Ng (2015)). We considered string GPs with 4 and 6 strings whose boundary times are learned as part of the maximum likelihood inference. For consistency, we used the resulting partitions of the domain to deﬁne the independent experts in the competing alternatives we considered. The Mat ern 3/2 kernel was used throughout. The results are reported in Table 3. To gauge the ability of each model to capture the physics of the helmets crash experiment, we have also trained all models with all data points. The results are illustrated in Figures 8 and 9. It can be seen at a glance from Figure 9 that mixtures of independent GP experts are inappropriate for this experiment as i) the resulting posterior means exhibit discontinuities (for instance at t = 30ms and t = 40ms) that are inconsistent with the physics of the underlying phenomenon, and ii) they overﬁt the data towards the end. The foregoing discontinuities do not come as a surprise as each GP regression expert acts on a speciﬁc subset of the domain that is disjoint from the ones used by the other experts, both for training and prediction. Thus, there is no guarantee of consistency between expert predictions at the boundaries of the domain partition. Another perspective to this observation is found in noting that postulating independent GP experts, each acting on an element of a partition of the domain, is equivalent to putting as prior on the whole function a stochastic process that is discontinuous at the boundaries of the partition. Thus, the posterior stochastic process should not be expected to be continuous at the boundaries of the domain either. This discontinuity issue is addressed by the Bayesian committee machine (BCM) and the robust Bayesian committee machine (r BCM) because, despite each independent expert being trained on a disjoint subset of the data, each expert is tasked with making predictions about all test inputs, not

KOM SAMO AND ROBERTS

0.0 0.2 0.4 0.6 0.8 0.0

0.0 0.2 0.4 0.6 0.8 0.0

Training f2

Figure 6: Extrapolation of a synthetic function f2 (top left corner), cropped in the middle for training (top right corner), using string GP regression and vanilla GP regression with various popular and expressive kernels.

STRING AND MEMBRANE GAUSSIAN PROCESSES

0.0 0.2 0.4 0.6 0.8 0.0

0.0 0.2 0.4 0.6 0.8 0.0

Training f3

Figure 7: Extrapolation of a synthetic function f3 (top left corner), cropped in the middle for training (top right corner), using string GP regression and vanilla GP regression with various popular and expressive kernels.

KOM SAMO AND ROBERTS

just the ones that fall into its input subspace. Each GP expert prediction is therefore continuous on the whole input domain,13 and the linear weighting schemes operated by the BCM and the r BCM on expert predictions to construct the overall predictive mean preserve continuity. However, we found that the BCM and the r BCM suffer from three pitfalls. First, we found them to be less accurate than any other alternative out-of-sample on this data set (see Table 3). Second, their predictions of latent function values are overly uncertain. This might be due to the fact that, each GP expert being trained only with training samples that lie on its input subspace, its predictions about test inputs that lie farther away from its input subspace will typically be much more uncertain, so that, despite the weighting scheme of the Bayesian committee machine putting more mass on conﬁdent experts, overall the posterior variance over latent function values might still be much higher than in the standard GP paradigm for instance. This is well illustrated by both the last column of Table 3 and the BCM and r BCM plots in Figure 9. On the contrary, no string GP model suffers from this excess uncertainty problem. Third, the posterior means of the BCM, the r BCM and the vanilla GP regression exhibit oscillations towards the end (t > 40ms) that are inconsistent with the experimental setup; the increases in acceleration as the helmet slows down suggested by these posterior means would require an additional source of energy after the bounce. In addition to being more accurate and more certain about predictions than vanilla GP regression, the BCM and the r BCM (see Table 3), string GP regressions yield posterior mean acceleration proﬁles that are more consistent with the physics of the experiment: steady speed prior to the shock, followed by a deceleration resulting from the shock, a brief acceleration resulting from the change in direction after the bounce, and ﬁnally a smooth slow down due to the dissipation of kinetic energy. Moreover, unlike the vanilla GP regression, the r BCM and the BCM, string GP regressions yield smaller posterior variances towards the beginning and the end of the experiment than in the middle, which is consistent with the fact that the operator would be less uncertain about the acceleration at the beginning and at the end of the experiment one would indeed expect the acceleration to be null at the beginning and at the end of the experiment. This desirable property can be attributed to the heteroskedasticity of the noise structure in the string GP regression model. We also learned the derivative of the latent acceleration with respect to time, purely from noisy acceleration measurements using the joint law of a string GP and its derivative (Theorem 2). This is illustrated in Figure 8.

13. So long as the functional prior is continuous, which is the case here.

STRING AND MEMBRANE GAUSSIAN PROCESSES

Training Prediction

Log. lik. Log. lik. Absolute Error Squared Error Pred. Std

String GP (4 strings) 388.36 0.36 22.16 0.41 15.70 1.05 466.47 50.74 0.70/2.25/3.39 String GP (6 strings) 367.21 0.43 21.99 0.37 15.89 1.06 475.59 51.95 0.64/2.21/3.46 Vanilla GP 420.69 0.24 22.77 0.24 16.84 1.09 524.18 58.33 2.66/3.09/4.94 Mix. of 4 GPs 388.37 0.38 20.90 0.38 16.61 1.10 512.30 56.08 1.67/2.85/4.59 Mix. of 6 GPs 369.05 0.45 20.11 0.45 16.05 1.11 500.43 58.26 0.62/2.83/4.63 BCM with 4 GPs 419.08 0.30 22.94 0.26 17.17 1.13 538.94 61.91 7.20/9.92/22.92 BCM with 6 GPs 422.15 0.30 22.91 0.26 16.93 1.12 533.21 61.78 7.09/9.93/25.10 r BCM with 4 GPs 419.08 0.30 22.99 0.27 17.29 1.11 546.95 61.21 5.86/9.08/27.52 r BCM with 6 GPs 422.15 0.30 22.96 0.28 16.79 1.12 542.95 61.95 5.15/8.61/29.15

Table 3: Performance comparison between string GPs, vanilla GPs, mixture of independent GPs, the Bayesian committee machine (Tresp

(2000)) and the robust Bayesian committee machine (Deisenroth and Ng (2015)) on the motorcycle data set of Silverman (1985). The Mat ern 3/2 kernel was used throughout. The domain partitions were learned in the string GP experiments by maximum likelihood. The learned partitions were then reused to allocate data between GP experts in other models. 50 random runs were performed, each run leaving 5 data points out for testing and using the rest for training. All results (except for predictive standard deviations) are reported as average over the 50 runs standard error. The last column contains the minimum, average and maximum of the predictive standard deviation of the values of the latent (noise-free) function at all test points across random runs.

KOM SAMO AND ROBERTS

0 10 20 30 40 50 60

Acceleration (g)

String GP (6 strings)

0 10 20 30 40 50 60 Time (ms)

Jerk (g/ms)

String GP (6 strings)

Figure 8: Posterior mean 2 predictive standard deviations on the motorcycle data set (see Silverman, 1985), under a Mat ern 3/2 derivative string GP prior with 6 learned strings. The top ﬁgure shows the noisy accelerations measurements and the learned latent function. The bottom function illustrates the derivative of the acceleration with respect to time learned from noisy acceleration samples. Posterior credible bands are over the latent functions rather than noisy measurements, and as such they do not include the measurement noise.

STRING AND MEMBRANE GAUSSIAN PROCESSES

0 10 20 30 40 50 60

Acceleration (g)

0 10 20 30 40 50 60

Acceleration (g)

String GP (4 strings)

0 10 20 30 40 50 60

Acceleration (g)

String GP (6 strings)

0 10 20 30 40 50 60

Acceleration (g)

BCM (4 experts)

0 10 20 30 40 50 60

Acceleration (g)

BCM (6 experts)

0 10 20 30 40 50 60

Acceleration (g)

r BCM (4 experts)

0 10 20 30 40 50 60

Acceleration (g)

r BCM (6 experts)

0 10 20 30 40 50 60

Acceleration (g)

0 10 20 30 40 50 60

Acceleration (g)

Figure 9: Bayesian nonparametric regressions on the motorcycle data set of Silverman (1985). Models compared are string GP regression, vanilla GP regression, mixture of independent GP regression experts on a partition of the domain, the Bayesian committee machine (BCM) and the robust Bayesian committee machine (r BCM). Domain partitions were learned during string GP maximum likelihood inference (red vertical bars), and reused in other experiments. Blue stars are noisy samples, red lines are posterior means of the latent function and grey bands correspond to 2 predictive standard deviations of the (noise-free) latent function about its posterior mean.

KOM SAMO AND ROBERTS

6.3 Large Scale Regression

To illustrate how our approach fares against competing alternatives on a standard large scale problem, we consider predicting arrival delays of commercial ﬂights in the USA in 2008 as studied by Hensman et al. (2013). We choose the same covariates as in Hensman et al. (2013), namely the age of the aircraft (number of years since deployment), distance that needs to be covered, airtime, departure time, arrival time, day of the week, day of the month and month. Unlike Hensman et al. (2013) who only considered commercial ﬂights between January 2008 and April 2008, we consider commercial throughout the whole year, for a total of 5.93 million records. In addition to the whole data set, we also consider subsets so as to empirically illustrate the sensitivity of computational time to the number of samples. Selected subsets consist of 10, 000, 100, 000 and 1, 000, 000 records selected uniformly at random. For each data set, we use 2/3 of the records selected uniformly at random for training and we use the remaining 1/3 for testing. In order to level the playing ﬁeld between stationary and nonstationary approaches, we normalize training and testing data sets.14 As competing alternatives to string GPs we consider the SVIGP of Hensman et al. (2013), the Bayesian committee machines (BCM) of Tresp (2000), and the robust Bayesian committee machines (r BCM) of Deisenroth and Ng (2015). As previously discussed the prediction scheme operated by the BCM is Kolmogorov-inconsistent in that the resulting predictive distributions are not consistent by marginalization.15 Moreover, jointly predicting all function values by using the set of all test inputs as query set, as originally suggested in Tresp (2000), would be impractical in this experiment given that the BCM requires inverting a covariance matrix of the size of the query set which, considering the numbers of test inputs in this experiment (which we recall can be as high as 1.97 million), would be computationally intractable. To circumvent this problem we use the BCM algorithm to query one test input at a time. This approach is in-line with that adopted by Deisenroth and Ng (2015), where the authors did not address determining joint predictive distributions over multiple latent function values. For the BCM and r BCM, the number of experts is chosen so that each expert processes 200 training points. For SVIGP we use the implementation made available by the The GPy authors (2012 2016), and we use the same conﬁguration as in Hensman et al. (2013). As for string GPs, we use the symmetric sum as link function, and we run two types of experiments, one allowing for inference of change-points (String GP), and the other enforcing a single kernel conﬁguration per input dimension (String GP*). The parameters α and β are chosen so that the prior mean number of change-points in each input dimension is 5% of the number of distinct training and testing values in that input dimension, and so that the prior variance of the foregoing number of change-points is 50 times the prior mean the aim is to be uninformative about the number of change-points. We run 10, 000 iterations of our RJ-MCMC sampler and discarded the ﬁrst 5, 000 as burn-in . After burn-in we record the states of the Markov chains for analysis using a 1-in-100 thinning rate. Predictive accuracies are reported in Table 4 and CPU time requirements16 are illustrated in Figure 10. We stress that all experiments were run on a multi-core machine, and that we prefer using the cumulative CPU clock resource

14. More precisely, we substract from every feature sample (both in-sample and out-of-sample) the in-sample mean of the feature and we divide the result by the in-sample standard deviation of the feature. 15. For instance the predictive distribution of the value of the latent function at a test input x1, namely f(x1), obtained by using {x1} as set of test inputs in the BCM, differs from the predictive distribution obtained by using {x1, x2} as set of test inputs in the BCM and then marginalising with respect to the second input x2. 16. We deﬁne CPU time as the cumulative CPU clock resource usage of the whole experiment (training and testing), across child processes and threads, and across CPU cores.

STRING AND MEMBRANE GAUSSIAN PROCESSES

as time complexity metric, instead of wall-clock time, so as to be agnostic to the number of CPU cores used in the experiments. This metric has the merit of illustrating how the number of CPU cores required grows as a function of the number of training samples for a ﬁxed/desired wall-clock execution time, but also how the wall-clock execution time grows as a function of the number of training samples for a given number of available CPU cores. The BCM and the r BCM perform the worst in this experiment both in terms of predictive accuracy (Table 4) and total CPU time (Figure 10). The poor scalability of the BCM and the r BCM is primarily due to the testing phase. Indeed, if we denote M the total number of experts, then M = N

300 , as each expert processes 200 training points, of which there are 2

3N. In the prediction phase, each expert is required to make predictions about all 1

3N test inputs, which requires evaluating M products of an 1

3N 200 matrix with a 200 200 matrix, which results in a total CPU time requirement that grows in O(M 1

3N2002), which is the same as O(N2). Given that training CPU time grows linearly in N the cumulative training and testing CPU time grows quadratically in N. This is well illustrated in Figure 10, where it can be seen that the slopes of total CPU time proﬁles of the BCM and the r BCM in log-log scale are approximately 2. The airline delays data set was also considered by Deisenroth and Ng (2015), but the authors restricted themselves to a ﬁxed size of the test set of 100, 000 points. However, this limitation might be restrictive as in many smoothing applications, the test data set can be as large as the training data set neither the BCM nor the r BCM would be sufﬁciently scalable in such applications. As for SVIGP, although it was slightly more accurate than string GPs on this data set, it can be noted from Figure 10 that string GPs required 10 times less CPU resources. In fact we were unable to run the experiment on the full data set with SVIGP we gave up after 500 CPU hours, or more than a couple of weeks wall-clock time given that the GPy implementation of SVIGP makes little use of multiple cores. As a comparison, the full experiment took 91.0 hours total CPU time ( 15 hours wall-clock time on our 8 cores machine) when change-points were inferred and 83.11 hours total CPU time ( 14 hours wall-clock time on our 8 cores machine) when change-points were not inferred. Another advantage of additively separable string GPs over GPs, and subsequently over SVIGP, is that they are more interpretable. Indeed, one can determine at a glance from the learned posterior mean string GPs of Figure 11 the effect of each of the 8 covariates considered on arrival delays. It turns out that the three most informative factors in predicting arrival delays are departure time, distance and arrival time, while the age of the aircraft, the day of the week and the day of the month seem to have little to no effect. Finally, posterior distributions of the number of changepoints are illustrated in Figure 12, and posterior distributions of the locations of change-points are illustrated in Figure 13.

KOM SAMO AND ROBERTS

N String GP String GP* BCM r BCM SVIGP

10, 000 1.03 0.10 1.06 0.10 1.06 0.10 1.06 0.10 0.90 0.09 100, 000 0.93 0.03 0.96 0.03 1.66 0.03 1.04 0.04 0.88 0.03 1, 000, 000 0.93 0.01 0.92 0.01 N/A N/A 0.82 0.01 5, 929, 413 0.90 0.01 0.93 0.01 N/A N/A N/A

Table 4: Predictive mean squared errors (MSEs) one standard error on the airline arrival delays experiment. Squared errors are expressed as fraction of the sample variance of airline arrival delays, and hence are unitless. With this normalisation, a MSE of 1.00 is as good as using the training mean arrival delays as predictor. The * in String GP* indicates that inference was performed without allowing for change-points. N/A entries correspond to experiments that were not over after 500 CPU hours.

103 104 105 106 107

CPU time (hours)

String GP* String GP BCM r BCM SVIGP

Figure 10: Total CPU time (training and testing) taken by various regression approaches on the airline delays data set as a function of the size of the subset considered, in log-log scale. The experimental setup is described in Section 6.3. The CPU time reﬂects actual CPU clock resource usage in each experiment, and is therefore agnostic to the number of CPU cores used. It can be regarded as the wall-clock time the experiment would have taken to complete on a single-core computer (with the same CPU frequency). Dashed lines are extrapolated values, and correspond to experiments that did not complete after 500 hours of CPU time.

STRING AND MEMBRANE GAUSSIAN PROCESSES

0 10 20 30 40 50 Age (years)

0 100 200 300 400 500 600 Air Time (min)

00:00 05:00 10:00 15:00 20:00 Arrival Time (hh:mm)

00:00 05:00 10:00 15:00 20:00 Departure Time (hh:mm)

0 1000 2000 3000 4000 5000 Distance (miles)

2 4 6 8 10 12 Month

5 10 15 20 25 30 Day of Month

1 2 3 4 5 6 7 Day of Week

Figure 11: Posterior mean one posterior standard deviation of univariate string GPs in the airline delays experiment of Section 6.3. Change-points were automatically inferred in this experiment. 51

KOM SAMO AND ROBERTS

0 2 4 6 8 10 Number of change-points

0 2 4 6 8 10 Number of change-points

0 2 4 6 8 10 Number of change-points

Arrival Time

0 2 4 6 8 10 Number of change-points

Departure Time

0 2 4 6 8 10 Number of change-points

0 2 4 6 8 10 Number of change-points

0 2 4 6 8 10 Number of change-points

Day of Month

0 2 4 6 8 10 Number of change-points

Day of Week

Figure 12: Posterior distributions of the numbers of change-points in each input dimension in the airline delays experiment of Section 6.3.

STRING AND MEMBRANE GAUSSIAN PROCESSES

0 10 20 30 40 50 Locations of change-point

Age (years)

82.8 83.0 83.2 83.4 83.6 83.8 84.0 Locations of change-point

Air Time (min)

00:00 05:00 10:00 15:00 20:00 Locations of change-point

Arrival Time (hh:mm)

00:00 05:00 10:00 15:00 20:00 Locations of change-point

Departure Time (hh:mm)

0 1000 2000 3000 4000 5000 Locations of change-point

Distance (miles)

0 2 4 6 8 10 12 Locations of change-point

Figure 13: Posterior distributions of the locations of change-points in each input dimension in the airline delays experiment of Section 6.3. Dimensions that were learned to exhibit no change-point have been omitted here.

KOM SAMO AND ROBERTS

6.4 Large Scale Dynamic Asset Allocation

An important feature of our proposed RJ-MCMC sampler (Algorithm 2) is that, unlike the BCM, the r BCM and SVIGP, which are restricted to Bayesian nonparametric regression and classiﬁcation, Algorithm 2 is agnostic with regard to the likelihood model, so long as it takes the form p(D|f, u). Thus, it may be used as is on a wide variety of problems that go beyond classiﬁcation and regression. In this experiment we aim to illustrate the efﬁcacy of our approach on one such large scale problem in quantitative ﬁnance.

6.4.1 BACKGROUND

Let (xi(t))t>0 for i = 1, . . . , n be n stock price processes. Let (Xi(t))t>0 for i = 1, . . . , n denote the market capitalisation processes, that is Xi(t) = ni(t)xi(t) where ni(t) is the number of shares in company i trading in the market at time t. We call long-only portfolio any vector-valued stochastic process π = (π1, . . . , πn) taking value on the unit simplex on Rn, that is

i, t, πi(t) 0 and

i=1 πi(t) = 1.

Each process πi represents the proportion of an investor s wealth invested in (holding) shares in asset i. An example long-only portfolio is the market portfolio µ = (µ1, . . . , µn) where

µi(t) = Xi(t) X1(t) + + Xn(t) (47)

is the market weight of company i at time t, that is its size relative to the total market size (or that of the universe of stocks considered). The market portfolio is very important to practitioners as it is often perceived not be to subject to idiosyncracies, but only to systemic risk. It is often used as an indicator of how the stock market (or a speciﬁc universe of stocks) performs as a whole. We denote Zπ the value process of a portfolio π with initial capital Zπ(0). That is, Zπ(t) is the wealth at time t of an investor who had an initial wealth of Zπ(0), and dynamically re-allocated all his wealth between the n stocks in our universe up to time t following the continuous-time strategy π. A mathematical theory has recently emerged, namely stochastic portfolio theory (SPT) (see Karatzas and Fernholz, 2009) that studies the stochastic properties of the wealth processes of certain portfolios called functionally-generated portfolio under realistic assumptions on the market capitalisation processes (Xi(t))t>0. Functionally-generated portfolios are rather speciﬁc in that the allocation at time t, namely (π1(t), . . . , πn(t)), solely depends on the market weights vector (µ1(t), . . . , µn(t)). Nonetheless, some functionally-generated portfolios π have been found that, under the mild (so-called diversity) condition

µmax, 0 < µmax < 1 s.t. i n, t T, µi(t) µmax, (48)

outperform the market portfolio over the time horizon [0, T] with probability one (see Vervuurt and Karatzas, 2015; Karatzas and Fernholz, 2009). More precisely,

P (Zπ (T) Zµ(T)) = 1 and P (Zπ (T) > Zµ(T)) > 0. (49)

Galvanized by this result, we here consider the inverse problem consisting of learning from historical market data a portfolio whose wealth process has desirable user-speciﬁed properties. This

STRING AND MEMBRANE GAUSSIAN PROCESSES

inverse problem is perhaps more akin to the problems faced by investment professionals: i) their benchmarks depend on the investment vehicles pitched to investors and may vary from one vehicle to another, ii) they have to take into account liquidity costs, and iii) they often ﬁnd it more valuable to go beyond market weights and leverage multiple company characteristics in their investment strategies.

6.4.2 MODEL CONSTRUCTION

We consider portfolios πf = πf 1, . . . , πf n of the form

πf i (t) = f(ci(t)) f(c1(t)) + + f(cn(t)), (50)

where ci(t) Rd are some quantiﬁable characteristics of asset i that may be observed in the market at time t, and f is a positive-valued function. Portfolios of this form include all functionallygenerated portfolios studied in SPT as a special case.17 A crucial departure of our approach from the aforementioned type of portfolios is that the market characteristics processes ci need not be restricted to size-based information, and may contain additional information such as social media sentiments, stock price path-properties, but also characteristics relative to other stocks such as performance relative to the best/worst performing stock last week/month/year etc. We place a mean-zero string GP prior on log f. Given some historical data D corresponding to a training time horizon [0, T], the likelihood model p D|πf is deﬁned by the investment professional and reﬂects the extent to which applying the investment strategy πf over the training time horizon would have achieved a speciﬁc investment objective. An example investment objective is to achieve a high excess return relative to a benchmark portfolio α

UER πf = log Zπf (T) log Zα(T). (51)

α can be the market portfolio (as in SPT) or any stock index. Other risk-adjusted investment objectives may also be used. One such objective is to achieve a high Sharpe-ratio, deﬁned as

1 T PT t=1(r(t) r)2 , (52)

where the time t is in days, r(t) := log Zπf (t) log Zπf (t 1) are the daily returns the portfolio πf and r = 1

T PT t=1 r(t) its average daily return. More generally, denoting U πf the performance of the portfolio πf over the training horizon [0, T] (as per the user-deﬁned investment objective), we may choose as likelihood model a distribution over U πf that reﬂects what the investment professional considers good and bad performance. For instance, in the case of the excess return relative to a benchmark portfolio or the Sharpe ratio, we may choose U πf to be supported on ]0, + [ (for instance U πf can be chosen to be Gamma distributed) so as to express that portfolios that do not outperform the benchmark or loose money overall in the training data are not of interest. We may then choose the mean and standard deviation of the Gamma distribution based on our

17. We refer the reader to Karatzas and Fernholz (2009) for the deﬁnition of functionally-generated portfolios.

KOM SAMO AND ROBERTS

expectation as to what performance a good candidate portfolio can achieve, and how conﬁdent we feel about this expectation. Overall we have,

p D|πf = γ U πf ; αe, βe , (53)

where γ(.; α, β) is the probability density function of the Gamma distribution. Noting, from Equation (50) that πf(t) only depends on f through its values at (c1(t), . . . , cn(t)), and assuming that U πf only depends on πf evaluated at a ﬁnite number of times (as it is the case for excess returns and the Sharpe ratio), it follows that U(πf) only depends on f, a vector of values of f at a ﬁnite number of points. Hence the likelihood model, which we may rewrite as

p(D|f) = γ U πf i ; αe, βe , (54)

is of the form required by the RJ-MCMC sampler previously developed. By sampling from the posterior distribution p(f, f , f, f |D), the hope is to learn a portfolio that did well during the training horizon, to analyse the sensitivity of its investment strategy to the underlying market characteristics through the gradient of f, and to evaluate the learned investment policy on future market conditions.

6.4.3 EXPERIMENTAL SETUP

The universe of stocks we considered for this experiment are the constituents of the S&P 500 index, accounting for changes in constituents over time and corporate events. We used the period 1st

January 1990 to 31st December 2004 for training and we tested the learned portfolio during the period 1st January 2005 to 31st December 2014. We rebalanced the portfolio daily, for a total of 2.52 million input points at which the latent function f must be learned. We considered as market characteristics the market weight (CAP), the latest return on asset (ROA) deﬁned as the ratio between the net yearly income and the total assets as per the latest balance sheet of the company known at the time of investment, the previous close-to-close return (PR), the close-to-close return before the previous (PR2), and the S&P long and short term credit rating (LCR and SCR). While the market weight is a company size characteristic, the ROA reﬂects how well a company performs relative to its size, and we hope that S&P credit ratings will help further discriminate successful companies from others. The close-to-close returns are used to learn possible momentum patterns from the data. The data originate from the CRSP and Compustat databases. In the experiments we considered as performance metric the annualised excess return UER-EWP relative to the equally-weighted portfolio. We found the equally-weighted portfolio to be a harder benchmark to outperform than the market portfolio. We chose αe and βe in Equation (54) so that the mean of the Gamma distribution is 10.0 and its variance 0.5, which expresses a very greedy investment target. It is worth pointing out that none of the scalable GP alternatives previously considered can cope with our likelihood model Equation (54). We compared the performance of the learned string GP portfolio out-of-sample to those of the best three SPT portfolios studied in Vervuurt and Karatzas (2015), namely the equally weighted portfolio

πEWP i (t) = 1

and the diversity weighted portfolios

πDWP i (t; p) = µi(t)p

µ1(t)p + + µn(t)p , (56)

STRING AND MEMBRANE GAUSSIAN PROCESSES

Figure 14: Evolution of the wealth processes of various long-only trading strategies on the S&P 500 universe of stocks between 1st January 2005 (where we assume a starting wealth of 1) and 31st December 2014. The String GP strategy was learned using market data from 1st January 1990 to 31st December 2004 as descried in Section 6.4. EWP refers to the equally-weighted portfolio, MKT refers to the market portfolio (which weights stocks proportionally to their market capitalisations) and DWP (p) refers to the diversityweighted portfolio with exponent p (which weights stocks proportionally to the p-th power of their market weights).

with parameter p equals to 0.5 and 0.5, and the market portfolio. Results are provided Table 5, and Figure 14 displays the evolution of the wealth process of each strategy. It can be seen that the learned string GP strategy considerably outperforms the next best SPT portfolio. This experiment not only demonstrates (once more) that string GPs scale to large scale problems, it also illustrates that our inference scheme is able to unlock commercial value in new intricate large scale applications where no alternative is readily available. In effect, this application was ﬁrst introduced by Kom Samo and Vervuurt (2016), where the authors used a Gaussian process prior on a Cartesian grid and under a separable covariance function so as to speed up inference with Kronecker techniques. Although the resulting inference scheme has time complexity that is linear in the total number of points on the grid, for a given grid resolution, the time complexity grows exponentially in the dimension of the input space (that is the number of trading characteristics), which is impractical for d 4. On the other hand, string GPs allow for a time complexity that grows linearly with the number of trading characteristics, thereby enabling the learning of subtler market inefﬁciencies from the data.

KOM SAMO AND ROBERTS

Strategy Sharpe Ratio Zπ(T)/ZEWP(T) Avg. Ann. Ret.

String GP 0.73 2.87 22.07% DWP (p = 0.5) 0.55 1.07 10.56% EWP 0.53 1.00 9.84% MKT 0.34 0.62 4.77% DWP (p = 0.5) 0.33 0.61 4.51%

Table 5: Performance of various long-only trading strategies on the S&P 500 universe of stocks between 1st January 2005 (where we assume a starting wealth of 1) and 31st December 2014. The String GP strategy was learned using market data from 1st January 1990 to 31st

December 2004 as descried in Section 6.4. EWP refers to the equally-weighted portfolio, MKT refers to the market portfolio (which weights stocks proportionally to their market capitalisations) and DWP (p) refers to the diversity-weighted portfolio with exponent p (which weights stocks proportionally to the p-th power of the market weight of the asset). Zπ(T) denotes the terminal wealth of strategy π, and Avg. Ann. Ret. is the strategy s equivalent constant annual return over the test horizon.

7. Discussion

In this paper, we introduce a novel class of smooth functional priors (or stochastic processes), which we refer to as string GPs, with the aim of simultaneously addressing the lack of scalability and the lack of ﬂexibility of Bayesian kernel methods. Unlike existing approaches, such as Gaussian process priors (Rasmussen and Williams (2006)) or student-t process priors (Shah et al. (2014)), which are parametrised by global mean and covariance functions, and which postulate fully dependent ﬁnitedimensional marginals, the alternative construction we propose adopts a local perspective and the resulting ﬁnite-dimensional marginals exhibit conditional independence structures. Our local approach to constructing string GPs provides a principled way of postulating that the latent function we wish to learn might exhibit locally homogeneous patterns, while the conditional independence structures constitute the core ingredient needed for developing scalable inference methods. Moreover, we provide theoretical results relating our approach to Gaussian processes, and we illustrate that our approach can often be regarded as a more scalable and/or more ﬂexible extension. We argue and empirically illustrate that string GPs present an unparalleled opportunity for learning local patterns in small scale regression problems using nothing but standard Gaussian process regression techniques. More importantly, we propose a novel scalable RJ-MCMC inference scheme to learn latent functions in a wide variety of machine learning tasks, while simultaneously determining whether the data set exhibits local patterns, how many types of local patterns the data might exhibit, and where do changes in these patterns are likely to occur. The proposed scheme has time complexity and memory requirement that are both linear in the sample size N. When the number of available computing cores is at least equal to the dimension d of the input space, the time complexity is independent from the dimension of the input space. Else, the time complexity grows in O(d N). The memory requirement grows in O(d N). We empirically illustrate that our approach scales considerably better than competing alternatives on a standard benchmark data set, and is able to process data sizes that competing approaches cannot handle in a reasonable time.

STRING AND MEMBRANE GAUSSIAN PROCESSES

7.1 Limitations

The main limitation of our approach is that, unlike the standard GP paradigm in which the time complexity of marginal likelihood evaluation does not depend on the dimension of the input space (other than through the evaluation of the Gram matrix), the string GP paradigm requires a number of computing cores that increases linearly with the dimension of the input space, or alternatively has a time complexity linear in the input space dimension on single-core machines. This is a by-product of the fact that in the string GP paradigm, we jointly infer the latent function and its gradient. If the gradient of the latent function is inferred in the standard GP paradigm, the resulting complexity will also be linear in the input dimension. That being said, overall our RJ-MCMC inference scheme will typically scale better per iteration to large input dimensions than gradient-based marginal likelihood inference in the standard GP paradigm, as the latter typically requires numerically evaluating an Hessian matrix, which requires computing the marginal likelihood a number of times per iterative update that grows quadratically with the input dimension. In contrast, a Gibbs cycle in our MCMC sampler has worst case time complexity that is linear in the input dimension.

7.2 Extensions

Some of the assumptions we have made in the construction of string GPs and membrane GPs can be relaxed, which we consider in detail below.

7.2.1 STRONGER GLOBAL REGULARITY

We could have imposed more (multiple continuous differentiability) or less (continuity) regularity as boundary conditions in the construction of string GPs. We chose continuous differentiability as it is a relatively mild condition guaranteed by most popular kernels, and yet the corresponding treatment can be easily generalised to other regularity requirements. It is also possible to allow for discontinuity at a boundary time ak by replacing µb k and Σb k in Equation (5) with k Mak and k Kak;ak respectively, or equivalently by preventing any communication between the k-th and the (k + 1)-th strings. This would effectively be equivalent to having two independent string GPs on [a0, ak] and ]ak, a K].

7.2.2 DIFFERENTIAL OPERATORS AS LINK FUNCTIONS

Our framework can be further extended to allow differential operators as link functions, thereby considering the latent multivariate function to infer as the response of a differential system to independent univariate string GP excitations. The RJ-MCMC sampler we propose in Section 5 would still work in this framework, with the only exception that, when the differential operator is of ﬁrst order, the latent multivariate function will be continuous but not differentiable, except if global regularity is upgraded as discussed above. Moreover, Proposition 5 can be generalised to ﬁrst order linear differential operators.

KOM SAMO AND ROBERTS

7.2.3 DISTRIBUTED STRING GPS

The RJ-MCMC inference scheme we propose may be easily adapted to handle applications where the data set is so big that it has to be stored across multiple clusters, and inference techniques have to be developed as data ﬂow graphs18 (for instance using libraries such as Tensor Flow).

To do so, the choice of string boundary times can be adapted so that each string has the same number of inner input coordinates, and such that in total there are as many strings across dimensions as a target number of available computing cores. We may then place a prior on kernel memberships similar to that of previous sections. Here, the change-points may be restricted to coincide with boundary times, and we may choose priors such that the sets of change-points are independent between input dimensions. In each input dimension the prior on the number of change-points can be chosen to be a truncated Poisson distribution (truncated to never exceed the total number of boundary times), and conditional on their number we may choose change-points to be uniformly distributed in the set of boundary times. In so doing, any two strings whose shared boundary time is not a change-point will be driven by the same kernel conﬁguration.

This new setup presents no additional theoretical or practical challenges, and the RJ-MCMC techniques previously developed are easily adaptable to jointly learn change-points and function values. Unlike the case we developed in previous sections where an update of the univariate string GP corresponding to an input dimension, say the j-th, requires looping through all distinct j-th input coordinates, here no step in the inference scheme requires a full view of the data set in any input dimension. Full RJ-MCMC inference can be constructed as a data ﬂow graph. An example such graph is constructed as follows. The leaves correspond to computing cores responsible for generating change-points and kernel conﬁgurations, and mapping strings to kernel conﬁgurations. The following layer is made of compute cores that use kernel conﬁgurations coming out of the previous layer to sequentially compute boundary conditions corresponding to a speciﬁc input dimension there are d such compute cores, where d is the input dimension. These compute cores then pass computed boundary conditions to subsequent compute cores we refer to as string compute cores. Each string compute core is tasked with computing derivative string GP values for a speciﬁc input dimension and for a speciﬁc string in that input dimension, conditional on previously computed boundary conditions. These values are then passed to a fourth layer of compute cores, each of which being tasked with computing function and gradient values corresponding to a small subset of training inputs from previously computed derivative string GP values. The ﬁnal layers then computes the log-likelihood using a distributed algorithm such as Map-Reduce when possible. This proposal data ﬂow graph is illustrated Figure 15.

We note that the approaches of Kim et al. (2005), Gramacy and Lee (2008), Tresp (2000), and Deisenroth and Ng (2015) also allow for fully-distributed inference on regression problems. Distributed string GP RJ-MCMC inference improves on these in that it places little restriction on the type of likelihood. Moreover, unlike Kim et al. (2005) and Gramacy and Lee (2008) that yield discontinuous latent functions, string GPs are continuously differentiable, and unlike Tresp (2000) and Deisenroth and Ng (2015), local experts in the string GP paradigm (i.e. strings) are driven by possibly different sets of hyper-parameters, which facilitates the learning of local patterns.

18. A data ﬂow graph is a computational (directed) graph whose nodes represent calculations (possibly taking place on different computing units) and directed edges correspond to data ﬂowing between calculations or computing units.

STRING AND MEMBRANE GAUSSIAN PROCESSES

7.2.4 APPROXIMATE MCMC FOR I.I.D. OBSERVATIONS LIKELIHOODS

As discussed in Section 5.2, the bottleneck of our proposed inference scheme is the evaluation of likelihood. When the likelihood factorises across training samples, the linear time complexity of our proposed approach can be further reduced using a Monte Carlo approximation of the loglikelihood (see for instance Bardenet et al. (2014) and references therein). Although the resulting Markov chain will typically not converge to the true posterior distribution, in practice its stationary distribution can be sufﬁciently close to the true posterior when reasonable Monte Carlo sample sizes are used. Convergence results of such approximations have recently been studied by Bardenet et al. (2014) and Alquier et al. (2016). We expect this extension to speed-up inference when the number of compute cores is in the order of magnitude of the input dimension, but we would recommend the previously mentioned fully-distributed string GP inference extension when compute cores are not scarce.

7.2.5 VARIATIONAL INFERENCE

It would be useful to develop suitable variational methods for inference under string GP priors, that we hope will scale similarly to our proposed RJ-MCMC sampler but will converge faster. We anticipate that the main challenge here will perhaps be the learning of model complexity, that is the number of distinct kernel conﬁgurations in each input dimension.

KOM SAMO AND ROBERTS

Change-points computations Computation of boundary conditions

Computation of inner String GP values conditional on boundary conditions

Computation of latent function values and gradients

Computation of partial log-likelihoods (i.e. Map)

Sum of partial log-likelihoods (i.e. Reduce)

Boundary Conditions String GP Values and Derivatives Latent Function and Gradient Values Kernel Hyper-Parameters Partial Log-Likelihoods

Figure 15: Example data ﬂow graph for fully-distributed string GP inference under an i.i.d observations likelihood model. Here the input space is three-dimensional to ease illustration. Filled circles represent compute cores, and edges correspond to ﬂows of data. Compute cores with the same colour (green, red or yellow) perform operations pertaining to the same input dimension, while black-ﬁlled circles represent compute cores performing cross-dimensional operations. The blue rectangle plays the role of a hub that relays string GP values to the compute cores that need them to compute the subset of latent function values and gradients they are responsible for. These values are then used to compute the log-likelihood in a distributed fashion using the Map-Reduce algorithm. Each calculation in the corresponding RJ-MCMC sampler would be initiated at one of the compute cores, and would trigger updates of all edges accessible from that compute core.

STRING AND MEMBRANE GAUSSIAN PROCESSES

Acknowledgments

Yves-Laurent is a Google Fellow in Machine Learning and would like to acknowledge support from the Oxford-Man Institute. Wharton Research Data Services (WRDS) was used in preparing the data for Section 6.4 of this paper. This service and the data available thereon constitute valuable intellectual property and trade secrets of WRDS and/or its third-party suppliers.

Appendix A.

We begin by recalling Kolmogorov s extension theorem, which we will use to prove the existence of derivative Gaussian processes and string Gaussian processes.

Theorem 7 (Kolmogorov s extension theorem, (Øksendal, 2003, Theorem 2.1.5)) Let I be an interval, let all t1, . . . , ti I, i, n N , let νt1,...,ti be probability measures on Rni

such that: νtπ(1),...,tπ(i)(Fπ(1), . . . , Fπ(i)) = νt1,...,ti(F1, . . . , Fi) (57)

for all permutations π on {1, . . . , i} and

νt1,...,ti(F1, . . . , Fi) = νt1,...,ti,ti+1,...,ti+m(F1, . . . , Fi, Rn, . . . , Rn) (58)

for all m N where the set on the right hand side has a total of i + m factors. Then there exists a probability space (Ω, F, P) and an Rn valued stochastic process (Xt)t I on Ω,

such that νt1,...,ti(F1, . . . , Fi) = P(Xt1 F1, . . . , Xti Fi) (59)

for all t1, . . . , ti I, i N and for all Borel sets F1, . . . , Fi.

It is easy to see that every stochastic process satisﬁes the permutation and marginalisation conditions (57) and (58). The power of Kolmogorov s extension theorem is that it states that those two conditions are sufﬁcient to guarantee the existence of a stochastic process.

Appendix B. Proof of Proposition 1

In this section we prove Proposition 1, which we recall below.

Proposition 1 (Derivative Gaussian processes) Let I be an interval, k : I I R a C2 symmetric positive semi-deﬁnite function,19 m : I R a C1 function. (A) There exists a R2-valued stochastic process (Dt)t I , Dt = (zt, z t), such that for all t1, . . . , tn I, (zt1, . . . , ztn, z t1, . . . , z tn)

19. C1 (resp. C2) functions denote functions that are once (resp. twice) continuously differentiable on their domains.

KOM SAMO AND ROBERTS

is a Gaussian vector with mean m(t1), . . . , m(tn), dm

dt (t1), . . . , dm

and covariance matrix such that

cov(zti, ztj) = k(ti, tj), cov(zti, z tj) = k

y (ti, tj), and cov(z ti, z tj) = 2k

x y(ti, tj).

We herein refer to (Dt)t I as a derivative Gaussian process. (B) (zt)t I is a Gaussian process with mean function m, covariance function k and that is C1 in the L2 (mean square) sense. (C) (z t)t I is a Gaussian process with mean function dm

dt and covariance function 2k x y. Moreover, (z t)t I is the L2 derivative of the process (zt)t I.

Appendix B.1 Proof of Proposition 1 (A)

Firstly, we need to show that the matrix suggested in the proposition as the covariance matrix of (zt1, . . . , ztn, z t1, . . . , z tn) is indeed positive semi-deﬁnite. To do so, we will show that it is the limit of positive deﬁnite matrices (which is sufﬁcient to conclude it is positive semi-deﬁnite, as x T Mnx 0 for a convergent sequence of positive deﬁnite matrices implies x T M x 0). Let k be as in the proposition, h such that i n, ti + h I and ( zt)t I be a Gaussian process with covariance function k. The vector zt1, . . . , ztn, zt1+h zt1

h , . . . , ztn+h ztn

is a Gaussian vector whose covariance matrix is positive deﬁnite and such that

cov zti, ztj = k(ti, tj), (60)

cov zti, ztj+h ztj

= k(ti, tj + h) k(ti, tj)

cov zti+h zti

h , ztj+h ztj

h2 (k(ti + h, tj + h) k(ti + h, tj) k(ti, tj + h) + k(ti, tj)) . (62)

As k is C2, h k(x, y + h) admits a second order Taylor expansion about h = 0 for every x, and we have:

k(x, y + h) = k(x, y) + k

y (x, y)h + 1

2 2k y2 (x, y)h2 + o(h2) = k(y + h, x). (63)

STRING AND MEMBRANE GAUSSIAN PROCESSES

Similarly, h k(x + h, y + h) admits a second order Taylor expansion about h = 0 for every x, y and we have:

k(x + h, y + h) = k(x, y) + k

x(x, y) + k

y (x, y) h +

" 2k x y(x, y) + 1

2 2k x2 (x, y)

2 2k y2 (x, y)

h2 + o(h2). (64)

k(ti, tj + h) k(ti, tj) = k

y (ti, tj)h + o(h), (65)

k(ti + h, tj + h) k(ti + h, tj) k(ti, tj + h) + k(ti, tj) = 2k

x y(ti, tj)h2 + o(h2). (66)

Dividing Equation (65) by h, dividing Equation (66) by h2, and taking the limits, we obtain:

lim h 0 cov zti, ztj+h ztj

y (ti, tj),

lim h 0 cov zti+h zti

h , ztj+h ztj

x y(ti, tj),

which corresponds to the covariance structure of Proposition 1. In other words the proposed covariance structure is indeed positive semi-deﬁnite. Let νN t1,...,tn be the Gaussian probability measure corresponding to the joint distribution of (zt1, . . . , ztn, z t1, . . . , z tn) as per the Proposition 1, and let νD t1,...,tn be the measure on the Borel σ-algebra B(R2 R2) | {z } n times

such that for any 2n intervals I11, I12, . . . , In1, In2,

νD t1,...,tn(I11 I12, . . . , In1 In2) := νN t1,...,tn(I11, . . . , In1, I12, . . . , In2). (67)

The measures νD t1,...,tn are the ﬁnite dimensional measures corresponding to the stochastic object (Dt)t I sampled at times t1, . . . , tn. They satisfy the time permutation and marginalisation conditions of Kolmogorov s extension theorem as the Gaussian measures νN t1,...,tn do. Hence, the R2valued stochastic process (Dt)t I deﬁned in Proposition 1 does exist.

Appendix B.2 Proof of Proposition 1 (B)

That (zt)t I is a Gaussian process results from the fact that the marginals (zt1, . . . , ztn) are Gaussian vectors with mean (m(t1), . . . , m(tn)) and covariance matrix [k(ti, tj)]i,j [1..n]. The fact that (zt)t I is C1 in the L2 sense is a direct consequence of the twice continuous differentiability of k.

KOM SAMO AND ROBERTS

Appendix B.3 Proof of Proposition 1 (C)

In effect, it follows from Proposition 1(A) that zt+h zt

h z t is a Gaussian random variable with mean m(t + h) m(t)

and variance

k(t + h, t + h) 2k(t + h, t) + k(t, t) 2 k

y(t + h, t)h + 2 k

y(t, t)h + 2k

x y(t, t)h2

Taking the second order Taylor expansion of the numerator in the fraction above about h = 0 we get o(h2), hence

lim h 0 Var zt+h zt

We also have

lim h 0 E zt+h zt

dt (t) E(z t) = 0.

which proves that (z t) is the L2 derivative of (zt). The fact that (z t) is a Gaussian process with mean function dm

dt and covariance function 2k x y is a direct consequence of the distribution of the marginals (z t1, . . . , z tn). Moreover, the continuity of (z t) in the L2 sense is a direct consequence of the continuity of 2k x y (see Rasmussen and Williams, 2006, p. 81 4.1.1).

Appendix C. Proof of Theorem 2

In this section we prove Theorem 2 which we recall below.

Theorem 2 (String Gaussian process) Let a0 < < ak < < a K, I = [a0, a K] and let p N (x; µ, Σ) be the multivariate Gaussian density with mean vector µ and covariance matrix Σ. Furthermore, let (mk : [ak 1, ak] R)k [1..K] be C1 functions, and (kk : [ak 1, ak] [ak 1, ak] R)k [1..K] be C3 symmetric positive semi-deﬁnite functions, neither degenerate at ak 1, nor degenerate at ak given ak 1. (A) There exists an R2-valued stochastic process (SDt)t I, SDt = (zt, z t) satisfying the following conditions: 1) The probability density of (SDa0, . . . , SDa K) reads:

pb(x0, . . . , x K) :=

k=0 p N xk; µb k, Σb k

where: Σb 0 = 1Ka0;a0, k > 0 Σb k = k Kak;ak k Kak;ak 1 k K 1 ak 1;ak 1 k KT ak;ak 1,

STRING AND MEMBRANE GAUSSIAN PROCESSES

µb 0 = 1Ma0, k > 0 µb k = k Mak + k Kak;ak 1 k K 1 ak 1;ak 1(xk 1 k Mak 1),

with k Ku;v =

" kk(u, v) kk

x (u, v) 2kk x y(u, v)

, and k Mu = mk(u) dmk

2) Conditional on (SDak = xk)k [0..K], the restrictions (SDt)t ]ak 1,ak[, k [1..K] are independent conditional derivative Gaussian processes, respectively with unconditional mean function mk and unconditional covariance function kk and that are conditioned to take values xk 1 and xk at ak 1 and ak respectively. We refer to (SDt)t I as a string derivative Gaussian process, and to its ﬁrst coordinate (zt)t I as a string Gaussian process namely,

(zt)t I SGP({ak}, {mk}, {kk}).

(B) The string Gaussian process (zt)t I deﬁned in (A) is C1 in the L2 sense and its L2 derivative is the process (z t)t I deﬁned in (A).

Appendix C.1 Proof of Theorem 2 (A)

We will once again turn to Kolmogorov s extension theorem to prove the existence of the stochastic process (SDt)t I. The core of the proof is in the ﬁnite dimensional measures implied by Theorem 2 (A-1) and (A-2). Let tk i ]ak 1, ak[

i [1..Nk],k [1..K] be n times. We ﬁrst formally construct the ﬁnite dimensional measures implied by Theorem 2 (A-1) and (A-2), and then verify that they satisfy the conditions of Kolmogorov s extension theorem. Let us deﬁne the measure νSD t1 1,...,t1 N1,...,t K 1 ,...,t K NK ,a0,...,a K as the probability measure having density

with respect to the Lebesgue measure on B(R2 R2) | {z } 1+n+K times

that reads:

p SD(xt1 1, . . . , xt1 N1, . . . , xt K 1 , . . . , xt K NK , xa0, . . . , xa K) =pb(xa0, . . . , xa K)

k=1 p xak 1,xak N (xtk 1, . . . , xtk Nk) (68)

where pb is as per Theorem 2 (A-1) and p xak 1,xak N (xtk 1, . . . , xtk Nk) is the (Gaussian) pdf of the joint

distribution of the values at times {tk i ]ak 1, ak[} of the conditional derivative Gaussian process with unconditional mean functions mk and unconditional covariance functions kk that is conditioned to take values xak 1 = (zak 1, z ak 1) and xak = (zak, z ak) at times ak 1 and ak respectively (the corresponding conditional mean and covariance functions are derived from Equations (3 and 4). Let us extend the family of measures νSD to cases where some or all boundary times ak are missing, by integrating out the corresponding variables in Equation (68). For instance when a0 and a1 are missing,

νSD t1 1,...,t1 N1,...,t K 1 ,...,t K NK ,a2,...,a K(T 1 1 , . . . , T 1 N1, . . . , T K 1 , . . . , T K NK, A2, . . . , AK)

:= νSD t1 1,...,t1 N1,...,t K 1 ,...,t K NK ,a0,...,a K(T 1 1 , . . . , T 1 N1, . . . , T K 1 , . . . , T K NK, R2, R2, A2, . . . , AK)

KOM SAMO AND ROBERTS

where Ai and T i j are rectangle in R2. Finally, we extend the family of measures νSD to any arbitrary set of indices {t1, . . . , tn} as follows:

νSD t1,...,tn(T1, . . . , Tn) := νSD tπ (1),...,tπ (n)(Tπ (1), . . . , Tπ (n)), (70)

where π is a permutation of {1, . . . , n} such that {tπ (1), . . . , tπ (n)} verify the following conditions:

1. i, j, if ti ]ak1 1, ak1[, tj ]ak2 1, ak2[, and k1 < k2, then Idx(ti) < Idx(tj). Where Idx(ti) stands for the index of ti in {tπ (1), . . . , tπ (n)};

2. if ti / {a0, . . . , a K} and tj {a0, . . . , a K} then Idx(ti) < Idx(tj);

3. if ti {a0, . . . , a K} and tj {a0, . . . , a K} then Idx(ti) < Idx(tj) if and only if ti < tj.

Any such measure νSD tπ (1),...,tπ (n) will fall in the category of either Equation (68) or Equation (69). Although π is not unique, any two permutations satisfying the above conditions will only differ by a permutation of times belonging to the same string interval ]ak 1, ak[. Moreover, it follows from Equations (68) and (69) that the measures νSD tπ (1),...,tπ (n) are invariant by permutation of times belonging to the same string interval ]ak 1, ak[, and as a result any two π satisfying the above conditions will yield the same probability measure. The ﬁnite dimensional probability measures νSD t1,...,tn are the measures implied by Theorem 2. The permutation condition (57) of Kolmogorov s extension theorem is met by virtue of Equation (70). In effect for every permutation π of {1, . . . , n}, if we let π : {π(1), . . . , π(n)} {π (1), . . . , π (n)}, then

νSD tπ(1),...,tπ(n)(Tπ(1), . . . , Tπ(n)) := νSD tπ π(1),...,tπ π(n)(Tπ π(1), . . . , Tπ π(n))

= νSD tπ (1),...,tπ (n)(Tπ (1), . . . , Tπ (n))

= νSD t1,...,tn(T1, . . . , Tn).

As for the marginalisation condition (58), it is met for every boundary time by virtue of how we extended νSD to missing boundary times. All we need to prove now is that the marginalisation condition is also met at any non-boundary time. To do so, it is sufﬁcient to prove that the marginalisation condition holds for t1 1, that is:

νSD t1 1,...,t1 N1,...,t K 1 ,...,t K NK ,a0,...,a K(R2, T 1 2 , . . . , T 1 N1, . . . , T K 1 , . . . , T K NK, A0, . . . , AK)

= νSD t1 2,...,t1 N1,...,t K 1 ,...,t K NK ,a0,...,a K(T 1 2 , . . . , T 1 N1, . . . , T K 1 , . . . , T K NK, A0, . . . , AK)

for every rectangles Ai and T i j in R2. In effect, cases where some boundary times are missing are special cases with the corresponding rectangles Aj set to R2. Moreover, if we prove Equation (71), the permutation property (57) will allow us to conclude that the marginalisation also holds true for any other (single) non-boundary time. Furthermore, if Equation (71) holds true, it can be shown that the marginalisation condition will also hold over multiple non-boundary times by using the permutation property (57) and marginalising one non-boundary time after another.

STRING AND MEMBRANE GAUSSIAN PROCESSES

By Fubini s theorem, and considering Equation (68), showing that Equation (71) holds true is equivalent to showing that: Z

R2 p xa0,xa1 N (xt1 1, . . . , xt1 N1)dxt1 1 = p xa0,xa1 N (xt1 2, . . . , xt1 N1) (72)

which holds true as p xa0,xa1 N (xt1 1, . . . , xt1 N1) is a multivariate Gaussian density, and the corresponding marginal is indeed the density of the same conditional derivative Gaussian process at times t1 2, . . . , t1 N1. This concludes the proof of the existence of the stochastic process (SDt)t I.

Appendix C.2 Proof of Theorem 2 (B)

As conditional on boundary conditions the restriction of a string derivative Gaussian process on a string interval [ak 1, ak] is a derivative Gaussian process, it follows from Proposition 1 (C) that

xa0, . . . , xa K, t, t + h [ak 1, ak],

2 xa0 = xa0, . . . , xa K = xa K

or equivalently that:

2 xa0, . . . , xa K

a.s. h 0 0. (74)

zh = Var zt+h zt

xa0, . . . , xa K

+ E zt+h zt

xa0, . . . , xa K

As both terms in the sum of the above equation are non-negative, it follows that

Var zt+h zt

xa0, . . . , xa K

a.s. h 0 0 and E zt+h zt

xa0, . . . , xa K

2 a.s. h 0 0.

From which we deduce

h z t xa0, . . . , xa K

a.s. h 0 0.

As E zt+h zt

h z t xa0, . . . , xa K depends linearly on the boundary conditions, and as the boundary conditions are jointly-Gaussian (see Appendix H step 1), it follows that

h z t xa0, . . . , xa K

is Gaussian. Finally we note that

Var zt+h zt

h z t xa0, . . . , xa K

KOM SAMO AND ROBERTS

does not depend on the values of the boundary conditions xak (but rather on the boundary times), and we recall that convergence almost sure of Gaussian random variables implies convergence in L2. Hence, taking the expectation on both side of Equation (75) and then the limit as h goes to 0 we get

= E( zh) h 0 0,

which proves that the string GP (zt)t I is differentiable in the L2 sense on I and has derivative (z t)t I. We prove the continuity in the L2 sense of (z t)t I in a similar fashion, noting that conditional

on the boundary conditions, (z t)t I is a Gaussian process whose mean function dm ak 1,ak ck

covariance function 2k ak 1,ak ck x y are continuous, thus is continuous in the L2 sense on [ak 1, ak] (conditional on the boundary conditions). We therefore have that:

xa0, . . . , xa K, t, t + h [ak 1, ak], lim h 0 E (z t+h z t)2 xa0 = xa0, . . . , xa K = xa K = 0,

from which we get that:

z h := E z t+h z t 2 xa0, . . . , xa K a.s. h 0 0. (77)

z h = Var z t+h z t xa0, . . . , xa K + E z t+h z t xa0, . . . , xa K 2 , (78)

which implies that Var z t+h z t xa0, . . . , xa K a.s. h 0 0

and E z t+h z t xa0, . . . , xa K 2 a.s. h 0 0,

as both terms in the sum in Equation (78) are non-negative. Finally,

Var z t+h z t xa0, . . . , xa K

does not depend on the values of the boundary conditions, and

E z t+h z t xa0, . . . , xa K

is Gaussian for the same reason as before. Hence, taking the expectation on both sides of Equation (78), we get that

E z t+h z t 2 = E( z h) h 0 0,

which proves that (z t) is continuous in the L2 sense.

STRING AND MEMBRANE GAUSSIAN PROCESSES

Appendix D. Proof of the Condition for Pathwise Regularity Upgrade of String GPs from L2

In this section we prove that a sufﬁcient condition for the process (z t)t I in Theorem 2 to be almost surely continuous and to be the almost sure derivative of the string Gaussian process (zt)t I, is that the Gaussian processes on Ik = [ak 1, ak] with mean and covariance functions mak 1,ak ck and kak 1,ak ck (as per Equations 3 and 4 with m := mk and k := kk) are themselves almost surely C1 for every boundary condition. Firstly we note that the above condition guarantees that the result holds at non-boundary times. As for boundary times, the condition implies that the string GP is almost surely right differentiable (resp. left differentiable) at every left (resp. right) boundary time, including a0 and a K. Moreover, the string GP being differentiable in L2, the right hand side and left hand side almost sure derivatives are the same, and are equal to the L2 derivative, which proves that the L2 derivatives at inner boundary times are also in the almost sure sense. A similar argument holds to conclude that the right (resp. left) hand side derivative at a0 (resp. a K) is also in the almost sure sense. Moreover, the derivative process (z t)t I admits an almost sure right hand side limit and an almost sure left hand side limit at every inner boundary time and both are equal as the derivative is continuous in L2, which proves its almost sure continuity at inner boundary times. Almost sure continuity of (z t)t I on the right (resp. left) of a0 (resp. a K) is a direct consequence of the above condition.

Appendix E. Proof of Proposition 4

In this section, we prove Proposition 4, which we recall below.

Proposition 4 (Additively separable string GPs are ﬂexible) Let k(x, y) := ρ ||x y||2 L2 be a stationary covariance function generating a.s. C1 GP paths indexed on Rd, d > 0, and ρ a function that is C2 on ]0, + [ and continuous at 0. Let φs(x1, . . . , xd) = Pd j=1 xj, let (zj t )t Ij, j [1..d] be independent stationary Gaussian processes with mean 0 and covariance function k (where the L2 norm is on R), and let f(t1, . . . , td) = φs(z1 t1, . . . , zd td) be the corresponding stationary string GP. Finally, let g be an isotropic Gaussian process indexed on I1 Id

with mean 0 and covariance function k (where the L2 norm is on Rd). Then: 1) x I1 Id, H( f(x)) = H( g(x)), 2) x = y I1 Id, I( f(x); f(y)) I( g(x); g(y)).

To prove Proposition 4 we need a lemma, which we state and prove below.

Lemma 8 Let Xn be a sequence of Gaussian random vectors with auto-covariance matrix Σn and mean µn, converging almost surely to X . If Σn Σ and µn µ then X is Gaussian with mean µ and auto-covariance matrix Σ .

Proof We need to show that the characteristic function of X is

φX (t) := E(eit T X ) = eit T µ 1

As Σn is positive semi-deﬁnite, n, |eit T µn 1

2 t T Σnt| = e 1

2 t T Σnt 1. Hence, by Lebesgue s dominated convergence theorem,

φX (t) = E( lim n + eit T Xn) = lim n + E(eit T Xn) = lim n + eit T µn 1

2 t T Σnt = eit T µ 1

KOM SAMO AND ROBERTS

Appendix E.1 Proof of Proposition 4 1)

Let x = (tx 1, . . . , tx d) I1 Id. We want to show that H( f(x)) = H( g(x)) where f and g are as per Proposition 4, and H is the entropy operator. Firstly, we note from Equation (11) that

f(x) = z1 tx 1, . . . , zd tx d

where the joint law of the GP (zj t )t Ij and its derivative (zj t )t Ij is provided in Proposition 1. As

the processes zj t , zj t

t Ij , j [1..d] are assumed to be independent of each other, f(x) is a

Gaussian vector and its covariance matrix reads:

Σ f(x) = 2dρ

dx(0)Id, (80)

where Id is the d d identity matrix. Hence,

H( f(x)) = d

2 (1 + ln(2π)) + 1

2 ln |Σ f(x)|. (81)

Secondly, let ej denote the d-dimensional vector whose j-th coordinate is 1 and every other coordinate is 0, and let h R. As the proposition assumes the covariance function k generates almost surely C1 surfaces, the vectors g(x+he1) g(x)

h , . . . , g(x+hed) g(x)

h are Gaussian vectors converging almost surely as h 0. Moreover, their mean is 0 and their covariance matrices have as element on the i-th row and j-th column (i = j):

cov g(x + hei) g(x)

h , g(x + hej) g(x)

= ρ(2h2) 2ρ(h2) + ρ(0)

and as diagonal terms:

Var g(x + hej) g(x)

= 2ρ(0) ρ(h2)

Taking the limit of Equations (82) and (83) using the ﬁrst order Taylor expansion of ρ (which the Proposition assumes is C2), we get that:

Σ g(x) = 2dρ

dx(0)Id = Σ f(x), (84)

It then follows from Lemma 8 that the limit g(x) of g(x + he1) g(x)

h , . . . , g(x + hed) g(x)

is also a Gaussian vector, which proves that H( f(x)) = H( g(x)).

STRING AND MEMBRANE GAUSSIAN PROCESSES

Appendix E.2 Proof of Proposition 4 2)

We start by stating and proving another lemma we will later use.

Lemma 9 Let A and B be two d-dimensional jointly Gaussian vectors with diagonal covariance matrices ΣA and ΣB respectively. Let ΣA,B be the cross-covariance matrix between A and B, and let diag(ΣA,B) be the diagonal matrix whose diagonal is that of ΣA,B. Then:

det ΣA diag(ΣA,B) diag(ΣA,B) ΣB

det ΣA ΣA,B ΣT A,B ΣB

Proof Firstly we note that

det ΣA diag(ΣA,B) diag(ΣA,B) ΣB

= det(ΣA)det ΣB diag(ΣA,B)Σ 1 A diag(ΣA,B)

det ΣA ΣA,B ΣA,B ΣB

= det(ΣA)det ΣB ΣT A,BΣ 1 A ΣA,B .

As the matrix ΣA is positive semi-deﬁnite, det(ΣA) 0. The case det(ΣA) = 0 is straight-forward. Thus we assume that det(ΣA) > 0, so that all we need to prove is that

det ΣB diag(ΣA,B)Σ 1 A diag(ΣA,B) det ΣB ΣT A,BΣ 1 A ΣA,B .

Secondly, the matrix Σdiag B|A := ΣB diag(ΣA,B)Σ 1 A diag(ΣA,B) being diagonal, its determinant is the product of its diagonal terms:

det(Σdiag B|A) =

i=1 Σdiag B|A[i, i] =

ΣB[i, i] ΣA,B[i, i]2

As for the matrix ΣB|A := ΣB ΣT A,BΣ 1 A ΣA,B, we note that it happens to be the covariance matrix of the (Gaussian) distribution of B given A, and thus is positive semi-deﬁnite and admits a Cholesky decomposition ΣB|A = LLT . It follows that

det(ΣB|A) =

i=1 L[i, i]2

i=1 ΣB|A[i, i] =

ΣA,B[j, i]2

ΣB[i, i] ΣA,B[i, i]2

= det(Σdiag B|A), (85)

where the ﬁrst inequality results from the fact that ΣB|A[i, i] = Pi j=1 L[j, i]2 by deﬁnition of the Cholesky decomposition. This proves that

det ΣB diag(ΣA,B)Σ 1 A diag(ΣA,B) det ΣB ΣT A,BΣ 1 A ΣA,B ,

which as previously discussed concludes the proof of the lemma.

KOM SAMO AND ROBERTS

Proof of Proposition 4 2): Let x = (tx 1, . . . , tx d), y = (ty 1, . . . , ty d) I1 Id, x = y. We want to show that I( f(x); f(y)) I( g(x); g(y)) where f and g are as per Proposition 4, and I(X; Y ) = H(X) + H(Y ) H(X, Y )

is the mutual information between X and Y . As we have proved that x, H( f(x)) = H( g(x)), all we need to prove now is that

H( f(x), f(y)) H( g(x), g(y)).

Firstly, it follows from Equation (79) and the fact that the derivative Gaussian processes (zj t , zj t )t Ij are independent that ( f(x), f(y)) is a jointly Gaussian vector. Moreover, the cross-covariance matrix Σ f(x), f(y) is diagonal with diagonal terms:

Σ f(x), f(y)[i, i] = 2 dρ

dx ||x y||2 L2 + 2(tx i ty i )2 d2ρ

dx2 ||x y||2 L2 . (86)

Secondly, it follows from a similar argument to the previous proof that ( g(x), g(y)) is also a jointly Gaussian vector, and the terms Σ g(x), g(y)[i, j] are evaluated as limit of the cross-

covariance terms cov g(x+hei) g(x)

h , g(y+hej) g(y)

h as h 0. For i = j,

cov g(x + hei) g(x)

h , g(y + hei) g(y)

k (tx k ty k)2 !

k =i (tx k ty k)2 + (tx i + h ty i )2

k =i (tx k ty k)2 + (tx i h ty i )2

As ρ is assumed to be C2, the below Taylor expansions around h = 0 hold true:

k (tx k ty k)2 !

k =i (tx k ty k)2 + (tx i h ty i )2

= 2(tx i ty i )hdρ

k (tx k ty k)2 !

k (tx k ty k)2 !

+ 2(tx i ty i )2 d2ρ

k (tx k ty k)2 !#

k (tx k ty k)2 !

k =i (tx k ty k)2 + (tx i + h ty i )2

= 2(tx i ty i )hdρ

k (tx k ty k)2 !

k (tx k yk)2 !

+ 2(tx i ty i )2 d2ρ

k (tx k ty k)2 !#

STRING AND MEMBRANE GAUSSIAN PROCESSES

Plugging Equations (88) and (89) into Equation (87) and taking the limit we obtain:

Σ g(x), g(y)[i, i] = 2 dρ

dx ||x y||2 L2 + 2(tx i ty i )2 d2ρ

dx2 ||x y||2 L2

= Σ f(x), f(y)[i, i]. (90)

Similarly for i = j,

cov g(x + hei) g(x)

h , g(y + hej) g(y)

k =i,j (tx k ty k)2 + (tx i + h ty i )2 + (tx j h ty j)2 !

k =i (tx k ty k)2 + (tx i + h ty i )2

k =j (tx k ty k)2 + (tx i h ty j)2

k (tx k ty k)2 ! )

k =i,j (tx k ty k)2 + (tx i + h ty i )2 + (tx j h ty j)2

k (tx k ty k)2 !

k (tx k ty k)2 !

+ 2 (tx i ty i ) (tx j ty j) 2 d2ρ

k (tx k ty k)2 !#

+ 2 tx i ty i tx j + ty j dρ

k (tx k ty k)2 !

h + o(h2). (92)

Plugging Equations (88), (89) and (92) in Equation (91) and taking the limit we obtain:

Σ g(x), g(y)[i, j] = 4(tx i ty i )(tx j ty j)d2ρ

dx2 ||x y||2 L2 . (93)

To summarize, ( f(x), f(y)) and ( g(x), g(y)) are both jointly Gaussian vectors; f(x), g(x), f(y), and g(y) are (Gaussian) identically distributed with a diagonal covariance matrix; Σ f(x), f(y) is diagonal; Σ g(x), g(y) has the same diagonal as Σ f(x), f(y) but has possibly non-zero off-diagonal terms. Hence, it follows from Lemma 9 that the determinant of the auto-covariance matrix of ( f(x), f(y)) is higher than that of the auto-covariance matrix of ( g(x), g(y)); or equivalently the entropy of ( f(x), f(y)) is higher than that of ( g(x), g(y)) (as both are Gaussian vectors), which as previously discussed is sufﬁcient to conclude that the mutual information between f(x) and f(y) is smaller than that between g(x) and g(y).

Appendix F. Proof of Proposition 6

In this section, we prove Proposition 6, which we recall below.

KOM SAMO AND ROBERTS

Proposition 6 (Extension of the standard GP paradigm) Let K N , let I = [a0, a K] and Ik = [ak 1, ak] be intervals with a0 < < a K. Furthermore, let m : I R be a C1 function, mk the restriction of m to Ik, h : I I R a C3 symmetric positive semi-deﬁnite function, and hk the restriction of h to Ik Ik. If

(zt)t I SGP({ak}, {mk}, {hk}),

then k [1..K], (zt)t Ik GP(m, h).

Proof To prove Proposition 6, we consider the string derivative Gaussian process (Theorem 2) (SDt)t I, SDt = (zt, z t) with unconditional string mean and covariance functions as per Proposition 6 and prove that its restrictions on the intervals Ik = [ak 1, ak] are derivative Gaussian processes with the same mean function m and covariance function h. Proposition 1(B) will then allow us to conclude that (zt)t Ik are GPs with mean m and covariance function h. Let t1, . . . , tn ]ak 1, ak[ and let p D(xak 1) (respectively p D(xak|xak 1) and p D(xt1, . . . , xtn|xak 1, xak)) denote the pdf of the value of the derivative Gaussian process with mean function m and covariance function h at ak 1 (respectively its value at ak conditional on its value at ak 1, and its values at t1, . . . , tn conditional on its values at ak 1 and ak). Saying that the restriction of the string derivative Gaussian process (SDt) on [ak 1, ak] is the derivative Gaussian process with mean m and covariance h is equivalent to saying that all ﬁnite dimensional marginals of the string derivative Gaussian process p SD(xak 1, xt1, . . . , xtn, xak), ti [ak 1, ak], factorise as20:

p SD(xak 1, xt1, . . . , xtn, xak) = p D(xak 1)p D(xak|xak 1)p D(xt1, . . . , xtn|xak 1, xak).

Moreover, we know from Theorem 2 that by design, p SD(xak 1, xt1, . . . , xtn, xak) factorises as

p SD(xak 1, xt1, . . . , xtn, xak) = p SD(xak 1)p D(xak|xak 1)p D(xt1, . . . , xtn|xak 1, xak).

In other words, all we need to prove is that

p SD(xak) = p D(xak)

for every boundary time, which we will do by induction. We note by integrating out every boundary condition but the ﬁrst in pb (as per Theorem 2 (a-1)) that

p SD(xa0) = p D(xa0).

If we assume that p SD(xak 1) = p D(xak 1) for some k > 0, then as previously discussed the restriction of the string derivative Gaussian process on [ak 1, ak] will be the derivative Gaussian process with the same mean and covariance functions, which will imply that p SD(xak) = p D(xak). This concludes the proof.

20. We emphasize that the terms on the right hand-side of this equation involve p D not p SD.

STRING AND MEMBRANE GAUSSIAN PROCESSES

Appendix G. Proof of Lemma 10

In this section, we will prove Lemma 10 that we recall below.

Lemma 10 Let X be a multivariate Gaussian with mean µX and covariance matrix ΣX. If conditional on X, Y is a multivariate Gaussian with mean MX + A and covariance matrix Σc Y where M, A and Σc Y do not depend on X, then (X, Y ) is a jointly Gaussian vector with mean

µX;Y = µX MµX + A

and covariance matrix

ΣX;Y = ΣX ΣXMT

MΣX Σc Y + MΣXMT

Proof To prove this lemma we introduce two vectors X and Y whose lengths are the same as those of X and Y respectively, and such that ( X, Y ) is jointly Gaussian with mean µX;Y and covariance matrix ΣX;Y . We then prove that the (marginal) distribution of X is the same as the distribution of X and that the distribution of Y | X = x is the same as Y |X = x for any x, which is sufﬁcient to conclude that (X, Y ) and ( X, Y ) have the same distribution. It is obvious from the joint ( X, Y ) that X is Gaussian distribution with mean µX and covariance matrix ΣX. As for the distribution of Y conditional on X = x, it follows from the usual Gaussian identities that it is Gaussian with mean

MµX + c + MΣXΣ 1 X (x µX) = Mx + c,

and covariance matrix

Σc Y + MΣXMT MΣXΣ 1 X ΣT XMT = Σc Y ,

which is the same distribution as that of Y |X = x since the covariance matrix ΣX is symmetric. This concludes our proof.

Appendix H. Proof of Proposition 5

In this section we will prove that string GPs with link function φs are GPs, or in other words that if f is a string GP indexed on Rd, d > 0 with link function φs(x1, . . . , xd) = Pd j=1 xj, then (f(x1), . . . , f(xn)) has a multivariate Gaussian distribution for every set of distinct points x1, . . . , xn Rd.

Proof As the sum of independent Gaussian processes is a Gaussian process, a sufﬁcient condition for additively separable string GPs to be GPs in dimensions d > 1 is that string GPs be GPs in dimension 1. Hence, all we need to do is to prove that string GPs are GPs in dimension 1.

Let (zj t , zj t )t Ij be a string derivative GP in dimension 1, with boundary times aj 0, . . . , aj Kj, and unconditional string mean and covariance functions mj k and kj k respectively. We want to prove that (zj t1, . . . , zj tn) is jointly Gaussian for any t1, . . . , tn Ij.

KOM SAMO AND ROBERTS

APPENDIX H.0.1 STEP 1 zj a0, zj a0, . . . , zj a Kj , zj a Kj IS JOINTLY GAUSSIAN

We ﬁrst prove recursively that the vector zj a0, zj a0, . . . , zj a Kj , zj a Kj is jointly Gaussian. We note

from Theorem 2 that (zj t , zj t )t [a0,a1] is the derivative Gaussian process with mean mj 1 and covariance function kj 1. Hence, (zj a0, zj a0, zj a1, zj a1) is jointly Gaussian. Moreover, let us assume that Bk 1 := (zj a0, zj a0, . . . , zj ak 1, zj ak 1) is jointly Gaussian for some k > 1. Conditional on Bk 1, (zj ak, zj ak) is Gaussian with covariance matrix independent of Bk 1, and with mean

" mj k(aj k) dmj k dt (aj k)

+ j k Kaj k;aj k 1

j k K 1 ak 1;aj k 1

zj aj k 1 mj k(aj k 1)

zj aj k 1 dmj k dt (aj k 1)

which depends linearly on (zj a0, zj a0, . . . , zj ak 1, zj ak 1). Hence by Lemma 10,

(zj a0, zj a0, . . . , zj ak, zj ak)

is jointly Gaussian.

APPENDIX H.0.2 STEP 2 (zj a0, zj a0, . . . , zj a Kj , zj a Kj , . . . , zj tk i , zj tk i , . . . ) IS JOINTLY GAUSSIAN

Let tk 1, . . . , tk nk ]aj k 1, aj k[, k Kj be distinct string times. We want to prove that the vector (zj a0, zj a0, . . . , zj a Kj , zj a Kj , . . . , zj tk i , zj tk i , . . . ) where all boundary times are represented, and for any ﬁnite number of string times is jointly Gaussian. Firstly, we have already proved that (zj a0, zj a0, . . . , zj a Kj , zj a Kj ) is jointly Gaussian. Secondly, we note from Theorem 2 that conditional on (zj a0, zj a0, . . . , zj a Kj , zj a Kj ), (. . . , zj tk i , zj tk i , . . . ) is a Gaussian vector whose covariance matrix does

not depend on (zj a0, zj a0, . . . , zj a Kj , zj a Kj ), and whose mean depends linearly on zj a0, zj a0, . . . , zj a Kj , zj a Kj .

Hence, zj a0, zj a0, . . . , zj a Kj , zj a Kj , . . . , zj tk i , zj tk i , . . .

is jointly Gaussian (by Lemma 10).

APPENDIX H.0.3 STEP 3 (zj t1, . . . , zj tn) IS JOINTLY GAUSSIAN

(zj t1, zj t1, . . . , zj tn, zj tn) is jointly Gaussian as it can be regarded as the marginal of some joint distribution of the form (zj a0, zj a0, . . . , zj a Kj , zj a Kj , . . . , zj tk i , zj tk i , . . . ). Hence, its marginal (zj t1, . . . , zj tn) is also jointly Gaussian, which concludes our proof.

Appendix I. Derivation of Global String GP Mean and Covariance Functions

We begin with derivative string GPs indexed on R. Extensions to membrane GPs are easily achieved for a broad range of link functions. In our exposition, we focus on the class of elementary symmetric

STRING AND MEMBRANE GAUSSIAN PROCESSES

polynomials (Macdonald (1995)). In addition to containing the link function φs previously introduced, this family of polynomials yields global covariance structures that have many similarities with existing kernel approaches, which we discuss in Section 4.3. For n d, the n-th order elementary symmetric polynomial is given by

e0(x1, . . . , xd) := 1, 1 n d en(x1, . . . , xd) = X

1 j1<j2< <jn d

k=1 xjk. (94)

As an illustration,

e1(x1, . . . , xd) =

j=1 xj = φs(x1, . . . , xd),

e2(x1, . . . , xd) = x1x2 + x1x3 + + x1xd + + xd 1xd,

ed(x1, . . . , xd) =

j=1 xj = φp(x1, . . . , xd).

Let f denote a membrane GP indexed on Rd with link function en and by (z1 t ), . . . , (zd t ) its independent building block string GPs. Furthermore, let mj k and kj k denote the unconditional mean and covariance functions corresponding to the k-th string of (zj t ) deﬁned on [aj k 1, aj k]. Finally, let us deﬁne mj(t) := E(zj t ), mj (t) := E(zj t ),

the global mean functions of the j-th building block string GP and of its derivative, where t Ij. It follows from the independence of the building block string GPs (zj t ) that:

mf(t1, . . . , td) := E(f(t1, . . . , td)) = en( m1(t1), . . . , md(td)).

Moreover, noting that en xj = en 1(x1, . . . , xj 1, xj+1, . . . , xd),

it follows that:

m f(t1, . . . , td) := E ( f(t1, . . . , td)) =

m1 (t1)en 1 m2(t2), . . . , md(td)

. . . md (td)en 1 m1(t1), . . . , md 1(td 1)

Furthermore, for any uj, vj Ij we also have that

cov (f(u1, . . . , ud), f(v1, . . . , vd)) = en cov(z1 u1, z1 v1), . . . , cov(zd ud, zd vd) ,

xi (u1, . . . , ud), f(v1, . . . , vd) = en cov(z1 u1, z1 v1), . . . , cov(zi ui, zi vi), . . . , cov(zd ud, zd vd) ,

KOM SAMO AND ROBERTS

and for i j

xi (u1, . . . , ud), f

xj (v1, . . . , vd)

( en cov(z1 u1, z1 v1), . . . , cov(zi ui, zi vi), . . . , cov(zj uj, zj vj), . . . , cov(zd ud, zd vd) , if i < j

en cov(z1 u1, z1 v1), . . . , cov(zi ui, zi vi), . . . , cov(zd ud, zd vd) , if i = j .

Overall, for any elementary symmetric polynomial link function, multivariate mean and covariance functions are easily deduced from the previously boxed equations and the univariate quantities

mj(u), mj (u), and j Ku;v :=

" cov(zj u, zj v) cov(zj u, zj v ) cov(zj u , zj v) cov(zj u , zj v )

= j KT v;u,

which we now derive. In this regards, we will need the following lemma.

Lemma 10 Let X be a multivariate Gaussian with mean µX and covariance matrix ΣX. If conditional on X, Y is a multivariate Gaussian with mean MX + A and covariance matrix Σc Y where M, A and Σc Y do not depend on X, then (X, Y ) is a jointly Gaussian vector with mean

µX;Y = µX MµX + A

and covariance matrix

ΣX;Y = ΣX ΣXMT

MΣX Σc Y + MΣXMT

Proof See Appendix G.

APPENDIX I.0.1 GLOBAL STRING GP MEAN FUNCTIONS

We now turn to evaluating the univariate global mean functions mj and mj . We start with boundary times and then generalise to other times.

Boundary times: We note from Theorem 2 that the restriction zj t , zj t

t [aj 0,aj 1] is the derivative

Gaussian process with mean and covariance functions mj 1 and kj 1. Thus,

" mj(aj 0) mj (aj 0)

" mj 1(aj 0) dmj 1 dt (aj 0)

" mj(aj 1) mj (aj 1)

" mj 1(aj 1) dmj 1 dt (aj 1)

For k > 1, we recall that conditional on zj aj k 1, zj aj k 1

, zj aj k, zj aj k

is Gaussian with mean

" mj k(aj k) dmj k dt (aj k)

+ j k Kaj k;aj k 1

j k K 1 aj k 1;aj k 1

zj aj k 1 mj k(aj k 1)

zj aj k 1 dmj k dt (aj k 1)

STRING AND MEMBRANE GAUSSIAN PROCESSES

with j k Ku;v =

kj k(u, v) kj k y (u, v)

kj k x (u, v) 2kj k x y(u, v)

It then follows from the law of total expectations that for all k > 1

" mj(aj k) mj (aj k)

" mj k(aj k) dmj k dt (aj k)

+ j k Kaj k;aj k 1

j k K 1 aj k 1;aj k 1

" mj(aj k 1) mj k(aj k 1)

mj (aj k 1) dmj k dt (aj k 1)

String times: As for non-boundary times t ]aj k 1, aj k[, conditional on zj ak 1, zj ak 1 and zj ak, zj ak , zj t , zj t is Gaussian with mean

" mj k(t) dmj k dt (t)

+ j k Kt;(aj k 1,aj k) j k K 1 (aj k 1,aj k);(aj k 1,aj k)

zj aj k 1 mj k(aj k 1)

zj aj k 1 dmj k dt (aj k 1)

zj aj k mj k(aj k)

zj aj k dmj k dt (aj k)

j k K(aj k 1,aj k);(aj k 1,aj k) =

j k Kaj k 1;aj k 1

j k Kaj k 1;aj k j k Kaj k;aj k 1

j k Kaj k;aj k

and j k Kt;(aj k 1,aj k) = hj k Kt;aj k 1

j k Kt;aj k

Hence, using once again the law of total expectation, it follows that for any t ]aj k 1, aj k[,

mj(t) mj (t)

" mj k(t) dmj k dt (t)

+ j k Kt;(aj k 1,aj k) j k K 1 (aj k 1,aj k);(aj k 1,aj k)

mj(aj k 1) mj k(aj k 1)

mj (aj k 1) dmj k dt (aj k 1) mj(aj k) mj k(aj k)

mj (aj k) dmj k dt (aj k)

We note in particular that when j, k, mj k = 0, it follows that mj = 0, mf = 0, m f = 0.

APPENDIX I.0.2 GLOBAL STRING GP COVARIANCE FUNCTIONS

As for the evaluation of j Ku,v, we start by noting that the covariance function of a univariate string GP is the same as that of another string GP whose strings have the same unconditional kernels but unconditional mean functions mj k = 0, so that to evaluate univariate string GP kernels we may assume that j, k, mj k = 0 without loss of generality. We start with the case where u and v are both boundary times, after which we will generalise to other times.

KOM SAMO AND ROBERTS

Boundary times: As previously discussed, the restriction zj t , zj t

t [aj 0,aj 1] is the derivative Gaus-

sian process with mean 0 and covariance function kj 1. Thus,

j Kaj 0;aj 0 = j 1Kaj 0;aj 0, j Kaj 1;aj 1 = j 1Kaj 1;aj 1, j Kaj 0;aj 1 = j 1Kaj 0;aj 1. (95)

We recall that conditional on the boundary conditions at or prior to aj k 1, zj aj k, zj aj k

is Gaussian

zj aj k 1 zj aj k 1

with b k M = j k Kaj k;aj k 1

j k K 1 aj k 1;aj k 1,

and covariance matrix b kΣ = j k Kaj k;aj k b k M j k Kaj k 1;aj k.

Hence using Lemma 10 with M = b k M 0 . . . 0 where there are (k 1) null block 2 2

matrices, and noting that zj aj 0, zj aj 0, . . . , zj aj k 1, zj aj k 1

is jointly Gaussian, it follows that the vector zj aj 0, zj aj 0, . . . , zj aj k, zj aj k

is jointly Gaussian, that zj aj k, zj aj k

has covariance matrix

j Kaj k;aj k = b kΣ + b k M j Kaj k 1;aj k 1 b k MT ,

and that the covariance matrix between the boundary conditions at aj k and at any earlier boundary time aj l , l < k reads:

j Kaj k;aj l = b k M j Kaj k 1;aj l .

String times: Let u [aj p 1, aj p], v [aj q 1, aj q]. By the law of total expectation, we have that

j Ku;v := E

" zj u zj u

# h zj v zj v i!

" zj u zj u

# h zj v zj v i B(p, q)

where B(p, q) refers to the boundary condtions at the boundaries of the p-th and q-th strings, in other

words zj x, zj x , x aj p 1, aj p, aj q 1, aj q . Furthermore, using the deﬁnition of the covariance

matrix under the conditional law, it follows that

E " zj u zj u

# h zj v zj v i B(p, q) = j c Ku;v + E " zj u zj u

# B(p, q) E h zj v zj v i B(p, q) , (96)

where j c Ku;v refers to the covariance matrix between (zj u, zj u ) and (zj v, zj v ) conditional on the boundary conditions B(p, q), and can be easily evaluated from Theorem 2. In particular,

if p = q, j c Ku;v = 0, and if p = q, j c Ku;v = j p Ku;v j pΛu

j p KT v;aj p 1 j p KT v;aj p

STRING AND MEMBRANE GAUSSIAN PROCESSES

x, l, j l Λx = hj l Kx;aj l 1

j l Kx;aj l

j l Kaj l 1;aj l 1

j l Kaj l 1;aj l j l Kaj l ;aj l 1

j l Kaj l ;aj l

We also note that

E " zj u zj u

# B(p, q) = j pΛu

zj aj p 1 zj aj p 1 zj aj p zj aj p

and E h zj v zj v i B(p, q) = h zj aj q 1 zj aj q 1 zj aj q zj aj q

i j qΛT v .

Hence, taking the expectation with respect to the boundary conditions on both sides of Equation (96), we obtain:

u [aj p 1, aj p], v [aj q 1, aj q], j Ku;v = j c Ku;v + j pΛu

"j Kaj p 1;aj q 1 j Kaj p 1;aj q j Kaj p;aj q 1 j Kaj p;aj q

where j c Ku;v is provided in Equation (97).

KOM SAMO AND ROBERTS

R. P. Adams and O. Stegle. Gaussian process product models for nonparametric nonstationarity. In International Conference on Machine Learning (ICML), pages 1 8, 2008.

R. J. Adler and J. E. Taylor. Topological Complexity of Smooth Random Functions: Ecole D Et e de Probabilit es de Saint-Flour XXXIX-2009. Lecture Notes in Mathematics / Ecole d Et e de Probabilit es de Saint-Flour. Springer, 2011.

P. Alquier, N. Friel, R. Everitt, and A. Boland. Noisy monte carlo: Convergence of Markov chains with approximate transition kernels. Statistics and Computing, 26(1-2):29 47, 2016.

R. Bardenet, A. Doucet, and C. Holmes. Towards scaling up Markov chain monte carlo: an adaptive subsampling approach. In International Conference on Machine Learning (ICML), pages 405 413, 2014.

S. Brooks, A. Gelman, G. Jones, and X.-L. Meng. Handbook of Markov Chain Monte Carlo. CRC press, 2011.

R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian processes for regression. ar Xiv preprint ar Xiv:1402.5876, 2014.

Y. Cao and D. J. Fleet. Generalized product of experts for automatic and principled fusion of Gaussian process predictions. ar Xiv preprint ar Xiv:1410.7827, 2014.

D. J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes. Springer-Verlag, 2008.

M. Deisenroth and J. W. Ng. Distributed Gaussian processes. In International Conference on Machine Learning (ICML), pages 1481 1490, 2015.

J. L. Doob. The elementary Gaussian processes. The Annals of Mathematical Statistics, 15(3): 229 282, 1944.

N. Durrande, D. Ginsbourger, and O. Roustant. Additive covariance kernels for high-dimensional Gaussian process modeling. Annales de la Facult e de Sciences de Toulouse, 21(3), 2012.

D. Duvenaud, H. Nickisch, and C. E. Rasmussen. Additive Gaussian processes. In Advances in Neural Information Processing Systems (NIPS), pages 226 234, 2011.

T. S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, pages 209 230, 1973.

N. J. Foti and S. Williamson. A survey of non-exchangeable priors for Bayesian nonparametric models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):359 371, 2015.

R. B. Gramacy and H. K. H. Lee. Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103(483), 2008.

P. J. Green. Reversible jump Markov chain monte carlo computation and Bayesian model determination. Biometrika, 82(4):711 732, 1995.

STRING AND MEMBRANE GAUSSIAN PROCESSES

P. J. Green and D. I. Hastie. Reversible jump MCMC. Genetics, 155(3):1391 1403, 2009.

J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Uncertainty in Artiﬁcial Intellegence (UAI), pages 282 290, 2013.

J. Hensman, A. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classiﬁcation. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 351 360, 2015.

M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14:1303 1347, 2013.

I. Karatzas and R. Fernholz. Stochastic Portfolio Ttheory: An Overview. Handbook of Numerical Analysis, 15:89 167, 2009.

H. Kim, Mallick B. K., and Holmes C. C. Analyzing nonstationary spatial data using piecewise Gaussian processes. Journal of the American Statistical Association, 100(470):653 668, 2005.

J. Kingman. Completely random measures. Paciﬁc Journal of Mathematics, 21(1):59 78, 1967.

Y.-L. Kom Samo and A. Vervuurt. Stochastic portfolio theory : a machine learning perspective. In Uncertainty in Artiﬁcial Intelligence (UAI), pages 657 665, 2016.

N. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: the informative vector machine. In Advances in Neural Information Processing Systems (NIPS), pages 625 632, 2003.

M. Lazaro-Gredilla, J. Quinonero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vida. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11:1866 1881, 2010.

Q. Le, T. Sarl os, and A. Smola. Fastfood-approximating kernel expansions in loglinear time. In International Conference on Machine Learning (ICML), 2013.

I. G. Macdonald. Symmetric Functions and Hall Polynomials. Oxford University Press, 1995.

D. J. C. Mac Kay. Introduction to Gaussian processes. In NATO ASI Series F: Computer and Systems Sciences, pages 133 166. Springer, Berlin, 1998.

E. Meeds and S. Osindero. An alternative inﬁnite mixture of Gaussian process experts. In Advances In Neural Information Processing Systems (NIPS), 2006.

I. Murray, R. P. Adams, and D. J. C. Mac Kay. Elliptical slice sampling. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 9 16, 2010.

R. Neal. Bayesian learning for neural networks. Lecture notes in Statistics. Springer, 1996.

T. Nguyen and E. Bonilla. Fast allocation of Gaussian process experts. In International Conference on Machine Learning (ICML), pages 145 153, 2014.

B. Øksendal. Stochastic Differential Equations: An Introduction with Applications. Hochschultext / Universitext. Springer, 2003.

KOM SAMO AND ROBERTS

C. Paciorek and M. Schervish. Nonstationary covariance functions for Gaussian process regression. In Advances in Neural Information Processing Systems (NIPS), pages 273 280, 2004.

J. Pitman and M. Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, pages 855 900, 1997.

C. Plagemann, K. Kersting, and W. Burgard. Nonstationary Gaussian process regression using point estimate of local smoothness. In European Conference on Machine Learning (ECML), pages 204 219, 2008.

J. Quinonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939 1959, 2005.

A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 1177 1184, 2007.

C. E. Rasmussen and Z. Ghahramani. Inﬁnite mixtures of Gaussian process experts. In Advances in Neural Information Processing Systems (NIPS), pages 881 888, 2001.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.

J. Ross and J. Dy. Nonparametric mixture of Gaussian processes with constraints. In International Conference on Machine Learning (ICML), pages 1346 1354, 2013.

Y. Saatchi. Scalable Inference for Structured Gaussian Process Models. Ph D thesis, University of Cambridge, 2011.

A. M. Schmidt and A. O Hagan. Bayesian inference for nonstationary spatial covariance structure via spatial deformations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(3):743 758, 2003.

M. Seeger. Bayesian Gaussian process models: Pac-Bayesian generalisation error bounds and sparse approximations. Technical report, 2003a.

M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classiﬁcation. Journal of Machine Learning Research, 3:233 269, 2003b.

A. Shah, A. G. Wilson, and Z. Ghahramani. Student-t processes as alternatives to Gaussian processes. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 877 885, 2014.

B. W. Silverman. Some Aspects of the Spline Smoothing Approach to Non-Parametric Regression Curve Fitting. Journal of the Royal Statistical Society. Series B (Methodological), 47(1):1 52, 1985.

A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information Processing Systems (NIPS), pages 619 625. MIT Press, 2001.

E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems (NIPS), pages 1257 1264, 2006.

STRING AND MEMBRANE GAUSSIAN PROCESSES

The GPy authors. GPy: A Gaussian process framework in python. http://github.com/ Sheffield ML/GPy, 2012 2016.

V. Tresp. A Bayesian Committee Machine. Neural Computation, 12(11):2719 2741, 2000.

V. Tresp. Mixtures of Gaussian processes. In Advances in Neural Information Processing Systems (NIPS), pages 654 660, 2001.

A. Vervuurt and I. Karatzas. Diversity-weighted portfolios with negative parameter. Annals of Finance, 11(3):411 432, 2015.

C. Williams and M. Seeger. Using the Nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 682 688, 2001.

A. G. Wilson and R. P. Adams. Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning (ICML), pages 1067 1075, 2013.

A. G. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes. In International Conference on Machine Learning (ICML), pages 1775 1784, 2015.

A. G. Wilson, E. Gilboa, and J. P. Nehorai, A.and Cunningham. Fast kernel learning for multidimensional pattern extrapolation. In Advances in Neural Information Processing Systems (NIPS), pages 3626 3634. 2014.

Z. Yang, A. Smola, L. Song, and A. G. Wilson. A la carte learning fast kernels. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1098 1106, 2015.