# vectorvalued_control_variates__3bfc3870.pdf

Vector-Valued Control Variates

Zhuo Sun 1 Alessandro Barp 2 3 Franc ois-Xavier Briol 1 3

Control variates are variance reduction tools for Monte Carlo estimators. They can provide significant variance reduction, but usually require a large number of samples, which can be prohibitive when sampling or evaluating the integrand is computationally expensive. Furthermore, there are many scenarios where we need to compute multiple related integrals simultaneously or sequentially, which can further exacerbate computational costs. In this paper, we propose vector-valued control variates, an extension of control variates which can be used to reduce the variance of multiple Monte Carlo estimators jointly. This allows for the transfer of information across integration tasks, and hence reduces the need for a large number of samples. We focus on control variates based on kernel interpolants and our novel construction is obtained through a generalised Stein identity and the development of novel matrixvalued Stein reproducing kernels. We demonstrate our methodology on a range of problems including multifidelity modelling, Bayesian inference for dynamical systems, and model evidence computation through thermodynamic integration.

1. Introduction

A significant computational challenge in statistics and machine learning is the approximation of intractable integrals. Examples include the computation of posterior moments, the model evidence (or marginal likelihood), Bayes factors, or integrating out latent variables. This challenge has lead to the development of a wide range of Monte Carlo (MC) methods; see (Green et al., 2015) for a review. Let f : Rd R denote some integrand of interest, and Π some distribution with Lebesgue density π known up to an intractable normalisation constant. The integration task we consider can be

1University College London, London, UK 2University of Cambridge, Cambridge, UK 3The Alan Turing Institute, London, UK. Correspondence to: Franc ois-Xavier Briol <f.briol@ucl.ac.uk>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

expressed as estimating

Rd f(x)π(x)dx

using evaluations of the integrand at some points in the domain: {xi, f(xi)}n i=1. These evaluations are usually combined to create an estimate of Π[f] of the form ˆΠ[f] = 1 n Pn i=1 f(xi). For example, when realisations are independent and identically distributed (IID), this corresponds to a MC estimator. In that case, assuming that f is squareintegrable with respect to Π (i.e. Π[f 2] < ), we can use the central limit theorem (CLT) to show that such estimators converge to Π[f] as n , and this convergence is then controlled by the asymptotic variance of the integrand f. Analogous results can also be obtained for Markov chain Monte Carlo (MCMC) realisations (Jones, 2004), in which case {xi}n i=1 are realisations from a Markov Chain with invariant distribution Π, or for randomised quasi-Monte Carlo (Hickernell et al., 2005), in which case {xi}n i=1 form some lattice or sequence filling some hypercube domain.

The main insight behind the concept of control variate (CV) is that it is instead often possible to use an estimator of Π[f g] for some g : Rd R. This is justified if Π[g] is known in closed form, in which case we may use ˆΠCV[f] := ˆΠ[f g] + Π[g]. Furthermore, if g is chosen appropriately, the variance of the CLT for this new estimator will be much smaller than that of the original, and a smaller number of samples will be required to approximate Π[f] at a given level of accuracy.

Suppose now that n = (n1, . . . , n T ) NT (T N+) is a multi-index. In this paper, we will focus on cases where we have not just one integral, but a sequence of integrands ft : Rd R and distributions Πt for which we would like to use {{xtj, ft(xtj)}nt j=1}T t=1 to estimate

Πt[ft] := R

Rd ft(x)πt(x)dx for t [T], (1)

where [T] = {1, . . . , T}. This is a common situation in practice; for example, our paper considers the case of multifidelity modelling (Peherstorfer et al., 2018) where f1, . . . , f T may be a computationally expensive physical model f, and we might be interested in expectations of that model with respect to unknown parameters. Another example we will also study is when π1, . . . , πT are closely related posterior distributions, such as in the case of power posteriors (Friel & Pettitt, 2008).

Vector-Valued Control Variates

Of course, the estimation of integrals in (1) could be tackled individually, and this is in fact the most common approach. However, the main insight from this paper is that if both the integrands and distributions are related across tasks, we can improve on this by sharing computation across these tasks. We propose to construct a CV to jointly reduce the variance of estimators for these integrals and hence obtain a more accurate approximation. We will call such a function a vector-valued control variate (vv-CV). In order to encode the relationship between integration tasks, we will propose a flexible class of CVs based on interpolation in reproducing kernel Hilbert space of vector-valued functions (vv-RKHS). More precisely, we generalise existing constructions of Stein reproducing kernels to derive novel vv-RKHSs with the property that each output has mean zero.

We note that very few methods exist to tackle multiple integrals jointly. One exception is (Xi et al., 2018), which also proposes an algorithm based on vv-RKHSs. However, that work is limited to cases where the integral of the kernel is known in closed-form, which is rarely possible in practice. In contrast, our vv-CVs are applicable so long as πt is known up to an unknown constant and x log πt can be evaluated pointwise for all t [T] (where x = ( / x1, . . . , / xd) ). This will usually be satisfied in Bayesian statistics, and is a requirement for the implementation of most gradient-based MCMC algorithms.

The remainder is as follows. In Section 2, we review existing CVs based on Stein s method. In Section 3, we introduce vv-CVs, then in Section 4 we show how to find an optimal vv-CV. Finally, in Section 5, we demonstrate the advantage of our approach on problems in multifidelity modelling, Bayesian inference for differential equations and model evidence computation through thermodynamic integration.

Notation Vectors x Rd are column vectors, x q = (Pd i=1 xq i )1/q for q N, and 1d = (1, . . . , 1) Rd. For a multi-index m Nd, we write |m| = Pd i=1 mi for its total degree. For a matrix M Rp q, Mij denotes the entry in row i and column j, M 2 F = Pp i=1 Pq j=1 M 2 ij is the Frobenius squared-norm, Tr(M) = Pm i=1 Mii is the trace, and M is the pseudo-inverse. Im denotes the mdimensional identity matrix, and Sm + the set of symmetric strictly positive definite matrices in Rm m. We denote by Cj the set of functions whose mixed partial derivatives of order at most j are continuous, and given a differentiable function g on Rd1 Rd2, r xg(x, y) denotes its partial derivative in the rth-coordinate of its first entry evaluated at (x, y).

2. Background

We now briefly review existing constructions for Stein-based CVs for a single integration problem.

The first step consists of constructing a set of functions G

which all integrate to a known value against Π. This can generally be challenging since π may not be computationally tractable, e.g., involving an unknown normalisation constant. Without loss of generality, we will discuss the construction of functions which integrate to zero, but notice that we can obtain functions with mean equal to any constant β R by simply adding this constant β to a zero-mean function.

Zero-Mean Functions through Stein Operators One way of constructing zero-mean functions is to use Stein s method (Anastasiou et al., 2023). The main ingredients of Stein s method are a function class and an operator acting on this class. More precisely, a Stein class of Π is a class of functions U associated to an operator S, called Stein operator, such that a Stein identity holds: Π[S[u]] = 0 u U. An obvious choice for the class of zero-mean functions G is to consider all functions of the form g = S[u] for u U. To ensure such g has finite variance, we assume that all functions in G are square-integrable with respect to Π. This can be guaranteed under weak regularity conditions on U and S; see Theorem 3.2. Note also that S depends implicitly on Π, but we only make this explicit in our notation (i. e. SΠ) when it is helpful for clarity.

The most common choice of Stein operator is the Langevin Stein operator, which acts on differentiable vector-valued functions (vv-functions) u : Rd Rd:

L[u](x) := x u(x) + u(x) x log π(x). (2)

The advantage of L is that it only requires knowledge of Π through evaluations of x log π, which does not require the normalisation constant of π. Indeed, let π = π/C for some unknown C R, then x log π = x log π. For more general Stein operators, see (Anastasiou et al., 2023).

Parametric Spaces G is usually chosen to be a parametric space and we will hence write it GΘ, where Θ denotes the space of parameter values. Most existing CVs can be obtained by taking gθ = S[uθ] for some uθ UΘ where UΘ is another parametric class. However, note that there might not be a unique uθ leading to gθ. For the remainder of the paper, θ will hence be a parameter indexing gθ directly as opposed to an element of the Stein class. Examples of parametric CVs are the polynomial-based CVs of (Mira et al., 2013) (see also (Assaraf & Caffarel, 1999; Papamarkou et al., 2014; Oates et al., 2016)), in which case the Stein class is parametrised directly by coefficients of a polynomial. Using a space of neural networks has also been studied in (Wan et al., 2019; Si et al., 2021). This later choice can be advantageous due to the flexibility of this function class, but is much more challenging to implement because selecting a CV becomes a non-convex problem.

Another example are kernel interpolants, which will be the main focus of our paper. This class is nonpara-

Vector-Valued Control Variates

metric, but it is often convenient to fix the dataset size and parametrise it. Let Hk denote a reproducing kernel Hilbert space (RKHS) with kernel k : Rd Rd R (Berlinet & Thomas-Agnan, 2011), so that k is symmetric (k(x, y) = k(y, x) x, y Rd) and positive semi-definite ( m N+, Pm i,j=1 cicjk(xi, xj) 0 c1, . . . , cm R and x1, . . . , xm Rd). The kernel could be a squaredexponential kernel k(x, y) = exp( x y 2 2/2l2) with lengthscale l > 0, or a polynomial kernel k(x, y) = (x y + c)l where c R and l N is the degree of the polynomial. Oates et al. (2017) noticed that the image of U = Hd k := Hk . . . Hk under L is a RKHS with kernel

k0(x, y) := x yk(x, y) + x log π(x) yk(x, y)

+ y log π(y) xk(x, y)

+ ( x log π(x) y log π(y))k(x, y), (3)

see also Thm 2.6 Barp et al. (2022b) for a more general result. Given m observations, it is known that the optimal interpolant in Hk0 is of the form gθ(x) = Pm i=1 θik0(x, xi) where θi R for all i [m]. This therefore provides a natural parametrisation for practical implementation. Oates et al. (2017) called this class of CVs control functionals (CF); see also (Briol et al., 2017; Oates et al., 2019; South et al., 2022a) for more details.

Selecting a CV In order to select a CV, we will pick the best element from GΘ, where best will refer to minimising MC variance:

J(θ) = VarΠ[f gθ] := Π[(f gθ Π[f])2], (4)

Following the framework of empirical risk minimisation, this can be approximated with {xj, f(xj)}m j=1 as follows:

Jm(θ, β) = 1

m Pm j=1(f(xj) gθ(xj) β)2 + λ gθ 2,

where β R is an additional parameter which tends to Π[f] as m and λ 0. Here, λ 0 is a regularisation parameter and gθ can be any suitable norm. For example, gθ = θ 2 or gθ = gθ Hk for some kernel k. Assuming that Θ Rp, this objective can then be minimised by the solution to a linear system when θ 7 gθ is linear and θ 7 gθ 2 is quadratic. In more general cases, it can be minimised using stochastic optimisation (Si et al., 2021). In that case, we initialise θ(0) and β(0), then iteratively take gradient steps with minibatches of size m m.

There are two particular perspectives which motivate the objective in (4). Firstly, it can be interpreted as a leastsquares objective for the function f Π[f] (Leluc et al., 2021). Secondly, by noticing that for any square-integrable function h and MC estimator with n samples we have VarΠ[ˆΠMC[h]] = VarΠ[h]/n, we can notice that J(θ) fully determines the CLT variance of a MC estimator of Π[f gθ],

which makes it particularly well suited for these estimators. This latter viewpoint has also motivated alternative objectives based on the variance of the randomised quasi Monte Carlo or MCMC CLT; see e.g. Hickernell et al. (2005); Oates & Girolami (2016); Dellaportas & Kontoyiannis (2012); Mijatovic & Vogrinc (2018). The MCMC case is briefly discussed in Appendix A.1, but the remainder of the paper will focus on the objective in (4) for simplicity.

The CV Estimator Recall that a CV estimator takes the form ˆΠCV[f] := ˆΠ[f g] + Π[g] for some CV g. Once a function gˆθ has been selected through the procedure in the previous subsection, the only part missing is therefore an estimator for Π[f gˆθ]. We present two possible approaches below. A first option is to create an estimator based on the remainder of the data {xi, f(xi)}m+n i=m+1:

ˆΠ[f gˆθ] = 1

n Pm+n i=m+1(f(xi) gˆθ(xi)).

This estimator is unbiased whenever ˆΠ[f gˆθ] is unbiased. This is the case when using a MC estimator, but not when using a MCMC estimator. The question of how to select m is of practical importance for the quality of the estimator. When m is small, most of the function evaluations are used for the MC estimator, whereas when m is large, most of the evaluations are used to construct the CV.

Due to the difficulty of choosing m, a second option is to use all of the data for selecting gˆθ (i.e. n = 0). More precisely, denoting by (ˆθ, ˆβ) the minimiser of Jm(θ, β), we could use ˆΠ[f gˆθ] = ˆβ. This estimator will be biased, but will also be more accurate than the first estimator if the least-squares problem can be solved at a fast rate than the MC CLT.

3. Methodology

We will now extend existing CVs to the case where multiple related integration problems are tackled jointly. This will be done through a multi-task learning approach (Micchelli & Pontil, 2005; Evgeniou et al., 2005), where each integral corresponds to a task and the relationship between tasks will be modelled explicitly. This will allow us to share information across integration tasks, and hence improve accuracy when the number of integrand evaluations is limited.

3.1. Vector-Valued Functions using Stein s Method

For the remainder of this paper, we consider integrands corresponding to the outputs of a vector-valued function f : Rd RT so that f(x) = (f1(x), . . . , f T (x)) . Our objective is to approximate Π[f] = (Π1[f1], . . . , ΠT [f T ]) , and we shall assume that ft is square-integrable with respect to Πt t [T]. Formally, Π is known as a vector probability distribution. To approximate Π[f], we will construct a class of zero-mean vv-functions through Stein s method. This

Vector-Valued Control Variates

can be done by considering a Stein class U whose image G under the Stein operator Svv : U G is a class of RT - valued functions on Rd. We will need a generalised form of Stein identity for vv-functions:

Πt[gt] = Πt[(Svv[u])t] = 0, u U and t [T]. (5)

In other words, each output of the vv-function should integrate to zero against the corresponding probability distribution. Of course, the ordering of the sequence of integrands and distributions matters here, as we do not guarantee that Πt[gt ] = 0 for t = t . The property above can be obtained by constructing an operator Svv through a sequence of Stein operators Ssv Πt for t [T] whose images are scalar-valued functions integrating to zero under Πt. These can then be applied in an element-wise fashion as follows

g = Svv[u] = (Ssv Π1[u1], . . . , Ssv ΠT [u T ]) . (6)

Once again, G can be parametrised and we will denote it GΘ. We can then use an objective based on the variances f gθ to select an optimal element:

Jvv(θ) = VarΠ[f gθ] = Π[(f gθ Π[f])2] , (7)

where gθ GΘ. In the above VarΠ should be thought of as applying VarΠt, the variance under Πt, to the tth element of the vv-function. The norm could be any norm on RT , but we will usually make use of the 1-norm so as to interpret this objective as the sum of variances on each integrand. For this objective to make sense, we require (gθ)t to be squaredintegrable with respect to Πt t [T]. Similarly to the T = 1 case, we are focusing on a least-squares objective, which directly controls the variance of MC estimators, but this could be adapted to other estimators as previously discussed.

Let m = (m1, . . . , m T ) NT +. Once again, the objective can be approximated via MC estimates based on the dataset D = {{x1j, f1(x1j)}m1 j=1, . . . , {x T j, f T (x T j)}m T j=1} following the framework of empirical risk minimisation. For example, when the norm above is a 1-norm, the objective is simply the sum of individual variances:

Lvv m(θ, β) := Jvv m(θ, β) + λ gθ 2 (8)

= PT t=1 1 mt Pmt j=1(ft(xtj) (gθ(xtj))t βt)2 + λ gθ 2,

where λ 0 and now β = (β1, . . . , βT ) RT . Once again, the second term is used to regularise Jvv m, but the norm acts on vv-functions. For U, we could take a class of polynomials, kernels or neural networks, but will usually require flexible classes of vv-functions in order to encode relationships between tasks. Assuming that each Πt has a C1

and strictly positive density πt with respect to the Lebesgue measure, we will also be able to use the Langevin Stein operators L, which only require access to the score functions for each distribution Π1, . . . , ΠT . For this purpose, we now define l : Rd RT d to be the matrix-valued function with entries lij(x) = j log πi(x).

3.2. Kernel-based Vector-Valued CVs

The main choice of Stein class U studied in this paper is vv-RKHSs (Carmeli et al., 2006; 2010; Alvarez et al., 2012). This choice is particularly convenient as it allows us to build on the rich literature in statistical learning theory which considers kernels encoding relationships between tasks. A main contribution of this section will be the design of novel Stein reproducing kernels specifically for numerical integration, a task not commonly tackled in statistical learning theory.

A vv-RKHS HK is a Hilbert space of functions mapping from Rd to RT with an associated matrix-valued reproducing kernel (mv-kernel) K : Rd Rd RT T which is symmetric (K(x, y) = K(y, x) x, y Rd) and positive semi-definite ( m N+, Pm i,j=1 c i K(xi, xj)cj 0 for all c1, . . . , cm RT and x1, . . . , xm Rd). Any vv RKHS satisfies a reproducing property, so that f HK, f(x) c = f, K( , x)c HK and K( , x)c HK x Rd and c RT . The reproducing kernels discussed in Section 2 are a special case which can be recovered when T = 1. We say that K is Cr,r(Rd Rd) provided that α x α y K(x, y) is continuous for all multi-indices α = (α1, . . . , αd) with α1 + + αd r, and that K is bounded with bounded derivatives if there exists C 0 for which α x α y K(x, y) F C for all x, y Rd and multi-indices α with α1 + + αd 1.

To construct vv-CVs, a natural approach is to construct a mv-kernel K0 using another mv-kernel K and an operator Svv. Then, assuming we have access to such a K0 and to data D, a natural generalisation of the scalar valued case is:

gθ(x) = PT t=1 Pmt j=1 K0(x, xtj)θtj, (9)

where θtj RT , for all t [T], j [mt]. The remainder of this section will hence focus on how to obtain K0. Our construction is based on the CFs of (Oates et al., 2017; 2019), which use a scalar-valued kernel k0 obtained through a tensor product structure U = Hd k. The initial motivation for introducing vv-functions in this setting was that L requires inputs which are vv-functions which can be thought of as introducing a dummy dimension for convenience, and is in no way related to the multitask setting in this paper. We now illustrate the natural extension of k0 to a mv-kernel.

Theorem 3.1. Consider HK which is a vv-RKHS with mv-kernel K : Rd Rd RT T , and suppose that K C1,1(Rd Rd). Furthermore, assuming u = (u1, . . . , u T ) HK, let Svv[u] = (LΠ1[u1], . . . , LΠT [u T ]) . Then, the image of Hd K = HK . . . HK under Svv is a vv-RKHS with kernel K0 : Rd Rd RT T with:

(K0(x, y))tt = Pd r=1 r x r y K(x, y)tt + lt r(y) r x K(x, y)tt

+ ltr(x) r y K(x, y)tt + ltr(x)lt r(y)K(x, y)tt ,

Vector-Valued Control Variates

The proof is in Appendix B.1. Notice that (K0(x, y))tt depends only on the score function of Πt and Πt , and so we (once again) do not require knowledge of normalisation constants. However, in order to evaluate K0(x, y) for some x, y Rd, we will require pointwise evaluation of x log πt(x) and y log πt(y) for all t [T]. Finally, another interesting point is that (K0(x, y))tt is a scalar-valued kernel when t = t , but this is not the case for t = t since it is not symmetric in that case.

To use elements of this RKHS as vv-CVs, we will require that the least-squares objective in (7) is well-defined. This can be guaranteed when the elements of the RKHS are square-integrable, and the theorem below, proved in Appendix B.2, provides sufficient conditions for this to hold.

Theorem 3.2. Suppose that K is bounded with bounded derivatives, and Πt[ x log πt 2 2] < for all t [T]. Then, for any g HK0, gt is square-integrable with respect to Πt for all t [T].

Now the mv-kernel in Theorem 3.1 takes a very general form as it has minimal requirements on K or Π1, . . . , ΠT . This is convenient as it can be applied in a wide range of settings, but this generality comes at the cost of computational complexity. We will now study several special cases which will often be sufficient for applications.

Special Case I: Separable kernel K For simplicity, the literature on multi-task learning often focuses on the case of separable kernels. We say a mv-kernel K is separable if it can be written as K(x, y) = Bk(x, y), where k is a scalar valued kernel and B ST +. The advantage of this formulation is that it decouples the model for individual outputs (as given by k) from the model of their relationship, as given by the components of the matrix B, which can be thought of as a covariance matrix for tasks. As we will see in Section 4, this can be particularly advantageous for selecting the hyperparameters of vv-CVs. Using such a kernel K, the kernel K0 in Theorem 3.1 becomes:

(K0(x, y))tt = Btt Pd r=1 r x r yk(x, y) + lt r(y) r xk(x, y)

+ ltr(x) r yk(x, y) + ltr(x)lt r(y)k(x, y),

for t, t [T]. This expression is interesting because it reduces the choice of K to the choice of a matrix B and a kernel k, and this matrix B has a natural interpretation in that Btt denotes the covariance between ft and f t. See Appendix E.1 for illustrations of such K0.

Special Case II: Separable kernel K with one target distribution A further simplification of the kernel is possible when K is separable and all distributions are the same: Π1 = = ΠT Π. In this case K0 itself becomes a separable kernel of the form: (K0(x, y))tt = Btt k0(x, y)

t, t [T], where k0 is given in (3). The scalar case can then be recovered by taking T = 1 and B = 1.

Selecting a Base Kernel and Alternative Constructions The base kernel K needs to be chosen by the user. As for the scalar case, we expect that the performance of our approach will depend on whether the smoothness of K0 closely matches the smoothness of the integrand f, and the smoothness of K should therefore be chosen accordingly. We also suggest selecting the hyperparameters of K through either cross validation or through maximisation of the log-marginal likelihood of a zero-mean multi-output Gaussian process with kernel K0; see Appendix D.2 for details. Furthermore, see Appendix C for alternative constructions, including vv-CVs derived from the second-order matrix-valued Stein kernels and vv-CVs based on other parametric spaces such as polynomials and neural networks.

4. Selecting a Vector-Valued CV

We will now derive both a closed-form expression for the optimal parameters of these kernel-based vv-CVs and a stochastic optimisation scheme to approximate it.

Closed-form Solutions We start with a theorem which provides a result akin to the RKHS representer theorem, but which focuses specifically on the function which minimises the objective in (8). This theorem shows that there exists a unique parameter minimising the variance objective.

Theorem 4.1. Let D = {{x1j, f1(x1j)}m1 j=1, . . . , {x T j, f T (x T j)}m T j=1}. The function which minimises the objective in (8) where gθ := gθ HK0 and β RT is of the form:

gθ(x) = PT t=1 Pmt j=1 θ tj K0(x, xtj), θtj RT

with optimal parameter θ given by the solution of this convex linear system of equations:

j K0(xt j , xtj) t K0(xtj, xt j )t

+ λK0(xt j , xt j ) θ t j

= P t 1 mt P

j K0(xt j , xtj) t(ft(xtj) βt),

t [T], j [m T ].

Furthermore, if K0 is strictly positive definite and the points xtj are distinct, then the system is strictly convex and θ is unique.

See Appendix B.3 for the proof. Theorem 4.1 assumes that β is known and fixed, which may not be the case in practice. However, given a fixed gθ the objective (8) is a quadratic in β with the optimal value β t = 1 mt Pmt j=1 ft(xtj) (gθ(xtj))t. This naturally leads to the use of block coordinate descent

Vector-Valued Control Variates

Algorithm 1 Block-coordinate descent for vv-CVs with unknown task relationship

Input: D, m, L, λ, β(0), θ(0) and B(0)

for iteration l = 1 to L do

Select a mini-batch D m of size m = ( m1, . . . , m T ) θ(l), β(l) UPDATEθ,β(θ(l 1), β(l 1), B(l 1); D m

B(l) UPDATEB θ(l), β(l), B(l 1); D m . end for Return: θ(L), β(L) and B(L)

approaches which iterate between optimising β and θ. This could be either directly implemented using the closed-form solutions, or through the use of numerical optimisers such as the stochastic optimisation approaches we will now present. The next paragraph highlights how this can be implemented for special case I and II. We remark that the use of stochastic optimisation tools will be essential in most applications due to the size of the linear systems leading exact solutions to being intractable in practice.

Unknown Task Relationship We now extend our approach to account for simultaneously estimating β, θ, but also the matrix B. In this setting, we will use the same objective as in (8), but penalised by the norm of B:

Lvv m(θ, β, B) = Jvv m(θ, β, B) + λ gθ 2 + B 2, (10)

We now use Jvv m(θ, β, B) to denote Jvv m(θ, β) in order to emphasise the dependence on B. A second regularisation parameter is unnecessary as this would be equivalent to rescaling k. The objective in (10) is a natural extension to (8) and can be straightforwardly minimised through stochastic optimisation; see Algorithm 1. To ensure B is strictly positive definite, we can take B = LL , where L is a lower triangular matrix with diagonal elements forced to be greater than zero via an exponential transformation. The pseudo-code in Algorithm 1 presents this abstractly as a function UPDATE. This is because different choices of vv CVs might benefit from different updates. For example, pre-conditioners for the gradients could be used when readily available, or when these can be estimated from data. In Section 5, we will exclusively be using the Adam optimiser (Kingma & Ba, 2015), a first-order method with estimates of lower-order moments. Additionally, we study the convex case when B is known and only β and θ are required to be estimated in Appendix A.2.

Computational Complexity Although vv-CVs can be beneficial from an accuracy viewpoint, they also incur a significant computational cost. Whether they should be used will therefore depend on the computational budget available. In particular, when evaluations of f or log π are expensive, the higher cost of using vv-CVs may be negligible. Table 1 provides the computational complexity of

Table 1. Computational complexity of kernel-based CVs and vv CVs as a function of d, m, m, L and T. We assume that mt is the same t [T] up to additive or multiplicative constants (and similarly for all mt with t [T]). The cost of stochastic optimisation algorithms is assumed to only scale with the cost of stochastic estimates of the gradient of Jvv.

Method CV VV-CV Exact solution O((dm2 t + m3 t)T) O(dm2 t T 4 + m3 t T 6) Stochastic optim. O(d mtmt LT) O(d mtmt LT 4)

all approaches considered in the paper. We emphasise the impact of T, the number of tasks. In the case of existing kernel-based CVs, the dependence is O(T). In contrast, the computational cost of vv-CVs is between O(T 4) and O(T 6). Table 1 also highlights the difference in computational complexity between obtaining closed form solutions of θ and β by solving the linear system of equations in Theorem 4.1, and using the stochastic-optimisation approaches in Appendix A.2 and Section 4. Here, the difference is mainly in terms of powers of T, mt and mt. As we can see the exact solutions usually come with an O(m3 t) cost, whereas the stochastic optimisation approach is associated with a O(mt mt L) cost. In this case, whenever mt L is small relative to mt, this will lead to computational gains.

In all applications considered, both mt and T were small so the overall cost is controlled. However, this computational complexity can be further significantly reduced in special cases. When using a kernel corresponding to a finitedimensional RKHS (e.g. a polynomial kernel), the scaling becomes linear in mt, but is O(q3) instead of O(m3 t), where q mt is the dimensionality of the RKHS. Alternatively, for certain choices of point sets and kernels, it is possible to reduce the computational complexity to O(mt log mt) instead of O(m3 t) by using scalable kernel methods such as fast Fourier features or inducing points. When the integrands are evaluated at the same set of points and the separable kernel is used, the computational cost in T also becomes O(T 2) instead of O(T 6), once again significantly reducing the computational complexity.

5. Experimental Results

We now illustrate our method on a range of problems including multi-fidelity models, computation of the model evidence for dynamical systems through thermodynamic integration and Bayesian inference for the abundance of preys using a Lotka-Volterra system. See Appendix E for additional experiments including illustrations of matrix-valued Stein kernels K0 in Appendix E.1 and a synthetic example when the Stein kernel matches the smoothness of integrands in Appendix E.2. Since we are interested in gains obtained from the CVs, we fix n = 0 which means we are using all the data to construct vv-CVs. The code to reproduce our re-

Vector-Valued Control Variates

Low-fidelity model

High-fidelity model

vv-CV (squared-exponetial k) vv-CV (1st order polyn. k) CV (squared-exponetial k) f(x)

0 200 400 Number of Epochs

Absolute error for H[f H]

Squared-exponential kernel CVs

MC CF CV vv-CV with Fixed B (1) vv-CV with Fixed B (2) vv-CV with Estimated B

0 200 400 Number of Epochs

Absolute error for H[f H]

First-order polynomial kernel CVs

Figure 1. Numerical integration of univariate discontinuous multifidelity model. Upper: fitted CVs for both functions. Lower Left: performance of CVs based on a squared-exponential kernel for the high-fidelity function as a number of epochs of the optimisation algorithm. The lines provide the mean over 100 repetitions of the experiment, whereas the shaded areas provide one standard deviation above and below the mean. Lower Right: same experiment for a polynomial kernel for the high-fidelity function.

sults is available at: https://github.com/jz-fun/ Vector-valued-Control-Variates-Code.

5.1. Multidelity Modelling in the Physical Sciences

Many problems in the engineering and physical sciences can be tackled with multiple models of a single system of interest. These models are often associated with varying computational costs and levels of accuracy, and their combination to solve a task is usually called multi-fidelity modelling; see (Peherstorfer et al., 2018) for a review. We will consider a high-fidelity model f H and a low-fidelity model f L, and will attempt to estimate the integral of f H with our vv-CVs and using function evaluations from both the highand low-fidelity models. For clarity, we will now denote the function f = (f L, f H) and the vector-probability distribution Π = (ΠL, ΠH). We note that this is a special case of the problem considered in our paper since we use evaluations of multiple functions but are only interested in ΠH[f H] (whereas ΠL[f L] is not of interest).

Univariate Step Function The first example considered is a toy problem from the multi-fidelity literature (Xi et al., 2018). The low-fidelity function is f L(x) = 2 if x 0 and 1 otherwise. The high-fidelity function is f H(x) = 1 if x 0 and 0 otherwise. In this example, K0 is smoother

than f1 and f2, which are both discontinuous. The integral is over the real line and taken against Π = N(0, 1), and we fix the sample sizes to m = (m L, m H) = (40, 40).

Results with a squared-exponential and 1st order polynomial kernel k can be found in Figure 1. The upper plots clearly show that the approximations are not of very high-quality, but the lower plots show that all CVs can still lead to an order of two gain in accuracy over MC methods. We also observe that vv-CVs can lead to further gains over existing CVs by leveraging evaluations of f L. For both kernels, we provide three different versions of the vv-CVs with separable structure to highlight the impact of the matrix B. The first two cases use Algorithm 2 with a fix value of B. In the first instance, B11 = B22 = 0.1, B12 = B21 = 0.01, whereas in the second instance B11 = B22 = 0.5, B12 = B21 = 0.01. The third case is based on estimating B through Algorithm 1. Clearly, B can have a significant impact on the performance of the vv-CV, and estimating a good value from data can provide further gains. The choice of k is also significant: all CVs based on the squared-exponential kernel significantly outperform the CVs based on a 1st order polynomial kernel. It is also found that even when the model is mis-specified, the proposed method still perform better than standard scalar-valued CVs.

Modelling of Waterflow through a Borehole A more complex example often used to assess multifidelity methods is the following model of water flow passing through a borehole (Xiong et al., 2013; Kandasamy et al., 2016; Park et al., 2017). Both f L and f H have d = 8 inputs representing a range of parameters influencing the geometry and hydraulic conductivity of the borehole, as well as transmissivity of the aquifer. Prior distributions have been elicited from scientists over input parameters to account for uncertainties about their exact value. See Appendix E.4 for the details of each input and the multi-fidelity models. One quantity of interest here is the expected water flow under these distributions, and we hence have ΠL = ΠH.

Table 2. Expected values of the flow of water through a borehole. The numbers provided give the mean absolute integration error for 100 repetition of the task of estimating ΠH[f H], and the numbers in brackets provide the sample standard deviation. To provide the absolute error, the true value (72.8904) is estimated by a MC estimator with 5 105 samples.

m VV-CVEST. B VV-CV-FIX. B CF MC 10 3.72 (0.27) 1.94 (0.15) 2.24 (0.16) 6.42 (0.44) 20 1.29 (0.10) 1.35 (0.10) 1.96 (0.10) 4.31 (0.31) 50 1.04 (0.06) 1.77 (0.12) 1.76 (0.07) 2.63 (0.17) 100 1.07 (0.06) 1.65 (0.14) 1.71 (0.05) 1.83 (0.15) 150 0.85 (0.05) 1.30 (0.09) 1.67 (0.04) 1.42 (0.10)

Results of our simulation study are presented in Table 2. We compare a standard MC estimator with a kernel-based

Vector-Valued Control Variates

CV fitted with a closed form solution (denoted CF) and two kernel-based vv-CVs corresponding to special case II in Section 3. The first with B11 = B22 = 5 10 4 and B12 = B21 = 5 10 5, and the second with B estimated using Algorithm 1. The kernel used is a tensor product of squared-exponential kernels with a separate lengthscale for each dimension. Clearly, vv-CVs significantly outperform MC in the large majority of cases, and estimating B can lead to significant gains over using a fixed B. The worst performance for vv-CVs with estimated B is when values of m are the lowest. This is because m is not large enough to learn a good B. See Appendix E.4 for further details.

5.2. Model Evidence for Dynamical Systems

We now consider Bayesian inference for non-linear differential equations such as dynamical systems, which can be particularly challenging due to the need to compute the model evidence. This is usually a computationally expensive task since sampling from the posterior repeatedly requires the use of a numerical solver for differential equations which needs to be used at a fine resolution.

20 40 60 80 Sample Size

Model Evidence

20 40 60 80 Sample Size

Model Evidence

Figure 2. Model evidence computation through thermodynamic integration. Left: Illustration of the van der Poll oscillator model (black line) and corresponding observations (red dots). Center: Estimates of the model evidence as a function of the number of posterior samples for kernel-based CVs. The box-plots were created by repeating the experiment 20 times and the black line gives an estimate of the truth obtained from (Oates et al., 2017) (25.58). Right: Same experiment but with kernel-based vv-CVs.

In (Calderhead & Girolami, 2009), the authors propose to use thermodynamic integration (TI) (Friel & Pettitt, 2008) to tackle this problem, and (Oates et al., 2016; 2017) later showed that CVs can lead to significant gains in accuracy in this context. TI introduces a path from the prior p(x) to the posterior p(x|y), where y and x represent the observations and the unknown parameters respectively. This is accomplished by the power posterior p(x|y, t) p(y|x)tp(x), where t [0, 1] is called the inverse temperature. When t = 0, p(x|y, t) = p(x), whereas when t = 1, p(x|y, t) = p(x|y). The standard TI formula for the model evidence has a simple form which can be approximated using second-order quadrature over a discretised temperature ladder 0 = t1 tw = 1 (Friel et al.,

2014). It takes the following form

log p(y) = R 1 0 R

X log p(y|x)p(x|y, t)dθ dt

Pw i=1 ti+1 ti

2 (µi+1 + µi) (ti+1 ti)2

12 (vi+1 vi),

where µi is the mean and vi the variance of the integrand f(x) := log p(y|x) with respect to πi(x) := p(x|y, ti). To estimate {µi, vi}w i=1, we need to sample from all power posteriors on the ladder, then use a MCMC estimator which can be enhanced through CVs. This gives T = 2w integrals which are related: w integrals to compute means and w integrals to compute variances, each against different power posteriors. As we will see, this structure will allow vv-CVs to provide significant gains in accuracy.

Our experiments will focus on the van der Poll oscillator, which is an oscillator u : R+ X R (where x R) given by the solution of d2u/ds2 x(1 u2)du/ds+u = 0, where s represents the time index. For this experiment, we will follow the exact setup of (Oates et al., 2017) and transform the equation into a system of first order equations: du1/ds = u2, du2/ds = x(1 u2 1)u2 u1, which can be tackled with ODE solvers. Our data will consist of noisy observation of u1 (the first component of that system) given by y(s) = N(u1(s; x), σ2) with σ = 0.1 at each point s {0, 1, . . . , 10}; see the left-most plot of Figure 2 for an illustration. We will take a ladder of size w = 31 with ti = ((i 1)/30)5 for i {1, . . . , 31}. This gives a total of T = 62 integrals will need to be computed simultaneously, which is likely to be too computationally expensive for vv CVs in their full generality. As a result, we chose B to be a block diagonal matrix which puts integrands in groups of 4 means or 4 variances (except one group of 3 for mean and variance). To sample from the power-posteriors, we use population MC with the manifold Metropolis-adjusted Langevin algorithm (Girolami & Calderhead, 2011). Due to the high computational cost of using ODE solvers, our number of samples will be limited to less than 100 per integrand and this number will be the same for each integration task.

Our results are presented in Figure 2. The kernel parameters were taken to be identical to those in (Oates et al., 2017). As observed in the centre plot, kernel-based CVs provide relatively accurate estimates of the model evidence. As the sample size increases, we notice less variability in these estimates, but the central 50% of the runs are contained in an interval which excludes the true value. In comparison, the right-most plot shows that kernel-based vv-CVs can provide significant further reduction in variance. The distribution of estimates is also much more concentrated and centered around the true value.

5.3. Bayesian Inference of Lotka-Volterra System

We now consider another model: the Lotka-Volterra system (Lotka, 1925; Volterra, 1926; Lotka, 1927) of ordinary dif-

Vector-Valued Control Variates

Table 3. Posterior Expected Abundance of Preys. The numbers provided give the sum of the mean absolute integration error for 10 repetition of each task. To provide the absolute error, the true values of the associated expectations are estimated by MCMC estimators with 8 105 posterior samples.

T m VV-CVEST. B VV-CV-FIX. B CF MCMC 2 500 0.462 0.404 0.666 0.568 5 500 0.393 0.419 0.521 0.987 10 500 0.938 1.031 2.540 2.663

ferential equations. This system is given by: dv1(s)/ds = αv1(s) βv1(s)v2(s), dv2(s)/ds = δv1(s)v2(s) γv2(s). Here, s [0, S] for some S R+ denotes the time, and v1(s) and v2(s) are the numbers of preys and predators, respectively. The system has initial conditions v1(0) and v2(0). We have access to noisy observations of v = (v1, v2) at points s1, . . . , sm [0, S] denoted y1j, y2j and which are both observed with log-normal noise with standard deviation σy1 and σy2 respectively for all j {1, . . . , m}, given some unknown parameter value x = (α , β , δ , γ , v1(0) , v2(0) , σ y1, σ y2) . In practice, we reparameterise x such that the model parameters are defined in R8; see Appendix E.6 for details. Given these observations, we can construct a posterior Π on the value of x . We will then be interested in computing posterior expectations of v1 at a set of time points s 1, . . . , s T , and hence have T integrands of the form ft(x) = v1(s t; x) where x highlights the dependence on the parameter x. CVs were previously considered for individual tasks in this context by Si et al. (2021). However, these T integrands are related when s 1, . . . , s T are close to each other.

We use the dataset of snowshoe hares (preys) and Canadian lynxes (predators) from Hewitt (1921), and implement Bayesian inference on model parameters x by using no U-turn sampler (NUTS) in Stan (Carpenter et al., 2017). For sv-CVs, we estimate each individual Π[ft] separately; while for vv-CVs, we estimate a collection these tasks Π[f] := (Π[f1], . . . , Π[f T ]) jointly. See Appendix E.6 for experimental details. In Table 3, we compare kernel vv-CVs (special case II) with standard MCMC estimators and CF estimators. We consider two cases of vv-CVs: the first is the case when B is with Btt = 5 10 4 for all t and Btt = 5 10 5 for t = t , and the second with B estimated using Algorithm 2. The kernel used is a tensor product of squared-exponential kernels with a separate lengthscale for each dimension. We increase the number of tasks from T = 2 to T = 10. Once again, vv-CVs significantly outperforms MCMC, especially for large T, and estimating B provides further gains over using a fixed B.

6. Conclusion

This paper considered variance reduction techniques that share information across related integration problems. The

proposed solution, vector-valued control variates, was shown to lead to significant variance reduction for problems in multi-fidelity modelling and Bayesian computation. Our approach is, to the best of our knowledge, the first algorithm able to perform multi-task learning for numerical integration by using only evaluation of the score functions of the corresponding target distributions. It is also the first algorithm which can simultaneously learn the relationship between integrands and provide estimates of the corresponding integrals without the requirement of a tractable kernel mean as in (Xi et al., 2018).

On an algorithmic level, further work will be needed to make the method more computationally practical and efficient. One particular line of research which could be considered is how special cases of our matrix-valued Stein kernels in Theorem 3.1 can be selected to reduce the computational cost whilst still producing a rich class of vv-CVs. Since a preprint version of this paper appeared online, it has been shown that approaches based on meta-learning could be competitive for very large T; see Sun et al. (2023).

On a theoretical level, we could also look at the question of when transferring information across tasks will lead to sufficient gains in accuracy to warrant the additional computational cost. In addition, it could be of interest to understand the negative transfer problems which can arise when the integrands are not in RKHS HK0 or when the sample size is too small to estimate B well.

Acknowledgements

The authors would like to thank Chris J. Oates for helpful discussions and for sharing some of his code for the thermodynamic integration example. ZS was supported under the EPSRC grant [EP/R513143/1]. AB was supported by the Department of Engineering at the University of Cambridge, and this material is based upon work supported by, or in part by, the U.S. Army Research Laboratory and the U. S. Army Research Office, and by the U.K. Ministry of Defence and under the EPSRC grant [EP/R018413/2]. FXB was supported by the Lloyd s Register Foundation Programme on Data-Centric Engineering and The Alan Turing Institute under the EPSRC grant [EP/N510129/1], and through an Amazon Research Award on Transfer Learning for Numerical Integration in Expensive Machine Learning Systems .

Alvarez, M. A., Rosasco, L., and Lawrence, N. D. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195 266, 2012.

Anastasiou, A., Barp, A., Briol, F.-X., Ebner, B., Gaunt, R. E., Ghaderinezhad, F., Gorham, J., Gretton, A., Ley,

Vector-Valued Control Variates

C., Liu, Q., Mackey, L., Oates, C. J., Reinert, G., and Swan, Y. Stein s method meets Statistics: A review of some recent developments. Statistical Science, 38(1): 120 139, 2023.

Assaraf, R. and Caffarel, M. Zero-variance principle for Monte Carlo algorithms. Physical Review Letters, 83(23): 4682, 1999.

Barp, A., Takao, S., Betancourt, M., Arnaudon, A., and Girolami, M. A unifying and canonical description of measurepreserving diffusions. ar Xiv preprint ar Xiv:2105.02845, 2021.

Barp, A., Oates, C. J., Porcu, E., and Girolami, M. A riemann stein kernel method. Bernoulli, 28(4):2181 2208, 2022a.

Barp, A., Simon-Gabriel, C.-J., Girolami, M., and Mackey, L. Targeted separation and convergence with kernel discrepancies. ar Xiv preprint ar Xiv:2209.12835, 2022b.

Berlinet, A. and Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media, 2011.

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018. ISSN 00361445. doi: 10.1137/ 16M1080173.

Briol, F.-X., Oates, C. J., Cockayne, J., Chen, W. Y., and Girolami, M. On the sampling problem for kernel quadrature. In Proceedings of the International Conference on Machine Learning, pp. 586 595, 2017.

Calderhead, B. and Girolami, M. Estimating Bayes factors via thermodynamic integration and population MCMC. Computational Statistics and Data Analysis, 53(12):4028 4045, 2009.

Carmeli, C., De Vito, E., and Toigo, A. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Analysis and Applications, 4(4): 377 408, 2006.

Carmeli, C., De Vito, E., Toigo, A., and Umanita, V. Vector valued reproducing kernel Hilbert spaces and universality. Analysis and Applications, 8(1):19 61, 2010.

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. Stan: A probabilistic programming language. Journal of statistical software, 76(1), 2017.

Ciliberto, C., Mroueh, Y., Poggio, T., and Rosasco, L. Convex learning of multiple tasks and their structure. In International Conference on Machine Learning, pp. 1548 1557, 2015.

Dellaportas, P. and Kontoyiannis, I. Control variates for estimation based on reversible Markov chain Monte Carlo samplers. Journal of the Royal Statistical Society Series B: Statistical Methodology, 74(1):133 161, 2012.

Dinuzzo, F., Ong, C. S., Gehler, P. V., and Pillonetto, G. Learning output kernels with block coordinate descent. In International Conference on Machine Learning, 2011.

Evgeniou, T., Micchelli, C. A., and Pontil, M. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615 637, 2005.

Friel, N. and Pettitt, A. N. Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(3):589 607, 2008.

Friel, N., Hurn, M., and Wyse, J. Improving power posterior estimation of statistical evidence. Statistics and Computing, 24(5):709 723, 2014.

Girolami, M. and Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 73(2):123 214, 2011.

Green, P., Latuszyski, K., Pereyra, M., and Robert, C. Bayesian computation: a summary of the current state, and samples backwards and forwards. Statistics and Computing, 25:835 862, 2015.

Hewitt, C. G. The conservation of the wild life of Canada. New York: C. Scribner, 1921.

Hickernell, F. J., Lemieux, C., and Owen, A. B. Control variates for quasi-Monte Carlo. Statistical Science, 20(1): 1 31, 2005.

Jones, G. L. On the Markov chain central limit theorem. Probability Surveys, 1:299 320, 2004.

Kandasamy, K., Dasarathy, G., Oliva, J., Schneider, J., and Poczos, B. Gaussian process optimisation with multifidelity evaluations. In Neural Information Processing Systems, pp. 992 1000, 2016.

Kingma, D. P. and Ba, J. L. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Leluc, R., Portier, F., and Segers, J. Control variate selection for monte carlo integration. Statistics and Computing, 31 (4):1 27, 2021.

Lotka, A. J. Elements of physical biology. Williams & Wilkins, 1925.

Vector-Valued Control Variates

Lotka, A. J. Fluctuations in the abundance of a species considered mathematically. Nature, 119(2983):12 12, 1927.

Micchelli, C. A. and Pontil, M. On learning vector-valued functions. Neural Computation, 17(1):177 204, 2005.

Mijatovic, A. and Vogrinc, J. On the Poisson equation for Metropolis-Hastings chains. Bernoulli, 24(3):2401 2428, 2018. ISSN 13507265.

Mira, A., Solgi, R., and Imparato, D. Zero variance Markov chain Monte Carlo for Bayesian estimators. Statistics and Computing, 23(5):653 662, 2013.

Oates, C. J. and Girolami, M. Control functionals for quasi Monte Carlo integration. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 51, pp. 56 65, 2016.

Oates, C. J., Papamarkou, T., and Girolami, M. The controlled thermodynamic integral for Bayesian model comparison. Journal of the American Statistical Association, 111(514):634 645, 2016.

Oates, C. J., Girolami, M., and Chopin, N. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79 (3):695 718, 2017.

Oates, C. J., Cockayne, J., Briol, F.-X., Girolami, M., et al. Convergence rates for a class of estimators based on Stein s method. Bernoulli, 25(2):1141 1159, 2019.

Papamarkou, T., Mira, A., and Girolami, M. Zero variance differential geometric Markov chain Monte Carlo algorithms. Bayesian Analysis, 9(1):97 128, 2014.

Park, C., Haftka, R. T., and Kim, N. H. Remarks on multifidelity surrogates. Structural and Multidisciplinary Optimization, 55(3):1029 1050, 2017.

Peherstorfer, B., Willcox, K., and Gunzburger, M. Survey of multifidelity methods in uncertainty propagation, inference, and optimization. SIAM Review, 60(3):550 591, 2018.

Si, S., Oates, C. J., Duncan, A. B., Carin, L., and Briol, F.-X. Scalable control variates for Monte Carlo methods via stochastic optimization. Proceedings of the 14th Conference on Monte Carlo and Quasi-Monte Carlo Methods. ar Xiv:2006.07487, 2021.

South, L. F., Karvonen, T., Nemeth, C., Girolami, M., and Oates, C. J. Semi-exact control functionals from Sard s method. Biometrika, 109(2):351 367, 2022a.

South, L. F., Riabiz, M., Teymur, O., and Oates, C. J. Post Processing of MCMC. Annual Review of Statistics and Its Application, 2022b.

Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008.

Sun, Z., Oates, C. J., and Briol, F.-X. Meta-learning control variates: Variance reduction with limited data. ar Xiv:2303.04756, to appear at UAI, 2023.

Volterra, V. Variazioni e fluttuazioni del numero d individui in specie animali conviventi. Societ a anonima tipografica Leonardo da Vinci , 1926.

Wan, R., Zhong, M., Xiong, H., and Zhu, Z. Neural control variates for variance reduction. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 533 547, 2019.

Xi, X., Briol, F.-X., and Girolami, M. Bayesian quadrature for multiple related integrals. In International Conference on Machine Learning, pp. 5373 5382, 2018.

Xiong, S., Qian, P. Z. G., and Wu, C. F. J. Sequential design and analysis of high-accuracy and low-accuracy computer codes. Technometrics, 55(1):37 46, 2013.

Vector-Valued Control Variates

We now complement the main text with additional details. Firstly, we provide additional methodology in Appendix A, including alternative objectives based on the variance of MCMC, stochastic optimisation algorithm for vv-CVs when the task relationship B is known (under special case I and II) and the calculation of the computational complexity that is presented in Table 1. Secondly, in Appendix B, we provide the proofs of all the theoretical results in the main text. Then, in Appendix C, we present alternative constructions of vv-CVs. These constructions include a more general form of kernel-based vv-CVs as well as some polynomial-based CVs. Finally, in Appendix D and Appendix E, we provide additional details on our implementation of vv-CVs and provide additional numerical experiments.

A. Additional Methodology

In this first appendix, we will present additional methodology for Section 3. We briefly discuss an alternative objective based on the variance of the MCMC case in Appendix A.1. The selection of vv-CVs under special case I and II when B is known and the corresponding algorithm is presented in Appendix A.2.

A.1. Alternative Objectives based on the Variance of MCMC

We present an alternative objective based on the variance of MCMC here, which is given by the following remark. Remark 1. Given {Xi} i=1 an ergodic Markov chain with invariant distribution Π, the variance of the MCMC CLT is proportional to

J(θ) + 2 P i=1 CovΠ[f(X1) gθ(X1), f(Xi) gθ(Xi)]

where the second term accounts for the correlation in the Markov chain. Given some realisation of the Markov chain {xj}m j=1 and the corresponding evaluations {f(xj)}m j=1, this can be approximated as:

Jm(θ, β) + 2 Pm 1 s=1 1 m Pm s i=1 (f(xi) gθ(xi) ˆΠMC[f gθ])(f(xi+s gθ(xi+s) ˆΠMC[f gθ])).

A.2. Stochastic Optimisation with a Known Task Relationship

In this section, we discuss the selection of vv-CVs under special case I and II when B is known and give the corresponding algorithm in Algorithm 2.

In this case, we will assume that B ST + is known. The proposed algorithm is a stochastic optimisation algorithm which we run for L time steps; pseudo-code is provided in Algorithm 2. We propose to initialise the algorithm at θ(0) = (0, . . . , 0) Rp, since this is equivalent to having gθ(x) = 0 (i.e. having no CV) for both the kerneland polynomial-based vv-CVs. We also suggest initialising the parameter β RT with any estimate of Π[f]. This is a natural initialisation since we expect βt to equal Π[ft] for all t [T] when m1, . . . , m T . For example, when the data consists of IID realisations from Π1, . . . , ΠT , a natural initialisation point is β(0) = (ˆΠMC 1 [f1], . . . , ˆΠMC T [f T ]) .

For each iteration, we take mini-batches of size m NT where | m| |m|. Note here that m = ( m1, . . . , m T ) is a multi-index giving the size of the mini-batch for each of the T datasets. This formulation allows for the use of different datasets across integrands, but also different mini-batch sizes for each task (which may be useful if the datasets are of different size for each integrand). An epoch consists of having gone through all data points for all T tasks, and we randomly shuffle the indices for mini-batches after each epoch. As default, we propose to take mt mt/(PT t=1 mt) for all t [T]. This choice guarantees that the number of samples for each integrand in the mini-batches is proportional to the proportion of samples for that integrand in the full dataset.

Once a mini-batch has been selected, we update our current estimate of the parameters θ and β using steps based on the gradient of our loss function: (θ,β)Lvv m(θ, β). The pseudo-code in Algorithm 2 presents this abstractly as a function UPDATEθ,β(θ, β, B; D), which takes in the current estimates of the parameters, the value of B and the dataset (or minibatch) used for the update; this is because different choices of vv-CVs might benefit from different updates. For example, preconditioners for the gradients could be used when readily available, or when these can be estimated from data. In Section 5, we will exclusively be using the Adam optimiser (Kingma & Ba, 2015), a first order method with estimates of lower-order moments.

Vector-Valued Control Variates

Algorithm 2 Stochastic optimisation for vv-CVs with known task relationship

Input: D, m, L, λ, β(0) and θ(0). for iterations l from 1 to L do

Select a mini-batch D m of size m θ(l), β(l) UPDATEθ,β(θ(l 1), β(l 1), B; D m

end for Return: θ(L), β(L)

For the penalisation term, it would be natural to take the RKHS norm gθ = gθ HK0 since this would lead to the objective used in Theorem 4.1. However, this can be impractical from a computational viewpoint since this norm depends on kernel evaluations for all of the training points. For this reason, we follow the recommendation of (Si et al., 2021) and use instead the Euclidean norm: gθ = θ 2. This still leads to a convex objective since the objective remains quadratic in θ.

Algorithm 2 is a natural approach to minimising our objective since our kernel-based vv-CVs are linear in θ and Lvv m is convex in (θ, β). Many stochastic optimisation methods, such as stochastic gradient descent, will hence converge to a global minimum under regularity conditions (Bottou et al., 2018). However, note that Algorithm 2 naturally applies to other vv-CVs whether linear or not.

A.3. Calculation of Computational Complexity

In this section, we derive the computational complexity reported in Table 1. Suppose we have mt samples for each of T tasks and similarly for all mt with t [T].

Computational cost of CV:

Exact solution: For each task, we need to compute k0( , ) for all pairs, which results in a cost of O(dm2 t) per task. To compute the exact solution of kernel-based control variates, it takes O(m3 t) per task. So, in total, the computational cost of the exact solution of CV is O((dm2 t + m3 t)T).

Stochastic optimisation: To use stochastic optimisation, suppose we use L epochs. At each iteration of one epoch, we need to compute k0( , ) for mtmt pairs, which costs O(d mtmt). We need to do this for all L iterations. This results in O(d mtmt L) per task. Hence, in total, the computational cost of stochastic optimisation of CV is O(d mtmt LT) for all T tasks.

Computational cost of vv-CV:

Exact solution: There are mt T samples (i.e. m2 t T 2 pairs) in total and we are using mv-Stein kernels. Thus, the computational cost of computing K0( , ) is O(d(mt T)2T 2) = O(dm2 t T 4). To compute the exact solution of vv-CV, we can re-arrange the Gram tensor of all samples into a matrix of size mt T 2 mt T 2. Hence, the computational cost of computing the exact solution of vv-CV is O((mt T 2)3) = O(m3 t T 6). Thus, in total, the computational cost of computing the exact solution of vv-CV is O(dm2 t T 4 + m3 t T 6).

Stochastic optimisation: At each iteration, one mini-batch of the stochastic optimisation of vv-CV has mt T samples. We need to compute K0( , ) for these samples with all mt T samples. This leads to a cost of O( mt Tmt Td T 2) = O(d mtmt T 4) per iteration. Note that we need to do this for all L iterations. So, in total, the computational cost of stochastic optimisation of vv-CV is O(d mtmt LT 4).

B. Proofs of the Main Theoretical Results

In this second appendix, we recall the proofs of the theoretical results in the main text. The derivation of the mv-kernel K0 from Theorem 3.1 can be found in Appendix B.1. The proof that kernel-based vv-CVs are square-integrable (Theorem 3.2) is in Appendix B.2. Finally, the proof of the existence of the optimal parameters as the solution to a linear system (Theorem 4.1) is given in Appendix B.3.

Vector-Valued Control Variates

B.1. Proof of Theorem 3.1

Proof. We will show K0 is a kernel and derive its matrix components by constructing an appropriate feature map. The first order Stein operator maps matrix-valued functions u = (u1, u2 . . . , u T ) : Rd Rd T to the vv-function Svv[u] : Rd RT

Svv[u] = (LΠ1 [u1] , . . . , LΠT [u T ])

where LΠt[ut](x) = x ut(x) + x log πt(x) ut(x) t [T].

Since K C1,1(Rd Rd), we can use Corollary 4.36 of (Steinwart & Christmann, 2008) to conclude that HK is a vector-valued RKHS of continuously differentiable functions from Rd to RT , hence the tensor product Hd K consists of suitable functions u Hd K, with components ui = (ui 1, . . . , ui T ) HK for i [d]. Now since the RKHS consists of differentiable functions, we have by Lemma C.8 in Barp et al. (2022b):

j x K( , x)et, ui

HK = et jui(x) = jui t(x) ui t xj (x) t [T]

where et RT is a vector of zeros with value 1 in the tth component. Then, writing Ket x K( , x)et,

Svv[u](x) = PT t=1 Pd r=1( r xur t(x) + r x log πt ur t(x))et

= PT t=1 Pd r=1 r x Ket x + ltr(x)Ket x , ur HK et

= PT t=1 x Ket x + lt (x)Ket x , u Hd K et,

where ltr(x) = r x log πt(x), and x Ket x and lt (x) denote respectively the tuples ( 1 x Ket x , . . . , d x Ket x ) Hd K and (lt1(x), . . . , ltd(x)) Rd.

We have thus obtained a feature map, i.e., a map γ : Rd B(Hd K, RT ), where B(Hd K, RT ) denotes the space of bounded linear maps from Hd K to RT , via the relation

γ(x)[u] = Svv[u](x),

with adjoint γ(x) = PT t=1( x Ket x + lt (x)Ket x )et. Recall the adjoint map γ(y) B(RT , Hd K) to γ(y), is defined for any a RT , u Hd K by the relation

γ(y) [a], u Hd K = γ(y)[u] a.

In particular, by Proposition 1 of Carmeli et al. (2010) we have that

K0(x, y) γ(x) γ(y) RT T

will then be the kernel associated to the feature operator (that is, a surjective partial isometry whose image is HK0) Svv : Hd K HK0. Subbing in the expressions for the feature map and its adjoint derived above, and using the equalities s x K( , x)et, r y K( , y)et

HK = et s x r y K(x, y)et = ( s x r y K(x, y))tt t, t [T]

and ss x K( , x)et, r y K( , y)et

HK = ( ss x r y K(x, y))tt t, t [T],

which hold for any differentiable matrix-valued kernel (Barp et al., 2022b), we obtain the following expression for the components of K0

(K0(x, y))tt = Pd r=1( r x r y K(x, y))tt + lt r(y)( r x K(x, y))tt

+ ltr(x) r y(K(x, y))tt + ltr(x)lt r(y)(K(x, y))tt .

In particular for separable kernels (i.e. K(x, y) = Bk(x, y)) we have

(K0(x, y))tt = Btt Pd r=1 r x r yk(x, y) + lt r(y) r xk(x, y) + ltr(x) r yk(x, y) + ltr(x)lt r(y)k(x, y).

Vector-Valued Control Variates

B.2. Proof of Theorem 3.2

Proof. Recall that if a scalar kernel k satisfies R

Rd k(x, x)dµ(x) < , then its RKHS consists of square µ-integrable functions (for any finite measure µ) (Steinwart & Christmann, 2008, Theorem 4.26).

If g HK0 then gt belongs to the RKHS with scalar-valued kernel (this follows from (Carmeli et al., 2010, Prop. 1), using as feature operator the dot product with respect to et, where et is defined in Appendix B.1)

(K0(x, y))tt = Pd r=1( r x r y K(x, y))tt + ltr(y) r x(K(x, y))tt + ltr(x) r y(K(x, y))tt + ltr(x)ltr(y)(K(x, y))tt t [T].

In particular since K is bounded with bounded derivatives, and

Πt [|ltr|] + Πt |ltr|2 p

Πt [|ltr|2] + Πt |ltr|2 t [T], r [d]

Rd(K0(x, x))ttdΠt(x) < if x log πt(x) 2 is square integrable with respect to Πt, and the result follows.

B.3. Proof of Theorem 4.1

Proof. We want to find

arg ming HK0 Lvv m(g, β)

where Lvv m(g, β) := PT t=1 1 mt Pmt j=1(ft(xtj) gt(xtj) βt)2 + λ g 2 HK0.

Note that the objective is the same as that in (8), with the only difference being that the first input is now a function as opposed to the parameter value parameterising this function. We will abuse notation by using the same mathematical expression for both objectives.

By Ciliberto et al. (2015, Section 2.1), any solution of the minimization problem has the form ˆg( ) PT t =1 Pmt j =1 K0( , xt j )θt j . Subbing this solution into Lvv m(g, β) yields

Lvv m(ˆg, β) = PT t=1 1 mt Pmt j=1(ft(xtj) (PT t =1 Pmt j =1 K0(xtj, xt j )tθt j βt)2

+ λ PT t ,t =1 Pmt j =1 Pmt j =1 θT t j K0(xt j , xt j )θt j

= λ PT t ,t =1 Pmt j =1 Pmt j =1 θT t j K0(xt j , xt j )θt j

+ PT t=1 1 mt Pmt j=1 y2 tj + (PT t =1 Pmt j =1 K0(xtj, xt j )tθt j )2

2 PT t =1 Pmt j =1 ytj K0(xtj, xt j )tθt j ,

where ytj ft(xtj) βt. The problem thus becomes a minimization problem over the coefficients θ,

arg minθ R|D| λ PT t ,t =1 Pmt j =1 Pmt j =1 θT t j K0(xt j , xt j )θt j

2 PT t,t =1 1 mt Pmt j=1 Pmt j =1 θT t j K0(xt j , xtj) tytj + PT t=1 1 mt Pmt j=1 y2 tj

+ PT t,t ,t =1 Pmt j=1 Pmt j =1 Pmt j =1 θT t j K0(xt j , xtj) t 1

mt K0(xtj, xt j )t θt j .

Since the quadratic terms are positive definite, the resulting objective is a convex function of θ, thus, by differentiating it, we obtain that the solution θ is the solution to

PT t =1 Pmt j =1 PT t=1 1 mt Pmt j=1 K0(xt j , xtj) t K0(xtj, xt j )t + λK0(xt j , xt j ) θt j

= PT t 1 mt Pmt j=1 K0(xt j , xtj) t(ft(xtj) βt), t [T], j [m T ].

Finally, generalising the scalar case, we say that a matrix-valued reproducing kernel K0 is strictly positive definite if for any finite set of γs RT and distinct points ys Rd we have P

s,ℓγ s K0(ys, yℓ)γℓ= 0 implies each γs is zero

Vector-Valued Control Variates

this means that the mean embedding of K0 is injective (or characteristic) over the set of linear functionals of the form δγ y : f 7 P

t ft(y)γt. It follows that the map RDx RT RDx RT , where Dx = {x1j, . . . , x T m T }, defined as

j=1 K0(x1j, xtj)θtj , . . . ,

j=1 K0(x T m T , xtj)θtj

is injective between vector spaces of the same dimension, and thus invertible by the rank theorem. Hence, since by above the quadratic term is positive definite, the linear system may be inverted to find θ .

C. Alternative Constructions

In this third appendix, we will now provide alternative constructions to those presented in the main text. First, in Appendix C.1, we present kernel-based vv-CVs derived from the second order Langevin Stein operator. Then, in Appendix C.2 and Appendix C.3 we point out how these constructions can lead to polynomial-based vv-CVs.

C.1. Kernel-based vv-CVs from Second-Order Langevin Stein Operators

The Langevin Stein operator can also be adapted to apply to the derivative of twice differentiable scalar-valued functions u : Rd R, in which case it is called the second-order Langevin Stein operator:

L [u](x) := xu(x) + xu(x) x log π(x), (11)

where x = x x.

In this section we will consider the second-order Langevin Stein operator which acts on scalar-valued functions. The following theorem provides a characterisation of the class of vv-functions obtained when applying this operator to functions in a vv-RKHS.

Theorem C.1. Consider HK which is a vv-RKHS with mv-kernel K : Rd Rd RT T , and suppose that K C2,2(Rd Rd). Furthermore, for suitably regular vv-functions u = (u1, . . . , u T ) : Rd RT define the differential operator

Svv[u] = (L Π1[u1], . . . , L ΠT [u T ]) .

Then, the image of HK under Svv is a vv-RKHS with reproducing kernel K0 : Rd Rd RT T :

(K0(x, y))tt = Pd r,s=1 ss x rr y (K(x, y))tt + lt r(y) ss x r y(K(x, y))tt

+ lts(x) s x rr y (K(x, y))tt + lts(x)lt r(y) s x r y(K(x, y))tt t, t [T].

We note that this theorem is very similar to Theorem 3.1, and recovers the kernel of (Barp et al., 2022a) when T = 1, provided we use the manifold analog of (11). Indeed, one advantage of (11) is that the associated Theorem C.1 can be easily extended to manifolds (more generally, we can obtain a similar result for any generators of measure-preserving diffusion given in Corollary 5.3 of Barp et al. (2021)). However, one particular disadvantage of this construction from a computational viewpoint is that it requires higher-order derivatives of the kernel K. It also requires the evaluation of a double sum, which significantly increases computational cost relative to our construction in the main text. For this reason, we did not explore this construction in more details.

Proof. We proceed as for the proof of Theorem 3.1 and shall derive a feature map for K0. Recall that g = Svv[u] = (L Π1[u1], . . . , L ΠT [u T ]) , where L Πi is the second-order Stein operator, which maps scalar functions to scalar functions. Here u belongs to a RKHS of RT -valued functions with matrix kernel K. From the differentiability assumption on K, we have HK C2, i.e., it is a space of twice continuously differentiable functions. Note that (here jj = j j = 2 xj xj )

jj x K( , x)et, u

HK = jjut(x) 2ut xj xj (x) t [T],

Vector-Valued Control Variates

where et is the tth standard basis vector of RT as before. Thus

L Πt[ut](x) = xut(x) + x log πt(x) xut(x)

= Pd s=1 ssut(x) + Pd s=1 lts(x) sut(x)

= Pd s=1 ss x K( , x)et, u HK + Pd s=1 lts(x) s x K( , x)et, u HK = Pd s=1 ss x K( , x)et + lts(x) s x K( , x)et, u HK t [T].

Svv[u](x) =

DPd s=1 ss x K( , x)e1 + l1s(x) s x K( , x)e1, u E

HK ... DPd s=1 ss x K( , x)e T + l T s(x) s x K( , x)e T , u E

Note that for each x Rd, each component of the above is a bounded linear operator HK R (i.e., the map u 7 (Svv(u)(x))s R to the s-component is a bounded linear operator), then we have obtained a a feature map, i.e., a map γ : Rd B(HK, RT ), where B(HK, RT ) denotes the space of bounded linear maps from HK to RT . Specifically

γ(x) Svv[ ](x) B(HK, RT ).

In particular, as before

K0(x, y) γ(x) γ(y) B(RT , RT )

will thus be the kernel associated to the feature operator Svv : HK HK0. Recall that γ(y) B(RT , HK) is the adjoint map to γ(y), i.e., it satisfies for any a RT , u HK:

γ(y) [a], u HK = γ(y)[u] a.

From this we obtain

γ(y) : a 7 Pd r=1 PT t=1 at rr y K( , y)et + ltr(y) r y K( , y)et HK.

From K0(x, y)a = γ(x) γ(y) [a] for all a RT and the above expressions we can finally calculate K0. We have

K0(x, y)a = Svv[γ(y) a](x) =

DPd s=1 ss x K( , x)e1 + l1s(x) s x K( , x)e1, γ(y) a E

HK ... DPd s=1 ss x K( , x)e T + l T s(x) s x K( , x)e T , γ(y) a E

We obtain that K0(x, y)a is a vector with components:

(K0(x, y)a)t = Pd r,s=1 PT t =1 at ( ss x rr y K(x, y))tt + lt r(y)( ss x r y K(x, y))tt

+ lts(x)( s x rr y K(x, y))tt + lts(x)lt r(y)( s x r y K(x, y))tt t [T].

Thus the components of K0(x, y) RT T are

(K0(x, y))tt = Pd r,s=1( ss x rr y K(x, y))tt + lt r(y)( ss x r y K(x, y))tt

+ lts(x)( s x rr y K(x, y))tt + lts(x)lt r(y)( s x r y K(x, y))tt t, t [T].

Vector-Valued Control Variates

Analogously to the mv-kernel in Theorem 3.1, there are several cases of practical interest. The first is when K(x, y) = Bk(x, y) is a separable kernel, in which case:

(K0(x, y))tt = Btt Pd r,s=1 ss x rr y k(x, y) + lt r(y) ss x r yk(x, y)

+ lts(x) s x rr y k(x, y) + lts(x)lt r(y) s x r yk(x, y) t, t [T].

The second is when K is separable and Π1 = . . . = ΠT , in which case lr(x) := l1r(x) = . . . = l T r(x) r [d] and:

(K0(x, y))tt = Btt Pd r,s=1 ss y rr x k(x, y) + lr(x) ss y r xk(x, y)

+ ls(y) s y rr x k(x, y) + ls(y)lr(x) s y r xk(x, y) t, t [T].

C.2. Alternative Constructions beyond Kernels

Although kernels are a natural way of constructing functions for multi-task problems, it is also possible to generalise constructions based on other parametric families such as polynomials or neural networks. We will not explore this avenue in detail in the present paper, but now provide brief comments on how such generalisations could be obtained.

Firstly, uθ could be based on any additive model such as a polynomial or wavelet expansion. In that case, it is straightforward to construct vv-CVs with a separable structure as follows:

(uθ(x))t = P

i PT t =1 Btt θiϕi(x), (gθ(x))t = P

i PT t =1 Btt θi Ssv Πt[ϕi(x)] t [T], (12)

where B ST + and ϕi : Rd R is a (sufficiently regular) basis function. In particular, taking the basis functions to be of the form xα for α Nd recovers the polynomial-based CVs of (Mira et al., 2013). We also note that any model of this form leads to a quadratic MC variance objective, whose solution can be obtained in closed form under mild regularity conditions on the basis functions.

Secondly, we could use non-linear models for uθ. In that case, one approach would be to use a separable structure of the form:

(uθ(x))t = PT t =1 Btt ϕθ(x), (gθ(x))t = PT t =1 Btt Ssv Πt[ϕθ(x)] t [T]. (13)

where ϕθ(x) is a non-linear function of the parameters θ. The above is a generalisation of the neural networks-based CVs of (Wan et al., 2019; Si et al., 2021) whenever ϕθ is a neural network. Unfortunately the MC variance objective will usually be non-convex in those cases, and we therefore have no guarantees of recovering the optimal parameter value when using most numerical optimisers.

C.3. Polynomial vv-CVs

In Appendix C.2, we have discussed a construction for vv-CVs based on polynomials which recovers the work of (Mira et al., 2013). However, it is also possible to obtain polynomial-based vv-CVs directly through our kernel constructions in Theorem 3.1 and Appendix C.1. In particular, one option would be to take K(x, y) = Bk(x, y) where B ST + and k(x, y) = (x y + c)l where c R and l N. Firstly, using the first-order Langevin Stein operator and setting l = 1, we obtain:

(K0(x, y))tt = Btt Pd r=1 1 + lt r(y)yr + ltr(x)xr + ltr(x)lt r(y) x y + c t [T].

Similarly when l = 2, we get:

(K0(x, y))tt = Btt Pd r=1 h 2xryr + 2 x y + c + 2yrlt r(y) x y + c

+ 2xrltr(x) x y + c + ltr(x)lt r x y + c 2 i t [T].

These two choices were considered in the experiments in Section 5. An alternative would be to consider this same kernel, but using the construction based on second-order Langevin Stein operators. Again, taking l = 1, we obtain:

(K0(x, y))tt = Pd r=1 ltr(x)lt r(y)Btt t [T].

Vector-Valued Control Variates

Similarly, when l = 2, we get:

(K0(x, y))tt = Btt h 4 d + Pd r=1 lt r(y)yr + ltr(x)xr

+ 2 Pd r=1 ltr(x)lt r(y) x y + c + Pd r,s=1 lts(x)lt r(y)xrys i t [T].

D. Implementation Details

In this appendix, we focus on implementation details which may be helpful for implementing the algorithms in the main text. Firstly, in Appendix D.1 we derive the derivatives of several common kernels; this is essential for the implementation of Stein reproducing kernels. Then, in Appendix D.2, we provide details on how to select hyperparameters. Finally, in Appendix D.3, we discuss how to turn the problem of estimating B from data into a sequence of convex optimisation problems.

D.1. Kernels and Their Derivatives

We now provide details of all the kernels used in the paper, as well as expressions for their derivatives.

Polynomial Kernel The polynomial kernel kl(x, y) = (x y + c)l with constant c R and power l N has derivatives given by

xkl(x, y) = l(x y + c)l 1y, ykl(x, y) = l(x y + c)l 1x,

x ykl(x, y) = Pd j=1 2 xj yj kl(x, y) = Pd j=1 xj l(x y + c)l 1xj

= Pd j=1 l(l 1)(x y + c)l 2yjxj + l(x y + c)l 1

= l(l 1)(x y + c)l 2x y + dl(x y + c)l 1.

Squared-Exponential Kernel The squared-exponential kernel (sometimes called Gaussian kernel) k(x, y) = exp( x y 2 2 2λ ) with lengthscale λ > 0 has derivatives given by

xk(x, y) = (x y)

λ k(x, y), yk(x, y) = (x y)

x yk(x, y) = Pd j=1 2 yj xj k(x, y) = Pd j=1 yj

λ k(x, y) i

= Pd j=1 h 1 λ (xj yj)2

λ2 i k(x, y) = h d λ (x y) (x y)

λ2 i k(x, y).

Preconditioned Squared-Exponential Kernel Following Oates et al. (2017), we also considered a preconditioned squared-exponential kernel:

k(x, y) = 1 (1+α x 2 2)(1+α y 2 2) exp x y 2 2 2λ2 .

with lengthscale λ > 0 and preconditioner parameter α > 0. This kernel has derivatives given by:

xk(x, y) = h 2αx 1+α x 2 2 (x y)

λ2 i k(x, y), yk(x, y) = h 2αy 1+α y 2 2 + (x y)

λ2 i k(x, y),

x yk(x, y) = Pd j=1 2 xj yj k(x, y) = Pd j=1 yj

h 2αxj 1+α x 2 2 (xj yj)

λ2 k(x, y) i

= Pd j=1 1 λ2 k(x, y) + h 2αxj 1+α x 2 2 (xj yj)

λ2 i yj k(x, y)

= Pd j=1 1 λ2 k(x, y) + h 2αxj 1+α x 2 2 (xj yj)

λ2 i h 2αyj 1+α y 2 2 + (xj yj)

λ2 i k(x, y)

= k(x, y) h 4α2x y (1+α x 2 2)(1+α y 2 2) + 2α(x y) y λ2(1+α y 2 2) 2α(x y) x λ2(1+α x 2 2) + d λ2 (x y) (x y)

Vector-Valued Control Variates

Product of Kernels Finally, some of our examples will also use products of well-known kernels. Consider the kernel k(x, y) = Qd j=1 kj(xj, yj). The derivatives of this kernel can be expressed in terms of the components of the product and their derivates as follows:

xk(x, y) = k1(x1,y1)

x1 Q j =1 kj(xj, yj), . . . , kd(xd,yd)

xd Q j =d kj(xj, yj)

yk(x, y) = k1(x1,y1)

j =1 kj(xj, yj), . . . , kd(xd,yd)

j =d kj(xj, yj)

y xk(x, y) = Pd j=1 2 xj yj k(x, y) = Pd j=1 yj

i =j ki(xi, yi)

= Pd j=1 h 2kj(xj,yj)

i =j ki(xi, yi) i .

D.2. Hyper-parameters Selection

Most kernels (whether scalaror matrix-valued) will have hyperparameters which we will have to select. For example, the squared-exponential kernel will often have a lengthscale or amplitude parameter, and these will have a significant impact on the performance.

We propose to select kernel hyperparameters through a marginal likelihood objective by noticing the equivalence between the optimal vv-CV based on the objective in (8) and the posterior mean of a zero-mean Gaussian process model with covariance matrix K0(x, y); see (Oates et al., 2017) for a discussion in the sv-CV case. Unfortunately, computing the marginal likelihood in the general case can be prohibitively expensive due to the need to take inverses of large kernel matrices; the exact issue we were attempting to avoid through the use of the stochastic optimisation approaches. For simplicity, we instead maximise the marginal likelihood corresponding to B = IT :

ν := arg maxν 1

2 PT t=1 Pmt j,j =1 ft(xtj)(KΠt(ν) + λImt) 1 jj ft(xtj ) + log det[KΠt(ν) + λImt] .

where KΠt(ν) is a matrix with entries KΠt(ν)ij = kΠt(xti, xtj; ν) where kΠt is a Stein reproducing kernel of the form in (3) specialised to Πt which has hyperparameters given by some vector ν. This form is not optimal when B = IT , but we found that it tend to perform well in our numerical experiments. The regularisation parameter λ can also be selected through the marginal likelihood. However, in practice we are in an interpolation setting and therefore choose λ as small as possible whilst still being large enough to guarantee numerically stable computation of the matrix inverses above.

D.3. Convex Optimisation for Estimating B

As discussed in Section 4, estimating the matrix B for a separable kernel from data leads to a non-convex optimisation problem. Thankfully, we can approximate the optimum using a sequence of convex problems by extending the work of Dinuzzo et al. (2011); Ciliberto et al. (2015) together with Theorem 4.1 above. For this, we will require that the kernel K0 is separable, and shall thus restrict ourselves to the case where we have a single target distribution (i.e. special case II). Theorem D.1. Suppose that Πt = Π for t [T] and K(x, y) = Bk(x, y) so that K0(x, y) = Bk0(x, y) where k0 is defined in (3). Then the following objective is convex in (θ, β, B) for any value of δ > 0:

Lvv m,δ(θ, β, B) = Jvv m(θ, β, IT ) + λ PT t,t =1 Pmt j=1 Pmt j =1 Tr B k0(xtj, xt j )θtjθ t j + δ2IT + B 2,

and for each β and any sequence δℓ 0, the associated sequence of minimisers (θℓ, Bℓ) converges to (θ , B ) s.t., (θ B , B ) minimises the objective in (10).

Proof. Since the kernel K0 is separable, the objective (10) may be written in the form of Ciliberto et al. (2015, Problem (Q)). Has shown therein, PT t,t =1 Pmt j=1 Pmt j =1 Tr B k0(xtj, xt j )θtjθ t j is jointly convex in B and θ, and since the first term in Lvv m,δ(θ, β, B) is convex in β and θ jointly, Lvv m,δ(θ, β, B) is jointly convex in (θ, B, β). Moreover, by Theorem 3.1 &

3.3 in (Ciliberto et al., 2015), when δ 0, (θ, B) converges in Frobenius norm to (θ , B ), where (θ B , B ) a minimiser of (10), where B denotes the pseudoinverse of B .

This theorem could therefore be used to construct an approach based on convex optimisation algorithms which are used iteratively for a decreasing sequence of penalisation parameters in order to converge to an optimum approaching the global optimum. However, this approach is limited to the case where all distributions are identical, and is hence not as widely applicable as Algorithm 1.

Vector-Valued Control Variates

E. Additional Details for the Experimental Study

This last Appendix provides additional experiments including: an illustration plot of matrix-valued Stein reproducing kernel in Appendix E.1; a synthetic example from (South et al., 2022b) when the Stein kernel matches the smoothness of integrands in Appendix E.2; extra experiments for physical modelling of waterflow when having unbalanced datasets in Appendix E.4.2.

Meanwhile, additional details of our numerical experiments in Section 5 of the main paper are provided: multifidelity univariate step functions in Appendix E.3; multifidelity modelling of waterflow in Appendix E.4, model evidence for dynamic systems in Appendix E.5 and Bayesian inference of Lotka-Volterra system in Appendix E.6.

E.1. Illustration of Matrix-valued Stein Kernels

An illustration of matrix-valued Stein kernels K0 is demonstrated in Figure 3 for the case T = 2. As observed, the choice of kernel k can have significant impacts on K0. Moreover, K0 possesses a well-known property of Stein kernels: even when k is translation-invariant (see the top row) this may not be the case for K0. This is due to the fact that K0 depends on l. Finally, we can also observe that the two outputs of 1 K0(x, y) are correlated, a property which will be key when it comes to vv-CVs.

squared expoential k

(1TK0(x, 0))1

(1TK0(x, 0))2

squared expoential k

(1TK0(x, 1))1

(1TK0(x, 1))2

squared expoential k

(1TK0(x, 2))1

(1TK0(x, 2))2

1st order polyn. k

(1TK0(x, 0))1

(1TK0(x, 0))2

1st order polyn. k

(1TK0(x, 1))1

(1TK0(x, 1))2

1st order polyn. k

(1TK0(x, 2))1

(1TK0(x, 2))2

2nd order polyn. k

(1TK0(x, 0))1

(1TK0(x, 0))2

2nd order polyn. k

(1TK0(x, 1))1

(1TK0(x, 1))2

2nd order polyn. k

(1TK0(x, 2))1

(1TK0(x, 2))2

Matrix-valued Stein reproducing kernel

Figure 3. Illustration of a separable mv-kernel K0 for T = 2 through projections with 1 = (1, 1). Here, Π1 = N(0, 1), Π2 = N(0, 1.25), B11 = B22 = 1 and B12 = B21 = 0.1. The first row corresponds to taking k to be a squared-exponential kernel, whereas the second and third row correspond to taking a polynomial kernel k(x, y) = (x y + 1)l with l = 1 and l = 2 respectively.

E.2. Additional Experiment: A Synthetic Example

Here is a synthetic example selected from (South et al., 2022b) (denoted f2), and to make the problem fit into our framework we introduced another similar integrand (denoted f1):

f1(x) = 1.5 + x + 1.5x2 + 1.75 sin(πx) exp( x2),

f2(x) = 1 + x + x2 + sin(πx) exp( x2).

For this problem, we trained all CVs through stochastic optimisation and use m = (50, 50) MC samples. This synthetic example was originally used by (South et al., 2022b) to show one of the drawbacks of kernel-based CVs, namely that the fitted CV will usually tend to β in parts of the domain where we do not have any function evaluations. This phenomenon

Vector-Valued Control Variates

can be observed on the red lines in Figure 4 (left and center) which gives a CV based on a squared-exponential kernel. This behaviour is clearly one of the biggest drawbacks of existing kernel-based approaches. However, the blue curve, representing a kernel-based vv-CV with separable kernel where B was inferred through optimisation, partially overcomes this issue by using evaluations of both integrands, hence clearly demonstrating potential advantages of sharing function values across integration tasks.

The right-most plot in Figure 4 presents several box plots for the sum of squared errors for each integration problem calculated over 100 repetitions of the experiment. The different box plots show the impact of the difference in Π1 and Π2. As we observed, vv-CVs tend to outperform CVs, although this difference in performance is more stark when Π2 has a larger tail than Π1. This reinforces the previous point, since a more disperse Π2 means that the second integrand will be evaluated more often at more extreme areas of the domain, which will help obtain a better vv-CV by improving the fit at the tails of the distribution.

In this experiment, the choice of k as a squared-exponential kernel was motivated by the fact that this makes k0 infinitely differentiable, and hence matching the smoothness of both f1 and f2.

5.0 2.5 0.0 2.5 5.0 x

f1(x) vv-CV CV

5.0 2.5 0.0 2.5 5.0 x

f2(x) vv-CV CV

1 1.1 1.15 1.2 1.25 2 10

Sum of squared errors

Figure 4. Numerical integration of problem from (South et al., 2022b). Left and center: Illustration of f1 and f2, as well as the corresponding kernel-based CVs and vv-CVs obtained through stochastic optimisation when Π1 = N(0, 1) and Π2 = N(0, 1.25). Right: Sum of the squared errors in estimating Π1[f1] and Π2[f2]. Here, Π1 = N(0, 1) whilst Π2 = N(0, σ2) where σ2 {1, 1.1, 1.15, 1.2, 1.25}.

The experiment was replicated 100 times for all methods. The exact details of the implementation are as follows.

Sample size: 50. Hyper-parameter tuning: batch size 5; learning rate 0.05; total number of epochs 30. Base kernel: squared exponential kernel Optimisation: λ = 0.001; batch size is 5; learning rate is 0.001; total number of epochs 400.

vv-CV (estimated B)

Sample size: (50, 50) from (Π1, Π2) for (f1, f2). Hyper-parameter tuning: batch size 5 (10 in total for (f1, f2)); learning rate 0.05; total number of epochs 30. Base kernel: squared exponential kernel Optimisation: B(0) is initialized at the identity matrix I2. λ = 0.001; batch size is 5 (10 in total for (f1, f2)); learning rate is 0.001; total number of epochs 400.

E.3. Experimental details of Multi-fidelity Univariate Step Functions

The experiment is replicated 100 times for all methods. Details of their implementation is given below:

Vector-Valued Control Variates

Table 4. Prior Distributions for the inputs of the Borehole function. Random variable Distributions Random variable Distributions rw Normal(0.1, 0.01618122) r Normal(100, 0.01) Tu Normal(89335, 20) Tl Normal(89.55, 1) Hu Normal(1050, 1) Hl Normal(760, 1) L Normal(1400, 10) Kw Normal(10950, 30)

Squared-exponential kernel

* Sample size: 40. * Base kernel: squared exponential kernel. * Hyper-parameter tuning: batch size is 10; learning rate 0.02; total number of epochs 15. * Optimisation: λ = 1e 5; batch size is 10; learning rate is 3e 4; total number of epochs 400. vv CV (estimated B/fixed B)

* Sample size: (40, 40) from (Normal(0, 1), Normal(0, 1)) for (f L, f H). * Hyper-parameter tuning: batch size is 5 (10 in total for (f L, f H)); learning rate 0.02; total number of epochs 15. * Base kernel: squared exponential kernel. * Optimisation: When B is fixed, we set B11 = B22 = 0.5, B12 = B21 = 0.01; otherwise, B(0) is initialized at the identity matrix I2. λ = 1e 5; batch size is 5 (10 in total for (f L, f H)); learning rate is 3e 4; total number of epochs 400.

First-order polynomial kernel

* Sample size: 40. * Base kernel: first order polynomial kernel. * Optimisation: λ = 10 5; batch size is 10; learning rate is 3 10 4; total number of epochs 400. vv-CV (estimating B/fixed B)

* Sample size: (40, 40) from (Normal(0, 1), Normal(0, 1)) for (f L, f H). * Base kernel: first order polynomial kernel. * Optimisation: When B is fixed, we set B11 = B22 = 0.5, B12 = B21 = 0.01; otherwise, B(0) is initialized at the identity matrix I2. λ = 10 5; batch size is 5 (10 in total for (f L, f H)); learning rate is 3 10 4; total number of epochs 400.

The empirical computational cost for all tasks is as follows. Scalar-valued CVs take approximately 2.4 seconds for either choice of kernels; vv-CVs with fixed B take around 3.3 seconds with a squared-exponential kernel or around 3.1 seconds with a 1st order polynomial kernel; vv-CVs with estimated B take around 6.6 seconds with a squared-exponential kernel or around 6.2 seconds with a 1st order polynomial kernel.

E.4. Experimental Details of the Physical Modelling (Borehole) of Waterflow

In this section, we provide details on the Borehole example from the main paper, and provide complementary experiments. The distributions with respect to which the integral is taken is an eight-dimensional Gaussian with independent marginals provided in Table 4. The low-fidelity model and high-fidelity model of water flow (Xiong et al., 2013) is given by,

f L(x) = 5Tu(Hu Hl)

1.5+ 2LTu log( r

rw )r2w Kw + Tu

f H(x) = 2πTu(Hu Hl)

1+ 2LTu log( r

rw )r2w Kw + Tu

where x = (rw, r, Tu, Tl, Hu, Hl, L, Kw).

Vector-Valued Control Variates

E.4.1. EXPERIMENT IN THE MAIN TEXT: BALANCED VV-CVS

The number of replications is 100 for all methods. Details of their implementation is given below:

Base kernel: Instead of using k(x, x ) = exp( x x 2 2/2ν) with l > 0 which implicitly assumes that the lengthscales are identical in all directions, we now allow that each dimension can have its own length-scale. That is,

k(x, x ) := Qd j=1 kj(xj, x j) where kj(xj, x j) = exp (xj x j)2 2 2νj

Each of the components has its own length-scale νj > 0 to be determined.

Since π(x) = Qd j=1 πj(xj), the score function is x log π(x) = log π1(x)

x1 , . . . , log πd(x)

Hyper-parameter tuning: batch size 5 (10 in total for (f L, f H)); learning rate of tuning 0.05; epochs of tuning 20.

Optimisation (estimated B/pre-fixing B): When B is fixed, we set B11 = B22 = 5e 4, B12 = B21 = 5e 5; otherwise, B(0) is initialized at 1e 5 I2. λ = 1e 5; batch size 5 (10 in total for (f L, f H)); learning rate for the cases when sample sizes are (10, 20, 50, 100, 150) are (0.09, 0.06, 0.012, 0.0035, 0.002), respectively.

The empirical computational cost (measured in seconds) of this example is: when m = (10, 10), it takes CF, vv-CVs with fixed B and vv-CVs with estimated B around 0.03, 1.2 and 2.6 seconds, respectively; when m = (20, 20), it takes CF, vv-CVs with fixed B and vv-CVs with estimated B around 0.1, 3 and 5 seconds, respectively; when m = (50, 50), it takes CF, vv-CVs with fixed B and vv-CVs with estimated B around 0.7, 7.5 and 13.5 seconds, respectively; when m = (100, 100), it takes CF, vv-CVs with fixed B and vv-CVs with estimated B around 2.7, 17 and 27.6 seconds, respectively; when m = (150, 150), it takes CF, vv-CVs with fixed B and vv-CVs with estimated B around 6, 29 and 49 seconds, respectively.

E.4.2. ADDITIONAL EXPERIMENT: UNBALANCED DATA-SETS FOR PHYSICAL MODELLING OF WATERFLOW

In Figure 5, we present the results of vv-CVs when the sample sizes are unbalanced; that is, we have a different number of samples for the low-fidelity and high-fidelity models. The exact setup is given below, and we replicated the experiment 100 times.

Sample size: m H is fixed to be 20, while m L {20, 40, 60}.

Base kernel: product of squared exponential kernels. k(x, x ) := Qd j=1 kj(xj, x j), where each kj(xj, x j) = exp( (xj x j)2 2/2νj) has its own length-scale νj > 0 to be determined.

Hyperparameter tuning: batch size of tuning 5 (10 in total for (f L, f H)); learning rate of tuning is 0.05; epochs of tuning is 20.

Optimisation (estimated B/pre-fixing B): When B is fixed, we set B11 = B22 = 5 10 4, B12 = B21 = 5 10 5; otherwise, B(0) is initialized at 10 5 I2. λ = 10 5; learning rate is (0.06, 0.04, 0.02) when m L {20, 40, 60}, respectively.

Interestingly, we notice that not much is gained when increasing the number of samples for the low-fidelity model. In fact, in the case of a fixed B, the performance tends to decrease with a larger m L. This is likely due to the negative transfer phenomenon which is well-known in machine learning. This phenomenon can occur when two tasks are not similar enough to provide any gains in accuracy. In this case, there is clearly no advantage in using a larger m L since this increases computational cost and does not provide any gains in accuracy.

E.5. Experimental Details of the Computation of the Model Evidence through Thermodynamic Integration

To implement our vv-CVs, we need to derive the corresponding score functions. For a power posterior, the score function is of the form:

θ log p(θ|y, t) = t θ log p(y|θ) + θ log p(θ)

Vector-Valued Control Variates

Estimated B Fixed B

Absolute Error of vv-CV

Low-fidelity model

Estimated B Fixed B

High-fidelity model

m L = 20 m L = 40 m L = 60

Figure 5. Performance of vv-CVs with unbalanced sample sizes. Here we fix m H = 20, and changing m L to be 20, 40, 60. Each experiment is repeated 100 times.

where θ log p(θ) is the score function corresponding to the prior. In our case, the prior is a log-normal distribution log θ N(µ, σ2) (where σ = 0.25), and its score function is given by:

θ log p(θ) = 1

The score functions for all temperatures are plotted in Figure 6; as observed, temperatures consecutive score functions are very similar to one another.

In order to keep the computational cost manageable, we split the T = 62 integration problems into groups of closely related problems. In particular, we jointly estimate the means in terms of 4 consecutive temperatures on the ladder ( group 1 is µ1, µ2, µ3, µ4, group 2 is µ5, µ6, µ7, µ8, etc...). Since 31 is not divisible by 4, our last group consists of three means µ29, µ30, µ31. Then, the same approach is taken to create groups of 4 (or 3 for the last group) variances.

The number of replications was 20 for each method. Details are given below:

* Base kernel: Preconditioned squared-exponential kernel (Oates et al., 2017). * Hyperparameter tuning: we use the values (0.1, 3) in (Oates et al., 2017). * Optimisation: λ = 10 3; batch size is 5; total number of epochs is 400.

vv-CV(estimated B)

* Base kernel: Preconditioned squared-exponential kernel (Oates et al., 2017). * Hyperparameter tuning: we use the values (0.1, 3) in (Oates et al., 2017). * Optimisation: λ = 10 3; batch size is 5; learning rate is 0.01; number of epochs is 400.

E.6. Experimental Details of the Lotka-Volterra System

We implement log-exp transform on model parameters and avoid constrained parameters on the ODE directly. Lotka Volterra system can be re-parameterized as,

ds = αv1(s) βv1(s)v2(s)

ds = δv1(s)v2(s) γv2(s),

Vector-Valued Control Variates

logp( |y, t)

Temp t = 0.000e+00 Temp t = 1.348e-03 Temp t = 4.315e-02 Temp t = 3.277e-01

logp( |y, t)

Temp t = 4.115e-08 Temp t = 2.430e-03 Temp t = 5.843e-02 Temp t = 4.019e-01

logp( |y, t)

Temp t = 1.317e-06 Temp t = 4.115e-03 Temp t = 7.776e-02 Temp t = 4.889e-01

logp( |y, t)

Temp t = 1.000e-05 Temp t = 6.628e-03 Temp t = 1.019e-01 Temp t = 5.905e-01

logp( |y, t)

Temp t = 4.214e-05 Temp t = 1.024e-02 Temp t = 1.317e-01 Temp t = 7.082e-01

logp( |y, t)

Temp t = 1.286e-04 Temp t = 1.528e-02 Temp t = 1.681e-01 Temp t = 8.441e-01

logp( |y, t)

Temp t = 3.200e-04 Temp t = 2.213e-02 Temp t = 2.121e-01 Temp t = 1.000e+00

0.5 1.0 1.5 2.0 2.5

logp( |y, t)

Temp t = 6.916e-04

0.5 1.0 1.5 2.0 2.5

Temp t = 3.125e-02

0.5 1.0 1.5 2.0 2.5

Temp t = 2.649e-01

Figure 6. Score functions corresponding to the power posteriors at different temperatures on the temperature ladder.

Vector-Valued Control Variates

α = exp(α), β = exp(β), δ = exp(δ), γ = exp(γ),

where v1 and v2 represents the number of preys and predators, respectively.

The model is,

y10 Log-Normal(log v1(0), σy1)

y20 Log-Normal(log v2(0), σy2)

y1s Log-Normal(log v1(s), σy1)

y2s Log-Normal(log v2(s), σy2)

v1(0) := exp(v1(0)), v2(0) := v2(0)

σy1 := exp(σy1), σy2 = exp(σy2).

By doing so, x := (α, β, δ, γ, v1(0), v2(0), σx, σy) can be defined on the whole R8 as the exponential transformation will make sure they larger than zero, and thus can be assigned priors on R8, e.g., Gaussian. As a result, the expectations asscoiated with π(x) are defined on R8 and Stan will return the scores of these parameters directly as these 8 parameters x themselves are unconstrained through manually reparameterisation.

Priors are,

α, γ Normal(0, 0.52)

β, δ Normal( 3, 0.52)

σx, σy Normal( 1, 12)

v1(0), v2(0) Normal(log 10, 12)

The fitting for predators y1s and v1(s) at points s1, . . . , sm are shown in Figure 7. The fitting for predators y2s and v2(s) at points s1, . . . , sm are shown in Figure 8.

Observed mean of posterior predictive credible intervals

mean of v1 credible intervals

Figure 7. Bayesian inference of abundance of preys of Lotka-Volterra system. Dots are observations; lines are the posterior means while dotted lines are the corresponding 95% credible intervals. Tasks are chosen in the area between the two vertical red lines.

The number of replications is 10 for each method and for each task. Details are given below:

Vector-Valued Control Variates

Observed mean of posterior predictive credible intervals

mean of v2 credible intervals

Figure 8. Bayesian inference of abundance of predators of Lotka-Volterra system. Dots are observations; lines are the posterior means while dotted lines are the corresponding 95% credible intervals. Tasks are chosen in the area between the two vertical red lines.

Tasks Π[ft] at time s 1, . . . , s T (the base unit is 1 year):

* T = 2: 1913., 1913.2; * T = 5: 1912., 1912.2, 1912.4, 1912.6, 1912.8; * T = 10: 1912., 1912.2, 1912.4, 1912.6, 1912.8, 1913., 1913.2, 1913.4, 1913.6, 1913.8.

Base kernel: same one as in E.4: product of squared-exponential kernels.

Hyperparameter tuning: batch size is 10; learning rate 0.01; total number of epochs 10.

Optimisation: λ = 10 5; batch size 10; learning rate 10 3; total number of epochs 400.

The empirical computational cost (measured in seconds) of this example is: when T = 2, it takes CF, scalar-valued CVs, vv-CVs with fixed B and vv-CVs with estimated B around 67, 80, 150 and 159 seconds respectively for all T tasks; when T = 5, it takes CF, scalar-valued CVs, vv-CVs with fixed B and vv-CVs with estimated B around 170, 200, 854 and 882 seconds respectively for all T tasks; when T = 10, it takes CF, scalar-valued CVs, vv-CVs with fixed B and vv-CVs with estimated B around 340, 400, 3435 and 3469 seconds respectively for all T tasks.

E.6.1. ADDITION EXPERIMENTS FOR BAYESIAN INFERENCE OF ABUNDANCE OF PREYS OF LOTKA-VOLTERRA SYSTEM

We present additional experiments in Table 5 under the same settings as those in the above section. We consider to estimate Π[ft] at time s 1, . . . , s T (the base unit is 1 year),

T = 2: 1915., 1915.2;

T = 5: 1915., 1915.2, 1915.4, 1915.6, 1915.8;

T = 10: 1914., 1914.2, 1914.4, 1914.6, 1914.8, 1915., 1915.2, 1915.4, 1915.6, 1915.8.

Table 5. Additional Experiments for the Bayesian inference of the Lotka-Volterra system: Sum of mean absolute error of each task.

T m vv-CVEstimated B vv-CV-Fixed B CF MC 2 500 0.169 0.144 0.242 0.275 5 500 0.322 0.245 1.246 0.661 10 500 0.916 0.792 5.835 1.797