# robust_and_conjugate_gaussian_process_regression__cdbd2eeb.pdf

Robust and Conjugate Gaussian Process Regression

Matias Altamirano 1 Franc ois-Xavier Briol 1 Jeremias Knoblauch 1

To enable closed form conditioning, a common assumption in Gaussian process (GP) regression is independent and identically distributed Gaussian observation noise. This strong and simplistic assumption is often violated in practice, which leads to unreliable inferences and uncertainty quantification. Unfortunately, existing methods for robustifying GPs break closed-form conditioning, which makes them less attractive to practitioners and significantly more computationally expensive. In this paper, we demonstrate how to perform provably robust and conjugate Gaussian process (RCGP) regression at virtually no additional cost using generalised Bayesian inference. RCGP is particularly versatile as it enables exact conjugate closed form updates in all settings where standard GPs admit them. To demonstrate its strong empirical performance, we deploy RCGP for problems ranging from Bayesian optimisation to sparse variational Gaussian processes.

1. Introduction

GPs (Rasmussen & Williams, 2006) are one of the most widely used methods for Bayesian inference on latent functions, especially when uncertainty is required. They have numerous appealing properties, including that the prior is relatively interpretable and can be elicited through a choice of mean and covariance functions, as well as the fact that they have closed form posteriors under Gaussian likelihoods. Their convergence is also well understood, even under prior misspecification (Wynne et al., 2021). Thanks to these advantages, GPs have found applications in diverse problems including singleand multi-output regression (Bonilla et al., 2007; Moreno-Mu noz et al., 2018), emulation of expen-

*Equal contribution 1Department of Statistical Science, University College London, London, United Kingdom. Correspondence to: Matias Altamirano <matias.altamirano.22@ucl.ac.uk>, Franc ois-Xavier Briol <f.briol@ucl.ac.uk>, Jeremias Knoblauch <j.knoblauch@ucl.ac.uk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

5.0 GP RCGP True

Data Outliers

Figure 1. The posterior predictive mean of a GP (green) and the RCGP (blue) on a synthetic dataset where 10% of the data are uniformly generated outliers. Unlike the RCGP, the GP is adversely affected.

sive simulators (Santner et al., 2018), Bayesian optimisation (Shahriari et al., 2015; Garnett, 2021) and Bayesian deep learning (Damianou & Lawrence, 2013; Salimbeni et al., 2019; Dutordoir et al., 2020). Their use is enabled by a plethora of packages including GPflow (Matthews et al., 2017) GPy Torch (Gardner et al., 2018), Bo Torch (Balandat et al., 2020), Prob Num (Wenger et al., 2021) and emukit (Paleyes et al., 2023).

By far the most common use of GPs is in regression. Here, the observations correspond to noisy realisations from an unknown latent function that is assumed to be drawn from a GP prior. To obtain a conjugate GP posterior distribution on the latent function, the observation noise is usually assumed to be Gaussian. While assuming Gaussian observation noise makes the posterior tractable, it also makes inferences nonrobust. In particular, Gaussian noise makes GPs highly susceptible to extreme values, heterogeneities, and outliers. This is illustrated in Figure 1 on a synthetic dataset corrupted with outliers: The standard GP is adversely affected, leading to considerable deviations between the inferred function and the ground truth. In many real-world applications and data sets, the presence of outliers is almost inevitable. They can occur for a variety of different reasons, including due to faulty measurements, broken sensors, extreme weather events, stock market sell-offs, or genetic mutations.

Existing Work The lack of robustness in GPs is a wellknown fundamental challenge for their widespread application, and a number of methods have been proposed to address this. Broadly, these fall into two categories. The first replaces the Gaussian measurement error with more

Robust and Conjugate Gaussian Process Regression

heavy-tailed error distributions such as Student s t (Jyl anki et al., 2011; Ranjan et al., 2016), Laplace (Kuss, 2006), Huber densities (Algikar & Mili, 2023), data-dependent noise (Goldberg et al., 1997), or mixture distributions (Naish Guzman & Holden, 2007; Stegle et al., 2008; Daemi et al., 2019; Lu et al., 2023). Heavy tails allow these distributions to better accommodate outliers, rendering them more robust to corruptions. Their main limitation lies in their computational cost, as abandoning Gaussian noise nullifies one key advantage of GPs: conjugacy. As a consequence, these techniques rely on approximations via variational methods or Markov chain Monte Carlo. This decreases their accuracy while increasing computational costs. The second set of approaches consists in removing outlying observations before using a standard GP with Gaussian noise (Li et al., 2021; Park et al., 2022; Andrade & Takeda, 2023). While such approaches use conjugacy, it can be challenging to detect outliers in irregularly spaced data or higher dimensions. Outlier detection also tends to be computationally costly, and often requires estimating large numbers of parameters.

In this paper, we propose a new and third way to achieve robustness that uses generalised Bayesian inference (see e.g. Bissiri et al., 2016; Jewson et al., 2018; Knoblauch et al., 2022). In doing so, we significantly improve upon an earlier attempt in this direction due to Knoblauch (2019) that was applicable only for variational deep GPs, lacked closed form solutions, and was based on hyperparameters that were difficult to choose. In line with the ideas of generalised Bayesian methods, we will not modify the Gaussian noise model. Instead, we change how information is assimilated, and leverage robust loss functions instead of robust error models.

Contributions This paper proposes a novel robust and conjugate Gaussian process (RCGP) inspired by a generalised Bayesian inference scheme proposed in Altamirano et al. (2023). The posteriors rely on a generalised form of score matching (Hyv arinen, 2006; Barp et al., 2019), which effectively down-weights outlying observations. The resulting inference resolves the trade-off between robustness and computation inherent in existing methods: it is robust in the sense of Huber (1981) [Proposition 3.2] while retaining closed form solutions for both its posterior and posterior predictive [Proposition 3.1]. Additionally and unlike other robust GPs RCGPs can easily be plugged into various GP techniques such as sparse variational GPs (Titsias, 2009; Hensman et al., 2013) [Proposition 4.1], deep GPs (Damianou & Lawrence, 2013), multi-output GPs (Bonilla et al., 2007), and Bayesian optimisation (Shahriari et al., 2015) [Proposition 4.2]. Finally, even in settings where robustness is not required, our experiments show that RCGPs performs as well as standard GPs raising the possibility that RCGPs may become a preferred default choice over GPs in the

The remainder of the paper reviews GPs and generalised Bayesian inference (Section 2), introduces RCGPs and proves their robustness (Section 3), and investigates their empirical performance and versatility for a range of experiments (Section 4).

2. Background

Our method applies the logic of generalised Bayesian posteriors to GPs. Here, we briefly explain the concepts relevant to understanding this interface.

Gaussian Processes Let y = (y1, . . . , yn) denote n observations with covariates x = (x1, . . . , xn) , where yi Y R and xi X Rd. While we take Y R for simplicity, our method can be straightforwardly generalised to multi-output regression (i.e., Y RT ). We consider a regression setting where the noisy observations y come from a latent function f : X R: yi = f(xi) + εi.

Here, ε = (ε1, . . . , εn) Rn are independent observation errors. We place a GP prior on f, so that f GP(m, k) with m : X R and k : X X R being mean and kernel functions. These functions determine key properties in the draws from the GP such as differentiability, periodicity, long-range correlation or stationarity, and are parameterised by θ Θ Rp. Throughout, we write f = (f(x1), . . . , f(xn)) , and use the fact that the GP prior implies the Gaussian prior p(f|x) = N(f; m, K), where K is the matrix with Kij = k(xi, xj) and m = (m(x1), . . . , m(xn)) .

Finally, while there are various options for modelling observation error, almost all of them break conjugacy. The main exception is the choice ε N(0, σ2In) where In is an n n identity matrix (or equivalently p(y|f, x) = N(y; f, σ2In)). This leads to the posterior

p(f|y, x) = N(f; µ, Σ),

µ = m + K(K + σ2In) 1(y m),

Σ = K(K + σ2In) 1σ2In.

A key quantity of interest is then the posterior predictive over f , the value f(x ) at a new point x X:

p(f |x , x, y) = Z

Rn p(f |x , f, x, y)p(f|y, x)df

= N(f ; µ , Σ ),

µ = m + k (K + σ2In) 1(y m),

Σ = k k (K + σ2In) 1k ,

for k = (k(x , x1), . . . , k(x , xn)) , k = k(x , x ), m = m(x ). Though the posterior also depends on θ and

Robust and Conjugate Gaussian Process Regression

σ2, we omit this for brevity. Commonly, they are chosen to maximise the marginal likelihood

p(y|x, θ, σ2) = Z

Rn p(f|x, θ, σ2)p(y|f, x, θ, σ2)df

= N(y; m, K + σ2In).

Alternatively, there are computationally efficient approaches for leave-one-out cross-validation, and computationally demanding Markov chain Monte Carlo methods for hierarchical Bayes (see Rasmussen & Williams, 2006, Section 5).

If f and ε are both modelled as Gaussians, GP regression is conjugate, so that posterior, posterior predictive, and marginal likelihood can all be obtained in closed forms. These operations have O(n3) computational and O(n2) storage cost, but are exact. For this reason, practitioners often model data using Gaussian errors even when this assumption is wholly inappropriate and yields severe misspecification. Note that conjugacy also holds for GP interpolation when ε is a Dirac measure at (0, . . . , 0) in which case all formulae above remain correct for σ = 0. However, this assumes that the observations y are noise-free, which is even more susceptible to model misspecification than Gaussianity.

Generalised Bayesian Inference If a statistical model is misspecified so that the model cannot correctly describe the true data-generating mechanism, standard Bayesian updating is not the optimal way of integrating prior information with data (Zellner, 1988). Indeed, standard Bayes results in miscalibrated uncertainties and misleading inferences. In the parametric setting, a line of research has tackled this through generalised Bayesian methodology (see e.g. Gr unwald, 2012; Hooker & Vidyashankar, 2014; Bissiri et al., 2016; Ghosh & Basu, 2016; Jewson et al., 2018; Miller & Dunson, 2018; Knoblauch et al., 2022; Miller, 2021; Fong et al., 2021; Jewson & Rossell, 2022; Matsubara et al., 2022b). Recently, generalised Bayesian posteriors have also been proposed for the non-parametric case (see e.g. Knoblauch, 2019; Wild et al., 2022). For regression, they take the form

p L β(f|y, x) p(f|x) exp βn Ln(f, y, x) , (1)

where denotes equality up to a multiplicative constant not depending on f. The learning rate β > 0 is a scaling parameter that determines how quickly the posterior learns from data, and Ln : Yn Yn X n R is a loss connecting the data and the posited statistical model. Most choices for Ln are estimators of statistical divergences between the true data-generating process, p0, and the model, such as the kernel Stein discrepancy (Matsubara et al., 2022b), β-divergences (Knoblauch et al., 2018), maximum mean discrepancies (Ch erief-Abdellatif & Alquier, 2020), and

Fisher divergences (Altamirano et al., 2023; Matsubara et al., 2022a).

The distributions in (1) are called generalised posteriors since they recover the standard Bayes posterior for β = 1 and Ln(f, y, x) = log p(y|f, x). By instead choosing Ln to be a robust loss, generalised Bayesian inference has enhanced applications including filtering (Boustati et al., 2020; Duran-Martin et al., 2024), changepoint detection (Knoblauch et al., 2018; Altamirano et al., 2023), deep Gaussian processes (Knoblauch, 2019), doubly-intractable problems (Matsubara et al., 2022a;b) and Bayesian neural networks (Futami et al., 2018). In addition, generalised posteriors have been leveraged for computational efficiency. For instance, Matsubara et al. (2022b;a) used generalised posteriors for accelerated computation with unnormalised models in both continuous and discrete domains. Similarly, Schmon et al. (2020), Dellaporta et al. (2022), Pacchiardi & Dutta (2021), Legramanti et al. (2022), and Frazier et al. (2024) deployed them for simulation-based inference.

3. Methodology

We now present RCGPs in three steps. First we introduce our loss and explain how it ensures conjugacy. Second, we provide formal robustness guarantees. Finally, we show how to select hyperparameters.

The Loss Function Altamirano et al. (2023) present a posterior based on a generalised score matching loss (Hyv arinen, 2006; Lyu, 2009; Yu et al., 2022) due to Barp et al. (2019). The resulting posterior is provably robust, and conjugate for exponential family models. Importantly however, it is not applicable for regression settings and covariate-dependent models. To rectify this, we follow Xu et al. (2022) and leverage the tower property of expectations. Using this and denoting by p0,x the marginal distribution of the covariates, score matching losses in our setting lead to

EX p0,x EY p0( |X) (smodel struth)(X, Y ) 2 2 ,

where struth(x, y) = y log p0(y|x) is the score function of the true data-generating conditional density, smodel the score function of our model. In the case of GP regression, this is smodel(x, y) = σ 2(f(x) y). As in Altamirano et al. (2023), we instead use a weighted generalisation due to Barp et al. (2019):

EX p0,x EY p0( |X) w(smodel struth) (X, Y )) 2 2 . (2)

Here, w : X Y R\{0} is a weighting function depending on both x and y, and we discuss how it should be chosen at the end of this section. Evaluating (2) would require the unknown score struth. Luckily, under mild smoothness and boundary conditions (Liu et al., 2022), we can use integration by parts to rewrite it up to a constant not depending on

Robust and Conjugate Gaussian Process Regression

EX p0,x EY p0( |X) (wsmodel)2 + 2 y(w2smodel) (X, Y ) .

Importantly, this expression no longer depends on struth, and only features p0 through an expectation. This leads to a natural estimator and the proposed loss function which is given by

Lw n (f, y, x) = 1

(wsmodel)2 + 2 y(w2smodel) (xi, yi).

While we have motivated the loss using the marginal p0,x for simplicity, p0,x can be replaced with any other measure over x. The active learning setting is a relevant example for this, and we use Lw n in a Bayesian optimisation experiment in Section 4 to showcase this.

Finally, when the model is Gaussian, i.e. p(y|f, x) = N(y; f, σ2In), this loss function becomes quadratic in f as follows:

Lw n (f, y, x) = f Anf + b n f + cn

for some An, bn, cn in Appendix A.1. This follows the same idea as in Matsubara et al. (2022b); Altamirano et al. (2023), and will be crucial to obtain closed form solution for both the posterior and predictive.

Robust and Conjugate Gaussian Processes Based on Lw n , we propose the RCGP posterior

pw(f|y, x) p(f) exp{ n Lw n (f, y, x)},

where we absorb β into w. Considering the same setting as for the standard GP in Section 2, the RCGP posterior and its posterior predictive have closed forms. To state them, take diag(v) as the d d diagonal matrix D so that Dij = 0 if i = j and Dii = vi.

Proposition 3.1. Suppose f GP(m, k) and ε N(0, Inσ2). Then, the RCGP posterior is

pw(f|y, x) = N(f; µR, ΣR),

µR = m + K K + σ2Jw

ΣR = K K + σ2Jw

for w = (w(x1, y1), . . . , w(xn, yn)) , mw = m + σ2 y log(w2) and Jw = diag( σ2

2 w 2). The RCGP s posterior predictive over f = f(x ) at x X is

pw(f |x , x, y) = Z

Rn p(f |x , f, x, y)pw(f|y, x)df

= N(f ; µR , ΣR ),

µR = m + k K + σ2Jw

ΣR = k k (K + σ2Jw) 1k .

Table 1. Existing methods as special cases of RCGPs Method w(x, y)

Standard GP σ

2 Heteroskedastic GP σ2

Robust GP β 1 + (y m(x))2

Throughout, exponents are applied entry-wise, and the proofs can be found in Appendix A.1. The distributions derived in the result have the same structure as their standard GP counterparts, but replace σ2In with the noise term σ2Jw = σ2 diag( σ2

2 w 2), and m with the shrinkage term mw = m + σ2 y log(w2). We interpret both terms after discussing how w should be chosen.

Similar to standard GP regression, RCGP regression is conjugate, has a computational cost of O(n3), and storage cost of O(n2). In addition, a variety of GP schemes fall into the proposed framework through a specific choice of w. For example, w(x, y) = σ

2 recovers the standard GP, and

w(x, y) = σ2

2 r(x) 1 a heteroskedastic GP with noise rate r(x); see Table 1. Importantly, it would be flawed to simply interpret RCGPs as GPs with a different noise model: w depends directly on y.

Finally, we note that RCGPs are in principle not suited to interpolation: the score smodel and thus the loss Lw n are not defined if σ2 = 0. However, it is common to add a nugget to the kernel for regularisation (Andrianakis & Challenor, 2012), which makes the problem equivalent to regression with very small σ2 > 0. Consequently, RCGP regression is still applicable for interpolation problems whenever this regularisation is used and can in fact strongly improve robustness in this setting.

Hyperparameter Selection To make RCGPs practically viable, we need to estimate θ and σ2. A first idea would be to maximise the pseudo marginal likelihood pw(y|θ, σ2) the RCGP s equivalent to the marginal likelihood, whose closed form is given in Appendix A.2. Unfortunately, this would be ill-posed: neither exp{ n Lw(f, y, x)} nor pw(y|θ, σ2) are probability densities over y. Hence, maximising pw(y|θ, σ2) is like maximum likelihood estimation for an un-normalised density whose normaliser depends on the estimated parameter, leading to implausible hyperparameters and numerical issues a well-known issue for generalised Bayesian methods (Jewson & Rossell, 2022). Instead of pw(y|θ, σ2), we thus maximise the leave-one-out cross validation (LOO-CV) predictive posteriors via

ˆσ2, ˆθ = arg max σ2,θ

i=1 log pw(yi|x, y i, θ, σ2) o ,

Robust and Conjugate Gaussian Process Regression

4 2 0 2 4 ym yc m

Figure 2. PIFGP (green) and PIFRCGP (blue) for the dataset in Figure 1. PIFGP as |ym yc m| , so that standard GPs are not robust. In contrast, PIFRCGP is bounded, showing robustness of the RCGP.

where y i = (y1, . . . , yi 1, yi+1, . . . , yn) (for details, see Section 5.2 Rasmussen & Williams, 2006). LOO-CV has been one of the primary methods for setting hyperparameters in Gaussian Processes (see e.g. Sundararajan & Keerthi, 1999; Bachoc, 2013b; Vehtari et al., 2016; Petit et al., 2020). A naive implementation of this objective function would require fitting the model n times, leading to a computational cost of O(n4). Hence, we follow Sundararajan & Keerthi (1999) to obtain an analytical formulation that allows us to compute the LOO objective in O(n3). By Proposition 3.1, pw(yi|x, y i, θ, σ2) = N(µR i , σR i + σ2) with

µR i = zi + mi [ K + σ2Jw 1 z]i[(K + σ2Jw) 1] 1 ii ,

σR i = [(K + σ2Jw) 1] 1 ii σ4

2 w(xi, yi) 2,

for z = y mw and z = (z1, . . . zn). The full derivation can be found in Appendix A.3. Crucially, we only need to compute (K + σ2Jw) 1 once which leads to a computational cost of O(n3), same as the standard marginal likelihood method.

Robustness RCGPs are not only computationally attractive, but also robust to outliers and non-Gaussian errors. While robustness for Bayesian methods can refer to a number of other aspects, including calibration (Gr unwald, 2012; Huggins & Miller, 2023; Lyddon et al., 2019), adversarial robustness (Bogunovic et al., 2020; Kirschner & Krause, 2021), and robustness to misspecified priors (van Der Vaart & van Zanten, 2011; Bachoc, 2013a; Teckentrup, 2020; Wang et al., 2020; Karvonen, 2021; Bogunovic & Krause, 2021; Stephenson et al., 2022; Naslidnyk et al., 2023), we prove RCGP s robustness to misspecification in the error model. We do so using the classical framework of Huber (1981). To this end, we first define the contamination of a dataset D = {(xi, yi)}n i=1 by the datum yc m as Dc m = (D \ {(xm, ym)}) {(xm, yc m)} for some m {1, . . . , n}. We quantify the impact of yc m on inference through the divergence between the contaminated and uncontaminated posteriors. As a function of |yc m ym|, this divergence is also sometimes called the posterior influence function (PIF) and was studied for parametric models in Ghosh & Basu (2016) and Matsubara et al. (2022b). To operationalise this, we consider the Kullback-Leibler (KL)

4 2 0 2 4 y m(x)

IMQ SE c = 1 β = 1

Figure 3. Comparing kernel-based w with the same hyperparameters: IMQ (blue) and Squared Exponential (SE) (orange). The dashed vertical lines indicate the soft threshold c past which a point is increasingly treated as an outlier. The SE down-weights observations more rapidly as they exceed c than the IMQ. The maximum possible weight for any observation is β = 1.

divergence:

PIF(yc m, D) = KL p L β(f|D) p L β(f|Dc m) .

We call a posterior robust if supycm Y | PIF(yc m, D)| < , as this implies that even as |yc m ym| , the contamination s effect on the posterior (as measured by the KL) is bounded. Choosing the KL is convenient as it allows closed form expressions for Gaussians, but we could in principle pick any other divergence with closed form expressions that is not uniformly bounded.

Proposition 3.2. Suppose f GP(m, k), ε N(0, Inσ2), and let Ck R; k = 1, 2, 3 be constants independent of yc m. Then, GP regression has the PIF

PIFGP(yc m, D) = C1(ym yc m)2,

and is not robust: PIFGP(yc m, f, D) as |yc m| . In contrast, for RCGPs with supx,y w(x, y) < ,

PIFRCGP(yc m, D) C2(w(xm, yc m)2yc m)2 + C3.

Thus, if supx,y y w(x, y)2 < , RCGP is robust since supycm | PIFRCGP(yc m, D)| < .

The proof is in Appendix A.4. The two conditions on w in Proposition 3.2 have clear interpretations: supx,y w(x, y) < ensures that no observation has infinite weight which would be antithetical to robustness. Further, supx,y y w(x, y)2 < demonstrates that for robustness, w must down-weight observations at least at rate 1/y for |y| large enough. Figure 2 shows the PIF for an RCGP with a weighting function that satisfies these conditions. Notably, the conditions on w in Proposition 3.2 can only hold if w depends on y. A consequence is that the w associated with heteroskedastic GPs in Table 1 does not lead to a bounded PIF and is not robust in the sense defined above.

Choice of Weighting Function Many choices of w satisfy the conditions of Proposition 3.2. An ideal choice of w

Robust and Conjugate Gaussian Process Regression

RCGP RCGP, no shrinkage term RCGP, no noise term

Standard GP ε = 0

ε = 0.1 ε = 0.5

ε = 0.9 ε = 0.99

Figure 4. Posterior predictive mean for varying values of c obtained by adjusting ε using the quantile absolute deviation method proposed in Section 3, applied to a synthetic dataset with 10% uniformly generated outliers. Left: Full RCGP. Center: RCGP with no shrinkage term. Right: RCGP with no noise term.

attains its maximum close to reasonable data points, and decreases as outliers become more extreme. In the remainder, our measure of how outlying an observation yi is will be |yi m(xi)|. Additionally, w should be smooth to ensure its derivatives exist. Conveniently, this makes weighting functions constructed via infinitely differentiable radial kernels k as w(xi, yi) = k(yi, m(xi)) well-suited. We advocate for the Inverse Multi-Quadratic (IMQ) kernel: it has heavy tails, so that extreme observations are not weighted down too much; see Figure 3. It is given by

w IMQ(x, y) = β 1 + (y m(x))2

with β, c > 0. Weighting functions chosen this way have two hyperparameters: the soft threshold c, and the learning rate β from (1) that we pulled into w. While it is in principle possible to pick both β and c jointly, we fix β = σ

2 and choose only c. We do so since joint selection of β and c is numerically unstable due to near non-identifiability. While we could fix c and select β, this is difficult since learning rate selection is largely unresolved (see e.g. Lyddon et al., 2019; Syring & Martin, 2019; Bochkina, 2023; Wu & Martin, 2023; Frazier et al., 2023). Fixing β = σ

2 and choosing c is much easier as c is interpretable. In particular, as c , w σ

2 so that the limiting setting recovers the standard GP (cf. Table 1). On the other hand, finite values c < can be interpreted as a soft outlier threshold: Points yi for which |yi f(xi)| = c + ξ are increasingly treated as outliers the larger ξ becomes. This is illustrated in Figure 3, which depicts two kernels with c = 1 = β, and shows that the weights decrease rapidly to 0 once the threshold is exceeded.

While, in principle, it is possible to choose c by maximising the leave-one-out cross-validation predictive posterior, the performance is not optimal in practice. This is likely because maximising the predictive posterior for extreme observations tends to match/fit these outliers by increasing c, leading to a less robust method. See Appendix B.3 for

further discussion. Therefore, we propose choosing c via the quantile absolute deviation around the prior mean as c = Qn(1 ε), where Qn(1 ε) is the (1 ε)-th quantile of {|yi m(xi)|}n i=1 for ε [0, 1]. As a default setting, we suggest ε = 0.05, which implies that we expect at most 5% of the data to be outliers.

Figure 4 shows an example of the impact of the choice of ε in the posterior where we observe that choosing ε = 0.99 will lead inferences that are too robust and treat nearly all points as outliers. Conversely, setting ε = 0 nearly reproduces the non-robust standard GP. Intermediate values exhibit a well-balanced and effective performance.

Crucially, the proposed weighting function depends on the prior mean m(xi). Hence, if the prior mean is not well specified, the method may discard observations that are not outliers or vice versa. (see Figure 7). Therefore, we emphasise the importance of carefully selecting the prior mean, as this choice could impact the performance of our method. Note that this consideration is crucial not only for our approach, but also for the standard GP (see e.g. De Ath et al., 2020; Hwang et al., 2023)

Interpretation of RCGP Terms Having discussed our choice of w, we are now ready to interpret the new terms arising in Proposition 3.1. For the RCGP, the noise term σ2Jw = σ2In(1 + (y m)2/c2) replaces the standard GP term σ2In. Here, the exponents have been applied elementwise and 1 = (1, . . . 1) Rn. Functionally, σ2Jw treats outliers as though they were noisier than other observations. Note however that the weight for yi depends on yi itself and thus cannot be interpreted as coming from a different or heteroskedastic noise model on ε. Conceptually, the term treats information as unreliable whenever it comes from observations yi that deviate very extremely from the prior mean m(xi). This not only ensures robustness, but also has computational benefits: in particular, the term tends to improve conditioning, making it more numerically stable to

Robust and Conjugate Gaussian Process Regression

Data Outliers

Figure 5. Considered contamination regimes are asymmetric (left) and focused (right).

invert K +σ2Jw than K +σ2In. This is especially relevant when σ is small (e.g. for interpolation) and n large, in which case numerical conditioning is typically a significant issue for GPs (Andrianakis & Challenor, 2012).

Next, y mw = y m σ2 h 2(y m) c21+(y m)2 i replaces the standard GP s y m for RCGPs. Unlike the noise term, the i-th shrinkage term [mw]i is not monotonically increasing in |yi m(xi)|. Rather, it is increasing as |yi m(xi)| c, peaks at σ2/c for |yi m(xi)| = c, and then decreases monotonically as |yi m(xi)| increases further. The effect is most obvious for m = 0, for which y mw = y(1 2σ2[c21 + y2] 1). Generally, shrinkage is associated with trading off a slight bias for a reduction in variance (e.g. Von Luxburg & Sch olkopf, 2011). This also applies here: relative to the GP, the RCGP s posterior mean will exhibit smaller variance at the expense of a slight bias towards m.

Figure 4 shows the impact of each term in the predictive posterior. It is clear that the main term providing robustness to our method is the noise term.

4. Experiments

The code is available at https://github.com/ maltamiranomontero/RCGP. The extensive literature on robust GPs makes it impossible to compare RCGPs to all competitors. Thus, we choose two representative and popular approaches: a GP with heavy tailed Student s t errors (Jyl anki et al., 2011, t-GP ), and a GP directly modelling outliers via mixture distributions (Lu et al., 2023, m-GP ). We do not compare against outlier-removal methods: they complement rather than compete with RCGPs. Throughout, we picked w, c and β as proposed in Section 3. All hyperparameters are selected via L-BFGS.

Benchmarking We assess our method on four benchmark datasets, including the synthetic problem in Figure 1 (d = 1, n = 300), the Boston dataset (d = 13, n = 506), and two datasets from the UCI repository: Energy (d = 8, n = 768) and Yacht (d = 6, n = 308). These allow us to compare the performances of robust GP methods in a

Table 2. Average test set mean absolute error and standard deviation (in brackets) for 50 train test splits.

GP RCGP t-GP m-GP

No Outliers Synthetic 0.09 (0.00) 0.09 (0.00) 0.09 (0.00) 0.33 (0.00) Boston 0.19 (0.01) 0.19 (0.01) 0.19 (0.01) 0.28 (0.00) Energy 0.03 (0.00) 0.02 (0.00) 0.03 (0.00) 0.61 (0.00) Yacht 0.02 (0.01) 0.02 (0.01) 0.01 (0.00) 0.33 (0.00)

Focused Outliers Synthetic 0.19 (0.00) 0.15 (0.00) 0.18 (0.00) 0.23 (0.00) Boston 0.23 (0.06) 0.22 (0.01) 0.27 (0.00) 0.27 (0.00) Energy 0.03 (0.04) 0.02 (0.00) 0.03 (0.05) 0.24 (0.00) Yacht 0.26 (0.15) 0.10 (0.14) 0.20 (0.04) 0.24 (0.00)

Asymmetric Outliers Synthetic 1.14 (0.00) 0.63 (0.00) 1.06 (0.00) 0.61 (0.00) Boston 0.63 (0.02) 0.49 (0.00) 0.52 (0.00) 0.52 (0.00) Energy 0.54 (0.02) 0.44 (0.04) 0.42 (0.02) 0.41 (0.00) Yacht 0.54 (0.06) 0.35 (0.02) 0.41 (0.00) 0.40 (0.00)

Uniform Synthetic 0.34 (0.00) 0.21 (0.00) 0.30 (0.00) 0.27 (0.00) Boston 0.53 (0.13) 0.51 (0.12) 0.30 (0.01) 0.24 (0.00) Energy 0.25 (0.03) 0.24 (0.03) 0.23 (0.05) 0.23 (0.00) Yacht 0.36 (0.16) 0.29 (0.07) 0.16 (0.00) 0.34 (0.00)

Table 3. Average clock time in seconds and its standard deviation (in brackets) across 50 repetitions

GP RCGP t-GP m-GP

Synthetic 1.5 (0.1) 1.2 (0.0) 2.2 (0.0) 3.0 (0.0) Boston 1.9 (0.5) 5.1 (0.9) 30.7 (6.1) 16.7 (1.7) Energy 3.8 (0.9) 4.6 (2.0) 34.0 (11) 33.8 (0.3) Yacht 1.6 (0.3) 2.1 (0.2) 5.6 (0.7) 4.5 (0.4)

range of dimensions and number of data points. For each dataset, we consider four settings: the original dataset (i.e. no outliers), focused outliers (i.e. outliers clustered in xand yspace), asymmetric outliers (i.e. observations corrupted by strictly negative shifts) and uniform outliers (both positive and negative shifts at random); see Figures 1 and 5 for illustrations. In each case, 10% of observations are perturbed to become outliers. Results are provided in Tables 2 and 3, and a full description of the datasets and outlier generation process is provided in Appendix B.3.

As expected, standard GPs outperform robust methods in absolute mean error when there are no outliers. That being said, RCGPs can easily compete in this setting indicating that there are no clear drawbacks to using RCGPs as a default. For example, on both the Energy and Boston datasets, the performance of RCGPs in the setting with no outliers is comparable to that of GPs. Further, RCGPs tend to outperform their competitors in the presence of focused and asymmetric outliers. This happens as the competitors

Robust and Conjugate Gaussian Process Regression

5000 10000 15000 n observations

5000 10000 15000 n observations

SVGP RCSVGP

Figure 6. Average clock time (left) and root mean square error (right) for SVGP (green) and RCSVGP (blue) on the synthetic dataset with 10% uniform outliers and m = n inducing points across 10 repetitions.

posited noise models are symmetric, but the outlier generation is not. This phenomenon demonstrates the advantage of using a generalised Bayesian approach as opposed to a more refined noised model. In the uniform outlier setting , t-GPs outperform all alternatives. This is to be expected: Student s t errors are a reasonable approximation to that error generating process. Though t-GPs perform even better, RCGPs still significantly outperform GPs. Further, while RCGPs and GPs have comparable computational cost, t-GPs and m-GPs are significantly more computationally expensive. To demonstrate this, Table 3 compares the cost of training (including hyperparameter selection) of each method. The slowdown of t-GPs and m-GPs is due to needing variational approximations, and especially noticeable for datasets with larger n. Both GPs and RCGPs have runtimes of O(n3), though there are minor numerical differences due to the adaptive stopping rule of L-BFGS for hyperparameter optimisation.

Sparse Variational Gaussian Processes (SVGPs) The O(n3) cost of GPs is prohibitive for large n. A popular remedy are SVGPs (Titsias, 2009), which reduce the cost to O(nm2) with m n inducing points u = (u1, . . . , um) . RCGPs are amenable to this type of inference. The resulting robust conjugate SVGP (RCSVGP) is derived below. Proposition 4.1. For f GP(m, k), ε N(0, σ2), the RCSVGP posterior is f GP(eµ, eΣ), where

eµ(x) = ϕu(x) µu, eΣ(x, x ) = k(x, x ) ϕu(x) (Kuu Σu) ϕu(x ),

µu = m + Kuu P 1 u Kuσ 2J 1 w (y mw),

Σu = Kuu P 1 u Kuu,

for Pu = Kuu + K u σ 2J 1 w Ku , [Kuu]ij = k(ui, uj), [Ku]ij = k(ui, xj), [ku(x)]i = k(ui, x), and ϕu(x) = K 1 uuku(x).

J(u, θ,σ2) = 1

2ν K u Q 1 u Kuν + 1

+C(σ2) Tr σ 2J 1

2 w (K K u K 1 uu Ku)J 1

GP RCGP Prior mean Sell off

10:00 11:00 12:00 13:00 14:00 15:00 16:00 Time

Figure 7. Top: GP (green) and RCGP (blue) regression on the DJIA with m(x) = 1

n Pn i=1 yi. While the GP overfits to outliers around 1pm, RCGP is robust. Bottom: w IMQ with β, c chosen as proposed in Section 3.

where Qu = Kuu + K u σ 2J 1 w Ku, ν = σ 2J 1 w (y mw), and C(σ2) is a function that only depends on σ2.

See Appendix A.5 for a proof and details. A numerical comparison between SVGPs and RCSVGPs for the synthetic example with 10% uniform outliers is presented in Figure 6. While RCSVGPs and SVGPs have similar runtimes, RCSVGPs make far more accurate predictions. In Appendix B.4, we show that this generalises to the other datasets and outlier types.

Twitter Flash Crash To illustrate the practical utility of RCGPs, we next analyse the Dow Jones Industrial Average (DJIA) index on the 17/04/2013. On this day, the Associated Press Twitter account was hacked and falsely tweeted that explosions at the White House had injured the president. This lead to a rapid sell-off in American stock markets within seconds, quickly followed by an equally fast bounceback. This resulted in a few moments after 1pm that day where the DJIA did not accurately reflect the USA s economic realities. In this sense, it is reasonable to regard the resulting group of extreme observations as outliers. Figure 7 depicts the raw data as well as a GP and RCGP fit. The figure not only illustrates the robustness of RCGPs, but also sheds further light on relevant trade-offs when choosing w. As the bottom panel shows, most points have weights 1. There are two exceptions to this: one of these is the group of points past 1pm where the crash occurred. Here, the observations are down-weighted almost all the way to zero. This is desirable, and exactly what we would expect RCGP to do. Perhaps more surprisingly, weights are also significantly smaller than 1 before 10am. This is not obviously desirable, and due to the choice of prior mean. In particular, recall that w(xi, yi) is small if |yi m(xi)| is large. For Figure 7, the constant function m(x) = 1

n Pn i=1 yi is chosen as the prior mean. Relative to this prior mean, the first observations are outliers. While this is not ideal, it is worth noting that this behaviour can be easily remedied by choosing m more carefully. For example, one can fit a simple parametric model for m, as is extremely common in the literature. See

Robust and Conjugate Gaussian Process Regression

Six-hump Camel

0 100 200 Iteration

GP RCGP t-GP

0 100 200 Iteration

0 100 200 Iteration

0 100 200 Iteration

Figure 8. Mean cumulative regret (top) and clock time (bottom) for BO with GP (green), RCGP (blue) and t-GP (orange) with 20% asymmetric outliers and UCB acquisition function over 10 realisations.

Appendix B.1 for a further discussion.

Bayesian Optimisation One of the most successful applications of GPs has been Bayesian optimisation (BO) (Garnett, 2021). In BO, a data-dependent acquisition function a : X R based on the posterior GP characterises desirable regions of X to choose the next often noisy function evaluations of f from. Here, misspecification of the noise model is a significant concern: it leads to poor acquisition functions; see e.g. Martinez-Cantin et al. (2018); Bogunovic et al. (2020); Kirschner & Krause (2021). Using RCGPs instead of GPs leads to a BO algorithm that is naturally robust to such misspecification. We illustrate this using the upperconfidence bound (UCB) and probability of improvement (PI) acquisition functions. Proposition 4.2. Let f GP(m, k), ε N(0, σ2). The UCB and PI acquisition functions for RCGPs are

a UCB-RCGP(x ) = µR + λ(ΣR )1/2

a PI-RCGP(x ) = Φ µR f(xmax) (ΣR ) 1/2

where Φ is the cdf of a standard normal, and xmax is the best solution we have so far.

Here, we have re-used the notations of Proposition 3.1, and full derivations are available in Appendix A.6. The only paper explicitly studying BO in the presence of outliers is Martinez-Cantin et al. (2018), which uses a t-GP. We thus compare RCGPs to GPs and t-GPs on the classical sixhump camel, Branin, Mc Cormick and Rosenbrock functions (see Appendix B.5). Figure 8 shows the results if each new function evaluation has a 20% chance of being contaminated by an asymmetric outlier generated as in Figure 5, and the BO uses UCB as the acquisition function. In terms of cumulative regret, RCGPs outperform GPs. While t GPs can match this, they take orders of magnitude longer to run. The results remain the same if the PI acquisition function is used (See Appendix B.5). It is notable that even without outliers, RCGPs match or outperform GPs (see Appendix B.5). This is likely because the GP priors on f are misspecified, and robustness to misspecified error models may help in this setting.

5. Conclusion

A major issue with existing GP regression techniques is that they are either robust or conjugate never both. RCGPs solve this issue through generalised Bayesian inference, and demonstrate that robustness does not require prohibitive computational cost. Intriguingly, our experiments also indicate that there is no clear disadvantage from using RCGPs in the absence of misspecification raising the possibility that RCGPs may become a preferred default choice over GPs in the future.

Limitations and future work One limitation of our work is its dependency on a well-specified prior mean through the weighting function, as illustrated in Figure 7. Despite this, the weighting function offers flexibility for practitioners to select one that best suits their specific problem. For example, a practitioner concerned only with outliers in one direction could choose a weighting function that penalises deviations only in that direction. Additionally, we provide guidelines on selecting a weighting function to ensure that RCGP remains provably outlier robust (see Proposition 3.2).

Another limitation of our method is the inability to obtain a posterior over the kernel hyperparameters due to the use of generalised Bayes. However, it is important to emphasise that the primary focus of our method is on speed and robustness rather than implementing a fully Bayesian procedure. Full Bayesian approaches that account for kernel hyperparameter uncertainty with standard GPs are significantly slower, so they are not our primary comparison. Instead, we focus on conjugacy, which is generally not achievable with a prior on hyperparameters.

Lastly, the flexibility and conjugacy of RCGPs means that there is significant scope to extend their use to settings beyond those presented here including multi-output GPs (Bonilla et al., 2007; Altamirano & Tobar, 2022), lineartime GPs (Hartikainen & Sarkka, 2010), GPs with derivative (Morris et al., 1993; Wu et al., 2017) or integral data (Yousefi et al., 2019; Tanaka et al., 2019), deep GPs (Damianou & Lawrence, 2013), transformed GPs (Maro nas et al., 2021), and probabilistic numerics (Hennig et al., 2022).

Robust and Conjugate Gaussian Process Regression

Acknowledgements

FXB and JK were supported by the EPSRC grant [EP/Y011805/1].

Impact statement

This paper presents work whose goal is to advance the field of statistical Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Algikar, P. and Mili, L. Robust Gaussian process regression with Huber likelihood. ar Xiv:2301.07858, 2023.

Altamirano, M. and Tobar, F. Nonstationary multi-output gaussian processes via harmonizable spectral mixtures. In International Conference on Artificial Intelligence and Statistics, pp. 3204 3218. PMLR, 2022.

Altamirano, M., Briol, F.-X., and Knoblauch, J. Robust and scalable Bayesian online changepoint detection. In Proceedings of the 40th International Conference on Machine Learning, pp. 642 663. PMLR, 2023.

Andrade, D. and Takeda, A. Robust Gaussian process regression with the trimmed marginal likelihood. In Uncertainty in Artificial Intelligence, pp. 67 76, 2023.

Andrianakis, I. and Challenor, P. The effect of the nugget on Gaussian process emulators of computer models. Computational Statistics and Data Analysis, 56(12):4215 4228, 2012.

Bachoc, F. Cross validation and maximum likelihood estimations of hyper-parameters of Gaussian processes with model misspecification. Computational Statistics and Data Analysis, 66:55 69, 2013a.

Bachoc, F. Cross validation and maximum likelihood estimations of hyper-parameters of gaussian processes with model misspecification. Computational Statistics & Data Analysis, 66:55 69, 2013b.

Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. Bo Torch: A framework for efficient Monte-Carlo Bayesian optimization. In Advances in Neural Information Processing Systems 33, 2020. URL http://arxiv.org/abs/1910. 06403.

Barp, A., Briol, F.-X., Duncan, A., Girolami, M., and Mackey, L. Minimum Stein discrepancy estimators. In Advances in Neural Information Processing Systems, pp. 12964 12976, 2019.

Bissiri, P. G., Holmes, C. C., and Walker, S. G. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103 1130, 2016.

Bochkina, N. Bernstein-von Mises theorem and misspecified models: a review. In Foundations of Modern Statistics, pp. 355 380, 2023.

Bogunovic, I. and Krause, A. Misspecified Gaussian process bandit optimization. Advances in Neural Information Processing Systems, 34:3004 3015, 2021.

Bogunovic, I., Krause, A., and Scarlett, J. Corruptiontolerant Gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics, pp. 1071 1081, 2020.

Bonilla, E. V., Chai, K., and Williams, C. Multi-task Gaussian process prediction. Advances in Neural Information Processing Systems, 20, 2007.

Boustati, A., Akyildiz, O. D., Damoulas, T., and Johansen, A. Generalised Bayesian filtering via sequential Monte Carlo. In Advances in Neural Information Processing Systems, pp. 418 429, 2020.

Ch erief-Abdellatif, B.-E. and Alquier, P. MMD-Bayes: Robust Bayesian estimation via maximum mean discrepancy. In Symposium on Advances in Approximate Bayesian Inference, 2020.

Daemi, A., Alipouri, Y., and Huang, B. Identification of robust Gaussian process regression with noisy input using EM algorithm. Chemometrics and Intelligent Laboratory Systems, 191:1 11, 2019.

Damianou, A. and Lawrence, N. D. Deep Gaussian processes. In Artificial intelligence and statistics, pp. 207 215. PMLR, 2013.

De Ath, G., Fieldsend, J. E., and Everson, R. M. What do you mean? the role of the mean function in bayesian optimisation. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, pp. 1623 1631, 2020.

Dellaporta, C., Knoblauch, J., Damoulas, T., and Briol, F.-X. Robust Bayesian inference for simulator-based models via the MMD posterior bootstrap. In International Conference on Artificial Intelligence and Statistics, pp. 943 970, 2022.

Duran-Martin, G., Altamirano, M., Shestopaloff, A. Y., Knoblauch, J., Jones, M., Briol, F.-X., and Murphy, K. Outlier-robust Kalman filtering through generalised Bayes. ar Xiv:2403.05646, 2024.

Robust and Conjugate Gaussian Process Regression

Dutordoir, V., Wilk, M., Artemev, A., and Hensman, J. Bayesian image classification with deep convolutional Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 1529 1539. PMLR, 2020.

Fong, E., Holmes, C., and Walker, S. G. Martingale posterior distributions. ar Xiv:2103.15671, 2021.

Frazier, D. T., Kohn, R., Drovandi, C., and Gunawan, D. Reliable Bayesian inference in misspecified models. ar Xiv:2302.06031, 2023.

Frazier, D. T., Knoblauch, J., and Drovandi, C. The impact of loss estimation on gibbs measures. ar Xiv preprint ar Xiv:2404.15649, 2024.

Futami, F., Sato, I., and Sugiyama, M. Variational inference based on robust divergences. In International Conference on Artificial Intelligence and Statistics, pp. 813 822, 2018.

Gardner, J. R., Pleiss, G., Bindel, D., Weinberger, K. Q., and Wilson, A. GPy Torch: Blackbox matrix-matrix Gaussian Pprocess inference with GPU acceleration. In Advances in Neural Information Processing Systems, pp. 7587 7597, 2018.

Garnett, R. Bayesian Optimization. Cambridge University Press, 2021.

Ghosh, A. and Basu, A. Robust Bayes estimation using the density power divergence. Annals of the Institute of Statistical Mathematics, 68(2):413 437, 2016.

Goldberg, P., Williams, C., and Bishop, C. Regression with input-dependent noise: A Gaussian process treatment. Advances in Neural Information Processing Systems, 10, 1997.

Gr unwald, P. The safe Bayesian. In International Conference on Algorithmic Learning Theory, pp. 169 183, 2012.

Hartikainen, J. and Sarkka, S. Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In 2010 IEEE International Workshop on Machine Learning for Signal Processing, pp. 379 384, 2010. ISBN 9781424478774. URL https://users.aalto. fi/$\sim$ssarkka/pub/gp-ts-kfrts.pdf.

Hennig, P., Osborne, M. A., and Kersting, H. Probabilistic Numerics: Computation as Machine Learning. Cambridge University Press, 2022.

Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian processes for big data. ar Xiv preprint ar Xiv:1309.6835, 2013.

Hooker, G. and Vidyashankar, A. Bayesian model robustness via disparities. TEST, 23(3):556 584, 2014.

Huber, P. J. Robust statistics. Wiley Series in Probability and Mathematical Statistics, 1981.

Huggins, J. H. and Miller, J. W. Reproducible model selection using bagged posteriors. Bayesian Analysis, 18(1): 79 104, 2023.

Hwang, S.-g., L Huillier, B., Keeley, R. E., Jee, M. J., and Shafieloo, A. How to use gp: effects of the mean function and hyperparameter selection on gaussian process regression. Journal of Cosmology and Astroparticle Physics, 2023(02):014, 2023.

Hyv arinen, A. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6:695 708, 2006.

Jewson, J. and Rossell, D. General Bayesian loss function selection and the use of improper models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(5):1640 1665, 2022.

Jewson, J., Smith, J. Q., and Holmes, C. Principled Bayesian minimum divergence inference. Entropy, 20(6):442, 2018.

Jyl anki, P., Vanhatalo, J., and Vehtari, A. Robust Gaussian process regression with a student-t likelihood. Journal of Machine Learning Research, 12(11), 2011.

Karvonen, T. Estimation of the scale parameter for a misspecified Gaussian process model. ar Xiv:2110.02810, 2021.

Kirschner, J. and Krause, A. Bias-robust Bayesian optimization via dueling bandits. In International Conference on Machine Learning, pp. 5595 5605, 2021.

Knoblauch, J. Robust deep Gaussian processes. ar Xiv preprint ar Xiv:1904.02303, 2019.

Knoblauch, J., Jewson, J. E., and Damoulas, T. Doubly robust Bayesian inference for non-stationary streaming data with β-divergences. In Advances in Neural Information Processing Systems, pp. 64 75, 2018.

Knoblauch, J., Jewson, J., and Damoulas, T. An optimization-centric view on Bayes rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research, 23(132):1 109, 2022.

Kuss, M. Gaussian process models for robust regression, classification, and reinforcement learning. Ph D thesis, echnische Universit at Darmstadt Darmstadt, Germany, 2006.

Robust and Conjugate Gaussian Process Regression

Legramanti, S., Durante, D., and Alquier, P. Concentration and robustness of discrepancy-based ABC via Rademacher complexity. ar Xiv:2206.06991, 2022.

Li, Z.-Z., Li, L., and Shao, Z. Robust Gaussian process regression based on iterative trimming. Astronomy and Computing, 36:100483, 2021.

Liu, S., Kanamori, T., and Williams, D. J. Estimating density models with truncation boundaries using score matching. Journal of Machine Learning Research, 23(186):1 38, 2022.

Lu, Y., Ma, J., Fang, L., Tian, X., and Jiang, J. Robust and scalable Gaussian process regression and its applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21950 21959, 2023.

Lyddon, S. P., Holmes, C. C., and Walker, S. G. General Bayesian updating and the loss-likelihood bootstrap. Biometrika, 106(2):465 478, 2019.

Lyu, S. Interpretation and generalization of score matching. In Conference on Uncertainty in Artificial Intelligence, pp. 359 366, 2009.

Maro nas, J., Hamelijnck, O., Knoblauch, J., and Damoulas, T. Transforming gaussian processes with normalizing flows. In International Conference on Artificial Intelligence and Statistics, pp. 1081 1089. PMLR, 2021.

Martinez-Cantin, R., Tee, K., and Mc Court, M. Practical Bayesian optimization in the presence of outliers. International Conference on Artificial Intelligence and Statistics, pp. 1722 1731, 2018.

Matsubara, T., Knoblauch, J., Briol, F.-X., Oates, C., et al. Generalised Bayesian inference for discrete intractable likelihood. ar Xiv:2206.08420, 2022a.

Matsubara, T., Knoblauch, J., Briol, F.-X., and Oates, C. J. Robust generalised Bayesian inference for intractable likelihoods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(3):997 1022, 2022b.

Matthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., Le on-Villagr a, P., Ghahramani, Z., and Hensman, J. GPflow: A Gaussian process library using Tensor Flow. Journal of Machine Learning Research, 18(40):1 6, apr 2017. URL http: //jmlr.org/papers/v18/16-537.html.

Miller, J. W. Asymptotic normality, concentration, and coverage of generalized posteriors. Journal of Machine Learning Research, 22(1):7598 7650, 2021.

Miller, J. W. and Dunson, D. B. Robust bayesian inference via coarsening. Journal of the American Statistical Association, 2018.

Moreno-Mu noz, P., Art es, A., and Alvarez, M. Heterogeneous multi-output Gaussian process prediction. Advances in Neural Nnformation Processing Systems, 31, 2018.

Morris, M. D., Mitchell, T. J., and Ylvisaker, D. Bayesian design and analysis of computer experiments: Use of derivatives in surface prediction. Technometrics, 35(3): 243 255, 1993.

Naish-Guzman, A. and Holden, S. Robust regression with twinned Gaussian processes. Advances in Neural Information Processing Systems, 20, 2007.

Naslidnyk, M., Kanagawa, M., Karvonen, T., and Mahsereci, M. Comparing scale parameter estimators for Gaussian process regression Cross validation and maximum likelihood. ar Xiv:2307.07466, 2023.

Opper, M. and Archambeau, C. The variational gaussian approximation revisited. Neural computation, 21(3):786 792, 2009.

Pacchiardi, L. and Dutta, R. Generalized Bayesian likelihood-free inference using scoring rules estimators. ar Xiv:2104.03889, 2021.

Paleyes, A., Mahsereci, M., and Lawrence, N. D. Emukit: A Python toolkit for decision making under uncertainty. Proceedings of the Python in Science Conference, 2023.

Park, C., Borth, D. J., Wilson, N. S., Hunter, C. N., and Friedersdorf, F. J. Robust Gaussian process regression with a bias model. Pattern Recognition, 124:108444, 2022.

Petit, S., Bect, J., Da Veiga, S., Feliot, P., and Vazquez, E. Towards new cross-validation-based estimators for gaussian process regression: efficient adjoint computation of gradients. ar Xiv preprint ar Xiv:2002.11543, 2020.

Ranjan, R., Huang, B., and Fatehi, A. Robust Gaussian process modeling using EM algorithm. Journal of Process Control, 42:125 136, 2016.

Rasmussen, C. E. and Williams, C. K. Gaussian processes for machine learning. MIT press Cambridge, MA, 2006.

Salimbeni, H., Dutordoir, V., Hensman, J., and Deisenroth, M. Deep Gaussian processes with importance-weighted variational inference. In International Conference on Machine Learning, pp. 5589 5598. PMLR, 2019.

Robust and Conjugate Gaussian Process Regression

Santner, T. J., Williams, B. J., and Notz, W. I. The Design and Analysis of Computer Experiments. Springer, 2nd edition, 2018. ISBN 9781493988457.

Schmon, S. M., Cannon, P. W., and Knoblauch, J. Generalized posteriors in approximate Bayesian computation. In Symposium on Advances in Approximate Bayesian Inference, 2020.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2015.

Stegle, O., Fallert, S. V., Mac Kay, D. J., and Brage, S. Gaussian process robust regression for noisy heart rate data. IEEE Transactions on Biomedical Engineering, 55 (9):2143 2151, 2008.

Stephenson, W. T., Ghosh, S., Nguyen, T. D., Yurochkin, M., Deshpande, S. K., and Broderick, T. Measuring the robustness of Gaussian processes to kernel choice. In International Conference on Artificial Intelligence and Statistics, PMLR, volume 151, pp. 3308 3331, 2022.

Sundararajan, S. and Keerthi, S. Predictive app roaches for choosing hyperparameters in gaussian processes. Advances in neural information processing systems, 12, 1999.

Syring, N. and Martin, R. Calibrating general posterior credible regions. Biometrika, 106(2):479 486, 2019.

Tanaka, Y., Tanaka, T., Iwata, T., Kurashima, T., Okawa, M., Akagi, Y., and Toda, H. Spatially aggregated Gaussian processes with multivariate areal outputs. In Advances in Neural Information Processing Systems, 2019.

Teckentrup, A. L. Convergence of Gaussian process regression with estimated hyper-parameters and applications in Bayesian inverse problems. SIAM-ASA Journal on Uncertainty Quantification, 8(4):1310 1337, 2020. ISSN 21662525. doi: 10.1137/19M1284816.

Titsias, M. Variational learning of inducing variables in sparse Gaussian processes. In Artificial intelligence and statistics, pp. 567 574. PMLR, 2009.

van Der Vaart, A. and van Zanten, H. Information rates of nonparametric Gaussian process methods. Journal of Machine Learning Research, 12:2095 2119, 2011.

Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T., and Winther, O. Bayesian leave-one-out cross-validation approximations for gaussian latent variable models. Journal of Machine Learning Research, 17(103):1 38, 2016.

Von Luxburg, U. and Sch olkopf, B. Statistical learning theory: Models, concepts, and results. In Handbook of the History of Logic, volume 10, pp. 651 706. Elsevier, 2011.

Wang, W., Tuo, R., and Wu, C. F. J. On prediction properties of kriging: uniform error bounds and robustness. Journal of the American Statistical Association, 115(530):920 930, 2020.

Wenger, J., Kr amer, N., Pf ortner, M., Schmidt, J., Bosch, N., Effenberger, N., Zenn, J., Gessner, A., Karvonen, T., Briol, F.-X., Mahsereci, M., and Hennig, P. Prob Num: Probabilistic numerics in Python. ar Xiv:2112.02100, 2021. URL http://arxiv.org/ abs/2112.02100.

Wild, V. D., Hu, R., and Sejdinovic, D. Generalized variational inference in function spaces: Gaussian measures meet Bayesian deep learning. Advances in Neural Information Processing Systems, 35:3716 3730, 2022.

Wu, J., Poloczek, M., Gordon Wilson, A., and Frazier, P. I. Bayesian optimization with gradients. In Advances in Neural Information Processing Systems, 2017.

Wu, P.-S. and Martin, R. A comparison of learning rate selection methods in generalized Bayesian inference. Bayesian Analysis, 18(1):105 132, 2023.

Wynne, G., Briol, F.-X., and Girolami, M. Convergence guarantees for Gaussian process means with misspecified likelihoods and smoothness. Journal of Machine Learning Research, 22(123):1 40, 2021.

Xu, J., Scealy, J. L., Wood, A. T., and Zou, T. Generalized score matching for regression. ar Xiv preprint ar Xiv:2203.09864, 2022.

Yousefi, F., Smith, M. T., and Alvarez, M. A. Multi-task learning for aggregated data using Gaussian processes. In Advances in Neural Information Processing Systems, 2019.

Yu, S., Drton, M., and Shojaie, A. Generalized score matching for general domains. Information and Inference, 11(2):739 780, 2022. ISSN 20498772. doi: 10.1093/imaiai/iaaa041.

Zellner, A. Optimal information processing and Bayes s theorem. The American Statistician, 42(4):278 280, 1988.

Robust and Conjugate Gaussian Process Regression

Robust and Conjugate Gaussian Process Regression: Supplementary Materials

Our supplementary material is structured as follows. In Appendix A, we provide the proofs of all our theoretical results, as well as additional results which complement those in the main text. In Appendix B, we provide additional details on our numerical experiments.

In this section, we recap the notation used throughout the paper.

If g : X Y Rd R R, then yg = g

y is the partial derivative of g with respect to y.

X N(µ, Σ) denotes that X is has a multivariate Gaussian distribution with mean µ and covariance Σ. Furthermore, p(x) = N(x; µ, Σ) denotes that p is the density of this multivariate Gaussian distribution.

For some d-dimensional vector v = (v1, . . . , vd) , diag(v) is the d d diagonal matrix D so that Dij = 0 if i = j and Dii = vi. In addition, exponents are applied entry-wise; e.g. v2 = (v2 1, . . . , v2 d) .

Tr denotes the trace operator, which for some d d matrix A is given by: Tr(A) = Pd i=1 Aii.

A positive-definite kernel k : X X R is a symmetric function, i.e. k(x, x ) = k(x , x) for any x, x , that is positive-definite, i.e. {x1, ..., xn} X, {c1, ..., cn} R and n N: Pn i=1 Pn j=1 cicjk(xi, xj) 0, which equivalent that for any n points the n n matrix given by k(x, x) is positive-semidefinite. According to the Moore Aronszajn theorem, every positive-definite kernel is also a reproducing kernel, and according to Loeve s theorem, any reproducing kernel is the covariance function of a second-order stochastic process and vice-versa.

Let X p be a random variable distributed according to the probability density p. We define the expectation and the variance of the random variable X as

EX p[X] = Z

Rn xp(x)dx, VX p[X] = Z

Rn(x EX p[X])(x EX p[X]) p(x)dx

A. Proofs of Theoretical Results

A.1. Proof of Proposition 3.1

Recall the loss function defined in Section 3:

Lw n (f, y, x) = 1

(wsmodel)2 + 2 y(w2smodel) (xi, yi).

Here, smodel : X Y R is the score function, and w : X Y R is the weighting function. Proposition 3.1 gives the posterior and posterior predictive distributions for RCGPs, and we now derive these one-by-one.

Posterior Distribution Firstly, we show that the RCGP posterior has density pw(f|y, x) = N(f; µR, ΣR),

µR = m + K K + σ2Jw 1 (y mw) , ΣR = K K + σ2Jw 1 σ2Jw,

where w = (w(x1, y1), . . . , w(xn, yn)) , mw = m + σ2 y log(w2), Jw = diag( σ2

Robust and Conjugate Gaussian Process Regression

Proof. Firstly, it is worth noting that for smodel(x, y) = (f(x) y)σ 2, the loss function L e w n constructed with the weight function w can be written as

L w n (f, y, x) = 1

w(xi, yi)2(f(xi) yi)2

w(xi, yi)2(f(xi) yi)

w(xi, yi)2(f(xi)2 2f(xi)yi + y2 i ) σ4 + 2 y

w(xi, yi)2(f(xi) yi)

w(xi, yi)2(f(xi)2 2f(xi)yi)

w(xi, yi)2f(xi)

+ w(xi, yi)2y2 i σ4 2 y

w(xi, yi)2yi

Let us consider ew(x, y) =

2σ 2w(x, y), then

L e w n (f, y, x) = 1

ew(xi, yi)2(f(xi)2 2f(xi)yi)

2σ2 + f(xi) y ew(xi, yi)2 + ew(xi, yi)2y2 i 2σ2 y( ew(xi, yi)2yi)

2n f σ 2 diag(ew2)f 2f σ 2 diag(ew2)y y ew2 + y σ 2 diag ew2)y 2 y(y ew2)

2n f σ 2 diag(ew2)f 2f σ 2 diag(ew2) y σ2 y log(ew2) + y σ 2 diag(ew2)y 2 y(y ew2)

2n f σ 2J 1 w f 2f σ 2J 1 w (y mw + m) + C(x, y, σ2)

Where we use the fact that y ew2 = σ 2 diag(ew2) σ2 y log(ew2), and

mw = m + σ2 y log(ew2) = m + σ2 y log(w2)

Jw = diag(ew 2) = diag(σ2

C(x, y, σ2) = y σ 2 diag(ew2)y 2 yy ew2 = y σ 2 diag(2σ 2w2)y 4σ2 yy w2

One remark is that C(x, y, σ2) does not depend on f. Thus, it will not have an impact on the posterior.

Now, we can calculate the density of the generalised posterior of f using the loss function defined before as follows

pw(f|y) p(f) exp{ n Lw n (f, y, x)}

2(f m) K 1(f m) exp 1

2 f σ 2J 1 w f 2f σ 2J 1 w (y mw + m)

2 (f m) K 1(f m) + f σ 2J 1 w f 2f σ 2J 1 w (y mw + m)

2 f K 1f 2f K 1m + f σ 2J 1 w f 2f σ 2J 1 w (y mw + m)

2 f (K 1 + σ 2J 1 w )f 2f (K 1 + σ 2J 1 w )m + σ 2J 1 w (y mw) .

By completing squares, the posterior has the form

pw(f|y) exp 1

2 (f µR) Σ 1 R (f µR) ,

ΣR = (K 1 + σ 2J 1 w ) 1 = K K + σ2Jw) 1 σ2Jw,

µR = ΣR (K 1 + σ 2J 1 w )m + σ 2J 1 w (y mw) = m + K K + σ2Jw 1 (y mw),

where we use the fact that for two invertible matrices A, B, we have (A 1 + B 1) 1 = A(A + B) 1B. One remark is that ΣR is positive semidefinite, since is the inverse of a sum of positive semidefinite matrices.

Robust and Conjugate Gaussian Process Regression

Predictive distribution The RCGP posterior predictive distribution over f = f(x ) at x X is a multivariate Gaussian distribution with density

pw(f |x , x, y) = Z

Rn p(f |x , f, x, y)pw(f|y, x)df N(f ; µR , ΣR ),

µR = m + k K + σ2Jw 1 (y mw) ,

ΣR = k k (K + σ2Jw) 1k .

Proof. We first derive the predictive for m(x) = 0 and then extend it to an arbitrary prior mean m. In order to compute the predictive, we first need the conditional density p(f |x , f, x, y). Using the fact that f is a mean-zero GP, the joint distribution of f and f is f f

N 0, K k k k

where k = (k(x , x1), ..., k(x , xn)) , and k , = k(x , x ). Therefore, the density of the conditional distribution of a multivariate normal is well known and has the form p(f |x , f, x, y) = N(f ; µ, Σ) where

µ = k K 1f,

Σ = k k K 1k .

Let us define a = K 1k , then µ = a f, and we can write the density of the predictive distribution as follows

pw(f |x , x, y) = Z

Rn p(f |x , f, x, y)pw(f|y, x)df.

2 (f a f) Σ 1(f a f) + (f µR) (ΣR) 1(f µR) df

2 f Σ 1f 2f aΣ 1f + f aΣ 1a f + f (ΣR) 1f 2f (ΣR) 1µR df

2 f ((ΣR) 1 + aΣ 1a )f 2f aΣ 1f + (ΣR) 1µR df.

where the steps follow from basic arithmetic and taking out all terms which do not depend on f, and indicates we do not consider the normalisation constants. Integrating over f, we get

pw (f |x , x, y)

2 f Σ 1f (aΣ 1f + (ΣR) 1µR) ((ΣR) 1 + aΣ 1a ) 1(aΣ 1f + (ΣR) 1µR)

2 f (Σ 1 Σ 1a ((ΣR) 1 + aΣ 1a ) 1aΣ 1)f 2f Σ 1a ((ΣR) 1 + aΣ 1a ) 1(ΣR) 1µR .

Therefore, by completing squares, we obtain p(f |x , x, y) = N(f ; µR , ΣR ), where,

µR = (1 a ((ΣR) 1 + aΣ 1a ) 1aΣ 1) 1ΣΣ 1a ((ΣR) 1 + aΣ 1a ) 1(ΣR) 1µR,

ΣR = (1 a ((ΣR) 1 + aΣ 1a ) 1aΣ 1) 1Σ.

Now, we expand the terms by arithmetic rules for matrix-vector multiplication, to obtain the final expressions:

µR = (1 a ((ΣR) 1 + aΣ 1a ) 1aΣ 1) 1ΣΣ 1a ((ΣR) 1 + aΣ 1a ) 1(ΣR) 1µR

= (1 a ((ΣR) 1 + aΣ 1a ) 1aΣ 1) 1(((ΣR) 1 + aΣ 1a )(a ) 1)(ΣR) 1µR

= (((ΣR) 1 + aΣ 1a )(a ) 1 aΣ 1) 1(ΣR) 1µR

= ((ΣR) 1(a ) 1 + aΣ 1 aΣ 1) 1(ΣR) 1µR

= ((ΣR) 1(a ) 1) 1(ΣR) 1µR

= k K + σ2Jw 1 (y mw).

Robust and Conjugate Gaussian Process Regression

The covariance follows the form:

ΣR = (Σ 1 Σ 1a ((ΣR) 1 + aΣ 1a ) 1aΣ 1) 1,

we use the Woodbury matrix identity on the term ((ΣR) 1 + aΣ 1a ) 1 and by arithmetic rules for matrix-vector multiplication we obtain:

ΣR = (Σ 1 Σ 1a (ΣR ΣRa(Σ + a ΣRa) 1a ΣR)aΣ 1) 1

= (Σ 1 Σ 1a (ΣRa(Σ + a ΣRa) 1((Σ + a ΣRa)a 1 a ΣR))aΣ 1) 1

= (Σ 1 Σ 1a (ΣRa(Σ + a ΣRa) 1(Σa 1 + a ΣR a ΣR)aΣ 1) 1

= (Σ 1 Σ 1a (ΣRa(Σ + a ΣRa) 1Σa 1aΣ 1) 1

= (Σ 1 Σ 1a ΣRa(Σ + a ΣRa) 1) 1

= (Σ + a ΣRa)(Σ 1(Σ + a ΣRa) Σ 1a ΣRa) 1

= Σ + a ΣRa.

Finally, we replace ΣR and use the Woodbury matrix identity on it, to get:

ΣR = Σ + k K 1(K 1 + σ 2J 1 w ) 1K 1k = Σ + k K 1(K K(K + σ2Jw) 1K)K 1k = Σ + k K 1k k (K + σ2Jw) 1k = k k (K + σ2Jw) 1k .

If the prior mean function m : X R is non-zero, we only need a minor modification of the derivation above. This involves recognising that if we have a function f GP(m, k), then f = f m f GP(0, k). Therefore, when we have observed values of f, we can adjust them by subtracting the corresponding prior mean function values, thus obtaining observations of f . We then perform inference on f , and once we have the posterior on this function, we can simply add the prior mean back to the posterior mean to obtain the posterior estimate for f. In other words, for a general prior mean m the density of the predictive posterior will be:

pw(f |x , x, y) = N(f ; µR , ΣR ),

µR = m + k K + σ2Jw 1 (y mw) ,

ΣR = k k (K + σ2Jw) 1k .

A.2. Pseudo Marginal Likelihood for RCGPs

While maximising the pseudo marginal likelihood for RCGP pw(y|θ, σ2) is ill-posed, it is available in closed form given below.

Proposition A.1. The pseudo marginal likelihood for RCGPs takes the form

pw(y | x, θ, σ2)

|K||K 1 + σ 2J 1 w | exp 1

2(y mw) σ 2J 1 w (K 1 + σ 2J 1 w ) 1σ 2J 1 w (y mw) C(x, y, σ2) .

pw(y|x, θ, σ2) = Z

Rn p(f|x, θ, σ2) exp{ n Lw n (f, y, x)}df

2f σ 2J 1 w f + f σ 2J 1 w (y mw) C(x, y, σ2) df

2f (K 1 + σ 2J 1 w )f + f σ 2J 1 w (y mw) C(x, y, σ2) df.

Robust and Conjugate Gaussian Process Regression

Integrating over f now yields

pw(y|x, θ, σ2)

|K 1 + σ 2J 1 w | exp 1

2(y mw) σ 2J 1 w (K 1 + σ 2J 1 w ) 1σ 2J 1 w (y mw) C(x, y, σ2)

|K||K 1 + σ 2J 1 w | exp 1

2(y mw) σ 2J 1 w (K 1 + σ 2J 1 w ) 1σ 2J 1 w (y mw) C(x, y, σ2) .

A.3. Leave-one-out Cross-validation Predictive

The hyperparameters obtained by the leave-one-out cross-validation predictive posteriors are

ˆσ2, ˆθ = arg max σ2,θ

i=1 log pw(yi|x, y i, θ, σ2) o ,

where y i = (y1, . . . , yi 1, yi+1, . . . , yn). By Proposition 3.1, pw(yi|x, y i, θ, σ2) = N(µR i , σR i + σ2) with

µR i = zi + mi [ K + σ2Jw 1 z]i[(K + σ2Jw) 1] 1 ii ,

σR i = [(K + σ2Jw) 1] 1 ii σ4

2 w(xi, yi) 2,

for z = y mw and z = (z1, . . . zn).

Proof. Without loss of generality, we will derive the predictive for i = n, which can be extended for an arbitrary i {1, ..., n} using a permutation matrix. Let pw(yn|x, y n, θ, σ2) = N(µR n , σR n + σ2) with

µR n = mn + K 1:n 1;n[K + σ2Jwc] 1 1:n 1;1:n 1z1:n 1,

ΣR = knn K 1:n 1;n[K + σ2Jwc] 1 1:n 1;1:n 1K1:n 1;n.

where [K +σ2Jw]1:n 1;1:n 1 denotes the submatrix formed from rows {1, ..., n 1} and columns {1, ..., n 1}, K1:n 1;n denotes the submatrix formed from nth row and columns {1, ..., n 1}, and knn denotes the element from nth row and nth

columns. Now, we observe that we can write the matrix K + σ2Jwc as

K + σ2Jw = [K + σ2Jw]1:n 1;1:n 1 Kn;1:n 1 K1:n 1;n Knn + σ4

2 w(xn, yn) 2

where we use A, B, C and D for clarity in the derivation. Applying block matrix inversion it is now notationally cumbersome, but it is easy to show that.

(K + σ2Jw) 1 =

A 1 + A 1B D CA 1B 1 CA 1 A 1B D CA 1B 1

D CA 1B 1 CA 1 D CA 1B 1

Therefore, we have that

[(K + σ2Jw) 1]nn = D CA 1B 1

2 w(xn, yn) 2 K 1:n 1;n[K + σ2Jw] 1 1:n 1;1:n 1K1:n 1;n

This leads to

Knn K 1:n 1;n[K + σ2Jw] 1 1:n 1;1:n 1K1:n 1;n = 1 [(K + σ2Jw) 1]nn σ4

2 w(xn, yn) 2,

Robust and Conjugate Gaussian Process Regression

which gives us the desired variance:

σR n = [(K + σ2Jw) 1] 1 nn σ4

2 w(xn, yn) 2,

Now, for the mean, we use again the block matrix inversion to get

(K + σ2Jw) 1z =

A 1 + A 1B D CA 1B 1 CA 1 A 1B D CA 1B 1

D CA 1B 1 CA 1 D CA 1B 1

! z1:n 1 zn

Therefore, we have that

[(K + σ2Jw) 1z]n = D CA 1B 1 CA 1z1:n 1 + D CA 1B 1 zn

= D CA 1B 1 (zn CA 1z1:n 1).

Noting that D CA 1B 1 = [(K + σ2Jwc) 1]nn, and replacing A and C by their values we obtain

[(K + σ2Jw) 1z]n = [(K + σ2Jw) 1]nn(zn K 1:n 1;n[K + σ2Jw] 1 1:n 1;1:n 1z1:n 1).

Finally, rearranging terms, we note that

K 1:n 1;n[K + σ2Jw] 1 1:n 1;1:n 1z1:n 1 = zn [ K + σ2Jw 1 z]n[(K + σ2Jw) 1] 1 nn

Leading to the desired mean function:

µR n = zn + mn [ K + σ2Jw 1 z]n[(K + σ2Jw) 1] 1 nn,

A.4. Proof of Proposition 3.2

First, define the contamination of the dataset D = {(xi, yi)}n i=1 by the datum yc m as Dc m = (D\{(xm, ym)}) {(xm, yc m)} for some m {1, . . . , n}. Let y = (y1, ..., yn) and yc = (y1, ..., ym 1, yc m, ym+1, ..., yn) .

PIF for the standard GP GP regression has the PIF for some constant C1 R.

PIFGP(yc m, D) = C1(ym yc m)2,

and is not robust: PIFGP(yc m, f, D) as |yc m| .

Proof. Let p(f|D) = N(f; µ, Σ) and p(f|Dc m) = N(f; µc, Σc) the uncontaminated and contaminated standard GP posterior respectively. Here,

µ = m + K(K + σ2In) 1(y m) µc = m + K(K + σ2In) 1(yc m)

Σ = K(K + σ2In) 1σ2In Σc = K(K + σ2In) 1σ2In.

Therefore, the PIF has the form

PIFGP(yc m, D) = 1

Tr Σ 1 c Σ n + (µc µ)T (Σc) 1 (µc µ) + ln det Σc

We observe that Σc = Σ since they do not depend on y, so that

Tr Σ 1 c Σ n = Tr (In) n = n n = 0,

= ln det(Σc) det(Σ 1) = ln det(ΣcΣ 1) = ln (det(In)) = 0.

Robust and Conjugate Gaussian Process Regression

This finally leads to the PIF

PIFGP(yc m, D) = 1

(µc µ)T (Σc) 1 (µc µ) .

We now notice that the term µc µ can be written as

µc µ = (m + K(K + σ2In) 1(y m)) (m + K(K + σ2In) 1(yc m))

= K(K + σ2In) 1(y yc).

Substituting the relevant expressions for µc µ and Σc above, we find

PIFGP(yc m, D) = 1

2 (y yc)T(K + σ2In) 1K(K(K + σ2In) 1σ2In) 1K(K + σ2In) 1(y yc) .

2 (y yc)T(K + σ2In) 1Kσ 2In(y yc) .

Finally, since y and yc are the same except for the mth element, the above expression is equal to

PIFGP(yc m, D) = 1

2 (K + σ2In) 1Kσ 2In

mm (ym yc m)2 .

PIF for the RCGP For RCGPs with supx,y w(x, y) < , it holds that

PIFRCGP(yc m, D) C1(w(xm, yc m)2yc m)2 + C2,

for some constants C1, C2 R. Thus, if supx,y y w(x, y)2 < , RCGP is robust since supycm | PIFRCGP(yc m, D)| < .

Proof. Without loss of generality, we will prove the bound for m = n, which can be extended for an arbitrary m {1, . . . , n} using a permutation matrix. Let pw(f|D) = N(f; µR, ΣR) and pw(f|Dc m) = N(f; µc R, Σc R) denote the uncontaminated and contaminated standard posterior respectively. Here,

µR = m + K K + σ2Jw 1 (y mw) µR c = m + K K + σ2Jwc 1 (y mwc)

ΣR = K K + σ2Jw 1 σ2Jw ΣR c = K K + σ2Jwc 1 σ2Jwc.

Here wc = (w(x1, y1), ..., w(xn, yc n)) . Therefore, the PIF has the form

PIFRCGP(yc m, D) = 1

Tr (ΣR c ) 1ΣR n | {z } (1)

+ µR c µR T (ΣR c ) 1 µR c µR

+ ln det ΣR c det ΣR

Now, we will get a bound for each term in the PIF. The first term can be bound as

(1) = Tr (ΣR c ) 1ΣR n = Tr σ 2J 1 wc K + σ2Jwc K + σ2Jw 1 σ2Jw n

Tr σ 2J 1 wc K + σ2Jwc Tr K + σ2Jw 1 σ2Jw n,

where we use the fact that for two positive semidefinite matrices A, B, it holds that Tr(AB) Tr(A) Tr(B). Observing that K + σ2Jw 1 σ2Jw does not depend on the contamination, we can now write e C1 = Tr( K + σ2Jw 1 σ2Jw), so that by using the arithmetic rules of traces, we obtain

(1) Tr σ 2J 1 wc K + σ2Jwc e C1 n

= Tr σ 2J 1 wc K + In e C1 n

= (Tr σ 2J 1 wc K + n) e C1 n

i=1 σ 2w2(xi, yy)Kii + n

Robust and Conjugate Gaussian Process Regression

Finally, since supx,y w(x, y) < , the entire expression can be bounded by a constant e C2 that does not depend on the contamination yc n as

σ 2 sup x,y w(x, y)

i=1 Kii + n

! e C1 n = e C2.

Next, we tackle the second term by noting that

(2) = µR c µR T (ΣR c ) 1 µR c µR λmax((ΣR c ) 1) µR c µR 2 2 λmax((ΣR c ) 1) µR c µR 2 1,

where λmax((ΣR c ) 1) is the maximum eigenvalue of (ΣR c ) 1. Then, expanding λmax((ΣR c ) 1) and using Weyl s inequality, we get:

λmax((ΣR c ) 1) = λmax(K 1 + σ 2J 1 wc ) λmax(K 1) + λmax(σ 2J 1 wc )).

Since J 1 wc = diag((wc)2) , and supx,y w(x, y) < , it holds that λmax(σ 2J 1 wc )) = e C3 < + , so that we have

λmax((ΣR c ) 1) = λmax(K 1) + e C3 = e C4.

We replace this in the expression for (2) to obtain

(2) e C4 µR c µR 2 1 = e C4 m + K K + σ2Jwc 1 (y mwc) m K K + σ2Jw 1 (y mw) 2 1 = e C4 K( K + σ2Jwc 1 (y mwc) K + σ2Jw 1 (y mw)) 2 1.

Applying Cauchy-Schwartz we obtain:

(2) e C4 K F K + σ2Jwc 1 (y mwc) K + σ2Jw 1 (y mw) 2 1,

where . F denotes the Frobenius norm. Now, let s write K + σ2Jwc as the block matrix

K + σ2Jwc = [K + σ2Jwc]1:n 1;1:n 1 [K + σ2Jwc]n;1:n 1 [K + σ2Jwc]1:n 1;n [K + σ2Jwc]nn

= [K + σ2Jwc]1:n 1;1:n 1 Kn;1:n 1 K1:n 1;n Knn + σ2w(xn, yc n) 2

where [K + σ2Jwc]1:n 1;1:n 1 denotes the submatrix formed from rows {1, ..., n 1} and columns {1, ..., n 1}, and K1:n 1;n denotes the submatrix formed from nth row and columns {1, ..., n 1}. The second equality holds since Jwc is diagonal. Because we assumed that the contamination is in the nth term, [K + σ2Jwc]1:n 1;1:n 1 = [K + σ2Jw]1:n 1;1:n 1, so that

K + σ2Jwc = [K + σ2Jw]1:n 1;1:n 1 Kn;1:n 1 K1:n 1;n Knn + σ2w(xn, yc n) 2

= A B C q(yc n)

where we use A, B, and C for clarity in the derivation and to emphasise that these submatrices do not depend on the contamination and therefore can be treated as constants. Applying block matrix inversion, it is now notationally cumbersome but easy to show that

(K + σ2Jwc) 1 = A 1 + A 1Bq(yc n) 1CA 1 A 1Bq(yc n) 1

q(yc n) 1CA 1 q(yc n) 1

We can now use this to rewrite the matrix-vector product K + σ2Jwc 1 (y mwc)

= A 1 + A 1Bq(yc n) 1CA 1 A 1Bq(yc n) 1

q(yc n) 1CA 1 q(yc n) 1

y1:n 1 yc n

= (A 1 + A 1Bq(yc n) 1CA 1)y1:n 1 A 1Bq(yc n) 1yc n q(yc n) 1CA 1y1:n 1 + q(yc n) 1yc n

Robust and Conjugate Gaussian Process Regression

where y1:n 1 = (y1, ..., yn 1) . Replicating the same steps for the uncontaminated terms, we similarly obtain:

K + σ2Jw 1 (y mw)

= (A 1 + A 1Bq(yn) 1CA 1)y1:n 1 A 1Bq(yn) 1yn q(yn) 1CA 1y1:n 1 + q(yn) 1yn

Now, we can write

K + σ2Jwc 1 (y mwc) K + σ2Jw 1 (y mw)

A 1Bq(yc n) 1CA 1y1:n 1 A 1Bq(yc n) 1yc n A 1Bq(yn) 1CA 1y1:n 1 + A 1Bq(yn) 1yn

Using the triangle inequality, we can bound this as

A 1Bq(yc n) 1CA 1y1:n 1

+ A 1Bq(yc n) 1yc n

+ A 1Bq(yn) 1CA 1y1:n 1

+ A 1Bq(yn) 1yn | {z } e C5

q(yc n) 1 A 1BCA 1y1:n 1

+ q(yc n) 1yc n A 1B

= e C5 + q(yc n) 1

A 1BCA 1y1:n 1

+ q(yc n) 1yc n

where e C5 are all terms that do not depend on the contamination. Now, we can observe that:

q(yc n) 1 = 1 Knn + σ2w(xn, ycn) 2 = σ 2w(xn, yc n)2 Knnσ 2w(xn, ycn)2 + 1 σ 2w(xn, yc n)2,

where the last inequality holds because Knnσ 2w(xn, yc n)2 > 0. Therefore, since supx,y w(x, y) < , we can write

e C5 + q(yc n) 1

A 1BCA 1y1:n 1

+ q(yc n) 1yc n

e C5 + q(yc n) 1

A 1BCA 1y1:n 1

+ q(yc n) 1yc n

e C6 + w(xn, yc n)2yc n e C7

For e C6 = e C5 + σ 2 supx,y w(x, y)2 Pn 1 i=1 A 1BCA 1y1:n 1

, and e C7 = σ 2 Pn 1 i=1 A 1B

. Now, similarly for the nth term we have K + σ2Jwc 1 (y mwc) (K + σ2Jw) 1(y mw)

= q(yc n) 1CA 1y1:n 1 + q(yc n) 1yc n + q(yn) 1CA 1y1:n 1 q(yn) 1yn

q(yc n) 1CA 1y1:n 1 + q(yc n) 1yc n + q(yn) 1CA 1y1:n 1 + q(yn) 1yn

e C8 + e C9 w(xn, yc n)2yc n ,

Robust and Conjugate Gaussian Process Regression

for e C8 = σ 2 supx,y w(x, y)2 CA 1y1:n 1 + q(yn) 1CA 1y1:n 1 + q(yn) 1yn and e C9 = σ 2. Putting both expressions together, we obtain:

(2) e C4 K F K + σ2Jwc 1 (y mwc) K + σ2Jw 1 (y mw) 2 1 e C4 K F 2(( e C6 + e C8)2 + ( e C7 + e C9)2(w(xn, yc n)2yc n)2)

e C10 + e C11(w(xn, yc n)2yc n)2,

where e C10 = e C4 K F 2(( e C6 + e C8)2, e C11 = e C4 K F 2( e C7 + e C9)2. Lastly, the third and final term can be rewritten using properties of determinants as

(3) = ln det ΣR c det ΣR

det K + σ2Jwc 1 σ2Jwc

det (K + σ2Jw) 1 σ2Jw

= ln det K + σ2Jw σ 2J 1 w det K + σ2Jwc 1 det σ2Jwc .

Here, we defined e C12 = det K + σ2Jw σ 2J 1 w since it does not depend on the contamination, and write

(3) = ln e C12 det K + σ2Jwc 1 det σ2Jwc

e C12 det σ2Jwc

det (K + σ2Jwc)

e C12 det σ2Jwc

det (K) + det (σ2Jwc)

where in the last inequality, we use the fact that for two positive semidefinite matrices A, B, it also holds that det(A + B) det(A) + det(B). Finally, det (K) > 0 and det σ2Jwc > 0, since both are positive definite matrices. Therefore,

det (K) + det (σ2Jwc) 1,

which leads to

(3) ln e C12 = e C13.

Finally, putting the three terms together we obtain the desire bound:

PIFRCGP(yc m, D) e C2 + e C10 + e C11(w(xn, yc n)2yc n)2 + e C13 = C1(w(xn, yc n)2yc n)2 + C2

where C2 = e C2 + e C10 + e C11 + e C13, and C1 = e C11.

A.5. Proof of Proposition 4.1

For f GP(m, k), ε N(0, σ2), the RCSVGP posterior is f GP(eµ, eΣ), where

eµ(x) = ϕu(x) µu, eΣ(x, x ) = k(x, x ) ϕu(x) (Kuu Σu) ϕu(x ),

µu = m + Kuu P 1 u Kuσ 2J 1 w (y mw),

Σu = Kuu P 1 u Kuu,

for Pu = Kuu + K u σ 2J 1 w Ku , [Kuu]ij = k(ui, uj), [Ku]ij = k(ui, xj), [ku(x)]i = k(ui, x), and ϕu(x) = K 1 uuku(x). As a function of u, θ, and σ2, the corresponding variational objective is

J(u, θ, σ2) = 1

2ν K u Q 1 u Kuν + C(σ2) + 1

2 w (K K u K 1 uu Ku)J 1

Robust and Conjugate Gaussian Process Regression

where Qu = Kuu + K u σ 2J 1 w Ku, ν = σ 2J 1 w (y mw), and C(σ2) = C(x, y, σ2) is a function that depends on the data and σ2.

Proof. Recall that for the standard case, we consider fu = (f(u1), . . . , f(um)) as the evaluations corresponding to the inducing input locations u = (u1, ..., um) . Given this, we define the variational distribution as

q(f, fu) = p(f|fu)q(fu),

where q(u) = N(u; µ, Σ). We now seek to approximate the exact posterior p(f, fu|y) by the variatonal distribution. We do so by minimising the Kullback-Leibler (KL) divergence between q(f, fu) and p(f, fu|y). This is equivalent to maximising the ELBO, which is defined as:

ELBO(u) = Z log Ψ(y, fu)p(fu)

q(fu) q(fu)dfu,

Ψ(y, fu) = exp Z

Rn log(p(y|f))p(f|fu)df .

It is straightforward to verify that the optimal variational distribution is

q(fu) = Ψ(y, fu)p(fu) R Ψ(y, fu)p(fu)dfu .

Plugging this back into the ELBO, we then get

ELBO(u, µ, Σ) = log Z Ψ(y, fu)p(fu)dfu

For the RCSVGP, we replace the standard likelihood p(y|f), with the pseudo likelihood exp{ n Lw n (f, y, x)}, leading to the RCSVGP s optimal variational distribution, which is given by

qw(fu) = Ψw(y, fu)p(fu) R Ψw(y, fu)p(fu)dfu ,

log Ψw(y, fu) = Z

Rn log(exp{ n Lw n (f, y, x)})p(f|fu)df

2 f σ 2J 1 w f 2f ν + C(x, y, σ2) p(f|fu)df,

where we used the expression of Lw n (f, y, x) from Appendix A.1 and defined ν = σ 2J 1 w (y mw). Recallling that C(x, y, σ2) = y σ 2 diag(w2)y 2 yy w2.

Recalling that the density of the conditional distribution of f given fu is

p(f|fu) = N(f; µf|fu, Σf|fu),

µf|fu = K u K 1 uufu,

Σf|fu = K K u K 1 uu Ku,

[Kuu]ij = k(ui, uj), [Ku]ij = k(ui, xj), [ku(x)]i = k(ui, x). Then, we can write the above expression as:

log Ψw(y, fu) = Z

2f σ 2J 1 w fp(f|fu)df + Z

Rn f νp(f|fu)df Z

Rn 1 2C(x, y)p(f|fu)df

Rn f σ 2J 1 w fp(f|fu)df + Ef p(f|fu)[f ν] 1

2C(x, y, σ2)

Robust and Conjugate Gaussian Process Regression

Recalling that we can write the inner product between two vectors as the trace of their outer product, in this case

f σ 2J 1 w f = (σ 1J 1/2 w f) (σ 1J 1/2 w f) = Tr(σ 1J 1/2 w ff σ 1J 1/2 w ).

log Ψw(y, fu) = 1

Rn Tr(σ 1J 1/2 w ff σ 1J 1/2 w )p(f|fu)df + Ef p(f|fu)[f ν] 1

2C(x, y, σ2)

2Ef p(f|fu)[Tr(σ 1J 1/2 w ff σ 1J 1/2 w )|fu] + Ef p(f|fu)[f ν] 1

2C(x, y, σ2).

It is well known that the expectation of the trace is equal to the trace of the expectation so that

log Ψw(y, fu) = 1

2 Tr(Ef p(f|fu)[σ 1J 1/2 w ff σ 1J 1/2 w ]) + Ef p(f|fu)[f ν] 1

2C(x, y, σ2)

2 Tr(σ 1J 1/2 w Ef p(f|fu)[ff ]σ 1J 1/2 w ) + Ef p(f|fu)[f] ν 1

2C(x, y, σ2).

Recalling that the variance of a random variable is defined for a random variable X distributed by p as

VX p[X] = EX p[XX ] EX p[X]EX[X] ,

which implies that EX p[XX ] = VX p[X] + EX p[X]EX p[X] . Therefore, we can use this relationship to write

logΨw(y, fu) = 1

2 Tr(σ 1J 1/2 w Ef p(f|fu)[ff ]σ 1J 1/2 w ) + Ef p(f|fu)(f) ν 1

2C(x, y, σ2)

2 Tr(σ 1J 1/2 w (Vf p(f|fu)[f] + Ef p(f|fu)[f]Ef p(f|fu)[f] )σ 1J 1/2 w ) + Ef p(f|fu)[f] ν 1

2C(x, y, σ2).

Since p(f|fu) = N(f; µf|fu, Σf|fu), we know explicitly Vf p(f|fu)[f] and Ef p(f|fu)[f]. Plug these expressions in and rearranging terms using traces properties, we obtain the final expression

log Ψw(y, fu) = 1

2 Tr(σ 1J 1/2 w (Vf p(f|fu)[f] + Ef p(f|fu)[f]Ef p(f|fu)[f] )σ 1J 1/2 w ) + E(f|fu) ν 1

2C(x, y, σ2)

2 Tr(σ 1J 1/2 w (Σf|fu + µf|fuµ f|fu)σ 1J 1/2 w ) + µ f|fuν 1

2C(x, y, σ2)

2 Tr(σ 2J 1/2 w Σf|fu J 1/2 w ) 1

2 Tr(σ 2J 1/2 w µf|fuµ f|fu J 1/2 w ) + µ f|fuν 1

2C(x, y, σ2)

2 Tr(σ 2J 1/2 w Σf|fu J 1/2 w ) 1

2µ f|fuσ 2J 1 w µf|fu + µ f|fuν 1

2C(x, y, σ2).

We now will plug Ψ in optimal variational distribution to finish the proof. Replacing and rearranging terms, we obtain

qw(fu) Ψw(y, fu)p(fu)

2µ f|fuσ 2J 1 w µf|fu + µ f|fuν 1

2f u K 1 uufu

2f u K 1 uu Kuσ 2J 1 w K u K 1 uufu + f u K 1 uu Kuν 1

2f u K 1 uufu

2f u (K 1 uu Kuσ 2J 1 w K u K 1 uu + K 1 uu)fu + f u K 1 uu Kuν

completing squares, we obtain

qw(fu) exp 1

2(fu µu) Σ 1 S (fu µu)

µu = (K 1 uu + K 1 uu Kuσ 2J 1 w K u K 1 uu) 1K 1 uu Kuν

= Kuu(Kuu + Kuσ 2J 1 w K u ) 1Kuν

= Kuu P 1 u Kuσ 2J 1 w (y mw),

Σu = (K 1 uu + K 1 uu Kuσ 2J 1 w K u K 1 uu) 1

= Kuu(Kuu + Kuσ 2J 1 w K u ) 1Kuu = Kuu P 1 u Kuu,

Robust and Conjugate Gaussian Process Regression

for Pu = Kuu + K u σ 2J 1 w Ku . Now, for the predictive posterior over f = f(x ) at new point x X, we have

pw(f |, y, x) = Z

R pw(f |x , fu)qw(fu)dfu

= N(f ; µf |u, Σf |u)N(fu; µu, Σu)dfu.

Here, we can use the integral obtain for the predictive in Appendix A.1 to get

pw(f |x, y, x) = N(f ; eµ(x ), eΣ(x , x ))

eµ(x ) = ϕu(x ) µu, eΣ(x , x ) = k(x , x ) ϕu(x ) (Kuu Σu) ϕu(x ),

for [ku(x)]i = k(ui, x), and ϕu(x) = K 1 uuku(x). Finally, plugging Ψw in the ELBO and using log and exponential properties, we get

ELBO(u) = log Z Ψw(y, fu)p(fu)dy

2 Tr(σ 2J 1/2 w Σf|fu J 1/2 w ) 1

2µ f|fuσ 2J 1 w µf|fu + µ f|fuν 1

2C(x, y, σ2) 1

2f u K 1 uufu

2(Tr(σ 2J 1/2 w (K K u K 1 uu Ku)J 1/2 w ) C(x, y, σ2))

2f u (K 1 uu Kuσ 2J 1 w K u K 1 uu + K 1 uu)fu + f u K 1 uu Kuν dfu

integrating over fu and just arithmetic rules for matrix-vector multiplication, we obtain

ELBO(u) = 1

2(Tr(σ 2J 1/2 w (K K u K 1 uu Ku)J 1/2 w ) C(x, y, σ2))

2ν K u K 1 uu(K 1 uu + K 1 uu Kuσ 2J 1 w K u K 1 uu) 1K 1 uu Kuν

det K 1 uu + K 1 uu Kuσ 2J 1 w K u K 1 uu

2(Tr(σ 2J 1/2 w (K K u K 1 uu Ku)J 1/2 w ) C(x, y, σ2))

2ν K u (Kuu + Kuσ 2J 1 w K u ) 1Kuν

det K 1 uu + K 1 uu Kuσ 2J 1 w K u K 1 uu

2(Tr(σ 2J 1/2 w (K K u K 1 uu Ku)J 1/2 w ) C(x, y, σ2))

2ν K u (Kuu + Kuσ 2J 1 w K u ) 1Kuν + 1

det Kuu + Kuσ 2J 1 w K u

A.6. Proof of Proposition 4.2

Let f GP(m, k), ε N(0, σ2). The UCB and PI acquisition functions for RCGPs are

a UCB-RCGP(x ) = µR + λ(ΣR )1/2

a PI-RCGP(x ) = Φ µR f(xmax) (ΣR ) 1/2

where Φ is the cdf of a standard normal, and xmax is the best solution we have so far.

Robust and Conjugate Gaussian Process Regression

Proof. Since the RCGP posterior is Gaussian, we can straightforwardly define the upper confidence bound acquisition function as

a UCB-RCGP(x ) = µR + λ(ΣR )1/2

For probability of improvement, suppose that the best solution we have so far is xmax. We define the improvement function as:

I(x ) = max(f(x ) f(xmax), 0)

we know that f(x ) N(µR , ΣR ), therefore, we can rewrite it as f(x ) = µR + ΣR z, with z N(0, 1), leading to:

I(x ) = max(µR + ΣR z f(xmax), 0) z N(0, 1)

Finally, we define the probabilty of improvement acquisition function as:

a PI-RCGP(x ) = P(I(x ) > 0)

= P(f(x ) > f(xmax))

= P(z > f(xmax) µR (ΣR ) 1/2)

= Φ µR f(xmax) (ΣR ) 1/2

B. Additional Experimental Results

All the experiments were running on an Apple M2 Pro CPU with 16 GB of memory.

B.1. Prior mean

This section illustrates how the choice of the prior mean affects our method. In this particular example, we generated data from a Gaussian Process (GP) with a zero mean and a periodic squared exponential kernel, where the length scale and variance are set to 1, and the period is set to 3. We then added 1% of focused contamination near zero. Figure 9 demonstrates the performance of two RCGPs on this dataset: the blue one with a zero prior mean and the brown one with a prior mean selected through fitting a polynomial regression. Since the outliers in this dataset are close to zero, the method with a zero prior mean does not identify them as outliers. In fact, as the outliers align exactly with the prior mean, the corresponding weight assigned to them is the largest. This leads to a non-optimal performance. In contrast, for the brown RCGP, since the prior mean is carefully chosen, the outliers are far from it and, therefore, are down-weighted, resulting in a posterior that matches the true generative process.

3 2 1 0 1 2 3

RCGP, prior mean m = 0 m = 0 RCGP, prior mean m = poly(x, y)

m = poly(x, y) True Data Outliers

Figure 9. The posterior predictive mean of the RCGP with zero prior mean (blue) and the RCGP with a prior mean selected through fitting a polynomial regression (brown) on a synthetic dataset where 1% of the data are focused generated outliers.

B.2. Implementation of Competing Methods

GP For the standard GP, we choose hyperparameters via maximum likelihood and utilise the implementation in GPflow (Matthews et al., 2017)

Robust and Conjugate Gaussian Process Regression

t-GP This model replaces the usual choice of Gaussian observation noise with a Student-t distributed observation noise; i.e.

p(y|f(x), ν, σ2) = Γ( ν+1

1 + (y f(x))2

where ν > 0 is the degrees of freedom and σ2 is the scale parameter. The challenge in this particular scenario lies in the fact that both its posterior and posterior predictive distributions no longer lend analytically tractable, necessitating the use of approximate inference methods. There are several ways to approximate this posterior: MCMC, Laplace s method or variational inference (Jyl anki et al., 2011). Our experiments use the variational inference technique implemented in GPflow (Matthews et al., 2017) to estimate the posterior, posterior predictive, and hyperparameter selection. 1 In particular, the student-t GPflow implementation uses the variational approximation proposed by Opper & Archambeau (2009). The main result of their work is that for a Gaussian process with a non-Gaussian likelihood, the optimal Gaussian approximation in terms of the Kullback-Leibler divergence is given by the expression: q(f) = N(Kα, [K 1 + diag(λ)] 1), where K is the kernel of the GP and α and λ are the variational parameters

m-GP This model explicitly considers the generation process of outliers as a uniform distribution over a bounded region that covers the output y. Then, each observation is associated with a latent variable z {0, 1}, where z = 0 indicates that the observation is generated by outlier distribution and z = 1 inlier. Therefore, the observation model is

p(y, z|f(x), γ, σ2) = (1 γ)1

1 z γN(y; f(x), σ2) z,

where γ = p(z = 1), and a > 0 denotes the volume of the outlier region. The variable γ controls the probability of occurrence of two models, and a Beta prior is assumed. Like t-GP, its posterior and posterior predictive distributions no longer lend analytically tractable; thus, approximation is needed. We follow the official implementation provided in the paper. 2

B.3. Benchmarking

B.3.1. DESCRIPTION OF THE DATASETS

Synthetic The dataset consists of n = 300 samples from a GP with zero mean and squared exponential kernel (length scale and variance equal to 1). Then, we added Gaussian noise ε N(0, 0.3) to the observations. For the experiments, we selected a GP prior with mean function m(x) = 1

n Pn i=1 yi, and squared exponential kernel as covariance function.

Boston The dataset consists of n = 506 observations, each representing a suburban or town area in Boston. It encompasses d = 13 features containing data like the average number of rooms in dwellings, pupil-teacher ratios, and per capita crime rates. We try to predict the median price of homes residents own (excluding rented properties). The dataset can be found at https://www.cs.toronto.edu/ delve/data/boston/boston Detail.html. We selected a GP prior with mean function m(x) = 1

n Pn i=1 yi, and squared exponential kernel as covariance function.

Energy The dataset describes the energy efficiency of buildings by correlating their heating and cooling load requirements with various building parameters. It consists of n = 768 data samples, each characterised by d = 8 distinct features, with the ultimate goal of predicting a single continuous response variable found in the last column. The dataset can be found at https://archive.ics.uci.edu/dataset/242/energy+efficiency. We selected a GP prior with mean function m(x) = 1

n Pn i=1 yi, and squared exponential kernel as covariance function.

Yacht The dataset s main focus is on predicting the residuary resistance of sailing yachts during their initial design phase, a critical aspect in evaluating a vessel s performance and estimating the essential propulsive power required. This prediction relies on d = 6 primary input parameters, which include the fundamental hull dimensions and boat velocity. The dataset contains n = 308 observations. The dataset can be found at https://archive.ics.uci.edu/dataset/243/

1An official usage example of this implementation can be found here: https://gpflow.github.io/GPflow/ develop/notebooks/getting_started/classification_and_other_data_distributions.html# Non-gaussian-regression

2https://github.com/Yifan Lu2000/Robust-Scalable-GPR

Robust and Conjugate Gaussian Process Regression

yacht+hydrodynamics. We selected a GP prior with mean function m(x) = 1

n Pn i=1 yi, and squared exponential kernel as covariance function.

B.3.2. DESCRIPTION OF THE OUTLIER GENERATION PROCESS

While contamination can occur in both the covariate x and the observation y, our work is focused on cases where contamination exclusively affects the observations y. We now describe the three outlier generation processes presented in this paper.

Uniform In this setting, our initial step involves uniformly selecting a specified proportion of the dataset that will be contaminated. To uniformly contaminate the selection, we did a random 50-50 split of this subset: half of the selected subset is contaminated by adding z U(3σ, 9σ), while the other half is contaminated by subtracting z U(3σ, 9σ) where σ is the standard deviation of the original observations, and U denotes the uniform distribution.

Asymmetric Much like the uniform outliers, we randomly select a subset of data points that we will contaminate. The key distinction lies in the fact that we do not split the selected subset; instead, we contaminate the entire subset by subtracting z U(3σ, 9σ) where σ is the standard deviations of the original observations.

Focused In this outlier generation process, we randomly select and remove a subset of data points, which will be replaced by outliers. For these outliers, we deterministically choose their values in X. To do so, we calculate the median value for each input data dimension j. However, we do not place the outliers at this median position directly. Instead, we replace the removed input values by (m1 + δ1, m2 + δ2 . . . , md + δd) , where mj is the median in the j-th input data dimension, and δj = αju, where αj is the median absolute deviation of the j-th data dimension times 0.1, and u U(0, 1). Simultaneously, the outlier values on Y are obtained by subtracting three times the standard deviation of the median of the observations My. To not have the same value for every outlier position, we also add a small perturbation δy = αyu, where αy is the median absolute deviation of y times 0.1, and u U(0, 1).

B.3.3. ADDITIONAL RESULTS

We ran all benchmark experiments, choosing c via leave-one-out ( c-LOO ), and compared it with the proposed way to choose c ( c-Qn ). Overall, the performance is slightly worse for c-LOO likely because maximising the predictive posterior for extreme observations tends to match/fit these outliers by increasing c, leading to a less robust method. Illustrative results can be seen in Table 4.

Table 4. Average test set mean absolute error and standard deviation (in brackets) for 50 train test splits.

GP c-LOO c-Qn

No Outliers Synthetic 0.09 (0.00) 0.09 (0.00) 0.09 (0.00) Boston 0.19 (0.01) 0.19 (0.01) 0.19 (0.01) Energy 0.03 (0.00) 0.02 (0.00) 0.02 (0.00) Yacht 0.02 (0.01) 0.02 (0.01) 0.02 (0.01)

Focused Outliers Synthetic 0.19 (0.00) 0.17 (0.00) 0.15 (0.00) Boston 0.23 (0.06) 0.22 (0.03) 0.22 (0.01) Energy 0.03 (0.04) 0.02 (0.00) 0.02 (0.00) Yacht 0.26 (0.15) 0.11 (0.13) 0.10 (0.14)

Asymmetric Outliers Synthetic 1.14 (0.00) 1.08 (0.00) 0.63 (0.00) Boston 0.63 (0.02) 0.61 (0.01) 0.49 (0.00) Energy 0.54 (0.02) 0.37 (0.10) 0.44 (0.04) Yacht 0.54 (0.06) 0.47 (0.03) 0.35 (0.02)

Robust and Conjugate Gaussian Process Regression

B.4. Sparse Variational Gaussian Processes

This section presents a numerical comparison between SVGP and RCSVGP on the four benchmark datasets presented in Appendix B.3, with three outliers regimes: no outliers, focused outliers, asymmetric outliers. For both methods, we selected a GP prior with mean function m(x) = 1

n Pn i=1 yi, and squared exponential kernel as covariance function. We consider a fixed noise variance σ2 = 0.01. Table 5 shows that RCSVGP outperforms SVGP even in cases without outliers.

Table 5. Average test set mean absolute error and standard deviation (in brackets) for 10 train test splits.

SVGP SVRCGP SVGP SVRCGP SVGP SVRCGP

No Outliers Focused Outliers Asymmetric Outliers Synthetic 0.08 (0.00) 0.08 (0.00) 0.19 (0.00) 0.13 (0.00) 1.32 (0.00) 1.02 (0.00) Boston 0.65 (0.01) 0.56 (0.18) 0.64 (0.02) 0.55 (0.16) 0.65 (0.04) 0.64 (0.01) Energy 0.02 (0.00) 0.03 (0.00) 0.05 (0.05) 0.05 (0.00 0.64 (0.05) 0.09 (0.16) Yacht 0.01 (0.00) 0.03 (0.00) 0.29 (0.00) 0.13 (0.02) 0.65 (0.04) 0.60 (0.02)

B.5. Bayesian Optimisation

In the Bayesian optimisation experiment section, we compared RCGPs to GPs and t-GPs on two classical functions: the Six-Hump Camel function and the Branin function. Here, we state the functions explicitly:

Six-Hump Camel: g1(x, x ) = 4 2.1x2 + x4

x2 + xx + (4x 2 4)x 2 x ( 2, 2), x ( 1, 1)

Branin: g2(x, x ) = x 5.1

π x 6 2 + 10 1 1

cos(x) + 10 x ( 5, 10), x (1, 15)

Mc Cormick: g3(x, x ) = sin(x + x ) + (x x ) + ( 1.5x + 2.5x + 1) x ( 1.5, 4), x ( 3, 4)

Rosenbrock: g4(x, x ) = 100(x x2)2 + (x2 1)2 x ( 5, 10), x ( 5, 1)

with global minima g1(x , x ) = 1.0316, g2(x , x ) = 0.3979, g3(x , x ) = 1.9133 and g4(x , x ) = 0.

In order to contaminate with outliers, we consider the scenario where, for each function evaluation, there is a 20% chance of being contaminated by an asymmetric outlier generated as Appendix B.3.2. In order to not change the global minimum, we consider the case where the outlier is bigger than the actual observation, i.e. asymmetric with adding a value.

We selected a GP prior with zero mean function and squared exponential kernel for the experiments for the three models. Figure 10 shows the results using PI as the acquisition function. In terms of cumulative regret, RCGPs outperform GPs. While t-GPs can match this, they take orders of magnitude longer to run.

Six-hump Camel

0 100 200 Iteration

GP RCGP t-GP

0 100 200 Iteration

0 100 200 Iteration

0 100 200 Iteration

Figure 10. Mean cumulative regret (top) and clock time (bottom) for BO with GP (green), RCGP (blue) and t-GP (orange) with 20% asymmetric outliers and PI acquisition function over 10 realisations.

Robust and Conjugate Gaussian Process Regression

Figure 11 shows the results without outliers. In terms of cumulative regret, RCGPs match or sometimes outperform GPs. While t-GPs can match this, they take orders of magnitude longer to run. It is notable that even without outliers.

Six-hump Camel

0 100 200 Iteration

GP RCGP t-GP

0 100 200 Iteration

0 100 200 Iteration

0 100 200 Iteration

Figure 11. Mean cumulative regret (top) and clock time (bottom) for BO with GP (green), RCGP (blue) and t-GP (orange) without outliers over 10 realisations.