# adversarial_robustness_guarantees_for_gaussian_processes__b6d58d05.pdf Journal of Machine Learning Research 23 (2022) 1-55 Submitted 4/21; Revised 12/21; Published 4/22 Adversarial Robustness Guarantees for Gaussian Processes Andrea Patan e andrea.patane@cs.ox.ac.uk Department of Computer Science University of Oxford Oxford, OX1 3QG, United Kingdom Arno Blaas arno@robots.ox.ac.uk Department of Engineering Science University of Oxford Oxford, OX2 6ED, United Kingdom Luca Laurenti l.laurenti@tudelft.nl Department of Computer Science University of Oxford Oxford, OX1 3QG, United Kingdom Luca Cardelli luca.cardelli@cs.ox.ac.uk Department of Computer Science University of Oxford Oxford, OX1 3QG, United Kingdom Stephen Roberts sjrob@robots.ox.ac.uk Department of Engineering Science University of Oxford Oxford, OX2 6ED, United Kingdom Marta Kwiatkowska marta.kwiatkowska@cs.ox.ac.uk Department of Computer Science University of Oxford Oxford, OX1 3QG, United Kingdom Editor: John Cunningham Gaussian processes (GPs) enable principled computation of model uncertainty, making them attractive for safety-critical applications. Such scenarios demand that GP decisions are not only accurate, but also robust to perturbations. In this paper we present a framework to analyse adversarial robustness of GPs, defined as invariance of the model s decision to bounded perturbations. Given a compact subset of the input space T Rd, a point x and a GP, we provide provable guarantees of adversarial robustness of the GP by computing lower and upper bounds on its prediction range in T. We develop a branch-and-bound scheme to refine the bounds and show, for any ϵ > 0, that our algorithm is guaranteed to converge to values ϵ-close to the actual values in finitely many iterations. The algorithm is anytime and can handle both regression and classification tasks, with analytical formulation for most kernels used in practice. We evaluate our methods on a collection of synthetic and standard benchmark data sets, including SPAM, MNIST and Fashion MNIST. We study the effect of approximate inference techniques on robustness and demonstrate how our c 2022 Andrea Patan e, Arno Blaas, Luca Laurenti, Luca Cardelli, Stephen Roberts and Marta Kwiatkowska. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0382.html. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska method can be used for interpretability. Our empirical results suggest that the adversarial robustness of GPs increases with accurate posterior estimation. Keywords: Gaussian processes, adversarial robustness, non-linear optimisation, Bayesian learning, branch-and-bound methods 1. Introduction Adversarial examples are input points intentionally crafted to trick a machine learning model into a misclassification. Imperceptible perturbations that can fool deep learning models in computer vision have been popularised by Szegedy et al. (2013) and, in the context of security, account for the growth in adversarial machine learning techniques, see review in (Biggio and Roli, 2018). Since test accuracy fails to account for the behaviour of a model in adversarial settings, algorithmic techniques for quantifying adversarial robustness of machine learning models are needed to aid their deployment in safety-critical scenarios. As a consequence, a number of methods that provide exact or approximate guarantees on the model output have been developed for neural networks, e.g., (Huang et al., 2017; Katz et al., 2017; Zhang et al., 2018). Gaussian process (GP) models (Rasmussen and Williams, 2006) provide a flexible probabilistic framework for performing inference over functions, which integrates information from prior and data into a predictive posterior distribution that informs the optimal model decision. GPs are particularly attractive in view of their favourable analytical properties and support for Bayesian inference. One advantage of GPs compared to neural network models is that they support the computation of uncertainty over model predictions, which can then be propagated through the decision-making pipelines. Various notions of robustness have been investigated for Gaussian process models, such as robustness against outliers (Kim and Ghahramani, 2008) or against labelling errors (Hern andez-Lobato et al., 2011). However, to the best of our knowledge, studies of adversarial robustness of GPs have been limited to statistical (i.e., input distribution dependent) (Abdelaziz, 2017) and heuristic analyses (Grosse et al., 2018; Bradshaw et al., 2017) or limited to an analysis on the behaviour of the latent mean (Smith et al., 2019). In this work, we develop a novel algorithmic framework to quantify the adversarial robustness of optimal predictions of Gaussian process models trained on a data set D. To this end, we adapt the notion of adversarial robustness commonly employed for neural networks models to the GP setting, defined as the invariance of the decision in a small neighbourhood of a test point (Huang et al., 2017), and thus study the worst-case effect of bounded perturbations of the input on the GP optimal decision. We represent bounded perturbations by a compact subset of the input space T Rd enclosing a test point x Rd, and consider the prediction range of the GP over T. Similarly to (Ruan et al., 2018), we observe that, to provide provable guarantees on the model prediction over T, it suffices to compute the minimum and maximum of the reachable prediction range. Unfortunately, exact direct computation of the minimum and maximum class probabilities over compact sets is not possible, as these would require providing an exact solution of a global non-linear optimisation problem, for which no general method exists (Neumaier, 2004). Instead, we approximate each extremum of the prediction range by lower and upper bounds. We show how such upper and lower bounds for the minimum and maximum prediction probabilities Adversarial Robustness Guarantees for Gaussian Processes of the GP can be computed on any given compact set T, and then iteratively refine these bounds in a branch-and-bound algorithmic optimisation scheme until convergence to the minimum and maximum is obtained. The method we propose is anytime (the bounds provided are at every step an over-estimation of the actual classification ranges over T, and can thus be used to provide guarantees) and ϵ-exact (the actual values are reached in finitely many steps up to an error ϵ selected a-priori). Our framework can handle robustness for both regression and classification tasks, with analytical formulation for most kernels used in practice, including generalised spectral kernels. We implement the methods in Matlab and apply our approach to analyse the robustness profile of GP models on a synthetic two-dimensional data set, the SPAM data set, feature-based analysis of both binary and 3-class subsets of the MNIST and Fashion-MNIST (F-MNIST) data sets, and on a Water Quality multi-output regression data set (Dˇzeroski et al., 2000).1 In particular, we compare the guarantees computed by our method with the robustness estimation approximated by adversarial attack methods for GPs (Grosse et al., 2018), discussing in which settings the latter fails. Then, we analyse the effect of approximate Bayesian inference techniques and hyper-parameter optimisation procedures on the GP model robustness. Across the four data sets analysed here, we observe that approximations based on Expectation Propagation (Minka, 2001) give more robust classification models than approximations based on Laplace approximation. We further find that GP robustness increases with the number of hyper-parameter training epochs, and that sparse GP model robustness generally increases with the number of training points (for a fixed number of inducing points). Finally, we show how our framework can be used to perform global interpretability analysis of GP predictions, highlighting differences over LIME (Ribeiro et al., 2016) To the best of our knowledge, ours is the first comprehensive framework that provides methods to compute provable guarantees for the adversarial robustness of Gaussian process models. In summary, the paper presents the following contributions: We design a flexible framework for the bounding of the posterior mean and variance of GPs in compact subsets of the input space. Using the mean and variance bounds, we develop methods to lowerand upper-bound the minimum and maximum of a GP output over compact sets for the adversarial analysis of GPs in both classification and regression settings. We incorporate the bounding procedures in a branch-and-bound algorithmic optimisation scheme, which we show converges for any specified error ϵ > 0 in finitely many steps. We empirically evaluate the robustness of a variety of GP models on four classification and a multi-output regression data sets, for different training regimes including sparse approximations, and demonstrate how our method can be used for global interpretability analysis of classification models. A preliminary version of this work appeared in (Blaas et al., 2020). This paper extends previous work in several aspects. In (Blaas et al., 2020) we provide analytical bounds only for 1. The code can be found at https://github.com/andreapatane/check-GPclass. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska GP classification using a probit link function and consider GPs with the squared exponential kernels. Here, we also derive analytical bounds in the case of logit link function, show that regression models can be analysed using a subset of the methods developed for classification, and extend our framework to a class of kernel functions that satisfy certain smoothness conditions (see Section 4). Furthermore, we extend the experimental evaluation to show that our framework can be employed to analyse the robustness of sparse GP approximations, and additionally consider the Fashion-MNIST data set and a multi-output regression task. This paper is structured as follows. In Section 2 we introduce background on GP regression and classification. The definition of adversarial robustness of GP models and the problem statements we consider are given in Section 3. Computation of adversarial robustness of a GP requires lowerand upper-bounding of the variation of the GP mean and variance in a neighbourhood of a test point. These bounds are presented in Section 4 and then employed in Section 5 to compute adversarial robustness for both (binary and multiclass) classification and regression. A branch-and-bound algorithm that incorporates the bounding methods is presented in Section 6, where we also show that it is guaranteed to converge to the true adversarial robustness of a GP model. Finally, empirical results on multiple data sets are discussed in Section 7. 1.1 Related Works Following on from seminal work that drew attention to deep learning models being susceptible to adversarial attacks in computer vision (Szegedy et al., 2013) and security (Biggio and Roli, 2018), a range of techniques have been proposed for the analysis of adversarial robustness of machine learning models. The developed techniques mainly focus on neural networks and the prevailing approach is to compute worst-case guarantees on the model prediction at a given test point (Huang et al., 2017; Katz et al., 2017). Various approaches have been considered to compute such robustness measures, including constraint solving (Huang et al., 2017; Katz et al., 2017), optimisation (Ruan et al., 2018; Bunel et al., 2020), convex relaxation (Zhang et al., 2018), and abstract interpretation (Gehr et al., 2018). Such methods have also been extended to Bayesian Neural Networks (BNNs) (i.e., neural networks with a prior distribution over their weights and biases) with both sampling-based (Cardelli et al., 2019a; Wicker et al., 2021) and numerical (Wicker et al., 2020; Berrada et al., 2021) solution methods. However, these techniques rely on the parametric nature of neural networks, and therefore cannot be directly applied to GPs. While various notions of robustness have been studied for Gaussian process models, such as robustness against outliers (Kim and Ghahramani, 2008) or against labelling errors (Hern andez-Lobato et al., 2011), studies of adversarial robustness of GPs have been limited to heuristic analyses (Grosse et al., 2018; Bradshaw et al., 2017) and binary classification (Smith et al., 2019). In particular, in Smith et al. (2019), the authors give guarantees for GPs in a binary classification setting under the ℓ0-norm and only consider the mean of the distribution in the latent space without taking into account the uncertainty intrinsic in the GP framework. In contrast, our approach also considers multi-class classification and regression, takes into account the full posterior distribution, and allows for exact (up to ϵ > 0) computation under any ℓp-norm. Adversarial Robustness Guarantees for Gaussian Processes Formal probabilistic guarantees for learning with GPs have been developed in the context of GP optimisation (Bogunovic et al., 2018) and GP regression (Cardelli et al., 2019b). Cardelli et al. (2019b) derive an upper bound on the probability that a function sampled from a trained GP is invariant to bounded perturbations at a specific test point, whereas Bogunovic et al. (2018) consider a GP optimisation algorithm, in which the returned solution is guaranteed to be robust to adversarial perturbations with a certain probability. We note that our problem formulation is different, and the methods developed in the above papers cannot be applied to classification with GPs due to its non-Gaussian nature. Further, our approach yields guarantees on the optimal model decision rather than on the latent GP posterior, is guaranteed to converge to any given error ϵ > 0 in finite time, and is anytime (i.e., at any time it gives sound upper and lower bounds of the classification probabilities). We also remark that the guarantees we provide in this paper are substantially different from those that can be obtained via randomized smoothing (Cohen et al., 2019). Randomized smoothing can smooth any base classifier at a given input point by perturbing the point with Gaussian isotropic noise. In contrast, our methods aim to directly quantify the robustness of the base machine learning model obtained by Gaussian process classification or regression. 2. Bayesian Learning with Gaussian Processes This section provides background material on Gaussian process modelling for regression and classification. More information can be found in (Rasmussen and Williams, 2006). An Rm-valued Gaussian process over a real-valued vector space Rd is a particular stochastic process f : Ω Rd Rm, where Ωis a suitable sample space, such that for every finite subset of input points their joint distribution under the GP is Gaussian. Namely, denoting with f(x) := f( , x) : Ω Rm the random variable induced by the stochastic process in the input point x, and given a collection of input points x = [x(1), . . . , x(N)], with x(i) Rd, a GP is such that f(x) N(µ(x), Σx,x), where µ : Rd Rm is the mean function and Σ : Rd Rd Rm2 is the covariance (or kernel) function, which fully characterise the behaviour of the GP. Consider now a data set D = {(x(i), y(i)) | x(i) Rd, y(i) Y, i = 1, . . . , N} for some input space Rd and output space Y. We denote with x = [x(1), . . . , x(N)] the aggregate vector of input points, and similarly y = [y(1), . . . , y(N)] is the aggregate vector of output points. We let Y to be (a subset of) Rm for regression, and the discrete set {1, . . . , m} in case of an m-class classification problem. Gaussian processes provide a probabilistic framework for performing inference over functions, where a prior is combined with data through an appropriate likelihood to obtain a posterior process that is consistent with the prior and data. In a Bayesian framework this is done by introducing a latent space F = Rm, and defining a GP prior f over the latter by instantiating a specific form for its mean, µ, and kernel, Σ, functions. The prior is updated to take into account the information contained in the data set D by means of the Bayes formula p(f(x)|D) p(y|f(x))p(f(x)), where p(y|f(x)) denotes the likelihood function, resulting in the posterior distribution, p(f(x)|D), over the latent space F. Given a previously unseen point x , the predictive posterior distribution over its associated output y can be obtained by marginalising the posterior evaluated on x over the latent space, i.e., p(y |D) = R F p(y |f(x ))p(f(x )|D)df(x ). For practical applications, Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska we typically extract a point value, ˆy(x ), from the posterior predictive distribution p(y |D) that satisfies specific criteria. In Bayesian decision theory, one proceeds by assuming a loss function, L(y , ˆy ), and minimising it with respect to the posterior distribution on the specific test point, that is, ˆy(x ) = arg min y Y Y L(y , y)p(y |D)dy . Since y is a continuous variable for regression models and a discrete variable for classification, different likelihood and loss functions are used in each case, resulting in different treatment for the posterior distribution and the model decision. Below, we review the specific details separately. Regression For regression models we typically assume a Gaussian likelihood function with uncorrelated noise σ2 noise, i.e., p(y|f) = N(y|f, σ2 noise IN). The posterior distribution over F is still Gaussian, and is characterised by the following inference equations for its posterior mean and variance: µ(x ) = µ(x ) + Σx ,xt (1) Σ(x ) := Σx ,x = Σx ,x Σx ,x SΣx,x , (2) where S and t are computed using the conditioning formula for Gaussian distributions (Rasmussen and Williams, 2006). Namely, S is a matrix in RN N with S = (Σx,x+σ2IN) 1 and t is a vector in RN with t = S(y µ(x)). Furthermore, the predictive posterior distribution over Y has the same mean as the posterior and variance equal to that of the posterior plus the underlying noise σ2 noise. Assuming a symmetric loss (e.g. the squared distance loss), which we refer to as the canonical loss for regression, the optimal model decision is simply given by the posterior mean, i.e., ˆy(x ) = µ(x ). Classification For classification models, the likelihood is generally defined in terms of a sigmoid function p(y = i|f) = σi(f), for i {1, . . . , m}, as the probit or softmax function. Unfortunately, this does not result in a Gaussian posterior and is intractable. Instead, analytical approximations are applied to estimate a Gaussian distribution of the form q(f|D) = N(f | µ(x ), Σ(x )), which approximates the true distribution p(f|D). In this paper we consider q derived using the Laplace approximation method (Williams and Barber, 1998), the Expectation Propagation (EP) method (Minka, 2001), as well as several sparse approximation techniques (Snelson and Ghahramani, 2005); more details can be found in Section 7. We observe that, in all these settings, the inference equations for q(f|D) have the same form as those given in Equations (1) and (2), with S and t defined depending on the method chosen (Rasmussen and Williams, 2006).2 Once the approximate posterior q has been computed, the predictive posterior distribution for class i {1, . . . , m} is πi(x ) := p(y = i|D) = Z F σi(ξ)N(ξ| µ(x ), Σ(x ))dξ. (3) 2. We remark that this form of inference equations is common for Gaussian approximations, as it results from conditioning formulas for multivariate Gaussian distribution. Our method can thus be applied in any situation in which Gaussian approximations are used (i.e., not necessarily resulting from Laplace or EP techniques). Adversarial Robustness Guarantees for Gaussian Processes Since Equation (3) includes a non-linear multi-dimensional integral, its solution cannot in general be found in closed form. However, when there are two classes, i.e. Y = {1, 2}, it suffices to compute π1 and then simply set π2 = 1 π1. This allows us to simplify the latent variable space as uni-dimensional, so that ξ R. Assuming standard 0-1 loss,3 which we consider canonical for classification, the optimal model decision is the class that maximises the predictive posterior distribution, that is, ˆy(x ) = arg maxi=1,...,m p(y = i|D). 3. Problem Formulation Let f be a Gaussian process trained on a data set D. We wish to analyse its adversarial robustness, in the sense of studying the worst-case effect of bounded perturbations on the model s optimal decision ˆy(x). For a generic test point x , we represent the possible adversarial perturbations by defining a compact neighbourhood T around x , and measure the changes in the decisions caused by limiting the perturbations to lie within T. Definition 1 (Adversarial Robustness w.r.t. Model Decision) Let T Rd be a compact subset and x T. Consider a GP f, a loss function L and the resulting optimal decision ˆy( ). Given an ℓp norm || ||, we say that f is δ-adversarially robust in T at a point x with respect to the optimal decision induced by L iff ||ˆy(x ) ˆy(x)|| δ x T. (4) In the remainder of this paper, we will formulate a method for the worst-case analysis of the GP decision function ˆy(x), which enables computing provable guarantees on whether a given f is adversarially robust around a test point x (that is, whether it satisfies the condition in Equation 4). Since regression and classification differ in how optimal decisions are made, which is reflected in the definition of the function ˆy( ), to simplify the presentation we will discuss the two cases separately. 3.1 Classification For classification problems, adversarial robustness is customarily defined in terms of invariance of the decision over the neighbourhood of an input (Huang et al., 2017; Ruan et al., 2018). This can be obtained by selecting δ = 0 in Definition 1, and noting that for classification ˆy(x) = arg maxi {1,...,m} πi(x). Definition 2 (Adversarial Robustness in Classification) Let T Rd be a compact subset and x T. Consider a classification GP f and its predictive posterior distribution πi(x), i {1, . . . , m}, defined as in Equation (3). Then we say that f is adversarially robust in T at a point x iff arg max i {1,...,m} πi(x) = arg max i {1,...,m} πi(x ) x T. (5) Adversarial robustness therefore provides guarantees that the classification decision is not influenced by adversarial perturbations applied to x , as long as the perturbations are constrained to remain within T. Recall that the optimal decision for classification accounts for 3. Our method is sufficiently general to accommodate other loss functions, e.g., weighted loss, which can be computed from the predictive posterior. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska model uncertainty by moderating class probabilities with respect to the posterior distribution. In general, the outcome differs from the most likely class, because the decisions are affected by variance. Similarly to (Ruan et al., 2018), we note that, in order to check the condition in Equation (5), it suffices to compute the minimum and maximum of the prediction ranges in T, i.e.: πmin,i(T) = min x T πi(x) πmax,i(T) = max x T πi(x), (6) for i = 1, . . . , m. It is easy to see that the knowledge of πmin,i(T) and πmax,i(T) for all i = 1, . . . , m can be used to provide guarantees on the absence of adversarial attacks of the model output, where an adversarial attack is a point x T that is classified differently from x , that is, such that arg maxi {1,...,m} πi(x) = arg maxi {1,...,m} πi(x ). More specifically, by letting ˆy = arg maxi {1,...,m} πi(x ) and defining the vector ( πmax,i(T) if i = ˆy πmin,i(T) if i = ˆy , (7) we can check whether the (stronger) condition arg maxi {1,...,m} π i (T) = arg maxi {1,...,m} πi(x ) holds. That is, in order to decide whether a GP classification model f satisfies Definition 2 around a point x we need to solve the following problem. Problem 1 (Computation of Adversarial Prediction Ranges) Let T Rd be a compact subset. Consider a classification GP f and its predictive posterior distribution πi(x), i {1, . . . , m}, defined as in Equation (3). For i = 1, . . . , m, compute the adversarial prediction ranges for πi(x) in T, that is: πmin,i(T) = min x T πi(x) πmax,i(T) = max x T πi(x). Unfortunately, the solution of Problem 1 requires solving 2m non-linear optimisation problems, for which no general solution method exists (Neumaier, 2004). We discuss the bounding of Problem 1 in Sections 5.1 and 5.2, and then show how to refine the bounds in Section 6.4 3.2 Regression For regression models, since the output is a continuous variable, we define adversarial robustness in terms of a small, bounded variation of the decision over a compact neighbourhood T of a test point x . This follows from Definition 1, since for regression ˆy(x) = µ(x). Formally, we have the following. Definition 3 (Adversarial Robustness in Regression) Let T Rd be a compact subset, x T and consider a GP f. We say that f is adversarially δ-robust in T at a point x with respect to ℓp norm || || iff || µ(x ) µ(x)|| δ x T, (8) where µ(x) = E[f(x)] is the posterior mean of the GP. 4. While we focus on adversarial robustness w.r.t. the 0-1 loss, the computation of the prediction ranges poses a more general problem and the methods developed here can be used for classifiers associated to different loss functions (e.g., a weighted classification loss) through an appropriate definition of a vector in Equation (7). Adversarial Robustness Guarantees for Gaussian Processes This definition is analogous to the computation of the reachable set of outputs (or confidence values) for neural networks (Ruan et al., 2018). Since for a GP the mean corresponds to the maximum of the distribution, it thus follows, under the assumption of convergence, that it can be computed by a deterministic scheme that relies on regularised maximum likelihood estimation. We remark that, in contrast to classification, adversarial robustness for GP regression does not take into consideration model variance, and analyses only the most likely model among those obtained by Bayesian inference. As a consequence, the computation of adversarial robustness for regression reduces to the adversarial robustness of the posterior mean function. More specifically, Definition 3 can be checked once the value of supx T || µ(x ) µ(x)|| is known. That is, in order to decide whether a GP regression model f satisfies Definition 3 around a point x we need to solve the following problem. Problem 2 (Computation of Posterior Mean Ranges) Let T Rd be a compact subset. Consider a regression GP f and its posterior mean µi(x), i {1, . . . , m}, defined as in Equation (1). For i = 1, . . . , m, compute the minimum and maximum of the posterior mean µi(x) in T, that is: µmin,i(T) = min x T µi(x) µmax,i(T) = max x T µi(x). As for Problem 1, solving Problem 2 requires the solution of 2m non-linear optimisation problems. Similarly to classification, for regression we will develop a bound for Problem 2 in Section 5.3 and refine it through a branch-and-bound technique in Section 6. 3.3 Outline of the Approach We now give an outline of a computational scheme to solve Problems 1 and 2 introduced in Sections 3.1 and 3.2, respectively, which will be developed in detail in Section 5. We first discuss classification, and then show how the regression scenario can be obtained as a special case of classification. Classification For Problem 1, we devise a branch-and-bound optimisation scheme for the lowerand upper-bounding computation of the prediction ranges of a GP classification model over the input region T. In particular, for i = 1, . . . , m, we first compute lower and upper bounds for πmin,i(T) and πmax,i(T), that is, we compute a set of real values πL min,i(T), πU min,i(T), πL max,i(T) and πU max,i(T) such that πL min,i(T) πmin,i(T) πU min,i(T) (9) πL max,i(T) πmax,i(T) πU max,i(T). (10) We refer to πL min,i(T) and πU max,i(T) as over-approximations of the ranges, as they provide pessimistic estimation of the actual values of πmin,i(T) and πmax,i(T) for the purpose of adversarial robustness, and hence tighter guarantees. On the other hand, we refer to πU min,i(T) and πL max,i(T) as under-approximations, because they provide an optimistic estimation of the actual values that we want to compute. The branch-and-bound scheme proceeds by iterative refinement of lower and upper bounds for the minimum and maximum of prediction ranges and is illustrated in Figure 1 Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska Gaussian Process Classifier Lower Bound Function Best Upper Bound on min Best Lower Bound on min Figure 1: Left: Computation of upper and lower bounds on πmin(T), i.e., the minimum of the classification range on T. Right: The search region is repeatedly partitioned into sub-regions (only first partitioning is visualised), reducing the gap between best lower and upper bounds until convergence (up to ϵ). for the simplified case of a GP with a single output value. First, we compute a lowerand an upper-bound function (the lower-bound function is depicted with a dashed red curve in Figure 1) for the GP output (solid blue curve) in the region T. We obtain this by deriving explicit bounds over the posterior mean and variance and relying on kernel bounding, which we then propagate through the predictive function. We then find the minimum of the lower-bound function, πL min(T) (shown in the plot), and the maximum of the upper bound function, πU max(T) (not shown). Then, valid values for πU min(T) and πL max(T) can be computed by evaluating the GP predictive distribution on any point in T (a specific πU min(T) is depicted in Figure 1). Next, the region T is iteratively subdivided into sub-regions (R1 and R2 in the plot), for which we compute new (tighter) bounds by repeating the procedure previously applied to T. This procedure repeats until the bounds converge up to a desired tolerance ϵ > 0. For each iteration, the bounds computed are valid, and therefore our method is anytime and can be terminated after a fixed number of iterations, at a cost of precision. The bounds on the predictive distribution depend analytically on the maximum variations of the posterior mean and variance over the region T, which we therefore need to compute beforehand. For this purpose, in Section 4, we develop an optimisation framework for the computation of a set of real values µL T,i, µU T,i, ΣL T,i,j and ΣU T,i,j that underand over-approximate the posterior mean and variance in T, i.e. µL T,i min x T µi(x) µU T,i max x T µi(x) (11) ΣL T,i,j min x T Σi,j(x) ΣU T,i,j max x T Σi,j(x), (12) for a general GP. We will utilise this framework in Section 5 to compute the desired upper and lower bounds on the ranges of the predictive posterior distribution. Adversarial Robustness Guarantees for Gaussian Processes Regression For regression, in Section 5.3 we develop a similar branch-and-bound approach to that for classification, except that (see Problem 2) we only need to consider the mean of the predictive posterior distribution (discussed in the next section). 4. Bounding Posterior Mean and Variance Function In Section 5 we will develop a method for the computation of adversarial robustness guarantees for GPs. This method utilises upper and lower bounds on the variation of the mean and variance in the compact region T. Therefore, in this section, we formulate a general framework for the computation of lower and upper bounds on the posterior mean (Section 4.1) and variance (Section 4.2) of a GP model. Hence, we propose a method for the computation of µL T,i, µU T,i, ΣL T,i,j and ΣU T,i,j that satisfy Equations (11) and (12), which will be used in Section 5. To simplify the presentation, we consider a GP with a single output value, eliding the explicit dependence on i. Since T is compact and therefore bounded, it can be covered by a finite union of hyper-boxes Tl, l = 1, . . . , n L, i.e., T Sn L l=1 Tl, and furthermore the overapproximation error can be made vanishingly small. The bounds can thus be computed for each of the boxes, Tl, and the minimum and maximum across l = 1, . . . , n L can be used as bounds for the infimum and supremum over the original set T. Thus, without loss of generality, in the following we assume that T is a box in the input space, i.e., T = [x L, x U]. We proceed by restricting the setting to kernel functions that admit an upper-bounding function U, which we propagate through inference equations to obtain bounds on mean and variance. Our construction admits analytical bounds for a large class of kernels. Definition 4 (Bounded Kernel Decomposition) Consider a one-dimensional kernel function Σ : Rd Rd R and a compact set T. We say that (ϕ, ψ, U) is a bounded decomposition for Σ in T if Σx ,x = ψ (ϕ (x , x )) and the following conditions are satisfied: 1. ϕ : Rd Rd R is linearly separable and continuously differentiable along each coordinate, so that ϕ(x , x ) = Pd j=1 ϕj(x j, x j ); 2. ψ : R R is continuously differentiable and with a finite number of flex points; 3. U is an upper bounding function such that, for any vector of coefficients c = [c1, . . . , c N] RN and finite set of associated input points [x(1), . . . , x(N)], with N N, we have that U(c) supx T PN i=1 ciϕ(x, x(i)). Intuitively, a kernel decomposition separates the part of the kernel function that depends on the two inputs (represented by ϕ) with the part of the kernel that relates their dependence to the variance of the GP (represented by ψ). Assumptions 1 and 2 usually follow immediately from the smoothness of kernel functions.5 Assumption 3 guarantees that we are able to upper bound the kernel function. The key idea is that, in view of the linearity of the inference equations for GPs, we can then propagate this bound through the inference equations to obtain bounds on the posterior mean and variance of the GP. We remark that, although not all kernel functions Σ admit kernel decomposition (for example if they are not smooth), the majority of kernel functions used in practice do. In Appendix B, we provide 5. The finite number of flex points can be guaranteed, for example, by inspecting the function derivatives. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska explicit computations for the squared exponential, the Matern, rational quadratic, and the periodic families of kernels, as well as sums and products thereof. Further, we describe the computation of bounded kernel decompositions for generalised (stationary and nonstationary) spectral kernels, which by means of Bochner s theorem can be shown to define a dense subset of the set of all the possible covariance functions (Samo and Roberts, 2015). In the remainder of the paper we assume that we are dealing with a kernel that admits a bounded decomposition. Before computing bounds on mean and variance, we state the following result (proved in Appendix A) that ensures that the knowledge of a kernel decomposition allows us to compute a Lower Bounding Function (LBF) and an Upper Bounding Function (UBF) on the kernel values, linearly on ϕ. Lemma 1 Let Σ be a kernel and (ϕ, ψ, U) be a bounded decomposition. Let T = [x L, x U] Rd be a box in the input space, then for every x Rd there exists a set of real coefficients a L, b L, a U and b U such that: g L(x) := a L + b Lϕ (x, x) Σx, x a U + b Uϕ (x, x) =: g U(x) x T. In other words, g L and g U respectively represent an LBF and a UBF for the kernel function, given a fixed input point. The above proposition allows us to explicitly compute coefficients of an LBF and a UBF on the overall kernel value, for any fixed point x in the input space. The main idea is that, since the posterior mean and variance are defined in terms of the summation and multiplication of pieces of the form Σx,x(i), for all x(i) in the training data set D, we can compute LBFs and UBFs corresponding to each point in the training set, and propagate them through the inference equations for any unseen test point in T. By the design of the upper-bounding function U, we can then use the resulting LBFs and UBFs to bound the overall mean and variance functions. This is formalised in the following two subsections. 4.1 Bounding the Posterior Mean Let T Rd be an axis aligned hyper-rectangle. In this section we show how to compute a lower bound µL T for the posterior mean function in T, i.e. such that µL T infx T µ(x), for a kernel Σ with an associated bounded kernel decomposition (ϕ, ψ, U). Analogous techniques can be used to compute an upper bound µU T by considering the function µ(x). We will then show that the bounds provided on the mean converge to the actual values as the diameter of the input region T tends to 0. For simplicity, we assume that the prior mean function is identically null (Rasmussen and Williams, 2006). Then, the posterior mean (Equation 1) can be written down as µ(x) = Σx,xt = i=1 Σx,x(i)ti. (13) A lower bound for the mean function can thus be computed analytically, as shown in the following proposition. Adversarial Robustness Guarantees for Gaussian Processes Proposition 2 Let Σ be a kernel with bounded decomposition (ϕ, ψ, U). Consider a(i) L , b(i) L , a(i) U and b(i) U , the set of coefficients for LBFs and UBFs associated to each training point x(i), i = 1, . . . , N, in an axis-aligned hyper-rectangle T Rd (computed as for Lemma 1). Define ( a(i) L , b(i) L ) = ( (a(i) L , b(i) L ), if ti 0 (a(i) U , b(i) U ), otherwise . i=1 a(i) L U([ b(1) L , . . . , b(N) L ]) inf x T µ(x). Proof By construction of the coefficients a(i) L , b(i) L , a(i) U and b(i) U we have that a(i) L + b(i) L ϕ(x, x(i)) Σx,x(i) a(i) U + b(i) U ϕ(x, x(i)). We can propagate the bounding functions through linear transformations (see Lemma 15 in Appendix A), so that we obtain Σx,x(i)ti a(i) L + b(i) L ϕ(x, x(i)) x T. (14) By summing over the index i and taking the infimum of both sides of the inequality above we obtain i=1 Σx,x(i)ti i=1 a(i) L + inf x T i=1 b(i) L ϕ(x, x(i)). (15) We then observe that infx T PN i=1 b(i) L ϕ(x, x(i)) = supx T PN i=1 b(i) L ϕ(x, x(i)) and that according to the definition of U (point 3 of Definition 4) we have that supx T PN i=1 b(i) L ϕ(x, x(i)) U([ b(1) L , . . . , b(N) L ]). By putting these two equations together we have that i=1 b(i) L ϕ(x, x(i)) U([ b(1) L , . . . , b(N) L ]). Finally, chaining the inequality above with that in Equation (15), we obtain i=1 Σx,x(i)ti i=1 a(i) L U([ b(1) L , . . . , b(N) L ]), which proves the proposition statement. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska 4.1.1 Convergence of Mean Bounds We are able to show, importantly, that the bounds provided for the mean converge uniformly to the actual mean function, when the input region T is small enough. We first state the following lemma that proves that the LBFs and UBFs given by Lemma 1 yield converging bounds. The proof is provided in Appendix A. Lemma 3 Let Σ be a kernel with bounded decomposition (ϕ, ψ, U). Let T = [x L, x U] Rd, x T, and let, for every axis-aligned hyper-rectangle R T, g R L(x) and g R U(x) be the LBF and UBF computed on R for Σ x,x using Lemma 1. Then we have that g R L and g R U converge uniformly to Σ x,x as diam(R) 0. As the lower bound that we compute on the mean over T is obtained by summing together the individual LBFs g R L computed over each training point x(i) on R, it then follows that convergence of all LBFs g R L combined with a tight bounding function U implies convergence of the posterior mean lower bound, and similarly for the upper bound. This is formally shown in the proposition below. Proposition 4 Let Σ be a kernel with bounded decomposition (ϕ, ψ, U). Then bounds for the posterior mean µL R and µU R computed through the application of Proposition 2 converge if the bounds provided by U do so. Proof We discuss the case of µL R; the arguments are analogous for µU R. We have that µ(x) = PN i=1 Σx,x(i)ti. By Proposition 2, we obtain that: i=1 ti g(i) L (x) i=1 Σx,x(i)ti, (16) where g(i) L (x) = g(i) L (x) if ti 0 and g(i) L (x) = g(i) U (x) otherwise. For Lemma 3 we have that each g(i) L converges uniformly to Σx,x(i) for each x(i). As ti is a scalar quantity, we also have that each ti g(i) L (x) converges uniformly to Σx,x(i)ti. Hence, we obtain that the bounds in Equation (16) converge uniformly as diam(R) = r 0, by virtue of being a linear combination of bounds that converge uniformly. The statement of the proposition then follows by the definition of U. Therefore, convergence of the bounds for the posterior mean is reduced to a property of the kernel bounding function U. In Appendix B we show how explicit kernel decomposition can be computed for many kernel functions used in practice, where the derived functions U converge to the actual desired values (as further discussed in Appendix B.8). 4.2 Bounding the Posterior Variance We now show how to find a lower and an upper bound for the posterior variance from Equation (2). For simplicity, we assume that Σx,x = σ2 p for all x Rd,6 so that we need 6. This is always the case for stationary kernels. In the general case Σx,x can be replaced by either its maximum or minimum value depending on whether we want to compute the minimum or the maximum of the posterior variance. Adversarial Robustness Guarantees for Gaussian Processes only to compute: min x T Σ(x) = σ2 p + min x T Σx,x SΣT x,x (17) max x T Σ(x) = σ2 p min x T Σx,x SΣT x,x. (18) We first show how an upper bound for Equation (18) can be computed by means of convex quadratic programming. 4.2.1 Variance Upper Bound The key observation is that S given in Equation (2) is a positive semi-definite matrix, so that the objective function to optimise in the case of the upper-bounding computation is a quadratic convex function on the variables Σx,x (but not on the optimisation variable x). In the following proposition, we show how the problem can be relaxed to obtain a quadratic convex program on the variable x and a suitably defined vector of slack variables. Proposition 5 Let Σ be a kernel with bounded decomposition (ϕ, ψ, U) and T = [x L, x U] a box of the input space Rd. Consider a(i) L , b(i) L , a(i) U and b(i) U , a set of coefficients for LBFs and UBFs associated to each training point x(i), i = 1, . . . , N, computed according to Lemma 1. Let r = [r(1), . . . , r(N)], ϕ(i), ϕ(i) j , for i = 1, . . . , N and j = 1, . . . , d, be slack continuous variables. Let σ2 be the solution of the following convex quadratic programming problem: min x T r Sr T subject to: r(i) + a(i) L + b(i) L ϕ(i) 0 i = 1, . . . , N r(i) a(i) U b(i) U ϕ(i) 0 i = 1, . . . , N a(i) j,L + b(i) j,Lxj ϕ(i) j 0 i = 1, . . . , N j = 1, . . . , d ϕ(i) j a(i) j,U b(i) j,Uxj 0 i = 1, . . . , N j = 1, . . . , d j=1 ϕ(i) j i = 1, . . . , N j = 1, . . . , d. Then ΣU T := σ2 p σ2 is an upper bound for the posterior variance Σ(x) in T. Proof By setting r = Σx,x in the minimum computation in Equation (18), we obtain the objective function of the problem statement, r Sr T , which is quadratic on the vector variable r. Since S is symmetric and positive semi-definite it follows that the objective function is a quadratic convex function in the slack variable vector r. In order to obtain a convex program we then need to linearise the constraint r = Σx,x We show how this is done for a generic index i = 1, . . . , N. We have that r(i) = Σx,x(i) = ψ(ϕ(x, x(i))). By Lemma 1 we obtain that a(i) L + b(i) L ϕ x, x(i) Σx,x(i) a(i) U + b(i) U ϕ x, x(i) . Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska Hence, the dependence of ψ on the constraints can be linearised by considering the following over-approximation for the definition of r(i): r(i) + a(i) L + b(i) L ϕ x, x(i) 0 r(i) a(i) U b(i) U ϕ x, x(i) 0. The final step is to linearise the dependency over ϕ x, x(i) . We introduce slack variables ϕ(i) = ϕ(x, x(i)), and ϕ(i) j = ϕj(xj, x(i) j ). For Assumption 1 of Definition 4 we have that ϕ(x, x(i)) = Pd j=1 ϕj(xj, x(i) j ). Let i {1, . . . , N} and let j {1, . . . , d}, then by applying Lemma 1 with ψ := ϕj( , x(i) j ) and ϕ := x, we obtain that there exists a set of coefficients a(i) j,L, b(i) j,L, a(i) j,U and b(i) j,U such that: a(i) j,L + b(i) j,Lxj ϕj(xj, x(i) j ) a(i) j,U + b(i) j,Uxj. Hence, we can over-approximate the set of constraints ϕ(i) = ϕ(x, x(i)) and ϕ(i) j = ϕ(xj, x(i) j ) with the following set of linear constraints: a(i) j,L + b(i) j,Lxj ϕ(i) j 0 ϕ(i) j a(i) j,U b(i) j,Uxj 0 j=1 ϕ(i) j . The formula for ΣU T then follows by the definition of minimum and by Equation (18). Crucially, the proposition above casts the computation of the quantity ΣU T as the solution of a convex quadratic programming problem, for which ready-made solver software exists (Rosen and Pardalos, 1986). 4.2.2 Variance Lower Bound The situation is, unfortunately, more complicated for the lower-bounding computation of minx T Σx,x SΣT x,x. In fact, though we can write down an optimisation problem akin to that of Proposition 5, since S is positive definite we have that S is negative definite, which implies that the function we want to optimise is quadratic concave. Thus, a number of local minima may exist, and simple quadratic optimisation is not guaranteed to yield the global solution. However, as we are interested in worst-case scenario analysis, we need to compute the global minimum. Unfortunately, this is an NP-hard problem, whose exact solution would be impractical to compute. Instead, we apply the methods proposed in (Rosen and Pardalos, 1986) and proceed by computing a safe lower bound to the global minimum, that is, we want to compute a lower Adversarial Robustness Guarantees for Gaussian Processes bound to the solution of: min x T r Sr T (19) subject to: r(i) + a(i) L + b(i) L ϕ(i) 0 i = 1, . . . , N r(i) a(i) U b(i) U ϕ(i) 0 i = 1, . . . , N a(i) j,L + b(i) j,Lxj ϕ(i) j 0 i = 1, . . . , N j = 1, . . . , d ϕ(i) j a(i) j,U b(i) j,Uxj 0 i = 1, . . . , N j = 1, . . . , d j=1 ϕ(i) j i = 1, . . . , N j = 1, . . . , d. We highlight the details of the procedure applied to our specific setting below. First, we start by re-writing the constraints of the optimisation problem above in matrix form. Next, we introduce the aggregate variable vector z = [x1, . . . , xd, ϕ(1), . . . , ϕ(N), ϕ(1) 1 , . . . , ϕ(N) d ]. Since the constraints are linear, it is possible to define two matrices Ar and Az such that the optimisation problem above can be equivalently written down as: min r T Sr (20) Subject to: Arr + Azz b for suitably defined vectors b, r L, r U, z L, z U. Now, as S is symmetric and positive definite, there exists a matrix of eigenvectors U = [u(1), . . . , u(N)] and a diagonal matrix Λ of the associated eigenvalues λ(i), for i = 1, . . . , N, such that S = UΛUT . We then define ˆr(i) = u(i) r for i = 1, . . . , N, the rotated variables, and ˆr the aggregated vector of rotated variables, and compute their ranges [ˆr(i),L, ˆr(i),U] by solving the following 2N linear programming problems: min / max u(i) r Subject to: Arr + Azz b Implementing the change of variables into the optimisation problem defined in Equation (20), we obtain min ˆr T Λˆr Subject to: Aˆrˆr + Azz b ˆr L ˆr ˆr U Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska where we have set Aˆr = Ar U. We then notice that ˆr T Λˆr = PN i=1 λ(i)ˆr(i)2. Each summand is a simple one-dimensional quadratic function, for which we can find a linear LBF by relying on Lemma 1. Let α(i) and β(i) be coefficients of such LBFs, then we have that α(i)+β(i)ˆr(i) λ(i)ˆr(i),2 for all i = 1, . . . , N. Let β = [β(1), . . . , β(N)] and ˆα = PN i=1 α(i), then we can lower-bound the optimisation problem defined in Equation (20) with the following linear programming problem: min ˆα + βTˆr (21) Subject to: Aˆrˆr + Azz b ˆr L ˆr ˆr U Hence, we have that a solution of the latter problem yields a lower bound for the solution of the optimisation problem in Equation (19). That is, we have proved the following statement. Proposition 6 Let σ2 be the solution of the linear programming problem defined in Equation (21). Then ΣL T := σ2 p + σ2 is a lower bound for the posterior variance Σ(x) in T. 4.2.3 Convergence of Variance Bounds The convergence of the bounds computed for the variance to the actual values in hyperrectangles R T, with diam(R) 0, is an immediate consequence of Lemma 3, and proceeds similarly to what we have shown for the posterior mean. In fact, the objective function for the upper bound (Proposition 5) is exact, and the over-approximation results only from the feasible region of the optimisation problem. This is relaxed by using LBFs and UBFs introduced in Lemma 1, so that their uniform convergence implies that the over-approximated feasible region converges to the actual one in the limit of the diameter diam(R) tending to 0. Similarly, for the lower-bounding of the variance the only difference arises from the use of Lemma 1, also for the lower-bounding of the optimisation function. However, this also converges to the actual objective function. Thus, the exact solution of both optimisation problems converges uniformly to the actual values, for R small enough. We summarise the discussion as the following proposition. The proof is a straightforward generalisation of the proof of Proposition 4 and is therefore omitted. Proposition 7 Let Σ be a kernel with bounded decomposition (ϕ, ψ, U). Then bounds on the posterior variance, ΣL R and ΣU R, computed through the application of Propositions 5 and 6 converge if the bounds provided by U do so. 5. Bounds on Adversarial Robustness In this section we show how the lower and upper bounds for the posterior mean and variance can be propagated through the predictive distribution of a GP to compute adversarial robustness guarantees, in the sense of ensuring invariance of the GP decision to perturbations constrained to a small neighbourhood around a test point. Thus developed bound will then be included in a branch-and-bound scheme in Section 6 for its iterative refinement. Recall from Section 3.1 and Problem 1 that for classification this reduces to bounding the Adversarial Robustness Guarantees for Gaussian Processes minimum and maximum of the prediction ranges over the neighbourhood. We first discuss the bound for the two-class classification problem, and then show how the two-class bound can be extended to the multi-class setting. Finally, we discuss how to obtain guarantees for regression (Problem 2) as a particular case of the techniques derived for classification. 5.1 Bounds for Two-class Classification As discussed in Section 2, for a two-class GP it suffices to consider a one-dimensional output space, which greatly simplifies the computations. Namely, we have that the predictive posterior distribution of Equation (3) evaluated on a generic point x can be simplified to one-dimensional integral, i.e. R σ(ξ)N(ξ| µ(x), Σ(x))dξ, (22) where µ and Σ are the posterior mean and variance functions, respectively, and σ( ) denotes the likelihood function. We give analytical bounds for the case where the likelihood function is either the probit function or the logistic sigmoid, which entail the majority of applications for GP classification (Rasmussen and Williams, 2006). A general bound based on latent space discretisation is discussed for the multi-class problem in Section 5.2, and can also be used for a generic two-class likelihood function. Let µL T , µU T , ΣL T and ΣU T be lower and upper bounds for the posterior mean and variance of the GP, computed according to the methods discussed Section 4. We consider the function that describes the dependence of the predictive posterior distribution directly on the mean and variance by dropping their dependence on x: Π(µ, Σ) = Z R σ(ξ)N(ξ|µ, Σ)dξ for µ [µL T , µU T ], Σ [ΣL T , ΣU T ]. (23) Then, by definition of lower and upper bounds we have that: min µ [µL T ,µU T ] Σ [ΣL T ,ΣU T ] Π(µ, Σ) min x T π(x) and max µ [µL T ,µU T ] Σ [ΣL T ,ΣU T ] Π(µ, Σ) max x T π(x), that is, over-approximations of the prediction ranges can be found by optimising the function Π over the mean/variance box domain [µL T , µU T ] [ΣL T , ΣU T ]. In the next two subsections we show how this can be done depending on the particular form of the chosen likelihood σ. 5.1.1 Classification with the Probit Likelihood We first consider the probit likelihood, i.e., σ(ξ) = Φ(λξ) is the cdf of the univariate standard Gaussian distribution scaled by λ > 0. In this case, the predictive distribution can be written down in closed form, which greatly simplifies the computation of the bounds: We can use this explicit form to derive analytic upper and lower bounds by direct inspection of the predictive distribution derivatives with respect to the induced mean and variance Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska variables. The following proposition provides a solution for Problem 1 in the case of twoclass classification with the probit likelihood. Proposition 8 Consider a predictive posterior distribution π(x) defined as in Equation (22), input box T Rd, and µL T , µU T , ΣL T and ΣU T , lower and upper bounds on the GP posterior variance, computed as detailed in Section 4. Let σ(ξ) = Φ(λξ), with λ > 0, then we have that πL min(T) := Φ πmin(T) (24) πmax(T) Φ µU T =: πU max(T), (25) ( ΣU T if µL T 0 ΣL T otherwise Σ = ( ΣL T if µU T 0 ΣU T otherwise. Proof We have: Π(µ, Σ) = Φ µ As Φ is monotonically increasing, it suffices to optimise for the argument φ(µ, Σ) = µ By computing the partial derivatives it is easy to see that φ(µ,Σ) µ > 0 for all values of µ and Σ. Therefore, for every value of Σ the minimum is obtained for µ = µL T . On the other hand, for the derivative wrt to Σ we have that: φ(µL T , Σ) Σ < 0 if µL T > 0 = 0 if µL T = 0 > 0 if µL T < 0 as Σ > 0. Hence, given µL T , we have that φ is monotonic in Σ and the proposition follows. 5.1.2 Classification via Logistic Likelihood We now consider the case where σ is defined as the logistic sigmoid. We will show that the minimum and maximum are to be found in the same extrema as for the probit likelihood. However, as the predictive distribution cannot be expressed in closed form (Rasmussen and Williams, 2006), we first show that the derivative of the predictive distribution can be computed by passing the sign of the derivative under the integral sign. First, we note that upper and lower bounds on the variance Σ naturally induce upper and lower bounds on the standard deviation s = Σ, which we denote s L T and s U T . By substituting s in the definition of Π in Equation (23), which yields Φ(µ, s) := Π(µ, s2) = Π(µ, Σ), and changing the integration variable to t = (ξ µ)/s, we have: Π(µ, Σ) =: Φ(µ, s) = Z R h(t, µ, s)dt where h(t, µ, s) = σ(st + µ)N(t|0, 1). Adversarial Robustness Guarantees for Gaussian Processes We now want to compute Φ s . It is easy to show that all the conditions to apply differentiation under the integral sign theorem are satisfied. Thus, we have: µ = Z σ (st + µ)N(t|0, 1)dt, Φ(µ, s) s = Z tσ (st + µ)N(t|0, 1)dt. By relying on the derivatives, we can establish the following bounds. Specifically, the following proposition provides a solution for Problem 1 for two-class classification with the logistic likelihood. Proposition 9 Consider T, π(x), µL T , µU T , Σ and Σ defined as in Proposition 8. Let σ(ξ) be the sigmoid, then we have that: πL min(T) := Π µL T , Σ πmin(T) (26) πmax(T) Π µU T , Σ =: πU max(T). (27) Proof We show that the derivatives have the same sign as the probit, and then the proof follows as for probit. More specifically, we have that µ = Z σ (st + µ)N(t|0, 1)dt > 0, since the sigmoid is a monotonically increasing function. For the derivative with respect to s we want to show that s = Z tσ (st + µ)N(t|0, 1)dt = < 0 if µ > 0 = 0 if µ = 0 > 0 if µ < 0 The case for µ = 0 is trivial. For the remaining cases we have: Z tσ (st + µ)N(t|0, 1)dt = Z 0 tσ (st + µ)N(t|0, 1)dt + Z + 0 tσ (st + µ)N(t|0, 1)dt 0 tσ ( st + µ)N(t|0, 1)dt + Z + 0 tσ (st + µ)N(t|0, 1)dt 0 t σ (µ + st) σ (µ st) N(t|0, 1)dt, and Lemma 16 (see Appendix A) can be applied to get the sign of the integral, since t and N(t|0, 1) are always positive in [0, + ). Though the methods provided in this section suffice for the solution of Problem 1 in the two-class case, in Section 6 we will show how the bounds described above can be utilised to develop a branch-and-bound scheme for their refinement to ensure convergence to πmin(T) and πmax(T). Before we do this, we show in the next subsection how to compute bounds for multi-class classification. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska 5.2 Bounds for Multi-class Classification In this section we generalise the results for two-class classification. Given a class index i {1, . . . , m}, we are interested in computing upper and lower bounds on the ith component of the predictive posterior distribution πi(x) (see Equation 3) for every x T, with T an axis-aligned hyper-rectangle in the input space. For simplicity, we explicitly tackle only the softmax likelihood, but similar arguments can be applied to the case of the multidimensional probit, as well as other likelihood functions that have similar monotonicity properties. In the following we show that bounds on the multi-class predictive distribution can be computed by discretising the integral over the latent space. Proposition 10 Consider a predictive posterior distribution π(x) defined as in Equation (3), an input box T Rm, and define πmin,i(T) and πmax,i(T) as in Equation (6). Let S = {Sl = [al, bl] | l {1, . . . , M}} be a finite partition of the latent space F = Rm, with [al, bl] = [al,1, bl,1] . . . [al,m, bl,m]. Then, for i {1, . . . , m}: l=1 σi( ξl) min x T Sl N(ξ| µ(x), Σ(x))dξ l=1 σi( ξl) max x T Sl N(ξ| µ(x), Σ(x))dξ. ξl = [bl,1, . . . , bl,i 1, al,i, bl,i+1, . . . , bl,m] ξl = [al,1, . . . , al,i 1, bl,i, al,i+1, . . . , al,m]. Proof We prove the statement for the minimum; the arguments for the maximum are analogous. By simple properties of integrals and definition of the minimum we have that: πmin,i(T) = min x T Rm σ(ξ)N(ξ| µ(x), Σ(x))dξ = min x T Sl σ(ξ)N(ξ| µ(x), Σ(x))dξ l=1 min x T Sl σ(ξ)N(ξ| µ(x), Σ(x))dξ. Taking the partial derivatives of the softmax function with respect to coordinate k {1, . . . , m} we have that: ( σi(ξ)(1 σi(ξ)) if k = i σi(ξ)σk(ξ) if k = i and hence we obtain that the i-th component of the softmax function is monotonically increasing along the direction i and monotonically decreasing along all the other dimensions k = i. Thus, its minimum in a generic axis-aligned hyper-rectangle [al,1, bl,1] . . . [al,m, bl,m] will be found in the vertex defined as ξl = [bl,1, . . . , bl,i 1, al,i, bl,i+1, . . . , bl,m]. Therefore, we Adversarial Robustness Guarantees for Gaussian Processes have that the chain of inequalities above can be lower-bounded by computing the softmax on ξl and taking it outside of the integral computation, which yields: l=1 σi( ξl) min x T Sl N(ξ| µ(x), Σ(x))dξ. Summing up, Proposition 10 guarantees that, for all x T, πi(x) can be upperand lower-bounded by solving M optimisation problems over a multi-dimensional Gaussian integral. In Proposition 11 below, we show that upper and lower bounds for the integral of a multi-dimensional Gaussian distribution, such as those appearing in Proposition 10, can be obtained by optimising a marginalised product of uni-dimensional Gaussian integrals over both the input and the latent space. We first introduce the following notation. We denote with µi:j(x) the subvector of µ(x) containing only the components from i to j, with i j, and similarly we define Σi:k,j:l(x) to be the submatrix of Σ(x) containing rows from i to k and columns from j to l, with i k and j l. Proposition 11 Let S = Qm i=1[ai, bi] Rm be an axis-aligned hyper-rectangle in the latent space, and consider the posterior mean and variance functions µ(x) and Σ(x). For i {1, . . . , m 1} and f I Rm i 1, define I = (i + 1) : m and µf i (x) = µi(x) Σi,I(x) Σ 1 I,I(x)(f I µI(x)) (28) Σf i (x) = Σi,i(x) Σi,I(x) Σ 1 I,I(x) ΣT i,I(x). (29) Let SI = Qm j=i+1[ai, bi], then we have that: S N(ξ| µ(x), Σ(x))dξ am N(ξ| µm(x), Σm,m(x))dξ i=1 max x T f SI ai N(ξ| µf i (x), Σf i (x))dξ (30) S N(ξ| µ(x), Σ(x))dξ am N(ξ| µm(x), Σm,m(x))dξ i=1 min x T f SI ai N(ξ| µf i (x), Σf i (x))dξ. (31) Proof We consider the case of the minimum; the maximum follows similarly. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska Consider the latent posterior process f, whose mean and variance function we denote with µ(x) and Σ(x). Then, we have S N(ξ| µ(x), Σ(x))dξ = min x T P(f(x) S) = min x T P(ai fi(x) bi, i = 1, . . . , m) = min x T P(am fm(x) bm) i=1 P(ai fi(x) bi|f I(x) SI) (By Lemma 17 included in the Appendix A) min x T P(am fm(x) bm) i=1 min f I SI P(ai fi(x) bi|f I(x) = f I) min x T P(am fm(x) bm) i=1 min x T f I SI P(ai fi(x) bi|f I(x) = f I). Notice that, for each i {1, . . . , m 1}, P(ai fi(x) bi|f I(x) = f I) is the integral of a uni-dimensional Gaussian random variable conditioned on a jointly Gaussian random variable. The statement of the proposition then follows by the application of the conditioning equations for Gaussian distributions. Proposition 11 reduces the computation of the multi-class bounds to a product of extrema computations over univariate Gaussian distributions. To solve this, we first need to compute lower and upper bounds for the conditional latent mean and the conditional latent variance defined in Equations (28) and (29). Observe that Equations (28) and (29) can be expressed as a rational function in the entries of the mean vector, variance matrix and latent variable vector. We can thus propagate the upper and lower bound of each entry from the mean vector and covariance matrix down through the rational function equations by simple interval bound propagation techniques, which results in an upper and lower bound on µf i (x) and Σf i (x) for x T and f SI, which we denote with µL,f i,T , µU,f i,T , ΣL,f i,T and ΣU,f i,T . This process can then be iterated backward from i = m to i = 1, up until all the required bounds are computed. Unfortunately, because of the need to symbolically compute a matrix inversion, the explicit formulas for the computation of µL,f i,T , µU,f i,T , ΣL,f i,T and ΣU,f i,T in general are rather convoluted and long (though still in the form of a simple ratio between polynomials). Once those bounds are computed, we rely on the following lemma for the solution of the optimisation problem over the Gaussian integrals. Lemma 12 Consider SI, ai, bi, µf i (x) and Σf i (x) defined as in Proposition 11, an input box T Rd, and µL,f i,T , µU,f i,T , ΣL,f i,T and ΣU,f i,T , lower and upper bounds on µf i (x) and Σf i (x) in T computed as discussed above. Define ζ := [x, f] and its input region as Z = T SI. Adversarial Robustness Guarantees for Gaussian Processes Let i = {1, . . . , m}, µc i = ai+bi 2 and Σc i(µ) = (µ ai)2 (µ bi)2 µ bi . Then it holds that: ai N(ξ| µf i (x), Σf i (x))dξ Z bi ai N(ξ| µ , Σ )dξ ai N(ξ| µf i (x), Σf i (x))dξ Z bi ai N(ξ| µ , Σ )dξ where we have: µ = arg min µ [µL,f i,T ,µU,f i,T ] |µc i µ|, Σ = ( ΣL,f i,T if µ [ai, bi] arg minΣ [ΣL,f i,T ,ΣU,f i,T ] |Σc i( µ ) Σ| otherwise µ = arg max µ [µL,f i,T ,µU,f i,T ] |µc i µ|, Σ = arg min Σ {ΣL,f i,T ,ΣU,f i,T } [erf(bi| µ , Σ) erf(ai| µ , Σ)]. By iterating the computation of Lemma 12 for each integral in Proposition 11, we obtain the bounds on the predictive distribution. The discretised bound can also be used for two-class classification, in cases where a likelihood function different from the probit and the logistic sigmoid is desired. 5.3 Bounds for Regression While computing adversarial robustness guarantees for classification models involves the computation of upper and lower bounds on the GP posterior predictive distribution, the analysis is much simpler for regression. As stated in Section 2 and formalised in Problem 2, for the canonical loss function the optimal decision corresponds to the posterior latent mean function µ(x) of the posterior GP distribution, whose computation is given in Section 3. Guarantees over the decision can then be made simply by relying on upper and lower bounds for the mean function, that is, µL i,T and µU i,T for every i = 1, . . . , m, which makes over-approximation of Definition 3 much faster and simpler in practice. Proposition 13 Consider a box T Rd of the input space, a test point x T, an ℓp metric || || in the output space Rm and a δ > 0. Let µ be the predictive posterior mean, and µL i,T and µU i,T , for every i = 1, . . . , m, its upper and lower bounds computed according to Proposition 2. Define µ T as the vector of entries: ( µL i,T if | µi(x ) µL i,T | | µi(x ) µU i,T | µU i,T if | µi(x ) µL i,T | < | µi(x ) µU i,T |. sup x T || µ(x ) µ(x)|| || µ(x ) µ T ||. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska Consequently, if: || µ(x ) µ T || δ then the GP is δ-robust in x w.r.t. T and norm || ||. Proof By construction, we have that µi(x) [µL i,T , µU i,T ] for every x T. Hence, by monotonicity of ℓp norms along the coordinate directions and by definition of µ(x), it follows that supx T || µ(x ) µ(x)|| || µ(x ) µ T ||. Thus: δ || µ(x ) µ T || sup x T || µ(x ) µ(x)|| || µ(x ) µ(x)||, for x T. In particular, δ || µ(x ) µ(x)||, which is equivalent to Definition 3. 6. Branch-and-Bound Algorithm In this section we formulate a branch-and-bound algorithmic scheme that incorporates the lowerand upper-bounding procedures for Gaussian process models introduced in Section 5 and prove its convergence up to any a-priori specified ϵ > 0. For simplicity of exposition, we restrict the discussion to two-class classification, noting that the multi-class classification and regression problems follow analoguously by substituting appropriate bounding procedures. The main idea behind branch-and-bound optimisation is to alternate between bounding the function we are interested in optimising in our input box T and splitting T into smaller boxes, i.e., candidate search regions, on which we compute the bound in the next iteration. This procedure creates a search tree, in which descending depth implies smaller search regions. The intuition is that, as we explore the branch-and-bound search tree depth-first, the search regions become smaller, so that the bounds get closer to the true function, and we thus slowly converge to the actual optimum. By computing lower and upper bounds on the quantity of interest, we are then able to prune our search tree for regions in which optimal values cannot occur. We now describe the proposed branch-and-bound scheme for the computation of lower and upper bounds for πmin(T) derived in Section 5.1, which is summarised in Algorithm 1. After initialising πL min(T) and πU min(T) to trivial values and the exploration regions stack R to the singleton {T}, the main optimisation loop is entered until convergence (lines 2 10). Among the regions in the stack, we select the region R with the most promising lower bound (line 3). After bounding posterior mean and variance in R (line 4), we refine its lower bound using Proposition 8 for the probit likelihood and Proposition 9 for the logistic sigmoid likelihood (line 5), as well as its upper bound through evaluation of points in R (line 6). If further exploration of R is necessary for convergence (line 7), then the region R is partitioned into two smaller regions R1 and R2, which are added to the regions stack and inherit R s bound values (line 8). We perform the split by randomly selecting an index j {1, . . . , d} from the input dimensions, and by splitting R at the mid-point along the jth dimension. Finally, the freshly computed bounds local to R T are used to update the global bounds for T (line 10). Namely, πL min(T) is updated to the smallest value among the Adversarial Robustness Guarantees for Gaussian Processes Algorithm 1 Branch and bound for πmin(T) Input: Input space subset T; error tolerance ϵ > 0; latent mean/variance functions µ( ) and Σ( ). Output: Lower and upper bounds on πmin(T) with πU min(T) πL min(T) ϵ 1: Initialisation: Stack of regions R {T}; πL min(T) ; πU min(T) + 2: while πU min(T) πL min(T) > ϵ do 3: Select region R R with lowest bound πL min(R) and delete it from stack 4: Find [µL R, µU R] and [ΣL R, ΣU R] applying Propositions 2, 5 and 6 over R 5: Compute πL min(R) from [µL R, µU R] and [ΣL R, ΣU R] using Proposition 8 or 9 resp. 6: Find πU min(R) by evaluating π(x) in a point in R 7: if πU min(R) πL min(R) > ϵ then 8: Split R into two sub-regions R1, R2, add them to stack 9: Use πL min(R), πU min(R) as initial bounds for both sub-regions R1, R2 10: end if 11: Update πL min(T) and πU min(T) with current best bounds found 12: end while 13: return [πL min(T), πU min(T)] πL min(R) values for R R, while πU min(T) is set to the lowest observed value yet explicitly computed in line 6. We remark that to derive a branch-and-bound scheme for multi-class classification (respectively, regression) it suffices to replace line 5 in Algorithm 1 with the bounding methods of Section 5.2 (respectively, Section 5.3). Computation of Under-approximations As discussed in Section 3.3, in order to obtain valid values for πU min(T) and πL max(T) it suffices to evaluate the GP posterior predictive distribution in any point of T. However, the closer πU min(T) and πL max(T) are to πmin(T) and πmax(T), respectively, the faster a branch-and bound-algorithm will converge. By solving the optimisation problems associated to µL T , µU T , ΣL T and ΣU T , we obtain four extrema points in T on which the GP assumes the optimal values for the posterior mean and variance bounds. As these points belong to T and provide extreme points for the latent function, they are promising candidates for the evaluation of πU min(T) and πL max(T). We thus evaluate the GP predictive posterior distribution on all four extremal points and select the one that yields the best bound. Convergence By construction it is clear that, if Algorithm 1 terminates, the resulting values overand under-approximate the true value πmin(T) with a known error ϵ > 0. We now show, by relying on the theory of convergence for branch-and-bound algorithms, that the loop of lines 2 9 terminates in a finite number of iterations. In particular, to prove convergence of a branch-and-bound scheme up to an error ϵ > 0 it suffices to show that the two following conditions hold (Balakrishnan et al., 1991): 1. Consistency Condition: πL min(R) πmin(R) πU min(R) R T. 2. Uniform Convergence: ϵ > 0 r > 0 s.t. R T with diam(R) r |πU min(R) πL min(R)| ϵ. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska Intuitively, the first condition ensures that the computed bounds are consistent across all the subsets of the initial input region T. This is clearly satisfied by construction, see Section 5.1. The second condition enforces that the lower and the upper bounds converge uniformly to each other as we reduce the maximum diameter of the branch-and-bound search region to zero. In the following theorem we show that the bound based on latent space discretisation has the uniform convergence property and converges in finitely many steps. Consequently, as the analytical bounds that we compute for the probit and the logistic function are tighter than for discretisation, they will also converge. For simplicity of exposition, we prove convergence for two-class classification, which also captures regression as a special case; we provide details below for how the result can be generalised to the multi-class case. Theorem 14 Let T be a box in the input space Rd. Consider a two-class classification GP with posterior mean and variance given by µ(x) and Σ(x). Assume that µL R, µU R, ΣL R, ΣU R are bounding functions for the posterior mean and variance such that: µL R min x R µ(x), µU R max x R µ(x), ΣL R min x R Σ(x), ΣU R max x R Σ(x) (34) whenever diam(R) 0. Then, for ϵ > 0, there exists a partition of the latent space S and r > 0 such that, for every R T with diam(R) < r, it holds that |πU min(R) πL min(R)| ϵ. (35) Proof Consider an ϵ > 0, and a generic axis-aligned hyper-rectangle R T of diameter diam(R) := r > 0 less than a fixed r > 0. We want to find a value for r for which the condition in Equation (35) is surely met. We start by observing that πU min(R) is defined by computing the predictive posterior distribution on a fixed point of R. Let x R be such a point, and define µ := µ( x) and Σ := Σ( x), then we have that πU min(R) = Z σ(ξ)N(ξ| µ, Σ)dξ. Now consider a generic M > 0; we define the discretisation of the latent space SM = {[al, bl] | l = 1 . . . , M} with the following equations: bl = σ 1 σ(al) + 1 l = 1, . . . , M al+1 = bl l = 1, . . . , M, that is, we discretise the y-axis into M equally distanced intervals and map that discretisation back to the x-axis through the link function, σ 1. We then have that the left-hand-side of Equation (35) can be written explicitly as Z σ(ξ)N(ξ| µ, Σ)dξ l=1 σ(al) min µ [µL R,µU R] Σ [ΣL R,ΣU R] al N(ξ|µ, Σ)dξ Adversarial Robustness Guarantees for Gaussian Processes Let µ ,(l) and Σ ,(l) be the solutions to the lth minimisation problems defined inside the summation of the equation above, then we have Z σ(ξ)N(ξ| µ, Σ)dξ l=1 σ(al) Z bl al N(ξ|µ ,(l), Σ ,(l))dξ al σ(ξ)N(ξ| µ, Σ)dξ σ(al) Z bl al N(ξ|µ ,(l), Σ ,(l))dξ al N(ξ| µ, Σ)dξ σ(al) Z bl al N(ξ|µ ,(l), Σ ,(l))dξ al N(ξ| µ, Σ)dξ l=1 σ(al) Z bl N(ξ| µ, Σ) N(ξ|µ ,(l), Σ ,(l)) dξ R N(ξ| µ, Σ)dξ + N(ξ| µ, Σ) N(ξ|µ ,(l), Σ ,(l)) dξ N(ξ| µ, Σ) N(ξ|µ ,(l), Σ ,(l)) dξ . (37) Now, thanks to the conditions in Equation (34), we have that as r 0 both mean and variance converge to the actual maximum and minimum values in R. By further observing that µ and Σ are by construction always inside the (vanishing) interval [µL R, µU R] [ΣL R, ΣU R], then for continuity of the Gaussian pdf we have that for each l = 1, . . . , M: N(ξ| µ, Σ) N(ξ|µ ,(l), Σ ,(l)) dξ = 0 which means that the second term in Equation (37) can be made vanishingly small, in particular less than ϵ 2. By selecting M = 2 ϵ the theorem statement holds. We have proved in Propositions 4 and 7 that the bounds for the mean and variance of Section 4 guarantee that the condition in Equation (34) holds. For multi-class classification (case m > 2), Theorem 14 can be generalised by further noticing that the error introduced by Proposition 11 also vanishes. For any ϵ > 0, to ensure that convergence holds for the multi-class problem one has to select a number of discretisation boxes of the order of 1 ϵm . 6.1 Time Complexity The method we have developed for the computation of adversarial robustness properties of GPs relies on the bounding of the posterior GP statistics, integrated within a branch-andbound scheme for the iterative refinement of the bound. Cost of Bounding Consider a kernel Σ with bounded kernel decomposition (ϕ, ψ, U), and let K denote the time complexity for the evaluation of the bounding function U. This is dependent on the particular function chosen, and in Appendix B we discuss its value for Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska each kernel that we analyse. The time complexity for the computation of the mean bound is O(m K), where m is the output dimension of the GP. The computation of an upper bound on the posterior variance requires solving a convex quadratic problem, whose computational complexity is cubic in the number of input variables (Nesterov and Nemirovskii, 1994), i.e. O((d + 2N + Nd)3), where d is the input dimensionality of the GP and N is the number of training points. Concerning the computation of the lower bound on the variance, we have to solve 2N + 1 linear programming problems, where N is the size of the training set. This again depends on the number of optimisation variables and can be done in O((d + 2N + Nd)2.5 log(d + 2N + Nd)) (Cohen et al., 2021). We emphasise that, while computing the mean is straightforward, bound computations for the variance are more involved. As a result, adversarial robustness for multi-output regression can be obtained much faster in practice than for multi-class classification. To demonstrate this, in Section 7 we investigate a multi-output regression problem with 14 output dimensions. Cost of Refinement Once the bounds on the mean and variance have been computed, refining them through branch-and-bound up to a desired threshold ϵ > 0 has a worst-case cost exponential in the number of dimensions of T. Furthermore, for multi-class classification, to guarantee convergence we have to discretise the region into a grid of size 1 ϵm , where m > 2 is the number of classes. This adds to the overall time complexity, which in the multi-class case is exponential also with respect to the number of classes. 7. Experimental Results We employ our methods to experimentally analyse the robustness of GP models in adversarial settings. We give results for four classification data sets: (i) Synthetic2D, generated by shifting a two-dimensional standard-normal either along the first (class 1) or second dimension (class 2); (ii) the SPAM data set (Dua and Graff, 2017), a binary data set with the split between the negative and positive classes respectively of 41% and 59%; (iii) a two-class subset of the MNIST data set (Le Cun, 1998) with classes 3 and 8 (i.e., MNIST38) and a three-class subset with classes 3, 5 and 8 (i.e., MNIST358); (iv) a two-class subset of Fashion MNIST (F-MNIST) (Xiao et al., 2017) with classes t-shirt/top and shirt (which we refer to as F-MNIST-TS) and a three-class subset with classes t-shirt/top , shirt and pullover (F-MNIST-TSP). Furthermore, we analyse the robustness of the Water Quality data set (Dˇzeroski et al., 2000) for multi-output regression. Training We learn the GP models using a squared-exponential kernel and zero mean prior and select the hyper-parameters by means of MLE (Rasmussen and Williams, 2006). For the Synthetic2D data set we learn the GP over 1000 training samples and test it over 200 test samples, obtaining an accuracy of 98%. For the SPAM data set we first standardise the data to zero mean and unit variance. Then, we perform feature-reduction by iteratively training an ℓ1-penalised logistic regression classifier and discarding the least relevant features, up until test set accuracy starts to diminish. This procedure leaves us with 11 features out of the initial 57. We then train a two-class classification GP over the resulting reduced feature vector. The GP thus computed achieves a test set accuracy of around 93%. Adversarial Robustness Guarantees for Gaussian Processes For MNIST and F-MNIST we first sub-sample the images to 14 14 pixels,7 and use similar learning settings as for the SPAM data set, with 1000 training samples randomly picked from the two data sets. We achieve a test set accuracy of around 98% for MNIST38 and 90% for F-MNIST-TS. For the two multi-class problems we use the softmax likelihood function and training setting similar to the two-class classification problems, obtaining a test set accuracy of around 93% for MNIST358 and 85% for F-MNIST-TSP. Finally, we standardise the Water Quality data set and use the full set of 16 input features to predict the 14 regression outputs. To do so, we learn a multi-output regression GP over a 50%/50% train/validation split of the full data set (1060 data points), obtaining a mean absolute error of around 0.15. We rely on the GPML Matlab toolbox for the training of two-class GPs (Rasmussen and Nickisch, 2010) and on the GPStufftoolbox for the training of multi-class GPs, sparse GP models and the multi-output regression model (Vanhatalo et al., 2013). Parameter Selection for the Analysis We compute adversarial robustness in neighbourhoods of the form T = [x γ, x +γ] around a given point x and for a range of γ > 0. Unless otherwise stated, we run the branch-and-bound algorithm until convergence up to an error threshold ϵ = 0.01. For MNIST38 and F-MNIST-TS we perform feature-level analysis for scalability reasons, similarly to Ruan et al. (2018). Namely, we restrict our methods to salient patches of each image only, as detected by SIFT (Lowe, 2004). We note that any other image feature extraction method can be used instead. In the remainder of this section, we discuss results concerning three types of analyses. First (Section 7.1), on four samples selected from the classification data sets, we provide empirical evidence illustrating the advantages of computing guarantees (as those provided by our branch-and-bound method) versus evaluating model robustness using gradient-based attacks for classification models. Next we consider the robustness of GPs learned by using a selection of latent-variable methods (Section 7.2) and sparse approximation techniques (Section 7.3), discussing the adversarial robustness properties of these state-of-the-art approximate inference methods. Finally, we show how the techniques developed here for adversarial robustness can be applied to perform interpretability analysis of classification GP models predictions (Section 7.4). 7.1 Local Adversarial Safety We study local adversarial safety for four points selected from the classification data sets, i.e. Synthethic2D, SPAM, MNIST38 and F-MNIST-TS data sets and summarise the results in Figure 2. To this end, we set T Rd to be a ℓ γ ball around the chosen test point and iteratively increase γ (x-axis in the second row plots), checking whether there are adversarial examples in T. Namely, if the point is originally assigned to class 1 (respectively class 2) we check whether the minimum classification probability in T is below the decision boundary threshold, that is, if πmin(T) < 0.5 (resp. πmax(T) > 0.5). We compare the values provided by our method (blue solid and dashed lines for class 2, green solid and dashed lines for class 1) with GPFGS, a gradient-based method for attacking GP mean prediction by Grosse et al. (2017) (pink curve in the plot), and Carlini & Wagner (CW) attack (Carlini 7. This reduces the number of hyper-parameters that need to be estimated by MLE and increases the numerical stability of the GP, while achieving comparable accuracy. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska 0 0.4 0.8 1.2 0.2 0 0.03 0.06 0.09 0.12 0.1 0 0.3 0.6 0.9 0 0.3 0.6 0.9 0.4 Figure 2: First row: Contour plot and test points for Synthetic2D; projected contour plot and test points for 2 dimensions of SPAM (dimensions 2 and 8 as selected by ℓ1-penalised logistic regression); sample of 8 from MNIST38 along with 10 pixels selected by SIFT; sample of shirt from F-MNIST-TS along with the 10 pixels selected by SIFT. Second row: Safety analysis for the four selected test points. Shown are the upper and lower bounds on max(T) (solid and dashed blue curves), min(T) (solid and dashed green curves), the GPFGS adversarial attack (pink curve), and the GPCW attack (violet curve). and Wagner, 2017) for the ℓ norm (GPCW) (violet curve in the plot). Naturally, as γ increases, the neighbourhood region T becomes larger, hence the confidence for the initial class can decrease. Interestingly, while our method succeeds in finding adversarial examples in all cases shown (i.e. both the lower and upper bound on the computed quantity cross the decision boundary), both GPFGS and GPCW often underestimate the effect of the worst-case perturbations, such as in Figure 2 (bottom left) for GPFGS and in Figure 2 (bottom, second figure from left) for GPCW. In particular, GPFGS, because of its local nature, tends to underestimate the true robustness values for large values of γ. On the other hand, GPCW, while more accurate for large values of γ, in some cases generates less accurate attacks than GPCW for small values of γ, such as Figure 2 (bottom left). We stress that our method provides converging bounds of the true robustness, and as a consequence GPCW and GPFSM attacks are always contained within the bounds given by our method. 7.2 Local Adversarial Robustness We now evaluate the empirical distribution of the adversarial robustness of the trained GP models. To this end, we introduce a quantitative measure of robustness analogous to that used by Ruan et al. (2018). More specifically, we consider the difference between the max- Adversarial Robustness Guarantees for Gaussian Processes Laplace EP Synthetic2D Laplace EP SPAM Laplace EP MNIST38 Laplace EP F-MNIST-TS Water Quality Figure 3: Boxplots for the distribution of robustness on the five data sets studied, comparing Laplace and EP approximation for the classification models, for γ = 0.1. imum and minimum prediction probability in the region T, δ = πU max(T) πL min(T), which utilises the computed quantities. We evaluate the moments of the empirical distribution of values of δ on 50 randomly selected test points for each of the four data sets considered. Note that a smaller value of δ implies a more robust model. Furthermore, in the classification cases, we analyse how the GP model robustness is affected by the training procedure used. To achieve this, we compare the robustness obtained when using either the Laplace or the Expectation Propagation (EP) (Rasmussen and Williams, 2006) posterior approximations technique, and investigate the influence of the number of marginal likelihood evaluations (epochs) performed during MLE hyper-parameter optimisation on robustness. Results for this analysis are depicted in Figure 3, for 10, 40 and 100 hyper-parameter optimisation epochs. As explained above, the analyses for the MNIST38 and F-MNIST-TS samples are restricted to the most influential SIFT features only, and thus δ values for them are smaller in magnitude than for the other two data sets (for which all the input variables are simultaneously changed). Interestingly, this empirical analysis demonstrates Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska that GPs trained with EP are consistently more robust than those trained using Laplace. In fact, for both Synthetic2D and MNIST38, EP yields a model about 5 times more robust than Laplace. For SPAM, the difference in robustness is the least pronounced. While Laplace approximation works by local approximations, EP calibrates mean and variance estimation by a global approach, which generally results in a more accurate approximation (Rasmussen and Williams, 2006). These results quantify and confirm for GPs that the posterior distribution is robust to adversarial attacks in the limit, as theorethically analysed by Carbone et al. (2020) in the case of over-parameterised Bayesian neural networks, of which GPs are a particular case (De Matthews et al., 2018). We observe that the values of δ decrease as the number of training epochs increases, and thus robustness improves with the increase in the number of training epochs. More training in Bayesian settings may imply better calibration of the latent mean and variance function to the observed data. Interestingly, we note that, also in the regression case, we observe the same trend as in the classification experiments, with robustness of the GP increasing with the number of hyper-parameter training epochs. 7.3 Robustness of Sparse Approximations In Section 7.2 we have empirically observed that a more refined training procedure may lead to more robust GP models. In standard GP settings it is infeasible to work with large-scale data sets that approximate the exact data manifold, as inference scales with the cube of the number of data (and storage with its square) (Bauer et al., 2016). For large-scale data sets, sparse GPs (Bauer et al., 2016) are customarily used for approximating the GP posterior distribution. While sparse GP approximations are usually evaluated in terms of mean and uncertainty calibration, here we consider adversarial robustness of GP sparse approximation techniques. As inference equations for sparse approximation can be generally cast in the form of Equations (1) (2), our methodology can be applied directly, modulo the definition of the matrix S, vector t and the vector of inducing points u (that is, the set of eventually synthetic points on which training is performed). We rely on the EP latent method and compare the results for FIC, DTC, and VAR sparse approximation methods (Qui nonero-Candela and Rasmussen, 2005) on the MNIST38 and F-MNIST-TS data sets. We vary the number of training points from 250 to 7500 and the number of inducing points (selected at random from the training points) from 100 to 500. For each of the resulting GPs we analyse the empirical distribution of δ-robustness on 50 randomly sampled test points with respect to their most relevant features (as detected by SIFT) with γ = 0.15. Results for this analysis are plotted in Figure 4, where boxplots are grouped according to the number of training points, with each boxplot in each group representing an increase of 100 inducing points (starting from 100). The test set accuracy of each GP, as estimated over 1000 test samples, is plotted in the same figure on a separate y-axis (red dots). In agreement with the literature on sparse GPs (Bauer et al., 2016), we observe that an increasing number of training and/or inducing points generally leads to more accurate models. Among the two data sets analysed here, this aspect is more pronounced on F-MNIST ( 6% increase), which poses a more complex classification task than MNIST ( 2% increase), so that the GP further benefits from more information from data. Adversarial Robustness Guarantees for Gaussian Processes 250 500 1000250050007500 mnist - FIC 250 500 1000 2500 5000 7500 fmnist - FIC 250 500 1000250050007500 mnist - DTC 250 500 1000 2500 5000 7500 fmnist - DTC 250 500 1000250050007500 mnist - VAR 250 500 1000 2500 5000 7500 fmnist - VAR Figure 4: Empirical distribution of δ-robustness for γ = 0.15. First Row: FIC sparse approximation. Second Row: DTC sparse approximation. Third Row: VAR sparse approximation. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska The robustness trends instead depend on the approximation techniques used. For FIC and VAR, we generally obtain that more training input points corresponds to an increase in the model robustness (i.e., lower value of δ). More specifically, sparse GPs successfully take into account the information from a larger pool of training samples in refining its posterior estimation. Unfortunately, for the VAR models the EP computations become numerically unstable after 2500 training samples and we have to increase the data jitter (which results in a widening of the boxplot and reduced robustness). For DTC, instead, we observe that the robustness slightly worsens in the case of MNIST and remains stable for F-MNIST. Finally, we remark that the number of inducing points has little effect on the overall robustness when compared to the number of training points used. 7.4 Interpretability Analysis Adversarial robustness and model prediction interpretability are closely linked together (Darwiche, 2020). To demonstrate this, we can utilise the bounds we compute on πmin(T) and πmax(T) to formulate an interpretability metric similar to that defined for linear classifiers in (Ribeiro et al., 2016) and implemented in a black-box tool called LIME. In particular, we consider a test point x and the one-sided input box T i γ(x ) = [x , x + γei] (where ei denotes the vector of 0s except for 1 at dimension i). We compute how much the maximum and minimum values can change over the one-sided intervals in both directions: i γ(x ) = πmax(T i γ(x )) πmax(T i γ(x )) + πmin(T i γ(x )) πmin(T i γ(x )) . If increasing the value of dimension i makes the model favour assigning lower class probabilities, we would expect this value to be negative and vice versa. Intuitively, this provides a non-linear generalisation of numerical gradient estimation, which resembles exactly the metric used by Ribeiro et al. (2016) as γ tends to 0 or if the model considered is linear. Global estimation measures can be computed by estimating the expected value of i γ(x ) with x sampled from a test set. However, since our method relies on the analytic form of the inference equations of GPs (rather than being model-agnostic, which LIME is), we are able to formally bound these quantities, which allows as to provide guarantees over interpretability results. Next, we first graphically demonstrate why linear approximation can be misleading for global interpretability analysis for the 2D-synthetic and SPAM data sets, and then show how we can rely on formal quantification of interpretability to investigate the adversarial vulnerability of a GP model around specific test points. Global Interpretability Analysis for Synthethic2D and SPAM We perform global interpretability analysis on GP models trained on the Synthetic2D and SPAM data sets, estimating the expected value of i γ with 50 random test points. The results are shown in Figure 5. For Synthetic2D (top row), LIME suggests that a higher probability of belonging to class 1 (depicted as the direction of the arrow in the plot) corresponds to lower values along dimension 1 and higher values along dimension 2. As can be seen in the corresponding contour plot in Figure 2 (top left), the exact opposite is true, however. LIME, as it is built on linearity approximations, fails to take into account the global behaviour of the GP. When using a small value of γ our approach obtains similar results to LIME. However, with γ = 2.0 the global relationship between input and output values is correctly captured. For Adversarial Robustness Guarantees for Gaussian Processes -1 -0.5 0 0.5 1 Dimension 1 Dimension 2 1 2 3 4 5 6 7 8 9 10 11 -1 Feature sensitivity Figure 5: Global interpretability, i γ, as analysed by LIME and our method. Left: Results for the Synthetic2D data set. Right: Results for the SPAM data set. SPAM, on the other hand (Figure 5, bottom), due to the linearity of the data set and the GP, a local analysis correctly reflects the global picture. Interpretability for MNIST358 and F-MNIST-TSP Predictions As shown in (Darwiche, 2020), interpretability metrics can be used to synthesise adversarial examples, because pixels and features that are deemed important for the prediction are also likely to be highly vulnerable to adversarial perturbations. These results can be used to glimpse further information about interpretability of the predictions by qualitatively examining the obtained adversarial examples. To this end, given a test point x and a point x taken from a small neighbourhood around x , we define the adversarial gap, πgap(x), as the minimum difference between the confidence in the true class and those of the remaining classes on x, so that πgap(x) < 0 implies that x is an adversarial example for x . We analyse how πgap changes as we change an increasing number of pixels, β, and compare the results obtained with our method to those of LIME. We plot the results on six images randomly selected from the MNIST358 and F-MNISTTSP data sets in Figure 6. The selected clean test images are reported in the first row of the figure, and the interpretability values are reported as a heatmap in the second row directly below the corresponding images. The colour gradient varies from red (positive impact, pixel value increase resulting in increased class probability of shown digit) to blue (negative impact, pixel value increase decreasing the class probability). The values of πgap obtained with our method (blue line) are compared with those from LIME (red line) in the third row, and in the fourth row we plot the minimal adversarial examples found with our method. We observe that, for each image, and for each value of β, relying on the values estimated by LIME leads to an over-estimation of model robustness, and in some cases (e.g., third and fifth column) more than triple the number of pixel modifications is required to find an adversarial example. We note that the adversarial examples that we obtained for MNIST and F-MNIST are qualitatively different. For the MNIST image, our method modified salient bits of the image. For digit 3, for example, the interpretability analysis retrieves a contiguous blue patch on the left, which is deemed to be the most important part for the prediction. When this is modified in adversarial fashion, the image obtained resembles an 8 in the upper part, and a 3 in the lower part, and is (understandably) classified as an 8 by the GP. Similarly, digit 5 is modified so that the lower part resembles an 8, whereas in the image of the 8 a 3 shaped contour is highlighted in the adversarial example. For Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska 0 10 20 -0.1 0 10 20 -0.1 0 10 20 -0.2 Figure 6: First row: 6 test images randomly selected from MNIST358 and F-MNIST-TSP. Second row: Interpretability metric estimation using our method. Third row: Comparison of adversarial gaps (y-axis) obtained for a given budget β (x-axis) when using our method for interpretability estimation (blue line) and when using LIME (red line). The dashed grey line represents the threshold below which an adversarial is found. Fourth row: Minimal adversarial examples found by utilising our interpretability metric. the F-MNIST image, instead, our method detects pixels that are important for the GP prediction but have little semantic meaning for a human, that is, where modifying pixels in the border of the image suffices to find adversarial examples. 8. Conclusion We presented a method for computing, for any compact region of the input space surrounding a test point, provable guarantees on the adversarial robustness of a GP model for all points in that region, up to any desired precision ϵ > 0. To achieve this, we have developed a branch-and-bound optimisation scheme that computes upper and lower bounds on the minimum and maximum of the model prediction ranges, and proved that it converges in finitely many steps up to an error tolerance ϵ > 0 selected a-priori. Adversarial Robustness Guarantees for Gaussian Processes We have experimentally evaluated our method on four classification data sets and a regression one, providing results for adversarial robustness, bounds over the predictive posterior distribution and local/global interpretability analysis. Empirically, we have observed that, in Bayesian prediction settings and with GPs, the adversarial robustness of the model increases with the accuracy of the posterior distribution approximation, and with better hyper-parameter calibration. This differs from what is generally observed in frequentist approaches to learning, for example, in deep neural networks, where better accuracy was empirically observed to imply worse adversarial robustness (Zhang et al., 2019; Su et al., 2018). We have also observed that increasing the number of training samples might still be beneficial for adversarial robustness even when using sparse approximations for GP training. One limitation of the approach presented in this paper is its high, exponential time, computational complexity. This unsurprising since the problem we are solving is nonlinear optimisation. To reduce the computational time requirement, we have formulated analytical solutions for the main types of kernels used in practical applications. We have also observed that sparse GPs, as well as improving training time, can significantly reduce the time requirement of our methods, as bounding functions need to be computed only with respect to the inducing points. We believe that the methods proposed in this paper are therefore widely applicable in practice. Acknowledgments SR is grateful for support from the UK Royal Academy of Engineering and the Oxford Man Institute. AB thanks the Konrad-Adenauer-Stiftung and the Oxford-Man Institute for their support. MK, LL and AP received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (FUN2MODEL, grant agreement No. 834115). AP and MK acknowledge partial funding from the European Union s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No 722022 Affec Tech . LC is supported by a Royal Society Research Professorship. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska Appendix A. Additional Lemmas and Proofs In this section we provide statements of additional lemmas referred to in the main text of the paper, as well as proofs of that were omitted for space reasons. Lemma 15 Let g L(t) = a L + b Lt and g U(t) = g U(t) = a U + b Ut be an LBF and UBF to a function g(t) t T , i.e. g L(t) g(t) g U(t) t T R. Consider two real coefficients α R and β R. Define ( αb L if α 0 αb U if α < 0 a L = ( αa L + β if α 0 αa U + β if α < 0 (38) ( αb U if α 0 αb L if α < 0 a U = ( αa U + β if α 0 αa L + β if α < 0. (39) g L(t) := a L + b Lt αg(t) + β a U + b Ut =: g U(t). That is, LBFs can be propagated through linear transformation by redefining the coefficients through Equations (38) (39). Proof The proof is an immediate consequence of multiplying the inequalities g L(t) g(t) g U(t) with the coefficients α and β and re-writing the new inequality using the constants defined in Equations (38) (39). Lemma 16 Consider the sigmoid function σ(x) = 1 1+e x .Let z > 0, then we have: ( > σ (µ + z) if µ > 0 < σ (µ + z) if µ < 0. Proof Let µ > 0; the proof when µ < 0 is similar, because σ is an even function. Case 1: µ z 0. Since σ is strictly concave in [0, + ), the derivative is a monotonic crescent in the relevant region. Thus, σ (µ z) > σ (µ + z). Case 2: µ z < 0. Since σ is even we have σ (µ z) = σ (z µ). Now z µ > 0, similarly to Case 1 we have σ (z µ) < σ (z + µ), which proves the lemma. Lemma 17 Let X and Y be random variables with joint density function f X,Y . Consider measurable sets A and B. Then, it holds that P(X A|Y B) inf y B P(X A|Y = y). Adversarial Robustness Guarantees for Gaussian Processes P(X A|Y B) = P(X A, Y B) A f X,Y (x, y)dxdy R B f Y (y)dy A f X|Y (x|y)f Y (y)dxdy R B f Y (y)dy B f Y (y)dy infy B R A f X|Y (x|y)dx R B f Y (y)dy = inf y B P(X A|Y = y). A.1 Proof of Lemma 1 Proof We show how to compute a L and b L; the same arguments also apply to the computation of a U and b U by simply considering Σx, x. Consider c L = 1 and c U = 1 coefficients associated to the input point x. Let ϕL = U(c L) and ϕU = U(c U), then by Assumption 3 of bounded kernel decomposition we have that ϕ(x, x) [ϕL, ϕU] for all x T. Consider now the function ψ restricted to the interval [ϕL, ϕU]. Then there are four cases to consider for ψ. Case 1 If ψ happens to be concave in [ϕL, ϕU], then, by definition of concave function, a lower bound is given by the line that links the points (ϕL, ψ(ϕL)) and (ϕU, ψ(ϕU)), that is, g L is simply the LBF with coefficients: b L = ψ(ϕL) ψ(ϕU) a L = ψ(ϕL) b LϕL. Case 2 If ψ happens to be a convex function, then, by definition of convex function and by the differentiability of ψ, a valid lower bound is given by the tangent line in the middle point ϕC = (ϕL + ϕU)/2 of the interval, that is, g L is the LBF with coefficients: b L = dψ(ϕC) dϕ a L = ψ(ϕL) b LϕL. Case 3 Assume now that ψ is concave in [ϕL, ϕF ], and convex in [ϕF , ϕU] (the arguments are very similar if we assume the first interval is that in which ψ is convex and the second concave). In other words, there is only one flex point ϕF (ϕL, ϕU). Let a L, b L be coefficients for linear lower approximation in [ϕL, ϕF ] and a L, b L analogous coefficients in [ϕF , ϕU] (respectively computed as for Case 1 and Case 2 above), and call g and g the corresponding functions. Define g L to be the LBF function of coefficients a L and b L that goes through the two points (ϕL, min(g (ϕL), g (ϕL))) and (ϕU, g (ϕU)). We then have that g L is a valid lower bound function for ψ in [ϕL, ϕU]. In order to prove this we distinguish between two cases: Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska 1. If min(g (ϕL), g (ϕL)) = g (ϕL), then we have that g L(ϕL) = g (ϕL) g (ϕL), and g L(ϕU) = g (ϕU). Hence, because of linearity, g L(ϕ) g (ϕ) in [ϕL, ϕU], and in particular in [ϕF , ϕU] as well. This also implies that g L(ϕF ) g (ϕF ) g (ϕF ). On the other hand, g L(ϕL) = g (ϕL), hence g L(ϕ) g (ϕ) in [ϕL, ϕF ]. Combining these two results and by construction of g and g we have that g L(ϕ) ψ(ϕ) in [ϕL, ϕU]. 2. If min(g (ϕL), g (ϕL)) = g (ϕL), then in this case we have g L = g , and just have to show that g(ϕ) g (ϕ) in [ϕL, ϕF ]. This immediately follows by noticing that g (ϕF ) g (ϕF ) and g (ϕL) g (ϕL). Case 4 In the general case, as we have a finite number of flex points, we can divide [ϕL, ϕU] in subintervals in which ψ is either convex or concave. We can then proceed iteratively from the two left-most intervals by repeatedly applying Case 3. A.2 Proof of Lemma 3 Proof We prove the lemma for the LBF. An analogous argument can be made for the UBF. Letting ϵ > 0, we want to find an r > 0 such that diam(R) < r implies maxx R |g R L(x) Σ x,x| < ϵ. Consider ϕR L and ϕR U, lower and upper bound values for ϕ in R. By taking r small enough we can assume without loss of generality that ψ(ϕ) has at most one flex point in [ϕR L, ϕR U]. We then have the following three cases. CASE 1: if ψ(ϕ) is concave then g R L is defined as the line connecting the two extreme points of the interval [ϕR L, ϕR U]. Since ψ(ϕ) is concave, we have that it obtains its minimum in one of these two extrema, so that we have min x R g R L(x) = min x R Σ x,x. By Assumption 2 of kernel decompositions (see Definition 4), it follows that ψ is Lipschitz continuous on any compact interval, so that we have that min x R Σ x,x max x R Σ x,x where r = diam(R). Putting the two results together we have that the difference between minx R g R L(x) and maxx R Σ x,x vanishes whenever that r tends to zero, which proves the statement. CASE 2: if ψ(ϕ) is convex then g R L is the Taylor expansion of ψ(ϕ) around the mid-point of the interval, truncated at the first-order. By continuity of ϕ we then obtain that shrinking r shrinks also the width of the interval [ϕR L, ϕR U], which then, relying on the properties of truncation error of Taylor expansions, proves the lemma statement. CASE 3: in the case in which a flex point exists, g R L is defined to be the maximum line that is below the two LBFs respectively defined over the convex and the concave part of the interval. Since by Case 1 and Case 2 these converge, we also have that g R L converges. Adversarial Robustness Guarantees for Gaussian Processes A.3 Proof of Lemma 12 Proof We provide the proof for the minimum; similar arguments also hold for the maximum. By definition of µL T , µU T , ΣL T , ΣU T , we have that for every x T, µ(x) [µL T , µU T ] and Σ(x) [ΣL T , ΣU T ]. Thus: a N(ξ| µ(x), Σ(x))dξ min µ [µL T ,µU T ] Σ [ΣL T ,ΣU T ] a N(ξ|µ, Σ)dξ = 1 2 min µ [µL T ,µU T ] Σ [ΣL T ,ΣU T ] 2 min µ [µL T ,µU T ] Σ [ΣL T ,ΣU T ] where we have set Φ(µ, Σ) := erf µ a . By looking at the partial derivatives we have that: 2Σ e (µ a)2 2Σ 0 µ a + b and that if µ [a, b]: (µ bi)e (µ bi)2 2Σ2 (µ ai)e (µ ai)2 Σ (µ a)2 (µ b)2 µ b = Σc(µ) as otherwise the last inequality has no solutions. As such, µc and Σc will correspond to global maximum with respect to µ and Σ, respectively. As Φ is symmetric w.r.t. µc we have that the minimum value w.r.t. to µ is always obtained for the point furthest away from µc, that is, at µ = arg maxµ [µL T ,µU T ] |µc µ|. The minimum value w.r.t. to Σ will hence be either for ΣL T or ΣU T , that is Σ = arg minΣ {ΣL T ,ΣU T } Φ( µ , Σ). Appendix B. Kernel Function Decomposition In this section, we compute explicit kernel decompositions (ϕ, ψ, U) for several kernels of practical relevance in applications. In particular, we give explicit formulas for the squaredexponential, the rational quadratic, the Mat ern (for half-integer values) and the periodic kernels, along with how kernel decomposition can be propagated through addition and multiplication with kernels. We remark that the formula for addition and multiplication can be used recursively so to obtaine bounded decomposition ofr variously composed kernels. Furthermore, we show how to compute kernel decompositions for generalised spectral kernels, both in the stationary and non-stationary case. Throughout this section, we assume T = [x L, x U], for some x L, x U Rd. For building the bounding function U we use the notation x(i), i = 1, . . . , N for the set of input points, and ci, i = 1, . . . , N, for their associated multiplying coefficients. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska B.1 Squared-Exponential Kernel For the squared-exponential kernel, we build a bounded kernel decomposition by setting: ψ(ϕ) = σ2 exp ( ϕ) ϕ(x , x ) = j=1 θj(x j x j )2. It is straightforward to notice that Assumptions 1 and 2 of Definition 4 are met by this decomposition. Concerning the definition of U, consider a set x(1), . . . , x(N) of N points in the input space and associated real coefficients c1, . . . , c N. For a hyper-rectangle T = [x L, x U] we have that: i=1 ciϕ(x, x(i)) = sup x T j=1 θj(xj x(i) j )2 = sup x T i=1 ci(xj x(i) j )2 i=1 cix2 j 2θj i=1 cix(i)xj + θj i=1 cix(i)2 ! j=1 sup xj [x L j ,x U j ] i=1 cix2 j 2θj i=1 cix(i)xj + θj i=1 cix(i)2 ! The right-hand-side of the last equation simply involves the computation of the maximum of a 1-d parabola over an interval of the real line, which can be done exactly and in constant time by simple inspection of the derivative function and by evaluating the function at the extrema of the interval. Call xj the only critical point of the jth parabola, and denote with hj(xj) = αjx2 j + βjxj + γj the parabola associated with the jth coordinate value, with αj = θj PN i=1 ci, βj = 2θj PN i=1 cix(i) and γj = θj PN i=1 cix(i)2, then we set j=1 Uj(c), (40) ( max {hj(x L j ), hj(x U j ), hj( xj)} if xj [x L j , x U j ] max {hj(x L j ), hj(x U j )} otherwise . Furthermore it follows that the time complexity for the computation of U in the squaredexponential case is O(N + d). Adversarial Robustness Guarantees for Gaussian Processes B.2 Rational Quadratic Kernel An analogous argument to the above holds for the rational quadratic kernel, where we can set ψ(ϕ) = σ2 1 + ϕ ϕ(x , x ) = j=1 θj(x j x j )2. As the definition of ϕ is exactly the same as for the squared-exponential kernel, the bounding function U can be defined as in Equation (40). B.3 Mat ern Kernel For half-integer values, the explicit form of the Mat ern Kernel allows us to find an analogous kernel decomposition to the two discussed above: ψ(ϕ) = σ2kp exp ( q l=0 kl,p p lq ϕ(x , x ) = j=1 θj(x j x j )2. B.4 Periodic Kernel For the periodic kernel we define ψ(ϕ) = σ2 exp( 0.5ϕ) ϕ(x , x ) = j=1 θj sin(pj(x j x j ))2. Assumptions 1 and 2 are trivially satisfied because of the smoothness of ψ and ϕ. For the definition of the bounding function U we have that: i=1 ciϕ(x, x(i)) = sup x T j=1 θj sin(pj(xj x(i) j ))2 (41) j=1 sup xj [x L j ,x U j ] ciθj sin pj(xj x(i) j ) 2 . The supremum in the final equation can be obtained by simply inspecting the derivative of ciθj sin pj(xj x(i) j ) 2 and its function value at the extrema of each interval [x L j , x U j ]. Let Uij(ci) be the value computed in such a way for each i and j, then we define: i=1 Uij(ci). (42) Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska Furthermore it follows that the time complexity for the computation of U in the squaredexponential case is O(Nd). B.5 Kernel Addition Consider now the case in which the kernel function Σ is defined by linear composition of two kernels Σ and Σ such as: Σx ,x = k Σ x ,x + k Σ x ,x x , x Rd, (43) for some given k and k 0. Then, we have that kernel decomposition for Σ and Σ can be simply propagated through the sum. To see that, let (ϕ , ψ , U ) and (ϕ , ψ , U ) be the two kernel decomposition. Then, by simply summing up the LBFs and UBFs for Σ and Σ , Lemma 1 can be generalised to this case as follows. Proposition 18 Let g L, g U, g L and g U be lower and upper bounding function for Σ x, x and Σ x, x, for all x T, as computed in Lemma 1. Then g L(x) = k g L(x) + k g L(x) g U(x) = k g U(x) + k g U(x) are respectively lower and upper bounding functions on Σx, x. As a consequence of the above proposition, it immediately follows that the infimum of the posterior mean function over the compact set T can be safely lower-bounded for the kernel Σ by setting: µL T = k µ L T + k µ L T , where µ L T and µ L T are computed by applying Proposition 2 to the kernels Σ and Σ . Similarly, Propositions 5 and 6 can be generalised by considering two sets of slack variables, one associated to ϕ and one to ϕ , and relying directly on the lowerand upper-bounding functions defined in Proposition 18. B.6 Kernel Multiplication When two kernels are combined through multiplication, we have that Σx ,x = Σ x ,x Σ x ,x . This case can be reduced to the addition by considering the following Mc Cormick s inequalities (Mc Cormick, 1976): Σx ,x = Σ x ,x Σ x ,x Σ LΣ x ,x + Σ x ,x Σ L Σ LΣ L (44) Σx ,x = Σ x ,x Σ x ,x Σ UΣ x ,x + Σ x ,x Σ L Σ UΣ L, (45) where Σ L, Σ U, Σ L and Σ U are lower and upper bound values for Σ and Σ in T, respectively. Then we can proceed by using the kernel summation of Equation (44) when computing lower bounding function on the kernel, and Equation (45) when computing the upper bounding function, and by using the techniques discussed in the section just above. Adversarial Robustness Guarantees for Gaussian Processes B.7 Generalised Spectral Kernel We show how to find kernel decompositions compatible with our optimisation framework for generalised spectral kernels (Samo and Roberts, 2015). We note that these are dense in the space of kernel functions, so that they can be used to derive any kernel up to an arbitrary small error tolerance. Stationary Kernel For the stationary case, we have: k=1 σ2h((x x ) θ(k)) cos(w T k (x x )), (46) where θ(k) 0, and h is any given positive definite function; in particular, we choose h((x x ) θ(k)) = exp Pm j=1 θ(k) j (x x )2 . We now show how a bounded kernel decomposition (ϕ, ψ, U) can be derived for this kernel. We first observe that the kernel is obtained by summing over K different kernel components. According to the results for kernel addition described in Appendix B.5, it suffices to find a bounded kernel decomposition for each summand of Equation (46), i.e., for k(x , x ) = σ2 exp j=1 θj(x x )2 cos(w T (x x )). In turn, by setting k1(x , x ) = σ2 exp Pd j=1 θj(x x )2 and k2(x , x ) = cos(w T (x x )), we have that k(x , x ) = k1(x , x )k2(x , x ), and thus a kernel decomposition can be found by using the formulas for kernel multiplication derived in Appendix B.6 to k1(x , x ) and k2(x , x ). Observe that k1(x , x ) has the same shape as the squared-exponential kernel, for which kernel decomposition was derived in Appendix B.1. For k2(x , x ), we set ϕ(x , x ) = j=1 wj(x j x j ) (47) ψ(ϕ) = cos(ϕ). (48) We note that Assumptions 1 and 2 of Definition 4 are satisfied by this decomposition. For the definition of a bounding function U, i.e., Assumption 3, we have i=1 ciϕ(x, x(i)) = sup x T j=1 wj(xj x(i) j ) = sup x T i=1 ciwj(xj x(i) j ) = i=1 wjx(i) j = j=1 sup x T wjxj β, where wj = PN i=1 ci wj and β = Pd j=1 PN i=1 wjx(i) j . As the above is a linear form, we have that the supremum of wjxj occurs in the point x j = x U j if wj 0 and in x j = x L j otherwise. Thus, we have that Uk2(c) = Pd j=1 wjx j β is a valid upper-bound function for the sub-kernel k2. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska Non-Stationary Kernel In the non-stationary case, we have: k=1 σ2 k k x θ(k), x θ(k) Ψk(x )T Ψk(x ), where θ(k) 0, k is a positive semi-definite, continuous and integrable function and Ψk(x) = h cos x T w(k) 1 + cos x T w(k) 2 , sin x T w(k) 1 + sin x T w(k) 2 i . In particular we choose k x θ, x θ = k(x θ)k(x θ) = ex θex θ = e P j θj(x j+x j ). Proceeding similarly to the case of stationary kernels, we can analyse each summand and factor in isolation. The final decomposition can then be obtained by using the addition and multiplication formulas for kernel decompositions derived in Appendix B.5 and B.6. For k (x , x ), we select ϕ(x , x ) = X j θj(x j + x j ) Since ϕ has exactly the same shape as for that in Equation (47), a similar argument can be used for finding the upper-bound function. It remains only to find a decomposition for Ψk(x )T Ψk(x ). In particular, we have Ψ(x )T Ψ(x ) = cos(x T w1) + cos(x T w2) cos(x T w1) + cos(x T w2) + sin(x T w1) + sin(x T w2) sin(x T w1) + sin(x T w2) = cos(x T w1) cos(x T w1) + cos(x T w1) cos(x T w2) + cos(x T w2) cos(x T w1) + cos(x T w2) cos(x T w2) + sin(x T w1) sin(x T w1) + sin(x T w1) sin(x T w2) + sin(x T w2) sin(x T w1) + sin(x T w2) sin(x T w2) Again, we can focus on the single factor from the equation above, and rely on the addition and multiplication formulas to obtain the overall result. We consider the first factor, cos(x T w1) cos(x T w1), and select: ϕ(x , x ) = cos(x T w1) cos(x T w1) For the computation of the upper-bound function, we have the following: i ci cos(x(i),T w1) cos(x T w1) X i sup x ci cos(x(i),T w1) cos(x T w1) = X i sup x γi cos(x T w1), where we define γi = ci cos(x(i),T w1). It is thus straightforward to find the maximum of the right-hand-side equation by inspecting the derivatives of the cosine function. Adversarial Robustness Guarantees for Gaussian Processes 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 Figure 7: Convergence of upper and lower bounds to the maximum and minimum estimated via grid search for the Synthetic2D data set for varying values of γ. Each column corresponds to the converging computation of the branch-and-bound algorithm for up to a maximum specified number of iterations, namely (from left to right): 10, 100 and 10000. Top row: Lower bound (solid line) and estimated minimum (dashed line) for a GP trained on the Synthethic2D data set on a test point from class 2. Bottom row: Upper bound (solid line) and estimated maximum (dashed line) for a GP trained on the Synthethic2D data set on a test point from class 1. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska Figure 8: Boxplots of the empirical distribution for the gap (πgap) between the bound and the optimum value estimated for 50 test points selected at random from the Synthetic2D test data set plotted against the number of branch-and-bound iterations. B.8 Convergence of Kernel Bounding functions As stated in the proof of Theorem 14, the finite-time convergence of our branch-and-bound methodology relies on the convergence discussed in Propositions 4 and 7, which in turn rests on the convergence of the kernel bounding function U to the actual supremum as the diameter of the input region T shrinks to zero. The convergence of U itself naturally depends on the explicit form derived for each specific kernel type. For the kernels that we have discussed above the following observations hold, from which convergence can be shown to follow. For the squared-exponential kernel, the rational quadratic kernel, the Mat ern kernel and for the kernel addition, the bound U we provide is the exact analytical solution of the supremum computations. As a consequence, the bounds trivially converge as the compact region T decreases in size. Regarding the periodic kernel, in Section B.4 we have shown how to compute a bounding function U that is an over-approximation of the supremum. The over-approximation arises because, in the equations for the supremum computation, we pass the supremum under the sign of summation (Equation 41). However, it is easy to see that, for every finite set of continuous functions gi(x) : Rm R, i = 1, . . . , N, if we set r = diam(T) then supx T PN i=1 gi(x) is equal to PN i=1 supx T gi(x) in the limit of r 0, while T remains compact. Therefore, the bound U in Equation (42) converges to the actual supremum as T shrinks, and hence the branch-and-bound computation on the periodic kernel converges too. For the multiplication of kernels, Mc Cormick inequalities are known to converge to the actual multiplication when the region of variability of the multiplication variables decreases (Mc Cormick, 1976), and thus convergence of the bound U in the case of multiplication of kernels ultimately depends on the convergence of the upper and lower bounding Σ L, Σ U, Σ L and Σ U that we compute. These, in turn again, depend on the specific forms of the sub- Adversarial Robustness Guarantees for Gaussian Processes kernels. If the kernels used are squared-exponential, rational quadratic, Mat ern kernel, periodic kernel or their addition, then convergence follows by the above argument. Finally, for the generalised spectral kernels, in the stationary case we have that for kernels derived as a summation or multiplication of kernels, for which we compute the exact form of U, convergence follows as above. For the non-stationary case, we have to compute supx P i ci cos(x(i),T w1) cos(x T w1), for which we again give an upper bound by swapping the summation and the supremum signs, as we did in the case of the periodic kernel. Hence, convergence of the function U as T reduces in size follows by a similar argument. Appendix C. Empirical Convergence Analysis We empirically investigate the convergence of our branch-and-bound methodology on two points selected from the Synthetic2D test data set. In particular, as an exact computation of πmin(T) and πmax(T) is not possible, we compare the bounds with an empirical approximation obtained by discretising each γ-ball using 10000 grid points, and computing the minimum and maximum over the grid by brute force search. We refer to the minimum and maximum thus estimated, respectively, as ˆπmin and ˆπmax. We report the results of this analysis in Figure 7 for two specular points selected from the test set, namely [ 0.4, 0.4] (top row) and [0.4, 0.4] (bottom row). We vary γ from 0 to 1 and report the results for three different values of the maximum number of branchand-bound iterations (from left to right: 10, 100 and 10000). First, we empirically confirm that the bounds are entirely safe, since the lower bound is always below ˆπmin and the upper bound is always above ˆπmax. It is interesting to note that there is a non-trivial relationship between the value of γ and the tightness of the bounds. When γ is equal to 0.4 the point that optimises the adversarial classification probability happens to be one of the vertices of the initial search space (i.e., the point [0, 0]), to which the branch-and-bound algorithm immediately converges. Convergence is slower when the optimal points lie further away from the vertices of the initial search space. However, after 10000 iterations the two curves overlap almost perfectly. In Figure 8, we show the boxplots of the empirical distribution for the gap (πgap) between the bound and the optimum value estimated for 50 test points selected at random. After 250 iterations the gap between our bound and the brute force empirical estimation is already, on average, almost zero. Hamzah Abdelaziz. A Data-Driven Approach for Modeling, Analysis and Control of Stochastic Hybrid Systems using Gaussian Processes. Ph D thesis, 2017. Venkataramanan Balakrishnan, Stephen Boyd, and Silvano Balemi. Branch and bound algorithm for computing the minimum stability degree of parameter-dependent linear systems. International Journal of Robust and Nonlinear Control, 1(4):295 317, 1991. Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse Gaussian process approximations. In International Conference on Neural Information Processing Systems, pages 1533 1541, 2016. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska Leonard Berrada, Sumanth Dathathri, Robert Stanforth, Rudy Bunel, Jonathan Uesato, Sven Gowal, M Pawan Kumar, et al. Verifying probabilistic specifications with functional Lagrangians. ar Xiv preprint ar Xiv:2102.09479, 2021. Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 84:317 331, 2018. Arno Blaas, Andrea Patane, Luca Laurenti, Luca Cardelli, Marta Kwiatkowska, and Stephen Roberts. Adversarial robustness guarantees for classification with Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 3372 3382. PMLR, 2020. Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimization with Gaussian processes. In Advances in Neural Information Processing Systems, pages 5760 5770, 2018. John Bradshaw, Alexander G de G Matthews, and Zoubin Ghahramani. Adversarial examples, uncertainty, and transfer testing robustness in Gaussian process hybrid deep networks. ar Xiv preprint ar Xiv:1707.02476, 2017. Rudy Bunel, Jingyue Lu, Ilker Turkaslan, Philip H. S. Torr, Pushmeet Kohli, and M. Pawan Kumar. Branch and bound for piecewise linear neural network verification. Journal of Machine Learning Research, 21:42:1 42:39, 2020. Ginevra Carbone, Matthew Wicker, Luca Laurenti, Andrea Patane, Luca Bortolussi, and Guido Sanguinetti. Robustness of Bayesian neural networks to gradient-based attacks. In Advances in Neural Information Processing Systems, 2020. Luca Cardelli, Marta Kwiatkowska, Luca Laurenti, Nicola Paoletti, Andrea Patane, and Matthew Wicker. Statistical guarantees for the robustness of Bayesian neural networks. In International Joint Conference on Artificial Intelligence, pages 5693 5700, 7 2019a. Luca Cardelli, Marta Kwiatkowska, Luca Laurenti, and Andrea Patane. Robustness guarantees for Bayesian inference with Gaussian processes. In AAAI Conference on Artificial Intelligence, volume 33, pages 7759 7768, 2019b. Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pages 39 57, 2017. Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pages 1310 1320. PMLR, 2019. Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. Journal of the ACM, 68(1):1 39, 2021. Adnan Darwiche. Three modern roles for logic in AI. In ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 229 243, 2020. Adversarial Robustness Guarantees for Gaussian Processes AGG De Matthews, J Hron, M Rowland, RE Turner, and Z Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http:// archive.ics.uci.edu/ml. Saˇso Dˇzeroski, Damjan Demˇsar, and Jasna Grbovi c. Predicting chemical parameters of river water quality from bioindicator data. Applied Intelligence, 13(1):7 17, 2000. Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin Vechev. Ai2: Safety and robustness certification of neural networks with abstract interpretation. In IEEE Symposium on Security and Privacy, pages 3 18, 2018. Kathrin Grosse, David Pfaff, Michael Thomas Smith, and Michael Backes. How wrong am I? Studying adversarial examples and their impact on uncertainty in Gaussian process machine learning models. ar Xiv preprint ar Xiv:1711.06598, 2017. Kathrin Grosse, David Pfaff, Michael T Smith, and Michael Backes. The limitations of model uncertainty in adversarial settings. ar Xiv preprint ar Xiv:1812.02606, 2018. Daniel Hern andez-Lobato, Jos e M Hern andez-Lobato, and Pierre Dupont. Robust multiclass Gaussian process classification. In Advances in Neural Information Processing Systems, pages 280 288, 2011. Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety verification of deep neural networks. In International Conference on Computer Aided Verification, pages 3 29. Springer, 2017. Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pages 97 117. Springer, 2017. Hyun-Chul Kim and Zoubin Ghahramani. Outlier robust Gaussian process classification. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition and Structural and Syntactic Pattern Recognition, pages 896 905. Springer, 2008. Yann Le Cun. The MNIST database of handwritten digits. 1998. URL http://yann.lecun. com/exdb/mnist/. David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91 110, 2004. Garth P Mc Cormick. Computability of global solutions to factorable nonconvex programs: Part I - Convex underestimating problems. Mathematical Programming, 10(1):147 175, 1976. Thomas P Minka. Expectation propagation for approximate Bayesian inference. In Conference on Uncertainty in Artificial Intelligence, pages 362 369. Morgan Kaufmann Publishers Inc., 2001. Patan e, Blaas, Laurenti, Cardelli, Roberts and Kwiatkowska Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994. Arnold Neumaier. Complete search in continuous global optimization and constraint satisfaction. Acta Numerica, 13:271 369, 2004. Joaquin Qui nonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec): 1939 1959, 2005. Carl Edward Rasmussen and Hannes Nickisch. Gaussian Processes for Machine Learning (GPML) toolbox. Journal of Machine Learning Research, 11(Nov):3011 3015, 2010. Carl Edward Rasmussen and Christopher K Williams. Gaussian Processes for Machine Learning. MIT press Cambridge, MA, 2006. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you? Explaining the predictions of any classifier. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135 1144. ACM, 2016. J Ben Rosen and Panos M Pardalos. Global minimization of large-scale constrained concave quadratic problems by separable programming. Mathematical Programming, 34(2):163 174, 1986. Wenjie Ruan, Xiaowei Huang, and Marta Kwiatkowska. Reachability analysis of deep neural networks with provable guarantees. In International Joint Conference on Artificial Intelligence, pages 2651 2659, 2018. Yves-Laurent Kom Samo and Stephen Roberts. Generalized spectral kernels. ar Xiv preprint ar Xiv:1506.02236, 2015. Michael Thomas Smith, Kathrin Grosse, Michael Backes, and Mauricio A Alvarez. Adversarial vulnerability bounds for Gaussian process classification. ar Xiv preprint ar Xiv:1909.08864, 2019. Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. Advances in Neural Information Processing Systems, 18:1257 1264, 2005. Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the cost of accuracy? A comprehensive study on the robustness of 18 deep image classification models. In European Conference on Computer Vision, pages 631 648, 2018. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. Jarno Vanhatalo, Jaakko Riihim aki, Jouni Hartikainen, Pasi Jyl anki, Ville Tolvanen, and Aki Vehtari. GPstuff: Bayesian modeling with Gaussian processes. Journal of Machine Learning Research, 14(Apr):1175 1179, 2013. Adversarial Robustness Guarantees for Gaussian Processes Matthew Wicker, Luca Laurenti, Andrea Patane, and Marta Kwiatkowska. Probabilistic safety for Bayesian neural networks. In Conference on Uncertainty in Artificial Intelligence, pages 1198 1207. PMLR, 2020. Matthew Wicker, Luca Laurenti, Andrea Patane, Zhuotong Chen, Zheng Zhang, and Marta Kwiatkowska. Bayesian inference with certifiable adversarial robustness. In International Conference on Artificial Intelligence and Statistics, pages 2431 2439. PMLR, 2021. Christopher KI Williams and David Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342 1351, 1998. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms, 2017. Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-offbetween robustness and accuracy. In International Conference on Machine Learning, pages 7472 7482. PMLR, 2019. Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efficient neural network robustness certification with general activation functions. In Advances in Neural Information Processing Systems, pages 4939 4948, 2018.