# bayesian_transformed_gaussian_processes__0dee4160.pdf Published in Transactions on Machine Learning Research (04/2023) Bayesian Transformed Gaussian Processes Xinran Zhu xz584@cornell.edu Cornell University, Center for Applied Mathematics Leo Huang ah839@cornell.edu Cornell University, Department of Computer Science Eric Hans Lee eric.lee@intel.com Sig Opt: An Intel Company Cameron Ibrahim cibrahim@udel.edu University of Delaware, Department of Computer and Information Sciences David Bindel bindel@cornell.edu Cornell University, Department of Computer Science Reviewed on Open Review: https: // openreview. net/ forum? id= 4z Cgjqjz Av The Bayesian transformed Gaussian (BTG) model, proposed by Kedem and Oliviera in 1997, was developed as a Bayesian approach to trans-Kriging in the spatial statistics community. In this paper, we revisit BTG in the context of modern Gaussian process literature by framing it as a fully Bayesian counterpart to the Warped Gaussian process that marginalizes out a joint prior over input warping and kernel hyperparameters. As with any other fully Bayesian approach, this treatment introduces prohibitively expensive computational overhead; unsurprisingly, the BTG posterior predictive distribution, itself estimated through high-dimensional integration, must be inverted in order to perform model prediction. To address these challenges, we introduce principled numerical techniques for computing with BTG efficiently using a combination of doubly sparse quadrature rules, tight quantile bounds, and rank-one matrix algebra to enable both fast model prediction and model selection. These efficient methods allow us to compute with higher-dimensional datasets and apply BTG with layered transformations that greatly improve its expressibility. We demonstrate that BTG achieves superior empirical performance over MLE-based models in the low-data regime situations in which MLE tends to overfit. 1 Introduction Gaussian processes (GPs), also known as Kriging models in the spatial statistics community Cressie (1993), are a powerful probabilistic learning framework, including a marginal likelihood which represents the probability of data given only GP hyperparameters. The marginal likelihood automatically balances model fit and complexity terms to favor simple models that explain the data well. A GP assumes normally distributed observations. In practice, however, this condition is not always adequately met. The classic approach to moderate departures from normality is trans-Gaussian Kriging, which applies a normalizing nonlinear transformation to the data (Cressie, 1993). This idea was reprised and expanded upon in the machine learning literature. One instance is the warped GP (WGP), which maps the observation space to a latent space in which the data is well-modeled by a GP and which learns GP hyperparameters through maximum likelihood estimation (Snelson et al., 2004). The WGP paper employs These authors contributed equally to this work. Published in Transactions on Machine Learning Research (04/2023) a class of parametrized, hyperbolic tangent transformations. Later, Rios & Tobar (2019) introduced compositionally warped GPs (CWGP), which chain together a sequence of parametric transformations with closed form inverses. Bayesian warped GPs further generalize WGPs by modelling the transformation as a GP (Lázaro-Gredilla, 2012). These are in turn generalized to Deep GPs by Damianou & Lawrence (2013), which stack GPs together in the layers of a neural network. Throughout this line of work, the GP transformation and kernel hyperparameters are typically learned through joint maximum likelihood estimation (MLE). A known drawback of MLE is overconfidence in the data-sparse or low-data regime, which may be exacerbated by warping (Chai & Garnett, 2019). Bayesian approaches, on the other hand, offer a way to account for uncertainty in values of model parameters. Bayesian trans-Kriging (Spöck et al., 2009) treats both transformation and kernel parameters in a Bayesian fashion. A prototypical Bayesian trans-Kriging model is the BTG model developed by Oliveira et al. (1997). The model places an uninformative prior on the precision hyperparameter and analytically marginalizes it out to obtain a posterior distribution that is a mixture of Student s t-distributions. Then, it uses a numerical integration scheme to marginalize out transformation and remaining kernel parameters. In this latter regard, BTG is consistent with other Bayesian methods in the GP literature, including those of Gibbs (1998); Adams et al. (2009); Lalchand & Rasmussen (2020). While BTG was shown to have improved prediction accuracy and better uncertainty propagation, it comes with several computational challenges, which hinder its scalability and limit its competitiveness with the MLE approach. First, the cost of numerical integration in BTG scales with the dimension of hyperparameter space, which can be large when transform and noise model parameters are both incorporated. Traditional methods such as Monte Carlo (MC) suffer from slow convergence. As such, we leverage sparse grid quadrature and quasi Monte Carlo (QMC), which have a higher degree of precision but require a sufficiently smooth integrand. Second, the posterior mean predictor of BTG is not guaranteed to exist, hence the need to use the posterior median predictor. The posterior median and credible intervals do not generally have closed forms, so one must resort to expensive numerical root-finding to compute them. Finally, while fast cross-validation schemes are known for vanilla GP models, leave-one-out-cross-validation (LOOCV) on BTG, which incurs quartic cost naively, is less straightforward to perform because of an embedded generalized least squares problem. In this paper, we reduce the overall computational cost of end-to-end BTG inference, including model prediction and selection. Our main contributions follow. We revisit the Bayesian transformed Gaussian (BTG) model in the context of machine learning literature and GPs. We show that BTG fills an important gap in the GP research in which a fully Bayesian treatment of input-warped GPs has not been thoroughly investigated. We extend BTG to support multi-layered transformations. We propose efficient methods for computing BTG predictive medians and predictive quantiles through a combination of doubly sparse quadrature and quantile bounds. We also propose fast LOOCV using rank-one matrix algebra. We develop a framework to control the tradeoff between speed and accuracy for BTG and analyze the error in sparsifying QMC and sparse grid quadrature rules. We conduct comparisons between the Bayesian and MLE approaches and provide experimental results for BTG and WGP coupled with 1-layer and 2-layer transformations. We find evidence that BTG is well-suited for low-data regimes, where hyperparameters are under-specified by the data. In these regimes, our empirical testing suggests that BTG provides superior point estimation and uncertainty quantification. We develop a modular Julia package for computing with transformed GPs (e.g., BTG and WGP) which exploits vectorized linear algebra operations and supports MLE and Bayesian inference1. 1The code used to run experiments in this paper can be found at https://github.com/xinranzhu/BTG. Published in Transactions on Machine Learning Research (04/2023) 2 Background In this section, we provide a brief overview of Gaussian processes and their transformed counterparts. This section is by no means comprehensive, and is meant to establish common notation and definitions for the reader s convenience. Note that we modified the standard GP treatment of Rasmussen & Williams (2008) to improve clarity when explaining the BTG model; notational differences aside, our formulations are equivalent. 2.1 Gaussian Process Regression A GP f GP(µ, τ 1k) is a distribution over functions in Rd, where µ(x) is the expected value of f(x) and τ 1k(x, x ) is the positive (semi)-definite covariance between f(x) and f(x ). For later clarity, we separate the precision hyperparameter τ from lengthscales and other kernel hyperparameters (typically denoted by θ) 2. Unless otherwise specified, we assume a linear mean field and the squared exponential kernel: µβ(x) = βT m(x), m: Rd Rp, kθ(x, x ) = exp 1 2 x x 2 D 2 θ Here m(x) is a known function mapping a location x to a vector of covariates, β is a vector of coefficients, and D2 θ is a diagonal matrix of length scales determined by the parameter(s) θ. For any finite set of input locations, let: X = [x1, . . . , xn]T X Rn d, MX = [m(x1), . . . , m(xn)]T MX Rn p, f X = [f(x1), . . . , f(xn)]T f X Rn, where X is the matrix of observations locations, MX is the matrix of covariates at X, and f X is the vector of observations. A GP has the property that any finite number of evaluations of f will have a joint Gaussian distribution: f X | β, τ, θ N(MXβ, τ 1KX), where (τ 1KX)ij = τ 1kθ(xi, xj) is the covariance matrix of f X. We assume MX to be full rank. Typically, the kernel hyperparameters τ, β, and θ are fit by minimizing the negative log likelihood: log L(f X | X, β, τ, θ) 1 2 f X MXβ 2 K 1 X + 1 This is known as type II maximum likelihood estimation (MLE) of the kernel hyperparameters. Having performed MLE to compute the optimal β, τ, and θ, the posterior predictive density of a point x is: Definition 1 (Posterior predictive density, GP model, type II MLE). f(x) | β, τ, θ, f X, X N(µθ,β, sθ,β). N(µθ,β, sθ,β) is a normal distribution with the following mean and variance: µθ,β = βT m(x) + KT Xx K 1 X (f X MXβ), sθ,β = τ 1 kθ(x, x) KT Xx K 1 X KXx , where (KXx)i = kθ(xi, x). The Bayesian model selection approach forgoes MLE of the negative log likelihood, and instead assumes a prior distribution over kernel hyperparameters p(β, τ, θ). This induces a posterior distribution over the kernel hyperparameters by noting that the posterior is proportional to the likelihood times the prior: 2Standard GP literature includes an additional hyperparameter σ which models the observed noise. We omit σ in this section for brevity but treat it in our experiments. Published in Transactions on Machine Learning Research (04/2023) p(β, τ, θ | f X, X) L(f X | X, β, τ, θ)p(β, τ, θ). Given this posterior distribution, the posterior predictive density is given by the following. Definition 2 (Posterior predictive density, fully Bayesian GP). p(f(x) | f X, X) = Z β,τ,θ p(f(x) | β, τ, θ, f X, X) p(β, τ, θ | f X, X). We call this the fully Bayesian approach to Gaussian process modeling. The well-known tradeoff between type II MLE and the fully Bayesian approach is that of tractability; the fully Bayesian approach requires computing integrals with no generic, analytic solution (often through MCMC) where as type II MLE is much more straightforward. 2.2 Warped Gaussian Processes While GPs are powerful tools for modeling nonlinear functions, they make the fairly strong assumptions of Gaussianity and homoscedasticity. WGPs (Snelson et al., 2004) challenge these assumptions by transforming ( warping ) the observation space to a latent space, which itself is modeled by a GP. Given a strictly increasing, differentiable parametric transformation gλ determined by a parameter λ, WGPs model the composite function gλ f with a GP. Definition 3 (Warped Gaussian Process). (gλ f) | β, τ, λ, θ GP(µβ, τ 1kθ). Let (gλ(f X))i = gλ(f(xi)). WGP jointly computes the parameters through MLE in the latent space, where the negative log likelihood is: log L gλ(f X) | X, β, τ, θ, λ 1 2 gλ(f X) MXβ 2 K 1 X + 1 2 log|KX| log Jλ. Note that the difference between this equation and the negative log likelihood is the inclusion of J, which represents the Jacobian of the transformation: f(xi)gλ(f(xi)) WGPs predict the value of a point x by computing its posterior mean in the latent space and then inverting the transformation back to the observation space: g 1 λ (ˆµ(x)). Snelson et al. (2004) uses the tanh transform family, whose members do not generally have closed form inverses; they must be computed numerically. 2.3 Other Transformed GP Models There exist a variety of other transformed GP models, and in this section we briefly summarize a few of them. This overview is by no means comprehensive, but we hope it succinctly communicates not only the breadth of existing work, but also the gaps that we believe BTG fills. The compositionally warped GP (CWGP) (Rios & Tobar, 2019) considers a set of augmented warping functions used by the standard WGP, which possess analytic inverses (for computational simplicity) and are then composed together for additional expressiveness. The Bayesian warped GP (BWGP) (Lázaro-Gredilla, 2012) takes a different approach to transformations, placing a GP prior on the warping function and variationally marginalizing it out so that model prediction and selection does not require numerical integration. Published in Transactions on Machine Learning Research (04/2023) Table 1: A table summarizing some of the different GP models in the literature today. The main differences among models are whether or not they use transformations (all do except for the standard GP), what sorts of transformations are used, and what type of inference method is used. Single denotes a single function used for transformations. Multiple refers to a composition of such functions. At the bottom of the table is the topic of this paper, the Bayesian transformed GP (BTG). Name Transformations Type Inference GP-MLE No N/A MLE GP-Bayesian No N/A Fully Bayesian Warped GP (WGP) Yes Single Analytic MLE Compositionally Warped GP (CWGP) Yes Multiple Analytic MLE Bayesian Warped GP (BWGP) Yes Single GP Variational Deep GP (DGP) Yes Multiple GPs Variational Bayesian Transformed GP (BTG) Yes Multiple Analytic Fully Bayesian Deep GPs (Damianou & Lawrence, 2013) extends the "GP of a GP" idea by building a network of GPs, and also uses variational inference to accelerate model prediction and selection by avoiding numerical integration. Indeed, a whole body of literature exists solely towards the application of variational inference to Gaussian process regression (Tran et al., 2015; Salimbeni & Deisenroth, 2017; Jakkala, 2021). Like all these other models, BTG considers a family of warping functions single layer or multiple compositions. However, instead of using type II MLE (which can sometimes be overconfident) or variational inference (VI) (whose approximations often require significant data before showing their advantages), BTG opts for a carefully chosen prior and explicit numerical integration during the model selection process for a fully Bayesian approach to transformed Gaussian process regression. While we focus on improving the Bayesian treatment and scale BTG to higher dimensions using exact kernel inference, another line of work, orthogonal to ours, focuses on combining the warping setting with sparse GP techniques to scale to large datasets. Graßhoff et al. (2020) adopt the Structured Kernel Interpolation (SKI) method (Wilson & Nickisch, 2015) to the WGP setting, which scales WGP up to large training data in low dimensions. Rossi et al. (2021) considers the sparse GP setting with inducing points and focuses on a fully Bayesian treatment of both the inducing points and model hyperparameters, based on stochastic gradient Hamiltonian Monte Carlo. 3 The Bayesian Transformed Gaussian Process (BTG) Model One might think of the Bayesian Transformed Gaussian (BTG) model (Oliveira et al., 1997) as a fully Bayesian generalization of WGP. WGP uses MLE to learn optimal values for the transformation parameters λ, mean function parameters β, signal variance parameters τ, and GP lengthscales θ. Just like WGP, BTG models a function f(x) as: (gλ f) | β, τ, λ, θ GP(µβ, τ 1kθ). Unlike WGP, BTG places a joint prior p(β, τ, θ, λ) over all model parameters and performs fully Bayesian model selection by computing the usual marginal likelihood instead of performing the MLE. A key idea of BTG is to carefully select a joint prior over λ, β, τ, and θ to simplify the computation so that the BTG s posterior predictive density is a mixture of t-distributions this will be derived in later subsections. Definition 4 (Posterior predictive density, BTG). f(x) | f X, X i=1 wi T n p i Published in Transactions on Machine Learning Research (04/2023) 3 2 1 0 1 2 3 Gaussian Process Model 3 2 1 0 1 2 3 Bayesian Transformed Gaussian Process Model True Function Training samples GP Mean GP Uncertainty BTG Mean BTG Uncertainty Figure 1: A comparison of GP and BTG, both trained on 48 random samples from the rounded sine function with noise from N(0, 0.05). Left: the predictive mean and 95% equal tailed credible interval of a Gaussian process using the squared exponential kernel. Right: the predictive median and 95% equal tailed credible interval of BTG using the squared exponential kernel. BTG is better able to capture the irregularity in the training data despite a highly smooth kernel, largely due to it s Bayesian treatment of warping functions that map the data points to a latent space. where each T n p i is a different transformed T distribution whose mean and variance we will derive in 3.2 and 4; for a comprehensive analysis, see Box & Cox (1964). Like in the case of the WGP, this predictive distribution must then be inverted to perform prediction or uncertainty quantification. BTG appears to possess the same advantages that any fully Bayesian model will possess over it s MLE counterpart: robustness to overconfidence. This results in a better fit, as demonstrated in Figure 1. The underlying kernel is the highly smooth squared exponential, resulting in a GP that is unable to capture the underlying function, which is not smooth at all. Due to a combination of input warping and Bayesian inference, BTG is able to resolve the underlying datapoints much better than a GP despite using the same smooth kernel. In later sections, we explore the advantages of being Bayesian in the low-data regime. 3.1 The Parameter Posterior A key idea of the BTG model is that, conditioned on λ, θ, and f X, the resulting WGP is a generalized linear model (Oliveira et al., 1997). We estimate β by ˆβθ,λ, the solution to the weighted least squares problem: qθ,λ = min β gλ(f X) MXβ 2 K 1 X , where qθ,λ is the residual norm. Explicity, we have ˆβθ,λ = (M T XK 1 X MX) 1M T XK 1 X gλ(f X), qθ,λ = (gλ(f X) MX ˆβθ,λ)T K 1 X (gλ(f X) MX ˆβθ,λ). Therefore, we have the marginal probability p(f X | X) = Z p(f X|β, τ, θ, λ, X)p(β, τ, θ, λ) dλ dβ dθ dτ = Z |KX| 1/2|M T XK 1 X MX| 1/2q (n p)/2 θ,λ J1 p/n λ p(θ)p(λ) dθ dλ. Further using Bayes theorem, we have p(θ, λ | f X, X) |KX| 1/2|M T XK 1 X MX| 1/2q (n p)/2 θ,λ J1 p/n λ p(θ)p(λ). (1) Published in Transactions on Machine Learning Research (04/2023) On the other hand, because appropriate values for β, τ, and θ depend nontrivially on λ, BTG adopts the improper joint prior: p(β, τ, θ, λ) p(θ)p(λ) / (τJp/n λ ). This effectively decouples the priors on θ and λ; p(θ) and p(λ) are independent and user-specified. BTG then selects the prior on (β, τ) to follow the conditional normal-inverse-gamma distribution: β | τ, λ, θ, f X, X N ˆβλ,θ, τ 1(M T XK 1 X MX) 1 , τ | λ, θ, f X, X Γ 1 n p Note that, due to the improper joint prior, the distribution on β and τ depends on the values of θ and λ. Using (1) and (2), the posterior distribution, up to some constant, is determined: p(β, τ, θ, λ | f X, X) = p(β, τ | θ, λ, f X, X)p(θ, λ | f X, X). (3) The reason for the normal-inverse-gamma prior becomes apparent; the marginal of a normal-inverse-gamma is a t-distribution, and so once we marginalize out equation 3, the mixture of t-distributions seen in equation 4 will be the result. We derive this in the following subsection. 3.2 The Predictive Density BTG bases the prediction at x on the Bayesian predictive density function (Aitchinson & Dunsmore, 1975): p(f(x) | f X, X) = Z p(f(x) | β, τ, λ, θ, f X, X)p(β, τ, λ, θ | f X, X) dλ dβ dθ dτ. (4) The Bayesian predictive density function is simply an integral of a product of two probability densities. In the previous section, we derived p(β, τ, λ, θ | f X, X), so we know the second probability density. The first probability density is that of a warped GP (Definition 3): gλ f(x) | β, τ, λ, θ, f X, X N(mβ,θ,λ, Dθ), mβ,θ,λ = βT m(x) + KT Xx K 1 X (gλ f X MXβ), Dθ = τ 1 kθ(x, x) KT Xx K 1 X KXx . We consequently know everything needed to specify Equation 4. Because we adopted a conditional normalinverse-gamma distribution, once we marginalize out β and τ, the posterior of gλ f(x) is the t-distribution: gλ f(x) | λ, θ, f X, X T n p mλ,θ, (qθ,λCθ,λ) 1 , (5) where the mean largely resembles that of a GP: mλ,θ = Kx XK 1 X gλ(f X) MX ˆβλ,θ + ˆβT λ,θm(x), and Cλ,θ is the final Schur complement B(x)/[kθ(x, x)] of the bordered matrix: 0 M T X m(x) MX KX KXx m(x)T KT Xx kθ(x, x) Therefore, by Bayes theorem, the marginal posterior of BTG is: p(f(x) | f X, X) = Θ,Λ ϕ(f(x) | θ, λ, f X, X)p(f X | λ, θ)p(θ)p(λ)dλdθ R Θ,Λ p(θ, λ|f X, X)p(θ)p(λ) dλ dθ , (6) Published in Transactions on Machine Learning Research (04/2023) where the posterior being integrated is a transformed t-distribution with the pdf ϕ(f(x) | θ, λ,f X, X) = 2 )|g λ(f(x))| 2 )π1/2|qθ,λCθ,λ|1/2 [1 + (f(x) mθ,λ)T (qθ,λCθ,λ) 1(f(x) mθ,λ)] (n p+1)/2. (7) Unlike WGP, BTG may not have first or second moments, because its marginal posterior may be for example, a mixture of log-t distributions. If this occurs, the probability density function (pdf) will not have a mean or variance. Therefore, BTG instead uses the median and credible intervals, computed by inverting its cumulative distribution function (cdf). 3.3 The Approximate Predictive Density For general nonlinear transformations, the posterior distribution of BTG (Equation 6) is intractable and therefore we approximate it using a set of M quadrature nodes and weights ([θi, λi], wi), yielding the mixture of transformed t-distributions p f(x) | f X, X PM i=1 wiϕ f(x) | θi, λi, f X, X p(f X|θi, λi)p(θi)p(λi) PM i=1 wip(θi, λi | f X)p(θi)p(λi) . Here, ϕ (f(x) | θi, λi, f X, X) is the pdf of the transformed t-distributions (Equation 3.2), p(f X|θi, λi) is the likelihood of data given hyperparameters, p(θi) and p(λi) are the user-specified hyperparameter priors, and wi is a quadrature weight at the quadrature point (θi, λi). To simplify notation, we combine all terms except for the transformed t-distribution pdf into weights wi to simplify the predictive distribution into our desired mixture of transformed t-distributions: wi := wip(f X|θi, λi)p(θi)p(λi) PM i=1 wip(θi, λi | f X)p(θi)p(λi) , p f(x) | f X, X i=1 wiϕ f(x) | θi, λi, f X, X . As mentioned earlier, p f(x) | f X, X is not guaranteed to have a mean, so we must use the median predictor instead. We do so by computing the quantile P 1(0.5) by numerical root-finding, where P is the cdf of p f(x) | f X, X , and therefore a mixture of transformed t-distribution cdfs. 4 Methodology BTG regression via the median predictor (or any other quantile) of Equation 8 is challenging. The dimensionality of the integral scales with the total number of hyperparameters, which contains the kernel lengthscale hyperparameters θ that grow with the ambient dimension of the data, as well as the transformation parameters λ that grow with the number of transformations used. Furthermore, its cdf must be numerically inverted, requiring many such quadrature computations for even a single, point-wise regression task. This is further complicated by the difficulty in assessing model fit through LOOCV, which must be repeated at every quadrature node as well. As a result, a naive implementation of BTG scales poorly. In this section, we discuss and propose efficient algorithms that make BTG model prediction and model validation far faster, and indeed, comparable to the speed of its MLE counterparts. First, we discuss our doubly sparse quadrature rules for computing the BTG predictive distribution ( 4.1 and 4.2). We then provide quantile bounds that accelerate the root-finding convergence ( 4.3). Next, we propose an O(n3) LOOCV algorithm for BTG using Cholesky downdates and rank-1 matrix algebra ( 4.4), which greatly improves the naive O(n4) LOOCV cost. Finally we discuss the single and multi-layer nonlinear transformations used in our experiments ( 4.5). Published in Transactions on Machine Learning Research (04/2023) 4.1 Sparse Grid Quadrature Sparse grid methods, or Smolyak algorithms, are effective for approximating integrals of sufficient regularity in moderate to high dimensions. While the conventional Monte Carlo (MC) quadrature approach used by Oliveira et al. (1997) converges at the rate of O(1/ M), where M is the number of quadrature nodes, the approximation error of Smolyak s quadrature rule is O M r|log2 M|(d 1)(r+1) where d is the dimensionality of the integral and r is the integrand s regularity i.e., number of derivatives. In this paper, we use a sparse grid rule detailed by Bungartz & Griebel (2004) and used for likelihood approximation by Heiss & Winschel (2008). 4.2 Quadrature Sparsification Numerical integration schemes such as sparse grids and QMC use M fixed quadrature nodes, where M depends on the dimensionality of the domain and fineness of the grid. In the Bayesian approach, expensive GP operations such as computing a log determinant and solving a linear system are repeated across quadrature nodes, for a total time complexity of O(Mn3). In practice, many nodes are associated with negligible mixture weights, so their contribution to the posterior predictive distribution can effectively be ignored. We thus adaptively drop nodes when their associated weights fall below a certain threshold. To do so in a principled way, we approximate the mixture with a subset of dominant weights and then quantify the error in terms of the total mass of discarded weights. We assume the posterior cdf is the mixture of cdfs F(x) = PM i=1 wifi(x), where each fi is a cdf. Assume the weights {wi}M i=1 are ordered by decreasing magnitude. Consider Fk(x), a truncated and re-scaled F(x). We first quantify the pointwise approximation error in Lemma 4.1. We then quantify the error in quantile computation: Propositions 4.1, 4.2 show that the approximated quantile F 1 k (p) can be bounded by perturbed true quantiles. Proposition 4.3 gives a simple bound between F 1 k (p) and F 1(p) within the region of interest, and applies to both QMC which uses positive weights and sparse grid quadrature which uses positive and negative weights. See Appendix A.1 for proofs. Lemma 4.1. Let k be the smallest integer such that Pk i=1 wi 1 ϵ. Then define the scaled, truncated mixture i=1 wifi(x), c := We have |F(x) Fk(x)| 2ϵ. Proposition 4.1 (Error Bound for Positive Weights). For any ϵ (0, 1), let k be the smallest integer such that Pk i=1 wi 1 ϵ. Define the scaled, truncated mixture i=1 wifi(x), c := Let p (0, 1) and assume that p 2ϵ (0, 1). Then the approximate quantile F 1 k (p) is bounded by perturbed true quantiles: F 1(p 2ϵ) F 1 k (p) F 1(p + 2ϵ). Proposition 4.2 (Error Bound for Negative Weights). Let F(x) be defined as before, except each wi is no longer required to be positive. Consider the split F(x) = FM (x) + RM (x), where FM (x) = PM i=1 wifi(x) and RM (x) = PM i=M +1 wifi(x). Then for any x, we have RM (x) [ϵ , ϵ+], where the epsilons are defined as the sum of positive (resp. negative) weights of RM (x) i=M +1 [wi] 0 , ϵ+ = i=M +1 [wi]+ 0. Published in Transactions on Machine Learning Research (04/2023) Let p (0, 1) and assume p + ϵ , p + ϵ+ (0, 1). Then the approximate quantile F 1 M (p) is bounded by perturbed true quantiles: F 1(p + ϵ ) F 1 M (p) F 1(p + ϵ+). Proposition 4.3 (Error Bound at a quantile). Let F(x) be defined as before, ϵ1, ϵ2 (0, 1), and Fk(x) be an approximate to F(x) such that F 1(p ϵ1) F 1 k (p) F 1(p ϵ2) for some p (0, 1). Assuming p ϵ1, p + ϵ2 (0, 1), we have the following error bound at a quantile, F 1 k (p) F 1(p) ϵ max ξ (p ϵ1,p+ϵ2) where ϵ = max{ϵ1, ϵ2}. By adaptively sparsifying our numerical quadrature schemes, we are able to discard a significant portion of summands in the mixture F(x), which in turn, enables significant speedup of BTG model prediction. Empirical results are shown in 5. 4.3 Quantile Bounds To compute posterior quantiles, we apply Brent s algorithm, a standard root-finding algorithm combining the secant and bisection methods, to the cdf defined in Equation 8. Since Brent s algorithm is box-constrained, we use quantile bounds to narrow down the locations of the quantiles for p {0.025, 0.5, 0.975}. Let F(x) = PM i=1 wifi(x). Then we have the following bounds for the quantile F 1(p). See Appendix A.2 for proofs. Proposition 4.4 (Convex Hull). Let F(x) be defined as before with wi > 0 and PM i=1 wi = 1. Then min i f 1 i (p) F 1(p) max i f 1 i (p). Proposition 4.5 (Singular Weight). Let F(x) be defined as before with wi > 0 and PM i=1 wi = 1. Let wi = 1 wi. Then max p wi 0 f 1 i (p wi) F 1(p) min p+wi 1 f 1 i (p + wi). When solving for y = F 1(p), we run Brent s algorithm using our quantile bounds as the box constraints. Furthermore, we adaptively set the termination conditions xtol and ftol to be on the same order of magnitude as the error in quadrature sparsification from 4.2. This greatly accelerates convergence in practice. A performance comparison between quantile bounds outlined in this section can be found in 5. 4.4 Fast Leave-One-Out-Cross-Validation (LOOCV) LOOCV is a standard measure of model fit: in practice, it is most commonly used for tuning hyperparameters and for model selection. While fast LOOCV schemes are known for GP regression, it is less straightforward to perform LOOCV on BTG. In particular, the computational difficulty lies in two LOOCV sub-problems: a generalized least squares problem and principle sub-matrix determinant computation. These correspond to the terms in the BTG likelihood function and the BTG conditional posterior in Equation 6. Being Bayesian about covariance and transform hyperparameters introduces additional layers of cost: LOOCV must be repeated at each quadrature node in the hyperparameter space. This further motivates the need for an efficient algorithm. For notational clarity, let ( i) denote the omission of the ith point. For a kernel matrix, this means deletion of the ith row and column; for a vector, this indicates the omission of the ith entry. We seek to compute the mean m( i) θ,λ and standard deviation σ( i) θ,λ = C( i) θ,λ q( i) θ,λ 1/2 of the t-distributions (Equations 5) for each submodel, obtained by leaving out the ith training point. Specifically, computing {q( i) θ,λ }n i=1 entails solving the generalized least squares problems for i = 1, . . . , n: arg min β( i) Y ( i) MX ( i)β( i) 2 K( i) X , Published in Transactions on Machine Learning Research (04/2023) Table 2: Elementary Transformations: analytic function forms and parameter constraints. Parameters are assumed to be in R unless stated otherwise. Name g(y) Req. Num. Params Affine a + by b > 0 1 Arc Sinh a + b asinh y c Sinh Arc Sinh sinh(b asinh(y a)) b > 0 2 log(y) if λ = 0 λ 0 1 where Y = g f X. In addition, computing C( i) θ,λ and m( i) θ,λ entails solves with K( i) X , which naively takes O(n3) per sub-problem. Therefore, the BTG LOOCV proceedure naively takes O(n4) total time. We develop an O(n3) fast LOOCV algorithm for BTG using three building blocks: fast determinant computations (Proposition 4.6), fast abridged linear system solves (Proposition 4.7) and fast rank-one O(p2) Cholesky down-dates (Proposition 4.8). We refer to Stewart (1998) for the rank-1 Cholesky downdate algorithm. For algorithm details as well as proofs, see Appendix B. The scaling behavior of our fast LOOCV algorithm is shown in Figure 4. Proposition 4.6 (Determinant of a Principal Minor). det Σ( i) = det(Σ) e T i Σ 1ei . Proposition 4.7 (Abridged Linear System). Let K Rn n be of full rank, and let c, y Rn satisfy Kc = y. Then if ri = ci/e T i K 1ei, we have: c( i) = (K( i)) 1y( i) = c ri K 1ei. Proposition 4.8 (Rank one matrix downdate). If X Rn m with m < n has full column rank and Σ is a positive definite matrix in Rn n, then we have X( i)T Σ( i) 1X( i) = XT Σ 1 Σ 1eie T i Σ 1 e T i Σ 1ei where Σ( i) R(n 1) (n 1) is the (i, i)th minor of Σ and ei is the ith canonical basis vector. 4.5 Transformations The original BTG model of Oliveira et al. (1997) uses the Box-Cox family of power transformations and places an uniform prior on λ. Recent research has greatly expanded the set of flexible transformations available. Snelson et al. (2004) uses a sum of tanh( ) transforms in the WGP model and Rios & Tobar (2019) composes various transformations to provide a flexible compositional framework in the CWGP model. We apply BTG with more elementary transformations and compositions thereof, summarized in Table 2. As we show in 5, these compositions have greater expressive power and generally outperform single transformations, at the expense of greater computational overhead. 5 Experiments This section contains: Experimental details necessary to reproduce our experiments, including the datasets we used as well as the decisions we made regarding the BTG and GP models (such as the kernel). Published in Transactions on Machine Learning Research (04/2023) 15.5 16.0 16.5 17.0 17.5 0 Log Likelihood 24 25 26 27 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0 2 4 6 8 10 0 10 5 0 5 10 Function Values 10.5 11.0 11.5 12.0 12.5 13.0 a Log Likelihood 16 17 18 19 b 2 4 6 8 10 2 0.0 0.2 0.4 0.6 0.8 1.0 2 10 5 0 5 10 Function Domain Function Values Comparing likelihood functions when n=5 and when n=30 Figure 2: Plots of marginal log likelihoods for a Sinh Arc Sinh-transformed WGP and corresponding training data. In the first four columns, a, b are transformation hyperparameters, and θ, σ2 are kernel hyperparameters; the optimal values of marginal log likelihoods are marked with a red star. The last column shows the underlying function (dashed curves) and the training data (red dots). The top row represents a data-sparse setting with 5 training points, while the bottom row represents a data-rich setting with 30 training points. We can see that in the data-sparse setting, optimal marginal log likelihood values with respect to θ and σ2 are clearly not as well defined as the data-rich setting. A small example highlighting that when data is sparse, the likelihoods for transformations and kernel hyperparameters can be very flat, thus demonstrating the need for BTG in these scenarios. Experiments to validate the efficiency of our computational techniques. Thorough regression experiments, which demonstrate BTG s strong empirical performance when compared to appropriately selected baselines. We also provide timing results of our fast BTG implementation, showing that BTG has a comparable computational cost to WGP. We hope these experiments provide a thorough and nuanced picture that highlights when and why BTG can perform better than it s GP counterparts. All methods possess their strengths and weaknesses, and BTG is not an exception. 5.1 Motivation for the Bayesian Approach 5.2 Experiment Details We provide the datasets we use, the performance metrics we measure and additional implementation details necessary to reproduce our experiments in this subsection. Two synthetic datasets: Int Sine and Six Hump Camel. The Int Sine dataset, also used by Lázaro-Gredilla (2012), is sampled from a rounded 1-dimensional sine function with Gaussian noise of a given variance. The training set is comprised of 51 uniformly spaced samples on [ π, π]. The testing set consists of 400 uniformly spaced points on [ π, π]. The Six Hump Camel function is a 2-dimensional benchmark optimization function usually evaluated on [ 3, 3] [ 2, 2] (Molga & Smutnicki, 2005). We shift its values to be strictly positive. The training set is comprised of 50 quasi-uniform samples, i.e., a 2d Sobol sequence, from [ 1, 1] [ 2, 2]. The testing set consists of 400 uniformly distributed points on the same domain. Three real datasets: Abalone, Wine Quality and Creep. Abalone is an 8-dimensional dataset, for which the prediction task is to determine the age of an abalone using eight physical measurements (Dua & Graff, 2017). The Wine Quality dataset has 12-dimensional explanatory variables and relates the quality of wine to input attributes (Cortez et al., 2009). The Creep dataset is 30-dimensional and relates the creep rupture Published in Transactions on Machine Learning Research (04/2023) 101.0 101.5 102.0 102.5 # Nodes Percent (%) 101.0 101.5 102.0 102.5 # Nodes 101.0 101.5 102.0 102.5 # Nodes Prediction Error Prediction Time Weight Concentration Figure 3: Comparison of sparse grid quadrature (blue) and QMC (red). Weights are ordered by decreasing magnitude. (Left) Number of nodes versus percent of total mass they comprise (Middle) Number of nodes versus BTG prediction MSE (Right) Number of nodes versus prediction time. We can see that sparse grid quadrature has a prediction error that decreases much faster than QMC. stress (in MPa) for steel to chemical composition and other features (Cole et al., 2000). To simulate datasparse training scenarios, we randomly select training samples of size 30, 200, and 100 from Abalone, Wine Quality, and Creep, respectively, and test on 500, 1000 and 1500 out-of-sample points. 5.2.1 Sparse Grids vs QMC Performance Metrics: Our empirical experiments involve three loss functions to evaluate model performance: root mean squared error (RMSE), mean absolute error (MAE) and mean negative log predictive density (NLPD). Let {f (xi)}n i=1 be true test labels and { ˆf(xi)}n i=1 be predictions, which are taken to be predictive medians in WGP and BTG. The predictive median is also used by Snelson et al. (2004). Let p be the predictive distribution from model. The RMSE, MAE and NLPD are defined as follows i=1 (f (xi) ˆf(xi))2 ! 1 f (xi) ˆf(xi) , i=1 log( p(f (xi))). Implementation: We run all experiments using our Julia software package, which supports a variety of models (WGP, CWGP and BTG) and allows for flexible treatment of hyperparameters. We also implement several single and composed transformations. For MLE optimization, we use the L-BFGS algorithm from the Julia Optim package (Mogensen & Riseth, 2018). Kernel: We used the squared exponential (RBF) kernel for all experiments: kθ(x, x ) = 1 2 x x 2 D 2 θ Model: To model observation input noise for BTG, we add a regularization term to make the analytical marginalization of mean and precision tractable. We also assume the constant covariate m(x) = 1n in the BTG model, and normalize observations to the unit interval. We assume the constant mean field for both BTG and WGP. Our motivation for using a Bayesian approach like BTG rather than a MLE-based one like WGP is that the maximum value of the marginal log likelihood in a data-sparse setting is not as well-defined as that in a data-rich setting. We demonstrate this motivation using a 1D toy example. We examine the marginal log Published in Transactions on Machine Learning Research (04/2023) Table 3: Compute time (seconds) for computing predictive median and credible intervals using no quantile bound, convex hull quantile bound, and singular weight quantile bound. Results are averaged over 10 trials. We find that both bounds significantly reduce the total time used to compute the predictive median and credible intervals though the convex hull bound performs better on average than the singular weight bound, likely because the former is often a better bound than the latter. Total (s) Median (s) Credible Interval (s) Baseline 13.0 3.24 7.87 Convex Hull 6.21 1.11 3.19 Singular Weight 11.0 2.52 6.54 likelihoods of transformation and kernel hyperparameters in the WGP model in Figure 2. We consider a 1D synthetic function, the 1D Levy function: f(x) = sin2 (x + 3)π 2 1 + sin2 (x + 3)π As a toy example, we use n = 5 training data for a data-sparse setting and use n = 30 for a data-rich setting. We use a Sinh Arc Sinh transformation for both settings, which is parameterized by two parameters a and b (see Table 2). We thus have four total hyperparameters (a, b, θ, σ), and we plot their marginal log likelihoods in Figure 2 for both the data-sparse and data-rich setting. We also identify the optimal hyperparameters (a , b , θ , σ ) using MLE, which we plot with a red star. We observe that in the data-sparse setting, the marginal log likelihoods tend to be quite flat, even on a log scale, which is especially true of θ and σ2. This means that many possible hyperparameters explain the data, and consequently that an MLE-based model (which selects only one hyperparameter value) will likely be a poor fit. This also suggests that a fully Bayesian approach which is what BTG does could be more appropriate in the data-sparse setting than MLE estimation. In the data-rich setting, the distribution of θ and σ2 are tightly concentrated, and thus the MLE approach does make sense. 5.3 Scaling Experiments We demonstrate the efficiency of the proposed computational techniques discussed in 4. First, we illustrate our quadrature sparsification scheme discussed in 4.2 on two different quadrature rules, and compare the performance under varying sparsification levels. Second, we evaluate the fast quantile bound for root-finding used in BTG prediction proposed in 4.3 by comparing the total time cost of BTG with WGP. Last but not least, we verify the complexity improvement of the proposed fast cross validation. We compare sparse grid and QMC quadrature rules under our quadrature sparsification framework. We use the Six Hump Camel dataset with 30 training data points and 100 testing data points as a toy problem. We train BTG with the composed transformation Affine-Sinh Arc Sinh. The hyperparameter space is 7-dimensional. In our experiment, we begin with a handful of quadrature nodes, and gradually extend this set to the entire quadrature grid. As more and more quadrature nodes are incorporated into the approximate integral (the approximate posterior distribution), we record the percentage of total weight that is used, and then compare how the prediction MSE and the prediction time cost vary. Intuitively, increasing quadrature nodes would reduce the prediction MSE and increase the prediction time cost. To better study our quadrature sparsification framework and balance the prediction accuracy and cost, we closely compare such phenomenon on the two quadrature rules allowed in BTG. Results are plotted in Figure 3. From Figure 3 we observe that the sparse grid quadrature rule yields lower prediction MSE: QMC converges to an MSE of 3.99, while sparse grid converges to an MSE of 3.80. We also observe that the sub-grids have similar weight concentration, which we showed was a proxy for quadrature approximation error in 4. Therefore, as the mass of the dropped weights falls below 0.1, the error in the integration scheme can Published in Transactions on Machine Learning Research (04/2023) 1 1.5 2 2.5 N 104 Naive LOOCV Cholesky Downdate LOOCV Figure 4: Compare the LOOCV timing of computing posterior distribution parameters of n sub-models with the proposed Cholesky downdate and without. It is verified that the naive way has a O(n4) cost while our proposed Cholesky downdate reduce the cost, by a factor of n, to O(n3). increasingly be attributed to the error in the quadrature rule itself as opposed to sparsification. Since the error of the sparse grid rule decays faster than that of QMC, we expect sparse grid prediction error to also decay more quickly and this trend is supported by Figure 3. We empirically find that the joint-likelihood function is sufficiently smooth, so that sparse grids are effective. Finally, we confirm that inference time scales linearly with the number of quadrature nodes. 5.3.1 Quantile Bounds Speedup To assess the effectiveness of quantile bounds for root-finding, we record BTG prediction times using the convex hull bound and singular weight bound proposed in 4.3. We set up a problem using the Levy1D dataset Molga & Smutnicki (2005) and 200 training points. The tolerance for the root-finding Brent s algorithm is set to be 10 3. We record the total time cost of prediction, the time cost of computing medians and time cost of computing credible intervals. Intuitively, using a more precise bound in a root-finding algorithm would greatly improve convergence speed, and this is verified with results in Table 3. We find that the convex hull bound decreases the overall computational overhead by a factor of at least two. The convex hull bound outperforms the singular weight bound for finding credible intervals and in overall time, but the singular weight bound was faster for finding the median in many scenarios. 5.3.2 Fast Cross Validation To assess our fast LOOCV proposed in 4.4, we infer the sub-model posterior distribution moments on a toy problem in two different ways: with and without using Cholesky rank-one downdates on RX to compute the Cholesky factors for R( i) X . We plot timing results in Figure 4, which confirm that our O(n3) method scales significantly better than the naive O(n4) method. We note that these timings are generated using exact GP inference, and that more generally our fast cross validation is O(n) faster than standard cross validation. We note that this speed-up applies not only to the typical O(n3) cost of dense GP inference, but also any other scalable GP method, such as variational GPs which would be a O(n2) cost instead. 5.4 Regression and Uncertainty Quantification Experiments Our efficient algorithms allow us to test BTG on a set of synthetic and real-world problems. We consider 5 datasets: Int Sine, Six Hump Camel, Abalone, Wine, and Creep with dimension ranging from 1 to 30. The total dimension of the hyperparmeter space is further increased by transform parameters by as much as 8. Published in Transactions on Machine Learning Research (04/2023) Table 4: A comprehensive set of RMSE and NLPD results for prediction using the Int Sine, Six Hump Camel, Abalone, Wine and Creep datasets. We generally kept the training sets small, to highlight BTG s advantages in the data-sparse regime, though for the Int Sine function there was a good amount of data. We compare the following models3: GP, Deep GP, WGP, CWGP, BTG, and use the following transformations transformations: I: Identity, A: Arc Sinh, SA: Sinh Arc Sinh, BC: Box Cox, L: affine. The best method is bolded; we find that BTG has lower RMSE and NLPD than the other models, though unsurprisingly, different transformations are more suitable for different datasets. Int Sine Camel Abalone Wine Creep RMSE NLPD RMSE NLPD RMSE NLPD RMSE NLPD RMSE NLPD GP 0.227 -1.179 2.003 11.20 3.290 1.120 1.994 -0.074 37.88 -1.189 Deep GP 0.241 0.013 1.344 1.761 3.212 3.988 0.824 1.377 91.95 20.39 WGP-A 0.179 -1.392 2.055 44.10 3.097 1.105 0.811 -1.108 35.48 -1.421 WGP-SA 0.172 -1.693 2.012 45.10 2.992 4.879 0.809 -1.269 61.62 1.001 WGP-BC 0.184 -1.371 1.964 44.70 2.826 0.837 1.045 -0.186 40.38 -1.357 CWGP-L-SA 0.172 -1.693 2.055 13.80 3.117 3.444 0.808 -1.415 39.71 -1.134 CWGP-A-BC 0.174 -1.457 1.960 37.90 3.088 1.151 0.808 -1.416 37.89 -1.399 BTG-I 0.169 -1.429 1.820 2.342 2.804 -0.070 0.808 -1.356 38.11 -1.273 BTG-A 0.168 -1.443 1.827 2.331 2.890 -0.174 0.807 -1.360 38.65 -1.239 BTG-SA 0.165 -1.759 1.801 0.907 2.791 -0.123 0.820 -0.856 91.69 -0.358 BTG-BC 0.170 -1.439 1.675 0.659 3.225 0.110 0.808 -1.358 39.10 -1.332 BTG-L-SA 0.145 -1.781 1.673 0.730 2.871 -0.023 0.809 -1.359 91.83 -0.374 BTG-A-BC 0.159 -1.516 1.828 2.330 2.832 -0.299 0.802 -1.361 35.25 -1.433 We compare with a standard GP model, WGP models and CWGP models using transformations4 and their compositions from Table 2. For each of these datasets, we perform both regression and uncertainty quantification (UQ) by computing the RMSE and NLPD, which we record in Table 4. Our two metrics evaluate both the predictive mean and predictive variance, respectively: RMSE is the root mean squared error metric evaluating predictive mean, and NLPD is the mean negative log predictive density metric evaluating both the predictive mean and predictive variance (uncertainty). For space s sake, we record additional performance metrics such as MAE in the appendix (Table 6). We also compare end-to-end inference time cost in Table 5 to give the reader a better idea of the computational overhead associated with BTG. This compute time generally depends on the dimension of the integral, i.e., the total number of hyperparameters that we must marginalize out in the fully Bayesian approach, and thus the numbers vary by a bit depending on the dataset. Our results are largely favorable. We find that composed BTG models tend to outperform singletransformation BTG models, which themselves tend to outperform their GP, WGP, and CWGP counterparts. The Deep GP model performs poorly in most of these problems due to the lack of enough data, which would otherwise allow it to overcome the approximation error associated with variational inference. More generally, BTG outperforms other baselines on almost all problems for both regression and uncertainty quantification (UQ), and quite convincingly in certain cases for example, BTG achieves an NLPD around two orders of magnitude better than existing models in the Camel dataset. 5 These results demonstrate the improved flexibility afforded by layered transformations, and is evidence of the superior performance possible with a fully Bayesian approach on small to medium sized datasets. 4We omit the tanh( ) transform used in the original WGP paper (Snelson et al., 2004) since Rios & Tobar (2019) shows that the elementary transforms in Table 2 are competitive with tanh( ). 5We tried to compare against BWGP, but could not find a code release available. We suspect BWGP to possess similar performance to the Deep GP, as both use variational inference and highly flexible families of warping functions. Published in Transactions on Machine Learning Research (04/2023) Table 5: End-to-end regression times for Six Hump Camel, Abalone and Wine datasets, using the following models: GP, WGP, CWGP, BTG, and the following transformations: SA:Sinh Arc Sinh, BC:Box Cox, L:affine. We note that the timings for BTG depends on the number of quadrature nodes it uses, but we made sure that these settings were consistent with those of Table 4, in which BTG outperforms GP, WGP, and CWGP. Six Hump Camel Abalone Wine Dim Time (min) Dim Time (min) Dim Time (min) WGP-BC 5 0.68 10 1.28 14 1.52 WGP-SA 6 0.78 11 1.22 15 1.60 CWGP-L-SA 8 1.08 13 1.48 17 2.50 CWGP-A-BC 9 1.14 15 1.56 18 2.82 BTG-BC 4 1.10 9 1.02 13 1.40 BTG-SA 5 0.95 10 1.04 14 1.29 BTG-L-SA 7 1.74 12 1.11 16 1.79 BTG-A-BC 8 1.65 14 1.09 17 1.87 Additionally, due to our efficient numerical methods, the end-to-end inference time of BTG is generally comparable to other baselines the compute time of BTG is only a bit higher than that of WGP. 6 Conclusion In this paper, we revisited the Bayesian transformed Gaussian (BTG) model, which is a variant of Bayesian trans-Kriging. BTG can be seen as a fully Bayesian treatment of input-warped GPs, and we believe it fills an important niche in the GP literature, which has yet not explored the combination of transformed GPs and Bayesian inference. We have presented a set of efficient methods for efficiently computing with the BTG model. These methods combine sparse grid quadrature, quadrature sparsification, and tight quantile bounds significantly reduces the expense of computing BTG s predictive median in certain cases rivaling even the speed of MLE without degrading prediction accuracy. Furthermore, we proposed a fast LOOCV algorithm for BTG for model selection and assessing model fit. An advantage of our framework is that it allows the practitioner to control the trade-off between the speed and accuracy of the Bayesian approach by modulating the sparsification of the grid and tolerance of the quantile-finding routine. Using these efficient methods, we were able to compute a set of experiments suggesting that BTG demonstrates superior performance over its comparable models in low-data and medium-data regimes, where hyperparameters are likely under-specified. In future work, we would like to combine our approach with approximate GP inference to further improve computational efficiency. Furthermore, though our sparse quadrature rules allow us to go to higher dimensions, we will still eventually hit a limit (depending on the problem, perhaps around 50); to go to dimensions in the hundreds or thousands, additional assumptions and methodologies will be needed. Lastly, we would like to apply BTG to Bayesian optimization, which is one application we believe to be particularly suitable for it, since we expect data to be sparse. Ryan Prescott Adams, Iain Murray, and David JC Mac Kay. Tractable nonparametric Bayesian inference in Poisson processes with Gaussian process intensities. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 9 16, 2009. J Aitchinson and JR Dunsmore. Statistical prediction analysis. Technical report, 1975. Published in Transactions on Machine Learning Research (04/2023) George EP Box and David R Cox. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological), 26(2):211 243, 1964. Hans-Joachim Bungartz and Michael Griebel. Sparse grids. Acta Numerica, 13:147 269, 2004. Henry R Chai and Roman Garnett. Improving quadrature for constrained integrands. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2751 2759. PMLR, 2019. D Cole, C Martin-Moran, AG Sheard, HKDH Bhadeshia, and DJC Mac Kay. Modelling creep rupture strength of ferritic steel welds. Science and Technology of Welding and Joining, 5(2):81 89, 2000. Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547 553, 2009. Noel A. C. Cressie. Statistics for spatial data. Wiley, 1993. Andreas Damianou and Neil Lawrence. Deep Gaussian processes. In Artificial intelligence and statistics, pp. 207 215. PMLR, 2013. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. Mark N Gibbs. Bayesian Gaussian processes for regression and classification. Ph D thesis, University of Cambridge, 1998. Jan Graßhoff, Alexandra Jankowski, and Philipp Rostalski. Scalable gaussian process separation for kernels with a non-stationary phase. In International Conference on Machine Learning, pp. 3722 3731. PMLR, 2020. Florian Heiss and Viktor Winschel. Likelihood approximation by numerical integration on sparse grids. Journal of Econometrics, 144(1):62 80, 2008. Kalvik Jakkala. Deep gaussian processes: A survey. ar Xiv preprint ar Xiv:2106.12135, 2021. Vidhi Lalchand and Carl Edward Rasmussen. Approximate inference for fully Bayesian Gaussian process regression. In Symposium on Advances in Approximate Bayesian Inference, pp. 1 12. PMLR, 2020. Miguel Lázaro-Gredilla. Bayesian warped Gaussian processes. Advances in Neural Information Processing Systems, 25:1619 1627, 2012. Patrick Kofod Mogensen and Asbjørn Nilsen Riseth. Optim: A mathematical optimization package for Julia. Journal of Open Source Software, 3(24):615, 2018. Marcin Molga and Czesław Smutnicki. Test functions for optimization needs. Test functions for optimization needs, 101:48, 2005. Victor De Oliveira, Benjamin Kedem, and David A. Short. Bayesian prediction of transformed Gaussian random fields. Journal of the American Statistical Association, 92(440):1422 1433, 1997. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. MIT Press, 2008. Gonzalo Rios and Felipe Tobar. Compositionally-warped Gaussian processes. Neural Networks, 118:235 246, 2019. Simone Rossi, Markus Heinonen, Edwin Bonilla, Zheyang Shen, and Maurizio Filippone. Sparse gaussian processes revisited: Bayesian approaches to inducing-variable approximations. In International Conference on Artificial Intelligence and Statistics, pp. 1837 1845. PMLR, 2021. Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep gaussian processes. Advances in neural information processing systems, 30, 2017. Published in Transactions on Machine Learning Research (04/2023) Edward Snelson, Zoubin Ghahramani, and Carl E. Rasmussen. Warped Gaussian processes. Advances in Neural Information Processing Systems, 16:337 344, 2004. Gunter Spöck, Hannes Kazianka, and Jürgen Pilz. Bayesian trans-Gaussian kriging with log-log transformed skew data. In Interfacing Geostatistics and GIS, pp. 29 43. Springer, 2009. Gilbert W. Stewart. Matrix algorithms, volume I. SIAM, Society for industrial and applied mathematics, 1998. Dustin Tran, Rajesh Ranganath, and David M Blei. The variational gaussian process. ar Xiv preprint ar Xiv:1511.06499, 2015. Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (kissgp). In International conference on machine learning, pp. 1775 1784. PMLR, 2015. Published in Transactions on Machine Learning Research (04/2023) A Methodology A.1 Quadrature Sparsification We assume the posterior cdf is the mixture of cdfs F(x) = PM i=1 wifi(x), 0 fi(x) 1, and fi(x) are monotone increasing for i = 1, . . . , M. Assume the weights {wi}M i=1 are decreasingly ordered by magnitude from 1 to M. We consider the quantiles of the approximant Fk(x), a truncated and re-scaled F(x). Lemma A.1. Define k to be the smallest integer such that Pk i=1 wi 1 ϵ. Then define the scaled, truncated mixture Fk(x) := 1 i=1 wifi(x), c := We have |F(x) Fk(x)| 2ϵ. Proof. Let Rk(x) = |Fk(x) F(x)|. We have i=1 wifi(x) + i=k+1 wifi(x) i=1 wifi(x) i=k+1 wifi(x) i=1 wifi(x) + 1 c Proposition A.1 (Error Bound for Positive Weights). For any ϵ (0, 1), let k be the smallest integer such that Pk i=1 wi 1 ϵ. Then define the scaled, truncated mixture i=1 wifi(x), c := Let p (0, 1) and assume that p 2ϵ (0, 1). Then we have the bound: F 1(p 2ϵ) F 1 k (p) F 1(p + 2ϵ). Proof. Let Fk(x ) = p. Then |p Fk(x )| 2ϵ, so p 2ϵ F(x ) p + 2ϵ It follows that F 1(p 2ϵ) x F 1(p + 2ϵ). Proposition A.2 (Error Bound for Negative Weights). Let F(x) be defined as before, except each wi is no longer required to be positive. Consider the split F(x) = FM (x) + RM (x), where FM (x) = PM i=1 wifi(x) and RM (x) = PM i=M +1 wifi(x). Then for any x, we have RM (x) [ϵ , ϵ+], where the epsilons are defined as the sum of positive (resp. negative) weights of RM (x) i=M +1 [wi] 0 , ϵ+ = i=M +1 [wi]+ 0. Then we bound F 1(p) as follows: F 1(p + ϵ ) F 1 M (p) F 1(p + ϵ+). Published in Transactions on Machine Learning Research (04/2023) Proof. Let FM (x ) = p. Then we can wrote F(x ) = p + RM (x ). Since ϵ RM (x ) ϵ+, it follows that p + ϵ F(x ) p + ϵ+ from which the result follows. Proposition A.3 (Error Bound at a quantile). Let F(x) be defined as before, ϵ1, ϵ2 (0, 1), and Fk(x) be an approximate to F(x) such that F 1(p ϵ1) F 1 k (p) F 1(p ϵ2) for some p (0, 1). Assuming p ϵ1, p + ϵ2 (0, 1), we have the following error bound at a quantile, F 1 k (p) F 1(p) ϵ max ξ (p ϵ1,p+ϵ2) where ϵ = max{ϵ1, ϵ2}. Proof. We have F 1 k (p) F 1(p) max{|F 1(p ϵ1) F 1(p)|, |F 1(p ϵ2) F 1(p)|} max{ϵ1, ϵ2} max ξ (p ϵ1,p+ϵ2) A.2 Quantile Bounds Proposition A.4 (Convex Hull). Let F(x) be defined as before with wi > 0 and PM i=1 wi = 1. Then min i f 1 i (p) F 1(p) max i f 1 i (p). Proof. Assume for the sake of contradiction that F 1(p) > maxi f 1 i (p). Let k = arg maxi f 1 i (p). Using the fact that f 1 k (p) f 1 j (p) for any j = k, we have fj(f 1 j (p)) p for any j = k. Therefore, we have p > F(f 1 k (p)) j=1 wjfj(f 1 k (p)) j =i wjfj(f 1 k (p)) wkp + (1 wk)p = p, which leads to a contradiction. The lower bound is analogous. Proposition A.5 (Singular Weight). Let F(x) be defined as before with wi > 0 and PM i=1 wi = 1. Let wi = 1 wi. Then max p wi 0 f 1 i (p wi) F 1(p) min p+wi 1 f 1 i (p + wi). Proof. Assume for sake of contradiction that F 1(p) > f 1 i (p + wi) Published in Transactions on Machine Learning Research (04/2023) p > F f 1 i (p + wi) j=1 wjfj f 1 i (p + wi) = wi(p + wi) + X j =i wjfj f 1 i (p + wi) However this implies that 0 > wi(wi p), which is a contradiction because 1 wi and wi p (by the assumption that p + wi 1). The lower bound is analogous. B Fast Cross Validation In this section, we discuss results leading up to O(n3) LOOCV algorithms for BTG, which are given by Algorithm 1 and Algorithm 2. Naively, the BTG LOOCV procedure has O(n4) time cost, due to the costs associated with solving generalized least squares problems related by single-point deletion and evaluating determinants of principle submatrices. We present relevant propositions used to solve these LOOCV subproblems efficiently in B.1, and derive our full algorithm in B.2. Notation Let f X, MX, KX, x, σθ,λ, qθ,λ, mθ,λ and Cθ,λ be defined same as in the paper. As before, we use the ( i) notiation to represent to omission of information from the ith data point. For the BTG LOOCV problem, we consider the n submodels {Model( i)}n i=1 trained on {x( i), f ( i) X , M ( i) X }n i=1: the locationcovariate-label triples obtained by omitting data points one at a time. We wish to efficiently compute the posterior predictive distributions of all n submodels indexed by i {1, ..., n}, p f(x( i)) | f ( i) X j=1 wj Lj Jjp(θj)p(λj), (9) where Lj = p gλj(f(x( i))) | θj, λj, f ( i) X and Jj = p f ( i) X |θj, λj for j {1, 2, . . . , M}. Recall that in Equation 9, p gλ(f(x)) | θ, λ, f X is the probability density function of the t-distribution Tn p(mθ,λ, (qθ,λCθ,λ) 1) and p(f X|θ, λ) is the likelihood of data given hyperparameters. Problem Formulation We have to efficiently compute the parameters that of the posterior mixture of t-distributions in Equation 9: {TParameters( i)}n i=1 := m( i) θi,λi, q( i) θi,λi, C( i) θi,λi For definitions of these quantities, we refer to the main text. We instead emphasize here that solving for q( i) θ,λ entails solving perturbed generalized least squares problems and that solving for m( i) θ,λ and C( i) θ,λ entail solving perturbed linear systems. For the likelihood term in Equation 9, we have p(f X|θ, λ) Σθ 1/2 M T XΣ 1 θ MX 1/2q( (n p)/2) θ,λ , hence we are interesting in computing the following for i {1, 2, . . . , n}: Det( i) = Σ( i) θ , (M ( i) X )T Σ( i) θ M ( i) X . (10) Published in Transactions on Machine Learning Research (04/2023) The perturbed least squares problems and linear systems can be solved independently in O(n3) time, hence a naive LOOCV procedure would take O(n4) time. However, using matrix decompositions, we can improve this to O(n3) total time. Algorithms Algorithms 1 and 2 are used for efficiently computing TParameters( i) n i=1 and Det( i) n i=1 for fixed hyperparameters (θ, λ). The total time complexity is O(n3), because the dominant costs are precomputing a Cholesky factorization for a kernel matrix and repeating O(n2) operations across n submodels. Algorithm 1 T-Distributions of Sub-Models Inputs Y = gλ f X, MX, KX, x Outputs: {m( i) θ,λ }n i=1, {q( i) θ,λ }n i=1, {C( i) θ,λ }n i=1 Pre-compute R, RX, and ˆx, where RT R = KX, RT XRX = M T XK 1 X MX, ˆx = K 1 X Y for i = 1 . . . n do ℓi = K 1 X ei/|e T i K 1ei| R( i) X Downdate(RX, ℓi) (Proposition B.3) ri Yi/ R T ei 2 2 ˆx( i) ˆx ri R 1(R T ei) (Proposition B.2) β( i) θ,λ R( i) X 1 R( i) X T M ( i) X ˆx( i) r( i) Y ( i) M ( i) X β( i) θ,λ q( i) θ,λ r( i) 2 K 1 X m( i) λ,θ Kx X R( i) X 1 R( i) X T r( i) + β( i) λ,θ T m(x) C( i) θ,λ B(x( i))/[kθ(x( i), x( i))] end for return {m( i) θ,λ }n i=1, {q( i) θ,λ }n i=1, {C( i) θ,λ }n i=1 Frozen Hyperparameters We remark that our LOOCV algorithm is possible because sparse grids and QMC are deterministic since the underlying sampling grids in hyperparameter-space are frozen in contrast to Monte Carlo (MC) methods, which are stochastic. Since we use fixed sparse grids, and we are in fact interested in evaluating the posterior distribution at fixed hyper-parameters {θi, λi}M i=1. If the sampling grid were not frozen across sub-models, our approach would not be viable, because the sampled points in hyperparameter-space would be different for each sub-model. Likewise, in the MLE approach, hyperparameters {θi, λi}M i=1 should theoretically be retrained on the submodels, hence we cannot re-use computed values. Algorithm 2 Fast Determinant Computation 1: Inputs KX 2: Output {log |K( i) X |}n i=1 3: Precompute RT R = KX 4: Precompute log(|KX|) 5: for i = 1 . . . n do 6: bi = e T i K( 1) X ei 7: log |K( i) X | log(|KX|) + log(bi) (Propsition B.1) 8: end for 9: return B.1 Auxiliary Results In this section, we present linear algebra results used in the derivations of Algorithms 1 and 2 in B.2. Published in Transactions on Machine Learning Research (04/2023) Proposition B.1 (Determinant of a Principal Minor). det Σ( i) = det(Σ) e T i Σ 1ei Proposition B.2 (Abridged Linear System). Let K Rn n be of full rank, and let c, y Rn satisfy Kc = y. Then if ri = ci/e T i K 1ei, we have: c( i) = (K( i)) 1y( i) = c ri K 1ei. Lemma B.1 (Determinant of the Schur Complement of a Principal Minor). If X Rn m with m < n has full column rank and Σ Rn n is a positive definite matrix, then det X( i) T Σ( i) 1X( i) = 1 det Σ( i) det Σ X XT O Proof. Extend the Cholesky factorization RT 11R11 of Σ to obtain the LDL-decomposition W := Σ X XT O = RT 11 0 RT 12 RT 22 R11 R12 0 R22 where R22 = Cholesky(RT 12R12) and R12 = R T 11 X. Observe X( i) T Σ( i) 1X( i) is a Schur complement of W ( i). This implies that = det Σ( i) det X( i) T Σ( i) 1X( i) By Proposition B.1 det W ( i) = det(W)e T i W 1ei. det X( i) T Σ( i) 1X( i) = 1 det(Σ( i)) det(W)e T i W 1ei. Lemma B.2 (Rank one downdate for bilinear forms). If x Rn and Σ is a positive definite matrix in Rn n, then x( i) T Σ( i) 1x( i) = x T Σ 1 Σ 1eie T i Σ 1 e T i Σ 1ei where Σ( i) R(n 1) (n 1) is the (i, i)th principal minor of Σ and x( i) R(n 1) results from deleting the ith entry of x. Proof. By Lemma B.1, we have x( i) T Σ( i) 1x( i) = det x( i) T Σ( i) 1x( i) = 1 det(Σ( i)) det(W)e T i W 1ei. Published in Transactions on Machine Learning Research (04/2023) In this equation, W = Σ X XT O = RT I 0 0 1 R, R = R11 R12 O R22 where R11 = chol(Σ), R12 = R T 11 X, and R22 = x T Σ 1x. Using this decomposition, we may compute the term e T i W 1ei. Since, R T ei = R T 11 ei x T Σ 1ei e T i W 1ei = e T i R 1R T ei (e T i Σ T x)(x T Σ 1ei) = e T i Σ 1ei (x T Σ 1ei)2 Lastly, we have det(W) det(Σ( i)) = det(Σ) det(x T Σ 1x) det(Σ)e T i Σ 1ei = x T Σ 1x e T i Σ 1ei . These together imply that x( i) T Σ( i) 1x( i) = x T Σ 1x x T Σ 1eie T i Σ 1ei e T i Σ 1ei as desired. Proposition B.3 (Rank one matrix downdate). If X Rn m with m < n has full column rank and Σ is a positive definite matrix in Rn n. Let vi = Σ 1ei then X( i) T Σ( i) 1X( i) = XT Σ 1 viv T i e T i Σ 1ei where Σ( i) R(n 1) (n 1) is the (i, i)th principal minor of Σ and X( i) R(n 1) m results from deleting row i from X. Proof. Let ˆΣ := Σ 1 Σ 1eie T i Σ 1 e T i Σ 1ei . It suffices to prove that (x( i))T (Σ( i)) 1y( i) = x T ˆΣy, x, y RN However, this follows from Lemma B.2, because (x + y)( i) T Σ( i) 1(x + y)( i) = (x + y)T ˆΣ(x + y) Expanding and canceling symmetric terms yields x( i) T Σ( i) 1y( i) + y( i) T Σ( i) 1x( i) = x T ˆΣy + y T ˆΣx implying the result. Published in Transactions on Machine Learning Research (04/2023) B.2 Algorithm Derivation Recall the following definitions of elements in {TParameters( i)}n i=1 from B: qθ,λ = min β gλ f X MXβ 2 KX 1 (11) ˆβθ,λ = argminβ gλ f X MXβ 2 KX 1 (12) mλ,θ = Kx XK 1 X gλ(f X) MX ˆβλ,θ + ˆβT λ,θm(x) (13) Cλ,θ = B(x)/[kθ(x, x)] (Schur Complement) (14) We use the generalized least squares LOOCV subroutine, outlined in Section B.2.1 to compute q( i) θ,λ and ˆ β( i) θ,λ efficiently for all i {1, ..., n}. We use Proposition B.2 to efficiently compute mλ,θ and Cλ,θ whenever a perturbed linear system arises. Generally, these routines involve precomputing a Cholesky decomposition and using it for back-substitution. These steps are enumerated in Algorithm 1. The computation of {Det( i)}n i=1 is straightforward given Proposition B.1 and a Cholesky decomposition of the kernel matrix. B.2.1 Generalized Least Squares The generalized least squares (GLS) LOOCV problem is that of solving the following set of problems efficiently: arg min x Rp b( i) A( i)x 2 It is assumed that Σ Rn n is positive definite, b Rn, A Rn p, Rank(A) = Rank(A( i)) = p for some p < n and for all i {1, 2, ..., n}. We consider the normal equations for the ith subproblem: arg min x R A( i)x b( i) T Σ( i) 1 A( i)x b( i) , namely, A( i) T Σ( i) 1A( i)x = A( i) T Σ( i) 1b( i). (15) We first show that Equation 15 has a unique solution. By Proposition B.3, we have A( i) T Σ( i) 1 A( i) = AT Σ 1A viv T i e T i Σ 1ei , where vi = AT Σ 1ei. The LHS is a rank-1 downdate applied to AT Σ 1A. Moreover, the LHS is positive definite and hence invertible, because by assumption, Rank(A( i)) = p, and Σ( i) is positive definite. We find the solution to Equation 15 by first computing the Cholesky factorization of the LHS. Specifically, given a Cholesky factorization RT R of AT Σ 1A from the full problem, the Cholesky factorization of the subproblem can be computed by a O(p2) Cholesky downdate R( i) = Downdate(R, vi/e T i Σ 1ei) such that R( i) T R( i) = A( i) T Σ( i) 1A( i). We therefore can solve the normal equation 15 x( i) = A( i) T Σ( i) 1A( i) 1 A( i) T y( i), where where y( i) = (Σ( i)) 1b( i). The cost of O(n2) is attained by evaluating terms from right to left. We first evaluate y( i) in O(n2) time by making use of Proposition B.2. We then perform back-substitution using the cholesky factor R( i) in O(p2) time. The overall time complexity is thus O(n3). Published in Transactions on Machine Learning Research (04/2023) C Supplementary Numerical Results In addition to the main results in Table 4, we provide results of the MAE metrics as well in Table 6. Conclusions are similar as in Section 5. Int Sine Camel Abalone Wine Creep RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE GP 0.227 0.171 2.003 1.781 3.290 2.208 1.994 1.792 37.88 25.68 WGP-A 0.179 0.117 2.055 1.815 3.097 2.068 0.811 0.677 35.48 24.27 WGP-SA 0.172 0.109 2.012 1.788 2.992 2.022 0.809 0.670 61.62 45.85 WGP-BC 0.184 0.123 1.964 1.751 2.826 1.940 1.045 0.784 40.38 25.49 CWGP-L-SA 0.172 0.109 2.055 1.815 3.117 2.100 0.808 0.670 91.89 73.05 CWGP-A-BC 0.174 0.113 1.960 1.747 3.088 2.069 0.808 0.670 37.89 25.34 BTG-I 0.169 0.100 1.820 1.731 2.804 1.842 0.808 0.670 38.11 26.18 BTG-A 0.168 0.101 1.827 1.741 2.890 1.900 0.807 0.669 38.68 27.68 BTG-SA 0.165 0.104 1.801 1.796 2.791 1.822 0.820 0.696 91.69 74.04 BTG-BC 0.170 0.102 1.675 1.666 3.225 2.172 0.808 0.670 39.10 26.52 BTG-L-SA 0.145 0.082 1.673 1.658 2.871 1.870 0.809 0.670 91.83 73.05 BTG-A-BC 0.159 0.090 1.828 1.742 2.832 1.814 0.802 0.664 35.25 24.44 Table 6: A comprehensive set of RMSE and MAE results for prediction using the Int Sine, Six Hump Camel, Abalone, Wine and Creep datasets. We generally kept the training sets small, to highlight BTG s advantages in the data-sparse regime, though for the Int Sine function there was a good amount of data. We compare the following models: GP, WGP, CWGP, BTG, and use the following transformations transformations: I: Identity, A: Arc Sinh, SA: Sinh Arc Sinh, BC: Box Cox, L: affine. The best method is bolded; we find that BTG has lower RMSE and MAE than the other models, though unsurprisingly, different transformations are more suitable for different datasets.