# sparse_additive_gaussian_process_regression__ec73dc99.pdf Journal of Machine Learning Research 23 (2022) 1-34 Submitted 7/19; Revised 9/20; Published 2/22 Sparse Additive Gaussian Process Regression Hengrui Luo luo.619@osu.edu Department of Statistics The Ohio State University Columbus, OH 43210, USA Giovanni Nattino nattino.1@osu.edu Division of Biostatistics, College of Public Health The Ohio State University Columbus, OH 43210, USA Matthew T. Pratola mpratola@stat.osu.edu Department of Statistics The Ohio State University Columbus, OH 43210, USA Editor: Robert Mc Culloch In this paper we introduce a novel model for Gaussian process (GP) regression in the fully Bayesian setting. Motivated by the ideas of sparsification, localization and Bayesian additive modeling, our model is built around a recursive partitioning (RP) scheme. Within each RP partition, a sparse GP (SGP) regression model is fitted. A Bayesian additive framework then combines multiple layers of partitioned SGPs, capturing both global trends and local refinements with efficient computations. The model addresses both the problem of efficiency in fitting a full Gaussian process regression model and the problem of prediction performance associated with a single SGP. Our approach mitigates the issue of pseudo-input selection and avoids the need for complex inter-block correlations in existing methods. The crucial trade-offbecomes choosing between many simpler local model components or fewer complex global model components, which the practitioner can sensibly tune. Implementation is via a Metropolis-Hasting Markov chain Monte-Carlo algorithm with Bayesian back-fitting. We compare our model against popular alternatives on simulated and real datasets, and find the performance is competitive, while the fully Bayesian procedure enables the quantification of model uncertainties. Keywords: Sparse Gaussian Process, Recursive Partition Scheme, Bayesian Additive Model, Nonparametric Regression. c 2022 Hengrui Luo, Giovanni Nattino and Matthew T. Pratola. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/19-597.html. Luo, Nattino and Pratola 1. Introduction Gaussian process (GP) regression is a widely adopted regression model (Rasmussen and Williams, 2006). Taking a Bayesian approach, its posterior distribution provides a principled way to quantify uncertainties while having nice theoretical properties (Gelman et al., 2013). However, the computational cost of GP likelihood evaluations based on an observed dataset {y, X} of size n is of order O(n3), which primarily results from the need to invert an n n covariance matrix. Therefore, the computational cost could be prohibitively high in scenarios where large datasets need to be analyzed. It is a focus of much current research to solve this problem of high computational cost for GP regression (Banerjee et al., 2012; Liu et al., 2020). Many approaches to circumvent this problem have been explored, such as low-rank covariance approximation (Titsias, 2009), model likelihood approximations (Kaufman et al., 2008) and local GP approximations (Snelson and Ghahramani, 2007; Gramacy and Apley, 2015). However, most of these approaches are not fully Bayesian. We are inspired by the idea of low-rank sparse GP regression (Snelson and Ghahramani, 2006) and localization ideas (Chipman et al., 1998; Lee et al., 2017; Gramacy and Apley, 2015; Park and Huang, 2016; Nguyen-Tuong et al., 2009; Chipman et al., 2016; Lee et al., 2017), but we still want to incorporate these methods within a fully Bayesian framework. Borrowing the framework of Bayesian (generalized) additive modeling (Hastie and Tibshirani, 1990, 2000), we propose the Sparse Additive Gaussian Process (SAGP) model. SAGP combines sparse GP regression and a recursive partition (RP) scheme within a fully Bayesian model. It turns out that our approach can simultaneously handle both local and global features in large datasets while realizing gains in computational efficiency. Furthermore, it provides principled uncertainty quantification for parameters and posterior predictions. A key feature of the approach is a much simplified fixed partitioning scheme that avoids the added computational costs of stochastic tree-based partitioning models (e.g. (Chipman et al., 1998, 2016; Gramacy et al., 2007)). To the best of our knowledge, this kind of additive Bayesian model, combining both sparsification and localization, has never been explored. The paper is organized as it follows. In section 2 we will briefly review the background knowledge for sparse GP, localization and Bayesian additive modeling as they are essential ingredients of SAGP modeling. In section 3 we will specify the SAGP model. Sections 4 and 5 are analyses of simulated and real-world datasets. Finally, we conclude our paper with a discussion in section 6. 2. Background 2.1 Gaussian Process Regression We start with GP regression on the input domain X and use the notation Nd(m, Σ) to denote the d-dimensional Gaussian distribution with mean vector m and covariance matrix Σ, and the notation Nd(y | m, Σ) to denote the d-dimensional Normal density evaluated at y Rn. The prior of the mean regression function is assumed to be a GP with known mean and covariance kernel function. Posterior estimation and prediction arise from combining the prior belief with the information contained in the likelihood of response variables y = Sparse Additive Gaussian Process Regression (y1, . . . , yn)T , observed at known input locations X = {xi Rd, i = 1, . . . , n} X, by using Bayes theorem. We also call f the target and the variable xi the input, based on the model form y(xi) = f(xi) + ϵi, i = 1, . . . , n ϵi N1(0, σ2 ϵ ) (1) which expresses the relationship between input xi and the unknown response f(xi) observed as yi with observational error ϵi having variance σ2 ϵ . Using vector notations we write y = (y1, . . . , yn)T = (y(x1), y(x2), . . . , y(xn))T , f = (f(x1), . . . , f(xn))T and the noise ϵ Nn(0n, σ2 ϵ In) to yield y = f + ϵ. Without loss of generality, it is often convenient to assume that the mean vector f is a realization of a zero mean Gaussian process, f Nn(0, Kn), where Kn = [K(xi, xj)]n i,j=1 , with covariance kernel K( , ) : Rd Rd R encoding assumed properties of the unknown function f to satisfy the application of interest (Rasmussen and Williams, 2006). 2.2 Sparsification of Gaussian Processes There are a variety of sparse approximation approaches to GP regression (e.g. Lawrence et al. 2003; Quinonero-Candela and Rasmussen 2005). A popular approach is the pseudoinput (or latent variable) approach. By replacing the exact covariance matrix in the likelihood computation with a low-rank approximation, one can greatly reduce computational cost. Snelson and Ghahramani (2006) propose the Sparse Gaussian Process (SGP) model by using a subset of the full inputs X = {x1, . . . , xn} as pseudo-inputs, denoted as X = { x1, . . . , xm} X, for m n. Then, f = (f( x1), f( x2), . . . , f( xm))T are called pseudo-targets, and Kn := [K(xk, xl)]n k,l=1 , Km := [K( xk, xl)]m k,l=1 , Knm = [K(xi, xj)]n,m i,j=1 = KT mn denote the (cross-)covariance matrices among and between the full targets f and pseudotargets f. Their approach treats the pseudo-inputs as (hyper-)parameters, resulting in a likelihood function that only requires the inversion of the dense m m matrix Km, a significant computational savings. The posterior and posterior predictive distributions can then be written in closed form by Gaussian conjugacy (Snelson and Ghahramani, 2006). For an SGP model with m pseudo-inputs, the full likelihood is P(y | X, X, f, σ2 ϵ ) = Nn(y | Knm K 1 m f, Λ + σ2 ϵ In) where Λ = diag K(xi, xi) k T i K 1 m ki n i=1 and ki = (K( x1, xi), . . . , K( xm, xi))T . Using Bayes theorem, we can write the posterior distribution of pseudo-targets as P( f X, y, X, σ2 ϵ ) = Nm( f | Km Q 1 m Kmn Λ + σ2 ϵ I 1 y, Km Q 1 m Km) where Qm = Km + Kmn Λ + σ2 ϵ I 1 Knm. The posterior predictive distribution for y at a new input x , after integrating out the pseudo-target f, can be written as P(y | x , X, y, σ2 ϵ ) = N1(k T Q 1 m Kmn Λn + σ2 ϵ In 1 y, σ2 ϵ +K k T K 1 m k +k T Q 1 mjk ), where Luo, Nattino and Pratola k = (K( x1, x ), . . . , K( xm, x ))T . In particular, when n = m we obtain the posterior distributions of the full Gaussian process model. One central problem of the SGP approach is that the sparsification depends on the choice of the pseudo-inputs X, which are treated as hyperparameters to be (somehow) selected once and then held fixed. In the original work, Snelson and Ghahramani (2006) propose to choose the pseudo-inputs by optimizing the marginal likelihood. Others have suggested to minimize the KL divergence (Titsias, 2009; Damianou and Lawrence, 2013). In our Bayesian approach, instead of using a fixed choice of pseudo-inputs (Titsias, 2009; Lee et al., 2017), we draw the pseudo-inputs from a prior distribution. 2.3 Bayesian Additive Modeling and Back-fitting Bayesian additive modeling (Hastie and Tibshirani, 1990; Chipman et al., 1998) is a flexible technique that is widely adopted. Such additive models are formed by taking the sum of many model components, where each component captures a portion of the overall response variability. In the Gaussian setting, fitting a Bayesian additive model can be accomplished by using partial residuals and updating each component sequentially in the so-called backfitting scheme (Hastie and Tibshirani, 2000). Following this scheme, we can represent an additive model with N components without intercept term in vector form as y = PN j=1 f j + ϵ, ϵ Nn(0n, σ2 ϵ In). Bayesian back-fitting proceeds by fitting each additive component, f j, by using the jth partial residuals , rj = y P i =j f i. These residuals are used as data for the j-th component. Starting with a particular initial value, the back-fitting algorithm (Algorithm 3.1 in Hastie and Tibshirani (2000)) iterates until the joint distribution of all mean functions f 1, f 2, . . . , f N stabilizes. One insight into the usefulness of this algorithm is to recognize that it allows updating the fj s one at a time rather than requiring an expensive joint update like the full GP regression on a large dataset. Therefore, it would be advantageous to make the fj updates computationally cheap. 2.4 Localization via Partition Schemes Partitioning the input space X = S j Xj has been another popular way of scaling-up regression models. In this line of research, pioneering works were performed by Breiman (1984), Denison et al. (1998) and Chipman et al. (1998, 2010, 2016). Furthermore, various choices of partition schemes of the input domain are discussed in the local GP regression literature (Nguyen-Tuong et al., 2009; Gramacy and Apley, 2015; Park and Huang, 2016). In terms of Bayesian additive modeling, Chipman et al. (1998) model the data in each partition Xj using an independent model component, conditional on the partitioning defined by a binary tree. This associates the fitted mean function (or target) f j with the data lying in the specific partition Xj. Subsequent works (Gramacy and Apley, 2015; Chipman et al., 2016; Pratola et al., 2020) demonstrate that assembling many simpler models over such partitioning schemes can usually out-perform a single complex model fitted to the entire modeling domain. As pointed out in Gramacy and Apley (2015) and Park and Apley (2018), such localization of the input-domain will fit and predict non-stationary datasets better. Also, Sparse Additive Gaussian Process Regression multi-scale features of a dataset can usually be well captured by introducing a hierarchical structure on the input domain (Fox and Dunson, 2012; Lee et al., 2017). In our approach, capturing global and local features is accomplished through a fixed partition scheme informed by the data X. We will show how this can be done so that the partition scheme is well suited to the Bayesian backfitting algorithm, and use sparse model components to further enhance the scalability of the model. 3. Sparse Additive Gaussian Process Regression (SAGP) The proposed SAGP model combines the three key ingredients of sparsification, Bayesian additive modeling (via backfitting), and localization in a clever way. In particular, our model has the usual additive form, j=1 f j + ϵ, (2) for some error component ϵ with variance σ2 ϵ . Much effort in statistical modeling focuses on the f j. For instance, in linear regression, f j = Xjβj for the jth column of some design matrix X and vector parameter β. In our approach, each f j has entries which are formed by weighted linear combinations of the pseudo-targets, W T f j, and each vector of pseudotargets f j arise from the pseudo-inputs X (j) belonging to the jth subdomain of the input domain X. Additional parameters κ will be involved in each component in forming the weights W . Finally, the subdomains are defined by a partitioning scheme, BN. Let the collection of pseudo-inputs belonging to each partition be X (1), . . . , X (N). Then, our model takes a hierarchical form involving the likelihood function L(y| f1, . . . , f N, κ, σ2 ϵ , BN, X (1), . . . , X (N), X) as well as the prior distributions of the various additive model components in the overall model, P( f1, . . . , f N|κ, BN, X (1), . . . , X (N), X), P(κ | BN, X), P(σ2 ϵ ), and P( X (1), . . . , X (N)|BN, X). To perform inference and prediction, we will be interested in the marginal posterior distribution P( f1, . . . , f N, κ, σ2 ϵ |y, BN, X). Note that the posterior is dependent on the partitioning scheme BN, since it is held fixed in our modeling approach. Therefore, we will start by describing the proposed partitioning scheme. The partitioning scheme reduces the computational cost by limiting the sample size in each model component by exploiting localization. Second, conditional on this localization scheme, the model components (i.e. the fj s) themselves leverage the sparse Gaussian Process. This sparsification reduces the computational cost as described earlier. Finally, our overall model combines all of these sparse localized components into a Bayesian additive model as defined by the likelihood function, and the overall model can be efficiently fit using Bayesian back-fitting. 3.1 A Recursive Partitioning Scheme We consider a recursive partitioning of the domain X that can be represented as a 2d-ary tree. Each node of the tree corresponds to a subregion Bj of X Rd called a block. Only the node at the first level, i.e., the root of the tree, corresponds to the whole domain (B1 = X). The collection of blocks corresponding to nodes at the same depth of the tree is referred to as a layer. The collection of all blocks across all layers of the tree comprises the partitioning Luo, Nattino and Pratola of X. More formally, we define a Recursive Partitioning (RP) scheme as a collection of blocks {B1, ..., BN} and layers L1, . . . , LL of these blocks satisfying the following properties: P1. (Nestedness) For a block Bi Rd in the j-th layer Lj, there exists a unique block Bk Lj 1 Rd in the (j 1)-layer Lj 1 such that Bi Bk. P2. (Disjointedness, or non-overlapping) For two blocks Bi, Bk in the j-th layer Lj such that Bi = Bk, their interiors do not intersect. To facilitate manipulating and storing the RP scheme on a computer, we encode each block by its centroid cj = (c1 j, , cd j) and half-width wj = (w1 j, . . . , wd j ) where for simplicity we take the half-widths to be the same in each dimension given a layer l, wk j = Rl, k = 1, . . . , d. The j-th block is then defined as Bj := B(cj, wj) = n x = (x1, . . . , xd) Rd |xk ck j | wk j , k = 1, . . . , d o . We require each block to have a minimum of mj observations, allowing us to later define an SGP with mj pseudo-inputs in each Bj. For simplicity of exposition, we will assume mj = m for all j = 1, . . . , N. We also require pseudo-inputs X (j) to be mutually disjoint, so that each input setting in X is chosen as a pseudo-input at most once. An example RP scheme construction with L = 3 layers and m = 3 pseudo-inputs is shown in Figure 1. The construction starts with a complete 2d-ary tree consisting of layers L1 = {B1}, L2 = {B2, B3} and L3 = {B4, B5, B6, B7}, and a dataset of n = 15 observations, shown as black dots. Then, the complete tree is pruned according to Algorithm 1, which ensures that each block Bj will have at least m pseudo-inputs available while also satisfying the required properties P1 and P2. Finally, given an RP scheme, one possible random selection of pseudo-inputs is shown. Algorithm 1 is able to perform the required pruning in general. Essentially, the algorithm works by requiring that the total number of observations in block Bj and all of Bj s children satisfies the total required number of pseudo-inputs for these components. Starting from the bottom layer, Algorithm 1 recursively works up the tree, pruning sub-trees that do not satisfy this constraint on total number of observations. Once the pruning is complete, the random selection of pseudo-inputs to blocks can be drawn by starting with blocks in layer L and working back to B1, thereby guaranteeing that all blocks meet the minimum of m pseudo-inputs per block. 3.2 SAGP Model Given a (pruned) RP scheme BN, we propose the additive model (2) for the response y, where each component fj = (fj(x1), . . . , fj(xn))T is described by an SGP model on the domain Bj and ϵ Nn(0n, σ2 ϵ In). For block B(cj, wj), we use X (j) to denote the pseudoinputs for that block, X (j) = { x(j) 1 , . . . , x(j) mj} B(cj, wj) s.t. x(j) k X k. The SGP associated with f j and pseudo-inputs X (j) has corresponding pseudo-targets, f j = fj( x(j) 1 ), fj( x(j) 2 ), . . . , fj( x(j) mj) T Rmj, j = 1, . . . , N. (3) Sparse Additive Gaussian Process Regression The initial complete RP scheme as a binary tree with 3-layers on [0, 1]. Starting from layer 3, we prune B(c7, w7) since there are only 2 < 3 observations available. We keep B(c6, w6), B(c5, w5), B(c4, w4) as they all contain at least m = 3 observations. Moving to layer 2, B(c3, w3) has the 6 observations required by itself and its child B(c6, w6). Checking B(c2, w2), it contains only 6 observations so we prune its children B(c4, w4), B(c5, w5). Moving to layer 1, B(c1, w1) contains 15 observations, therefore there are sufficient observations for B(c1, w1) and its children B(c2, w2) , B(c3, w3) and B(c6, w6). We keep B(c1, w1) and its children, completing the partitioning. Given the final RP scheme B(c1, w1), B(c2, w2), B(c3, w3) and B(c6, w6), one possible random selection of the pseudo-inputs Xj for the j = 1, . . . , N different additive components (here N = 4) conditional on the RP scheme is shown as colored dots. Points with the same color belong to blocks on the same layer. Figure 1: RP scheme on the domain X = [0, 1] as a 21-ary tree with 3 layers and m = 3 pseudo-inputs per block. The n = 15 data points X are represented as dots. The right panels describe the application of the RP pruning Algorithm 1 (a)-(d), and the selection of pseudo-inputs given the RP scheme in (e). The left panels provide the analogous graphical construction of the RP scheme. Luo, Nattino and Pratola Algorithm 1 Pruning algorithm for RP scheme. Input : RP partition scheme A consisting of N components, Observed dataset {X, y}. Output: RP partition scheme A consisting of N N components, 1 for l in L : 1 do 2 for each component j in the l-th layer Ll do 3 for s in L : l do 4 mreq Sum of the numbers of pesudo-inputs required for all components contained in B(cj, wj) in A . 5 if | X B(cj, wj) | mreq then 8 Remove all the children components of component j from the model in s-th layer. Conditional on the RP scheme and pseudo-inputs, the joint posterior of pseudo-targets and other parameters in (2) can be written as: P( f 1, . . . , f N, κ, σ2 ϵ | BN, y, X, X (1), . . . , X (N)) P(y | f 1, . . . , f N, κ, σ2 ϵ, BN, X (1), . . . , X (N), X) | {z } Likelihood Function P( f 1, . . . , f N | BN, X (1), . . . , X (N), X, κ) | {z } Pseudo-target Prior P(κ | BN) | {z } Kernel Prior P(σ2 ϵ) | {z } Error Prior In effect, we view the choice of pseudo-inputs X (1), . . . , X (N) as nuisance parameters, and ultimately will integrate them out with respect to the prior P( X (1), . . . , X (N) | BN), which gives the marginal posterior of interest, P( f 1, . . . , f N, κ, σ2 ϵ | BN, y, X) = Z P( f 1, . . . , f N, κ, σ2 ϵ | BN, y, X, X (1), . . . , X (N)) P( X (1), . . . , X (N) | BN) | {z } Pseudo-input Prior d xm1 . . . d xm N , where d xmj = d x(j) 1 . . . d x(j) mj. In subsection 3.2.7 we will show a Gibbs sampler algorithm for SAGP fitting and for calculating predictions, but first we describe in greater detail the likelihood function and various prior distributions involved in the SAGP model. 3.2.1 Likelihood Function, P(y | f1, . . . , f N, κ, σ2 ϵ , BN, X (1), . . . , X (N), X) Let us denote the covariance kernel for the SGP in the j-th block by K(j), j = 1, . . . , N . We use the Gaussian covariance kernel supported inside Bj with parameters κ(j) = (ρ(j), η(j)). Sparse Additive Gaussian Process Regression K(j)(x, x ) := 1 η(j) ρ(j) [(x x )T (x x )] , x, x Bj. (5) Using (5), we can write down the (cross-)covariance matrices among and between inputs in X and X (j) as: K(j) n := h K(j)(xk, xl) in K(j) mj := h K(j)( x(j) k , x(j) l ) imj K(j) nmj := h K(j)(xk, x(j) l ) in,mj k,l=1 = K(j) mjn T . For a general x Rd, we also have k(j) x := K(j)( x(j) 1 , x), . . . , K(j)( x(j) mj, x) T . (6) Assuming the additive components are conditionally independent, the likelihood is (see Lemma 1 in Appendix D) P(y | f 1, . . . , f N, κ, σ2 ϵ, BN, X (1), . . . , X (N)) = Nn j=1 K(j) nmj K(j) mj 1 f j, σ2 ϵ In + where the matrix Λ(j) n := diag K(j) ii k(j)T i K(j) mj 1 k(j) i n n takes the diagonal form, with k(j) i as defined in (6) (with subscript i being shorthand for xi). 3.2.2 Pseudo-target Prior, P( f1, . . . , f N | BN, X (1), . . . , X (N), X, κ) The prior distribution of pseudo-targets given pseudo-inputs and covariance function parameters is straight-forward. Following Snelson and Ghahramani (2006), the pseudo-targets are assumed to be a priori conditionally independent, and so have Gaussian distributions with prescribed kernels: P( f 1, . . . , f N | BN, X (1), . . . , X (N), X, κ) = j=1 P( f j | BN, X (j), X, κ(j)) j=1 Nmj f j 0mj, K(j) mj . (8) 3.2.3 Pseudo-input Prior, P( X (1), . . . , X (N) | BN, X) The idea of the proposed pseudo-input prior is to sample pseudo-inputs uniformly within each block Bj while satisfying properties P1 P3 required for the RP scheme, BN. Algorithm 2 implements such a sampling scheme, which we now motivate. Let the index set Ij Luo, Nattino and Pratola Algorithm 2 Sampling pseudo-inputs given RP scheme BN. Input : RP scheme BN consisting of N blocks, Observed inputs X. Output: Sample of X (j), j = 1, . . . , N conditional on RP scheme BN. 1 Initialize XA = X as available inputs. 2 for j in N : 1 do 3 Sample a random sample X (j) X Rd of size mj from X Bj XA. 4 XA XA \ X (j) // Remove X (j) sampled in the previous step from XA. representing the indices of children blocks of block Bj, which is defined as Ij := {k = j such that Bk Bj}, and also define the collection of already selected pseudo-inputs of these child blocks as C(Bj) := k Ij X (k). Then, P( X (1), . . . , X (N) | BN) = j:Bj Lℓ P( X (j) | C(Bj)) P( X (j) | C(Bj)) = i=1 P( x(j) i | C(Bj)) P( x(j) i | C(Bj)) = Discrete Uniform ({x X Bj\C(Bj)}) . In the expression above, we essentially draw a random sample from all those observed locations that have not been selected as pseudo-inputs of any children components in the lower layers of the RP scheme. Unlike the standard SGP approach, this allows us to capture the uncertainty of pseudoinput selection by sampling the pseudo-inputs using Algorithm 2 and propagating this uncertainty to the posterior. Alternatives such as a continuous uniform prior over each component domain Bj, or sampling accordingly to design-theoretic considerations (Pratola et al., 2019), are possible. 3.2.4 Additional Prior Distributions, P(κ | BN) and P(σ2 ϵ ) We place a conjugate inverse gamma prior on the noise variance, σ2 ϵ , σ2 ϵ Inverse Gamma(αϵ, βϵ). The hyper-parameters αϵ and βϵ may be chosen as the hyper-parameters of the noise variance in traditional Bayesian GP regression. Sparse Additive Gaussian Process Regression We assume independent priors on the scale and correlation parameters of the kernel, η(j) and ρ(j), P(κ) = P(κ(1), . . . , κ(j)) = Bj Lℓ P(η(j) | αl η, βl η)P(ρ(j)). The precision parameters η(j) are assumed to have gamma priors, η(j) Gamma(αl η, βl η), with αl η, βl η > 0, l = 1, . . . , L. The hyper-parameters αl η, βl η are the same for components within the same layer. We set up these hyper-parameters so that the variance of the response explained by the SAGP model is unequally partitioned across the L layers, with components in higher layers of the partitioning scheme explaining larger portions of the variance. To facilitate the set-up of the hyper-parameters, we first normalize the observed responses y1, . . . , yn, re-centering and re-scaling so that they have mean 0 and variance 1. For all the components in layer l, we set αl η =c1η + 1, βl η =c1η(1 c2η)cl 1 2η , with c1η > 0 and 0 < c2η < 1. For each component j in layer l, this choice implies that 1/η(j), the marginal variance of the component, has prior mean E[1/η(j)] = βl η αlη 1 = (1 c2η)cl 1 2η . For example, if c2η = .1, components on layer l = 1 are expected to have variance 1 c2η = .9, which is 90% of the variance of the response because of the normalization. Components on layer l = 2 are expected to have (1 c2η)c2η = 0.09, 9% of the variance of the response. Similarly, as l increases, components are expected to explain smaller portions of the variance of the response. In particular, the geometric decay of the prior mean of 1/η(j) is chosen so that the expected layer-specific variances add up to approximately the total response variance, which is guaranteed because, if L is sufficiently big, PL l=1(1 c2η)cl 1 2η 1. The other hyper-parameter c1η controls the spread of the prior distributions of 1/η(j), with larger values of c1η imposing a tighter constraint to the prior mean. In our experience, values of c2η = .1 and c1η between 10 and 50 appear to provide the best results in our applications. We set the prior distributions on the parameters ρ(j) in the following way. First of all, we assume that the inputs xi s have been appropriately scaled, so that the domain X is mapped into the unit cube [0, 1]d. This facilitates the definition of priors for ρ(j). Second, as for the η(j), we assume the same prior distribution for parameters corresponding to components in the same layer l. Third, we adopt a structure of priors imposing smoother behaviors for components in the top layers of the partitioning scheme. In other words, we impose a structure of priors where ρ(j) is expected to be greater than ρ(j ) if component j belongs to a layer on a higher level than the layer of component j . Despite a family of beta priors on the ρ(j) may be tuned to satisfy these properties, we empirically observed that setting the values Luo, Nattino and Pratola of these parameters to fixed layer-specific constants ρl (i.e., P(ρ(j)) = δρl, the Dirac delta function on ρl) worked as well but was computationally less expensive. To get the sense on how the values of ρl affect the layer-specific correlations, one may plot the correlation for two responses as a function of the distance of their inputs, as specified by Equation (5). We provide this plot in Appendix A, in the case L = 5 and using the values ρ1 = 10 1, ρL = 10 50 and the intermediate values ρl, l = 2, . . . , L 1 to be equally spaced between ρ1 and ρL on the logarithm (base 10) scale. Even though these values of ρl may appear to quickly become excessively small, the sizes of the subdomains where the components are defined (i.e., the blocks Bj) shrink as l increases. In our numerical example, if we consider a one-dimensional case with two inputs x and x at distance 0.0625 (i.e., the largest distance between two points in one block on the fifth layer), the assumed correlations on components on layer l = 1 to 5 are 0.99, 0.89, 0.80, 0.71 and 0.64, respectively. Notably, the decay of such values depends on the number of layers L, which can be tuned using prior beliefs and the information in the data. In our applications, trading the conventional estimation of the parameters ρ(j) with a set of fixed ρl and a data-driven selection of L via cross-validation (see Section 3.2.8) resulted in sufficiently flexible models. 3.2.5 Full Conditional Distribution of Pseudo-targets In order to implement an MCMC algorithm for SAGP, we apply Bayes theorem on the pseudo-inputs f j in order to yield its full conditional distribution from (7) and (8) and the conditional independence assumption, P( f j | y, X, f 1, . . . , f j 1, f j+1, . . . f N, BN, X (1), . . . , X (N), κ, σ2 ϵ) P(y | f 1, . . . f N, BN, X (1), . . . , X (N), κ, σ2 ϵ) P( f j | X, BN, X (j), κ, σ2 ϵ) rj| K(j) nmj K(j) mj 1 f j, Λ(j) n + σ2 ϵ In Nmj f j 0mj, K(j) mj , (9) where rj = y P l =j K(l) nml K(l) ml 1 f l. Using normal-normal conjugacy, we can identify the mean and variance of this normal distribution, f j | Meanj, V arj, where (see Appendix D) Meanj = K(j) mj Q(j) 1 mj K(j) mjn Λ(j) n + σ2 ϵ In 1 rj, V arj = K(j) mj Q(j) 1 mj K(j) mj, and Q(j) mj =K(j) mj + K(j) mjn Λ(j) n + σ2 ϵ In 1 K(j) nmj. (10) Although we still need to invert an n n matrix Λ(j) n + σ2 ϵ In, it is a diagonal matrix and hence its computational cost will be O(n). 3.2.6 Full Conditional Distribution of Noise Variance As we mentioned in the previous section, we want to make use of the Gaussian-inverse gamma conjugacy. For the observation of sample size n, by conjugacy, the distribution σ2 ϵ Sparse Additive Gaussian Process Regression is again inverse gamma, P(σ2 ϵ | y, X, f 1, . . . f N, BN, X (1), . . . , X (N), κ) = Inverse Gamma αϵ + n 2(y ˆy)T (y ˆy) , where ˆy := PN j=1 K(j) nmj K(j) mj 1 f j is the fitted value from the SAGP model. We can directly sample this parameter using a Gibbs step. 3.2.7 Sampling Algorithm SAGP is fitted by a Metropolis-Hastings Markov chain Monte Carlo (MCMC) algorithm (Gelfand et al., 1990). For each additive component of the model, we have to use the partial residuals rj defined in (9) as data. This step is from the back-fitting scheme designed for fitting additive Bayesian models (Hastie and Tibshirani, 2000). From the likelihood derivations presented in section 3.2.5, we know that f j can be directly sampled from their conditional distributions for each j = 1, . . . , N components. The difficulty in this step is to compute the Meanj, V arj in (10). As mentioned earlier, the main computational cost occurs in the inversion of the covariance matrices in section 3.2.1, which has been reduced compared to a full GP covariance matrix. Numerical instability in inversion of these matrices may cause additional problems, so we adopt the Cholesky decomposition method with diagonal perturbation to solve this instability problem as in Rasmussen and Williams (2006). For each ηj we do not have normal conjugacy, therefore an adaptive Metropolis-Hasting step is used for sampling ηj (Banerjee et al., 2012). The advantage of using such a fully Bayesian model is that the uncertainty quantification comes naturally with the posterior samples from the sampler. Our posterior inference below can be based on all these posterior samples. The algorithm for overall sampling is presented in Algorithm 3 in the Appendix C. 3.2.8 Tuning Parameters and Complexity The trade-offbetween the number of layers L in the RP scheme and the number of pseudoinputs m is central to the SAGP model. On one hand, in SGP modeling (Snelson and Ghahramani, 2006, 2007; Lee et al., 2017), we need to increase the number of pseudo-inputs m to get a better fit of the SGP model. On the other hand, for regression tree partitionining models (Chipman et al., 1998, 2016; Pratola et al., 2020), the more additive components a model has, the better fit we can expect. In the SAGP model, increasing both factors (number of pseudo-inputs, m, and number of layers, L,) would certainly improve the overall fit, but the interesting observation is that there exists a trade-offbetween these two tuning parameters. Increasing the number of layers L may counter-act the effect of decreasing the number of pseudo-inputs m, and vice versa. Theoretically, we can tune the choice of the number of layers using cross-validation (see Figure 4). Practically, we can usually choose reasonable m and L depending on the desired granularity of the RP scheme. This trade-offbetween m and L can also be observed by considering the model s computational complexity. We already mentioned above that for a full GP model based on X the complexity is of order O(n3); for an SGP model with m n pseudo-inputs selected Luo, Nattino and Pratola from X, the complexity is of order O(nm2) (Snelson and Ghahramani, 2006). Since each additive component in our SAGP model is essentially an SGP model, the overall complexity is given by the following proposition. Proposition 1 For an RP scheme on input-domain [0, 1]d, with bi-ary tree (Storer, 2012) in the i-th dimension, the complexity of fitting an L-layer SAGP model with m pseudo-inputs for each block and an overall sample of size n is at most O PL ℓ=1 Qd i=1 bℓ 1 i n m2 . Proof See Appendix B. For N = 1, where there is only one layer and one component, this complexity reduces to SGP complexity with m pseudo-inputs. We will revisit this component number pseudo-input trade-offin our data analyses. 4. Simulation Study To evaluate the performance of our methodology and compare it to competing approaches, we run a family of simulations. We focus on the one-dimensional case (d = 1) and we simulate from a GP with a mean function f(x) = 5 6x3 + 30(x .5)2 + 3 exp(2x 1) + 3x2 sin(12πx) + cos(6πx), (11) which is represented in Figure 2 in the interval [0,1]. We generate a sample of n = 200 locations from a uniform distribution on [0, 1] and we define the observed responses as yi = f(xi) + ϵi for all xi X, with ϵi N1(0, 0.1). The data are split into training and testing sets, with sizes 150 and 50, respectively. We consider two scenarios. In the first scenario, the testing set is selected at random. In the second scenario, the testing set is chosen as the subset with 50 data points with xi closest to a point randomly chosen in [0.25, 0.75]. Figure 2 shows an example for each of these scenarios. We generate 1000 datasets for each scenario (random and interval testing set). In each dataset, we fit the SAGP model with three configurations of m, L: (i) m = 5, L = 4; (ii) m = 10, L = 3; (iii) m = 15, L = 3. We compare the SAGP models to the following methods: Full GP regression. We use the implementation of GP regression model in the R package Dice Kriging by Roustant et al. (2012). SGP regression. We consider the choices m = 5, m = 10 and m = 15 and use the implementation of SGP in the Matlab package implementation SPGP at http: //www.gaussianprocess.org accompanying the paper by Snelson and Ghahramani (2006). Bayesian Additive Regression Trees (BART) Chipman et al. (1998, 2010); Pratola et al. (2020). We used the default number of trees as specified in Chipman et al. (2010) and the implementation at http://bitbucket.org/mpratola/openbt. Sparse Additive Gaussian Process Regression 0.00 0.25 0.50 0.75 1.00 x 0.00 0.25 0.50 0.75 1.00 x SAGP m=15 L=3 SGP m=15 GP BART Figure 2: Example of data generated in the simulation study. The gray, bold, curve represents the true mean function f(x). Training and testing sets are represented as black and red points. Panels (a) and (b) show the scenarios where the 50 data points of the testing set are chosen at random or as the input location that is closest to a randomly chosen point (0.5 in the example), respectively. The posterior predictive functions of four models, fit on the training portion of the data, are provided in both panels. For each generated dataset, the models are fit on the training data and used to predict the response on the testing data. For each point in the testing set, we compute the estimated mean function ˆy(xi) (see Section 3.2.6) and the 95% prediction interval (PI) for yi. The performance of the estimators of the mean function is evaluated in terms of root mean squared error (RMSE). To assess the appropriateness of the uncertainty quantification, we compute the coverage of the PIs and compare it to the nominal prediction level. Finally, we compare the methods in terms of the average value of interval scores, which is a summary measure to assess the quality of prediction intervals (Gneiting and Raftery, 2007). Given a (1 α)100% PI for yi with extremes (li, ui), the interval score at yi is defined as sα(li, ui; yi) = (ui li) + 2 α(li yi)1(yi < li) + 2 α(yi ui)1(yi > ui). We choose this metric to jointly evaluate a family of intervals in terms of precision (i.e. the width of the intervals) and accuracy (i.e., the coverage of the true value). Notably, low values of the score indicate good performance. Luo, Nattino and Pratola 4.2 Results Figure 3 summarizes the resulting RMSEs, PI coverages and averages of the interval scores across the 1000 generated datasets for the two scenarios. G GGGGGGGG G GGG G G GGGGGGGG G GGG G GGGG GGGG GG GGGGGGGGGGGG GGG G G G G GG GGGG GGGG G G GGGGGGG GGGG GGGGGGGGGGGGGGGG G G G GG GGGG GGGG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GGGGGGG GGG G GG G GGG GGG GG G G GGGG GGGGG G GGGG G G GGGGGG GG G GG GGGGGGGGGGG GG G GGG G G GG GG GG G GGG GGGGGGGGG GGGG G G G G G G GGG G G G GGG GG GG G GGGG G GG GG GGG GG G GG GG GGG GG G GGGGGGGGG GG G GG GGGGG G G G G GG G GGGGGGG G GGGGGG GGG GGGGGG RMSE Coverage 95% PI Interval Score SAGP m=5 L=4 SAGP m=10 L=3 SAGP m=15 L=3 SGP m=5 SGP m=10 SGP m=15 GP BART GGGGGGGGGGG G GGGG GGGGGGGGGGG G G GGGGGGGG GG GGG G GGGGGG G GGGGGGGG GGG GG G G G GG G G GG GGGGGG G GG GG GGGG G G GG GGG G G G GGGGGGGGGGG G G G GGG G GGGGGG GGG GGG G G GGGGGGGG GGGG GG GG G GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGG RMSE Coverage 95% PI Interval Score SAGP m=5 L=4 SAGP m=10 L=3 SAGP m=15 L=3 SGP m=5 SGP m=10 SGP m=15 GP BART Figure 3: RMSE (top panels), coverage (central panels) and interval score (bottom panels) over the 1000 simulated datasets. Panel (a) shows the results in the case where the testing set is chosen at random over [0, 1]. Panel (b) shows the results in the case where the testing set is chosen in a random interval with center uniformly selected from [0.25, 0.75]. Panel (a) provides the results in the scenario where the testing set is selected at random. In terms of RMSE, both the SAGP and SGP models perform better with larger values of m. As expected, the full GP model attains the smallest RMSEs. The SAGP models with m = 5 and 10 perform better than the SGP models with the same number of pseudo-inputs. For m = 15, the median RMSEs in the SAGP and SGP models are similar, but the performance of the SAGP model is more consistent across simulations (the upper quartile of SAGP with m = 15 is considerably smaller than the one of SGP with m = 15). With the considered configuration of the parameters, the BART model performs slightly better than the SAGP model with m = 5, but worse than the SAGP model with m = 10 and m = 15. The coverage of the 95% PIs is close to the nominal level for all the methods except for BART. The PIs of the SAGP model appear to be slightly too narrow, as most of the coverages are a little lower than .95. SGP and GP models show coverages perfectly matching the nominal value. Sparse Additive Gaussian Process Regression However, the BART model produces overly wide PIs, as the median coverage is 100%. The ranking of the methods in terms of interval score is similar to the one based on the RMSE. Again, better performances are attained by SAGP and SGP models with larger values of m. The SAGP models with m = 10 and m = 15 perform better than all the other methods, except for the full GP model. Panel (b) provides the results of the simulations in the scenario where the testing set is an interval with random mid-point. Notably, this prediction problem is much harder than the one evaluated by the previous scenario, as the models are forced to a certain degree of extrapolation due to the lack of data. BART, GP and SAGP with m = 15 attain the best performance in terms of RMSE. Overall, the SAGP model seems to perform better than SGP. The coverage is suboptimal for all the methods, being much lower (undercoverage) than expected for SAGP and SGP with m = 5 and higher (overcoverage) for GP. A wide range of coverages is observed for all the other methods. With respect to the interval score, GP and BART are the methods that appear to perform best. Among the SAGP and SGP models, SAGP with m = 15 is the best performing and competitive with GP and BART. 4.3 Computational Details As for any Bayesian model that is fit using MCMC algorithms, the convergence to the stationary distribution must be investigated also for the SAGP model. In our specific implementation, we discard the first 10,000 samples as burn-in and keep the following 1,000 samples to compute posterior estimates. We monitor the convergence of σ2 ϵ , sampled with Gibbs steps, and of the parameters η(j), which are sampled with Metropolis-Hastings steps with an adaptive choice of the bandwidth of the proposal distribution to control the acceptance rate. The considered SAGP models turned out to mix well and reasonably fast on the basis of trace-plots of the parameters (not shown) and the diagnostics suggested by Gelman et al. (2013), which are provided in Appendix E. Notably, in our experience, similar satisfying mixing diagnostic for the SAGP model may be achieved with much fewer steps than 10,000. Table 1: Computation time needed to fit the SAGP model on 1,000 simulated datasets on a 40-core cluster. m L Testing set CPU time (hh:mm:ss) 10 3 Random 106:53:21 10 3 Interval 107:56:21 5 4 Random 241:17:31 5 4 Interval 175:45:35 15 3 Random 477:22:52 15 3 Interval 479:06:39 With respect to the computation time, setting the burn-in size to 10,000 and the size of posterior samples to 1,000, an SAGP model can be fit on one dataset of size n = 200 in Luo, Nattino and Pratola 3 to 5 minutes, depending on the configuration of m and L, using a laptop with an Intel Core-i5 2.30GHz processor. The time that was needed to fit the model on one batch of 1,000 simulated datasets are summarized in Table 1. 5. Real Data Applications In this section, the proposed model is applied to real data. We considered four datasets that differ in terms of sample size and number of predictors: Heart rate data: n = 1, 664, d = 1; Temperature data: n = 247, d = 2; Ice Sheet data: n = 2, 226, d = 2; UK Housing data: n = 1, 519 and d = 8. The performance is evaluated quantitatively with the out-of-sample RMSE, coverage of 95% PIs and average interval score on a 25% test set. Our model is compared to other popular methods. We considered two Bayesian models: BART and Bayesian CART (BCART) (Chipman et al., 1998), implemented in the Bayes Tree package on CRAN (version 0.3-1.4). We also considered two frequentist models: full GP and Local Approximate GP (la GP) (Gramacy et al., 2016), implemented in the la GP package on CRAN (version 1.5-5). For the d = 1 and d = 2 datasets, we also provide a qualitative assessment of the fits via graphical plots. 5.1 Heart Rate Data The heart rate (HR) dataset we study here can be used to evaluate the level of physical preparation and design training/rehabilitation activities (Zakynthinaki, 2015). In this study, a single runner was asked to run on a treadmill at constant speed. The HR (in beats/minute) was recorded for about 7 minutes from the beginning of the exercise. After the exercise, the HR of the subject was measured for about 10 minutes during the recovery. The experiment was repeated four times, varying the speed of the exercise (v = 13.4, 14.4, 15.7 and 17 km/h). For our illustrative purposes, we use the data of the exercise performed at speed v = 13.4 km/h, which are graphically represented in Figure 5. We consider SAGP models with m = 5 and m = 10 pseudo-inputs, and use 10-fold crossvalidation to select the number of layers L as shown in Figure 4(a). This plot demonstrates the trade-offbetween the values of pseudo imputs m and the number of layers L, with L = 3, m = 10 being a good choice. The resulting fitted SAGP model, which consists of 7 additive components, is summarized in Figure 5, both in terms of how the fit is decomposed by layer in panel (a) and the overall fit shown in panel (b). An alternative fit with L = 4, m = 5 is provided in Appendix F. 5.2 Temperature Data In this section we study a moderate sized 2-dimensional dataset of average daily maximum temperature in degrees centigrade at 247 locations in Colorado during 1997 (https:// Sparse Additive Gaussian Process Regression 1 2 3 4 5 6 L m G 5 10 15 20 25 Figure 4: (a) Out-of-sample MSE (on log10 scale) attained by different models fitted on 1 dimensional heart-rate dataset with m = 5, 10 and L = 1, . . . , 6. (b) Out-of-sample MSE (on log10 scale) attained by different models fitted on 2 dimensional temperature dataset with m = 5, 10, 15, 20, 25 and L = 1, . . . , 4. For any L > 4, our pruning algorithm 1 will reduce it to L = 4; for m = 25 our pruning algorithm will reduce SAGP model to L = 3 www.image.ucar.edu/Data/US.monthly.met/USmonthly Met.shtml, US precipitation and temperature (1895-1997) dataset). Qualitative comparisons of GP, BART and SAGP are shown in Figure 6. For GP regression we used MLE estimates with the Matern(5/2) kernel. For BART we use the default settings (Chipman et al., 2010). For SAGP, we choose L = 3, m = 25 and calibrate the α, β of the noise prior in SAGP and the noise estimate in BART according to MLE of noise estimate from GP. The GP model shows reasonable predictions, however, the prediction comes with high predictive variance in locations away from the observations and especially near the boundary (not shown). The predictive mean of BART shows it has a slight grid-like artifact due to its decision tree construction. In addition, the shape of the response around the mode is noticeably more rectangular than suggested by the other models. This dataset provides us a 2-dimensional example where the data is limited, which is actually a disadvantage for SAGP since the sparsification does not cut down the computational cost significantly yet some information is lost in the procedure. Nonetheless, the SAGP method captures the major trends and even some of the extremal temperatures close to 40 degrees centigrade. Compared to BART and GP, the SAGP model behaves in-between these two methods and provides us with very competitive performance. 5.3 Ice Sheet Data The Ice Sheet data is a larger 2-dimensional dataset but this time with noticeably uneven sampling as discussed in Park and Apley (2018).The response is ice sheet thickness in meters collected over a region of west Antarctica (Blankenship et al., 2004).We used the data from 1991, first converting the longitude and latitude into 2-dimensional Euclidean coordinates and standardizing the dataset to [0, 1]2. Luo, Nattino and Pratola GG G GGGGGGG G G G G GGGG GG G GGGGG G G GGG GG G GG GG 0 250 500 750 1000 Time (seconds) HR (beats/min) GG G GGGGGGG G G G G GGGG GG G GGGGG G G GGG GG G GG GG 0 250 500 750 1000 Time (seconds) HR (beats/min) GG G GGGGGGG G G G G GGGG GG G GGGGG G G GGG GG G GG GG G GG GGG 0 250 500 750 1000 Time (seconds) HR (beats/min) 5 6 7 Layer 1 1:2 1:3 Figure 5: The panels show the observed HR values over time as black dots and results about the fit of the SAGP model with m = 10 and L = 3. Panel (a) shows the posterior means and the 95% CIs of the 7 additive components of the SAGP model on 100 equispaced locations on the support of the data. Panel (b) shows the posterior means and the 95% CIs of the sole component in layer 1 (red), of the components belonging to layer 1 and 2 (green) and of the complete model, including components from layer 1, 2 and 3. Panel (c) provides the predictive mean and the corresponding 95% prediction intervals. A plot of the data and predictive fits for the GP (exponential correlation), la GP, BART, treed GP (TGP; (Gramacy et al., 2007)) and SAGP models are shown in Figure 7. We included TGP in this plot as we thought it may be helpful with the unevenly sampled data but did not end up including it in our overall quantitative results below. For the SAGP model, we show the fit obtained with L = 3, m = 10. The fits obtained among these models show quite different behaviors. The full GP fit possess extreme boundary behavior due to the lack of data near the boundary. The BART model shows more noticeable grid-like artifacts in this dataset, but does not suffer from the boundary effects seen with the GP. The TGP regression also does not exhibit boundary effects but has much higher variability of the mean response in the data-rich region which does not agree with the other models. The dynamic partitioning of TGP also introduces considerable computational cost. The la GP model with its default settings and MSPE criterion exhibits some degree of variability in the fitted mean response, particularly near the boundaries, however, it is the most computationally efficient method. Sparse Additive Gaussian Process Regression 107.5 105.0 102.5 X 108 106 104 102 X 108 106 104 102 X BART posterior mean 108 106 104 102 X SAGP(L=3,m=25) posterior mean Figure 6: The original max temperature dataset for Colorado in 1997 summer. The horizontal axis is longitude; the vertical axis is latitude; the response is the observed values of maximal temperature in degrees Celsius. The typical raster plot for predictive means of Gaussian Process regression (Universal kriging with Gaussian kernel and MLE nuggets)/BART(number of trees m = 200)/SAGP(m = 25, L = 3) evaluated on a fine meshed grid (generated by steplengths of 0.1) on the original input domain. Luo, Nattino and Pratola Table 2: Performance of SAGP model and of other competing methods on four datasets. Dataset Model Details RMSE Coverage (%) Avg. Int. Score (log10 scale) Heart Rate (n=1,664, d=1) SAGP L=3, m=10 1.340 102 88.8 2.800 SAGP L=4, m=5 1.342 102 89.5 2.796 GP - 2.727 102 12.0 3.855 SGP m=5 1.366 102 100.0 4.605 SGP m=15 1.347 102 100.0 4.633 SGP m=150 1.325 102 100.0 4.720 la GP ALC 1.325 102 91.7 2.777 la GP MSPE 1.879 102 91.1 2.821 BART - 1.331 102 18.1 3.494 BCART - 1.355 102 94.7 2.751 Temperature (n=247, d=2) SAGP L=2, m=5 3.412 100 79.7 1.345 SAGP L=4, m=10 2.910 100 77.4 1.356 GP - 3.041 100 92.5 1.228 SGP m=5 3.401 100 100.0 2.010 SGP m=15 3.345 100 100.0 2.033 SGP m=150 3.043 100 100.0 1.709 la GP ALC 3.206 100 86.6 1.247 la GP MSPE 3.431 100 85.3 1.273 BART - 3.123 100 52.3 1.658 BCART - 3.432 100 90.8 1.195 Ice Sheet (n=2,226, d=2) SAGP L=3, m=10 1.944 102 89.6 3.048 SAGP L=3, m=15 1.858 102 89.1 3.038 SAGP L=4, m=5 2.126 102 89.7 3.073 GP - 0.766 102 93.8 2.570 SGP m=5 2.575 102 100.0 5.638 SGP m=15 2.278 102 100.0 5.841 SGP m=150 1.637 102 100.0 6.365 la GP ALC 1.672 102 88.7 2.892 la GP MSPE 1.715 102 88.7 2.894 BART - 1.532 102 49.9 3.322 BCART - 2.231 102 91.3 3.026 UK Budget (n=1,519, d=8) SAGP L=2, m=10 3.486 101 92.3 2.327 SAGP L=2, m=15 3.370 101 92.8 2.312 SAGP L=3, m=10 3.478 101 92.2 2.327 SAGP L=3, m=15 3.366 101 92.7 2.313 GP - 3.105 101 94.2 2.245 SGP m=5 3.087 101 5.0 2.904 SGP m=15 3.053 101 13.9 2.855 SGP m=150 3.112 101 49.0 2.647 la GP ALC 4.459 101 55.3 2.679 la GP MSPE 4.529 101 54.7 2.685 BART - 3.065 101 48.1 2.605 BCART - 3.613 101 92.4 2.286 Sparse Additive Gaussian Process Regression 10500 10250 10000 9750 X 10600 10400 10200 10000 9800 X 10600 10400 10200 10000 9800 X Local Approximate GP mean 10600 10400 10200 10000 9800 X BART posterior mean 10600 10400 10200 10000 9800 X Treed GP Regression posterior mean 10600 10400 10200 10000 9800 X SAGP(L=3,m=10) posterior mean Figure 7: The scatter point plot shows the ice thickness in a region of Antarctica. The horizontal and the vertical axis are geographical coordinates in kilometers (km). The raster plot for predictive means of GP/la GP/BART(number of trees m = 200)/TGP/SAGP(m = 10, L = 3) for ice sheet dataset and evaluation on a fine meshed grid (generated by steplengths of 0.02) on the original input domain. The color scales and the axis are the same in these plots. Luo, Nattino and Pratola The SAGP model fit is reasonable. It does not have extreme values, high variability in the mean response or boundary effects like some of the other models, yet retains most of the smoothness suggested by the full GP fit. 5.4 United Kingdom Budget Data This dataset is a well-known econometric dataset first studied in Blundell et al. (1998). The dataset consists of a cross-section of 1,519 households drawn from the 1980-1982 British Family Expenditure Surveys. We attempt to predict the total household expenditure (rounded to the nearest 10 UK pounds sterling) with 8 variables as inputs. We do not use the variable of the number of children per household in the regression, since the dataset is cleaned in such a way that it contains only households with one or two children, as presented in Blundell et al. (1998). We choose this dataset to explore the performance of our SAGP model in the higherdimensional scenario. As mentioned by Gramacy et al. (2016), such a dataset of highdimensionality (d = 8) will usually present computational challenges to classical GP models. Our main goal is to show that with reasonable increase of computational time, SAGP model has competitive performance. Since this dataset cannot be easily visualized, we only present quantitative results as shown in the next section. 5.5 Quantitative Performance Summary The performance of SAGP and the alternative models considered is summarized quantitatively in Table 2. As in Section 4, we summarize the quantitative performance using 25% test set of original dataset to calculate out-of-sample RMSE, coverage of 95% credible intervals and interval scores. SAGP, SGP, GP, la GP, BART and BCART models were applied to all datasets. For SAGP, we generally selected L = 2 4 and m = 5 15 while for SGP we selected m = 5, 15 or 150. BART and BCART models were fit using their defaults, and la GP was fit using defaults but with both ALC and MSPE local design criteria. Generally, we see that models could excel in one aspect (say RMSE) typically at the expense of another aspect of model fit quality, where the quality of fit depends on the dataset and application scenario. We notice that BART generally had lower coverage probability for the 95% PI and higher interval score. BCART had better coverage probability but generally was not the best in terms of RMSE. For the frequentist GP, two datasets exhibited good RMSE and two exhibited weaker RMSE performance. The GP is also less informative in terms of uncertainty quantification than the Bayesian models we considered. The la GP models often provided good RMSE performance, particularly with the ALC criterion, however the coverage was lower on the UK dataset. In comparison, the SAGP model generally provided RMSE performance on par or near the best model for each dataset. The coverage also shows that SAGP models were consistent performers, especially compared to BART and la GP. We also see that SGP is uniformly worse than SAGP, often having either higher RMSE or worse coverage behavior. Overall, it is clear that SAGP is competitive with the best models for each dataset as summarized in Table 2, and we often prefer the qualitative aspect of the SAGP fits compared to some of the alternative models, as demonstrated earlier. Sparse Additive Gaussian Process Regression 6. Discussion The SAGP model effectively borrows ideas from both sparsification and localization. In particular, we divide the input domain X in such a way that we can choose enough pseudoinputs and fit a sparse GP regression within the sub-region block of the partition, which also produces a trade-offfor model parameters. We also showed that SAGP can achieve an effective reduction in computational cost (see Proposition 1) since all components within a layer can effectively be fit in parallel. As a Bayesian additive model, SAGP provides uncertainty quantification and leads to accurate posterior inference. Along the model building process, we exhibit how the pseudoinputs can be sampled to capture the aspect of model uncertainty, which is ignored with the fixed pseudo-inputs of SGP. The RP partition scheme outlined not only serves as a localization construction but also guarantees adequate pseudo-inputs for this resampling are available in each SAGP model component.As shown in the data analysis examples, the SAGP model is a competitive candidate compared to other generalization of GP regression methods. SAGP model can easily be generalized to higher dimensions, and our RP partition scheme is very flexible since it carries a hierarchical structure that allows us to analyze dataset in a multiscale way. With the homogeneous partition in one dimension, our RP scheme is similar to the one proposed by Bui and Turner (2014) and Lee et al. (2017); with heterogeneous partition in higher dimensions, our RP partition scheme is more flexible. For example, we can use binary partitioning in the first dimension but ternary partitioning in the second dimension. This will also preserve the hierarchical structure and allow us to decompose the high-dimensional data through different layers. As for future works, there are various possible extensions of the proposed SAGP model. In terms of generalization of our current base model, we are interested in making the SAGP model admit different covariance kernels and different number of pseudo-inputs in each component. It is also of interest to extend the SAGP model to binary, count and categorical responses. To push the computational implementation of SAGP further and since independent sparse Gaussian process (SGP) regression models are fitted for each local component, it is readily seen that our model is parallelizable for efficient computation. Theoretically, we would also like to see a (frequentist) consistency result (Rockov a and van der Pas, 2017) for the SAGP model and a careful analysis of the effect of the choice of priors in this model. Acknowledgments The authors wish to thank the helpful feedback of the editor, associate editor and two anonymous reviewers, which helped to substantially improve the paper. The work of M.T.P. was supported in part by the National Science Foundation under Agreement DMS-1916231 and in part by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSR-2018-CRG7-3800.3. Luo, Nattino and Pratola Appendix A. Correlation between Targets as Function of the Distance between Inputs 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Correlation Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 8: Given two one-dimensional inputs xi and xi (d = 1), the figure represents the correlation between fj(xi) and fj(xi ) (i.e., the targets of component j) as a function of the distance between xi and xi . We represent with different colors the correlation that is assumed for components at different layers (L = 5 in the Figure). Notably, the size of a component s domain Bj depends on the layer where the component is defined (the higher the layer index, the smaller the domain). Therefore, in our binary recursive partition scheme, the maximum possible distance between two inputs (highlighted with a vertical dashed line in the Figure) halves at each layer. Appendix B. Proof of Proposition 1 We first calculate the total number of components in the SAGP models when the inputdomain is [0, 1]d. For each dimension i = 1, 2, . . . , d, if we bi-ary subdivide the [0, 1] interval, then there are at most |Lℓ|= Qd i=1 bℓ 1 i individual components in form of B(cj, wj) in the ℓ-th layer of the RP scheme for ℓ= 1, 2, . . . , L. For each component in the ℓ-th layer, the number of observations fitted to the j-th component is at most |X (j)| n. Then we fit a SGP model with m pseudo-inputs, whose complexity is O(|X (j)| m2). Then for the ℓ-th layer the total complexity is O(P Bj Lℓ|X (j)| m2). Sparse Additive Gaussian Process Regression Therefore, for the ℓ-th layer the complexity is at most O(|Lℓ| n m2). Therefore we can compute the total complexity of the model as O(PL ℓ=1|Lℓ| n m2) O PL ℓ=1 Qd i=1 bℓ 1 i n m2 . Appendix C. MCMC Algorithm for SAGP Model Fitting (Algorithm 3) Algorithm 3 MCMC algorithm for SAGP model. Input : RP partition scheme consisting of N components, Number of pseudo-inputs for each component mj, Hyper-parameters for the prior of parameters, Observed dataset {X, y}. Output: Posterior samples for parameters, X (j), f j, Predictive posterior samples for y , f j, y. 1 Initialization of the parameter values 2 while not converged do 3 Sample X (j), j = 1, . . . , N as in Algorithm 2. 4 for j in 1 : N do 5 rj y P l =j K(l) nml K(l) ml 1 f l 6 f j Nmj(Meanj, V arj) as in (10) 7 ηj,new Uniform(ηj bandwidth) 8 α min(1, C Model Likelihood(ηj,new)/C Model Likelihood(ηj)) 9 if Uniform(0,1) α then 10 ηj ηj,new 11 end // For every burn-in steps/20 steps, we adjust bandwidth. 12 if Acceptance rate of ηj / (0.39, 0.49] then 13 Band width for proposing ηj Acceptance rate of ηj/0.44 16 σ2 ϵ Inverse Gamma αϵ + n 2(y ˆy)T (y ˆy) Appendix D. Detailed Derivation of Posterior Distribution in Section 3.2.5 To clarify our derivations, we first stated following simple lemma that will be used, which can be derived from Woodbury identity (Horn and Johnson, 1990) or a direct verification (Rasmussen and Williams, 2006). Lemma 1 For a joint Gaussian distribution a Rn, b Rn if a b , Caa Cab Cba Cbb then its conditional distribution is: a | b Nn µa + Cab (Cbb) 1 (b µb), Caa Cab (Cbb) 1 Cba (13) Luo, Nattino and Pratola In particular, for fl = f(xl), f j = ( fj( x1), . . . , fj( xmj))T and covariance kernel function K = K(j), K(j) ll = K(j)(xl, xl) if fl f j X (j), xl N1+mj K(j) ll k(j)T l k(j) l K(j) mj then its conditional distribution is: fl | f j, X (j), xl N1 k(j)T l K(j) mj 1 f j, K(j) ll k(j)T l K(j) mj 1 k(j) l We assume a Gaussian prior on the pseudo-targets as in (3.2.2). P( f j | X (j)) Nmj f j | 0mj, K(j) mj (16) and then use Bayesian rule on the parameter f j, recalling that (2) determines the form of mean and variance of the Gaussian distribution P(y | f 1, . . . , f N, X (1), . . . , X (N), X, κ). P( f j | y, X, f 1, . . . , f N, X (1), . . . , X (N), κ) P(y | f 1, . . . f N, X (1), . . . , X (N), κ) P( f j | {x}n, f 1, . . . , f j 1, f j+1, . . . f N, X (1), . . . , X (N), κ) (17) l =j K(l) nml K(l) ml 1 f l K(j) nmj K(j) mj 1 f j, Λ(j) n + σ2 ϵ In Nmj f j|0mj, K(j) mj (18) We can derive the posterior using the normal normal conjugacy: P( f j | y, X, f 1, . . . , f j 1, f j+1, . . . f N, X (1), . . . , X (N), κ) 1 r 2π Λ(j) n + σ2ϵ In exp l=1 K(l) nml K(l) ml 1 f l #T Λ(j) n + σ2 ϵ In 1 l=1 K(l) nml K(l) ml 1 f l 1 q 2πKmj exp 1 2 f T j K 1 mj f j We complete the squares inside the exponent, K 1 mj + K(j) mj 1 K(j) mjn Λ(j) n + σ2 ϵ In 1 K(j) nmj K(j) mj 1 f j f T j Λ(j) n + σ2 ϵ In 1 K(j) nmj K(j) mj 1 f j + proportionally constant terms (20) Sparse Additive Gaussian Process Regression After completing square we can obtain the mean and variance of the j-th component: Meanj = K(j) mj 1 + K(j) mj 1 K(j) mjn Λ(j) n + σ2 ϵ In 1 K(j) nmj K(j) mj 1 l =j K(l) nml K(l) ml 1 f l T Λ(j) n + σ2 ϵ In 1 K(j) nmj K(j) mj 1 =K(j) mj Q(j) 1 mj K(j) mjn Λ(j) n + σ2 ϵ In 1 l =j K(l) nml K(l) ml 1 f l By Woodbury identity, we know that for Q(j) mj = K(j) mj + K(j) mjn Λ(j) n + σ2 ϵ In 1 K(j) nmj we can write its inverse as Q(j) 1 mj = K(j) 1 mj K(j) 1 mj K(j) mjn h Λ(j) n + σ2 ϵ In + K(j) nmj K(j) 1 mj K(j) mjn i 1 K(j) nmj K(j) 1 mj Using this mj mj matrix Qmj, we can write down the covariance matrix V arj: V arj = K(j) mj 1 + K(j) mj 1 K(j) mjn Λ(j) n + σ2 ϵ In 1 K(j) nmj K(j) mj 1 1 (22) =K(j) mj K(j) mj K(j) mj 1 K(j) mjn Λ(j) n + σ2 ϵ In + K(j) nmj K(j) mj 1 K(j) mj K(j) mj 1 K(j) mjn K(j) nmj K(j) mj 1 K(j) mj (23) =K(j) mj K(j) mjn Λ(j) n + σ2 ϵ In + K(j) nmj K(j) mj 1 K(j) mjn K(j) mj 1 K(j) mjn Λ(j) n + σ2 ϵ In + K(j) nmj K(j) mj 1 K(j) mjn 1 K(j) mj 1 ) =K(j) mj Q(j) 1 mj K(j) mj (25) Note that although we do need to invert an n n matrix Λ(j) n +σ2 ϵ In, it is a diagonal matrix and hence easy to invert as claimed before. Luo, Nattino and Pratola Appendix E. Diagnostic Statistics for the SAGP Model on 1000 Batches of Simulated Dataset eta1 eta2 eta3 eta4 eta5 eta6 eta7 sig_eps Geweke's convergence diagnostic eta1 eta2 eta3 eta4 eta5 eta6 eta7 sig_eps Heidelberger Welch's convergence diagnostic Figure 9: The panels of box plots show the Geweke s convergence diagnostic (Geweke et al., 1991) and Heidelberger-Welch s convergence diagnostic (Heidelberger and Welch, 1983) based on the MCMC sample of SAGP model, for parameter η(j) and σ2 ϵ , calculated from the 1000 batches of simulated dataset from formula (11) with the testing set is random or interval. Appendix F. Heart Rate Dataset Analyzed by SAGP Model Fitted with m = 5 and L = 4 (Figure 10) Sparse Additive Gaussian Process Regression GG G GGGGGGG G G G G GGGG GG G GGGGG G G GGG GG G GG GG G GG GGG 0 250 500 750 1000 Time (seconds) HR (beats/min) GG G GGGGGGG G G G G GGGG GG G GGGGG G G GGG GG G GG GG G GG GGG 0 250 500 750 1000 Time (seconds) HR (beats/min) GG G GGGGGGG G G G G GGGG GG G GGGGG G G GGG GG G GG GG GGGG G G G G 0 250 500 750 1000 Time (seconds) HR (beats/min) 11 12 13 14 15 Layer 1 1:2 1:3 1:4 Figure 10: The panels show the observed HR values over time as black dots and results about the fit of the SAGP model with m = 5 and L = 4. Panel (a) shows the posterior means and the 95% CIs of the 15 additive components of the SAGP model on 100 equispaced locations on the support of the data. Panel (b) shows the posterior means and the 95% CIs of the sole component in layer 1 (red), of the components belonging to layer 1 and 2 (green) , of the components belonging to layer 1, 2, 3 and of the complete model, including components from layer 1, 2, 3 and 4. Panel (c) provides the predictive mean and the corresponding 95% prediction intervals. Anjishnu Banerjee, David B. Dunson, and Surya T. Tokdar. Efficient gaussian process regression for large datasets. Biometrika, 100.1:75 89, 2012. D. D. Blankenship et al. Ice thickness and surface elevation, southeastern ross embayment, west antarctica. U.S. Antarctic Program (USAP) Data Center, 2004. doi: doi:10.7265/ N5WW7FKC. URL https://www.usap-dc.org/view/dataset/609099. Richard Blundell, Alan Duncan, and Krishna Pendakur. Semiparametric estimation and consumer demand. Journal of applied econometrics, 13(5):435 461, 1998. Leo Breiman. Classification and Regression Trees. Belmont, California: Wadsworth, 1984. Thang D. Bui and Richard E. Turner. Tree-structured gaussian process approximations. In Advances in Neural Information Processing Systems, 2014. Hugh A. Chipman, Edward I. George, and Robert E. Mc Culloch. Bayesian cart model search. Journal of the American Statistical Association, 93.443:935 948, 1998. Hugh A. Chipman, Edward I. George, and Robert E. Mc Culloch. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4.1:266 298, 2010. Luo, Nattino and Pratola Hugh A. Chipman et al. High-dimensional nonparametric monotone function estimation using bart. ar Xiv preprint ar Xiv:1612.01619, 2016. Andreas Damianou and Neil D. Lawrence. Deep gaussian processes. Artificial Intelligence and Statistics, 2013. David G.T. Denison, Bani K. Mallick, and Adrian F.M. Smith. A bayesian cart algorithm. Biometrika, 85.2:363 377, 1998. Emily Fox and David B. Dunson. Multiresolution gaussian processes. Advances in Neural Information Processing Systems, 2012. Alan E. Gelfand, Susan E. Hills, Amy Racine-Poon, and Adrian F. M. Smith. Illustration of bayesian inference in normal data models using gibbs sampling. Journal of the American Statistical Association, 85.412:972 985, 1990. Andrew Gelman et al. Bayesian Data Analysis. New York : CRC Press : Taylor & Francis Group, 2013. John Geweke et al. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments, volume 196. Federal Reserve Bank of Minneapolis, Research Department Minneapolis, MN, 1991. Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102.477:359 378, 2007. Robert B. Gramacy and Daniel W. Apley. Local gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24.2:561 578, 2015. Robert B. Gramacy et al. tgp: an r package for bayesian nonstationary, semiparametric nonlinear regression and design by treed gaussian process models. Journal of Statistical Software, 19.9:1 46, 2007. Robert B. Gramacy et al. lagp: large-scale spatial modeling via local approximate gaussian processes in r. Journal of Statistical Software, 72.1:1 46, 2016. Trevor J. Hastie and Robert J. Tibshirani. Generalized Additive Models. New York : Chapman and Hall, 1990. Trevor J. Hastie and Robert J. Tibshirani. Bayesian backfitting (with comments and a rejoinder by the authors. Statistical Science, 15.3:196 223, 2000. Philip Heidelberger and Peter D. Welch. Simulation run length control in the presence of an initial transient. Operations Research, 31.6:1109 1144, 1983. Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge: Cambridge university press, 1990. Sparse Additive Gaussian Process Regression Cari G. Kaufman, Mark J. Schervish, and Douglas W. Nychka. Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association, 103.484:1545 1555, 2008. Neil D. Lawrence, Ralf Herbrich, and Matthias Seeger. Fast sparse gaussian process methods: The informative vector machine. In Advances in neural information processing systems, 2003. Byung-Jun Lee, Jongmin Lee, and Kee-Eung Kim. Hierarchically-partitioned gaussian process approximation. In Artificial Intelligence and Statistics, 2017. Haitao Liu et al. When gaussian process meets big data: A review of scalable gps. IEEE Transactions on Neural Networks and Learning Systems, 31(11):4405 4423, 2020. Duy Nguyen-Tuong, Matthias Seeger, and Jan Peters. Model learning with local gaussian process regression. Advanced Robotics, 23.15:2015 2034, 2009. Chiwoo Park and Daniel Apley. Patchwork kriging for large-scale gaussian process regression. The Journal of Machine Learning Research, 19.1:269 311, 2018. Chiwoo Park and Jianhua Z. Huang. Efficient computation of gaussian process regression for large spatial data sets by patching local gaussian processes. Journal of Machine Learning Research, 17.174:1 29, 2016. Matthew T. Pratola, C Devon Lin, and Peter F. Craigmile. Optimal design emulators: A point process approach. ar Xiv preprint ar Xiv:1804.02089, 2019. Matthew T. Pratola, Hugh A. Chipman, Edward I. George, and Robert E. Mc Culloch. Heteroscedastic bart via multiplicative regression trees. Journal of Computational and Graphical Statistics, 29:405 417, 2020. Joaquin Quinonero-Candela and Carl E. Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6:1939 1959, 2005. Carl E. Rasmussen and Christopher K.I. Williams. Gaussian Process for Machine Learning. Cambridge: MIT press, 2006. Veronika Rockov a and St ephanie van der Pas. Posterior concentration for bayesian regression trees and forests. ar Xiv preprint ar Xiv:1708.08734, 2017. Olivier Roustant, David Ginsbourger, and Yves Deville. Dicekriging, diceoptim: Two r packages for the analysis of computer experiments by kriging-based metamodeling and optimization. Journal of Statistical Software, Articles, 51.1, 2012. Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, 2006. Edward Snelson and Zoubin Ghahramani. Local and global sparse gaussian process approximations. In Artificial Intelligence and Statistics, 2007. Luo, Nattino and Pratola James Andrew Storer. An Introduction to Data structures and Algorithms. New York : Springer, 2012. Michali Titsias. Variational learning of inducing variables in sparse gaussian processes. Artificial Intelligence and Statistics, 2009. Maria S. Zakynthinaki. Modelling heart rate kinetics. Plo S one, 10.4, 2015.