# gaussian_process_boosting__af1c94fe.pdf

Journal of Machine Learning Research 23 (2022) 1-46 Submitted 4/20; Revised 8/22; Published 9/22

Gaussian Process Boosting

Fabio Sigrist fabio.sigrist@hslu.ch Lucerne University of Applied Sciences and Arts Suurstoﬃ1 6343 Rotkreuz, Switzerland

Editor: John Cunningham

We introduce a novel way to combine boosting with Gaussian process and mixed effects models. This allows for relaxing, ﬁrst, the zero or linearity assumption for the prior mean function in Gaussian process and grouped random eﬀects models in a ﬂexible non-parametric way and, second, the independence assumption made in most boosting algorithms. The former is advantageous for prediction accuracy and for avoiding model misspeciﬁcations. The latter is important for eﬃcient learning of the ﬁxed eﬀects predictor function and for obtaining probabilistic predictions. Our proposed algorithm is also a novel solution for handling high-cardinality categorical variables in tree-boosting. In addition, we present an extension that scales to large data using a Vecchia approximation for the Gaussian process model relying on novel results for covariance parameter inference. We obtain increased prediction accuracy compared to existing approaches on multiple simulated and real-world data sets. Keywords: non-linear mixed eﬀects models, mixed eﬀects machine learning, grouped random eﬀects, longitudinal data, spatial and spatio-temporal data, tree-boosting with high-cardinality categorical variables

1. Introduction

Boosting (Freund and Schapire, 1996; Friedman, 2001) is a machine learning technique that often achieves superior prediction accuracy on tabular data sets (Nielsen, 2016; Shwartz Ziv and Armon, 2022; Grinsztajn et al., 2022). This is reﬂected in statements such as [i]n general boosted decision trees is regarded as the most eﬀective oﬀ-the-shelf nonlinear learning method for a wide range of application problems (Johnson and Zhang, 2013). Apart from this, the wide adoption of tree-boosting in applied machine learning and data science is due to several advantages: boosting with trees as base learners can automatically account for complex non-linearities, discontinuities, and high-order interactions, it is robust to outliers in and multicollinearity among predictor variables, it is scale-invariant to monotone transformations of the predictor variables, it can handle missing values in predictor variables automatically by loosing almost no information (Elith et al., 2008), and it can perform variable selection. In boosting and in many other state-of-the-art supervised machine learning algorithms, a ﬂexible and potentially complex function relates a set of predictor variables to a response variable. However, the response variable is usually assumed to be independent across observations conditional on the predictor variables. This means that potential residual correlation, i.e., correlation that is not accounted for by the predictor function, is ignored. As we show in our experiments, modeling such correlation allows for more eﬃcient learning

c 2022 Fabio Sigrist.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-322.html.

of the predictor function, and it is important for prediction accuracy, in particular for probabilistic predictions and for predicting sums (or averages) over, e.g., space. Example applications of the latter include the prediction of global average temperatures, the total rainfall in a catchment area, or the total value of a portfolio of real estate objects. Apart from the potentially unrealistic independence assumption, tree-boosting can have diﬃculty with high-cardinality categorical variables and predictions are discontinuous. The latter is often unrealistic for spatial and spatio-temporal data.

Gaussian processes (Williams and Rasmussen, 2006) are ﬂexible non-parametric function models that achieve state-of-the-art prediction accuracy and allow for making probabilistic predictions (Gneiting et al., 2007). They are used in areas such as (i) non-parametric regression, (ii) modeling of time series, spatial, and spatio-temporal data (Shumway and Stoﬀer, 2017; Banerjee et al., 2014; Cressie and Wikle, 2015), (iii) emulation of large computer experiments (Kennedy and O Hagan, 2001), (iv) optimization of expensive black-box functions (Jones et al., 1998), (v) parameter tuning in machine learning models (Snoek et al., 2012), and others. Further, mixed eﬀects models (Laird et al., 1982; Pinheiro and Bates, 2006) with grouped, or clustered, random eﬀects are widely used in various scientiﬁc disciplines for modeling data with a grouping structure such as panel and longitudinal data.

In Gaussian process and mixed eﬀects models, the prior mean is traditionally assumed to be either zero or a linear function of predictor variables. Residual structured variation is then modeled using a zero-mean Gaussian process and/or a grouped random eﬀects model. However, both the zero-mean and the linearity assumption can be unrealistic, and higher prediction accuracy can be obtained by relaxing these assumptions; see, e.g., our experiments in Sections 4 and 5. Furthermore, if the prior mean function of a Gaussian process model is misspeciﬁed, spurious second-order non-stationarity can occur as the covariance function of a misspeciﬁed model equals the true covariance function plus the squared bias of the mean function (Fuglstad et al., 2015; Schmidt and Guttorp, 2020). It is thus important to ﬁrst correctly model the prior mean function before accounting for potential residual second-order non-stationarity.

In this article, we propose a novel way to combine boosting with Gaussian process and grouped random eﬀects models to remedy the above-mentioned drawbacks and leverage the advantages of both approaches. In particular, the goal is to relax, ﬁrst, the linearity or zero prior mean assumption in Gaussian process and mixed eﬀects models and, second, the independence assumption in boosting. This is done by considering a mixed eﬀects model where the prior mean ﬁxed eﬀects function, also denoted as predictor function in the following, is assumed to be a non-parametric function of predictor variables, and the random eﬀects can consist of various combinations of Gaussian processes and grouped random eﬀects. We propose to model the non-linear predictor function by an ensemble of base learners, such as regression trees (Breiman et al., 1984), learned in a stage-wise manner using boosting, and the parameters of the covariance structure of the random eﬀects are jointly estimated with the predictor function. As every categorical variable corresponds to a grouping of the data, grouped random eﬀects models can be seen as a tool for modeling categorical variables. In light of this, we also present a novel approach for dealing with high-cardinality categorical predictor variables in tree-boosting.

1.1 Relation to existing work

The majority of existing Gaussian process and mixed eﬀects models assume that the prior mean ﬁxed eﬀects function is linear or zero. Little research has been done on combining modern supervised machine learning techniques, such as boosting or random forest, with mixed eﬀects models and Gaussian processes. In the following, we give a brief review

Gaussian Process Boosting

of existing literature for mixed eﬀects models where a non-linear predictor function is estimated in a ﬂexible, non-parametric way. To relax the linearity assumption in mixed eﬀects models, Tutz and Reithinger (2007) and Groll and Tutz (2012) propose to use generalized additive models (GAMs) (Hastie and Tibshirani, 1986; Wood, 2017). However, the structure of the predictor function has to be determined a priori by specifying, for instance, main and second-order interaction eﬀects. In general, this can thus result in model misspeciﬁcation. For the special case of grouped random eﬀects models for clustered or longitudinal data, several non-parametric, machine learning-based approaches have been proposed. This includes Hajjem et al. (2011), Sela and Simonoﬀ(2012), and Fu and Simonoﬀ(2015) which use regression trees for the predictor function, and Hajjem et al. (2014) which use random forest to model the predictor function. Both the MERT and MERF algorithms of Hajjem et al. (2011, 2014) and the REEM tree algorithm of Sela and Simonoﬀ(2012) repeatedly (i) learn the predictor function using a machine learning technique, (ii) estimate covariance parameters, and (iii) calculate predictions for random eﬀects. This can make such approaches computationally demanding, in particular when the sample size is large and the predictor function is modeled using a complex model because both covariance parameters and the ﬁxed eﬀects functions are repeatedly learned in every iteration. Since it is not clearly stated in the previous literature which optimization problems are solved by these algorithms and how they are related to the EM algorithm for mixed eﬀects models (Laird et al., 1982), we elaborate on this in Appendix A. Pande et al. (2017) propose a tree-boosting approach called boostmtree for a special type of mixed eﬀects models for longitudinal data with single-level grouped random eﬀects and where predictor variables are assumed to be constant within subjects. Focusing on modeling complex interactions between time and predictor variables, their approach diﬀers from ours in several directions. For instance, they do not cover Gaussian processes or other forms of random eﬀects models with complex clustering such as crossed or nested random eﬀects. Furthermore, they (re-)estimate the covariance parameters in every boosting iteration using the nlme R package, and their boostmtree algorithm uses a certain form of in-sample cross-validation to avoid overﬁtting. Table 1 provides a brief overview of the above-mentioned methods for linear and non-linear mixed eﬀects models.

Method Fixed eﬀects F( ) Random eﬀects Zb LMM Linear Grouped random eﬀects Gaussian processes (GP) Linear or F( ) 0 GP Grouped random eﬀects with GAM ﬁxed eﬀects Non-linear, GAM Grouped random eﬀects

Classical boosting Non-linear, boosting None MERT Non-linear, regression tree Grouped random eﬀects RE-EM tree Non-linear, regression tree Grouped random eﬀects MERF Non-linear, random forest Grouped random eﬀects

boostmtree Non-linear, boosting, predictor variables are constant within groups

Special type of grouped random eﬀects for longitudinal data

GPBoost Non-linear, boosting GP and grouped random eﬀects

Table 1: Short overview of methods for linear and non-linear mixed eﬀects models.

A straightforward approach to applying machine learning methods to spatial and grouped data consists of simply including spatial locations as continuous variables and

grouping variables as categorical variables in the set of predictor variables for the predictor function. For linear models, this is equivalent to using ﬁxed eﬀects instead of random eﬀects. However, this approach has several drawbacks. For instance, when modeling spatial data, it is often required that the spatial eﬀect is continuous over space, but tree-boosting and random forest produce discontinuous functions. A way to avoid this problem in boosting is to use base learners that are continuous in the locations. This is the approach proposed in Hothorn et al. (2010) where splines are used to model spatial eﬀects and ridge regression is used to model grouped eﬀects. As we argue in the following, this approach still has some weaknesses. First, all boosting approaches which model spatial and grouped eﬀects using deterministic base learners have the drawback that they assume a deterministic relationship and that the residual error term is the only source of stochastic variation. In contrast, Gaussian process and mixed eﬀects models are probabilistic models that explicitly account for correlations and provide probabilistic predictions. Modeling correlation is particularly important for probabilistic predictions of, e.g., areal sums as correlation between the different random eﬀects needs to be taken into account; see our experiments in Sections 4 and 5. Besides, estimates are usually less eﬃcient when using ﬁxed eﬀects instead of random eﬀects in linear models. It is thus likely that the predictor function is also less eﬃciently estimated in such an independent ﬁxed-eﬀects boosting approach. Our simulated experiments in Section 4 support this hypothesis. Moreover, splines have the disadvantage that they suﬀer from the so-called curse of dimensionality in non-spatial applications when the dimension of the locations , or features, is large, and the locations are thus sparse in space; see Section 1 for examples where this occurs. The above-mentioned eﬃciency problem of ﬁxed eﬀects models is related to the fact that classical tree-boosting algorithms can have diﬃculties with high-cardinality categorical variables. Concerning this, the approach of the Light GBM boosting library (Ke et al., 2017) works by partitioning all groups into two subsets when ﬁnding splits in the tree-building algorithm. Since there are 2m 1 possible partitions of m diﬀerent groups, an eﬃcient approach based on Fisher (1958) is used in Light GBM. Further, Prokhorenkova et al. (2018) present an approach based on ordered target statistics calculated using random partitions of the training data for handling categorical predictor variables. Finally, we also mention Song and Zhang (2005) who use Gaussian processes as base learners in special types of boosting algorithms, and Sigrist (2021b) who combines reproducing kernel Hilbert space (RKHS) regression functions with trees as base learners in boosting algorithms. However, these approaches using Gaussian processes or their deterministic counterparts, kernel machines, as base learners assume conditional independence among observations and are thus diﬀerent from our approach.

2. A Non-linear and Non-parametric Mixed Eﬀects Model

2.1 Model assumptions and notation

Following the terminology used in the mixed eﬀects models literature (Laird et al., 1982; Pinheiro and Bates, 2006), we assume a mixed eﬀects model of the form

y = F(X) + Zb + ϵ, b N(0, Σ), ϵ N(0, σ2In), (1)

where y = (y1, . . . , yn)T Rn is the response variable, F(X) Rn are called ﬁxed effects, b Rm are zero-mean random eﬀects with covariance matrix Σ Rm m, and ϵ = (ϵ1, . . . , ϵn)T Rn is an independent error term. Speciﬁcally, F(X) is the rowwise evaluation of a function F : Rp R, F(X) = (F(X1), . . . , F(Xn))T , where Xi = (Xi1 . . . , Xip)T Rp is the i-th row of X containing predictor variables for observation i, i = 1, . . . , n. The matrices X Rn p and Z Rn m are the ﬁxed and random eﬀects

Gaussian Process Boosting

predictor variable matrices. Further, n denotes the number of data points, m denotes the dimension of the random eﬀects b, and p denotes the number of predictor variables in X. The random eﬀects vector b is either a ﬁnite-dimensional version of a Gaussian process and/or it can contain grouped random eﬀects.1 The covariance matrix Cov(b) = Σ is usually assumed to be parametrized by a relatively low number of parameters, and it can depend on predictor variables S Rn d. For instance, the predictor variables S can consist of locations and time points for spatial and time series data, or they can be any input features in machine learning applications. For notational simplicity, we suppress the dependence of Σ on its parameters and also on S. The matrix Z relates the random eﬀects b to the response variable y and it is often an incidence matrix with entries in {0, 1}. In the case of grouped random eﬀects, the columns of Z correspond to dummy variables, also called one-hot encoded variables, of the categorical variables that deﬁne the grouping structure. For Gaussian processes, Z is usually simply an identity matrix, but it can also be a binary incidence matrix to model multiple observations at the same locations. Further, Z can also contain covariate data, e.g., in the case of random coeﬃcient models (Gelfand et al., 2003; Pinheiro and Bates, 2006) also called spatially varying coeﬃcient models for spatial data. The predictor variables in Z and S may or may not be a subset of the predictor variables in X, and vice versa. In Section 2.2, we outline several examples and special cases of models for Zb. Note that conditional on F(X) and Z, dependence among the response variable y can arise either due to the matrix Z being non-diagonal or due to the covariance matrix Σ of the random eﬀects being non-diagonal. In linear mixed eﬀects models (LMMs) and linear Gaussian process models in spatial statistics, it is assumed that F(X) = XT β, where β Rp is a vector of coeﬃcients. We denote models with F(X) = XT β as linear models in this article. Further, in the machine learning literature on Gaussian processes, is usually assumed that the prior mean function is zero, F(X) 0. As mentioned above, one of the novel contributions of our approach is to relax this linearity or zero prior mean assumption in a ﬂexible and non-parametric manner. Concerning F( ), we assume that F( ) is a function in a function space H which is the linear span of a set S of so-called base learners fj( ) : Rp R. Possible base learners are linear functions (B uhlmann et al., 2006), smoothing splines (B uhlmann and Yu, 2003), wavelets (Saberian et al., 2011), reproducing kernel Hilbert space (RKHS) regression functions (Sigrist, 2021b), neural networks (Schwenk and Bengio, 2000), and regression trees (Breiman et al., 1984), with the latter being the most popular choice, in particular, in applied machine learning due to the advantages listed in Section 1. For deﬁning functional derivatives, we additionally assume that the space H is normed. For instance, assuming that the Xi s are identically distributed and that all F H are square integrable with respect to the law of X1, a norm on H can deﬁned by the inner product F, G = EX1(F(X1)G(X1)) for F, G H. The marginal distribution of y in (1) is

y N (F(X), Ψ) , Ψ = ZΣZT + σ2In,

and the negative log-likelihood of this model is given by

L(y, F, θ) = 1

2(y F)T Ψ 1(y F) + 1

2 log det (Ψ) + n

2 log(2π), (2)

where we abbreviate F = F(X) and θ Θ Rq denotes all variance and covariance parameters, i.e., σ2 and the parameters of Σ. In this article, we assume that the ﬁrst

1. We assume that the random eﬀects follow a normal distribution, but moderate violations of this assumption have been shown to have only a small eﬀect on prediction accuracy in the context of generalized linear mixed models (Mc Culloch and Neuhaus, 2011).

element of θ is the error variance, θ1 = σ2. Further, to distinguish a function from its evaluation, we use the symbols F( ) to denote a function and F for the function evaluated at X. Factoring out the error variance σ2 by setting

Σ = Σ/σ2 and Ψ = Ψ/σ2

gives the equivalent form y N F(X), σ2Ψ .

As this leads to a closed form expression for estimating σ2 (see Section 3.2), it can be beneﬁcial to use a reparametrization in which, after factoring out σ2, all other variance parameters are replaced by the ratio of their original value and the error variance σ2, and Ψ does not depend on σ2. Calculations in the GPBoost library (see Section 3.6), which implements our methodology, are based on this reparametrization.

2.2 Examples and special cases of random eﬀects models

In this section, we outline several examples of random eﬀects models. We distinguish between the following broad classes of random eﬀects models: (i) grouped random eﬀects models, (ii) Gaussian process models, and (iii) combinations of these two types of random eﬀects.

2.2.1 Grouped random effects model

Grouped, or clustered, random eﬀects models are applied to grouped data. For instance, a grouping of data can occur if there are several units each having multiple observations which are potentially dependent within units. Grouped random eﬀects then account for correlation due to this grouping structure. Depending on the application area, such grouped data is also denoted as longitudinal, panel, or repeated measurement data if diﬀerent units are repeatedly measured over time. In a single-level grouped random eﬀects model, there is a single grouping of the data into m diﬀerent groups. For every group j, j = 1, . . . , m, there is a random eﬀect bj R, and the random eﬀects bj are assumed to be independent and identically distributed with Cov(b) = Σ = σ2 1Im. We thus have

Ψ = σ2 1ZZT + σ2In,

where the matrix Z {0, 1}n m is an incidence matrix that relates group-level random eﬀects to observations. Such a single-level model can be easily extended to allow for multiple random eﬀects which are crossed and hierarchically nested and can also consist of random coeﬃcients (random slopes). In the latter case, Z is no longer a binary incidence matrix but contains covariate data. Note that every categorical variable of cardinality m corresponds to a grouping of the data into m diﬀerent groups, and the columns of Z correspond to dummy, or one-hot encoded, variables of the categorical variable representing the grouping structure. In general, grouped random eﬀects models can thus be used to model (high-cardinality) categorical predictor variables.

2.2.2 Gaussian process model

In a Gaussian process model, the random eﬀects b = (b(s1), . . . , b(sm))T are a ﬁnitedimensional version of a Gaussian process b(s) with a covariance function

Cov(b(s), b(s )) = c(s, s ), s, s Rd,

Gaussian Process Boosting

observed at locations s1, . . . , sm. Note that we use the terminology locations adopted in spatial statistics, but the locations can in general consist of non-spatial predictor variables also called input features in machine learning. In particular, the locations can also contain the predictor variables Xi or a subset of them. Often, the covariance function is assumed to be second-order stationary, mean-square continuous, and parametrized of the form

c(s, s ) = σ2 1r( s s /ρ),

where r is an isotropic autocorrelation function with r(0) = 0, σ2 1 = V ar(b(s)), and ρ is a socalled range parameter that determines how fast the autocorrelation decays with distance. Examples of autocorrelation functions include the exponential function r( s s /ρ) =

exp( s s /ρ) and the Gaussian function r( s s /ρ) = exp ( s s /ρ)2 . The extension to more general covariance functions and also to multivariate Gaussian processes is straightforward, and our methodology presented in Section 3 does not rely on stationarity assumptions. For a stationary Gaussian process, we obtain the following covariance matrix

Ψ = ZΣZT + σ2In,

where Σ Rm m has entries (Σ)jk = σ2 1r(djk/ρ)

and djk = sj sk , j, k = 1, . . . , m.

2.2.3 Joint grouped random effects and Gaussian process models

Grouped random eﬀects and Gaussian processes can also be combined. For instance, such models are used for repeated measures and longitudinal data. One can assume that within groups, there is temporal and/or spatial dependence modeled by a Gaussian process. In a single-level grouped random eﬀect model, this means that every group contains a Gaussian process and the diﬀerent Gaussian processes are independent among each other. Alternatively, one can assume that there is a single global Gaussian process in combination with grouped random eﬀects, i.e. the same Gaussian process is related to all observations and it accounts, for instance, for spatial or temporal correlation among all observations.

3. Combining Gaussian Process and Mixed Eﬀects Models with Boosting

Our goal is to ﬁnd the joint minimizer

( ˆF( ), ˆθ) = argmin (F ( ),θ) (H,Θ) R(F( ), θ), (3)

where R(F( ), θ) is a risk functional deﬁned as

R(F( ), θ) : (F( ), θ) 7 L(y, F, θ) F =F (X), (4)

and L(y, F, θ) is a loss function corresponding to the negative log-likelihood given in (2). Note that R(F( ), θ) is determined by evaluating F( ) at X and then calculating L(y, F = F(X), θ). We recall that F( ) is a function in a function space H and θ Θ Rq

is a vector of all variance and covariance parameters. The risk functional R(F( ), θ) is, in general, inﬁnite-dimensional in its ﬁrst argument and ﬁnite-dimensional in its second argument. We propose to do the minimization in (3) using a boosting algorithm presented in the following.

3.1 Boosting for θ ﬁxed

We ﬁrst show how boosting (Freund and Schapire, 1996; Breiman, 1998; Friedman et al., 2000; Mason et al., 2000; Friedman, 2001; B uhlmann and Hothorn, 2007) works in our case when the covariance parameters θ are given. For ﬁxed θ, boosting ﬁnds a minimizer of the empirical risk functional R(F( ), θ) in a stagewise manner by sequentially adding an update fm( ) to the current estimate Fm 1( ):

Fm( ) = Fm 1( ) + fm( ), fm S, m = 1, . . . , M, (5)

where fm( ) is chosen in a way such that its addition results in the minimization of the risk. This minimization can usually not be done analytically and, consequently, an approximation is used. In most boosting algorithms, such an approximation is obtained using either a penalized functional ﬁrst-order or a functional second-order Taylor expansion of the risk functional around the current estimate Fm 1( ). This then corresponds to functional gradient descent or functional Newton steps. It is also possible to combine gradient and Newton steps by ﬁrst learning part of the parameters of the base learners using a gradient step and the remaining part using a Newton update (Friedman, 2001). See Sigrist (2021a) for more information on the distinction between gradient, Newton, and hybrid gradient-Newton boosting. In contrast to classical boosting algorithms, the risk functional in (4) is not a sum of n independent terms as all observations y are dependent conditional on X. In other words, we generally have only one independent multivariate sample instead of n independent samples. In gradient boosting, fm( ) is found as the least squares approximation to the vector obtained when evaluating the negative functional gradient, i.e., the negative Gˆateaux derivative, of R(F( ), θ) at (Fm 1( ), IXi( )), i = 1, . . . , n, where IXi( ) are indicator functions which equal 1 at Xi and 0 otherwise. Equivalently, one can show that fm( ) corresponds to a minimizer of a ﬁrst-order functional Taylor approximation of R(Fm 1( )+f( ), θ) around Fm 1( ) with an L2 penalty on f( ) evaluated at (Xi), i = 1, . . . , n; see, e.g., Sigrist (2021a) for more information. It is easily seen that the negative Gˆateaux derivative of R(F( ), θ) evaluated at (Fm 1( ), IXi( )), i = 1, . . . , n, is given by

F L(y, F, θ) F =Fm 1 = Ψ 1(y Fm 1).

This means that for θ ﬁxed, gradient boosting ﬁnds fm( ) as the least squares approximation

fm( ) = argmin f( ) S

Ψ 1(y Fm 1) f 2 , (6)

where f = (f(X1), . . . , f(Xn))T . In Newton boosting, fm( ) is found as the minimizer of a second-order functional Taylor approximation to R(Fm 1( ) + f( ), θ) around Fm 1( ). In our case, this gives

fm( ) = argmin f( ) S (y Fm 1 f)T Ψ 1 (y Fm 1 f) . (7)

As mentioned, there also exists a hybrid gradient-Newton boosting version which learns part of the parameters of the base learner fm( ) using a gradient step and the remaining part using a Newton step. In the following paragraph, we assume that the base learners can be written in the form

f( ) = h( ; α)T γ, h( ; α), γ RK, α RQ,

where α and γ denote parameters of the base learners and h( ; α) : Rp RK. For instance, for regression trees, α encodes split locations and variables, γ contains the terminal node

Gaussian Process Boosting

values, and h( ; α) is a function that maps predictor variables to terminal tree nodes. In this case, hybrid gradient-Newton boosting ﬁrst learns ˆα and ˆγ using a gradient boosting step given in (6) and sets αm = ˆα, and then updates γm = ˆγ using a Newton boosting step deﬁned in (7) with αm ﬁxed. For the latter Newton step, an explicit generalized least squares solution is obtained as

γm = h T αmΨ 1hαm 1 h T αmΨ 1(y Fm 1),

where hαm Rn K is the matrix with entries (hαm)ik = h(Xi; αm)k, i = 1, . . . , n, k = 1, . . . , K. We note that in Newton boosting as well as hybrid gradient-Newton boosting, the norm of fm( ), i.e., the step-length, does not depend on the scaling of the loss function, in particular not on σ2, because

argmin f( ) S (y Fm 1 f)T Ψ 1 (y Fm 1 f) = argmin f( ) S (y Fm 1 f)T Ψ 1 (y Fm 1 f) ,

and Ψ , which is deﬁned in Section 2.1, does not depend on σ2. Since gradients are scaledependent, this does not hold true for gradient boosting. It has been empirically observed (Friedman, 2001) that higher prediction accuracy can be obtained by damping the update in (5):

Fm( ) = Fm 1( ) + νfm( ), ν > 0, (8)

where ν is called the shrinkage parameter or learning rate. Further, as in ﬁnite-dimensional optimization, functional gradient descent can be accelerated using momentum. For instance, Biau et al. (2019) and Lu et al. (2020) propose to use Nesterov acceleration (Nesterov, 2004) for gradient boosting.

3.2 Gaussian process boosting

A straightforward approach for ﬁnding a joint minimizer in (3) would consist of iteratively ﬁrst doing one approximate functional gradient or Newton descent step for F( ), and then performing one ﬁrstor second-order optimization step for θ. Despite being attractive from a computational point of view, this has the following drawback. For ﬁnite samples, boosting tends to overﬁt, in particular for regression, and early stopping has to be applied as a form of regularization to prevent this. However, there is no guarantee that θ has converged to a minimum when early stopping is applied after a certain number of iterations. A possible solution to avoid this problem consists of doing coordinate descent, also called block descent, for both F( ) and θ, i.e., iteratively doing full optimization in both directions. But this has the drawback that it is computationally expensive as both F( ) and the covariance parameters θ need to be repeatedly learned. Our proposed solution presented in Algorithm 1 is to combine functional gradient or Newton boosting steps for F( ) with coordinate descent steps for θ. Speciﬁcally, in every iteration, we ﬁrst determine θm = argminθ Θ L(y, Fm 1, θ) and then update the ensemble of base learners using either a functional gradient descent step, a functional Newton step, or a combination of the two to obtain Fm( ) = Fm 1( ) + νfm( ). We thus avoid the above-mentioned overﬁtting problem while being computationally more eﬀective compared to doing coordinate descent in both directions. Note that Ψm in Algorithm 1 denotes the covariance matrix Ψ for the covariance parameters θm of iteration m. The coordinate descent step for ﬁnding θm = argminθ Θ L(y, Fm 1, θ) can be done using a ﬁrstor second-order method for convex optimization initialized with the covariance parameters θm 1 of the previous iteration. In doing so, we avoid the full re-estimation of

Algorithm 1: GPBoost: Gaussian Process Boosting

Input : Initial value θ0 Θ, learning rate ν > 0, number of boosting iterations M N, Boost Type {"gradient", "newton", "hybrid"}, Nesterov Accel {True, False}, and if Nesterov Accel==True momentum sequence µm (0, 1] Output: Predictor function ˆF( ) = FM( ) and covariance parameters ˆθ = θM

1: Initialize F0( ) = argminc R L(y, c 1, θ0)

2: for m = 1 to M do

3: Find θm = argminθ Θ L(y, Fm 1, θ) using a method for convex optimization initialized with θm 1 4: if Nesterov Accel==True then

5: Set Gm 1( ) = Fm 1( )

6: if m > 1 then

7: Update Fm 1( ) = Gm 1( ) + µm(Gm 1( ) Gm 2( ))

10: if Boost Type=="gradient" then

11: Find fm( ) = argminf( ) S Ψ 1 m (Fm 1 y) f 2

12: else if Boost Type=="newton" then

13: Find fm( ) = argminf( ) S (y Fm 1 f)T Ψ 1 m (y Fm 1 f)

14: else if Boost Type=="hybrid" then

15: Find αm = ˆα, (ˆα, ˆγ) = argmin(α,γ):f( )=h( ;α)T γ S Ψ 1 m (Fm 1 y) f 2

16: Calculate γm = h T αmΨ 1 m hαm 1 h T αmΨ 1 m (y Fm 1)

17: Set fm( ) = h( ; αm)T γm 18: end if

19: Update Fm( ) = Fm 1( ) + νfm( )

20: end for

the covariance parameters in every boosting iteration. Examples of optimization algorithms for determining θm include various forms of gradient descent and quasi-Newton methods such as Fisher scoring which is often called natural gradient descent in machine learning (Amari, 1998). For ﬁrstand second-order optimization methods, the gradient of L(y, F, θ) with respect to θ is required. This gradient is given by

2(y F)T Ψ 1 Ψ

θk Ψ 1(y F) + 1

, k = 1, . . . , q. (9)

In our software implementation (see Section 3.6), we reparametrize all parameters θk with positivity constraints, such as marginal variance and range parameters, on the log-scale log(θk) in order to constrain them to positive values during the numerical optimization. Further, there is an explicit solution for the error variance parameter:

n (y Fm 1)T Ψ 1 (y Fm 1) . (10)

For Gaussian processes, we have found in simulated experiments that when doing gradient descent, proﬁling out the error variance using the analytic formula in (10) increases con-

Gaussian Process Boosting

vergence speed (results not tabulated). For Fisher scoring, on the other hand, proﬁling out the error variance can reduce convergence speed considerably (results not tabulated). The Fisher information matrix I Rq q for Fisher scoring is given by

, 1 k, l q.

For the linear case, F(X) = XT β, asymptotic theory (Stein, 1999) suggests that if the smallest eigenvalue of the Fisher information I tends to inﬁnity as n , we can expect that I(ˆθ)1/2(ˆθ θ0) d N(0, I),

where θ0 denotes the population parameter and I(ˆθ)1/2 is a matrix square root. Based on this, one can construct approximate conﬁdence sets or intervals for θ. If the risk functional R(F( ), θ) = L(y, F, θ) F =F (X) is convex in its two arguments F( )

and θ and Θ is a convex set, then (3) is a convex optimization problem since H = span(S) is also convex. Such an algorithm with gradient or Newton steps in F( ) and coordinate descent in θ thus converges to a minimizer of R(F( ), θ) as long as the learning rate ν is not too large to avoid overshooting, i.e., that the risk increases when doing too large steps. The computational complexity of the GPBoost algorithm depends on the speciﬁc random eﬀects model. For Gaussian processes, computational costs are dominated by the learning of the covariance parameters and the calculation of derivatives with respect to F. Speciﬁcally, the computational time and space complexity are O(m3) and O(m2), respectively, of a single evaluation of the likelihood or its gradient using the Cholesky decomposition when not applying an approximation for large data. Compared to this, the learning of trees is faster in the gradient boosting version and, except for the leaf updates, also in the hybrid gradient-Newton version. In linear mixed eﬀects models, L(y, F, θ) is usually optimized by ﬁrst proﬁling out the ﬁxed eﬀect part and then optimizing over θ. In our case, this is not an option since there is no explicit solution for F( ) conditional on θ. Also, note that restricted maximum likelihood (REML) estimation is often used for linear mixed eﬀects models since otherwise covariance parameter estimates can be biased. This is, however, not applicable to our approach.

3.3 Out-of-sample learning for covariance parameters

It has recently been observed for both regression and classiﬁcation that state-of-the-art machine learning techniques such as neural networks, kernel machines, or boosting algorithms can achieve zero training loss and interpolate the training data while at the same time having excellent generalization properties (Zhang et al., 2017; Wyner et al., 2017; Belkin et al., 2018, 2019; Bartlett et al., 2020). In line with this, we ﬁnd in our simulated experiments in Section 4 that estimates of the error variance σ2 are often too small also when doing early stopping by monitoring a validation error. A way to alleviate this potential variance parameter bias problem is to estimate the covariance parameters using out-of-sample validation data obtained by applying crossvalidation or by partitioning the data into two disjoint training and validation sets. To avoid that the predictor function and the covariance parameters θ are only learned on a fraction of the full data, we propose a two-step approach presented in the GPBoost OOS Algorithm 2. In brief, the GPBoost OOS algorithm ﬁrst runs the GPBoost algorithm on subsamples of the data and obtains predictions for the predictor function on the leftout validation data. The covariance parameters are then estimated on the out-of-sample data using the predicted predictor function. Finally, the GPBoost algorithm is run a second time on the full data without estimating the covariance parameters. When k-fold

cross-validation is used in steps 1.-3. of Algorithm 2, both the predictor function and the covariance parameters are learned using the full data.

Algorithm 2: GPBoost OOS: Gaussian Process Boosting with Out-Of-Sample covariance parameter estimation

Input : Initial value θ0 Θ, learning rate ν > 0, number of boosting iterations M N, Boost Type {"gradient", "newton", "hybrid"}, Nesterov Accel {True, False}, and if Nesterov Accel==True momentum sequence µm (0, 1] Output: Predictor function ˆF( ) and covariance parameters ˆθ

1: Partition the data into training and validation sets, e.g., using k-fold cross-validation or by splitting the data into two disjoint sets

2: Run the GPBoost algorithm on the training data and generate predictions for the predictor function on the validation data ˆFval 3: Find ˆθ = argminθ Θ L(yval, ˆFval, θ) using the validation data with response yval 4: Run the GPBoost algorithm on the full data while holding the covariance parameters θ ﬁxed at ˆθ, i.e., by skipping line 3 in Algorithm 1, to obtain ˆF( )

3.4 Computationally eﬃcient learning for large data

For computationally eﬃcient learning of trees, several approaches exist so that computations scale well to large data (Chen and Guestrin, 2016; Ke et al., 2017; Prokhorenkova et al., 2018). In this article, we use the approach of Ke et al. (2017). Concerning covariance parameters, we adopt the following solutions for reducing computational costs. If the random eﬀects b consist of only grouped random eﬀects, Ψ is usually a sparse matrix, and computations can be done eﬃciently using sparse matrix algebra. Further, we can use the Sherman-Morrison-Woodbury formula

ZΣZT + σ2In 1 = σ 2In σ 2Z σ2Σ 1 + ZT Z 1 ZT (11)

for calculating gradients with respect to F and θ and for evaluating the log-likelihood since the dimension of the random eﬀects m is typically smaller than the number of samples n. If b contains a Gaussian process with a non-sparse covariance matrix, both the computational cost and the required memory do not scale well in the number of observed locations m as standard approaches relying on the Cholesky factorization require O(m3) calculations and O(m2) memory storage. In this case, one has to use some approximation to make calculations feasible. We choose to use Vecchia s approximation (Vecchia, 1988; Datta et al., 2016; Katzfuss and Guinness, 2021; Finley et al., 2019), also denoted as nearest-neighbor Gaussian process (NNGP) model (Datta et al., 2016), as it is very accurate for spatial data, embarrassingly parallel, and it has the desirable property that maximizing it corresponds to solving a set of unbiased estimating equations (Guinness, 2018). This has led some authors in the spatial statistics community to declare that [a]mong the sea of Gaussian process approximations proposed over the past several decades, Vecchia s approximation has emerged as a leader (Guinness, 2021). However, we note that this is not the only large-data Gaussian process approximation that can be applied to the GPBoost algorithm. Other potential Gaussian process approximations include Snelson and Ghahramani (2006); Quinonero-Candela et al. (2007); Cunningham et al. (2008); Titsias (2009); Hensman et al.

Gaussian Process Boosting

(2013); Wilson and Nickisch (2015); Gardner et al. (2018); see also the review of Liu et al. (2020). Intuitively, the idea of Vecchia s approximation is to approximate a Cholesky factor of the precision matrix using a sparse matrix and thus to obtain a sparse approximate precision matrix. In the following, we brieﬂy review how this is obtained in our case and then show how gradients of the negative log-likelihood given in (9) can be calculated eﬃciently. To the best of our knowledge, the latter result is novel.

3.4.1 Vecchia approximation for the response variable y

Vecchia approximations can be seen as a special form of composite likelihood methods (Varin et al., 2011). In our case, the likelihood p(y|F, θ) is approximated as

p(y|F, θ) =

i=1 p(yi|(y1, . . . , yi 1), F, θ)

i=1 p(yi|y N(i), F, θ),

where y N(i) are subsets of the conditioning sets (y1, . . . , yi 1), and N(i) denotes the corresponding subsets of indices. As is commonly done, we choose N(i) as the indices of the mv nearest neighbors of si among s1, . . . , si 1 if i > mv + 1, and, in the case i mv + 1, N(i) equals (1, . . . , i 1). By standard arguments for conditional Gaussian distributions, we have

p(yi|y N(i), F, θ) = N yi | Fi + Ai y N(i) FN(i) , Di ,

where N(x|µ, Ξ) denotes a Gaussian density with mean vector µ and covariance matrix Ξ evaluated at x, and the matrices Ai R1 |N(i)| and Di R, where |N(i)| denotes the size of the set N(i), are given by

ZΣZT + σ2In

Di = ZΣZT + σ2In

i,i Ai ZΣZT

N(i),i , (13)

where Σ = (c(sl, sk))l,k, 1 l, k n is the covariance matrix of b, c( , ) is the covariance function, Mi,N(i) denotes the sub-matrix of a matrix M consisting of row i and columns N(i), and MN(i) denotes the sub-matrix of a matrix M consisting of rows N(i) and columns N(i). Note that if Z is a diagonal matrix, we have ZΣZT

N(i) = ZN(i)ΣN(i)ZN(i) = ΣN(i) zz T

where z is the diagonal of Z and denotes the Hadamard product (Dambon et al., 2021). Using this relationship can lead to a reduction in computational cost, in particular for random coeﬃcient models. We further denote by B the lower triangular matrix with 1 s on the diagonal, oﬀdiagonal entries (B)i,N(i) = Ai, (14)

and 0 s otherwise, and by D a diagonal matrix with Di on the diagonal. We then obtain the following approximate distribution

y approx N F(X), Ψ , Ψ = B 1DB T , (15)

and the corresponding precision matrix is given by

Ψ 1 = BT D 1B, (16)

where B and Ψ 1 are sparse.

Concerning computational complexity, the main burden is the calculation of Cholesky factors of Cov(y N(i)) = ZΣZT + σ2In

N(i). Calculating a Vecchia approximation has O(nm3 v) computational cost and requires O(nmv) memory storage. Thus, for given mv, both the computational time and the memory storage grow linearly. Concerning the choice of the numbers of neighbors mv, Datta et al. (2016) report that usually a small value of [mv] between 10 and 15 produces performance at par with a full geostatistical model .

3.4.2 Efficient calculation of the gradient and Fisher information for the Vecchia approximation

In the following, we show how the gradient and the Fisher information of the approximate log-likelihood of the Vecchia approximation given in (15) can be calculated eﬃciently. To the best of our knowledge, the following results are novel. Guinness (2021) also presents a way for computing the gradient and Fisher information for the Vecchia approximation. However, Guinness (2021) uses a diﬀerent representation of the approximate likelihood by writing conditional densities in (12) as ratios of joint and marginal densities and, in doing so, obtains a diﬀerent way for calculating the gradient and Fisher information compared to our result. Guinness (2021) motivates his approach by claiming that for calculating the

gradient in (9), [n]ot only is h Ψ θk

i too large to store in memory, the covariances [Ψ] are not easily computable, nor are their partial derivatives . The following Proposition 1 and its proof show that Ψ

θk does not need to be stored in memory, and neither Ψ nor its partial derivatives need to be computed. In the approach of Guinness (2021) for calculating the gradient, the computational complexity is dominated by the need to calculate two Cholesky factorizations of matrices of sizes R|N(i)| |N(i)| and R(|N(i)|+1) (|N(i)|+1) for every data point i, where we recall that |N(i)| denotes the number of neighbors of sample i. In contrast, in our approach in Proposition 1 below, only one matrix of size R|N(i)| |N(i)| needs to be factorized for every sample i. This means that our approach for calculating the gradient has approximately only half the computational cost compared to the one of Guinness (2021).

Proposition 1 The gradient of the negative log-likelihood L(y, F, θ) for the Vecchia approximation given in (15) can be calculated as

2u T k u u T D

θk , 1 k q,

u = D 1B(y F) and uk = B

θk (y F), (17)

Gaussian Process Boosting

and B θk are lower triangular and D

θk diagonal matrices with non-zero entries given by

i,N(i) = Ai

ZΣZT + σ2In

ZΣZT + σ2In

ZΣZT + σ2In

for 1 < k q, and for k = 1, the non-zero entries of B θk and D

i,N(i) = ZΣZT

ZΣZT + σ2In

A proof can be found in Appendix B. As indicated above, the computational costs for calculating the gradient are O(nm3 v). The Fisher information for the Vecchia approximation in (15) can be calculated using the following result.

Proposition 2 The Fisher information for the Vecchia approximation matrix in (15) has entries

i=1 Di 2 Di

θl , 1 k q, (18)

where B θk and D

θk are lower triangular and diagonal matrices deﬁned in Proposition 1

A proof can be found in Appendix B. This Fisher information can be used for ﬁnding a maximum of the (approximate) likelihood using Fisher scoring. Concerning an approximate covariance matrix for ˆθ using the asymptotic result mentioned in Section 3.2, we note that since the Vecchia approximation results in a misspeciﬁed model, the Fisher information matrix needs to be replaced by the Godambe information matrix (Godambe, 1960) G = HI 1H, where H is the negative expected Hessian of the log-likelihood.

3.5 Prediction

Let yp Rnp denote the random variables for which predictions should be made. We have y yp

= F(X) F(Xp)

+ (Z, 0n mp) Zp

ZΣZT + σ2In Z(Σ, Σop)ZT p

Zp(Σ, Σop)T ZT Zp

Σ Σop ΣT op Σp

ZT p + σ2Inp

where bp Rmp is a vector of mp random eﬀects, for which no data has been observed in y, (Z, 0n mp) Rn (m+mp), 0n mp Rn mp is a matrix of zeros, the matrix Zp Rnp (m+mp) relates the vector of observed and new random eﬀects (b T , b T p )T Rm+mp

to yp, (Σ, Σop) Rm (m+mp), Σop = Cov(b, bp), Σp = Cov(bp), and Xp Rnp p is the predictor variable matrix of the predictions. Note that yp can be related to both existing random eﬀects b, for which data y has been observed, and also new, unobserved random eﬀects bp. This distinction is particularly relevant for grouped random eﬀects, where predictions can be made for samples of groups for which data has already been observed in y, or also for new groups for which no data has been observed in y. Further, this can also be useful for Gaussian process models for distinguishing when predictions should be made for yp, or bp, at new locations, or when b should be predicted at observed locations. From (19), it follows that the conditional distribution yp|y is given by

yp|y N (µp, Ξp) ,

µp =F(Xp) + Zp(Σ, Σop)T ZT ZΣZT + σ2In 1 (y F(X))

Σ Σop ΣT op Σp

ZT p + σ2Inp Zp(Σ, Σop)T ZT ZΣZT + σ2In 1 Z(Σ, Σop)ZT p . (20)

Depending on the application, if n m, the above quantities can be more eﬃciently calculated using the Sherman-Morrison-Woodbury formula given in (11). Further, pre-

dictions for the latent b, bp, or F(Xp) + Zp

can be done analogously with minor

modiﬁcations, e.g., dropping the error variance term σ2Inp from the covariance matrix in (20).

3.5.1 Prediction using the Vecchia approximation

Similarly as for parameter estimation, Vecchia approximations can also be used for making predictions. Speciﬁcally, predictions can be obtained by applying a Vecchia approximation to the joint response vector of observed and prediction locations. When doing so, one has to choose an ordering among the joint set of observed and predicted locations. We assume that either the observed or the prediction locations appear ﬁrst in the ordering of the response variable. The former has the advantage that the nearest neighbors found for estimation can be reused and that the predictive distributions have the simple form given below in (21). On the other hand, if prediction locations appear ﬁrst in the ordering, the approximations of predictive distributions are generally more accurate. See Katzfuss et al. (2020) for a comparison of diﬀerent approaches for making predictions with Vecchia approximations.

Proposition 3 Assume that prediction are made at np locations sp,1, . . . , sp,np with predictor variable data Xp. When applying the Vecchia approximation in (15) to the response vector (y, yp)T with the observed response y appearing ﬁrst in the ordering, the conditional distribution yp|y is given by

yp|y N (µp, Ξp) ,

where µp =F(Xp) B 1 p Bpo (y F(X)) ,

Ξp =B 1 p Dp B T p , (21)

and Bpo Rnp n, Bp Rnp np, Dp 1 Rnp np are the following submatrices of the Vecchia approximated precision matrix Cov (y, yp)T 1:

Cov (y, yp)T 1 = B 0 Bpo Bp

T D 1 0 0 Dp 1

Gaussian Process Boosting

and B and D are deﬁned in (13) and (14).

A proof can be found in Appendix B. Note that Dp is a diagonal matrix, Bp is a lower triangular matrix with 1 s on the diagonal and non-zero oﬀ-diagonal entries corresponding to the nearest neighbors of the prediction locations among the prediction locations themselves sp,1, . . . , sp,np, and Bpo has non-zero entries corresponding to the nearest neighbors of the prediction locations among the observed locations s1, . . . , sn. If only univariate predictive distributions are of interest, computational costs can be additionally reduced by restricting that one conditions on observed locations only in (12). The latter means that for every prediction location sp,i, one conditions only on observed data y N(i) where N(i) denotes the set of nearest neighbors for location sp,i. In this case, Bp is an identity matrix and the predictive covariance matrix Ξp is a diagonal matrix. The latter can be a drawback if multivariate predictive distributions are required. When prediction locations appear ﬁrst in the ordering of the response variable, predictions can be obtained as follows.

Proposition 4 Assume that prediction are made at np locations sp,1, . . . , sp,np with predictor variable data Xp. When applying the Vecchia approximation in (15) to the response vector (yp, y)T with the predicted response yp appearing ﬁrst in the ordering, the conditional distribution yp|y is given by

yp|y N (µp, Ξp) ,

µp =F(Xp) BT p Dp 1Bp + BT op Do 1Bop 1 BT op Do 1Bo (y F(X)) ,

Ξp = BT p Dp 1Bp + BT op Do 1Bop 1 , (23)

where Bo, Do Rn n, Bop Rn np, Bp, Dp Rnp np, Dp 1 Rnp np are the following submatrices of the Vecchia approximated precision matrix Cov (yp, y)T 1:

Cov (yp, y)T 1 = Bp 0 Bop Bo

T Dp 1 0 0 Do 1

Bp 0 Bop Bo

A proof can be found in Appendix B.

3.6 Software implementation

The GPBoost algorithm is implemented in the GPBoost library written in C++ with a C application programming interface (API) and corresponding Python and R packages. See https://github.com/fabsig/GPBoost for more information. For linear algebra calculations, we rely on the Eigen library (Guennebaud et al., 2010). Sparse matrix algebra is used, in particular for calculating Cholesky decompositions, whenever covariance matrices are sparse, e.g., in the case of grouped random eﬀects. In addition, to speed up computations for solving sparse linear triangular equation systems where the right-hand side is also sparse, we use the function cs spsolve from the CSparse library (Davis, 2005) where the non-zero entries of the solutions are determined using a depth-ﬁrst search algorithm. Further, multi-processor parallelization is done using Open MP. For the tree-boosting part, in particular the tree growing algorithm, we use the Light GBM library (Ke et al., 2017). Note that the GPBoost library allows for modeling Gaussian processes, grouped random eﬀects including hierarchically nested and crossed ones, random coeﬃcients, and combinations of the former.

4. Simulated Experiments

In the following, we use simulation to investigate the prediction accuracy as well as the properties of the covariances parameter and predictor function estimates of the GPBoost algorithm.

4.1 Simulation setting

We simulate data from the model given in (1). For the random eﬀects part Zb, we use both a grouped random eﬀects model and a spatial Gaussian process model. The sample size is n = 5000 for the grouped data and n = 500 for the spatial data. The reason for using a smaller sample size for the spatial data is that this allows us to do all calculations exactly without relying on an approximation for large data. In Section 3, we show how learning can be done for large data, and we use this in the application in Section 5. For simulating grouped data, we use a single-level grouped random eﬀects model with 10 samples per group, i.e., m = 500 diﬀerent groups. In other words, there is a single high-cardinality categorical variable with 500 diﬀerent categories. For simulating spatial data, we use a spatial Gaussian process model with an exponential covariance function

c(s, s ) = σ2 1 exp( s s /ρ),

where the locations s are in [0, 1]2 and ρ = 0.1. The marginal variance in both models is set to σ2 1 = 1, and the error variance equals σ2 = 1 such that the signal-to-noise ratio between the random eﬀects Zb and the error term ϵ is 1. Concerning the ﬁxed eﬀects predictor function F( ) and the predictor variables X, we use the following diﬀerent speciﬁcations:

F(x) = C (2x1 + x2 2 + 4 1{x3>0} + 2 log(|x1|)x3), x = (x1, . . . , x9)T , x N(0, I9), ( hajjem )

F(x) = C tan 1 x2x3 1 1 x2x4 x1

, x = (x1, x2, x3, x4)T , ( friedman3 )

x1 Unif(0, 100), x2 Unif(40π, 560π), x3 Unif(0, 1), x4 Unif(1, 11),

F(x) = C (1 + x1 + x2), x = (x1, x2)T , x1, x2 iid Unif(0, 1). ( linear )

The function hajjem has been used in Hajjem et al. (2014) to compare non-parametric mixed eﬀects models, and the function friedman3 was ﬁrst used in Friedman (1991) and has since then often been used to compare non-parametric regression models. We also include a linear function to investigate how our approach compares to a linear mixed eﬀects model when the data generating process is linear. The constant C is chosen such that the variance of F(x) equals approximately 1, i.e., that F(x) has the same signal strength as the random eﬀects part. Predictor variable data is simulated independently from the random eﬀects. We simulate 100 times both a training data set of size n and two diﬀerent test data sets each also of size n. Learning and selection of tuning parameters are done on the training data, and evaluation is done on the test data for all models considered. The two test data sets, denoted brieﬂy as interpolation and extrapolation test sets, are generated as follows in every simulation run. For the grouped data, the interpolation test data set consists of random eﬀects for the same groups as in the training data, and the extrapolation test data contains m independent random eﬀects for new groups that have not been observed in the training data. For the spatial data, the training data locations are sampled uniformly from [0, 1]2 excluding [0.5, 1]2, the interpolation test data set is obtained by also simulating locations uniformly in the same area, and the extrapolation test data contains locations sampled uniformly from [0.5, 1]2. Predictions

Gaussian Process Boosting

for the extrapolation test data are thus to some degree extrapolations. Figure 1 illustrates this.

Figure 1: Example of locations for training and test data for the spatial data. Test and Test ext refers to locations of the interpolation and extrapolation test data sets, respectively. The black crosses show examples of locations for which predictions of sums are made.

4.2 Methods considered

We compare the GPBoost algorithm to the following alternative approaches: linear grouped mixed eﬀects models ( Linear ME ) and linear Gaussian process models ( Linear GP ) with F(X) = XT β and an exponential covariance function, mixed-eﬀects random forest ( MERF ) (Hajjem et al., 2014), RE-EM trees ( REEMtree ) (Sela and Simonoﬀ, 2012), model-based boosting ( mboost ) (Hothorn et al., 2010), independent gradient boosting with a square loss ( LSBoost ), and the boosting approach for categorical predictor variables of Prokhorenkova et al. (2018) ( Cat Boost ). Besides the GPBoost algorithm, we apply the GPBoost OOS algorithm where covariance parameters are estimated using 4-fold cross-validation. For the latter cross-validation, sampling of data indices is done independently of the grouping variable and the spatial coordinates. For the spatial data, we additionally consider a two-step approach ( Two Step ) where F( ) and θ in (1) are estimated separately in an iterative manner instead of jointly as in the GPBoost algorithm. Speciﬁcally, we ﬁrst apply independent gradient boosting using the predictor variables X and then ﬁt a zero-mean spatial Gaussian process to the residuals of the ﬁrst step. For all boosting algorithms, we use gradient boosting without Nesterov acceleration and trees as base learners, except for the grouped and spatial random eﬀects in mboost where Ridge regression and splines are used. In independent gradient boosting with a square loss, the spatial locations and the categorical grouping variable are included as additional predictor variables in the predictor function F( ). In doing so, we consider the grouping variable as a categorical variable and use the approach of Light GBM to handle categorical variables. Considering the grouping variable as a continuous variable or using dummy variables leads to worse results (results

not tabulated). The MERF, REEMtree, and Cat Boost algorithms are only used for the grouped data and not for the spatial data. Learning and prediction with the GPBoost and GPBoost OOS algorithms, the linear grouped random eﬀects and Gaussian process models, and independent boosting with a square loss (LSBoost) is done using the GPBoost library version 0.7.8 compiled with the MSVC compiler version 19.24.28315.0 and Open MP version 2.0.2 For the linear mixed eﬀects and Gaussian process models, the GPBoost algorithm, and the GPBoost OOS algorithm, optima for covariance parameters θ are found using Nesterov accelerated gradient descent (Nesterov, 2004). For the mboost algorithm, we use the mboost R package (Hofner et al., 2014) version 2.9-2 and model spatial eﬀects using bivariate P-spline base learner (bspatial with df=6) and grouped random eﬀects using random eﬀects base learners (brandom with df=4). All other predictor variables are modeled using trees as base learners. For the MERF algorithm, we use the merf Python package version 0.33. The number of iterations of the MERF algorithm is set to 100. Increasing this value does not change our ﬁndings (results not tabulated). However, we note that we often do not observe convergence of the MERF algorithm, no matter how long we let it run.4 For the REEMtree algorithm, we use the REEMtree R package version 0.90.3. For the Cat Boost algorithm, we use the Cat Boost library version 1.0.6. All calculations are done on a laptop with a 2.9 GHz quad-core processor and 16 GB of random-access memory (RAM).

4.3 Evaluation criteria

We measure prediction accuracy using the the root mean square error (RMSE)

i=1 (yp,i µp,i)2,

where yp,i are test response variables, are µp,i predictive means, and np is the number of test samples. In addition, we analyze the accuracy of probabilistic predictions for the spatial data. As is commonly done in spatial statistics, probabilistic predictions are evaluated using the continuous ranked probability score (CRPS) (Gneiting and Raftery, 2007) given by Z Fp,i(y) 1{yp,i y} 2 dy

for one sample i, where Fp,i is the predictive cumulative distribution function (CDF). Intuitively, the CRPS corresponds to the mean square error between the predictive CDF Fp,i and the empirical CDF 1{yp,i y}. The CRPS is a proper scoring rule (Gneiting and Raftery, 2007) that measures both calibration and sharpness of predictive distributions; see Gneiting et al. (2007) for more information. In our case, the predictive distributions

2. When using another compiler such as a GNU compiler which supports a higher version of Open MP, computations for the random eﬀects part can become considerably faster (e.g., approximately twice as fast for the Gaussian process model when using Open MP 4.5). However, the tree-boosting part of Light GBM is slower when using a GNU compiler. 3. We have also tried the merf Python package version 1.0. While the results are overall very similar, the estimates of the error variance σ2 are highly biased when using version 1.0. For this reason, we use version 0.3. 4. This is an observation that has also been made by the creator of the MERF package in a blog post; see https://towardsdatascience.com/mixed-eﬀects-random-forests-6ecbb85cb177 (retrieved on August 22, 2022).

Gaussian Process Boosting

are Gaussian distributions, and the average CRPS can be calculated explicitly as

1 π 2φ yp,i µp,i

2Φ yp,i µp,i

where µp,i and σ2 p,i are predictive means and variances, and φ and Φ denote the probability density function and CDF of a standard Gaussian variable. Predictive variances σ2 p,i are obtained as described in Section 3.5 for the Gaussian process-based models. For the deterministic independent boosting approaches (mboost and LSBoost), we use sample variances of the residuals on the training data as predictive variances. For the spatial data, we additionally evaluate the accuracy when predicting sums of n = 20 observations yp = (yp,1, . . . , yp,n )T . For instance, such predictions are required in meteorology for predicting the total precipitation over a catchment area or in real estate for predicting the total value of a portfolio of objects. Note that if a multivariate predictive distribution for yp is given by yp|y N (µp, Ξp), the predictive distribution of the sum is obtained as 1T yp y N 1T µp, 1T Ξp1 ,

where 1 is a vector of ones 1 = (1, . . . , 1)T . In total, we predict 50 times sums of diﬀerent samples yp R20 in every simulation run. Half of these 50 samples for predicting sums are from the interpolation and the extrapolation test data each. Speciﬁcally, for both the interpolation and the extrapolation test data, we randomly select 25 times disjoint sets of 20 observations that are close together in space. We obtain these sets of 20 close-by observations by randomly selecting a location and determining its 19 nearest neighbors and then iteratively continuing in the same manner with the remaining locations until there are 25 disjoints set for both the interpolation and the extrapolation test data. This is illustrated in Figure 1. Further, we also measure the accuracy to learn the predictor function F( ) and predict the random eﬀects b. This is done by evaluating predictions for F( ) and b on the interpolation test data sets. In addition, we evaluate the accuracy of estimates for the variance and covariance parameters θ. For the independent deterministic boosting approaches which model spatial or grouped random eﬀects using deterministic base learners (LSBoost, mboost, and Cat Boost), no estimates for covariance parameters, the predictor function F( ), and the random eﬀects b can be obtained. The exception are grouped random eﬀects where mboost reports estimated random eﬀects b. This then also allows for obtaining an estimate for F( ) by assigning the test data a certain group from the training data and subtracting the value of the estimated random eﬀect for this group from the obtained predictions. In the same way, we also obtain response variable predictions for data with new groups for mboost.

4.4 Choice of tuning parameters

Tuning parameters are chosen using 4-fold cross-validation on the training data in every simulation run by selecting the combination of tuning parameters from a full grid that minimizes the average mean square error on the validation data. For every boosting algorithm (LSBoost, mboost, Cat Boost, GPBoost, and GPBoost OOS), we consider the following candidate tuning parameters: the number of boosting iterations M {1, . . . , 1000}, the learning rate ν {0.1, 0.05, 0.01}, the maximal tree depth {1, 5, 10}, and the minimal number of samples per leaf {1, 10, 100}. For the MERF algorithm, we choose the proportion of variables considered for making splits {0.5, 0.75, 1}. As in Hajjem et al. (2014), we do not impose a maximal tree depth limit and set the number of trees to 300. Note that the MERF algorithm implemented in the MERF package is very slow (see the results

below) and choosing tuning parameters is thus computationally demanding. Generally, for random forests, the choice of tuning parameters is less important compared to boosting. For the REEMtree package, which relies on the rpart R package, trees are cost-complexity pruned and the amount of pruning is chosen using 10-fold cross-validation on the training data.

4.5 Results

The results for the hajjem predictor function are reported in Table 2 for the grouped data and in Table 3 for the spatial data. The results for the other two predictor functions are reported in Appendix C. In the tables, we report average values of the prediction accuracy measures over the simulation runs as well as corresponding standard errors. Further, we calculate p-values of paired two-sided t-tests comparing the GPBoost algorithm to the other approaches. Note that sometimes diﬀerences in accuracy measures are relatively small compared to standard errors, but p-values are highly signiﬁcant nonetheless. The reason for this is that standard errors do not reﬂect correlation among methods over diﬀerent simulation runs but paired t-tests do.5 For the covariance parameters, we report the RMSE and bias over the simulation runs. We also report the wall-clock time in seconds for training the diﬀerent models. Note that the wall-clock times reported depend on the chosen tuning parameters. In particular, lower learning rates ν usually imply higher computational times for all boosting algorithms including the GPBoost algorithm.

GPBoost Linear ME LSBoost Cat Boost mboost MERF REEMtree GPBOOS RMSE 1.100 1.342 1.156 1.183 1.335 1.104 1.171 1.102 (SE) (0.00144) (0.00181) (0.00205) (0.00184) (0.00215) (0.00137) (0.00163) (0.00151) [p-val] [4.66e-129] [2.85e-59] [3.45e-88] [1.4e-112] [1.31e-09] [2.02e-83] [8.06e-06] RMSE new 1.458 1.635 1.493 1.464 1.520 1.460 1.506 1.459 (SE) (0.00292) (0.00282) (0.00294) (0.00313) (0.00285) (0.00292) (0.00298) (0.00293) [p-val] [6.43e-122] [1.5e-51] [2.09e-05] [8.47e-86] [6.08e-06] [2.45e-78] [6.45e-05] RMSE F 0.3370 0.8141 0.5495 0.3494 0.5111 0.3411 (SE) (0.00243) (0.00197) (0.0024) (0.00211) (0.00227) (0.00253) [p-val] [2.37e-139] [1.79e-102] [4.13e-14] [1.05e-91] [2.59e-06] RMSE b 0.3193 0.3793 0.6934 0.3244 0.3363 0.3197 (SE) (0.00109) (0.00136) (0.00221) (0.00137) (0.0012) (0.00114) [p-val] [8.1e-82] [4.01e-121] [1.55e-08] [5.04e-46] [0.0611] RMSE σ2 1 0.07466 0.07812 0.07484 0.07402 0.07531 Bias σ2 1 0.005512 0.005330 0.009059 0.003182 0.005787 RMSE σ2 0.1703 0.6583 0.1361 0.07989 0.6286 0.1183 0.1477 0.1375 Bias σ2 -0.1672 0.6565 -0.1172 0.07059 0.6266 0.1151 0.1436 0.1348 Time (s) 0.9356 0.05323 0.3289 3.325 14.00 416.7 1.349 3.077

Table 2: Results of the simulated experiments for the grouped data and the predictor function F = hajjem . For the prediction accuracy metrics for y, F, and b, averages over the simulation runs are reported. Corresponding standard errors are in parentheses. P-values are calculated using paired t-tests comparing the GPBoost algorithm to the other approaches. GBPOOS refers to the GPBoost OOS algorithm. Results for the test data with new groups are denoted by new . The smallest values are in boldface (excluding GPBOOS ). An empty value indicates that the required predictions or estimates cannot be calculated. Time refers to the average wall-clock time in seconds.

5. An example of this is the RMSE new in Table 2 when comparing GPBoost and Cat Boost. The diﬀerence in the average RMSE new is relatively small compared to the corresponding standard errors, but the p-value is very small. In this case, GPBoost has a lower RMSE new than Cat Boost in approximately 80 of 100 simulation runs, and the correlation between RMSE new of the two methods is approximately 0.9 (results not tabulated).

Gaussian Process Boosting

Considering the results for the grouped data reported in Table 2, we ﬁnd that the GPBoost algorithm signiﬁcantly outperforms all other methods in all prediction accuracy measures including the ones concerning the learning, or prediction, of both the predictor function F( ) and the random eﬀects b. Apart from the GPBoost OOS algorithm, the MERF algorithm has the second highest prediction accuracy. Gradient boosting with a square loss including the grouping variable as a categorical variable using the approach of Light GBM (LSBoost), Cat Boost, and mboost all have signiﬁcantly lower prediction accuracy compared to the GPBoost algorithm for data with both existing groups ( RMSE ) and new groups ( RMSE new ). LSBoost, i.e., Light GBM, performs better than Cat Boost for predicting data with existing groups and the opposite holds for new groups. Apparently, LSBoost can learn group eﬀects, i.e., the eﬀect of a high-cardinality categorical variable, better than Cat Boost, but Cat Boost learns the predictor function F( ) for the non-categorical predictor variables X better than LSBoost. The reason for the former is unclear to us. The likely reason for the latter ﬁnding is that LSBoost has higher variance and does some overﬁtting due to the highcardinality categorical variable, and Cat Boost mitigates this due to its ordered boosting mode. Not surprisingly, a linear mixed eﬀects model (Linear ME) performs considerably worse than the GPBoost algorithm in all prediction accuracy measures and the estimation of the predictor function F( ). Concerning variance parameter estimates, we observe no major diﬀerences among the methods in the RMSE of the variance of the random eﬀects σ2 1. For the error variance σ2, we observe large diﬀerences, though. In particular, all methods have biased estimates for the error variance parameter. As expected, the linear model has an upward bias which compensates for the misspeciﬁcation of F( ), and the GPBoost algorithm has a downward bias. The latter ﬁnding is in line with the recent observation that state-of-the-art machine learning methods can interpolate the training data while at the same time having a low generalization error as discussed in Section 3.3. When estimating the covariance parameters on out-of-sample data using the GPBoost OOS algorithm, the RMSE of σ2 is smaller and there is no downward bias anymore. Concerning computational time, not surprisingly, the linear model has the lowest computational time. LSBoost is approximately three times faster than GPBoost, and Cat Boost is approximately three times slower than GPBoost. Further, the GPBoost algorithm runs approximately one order of magnitude faster than mboost and several orders of magnitude faster than the MERF algorithm. For the friedman3 predictor function for the grouped data, we ﬁnd qualitatively similar results as for the hajjem predictor function; see Table 8 in the appendix. Further, as expected, a linear mixed eﬀects model performs best in the case where the true predictor function is linear; see Table 9 in the appendix. The diﬀerences between the linear model and the GPBoost algorithm are of small, albeit signiﬁcant, magnitude. Except for the linear model, the GPBoost algorithm signiﬁcantly outperforms all other approaches when F( ) is a linear function. The MERF algorithm has a relatively high RMSE for the predictor function F( ) which also translates to low prediction accuracy for the response variable for data with both existing groups ( RMSE ) and new groups ( RMSE new ). For the friedman3 and linear predictor functions, we again observe that the GPBoost OOS algorithm has a lower RMSE for the error variance σ2 compared to the GPBoost algorithm and no downward bias. We next discuss the results for the spatial data reported in Table 3. We ﬁnd that the GPBoost algorithm has higher prediction accuracy compared to all alternative approaches in all measures with most of the diﬀerences being highly signiﬁcant. In particular, the GPBoost algorithm has higher prediction accuracy in terms of the RMSE and CRPS compared to both a linear Gaussian process model with F(X) = XT β and independent boosting including the coordinates in the predictor function F( ). A two-step approach performs better than a linear model and independent boosting in the majority of prediction accuracy

GPBoost Linear GP LSBoost mboost Two Step GPBOOS RMSE 1.374 1.466 1.474 1.479 1.406 1.372 (SE) (0.00574) (0.00567) (0.0067) (0.00633) (0.00625) (0.00556) [p-val] [1.83e-45] [2.33e-35] [8.74e-42] [1.87e-12] [0.329] CRPS 0.8011 0.8183 0.8807 0.8280 0.8043 0.7683 (SE) (0.00564) (0.00295) (0.00673) (0.00354) (0.00448) (0.00305) [p-val] [0.000359] [1.09e-22] [5.02e-07] [0.484] [1.23e-10] RMSE ext 1.519 1.599 1.611 1.899 1.536 1.517 (SE) (0.00978) (0.00896) (0.0164) (0.0377) (0.00963) (0.00928) [p-val] [2.8e-33] [3.91e-11] [3.08e-16] [5.15e-05] [0.551] CRPS ext 0.8689 0.8960 0.9807 1.101 0.8808 0.8526 (SE) (0.00686) (0.00519) (0.0131) (0.0261) (0.00626) (0.00535) [p-val] [7.15e-11] [1.92e-17] [1.32e-13] [0.000694] [9.99e-06] RMSE sum 11.70 11.90 13.80 18.44 11.90 11.62 (SE) (0.211) (0.204) (0.328) (0.697) (0.206) (0.206) [p-val] [0.00573] [1.14e-12] [4.05e-15] [0.00416] [0.151] CRPS sum 6.468 6.509 8.508 10.45 6.710 6.331 (SE) (0.115) (0.103) (0.22) (0.408) (0.117) (0.102) [p-val] [0.403] [5.92e-19] [1.69e-15] [2.69e-06] [0.000978] RMSE F 0.7125 0.8741 0.7538 0.7538 0.7102 (SE) (0.00869) (0.00692) (0.00854) (0.00854) (0.00817) [p-val] [1.17e-49] [4.4e-13] [4.4e-13] [0.577] RMSE b 0.7080 0.7091 0.7285 0.7285 0.7034 (SE) (0.00729) (0.0065) (0.00714) (0.00714) (0.00696) [p-val] [0.717] [6.17e-07] [6.17e-07] [0.0609] RMSE σ2 1 0.2810 0.2690 0.4950 0.2597 Bias σ2 1 -0.004313 -0.006602 -0.4160 0.005215 RMSE ρ 0.08537 0.03392 0.03529 0.03221 Bias ρ 0.008375 -0.004949 -0.002940 -0.004969 RMSE σ2 0.5672 0.6185 0.4955 0.9024 0.3251 0.4975 Bias σ2 -0.4539 0.5778 -0.2283 0.8832 -0.1561 0.4559 Time (s) 27.30 0.5509 0.09183 3.204 0.5057 36.12

Table 3: Results of the simulated experiments for the spatial data and the predictor function F = hajjem . For the prediction accuracy metrics for y, F, and b, averages over the simulation runs are reported. Corresponding standard errors are in parentheses. P-values are calculated using paired t-tests comparing the GPBoost algorithm to the other approaches. GBPOOS refers to the GPBoost OOS algorithm. Results for the extrapolation test data are denoted by ext , and results for the predictions of sums are denoted by sum . The smallest values are in boldface (excluding GPBOOS ). An empty value indicates that the required predictions or estimates cannot be calculated. Time refers to the average wall-clock time in seconds.

measures, but it has lower prediction accuracy compared to the GPBoost algorithm. The results for the friedman3 predictor function reported in Table 10 in the appendix are similar. The GPBoost algorithm signiﬁcantly outperforms all other approaches in all prediction accuracy measures. Not surprisingly, a linear model has the highest prediction accuracy when the data generating process is linear; see Table 11 in the appendix. However, despite the relatively small sample size of n = 500, the diﬀerences in prediction accuracy between the linear Gaussian process model and the GPBoost algorithm are relatively small. As expected due to the arguments laid out in Section 1.1, the diﬀerences between the GPBoost algorithm and independent boosting including the coordinates in the predictor variables (LSBoost) are the largest for probabilistic predictions of sums ( CRPS sum ) for all three predictor functions.

Gaussian Process Boosting

5. Real-world Applications

In the following, we apply the GPBoost algorithm to several real-world data sets and compare its prediction accuracy to alternative methods. We consider grouped data, spatial data, and several UCI benchmark data sets.

5.1 Grouped random eﬀects for categorical data: wages data

We ﬁrst apply the GPBoost algorithm to a grouped data set, i.e., a data set with a highcardinality categorical predictor variable. We use panel data from the National Longitudinal Survey of Young Working Women consisting of 28 534 observations for 4 711 young working women. This data was collected within the National Longitudinal Survey over the years 1968-1988, and it can be downloaded from https://www.stata-press.com/ data/r10/nlswork.dta. The response variable is the logarithmic real wage, the persons ID number constitutes the high-cardinality categorical grouping variable, and the data includes the following predictor variables: age, ttl exp (total work experience), tenure (job tenure in years), not smsa (1 if not SMSA), south (1 if south), year (interview year), msp (1 if married, spouse present), nev mar (1 if never married), collgrad (1 if college graduate), c city (1 if central city), hours (usual hours worked), grade (current grade completed), ind code (industry of employment), occ code (occupation), and race (1=white, 2=black, 3=other). The low-cardinality categorical variables ind code, occ code, race, and year are dummy coded. Further, we also include the square of age, ttl exp, and tenure in the linear model. We compare the prediction accuracy of diﬀerent approaches using nested 4-fold crossvalidation. Speciﬁcally, all observations are partitioned into four disjoint sets, and, in every fold, one of the sets is used as test data and the remaining data is used for training. Note that the test data sets contain both groups that are observed and unobserved in the training data. We compare the GPBoost algorithm to the same alternative methods as in the simulation study; see Section 4.2. Concerning independent gradient boosting, we also report the results when including the high-cardinality categorical variable as a continuous variable ( LSBoost Cont ) in addition to including it as a categorical variable using the approach of Light GBM ( LSBoost Cat ). Tuning parameters are chosen by doing an additional inner 4-fold cross-validation in the same manner as the outer cross-validation on every of the four training data sets using the mean square error as selection criterion.6

We consider the same set of tuning parameters as in the simulated experiments; see Section 4.4. The results are summarized in Table 4 and Figure 2. Table 4 reports the average test RMSE over all folds, and Figure 2 graphically displays the test RMSE per fold. We ﬁnd that the GPBoost algorithm has higher prediction accuracy compared to all other methods on average over all folds and also for every fold separately. The second-best method is the MERF algorithm, and Cat Boost has the third-lowest test RMSE. In contrast to the simulated experiments but as suggested by the authors of Light GBM7, considering the grouping variable as a continuous variable ( LSBoost Cont ) results in higher prediction accuracy compared to the approach of Light GBM for categorical variables ( LSBoost Cat ).

6. Note that one has to be careful when doing cross-validation for dependent data to avoid biased estimates of the generalization error as pointed out by, e.g., Rabinowicz and Rosset (2020). However, apart from the fact that the validation and test data sets are of slightly diﬀerent sizes, our cross-validation setting preserves the distributional relation between the inner fold training and validation data sets and the training and test data sets and, consequently, no bias is introduced. 7. https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html# categorical-feature-support (retrieved on August 22, 2022)

GPBoost Linear ME LSBoost Cat LSBoost Cont Cat Boost mboost MERF REEMtree 0.296 0.305 0.321 0.313 0.303 0.331 0.299 0.322

Table 4: Average test RMSE for the wages data.

Figure 2: Test RMSE for the wages data for every fold separately.

5.2 Gaussian processes for spatial data: house price data

We next apply the GPBoost algorithm to a spatial data set. We use house price data for 25 357 single-family homes sold in Lucas County, Ohio. This data is available in the sp Data R package (Bivand et al., 2008), and it has been previously studied by Le Sage and Pace (2004); Bivand (2011); Dub e and Legros (2013). The response variable is the logarithmic selling price, and the data includes the following predictor variables: age, stories (factor with levels {one, bilevel, multilvl, one+half, two, two+half, three}), TLA (total living area), wall (factor with levels {stucdrvt, ccbtile, metlvnyl, brick, stone, wood partbrk}), beds (number of bedrooms), baths (number of full baths), halfbaths (number of halfhbaths), frontage (lot frontage), depth, garage (factor with levels {no garage, basement, attached, detached, carport}), garagesqft, rooms (number of rooms), lotsize, sdate (year in which the house was sold, 1993 sdate 1998), as well as longitude-latitude coordinates for the location. For the Gaussian process model with a linear predictor function, we follow Bivand (2011) and also include the square and cube of age as predictor variables, and as in Dub e and Legros (2013), we logarithmize the total living area and the lot size. The left plot of Figure 3 shows the observation locations and observed logarithmic prices. Further, the right plot of Figure 3 shows smoothed diﬀerences of log-prices from the global mean. The latter are obtained by estimating a zero-mean Gaussian process to the diﬀerences of the log-prices from the global mean and calculating the posterior mean. We compare the GPBoost algorithm to the same approaches as in the simulation study: a linear Gaussian process model with F(X) = XT β, independent gradient boosting, mboost with spatial spline base learners, and a two-step approach; see Section 4.2 for more information. For independent boosting and the GPBoost algorithm, we include the coordinates as predictor variables in the predictor function F( ). For comparison, we also report results

Gaussian Process Boosting

Figure 3: Illustration of house price data: map with observation locations and log-prices (left plot) and smoothed diﬀerences of log-prices from the global mean (right plot).

when excluding the coordinates from the predictor function. For the Gaussian process model and the GPBoost algorithm, we use an exponential covariance function. Due to the relatively large sample size, we use a Vecchia approximation as outlined in Sections 3.4 and 3.5.1 for the Gaussian process-based models. Speciﬁcally, for training, we use a Vecchia approximation for the response variable with 50 nearest neighbors and a random ordering of the observations; see Section 3.4.1. For prediction, we use the result in Proposition 3 with the observed data ordered ﬁrst, conditioning on observed data only when calculating the Vecchia approximation, and using 500 nearest neighbors.8

Prediction accuracy is evaluated by partitioning the data into expanding window training data sets and temporal out-of-sample test data sets. Speciﬁcally, learning is done on an expanding window containing all data up to the year t 1, and predictions are calculated for the next year t. We use the three years t {1996, 1997, 1998} as test data. For every t, we additionally split the training data into two subsets: inner training data containing all data up to year t 2 and validation data for the year t 1. Tuning parameters of the boosting methods are chosen separately for t {1996, 1997, 1998} by learning on the inner training data and minimizing the mean square error on the validation data. We consider the same grid of tuning parameters as in the simulated experiments; see Section 4.4. In addition to univariate predictions, we also generate predictions for the total value of multiple houses as explained in the following. For every test set, we randomly select 100 times disjoint sets of 20 observations that are close-by in space. Such sets of spatially close samples are determined by ﬁrst randomly selecting a sample, ﬁnding its 19 nearest neighbors, and then iteratively continuing in the same manner with the remaining locations. For these sets of 20 objects, we calculate predictive distributions for their sums as described in Section 4.3.9 For generating multivariate predictive distributions of dimension 20 required for the latter, we also use the Vecchia approximation as outlined in Proposition 3 with the

8. We set the number of nearest neighbors for prediction to the largest value that our computational budget supports. Using a smaller number of nearest neighbors decreases the prediction accuracy. Whether a higher prediction accuracy can be achieved with an even larger number of nearest neighbors is unclear. 9. For simplicity, we predict the sum on the logarithmic scale. If the sum should be predicted on the original scale, simulation is required as the sum of dependent log-normal variables does not follow a standard distribution.

observed data ordered ﬁrst, but we condition on all data and not just the observed data. The latter is computationally more expensive, but it allows for obtaining non-diagonal and more accurate predictive covariance matrices; see Section 3.5.1 for more information. Further, to disentangle the eﬀect of the Vecchia approximation and the standard GPBoost algorithm without an approximation for large data, we also run the experiments on smaller subsets of the full data. This allows to do all Gaussian process-related calculations exactly. Such smaller-sized subsets are obtained as follows. For every year, we randomly partition the data into ten disjoint equal-sized subsets. Every such subset is used as a training data set, and for every training data set, we use another one of the ten subsets as test set such that each subset is a test set once. In doing so, we obtain 60 training and test data sets with an average sample size of approximately 400. For these small subsets, tuning parameters are chosen using 4-fold cross-validation on every training data set. As in the simulated experiments in Section 4, we measure the accuracy of both point predictions and probabilistic predictions using the RMSE and the CRPS. Further, we also evaluate the accuracy of α quantile predictions for α = 0.05. We do this since, in practice, predictions for lower quantiles are required for risk management purposes such as calculating a value-at-risk. Quantile predictions are obtained by assuming Gaussian predictive distributions with means and variances determined by the ﬁxed and random eﬀects. If a method does not contain random eﬀects, we use the residual variance as in the simulated experiments. Quantile predictions are evaluated using the quantile loss (Gneiting et al., 2007) S(ys, ˆyα) = (ys ˆyα)(α 1{ys ˆyα}), (24)

where ˆyα denotes the predicted quantile and ys the observed data. Table 5 reports the prediction accuracy results averaged over the diﬀerent test sets, Figure 4 graphically displays the results per test set for the full data, and Figure 5 illustrates the results of the small subsets over diﬀerent folds. Overall, the GPBoost algorithm has the highest prediction accuracy. Speciﬁcally, we ﬁnd that the GPBoost algorithm has higher prediction accuracy compared to a linear Gaussian process model, the mboost algorithm, and a two-step approach in all prediction accuracy measures. This holds true for both the full data set and the small random subsets for which exact Gaussian process inference is carried out. For univariate predictions ( RMSE , CRPS , and QL ), the GPBoost algorithm also has almost the same prediction accuracy as independent boosting ( LSBoost ). For the prediction of sums of several objects and the small subsets, the GPBoost algorithm has considerably higher prediction accuracy compared to independent boosting in all measures ( RMSE sum , CRPS sum , QL sum , RMSE small , CRPS small , and QL small ).

GPBoost Linear GP LSBoost mboost Two Step GPBoost ex LSBoost ex RMSE 0.266 0.353 0.266 0.332 0.306 0.290 0.338 CRPS 0.144 0.176 0.143 0.174 0.162 0.155 0.180 QL 0.0358 0.0516 0.0360 0.0468 0.0384 0.0370 0.0461 RMSE sum 1.72 2.50 1.94 3.14 2.31 2.17 3.73 CRPS sum 1.07 1.34 1.23 1.85 1.38 1.29 2.30 QL sum 0.185 0.446 0.234 0.517 0.320 0.263 0.705 RMSE small 0.307 0.331 0.341 0.358 0.334 0.307 0.368 CRPS small 0.162 0.171 0.189 0.189 0.179 0.162 0.196 QL small 0.0402 0.0431 0.0611 0.0489 0.0470 0.0400 0.0554

Table 5: Average prediction accuracy measures for the housing data. Results for the predictions of sums are denoted by sum . Results for the small subsampled data sets are denoted by small . QL denotes the quantile loss. ex denotes results when not including the coordinates in the variables for the predictor function.

Gaussian Process Boosting

The comparison between the GPBoost algorithm when not including the locations in the predictor function F( ) ( GPBoost ex ) and the corresponding independent boosting version ( LSBoost ex ) shows that there is important residual spatial variation also conditional on the eﬀect of the predictor variables. Considering the results for the full data, the GPBoost algorithm with the coordinates included in the predictor function ( GPBoost ) performs better compared to the GPBoost algorithm when the coordinates are not included in the predictor function ( GPBoost ex ). This is an indication that there are interaction eﬀects between the predictor variables and the spatial locations. For the small data, the results of GPBoost and GPBoost ex are almost equivalent. A possible explanation for this is that the sample size is too small to allow for modeling such interaction eﬀects.

Figure 4: Prediction accuracy for the housing data for every fold separately. Results for the predictions of sums are denoted by sum . QL denotes the quantile loss.

For illustration, we plot in Figure 6 predictive posterior means and variances of the Gaussian process part of the GPBoost model applied to the entire data set when not including the locations in the ﬁxed eﬀects function. We use the tuning parameters obtained on the last validation data set. Comparing the predicted mean ﬁeld to the one of Figure 3, we see that in some areas the spatial eﬀect is diﬀerent when factoring out the eﬀect of the other predictor variables. As expected, predictive variances are high in areas with few observations.

5.3 Gaussian processes for non-spatial data: UCI benchmark data sets

In spatial statistics, one usually distinguishes between locations S and other predictor variables X. The locations S are used as input features for a spatial Gaussian process and the eﬀect of X is modeled using a predictor function F( ). This is the way we have used Gaussian processes so far in this paper. In machine learning, Gaussian processes are

Figure 5: Violin plots illustrating the prediction accuracy for the small subsets of the housing data. The red rhombi represent means over the sample splits.

Figure 6: Predicted random eﬀects means and variances obtained using the GPBoost algorithm when not including the locations in the ﬁxed eﬀects function.

often used to model non-spatial data by including non-spatial predictor variables as input features for Gaussian process models. In the following, we compare the GPBoost algorithm to independent boosting and Gaussian process regression using several UCI data set repository10 benchmark data sets. In doing so, all predictor variables are used as features X in F( ) for boosting and as input features S in the Gaussian process models. In particular, for the GPBoost algorithm, the predictor variables are used both as features X for the function F( ) and as input features for the Gaussian process part, and X and S are equal. Table 6 shows the data sets. All data sets have a relatively small sample size such that the Gaussian process-related calculations can be done exactly without using an approximation for large data. The analysis of the GPBoost algorithm for larger benchmark data sets with diﬀerent approximations for large (non-spatial) data is left for future research.

10. http://archive.ics.uci.edu/ml/index.php

Gaussian Process Boosting

fertility servo machine yacht autompg boston pendulum energy concrete n: 100 167 209 308 392 506 630 768 1030 p = d: 9 4 7 6 7 13 9 8 8

Table 6: Overview of data sets.

We use 10-fold cross-validation to analyze the prediction accuracy. All input features S for the Gaussian process models are standardized to have mean zero and standard deviation one, where the parameters for the standardization are calculated on the training data only. Tuning parameters for independent tree-boosting and the GPBoost algorithm are chosen by doing an additional inner 4-fold cross-validation on every training data set. We use the same set of candidate tuning parameters as in all experiments above and an exponential covariance function. In the machine learning Gaussian process literature, probabilistic predictions are often evaluated using the average (univariate Gaussian) test negative log-likelihood (NLL). We also include the average test NLL given by

i=1 0.5(yp,i µp,i)2/σ2 p,i + 0.5 log(σ2 p,i2π)

as prediction accuracy measure. The results shown in the following for the test NLL are obtained by also using the test NLL as criterion for choosing tuning parameters in the inner 4-fold cross-validation. Table 7 shows average prediction accuracy measures for the diﬀerent methods. In parentheses are standard errors. Note that such standard errors have to be taken with a grain of salt.11 We observe that the GPBoost algorithm has higher prediction accuracy compared to independent boosting and Gaussian process regression for the majority of the data sets and prediction accuracy measures. Speciﬁcally, the GPBoost algorithm has an average rank of approximately 1.33 for the test RMSE and 1.22 for the CRPS and the test NLL. Further, in the few cases where either independent boosting or Gaussian process regression has the highest prediction accuracy, the GPBoost algorithm always ranks second.

6. Conclusion

We have introduced a novel way of combining boosting with Gaussian process and mixed eﬀects models. This allows for relaxing, ﬁrst, the zero or linearity assumption for the prior mean function in Gaussian process and mixed eﬀects models in a ﬂexible non-parametric way and, second, the independence assumption made in most boosting algorithms. Further, it can be used as an approach for handling high-cardinality categorical variables in treeboosting. In simulation experiments and real-world applications, we have shown that this leads to improved prediction accuracy compared to existing state-of-the-art methods. We are currently investigating how our approach can be extended to non-Gaussian data using, e.g., a Laplace approximation (Sigrist, 2022). Further, future research can investigate and compare properties of diﬀerent Gaussian process approximations for large data applied to the GPBoost algorithm. Future research is also required for computationally eﬃcient

11. First, the standard errors are calculated assuming independence of the sample splits which likely does not hold true given that the training data sets for the diﬀerent sample splits have substantial overlap. Further, similarly as mentioned for the simulated experiments in Section 4.5, such univariate standard errors do not reﬂect correlation among methods across diﬀerent sample splits. The main reason we report such standard errors is that this is common practice in the machine learning literature.

GPBoost GP LSBoost RMSE fertility 0.171 (0.011) 0.182 (0.0092) 0.166 (0.011) servo 0.270 (0.020) 0.281 (0.020) 0.288 (0.026) machine 0.342 (0.013) 0.368 (0.014) 0.347 (0.012) yacht 0.492 (0.077) 2.47 (0.35) 0.686 (0.090) autompg 2.53 (0.18) 2.52 (0.19) 2.79 (0.24) boston 2.81 (0.22) 2.96 (0.22) 3.26 (0.19) pendulum 1.62 (0.14) 1.64 (0.14) 2.23 (0.18) energy 0.393 (0.018) 1.32 (0.061) 0.330 (0.015) concrete 3.96 (0.24) 5.21 (0.22) 4.03 (0.20) Avg. rank 1.33 2.44 2.22 CRPS fertility 0.0986 (0.0065) 0.103 (0.0046) 0.0946 (0.0064) servo 0.136 (0.011) 0.152 (0.0076) 0.146 (0.013) machine 0.195 (0.0081) 0.208 (0.0072) 0.224 (0.0072) yacht 0.259 (0.027) 1.28 (0.100) 0.286 (0.031) autompg 1.36 (0.075) 1.36 (0.076) 1.59 (0.12) boston 1.45 (0.085) 1.48 (0.060) 1.87 (0.095) pendulum 0.692 (0.041) 0.701 (0.041) 1.13 (0.071) energy 0.205 (0.0076) 0.749 (0.026) 0.172 (0.0051) concrete 1.99 (0.084) 2.75 (0.079) 2.12 (0.059) Avg. rank 1.22 2.56 2.22 NLL fertility -0.294 (0.087) -0.269 (0.051) -0.339 (0.097) servo 0.142 (0.088) 0.199 (0.052) 0.416 (0.21) machine 0.436 (0.040) 0.460 (0.036) 0.587 (0.063) yacht 0.738 (0.11) 2.30 (0.042) 1.62 (0.30) autompg 2.36 (0.079) 2.37 (0.073) 2.52 (0.084) boston 2.46 (0.074) 2.45 (0.070) 2.86 (0.11) pendulum 1.55 (0.063) 1.56 (0.060) 2.44 (0.090) energy 0.439 (0.052) 1.77 (0.025) 0.694 (0.073) concrete 2.87 (0.075) 3.08 (0.043) 3.14 (0.041) Avg. rank 1.22 2.22 2.56

Table 7: Average prediction accuracy measures when doing 10-fold cross-validation for the UCI benchmark data sets. In parenthesis are standard errors.

uncertainty quantiﬁcation for the estimated predictor function F( ) and for generating probabilistic predictions that also take into account uncertainty due to F( ). Another direction for future work are convergence results concerning the generalization error as well as ﬁnite-sample and asymptotic properties of the estimators obtained from the GPBoost algorithm.

Acknowledgments

We thank Hansruedi K unsch and three anonymous reviewers for their helpful comments which have helped to improve this article. This research was partially supported by the Swiss Innovation Agency - Innosuisse (grant numbers 25746.1 PFES-ES and 28408.1 PFES-ES ).

Gaussian Process Boosting

A. Which optimization problems are solved by the MERT/MERF and RE-EM tree algorithms and how are they related to the EM algorithm?

The MERT and MERF algorithms (Hajjem et al., 2011, 2014) and the RE-EM tree (Sela and Simonoﬀ, 2012; Fu and Simonoﬀ, 2015) algorithm allow for combining trees and random forests with grouped random eﬀects models. Hajjem et al. (2011) and Hajjem et al. (2014) claim that the MERT and MERF algorithm are versions of the EM algorithm.12

Sela and Simonoﬀ(2012) call their algorithm RE-EM tree algorithm because it is reminiscent of the EM algorithm, but they acknowledge that it is not a true EM algorithm. Unfortunately, it is not clearly explained in the prior literature which optimization problems these algorithms aim to solve and how they are related to the EM algorithm. In this section, we shed some light on this.

The EM algorithm for linear mixed effects models

We ﬁrst brieﬂy describe the EM algorithm (Dempster et al., 1977) for linear mixed eﬀects models. Using the notation of Section 2, the E-step in iteration t of an EM algorithm for mixed eﬀect models works by, ﬁrst, ﬁnding a maximizer for the predictor function F of the multivariate normal likelihood given the current estimate for the covariance matrix Ψt:

ˆFt+1 = argmin F (y F)T Ψt 1(y F), (25)

second, conditional on this obtaining predictions for the random eﬀects

ˆbt+1 = ZΣt ZT Ψt 1(y ˆFt+1),

and then using these two quantities to calculate the expectation of the full data loglikelihood; see, e.g., Laird et al. (1982) and Wu and Zhang (2006). The EM algorithm then proceeds with the M-step by maximizing this expected full data log-likelihood to obtain an estimate for Ψt+1 or its parameters θt+1.

The RE-EM tree algorithm

The RE-EM tree algorithm of Sela and Simonoﬀ(2012) and Fu and Simonoﬀ(2015) iterates between, ﬁrst, estimating the structure of a tree using an independent normal likelihood obtained after subtracting predicted values of the random eﬀects from the response variable and, second, jointly estimating the leaf values and covariance parameters using a classical linear mixed eﬀects model. This is not an EM algorithm as it does not involve an E-step that calculates an expectation of a full data log-likelihood. It can, however, be interpreted as a component-wise, or coordinate descent, minimization algorithm that iterates between ﬁnding an optimizer for (parts of) F, the covariance parameters θ, and the random eﬀects b. This can be seen by ﬁrst noting that, regarding F and θ, the minimization problem

( ˆF, ˆθ) = argmin (F,θ)

1 2(y F)T Ψ 1(y F) + 1

2 log det (Ψ) (26)

12. For instance, Hajjem et al. (2011, 2014) state that The MERT algorithm is the ML-based EM-algorithm in which we replace the linear structure used to estimate the ﬁxed part of the model by a standard tree structure. and that It [the MERT algorithm] is basically an iterative call to a standard RT [regression tree] algorithm within the framework of the expectation maximization (EM) algorithm .

is equivalent to

( ˆF, ˆθ,ˆb) = argmin (F,θ,b)

1 2σ2 (y F Zb)T (y F Zb) + 1

2b T Σ 1b + 1

2 log det (Ψ) . (27)

Further, the componentwise minimizer for b given F and θ corresponds to the best linear unbiased estimator for b:

ˆb =(ZT Z + σ2Σ 1) 1ZT (y F)

=ΣZT (ZΣZT + σ2In) 1(y F).

Using the notation of Section 3.1, F( ) = h( ; α)T γ, where α denotes the splits of a tree and γ the leaf values, one iteration of the RE-EM tree algorithm can thus be written as

ˆαt+1 = argmin α L F = h( ; α)T ˆγt, ˆθt,ˆbt ,

(ˆγt+1, ˆθt+1) = argmin γ,θ L F = h( ; ˆαt+1)T γ, θ,ˆbt ,

ˆbt+1 = argmin b L F = h( ; ˆαt+1)T ˆγt+1, ˆθt+1, b ,

L (F, θ, b) = 1 2σ2 (y F Zb)T (y F Zb) + 1

2b T Σ 1b + 1

2 log det (Ψ) .

This corresponds to one iteration of a coordinate descent algorithm applied to the minimization problem in (27) which is equivalent to (26).

The MERT and MERF algorithms

In contrast to a proper E-step of an EM algorithm for mixed eﬀects model, in the E-step of the MERT and MERF algorithms of Hajjem et al. (2011) and Hajjem et al. (2014), the predictor function ˆFt+1 is not obtained as maximizer of the multivariate normal likelihood as in (25). Instead, the following is done. First, an independent normal likelihood obtained after subtracting predicted values for the random eﬀects ˆbt of the previous iteration from the response variable y is used to learn the predictor function ˆFt+1 using a regression tree or a random forest:

ˆFt+1 = argmin F

1 2(y F ˆbt)T (y F ˆbt).

Second, predictions are calculated for the random eﬀects:

ˆbt+1 = ZΣt ZT Ψt 1(y ˆFt+1).

The M-step is then analogous to a correctly speciﬁed EM algorithm. It is unclear to us whether and to which quantities the MERT and MERF algorithms converge as it is nowhere shown that they correspond to correctly speciﬁed EM algorithms.

B. Proofs for the results involving the Vecchia approximation

Proof [Proof of Proposition 1] The negative log-likelihood of the Vecchia approximation in (15) is given by

L(y, F, θ) = 1

2(y F)T Ψ 1(y F) + 1

i=1 log Di + n

2 log(2π). (28)

Gaussian Process Boosting

It follows that

θk = 1 2σ2 (y F)T

θk Ψ 1(y F) + 1

Further, we have

θk Ψ 1 = BT

θk D 1B + BT D 1 B

θk BT D 1 D

We thus obtain the result in the proposition by noting that

θk Ψ 1(y F) = 2u T k u u T D

where u and uk are given in (17).

Proof [Proof of Proposition 2] We have

Ψ θk = B 1 B

θk B 1DB T B 1DB T BT

θk B T + B 1 D

It follows that

θk = BT D 1 B

θk B 1DB T BT

θk B T + BT D 1 D

We thus have

θl =BT D 1 B

θl B 1DB T + BT

θk B 1DB T BT

θl B T + BT

Due to the cyclicality of the trace, it follows that

θl B 1 + tr BT

θk B 1DB T BT

θk is lower triangular with zeros on the diagonal, we obtain

θl B 1DB T + tr D 1 B

θk B 1DB T BT

=2tr B T BT

θl B 1D + tr D 1 D

and the statement in (18) follows.

Proof [Proof of Proposition 3] We have

T D 1 0 0 Dp 1

= BT o Do 1Bo + BT po Dp 1Bpo BT po Dp 1Bp BT p Dp 1Bpo BT p Dp 1Bp

Since BT p Dp 1Bp 1 BT p Dp 1Bpo = B 1 p Bpo,

the result follows from Theorem 12.2 in Rue and Held (2010).

Proof [Proof of Proposition 4] We have

Bp 0 Bop Bo

T Dp 1 0 0 Do 1

Bp 0 Bop Bo

= BT p Dp 1Bp + BT op Do 1Bop BT op Do 1Bo BT o Do 1Bop BT o Do 1Bo

and the result follows from Theorem 12.2 in Rue and Held (2010).

Gaussian Process Boosting

C. Additional results for the simulated experiments

In the following, we present the results for the simulated experiments in Section 4 for the two predictor functions which are not reported in the body of the article.

GPBoost Linear ME LSBoost Cat Boost mboost MERF REEMtree GPBOOS RMSE 1.065 1.237 1.112 1.141 1.239 1.088 1.119 1.065 (SE) (0.00118) (0.00123) (0.00154) (0.00134) (0.00167) (0.00118) (0.00131) (0.00117) [p-val] [1.3e-120] [9.65e-61] [1.01e-90] [2.88e-106] [3.33e-82] [3.75e-86] [0.0624] RMSE new 1.426 1.549 1.446 1.425 1.429 1.442 1.466 1.426 (SE) (0.00267) (0.00273) (0.00286) (0.0027) (0.00266) (0.00262) (0.00266) (0.00265) [p-val] [3.51e-112] [5e-44] [0.00398] [4.22e-13] [9.94e-62] [1.25e-78] [0.905] RMSE F 0.2073 0.6435 0.2280 0.2962 0.3916 0.2082 (SE) (0.00129) (0.00125) (0.00161) (0.000934) (0.00184) (0.00128) [p-val] [1.64e-144] [1.7e-32] [6.05e-98] [1.12e-103] [0.0199] RMSE b 0.3111 0.3542 0.6986 0.3179 0.3242 0.3112 (SE) (0.00113) (0.00125) (0.00229) (0.00108) (0.00113) (0.00111) [p-val] [9.74e-76] [9.43e-122] [2.67e-23] [1.01e-54] [0.238] RMSE σ2 1 0.07503 0.08060 0.07495 0.07532 0.07527 Bias σ2 1 0.003267 0.005301 0.004099 0.0008061 0.004154 RMSE σ2 0.1177 0.4077 0.08416 0.09373 0.3652 0.08100 0.06141 0.05531 Bias σ2 -0.1144 0.4067 -0.07411 0.09006 0.3628 0.07779 0.05595 0.05079 Time (s) 0.5760 0.02151 0.2079 5.754 9.129 367.4 0.8807 2.539

Table 8: Results of the simulated experiments for the grouped data and the predictor function F = friedman3 . For the prediction accuracy metrics for y, F, and b, averages over the simulation runs are reported. Corresponding standard errors are in parentheses. P-values are calculated using paired t-tests comparing the GPBoost algorithm to the other approaches. GBPOOS refers to the GPBoost OOS algorithm. Results for the test data with new groups are denoted by new . The smallest values are in boldface (excluding GPBOOS ). An empty value indicates that the required predictions or estimates cannot be calculated. Time refers to the average wall-clock time in seconds.

GPBoost Linear ME LSBoost Cat Boost mboost MERF REEMtree GPBOOS RMSE 1.049 1.044 1.074 1.139 1.225 1.124 1.087 1.049 (SE) (0.00107) (0.00105) (0.00159) (0.00145) (0.00177) (0.00127) (0.00122) (0.00108) [p-val] [4.2e-48] [1.88e-43] [7.1e-96] [4.56e-112] [2.06e-111] [1.43e-82] [5.26e-08] RMSE new 1.418 1.415 1.425 1.421 1.424 1.471 1.445 1.419 (SE) (0.00256) (0.00256) (0.00257) (0.00255) (0.0026) (0.00254) (0.00262) (0.00256) [p-val] [8.24e-40] [2.49e-30] [1.66e-08] [1.41e-42] [1.68e-99] [2.68e-76] [3.06e-05] RMSE F 0.1077 0.04600 0.1654 0.4015 0.2954 0.1111 (SE) (0.0014) (0.00253) (0.00125) (0.000874) (0.00151) (0.00145) [p-val] [7.6e-63] [4.52e-76] [2.11e-137] [7.61e-106] [9.73e-08] RMSE b 0.3068 0.3055 0.6920 0.3283 0.3173 0.3069 (SE) (0.000974) (0.000972) (0.00215) (0.000933) (0.00103) (0.000975) [p-val] [1.36e-18] [5.84e-122] [1.43e-60] [5.33e-49] [7.62e-05] RMSE σ2 1 0.06990 0.06986 0.07142 0.07124 0.06998 Bias σ2 1 -0.01381 -0.01199 -0.01348 -0.01570 -0.01279 RMSE σ2 0.03107 0.02032 0.04995 0.1274 0.3616 0.1322 0.03238 0.02252 Bias σ2 -0.02395 -0.004143 -0.03722 0.1242 0.3595 0.1289 0.02331 0.009169 Time (s) 0.6096 0.01585 0.1182 3.278 8.070 323.7 0.5651 2.656

Table 9: Results of the simulated experiments for the grouped data and the predictor function F = linear . For the prediction accuracy metrics for y, F, and b, averages over the simulation runs are reported. Corresponding standard errors are in parentheses. P-values are calculated using paired t-tests comparing the GPBoost algorithm to the other approaches. GBPOOS refers to the GPBoost OOS algorithm. Results for the test data with new groups are denoted by new . The smallest values are in boldface (excluding GPBOOS ). An empty value indicates that the required predictions or estimates cannot be calculated. Time refers to the average wall-clock time in seconds.

Gaussian Process Boosting

GPBoost Linear GP LSBoost mboost Two Step GPBOOS RMSE 1.271 1.361 1.379 1.331 1.287 1.273 (SE) (0.00456) (0.00511) (0.00515) (0.00519) (0.00475) (0.00468) [p-val] [2.14e-45] [2.6e-46] [1.38e-24] [1.21e-07] [0.188] CRPS 0.7225 0.7644 0.8092 0.7530 0.7300 0.7174 (SE) (0.00291) (0.0028) (0.00415) (0.00299) (0.00304) (0.0026) [p-val] [7.5e-35] [8.07e-46] [4.01e-20] [7.2e-05] [2.96e-05] RMSE ext 1.437 1.511 1.556 1.743 1.450 1.440 (SE) (0.0107) (0.0101) (0.0169) (0.0412) (0.0109) (0.0107) [p-val] [7.57e-35] [3.05e-16] [1.31e-11] [2.06e-06] [0.0808] CRPS ext 0.8158 0.8522 0.9358 1.032 0.8259 0.8138 (SE) (0.00652) (0.00581) (0.013) (0.0299) (0.00703) (0.00619) [p-val] [4.31e-27] [5.49e-21] [3.83e-11] [9.96e-07] [0.0648] RMSE sum 11.60 11.87 14.16 17.50 11.76 11.63 (SE) (0.208) (0.195) (0.319) (0.726) (0.21) (0.208) [p-val] [1.35e-05] [2.06e-18] [1.27e-12] [0.00157] [0.227] CRPS sum 6.342 6.480 8.604 10.21 6.495 6.330 (SE) (0.105) (0.098) (0.219) (0.444) (0.116) (0.103) [p-val] [9.77e-05] [3.63e-24] [4.58e-14] [0.00011] [0.438] RMSE F 0.5061 0.6870 0.5360 0.5360 0.5096 (SE) (0.007) (0.00571) (0.00732) (0.00732) (0.00719) [p-val] [1.98e-54] [7.62e-10] [7.62e-10] [0.279] RMSE b 0.6690 0.6790 0.6775 0.6775 0.6676 (SE) (0.00458) (0.00531) (0.00504) (0.00504) (0.00462) [p-val] [0.000699] [4.82e-05] [4.82e-05] [0.212] RMSE σ2 1 0.2480 0.2763 0.3507 0.2473 Bias σ2 1 -0.006707 -0.001842 -0.2370 0.0006891 RMSE ρ 0.02832 0.02810 0.02905 0.02805 Bias ρ -0.005450 -0.004948 -0.004900 -0.006572 RMSE σ2 0.3187 0.3951 0.4018 0.4257 0.1943 0.2367 Bias σ2 -0.2290 0.3643 -0.1137 0.3940 -0.08988 0.1921 Time (s) 21.60 0.6269 0.09670 2.857 0.5620 32.14

Table 10: Results of the simulated experiments for the spatial data and the predictor function F = friedman3 . For the prediction accuracy metrics for y, F, and b, averages over the simulation runs are reported. Corresponding standard errors are in parentheses. P-values are calculated using paired t-tests comparing the GPBoost algorithm to the other approaches. GBPOOS refers to the GPBoost OOS algorithm. Results for the extrapolation test data are denoted by ext , and results for the predictions of sums are denoted by sum . The smallest values are in boldface (excluding GPBOOS ). An empty value indicates that the required predictions or estimates cannot be calculated. Time refers to the average wall-clock time in seconds.

GPBoost Linear GP LSBoost mboost Two Step GPBOOS RMSE 1.205 1.183 1.328 1.276 1.214 1.208 (SE) (0.00422) (0.0039) (0.00594) (0.00505) (0.00464) (0.00441) [p-val] [4.69e-30] [1.79e-52] [3.41e-47] [1.95e-06] [0.00153] CRPS 0.6807 0.6677 0.7632 0.7222 0.6861 0.6819 (SE) (0.00246) (0.00219) (0.00431) (0.00296) (0.00277) (0.00248) [p-val] [2.5e-29] [2.73e-44] [1.75e-46] [4.65e-05] [0.0474] RMSE ext 1.392 1.373 1.518 1.700 1.400 1.395 (SE) (0.00979) (0.00965) (0.0155) (0.0403) (0.00981) (0.01) [p-val] [8.13e-20] [2.03e-13] [4.93e-12] [3.78e-05] [0.00458] CRPS ext 0.7879 0.7764 0.8871 1.004 0.7924 0.7890 (SE) (0.0058) (0.00562) (0.0112) (0.0289) (0.00584) (0.00578) [p-val] [1.82e-19] [3.68e-15] [1.26e-11] [6.05e-05] [0.0728] RMSE sum 11.44 11.32 14.15 17.29 11.51 11.49 (SE) (0.198) (0.196) (0.287) (0.702) (0.198) (0.201) [p-val] [4.8e-05] [1.78e-16] [1.11e-13] [0.0177] [0.00526] CRPS sum 6.152 6.070 8.339 9.928 6.200 6.177 (SE) (0.0969) (0.0952) (0.183) (0.42) (0.0986) (0.0979) [p-val] [2.88e-07] [9.54e-23] [3.69e-15] [0.00553] [0.00862] RMSE F 0.3265 0.2099 0.3568 0.3568 0.3382 (SE) (0.0102) (0.013) (0.0111) (0.0111) (0.0106) [p-val] [1.32e-32] [6.59e-11] [6.59e-11] [4.45e-05] RMSE b 0.6662 0.6600 0.6721 0.6721 0.6673 (SE) (0.0059) (0.00554) (0.0064) (0.0064) (0.00607) [p-val] [0.0135] [4e-05] [4e-05] [0.0592] RMSE σ2 1 0.2279 0.2265 0.2483 0.2275 Bias σ2 1 -0.03959 -0.03913 -0.1413 -0.03173 RMSE ρ 0.03602 0.03598 0.03643 0.03506 Bias ρ -0.007792 -0.007953 -0.008058 -0.009017 RMSE σ2 0.1644 0.1123 0.4185 0.3645 0.1201 0.1347 Bias σ2 -0.1222 -0.02047 0.1885 0.3444 -0.05652 0.06216 Time (s) 26.46 0.6568 0.06625 1.989 0.5880 43.50

Table 11: Results of the simulated experiments for the spatial data and the predictor function F = linear . For the prediction accuracy metrics for y, F, and b, averages over the simulation runs are reported. Corresponding standard errors are in parentheses. P-values are calculated using paired t-tests comparing the GPBoost algorithm to the other approaches. GBPOOS refers to the GPBoost OOS algorithm. Results for the extrapolation test data are denoted by ext , and results for the predictions of sums are denoted by sum . The smallest values are in boldface (excluding GPBOOS ). An empty value indicates that the required predictions or estimates cannot be calculated. Time refers to the average wall-clock time in seconds.

Gaussian Process Boosting

Shun-Ichi Amari. Natural gradient works eﬃciently in learning. Neural computation, 10(2):251 276, 1998.

Sudipto Banerjee, Bradley P Carlin, and Alan E Gelfand. Hierarchical modeling and analysis for spatial data. Chapman and Hall/CRC, 2014.

Peter L Bartlett, Philip M Long, G abor Lugosi, and Alexander Tsigler. Benign overﬁtting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020.

Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, volume 80, pages 541 549, 2018.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-oﬀ. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

G erard Biau, Benoˆıt Cadre, and Laurent Rouv ı ere. Accelerated gradient boosting. Machine Learning, 108(6):971 992, 2019.

Roger Bivand. After raising the bar : applied maximum likelihood estimation of families of models in spatial econometrics. Technical report, Norwegian School of Economics, Department of Economics, 2011.

Roger S Bivand, Edzer J Pebesma, Virgilio Gomez-Rubio, and Edzer Jan Pebesma. Applied spatial data analysis with R, volume 747248717. Springer, 2008.

Leo Breiman. Arcing classiﬁers. Annals of Statistics, pages 801 824, 1998.

Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classiﬁcation and regression trees. CRC press, 1984.

Peter B uhlmann and Torsten Hothorn. Boosting algorithms: Regularization, prediction and model ﬁtting. Statistical Science, pages 477 505, 2007.

Peter B uhlmann and Bin Yu. Boosting with the L2 loss: regression and classiﬁcation. Journal of the American Statistical Association, 98(462):324 339, 2003.

Peter B uhlmann et al. Boosting for high-dimensional linear models. The Annals of Statistics, 34(2): 559 583, 2006.

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785 794. ACM, 2016.

Noel Cressie and Christopher K Wikle. Statistics for spatio-temporal data. John Wiley & Sons, 2015.

John P Cunningham, Krishna V Shenoy, and Maneesh Sahani. Fast gaussian process methods for point process intensity estimation. In International Conference on Machine Learning, pages 192 199, 2008.

Jakob A Dambon, Fabio Sigrist, and Reinhard Furrer. Maximum likelihood estimation of spatially varying coeﬃcient models for large data with an application to real estate price prediction. Spatial Statistics, 41:100470, 2021.

Abhirup Datta, Sudipto Banerjee, Andrew O Finley, and Alan E Gelfand. Hierarchical nearestneighbor gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800 812, 2016.

Tim Davis. Csparse: a concise sparse matrix package, 2005. URL http://www.cise.ufl.edu/ research/sparse/CSparse.

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1 22, 1977.

Jean Dub e and Di ego Legros. Dealing with spatial data pooled over time in statistical models. Letters in Spatial and Resource Sciences, 6(1):1 18, 2013.

Jane Elith, John R Leathwick, and Trevor Hastie. A working guide to boosted regression trees. Journal of Animal Ecology, 77(4):802 813, 2008.

Andrew O Finley, Abhirup Datta, Bruce D Cook, Douglas C Morton, Hans E Andersen, and Sudipto Banerjee. Eﬃcient algorithms for bayesian nearest neighbor gaussian processes. Journal of Computational and Graphical Statistics, 28(2):401 414, 2019.

Walter D Fisher. On grouping for maximum homogeneity. Journal of the American statistical Association, 53(284):789 798, 1958.

Yoav Freund and Robert E Schapire. Experiments with a new boosting algorithm. In International Conference on Machine Learning, volume 96, pages 148 156, 1996.

Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics, 28 (2):337 407, 2000.

Jerome H Friedman. Multivariate adaptive regression splines. The Annals of Statistics, pages 1 67, 1991.

Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189 1232, 2001.

Wei Fu and Jeﬀrey S Simonoﬀ. Unbiased regression trees for longitudinal and clustered data. Computational Statistics & Data Analysis, 88:53 74, 2015.

Geir-Arne Fuglstad, Daniel Simpson, Finn Lindgren, and H avard Rue. Does non-stationary spatial data always require non-stationary random ﬁelds? Spatial Statistics, 14:505 531, 2015.

Jacob Gardner, GeoﬀPleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. GPy Torch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration. Advances in Neural Information Processing Systems, 31:7576 7586, 2018.

Alan E Gelfand, Hyon-Jung Kim, CF Sirmans, and Sudipto Banerjee. Spatial modeling with spatially varying coeﬃcient processes. Journal of the American Statistical Association, 98(462):387 396, 2003.

Gaussian Process Boosting

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359 378, 2007.

Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2): 243 268, 2007.

Vidyadhar P Godambe. An optimum property of regular maximum likelihood estimation. The Annals of Mathematical Statistics, 31(4):1208 1211, 1960.

Leo Grinsztajn, Edouard Oyallon, and Gael Varoquaux. Why do tree-based models still outperform deep learning on tabular data? In Neural Information Processing Systems Datasets and Benchmarks Track, 2022.

Andreas Groll and Gerhard Tutz. Regularization for generalized additive mixed models by likelihoodbased boosting. Methods of Information in Medicine, 51(02):168 177, 2012.

Ga el Guennebaud, Benoˆıt Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.

Joseph Guinness. Permutation and grouping methods for sharpening gaussian process approximations. Technometrics, 60(4):415 429, 2018.

Joseph Guinness. Gaussian process learning via Fisher scoring of Vecchia s approximation. Statistics and Computing, 31(3):1 8, 2021.

Ahlem Hajjem, Fran cois Bellavance, and Denis Larocque. Mixed eﬀects regression trees for clustered data. Statistics & probability letters, 81(4):451 459, 2011.

Ahlem Hajjem, Fran cois Bellavance, and Denis Larocque. Mixed-eﬀects random forest for clustered data. Journal of Statistical Computation and Simulation, 84(6):1313 1328, 2014.

Trevor Hastie and Robert Tibshirani. Generalized additive models. Statistical Science, 1(3):297 310, 1986.

James Hensman, Nicol o Fusi, and Neil D Lawrence. Gaussian processes for big data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artiﬁcial Intelligence, pages 282 290, 2013.

Benjamin Hofner, Andreas Mayr, Nikolay Robinzonov, and Matthias Schmid. Model-based boosting in R: A hands-on tutorial using the R package mboost. Computational Statistics, 29:3 35, 2014.

Torsten Hothorn, Peter B uhlmann, Thomas Kneib, Matthias Schmid, and Benjamin Hofner. Modelbased boosting 2.0. Journal of Machine Learning Research, 11(Aug):2109 2113, 2010.

Rie Johnson and Tong Zhang. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis & Machine Intelligence, 36(5):942 954, 2013.

Donald R Jones, Matthias Schonlau, and William J Welch. Eﬃcient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455 492, 1998.

Matthias Katzfuss and Joseph Guinness. A general framework for vecchia approximations of gaussian processes. Statistical Science, 36(1):124 141, 2021.

Matthias Katzfuss, Joseph Guinness, Wenlong Gong, and Daniel Zilber. Vecchia approximations of gaussian-process predictions. Journal of Agricultural, Biological and Environmental Statistics, 25 (3):383 414, 2020.

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie Yan Liu. Light GBM: A highly eﬃcient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pages 3149 3157, 2017.

Marc C Kennedy and Anthony O Hagan. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3):425 464, 2001.

Nan M Laird, James H Ware, et al. Random-eﬀects models for longitudinal data. Biometrics, 38 (4):963 974, 1982.

James P Le Sage and R Kelley Pace. Models for spatially dependent missing data. The Journal of Real Estate Finance and Economics, 29(2):233 254, 2004.

Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai. When Gaussian process meets big data: A review of scalable GPs. IEEE transactions on neural networks and learning systems, 31(11): 4405 4423, 2020.

Haihao Lu, Sai Praneeth Karimireddy, Natalia Ponomareva, and Vahab Mirrokni. Accelerating gradient boosting machines. In International Conference on Artiﬁcial Intelligence and Statistics, volume 108, pages 516 526, 2020.

Llew Mason, Jonathan Baxter, Peter L Bartlett, and Marcus R Frean. Boosting algorithms as gradient descent. In Advances in Neural Information Processing Systems, pages 512 518, 2000.

Charles E Mc Culloch and John M Neuhaus. Misspecifying the shape of a random eﬀects distribution: why getting it wrong may not matter. Statistical science, pages 388 402, 2011.

Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2004.

Didrik Nielsen. Tree boosting with xgboost-why does xgboost win every machine learning competition? Master s thesis, NTNU, 2016.

Amol Pande, Liang Li, Jeevanantham Rajeswaran, John Ehrlinger, Udaya B Kogalur, Eugene H Blackstone, and Hemant Ishwaran. Boosted multivariate trees for longitudinal data. Machine learning, 106(2):277 305, 2017.

Jos e Pinheiro and Douglas Bates. Mixed-eﬀects models in S and S-PLUS. Springer Science & Business Media, 2006.

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Cat Boost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems 31, pages 6638 6648. 2018.

Joaquin Quinonero-Candela, Carl Edward Rasmussen, and Christopher KI Williams. Approximation methods for gaussian process regression. In Large-scale kernel machines, pages 203 223. MIT Press, 2007.

Assaf Rabinowicz and Saharon Rosset. Cross-validation for correlated data. Journal of the American Statistical Association, pages 1 14, 2020.

H arvard Rue and Leonhard Held. Discrete spatial variation. Handbook of spatial statistics, pages 171 200, 2010.

Gaussian Process Boosting

Mohammad J Saberian, Hamed Masnadi-Shirazi, and Nuno Vasconcelos. Taylorboost: First and second-order boosting algorithms with explicit margin control. In Computer Vision and Pattern Recognition, pages 2929 2934, 2011.

Alexandra M Schmidt and Peter Guttorp. Flexible spatial covariance functions. Spatial Statistics, page 100416, 2020.

Holger Schwenk and Yoshua Bengio. Boosting neural networks. Neural computation, 12(8):1869 1887, 2000.

Rebecca J Sela and Jeﬀrey S Simonoﬀ. RE-EM trees: a data mining approach for longitudinal and clustered data. Machine learning, 86(2):169 207, 2012.

Robert H Shumway and David S Stoﬀer. Time series analysis and its applications: with R examples. Springer, 2017.

Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84 90, 2022.

Fabio Sigrist. Gradient and Newton boosting for classiﬁcation and regression. Expert Systems With Applications, 167:114080, 2021a.

Fabio Sigrist. KTBoost: Combined kernel and tree boosting. Neural Processing Letters, 53(2): 1147 1160, 2021b.

Fabio Sigrist. Latent Gaussian Model Boosting. IEEE Transactions on Pattern Analysis & Machine Intelligence (accepted), 2022.

Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. Advances in Neural Information Processing Systems, 18:1257, 2006.

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951 2959, 2012.

Yangqiu Song and Changshui Zhang. New boosting methods of Gaussian processes for regression. In International Joint Conference on Neural Networks, volume 2, pages 1142 1147, 2005.

Michael L Stein. Interpolation of spatial data: some theory for kriging. Springer Science & Business Media, 1999.

Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artiﬁcial Intelligence and Statistics, pages 567 574, 2009.

Gerhard Tutz and Florian Reithinger. A boosting approach to ﬂexible semiparametric mixed models. Statistics in medicine, 26(14):2872 2900, 2007.

Cristiano Varin, Nancy Reid, and David Firth. An overview of composite likelihood methods. Statistica Sinica, pages 5 42, 2011.

Aldo V Vecchia. Estimation and model identiﬁcation for continuous spatial processes. Journal of the Royal Statistical Society: Series B (Methodological), 50(2):297 312, 1988.

Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press Cambridge, MA, 2006.

Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In International Conference on Machine Learning, pages 1775 1784, 2015.

Simon N Wood. Generalized additive models: an introduction with R. Chapman and Hall/CRC, 2017.

Hulin Wu and Jin-Ting Zhang. Nonparametric regression methods for longitudinal data analysis: mixed-eﬀects modeling approaches, volume 515. John Wiley & Sons, 2006.

Abraham J Wyner, Matthew Olson, Justin Bleich, and David Mease. Explaining the Success of Ada Boost and Random Forests as Interpolating Classiﬁers. Journal of Machine Learning Research, 18(48):1 33, 2017.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.