# gradient_boosting_performs_gaussian_process_inference__9a74f5ae.pdf Published as a conference paper at ICLR 2023 GRADIENT BOOSTING PERFORMS GAUSSIAN PROCESS INFERENCE Aleksei Ustimenko Share Chat research@aleksei.uk Artem Beliakov Yandex Research, HSE University belyakov.arteom2015@gmail.com Liudmila Prokhorenkova Yandex Research ostroumova-la@yandex.com This paper shows that gradient boosting based on symmetric decision trees can be equivalently reformulated as a kernel method that converges to the solution of a certain Kernel Ridge Regression problem. Thus, we obtain the convergence to a Gaussian Process posterior mean, which, in turn, allows us to easily transform gradient boosting into a sampler from the posterior to provide better knowledge uncertainty estimates through Monte-Carlo estimation of the posterior variance. We show that the proposed sampler allows for better knowledge uncertainty estimates leading to improved out-of-domain detection. 1 INTRODUCTION Gradient boosting (Friedman, 2001) is a classic machine learning algorithm successfully used for web search, recommendation systems, weather forecasting, and other problems (Roe et al., 2005; Caruana & Niculescu-Mizil, 2006; Richardson et al., 2007; Wu et al., 2010; Burges, 2010; Zhang & Haghani, 2015). In a nutshell, gradient boosting methods iteratively combine simple models (usually decision trees), minimizing a given loss function. Despite the recent success of neural approaches in various areas, gradient-boosted decision trees (GBDT) are still state-of-the-art algorithms for tabular datasets containing heterogeneous features (Gorishniy et al., 2021; Katzir et al., 2021). This paper aims at a better theoretical understanding of GBDT methods for regression problems assuming the widely used RMSE loss function. First, we show that the gradient boosting with regularization can be reformulated as an optimization problem in some Reproducing Kernel Hilbert Space (RKHS) with implicitly defined kernel structure. After obtaining that connection between GBDT and kernel methods, we introduce a technique for sampling from prior Gaussian process distribution with the same kernel that defines RKHS so that the final output would converge to a sample from the Gaussian process posterior. Without this technique, we can view the output of GBDT as the mean function of the Gaussian process. Importantly, our theoretical analysis assumes the regularized gradient boosting procedure (Algorithm 2) without any simplifications we only need decision trees to be symmetric (oblivious) and properly randomized (Algorithm 1). These assumptions are non-restrictive and are satisfied in some popular gradient boosting implementations, e.g., Cat Boost (Prokhorenkova et al., 2018). Our experiments confirm that the proposed sampler from the Gaussian process posterior outperforms the previous approaches (Malinin et al., 2021) and gives better knowledge uncertainty estimates and improved out-of-domain detection. A major part of the work was done while working at Yandex Research. Published as a conference paper at ICLR 2023 2 BACKGROUND Assume that we are given a distribution ϱ over X Y , where X Rd is called a feature space and Y R a target space. Further assume that we are given a dataset z = {(xi, yi)}N i=1 X Y of size N 1 sampled i.i.d. from ϱ. Let us denote by ρ(dx) = R Y dϱ(dx, dy). W.l.o.g., we also assume that X = supp ρ = x Rd : ϵ > 0, ρ({x Rd : x x Rd < ϵ}) > 0 which is a closed subset of Rd. Moreover, for technical reasons, we assume that 1 2N PN i=1 y2 i R2 for some constant R > 0 almost surely, which can always be enforced by clipping. Throughout the paper, we also denote by x N and y N the matrix of all feature vectors and the vector of targets. 2.1 GRADIENT BOOSTED DECISION TREES Given a loss function L : R2 R, a classic gradient boosting algorithm (Friedman, 2001) iteratively combines weak learners (usually decision trees) to reduce the average loss over the train set z: L(f) = Ez[L(f(x), y)]. At each iteration τ, the model is updated as: fτ(x) = fτ 1(x) + ϵwτ(x), where wτ( ) W is a weak learner chosen from some family of functions W, and ϵ is a learning rate. The weak learner wτ is usually chosen to approximate the negative gradient of the loss function gτ(x, y) := L(s,y) s s=fτ 1(x): wτ = arg min w W Ez gτ(x, y) w(x) 2 . (1) The family W usually consists of decision trees. In this case, the algorithm is called GBDT (Gradient Boosted Decision Trees). A decision tree is a model that recursively partitions the feature space into disjoint regions called leaves. Each leaf Rj of the tree is assigned to a value, which is the estimated response y in the corresponding region. Thus, we can write w(x) = Pd j=1 θj1{x Rj}, so the decision tree is a linear function of the leaf values θj. A recent paper Ustimenko & Prokhorenkova (2021) proposes a modification of classic stochastic gradient boosting (SGB) called Stochastic Gradient Langevin Boosting (SGLB). SGLB combines gradient boosting with stochastic gradient Langevin dynamics to achieve global convergence even for non-convex loss functions. As a result, the obtained algorithm provably converges to some stationary distribution (invariant measure) concentrated near the global optimum of the loss function. We mention this method because it samples from similar distribution as our method but with a different kernel. 2.2 ESTIMATING UNCERTAINTY In addition to the predictive quality, it is often important to detect when the system is uncertain and can be mistaken. For this, different measures of uncertainty can be used. There are two main sources of uncertainty: data uncertainty (a.k.a. aleatoric uncertainty) and knowledge uncertainty (a.k.a. epistemic uncertainty). Data uncertainty arises due to the inherent complexity of the data, such as additive noise or overlapping classes. For instance, if the target is distributed as y|x N(f(x), σ2(x)), then σ(x) reflects the level of data uncertainty. This uncertainty can be assessed if the model is probabilistic. Knowledge uncertainty arises when the model gets input from a region either sparsely covered by or far from the training data. Since the model does not have enough data in this region, it will likely make a mistake. A standard approach to estimating knowledge uncertainty is based on ensembles (Gal, 2016; Malinin, 2019). Assume that we have trained an ensemble of several independent models. If all the models understand an input (low knowledge uncertainty), they will give similar predictions. However, for out-of-domain examples (high knowledge uncertainty), the models are likely to provide diverse predictions. For regression tasks, one can obtain knowledge uncertainty by measuring the variance of the predictions provided by multiple models (Malinin, 2019). Such ensemble-based approaches are standard for neural networks (Lakshminarayanan et al., 2017). Recently, ensembles were also tested for GBDT models (Malinin et al., 2021). The authors consider two ways of generating ensembles: ensembles of independent SGB models and ensembles of independent SGLB models. While empirically methods are very similar, SGLB has better theoretical properties: the convergence of parameters to the stationary distribution allows one to sample models Published as a conference paper at ICLR 2023 from a particular posterior distribution. One drawback of SGLB is that its convergence rate is unknown, as the proof is asymptotic. However, the convergence rate can be upper bounded by that of Stochastic Gradient Langevin Dynamics for log-concave functions, e.g., Zou et al. (2021), which is not dimension-free. In contrast, our rate is dimension-free and scales linearly with inverse precision. 2.3 GAUSSIAN PROCESS INFERENCE In this section, we briefly describe the basics of Bayesian inference with Gaussian processes that is closely related to our analysis. A random variable f with values in L2(ρ) is said to be a Gaussian process f GP f0, σ2K+δ2Id L2 with covariance defined via a kernel K(x, x ) = cov f(x), f(x ) and mean value f0 L2(ρ) iff g L2(ρ) we have Z X f(x)g(x)ρ(dx) N Z X f0(x)g(x)ρ(dx), σ2 Z X X g(x)K(x, x )g(x )ρ(dx) ρ(dx ) + δ2 g 2 L2 . A typical Gaussian Process setup is to assume that the target y|x is distributed as GP 0L2(ρ), σ2K+ δ2Id L2 for some kernel function1 K(x, x ) with scales σ > 0 and δ > 0: y|x = f(x), f GP(0L2(ρ), σ2K + δ2Id L2). The posterior distribution f(x)|x, x N, y N is again a Gaussian Process GP(f , σ2 e K + δ2Id L2) with mean and covariance given by (see Rasmussen & Williams (2006)): f λ (x) = K(x, x N) K(x N, x N) + λIN 1y N, e K(x, x) = K(x, x) K(x, x N) K(x N, x N) + λIN 1K(x N, x). with λ = δ2 σ2 . To estimate the posterior mean f λ (x) = R R fp(f|x, x N, y N) df, we can use the maximum a posteriori probability estimate (MAP) the only solution for the following Kernel Ridge Regression (KRR) problem: L(f) = 1 2N f(xi) yi 2 + λ 2N f 2 H min f H , where H = span K( , x) x X L2(ρ) is the reproducing kernel Hilbert space (RKHS) for the kernel K( , ) and L(f) minf H means that we are seeking minimizers of this function L(f). We refer to Appendix D for the details on KRR and RKHS. To solve the KRR problem, one can apply the gradient descent (GD) method in the functional space: fτ+1 = (1 λϵ i=1 (fτ(xi) yi)Kxi , f0 = 0L2(ρ) , (2) where ϵ > 0 is a learning rate. Since this objective is strongly convex due to the regularization, the gradient descent rapidly converges to the unique optimum: f λ = lim τ fτ = K( , x N) K(x N, x N) + λIN 1y N, see Appendices C and E for the details. Gradient descent guides fτ to the posterior mean f λ of the Gaussian Process with kernel σ2K + δ2Id L2. To obtain the posterior variance estimate e K(x, x) for any x, one can use sampling and introduce a source of randomness in the above iterative scheme as follows: 1. sample f init GP(0L2(ρ), σ2K + δ2Id L2); 2. set new labels ynew N = y N f init(x N); 3. fit GD fτ( ) on ynew N assuming F0( ) = 0L2(ρ); 4. output f init( ) + fτ( ) as final model. 1K(x, x ) is a kernel function if K(xi, xj) N,N i=1,j=1 0 for any N 1 and any xi X almost surely. Published as a conference paper at ICLR 2023 This method is known as Sample-then-Optimize (Matthews et al., 2017) and is widely adopted for Bayesian inference. As τ , we get f init + f distributed as a Gaussian Process posterior with the desired mean and variance. Formally, the following result holds: Lemma 2.1. f init + f follows the Gaussian Process posterior GP(f λ , σ2 e K + δ2Id L2) with: f λ (x) = K(x, x N) K(x N, x N) + λIN 1y N , e K(x, x) = K(x, x) K(x, x N) K(x N, x N) + λIN 1K(x N, x) . 3 EVOLUTION OF GBDT IN RKHS 3.1 PRELIMINARIES In our analysis, we assume that we are given a finite set V of weak learners used for the gradient boosting.2 Each ν corresponds to a decision tree that defines a partition of the feature space into disjoint regions (leaves). For each ν V, we denote the number of leaves in the tree by Lν 1. Also, let φν : X {0, 1}Lν be a mapping that maps x to the vector indicating its leaf index in the tree ν. This mapping defines a decomposition of X into the disjoint union: X = Lν j=1 x X φ(j) ν (x) = 1 . Having φν, we define a weak learner associated with it as x 7 θ, φν(x) RLν for any choice of θ RLν which we refer to as leaf values . In other words, θ corresponds to predictions assigned to each region of the space defined by ν. Let us define a linear space F L2(ρ) of all possible ensembles of trees from V: F = span φ(j) ν ( ) : X {0, 1} ν V, j {1, . . . , Lν} . We note that the space F can be data-dependent since V may depend on z, but we omit this dependence in the notation for simplicity. Note that we do not take the closure w.r.t. the topology of L2(ρ) since we assume that V is finite and therefore F is finite-dimensional and thus topologically closed. 3.2 GBDT ALGORITHM UNDER CONSIDERATION Our theoretical analysis holds for classic GBDT algorithms discussed in Section 2.1 equipped with regularization from Ustimenko & Prokhorenkova (2021). The only requirement we need is that the procedure of choosing each new tree has to be properly randomized. Let us discuss a tree selection algorithm that we assume in our analysis. Each new tree approximates the gradients of the loss function with respect to the current predictions of the model. Since we consider the RMSE loss function, the gradients are proportional to the residuals rj = yj f(xj), where f is the currently built model. The tree structure is defined by the features and the corresponding thresholds used to split the space. The analysis in this paper assumes the Sample Tree procedure (see Algorithm 1), which is a classic approach equipped with proper randomization. Sample Tree builds an oblivious decision tree (Prokhorenkova et al., 2018), i.e., all nodes at a given level share the same splitting criterion (feature and threshold).3 To limit the number of candidate splits, each feature is quantized into n + 1 bins. In other words, for each feature, we have n thresholds that can be chosen arbitrarily.4 The maximum tree depth is limited by m. Recall that we denote the set of all possible tree structures by V. We build the tree in a top-down greedy manner. At each step, we choose one split among all the remaining candidates based on the following score defined for ν V and residuals r: D(ν, r) := 1 PN i=1 φ(j) ν (xi) ri 2 PN i=1 φ(j) ν (xi) . (3) 2The finiteness of V is important for our analysis, and it usually holds in practice, see Section 3.2. 3In fact, the procedure can be extended to arbitrary trees, but this would over-complicate formulation of the algorithm and would not change the space of tree ensembles as any non-symmetric tree can be represented as a sum of symmetric ones. 4A standard approach is to quantize the feature such that all n + 1 buckets have approximately the same number of training samples. Published as a conference paper at ICLR 2023 Algorithm 1 Sample Tree(r; m, n, β) input: residuals r = (ri)N i=1 output: oblivious tree structure ν V hyper-parameters: number of feature splits n, max. tree depth m, random strength β [0, ) definitions: S = (j, k) j {1, . . . , d}, k {1, . . . , n} indices of all possible splits instructions: initialize i = 0, ν0 = , S(0) = S repeat sample (ui(s))s S(i) U [0, 1]nd i choose next split as {si+1} = arg max s S(i) D (νi, s), r β log( log ui(s)) update tree: νi+1 = (νi, si+1) update candidate splits: S(i+1) = S(i)\{si+1} i = i + 1 until i m or S(i) = return: νi Algorithm 2 Train GBDT(z; ϵ, T, m, n, β, λ) input: dataset z = (x N, y N) hyper-parameters: learning rate ϵ > 0, regularization λ > 0, iterations of boosting T, parameters of Sample Tree m, n, β instructions: initialize τ = 0, f0( ) = 0L2(ρ) repeat rτ = y N fτ(x N) compute residuals ντ = Sample Tree(rτ; m, n, β) construct a tree θτ = PN i=1 φ(j) ντ (xi)r(i) τ PN i=1 φ(j) ντ (xi) j=1 set val- ues in leaves fτ+1( ) = (1 λϵ N )fτ( ) + ϵ D φντ ( ), θτ E RLντ update model τ = τ + 1 until τ T return: f T ( ) In classic gradient boosting, one builds a tree recursively by choosing such split s that maximizes the score D((νi, s), r) (Ibragimov & Gusev, 2019).5 Random noise is often added to the scores to improve generalization. In Sample Tree, we choose a split that maximizes D (νi, s), r + ε, where ε Gumbel(0, β) . (4) Here β is random strength: β = 0 gives the standard greedy approach, while β gives the uniform distribution among all possible split candidates. To sum up, Sample Tree is a classic oblivious tree construction but with added random noise. We do this to make the distribution of trees regular in a certain sense: roughly speaking, the distributions should stabilize with iterations by converging to some fixed distribution. Given the algorithm Sample Tree, we describe the gradient boosting procedure assumed in our analysis in Algorithm 2. It is a classic GBDT algorithm but with the update rule fτ+1( ) = (1 λϵ/N) fτ( ) + ϵwτ(x). In other words, we shrink the model at each iteration, which serves as regularization (Ustimenko & Prokhorenkova, 2021). Such shrinkage is available, e.g., in the Cat Boost library. 3.3 DISTRIBUTION OF TREES The Sample Tree algorithm induces a local family of distributions p( |f, β) for each f F: p(dν|f, β) = P n Sample Tree y N f(x N); m, n, β dν o . Remark 3.1. Lemma D.5 ensure that such distribution coincides with the one where we use f (x N) instead of y N. This is due to the fact that D ν, y N f(x N) = D ν, f f(x N) ν V, f F. The following lemma describes the distribution p(dν|f, β), see Appendix F for the proof. Lemma 3.2. (Probability of a tree)6 p(ν|f, β) = X s S\νς,i 1 e D((νς,i 1,s),r) 5Maximizing (3) is equivalent to minimizing the squared error between the residuals and the mean values in the leaves. 6Note that for oblivious decision trees, changing the order of splits does not affect the obtained partition. Hence, we assume that each tree is defined by an unordered set of splits. Published as a conference paper at ICLR 2023 where the sum is over all permutations ς Pm, νς,i = (sς(1), . . . , sς(i)), and ν = (s1, . . . , sm). Let us define the stationary distribution of trees as π( ) = limβ p( |f, β). It follows from Remark 3.1 that we also have π( ) = p( |f , β). Corollary 3.3. (Stationary distribution is the uniform distribution over tree structures) We have π(dν) = dν nd m , where nd m = (nd)! (nd m)!m!. 3.4 RKHS STRUCTURE In this section, we describe the evolution of GBDT in a certain Reproducing Kernel Hilbert Space (RKHS). Even though the problem is finite dimensional, treating it as functional regression is more beneficial as dimension of the ensembles space grows rapidly and therefore we want to obtain dimension-free constants which is impossible if we treat it as finite dimensional optimization problem. Let us start with defining necessary kernels. For convenience, we also provide a diagram illustrating the introduced kernels and relations between them in Appendix A. Definition 3.4. A weak learner s kernel kν( , ) is a kernel function associated with a tree structure ν V which can be defined as: kν(x, x ) = j=1 w(j) ν φ(j) ν (x)φ(j) ν (x ), where w(j) ν = N max{N (j) ν , 1} , N (j) ν = i=1 φ(j) ν (xi) . This weak learner s kernel is a building block for any other possible kernel in boosting and is used to define the iterations of the boosting algorithm analytically. Definition 3.5. We also define a greedy kernel of the gradient boosting algorithm as follows: Kf(x, x ) = X ν V kν(x, x )p(ν|f, β) . This greedy kernel is a kernel that guides the GBDT iterations, i.e., we can think of each iteration as SGD with a kernel from 3.5, and 3.4 is used as a stochastic gradient estimator of the Fr echet derivative in RKHS defined by the kernel from 3.5. Definition 3.6. Finally, there is a stationary kernel K(x, x ) that is independent from f: K(x, x ) = X ν V kν(x, x )π(ν) , which we call a prior kernel of the gradient boosting. This kernel defines the limiting solution since the gradient projection on RKHS converges to zero, and thus 3.5 converges to 3.6. Note that F = span K( , x) | x X . Having the space of functions F, we define RKHS structure H = F, , H on it using a scalar product defined as f, K( , x) H = f(x). Now, let us define the empirical error of a model f: L(f, λ) = 1 2N y N f(x N) 2 RN + λ Then, we define V (f, λ) = L(f, λ) inff F L(f , λ). Let us also define the following functions: f λ arg minf F V (f, λ) and f = lim λ 0 f λ arg min f F V (f), where V (f) = V (f, 0). It is known that such f exists and is unique since the set of all solutions is convex, and therefore there is a unique minimizer of the norm H. Finally, the following lemma gives the formula of the GBDT iteration in terms of kernels in Lemma 3.7 which will be useful in proving our results. See Appendix D for the proofs. Lemma 3.7. Iterations fτ of Gradient Boosting (Algorithm 2) can be written in the form: fτ+1 = (1 λϵ N kντ ( , x N) f (x N) fτ(x N) , ντ p(ν|fτ, β) . Published as a conference paper at ICLR 2023 3.5 KERNEL GRADIENT BOOSTING CONVERGENCE TO KRR Consider the sequence {fτ}τ N generated by the gradient boosting algorithm. Its evolution is described by Lemma 3.7. The following theorem estimates the expected (w.r.t. the randomness of tree selection) empirical error of f T relative to the best possible ensemble. The full statement of the theorem and its proof can be found in Appendix G. Theorem 3.8. Assume that β, T1 are sufficiently large and ϵ is sufficiently small (see Appendix G). Then, T T1, EV (f T , λ) O e O ϵ(T T1) N 2 + ϵ + λ βN 2 Corollary 3.9. (Convergence to the solution of the KRR problem) Under the assumptions of the previous theorem, we have the following dimension-free bound: E f T f λ 2 L2 O e O ϵ(T T1) N + Nϵ + λ βN This bound is dimension-free thanks to functional treatment and exponentially decaying to small value with iterations and therefore justifies the observed rapid convergence of gradient boosting algorithms in practice even though dimension of space H is enormous. 4 GAUSSIAN INFERENCE So far, the main result of the paper proved in Section 3.5 shows that Algorithm 2 solves the Kernel Ridge Regression problem, which can be interpreted as learning Gaussian Process posterior mean f λ under the assumption that f GP(0, σ2K+δ2Id L2) where λ = σ2 δ2 . I.e., Algorithm 2 does not give us the posterior variance. Still, as mentioned earlier, we can estimate the posterior variance through Monte-Carlo sampling in a sample-then-optimize way. For that, we need to somehow sample from the prior distribution GP(0, σ2K + δ2Id L2). 4.1 PRIOR SAMPLING We introduce Algorithm 3 for sampling from the prior distribution. Sample Prior generates an ensemble of random trees (with random splits and random values in leaves). Note that while being random, the tree structure depends on the dataset features x N since candidate splits are based on x N. We first note that the process h T ( ) is centered with covariance operator K: Eh T (x) = 0 x X , Eh T (x)h T (y) = K(x, y) x, y X . (5) Then, we show that h T ( ) converges to the Gaussian Process in the limit. Lemma 4.1. The following convergence holds almost surely in x X: h T ( ) T GP 0L2(ρ), K . Algorithm 3 Sample Prior(T, m, n) hyper-parameters: number of iterations T, parameters of Sample Tree m, n instructions: initialize τ = 0, h0(x) = 0 repeat ντ = Sample Tree(0RN ; m, n, 1) sample random tree θτ N 0RLντ , diag N max{N (j) ντ ,1} : j {1, . . . , Lντ } generate random values in leaves hτ+1( ) = hτ( ) + 1 T φντ ( ), θτ RLντ update model τ = τ + 1 until τ T return: h T ( ) 4.2 POSTERIOR SAMPLING Now we are ready to introduce Algorithm 4 for sampling from the posterior. The procedure is simple: we first perform T0 iterations of Sample Prior to obtain a function h T0( ) and then we train a standard GBDT model f T1( ) approximating y N σh T0(x N) + N(0N, δ2IN). Our final model is σh T0( ) + f T1( ). Published as a conference paper at ICLR 2023 Figure 1: KGB on a synthetic dataset. We further refer to this procedure as Sample Posterior or KGB (Kernel Gradient Boosting) for brevity. Denote h = lim T0 h T0 , f = lim f T1 , where the first limit is with respect to the point-wise convergence of stochastic processes and the second one with respect to L2(ρ) convergence. The following theorem shows that KGB indeed samples from the desired posterior. The proof directly follows from Lemmas 4.1 and 2.1. Algorithm 4 Sample Posterior(z; ϵ, T1, T0, m, n, β, σ, δ) input: dataset z = (x N, y N) hyper-parameters: learning rate ϵ > 0, boosting iteration T1, Sample Prior iterations T0, parameters of Sample Tree m, n, β, kernel scale σ > 0 (default σ = 1), noise scale δ > 0 (default: δ = 0.01) instructions: h T0( ) = Sample Prior(T0, m, n) ynew N = y N σh T0(x N) + N(0N, δ2IN) f T1( ) = Train GBDT (x N, ynew N ); ϵ, T1, m, n, β, δ2 return: σh T0( ) + f T1( ) Theorem 4.2. In the limit, the output of Algorithm 4 follows the Gaussian process posterior: σh ( ) + f ( ) + N(0, δ2) GP ef, e K with mean ef(x) = K(x, x N) K(x N, x N) + λIN 1y N and covariance e K(x, x) = δ2 + σ2 K(x, x) K(x, x N) K(x N, x N) + λIN 1K(x N, x) . 5 EXPERIMENTS This section empirically evaluates the proposed KGB algorithm and shows that it indeed allows for better knowledge uncertainty estimates. Synthetic experiment To illustrate the KGB algorithm in a controllable setting, we first conduct a synthetic experiment. For this, we defined the feature distribution as uniform over D = {(x, y) [0, 1]2 : 1 10 (x 1 5)}. We sample 10K points from U([0, 1]2) and take into the train set only those that fall into D. The target is defined as f(x, y) = x + y. Figure 1(a) illustrates the training dataset colored with the target values. For evaluation, we take the same 10K points without restricting them to D. For KGB, we fix ϵ = 0.3, T0 = 100, T1 = 900, σ = 10 2, δ = 10 4 β = 0.1, m = 4, n = 64, and sampled 100 KGB models. Figure 1(b) shows the estimated by Monte-Carlo posterior mean eµ. On Figure 1(c), we show log eσ, where eσ2 is the posterior variance estimated by Monte-Carlo. One can see that the posterior variance is small in-domain and grows when we move outside the dataset D, as desired. Experiment on real datasets Uncertainty estimates for GBDTs have been previously analyzed by Malinin et al. (2021). Our experiments on real datasets closely follow their setup, and we compare the proposed KGB with SGB, SGLB, and their ensembles. For the experiments, we use several standard regression datasets (Gal & Ghahramani, 2016). The implementation details can be found in Appendix H. The code of our experiments can be found on Git Hub.7 7https://github.com/Take Over/Gradient-Boosting-Performs-Gaussian-Process-Inference Published as a conference paper at ICLR 2023 Table 1: Predictive performance, RMSE Dataset Single Ensemble SGB SGLB KGB SGB SGLB KGB Boston 3.06 3.12 2.81 3.04 3.10 2.82 Concrete 5.21 5.11 4.36 5.21 5.10 4.30 Energy 0.57 0.54 0.33 0.57 0.54 0.33 Kin8nm 0.14 0.14 0.11 0.14 0.14 0.10 Naval 0.00 0.00 0.00 0.00 0.00 0.00 Power 3.55 3.56 3.48 3.52 3.54 3.43 Protein 3.99 3.99 3.79 3.99 3.99 3.76 Wine 0.63 0.63 0.61 0.63 0.63 0.60 Yacht 0.82 0.84 0.52 0.83 0.84 0.50 Year 8.99 8.96 8.97 8.97 8.93 8.94 Table 2: Error and OOD detection Dataset PRR AUC SGB SGLB KGB SGB SGLB KGB Boston 36 37 43 80 80 88 Concrete 29 29 37 92 92 93 Energy 36 31 60 100 100 99 Kin8nm 18 19 20 45 45 41 Naval 55 56 35 100 100 100 Power 8 9 31 72 73 76 Protein 30 29 35 99 99 100 Wine 25 19 37 74 72 87 Yacht 74 78 86 62 60 69 Year 30 30 32 67 57 71 We note that in our setup, we cannot compute likelihoods as kernel K is defined implicitly, and its evaluation requires summing up among all possible trees structures number of which grows as (nd)m which is unfeasible, not to mention the requirement to inverse the kernel which requires O(N 2+ω) operations which additionally rules out the applicability of classical Gaussian Processes methods with our kernel. Therefore, a typical Bayesian setup is not applicable, and we resort to the uncertainty estimation setup described in Malinin et al. (2021). Also, the intractability of the kernel does not allow us to treat σ, δ in a fully Bayesian way, as it will require estimating the likelihood. Therefore, we fix them as constants, but we note that this will not affect the evaluation metrics for our setup as they are scale and translation invariant. First, we compare KGB with SGLB since they both sample from similar posterior distributions. Thus, this comparison allows us to find out which of the algorithms does a better sampling from the posterior and thus provides us with more reliable estimates of knowledge uncertainty. Moreover, we consider the SGB approach as the most straightforward way to generate an ensemble of models. In Table 1, we compare the predictive performance of the methods. Interestingly, we obtain improvements on almost all the datasets. Here we perform cross-validation to estimate statistical significance with paired t-test and highlight the approaches that are insignificantly different from the best one (p-value > 0.05). Then, we check whether uncertainty measured as the variance of the model s predictions can be used to detect errors and out-of-domain inputs. Detecting errors can be evaluated via the Prediction-Rejection Ratio (PRR) (Malinin, 2019; Malinin et al., 2020). PRR measures how well uncertainty estimates correlate with errors and rank-order them. Out-of-domain (OOD) detection is usually assessed via the area under the ROC curve (AUC-ROC) for the binary task of detecting whether a sample is OOD (Hendrycks & Gimpel, 2017). For this, one needs an OOD test set. We use the same OOD test sets (sampled from other datasets) as Malinin et al. (2021). The results of this experiment are given in Table 2. We can see that the proposed method significantly outperforms the baselines for out-of-domain detection. These improvements can be explained by the theoretical soundness of KGB: convergence properties are theoretically grounded and non-asymptotic. In contrast, for SGB, there are no general results applicable in our setting, while for SGLB the guarantees are asymptotic. In summary, these results show that our approach is superior to SGB and SGLB, achieving smaller values of RMSE and having better knowledge uncertainty estimates. 6 CONCLUSION This paper theoretically analyses the classic gradient boosting algorithm. In particular, we show that GBDT converges to the solution of a certain Kernel Ridge Regression problem. We also introduce a simple modification of the classic algorithm allowing one to sample from the Gaussian posterior. The proposed method gives much better knowledge uncertainty estimates than the existing approaches. We highlight the following important directions for future research. First, to explore how one can control the kernel and use it for better knowledge uncertainty estimates. Also, we do not analyze generalization in the current work, which is another important research topic. Finally, we need to establish universal approximation property which further justifies need for functional formalism. Published as a conference paper at ICLR 2023 Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. In International Conference on Machine Learning, pp. 242 252. PMLR, 2019. Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests. The Annals of Statistics, 47(2):1148 1178, 2019. Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and Yee Whye Teh. The Mondrian kernel. In 32nd Conference on Uncertainty in Artificial Intelligence (UAI), 2016. Christopher JC Burges. From Rank Net to Lambda Rank to Lambda MART: An overview. Learning, 11(23-581):81, 2010. Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning, pp. 161 168. ACM, 2006. Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, volume 22, 2009. Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019. Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pp. 1189 1232, 2001. Yarin Gal. Uncertainty in Deep Learning. Ph D thesis, University of Cambridge, 2016. Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proc. 33rd International Conference on Machine Learning, 2016. Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. In Advances in Neural Information Processing Systems 34 (Neur IPS 2021), 2021. Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017. Bulat Ibragimov and Gleb Gusev. Minimal variance sampling in stochastic gradient boosting. Advances in Neural Information Processing Systems, 32, 2019. Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 31, 2018. Ryuichi Kanoh and Mahito Sugiyama. A neural tangent kernel perspective of infinite tree ensembles. In International Conference on Learning Representations, 2022. Liran Katzir, Gal Elidan, and Ran El-Yaniv. Net-DNF: Effective deep modeling of tabular data. In International Conference on Learning Representations, 2021. B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Proc. Conference on Neural Information Processing Systems, 2017. Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Journal of Statistical Mechanics: Theory and Experiment, 2020. Published as a conference paper at ICLR 2023 Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in Neural Information Processing Systems, 31, 2018. David G. Luenberger. Optimization by Vector Space Methods. John Wiley & Sons, 1969. Andrey Malinin. Uncertainty Estimation in Deep Learning with application to Spoken Language Assessment. Ph D thesis, University of Cambridge, 2019. Andrey Malinin, Bruno Mlodozeniec, and Mark JF Gales. Ensemble distribution distillation. In International Conference on Learning Representations, 2020. Andrey Malinin, Liudmila Prokhorenkova, and Aleksei Ustimenko. Uncertainty in gradient boosting via ensembles. In International Conference on Learning Representations, 2021. AG de G Matthews, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Sample-then-optimize posterior sampling for bayesian linear models. In Neur IPS Workshop on Advances in Approximate Bayesian Inference, 2017. Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Cat Boost: unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6638 6648, 2018. C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT press, 2006. Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: estimating the clickthrough rate for new ads. In Proceedings of the 16th International Conference on World Wide Web, pp. 521 530. ACM, 2007. Byron P Roe, Hai-Jun Yang, Ji Zhu, Yong Liu, Ion Stancu, and Gordon Mc Gregor. Boosted decision trees as an alternative to artificial neural networks for particle identification. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 543(2):577 584, 2005. Aleksei Ustimenko and Liudmila Prokhorenkova. SGLB: Stochastic Gradient Langevin Boosting. In International Conference on Machine Learning, pp. 10487 10496. PMLR, 2021. Qiang Wu, Christopher JC Burges, Krysta M Svore, and Jianfeng Gao. Adapting boosting for information retrieval measures. Information Retrieval, 13(3):254 270, 2010. Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. ar Xiv preprint ar Xiv:1902.04760, 2019. Yanru Zhang and Ali Haghani. A gradient boosting method to improve travel time prediction. Transportation Research Part C: Emerging Technologies, 58:308 324, 2015. Difan Zou, Pan Xu, and Quanquan Gu. Faster convergence of stochastic gradient langevin dynamics for non-log-concave sampling. In Uncertainty in Artificial Intelligence, pp. 1152 1162. PMLR, 2021. Published as a conference paper at ICLR 2023 A NOTATION USED IN THE PAPER For convenience, let us list some frequently used notation: X Rd feature space; d dimension of feature vectors; Y R target space; ρ distribution of features; N number of samples; z = (x N, y N) dataset; V set of all possible tree structures; Lν : V N number of leaves for ν V; D(ν, r) score used to choose a split (3); S indices of all possible splits; n number of borders in our implementation of Sample Tree; m depth of the tree in our implementation of Sample Tree; β random strength; ϵ learning rate; λ regularization; F space of all possible ensembles of trees from V; φν : X {0, 1}Lν tree structure; φ(j) ν indicator of j-th leaf; V (f) empirical error of a model f relative to the best possible f F; kν( , ) single tree kernel; K( , ) stationary kernel of the gradient boosting; p( |f, β) distribution of trees, f F; π( ) = limβ p( |f, β) = p( |f , β) stationary distribution of trees; σ kernel scale. The following diagram illustrates the kernels introduced in our paper: Weak learner kernel kντ : X X R+ Σντ : F F Σfτ : F F Iteration kernel Kfτ : X X R+ Stationary Kernel K : X X R+ acts as f7 kντ f(x N ) expected value expected valuefτ used in dot product B ADDITIONAL RELATED WORK Let us briefly discuss some additional related work. Mondrian forest method (Balog et al., 2016) and Generalized Random Forests (Athey et al., 2019), besides having links to the kernel methods, in fact, have the underlying limiting RKHS space that is much smaller than the space of all possible ensembles built on the same weak learners due to the independence of the trees that are added to the ensemble. Therefore, there is an issue of high bias when comparing plain gradient boosting with the plain random forest method. Also, these two methods are built from scratch to obtain an RKHS Published as a conference paper at ICLR 2023 interpretation while we provide a link between the existing standard gradient boosting approaches to the kernel methods, i.e., we do not create a novel gradient boosting algorithm but rather show that the existing ones already have such a link to derive convergence rates and to exploit such linkage to obtain formal Gaussian process interpretation of the gradient boosting learning to get uncertainty estimates using well-established gradient boosting libraries. Let us mention that there are approaches that study kernels induced by tree ensembles through the perspective of Neural Tangent Kernel (Kanoh & Sugiyama, 2022), though this analysis is not applicable for classical gradient boosting, while ours is. Let us also briefly discuss the papers on Neural Tangent Kernel, e.g., Jacot et al. (2018); Li & Liang (2018); Allen-Zhu et al. (2019); Du et al. (2019), that study deep learning convergence through the perspective of kernel methods. Though such works share similarities with what we do, there are fundamental differences. First, our work is not in the over-parametrization regime, i.e., our kernel method correspondence works for tree ensembles with fixed parameters, but the correspondence is achieved as the number of iterations goes to infinity. It is worth noting that the kernel method perspective on deep learning basically establishes that each trained deep learning model is a sample from Gaussian Process posterior (Lee et al., 2020; 2018; Yang, 2019; Cho & Saul, 2009), i.e., is sample-then-optimize. For boosting, we achieve this only by introducing Algorithm 4 relying on Algorithm 3, which in its essence, is random initialization for gradient boosting. The classic gradient boosting (Algorithm 2) can be considered as the mean value of the Gaussian Process, which has no analogs in the world of deep learning, and to achieve convergence to posterior mean there, one needs to average among many trained models. This can be considered as an advantage of gradient boosting over deep learning that we derive in our paper. C CONVEX OPTIMIZATION IN FUNCTIONAL SPACES In this section, we formulate basic definitions of differentiability in functional spaces and the theorem on the convergence of gradient descent in functional spaces. For the proof of the theorem and further details on convex optimization in functional space, the reader can consult Luenberger (1969). We consider H to be a Hilbert space with some scalar product , H. Definition C.1. We say that F : H R is Fr echet differentiable if for any f H there exists a bounded linear functional Lf : H R such that h H F(f + h) = F(f) + Lf[h] + o( h ) . The value of Lf : H R is denoted by Df F(f) and is called a Fr echet differential of F at point f. So, Fr echet differential is a functional Df F : H B(H, R), where B(X, Y ) denotes a normed space of linear bounded functionals from X to Y . Definition C.2. Let F : H R be Fr echet differentiable with a Fr echet differential Df F(f) that is a bounded linear functional. Then, by the Riesz theorem there exists a unique hf such that Df F(f) [h] = hf, h H h H. We call such element a gradient of F in H at f H and denote it by HF(f) = hf H. Definition C.3. F : H R is said to be twice Fr echet differentiable if Df F is Fr echet differentiable, where the definition of Fr echet differential is analogous to Definition C.1 with the only difference that Df F takes values in B(H, R). The second Fr echet differential is denoted by D2 f F : H B(H, B(H, R)). As there is an isomorphism between B(H, B(H, R)) and B(H H, R), we can consider the second Fr echet differential to take values in B(H H, R). Henceforth, we will not differentiate between B(H, B(H, R)) and B(H H, R). Definition C.4. A linear operator P : H H R is said to be semi-positive definite (denoted by P 0) if f H we have P(f, f) 0. P is said to be positive definite (P 0) if f H \ {0} we have P(f, f) > 0. Definition C.5. Given two linear operators P, G : H H R we write P G if P G 0 and P G if P G 0. Let I B(H H, R) be a linear operator defined as I(g, h) = (g, h)H. Theorem C.6. Let F be bounded below and twice Fr echet differentiable functional on a Hilbert space H. Assume that D2 f F(f) satisfies 0 m I D2 f F(f) µI. Then the gradient descent Published as a conference paper at ICLR 2023 scheme: fk+1 = fk ϵ HF(fk) converges to f that minimizes F. Proof. For the proof see Luenberger (1969). D KERNEL RIDGE REGRESSION AND RKHS Definition D.1. K : X X R is called a kernel function if it is positive semi-definite, i.e., N N+ x N X N : K(x N, x N) 0. Definition D.2. For any kernel function we can define a Reproducing Kernel Hilbert Space (RKHS) H(K) = span K( , x) x X with a scalar product such that f, K( , x) H(K) = f(x) . Consider the following Kernel Ridge Regression problem: V (f, λ) = 1 2N y N f(x N) 2 RN + λ 2N f 2 H(K) min f H(K) V (f, λ) min f H(K) and the following Kernel Ridgeless Regression problem: V (f) = 1 2N y N f(x N) 2 RN min f H(K) V (f) min f H(K) . Lemma D.3. min H(K) V (f, λ) has the only solution f λ = K( , x N)(K(x N, x N) + λI) 1y N . Proof. First, let us show that f λ span K( , xi) . Let H(K) = span K( , xi) span K( , xi) and consider the projector P : H(K) H(K) onto the space span K( , xi) . It is easy to show that P(f)(x N) = f(x N) for any f H(K). Indeed, (f P(f))[x N] = f P(f), K( , x N) = 0 . If f λ does not lie in span K( , xi) , then f λ H(K) > P(f λ ) H(K) and V (P(f λ ), λ) < V (f λ , λ). We get a contradiction with the minimality of f λ . Now, let us prove the existence of f λ . Consider f = K( , x N)c, where c RN. Then, we find the optimal c by taking a derivative of V (f, λ) with respect to c and equating it to zero: K(x N, x N)(K(x N, x N)c y N) + λK(x N, x N)c = 0 . Then, cv = (K(x N, x N) + λI) 1(y N + v) , where v ker K(x N, x N). Note that all K( , x N)cv are equal. Then, we have the only solution of the KRR problem: f λ = K( , x N)(K(x N, x N) + λI) 1y N . Lemma D.4. min H(K) V (f) has the only solution in span K( , xi) and it is the solution of minimum RKHS norm: f = K( , x N)K(x N, x N) y N . Published as a conference paper at ICLR 2023 Proof. Consider f = K( , x N)c, where c RN. Now consider V (f) and differentiate it with respect to c. If we equate the derivative to zero, we get: 1 N K(x N, x N)(K(x N, x N)c y N) = 0 . Then, K(x N, x N)c (y N +v) = 0 for some v ker K(x N, x N). Note that y N +ker K(x N, x N) Im K(x N, x N) = . Then, for any v such that y N +v Im K(x N, x N) there exists a solution cv = K(x N, x N) (y N + v) + ker K(x N, x N). This follows from the fact that K(x N, x N)K(x N, x N) is an orthoprojector onto Im K(x N, x N). Note that f = K( , x N)(K(x N, x N) (y N + v) + ker K(x N, x N)) = K( , x N)K(x N, x N) y N. Then, the existence and uniqueness of f follow. Now, consider a linear space F L2(ρ) of all possible ensembles of trees from V: F = span φ(j) ν ( ) : X {0, 1} ν V, j {1, . . . , Lν} . Define the unique function: {f } = lim λ 0+ arg min f F V (f, λ) arg min f F V (f). Then following two lemmas hold: Lemma D.5. y N f (x N), f(x N) RN = 0 for any f F. Proof. Assume that y N f (x N), f(x N) RN = 0 for some f F. Then, for some f F, y N f (x N), f(x N) RN > 0. We have: y N (f + αf)(x N) 2 RN = y N f (x N) 2 RN 2α y N f (x N), f(x N) RN + α2 f(x N) 2 RN < y N f (x N) 2 RN for small enough α > 0, which contradicts with the definition of f . V (f, λ) = 1 2N f (x N) f(x N) 2 RN + λ 2N f 2 H Cλ where Cλ = inff F L(f, λ) inff F L(f) 0. Proof. We need to prove it only for V (f) without regularization as for regularized it follows immediately (note that f is minimizer of not regularized objective). By definition, V (f) = 1 2N y N f(x N) 2 RN 1 2N y N f (x N) 2 RN . Now, let us prove that y N f(x N) 2 RN y N f (x N) 2 RN f (x N) f(x N) 2 RN = 0 . y N f(x N) 2 RN y N f (x N) 2 RN f (x N) f(x N) 2 RN = 2 f (x N), f (x N) RN 2 y N, f(x N) f (x N) RN + 2 f (x N), f(x N) RN = 2 y N f (x N), f(x N) f (x N) RN = 0 , where the last equality follows from the previous lemma. Lemma D.7. V (f, λ) = 1 2N f λ (x N) f(x N) 2 RN + λ 2N f λ f 2 H. Published as a conference paper at ICLR 2023 Proof. f λ is optimum for V (f, λ). Then Fr echet derivative at f λ equals 0: Df V (f λ , λ) = 0. Consider then writing: V (f, λ) = V (f λ , λ) + Df V (f λ , λ)[f f λ ] + 1 2D2 f V (f λ , λ)[f f λ , f f λ ] = 1 2N f λ (x N) f(x N) 2 RN + λ 2N f λ (x N) f(x N) 2 RN . The explicit formula for the Fr echet Derivative of V (f, λ) can be found in Appendix E. E GAUSSIAN PROCESS INFERENCE In this section, we prove Lemma 2.1 from Section 2.3 of the main text. Firstly, consider the following regularized error functional: V (f, λ) = 1 2N f(xi) yi 2 + λ 2N f 2 H min f H(K) f(xi) yi 2 + λ With this functional we can consider the following optimization problem: min f H(K) V (f, λ), which is called as Kernel Ridge Regression. We will show that this functional satisfies the conditions needed for Theorem C.6. We will also deduce the formula of the gradient of V in order to show that gradient descent takes the form (2). Lemma E.1. V (f, λ) is Fr echet differentiable with the differential given by: Df V (f, λ) = λ N f, H(K) + 1 f(xi) yi evxi , where evxi : H(K) R is a bounded linear functional such that evxi(f) = f(xi) = (f, K(xi, ))H(K).8 Proof. As Fr echet differential is linear, we only need to find Fr echet differential for (f(xi) yi)2. Note that (f(xi) yi)2 is a composition of two functions: F : H(K) R, F = evxi yi , G : R R, G(x) = x2 . The differential of the composition can be found as: Df G(F(f)) = x G(F(f))Df F(f) , Df G(F(f)) = 2(f(xi) yi)evxi , where Df F(f) = evxi because evxi(f + h) yi = evxi(f) yi + evxi(h) . Lemma E.2. The gradient of V (f, λ), Riesz representative of the functional above, is given by: f V (f, λ) = λ i=1 (f(xi) yi)Kxi . 8We further use the notation Kxi := K( , xi). Published as a conference paper at ICLR 2023 Proof. Follows from the previous lemma. Lemma E.3. V (f, λ) is twice Fr echet differentiable with the differential given by: D2 f V : H(K) B(H(K), B(H(K), R)) , D2 f V (f, λ)[h] = λ N h, H(K) + 1 i=1 h(xi)evxi . Proof. Due to the linearity of Fr echet differential and lemma E.1 we need to find only Fr echet differential for (f(xi) yi)evxi. Consider S(f) = (f(xi) yi)evxi. Then we need to find Vf B(H(K), B(H(K), R)) such that S(f + h) = S(f) + Vf[h] + o( h ). It is easy to show that h 7 h(xi)evxi B(H(K), B(H(K), R)) and S(f +h) = S(f)+h(xi)evxi. Thus, we get that Df S(f)[h] = h(xi)evxi. From this the statement of the lemma follows. Given all the above lemmas, as a corollary of Theorem C.6, we have the following. Corollary E.4. Gradient descent, defined by the following iterative scheme fτ+1 = 1 λϵ i=1 (fτ(xi) yi)Kxi , f0 = 0L2(ρ) converges to the optimum of V (f, λ). Thus, f λ = lim τ fτ = K( , x N) K(x N, x N) + λIN 1 y N . (6) Proof. By Lemma E.2, our update rule has the form fτ+1 = fτ ϵ HV (fτ, λ). Then, we will find m, µ such that 0 m I D2 f V (f, λ) µI. By Lemma E.3, D2 f V (f, λ)[g, h] = λ N g, h H(K) + 1 N g(x N)T h(x N) . Then, we can take m = λ N . Let us also write D2 f V (f, λ)[g, g] = λ N g 2 H(K) + 1 N g(x N) 2 = λ N g 2 H(K) + 1 N g, K( , x N) H(K) 2 ( λ N max x X K(x, x)) g 2 H(K). Then, we can take µ = λ N + 1 N maxx X K(x, x). By theorem C.6 and lemma D.3 the corollary follows. Lemma E.5. Consider the gradient descent: fτ+1 = 1 λϵ i=1 (fτ(xi) yi)Kxi , f0 = 0L2(ρ) , f = lim τ fτ and the following randomization scheme: 1. sample f init GP(0L2(ρ), σ2K + δ2Id L2); 2. set new labels ynew N = y N f init(x N); 3. fit GD fτ( ) on ynew N assuming f0( ) = 0L2(ρ); 4. output ˆf( ) = f init( ) + f ( ) as final model. Published as a conference paper at ICLR 2023 Then, ˆf from the scheme above follows the Gaussian Process posterior with the following mean and covariance: E ˆf(x) = K(x, x N) K(x N, x N) + λIN 1 y N , cov( ˆf(x)) = δ2 + σ2 K(x, x) K(x, x N) K(x N, x N) + λIN 1 K(x N, x) . f = K( , x N) K(x N, x N)+λIN 1 ynew N = K( , x N) K(x N, x N)+λIN 1 (y N f init(x N)) . Let us find the distribution of ˆf at x Rn. It can be easily seen that: E ˆf(x) = K(x, x N) K(x N, x N) + λIN 1 y N . Let us now calculate covariance: cov ˆf(x) = E( ˆf(x) E ˆf(x))( ˆf(x) E ˆf(x))T = E f init(x) K(x, x N) K(x N, x N) + λIN 1 f init(x N) f init(x) K(x, x N) K(x N, x N) + λIN 1 f init(x N) T = Ef init(x)f init(x)T Ef init(x)f init(x N)T K(x N, x N) + λIN 1 K(x N, x) K(x, x N) K(x N, x N) + λIN 1 Ef init(x N)f init(x)T + K(x, x N) K(x N, x N) + λIN 1 Ef init(x N)f init(x N)T K(x N, x N) + λIN 1 K(x N, x) = δ2 + σ2 K(x, x) 2K(x, x N)(K(x N, x N) + λIN) 1K(x N, x) + σ2K(x, x N)(K(x N, x N) + λIN) 1K(x N, x) = δ2 + σ2 K(x, x) K(x, x N)(K(x N, x N) + λIN) 1K(x N, x) , which is exactly what we need. F DISTRIBUTION OF TREES Lemma F.1 (Lemma 3.2 in the main text). p(ν|f, β) = X s S\νς,i 1 e D((νς,i 1,s),r) where the sum is over all permutations ς Pm, νς,i = (sς(1), . . . , sς(i)), and ν = (s1, . . . , sm). Proof. Let us fix some permutation ς Pm. W.l.o.g., let ς = id Pm, i.e. ς(i) = i i. It remains to derive the formula for the fixed permutation. The probability of adding the next split given the previously build tree is: P(νi 1 si|νi 1) = e 1 β D(νi,r) P s S\νi 1 e 1 β D((νi 1,s),r) , which comes from (4) and the Gumbel-Soft Max trick. Then, we decompose the probability P(ν) of a tree as: i=1 P(νi 1 si|νi 1) , Published as a conference paper at ICLR 2023 and so for the fixed permutation we have e 1 β D(νi,r) P s S\νi 1 e 1 β D((νi 1,s),r) . Then we sum over all permutations and the lemma follows. Now, let us define the following value indicating how different are the distribution of trees for f and f : Γβ(f) = max max ν V Lemma F.2. The following bound relates the distributions. Proof. Consider π = p( |f , β) and the following expression P(ν, ς): e 1 β D(νς,i,r) P s S\νς,i 1 e 1 β D((νς,i 1,s),r) . ς Pm P(ν, ς) X β D(ν,r) m Y s S\νς,i 1 1 e where in second inequality we used D(ν, r) 2V (f) which straightly follows from the definition. By noting that the probabilities remain the same if we shift D( , r) D( , r) 2V (f) which becomes everywhere non-positive and allows us to do the above trick once more but in reverse manner: if we formally replace the D with such modified function and repeat the steps with reversing the inequalities which is needed since the new function is everywhere negative then the lemma follows. X ς Pm P(ν, ς) X ς Pm e Pm i=1 1 β D(νς,i,r) m β 2V (f) m Y 1 P s S\νi 1 1 e 2m V (f) G PROOF OF THEOREM 3.8 G.1 RKHS STRUCTURE In section 3.4 we defined RKHS structure on F as: f, K( , x) H(K) = f(x) and we introduced the kernels kν, Kf, Kπ. Let us also define a kernel Kp( , ) = P ν V kν( , )p(ν) for arbitrary distribution p on V. This way, taking p as δν( ), p(ν | f, β), π( ) we get Kp equal to kν, Kf, K, respectively. For each kernel, we define the operator associated with it denoted similarly: X Kp( , x)f(x)ρ(dx). Lemma G.1. Consider two positive semidefinite operators on a finite dimensional vector space V : A : V V and B : V V such that A B. Then, Im A Im B. Lemma G.2. Kp : F F is invertible for p non-vanishing on V. Published as a conference paper at ICLR 2023 Proof. Note that Im kν = span {φj ν | j = 1, . . . , Lν} and p(ν)kν Kp. Then, Im Kp = F follows from lemma G.1. Thus, Kp is invertible. Lemma G.3. F = span Kp( , x) | x X for p non-vanishing on V. Proof. From lemma G.2, Im Kp = F. From the definition of the operator, Im Kp span Kp( , x) | x X . Then, the lemma follows. Lemma G.4. For non-vanishing distribution p on V Kpf, g H(Kp) = f, g L2(ρ). Proof. It is sufficient to check on the basis. So, we take g = Kp( , x). Then, Kpf, Kp( , x) H(Kp) = Kp[f](x) = f, Kp( , x) L2(ρ), where the second equality holds by the definition of the operator Kp. For a weak learner ν, we define a covariance operator: N kν( , x N)f(x N), Σν : H H . Also, for an arbitrary probability distribution p over V, we denote Σp = P ν V Σνp(ν). These operators are typically referred to as covariance operators. Let us formulate and prove several lemmas about the RKHS structure and operators Σ, Σf, Σν. Lemma G.5. (Courant-Fischer) Let A be an n n real symmetric matrix and λ1 . . . λn its eigenvalues. Then, λk = min dim U=k max x U RA(x), λk = max dim U=n k+1 min x U RA(x), where RA(x) = Ax,x x,x is the Rayleigh-Ritz quotient. Lemma G.6. ρ(kν(x N, x N)) = kν(x N, x N) = N for any ν V. Proof. It is easy to see that kν(x N, x N) = Lν i=1w(i) ν 1N iν Niν, where 1n n is a matrix of size n n consisting of ones. Then, we note that 1n n = n and now the statement of the lemma follows. Lemma G.7. (Covariation majorization) The following operator inequality holds for probability distributions p, p over V, where p is arbitrary and p does not vanish at any ν V: λmax(Σp) 1, λmin(Σp ) 1 Proof. Consider the following operators: A : F Rn, f 7 f(x N), B : Rn F, v 7 Kp( , x N)v. Then, Σp = 1 N BA and KN = 1 N Kp(x N, x N) = 1 N AB. As AB and BA have the same spectra, we further study the spectrum of KN. We have KN = 1 N Kp(x N, x N) = P ν V 1 N kν(x N, x N)p(ν) and λmax(KN) 1 follows from lemma G.6. Then, λmax(Σp) 1 follows. Published as a conference paper at ICLR 2023 Now we need to show that λmin(Σp ) 1 N . Consider the following formula: i=1 Kp ( , xi) H(Kp ) Kp ( , xi) , i=1 Kp (xi, xi) Kp ( , xi) p Kp (xi, xi) H(Kp ) Kp ( , xi) p Kp (xi, xi) where (a H(Kp ) b)[c] = b, c H(Kp )a. If a = b and a H(Kp ) = 1, then 1 and 0 are the only eigenvalues of a H(Kp ) a. Denote by S = span{Kp ( , xi) | i = 1, . . . , N} H(Kp ) and m = dim S, n = dim H(Kp ). Then, λmin(Σp ) = λn m+1(Σp ) = min dim U=n m+1 max x U RΣp (x), where RΣp (x) = (Σp x,x)H(Kp ) (x,x)H(Kp ) . As dim U = n m + 1, then U S = . Suppose Kp ( , xi) max x U RΣp (x) RΣp ( Kp ( , xi) p Kp (xi, xi) ) * Kp (xi, xi) where (*) is fulfilled as a H(Kp ) a is a positive semidefinite operator and the last inequality follows from Kp (x, x) 1 x X. G.2 NORM MAJORIZATION The following lemmas relate the norms L2, H, RN with respect to each other.9 Indeed, by these lemmas we can consider the bound L2 H RN . Further, in the main theorems we will use these relations extensively. Corollary G.8. f(x N) f H for f span{K( , xi) | i = 1, . . . , N}. 1 N f(x N) 2 = Σf, f H 1 N I on span{K( , xi) | i = 1, . . . , N}. Lemma G.9. λmax(K) maxx X K(x, x). Proof. Consider K as an operator on (F, L2(ρ)). We will prove that K B((F,L2(ρ))) maxx X K(x, x) and the lemma will follow. Consider the inequality K[f](x) = Kx, f L2 Kx L2 f L2. Then, K[f] L2 maxx X Kx L2 f L2. Note also that K(x, x ) min(K(x, x), K(x , x )) which can be easily seen from the definition. Then, Kx L2 K(x, x) and the lemma follows. Corollary G.10. (Expected squared norm majorization by RKHS norm) The following bound holds f H: f 2 L2(ρ) max x X K(x, x) f 2 H. Proof. We have λmax(K) f 2 H Kf, f H = f, f L2 = f 2 L2. Then, from the previous lemma the bound holds. 9Note that RN indeed becomes a norm once we restrict our space to span{K( , xi) | i = 1, . . . , N}. Published as a conference paper at ICLR 2023 G.3 SYMMETRY OF OPERATORS In this section, we establish symmetry of various operators with respect to the norms L2, H, RN. These results are mainly required to claim that the spectral radii of these operators coincide with their operator norms in various spaces: B(F, L2), B(H), B(span{K( , xi) | i = 1, . . . , N}, RN ). Though, we use symmetry of operators not only this way. Lemma G.11. (Universal symmetry of covariance operators in H) The operator Σp for any p is symmetric w.r.t. the dot product of H(Kp ) for any non-singular p . Proof. First, let us prove the statement for non-singular distribution p. To see that, we consider the following quantity: Σpf, g H(Kp) = 1 i=1 Kp( , xi), f H(Kp) Kp( , xi), g H(Kp). Then, we use the following trick: Kp( , xi), f H(Kp) = Kp K 1 p Kp( , xi), f H(Kp ). It allows us to rewrite: Σpf, g H(Kp) = 1 i=1 Kp K 1 p Kp( , xi), f H(Kp ) Kp K 1 p Kp( , xi), g H(Kp ). From this, it immediately follows: Kp K 1 p Kp( , xi) H(Kp ) Kp K 1 p Kp( , xi) . This shows that Σp is indeed symmetric w.r.t. the dot product of H(Kp ). Finally, we can use the continuity argument, which we can use due to intrinsic finite dimension, to conclude that symmetry must hold for arbitrary distributions, in particular for p = δν which corresponds to Σν. Lemma G.12. (Universal symmetry of covariance operators in L2) The operator Σp for any p is symmetric w.r.t. the dot product of L2. Proof. We consider similarly the following quantity: Σpf, g H(Kp) = 1 i=1 Kp( , xi), f H(Kp) Kp( , xi), g H(Kp). Then, we use the following trick Kp( , xi), f H(Kp) = K 1 p Kp( , xi), f L2. It allows us to rewrite: Σpf, g H(Kp) = 1 i=1 K 1 p Kp( , xi), f L2 K 1 p Kp( , xi), g L2. From this, it immediately follows: K 1 p Kp( , xi) L2 K 1 p Kp( , xi) Which shows that Σp is indeed symmetric w.r.t. the dot product of L2. Lemma G.13. (Universal symmetry of kernel operators in H) The operator Kp for any p is symmetric w.r.t. the dot product of H. Proof. We consider decomposing Kp as: Kpf, g L2 = Z X X Kp(x, y)f(x)g(y)ρ(dx)ρ(dy). Published as a conference paper at ICLR 2023 Then, we use the following trick: f(x) = K( , x), f H. It allows us to rewrite: Kpf, g L2 = Z X X Kp(x, y) K( , x), f H K( , y), g Hρ(dx)ρ(dy). From this, it immediately follows: X X Kp(x, y) K( , x) H K( , y) ρ(dx)ρ(dy). This shows that Kp is indeed symmetric w.r.t. the dot product of H since both K( , x) HK( , y) and K( , y) H K( , x) are present with the same weight Kp(x, y)ρ(dx)ρ(dy) = Kp(y, x)ρ(dy)ρ(dx). G.4 ITERATIONS OF GRADIENT BOOSTING Lemma G.14. For any ν V, we have kν( , x N)[y N f (x N)] = 0. Proof. Follows from Lemma D.5. Lemma G.15 (Lemma 3.7 in the main text). Iterations fτ of gradient boosting (Algorithm 2) can be written in the form: fτ+1 = 1 λϵ N kντ ( , x N) y N fτ(x N) = (1 λϵ N kντ ( , x N) f (x N) fτ(x N) , ντ p(ν|fτ, β) . Proof. According to Algorithm 2: fτ+1( ) = 1 λϵ N fτ( ) + ϵ φντ ( ), θτ RLντ for θτ = PN i=1 φ(j) ντ (xi)r(i) τ PN i=1 φ(j) ντ (xi) fτ+1 = 1 λϵ j=1 ωj ντ φj ντ X i: φj ντ (xi)=1 ri τ . Now note that kντ ( , xi) = ωj ντ φj ντ ( ), where j is such that φj ντ (xi) = 1. From this the lemma follows. From Lemmas G.15, D.4, it is easy to show that fτ, f span{K( , xi) | i = 1, . . . , N}. Then, hereafter we can use Corollary G.8 to bound H norm with RN norm. Lemma G.16. The iterations of gradient boosting can be represented as: ϵ 1E fτ fτ+1 | fτ = Kfτ D[fτ f ] + λ where D : F F is bounded linear operator defined as Riesz representative with respect to L2 scalar product of such bilinear function 1 N f(x N)T h(x N) = Df, h L2. Similar decomposition holds for HV (f, λ) = KD[f f ] + λ Proof. First, observe that Efτ+1 | fτ = fτ ϵ V (fτ, λ) where gradient here is taken with respect to H(Kfτ ). Keep in mind that in the definition of V (fτ, λ), the norm in the regularizer term is taken with respect to H(Kfτ ) instead of H(K). Thus, we need only to prove that HV (f, λ) = KD[f f ] + λ Consider Fr echet differential Df V (f)[h] = 1 N (f(x N) f (x N))T h(x N) = D[f f ], h L2. By Lemma G.4, we deduce Df V (f)[h] = 1 N (f(x N) f (x N))T h(x N) = KD[f f ], h H, which implies V (f) = KD[f f ] and the lemma follows. Published as a conference paper at ICLR 2023 G.5 MAIN LEMMAS Lemma G.17. Let A, B B(H, H) be two PSD operators such that ξB A and ξA B are PSD for some ξ (1, ). Let g, h H be two arbitrary vectors and λ R++ be a constant. Then, A[g] + λξh, B[g] + λh H 1 ξ 1 A[g] + λξh 2 H ξ(ξ2 1)λ2 h 2 H . Proof. Consider the following equality: ξ ξ 1A[g] + λh, B[g] + λh H = ξ ξ 1A[g] + λh 2 H + B[g] + λh 2 H (B ξ 1A)[g] 2 H which is basically the classical decomposition of the dot product x, y = 1 2 x 2 + y 2 x y 2 . Then, we note that 1 ξ 2 B B ξ 1A = ξ 2(ξA B) is PSD by assumption and since B ξ 1A is PSD it implies that 1 ξ 2 B B ξ 1A , which implies: A[g] + λξh, B[g] + λh H ξ ξ 1A[g] + λh 2 H + B[g] + λh 2 H (1 ξ 2)2 B[g] 2 H Finally, note that ξ2 2 ξ 2 1 = ξ2(1 ξ 2)2 2 ξ 2 ξ2 1. Then, the result directly follows from the following equality: B[g] + λh 2 H (1 ξ 2)2 B[g] 2 H 2 ξ 2B[g] + λ 2 ξ 2 h 2 H λ2 ξ2 2 ξ 2 1 h 2 H. Denote κ(A, B) = Id H BA 1 B(H,H) = (B A)A 1 B(H,H). Lemma G.18. If ξ 1K Kf ξK for ξ > 1, then κ(K, Kf) ξ 1. Proof. First, we note that both operators are symmetric semi-positive definite in L2. Now, let us look at the Rayleigh quotient: (K Kf)K 1 B(H,H) = max f F\{0} (K Kf)K 1f H f H = max f F\{0} K 1 2 (K Kf)K 1 2 f L2 f L2 . In the last equality we used fact that K is symmetric positive definite and therefore K 1 2 is too and hence we can substitute f K 1 2 f and use the explicit formula for the dot product in H via the product in L2. Now we observe that (ξ 1)Id L2 = K 1 2 (K ξK)K 1 2 (K Kf)K 1 2 (K ξ 1K)K 1 2 = (1 ξ 1)Id L2 (ξ 1)Id L2, which implies that the spectral radius ρ(K 1 2 (K Kf)K 1 2 ) is bounded by ξ 1. Therefore, we obtain: max f F\{0} K 1 2 (K Kf)K 1 2 f L2 f L2 = ρ(K 1 2 (K Kf)K 1 Lemma G.19. Let A, B B(H, H). Then, the following inequality holds: A[g] + λh, B[g] + λh H 1 2 κ(A, B) A[g] + λh 2 H κ2(A, B)λ2 Published as a conference paper at ICLR 2023 Proof. Let us rewrite the left part as A[g]+λh, B[g]+λh H = A[g]+λh, Id L2 +(BA 1 Id L2) (A[g]+λh)+λ(Id L2 BA 1)h H = A[g]+λh 2 H A[g]+λh, (Id L2 BA 1)(A[g]+λh) H + A[g]+λh, λ(Id L2 BA 1)h H. Then, we use the equality a, b = 1 2 a 2 + b 2 a b 2 . Also, we use A[g] + λh, (Id L2 BA 1)(A[g] + λh) H A[g] + λh H (Id L2 BA 1)(A[g] + λh) H κ(A, B) A[g] + λh 2 H to obtain A[g] + λh, B[g] + λh H 3 2 κ(A, B) A[g] + λh 2 H 2 (Id L2 BA 1)h 2 H 1 2 (A[g] + λh) λ(Id L2 BA 1)h 2 H 2 κ(A, B) A[g] + λh 2 H (A[g] + λh) 2 H λ2 2 (Id L2 BA 1)h 2 H 2 κ(A, B) A[g] + λh 2 H κ2(A, B)λ2 The following lemma holds. Lemma G.20. If ( λ N + 1)ϵ < 1 and f0 = 0H, then τ the following holds almost surely for norms L2, H, and RN . Proof. Note that fτ+1 f = Id L2 λϵ N Id L2 ϵΣντ fτ f λϵ Now observe that S := Id L2 λϵ N Id L2 ϵΣντ is symmetric with eigenvalues 0 < λi(S) 1 λϵ therefore its operator norm in B(L2), B(H), and RN N is less than 1 λϵ N . Taking the norm of left and right sides and using the sub-additivity of the norm, we obtain: fτ+1 f (1 λϵ N ) fτ f + λϵ Since f0 f = f that recurrent relation inductively yields the statement of the lemma. Corollary G.21. Under the same conditions, fτ 2 f . G.6 MAIN THEOREMS Let us denote R = f RN . We argue that it is a constant value since the kernel H and f are convergent as N which makes it bounded by some constant with probability arbitrary close to one. By Lemma F.2, Γβ(f) e β . Then, Γβ(fτ) e 2m 1 2N fτ f 2 RN β e m R2 Nβ and we denote Mβ = e m R2 Theorem G.22. Consider an arbitrary ϵ, 0 < ϵ( λ N + 1) < 1 and (1+Mβλ) 4MβN ϵ. The following inequality holds: EV (f T ) R2 2N e 1+Mβ λ 2N + 4Mβ 1 + Mβλ(M 2 β 1) λ N + 4ϵ 1 + Mβλ 2λ N 2 + Mβ(1 + 2λ2 Published as a conference paper at ICLR 2023 Proof. To prove the theorem, we will bound V (f) V (f, Mβλ)+const. It will allow us to invoke Lemma G.17. After that by using strong convexity we obtain a bound on EV (fτ, Mβ) and then a bound on EV (fτ) will follow straightforwardly. To get the result for V (fτ, Mβλ), we expand V (fτ+1, Mβλ) by substituting the formula for fτ+1 and first dealing with the term V (fτ+1): 1 2N fτ+1(x N) f (x N) 2 RN = 1 2N fτ(x N) f (x N) 2 RN + 1 N fτ+1(x N) fτ(x N), fτ(x N) f (x N) RN + 1 2N fτ+1(x N) fτ(x N) 2 RN = V (fτ) ϵ Σντ [fτ f ] + λ N fτ, KD[fτ f ] H + ϵ2 2N Σντ [fτ f ] + λ V (fτ) ϵ Σντ [fτ f ] + λ N fτ, KD[fτ f ] H + 2ϵ2V (fτ) + λ2ϵ2 N 3 fτ(x N) 2 RN , where we used the inequality a + b 2 2 a 2 + 2 b 2 and Lemma G.7 which allows us to bound Σντ (fτ f ) RN (fτ f ) RN . Then, we analyze the regularization: λ 2N Mβ fτ+1 2 H = λ 2N Mβ fτ 2 H + fτ+1 fτ, λ N Mβfτ H + λϵ2 2N Mβ Σντ [fτ f ] + λ λ 2N Mβ fτ 2 H ϵ λ N Mβfτ, Σντ [fτ f ] + λ N fτ H + ϵ2λ N Mβ fτ f 2 H + ϵ2λ3 N 3 Mβ fτ 2 H λ 2N Mβ fτ 2 H ϵ λ N Mβfτ, Σντ [fτ f ] + λ N fτ H + ϵ2λ N Mβ(1 + 4λ2 N 2 ) f 2 H. where in the first inequality we used Σντ [fτ f ] H fτ f H which is due to Lemma G.7. Summing up the expectations of those two expressions, we obtain: EV (fτ+1, Mβλ) = EV (fτ+1) + Mβλ 2N E fτ+1 2 H CMβλ EV (fτ) ϵE Σfτ [fτ f ] + λ N fτ, KD[fτ f ] + Mβλ N fτ H + 2ϵ2EV (fτ) N 3 E fτ(x N) 2 RN + λ 2N MβE fτ 2 H + ϵ2λ N Mβ(1 + 4λ2 N 2 ) f 2 H CMβλ (1 + 2ϵ2) EV (fτ) + λ 2N MβE fτ 2 H CMβλ ϵE Σfτ [fτ f ] + λ N fτ, KD[fτ f ] N fτ H + λ2ϵ2 N 3 E fτ(x N) 2 RN + ϵ2λ N Mβ(1 + 4λ2 N 2 ) f 2 H + 2ϵ2CMβλ (1 + 2ϵ2)EV (fτ, Mβλ) ϵE Kfτ D[fτ f ] N fτ, KD[fτ f ] + Mβλ N fτ + 2ϵ2λ N 2 + Mβ(1 + 2λ2 Here we used Cλ = inf f F L(f, λ) inf f F L(f) L(f , λ) L(f ) = λ 2N f 2 H λ 2N R2. Then, by applying Lemma G.16 for Σfτ = Kfτ D and applying Lemma G.17 with ξ = Mβ, A = K, B = Kfτ , g = D[fτ f ], and h = fτ, we obtain the following bound: ϵE Kfτ D[fτ f ] + λ N fτ, KD[fτ f ] + Mβλ ϵ 2Mβ E HV (fτ, Mβλ) 2 H + ϵ 2Mβ(M 2 β 1) λ2 N 2 E fτ 2 H ϵ 2Mβ E HV (fτ, Mβλ) 2 H + 2ϵMβ(M 2 β 1) λ2 Published as a conference paper at ICLR 2023 Then, by using Polyak-Łojasiewicz inequality 1 2 V H µV for µ-strongly convex function V (restricted on span{K( , xi) | i = 1, . . . , N}) with µ 1+Mβλ N > 0, which is due to Corollary G.8, we obtain: ϵE Kfτ D[fτ f ] + λ N fτ, KD[fτ f ] + Mβλ MβN EV (fτ, Mβλ) + 2ϵMβ(M 2 β 1) λ2 Substituting it into the bound on V (fτ+1, Mβλ) gives: EV (fτ+1, Mβλ) 1 ϵ(1 + Mβλ MβN 2ϵ) V (fτ, Mβλ) + 2ϵλ Mβ(M 2 β 1) λ N 2 + Mβ(1 + 2λ2 2MβN V (fτ, Mβλ) + 2ϵλ Mβ(M 2 β 1) λ N 2 + Mβ(1 + 2λ2 which yields EV (f T , Mβλ) R2 2N e 1+Mβ λ 2Mβ N T ϵ + 4Mβλ 1 + Mβλ Mβ(M 2 β 1) λ N 2 + Mβ(1 + 2λ2 where we used the bound V (f0, Mβλ) = V (0, Mβλ) = 1 2N f (x N) 2 RN CMβλ R2 Next, we use the following inequality: V (f) = L(f) min L(f) L(f, Mβλ) min L(f, Mβλ) + min L(f, Mβλ) min L(f) V (f, Mβλ) + L(f , Mβλ) L(f ) = V (f, Mβλ) + Mβλ which finally gives us the following bound on EV (f T ): EV (f T ) R2 2N e 1+Mβ λ 2N + 4Mβ 1 + Mβλ(M 2 β 1) λ N + 4ϵ 1 + Mβλ 2λ N 2 + Mβ(1 + 2λ2 Theorem G.23. (Theorem 3.8 in the main text) Let C = Mβλ 1 2N + 4Mβ 1+Mβλ(M 2 β 1) λ 4ϵ 1+Mβλ 2λ N2 + Mβ(1 + 2λ2 N2 ) R2. Assume that 0 < ϵ( λ N + 1) < 1 and (1+Mβλ) 8N ϵ, e 4m C β 5 4 (this bound can be achieved by taking β arbitrary large) and define T1 = h 2MβN ϵ(1+Mβλ) log R2 2CN i + 1. Then T T1 it holds that EV (f T , λ) 2(C + λ 4N ϵ(T T1) + 8λ 1 + λ λM 2 β N + ϵ 1 + 2λ(1 + λ) Proof. First, we apply the previous theorem to obtain a bound on V (fτ) which we will use to claim that the kernels Kfτ and K are close to each other in expectation. If we take T1 = h 2MβN ϵ(1 + Mβλ) log R2 then the following inequalities hold τ T1: EV (fτ) 2C, Published as a conference paper at ICLR 2023 EV (fτ, λ) 2(C + λ Then, analogously with the previous theorem, we estimate: EV (fτ+1, λ) (1 + 2ϵ2)EV (fτ, λ) ϵE Kfτ D[fτ f ] N fτ, KD[fτ f ] + λ N fτ + 2ϵ2λ 1 + 2λ(1 + λ) Further, we bound ϵE Kfτ D[fτ f ] + λ N fτ, KD[fτ f ] + λ N fτ by Lemma G.19, instead of Lemma G.17, which we used in the previous theorem: E Kfτ D[fτ f ] + λ N fτ, KD[fτ f ] + λ HV (fτ, λ) 2 H + E(e 2N 2 fτ 2 H 2m EV (fτ ,λ) N EV (fτ, λ) + 2λ2M 2 β N 2 R2 EV (fτ, λ) + 2λ2M 2 β N 2 R2 EV (fτ, λ) + 2λ2M 2 β N 2 R2 2N EV (fτ, λ) + 2λ2M 2 β N 2 R2. Substituting this in the formula, we get: EV (fτ+1, λ) 1 ϵ 1 + λ 2N 2ϵ EV (fτ, λ) + 2ϵλ λM 2 β N + ϵ 1 + 2λ(1 + λ) 4N )EV (fτ, λ) + 2ϵλ λM 2 β N + ϵ 1 + 2λ(1 + λ) Iterating the bound yields EV (f T , λ) EV (f T1, λ)e 1+λ 4N ϵ(T T1) + 8λ 1 + λ λM 2 β N + ϵ 1 + 2λ(1 + λ) 4N ϵ(T T1) + 8λ 1 + λ λM 2 β N + ϵ 1 + 2λ(1 + λ) Corollary G.24. (Convergence to the solution of the KRR / Convergence to the Gaussian Process posterior mean function). Under the assumptions of both previous theorems we have that:10 E f T f λ 2 L2 max x X K(x, x) 4N(C + λ λM 2 β N + ϵ 1 + 2λ(1 + λ) Proof. By Lemma D.7, V (f, λ) = 1 2N f(x N) f λ (x N) 2 RN + λ 2N f f λ 2 H. Then, by the previous theorem, we get a bound on 1 2N E f T (x N) f λ (x N) 2 RN , and by Lemmas G.10 and G.8, we majorize our L2 norm by RN norm. Lemma then follows. 10When N , K converges to a certain kernel. Thus, maxx X K(x, x) can be estimated with a constant with probability arbitrary close to one. Published as a conference paper at ICLR 2023 Lemma G.25 (Lemma 4.1in the main text). The following convergence holds almost surely in x X: h T ( ) T GP 0L2(ρ), K . Proof. From (5), we have that the covariance of h T is K independently from T. Thus, it remains to show that the limit is Gaussian almost surely which essentially holds due to the central limit theorem almost surely in x X: h T (x) = 1 i=1 h T,i(x) N 0L2, K(x, x) , where each individual tree h T,i is centered i.i.d. (with the same distribution as h1). H IMPLEMENTATION DETAILS In the experiments, we fix σ = 10 2 (scale of the kernel) and δ = 10 4 (scale of noise), which theoretically can be taken arbitrarily. As a hyperparameter (that is estimated on the validation set), we consider β {10 2, 10 1, 1}. We use the standard Cat Boost library and add the Gumbel noise term in selecting the trees for the L2 scoring function, which is implemented in Cat Boost out of the box but is not used by SGB and SGLB since it is not the default one for the library. Moreover, we do not consider subsampling of the data (as SGLB does also), and differently from SGB and SGLB, we disable the boost-from-average option. Finally, we set l2 leaf reg value to 0, as SGLB does.