# sglb_stochastic_gradient_langevin_boosting__f122136d.pdf SGLB: Stochastic Gradient Langevin Boosting Aleksei Ustimenko 1 Liudmila Prokhorenkova 1 2 3 This paper introduces Stochastic Gradient Langevin Boosting (SGLB) a powerful and efficient machine learning framework that may deal with a wide range of loss functions and has provable generalization guarantees. The method is based on a special form of the Langevin diffusion equation specifically designed for gradient boosting. This allows us to theoretically guarantee the global convergence even for multimodal loss functions, while standard gradient boosting algorithms can guarantee only local optimum. We also empirically show that SGLB outperforms classic gradient boosting when applied to classification tasks with 0-1 loss function, which is known to be multimodal. 1. Introduction Gradient boosting is a powerful machine-learning method that iteratively combines weak models to obtain more accurate ones (Friedman, 2001). Nowadays, this technique remains the primary method for web search, recommendation systems, weather forecasting, and many other problems with complex dependencies and heterogeneous data. Combined with decision trees, gradient boosting underlies such widely-used software libraries like XGBoost (Chen & Guestrin, 2016), Light GBM (Ke et al., 2017), and Cat Boost (Prokhorenkova et al., 2018). For convex loss functions and under some regularity assumptions, stochastic gradient boosting (SGB) converges to the optimal solution (Zhou & Hooker, 2018). However, even local optima cannot be guaranteed for general losses. We fill this gap and build the first globally convergent gradient boosting algorithm for convex and non-convex optimization with provable generalization guarantees. For this purpose, we combine gradient boosting with stochastic gradient 1Yandex, Moscow, Russia 2Moscow Institute of Physics and Technology, Moscow, Russia 3HSE University, Moscow, Russia. Correspondence to: Aleksei Ustimenko . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Langevin dynamics (SGLD), which is a powerful iterative optimization algorithm (Raginsky et al., 2017). It turns out that gradient boosting can be easily modified to a globally converging method: at each step, one has to shrink the currently built model and add a proper noise to stochastic gradient estimates and to the weak learners selection algorithm. To prove the global convergence and generalization, we develop a novel theoretical framework and show that the dynamics of SGLB can be approximated by a special form of the Langevin diffusion equation. The proposed algorithm is implemented within the Cat Boost open-source gradient boosting library (option langevin=True) (Cat Boost, 2020). Our experiments on synthetic and real datasets show that SGLB outperforms standard SGB and can optimize globally such non-convex losses as 0-1 loss, which was previously claimed to be a challenge (Nguyen & Sanner, 2013). Since the first version of this paper appeared online, SGLB has found its new exciting applications in learning to rank and uncertainty estimation (Ustimenko & Prokhorenkova, 2020; Malinin et al., 2021), see Section 8 for more details. In the next section, we briefly discuss the related research on gradient boosting convergence and 0-1 loss optimization. Then, in Section 3, we give the necessary background on gradient boosting and stochastic gradient Langevin dynamics. The proposed SLGB method is described in Algorithm 2 in Section 4. This approach is easy to implement, and it is backed by strong theoretical guarantees. Our main results on the convergence of SGLB are given in Section 5: convergence on the train set is given in Theorem 2, the generalization error is bounded in Theorem 3, and our results are summarized in Corollary 1. The experiments comparing SGLB with SGB on synthetic and real datasets are discussed in Section 7. Section 8 concludes the paper, discusses further applications of SGLB, and outlines promising directions for future research. 2. Related Work In this section, we discuss related work on SGB convergence and 0-1 loss optimization. Our work is also closely related to stochastic gradient Langevin dynamics, which we discuss in Section 3.3. SGLB: Stochastic Gradient Langevin Boosting Convergence of gradient boosting There are several theoretical attempts to study SGB convergence, e.g., Boulevard (Zhou & Hooker, 2018), Any Boost (Mason et al., 2000), or gradient boosting in general L2 setting (Bhlmann & Yu, 2003; Zhang & Yu, 2005; B uhlmann & Hothorn, 2008; Dombry & Esstafa, 2020). These works consider a general boosting algorithm, but under restrictive assumptions like exact greediness of the weak learners selection algorithm (Mason et al., 2000), Structure Value Isolation properties (Zhou & Hooker, 2018), and, most importantly, convexity of the loss function. However, many practical tasks involve non-convex losses like 0-1 loss optimization (Nguyen & Sanner, 2013), regret minimization in non-convex games (Hazan et al., 2017), learning to rank (Liu, 2009), learning to select with order (Vorobev et al., 2019), and many others. Thus, existing frameworks fail to solve these tasks as they strongly rely on convexity. Many practical implementations of boosting like XGBoost (Chen & Guestrin, 2016), Light GBM (Ke et al., 2017), and Cat Boost (Prokhorenkova et al., 2018) use constant learning rate in their default settings as in practice it outperforms dynamically decreasing ones. However, existing works on the convergence of boosting algorithms assume decreasing learning rates (Zhang & Yu, 2005; Zhou & Hooker, 2018), thus leaving an open question: if we assume constant learning rate ϵ > 0, can convergence be guaranteed? 0-1 loss optimization For binary classification problems, convex loss functions are usually used since they can be efficiently optimized. However, as pointed out by Nguyen & Sanner (2013), such losses are sensitive to outliers. On the other hand, 0-1 loss (the fraction of incorrectly predicted labels) is more robust and more interpretable but harder to optimize. Nguyen & Sanner (2013) propose smoothing 0-1 loss with sigmoid function and show that an iteratively unrelaxed coordinate descent approach for gradient optimization of this smoothed loss outperforms optimization of convex upper bounds of the original 0-1 loss. In the current paper, we use a smoothed 0-1 loss as an example of a multimodal function and show that the SGLB algorithm achieves superior performance for this loss. 3. Background 3.1. General Setup Assume that we are given a distribution D on X Y, where X is a feature space (typically Rk) and Y is a target space (typically R for regression or {0, 1} for classification).1 We are also given a loss function L(z, y) : Z Y R, where Z is a space of predictions (typically R or {0, 1}). Our goal 1Table 2 of the supplementary materials lists notation frequently used in the paper. is to minimize the expected loss L(f|D) := EDL(f(x), y) over functions f belonging to some family F {f : X Z}. In practice, the distribution D is unknown and we are given i.i.d. samples (x1, y1), . . . , (x N, y N) D denoted as DN, so the expected loss is replaced by the empirical average: LN(f) := L(f|DN) = 1 i=1 L(f(xi), yi) . (1) Typically, a regularization term is added to improve generalization: LN(f, γ) := LN(f) + γ 2 f 2 2 . (2) We consider only F corresponding to additive ensembles of weak learners H := {hs(x, θs) : X Rms R, s S}, where S is some finite index set and hs depends linearly on its parameters θs. Linearity in θs is crucial for convergence guarantees and is typical for practical implementations, where decision trees are used as base learners. A decision tree is a model built by a recursive partition of the feature space into disjoint regions called leaves.2 Each leaf is assigned to a value that estimates the response y in the corresponding region. Denoting these regions by Rj, we can write h(x, θ) = P j θj1{x Rj}. Thus, if we know the tree structure given by Rj, a decision tree is a linear function of leaf values θ. The finiteness of S is a natural assumption since the training dataset is finite: e.g., for decision trees, there is a finite number of ways to split the dataset into disjoint regions. Further in the paper, we denote by Hs : Rms RN a linear operator converting θs to (hs(xi, θs))N i=1. Due to linear dependence of hs on θs and finiteness of S, we can represent any ensemble of models from H as a linear model fΘ(x) = φ(x), Θ 2 for some feature map φ(x) : X Rm. Here Θ Rm is a vector of parameters of the whole ensemble. To obtain this vector, for each hs H we take all its occurrences in the ensemble, sum the corresponding vectors of parameters, and concatenate the vectors obtained for all s S. If H consists of decision trees, we sum all trees with the same leaves Rj and φ(x) maps the feature vector to the binary vector encoding all possible leaves containing x. The parameters of the model b Fτ obtained at iteration τ are denoted by bΘτ.3 3.2. Stochastic Gradient Boosting This section describes a classic stochastic gradient boosting (SGB) algorithm (Friedman, 2002) using notation conve- 2Usually, a tree is constructed in a top-down greedy manner. We assume an arbitrary tree construction procedure satisfying mild assumptions formalized in Section 5. 3We use notation bΘτ, b Fτ to stress that a process is discrete, while F(t) is used for a continuous process. SGLB: Stochastic Gradient Langevin Boosting Algorithm 1 SGB input: dataset DN, learning rate ϵ > 0 initialize τ = 0, fbΘ0( ) = 0 repeat estimate gradients bgτ RN on DN using fbΘτ ( ) sample tree structure sτ p(s|bgτ) estimate leaf values θsτ = ϵΦsτ bgτ update ensemble fbΘτ+1( ) = fbΘτ ( ) + hsτ ( , θsτ ) update τ = τ + 1 until stopping criteria met return: fbΘτ ( ) nient for our analysis. SGB is a recursive procedure that can be characterized by a tuple B := H, p : H is a set of weak learners; p(s|g) is a probability distribution over weak learners indices s S given a vector g RN that is typically picked as a vector of gradients. In case of decision trees, we refer to p(s|g) as a distribution over tree structures. SGB constructs an ensemble of decision tree iteratively. At each step, we compute unbiased gradient estimates bgτ RN such that Ebgτ = f L(fbΘτ (xi), yi) N i=1, where fbΘτ is the currently built model. Then, we pick a weak learner s index sτ according to the distribution p(s|bgτ). It remains to estimate the parameters θsτ of this weak learner, which is done as follows: minimize θsτ 2 2 subject to θsτ arg min θ Rmsτ ϵbgτ Hsτ θ 2 2 , (3) where ϵ > 0 is a learning rate. In other words, if arg min returns not a single value but a set, then we choose the value that minimizes θsτ 2 2. The above optimization problem can be solved exactly as θsτ = ϵΦsτ bgτ, where Φsτ = (HT sτ Hsτ ) HT sτ .4 This expression is general and holds for any weak learners linearly depending on parameters. However, for decision trees, the matrix Φsτ has a simple explicit formula. For a decision tree hsτ (x, θsτ ) = P i θsτ i 1{x Ri} we have (Φsτ )i,j = 1{xj Ri} |{k : xk Ri}|. Thus, Φsτ bgτ corresponds to averaging the gradient estimates in each leaf of the tree. After obtaining θsτ , the algorithm updates the ensemble as fbΘτ+1( ) := fbΘτ ( ) + hsτ ( , θsτ ) . (4) Under the convexity of L(z, y) by z and some regularity assumptions on the triplet B (Zhou & Hooker, 2018), the 4A denotes the pseudo-inverse of the matrix A, see Supplementary A for more details. ensemble converges to the optimal one with respect to the closure of the set of all possible finite ensembles. Thus, one can construct a converging SGB algorithm for convex losses. For non-convex losses, SGB cannot guarantee convergence for the same reasons as stochastic gradient descent (SGD) it gives only a first-order stationarity guarantee (Ghadimi & Lan, 2013) that means not only local minima points but also saddles, which prevents effective optimization. Our paper fills this gap: we build a globally converging gradient boosting algorithm for convex and non-convex optimization with provable (under some assumptions on H) generalization gap bounds. 3.3. Stochastic Gradient Langevin Dynamics Our algorithm combines SGB with stochastic gradient Langevin dynamics (SGLD), which we briefly introduce here. Assume that we are given a Lipschitz smooth function U(θ) : Rm R such that U(θ) as θ 2 . Importantly, the function U(θ) can be non-convex. SGLD (Gelfand et al., 1992; Welling & Teh, 2011; Raginsky et al., 2017; Erdogdu et al., 2018) aims at finding the global minimum of U(θ) which is in sharp contrast with SGD that can stuck in local minima point. It performs an iterative procedure and updates bθτ as: bθτ+1 = bθτ ϵP b U(bθτ) + N(0m, 2ϵβ 1P), (5) where b U(θ) is an unbiased stochastic gradient estimate (i.e., Eb U(θ) = U(θ)), ϵ > 0 is a learning rate, β > 0 is an inverse diffusion temperature, and P Rm m is a symmetric positive definite preconditioner matrix (Welling & Teh, 2011). Informally, one just injects the ϵ rescaled Gaussian noise into a standard SGD update to force the algorithm to explore the whole space and eventually find the global optimum. The diffusion temperature controls the level of exploration.5 The preconditioner matrix P affects the convergence rate of bθτ while still giving the convergence to the global optimum (Welling & Teh, 2011). We write (5) in such a general form since in our analysis of SGLB we obtain a particular non-trivial preconditioner matrix. Under mild assumptions6 (Raginsky et al., 2017; Erdogdu et al., 2018), the chain bθτ converges in distribution to a random variable with density pβ(θ) exp( βU(θ)) (Gibbs distribution) as ϵτ , ϵ 0+. Moreover, according to Raginsky et al. (2017); Erdogdu et al. (2018), we have Eθ pβ(θ)U(θ) min θ Rm U(θ) = O m so the distribution pβ(θ) concentrates around the global 5For the proposed approach, β also affects the generalization. In our experiments, we tune this parameter on the validation set. 6Typical assumptions are Lipschitz smoothness and convexity outside a large enough ball, often referred to as dissipativity. SGLB: Stochastic Gradient Langevin Boosting Algorithm 2 SGLB input: dataset DN, learning rate ϵ > 0, inverse temperature β > 0, regularization γ > 0 initialize τ = 0, fbΘ0( ) = 0 repeat estimate gradients bgτ RN on DN using fbΘτ ( ) sample ζ, ζ N(0N, IN) sample tree structure sτ p(s|bgτ + q estimate leaf values θsτ = ϵΦsτ (bgτ + q update ensemble fbΘτ+1( ) = (1 γϵ)fbΘτ ( ) + hsτ ( , θsτ ) update τ = τ + 1 until stopping criteria met return: fbΘτ ( ) optimum of U(θ) for large β. Following Raginsky et al. (2017), we refer to such random variables as almost minimizers of the function, meaning that by choosing the parameters of the algorithm, we can obtain ε-minimizers with probability arbitrary close to one for any fixed ε > 0. The main idea behind the proof of convergence is to show that the continuous process θϵ(t) := bθ[ϵ 1t] weakly converges to the solution of the following associated Langevin dynamics stochastic differential equation (SDE) as ϵ 0+: dθ(t) = P U(θ(t))dt + p 2β 1Pd W(t), (6) where W(t) is a standard Wiener process. The Gibbs measure pβ(θ) is an invariant measure of the SDE, θ(t) is a solution of the SDE typically defined by the Ito integral (Chung et al., 1990). A key property of the solution θ(t) is that its distribution converges to the invariant measure pβ(θ) in a suitable distance (see Raginsky et al. (2017) for details). We use a similar technique to prove the convergence of the proposed SGLB algorithm. 4. SGLB Algorithm In this section, we describe the proposed SGLB algorithm that combines SGB with Langevin dynamics. Surprisingly, a few simple modifications allow us to convert SGB (Algorithm 1) to a globally converging method (Algorithm 2). The obtained approach is easy to implement, but proving the convergence is non-trivial (see Sections 5 and 6). We further assume that the loss L(z, y) is Lipschitz continuous and Lipschitz smooth by the variable z and that infz L(z, y) > for any y. Since LN(F) is a sum of several L( , ), it necessarily inherits all these properties.7 7We use LN(F) and LN(f) interchangeably. First, we add L2 regularization (2) to the loss for two reasons: 1) regularization is known to improve generalization, 2) we do not assume that LN(F) as F 2 , but we need to ensure this property for Langevin dynamics to guarantee the existence of the invariant measure. Lipschitz continuity of the loss implies at most linear growth at infinity, and L2 regularizer (2) grows faster than linearly. Thus, we get LN(F, γ) as F 2 . According to the SGB procedure (Section 3.2), a new weak learner should fit ϵb F LN( b Fτ, γ) = ϵb F LN( b Fτ) + γ 2 b Fτ 2 2 . Note that ϵ F γ 2 b Fτ 2 2 = ϵγ b Fτ, thus we can make the exact step for γ 2 F 2 2 by shrinking the predictions as (1 γϵ) b Fτ. Due to the linearity of the model, it is equivalent to the shrinkage of the parameters: (1 γϵ)bΘτ. Second, we inject Gaussian noise directly into the SGB gradients estimation procedure. Namely, instead of approximating bgτ by a weak learner (3), we approximate bgτ + N(0N, 2N minimize θsτ 2 2 subject to θsτ arg min θ Rmsτ ϵ Hsτ θ 2 2 , (7) where ζ N(0N, IN). Adding Gaussian noise allows us to show that after a proper time interpolation, the process weakly converges to the Langevin dynamics, which in turn leads to the global convergence. Finally, we also add noise to the weak learners selection algorithm. This is done to make the distribution of the trees independent from the gradient noise. Formally, instead of a given distribution p(s|g), we consider the distribution p (s|g) = R z p(s|z)N(z|g, 2N ϵβ )dz which is a convolution of p(s|z) with the Gaussian distribution.8 We can easily sample from this distribution by first sampling z N(z|g, 2N ϵβ ) and then sampling s p(s|z). The overall SGLB procedure is described in Algorithm 2. 5. Convergence of SGLB In this section, we formulate our main theoretical results. For this, we need some notation. Let VB := n F = fΘ(xi) N i=1 ensembles Θ o RN . This set encodes all possible predictions of all possible ensembles. It is easy to see that VB is a linear subspace in RN: the sum of any ensembles is again an ensemble. 8N(z|a, σ) denotes the density of N(a, σ). SGLB: Stochastic Gradient Langevin Boosting Let us also define Psτ := Hsτ Φsτ . For given unbiased estimates of the gradients bgτ RN and a weak learner s index sτ S, such operation first estimates θsτ according to (3) and then converts them to the predictions of the weak learner. Clearly, we have Psτ v VB for any v RN. The following lemma characterizes the structure of Psτ and is proven in Supplementary A. Lemma 1. Psτ is an orthoprojector on the image of the weak learner hsτ , i.e., Psτ is symmetric, P 2 sτ = Psτ , and im Psτ = im Hsτ . We are ready to formulate the required properties for the probability distribution p(s|g) that determines the choice of the weak learners: Non-degeneracy: s S g RN such that for some small δ > 0 we have p(s|g ) > 0 almost surely for all g s.t. g g 2 < δ. Zero-order Positive Homogeneity: λ > 0 g RN we have p(s|λg) p(s|g) s S. Informally, the first property requires that for each tree structure (denoted by s), we have a non-zero probability, i.e., by feeding different g RN to p(s|g) and then sampling s, we would eventually sample every possible structure. The second property means that the weak learner s selection does not depend on the scale of g. This property is usually satisfied by standard gradient boosting algorithms: typically, one sets p(s|g) = 1 for the weak learner achieving the lowest MSE error of predicting the vector g RN). Clearly, if we rescale g, then the best fit weak learner remains the same as it is a linear function of its parameters. Recall that we also assume the loss L(z, y) to be Lipschitz continuous and Lipschitz smooth by the variable z, and infz L(z, y) > y. Also, from the stochastic gradient estimates bgτ we require bgτ Ebgτ 2 = O(1) with probability one. Now we can state our main theorem. Theorem 1. The Markov chain b Fτ generated by SGLB (Algorithm 2) weakly converges to the solution of the following SDE as ϵ 0+: d F(t) = γF(t)dt P F LN(F(t))dt 2β 1P d W(t). (8) where P := NEs p (s|0N)Ps. Based on this theorem, we prove that in the limit, LN( b Fτ) concentrates around the global optimum of LN(F). Theorem 2. The following bound holds almost surely: lim ELN( b Fτ) inf F VB LN(F) = O δΓ(γ) + d for ϵ 0+, ϵτ , where d = dim VB and δΓ(γ) encodes the error from the regularization and is of order o(1) as γ 0+. Note that this theorem is stronger than the convergence in probability. Indeed, the random variable lτ = LN( b Fτ) satisfies lτ l = inf F VB LN(F) almost surely and Theorem 2 implies that we can force Elτ l + ε for arbitrary ε > 0. Thus, by considering a non-negative random variable lτ l 0 and applying Markov s inequality, we get P(lτ l < kε) 1 1/k for arbitrary k > 0. Thus, for any δ > 0 and η > 0, we can take k = 1/δ and ε = ηδ and obtain: P(LN( b Fτ) inf F VB LN(F) < η) 1 δ, which is exactly the convergence to the η-minimum with probability at least 1 δ, where δ and η can be made arbitrary small. To derive a generalization bound, we need an additional assumption on the weak learners set. We assume the independence of prediction vectors, i.e., that (hs1(xi, θs1))N i=1, . . . , (hsk(xi, θsk))N i=1 are linearly independent for an arbitrary choice of different weak learners indices s1, . . . , sk S and parameters θsj Rmsj such that the vector (hsj(xi, θsj))N i=1 is not zero for any j. This condition is quite restrictive,9 but it allows us to deduce a generalization gap bound, which gives an insight into how the choice of B affects the generalization. The independence assumption can be satisfied if we assume that each feature vector x Rk has independent binary components and N 1. Also, we may enforce this condition by restricting the space H at each iteration, taking into account the currently used weak learners. The latter is a promising direction for future analysis. Theorem 3. If the above assumptions are satisfied, then the generalization gap can be bounded as EΘ pβ(Θ)L(fΘ( )) EΘ pβ(Θ)LN(fΘ( )) = O (β + d)2 where pβ(Θ) is the limiting distribution of Θt as ϵ 0+, ϵt ; d = dim VB is independent from N for large enough N 1; and λ is a so-called uniform spectral gap parameter that encodes hardness of the problem and depends on the choice of L(z, y). Note that 1/λ is not dimension-free and in general may depend on d, β, Γ, and γ. Moreover, the dependence can be exponential in d (Raginsky et al., 2017). However, such 9We are currently working on a much less restrictive generalization result for gradient boosting methods. SGLB: Stochastic Gradient Langevin Boosting exponential dependence is very conservative since the bound applies to any Lipschitz continuous and Lipschitz smooth function. For instance, in the convex case, we can get a dimension-free bound for 1/λ . We refer to Section 6.4 for further details and intuitions behind the statement of Theorem 3. Corollary 1. It follows from Theorems 2 and 3 that SGLB has the following performance as ϵ 0+, ϵτ : lim L( b Fτ) inf F VB L(F) = O δΓ(γ) + d d + (β + d)2 Here δΓ(γ) encodes the error from the regularization that is negligible in γ 0+ limit. However, 1/λ also depends on γ (e.g., in the convex case the dependence is of order O(1/γ)) and thus the optimal γ must be strictly greater than zero. Finally, taking large enough N, the bound (9) can be made arbitrarily small, and therefore our algorithm reaches the ultimate goal stated in Section 3.2. The obtained bound answers the question how the choice of the tuple B = (H, p) affects the optimization quality. Note that our analysis, unfortunately, gives no insights on the speed of the algorithm s convergence and we are currently working on filling this gap. 6.1. Preliminaries Before we prove the main theorems, let us analyze the distribution p (s|g). Lemma 2. p (s|g) = p (s|0N) + O( g 2) as g 2 0. Proof. The statement comes from the uniform boundedness of probabilities p(s|g). By definition, p (s|g) is a Gaussian smoothing of p(s|g), and by properties of the Gaussian smoothing it is a Lipschitz-continuous function (Nesterov & Spokoiny, 2017). Statement 1. Es p (s|g)Ps is non-degenerate as an operator from VB to VB g RN. Proof. First, we use the Non-degeneracy property of p(s|g): it implies that after smoothing p(s|g) into p (s|g) we have p (s|g) > 0 s S g RN. Finally, we note that VB is formed as a span of both images and coimages of Ps. Therefore, Es p (s|g)Ps is a non-degenerate operator within VB. 6.2. Proof of Theorem 1 We heavily rely on Theorem 1 (p. 464) of Gikhman & Skorokhod (1996). We note that this theorem is proved in dimension one in Gikhman & Skorokhod (1996), but it remains valid in an arbitrary dimension (Kushner, 1974). We formulate a slightly weaker version of the theorem that is sufficient for our case and follows straightforwardly from the original one. Theorem 4 (Gikhman & Skorokhod (1996)). Assume that we are given a Markov chain b Fτ that satisfies the following relation: ϵ 1E b Fτ+1 b Fτ b Fτ = A( b Fτ) + O( ϵ) , ϵ 1Var b Fτ+1 b Fτ = B( b Fτ)(B( b Fτ))T + O( ϵ) for some vector and matrix fields A( ), B( ) that are both Lipschitz and B( ) is everywhere non-degenerate matrix for every argument. Then, for any fixed T > 0, we have that Fϵ(t) := b F[ϵ 1t] converges weakly to the solution process of the following SDE on the interval t [0, T]: d F(t) = A(F(t))dt + B(F(t))d W(t), where W(t) is a standard Wiener process. Recall that Psτ is an orthoprojector defined by the sampled weak learner s index sτ. Let us denote the corresponding random projector by Pτ (Pτ depends on a randomly sampled index). Then, b Fτ conforms the following update equation: b Fτ+1 = (1 γϵ) b Fτ ϵNPτ b F LN( b Fτ)+ p 2ϵβ 1NPτζ . Here we exploit the fact that Pτ is a projector: P 2 τ Pτ. The equation clearly mimics the SGLD update with the only difference that Pτ is not a constant preconditioner but a random projector. So, the SGLB update can be seen as SGLD on random subspaces in VB. Recall that we require bgτ F LN( b Fτ) 2 C1 with probability one for some constant C1. Also, F LN(F) is uniformly bounded by some constant C2 due to Lipschitz continuity. So, bgτ is uniformly bounded with probability one by a constant C3 = C1 + C2. Using Zero-order positive homogeneity, we get p (s|g) = p (s) + O( ϵ), where p (s) := p (s|0N). To prove this, we note that p (s|g) = Eε N(0N,IN)p = Eε N(0N,IN)p = p (s) + O SGLB: Stochastic Gradient Langevin Boosting where the final equality comes from Lemma 2. So, noting that C4 := C3 q β 2 is a constant, we obtain that the distribution p (s|g) converges to p (s) as ϵ 0+ uniformly for each s (note that s S and we assumed |S| < ). Let us define an implicit limiting preconditioner matrix of the boosting algorithm P := NEs p (s)HsΦs : RN RN. This expectation exists since each term is a projector and hence is uniformly bounded by 1 using the spectral norm. Since p (s|g) = p (s) + O( ϵN 1), we get NEs p (s|g)HsΦs = P + O( ϵN) in operator norm since Ps = HsΦs is essentially bounded by one in the operator norm. Therefore, we obtain ϵ 1E( b Fτ+1 b Fτ| b Fτ) = γ b Fτ P F LN( b Fτ) + O( ϵ), ϵ 1Var( b Fτ+1| b Fτ) = 2β 1P + O( ϵ). Henceforth, Theorem 4 applies ensuring the weak convergence of Fϵ(t) := b F ϵ 1t to the Langevin Dynamics as ϵ 0+ for t [0, T] T > 0. 6.3. Proof of Theorem 2 To prove the theorem, we study the properties of the limiting Langevin equation, which describes the evolution of F(t) in the space VB. The trick is to observe that VB = im P = coim P due to Non-degeneracy of the sampling p(s|g). Henceforth, we can easily factorize VB ker P = RN and assume that actually we live in VB and there formally P > 0 as an operator from VB to VB.10 Now, consider the following Langevin-like SDE for P > 0, P T = P in Rd: d F(t) = γF(t)dt P F LN(F(t))dt+ p 2β 1Pd W . (10) The classic form of the equation can be obtained using an implicitly regularized loss function:11 LN(F, γ) := LN(F) + γ P 1 > 0 is a regularization matrix. Due to the well-known properties of L2-regularization, for small enough γ > 0, the minimization of LN(F, γ) leads to an almost minimization of LN(F) with an error of order O(δΓ(γ)) for some function δΓ(γ) depending on LN and Γ, which is negligible as γ 0+.12 Thus, the error δΓ(γ) heavily depends on Γ = Using LN(F, γ), we rewrite Equation (10) as: d F(t) = P F LN(F(t), γ)dt + p 2β 1Pd W(t). 10P > 0 means that all eigenvalues of P are positive reals. 11Here we redefine the notation LN(F, γ) used before. 12If min F LN(F) exists, one can show that δΓ(γ) = O(λmax(Γ2)γ) with O( ) depending on LN. Then, the results of Erdogdu et al. (2018); Raginsky et al. (2017) apply ensuring that LN( b Fτ, γ) converges to an almost-minimizer of LN(F, γ) with an error of order O d β log β d , where d = dim VB. From this the theorem follows. 6.4. Proof of Theorem 3 We reduced the problem of convergence of a general boosting to the problem of convergence of predictions b Fτ := fbΘτ (xi) N i=1 in the space of all possible ensemble predictions VB on the dataset DN. Using that |S| < , we define a design matrix Ψ := φ(x1), . . . , φ(x N) T RN m, so we can write F = ΨΘ and LN(F, γ) = LN(ΨΘ) + γ 2 ΓΨΘ 2 2. Note that VB can be obtained as the image of Ψ. Let us consider the uniform spectral gap parameter13 λ 0 for the distribution pβ(Θ) := exp( βLN(ΨΘ, γ)) R Rm exp( βLN(ΨΘ, γ))dΘ . To show that pβ(Θ) is a completely and correctly defined distribution, we first note that ker Ψ = L s S ker hs, where ker hs := {θs Rms : h(xi, θs) = 0 xi DN} Rms. Indeed, this is equivalent to our assumption on the independence of prediction vectors. Then, ker Ψ has right structure, i.e., we have basis weak learners for B. Second, we use the factorization trick to factorize Rms = ker hs (ker hs) and hence w.l.o.g. we can assume ker hs = {0ms}. The latter implies that w.l.o.g. ker Ψ = {0m}, so the distribution pβ(Θ) is a correctly defined distribution on Rm. Moreover, observe that in that case m = d = dim VB and thus d is independent from N for large enough N 1. Having pβ(Θ) exp( βLN(ΨΘ, γ)), from Raginsky et al. (2017), we can transfer a bound (β+d)2 λ N for the generalization gap. Since we added L2-regularizer to the loss, which is Lipschitz smooth and continuous, we necessarily obtain dissipativity and thus λ > 0 (Raginsky et al., 2017). This concludes the proof of the theorem. In the presence of convexity, we can get a dimension-free bound for 1/λ . To see that, we need to bound 1/λ c P (pβ), where c P is the Poincare constant for pβ(Θ). If L( , ) is convex, then βLN(Θ, γ) must be strongly convex with a constant κβγ 2 , where κ := λmin(ΨΓ2ΨT ) is the smallest eigenvalue of ΨΓ2ΨT > 0. Then, pβ is strongly log-concave, so by transferring the Poincare inequality for strongly log-concave distribution from Milman (2007), we 13We refer to Raginsky et al. (2017) for the definition of a uniform spectral gap. SGLB: Stochastic Gradient Langevin Boosting obtain a dimension free-bound 1/λ 1 κγβ . 7. Experiments 7.1. Implementation Our implementation of SGLB is available within the opensource Cat Boost gradient boosting library. Cat Boost is known to achieve state-of-the-art results across a wide variety of datasets (Prokhorenkova et al., 2018). As the baseline, we use the standard SGB algorithm implemented within the same library. Recall that in Section 5 we require several properties to be satisfied for the distribution p(s|g). Importantly, they are all satisfied for Cat Boost, as we show in Supplementary B. 7.2. Direct Accuracy Optimization via Smooth Loss Approximation To show the power of SGLB for non-convex multimodal optimization, we select Accuracy (0-1 loss) for the direct optimization by our framework: 0-1 loss = 1 accuracy i=1 1{(2yi 1)f(xi|θ)>0}, (11) where yi {0, 1}. To make this function Lipschitz smooth and Lipschitz continuous, we approximate the indicator by a sigmoid function and minimize: LN(f) := 1 1 i=1 σ(ς 1(2yi 1)f(xi|θ)), (12) where σ(x) = (1 + exp(x)) 1 and ς > 0 is a hyperparameter. Such smoothing is known as Smooth Loss Approximation (SLA) (Nguyen & Sanner, 2013). If f(xi|θ) = 0 i, then LN(f) converges to 0-1 loss as ς 0+. To apply SGLB, we need to ensure Lipschitz smoothness and continuity. Observe that the gradient is uniformly bounded due to d dzσ(z) = (1 σ(z))σ(z) 1, which in turn implies Lipschitz continuity. Moreover, d2 dz2 σ(z) = (1 2σ(z)) d dzσ(z) 3 which implies Lipschitz smoothness. Thus, SGLB is applicable to SLA. 7.3. Illustration on Synthetic Data First, we analyze the performance of SGLB in a simple synthetic experiment. We randomly generate three-dimensional feature vectors x N(03, I3) and let y = 1{ey>0} for ey N(sin(x1x2x3), 1). We add a significant amount of noise to the target since in this case the loss is very likely to be multimodal. Table 1. Optimization on synthetic data Approach 0-1 loss p-value Logloss + GB 0.482 2 10 8 SLA + GB 0.475 5 10 3 SLA + SGB 0.474 7 10 3 SLA + SGLB 0.470 We made cross-validation with 100 folds, each containing 1000 examples for training and 1000 examples for testing. To see the difference between the methods, we consider simple models based on decision trees of depth 1. We fix border count = 5 (the number of different splits allowed for each feature). We set learning rate to 0.1 and ς = 10 1 for SLA. For SGLB, we set β = 103 and γ = 10 3. Moreover, we set the subsampling rate of SGB to 0.5. The results are presented in Table 1. Here the p-values (according to the t-test) are reported relative to SGLB. We see that for the SLA optimization, SGLB outperforms the classic gradient boosting (GB). Then, we analyze whether a standard subsampling used in SGB can help to avoid local optima and improve the performance of GB. We see that SGB with a sample rate of 0.5 is indeed slightly better than GB but is still worse than SGLB, which has a theoretically grounded randomization. Finally, we note that optimizing the convex logistic loss instead of SLA leads to the worst performance. This means that the logistic loss has a different optimum and, for better performance, non-convex loss functions should not be replaced by convex substitutes. 7.4. Comparison on Real Data In this section, we show that SGLB has superior performance on various real datasets. We mostly focus on classification task with 0-1 loss (11) and optimize SLA L(z, y) = 1 σ(ς 1z(2y 1)) with ς = 10 1, which is Lipschitz smooth and continuous and thus SGLB can be applied. The datasets are described in Table 1 of the supplementary materials. We split each dataset into train, validation, and test sets in proportion 65/15/20. We tune the parameters on the validation set using 200 iterations of random search and select the best iteration on the validation; the details are given in Supplementary C.2. For all algorithms, the maximal number of trees is set to 1000. The results are shown in Table 2. We use bold font to highlight significance for the two-tailed t-test with a p-value < 0.05. We note that SGB uses leaves regularization, while SGLB is not. We see that for the non-convex 0-1 loss, SGLB performance is superior to SGB, which clearly shows the necessity of non-convex optimization methods in machine SGLB: Stochastic Gradient Langevin Boosting Table 2. 0-1 loss optimized via SLA Dataset SGB SGLB p-value Appetency 1.8 1.8 1 Churn 7.1 7.2 0.18 Upselling 4.8 4.7 0.04 Adult 13.2 12.8 0.01 Amazon 5.2 4.8 0.01 Click 16 15.9 3 10 6 Epsilon 11.7 11.7 0.44 Higgs 25.2 24.8 0.04 Internet 10.1 9.8 0.05 Kick 9.7 9.6 0.02 We also compare SGB with SGLB for the convex Logistic regression loss L(z, y) = y log σ(z) (1 y) log(1 σ(z)). Similarly to accuracy, it is easy to show that this loss is Lipschitz smooth and Lipschitz continuous as d dz L(z, y) = y + σ(z) and d2 dz2 L(z, y) = σ(z)(1 σ(z)). The results of the comparison are shown in Table 3. We see that in most cases, SGLB and SGB are comparable, but SGLB is preferable. Importantly, SGLB has better performance on large Epsilon and Higgs datasets. 8. Conclusion & Future Work Our experiments demonstrate that the theoretically grounded SGLB algorithm also shows promising experimental results. Namely, SGLB can be successfully applied to the optimization of 0-1 loss that is known to be nonconvex. Interestingly, since the first version of this paper appeared online, SGLB has found other exciting applications. For instance, Ustimenko & Prokhorenkova (2020) show that SGLB can be applied to learning to rank, which is a classic information retrieval problem. The authors propose to smooth any given ranking loss and then directly optimize it. The obtained function turns out to be non-convex, and SGLB provably allows for reaching the global optimum. Malinin et al. (2021) apply SGLB to uncertainty estimation. Namely, they use the convergence of parameters to the stationary distribution pβ(Θ) to sample from the posterior, which allows for theoretically grounded uncertainty estimates. There are plenty of directions for future research that can potentially further improve the performance of SGLB. Recall that our generalization gap estimate relies on the restrictive assumption on linear independence of weak learners. Thus, a promising direction is to modify the algorithm so that some form of Langevin diffusion is still preserved in the limit with good provable generalization gap guarantees. Another idea Table 3. Logloss optimization Dataset SGB SGLB p-value Appetency 0.074 0.073 0.6 Churn 0.229 0.230 0.02 Upselling 0.164 0.163 0.05 Adult 0.274 0.276 0.07 Amazon 0.144 0.145 0.09 Click 0.395 0.394 4 10 4 Epsilon 0.274 0.273 0.01 Higgs 0.480 0.479 2 10 35 Internet 0.226 0.225 0.26 Kick 0.288 0.289 0.04 is to incorporate momentum into boosting so that there is the Hamiltonian dynamics (Gao et al., 2018) in the limit instead of the ordinary Langevin dynamics. There are several theoretical attempts to incorporate momentum into boosting like Historical GBM (Feng et al., 2018), so the question is: if we use the Historical GMB approach or its modification, would that be enough to claim the Hamiltonian dynamics equation in the limit? Finally, our research does not investigate the rates of convergence, which is another promising direction. It would give a better understanding of the trade-offs between the algorithm s parameters. B uhlmann, P. and Hothorn, T. Boosting Algorithms: Regularization, Prediction and Model Fitting. ar Xiv e-prints, art. ar Xiv:0804.2752, Apr 2008. Bhlmann, P. and Yu, B. Boosting with the l2 loss. Journal of the American Statistical Association, 98(462):324 339, 2003. doi: 10.1198/016214503000125. Cat Boost. Catboost github. https://github.com/ catboost/catboost, 2020. Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 16, pp. 785 794, New York, NY, USA, 2016. ACM. Chung, K. L., Williams, R. J., and Williams, R. Introduction to stochastic integration, volume 2. Springer, 1990. Dombry, C. and Esstafa, Y. Behavior of linear l2-boosting algorithms in the vanishing learning rate asymptotic. ar Xiv preprint ar Xiv:2012.14657, 2020. Erdogdu, M. A., Mackey, L., and Shamir, O. Global nonconvex optimization with discretized diffusions. In Proceedings of the 32Nd International Conference on Neu- SGLB: Stochastic Gradient Langevin Boosting ral Information Processing Systems, NIPS 18, pp. 9694 9703, 2018. Feng, Z., Xu, C., and Tao, D. Historical gradient boosting machine. In Lee, D., Steen, A., and Walsh, T. (eds.), GCAI-2018. 4th Global Conference on Artificial Intelligence, volume 55 of EPi C Series in Computing, pp. 68 80, 2018. Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Statist., 29(5):1189 1232, 2001. Friedman, J. H. Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367 378, 2002. Gao, X., Grbzbalaban, M., and Zhu, L. Global convergence of stochastic gradient hamiltonian monte carlo for nonconvex stochastic optimization: Non-asymptotic performance bounds and momentum-based acceleration, 2018. Gelfand, S. B., Doerschuk, P. C., and Nahhas-Mohandes, M. Theory and application of annealing algorithms for continuous optimization. In Proceedings of the 24th conference on Winter simulation, pp. 494 499, 1992. Ghadimi, S. and Lan, G. Stochastic firstand zeroth-order methods for nonconvex stochastic programming. 09 2013. Gikhman, I. and Skorokhod, A. Introduction to the Theory of Random Processes. Dover books on mathematics. Dover Publications, 1996. Hazan, E., Singh, K., and Zhang, C. Efficient regret minimization in non-convex games. Co RR, abs/1708.00075, 2017. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30, pp. 3146 3154. 2017. Kushner, H. J. On the weak convergence of interpolated markov chains to a diffusion. Ann. Probab., 2(1):40 50, 1974. Liu, T.-Y. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225 331, 2009. Malinin, A., Prokhorenkova, L., and Ustimenko, A. Uncertainty in gradient boosting via ensembles. In International Conference on Learning Representations, 2021. Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. R. Boosting algorithms as gradient descent. In Advances in neural information processing systems, pp. 512 518, 2000. Milman, E. On the role of Convexity in Isoperimetry, Spectral-Gap and Concentration. ar Xiv e-prints, art. ar Xiv:0712.4092, Dec 2007. Nesterov, Y. and Spokoiny, V. G. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527 566, 2017. Nguyen, T. and Sanner, S. Algorithms for direct 0 1 loss optimization in binary classification. In International Conference on Machine Learning, pp. 1085 1093, 2013. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. Catboost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, pp. 6638 6648, 2018. Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. Co RR, abs/1702.03849, 2017. Ustimenko, A. and Prokhorenkova, L. Stochasticrank: Global optimization of scale-free discrete functions. In International Conference on Machine Learning, pp. 9669 9679, 2020. Vorobev, A., Ustimenko, A., Gusev, G., and Serdyukov, P. Learning to select for a predefined ranking. In International Conference on Machine Learning, pp. 6477 6486, 2019. Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning (ICML), 2011. Zhang, T. and Yu, B. Boosting with early stopping: Convergence and consistency. Ann. Statist., 33(4):1538 1579, 08 2005. Zhou, Y. and Hooker, G. Boulevard: Regularized stochastic gradient boosted trees and their limiting distribution. ar Xiv preprint ar Xiv:1806.09762, 2018.