# regularized_gradient_boosting__0a832f71.pdf Regularized Gradient Boosting Corinna Cortes Google Research New York, NY 10011 corinna@google.com Mehryar Mohri Google & Courant Institute New York, NY 10012 mohri@google.com Dmitry Storcheus Courant Institute & Google New York, NY 10012 dstorcheus@google.com Gradient Boosting (GB) is a popular and very successful ensemble method for binary trees. While various types of regularization of the base predictors are used with this algorithm, the theory that connects such regularizations with generalization guarantees is poorly understood. We fill this gap by deriving data-dependent learning guarantees for GB used with regularization, expressed in terms of the Rademacher complexities of the constrained families of base predictors. We introduce a new algorithm, called RGB, that directly benefits from these generalization bounds and that, at every boosting round, applies the Structural Risk Minimization principle to search for a base predictor with the best empirical fit versus complexity trade-off. Inspired by Randomized Coordinate Descent we provide a scalable implementation of our algorithm, able to search over large families of base predictors. Finally, we provide experimental results, demonstrating that our algorithm achieves significantly better out-of-sample performance on multiple datasets than the standard GB algorithm used with its regularization. 1 Introduction Ensemble methods form a powerful family of techniques in machine learning that combine multiple base predictors to create more accurate ones. These methods are often very effective in practice and can achieve a significant performance improvement over the individual base predictors [Quinlan et al., 1996, Caruana et al., 2004, Freund et al., 1996, Dietterich, 2000]. ADABOOST [Freund and Schapire, 1997] and its variants are among the most prominent ensemble methods since they are both very effective in practice and benefit from well-studied theoretical margin guarantees [Freund and Schapire, 1997, Koltchinskii and Panchenko, 2002]. Gradient Boosting (GB) [Friedman, 2001] is another popular tree-based ensemble method that has inspired a number of widely-used software libraries (e.g., XGBOOST [Chen and Guestrin, 2016], MART [Friedman, 2002], and DART [Rashmi and Gilad-Bachrach, 2015]) and has frequently ranked among the top in benchmark competitions such as Kaggle. But, while it is often introduced and presented differently, GB exactly coincides with Ada Boost, when the objective function used is the exponential function, as shown for example by [Schapire and Freund, 2012]. More generally, both of these algorithms are instances of Functional Gradient Descent [Mason et al., 2000, Grubb and Bagnell, 2011] when non-increasing convex and differentiable upper bounds on the zero-one loss are used. Viewed from the Functional Gradient Descent perspective, at every boosting step, GB seeks a predictor function h that is closest to the functional gradient of the objective within some constrained family of base predictors H. Specifying this base predictor family H such that the selected function does not overfit the gradient, as well as defining an efficient search procedure over H is crucial for the success of the algorithm. In most practical instances, several types of constraints are imposed to do so. As an example, for binary regression trees, XGBOOST bounds the number of leaves and the norm of the leaf values vector. This can be viewed as a regularization. However, to our knowledge, no theoretical analysis has been provided for these commonly-used constraints. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. A natural question is whether one can derive learning guarantees that explain how this regularization on H, and, perhaps even more general forms of constraints on functions h H, are connected to the generalization performance of GB. We seek inspiration from the margin-based learning bounds given for ADABOOST [Schapire et al., 1997, Mohri et al., 2012]. These guarantees, however, do not provide a detailed analysis of the constraints on the families of tree base predictors, nor do they provide guidance on how to conduct an efficient search of these families to select a predictor during each boosting round. We fill this gap by providing a comprehensive analysis of regularization in GB and derive learning guarantees that explain what type of regularization should be used and how. We give data-dependent learning bounds for GB with regularization, expressed in terms of the Rademacher complexities of the constrained hypotheses sub-families, from which the base predictors are selected, as well as the ensemble mixture weights. We present a new algorithm, called RGB for Regularized Gradient Boosting, which generalizes the existing gradient boosting methods by introducing a general functional q-norm constraint for the families of the tree base predictors. Our algorithm and its objective function are directly guided by the theory we develop. Our bound suggests that the Structural Risk Minimization principle (SRM) [Vapnik, 1992] should be used to break down H into subsets of varying complexities and, at each round, select a base learner h from a subset that provides the best trade-off between proximity to the functional gradient and the complexity. Applying SRM to search over subsets of H is challenging, since often these subsets are extremely rich, possibly infinite. An example is the families of decision trees with bounded depth used in GB. We provide a solution to the problem of expensive search and show how Randomized Coordinate Descent [Nesterov, 2012] can be used to search over H efficiently, using our generalization bounds. Finally, this paper provides experimental results, demonstrating that our algorithm achieves significantly better out-of-sample performance than the baselines such as XGBOOST on multiple datasets. We give specific bounds, as well as the pseudocode and experimental results, for the families of binary regression trees, but our analysis can be extended to broader families of functions, such as SVMs [Cortes and Vapnik, 1995] and Deep Neural Networks [Le Cun et al., 2015]. The paper is organized as follows. In Section 2, we introduce what we name a Regularized Gradient Boosting framework. In Section 3 we derive a Rademacher complexity bound on the families of regularized regression trees, which allows us to establish learning guarantees for Regularized Gradient Boosting. This bound directly inspire the optimization objective and the RGB algorithm presented in Section 4 that benefits from the guarantees following from the SRM principle. A non-uniform randomized search over the families of base predictors provides an efficient solution. In Section 5, we present our experimental results, which illustrate the benefits of the RGB algorithm. 2 Regularized Gradient Boosting In this section, we examine the correspondence between gradient descent in functional spaces and coordinate descent in vector spaces. This connection will help us rigorously define a Regularized Gradient Boosting learning scenario and develop a scalable implementation for it. 2.1 Gradient Boosting as Functional Gradient Descent Let X denote the input space, and let F be an inner product space of functions from X to R. We define a restricted family of functions H F to be a set of base hypotheses. In a standard supervised learning scenario, the training and test points are drawn i.i.d. according to some distribution D over X { 1, 1}, and S = {(x1, y1), . . . , (xm, ym)} is a training sample of size m drawn from Dm. In this scenario, a general boosting algorithm selects a sequence of functions h1, . . . , h T from H to minimize a certain empirical loss L: F 7 R. [Friedman, 2001, Grubb and Bagnell, 2011, Mason et al., 2000, Schapire, 1999, Cortes et al., 2014]. The specification of H and the method of selecting each ht H are essential for the success of the boosting algorithms. In fact, different answers to these two questions have resulted in distinct and separately-studied algorithms, such as GB, ADABOOST, and LOGITBOOST [Friedman et al., 1998]. The goal of boosting algorithms is typically to minimize an empirical loss functional: i=1 Φ yi, F(xi) , (1) where F(x) = PT t=1 αtht(x) such that t [1, T] : ht H. Popular ensemble learning algorithms, such as ADABOOST and GB, despite having originated in different research communities at different times, are particular instances of a more general algorithm, Functional Gradient Descent. The objective in Equation 1 is viewed by the Functional Gradient Descent as a functional rather than a vector-valued function, with the goal of minimizing L over F by taking steps in the direction of the steepest descent F F η L(F) for some positive learning rate η R. In the learning scenario described above, only the trace of F on x1, . . . , xm is observable; therefore, the functional gradient of L is L = L(F ) F (x1), . . . , L(F ) F (xm) . This makes the Functional Gradient Descent update equal to F(xi) F(xi) η L(F ) F (xi). Of course, to make sure this functional update is well defined and to avoid over-fitting, it is natural to restrict F(x) to some hypothesis set H, which implies the following form of the functional update: h = argmin h H d( L, h), (2) where d is some distance measure. This means that h H is chosen to be the closest function h H to the projection of L onto H. The update in Equation 2 is a fundamental but not well-studied component of virtually all boosting methods. Simply by varying the choice of H and d, this single equation recovers most widely-used boosting algorithms. If we restrict the optimization steps to a set of base hypotheses H, then each step is chosen to be the function closest in the direction to the negative gradient, which means it maximizes L(F) F(xi)h(xi). (3) Particularly, if Φ(yi, f(xi)) = e yif(xi), then the Functional Gradient Descent recovers ADABOOST, and if Φ(yi, f(xi)) = log 1 + e yif(xi) , then it recovers LOGITBOOST. When, instead of the negative inner product L h, we minimize the distance L(F) h 2 2, we recover the GB algorithm. 2.2 Gradient Boosting as Vector Space Coordinate Descent There is an equivalence relation between gradient descent in functional spaces and coordinate descent in vector spaces that often helps to obtain efficient algorithms for ensemble learning. At each of the T steps of the Functional Gradient Descent, L is projected onto H, hence the final solution F can be expressed as Fα = PT t=1 αtht for some α RT , where 1 t T : ht HIt H, where HIt indicates the subset of H selected at the t-th step. The subsets HIt can be viewed as coordinate blocks in H. In this view, at boosting step t a particular subspace HIt out of {H1, . . . , HK} is selected; then a base predictor ht HIt from that subspace is added to the ensemble. This allows switching from minimizing the loss functional L(F) to minimizing the loss function L(α) = L(Fα). t=1 αtht(xi) (4) over the ensemble weights vector α RT . Selecting a projection ht and a step size αt on the t-th step of the Functional Gradient Descent on L(Fα) or alternatively selecting a coordinate αt on the t-th step of the vector space coordinate descent on L(α) both result in the same form of the update Fα,t = Fα,t 1 + αtht. Additionally, the full sequence of these updates for t from 1 to T is equal since, by the chain rule 1 t T : L(αt) L(Fα) Fα(xi)ht(xi) = L ht, (5) which means that min1 t T L(αt) αt selected by the coordinate descent is equal to min1 t T L ht selected by Functional Gradient Descent. This equivalence illustrates two important points. First, coordinate descent methods can be used to provide efficient numerical solutions for boosting. Second, the proper construction of the subsets Ht such that ht HIt H is crucial for the success of boosting algorithms. We rely on this equivalence when presenting a coordinate-descent-style algorithm for minimizing the regularized boosting objective that scales well to large families of base predictors. 2.3 Regularized Gradient Boosting In this subsection, we describe the main novelty of our work the analysis of regularization applied to GB. We formulate what we name a Regularized Gradient Boosting framework and show the subtle connection between the regularization and the properties of Hk H. As we shall see, the regularization terms are not explicitly introduced in the definition of the objective, but only in the definition of an approximation to the functional gradient. While the unregularized projection step, as in Equation 2, has been extensively studied for GB, the fundamental theory of the regularization commonly used is missing. However, a number of empirical studies and software frameworks [Sun et al., 2014, Chen and Guestrin, 2016] indicate that introducing regularization to this step is extremely beneficial. For example, the popular XGBOOST library, dedicated to boosted decision trees, regularizes the norm of the leaf values, as well as the number of leafs. We are filling this gap by providing a theory that links regularization with learning guarantees for GB algorithms. For a convex function Ω: F 7 R, a closed subspace H F and β R+, let the Regularized Gradient Boosting step be defined by h = argmin h H d( L, h) + βΩ(h). (6) Given the convexity of Ω, this step is equivalent to h = argminh b H d( L, h), where b H = H {h: Ω(h) β}. Such a reduction illustrates the subtle, yet extremely important, connection between regularization and the definition of hypothesis set H. The equivalence between vector space coordinate descent and Functional Gradient Descent presented in Section 2, meaning that both of these methods iteratively select the same sequence of functions ht HIt H, suggests that a natural way to use regularization for boosting is to define F = conv( K k=1Hk), where Hk = {h: θk 1 < Ω(h) θk} are disjointed sets of functions for a set of parameters [θ1, . . . , θK]. Note that, with this formulation, the regularization is not in the objective function; instead the search for the gradient approximation is constrained by a regularization. We show, in the following section, that such a definition of F allows us to obtain margin-based learning guarantees for the Regularized Gradient Boosting that are dependent on the complexities of each individual Hk. 3 Learning Guarantees As described in the previous section, by projecting the functional gradient onto F = conv( K k=1Hk) at each step, we are able to learn an ensemble function f = PT t=1 αtft F, where the Hks are families of functions with varying complexity. Thus, it is natural to seek learning guarantees depending on the properties of each Hk and the mixture weight vector α = [α1, . . . , αT ]. The first margin bound based on the VC-dimension for ensembles PT t=1 αtft was given by Freund and Schapire [1997]. Later, tighter data-dependent bounds in terms of the Rademacher complexity of the underlying function class H were given by Koltchinskii and Panchenko [2002], see also [Mohri et al., 2018]. For the specific case where H = conv( K k=1Hk), Rademacher complexity-based guarantees were given in [Cortes et al., 2014]. In this section, we will use these theoretical results to derive margin-based guarantees based on the Rademacher complexities of the families of regularized decision trees Hk and the mixture weights α. The bounds that we show, being data-dependent, will not only fill the missing generalization theory for the existing gradient tree boosting frameworks but also motivate a new scalable learning algorithm for the Regularized Gradient Boosting framework, called RGB, in Section 4. Here, we restrict our analysis to the hypothesis families Hk of regression trees. However, our results can be extended to other families, such as kernel-based hypotheses and neural networks, so long as the sample Rademacher complexities of these families can be bounded. Each leaf l in a regression tree contains a real-valued number wl providing the output value of the tree for any sample point allocated to that leaf; thus, we let w be a vector of stacked leaf values. The function computed by a regression tree can thus be represented by h(x) = P l leaves(h) wl I{x leafl}, where I{x leafl} is the indicator function for sample point x Rd being allocated to leafl; this value h(x) can be used for classification in a straightforward manner by thresholding. The node partition functions in binary regression trees are of the form [x]j θ for some feature index j [1, d] and θ R, which means that if [xi]j θ for a sample point xi Rd, then xi is allocated to the left subtree and to the right subtree otherwise. Let Hn,λ,q be the set of all regularized binary regression trees with the number of internal nodes bounded by n and a leaf values vector w such that w q λ, q 1. Special instances of these families of trees are widely used in practice. For example, Hn,λ,1 and Hn,λ,2 are implemented in XGBOOST and frequently used in practice. Theorem 1. For any sample S = (x1, . . . , xm), the empirical Rademacher complexity of a hypothesis set H is defined by b RS(H) = Eσ suph H Pm i=1 σih(xi) , where, σis, i [m], are independent uniformly distributed random variables taking values in { 1, 1}. Let d be the input data dimension. The following upper bound holds for the empirical Rademacher complexity of Hn,λ,q: b RS(Hn,λ,q) λ (4n + 2) log2(d + 2) log(m + 1) The proof of Theorem 1 is given in the Appendix. This bound shows how the empirical Rademacher complexity of the regularized decision trees depends both on on the number of internal nodes n and the upper bound λ on the q-norm of leaf values. Using this bound, we can now derive our margin-based learning guarantees for the family F. Let R(f) denote the binary classification error of f F, R(f) = E(x,y) D I{yf(x) 0}, and Rρ(f) its empirical ρ-margin loss for a sample S, Rρ(f) = E(x,y) D I{yf(x) ρ}. Let b Rρ(f) = E(x,y) S I{yf(x) ρ}. Theorem 2. Fix ρ > 0. Let Hk = Hnk,λk,qk, where (nk), (λk) are sequences of constraints on the number of internal nodes n and the leaf vector norm w q. Define F = conv( K k=1Hk). Then, for any δ > 0, with probability at least 1 δ over the draw of a sample S of size m, the following inequality holds for all f = PT t=1 αtht F: R(f) b RS,ρ(f) + 4 (4n It + 2) log2(d + 2) log(m + 1) m + C(m, K), where It is the index of the subclass selected at time t and C(m, K) = O q ρ2m log ρ2m The proof of Theorem 2 is given in the Appendix. The generalization bound of Theorem 2 motivates a specific algorithm for Regularized Gradient Boosting, described and discussed in the next section. 4 Algorithm The multiplicative structure of the bound in Theorem 2 with respect to the mixture weights [α1, . . . , αT ] and the complexities HIt suggests the use of these complexities (or their upper bounds) in the regularization Ω(h). Additionally, one may upper-bound the empirical loss function of u 7 I{u 0} in Theorem 2, leading to the following objective: t=1 αtht(xi) + β t=1 |αt|λIt (4n It + 2) log2(d + 2) log(m + 1) Minimizing the function with vector space coordinate descent is equivalent to solving for a projection at each Functional Gradient Descent step of the form ht = argmin h H d( L, h) + β (4nk + 2) log2(d + 2) log(m + 1) m I{h Hk}. (8) In this section we will devise an algorithm for minimizing the regularized objective L(α), called RGB, that is able to scale to large families of base predictors. 4.1 Randomized Coordinate Descent The practical challenge of building an ensemble of base predictors in the Regularized Gradient Boosting scenario is to both define the hypothesis sets Hk and implement an efficient search across these sets to select the best update direction ht, at each optimization step. Applying coordinate descent to the objective in Equation 7 may be feasible for finite hypothesis sets; however, we are often required to work with infinite spaces of subfamilies of functions. A typical example would be one where each subfamily is a decision tree with a fixed topology and fixed leaf values. It is common to resort to heuristics or to discretize the search space to define an approximate search. To solve the problem of an extensive search over Hk, we propose a novel method for boosting updates using randomization applied to the functional space. Random selection of base learners for GB in the context of Randomized Coordinate Descent has been shown to be successful in practice. For example, [Lu and Mazumder, 2018] demonstrated that uniform sampling helps make the search over base hypothesis classes more scalable, gave favorable convergence guarantees for this method. Nesterov [2012] introduced probabilistic convergence guarantees for Randomized Coordinate Descent expressed in terms of the local smoothness properties of the objective and suggested a distribution to sample the coordinates. Inspired by the analysis of Nesterov [2012], our work is the first one to provide a fundamentallyjustified method of searching over the subspaces Hk, an algorithm that is both scalable and admits convergence guarantees. The RGB algorithm picks at each round at random a subset {Ht1, . . . , Ht S}. Given a meaningful distribution over H that captures the steepness of the objective L(α) within each of these subsets, RGB is able to learn an ensemble of functions from families Hk of varying complexity. In the following, we show how to apply the Randomized Coordinate Descent method, as in [Nesterov, 2012], to the objective 7. 4.2 Lipschitz-Continuous Gradients Consider the problem of minimizing L(α) as in Equation 7. The following lemma describes the continuity properties of the partial derivatives of L(α), which are needed for the application of Randomized Coordinate Descent. Lemma 3. Assume that Φ(y, h) is differentiable with respect to the second argument, and that Φ h is CΦ(y)-Lipschitz with respect to the second argument, for any fixed value y of the first argument. For all k [0, K], define L k(α) = L αk . Then, L k(α) is Ck-Lipschitz with Ck bounded as follows: i=1 h2 k(xi)CΦ(yi). (9) Randomized Coordinate Descent samples the k-th coordinate with probability pk = Ck/ PK k=1 Ck. The convergence guarantees for this procedure are given in [Nesterov, 2012] as a function of the Lipschitz constants Ck. We can further give upper bounds for the Lipschitz constants above to avoid the computation of the sums Pm i=1 h2 k(xi). If we introduce the vectors hk and CΦ in Rm such that [hk]i = h2 k(xi) and [CΦ]i = CΦ(yi), then, by Hölder s inequality, i=1 h2 k(xi)CΦ(yi) = 1 m h r CΦ q, (10) q = 1. Various (r, q)-dual norms can be used, depending on the computational constraints and the complexity of the hypothesis classes for the application of the Randomized Coordinate Descent method. For example, using h 1 and CΦ gives the following upper bound: Ck 1 m max1 i m CΦ(yi) Pm i=1 h2 k(xi). The developed generalization bounds imply the Lipschitz constants and thus define the Randomized Coordinate Descent steps for the minimization of L(α), controlling its convergence. To illustrate this Algorithm 1 RGB. Input: α = 0, F = 0 1: for t [1, T] do 2: [t1, , t S] P 3: for s [1, S] do 4: hs argminh Hts 1 m Pm i=1 Φ yi, F 1 Cts L ts(α)h 5: end for 6: s = argmins [1,S] 1 m Pm i=1 Φ yi, F 1 Cts L ts(α)hs + βΩ(hts) 7: α α 1 Cs L s (α)ets 8: F F 1 Cs L s (α)hs 9: end for point, the convergence rate stated in [Nesterov, 2012] is as follows: E t 1 L(αt) L(α ) 2 t + 1 R2 0(α0), (11) where α0 is the starting point, α is the global minimizer of L(α) and R0(α0) is the size of the initial level set of the objective. The conditional expectation is taken over the random choice of the next coordinate. The regularization applied to the base predictor families in our Regularized Gradient Boosting Framework implies the bounds on Ck, thus controlling the convergence of the algorithm. 4.3 Pseudocode The pseudocode of our RGB algorithm is given in Algorithm 1. The algorithm seeks to minimize the objective given in Equation 7, using Randomized Coordinate Descent. Let P be a discrete probability distribution over [1, K] with pk = Ck/ PK j=1 Cj. Equivalently, P is the distribution over the base hypothesis sets H1, , HK. At each draw from P, we select a sample Ht1, , Ht S of size S and, from this sample, select one function that provides the best trade-off in the decrease in objective L(α) and the complexity bound of Theorem 1 of the underlying hypotheses family. The local optimization procedure in Line 6 is an extra step required in the coordinate descent procedure to select a single function from Hts. The step in Line 8 is required to select, out of S sampled directions, the one with the best trade-off between sample fit and complexity bounds. Note that the evaluation of sampled candidates in Line 5 can be done in parallel, making the time of RGB per thread comparable to that of standard GB. More specifically, given a fixed sample of S coordinates, the runtime of one RGB round is equal to the runtime of S rounds of GB when the same subroutine is used for tree splitting. 5 Experiments In this section, we present the results of experiments with our RGB algorithm. We restrict our attention to learning an ensemble of the regularized regression trees as defined in the family Hn,λ,q, and to simplify the presentation, we let q = 2, although similar experiments can be easily done for other norms. For the complexity of these base classifiers we use the bound derived in Theorem 1. To define the subfamilies of base learners we impose a grid of size 7 on the maximum number of internal nodes n {2, 4, 8, 16, 32, 64, 256} and a grid of size 7 on λ {0.001, 0.01, 0.1, 0.5, 1, 2, 4}. For each element from the Cartesian product of these grids, we assign (nk, λk), thus defining the base families Hnk,λk,2 and F = conv 49 k=1 Hnk,λk,2 . Given such a decomposition of the functional space, we directly minimize the regularized objective in Equation 7 using Randomized Coordinate Descent with the distribution over the coordinate blocks as described above. We use the logistic loss as the per-instance loss Φ. For a given training sample, we normalize the regularization Ω(h) to be in [0, 1] and tune the RGB parameter β using a grid search over β {0.001, 0.01, 0.1, 0.3, 1}. Section 4 describes multiple ways to bound the coordinate-wise Lipschitz constants of the derivative of the objective function, resulting in various coordinate sampling distributions for the Randomized Coordinate Descent. For our experiments, and specifically to the families Hnk,λk,2 bound the Table 1: Experimental Results Error % sonar cancer diabetes ocr17 ocr49 mnist17 mnist49 higgs RGB Mean 26.94 5.19 28.86 0.90 3.10 0.43 1.53 28.60 (Std) (2.10) (0.97) (4.85) (0.45) (0.69) (0.10) (0.38) (0.41) GB Mean 28.64 6.14 28.39 1.35 3.50 0.55 1.66 29.11 (Std) (2.13) (0.94) (5.08) (0.52) (0.65) (0.11) (0.32) (0.37) One-tailed, paired sample t-test Signif. 5% 5% - 2.5% 2.5% 2% 5% 2.5% Lipschitz constants by Ck λk max1 i m CΦ(yi) , which implies that the k-th coordinate is sampled with probability pk = λk/ PK j=1 λj, since the max1 i m CΦ(yi) terms cancel out (see Lemma 4 in the Appendix for the derivation of this bound). As a comparison, we run the standard GB, using the XGBOOST library with ℓ2 regularization on the vector of leaf scores w. We use grid search to tune the hyperparameters of XGBOOST with a grid that makes the families of trees explored comparable to the H defined for RGB above. Specifically, we let the ℓ2 norm regularization parameter be in {0.001, 0.01, 0.1, 0.5, 1, 2, 4}, the maximum tree depth parameter in {1, 2, 3, 4, 5, 6, 7}, and the learning rate parameter in {0.001, 0.01, 0.1, 0.5, 1}. Both GB and RGB are run for T = 100 boosting rounds. The hyperparameters are chosen via 5-fold cross-validation, and the standard errors for the best set of hyperparameters reported. Table 1 shows the classification errors on the test sets for the UCI datasets studied, for both RGB and GB, see Table 2 in the appendix for details on the dataset. A one-tailed, paired sample t-test on the pairs of results from the different trials demonstrate that these results are in general significant at a 5% level or better. Only for one of the dataset, diabetes with an input dimension of just 8, we do not observe an improvement of RGB over GB. One natural hypothesis is that the larger the input dimension, the more the need for proper regularization of the binary regression trees forming the base learner, and the larger the advantage of the RGB algorithm. In general, the results demonstrate that by randomly taking steps into coordinates that correspond to subspaces Ht with a theoretically justified distribution, RGB can explore larger hypothesis families more efficiently that the baseline methods. Furthermore, compared to baselines that operate on the same hypotheses space H, by optimizing for the trade-off between sample fit and functional subclass complexity, RGB can reduce over-fitting, thereby achieving greater test accuracy on multiple datasets. 6 Conclusion We introduced and analyzed a general framework of Regularized Gradient Boosting, for which we also devised an effective algorithm, RGB. In this framework, regularization is not directly incorporated as a term in the loss function. Instead, its definition affects each boosting step by restricting the search for the gradient approximation to a constrained subset of base functions. Our analysis is based upon strong margin-based Rademacher complexity learning guarantees. These bounds suggest a natural approach for our optimization solution, which consists of dividing the space of base learners into subfamilies of increasing complexity. For the special case of binary regression trees, we derived explicit Rademacher complexity bounds that we subsequently exploit in the definition of our RGB algorithm. Randomization over the subfamilies of base functions allows us to scale our algorithm to large families of base predictors. Our experimental results suggest improved performance, thanks to a more efficient and theoretically motivated exploration of large function spaces without over-fitting. Also, as already stated, the run-times of the algorithms are comparable, thereby making RGB a strong alternative to XGBOOST. Finally, our analysis can be extended in a similar way to that of boosting with other families of base predictors, such as kernel-based hypothesis sets and Deep Neural Networks. Acknowledgments We thank our colleagues Natalia Ponomareva and Vitaly Kuznetsov for insightful discussions and feedback. This work was partly supported by NSF CCF-1535987, NSF IIS-1618662, and a Google Research Award. R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from libraries of models. In Proceedings of ICML, page 18. ACM, 2004. T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785 794. ACM, 2016. C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273 297, 1995. C. Cortes, M. Mohri, and U. Syed. Deep boosting. In Proceedings of ICML, 2014. T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning, 40(2):139 157, 2000. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. Y. Freund, R. E. Schapire, et al. Experiments with a new boosting algorithm. In icml, volume 96, pages 148 156. Citeseer, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 1998. J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189 1232, 2001. J. H. Friedman. Stochastic gradient boosting. Computational statistics & data analysis, 38(4): 367 378, 2002. A. Grubb and J. A. Bagnell. Generalized boosting algorithms for convex optimization. ar Xiv preprint ar Xiv:1105.2054, 2011. V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 2002. Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015. H. Lu and R. Mazumder. Randomized gradient boosting machine. ar Xiv preprint ar Xiv:1810.10158, 2018. L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In Advances in neural information processing systems, pages 512 518, 2000. P. Massart and J. Picard. Concentration inequalities and model selection: Ecole d Eté de Probabilités de Saint-Flour XXXIII - 2003. Number no. 1896 in Ecole d Eté de Probabilités de Saint-Flour. Springer-Verlag, 2007. M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2012. M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, second edition, 2018. Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012. J. R. Quinlan et al. Bagging, boosting, and c4. 5. In AAAI/IAAI, Vol. 1, pages 725 730, 1996. K. V. Rashmi and R. Gilad-Bachrach. Dart: Dropouts meet multiple additive regression trees. In AISTATS, pages 489 497, 2015. R. E. Schapire. A brief introduction to boosting. In Ijcai, volume 99, pages 1401 1406, 1999. R. E. Schapire and Y. Freund. Boosting: Foundations and algorithms. MIT press, 2012. R. E. Schapire, Y. Freund, P. Barlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In Proceedings of ICML, pages 322 330, 1997. P. Sun, T. Zhang, and J. Zhou. A convergence rate analysis for logitboost, mart and their variant. In ICML, pages 1251 1259, 2014. V. Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831 838, 1992.