# deep_boosting__62817c59.pdf Deep Boosting Corinna Cortes CORINNA@GOOGLE.COM Google Research, 111 8th Avenue, New York, NY 10011 Mehryar Mohri MOHRI@CIMS.NYU.EDU Courant Institute and Google Research, 251 Mercer Street, New York, NY 10012 Umar Syed USYED@GOOGLE.COM Google Research, 111 8th Avenue, New York, NY 10011 We present a new ensemble learning algorithm, Deep Boost, which can use as base classifiers a hypothesis set containing deep decision trees, or members of other rich or complex families, and succeed in achieving high accuracy without overfitting the data. The key to the success of the algorithm is a capacity-conscious criterion for the selection of the hypotheses. We give new datadependent learning bounds for convex ensembles expressed in terms of the Rademacher complexities of the sub-families composing the base classifier set, and the mixture weight assigned to each sub-family. Our algorithm directly benefits from these guarantees since it seeks to minimize the corresponding learning bound. We give a full description of our algorithm, including the details of its derivation, and report the results of several experiments showing that its performance compares favorably to that of Ada Boost and Logistic Regression and their L1-regularized variants. 1. Introduction Ensemble methods are general techniques in machine learning for combining several predictors or experts to create a more accurate one. In the batch learning setting, techniques such as bagging, boosting, stacking, errorcorrection techniques, Bayesian averaging, or other averaging schemes are prominent instances of these methods (Breiman, 1996; Freund & Schapire, 1997; Smyth & Wolpert, 1999; Mac Kay, 1991; Freund et al., 2004). Ensemble methods often significantly improve performance in practice (Quinlan, 1996; Bauer & Kohavi, 1999; Caruana et al., 2004; Dietterich, 2000; Schapire, 2003) and ben- Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). efit from favorable learning guarantees. In particular, Ada Boost and its variants are based on a rich theoretical analysis, with performance guarantees in terms of the margins of the training samples (Schapire et al., 1997; Koltchinskii & Panchenko, 2002). Standard ensemble algorithms such as Ada Boost combine functions selected from a base classifier hypothesis set H. In many successful applications of Ada Boost, H is reduced to the so-called boosting stumps, that is decision trees of depth one. For some difficult tasks in speech or image processing, simple boosting stumps are not sufficient to achieve a high level of accuracy. It is tempting then to use a more complex hypothesis set, for example the set of all decision trees with depth bounded by some relatively large number. But, existing learning guarantees for Ada Boost depend not only on the margin and the number of the training examples, but also on the complexity of H measured in terms of its VC-dimension or its Rademacher complexity (Schapire et al., 1997; Koltchinskii & Panchenko, 2002). These learning bounds become looser when using too complex base classifier sets H. They suggest a risk of overfitting which indeed can be observed in some experiments with Ada Boost (Grove & Schuurmans, 1998; Schapire, 1999; Dietterich, 2000; R atsch et al., 2001b). This paper explores the design of alternative ensemble algorithms using as base classifiers a hypothesis set H that may contain very deep decision trees, or members of some other very rich or complex families, and that can yet succeed in achieving a higher performance level. Assume that the set of base classifiers H can be decomposed as the union of p disjoint families H1, . . . , Hp ordered by increasing complexity, where Hk, k 2 [1, p], could be for example the set of decision trees of depth k, or a set of functions based on monomials of degree k. Figure 1 shows a pictorial illustration. Of course, if we strictly confine ourselves to using hypotheses belonging only to families Hk with small k, then we are effectively using a smaller base classifier set H with favorable guarantees. But, to succeed in some chal- Deep Boosting Figure 1. Base classifier set H decomposed in terms of subfamilies H1, . . . , Hp or their unions. lenging tasks, the use of a few more complex hypotheses could be needed. The main idea behind the design of our algorithms is that an ensemble based on hypotheses drawn from H1, . . . , Hp can achieve a higher accuracy by making use of hypotheses drawn from Hks with large k if it allocates more weights to hypotheses drawn from Hks with a small k. But, can we determine quantitatively the amounts of mixture weights apportioned to different families? Can we provide learning guarantees for such algorithms? Note that our objective is somewhat reminiscent of that of model selection, in particular Structural Risk Minimization (SRM) (Vapnik, 1998), but it differs from that in that we do not wish to limit our base classifier set to some optimal Hq = Sq k=1 Hk. Rather, we seek the freedom of using as base hypotheses even relatively deep trees from rich Hks, with the promise of doing so infrequently, or that of reserving them a somewhat small weight contribution. This provides the flexibility of learning with deep hypotheses. We present a new algorithm, Deep Boost, whose design is precisely guided by the ideas just discussed. Our algorithm is grounded in a solid theoretical analysis that we present in Section 2. We give new data-dependent learning bounds for convex ensembles. These guarantees are expressed in terms of the Rademacher complexities of the sub-families Hk and the mixture weight assigned to each Hk, in addition to the familiar margin terms and sample size. Our capacity-conscious algorithm is derived via the application of a coordinate descent technique seeking to minimize such learning bounds. We give a full description of our algorithm, including the details of its derivation and its pseudocode (Section 3) and discuss its connection with previous boosting-style algorithms. We also report the results of several experiments (Section 4) demonstrating that its performance compares favorably to that of Ada Boost, which is known to be one of the most competitive binary classification algorithms. 2. Data-dependent learning guarantees for convex ensembles with multiple hypothesis sets Non-negative linear combination ensembles such as boosting or bagging typically assume that base functions are selected from the same hypothesis set H. Margin-based generalization bounds were given for ensembles of base functions taking values in { 1, +1} by Schapire et al. (1997) in terms of the VC-dimension of H. Tighter margin bounds with simpler proofs were later given by Koltchinskii & Panchenko (2002), see also (Bartlett & Mendelson, 2002), for the more general case of a family H taking arbitrary real values, in terms of the Rademacher complexity of H. Here, we also consider base hypotheses taking arbitrary real values but assume that they can be selected from several distinct hypothesis sets H1, . . . , Hp with p 1 and present margin-based learning in terms of the Rademacher complexity of these sets. Remarkably, the complexity term of these bounds admits an explicit dependency in terms of the mixture coefficients defining the ensembles. Thus, the ensemble family we consider is F = conv(Sp k=1 Hk), that is the family of functions f of the form f = PT t=1 tht, where = ( 1, . . . , T ) is in the simplex and where, for each t 2 [1, T], ht is in Hkt for some kt 2 [1, p]. Let X denote the input space. H1, . . . , Hp are thus families of functions mapping from X to R. We consider the familiar supervised learning scenario and assume that training and test points are drawn i.i.d. according to some distribution D over X { 1, +1} and denote by S = ((x1, y1), . . . , (xm, ym)) a training sample of size m drawn according to Dm. Let > 0. For a function f taking values in R, we denote by R(f) its binary classification error, by R (f) its -margin error, and by b RS, (f) its empirical margin error: R(f) = E (x,y) D[1yf(x) 0], R (f) = E (x,y) D[1yf(x) ], b R (f) = E (x,y) S[1yf(x) ], where the notation (x, y) S indicates that (x, y) is drawn according to the empirical distribution defined by S. The following theorem gives a margin-based Rademacher complexity bound for learning with such functions in the binary classification case. As with other Rademacher complexity learning guarantees, our bound is data-dependent, which is an important and favorable characteristic of our results. For p = 1, that is for the special case of a single hypothesis set, the analysis coincides with that of the standard ensemble margin bounds (Koltchinskii & Panchenko, 2002). Theorem 1. Assume p > 1. Fix > 0. Then, for any δ > 0, with probability at least 1 δ over the choice of a sample S of size m drawn i.i.d. according to Dm, the following inequality holds for all f = PT t=1 tht 2 F: R(f) b RS, (f) + 4 Deep Boosting Thus, R(f) b RS, (f) + 4 t=1 t Rm(Hkt) + C(m, p) with C(m, p) = O log p 2m log This result is remarkable since the complexity term in the right-hand side of the bound admits an explicit dependency on the mixture coefficients t. It is a weighted average of Rademacher complexities with mixture weights t, t 2 [1, T]. Thus, the second term of the bound suggests that, while some hypothesis sets Hk used for learning could have a large Rademacher complexity, this may not be detrimental to generalization if the corresponding total mixture weight (sum of ts corresponding to that hypothesis set) is relatively small. Such complex families offer the potential of achieving a better margin on the training sample. The theorem cannot be proven via a standard Rademacher complexity analysis such as that of Koltchinskii & Panchenko (2002) since the complexity term of the bound would then be the Rademacher complexity of the family of hypotheses F = conv(Sp k=1 Hk) and would not depend on the specific weights t defining a given function f. Furthermore, the complexity term of a standard Rademacher complexity analysis is always lower bounded by the complexity term appearing in our bound. Indeed, since Rm(conv([p k=1Hk)) = Rm([p k=1Hk), the following lower bound holds for any choice of the non-negative mixtures weights t summing to one: t Rm(Hkt). (1) Thus, Theorem 1 provides a finer learning bound than the one obtained via a standard Rademacher complexity analysis. The full proof of the theorem is given in Appendix A. Our proof technique exploits standard tools used to derive Rademacher complexity learning bounds (Koltchinskii & Panchenko, 2002) as well as a technique used by Schapire, Freund, Bartlett, and Lee (1997) to derive early VC-dimension margin bounds. Using other standard techniques as in (Koltchinskii & Panchenko, 2002; Mohri et al., 2012), Theorem 1 can be straightforwardly generalized to hold uniformly for all > 0 at the price of an additional term that is in O 3. Algorithm In this section, we will use the learning guarantees of Section 2 to derive a capacity-conscious ensemble algorithm for binary classification. 3.1. Optimization problem Let H1, . . . , Hp be p disjoint families of functions taking values in [ 1, +1] with increasing Rademacher complex- ities Rm(Hk), k 2 [1, p]. We will assume that the hypothesis sets Hk are symmetric, that is, for any h 2 Hk, we also have ( h) 2 Hk, which holds for most hypothesis sets typically considered in practice. This assumption is not necessary but it helps simplifying the presentation of our algorithm. For any hypothesis h 2 [p k=1Hk, we denote by d(h) the index of the hypothesis set it belongs to, that is h 2 Hd(h). The bound of Theorem 1 holds uniformly for all > 0 and functions f 2 conv(Sp k=1 Hk).1 Since the last term of the bound does not depend on , it suggests selecting to minimize t=1 tht(xi) + 4 where rt = Rm(Hd(ht)). Since for any > 0, f and f/ admit the same generalization error, we can instead search for 0 with PT t=1 t 1/ which leads to t=1 tht(xi) 1+4 The first term of the objective is not a convex function of and its minimization is known to be computationally hard. Thus, we will consider instead a convex upper bound. Let u 7! Φ( u) be a non-increasing convex function upper bounding u 7! 1u 0 with Φ differentiable over R and Φ0(u) 6= 0 for all u. Φ may be selected to be for example the exponential function as in Ada Boost (Freund & Schapire, 1997) or the logistic function. Using such an upper bound, we obtain the following convex optimization problem: where we introduced a parameter λ 0 controlling the balance between the magnitude of the values taken by function Φ and the second term. Introducing a Lagrange variable β 0 associated to the constraint in (2), the problem can be equivalently written as (λrt + β) t. Here, β is a parameter that can be freely selected by the algorithm since any choice of its value is equivalent to a 1The condition PT t=1 t = 1 of Theorem 1 can be relaxed to PT t=1 t 1. To see this, use for example a null hypothesis (ht = 0 for some t). Deep Boosting choice of in (2). Let {h1, . . . , h N} be the set of distinct base functions, and let G be the objective function based on that collection: with = ( 1, . . . , N) 2 RN. Note that we can drop the requirement 0 since the hypothesis sets are symmetric and tht = ( t)( ht). For each hypothesis h, we keep either h or h in {h1, . . . , h N}. Using the notation j = λrj + β, (3) for all j 2 [1, N], our optimization problem can then be rewritten as min F( ) with with no non-negativity constraint on . The function F is convex as a sum of convex functions and admits a subdifferential at all 2 R. We can design a boosting-style algorithm by applying coordinate descent to F( ). Let t = ( t,1, . . . , t,N)> denote the vector obtained after t 1 iterations and let 0 = 0. Let ek denote the kth unit vector in RN, k 2 [1, N]. The direction ek and the step selected at the tth round are those minimizing F( t 1 + ek), that is F( t 1 + ek)= 1 1 yift 1(xi) yihk(xi) j| t 1,j| + k| t 1,k + |, where ft 1 = PN j=1 t 1,jhj. For any t 2 [1, T], we denote by Dt the distribution defined by Dt(i) = Φ00 1 yift 1(xi) where St is a normalization factor, St = Pm i=1 Φ0(1 yift 1(xi)). For any s 2 [1, T] and j 2 [1, N], we denote by s,j the weighted error of hypothesis hj for the distribution Ds, for s 2 [1, T]: 1 E i Ds[yihj(xi)] 3.2. Deep Boost Figure 2 shows the pseudocode of the algorithm Deep Boost derived by applying coordinate descent to the objective function (4). The details of the derivation of the expression are given in Appendix B. In the special cases of the DEEPBOOST(S = ((x1, y1), . . . , (xm, ym))) 1 for i 1 to m do 2 D1(i) 1 m 3 for t 1 to T do 4 for j 1 to N do 5 if ( t 1,j 6= 0) then 6 dj + sgn( t 1,j) jm 2St 7 elseif then 8 dj 0 9 else dj 2St 10 k argmax 11 t t,k 12 if |(1 t)e t 1,k te t 1,k| km then 13 t t 1,k 14 elseif (1 t)e t 1,k te t 1,k > km 16 else t log 17 t t 1 + tek 18 St+1 Pm j=1 t,jhj(xi) 19 for i 1 to m do j=1 t,jhj(xi) St+1 21 f PN j=1 T,jhj 22 return f Figure 2. Pseudocode of the Deep Boost algorithm for both the exponential loss and the logistic loss. The expression of the weighted error t,j is given in (6). In the generic case of a surrogate loss Φ different from the exponential or logistic losses, t is found instead via a line search or other numerical methods from t = argmax F( t 1 + ek). exponential loss (Φ( u) = exp( u)) or the logistic loss (Φ( u) = log2(1 + exp( u))), a closed-form expression is given for the step size (lines 12-16), which is the same in both cases (see Sections B.4 and B.5). In the generic case, the step size t can be found using a line search or other numerical methods. Note that when the condition of line 12 is satisfied, the step taken by the algorithm cancels out the coordinate along the direction k, thereby leading to a sparser result. This is consistent with the fact that the objective function contains a second term based on (weighted) L1-norm, which is favoring sparsity. Our algorithm is related to several other boosting-type algorithms devised in the past. For λ = 0 and β = 0 and using the exponential surrogate loss, it coincides with Ada Boost (Freund & Schapire, 1997) with precisely the same direction and same step log using H = Sp k=1 Hk as the hypothesis set for base learners. This corresponds to Deep Boosting ignoring the complexity term of our bound as well as the control of the sum of the mixture weights via β. For λ = 0 and β = 0 and using the logistic surrogate loss, our algorithm also coincides with additive logistic loss (Friedman et al., 1998). In the special case where λ = 0 and β 6= 0 and for the exponential surrogate loss, our algorithm matches the L1-norm regularized Ada Boost (e.g., see (R atsch et al., 2001a)). For the same choice of the parameters and for the logistic surrogate loss, our algorithm matches the L1norm regularized additive Logistic Regression studied by Duchi & Singer (2009) using the base learner hypothesis set H = Sp k=1 Hk. H may in general be very rich. The key foundation of our algorithm and analysis is instead to take into account the relative complexity of the sub-families Hk. Also, note that L1-norm regularized Ada Boost and Logistic Regression can be viewed as algorithms minimizing the learning bound obtained via the standard Rademacher complexity analysis (Koltchinskii & Panchenko, 2002), using the exponential or logistic surrogate losses. Instead, the objective function minimized by our algorithm is based on the generalization bound of Theorem 1, which as discussed earlier is a finer bound (see (1)). For λ = 0 but β 6= 0, our algorithm is also close to the so-called unnormalized Arcing (Breiman, 1999) or Ada Boost (R atsch & Warmuth, 2002) using H as a hypothesis set. Ada Boost coincides with Ada Boost modulo the step size, which is more conservative than that of Ada Boost and depends on . R atsch & Warmuth (2005) give another variant of the algorithm that does not require knowing the best , see also the related work of Kivinen & Warmuth (1999); Warmuth et al. (2006). Our algorithm directly benefits from the learning guarantees given in Section 2 since it seeks to minimize the bound of Theorem 1. In the next section, we report the results of our experiments with Deep Boost. Let us mention that we have also designed an alternative deep boosting algorithm that we briefly describe and discuss in Appendix C. 4. Experiments An additional benefit of the learning bounds presented in Section 2 is that they are data-dependent. They are based on the Rademacher complexity of the base hypothesis sets Hk, which in some cases can be well estimated from the training sample. The algorithm Deep Boost directly inherits this advantage. For example, if the hypothesis set Hk is based on a positive definite kernel with sample matrix Kk, it is known that its empirical Rademacher complexity can be upper bounded by m and lower bounded by m . In other cases, when Hk is a family of functions taking binary values, we can use an upper bound on the Rademacher complexity in terms of the growth func- tion of Hk, Hk(m): Rm(Hk) 2 log Hk (m) m . Thus, for the family Hstumps 1 of boosting stumps in dimension d, Hstumps 1 (m) 2md, since there are 2m distinct threshold functions for each dimension with m points. Thus, the following inequality holds: Similarly, we consider the family of decision trees Hstumps 2 of depth 2 with the same question at the internal nodes of depth 1. We have Hstumps 2 (m) (2m)2 d(d 1) 2 since there are d(d 1)/2 distinct trees of this type and since each induces at most (2m)2 labelings. Thus, we can write 2 log(2m2d(d 1)) More generally, we also consider the family of all binary decision trees Htrees k of depth k. For this family it is known that VC-dim(Htrees k ) (2k + 1) log2(d + 1) (Mansour, 1997). More generally, the VC-dimension of Tn, the family of decision trees with n nodes in dimension d can be bounded by (2n + 1) log2(d + 2) (see for example (Mohri et al., 2012)). Since Rm(H) 2 VC-dim(H) log(m+1) m , for any hypothesis class H we have (4n + 2) log2(d + 2) log(m + 1) The experiments with Deep Boost described below use either Hstumps = Hstumps 1 [ Hstumps 2 or Htrees k , for some K > 0, as the base hypothesis sets. For any hypothesis in these sets, Deep Boost will use the upper bounds given above as a proxy for the Rademacher complexity of the set to which it belongs. We leave it to the future to experiment with finer data-dependent estimates or upper bounds on the Rademacher complexity, which could further improve the performance of our algorithm. Recall that each iteration of Deep Boost searches for the base hypothesis that is optimal with respect to a certain criterion (see lines 5-10 of Figure 2). While an exhaustive search is feasible for Hstumps 1 , it would be far too expensive to visit all trees in Htrees K when K is large. Therefore, when using Htrees K and also Hstumps 2 as the base hypotheses we use the following heuristic search procedure in each iteration t: First, the optimal tree h 1 is found via exhaustive search. Next, for all 1 < k K, a locally optimal tree h k is found by considering only trees that can be obtained by adding a single layer of leaves to h k 1. Finally, we select the best hypotheses in the set {h 1, . . . , h K, h1, . . . , ht 1}, where h1, . . . , ht 1 are the hypotheses selected in previous iterations. Deep Boosting Table 1. Results for boosted decision stumps and the exponential loss function. Ada Boost Ada Boost Ada Boost Ada Boost breastcancer Hstumps 2 Ada Boost-L1 Deep Boost ocr17 Hstumps 2 Ada Boost-L1 Deep Boost Error 0.0429 0.0437 0.0408 0.0373 Error 0.0085 0.008 0.0075 0.0070 (std dev) (0.0248) (0.0214) (0.0223) (0.0225) (std dev) 0.0072 0.0054 0.0068 (0.0048) Avg tree size 1 2 1.436 1.215 Avg tree size 1 2 1.086 1.369 Avg no. of trees 100 100 43.6 21.6 Avg no. of trees 100 100 37.8 36.9 Ada Boost Ada Boost Ada Boost Ada Boost ionosphere Hstumps 2 Ada Boost-L1 Deep Boost ocr49 Hstumps 2 Ada Boost-L1 Deep Boost Error 0.1014 0.075 0.0708 0.0638 Error 0.0555 0.032 0.03 0.0275 (std dev) (0.0414) (0.0413) (0.0331) (0.0394) (std dev) 0.0167 0.0114 0.0122 (0.0095) Avg tree size 1 2 1.392 1.168 Avg tree size 1 2 1.99 1.96 Avg no. of trees 100 100 39.35 17.45 Avg no. of trees 100 100 99.3 96 Ada Boost Ada Boost Ada Boost Ada Boost german Hstumps 2 Ada Boost-L1 Deep Boost ocr17-mnist Hstumps 2 Ada Boost-L1 Deep Boost Error 0.243 0.2505 0.2455 0.2395 Error 0.0056 0.0048 0.0046 0.0040 (std dev) (0.0445) (0.0487) (0.0438) (0.0462) (std dev) 0.0017 0.0014 0.0013 (0.0014) Avg tree size 1 2 1.54 1.76 Avg tree size 1 2 2 1.99 Avg no. of trees 100 100 54.1 76.5 Avg no. of trees 100 100 100 100 Ada Boost Ada Boost Ada Boost Ada Boost diabetes Hstumps 2 Ada Boost-L1 Deep Boost ocr49-mnist Hstumps 2 Ada Boost-L1 Deep Boost Error 0.253 0.260 0.254 0.253 Error 0.0414 0.0209 0.0200 0.0177 (std dev) (0.0330) (0.0518) (0.04868) (0.0510) (std dev) 0.00539 0.00521 0.00408 (0.00438) Avg tree size 1 2 1.9975 1.9975 Avg tree size 1 2 1.9975 1.9975 Avg no. of trees 100 100 100 100 Avg no. of trees 100 100 100 100 Breiman (1999) and Reyzin & Schapire (2006) extensively investigated the relationship between the complexity of decision trees in an ensemble learned by Ada Boost and the generalization error of the ensemble. We tested Deep Boost on the same UCI datasets used by these authors, http:// archive.ics.uci.edu/ml/datasets.html, specifically breastcancer, ionosphere, german(numeric) and diabetes. We also experimented with two optical character recognition datasets used by Reyzin & Schapire (2006), ocr17 and ocr49, which contain the handwritten digits 1 and 7, and 4 and 9 (respectively). Finally, because these OCR datasets are fairly small, we also constructed the analogous datasets from all of MNIST, http://yann. lecun.com/exdb/mnist/, which we call ocr17-mnist and ocr49-mnist. More details on all the datasets are given in Table 4, Appendix D.1. As we discussed in Section 3.2, by fixing the parameters β and λ to certain values, we recover some known algorithms as special cases of Deep Boost. Our experiments compared Deep Boost to Ada Boost (β = λ = 0 with exponential loss), to Logistic Regression (β = λ = 0 with logistic loss), which we abbreviate as Log Reg, to L1-norm regularized Ada Boost (e.g., see (R atsch et al., 2001a)) abbreviated as Ada Boost-L1, and also to the L1-norm regularized additive Logistic Regression algorithm studied by (Duchi & Singer, 2009) (β > 0, λ = 0) abbreviated as Log Reg-L1. In the first set of experiments reported in Table 1, we compared Ada Boost, Ada Boost-L1, and Deep Boost with the exponential loss (Φ( u) = exp( u)) and base hypotheses Hstumps. We tested standard Ada Boost with base hypotheses Hstumps 1 and Hstumps 2 . For Ada Boost-L1, we op- timized over β 2 {2 i : i = 6, . . . , 0} and for Deep Boost, we optimized over β in the same range and λ 2 {0.0001, 0.005, 0.01, 0.05, 0.1, 0.5}. The exact parameter optimization procedure is described below. In the second set of experiments reported in Table 2, we used base hypotheses Htrees K instead of Hstumps, where the maximum tree depth K was an additional parameter to be optimized. Specifically, for Ada Boost we optimized over K 2 {1, . . . , 6}, for Ada Boost-L1 we optimized over those same values for K and β 2 {10 i : i = 3, . . . , 7}, and for Deep Boost we optimized over those same values for K, β and λ 2 {10 i : i = 3, . . . , 7}. The last set of experiments, reported in Table 3, are identical to the experiments reported in Table 2, except we used the logistic loss Φ( u) = log2(1 + exp( u)). We used the following parameter optimization procedure in all experiments: Each dataset was randomly partitioned into 10 folds, and each algorithm was run 10 times, with a different assignment of folds to the training set, validation set and test set for each run. Specifically, for each run i 2 {0, . . . , 9}, fold i was used for testing, fold i + 1 (mod 10) was used for validation, and the remaining folds were used for training. For each run, we selected the parameters that had the lowest error on the validation set and then measured the error of those parameters on the test set. The average error and the standard deviation of the error over all 10 runs is reported in Tables 1, 2 and 3, as is the average number of trees and the average size of the trees in the ensembles. In all of our experiments, the number of iterations was set to 100. We also experimented with running each algorithm Deep Boosting Table 2. Results for boosted decision trees and the exponential loss function. breastcancer Ada Boost Ada Boost-L1 Deep Boost ocr17 Ada Boost Ada Boost-L1 Deep Boost Error 0.0267 0.0264 0.0243 Error 0.004 0.003 0.002 (std dev) (0.00841) (0.0098) (0.00797) (std dev) (0.00316) (0.00100) (0.00100) Avg tree size 29.1 28.9 20.9 Avg tree size 15.0 30.4 26.0 Avg no. of trees 67.1 51.7 55.9 Avg no. of trees 88.3 65.3 61.8 ionosphere Ada Boost Ada Boost-L1 Deep Boost ocr49 Ada Boost Ada Boost-L1 Deep Boost Error 0.0661 0.0657 0.0501 Error 0.0180 0.0175 0.0175 (std dev) (0.0315) (0.0257) (0.0316) (std dev) (0.00555) (0.00357) (0.00510) Avg tree size 29.8 31.4 26.1 Avg tree size 30.9 62.1 30.2 Avg no. of trees 75.0 69.4 50.0 Avg no. of trees 92.4 89.0 83.0 german Ada Boost Ada Boost-L1 Deep Boost ocr17-mnist Ada Boost Ada Boost-L1 Deep Boost Error 0.239 0.239 0.234 Error 0.00471 0.00471 0.00409 (std dev) (0.0165) (0.0201) (0.0148) (std dev) (0.0022) (0.0021) (0.0021) Avg tree size 3 7 16.0 Avg tree size 15 33.4 22.1 Avg no. of trees 91.3 87.5 14.1 Avg no. of trees 88.7 66.8 59.2 diabetes Ada Boost Ada Boost-L1 Deep Boost ocr49-mnist Ada Boost Ada Boost-L1 Deep Boost Error 0.249 0.240 0.230 Error 0.0198 0.0197 0.0182 (std dev) (0.0272) (0.0313) (0.0399) (std dev) (0.00500) (0.00512) (0.00551) Avg tree size 3 3 5.37 Avg tree size 29.9 66.3 30.1 Avg no. of trees 45.2 28.0 19.0 Avg no. of trees 82.4 81.1 80.9 Table 3. Results for boosted decision trees and the logistic loss function. breastcancer Log Reg Log Reg-L1 Deep Boost ocr17 Log Reg Log Reg-L1 Deep Boost Error 0.0351 0.0264 0.0264 Error 0.00300 0.00400 0.00250 (std dev) (0.0101) (0.0120) (0.00876) (std dev) (0.00100) (0.00141) (0.000866) Avg tree size 15 59.9 14.0 Avg tree size 15.0 7 22.1 Avg no. of trees 65.3 16.0 23.8 Avg no. of trees 75.3 53.8 25.8 ionosphere Log Reg Log Reg-L1 Deep Boost ocr49 Log Reg Log Reg-L1 Deep Boost Error 0.074 0.060 0.043 Error 0.0205 0.0200 0.0170 (std dev) (0.0236) (0.0219) (0.0188) (std dev) (0.00654) (0.00245) (0.00361) Avg tree size 7 30.0 18.4 Avg tree size 31.0 31.0 63.2 Avg no. of trees 44.7 25.3 29.5 Avg no. of trees 63.5 54.0 37.0 german Log Reg Log Reg-L1 Deep Boost ocr17-mnist Log Reg Log Reg-L1 Deep Boost Error 0.233 0.232 0.225 Error 0.00422 0.00417 0.00399 (std dev) (0.0114) (0.0123) (0.0103) (std dev) (0.00191) (0.00188) (0.00211) Avg tree size 7 7 14.4 Avg tree size 15 15 25.9 Avg no. of trees 72.8 66.8 67.8 Avg no. of trees 71.4 55.6 27.6 diabetes Log Reg Log Reg-L1 Deep Boost ocr49-mnist Log Reg Log Reg-L1 Deep Boost Error 0.250 0.246 0.246 Error 0.0211 0.0201 0.0201 (std dev) (0.0374) (0.0356) (0.0356) (std dev) (0.00412) (0.00433) (0.00411) Avg tree size 3 3 3 Avg tree size 28.7 33.5 72.8 Avg no. of trees 46.0 45.5 45.5 Avg no. of trees 79.3 61.7 41.9 for up to 1,000 iterations, but observed that the test errors did not change significantly, and more importantly the ordering of the algorithms by their test errors was unchanged from 100 iterations to 1,000 iterations. Observe that with the exponential loss, Deep Boost has a smaller test error than Ada Boost and Ada Boost-L1 on every dataset and for every set of base hypotheses, except for the ocr49-mnist dataset with decision trees where its performance matches that of Ada Boost-L1. Similarly, with the logistic loss, Deep Boost performs always at least as well as Log Reg or Log Reg-L1. For the small-sized UCI datasets it is difficult to obtain statistically significant results, but, for the larger ocr XX-mnist datasets, our results with Deep Boost are statistically significantly better at the 2% level using one-sided paired t-tests in all three sets of experiments (three tables), except for ocr49-mnist in Table 3, where this holds only for the comparison with Log Reg. This across-the-board improvement is the result of Deep Boost s complexity-conscious ability to dynamically tune the sizes of the decision trees selected in each boosting round, trading off between training error and hypothesis class complexity. The selected tree sizes should depend on properties of the training set, and this is borne out by our experiments: For some datasets, such as breastcancer, Deep Boost selects trees that are smaller on average than the trees selected by Ada Boost-L1 or Log Reg-L1, while, for other datasets, such as german, the average tree size is larger. Note that Ada Boost and Ada Boost-L1 produce ensembles of trees that have a constant depth since neither algorithm penalizes tree size except for imposing a maximum tree depth K, while for Deep Boost the trees in one ensemble typically vary in size. Figure 3 plots the distri- Deep Boosting Ion: Histogram of tree sizes 10 20 30 40 Figure 3. Distribution of tree sizes when Deep Boost is run on the ionosphere dataset. bution of tree sizes for one run of Deep Boost. It should be noted that the columns for Ada Boost in Table 1 simply list the number of stumps to be the same as the number of boosting rounds; a careful examination of the ensembles for 100 rounds of boosting typically reveals a 5% duplication of stumps in the ensembles. Theorem 1 is a margin-based generalization guarantee, and is also the basis for the derivation of Deep Boost, so we should expect Deep Boost to induce large margins on the training set. Figure 4 shows the margin distributions for Ada Boost, Ada Boost-L1 and Deep Boost on the same subset of the ionosphere dataset. 5. Conclusion We presented a theoretical analysis of learning with a base hypothesis set composed of increasingly complex subfamilies, including very deep or complex ones, and derived an algorithm, Deep Boost, which is precisely based on those guarantees. We also reported the results of experiments with this algorithm and compared its performance with that of Ada Boost and additive Logistic Regression, and their L1-norm regularized counterparts in several tasks. We have derived similar theoretical guarantees in the multiclass setting and used them to derive a family of new multiclass deep boosting algorithms that we will present and discuss elsewhere. Our theoretical analysis and algorithmic design could also be extended to ranking and to a broad class of loss functions. This should also lead to the generalization of several existing algorithms and their use with a richer hypothesis set structured as a union of families with different Rademacher complexity. In particular, the broad family of maximum entropy models and conditional maximum entropy models and their many variants, which includes the already discussed logistic regression, could all be extended in a similar way. The resulting Deep Maxent models (or their conditional versions) may admit an alternative theoretical justification that we will discuss elsewhere. Our algorithm can also be extended by considering non-differentiable convex surrogate losses such as the hinge loss. When used with kernel base classifiers, this leads to an algorithm we have named Deep SVM. The theory we developed could perhaps be further generalized to Ion: Ada Boost L1, fold = 6 Normalized Margin 0.1 0.3 0.5 0.7 Ion: Ada Boost, fold = 6 Normalized Margin 0.1 0.3 0.5 0.7 Ion: Deep Boost, fold = 6 Normalized Margin 0.1 0.3 0.5 0.7 0.1 0.3 0.5 0.7 Normalized Margin Cumulative Dist. Cumulative Distribution of Margins Figure 4. Distribution of normalized margins for Ada Boost (upper right), Ada Boost-L1 (upper left) and Deep Boost (lower left) on the same subset of ionosphere. The cumulative margin distributions (lower right) illustrate that Deep Boost (red) induces larger margins on the training set than either Ada Boost (black) or Ada Boost-L1 (blue). encompass the analysis of other learning techniques such as multi-layer neural networks. Our analysis and algorithm also shed some new light on some remaining questions left about the theory underlying Ada Boost. The primary theoretical justification for Ada Boost is a margin guarantee (Schapire et al., 1997; Koltchinskii & Panchenko, 2002). However, Ada Boost does not precisely maximize the minimum margin, while other algorithms such as arc-gv (Breiman, 1996) that are designed to do so tend not to outperform Ada Boost (Reyzin & Schapire, 2006). Two main reasons are suspected for this observation: (1) in order to achieve a better margin, algorithms such as arc-gv may tend to select deeper decision trees or in general more complex hypotheses, which may then affect their generalization; (2) while those algorithms achieve a better margin, they do not achieve a better margin distribution. Our theory may help better understand and evaluate the effect of factor (1) since our learning bounds explicitly depend on the mixture weights and the contribution of each hypothesis set Hk to the definition of the ensemble function. However, our guarantees also suggest a better algorithm, Deep Boost. Acknowledgments We thank Vitaly Kuznetsov for his comments on an earlier draft of this paper. The work of M. Mohri was partly funded by the NSF award IIS-1117591. Deep Boosting Bartlett, Peter L. and Mendelson, Shahar. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 3, 2002. Bauer, Eric and Kohavi, Ron. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105 139, 1999. Breiman, Leo. Bagging predictors. Machine Learning, 24 (2):123 140, 1996. Breiman, Leo. Prediction games and arcing algorithms. Neural Computation, 11(7):1493 1517, 1999. Caruana, Rich, Niculescu-Mizil, Alexandru, Crew, Geoff, and Ksikes, Alex. Ensemble selection from libraries of models. In ICML, 2004. Dietterich, Thomas G. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139 157, 2000. Duchi, John C. and Singer, Yoram. Boosting with structural sparsity. In ICML, pp. 38, 2009. Freund, Yoav and Schapire, Robert E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer System Sciences, 55(1): 119 139, 1997. Freund, Yoav, Mansour, Yishay, and Schapire, Robert E. Generalization bounds for averaged classifiers. The Annals of Statistics, 32:1698 1722, 2004. Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 1998. Grove, Adam J and Schuurmans, Dale. Boosting in the limit: Maximizing the margin of learned ensembles. In AAAI/IAAI, pp. 692 699, 1998. Kivinen, Jyrki and Warmuth, Manfred K. Boosting as en- tropy projection. In COLT, pp. 134 144, 1999. Koltchinskii, Vladmir and Panchenko, Dmitry. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 2002. Mac Kay, David J. C. Bayesian methods for adaptive mod- els. Ph D thesis, California Institute of Technology, 1991. Mansour, Yishay. Pessimistic decision tree pruning based on tree size. In Proceedings of ICML, pp. 195 201, 1997. Mohri, Mehryar, Rostamizadeh, Afshin, and Talwalkar, Ameet. Foundations of Machine Learning. The MIT Press, 2012. Quinlan, J. Ross. Bagging, boosting, and C4.5. In AAAI/IAAI, Vol. 1, pp. 725 730, 1996. R atsch, Gunnar and Warmuth, Manfred K. Maximizing the margin with boosting. In COLT, pp. 334 350, 2002. R atsch, Gunnar and Warmuth, Manfred K. Efficient margin maximizing with boosting. Journal of Machine Learning Research, 6:2131 2152, 2005. R atsch, Gunnar, Mika, Sebastian, and Warmuth, Man- fred K. On the convergence of leveraging. In NIPS, pp. 487 494, 2001a. R atsch, Gunnar, Onoda, Takashi, and M uller, Klaus- Robert. Soft margins for Ada Boost. Machine Learning, 42(3):287 320, 2001b. Reyzin, Lev and Schapire, Robert E. How boosting the margin can also boost classifier complexity. In ICML, pp. 753 760, 2006. Schapire, Robert E. Theoretical views of boosting and ap- plications. In Proceedings of ALT 1999, volume 1720 of Lecture Notes in Computer Science, pp. 13 25. Springer, 1999. Schapire, Robert E. The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classification, pp. 149 172. Springer, 2003. Schapire, Robert E., Freund, Yoav, Bartlett, Peter, and Lee, Wee Sun. Boosting the margin: A new explanation for the effectiveness of voting methods. In ICML, pp. 322 330, 1997. Smyth, Padhraic and Wolpert, David. Linearly combining density estimators via stacking. Machine Learning, 36: 59 83, July 1999. Vapnik, Vladimir N. Statistical Learning Theory. Wiley- Interscience, 1998. Warmuth, Manfred K., Liao, Jun, and R atsch, Gunnar. To- tally corrective boosting algorithms that maximize the margin. In ICML, pp. 1001 1008, 2006.