# multiclass_deep_boosting__187cc0c3.pdf

Multi-Class Deep Boosting

Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY 10012 vitaly@cims.nyu.edu

Mehryar Mohri Courant Institute & Google Research 251 Mercer Street New York, NY 10012 mohri@cims.nyu.edu

Umar Syed Google Research 76 Ninth Avenue New York, NY 10011 usyed@google.com

We present new ensemble learning algorithms for multi-class classiﬁcation. Our algorithms can use as a base classiﬁer set a family of deep decision trees or other rich or complex families and yet beneﬁt from strong generalization guarantees. We give new data-dependent learning bounds for convex ensembles in the multiclass classiﬁcation setting expressed in terms of the Rademacher complexities of the sub-families composing the base classiﬁer set, and the mixture weight assigned to each sub-family. These bounds are ﬁner than existing ones both thanks to an improved dependency on the number of classes and, more crucially, by virtue of a more favorable complexity term expressed as an average of the Rademacher complexities based on the ensemble s mixture weights. We introduce and discuss several new multi-class ensemble algorithms beneﬁting from these guarantees, prove positive results for the H-consistency of several of them, and report the results of experiments showing that their performance compares favorably with that of multi-class versions of Ada Boost and Logistic Regression and their L1regularized counterparts.

1 Introduction

Devising ensembles of base predictors is a standard approach in machine learning which often helps improve performance in practice. Ensemble methods include the family of boosting meta-algorithms among which the most notable and widely used one is Ada Boost [Freund and Schapire, 1997], also known as forward stagewise additive modeling [Friedman et al., 1998]. Ada Boost and its other variants learn convex combinations of predictors. They seek to greedily minimize a convex surrogate function upper bounding the misclassiﬁcation loss by augmenting, at each iteration, the current ensemble, with a new suitably weighted predictor.

One key advantage of Ada Boost is that, since it is based on a stagewise procedure, it can learn an effective ensemble of base predictors chosen from a very large and potentially inﬁnite family, provided that an efﬁcient algorithm is available for selecting a good predictor at each stage. Furthermore, Ada Boost and its L1-regularized counterpart [R atsch et al., 2001a] beneﬁt from favorable learning guarantees, in particular theoretical margin bounds [Schapire et al., 1997, Koltchinskii and Panchenko, 2002]. However, those bounds depend not just on the margin and the sample size, but also on the complexity of the base hypothesis set, which suggests a risk of overﬁtting when using too complex base hypothesis sets. And indeed, overﬁtting has been reported in practice for Ada Boost in the past [Grove and Schuurmans, 1998, Schapire, 1999, Dietterich, 2000, R atsch et al., 2001b].

Cortes, Mohri, and Syed [2014] introduced a new ensemble algorithm, Deep Boost, which they proved to beneﬁt from ﬁner learning guarantees, including favorable ones even when using as base classiﬁer set relatively rich families, for example a family of very deep decision trees, or other similarly complex families. In Deep Boost, the decisions in each iteration of which classiﬁer to add to the ensemble and which weight to assign to that classiﬁer, depend on the (data-dependent) complexity

of the sub-family to which the classiﬁer belongs one interpretation of Deep Boost is that it applies the principle of structural risk minimization to each iteration of boosting. Cortes, Mohri, and Syed [2014] further showed that empirically Deep Boost achieves a better performance than Ada Boost, Logistic Regression, and their L1-regularized variants. The main contribution of this paper is an extension of these theoretical, algorithmic, and empirical results to the multi-class setting.

Two distinct approaches have been considered in the past for the deﬁnition and the design of boosting algorithms in the multi-class setting. One approach consists of combining base classiﬁers mapping each example x to an output label y. This includes the SAMME algorithm [Zhu et al., 2009] as well as the algorithm of Mukherjee and Schapire [2013], which is shown to be, in a certain sense, optimal for this approach. An alternative approach, often more ﬂexible and more widely used in applications, consists of combining base classiﬁers mapping each pair (x, y) formed by an example x and a label y to a real-valued score. This is the approach adopted in this paper, which is also the one used for the design of Ada Boost.MR [Schapire and Singer, 1999] and other variants of that algorithm.

In Section 2, we prove a novel generalization bound for multi-class classiﬁcation ensembles that depends only on the Rademacher complexity of the hypothesis classes to which the classiﬁers in the ensemble belong. Our result generalizes the main result of Cortes et al. [2014] to the multi-class setting, and also represents an improvement on the multi-class generalization bound due to Koltchinskii and Panchenko [2002], even if we disregard our ﬁner analysis related to Rademacher complexity. In Section 3, we present several multi-class surrogate losses that are motivated by our generalization bound, and discuss and compare their functional and consistency properties. In particular, we prove that our surrogate losses are realizable H-consistent, a hypothesis-set-speciﬁc notion of consistency that was recently introduced by Long and Servedio [2013]. Our results generalize those of Long and Servedio [2013] and admit simpler proofs. We also present a family of multi-class Deep Boost learning algorithms based on each of these surrogate losses, and prove general convergence guarantee for them. In Section 4, we report the results of experiments demonstrating that multi-class Deep Boost outperforms Ada Boost.MR and multinomial (additive) logistic regression, as well as their L1-norm regularized variants, on several datasets.

2 Multi-class data-dependent learning guarantee for convex ensembles

In this section, we present a data-dependent learning bound in the multi-class setting for convex ensembles based on multiple base hypothesis sets. Let X denote the input space. We denote by Y = {1, . . . , c} a set of c 2 classes. The label associated by a hypothesis f : X Y R to x X is given by argmaxy Y f(x, y). The margin ρf(x, y) of the function f for a labeled example (x, y) X Y is deﬁned by

ρf(x, y) = f(x, y) max y =y f(x, y ). (1)

Thus, f misclassiﬁes (x, y) iff ρf(x, y) 0. We consider p families H1, . . . , Hp of functions mapping from X Y to [0, 1] and the ensemble family F = conv(Sp k=1 Hk), that is the family of functions f of the form f = PT t=1 αtht, where α = (α1, . . . , αT ) is in the simplex and where, for each t [1, T], ht is in Hkt for some kt [1, p]. We assume that training and test points are drawn i.i.d. according to some distribution D over X Y and denote by S = ((x1, y1), . . . , (xm, ym)) a training sample of size m drawn according to Dm. For any ρ > 0, the generalization error R(f), its ρ-margin error Rρ(f) and its empirical margin error are deﬁned as follows:

R(f) = E (x,y) D[1ρf (x,y) 0], Rρ(f) = E (x,y) D[1ρf (x,y) ρ], and b RS,ρ(f) = E (x,y) S[1ρf (x,y) ρ],

(2) where the notation (x, y) S indicates that (x, y) is drawn according to the empirical distribution deﬁned by S. For any family of hypotheses G mapping X Y to R, we deﬁne Π1(G) by

Π1(G) = {x 7 h(x, y): y Y, h G}. (3)

The following theorem gives a margin-based Rademacher complexity bound for learning with ensembles of base classiﬁers with multiple hypothesis sets. As with other Rademacher complexity learning guarantees, our bound is data-dependent, which is an important and favorable characteristic of our results.

Theorem 1. Assume p > 1 and let H1, . . . , Hp be p families of functions mapping from X Y to [0, 1]. Fix ρ > 0. Then, for any δ > 0, with probability at least 1 δ over the choice of a sample S of size m drawn i.i.d. according to D, the following inequality holds for all f = PT t=1 αtht F:

R(f) b RS,ρ(f)+8c

t=1 αt Rm(Π1(Hkt))+ 2

l 4 ρ2 log c2ρ2m

4 log p mlog p

Thus, R(f) b RS,ρ(f) + 8c

ρ PT t=1 αt Rm(Hkt) + O rlog p

ρ2m log h ρ2c2m

4 log p i .

The full proof of theorem 3 is given in Appendix B. Even for p = 1, that is for the special case of a single hypothesis set, our analysis improves upon the multi-class margin bound of Koltchinskii and Panchenko [2002] since our bound admits only a linear dependency on the number of classes c instead of a quadratic one. However, the main remarkable beneﬁt of this learning bound is that its complexity term admits an explicit dependency on the mixture coefﬁcients αt. It is a weighted average of Rademacher complexities with mixture weights αt, t [1, T]. Thus, the second term of the bound suggests that, while some hypothesis sets Hk used for learning could have a large Rademacher complexity, this may not negatively affect generalization if the corresponding total mixture weight (sum of αts corresponding to that hypothesis set) is relatively small. Using such potentially complex families could help achieve a better margin on the training sample.

The theorem cannot be proven via the standard Rademacher complexity analysis of Koltchinskii and Panchenko [2002] since the complexity term of the bound would then be Rm(conv(Sp k=1 Hk)) = Rm(Sp k=1 Hk) which does not admit an explicit dependency on the mixture weights and is lower bounded by PT t=1 αt Rm(Hkt). Thus, the theorem provides a ﬁner learning bound than the one obtained via a standard Rademacher complexity analysis.

3 Algorithms

In this section, we will use the learning guarantees just described to derive several new ensemble algorithms for multi-class classiﬁcation.

3.1 Optimization problem

Let H1, . . . , Hp be p disjoint families of functions taking values in [0, 1] with increasing Rademacher complexities Rm(Hk), k [1, p]. For any hypothesis h p k=1Hk, we denote by d(h) the index of the hypothesis set it belongs to, that is h Hd(h). The bound of Theorem 3 holds uniformly for all ρ > 0 and functions f conv(Sp k=1 Hk). Since the last term of the bound does not depend on α, it suggests selecting α that would minimize:

i=1 1ρf (xi,yi) ρ + 8c

where rt = Rm(Hd(ht)) and α .1 Since for any ρ > 0, f and f/ρ admit the same generalization error, we can instead search for α 0 with PT t=1 αt 1/ρ, which leads to

min α 0 1 m

i=1 1ρf (xi,yi) 1 + 8c

t=1 αtrt s.t.

The ﬁrst term of the objective is not a convex function of α and its minimization is known to be computationally hard. Thus, we will consider instead a convex upper bound. Let u 7 Φ( u) be a non-increasing convex function upper-bounding u 7 1u 0 over R. Φ may be selected to be

1 The condition PT t=1 αt = 1 of Theorem 3 can be relaxed to PT t=1 αt 1. To see this, use for example a null hypothesis (ht = 0 for some t).

for example the exponential function as in Ada Boost [Freund and Schapire, 1997] or the logistic function. Using such an upper bound, we obtain the following convex optimization problem:

min α 0 1 m

i=1 Φ 1 ρf(xi, yi) + λ

t=1 αtrt s.t.

where we introduced a parameter λ 0 controlling the balance between the magnitude of the values taken by function Φ and the second term.2 Introducing a Lagrange variable β 0 associated to the constraint in (5), the problem can be equivalently written as

min α 0 1 m

i=1 Φ 1 min y =yi

t=1 αtht(xi, yi) αtht(xi, y) i +

t=1 (λrt + β)αt.

Here, β is a parameter that can be freely selected by the algorithm since any choice of its value is equivalent to a choice of ρ in (5). Since Φ is a non-decreasing function, the problem can be equivalently written as

min α 0 1 m

i=1 max y =yi Φ 1 h T X

t=1 αtht(xi, yi) αtht(xi, y) i +

t=1 (λrt + β)αt.

Let {h1, . . . , h N} be the set of distinct base functions, and let Fmax be the objective function based on that expression:

Fmax(α) = 1

i=1 max y =yi Φ 1

j=1 αjhj(xi, yi, y) +

j=1 Λjαj, (6)

with α = (α1, . . . , αN) RN, hj(xi, yi, y) = hj(xi, yi) hj(xi, y), and Λj = λrj + β for all j [1, N]. Then, our optimization problem can be rewritten as minα 0 Fmax(α). This deﬁnes a convex optimization problem since the domain {α 0} is a convex set and since Fmax is convex: each term of the sum in its deﬁnition is convex as a pointwise maximum of convex functions (composition of the convex function Φ with an afﬁne function) and the second term is a linear function of α. In general, Fmax is not differentiable even when Φ is, but, since it is convex, it admits a sub-differential at every point. Additionally, along each direction, Fmax admits left and right derivatives both nonincreasing and a differential everywhere except for a set that is at most countable.

3.2 Alternative objective functions

We now consider the following three natural upper bounds on Fmax which admit useful properties that we will discuss later, the third one valid when Φ can be written as the composition of two function Φ1 and Φ2 with Φ1 a non-increasing function:

Fsum(α) = 1

j=1 αjhj(xi, yi, y) +

j=1 Λjαj (7)

Fmaxsum(α) = 1

j=1 αjρhj(xi, yi) +

j=1 Λjαj (8)

Fcompsum(α) = 1

j=1 αjhj(xi, yi, y) +

j=1 Λjαj. (9)

Fsum is obtained from Fmax simply by replacing in the deﬁnition of Fmax the max operator by a sum. Clearly, function Fsum is convex and inherits the differentiability properties of Φ. A drawback of Fsum is that for problems with very large c as in structured prediction, the computation of the sum

2Note that this is a standard practice in the ﬁeld of optimization. The optimization problem in (4) is equivalent to a vector optimization problem, where (Pm i=1 1ρf (xi,yi) 1, PT t=1 αtrt) is minimized over α. The latter problem can be scalarized leading to the introduction of a parameter λ in (5).

may require resorting to approximations. Fmaxsum is obtained from Fmax by noticing that, by the sub-additivity of the max operator, the following inequality holds:

j=1 αjhj(xi, yi, y)

j=1 max y =yi αjhj(xi, yi, y) =

j=1 αjρhj(xi, yi).

As with Fsum, function Fmaxsum is convex and admits the same differentiability properties as Φ. Unlike Fsum, Fmaxsum does not require computing a sum over the classes. Furthermore, note that the expressions ρhj(xi, yi), i [1, m], can be pre-computed prior to the application of any optimization algorithm. Finally, for Φ = Φ1 Φ2 with Φ1 non-increasing, the max operator can be replaced by a sum before applying φ1, as follows:

max y =yi Φ 1 f(xi, yi, y) = Φ1 max y =yi Φ2 1 f(xi, yi, y) Φ1 X

y =yi Φ2 1 f(xi, yi, y) ,

where f(xi, yi, y) = PN j=1 αjhj(xi, yi, y). This leads to the deﬁnition of Fcompsum.

In Appendix C, we discuss the consistency properties of the loss functions just introduced. In particular, we prove that the loss functions associated to Fmax and Fsum are realizable H-consistent (see Long and Servedio [2013]) in the common cases where the exponential or logistic losses are used and that, similarly, in the common case where Φ1(u) = log(1 + u) and Φ2(u) = exp(u + 1), the loss function associated to Fcompsum is H-consistent.

Furthermore, in Appendix D, we show that, under some mild assumptions, the objective functions we just discussed are essentially within a constant factor of each other. Moreover, in the case of binary classiﬁcation all of these objectives coincide.

3.3 Multi-class Deep Boost algorithms

In this section, we discuss in detail a family of multi-class Deep Boost algorithms, which are derived by application of coordinate descent to the objective functions discussed in the previous paragraphs. We will assume that Φ is differentiable over R and that Φ (u) = 0 for all u. This condition is not necessary, in particular, our presentation can be extended to non-differentiable functions such as the hinge loss, but it simpliﬁes the presentation. In the case of the objective function Fmaxsum, we will assume that both Φ1 and Φ2, where Φ = Φ1 Φ2, are differentiable. Under these assumptions, Fsum, Fmaxsum, and Fcompsum are differentiable. Fmax is not differentiable due to the presence of the max operators in its deﬁnition, but it admits a sub-differential at every point.

For convenience, let αt = (αt,1, . . . , αt,N) denote the vector obtained after t 1 iterations and let α0 = 0. Let ek denote the kth unit vector in RN, k [1, N]. For a differentiable objective F, we denote by F (α, ej) the directional derivative of F along the direction ej at α. Our coordinate descent algorithm consists of ﬁrst determining the direction of maximal descent, that is k = argmaxj [1,N] |F (αt 1, ej)|, next of determining the best step η along that direction that preserves non-negativity of α, η = argminαt 1+ηek 0 F(αt 1 + ηek), and updating αt 1 to αt = αt 1 + ηek. We will refer to this method as projected coordinate descent. The following theorem provides a convergence guarantee for our algorithms in that case. Theorem 2. Assume that Φ is twice differentiable and that Φ (u) > 0 for all u R. Then, the projected coordinate descent algorithm applied to F converges to the solution α of the optimization maxα 0 F(α) for F = Fsum, F = Fmaxsum, or F = Fcompsum. If additionally Φ is strongly convex over the path of the iterates αt, then there exists τ > 0 and γ > 0 such that for all t > τ,

F(αt+1) F(α ) (1 1

γ )(F(αt) F(α )). (10)

The proof is given in Appendix I and is based on the results of Luo and Tseng [1992]. The theorem can in fact be extended to the case where instead of the best direction, the derivative for the direction selected at each round is within a constant threshold of the best [Luo and Tseng, 1992]. The conditions of Theorem 2 hold for many cases in practice, in particular in the case of the exponential loss (Φ = exp) or the logistic loss (Φ( x) = log2(1 + e x)). In particular, linear convergence is guaranteed in those cases since both the exponential and logistic losses are strongly convex over a compact set containing the converging sequence of αts.

MDEEPBOOSTSUM(S = ((x1, y1), . . . , (xm, ym))) 1 for i 1 to m do 2 for y Y {yi} do 3 D1(i, y) 1 m(c 1) 4 for t 1 to T do

5 k argmin j [1,N] ϵt,j + Λjm

2St 6 if (1 ϵt,k)eαt 1,k ϵt,ke αt 1,k < Λkm

then 7 ηt αt 1,k 8 else ηt log h Λkm

2ϵt St + q Λkm

2ϵt St 2 + 1 ϵt

9 αt αt 1 + ηtek 10 St+1 Pm i=1 P

y =yi Φ 1 PN j=1 αt,jhj(xi, yi, y)

11 for i 1 to m do 12 for y Y {yi} do

13 Dt+1(i, y) Φ 1 PN j=1 αt,jhj(xi,yi,y)

St+1 14 f PN j=1 αt,jhj 15 return f

Figure 1: Pseudocode of the MDeep Boost Sum algorithm for both the exponential loss and the logistic loss. The expression of the weighted error ϵt,j is given in (12).

We will refer to the algorithm deﬁned by projected coordinate descent applied to Fsum by MDeep Boost Sum, to Fmaxsum by MDeep Boost Max Sum, to Fcompsum by MDeep Boost Comp Sum, and to Fmax by MDeep Boost Max. In the following, we brieﬂy describe MDeep Boost Sum, including its pseudocode. We give a detailed description of all of these algorithms in the supplementary material: MDeep Boost Sum (Appendix E), MDeep Boost Max Sum (Appendix F), MDeep Boost Comp Sum (Appendix G), MDeep Boost Max (Appendix H).

Deﬁne ft 1 = PN j=1 αt 1,jhj. Then, Fsum(αt 1) can be rewritten as follows:

Fsum(αt 1) = 1

y =yi Φ 1 ft 1(xi, yi, y) +

j=1 Λjαt 1,j.

For any t [1, T], we denote by Dt the distribution over [1, m] [1, c] deﬁned for all i [1, m] and y Y {yi} by

Dt(i, y) = Φ 1 ft 1(xi, yi, y)

where St is a normalization factor, St = Pm i=1 P

y =yi Φ (1 ft 1(xi, yi, y)). For any j [1, N] and s [1, T], we also deﬁne the weighted error ϵs,j as follows:

h 1 E (i,y) Ds

hj(xi, yi, y) i . (12)

Figure 1 gives the pseudocode of the MDeep Boost Sum algorithm. The details of the derivation of the expressions are given in Appendix E. In the special cases of the exponential loss (Φ( u) = exp( u)) or the logistic loss (Φ( u) = log2(1 + exp( u))), a closed-form expression is given for the step size (lines 6-8), which is the same in both cases (see Sections E.2.1 and E.2.2). In the generic case, the step size can be found using a line search or other numerical methods.

The algorithms presented above have several connections with other boosting algorithms, particularly in the absence of regularization. We discuss these connections in detail in Appendix K.

4 Experiments

The algorithms presented in the previous sections can be used with a variety of different base classiﬁer sets. For our experiments, we used multi-class binary decision trees. A multi-class binary decision tree in dimension d can be deﬁned by a pair (t, h), where t is a binary tree with a variablethreshold question at each internal node, e.g., Xj θ, j [1, d], and h = (hl)l Leaves(t) a vector of distributions over the leaves Leaves(t) of t. At any leaf l Leaves(t), hl(y) [0, 1] for all y Y and P

y Y hl(y) = 1. For convenience, we will denote by t(x) the leaf l Leaves(t) associated to x by t. Thus, the score associated by (t, h) to a pair (x, y) X Y is hl(y) where l = t(x).

Let Tn denote the family of all multi-class decision trees with n internal nodes in dimension d. In Appendix J, we derive the following upper bound on the Rademacher complexity of Tn:

(4n + 2) log2(d + 2) log(m + 1)

All of the experiments in this section use Tn as the family of base hypothesis sets (parametrized by n). Since Tn is a very large hypothesis set when n is large, for the sake of computational efﬁciency we make a few approximations. First, although our MDeep Boost algorithms were derived in terms of Rademacher complexity, we use the upper bound in Eq. (13) in place of the Rademacher complexity (thus, in Algorithm 1 we let Λn = λBn + β, where Bn is the bound given in Eq. (13)). Secondly, instead of exhaustively searching for the best decision tree in Tn for each possible size n, we use the following greedy procedure: Given the best decision tree of size n (starting with n = 1), we ﬁnd the best decision tree of size n+1 that can be obtained by splitting one leaf, and continue this procedure until some maximum depth K. Decision trees are commonly learned in this manner, and so in this context our Rademacher-complexity-based bounds can be viewed as a novel stopping criterion for decision tree learning. Let H K be the set of trees found by the greedy algorithm just described. In each iteration t of MDeep Boost, we select the best tree in the set H K {h1, . . . , ht 1}, where h1, . . . , ht 1 are the trees selected in previous iterations.

While we described many objective functions that can be used as the basis of a multi-class deep boosting algorithm, the experiments in this section focus on algorithms derived from Fsum. We also refer the reader to Table 3 in Appendix A for results of experiments with Fcompsum objective functions. The Fsum and Fcompsum objectives combine several advantages that suggest they will perform well empirically. Fsum is consistent and both Fsum and Fcompsum are (by Theorem 4) H-consistent. Also, unlike Fmax both of these objectives are differentiable, and therefore the convergence guarantee in Theorem 2 applies. Our preliminary ﬁndings also indicate that algorithms based on Fsum and Fcompsum objectives perform better than those derived from Fmax and Fmaxsum. All of our objective functions require a choice for Φ, the loss function. Since Cortes et al. [2014] reported comparable results for exponential and logistic loss for the binary version of Deep Boost, we let Φ be the exponential loss in all of our experiments with MDeep Boost Sum. For MDeep Boost Comp Sum we select Φ1(u) = log2(1 + u) and Φ2( u) = exp( u).

In our experiments, we used 8 UCI data sets: abalone, handwritten, letters, pageblocks, pendigits, satimage, statlog and yeast see more details on these datasets in Table 4, Appendix L. In Appendix K, we explain that when λ = β = 0 then MDeep Boost Sum is equivalent to Ada Boost.MR. Also, if we set λ = 0 and β = 0 then the resulting algorithm is an L1-norm regularized variant of Ada Boost.MR. We compared MDeep Boost Sum to these two algorithms, with the results also reported in Table 1 and Table 2 in Appendix A. Likewise, we compared MDeep Boost Comp Sum with multinomial (additive) logistic regression, Log Reg, and its L1-regularized version Log Reg-L1, which, as discussed in Appendix K, are equivalent to MDeep Boost Comp Sum when λ = β = 0 and λ = 0, β 0 respectively. Finally, we remark that it can be argued that the parameter optimization procedure (described below) signiﬁcantly extends Ada Boost.MR since it effectively implements structural risk minimization: for each tree depth, the empirical error is minimized and we choose the depth to achieve the best generalization error.

All of these algorithms use maximum tree depth K as a parameter. L1-norm regularized versions admit two parameters: K and β 0. Deep boosting algorithms have a third parameter, λ 0. To set these parameters, we used the following parameter optimization procedure: we randomly partitioned each dataset into 4 folds and, for each tuple (λ, β, K) in the set of possible parameters (described below), we ran MDeep Boost Sum, with a different assignment of folds to the training

Table 1: Empirical results for MDeep Boost Sum, Φ = exp. AB stands for Ada Boost.

abalone AB.MR AB.MR-L1 MDeep Boost handwritten AB.MR AB.MR-L1 MDeep Boost Error 0.739 0.737 0.735 Error 0.024 0.025 0.021 (std dev) (0.0016) (0.0065) (0.0045) (std dev) (0.0011) (0.0018) (0.0015)

letters AB.MR AB.MR-L1 MDeep Boost pageblocks AB.MR AB.MR-L1 MDeep Boost Error 0.065 0.059 0.058 Error 0.035 0.035 0.033 (std dev) (0.0018) (0.0059) (0.0039) (std dev) (0.0045) (0.0031) (0.0014)

pendigits AB.MR AB.MR-L1 MDeep Boost satimage AB.MR AB.MR-L1 MDeep Boost Error 0.014 0.014 0.012 Error 0.112 0.117 0.117 (std dev) (0.0025) (0.0013) (0.0011) (std dev) (0.0123) (0.0096) (0.0087)

statlog AB.MR AB.MR-L1 MDeep Boost yeast AB.MR AB.MR-L1 MDeep Boost Error 0.029 0.026 0.024 Error 0.415 0.410 0.407 (std dev) (0.0026) (0.0071) (0.0008) (std dev) (0.0353) (0.0324) (0.0282)

set, validation set and test set for each run. Speciﬁcally, for each run i {0, 1, 2, 3}, fold i was used for testing, fold i + 1 (mod 4) was used for validation, and the remaining folds were used for training. For each run, we selected the parameters that had the lowest error on the validation set and then measured the error of those parameters on the test set. The average test error and the standard deviation of the test error over all 4 runs is reported in Table 1. Note that an alternative procedure to compare algorithms that is adopted in a number of previous studies of boosting [Li, 2009a,b, Sun et al., 2012] is to simply record the average test error of the best parameter tuples over all runs. While it is of course possible to overestimate the performance of a learning algorithm by optimizing hyperparameters on the test set, this concern is less valid when the size of the test set is large relative to the complexity of the hyperparameter space. We report results for this alternative procedure in Table 2 and Table 3, Appendix A.

For each dataset, the set of possible values for λ and β was initialized to {10 5, 10 6, . . . , 10 10}, and to {1, 2, 3, 4, 5} for the maximum tree depth K. However, if we found an optimal parameter value to be at the end point of these ranges, we extended the interval in that direction (by an order of magnitude for λ and β, and by 1 for the maximum tree depth K) and re-ran the experiments. We have also experimented with 200 and 500 iterations but we have observed that the errors do not change signiﬁcantly and the ranking of the algorithms remains the same.

The results of our experiments show that, for each dataset, deep boosting algorithms outperform the other algorithms evaluated in our experiments. Let us point out that, even though not all of our results are statistically signiﬁcant, MDeep Boost Sum outperforms Ada Boost.MR and Ada Boost.MRL1 (and, hence, effectively structural risk minimization) on each dataset. More importantly, for each dataset MDeep Boost Sum outperforms other algorithms on most of the individual runs. Moreover, results for some datasets presented here (namely pendigits) appear to be state-of-the-art. We also refer our reader to experimental results summarized in Table 2 and Table 3 in Appendix A. These results provide further evidence in favor of Deep Boost algorithms. The consistent performance improvement by MDeep Boost Sum over Ada Boost.MR or its L1-norm regularized variant shows the beneﬁt of the new complexity-based regularization we introduced.

5 Conclusion

We presented new data-dependent learning guarantees for convex ensembles in the multi-class setting where the base classiﬁer set is composed of increasingly complex sub-families, including very deep or complex ones. These learning bounds generalize to the multi-class setting the guarantees presented by Cortes et al. [2014] in the binary case. We also introduced and discussed several new multi-class ensemble algorithms beneﬁting from these guarantees and proved positive results for the H-consistency and convergence of several of them. Finally, we reported the results of several experiments with Deep Boost algorithms, and compared their performance with that of Ada Boost.MR and additive multinomial Logistic Regression and their L1-regularized variants.

Acknowledgments

We thank Andres Mu noz Medina and Scott Yang for discussions and help with the experiments. This work was partly funded by the NSF award IIS-1117591 and supported by a NSERC PGS grant.

P. B uhlmann and B. Yu. Boosting with the L2 loss. J. of the Amer. Stat. Assoc., 98(462):324 339, 2003. M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, Adaboost and Bregman distances. Machine Learning, 48:253 285, September 2002. C. Cortes, M. Mohri, and U. Syed. Deep boosting. In ICML, pages 1179 1187, 2014. T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139 157, 2000. J. C. Duchi and Y. Singer. Boosting with structural sparsity. In ICML, page 38, 2009. N. Duffy and D. P. Helmbold. Potential boosters? In NIPS, pages 258 264, 1999. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer System Sciences, 55(1):119 139, 1997. J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189 1232, 2000. J. H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 1998. A. J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In AAAI/IAAI, pages 692 699, 1998. J. Kivinen and M. K. Warmuth. Boosting as entropy projection. In COLT, pages 134 144, 1999. V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Annals of Statistics, 30, 2002. M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991. P. Li. ABC-boost: adaptive base class boost for multi-class classiﬁcation. In ICML, page 79, 2009a. P. Li. ABC-logitboost for multi-class classiﬁcation. Technical report, Rutgers University, 2009b. P. M. Long and R. A. Servedio. Consistency versus realizable H-consistency for multiclass classiﬁcation. In ICML (3), pages 801 809, 2013. Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7 35, 1992. L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In NIPS, 1999. M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. I. Mukherjee and R. E. Schapire. A theory of multiclass boosting. JMLR, 14(1):437 497, 2013. G. R atsch and M. K. Warmuth. Maximizing the margin with boosting. In COLT, pages 334 350, 2002. G. R atsch and M. K. Warmuth. Efﬁcient margin maximizing with boosting. JMLR, 6:2131 2152, 2005. G. R atsch, S. Mika, and M. K. Warmuth. On the convergence of leveraging. In NIPS, pages 487 494, 2001a. G. R atsch, T. Onoda, and K.-R. M uller. Soft margins for Ada Boost. Machine Learning, 42(3):287 320, 2001b. R. E. Schapire. Theoretical views of boosting and applications. In Proceedings of ALT 1999, volume 1720 of Lecture Notes in Computer Science, pages 13 25. Springer, 1999. R. E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012. R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37(3):297 336, 1999. R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In ICML, pages 322 330, 1997. P. Sun, M. D. Reid, and J. Zhou. Aoso-logitboost: Adaptive one-vs-one logitboost for multi-class problem. In ICML, 2012. A. Tewari and P. L. Bartlett. On the consistency of multiclass classiﬁcation methods. JMLR, 8:1007 1025, 2007. M. K. Warmuth, J. Liao, and G. R atsch. Totally corrective boosting algorithms that maximize the margin. In ICML, pages 1001 1008, 2006. T. Zhang. Statistical analysis of some multi-category large margin classiﬁcation methods. JMLR, 5:1225 1251, 2004a. T. Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. Annals of Statistics, 32(1):56 85, 2004b. J. Zhu, H. Zou, S. Rosset, and T. Hastie. Multi-class adaboost. Statistics and Its Interface, 2009. H. Zou, J. Zhu, and T. Hastie. New multicategory boosting algorithms based on multicategory ﬁsher-consistent losses. Annals of Statistics, 2(4):1290 1306, 2008.