# generalized_boosting__daebfd69.pdf

Generalized Boosting

Arun Sai Suggala, Bingbin Liu, Pradeep Ravikumar Carnegie Mellon University Pittsburgh, PA 15213 {asuggala,bingbinl,pradeepr}@cs.cmu.edu

Boosting is a widely used learning technique in machine learning for solving classiﬁcation problems. In boosting, one predicts the label of an example using an ensemble of weak classiﬁers. While boosting has shown tremendous success on many classiﬁcation problems involving tabular data, it performs poorly on complex classiﬁcation tasks involving low-level features such as image classiﬁcation tasks. This drawback stems from the fact that boosting builds an additive model of weak classiﬁers, each of which has very little predictive power. Often, the resulting additive models are not powerful enough to approximate the complex decision boundaries of real-world classiﬁcation problems. In this work, we present a general framework for boosting where, similar to traditional boosting, we aim to boost the performance of a weak learner and transform it into a strong learner. However, unlike traditional boosting, our framework allows for more complex forms of aggregation of weak learners. In this work, we speciﬁcally focus on one form of aggregation - function composition. We show that many popular greedy algorithms for learning deep neural networks (DNNs) can be derived from our framework using function compositions for aggregation. Moreover, we identify the drawbacks of these greedy algorithms and propose new algorithms that ﬁx these issues. Using thorough empirical evaluation, we show that our learning algorithms have superior performance over traditional additive boosting algorithms, as well as existing greedy learning techniques for DNNs. An important feature of our algorithms is that they come with strong theoretical guarantees.

1 Introduction

Boosting is a widely used learning technique in machine learning for solving classiﬁcation problems. Boosting aims to improve the performance of a weak learner by combining multiple weak classiﬁers to produce a strong classiﬁer with good predictive performance. Since the seminal works of Freund [13], Schapire [32], a number of practical algorithms such as Ada Boost [16], gradient boosting [26], XGBoost [9], have been proposed for boosting. Over the years, boosting based methods such as XGBoost in particular, have shown tremendous success in many real-world classiﬁcation problems, as well as competitive settings such as Kaggle competitions. However, this success is mostly limited to classiﬁcation tasks involving structured or tabular data with hand-engineered features. On classiﬁcation problems involving low-level features and complex decision boundaries, boosting tends to perform poorly [3, 30] (also see Section 5). One example where this is evident is the image classiﬁcation task, where the decision boundaries are often complex and the features are low-level pixel intensities. This drawback stems from the fact that boosting builds an additive model of weak classiﬁers, each of which has very little predictive power. Since such additive models with any reasonable number of weak classiﬁers are usually not powerful enough to approximate complex decision boundaries, the models output by boosting tend to have poor performance.

In this work, we aim to overcome this drawback of traditional boosting by considering a generalization of boosting which allows for more complex forms of aggregation than linear combinations of weak classiﬁers. To achieve this goal, we work in the feature representation space and boost the

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

performance of weak feature transformers. Working in the representation space allows for more ﬂexible combinations of weak feature transformers. This is unlike traditional boosting which works in the label space and builds an additive model on the predictions of the weak classiﬁers. The starting point for our approach is the greedy view of boosting, originally studied by Friedman et al. [18], Mason et al. [26]. Letting p RSpfq be the risk of a classiﬁer f on training samples S, boosting techniques aim to approximate the minimizer of p RS in terms of linear combinations of elements from a set of weak classiﬁers F. Many popular boosting algorithms including Ada Boost, XGBoost, rely on greedy techniques to ﬁnd such an approximation. In our generalized framework for boosting, we take this greedy view, but differ in how we aggregate the weak learners. We approximate the minimizer of p RS using models of the form f T WφT , where φT řT t 0 gt, and tgtu T t 0 are feature transformations learned in each iteration of the greedy algorithm, and W is the linear classiﬁer on top of the feature transformation. Unlike additive boosting, where each gt comes from a ﬁxed weak feature transformer class G, in our framework each gt comes from a class Gt which evolves over time t and is allowed to depend on the past iterates tφiut 1 i 0. Some potential choices for Gt that could be of interest are tg φt 1 for g P Gu, tg prφ0, . . . , φt 1sq for g P Gu, where g φpxq gpφpxqq denotes function composition of g and φ, and G is a weak feature transformer class. Note that the former choice of Gt is connected to layer-by-layer training of models with Res Net architecture [21].

As one particular instantiation of our framework, we consider weak feature transformers that are neural networks and use function compositions to combine them; that is, we use Gt s constructed using function compositions. We show that for certain choices of Gt, our framework recovers the layer-by-layer training techniques developed in deep learning [6, 22]. Greedy layer-by-layer training techniques have seen a revival in recent years [5, 8, 22, 25, 29]. One reason for this revival is that greedy techniques consume less memory than end-to-end training of deep networks, as they do not perform end-to-end back-propagation. Consequently, they can accommodate much larger models in limited memory. As a primary contribution of the paper, we identify several drawbacks of existing layer-by-layer training techniques, and show that the choice of Gt used by these algorithms can lead to a drop in performance. We propose alternative choices for Gt which ﬁx these issues and empirically demonstrate that the resulting algorithms have superior performance over existing layer-by-layer training techniques, and in some cases achieve performance close to that of end-to-end trained DNNs. Moreover, we show that the proposed algorithms perform much better than traditional additive boosting algorithms, on a variety of classiﬁcation tasks.

As the second contribution of the paper, we provide excess risk bounds for models learned using our generalized boosting framework. Our results depend on a certain weak learning condition on feature transformer classes t Gtu T t 1, which is a natural generalization of the weak learning condition that is typically imposed in traditional boosting. The resulting risk bounds are modular and depend on the generalization bounds of t Gtu T t 1. An advantage of such modular bounds is that one can rely on the best-known generalization bounds for weak transformation classes t Gtu T t 1 and obtain tight risk bounds for boosting. As an immediate consequence of this result, we obtain excess risk bounds for existing greedy layer-by-layer training techniques.

Related Work. Several works have proposed generalizations of traditional boosting [10, 11, 20, 22]. Cortes et al. [10] propose a boosting algorithm where the hypothesis set of weak classiﬁers is chosen adaptively. However, the resulting models are still additive models of weak classiﬁers and usually perform poorly on hard classiﬁcation problems. Several recent works have attempted to learn neural networks greedily based on boosting theory. Cortes et al. [11] propose a boosting-style algorithm to learn both the structure and weights of neural networks in an adaptive way. However, the algorithms developed are restricted to feed forward neural networks and are mostly theoretical in nature. The experimental evidence in the paper is a proof-of-concept and only considers small scale binary classiﬁcation tasks. Huang et al. [22], Nitanda and Suzuki [29] use ideas from classical boosting to learn neural networks in a layer-by-layer fashion. As we show later, these algorithms are speciﬁc instances of our generalized framework, and have certain drawbacks arising from the choice of Gt they use.

2 Preliminaries

In this section, we set up the notation and review the necessary background on additive boosting. A consolidated list of notation can be found in Appendix A. Notation. Let p X, Y q P X ˆ Y denote a feature-label pair following a probability distribution P. Let P X, P Y denote the marginal distributions of X and Y . In this work, we consider the multi-class

classiﬁcation problem where Y t0, . . . K 1u, and assume X Ď Rd. Let S tpxi, yiqun i 1 be n i.i.d samples drawn from P. Let Pn be the empirical distribution of S and P X n , P Y n be the marginal distributions of txiun i 1, tyiun i 1.

In classiﬁcation, our goal is to ﬁnd a predictor that can well predict the label of any feature from just the samples S. Let f : X Ñ RK denote a score-based classiﬁer which assigns X to class argmaxi fip Xq. The expected classiﬁcation risk of f is deﬁned as EX,Y rℓ0 1pfp Xq, Y qs, where ℓ0 1pfp Xq, Y q 0 if argmaxi fip Xq Y , and 1 otherwise. Since optimizing 0{1 risk is computationally intractable, we consider convex surrogates of ℓ0 1pfp Xq, Y q, which we denote by ℓpfp Xq, Y q; typical choices for ℓinclude the logistic loss and the exponential loss. The population risk of f is then deﬁned as Rpfq EX,Y rℓpfp Xq, Y qs . Since directly optimizing the population risk is impossible, we approximate it with the empirical risk p RSpfq 1

n řn i 1 ℓpfpxiq, yiq and try to ﬁnd its minimizer.

We consider classiﬁers of the form fp Xq Wφp Xq, where φ : X Ñ RD is the feature transformer and W P RKˆD is the linear classiﬁer on top. A popular choice for φ is a neural network. We denote the population and empirical risks of such an f as Rp W, φq, p RSp W, φq. We usually work in the space of feature transforms. Let L2p Pq denote the space of square integrable functions w.r.t P, and deﬁne the inner product between φ1, φ2 P L2p Pq as xφ1, φ2y P EX P rxφ1p Xq, φ2p Xqys. We denote with φRp W, φq the functional gradient of Rp W, φq w.r.t φ in the L2p P Xq space, which is deﬁned as φRp W, φqpxq EY |x W T ℓp Wφpxq, Y q , where ℓp Wφpxq, yq denotes the gradient of ℓ w.r.t its ﬁrst argument, evaluated at Wφpxq. Similarly, we let φ p RSp W, φq denote the functional gradient of p RSp W, φq in the L2p P X n q space

φ p RSp W, φqpxq "W T ℓp Wφpxiq, yiq, if x xi, 0 otherwise .

Additive Boosting. In this work, we refer to traditional boosting as additive boosting, as it constructs additive models of weak classiﬁers. Let F be a hypothesis class of weak classiﬁers, a typical example being decision trees of bounded depth. Additive boosting aims to ﬁnd an element in the linear span of F which minimizes the empirical risk p RSpfq. As previously mentioned, there exists a duality between boosting and greedy algorithms [18, 19, 26]. Many popular boosting algorithms use a greedy forward stagewise approach to ﬁnd a minimizer of p RSpfq, and solve the following in each iteration:

ηt, ft argminηPR,f PF p RS ÿt 1

i 1 ηifi ηf ,

where η is the learning rate. Various algorithms differ in how they solve this optimization problem. In gradient boosting, one uses a linear approximation of p RS around řt 1 i 1 ηifi [26]. In this work, we take this greedy view of boosting to design the generalized boosting framework. Additive Representation Boosting. In this work, we perform boosting in the representation space, contrasting with traditional boosting which works in the output space. Let G be a hypothesis class of weak feature transformers, whose examples include the set of one layer neural networks of bounded width and a set of vector-valued polynomials of bounded degree. More generally, G can be any set of non-linear transformations. In additive representation boosting, we aim to ﬁnd a strong feature transform φ in the linear span of G, and a linear predictor W P W Ď RKˆD that minimizes p RSp W, φq. To this end, we consider greedy algorithms that solve the following problem each iteration:

Wt, gt argmin W PW,g PG p RS p W, φt 1 ηtgq , (1)

where φt φ0 řt i 1 ηigi with φ0 being the initial feature transformation, and tηiu8 i 1 is a predeﬁned learning rate schedule.

3 Generalized Boosting

The starting point for our generalized boosting framework is the additive representation boosting described in Section 2. Typically, linear combinations of weak feature transformations are not powerful enough to model complex decision boundaries. Consequently, the minimizer of p RSp W, φq over the linear span of G tends to have a high risk. A simple workaround for this issue would be to perform additive boosting with a complex hypothesis class G. For example, if the weak feature transformers are one layer neural networks, then one could increase the complexity of G by using deeper networks. However, such an alternative has several drawbacks both from an optimization

and generalization perspective and defeats the purpose of boosting, which aims to convert weak learners into strong learners. From an optimization perspective, moving to complex G makes each greedy step harder to optimize. For example, compared to deep neural networks, shallow networks are easier to optimize, require fewer resources, and are easier to analyze or interpret [5]. From a generalization perspective, since the generalization bounds of boosting depend on the complexity of G, larger hypothesis classes can lead to overﬁtting and poor performance on unseen data.

In this work, we are interested in other approaches for increasing the complexity of models produced by boosting, while ensuring the boosting/greedy steps are easy to implement. One way to achieve this is by considering more complex combinations of weak feature transformers than the linear combinations considered in additive representation boosting. Formally, let Gt denote the hypothesis class of feature transformations used in the tth iteration of boosting. In additive boosting, Gt G for all t. In our generalized boosting framework, we increase the complexity of Gt by letting it depend on the past iterates tφiut 1 i 0. Here are some potential choices for Gt, other than the ones stated in the introduction: tg přt 1 i 0 αiφiq, for g P G, αi P Ru, tg φt 1 φt 2 φ0, for g P Gu. Depending on the problem domain, one could consider several other ways of constructing Gt using the past iterates. Note that even with these complex choices of Gt, the greedy steps are easy to implement and only need a weak learner which can identify an element in G that best ﬁts the data. As a result, this remains in the spirit of boosting and at the same time ensures the models learned are complex enough for real world problems.

We now present our algorithm for generalized boosting (see Algorithm 1). Similar to additive representation boosting, our algorithm proceeds in a greedy fashion. In the tth iteration of the algorithm, we aim to solve the following optimization problem:

Wt, gt argmin W PW,g PGt p RS p W, φt 1 ηtgq . (2)

We provide two approaches for solving this problem. One is the exact greedy approach, which directly solves the optimization problem (Algorithm 2). For problems where direct optimization of Equation (2) is difﬁcult1, we provide an approximate technique which performs functional gradient descent on the objective. In this approach, which we call gradient greedy approach, we approximate the objective with the linear approximation of p RS around φt 1 (Algorithm 3): p RS p W, φt 1 ηtgq p RS p W, φt 1q ηt A φ p RSp W, φt 1q, g E

To optimize the linear approximation, we ﬁrst ﬁx W to Wt 1 and ﬁnd a minimizing gt P Gt. Intuitively, this step can be seen as ﬁnding a g which best aligns with the negative functional gradient of empirical risk at the current iterate. For appropriate choice of learning rate η, moving along gt results in reduction of p RS. Next, we ﬁx gt and ﬁnd a linear predictor W which minimizes the empirical risk p RSp W, φtq. This alternating optimization of g and W makes the algorithm easy to implement in practice. Moreover, this algorithm is more stable than joint optimization of g and W. We note that such gradient greedy approaches have been developed for traditional boosting [26].

3.1 Compositional Boosting

As one particular instantiation of our framework, we consider Gt s constructed by composing elements from a weak feature transformer class G with the past iterates tφiut 1 i 0 and study the resulting boosting algorithms. We refer to such boosting algorithms as compositional boosting algorithms since the strong feature transformer is constructed from weak feature transformer via function composition. When Gt tg φt 1 for g P Gu, the models in our framework have the Res Net architecture and can be deﬁned recurrently as φt φt 1 ηtgt φt 1. Moreover, Algorithm 1 with this choice of Gt and Algorithm 2 as update routine give us the greedy layer-wise supervised training technique proposed by Bengio et al. [6] and recently revisited by Belilovsky et al. [5]. In another recent work, Huang et al. [22] propose a boosting-based algorithm for learning Res Nets (see Algorithm 4 in Appendix). We now show that their approach is equivalent to the greedy technique of Bengio et al. [6], and thus can be seen as an instance of our general framework. We note that such a connection is not known previously. Proposition 3.1. Suppose the classiﬁcation loss ℓis the exponential loss. Then the greedy technique of Huang et al. [22] for learning Res Nets is equivalent to the greedy layer-wise supervised training technique of Bengio et al. [6].

1Such scenarios can potentially arise if the feature transformations are non-differentiable functions.

In another recent work, Nitanda and Suzuki [29] propose a gradient boosting technique to greedily learn a Res Net. This algorithm is closely related to the gradient greedy approach described in Algorithm 3, with Gt tg φt 1 for g P Gu.

Algorithm 1 Generalized Boosting

1: Input: Training data S tpxi, yiqun i 1, iterations T, initial linear predictor W0, initial feature transformer φ0, learning rates tηiu T i 1, Update-routine: UPDATE 2: t Ð 1 3: while t ď T do 4: Construct feature transformer class Gt based on past iterates tp Wi, φiqut 1 i 0 5: Wt, φt, gt Ð UPDATE p S, Wt 1, φt 1, ηt, Gtq 6: t Ð t 1 7: end while 8: Return: WT , φT

Algorithm 2 Exact Greedy Update

1: Input: Training data S, previous iterate p W, φq, learning rate η, feature transformer class G 2:

W , g Ð argmin Ă W PW, g PG p RSpĂ W, φ η gq

3: φ Ð φ ηg 4: Return: W , φ , g

Algorithm 3 Gradient Greedy Update

1: Input: Training data S, previous iterate p W, φq, learning rate η, feature transformer class G 2: // Pick a descent direction 3: g Ð argmin g PG A φ p RSp W, φq, g E

P X n 4: φ Ð φ ηg 5: // Update the linear predictor 6: W Ð argmin Ă W PW p RSpĂ W, φ q 7: Return: W , φ , g

We now highlight certain drawbacks of the existing greedy layer-wise training techniques, which arise from the particular choice of Gt used by these algorithms. Since tg φt 1 for g P Gu is constructed solely based on the past iterate φt 1, any mistake in φt 1 is propagated to all the future iterates. As a result, these algorithms can not recover from their past mistakes. As an example, consider the following scenario where two points x1, x2 belonging to two different classes are placed close to each other in the feature space, after 1st iteration of greedy; that is φ1px1q φ1px2q. In such a scenario, the future iterates tφtu8 t 2 generated by existing greedy algorithms will always place x1, x2 close to each other in the representation space. As a result, the algorithm will always misclassify at least one of x1, x2. Another issue with existing greedy techniques is that they do not guarantee that the complexity of Gt increases with time t. In such scenarios, Algorithm 1 doesn t make much progress in each iteration and can result in poor models. As an example, consider the setting where G is the set of all linear transformations. Suppose φ0 is the identity transform and φ1 is such that its range lies in a low dimensional subspace. Then, it is evident that G1 Ě Gt for all t ě 2. To ﬁx these issues, we propose two new compositional boosting algorithms obtained with a more careful choice of Gt. In our ﬁrst algorithm, which we call Dense Comp Boost, we choose Gt as follows

Gt ! g Id ÿt 1

i 0 αiφi , for g P G, αi P R ) , (3)

where Idp q is the identify function. Such a choice of Gt helps us recover from the past mistakes. For example, if φ1 is a constant function, then the algorithm can still learn a good feature transformer by relying on the input x and the initial feature transform φ0. Moreover, our choice of Gt ensures its complexity grows with t and satisﬁes: Gt 1 Ď Gt, for all t. We call our algorithm Dense Comp Boost, since the resulting model for this choice of Gt resembles a Dense Net [23], where each layer is allowed to be connected to all the previous layers. That being said, the models output by Dense Comp Boost differ from Dense Net in how they aggregate the previous layers. Dense Net concatenates the features from previous layers, whereas Dense Comp Boost adds the features. Our second algorithm, which we call Cmplx Comp Boost, tries to increase the complexity of Gt in each iteration as follows

Gt ! g φt 1, for g P r Gt ) , (4)

where r Gt is a weak feature transformer class and satisﬁes r Gt 1 Ă r Gt for all t. In the case of one layer neural networks, such r Gt s can be constructed by increasing the layer width with t. We note that the r Gt in this algorithm is independent of the past iterates. By increasing the complexity of r Gt with t, we expect the complexity of Gt to increase and Algorithm 1 to make more progress in each iteration. While not immediately evident, we note that this technique can also ﬁx the mistakes made by past iterates. For example, suppose φ1 is such that it places two points x1, x2 from different classes,

close to each other in the feature space. Then having a more complex r G2 can help recover from this mistake, as one can potentially ﬁnd a g P r G2 which can separate these two points. In Section 5, we present empirical evidence showing that our new boosting algorithms have superior performance over existing additive and compositional boosting algorithms. Further empirical evidence corroborating the issues we identiﬁed with existing layer-wise training techniques can be found in Appendix J.1.

4 Excess Risk Bounds

In this section, we provide excess risk bounds for the models output by the generalized boosting framework. Our results depend on a weak learning condition on the hypothesis class Gt used in the tth iteration of Algorithm 1. This condition is a way to quantify the relative strength of Gt and roughly says that there always exists an element in Gt which has an acute angle with the negative functional gradient at the current iterate. Such a condition ensures progress in each iteration of boosting. Deﬁnition 4.1. Let β P p0, 1s, ϵ ě 0 be constants. Gt 1 is said to satisfy the pβ, ϵq-weak learning condition for a dataset S, if there exists a g P Gt 1 such that A g, φ p RSp Wt, φtq E

P X n ě βBp Gt 1q} φ p RSp Wt, φtq}P X n ϵ,

where Bp Gt 1q supg PGt 1 }g}P X n , and Pn is the empirical distribution of S.

In traditional boosting, such conditions are typically referred to as the edge of a weak learner and play a crucial role in the convergence analysis. For example, Freund and Schapire [14] assume that for any set of weights over the training set S, there exists a classiﬁer in the hypothesis class of weak classiﬁers which has better than random accuracy on the weighted samples. The following proposition shows that their condition is closely related to Deﬁnition 4.1. Proposition 4.1. For binary classiﬁcation, the weak learning condition of Freund and Schapire [14] satisﬁes the empirical weak learning condition in Deﬁnition 4.1, albeit in the label space.

For binary classiﬁcation problems, it is well known that the weak learning condition of [14] is the weakest condition under which boosting is possible [15, 31]. This, together with the above proposition, suggests that our weak learning condition in Deﬁnition 4.1 cannot be weakened for binary classiﬁcation problems. To begin with, we derive excess risk bounds for the gradient greedy approach. Our analysis crucially relies on the observation that it can be viewed as performing inexact gradient descent on the population risk R. Several recent works have analyzed inexact gradient descent on convex objectives [2, 12, 33, 34]. However, the condition on the inexact gradient imposed by these works is different from ours and in many cases is stronger than our condition. For example, the condition of Balakrishnan et al. [2] translates to }g φRp W, φq}P X ď ϵ in our setting, which is stronger than our weak learning condition. So the core of our analysis focuses on understanding inexact gradient descent with descent steps satisfying the weak learning condition in Deﬁnition 4.1. In our analysis, we consider a sample-splitting variant of the algorithm, where in each iteration we use a fresh batch of samples. This is mainly done to simplify the analysis by avoiding complex statistical dependencies between the iterates of the algorithm. Let n t n

T u, we split the training dataset S into T subsets t Stu T t 1 of size n, where St tpxt,i, yt,iqu n i 1. We work with the subset St in the tth iteration of Algorithm 1. We are now ready to state our main result on the excess risk bounds of the iterates of Algorithm 3. Our results depend on the Rademacher complexity terms related to the hypothesis sets W, Gt

R p W, Gtq E

sup W PW, g PGt

k 1 ρikr Wgpxt,iqsk

ﬃﬂ, R p Gtq E

j 1 ρijrgpxt,iqsj

where rusk denotes the kth entry of a vector u, and the expectation is taken w.r.t the randomness from St and the Rademacher random variables ρij s. Theorem 4.1 (Gradient Greedy). Suppose the classiﬁcation loss ℓis L-Lipschitz and M-smooth w.r.t the ﬁrst argument. Let the hypothesis set of linear predictors W be s.t. any W P W satisﬁes λmin WW T ě σ2 min ą 0 and λmax WW T ď σ2 max. Moreover, suppose for all t, Gt satisﬁes the pβ, ϵtq-weak learning condition of Deﬁnition 4.1 for any dataset St. Finally, suppose any g P Gt is bounded with sup X }gp Xq}2 ď B. Let the learning rates tηtu8 t 1 be chosen as ηt ct s, for some

s P β 1 β 2, 1 and positive constant c. If Algorithm 1 is run for T iterations with Algorithm 3 as

update routine, then p WT , φT q, the T th iterate output by the algorithm, satisﬁes the following risk bound for any W , φ and α P p0, βp1 sqq, with probability at least 1 δ over datasets of size n

Rp WT , φT q ď Rp W , φ q O

t 1 ηt p LR p W, Gtq LR p Gtq ϵtq .

Proof Sketch. We ﬁrst show that Algorithm 3 can be viewed as performing inexact gradient descent on the population risk R. Speciﬁcally, we show that with high probability, the tth iterate gt satisﬁes xgt, φRp Wt 1, φt 1qy P ě βB} φRp Wt 1, φt 1q}P ϵt ζt, for some ζt ą 0. This follows from the weak learning condition satisﬁed by Gt. Ignoring ϵt, ζt, the above equation shows that gt makes acute angle with the population functional gradient at φt 1. Consequently, we would expect the population risk to decrease, if we move along gt. This is indeed the case, and the ﬁnal step in the proof formalizes this intuition.

Remarks: We now brieﬂy discuss the above result. See Appendix D for more discussion. The reference classiﬁer p W , φ q in the above bound can be any classiﬁer, as long as }W }2 ă 8, }φ }P X ă 8. In particular, if there exists a Bayes optimal classiﬁer satisfying this condition, then the above Theorem provides an excess risk bound w.r.t the Bayes optimal classiﬁer. The T α term in the bound corresponds to the optimization error. The ηtϵt term corresponds to the approximation error and the rest of the terms correspond to the generalization error. As T increases, the optimization error goes down, and as n increases, the generalization error goes down. If there is no approximation error, that is ϵt 0 for all t, then the excess risk goes down to 0 as n, T Ñ 8 at appropriate rate.

If β 1, then for appropriate choice of step size the optimization error goes down as O T 1{3 γ , for some arbitrarily small γ ą 0. This rate is slower than the Op T 1q rates for inexact gradient descent obtained by Devolder et al. [12], Schmidt et al. [33]. However, we note that unlike our work, these works assume that the level sets of the objective are bounded. Under the assumption that the level sets of population risk are bounded, the optimization error in Theorem 4.1 can be improved to Op T 1q. However, such a condition need not hold in the our setting. Note that the risk bounds are modular and only depend on the Rademacher complexity terms Rp W, Gtq, Rp Gtq which capture the complexity of Gt. To instantiate Theorem 4.1 for speciﬁc choices of Gt, we need to bound these two complexity terms.

We now extend the analysis of Theorem 4.1 to the exact greedy approach. Corollary 4.1 (Exact Greedy). Consider the setting of Theorem 4.1. Suppose Algorithm 1 is run with Algorithm 2 as update routine. Then p WT , φT q, the T th iterate output by the algorithm, satisﬁes the same risk bounds as gradient greedy algorithm in Theorem 4.1.

In the rest of the section, we instantiate Theorem 4.1 for speciﬁc choices of Gt. We ﬁrst consider the additive representation boosting algorithm. Corollary 4.2. Consider the setting of Theorem 4.1 and consider the additive representation boosting algorithm, where Gt G for all t. Suppose G is the set of one layer neural networks with sigmoid activation functions: G σp Cxq, for C P RDˆd, }Ci, }1 ď Λ, @i ( . Moreover, suppose the feature domain X is a subset of r0, 1sd. Then the T th iterate output by Algorithm 1, with Algorithm 2 or 3 as update routine, satisﬁes the following risk bound for any p W , φ q, with probability at least 1 δ

Rp WT , φT q ď Rp W , φ q O ˆ 1

KDΛT 1 s log D ?

Next, we consider the layer-by-layer ﬁtting technique of Bengio et al. [6]. Corollary 4.3. Consider the setting of Corollary 4.2 and consider the layer-by-layer training technique of Bengio et al. [6], where Gt tg φt 1 for g P Gu. Suppose G is the set of one layer neural networks with sigmoid activation functions: G σp Cxq, for C P RDˆD, }Ci, }1 ď Λ, @i ( . Then the T th iterate output by Algorithm 1, with Algorithm 2 or 3 as update routine, satisﬁes the following risk bound for any p W , φ q with probability at least 1 δ

Rp WT , φT q ď Rp W , φ q O ˆ 1

KDΛT 2 2s log D ?

Note that the generalization and optimization errors for both additive feature boosting and layerby-layer ﬁtting have similar dependence on T, n. However, the latter tends to have a smaller approximation error (ϵt) as it is able to build complex Gt s over time. So one would expect layer-bylayer ﬁtting to output models with a better population risk, which our empirical results in fact verify.

5 Experiments

In this section, we present experiments comparing the performance of various boosting techniques on both simulated and benchmark datasets. Baselines. We compare our proposed boosting techniques with XGBoost, Ada Boost, additive representation boosting (discussed in Corollary 4.2) and greedy layer-by-layer training technique of Bengio et al. [6] (Corollary 4.3). XGBoost uses decision trees as weak classiﬁers. For Ada Boost, we use 1 hidden layer neural networks as weak classiﬁers. We use two kinds of neural networks, based on the dataset. For tabular datasets, we use fully connected networks and for image datasets, we use convolutional networks (CNN) with the convolution block made up of Convolution, Batch Norm, Re LU layers arranged sequentially. For additive representation boosting (Additive Feature Boost from now on) and layer-by-layer ﬁtting (Std Comp Boost from now on), the weak feature transformer class G consists of one layer neural network transformations. Similar to Ada Boost, we use two kinds of transformations: a) fully connected transformations of the form gpxq Re LUp Cx dq, and b) convolutional transformations with Convolution, Batch Norm, Re LU blocks arranged sequentially. Finally, we also compare against end-to-end training of Res Nets. Proposed Techniques. For Dense Comp Boost, we use a slight variant of Gt deﬁned in Equation (3) : Gt th g přt 1 i 0 αiφiq, for h P H, g P G, αi P Ru, where H, G are weak feature transformer classes. We use this variant because the dimensions of the input feature space and the representation space need not be the same, and as a consequence Gt in Equation (3) can not always be used. Similar to Std Comp Boost, we consider two choices for H, G: one based on fully connected blocks and the other based on convolution blocks. For Cmplx Comp Boost, we again consider two choices for the weak transformer class Gt in Equation (4): a) Re LUp Cx dq with C P RDtˆDt 1, where Dt Dt 1 for some positive constant , and b) convolution blocks with number of output channels equal to the number of input channels plus a constant . This choice of feature transformers ensures the complexity of Gt increases with t. We use exact greedy updates (Algorithm 2) for both of our proposed methods and set learning rate ηt to 1. We do not present experimental results for Algorithm 3, which we noticed has marginally worse performance than Algorithm 2.

5.1 Simulated Datasets

Datasets. In this section we compare the techniques described above on simulated datasets. We generated 3 synthetic binary classiﬁcation datasets in R32. Simulation 1 is a concentric ellipsoids dataset, where a point x is classiﬁed based on x T Ax, for some randomly generated positive semideﬁnite matrix A. Simulations 2, 3 are datasets whose classiﬁcation boundaries are polynomials of degrees 8 and 9 respectively. For each of these datasets, we generated 106 samples for training and testing. Hyper-parameters. We used hold-out set validation to pick the best hyper-parameters for all the methods. We used 20% of the training data as validation data and picked the best parameters using grid search, based on validation accuracy. After picking the best parameters, we train on the entire training data and report performance on the test data. For all the greedy techniques based on neural networks, we used fully connected blocks and tuned the following parameters: weight decay, width of weak feature transformers, number of boosting iterations T, which we upper bound by 15. For Cmplx Comp Boost, we set D0{5. For end-to-end training, we tuned weight decay, width of layers, depth. We used SGD for optimization of all these techniques. The number of epochs and step size schedule of SGD are chosen to ensure convergence. For XGBoost, we tuned the number of trees, depth of each tree, learning rate. The exact values of hyper-parameters tuned for each of the methods can be found in Appendix J. Results. Table 1 presents the results from our experiments. Both Cmplx Comp Boost and Std Comp Boost largely outperform the additive boosting methods, with Cmplx Comp Boost being slightly better due to the increasing complexity in Gt. Notably, Dense Comp Boost performs signiﬁcantly better than the rest and is able to bridge the gap between Std Comp Boost and End-to-End. We attribute its success to its ability to recover from earlier mistakes: while Std Comp Boost or Cmplx Comp Boost necessarily accumulate errors at each layer, Dense Comp Boost is further connected to earlier layers, allowing it to undo its past mistakes.

Table 1: Test accuracy of various boosting techniques on synthetic datasets. Numbers in bold indicate the best performance among various greedy techniques.

Technique Simulation 1 Simulation 2 Simulation 3 XGBoost (Trees) 84.40 97.59 50.10 Ada Boost (1 NN) 67.90 93.73 72.64 Additive Feature Boost 88.49 93.91 73.13 Std Comp Boost 91.53 96.95 82.49 Dense Comp Boost 93.55 98.35 95.70 Cmplx Comp Boost 91.97 97.22 82.52 End-to-End 93.88 98.35 99.09

Table 2: Test accuracy of various boosting techniques on benchmark datasets. We use convolution blocks for the ﬁrst 5 datasets and fully connected blocks for the other datasets.

Technique SVHN Fashion MNIST CIFAR10 Convex MNIST-rotback-image MNIST Letter Cov Type Connect4

XGBoost (Trees) 77.72 90.34 58.34 82.29 53.89 97.96 96.16 97.46 86.63 Ada Boost (1 NN) 82.88 88 72.78 86.17 50.02 98.27 92.08 90.95 86.39 Additive Feature Boost 83.36 89.95 74.33 89.30 54.31 98.27 90.86 93.12 86.58

Std Comp Boost 90.81 92.77 81.93 98.19 73.17 98.37 96.43 95.61 86.33 Dense Comp Boost 91.03 93.17 82.31 98.6 73.1 98.34 96.96 96.28 86.85 Cmplx Comp Boost 91.25 93.18 82.43 98.52 74.32 98.34 96.66 95.92 86.49 End-to-End 94.82 93.49 86.88 98.81 82.69 98.95 97.67 96.86 87.37

5.2 Benchmark Datasets

Datasets. In this section, we compare various techniques on the following image datasets: CIFAR10, MNIST, Fashion MNIST [35], MNIST-rot-back-image [24], convex [35], SVHN [28], and the following tabular datasets from UCI repository [7]: letter recognition [17], forest cover type (covtype), connect4. The convex dataset involves classifying shapes in images as either convex or non-convex. Hyper-parameters. For covtype dataset, which doesn t come with a test set, we randomly sample 20% of the original data and use it as the test set. We use a similar hyper-parameter selection technique as above and tune the same set of hyper-parameters as described above. We use convolution blocks for CIFAR10, SVHN, Fashion MNIST, convex, MNIST-rot-back-image and fully connected blocks for the rest. We limit the width of fully connected blocks to 4096, and the number of output channels in convolution blocks to 128 while tuning the hyper-parameters for the composition boosting techniques and end-to-end training. For Ada Boost and additive representation boosting, we set these limits to 16000 and 350 respectively. For Cmplx Comp Boost with convolution blocks, we set D0{8. We do not use data augmentation in our experiments. Results. Table 2 presents the results from our experiments. It can be seen that on image classiﬁcation tasks, additive boosting techniques have poor performance. Among compositional boosting methods, Std Comp Boost performs the worst. While Dense Comp Boost performs comparably to Cmplx Comp Boost on image datasets, it is better on tabular data. We believe a hybrid of Dense Comp Boost and Cmplx Comp Boost algorithms can achieve better performance than either of the algorithms.

6 Conclusion

We proposed a generalized framework for boosting, which allows for more complex forms of aggregation of weak learners than traditional boosting. Our generalized framework allows to derive learning algorithms that (a) have performance close to that of end-to-end trained DNNs, and (b) come with strong theoretical guarantees. Additive boosting algorithms do not satisfy property (a), while DNNs do not satisfy property (b). In particular, additive boosting algorithms, even with small neural networks as their weak classiﬁers, do not have the strong performance of end-to-end trained DNNs. Improving their performance requires the hypothesis space to increase in complexity while not increasing sample complexity of each boosting step too greatly, which can be achieved by our generalized boosting framework. One particular instantiation of our framework is aggregation using function compositions. A number of existing greedy techniques for learning neural networks fall into our framework, and our analysis allowed us to delineate some of their key ﬂaws, then consequently, propose new techniques which improve upon them. We believe our work opens up a new line of inquiry for greedy learning of highly ﬂexible models with rigorous theoretical guarantees, by leveraging the theory of boosting and generalized greedy algorithms in function spaces. We moreover believe our work has the potential to bridge the gap in performance between existing greedy layer-by-layer training techniques and end-to-end training of deep networks.

Broader Impact

Deep learning has been tremendously successful over the past decade in many application areas such as computer vision, image recognition, speech recognition, and natural language processing. Despite this success, deep neural networks have largely remained a mystery. With millions of parameters, these models are blackboxes to humans, making it harder to diagnose errors. This also makes it harder to adopt these models in critical applications such as healthcare, law and ﬁnance. Consequently, it is crucial to come up with techniques that make neural networks transparent and easy to understand. We take a step towards this goal by drawing inspiration from classical boosting. Similar to classical boosting, our generalized boosting framework builds complex models greedily. But unlike classical boosting, it allows us to derive learning algorithms that have performance close to that of end-to-end trained DNNs. Moreover, models built using our framework are easy to understand and come with strong theoretical guarantees.

Acknowledgement

We acknowledge the support of NSF via IIS-1909816, OAC-1934584, ONR via N000141812861, and Amazon Web Services (AWS).

[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pages 6155 6166, 2019. [2] Sivaraman Balakrishnan, Martin J Wainwright, Bin Yu, et al. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77 120, 2017. [3] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5:4308, 2014. [4] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. [5] Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy layerwise learning can scale to imagenet. ar Xiv preprint ar Xiv:1812.11446, 2018. [6] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pages 153 160, 2007. [7] Catherine L Blake and Christopher J Merz. Uci repository of machine learning databases, 1998, 1998. [8] Chang Chen, Zhiwei Xiong, Xinmei Tian, and Feng Wu. Deep boosting for image denoising. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3 18, 2018. [9] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785 794, 2016. [10] Corinna Cortes, Mehryar Mohri, and Umar Syed. Deep boosting. volume 32 of Proceedings of Machine Learning Research, pages 1179 1187, Bejing, China, 22 24 Jun 2014. PMLR. URL http://proceedings.mlr.press/v32/cortesb14.html. [11] Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Adanet: Adaptive structural learning of artiﬁcial neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 874 883. JMLR. org, 2017. [12] Olivier Devolder, François Glineur, and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1-2):37 75, 2014. [13] Yoav Freund. Boosting a weak learning algorithm by majority. Information and computation, 121(2):256 285, 1995. [14] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory, pages 23 37. Springer, 1995.

[15] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the ninth annual conference on Computational learning theory, pages 325 332, 1996. [16] Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In icml, volume 96, pages 148 156. Citeseer, 1996. [17] Peter W Frey and David J Slate. Letter recognition using holland-style adaptive classiﬁers. Machine learning, 6(2):161 182, 1991. [18] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2):337 407, 2000. [19] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189 1232, 2001. [20] Alexander Grubb and J Andrew Bagnell. Generalized boosting algorithms for convex optimization. ar Xiv preprint ar Xiv:1105.2054, 2011. [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [22] Furong Huang, Jordan Ash, John Langford, and Robert Schapire. Learning deep resnet blocks sequentially using boosting theory. ar Xiv preprint ar Xiv:1706.04964, 2017. [23] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700 4708, 2017. [24] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, pages 473 480, 2007. [25] Sindy Löwe, Peter O Connor, and Bastiaan Veeling. Putting an end to end-to-end: Gradientisolated learning of representations. In Advances in Neural Information Processing Systems, pages 3033 3045, 2019. [26] Llew Mason, Jonathan Baxter, Peter L Bartlett, and Marcus R Frean. Boosting algorithms as gradient descent. In Advances in neural information processing systems, pages 512 518, 2000. [27] Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pages 3 17. Springer, 2016. [28] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011. [29] Atsushi Nitanda and Taiji Suzuki. Functional gradient boosting based on residual network perception. ar Xiv preprint ar Xiv:1802.09031, 2018. [30] Natalia Ponomareva, Thomas Colthurst, Gilbert Hendry, Salem Haykal, and Soroush Radpour. Compact multi-class boosted trees. In 2017 IEEE International Conference on Big Data (Big Data), pages 47 56. IEEE, 2017. [31] Gunnar Rätsch and Manfred K Warmuth. Efﬁcient margin maximizing with boosting. Journal of Machine Learning Research, 6(Dec):2131 2152, 2005. [32] Robert E Schapire. The strength of weak learnability. Machine learning, 5(2):197 227, 1990. [33] Mark Schmidt, Nicolas L Roux, and Francis R Bach. Convergence rates of inexact proximalgradient methods for convex optimization. In Advances in neural information processing systems, pages 1458 1466, 2011. [34] Vladimir Nikolaevich Temlyakov. Greedy expansions in convex optimization. Proceedings of the Steklov Institute of Mathematics, 284(1):244 262, 2014. [35] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.