# precisionbased_boosting__961ba342.pdf

Precision-based Boosting

Mohammad Hossein Nikravan, Marjan Movahedan, Sandra Zilles Department of Computer Science, University of Regina, Regina, SK, Canada nikravam@uregina.ca, marjan.movahedan@gmail.com, zilles@cs.uregina.ca

Ada Boost is a highly popular ensemble classiﬁcation method for which many variants have been published. This paper proposes a generic reﬁnement of all of these Ada Boost variants. Instead of assigning weights based on the total error of the base classiﬁers (as in Ada Boost), our method uses classspeciﬁc error rates. On instance x it assigns a higher weight to a classiﬁer predicting label y on x, if that classiﬁer is less likely to make a mistake when it predicts class y. Like Ada Boost, our method is guaranteed to boost weak learners into strong learners. An empirical study on Ada Boost and one of its multi-class versions, SAMME, demonstrates the superiority of our method on datasets with more than 1,000 instances as well as on datasets with more than three classes.

Introduction One of the most popular methods for ensemble classiﬁcation is Ada Boost, introduced by Freund and Schapire (1997). The intuitive idea behind Ada Boost is: (i) train a simple base classiﬁer h1 and observe its performance on the training data; (ii) assign a higher weight to the training examples misclassiﬁed by h1 and a lower weight to the correctly classiﬁed ones (so that the base classiﬁer h2 to be trained in the next iteration focuses more on those data points that h1 misclassiﬁed); (iii) iterate this procedure T times, to create base classiﬁers h1, . .. , h T ; and (iv) form a ﬁnal classiﬁer from a weighted majority vote of the base classiﬁers. If base classiﬁer ht has a lower weighted error on the training examples than base classiﬁer ht , then ht is assigned more weight in the ﬁnal vote than ht . Moreover, when reweighting training examples after iteration t, weights are changed more substantially than after iteration t , because the examples misclassiﬁed by the stronger classiﬁer ht likely tend to be more difﬁcult than many of those misclassiﬁed by the weaker classiﬁer ht . The exact coefﬁcients by which to reweight training examples between iterations and by which to weight base classiﬁers in the ﬁnal vote were chosen so as to yield guarantees on the performance of Ada Boost. In this paper, we follow an idea originally proposed by Aslam (2000) and reﬁne the principle behind Ada Boost s weighting scheme. Consider a hypothetical situation in

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

which several base classiﬁers are to be aggregated, based on their performance on a set of 100 training examples, of which 50 have class 1 and 50 have class 2. Classiﬁer ht predicts correct labels for 49 of the 50 instances in class 1, and for 25 of the 50 instances in class 2, for a total error of 26%. Classiﬁer ht predicts correct labels for 40 of the 50 instances in class 1, and for 40 of the 50 instances in class 2. Its total error is 20%. When aggregating all base classiﬁers to make a prediction on an unseen instance x, Ada Boost will assign more weight to ht , which appears more trustworthy , given its lower error. The reﬁned approach that we suggest though will ﬁrst evaluate both ht and ht on x and then decide how to weight their votes. Suppose ht(x) = 2 and ht (x) = 1. Our statistics on the training data suggest that, when ht predicts class 2, it is incorrect in only 1/26 = 3.8% of all cases. By comparison, when ht predicts class 1, it is incorrect in 10/50 = 20% of all cases. So, the chance of being incorrect on this speciﬁc instance x is more than 5 times higher when following ht than when following ht. Unlike what Ada Boost would do, the method we propose in this paper would assign a higher weight to the vote of ht than to the vote of ht , when making a prediction on the speciﬁc instance x. In short, our method does not simply assess the total error/accuracy of a base classiﬁer, but its error/precision when it predicts class y, separately for each class y. We use the term class-speciﬁc precision to refer to the precision of a classiﬁer on any one chosen class label. When handling data instance x, the weight our method assigns to a base classiﬁer h hence increases with h s class-speciﬁc precision on the class y = h(x) predicted by h on x. We thus call our method Precision-based Ada Boost, or Pr Ada Boost for short. The same idea can be applied to virtually every existing variation of Ada Boost; e.g., we propose and test Pr SAMME, the precision-based version of the multi-class Ada Boost adaptation SAMME (Hastie et al. 2009). Following a forward-stagewise additive model (Friedman, Hastie, and Tibshirani 2000), the choice of weight parameters in Ada Boost is optimal when only a single weight parameter can be assigned to every individual base classiﬁer. In this forward-stagewise model, we simply introduce one more degree of freedom by incorporating class-speciﬁc precision, which leads to a different optimal solution. Our formal analysis of Pr Ada Boost shows that, as in the case

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

of Ada Boost, the Pr Ada Boost algorithm trains and aggregates weak classiﬁers (which are only guaranteed to be better than random guessing) into arbitrarily strong (i.e., arbitrarily accurate) classiﬁers, when evaluated on the training data. Analogously to Freund and Schapire s (1996) result for Ada Boost, we obtain an upper bound on the training error that decreases exponentially quickly as a function of the number of iterations run by our algorithm. In an empirical evaluation on 23 two-class UCI datasets, neither Pr Ada Boost nor Ada Boost consistently outperform its competitor, but our results strongly suggest that Pr Ada Boost tends to beat Ada Boost on larger datasets as well as on imbalanced datasets. In the multi-class setting, on 18 UCI datasets, overall Pr SAMME is the clear winner over SAMME. In the sum, our evaluation suggests that, especially if the dataset has more than 1,000 data points, is imbalanced, or has more than three classes, precision-based boosting should be preferred over the traditional error-based boosting.

Related Work Various methods for constructing ensemble-based systems have been developed (Rokach 2010). Adaptive boosting (Freund and Schapire 1997), bagging (Breiman 1996), and random forests (Breiman 2001) are some of the best-known ensemble-based classiﬁers. Ada Boost is among the most popular methods, not only because of its successes in practice, but also because of its solid theoretical foundation. The theory behind Ada Boost is based on the greedy minimization of an exponential loss function (Schapire and Freund 2012; Friedman, Hastie, and Tibshirani 2000; Breiman 1999; Schapire and Singer 1999). Knowledge of this minimization procedure gives us insights into the convergence properties of the Ada Boost algorithm. In addition, it provides us with a means of modifying the loss function so as to obtain novel boosting-like methods for modiﬁed classiﬁcation tasks (Schapire and Freund 2012). For example, Schapire and Singer (Schapire and Singer 1999) used the Hamming loss to develop a generalized multi-class version of Ada Boost. Their method, Ada Boost MH, minimizes the Hamming loss by decomposing the problem into several two-class classiﬁcation problems. Approaches targeting imbalances in the data or the misclassiﬁcation costs led to the design of asymmetric learning methods (Nikolaou and Brown 2015), such as Ada Cost (Fan et al. 1999), which reduces the cumulative misclassiﬁcation cost by integrating a cost-adjustment function into the weight updating rule of Ada Boost, and of other variants of Ada Boost; see for instance (Wu and Nagahashi 2015) for further references and for a comparative study of various Ada Boost-like methods. The crucial difference between these methods and ours is that ours is not targeted at imbalanced data or imbalanced misclassiﬁcation costs or any other kind of imbalance between the individual classes in the given data. Instead of targeting asymmetric learning settings, we improve Ada Boost in its original task by introducing more freedom in the choice of weight parameters to be selected in minimizing the exponential loss. The precisionbased approach is applicable to all variations of Ada Boost

Algorithm 1: Ada Boost Scheme (Freund and Schapire 1997)

Input: A training set S = {(x1, y1), . . . , (xn, yn)}, Base Inducer, i.e., a weak learning algorithm, Number of iterations T N. Initialize D1(xi) = 1

n for all i {1, . . . , n}. For t = 1 to T do 1. Call Base Inducer with the distribution Dt to obtain a hypothesis ht : X Y. 2. Calculate the total error εt of ht w.r.t. Dt, i.e.,

i=1 Dt(xi)I(ht(xi) = yi).

If εt > 0.5, then set T to t 1 and abort the loop. 3. Set βt as a decreasing function of εt. 4. Redistribute weights as follows, for 1 i n:

Dt+1(xi) = Dt(xi)

( e βt if ht(xi) = yi eβt otherwise

where Zt is set such that Pn i=1 Dt+1(xi) = 1. end Output: the ﬁnal hypothesis

( 1 , if PT t=1 βtht(x) 0 , 1 , otherwise . (1)

just as we test Pr Ada Boost and Pr SAMME, one could also implement Pr Ada Cost, Pr Ada Boost.M2, etc. The only existing method using an idea similar to ours is Info Boost (Aslam 2000), which was never tested extensively in the literature. Like Pr Ada Boost, it uses weight parameters based on class-speciﬁc precision, but it is built on an information-theoretic motivation. By contrast, our method employs the forward-stagewise additive model used to justify Ada Boost s weight parameters. In particular, we introduce another degree of freedom into this model, yielding a new optimal solution. We use this new optimal solution in our parameters, thus not only signiﬁcantly improving Ada Boost but also beating Info Boost by a large margin.

Ada Boost Framework

Suppose a training set S = {(x1, y1), . . . , (xn, yn)} of labeled examples is given, based on a binary target classiﬁer. Here, each xi represents a training instance, and yi { 1, +1} is the label of xi. The algorithmic framework that is used by Ada Boost (Freund and Schapire 1996) is displayed in Algorithm 1. On input of S, the boosting algorithm runs for T iterations, in each iteration t obtaining a new base classiﬁer ht from a weak learning algorithm. Initially, all training instances in S have the same weight, but in each iteration t, the weights are updated, see line 4. The parameter βt used in the weight ad-

justment is deﬁned to be a decreasing function of the error εt. The latter refers to the cumulative weight of the training instances misclassiﬁed by the base classiﬁer ht, with respect to the distribution Dt, see line 2.1 After T iterations, the ﬁnal aggregated classiﬁer computes a weighted majority vote over all previously trained base classiﬁers, where the weight of classiﬁer ht equals βt, i.e., decreases when the error εt of ht increases, see Eqn. (1). In Ada Boost (Freund and Schapire 1996), the crucial parameter βt is deﬁned as

Ada Boost follows a forward-stagewise additive model (Friedman, Hastie, and Tibshirani 2000), where the objective is to ﬁnd a function H which minimizes the exponential loss L(y, H) = e y H(x) over S, i.e., which minimizes

i=1 L(yi, H(xi)) =

i=1 e yi H(xi) . (2)

Here H(xi) { 1, +1} is the label H predicts for the training instance xi and yi { 1, +1} is again the observed true label for xi. Moreover, H is considered to be an additive function of the following form, where βt R are coefﬁcients and ht are base classiﬁers for t = 1, . . . , T:

t=1 βtht(x) , (3)

The greedy method implemented in Ada Boost approximates the optimal solution to (2) in an iterative way, starting with H0 = 0, choosing βt and ht so as to minimize Pn i=1 exp( yi(Ht 1(xi) + βtht(xi))), and then setting Ht(x) = Ht 1(x) + βtht(x) and H = HT . Deﬁning ωt,i = exp( yi Ht 1(xi)), the goal is thus to ﬁnd

argmin βt,ht

i=1 ωt,i exp( βtyiht(xi))) (4)

= argmin βt,ht

yi=ht(xi) ωt,ie βt + X

yi =ht(xi) ωt,ieβt . (5)

The optimal solution to this iterative procedure (thus approximating the minimizer of (2)) yields βt = 1

εt , as used in Ada Boost, cf. (Friedman, Hastie, and Tibshirani 2000).

Precision-based Boosting Ada Boost weights a classiﬁer based on its error a low error εt of ht results in a high weight βt assigned to ht. This is possible, because the stagewise model shown above makes the coefﬁcient βt dependent on the iteration number t. We reﬁne this approach by adding one more degree of freedom to the optimization problem described above. In iteration t, we will allow the stagewise model to select two values of βt, one value βt,1 to handle instances xi on which ht predicts 1, and one value βt, 1 to handle those on which ht predicts

1Throughout the paper, the notation I(P) {0, 1} refers to the binary indicator of the truth of condition P.

1. Thus one can assign a higher weight to a classiﬁer ht when it predicts class y, in case it generally has a high precision on class y. Intuitively, this will create an advantage over Ada Boost when the precision of base classiﬁers on the two classes is not equal. For this purpose, we extend Equation (4); the goal is to ﬁnd βt,1, βt, 1, and ht(x) that minimize X

yi=ht(xi) ωt,ie βt,1I(ht(xi) = 1)

yi=ht(xi) ωt,ie βt, 1I(ht(xi) = 1)

yi =ht(xi) ωt,ieβt,1I(ht(xi) = 1)

yi =ht(xi) ωt,i + eβt, 1I(ht(xi) = 1)

While (4) eventually yields a βt-value depending on the total error εt of ht, our approach requires two error terms, namely εt,1 for the error of ht when it predicts class 1, and εt, 1 for the error of ht when it predicts class 1. Speciﬁcally, the error εt,y is deﬁned as the cumulative weight of the training instances on which ht incorrectly predicts y, under the distribution Dt. Similarly, we use rt,y to refer to the cumulative weight of the training instances in class y.

Deﬁnition 1. For y {1, 1} and t {1, . . . , T}, let

i=1 Dt(xi)I(yi = ht(xi))I(ht(xi) = y) and

i=1 Dt(xi)I(yi = y) .

In what follows, if y = 1, then y = 1, and, if y = 1, then y = 1. The proof of Lemma 1 is omitted.

Lemma 1. If εt,y = 0 for all t, y, then (6) is minimized by

ht = argmin h

i=1 ωt,i I(yi = h(xi)) and

βt,y = 1 2 ln rt,y εt, y

We correspondingly adapt the framework for Ada Boost to a new boosting algorithm called Pr Ada Boost, which is short for Precision-based Ada Boost, shown in Algorithm 2. Intuitively, Ada Boost and Pr Ada Boost differ only in one simple aspect. For a given training dataset S and a set of base classiﬁers, Ada Boost adopts the following strategy: The higher the accuracy of a base classiﬁer (statistically on S), the more one should trust its predictions. By comparison, Pr Ada Boost s strategy is: The higher the class-speciﬁc precision of a base classiﬁer h for class y (statistically on S), the more one should trust h when it predicts class y. A minor difference in forming the ﬁnal classiﬁer is that Ada Boost has a perfect base classiﬁer ht when βt = . In our case, βt,y = does not imply εt = 0 one can only

Algorithm 2: Pr Ada Boost

Input: A training set S = {(x1, y1), . . . , (xn, yn)}, Base Inducer, i.e., a weak learning algorithm, Number of iterations T N. Initialize D1(xi) = 1

n for all i {1, . . . , n}. For t = 1 to T do 1. Call Base Inducer with the distribution Dt to obtain a hypothesis ht : X Y. 2. If εt > miny rt,y, set T to t 1 and abort the loop.

3. βt,1 = 1

2 ln rt,1 εt, 1

εt,1 and βt, 1 = 1

2 ln rt, 1 εt,1

4. Redistribute weights as follows, for 1 i n:

Dt+1(xi) = Dt(xi)

( e βt,ht(xi) if ht(xi) = yi eβt,ht(xi) if ht(xi) = yi

where Zt is deﬁned such that Pn i=1 Dt+1(xi) = 1. end Output: the ﬁnal hypothesis

ht (x) , if T (x) = , t = min T (x) , 1 , else if PT t=1 βt,ht(x)ht(x) 0 , 1 , otherwise .

T (x) = {t | βt,ht(x) = } (7)

rely fully on ht when it predicts y. This is addressed by incorporating min(T (x)) in the output of Pr Ada Boost. Our formal derivation of βt,y above assumes that these values are ﬁnite, so that T (x) = ; but to ensure proper logic in our algorithm, we introduce the case T (x) = . Note that the abort condition in line 2, to ensure the β values are non-negative, is weaker than the corresponding one used in Ada Boost, but is irrelevant in the typical practical deployment of Ada Boost-type methods. It is given just for formal reasons, as the pseudocode allows any kind of base inducer. Formally, the loop is aborted when the total error of any base classiﬁer is higher than the weight of either class under the current distribution. However, that abort condition is never met when using a standard decision stump learner, as it would favour a constant classiﬁer with error εt = rt,y (i.e., the trivial classiﬁer always predicting the class with the highest cumulative weight) over any other decision stump with error εt > rt,y. Likewise, line 2 of Ada Boost s pseudocode can be dropped when using decision stumps. Pr Ada Boost s runtime complexity is equal to Ada Boost s.

Formal Analysis

Ada Boost is known to boost a series of base classiﬁers ht with εt < 1

2 into a classiﬁer with arbitrarily small training error; known upper bounds on the training error decrease exponentially quickly as a function of T (Freund and Schapire 1996). We will now establish a similar result for Pr Ada Boost that gives, in some cases, much smaller bounds on the training error than the corresponding bounds for Ada Boost.

Theorem 1. W.r.t. the uniform distribution D1, the error εH of the hypothesis H output by Pr Ada Boost is bounded by εH = Prxi D1[H(xi) = yi]

εt,1(rt,1 εt, 1) + 2 q

εt, 1(rt, 1 εt,1) .

If the base classiﬁers ht in Algorithm 2 satisfy εt < 1/2 for all t {1, . . . , T}, this bound decreases exponentially quickly as a function of T. Our proof of this theorem follows in spirit the corresponding standard proof for Ada Boost, cf. (Mohri, Rostamizadeh, and Talwalkar 2018). Proof. Let F(x) = PT t=1 βt,ht(x)ht(x). We ﬁrst show, for all i, I(H(xi) = yi) exp( yi F(xi)). It sufﬁces to show that H(xi) = yi yields yi F(xi) 0. If T (xi) in Algorithm 2 is empty, then F(xi) = 0 or H(xi) = sgn(F(xi)). Thus, if H(xi) = yi, we get yi F(xi) 0. If T (xi) = , then H(xi) = yi by Algorithm 2, so there is nothing to show. Note that DT +1 can be written in terms of D1 as follows:

DT +1(xi) = D1(xi)

exp( yiβt,ht(xi)ht(xi))

= exp( yi PT t=1 βt,ht(xi)ht(xi))

n QT t=1 Zt = exp( yi F(xi))

n QT t=1 Zt Now we use I(H(xi) = yi) exp( yi F(xi)) to obtain

i=1 D1(xi)I(H(xi) = yi)

1 n exp( yi F(xi))

i=1 DT +1(xi)

Using line 4 of Algorithm 2, one can express Zt as

i=1 Dt(xi) exp( yiβt,ht(xi)ht(xi))

i:yi=ht(xi) Dt(xi)e βt,yi + X

i: yi=ht(xi) Dt(xi)eβt, yi

= (rt,1 εt, 1)e βt,1 + (rt, 1 εt,1)e βt, 1

+ εt,1eβt,1 + εt, 1eβt, 1

= (rt,1 εt, 1) r εt,1 rt,1 εt, 1

+ (rt, 1 εt,1) r εt, 1 rt, 1 εt,1

εt,1 + εt, 1

εt,1(rt,1 εt, 1) + 2 q

εt, 1(rt, 1 εt,1)

Thus one obtains the claimed upper bound

εt,1(rt,1 εt, 1) + q

εt, 1(rt, 1 εt,1)

To show that this bound drops exponentially quickly as a function of T, we use the known upper bound on the training error of Ada Boost, given by the formula

εt(1 εt) , (8)

cf. (Freund and Schapire 1996; Mohri, Rostamizadeh, and Talwalkar 2018). This bound is known to drop exponentially quickly as a function of T if εt < 1/2 for all t. We will see that 2 p

εt,1(rt,1 εt, 1) + p

εt, 1(rt, 1 εt,1) 2(2 p

εt(1 εt)), i.e., our bound is within a factor of 2 of the one known for Ada Boost and therefore also drops exponentially quickly as a function of T. To verify that our bound is within a factor of 2 of the one for Ada Boost, it sufﬁces to observe that

εt,y(rt,y εt, y) εt,y(rt,y εt, y) + εt(rt, y εt,y) εt(rt,y εt, y) + εt(rt, y εt,y) = εt(rt,y + rt, y εt, y εt,y) = εt(1 εt)

Our proof uses a crude approximation to show that our upper bound on Pr Ada Boost s training error is at most twice as large as that of Ada Boost. Very often in fact, our bound on Pr Ada Boost s training error (from Theorem 1) is smaller than the bound QT t=1 2 p

εt(1 εt) (Freund and Schapire 1996; Mohri, Rostamizadeh, and Talwalkar 2018) for Ada Boost a claim that we also tested empirically (results not included here). Here we present a hypothetical situation in which a small difference between the bounds results in a large difference in the number of iterations after which the bounds can guarantee a certain target training accuracy. Suppose the maximal error maxt T εt incurred by both algorithms over T iterations is 0.48 and the target error to be reached is 0.02. Suppose further that the classes are never quite balanced throughout Pr Ada Boost s iterations, e.g., rt,1 0.4 and rt, 1 0.6, and that the class-speciﬁc errors (which sum up to εt) are such that εt,1 0.285 and εt, 1 0.195 for all t. Under these conditions, the maximum possible value of 2 p

εt,1(rt,1 εt, 1) + 2 p

εt, 1(rt, 1 εt,1) is roughly 0.97910 and occurs when rt,1 = 0.4, rt, 1 = 0.6, εt,1 = 0.285, εt, 1 = 0.195. The maximum possible value of 2 p

εt(1 εt) is roughly 0.99919, for εt = 0.48. Making a pessimistic estimate for both algorithms, by considering the maximum possible values to occur not just in one but in all iterations, Pr Ada Boost s bound evaluates to roughly 0.97910T . This number is less than 0.02 if and only if T 186. By comparison, Ada Boost s bound QT t=1 2 p

εt(1 εt), which is based only on εt, evaluates to roughly 0.99919T and is below 0.02 only when T 4828, which is almost 26 times the value obtained through Pr Ada Boost s bound. Note that, for such upper-bound estimates on Ada Boost-type algorithms, it is standard practice to assume the maximum possible error under the given constraints occurs in every iteration, cf. (Freund and Schapire 1996; Mohri, Rostamizadeh, and Talwalkar 2018).

A Multi-class Variant Precision-based scoring parameters can similarly be applied in variants of Ada Boost, such as Ada Cost etc. We were particularly interested in studying the effect of precision-based boosting in multi-class classiﬁcation; here we expected a more noticeable beneﬁt of precision-based boosting due to the higher chance of base classiﬁers having large differences in precision for at least some of the classes. The best-known multi-class Ada Boost variants are Ada Boost.M1, Ada Boost.M2 (Freund and Schapire 1996), Ada Boost.MH (Schapire and Singer 1999), and SAMME (Hastie et al. 2009). Since SAMME is simpler than Ada Boost.MH and was reported to outperform Ada Boost.M1 and Ada Boost.M2, we chose SAMME as a method for our analysis. Using the same approach as in Section , we formulate a forward stagewise additive model in a class-speciﬁc approach, but this time starting from the multi-class exponential loss function from which Hastie et al. (Hastie et al. 2009) derived SAMME. Minimizing the loss as done in Section for the binary case, one obtains optimal parameters of a class-speciﬁc ﬂavor. We adapt Deﬁnition 1 to a multi-class setting with K classes, by introducing a dual form of class error.

Deﬁnition 2. Let y {1, . . . , K} and t {1, . . . , T}. The dual class error ε t,y is deﬁned as the cumulative weight of the training instances in class y which are incorrectly predicted by ht, under the distribution Dt.

i=1 Dt(xi)I(yi = ht(xi))I(yi = y)

Then, the following scoring parameters are obtained with a calculation similar to that in Section :

Lemma 2. Assuming y {1, . . . , K} and t {1, . . . , T}, the following expression minimizes the precision-based version of SAMME s loss function.

βt,y = (K 1)2

ln rt,y ε t,y εt,y + ln(K 1)

Using such βt,y as scoring parameters within SAMME, one obtains a method we call Pr SAMME, analogously to the way Pr Ada Boost was obtained from Ada Boost.

Empirical Analysis We evaluated (Pr)Ada Boost on 23 binary UCI datasets (Lichman 2013) and (Pr)SAMME on 18 multi-class UCI datasets, using decision stumps trained in Matlab as base classiﬁers. We performed 10-fold cross validation on each dataset (except isolet , which has designated training and test portions), comparing two algorithms always on the same folds. All errors were calculated w.r.t. the uniform distribution over the input data. The datasets were preprocessed to remove all data points with missing attribute values.

Binary Classiﬁcation We ran Pr Ada Boost and Ada Boost for T iterations, trying T = 30, 50, and 100 (without attempting to tune T.) In each

case, we report (i) test error averages from 10-fold crossvalidation over iterations 1 through T, and (ii) test error at iteration T. We also tried excluding the ﬁrst 9 iterations from the average in (i), so as not to penalize a method for very early erroneous predictions, but since this did not change the trends the corresponding results are not reported here. The results for T = 50 are given in Table 1, where statistically signiﬁcant wins (using 2-tailed paired t-tests at the 95% conﬁdence level) are shown in bold. Datasets are listed in decreasing order of size; from the top down to german are those with 1,000 or more instances. Note that not many of the results for iteration 50 alone are statistically signiﬁcant, as they are based on only 10 result pairs (one per fold), whereas for iterations 1 50 we compare 50 10 result pairs. Pr Ada Boost tends to outperform Ada Boost (i) on larger datasets but also (ii) on datasets with class imbalance. Concerning (i), note that the datasets containing more than 1,000 instances are the top 14 in Table 1 Ada Boost wins on only 4 of these 14 in terms of test error, while winning on 8 of the 9 small sets. This suggests that the training data in larger datasets become representative enough of the test data to give Pr Ada Boost a crucial advantage over Ada Boost. As for (ii), note that Pr Ada Boost takes class ratios into account explicitly via the r1,y values in iteration t = 1. When redistributing weights, class imbalance may decline, so it is hard to predict its effect for larger t. In our study, the datasets in which the size ratio between the larger and the smaller class is less than 1.5 are the following: krvskp, banknote, messidor, crx, cleveland, house, sonarall, and promoters. Pr Ada Boost wins on only 2 out of these 8 datasets in terms of test error. By contrast, it wins in terms of test error on 9 out of 15 datasets with class ratio greater than 1.5. This suggests that Pr Ada Boost provides much more of an advantage over Ada Boost for imbalanced datasets than for balanced ones. Results for T = 30 showed similar trends, so that we do not display them here. For T = 100, we cannot make a meaningful comparison. While the errors incurred by Ada Boost keep getting slightly smaller, Pr Ada Boost s errors appear to increase. The latter though seems to be due to numerical imprecision. Pr Ada Boost adjusts the weights of individual training examples much more aggressively than Ada Boost, so that weights can become extremely small very quickly. In our experiments, this caused weights to be treated as zero when in fact they weren t, so that our test runs could not numerically perform the operations prescribed by Pr Ada Boost. Efforts to resolve this problem, e.g., using smoothing or exact rational arithmetic, are left for future studies.

Comparison to Info Boost Aslam (2000) proposed Info Boost as a ﬁrst Ada Boost-type method that takes class-speciﬁc precision into account in its weighting scheme. The parameters used in Info Boost differ from those used in Pr Ada Boost, and appear substantially inferior, see Table 2. In terms of test error, Pr Ada Boost signiﬁcantly outperforms Info Boost on 21 of the 23 tested datasets, ties on one dataset, and loses to Info Boost on only one dataset (promoters). Moreover, Pr Ada Boost mostly wins by a large margin, and it does not seem as if Info Boost can compete with the original Ada Boost either.

1-50 50 alone Name Ada B Pr Ada B Pr bankfull 10.50 10.33 10.21 10.03 adult 15.98 15.11 14.69 14.22 creditcard 18.06 18.09 18.06 18.10 magic04 19.14 17.98 17.22 16.04 htru2 2.27 2.20 2.27 2.16 agaricus 0.17 0.07 0 0 spambase 8.98 7.89 7.10 6.56 krvskp 7.73 7.15 5.60 5.44 seismic 6.58 6.79 6.57 6.59 titanic 22.22 22.29 22.17 22.17 banknote 2.88 3.52 0.43 2.26 messidor 35.03 33.45 32.23 32.40 biodeg 18.68 16.48 17.44 14.98 german 26.19 25.62 24.30 25.50 breastc 4.73 5.60 4.24 6.59 crx 13.45 14.63 13.16 15.62 diabetes 21.65 24.22 21.89 25.96 ionosphere 9.91 8.79 8.57 7.12 cleveland 17.30 17.86 17.13 17.81 house 3.38 5.68 3.46 8.58 sonarall 19.87 20.91 15.38 22.11 promoters 10.36 15.31 8.72 17.09 hepatitis 15.13 16.78 13.75 17.50

Table 1: Test errors (%) of Ada Boost and Pr Ada Boost, averaged over iterations 1 50 and at iteration 50 alone.

To show that this is not an issue of Info Boost aggressively overﬁtting, we display training errors as well. Given Info Boost s poor performance compared to Pr Ada Boost over 50 iterations, we did not evaluate Info Boost further.

Multi-Class Classiﬁcation

Table 3 compares Pr SAMME and SAMME in a 10-fold cross-validation and for 100 iterations, with bold entries again referring to statistically signiﬁcant wins at the 95% conﬁdence level (paired t-tests). Pr SAMME clearly outperforms SAMME (15 wins, 2 losses, 1 tie, when averaging over 100 iterations). Note that, in the few cases in which SAMME s errors are signiﬁcantly less than Pr SAMME s, their differences are still very small, whereas Pr SAMME often beats SAMME by a large margin. Pr SAMME tends to outperform SAMME more strongly when the number of classes increases. 13 of the 18 tested datasets have more than 3 classes. Pr SAMME signiﬁcantly outperforms SAMME on all of them. The datasets in which Pr SAMME does not substantially beat SAMME (splice, wine, and iris) have only 3 classes each.

Stronger Base Classiﬁers

To see whether the trends in our results are speciﬁc to the use of decision stumps as base classiﬁers, we also tested both approaches using stronger base classiﬁers. Allowing trees with up to 5 splits as base classiﬁers, we tested the four largest two-class datasets and the four largest multi-class datasets: bankfull, adult, creditcard, magic04, connect4, sensorless, shuttle, and letter. For shuttle , SAMME and Pr SAMME

Training error Test error Name Info Pr Info Pr bankfull 18.34 10.26 18.38 10.33 adult 33.97 15.01 33.99 15.11 creditcard 34.85 18.00 34.87 18.09 magic04 45.51 17.43 45.77 17.98 htru2 8.04 2.15 8.10 2.20 agaricus 29.91 0.07 29.91 0.07 spambase 13.83 6.97 14.55 7.89 krvskp 17.87 6.93 18.02 7.15 seismic 13.27 6.59 13.86 6.79 titanic 32.75 22.27 32.90 22.29 banknote 2.93 2.81 3.62 3.52 messidor 42.62 26.92 45.52 33.45 biodeg 22.97 12.89 26.68 16.48 german 47.61 21.42 49.41 25.62 breast 4.27 2.78 6.85 5.60 crx 23.35 10.34 27.08 14.63 diabetes 24.23 14.85 29.85 24.22 ionosphere 5.64 4.00 12.30 8.79 cleveland 18.38 12.02 25.89 17.86 house 5.24 4.27 7.01 5.68 sonarall 7.08 5.19 25.19 20.91 promoters 2.19 1.52 14.23 15.31 hepatitis 9.34 1.01 25.30 16.78

Table 2: Training/test errors (%) of Info Boost and Pr Ada Boost, averaged over iterations 1 through 50.

were on par; for the other seven datasets the comparison results were similar to those we reported for decision stumps, just with smaller error values throughout. In particular, Pr won signiﬁcantly on six datasets and lost slightly only on creditcard. This suggests that, at least for larger datasets, precision-based methods are preferable even when using stronger base classiﬁers. Since most of the literature uses Ada Boost with decision stumps, we did not extend our study on stronger classiﬁers beyond the reported eight datasets.

Conclusions

We demonstrated that the Ada Boost framework can substantially beneﬁt from using class-speciﬁc precision in its weighting scheme. In particular, we formally derived a weighting scheme that is provably optimal under additive forward stagewise modeling in a setting that generalizes the one on which Ada Boost is built. The resulting algorithm Pr Ada Boost is guaranteed to convert a weak learner into a strong learner. It provably decreases its training error exponentially quickly, and the obtained error bound can often substantially improve on the known ones for Ada Boost. Similarly, virtually all known ensemble classiﬁcation methods built on Ada Boost can be adapted to take class-based precision into account, as illustrated for SAMME. One interesting direction for future work would be to establish generalization error bounds for precision-based boosting. Experiments on 23 two-class and 18 multi-class benchmark datasets showed Pr Ada Boost and Pr SAMME to be superior to Ada Boost and SAMME, resp., in many cases. Typically, larger datasets as well as datasets with more classes

1 100 100 alone Name SAMME Pr SAMME Pr connect4 29.58 28.14 27.75 25.75 sensorless 59.64 45.64 48.26 43.65 shuttle 5.54 0.89 2.98 0.09 letter 73.06 66.30 58.18 56.39 nursery 19.96 14.96 23.40 14.27 pendigit 44.63 35.52 29.34 25.61 isolet 79.91 65.73 68.44 52.34 satimage 25.25 23.78 20.44 22.44 splice 6.80 7.18 5.33 6.30 segment 24.97 19.80 17.01 16.75 yeast 68.53 68.39 68.53 68.39 vowel 68.84 62.95 63.74 57.17 vehicle 41.30 40.31 39.37 37.12 glass 47.62 45.35 48.98 43.10 seed 8.66 6.94 8.57 7.14 wine 4.85 4.72 3.40 3.46 iris 5.95 7.13 4.67 8.00 breastt 40.24 37.26 31.00 38.55

Table 3: Test errors (%) of SAMME and Pr SAMME, averaged over iterations 1 100 and at iteration 100 alone; datasets are in decreasing order of size.

support our conclusion more strongly. Class imbalance also tends to be more easily handled by the precision-based approach. These trends seem plausible. Given multiple classes, the chances are higher that class-speciﬁc precision of any base classiﬁer varies among classes, thus increasing the positive effect of the precision-based approach. Potential negative effects of class imbalance are possibly mitigated through the rt,y terms. The fact that Pr Ada Boost does not beat Ada Boost on the smaller datasets tested, presumably has its reasons in the aggressiveness of Pr Ada Boost s weight update technique. The latter may have adverse effects when the training data are not representative of the test data which is more likely for small datasets. Regularization might help to mitigate Pr Ada Boost s problems with overgeneralization on small datasets, and possibly also the numerical issues we encountered when running it for many iterations. We do not claim that Pr Ada Boost/Pr SAMME are not outperformed by any existing boosting method, but that they tend to outperform their non-precision-based counterparts Ada Boost/SAMME on larger/multi-class datasets. In the same vein, we propose to consider replacing other existing Ada Boost-type methods by their corresponding precisionbased variants when dealing with larger datasets. In the sum, our experiments from both binary and multiclass settings suggest that the preferred choice, when running an Ada Boost-type method on datasets with more than 1,000 instances or with more than 3 classes, should be to use the precision-based approach that we propose, as it is expected to improve the predictive performance.

Acknowledgements This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), through the Discovery Grants and Canada Research Chairs programs.

References Aslam, J. A. 2000. Improving Algorithms for Boosting. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT)), 200 207. Morgan Kaufmann. Breiman, L. 1996. Bagging Predictors. Machine Learning 24(2): 123 140. doi:10.1007/BF00058655. URL http://dx. doi.org/10.1007/BF00058655. Breiman, L. 1999. Prediction games and arcing algorithms. Neural Computation 11(7): 1493 1517. Breiman, L. 2001. Random Forests. Machine Learning 45(1): 5 32. doi:10.1023/A:1010933404324. URL http: //dx.doi.org/10.1023/A:1010933404324. Fan, W.; Stolfo, S. J.; Zhang, J.; and Chan, P. K. 1999. Ada Cost: misclassiﬁcation cost-sensitive boosting. In ICML, 97 105. Freund, Y.; and Schapire, R. E. 1996. Experiments with a new boosting algorithm. In ICML, 148 156. Freund, Y.; and Schapire, R. E. 1997. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences 55(1): 119 139. Friedman, J.; Hastie, T.; and Tibshirani, R. 2000. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics 28: 337 407. Hastie, T.; Rosset, S.; Zhu, J.; and Zou, H. 2009. Multi-class Ada Boost. Statistics and its Interface 2: 349 360. Lichman, M. 2013. UCI Machine Learning Repository. URL http://archive.ics.uci.edu/ml. Mohri, M.; Rostamizadeh, A.; and Talwalkar, A. 2018. Foundations of Machine Learning, 2nd ed. MIT Press. Nikolaou, N.; and Brown, G. 2015. Calibrating Ada Boost for Asymmetric Learning. In Proceedings of the 12th International Workshop on Multiple Classiﬁer Systems (MCS), 112 124. Rokach, L. 2010. Ensemble-based classiﬁers. Artiﬁcial Intelligence Review 33(1-2): 1 39. Schapire, R.; and Freund, Y. 2012. Boosting: Foundations and Algorithms. MIT Press. Schapire, R. E.; and Singer, Y. 1999. Improved boosting algorithms using conﬁdence-rated predictions. Machine learning 37(3): 297 336. Wu, S.; and Nagahashi, H. 2015. Analysis of Generalization Ability for Different Ada Boost Variants Based on Classiﬁcation and Regression Trees. J. Electrical and Computer Engineering 2015: 835357:1 835357:17.