# boosting_with_abstention__83b6c0da.pdf Boosting with Abstention Corinna Cortes Google Research New York, NY 10011 corinna@google.com Giulia De Salvo Courant Institute New York, NY 10012 desalvo@cims.nyu.edu Mehryar Mohri Courant Institute and Google New York, NY 10012 mohri@cims.nyu.edu We present a new boosting algorithm for the key scenario of binary classification with abstention where the algorithm can abstain from predicting the label of a point, at the price of a fixed cost. At each round, our algorithm selects a pair of functions, a base predictor and a base abstention function. We define convex upper bounds for the natural loss function associated to this problem, which we prove to be calibrated with respect to the Bayes solution. Our algorithm benefits from general margin-based learning guarantees which we derive for ensembles of pairs of base predictor and abstention functions, in terms of the Rademacher complexities of the corresponding function classes. We give convergence guarantees for our algorithm along with a linear-time weak-learning algorithm for abstention stumps. We also report the results of several experiments suggesting that our algorithm provides a significant improvement in practice over two confidence-based algorithms. 1 Introduction Classification with abstention is a key learning scenario where the algorithm can abstain from making a prediction, at the price of incurring a fixed cost. This is the natural scenario in a variety of common and important applications. An example is spoken-dialog applications where the system can redirect a call to an operator to avoid the cost of incorrectly assigning a category to a spoken utterance and misguiding the dialog manager. This requires the availability of an operator, which incurs a fixed and predefined price. Other examples arise in the design of a search engine or an information extraction system, where, rather than taking the risk of displaying an irrelevant document, the system can resort to the help of a more sophisticated, but more time-consuming classifier. More generally, this learning scenario arises in a wide range of applications including health, bioinformatics, astronomical event detection, active learning, and many others, where abstention is an acceptable option with some cost. Classification with abstention is thus a highly relevant problem. The standard approach for tackling this problem is via confidence-based abstention: a real-valued function h is learned for the classification problem and the points x for which its magnitude |h(x)| is smaller than some threshold γ are rejected. Bartlett and Wegkamp [1] gave a theoretical analysis of this approach based on consistency. They introduced a discontinuous loss function taking into account the cost for rejection, upper-bounded that loss by a convex and continuous Double Hinge Loss (DHL) surrogate, and derived an algorithm based on that convex surrogate loss. Their work inspired a series of follow-up papers that developed both the theory and practice behind confidence-based abstention [32, 15, 31]. Further related works can be found in Appendix A. In this paper, we present a solution to the problem of classification with abstention that radically departs from the confidence-based approach. We introduce a general model where a pair (h, r) for a classifier h and rejection function r are learned simultaneously. Under this novel framework, we present a Boosting-style algorithm with Abstention, BA, that learns accurately the classifier and abstention functions. Note that the terminology of boosting with abstention was used by Schapire and Singer [26] to refer to a scenario where a base classifier is allowed to abstain, but 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. + + + + + - - - - - - - - - + - + - |h(x)| < γ γ h(x) = x + Figure 1: The best predictor h is defined by the threshold : h(x) = x + . For c < 1 2, the region defined by X should be rejected. But the corresponding abstention function r defined by r(x) = x cannot be defined as |h(x)| γ for any γ > 0. where the boosting algorithm itself has to commit to a prediction. This is therefore distinct from the scenario of classification with abstention studied here. Nevertheless, we will introduce and examine a confidence-based Two-Step Boosting algorithm, the TSB algorithm, that consists of first training Adaboost and next searching for the best confidence-based abstention threshold. The paper is organized as follows. Section 2 describes our general abstention model which consists of learning a pair (h, r) simultaneously and compares it with confidence-based models. Section 3.2 presents a series of theoretical results for the problem of learning convex ensembles for classification with abstention, including the introduction of calibrated convex surrogate losses and general datadependent learning guarantees. In Section 4, we use these learning bounds to design a regularized boosting algorithm. We further prove the convergence of the algorithm and present a linear-time weak-learning algorithm for a natural family of abstention stumps. Finally, in Section 5, we report several experimental results comparing the BA algorithm with the DHL and the TSB algorithms. 2 Preliminaries In this section, we first introduce a general model for learning with abstention [7] and then compare it with confidence-based models. 2.1 General abstention model We assume as in standard supervised learning that the training and test points are drawn i.i.d. according to some fixed but unknown distribution D over X { 1, +1}. We consider the learning scenario of binary classification with abstention. Given an instance x 2 X, the learner has the option of abstaining from making a prediction for x at the price of incurring a non-negative loss c(x), or otherwise making a prediction h(x) using a predictor h and incurring the standard zero-one loss 1yh(x) 0 where the true label is y. Since a random guess achieves an expected cost of at most 1 2, rejection only makes sense for c(x)< 1 We will model the learner by a pair (h, r) where the function r: X ! R determines the points x 2 X to be rejected according to r(x) 0 and where the hypothesis h: X ! R predicts labels for non-rejected points via its sign. Extending the loss function considered in Bartlett and Wegkamp [1], the abstention loss for a pair (h, r) is defined as as follows for any (x, y) 2 X { 1, +1}: L(h, r, x, y) = 1yh(x) 01r(x)>0 + c(x)1r(x) 0. (1) The abstention cost c(x) is assumed known to the learner. In the following, we assume that c is a constant function, but part of our analysis is applicable to the more general case. We denote by H and R two families of functions mapping X to R and we assume the labeled sample S = ((x1, y1), . . . , (xm, ym)) is drawn i.i.d. from Dm. The learning problem consists of determining a pair (h, r) 2 H R that admits a small expected abstention loss R(h, r), defined as follows: R(h, r) = E (x,y) D 1yh(x) 01r(x)>0 + c1r(x) 0 Similarly, we define the empirical loss of a pair (h, r) 2 H R over the sample S by: b RS(h, r) = E(x,y) S 1yh(x) 01r(x)>0 + c1r(x) 0 , where (x, y) S indicates that (x, y) is drawn according to the empirical distribution defined by S. 2.2 Confidence-based abstention model Confidence-based models are a special case of the general model for learning with rejection presented in Section 2.1 corresponding to the pair (h(x), r(x)) = (h(x), |h(x)| γ), where γ is a parameter that changes the threshold of rejection. This specific choice was based on consistency results shown in [1]. In particular, the Bayes solution (h , r ) of the learning problem, that is where the distribution D is known, is given by h (x) = (x) 1 2 and r (x) = |h (x)| ( 1 2 c) where (x) = P[Y = +1|x] for any x 2 X, but note that this is not a unique solution. The form of h (x) follows by a similar reasoning as for the standard binary classification problem. It is straightforward to see that the optimal rejection function r is non-positive, meaning a point is rejected, if and only if min{ (x), 1 (x)} c. Equivalently, the following holds: max{ (x) 1 2 c if and only if | (x) 1 2 c and using the definition of h , we recover the optimal r . In light of the Bayes solution, the specific choice of the abstention function r is natural; however, requiring the abstention function r to be defined as r(x) = |h(x)| γ, for some h 2 H, is in general too restrictive when predictors are selected out of a limited subset H of all measurable functions over X. Consider the example shown in Figure 1 where H is a family of linear functions. For this simple case, the optimal abstention region cannot be attained as a function of the best predictor h while it can be achieved by allowing to learn a pair (h, r). Thus, the general model for learning with abstention analyzed in Section 2.1 is both more flexible and more general. 3 Theoretical analysis This section presents a theoretical analysis of the problem of learning convex ensembles for classification with abstention. We first introduce general convex surrogate functions for the abstention loss and prove a necessary and sufficient condition based on their parameters for them to be calibrated. Next we define the ensemble family we consider and prove general data-dependent learning guarantees for it based on the Rademacher complexities of the base predictor and base rejector sets. 3.1 Convex surrogates We introduce two types of convex surrogate functions for the abstention loss. Observe that the abstention loss L(h, r, x, y) can be equivalently expressed as L(h, r, x, y) = max 1yh(x) 01 r(x)<0, c 1r(x) 0 . In view of that, since for any f, g 2 R, max(f, g) = f+g+|g f| 2 , the following inequalities hold for a > 0 and b > 0: L(h, r, x, y) = max 1yh(x) 01 r(x)<0, c 1r(x) 0 1max(yh(x), r(x)) 0, c 1r(x) 0 1 yh(x) r(x) 2 0, c 1r(x) 0 1a [yh(x) r(x)] 0, c1b r(x) 0 a [r(x) yh(x)] where u ! Φ1( u) and u ! Φ2( u) are two non-increasing convex functions upper-bounding u ! 1u 0 over R. Let LMB be the convex surrogate defined by the last inequality above: LMB(h, r, x, y) = max a [r(x) yh(x)] Since LMB is not differentiable everywhere, we upper-bound the convex surrogate LMB as follows: max 1a [yh(x) r(x)] 0, c 1b r(x) 0 a [r(x) yh(x)] . Similarly, we let LSB denote this convex surrogate: LSB(h, r, x, y) = Φ1 a [r(x) yh(x)] . (4) Figure 2 shows the plots of the convex surrogates LMB and LSB as well as that of the abstention loss. L) denote the pair that attains the minimum of the expected loss Ex,y(LSB(h, r, x, y)) over all measurable functions for Φ1(u) = Φ2(u) = exp(u). In Appendix F, we show that with (x)= P(Y =+1|X =x), the pair (h L = 1 2a log L = 1 a +b log makes LSB a calibrated loss, meaning that the sign of the (h L) that minimizes the expected surrogate loss matches the sign of the Bayes classifier (h , r ). More precisely, the following holds. Theorem 1 (Calibration of convex surrogate). For a > 0 and b > 0, the inf(h,r) E(x,y)[L(h, r, x, y)] is attained at (h L) such that sign(h ) = sign(h L) and sign(r ) = sign(r L) if and only if b /a = 2 Figure 2: The left figure is a plot of the abstention loss. The middle figure is a plot of the surrogate function LMB while the right figure is a plot of the surrogate loss LSB both for c = 0.45. The theorem shows that the classification and rejection solution obtained by minimizing the surrogate loss for that choice of (a, b) coincides with the one obtained using the original loss. In the following, we make the explicit choice of a = 1 and b = 2 (1 c)/c for the loss LSB to be calibrated. 3.2 Learning guarantees for ensembles in classification with abstention In the standard scenario of classification, it is often easy to come up with simple base classifiers that may abstain. As an example, a simple rule could classify a message as spam based on the presence of some word, as ham in the presence of some other word, and just abstain in the absence of both, as in the boosting with abstention algorithm by Schapire and Singer [26]. Our objective is to learn ensembles of such base hypotheses to create accurate solutions for classification with abstention. Our ensemble functions are based on the framework described in Section 2.1. Let H and R be two families of functions mapping X to [ 1, 1]. The ensemble family F that we consider is then the convex hull of H R: : T 1, t 0, t = 1, ht 2 H, rt 2 R Thus, (h, r) 2 F abstains on input x 2 X when r(x) 0 and predicts the label sign(h(x)) otherwise. Let u ! Φ1( u) and u ! Φ2( u) be two strictly decreasing differentiable convex function upperbounding u ! 1u 0 over R. For calibration constants a , b > 0, and cost c > 0, we assume that there exist u and v such that Φ1(a u) < 1 and c Φ2(v) < 1, otherwise the surrogate would not be useful. Let Φ 1 2 be the inverse functions, which always exist since Φ1 and Φ2 are strictly monotone. We will use the following definitions: CΦ1 = 2a Φ0 and CΦ2 = 2cb Φ0 . Observe that for Φ1(u) = Φ2(u) = exp(u), we simply have CΦ1 = 2a and CΦ2 = 2b . Theorem 2. Let H and R be two families of functions mapping X to R. Assume N > 1. Then, for any δ > 0, with probability at least 1 δ over the draw of a sample S of size m from D, the following holds for all (h, r) 2 F: R(h, r) E (x,y) S[LMB(h, r, x, y)] + CΦ1Rm(H) + (CΦ1 + CΦ2)Rm(R) + The proof is given in Appendix C. The theorem gives effective learning guarantees for ensemble pairs (h, r) 2 F when the base predictor and abstention functions admit favorable Rademacher complexities. In earlier work [7], we present a learning bound for a different type of surrogate losses which can also be extended to hold for ensembles. Next, we derive margin-based guarantees in the case where Φ1(u) = Φ2(u) = exp(u). For any > 0, the margin-losses associated to LMB and LSB are denoted by L SB and defined for all (h, r) 2 F and (x, y) 2 X { 1, +1} by MB(h, r, x, y) = LMB(h/ , r/ , x, y) and L SB(h, r, x, y) = LSB(h/ , r/ , x, y). Theorem 2 applied to this margin-based loss results in the following corollary. Corollary 3. Assume N > 1 and fix > 0. Then, for any δ > 0, with probability at least 1 δ over the draw of an i.i.d. sample S of size m from D, the following holds for all f 2 F: R(h, r) E (x,y) S[L MB(h, r, x, y)] + 2a Rm(H) + 2(a + b ) BA(S = ((x1, y1), . . . , (xm, ym))) 1 for i 1 to m do 2 D1(i, 1) 1 2m; D1(i, 2) 1 2m 3 for t 1 to T do 4 Z1,t Pm i=1 Dt(i, 1); Z2,t Pm i=1 Dt(i, 2) 5 k argminj2[1,N] 2Z1,t t,j + Z1,trj,1 2 c(1 c)Z2,trj,2 . Direction 6 Z Z1,t( t,k + rk,1 2 7 if (Z1,t Z)e t 1,k Ze t 1,k < m Zt β then 8 t t 1,k . Step 9 else t log 10 t t 1 + tek 11 rt PN j=1 jrj 12 ht PN rt(xi) yiht(xi) 14 for i 1 to m do 15 Dt+1(i, 1) rt(xi) yiht(xi) Zt+1 ; Dt+1(i, 2) Zt+1 16 (h, r) PN j=1 T,j(hj, rj) 17 return (h, r) Figure 3: Pseudocode of the BA algorithm for both the exponential loss with Φ1(u) = Φ2(u) = exp(u) as well as for the logistic loss with Φ1(u) = Φ2(u) = log2(1 + eu). The parameters include the cost of rejection c and β determining the strength of the the -constraint for the L1 regularization. The definition of the weighted errors t,k as well as the expected rejections, rk,1 and rk,2, are given in Equation 7. For other surrogate losses, the step size t is found via a line search or other numerical methods by solving argmin F( t 1 + ek). The bound of Corollary 3 applies similarly to L SB since it is an upper bound on L MB. It can further be shown to hold uniformly for all 2 (0, 1) at the price of a term in O using standard techniques [16, 22] (see Appendix C). 4 Boosting algorithm Here, we derive a boosting-style algorithm (BA algorithm) for learning an ensemble with the option of abstention for both losses LMB and LSB. Below, we describe the algorithm for LSB and refer the reader to Appendix H for the version using the loss LMB. 4.1 Objective function The BA algorithm solves a convex optimization problem that is based on Corollary 3 for loss LSB. Since the last three terms of the right-hand side of the bound of the corollary do not depend on , this suggests to select as the solution of min 2 1 SB(h, r, xi, yi). Via a change of variable / that does not affect the optimization problem, we can equivalently search for min 0 1 i=1 LSB(h, r, xi, yi) such that PT t=1 t 1/ . Introducing the Lagrange variable β associated to the constraint PT t=1 t 1/ , the problem can rewritten as: min 0 1 i=1 LSB(h, r, xi, yi)+β PT t=1 t. Letting {(h1, r1), . . . , (h N, r N)} be the set of base functions pairs for the classifier and rejection function, we can rewrite the optimization problem as the minimization over 0 of Thus, the following is the objective function of our optimization problem: rt(xi) yiht(xi) 4.2 Projected coordinate descent The problem min 0 F( ) is a convex optimization problem, which we solve via projected coordinate descent. Let ek be the kth unit vector in RN and let F 0( , ej) be the directional derivative of F along the direction ej at . The algorithm consists of the following three steps. First, it determines the direction of maximal descent by k = argmaxj2[1,N] |F 0( t 1, ej)|. Second, it calculates the best step along the direction that preserves non-negativity of by = argmin t 1+ ek 0 F( t 1 + ek). Third, it updates t 1 to t = t 1 + ek. The pseudocode of the BA algorithm is given in Figure 3. The step and direction are based on F 0( t 1, ej). For any t 2 [1, T], define a distribution Dt over the pairs (i, n), with n in {1, 2} Dt(i, 1) = Φ0$ rt 1(xi) yiht 1(xi) and Dt(i, 2) = Φ0$ where Zt is the normalization factor given by Zt = Pm rt 1(xi) yiht 1(xi) . In order to derive an explicit formulation of the descent direction that is based on the weighted error of the classification function hj and the expected value of the rejection function rj, we use the distributions D1,t and D2,t defined by Dt(i, 1)/Z1,t and Dt(i, 1)/Z2,t where Z1,t = Pm i=1 Dt(i, 1) and Z2,t = Pm i=1 Dt(i, 2) are the normalization factors. Now, for any j 2 [1, N] and s 2 [1, T], we can define the weighted error t,j and the expected value of the rejection function, rj,1 and rj,2, over distribution D1,t and D2,t as follows: 1 E i D1,t[yihj(xi)] , rj,1 = E i D1,t[rj(xi)], and rj,2 = E i D2,t[rj(xi)]. (7) Using these definition, we show (see Appendix D) that the descent direction is given by 2Z1,t t,j + Z1,trj,1 2 c(1 c)Z2,trj,2. This equation shows that Z1,t and 2 c(1 c)Z2,t re-scale the weighted error and expected rejection. Thus, finding the best descent direction by minimizing this equation is equivalent to finding the best scaled trade-off between the misclassification error and the average rejection cost. The step size can in general be found via line search or other numerical methods, but we have derived a closed-form solution of the step size for both the exponential and logistic loss (see Appendix D.2). Further details of the derivation of projected coordinate descent on F are also given in Appendix D. Note that for rt ! 0+ in Equation 6, that is when the rejection terms are dropped in the objective, we retrieve the L1-regularized Adaboost. As for Adaboost, we can define a weak learning assumption which requires that the directional derivative along at least one base pair be non-zero. For β = 0, it does not hold when for all j: 2 s,j 1 = rj,1 + Z1,t rj,2, which corresponds to a balance between the edge and rejection costs for all j. Observe that in the particular case when the rejection functions are zero, it coincides with the standard weak learning assumption for Adaboost ( s,j = 1 2 for all j). The following theorem provides the convergence of the projected coordinate descent algorithm for our objective function, F( ). The proof is given in Appendix E. Theorem 4. Assume that Φ is twice differentiable and that Φ00(u) > 0 for all u 2 R. Then, the projected coordinate descent algorithm applied to F converges to the solution of the optimization problem max 0 F( ). If additionally Φ is strongly convex over the path of the iterates t then there exists > 0 and > 0 such that for all t > , F( t+1) F( ) θ1 θ2 + R θ1 θ2 + R Figure 4: Illustration of the abstention stumps on a variable X. Specifically, this theorem holds for the exponential loss Φ(u) = exp(u) and the logistic loss Φ( u) = log2(1 + e u) since they are strongly convex over the compact set containing the ts. 4.3 Abstention stumps We first define a family of base hypotheses, abstention stumps, that can be viewed as extensions of the standard boosting stumps to the setting of classification with abstention. An abstention stump h 1, 2 over the feature X is defined by two thresholds 1, 2 2 R with 1 2. There are 6 different such stumps, Figure 4 illustrates two of them. For the left figure, points with variables X less than or equal to 1 are labeled negatively, those with X 2 are labeled positively, and those with X between 1 and 2 are rejected. In general, an abstention stump is defined by the pair h 1, 2(X), r 1, 2(X) where, for Figure 4-left, h 1, 2(X) = 1X 1 + 1X> 2 and r 1, 2(X) = 1 1 0, to define a family of base predictor and base rejector pairs of the form (h(x), γ ˆr(x)). Since t is non-negative, the value γ is needed to correct for over-rejection by previously selected abstention stumps. The γ can be automatically learned by adding to the set of base pairs the constant functions (h0, r0) = (0, 1). An ensemble solution returned by the BA algorithm is therefore of the form t tht(x), P where ts are the weights assigned to each base pair. Now, consider a sample of m points sorted by the value of X, which we denote by X1 Xm. For abstention stumps, the derivative of the objective, F, can be further simplified (see Appendix G) such that the problem can be reduced to finding an abstention stump with the minimal expected abstention loss l( 1, 2), that is 2Dt(i, 1)[1yi=+11Xi 1 + 1yi= 11Xi> 2] + 2Dt(i, 1) cb Dt(i, 2) Notice that given m points, at most (m + 1) thresholds need to be considered for 1 and 2. Hence, a straightforward algorithm inspects all possible O(m2) pairs ( 1, 2) with 1 2 in time O(m2). However, Lemma 5 below and further derivations in Appendix G, allows for an O(m)-time algorithm for finding optimal abstention stumps when the problem is solved without the constraint 1 2. Note that while we state the lemma for the abstention stump in Figure 4-left, similar results hold for any of the 6 types of stumps. Lemma 5. The optimization problem without the constraint ( 1 < 2) can be decomposed as follows: l( 1, 2) = argmin 2Dt(i, 1)1yi=+11Xi 1 + 2Dt(i, 1) cb Dt(i, 2) 2Dt(i, 1)1yi= 11Xi> 2 + 2Dt(i, 1) cb Dt(i, 2) The optimization Problems (8) and (9) can be solved in linear time, via a method similar to that of finding the optimal threshold for a standard zero-one loss boosting stump. When the condition 1 < 2 does not hold, we can simply revert to finding the minimum of l( 1, 2) in the naive way. In practice, we find most often that the optimal solution of Problem 8 and Problem 9 satisfies 1 < 2. 5 Experiments In this section, we present the results of experiments with our abstention stump BA algorithm based on LSB for several datasets. We compare the BA algorithm with the DHL algorithm [1], as well as a cod pima skin 0.05 0.15 0.25 0.35 0.45 Rejection Loss 0.05 0.15 0.25 0.35 0.45 Rejection Loss 0.05 0.15 0.25 0.35 0.45 Rejection Loss banknote haberman australian 0.05 0.15 0.25 0.35 0.45 Rejection Loss 0.05 0.15 0.25 0.35 0.45 Rejection Loss 0.05 0.15 0.25 0.35 0.45 Rejection Loss Cost Figure 5: Average rejection loss on the test set as a function of the abstention cost c for the TSB Algorithm (in orange), the DHL Algorithm (in red) and the BA Algorithm (in blue) based on LSB. confidence-based boosting algorithm TSB. Both of these algorithms are described in further detail in Appendix B. We tested the algorithms on six data sets from UCI s data repository, specifically australian, cod, skin, banknote, haberman, and pima. For more information about the data sets, see Appendix I. For each data set, we implemented the standard 5-fold cross-validation where we randomly divided the data into training, validation and test set with the ratio 3:1:1. Using a different random partition, we repeated the experiments five times. For all three algorithms, the cost values ranged over c 2 {0.05, 0.1, . . . , 0.5} while threshold γ ranged over γ 2 {0.08, 0.16, . . . , 0.96}. For the BA algorithm, the β regularization parameter ranged over β 2 {0, 0.05, . . . , 0.95}. All experiments for BA were based on T = 200 boosting rounds. The DHL algorithm used polynomial kernels with degree d 2 {1, 2, 3} and it was implemented in CVX [8]. For each cost c, the hyperparameter configuration was chosen to be the set of parameters that attained the smallest average rejection loss on the validation set. For that set of parameters we report the results on the test set. We first compared the confidence-based TSB algorithm with the BA and DHL algorithms (first row of Figure 5). The experiments show that, while TSB can sometimes perform better than DHL, in a number of cases its performance is dramatically worse as a function of c and, in all cases it is outperformed by BA. In Appendix J, we give the full set of results for the TSB algorithm. In view of that, our next series of results focus on the BA and DHL algorithms, directly designed to optimize the rejection loss, for 3 other datasets (second row of Figure 5). Overall, the figures show that BA outperforms the state-of-the-art DHL algorithm for most values of c, thereby indicating that BA yields a significant improvement in practice. We have also successfully run BA on the CIFAR-10 data set (boat and horse images) which contains 10,000 instances and we believe that our algorithm can scale to much larger datasets. In contrast, training DHL on such larger samples did not terminate as it is based on a costly QCQP. In Appendix J, we present tables that report the average and standard deviation of the abstention loss as well as the fraction of rejected points and the classification error on non-rejected points. 6 Conclusion We introduced a general framework for classification with abstention where the predictor and abstention functions are learned simultaneously. We gave a detailed study of ensemble learning within this framework including: new surrogate loss functions proven to be calibrated, Rademacher complexity margin bounds for ensemble learning of the pair of predictor and abstention functions, a new boosting-style algorithm, the analysis of a natural family of base predictor and abstention functions, and the results of several experiments showing that BA algorithm yield a significant improvement over the confidence-based algorithms DHL and TSB. Our algorithm can be further extended by considering more complex base pairs such as more general ternary decision trees with rejection leaves. Moreover, our theory and algorithm can be generalized to the scenario of multi-class classification with abstention, which we have already initiated. Acknowledgments This work was partly funded by NSF CCF-1535987 and IIS-1618662. [1] P. Bartlett and M. Wegkamp. Classification with a reject option using a hinge loss. JMLR, 2008. [2] A. Bounsiar, E. Grall, and P. Beauseroy. Kernel based rejection method for supervised classification. In WASET, 2007. [3] H. L. Capitaine and C. Frelicot. An optimum class-rejective decision rule and its evaluation. In ICPR, [4] K. Chaudhuri and C. Zhang. Beyond disagreement-based agnostic active learning. In NIPS, 2014. [5] C. Chow. An optimum character recognition system using decision function. IEEE Trans. Comput., 1957. [6] C. Chow. On optimum recognition error and reject trade-off. IEEE Trans. Comput., 1970. [7] C. Cortes, G. De Salvo, and M. Mohri. Learning with rejection. In ALT, 2016. [8] I. CVX Research. CVX: Matlab software for disciplined convex programming, version 2.0, Aug. 2012. [9] B. Dubuisson and M. Masson. Statistical decision rule with incomplete knowledge about classes. In PR, [10] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classification. JMLR, 2010. [11] R. El-Yaniv and Y. Wiener. Agnostic selective classification. In NIPS, 2011. [12] C. Elkan. The foundations of cost-sensitive learning. In IJCAI, 2001. [13] G. Fumera and F. Roli. Support vector machines with embedded reject option. In ICPR, 2002. [14] G. Fumera, F. Roli, and G. Giacinto. Multiple reject thresholds for improving classification reliability. In ICAPR, 2000. [15] Y. Grandvalet, J. Keshet, A. Rakotomamonjy, and S. Canu. Suppport vector machines with a reject option. In NIPS, 2008. [16] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 2002. [17] T. Landgrebe, D. Tax, P. Paclik, and R. Duin. The interaction between classification and reject performance for distance-based reject-option classifiers. PRL, 2005. [18] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, New York, 1991. [19] M. Littman, L. Li, and T. Walsh. Knows what it knows: A framework for self-aware learning. In ICML, [20] Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 1992. [21] I. Melvin, J. Weston, C. S. Leslie, and W. S. Noble. Combining classifiers for improved classification of proteins from sequence or structure. BMC Bioinformatics, 2008. [22] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. In JMLR, 2011. [24] T. Pietraszek. Optimizing abstaining classifiers using ROC analysis. In ICML, 2005. [25] C. Santos-Pereira and A. Pires. On optimal reject rules and ROC curves. PRL, 2005. [26] R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2-3):135 168, 2000. [27] D. Tax and R. Duin. Growing a multi-class classifier with a reject option. In Pattern Recognition Letters, [28] F. Tortorella. An optimal reject rule for binary classifiers. In ICAPR, 2001. [29] K. Trapeznikov and V. Saligrama. Supervised sequential classification under budget constraints. In AISTATS, 2013. [30] J. Wang, K. Trapeznikov, and V. Saligrama. An lp for sequential learning under budgets. In JMLR, 2014. [31] M. Yuan and M. Wegkamp. Classification methods with reject option based on convex risk minimizations. In JMLR, 2010. [32] M. Yuang and M. Wegkamp. Support vector machines with a reject option. In Bernoulli, 2011. [33] C. Zhang and K. Chaudhuri. The extended littlestone s dimension for learning with mistakes and abstentions. In COLT, 2016.