# boosting_with_abstention__83b6c0da.pdf

Boosting with Abstention

Corinna Cortes Google Research New York, NY 10011

corinna@google.com

Giulia De Salvo Courant Institute New York, NY 10012 desalvo@cims.nyu.edu

Mehryar Mohri Courant Institute and Google

New York, NY 10012

mohri@cims.nyu.edu

We present a new boosting algorithm for the key scenario of binary classiﬁcation with abstention where the algorithm can abstain from predicting the label of a point, at the price of a ﬁxed cost. At each round, our algorithm selects a pair of functions, a base predictor and a base abstention function. We deﬁne convex upper bounds for the natural loss function associated to this problem, which we prove to be calibrated with respect to the Bayes solution. Our algorithm beneﬁts from general margin-based learning guarantees which we derive for ensembles of pairs of base predictor and abstention functions, in terms of the Rademacher complexities of the corresponding function classes. We give convergence guarantees for our algorithm along with a linear-time weak-learning algorithm for abstention stumps. We also report the results of several experiments suggesting that our algorithm provides a signiﬁcant improvement in practice over two conﬁdence-based algorithms.

1 Introduction

Classiﬁcation with abstention is a key learning scenario where the algorithm can abstain from making a prediction, at the price of incurring a ﬁxed cost. This is the natural scenario in a variety of common and important applications. An example is spoken-dialog applications where the system can redirect a call to an operator to avoid the cost of incorrectly assigning a category to a spoken utterance and misguiding the dialog manager. This requires the availability of an operator, which incurs a ﬁxed and predeﬁned price. Other examples arise in the design of a search engine or an information extraction system, where, rather than taking the risk of displaying an irrelevant document, the system can resort to the help of a more sophisticated, but more time-consuming classiﬁer. More generally, this learning scenario arises in a wide range of applications including health, bioinformatics, astronomical event detection, active learning, and many others, where abstention is an acceptable option with some cost. Classiﬁcation with abstention is thus a highly relevant problem.

The standard approach for tackling this problem is via conﬁdence-based abstention: a real-valued function h is learned for the classiﬁcation problem and the points x for which its magnitude |h(x)| is smaller than some threshold γ are rejected. Bartlett and Wegkamp [1] gave a theoretical analysis of this approach based on consistency. They introduced a discontinuous loss function taking into account the cost for rejection, upper-bounded that loss by a convex and continuous Double Hinge Loss (DHL) surrogate, and derived an algorithm based on that convex surrogate loss. Their work inspired a series of follow-up papers that developed both the theory and practice behind conﬁdence-based abstention [32, 15, 31]. Further related works can be found in Appendix A.

In this paper, we present a solution to the problem of classiﬁcation with abstention that radically departs from the conﬁdence-based approach. We introduce a general model where a pair (h, r) for a classiﬁer h and rejection function r are learned simultaneously. Under this novel framework, we present a Boosting-style algorithm with Abstention, BA, that learns accurately the classiﬁer and abstention functions. Note that the terminology of boosting with abstention was used by Schapire and Singer [26] to refer to a scenario where a base classiﬁer is allowed to abstain, but

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

+ + + + + - - - - - - - - - + - + -

|h(x)| < γ γ

h(x) = x + Figure 1: The best predictor h is deﬁned by the threshold : h(x) = x + . For c < 1 2, the region deﬁned by X should be rejected. But the corresponding abstention function r deﬁned by r(x) = x cannot be deﬁned as |h(x)| γ for any γ > 0.

where the boosting algorithm itself has to commit to a prediction. This is therefore distinct from the scenario of classiﬁcation with abstention studied here. Nevertheless, we will introduce and examine a conﬁdence-based Two-Step Boosting algorithm, the TSB algorithm, that consists of ﬁrst training Adaboost and next searching for the best conﬁdence-based abstention threshold.

The paper is organized as follows. Section 2 describes our general abstention model which consists of learning a pair (h, r) simultaneously and compares it with conﬁdence-based models. Section 3.2 presents a series of theoretical results for the problem of learning convex ensembles for classiﬁcation with abstention, including the introduction of calibrated convex surrogate losses and general datadependent learning guarantees. In Section 4, we use these learning bounds to design a regularized boosting algorithm. We further prove the convergence of the algorithm and present a linear-time weak-learning algorithm for a natural family of abstention stumps. Finally, in Section 5, we report several experimental results comparing the BA algorithm with the DHL and the TSB algorithms.

2 Preliminaries

In this section, we ﬁrst introduce a general model for learning with abstention [7] and then compare it with conﬁdence-based models.

2.1 General abstention model

We assume as in standard supervised learning that the training and test points are drawn i.i.d. according to some ﬁxed but unknown distribution D over X { 1, +1}. We consider the learning scenario of binary classiﬁcation with abstention. Given an instance x 2 X, the learner has the option of abstaining from making a prediction for x at the price of incurring a non-negative loss c(x), or otherwise making a prediction h(x) using a predictor h and incurring the standard zero-one loss 1yh(x) 0 where the true label is y. Since a random guess achieves an expected cost of at most 1

2, rejection only makes sense for c(x)< 1

We will model the learner by a pair (h, r) where the function r: X ! R determines the points x 2 X to be rejected according to r(x) 0 and where the hypothesis h: X ! R predicts labels for non-rejected points via its sign. Extending the loss function considered in Bartlett and Wegkamp [1], the abstention loss for a pair (h, r) is deﬁned as as follows for any (x, y) 2 X { 1, +1}:

L(h, r, x, y) = 1yh(x) 01r(x)>0 + c(x)1r(x) 0. (1)

The abstention cost c(x) is assumed known to the learner. In the following, we assume that c is a constant function, but part of our analysis is applicable to the more general case.

We denote by H and R two families of functions mapping X to R and we assume the labeled sample S = ((x1, y1), . . . , (xm, ym)) is drawn i.i.d. from Dm. The learning problem consists of determining a pair (h, r) 2 H R that admits a small expected abstention loss R(h, r), deﬁned as follows:

R(h, r) = E (x,y) D

1yh(x) 01r(x)>0 + c1r(x) 0

Similarly, we deﬁne the empirical loss of a pair (h, r) 2 H R over the sample S by: b RS(h, r) = E(x,y) S

1yh(x) 01r(x)>0 + c1r(x) 0

, where (x, y) S indicates that (x, y) is drawn according to the empirical distribution deﬁned by S.

2.2 Conﬁdence-based abstention model

Conﬁdence-based models are a special case of the general model for learning with rejection presented in Section 2.1 corresponding to the pair (h(x), r(x)) = (h(x), |h(x)| γ), where γ is a parameter

that changes the threshold of rejection. This speciﬁc choice was based on consistency results shown in [1]. In particular, the Bayes solution (h , r ) of the learning problem, that is where the distribution D is known, is given by h (x) = (x) 1

2 and r (x) = |h (x)| ( 1

2 c) where (x) = P[Y = +1|x] for any x 2 X, but note that this is not a unique solution. The form of h (x) follows by a similar reasoning as for the standard binary classiﬁcation problem. It is straightforward to see that the optimal rejection function r is non-positive, meaning a point is rejected, if and only if min{ (x), 1 (x)} c. Equivalently, the following holds: max{ (x) 1

2 c if and only if | (x) 1

2 c and using the deﬁnition of h , we recover the optimal r . In light of the Bayes solution, the speciﬁc choice of the abstention function r is natural; however, requiring the abstention function r to be deﬁned as r(x) = |h(x)| γ, for some h 2 H, is in general too restrictive when predictors are selected out of a limited subset H of all measurable functions over X. Consider the example shown in Figure 1 where H is a family of linear functions. For this simple case, the optimal abstention region cannot be attained as a function of the best predictor h while it can be achieved by allowing to learn a pair (h, r). Thus, the general model for learning with abstention analyzed in Section 2.1 is both more ﬂexible and more general.

3 Theoretical analysis

This section presents a theoretical analysis of the problem of learning convex ensembles for classiﬁcation with abstention. We ﬁrst introduce general convex surrogate functions for the abstention loss and prove a necessary and sufﬁcient condition based on their parameters for them to be calibrated. Next we deﬁne the ensemble family we consider and prove general data-dependent learning guarantees for it based on the Rademacher complexities of the base predictor and base rejector sets.

3.1 Convex surrogates

We introduce two types of convex surrogate functions for the abstention loss. Observe that the abstention loss L(h, r, x, y) can be equivalently expressed as L(h, r, x, y) = max

1yh(x) 01 r(x)<0, c 1r(x) 0

. In view of that, since for any f, g 2 R, max(f, g) = f+g+|g f|

2 , the following inequalities hold for a > 0 and b > 0:

L(h, r, x, y) = max

1yh(x) 01 r(x)<0, c 1r(x) 0

1max(yh(x), r(x)) 0, c 1r(x) 0

1 yh(x) r(x)

2 0, c 1r(x) 0

1a [yh(x) r(x)] 0, c1b r(x) 0

a [r(x) yh(x)]

where u ! Φ1( u) and u ! Φ2( u) are two non-increasing convex functions upper-bounding u ! 1u 0 over R. Let LMB be the convex surrogate deﬁned by the last inequality above:

LMB(h, r, x, y) = max

a [r(x) yh(x)]

Since LMB is not differentiable everywhere, we upper-bound the convex surrogate LMB as follows: max

1a [yh(x) r(x)] 0, c 1b r(x) 0

a [r(x) yh(x)]

. Similarly, we let LSB denote this convex surrogate:

LSB(h, r, x, y) = Φ1

a [r(x) yh(x)]

. (4) Figure 2 shows the plots of the convex surrogates LMB and LSB as well as that of the abstention loss.

L) denote the pair that attains the minimum of the expected loss Ex,y(LSB(h, r, x, y)) over all measurable functions for Φ1(u) = Φ2(u) = exp(u). In Appendix F, we show that with (x)=

P(Y =+1|X =x), the pair (h

L = 1 2a log

L = 1 a +b log

makes LSB a calibrated loss, meaning that the sign of the (h

L) that minimizes the expected surrogate loss matches the sign of the Bayes classiﬁer (h , r ). More precisely, the following holds. Theorem 1 (Calibration of convex surrogate). For a > 0 and b > 0, the inf(h,r) E(x,y)[L(h, r, x, y)] is attained at (h

L) such that sign(h ) = sign(h

L) and sign(r ) = sign(r

L) if and only if b /a = 2

Figure 2: The left ﬁgure is a plot of the abstention loss. The middle ﬁgure is a plot of the surrogate function LMB while the right ﬁgure is a plot of the surrogate loss LSB both for c = 0.45.

The theorem shows that the classiﬁcation and rejection solution obtained by minimizing the surrogate loss for that choice of (a, b) coincides with the one obtained using the original loss. In the following, we make the explicit choice of a = 1 and b = 2

(1 c)/c for the loss LSB to be calibrated.

3.2 Learning guarantees for ensembles in classiﬁcation with abstention

In the standard scenario of classiﬁcation, it is often easy to come up with simple base classiﬁers that may abstain. As an example, a simple rule could classify a message as spam based on the presence of some word, as ham in the presence of some other word, and just abstain in the absence of both, as in the boosting with abstention algorithm by Schapire and Singer [26]. Our objective is to learn ensembles of such base hypotheses to create accurate solutions for classiﬁcation with abstention. Our ensemble functions are based on the framework described in Section 2.1. Let H and R be two families of functions mapping X to [ 1, 1]. The ensemble family F that we consider is then the convex hull of H R:

: T 1, t 0,

t = 1, ht 2 H, rt 2 R

Thus, (h, r) 2 F abstains on input x 2 X when r(x) 0 and predicts the label sign(h(x)) otherwise.

Let u ! Φ1( u) and u ! Φ2( u) be two strictly decreasing differentiable convex function upperbounding u ! 1u 0 over R. For calibration constants a , b > 0, and cost c > 0, we assume that there exist u and v such that Φ1(a u) < 1 and c Φ2(v) < 1, otherwise the surrogate would not be useful. Let Φ 1

2 be the inverse functions, which always exist since Φ1 and Φ2 are strictly monotone. We will use the following deﬁnitions: CΦ1 = 2a Φ0

and CΦ2 = 2cb Φ0

. Observe that for Φ1(u) = Φ2(u) = exp(u), we simply have CΦ1 = 2a and CΦ2 = 2b . Theorem 2. Let H and R be two families of functions mapping X to R. Assume N > 1. Then, for any δ > 0, with probability at least 1 δ over the draw of a sample S of size m from D, the following holds for all (h, r) 2 F:

R(h, r) E (x,y) S[LMB(h, r, x, y)] + CΦ1Rm(H) + (CΦ1 + CΦ2)Rm(R) +

The proof is given in Appendix C. The theorem gives effective learning guarantees for ensemble pairs (h, r) 2 F when the base predictor and abstention functions admit favorable Rademacher complexities. In earlier work [7], we present a learning bound for a different type of surrogate losses which can also be extended to hold for ensembles.

Next, we derive margin-based guarantees in the case where Φ1(u) = Φ2(u) = exp(u). For any > 0, the margin-losses associated to LMB and LSB are denoted by L

SB and deﬁned for all (h, r) 2 F and (x, y) 2 X { 1, +1} by

MB(h, r, x, y) = LMB(h/ , r/ , x, y) and L

SB(h, r, x, y) = LSB(h/ , r/ , x, y). Theorem 2 applied to this margin-based loss results in the following corollary. Corollary 3. Assume N > 1 and ﬁx > 0. Then, for any δ > 0, with probability at least 1 δ over the draw of an i.i.d. sample S of size m from D, the following holds for all f 2 F:

R(h, r) E (x,y) S[L

MB(h, r, x, y)] + 2a

Rm(H) + 2(a + b )

BA(S = ((x1, y1), . . . , (xm, ym)))

1 for i 1 to m do 2 D1(i, 1) 1 2m; D1(i, 2) 1 2m 3 for t 1 to T do 4 Z1,t Pm

i=1 Dt(i, 1); Z2,t Pm

i=1 Dt(i, 2) 5 k argminj2[1,N] 2Z1,t t,j + Z1,trj,1 2

c(1 c)Z2,trj,2 . Direction 6 Z Z1,t( t,k + rk,1

2 7 if (Z1,t Z)e t 1,k Ze t 1,k < m

Zt β then 8 t t 1,k . Step

9 else t log

10 t t 1 + tek 11 rt PN

j=1 jrj 12 ht PN

rt(xi) yiht(xi)

14 for i 1 to m do

15 Dt+1(i, 1)

rt(xi) yiht(xi)

Zt+1 ; Dt+1(i, 2)

Zt+1 16 (h, r) PN

j=1 T,j(hj, rj) 17 return (h, r)

Figure 3: Pseudocode of the BA algorithm for both the exponential loss with Φ1(u) = Φ2(u) = exp(u) as well as for the logistic loss with Φ1(u) = Φ2(u) = log2(1 + eu). The parameters include the cost of rejection c and β determining the strength of the the -constraint for the L1 regularization. The deﬁnition of the weighted errors t,k as well as the expected rejections, rk,1 and rk,2, are given in Equation 7. For other surrogate losses, the step size t is found via a line search or other numerical methods by solving argmin F( t 1 + ek).

The bound of Corollary 3 applies similarly to L

SB since it is an upper bound on L

MB. It can further

be shown to hold uniformly for all 2 (0, 1) at the price of a term in O

using standard techniques [16, 22] (see Appendix C).

4 Boosting algorithm

Here, we derive a boosting-style algorithm (BA algorithm) for learning an ensemble with the option of abstention for both losses LMB and LSB. Below, we describe the algorithm for LSB and refer the reader to Appendix H for the version using the loss LMB.

4.1 Objective function

The BA algorithm solves a convex optimization problem that is based on Corollary 3 for loss LSB. Since the last three terms of the right-hand side of the bound of the corollary do not depend on , this suggests to select as the solution of min 2 1

SB(h, r, xi, yi). Via a change of variable / that does not affect the optimization problem, we can equivalently search for min 0 1

i=1 LSB(h, r, xi, yi) such that PT

t=1 t 1/ . Introducing the Lagrange variable β associated to the constraint PT

t=1 t 1/ , the problem can rewritten as: min 0 1

i=1 LSB(h, r, xi, yi)+β PT

t=1 t. Letting {(h1, r1), . . . , (h N, r N)} be the set of base functions pairs for the classiﬁer and rejection function, we can rewrite the optimization problem as

the minimization over 0 of

Thus, the following is the objective function of our optimization problem:

rt(xi) yiht(xi)

4.2 Projected coordinate descent

The problem min 0 F( ) is a convex optimization problem, which we solve via projected coordinate descent. Let ek be the kth unit vector in RN and let F 0( , ej) be the directional derivative of F along the direction ej at . The algorithm consists of the following three steps. First, it determines the direction of maximal descent by k = argmaxj2[1,N] |F 0( t 1, ej)|. Second, it calculates the best step along the direction that preserves non-negativity of by = argmin t 1+ ek 0 F( t 1 + ek). Third, it updates t 1 to t = t 1 + ek.

The pseudocode of the BA algorithm is given in Figure 3. The step and direction are based on F 0( t 1, ej). For any t 2 [1, T], deﬁne a distribution Dt over the pairs (i, n), with n in {1, 2}

Dt(i, 1) = Φ0$

rt 1(xi) yiht 1(xi)

and Dt(i, 2) = Φ0$

where Zt is the normalization factor given by Zt = Pm

rt 1(xi) yiht 1(xi)

. In order to derive an explicit formulation of the descent direction that is based on the weighted error of the classiﬁcation function hj and the expected value of the rejection function rj, we use the distributions D1,t and D2,t deﬁned by Dt(i, 1)/Z1,t and Dt(i, 1)/Z2,t where Z1,t = Pm

i=1 Dt(i, 1) and Z2,t = Pm

i=1 Dt(i, 2) are the normalization factors. Now, for any j 2 [1, N] and s 2 [1, T], we can deﬁne the weighted error t,j and the expected value of the rejection function, rj,1 and rj,2, over distribution D1,t and D2,t as follows:

1 E i D1,t[yihj(xi)]

, rj,1 = E i D1,t[rj(xi)], and rj,2 = E i D2,t[rj(xi)]. (7)

Using these deﬁnition, we show (see Appendix D) that the descent direction is given by

2Z1,t t,j + Z1,trj,1 2

c(1 c)Z2,trj,2.

This equation shows that Z1,t and 2

c(1 c)Z2,t re-scale the weighted error and expected rejection. Thus, ﬁnding the best descent direction by minimizing this equation is equivalent to ﬁnding the best scaled trade-off between the misclassiﬁcation error and the average rejection cost. The step size can in general be found via line search or other numerical methods, but we have derived a closed-form solution of the step size for both the exponential and logistic loss (see Appendix D.2). Further details of the derivation of projected coordinate descent on F are also given in Appendix D.

Note that for rt ! 0+ in Equation 6, that is when the rejection terms are dropped in the objective, we retrieve the L1-regularized Adaboost. As for Adaboost, we can deﬁne a weak learning assumption which requires that the directional derivative along at least one base pair be non-zero. For β = 0, it

does not hold when for all j: 2 s,j 1 = rj,1 +

Z1,t rj,2, which corresponds to a balance between the edge and rejection costs for all j. Observe that in the particular case when the rejection functions are zero, it coincides with the standard weak learning assumption for Adaboost ( s,j = 1

2 for all j).

The following theorem provides the convergence of the projected coordinate descent algorithm for our objective function, F( ). The proof is given in Appendix E. Theorem 4. Assume that Φ is twice differentiable and that Φ00(u) > 0 for all u 2 R. Then, the projected coordinate descent algorithm applied to F converges to the solution of the optimization problem max 0 F( ). If additionally Φ is strongly convex over the path of the iterates t then there exists > 0 and > 0 such that for all t > , F( t+1) F( )

θ1 θ2 + R θ1 θ2 + R

Figure 4: Illustration of the abstention stumps on a variable X.

Speciﬁcally, this theorem holds for the exponential loss Φ(u) = exp(u) and the logistic loss Φ( u) = log2(1 + e u) since they are strongly convex over the compact set containing the ts.

4.3 Abstention stumps

We ﬁrst deﬁne a family of base hypotheses, abstention stumps, that can be viewed as extensions of the standard boosting stumps to the setting of classiﬁcation with abstention. An abstention stump h 1, 2 over the feature X is deﬁned by two thresholds 1, 2 2 R with 1 2. There are 6 different such stumps, Figure 4 illustrates two of them. For the left ﬁgure, points with variables X less than or equal to 1 are labeled negatively, those with X 2 are labeled positively, and those with X between 1 and 2 are rejected. In general, an abstention stump is deﬁned by the pair

h 1, 2(X), r 1, 2(X)

where, for Figure 4-left, h 1, 2(X) = 1X 1 + 1X> 2 and r 1, 2(X) = 1 1<X 2.

Thus, our abstention stumps are pairs (h, ˆr) with h taking values in { 1, 0, 1} and ˆr in {0, 1}, and such that for any x either h(x) or ˆr(x) is zero. For our formulation and algorithm, these stumps can be used in combination with any γ > 0, to deﬁne a family of base predictor and base rejector pairs of the form (h(x), γ ˆr(x)). Since t is non-negative, the value γ is needed to correct for over-rejection by previously selected abstention stumps. The γ can be automatically learned by adding to the set of base pairs the constant functions (h0, r0) = (0, 1). An ensemble solution returned by the BA algorithm is therefore of the form

t tht(x), P

where ts are the weights assigned to each base pair.

Now, consider a sample of m points sorted by the value of X, which we denote by X1 Xm. For abstention stumps, the derivative of the objective, F, can be further simpliﬁed (see Appendix G) such that the problem can be reduced to ﬁnding an abstention stump with the minimal expected abstention loss l( 1, 2), that is

2Dt(i, 1)[1yi=+11Xi 1 + 1yi= 11Xi> 2] +

2Dt(i, 1) cb Dt(i, 2)

Notice that given m points, at most (m + 1) thresholds need to be considered for 1 and 2. Hence, a straightforward algorithm inspects all possible O(m2) pairs ( 1, 2) with 1 2 in time O(m2). However, Lemma 5 below and further derivations in Appendix G, allows for an O(m)-time algorithm for ﬁnding optimal abstention stumps when the problem is solved without the constraint 1 2. Note that while we state the lemma for the abstention stump in Figure 4-left, similar results hold for any of the 6 types of stumps. Lemma 5. The optimization problem without the constraint ( 1 < 2) can be decomposed as follows:

l( 1, 2) = argmin

2Dt(i, 1)1yi=+11Xi 1 +

2Dt(i, 1) cb Dt(i, 2)

2Dt(i, 1)1yi= 11Xi> 2 +

2Dt(i, 1) cb Dt(i, 2)

The optimization Problems (8) and (9) can be solved in linear time, via a method similar to that of ﬁnding the optimal threshold for a standard zero-one loss boosting stump. When the condition 1 < 2 does not hold, we can simply revert to ﬁnding the minimum of l( 1, 2) in the naive way. In practice, we ﬁnd most often that the optimal solution of Problem 8 and Problem 9 satisﬁes 1 < 2.

5 Experiments

In this section, we present the results of experiments with our abstention stump BA algorithm based on LSB for several datasets. We compare the BA algorithm with the DHL algorithm [1], as well as a

cod pima skin

0.05 0.15 0.25 0.35 0.45

Rejection Loss

0.05 0.15 0.25 0.35 0.45

Rejection Loss

0.05 0.15 0.25 0.35 0.45

Rejection Loss

banknote haberman australian

0.05 0.15 0.25 0.35 0.45

Rejection Loss

0.05 0.15 0.25 0.35 0.45

Rejection Loss

0.05 0.15 0.25 0.35 0.45

Rejection Loss

Cost Figure 5: Average rejection loss on the test set as a function of the abstention cost c for the TSB Algorithm (in orange), the DHL Algorithm (in red) and the BA Algorithm (in blue) based on LSB.

conﬁdence-based boosting algorithm TSB. Both of these algorithms are described in further detail in Appendix B. We tested the algorithms on six data sets from UCI s data repository, speciﬁcally australian, cod, skin, banknote, haberman, and pima. For more information about the data sets, see Appendix I. For each data set, we implemented the standard 5-fold cross-validation where we randomly divided the data into training, validation and test set with the ratio 3:1:1. Using a different random partition, we repeated the experiments ﬁve times. For all three algorithms, the cost values ranged over c 2 {0.05, 0.1, . . . , 0.5} while threshold γ ranged over γ 2 {0.08, 0.16, . . . , 0.96}. For the BA algorithm, the β regularization parameter ranged over β 2 {0, 0.05, . . . , 0.95}. All experiments for BA were based on T = 200 boosting rounds. The DHL algorithm used polynomial kernels with degree d 2 {1, 2, 3} and it was implemented in CVX [8]. For each cost c, the hyperparameter conﬁguration was chosen to be the set of parameters that attained the smallest average rejection loss on the validation set. For that set of parameters we report the results on the test set.

We ﬁrst compared the conﬁdence-based TSB algorithm with the BA and DHL algorithms (ﬁrst row of Figure 5). The experiments show that, while TSB can sometimes perform better than DHL, in a number of cases its performance is dramatically worse as a function of c and, in all cases it is outperformed by BA. In Appendix J, we give the full set of results for the TSB algorithm.

In view of that, our next series of results focus on the BA and DHL algorithms, directly designed to optimize the rejection loss, for 3 other datasets (second row of Figure 5). Overall, the ﬁgures show that BA outperforms the state-of-the-art DHL algorithm for most values of c, thereby indicating that BA yields a signiﬁcant improvement in practice. We have also successfully run BA on the CIFAR-10 data set (boat and horse images) which contains 10,000 instances and we believe that our algorithm can scale to much larger datasets. In contrast, training DHL on such larger samples did not terminate as it is based on a costly QCQP. In Appendix J, we present tables that report the average and standard deviation of the abstention loss as well as the fraction of rejected points and the classiﬁcation error on non-rejected points.

6 Conclusion

We introduced a general framework for classiﬁcation with abstention where the predictor and abstention functions are learned simultaneously. We gave a detailed study of ensemble learning within this framework including: new surrogate loss functions proven to be calibrated, Rademacher complexity margin bounds for ensemble learning of the pair of predictor and abstention functions, a new boosting-style algorithm, the analysis of a natural family of base predictor and abstention functions, and the results of several experiments showing that BA algorithm yield a signiﬁcant improvement over the conﬁdence-based algorithms DHL and TSB. Our algorithm can be further extended by considering more complex base pairs such as more general ternary decision trees with rejection leaves. Moreover, our theory and algorithm can be generalized to the scenario of multi-class classiﬁcation with abstention, which we have already initiated.

Acknowledgments

This work was partly funded by NSF CCF-1535987 and IIS-1618662.

[1] P. Bartlett and M. Wegkamp. Classiﬁcation with a reject option using a hinge loss. JMLR, 2008.

[2] A. Bounsiar, E. Grall, and P. Beauseroy. Kernel based rejection method for supervised classiﬁcation. In

WASET, 2007.

[3] H. L. Capitaine and C. Frelicot. An optimum class-rejective decision rule and its evaluation. In ICPR,

[4] K. Chaudhuri and C. Zhang. Beyond disagreement-based agnostic active learning. In NIPS, 2014.

[5] C. Chow. An optimum character recognition system using decision function. IEEE Trans. Comput., 1957.

[6] C. Chow. On optimum recognition error and reject trade-off. IEEE Trans. Comput., 1970.

[7] C. Cortes, G. De Salvo, and M. Mohri. Learning with rejection. In ALT, 2016.

[8] I. CVX Research. CVX: Matlab software for disciplined convex programming, version 2.0, Aug. 2012.

[9] B. Dubuisson and M. Masson. Statistical decision rule with incomplete knowledge about classes. In PR,

[10] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classiﬁcation. JMLR, 2010.

[11] R. El-Yaniv and Y. Wiener. Agnostic selective classiﬁcation. In NIPS, 2011.

[12] C. Elkan. The foundations of cost-sensitive learning. In IJCAI, 2001.

[13] G. Fumera and F. Roli. Support vector machines with embedded reject option. In ICPR, 2002.

[14] G. Fumera, F. Roli, and G. Giacinto. Multiple reject thresholds for improving classiﬁcation reliability. In

ICAPR, 2000.

[15] Y. Grandvalet, J. Keshet, A. Rakotomamonjy, and S. Canu. Suppport vector machines with a reject option.

In NIPS, 2008.

[16] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of

combined classiﬁers. Annals of Statistics, 30, 2002.

[17] T. Landgrebe, D. Tax, P. Paclik, and R. Duin. The interaction between classiﬁcation and reject performance

for distance-based reject-option classiﬁers. PRL, 2005.

[18] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, New

York, 1991.

[19] M. Littman, L. Li, and T. Walsh. Knows what it knows: A framework for self-aware learning. In ICML,

[20] Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differentiable

minimization. Journal of Optimization Theory and Applications, 1992.

[21] I. Melvin, J. Weston, C. S. Leslie, and W. S. Noble. Combining classiﬁers for improved classiﬁcation of

proteins from sequence or structure. BMC Bioinformatics, 2008.

[22] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.

[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,

R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. In JMLR, 2011.

[24] T. Pietraszek. Optimizing abstaining classiﬁers using ROC analysis. In ICML, 2005.

[25] C. Santos-Pereira and A. Pires. On optimal reject rules and ROC curves. PRL, 2005.

[26] R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine

Learning, 39(2-3):135 168, 2000.

[27] D. Tax and R. Duin. Growing a multi-class classiﬁer with a reject option. In Pattern Recognition Letters,

[28] F. Tortorella. An optimal reject rule for binary classiﬁers. In ICAPR, 2001.

[29] K. Trapeznikov and V. Saligrama. Supervised sequential classiﬁcation under budget constraints. In

AISTATS, 2013.

[30] J. Wang, K. Trapeznikov, and V. Saligrama. An lp for sequential learning under budgets. In JMLR, 2014.

[31] M. Yuan and M. Wegkamp. Classiﬁcation methods with reject option based on convex risk minimizations.

In JMLR, 2010.

[32] M. Yuang and M. Wegkamp. Support vector machines with a reject option. In Bernoulli, 2011.

[33] C. Zhang and K. Chaudhuri. The extended littlestone s dimension for learning with mistakes and abstentions.

In COLT, 2016.