# auc_optimization_with_a_reject_option__daf8d561.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

AUC Optimization with a Reject Option

Song-Qing Shen, Bin-Bin Yang, Wei Gao

National Key Laboratory for Novel Software Technology Nanjing University, Nanjing, 210023, China {shensq, yangbb, gaow}@lamda.nju.edu.cn

Making an erroneous decision may cause serious results in diverse mission-critical tasks such as medical diagnosis and bioinformatics. Previous work focuses on classiﬁcation with a reject option, i.e., abstain rather than classify an instance of low conﬁdence. Most mission-critical tasks are always accompanied with class imbalance and cost sensitivity, where AUC has been shown a preferable measure than accuracy in classiﬁcation. In this work, we propose the framework of AUC optimization with a reject option, and the basic idea is to withhold the decision of ranking a pair of positive and negative instances with a lower cost, rather than mis-ranking. We obtain the Bayes optimal solution for ranking, and learn the reject function and score function for ranking, simultaneously. An online algorithm has been developed for AUC optimization with a reject option, by considering the convex relaxation and plug-in rule. We verify, both theoretically and empirically, the effectiveness of the proposed algorithm.

Introduction

Making an erroneous decision may cause serious results in many mission-critical tasks such as medical diagnostic (Hamid et al. 2017), biometric veriﬁcation (Golfarelli, Maio, and Malton 1997), myoelectric pattern-recognition control (Robertson, Englehart, and Scheme 2018), and so on. For example, in an electronic nose system, researchers may be totally exposed to toxic or dangerous chemicals when the system makes an incorrect prediction (Hatami and Chira 2013), and for a mobile robot, an imprecise localization may bring about large deviation of navigation route (Marinho et al. 2018). The mission-critical studies have received much attention as machine learning techniques go to more real applications (Hatami and Chira 2013; Hamid et al. 2017; Robertson, Englehart, and Scheme 2018). A classical solution for mission-critical tasks is to consider the classiﬁcation with a reject option, also known as selective classiﬁcation (Golfarelli, Maio, and Malton 1997; Bounsiar, Grall, and Beauseroy 2006; Bartlett and Wegkamp 2008; Grandvalet et al. 2009; Hamid et al. 2017; Robertson,

This research is supported by NSFC(61921006, 61876078). Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Englehart, and Scheme 2018; Marinho et al. 2018). The basic idea is to abstain an uncertain instance with a lower cost so as to avoid mis-classiﬁcation, and the cost is often more acceptable in practice. Relevant research could be traced back to Chow s rule (Chow 1970), where the Bayes decision is provided to tradeoff the optimal error versus reject rate. Most previous studies concern the reduction of classiﬁcation error, which is equivalent to the improvement of predictive accuracy of classiﬁers. Many mission-critical tasks are always accompanied with class imbalance and cost sensitivity, where the number of one class may overwhelm the others, and the minor class receives primary interest with higher cost. For example, in medical diagnosis, the number of healthy persons is much larger than that of patients, whereas the cost of erroneously diagnosing a patient as healthy may be much higher than that of diagnosing a healthy person as a patient. The Area Under the ROC Curve (or short for AUC) is preferable to accuracy as an evaluation measure in diverse tasks such as class-imbalance learning and cost-sensitive learning (Cortes and Mohri 2004; Gao et al. 2016; Liu et al. 2018), and various algorithm have been developed for AUC optimization (Freund et al. 2003; Zhao et al. 2011; Gao et al. 2016; Liu et al. 2018). However, it remains open for missioncritical tasks to take AUC into consideration. This work tries to study AUC as an evaluation measure for mission-critical tasks, and the main contributions are summarized as follows: We propose the framework of AUC optimization with a reject option for mission-critical tasks, and present the Bayes optimal solution. The optimal score function for ranking and the reject function are also provided based on the conditional probability of data distribution. From the Bayes optimal solution, we develop an online algorithm for AUC optimization with a reject option by using plug-in rule and convex surrogate loss, and learn the score function and the reject function simultaneously. Our algorithm is guaranteed theoretically with regret bounds. We ﬁnally present extensive empirical studies to verify the effectiveness of the proposed algorithm by comparing with state-of-the-art algorithm on AUC optimization and classiﬁcation with a reject option.

Related Work Classiﬁcation with a reject option has been a classical framework to deal with mission-critical tasks. The basic idea is to provide an option to reject an instance with a lower cost, rather than mis-classiﬁcation. Chow (1957; 1970) provided the Bayes optimal decision from the trade-off between error rate and rejection. Various SVMs-style variants have been developed for classiﬁcation with a reject option (Fumera and Roli 2002; Bounsiar, Grall, and Beauseroy 2006; Bartlett and Wegkamp 2008; Grandvalet et al. 2009). Cortes, De Salvo, and Mohri (2016) introduced a general framework of classiﬁcation with a reject option. Recent years have witnessed the reject option in deep neural networks (Geifman and El-Yaniv 2017). Tortorella (2005) and Pietraszek (2005) studied an optimal rejection rule based on the ROC curve. The studies on AUC could date back to the 1970 s in signal detection theory (Egan 1975). AUC has been an important performance measure for ranking (Metz 1978; Ferri, Hern andez-Orallo, and Flach 2011). Herschtal and Raskutti (2004) proposed the Rank Opt algorithm based on gradient descent method, and some variants of traditional algorithms have been developed for AUC optimization, such as boosting (Freund et al. 2003; Rudin and Schapire 2009) and SVM (Brefeld and Scheffer 2005; Joachims 2005). Zhao et al. (2011) proposed an online AUC optimization algorithm based on reservoir sampling. Gao et al. (2016) proposed an online AUC optimization algorithm using the consistent square loss. Ying, Wen, and Lyu (2016) further built on a saddle point formulation for a consistent square loss and proposed a stochastic algorithm with convergence analysis. Based on this saddle point formulation, Natole, Ying, and Lyu (2018) and Liu et al. (2018) proposed new stochastic algorithms with tighter regret bounds. The rest of the paper is organized as follows. We begin with the some preliminaries, and propose the framework of AUC optimization with a reject option. We then develop the AUCRO algorithm with theoretical guarantees. We ﬁnally conduct extensive experiments and conclude this work.

Preliminaries Let X Rm and Y = { 1, +1} be the instance and label space, respectively, and assume that D is an underlying (unknown) distribution over joint space X Y. Denote by

η(x) = Pr[y = +1|x]

the conditional probability of positive instances according to distribution D. We can also express distribution D with triplet (D+, D , p), where D+(x) = Pr[x|y = +1], D (x) = Pr[x|y = 1] and p = Pr[y = +1]. Notice that the distribution D is unknown in practice, and what we can observe is a training sample Sn of size n, i.e.,

Sn = {(x+ 1 , +1), (x+ 2 , +1), , (x+ n+, +1),

(x 1 , 1), (x 2 , 1), , (x n , 1)},

where each element is drawn independently and identically (i.i.d) from distribution D. Here, we denote by n+ and n

the cardinality of positive and negative instances in training sample Sn, respectively, and n = n + n+. Let f : X R be a score function. Given a sample Sn, the AUC w.r.t. function f is deﬁned as

AUC(f, Sn) = 1 1 n+n

j=1 ℓ(f, x+ i , x j ),

where the ranking loss ℓ(f, x+ i , x j ) is given by

ℓ(f, x+ i , x j ) = I[f(x+ i ) < f(x j )] + 1

2I[f(x+ i ) = f(x j )].

Here, I[ ] is the indicator function which returns 1 if the argument is true and 0 otherwise. Essentially, AUC is equivalent to the Wilcoxon-Mann-Whitney statistic (Yan et al. 2003).

AUC Optimization with a Reject Option For AUC optimization with a reject option, we introduce an augmented function r: R R {0, 1}, and make a reject option by setting

r(f(x+ i ), f(x j )) = 0.

In such case, there is no deﬁnitive ranking between x+ i and x j . We further introduce the loss function for AUC with a reject option as follows:

L(r, f, x+ i , x j ) = ℓ(f, x+ i , x j )r(f(x+ i ), f(x j ))

+ d(1 r(f(x+ i ), f(x j )))

= ℓ(f, x+ i , x j ) d r(f(x+ i ), f(x j )) + d

where d is the cost of a reject option. For convenience, we introduce a constant

κ = d/(1 d). (1)

Our goal is to optimize the empirical risk over the training data Sn deﬁned below:

R(f, Sn) = 1 n+n

j=1 L(r, f, x+ i , x j ).

Based on the Bayes optimal solution from Theorem 1 (to be shown in the next section), we consider the reject function as follows:

r f(x+ i ), f(x j ) = I

f(x+ i ) > 1

f(x+ i ) < κ

and the score function f is given by

f(x) = 1 1/η(x).

We restrict d (0, 1/2], which is similar to the scenario in classiﬁcation with a reject option (Herbei and Wegkamp 2006; Bartlett and Wegkamp 2008), and we reject pairwise instances for smaller d. The problem can be degenerated to the traditional AUC optimization without any reject option as for d = 1/2.

By plugging the reject function r( , ) into loss function L(r, f, x+ i , x j ), we have

L(r, f, x+ i , x j ) =

f(x+ i ) < κ

f(x+ i ) < 1

We can not directly optimize the loss function L since the class-conditional probability η(x) is unknown. For this problem, we consider plug-in rules (Herbei and Wegkamp 2006; Devroye, Gy orﬁ, and Lugosi 2013), where an empirical conditional probability η (x) is estimated to approximate the true conditional probability η (x). In this work, we focus on logistic regression for plugin rule to estimate η (x) as in the work of (Herbei and Wegkamp 2006), that is,

η (x) = ew x+b

1 + ew x+b .

Substituting into the score function f, we have

f (x) = e (w x+b). (2)

This follows that

d I ln κ + w (x+ i x j ) 0

(1 d)I ln κ + w (x+ i x j ) 0

For non-convex loss function in Eqn. (3), we exploit the surrogate loss functions as follows:

dψ ln κ + w (x+ i x j )

+(1 d)ψ ln κ + w (x+ i x j ) + λ

where λ is a regularization parameter and ψ is a convex surrogate loss function. We adopt the square loss in Eqn. (4) as in the works of (Gao and Zhou 2015; Gao et al. 2016), i.e.,

d 1 ln κ w (x+ i x j ) 2

(1 d) 1 + ln κ w (x+ i x j ) 2

Motivated from (Gao et al. 2016), we rewrite the loss function L(w) as a sum of losses for each individual training instance T t=1 Lt(w), where Lt(w) is equal to

I [yi = yt] d 1 ln κ ytw (xt xi) 2

2 |{i [t 1] : yiyt = 1}|

I [yi = yt] (1 d) 1 + ln κ ytw (xt xi) 2

2 |{i [t 1] : yiyt = 1}| .

Algorithm 1 The AUCRO Algorithm

Input: Training instances {(xt, yt)}T t=1, the regularization parameter λ > 0 and stepsizes {ηt}T t=1. Initialization: Set T + 0 = T 0 = 0, c+ 0 = c 0 = 0, w0 = 0 and Γ+ 0 = Γ 0 = [0]m m. 1: for t = 1, 2, . . . , T do 2: Receive a training instance (xt, yt); 3: if yt = 1 then 4: T + t = T + t 1 + 1 and T t = T t 1; 5: c+ t = c+ t 1 + 1 T + t (xt c+ t 1) and c t = c t 1;

6: Update Γ+ t using Eqn. (8); 7: Calculate the gradient Lt(wt 1); 8: else 9: T t = T t 1 + 1 and T + t = T + t 1; 10: c t = c t 1 + 1 T t (xt c t 1) and c+ t = c+ t 1;

11: Update Γ t using Eqn. (9); 12: Calculate the gradient Lt(wt 1); 13: end if 14: wt = wt 1 ηt Lt(wt 1) 15: end for

Notice that this is an unbiased estimation to L(w). We denote by T + t and T t the number of positive and negative instances in sequence St, respectively. We also set Lt(w) = 0 for the case T + t T t = 0. For yt = 1, we have the gradient

Lt(w) = λw + xtx t w αxt

αxi + (xix i xix t xtx i )w /T t , (5)

where α = 1 ln κ(2d 1). We further denote by

i:yi= 1 xi/T t ,

xix i c t c t /T t ,

the mean and covariance matrix of negative class, respectively. Then, Eqn. (5) can be simpliﬁed as

Lt(w) = λw + α(c t xt)

+ (c t xt)(c t xt) w + Γ t w. (6)

Similarly, we calculate the gradient for yt = 1 as follows:

Lt(w) = λw α(c+ t xt)

+ (c+ t xt)(c+ t xt) w + Γ+ t w, (7)

i:yi=1 xi/T + t ,

xix i c+ t c+ t /T + t .

Algorithm 1 presents the detailed description of AUC optimization with a Reject Option (AUCRO). Speciﬁcally, we update Γ+ t (Line 6) or Γ t (Line 11) for each iteration, respectively, as follows:

Γ+ t = Γ+ t 1 + c+ t 1 c+ t 1 c+ t c+ t

+ xtx t Γ+ t 1 c+ t 1 c+ t 1 /T + t , (8)

Γ t = Γ t 1 + c t 1 c t 1 c t c t

+ xtx t Γ t 1 c t 1 c t 1 /T t . (9)

We ﬁnally use the stochastic gradient descent to update the classiﬁer as follows:

wt = wt 1 ηt Lt(wt 1),

where ηt is the stepsize for each t-th iteration.

Main Theoretical Analysis This section presents theoretical analysis. We ﬁrst provide the Bayes optimal solution for optimal ranking, as well as the reject function. Deﬁne the expected risk with respect to distribution D as follows:

R(f, D) = Ex+ D+,x D L(r, f, x+, x ).

Theorem 1. Given distribution D, the Bayes optimal function f is deﬁned by

f (x) = 1 1/η (x),

that is, we have R(f, D) R(f , D) for any score function f : X R. Here, we assume η(x) (0, 1) without loss of generality. We also have the corresponding reject function, for each positive instance x+ and negative one x ,

r f(x+ i ), f(x j ) = I

f(x+ i ) > 1

f(x+ i ) < κ

where the κ is deﬁned by Eqn. (1).

Proof. Recall that our optimization objective is given by

arg min f Ex1 D+,x2 D L(r, f, x1, x2).

Without changing the optimal Bayes solution, the optimization objective can be transformed as

arg min f E (x1,y1),(x2,y2) D2 I [y1 > y2] L(r, f, x1, x2)

+ I [y2 > y1] L(r, f, x2, x1) .

It is easy to observe that the reject function r is symmetric, and we have

r f(x1), f(x2) = r f(x2), f(x1) = 0

if there is no deﬁnitive ranking between x1 and x2. We also denote by

r = r f(x1), f(x2) = r f(x2), f(x1) .

Table 1: Benchmark datasets dataset #instance #feature dataset #instance #feature diabetes 768 8 a9a 32,561 123 fourclass 862 2 w8a 49,749 300 german 1,000 24 connect-4 67,557 126 splice 3,175 60 acoustic 78,823 50 letter 20,000 16 covtype 581,012 54

This follows that

arg min f E (x1,y1),(x2,y2) D2 d(1 r)(a + b 2ab)

+ ra(1 b)ℓ(f, x1, x2) + rb(1 a)ℓ(f, x2, x1)

where a = η(x1) and b = η(x2). Minimize the above formula, and it is easy to obtain

r f(x1), f(x2) = r f(x2), f(x1) = 0

when κ (1 1/b)/(1 1/a) 1/κ. It is easy to ﬁnd the Bayes optimal solution f

f (x) = 1 1 η (x)

and the reject function r is given by

r f(x1), f(x2) = I f(x2)

f(x1) < κ .

We also have the Bayes risk

R = E x+ D+,x D min{η(x+), η(x ), dη(x+)

+ dη(x ) + (1 2d)η(x+)η(x )} η(x+)η(x ) .

This completes the proof.

Remark: f (x) = 1 1/η(x) is not the only Bayes optimal solution, whereas it is helpful for our later analysis. Notice that the reject option takes effect only for d (0, 1/2], that is why we restrict the range of d in the previous section. Without loss of generality, we assume η(x) (0, 1) for any x X in the above proof. In the rest of this work, we present the convergence analysis of the proposed AUCRO algorithm. We ﬁrst introduce a lemma for smooth functions from (Nesterov 2003)[Theorem 2.1.5] as follows: Lemma 2. Given a linear function space W Rm, if a function g: W R is β-smooth, then, for every w, w W, we have

g(w) g(w ) g(w ), w w + 1

Deﬁne w = arg minw

t Lt(w). The following theorem guarantees the convergence of Algorithm 1. Theorem 3. For xt 1 (t [T]), w D and TL = T t=1 Lt(w ), we have

t Lt(w ) 2θD2 + D

where θ = 4 + λ and ηt = 1/(θ +

θ2 + 2θTL /D2).

Table 2: Comparison of the testing AUC values (mean std.) on benchmark datasets

Compared methods AUCRO dataset FSAUC OPAUC OAMseq OAMgd d=0.5 d=0.48 d=0.46 d=0.44 d=0.42 diabetes .8143 .0235 .8335 .0319 .8035 .0243 .8305 .0232 .8318 .0232 .8549 .0232 .8768 .0233 .8960 .0231 .9129 .0220 fourclass .8318 .0311 .8279 .0239 .8295 .0303 .8317 .0303 .8303 .0269 .8534 .0264 .8769 .0254 .9004 .0224 .9248 .0197 german .7791 .0392 .7911 .0281 .7762 .0356 .7547 .0302 .7896 .0332 .8146 .0346 .8385 .0352 .8613 .0350 .8826 .0343 splice .9088 .0087 .9240 .0084 .8965 .0130 .8905 .0145 .9243 .0078 .9435 .0072 .9591 .0063 .9714 .0053 .9810 .0044 letter .8106 .0064 .8116 .0046 .7455 .0128 .7826 .0146 .8108 .0040 .8404 .0040 .8687 .0039 .8956 .0039 .9204 .0038 a9a .8977 .0037 .9002 .0034 .8637 .0044 .8507 .0114 .8994 .0034 .9206 .0033 .9390 .0031 .9548 .0028 .9680 .0024 w8a .9482 .0058 .9657 .0046 .8858 .0133 .9336 .0145 .9703 .0038 .9816 .0029 .9884 .0021 .9928 .0013 .9954 .0009 connect-4 .8589 .0029 .8585 .0026 .7351 .0124 .7986 .0135 .8578 .0027 .8831 .0027 .9058 .0026 .9259 .0024 .9435 .0022 acoustic .8017 .0038 .8018 .0026 .7765 .0046 .7661 .0121 .8010 .0030 .8370 .0033 .8636 .0033 .8844 .0030 .9006 .0031 covtype .8230 .0014 .8235 .0009 .6770 .0231 .7459 .0272 .8232 .0014 .8494 .0014 .8734 .0014 .8953 .0013 .9155 .0012

Proof. Since we have deﬁned Lt(w) = 0 for T + t T t = 0 above, this easy case will not be analyzed. Here we only consider the general case T + t T t = 0 in our proof. Recall aforementioned Lt(w) and it is easy to calculate the gradient

Lt(w) = λw t 1 i=1 I [yi = yt] |{i [t 1] : yiyt = 1}|

d 1 ln κ ytw (xt xi) yt(xt xi)+

(1 d) 1 + ln κ ytw (xt xi) yt(xt xi) .

For w, w W and xt 1,we have

Lt(w ) Lt(w ) θ w w ,

which implies that Lt is θ-smooth. Denote by

wt = arg min w Lt(w),

and this gives Lt(wt ) = 0 from the convex and differentiable loss Lt. Based on Lemma 2, we have

0 Lt(wt ) min c

Lt wt 1 c Lt(wt 1)

Lt(wt 1) c Lt(wt 1) 2+ 1

2θc2 Lt(wt 1) 2

= Lt(wt 1) 1

2θ Lt(wt 1) 2. (10)

From the convexity of function Lt 1, we have

Lt(wt 1) Lt(w ) Lt(wt 1), wt 1 w . (11)

Also, we have

wt w 2 = wt 1 ηt Lt(wt 1) w 2

= wt 1 w 2 2ηt Lt(wt 1), wt 1 w

+ η2 t Lt(wt 1) 2. (12)

By using Eqns. (10) and (11) in Eqn. (12), we have

(1 θηt)Lt(wt 1) Lt 1(w )

1 2ηt wt 1 w 2 1

2ηt wt w 2.

Summing over t = 1, , T and rearranging, we obtain

t=1 (1 θηt)Lt(wt 1)

t=1 Lt 1(w )

1 2η1 w0 w 2 1 2ηT w T w 2

t=1 ( 1 2ηt+1 1

2ηt ) wt w .

By setting ηt = η, we have

1 2η1 w0 w 2 1 2ηT w T w 2 1

From w0 = 0 and w D, we ﬁnally get

t=1 Lt(wt 1)

t=1 Lt(w ) 1 1 θη

2η + θηTL .

θ2 + 2θTL /D2)

and simple derivations, the theorem holds as desired.

Remark: This theorem presents an O(1/T) convergence rate for the AUCRO algorithm if the distribution is separable, i.e., L = 0, and an O(1/

T) convergence rate for general case. Besides, the convergence rate can be improved to O(log T/T) by exploring the strongly convexity of Lt(wt). Here we consider the standard and suboptimal regret analysis on exploiting the smoothness of Lt(wt) for simplicity.

Experiments We evaluate the performance of our method on ten benchmark datasets, as summarized in Table 11. We compare our method with state-of-the-art AUC optimization algorithms without a reject option. We then compare our method with conventional rejection algorithms that reject instances of low conﬁdence. We ﬁnally analyze the parameter inﬂuence. The features have been scaled to [ 1, 1] for all datasets. Multi-class datasets have been transformed into binary ones by randomly partitioning classes into two groups, where each group contains almost the same number of classes.

1https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/

Figure 1: The curves of reject-rate vs AUC on benchmark datasets

Comparisons with AUC Approches We compare our method with four state-of-the-art algorithms on AUC optimization as follows: FSAUC: An AUC optimization algorithm which builds on a saddle point formulation for the consistent square loss (Liu et al. 2018); OPAUC: An online AUC optimization algorithm which employ the consistent square loss. (Gao et al. 2016); OAMseq: An online AUC optimization algorithm based on reservoir sampling which updates the model w.r.t. the sequence of pairwise instances (Zhao et al. 2011); OAMgd: An online AUC optimization algorithm based on reservoir sampling which updates model with gradient descent approach (Zhao et al. 2011). Two trials of 5-fold cross-validation is executed on training sets to decide the learning rate ηt 2[ 12: 10] and the regularized parameter λ 2[ 10: 2] for our algorithm. For FSAUC, we tune the initial stepsize η1 2[ 10: 10] and the parameter R 10[ 1: 5], as recommended in (Liu et al. 2018). For OPAUC, stepsize ηt is decided within the range 2[ 12: 10] and the regularization parameter λ is decided within the range 2[ 10: 2] as recommended in (Gao et al. 2016). For OAMseq and OAMgd, the buffer sizes are ﬁxed to be 100 and the penalty parameter C is decided within 2[ 10: 10] as recommended in (Zhao et al. 2011).

The performance of compared methods are evaluated by ﬁve trials of 5-fold cross-validation, where AUC values are obtained by averaging over these 25 runs, as summarized in Table 2. Firstly, when reject cost d = 0.5, that is to say when our method does not reject any pairwise instances, our method achieve better AUC than OAMseq and OAMgd and the performance is comparable to FSAUC and OPAUC. Secondly, when reject cost d < 0.5, that is to say when our method begin to reject some uncertain pairwise instances, the AUC achieved goes up as reject cost d goes down. On the other hand, FSAUC, OPAUC, OAMseq and OAMgd have no ability to reject uncertain pairwise instances.

Comparisons with Rejection Approaches Conventional algorithms on classiﬁcation with a reject option abstain uncertain instances without prediction. Since our method is the ﬁrst to reject making a rank, to show the effectiveness of our method, we make comparisons with rejection approaches LR-R and SVM-BW in enhancing AUC:

LR-R: A rejection algorithm which is embedded with plug-in rule (logistic regression rule) to decide whether to reject an instance or not (Herbei and Wegkamp 2006);

SVM-BW: An SVM-based algorithm using a generalized hinge loss which allow the classiﬁer to reject instances according to the distance to the classiﬁcation hyperplane (Bartlett and Wegkamp 2008).

We evaluate the performance of AUCRO and conventional rejection methods on reject-rate v.s. AUC curve. This is inspired by (Nadeem, Zucker, and Hanczar 2009) that proposed reject-rate v.s. accuracy curve. In the comparisons of conventional rejection methods, the method getting higher accuracy under the same reject rate is better. More speciﬁcally, the more area under the curve, the better the method is. This way of comparison is similar to AUC comparison as AUC is the area under the ROC-AUC curve . So here we compare the area under reject-rate v.s. AUC curve to show the effectiveness of our method. For AUCRO, parameters are tuned in the same way as aforementioned. For LR-R, two trials of 5-fold crossvalidation is executed on training sets to decide the regularized parameter λ 2[ 10: 10]. For SVM-BW, we tune the regularization parameter λ 2[ 10: 10] and choose the linear kernel as kernel function. The performance of compared methods are evaluated by ﬁve trials of 5-fold cross-validation, where AUC values and corresponding reject rates are obtained by averaging over these 25 runs. The reject-rate v.s. AUC curves on benchmark datasets are drawn in Figure 1. It can be observed that when the same percentage of instances are rejected, our method achieves higher AUC and higher growth rate. When the reject rate is zero, though AUCRO is an online algorithm, it gets comparable AUC to the other two batch algorithms. Note that AUCs of the dataset w8a never grow up and change slowly in LR-R and SVM-BW. We think that w8a is a severely class-imbalanced dataset where instances in majority class is dozens of times more than in minority class such that conventional univariate rejection algorithms cannot improve AUC with rejecting instances. However, our method can process such severely imbalanced datasets well.

Figure 2: Inﬂuence of stepsize ηt

Figure 3: Inﬂuence of regularization parameter λ

Parameter Analysis We analyze the inﬂuence of parameters in this section. Since different reject cost d lead to curves of similar shape in the

Figure 4: Inﬂuence of iterations

analysis of ηt and λ, we only show the case d = 0.47 in the ﬁgure for simplicity. And due to page limit, we only present the results of several datasets for the study of each parameter, but the trends are similar on other datasets. Figure 2 shows that there is a relatively big range ηt [2 12, 2 4] where AUCRO achieves good results with d = 0.47. When ηt is set to values larger than 2 4, the performance drops off dramatically. Figure 3 shows that AUCRO is relatively not sensitive to the regularization parameter λ. It can be seen that when λ is set to a big value, AUCRO may get slightly higher AUC. That s because larger λ prefers a smaller w 2, accompanied with similar scores and higher reject rate. Figure 4 shows the inﬂuence of the number of iterations for AUCRO, FSAUC, OPAUC, OAMseq and OAMgd. When d = 0.5, AUCRO converges faster and more smoothly than FSAUC, OAMseq and OAMgd, and gets a comparable convergence rate to OPAUC. When d < 0.5, AUCRO converges even faster and there lies the difference that AUCRO achieves high AUC before the convergence due to the zero initialization of w which leads to high reject rates in the initial iterations.

This work introduces the framework for AUC optimization with a reject option. We present the Bayes rule for optimal ranking and interpret the design of the reject function. We then present an online algorithm for AUC optimization with a reject option based on plug-in rule and convex relaxation of surrogate loss. We verify the effectiveness of the proposed algorithm empirically and theoretically. An interesting work in the future is to consider different reject option into AUC optimization with tighter regret bound, and we could also apply reject option in the deep neural networks for ranking.

Bartlett, P. L., and Wegkamp, M. H. 2008. Classiﬁcation with a reject option using a hinge loss. Journal of Machine Learning Research 9:1823 1840. Bounsiar, A.; Grall, E.; and Beauseroy, P. 2006. A kernel based rejection method for supervised classiﬁcation. International Journal of Computational Intelligence 3(4):312 321.

Brefeld, U., and Scheffer, T. 2005. AUC maximizing support vector learning. In Proceedings of the ICML 2005 workshop on ROC Analysis in Machine Learning. Chow, C. K. 1957. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers EC-6(4):247 254. Chow, C. K. 1970. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory 16(1):41 46. Cortes, C., and Mohri, M. 2004. AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems 17, 313 320. Cortes, C.; De Salvo, G.; and Mohri, M. 2016. Learning with rejection. In Proceedings of the 27th International Conference on Algorithmic Learning Theory, 67 82. Devroye, L.; Gy orﬁ, L.; and Lugosi, G. 2013. A Probabilistic Theory of Pattern Recognition. New York: Springer. Egan, J. P. 1975. Signal Detection Theory and ROCanalysis. New York: Academic press. Ferri, C.; Hern andez-Orallo, J.; and Flach, P. A. 2011. A coherent interpretation of AUC as a measure of aggregated classiﬁcation performance. In Proceedings of the 28th International Conference on Machine Learning, 657 664. Freund, Y.; Iyer, R.; Schapire, R. E.; and Singer, Y. 2003. An efﬁcient boosting algorithm for combining preferences. Journal of Machine Learning Research 4:933 969. Fumera, G., and Roli, F. 2002. Support vector machines with embedded reject option. In International Workshop on Support Vector Machines, 68 82. Springer. Gao, W., and Zhou, Z.-H. 2015. On the consistency of AUC pairwise optimization. In Proceedings of the 24th International Joint Conference on Artiﬁcial Intelligence, 939 945. Gao, W.; Wang, L.; Jin, R.; Zhu, S.; and Zhou, Z.-H. 2016. One-pass AUC optimization. Artiﬁcial Intelligence 236:1 29. Geifman, Y., and El-Yaniv, R. 2017. Selective classiﬁcation for deep neural networks. In Advances in Neural Information Processing Systems 30, 4878 4887. Golfarelli, M.; Maio, D.; and Malton, D. 1997. On the error-reject trade-off in biometric veriﬁcation systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7):786 796. Grandvalet, Y.; Rakotomamonjy, A.; Keshet, J.; and Canu, S. 2009. Support vector machines with a reject option. In Advances in Neural Information Processing Systems 22, 537 544. Hamid, K.; Asif, A.; Abbasi, W.; Sabih, D.; et al. 2017. Machine learning with abstention for automated liver disease diagnosis. In Proceedings of the 15th International Conference on Frontiers of Information Technology, 356 361. Hatami, N., and Chira, C. 2013. Classiﬁers with a reject option for early time-series classiﬁcation. In Proceedings of 2013 IEEE Symposium on Computational Intelligence and Ensemble Learning, 9 16.

Herbei, R., and Wegkamp, M. H. 2006. Classiﬁcation with reject option. Canadian Journal of Statistics 34(4):709 721. Herschtal, A., and Raskutti, B. 2004. Optimising area under the ROC curve using gradient descent. In Proceedings of the 21st International Conference on Machine Learning, 49. Joachims, T. 2005. A support vector method for multivariate performance measures. In Proceedings of the 22nd International Conference on Machine Learning, 377 384. Liu, M.; Zhang, X.; Chen, Z.; Wang, X.; and Yang, T. 2018. Fast stochastic AUC maximization with o (1/n)-convergence rate. In Proceedings of the 35th International Conference on Machine Learning, 3195 3203. Marinho, L. B.; Filho, P. P. R.; Almeida, J. S.; Souza, J. W. M.; Junior, A. H. S.; and de Albuquerque, V. H. C. 2018. A novel mobile robot localization approach based on classiﬁcation with rejection option using computer vision. Computers & Electrical Engineering 68:26 43. Metz, C. E. 1978. Basic principles of ROC analysis. Seminars in Nuclear Medicine 8(4):283 298. Nadeem, M. S. A.; Zucker, J.-D.; and Hanczar, B. 2009. Accuracy-rejection curves (arcs) for comparing classiﬁcation methods with a reject option. In Machine Learning in Systems Biology, 65 81. Natole, Jr., M.; Ying, Y.; and Lyu, S. 2018. Stochastic proximal algorithms for AUC maximization. In Proceedings of the 35th International Conference on Machine Learning, 3710 3719. Nesterov, Y. 2003. Introductory Lectures on Convex Optimization: A Basic Course. Springer. Pietraszek, T. 2005. Optimizing abstaining classiﬁers using ROC analysis. In Proceedings of the 22nd International conference on Machine Learning, 665 672. Robertson, J. W.; Englehart, K. B.; and Scheme, E. J. 2018. Rejection of systemic and operator errors in a real-time myoelectric control task. In Proceedings of the 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 5640 5643. IEEE. Rudin, C., and Schapire, R. E. 2009. Margin-based ranking and an equivalence between adaboost and rankboost. Journal of Machine Learning Research 10:2193 2232. Tortorella, F. 2005. A ROC-based reject rule for dichotomizers. Pattern Recognition Letters 26(2):167 180. Yan, L.; Dodier, R. H.; Mozer, M.; and Wolniewicz, R. H. 2003. Optimizing classiﬁer performance via an approximation to the wilcoxon-mann-whitney statistic. In Proceedings of the 20th International Conference on Machine Learning, 848 855. Ying, Y.; Wen, L.; and Lyu, S. 2016. Stochastic online AUC maximization. In Advances in Neural Information Processing Systems 29, 451 459. Zhao, P.; Hoi, S. C.; Jin, R.; and Yang, T. 2011. Online AUC maximization. In Proceedings of the 28th International Conference on Machine Learning, 233 240.