# online_active_learning_of_reject_option_classifiers__cd6511d8.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Online Active Learning of Reject Option Classifiers Kulin Shah, Naresh Manwani Machine Learning Lab, KCIS, IIIT Hyderabad, India kulin.shah@students.iiit.ac.in, naresh.manwani@iiit.ac.in Active learning is an important technique to reduce the number of labeled examples in supervised learning. Active learning for binary classification has been well addressed in machine learning. However, active learning of the reject option classifier remains unaddressed. In this paper, we propose novel algorithms for active learning of reject option classifiers. We develop an active learning algorithm using double ramp loss function. We provide mistake bounds for this algorithm. We also propose a new loss function called double sigmoid loss function for reject option and corresponding active learning algorithm. We offer a convergence guarantee for this algorithm. We provide extensive experimental results to show the effectiveness of the proposed algorithms. The proposed algorithms efficiently reduce the number of label examples required. 1 Introduction In standard binary classification problems, algorithms return prediction on every example. For any misprediction, the algorithms incur a cost. Many real-life applications involve very high misclassification costs. Thus, for some confusing examples, not predicting anything may be less costly than any misclassification. The choice of not predicting anything for an example is called reject option in machine learning literature. Such classifiers are called reject option classifiers. Reject option classification is very useful in many applications. Consider a doctor diagnosing a patient based on the observed symptoms and preliminary diagnosis. If there is an ambiguity in observations and preliminary diagnosis, the doctor can hold the decision on the treatment. She can recommend to take advanced tests or consult a specialist to avoid the risk of misdiagnosing the patient. The holding response of the doctor is the same as to reject option for the specific patient (da Rocha Neto et al. 2011). On the other hand, the doctor s misprediction can cost huge money for further treatment or the life of a person. In another example, a banker can use the reject option while looking at the loan application of a customer (Rosowsky and Smith 2013). A banker may choose not to decide based on the information available because of high misclassification cost, Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. and asks for further recommendations or a credit bureau score from the stakeholders. Application of reject option classifiers include healthcare (Hanczar and Dougherty 2008; da Rocha Neto et al. 2011), text categorization (Fumera, Pillai, and Roli 2003), crowdsourcing (Li et al. 2017) etc. Let X Rd be the feature space and {+1, 1} be the label space. Examples of the form (x, y) are generated from an unknown fixed distribution on X {+1, 1}. A reject option classifier can be described with the help of a function f : X R and a rejection width parameter ρ R+ as below. hρ(f(x)) = 1.I{f(x)>ρ} 1.I{f(x)< ρ} 0.I{|f(x)| ρ} (1) The goal is to learn f(.) and ρ simultaneously. For a given example (x, y), the performance of reject option classifier hρ(f(.)) is measured using following loss function. Ld(yf(x), ρ) = I{yf(x) ρ} + d I{|f(x)| ρ} (2) where d (0, 0.5) is the cost of rejection. A reject option classifier is learnt by minimizing the risk (expectation of loss) under Ld. As Ld is not continuous, optimization of empirical risk under Ld is difficult. Bartlett and Wegkamp; Wegkamp and Yuan (2008; 2011) propose a convex surrogate of Ld called generalized hinge loss. They learn the reject option classifier using risk minimization algorithms based on generalized hinge loss. Grandvalet et al. (2008) propose another convex surrogate of Ld called double hinge loss and corresponding risk minimization approach for reject option classification. Manwani et al.; Shah and Manwani (2015; 2019) propose double ramp loss based approaches for reject option classification. Double ramp loss is a non-convex bounded loss function. All these approaches assume that we have plenty of labeled data available. In general, classifiers learned with a large amount of training data can give better generalization on testing data. However, in many real-life applications, it can be costly and difficult to get a large amount of labeled data. Thus, in many cases, it is desirable to ask the labels of the examples selectively. This motivates the idea of active learning. Active learning selects more informative examples and queries labels of those examples. Active learning of standard binary classifiers has been well-studied (Dasgupta, Kalai, and Monteleoni 2009; Bachrach, Fine, and Shamir 1999; Tong and Koller 2002). In El-Yaniv and Wiener (2012), authors reduce active learning for the usual binary classification problem to learning a reject option classifier to achieve faster convergence rates. However, active learning of reject option classifiers has remained an unaddressed problem. In this paper, we propose online active learning algorithms to reject option classification. Let us reconsider the example where the banker uses the reject option classifier for selecting the loan applications. Consider a loan application that satisfies the basic requirements. Thus, the banker is not clear about using the hold option. On the other hand, she is also not sure enough to approve the application. Such cases are instrumental in defining the separation rule between accepting the loan application and holding it for further investigation. This motivates us to think that one can use active learning to ask the labels of selective examples as described above while learning the reject option classifier. A broad class of active learning algorithms is inspired by the concept of a margin between the two categories. Thus, an example, which falls in the margin area of the current classifier, carries more information about the decision boundary. On the other hand, examples which are correctly classified with good margin or misclassified by a good margin, give less knowledge of the decision boundary. Margin examples can bring more changes to the existing classifier. Thus, querying the label of margin examples is more desirable than the other two kinds of examples. A reject option classifier can be viewed as two parallel surfaces with the rejection area in between. Thus, active learning of the reject option classifier becomes active learning of two surfaces in parallel with a shared objective. This shared objective is nothing but to minimize the sum of Ld losses over a sequence of examples. In (Manwani et al. 2015), the authors propose a risk minimization approach based on double ramp loss (Ldr) for learning the reject option classifier. In (Manwani et al. 2015), it is shown that at the optimality, the two surfaces can be represented using only those examples which are close to them. Examples that are far from the two surfaces do not participate in the representation of the surfaces. This motivates us to use double ramp loss for developing an active learning approach to reject option classifiers. Our Contributions We make the following contributions in this paper. We propose an active learning algorithm based on double ramp loss Ldr to learn a linear and non-linear classifier. We give bounds to the number of rejected examples and misclassification rates for un-rejected examples. We propose a smooth non-convex loss called double sigmoid loss (Lds) for reject option classification. We propose an active learning algorithm based on Lds to learn both linear and non-linear classifiers. We also give convergence guarantees for the proposed algorithm. We present extensive simulation results for both proposed active learning algorithms for linear as well as non-linear classification boundaries. Figure 1: Double Ramp Loss with ρ = 2 2 Proposed Approach: Active Learning Inspired by Double Ramp Loss Active learning algorithm does not ask the label in every trial. We denote the instance presented to algorithm at trial t by xt. Each xt X is associated with a unique label yt { 1, 1}. The algorithm calculates ft(xt) and outputs the decision using eq.(1). Based on ft(xt), the active learning algorithm decides whether to ask label or not. (Guillory, Chastain, and Bilmes 2009) shows that online active learning algorithms can be viewed as stochastic gradient descent on non-convex loss function therefore, we use a non-convex loss function Double ramp loss Ldr (Manwani et al. 2015) to derive our first active learning approach. Ldr is defined as follows. Ldr(yf(x), ρ) = d 1 yf(x) + ρ + 1 yf(x) + ρ +(1 d) 1 yf(x) ρ + 1 yf(x) ρ Here [a]+ = max(0, a) and d is the cost of rejection. Figure 1 shows the plot of double ramp loss for ρ = 2. We first consider developing active learning algorithm for linear classifiers (i.e. f(x) = w x). We use stochastic gradient descent (SGD) to derive double ramp loss based active learning algorithm. Parameters update equations using SGD are as follows. wt+1 = wt η wt Ldr(ytf(xt), ρt) wt + ηdytxt, ρt 1 ytf(xt) ρt + 1 wt + η(1 d)ytxt ρt 1 ytf(xt) ρt + 1 wt otherwise ρt+1 = ρt η ρt Ldr(ytf(xt), ρt) ρt ηd, ρt 1 ytf(xt) ρt + 1 ρt + η(1 d), ρt 1 ytf(xt) ρt + 1 ρt otherwise Where η is the step-size. We see that the parameters are updated only when |ft(xt)| [ρt 1, ρt + 1]. For the rest of the regions, the gradient of the loss Ldr is zero therefore, there won t be any update when an example xt is such that |ft(xt)| / [ρt 1, ρt + 1]. Thus, there is no need to query the label when |ft(xt)| / [ρt 1, ρt + 1]. We only query the labels when |ft(xt)| [ρt 1, ρt + 1]. Thus, we ask the label of the current example only if it falls in the linear region of the loss Ldr. This is the same way any margin based active learning approach updates the parameters. If the algorithm does not query the label yt, the parameters (w, ρ) are not updated. Thus, we define the query function Qt as follows. Qt = 1 if ρt 1 |f(xt)| ρt + 1 0 otherwise (3) The detailed algorithm is given in Algorithm 1. We call it DRAL (double ramp loss based active learning). DRAL can be easily extended for learning nonlinear classifiers using kernel trick and is described in the supplementary file. Algorithm 1 Double Ramp Loss Active Learning (DRAL) Input: d (0, 0.5), step size η Output: Weight vector w, Rejection width ρ Initialize: w1 = 0, ρ1 = 1 for t = 1, . . . , T do Sample xt S Set ft(xt) = wt xt if ρt 1 |ft(xt)| ρt + 1 then Set Qt = 1 Query the label yt of xt. if (ρt 1 ytft(xt) ρt + 1) then wt+1 = wt + ηdytxt. ρt+1 = ρt ηd else if ( ρt 1 ytft(xt) ρt + 1) then wt+1 = wt + η(1 d)ytxt ρt+1 = ρt + η(1 d) else wt+1 = wt ρt+1 = ρt Mistake Bounds for DRAL In this section, we derive the mistake bounds of DRAL. Before presenting the mistake bounds, we begin by presenting a lemma which would facilitate the following mistake bound proofs. Let ft(xt) = wt xt. We define the following.1 Ct = I{ρt ytft(xt) ρt+1} R1t = I{ρt 1 ytft(xt) ρt} R2t = I{ ρt ytft(xt) ρt+1} Mt = I{ ρt 1 ytft(xt) ρt} (4) Lemma 1. Let (x1, y1), . . . , (x T , y T ) be a sequence of input instances, where xt X and yt { 1, 1} for all t [T].2 Given Ct, R1t, R2t and Mt as defined in eq.(4) and α > 0, the following bound holds for any w. α2 w 2 + (1 αρ)2 + 2αη t=1 Ldr(ytf(xt), ρ) t=1 [ Ct + R1t ] 2αηd + 2η(Ldr(ytft(xt), ρt) d) η2d2( xt 2 + 1) + t=1 [ R2t + Mt ] 2αη(1 + d) +2η(Ldr(ytft(xt), ρt) d 1) η2(1 d)2( xt 2 + 1) 1I{A} takes value 1 when A is true and 0 otherwise. 2Here, [T] denotes the sequence 1, . . . , T. where f(xt) = w xt and ft(xt) = wt xt. The proof is given in the Supplementary file. Now, we will find the bounds on rejection rate and mis-classification rate. Theorem 2. Let (x1, y1), . . . , (x T , y T ) be a sequence of input instances, where xt X and yt { 1, 1} and xt R for all t [T]. Assume that there exists a f(x) = w x and ρ such that Ldr(ytf(xt), ρ) = 0 for all t [T]. 1. Number of examples rejected by DRAL (Algorithm 1) among those for which the label was asked in this sequence is upper bounded as follows. t:Qt=1 [R1t + R2t] α2 w 2 + (1 αρ)2 where α = max 1+η2d2(R2+1)+2ηd 2ηd , 1+η2(1 d)2(R2+1)+2η(1 d) . 2. Number of examples mis-classified by DRAL (Algorithm 1) among those for which the label was asked in this sequence is upper bounded as follows. t:Qt=1 Mt α2 w 2 + (1 αρ)2 where α = max ηd(R2+1)+2 2 , 1+η2(1 d)2(R2+1)+2η(1 d) The proof is given in the Supplementary file. The above theorem assumes that there exists f(x) = w x and ρ such that Ldr(ytf(xt), ρ) = 0 for all t [T]. This means that the data is linearly separable. In such a case, the number of mistakes made by the algorithm on unrejected examples as well as the number of rejected examples are upper bounded by a complexity term and are independent of T. Now, we derive the bounds when the assumption Ldr(ytf(xt), ρ) = 0, t [T] does not hold for any f(x) = w x and ρ. Theorem 3. Let (x1, y1), (x2, y2), . . . , (x T , y T ) be a sequence of input instances, where xt X and yt { 1, 1} and xt R for all t [T]. Then, for any given f(x) = w x and ρ, we observe the following. 1. Number of rejected examples by DRAL (Algorithm 1) among those for which the label was asked in this sequence is upper bounded as follows. t:Qt=1 [R1t + R2t] α2 w 2 + (1 αρ)2 + t=1 2ηαLdr(ytf(xt), ρ) where α = max 1+η2d2(R2+1)+2ηd 2ηd 1+η2(1 d)2(R2+1)+2η(1 d) 2. The number of misclassified examples by DRAL (Algorithm 1) is upper bounded as follows. t:Qt=1 Mt α2 w 2 + (1 αρ)2 + t=1 2ηαLdr(ytf(xt), ρ) where α = max 2 1+η2(1 d)2(R2+1)+2η(1 d) The proof is given in the Supplementary file. We see that when the data is not linearly separable, the number of mistakes made by the algorithm is upper bounded by the sum of complexity term and sum of the losses using a fixed classifier. 3 Active Learning Using Double Sigmoid Loss Function We observe that double ramp loss is not smooth. Moreover, Ldr is constant whenever yf(x) [ρ + 1, ) ( , ρ 1] [ ρ + 1, ρ 1]. Thus, when loss Ldr for an example x falls in any of these three regions, the gradient of the loss becomes zero. The zero gradient causes no update. Thus, there is no benefit of asking the labels when an example falls in one of these regions. However, we don t want to ignore these regions completely. To capture the information in these regions, we need to change the loss function in such a way that the gradient does not vanish completely in these regions. To ensure that, we propose a new loss function. Double Sigmoid Loss We propose a new loss function for reject option classification by combining two sigmoids as follows. We call it double sigmoid loss function Lds. Lds(yf(x), ρ) = 2dσ(yf(x) ρ) + 2(1 d)σ(yf(x) + ρ) where σ(a) = (1 + eγa) 1 is the sigmoid function (γ > 0). Figure 2 shows the double sigmoid loss function. Lds is a smooth non-convex surrogate of loss Ld (see eq.(2)). We also see that for the double sigmoid loss, the gradient in the regions yf(x) [ρ+1, ) ( , ρ 1] [ ρ+1, ρ 1] does not vanish unlike double ranp loss. Below we establish Figure 2: Double sigmoid loss with γ = 2. that the loss Lds is β-smooth.3 Lemma 4. Assuming x R, Double sigmoid loss Lds(yf(x), ρ) is β smooth with constant β = γ2 The proof is given in the supplementary file. Query Probability Function In the case of DRAL, we saw that the gradient of Ldr becomes nonzero only in the region yf(x) [ρ 1, ρ + 1]. So, we ask the labels only when examples fall in this region. However, in case of double sigmoid loss, the gradient does 3A function f is β-smooth if for all x, y Domain(f), f(x) f(y) β x y . not vanish. Thus, to perform active learning using Lds, we need to ask the labels selectively. We propose a query probability function to set the label query probability at trial t. The query probability function should carry the following properties. In the loss Ld (see eq.(2)), we see two transitions. One at yf(x) = ρ (transition between correct classification and rejection) and another at yf(x) = ρ (transition between rejection and misclassification). Any example falling closer to one of these transitions captures more information about the two transitions. We want the query probability function to be such that it gives higher probabilities near these transitions. Examples that are correctly classified with a good margin, examples misclassified with a considerable margin, and examples in the middle of the reject region do not carry much information. Such examples are also situated away from the transition regions. Thus, query probability should decrease as we move away from these decision boundaries. Therefore, we ask the label in these regions with less probability. Considering these desirable properties, we propose the following query probability function. pt = 4 σ(|ft(xt)| ρt) (1 σ(|ft(xt)| ρt)) (5) where ft(xt) = wt xt. Figure 3 shows the graph of the query probability function. We see that the probability function has two peaks. One peak is at yf(x) = ρ (transition between correct classification and rejection) and another at yf(x) = ρ (transition between rejection and misclassification). Figure 3: Query Probability Function Double Sigmoid Based Parameter Updates The parameter update equations using Lds is as follows. wt+1 = wt η wt Lds(ytf(xt), ρt) = wt 2ytαxt dσ(ytft(xt) ρt) (1 σ(ytft(xt) ρt)) + (1 d)σ(ytft(xt) + ρt) (1 σ(ytft(xt) + ρt)) (6) ρt+1 = ρt η ρt Lds(ytf(xt), ρt) = ρt + 2α dσ(ytft(xt) ρt) (1 σ(ytft(xt) ρt)) (1 d)σ(ytft(xt) + ρt) (1 σ(ytft(xt) + ρt)) (7) Now, we will explain the update equations for w and ρ. 1. When an example is correctly classified with good margin (i.e. ytft(xt) >> 0) then the active learning algorithm will update w by a small factor of ytxt and will reduce the rejection width (ρ) because for ytft(xt) >> 0, dσ(ytft(xt) ρt) (1 σ(ytft(xt) ρt)) > (1 d)σ(ytft(xt) + ρt) (1 σ(ytft(xt) + ρt)). 2. When an example is misclassified with good margin (i.e. ytft(xt) << 0) then the active learning algorithm will update w by a large factor of ytxt and will increase the rejection width (ρ) because for ytft(xt) << 0, dσ(ytft(xt) ρt) (1 σ(ytft(xt) ρt)) < (1 d)σ(ytft(xt) + ρt) (1 σ(ytft(xt) + ρt)). We use the acronym DSAL for double sigmoid based active learning. DSAL is described in Algorithm 2. Algorithm 2 Double Sigmoid Loss Active Learning (DSAL) Input: d (0, 0.5), step size η Output: Weight vector w, Rejection width ρ. Initialize: w1, ρ1 for t = 1, .., T do Sample xt Rd Set ft(xt) = wt xt Set pt = 4σ(|ft(xt)| ρt) (1 σ(|ft(xt)| ρt)) Randomly sample zt {0, 1} from Bernoulli(pt). if zt == 1 then Query the label yt of xt. Find wt+1 using eq.(6). Find ρt+1 using eq.(7). else wt+1 = wt. ρt+1 = ρt. Convergence of DSAL In the case of DRAL, the mistake bound analysis was possible as Ldr increases linearly in the regions where its gradient is nonzero. However, we don t see similar behavior in double sigmoid loss Lds. Thus, we are not able to carry out the same analysis here. Instead, we here show the convergence of DSAL to local minima. For which, we borrow the techniques from online non-convex optimization. In online non-convex optimization, it is challenging to converge towards a global minimizer. It is a common practice to state the convergence guarantee of an online non-convex optimization algorithm by showing it s convergence towards an ϵ-approximate stationary point. In our case, it means that for some t, Lds(ytft(xt), ρt) 2 ϵ. To prove the convergence of DSAL, we use the notion of local regret defined in (Hazan, Singh, and Zhang 2017) . Definition 5. The local regret for an online algorithm is t=1 Lds(ytft(xt), ρt) 2. where T is the total number of trials. (Defined in (Hazan, Singh, and Zhang 2017)) Thus, in each trial, we incur a regret, which is the squared norm of the gradient of the loss. When we reach a stationary point, the gradient will vanish and hence the norm. Note that the convergence here requires that the objective function should be β-smooth. In this case, Lds holds that property, as shown in Lemma 4. Thus, we can use the convergence approach proposed in (Hazan, Singh, and Zhang 2017).4 Theorem 6. If we choose η = 5 γ2 R2+1 , then using smooth- ness condition of Lds(yf(x), ρ), the local regret of DSAL algorithm is bounded as follows. 5 R2 + 1 (T + 1) The proof is given in the supplementary file. To prove that DSAL reaches ϵ stationary point in expectation over iterates, we use following result of (Hazan, Singh, and Zhang 2017). E t Unif[T ] Lds(yf(x), ρ) 2 R(T) Corollary 7. For DSAL algorithm, E t Unif[T ] Lds(yf(x), ρ) 2 4γ2 5 R2 + 1 1 + 1 Using theorem 6 and eq. (8), we can get the required result of the Corollary. In the Corollary, We see that upper bound on the expectation of the square of the gradient is inversely proportional to T; hence, decreases as the total number of trials T increases. It means that the probability of DSAL algorithm reaches to ϵ stationary point increases as T increases. 4 Experiments We show the effectiveness of the proposed active learning approaches on Gisette, Phishing and Guide datasets available on UCI ML repository (Lichman 2013). Experimental Setup We evaluate the performance of our approaches to learning linear classifiers. In all our simulations, we initialize step size by a small value, and after every trial, step size decreases by a small constant. Parameter α in the double sigmoid loss function is chosen to minimize the average risk and average fraction of queried labels (averaged over 100 runs). We need to show that the proposed active learning algorithms are effectively reducing the number of labeled examples required while achieving the same accuracy as online learning. Thus, we compare the active learning approaches with an online algorithm that updates the parameters using gradient descent on the double sigmoid loss at every trial. We call this online algorithm as DSOL (double sigmoid loss based online learning). 4Ldr does not have sufficient smoothness properties required in (Hazan, Singh, and Zhang 2017). Thus, we do not present these convergence results for DRAL. Figure 4: Comparison plots for Gisette dataset with linear Kernel function. Simulation Results We report the results for three different values of d {0.1, 0.25, 0.4}. The results provided here are based on 100 repetitions of a total number of trial (T) equal to 10000. For every value of d, we find the average of risk, the fraction of asked labels, fraction of misclassified examples, and fraction of rejected examples over 100 repetitions. We plotted the average of each quantity (e.g., risk, the fraction of asked labels, etc.) as a function of t [T]. Moreover, the standard deviation of the quantity is denoted by error bar in figures. Figure 4, 5 and 6 show experimental results for Gisette and Phishing and Guide datasets. We observe the following. Label Complexity Versus Risk: The first column in each figure shows how the risk goes down with the number of asked labels. For Gisette and Phishing datasets, given the number of queried labels, both DSAL and DRAL achieve lower risk compared to DSOL. For Guide dataset, DSAL always makes lower risk compared to DSOL for a given number of queried labels. For Gisette and Guide datasets, DSAL achieves lower risk compared to DRAL with the same number of label queries. For Phishing dataset, DSAL and DRAL perform comparably. Average Risk: The second column in all the figures shows how the average risk (average of Ld) goes down with the number of steps (t). In all the cases, we see that the risk increases with increasing the value of d. We understand that the average risk of DSAL is higher than DRAL for Gisette and Phishing datasets and all values of d. For Guide dataset, DSAL always achieves lower risk compared to DRAL. For Gisette and Guide datasets, DSAL achieves similar risk as DSOL. For Phishing dataset, DSOL performs marginally better than DSAL and DRAL. DRAL does better risk minimization compared to DSOL for Phishing dataset. For Guide dataset, DRAL performs comparable to DSOL as t becomes larger except for d = 0.1. Average Fraction of Asked Labels: Third column in all the figures show the fraction of labels asked for a given time step t. We observe that the fraction of asked labels decreases with increasing d. For Gisette and Phishing datasets, DSAL asks significantly less number of labels as DRAL. This happens because DRAL asks labels every time in a specific region and completely ignores other regions, but DSAL asks labels in every region with some probability. For Guide dataset, the fraction of labels asked to become the same for both DSAL and DRAL as t becomes larger. Average Fraction of Misclassified Examples: The fourth column of all the figures, shows how the average fraction of misclassified examples goes down with t. We observe that the misclassification rate goes up with increasing d. We see that DRAL achieves a minimum average misclassification rate in all the cases compared to DSOL and DSAL except for the Guide dataset with d = 0.1 value. For Gisette and Phishing datasets, DSAL achieves a comparable average misclassification rate compared to DSOL for all the cases. For Guide dataset, DSAL achieves a lower misclassification rate compared to DSOL except for d = 0.1. Average Fraction of Rejected Examples: The fifth column in each figure shows how the rejection rate goes down with steps t. We see that the average fraction of rejected examples is higher in DRAL than DSAL and DSOL. Also, the rejection rate decreases with increasing d. Thus, we see that the proposed active learning algorithms DRAL and DSAL effective reduce the number of labels required for learning the reject option classifier and perform better compared to online learning. 5 Conclusion In this paper, we have proposed novel active learning algorithms DRAL and DSAL. We presented mistake bounds for Figure 5: Comparison plots for Phishing dataset with linear Kernel function. Figure 6: Comparison plots for Guide dataset with polynomial kernel function. DRAL and convergence results for DSAL. We experimentally show that the proposed active learning algorithms reduce the number of labels required while maintaining a similar performance as online learning. References Bachrach, R.; Fine, S.; and Shamir, E. 1999. Query by committee, linear separation and random walks. In Proceedings of the 4th European Conference on Computational Learning Theory, Euro COLT 99, 34 49. Bartlett, P. L., and Wegkamp, M. H. 2008. Classification with a reject option using a hinge loss. Journal of Machine Learning Research. 9:1823 1840. da Rocha Neto, A. R.; Sousa, R.; de A. Barreto, G.; and Cardoso, J. S. 2011. Diagnostic of pathology on the vertebral column with embedded reject option. In Pattern Recognition and Image Analysis, 588 595. Dasgupta, S.; Kalai, A. T.; and Monteleoni, C. 2009. Analysis of perceptron-based active learning. J. Mach. Learn. Res. 10:281 299. El-Yaniv, R., and Wiener, Y. 2012. Active learning via perfect selective classification. J. Mach. Learn. Res. 13(1):255 279. Fumera, G.; Pillai, I.; and Roli, F. 2003. Classification with reject option in text categorisation systems. In 12th International Conference on Image Analysis and Processing, 2003.Proceedings., 582 587. Grandvalet, Y.; Rakotomamonjy, A.; Keshet, J.; and Canu, S. 2008. Support vector machines with a reject option. In Advances in Neural Information Processing Systems (NIPS), 537 544. Guillory, A.; Chastain, E.; and Bilmes, J. 2009. Active learning as non-convex optimization. In van Dyk, D., and Welling, M., eds., Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, 201 208. Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA: PMLR. Hanczar, B., and Dougherty, E. R. 2008. Classification with reject option in gene expression data. Bioinformatics 24(17):1889 1895. Hazan, E.; Singh, K.; and Zhang, C. 2017. Efficient regret minimization in non-convex games. Co RR abs/1708.00075. Li, Q.; Vempaty, A.; Varshney, L.; and Varshney, P. 2017. Multi-object classification via crowdsourcing with a reject option. IEEE Transactions on Signal Processing 65(4):1068 1081. Lichman, M. 2013. UCI machine learning repository. Manwani, N.; Desai, K.; Sasidharan, S.; and Sundararajan, R. 2015. Double ramp loss based reject option classifier. In 19th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD), 151 163. Rosowsky, Y. I., and Smith, R. E. 2013. Rejection based support vector machines for financial time series forecasting. In Proceedings of International Joint Conference on Neural Networks (IJCNN), 1161 1167. Shah, K., and Manwani, N. 2019. Sparse reject option classifier using successive linear programming. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI). Tong, S., and Koller, D. 2002. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2:45 66. Wegkamp, M., and Yuan, M. 2011. Support vector machines with a reject option. Bernoulli 17(4):1368 1385.