# classification_with_rejection_based_on_costsensitive_classification__260f216d.pdf

Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

Nontawat Charoenphakdee 1 2 Zhenghang Cui 1 2 Yivan Zhang 1 2 Masashi Sugiyama 2 1

The goal of classiﬁcation with rejection is to avoid risky misclassiﬁcation in error-critical applications such as medical diagnosis and product inspection. In this paper, based on the relationship between classiﬁcation with rejection and cost-sensitive classiﬁcation, we propose a novel method of classiﬁcation with rejection by learning an ensemble of cost-sensitive classiﬁers, which satisﬁes all the following properties: (i) it can avoid estimating class-posterior probabilities, resulting in improved classiﬁcation accuracy, (ii) it allows a ﬂexible choice of losses including non-convex ones, (iii) it does not require complicated modiﬁcations when using different losses, (iv) it is applicable to both binary and multiclass cases, and (v) it is theoretically justiﬁable for any classiﬁcation-calibrated loss. Experimental results demonstrate the usefulness of our proposed approach in clean-labeled, noisy-labeled, and positive-unlabeled classiﬁcation.

1. Introduction

In ordinary classiﬁcation, a classiﬁer learned from training data is expected to accurately predict a label of every possible test input in the input space. However, when a particular test input is difﬁcult to classify, forcing a classiﬁer to always predict a label can lead to misclassiﬁcation, causing serious troubles in risk-sensitive applications such as medical diagnosis, home robotics, and product inspection (Cortes et al., 2016a; Geifman & El-Yaniv, 2017; Ni et al., 2019). To cope with this problem, classiﬁcation with rejection was proposed as a learning framework to allow a classiﬁer to abstain from making a prediction (Chow, 1957; 1970; Bartlett & Wegkamp, 2008; El-Yaniv & Wiener, 2010; Geifman & El-Yaniv, 2017; Cortes et al., 2016a;b; Yuan & Wegkamp, 2010; Franc & Prusa, 2019), so that we can prevent misclas-

1The University of Tokyo, Tokyo, Japan 2RIKEN AIP, Tokyo, Japan. Correspondence to: Nontawat Charoenphakdee <nontawat@ms.k.u-tokyo.ac.jp>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

siﬁcation in critical applications.

A well-known framework for classiﬁcation with rejection that has been studied extensively is called the cost-based framework (Chow, 1970; Bartlett & Wegkamp, 2008; Yuan & Wegkamp, 2010; Cortes et al., 2016a;b; Franc & Prusa, 2019; Ni et al., 2019). In this setting, we set a pre-deﬁned rejection cost to be less than the misclassiﬁcation cost. As a result, a classiﬁer trained in this framework prefers to reject than making a risky prediction, where there currently exist two main approaches as the following.

The ﬁrst approach is called the conﬁdence-based approach, where we train a classiﬁer then use an output of the classiﬁer as a conﬁdence score (Bartlett & Wegkamp, 2008; Grandvalet et al., 2009; Herbei & Wegkamp, 2006; Yuan & Wegkamp, 2010; Ramaswamy et al., 2018; Ni et al., 2019). In this approach, we manually set a conﬁdence threshold as a criterion to refrain from making a prediction, if the conﬁdence score of a test input is lower than the threshold. Most conﬁdence-based methods rely on a loss that can estimate class-posterior probabilities (Yuan & Wegkamp, 2010; Reid & Williamson, 2010; Ni et al., 2019), which can be difﬁcult to estimate especially when using deep neural networks (Guo et al., 2017). Although there are some exceptions that can avoid estimating class-posterior probabilities, most of them are only applicable to binary classiﬁcation (Bartlett & Wegkamp, 2008; Grandvalet et al., 2009; Manwani et al., 2015).

The second approach is called the classiﬁer-rejector approach, where we simultaneously train a classiﬁer and a rejector (Cortes et al., 2016a;b; Ni et al., 2019). It is known that this approach has theoretical justiﬁcation in the binary case only for the exponential and hinge-based losses (Cortes et al., 2016a;b). This is because the proof technique highly relies on the function form of the loss (Cortes et al., 2016a;b). In the multiclass case, Ni et al. (2019) argued that this approach is not suitable both theoretically and experimentally since the multiclass extension of Cortes et al. (2016b) is not calibrated and the conﬁdence-based softmax cross-entropy loss can outperform this approach in practice.

The goal of this paper is to develop an alternative approach to classiﬁcation with rejection that achieves the following four design goals. First, it can avoid estimating classposterior probabilities, since this often yields degradation

Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

of classiﬁcation performance. Second, the choice of losses is ﬂexible and does not require complicated modiﬁcations when using different losses, which allows a wider range of applications. Third, it is applicable to both binary and multiclass cases. Fourth, it can be theoretically justiﬁed. In this paper, we show that this goal can be achieved by bridging the theory of cost-sensitive classiﬁcation (Elkan, 2001; Scott, 2012; Steinwart, 2007) and classiﬁcation with rejection. The key observation that allows us to connect the two problems is based on the fact that one can mimic the Bayes optimal solution of classiﬁcation rejection by only knowing arg maxy p(y|x) and whether maxy p(y|x) > 1 c, where c is the rejection cost. Based on this observation, we propose the cost-sensitive approach, which calibration can be guaranteed for any classiﬁcation-calibrated loss (Zhang, 2004; Bartlett et al., 2006). Classiﬁcation-calibration is known to be a minimum requirement for a loss in ordinary classiﬁcation (Bartlett et al., 2006). This suggests that the loss choices of our proposed approach are as ﬂexible as that of ordinary classiﬁcation.

To emphasize the importance of having a ﬂexible loss choice, we explore the usage of our approach for classiﬁcation from positive and unlabeled data (PU-classiﬁcation) (du Plessis et al., 2014; 2015; Kiryo et al., 2017) and classiﬁcation from noisy labels (Angluin & Laird, 1988; Ghosh et al., 2015). Our experimental results show that a family of symmetric losses, which are the losses that cannot estimate class-posterior probabilities (Charoenphakdee et al., 2019), can be advantageous in these settings. We also provide experimental results of clean-labeled classiﬁcation with rejection to illustrate the effectiveness of the cost-sensitive approach.

2. Preliminaries

In this section, we introduce the problem setting of classiﬁcation with rejection. Then, we review cost-sensitive binary classiﬁcation, which will be essential for deriving the proposed cost-sensitive approach for classiﬁcation with rejection.

2.1. Classiﬁcation with Rejection

Our problem setting follows the standard cost-based framework classiﬁcation with rejection (Chow, 1970; Cortes et al., 2016b; Ni et al., 2019). Let X be an input space and Y = {1, . . . , K} be an output space, where K denotes the number of classes. Note that we adopt a conventional notation Y = { 1, +1} when considering binary classiﬁcation (Bartlett et al., 2006). In this problem, we are given the training input-output pairs {xi, yi}n i=1 drawn i.i.d. from an unknown probability distribution with density p(x, y). A classiﬁcation rule of learning with rejection is f : X {1, . . . , K, }, where denotes rejection. Let

c (0, 0.5) be the rejection cost. Unlike ordinary classiﬁcation, where the zero-one loss ℓ01(f(x), y) = 1[f(x) =y]1 is the performance measure, we are interested in an extension of ℓ01, which is called the zero-one-c loss ℓ01c deﬁned as follows (Ni et al., 2019):

ℓ01c(f(x), y) =

( c f(x) = , ℓ01(f(x), y) otherwise.

The goal is to ﬁnd a classiﬁcation rule f that minimizes the expected risk with respect to ℓ01c, i.e.,

Rℓ01c(f) = E (x,y) p(x,y)[ℓ01c(f(x), y)]. (1)

In classiﬁcation with rejection, a classiﬁcation rule f is allowed to refrain from making a prediction and will receive a ﬁxed rejection loss c. In this paper, following most existing studies (Cortes et al., 2016a;b; Ramaswamy et al., 2018; Ni et al., 2019), we consider the case where c < 0.5. Intuitively, this case implies that it is strictly better to reject if a classiﬁer has less than half a chance to be correct. Thus, the case where c < 0.5 is suitable if the goal is to avoid harmful misclassiﬁcation. We refer the readers to Ramaswamy et al. (2018) for more discussion on the case where c 0.5 and how it is fundamentally different from the case where c < 0.5. Next, let us deﬁne η(x) = [η1(x), . . . , ηK(x)] , where ηy(x) = p(y|x) denotes the class-posterior probability of a class y. The optimal solution for classiﬁcation with rejection f = arg minf Rℓ01c(f) known as Chow s rule (Chow, 1970) can be expressed as follows:

Deﬁnition 1 (Chow s rule (Chow, 1970)). The optimal solution of multiclass classiﬁcation with rejection f = arg minf Rℓ01c(f) can be expressed as

( maxy ηy(x) 1 c, arg maxy ηy(x) otherwise.

Chow s rule suggests that classiﬁcation with rejection is solved if we have the knowledge of η(x). Therefore, one approach is to estimate η(x) from training examples. This method is in a family of the conﬁdence-based approach, which has been extensively studied in both the binary (Yuan & Wegkamp, 2010) and multiclass cases (Ni et al., 2019). Figure 1 illustrates the conﬁdence-based approach.

2.2. Cost-sensitive Binary Classiﬁcation

Consider binary classiﬁcation where y { 1, +1}. In ordinary classiﬁcation, the false positive and false negative costs are treated equally. On the other hand, in cost-sensitive classiﬁcation, the false positive and false negative costs are

11[ ] denotes an indicator function.

Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

Figure 1. Illustration of the conﬁdence-based approach. Figure (a) denotes a prediction function. The rejector in ﬁgure (b) has a rejection region spreads from the decision boundary of the prediction function. The width of the rejection region depends on the choice of the rejection threshold parameter. Data points in purple are rejected.

generally unequal (Elkan, 2001; Saerens et al., 2002; Scott, 2012).

Without loss of generality, we deﬁne α (0, 1) to be the false positive cost and 1 α to be the false negative cost. Then, the expected cost-sensitive risk can be expressed as

Rℓ01 α (f) = (1 α)π E x p(x|y=+1)[ℓ01(f(x), +1)]

+ α(1 π) E x p(x|y= 1)[ℓ01(f(x), 1)],

where π = p(y = +1) denotes the class prior.

It is known that the Bayes optimal cost-sensitive binary classiﬁer can be expressed as follows:

Deﬁnition 2 (Scott (2012)). The optimal cost-sensitive classiﬁer f α = arg minf Rα(f) can be expressed as

( +1 p(y = +1|x) > α, 1 otherwise.

Note that when α = 0.5, the Bayes optimal solution f 0.5(x) coincides with that of ordinary binary classiﬁcation. Moreover, when α is known, cost-sensitive binary classiﬁcation is solved if we have access to the class-posterior probability p(y = +1|x).

3. Cost-sensitive Approach

In this section, we propose a cost-sensitive approach for classiﬁcation with rejection. We begin by describing our motivation and analyzing the behavior of the Bayes optimal solution of classiﬁcation with rejection. Then, we show that this problem can be solved by simultaneously solving multiple cost-sensitive classiﬁcation problems.

3.1. Motivation

As suggested by Chow s rule (Chow, 1970), classiﬁcation with rejection can be solved by estimating the class-posterior probabilities. However, an important question arises as: Is class-posterior probability estimation indispensable for solving classiﬁcation with rejection? This question is fundamentally motivated by Vapnik s principle (Vapnik, 1998), which suggests not to solve a more general problem as an intermediate step when solving a target problem if we are given a restricted amount of information.

In our context, the general problem is class-posterior probability estimation. In fact, knowing class-posterior probabilities can also solve many other problems (Qin, 1998; Bickel et al., 2007; Sugiyama et al., 2012; Dembczynski et al., 2013; Koyejo et al., 2014). However, many of such problems are also known to be solvable without estimating the class-posterior probabilities (Kanamori et al., 2012; Bao & Sugiyama, 2020). Note that class-posterior probability estimation can be unreliable when the model is misspeciﬁed (Begg & Lagakos, 1990; Heagerty & Kurland, 2001) or highly ﬂexible (Guo et al., 2017; Hein et al., 2019).

To ﬁnd a more direct solution for classiﬁcation with rejection, we seek for a general approach that it may not be able to estimate class-posterior probabilities, but its optimal solution coincides with the optimal Chow s rule (Chow, 1970). Although the idea of directly solving classiﬁcation with rejection without class-posterior estimation itself is not novel, most existing methods are only applicable to binary classiﬁcation with rejection (Bartlett & Wegkamp, 2008; Grandvalet et al., 2009; Manwani et al., 2015; Cortes et al., 2016b;a), or focus on speciﬁc types of losses (Ramaswamy et al., 2018). For the multiclass case, Zhang et al. (2018)

Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

Figure 2. Illustration of Chow s rule in binary classiﬁcation with rejection and the unnecessity of knowing the class-posterior probability to solve this problem. If the rejection cost c = 0.2, as long as we know p(y = 1|x) > 0.8, knowing the exact value of the class-posterior probability does not change our ﬁnal decision to predict a positive label.

proposed to modify a loss by bending it to be steeper (for the hinge loss) and positively unbounded, but there exist hyperparameters to be tuned such as the rejection threshold and the bending slope. Also, Mozannar & Sontag (2020) recently proposed a classiﬁer-rejector approach by augmenting a rejection class in the model s prediction, but their loss choice is limited to the modiﬁed cross-entropy loss. More discussion on related work is provided in Appendix A.

3.2. A Closer Look at Chow s Rule

Here, we analyze the behavior of Chow s rule (Chow, 1970). We discuss the minimal knowledge required for a classiﬁcation rule to mimic Chow s rule, which illustrates that the class-posterior probabilities need not to be known. For simplicity, we begin by considering binary classiﬁcation with rejection.

In binary classiﬁcation with rejection, Chow s rule in Deﬁnition 1 can be expressed as

1 p(y = +1|x) > 1 c, c p(y = +1|x) 1 c, 1 p(y = +1|x) < c. (2)

To solve binary classiﬁcation with rejection, there are only three conditions to verify, which are p(y = +1|x) > 1 c, p(y = +1|x) < c, and p(y = +1|x) [c, 1 c]. We can see that if we know p(y = +1|x) > 1 c, we do not need to know the exact value of p(y = +1|x) to predict the label as positive. For example, if c = 0.2, knowing p(y = +1|x) > 0.8 is already sufﬁcient to predict a label, i.e., knowing whether p(y = +1|x) > 0.88 or p(y = +1|x) > 0.96 does not change the decision of Chow s rule. Figure 2 illustrates this fact, which is the key intuition why it is possible to develop a method that can avoid estimating the class-posterior probabilities for solving this problem.

3.3. Binary Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

Here, we show that by solving two cost-sensitive binary classiﬁcation problems, binary classiﬁcation with rejection can be solved. The following proposition shows the relationship between the optimal solutions of cost-sensitive binary

classiﬁcation and that of binary classiﬁcation with rejection. Proposition 3. In binary classiﬁcation with rejection, Chow s rule can be expressed as

1 f 1 c(x) = 1, 1 f c (x) = 1, otherwise. (3)

Proof. We assert that if we can verify the following two conditions:

p(y = +1|x) > 1 c, (4)

p(y = +1|x) > c, (5)

then binary classiﬁcation with rejection is solved. Based on Chow s rule (2), if Ineq. (4) holds, Ineq. (5) must also hold since c < 0.5. Then we should predict a positive label. On the other hand, we should predict a negative label if Ineq. (5) does not hold. Next, if Ineq. (5) holds but Ineq. (4) does not hold, we should reject x. As a result, based on Deﬁnition 2, knowing f 1 c(x) and f c (x) is sufﬁcient to verify Ineqs. (4) and Ineq. (5). This concludes the proof.

Proposition 3 suggests that by solving two binary costsensitive classiﬁcation with α = c and α = 1 c to obtain f c (x) and f 1 c(x), binary classiﬁcation with rejection can be solved.

3.4. Multiclass Extension

Here, we show that our result in Section 3.3 can be naturally extended to the multiclass case. More speciﬁcally, we show that multiclass classiﬁcation with rejection can be solved by learning an ensemble of K binary cost-sensitive classiﬁers.

Let us deﬁne the Bayes optimal solution for one-versus-rest cost-sensitive binary classiﬁer f ,y α , where y is the positive class and y Y, y = y are the negative classes:

f ,y α (x) =

( +1 ηy(x) > α, 1 otherwise.

Then, we obtain the following proposition (its proof can be found in Appendix B.1). Proposition 4. Chow s rule in multiclass classiﬁcation with rejection can be expressed as

( maxy f ,y 1 c(x) = 1, arg maxy f ,y 1 c(x) otherwise.

Proposition 4 suggests that by learning cost-sensitive classiﬁers f ,y 1 c for y Y, it is possible to obtain Chow s rule without estimating the class-posterior probabilities. Note that when c < 0.5, there exists at most one y Y such that f ,y 1 c(x) = 1. This is because it implies that ηy (x) > 1 c, which is larger than 0.5.

Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

Table 1. Classiﬁcation-calibrated binary surrogate losses and their properties including the convexity, symmetricity (i.e., φ(z) + φ( z) is a constant), and its capability to estimate the class posterior probability η1(x) in binary classiﬁcation. The column conﬁdence-based indicates that a loss is applicable to the conﬁdence-based approach and it satisﬁes all conditions required in order to use the previous work to derive its excess risk bound (Yuan & Wegkamp, 2010; Ni et al., 2019). On the other hand, our cost-sensitive approach can guarantee the existence of the excess risk bound as long as a loss φ is classiﬁcation-calibrated (Bartlett et al., 2006).

Loss name φ(z) Convex Symmetric Estimating η1(x) Conﬁdence-based Squared (1 z)2 Squared hinge max(0, 1 z)2 Exponential exp( z) Logistic log(1 + exp( z)) Hinge max(0, 1 z) Savage

(1 + exp(2z))2 1 Tangent (2arctan(z) 1)2 Ramp max(0, min(1, 0.5 0.5z)) Sigmoid [1 + exp(z)] 1

4. A Surrogate Loss for the Cost-sensitive Approach

In this section, we propose a surrogate loss for the costsensitive approach for classiﬁcation with rejection.

It is known that given training data, directly minimizing the empirical risk with respect to ℓ01c is computationally infeasible (Bartlett & Wegkamp, 2008; Ramaswamy et al., 2018). Therefore, many surrogate losses have been proposed to learn a classiﬁer with rejection in practice (Yuan & Wegkamp, 2010; Cortes et al., 2016b; Ni et al., 2019). Here, we propose the cost-sensitive surrogate loss for classiﬁcation with rejection. Let g(x) = [g1(x), . . . , g K(x)] , where gy(x): X R is the score function for a class y. Let φ: R R be a binary margin surrogate loss. A margin loss is a class of loss functions for binary classiﬁcation that takes only one real-valued argument (Bartlett et al., 2006; Reid & Williamson, 2010). Table 1 illustrates examples of binary margin surrogate losses. With a choice of φ, we can deﬁne our proposed cost-sensitive surrogate loss as follows.

Deﬁnition 5. Given a binary margin surrogate loss φ and a pre-deﬁned rejection cost c, the cost-sensitive surrogate loss for classiﬁcation with rejection is deﬁned as

Lc,φ CS(g; x, y) = cφ gy(x) + (1 c) X

y =y φ gy (x) .

Next, following the empirical risk minimization framework (Vapnik, 1998), a learning objective function can be straightforwardly obtained as follows:

ˆRLc,φ CS (g) = 1

i=1 Lc,φ CS(g; xi, yi). (6)

Note that regularization can also be applied in practice to avoid overﬁtting. Moreover, we want to emphasize that

although it is theoretically suggested to learn an ensemble of classiﬁers to solve classiﬁcation with rejection, in practice, by using linear-in-parameter models or neural networks with K-dimensional vectorial outputs, we can conveniently learn all K cost-sensitive binary classiﬁers together at once, which is g.

After learning g by minimizing Eq. (6), we have to design how to reject x. Following the optimal rejection rule in our Proposition 4, i.e., maxy f ,y 1 c(x) = 1, we can directly obtain the following rejection rule:

max y gy(x) 0. (7)

Intuitively, Cond. (7) suggests to reject x if all gy(x) have low prediction conﬁdence. One may interpret this type of rejection as distance rejection (Dubuisson & Masson, 1993), where the rejection is made when g is uncertain whether x belongs to any of the known classes. Note that this does not necessarily imply that x belongs to an unknown class, e.g., x may be located close to the decision boundary, causing none of gy(x) to be conﬁdent enough to predict a class y.

Next, we also consider the following rejection rule:

y, y s.t. y = y and gy(x), gy (x) > 0. (8)

Cond. (8) suggests to reject x because there exists a prediction conﬂict among at least two binary classiﬁers, i.e., gy(x) suggests to predict a class y but gy (x) suggests to predict another class y . Note that if we succeed to obtain the optimal classiﬁer g , this condition is impossible to be satisﬁed. Recall that in Section 3.4, for g , at most one g y(x) can be more than zero since it implies ηy > 1 c > 0.5. Nevertheless, Cond. (8) may hold in practice due to empirical estimation. This rejection condition can be interpreted as ambiguity rejection (Dubuisson & Masson, 1993), where the rejection is made when g interprets x to be associated

Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

Figure 3. Illustration of the cost-sensitive approach. Figure (a) denotes a prediction function. Unlike the conﬁdence-based approach (Figure 1), the prediction function is not designed to predict all data points in the space and the rejection region does not spread from the decision boundary. The decision boundary is based on an ensemble of cost-sensitive classiﬁers for blue, red, and green classes, respectively. Then, the rejector in ﬁgure (b) is constructed based on the rejection rule in Eq. (7) by aggregating the prediction result of each cost-sensitive classiﬁer. Data points in purple are rejected.

with more than one class. More discussion on the proposed rejection conditions is provided in Appendix. D.

To sum up, we employ the following classiﬁcation rule for the cost-sensitive approach:

( Conds. (7) or (8), arg maxy gy(x) otherwise. (9)

Figure 3 illustrates the cost-sensitive approach. It is worth mentioning that our rejection condition is different from that of Zhang et al. (2018). In their rejection rule, an input x is rejected if all binary classiﬁers outputs are close to zero. In our case, Cond. (7) rejects x as long as all gy(x) s are negative, e.g., x is also rejected if all prediction outputs are much smaller than zero. Also, their method can predict a set of labels when at least two classiﬁers predict positively, which is different from our problem setting, where it is only allowed to predict one label or refrain from making a prediction.

5. Theoretical Analysis

In this section, we show that the classiﬁcation rule f(x; g) in Eq. (9) can achieve the Chow s rule and also provide excess risk bounds.

5.1. Calibration

We begin by introducing the well-known notion of classiﬁcation-calibrated loss in binary classiﬁcation. Let us deﬁne the pointwise conditional surrogate risk for a ﬁxed input x in binary classiﬁcation with its class-posterior prob-

ability of a positive class η1:

Cφ η1(v) = η1φ(v) + (1 η1)φ( v), (10)

for v R. Next, a classiﬁcation-calibrated loss can be deﬁned as follows. Deﬁnition 6 (Bartlett et al. (2006)). We say a loss φ is classiﬁcation-calibrated if for any η1 = 1

inf v(2η1 1) 0 Cφ η1(v) > inf v Cφ η1(v).

Intuitively, classiﬁcation-calibration ensures that minimizing a loss φ can give the Bayes optimal binary classiﬁer sign(2η1 1). It is known that a convex loss φ is classiﬁcation-calibrated if and only if it is differentiable at 0 and φ (0) < 0 (Bartlett et al., 2006).

Analogously, in classiﬁcation with rejection, calibration is also an important property that has been used to verify if a surrogate loss is appropriate (Bartlett & Wegkamp, 2008; Yuan & Wegkamp, 2010; Cortes et al., 2016b;a; Ni et al.,

2019). Calibration guarantees that by replacing the zeroone-c loss ℓ01c with a surrogate loss L, the optimal Chow s rule can still be obtained by minimizing the surrogate risk. Verifying the calibration condition in classiﬁcation with rejection has not been as well-studied as ordinary binary classiﬁcation. We are only aware of the works by Yuan & Wegkamp (2010) and Ni et al. (2019), which provided a condition to verify calibration of general loss functions for the conﬁdence-based approach. Nevertheless, their condition can only verify a convex loss. Note that losses that are calibrated in our cost-sensitive approach may not be calibrated in the conﬁdence-based approach, e.g., the sigmoid loss. See Table 1 for more details.

Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

Now we are ready to show that the calibration condition of our proposed approach is equivalent to the classiﬁcationcalibration condition of φ. Let us deﬁne the pointwise conditional surrogate risk WL of an input x with its classposterior probability η(x) for the multiclass case:

WL g(x)); η(x) = X

y Y ηy(x)L g; x, y . (11)

By analyzing the classiﬁcation rule with respect to the conditional risk minimizer, we obtain the following theorem (its proof can be found in Appendix B.2). Theorem 7. Let g be a conditional risk minimizer that minimizes the pointwise conditional surrogate risk g (x) = arg ming WLc,φ CS g(x); η(x))). The surrogate loss Lc,φ CS is calibrated for classiﬁcation with rejection, that is, f(x; g ) = f (x) for all x X, if and only if φ is classiﬁcation-calibrated.

Theorem 7 suggests that the condition to verify if our costsensitive surrogate loss Lc,φ CS is calibrated is equivalent to the condition of whether φ is classiﬁcation-calibrated. As long as a binary surrogate loss φ is classiﬁcation-calibrated, minimizing the surrogate risk w.r.t. Lc,φ CS can lead to the optimal Chow s rule. As a result, Theorem 7 successfully borrows the knowledge in the literature of binary classiﬁcation to help prove calibration in classiﬁcation with rejection for the cost-sensitive approach.

5.2. Excess Risk Bound

While calibration ensures that the optimal solution w.r.t. a surrogate loss agrees with the optimal Chow s rule, an excess risk bound provides a regret bound relationship between the zero-one-c loss ℓ01c and a surrogate loss L.

Let Rφ,i 1 c(gi) be the cost-sensitive binary surrogate risk for class i and RL, be the minimum risk w.r.t. to the loss L. We prove the following theorem, which is our main result to use for deriving the excess risk bound of the cost-sensitive approach for any classiﬁcation-calibrated loss (its proof can be found in Appendix B.3). Theorem 8. Consider a cost-sensitive surrogate loss Lc,φ CS. Let f be a classiﬁcation rule of the cost-sensitive approach with respect to the score function g, that is, f = f(x; g) for an input x. If a binary surrogate loss φ is classiﬁcationcalibrated, the excess risk bound can be expressed as follows:

Rℓ01c(f) Rℓ01c, RLc,ℓ01 CS (g) RLc,ℓ01 CS , (12)

i=1 ψ 1 φ,1 c(Rφ,i 1 c(gi) Rφ,i, 1 c ),

where ψφ,1 c : R R is a non-decreasing invertible function and ψφ,1 c(0) = 0.

Ineq. (12) suggests that the regret of the classiﬁcation with rejection problem can be bounded by the regret of the costsensitive surrogate with respect to the zero-one loss ℓ01. This inequality allows us to borrow the existing ﬁndings of cost-sensitive classiﬁcation to give excess risk bounds for classiﬁcation with rejection. Next, Ineq. (13) suggests that RLc,ℓ01 CS (g) RLc,ℓ01 CS , is bounded by the sum of an invertible function of the regret of the cost-sensitive binary classiﬁcation risk. The invertible function ψφ,1 c is a wellstudied function in the literature of cost-sensitive classiﬁcation, which is guaranteed to exist for any classiﬁcationcalibrated loss (Steinwart, 2007; Scott, 2012). For example, ψ 1 φ,1 c(ϵ) = ϵ2 2c(1 c) (ϵ)(1 2c) for the squared loss, where ϵ 0. Examples of ψφ,1 c for more losses and how to derive ψφ,1 c can be found in Steinwart (2007) and Scott (2012). Since ψφ,1 c is non-decreasing and ψφ,1 c(0) = 0, the regret with respect to the zero-one-c loss will also get smaller and eventually become zero if the surrogate risk is successfully minimized.

As an example to demonstrate how to obtain an excess risk bound with our Theorem 8, we prove that the following excess risk bound holds for the hinge loss φhin, which is the loss that cannot estimate the class-posterior probabilities (Cortes & Vapnik, 1995), and its optimal solution for the conﬁdence-based approach cannot mimic Chow s rule. The bound can be straightforwardly derived based on our Theorem 8 and the known fact that ψ 1 φhin,1 c(ϵ) = ϵ (Steinwart, 2007). Corollary 9. Let us consider the hinge loss φhin(z) = max(0, 1 z). The excess risk bound for the cost-sensitive surrogate Lc,φhin CS can be given as follows:

Rℓ01c(f) Rℓ01c, RL c,φhin CS (g) RL c,φhin CS , .

6. Experimental Results

In this section, we provide experimental results of classiﬁcation with rejection. The evaluation metric is the test empirical zero-one-c risk over ten trials. We also reported the rejection ratio, accuracy of accepted data, and the full experimental results in the table format in Appendix D. The varying rejection costs ranged from {0.1, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40} for all settings. For noisy-labeled classiﬁcation, we used the uniform noise (Angluin & Laird, 1988; Ghosh et al., 2015), where the randomly selected 25% of the training labels were ﬂipped.

6.1. Experiment Setup

Datasets and models: For binary classiﬁcation, we used the subjective-versus-objective classiﬁcation (Subj), which is a text dataset (Pang & Lee, 2004). Moreover, we used Phishing and Spambase, which are tabular datasets, and

Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

SCE DEFER ANGLE CS-hinge CS-sigmoid

Figure 4. Mean and standard error of the test empirical zero-one-c risk over ten trials with varying rejection costs (Binary classiﬁcation). Each column indicates the performance with respect to one dataset. (Top) clean-labeled classiﬁcation with rejection. (Middle) noisy-labeled classiﬁcation with rejection. (Bottom) PU-classiﬁcation with rejection.

Twonorm, which is a synthetic dataset drawn from different multivariate Gaussian distributions (Lichman et al., 2013). We also used the Gisette dataset, which is the problem of separating the highly confusible digits 4 and 9 with noisy features (Guyon et al., 2005). Linear-in-input models were used for all binary datasets. For multiclass classiﬁcation, we used Gas-Drift (Vergara et al., 2012) and Human activity recognition (HAR) (Anguita et al., 2013), which are tabular datasets. Multilayer perceptrons were used for both datasets. We also used the image datasets, which are MNIST (Le Cun, 1998), Kuzushiji-MNIST (KMNIST) (Clanuwat et al., 2018), and Fashion-MNIST (Xiao et al., 2017). Convolutional neural networks were used for all image datasets. The implementation was done using Py Torch (Paszke et al., 2019). More details on the datasets and implementation can be found in Appendix C.

Methods: For the conﬁdence-based approach, based on Ni et al. (2019), we used the softmax cross-entropy loss (SCE). For the classiﬁer-reject approach, we used the proposed method by Mozannar & Sontag (2020) (DEFER). We also used the method by Zhang et al. (2018) with the bent hinge loss (ANGLE). For our cost-sensitive approach, we used the hinge (CS-hinge), and sigmoid (CS-sigmoid) losses.

Hyperparameter tuning: We provided additional training data for SCE and ANGLE to tune their hyperparameters. For SCE, we also added temperature scaling (Guo et al., 2017) to improve the prediction conﬁdence. For ANGLE, we chose the bending slope paramater according to Zhang et al. (2018) and tuned the rejection threshold. In PU-classiﬁcation, it is difﬁcult to tune hyperparameters for them. Thus, we provided clean-labeled data for them only for hyperparameter tuning. Both rejection threshold of ANGLE and the temperature parameter for SCE are chosen from the following candidate set of twenty numbers spaced evenly in a log scale from 0 to 1 (inclusively) and nine integers from 2 to 10. Since only SCE and ANGLE require additional data to tune hyperparameters, it is not straightforward to provide a fair comparison because our methods and DEFER do not use validation data. Nevertheless, with less data, our methods are still competitive and can outperform the baselines in several settings.

6.2. Binary Classiﬁcation with Rejection

Here, we compare the performance of all methods in cleanlabeled, noisy-labeled, and positive-unlabeled classiﬁcation with rejection. For PU-classiﬁcation, we implemented all

Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

Fashion-MNIST

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection cost c

Zero-one-c risk

SCE DEFER ANGLE CS-hinge CS-sigmoid

Figure 5. Mean and standard error of the test empirical zero-one-c risk over ten trials with varying rejection cost (Multiclass classiﬁcation). Each column indicates the performance with respect to one dataset. (Top) clean-labeled classiﬁcation with rejection. (Bottom) noisylabeled classiﬁcation with rejection. For MNIST, Fashion-MNIST, and KMNIST, we found that ANGLE failed miserably and has zero-one-c risk more than 0.5 and thus it is excluded from the ﬁgure for readability.

methods based on the empirical risk minimization framework proposed by Kiryo et al. (2017) (more detail can be found in Appendix C).

Figure 4 shows the performance with varying rejection costs for all settings. In clean-labeled classiﬁcation, it can be observed that CS-hinge and ANGLE are most preferable in this setting. In noisy-labeled and PU classiﬁcation, CSsigmoid outperformed other methods in most cases. This illustrates the usefulness of having a ﬂexible choice of loss functions. We also found that noise can degrade the performance to be worse than always reject for some methods. Moreover, we found that DEFER rejected data more often than other methods, which may sometimes lead to worse performance. In PU-classiﬁcation, SCE and ANGLE did not perform well although clean labeled data were used for hyperparameter tuning. This could be due to a steep loss can suffer severely from the negative risk problem (Kiryo et al., 2017), causing them to be ineffective in PU-classiﬁcation.

6.3. Multiclass Classiﬁcation with Rejection

Figure 5 illustrates the performance of all methods in the clean-labeled and noisy-labeled settings. It can be observed that SCE had almost the same performance for all rejection costs. Although temperature scaling is applied, it seems that SCE still suffered from overconﬁdence (Guo et al., 2017) and failed to reject the ambiguous data points. This could be due to SCE has the high accuracy on the validation set (more than 90%) and thus temperature scaling could not smoothen the prediction conﬁdence to reject the ambiguous data effectively. Interestingly, DEFER did not suffer from such overconﬁdence although it is also based on the

cross-entropy loss and it rejected the data more than other methods. For ANGLE, we found that although it can perform competitively in Gas-drift and HAR, it failed miserably in the image datasets. For ﬁgure s readability, we report the performance of ANGLE in a table format in Appendix D. In noisy-labeled classiﬁcation, CS-sigmoid outperformed other methods in most cases.

7. Conclusions

We have proposed a cost-sensitive approach to classiﬁcation with rejection, where any classiﬁcation-calibrated loss can be applied with theoretical guarantee. Our theory of excess risk bounds explicitly connects the classiﬁcation with rejection problem to the cost-sensitive classiﬁcation problem. Our experimental results using clean-labeled, noisy-labeled, and positive and unlabeled training data demonstrated the advantages of avoiding class-posterior probability estimation and having a ﬂexible choice of loss functions.

Acknowledgements

We would like to thank Han Bao, Takeshi Teshima, Chenri Ni, and Junya Honda for helpful discussion, and also the Supercomputing Division, Information Technology Center, The University of Tokyo, for providing us the Reedbush supercomputer system to conduct the experiments. NC was supported by MEXT scholarship, JST AIP Challenge, and Google Ph D Fellowship program. ZC was supported by JST AIP Challenge. MS was supported by the International Research Center for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.

Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

Angluin, D. and Laird, P. Learning from noisy examples. Machine Learning, 2(4):343 370, 1988.

Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes Ortiz, J. L. A public domain dataset for human activity recognition using smartphones. In Esann, volume 3, pp. 3, 2013.

Bao, H. and Sugiyama, M. Calibrated surrogate maximization of linear-fractional utility in binary classiﬁcation. AISTATS, 2020.

Bartlett, P. L. and Wegkamp, M. H. Classiﬁcation with a reject option using a hinge loss. JMLR, 9:1823 1840, 2008.

Bartlett, P. L., Jordan, M. I., and Mc Auliffe, J. D. Convexity, classiﬁcation, and risk bounds. JASA, 101(473):138 156, 2006.

Begg, M. D. and Lagakos, S. On the consequences of model misspeciﬁcation in logistic regression. Environmental health perspectives, 87:69 75, 1990.

Bickel, S., Br uckner, M., and Scheffer, T. Discriminative learning for differing training and test distributions. In ICML, pp. 81 88, 2007.

Charoenphakdee, N. and Sugiyama, M. Positive-unlabeled classiﬁcation under class prior shift and asymmetric error. In SDM, pp. 271 279. SIAM, 2019.

Charoenphakdee, N., Lee, J., and Sugiyama, M. On symmetric losses for learning from corrupted labels. ICML, 2019.

Chow, C. K. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers, EC-6(4):247 254, 1957.

Chow, C. K. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16 (1):41 46, 1970.

Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. Deep learning for classical japanese literature. ar Xiv preprint ar Xiv:1812.01718, 2018.

Cortes, C. and Vapnik, V. Support-vector networks. Machine learning, 20(3):273 297, 1995.

Cortes, C., De Salvo, G., and Mohri, M. Boosting with abstention. In Advances in Neural Information Processing Systems, pp. 1660 1668, 2016a.

Cortes, C., De Salvo, G., and Mohri, M. Learning with rejection. In ALT, pp. 67 82, 2016b.

Dembczynski, K., Jachnik, A., Kotlowski, W., Waegeman, W., and H ullermeier, E. Optimizing the f-measure in multi-label classiﬁcation: Plug-in rule approach versus structured loss minimization. In ICML, pp. 1130 1138, 2013.

du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. In Advances in Neural Information Processing Systems, pp. 703 711, 2014.

du Plessis, M. C., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, pp. 1386 1394, 2015.

Dubuisson, B. and Masson, M. A statistical decision rule with incomplete knowledge about classes. Pattern recognition, 26(1):155 165, 1993.

El-Yaniv, R. and Wiener, Y. On the foundations of noisefree selective classiﬁcation. JMLR, 11(May):1605 1641, 2010.

Elkan, C. The foundations of cost-sensitive learning. In IJCAI, volume 17, pp. 973 978. Lawrence Erlbaum Associates Ltd, 2001.

Franc, V. and Prusa, D. On discriminative learning of prediction uncertainty. In ICML, pp. 1963 1971, 2019.

Geifman, Y. and El-Yaniv, R. Selective classiﬁcation for deep neural networks. In Advances in Neural Information Processing Systems, pp. 4878 4887, 2017.

Ghosh, A., Manwani, N., and Sastry, P. Making risk minimization tolerant to label noise. Neurocomputing, 160: 93 107, 2015.

Grandvalet, Y., Rakotomamonjy, A., Keshet, J., and Canu, S. Support vector machines with a reject option. In Advances in Neural Information Processing Systems, pp. 537 544, 2009.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In ICML, pp. 1321 1330. JMLR. org, 2017.

Guyon, I., Gunn, S., Ben-Hur, A., and Dror, G. Result analysis of the nips 2003 feature selection challenge. In Advances in Neural Information Processing Systems, pp. 545 552, 2005.

Heagerty, P. J. and Kurland, B. F. Misspeciﬁed maximum likelihood estimates and generalised linear mixed models. Biometrika, 88(4):973 985, 2001.

Hein, M., Andriushchenko, M., and Bitterwolf, J. Why Re LU networks yield high-conﬁdence predictions far away from the training data and how to mitigate the problem. In CVPR, pp. 41 50, 2019.

Classiﬁcation with Rejection Based on Cost-sensitive Classiﬁcation

Herbei, R. and Wegkamp, M. H. Classiﬁcation with reject option. Canadian Journal of Statistics, 34(4):709 721, 2006.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448 456. PMLR, 2015.

Kanamori, T., Suzuki, T., and Sugiyama, M. Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86(3):335 367, 2012.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ICLR, 2015.

Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In Advances in Neural Information Processing Systems, pp. 1674 1684, 2017.

Koyejo, O. O., Natarajan, N., Ravikumar, P. K., and Dhillon, I. S. Consistent binary classiﬁcation with generalized performance metrics. In Advances in Neural Information Processing Systems, pp. 2744 2752, 2014.

Le Cun, Y. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

Lichman, M. et al. UCI machine learning repository, 2013.

Manwani, N., Desai, K., Sasidharan, S., and Sundararajan, R. Double ramp loss based reject option classiﬁer. In Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining, pp. 151 163. Springer, 2015.

Mozannar, H. and Sontag, D. Consistent estimators for learning to defer to an expert. ICML, 2020.

Nair, V. and Hinton, G. E. Rectiﬁed linear units improve restricted boltzmann machines. In ICML, pp. 807 814, 2010.

Ni, C., Charoenphakdee, N., Honda, J., and Sugiyama, M. On the calibration of multiclass classiﬁcation with rejection. In Advances in Neural Information Processing Systems, pp. 2582 2592, 2019.

Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. ar Xiv preprint cs/0409058, 2004.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8026 8037, 2019.

Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing, pp. 1532 1543, 2014.

Qin, J. Inferences for case-control and semiparametric twosample density ratio models. Biometrika, 85(3):619 630, 1998.

Ramaswamy, H. G., Tewari, A., Agarwal, S., et al. Consistent algorithms for multiclass classiﬁcation with an abstain option. Electronic Journal of Statistics, 12(1): 530 554, 2018.

Reid, M. D. and Williamson, R. C. Composite binary losses. JMLR, 11:2387 2422, 2010.

Saerens, M., Latinne, P., and Decaestecker, C. Adjusting the outputs of a classiﬁer to new a priori probabilities: a simple procedure. Neural computation, 14(1):21 41, 2002.

Scott, C. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958 992, 2012.

Steinwart, I. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007.

Sugiyama, M., Suzuki, T., and Kanamori, T. Density ratio estimation in machine learning. Cambridge University Press, 2012.

Vapnik, V. Statistical learning theory. 1998, volume 3. Wiley, New York, 1998.

Vergara, A., Vembu, S., Ayhan, T., Ryan, M. A., Homer, M. L., and Huerta, R. Chemical gas sensor drift compensation using classiﬁer ensembles. Sensors and Actuators B: Chemical, 166:320 329, 2012.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Yuan, M. and Wegkamp, M. Classiﬁcation methods with reject option based on convex risk minimization. JMLR, 11:111 130, 2010.

Zhang, C., Wang, W., and Qiao, X. On reject and reﬁne options in multicategory classiﬁcation. Journal of the American Statistical Association, 113(522):730 745, 2018.

Zhang, T. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. Annals of Statistics, pp. 56 85, 2004.