# optimal_strategies_for_reject_option_classifiers__6dc8794e.pdf

Journal of Machine Learning Research 24 (2023) 1-49 Submitted 1/21; Revised 12/21; Published 1/23

Optimal Strategies for Reject Option Classiﬁers

Vojtech Franc xfrancv@fel.cvut.cz Daniel Prusa prusa@fel.cvut.cz Vaclav Voracek vaclav.voracek@uni-tuebingen.de Department of Cybernetics, Faculty of Electrical Engineering Czech Technical University in Prague, Czech Republic

Editor: Francis Bach

In classiﬁcation with a reject option, the classiﬁer is allowed in uncertain cases to abstain from prediction. The classical cost-based model of a reject option classiﬁer requires the rejection cost to be deﬁned explicitly. The alternative bounded-improvement model and the bounded-abstention model avoid the notion of the reject cost. The bounded-improvement model seeks a classiﬁer with a guaranteed selective risk and maximal cover. The boundedabstention model seeks a classiﬁer with guaranteed cover and minimal selective risk. We prove that despite their diﬀerent formulations the three rejection models lead to the same prediction strategy: the Bayes classiﬁer endowed with a randomized Bayes selection function. We deﬁne the notion of a proper uncertainty score as a scalar summary of the prediction uncertainty suﬃcient to construct the randomized Bayes selection function. We propose two algorithms to learn the proper uncertainty score from examples for an arbitrary black-box classiﬁer. We prove that both algorithms provide Fisher consistent estimates of the proper uncertainty score and demonstrate their eﬃciency in diﬀerent prediction problems, including classiﬁcation, ordinal regression, and structured output classiﬁcation. Keywords: Reject option classiﬁcation, prediction uncertainty, selective classiﬁers

1. Introduction

In safety critical applications of classiﬁcation models, prediction errors may lead to serious losses. In such cases, estimating when the model makes an error can be as important as its average performance. These two objectives are taken into account in classiﬁcation with a reject option when the classiﬁer is allowed in uncertain cases to abstain from prediction. The cost-based model of a classiﬁcation strategy with the reject option was proposed by Chow in his pioneering work Chow (1970). The goal is to minimize the expected loss equal to the cost of misclassiﬁcation, when the classiﬁer predicts, and to the reject cost, when the classiﬁer abstains from prediction. An optimal strategy leads to the Bayes classiﬁer abstaining from prediction when the conditional expected risk exceeds the reject cost. The known form of the optimal strategy allows to construct the classiﬁer by plugging in an estimate of the class posterior probabilities to the formula for the conditional risk. Besides the plug-in rule, the reject option classiﬁers can be learned by empirical risk minimization based approaches like e.g. modiﬁcations of Support Vector Machines (Grandvalet et al., 2008), Boosting (Cortes et al., 2016), or Prototype-based classiﬁers (Villman et al., 2016) to name a few.

2023 Vojtech Franc and Daniel Prusa and Vaclav Voracek.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0048.html.

Franc, Prusa and Voracek

The cost-based model requires the reject cost to be deﬁned explicitly, which is diﬃcult in some applications, e.g., when the misclassiﬁcation costs have diﬀerent physical units than the reject cost. Alternative bounded-improvement and bounded-abstention models coined in Pietraszek (2005) avoid explicit deﬁnition of the reject cost. The rejection strategy is evaluated by two antagonistic quantities: i) a selective risk deﬁned as the expected misclassiﬁcation cost of accepted predictions and ii) a coverage which corresponds to the probability that the prediction is accepted. The optimal strategy for the bounded-improvement model maximizes the coverage under the condition that the selective risk does not exceed a target value. The optimal strategy of the bounded-abstention model is deﬁned symmetrically as the one which minimizes the selective risk under the condition that the coverage does not drop below a target value. In contrast to the cost-based model, it is unknown what are the optimal prediction strategies of the two models when the underlying statistical model is given. A solution has been proposed only for special instances of the tasks. Pietraszek (2005) proposed a method for constructing a speciﬁc type of reject option strategy for a pretrained two-class classiﬁer using the ROC analysis. El-Yaniv and Wiener (2010) proposed an algorithm learning the optimal strategy in the noise-free setting, i.e., when a perfect strategy with zero selective risk exists. Geifman and El-Yaniv (2017) shows how to equip a trained classiﬁer with a reject option provided an uncertainty measure is known and the task is to ﬁnd only a rejection threshold optimal under the bounded-improvement model. A large number of other works address the problem of uncertainty prediction, including recent papers related to deep learning like e.g. Lakshminarayanan et al. (2017); Jiang et al. (2018); Corbiere et al. (2019). They seek for an uncertainty score 1 deﬁned informally as a real-valued summary of an input observation that is predictive of the classiﬁcation error. These works do not formulate the problem to be solved explicitly as a rejection model. However, most of these works assess the performance of their uncertainty scores using evaluation metrics for the rejection models, namely, using the Risk-Coverage (RC) curve and the Area under the RC curve (Au RC). This article uniﬁes and extends existing formulations of an optimal reject option classiﬁer and proposes theoretically grounded algorithms to learn the rejection strategy for an arbitrary black-box classiﬁer from examples. The main contributions are as follows:

1. We derive necessary and suﬃcient conditions for an optimal prediction strategy of the bounded-improvement model when the underlying distribution is known. We show that an optimal solution is the Bayes classiﬁer endowed with a rejection strategy, which we call randomized Bayes selection function. The randomized Bayes selection function is constructed from the conditional expected risk and two parameters: a decision threshold and an acceptance probability. The strategy rejects prediction when the conditional risk is above the threshold, accepts prediction when it is below the threshold and randomizes with the acceptance probability otherwise. We provide an explicit relation between the decision threshold, the acceptance probability and the target risk.

2. We derive necessary and suﬃcient conditions for an optimal prediction strategy of the bounded-abstention model when the underlying distribution is known. We show that

1. Some works use term conﬁdence score which is inverse to the uncertainty score utilized in this paper.

Optimal Strategies for Reject Option Classifiers

the conditions are satisﬁed by the Bayes classiﬁer endowed with the randomized Bayes selection function. We provide an explicit relation between the decision threshold, the acceptance probability and the target coverage.

3. We deﬁne a notion of a proper uncertainty score as a function which preserves ordering of the inputs induced by the conditional expected risk. A proper uncertainty score is suﬃcient for construction of the randomized Bayes selection function. We propose two generic algorithms to learn the proper uncertainty score from examples for an arbitrary black-box classiﬁer. The ﬁrst is based on regression of the classiﬁer loss. The second is based on minimization of a newly proposed loss function which we call SElective classiﬁer LEarning (SELE) loss. We show that SELE loss is a close approximation of the Au RC and, at the same time, amenable to optimization. We prove that both proposed algorithms provide Fisher consistent estimate of the proper uncertainty score. As a proof of concept, we apply the proposed algorithms to learn proper uncertainty scores for diﬀerent prediction problems including classiﬁcation, ordinal regression and structured output classiﬁcation. We demonstrate that the algorithm based on the SELE loss minimization learns uncertainty scores which consistently outperform common baselines and work on par with the state-of-the-art methods that are, unlike our algorithm, applicable only to particular prediction models.

Besides the algorithms proposed that are applicable to learning uncertainty score for an arbitrary classiﬁcation model, our contributions may have the following uses. Firstly, our analysis shows that despite their diﬀerent objectives, the cost-based, the boundedimprovement and the bounded-abstention rejection models are equivalent in the sense that they lead to the same prediction strategy. Secondly, the explicit characterization of optimal strategies provides a recipe to construct plug-in rules, which has been so far possible only for the cost-based model. That is, any method estimating the class posterior distribution can be turned into an algorithm learning the reject option classiﬁer that solves the boundedimprovement and the bounded-abstention model, respectively. Thirdly, there is a tight connection between the proposed bounded-abstention model and the RC curve. The RC curve represents the quality of all solutions of the bounded-abstention model that can be constructed from a pair of classiﬁer and uncertainty score. The Au RC is then an expected quality of the reject option classiﬁer constructed from the pair when the target coverage is selected uniformly at random. This connection sheds light on many published methods which do not explicitly deﬁne the target objective but use the RC curve and the Au RC as evaluation metrics. This article is an extension of our previous work published in Franc and Prusa (2019). The major extensions involve deriving optimality conditions for the bounded-abstention model, regression-based algorithm for learning the proper uncertainty score, analysis of the learning algorithms including the proof of Fisher consistency, describing the connection between the Au RC and the bounded-abstention model, and most of the experiments. The paper is organized as follows. Section 2 introduces the three rejection models and provides a characterization of their optimal solutions. Algorithms to learn a proper uncertainty score from examples are discussed in Section 3. Survey of related literature is given in Section 4. Experimental evaluation of the proposed learning algorithms is provided in Section 5. Section 6 concludes the paper. Proofs of all theorems are deferred to Appendix.

Franc, Prusa and Voracek

2. Reject Option Models and Their Optimal Strategies

Let X be a set of input observations and Y a ﬁnite set of labels. Let us assume that the inputs and labels are generated by a random process with p.d.f. p(x, y) deﬁned over X Y. The goal in the non-reject setting is to ﬁnd a classiﬁer h: X Y with a small expected risk

y Y p(x, y)ℓ(y, h(x)) dx ,

where ℓ: Y Y R+ is a loss penalizing the predictions. The expected risk can be reduced by abstaining from prediction in uncertain cases. To this end, we use a selective classiﬁer 2 (h, c) composed of a classiﬁer h: X Y and a selection function c: X [0, 1]. When applying the selective classiﬁer to input x X it outputs

(h, c)(x) = h(x) with probability c(x) , reject with probability 1 c(x) .

In the sequel, we introduce three models of an optimal selective classiﬁer: the cost-based, the bounded-improvement and the bounded-abstention model. We characterize the optimal strategies of the three models provided the underlying distribution p(x, y) is known.

2.1 Cost-based model

Besides the label loss ℓ: Y Y R+, let us deﬁne a reject loss ε R+ incurred when a classiﬁer rejects to predict. The selective classiﬁer (h, c) is then evaluated in terms of the expected risk

RB(h, c)= Z

y Y p(x, y) ℓ(y, h(x))c(x) + (1 c(x))ε dx . (1)

Problem 1 (Cost-based model) The optimal selective classiﬁer (h B, c B) is a solution to the minimization problem min h,c RB(h, c) , (2)

where we assume that both minimizers exist.

The well-known optimal strategy (h B, c B) solving Problem 1 reads

h B(x) argmin ˆy Y

y Y p(y | x) ℓ(y, ˆy) , (3)

1 if r (x) < ε , τ if r (x) = ε , 0 if r (x) > ε ,

2. The classiﬁer with a reject option is usually represented by a single function h : X Y {reject}. We use the decomposition h (x) = (h, c)(x), and the terminology selective classiﬁer from El-Yaniv and Wiener (2010) because we analyze h and c separately.

Optimal Strategies for Reject Option Classifiers

where r (x) = min ˆy Y

y Y p(y | x) ℓ(y, ˆy)

is the minimal conditional expected risk associated with input x, and τ is any real number from the interval [0, 1]. In the boundary case when r (x) = ε one can arbitrarily reject or return the best label h B(x) without aﬀecting the value of the risk RB(h, c). In turn, there always exists a deterministic optimal strategy solving the cost-based model. In the sequel, we denote the classiﬁer h B alone as Bayes classiﬁer. The cost-based model was coined in Chow (1970) who also provides an optimal strategy in the case p(x, y) is known. Chow (1970) considers only the 0/1-loss ℓ(y, y ) = [[y = y ]] 3. A generalization for an arbitrary loss and the optimal strategy of the generalized formulation can be found e.g. in Schlesinger and Hlav aˇc (2002); Santos-Pereira and Pires (2005).

2.2 Bounded-improvement model

One can characterize the selective classiﬁer by two antagonistic quantities: i) the coverage

X p(x) c(x) dx (5)

corresponding to the probability that the prediction is accepted and ii) the selective risk

y Y p(x, y) ℓ(y, h(x)) c(x) dx

deﬁned for non-zero φ(c) as the expected classiﬁcation loss on the accepted predictions.

Problem 2 (Bounded-improvement model) Given a target risk λ > 0, the optimal selective classiﬁer (h I, c I) is a solution to the problem

max h,c φ(c) s.t. RS(h, c) λ , (7)

where we assume that both maximizers exist.

Theorem 1 Let (h, c) be an optimal solution to (7). Then, (h B, c), where h B is the Bayes classiﬁer (3), is also optimal to (7).

According to Theorem 1 the Bayes classiﬁer h B is also optimal for the task (7) deﬁning the bounded-improvement model which is not surprising. Note however that the Bayes classiﬁer is not a unique solution to (7) because the predictions on the reject region Xc(x)=0 4

do not count to the selective risk and hence they can be arbitrary. Theorem 1 allows to solve the bounded-improvement task (7) in two consecutive steps: First, set h I to be the Bayes classiﬁer h B. Second, when h I is ﬁxed, the optimal selection function c I is obtained by solving the task (7) only with respect to c which boils down to

3. [[A]] is the Iverson bracket which evaluates to 1 if A is true and it is 0 otherwise. 4. For a function f : X R and a R { }, we deﬁne Xf(x) a ={x X |f(x) a}, Xf(x)<a = {x X | f(x) < a}, Xf(x)=a= {x X |f(x) = a}, Xf(x)>a = {x X |f(x) > a}, Xf(x) a= {x X |f(x) a}.

Franc, Prusa and Voracek

Problem 3 (Bounded-improvement model for known h(x)) Given a classiﬁer h(x), the optimal selection function c (x) is a solution to

max c [0,1]X φ(c) s.t. RS(h, c) λ . (8)

Note that Problem 3 is meaningful even if h is not the Bayes classiﬁer h B. In practice, we can seek for an optimal selection function c for any ﬁxed h which is usually our best approximation of h B learned from data. We will show that the key concept to characterize the optimal selection function of Problem 3 is the conditional expected risk of h 5 deﬁned as r(x) = X

y Y p(y | x) ℓ(y, h(x)) . (9)

Theorem 2 A selection function c : X [0, 1] is an optimal solution to Problem 3 if and only if it holds Z

Xr(x)<b p(x)c (x)dx = Z

Xr(x)<b p(x)dx, (10)

Xr(x)=b p(x)c (x)dx =

( ρ(Xr(x)<b)

b if b > 0 , R

Xr(x)=0 p(x)dx if b = 0 , (11)

Xr(x)>b p(x)c (x)dx = 0 , (12)

where r(x) = r(x) λ measures how much the conditional risk r(x) of the classiﬁer h(x) exceeds the target λ,

X p(x)r(x) dx (13)

is the expectation of r(x) restricted to inputs in X , and

b = sup {a | ρ(Xr(x) a) 0} 0 . (14)

Theorem 2 deﬁnes behaviour of an optimal selection function c (x) on a partition of the input space X into three regions Xr(x)<b, Xr(x)=b and Xr(x)>b. Informally, b can be interpreted as a threshold characterizing the largest preﬁx X X of inputs in X sorted by the conditional risk r(x) for which the expectation of r(x) does not exceed the target risk λ. In each region, the expected value of c (x) is constrained to a particular constant, the value of which depends on parameters of the problem. A particular selection function satisfying the optimality condition is given by the following theorem.

Theorem 3 Let r: X R be the conditional risk (9) of a classiﬁer h: X Y, γ = b + λ the rejection threshold given by the target risk λ and a constant b computed by (14). Then the selection function

1 if r(x) < γ , τ if r(x) = γ , 0 if r(x) > γ , (15)

5. Note that the conditional risk r is a function of the classiﬁer h. Since the classiﬁer h is eﬀectively ﬁxed throughout the paper, the dependency is not expressed explicitly to keep the notation simple.

Optimal Strategies for Reject Option Classifiers

where τ is the acceptance probability given by

( 1 if ρ(Xr(x)=γ) = 0 ,

ρ(Xr(x)<γ)) ρ(Xr(x)=γ)) if ρ(Xr(x)=γ) > 0 , (16)

satisﬁes the optimality condition of Theorem 2, and hence it is a solution to Problem 3.

The selection function (15) is deﬁned by the conditional risk r(x), the decision threshold γ and the acceptance probability τ. The prediction is always accepted when r(x) < γ and always rejected when r(x) > γ. In the boundary case, when r(x) = γ, the strategy randomizes and the prediction is accepted with probability τ. The decision threshold is given by γ = b + λ where λ is the target risk in the deﬁnition of Problem 3 and b is given by (14). Solving (14) is hard and it requires knowledge of p(x, y). When the probability mass of the set of boundary cases Xr(x)=γ is zero, which usually happens in case of continuous p(x), the acceptance probability is τ = 1 and the boundary cases are always accepted, i.e. no randomization is needed. The bounded-improvement model was coined in Pietraszek (2005) for two-classes and a speciﬁc class of rejection strategies. A generalized formulation which imposes no restriction on the number of classes and the class of decision strategies appeared e.g. in Geifman and El-Yaniv (2017). The necessary and suﬃcient conditions for an optimal selective classiﬁer were provided in Franc and Prusa (2019).

2.3 Bounded-abstention model

In this section, we introduce bounded-abstention model, the deﬁnition of which is symmetric to the deﬁnition of the bounded-improvement model.

Problem 4 (Bounded-abstention model) Given a target coverage ω > 0, the optimal selective classiﬁer (h A, c A) is a solution to the problem

min h,c RS(h, c) s.t. φ(c) ω , (17)

where we assume that both minimizers exist.

Theorem 4 Let (h, c) be an optimal solution to (17). Then, (h B, c), where h B is the optimal Bayes classiﬁer (3), is also optimal to (17).

Theorem 4 ensures that the Bayes classiﬁer h B is an optimal solution to (17) deﬁning the bounded-abstention model. Note that the solution is not unique as the predictions on Xc(x)=0 do not count to the selective risk, hence they can be arbitrary. After ﬁxing the classiﬁer h = h B the search for an optimal selection function leads to:

Problem 5 (Bounded-abstention model for known h(x)) Given a classiﬁer h(x) and a target coverage 0 < ω 1, the optimal selection function c (x) is a solution to the problem

min c [0,1]X RS(h, c) s.t. φ(c) ω , (18)

where we assume that the minimizer exists.

Franc, Prusa and Voracek

Theorem 5 A selection function c : X [0, 1] is an optimal solution to Problem 5 if and only if it holds

Xr(x)<β p(x)c (x)dx = Z

Xr(x)<β p(x)dx, (19)

Xr(x)=β p(x)c (x)dx = ω Z

Xr(x)<β p(x)dx, (20)

Xr(x)>β p(x)c (x)dx = 0 , (21)

Xr(x)<a p(x)dx ω

Theorem 5 deﬁnes necessary and suﬃcient conditions for an optimal solution to Problem 5. A particular selection function satisfying the optimality conditions is given by the following theorem.

Theorem 6 Let r: X R be the conditional risk (9) of a classiﬁer h: X Y, 1 ω > 0 be a target coverage and β be the constant computed by (22). Then the selection function

1 if r(x) < β , κ if r(x) = β , 0 if r(x) > β , (23)

where κ is the acceptance probability given by

Xr(x)=β p(x)dx = 0 ,

Xr(x)<β p(x)dx R

Xr(x)=β p(x)dx otherwise , (24)

satisﬁes the optimality condition of Theorem 6, and hence it is a solution of Problem 5.

The selection function (23) is determined by the conditional risk r(x), the decision threshold β and the acceptance probability κ. Both computations of the decision threshold β, deﬁned by (22), and the acceptance probability κ, deﬁned by (24), involve integration of p(x). When the probability mass of the set of boundary cases Xr(x)=β is zero, the acceptance probability is κ = 0 and the boundary cases are always rejected without any randomization. The bounded-abstention model was coined in Pietraszek (2005) for two-classes and a speciﬁc class of rejection strategies. In this article, we provide a generalized formulation with no restriction on the number of classes and the class of decision strategies, and we characterize the necessary and suﬃcient conditions for an optimal selective classiﬁer.

Optimal Strategies for Reject Option Classifiers

Model Parameter Deﬁnition

Cost-based reject cost ε min h,c RB(h, c)

Bounded-improvement target risk λ max h,c φ(c) s.t. RS(h, c) λ

Bounded-abstention target coverage ω min h,c RS(h, c) s.t. φ(c) ω

Table 1: Summary of three rejection models analyzed in the article. Each model has a single parameter shown in the middle column. The deﬁnition of the optimal selective classiﬁer is in the right column. The cost-based model deﬁnes the optimal selective classiﬁer using the expected risk RB(h, c) given by (1). The boundedimprovement and the bounded-abstention model are deﬁned in terms of the selective risk RS(h, c), equation (6), and the coverage φ(c), equation (5). Regardless the deﬁnition, the optimal solution is composed of the Bayes classiﬁer h B given by (3), and an instance of the randomized Bayes selection function c R given by (25).

2.4 Summary

We have shown that the three rejection models summarized in Table 1, namely, the costbased model (Problem 1), the bounded-improvement model (Problem 2) and the boundedabstention model (Problem 4), share the same class of optimal prediction strategies. An optimal selective classiﬁer (h, c) can be always constructed from the Bayes classiﬁer h = h B given by (3) and an optimal selection function

1 if r(x) < α , ν if r(x) = α , 0 if r(x) > α , (25)

where r(x) is the conditional expected risk of h given by (9), α R is a decision threshold and ν [0, 1] is an acceptance probability. We denote c R deﬁned by (25) as the randomized Bayes selection function. The randomized Bayes selection function c R is also an optimal solution of the rejection models deﬁned for an arbitrary (i.e. not necessarily the Bayes) classiﬁer h, that is, an optimal solution of Problem 3 and Problem 5. The constants (ν, α) are deﬁned for each rejection model diﬀerently and their value depends on parameters of the model (i.e. reject cost ε, target risk λ or target coverage ω), and the underlying distribution p(x, y). For example, in the case of the cost-based model, the acceptance threshold α equals to the reject cost ε and the acceptance probability ν can be arbitrary. In the case of the bounded-improvement and the bounded-abstention model, the constants (ν, α) are deﬁned implicitly via optimization problems and integral equations. In practice, (ν, α) can be tuned on data. For example, Geifman and El-Yaniv (2017) show how to ﬁnd α from a ﬁnite sample such that it is optimal for the bounded-improvement model in PAC sense.

Franc, Prusa and Voracek

The key component of the randomized Bayes selection function c R is ranking of the inputs X according to r(x). We introduce notion of a proper uncertainty score which is less informative than the conditional risk r(x), yet it is suﬃcient to construct c R.

Deﬁnition 1 Let h: X Y be a classiﬁer and r(x) = P y Y p(y | x) ℓ(y, h(x)) its conditional expected risk. We say that function s: X R is a proper uncertainty score of h iﬀ (x, x ) X X : r(x) < r(x ) s(x) < s(x ).

By deﬁnition, the proper uncertainty score s(x) preserves ordering of the inputs X induced by the conditional risk r(x). Therefore, replacing r(x) by s(x) in function (25), and changing the decision threshold α appropriately, leads to the same optimal selection function.

3. Learning uncertainty function

Assume that we want to construct a selective classiﬁer (h, c) solving any of the three rejection models described in Section 2. We have shown that regardless the rejection model, an optimal h is the Bayes classiﬁer h B given by (3) and an optimal c is the randomized Bayes selection function c R given by (25). In this section we consider the scenario where h: X Y is known, e.g., it has been trained from examples, and we want to endow it with c R. The key component of c R is a proper uncertainty score s: X R that satisﬁes Deﬁnition 1. In this section we address the problem of learning a proper uncertainty score from examples Tn = {(xi, yi) X Y | i = 1, . . . , n} assumed to be generated from n i.i.d. random variables with distribution p(x, y). Before describing the algorithms, in Section 3.1 we introduce the notion of Risk-Coverage (RC) curve and Area under Risk-Coverage curve (Au RC). In line with the literature, we use the Au RC as a metric to evaluate performance of the learned uncertainty scores. We also show a connection between the RC curve, the Au RC and the bounded-abstention model. In Section 3.2 we describe a plug-in conditional risk rule and point out that a frequently used Maximum Class Probability rule (MCP) is its special instance (Chow, 1970; Herbei and Wegkamp, 2006). In Section 3.3 we outline a learning approach based on a loss regression. In Section 3.4 we introduce a proxy of Au RC which we call a loss for SElective classiﬁer LEarning (SELE). We prove that both proposed methods learn the Fisher consistent estimator of the proper uncertainty score. The SELE loss and Theorem 9 proving its Fisher consistency were published in Franc and Prusa (2019). All other results presented in this section are novel.

3.1 Area under Risk Coverage curve

The majority of existing methods that learn selective classiﬁers output a classiﬁer h: X Y and a deterministic selection function c: X [0, 1] deﬁned as 6

c(x) = [[s(x) θ]] , (26)

where s: X R is an uncertainty score and θ R is a decision threshold. The performance of the pair (h, s) is evaluated by the RC curve obtained after computing the empirical

6. Note that the deterministic (26) and the randomized Bayes selection function c R coincide if the acceptance probability is ν = 1, which is a usual case when p(x) is continuous.

Optimal Strategies for Reject Option Classifiers

selective risk and the coverage for all settings of the threshold θ. That is, the computation is as follows. Let us order the examples Tn = {(xi, yi) X Y | i = 1, . . . , n} according to s(x) so that s(xπ(1)) s(xπ(2)) s(xπ(n)), where π: {1, . . . , n} {1, . . . , n} is a permutation deﬁning the order7. Let L(i, s) = Pi j=1 ℓ(yπj, h(xπj)) be a sum of losses incurred by the classiﬁer h(x) on the examples with uncertainty not higher than the i-th highest uncertainty on the examples Tn. The Risk-Coverage curve C = {(1

i L(i, s), i

n) | i = 1, . . . , n} is a set of two-dimensional points, where the pair (1

i L(i, s), i

n) corresponds to the empirical estimate of the selective risk RS(h, c) and the coverage φ(c) of a selective classiﬁer (h, c) with the deterministic selective function (26) and the decision threshold θ = s(xπi). The area under the RC curve C is then

Au RC(s, Tn) = 1

i L(i, s) = 1

j=1 ℓ(yπj, h(xπj)) . (27)

The value of Au RC(s, Tn) can be interpreted as an arithmetic mean of the empirical selective risks corresponding to the coverage spread evenly over the interval [0, 1] with step 1

n. There is a tight connection between the RC curve, Au RC and the bounded-abstention model. The RC curve C represents the quality of all admissible solutions of the boundedabstention model that can be constructed from the pair (h, s) when using the sample Tn for evaluation. The value of Au RC(s, Tn) is an estimate of the expected quality of the selective classiﬁer constructed from the pair (h, s) when the target coverage is selected uniformly at random.

3.2 Plug-in conditional risk rule

Prediction models, like, e.g., Logistic Regression or Neural Networks learned by crossentropy loss, use the training set Tn to learn an estimate ˆp(y | x) of the class posterior distribution p(y | x). The estimate is then used to construct a plug-in Bayes classiﬁer ˆh(x) argminˆy Y P y Y ˆp(y | x)ℓ(y, ˆy). Similarly, using ˆp(y | x) instead of p(y | x) in (9) gives the plug-in rule for the conditional risk of the classiﬁer h deﬁned as

y Y ˆp(y | x) ℓ(y, h(x)) .

Provided p(y | x) = ˆp(y | x), x X, y Y, the plug-in conditional risk ˆr(x) is by deﬁnition a proper uncertainty score and it can be used to construct the randomized Bayes selection function c R which is an optimal rejection strategy for all three rejection models.

Example 1 (Maximum Class Probability rule) In case of 0/1-loss ℓ(y, y ) = [[y = y ]] the plug-in Bayes classiﬁer decides based on the maximum posterior probability ˆh(x) argmaxy Y ˆp(y | x) and the plug-in conditional risk rule is

y Y ˆp(y | x) ℓ(y, ˆh(x)) = 1 max y Y ˆp(y | x) .

7. To break ties we use the index of the input in case the scores are the same.

Franc, Prusa and Voracek

3.3 Loss regression

A straightforward approach to learn the uncertainty score is to pose it as a regression problem. The regression function gets an input x X and outputs an estimate of the classiﬁcation loss ℓ(y, h(x)). Formally, given a hypothesis space F {s: X R}, classiﬁer h(x) and training set Tn, the loss regression score s: X R is a solution to mins F Freg(s) where

Freg(s) = 1

ℓ(yi, h(xi)) s(xi) 2 .

It is easy to show that the loss regression score is Fisher consistent estimate of the proper uncertainty score. This amounts to deﬁning the expectation Freg(x) with respect to i.i.d. generated training set Tn, i.e.,

Ereg(s) = ETn p(x,y)Freg(s)

i=1 p(xi, yi) 1

ℓ(yi, h(xi)) s(xi) 2 dx1 dxn (28)

y Y p(x, y) ℓ(y, h(x)) s(x) 2 dx ,

and showing that its minimizer is the conditional risk r(x) which is by deﬁnition a proper uncertainty score. This is ensured by the following theorem.

Theorem 7 The conditional risk r(x) deﬁned by (9) is an optimal solution to min s:X R Ereg(s).

3.4 Minimization of SELE loss

In this section, we deﬁne a computationally manageable proxy of Au RC, which we call SElective classiﬁer LEarning (SELE) loss. The SELE loss sele : Rn X n Yn R+ is deﬁned as 8

sele(s, Tn) = 1

j=1 ℓ(yi, h(xi))[[s(xi) s(xj)]] . (29)

In contrast to Au RC the computation of sele does not require sorting the examples, i.e., we eliminate the permutations that make the evaluation diﬃcult. sele is still hard to optimize directly due to the step function in its deﬁnition. After replacing the step function [[ ]] by a logistic function, we obtain its proxy ψsele : Rn X n Yn R+ deﬁned as

ψsele(s, Tn) = 1

j=1 ℓ(yi, h(xi)) log 1 + exp(s(xj) s(xi) . (30)

The function ψsele(s, Tn) is smooth and convex w.r.t. the argument s and therefore is amenable to optimization. Minimization of ψsele is the core of the proposed learning algorithm which works as follows.

8. We assume that the training set Tn has at least two examples, i.e. n 2.

Optimal Strategies for Reject Option Classifiers

Given a hypothesis space F {s: X R}, a classiﬁer h(x) and training set Tn, the SELE score s: X R is a solution to the problem mins F ψsele(s, Tn). We justify the proposed algorithm empirically in Section 5. The theoretical justiﬁcation is based on the following three arguments:

1. The value of sele is a close approximation to Au RC. In case the values of s(x) are diﬀerent for each input in Tn, i.e. s(xi) = s(xj), i = j, then we have

sele(s, Tn) Au RC(s, Tn) 2 sele(s, Tn) .

The ﬁrst inequality follows from

sele(s, Tn) = 1

1 n ˆL(i, s) 1

i ˆL(i, s) = Au RC(s, Tn) .

The second inequality is ensured by Theorem 8.

2. The uncertainty score estimator deﬁned as ˆs argmins [0,1]X sele(s, Tn) is Fisher consistent. Namely, Theorem 9 ensures that a population minimizer of sele is a proper uncertainty score.

3. The Fisher consistency is preserved even for the smooth proxy ψsele the minimization of which is in the core of the proposed algorithm. Namely, Theorem 11 ensures the population minimizer of ψsele is a proper uncertainty score.

Theorem 8 The inequality Au RC(s, Tn) < 2 sele(s, Tn) holds true for any s: X R and Tn = {(xi, yi) X Y | i = 1, . . . , n}. Moreover, for any ε > 0, there are n, s and Tn such that Au RC(s, Tn) (2 ε) sele(s, Tn).

To show the Fisher consistency of sele we deﬁne its expectation with respect to i.i.d. generated examples Tn, i.e.,

Esele(s) = Z

i=1 p(xi, yi) sele(s, Tn)dx1 dxn

j=1 r(xi) [[s(xi) s(xj)]] dx1 dxn

X p(xi)p(xj)r(xi)[[s(xi) s(xj)]]dxi dxj

X p(xi)p(xi)r(xi)[[s(xi) s(xi)]]dxi dxi

X p(x) r(x) Z

X p(z) [[s(x) s(z)]]dz dx + 1

X p2(x)r(x)dx dx .

Minimizers of Esele are characterized by the following theorem 9.

X R z =x s (z)=s (x) f(x,z)dz dx stands for R

X f(x,z)dz dx where X = {z X |z =x s (z)=s (x)}, etc.

Franc, Prusa and Voracek

Theorem 9 A function s : X R is an optimal solution to mins:X R Esele(s) iﬀ

z =x s (z)=s (x) max{r(x), r(z)}p(x)p(z)dz dx = 0 , and (32)

r(z)<r(x) s (z)>s (x) (r(x) r(z)) p(x)p(z)dz dx = 0 . (33)

The conditions (32) and (33) imply that the conditional expectations Ex,z p(x)[max{r(x), r(z)} | z = x s (x) = s (z)] and Ex,z p(x)[r(x) r(z) | r(z) < r(x) s (z) > s (x)] are both zero. If combined, it further implies that a subset of the input space X = {(x, z) X X | r(z) < r(x) s (z) > s (x)}, on which the order is violated, has probability measure zero. In other words, the optimal s (x) preserves the ordering induced by r(x) almost surely 10.

Corollary 10 Any function s : X R fulﬁlling

(x, x ) X X : x = x s(x) = s(x ), and (34)

(x, x ) X X : r(x) < r(x ) s(x) < s(x ) (35)

satisﬁes the optimality conditions of Theorem 9.

Note that (34) requires the minimizer of sele to assign a unique value to each input x X which is not necessary for the score to be proper. Hence, the minimizers of sele form a subset of all proper uncertainty scores. To show the Fisher consistency of the smooth proxy ψsele we deﬁne its expectation with respect to i.i.d. generated examples Tn, i.e.,

Eproxy(s) = Z

i=1 p(xi, yi) ψsele(s, Tn) dx1 dxn

X p(x) r(x) Z

X p(z) log 1 + exp(s(z) s(x)) dz dx

X p(x) r(x)dx . (36)

We omitted the derivation as it is similar to (31). The key property of the minimizers of Eproxy is stated in the following theorem.

Theorem 11 Let s : X R be an optimal solution to mins: X R Eproxy(s). Then, the condition (x, x ) X X : r(x) < r(x ) s (x) < s (x )

is satisﬁed almost surely.

10. This means that the condition can be violated at most on a subset of X with probability measure zero.

Optimal Strategies for Reject Option Classifiers

4. Related Works

The cost-based rejection model was proposed in Chow (1970) who also provides the optimal strategy in case the distribution p(x, y) is known, analyzes the error-reject trade-oﬀ, and proves the basic properties of the error-rate and the reject-rate, e.g., that both functions are monotone with respect to the reject cost. The original paper considers the risk with 0/1-loss only. The model with arbitrary classiﬁcation costs was analyzed, e.g., in Tortorella (2000); Santos-Pereira and Pires (2005); Schlesinger and Hlav aˇc (2002). The bounded-improvement model and the bounded-abstention model were coined in Pietraszek (2005). He proposes a method to construct a speciﬁc class of rejection strategies composed of two classiﬁers described by a single ROC curve. The task is to ﬁnd the optimal decision thresholds of the two classiﬁers, which is done numerically based on ROC analysis. The original formulation assumes two classes and the particular form of the rejection strategies. In this article, we consider a generalization of the bounded-improvement model, see Problem 2, and the bounded-abstention model, see Problem 4, which allow for an arbitrary number of classes (including structured output prediction) and do not impose any constraint on the rejection strategy. We have proved that necessary and suﬃcient conditions on an optimal strategy for both models in the case where p(x, y) is known. We showed that in both cases, a particular optimal solution is composed of the Bayes classiﬁer and the randomized Bayes selection function. There exist other formulations of rejection models for two-class classiﬁers. For example, in Hanczar and Dougherty (2008) the objective is to maximize coverage under the constraints that each class has an error rate below a speciﬁc threshold. Hence, it can be seen as a generalization of the bounded-improvement model. The objective of a rejection model proposed in Lei (2014) is to maximize the total coverage under the constraint that each class has coverage above a speciﬁc threshold. None of the two articles analyzes optimal strategies of the corresponding models. A common approach to construct selective classiﬁers for the cost-based model is based on the plug-in rule, which involves learning the class posterior distribution from examples and plugging the distribution into the formula deﬁning the Bayes-optimal strategy (3). In the case of 0/1-loss the plug-in rule rejects based on the maximal class posterior which is denoted as Maximal Class Probability (MCP) rule (see Example 1). The MCP rule is probably the most frequently used uncertainty score in the literature. The statistical consistency of the plug-in rejection rule is discussed in Herbei and Wegkamp (2006). Fumera et al. (2000) investigate how errors in estimation of the posterior distribution aﬀect eﬀectiveness of the plug-in rule, and they also try to improve its performance by using class-speciﬁc thresholds. Other methods to improve the plug-in rule by tuning multiple thresholds were proposed in Kummert et al. (2016); Fischer et al. (2016). In our work, we have derived the optimal strategies for the bounded-improvement model and the bounded-abstention model. Our results thus provide a recipe to construct the plug-in rules also for these two rejection models. There exist many modiﬁcations of standard prediction models to learn reject option classiﬁers for the cost-based model. For example, extensions of the Support Vector Machines to learn a reject option classiﬁer have been extensively studied in Grandvalet et al. (2008); Bartlett and Wegkamp (2008); Yuan and Wegkamp (2010). These works are limited to

Franc, Prusa and Voracek

two-class problems and 0/1-loss. Learning leads to minimization of a convex surrogate of the cost-based model s objective function. Under some conditions, the algorithms are statistically consistent. A boosting algorithm is proposed in Cortes et al. (2016) to learn a two-class classiﬁer with reject option. The algorithm minimizes a convex surrogate for the cost-based model and shows that the surrogate is calibrated with Bayes solution. Learning prototype-based classiﬁer with the rejection option has been addressed in Villman et al. (2016). All these methods require the reject cost to be ﬁxed at the time of learning, and hence changing the cost requires re-training. In contrast, we propose algorithms to learn the proper uncertainty score on top of a pre-trained classiﬁer so that the risk-coverage trade-oﬀ can be set by tuning the reject threshold without re-training. For many prediction models, it is easy to devise an ordinal uncertainty score from outputs of the learned (non-reject) classiﬁer. Such strategies are often heuristically based but work reasonably well in practice. For example, Le Cun et al. (1990) proposed a reject strategy for a Neural Network classiﬁer based on thresholding either the output of the maximally activated unit of the last layer or a diﬀerence between the maximal and runnerup output units. Other heuristically based strategies for neural networks were evaluated in Zaragoza and d Alche Buc (1998); Fisher et al. (2015). In case of Support Vector Machine classiﬁers (Vapnik, 1998) the trained linear score, proportional to the distance between the input and the decision hyperplane, is directly used as the uncertainty score (Fumera and Roli, 2002). We denote this approach as the margin score and use it as a baseline in our experiments. The learning of a selective classiﬁer optimal for the bounded-improvement model was discussed in El-Yaniv and Wiener (2010). Their method requires a noisy-free scenario, i.e. they maximize the coverage under the constraint that the selective risk is zero. They provide a characterization of the lower and upper bounds of the risk-coverage curves in the PAC setting. Geifman and El-Yaniv (2017) assume a selective classiﬁer based on thresholding an uncertainty score and show how to ﬁnd a decision threshold for the bounded-improvement model which is optimal in the PAC sense. They do not address the problem of learning the uncertainty score. We complement their work by showing that the threshold-based selective classiﬁer is an optimal solution when the uncertainty function is proper, and we propose algorithms to learn the proper uncertainty score from examples. Recent work addresses uncertainty prediction in the context of deep learning (Lakshminarayanan et al., 2017; Jiang et al., 2018; Corbiere et al., 2019). These works do not formulate the problem to be solved explicitly as a rejection model. However, they empirically evaluate their uncertainty scores in terms of the Risk-Coverage curve and the Area under the RC curve which we have shown to be connected with the bounded-abstention model (see Section 3.1). Lakshminarayanan et al. (2017) construct the MCP rule from a posterior distribution modeled as an ensemble of neural networks trained from multiple random initialization. They use adversarial examples to smooth out the posterior estimate. Jiang et al. (2018) propose a Trust Score as the ratio between the distance from the test sample to the samples of the nearest class with a diﬀerent label and the distance to the samples with the same labels as the predicted class. Corbiere et al. (2019) propose a True Class Probability (TCP) score as a measure of prediction conﬁdence. The TCP predicts the value of p(y | x) where y is the ground truth label. They learn an NN, the so-called Conf Net, by minimizing L2-loss between the Conf Net output and ˆp(yi | x) on training ex-

Optimal Strategies for Reject Option Classifiers

amples, where ˆp(y | x) is the soft-max distribution trained by standard cross-entropy loss. They show that the TCP empirically outperforms the MCP score and the Trust Score (Jiang et al., 2018) in terms of the Au RC metric. Both Jiang et al. (2018); Corbiere et al. (2019) consider the two-stage approach to learn the uncertainty score similarly to our paper. We show empirically that our proposed SELE score, in addition to having a theoretical backing and being applicable to a generic classiﬁcation problem, outperforms the state-of-the-art TCP score.

5. Experiments

In Section 3, we outlined two risk minimization based methods to learn the uncertainty score s(x) for a pre-trained predictor h(x), namely, the algorithm based on i) loss regression (Section 3.3) and ii) minimization of SELE loss (Section 3.4). We have shown that both methods are Fisher consistent, i.e., they are guaranteed to ﬁnd the proper score in the idealized setting when the distribution p(x, y) is known (estimation error is zero), the hypothesis space F contains the proper score (approximation error is zero) and the loss minimizer can be found exactly (optimization error is zero). In this section, we evaluate these methods experimentally on real data when all assumptions are presumably violated. We design the experiments so that the optimization and the estimation error are small by using a large number of training examples and linear rules, making the loss minimization a convex problem. We compare against the recently proposed True Class Probability (TCP) score (Corbiere et al., 2019) which is learned from examples like the proposed methods. Unlike the proposed methods, the TCP requires the prediction model h(x) to provide an estimate of the posterior p(y | x), hence it is not applicable to fully discriminative models like e.g. SVMs. We emphasize that the experiments are meant to be a proof of concept rather than an exhaustive comparison of all existing methods. On the other hand, we are not aware of any other generic method (i.e., being not connected to a particular prediction model) we could compare with. To demonstrate that the proposed methods are generic, we consider three diﬀerent categories of prediction problems: classiﬁcation, ordinal regression, and structured output classiﬁcation. For each prediction problem, we use several benchmark datasets and frequently used prediction models like Logistic Regression (LR), three variants of Support Vector Machines (SVMs), and Gradient Boosted Trees. For each prediction model, there exists an uncertainty score that is commonly used in practice, like, e.g., Maximal Class Probability (MCP) for logistic regression or distance to the decision hyper-plane (a.k.a. margin score) for SVMs. We use these uncertainty scores as additional baselines in our experiments.

5.1 Compared methods for uncertainty score learning

In this section, we describe three algorithms that use a training set Tn = {(xi, yi) X Y | i = 1, . . . , n} to learn an uncertainty score s(x) for a pre-trained classiﬁer h(x). We consider linear scores sθ(x) = θ, ψ(x) , where θ Rd are parameters to be learned and ψ: X Rd

is a ﬁxed mapping that will be deﬁned for each prediction model separately in the following sections. All evaluated methods are instances of a regularized risk minimization framework. In all cases, learning leads to an unconstrained minimization of a convex objective F(θ) = C

2 θ 2 + ˆR(θ, Tn), where C > 0 is a regularization constant and ˆR(θ, Tn) is an

Franc, Prusa and Voracek

empirical risk deﬁned by each method diﬀerently. The optimal value of C is selected from {0, 1, 10, 100, 1000} based on the minimal value of the Au RC evaluated on a validation set.

5.1.1 Regression score

The parameters θ Rd are learned by minimizing a convex function

FREG(θ) = C

ℓ(yi, h(xi)) sθ(xi) 2 .

Minimization of FREG(θ) is an instance of ridge regression that can be solved eﬃciently, e.g., by Singular Value Decomposition (SVD).

5.1.2 SELE score

Evaluation of the proposed SELE loss ψsele(s, Tn) as deﬁned by (30) requires O(n2) operations. To decrease the complexity, we approximate its value by splitting the examples into chunks and computing the average loss over the chunks. Namely, the parameters θ Rd

are learned by minimizing a convex function

FSELE(θ) = C

i=1 ψsele(s, T k n ) ,

where C > 0 is a regularization constant and T 1 n T 2 n T P n is a randomly generated partition of the training set Tn into P approximately equally sized batches. In all experiments, we used P = round(n/500), i.e., the chunks contain around 500 examples. We minimize FSELE(θ) by the Bundle Method for Risk Minimization (BMRM) algorithm (Teo et al., 2010) which is set to ﬁnd a solution whose objective is at most 1% oﬀthe optimum 11. The total computation time of the BMRM algorithm is in the order of units of minutes for all datasets using a contemporary PC.

5.1.3 True Class Probability score

The TCP (Corbiere et al., 2019) was originally designed for getting uncertainty score on top of a Convolution Neural Network (CNN) trained with cross-entropy loss. The setting we consider here can be seen as the original method applied to a single layer CNN. Namely, let ˆp(y | x) be an estimate of the posterior distribution in our experiments provided by the Logistic Regression (a.k.a. single layer NN). The parameters θ Rd are learned by minimizing a convex function

FTCP(θ) = C

i=1 (ˆp(yi | xi) sθ(xi))2 .

Minimization of FREG(θ) is an instance of the ridge regression which we solve by SVD.

11. We use (Fprimal Fdual)/Fprimal 0.01 as the stopping condition of the BMRM algorithm.

Optimal Strategies for Reject Option Classifiers

5.2 Benchmark problems

5.2.1 Classification

Given real-valued features x X Rd, the task is to predict a hidden state y Y = {1, . . . , Y } so that the expectation of 0/1-loss ℓ(y, y ) = 100 [[y = y ]] is as small as possible 12. We selected 11 classiﬁcation problems from the UCI repository (Dua and Taniskidou, 2017) and lib SVM datasets (Chang and Lin, 2011). The datasets are summarized in Table 2. We chose the datasets with a suﬃciently large number of examples relative to the number of features, as we need to learn both the classiﬁer and the uncertainty score and simultaneously keep the estimation error low. Each dataset was randomly split 5 times into 5 subsets, Trn1/Val1/Trn2/Val2/Tst, in ratio 30/10/30/10/20 (up to CODRNA with ratio 25/5/20/20/30 and COVTYPE with ratio 28/20/2/20/30). The subsets Trn1/Val1 were used for learning and tuning the best regularization constant of the classiﬁer h(x). The subsets Trn2/Val2 were used for learning and tuning the regularization constant of the uncertainty score s(x) as described in Section 5.1. All features were normalized to have zero mean and unit variance. The normalization coeﬃcients were estimated using only the Trn1 and Trn2 subsets, respectively. The Tst subset was used solely to evaluate the test performance. We used two prediction models: Logistic Regression (LR) (Hastie et al., 2009) and Support Vector Machines (SVM) (Vapnik, 1998).

Logistic Regression learns parameters θLR = ((wy, by) Rd R | y Y) of the posterior probabilities ˆpθ(y | x) exp( wy, x + by) by maximizing the regularized loglikelihood FLR(θ) = C

n Pn i=1 log ˆpθ(yi | xi) . The optimal C was selected from {1, 10, 100, 1000} based on the validation classiﬁcation error. After learning θLR we used the plug-in Bayes classiﬁer h(x) = argmaxy Y ˆpθLR(y | x). As a baseline uncertainty score we use the plug-in class conditional risk ˆr(x) = 1 ˆpθLR(h(x) | x). In accordance with the literature we refer to this baseline as the Maximal Class Probability (MCP) rule. As shown in Section 3.2, the MCP score is the proper uncertainty score provided the estimate ˆp(y | x) matches the true posterior p(y | x).

Support Vector Machines learn parameters θSVM = ((wy, by) Rd R | y Y) of the linear classiﬁer h(x) = argmaxy Y( wy, x + by) by minimizing FSVM(θ) = C

2 θ 2 + 1 n Pn i=1 maxy Y [[y = yi]]+ wy wyi, xi . The optimal C was selected in the same way as in case of the LR. As the baseline uncertainty measure we use s(x) = maxy Y wy, x +by. In the binary case |Y| = 2, the setting was θSVM = (w, b), h(x) = sgn( w, x + b), FSVM(θ) = C

n Pn i=1 max{0, 1 yi( w, xi + b)} and s(x) = | w, x + b|. In both cases, the value of s(x) is proportional to a distance between the input x and the decision boundary. We denote this baseline as the margin score. Given the pre-trained LR or SVM classiﬁer h(x), we apply the methods from Section 5.1 to learn the uncertainty score

sθ(x) = wh(x), x + by , (37)

where θ = ((wy, by) Rd R | y Y) are the parameters to be learned. It is seen that the rule (37) can be re-written as sθ(x) = θ, ψ(x) , where ψ: X Rd is an appropriately

12. Due to the factor 100 the reported errors correspond to the percentage of misclassiﬁed examples.

Franc, Prusa and Voracek

deﬁned feature map. The form of the score (37) can be justiﬁed by noting that its special instance is the margin score which is obtained after substituting θSVM for θ.

5.2.2 Ordinal regression

The task is to predict a hidden state from Y = {1, . . . , Y } based on real-valued features X Rd. Unlike the classiﬁcation problem, the hidden states Y are assumed to be ordered and the goal is to minimize the expectation of the Mean Absolute Error ℓ(y, y ) = |y y |. We selected 11 regression problems from UCI repository (Dua and Taniskidou, 2017). The datasets are summarized in Table 2. The real-valued hidden states were discretized into Y bins, which are constructed to get a uniform class prior. Each dataset was randomly split 5 times into 5 subsets, Trn1/Val1/Trn2/Val2/Tst, in ratio 30/10/30/10/20. We used the same normalization and evaluation protocol as described for the classiﬁcation benchmarks. As a prediction model, we used a variant of the Support Vector Machine algorithm developed for ordinal regression (Chu and Keerthi, 2005) 13.

Support Vector Ordinal Regression (SVOR) learns parameters θSVOR = (w Rd, (b1, . . . , b Y 1) RY 1) of the ordinal linear classiﬁer h(x) = 1 + PY 1 y=1 [[ x, w > by]] by minimizing FSVOR(θ) = C

n Pn i=1 Pyi 1 y=1 max(0, 1 w, xi +by)+PY 1 y=yi max(0, 1+ xi, w by) . The optimal C was selected from {1, 10, 100, 1000} based on the validation MAE. The ordinal classiﬁer can be thought of as a standard linear classiﬁer composed of parallel decision hyper-planes. Similarly to the standard SVM, we use s(x) = miny {1,...,Y 1} | x, w by| as a baseline uncertainty score. The value of s(x) is proportional to the distance of x to the closest hyper-plane hence we also denote it as the margin score. When learning the uncertainty from examples, we use the parametrization (37).

5.2.3 Structured Output Classification

Given an RGB image x X = {0, . . . , 255}W H 3 capturing a human face, the task is to predict a pixel positions of 68 landmarks y = (l1, . . . , l68) Y = ({1, . . . , W} {1, . . . , H})68

corresponding to contours of eyes, mouth, nose, etc. We use the 300-W dataset and the associated evaluation protocol which was created by the organizers of landmark detection challenge (Sagonas et al., 2016). The 300-W dataset contains 5,807 faces each annotated with 68 landmarks. The faces are split into 3,484 training, 1,161 validation and 1,162 test examples. The prediction accuracy is measured in terms of normalized average localization error ℓ(y, ˆy) = 100 iod(y) P68 i=1 li ˆli , where iod(y) is the inter-ocular distance computed from the ground-truth landmark positions y. As the structured classiﬁer h(x) we use the landmark detector from DLIB package (King, 2009). The detector predicts the landmark positions based on HOG descriptors (Dalal and Triggs, 2005) of the input image using an ensemble of regression trees that are trained by gradient boosting (Kazemi and Sullivan, 2014). The DLIB landmark detector has been widely used by developers due to its robustness and exceptional speed even on low-end hardware. The detector does not provide any measure of prediction uncertainty and it is unclear how to derive it from outputs of the regression trees. A commonly used uncertainty

13. (Chu and Keerthi, 2005) introduced two variants of SVOR algorithm. We use so called SVOR with implicit constraints which is designed for minimization of the MAE loss.

Optimal Strategies for Reject Option Classifiers

score for face recognition related prediction problems is a score function of a face detector. The face detector score is the output of a binary classiﬁer trained to distinguish a face from non-face images. The value of the score is high for well-looking prototypical faces and low for corrupted or diﬃcult faces. As a baseline, we use the score of the DLIB face detector which is a linear SVM classiﬁer on top of HOG descriptors extracted from the image. When learning the uncertainty score from examples, the score is a linear regressor sθ(x) = θ, ψ(x) on top of a feature vector ψ(x) R2,448 which is a concatenation of HOG descriptors extracted from the input facial image x along landmark positions predicted by the landmark detector h(x). The DLIB detector uses the same features to predict the landmark positions, hence the extra computational time required to evaluate the uncertainty score is neglectable.

Classiﬁcation problems dataset examples feat cls AVILA 20,867 10 12 CODRNA 331,152 8 2 COVTYPE 581,012 54 7 IJCNN 49,990 22 2 LETTER 20,000 16 26 MARKETING 45,211 51 2 PENDIGIT 10,992 16 10 PHISHING 11,055 68 2 SATTELITE 6,435 36 6 SENSORLESS 58,509 48 11 SHUTTLE 58,000 9 7

Ordinal regression problems dataset examples feat cls ABALONE 4,177 10 19 BANK 8,192 32 10 BIKESHARE 17,379 11 10 CALIFORNIA 20,640 8 10 CCPP 9568 4 10 CPU 8192 21 10 FACEBOOK 50,993 53 10 GPU 24,1600 14 10 METRO 48,204 30 10 MSD 499,671 90 41 SUPERCOND 21,263 81 10

Table 2: Summary of 11 classiﬁcation problems (left) and 11 ordinal regression problems (right) selected from UCI repository (Dua and Taniskidou, 2017) and lib SVM datasets (Chang and Lin, 2011). The table shows the total number of examples, the number of features and the number of classes.

5.3 Results

5.3.1 Classification problems

For both classiﬁcation models, LR and SVM, we recorded the test risk of the classiﬁer h(x) and the Au RC computed from h(x) and uncertainty score s(x) produced by the corresponding method under evaluation. In the case of LR, we compare the baseline MCP score (sec 5.2.1) and scores learned from examples including the state-of-the-art TCP score (sec 5.1.3) and the two proposed SELE score (sec 5.1.2) and Regression (REG) score (sec 5.1.1). In the case of SVM, we compare the baseline margin score (sec 5.2.1) against the proposed SELE and REG scores. Note that TCP score is not applicable for SVM classiﬁer as it does not provide an estimate of the posterior probability p(y | x). The results are summarized in Table 3.

Franc, Prusa and Voracek

For each dataset, we rank the compared methods according to the Au RC 14. Following the methodology of Demˇsar (2006) we summarize the performance of each method by its average rank and use the Friedman test and the post-hoc Nemenyi test to analyze signiﬁcance of the results.

We used the Friedman test to check whether the measured average ranks are significantly diﬀerent from the mean rank. The null hypothesis states that the compared scores are equivalent so that their average ranks should be equal. In both cases, the null hypothesis is rejected for p-value 0.05, i.e., performance of the compared methods is signiﬁcantly diﬀerent.

We used post-hoc Nemenyi test for pairwise comparison. For each pair of methods, it checks whether their average ranks are signiﬁcantly diﬀerent. In the case of LR, when we compare K = 4 methods using N = 11 datasets, the critical diﬀerence for p-value 0.10 is CD = 1.26. By comparing the average ranks, we conclude that SELE score performs signiﬁcantly better than MCP score and REG score. In the case of SVM, when we compare K = 3 methods using N = 11 dataset, the critical diﬀerence for p-value 0.10 is CD = 0.98. We conclude that SELE performs signiﬁcantly better than REG score and Margin score. The data is not suﬃcient to reach any conclusion about other pairwise comparisons. The result of the Nemenyi test is visualized in Figure 1.

We further computed the relative improvement gained by using the scores SELE, TCP, and REG, that are all learned from examples, with respect to the baseline scores derived from the classiﬁer output, i.e. MCP in the case of LR and Margin score in the case of SVM. The results are summarized in Figure 2. It is seen that the MCP uncertainty computed from the estimated p(y | x) constitutes a much stronger baseline than the Margin score of the fully discriminative SVM model. The relative improvement of scores learned on top of the LR is only moderate in contrast to the SVM classiﬁer where the improvements are more signiﬁcant and consistent. It is also seen that on the majority of datasest, the performance of the learned scores is similar taking into account the statistical error of the Au RC estimate. It is also worth mentioning that the results of SELE score have the lowest variance of the Au RC estimates as seen from the error bars in Figure 2.

LR+SELE LR+TCP LR+REG LR+MCP

SVM+SELE SVM+Margin SVM+REG a) Average ranks for scores on top LR. b) Average ranks for scores on top of SVM.

Figure 1: Comparison of all uncertainty scores against each other with the Nemenyi test. The test is computed separately for the LR classiﬁer (4 scores compared) and the SVM classiﬁer (3 scores compared). The ﬁgures show the average ranks for each score and the critical distance (CD). Groups of scores that are not signiﬁcantly diﬀerent at p-value 0.10 are connected.

14. The score with smallest Au RC is ranked 1, the second smallest 2 and so on.

Optimal Strategies for Reject Option Classifiers

LR+MCP LR+SELE LR+REG LR+TCP LR Au RC Au RC Au RC Au RC R@100 AVILA 27.18 0.55 25.79 0.44 26.62 0.74 26.85 0.78 43.71 0.42 CODRNA 0.88 0.05 0.65 0.03 0.82 0.06 0.78 0.04 4.81 0.08 COVTYPE 16.49 0.06 17.58 0.07 17.62 0.09 17.19 0.07 27.56 0.17 IJCNN 1.26 0.04 1.00 0.03 1.16 0.08 1.14 0.06 7.54 0.15 LETTER 7.43 0.40 6.42 0.34 7.44 0.59 6.71 0.42 23.32 0.60 MARKETING 2.60 0.31 1.88 0.11 1.97 0.12 1.90 0.11 9.88 0.29 PENDIGIT 0.69 0.04 1.55 0.19 1.97 0.55 1.47 0.39 5.29 0.40 PHISHING 0.76 0.10 0.75 0.10 0.91 0.31 0.85 0.25 6.29 0.44 SATTELITE 3.83 0.26 3.68 0.27 4.93 1.07 4.52 0.85 15.06 0.46 SENSORLESS 2.03 0.11 1.82 0.08 2.69 0.09 2.37 0.22 8.23 0.45 SHUTTLE 0.59 0.09 0.26 0.07 1.24 0.51 0.58 0.13 3.36 0.25 average rank 2.73 1.36 3.55 2.36 (a) Uncertainty scores on top of LR classiﬁer.

SVM+MARGIN SVM+SELE SVM+REG SVM Au RC Au RC Au RC R@100 AVILA 31.65 0.83 25.26 0.67 25.95 0.75 43.34 0.70 CODRNA 0.89 0.05 0.65 0.03 0.82 0.05 4.78 0.08 COVTYPE 25.71 0.81 17.79 0.21 17.77 0.14 27.41 0.11 IJCNN 1.40 0.04 1.01 0.04 1.18 0.08 7.56 0.16 LETTER 10.20 0.22 6.05 0.65 7.15 0.65 22.06 0.69 MARKETING 2.24 0.20 1.97 0.10 2.04 0.20 10.48 0.39 PENDIGIT 2.79 0.40 1.57 0.21 2.16 0.43 4.88 0.57 PHISHING 0.84 0.12 0.72 0.12 0.90 0.30 6.37 0.44 SATTELITE 4.75 0.60 3.82 0.27 5.44 0.68 15.36 0.37 SENSORLESS 3.68 0.20 1.56 0.08 2.46 0.29 6.92 0.17 SHUTTLE 1.31 0.47 0.24 0.07 0.55 0.15 2.02 0.15 average rank 2.82 1.09 2.09 (b) Uncertainty scores on top of SVM classiﬁer.

Table 3: Performance of the uncertainty scores on 11 classiﬁcation problems. The scores are constructed on top of the LR classiﬁer and the SVM classiﬁer measured in terms of Au RC. For each score we show the mean and the standard deviation of the test Au RC computed over 5 random splits. We compare the performance of scores learned from examples (SEL, REG, TCP) and the baseline scores derived from the classiﬁers output (MCP and Margin score). The last column shows the risk of the base (non-selective) classiﬁer. All the values correspond to percentage of misclassiﬁcation. The best results for each dataset are shown in bold. The last row shows the average rank.

Franc, Prusa and Voracek

(a) Improvement over LR+MCP score. (b) Improvement over SVM+Margin score.

improvement [%]

LR+SELE LR+REG LR+TCP

improvement [%]

SVM+SELE SVM+REG

Figure 2: Relative improvement gained by using the uncertainty scores (SELE, REG and TCP) that are learned from examples over the baseline scores (MCP for LR and Margin score for SVM) constructed from the classiﬁer output. The relative improvement is computed as 100 (Au RCbaseline Au RCmethod)/Au RCbaseline. We show the mean and the standard deviation (error bar) of the relative improvement computed over the random 5 splits.

5.3.2 Ordinal regression

In the case of SVOR classiﬁer, we compared the baseline Margin score (sec 5.2.2) and the scores learned from examples including SELE (sec 5.1.2) and REG score (sec 5.1.1). We used the same evaluation protocol as for the classiﬁcation task, however, instead of 0/1-loss, the errors were evaluated by the MAE loss. The results are summarized in Table 4. We again ranked the methods according to the Au RC and summarized their performance by the average rank:

We applied the Friedman test checking whether the measured average ranks are significantly diﬀerent from the mean rank. The null hypothesis, stating that the compared scores are equivalent, is rejected for p-value 0.05, i.e., performance of the compared methods is signiﬁcantly diﬀerent.

We used post-hoc Nemenyi test to check for each pair whether their average ranks are signiﬁcantly diﬀerent. Considering K = 3 compared methods using N = 11 dataset yields the critical diﬀerence for p-value 0.10 is CD = 0.98. We conclude that SELE and REG scores are signiﬁcantly better than the baseline Margin score. The data is not suﬃcient to reach any conclusion about the comparison of SELE and REG. The result of the Nemenyi test is visualized in Figure 3(a).

The relative improvement gained by using SELE and REG scores learned from examples w.r.t. baseline Margin score is shown in Figure 3(b). It is seen that the performance of the learned scores is similar and that they consistently outperform the baseline by a signiﬁcant margin.

Optimal Strategies for Reject Option Classifiers

SVOR+MARGIN SVOR+SELE SVOR+REG SVOR Au RC Au RC Au RC R@100 CALIFORNIA 0.98 0.03 0.82 0.02 0.84 0.02 1.18 0.01 ABALONE 1.48 0.10 1.19 0.09 1.21 0.05 1.54 0.02 BANK 1.07 0.04 0.99 0.04 0.98 0.03 1.50 0.03 CPU 0.41 0.01 0.36 0.02 0.36 0.02 0.64 0.03 BIKESHARE 1.60 0.07 1.25 0.01 1.27 0.01 1.70 0.03 CCPP 0.46 0.02 0.41 0.02 0.42 0.02 0.58 0.02 FACEBOOK 0.51 0.01 0.37 0.01 0.36 0.01 1.11 0.01 GPU 1.43 0.02 0.85 0.03 0.86 0.03 1.49 0.02 METRO 2.20 0.07 1.97 0.01 1.98 0.03 2.37 0.03 MSD 6.23 0.07 4.26 0.03 4.25 0.03 6.22 0.03 SUPERCONDUCT 0.98 0.02 0.75 0.01 0.77 0.01 1.07 0.01 average rank 3.00 1.27 1.73

Table 4: Performance of the uncertainty scores on 11 ordinal regression problems. The scores are constructed on top of the SVOR classiﬁer. For each score we show the mean and the standard deviation of the test Au RC computed over 5 random splits. We compare the performance of scores learned from examples (SEL, REG) and the baseline Margin score derived from the SVOR classiﬁer output. The last column shows the risk of the base (non-selective) classiﬁer. All the values correspond to the Mean Absolute Error (MAE). The best results for each dataset are shown in bold. The last row shows the average rank.

5.3.3 Structured Output Classification

We trained SELE score (sec 5.1.2) and REG score (sec 5.1.1) on top of the DLIB detector and compared them with the baseline which uses the DLIB face detector score (sec 5.2.3) as an uncertainty measure. The Risk-Coverage curves of the three methods and their corresponding Au RC are shown in Figure 4(a). Both the learned scores, SELE and REG, are signiﬁcantly better than the baseline face detector score. The SELE is slightly better than REG score. The largest diﬀerences between the three scores are seen for low values of coverage where SELE outperforms the other two methods. High selective risk for low values of coverage means that faces with very bad landmark predictions are assigned the lowest uncertainty scores. SELE score does not suﬀer from this problem. This can be seen in Figure 5 where we show examples of the 10 test faces with the lowest uncertainty and the highest uncertainty predicted by SELE.

Unlike the experiments in the previous section, the number of parameters to be learned (m = 2, 448) relative to the number of training examples (n = 3, 484) is much higher. To see whether the number of examples is suﬃcient, we trained SELE and REG scores on an increasingly bigger training set. Figure 4(b) shows the test Au RC as a function of the the number of training examples. It is seen that Au RC of the SELE is not yet saturated and

Franc, Prusa and Voracek

a) Average rank for scores on top of SVOR. b) Relative improvement over SVOR+Margin score.

SVOR+REG SVOR+Margin SVOR+SELE

SUPERCONDUCT

improvement [%]

SVOR+SELE SVOR+REG

Figure 3: Statistics derived from results obtained on 11 ordinal regression problems. Figure (a) shows pair-wise comparison of uncertainty scores with the Nemenyi test. The ﬁgure shows the average ranks for each score and the critical distance (CD). Groups of scores that are not signiﬁcantly diﬀerent at p-value 0.10 are connected. Figure (b) shows relative improvement gained by using the SELE and REG uncertainty scores learned from examples over the baseline Margin score.

would most likely converge to a signiﬁcantly lower value relative to the REG score provided 300-W dataset had more training examples.

6. Conclusions

The standard cost-based rejection model introduced by Chow (1970) requires an explicit deﬁnition of the rejection cost, which is diﬃcult in applications when the reject cost and the loss of the label have a diﬀerent nature or physical units. Pietraszek (2005) proposed a bounded-improvement model and a bounded-abstention model which avoid the problem by deﬁning an optimal prediction strategy in terms of coverage and selective risk. Our main result is a formal proof that despite their diﬀerent objectives, the three rejection models are equivalent in the sense that they lead to the same prediction strategy: the Bayes classiﬁer and the randomized Bayes selection function. Thanks to the common optimal solution, it is possible to convert the parameters of diﬀerent rejection models. For example, for any target risk deﬁning the bounded-improvement model, there exists a corresponding reject cost so that both models have the same optimal strategy. The explicit characterization of the optimal strategies provides a recipe for building plug-in rules to solve bounded-improvement and bounded-abstention models. Any method estimating the class posterior probabilities can be thus turned into an algorithm for learning the selective classiﬁer that solves the bounded-improvement and the bounded-abstention model. We have deﬁned a notion of a proper uncertainty score which is suﬃcient to construct the randomized Bayes selection function. We proposed two algorithms to learn a proper

Optimal Strategies for Reject Option Classifiers

(a) Risk-Coverage curve (b) Learning curve

0 0.5 1 coverage

selective risk

DLIB+SELE, Au RC=2.6 DLIB+REG, Au RC=2.8 DLIB+FACE-SCORE, Au RC=3.5

#training examples

DLIB+SELE DLIB+REG

Figure 4: Evaluation of uncertainty scores on top of DLIB landmark detector using the 300-W benchmark. Figure (a) shows the RC curve and Au RC of computed from the predictions of DLIB detector endowed with the three compared uncertainty scores: the proposed SELE and REG scores and the DLIB face detector score used as a baseline. Figure (b) shows the test Au RC for SELE and REG scores as a function of the number of training examples.

uncertainty score from examples for a given classiﬁer. We have shown that both algorithms provide a Fisher-consistent estimate of the proper uncertainty score. As a proof of concept, we evaluated the proposed algorithms on diﬀerent types of prediction problems. We have shown that the proposed algorithm based on minimization of the SELE loss outperforms existing approaches tailored for a particular prediction model and works on a par with the recently published state-of-the-art TCP score (Corbiere et al., 2019). Unlike the TCP score, which requires the classiﬁer to output the class posterior probabilities, the proposed algorithms are applicable to an arbitrary black-box classiﬁer.

We have drawn a connection between the proposed bounded-abstention model and the RC curve. Namely, the RC curve represents the quality of all admissible solutions of the bounded-abstention model that can be constructed from a pair of classiﬁer and uncertainty score. The Au RC is then the expected quality of the selective classiﬁer constructed from the pair when the target coverage is selected uniformly at random. This connection sheds light on many published methods which do not explicitly deﬁne the target objective but use the RC curve and the Au RC as evaluation metrics.

Finally, let us mention some topics for future work. First, the proposed algorithms consider a two-stage scenario in which the classiﬁer and the uncertainty score are learned separately from independent training sets. Although the scenario is useful in practice, an algorithm that learns the classiﬁer and the uncertainty score simultaneously from a single training set constitutes an interesting topic to be solved. Secondly, we have shown how to learn the proper uncertainty score, but have not discussed how to set up the decision threshold and the acceptance probability that are also needed to construct the selective

Franc, Prusa and Voracek

classiﬁer. It is straightforward to tune these parameters using empirical data and the RC curve. Analysis of the generalization error of this empirical approach is an open issue which has been solved only for the decision threshold of the bounded-improvement model by Geifman and El-Yaniv (2017). Thirdly, the empirical evaluation is limited to uncertainty scores linear in the parameters to be learned. Eﬃcient implementations of the algorithms applicable to non-linear models, like e.g. the neural networks, is an another topic left for future. Forth, we use Fisher consistency to justify the proposed SELE loss. Deriving eﬃcient ﬁnite sample guarantees constitutes another interesting open problem.

Acknowledgments

We are grateful to all reviewers for their insightful feedback. The authors acknowledge support for this work from the Czech Science Foundation project GACR GA19-21198S. We also acknowledge project No. CZ.02.1.01/0.0/0.0/16 019/0000765 that supported V. Franc in 0.4 FTE.

Optimal Strategies for Reject Option Classifiers

1: loss=0.02

6: loss=0.03

1153: loss=0.53

1158: loss=0.25

2: loss=0.02

7: loss=0.02

1154: loss=0.71

1159: loss=0.82

3: loss=0.02

8: loss=0.03

1155: loss=0.32

1160: loss=0.30

4: loss=0.02

9: loss=0.02

1156: loss=0.49

1161: loss=0.46

5: loss=0.03

10: loss=0.02

1157: loss=0.43

1162: loss=0.37

Figure 5: Figure shows examples of 10 test faces from 300-W database with the lowest and 10 faces with the highest value of the SELE uncertainty score. The ground-truth landmark positions (red) and the landmark positions predicted by DLIB detector (blue) are superimposed into the image. The image title shows the rank induced by the SELE score and the normalized localization error which is used as the classiﬁcation loss in this application.

Franc, Prusa and Voracek

Appendix A. Proofs of theorems from Section 2

A.1 Proof of Theorem 1

The Bayes classiﬁer reads

h B(x) argmin ˆy Y

y Y p(y | x) ℓ(y, ˆy) (3)

Problem 2 (Bounded-improvement model) Given a target risk λ > 0, the optimal selective classiﬁer (h I, c I) is a solution to the problem

max h,c φ(c) s.t. RS(h, c) λ , (7)

where we assume that both maximizers exist.

Theorem 1 Let (h, c) be an optimal solution to (7). Then, (h B, c), where h B is the Bayes classiﬁer (3), is also optimal to (7).

Proof It is suﬃcient to show that (h B, c) is feasible to (7), i.e., that RS(h B, c) λ. Then (h B, c) attains the same maximum objective value φ(c) as (h, c). Derive

RS(h B, c) = 1 φ(c)

y Y p(x, y) ℓ(y, h B(x)) c(x) dx

y Y p(y | x) ℓ(y, h B(x))

y Y p(y | x) ℓ(y, h(x))

y Y p(x, y) ℓ(y, h(x)) c(x) dx

= RS(h, c) λ.

A.2 Proof of Theorem 2

The presented proof of the theorem uses Lemmas 13 and 14, both derived based on Lemma 12 bellow.

Lemma 12 For a set X, let f : X R+15 and g : X R be measurable functions such that R

X f(x)dx > 0 and g(x) > 0 for all x X. Then it holds R

X g(x)f(x)dx > 0.

15. We use R, R+ and N+ to denote the set of real numbers, non-negative real numbers and positive integers, respectively.

Optimal Strategies for Reject Option Classifiers

Proof For n N+, deﬁne functions

fn(x) = f(x) if g(x) 1

n, 0 otherwise.

The sequence {fn} n=1 is monotone and converges to f. Using the monotone convergence theorem (Stein and Shakarchi, 2009), derive

lim n fn(x)dx = lim n

This means that there is a k N+ such that R

X fk(x)dx > 0, hence we conclude Z

g(x)f(x)dx Z

g(x)fk(x)dx Z

1 kfk(x)dx > 0.

Lemma 13 For a set X, let f : X R+ and g : X R be measurable functions such that R

X f(x)dx > 0 and g(x) > b for all x X and some b R. Then it holds R

X g(x)f(x)dx > b R

Proof By Lemma 12, we have Z

(g(x) b)f(x)dx > 0,

g(x)f(x)dx = Z

(g(x) b)f(x)dx + Z

bf(x)dx > b Z

Lemma 14 For a set X, let f : X R+ and g : X R be measurable functions such that R

X g(x)f(x)dx > 0 and g(x) < 1 for all x X. Then it holds R

X f(x)dx > R

X g(x)f(x)dx.

X g(x)f(x)dx > 0 implies R

X f(x)dx > 0. Since it holds x X : (1 g(x)) > 0, Lemma 12 yields

(1 g(x))f(x)dx = Z

g(x)f(x)dx,

X f(x)dx > R

X g(x)f(x)dx is obtained as a direct consequence.

Franc, Prusa and Voracek

Problem 3 (Bounded-improvement model for known h(x)) Given a classiﬁer h(x), the optimal selection function c (x) is a solution to

max c [0,1]X φ(c) s.t. RS(h, c) λ . (8)

Theorem 2 A selection function c : X [0, 1] is an optimal solution to Problem 3 if and only if it holds Z

Xr(x)<b p(x)c (x)dx = Z

Xr(x)<b p(x)dx, (10)

Xr(x)=b p(x)c (x)dx =

( ρ(Xr(x)<b)

b if b > 0 , R

Xr(x)=0 p(x)dx if b = 0 , (11)

Xr(x)>b p(x)c (x)dx = 0 , (12)

where r(x) = r(x) λ measures how much the conditional risk r(x) of the classiﬁer h(x) exceeds the target λ,

X p(x)r(x) dx (13)

is the expectation of r(x) restricted to inputs in X , and

b = sup {a | ρ(Xr(x) a) 0} 0 . (14)

Proof Observe that b 0, because ρ(Xr(x) 0) 0. Next, observe that Problem 3 can be rewritten into the form

max c [0,1]X

X p(x)c(x)dx s.t. Z

X p(x)c(x)r(x)dx 0 (38)

RS(h, c) λ =

y Y p(x, y) ℓ(y, h(x)) c(x) dx λφ(c)

X p(x)c(x)r(x) dx λ R

X p(x)c(x)r(x) dx

φ(c) . (40)

Let F(c) = φ(c) = R

X p(x)c(x)dx denote the objective function of (38).

Case 1 b > 0. Claim I Each c : X [0, 1] which fulﬁls (10), (11) and (12) is feasible to (38) and

bρ(Xr(x)<b). (41)

Proof of Claim I.

Optimal Strategies for Reject Option Classifiers

Equality (41) is simply obtained by summing LHS and RHS of (10), (11) and (12). To verify the constraint of (38), observe that, since r is a bounded function and Z

Xr(x)<b p(x)(c (x) 1)dx (10) = 0 , (42)

it holds that Z

Xr(x)<b p(x)(c (x) 1)r(x)dx = 0 , (43)

which implies Z

p(x)c (x)r(x)dx = Z

p(x)r(x)dx (13) = ρ(Xr(x)<b). (44)

If b < , then Z

p(x)c (x)r(x)dx (12) = Z

p(x)c (x)r(x)dx + Z

p(x)c (x)r(x)dx

p(x)r(x)dx + b Z

p(x)c (x)dx (45)

(11),(13),(44) = ρ(Xr(x)<b) ρ(Xr(x)<b) = 0. (46)

If b = , then Z

p(x)c (x)r(x)dx = Z

p(x)c (x)r(x)dx (44) = ρ(Xr(x)<b) 0.

Claim II Let c : X [0, 1] be a feasible solution to (38) that violates at least one of the constraints (10), (11) and (12). Then, F(c) < F(c ), where c : X [0, 1] is a conﬁdence function satisfying (10), (11), (12), and, without loss of generality,

x Xr(x)<b : c (x) = 1 . (47)

Proof of Claim II. Distinguish three cases.

Case 1.1 Condition (12) is violated (note that this is possible only if b < ), i.e. Z

p(x)c(x)dx > 0. (48)

Inequality (48) and Lemma 13 (applied to f(x) = p(x)c(x) and g(x) = r(x)) yield Z

p(x)c(x)r(x)dx > b Z

p(x)c(x)dx.

Franc, Prusa and Voracek

Therefore, we can write Z

p(x)c(x)r(x)dx = b Z

p(x)c(x)dx (49)

for a suitable b R+ such that b > b > 0. (50)

Based on the constraint of (38), derive Z

p(x)c(x)r(x)dx (49) = Z

p(x)c(x)r(x)dx + b Z

p(x)c(x)dx + b Z

(38) 0 (46) = Z

p(x)c (x)r(x)dx + b Z

p(x)c (x)dx. (51)

Let σ(x) = 1

br(x). Inequality (51) can be rearranged and upper bounded as Z

p(x)c(x)dx Z

p(x)c (x)dx + b

p(x)c(x)dx (52)

p(x)(c (x) c(x))σ(x)dx Z

p(x)c (x)dx Z

where the second inequality follows from x Xr(x)<b : σ(x) 1. From this we get Z

p(x)c(x)dx (52) Z

p(x)c (x)dx b

p(x)c(x)dx . (53)

Now, derive

p(x)c(x)dx + Z

p(x)c (x)dx b

(48),(50) < Z

p(x)c (x)dx = F(c ).

Case 1.2 Condition (12) holds, condition (11) is violated. If R

Xr(x)=b p(x)c(x)dx < ρ(Xr(x)<b)

b , then obviously F(c) < F(c ). Hence, assume

p(x)c(x)dx > ρ(Xr(x)<b)

Optimal Strategies for Reject Option Classifiers

Analogically to (51), derive Z

p(x)c(x)r(x)dx + b Z

p(x)c(x)dx (38) 0 (55)

p(x)c (x)r(x)dx + b Z

p(x)c (x)dx, (56)

p(x)c(x)σ(x)dx + Z

p(x)c(x)dx Z

p(x)c (x)σ(x)dx + Z

p(x)c (x)dx

(57) where σ(x) = 1

br(x) < 1 for all x Xr(x)<b. Denote and derive

p(x)c(x)dx Z

p(x)c (x)dx (11) = Z

p(x)c(x)dx + ρ(Xr(x)<b)

(54) > 0. (58)

Then, (57) can be rewritten as Z

p(x)(c (x) c(x))σ(x)dx (58) > 0. (59)

Inequality (59) and Lemma 14 (applied to g(x) = σ(x) < 1 and f(x) = p(x)(c (x) c(x)) (47) 0 over Xr(x)<b) yield Z

p(x)(c (x) c(x))dx > . (60)

Now, combine and rearrange (58) and (60) to obtain

p(x)c (x)dx + Z

p(x)c (x)dx (60) > (61)

p(x)c(x)dx + Z

p(x)c(x)dx = F(c). (62)

Case 1.3 Conditions (11) and (12) hold, condition (10) is violated, i.e. Z

p(x)c(x)dx < Z

p(x)dx. (63)

p(x)c (x)dx ρ(Xr(x)<b)

p(x)c(x)dx ρ(Xr(x)<b)

Franc, Prusa and Voracek

Case 2 b = 0. This occurs only if R

Xr(x)<0 p(x)r(x)dx = 0. The constraint of (38) implies

p(x)c(x)r(x)dx = 0 ,

p(x)c(x)dx = 0 ,

which conﬁrms condition (12). Finally, the obvious equations

max c:X [0,1]

p(x)c(x)dx = Z

p(x)dx , and

max c:X [0,1]

p(x)c(x)dx = Z

conﬁrm condition (10) and (11), respectively.

A.3 Proof of Theorem 3

Theorem 3 Let r: X R be the conditional risk (9) of a classiﬁer h: X Y, γ = b + λ the rejection threshold given by the target risk λ and a constant b computed by (14). Then the selection function

1 if r(x) < γ , τ if r(x) = γ , 0 if r(x) > γ , (15)

where τ is the acceptance probability given by

( 1 if ρ(Xr(x)=γ) = 0 ,

ρ(Xr(x)<γ)) ρ(Xr(x)=γ)) if ρ(Xr(x)=γ) > 0 , (16)

satisﬁes the optimality condition of Theorem 2, and hence it is a solution to Problem 3.

Proof The optimality conditions (10) and (12) given in Theorem 2 are equivalent to a probabilistic statement Px p(x)[c (x) = 0 r(x) < b] = 0 and Px p(x)[c (x) = 1 r(x) > b] = 0, respectively. Hence the two conditions are satisﬁed by a selection function which predicts, c (x) = 1, whenever r(x) < b and rejects, c (x) = 0, whenever r(x) > b. Or equivalently, using the identity r(x) = r(x) λ and a threshold γ = b + λ, by c (x) = 1 when r(x) < γ and c (x) = 0 when r(x) > γ. Finally, if we opt for a selection function that

Optimal Strategies for Reject Option Classifiers

is constant c (x) = τ inside the boundary region Xr(x)=b, then the condition (11) im-

plies τ = ρ(Xr(x)<γ)

b ρ0 if b > 0, where ρ0 = R

Xr(x)=b p(x) dx, and τ = 1 if b = 0. Using Xr(x)<b = Xr(x)<γ and b ρ0 = ρ(Xr(x)=b) = ρ(Xr(x)=γ), we derive (16).

A.4 Proof of Theorem 4

Theorem 4 Let (h, c) be an optimal solution to (17). Then, (h B, c), where h B is the optimal Bayes classiﬁer (3), is also optimal to (17).

Proof The theorem follows from the fact that RS(h B, c) RS(h, c) for any (h, c) feasible to (17). This inequality was already derived within the proof of Theorem 1.

A.5 Proof of Theorem 5

Problem 5 (Bounded-abstention model for known h(x)) Given a classiﬁer h(x) and a target coverage 0 < ω 1, the optimal selection function c (x) is a solution to the problem

min c [0,1]X RS(h, c) s.t. φ(c) ω , (18)

where we assume that the minimizer exists.

Theorem 5 A selection function c : X [0, 1] is an optimal solution to Problem 5 if and only if it holds

Xr(x)<β p(x)c (x)dx = Z

Xr(x)<β p(x)dx, (19)

Xr(x)=β p(x)c (x)dx = ω Z

Xr(x)<β p(x)dx, (20)

Xr(x)>β p(x)c (x)dx = 0 , (21)

Xr(x)<a p(x)dx ω

Proof By substituting the deﬁnitions of RS(h, c) and φ(c) to (18), we rewrite the problem into the form

min c [0,1]X

X p(x)c(x)r(x)dx

X p(x)c(x)dx s.t. Z

p(x)c(x)dx ω . (64)

Franc, Prusa and Voracek

X p(x)c(x)r(x)dx R

X p(x)c(x)dx denote the objective function of (64). Whenever c : X

[0, 1] fulﬁls (19), (20) and (21), it is feasible to (64) and

F(c ) = β + 1

p(x)r(x)dx β

p(x)dx , (65)

which is a value independent of c . We will prove the theorem by showing that any c : X [0, 1] feasible to (64) that violates at least one of conditions (19), (20), (21) is not an optimal solution. Three cases will be examined. Case 1 Condition (19) is violated, i.e., Z

p(x)c(x)dx < Z

p(x)dx. (66)

This means that there is a subset X X such that

x X : r(x) β (67)

p(x)c(x)dx = Z

p(x)c(x)dx (66) > 0 . (68)

Deﬁne c : X [0, 1] as follows.

1 if r(x) < β , 0 if x X , c(x) otherwise . (69)

c is feasible to (64) as φ(c ) = φ(c). Derive

φ(c) F(c) F(c ) = Z

p(x)c(x)r(x)dx Z

p(x)r(x)dx + Z

p(x)c(x)r(x)dx

(67),(68) Z

β p(x)(1 c(x))dx Z

p(x)(1 c(x))r(x)dx (71)

p(x)(1 c(x))(β r(x))dx > 0 (72)

where the inequality in (72) is obtained from Lemma 12 applied to f(x) = p(x)(1 c(x)), g(x) = β r(x), and the set Xr(x)<β. This shows that c is not an optimal solution. Case 2 Condition (19) is satisﬁed, condition (20) is violated and Z

Xr(x)=β p(x)c(x)dx < ω Z

Xr(x)<β p(x)dx > 0 . (73)

Optimal Strategies for Reject Option Classifiers

In this case, there is a c : X [0, 1] such that

c (x) = c(x) if r(x) < β , (74)

c (x) = 0 if r(x) > β , (75)

Xr(x)=β p(x)c (x)dx = Z

Xr(x)=β p(x)c(x)dx + Z

Xr(x)>β p(x)c(x)dx . (76)

If Lemma 13 is applied to f(x) = p(x)c(x), g(x) = r(x), and the set Xr(x)>β, we get

Xr(x)>β p(x)c(x)r(x)dx > β Z

Xr(x)>β p(x)c(x)dx . (77)

It also holds that φ(c ) = φ(c). Now, deriving

φ(c) F(c) F(c ) (76) = Z

p(x)c(x)r(x)dx β Z

p(x)c(x)dx (78)

p(x)c(x) β Z

p(x)x(x)dx = 0 (79)

shows that c is not an optimal solution. Case 3 φ(c) > ω, which occurs if Z

Xr(x)=β p(x)c(x)dx > ω Z

Xr(x)<β p(x)dx > 0 (80)

(implying that condition (20) is violated), or if condition (21) is violated. Observe that F(c) = F(α c) for any a R+. Let c = ω φ(c) c. Since φ(c ) = ω, the selection function c is feasible to (64). Because Z

Xr(x)<β p(x)c (x)dx = ω φ(c)

Xr(x)<β p(x)c(x)dx < Z

Xr(x)<β p(x)dx , (81)

c violates condition (19) and is therefore not an optimal solution (see Case 1). This implies that c is not an optimal solution too.

A.6 Proof of Theorem 6

Theorem 6 Let r: X R be the conditional risk (9) of a classiﬁer h: X Y, 1 ω > 0 be a target coverage and β be the constant computed by (22). Then the selection function

1 if r(x) < β , κ if r(x) = β , 0 if r(x) > β , (23)

Franc, Prusa and Voracek

where κ is the acceptance probability given by

Xr(x)=β p(x)dx = 0 ,

Xr(x)<β p(x)dx R

Xr(x)=β p(x)dx otherwise , (24)

satisﬁes the optimality condition of Theorem 6, and hence it is a solution of Problem 5.

Proof It is easy to see that c satisﬁes conditions (19) and (21). The validity of condition (20) is proved as follows. If R

Xr(x)=β p(x)dx = 0, then R

Xr(x)=β p(x)dx = ω, and

condition (20) is met. If R

Xr(x)=β p(x)dx > 0, we derive

Xr(x)=β p(x)c (x)dx = ω R

Xr(x)<β p(x)dx R

Xr(x)=β p(x)dx

Xr(x)=β p(x)dx = ω Z

Xr(x)<β p(x)dx . (82)

Appendix B. Proofs of theorems from Section 3

B.1 Proof of Theorem 7

The expectation of the squared loss deviation reads

Ereg(s) = Z

y Y p(x, y) ℓ(y, h(x)) s(x) 2 dx . (83)

Theorem 7 The conditional risk r(x) deﬁned by (9) is an optimal solution to min s:X R Ereg(s).

Proof We can rewrite Ereg(s) as

Ereg(s) = Z

y Y p(y | x) ℓ(y, h(x))2 2 ℓ(y, h(x)) s(x)+s(x)2 dx = Z

X p(x)f(s(x))dx.

Due to additivity, we can solve mins: X R Ereg(s) for each x X separately by setting derivative of f(s) to zero and solving for s which yields

f (s) = 2 X

y Y p(y | x)ℓ(y, h(x)) + 2 X

y Y p(y | x)s(x) = 0 s (x) = X

y Y p(y | x)ℓ(y, h(x)) .

Optimal Strategies for Reject Option Classifiers

B.2 Proof of Theorem 8

Lemma 15 For an integer n 1, let {ai}n i=1 and {bi}n i=1 be sequences of positive real numbers such that {ai

bi }n i=1 is a non-increasing sequence. For each non-decreasing sequence {ℓi}n i=1 of non-negative real numbers with a positive sum it holds that

Pn i=1 aiℓi Pn i=1 biℓi Pn i=1 ai Pn i=1 bi .

Proof We will show that the lemma statement is a direct consequence of the four-letter identity

i=1 Ci Di = X

j<k (Aj Dk Ak Dj) (Cj Bk Ck Bj)

(Steele, 2004). Setting Ai = ai, Bi = 1, Ci = ℓi, and Di = bi for all i = 1, . . . , n yields

i=1 biℓi = X

j<k (ajbk akbj) (ℓj ℓk) . (84)

The lemma assumptions ensure that for any j < k, ajbk akbj 0 and ℓj ℓk 0. Thus each term in the right-hand side of equality (84) is non-positive, which implies the required result.

For i = 1, . . . , n and a permutation π on {1, . . . , n}, let aπi = Pn j=i n

j and bπi = n i+1. We can write

sele(s, Tn) = 1

Au RC(s, Tn) = 1

i=1 aπiℓπi ,

where ℓπi = ℓ(yπi, h(xπi)). Without loss of generality, assume that ℓ1 ℓ2 . . . ℓn and πi = i for all i = 1, . . . , n. Let Hk = Pk i=1 1

i denote the k-th harmonic number. As shown in (Havil, 2003), it fulﬁls

ln(k) + γ + 1 2(k + 1) < Hk < ln(k) + γ + 1

where γ 0.5772156649 is the Euler Mascheroni constant. Moreover, let us deﬁne H0 = 0.

Lemma 16 Let n 2 be an integer. For each i = 1, . . . , n, let ai = Pn j=i n

j and bi = n i + 1. Then, ai

bi+1 holds for all i = 1, . . . , n 1.

Franc, Prusa and Voracek

Proof Since

bi+1 = n (Hn Hi 1)

n i + 1 n (Hn Hi)

it suﬃces to show that

(n i)(Hn Hi 1) (n i + 1)(Hn Hi)

and this is equivalent to showing that

(n i + 1)Hi (n i)Hi 1 Hn .

If we substitute Hi 1 = Hi 1

i , the inequality further reduces to

Now we derive

Hn Hi (85) < ln n

2n 1 2(i + 1) ln n

where the last inequality follows from the fact that ln(x) < x 1 for all x > 1. This conﬁrms inequality (86).

Lemma 17 Pn i=1(Hn Hi 1) = n.

Proof By induction on n. The lemma trivially holds for n = 1. Let n > 1. Then, using the induction hypothesis, we derive

i=1 (Hn Hi 1) = Hn Hn 1 +

n + Hn 1 Hi 1

i=1 (Hn 1 Hi 1) = n .

Theorem 8 The inequality Au RC(s, Tn) < 2 sele(s, Tn) holds true for any s: X R and Tn = {(xi, yi) X Y | i = 1, . . . , n}. Moreover, for any ε > 0, there are n, s and Tn such that Au RC(s, Tn) (2 ε) sele(s, Tn).

Optimal Strategies for Reject Option Classifiers

Proof The ﬁrst statement trivially holds if Pn i=1 ℓi = 0. If Pn i=1 ℓi > 0, we apply Lemmas 16, 15 and 17 to derive

Au RC(s, Tn)

sele(s, Tn) Pn i=1 ai Pn i=1 bi = Pn i=1 n (Hn Hi 1) Pn i=1(n i + 1) = n2

n 2 (n + 1) = 2n n + 1 < 2 .

Taking ℓi = 1 for all i = 1, . . . , n yields

Au RC(s, Tn)

sele(s, Tn) = Pn i=1 ai Pn i=1 bi = 2n n + 1 .

Since limn 2n n+1 = 2, we see that also the second statement is valid.

B.3 Proof of Theorem 9

Remark 18 For the sake of simplicity, for predicates ϕ1(x, z), . . . , ϕk(x, z) and a function f : X X R, we write

ϕ1(x,z) ... ϕk(x,z)

f(x, z)dz dx

to represent

f(x, z)[[ϕ1(x, z) . . . ϕk(x, z)]]dz dx .

Theorem 9 A function s : X R is an optimal solution to mins:X R Esele(s) iﬀ

z =x s (z)=s (x) max{r(x), r(z)}p(x)p(z)dz dx = 0 , and (32)

r(z)<r(x) s (z)>s (x) (r(x) r(z)) p(x)p(z)dz dx = 0 . (33)

Franc, Prusa and Voracek

Proof We ﬁrst present four equalities to be used later. We assume that s : X R is any measurable function. The validity of the equalities can be easily veriﬁed.

r(z)>r(x) s(z)<s(x)

p(z)dz dx = Z

r(x)<r(z) s(x)>s(z)

r(x)p(x)dx dz = Z

r(z)<r(x) s(z)>s(x)

r(z)p(z)dz dx ,

r(z)<r(x) s(z)=s(x)

r(x)p(x)p(z)dz dx = 1

z =x s(z)=s(x)

max{r(x), r(z)}p(x)p(z)dz dx

z =x r(z)=r(x) s(z)=s(x)

max{r(x), r(z)}p(x)p(z)dz dx , (88)

r(z)=r(x) s(z)<s(x)

r(x)p(x)p(z)dz dx = 1

r(x)p(x)p(z)dz dx 1

r(z)=r(x) s(z)=s(x)

r(x)p(x)p(z)dz dx ,

r(z)=r(x) s(z)=s(x)

r(x)p(x)p(z)dz dx = 2 Z

z>x r(z)=r(x) s(z)=s(x)

r(x)p(x)p(z)dz dx + Z

z=x r(x)p(x)p(z)dz dx .

Since argmins:X R E(s) = argmins:X R (E(s) E(r)), it suﬃces to analyze minimizers of E(s) E(r) instead of E(s). Derive

E(s) E(r) = Z

r(x)p(x)p(z)dz dx Z

r(x)p(x)p(z)dz dx

r(z)<r(x) s(z) s(x)

r(x)p(x)p(z)dz dx Z

r(z) r(x) s(z)<s(x)

r(x)p(x)p(z)dz dx

r(z)<r(x) s(z)>s(x)

r(x)p(x)p(z)dz dx Z

r(z)>r(x) s(z)<s(x)

r(x)p(x)p(z)dz dx

r(z)<r(x) s(z)=s(x)

r(x)p(x)p(z)dz dx Z

r(z)=r(x) s(z)<s(x)

r(x)p(x)p(z)dz dx

= F1(s) + F2(s)

Optimal Strategies for Reject Option Classifiers

r(z)<r(x) s(z)>s(x)

r(x)p(x)p(z)dz dx Z

r(z)>r(x) s(z)<s(x)

r(x)p(x)p(z)dz dx

r(z)<r(x) s(z)>s(x)

r(x)p(x)p(z)dz dx Z

r(z)<r(x) s(z)>s(x)

r(z)p(x)p(z)dz dx

r(z)<r(x) s(z)>s(x)

(r(x) r(z)) p(x)p(z)dz dx

r(z)<r(x) s(z)=s(x)

r(x)p(x)p(z)dz dx Z

r(z)=r(x) s(z)<s(x)

r(x)p(x)p(z)dz dx

(88,89,90) = 1 2

z =x s(z)=s(x)

max{r(x), r(z)}p(x)p(z)dz dx + 1

z=x r(x)p(x)p(z)dz dx

r(x)p(x)p(z)dz dx .

Observe that min s:X R F1(s) = 0,

min s:X R F2(s) = 1

z=x r(x)p(x)p(z)dz dx 1

r(x)p(x)p(z)dz dx ,

and both minima are attained by a scoring function s : X R if and only if conditions (32) and (33) hold for s . Moreover, note that the conditions can be fulﬁlled, e.g., by any s such that

( x, z X) (x = z s (x) = s (z) r(x) < r(z) s (x) < s (z)) .

B.4 Proof of Theorem 11

The expectation of ψsele reads

Eproxy(s)= n2 n

X p(x)r(x) Z

X p(z) log 1+exp(s(z) s(x)) dz dx+log(2)

X p(x)r(x)dx.

Franc, Prusa and Voracek

Theorem 11 Let s : X R be an optimal solution to mins: X R Eproxy(s). Then, the condition (x, x ) X X : r(x) < r(x ) s (x) < s (x )

is satisﬁed almost surely.

Proof For every a X, Eproxy(s) can be seen as a function of one variable s(a), where the others s(b), b X \ {a} are ﬁxed. Hence if s is a minimizer of Eproxy(s), then for every s(a), a X, the partial derivative w.r.t. s(a) must be zero, i.e.,

X p(x) r(x) Z

X p(z) log (1 + exp(s(z) s(x))) dz dx

= p(a)r(a) Z

X p(z) exp(s(z) s(a))

1 + exp(s(z) s(a))dz + p(a) Z

X p(x)r(x) exp(s(a) s(x)) 1 + exp(s(a) s(x))dx

r(a) p(x) + r(x) p(x) 1 + exp(s(a) s(x))dx p(a) Z

X p(x) r(x) dx

= f(r(a), s(a)) C .

It shows that f(r(a), s(a)) = C for any a X in order to guarantee that s is a minimizer of Eproxy(s). We prove by contradiction that the condition r(a) < r(b) s(a) < s(b) is satisﬁed up to a set {(a, b)|(a, b) X 2} of zero measure. Assume s is optimal and the condition is violated, i.e. r(a) < r(b) s(a) s(b) holds for a pair (a, b) X 2. Since s is optimal then f(r(a), s(a)) = C. Since r(b) > r(a) and s(b) s(a) then f(r(b), s(b)) > f(r(a), s(b)) because the function f(u, v) is strictly increasing in u and strictly decreasing in v. Combined, it implies that f(r(b), s(b)) > C which leads to a contradiction because an optimal s requires f(r(b), s(b)) = C.

P. L. Bartlett and M. H. Wegkamp. Classiﬁcation with a reject option using a hinge loss. Journal of Machine Learning Research, 9(59):1823 1840, 2008.

C. C. Chang and C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. URL http: //www.csie.ntu.edu.tw/~cjlin/libsvm.

C. Chow. On optimum recognition error and reject tradeoﬀ. IEEE Transactions on Information Theory, 16(1):41 46, 1970.

W. Chu and S. S. Keerthi. New approaches to support vector ordinal regression. In Proceedings of the International Conference on Machine Learning, pages 145 152, 2005.

C. Corbiere, N. Thome, A. Bar-Hen, M. Cord, and P. Perez. Addressing failure prediction by learning model conﬁdence. In Advances in Neural Information Processing Systems, volume 32, pages 2902 2913, 2019.

Optimal Strategies for Reject Option Classifiers

C. Cortes, G. De Salvo, and M. Mohri. Boosting with abstention. In Advances in Neural Information Processing Systems, volume 29, pages 1660 1668, 2016.

N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proceedings of Conference on Computer Vision and Patter Recognition, volume 1, pages 886 893, 2005.

J. Demˇsar. Statistical comparisons of classiﬁers over multiple data sets. Journal of Machine Learning Research, 7(1):1 30, 2006.

D. Dua and E. Karra Taniskidou. UCI machine learning repository, 2017. URL http: //archive.ics.uci.edu/ml.

R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classiﬁcation. Journal of Machine Learning Research, 11(53):1605 1641, 2010.

L. Fischer, B. Hammer, and H. Wersing. Optimal local rejection for classiﬁers. Neurocomputing, 214:445 457, 2016.

L. Fisher, B. Hammer, and H. Wersing. Eﬃcient rejection strategies for prototype-based classiﬁcation. Neurocomputing, 169:334 342, 2015.

V. Franc and D. Prusa. On discriminative learning of prediction uncertainty. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 1963 1971, 2019.

G. Fumera and F. Roli. Support vector machines with embedded reject option. In Pattern Recognition with Support Vector Machines, Lecture Notes in Computer Science, volume 2388. Springer, 2002.

G. Fumera, F. Roli, and G. Giacinto. Multiple reject thresholds for improving classiﬁcation reliability. In Advances in Pattern Recognition, pages 863 871, 2000.

Y. Geifman and R. El-Yaniv. Selective classiﬁcation for deep neural networks. In Advances in Neural Information Processing Systems 30, pages 4878 4887, 2017.

Y. Grandvalet, A. Rakotomamonjy, J. Keshet, and S. Canu. Support vector machines with a reject option. In Advances in Neural Information Processing Systems, volume 21, pages 537 544, 2008.

B. Hanczar and E. R. Dougherty. Classiﬁcation with reject option in gene expression data. Bioinformatics, 24:1889 1895, 2008.

T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2009.

J. Havil. Gamma: Exploring Euler s Constant. Princeton University Press, 2003.

R. Herbei and M. H. Wegkamp. Classiﬁcation with reject option. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 34(4):709 721, 2006.

Franc, Prusa and Voracek

H. Jiang, B. Kim, M. Y. Guan, and M. Gupta. To trust or not to trust a classiﬁer. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, page 5546 5557, 2018.

V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1867 1874, 2014.

D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755 1758, 2009.

J. Kummert, B. Paassen, J. Jensen, C. G opfert, and B. Hammer. Local reject option for deterministic multi-class SVM. In Artiﬁcial Neural Networks and Machine Learning ICANN, Lecture Notes in Computer Science, volume 9887. Springer, 2016.

B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30, pages 6402 6413, 2017.

Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jakel. Handwritten digit recognition with a back-propagation networks. In Advances in Neural Information Processing Systems, volume 2, pages 396 404, 1990.

J. Lei. Classiﬁcation with conﬁdence. Biometrika, 101:755 769, 2014.

T. Pietraszek. Optimizing abstaining classiﬁers using ROC analysis. In Proceedings of the 22nd International Conference on Machine Learning, page 665 672, 2005.

C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-thewild challenge: database and results. Image and Vision Computing, 47:3 18, 2016.

C. M. Santos-Pereira and A. M. Pires. On optimal reject rules and roc curves. Pattern Recognition Letters, 26(7):943 952, 2005.

M. I. Schlesinger and V. Hlav aˇc. Ten lectures on statistical and structural pattern recognition. Kluwer Academic Publishers, 2002.

J. M. Steele. The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Cambridge University Press, 2004.

E. M. Stein and R. Shakarchi. Real Analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton University Press, 2009.

C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11(10):311 365, 2010.

F. Tortorella. An optimal reject rule for binary classiﬁers. In Advances in Pattern Recognition, Lecture Notes in Computer Science, volume 1876. Springer, 2000.

V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998.

Optimal Strategies for Reject Option Classifiers

T. Villman, M. Kaden, A. Bohnsack, J. M. Villman, T. Drogies, S. Saralajew, and B. Hammer. Self-adjusting reject options in prototype based classiﬁcation. In Advances in Intelligent Systems and Computing, volume 428. Springer, 2016.

M. Yuan and M. Wegkamp. Classiﬁcation methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11(5):111 130, 2010.

H. Zaragoza and F. d Alche Buc. Conﬁdence measures for neural network classiﬁers. In 7th Conference on Information Processing and Management of Uncertainty in Knowledge Based Systems, 1998.