# optimal_strategies_for_reject_option_classifiers__6dc8794e.pdf Journal of Machine Learning Research 24 (2023) 1-49 Submitted 1/21; Revised 12/21; Published 1/23 Optimal Strategies for Reject Option Classifiers Vojtech Franc xfrancv@fel.cvut.cz Daniel Prusa prusa@fel.cvut.cz Vaclav Voracek vaclav.voracek@uni-tuebingen.de Department of Cybernetics, Faculty of Electrical Engineering Czech Technical University in Prague, Czech Republic Editor: Francis Bach In classification with a reject option, the classifier is allowed in uncertain cases to abstain from prediction. The classical cost-based model of a reject option classifier requires the rejection cost to be defined explicitly. The alternative bounded-improvement model and the bounded-abstention model avoid the notion of the reject cost. The bounded-improvement model seeks a classifier with a guaranteed selective risk and maximal cover. The boundedabstention model seeks a classifier with guaranteed cover and minimal selective risk. We prove that despite their different formulations the three rejection models lead to the same prediction strategy: the Bayes classifier endowed with a randomized Bayes selection function. We define the notion of a proper uncertainty score as a scalar summary of the prediction uncertainty sufficient to construct the randomized Bayes selection function. We propose two algorithms to learn the proper uncertainty score from examples for an arbitrary black-box classifier. We prove that both algorithms provide Fisher consistent estimates of the proper uncertainty score and demonstrate their efficiency in different prediction problems, including classification, ordinal regression, and structured output classification. Keywords: Reject option classification, prediction uncertainty, selective classifiers 1. Introduction In safety critical applications of classification models, prediction errors may lead to serious losses. In such cases, estimating when the model makes an error can be as important as its average performance. These two objectives are taken into account in classification with a reject option when the classifier is allowed in uncertain cases to abstain from prediction. The cost-based model of a classification strategy with the reject option was proposed by Chow in his pioneering work Chow (1970). The goal is to minimize the expected loss equal to the cost of misclassification, when the classifier predicts, and to the reject cost, when the classifier abstains from prediction. An optimal strategy leads to the Bayes classifier abstaining from prediction when the conditional expected risk exceeds the reject cost. The known form of the optimal strategy allows to construct the classifier by plugging in an estimate of the class posterior probabilities to the formula for the conditional risk. Besides the plug-in rule, the reject option classifiers can be learned by empirical risk minimization based approaches like e.g. modifications of Support Vector Machines (Grandvalet et al., 2008), Boosting (Cortes et al., 2016), or Prototype-based classifiers (Villman et al., 2016) to name a few. 2023 Vojtech Franc and Daniel Prusa and Vaclav Voracek. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0048.html. Franc, Prusa and Voracek The cost-based model requires the reject cost to be defined explicitly, which is difficult in some applications, e.g., when the misclassification costs have different physical units than the reject cost. Alternative bounded-improvement and bounded-abstention models coined in Pietraszek (2005) avoid explicit definition of the reject cost. The rejection strategy is evaluated by two antagonistic quantities: i) a selective risk defined as the expected misclassification cost of accepted predictions and ii) a coverage which corresponds to the probability that the prediction is accepted. The optimal strategy for the bounded-improvement model maximizes the coverage under the condition that the selective risk does not exceed a target value. The optimal strategy of the bounded-abstention model is defined symmetrically as the one which minimizes the selective risk under the condition that the coverage does not drop below a target value. In contrast to the cost-based model, it is unknown what are the optimal prediction strategies of the two models when the underlying statistical model is given. A solution has been proposed only for special instances of the tasks. Pietraszek (2005) proposed a method for constructing a specific type of reject option strategy for a pretrained two-class classifier using the ROC analysis. El-Yaniv and Wiener (2010) proposed an algorithm learning the optimal strategy in the noise-free setting, i.e., when a perfect strategy with zero selective risk exists. Geifman and El-Yaniv (2017) shows how to equip a trained classifier with a reject option provided an uncertainty measure is known and the task is to find only a rejection threshold optimal under the bounded-improvement model. A large number of other works address the problem of uncertainty prediction, including recent papers related to deep learning like e.g. Lakshminarayanan et al. (2017); Jiang et al. (2018); Corbiere et al. (2019). They seek for an uncertainty score 1 defined informally as a real-valued summary of an input observation that is predictive of the classification error. These works do not formulate the problem to be solved explicitly as a rejection model. However, most of these works assess the performance of their uncertainty scores using evaluation metrics for the rejection models, namely, using the Risk-Coverage (RC) curve and the Area under the RC curve (Au RC). This article unifies and extends existing formulations of an optimal reject option classifier and proposes theoretically grounded algorithms to learn the rejection strategy for an arbitrary black-box classifier from examples. The main contributions are as follows: 1. We derive necessary and sufficient conditions for an optimal prediction strategy of the bounded-improvement model when the underlying distribution is known. We show that an optimal solution is the Bayes classifier endowed with a rejection strategy, which we call randomized Bayes selection function. The randomized Bayes selection function is constructed from the conditional expected risk and two parameters: a decision threshold and an acceptance probability. The strategy rejects prediction when the conditional risk is above the threshold, accepts prediction when it is below the threshold and randomizes with the acceptance probability otherwise. We provide an explicit relation between the decision threshold, the acceptance probability and the target risk. 2. We derive necessary and sufficient conditions for an optimal prediction strategy of the bounded-abstention model when the underlying distribution is known. We show that 1. Some works use term confidence score which is inverse to the uncertainty score utilized in this paper. Optimal Strategies for Reject Option Classifiers the conditions are satisfied by the Bayes classifier endowed with the randomized Bayes selection function. We provide an explicit relation between the decision threshold, the acceptance probability and the target coverage. 3. We define a notion of a proper uncertainty score as a function which preserves ordering of the inputs induced by the conditional expected risk. A proper uncertainty score is sufficient for construction of the randomized Bayes selection function. We propose two generic algorithms to learn the proper uncertainty score from examples for an arbitrary black-box classifier. The first is based on regression of the classifier loss. The second is based on minimization of a newly proposed loss function which we call SElective classifier LEarning (SELE) loss. We show that SELE loss is a close approximation of the Au RC and, at the same time, amenable to optimization. We prove that both proposed algorithms provide Fisher consistent estimate of the proper uncertainty score. As a proof of concept, we apply the proposed algorithms to learn proper uncertainty scores for different prediction problems including classification, ordinal regression and structured output classification. We demonstrate that the algorithm based on the SELE loss minimization learns uncertainty scores which consistently outperform common baselines and work on par with the state-of-the-art methods that are, unlike our algorithm, applicable only to particular prediction models. Besides the algorithms proposed that are applicable to learning uncertainty score for an arbitrary classification model, our contributions may have the following uses. Firstly, our analysis shows that despite their different objectives, the cost-based, the boundedimprovement and the bounded-abstention rejection models are equivalent in the sense that they lead to the same prediction strategy. Secondly, the explicit characterization of optimal strategies provides a recipe to construct plug-in rules, which has been so far possible only for the cost-based model. That is, any method estimating the class posterior distribution can be turned into an algorithm learning the reject option classifier that solves the boundedimprovement and the bounded-abstention model, respectively. Thirdly, there is a tight connection between the proposed bounded-abstention model and the RC curve. The RC curve represents the quality of all solutions of the bounded-abstention model that can be constructed from a pair of classifier and uncertainty score. The Au RC is then an expected quality of the reject option classifier constructed from the pair when the target coverage is selected uniformly at random. This connection sheds light on many published methods which do not explicitly define the target objective but use the RC curve and the Au RC as evaluation metrics. This article is an extension of our previous work published in Franc and Prusa (2019). The major extensions involve deriving optimality conditions for the bounded-abstention model, regression-based algorithm for learning the proper uncertainty score, analysis of the learning algorithms including the proof of Fisher consistency, describing the connection between the Au RC and the bounded-abstention model, and most of the experiments. The paper is organized as follows. Section 2 introduces the three rejection models and provides a characterization of their optimal solutions. Algorithms to learn a proper uncertainty score from examples are discussed in Section 3. Survey of related literature is given in Section 4. Experimental evaluation of the proposed learning algorithms is provided in Section 5. Section 6 concludes the paper. Proofs of all theorems are deferred to Appendix. Franc, Prusa and Voracek 2. Reject Option Models and Their Optimal Strategies Let X be a set of input observations and Y a finite set of labels. Let us assume that the inputs and labels are generated by a random process with p.d.f. p(x, y) defined over X Y. The goal in the non-reject setting is to find a classifier h: X Y with a small expected risk y Y p(x, y)ℓ(y, h(x)) dx , where ℓ: Y Y R+ is a loss penalizing the predictions. The expected risk can be reduced by abstaining from prediction in uncertain cases. To this end, we use a selective classifier 2 (h, c) composed of a classifier h: X Y and a selection function c: X [0, 1]. When applying the selective classifier to input x X it outputs (h, c)(x) = h(x) with probability c(x) , reject with probability 1 c(x) . In the sequel, we introduce three models of an optimal selective classifier: the cost-based, the bounded-improvement and the bounded-abstention model. We characterize the optimal strategies of the three models provided the underlying distribution p(x, y) is known. 2.1 Cost-based model Besides the label loss ℓ: Y Y R+, let us define a reject loss ε R+ incurred when a classifier rejects to predict. The selective classifier (h, c) is then evaluated in terms of the expected risk RB(h, c)= Z y Y p(x, y) ℓ(y, h(x))c(x) + (1 c(x))ε dx . (1) Problem 1 (Cost-based model) The optimal selective classifier (h B, c B) is a solution to the minimization problem min h,c RB(h, c) , (2) where we assume that both minimizers exist. The well-known optimal strategy (h B, c B) solving Problem 1 reads h B(x) argmin ˆy Y y Y p(y | x) ℓ(y, ˆy) , (3) 1 if r (x) < ε , τ if r (x) = ε , 0 if r (x) > ε , 2. The classifier with a reject option is usually represented by a single function h : X Y {reject}. We use the decomposition h (x) = (h, c)(x), and the terminology selective classifier from El-Yaniv and Wiener (2010) because we analyze h and c separately. Optimal Strategies for Reject Option Classifiers where r (x) = min ˆy Y y Y p(y | x) ℓ(y, ˆy) is the minimal conditional expected risk associated with input x, and τ is any real number from the interval [0, 1]. In the boundary case when r (x) = ε one can arbitrarily reject or return the best label h B(x) without affecting the value of the risk RB(h, c). In turn, there always exists a deterministic optimal strategy solving the cost-based model. In the sequel, we denote the classifier h B alone as Bayes classifier. The cost-based model was coined in Chow (1970) who also provides an optimal strategy in the case p(x, y) is known. Chow (1970) considers only the 0/1-loss ℓ(y, y ) = [[y = y ]] 3. A generalization for an arbitrary loss and the optimal strategy of the generalized formulation can be found e.g. in Schlesinger and Hlav aˇc (2002); Santos-Pereira and Pires (2005). 2.2 Bounded-improvement model One can characterize the selective classifier by two antagonistic quantities: i) the coverage X p(x) c(x) dx (5) corresponding to the probability that the prediction is accepted and ii) the selective risk y Y p(x, y) ℓ(y, h(x)) c(x) dx defined for non-zero φ(c) as the expected classification loss on the accepted predictions. Problem 2 (Bounded-improvement model) Given a target risk λ > 0, the optimal selective classifier (h I, c I) is a solution to the problem max h,c φ(c) s.t. RS(h, c) λ , (7) where we assume that both maximizers exist. Theorem 1 Let (h, c) be an optimal solution to (7). Then, (h B, c), where h B is the Bayes classifier (3), is also optimal to (7). According to Theorem 1 the Bayes classifier h B is also optimal for the task (7) defining the bounded-improvement model which is not surprising. Note however that the Bayes classifier is not a unique solution to (7) because the predictions on the reject region Xc(x)=0 4 do not count to the selective risk and hence they can be arbitrary. Theorem 1 allows to solve the bounded-improvement task (7) in two consecutive steps: First, set h I to be the Bayes classifier h B. Second, when h I is fixed, the optimal selection function c I is obtained by solving the task (7) only with respect to c which boils down to 3. [[A]] is the Iverson bracket which evaluates to 1 if A is true and it is 0 otherwise. 4. For a function f : X R and a R { }, we define Xf(x) a ={x X |f(x) a}, Xf(x)a = {x X |f(x) > a}, Xf(x) a= {x X |f(x) a}. Franc, Prusa and Voracek Problem 3 (Bounded-improvement model for known h(x)) Given a classifier h(x), the optimal selection function c (x) is a solution to max c [0,1]X φ(c) s.t. RS(h, c) λ . (8) Note that Problem 3 is meaningful even if h is not the Bayes classifier h B. In practice, we can seek for an optimal selection function c for any fixed h which is usually our best approximation of h B learned from data. We will show that the key concept to characterize the optimal selection function of Problem 3 is the conditional expected risk of h 5 defined as r(x) = X y Y p(y | x) ℓ(y, h(x)) . (9) Theorem 2 A selection function c : X [0, 1] is an optimal solution to Problem 3 if and only if it holds Z Xr(x) 0 , R Xr(x)=0 p(x)dx if b = 0 , (11) Xr(x)>b p(x)c (x)dx = 0 , (12) where r(x) = r(x) λ measures how much the conditional risk r(x) of the classifier h(x) exceeds the target λ, X p(x)r(x) dx (13) is the expectation of r(x) restricted to inputs in X , and b = sup {a | ρ(Xr(x) a) 0} 0 . (14) Theorem 2 defines behaviour of an optimal selection function c (x) on a partition of the input space X into three regions Xr(x)b. Informally, b can be interpreted as a threshold characterizing the largest prefix X X of inputs in X sorted by the conditional risk r(x) for which the expectation of r(x) does not exceed the target risk λ. In each region, the expected value of c (x) is constrained to a particular constant, the value of which depends on parameters of the problem. A particular selection function satisfying the optimality condition is given by the following theorem. Theorem 3 Let r: X R be the conditional risk (9) of a classifier h: X Y, γ = b + λ the rejection threshold given by the target risk λ and a constant b computed by (14). Then the selection function 1 if r(x) < γ , τ if r(x) = γ , 0 if r(x) > γ , (15) 5. Note that the conditional risk r is a function of the classifier h. Since the classifier h is effectively fixed throughout the paper, the dependency is not expressed explicitly to keep the notation simple. Optimal Strategies for Reject Option Classifiers where τ is the acceptance probability given by ( 1 if ρ(Xr(x)=γ) = 0 , ρ(Xr(x)<γ)) ρ(Xr(x)=γ)) if ρ(Xr(x)=γ) > 0 , (16) satisfies the optimality condition of Theorem 2, and hence it is a solution to Problem 3. The selection function (15) is defined by the conditional risk r(x), the decision threshold γ and the acceptance probability τ. The prediction is always accepted when r(x) < γ and always rejected when r(x) > γ. In the boundary case, when r(x) = γ, the strategy randomizes and the prediction is accepted with probability τ. The decision threshold is given by γ = b + λ where λ is the target risk in the definition of Problem 3 and b is given by (14). Solving (14) is hard and it requires knowledge of p(x, y). When the probability mass of the set of boundary cases Xr(x)=γ is zero, which usually happens in case of continuous p(x), the acceptance probability is τ = 1 and the boundary cases are always accepted, i.e. no randomization is needed. The bounded-improvement model was coined in Pietraszek (2005) for two-classes and a specific class of rejection strategies. A generalized formulation which imposes no restriction on the number of classes and the class of decision strategies appeared e.g. in Geifman and El-Yaniv (2017). The necessary and sufficient conditions for an optimal selective classifier were provided in Franc and Prusa (2019). 2.3 Bounded-abstention model In this section, we introduce bounded-abstention model, the definition of which is symmetric to the definition of the bounded-improvement model. Problem 4 (Bounded-abstention model) Given a target coverage ω > 0, the optimal selective classifier (h A, c A) is a solution to the problem min h,c RS(h, c) s.t. φ(c) ω , (17) where we assume that both minimizers exist. Theorem 4 Let (h, c) be an optimal solution to (17). Then, (h B, c), where h B is the optimal Bayes classifier (3), is also optimal to (17). Theorem 4 ensures that the Bayes classifier h B is an optimal solution to (17) defining the bounded-abstention model. Note that the solution is not unique as the predictions on Xc(x)=0 do not count to the selective risk, hence they can be arbitrary. After fixing the classifier h = h B the search for an optimal selection function leads to: Problem 5 (Bounded-abstention model for known h(x)) Given a classifier h(x) and a target coverage 0 < ω 1, the optimal selection function c (x) is a solution to the problem min c [0,1]X RS(h, c) s.t. φ(c) ω , (18) where we assume that the minimizer exists. Franc, Prusa and Voracek Theorem 5 A selection function c : X [0, 1] is an optimal solution to Problem 5 if and only if it holds Xr(x)<β p(x)c (x)dx = Z Xr(x)<β p(x)dx, (19) Xr(x)=β p(x)c (x)dx = ω Z Xr(x)<β p(x)dx, (20) Xr(x)>β p(x)c (x)dx = 0 , (21) Xr(x) 0 be a target coverage and β be the constant computed by (22). Then the selection function 1 if r(x) < β , κ if r(x) = β , 0 if r(x) > β , (23) where κ is the acceptance probability given by Xr(x)=β p(x)dx = 0 , Xr(x)<β p(x)dx R Xr(x)=β p(x)dx otherwise , (24) satisfies the optimality condition of Theorem 6, and hence it is a solution of Problem 5. The selection function (23) is determined by the conditional risk r(x), the decision threshold β and the acceptance probability κ. Both computations of the decision threshold β, defined by (22), and the acceptance probability κ, defined by (24), involve integration of p(x). When the probability mass of the set of boundary cases Xr(x)=β is zero, the acceptance probability is κ = 0 and the boundary cases are always rejected without any randomization. The bounded-abstention model was coined in Pietraszek (2005) for two-classes and a specific class of rejection strategies. In this article, we provide a generalized formulation with no restriction on the number of classes and the class of decision strategies, and we characterize the necessary and sufficient conditions for an optimal selective classifier. Optimal Strategies for Reject Option Classifiers Model Parameter Definition Cost-based reject cost ε min h,c RB(h, c) Bounded-improvement target risk λ max h,c φ(c) s.t. RS(h, c) λ Bounded-abstention target coverage ω min h,c RS(h, c) s.t. φ(c) ω Table 1: Summary of three rejection models analyzed in the article. Each model has a single parameter shown in the middle column. The definition of the optimal selective classifier is in the right column. The cost-based model defines the optimal selective classifier using the expected risk RB(h, c) given by (1). The boundedimprovement and the bounded-abstention model are defined in terms of the selective risk RS(h, c), equation (6), and the coverage φ(c), equation (5). Regardless the definition, the optimal solution is composed of the Bayes classifier h B given by (3), and an instance of the randomized Bayes selection function c R given by (25). 2.4 Summary We have shown that the three rejection models summarized in Table 1, namely, the costbased model (Problem 1), the bounded-improvement model (Problem 2) and the boundedabstention model (Problem 4), share the same class of optimal prediction strategies. An optimal selective classifier (h, c) can be always constructed from the Bayes classifier h = h B given by (3) and an optimal selection function 1 if r(x) < α , ν if r(x) = α , 0 if r(x) > α , (25) where r(x) is the conditional expected risk of h given by (9), α R is a decision threshold and ν [0, 1] is an acceptance probability. We denote c R defined by (25) as the randomized Bayes selection function. The randomized Bayes selection function c R is also an optimal solution of the rejection models defined for an arbitrary (i.e. not necessarily the Bayes) classifier h, that is, an optimal solution of Problem 3 and Problem 5. The constants (ν, α) are defined for each rejection model differently and their value depends on parameters of the model (i.e. reject cost ε, target risk λ or target coverage ω), and the underlying distribution p(x, y). For example, in the case of the cost-based model, the acceptance threshold α equals to the reject cost ε and the acceptance probability ν can be arbitrary. In the case of the bounded-improvement and the bounded-abstention model, the constants (ν, α) are defined implicitly via optimization problems and integral equations. In practice, (ν, α) can be tuned on data. For example, Geifman and El-Yaniv (2017) show how to find α from a finite sample such that it is optimal for the bounded-improvement model in PAC sense. Franc, Prusa and Voracek The key component of the randomized Bayes selection function c R is ranking of the inputs X according to r(x). We introduce notion of a proper uncertainty score which is less informative than the conditional risk r(x), yet it is sufficient to construct c R. Definition 1 Let h: X Y be a classifier and r(x) = P y Y p(y | x) ℓ(y, h(x)) its conditional expected risk. We say that function s: X R is a proper uncertainty score of h iff (x, x ) X X : r(x) < r(x ) s(x) < s(x ). By definition, the proper uncertainty score s(x) preserves ordering of the inputs X induced by the conditional risk r(x). Therefore, replacing r(x) by s(x) in function (25), and changing the decision threshold α appropriately, leads to the same optimal selection function. 3. Learning uncertainty function Assume that we want to construct a selective classifier (h, c) solving any of the three rejection models described in Section 2. We have shown that regardless the rejection model, an optimal h is the Bayes classifier h B given by (3) and an optimal c is the randomized Bayes selection function c R given by (25). In this section we consider the scenario where h: X Y is known, e.g., it has been trained from examples, and we want to endow it with c R. The key component of c R is a proper uncertainty score s: X R that satisfies Definition 1. In this section we address the problem of learning a proper uncertainty score from examples Tn = {(xi, yi) X Y | i = 1, . . . , n} assumed to be generated from n i.i.d. random variables with distribution p(x, y). Before describing the algorithms, in Section 3.1 we introduce the notion of Risk-Coverage (RC) curve and Area under Risk-Coverage curve (Au RC). In line with the literature, we use the Au RC as a metric to evaluate performance of the learned uncertainty scores. We also show a connection between the RC curve, the Au RC and the bounded-abstention model. In Section 3.2 we describe a plug-in conditional risk rule and point out that a frequently used Maximum Class Probability rule (MCP) is its special instance (Chow, 1970; Herbei and Wegkamp, 2006). In Section 3.3 we outline a learning approach based on a loss regression. In Section 3.4 we introduce a proxy of Au RC which we call a loss for SElective classifier LEarning (SELE). We prove that both proposed methods learn the Fisher consistent estimator of the proper uncertainty score. The SELE loss and Theorem 9 proving its Fisher consistency were published in Franc and Prusa (2019). All other results presented in this section are novel. 3.1 Area under Risk Coverage curve The majority of existing methods that learn selective classifiers output a classifier h: X Y and a deterministic selection function c: X [0, 1] defined as 6 c(x) = [[s(x) θ]] , (26) where s: X R is an uncertainty score and θ R is a decision threshold. The performance of the pair (h, s) is evaluated by the RC curve obtained after computing the empirical 6. Note that the deterministic (26) and the randomized Bayes selection function c R coincide if the acceptance probability is ν = 1, which is a usual case when p(x) is continuous. Optimal Strategies for Reject Option Classifiers selective risk and the coverage for all settings of the threshold θ. That is, the computation is as follows. Let us order the examples Tn = {(xi, yi) X Y | i = 1, . . . , n} according to s(x) so that s(xπ(1)) s(xπ(2)) s(xπ(n)), where π: {1, . . . , n} {1, . . . , n} is a permutation defining the order7. Let L(i, s) = Pi j=1 ℓ(yπj, h(xπj)) be a sum of losses incurred by the classifier h(x) on the examples with uncertainty not higher than the i-th highest uncertainty on the examples Tn. The Risk-Coverage curve C = {(1 i L(i, s), i n) | i = 1, . . . , n} is a set of two-dimensional points, where the pair (1 i L(i, s), i n) corresponds to the empirical estimate of the selective risk RS(h, c) and the coverage φ(c) of a selective classifier (h, c) with the deterministic selective function (26) and the decision threshold θ = s(xπi). The area under the RC curve C is then Au RC(s, Tn) = 1 i L(i, s) = 1 j=1 ℓ(yπj, h(xπj)) . (27) The value of Au RC(s, Tn) can be interpreted as an arithmetic mean of the empirical selective risks corresponding to the coverage spread evenly over the interval [0, 1] with step 1 n. There is a tight connection between the RC curve, Au RC and the bounded-abstention model. The RC curve C represents the quality of all admissible solutions of the boundedabstention model that can be constructed from the pair (h, s) when using the sample Tn for evaluation. The value of Au RC(s, Tn) is an estimate of the expected quality of the selective classifier constructed from the pair (h, s) when the target coverage is selected uniformly at random. 3.2 Plug-in conditional risk rule Prediction models, like, e.g., Logistic Regression or Neural Networks learned by crossentropy loss, use the training set Tn to learn an estimate ˆp(y | x) of the class posterior distribution p(y | x). The estimate is then used to construct a plug-in Bayes classifier ˆh(x) argminˆy Y P y Y ˆp(y | x)ℓ(y, ˆy). Similarly, using ˆp(y | x) instead of p(y | x) in (9) gives the plug-in rule for the conditional risk of the classifier h defined as y Y ˆp(y | x) ℓ(y, h(x)) . Provided p(y | x) = ˆp(y | x), x X, y Y, the plug-in conditional risk ˆr(x) is by definition a proper uncertainty score and it can be used to construct the randomized Bayes selection function c R which is an optimal rejection strategy for all three rejection models. Example 1 (Maximum Class Probability rule) In case of 0/1-loss ℓ(y, y ) = [[y = y ]] the plug-in Bayes classifier decides based on the maximum posterior probability ˆh(x) argmaxy Y ˆp(y | x) and the plug-in conditional risk rule is y Y ˆp(y | x) ℓ(y, ˆh(x)) = 1 max y Y ˆp(y | x) . 7. To break ties we use the index of the input in case the scores are the same. Franc, Prusa and Voracek 3.3 Loss regression A straightforward approach to learn the uncertainty score is to pose it as a regression problem. The regression function gets an input x X and outputs an estimate of the classification loss ℓ(y, h(x)). Formally, given a hypothesis space F {s: X R}, classifier h(x) and training set Tn, the loss regression score s: X R is a solution to mins F Freg(s) where Freg(s) = 1 ℓ(yi, h(xi)) s(xi) 2 . It is easy to show that the loss regression score is Fisher consistent estimate of the proper uncertainty score. This amounts to defining the expectation Freg(x) with respect to i.i.d. generated training set Tn, i.e., Ereg(s) = ETn p(x,y)Freg(s) i=1 p(xi, yi) 1 ℓ(yi, h(xi)) s(xi) 2 dx1 dxn (28) y Y p(x, y) ℓ(y, h(x)) s(x) 2 dx , and showing that its minimizer is the conditional risk r(x) which is by definition a proper uncertainty score. This is ensured by the following theorem. Theorem 7 The conditional risk r(x) defined by (9) is an optimal solution to min s:X R Ereg(s). 3.4 Minimization of SELE loss In this section, we define a computationally manageable proxy of Au RC, which we call SElective classifier LEarning (SELE) loss. The SELE loss sele : Rn X n Yn R+ is defined as 8 sele(s, Tn) = 1 j=1 ℓ(yi, h(xi))[[s(xi) s(xj)]] . (29) In contrast to Au RC the computation of sele does not require sorting the examples, i.e., we eliminate the permutations that make the evaluation difficult. sele is still hard to optimize directly due to the step function in its definition. After replacing the step function [[ ]] by a logistic function, we obtain its proxy ψsele : Rn X n Yn R+ defined as ψsele(s, Tn) = 1 j=1 ℓ(yi, h(xi)) log 1 + exp(s(xj) s(xi) . (30) The function ψsele(s, Tn) is smooth and convex w.r.t. the argument s and therefore is amenable to optimization. Minimization of ψsele is the core of the proposed learning algorithm which works as follows. 8. We assume that the training set Tn has at least two examples, i.e. n 2. Optimal Strategies for Reject Option Classifiers Given a hypothesis space F {s: X R}, a classifier h(x) and training set Tn, the SELE score s: X R is a solution to the problem mins F ψsele(s, Tn). We justify the proposed algorithm empirically in Section 5. The theoretical justification is based on the following three arguments: 1. The value of sele is a close approximation to Au RC. In case the values of s(x) are different for each input in Tn, i.e. s(xi) = s(xj), i = j, then we have sele(s, Tn) Au RC(s, Tn) 2 sele(s, Tn) . The first inequality follows from sele(s, Tn) = 1 1 n ˆL(i, s) 1 i ˆL(i, s) = Au RC(s, Tn) . The second inequality is ensured by Theorem 8. 2. The uncertainty score estimator defined as ˆs argmins [0,1]X sele(s, Tn) is Fisher consistent. Namely, Theorem 9 ensures that a population minimizer of sele is a proper uncertainty score. 3. The Fisher consistency is preserved even for the smooth proxy ψsele the minimization of which is in the core of the proposed algorithm. Namely, Theorem 11 ensures the population minimizer of ψsele is a proper uncertainty score. Theorem 8 The inequality Au RC(s, Tn) < 2 sele(s, Tn) holds true for any s: X R and Tn = {(xi, yi) X Y | i = 1, . . . , n}. Moreover, for any ε > 0, there are n, s and Tn such that Au RC(s, Tn) (2 ε) sele(s, Tn). To show the Fisher consistency of sele we define its expectation with respect to i.i.d. generated examples Tn, i.e., Esele(s) = Z i=1 p(xi, yi) sele(s, Tn)dx1 dxn j=1 r(xi) [[s(xi) s(xj)]] dx1 dxn X p(xi)p(xj)r(xi)[[s(xi) s(xj)]]dxi dxj X p(xi)p(xi)r(xi)[[s(xi) s(xi)]]dxi dxi X p(x) r(x) Z X p(z) [[s(x) s(z)]]dz dx + 1 X p2(x)r(x)dx dx . Minimizers of Esele are characterized by the following theorem 9. X R z =x s (z)=s (x) f(x,z)dz dx stands for R X f(x,z)dz dx where X = {z X |z =x s (z)=s (x)}, etc. Franc, Prusa and Voracek Theorem 9 A function s : X R is an optimal solution to mins:X R Esele(s) iff z =x s (z)=s (x) max{r(x), r(z)}p(x)p(z)dz dx = 0 , and (32) r(z)s (x) (r(x) r(z)) p(x)p(z)dz dx = 0 . (33) The conditions (32) and (33) imply that the conditional expectations Ex,z p(x)[max{r(x), r(z)} | z = x s (x) = s (z)] and Ex,z p(x)[r(x) r(z) | r(z) < r(x) s (z) > s (x)] are both zero. If combined, it further implies that a subset of the input space X = {(x, z) X X | r(z) < r(x) s (z) > s (x)}, on which the order is violated, has probability measure zero. In other words, the optimal s (x) preserves the ordering induced by r(x) almost surely 10. Corollary 10 Any function s : X R fulfilling (x, x ) X X : x = x s(x) = s(x ), and (34) (x, x ) X X : r(x) < r(x ) s(x) < s(x ) (35) satisfies the optimality conditions of Theorem 9. Note that (34) requires the minimizer of sele to assign a unique value to each input x X which is not necessary for the score to be proper. Hence, the minimizers of sele form a subset of all proper uncertainty scores. To show the Fisher consistency of the smooth proxy ψsele we define its expectation with respect to i.i.d. generated examples Tn, i.e., Eproxy(s) = Z i=1 p(xi, yi) ψsele(s, Tn) dx1 dxn X p(x) r(x) Z X p(z) log 1 + exp(s(z) s(x)) dz dx X p(x) r(x)dx . (36) We omitted the derivation as it is similar to (31). The key property of the minimizers of Eproxy is stated in the following theorem. Theorem 11 Let s : X R be an optimal solution to mins: X R Eproxy(s). Then, the condition (x, x ) X X : r(x) < r(x ) s (x) < s (x ) is satisfied almost surely. 10. This means that the condition can be violated at most on a subset of X with probability measure zero. Optimal Strategies for Reject Option Classifiers 4. Related Works The cost-based rejection model was proposed in Chow (1970) who also provides the optimal strategy in case the distribution p(x, y) is known, analyzes the error-reject trade-off, and proves the basic properties of the error-rate and the reject-rate, e.g., that both functions are monotone with respect to the reject cost. The original paper considers the risk with 0/1-loss only. The model with arbitrary classification costs was analyzed, e.g., in Tortorella (2000); Santos-Pereira and Pires (2005); Schlesinger and Hlav aˇc (2002). The bounded-improvement model and the bounded-abstention model were coined in Pietraszek (2005). He proposes a method to construct a specific class of rejection strategies composed of two classifiers described by a single ROC curve. The task is to find the optimal decision thresholds of the two classifiers, which is done numerically based on ROC analysis. The original formulation assumes two classes and the particular form of the rejection strategies. In this article, we consider a generalization of the bounded-improvement model, see Problem 2, and the bounded-abstention model, see Problem 4, which allow for an arbitrary number of classes (including structured output prediction) and do not impose any constraint on the rejection strategy. We have proved that necessary and sufficient conditions on an optimal strategy for both models in the case where p(x, y) is known. We showed that in both cases, a particular optimal solution is composed of the Bayes classifier and the randomized Bayes selection function. There exist other formulations of rejection models for two-class classifiers. For example, in Hanczar and Dougherty (2008) the objective is to maximize coverage under the constraints that each class has an error rate below a specific threshold. Hence, it can be seen as a generalization of the bounded-improvement model. The objective of a rejection model proposed in Lei (2014) is to maximize the total coverage under the constraint that each class has coverage above a specific threshold. None of the two articles analyzes optimal strategies of the corresponding models. A common approach to construct selective classifiers for the cost-based model is based on the plug-in rule, which involves learning the class posterior distribution from examples and plugging the distribution into the formula defining the Bayes-optimal strategy (3). In the case of 0/1-loss the plug-in rule rejects based on the maximal class posterior which is denoted as Maximal Class Probability (MCP) rule (see Example 1). The MCP rule is probably the most frequently used uncertainty score in the literature. The statistical consistency of the plug-in rejection rule is discussed in Herbei and Wegkamp (2006). Fumera et al. (2000) investigate how errors in estimation of the posterior distribution affect effectiveness of the plug-in rule, and they also try to improve its performance by using class-specific thresholds. Other methods to improve the plug-in rule by tuning multiple thresholds were proposed in Kummert et al. (2016); Fischer et al. (2016). In our work, we have derived the optimal strategies for the bounded-improvement model and the bounded-abstention model. Our results thus provide a recipe to construct the plug-in rules also for these two rejection models. There exist many modifications of standard prediction models to learn reject option classifiers for the cost-based model. For example, extensions of the Support Vector Machines to learn a reject option classifier have been extensively studied in Grandvalet et al. (2008); Bartlett and Wegkamp (2008); Yuan and Wegkamp (2010). These works are limited to Franc, Prusa and Voracek two-class problems and 0/1-loss. Learning leads to minimization of a convex surrogate of the cost-based model s objective function. Under some conditions, the algorithms are statistically consistent. A boosting algorithm is proposed in Cortes et al. (2016) to learn a two-class classifier with reject option. The algorithm minimizes a convex surrogate for the cost-based model and shows that the surrogate is calibrated with Bayes solution. Learning prototype-based classifier with the rejection option has been addressed in Villman et al. (2016). All these methods require the reject cost to be fixed at the time of learning, and hence changing the cost requires re-training. In contrast, we propose algorithms to learn the proper uncertainty score on top of a pre-trained classifier so that the risk-coverage trade-off can be set by tuning the reject threshold without re-training. For many prediction models, it is easy to devise an ordinal uncertainty score from outputs of the learned (non-reject) classifier. Such strategies are often heuristically based but work reasonably well in practice. For example, Le Cun et al. (1990) proposed a reject strategy for a Neural Network classifier based on thresholding either the output of the maximally activated unit of the last layer or a difference between the maximal and runnerup output units. Other heuristically based strategies for neural networks were evaluated in Zaragoza and d Alche Buc (1998); Fisher et al. (2015). In case of Support Vector Machine classifiers (Vapnik, 1998) the trained linear score, proportional to the distance between the input and the decision hyperplane, is directly used as the uncertainty score (Fumera and Roli, 2002). We denote this approach as the margin score and use it as a baseline in our experiments. The learning of a selective classifier optimal for the bounded-improvement model was discussed in El-Yaniv and Wiener (2010). Their method requires a noisy-free scenario, i.e. they maximize the coverage under the constraint that the selective risk is zero. They provide a characterization of the lower and upper bounds of the risk-coverage curves in the PAC setting. Geifman and El-Yaniv (2017) assume a selective classifier based on thresholding an uncertainty score and show how to find a decision threshold for the bounded-improvement model which is optimal in the PAC sense. They do not address the problem of learning the uncertainty score. We complement their work by showing that the threshold-based selective classifier is an optimal solution when the uncertainty function is proper, and we propose algorithms to learn the proper uncertainty score from examples. Recent work addresses uncertainty prediction in the context of deep learning (Lakshminarayanan et al., 2017; Jiang et al., 2018; Corbiere et al., 2019). These works do not formulate the problem to be solved explicitly as a rejection model. However, they empirically evaluate their uncertainty scores in terms of the Risk-Coverage curve and the Area under the RC curve which we have shown to be connected with the bounded-abstention model (see Section 3.1). Lakshminarayanan et al. (2017) construct the MCP rule from a posterior distribution modeled as an ensemble of neural networks trained from multiple random initialization. They use adversarial examples to smooth out the posterior estimate. Jiang et al. (2018) propose a Trust Score as the ratio between the distance from the test sample to the samples of the nearest class with a different label and the distance to the samples with the same labels as the predicted class. Corbiere et al. (2019) propose a True Class Probability (TCP) score as a measure of prediction confidence. The TCP predicts the value of p(y | x) where y is the ground truth label. They learn an NN, the so-called Conf Net, by minimizing L2-loss between the Conf Net output and ˆp(yi | x) on training ex- Optimal Strategies for Reject Option Classifiers amples, where ˆp(y | x) is the soft-max distribution trained by standard cross-entropy loss. They show that the TCP empirically outperforms the MCP score and the Trust Score (Jiang et al., 2018) in terms of the Au RC metric. Both Jiang et al. (2018); Corbiere et al. (2019) consider the two-stage approach to learn the uncertainty score similarly to our paper. We show empirically that our proposed SELE score, in addition to having a theoretical backing and being applicable to a generic classification problem, outperforms the state-of-the-art TCP score. 5. Experiments In Section 3, we outlined two risk minimization based methods to learn the uncertainty score s(x) for a pre-trained predictor h(x), namely, the algorithm based on i) loss regression (Section 3.3) and ii) minimization of SELE loss (Section 3.4). We have shown that both methods are Fisher consistent, i.e., they are guaranteed to find the proper score in the idealized setting when the distribution p(x, y) is known (estimation error is zero), the hypothesis space F contains the proper score (approximation error is zero) and the loss minimizer can be found exactly (optimization error is zero). In this section, we evaluate these methods experimentally on real data when all assumptions are presumably violated. We design the experiments so that the optimization and the estimation error are small by using a large number of training examples and linear rules, making the loss minimization a convex problem. We compare against the recently proposed True Class Probability (TCP) score (Corbiere et al., 2019) which is learned from examples like the proposed methods. Unlike the proposed methods, the TCP requires the prediction model h(x) to provide an estimate of the posterior p(y | x), hence it is not applicable to fully discriminative models like e.g. SVMs. We emphasize that the experiments are meant to be a proof of concept rather than an exhaustive comparison of all existing methods. On the other hand, we are not aware of any other generic method (i.e., being not connected to a particular prediction model) we could compare with. To demonstrate that the proposed methods are generic, we consider three different categories of prediction problems: classification, ordinal regression, and structured output classification. For each prediction problem, we use several benchmark datasets and frequently used prediction models like Logistic Regression (LR), three variants of Support Vector Machines (SVMs), and Gradient Boosted Trees. For each prediction model, there exists an uncertainty score that is commonly used in practice, like, e.g., Maximal Class Probability (MCP) for logistic regression or distance to the decision hyper-plane (a.k.a. margin score) for SVMs. We use these uncertainty scores as additional baselines in our experiments. 5.1 Compared methods for uncertainty score learning In this section, we describe three algorithms that use a training set Tn = {(xi, yi) X Y | i = 1, . . . , n} to learn an uncertainty score s(x) for a pre-trained classifier h(x). We consider linear scores sθ(x) = θ, ψ(x) , where θ Rd are parameters to be learned and ψ: X Rd is a fixed mapping that will be defined for each prediction model separately in the following sections. All evaluated methods are instances of a regularized risk minimization framework. In all cases, learning leads to an unconstrained minimization of a convex objective F(θ) = C 2 θ 2 + ˆR(θ, Tn), where C > 0 is a regularization constant and ˆR(θ, Tn) is an Franc, Prusa and Voracek empirical risk defined by each method differently. The optimal value of C is selected from {0, 1, 10, 100, 1000} based on the minimal value of the Au RC evaluated on a validation set. 5.1.1 Regression score The parameters θ Rd are learned by minimizing a convex function FREG(θ) = C ℓ(yi, h(xi)) sθ(xi) 2 . Minimization of FREG(θ) is an instance of ridge regression that can be solved efficiently, e.g., by Singular Value Decomposition (SVD). 5.1.2 SELE score Evaluation of the proposed SELE loss ψsele(s, Tn) as defined by (30) requires O(n2) operations. To decrease the complexity, we approximate its value by splitting the examples into chunks and computing the average loss over the chunks. Namely, the parameters θ Rd are learned by minimizing a convex function FSELE(θ) = C i=1 ψsele(s, T k n ) , where C > 0 is a regularization constant and T 1 n T 2 n T P n is a randomly generated partition of the training set Tn into P approximately equally sized batches. In all experiments, we used P = round(n/500), i.e., the chunks contain around 500 examples. We minimize FSELE(θ) by the Bundle Method for Risk Minimization (BMRM) algorithm (Teo et al., 2010) which is set to find a solution whose objective is at most 1% offthe optimum 11. The total computation time of the BMRM algorithm is in the order of units of minutes for all datasets using a contemporary PC. 5.1.3 True Class Probability score The TCP (Corbiere et al., 2019) was originally designed for getting uncertainty score on top of a Convolution Neural Network (CNN) trained with cross-entropy loss. The setting we consider here can be seen as the original method applied to a single layer CNN. Namely, let ˆp(y | x) be an estimate of the posterior distribution in our experiments provided by the Logistic Regression (a.k.a. single layer NN). The parameters θ Rd are learned by minimizing a convex function FTCP(θ) = C i=1 (ˆp(yi | xi) sθ(xi))2 . Minimization of FREG(θ) is an instance of the ridge regression which we solve by SVD. 11. We use (Fprimal Fdual)/Fprimal 0.01 as the stopping condition of the BMRM algorithm. Optimal Strategies for Reject Option Classifiers 5.2 Benchmark problems 5.2.1 Classification Given real-valued features x X Rd, the task is to predict a hidden state y Y = {1, . . . , Y } so that the expectation of 0/1-loss ℓ(y, y ) = 100 [[y = y ]] is as small as possible 12. We selected 11 classification problems from the UCI repository (Dua and Taniskidou, 2017) and lib SVM datasets (Chang and Lin, 2011). The datasets are summarized in Table 2. We chose the datasets with a sufficiently large number of examples relative to the number of features, as we need to learn both the classifier and the uncertainty score and simultaneously keep the estimation error low. Each dataset was randomly split 5 times into 5 subsets, Trn1/Val1/Trn2/Val2/Tst, in ratio 30/10/30/10/20 (up to CODRNA with ratio 25/5/20/20/30 and COVTYPE with ratio 28/20/2/20/30). The subsets Trn1/Val1 were used for learning and tuning the best regularization constant of the classifier h(x). The subsets Trn2/Val2 were used for learning and tuning the regularization constant of the uncertainty score s(x) as described in Section 5.1. All features were normalized to have zero mean and unit variance. The normalization coefficients were estimated using only the Trn1 and Trn2 subsets, respectively. The Tst subset was used solely to evaluate the test performance. We used two prediction models: Logistic Regression (LR) (Hastie et al., 2009) and Support Vector Machines (SVM) (Vapnik, 1998). Logistic Regression learns parameters θLR = ((wy, by) Rd R | y Y) of the posterior probabilities ˆpθ(y | x) exp( wy, x + by) by maximizing the regularized loglikelihood FLR(θ) = C n Pn i=1 log ˆpθ(yi | xi) . The optimal C was selected from {1, 10, 100, 1000} based on the validation classification error. After learning θLR we used the plug-in Bayes classifier h(x) = argmaxy Y ˆpθLR(y | x). As a baseline uncertainty score we use the plug-in class conditional risk ˆr(x) = 1 ˆpθLR(h(x) | x). In accordance with the literature we refer to this baseline as the Maximal Class Probability (MCP) rule. As shown in Section 3.2, the MCP score is the proper uncertainty score provided the estimate ˆp(y | x) matches the true posterior p(y | x). Support Vector Machines learn parameters θSVM = ((wy, by) Rd R | y Y) of the linear classifier h(x) = argmaxy Y( wy, x + by) by minimizing FSVM(θ) = C 2 θ 2 + 1 n Pn i=1 maxy Y [[y = yi]]+ wy wyi, xi . The optimal C was selected in the same way as in case of the LR. As the baseline uncertainty measure we use s(x) = maxy Y wy, x +by. In the binary case |Y| = 2, the setting was θSVM = (w, b), h(x) = sgn( w, x + b), FSVM(θ) = C n Pn i=1 max{0, 1 yi( w, xi + b)} and s(x) = | w, x + b|. In both cases, the value of s(x) is proportional to a distance between the input x and the decision boundary. We denote this baseline as the margin score. Given the pre-trained LR or SVM classifier h(x), we apply the methods from Section 5.1 to learn the uncertainty score sθ(x) = wh(x), x + by , (37) where θ = ((wy, by) Rd R | y Y) are the parameters to be learned. It is seen that the rule (37) can be re-written as sθ(x) = θ, ψ(x) , where ψ: X Rd is an appropriately 12. Due to the factor 100 the reported errors correspond to the percentage of misclassified examples. Franc, Prusa and Voracek defined feature map. The form of the score (37) can be justified by noting that its special instance is the margin score which is obtained after substituting θSVM for θ. 5.2.2 Ordinal regression The task is to predict a hidden state from Y = {1, . . . , Y } based on real-valued features X Rd. Unlike the classification problem, the hidden states Y are assumed to be ordered and the goal is to minimize the expectation of the Mean Absolute Error ℓ(y, y ) = |y y |. We selected 11 regression problems from UCI repository (Dua and Taniskidou, 2017). The datasets are summarized in Table 2. The real-valued hidden states were discretized into Y bins, which are constructed to get a uniform class prior. Each dataset was randomly split 5 times into 5 subsets, Trn1/Val1/Trn2/Val2/Tst, in ratio 30/10/30/10/20. We used the same normalization and evaluation protocol as described for the classification benchmarks. As a prediction model, we used a variant of the Support Vector Machine algorithm developed for ordinal regression (Chu and Keerthi, 2005) 13. Support Vector Ordinal Regression (SVOR) learns parameters θSVOR = (w Rd, (b1, . . . , b Y 1) RY 1) of the ordinal linear classifier h(x) = 1 + PY 1 y=1 [[ x, w > by]] by minimizing FSVOR(θ) = C n Pn i=1 Pyi 1 y=1 max(0, 1 w, xi +by)+PY 1 y=yi max(0, 1+ xi, w by) . The optimal C was selected from {1, 10, 100, 1000} based on the validation MAE. The ordinal classifier can be thought of as a standard linear classifier composed of parallel decision hyper-planes. Similarly to the standard SVM, we use s(x) = miny {1,...,Y 1} | x, w by| as a baseline uncertainty score. The value of s(x) is proportional to the distance of x to the closest hyper-plane hence we also denote it as the margin score. When learning the uncertainty from examples, we use the parametrization (37). 5.2.3 Structured Output Classification Given an RGB image x X = {0, . . . , 255}W H 3 capturing a human face, the task is to predict a pixel positions of 68 landmarks y = (l1, . . . , l68) Y = ({1, . . . , W} {1, . . . , H})68 corresponding to contours of eyes, mouth, nose, etc. We use the 300-W dataset and the associated evaluation protocol which was created by the organizers of landmark detection challenge (Sagonas et al., 2016). The 300-W dataset contains 5,807 faces each annotated with 68 landmarks. The faces are split into 3,484 training, 1,161 validation and 1,162 test examples. The prediction accuracy is measured in terms of normalized average localization error ℓ(y, ˆy) = 100 iod(y) P68 i=1 li ˆli , where iod(y) is the inter-ocular distance computed from the ground-truth landmark positions y. As the structured classifier h(x) we use the landmark detector from DLIB package (King, 2009). The detector predicts the landmark positions based on HOG descriptors (Dalal and Triggs, 2005) of the input image using an ensemble of regression trees that are trained by gradient boosting (Kazemi and Sullivan, 2014). The DLIB landmark detector has been widely used by developers due to its robustness and exceptional speed even on low-end hardware. The detector does not provide any measure of prediction uncertainty and it is unclear how to derive it from outputs of the regression trees. A commonly used uncertainty 13. (Chu and Keerthi, 2005) introduced two variants of SVOR algorithm. We use so called SVOR with implicit constraints which is designed for minimization of the MAE loss. Optimal Strategies for Reject Option Classifiers score for face recognition related prediction problems is a score function of a face detector. The face detector score is the output of a binary classifier trained to distinguish a face from non-face images. The value of the score is high for well-looking prototypical faces and low for corrupted or difficult faces. As a baseline, we use the score of the DLIB face detector which is a linear SVM classifier on top of HOG descriptors extracted from the image. When learning the uncertainty score from examples, the score is a linear regressor sθ(x) = θ, ψ(x) on top of a feature vector ψ(x) R2,448 which is a concatenation of HOG descriptors extracted from the input facial image x along landmark positions predicted by the landmark detector h(x). The DLIB detector uses the same features to predict the landmark positions, hence the extra computational time required to evaluate the uncertainty score is neglectable. Classification problems dataset examples feat cls AVILA 20,867 10 12 CODRNA 331,152 8 2 COVTYPE 581,012 54 7 IJCNN 49,990 22 2 LETTER 20,000 16 26 MARKETING 45,211 51 2 PENDIGIT 10,992 16 10 PHISHING 11,055 68 2 SATTELITE 6,435 36 6 SENSORLESS 58,509 48 11 SHUTTLE 58,000 9 7 Ordinal regression problems dataset examples feat cls ABALONE 4,177 10 19 BANK 8,192 32 10 BIKESHARE 17,379 11 10 CALIFORNIA 20,640 8 10 CCPP 9568 4 10 CPU 8192 21 10 FACEBOOK 50,993 53 10 GPU 24,1600 14 10 METRO 48,204 30 10 MSD 499,671 90 41 SUPERCOND 21,263 81 10 Table 2: Summary of 11 classification problems (left) and 11 ordinal regression problems (right) selected from UCI repository (Dua and Taniskidou, 2017) and lib SVM datasets (Chang and Lin, 2011). The table shows the total number of examples, the number of features and the number of classes. 5.3 Results 5.3.1 Classification problems For both classification models, LR and SVM, we recorded the test risk of the classifier h(x) and the Au RC computed from h(x) and uncertainty score s(x) produced by the corresponding method under evaluation. In the case of LR, we compare the baseline MCP score (sec 5.2.1) and scores learned from examples including the state-of-the-art TCP score (sec 5.1.3) and the two proposed SELE score (sec 5.1.2) and Regression (REG) score (sec 5.1.1). In the case of SVM, we compare the baseline margin score (sec 5.2.1) against the proposed SELE and REG scores. Note that TCP score is not applicable for SVM classifier as it does not provide an estimate of the posterior probability p(y | x). The results are summarized in Table 3. Franc, Prusa and Voracek For each dataset, we rank the compared methods according to the Au RC 14. Following the methodology of Demˇsar (2006) we summarize the performance of each method by its average rank and use the Friedman test and the post-hoc Nemenyi test to analyze significance of the results. We used the Friedman test to check whether the measured average ranks are significantly different from the mean rank. The null hypothesis states that the compared scores are equivalent so that their average ranks should be equal. In both cases, the null hypothesis is rejected for p-value 0.05, i.e., performance of the compared methods is significantly different. We used post-hoc Nemenyi test for pairwise comparison. For each pair of methods, it checks whether their average ranks are significantly different. In the case of LR, when we compare K = 4 methods using N = 11 datasets, the critical difference for p-value 0.10 is CD = 1.26. By comparing the average ranks, we conclude that SELE score performs significantly better than MCP score and REG score. In the case of SVM, when we compare K = 3 methods using N = 11 dataset, the critical difference for p-value 0.10 is CD = 0.98. We conclude that SELE performs significantly better than REG score and Margin score. The data is not sufficient to reach any conclusion about other pairwise comparisons. The result of the Nemenyi test is visualized in Figure 1. We further computed the relative improvement gained by using the scores SELE, TCP, and REG, that are all learned from examples, with respect to the baseline scores derived from the classifier output, i.e. MCP in the case of LR and Margin score in the case of SVM. The results are summarized in Figure 2. It is seen that the MCP uncertainty computed from the estimated p(y | x) constitutes a much stronger baseline than the Margin score of the fully discriminative SVM model. The relative improvement of scores learned on top of the LR is only moderate in contrast to the SVM classifier where the improvements are more significant and consistent. It is also seen that on the majority of datasest, the performance of the learned scores is similar taking into account the statistical error of the Au RC estimate. It is also worth mentioning that the results of SELE score have the lowest variance of the Au RC estimates as seen from the error bars in Figure 2. LR+SELE LR+TCP LR+REG LR+MCP SVM+SELE SVM+Margin SVM+REG a) Average ranks for scores on top LR. b) Average ranks for scores on top of SVM. Figure 1: Comparison of all uncertainty scores against each other with the Nemenyi test. The test is computed separately for the LR classifier (4 scores compared) and the SVM classifier (3 scores compared). The figures show the average ranks for each score and the critical distance (CD). Groups of scores that are not significantly different at p-value 0.10 are connected. 14. The score with smallest Au RC is ranked 1, the second smallest 2 and so on. Optimal Strategies for Reject Option Classifiers LR+MCP LR+SELE LR+REG LR+TCP LR Au RC Au RC Au RC Au RC R@100 AVILA 27.18 0.55 25.79 0.44 26.62 0.74 26.85 0.78 43.71 0.42 CODRNA 0.88 0.05 0.65 0.03 0.82 0.06 0.78 0.04 4.81 0.08 COVTYPE 16.49 0.06 17.58 0.07 17.62 0.09 17.19 0.07 27.56 0.17 IJCNN 1.26 0.04 1.00 0.03 1.16 0.08 1.14 0.06 7.54 0.15 LETTER 7.43 0.40 6.42 0.34 7.44 0.59 6.71 0.42 23.32 0.60 MARKETING 2.60 0.31 1.88 0.11 1.97 0.12 1.90 0.11 9.88 0.29 PENDIGIT 0.69 0.04 1.55 0.19 1.97 0.55 1.47 0.39 5.29 0.40 PHISHING 0.76 0.10 0.75 0.10 0.91 0.31 0.85 0.25 6.29 0.44 SATTELITE 3.83 0.26 3.68 0.27 4.93 1.07 4.52 0.85 15.06 0.46 SENSORLESS 2.03 0.11 1.82 0.08 2.69 0.09 2.37 0.22 8.23 0.45 SHUTTLE 0.59 0.09 0.26 0.07 1.24 0.51 0.58 0.13 3.36 0.25 average rank 2.73 1.36 3.55 2.36 (a) Uncertainty scores on top of LR classifier. SVM+MARGIN SVM+SELE SVM+REG SVM Au RC Au RC Au RC R@100 AVILA 31.65 0.83 25.26 0.67 25.95 0.75 43.34 0.70 CODRNA 0.89 0.05 0.65 0.03 0.82 0.05 4.78 0.08 COVTYPE 25.71 0.81 17.79 0.21 17.77 0.14 27.41 0.11 IJCNN 1.40 0.04 1.01 0.04 1.18 0.08 7.56 0.16 LETTER 10.20 0.22 6.05 0.65 7.15 0.65 22.06 0.69 MARKETING 2.24 0.20 1.97 0.10 2.04 0.20 10.48 0.39 PENDIGIT 2.79 0.40 1.57 0.21 2.16 0.43 4.88 0.57 PHISHING 0.84 0.12 0.72 0.12 0.90 0.30 6.37 0.44 SATTELITE 4.75 0.60 3.82 0.27 5.44 0.68 15.36 0.37 SENSORLESS 3.68 0.20 1.56 0.08 2.46 0.29 6.92 0.17 SHUTTLE 1.31 0.47 0.24 0.07 0.55 0.15 2.02 0.15 average rank 2.82 1.09 2.09 (b) Uncertainty scores on top of SVM classifier. Table 3: Performance of the uncertainty scores on 11 classification problems. The scores are constructed on top of the LR classifier and the SVM classifier measured in terms of Au RC. For each score we show the mean and the standard deviation of the test Au RC computed over 5 random splits. We compare the performance of scores learned from examples (SEL, REG, TCP) and the baseline scores derived from the classifiers output (MCP and Margin score). The last column shows the risk of the base (non-selective) classifier. All the values correspond to percentage of misclassification. The best results for each dataset are shown in bold. The last row shows the average rank. Franc, Prusa and Voracek (a) Improvement over LR+MCP score. (b) Improvement over SVM+Margin score. improvement [%] LR+SELE LR+REG LR+TCP improvement [%] SVM+SELE SVM+REG Figure 2: Relative improvement gained by using the uncertainty scores (SELE, REG and TCP) that are learned from examples over the baseline scores (MCP for LR and Margin score for SVM) constructed from the classifier output. The relative improvement is computed as 100 (Au RCbaseline Au RCmethod)/Au RCbaseline. We show the mean and the standard deviation (error bar) of the relative improvement computed over the random 5 splits. 5.3.2 Ordinal regression In the case of SVOR classifier, we compared the baseline Margin score (sec 5.2.2) and the scores learned from examples including SELE (sec 5.1.2) and REG score (sec 5.1.1). We used the same evaluation protocol as for the classification task, however, instead of 0/1-loss, the errors were evaluated by the MAE loss. The results are summarized in Table 4. We again ranked the methods according to the Au RC and summarized their performance by the average rank: We applied the Friedman test checking whether the measured average ranks are significantly different from the mean rank. The null hypothesis, stating that the compared scores are equivalent, is rejected for p-value 0.05, i.e., performance of the compared methods is significantly different. We used post-hoc Nemenyi test to check for each pair whether their average ranks are significantly different. Considering K = 3 compared methods using N = 11 dataset yields the critical difference for p-value 0.10 is CD = 0.98. We conclude that SELE and REG scores are significantly better than the baseline Margin score. The data is not sufficient to reach any conclusion about the comparison of SELE and REG. The result of the Nemenyi test is visualized in Figure 3(a). The relative improvement gained by using SELE and REG scores learned from examples w.r.t. baseline Margin score is shown in Figure 3(b). It is seen that the performance of the learned scores is similar and that they consistently outperform the baseline by a significant margin. Optimal Strategies for Reject Option Classifiers SVOR+MARGIN SVOR+SELE SVOR+REG SVOR Au RC Au RC Au RC R@100 CALIFORNIA 0.98 0.03 0.82 0.02 0.84 0.02 1.18 0.01 ABALONE 1.48 0.10 1.19 0.09 1.21 0.05 1.54 0.02 BANK 1.07 0.04 0.99 0.04 0.98 0.03 1.50 0.03 CPU 0.41 0.01 0.36 0.02 0.36 0.02 0.64 0.03 BIKESHARE 1.60 0.07 1.25 0.01 1.27 0.01 1.70 0.03 CCPP 0.46 0.02 0.41 0.02 0.42 0.02 0.58 0.02 FACEBOOK 0.51 0.01 0.37 0.01 0.36 0.01 1.11 0.01 GPU 1.43 0.02 0.85 0.03 0.86 0.03 1.49 0.02 METRO 2.20 0.07 1.97 0.01 1.98 0.03 2.37 0.03 MSD 6.23 0.07 4.26 0.03 4.25 0.03 6.22 0.03 SUPERCONDUCT 0.98 0.02 0.75 0.01 0.77 0.01 1.07 0.01 average rank 3.00 1.27 1.73 Table 4: Performance of the uncertainty scores on 11 ordinal regression problems. The scores are constructed on top of the SVOR classifier. For each score we show the mean and the standard deviation of the test Au RC computed over 5 random splits. We compare the performance of scores learned from examples (SEL, REG) and the baseline Margin score derived from the SVOR classifier output. The last column shows the risk of the base (non-selective) classifier. All the values correspond to the Mean Absolute Error (MAE). The best results for each dataset are shown in bold. The last row shows the average rank. 5.3.3 Structured Output Classification We trained SELE score (sec 5.1.2) and REG score (sec 5.1.1) on top of the DLIB detector and compared them with the baseline which uses the DLIB face detector score (sec 5.2.3) as an uncertainty measure. The Risk-Coverage curves of the three methods and their corresponding Au RC are shown in Figure 4(a). Both the learned scores, SELE and REG, are significantly better than the baseline face detector score. The SELE is slightly better than REG score. The largest differences between the three scores are seen for low values of coverage where SELE outperforms the other two methods. High selective risk for low values of coverage means that faces with very bad landmark predictions are assigned the lowest uncertainty scores. SELE score does not suffer from this problem. This can be seen in Figure 5 where we show examples of the 10 test faces with the lowest uncertainty and the highest uncertainty predicted by SELE. Unlike the experiments in the previous section, the number of parameters to be learned (m = 2, 448) relative to the number of training examples (n = 3, 484) is much higher. To see whether the number of examples is sufficient, we trained SELE and REG scores on an increasingly bigger training set. Figure 4(b) shows the test Au RC as a function of the the number of training examples. It is seen that Au RC of the SELE is not yet saturated and Franc, Prusa and Voracek a) Average rank for scores on top of SVOR. b) Relative improvement over SVOR+Margin score. SVOR+REG SVOR+Margin SVOR+SELE SUPERCONDUCT improvement [%] SVOR+SELE SVOR+REG Figure 3: Statistics derived from results obtained on 11 ordinal regression problems. Figure (a) shows pair-wise comparison of uncertainty scores with the Nemenyi test. The figure shows the average ranks for each score and the critical distance (CD). Groups of scores that are not significantly different at p-value 0.10 are connected. Figure (b) shows relative improvement gained by using the SELE and REG uncertainty scores learned from examples over the baseline Margin score. would most likely converge to a significantly lower value relative to the REG score provided 300-W dataset had more training examples. 6. Conclusions The standard cost-based rejection model introduced by Chow (1970) requires an explicit definition of the rejection cost, which is difficult in applications when the reject cost and the loss of the label have a different nature or physical units. Pietraszek (2005) proposed a bounded-improvement model and a bounded-abstention model which avoid the problem by defining an optimal prediction strategy in terms of coverage and selective risk. Our main result is a formal proof that despite their different objectives, the three rejection models are equivalent in the sense that they lead to the same prediction strategy: the Bayes classifier and the randomized Bayes selection function. Thanks to the common optimal solution, it is possible to convert the parameters of different rejection models. For example, for any target risk defining the bounded-improvement model, there exists a corresponding reject cost so that both models have the same optimal strategy. The explicit characterization of the optimal strategies provides a recipe for building plug-in rules to solve bounded-improvement and bounded-abstention models. Any method estimating the class posterior probabilities can be thus turned into an algorithm for learning the selective classifier that solves the bounded-improvement and the bounded-abstention model. We have defined a notion of a proper uncertainty score which is sufficient to construct the randomized Bayes selection function. We proposed two algorithms to learn a proper Optimal Strategies for Reject Option Classifiers (a) Risk-Coverage curve (b) Learning curve 0 0.5 1 coverage selective risk DLIB+SELE, Au RC=2.6 DLIB+REG, Au RC=2.8 DLIB+FACE-SCORE, Au RC=3.5 #training examples DLIB+SELE DLIB+REG Figure 4: Evaluation of uncertainty scores on top of DLIB landmark detector using the 300-W benchmark. Figure (a) shows the RC curve and Au RC of computed from the predictions of DLIB detector endowed with the three compared uncertainty scores: the proposed SELE and REG scores and the DLIB face detector score used as a baseline. Figure (b) shows the test Au RC for SELE and REG scores as a function of the number of training examples. uncertainty score from examples for a given classifier. We have shown that both algorithms provide a Fisher-consistent estimate of the proper uncertainty score. As a proof of concept, we evaluated the proposed algorithms on different types of prediction problems. We have shown that the proposed algorithm based on minimization of the SELE loss outperforms existing approaches tailored for a particular prediction model and works on a par with the recently published state-of-the-art TCP score (Corbiere et al., 2019). Unlike the TCP score, which requires the classifier to output the class posterior probabilities, the proposed algorithms are applicable to an arbitrary black-box classifier. We have drawn a connection between the proposed bounded-abstention model and the RC curve. Namely, the RC curve represents the quality of all admissible solutions of the bounded-abstention model that can be constructed from a pair of classifier and uncertainty score. The Au RC is then the expected quality of the selective classifier constructed from the pair when the target coverage is selected uniformly at random. This connection sheds light on many published methods which do not explicitly define the target objective but use the RC curve and the Au RC as evaluation metrics. Finally, let us mention some topics for future work. First, the proposed algorithms consider a two-stage scenario in which the classifier and the uncertainty score are learned separately from independent training sets. Although the scenario is useful in practice, an algorithm that learns the classifier and the uncertainty score simultaneously from a single training set constitutes an interesting topic to be solved. Secondly, we have shown how to learn the proper uncertainty score, but have not discussed how to set up the decision threshold and the acceptance probability that are also needed to construct the selective Franc, Prusa and Voracek classifier. It is straightforward to tune these parameters using empirical data and the RC curve. Analysis of the generalization error of this empirical approach is an open issue which has been solved only for the decision threshold of the bounded-improvement model by Geifman and El-Yaniv (2017). Thirdly, the empirical evaluation is limited to uncertainty scores linear in the parameters to be learned. Efficient implementations of the algorithms applicable to non-linear models, like e.g. the neural networks, is an another topic left for future. Forth, we use Fisher consistency to justify the proposed SELE loss. Deriving efficient finite sample guarantees constitutes another interesting open problem. Acknowledgments We are grateful to all reviewers for their insightful feedback. The authors acknowledge support for this work from the Czech Science Foundation project GACR GA19-21198S. We also acknowledge project No. CZ.02.1.01/0.0/0.0/16 019/0000765 that supported V. Franc in 0.4 FTE. Optimal Strategies for Reject Option Classifiers 1: loss=0.02 6: loss=0.03 1153: loss=0.53 1158: loss=0.25 2: loss=0.02 7: loss=0.02 1154: loss=0.71 1159: loss=0.82 3: loss=0.02 8: loss=0.03 1155: loss=0.32 1160: loss=0.30 4: loss=0.02 9: loss=0.02 1156: loss=0.49 1161: loss=0.46 5: loss=0.03 10: loss=0.02 1157: loss=0.43 1162: loss=0.37 Figure 5: Figure shows examples of 10 test faces from 300-W database with the lowest and 10 faces with the highest value of the SELE uncertainty score. The ground-truth landmark positions (red) and the landmark positions predicted by DLIB detector (blue) are superimposed into the image. The image title shows the rank induced by the SELE score and the normalized localization error which is used as the classification loss in this application. Franc, Prusa and Voracek Appendix A. Proofs of theorems from Section 2 A.1 Proof of Theorem 1 The Bayes classifier reads h B(x) argmin ˆy Y y Y p(y | x) ℓ(y, ˆy) (3) Problem 2 (Bounded-improvement model) Given a target risk λ > 0, the optimal selective classifier (h I, c I) is a solution to the problem max h,c φ(c) s.t. RS(h, c) λ , (7) where we assume that both maximizers exist. Theorem 1 Let (h, c) be an optimal solution to (7). Then, (h B, c), where h B is the Bayes classifier (3), is also optimal to (7). Proof It is sufficient to show that (h B, c) is feasible to (7), i.e., that RS(h B, c) λ. Then (h B, c) attains the same maximum objective value φ(c) as (h, c). Derive RS(h B, c) = 1 φ(c) y Y p(x, y) ℓ(y, h B(x)) c(x) dx y Y p(y | x) ℓ(y, h B(x)) y Y p(y | x) ℓ(y, h(x)) y Y p(x, y) ℓ(y, h(x)) c(x) dx = RS(h, c) λ. A.2 Proof of Theorem 2 The presented proof of the theorem uses Lemmas 13 and 14, both derived based on Lemma 12 bellow. Lemma 12 For a set X, let f : X R+15 and g : X R be measurable functions such that R X f(x)dx > 0 and g(x) > 0 for all x X. Then it holds R X g(x)f(x)dx > 0. 15. We use R, R+ and N+ to denote the set of real numbers, non-negative real numbers and positive integers, respectively. Optimal Strategies for Reject Option Classifiers Proof For n N+, define functions fn(x) = f(x) if g(x) 1 n, 0 otherwise. The sequence {fn} n=1 is monotone and converges to f. Using the monotone convergence theorem (Stein and Shakarchi, 2009), derive lim n fn(x)dx = lim n This means that there is a k N+ such that R X fk(x)dx > 0, hence we conclude Z g(x)f(x)dx Z g(x)fk(x)dx Z 1 kfk(x)dx > 0. Lemma 13 For a set X, let f : X R+ and g : X R be measurable functions such that R X f(x)dx > 0 and g(x) > b for all x X and some b R. Then it holds R X g(x)f(x)dx > b R Proof By Lemma 12, we have Z (g(x) b)f(x)dx > 0, g(x)f(x)dx = Z (g(x) b)f(x)dx + Z bf(x)dx > b Z Lemma 14 For a set X, let f : X R+ and g : X R be measurable functions such that R X g(x)f(x)dx > 0 and g(x) < 1 for all x X. Then it holds R X f(x)dx > R X g(x)f(x)dx. X g(x)f(x)dx > 0 implies R X f(x)dx > 0. Since it holds x X : (1 g(x)) > 0, Lemma 12 yields (1 g(x))f(x)dx = Z g(x)f(x)dx, X f(x)dx > R X g(x)f(x)dx is obtained as a direct consequence. Franc, Prusa and Voracek Problem 3 (Bounded-improvement model for known h(x)) Given a classifier h(x), the optimal selection function c (x) is a solution to max c [0,1]X φ(c) s.t. RS(h, c) λ . (8) Theorem 2 A selection function c : X [0, 1] is an optimal solution to Problem 3 if and only if it holds Z Xr(x) 0 , R Xr(x)=0 p(x)dx if b = 0 , (11) Xr(x)>b p(x)c (x)dx = 0 , (12) where r(x) = r(x) λ measures how much the conditional risk r(x) of the classifier h(x) exceeds the target λ, X p(x)r(x) dx (13) is the expectation of r(x) restricted to inputs in X , and b = sup {a | ρ(Xr(x) a) 0} 0 . (14) Proof Observe that b 0, because ρ(Xr(x) 0) 0. Next, observe that Problem 3 can be rewritten into the form max c [0,1]X X p(x)c(x)dx s.t. Z X p(x)c(x)r(x)dx 0 (38) RS(h, c) λ = y Y p(x, y) ℓ(y, h(x)) c(x) dx λφ(c) X p(x)c(x)r(x) dx λ R X p(x)c(x)r(x) dx φ(c) . (40) Let F(c) = φ(c) = R X p(x)c(x)dx denote the objective function of (38). Case 1 b > 0. Claim I Each c : X [0, 1] which fulfils (10), (11) and (12) is feasible to (38) and bρ(Xr(x) 0. (48) Inequality (48) and Lemma 13 (applied to f(x) = p(x)c(x) and g(x) = r(x)) yield Z p(x)c(x)r(x)dx > b Z p(x)c(x)dx. Franc, Prusa and Voracek Therefore, we can write Z p(x)c(x)r(x)dx = b Z p(x)c(x)dx (49) for a suitable b R+ such that b > b > 0. (50) Based on the constraint of (38), derive Z p(x)c(x)r(x)dx (49) = Z p(x)c(x)r(x)dx + b Z p(x)c(x)dx + b Z (38) 0 (46) = Z p(x)c (x)r(x)dx + b Z p(x)c (x)dx. (51) Let σ(x) = 1 br(x). Inequality (51) can be rearranged and upper bounded as Z p(x)c(x)dx Z p(x)c (x)dx + b p(x)c(x)dx (52) p(x)(c (x) c(x))σ(x)dx Z p(x)c (x)dx Z where the second inequality follows from x Xr(x) ρ(Xr(x) 0. (58) Then, (57) can be rewritten as Z p(x)(c (x) c(x))σ(x)dx (58) > 0. (59) Inequality (59) and Lemma 14 (applied to g(x) = σ(x) < 1 and f(x) = p(x)(c (x) c(x)) (47) 0 over Xr(x) . (60) Now, combine and rearrange (58) and (60) to obtain p(x)c (x)dx + Z p(x)c (x)dx (60) > (61) p(x)c(x)dx + Z p(x)c(x)dx = F(c). (62) Case 1.3 Conditions (11) and (12) hold, condition (10) is violated, i.e. Z p(x)c(x)dx < Z p(x)dx. (63) p(x)c (x)dx ρ(Xr(x) γ , (15) where τ is the acceptance probability given by ( 1 if ρ(Xr(x)=γ) = 0 , ρ(Xr(x)<γ)) ρ(Xr(x)=γ)) if ρ(Xr(x)=γ) > 0 , (16) satisfies the optimality condition of Theorem 2, and hence it is a solution to Problem 3. Proof The optimality conditions (10) and (12) given in Theorem 2 are equivalent to a probabilistic statement Px p(x)[c (x) = 0 r(x) < b] = 0 and Px p(x)[c (x) = 1 r(x) > b] = 0, respectively. Hence the two conditions are satisfied by a selection function which predicts, c (x) = 1, whenever r(x) < b and rejects, c (x) = 0, whenever r(x) > b. Or equivalently, using the identity r(x) = r(x) λ and a threshold γ = b + λ, by c (x) = 1 when r(x) < γ and c (x) = 0 when r(x) > γ. Finally, if we opt for a selection function that Optimal Strategies for Reject Option Classifiers is constant c (x) = τ inside the boundary region Xr(x)=b, then the condition (11) im- plies τ = ρ(Xr(x)<γ) b ρ0 if b > 0, where ρ0 = R Xr(x)=b p(x) dx, and τ = 1 if b = 0. Using Xr(x)β p(x)c (x)dx = 0 , (21) Xr(x) 0 . (68) Define c : X [0, 1] as follows. 1 if r(x) < β , 0 if x X , c(x) otherwise . (69) c is feasible to (64) as φ(c ) = φ(c). Derive φ(c) F(c) F(c ) = Z p(x)c(x)r(x)dx Z p(x)r(x)dx + Z p(x)c(x)r(x)dx (67),(68) Z β p(x)(1 c(x))dx Z p(x)(1 c(x))r(x)dx (71) p(x)(1 c(x))(β r(x))dx > 0 (72) where the inequality in (72) is obtained from Lemma 12 applied to f(x) = p(x)(1 c(x)), g(x) = β r(x), and the set Xr(x)<β. This shows that c is not an optimal solution. Case 2 Condition (19) is satisfied, condition (20) is violated and Z Xr(x)=β p(x)c(x)dx < ω Z Xr(x)<β p(x)dx > 0 . (73) Optimal Strategies for Reject Option Classifiers In this case, there is a c : X [0, 1] such that c (x) = c(x) if r(x) < β , (74) c (x) = 0 if r(x) > β , (75) Xr(x)=β p(x)c (x)dx = Z Xr(x)=β p(x)c(x)dx + Z Xr(x)>β p(x)c(x)dx . (76) If Lemma 13 is applied to f(x) = p(x)c(x), g(x) = r(x), and the set Xr(x)>β, we get Xr(x)>β p(x)c(x)r(x)dx > β Z Xr(x)>β p(x)c(x)dx . (77) It also holds that φ(c ) = φ(c). Now, deriving φ(c) F(c) F(c ) (76) = Z p(x)c(x)r(x)dx β Z p(x)c(x)dx (78) p(x)c(x) β Z p(x)x(x)dx = 0 (79) shows that c is not an optimal solution. Case 3 φ(c) > ω, which occurs if Z Xr(x)=β p(x)c(x)dx > ω Z Xr(x)<β p(x)dx > 0 (80) (implying that condition (20) is violated), or if condition (21) is violated. Observe that F(c) = F(α c) for any a R+. Let c = ω φ(c) c. Since φ(c ) = ω, the selection function c is feasible to (64). Because Z Xr(x)<β p(x)c (x)dx = ω φ(c) Xr(x)<β p(x)c(x)dx < Z Xr(x)<β p(x)dx , (81) c violates condition (19) and is therefore not an optimal solution (see Case 1). This implies that c is not an optimal solution too. A.6 Proof of Theorem 6 Theorem 6 Let r: X R be the conditional risk (9) of a classifier h: X Y, 1 ω > 0 be a target coverage and β be the constant computed by (22). Then the selection function 1 if r(x) < β , κ if r(x) = β , 0 if r(x) > β , (23) Franc, Prusa and Voracek where κ is the acceptance probability given by Xr(x)=β p(x)dx = 0 , Xr(x)<β p(x)dx R Xr(x)=β p(x)dx otherwise , (24) satisfies the optimality condition of Theorem 6, and hence it is a solution of Problem 5. Proof It is easy to see that c satisfies conditions (19) and (21). The validity of condition (20) is proved as follows. If R Xr(x)=β p(x)dx = 0, then R Xr(x)=β p(x)dx = ω, and condition (20) is met. If R Xr(x)=β p(x)dx > 0, we derive Xr(x)=β p(x)c (x)dx = ω R Xr(x)<β p(x)dx R Xr(x)=β p(x)dx Xr(x)=β p(x)dx = ω Z Xr(x)<β p(x)dx . (82) Appendix B. Proofs of theorems from Section 3 B.1 Proof of Theorem 7 The expectation of the squared loss deviation reads Ereg(s) = Z y Y p(x, y) ℓ(y, h(x)) s(x) 2 dx . (83) Theorem 7 The conditional risk r(x) defined by (9) is an optimal solution to min s:X R Ereg(s). Proof We can rewrite Ereg(s) as Ereg(s) = Z y Y p(y | x) ℓ(y, h(x))2 2 ℓ(y, h(x)) s(x)+s(x)2 dx = Z X p(x)f(s(x))dx. Due to additivity, we can solve mins: X R Ereg(s) for each x X separately by setting derivative of f(s) to zero and solving for s which yields f (s) = 2 X y Y p(y | x)ℓ(y, h(x)) + 2 X y Y p(y | x)s(x) = 0 s (x) = X y Y p(y | x)ℓ(y, h(x)) . Optimal Strategies for Reject Option Classifiers B.2 Proof of Theorem 8 Lemma 15 For an integer n 1, let {ai}n i=1 and {bi}n i=1 be sequences of positive real numbers such that {ai bi }n i=1 is a non-increasing sequence. For each non-decreasing sequence {ℓi}n i=1 of non-negative real numbers with a positive sum it holds that Pn i=1 aiℓi Pn i=1 biℓi Pn i=1 ai Pn i=1 bi . Proof We will show that the lemma statement is a direct consequence of the four-letter identity i=1 Ci Di = X j 1. This confirms inequality (86). Lemma 17 Pn i=1(Hn Hi 1) = n. Proof By induction on n. The lemma trivially holds for n = 1. Let n > 1. Then, using the induction hypothesis, we derive i=1 (Hn Hi 1) = Hn Hn 1 + n + Hn 1 Hi 1 i=1 (Hn 1 Hi 1) = n . Theorem 8 The inequality Au RC(s, Tn) < 2 sele(s, Tn) holds true for any s: X R and Tn = {(xi, yi) X Y | i = 1, . . . , n}. Moreover, for any ε > 0, there are n, s and Tn such that Au RC(s, Tn) (2 ε) sele(s, Tn). Optimal Strategies for Reject Option Classifiers Proof The first statement trivially holds if Pn i=1 ℓi = 0. If Pn i=1 ℓi > 0, we apply Lemmas 16, 15 and 17 to derive Au RC(s, Tn) sele(s, Tn) Pn i=1 ai Pn i=1 bi = Pn i=1 n (Hn Hi 1) Pn i=1(n i + 1) = n2 n 2 (n + 1) = 2n n + 1 < 2 . Taking ℓi = 1 for all i = 1, . . . , n yields Au RC(s, Tn) sele(s, Tn) = Pn i=1 ai Pn i=1 bi = 2n n + 1 . Since limn 2n n+1 = 2, we see that also the second statement is valid. B.3 Proof of Theorem 9 Remark 18 For the sake of simplicity, for predicates ϕ1(x, z), . . . , ϕk(x, z) and a function f : X X R, we write ϕ1(x,z) ... ϕk(x,z) f(x, z)dz dx to represent f(x, z)[[ϕ1(x, z) . . . ϕk(x, z)]]dz dx . Theorem 9 A function s : X R is an optimal solution to mins:X R Esele(s) iff z =x s (z)=s (x) max{r(x), r(z)}p(x)p(z)dz dx = 0 , and (32) r(z)s (x) (r(x) r(z)) p(x)p(z)dz dx = 0 . (33) Franc, Prusa and Voracek Proof We first present four equalities to be used later. We assume that s : X R is any measurable function. The validity of the equalities can be easily verified. r(z)>r(x) s(z)s(z) r(x)p(x)dx dz = Z r(z)s(x) r(z)p(z)dz dx , r(z)x r(z)=r(x) s(z)=s(x) r(x)p(x)p(z)dz dx + Z z=x r(x)p(x)p(z)dz dx . Since argmins:X R E(s) = argmins:X R (E(s) E(r)), it suffices to analyze minimizers of E(s) E(r) instead of E(s). Derive E(s) E(r) = Z r(x)p(x)p(z)dz dx Z r(x)p(x)p(z)dz dx r(z)s(x) r(x)p(x)p(z)dz dx Z r(z)>r(x) s(z)s(x) r(x)p(x)p(z)dz dx Z r(z)>r(x) s(z)s(x) r(x)p(x)p(z)dz dx Z r(z)s(x) r(z)p(x)p(z)dz dx r(z)s(x) (r(x) r(z)) p(x)p(z)dz dx r(z) r(a) and s(b) s(a) then f(r(b), s(b)) > f(r(a), s(b)) because the function f(u, v) is strictly increasing in u and strictly decreasing in v. Combined, it implies that f(r(b), s(b)) > C which leads to a contradiction because an optimal s requires f(r(b), s(b)) = C. P. L. Bartlett and M. H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(59):1823 1840, 2008. C. C. Chang and C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. URL http: //www.csie.ntu.edu.tw/~cjlin/libsvm. C. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16(1):41 46, 1970. W. Chu and S. S. Keerthi. New approaches to support vector ordinal regression. In Proceedings of the International Conference on Machine Learning, pages 145 152, 2005. C. Corbiere, N. Thome, A. Bar-Hen, M. Cord, and P. Perez. Addressing failure prediction by learning model confidence. In Advances in Neural Information Processing Systems, volume 32, pages 2902 2913, 2019. Optimal Strategies for Reject Option Classifiers C. Cortes, G. De Salvo, and M. Mohri. Boosting with abstention. In Advances in Neural Information Processing Systems, volume 29, pages 1660 1668, 2016. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proceedings of Conference on Computer Vision and Patter Recognition, volume 1, pages 886 893, 2005. J. Demˇsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7(1):1 30, 2006. D. Dua and E. Karra Taniskidou. UCI machine learning repository, 2017. URL http: //archive.ics.uci.edu/ml. R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11(53):1605 1641, 2010. L. Fischer, B. Hammer, and H. Wersing. Optimal local rejection for classifiers. Neurocomputing, 214:445 457, 2016. L. Fisher, B. Hammer, and H. Wersing. Efficient rejection strategies for prototype-based classification. Neurocomputing, 169:334 342, 2015. V. Franc and D. Prusa. On discriminative learning of prediction uncertainty. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 1963 1971, 2019. G. Fumera and F. Roli. Support vector machines with embedded reject option. In Pattern Recognition with Support Vector Machines, Lecture Notes in Computer Science, volume 2388. Springer, 2002. G. Fumera, F. Roli, and G. Giacinto. Multiple reject thresholds for improving classification reliability. In Advances in Pattern Recognition, pages 863 871, 2000. Y. Geifman and R. El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems 30, pages 4878 4887, 2017. Y. Grandvalet, A. Rakotomamonjy, J. Keshet, and S. Canu. Support vector machines with a reject option. In Advances in Neural Information Processing Systems, volume 21, pages 537 544, 2008. B. Hanczar and E. R. Dougherty. Classification with reject option in gene expression data. Bioinformatics, 24:1889 1895, 2008. T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2009. J. Havil. Gamma: Exploring Euler s Constant. Princeton University Press, 2003. R. Herbei and M. H. Wegkamp. Classification with reject option. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 34(4):709 721, 2006. Franc, Prusa and Voracek H. Jiang, B. Kim, M. Y. Guan, and M. Gupta. To trust or not to trust a classifier. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, page 5546 5557, 2018. V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1867 1874, 2014. D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755 1758, 2009. J. Kummert, B. Paassen, J. Jensen, C. G opfert, and B. Hammer. Local reject option for deterministic multi-class SVM. In Artificial Neural Networks and Machine Learning ICANN, Lecture Notes in Computer Science, volume 9887. Springer, 2016. B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30, pages 6402 6413, 2017. Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jakel. Handwritten digit recognition with a back-propagation networks. In Advances in Neural Information Processing Systems, volume 2, pages 396 404, 1990. J. Lei. Classification with confidence. Biometrika, 101:755 769, 2014. T. Pietraszek. Optimizing abstaining classifiers using ROC analysis. In Proceedings of the 22nd International Conference on Machine Learning, page 665 672, 2005. C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-thewild challenge: database and results. Image and Vision Computing, 47:3 18, 2016. C. M. Santos-Pereira and A. M. Pires. On optimal reject rules and roc curves. Pattern Recognition Letters, 26(7):943 952, 2005. M. I. Schlesinger and V. Hlav aˇc. Ten lectures on statistical and structural pattern recognition. Kluwer Academic Publishers, 2002. J. M. Steele. The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Cambridge University Press, 2004. E. M. Stein and R. Shakarchi. Real Analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton University Press, 2009. C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11(10):311 365, 2010. F. Tortorella. An optimal reject rule for binary classifiers. In Advances in Pattern Recognition, Lecture Notes in Computer Science, volume 1876. Springer, 2000. V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. Optimal Strategies for Reject Option Classifiers T. Villman, M. Kaden, A. Bohnsack, J. M. Villman, T. Drogies, S. Saralajew, and B. Hammer. Self-adjusting reject options in prototype based classification. In Advances in Intelligent Systems and Computing, volume 428. Springer, 2016. M. Yuan and M. Wegkamp. Classification methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11(5):111 130, 2010. H. Zaragoza and F. d Alche Buc. Confidence measures for neural network classifiers. In 7th Conference on Information Processing and Management of Uncertainty in Knowledge Based Systems, 1998.