# on_discriminative_learning_of_prediction_uncertainty__a2ad629d.pdf

On Discriminative Learning of Prediction Uncertainty

Vojtech Franc 1 Daniel Prusa 1

In classiﬁcation with a reject option, the classiﬁer is allowed in uncertain cases to abstain from prediction. The classical cost based model of an optimal classiﬁer with a reject option requires the cost of rejection to be deﬁned explicitly. An alternative bounded-improvement model, avoiding the notion of the reject cost, seeks for a classiﬁer with a guaranteed selective risk and maximal cover. We prove that both models share the same class of optimal strategies, and we provide an explicit relation between the reject cost and the target risk being the parameters of the two models. An optimal rejection strategy for both models is based on thresholding the conditional risk deﬁned by posterior probabilities which are usually unavailable. We propose a discriminative algorithm learning an uncertainty function which preserves ordering of the input space induced by the conditional risk, and hence can be used to construct optimal rejection strategies.

1. Introduction

In classiﬁcation with a reject option the classiﬁer is allowed in uncertain cases to abstain from prediction. This setting is essential e.g. in medical diagnosis, safety critical systems and many other applications. The cost-based model of an optimal classiﬁcation strategy with the reject option was proposed by Chow in his pioneering work (Chow, 1970). The goal is to minimize the expected loss equal to the cost of misclassiﬁcation, if the classiﬁer predicts, and to the reject cost, if the classiﬁer abstains from prediction. The well known Bayes-optimal strategy rejects to predict whenever the conditional risk exceeds the reject cost. Computation of the conditional risk requires the class posterior probabilities which are usually unavailable. A common solution is the

1Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague. Correspondence to: Vojtech Franc <xfrancv@cmp.felk.cvut.cz>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

plug-in rule obtained after replacing the posterior probabilities by their estimates. A quality of the plug-in rule heavily depends on the estimated probabilities (Fumera et al., 2000; Herbei & Wegkamp, 2006). Discriminative methods instead learn the reject option classiﬁcation strategy directly without the probability estimation, e.g. (Bartlett & Wegkamp, 2008; Ramaswamy & Agarwal, 2016). Many works assume that the uncertainty measure is known, e.g. obtained from responses of trained Neural Network (Le Cun et al., 1990), and the task is to ﬁnd only the rejection thresholds (Tortorella, 2000; Kummert et al., 2016; Fischer et al., 2016). Most existing methods, including this paper, account for aleatoric uncertainty only, while literature dealing with the epistemic uncertainty is scarce (Wang et al., 2017).

The cost-based model requires the reject cost to be deﬁned explicitly which is difﬁcult in some applications like e.g. medical diagnosis. An alternative bounded-improvement model proposed by (Pietraszek, 2005) avoids explicit deﬁnition of the reject cost. The rejection strategy is evaluated by two antagonistic quantities: i) the selective risk deﬁned as the expected misclassiﬁcation cost on the non-reject region and ii) the coverage corresponding to the probability mass of the non-reject region. An optimal strategy for the bounded-improvement model maximizes the coverage under the condition that the selective risk does not exceed a speciﬁed target value. In contrast to the cost-based model, it has not been formally shown what is the optimal class of strategies when the underlying model is known. The solution was proposed only for special instances of the task. (Pietraszek, 2005) proposed a method based on ROC analysis applicable when a score proportional to posterior probabilities is known and the task is to ﬁnd only the optimal thresholds. (El-Yaniv & Wiener, 2010) proposed an algorithm learning the optimal strategy in the noise-free setting, i.e. when a perfect strategy with zero selective risk exists. (Geifman & El-Yaniv, 2017) shows how to equip a trained classiﬁer with a reject option provided an uncertainty measure is known and the task is to ﬁnd a rejection threshold optimal under the bounded-improvement model.

This paper describes three contributions:

1) We provide necessary and sufﬁcient conditions for an optimal strategy of the bounded-improvement model when the underlying distribution is known. We show that an optimal

On Discriminative Learning of Prediction Uncertainty

rejection strategy can be constructed based on thresholding the class conditional risk, which corresponds to the optimal strategy of the cost-based model. We show that in contrast to the cost-based model, the rejection strategy cannot be arbitrary at the boundary cases. We provide an explicit relation between the parameters of the two rejection models.

2) We propose a novel loss which evaluates the quality of an uncertainty function associated to a given classiﬁer. We prove that any minimizer of an expectation of the loss is a function which preserves ordering of the input space induced by the true conditional risk. Hence, any minimizer of the loss can be used to construct a strategy optimal for cost-based and/or bounded-improvement model.

3) We propose a discriminative algorithm which for a given classiﬁer learns an uncertainty function by minimizing a convex surrogate of the proposed loss. We show experimentally that the learned rejection strategies outperform plug-in rules constructed from logistic-regression model and rules based on distance to the SVM decision boundary.

To our knowledge we are the ﬁrst to propose a discriminative method learning an uncertainty function with a clear connection to the conditional risk being the key quantity for constructing the optimal reject-option strategies. Our method is potentially most useful when modeling the posterior distribution is difﬁcult.

The paper organization: Section 2 characterizes optimal strategies of the bounded-improvement rejection model and relation to the cost-based model. Section 3 proposes an algorithm learning uncertainty function optimal for both models. Related works are discussed in Section 4, experiments are presented in Section 5 and a summary is given in Section 6. Proofs are referred to a supplementary material.

2. Equivalence of Cost-based and Bounded-improvement Rejection Models

Let X be a set of input observations and Y a ﬁnite set of labels. Let us assume that inputs and labels are generated by a random process with p.d.f. p(x, y) deﬁned over X Y. A goal in the non-reject setting is to ﬁnd a classiﬁer h: X Y with a small expected risk

y Y p(x, y)ℓ(y, h(x)) dx ,

where ℓ: Y Y R+ is a loss penalizing the predictions.

The expected risk can be reduced by abstaining from prediction in uncertain cases. To this end, we use a selective classiﬁer 1 (h, c) composed of a classiﬁer h: X Y and a selection function c: X [0, 1]. When applying the

1The classiﬁer with a reject option is more frequently denoted by a single function h : X Y {reject}. Instead, we use the

selective classiﬁer to input x X it outputs

(h, c)(x) = h(x) with probability c(x) , reject with probability 1 c(x) .

In the sequel we introduce two models of an optimal selective classiﬁer, the cost-based and the bounded-improvement model, and we characterize their optimal strategies provided the underlying model p(x, y) is known. While the solution of the former model is well known the latter is novel.

Cost-based model seeks for a selective classiﬁer (h B, c B) which for a given reject cost ε > 0 minimizes

RB(h, c)= Z

y Y p(x, y) ℓ(y, h(x))c(x)+(1 c(x))ε dx.

The well-known optimal strategy (i.e. the Bayes classiﬁer) reads

h B(x) argmin ˆy Y

y Y p(y | x) ℓ(y, ˆy) (1)

1 if r (x) < ε , τ if r (x) = ε , 0 if r (x) > ε ,

where r (x) = minˆy Y P

y Y p(y | x) ℓ(y, ˆy) is the minimal class conditional risk associated to the input x, and τ is any number from the interval [0, 1]. Hence there is an inﬁnite number of optimal selection functions parameterized by τ which yield the same risk RB(h, c).

Bounded-improvement model characterizes the selective classiﬁer by two antagonistic quantities: i) the coverage

X p(x) c(x) dx

corresponding to the probability mass of the non-reject region and ii) the selective risk

y Y p(x, y) ℓ(y, h(x)) c(x) dx

deﬁned for non-zero φ(c) as the expected classiﬁcation loss on the non-reject region Xc(x)>0 2. Given a target risk

decomposition (h, c)(x) = h (x), and the terminology selective classiﬁer from (El-Yaniv & Wiener, 2010) as it simpliﬁes notation. 2For a function f : X R and a R { }, we deﬁne

Xf(x) a ={x X |f(x) a}, Xf(x)<a= {x X |f(x) < a}, Xf(x)=a= {x X |f(x) = a}, Xf(x)>a= {x X |f(x) > a}, Xf(x) a= {x X |f(x) a}.

On Discriminative Learning of Prediction Uncertainty

λ > 0, the bounded-improvement model deﬁnes the optimal selective classiﬁer (h S, c S) as a solution to the problem

max h,c φ(c) s.t. RS(h, c) λ , (3)

where we assume that both maximizers exist.

Theorem 1 Let (h, c) be an optimal solution to (3). Then, (h B, c), where h B is the optimal Bayes classiﬁer (1), is also optimal to (3).

According to Theorem 1 the Bayes classiﬁer h B is also optimal for the risk-coverage task (3) which is not surprising. Note the Bayes classiﬁer is not a unique solution to (3) because the predictions on the reject region Xc(x)=0 do not count to the selective risk and hence they can be arbitrary.

Theorem 1 allows to solve the bounded-improvement task (3) in two consecutive steps: First, set h to be the Bayes classiﬁer h B or to its best approximation, e.g. learned by a discriminative method of choice. Second, when h is ﬁxed, the optimal selection function c is obtained by solving the task (3) only w.r.t c which boils down to

max c [0,1]X

X p(x)c(x)dx s.t. Z

X p(x)c(x)r(x)dx 0

(4) where r(x) = r(x) λ measures how much the conditional risk of the prediction h(x), deﬁned as

y Y p(y | x) ℓ(y, h(x)) , (5)

exceeds the target risk λ. It is seen that for h = h B solving (4) yields c which is a solution to the task (3), i.e. c = c S. Indeed, using the deﬁnition of RS(h, c) we can rewrite the constraint of the task (3) as Z

y Y p(y | x)ℓ(y, h(x))c(x) dx λ Z

X p(x)c(x) dx

which after a simple rearrangement becomes the constraint of the task (4), and objectives of both tasks are the same.

Theorem 2 A selection function c : X [0, 1] is an optimal solution to (4) if and only if it holds Z

Xr(x)<b p(x)c (x)dx = Z

Xr(x)<b p(x)dx, (6)

Xr(x)=b p(x)c (x)dx =

( ρ(Xr(x)<b)

b if b > 0 , R

Xr(x)=0 p(x)dx if b = 0 ,

Xr(x)>b p(x)c (x)dx = 0 , (8)

where ρ(X ) = R

X p(x)r(x) dx is the expectation of r(x) restricted to inputs in X , and

b = sup {a | ρ(Xr(x) a) 0} 0 . (9)

Theorem 2 deﬁnes behaviour of optimal selection function on the partition of the input space X into three regions Xr(x)<b, Xr(x)=b and Xr(x)>b. The condition (6) says that the conditional expectation Ex p(x)[c (x) 1 | r(x) < b] is zero, which is satisﬁed if for r(x) < b we use c (x) = 1. Or equivalently, by using the identity r(x) = r(x) λ, we set c (x) = 1 whenever r(x) < γ with the threshold γ = b+λ. Analogically, setting c (x) = 0 for r(x) > γ satisﬁes the condition (8). Finally, if we opt for a constant selecting function c (x) = τ in the boundary region Xr(x)=b, then the

condition (7) implies τ = ρ(Xr(x)<γ)

b ρ0 if b > 0, where ρ0 = R

Xr(x)=b p(x) dx, and τ = 1 if b = 0. Using Xr(x)<b = Xr(x)<γ and b ρ0 = ρ(Xr(x)=b) = ρ(Xr(x)=γ), we derive

( 1 if ρ(Xr(x)=γ) = 0 , ρ(Xr(x)<γ)) ρ(Xr(x)=γ)) if ρ(Xr(x)=γ) > 0 . (10)

Corollary 1 Let r: X R be the conditional risk (5), τ the acceptance probability given by (10) and γ = b + λ the rejection threshold given by the target-risk λ and b computed by (9). Then the selection function

1 if r(x) < γ , τ if r(x) = γ , 0 if r(x) < γ , (11)

satisﬁes the optimality condition of Theorem 2.

It is seen that (11) coincides with the Bayes-optimal selection function (2) of the cost-based model. In case of the cost-based model, the threshold γ corresponds to the reject cost ε while the acceptance probability τ can be arbitrarily without affecting the risk RB(h, c). In contrast, the optimal threshold γ = b + λ of the bounded-improvement model is deﬁned implicitly by choosing the target risk λ. In addition, the acceptance probability τ for the boundary cases cannot be arbitrary in general. In practice, however, the impact of τ is negligible because ρ0 is usually small or even zero as can be expect for continuous p(x) when the boundary region Xr(x)=γ has probability measure zero. On the other hand, setting the threshold γ is crucial. Fortunately, (Geifman & El-Yaniv, 2017) shows how to ﬁnd γ from a ﬁnite training set such that it is optimal solution of the boundedimprovement model in PAC sense. The remaining issue is thus the conditional risk r(x) because its computation requires the posterior probabilities p(y | x). A common solution is the plug-in rule obtained after replacing p(y | x) in (5) by ˆp(y | x) estimated from examples. In the next section, we propose an alternative approach which completely avoids the probability estimation.

On Discriminative Learning of Prediction Uncertainty

3. Discriminative Learning of Uncertainty

Assume a classiﬁer h: X Y has been ﬁxed and we want to endow it with a selection function c: X [0, 1]. We have shown that regardless the rejection model, the optimal selection function is based on thresholding the conditional risk r(x) deﬁned by (5). Hence, it is clear that r(x) can be replaced by any uncertainty function s: X R which preserves ordering of inputs induced by the conditional risk r(x) 3. In this section, we show how to learn such uncertainty functions from examples.

3.1. Order Enforcing Loss Function

Let Tn = {(xi, yi) X Y | i = 1, . . . , n} be a set of inputs and labels generated from n i.i.d. random variables with distribution p(x, y). We deﬁne a loss function : Rn X n Yn R+ as

(s, Tn) = 1

j=1 ℓ(yi, h(xi))[[s(xi) s(xj)]] ,

(12) which has the following interpretation. Let us order the examples Tn according to s(x) so that s(xπ(1)) s(xπ(2)) s(xπ(n)), where π: {1, . . . , n} {1, . . . , n} is the permutation deﬁning the order. Let ˆR(i, s) = 1 n Pi j=1 ℓ(yπj, h(xπj))[[s(xπj) s(xπi)]] be the empirical risk of the classiﬁer h(x) computed on the examples with uncertainty not higher than the uncertainty of the i-th example. Then, the loss (s, Tn) can be seen as the area under the curve C = {( ˆR(i, s), i

n) | i = 1, . . . , n}. The curve C summarizes performance of the selective classiﬁer on Tn when it rejects to predict based on thresholding the uncertainty s(x). It is intuitively clear that the area is minimized by such s(x) which orders the examples as if they were ordered according to the loss ℓ(yi, h(xi)). Hence, by minimizing the expectation of (s, Tn) we should get a function preserving the ordering induced by the conditional risk. In the sequel we will support the intuition by theorems.

Let us deﬁne the expectation of the loss (s, Tn) as

i=1 p(xi, yi) (s, Tn)dx

j=1 r(xi) [[s(xi) s(xj)]] dx

X p(x)p(z)r(x)[[s(x) s(z)]]dz dx

X p(x) r(x) Z

X p(z) [[s(x) s(z)]]dz dx ,

3f : X R preserves ordering induced by g : X R iff (x, x ) X X, g(x) < g(x ) f(x) < f(x ).

its minimizers are described by the following theorem 4.

Theorem 3 A function s : X R is an optimal solution to mins:X R E(s) if and only if Z

z =x s (z)=s (x) max{r(x), r(z)}p(x)p(z)dz dx = 0 , and

r(z)<r(x) s (z)>s (x) (r(x) r(z)) p(x)p(z)dz dx = 0 . (14)

The conditions (13) and (14) imply that the conditional expectations Ex,z p(x)[max{r(x), r(z)} | z = x s (x) = s (z)] and Ex,z p(x)[r(x) r(z) | r(z) < r(x) s (z) > s (x)] are both zero. If combined it further implies that a subset of input space X = {(x, z) X X | r(z) < r(x) s (z) > s (x)} on which the order is violated has probability measure 0. In other words the optimal s (x) preserves the ordering induced by r(x) almost surely.

Corollary 2 Any function s : X R fulﬁlling

(x, x ) X X : x = x s(x) = s(x ), and (15)

(x, x ) X X : r(x) < r(x ) s(x) < s(x ) (16)

satisﬁes the optimality conditions of Theorem 3.

Corollary 2 suggests how to construct the optimal s (x) if we know the conditional risk r(x). More importantly, the following corollary states that the optimal s (x) can be used instead of r(x) to construct the optimal selection function.

Corollary 3 Let s argmins:X R E(s). The selection function c : X [0, 1] deﬁned by

1 if s (x) < γ , τ if s (x) = γ , 0 if s (x) > γ , (17)

where γ = sup {a | ρ(Xs (x) a) 0} , (18)

( 1 if ρ(Xs (x)=γ) = 0 , ρ(Xs (x)<γ) ρ(Xs (x)=γ) if ρ(Xs (x)=γ) > 0 ,

fulﬁlls conditions (6), (7) and (8) of Theorem 2, therefore it is an optimal solution to (4).

3.2. Algorithm

The expectation E(s) cannot be minimized directly because p(x, y) is unknown and the loss (s, Tn) is hard to optimize due to the step function [[0 t]] in its deﬁnition. We

X R z =x s (z)=s (x) f(x,z)dz dx stands for R

X f(x,z)dz dx

where X = {z X |z =x s (z)=s (x)}, etc.

On Discriminative Learning of Prediction Uncertainty

describe two standard techniques to minimize E(s) approximately given a training set Tm = {(xi, yi) X Y | i = 1, . . . , m} drawn i.i.d. from p(x, y). We assume that uncertainty function sθ : X R is known up to parameters θ Rn which need to be learned from Tm.

Stochastic Gradient Approximation Starting from initial θ0 Rn, the new values are computed in iterative fashion by θt+1 = θt + kt gt, where kt is an appropriate learning rate and gt is a sub-gradient evaluated at θt of the proxy-loss

ˆ (θ, Tn) = 1

j=1 ℓ(yi, h(xi)) ω sθ(xj) sθ(xi)

(20) where ω(t) = max{0, 1 + t} is a convex upper-bound of [[0 t]] used in deﬁnition of the original loss (sθ, Tn). The set Tn denotes a mini-batch selected randomly from Tm. It is seen that the true objective E(sθ) is upper bounded by the expectation (w.r.t. randomly generated Tn) of ˆ (θ, Tn) which we use as our proxy, and which can be optimized by SGD (Kushner & Yin, 2003) or similar algorithms.

Regularized Empirical Risk Approximation We approximate E(s) by

F(θ, Tm) = C

k=1 ˆ (θ, T k) (21)

where T 1 T 2 T P is randomly generated partition of the training set Tm into P approximately equally sized batches, and C > 0 is a chosen regularization constant. It is seen that the expectation (w.r.t. randomly generated Tm) of F(θ, Tm) upper bounds E(s). The objective F(θ, Tm) is convex in θ for uncertainty functions sθ(x) = θ, ψ(x) linear in θ and then any convex solver can be used.

4. Related Works

The cost-based rejection model was proposed in (Chow, 1970) who also provides the optimal strategy in case the distribution is known, analyzes the error-reject trade-off, and proves basic properties of the error-rate and the rejectrate, e.g. that both functions are monotone w.r.t. the reject cost. The original paper considers the risk with 0/1-loss only. The model with arbitrary prediction costs was analyzed e.g. in (Tortorella, 2000; Schlesinger & Hlav aˇc, 2002).

The bounded-improvement model was proposed in (Pietraszek, 2005). He assumed that the classiﬁer score proportional to the posterior probabilities is known and the task is to ﬁnd only the optimal thresholds, which is done numerically based on ROC analysis. The original formulation assumes two classes and 0/1-loss. In our

paper we consider a straightforward generalization of the bounded-improvement model, deﬁned by task (3), which allows arbitrary number of classes, arbitrary loss and puts no constraint on the class of optimal strategies. We show how to construct the optimal strategy provided the underlaying distribution p(x, y) is known.

Learning of a selective classiﬁer optimal for the boundedimprovement model was discussed in (El-Yaniv & Wiener, 2010). Their method requires a noisy-free scenario, i.e. there must exist a selective classiﬁer with RS(h, c) = 0. They also provide a characterization of the lower and upper bound of the risk-coverage curves in PAC setting. (Geifman & El-Yaniv, 2017) proposed a method to ﬁnd a threshold optimal in PAC sense provided the uncertainty function s(x) is known, which in their work is constructed from an output of a trained Neural Network. We complement their work by a method learning the uncertainty function s(x).

A common approach to construct a rejection strategy is based on plug-in conditional risk ˆr(x) obtained after replacing p(y | x) in (5) by their estimate ˆp(y | x). (Fumera et al., 2000) shows that the plug-in Bayes-optimal strategy is not optimal if the estimates ˆp(y | x) are affected by errors, in which case they propose to use class-related thresholds instead of a single global threshold. Other methods trying to improve the plug-in strategy by tuning multiple thresholds were proposed in (Kummert et al., 2016; Fischer et al., 2016). The statistically consistency of the plug-in reject rules is discussed in (Herbei & Wegkamp, 2006). We use the plug-in rule on top of logistic-regression model (Mc Cullagh & Nelder, 1989) as a baseline in our experiments.

Most prediction models come with at least an ordinal measure of uncertainty which can be used to construct the reject strategies. E.g. (Le Cun et al., 1990) proposed a reject strategy for a Neural Network classiﬁer based on thresholding either i) the output of the maximally activated unit of the last layer or ii) a difference between the maximal and runner upper output units. In case of Support Vector Machine classiﬁers (Vapnik, 1998) the trained linear score, proportional to the distance between the input and the decision hyperplane, is often used as the uncertainty measure (Fumera & Roli, 2002). We consider this approach as a baseline in our experiments. Other solutions involve ﬁtting a probabilistic model to the SVM outputs (Platt, 2000; Wu et al., 2004).

5. Experiments

We experimented with two methods to learn the classiﬁer h(x) and two baselines to compute the uncertainty measure: i) Logistic-Regression estimating ˆp(y | x) so that the plug-in Bayes classiﬁer and the conditional risk can be computed, ii) the SVMs, representing discriminative methods without probabilistic output, in which case the classiﬁcation

On Discriminative Learning of Prediction Uncertainty

dataset classes features examples AVILA 12 10 20,867 COVTYPE 7 54 581,012 CODRNA 2 8 331,152 IJCNN 2 22 49,990 LETTER 26 16 20,000 PENDIGIT 10 16 10,992 PHISHING 2 68 11,055 SATTELITE 6 36 6,435 SENSORLESS 11 48 58,509 SHUTTLE 7 9 58,000

Table 1. Summary of benchmark classiﬁcation problems.

score with the maximal response was used as an uncertainty measure. Then, on top of the LR and SVM classiﬁers we learned the uncertainty measure by minimizing the proposed loss. We implemented both variants based on the stochastic gradient approximation and the regularized empirical risk minimization. The methods were evaluated on 10 classiﬁcation problems with different sizes, number of features and number of classes. The goal was to minimize the classiﬁcation error, hence we used the 0/1-loss ℓ(y, y ) = [[y = y ]].

Logistic-regression (LR) learns parameters θ = ((wy, by) Rd R | y Y) of the posterior probabilities ˆpθ(y | x) exp( wy, x + by) by maximizing the regularized log-likelihood L(θ) = C

2 θ 2 + 1 m Pm i=1 log ˆpθ(yi | xi) . The optimal C was selected from {1, 10, 100, 1000} based on the validation classiﬁcation error. After learning θ, we used the plug-in Bayes classiﬁer hθ(x) = argmaxy Y ˆpθ(y | x) and the plug-in class conditional risk ˆr(x) = 1 ˆpθ(h(x) | x) as the baseline uncertainty measure. Note that both plug-in rules are Bayes-optimal for the 0/1-loss.

Support Vector Machines (SVM) learn parameters θ = ((wy, by) Rd R | y Y) of the linear classiﬁer hθ(x) = argmaxy Y wy, x + by by minimizing F(θ) = C

m Pm i=1 maxy Y [[y = yi]] + wy wyi, xi . As the baseline uncertainty measure we use the maximal score s(x) = maxy Y wy, x + by proportional to the distance between input x and the decision hyper-plane. For binary case |Y| = 2, we used θ = (w, b), hθ(x) = sgn( w, x + b), F(θ) = C

m Pm i=1 max{0, 1 yi( w, xi + b)} and s(x) = | w, x + b|. The optimal C was selected in the same way as for LR.

Proposed method for uncertainty learning We parameterized the uncertainty function as

sθ(x) = wh(x), ψ(x) + by , (22)

where ψ: Rd Rq are features extracted from the vectorial inputs x Rd, and W = ((wy, by) Rq R | y Y)

are unknown parameters. We experimented with three different feature maps: i) linear features ψ(x) = x, ii) quadratic features ψ(x) = (xi xj | (i, j) {1, . . . , d}2 j i) and iii) features extracted by multi-layer perceptron trained from examples. For linear and quadratic features, ψ is ﬁxed and the unknown parameters θ = W were learned by minimizing the convex objective function (21) using the Bundle Method for Risk Minimization (Choon et al., 2010). The number of batches was P = 5, and the regularization constant C was selected from {1, 10, 100, 1000} based on the proposed loss (12) evaluated on the validation set. Finally, we used MLP with 1, 5, or 10 layers (the optimal number was selected based on the validation set) each having the same number of neurons as the input dimension d. The Re LU was used as the transfer function. The unknown parameters θ = (ψ, W ) comprised of the linear ﬁlters of fully-connected layers ψ and parameters W of the rule (22) put as the last layer. The parameters θ were learned by ADAM (Kingma & Ba, 2015) optimizing the convex loss function (20). To speed up convergence, we used the batchnormalization (Ioffe & C. Szegedy, 2015) placed after each fully-connected layer.

5.1. Datasets

We selected 10 classiﬁcation problems from UCI repository (Dua & Taniskidou, 2017) and lib SVM datasets (Chang & C.J.Lin, 2011). The datasets are summarized in Table 1. We chose the datasets with sufﬁciently large number of examples, as we need to learn both the classiﬁer and the uncertainty, and with a moderate input dimension d because of using q = d(d + 1)/2 explicitly computed quadratic features. Each dataset was randomly split 5 times into 5 subsets, Trn1/Val1/Trn2/Val2/Tst, in ratio 30/10/30/10/20 (up to COVTYPE with ratio 28/20/2/20/30). The subsets Trn1/Val1 and Trn2/Val2 were used for learning and model selection of the classiﬁer and the uncertainty function, respectively. All features were normalized to have zero mean and unit variance. The normalization coefﬁcients were estimated using only the Trn1 and Trn2 subsets, respectively. The Tst subset was used solely to compute the test metrics.

5.2. Evaluation Protocol

Selective classiﬁers are evaluated based on Risk-Coverage curve {( ˆRS(i, s), i

n) | i = 1, . . . , n} where ˆRS(i, s) is an estimate of RS(h, c) computed on test examples by

ˆRS(i, s) = 1

j=1 ℓ(yπ(j), h(xπ(j))) ,

π(j) is the permutation obtained by sorting the examples according to s(x) in ascending order. We describe the performance by the selective risk at coverage 90%, deﬁned as R@90 = ˆRS(i, s) where i = round(0.9 n), and by the Area Under risk-coverage Curve AUC(s) =

On Discriminative Learning of Prediction Uncertainty

1 n Pn i=1 ˆRS(i, s). For each metric we report averages and standard deviations computed over the 5 random splits. In ﬁgures we show the mean curves.

5.3. Results

Figure 1 and Table 2 summarize results for the two baseline selective classiﬁers, LR+plugin and SVM+maxscore, and the selective classiﬁers which use uncertainty function learned by the proposed method, LR+learn(X) and SVM+learn(X), where X stands for the used feature map. The table reports results only for the best features while the ﬁgure shows all models. We can conclude that:

1) R@100 is always higher than R@90 which shows that all selective classiﬁers when accepting only 90% of inputs reduce the classiﬁcation error of the non-rejecting classiﬁers.

2) The selective classiﬁers with uncertainty learned by the proposed method outperform the baselines both in terms of AUC and R@90 consistently on all datasets. Up to a few cases the improvement is signiﬁcant.

3) LR and SVM classiﬁers have statistically similar error R@100 on 7 datasets while on 3 datasets (letter, shuttle, sensorless) SVM performs better. However, on the same 3 datasets the baseline selective classiﬁer SVM+maxscore is worse than the baseline LR+plugin in terms of R@90, probably due to non-probabilistic output used as uncertainty by SVMs. At the same time, when the learned uncertainty is used the SVM-based selective classiﬁer becomes better.

4) In 6 versus 4 cases the uncertainty functions on top of the learned MLP features outperform the hand-crafted quadratic features. While the linear features were never optimal they perform in most cases better than non-learned baselines.

6. Conclusions

We proved that the cost-based and bounded-improvement rejection models share the same optimal strategies. In both cases an optimal reject strategy can be constructed by thresholding the conditional risk. We provided formulas relating parameters of the two models. We proposed a new loss function whose minimizer is an uncertainty function which preserves ordering induced by the conditional risk, and hence can be used to construct an optimal reject strategy for both models. We experimentally showed that Logistic Regression and SVM-based selective classiﬁers with uncertainty function learned by minimizing a convex-surrogate of the proposed loss outperform commonly used reject rules.

Open questions involve e.g. i) analysis of the statistical consistency of the proposed convex proxy-loss (20) and ii) investigating the possibility to simultaneously learn both the uncertainty and the prediction rule instead of using the data demanding two-stage approach.

classi selection AUC R@90 R@100 ﬁer function 100 100 100

LR plugin 27.2 0.6 40.9 0.5 43.7 0.4 LR learn(M) 17.3 0.4 38.0 0.4 SVM maxscore 31.7 0.8 41.0 0.7 43.3 0.7 SVM learn(M) 16.9 0.7 37.4 0.8

LR plugin 0.9 0.0 2.0 0.0 4.8 0.1 LR learn(M) 0.4 0.0 1.3 0.1 SVM maxscore 0.9 0.1 2.0 0.1 4.8 0.1 SVM learn(M) 0.4 0.0 1.3 0.1

LR plugin 16.5 0.1 25.1 0.2 27.6 0.2 LR learn(M) 10.0 0.1 21.6 0.2 SVM maxscore 25.7 0.8 26.8 0.1 27.4 0.1 SVM learn(M) 9.8 0.1 21.5 0.1

LR plugin 1.3 0.0 3.8 0.1 7.5 0.1 LR learn(M) 0.3 0.0 0.4 0.1 SVM maxscore 1.4 0.0 3.7 0.2 7.6 0.2 SVM learn(M) 0.4 0.0 0.4 0.1

LR plugin 7.4 0.4 18.3 0.6 23.3 0.6 LR learn(Q) 4.1 0.1 15.4 0.6 SVM maxscore 10.2 0.2 19.1 0.4 22.1 0.7 SVM learn(Q) 3.9 0.3 14.0 0.8

LR plugin 0.7 0.0 1.9 0.1 5.3 0.4 LR learn(Q) 0.7 0.1 0.8 0.2 SVM maxscore 2.8 0.4 3.9 0.4 4.9 0.6 SVM learn(Q) 0.8 0.1 0.7 0.2

LR plugin 0.8 0.1 2.9 0.3 6.3 0.4 LR learn(Q) 0.8 0.3 1.8 0.4 SVM maxscore 0.8 0.1 3.0 0.3 6.4 0.4 SVM learn(Q) 0.7 0.2 1.7 0.3

LR plugin 0.6 0.1 1.0 0.1 3.4 0.2 LR learn(M) 0.1 0.1 0.0 0.0 SVM maxscore 1.3 0.5 1.5 0.3 2.0 0.1 SVM learn(M) 0.1 0.0 0.0 0.0

LR plugin 2.0 0.1 4.9 0.3 8.2 0.4 LR learn(M) 0.4 0.0 0.6 0.2 SVM maxscore 3.7 0.2 6.9 0.3 6.9 0.2 SVM learn(M) 0.4 0.0 0.5 0.1

LR plugin 3.8 0.3 11.1 0.3 15.1 0.5 LR learn(Q) 3.2 0.3 9.8 0.9 SVM maxscore 4.8 0.6 11.4 0.5 15.4 0.4 SVM learn(Q) 3.3 0.2 10.2 0.6

Table 2. Area Under risk-coverage Curve (AUC), selective risk at coverage 90% and 100% are shown for selective classiﬁers built from the non-rejection LR or SVM models. The baseline selective function for LR model uses the plug-in conditional risk and the max-score for SVM. Below each baseline we report results for the selective classiﬁers with learned conﬁdence measure. We report results only for model with the best features ((L)inear, (Q)uadratic and (M)ulti-layer perceptron) selected based on the validation set.

On Discriminative Learning of Prediction Uncertainty

AVILA CODRNA

0.5 0.6 0.7 0.8 0.9 cover

LR+plugin LR+learn(linear) LR+learn(quad) LR+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

SVM+maxscore SVM+learn(linear) SVM+learn(quad) SVM+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

LR+plugin LR+learn(linear) LR+learn(quad) LR+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

SVM+maxscore SVM+learn(linear) SVM+learn(quad) SVM+learn(mlp)

COVTYPE IJCNN

0.5 0.6 0.7 0.8 0.9 cover

LR+plugin LR+learn(linear) LR+learn(quad) LR+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

SVM+maxscore SVM+learn(linear) SVM+learn(quad) SVM+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

LR+plugin LR+learn(linear) LR+learn(quad) LR+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

SVM+maxscore SVM+learn(linear) SVM+learn(quad) SVM+learn(mlp)

LETTER PENDIGIT

0.5 0.6 0.7 0.8 0.9 cover

LR+plugin LR+learn(linear) LR+learn(quad) LR+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

SVM+maxscore SVM+learn(linear) SVM+learn(quad) SVM+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

LR+plugin LR+learn(linear) LR+learn(quad) LR+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

SVM+maxscore SVM+learn(linear) SVM+learn(quad) SVM+learn(mlp)

PHISHING SATTELITE

0.5 0.6 0.7 0.8 0.9 cover

LR+plugin LR+learn(linear) LR+learn(quad) LR+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

SVM+maxscore SVM+learn(linear) SVM+learn(quad) SVM+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

LR+plugin LR+learn(linear) LR+learn(quad) LR+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

SVM+maxscore SVM+learn(linear) SVM+learn(quad) SVM+learn(mlp)

SENSORLESS SHUTTLE

0.5 0.6 0.7 0.8 0.9 cover

LR+plugin LR+learn(linear) LR+learn(quad) LR+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

SVM+maxscore SVM+learn(linear) SVM+learn(quad) SVM+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

LR+plugin LR+learn(linear) LR+learn(quad) LR+learn(mlp)

0.5 0.6 0.7 0.8 0.9 cover

SVM+maxscore SVM+learn(linear) SVM+learn(quad) SVM+learn(mlp)

Figure 1. The risk-coverage curves for different selective classiﬁers evaluated on 10 datasets. X-axis is the coverage and y-axis is the selective risk corresponding to the misclassiﬁcation error on the non-reject area. The classiﬁer is either LR (left sub-ﬁgure) or SVM (right sub-ﬁgure). The curves for baseline selective functions shown in black correspond to the the plug-in conditional risk for LR and to the max-score for SVM. Colored curves correspond to selective functions learned by the proposed algorithm on top of linear features (red), quadratic features (green) and features extracted by multi-layer perceptron (blue).

On Discriminative Learning of Prediction Uncertainty

Acknowledgments

We thank reviewers for their useful comments. The research was supported by the Czech Science Foundation project GACR GA19-21198S and OP VVV project CZ.02.1.01\0.0\0.0\16 019\0000765 Research Center for Informatics.

Bartlett, P. and Wegkamp, M. Classiﬁcation with a reject option using a hinge loss. Journal of Machine Learning Research, 9:1823 1840, 2008.

Chang, C. and C.J.Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. URL http: //www.csie.ntu.edu.tw/ cjlin/libsvm.

Choon, T., Vishwanthan, S., Smola, A., and Quoc, V. L. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11:311 365, 2010. ISSN 1532-4435.

Chow, C. On optimum recognition error and reject tradeoff. IEEE Trans. Inf. Theor., 16(1):41 46, 1970.

Dua, D. and Taniskidou, E. K. UCI machine learning repository, 2017. URL http://archive.ics.uci. edu/ml.

El-Yaniv, R. and Wiener, Y. On the foundations of noisefree selective classiﬁcation. Journal of Machine Learning Research, 2010.

Fischer, L., Hammer, B., and Wersing, H. Optimal local rejection for classiﬁers. Neurocomputing, 214:445 457, 2016.

Fumera, G. and Roli, F. Support vector machines with embedded reject option. In First International Workshop on Patter Recognition with Support Vector Machines, 2002.

Fumera, G., Roli, F., and Giacinto, G. Multiple reject thresholds for improving classiﬁcation reliability. In Joint IAPR Internatinal Workshop on Advances in Pattern Recognition and, pp. 863 871, 2000.

Geifman, Y. and El-Yaniv, R. Selective classiﬁcation for deep neural networks. In Advances in Neural Information Processing Systems 30, pp. 4878 4887, 2017.

Herbei, R. and Wegkamp, M. Classiﬁcation with reject option. Can. J. Stat, 34(4):709 721, 2006.

Ioffe, S. and C. Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on International Conference on Machine Learning, pp. 448 456, 2015.

Kingma, D. and Ba, L. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Kummert, J., Paassen, B., Jensen, J., G opfert, C., and Hammer, B. Local reject option for deterministic multi-class SVM. In Artiﬁcial Neural Networks and Machine Learning ICANN, 2016.

Kushner, H. and Yin, G. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2003.

Le Cun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jakel, L. Handwritten digit recognition with a back-propagation networks. In Neural Information Processing Systems, 1990.

Mc Cullagh, P. and Nelder, J. Generalized Linear Models. Chapman and Hall, 1989.

Pietraszek, T. Optimizing abstaining classiﬁers using ROC analysis. In International Conference on Machine Learning, 2005.

Platt, J. C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. Smola et al. (ed.), Advances in Large Margin Classiﬁers. Cambridge, MA, 2000.

Ramaswamy, H. and Agarwal, S. Convex calibration dimension for multiclass loss matrices. Journal of Machine Learning Research, 17:1 45, 2016.

Schlesinger, M. and Hlav aˇc, V. Ten lectures on statistical and structural pattern recognition. Kluwer Academic Publishers, 2002.

Tortorella, F. An optimal reject rule for binary classiﬁers. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), 2000.

Vapnik, V. Statistical Learning Theory. John Wiley & Sons, Inc., 1998.

Wang, W., Wang, A., Tamar, A., Chen, X., and Abbeel, P. Safer classiﬁcation by synthesis. Co RR, abs/1711.08534, 2017. URL http://arxiv.org/ abs/1711.08534.

Wu, T., Lin, C., and Weng, R. Probability estimates for multi-class classiﬁcation by pairwise coupling. Journal of Machine Learning Research, 5:975 1005, 2004.