# selective_classification_under_distribution_shifts__b9247033.pdf

Published in Transactions on Machine Learning Research (10/2024)

Selective Classification Under Distribution Shifts

Hengyue Liang liang656@umn.edu Department of Electrical and Computer Engineering University of Minnesota

Le Peng peng0347@umn.edu Department of Computer Science and Engineering University of Minnesota

Ju Sun jusun@umn.edu Department of Computer Science and Engineering University of Minnesota

Reviewed on Open Review: https: // openreview. net/ forum? id= dmx MGW6J7N

In selective classification (SC), a classifier abstains from making predictions that are likely to be wrong to avoid excessive errors. To deploy imperfect classifiers either due to intrinsic statistical noise of data or for robustness issue of the classifier or beyond in high-stakes scenarios, SC appears to be an attractive and necessary path to follow. Despite decades of research in SC, most previous SC methods still focus on the ideal statistical setting only, i.e., the data distribution at deployment is the same as that of training, although practical data can come from the wild. To bridge this gap, in this paper, we propose an SC framework that takes into account distribution shifts, termed generalized selective classification, that covers label-shifted (or out-of-distribution) and covariate-shifted samples, in addition to typical indistribution samples, the first of its kind in the SC literature. We focus on non-training-based confidence-score functions for generalized SC on deep learning (DL) classifiers, and propose two novel margin-based score functions. Through extensive analysis and experiments, we show that our proposed score functions are more effective and reliable than the existing ones for generalized SC on a variety of classification tasks and DL classifiers. The code is available at https://github.com/sun-umn/sc_with_distshift.

1 Introduction

In practice, classifiers almost never have perfect accuracy. Although modern classifiers powered by deep neural networks (DNNs) typically achieve higher accuracy than the classical ones, they are known to be unrobust: perturbations of inputs that are inconsequential to human decision making can easily alter DNN classifiers predictions (Carlini et al., 2019; Croce et al., 2020; Hendrycks & Dietterich, 2018; Liang et al., 2023), and more generally, shifts in data distribution in deployment from that in training often cause systematic classification errors. These classification errors, regardless of their source, are rarely acceptable for high-stakes applications, such as disease diagnosis in healthcare.

To achieve minimal and controllable levels of classification error so that imperfect and unrobust classifiers can be deployed for high-stakes applications, a promising approach is selective classification (SC): samples that are likely to be misclassified are selected, excluded from prediction, and deferred to human decision makers, so that the classification performance on the remaining samples reaches the desired level (Chow, 1970; Franc et al., 2023a; Geifman & El-Yaniv, 2017). For example, by flagging and passing uncertain patient cases that it tends to mistake on to human doctors, an intelligent medical agent can make confident and correct diagnoses for the rest. This conservative classification framework not only saves doctors efforts, but also avoids liability due to the agent s mistakes.

Published in Transactions on Machine Learning Research (10/2024)

Consider a multiclass classification problem with input space X Rn, label space Y = {1, . . . , K}, and training distribution DX,Y on X Y. For any classifier f : X Y, there are many potential causes of classification errors. In this paper, we focus on three types of errors that are commonly encountered in practice and are studied extensively, but mostly separately, in the literature.

Type A errors: errors made on in-distribution (In-D) samples, i.e., those samples drawn from DX,Y. These are classification errors discussed in typical statistical learning frameworks (Mohri et al., 2018); Type B errors: errors made on label-shifted samples, i.e., those samples with groundtruth labels not from Y. Since f assigns labels from Y only, it always errs on these samples; Type C errors: errors made on covariate-shifted samples, i.e., samples drawn from a different input distribution D X where D X = DX but with groundtruth labels from Y.

It is clear that in practical deployment of classifiers, samples can come from the wild, and hence Type A, Type B and Type C errors can coexist. In order to ensure the reliable deployment of classifiers in high-stakes applications, we must control the three types of errors, jointly. Unfortunately, previous research falls short of a unified treatment of these errors. Classical SC (Chow, 1970) focuses on rejecting samples that cause In-D errors (Type A), whereas the current out-of-distribution (OOD) detection research (Yang et al., 2021; Park et al., 2023) focuses on detecting label-shifted samples (Type B). Although Hendrycks & Gimpel (2016); Granese et al. (2021); Xia & Bouganis (2022); Kim et al. (2023) have advocated the simultaneous detection of samples that cause Type A and Type B errors, their approaches still treat the problem as consisting of two separate tasks, reflected in their separate and independent performance evaluation on OOD detection and SC. Regarding the challenge posed by Type C errors, existing work (Hendrycks & Dietterich, 2018; Croce et al., 2020) focuses primarily on obtaining classifiers that are more robust to covariate shifts, not on rejecting potentially misclassified samples due to covariate shifts the latter, to the best of our knowledge, has not yet been explicitly considered, not to mention joint rejection together with Type A and Type B errors.

In this paper, our goal is to close the gap and consider, for the first time, rejecting all three types of errors in a unified framework. For brevity, we use the umbrella term distribution shifts to cover both label shifts and covariate shifts, which are perhaps the most commonly seen types of distribution shifts, with the caveat that practical distribution shifts can also be induced by other sources. So, we call the unified framework considered in this paper selective classification under distribution shifts, or generalized selective classification. Another key desideratum is practicality. With the increasing popularity of foundation models and associated downstream few-shot learners (Brown et al., 2020; Radford et al., 2021; Yuan et al., 2021), accessing massive original training data becomes increasingly more difficult. Moreover, there are numerous high-stakes domains where training data are typically protected due to privacy concerns, such as healthcare and finance. These applied scenarios call for SC strategies that can work with given pretrained classifiers and do not require access to the training data, which will be our focus in this paper. Our contributions include:

We advocate a new SC framework, generalized selective classification, which rejects samples that could cause Type A, Type B and Type C errors jointly, to improve classification performance over the nonrejected samples. With careful review and reasoning, we argue that generalized SC covers and unifies the scope of the existing OOD detection and SC, if the goal is to achieve reliable classification on the selected samples. (Sections 2.3 and 2.4) Focused on non-training-based (or post-hoc) SC settings, we identify a critical scale-sensitivity issue of several SC confidence scores based on softmax responses (Section 3.1) which are popularly used and reported to be the state-of-the-art (SOTA) methods in the existing SC literature (Geifman & El-Yaniv, 2017; Feng et al., 2023). We propose two confidence scores based on the raw logits (v.s. the normalized logits, i.e., softmax responses), inspired by the notion of margins (Section 3.2). Through careful analysis (Section 3.3) and extensive experiments (Section 4), we show that our margin-based confidence scores are more reliable for generalized SC on various dataset-classifier combinations, even under moderate distribution shifts.

Published in Transactions on Machine Learning Research (10/2024)

2 Technical background and related work

2.1 Selective classification (SC)

Consider a multiclass classification problem with input space X Rn, label space Y = {1, . . . , K}, and data distribution DX,Y on X Y. A selective classifier (f, g) consists of a predictor f : X RK and a selector g : X {0, 1} and works as follows:

( f(x) if g(x) = 1 abstain if g(x) = 0 , (1)

for any input x X. Typical selectors g take the form:

gs,γ(x) = 1[s(x) > γ], (2)

where s(x) is a confidence-score function, and γ is a tunable threshold for selection.

2.2 Prior work in SC

For a given selective classifier (f, gs,γ), its SC performance is often characterized by two quantities:

(coverage) ϕs,γ = EDX,Y[gs,γ(x)], (higher the better),

(selection risk) Rs,γ = EDX,Y[ℓ(f(x), y)gs,γ(x)]/ϕs,γ, (lower the better), (3)

Because a high coverage typically comes with a high selection risk, there is always a need for risk-coverage tradeoff in SC. Most of the existing work considers ℓto be the standard 0/1 classification loss (Chow, 1970; El-Yaniv et al., 2010; Geifman et al., 2018), and we also follow this convention in this paper. A classical cost-based formulation is to optimize the risk-coverage (RC) tradeoff (Chow, 1970)

minf,gs,γ EDX,Y[ℓ(f(x), y)gs,γ(x)] + εEDX,Y[1 gs,γ(x)] minf,gs,γ Rs,γϕs,γ εϕs,γ, (4)

where ε [0, 1] is the cost of making a rejection. The optimal selective classifier for this formulation is (Chow, 1970; Franc et al., 2023a):

f = arg minby Y X

y Y p(y|x)ℓ(by, y), and g = 1[ minby Y X

y Y p(y|x)ℓ(by, y) > ε], (5)

where f is the Bayes optimal classifier and depends on the posterior probabilities p(y|x) for all y Y, which are hard to obtain in practice. Moreover, solutions to two constrained formulations for the RC tradeoff,

minf,gs,γ Rs,γ, s. t. ϕs,γ ω and maxf,gs,γ ϕs,γ, s. t. Rs,γ λ, (6)

also depend on the posterior probabilities (Pietraszek, 2005; Geifman & El-Yaniv, 2017; Franc et al., 2023a; El-Yaniv et al., 2010).

Training-based scores Due to the intractability of true posterior probabilities in practice, many previous methods focus on learning effective confidence-score functions from training data. They require access to training data and learn parametric score functions, often under cost-based/constrained formulations and their variants for the RC tradeoff. This learning problem can be formulated together with (Chow, 1970; Pietraszek, 2005; Grandvalet et al., 2008; El-Yaniv et al., 2010; Cortes et al., 2016; Geifman & El-Yaniv, 2019; Liu et al., 2019; Huang et al., 2022; Gal & Ghahramani, 2016; Lakshminarayanan et al., 2017; Geifman et al., 2018; Maddox et al., 2019; Dusenberry et al., 2020; Lei, 2014; Villmann et al., 2016; Corbière et al., 2019) or separately from training the classifier (Jiang et al., 2018; Fisch et al., 2022; Franc et al., 2023a). However, Feng et al. (2023) has recently shown that these training-based scores do not outperform simple non-training-based scores described below.

Published in Transactions on Machine Learning Research (10/2024)

Algorithm 1 Non-training-based selective classification Require: A pretrained classifier f; a score function s; a small calibration dataset Zcali iid Dcali X,Y 1: (xi, yi) Zcali, compute s(xi) and ℓ(f(xi), yi) 2: Determine a threshold γ according to the coverage or selection-risk target

3: Deploy the selector gs,γ based on Eq. (2).

Manually designed (non-training-based) scores This family works with any given classifier and does not assume access to the training set. This is particularly attractive when it comes to modern pretrained large DNN models, e.g., CLIP (Radford et al., 2021), Florence (Yuan et al., 2021), and GPTs (Brown et al., 2020), for which obtaining the original training data and performing retraining are prohibitively expensive, if not impossible, to typical users. Algorithm 1 shows a typical use case of SC with non-training-based scores. Different confidence scores have been proposed in the literature. For example, for support vector machines (SVMs), confidence margin (the difference of the top two raw logits) has been used as a confidence score (Fumera & Roli, 2002; Franc et al., 2023a); see also Section 3.2. For DNN models, which is our focus, confidence scores are popularly defined over the softmax responses (SRs). Assume that z RK contains the raw logits (RLs) and σ is the softmax activation. The following three confidence-score functions

SRmax(z) max i σ(zi), SRdoctor(z) 1 1/ σ(z) 2 2, SRent(z) X

i σ(zi) log σ(zi), (7)

are popularly used in recent work, e.g., Feng et al. (2023); Granese et al. (2021); Xia & Bouganis (2022). Although simple, SRmax can easily beat existing training-based methods (Feng et al., 2023). On the other hand, these SR-based score functions generally follow the plug-in principle by assuming that SRs approximate posterior probabilities well (Franc et al., 2023a). Unfortunately, this assumption often does not hold in practice, and bridging this approximation gap is a major challenge for confidence calibration (Guo et al., 2017; Nixon et al., 2019). However, Zhu et al. (2022) reveals that recent calibration methods may even degrade SC performance.

2.3 SC under distribution shifts: generalized SC

In this paper, we consider SC under distribution shifts, or generalized selective classification. Shifts between training and deployment distributions are common in practice and can often cause performance drops in deployment (Quinonero-Candela et al., 2008; Rabanser et al., 2019; Koh et al., 2021), raising reliability concerns for high-stakes applications in the real world. In this paper, we use the term distribution shifts to cover both covariate and label shifts perhaps the most prevalent forms of distribution shifts (see the beginning of Section 1 for their definitions) jointly. Although the basic set-up for our generalized SC framework remains the same as that of Eqs. (1) and (2), we need to modify the definitions for selection risk and coverage in Eq. (3) to take into account potential distribution shifts:

(coverage) ϕs,γ = ED X,Y [gs,γ(x)], and (selection risk) Rs,γ = ED X,Y [ℓ(f(x), y)gs,γ(x)]/ϕs,γ, (8)

where DX,Y is the original data distribution, D X,Y is the shifted distribution Y may not be the same as Y due to potential label shifts.1

Out-of-distribution (OOD) detection as a weak form of generalized SC The goal of OOD detection is to detect and exclude OOD samples (Yang et al., 2021). An ideal OOD detector G(x) should perfectly separate In-D and OOD samples:

( 0 (i.e., excluded) if x DOOD X 1 (i.e., kept) if x DX , which is often realized as G(x) = 1[s OOD(x) > γ]. (9)

1We assume no outliers in generalized SC samples that do not follow any specific statistical patterns during deployment, i.e., they are already detected and removed after separate data preprocessing steps. This allows us to properly define the coverage and selection risk.

Published in Transactions on Machine Learning Research (10/2024)

Algorithm 2 Typical OOD detection pipeline (e.g., Sun et al. (2021))

Require: An OOD score function s OOD; an In-D calibration dataset Xin iid DX and an OOD calibration dataset Xood iid DOOD X 1: xi Xin and xj Xood, compute s OOD(xi) and s OOD(xj). 2: Compute a threshold γOOD using Eq. (9) by problem-specific target requirements, e.g., a target TPR (true positive rate) value. 3: Deploy the OOD detector according to Eq. (9).

Here, s OOD( ) is a confidence-score function indicating the likelihood that the input is an In-D sample, and γ is again a tunable cutoff threshold. Although by the literal meaning of OOD both covariate and label shifts are covered by DOOD X , the literature on OOD detection focuses mainly on detecting label-shifted samples, i.e., covariate-shifted DOOD X induced by label shifts (Liu et al., 2020; Sun et al., 2021; Wang et al., 2022; Sun et al., 2022). OOD detection is commonly motivated as an approach to achieving reliable predictions: under the assumption that DOOD X is induced by label shifts only, any OOD samples will cause misclassification and hence should be excluded clearly aligned with the goal of SC. Algorithm 2 shows the typical use case of OOD (label-shift) detectors, and its similarity to SC shown in Algorithm 1 is self-evident. However, OOD detection clearly aims for less than generalized SC in that: (1) even if the OOD detection is perfect, misclassified samples either as In-D or due to distribution shifts by imperfect classifiers are not rejected, and (2) practical OOD detectors may fail to perfectly separate In-D and OOD samples, OOD detected but correctly classified In-D samples are still rejected, hurting the classification performance on the selected samples; see Appendix C for an illustrative example. Therefore, if we are to achieve reliable predictions by excluding samples that are likely to cause errors, we should directly follow the generalized SC instead of the OOD detection formulation.

Figure 1: Visualization of the normalized AURC-α the area in blue divided by the coverage value α.

Other related concepts Besides OOD detection, OOD generalization focuses on correctly classifying In-D and covariate-shifted samples, without considering prediction confidence and selection to improve prediction reliability; open-set recognition (OSR) focuses on correctly classifying In D samples, as well as flagging label-shifted samples; see Geng et al. (2020) for a comprehensive review. In contrast, generalized SC covers all In-D, label-shifted, and covariate-shifted samples, the widest coverage compared to these related concepts, and targets the most practical and pragmatic metric classification performance on the selected samples.

Prior work on SC with distribution shifts Although the existing literature on SC is rich (Zhang et al., 2023), research work that considers SC with potential distribution shifts is very recent and focuses only on label shifts: Xia & Bouganis (2022); Kim et al. (2023) perform In-D SC and OOD (label shift) detection together with a confidence score that combines an SC score and an OOD score, but they still evaluate the performance of In-D SC and OOD detection separately. Müller et al. (2023); Cattelan & Silva empirically show that existing OOD scores are not good enough for SC tasks with covariate/label-shifted samples; Cattelan & Silva proposes ways to refine these scores with the help of additional datasets to optimize performance. Franc et al. (2024) provides theoretical insights on SC with In-D and label-shifted samples. In contrast, we focus on identifying better confidence scores for generalized SC that covers both In-D and covariate/label-shifted samples and maximizes the utility of the classifier, and unify the evaluation protocol (see Section 2.4).

2.4 Evaluation of generalized SC

Since the goal of generalized SC is to identify and exclude misclassified samples, for performance evaluation at a fixed cutoff threshold γ, it is natural to report the coverage the portion of samples accepted, and the corresponding selection risk accuracy (taken broadly) on accepted samples. It is clear from Eqs. (1)

Published in Transactions on Machine Learning Research (10/2024)

and (2) that for a given pair of classifier f and confidence-score function s, the threshold γ can be adjusted to achieve different risk-coverage (RC) tradeoffs. By continuously varying γ, we can plot a risk-coverage (RC) curve El-Yaniv et al. (2010); Franc et al. (2023a) to profile the SC performance of (f, s) throughout the entire coverage range ϕγ [0, 1]; see Fig. 1 for an example. Generally, the lower the RC curve, the better the SC performance. To obtain a summarizing metric, it is natural to use the area under the RC curve (AURC) (El-Yaniv et al., 2010; Franc et al., 2023a). We note that the RC curve and the AURC are also the most widely used evaluation metrics for classical SC which is not surprising, as the goal of classical SC aligns with that of generalized SC, although generalized SC also allows distribution shifts.

For typical high-stakes applications, such as medical diagnosis, low selection risks are often prioritized over high coverage levels. So, in addition to RC curves and AURC, we also report several partial AURCs to account for potential different needs normalized AURC-α, where α specifies the coverage level, and we normalize the partial area-under-the-curve by the corresponding α so that different partial levels can be cross-compared; see Fig. 1 for illustration.

Note that RC curves, and hence the associated AURCs and normalized AURC-α also, depend on the (f, s) pair. So, if the purpose is to compare different confidence-score functions, f should be fixed. Feng et al. (2023) has recently pointed out the abuse of this crucial point in recent training-based SC methods. Thus, it is worth stressing that we always take and fix pretrained f s when making the comparison between different score functions.

2.5 Few words on implementing Algorithm 1 in practice

In the practical implementation of generalized SC for high-stakes applications after Algorithm 1, it is necessary to select a cutoff threshold γ based on a calibration set to meet the target coverage, or more likely the target risk level. However, in this paper, we follow most existing work on SC and do not touch on issues such as how the calibration set should be constructed and how the threshold should be selected we leave these for future work. Our evaluation here, again, as most existing SC work, is only about the potential of specific confidence-score functions for generalized SC, measured by the RC curve, AUPC, and normalized AURC-α s, directly on test sets that consist of In-D, OOD, and covariate-shifted samples.

3 Our method margins as confidence scores for generalized SC

Our goal is to design effective confidence-score functions for generalized SC. Again, our focus is on nontraining-based scores that can work on any pretrained classifier f without access to the training data.

3.1 Scale sensitivity of SR-based scores

As discussed in Section 2.2, most manually designed confidence scores focus on DNN models and are based on softmax responses (SRs), assuming that SRs closely approximate true posterior probabilities closing such approximation gaps is the goal of confidence calibration. However, effective confidence calibration remains elusive (Guo et al., 2017; Nixon et al., 2019), and the performance of SR-based score functions is sensitive to the scale of raw logits and hence that of SRs, as explained below.

A quick numerical experiment Consider a 4-component mixture-of-Gaussian distribution with means w1 = [

2/2] , w2 = [

2/2] , w3 = [

2/2] , w4 = [

2/2] , equal variance 0.15 I, and equal weight 1/4. If we treat each component of the mixture as a class and consider the resulting 4-class classification problem, it is easy to see that the optimal 4-class linear classifier is f(x) = [w1, w2, w3, w4] x, with the decision rule arg maxj {1,2,3,4} w j x; see Fig. 2 (a) for visualization of the data distribution and decision boundaries (i.e., the lines x1 = 0 and x2 = 0). Moreover, this f(x) is also a Bayes optimal classifier as well as the maximum a posterior (MAP) classifier, for our particular problem here. Now, given any input x, we consider scaled raw logits λf(x) for different scale factors λ = 0.1, 1, 2, 4 and plot the resulting RC curves for SRmax, SRdoctor, and SRent, respectively; see Fig. 2(b)-(d). For reference, we also include the RC curves based on the true posterior probabilities (denoted as spost), which are available for our simple data model here. We can observe that for SR-based functions (SRmax, SRdoctor, and SRent),

Published in Transactions on Machine Learning Research (10/2024)

(a) Data and classifier visualization (b) SRmax (c) SRdoctor (d) SRent

Figure 2: RC curves for (b) SRmax, (c) SRdoctor, and (d) SRent, calculated based on scaled (by factor 0.1, 1.0, 2.0, and 4.0, respectively) raw logits from the optimal 4-class linear classifier using data shown in (a). The RC curves for RLconf-M and spost are also plotted for reference, where RLconf-M is one of our proposed confidence-score functions.

their RC curves and hence the associated AURC s vary as λ changes, and these curves approach a common curve (RLconf-M, which we will explain below) as λ becomes large.

Why it happens? The above observations are not incidental. To see why the curves change with respect to λ, note that for a given test set {xi} and a fixed classifier f, the RC curve for any score function s is fully determined by the ordering of s(xi) s (Franc et al., 2023a). But this ordering is sensitive to the scale of the raw logits for all three SR-based score functions: SRmax, SRdoctor, and SRent. Take s = SRmax as an example and consider any sample x with its corresponding raw logits z sorted in descending order (i.e., z(1) z(2) ) without loss of generality. Then for any scale factor λ > 0 applied to z, we have the score

SRmax eλz(1)/ X

j eλz(j) = 1/ X

j eλ(z(j) z(1)) = exp log X

j eλ(z(j) z(1)) . (10)

This means that the score is determined by all the scaled logit gaps λ(z(j) z(1)) s. Moreover, due to the inner exponential function, small gaps gain more emphasis as λ increases, and all gaps receive increasingly more emphasis as λ decreases. Such a shifted emphasis can easily change the order of scores for two data samples, depending on how different their raw logits are distributed. Clearly, eλz(1)/ P

j eλz(j) = 1/K as λ = 0. We can also make similar arguments for SRdoctor and SRent. Next, for the common asymptotic curve as λ , we can show the following (proof is deferred to Appendix B):

Lemma 3.1. Consider the raw logits z, and without loss of generality assume that they are ordered in descending order without any ties, i.e., z(1) > z(2) > . We have that as λ ,

SRmax(λz) exp eλ(z(2) z(1)) , SRdoctor(λz) 1 exp 2eλ(z(2) z(1)) , SRent(λz) eλ(z(2) z(1)),

where means asymptotic equivalence. In particular, all the asymptotic functions increase monotonically with respect to z(1) z(2).

This implies that the asymptotic RC curve as λ for all three score functions is fully determined by the score function z(1) z(2)!

Implications The sensitivity of the RC curves, and hence of the performance, of these SR-based scores to the scale of raw logits is disturbing. It implies that one can simply change the overall scale of the raw logits which does not alter the classification accuracy itself to claim better or worse performance of an SR-based confidence-score function for selective classification, making the comparison of different SR-based scores shaky. Unfortunately, between the limiting cases λ 0 and λ , there is no canonical scaling.

3.2 Our method: margin-based confidence scores

To avoid the scale sensitivity caused by the softmax nonlinearity, it is natural to consider designing score functions directly over the raw logits. To this end, we revisit ideas in support vector machines (SVMs).

Published in Transactions on Machine Learning Research (10/2024)

Margins in SVMs In linear SVMs for binary classification, the classifier takes the form f(x) = sign(w x + b) and the confidence in classifying a sample x can be assessed by its distance from the supporting hyperplane (Fumera & Roli, 2002; Franc et al., 2023a): |w x + b|/ w 2, which is called the geometric margin; see Appendix A for a detailed review. We can extend the idea to K-class linear SVMs. Following the popular joint multiclass SVM formulation (Crammer & Singer, 2001), we consider a linear classifier f(x) = W x+b. Here, W and b induce K hyperplanes, and we can define the signed distance of any sample x to the i-th hyperplane as: (w i x + bi)/ wi (wi denotes the i-th column of W and bi the i-th element of b), generalizing the definition for the binary case. However, a single signed distance makes little sense for assessing the classification confidence in multiclass cases, given the typical argmax decision rule e.g., the largest signed distance can be negative. Instead, comparing the distances to all decision hyperplanes seems more reasonable. Thus, we can consider the following geometric margin as a confidence-score function:

(w y x + by )/ wy 2 maxj {1,...,K}\y (w j x + bj)/ wj 2 , (11)

where y arg maxj {1,...,K}(w j x + bj)/ wj 2. In other words, it is the difference between the top two signed distances of x to all K hyperplanes. Intuitively, the larger the geometric margin, the more confident the classifier is in classifying the sample following the largest signed distance a clearer winner earns more trust. Although the interpretation is intuitive, the geometric margin is not popularly used in multiclass SVM formulations, likely due to its non-convexity. Instead, a popular proxy for the geometric margin is the convex confidence margin:

(w y x + by ) maxi {1,...,K}\y (w i x + bi), (12)

with the decision rule y arg maxj {1, ,K} w j x + bj; see Appendix A. Despite its numerical convenience, the confidence margin loses geometric interpretability compared to the geometric margin, and it can be sensitive to the scaling of wj. We study both margins in this paper.

Margins in DNNs To extend the idea of margins to a DNN classifier fθ(x) parameterized by θ, we view all but the final linear layer as a feature extractor, denoted as f e θ. So, for each sample x, the logit output takes the form z = W f e θ(x) + b, and thus the signed distance of the representation f e θ(x) to each decision hyperplane in the representation space is: dj = (w j f e θ(x) + bj)/ wj 2 j {1, . . . , K}. Assume sorted signed distances and logits, i.e., d(1) d(2) . . . d(K) and z(1) z(2) . . . z(K). The geometric margin and the confidence margin are defined as

RLgeo-M d(1) d(2) and RLconf-M z(1) z(2), respectively. (13)

Note that both RLgeo-M and RLconf-M are computed using the raw logits without softmax normalization; z s and d s may not have the same ordering due to the scale of wj 2. In fact, RLconf-M is applied in Le Cun et al. (1989) to formulate an empirical rejection rule for a handwritten recognition system, although no detailed analysis or discussion is given on why it is effective. Despite the simplicity of these two notions of margins, we have not found prior work that considers them for SC except for Le Cun et al. (1989).

Scale-invariance property An attractive property of margin-based score functions is that their SC performance is invariant w.r.t. the scale of raw logits. This is because changing the overall scale of the raw logits does not change the order of scores assigned by either the geometric or the confidence margin. In this regard, margin-based score functions are much more preferred and reliable than SR-based scores for SC. Another interesting point is that the limiting curve depicted in Fig. 2(b)-(d) is induced by the confidence margin, as is clear from Lemma 3.1 and the discussion following it.

3.3 Analysis of rejection patterns

We continue with the toy example in Section 3.1 to show another major difference between the SR-based and the margin-based score functions they have different rejection patterns for given coverage levels. We will see that margin-based score functions induce favorable rejection patterns and can hence be used for reliable rejection even under moderate covariate shifts. For comparison, we also consider the maximum raw logit (denoted as RLmax) to show that a single logit in multiclass classification is not a sensible confidence score.

Published in Transactions on Machine Learning Research (10/2024)

(a-1) (b-1) (c-1) (d-1)

(a-2) (b-2) (c-2) (d-2)

(a-3) (b-3) (c-3) (d-3)

Figure 3: Further analysis of the numerical example in Section 3.1. Case 1, Case 2, and Case 3 correspond to the original dataset in Section 3.1, the dataset after small perturbations, and the dataset after substantial perturbations, respectively. Here, (a-) s are the RC curves achieved by different selection scores; (b-) s are visualizations of the samples (one color per class), decision boundaries (dashed blue line) and the rejected samples (black crosses) at coverage 0.8 by RLgeo-M; (c-) s visualize the rejected samples (black crosses) at coverage 0.8 by SRmax; and (d-) s present the histogram of the robustness radius of the selected samples in by all score functions.

Case 1: We use the same setup as in the numerical experiment in Section 3.1 (see also Fig. 2), and plot in Fig. 3 (a-1) the RC curves induced by the various confidence-score functions2. It is clear that RLgeo-M performs the best. To better understand the difference between RLgeo-M and other score functions, we study their rejection patterns: we visualize in Fig. 3 (b-1)&(c-1) the samples rejected at 0.8 coverage for RLgeo-M and SRmax, respectively; see visualization of other score functions in Appendix D, whose rejection patterns are similar to that of SRmax. An iconic feature of RLgeo-M is that it prioritizes rejecting samples closer to decision boundaries, whereas SR-based scores prioritize rejecting samples close to the origin. Conceptually, the former rejection pattern is favorable, as the goal of SC is exactly to reject uncertain samples on which classifier s decisions can be shaky. More precisely, the difference in rejection

2For the classifier consideblue, RLgeo-M and RLconf-M have the same SC performance as w 2 = 1.

Published in Transactions on Machine Learning Research (10/2024)

patterns implies at least two things: (1) RLgeo-M could be advantageous when most classification errors occur near the decision boundaries; (2) RLgeo-M may be superior even when test samples have a moderate level of distribution shifts with respect to training. For example, when the test set has a slightly different DX|Y than the training set (see Cases 2 & 3 below), mistaken samples due to the shift tend to be close to the decision boundaries and thus can be successfully rejected. Fig. 3 (d-1) plots the histograms of the robustness radii (i.e., the ℓ2 distance of a sample to the closest decision boundary) of selected samples at 0.8 coverage, where the robustness radius quantitatively captures the extent of DX|Y shift SC can tolerate: while the selected samples using RLgeo-M uniformly have nonzero robust radii, all other score functions lead to zero robustness radii for the worst samples, implying sensitivity to DX|Y shifts.3

Case 2: We keep the same setup as Case 1, except that small perturbations are added on all samples. The perturbations are drawn from a uniform distribution within the interval [ 0.5, 0.5] on each dimension of R2; see Fig. 3 (b-2), where more samples of different classes are intermingled than before the perturbations are added. Although some misclassified samples have moved far into the bulks of other classes, most of them are still close to the decision boundaries. Therefore, RLgeo-M still outperforms other SR-based score functions, as in Fig. 3 (a-2). Case 3: We continue to increase the magnitudes of perturbations and Fig. 3 (b-3) illustrates the case where the perturbations are drawn from a uniform distribution within the interval [ 2, 2]. Now that samples from different classes are well mixed in the 2D space, RLgeo-M is no longer superior when the coverage level is high, as shown in Fig. 3 (a-3). However, we argue that Case 3 is less concerning in practice we probably will never consider deploying a classifier that does not work well at all before SC; see the risk achieved at coverage level 1. Instead of relying on an SC strategy, it is more urgent to improve the base classifier in this case.

Summary: Using the above examples, we have shown that our proposed margin-based score functions are not sensitive to the scale of the raw logits. When the base classifier is reasonable in classifying in-distribution data samples (i.e., achieving low risks at full coverage), margin-based scores are expected to result in good SC performance, even when test samples have low or moderate distribution shifts, as we show empirically in Section 4 below.

4 Experiments

In this section, we experiment with various multiclass classification tasks and recent DNN classifiers to verify the effectiveness of our margin-based score functions for generalized SC.

4.1 Comparison with nontraining-based score functions using pretrained models

Setups We take different pretrained DNN models in various classification tasks and evaluate SC performance on test datasets composed of In-D and distribution-shifted samples jointly. Specifically, our evaluation tasks include (i) Image Net (Russakovsky et al., 2015), the most widely used testbed for image classification, with a covariate-shifted version Image Net-C (Hendrycks & Dietterich, 2018) composed of synthetic perturbations, and Open Image-O (Wang et al., 2022) composed of natural images similar to Image Net but with disjoint labels, i.e., label-shifted samples; (ii) i WIld Cam (Beery et al., 2020) test set provides two subsets of animal images taken at different geo-locations, where one is the same as the training set serving as In-D and the other at different locations as a natural covariate-shifted version; (iii) Amazon (Ni et al., 2019) test set provides two subsets of review comments by different users, producing In-D and natural covariate-shifted test samples for a language sentiment classification task; (iv) CIFAR-10 (Krizhevsky et al., 2009), a small image classification dataset commonly used in previous training-based SC works, together with CIFAR-10-C (perturbed CIFAR-10) and CIFAR-100 (with disjoint labels from CIFAR-10), popularly used covariate-shifted and label-shifted versions of CIFAR-10. Tables 1 and 2 summarize the pretrained models and datasets.

Confidence-score functions for comparison In addition to SRmax, SRdoctor and SRent introduced in Eq. (7) and our proposed margin-based scores RLgeo-M and RLconf-M in Eq. (13), we also consider several

3The intuition on why our notions of margins work for Type B errors is different: there since x assumes a label outside the known set, we expect no clear winner in the raw logits.

Published in Transactions on Machine Learning Research (10/2024)

recent post-hoc OOD detection scores4: (i) RLmax: the maximum raw logit (Hendrycks et al., 2019); (ii) Energy: log-sum-exponential aggregation (i.e., smooth approximation to the maximum raw logit) of the raw logits (Liu et al., 2020); (iii) KNN: a score composed of the distances from a test data point to the k nearest neighbors of the training set in the raw logit space (Sun et al., 2022); (iv) Vi M a score composed of the residual of a test sample from the principal components estimated in the feature space prior to the raw logits using training data (Wang et al., 2022); and (v) SIRC a composite score of the softmax response and OOD detection scores (Xia & Bouganis, 2022). We note that KNN, Vi M, and SIRC all contain hyperparameters that are determined by the training data. To minimize the gap with our nontraining-based setup, we randomly sample a small number of data points5 from the In-D test set to tune their hyperparameters, respectively. Also, note that KNN has an additional hyperparameter k that is independent of the statistics of the dataset. Empirically, we find KNN s performance is very sensitive to the choice of k, the task, and the classifier. Therefore, in this paper, we use k = 2 (the empirical best) by default for KNN and provide an ablation analysis for KNN for each experiment in Appendix H.

Table 1: Summary of the pretrained classifiers used for the various classification tasks Task Model Name Source Note

EVA (Fang et al., 2023) Top-1 acc. 88.76 % Image Net Conv Next (Liu et al., 2022) timm6 Top-1 acc. 86.25 % VOLO (Yuan et al., 2022) Top-1 acc. 85.56 % Res Next (Xie et al., 2017) Top-1 acc. 85.54 %

i Wild Cam FLYP (Goyal et al., 2023) Official source code7 Ranked 1st on WILDS (Koh et al., 2021)

Amazon LISA (Yao et al., 2022) Official source code8 Ranked 1st on WILDS

CIFAR & Image Net Sc Net (Geifman & El-Yaniv, 2019) Py Torch re-implementation9 Training-based SC.

Table 2: Summary of In-D and distribution-shifted datasets used for our SC evaluation Task In-D (split) classes - samples Shift-Cov samples Shift-Label samples

Image Net ILSVRC-2012 ( val ) 1000 - 50,000 Image Net-C (severity 3) 50,000 19 Open Image-O 17,256 *All types of corruptions

i Wild Cam i Wild Cam ( id_test ) 178 - 8154 i Wild Cam ( ood_test ) 42791 N/A N/A

Amazon Amazon ( id_test ) 5 - 46,950 Amazon ( test ) 100,050 N/A N/A

CIFAR CIFAR-10 ( val ) 10 - 10,000 CIFAR-10-C (severity 3) 10,000 19 CIFAR-100 10,000 *All types of corruptions

Evaluation metrics We report both the RC curves and the AURC-α where α {0.1, 0.5, 1} as discussed in Section 2.4. Note that when plotting the RC curves, we omit SRdoctor because it almost overlaps with SRmax, which is also observed by Xia & Bouganis (2022).

Results on Image Net We show in Fig. 4 the RC curves of the various score functions on the pretrained model EVA, for different combinations of subsets of test data, as summarized in Table 3. The most striking is in Fig. 4(c), which collects the results for evaluation on mixup of In-D and label-shifted samples: except for RLgeo-M, RLconf-M and KNN, the selection risks of other score functions do not follow a monotonic decreasing trend as coverage decreases. As coverage approaches zero, their selection risks spike up, almost to the risk level at full coverage (i.e., error rate on the whole set). This is because the other score functions do not indicate prediction confidence well in this setting and hence fail to sufficiently separate right and wrong

4In OOD detection, scores are usually dependent on the training data. However, these post-hoc scores can also be applied as nontraining-based SC scores as Algorithm 1, by replacing DX and DOOD X in Algorithm 2 with Dcali X,Y. 5Five times the number of classes in each task from Table 2. We do not sample five points per class, as in practice the calibration set Dcali X,Y may be imbalanced. 6See Table 6 in Appendix E for the model card information to retrieve these timm models. 7https://github.com/locuslab/FLYP 8https://github.com/huaxiuyao/LISA.git 9https://github.com/gatheluck/pytorch-Selective Net

Published in Transactions on Machine Learning Research (10/2024)

predictions during rejection, both right and wrong predictions are rejected indiscriminately. On the other hand, RLgeo-M, RLconf-M are better than KNN in separating correct and wrong predictions when there are no label-shifted samples, as shown in Fig. 4 (a)&(b). As a result, RLgeo-M and RLconf-M have the best overall performance when In-D, covariate-shifted and label-shifted samples coexist, as shown in Fig. 4 (d). Also, see Table 3 for numerical confirmation of the above observations, where in all cases RLgeo-M and RLconf-M are the best or comparable to the best-performing among all score functions. We present the SC results of other Image Net models in Appendix G; our margin-based score functions still stand as the best-performing among all.

(a) In-D (Image Net) (b) In-D + Shift (Cov) (c) In-D + Shift (Label) (d) In-D + Shift (both)

Figure 4: RC curves of different confidence-score functions on the model EVA for Image Net. (a)-(d) are RC curves evaluated using samples from (a) In-D samples only, (b) In-D and covariate-shifted samples only, (c) In-D and label-shifted samples only, and (d) all samples, respectively. We group the curves by whether they are originally proposed for SC setups (solid lines) or for OOD detection (dashed lines).

Table 3: Summary of AURC-α for Fig. 4. The AURC numbers are on the 10 2 scale the lower, the better. The score functions proposed for SC are highlighted in gray, and the rest are originally for OOD detection. The best AURC numbers for each coverage level are highlighted in bold, and the 2nd and 3rd best scores are underlined.

Image Net - EVA In-D In-D + Shift (Cov) In-D + Shift (Label) In-D + Shift (both)

α 0.1 0.5 1 0.1 0.5 1 0.1 0.5 1 0.1 0.5 1

RLconf-M 0.16 0.53 2.39 0.24 0.96 4.77 1.04 3.34 11.7 0.34 1.20 5.43 RLgeo-M 0.27 0.59 2.43 0.37 1.02 4.78 1.20 3.35 11.6 0.48 1.26 5.43 SIRC 2.23 2.07 3.36 3.71 3.06 5.83 15.8 8.88 13.7 4.61 3.53 6.52 SRmax 3.20 2.36 3.38 4.52 3.66 5.93 13.1 7.52 12.6 5.21 3.75 6.56 SRent 4.28 3.13 4.04 6.24 4.66 7.00 16.0 9.19 13.4 7.04 5.10 7.61 SRdoctor 3.22 2.38 3.40 4.55 3.40 6.00 13.2 7.55 12.6 5.24 3.78 6.61 RLmax 5.53 4.05 4.57 8.48 6.04 7.64 21.1 11.9 14.9 9.53 6.59 8.33 Energy 8.13 6.60 6.90 12.8 10.3 11.1 27.3 16.6 18.1 14.1 11.0 11.8 KNN 0.99 2.27 4.58 1.22 2.89 6.78 1.18 3.23 10.8 1.24 2.98 7.16 Vi M 5.48 7.11 8.31 5.31 8.05 10.4 5.83 7.89 13.4 5.35 8.12 10.7

Results on i Wild Cam & Amazon We report in Fig. 5 and Table 4 the SC performance of different score functions on i Wild Cam and Amazon. Similar to the Image Net experiment above, scores designed for OOD detection (RLmax, Energy, KNN and Vi M) do not have satisfactory performance in SC. By contrast, existing SR-based scores (SIRC, SRmax, SRent and SRdoctor) all demonstrate better SC potential than OOD score functions, and our margin-based score functions (RLconf-M and RLgeo-M) perform on par with the SR-based scores.

4.2 Comparison with a training-based confidence-score function

We also compare with a training-based method, Sc Net (Geifman & El-Yaniv, 2019). Sc Net consists of a selection network and a classifier that are structurally decoupled and trained together, allowing us to perform

Published in Transactions on Machine Learning Research (10/2024)

i Wild Cam - FYLP Amazon - LISA

(a) In-D (b) In-D + Shift (Cov) (c) In-D (d) In-D + Shift (Cov)

Figure 5: RC curves of different confidence-score functions on the model FLYP for i Wild Cam and the model LISA for Amazon. (a)&(c) are RC curves evaluated using In-D samples only and (b)&(d) are RC curves evaluated using both In-D and covariate-shifted samples.

Table 4: Summary of AURC-α for Fig. 5. The AURC numbers are on the 10 2 scale the lower, the better. The score functions proposed for SC are highlighted in gray, and the rest are originally for OOD detection. The best AURC numbers for each coverage level are highlighted in bold, and the 2nd and 3rd best scores are underlined.

i Wild Cam FYLP Amazon - LISA

In-D In-D + Shift (Cov) In-D In-D + Shift (Cov)

α 0.1 0.5 1 0.1 0.5 1 0.1 0.5 1 0.1 0.5 1

RLconf-M 1.63 3.88 10.2 1.84 3.21 10.0 1.11 5.31 12.5 1.83 6.91 14.2 RLgeo-M 1.63 3.88 10.1 1.84 3.21 10.0 1.13 5.51 12.8 1.86 7.15 14.6 SIRC 1.45 3.72 9.84 1.38 3.5 9.94 1.14 5.09 12.2 1.88 6.66 13.9 SRmax 1.45 3.87 10.0 1.38 3.61 10.1 1.14 5.13 12.3 1.88 6.70 14.0 SRent 1.46 4.03 10.6 1.34 3.94 10.6 1.15 5.06 12.1 1.89 6.61 13.8 SRdoctor 1.45 3.87 10.1 1.38 3.62 10.1 1.14 5.13 12.2 1.88 6.70 13.9 RLmax 29.1 21.4 24.7 25.5 24.8 27.9 1.26 5.21 12.5 1.98 6.88 14.4 Energy 35.2 28.3 29.9 36.1 33.2 34.4 1.26 5.37 12.8 1.98 6.88 14.4 KNN 6.40 11.1 15.3 8.16 5.10 10.7 12.1 14.3 18.2 16.1 16.5 20.1 Vi M 13.4 10.7 15.7 6.98 6.47 12.2 2.33 8.72 15.0 3.55 10.4 16.7

a faithful comparison of selection scores with a fixed classifier10. As shown above, score functions designed for OOD detection perform poorly for generalized SC, so here we focus on comparing our margin-based and SR-based score functions with Sc Net. We first train Sc Net using the training set of CIFAR-10 and Image Net, respectively; see Appendix F for training details. After training, we fix both the classification and the selection heads and compute the scores and selection risks using the test setup shown in Table 2: (i) the Sc Net selection score is taken directly from the selection head, and (ii) the margin-based and SR-based scores are computed using the classification head.

Results We show in Fig. 6 the RC curves achieved using Sc Net, SR-based, and margin-based scores. For the CIFAR experiment shown in Fig. 6 (a)&(b), Sc Net and RLconf-M perform comparably and are better than SRmax and SIRC, whereas for the Image Net experiment in Fig. 6 (c)&(d), RLconf-M, RLgeo-M, SRmax and SIRC perform comparably and are better than Sc Net.11 Surprisingly, Sc Net does not always lead to the best

10We do not consider training-based score functions such as Liu et al. (2019); Huang et al. (2022) due to the ambiguity in calculating their SR responses. During their training, a virtual class abstention is added and the softmax normalization is applied on all logits including that of the virtual class, so it is unfair either simply dropping the abstention logit during test for score calculation or keeping the abstention logit but modifying the score calculation procedure. Retraining a classifier with the same settings but without the abstention logit is also unfair due to the requirement of a fixed classifier. Furthermore, Feng et al. (2023) reports that the above selection methods (Liu et al., 2019; Huang et al., 2022) are not as effective as they claim. 11Existing training-based SC works so far have only reported SC (In-D) performance on CIFAR-10 dataset and have not experimented with Image Net using the full training set. Our results on CIFAR-10 dataset faithfully reproduce the result originally reported in Geifman & El-Yaniv (2019).

Published in Transactions on Machine Learning Research (10/2024)

CIFAR - Sc Net Image Net - Sc Net

(a) In-D (b) In-D + Shift (Both) (c) In-D (d) In-D + Shift (Both)

Figure 6: RC curves of different confidence-score functions on the model Sc Net for CIFAR and Image Net. (a)&(c) are RC curves evaluated using In-D samples only and (b)&(d) are RC curves evaluated using both In-D and covariate-shifted samples.

performance, even if it has access to training data. However, our margin-based scores consistently exhibit good SC performance.

4.3 Summary of experimental results

From all above experiments, we can conclude that (i) existing nontraining-based score functions for OOD detection do not perform well for generalized SC, not helping achieve reliable classification performance after rejecting low-confidence samples, and (ii) our proposed margin-based score functions RLgeo-M and RLconf-M consistently perform comparably to or better than existing SR-based scores on all DL models we have tested, especially in the low-risk regime, which is of particular interest for high-stakes problems. These confirm the superiority of RLgeo-M and RLconf-M as effective confidence-score functions for SC even under moderate distribution shifts for risk-sensitive applications.

In most of our experiments, RLgeo-M and RLconf-M perform similarly; only in rare cases, e.g. Fig. 5 (a) and Fig. 6 (b), RLconf-M slightly outperforms RLgeo-M. However, we do not think it is sufficient to conclude that RLconf-M is better than RLgeo-M, or vise versa. Recall how RLconf-M and RLgeo-M are defined in Eqs. (11) and (12) and their associated decision rules, the current practice of training DL classifiers is in favor of RLconf-M12. Thus, understanding the difference in behavior of RLgeo-M and RLconf-M is likely to also involve investigation of the training process, which we will leave for future work.

5 Conclusion and discussion

In this paper, we have proposed generalized selective classification, a new selective classification (SC) framework that allows distribution shifts. This is motivated by the pressing need to achieve reliable classification for real-world, risk-sensitive applications where data can come from the wild in deployment. Generalized SC covers and unifies existing selective classification and out-of-distribution (OOD) detection, and we have proposed two margin-based score functions for generalized SC, RLgeo-M and RLconf-M, which are not based on training: they are compatible for any given pretrained classifiers. Through our extensive analysis and experiments, we have shown the superiority of RLgeo-M and RLconf-M over numerous recently proposed nontraining-based score functions for SC and OOD detection. As the first work that touches on generalized SC, our paper can inspire several lines of future research, including at least: (i) to further improve the SC performance, one can try to align the training objective with our SC confidence-score functions here, i.e.,

12The cross-entropy loss is the most commonly used and minimizing it can be viewed as approximating maximizing the confidence margin. To see this, without loss of generality, assume that the magnitudes of the raw logits are ordered z1 > z2 > > z K and that the true label of the current sample is class 1. Then the cross-entropy loss for the current sample is

i ezi = log P

i ezi z1 = log 1 + P

i 2 ezi z1 , so min log ez1/P

i ezi min log 1 + P

i 2 ezi z1, where the last minimization problem can be approximated by min ez2 z1 min(z2 z1) max(z2 z1),

i.e., maximizing the confidence margin, when ez2 z1 P

i 3 ezi z1.

Published in Transactions on Machine Learning Research (10/2024)

promoting large margins; (ii) in this paper, we only consider the case where all classes are treated equally, while practical generalized SC might entail different rejection weights and costs for different classes, e.g., medical diagnosis of diseases with different levels of health implications; (iii) last but not least, finding better confidence-score functions. We hope that our small step here stimulates further research on generalized SC, bridging the widespread gaps between exploratory AI development and reliable AI deployment for practical high-stakes applications.

Acknowledgments

Liang H. and Sun J. are partially supported by NIH fund R01NS131314. Peng L. and Sun J. are partially supported by NIH fund R01CA287413. The authors acknowledge the Minnesota Supercomputing Institute (MSI) at the University of Minnesota for providing resources that contributed to the research results reported in this article. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This research is also part of AI-CLIMATE: AI Institute for Climate-Land Interactions, Mitigation, Adaptation, Tradeoffs and Economy, and is supported by USDA National Institute of Food and Agriculture (NIFA) and the National Science Foundation (NSF) National AI Research Institutes Competitive Award no. 2023-67021-39829.

Sara Beery, Elijah Cole, and Arvi Gjoka. The iwildcam 2020 competition dataset. ar Xiv preprint ar Xiv:2004.10340, 2020.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020.

Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. ar Xiv preprint ar Xiv:1902.06705, 2019.

Luís Felipe Prates Cattelan and Danilo Silva. On selective classification under distribution shift. In Neur IPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models.

C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16 (1):41 46, 1970.

Charles Corbière, Nicolas Thome, Avner Bar-Hen, Matthieu Cord, and Patrick Pérez. Addressing failure prediction by learning model confidence. Advances in Neural Information Processing Systems, 32, 2019.

Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Boosting with abstention. Advances in Neural Information Processing Systems, 29, 2016.

Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265 292, 2001.

Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. ar Xiv preprint ar Xiv:2010.09670, 2020.

Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pp. 233 240, 2006.

Andrija Djurisic, Nebojsa Bozanic, Arjun Ashok, and Rosanne Liu. Extremely simple activation shaping for out-of-distribution detection. ar Xiv preprint ar Xiv:2209.09858, 2022.

Published in Transactions on Machine Learning Research (10/2024)

Michael Dusenberry, Ghassen Jerfel, Yeming Wen, Yian Ma, Jasper Snoek, Katherine Heller, Balaji Lakshminarayanan, and Dustin Tran. Efficient and scalable bayesian neural nets with rank-1 factors. In International conference on machine learning, pp. 2782 2792. PMLR, 2020.

Ran El-Yaniv et al. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11(5), 2010.

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358 19369, 2023.

Leo Feng, Mohamed Osama Ahmed, Hossein Hajimirsadeghi, and Amir H Abdi. Towards better selective classification. In The Eleventh International Conference on Learning Representations, 2023.

Adam Fisch, Tommi S Jaakkola, and Regina Barzilay. Calibrated selective classification. Transactions on Machine Learning Research, 2022.

Vaclav Voracek Vojtech Franc, Daniel Prusa, and Vaclav Voracek. Optimal strategies for reject option classifiers. Journal of Machine Learning Research, 24(11):1 49, 2023a.

Vojtech Franc, Daniel Prusa, and Jakub Paplham. Reject option models comprising out-of-distribution detection. ar Xiv preprint ar Xiv:2307.05199, 2023b.

Vojtech Franc, Jakub Paplham, and Daniel Prusa. Scod: From heuristics to theory. ar Xiv preprint ar Xiv:2403.16916, 2024.

Giorgio Fumera and Fabio Roli. Support vector machines with embedded reject option. In Pattern Recognition with Support Vector Machines: First International Workshop, SVM 2002 Niagara Falls, Canada, August 10, 2002 Proceedings, pp. 68 82. Springer, 2002.

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050 1059. PMLR, 2016.

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017.

Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. In International conference on machine learning, pp. 2151 2159. PMLR, 2019.

Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. ar Xiv preprint ar Xiv:1805.08206, 2018.

Chuanxing Geng, Sheng-jun Huang, and Songcan Chen. Recent advances in open set recognition: A survey. IEEE transactions on pattern analysis and machine intelligence, 43(10):3614 3631, 2020.

Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, and Aditi Raghunathan. Finetune like you pretrain: Improved finetuning of zero-shot vision models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19338 19347, 2023.

Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, and Stéphane Canu. Support vector machines with a reject option. Advances in neural information processing systems, 21, 2008.

Federica Granese, Marco Romanelli, Daniele Gorla, Catuscia Palamidessi, and Pablo Piantanida. Doctor: A simple method for detecting misclassification errors. Advances in Neural Information Processing Systems, 34:5669 5681, 2021.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pp. 1321 1330. PMLR, 2017.

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018.

Published in Transactions on Machine Learning Research (10/2024)

Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. ar Xiv preprint ar Xiv:1610.02136, 2016.

Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, Joe Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn Song. Scaling out-of-distribution detection for real-world settings. ar Xiv preprint ar Xiv:1911.11132, 2019.

Lang Huang, Chao Zhang, and Hongyang Zhang. Self-adaptive training: Bridging supervised and selfsupervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

Heinrich Jiang, Been Kim, Melody Guan, and Maya Gupta. To trust or not to trust a classifier. Advances in neural information processing systems, 31, 2018.

Jihyo Kim, Jiin Koo, and Sangheum Hwang. A unified benchmark for the unknown detection capability of deep neural networks. Expert Systems with Applications, 229:120461, 2023.

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of inthe-wild distribution shifts. In International Conference on Machine Learning, pp. 5637 5664. PMLR, 2021.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.

Yann Le Cun, Bernhard Boser, John Denker, Donnie Henderson, Richard Howard, Wayne Hubbard, and Lawrence Jackel. Handwritten digit recognition with a back-propagation network. Advances in neural information processing systems, 2, 1989.

Jing Lei. Classification with confidence. Biometrika, 101(4):755 769, 2014.

Hengyue Liang, Buyun Liang, Le Peng, Ying Cui, Tim Mitchell, and Ju Sun. Optimization and optimizers for adversarial robustness. ar Xiv preprint ar Xiv:2303.13401, 2023.

Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. ar Xiv preprint ar Xiv:1706.02690, 2017.

Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in neural information processing systems, 33:21464 21475, 2020.

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976 11986, 2022.

Ziyin Liu, Zhikang Wang, Paul Pu Liang, Russ R Salakhutdinov, Louis-Philippe Morency, and Masahito Ueda. Deep gamblers: Learning to abstain with portfolio theory. Advances in Neural Information Processing Systems, 32, 2019.

Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. Advances in neural information processing systems, 32, 2019.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.

Jens Müller, Stefan T Radev, Robert Schmier, Felix Draxler, Carsten Rother, and Ullrich Köthe. Finding competence regions in domain generalization. ar Xiv preprint ar Xiv:2303.09989, 2023.

Published in Transactions on Machine Learning Research (10/2024)

Jianmo Ni, Jiacheng Li, and Julian Mc Auley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.

Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. In CVPR workshops, volume 2, 2019.

Jaewoo Park, Yoon Gyo Jung, and Andrew Beng Jin Teoh. Nearest neighbor guidance for out-of-distribution detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1686 1695, 2023.

Tadeusz Pietraszek. Optimizing abstaining classifiers using roc analysis. In Proceedings of the 22nd international conference on Machine learning, pp. 665 672, 2005.

Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. Mit Press, 2008.

Stephan Rabanser, Stephan Günnemann, and Zachary Lipton. Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, 32, 2019.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211 252, 2015.

Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. Plo S one, 10(3):e0118432, 2015.

Yiyou Sun, Chuan Guo, and Yixuan Li. React: Out-of-distribution detection with rectified activations. Advances in Neural Information Processing Systems, 34:144 157, 2021.

Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. In International Conference on Machine Learning, pp. 20827 20840. PMLR, 2022.

Thomas Villmann, Marika Kaden, Andrea Bohnsack, J-M Villmann, T Drogies, Sascha Saralajew, and Barbara Hammer. Self-adjusting reject options in prototype based classification. In Advances in Self Organizing Maps and Learning Vector Quantization: Proceedings of the 11th International Workshop WSOM 2016, Houston, Texas, USA, January 6-8, 2016, pp. 269 279. Springer, 2016.

Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution with virtual-logit matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4921 4930, 2022.

Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.

Guoxuan Xia and Christos-Savvas Bouganis. Augmenting softmax information for selective classification with out-of-distribution data. In Proceedings of the Asian Conference on Computer Vision, pp. 1995 2012, 2022.

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492 1500, 2017.

Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey. ar Xiv preprint ar Xiv:2110.11334, 2021.

Published in Transactions on Machine Learning Research (10/2024)

Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, et al. Openood: Benchmarking generalized out-of-distribution detection. Advances in Neural Information Processing Systems, 35:32598 32611, 2022.

Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving outof-distribution robustness via selective augmentation. In International Conference on Machine Learning, pp. 25407 25437. PMLR, 2022.

Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, and Shuicheng Yan. Volo: Vision outlooker for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. ar Xiv preprint ar Xiv:2111.11432, 2021.

Xu-Yao Zhang, Guo-Sen Xie, Xiuli Li, Tao Mei, and Cheng-Lin Liu. A survey on learning to reject. Proceedings of the IEEE, 111(2):185 215, 2023.

Fei Zhu, Zhen Cheng, Xu-Yao Zhang, and Cheng-Lin Liu. Rethinking confidence calibration for failure prediction. In Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23 27, 2022, Proceedings, Part XXV, pp. 518 536. Springer, 2022.

A Linear SVM and margins

We first consider binary classification. Assume training set {(xi, yi)}i [N] ([N] .= {1, . . . , N}), where yi {+1, 1} and for notational simplicity, we assume that an extra 1 has been appended to the original feature vectors so that we only need to consider the homogeneous form of the predictor: f(x) = w x. The basic idea of SVM is to maximize the worst signed geometric margin, which makes sense no matter whether the data are separable or not:

max w min i [N] yiw xi

Note that the problem is non-convex due to the fractional form yiw xi

w . Moreover, yiw xi

w is invariant to the rescaling of w, which is bad for numerical computation (as this implies that there exist global solutions arbitrarily close to 0 and ).

If the training set is separable, i.e., there exists a w such that yiw xi 0, i [N], there also exists a w so that mini yiw xi = 1 i [N] by a simple rescaling argument. Then Eq. (14) becomes

max w min i [N] yiw xi

w s. t. min i yiw xi = 1 i [N] (15)

max w 1 w s. t. min i yiw xi = 1 i [N] (16)

min w w s. t. yiw xi 1 i [N], (17)

where Eq. (17) is our textbook hard-margin SVM (except for the squared norm often used in the objective). A problem with Eq. (17) is that the constraint set is infeasible for inseparable training data. To fix this issue, we can allow slight violations in the constraint and penalize these violations in the objective of Eq. (17), arriving at

min w w 2 + C X

i [N] ξi s. t. yiw xi 1 ξi, ξi 0 i [N], (18)

which is our textbook soft-margin SVM.

Published in Transactions on Machine Learning Research (10/2024)

Now for multiclass classification, let us assume the data space: X Y = Rd {1, . . . , K} with K 3. The classifier takes the form f(x) = W x, where W Rd K. We note that from binary SVM, people create the notion of confidence margin:

Conf Margin(xi, w) .= yiw xi, (19)

which for the binary case is simply the signed geometric margin rescaled by w . The standard multiclass decision rule is13

arg max j [K] w j x, (20)

where wj is the j-th column of W . To correctly classify all points, we need

i [N], yi = arg max j [K] w j x i [N], w yixi > max y Y\{yi} w yxi. (21)

This motivates the multiclass hard-margin SVM, separability assumed:

j [K] wj 2 s. t. w yixi max y Y\{yi} w yxi 1, i [N], (22)

where terms w yixi maxy Y\{yi} w yxi can be viewed as multiclass confidence margins, natural generalizations of confidence margins for the binary case. The corresponding soft-margin version is

j [K] wj 2 + C X

i [N] ξi s. t. w yixi max y Y\{yi} w yxi 1 ξi, ξi 0 i [N]. (23)

Both hardand soft-margin versions are convex and thus more convenient for numerical optimization.

On the other hand, if we strictly follow the geometric margin interpretation, it seems more natural to formulate multiclass SVM as follows. Consider the decision rule:

arg max j [K]

w j x wj , (24)

which would classify all points correctly provided that there exists a W Rd K satisfying

i [N], w yixi wyi > max y Y\{yi} w yxi wy . (25)

This motivates an optimization problem on the worst geometric margins:

max W min i [N]

w yixi wyi max y Y\{yi} w yxi wy

However, this problem is non-convex and thus not popularly adopted.

B Asymptotic behaviors of SRmax, SRdoctor, and SRent

Recall from mathematical analysis that two functions f(x) and g(x) are asymptotically equivalent as x , written as f(x) g(x) as x , if and only if f(x) = g(x)(1 + o(1)) as x , where o( ) is the standard small-o notation. Note that f(x) g(x) g(x) f(x).

Lemma B.1. Consider the raw logits z, and without loss of generality assume that they are ordered in descending order without any ties, i.e., z(1) > z(2) > . We have that as λ ,

SRmax(λz) exp eλ(z(2) z(1)) , SRdoctor(λz) 1 exp 2eλ(z(2) z(1)) , SRent(λz) eλ(z(2) z(1)).

Moreover, all of the asymptotic functions are monotonically increasing with respect to z(1) z(2).

13The decision rule for the binary case is arg maxy {+1, 1} yw x. Therefore, we do not need to worry about the w s scaling.

Published in Transactions on Machine Learning Research (10/2024)

Proof. First, for SRmax, we have

log SRmax(λz) = log eλz(1) P

i eλz(i) = log X

i eλ(z(i) z(1)) log 1 + eλ(z(2) z(1)) (27)

as λ , because P

i 3 eλ(z(i) z(1))/(1 + eλ(z(2) z(1))) 0 as λ . Moreover, as λ ,

eλ(z(2) z(1)) 0 = log 1 + eλ(z(2) z(1)) eλ(z(2) z(1)), (28)

as log(1 + x) x when x 0. So we conclude that

SRmax(λz) exp eλ(z(2) z(1)) as λ . (29)

Now consider SRdoctor. Applying a similar argument as above, we have

log σ(λz) 2 2 = log X

j eλz(j))2 = log X

e2λ(z(i) z(1))

j eλ(z(j) z(1)))2

j eλ(z(j) z(1)) + log X

i e2λ(z(i) z(1)) (30)

2 log 1 + eλ(z(2) z(1)) + log 1 + e2λ(z(2) z(1)) (31)

2eλ(z(2) z(1)) + e2λ(z(2) z(1)) (32)

2eλ(z(2) z(1)) (33)

as λ , where Eq. (33) holds as e2λ(z(2) z(1)) is lower order than 2eλ(z(2) z(1)) when z(2) z(1) < 0 so that eλ(z(2) z(1)) < 1. Therefore, as λ ,

SRdoctor(λz) = 1 σ(λz) 2 2 1 exp 2eλ(z(2) z(1)) . (34)

Finally, for SRent, we have that when λ ,

SRent(λz) = X

j eλz(j) log eλz(i) P

j eλz(j) = X

eλ(z(i) z(1)) P

j eλ(z(j) z(1)) log eλ(z(i) z(1)) P

j eλ(z(j) z(1)) (35)

j eλ(z(j) z(1)) X

i eλ(z(i) z(1))

λ(z(i) z(1)) log X

j eλ(z(j) z(1))

1 P j eλ(z(j) z(1)) X

h eλ(z(i) z(1))λ(z(i) z(1)) i (37)

j eλ(z(j) z(1))/(λ(z(i) z(1))) o(1) when λ )

j eλ(z(j) z(1))

h eλ(z(2) z(1))λ(z(2) z(1)) i , (38)

where Eq. (38) holds because P

i 3 eλ(z(i) z(1))λ(z(i) z(1))/(eλ(z(2) z(1))λ(z(2) z(1))) = P

i 3 eλ(z(i) z(2)) z(i) z(1) / z(2) z(1) o(1) as λ . Continuing the above argument, we further have that as λ ,

log( SRent(λz)) log X

j eλ(z(j) z(1)) + λ(z(2) z(1)) + log λ(z(2) z(1)) . (39)

Published in Transactions on Machine Learning Research (10/2024)

Let s write x (z(2) z(1)). The last two terms in Eq. (39) can be re-written as x + log(x). Since limx log(x)

x = 0, we thus have x + log(x) = x(1 + o(1)) as x , and hense x + log(x) x by the definition of the asymptotic equivalence. Therefore, we have:

log( SRent(λz)) log 1 + eλ(z(2) z(1)) + λ(z(2) z(1)) (40)

eλ(z(2) z(1)) + λ(z(2) z(1)) λ(z(2) z(1)). (41)

So we conclude that

SRent(λz) exp λ(z(2) z(1)) as λ , (42)

completing the proof.

C Evaluation metrics for OOD detection vs. evaluation metrics for generalized SC

Table 5: Evaluation of s1 and s2 using popular OOD metrics. The better numbers are highlighted in bold.

OOD metric s1 s2 AUROC ( ) 0.765 0.944 AUPR ( ) 0.987 0.997 FPR@TPR=0.95 ( ) 0.816 0.279

The commonly used evaluation metrics for OOD detection do not reflect the classification performance (Franc et al., 2023b). Here we provide a quantitative supporting example, in comparison with the RC curve for generalized SC.

OOD (mostly label-shift) detection as formulated in Eq. (9) can be viewed as a binary classification problem: selected and rejected samples form the two classes. So pioneer work on OOD detection, such as Hendrycks & Gimpel (2016), proposes to evaluate OOD detection in a manner similar to that of binary classification, e.g., using the Area Under the Receiver Operating Characteristic (AUROC) curve (Davis & Goadrich, 2006) and Area Under the Precision-Recall curve (AUPR) (Saito & Rehmsmeier, 2015) to measure the separability of In-D and OOD samples.14 However, two important aspects are missing in OOD detection, and hence also its performance evaluation, if we are to focus on the performance on the accepted samples:

1. Pretrained classifiers do not always make wrong predictions on label-shifted samples, and hence these OOD samples should not be blindly rejected; 2. In-D samples that might have been correctly classified can be rejected due to poor separation of In-D and OOD samples, leading to worse classification performance on the selected part.

To demonstrate our points quantitatively, we take the pretrained model EVA15 from timm (Wightman, 2019) that achieves > 88% top 1 accuracy on the Image Net validation set. We then mix Image Net validation set (In-D samples) with Image Net-O (OOD samples, label shifted) (Hendrycks & Dietterich, 2018), and evaluate two score functions s1 and s216 using both generalized SC formulation (via RC curves) and OOD detection (via AUROC and AUPR).

According to Table 5, s2 is considered superior to s1 by all metrics for OOD detection. Correspondingly, from Fig. 7(a) and (b), we observe that the scores of the label-shifted samples (green) and those of the In-D samples (blue and orange) are more separated by s2 than by s1. However, we can also quickly notice one issue: In-D samples are not completely separated from OOD samples a threshold intended to reject label-shifted samples will inevitably reject a portion of In-D samples at the same time, even though a large portion of In-D samples have been correctly classified (blue); In-D samples that can be correctly classified (blue) are less separated from those misclassified ones (orange) by s2 than by s1. This problem cannot be revealed by the OOD metrics in Table 5, but is captured by the RC curves in Fig. 7(c) where the selection risk of s2 (blue) increases as more OOD samples are rejected (TPR from 0.95 to 0.1 as indicated by the

14A single-point metric, False Positive Rate (FPR) at 0.95 True Positive Rate (TPR), is also popularly used as a companion (Liang et al., 2017; Wang et al., 2022; Liu et al., 2020; Djurisic et al., 2022; Sun et al., 2022; Yang et al., 2022). 15See Appendix E for model card information. This model is also used in the experiments of Section 4. 16s1 is our proposed RLconf-M and s2 is Vi M.

Published in Transactions on Machine Learning Research (10/2024)

(a) s1 score distributions (b) s2 score distributions (c) RC curves

Figure 7: Score distributions of s1 and s2 (a)-(b) and their RC curves (c). In (a) and (b), In-D samples that are correctly classified by EVA are shown in blue, while In-D samples that are incorrectly classified are shown in orange; OOD samples (label-shifted) are shown in green. The vertical dashed lines in (a)-(c) corresponds to different True-Positive-Rate cutoffs in the AUROC metric in OOD detection.

vertical dashed lines). In contrast, the more samples rejected by s1 (smaller coverage), the lower the selection risk, implying that s1 serves SC better.

D Rejection patterns of different score functions

We plot in Fig. 8 the heatmap of the score values for each score function. During SC, samples located in the darker areas (with low score values) will be rejected before those located in the brighter areas (with high score values).

(a) RLgeo-M (a) SRmax (b) SRent (c) SRdoctor (d) RLmax

Figure 8: Heatmaps of rejection patterns (distribution of scores). Note that because we rescale the scores for good visualization, the colors are not cross-comparable between different score functions.

E Timm model cards

Table 6: Names of model cards in library timm to retrieve the models for Image Net Dataset Model name Model card name Top-1 Acc. (% )

EVA (Vi T) eva_giant_patch14_224.clip_ft_in1k 88.76

Image Net Conv Next convnextv2_base.fcmae_ft_in22k_in1k 86.25

VOLO volo_d4_224.sail_in1k 85.56

Res Next seresnextaa101d_32x8d.sw_in12k_ft_in1k 85.94

Table 6 shows the names of the model cards used to retrieve the pretrained models for Image Net from the timm library. Our considerations for choosing these models are as follows: (i) the models should cover a

Published in Transactions on Machine Learning Research (10/2024)

wide range of recent and popular architectures, and (ii) they should achieve high top-1 accuracy to represent recent advances of image classification.

F Training details for Sc Net

We use the unofficial Py Torch implementation17 of the original Selective Net (Geifman & El-Yaniv, 2019) due to the out-of-date Keras environment of the original repository18. The Py Torch implementation follows the training method proposed in Geifman & El-Yaniv (2019) and faithfully reproduces the results of CIFAR-10 experiment reported in the original paper. We add the Image Net experiment on top of the Py Torch code, as it is not included in the original code or the paper. Table 7 summarizes the key hyperparameters to produce the results reported in this paper.

Table 7: Key hyperparameters for the Sc Net training used in this paper Dataset Model architecture Dropout prob. Target coverage Batch size Total epochs Lr (base) Scheduler

CIFAR-10 VGG 0.3 0.7 128 300 0.1 Step LR

Imaeg Net-1k resnet34 N/A 0.7 768 250 0.1 Cosine Annealing LR

G Additional Image Net experiments

We report in Fig. 9 the RC curves of different score functions on models Conv Next, Res Next, and VOLO for Image Net, and summarize their AURC statistics in Table 8.

H Ablation experiments for the KNN score

We show in Fig. 10 the SC performance of the KNN score on models EVA, Conv Next, Res Next, and VOLO, respectively, on Image Net with all In-D and distribution-shifted samples. We can observe that (i) the SC performance of KNN is sensitive to the choice of hyperparameter k, and (ii) our selection k = 2 achieves the best SC performance for KNN score on our Image Net task.

17https://github.com/gatheluck/pytorch-Selective Net 18https://github.com/anonygit32/Selective Net

Published in Transactions on Machine Learning Research (10/2024)

Image Net - Cov Next

Image Net - Res Next

Image Net - VOLO

In-D (Image Net) In-D + Shift (Cov) In-D + Shift (Label) In-D + Shift (both)

Figure 9: RC curves of different confidence-score functions on models Conv Next, Res Next and VOLO from timm for Image Net. The four columns are RC curves evaluated using samples from In-D only, In-D and covariate-shifted only, In-D and label-shifted only, and all, respectively. We group the curves by whether they are originally proposed for SC (solid lines) or for OOD detection (dashed lines).

EVA Conv Next Res Next VOLO

Figure 10: RC curves achieved by the KNN score with different k on Image Net

Published in Transactions on Machine Learning Research (10/2024)

Table 8: Summary of AURC-α for Fig. 9. The AURC numbers are on the 10 2 scale the lower, the better. The score functions proposed for SC are highlighted in gray, and the rest are originally for OOD detection. The best AURC numbers for each coverage level are highlighted in bold, and the 2nd and 3rd best scores are underlined.

Image Net - Conv Next In-D In-D + Shift (Cov) In-D + Shift (Label) In-D + Shift (both)

α 0.1 0.5 1 0.1 0.5 1 0.1 0.5 1 0.1 0.5 1

RLconf-M 0.10 0.53 3.02 0.26 1.76 8.20 0.58 2.51 11.8 0.34 1.99 8.88 RLgeo-M 0.15 0.59 3.10 0.31 1.75 8.14 0.75 2.54 11.8 0.38 1.97 8.81 SIRC 1.96 1.70 3.59 3.44 3.23 8.60 5.94 4.03 11.5 3.76 3.46 9.18 SRmax 2.26 1.86 3.66 3.73 3.40 8.70 5.86 4.05 11.4 4.04 3.62 9.26 SRent 2.77 2.44 4.19 4.78 4.33 9.54 6.83 4.85 11.6 5.13 4.56 10.1 SRdoctor 2.26 1.86 3.67 3.73 3.41 8.74 5.86 4.06 11.3 4.04 3.63 9.29 RLmax 5.43 4.77 5.81 9.05 7.89 11.6 10.5 7.73 13.2 9.45 8.13 12.1 Energy 6.66 6.70 7.54 10.9 10.7 13.9 11.9 9.78 14.6 11.3 10.9 14.3 KNN 1.01 2.37 5.72 1.29 4.54 10.6 1.11 3.66 12.0 1.31 4.59 11.0 Vi M 15.1 9.84 9.49 16.2 11.9 14.3 14.1 9.57 14.5 16.2 11.9 14.7

Image Net - Res Next

RLconf-M 0.12 0.59 3.17 0.29 2.15 9.38 0.59 3.22 12.8 0.38 2.50 10.2 RLgeo-M 0.17 0.60 3.18 0.34 2.14 9.33 0.65 3.16 12.7 0.43 2.49 10.1 SIRC 1.71 1.91 3.94 3.96 4.18 9.99 7.77 5.88 13.1 4.47 4.57 10.7 SRmax 2.28 2.26 4.11 4.88 4.69 10.3 7.44 5.88 12.9 5.36 5.06 11.0 SRent 3.38 3.42 5.37 6.92 6.94 12.2 9.46 7.70 13.9 7.47 7.36 12.8 SRdoctor 2.29 2.28 4.17 4.92 4.75 10.4 7.47 5.92 12.8 5.39 5.12 11.1 RLmax 1.57 2.34 4.79 2.98 4.82 10.9 2.37 3.83 11.9 3.06 5.00 11.4 Energy 3.08 3.90 6.17 5.13 7.20 12.7 3.68 5.34 13.2 5.19 7.37 13.2 KNN 3.23 4.84 7.61 4.12 7.65 13.6 3.40 5.85 13.5 4.14 7.77 14.0 Vi M 4.68 6.13 7.79 6.18 8.81 13.6 5.09 6.82 13.6 6.23 8.92 14.1

Image Net - VOLO

RLconf-M 0.31 0.79 3.44 0.46 2.24 9.72 1.30 3.79 13.3 0.68 2.67 10.6 RLgeo-M 0.37 0.81 3.46 0.50 2.23 9.73 0.94 3.56 13.1 0.66 2.64 10.6 SIRC 1.27 1.44 3.74 1.35 2.82 9.56 2.68 3.97 12.9 1.90 3.37 10.5 SRmax 1.31 1.42 3.72 1.33 2.82 9.59 2.54 3.78 12.7 1.86 3.36 10.5 SRent 1.47 1.59 3.83 1.58 3.13 9.72 2.71 3.87 12.4 2.13 3.69 10.6 SRdoctor 1.31 1.42 3.71 1.33 2.82 9.55 2.54 3.78 12.7 1.86 3.36 10.4 RLmax 4.92 4.51 6.18 6.32 7.13 12.5 6.37 6.82 13.8 7.07 7.84 13.4 Energy 5.21 4.99 6.84 6.88 8.24 13.5 6.70 7.37 14.3 7.62 8.95 14.4 KNN 2.18 3.29 6.23 2.10 5.03 11.7 2.27 4.85 13.7 2.15 5.26 12.3 Vi M 9.38 10.7 11.9 9.04 12.0 16.5 10.4 13.5 21.1 9.22 12.4 17.3