# learning_to_reject_meets_longtail_learning__00d79510.pdf

Published as a conference paper at ICLR 2024

LEARNING TO REJECT MEETS LONG-TAIL LEARNING

Harikrishna Narasimhan, Aditya Krishna Menon, Wittawat Jitkrittum Neha Gupta, Sanjiv Kumar Google Research {hnarasimhan, adityakmenon, wittawat, nehagup, sanjivk}@google.com

Learning to reject (L2R) is a classical problem where one seeks a classiﬁer capable of abstaining on low-conﬁdence samples. Most prior work on L2R has focused on minimizing the standard misclassiﬁcation error. However, in many real-world applications, the label distribution is highly imbalanced, necessitating alternate evaluation metrics such as the balanced error or the worst-group error that enforce equitable performance across both the head and tail classes. In this paper, we establish that traditional L2R methods can be grossly sub-optimal for such metrics, and show that this is due to an intricate dependence in the objective between the label costs and the rejector. We then derive the form of the Bayes-optimal classiﬁer and rejector for the balanced error, propose a novel plug-in approach to mimic this solution, and extend our results to general evaluation metrics. Through experiments on benchmark image classiﬁcation tasks, we show that our approach yields better trade-offs in both the balanced and worst-group error compared to L2R baselines.

1 INTRODUCTION

Multi-class classiﬁcation is typically framed as learning a classiﬁer that can provide accurate predictions on any test sample. However, in many real-world problems, it may be preferable to allow the classiﬁer to abstain from making a prediction on low-conﬁdence samples. Learning to reject (L2R) is the problem of learning such a classiﬁer capable of abstaining on certain samples, at the expense of incurring a ﬁxed cost (Bartlett et al., 2006; El-Yaniv & Wiener, 2010; Cortes et al., 2016; Gangrade et al., 2021; Cortes et al., 2023). Common L2R approaches include conﬁdence-based thresholding or Chow s rule (Chow, 1970), and loss-based strategies which pose the problem as an instance of cost-sensitive learning (Cortes et al., 2016; Ramaswamy et al., 2018; Ni et al., 2019; Mozannar & Sontag, 2020; Charoenphakdee et al., 2021; Verma & Nalisnick, 2022; Mao et al., 2023a).

Most prior studies on L2R have focused on the goal of minimizing the standard misclassiﬁcation error. However, the misclassiﬁcation error may not be well-suited to evaluate the performance of classiﬁer when the label distribution is highly imbalanced or long-tailed (Buda et al., 2017). In such applications, one often employs evaluation metrics such as the balanced error (Brodersen et al., 2010) or the worst-group error (Sagawa et al., 2020), which enforce equitable performance across different label groups (e.g., head and tail). This raises a natural, but surprisingly under-explored question: what is an appropriate L2R mechanism for such generalised evaluation measures?

In this paper, we present a formal study of the problem of L2R with general evaluation metrics. In the standard classiﬁcation setup, one typically minimizes a given evaluation measure by formulating an equivalent cost-sensitive problem (Menon et al., 2013; Koyejo et al., 2014). For example, with the balanced error metric, one may re-weight the per-class errors by the inverse class priors. Similarly, many commonly used classiﬁcation metrics can be reduced to minimizing a cost-sensitive error with ﬁxed label costs (Narasimhan et al., 2015b). However, applying these reduction techniques to the L2R setup is not straightforward. In particular, even with a metric as simple as the balanced classiﬁcation error, the resulting label costs are no longer constants, and have an intricate dependence on the rejection mechanism. We characterize the optimal solution for this non-trivial formulation, propose novel plug-in approaches to mimic the solution, and demonstrate their efﬁcacy in experiments.

In summary, we make the following contributions:

Published as a conference paper at ICLR 2024

Table 1: Summary of Bayes-optimal classiﬁer h and rejector r for different metrics. We assume a cost c for abstention. With the standard 0-1 error (row 1) and the conditional 0-1 error (row 2), the optimal rejector is a thresholding of the true class-probability of the predicted label. With the balanced error (row 3), the rejector takes a fundamentally different form, and involves all class-probabilities, with per-class coefﬁcients u, v RK that are distribution-dependent. With a general evaluation metric such as the worst-group error (row 4), we have a stochastic combination of multiple classiﬁer-rejector pairs.

L2R Risk h (x) r (x) Reference

Standard (1) arg maxy ηy(x) maxy ηy(x) < 1 c Chow (1970) Conditional (22) arg maxy ηy(x) maxy ηy(x) < 1 c Lemma 8 (App. C)

Balanced (6) arg maxy uy ηy(x) maxy uy ηy(x) < P

j vj ηj(x) c Theorem 1

Generalized (14) Stochastic combination of (h, r) pairs of the form in row 3 Theorem 3

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Balanced Error

Chow's rule Plug-in [Balanced] Plug-in [Worst]

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Conditional Head Error

Chow's rule Plug-in [Balanced] Plug-in [Worst]

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Conditional Tail Error

Chow's rule Plug-in [Balanced] Plug-in [Worst]

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Tail Error - Head Error

Chow's rule Plug-in [Balanced] Plug-in [Worst]

Figure 1: Comparison of balanced error as a function of proportion of samples rejected on the Image Net-LT dataset. We compare Chow s rule with the proposed plug-in methods. We also include plots of the individual head and tail errors, and the differences between them. We designate classes with 20 or fewer samples in the training set as tail classes, and the remaining as head classes. The base classiﬁer uses a Res Net-50 architecture. Chow s rule is seen to be substantially better on the head group compared to the tail group. In contrast, the proposed plug-in rules yield signiﬁcantly lower balanced (worst-group) error. They are also seen to yield a lower gap between the tail and head errors.

(i) We ﬁrst derive the Bayes-optimal classiﬁer and rejector for the balanced error metric, and show that they have a fundamentally different form from classical baselines such as Chow s rule (Chow, 1970), which reject samples with low prediction conﬁdence ( 3, Table 1, Theorem 1). (ii) We show that vanilla modiﬁcations to Chow s rule such as changing the loss used in the base model training are also theoretically sub-optimal for the balanced error. As an alternative, we propose a novel plug-in approach that seeks to mimic the Bayes-optimal solution without making any changes to the base model training ( 4). (iii) We then extend our Bayes-optimal characterization to any evaluation metric that can be expressed as a general function of the per-class errors (Theorem 3), and as a concrete example, showcase how to extend our plug-in approach to minimize the worst-group error ( 5). (iv) Through experiments on benchmark long-tailed image classiﬁcation datasets, we demonstrate that our plug-in approach yields signiﬁcantly better trade-offs than Chow s rule, and is competitive with variants of Chow that require changes to the training process ( 6).

2 LEARNING TO REJECT AND LONG-TAIL CLASSIFICATION

We begin with some background material. Let X be an instance space, Y = [L] .= {1, 2, . . . , L} be the label space, and P be the underlying distribution over X Y. Furthermore, let ηy(x) = P(y | x) denote the conditional distribution over labels. Given a training set S = {(xn, yn)}n [N] comprising N i.i.d. draws from P, multi-class classiﬁcation canonically seeks a classiﬁer h: X Y with low misclassiﬁcation error R0-1(h) = P(y = h(x)). We may parameterise h(x) = arg maxy Y fy (x) for a scorer f : X RL, which is typically trained by minimizing an empirical surrogate risk on S.

2.1 LEARNING TO REJECT

We are interested in settings where the classiﬁer is provided the option of abstaining on select samples. To this end, we consider the goal of learning a classiﬁer h: X Y and a rejector r: X {0, 1}. For a given instance x, we abstain on a sample whenever r(x) = 1, and predict h(x) otherwise.

Published as a conference paper at ICLR 2024

Prior literature has explored minimizing the standard misclassiﬁcation error with a constant penalty for abstaining on a sample (Cortes et al., 2016):

Rrej 0-1(h, r) = P (y = h(x), r(x) = 0) + c P (r(x) = 1) . (1)

Intuitively, when r chooses not to abstain, we incur the usual misclassiﬁcation error; otherwise, we incur the cost c > 0. Under mild distributional assumptions, this objective can be equivalently seen as constraining the proportion of abstentions to be within a budget, a formulation often referred to as selective classiﬁcation (Geifman & El-Yaniv, 2017; Gangrade et al., 2021). More precisely, via a Lagrangian analysis (following, e.g., the Neyman-Pearson lemma (Neyman & Pearson, 1933)), for any such abstention budget, there exists an equivalent cost c in Eq. (1); see Appendix B for details.

Bayes-optimal L2R. The Bayes-optimal solution for Eq. (1) admits an intuitive form:

h (x) = argmax y [L] ηy(x); r (x) = 1 max y [L] ηy(x) < 1 c. (2)

The classiﬁer abstains whenever the maximum conditional probability falls below a certain threshold, and otherwise, predicts the class with the maximum class probability.

Post-hoc L2R. A classical baseline for L2R, often dubbed as Chow s rule (Chow, 1970), seeks to mimic the rejector in Eq. (2). In practice, one may implement Chow s rule by training a scorer f : X RL to minimize the softmax cross-entropy (CE) loss:

E(x,y) P [ℓce(y, f(x))] = E(x,y) P log P

y [L] efy (x) fy(x) . (3)

One then uses the learned scorer f to compute estimates py(x) exp(fy(x)) of the class probabilities ηy(x), and constructs the rejector by thresholding the maximum probability maxy py(x) at 1 c.

Loss-based L2R. As an alternative to Chow s rule, which constructs a rejector post-hoc given an existing classiﬁer, the literature has also explored the design of loss functions that jointly learn both the classiﬁer and rejector to optimize the L2R risk (Cortes et al., 2016; Geifman & El-Yaniv, 2017; Ramaswamy et al., 2018; Ni et al., 2019; Mozannar & Sontag, 2020; Charoenphakdee et al., 2021). Of these, Geifman & El-Yaniv minimize a variant of the L2R risk in Eq. (1) where the classiﬁcation error is conditioned on non-rejected samples. As shown in Appendix C, the optimal rejector for this conditional risk has the same form as Eq. (2), but uses a different (distribution-dependent) threshold.

2.2 LONG-TAIL CLASSIFICATION

In many real-world applications, the label distribution is highly class imbalanced (He & Garcia, 2009), or long-tailed (Buda et al., 2017). In such setting, one may wish to evaluate performance separately on minority and majority classes. For example, a common partitioning of the classes is into head and tail classes, the primary setting of interest in this paper. More generally, let G1, . . . , GK Y denote a paritioning of the labels into K disjoint groups, with Gi Gj = , i = j and k [K]Gk = Y. Let πk = P (y Gk) denote the proportion of samples belonging to group Gk. The misclassiﬁcation error for label group Gk can then be measured as P (y = h(x) | y Gk).

A key metric of interest in this setting is the balanced error that computes the average error across label groups (Brodersen et al., 2010; Menon et al., 2013; 2021b):

Rbal(h) = 1

k [K] P (y = h(x) | y Gk) = 1

1 πk P (y = h(x), y Gk) . (4)

In practice, one may minimize the balanced error by using a balanced version of the cross-entropy loss, implemented, e.g., through adjustments to the model logits (Menon et al., 2021a). The literature also explores a plethora of other techniques to perform well on the balanced error, including data modiﬁcations (Yin et al., 2019; Zhang et al., 2019), loss modiﬁcations (Cui et al., 2019; Cao et al., 2019; Tan et al., 2020; Jamal et al., 2020; Ren et al., 2020; Wu et al., 2020a; Deng et al., 2021; Kini et al., 2021; Wang et al., 2021), and prediction modiﬁcations (Kang et al., 2020; Zhang et al., 2021).

Another evaluation metric of interest is the worst-group error, which is typically minimized through distributionally robust optimization (DRO) (Sagawa et al., 2020; Menon et al., 2021b):

Rwst(h) = max k [K] P (y = h(x) | y Gk) = max k [K] 1 πk P (y = h(x) , y Gk) . (5)

Published as a conference paper at ICLR 2024

Beyond the balanced (worst-group) errors, there are many other metrics popular in the class imbalanced literature, such as the F-measure or the G-mean, which can be generally expressed as functions of the confusion matrix (Menon et al., 2013; Narasimhan et al., 2015b). When learning a stand-alone classiﬁer, minimizing these can be reduced to a cost-sensitive learning problem with constant label costs (Narasimhan et al., 2019). We see next that such a simple reduction is not possible when jointly learning a classiﬁer and rejector in an L2R setup, even with a metric as simple as the balanced error.

3 LEARNING TO REJECT WITH BALANCED ERROR

In this paper, we provide a formal framework for L2R with general evaluation metrics, beginning in this section with the balanced error metric.

3.1 DOES CHOW S RULE SUFFICE FOR BALANCED ERROR?

The classical Chow s rejection rule described in Eq. (2) can be grossly sub-optimal for the balanced error, and can in fact have an adverse impact on the tail classes. Figure 1(d) provides an illustration of the latter on the long-tailed Image Net-LT dataset (Liu et al., 2019), where we group all classes with fewer than 20 samples in the training set as the tail classes, and show that Chow s rule is substantially better on the head classes compared to the tail. This suggests that one cannot naïvely apply existing L2R methods in settings involving evaluation measures beyond the standard accuracy.

Jones et al. (2021) similarly show in a setting where the groups are deﬁned by spurious attributes that Chow s rule unfairly favors the majority groups. They propose mitigating this by modifying Chow s rule to use a base model trained with DRO. Follow-up work explores the use of a sufﬁciency constraint in the base model training to alleviate group disparities (Lee et al., 2021). However, formal understanding of L2R with the balanced error remains elusive; we aim to resolve this in the sequel.

3.2 FORMULATION WITH REJECTOR-DEPENDENT COSTS

We now formally describe the problem of L2R with the balanced error. For a base classiﬁer h and a rejector r, we deﬁne the error that the classiﬁer makes on non-abstained samples from each group by:

P (y = h(x) | r(x) = 0, y Gk) ,

where we condition on samples from group Gk which the rejector does not reject (Jones et al., 2021). We consider the goal of minimizing the average per-group error, with a constraint on the total proportion of abstentions. One may equivalently impose a a rejection penalty of c > 0:

Rrej bal(h, r) = 1

k [K] P y = h(x) r(x) = 0, y Gk + c P (r(x) = 1) . (6)

We will refer to this objective as the balanced L2R risk, which we re-write as:

Rrej bal(h, r) = 1

1 πk(r) P (y = h(x), r(x) = 0, y Gk) + c P (r(x) = 1) .

where πk(r) = P (r(x) = 0, y Gk) is the proportion of samples from group Gk that were not rejected by r. Similar to the balanced error in Eq. (4), each group is weighted by the inverse group prior; however, the coefﬁcients here are not constants and are intricately tied to the rejector r. In fact, the above risk has a non-decomposable dependence on r: both the numerator and the denominator depend on r, making the minimization problem non-trivial and challenging. This is in contrast to prior work on cost-sensitive rejection (deferral), where the costs on the misclassiﬁcation errors are ﬁxed and have no dependence on r (Santos-Pereira & Pires, 2005; Alves et al., 2024).

3.3 BAYES-OPTIMAL CLASSIFIER AND REJECTOR

As a ﬁrst step towards learning a classiﬁer and rejector that minimizes the balanced L2R risk in Eq. (6), we seek to understand the form of the Bayes-optimal solution for the objective. To this end, we re-write the risk into an equivalent form where the denominators do not explicitly depend

Published as a conference paper at ICLR 2024

on the rejector. Speciﬁcally, we decouple the dependence of the denominators on the rejector by introducing auxiliary variable α1, . . . , αK (0, 1), and reformulate the minimization over h and r as an equivalent constrained optimization problem:

min h, r, α (0,1)K X

1 αk P (y = h(x), r(x) = 0, y Gk) + c P (r(x) = 1) (7)

s.t. K P (r(x) = 0, y Gk) = αk, k [K].

This then allows us to provide a characterization of the Bayes-optimal solution. By computing the Lagrangian for Eq. (7) with multipliers µ1, . . . , µK R for the K equality constraints, we have: Assumption 1. For x P, assume that [η1(x), . . . , ηL(x)] is a continuous random vector. Assumption 2. The minimizer (h , r ) of the balanced L2R risk in Eq. (6) is such that P(r (x) = 0, y Gk) κ, k, for some κ > 0. Theorem 1. Under Assumptions 1 and 2, there exists group-speciﬁc parameters α (0, K)K and µ RK such that the optimal classiﬁer and rejector for Eq. (6) take the form:

h (x) = arg max y [L]

1 α [y] ηy(x);

r (x) = 1 max y [L] 1 α [y] ηy(x) < X

α [y ] µ [y ] ηy (x) c, (8)

where [y] is the index of the group class y belongs to; moreover, α k = K P (r (x) = 0, y Gk).

Assumption 1 is fairly common in the analysis of Bayes-optimal solutions for evaluation metrics (Narasimhan et al., 2015b; Yan et al., 2018; Yang et al., 2020). This together with Assumption 2 allows us to pose Eq. (7) as an equivalent (unconstrained) Lagrangian minimization problem.

The optimal rejector uses the term maxy [L] 1 α [y] ηy(x) as a measure of classiﬁer conﬁdence, and

checks if it falls below a sample-dependent threshold. The multipliers µ ks in the threshold can be seen as per-group adjustments that ensure that the acceptance rate for group k is proportional to α k. In fact, in the special case where we have binary labels, each in a separate group, the rejector takes a much simpler form, with a constant threshold for each class (details in Appendix D.4). Remark 1 (Sub-optimality of L2R Baselines). Unlike the classical Chow s rule in Eq. (2) or its conditional variant in Eq. (23), which apply a constant threshold to the maximum conditional probability maxy ηy(x), the optimal rejector in Theorem 1 takes a fundamentally different form, involving all class-probabilities. In fact, it can be seen as applying a sample-dependent threshold to re-weighted probabilities. This suggests that Chow s rule is sub-optimal for the balanced L2R risk. This also suggests that approaches which optimize a surrogate loss for the L2R risk (Cortes et al., 2016; Geifman & El-Yaniv, 2017; Ramaswamy et al., 2018; Ni et al., 2019; Mozannar & Sontag, 2020; Charoenphakdee et al., 2021) are also sub-optimal for the balanced L2R risk, as they are also designed to learn the same thresholded rejector in Eq. (2) that Chow s rule seeks to mimic.

4 ALGORITHMS FOR BALANCED L2R

We next present an algorithm for learning a classiﬁer and rejector that is optimal for the balanced L2R risk in Eq. (6). We begin by exploring simple modiﬁcations to Chow s rule.

4.1 CHOW S RULE WITH MODIFIED BASE LOSS

One obvious variation to Chow s rule is to modify the base model training to optimize a balanced version of the CE loss in Eq. (3):

f bal = argminf X

1 πk E [1(y Gk) ℓce(y, f(x))] . (9)

One may then implement Chow s rule via the class probability estimates pbal y (x) exp(f bal y (x)):

hbal(x) arg maxy [L] pbal y (x); rbal(x) = 1 maxy [L] pbal y (x) < 1 c. (10)

Published as a conference paper at ICLR 2024

Although intuitive, this simple modiﬁcation to Chow s rule does not yield the optimal classiﬁer and rejector for the balanced L2R risk.

Lemma 2. The variant of Chow s rule in Eq. (10), where the base model is trained with the balanced cross-entropy loss, results in a classiﬁer and rejector of the form:

hbal(x) = arg max y [L]

1 π[y] ηy(x)

rbal(x) = 1 max y [L] 1 π[y] ηy(x) < X

1 π[y ] ηy (x) (1 c), (11)

where [y] is the index of the group to which class y belongs.

There are two key differences between the above rejector and the optimal rejector in Theorem 1. First, in the above rejector, the abstention cost c is part of a multiplicative term whereas it is part of an additive term in the rejector in Theorem 1. Second, the per-class weights in the above rejector depend on the ﬁxed group priors πk, whereas those in Theorem 1 depend more intricately on the underlying distribution through the parameters α k (which depend on the optimal rejector) and multipliers µ k. More generally, any re-weighting of the base loss with constant coefﬁcients will result in a rejector with a multiplicative penalty term different from the form in Theorem 1 (see Appendix A.2).

4.2 PLUG-IN APPROACH WITH NO CHANGES TO THE BASE LOSS

As an alternative to modifying the base loss in Chow s rule, we propose a new plug-in approach that seeks to directly mimic the optimal classiﬁer and rejector for the balanced L2R risk. Like standard Chow s rule, we assume access to only a class probability estimator py(x) trained to minimize the softmax CE loss in Eq. (3). We then wish to approximate the Bayes-optimal classiﬁer and rejector from Theorem 1 by plugging in estimates py(x) of the class probabilities:

ˆh(x) = arg max y [L]

1 ˆα[y] py(x); (12)

ˆr(x) = 1 max y [L] 1 ˆα[y] py(x) < X

ˆα[y ] ˆµ[y ] py (x) c, (13)

for choices of ˆα (0, K)K and ˆµ RK. One may carry out an exhaustive search to pick ˆα and ˆµ so that the resulting classiﬁer-rejector pair yields the lowest balanced error on a held-out sample. However, searching over 2K parameters can quickly become infeasible even when K is small.

To make the search more efﬁcient, we exploit the additional constraint in Theorem 1 that requires ˆαk to match the rejector s coverage for group Gk, i.e., ˆαk = K P(y Gk, ˆr(x) = 0). Speciﬁcally, we propose applying a grid search over only the multipliers ˆµ, while using a power-iteration-like heuristic to choose ˆα. Starting with initial values α(0) (0, K)K, we propose alternating between constructing the classiﬁer-rejector pair and updating the group priors using the rejector s coverage:

Construct (h(m), r(m)) using equations 12 13 with ˆα = α(m) and the ﬁxed multipliers ˆµ;

Set α(m+1) k = K P(y Gk, r(m)(x) = 0).

One may repeat these steps until the iterates α(t)s converge to a stationary point, and return the corresponding classiﬁer and rejector.

For moderate number of groups K, one may apply the above procedure for different values of multipliers ˆµ RK, and pick the one for which the returned classiﬁer-rejector pair yields the lowest balanced error on a validation set. The details are outlined in Algorithm 1, where Λ is the set of multiplier values we wish to search over, and βk = 1

K for the balanced error. In fact, one can further prune the search space by re-parameterizing the rejection criterion in Eq. (13) to search over only K 1 parameters (see Appendix E.1 for details). For problems with two groups (e.g. head and tail), this would mean that we only have a single parameter to tune. For larger number of groups, we prescribe applying an alternate Lagrangian-style min-max optimization to tune the multiplier parameters. We elaborate on this procedure in Appendix E.2.

Published as a conference paper at ICLR 2024

Algorithm 1 Cost-sensitive Plug-in (CS-plug-in)

1: Input: Rejection cost c, Group weights β L, Pre-trained model p : X L, Sample S 2: Parameters: Multiplier set Λ RK, Iterations M, Initial prior α(0) (0, 1)K, Initial (ˆh, ˆr) 3: For µ in Λ:

4: For m = 0 to M 1:

5: Construct (h(m+1), r(m+1)) using equations 12 13 with ˆαk = α(m) k βk and ˆµ = µ

6: α(m+1) k = 1 |S| P

(x,y) S 1(y Gk, r(m+1)(x) = 0), k [K]

7: (ˆh, ˆr) = (h(M), r(M)) if P

k βk ˆek(h(M), r(M)) < P

k βk ˆek(ˆh, ˆr)

8: Return: (ˆh, ˆr)

Algorithm 2 Worst-group Plug-in

1: Input: Rejection cost c, Pre-trained model p : X L, Sample Sval divided into S1, S2 2: Parameters: Inital group weights β(0) L, Iterations T, Step-size ξ 3: For t = 0 to T

4: (h(t), r(t)) = CS-plug-in(c, β(t), p, S1)

5: β(t+1) k = 1

Z β(t) k exp(ξ ˆek(h(t), r(t))), k [K], where Z = P j β(t) j exp(ξ ˆej(h(t), r(t)))

6: Return: (h(T ), r(T ))

Figure 2: Plug-in algorithm for the cost-sensitive error P

k βk ˆek(h, r) (Algorithm 1) and worst-group error

maxk ˆek(h, r) (Algorithm 2), where ˆek(h, r) =

P (x,y) S2 1(y =h(x),y Gk,r(x)=0)

P (x,y) S2 1(y Gk,r(x)=0) . Algorithm 1 can

be applied to minimize the balanced L2R risk in Eq. (6) by setting βk = 1/K, k [K].

5 EXTENSION TO GENERAL EVALUATION METRICS

We next extend our results to evaluation metrics more general than the balanced error. To this end, we denote the conditional error of classiﬁer h and rejector r on group Gk by: ek(h, r) = P (y = h(x) | r(x) = 0, y Gk) . We then consider the problem of minimizing a general function ψ : [0, 1]K R of the per-group errors e1(h, r), . . . , e K(h, r), with a cost of c > 0 for abstention:

Rrej gen(h, r) = ψ (e1(h, r), . . . , e K(h, r)) + c P (r(x) = 1) . (14)

This generalized L2R risk includes the balanced error from Eq. (6) as a special case when ψ(z) = 1 K P

k zk. It also captures more general evaluation metrics deﬁned by a non-linear function ψ such as the worst-group error with ψ(z) = maxk [K] zk (Sagawa et al., 2020), and the G-mean metric when ψ(z) = 1 (QK k=1(1 zk))1/K (Menon et al., 2013; Narasimhan et al., 2015b).

5.1 BAYES-OPTIMAL CLASSIFIER AND REJECTOR

As with prior work on optimizing general evaluation metrics (Narasimhan et al., 2015a; Yang et al., 2020), the generalized L2R risk is minimized by a classiﬁer-rejector mechanism that is stochastic.

Theorem 3. For any non-linear ψ, there exists coefﬁcients u(1), v(1), . . . , u(K+1), v(K+1) RK, and a distribution ν K+1, such that, the generalized L2R risk in Eq. (14) is minimized by a stochastic mechanism that randomizes over K + 1 classiﬁer-rejector pairs: for any x, it predicts using (h(i), r(i)) with probability νi:

h(i)(x) arg maxy u(i) y ηy(x), i [K + 1]

r(i)(x) = 1 maxy u(i) y ηy(x) < P

y v(i) y ηy (x) c, i [K + 1].

Stochastic classiﬁers are widely used in the design of learning algorithms for complex (non-linear) evaluation metrics (Narasimhan et al., 2015b; Cotter et al., 2019b; Narasimhan et al., 2019; Yang et al., 2020). In applications where the use of stochastic classiﬁers is infeasible, one may use techniques offered in Cotter et al. (2019a) to convert them into a similar performing deterministic classiﬁer.

Published as a conference paper at ICLR 2024

Method CIFAR-100 Image Net i Naturalist

Balanced Worst Balanced Worst Balanced Worst

Chow 0.509 0.002 0.883 0.007 0.409 0.003 0.685 0.007 0.131 0.004 0.162 0.004 CSS 0.483 0.002 0.785 0.003 0.526 0.002 0.784 0.003 0.200 0.005 0.246 0.006

Chow [BCE] 0.359 0.017 0.570 0.030 0.213 0.001 0.274 0.003 0.111 0.003 0.114 0.003 Chow [DRO] 0.325 0.004 0.333 0.003 0.197 0.001 0.211 0.004 0.110 0.001 0.117 0.002

Plug-in [Balanced] 0.292 0.006 0.416 0.015 0.205 0.001 0.248 0.001 0.108 0.003 0.118 0.004 Plug-in [Worst] 0.287 0.008 0.321 0.018 0.198 0.001 0.213 0.004 0.108 0.001 0.113 0.002

Table 2: Area Under the Risk-Coverage curve (AURC) with the balanced and worst-group error. Chow and CSS are L2R baselines. We evaluate two variants of Chow (BCE, DRO) with a modiﬁed base loss ( 4.1), and plug-in methods (Balanced, Worst) that require no changes to the base model training ( 4.2, 5).

5.2 MINIMIZING WORST-GROUP ERROR

One can minimize the risk in Eq. (14) by exploiting the form of the function ψ. For example, if ψ is convex, one can reduce it to a sequence of minimization problems of the form: P

k β(t) k ek(h, r) + c P (r(x) = 1) ,

for coefﬁcients β(1), β(2), . . . RK + , each of which can be solved by adapting the plug-in method described in 4.2. Prior work has shown how this can be done in the standard classiﬁcation setup (without rejection) by applying gradient-based optimization to ψ (Narasimhan et al., 2019; 2022a; Yang et al., 2020). As a concrete example, we showcase how a similar reduction can be done for the worst-group L2R risk with ψ(z1, . . . , z K) = maxk [K] zk. The details are in Algorithm 2.

The algorithm maintains group-speciﬁc weights β(t) K, where K is the probability simplex over K coordinates. At each iteration, it updates β(t) using exponentiated gradient-descent, up-weighting groups with high empirical error ˆek(h, r). It then invokes the plug-in approach in Algorithm 1 to minimize the resulting weighted error P k β(t) k ˆek(h, r). Below, we provide convergence guarantees for the algorithm in terms of the optimality gap in the weighted error minimization step (line 4).

Theorem 4. Let Rrej wst(h, r) denote the generalized L2R risk in Eq. (14) with ψ(z1, . . . , z K) = maxk zk. Let ϵgen t = |ek(h(t), r(t)) ˆek(h(t), r(t))| denote the estimation error in iteration t of Algorithm 2. Let ϵcs t = P

k β(t) k ek(h(t), r(t)) infh,r P

k β(t) k ek(h, r) denote the excess cost-sensitive risk for the classiﬁer-rejector pair (h(t), r(t)) returned in line 4. Then a stochastic mechanism that picks one among {(h(1), r(1)), . . . , (h(T ), r(T ))} with equal probability satisﬁes:

max k [K] Et ek(h(t), r(t)) + c Et P(r(t)(x) = 1) inf h,r Rrej wst(h, r) + ϵcs + 2 ϵgen + 2 q

where t is drawn uniformly from {1, . . . , T}, ϵcs = 1

t [T ] ϵcs t and ϵgen = 1

t [T ] ϵgen t .

To ensure a fair comparison against other deterministic baselines, when implementing Algorithm 2, we do not employ a stochastic solution, but instead use the ﬁnal classiﬁer-rejector pair (h(T ), r(T )).

6 EXPERIMENTAL RESULTS

We present experiments on long-tailed image classiﬁcation tasks to showcase that proposed plug-in approaches for the balanced error ( 4) and the worst-group error ( 5) yield signiﬁcantly better trade-offs than Chow s rule, despite using the same base model, and are competitive with variants of Chow s rule which require re-training the base model with a modiﬁed loss.

Datasets. We replicate the long-tail experimental setup from Menon et al. (2021a). We use long-tailed versions of CIFAR-100 (Krizhevsky, 2009), Image Net (Deng et al., 2009) and i Naturalist (Van Horn et al., 2018). For CIFAR-100, we downsample the examples per label following the Exp proﬁle of Cao et al. (2019), and for Image Net, we use the long-tail version provided by Liu et al. (2019). The train, test and validation splits have the same label distributions (see Appendix F for details). We train a Res Net-32 (50) model for CIFAR (Image Net and i Naturalist).

Published as a conference paper at ICLR 2024

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Balanced Error

Chow CSS Chow [BCE] Plug-in [Balanced]

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Balanced Error

Chow CSS Chow [BCE] Plug-in [Balanced]

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Balanced Error

Chow CSS Chow [BCE] Plug-in [Balanced]

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Worst-group Error

Chow CSS Chow [DRO] Plug-in [Worst]

(a) CIFAR-100

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Worst-group Error

Chow CSS Chow [DRO] Plug-in [Worst]

(b) Image Net

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Worst-group Error

Chow CSS Chow [DRO] Plug-in [Worst]

(c) i Naturalist

Figure 3: Balanced and worst-group errors as functions of proportion of rejections. Lower is better.

Evaluation metrics. For each dataset, we divide the classes into two groups, namely head and tail. We designate all classes with 20 or fewer samples in the training set as tail classes (Liu et al., 2019), and the remaining as the head. We evaluate both the balanced error and the worst-group error across the head and tail groups. We train classiﬁers and rejectors with different rejection costs c, and plot these metrics as a function of the proportion of rejections. We summarize the performance across different rejection rates by computing the areas under the respective curves (averaged over 5 trials); this summary metric is referred to as the area under risk-coverage curve (AURC) (Moon et al., 2020).

Comparisons. We compare against two representative L2R baselines that optimize the standard 0-1 error: (i) Chow s rule, which trains a base model by optimizing the CE loss, and thresholds the estimated max probability (Eq. (2)); and (ii) the cost-sensitive softmax (CSS) loss of Mozannar & Sontag (2020), which jointly trains a classiﬁer and rejector using a surrogate loss (with the Res Net model predicting L + 1 logits, and the (L + 1)-th denoting the abstain option). We evaluate our plug-in methods for the balanced error (Alg. 1) and the worst-case error (Alg. 2), both of which use the same base model as Chow s rule. We also evaluate variants of Chow s rule ( 4.1), where the base model is trained with either a balanced cross-entropy (BCE) loss implemented through logit adjustments (Menon et al., 2021a), or the DRO loss prescribed by Jones et al. (2021).

Results. We present plots of the balanced and worst-group errors as a function of the rejection rate (for varying values of c) in Figure 3. We summarize the area under these curves in Table 2. On all three datasets, the proposed plug-in methods are substantially better than both Chow s rule and the CSS baseline. The plug-in methods are also comparable to or better than the variants of Chow s rule that uses a balanced or DRO base loss. As shown in Figures 4 5 in Appendix F.5, Chow s rule is unsurprisingly the best on the standard 0-1 error. On the other hand, Figures 6 7 show that our plug-in methods yield a lower gap between the tail and head errors compared to the L2R baselines. Finally, in Appendix F.6, we show on CIFAR-100 that the heuristic we employ in our plug-in approach to pick the group coverage α ( 4.2) performs comparable to an exhaustive grid search over α.

7 CONCLUSION

We have shown that standard L2R can be sub-optimal for general evaluation metrics, and proposed simple plug-in methods that yield signiﬁcantly better trade-offs for both the balanced and worst-group error. Our methods require only a pre-trained model optimized with the CE loss, and are competitive with complex loss modiﬁcation approaches. In the future, it is of interest to explore joint training of the classiﬁer and rejector using surrogate losses (Cao et al., 2022), and extensions to hierarchical classiﬁcation (Wu et al., 2020b) and pairwise ranking (Shen et al., 2020; Mao et al., 2023b).

Published as a conference paper at ICLR 2024

Jean V Alves, Diogo Leitão, Sérgio Jesus, Marco OP Sampaio, Javier Liébana, Pedro Saleiro, Mário AT Figueiredo, and Pedro Bizarro. Cost-sensitive learning to defer to multiple experts with workload constraints. ar Xiv preprint ar Xiv:2403.06906, 2024.

Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M. Buhmann. The balanced accuracy and its posterior distribution. In 2010 20th International Conference on Pattern Recognition, pp. 3121 3124, 2010. doi: 10.1109/ICPR.2010.764.

Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. ar Xiv:1710.05381 [cs, stat], October 2017.

Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, 2019.

Yuzhou Cao, Tianchi Cai, Lei Feng, Lihong Gu, Jinjie Gu, Bo An, Gang Niu, and Masashi Sugiyama. Generalizing consistent multi-class classiﬁcation with rejection to be compatible with arbitrary losses. Advances in Neural Information Processing Systems, 35:521 534, 2022.

Nontawat Charoenphakdee, Zhenghang Cui, Yivan Zhang, and Masashi Sugiyama. Classiﬁcation with rejection based on cost-sensitive classiﬁcation. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 1507 1517. PMLR, 18 24 Jul 2021.

C. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16(1):41 46, 1970. doi: 10.1109/TIT.1970.1054406.

Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Learning with rejection. In ALT, 2016.

Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Theory and algorithms for learning with rejection in binary classiﬁcation. Annals of Mathematics and Artiﬁcial Intelligence, 2023.

Andrew Cotter, Maya Gupta, and Harikrishna Narasimhan. On making stochastic classiﬁers deterministic. Advances in Neural Information Processing Systems, 32, 2019a.

Andrew Cotter, Heinrich Jiang, Maya Gupta, Serena Wang, Taman Narayan, Seungil You, and Karthik Sridharan. Optimization with non-differentiable constraints with applications to fairness, recall, churn, and other goals. Journal of Machine Learning Research, 20(172):1 59, 2019b. URL http://jmlr.org/papers/v20/18-616.html.

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In CVPR, 2019.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Zongyong Deng, Hao Liu, Yaoxing Wang, Chenyang Wang, Zekuan Yu, and Xuehong Sun. PML: progressive margin loss for long-tailed age classiﬁcation. Co RR, abs/2103.02140, 2021. URL https://arxiv.org/abs/2103.02140.

Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classiﬁcation. Journal of Machine Learning Research, 11(53):1605 1641, 2010. URL http://jmlr.org/papers/ v11/el-yaniv10a.html.

Aditya Gangrade, Anil Kag, and Venkatesh Saligrama. Selective classiﬁcation via one-sided prediction. In Arindam Banerjee and Kenji Fukumizu (eds.), Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 2179 2187. PMLR, 13 15 Apr 2021. URL https: //proceedings.mlr.press/v130/gangrade21a.html.

Published as a conference paper at ICLR 2024

Yonatan Geifman and Ran El-Yaniv. Selective classiﬁcation for deep neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 4885 4894, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263 1284, 2009.

Muhammad Abdullah Jamal, Matthew Brown, Ming-Hsuan Yang, Liqiang Wang, and Boqing Gong. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

Erik Jones, Shiori Sagawa, Pang Wei Koh, Ananya Kumar, and Percy Liang. Selective classiﬁcation can magnify disparities across groups. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum?id=N0M_4Bk Q05i.

Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classiﬁer for long-tailed recognition. In Eighth International Conference on Learning Representations (ICLR), 2020.

Ganesh Ramachandra Kini, Orestis Paraskevas, Samet Oymak, and Christos Thrampoulidis. Labelimbalanced and group-sensitive classiﬁcation under overparameterization. Co RR, abs/2103.01550, 2021. URL https://arxiv.org/abs/2103.01550.

Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistent binary classiﬁcation with generalized performance metrics. In NIPS, pp. 2744 2752, 2014.

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

Joshua K Lee, Yuheng Bu, Deepta Rajan, Prasanna Sattigeri, Rameswar Panda, Subhro Das, and Gregory W Wornell. Fair selective classiﬁcation via sufﬁciency. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 6076 6086. PMLR, 18 24 Jul 2021. URL

https://proceedings.mlr.press/v139/lee21b.html.

Zachary C Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. Optimal thresholding of classiﬁers to maximize f1 measure. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part II 14, pp. 225 239. Springer, 2014.

Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu. Largescale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=GIls H0T4b2.

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Ranking with abstention. ar Xiv preprint ar Xiv:2307.02035, 2023b.

Aditya K Menon, Ankit Singh Rawat, Sashank Reddi, Seungyeon Kim, and Sanjiv Kumar. A statistical perspective on distillation. In International Conference on Machine Learning, pp. 7632 7642. PMLR, 2021a.

Aditya Krishna Menon, Harikrishna Narasimhan, Shivani Agarwal, and Sanjay Chawla. On the statistical consistency of algorithms for binary classiﬁcation under class imbalance. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pp. 603 611, 2013.

Published as a conference paper at ICLR 2024

Aditya Krishna Menon, Ankit Singh Rawat, and Sanjiv Kumar. Overparameterisation and worst-case generalisation: friend or foe? In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021b. URL https: //openreview.net/forum?id=jphn JNOwe36.

Jooyoung Moon, Jihyo Kim, Younghak Shin, and Sangheum Hwang. Conﬁdence-aware learning for deep neural networks. In international conference on machine learning, pp. 7034 7044. PMLR, 2020.

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 7076 7087. PMLR, 13 18 Jul 2020.

Harikrishna Narasimhan. Learning with complex loss functions and constraints. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1646 1654, 2018.

Harikrishna Narasimhan and Aditya K Menon. Training over-parameterized models with nondecomposable objectives. Advances in Neural Information Processing Systems, 34:18165 18181, 2021.

Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Consistent multiclass algorithms for complex performance measures. In ICML, pp. 2398 2407, 2015a.

Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Consistent multiclass algorithms for complex performance measures. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2398 2407, Lille, France, 07 09 Jul 2015b. PMLR.

Harikrishna Narasimhan, Andrew Cotter, and Maya Gupta. Optimizing generalized rate metrics with three players. In Advances in Neural Information Processing Systems, pp. 10746 10757, 2019.

Harikrishna Narasimhan, Wittawat Jitkrittum, Aditya K Menon, Ankit Rawat, and Sanjiv Kumar. Post-hoc estimators for learning to defer to an expert. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 29292 29304. Curran Associates, Inc., 2022a. URL https://proceedings.neurips.cc/paper_files/paper/2022/ file/bc8f76d9caadd48f77025b1c889d2e2d-Paper-Conference.pdf.

Harikrishna Narasimhan, Harish G Ramaswamy, Shiv Kumar Tavker, Drona Khurana, Praneeth Netrapalli, and Shivani Agarwal. Consistent multiclass algorithms for complex metrics and constraints. ar Xiv preprint ar Xiv:2210.09695, 2022b.

Jerzy Neyman and Egon Sharpe Pearson. Ix. on the problem of the most efﬁcient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289 337, 1933.

Chenri Ni, Nontawat Charoenphakdee, Junya Honda, and Masashi Sugiyama. On the calibration of multiclass classiﬁcation with rejection. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 2582 2592, 2019.

Harish G. Ramaswamy, Ambuj Tewari, and Shivani Agarwal. Consistent algorithms for multiclass classiﬁcation with an abstain option. Electronic Journal of Statistics, 12(1):530 554, 2018. doi: 10.1214/17-EJS1388.

Jiawei Ren, Cunjun Yu, shunan sheng, Xiao Ma, Haiyu Zhao, Shuai Yi, and hongsheng Li. Balanced meta-softmax for long-tailed visual recognition. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 4175 4186. Curran Associates, Inc., 2020.

Published as a conference paper at ICLR 2024

S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), 2020.

Carla M. Santos-Pereira and Ana M. Pires. On optimal reject rules and ROC curves. Pattern Recognition Letters, 26(7):943 952, 2005. ISSN 0167-8655. doi: https://doi.org/10.1016/j. patrec.2004.09.042. URL https://www.sciencedirect.com/science/article/ pii/S0167865504002892.

Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012.

Song-Qing Shen, Bin-Bin Yang, and Wei Gao. AUC optimization with a reject option. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pp. 5684 5691, 2020.

J. Tan, C. Wang, B. Li, Q. Li, W. Ouyang, C. Yin, and J. Yan. Equalization loss for long-tailed object recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11659 11668, 2020.

S. K. Tavker, H. G. Ramaswamy, and H. Narasimhan. Consistent plug-in classiﬁers for complex objectives and constraints. In Advances in Neural Information Processing Systems, 2020.

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classiﬁcation and detection dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018.

Rajeev Verma and Eric Nalisnick. Calibrated learning to defer with one-vs-all classiﬁers. ar Xiv preprint ar Xiv:2202.03673, 2022.

Jiaqi Wang, Wenwei Zhang, Yuhang Zang, Yuhang Cao, Jiangmiao Pang, Tao Gong, Kai Chen, Ziwei Liu, Chen Change Loy, and Dahua Lin. Seesaw loss for long-tailed instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.

Serena Wang, Harikrishna Narasimhan, Yichen Zhou, Sara Hooker, Michal Lukasik, and Aditya Krishna Menon. Robust distillation for worst-class performance: on the interplay between teacher and student objectives. In Uncertainty in Artiﬁcial Intelligence, pp. 2237 2247. PMLR, 2023.

Tong Wu, Qingqiu Huang, Ziwei Liu, Yu Wang, and Dahua Lin. Distribution-balanced loss for multilabel classiﬁcation in long-tailed datasets. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision ECCV 2020, pp. 162 178, Cham, 2020a. Springer International Publishing. ISBN 978-3-030-58548-8.

Tz-Ying Wu, Pedro Morgado, Pei Wang, Chih-Hui Ho, and Nuno Vasconcelos. Solving long-tailed recognition with deep realistic taxonomic classiﬁer. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part VIII 16, pp. 171 189. Springer, 2020b.

Bowei Yan, Sanmi Koyejo, Kai Zhong, and Pradeep Ravikumar. Binary classiﬁcation with karmic, threshold-quasi-concave metrics. In International Conference on Machine Learning, pp. 5531 5540. PMLR, 2018.

Forest Yang, Moustapha Cisse, and Sanmi Koyejo. Fairness with overlapping groups. In Neur IPS, 2020.

Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Manmohan Chandraker. Feature transfer learning for face recognition with under-represented data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

Junjie Zhang, Lingqiao Liu, Peng Wang, and Chunhua Shen. To balance or not to balance: A simple-yet-effective approach for learning with long-tailed distributions, 2019.

Songyang Zhang, Zeming Li, Shipeng Yan, Xuming He, and Jian Sun. Distribution alignment: A uniﬁed framework for long-tail visual recognition. In CVPR, 2021.

Published as a conference paper at ICLR 2024

A PROOFS FOR RESULTS IN MAIN TEXT

A.1 PROOF OF THEOREM 1

The proof follows as a corollary of Theorem 12 in Appendix D.3 by setting βk = 1 K , k, and scaling the resulting parameters α y by K. Note that Theorem 12 considers a minimization problem that allows for stochastic classiﬁer-rejector mechanisms. However, since the optimal solution is a deterministic classiﬁer-rejector pair, it still remains the optimal solution when the minimization is over only deterministic classiﬁer-rejector pairs.

A.2 PROOF OF LEMMA 2

We prove a more general version of Lemma 2. Consider a variation of Chow s rule where we modify the base model training to optimize a weighted version of the CE loss in Eq. (3): X

k [K] βk E [1(y Gk) ℓce(y, f(x))] , (15)

for some β RK + . One may then use the class probability estimates pwt y (x) exp(f wt y (x)) from the learned scorer f wt to implement Chow s rule:

hwt(x) arg maxy [L] pwt y (x); rwt(x) = 1 maxy [L] pwt y (x) < 1 c. (16)

Lemma 5. The variant of Chow s rule in Eq. (16), where the base model is trained with the balanced cross-entropy loss, results in a classiﬁer and rejector of the form:

hwt(x) = arg max y [L] β[y] ηy(x)

rbal(x) = 1 max y [L] β[y] ηy(x) < X

y [L] β[y ] ηy (x) (1 c), (17)

where [y] is the index of the group to which class y belongs.

Proof. We re-write the weighted CE loss in Eq. (15) as: X

y [L] E β[y] ℓce(y, f(x)) .

Since the softmax CE loss is a strictly proper-composite loss, we have that the minimizer pwt y (x) exp(f wt y (x)) of this objective takes the form:

pwt y (x) = 1 Z(x) β[y] ηy(x),

where Z(x) = P

y β[y ] ηy (x). When applying Chow s rule to these probabilities, we get:

rwt(x) = 1 max y pwt y (x) < 1 c,

which is the same as:

rwt(x) = 1 max y β[y] ηy(x) <

y β[y ] ηy (x)

as desired.

Proof of Lemma 2. The proof follows from Lemma 5 by setting βk = 1 πk .

Published as a conference paper at ICLR 2024

A.3 PROOF OF THEOREM 3

See Appendix D.5.

A.4 PROOF OF THEOREM 4

The proof follows from an application of the standard convergence guarantees for the exponentiated gradient descent algorithm (Shalev-Shwartz et al., 2012). One can further bound the estimation errors ϵgen t in each iteration of Algorithm 2 using standard generalization analysis (Narasimhan et al., 2015a; Tavker et al., 2020; Wang et al., 2023), noting that the classiﬁer and rejectors (ht, rt) are plug-in estimators chosen from a ﬁnite capacity class (namely that of post-hoc linear adjustments to a ﬁxed model p).

Proof. Recall that the worst-group L2R risk is given by:

Rrej wst(h, r) = max k ek(h, r) + c P(r(x) = 1).

This risk can be equivalently re-written as:

Rrej wst(h, r) = max β K

k βk ek(h, r) + c P(r(x) = 1).

We apply standard guarantees for the exponentiated gradient ascent algorithm (Shalev-Shwartz et al., 2012) to treating ˆe1(h(t), r(t)), . . . , ˆe K(h(t), r(t)) as the losses incurred by K experts. We have that

T , the iterates (h(1), r(1)), . . . , (h(T ), r(T )) in Algorithm 2 satisfy:

max β K 1 T

k βk ˆek(h(t), r(t)) 1

k β(t) k ˆek(h(t), r(t)) + 2

Bounding by the estimation error in each iteration, we get:

max β K 1 T

k βk (ek(h(t), r(t)) ϵgen t )

k β(t) k (ek(h(t), r(t)) + ϵgen t ) + 2

which is the same as:

max β K 1 T

k βk ek(h(t), r(t)) 1

k β(t) k ek(h(t), r(t)) + 2 ϵgen + 2

Adding c P(r(t)(x) = 1) on either side, we have:

max β K 1 T

k βk ek(h(t), r(t)) + c P(r(t)(x) = 1)

k β(t) k ek(h(t), r(t)) + c P(r(t)(x) = 1) + 2 ϵgen + 2

t=1 inf h,r

k β(t) k ek(h, r) + c P(r(x) = 1)

t ϵcs t + 2 ϵgen + 2

inf h,r 1 T

k β(t) k ek(h, r) + c P(r(x) = 1) + ϵcs + 2 ϵgen + 2

k βk ek(h, r) + c P(r(x) = 1) + ϵcs + 2 ϵgen + 2

Published as a conference paper at ICLR 2024

= inf h,r max β K

k βk ek(h, r) + c P(r(x) = 1) + ϵcs + 2 ϵgen + 2

inf h,r max k ek(h, r) + c P(r(x) = 1) + ϵcs + 2 ϵgen + 2

= inf h,r Rrej wst(h, r) + ϵcs + 2 ϵgen + 2

where in Eq. (18), we apply the bound on the excess cost-sensitive risk in line 4 of the algorithm, and in Eq. (19), we denote βk = 1

T PT t=1 β(t) k . We can re-write the LHS as:

max β K 1 T

k βk ek(h(t), r(t)) + c P(r(t) = 1)

k βk Et U(1,...,T ) h ek(h(t), r(t)) + c P(r(t) = 1) i

= max k [K] Et U(1,...,T ) h ek(h(t), r(t)) + c P(r(t) = 1) i

= max k [K] Et U(1,...,T ) h ek(h(t), r(t)) i + c Et U(1,...,T ) h P(r(t) = 1) i

Substituting back into Eq. (20) completes the proof.

B EQUIVALENCE BETWEEN SELECTIVE CLASSIFICATION AND L2R

Selective classiﬁcation seeks to learn a classiﬁer and rejector that minimizes the misclassiﬁcation error with a constraint on the proportion of abstentions:

min h,r P (y = h(x), r(x) = 0) s.t. P (r(x) = 1) b, (21)

where b > 0 is an abstention budget. Theorem 6. Suppose P(y | x) and P(x) are continuous in x. For any such abstention budget b > 0 in the constrained problem in Eq. (21), there exists a cost c 0 cost such that the minimizer of the L2R risk below is also a minimizer for Eq. (21):

P (y = h(x), r(x) = 0) + c P (r(x) = 1) .

The Lagrangian for the constrained problem in Eq. (21) is given by:

L(h, r, λ) = P (y = h(x), r(x) = 0) + λ (P(r(x) = 1) b) ,

where λ 0 is the Lagrange multiplier.

We will ﬁnd it useful to state the following lemma: Lemma 7. Let (h λ, r λ) be the minimizer of L(h, r, λ) for Lagrange multiplier λ 0. Then:

P (y = h λ(x), r λ(x) = 0) P (y = h(x), r(x) = 0) ,

for all (h, r) such that P(r(x) = 1) P(r λ(x) = 1).

Proof. Since (h λ, r λ) minimizes the Lagrangian, for any (h, r), L(h λ, r λ, λ) L(h, r, λ), i.e.,

P (y = h λ(x), r λ(x) = 0) P (y = h(x), r(x) = 0) + λ (P(r(x) = 1) P(r λ(x) = 1)) .

Since λ 0, for any (h, r) such that P(r(x) = 1) P(r λ(x) = 1),

P (y = h λ(x), r λ(x) = 0) P (y = h(x), r(x) = 0) ,

as desired.

We are now ready to prove Theorem 6.

Published as a conference paper at ICLR 2024

Proof of Theorem 6. For a ﬁxed λ 0, the following is a minimizer of the Lagrangian L(h, r, λ): h λ(x) arg max y [L] P(y | x);

r λ(x) = 1 max y [L] P(y | x) < 1 λ.

The abstention rate for r λ(x) can then be written as:

P(r λ(x) = 1) = Z

maxy [L] P(y|x)<1 λ P(x) dx.

Since P(y | x) and P(x) are continuous in x, we can always ﬁnd a λ 0 for which P(r λ(x) = 1) = b. Applying Lemma 7 with this choice of λ, we then have that P (y = h λ(x), r λ(x) = 0) P (y = h(x), r(x) = 0) , for all (h, r) such that P(r(x) = 1) b. Setting c = λ completes the proof.

C CONDITIONAL LEARNING TO REJECT

Geifman & El-Yaniv (2017) minimize a variant of the L2R risk in Eq. (1), where the misclassiﬁcation error is conditioned on samples that are not rejected:

Rrej cond(h, r) = P (y = h(x) | r(x) = 0) + c P (r(x) = 1) . (22) We show below that the optimal classiﬁer and rejector for this conditional L2R risk continues to have the same form as Eq. (2), but uses a different (distribution-dependent) rejection threshold: Lemma 8. Under Assumption 1, the Bayes-optimal solution to Eq. (22) is given by: h (x) = argmax y [L] ηy(x); r (x) = 1 max y [L] ηy(x) < 1 c , (23)

for some distribution-dependent threshold c [0, 1].

Proof. The conditional L2R risk in Eq. (22) can be expressed as:

Rrej cond(h, r) = P (y = h(x), r(x) = 0)

1 P (r(x) = 1) + c P (r(x) = 1) .

Minimizing this risk can be posed as an equivalent constrained optimization problem:

min h,r,α (0,1] 1 α P (y = h(x), r(x) = 0) + c P (r(x) = 1) s.t. P (r(x) = 1) = 1 α.

In particular, there exists an α (0, 1], such that the minimizer of the following constrained problem also minimizes Eq. (22):

min h,r 1 α P (y = h(x), r(x) = 0) + c P (r(x) = 1) s.t. P (r(x) = 1) = 1 α (24)

Using arguments similar to the Neyman-Pearson lemma (Neyman & Pearson, 1933), one can show that, under Assumption 1, there exists a Lagrange multiplier µ R such that the minimizer to the following problem is also a minimizer to Eq. (24):

min h,r 1 α P (y = h(x), r(x) = 0) + (c + µ ) P (r(x) = 1) ,

or equivalently:

α (1 ηh(x)(x)) 1 (r(x) = 0) + (c + µ ) 1 (r(x) = 1) .

The following are then minimizers of the above objective: h (x) arg max y [L] ηy(x)

and r (x) = 1 max y [L] ηy(x) < 1 α (c + µ ).

Setting c = α (c + µ ) completes the proof.

Published as a conference paper at ICLR 2024

D BAYES-OPTIMAL CLASSIFIERS AND REJECTORS

We derive the Bayes-optimal classiﬁer and rejector for the balanced L2R risk (6) and the generalized L2R risk (14). We will will ﬁrst consider a general formulation where we allow stochastic classiﬁerrejector mechanisms and then specialize to the case of deterministic classiﬁers and rejectors.

D.1 CONFUSION STATISTICS

We begin with some useful deﬁnitions. Suppose Y = [n] = {1, . . . , n}. We deﬁne the confusion matrix for a classiﬁer-rejector pair (h, r) by:

Cij(h, r) = E(x,y) P [1 (y = i, h(x) = j, r(x) = 0)]

We allow for stochastic mechanisms F that are deﬁned by a distribution over classiﬁer-rejector pairs. We adopt the convention in Narasimhan et al. (2015a) and deﬁne the confusion matrix for a stochastic mechanism F by: Cij(F) = E(h,r) F [Cij(h, r)] . We will also ﬁnd it useful to deﬁne the space of all confusion matrices realizable by distributions over classiﬁers-rejector pairs:

C = C(F) [0, 1]n n distributions F over (h, r) .

It is easy to show that the set C is convex. Lemma 9. C is a convex set.

Proof. For any C(1), C(2) C and γ [0, 1], we will show γ C(1) + (1 γ) C(2) C. Clearly, there exists stochastic mechanisms F1, F2 such that C(1) = C(F1) and C(2) = C(F2). Since F = γ F1+(1 γ) F2 represents a valid stochastic mechanism, C(F ) = γ C(1)+(1 γ) C(2) C.

D.2 OPTIMAL SOLUTIONS TO LINEAR MINIMIZATION OVER C

We ﬁrst state a couple of helper lemmas that characterize the optimal solutions to linear minimization problems over the set C. Suppose we wish to minimize the following objective over stochastic mechanisms F: min F

ij Wij Cij[F],

for coefﬁcients W RL L. One may equivalently formulate this as a linear minimization over C:

ij Wij Cij. (25)

Lemma 10. Suppose Assumption 1 holds. Let W RL L be a coefﬁcient matrix such that no two columns are identical. Then Eq. (25) admits a unique solution. Furthermore, the minimizing confusion matrix is achieved by the following deterministic classiﬁer-rejector pair:

h (x) arg min y [L]

i Wi,y ηi(x);

r (x) = 1 min y [L]

i Wi,y ηi(x) > 0.

Proof. The ﬁrst part (uniqueness claim) directly follows from Lemma 24 in Narasimhan et al. (2022b). For the second part, we re-write Eq. (25) into an minimization problem over stochastic mechanisms F:

i,j Wij Cij = min F

i,j Wij Cij(F)

= min F E(h,r) F h X

i,j Wij Cij(h, r) i

Published as a conference paper at ICLR 2024

= min (h,r)

i,j Wij Cij(h, r)

= min (h,r) Ex h X

i,j Wij P(y = i, h(x) = j, r(x) = 0) i

= min (h,r) Ex h X

i,j Wij ηi(x) 1(h(x) = j, r(x) = 0) i , (26)

where the third step follows from the fact that for any function ϕ, the expectation E(h,r) F [ϕ(h, r)] is minimized by a point mass on arg min(h,r) ϕ(h, r).

The minimizer of Eq. (26) minimizes the term within the expectation pointwise: for any given x,

(h (x), r (x)) arg min y [L],z {0,1}

i Wiy ηi(x) 1(z = 0),

which takes the form in the statement of the lemma.

For our next helper lemma, we consider a constrained minimization problem over stochastic mechanisms F: min F

i,j Wij Cij[F] s.t. X

i,j w(k) i Cij[F] = bk, k [K],

for coefﬁcients W RL L and w(1), . . . , w(K) RL. Once again, we may formulate this as an equivalent constrained linear minimization over C:

i,j Wij Cij s.t. X

i,j w(k) i Cij = bk, k [K]. (27)

Lemma 11. Suppose Assumption 1 holds. Let W RL L be a coefﬁcient matrix such that no two columns are identical. Let w(1), . . . , w(K) RL be constraint coefﬁcients so that Eq. (27) admits a feasible solution. Then there exists multipliers µ RK such that the minimizer to Eq. (27) coincides with the minimizer to the following unconstrained linear minimization over C:

k [K] µ k w(k) i Cij. (28)

Proof. We ﬁrst write the Lagrangian for Eq. (27):

L(C, µ) = X

i,j Wij Cij X

i,j w(k) i Cij bk , (29)

where µ1, . . . , µK R are multipliers associated with the K constraints.

Since the objective and constraints in Eq. (27) are linear in C, and C is convex from Lemma 9, Eq. (27) is a convex minimization problem. Furthermore, since it has at least one feasible solution, Slater s condition is satisﬁed, and strong duality holds. As a result, there exists an optimal primal solution C that satisﬁes the constraints in Eq. (27), and optimal dual solution µ such that:

L(C , µ ) = min C C L(C, µ ) = max µ RK L(C , µ) = X

i,j Wij C ij. (30)

Furthermore, we have that:

min C C L(C, µ ) = min C C

k [K] µ k w(k) i Cij + ω = min C C

i,j Lij Cij + ω ,

where Lij = Wij P

k [K] µ k w(k) i and ω = P

k [K] µ k bk is independent of C.

Observe that the coefﬁcient matrix L on the right-hand side has no two columns identical. To see this, each column of L is of the form L:,j = W:,j P

k [K] µ k w(k), where w(k) = [w(k) 1 , . . . , w(k) L ] . Since W has no two columns identical, i.e., W:j = W:j , j = j , we have that L:,j = L:,j .

Using Assumption 1 and the above observation, we can apply the ﬁrst part of Lemma 10 to show that min C C L(C, µ ) has a unique solution. Since L(C , µ ) = min C C L(C, µ ) from Eq. (30), we have that the optimal primal solution C is a unique minimizer to min C C L(C, µ ). In other words, C is the unique minimizer to the unconstrained linear problem in Eq. (28).

Published as a conference paper at ICLR 2024

D.3 BAYES-OPTIMAL CLASSIFIER AND REJECTOR FOR COST-SENSITIVE L2R

In this section, we derive the Bayes-optimal stochastic mechanism that minimizes the following cost-sensitive objective, with a penalty of c > 0 for abstention. Speciﬁcally, deﬁne:

Rrej cs (h, r) = X

k [K] βk P y = h(x) r(x) = 0, y Gk + c P (r(x) = 1) , (31)

where β K are the costs associated with the K groups. One can re-deﬁne Eq. (31) in terms of the confusion matrices C(h, r) as follows:

Rrej cs (h, r) = X

k [K] βk P y = h(x), r(x) = 0, y Gk

P r(x) = 0, y Gk + c P (r(x) = 1)

P i Gk P j =i Cij(h, r) P

j [L] Cij(h, r) + c 1 X

i,j Cij(h, r) .

For a stochastic mechanism F, we can similarly deﬁne

Rrej cs (F) = X

j =i Cij(F) P

j [L] Cij(F) + c 1 X

i,j Cij(F) . (32)

We derive an optimal stochastic classiﬁer that minimizes Eq. (32): Theorem 12. Under Assumptions 1 and 2, there exists group-speciﬁc parameters α (0, 1)K and µ RK such that the following deterministic classiﬁer-rejector pair is an optimal solution to Eq. (32):

h (x) = arg max y [L]

β[y] α [y] ηy(x);

r (x) = 1 max y [L] β[y] α [y] ηy(x) < X

β[y ] α [y ] µ [y ]

ηy (x) c, (33)

where [y] is the index of the group class y belongs to. Furthermore, α k = P (r (x) = 0, y Gk).

Proof. From the boundedness condition in Assumption 2, we know that there exists α (0, 1)K such that the minimizer to the following problem also minimizes Eq. (32):

j =i Cij(F) + c 1 X

i,j Cij(F) (34)

j [L] Cij(F) = α k, k [K].

One may equivalently formulate a constrained linear minimization problem over C:

j =i Cij + c 1 X

i,j Cij (35)

j [L] Cij = α k, k [K].

One can now directly apply the result in Lemma 11 with Wij = P

k [K] βk α k 1(i Gk, j = i) c

and w(k) i = 1(i Gk), i [L], noting that no two columns of W are identical. Consequently, we have that there exists multipliers µ RK such that the minimizer to the following unconstrained linear minimization over C also minimizes Eq. (35):

k [K] µ k w(k) i Cij.

Published as a conference paper at ICLR 2024

Applying Lemma 10, we have that the above problem admits a unique minimizer C which is achieved by the following deterministic classiﬁer-rejector pair:

h (x) arg min y [L]

k [K] µ k w(k) i ηi(x);

r (x) = 1 min y [L]

k [K] µ k w(k) i ηi(x) > 0.

Substituting for W and w(k) i then gives us:

h (x) arg min y [L]

α k 1(y = i) µ k ηi(x) c;

r (x) = 1 min y [L]

α k 1(y = i) µ k ηi(x) > c.

The optimal classiﬁer h then simpliﬁes to:

h (x) arg max y [L]

βk α k ηy(x).

Since each class y belongs exactly one group, this is the same as:

h (x) arg max y [L]

β[y] α [y] ηy(x),

where [y] is the index of the group class y belongs to.

Similarly, the optimal rejector r simpliﬁes to

r (x) = 1 min y [L]

α k µ k ηi(x) X

βk α k ηy(x) > c,

which is the same as:

r (x) = 1 max y [L] β[y] α [y] ηy(x) < X

α [y ] µ [y ] ηy (x) c,

as desired.

D.4 SPECIAL CASE OF BINARY GROUPS WITH BINARY LABELS

In the special case where we have binary labels, each in a separate group, the rejector in Theorem 12 takes a much simpler form, with a constant threshold for each class. Corollary 13. Suppose Y = G = {0, 1}. Under Assumption 1, the optimal classiﬁer and rejector for Eq. (32) simpliﬁes to applying class-speciﬁc thresholds τ 0 , τ 1 R and γ (0, 1):

h (x) = 1 (η1(x) > γ ) ; r (x) = 1 ηh (x)(x) < τ h (x).

Proof. When Y = G = {0, 1}, we have from Theorem 12 that the optimal classiﬁer takes the form:

h (x) = arg max β0

α 0 η0(x), β1

α 1 η1(x) = 1 β0

α 0 η0(x) < β1

α 1 η1(x) = 1 (η1(x) > γ ) ,

where γ = β0/α 0 β0/α 0+β1/α 1 , as desired; the last equality uses the fact that η0(x) + η1(x) = 1.

Similarly, the optimal rejector is given by:

r (x) = 1 βh (x) α h (x) ηh (x)(x) < β0

η1(x) c. (36)

Published as a conference paper at ICLR 2024

Consider the case when h (x) = 1. We can then use the fact that η0(x) + η1(x) = 1 to re-write the right-hand side of Eq. (36) as:

β1 α 1 η1(x) < β0

(1 η1(x)) + β1

which simpliﬁes to: η1(x) < τ 1 ,

β0 α 0 µ 0 c

β0 α 0 + µ 1 µ 0 .

Similarly, when h (x) = 1, the right-hand side of Eq. (36) simpliﬁes to:

η0(x) < τ 0 ,

β1 α 1 µ 1 c

β1 α 1 + µ 0 µ 1 ,

which completes the proof.

D.5 BAYES-OPTIMAL CLASSIFIER AND REJECTOR FOR GENERALIZED L2R

We next derive the Bayes-optimal classiﬁer and rejector for the generalized L2R risk in Eq. (14), which we restate below:

Rrej gen(h, r) = ψ (e1(h, r), . . . , e K(h, r)) + c P (r(x) = 1) , (37)

where ψ : [0, 1]K R+ is a general function, and encompasses several common metrics (Lipton et al., 2014; Menon et al., 2013; Narasimhan et al., 2015a).

One may equivalently write Eq. (37) in terms of the confusion matrix C(h, r):

Rrej gen(h, r) = ψ P

j =i Cij(h, r) P

j [L] Cij(h, r), . . . ,

j =i Cij(h, r) P

j [L] Cij(h, r)

i,j Cij(h, r) .

Allowing for stochastic mechanisms F that are deﬁned by a distribution over classiﬁer-rejector pairs, we would like to minimize:

min F ψ P i G1 P j =i Cij(F) P i G1 P j [L] Cij(F), . . . ,

j =i Cij(F) P i GK P j [L] Cij(F)

i,j Cij(F) ,

or equivalently:

min F ψ φ1(C(F))

π1(C(F)), . . . , φK(C(F))

k πk(C(F)) , (38)

where φk(C) = P

j =i Cij(F) and πk(C) = P

j [L] Cij(F).

Theorem 7 (restated). For any non-linear ψ, there exists coefﬁcients u(i), v(i) RK, i [K + 1] and a parameter ν K+1, such that, the generalized L2R in Eq. (14) is minimized by a stochastic mechanism that for any x, predicts using (h(i), r(i)) with probability νi:

h(i)(x) arg max y u(i) y ηy(x), i [K + 1];

r(i)(x) = 1 max y u(i) y ηy(x) < X

y v(i) y ηy (x) c, i [K + 1].

Published as a conference paper at ICLR 2024

Proof. One may reformulate Eq. (38) as an optimization over the space of all confusion statistics realizable through distribution over classiﬁers-rejector pairs C, or equivalently over a smaller space of transformed confusion matrices: C = {(φ, π), where φ = (φ1(C), . . . , φK(C)), and π = (π1(C), . . . , πK(C)) | C C}. We can then solve Eq. (38) by solving:

min (φ,π) C ψ φ1

π1 , . . . , φK

k πk . (39)

We know from Lemma 9 that C is convex; consequently, it is easy to see that C is also convex. Therefore any point (φ, π) C can be expressed as a convex combination of extreme points of C . In fact, following Proposition 10 in Narasimhan et al. (2022b), any (φ, π) C can be expressed as convex combination of K + 1 extreme points (φ(1), π(1)), . . . (φ(K+1), π(K+1)), each of which is a unique linear minimizer over C :

(φ(i), π(i)) arg min (φ,π) C

k w(i) k φk + X

k q(i) k πk

for coefﬁcients w(i), q(i) RK, i [K + 1]. Therefore this is also true for the minimizer (φ , π ) C of Eq. (39).

All that remains to show is that these K +1 linear minimizers are realized by K +1 classiﬁer-rejector pairs of the form in the theorem statement. To show this, we note that for coefﬁcients w(i), q(i) RK, minimizing the linear objective over C :

min (φ,π) C X

k w(i) k φk + X

k q(i) k πk

is equivalent to minimizing the same objective over the boundary points of C , and equivalently, over deterministic classiﬁer-rejector pairs:

k w(i) k φk(C(h, r)) + X

k q(i) k πk(C(h, r)). (40)

Expanding the above objective, we get: X

k w(i) k E [1(y = h(x), r(x) = 0, y Gk)] + X

k q(i) k E [1(r(x) = 0, y Gk)] ,

which is the same as:

y Gk ηy(x) 1(y = h(x), r(x) = 0) + X

y Gk ηy(x) 1(r(x) = 0)

y w(i) [y] ηy(x) w(i) [h(x)] ηh(x)(x) + X

y q(i) [y] ηy(x) 1(r(x) = 0) .

The minimizer of Eq. (40) then takes the form:

h(i)(x) arg max y [L] w(i) [y] ηy(x);

r(i)(x) = 1 max y [L] w(i) [y] ηy(x) < X

y [L] (w(i) [y] + q(i) [y]) ηy(x).

Setting u(i) y = w(i) [y] and v(i) y = w(i) [y] + q(i) [y] + c, the above is equivalent to:

h(i)(x) arg max y [L] u(i) y ηy(x);

r(i)(x) = 1 max y [L] u(i) y ηy(x) < X

y [L] v(i) y ηy(x) c.

It thus follows that the minimizer (φ , π ) = (φ(C(F )), π(C(F )) to Eq. (39) is realized by a stochastic mechanism F that randomizes over the K + 1 classiﬁer-rejector pairs (h(1), r(1)), . . . , (h(K+1), r(K+1)).

Published as a conference paper at ICLR 2024

Algorithm 3 Lagrangian-based Plug-in for Balanced Error

1: Input: Rejection cost c, Pre-trained p : X L, Sample S 2: Parameters: Iterations M, T, Step-sizes ξµ, ξα, Initial α(0) (0, K)K, Lower bound κ > 0

3: For m = 0 to M

4: h(m)(x) = arg maxy [L] u(m) y py(x), where u(m) y = 1 α(m) [y]

5: Initialize µ(m,0) RK

6: For t = 0 to T 1

7: µ(m,t+1) k = µ(m,t) k + ξµ α(m) i K

(x,y) S 1 (r(x) = 0, y Gk) , k [K]

8: r(m,t+1)(x) = 1 maxy [L] u(m) y py(x) < P y [L] v(m) y py (x) c,

9: where v(m) y = 1 α(m) [y] µ(m,t+1) [y] .

10: End For

11: α(m+1) k = α(m) k ξα µ(m,T ) k 1 (α(m) k )2 ˆek(h(m), r(m,T ))

12: α(m+1) k = proj[κ, K κ] α(m) k , where proj[a,b](z) = max{min{z, b}, a}

13: End For 14: Return: (h(M), r(M,T ))

E PLUG-IN APPROACH: FURTHER DETAILS

We provide further details about the plug-in approach in 4.

E.1 RE-PARAMETERIZING THE REJECTOR FOR THE BALANCED ERROR

One can further prune the search space over multipliers ˆµ by re-parameterizing the rejection criterion in Eq. (13) to search over only K 1 parameters. To this end, using the fact that P

y [L] py(x) = 1, we re-write the criterion as:

max y [L] 1 ˆα[y] py(x) < X

1 ˆα[y ] py (x) X

y [L] (ˆµ[y ] ˆµK) py (x) ˆµK X

y [L] py (x) c,

which is the same as:

max y [L] 1 ˆα[y] py(x) < X

1 ˆα[y ] py (x) X

k [K 1] (ˆµk ˆµK) X

y Gk py (x) ˆµK c.

This can be equivalently re-paramterized as:

max y [L] 1 ˆα[y] py(x) < X

1 ˆα[y ] py (x) X

ˆλk py (x) ˆτ,

where ˆλk = ˆµk ˆµK, for k [K 1], and ˆτ = ˆµK c.

Note that in practice, one is often prescribed a target rejection rate, and picks the rejection cost c to achieve this rejection rate. For ﬁxed ˆλ1, . . . , ˆλK 1, one may equivalently pick the threshold ˆτ (instead of cost c) to reach the desired rejection rate. So given a target rejection rate, all one needs to do is to apply a brute-force search to tune the remaining K 1 unknowns to minimize the balanced error on a validation set, with the threshold ˆτ set to a percentile that matches the desired rejection rate. When K = 2 (head and tail), i.e., G = {0, 1}, this would mean that we only have a single parameter ˆλ0 = ˆµ0 ˆµ1 to tune, which can be efﬁciently done using a simple line search.

E.2 LAGRANGIAN-BASED PLUG-IN APPROACH

As an alternative to plug-in approach in Algorithm 1, we explore a Lagrangian-based approach that handles the constraints more intricately. We begin by introducing multipliers µ RK for the K

Published as a conference paper at ICLR 2024

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Standard Error

Chow CSS Chow [BCE] Plug-in [Balanced]

(a) CIFAR-100

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Standard Error

Chow CSS Chow [BCE] Plug-in [Balanced]

(b) Image Net

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Standard Error

Chow CSS Chow [BCE] Plug-in [Balanced]

(c) i Naturalist

Figure 4: Balanced L2R: Standard 0-1 error as a function of proportion of rejections.

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Standard Error

Chow CSS Chow [DRO] Plug-in [Worst]

(a) CIFAR-100

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Standard Error

Chow CSS Chow [DRO] Plug-in [Worst]

(b) Image Net

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Standard Error

Chow CSS Chow [DRO] Plug-in [Worst]

(c) i Naturalist

Figure 5: Worst-group L2R: Standard 0-1 error as a function of proportion of rejections.

constraints in the optimization problem in Eq. (7) and writing out the Lagrangian:

L(h, r, α; µ) = X

1 αk P (y = h(x), r(x) = 0, y Gk) + c P (r(x) = 1)

k [K] µk (K P (r(x) = 0, y Gk) αk) ,

and formulate an equivalent saddle point optimization problem:

min α,h,r max µ RK L(h, r, α; µ). (41)

For a ﬁxed α, we can solve the min-max problem over h, r and µ by alternating between gradientascent updates on the multipliers µ:

µ(t+1) k = µ(t) k + ξµ (αk K P (r(x) = 0, y Gk)) , k [K],

for step size ξµ > 0, and a full minimization over h and r. The procedure outlined in Algorithm 3 then solves the saddle point optimization problem in Eq. (41) by performing gradient-descent on α, and the combination of the gradient-ascent and full minimization procedure described above over h, r and µ.

It is worth noting that this Lagrangian procedure is made possible by viewing Eq. (7) as an optimization problem over the space of the space of realizable confusion statistics C (Narasimhan, 2018; Narasimhan et al., 2022b), which we can then show is a continuous constrained optimization problem that is amenable be solved with Lagrangian style updates.

Published as a conference paper at ICLR 2024

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Tail Error - Head Error

Chow CSS Chow [BCE] Plug-in [Balanced]

(a) CIFAR-100

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Tail Error - Head Error

Chow CSS Chow [BCE] Plug-in [Balanced]

(b) Image Net

0.00 0.25 0.50 0.75 Proportion of Rejections

Tail Error - Head Error

Chow CSS Chow [BCE] Plug-in [Balanced]

(c) i Naturalist

Figure 6: Balanced L2R: Difference between tail and head errors as a function of proportion of rejections.

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Tail Error - Head Error

Chow CSS Chow [DRO] Plug-in [Worst]

(a) CIFAR-100

0.0 0.2 0.4 0.6 0.8 Proportion of Rejections

Tail Error - Head Error

Chow CSS Chow [DRO] Plug-in [Worst]

(b) Image Net

0.00 0.25 0.50 0.75 Proportion of Rejections

Tail Error - Head Error

Chow CSS Chow [DRO] Plug-in [Worst]

(c) i Naturalist

Figure 7: Worst-group L2R: Difference between tail and head errors as a function of proportion of rejections.

F ADDITIONAL EXPERIMENTAL RESULTS

We discuss the data splits, hyper-parameter choices and further implementation details in our experiments. We also include additional experimental plots. All results are averaged over 5 trials. We report 95% conﬁdence intervals in Table 2.

F.1 HYPER-PARAMETER CHOICES

We apply the same data pre-processing as Menon et al. (2021a). We trained all the models using SGD with a momentum of 0.9 and a weight decay of 10 4. We summarise the hyper-parameter choices in Table 3 below.

Dataset #classes ntrain ntest Tail prop. Model Base LR Schedule Epochs Batch size

CIFAR-100-LT 100 50,000 10,000 0.03 CIFAR Res Net-32 0.4 anneal 256 128 Image Net-LT 1000 1,281,167 50,000 0.02 Res Net-50 0.4 cosine 200 512 i Naturalist 2018 8143 437,513 24,426 0.12 Res Net-50 0.4 cosine 200 1024

Table 3: Dataset details and hyper-parameter choices.

For CIFAR-100, we apply a warm up with a linear learning rate for 15 steps until we reach the base learning rate. We apply a learning rate decay of 0.1 at the 96th, 192nd and 224th epochs.

For Image Net and i Naturalist, we apply a warm up with a linear learning rate for 5 steps until we reach the base learning rate. We apply a learning rate decay of 0.1 at the 45th, 100th and 150th epochs.

Published as a conference paper at ICLR 2024

F.2 DATASET SPLITS

The train, test and validations samples all follow the same long-tailed label distributions. Each dataset comes with an original test set with equal proportions of all labels. We re-weight the samples in the original test set to replicate the same label proportions as the training set. Furthermore, we hold out 20% of the original test set as a validation sample and use the remaining as the test sample.

F.3 IMPLEMENTATION DETAILS

For the proposed plug-in method for the balanced error (Algorithm 1), we set the number of iterations M to 10, and tune the single parameter ˆλ0 = ˆµ0 ˆµ1 (see Appendix E.1) from three values {1, 6, 11}. For the proposed plug-in method for the worst-group error (Algorithm 2), we set the number of outer iterations T to 25 and the step-size ξ to 1; for the inner call to Algorithm 1, we set M to 10 and tune the parameter ˆλ0 from {1, 6, 11}.

We implement the cost-sensitive softmax (CSS) loss of Mozannar & Sontag (2020) described in Equation 3 in their paper. We set the cost vector c(1), . . . , c(L + 1) in this loss to c(i) = 1(y = i), i [L] and c(L + 1) to the rejection cost c.

For Chow [Balanced], there exists a plethora of options in the long-tail literature to train the base model (see 2.2 for a brief list of representative approaches). We pick the logit-adjusted cross-entropy loss of Menon et al. (2021a) as a representative approach.

For Chow [DRO], as prescribed in Jones et al. (2021), we use the group DRO approach proposed in Sagawa et al. (2020), but modify the inner weighted minimization step to use a logit-adjusted loss (Narasimhan & Menon, 2021; Wang et al., 2023). This modiﬁcation was necessary to adapt the approach of Sagawa et al. (2020) to multi-class classiﬁcation with a large number of classes. Their method requires tuning a step-size for DRO and a regularization parameter C. We pick these parameters from the sets {0.0005, 0.001} and {1.0, 2.0} respectively, by maximizing the worst-group error on the validation sample.

F.4 TRADE-OFF PLOTS: FURTHER DETAILS

To generate the trade-off plots in Figure 3 7 for each method, we train rejectors with different rejection rates. For this, one may vary the cost parameter c, or equivalently, vary the rejection threshold in the case of Chow s rule and the threshold ˆτ in the proposed plug-in method (see Appendix E.1). For a target rejection rate, this can be done with a simple percentile computation.

For the CSS baseline, we re-train a model for different values of c in {0.0, 0.1, 0.5, 0.75, 0.85, 0.91, 0.95, 0.97, 0.99}. Unfortunately, we ﬁnd that with the Image Net and i Naturalist datasets, the resulting models do not exceed a rejection rate of 40% and 75% respectively, despite us using the entire range of c [0, 1].

F.5 ADDITIONAL PLOTS

We present additional plots of the standard 0-1 error as a function of rejection rate for different methods in Figures 4 5. We also present plots of the difference between the tail and head errors as a function of rejection rates in Figures 6 7.

F.6 COMPARISON TO EXHAUSTIVE SEARCH OVER α

We repeated the CIFAR-100 experiments in 6 choosing the multiplier parameter from a ﬁner grid of values: {0.25, 0.5, 0.75, . . . , 11}. We additionally replaced our heuristic procedure in 4.2 with an exhaustive grid search over group prior α, choosing the head-to-tail ratio from the range {0, 0.01, 0.02, . . . , 1.0}. We pick the parameter combination that yields the lowest balanced error on the held-out validation sample.

We show below the balanced error on the test and validation samples. Although the grid search provides gains on the validation set, on the test set, our heuristic approach to picking the multipliers

Published as a conference paper at ICLR 2024

and group priors yields metrics that are only slightly worse than grid-search. Moreover, the differences are within standard errors.

Test Validation Chow 0.509 0.002 0.498 0.011 Plug-in [Balanced] 0.291 0.008 0.282 0.007 Plug-in [Balanced, grid search] 0.284 0.007 0.264 0.008