# training_overparameterized_models_with_nondecomposable_objectives__c62009b4.pdf

Training Over-parameterized Models

with Non-decomposable Objectives

Harikrishna Narasimhan Google Research, Mountain View

hnarasimhan@google.com

Aditya Krishna Menon Google Research, New York adityakmenon@google.com

Many modern machine learning applications come with complex and nuanced design goals such as minimizing the worst-case error, satisfying a given precision or recall target, or enforcing group-fairness constraints. Popular techniques for optimizing such non-decomposable objectives reduce the problem into a sequence of cost-sensitive learning tasks, each of which is then solved by re-weighting the training loss with example-speciﬁc costs. We point out that the standard approach of re-weighting the loss to incorporate label costs can produce unsatisfactory results when used to train over-parameterized models. As a remedy, we propose new costsensitive losses that extend the classical idea of logit adjustment to handle more general cost matrices. Our losses are calibrated, and can be further improved with distilled labels from a teacher model. Through experiments on benchmark image datasets, we showcase the effectiveness of our approach in training Res Net models with common robust and constrained optimization objectives.

1 Introduction

The misclassiﬁcation error is the canonical performance measure in most treatments of classiﬁcation problems [92]. While appealing in its simplicity, practical machine learning applications often come with more complex and nuanced design goals. For example, these may include minimizing the worst-case error across all classes [64], satisfying a given precision or recall target [106], enforcing minimal coverage for minority classes [31], or imposing group-fairness constraints [105]. Unlike the misclassiﬁcation error, such objectives are non-decomposable, i.e., they cannot be expressed as a sum of losses over individual samples. This poses a non-trivial optimization challenge, which is typically addressed by reducing the problem into a sequence of cost-sensitive learning tasks [12, 25, 4, 16, 71]. Such reduction-based approaches have been successfully employed in many open-source libraries [3, 2, 1] to provide drop-in replacements for standard loss functions.

In this paper, we point out that the standard approach of re-weighting the training loss to incorporate label costs can produce unsatisfactory results when used with high capacity models, which are often trained to memorize the training data. Such over-parameterized models are frequently encountered in the use of modern neural networks, and have been the subject of considerable recent study [107, 73, 66]. As a remedy, we provide new calibrated losses for cost-sensitive learning that are better equipped at training over-parameterized models to optimize non-decomposable metrics, and demonstrate their effectiveness on benchmark image classiﬁcation tasks. Our main contributions are as follows:

(i) we illustrate the pitfalls of using loss re-weighting in over-parameterized settings, particularly

with diagonal class weights (Section 3). (ii) we propose new logit-adjusted losses for cost-sensitive learning for both diagonal and non-

diagonal gain matrices, and show that they are calibrated (Section 4). (iii) we demonstrate that our losses provide signiﬁcant gains over loss re-weighting in training Res Net

models to optimize worst-case recall and to enforce coverage constraints (Section 5).

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

(iv) we show that the proposed approach compares favorably to post-hoc correction strategies, and

can be further improved by distilling scores from a teacher model (Section 6).

1.1 Related Work

We cover prior work on class-imbalanced learning, cost-sensitive learning, and complex metrics.

Learning under class imbalance. A common setting where metrics beyond the misclassiﬁcation error have been studied is the problem of class imbalance [49, 11, 33]. Here, the label distribution P(y) is skewed, and one seeks to ensure that performance on rare classes do not overly suffer. To this end, common metrics include the average of per-class recalls (also known as the balanced accuracy) [7, 62], quadratic mean of per-class accuracies [51], and the F-score [54].

A panoply of different techniques have been explored for this problem, spurred in part by recent interest in the speciﬁc context of neural networks (where the problem is termed long-tail learning) [91, 9], with the focus largely on optimizing balanced accuracy. An exhaustive survey is beyond the scope of this paper (see, e.g,. [33, 39]), but we may roughly identify three main strands of work: data modiﬁcation [48, 11, 93, 103, 108], loss modiﬁcation [107, 18, 10, 88, 37, 79, 97, 63, 21, 45, 94], and prediction modiﬁcation [28, 43, 76, 60, 13, 40, 110]. A related recent thread seeks to improve performance on rare subgroups [82, 83, 86], as a means to ensure model fairness [24].

Amongst loss modiﬁcation techniques, two strategies are of particular relevance. The ﬁrst is reweighting techniques [100, 65, 18]. As has been noted [55, 96, 61] (and as we shall subsequently verify), these may have limited effect on training samples that are perfectly separable under a given model class (as is the case with overparameterized models). The second is margin techniques, which enforce a class-dependent margin in the loss [10, 88, 79, 63, 94]. This seeks to ensure that rare classes are separated with greater conﬁdence, to account for the higher uncertainty in their decision boundary.

Cost-sensitive learning. Cost-sensitive learning is a classical extension of standard multiclass classiﬁcation, wherein errors on different classes incur different costs [26, 23, 85]. Strategies for this problem largely mirror those for class imbalance, which can be seen as imposing a particular set of costs dependent on the label frequencies [58]. In particular, loss modiﬁcation techniques based on asymmetric weights [57, 98, 104, 5, 19, 22, 85, 112] and margins [61, 44, 36, 45] have been explored, with the latter proving useful on separable problems. Amongst these, Lin et al. [57] handle a general cost matrix for multiclass problems, but require the class scores to sum to 0, a constraint that might be difﬁcult to impose with neural networks. Khan et al. [44] propose a multiclass loss similar to the logit adjustment idea used in this paper, but only handle a speciﬁc type of cost matrix (e.g., in their setup, the default cost matrix is a matrix of all 1s). The standard multiclass loss of Crammer and Singer [17] can also be extended to handle a general cost matrix [90], but unfortunately is not calibrated [78].

Complex metrics. There has been much work on extending the class imbalance literature to handle more complex metrics and to impose data-dependent constraints. These methods can again be divided into loss modiﬁcations [38, 74, 41, 69, 42, 31, 25, 4, 16, 71, 27, 59, 50] and post-hoc prediction modiﬁcations [102, 68, 46, 70, 72, 32, 20, 67, 101, 89], and differ in how they decompose the problem into simpler cost-sensitive formulations (see [71] for a uniﬁed treatment of common reduction strategies). Most of these papers experiment with simple models, with the exception of Sanyal et al. [84], who use re-weighting strategies to train shallow DNNs, Kumar et al. [50], who train CNNs on binary-labeled problems, and Eban et al. [25], who train Inception Nets to optimize an AUC-based ranking metric. There have also been other attempts at training neural networks with ranking metrics [87, 35, 8, 77]. In contrast, our focus is on handling a general family of evaluation metrics popular in multiclass classiﬁcation problems.

2 Non-decomposable Objectives

Notations. Let [m] = {1, . . . , m}. Let m be the (m 1)-dimensional simplex with m coordinates. Let 1m 2 Rm denote an all 1s vector of size m, and diag(u1, . . . , um) denote a m m diagonal matrix with diagonal elements u1, . . . , um. For vectors a, b 2 Rm, a/b denotes element-wise division. Let softmaxi(s) = exp(si) Pm

j=1 exp(sj) denote a softmax transformation of scores s 2 Rm.

Consider a multiclass problem with an instance space X Rd and a label space Y = [m]. Let D denote the underlying data distribution over X [m], DX denote the marginal distribution over the

Algorithm 1 Reductions-based Algorithm for Maximizing Worst-case Recall (1)

Inputs: Training set S, Validation set Sval, Step-size ! 2 R+, Class priors , CS-loss G Initialize: Classiﬁer h0, Multipliers λ0 2 m for t = 0 to T 1 do

, 8i, where b Cij[h] = 1 |Sval|

(x,y)2Sval 1(y = i, h(x) = j)

G = diag(λt+1

1 / 1, . . . , λt+1

Cost-sensitive Learning (CSL):

st+1 2 argmins

(x,y)2S G(y, s(x)) // Replaced by few steps of SGD ht+1(x) 2 argmaxi2[m] st+1

i (x), 8x end for return h T

instances X, and i = P(y = i) denote the class priors. We will use pi(x) = P(y = i|x) to denote the conditional-class probability for instance x. Our goal is to learn a classiﬁer h : X![m], and will measure its performance in terms of its confusion matrix C[h], where

Cij[h] = E(x,y) D [1 (y = i, h(x) = j)] .

Complex learning problems. In the standard setup, one is often interested in maximizing the overall classiﬁcation accuracy acc[h] = P

i Cii[h]. However, in many practical settings, one may care about other metrics such as the recall for class i, reci[h] = Cii[h]

i , the precision on class i, preci[h] = Cii[h] P

or the proportion of predictions made on class i, i.e. the coverage, covi[h] = P

j Cji. Below are common examples of real-world design goals based on these metrics.

Example 1 (Maximizing worst-case recall). In class-imbalanced settings, where the classiﬁer tends to generalize poorly on the rare classes, we may wish to change the training objective to directly maximize the minimum recall across all classes [12]:

h min i2[m]

Example 2 (Constraining per-class coverage). Another consequence of label imbalance could be that the proportion of predictions that the classiﬁer makes for the tail classes is lower than the actual prevalence of that class. The following learning problem seeks to maximize the average recall while explicitly constraining the classiﬁer s coverage for class j to be at least 95% of its prior j [16, 71]:

Cij[h] 0.95 j, 8j 2 [m]. (2)

Other examples include constraints on recall and precision (see Table 1), the F-measure, and the area under the ROC and PR curves, as well as fairness constraints like equal opportunity, which are computed on group-speciﬁc versions of the confusion matrix [25, 71, 15, 32, 4].

Reduction to cost-sensitive learning. All the problems described above are non-decomposable, in the sense that, they cannot be written as minimization of a sum of errors on individual examples, and hence cannot be tackled using common off-the-shelf learning algorithms. The dominant approach for solving such problems is to formulate a sequence of cost-sensitive objectives, which do take the form of an average of training errors [70, 12, 4, 16]. For example, the robust learning problem in (1) can be equivalently re-written as the following saddle-point optimization problem:

where λi is a multiplier for class i. One can then ﬁnd a saddle-point for (3) by jointly minimizing the weighted objective over λ 2 m (using e.g. exponentiated-gradient updates) and maximizing

Problem Gain Matrix Losses 1 maxh miny recy[h] diag(λ/ ) LA

2 maxh acc[h] s.t. recy[h] , 8y diag(1m + λ/ ) LA

3 maxh accy[h] s.t. precy[h] , 8y diag(1m + λ) 1mλ> hyb, SMS

y recy[h] s.t. covy[h] 0.95 y, 8y diag(1m/ ) + 1mλ> hyb, SMS

y recy[h] s.t. bcovy[h] 0.95 1

m, 8y diag(1m/ ) + (1m/ )λ> hyb

Table 1: Examples of complex learning problems, the associated gain matrices G, and the proposed losses that are applicable to the setting. We use to denote the minimum allowed recall/precision, and bcovy[h] = Pm

1 i Ciy[h] is the balanced coverage metric we use in our experiments. Similar reductions to cost-sensitive learning can also be accomplished with other common evaluation metrics such as F-measure, G-mean, AUC-ROC, AUC-PR [25, 71], and fairness criteria such as equal opportunity and equalized odds [4, 89]. See Appendix A for details.

the objective over h. Notice that for a ﬁxed λ, the optimization over h is a cost-sensitive (or a gain-weighted) learning problem:

Gij Cij[h], (4)

where Gii = λi

i and Gij = 0, 8i 6= j are the rewards associated with predicting class j when the true class is i. We will refer to G 2 Rm m as the gain matrix, which in this case is diagonal. In practice, the cost-sensitive learning problem in (4) is usually replaced with multiple steps of gradient descent, interleaved with the updates on λ. Algorithm 1 provides an outline of this procedure.

Similar to (1), the constrained learning problem in (2) can be written as an equivalent (Lagrangian) min-max problem with Lagrange multipliers λ 2 Rm

Cij[h] 0.95 j

One can ﬁnd a saddle-point for this problem by jointly minimizing the Lagrangian over λ 2 Rm

+ (using e.g. gradient updates) and maximizing it over h, with the maximization over h taking the form of a cost-sensitive learning problem. In this case, the gain matrix G is non-diagonal and is given by Gii = 1 m i + λi, 8i and Gij = λj, 8i 6= j. See Table 1 for the form of the gain matrix for other common constraints. In Appendix A, we provide an outline of this procedure (Algorithm 2), and also elaborate on how this reduction-based approach can be extended to optimize other common evaluation metrics such F-measure [25, 71].

Cost-sensitive losses. A standard approach for solving the cost-sensitive learning problem in (4) is to use a surrogate loss function : [m] Rm!R+ that takes a label y and a m-dimensional score u 2 Rm, and outputs a real value (y, u). One would then minimize the expected loss L(s) = Ex,y [ (y, s(x))] over a class of scoring function s : X!Rm that map each instance to an m-dimensional score. The ﬁnal classiﬁer h can then be obtained from the learned scoring function s by taking an argmax of its predicted scores, i.e. by constructing h (x) 2 argmaxi2[m] s

In practice, we are provided a ﬁnite sample S = {(x1, y1), . . . , (xn, yn)} drawn from D, and will seek to minimize the average loss on S, given by b L(s) = 1 |S|

(x,y)2S (y, s(x)). We will also assume access to a held-out validation sample Sval = {(x1, y1), . . . , (xnval, ynval)}.

One common sanity check for a good loss function is to conﬁrm that it is classiﬁcation calibrated (or Fisher consistent) for the learning problem of interest [6]. Loosely speaking, a surrogate loss is classiﬁcation calibrated if the minimizers of the conditional risk Ey|x( (y, s(x))) over all scoring functions s : X!Rm coincide with the Bayes-optimal prediction for x. For the cost-sensitive objective in (4), the Bayes-optimal classiﬁer is given below [53].

Proposition 1. The optimal classiﬁer for (4) for a general gain matrix G 2 Rm m is of the form:

h (x) 2 argmaxy2[m]

i=1 Giy pi(x) = argmaxy2[m](G>p(x))y.

All proofs can be found in Appendix B.

Figure 1: Training Res Net-56 on a subset of CIFAR-10 with an unweighted, re-weighted (cf. (5)) and logit-adjusted (cf. (7)) cross-entropy loss. All models achieve near-zero training error.

3 The Perils of Over-parameterization

We now point out the problems in applying the algorithms described previously to models that are over-parameterized, i.e., which contain enough parameters to perfectly ﬁt the labels in the training set. We will focus particularly on the loss used for cost-sensitive learning steps in Algorithms 1 2.

Limitations of loss re-weighting. One of the most common loss function for a diagonal gain matrix G is a simple re-scaling of the standard cross-entropy loss with the diagonal weights:

wt(y, s) = Gy,y log

The following is a natural extension of this re-weighted loss to a general gain matrix G, used, for example, in constrained optimization libraries [3], and also for mitigating label noise [75]:

While this weighted loss is calibrated for G (see, e.g., [75]), it is often inadequate when training an over-parameterized model on a ﬁnite sample. This is evident when G is a diagonal matrix: such a model will usually memorize the training labels and achieve zero training loss for every class, irrespective of what we choose for the outer weighting. In such separable settings, it is unclear how the outer weighting will impact the model s out-of-sample performance on different classes.

For clarity, we provide a simple illustration in Figure 1, where we use a re-weighted loss to train Res Net-56 models on 10000 images from CIFAR-10 with different diagonal gain matrices G. We assign a weight of ! to the ﬁrst class airplane , and a weight of 1 on all other classes (i.e. set G1,1 = ! and Gy,y = 1, 8y 6= 1), and plot the normalized weighted accuracy P

y Gyy Cyy[h]/ P

y Gyy on the test set as we vary !. Compared to an unweighted cross-entropy loss, the re-weighted loss does not produce a signiﬁcant change to the test metric, whereas the logit-adjusted loss (which we will discuss in Section 4) yields substantially better values. Probing closer into the test accuracy for each class, we see that the re-weighted loss yields slightly better accuracies for class 1, but is signiﬁcantly worse-off on the other classes, suggesting that re-weighting has the effect of excessively focusing on the class with the higher weight at the cost of the other classes. Sagawa et al. [83] also make a similar observation when using importance weights to improve worst-group generalization, while Cui et al. [18], Cao et al. [10] observe that up-weighting the minority classes can lead to unstable optimization.

When G is not diagonal, the minimum training loss may not be zero. Nonetheless, the following proposition sheds some light on the scores learned by a model that achieves minimum training loss.

Proposition 2. Suppose the instances x in S are all unique. Let b Lwt(s) = 1 |S|

(x,y)2S wt(y, s(x)) denote the average loss on training sample S with wt deﬁned as in (6) for a gain matrix G. Then a scoring function bs that achieves the minimum value of b Lwt(s) over all s : X!Rm is of the form: softmaxi(bs(x)) = Gy,i P

j Gy,j , 8(x, y) 2 S.

Notice that the model output is invariant to re-scaling of the rows in G, i.e., one can multiply each row of the gain matrix G by a different scalar, and the values memorized by the model for each training example will remain unchanged. On the other hand, re-scaling a column of G does substantially change the score learned for each example. While this does not tell us about how the model would

behave on unseen new examples, our experience has been that loss re-weighting is usually effective when used to control the amount of out-of-sample predictions that a model makes for a certain class (by suitably scaling the column for that class in G), but does not work well when used to emphasize greater accuracy on a particular class. Consequently, we ﬁnd that loss re-weighting usually fares better with metrics like coverage that depend only on the model predictions and not on the true labels, than with metrics like recall which depend on both the predictions and labels.

Existing remedies for better generalization. The literature offers some remedy to improve model generalization with non-standard objectives. Sagawa et al. [83] improve performance on rare subgroups by regularizing the losses based on the size of each group, but their solution applies to a speciﬁc learning problem. Other remedies such as smarter re-weighting strategies [56, 18] or deferred re-weighting schedules [10] have proven to work well in specialized imbalanced settings.

The work that most closely relates to our setting is Cotter et al. [14], who propose a simple modiﬁcation to the algorithms discussed in Section 2, with the use of two datasets: a training sample S, and a held-out validation sample Sval. They suggest using the validation sample to perform updates on the multipliers and in turn the gain matrix G, and the training sample to solve the resulting cost-sensitive learning problem in (4). The latter would typically involve optimizing the empirical risk on the training set 1 |S|

(x,y)2S G(y, s(x)), for some cost-sensitive loss G. The intuition here is that, even when a model achieves very low training error, the estimate of G will accurately reﬂect the model s performance on held-out examples. While this modiﬁcation improves our estimate of G, it only provides a partial solution for over-parameterized settings, where the model would still struggle to generalize well if G happens to be the simple re-weighted loss in (6).

4 Cost-sensitive Losses Based on Logit Adjustment

We present new cost-sensitive losses that seek to avoid the problems mentioned above. We build on recent work on logit adjustment, shown previously to be effective in long tail settings [79, 63, 94].

Diagonal gain matrix. When the gain matrix G is diagonal, Proposition 1 tells us that the Bayes-optimal classiﬁer for the resulting weighted accuracy metric is of the form h (x) 2 argmaxy2[m] Gyy py(x), where py(x) = P(y|x). Intuitively, we would like the learned scoring function s(x) to mimic the Bayes-optimal scores Gyy py(x) for each x. In particular, we would like the maximizer of sy(x) over labels y to match the Bayes-optimal label for x.

One way to facilitate this is to adjust the scores sy(x) based on the diagonal weights Gyy, and to compute a loss on the adjusted scores. Speciﬁcally, we shift sy(x) to sy(x) = sy(x) log(Gyy), and optimize the shifted scores so that their softmax transformation matches the class probabilities p(x). This would then encourage the original scorer s to be monotonic in the Bayes optimal scores:

exp( sy(x)) P

j exp( sj(x)) = py(x) () sy(x) / ln(py(x)) () sy(x) / ln(Gyy py(x)).

In practice, p(x) is not directly available to us, and so we employ the following logit-adjusted cross-entropy loss that implements this idea with labels y drawn according to p(x):

LA(y, s) = log

exp(sy log(Gyy)) P

j exp(sj log(Gjj))

This loss is a simple generalization of the one analyzed in Ren et al. [79], Menon et al. [63], Wang et al. [94], in which the logits are adjusted based on the class priors to optimize the balanced error rate. Such approaches have historical precedent in the class imbalance literature [76, 111, 13]. Proposition 3. The logit-adjusted loss LA is calibrated for a diagonal gain matrix G.

Unlike the weighted loss in (5), changing the diagonal entries of G in (7) changes the operating point of the learned scoring function. In separable settings, this would mean that the diagonal weights will determine the operating point at which the model achieves zero training error, and tend to push the separator towards classes that have higher weights, thus helping improve model generalization. An alternate explanation posed by Menon et al. [63] re-writes the loss in a pair-wise form:

LA(y, s) = log

δyj (sy sj)

where δyj = log(Gyy/Gjj) can be seen as the relative margin between class y and class j. This tells us that the loss encourages a larger separation between classes that have different weights. In Appendix C, we elaborate on this margin interpretation and point out that this loss can in fact be seen as a soft-approximation to the more traditional margin-based loss of Crammer and Singer [17].

General gain matrix. When the gain matrix G is non-diagonal, a simple logit adjustment no longer works. In this case, we propose a hybrid approach that combines logit adjustment with an outer weighting. To this end, we prescribe factorizing the gain matrix into a product G = MD for some diagonal matrix D 2 Rm m, and M = GD 1. The proposed loss then adjusts the logits to account for the diagonal entries of D and applies an outer weighting to account for M:

hyb(y, s) =

exp(si log(Dii)) P

j exp(sj log(Djj))

In practice, D can be chosen to reﬂect the relative importance of the classes and include information such as the class priors, which cannot be effectively incorporated as part of the outer weighting. One simple choice could be D = diag(1/ 1, . . . , 1/ m). Another intuitive choice could be to D = diag(G11, . . . , Gmm), so that the residual matrix M = GD 1 that serves as the outer weighting has 1s in all its diagonal entries, and thus equal weights on the diagonal loss terms. Proposition 4. For any diagonal matrix D 2 Rm m with Dyy > 0, 8y, and M = GD 1, the hybrid loss hyb is calibrated for G.

In some special cases, we may be able to avoid the outer-weighting: e.g., when the gain matrix is the sum of a diagonal matrix and a matrix with equal rows, i.e. G = diag( ) + 1β> (cf. Table 1, rows 3 & 4), the Bayes-optimal classiﬁer from Proposition 1 is of the form h (x) = argmaxy2[m] ypy(x)+ βy. The additive terms βy in the optimal classiﬁer can then be encoded in the loss as a shift to the softmax output, giving us the following softmax-shifted loss: for constant C > 0,

SMS(y, s) = log

C exp(sy log( y)) P

j exp(sj log( j)) βy/ y

Proposition 5. SMS is calibrated for the gain matrix G = diag( )+1β> when C = 1+P

See Appendix D for a practical variant this loss that avoids a negative value within the log.

5 Experimental Comparison with Loss Re-weighting

We now showcase that the proposed losses provide substantial gains over loss re-weighting through experiments on three benchmark datasets: CIFAR-10, CIFAR-100 [47], and Tiny Image Net [52, 80] (a subset of the Image Net dataset with 200 classes). Similar to [18, 10, 63], we use long-tail versions of these datasets by downsampling examples with an exponential decay in the per-class sizes. We set the imbalance ratio maxi P(y=i)

mini P(y=i) to 100 for CIFAR-10 and CIFAR-100, and to 83 for Tiny Image Net (the slightly smaller ratio is to ensure that the smallest class is of a reasonable size). In each case, we use a balanced validation sample of 5000 held-out images, and a balanced test set of the same size.

We trained Res Net-56 models on CIFAR-10 and CIFAR-100, and Res Net-18 models on Tiny Image Net, using SGD with momentum. We provide details about our hyper-parameters choices in Appendix E. On CIFAR-10 and CIFAR-100, the Res Net-56 architectures have sufﬁcient parameters to perfectly ﬁt the training labels, as evident from the training set metrics reported in Appendix E.1.1

Cost-sensitive learning and baselines. We consider two learning objectives: maximizing worst-case recall, and maximizing average recall subject to coverage constraints. As outlined in Algorithms 1 2, we employ the two dataset approach of Cotter et al. [14] (discussed in Section 3) to solve these problems, with the validation set used for updates on the gain matrix G, and the training set used for the cost-sensitive learning (CSL) step. Since our goal here is to demonstrate that the the proposed losses fair better at solving the CSL step than loss re-weighting, we use a large validation sample to get good estimates of the gain matrix G. For the CIFAR datasets, we perform 32 SGD steps on

1Code will be made available at: https://github.com/google-research/google-research/tree/ master/non_decomp

Method CIFAR-10-LT CIFAR-100-LT Tiny Img Net-LT Avg Rec Min Rec Avg Rec Min HT Rec Avg Rec Min HT Rec ERM 0.750 0.563 0.441 0.082 0.321 0.009 Balanced [63] 0.790 0.664 0.478 0.210 0.353 0.086 Equalized [88] 0.754 0.541 0.461 0.147 0.325 0.033 Adaptive [10] 0.756 0.584 0.450 0.116 0.323 0.005 CSL [Re-weighted] 0.739 0.619 0.424 0.089 0.300 0.094 CSL [Logit-adjusted] 0.786 0.731 0.462 0.407 0.335 0.295

Table 2: Results of maximizing worst-case recall on CIFAR-10 and the minimum of the head and tail recalls on CIFAR-100 and Tiny-Image Net. The last two rows contain results from Algorithm 1 with different losses in the CSL step. All metrics are evaluated on the test set and averaged over

5 independent trials. The highest entry in each column is shaded. The proposed logit-adjusted loss yields signiﬁcantly better worst-case recall than all the baselines.

CIFAR-10-LT CIFAR-100-LT Tiny Img Net-LT Avg Rec Min Cov Avg Rec Min HT Cov Avg Rec Min HT Cov (tgt: 0.095) (tgt: 0.010) (tgt: 0.005) ERM 0.750 0.058 0.441 0.001 0.321 0.000 Balanced [63] 0.790 0.076 0.478 0.005 0.353 0.001 Equalized [88] 0.754 0.056 0.461 0.003 0.325 0.000 Adaptive [10] 0.756 0.060 0.450 0.002 0.323 0.000 CSL [Re-weighted] 0.750 0.087 0.411 0.010 0.295 0.002 CSL [Hybrid A] 0.750 0.088 0.461 0.010 0.347 0.005 CSL [Hybrid B] 0.755 0.087 0.464 0.010 0.353 0.004

Table 3: Results of maximizing average recall, while constraining the coverage for each class to be at least 95% of 1

m on CIFAR-10, and constraining the head and tail coverages to be at least 95% of 1

m on CIFAR-100 and Tiny Image Net. The models are evaluated on a balanced test set and hence we set the coverage target to 0.95 1

m. The last three rows contain results from Algorithm 2 (in the appendix) with different losses in the CSL step. All metrics and targets are rounded off to 3 decimal places, and averaged over 5 independent trials. The highest entry and those comparable to it (within 0.001) are shaded. The maximum recall among rows where the coverage is closest to the target (or within 0.001 of the closest entry) is underlined. The proposed hybrid losses yield coverage values target, while achieving a higher average recall than the re-weighted loss.

the cost-sensitive loss for every update on G, and for Tiny Image Net, we perform 100 SGD steps for every update on G. In each case, we compare the results with empirical risk minimization (ERM) with the standard cross-entropy loss, and as representative methods that seek to maximize the average per-class recall (i.e. balanced accuracy), we include the balanced logit-adjusted loss of Menon et al. [63] in which the adjustments are based on the class priors alone (Balanced), the equalized loss of Tan et al. [88] (Equalized), and the adaptive margin loss of Cao et al. [10] (Adaptive).

Maximizing worst-case recall. In our ﬁrst task, we maximize the worst-case recall over all 10 classes on CIFAR-10 (cf. (1)). Given the larger number of classes, this can be a very restrictive goal for CIFAR-100 and Tiny Image Net, leading to poor overall performance. So for these datasets, we will consider a simpler goal, where we measure the average recall on the bottom 10% of the smallest-sized classes (the tail recall), and the average recall on the top 90% of the classes (the head recall), and seek to maximize the minimum of the head and tail recalls:

recy[h], 1 |T |

where H [m] and T [m] denote the set of head and tail labels respectively. The gain matrix G for these problems is diagonal. In Table 2, we present results of maximizing the worst-case recall metrics using Algorithm 1, with both the re-weighted loss in (5) and the proposed logit-adjusted loss for a diagonal G in (7) used in the CSL step. Compared to loss re-weighting, the prescribed loss provides huge gains in the worst-case recalls. In fact, on the CIFAR datasets, loss re-weighting fairs signiﬁcantly worse than using the simple logit adjustment with class priors prescribed by Menon et al. [63], which performs the best on the average recall metric that it seeks to maximize. Our approach improves substantially over this baseline on the worst-case recall, at the cost of a moderate decrease in the average recall.

C10-LT C100-LT C100-LT TIN-LT Per-class Recall Per-class Recall Head-Tail Recall Head-Tail Recall Avg Min Avg Min Avg Min Avg Min ERM 0.750 0.563 0.434 0.000 0.441 0.082 0.321 0.009 ERM + PS 0.766 0.711 0.279 0.103 0.459 0.454 0.319 0.303 CSL [Logit-adjusted] 0.786 0.731 0.389 0.107 0.462 0.407 0.335 0.295 CSL [Logit-adjusted] + PS 0.766 0.730 0.298 0.068 0.456 0.447 0.328 0.315

Table 4: Improvements with post-shifting: Results (on the test set) of maximizing the minimum recall over all classes (col 1 2) and over just the head and tail classes (col 3 4). The results are averaged over 5 independent trials. The highest entry in each column is shaded. The proposed approach, or its variant (CSL [Logit-adjusted] + PS), where the trained model is modiﬁed with a post-hoc adjustment, compares favorably to ERM + PS. Our proposal often yields better average recall.

Method CIFAR-10-LT CIFAR-100-LT Avg Rec Min Rec Avg Rec Min HT Rec Distilled ERM 0.766 0.600 0.456 0.057 Distilled Balanced [63] 0.812 0.709 0.508 0.230 Distilled CSL [Re-weighted] 0.771 0.735 0.481 0.478 Distilled CSL [Logit-adjusted] 0.777 0.737 0.475 0.473

Table 5: Improvements with self-distillation: Performance of student model (on the test set) in maximizing worst-case recall on CIFAR-10 and the minimum of the head and tail recalls on CIFAR100. Both the teacher and student use a Res Net-56 architecture. The results are averaged over 5

independent trials. The highest entry in each column is shaded.

Constraining coverage. The next task we consider for CIFAR-10 seeks to ensure that when evaluated on a balanced dataset, the model makes the same proportion of predictions for each class. This leads us to the optimization problem shown in row 5 of Table 1, where we wish to maximize the average recall, constraining the model s balanced coverage on each class bcovy[h] = Pm

1 i Ciy[h] to be at least 95% of 1

m, where m is the number of classes. Because the validation and test samples are already balanced, the model s coverage on these datasets is the same as its balanced coverage. For CIFAR-100 and Tiny Image Net, we consider the simpler goal of maximizing the average recall over all classes, with constraints on the model s average coverage over the head labels, and its average coverage over the tail labels to be both at least 95% of 1

recy[h] s.t. 1 |H|

bcovy[h] 0.95 1

bcovy[h] 0.95 1

The gain matrix G for these problems (as shown in Table 1) is non-diagonal, and does not have a special structure that we can exploit. We will therefore use the hybrid loss function that we provide in (9) for a general G, and try out both variants suggested: with the diagonal matrix D = diag(1/ 1, . . . , 1/ m) (variant A ), and with D = diag(G11, . . . , Gmm) (variant B ). In Table 3, we present the results of applying Algorithm 2 (in the appendix) to this task, with both the re-weighted loss and the proposed losses used in the CSL step. In this case, loss re-weighting is able to match the coverage target as closely as the proposed losses on CIFAR-10 and CIFAR-100, but fares poorly on the average recall on two of the three datasets. The proposed losses are often considerably better on this metric, while yielding coverage values that are closest to the target. Between the two hybrid variants, there isn t a clear winner, although variant B seems to have a slight edge. The reason we are not able to match the target in CIFAR-10 as closely as we do in the other two datasets is because of the larger number of constraints (10) that need to be satisﬁed (we only have 2 constraints to be satisﬁed for CIFAR-100 and Tiny Image Net).

6 Improvements with Post-shifting and Distillation

Having demonstrated that the proposed losses can provide signiﬁcant gains over loss re-weighting, we explore ways to further improve their performance.

Does post-shifting provide further gains? Post-hoc correction strategies have generally shown to be very effective in optimizing evaluation metrics [68, 46, 101, 89, 63], often matching the performance of more direct methods that modify the training loss. They are implemented in two steps: (i) train a

base scoring model s : X!Rm using ERM, (ii) construct a classiﬁer that estimates the Bayes-optimal label for a given x by applying a gain matrix G 2 Rm m to the predicted probabilities:

h(x) 2 argmaxy2[m]

i=1 Giy i(x), where (x) = softmax(s(x)).

The coefﬁcients G are usually chosen to maximize the given evaluation metric on a held-out validation set, using either a simple grid search (when the number of classes is small) or more sophisticated optimization tools [70, 89]. Table 4 shows the results of post-shifting the ERM-trained model for the tasking of maximizing worst-case recall (see Appendix E for details of how we ﬁt the post-shift coefﬁcients). While post-shifting provides huge improvements over ERM, our proposed approach of employing an iterative training procedure (Algorithm 1) with a modiﬁed logit-adjusted loss for the CSL step does often have a slight edge over the simpler post-shifting approach on the worst-case recall. Moreover, our proposal often yields a higher average recall, e.g., on the more difﬁcult problem of maximizing the worst-case recall over all 100 classes in CIFAR-100-LT. Interestingly, applying the post-hoc adjustment procedure to a model trained using our iterative approach (CSL [Logit-adjusted] + PS) does not always yield further gains. For example, while maximizing the worst-case recall

over all 100 classes on CIFAR-100-LT, the resulting classiﬁer from this hybrid approach overﬁts to validation sample and is worse off on the test sample.

Improvements with distillation. Another technique that has proven effective in boosting the performance of neural networks is knowledge distillation, wherein soft predictions from a teacher model pt : X! m are used as labels to train a student model [34, 81, 29, 99, 109]. All the losses discussed so far can be easily applied to a distillation setup, where the training labels can be replaced with an expectation over the teacher s label distribution: Ex DX

y(x) (y, s(x))

. In applying these losses for cost-sensitive learning, it is important that the class priors y used to construct the gain matrices (see Table 1) be replaced with the teacher s prior distribution t

y(x). This is particularly important when the teacher is trained on a dataset with a different prior distribution, or its outputs are re-calibrated to yield soft predictions. We make use of the teacher s labels only in the training sample, and retain the original one-hot labels in the validation sample.

We re-run the worst-case recall experiments on CIFAR-10 and CIFAR-100 from Section 5 with distillation, and compare a distilled version of our approach with the distilled ERM, a distilled version of the class-prior-adjusted loss of Menon et al. [63] (which performed the best in Table 2 among all methods that sought to maximize average recall), and a distilled version of loss re-weighting. In each case, we employ self-distillation and use the same Res Net-56 architecture for both the teacher and the student. As shown in Table 5, distillation provides improvements across the board. Interestingly, loss re-weighting is quite competitive in this case, and has a slight edge on CIFAR-100. We attribute this surprising improvement in performance to the teacher prior distribution t

y being signiﬁcantly more balanced than the original class priors, and as a result, the costs Gyy in the optimization of the worst-case recall in Algorithm 1 being less drastically different across classes.

7 Conclusions

We have proposed new cost-sensitive losses based on logit adjustment for training over-parameterized models with non-decomposable metrics. Our losses serve as replacements for and provide signiﬁcant gains over standard loss re-weighting strategies used as a part of the iterative reduction-based approach to optimize such metrics. We show that our proposal compares favorably to the common post-shifting baseline, and can be further improved with distillation. While we considered two representative tasks in our experiments (maximizing worst-case recall and constraining coverage), the reductions-based approach that we employ and the losses that we propose apply to general evaluation metrics such as the F-measure, the AUC-PR, and common fairness goals [25, 4, 67].

A limitation of our losses is that while they are calibrated, these guarantees only hold in the large sample limit. In the future, we wish to probe further into why our loses improve generalization even in ﬁnite sample settings, and to develop a more formal understanding of their margin properties. We would also like to further investigate into why post-hoc adjustment approaches work well in practice. More broadly, when applied to objectives with fairness constraints, the techniques presented here could be used to improve performance of complex neural networks on under-represented samples. Carefully studying their empirical behaviour in such settings, to ensure they do not introduce unforeseen additional biases, is an important direction for future study.

[1] Fairlearn Library. URL https://fairlearn.org/. [2] Global Objectives Library. URL https://git.dst.etit.tu-chemnitz.de/external/

tf-models/-/tree/master/research/global_objectives. [3] Tensor Flow Constrained Optimization Library. URL https://github.com/ google-research/tensorflow_constrained_optimization. [4] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudik, John Langford, and Hanna Wallach. A

reductions approach to fair classiﬁcation. In International Conference on Machine Learning, pages 60 69, 2018. [5] Francis R. Bach, David Heckerman, and Eric Horvitz. Considering cost asymmetry in learning classiﬁers. Journal of Machine Learning Research, 7(63):1713 1741, 2006. URL http://jmlr.org/papers/v7/bach06a.html. [6] Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliffe. Convexity, classiﬁcation, and risk

bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. [7] Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M. Buhmann.

The balanced accuracy and its posterior distribution. In 2010 20th International Conference on Pattern Recognition, pages 3121 3124, 2010. doi: 10.1109/ICPR.2010.764. [8] Sebastian Bruch, Masrour Zoghi, Mike Bendersky, and Marc Najork. Revisiting approximate

metric optimization in the age of deep neural networks. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 19), pages 1241 1244, 2019. [9] Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski. A systematic study of the class

imbalance problem in convolutional neural networks. ar Xiv:1710.05381 [cs, stat], October 2017. [10] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbal-

anced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, 2019. [11] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE:

Synthetic minority over-sampling technique. Journal of Artiﬁcial Intelligence Research (JAIR), 16:321 357, 2002. [12] Robert S Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization

for non-convex objectives. In Advances in Neural Information Processing Systems, pages 4705 4714, 2017. [13] Guillem Collell, Drazen Prelec, and Kaustubh R. Patil. Reviving threshold-moving: a simple

plug-in bagging ensemble for binary and multiclass imbalanced data. Co RR, abs/1606.08698, 2016. [14] Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang,

Blake Woodworth, and Seungil You. Training well-generalizing classiﬁers for fairness metrics and other data-dependent constraints. In International Conference on Machine Learning, pages 1397 1405. PMLR, 2019. [15] Andrew Cotter, Maya Gupta, and Harikrishna Narasimhan. On making stochastic classiﬁers

deterministic. In Advances in Neural Information Processing Systems, 2019. [16] Andrew Cotter, Heinrich Jiang, Serena Wang, Taman Narayan, Seungil You, Karthik Sridharan,

and Maya R. Gupta. Optimization with non-differentiable constraints with applications to fairness, recall, churn, and other goals. Journal of Machine Learning Research (JMLR), 20 (172):1 59, 2019. [17] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-

based vector machines. Journal of machine learning research, 2(Dec):265 292, 2001. [18] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss

based on effective number of samples. In CVPR, 2019. [19] M.A. Davenport, R.G. Baraniuk, and C.D. Scott. Controlling false alarms with support vector

machines. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 5, pages V V, 2006. doi: 10.1109/ICASSP.2006.1661344.

[20] Krzysztof Dembczy nski, Wojciech Kotłowski, Oluwasanmi Koyejo, and Nagarajan Natarajan.

Consistency analysis for binary classiﬁcation revisited. In International Conference on Machine Learning, pages 961 969. PMLR, 2017.

[21] Zongyong Deng, Hao Liu, Yaoxing Wang, Chenyang Wang, Zekuan Yu, and Xuehong Sun.

PML: progressive margin loss for long-tailed age classiﬁcation. Co RR, abs/2103.02140, 2021. URL https://arxiv.org/abs/2103.02140.

[22] Jacek P. Dmochowski, Paul Sajda, and Lucas C. Parra. Maximum likelihood in cost-sensitive

learning: Model speciﬁcation, approximations, and upper bounds. Journal of Machine Learning Research, 11(108):3313 3332, 2010.

[23] Pedro Domingos. Metacost: A general method for making classiﬁers cost-sensitive. In

Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 99, page 155 164, New York, NY, USA, 1999. Association for Computing Machinery. ISBN 1581131437. doi: 10.1145/312129.312220. URL https: //doi.org/10.1145/312129.312220.

[24] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness

through awareness. In Innovations in Theoretical Computer Science Conference (ITCS), pages 214 226, 2012.

[25] Elad Eban, Mariano Schain, Alan Mackey, Ariel Gordon, Ryan Rifkin, and Gal Elidan.

Scalable learning of non-decomposable objectives. In Artiﬁcial Intelligence and Statistics, pages 832 840. PMLR, 2017.

[26] Charles Elkan. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artiﬁcial Intelligence - Volume 2, IJCAI 01, page 973 978, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608125.

[27] Rizal Fathony and Zico Kolter. AP-perf: Incorporating generic performance metrics in

differentiable learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 4130 4140. PMLR, 2020.

[28] Tom Fawcett and Foster Provost. Combining data mining and machine learning for effective

user proﬁling. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 8 13. AAAI Press, 1996.

[29] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandku-

mar. Born again neural networks. In International Conference on Machine Learning, pages 1607 1616. PMLR, 2018.

[30] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estima-

tion. Journal of the American statistical Association, 102(477):359 378, 2007.

[31] Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. Satisfying real-world

goals with dataset constraints. In Advances in Neural Information Processing Systems, pages 2415 2423, 2016.

[32] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In

Advances in Neural Information Processing Systems, pages 3315 3323, 2016.

[33] Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE Transactions on

Knowledge and Data Engineering, 21(9):1263 1284, 2009.

[34] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. ar Xiv:1503.02531, 2015.

[35] Chen Huang, Shuangfei Zhai, Walter Talbott, Miguel Bautista Martin, Shih-Yu Sun, Carlos

Guestrin, and Josh Susskind. Addressing the loss-metric mismatch with adaptive loss alignment. In International Conference on Machine Learning, pages 2891 2900. PMLR, 2019.

[36] Arya Iranmehr, Hamed Masnadi-Shirazi, and Nuno Vasconcelos. Cost-sensitive support vector

machines. Neurocomputing, 343:50 64, 2019.

[37] Muhammad Abdullah Jamal, Matthew Brown, Ming-Hsuan Yang, Liqiang Wang, and Boqing

Gong. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

[38] Thorsten Joachims. A support vector method for multivariate performance measures. In

Proceedings of the 22nd international conference on Machine learning, pages 377 384. ACM, 2005.

[39] Justin Johnson and Taghi Khoshgoftaar. Survey on deep learning with class imbalance. Journal

of Big Data, 6:27, 03 2019.

[40] Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and

Yannis Kalantidis. Decoupling representation and classiﬁer for long-tailed recognition. In Eighth International Conference on Learning Representations (ICLR), 2020.

[41] Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and stochastic gradient

methods for non-decomposable loss functions. Advances in Neural Information Processing Systems, 2014.

[42] Purushottam Kar, Shuai Li, Harikrishna Narasimhan, Sanjay Chawla, and Fabrizio Sebastiani.

Online optimization methods for the quantiﬁcation problem. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1625 1634, 2016.

[43] Grigoris Karakoulas and John Shawe-Taylor. Optimizing classiﬁers for imbalanced training

sets. In Proceedings of the 11th International Conference on Neural Information Processing Systems, NIPS 98, page 253 259, Cambridge, MA, USA, 1998. MIT Press.

[44] Salman H. Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A. Sohel, and Roberto

Togneri. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 29(8):3573 3587, 2018. doi: 10. 1109/TNNLS.2017.2732482.

[45] Ganesh Ramachandra Kini, Orestis Paraskevas, Samet Oymak, and Christos Thrampoulidis.

Label-imbalanced and group-sensitive classiﬁcation under overparameterization. Co RR, abs/2103.01550, 2021. URL https://arxiv.org/abs/2103.01550.

[46] Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon.

Consistent binary classiﬁcation with generalized performance metrics. In NIPS, pages 2744 2752, 2014.

[47] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,

University of Toronto, 2009.

[48] Miroslav Kubat and Stan Matwin. Addressing the curse of imbalanced training sets: One-sided

selection. In Proceedings of the International Conference on Machine Learning (ICML), 1997.

[49] Miroslav Kubat, Robert Holte, and Stan Matwin. Learning when negative examples abound. In

Maarten van Someren and Gerhard Widmer, editors, Proceedings of the European Conference on Machine Learning (ECML), volume 1224 of Lecture Notes in Computer Science, pages 146 153. Springer Berlin Heidelberg, 1997. ISBN 978-3-540-62858-3.

[50] Abhishek Kumar, Harikrishna Narasimhan, and Andrew Cotter. Implicit rate-constrained

optimization of non-decomposable objectives. In International Conference on Machine Learning (ICML), 2021.

[51] Steve Lawrence, Ian Burns, Andrew D. Back, Ah Chung Tsoi, and C. Lee Giles. Neural

network classiﬁcation and prior class probabilities. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, page 299 313, Berlin, Heidelberg, 1998.

Springer-Verlag. ISBN 3540653112.

[52] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 2015.

[53] Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines: Theory

and application to the classiﬁcation of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004.

[54] David D. Lewis and William A. Gale. A sequential algorithm for training text classiﬁers.

In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 94, page 3 12, Berlin, Heidelberg, 1994. Springer-Verlag. ISBN 038719889X.

[55] Yaoyong Li, Hugo Zaragoza, Ralf Herbrich, John Shawe-Taylor, and Jaz S. Kandola. The

perceptron algorithm with uneven margins. In Proceedings of the Nineteenth International

Conference on Machine Learning, ICML 02, page 379 386, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. ISBN 1558608737. [56] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense

object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980 2988, 2017. [57] Yi Lin, Yoonkyung Lee, and Grace Wahba. Support vector machines for classiﬁcation in

nonstandard situations. Mach. Learn., 46(1 3):191 202, March 2002. ISSN 0885-6125. [58] Charles Ling and Victor Sheng. Cost-sensitive learning and the class imbalance problem.

Encyclopedia of Machine Learning, 01 2010. [59] Junru Luo, Hong Qiao, and Bo Zhang. A minimax probability machine for non-decomposable

performance measures. ar Xiv preprint ar Xiv:2103.00396, 2021. [60] Marcus A. Maloof. Learning when data sets are imbalanced and when costs are unequal and

unknown. In ICML 2003 Workshop on Learning from Imbalanced Datasets, 2003. [61] Hamed Masnadi-Shirazi and Nuno Vasconcelos. Risk minimization, probability elicitation,

and cost-sensitive SVMs. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML 10, page 759 766, Madison, WI, USA, 2010. Omnipress. ISBN 9781605589077. [62] Aditya Menon, Harikrishna Narasimhan, Shivani Agarwal, and Sanjay Chawla. On the statisti-

cal consistency of algorithms for binary classiﬁcation under class imbalance. In International Conference on Machine Learning, pages 603 611, 2013. [63] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas

Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment. In International Conference on Learning Representations (ICLR), 2021. [64] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In

International Conference on Machine Learning, 2019. [65] Katharina Morik, Peter Brockhausen, and Thorsten Joachims. Combining statistical learning

with a knowledge-based approach - a case study in intensive care monitoring. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML), pages 268 277, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1-55860-612-2. [66] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever.

Deep double descent: Where bigger models and more data hurt. In International Conference on Learning Representations, 2020. [67] Harikrishna Narasimhan. Learning with complex loss functions and constraints. In Interna-

tional Conference on Artiﬁcial Intelligence and Statistics, pages 1646 1654, 2018. [68] Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the statistical consistency

of plug-in classiﬁers for non-decomposable performance measures. In Advances in Neural Information Processing Systems, pages 1493 1501, 2014. [69] Harikrishna Narasimhan, Purushottam Kar, and Prateek Jain. Optimizing non-decomposable

performance measures: A tale of two classes. In International Conference on Machine Learning, pages 199 208. PMLR, 2015. [70] Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Consis-

tent multiclass algorithms for complex performance measures. In ICML, pages 2398 2407, 2015. [71] Harikrishna Narasimhan, Andrew Cotter, and Maya Gupta. Optimizing generalized rate

metrics with three players. In Advances in Neural Information Processing Systems, pages 10746 10757, 2019. [72] Nagarajan Natarajan, Oluwasanmi Koyejo, Pradeep Ravikumar, and Inderjit Dhillon. Optimal

classiﬁcation with multivariate losses. In International Conference on Machine Learning, pages 1530 1538. PMLR, 2016. [73] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann Le Cun, and Nathan Srebro.

The role of over-parametrization in generalization of neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

[74] Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Optimizing f-measures

by cost-sensitive classiﬁcation. In Advances in Neural Information Processing Systems, pages 2123 2131, 2014.

[75] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu.

Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1944 1952, 2017.

[76] Foster Provost. Machine learning from imbalanced data sets 101. In Proceedings of the

AAAI-2000 Workshop on Imbalanced Data Sets, 2000.

[77] Qi Qi, Youzhi Luo, Zhao Xu, Shuiwang Ji, and Tianbao Yang. Stochastic optimization

of area under precision-recall curve for deep learning with provable convergence. Co RR, abs/2104.08736, 2021. URL https://arxiv.org/abs/2104.08736.

[78] Harish G Ramaswamy and Shivani Agarwal. Convex calibration dimension for multiclass loss

matrices. The Journal of Machine Learning Research, 17(1):397 441, 2016.

[79] Jiawei Ren, Cunjun Yu, shunan sheng, Xiao Ma, Haiyu Zhao, Shuai Yi, and hongsheng Li.

Balanced meta-softmax for long-tailed visual recognition. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4175 4186. Curran Associates, Inc., 2020.

[80] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng

Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015.

[81] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gülcehre, Guillaume Desjardins, James

Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In ICLR (Poster), 2016.

[82] S. Sagawa, A. Raghunathan, P. W. Koh, and P. Liang. An investigation of why overparameteri-

zation exacerbates spurious correlations. In International Conference on Machine Learning (ICML), 2020.

[83] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally

robust neural networks. In International Conference on Learning Representations, 2019.

[84] Amartya Sanyal, Pawan Kumar, Purushottam Kar, Sanjay Chawla, and Fabrizio Sebastiani.

Optimizing non-decomposable measures with deep networks. Machine Learning, 107(8): 1597 1620, 2018.

[85] Clayton Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6(none):

958 992, 2012. doi: 10.1214/12-EJS699. URL https://doi.org/10.1214/12-EJS699.

[86] N. Sohoni, J. Dunnmon, G. Angus, A. Gu, and C. Ré. No subclass left behind: Fine-grained

robustness in coarse-grained classiﬁcation problems. In Conference on Neural Information Processing Systems (Neur IPS), 2020.

[87] Yang Song, Alexander Schwing, Raquel Urtasun, et al. Training deep neural networks via

direct loss minimization. In International Conference on Machine Learning, pages 2169 2177. PMLR, 2016.

[88] J. Tan, C. Wang, B. Li, Q. Li, W. Ouyang, C. Yin, and J. Yan. Equalization loss for long-

tailed object recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11659 11668, 2020.

[89] S. K. Tavker, H. G. Ramaswamy, and H. Narasimhan. Consistent plug-in classiﬁers for

complex objectives and constraints. In Advances in Neural Information Processing Systems, 2020.

[90] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun, and Yoram

Singer. Large margin methods for structured and interdependent output variables. Journal of machine learning research, 6(9), 2005.

[91] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classiﬁcation in the

wild. ar Xiv preprint ar Xiv:1709.01450, 2017.

[92] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.

[93] B.C. Wallace, K.Small, C.E. Brodley, and T.A. Trikalinos. Class imbalance, redux. In Proc.

ICDM, 2011. [94] Jiaqi Wang, Wenwei Zhang, Yuhang Zang, Yuhang Cao, Jiangmiao Pang, Tao Gong, Kai

Chen, Ziwei Liu, Chen Change Loy, and Dahua Lin. Seesaw loss for long-tailed instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. [95] Robert C Williamson, Elodie Vernet, and Mark D Reid. Composite multiclass losses. Journal

of Machine Learning Research, 17:1 52, 2016. [96] Shan-Hung Wu, Keng-Pei Lin, Chung-Min Chen, and Ming-Syan Chen. Asymmetric support

vector machines: Low false-positive learning under the user tolerance. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,

KDD 08, page 749 757, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605581934. [97] Tong Wu, Qingqiu Huang, Ziwei Liu, Yu Wang, and Dahua Lin. Distribution-balanced loss for

multi-label classiﬁcation in long-tailed datasets. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision ECCV 2020, pages 162 178, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58548-8. [98] Xiaoyun Wu and Rohini Srihari. New -support vector machines and their sequential minimal

optimization. In AAAI, pages 824 831, 01 2003. [99] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student

improves imagenet classiﬁcation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687 10698, 2020. [100] Yu Xie and Charles F. Manski. The logit model and response-based samples. Sociological

Methods & Research, 17(3):283 302, 1989. [101] Bowei Yan, Sanmi Koyejo, Kai Zhong, and Pradeep Ravikumar. Binary classiﬁcation with

karmic, threshold-quasi-concave metrics. In International Conference on Machine Learning, pages 5531 5540. PMLR, 2018. [102] Nan Ye, Kian Ming Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing f-measures: a tale

of two approaches. In Proceedings of the 29th International Conference on Machine Learning, pages 289 296. Omnipress, 2012. [103] Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Manmohan Chandraker. Feature transfer

learning for face recognition with under-represented data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. [104] B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example

weighting. In Third IEEE International Conference on Data Mining, pages 435 442, 2003. doi: 10.1109/ICDM.2003.1250950. [105] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi.

Fairness constraints: Mechanisms for fair classiﬁcation. In Artiﬁcial Intelligence and Statistics, pages 962 970, 2017. [106] Ao Zhang, Nan Li, Jian Pu, Jun Wang, Junchi Yan, and Hongyuan Zha. Tau-fpl: Tolerance-

constrained learning in linear time. In Sheila A. Mc Ilraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI-18), pages 4398 4405. AAAI Press, 2018. [107] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under-

standing deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. [108] Junjie Zhang, Lingqiao Liu, Peng Wang, and Chunhua Shen. To balance or not to balance: A

simple-yet-effective approach for learning with long-tailed distributions, 2019. [109] Shaoyu Zhang, Chen Chen, Xiyuan Hu, and Silong Peng. Balanced knowledge distillation for

long-tailed learning. Co RR, abs/2104.10510, 2021. URL https://arxiv.org/abs/2104. 10510. [110] Songyang Zhang, Zeming Li, Shipeng Yan, Xuming He, and Jian Sun. Distribution alignment:

A uniﬁed framework for long-tail visual recognition. In CVPR, 2021.

[111] Zhi-Hua Zhou and Xu-Ying Liu. Training cost-sensitive neural networks with methods address-

ing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(1), 2006. [112] Zhi-Hua Zhou and Xu-Ying Liu. On multi-class cost-sensitive learning. Computational

Intelligence, 26(3):232 257, 2010.