# cardinalityaware_set_prediction_and_topk_classification__f161692e.pdf

Cardinality-Aware Set Prediction and Top-k Classiﬁcation

Corinna Cortes Google Research New York, NY 10011 corinna@google.com

Anqi Mao Courant Institute New York, NY 10012 aqmao@cims.nyu.edu

Christopher Mohri Stanford University Stanford, CA 94305 xmohri@stanford.edu

Mehryar Mohri Google Research & CIMS New York, NY 10011 mohri@google.com

Yutao Zhong Courant Institute New York, NY 10012 yutao@cims.nyu.edu

We present a detailed study of cardinality-aware top-k classiﬁcation, a novel approach that aims to learn an accurate top-k set predictor while maintaining a low cardinality. We introduce a new target loss function tailored to this setting that accounts for both the classiﬁcation error and the cardinality of the set predicted. To optimize this loss function, we propose two families of surrogate losses: costsensitive comp-sum losses and cost-sensitive constrained losses. Minimizing these loss functions leads to new cardinality-aware algorithms that we describe in detail in the case of both top-k and threshold-based classiﬁers. We establish H-consistency bounds for our cardinality-aware surrogate loss functions, thereby providing a strong theoretical foundation for our algorithms. We report the results of extensive experiments on CIFAR-10, CIFAR-100, Image Net, and SVHN datasets demonstrating the effectiveness and beneﬁts of our cardinality-aware algorithms.

1 Introduction

Top-k classiﬁcation consists of predicting the k most likely classes for a given input, as opposed to solely predicting the single most likely class. Several compelling reasons support the adoption of this framework. First, it enhances accuracy by allowing the model to consider the top k predictions, accommodating uncertainty and providing a more comprehensive prediction. This is particularly valuable in scenarios where multiple correct answers exist, such as image tagging, where a top-k classiﬁer can identify multiple relevant objects in an image. Second, top-k classiﬁcation is applicable in ranking and recommendation tasks such as suggesting the top k most relevant products in ecommerce based on user queries. The conﬁdence scores associated with the top k predictions also serve as a means to estimate the model s uncertainty, which is crucial in applications requiring insight into the model s conﬁdence level.

The predictions of a top-k classiﬁer are also useful in several natural settings. For example, ensemble learning can beneﬁt from top-k predictions as they can be combined from multiple models, contributing to improved overall performance by introducing a more robust and diverse set of predictions. In addition, top-k predictions can serve as input for downstream tasks like natural language generation or dialogue systems, enhancing the performance of these tasks by providing a broader range of potential candidates. Finally, the interpretability of the model s decision-making process is enhanced by examining the top k predicted classes, allowing users to gain insights into the rationale behind the model s predictions.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

The appropriate k for a task at hand may be determined by the application itself like a recommendor system always expecting a ﬁxed set size to be returned. For other applications, it may be natural to let the cardinality of the returned set vary with the model s conﬁdence or other properties of the task. Designing effective algorithms with learning guarantees for this setting is our main goal.

In this paper, we introduce the problem of cardinality-aware set prediction, which is to learn an accurate set predictor while maintaining a low cardinality. The core idea is that an effective algorithm should dynamically adjust the cardinality of its prediction sets based on input instances. For top-k classiﬁers, this means selecting a larger k for difﬁcult inputs to ensure high accuracy, while opting for a smaller k for simpler inputs to maintain low cardinality. Similarly, for threshold-based classiﬁers, a lower threshold can be used for difﬁcult inputs to minimize the risk of misclassiﬁcation, whereas a higher threshold can be applied to simpler inputs to reduce cardinality.

To tackle this problem, we introduce a novel target loss function which captures both the classiﬁcation error and the cardinality of a prediction set. Minimizing this target loss function directly is an instance-dependent cost-sensitive learning problem, which is intractable for most hypothesis sets. Instead, we derive two families of general surrogate loss functions that beneﬁt from smooth properties and favorable optimization solutions.

To provide theoretical guarantees for our cardinality-aware top-k approach, we ﬁrst study consistency properties of surrogate loss functions for the general top-k problem with a ﬁxed k. Unlike standard classiﬁcation, the consistency of surrogate loss functions for the top-k problem has been relatively unexplored. A crucial property in this context is the asymptotic notion of Bayes-consistency, which has been extensively studied in standard binary and multi-class classiﬁcation [Zhang, 2004a, Bartlett et al., 2006, Zhang, 2004b, Bartlett and Wegkamp, 2008]. While Bayes-consistency has been explored for various top-k surrogate losses [Lapin et al., 2015, 2016, 2018, Yang and Koyejo, 2020, Thilagar et al., 2022], some face limitations. Non-convex hinge-like" surrogates [Yang and Koyejo, 2020], surrogates inspired by ranking [Usunier et al., 2009], and polyhedral surrogates [Thilagar et al., 2022] cannot lead to effective algorithms as they cannot be efﬁciently computed and optimized. Negative results also indicate that several convex "hinge-like" surrogates [Lapin et al., 2015, 2016, 2018] fail to achieve Bayes-consistency [Yang and Koyejo, 2020]. On the positive side, it has been shown that the logistic loss (or cross-entropy loss used with the softmax activation) is a Bayes-consistent loss for top-k classiﬁcation [Lapin et al., 2015, Yang and Koyejo, 2020].

We show that, remarkably, several widely used families of surrogate losses used in standard multi-class classiﬁcation admit H-consistency bounds [Awasthi, Mao, Mohri, and Zhong, 2022a,b, Mao, Mohri, and Zhong, 2023f,b] with respect to the top-k loss. These are strong non-asymptotic consistency guarantees that are speciﬁc to the actual hypothesis set H adopted, and therefore also imply asymptotic Bayes-consistency. We establish this property for the broad family of comp-sum losses [Mao, Mohri, and Zhong, 2023f], comprised of the composition of a non-decreasing and non-negative function with the sum exponential losses. This includes the logistic loss, the sum-exponential loss, the mean absolute error loss, and the generalized cross-entropy loss. Additionally, we extend these results to constrained losses, a family originally introduced for multi-class SVM [Lee et al., 2004], which includes the constrained exponential, hinge, squared hinge, and ρ-margin losses. The guarantees of H-consistency provide a strong foundation for principled algorithms in top-k classiﬁcation by directly minimizing these surrogate loss functions.

We then leverage these results to derive strong guarantees for the two families of cardinality-aware surrogate losses: cost-sensitive comp-sum and cost-sensitive constrained losses. Both families are obtained by augmenting their top-k counterparts [Lapin et al., 2015, 2016, Berrada et al., 2018, Reddi et al., 2019, Yang and Koyejo, 2020, Thilagar et al., 2022] with instance-dependent cost terms. We establish strong H-consistency bounds, implying Bayes-consistency, for both families relative to the cardinality-aware target loss. Our H-consistency bounds for the top-k problem are further beneﬁcial here in that the cardinality-aware problem can consist of ﬁxing and selecting from a family top-k classiﬁers we now know how to effectively learn each top-k classiﬁer.

The rest of the paper is organized as follows. In Section 2, we formally introduce the cardinality-aware set prediction problem along with our new families of surrogate loss functions. Section 3 instantiates our algorithms in the case of both top-k classiﬁers and threshold-based classiﬁers, and Section 4 presents strong theoretical guarantees. In Section 5, as well as in Appendix J and Appendix K, we present experimental results on the CIFAR-10, CIFAR-100, Image Net, and SVHN datasets, demonstrating the effectiveness of our algorithms.

2 Cardinality-aware set prediction

In this section, we introduce cardinality-aware set prediction, where the goal is to devise algorithms that dynamically adjust the prediction set s size based on the input instance to both achieve high accuracy and maintain a low average cardinality. Speciﬁcally, for top-k classiﬁers, our objective is to determine a suitable cardinality k for each input x, with higher values of k for instances that are more difﬁcult to classify.

To address this problem, we ﬁrst deﬁne a cardinality-aware loss function that accounts for both the classiﬁcation error and the cardinality of the set predicted (Section 2.1). However, minimizing this loss function directly is computationally intractable for non-trivial hypothesis sets. Thus, to optimize it, we introduce two families of surrogate losses: cost-sensitive comp-sum losses (Section 2.2) and cost-sensitive constrained losses (Section 2.3). We will later show that these loss functions beneﬁts from favorable guarantees in terms of H-consistency (Section 4.3).

2.1 Cardinality-aware problem formulation and loss function

The learning setup for cardinality-aware set prediction is as follows.

Problem setup. We denote by X the input space and Y = [n] = {1,...,n} the label space. Let {gk k K} denote a collection of given set predictors, induced by a parameterized set predictor gk X 2Y, where each K R is a set of indices. This could be a subset of the family of top-k classiﬁers induced by some classiﬁer h, or a family of threshold-based classiﬁers based on some scoring function s X Y R. In that case, gk(x) then comprises the set of ys with a score s(x,y) exceeding the threshold τk deﬁning gk. This formulation covers as a special case standard conformal prediction set predictors [Shafer and Vovk, 2008], as well as set predictors deﬁned as conﬁdence sets described in [Denis and Hebiri, 2017]. We will denote by gk(x) the cardinality of the set gk(x) predicted by gk for the input x. To simplify the discussion, we will assume that gk(x) is an increasing function of k, for any x. For a family of top-k classiﬁers or threshold-based classiﬁers, this simply means that they are sorted in increasing order of k or decreasing order of the threshold values.

To account for the cost associated with cardinality, we introduce a non-negative and increasing function cost R+ R+, where cost( gk(x) ) represents the cost associated to the cardinality gk(x) . Common choices for cost include cost( gk(x) ) = gk(x) , or a logarithmic function cost( gk(x) ) = log( gk(x) ) as in our experiments (see Section 5), to moderate the magnitude of the cost relative to the binary classiﬁcation loss. Our analysis is general and requires no assumption about cost.

Our goal is to learn to assign to each input instance x the most appropriate index k K to both achieve high accuracy and maintain a low average cardinality.

Cardinality-aware loss function. As in the ordinary multi-class classiﬁcation problem, we consider a family R of scoring functions r X K R. For any x, r(x,k) denotes the score assigned to the label (or index) k K, given x X. The label predicted is r(x) = argmaxk K r(x,k), with ties broken in favor of the largest index. To account for both classiﬁcation accuracy and cardinality cost, we deﬁne the cardinality-aware loss function for a scoring function r and input-output label pair (x,y) X Y as a linearized loss of these two criteria:

ℓ(r,x,y) = 1y gr(x)(x) + λcost( gr(x)(x) ), (1)

where the ﬁrst term is the standard loss for a top-k prediction taking the value one when the correct label y is not included in the top-k set and zero otherwise, and λ > 0 is a hyperparameter that governs the balance between prioritizing accuracy versus limiting cardinality. The learning problem then consists of using a labeled training sample (x1,y1),...(xm,ym) drawn i.i.d. from some (unknown) distribution D to select r R with a small expected cardinality-aware loss E(x,y) D[ℓ(r,x,y)].

The loss function (1) can be equivalently expressed in terms of an instance-dependent cost function c X K Y R+: ℓ(r,x,y) = c(x,r(x),y), (2)

where c(x,k,y) = 1y gk(x) + λcost( gk(x) ). Minimizing (2) is an instance-dependent cost-sensitive learning problem. However, directly minimizing this target loss is intractable. To optimize this loss function, we introduce two families of surrogate losses in the next sections: cost-sensitive comp-sum losses and cost-sensitive constrained losses. Note that throughout this paper, we will denote all target

(or true) losses on which performance is measured with an ℓ, while surrogate losses introduced for ease of optimization are denoted by ℓ.

2.2 Cost-sensitive comp-sum surrogate losses

Our surrogate cost-sensitive comp-sum, c-comp, losses are deﬁned as follows: for all (r,x,y) R X Y, ℓc comp(r,x,y) = k K(1 c(x,k,y)) ℓcomp(r,x,k), where the comp-sum loss ℓcomp is deﬁned as in [Mao, Mohri, and Zhong, 2023f]. That is, for any r in a hypothesis set R and (x,y) X Y, ℓcomp(r,x,y) = Φ( y y er(x,y ) r(x,y)), where Φ R+ R+ is non-decreasing. See Section 4.2 for more details. For example, when the logistic loss is used, we obtain the cost-sensitive logistic loss:

ℓc log(r,x,y) = k K (1 c(x,k,y)) ℓlog(r,x,k) = k K (c(x,k,y) 1)[ log( k K er(x,k ) r(x,k))].

The negative log-term becomes larger as the score r(x,k) increases. Thus, the loss function imposes a greater penalty on higher scores r(x,k) through a penalty term (c(x,k,y) 1) that depends on the cost assigned to the expert s prediction gk(x).

2.3 Cost-sensitive constrained surrogate losses

Constrained losses are deﬁned as a summation of a function Φ applied to the scores, subject to a constraint, as in [Lee et al., 2004]. For any r R and (x,y) X Y, they are expressed as

ℓcstnd(h,x,y) = y y Φ( r(x,y )), with the constraint y Y r(x,y) = 0,

where Φ R R+ is non-increasing. See Section 4.2 for a detailed discussion. Inspired by these constrained losses, we introduce a new family of surrogate losses, cost-sensitive constrained (c-cstnd losses) which are deﬁned, for all (r,x,y) R X Y, by ℓc cstnd(r,x,y) = k K c(x,k,y)Φ( r(x,k)), with the constraint k K r(x,k) = 0, where Φ R R+ is nonincreasing. For example, for Φ(t) = e t, we obtain the cost-sensitive constrained exponential loss: ℓcstnd c exp(r,x,y) = k K c(x,k,y)er(x,k), with the constraint k K r(x,k) = 0.

3 Cardinality-aware algorithms

Minimizing the cost-sensitive surrogate loss functions described in the previous section directly leads to novel cardinality-aware algorithms. In this section, we brieﬂy detail the instantiation of our algorithms in the speciﬁc cases of top-k classiﬁers (our main focus) and threshold-based classiﬁers.

Top-k classiﬁers. Here, the collection of set predictors is a subset of the top-k classiﬁers, deﬁned by gk(x) = {h1(x),...,hk(x)}, where h1(x),...,hk(x) are the induced top-k labels for a classiﬁer h. The cardinality in this case coincides with the index: gk(x) = k, for any x X. The cost is deﬁned as c(x,k,y) = 1y {h1(x),...,hk(x)} + λcost(k), where cost(k) can be chosen to be k or log(k). Thus, our cardinality-aware algorithms for top-k classiﬁcation can be described as follows. At training time, we assume access to a sample set {(xi,yi)}m i=1 and the costs each top-k set incurs, {c(xi,k,yi)}m i=1, where k K, a pre-ﬁxed subset. The goal is to minimize the target cardinality-aware loss function m i=1 ℓ(r,xi,yi) = m i=1 c(xi,r(xi),yi) over a hypothesis set R. Our algorithm consists of minimizing a surrogate loss such as the cost-sensitive logistic loss, deﬁned as ˆr = argminr R m i=1 k K(1 c(xi,k,yi))log( k K er(x,k ) r(x,k)). At inference time, we use

the top-ˆr(x) set {h1(x),...,hˆr(x)(x)} for prediction, with the accuracy 1y {h1(x),...,hˆr(x)(x)} and cardinality ˆr(x) for that instance.

In Section 5, we compare the accuracy-versus-cardinality curves of our cardinality-aware algorithms obtained by varying λ with those of top-k classiﬁers, demonstrating the effectiveness of our algorithms. What λ to select for a given application will depend on the desired accuracy. Note that the performance of the algorithm in [Denis and Hebiri, 2017] in this setting is theoretically the same as that of top-k

classiﬁers. The algorithm is designed to maximize accuracy within a constrained cardinality of k, and it always reaches maximal accuracy at the boundary K after the cardinality is constrained to k K.

Threshold-based classiﬁers. Here, the set predictor is deﬁned via a set of thresholds τk: gk(x) = {y Y s(x,y) > τk}. When the set is empty, we just return argmaxy Y s(x,y) by default. The description of the costs and other components of the algorithms is similar to that of top-k classiﬁers. A special case of threshold-based classiﬁer is conformal prediction [Shafer and Vovk, 2008], which is a general framework that provides provably valid conﬁdence intervals for a black-box scoring function. Split conformal prediction guarantees that P(Ym+1 Cs,α(Xm+1)) 1 α for some scoring function s X Y R, where Cs,α(Xm+1) = {y s(Xm+1,y) ˆqα} and ˆqα is the α(m + 1) /m empirical quantile of s(Xi,Yi) over a held-out set {(Xi,Yi)}m i=1 drawn i.i.d. from some distribution D (or just exchangeably). Note, however, that the framework does not supply an effective guarantee on the size of the sets Cs,α(Xm+1).

In Appendix K, we present in detail a series of early experiments for our algorithm used with threshold-based classiﬁers and include more discussion. Our experiments suggest that, when the training sample is sufﬁciently large, our algorithm can outperform conformal prediction.

4 Theoretical guarantees

Here, we present theory for our cardinality-aware algorithms. Our analysis builds on theory of top-k algorithms, and we start by providing stronger results than previously known for top-k surrogates.

4.1 Preliminaries

We denote by D a distribution over X Y and write p(x,y) = D(Y = y X = x) for the conditional probability of Y = y given X = x, and use p(x) = (p(x,1),...,p(x,n)) to denote the corresponding conditional probability vector. We denote by ℓ Hall X Y R a loss function deﬁned for the family of all measurable functions Hall. Given a hypothesis set H Hall, the conditional error of a hypothesis h and the best-in-class conditional error are deﬁned as follows: Cℓ(h,x) = Ey x[ℓ(h,x,y)] = y Y p(x,y)ℓ(h,x,y) and C ℓ(H,x) = infh H Cℓ(h,x). Accordingly, the generalization error of a hypothesis h and the best-in-class generalization error are deﬁned by: Eℓ(h) = E(x,y) D[ℓ(h,x,y)] = Ex[Cℓ(h,x)] and E ℓ(H) = infh H Eℓ(h) = infh H Ex[Cℓ(h,x)]. Given a score vector (h(x,1),...,h(x,n)) generated by hypothesis h, we sort its components in decreasing order and write hk(x) to denote the k-th label, that is h(x,h1(x)) h(x,h2(x)) ... h(x,hn(x)). Similarly, for a given conditional probability vector p(x) = (p(x,1),...,p(x,n)), we write pk(x) to denote the k-th element in decreasing order, that is p(x,p1(x)) p(x,p2(x)) ... p(x,pn(x)). In the event of a tie for the k-th highest score or conditional probability, the label hk(x) or pk(x) is selected based on the highest index when considering the natural order of labels.

The target generalization error for top-k classiﬁcation is given by the top-k loss, which is denoted by ℓk and deﬁned, for any hypothesis h and (x,y) X Y by

ℓk(h,x,y) = 1y {h1(x),...,hk(x)}.

The loss takes value one when the correct label y is not included in the top-k predictions made by the hypothesis h, zero otherwise. In the special case where k = 1, this is precisely the familiar zero-one classiﬁcation loss. Like the zero-one loss, optimizing the top-k loss is NP-hard for common hypothesis sets. Therefore, alternative surrogate losses are typically used to design learning algorithms. A crucial property of these surrogate losses is Bayes-consistency. This requires that, asymptotically, nearly minimizing a surrogate loss over the family of all measurable functions leads to the near minimization of the top-k loss over the same family [Steinwart, 2007].

Deﬁnition 4.1. A surrogate loss ℓis said to be Bayes-consistent with respect to the top-k loss ℓk if, for all given sequences of hypotheses {hn}n N Hall and any distribution, limn + E ℓ(hn) E ℓ(Hall) = 0 implies limn + Eℓk(hn) E ℓk(Hall) = 0.

Bayes-consistency is an asymptotic guarantee and applies only to the family of all measurable functions. Recently, Awasthi, Mao, Mohri, and Zhong [2022a,b] (see also [Awasthi et al., 2021a,b, 2023a,b, Mao et al., 2023c,d,e,a, 2024c,b,a,e,h,i,d,f,g, Mohri et al., 2024]) proposed a stronger consistency guarantee, referred to as H-consistency bounds. These are upper bounds on the target

estimation error in terms of the surrogate estimation error that are non-asymptotic and hypothesis set-speciﬁc.

Deﬁnition 4.2. Given a hypothesis set H, a surrogate loss ℓis said to admit an H-consistency bound with respect to the top-k loss ℓk if, for some non-decreasing function f, the following inequality holds for all h H and for any distribution: f(Eℓk(h) E ℓk(H)) E ℓ(h) E ℓ(H).

We refer to Eℓk(h) E ℓk(H) as the target estimation error and E ℓ(h) E ℓ(H) as the surrogate estimation error. These bounds imply Bayes-consistency when H = Hall, by taking the limit.

A key quantity appearing in H-consistency bounds is the minimizability gap, which measures the difference between the best-in-class generalization error and the expectation of the best-inclass conditional error, deﬁned for a given hypothesis set H and a loss function ℓby: Mℓ(H) = E ℓ(H) Ex[C ℓ(H,x)]. As shown by Mao, Mohri, and Zhong [2023f], the minimizability gap is non-negative and is upper bounded by the approximation error Aℓ(H) = E ℓ(H) E ℓ(Hall): 0 Mℓ(H) Aℓ(H). When H = Hall or more generally Aℓ(H) = 0, the minimizability gap vanishes. However, in general, it is non-zero and provides a ﬁner measure than the approximation error. Thus, H-consistency bounds provide a stronger guarantee than the excess error bounds.

4.2 Theoretical guarantees for top-k surrogate losses

We study the surrogate loss families of comp-sum losses and constrained losses in multi-class classiﬁcation, which have been shown in the past to beneﬁt from H-consistency bounds with respect to the zero-one classiﬁcation loss, that is ℓk with k = 1 [Awasthi et al., 2022b, Mao et al., 2023f] (see also [Zheng et al., 2023, Mao et al., 2023b]). We extend these results to top-k classiﬁcation and prove H-consistency bounds for these loss functions with respect to ℓk for any 1 k n.

Another commonly used family of surrogate losses in multi-class classiﬁcation is the max losses, which are deﬁned through a convex function, such as the hinge loss function applied to the margin [Crammer and Singer, 2001, Awasthi et al., 2022b]. However, as shown in [Awasthi et al., 2022b], no non-trivial H-consistency guarantee holds for max losses with respect to ℓk, even when k = 1.

We ﬁrst characterize the best-in-class conditional error and the conditional regret of top-k loss, which will be used in the analysis of H-consistency bounds. We denote by S[k] = {X S X = k} the set of all k-subsets of a set S. We will study any hypothesis set that is regular.

Deﬁnition 4.3. Let A(n,k) be the set of ordered k-tuples with distinct elements in [n]. We say that a hypothesis set H is regular for top-k classiﬁcation, if the top-k predictions generated by the hypothesis set cover all possible outcomes: x X, {(h1(x),...,hk(x)) h H} = A(n,k).

Common hypothesis sets such as that of linear models or neural networks, or the family of all measurable functions, are all regular for top-k classiﬁcation.

Lemma 4.4. Assume that H is regular. Then, for any h H and x X, the best-in-class conditional error and the conditional regret of the top-k loss can be expressed as follows:

C ℓk(H,x) = 1 k i=1 p(x,pi(x)) Cℓk,H(h,x) = k i=1 [p(x,pi(x)) p(x,hi(x))].

The proof is included in Appendix A. For k = 1, the result coincides with the known identities for standard multi-class classiﬁcation with regular hypothesis sets [Awasthi et al., 2022b, Lemma 3].

As with [Awasthi et al., 2022b, Mao et al., 2023f], in the following sections, we will consider hypothesis sets that are symmetric and complete. This includes the class of linear models and neural networks typically used in practice, as well as the family of all measurable functions. We say that a hypothesis set H is symmetric if it is independent of the ordering of labels. That is, for all y Y, the scoring function x h(x,y) belongs to some real-valued family of functions F. We say that a hypothesis set is complete if, for all (x,y) X Y, the set of scores h(x,y) can span over the real numbers, that is, {h(x,y) h H} = R. Note that any symmetric and complete hypothesis set is regular for top-k classiﬁcation.

Next, we analyze the broad family of comp-sum losses, which includes the commonly used logistic loss (or cross-entropy loss used with the softmax activation) as a special case.

Comp-sum losses are deﬁned as the composition of a function Φ with the sum exponential losses, as in [Mao et al., 2023f]. For any h H and (x,y) X Y, they are expressed as

ℓcomp(h,x,y) = Φ y y eh(x,y ) h(x,y) ,

where Φ R+ R+ is non-decreasing. When Φ is chosen as the function t log(1 + t), t t, t 1 1 1+t and t 1

1+t) q), q (0,1), ℓcomp(h,x,y) coincides with the most commonly

used (multinomial) logistic loss, deﬁned as ℓlog(h,x,y) = log( y Y eh(x,y ) h(x,y)) [Verhulst,

1838, 1845, Berkson, 1944, 1951], the sum-exponential loss ℓexp(h,x,y) = y y eh(x,y ) h(x,y) [Weston and Watkins, 1998, Awasthi et al., 2022b] which is widely used in multi-class boosting [Saberian and Vasconcelos, 2011, Mukherjee and Schapire, 2013, Kuznetsov et al., 2014], the

mean absolute error loss ℓmae(h,x,y) = 1 [ y Y eh(x,y ) h(x,y)] 1 known to be robust to label noise for training neural networks [Ghosh et al., 2017], and the generalized cross-entropy loss ℓgce(h,x,y) = 1

q[1 [ y Y eh(x,y ) h(x,y)] q ], q (0,1), a generalization of the logistic loss and

mean absolute error loss for learning deep neural networks with noisy labels [Zhang and Sabuncu, 2018], respectively. We speciﬁcally study these loss functions and show that they beneﬁt from H-consistency bounds with respect to the top-k loss.

Theorem 4.5. Assume that H is symmetric and complete. Then, for any 1 k n, the following H-consistency bound holds for the comp-sum loss:

Eℓk(h) E ℓk(H) + Mℓk(H) kψ 1(E ℓcomp(h) E ℓcomp(H) + M ℓcomp(H)),

In the special case where A ℓcomp(H) = 0, for any 1 k n, the following upper bound holds:

Eℓk(h) E ℓk(H) kψ 1(E ℓcomp(h) E ℓcomp(H)),

where ψ(t) = 1 t

2 log(1 t)+ 1+t

2 log(1+t), t [0,1] when ℓcomp is ℓlog; ψ(t) = 1

1 t2, t [0,1]

when ℓcomp is ℓexp; ψ(t) = t/n when ℓcomp is ℓmae; and ψ(t) = 1 qnq [ (1+t) 1 1 q +(1 t) 1 1 q 2 ]

for all q (0,1), t [0,1] when ℓcomp is ℓgce.

The proof is included in Appendix B. The second part follows from the fact that when A ℓcomp(H) = 0, the minimizability gap M ℓcomp(H) vanishes. By taking the limit on both sides, Theorem 4.5 implies the H-consistency and Bayes-consistency of comp-sum losses with respect to the top-k loss. It further shows that, when the estimation error of ℓcomp is reduced to ϵ > 0, then the estimation error of ℓk is upper bounded by kψ 1(ϵ), which, for a sufﬁciently small ϵ, is approximately k

2ϵ for ℓlog and ℓexp; knϵ for ℓmae; and k

2nqϵ for ℓgce. Note that different from the other loss functions, the bound for the mean absolute error loss is only linear. The downside of this more favorable linear rate is the dependency on the number of classes and the fact that the mean absolute error loss is harder to optimize [Zhang and Sabuncu, 2018]. The bound for the generalized cross-entropy loss depends on both the number of classes n and the parameter q.

In the proof, we used the fact that the conditional regret of the top-k loss is the sum of k differences between two probabilities. We then upper bounded each difference with the conditional regret of the comp-sum loss, using a hypothesis based on the two probabilities. The ﬁnal bound is derived by summing these differences. In Appendix G, we detail the technical challenges and the novelty.

The key quantities in our H-consistency bounds are the minimizability gaps, which can be upper bounded by the approximation error, or more reﬁned terms, depending on the magnitude of the parameter space, as discussed by Mao et al. [2023f]. As pointed out by these authors, these quantities, along with the functional form, can help compare different comp-sum loss functions. In Appendix C, we further discuss the important role of minimizability gaps under the realizability assumption, and the connection with some negative results of Yang and Koyejo [2020].

Constrained losses are deﬁned as a summation of a function Φ applied to the scores, subject to a constraint, as shown in [Lee et al., 2004]. For any h H and (x,y) X Y, they are expressed as ℓcstnd(h,x,y) = y y Φ( h(x,y )), with the constraint y Y h(x,y) = 0,

where Φ R R+ is non-increasing. In Appendix E, we study this family of loss functions and show that several beneﬁt from H-consistency bounds with respect to the top-k loss. In Appendix H, we provide generalization bounds for the top-k loss in terms of ﬁnite samples (Theorems H.1 and H.2).

4.3 Theoretical guarantees for cardinality-aware surrogate losses

The strong theoretical results of the previous sections establish the effectiveness of comp-sum and constrained losses as surrogate losses for the target top-k loss for common hypothesis sets used in practice. Building on this foundation, we expand our analysis to their cost-sensitive variants in the study of cardinality-aware set prediction in Section 2. We derive H-consistency bounds for these loss functions, thereby also establishing their Bayes-consistency. To do so, we characterize the conditional regret of the target cardinality-aware loss function in Lemma I.1, which can be found in Appendix I. For this analysis, we will assume, without loss of generality, that the cost c(x,k,y) takes values in [0,1] for any (x,k,y) X K Y, which can be achieved by normalizing the cost function.

We will use ℓc log, ℓc exp, ℓc gce and ℓc mae to denote the corresponding cost-sensitive counterparts for ℓlog, ℓexp, ℓgce and ℓmae, respectively. Next, we show that these cost-sensitive surrogate loss functions beneﬁt from H-consistency bounds with respect to the target loss ℓgiven in (1). Theorem 4.6. Assume that R is symmetric and complete. Then, the following bound holds for the cost-sensitive comp-sum loss: for all r R and for any distribution,

Eℓ(r) E ℓ(R) + Mℓ(R) γ(E ℓc comp(r) E ℓc comp(R) + M ℓc comp(R));

When R = Rall, the following holds: Eℓ(r) E ℓ(Rall) γ(E ℓc comp(r) E ℓc comp(Rall)), where

t when ℓc comp is either ℓc log or ℓc exp; γ(t) = 2

K qt when ℓc comp is ℓc gce; and γ(t) = K t when ℓc comp is ℓc mae.

The proof is included in Appendix I.1. The second part follows from the fact that when R = Rall, all the minimizability gaps vanish. In particular, Theorem 4.6 implies the Bayes-consistency of cost-sensitive comp-sum losses. The bounds for cost-sensitive generalized cross-entropy and mean absolute error loss depend on the number of set predictors, making them less favorable when K is large. As pointed out earlier, while the cost-sensitive mean absolute error loss admits a linear rate, it is difﬁcult to optimize even in the standard classiﬁcation, as reported by Zhang and Sabuncu [2018].

In the proof, we represented the comp-sum loss as a function of the softmax and introduced a softmax-dependent function Sµ to upper bound the conditional regret of the target cardinality-aware loss function by that of the cost-sensitive comp-sum loss. This technique is novel and differs from the approach used in the standard scenario (Section 4.2).

We will use ℓcstnd c exp, ℓc sq hinge, ℓc hinge and ℓc ρ to denote the corresponding cost-sensitive counterparts for ℓcstnd exp , ℓsq hinge, ℓhinge and ℓρ, respectively. Next, we show that these cost-sensitive surrogate losses beneﬁt from H-consistency bounds with respect to the target loss ℓgiven in (1). Theorem 4.7. Assume that R is symmetric and complete. Then, the following bound holds for the cost-sensitive constrained loss: for all r R and for any distribution,

Eℓ(r) E ℓ(R) + Mℓ(R) γ(E ℓc cstnd(r) E ℓc cstnd(R) + M ℓc cstnd(R));

When R = Rall, the following holds: Eℓ(r) E ℓ(Rall) γ(E ℓc cstnd(r) E ℓc cstnd(Rall)), where

t when ℓc cstnd is ℓcstnd c exp or ℓc sq hinge; γ(t) = t when ℓc cstnd is ℓc hinge or ℓc ρ.

The proof is included in Appendix I.2. The second part follows from the fact that when R = Rall, all the minimizability gaps vanish. In particular, Theorem 4.7 implies the Bayes-consistency of cost-sensitive constrained losses. Note that while the constrained hinge loss and ρ-margin loss have a more favorable linear rate in the bound, their optimization may be more challenging compared to other smooth loss functions.

2 4 6 8 10 Cardinality

top-k classifier card-aware

2 4 6 8 10 Cardinality

top-k classifier card-aware

1 2 3 4 Cardinality

top-k classifier card-aware

1 2 3 4 Cardinality

top-k classifier card-aware

CIFAR-100 Image Net CIFAR-10 SVHN Figure 1: Accuracy versus average cardinality plots obtained by varying λ for our cardinality-aware algorithm and top-k classiﬁers across four datasets, with predictor set K = {1,2,4,8} and cardinality cost cost(k) = log k. Our cardinality-aware algorithm consistently achieves higher accuracy for any ﬁxed average cardinality across all datasets.

1 2 3 4 Cardinality

card-aware, cost(k) = log(k) card-aware, cost(k) = k

1 2 3 4 Cardinality

card-aware, cost(k) = log(k) card-aware, cost(k) = k

1.0 1.5 2.0 2.5 Cardinality

card-aware, cost(k) = log(k) card-aware, cost(k) = k

1.0 1.5 2.0 2.5 3.0 Cardinality

card-aware, cost(k) = log(k) card-aware, cost(k) = k

CIFAR-100 Image Net CIFAR-10 SVHN Figure 2: Comparison of cardinality costs cost(k) = log k and cost(k) = k, with predictor set K = {1,2,4,8}. The accuracy versus average cardinality plots for our cardinality-aware algorithm are similar, suggesting that the choice of cardinality cost has minimal impact on performance.

5 Experiments

Here, we report empirical results for our cardinality-aware algorithm and show that it consistently outperforms top-k classiﬁers on benchmark datasets CIFAR-10, CIFAR-100 [Krizhevsky, 2009], SVHN [Netzer et al., 2011] and Image Net [Deng et al., 2009].

We used the outputs of the second-to-last layer of Res Net [He et al., 2016] as features for the CIFAR10, CIFAR-100 and SVHN datasets. For the Image Net dataset, we used the CLIP [Radford et al.,

2021] model to extract features. We adopted a linear model, trained using multinomial logistic loss, for the classiﬁer h on the extracted features from the datasets. We used a two-hidden-layer feedforward neural network with Re LU activation functions [Nair and Hinton, 2010] for the cardinality selector r. Both the classiﬁer h and the cardinality selector r were trained using the Adam optimizer [Kingma and Ba, 2014], with a learning rate of 1 10 3, a batch size of 128, and a weight decay of 1 10 5.

Figure 1 compares the accuracy versus cardinality curves of the cardinality-aware algorithm with that of top-k classiﬁers induced by h for the various datasets. The accuracy of a top-k classiﬁer is measured by E(x,y) S[1 ℓk(h,x,y)], that is the fraction of the sample in which the top-k predictions include the true label. It naturally grows as the cardinality k increases, as shown in Figure 1. The accuracy of the carnality-aware algorithms is measured by E(x,y) S[1y {h1(x),...,hr(x)(x)}], that is the fraction of the sample in which the predictions selected by the model r include the true label, and the corresponding cardinality is measured by E(x,y) S[r(x)], that is the average size of the selected predictions. The cardinality selector r was trained by minimizing the cost-sensitive logistic loss ℓc log with the cost c(x,k,y) deﬁned as ℓk(h,x,y) + λlog(k) and normalized to [0,1] through division by its maximum value over X K Y. We allow for top-k experts with k K = {1,2,4,8} and vary λ. Starting from high values of λ, as λ decreases in Figure 1, our cardinality-aware algorithm yields solutions with higher average cardinality and increased accuracy. This is because λ controls the trade-off between cardinality and accuracy. The plots end to the right at λ = 0.01.

Figure 1 shows that the cardinality-aware algorithm is superior across the CIFAR-100, Image Net, CIFAR-10 and SVHN datasets. For a given average cardinality, the cardinality-aware algorithm always achieves higher accuracy than a top-k classiﬁer. In other words, to achieve the same level of accuracy, the predictions made by the cardinality-aware algorithm can be signiﬁcantly smaller in size compared to those made by the corresponding top-k classiﬁer. In particular, on the CIFAR-100, CIFAR-10 and SVHN datasets, the cardinality-aware algorithm achieves the same accuracy (98%) as the top-k classiﬁer while using roughly only half of the cardinality on average. As with the Image Net dataset, it achieves the same accuracy (95%) as the top-k classiﬁer with only two-thirds of the cardinality. This illustrates the effectiveness of our cardinality-aware algorithm.

Figure 2 presents the comparison of cost( gk(x) )=k and cost( gk(x) )=log k in the same setting (for each dataset, the orange curve in Figure 2 coincides with the orange curve in Figure 1). The

CIFAR-100 CIFAR-10

Figure 3: Cardinality distribution for top-k experts with K = {1,2,4,8} on CIFAR-10 and CIFAR100 datasets, analyzed under two λ values. For each dataset, increasing λ reduces the number of samples with the highest cardinality (k = 8) and increases those with lower cardinalities, as higher λ ampliﬁes the inﬂuence of cardinality in the cost function. Across datasets, distributions vary for the same λ due to differing task complexities.

True label: cat

True label: frog

True label: deer

True label: airplane

Hard images.

True label: cat

True label: frog

True label: deer

True label: airplane

Easy images.

Figure 4: Illustration of hard and easy images on the CIFAR-10 dataset as judged by human annotators, for top-k experts K = {1,2,4,8}. Hard images are those correctly predicted by our algorithm with a cardinality of 8 but misclassiﬁed with a cardinality of 4. Easy images are correctly predicted with a cardinality of 1.

comparison suggests that the choice between the linear and logarithmic cardinality costs has negligible impact on our algorithm s performance, highlighting its robustness in this regard. We also empirically demonstrate that our algorithm dynamically adjusts the cardinality of prediction sets based on input complexity, selecting larger sets for more challenging inputs to ensure high accuracy and smaller sets for simpler inputs to keep the cardinality low, as illustrated in Figure 3 and Figure 4. We present additional experimental results with different choices of set K in Figure 5 and Figure 6 in Appendix J. Our cardinality-aware algorithm consistently outperforms top-k classiﬁers across all conﬁgurations.

6 Conclusion

We introduced a new cardinality-aware set prediction framework for which we proposed two families of surrogate losses with strong H-consistency guarantees: cost-sensitive comp-sum and constrained losses. This leads to principled and practical cardinality-aware algorithms for top-k classiﬁcation, which we showed empirically to be very effective. Additionally, we established a theoretical foundation for top-k classiﬁcation with ﬁxed cardinality k by proving that several common surrogate loss functions, including comp-sum losses and constrained losses in standard classiﬁcation, admit H-consistency bounds with respect to the top-k loss. This provides a theoretical justiﬁcation for the use of these loss functions in top-k classiﬁcation and opens new avenues for further research in this area.

P. Awasthi, N. Frank, A. Mao, M. Mohri, and Y. Zhong. Calibration and consistency of adversarial surrogate losses. In Advances in Neural Information Processing Systems, pages 9804 9815, 2021a.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. A ﬁner calibration analysis for adversarial robustness. ar Xiv preprint ar Xiv:2105.01550, 2021b.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, pages 1117 1174, 2022a.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. Multi-class H-consistency bounds. In Advances in neural information processing systems, pages 782 795, 2022b.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. Theoretically grounded loss functions and algorithms for adversarial robustness. In International Conference on Artiﬁcial Intelligence and Statistics, pages 10077 10094, 2023a.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. DC-programming for neural network optimizations. Journal of Global Optimization, 2023b.

P. L. Bartlett and M. H. Wegkamp. Classiﬁcation with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8), 2008.

P. L. Bartlett, M. I. Jordan, and J. D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

J. Berkson. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39:357 -365, 1944.

J. Berkson. Why I prefer logits to probits. Biometrics, 7(4):327 -339, 1951.

L. Berrada, A. Zisserman, and M. P. Kumar. Smooth loss functions for deep top-k classiﬁcation. In International Conference on Learning Representations, 2018.

K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265 292, 2001.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

C. Denis and M. Hebiri. Conﬁdence sets with expected sizes for multiclass classiﬁcation. Journal of Machine Learning Research, 18(102):1 28, 2017.

A. Ghosh, H. Kumar, and P. S. Sastry. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artiﬁcial intelligence, 2017.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Toronto University, 2009.

V. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. In Advances in Neural Information Processing Systems, pages 2501 2509, 2014.

M. Lapin, M. Hein, and B. Schiele. Top-k multiclass SVM. In Advances in neural information processing systems, 2015.

M. Lapin, M. Hein, and B. Schiele. Loss functions for top-k error: Analysis and insights. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1468 1477, 2016.

M. Lapin, M. Hein, and B. Schiele. Analysis and optimization of loss functions for multiclass, top-k, and multilabel classiﬁcation. IEEE Transactions on Pattern Analysis & Machine Intelligence, 40 (07):1533 1554, 2018.

Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory and application to the classiﬁcation of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004.

P. Long and R. Servedio. Consistency versus realizable H-consistency for multiclass classiﬁcation. In International Conference on Machine Learning, pages 801 809, 2013.

A. Mao, C. Mohri, M. Mohri, and Y. Zhong. Two-stage learning to defer with multiple experts. In Advances in neural information processing systems, 2023a.

A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds: Characterization and extensions. In Advances in Neural Information Processing Systems, 2023b.

A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds for pairwise misranking loss surrogates. In International conference on Machine learning, 2023c.

A. Mao, M. Mohri, and Y. Zhong. Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023d.

A. Mao, M. Mohri, and Y. Zhong. Structured prediction with stronger consistency guarantees. In Advances in Neural Information Processing Systems, 2023e.

A. Mao, M. Mohri, and Y. Zhong. Cross-entropy loss functions: Theoretical analysis and applications. In International Conference on Machine Learning, 2023f.

A. Mao, M. Mohri, and Y. Zhong. Principled approaches for learning to defer with multiple experts. In International Symposium on Artiﬁcial Intelligence and Mathematics, 2024a.

A. Mao, M. Mohri, and Y. Zhong. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. In International Conference on Algorithmic Learning Theory, 2024b.

A. Mao, M. Mohri, and Y. Zhong. Theoretically grounded loss functions and algorithms for scorebased multi-class abstention. In International Conference on Artiﬁcial Intelligence and Statistics, 2024c.

A. Mao, M. Mohri, and Y. Zhong. Enhanced H-consistency bounds. ar Xiv preprint ar Xiv:2407.13722, 2024d.

A. Mao, M. Mohri, and Y. Zhong. H-consistency guarantees for regression. In International Conference on Machine Learning, pages 34712 34737, 2024e.

A. Mao, M. Mohri, and Y. Zhong. Multi-label learning with stronger consistency guarantees. In Advances in neural information processing systems, 2024f.

A. Mao, M. Mohri, and Y. Zhong. Realizable H-consistent and Bayes-consistent loss functions for learning to defer. In Advances in neural information processing systems, 2024g.

A. Mao, M. Mohri, and Y. Zhong. Regression with multi-expert deferral. In International Conference on Machine Learning, pages 34738 34759, 2024h.

A. Mao, M. Mohri, and Y. Zhong. A universal growth rate for learning with smooth surrogate losses. In Advances in neural information processing systems, 2024i.

C. Mohri, D. Andor, E. Choi, M. Collins, A. Mao, and Y. Zhong. Learning to reject with a ﬁxed predictor: Application to decontextualization. In International Conference on Learning Representations, 2024.

M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.

I. Mukherjee and R. E. Schapire. A theory of multiclass boosting. Journal of Machine Learning Research, 2013.

V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807 814, 2010.

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing Systems, 2011.

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748 8763. PMLR, 2021.

S. J. Reddi, S. Kale, F. Yu, D. Holtmann-Rice, J. Chen, and S. Kumar. Stochastic negative mining for learning with large output spaces. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1940 1949, 2019.

M. Saberian and N. Vasconcelos. Multiclass boosting: Theory and algorithms. Advances in neural information processing systems, 24, 2011.

G. Shafer and V. Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9 (3), 2008.

I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007.

A. Thilagar, R. Frongillo, J. J. Finocchiaro, and E. Goodwill. Consistent polyhedral surrogates for top-k classiﬁcation and variants. In International Conference on Machine Learning, pages 21329 21359, 2022.

N. Usunier, D. Buffoni, and P. Gallinari. Ranking with ordered weighted pairwise classiﬁcation. In International conference on machine learning, pages 1057 1064, 2009.

P. F. Verhulst. Notice sur la loi que la population suit dans son accroissement. Correspondance mathématique et physique, 10:113 -121, 1838.

P. F. Verhulst. Recherches mathématiques sur la loi d accroissement de la population. Nouveaux Mémoires de l Académie Royale des Sciences et Belles-Lettres de Bruxelles, 18:1 -42, 1845.

J. Weston and C. Watkins. Multi-class support vector machines. Technical report, Citeseer, 1998.

F. Yang and S. Koyejo. On the consistency of top-k surrogate losses. In International Conference on Machine Learning, pages 10727 10735, 2020.

T. Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, 2004a.

T. Zhang. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004b.

Z. Zhang and M. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, 2018.

C. Zheng, G. Wu, F. Bao, Y. Cao, C. Li, and J. Zhu. Revisiting discriminative vs. generative classiﬁers: Theory and implications. In International Conference on Machine Learning, 2023.

Contents of Appendix

A Proof of Lemma 4.4 15

B Proofs of H-consistency bounds for comp-sum losses 15

C Minimizability gaps and realizability 20

D Proofs of realizable H-consistency for comp-sum losses 20

E H-Consistency bounds for constrained losses 21

F Proofs of H-consistency bounds for constrained losses 22

G Technical challenges and novelty in Section 4.2 26

H Generalization bounds 27

I Proofs of H-consistency bounds for cost-sensitive losses 29

I.1 Proof of Theorem 4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

I.2 Proof of Theorem 4.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

J Additional experimental results: top-k classiﬁers 37

K Additional experimental results: threshold-based classiﬁers 38

L Future work 39

A Proof of Lemma 4.4

Lemma 4.4. Assume that H is regular. Then, for any h H and x X, the best-in-class conditional error and the conditional regret of the top-k loss can be expressed as follows:

C ℓk(H,x) = 1 k i=1 p(x,pi(x)) Cℓk,H(h,x) = k i=1 [p(x,pi(x)) p(x,hi(x))].

Proof. By deﬁnition, for any h H and x X, the conditional error of top-k loss can be written as

Cℓk(h,x) = y Y p(x,y)1y {h1(x),...,hk(x)} = 1 k i=1 p(x,hi(x)).

By deﬁnition of the labels pi(x), which are the most likely top-k labels, Cℓk(h,x) is minimized for hi(x) = kmin(x), i [k]. Since H is regular, this choice is realizable for some h H. Thus, we have

C ℓk(H,x) = inf h H Cℓk(h,x) = 1 k i=1 p(x,pi(x)).

Furthermore, the calibration gap can be expressed as

Cℓk,H(h,x) = Cℓk(h,x) C ℓk(H,x) = k i=1 (p(x,pi(x)) p(x,hi(x))),

which completes the proof.

B Proofs of H-consistency bounds for comp-sum losses

Theorem 4.5. Assume that H is symmetric and complete. Then, for any 1 k n, the following H-consistency bound holds for the comp-sum loss:

Eℓk(h) E ℓk(H) + Mℓk(H) kψ 1(E ℓcomp(h) E ℓcomp(H) + M ℓcomp(H)),

In the special case where A ℓcomp(H) = 0, for any 1 k n, the following upper bound holds:

Eℓk(h) E ℓk(H) kψ 1(E ℓcomp(h) E ℓcomp(H)),

where ψ(t) = 1 t

2 log(1 t)+ 1+t

2 log(1+t), t [0,1] when ℓcomp is ℓlog; ψ(t) = 1

1 t2, t [0,1]

when ℓcomp is ℓexp; ψ(t) = t/n when ℓcomp is ℓmae; and ψ(t) = 1 qnq [ (1+t) 1 1 q +(1 t) 1 1 q 2 ]

for all q (0,1), t [0,1] when ℓcomp is ℓgce.

Proof. Case I: ℓcomp = ℓlog. For logistic loss ℓlog, the conditional regret can be written as

C ℓlog,H(h,x) = n y=1 p(x,y) ℓlog(h,x,y) inf h H

n y=1 p(x,y) ℓlog(h,x,y)

n y=1 p(x,y) ℓlog(h,x,y) inf µ R

n y=1 p(x,y) ℓlog(hµ,i,x,y),

where for any i [k],

hµ,i(x,y) =

h(x,y), y {pi(x),hi(x)} log(eh(x,pi(x)) + µ) y = hi(x) log(eh(x,hi(x)) µ) y = pi(x).

Note that such a choice of hµ,i leads to the following equality holds:

y {hi(x),pi(x)} p(x,y) ℓlog(h,x,y) = y {hi(x),pi(x)} p(x,y) ℓlog(hµ,i,x,y).

Therefore, for any i [k], the conditional regret of logistic loss can be lower bounded as

C ℓlog,H(h,x) p(x,hi(x))log( eh(x,hi(x))

y Y eh(x,y) ) p(x,pi(x))log( eh(x,pi(x))

y Y eh(x,y) )

+ sup µ R (p(x,hi(x))log(eh(x,pi(x)) + µ

y Y eh(x,y) ) + p(x,pi(x))log(eh(x,hi(x)) µ

y Y eh(x,y) ))

= sup µ R (p(x,hi(x))log(eh(x,pi(x)) + µ

eh(x,hi(x)) ) + p(x,pi(x))log(eh(x,hi(x)) µ

eh(x,pi(x)) )).

By the concavity of the function, differentiate with respect to µ, we obtain that the supremum is

achieved by µ = p(x,hi(x))eh(x,hi(x)) p(x,pi(x))eh(x,pi(x))

p(x,hi(x))+p(x,pi(x)) . Plug in µ , we obtain

C ℓlog,H(h,x)

p(x,hi(x))log( p(x,hi(x)) p(x,hi(x)) + p(x,pi(x)) eh(x,hi(x)) + eh(x,pi(x))

eh(x,hi(x)) )

+ p(x,pi(x))log( p(x,pi(x)) p(x,hi(x)) + p(x,pi(x)) eh(x,hi(x)) + eh(x,pi(x))

eh(x,pi(x)) )

p(x,hi(x))log( 2p(x,hi(x)) p(x,hi(x)) + p(x,pi(x))) + p(x,pi(x))log( 2p(x,pi(x)) p(x,hi(x)) + p(x,pi(x))).

(minimum is achieved when h(x,hi(x)) = h(x,pi(x)))

let Si = p(x,pi(x)) + p(x,hi(x)) and i = p(x,pi(x)) p(x,hi(x)), we have

C ℓlog,H(h,x) Si i

Si ) + Si + i

2 log(Si + i

2 log(1 i) + 1 + i

2 log(1 + i) (minimum is achieved when Si = 1)

= ψ(p(x,pi(x)) p(x,hi(x))),

where ψ(t) = 1 t

2 log(1 t) + 1+t

2 log(1 + t), t [0,1]. Therefore, the conditional regret of the top-k loss can be upper bounded as follows:

Cℓk,H(h,x) = k i=1 (p(x,pi(x)) p(x,hi(x))) kψ 1( C ℓlog,H(h,x)).

By the concavity of ψ 1, taking expectations on both sides of the preceding equation, we obtain

Eℓk(h) E ℓk(H) + Mℓk(H) kψ 1(E ℓlog(h) E ℓlog(H) + M ℓlog(H)).

The second part follows from the fact that when A ℓlog(H) = 0, the minimizability gap M ℓlog(H) vanishes.

Case II: ℓcomp = ℓexp. For sum exponential loss ℓexp, the conditional regret can be written as

C ℓexp,H(h,x) = n y=1 p(x,y) ℓexp(h,x,y) inf h H

n y=1 p(x,y) ℓexp(h,x,y)

n y=1 p(x,y) ℓexp(h,x,y) inf µ R

n y=1 p(x,y) ℓexp(hµ,i,x,y),

where for any i [k],

hµ,i(x,y) =

h(x,y), y {pi(x),hi(x)} log(eh(x,pi(x)) + µ) y = hi(x) log(eh(x,hi(x)) µ) y = pi(x).

Note that such a choice of hµ,i leads to the following equality holds:

y {hi(x),pi(x)} p(x,y) ℓexp(h,x,y) = y {hi(x),pi(x)} p(x,y) ℓexp(hµ,i,x,y).

Therefore, for any i [k], the conditional regret of sum exponential loss can be lower bounded as

C ℓexp,H(h,x) y Y exp(h(x,y ))[ p(x,hi(x)) exp(h(x,hi(x))) + p(x,pi(x)) exp(h(x,pi(x)))]

y Y exp(h(x,y ))[ p(x,hi(x)) exp(h(x,pi(x))) + µ + p(x,pi(x)) exp(h(x,hi(x))) µ] .

By the concavity of the function, differentiate with respect to µ, we obtain that the supremum is

achieved by µ = exp[h(x,hi(x))]

p(x,hi(x)) exp[h(x,pi(x))]

p(x,hi(x))+

p(x,pi(x)) . Plug in µ , we obtain

C ℓexp,H(h,x)

y Y exp(h(x,y ))

p(x,hi(x)) exp(h(x,hi(x))) + p(x,pi(x)) exp(h(x,pi(x))) (

p(x,hi(x)) +

p(x,pi(x))) 2

exp(h(x,pi(x))) + exp(h(x,hi(x)))

[1 + exp(h(x,pi(x)))

exp(h(x,hi(x)))]p(x,hi(x))

+ [1 + exp(h(x,hi(x)))

exp(h(x,pi(x)))]p(x,pi(x)) (

p(x,hi(x)) +

p(x,pi(x))) 2

( y Y exp(h(x,y )) exp(h(x,pi(x))) + exp(h(x,hi(x))))

2p(x,hi(x)) + 2p(x,pi(x)) (

p(x,hi(x)) +

p(x,pi(x))) 2 .

(minimum is attained when exp(h(x,pi(x)))

exp(h(x,hi(x))) = 1)

let Si = p(x,pi(x)) + p(x,hi(x)) and i = p(x,pi(x)) p(x,hi(x)), we have

C ℓexp,H(h,x) 2Si

1 2 + (1 i)

(minimum is achieved when Si = 1)

= ψ(p(x,pi(x)) p(x,hi(x))),

where ψ(t) = 1

1 t2, t [0,1]. Therefore, the conditional regret of the top-k loss can be upper bounded as follows:

Cℓk,H(h,x) = k i=1 (p(x,pi(x)) p(x,hi(x))) kψ 1( C ℓexp,H(h,x)).

By the concavity of ψ 1, taking expectations on both sides of the preceding equation, we obtain

Eℓk(h) E ℓk(H) + Mℓk(H) kψ 1(E ℓexp(h) E ℓexp(H) + M ℓexp(H)).

The second part follows from the fact that when A ℓexp(H) = 0, the minimizability gap M ℓexp(H) vanishes.

Case III: ℓcomp = ℓmae. For mean absolute error loss ℓmae, the conditional regret can be written as

C ℓmae,H(h,x) = n y=1 p(x,y) ℓmae(h,x,y) inf h H

n y=1 p(x,y) ℓmae(h,x,y)

n y=1 p(x,y) ℓmae(h,x,y) inf µ R

n y=1 p(x,y) ℓmae(hµ,i,x,y),

where for any i [k],

hµ,i(x,y) =

h(x,y), y {pi(x),hi(x)} log(eh(x,pi(x)) + µ) y = hi(x) log(eh(x,hi(x)) µ) y = pi(x).

Note that such a choice of hµ,i leads to the following equality holds:

y {hi(x),pi(x)} p(x,y) ℓmae(h,x,y) = y {hi(x),pi(x)} p(x,y) ℓmae(hµ,i,x,y).

Therefore, for any i [k], the conditional regret of mean absolute error loss can be lower bounded as

C ℓmae,H(h,x)

p(x,hi(x))(1 exp(h(x,hi(x))) y Y exp(h(x,y ))) + p(x,pi(x))(1 exp(h(x,pi(x))) y Y exp(h(x,y )))

+ sup µ R ( p(x,pi(x))(1 exp(h(x,hi(x))) µ

y Y exp(h(x,y )) ) p(x,hi(x))(1 exp(h(x,pi(x))) + µ

y Y exp(h(x,y )) )).

By the concavity of the function, differentiate with respect to µ, we obtain that the supremum is achieved by µ = exp[h(x,pi(x)]. Plug in µ , we obtain

C ℓmae,H(h,x)

p(x,pi(x)) exp(h(x,hi(x)))

y Y exp(h(x,y )) p(x,hi(x)) exp(h(x,hi(x)))

y Y exp(h(x,y ))

n(p(x,pi(x)) p(x,hi(x))) ( exp(h(x,hi(x))) y Y exp(h(x,y )) 1

Therefore, the conditional regret of the top-k loss can be upper bounded as follows:

Cℓk,H(h,x) = k i=1 (p(x,pi(x)) p(x,hi(x))) kn( C ℓmae,H(h,x)).

Take expectations on both sides of the preceding equation, we obtain

Eℓk(h) E ℓk(H) + Mℓk(H) kn(E ℓmae(h) E ℓmae(H) + M ℓmae(H)).

The second part follows from the fact that when A ℓmae(H) = 0, the minimizability gap M ℓmae(H) vanishes.

Case IV: ℓcomp = ℓgce. For generalized cross-entropy loss ℓgce, the conditional regret can be written as

C ℓgce,H(h,x)

= n y=1 p(x,y) ℓgce(h,x,y) inf h H

n y=1 p(x,y) ℓgce(h,x,y)

n y=1 p(x,y) ℓgce(h,x,y) inf µ R

n y=1 p(x,y) ℓgce(hµ,i,x,y),

where for any i [k],

hµ,i(x,y) =

h(x,y), y {pi(x),hi(x)} log(eh(x,pi(x)) + µ) y = hi(x) log(eh(x,hi(x)) µ) y = pi(x).

Note that such a choice of hµ,i leads to the following equality holds:

y {hi(x),pi(x)} p(x,y) ℓgce(h,x,y) = y {hi(x),pi(x)} p(x,y) ℓgce(hµ,i,x,y).

Therefore, for any i [k], the conditional regret of generalized cross-entropy loss can be lower bounded as

q C ℓgce,H(h,x)

p(x,hi(x))(1 [ exp(h(x,hi(x)))

y Y exp(h(x,y ))]

q ) + p(x,pi(x))(1 [ exp(h(x,pi(x)))

y Y exp(h(x,y ))]

+ sup µ R ( p(x,hi(x))(1 [exp(h(x,pi(x))) + µ

y Y exp(h(x,y )) ]

q ) p(x,pi(x))(1 [exp(h(x,hi(x))) µ

y Y exp(h(x,y )) ]

By the concavity of the function, differentiate with respect to µ, we obtain that the supremum is

achieved by µ = exp[h(x,hi(x))]p(x,pi(x)) 1 q 1 exp[h(x,pi(x))]p(x,hi(x)) 1 q 1

p(x,hi(x)) 1 q 1 +p(x,pi(x)) 1 q 1 . Plug in µ , we obtain

q C ℓgce,H(h,x)

[exp(h(x,hi(x))) + exp(h(x,pi(x)))]p(x,pi(x)) 1 q 1

y Y exp(h(x,y ))[p(x,hi(x)) 1 q 1 + p(x,pi(x)) 1 q 1 ]

p(x,hi(x))[ exp(h(x,hi(x)))

y Y exp(h(x,y ))]

+ p(x,pi(x))

[exp(h(x,hi(x))) + exp(h(x,pi(x)))]p(x,hi(x)) 1 q 1

y Y exp(h(x,y ))[p(x,hi(x)) 1 q 1 + p(x,pi(x)) 1 q 1 ]

p(x,pi(x))[ exp(h(x,pi(x)))

y Y exp(h(x,y ))]

nq p(x,hi(x))

2p(x,pi(x)) 1 q 1

p(x,hi(x)) 1 q 1 + p(x,pi(x)) 1 q 1

nq p(x,pi(x))

2p(x,hi(x)) 1 q 1

p(x,hi(x)) 1 q 1 + p(x,pi(x)) 1 q 1

(( exp(h(x,pi(x))) y Y exp(h(x,y ))) q 1 nq and minimum is attained when exp(h(x,pi(x)))

exp(h(x,hi(x))) = 1)

let Si = p(x,pi(x)) + p(x,hi(x)) and i = p(x,pi(x)) p(x,hi(x)), we have

C ℓgce,H(h,x) 1 qnq

1 1 q + (Si i)

1 1 q + (1 i)

(minimum is achieved when Si = 1) = ψ(p(x,pi(x)) p(x,hi(x))),

where ψ(t) = 1 qnq [ (1+t) 1 1 q +(1 t) 1 1 q 2 ]

1 q 1 , t [0,1]. Therefore, the conditional regret of the

top-k loss can be upper bounded as follows:

Cℓk,H(h,x) = k i=1 (p(x,pi(x)) p(x,hi(x))) kψ 1( C ℓgce,H(h,x)).

By the concavity of ψ 1, taking expectations on both sides of the preceding equation, we obtain

Eℓk(h) E ℓk(H) + Mℓk(H) kψ 1(E ℓgce(h) E ℓgce(H) + M ℓgce(H)).

The second part follows from the fact that when A ℓgce(H) = 0, the minimizability gap M ℓgce(H) vanishes.

C Minimizability gaps and realizability

The key quantities in our H-consistency bounds are the minimizability gaps, which can be upper bounded by the approximation error, or more reﬁned terms, depending on the magnitude of the parameter space, as discussed by Mao et al. [2023f]. As pointed out by these authors, these quantities, along with the functional form, can help compare different comp-sum loss functions.

Here, we further discuss the important role of minimizability gaps under the realizability assumption, and the connection with some negative results of Yang and Koyejo [2020].

Deﬁnition C.1 (top-k-H-realizability). A distribution D over X Y is top-k-H-realizable, if there exists a hypothesis h H such that P(x,y) D(h(x,y) > h(x,hk+1(x))) = 1.

This extends the H-realizability deﬁnition from standard (top-1) classiﬁcation [Long and Servedio, 2013] to top-k classiﬁcation for any k 1.

Deﬁnition C.2. We say that a hypothesis set H is closed under scaling, if it is a cone, that is for all h H and β R+, βh H.

Deﬁnition C.3. We say that a surrogate loss ℓis realizable H-consistent with respect to ℓk, if for all k [1,n], and for any sequence of hypotheses {hn}n N H and top-k-H-realizable distribution, limn + E ℓ(hn) E ℓ(H) = 0 implies limn + Eℓk(hn) E ℓk(H) = 0.

When H is closed under scaling, for k = 1 and all comp-sum loss functions ℓ= ℓlog, ℓexp, ℓgce and ℓmae, it can be shown that E ℓ(H) = M ℓ(H) = 0 for any H-realizable distribution. For example, for ℓ= ℓlog, by using the Lebesgue dominated convergence theorem, we have

M ℓlog(H) E ℓlog(H) lim β + E ℓlog(βh ) = lim β + log[1 + y y eβ(h (x,y ) h (x,y))] = 0,

where h satisﬁes P(x,y) D(h (x,y) > h (x,h2(x))) = 1 Therefore, Theorem 4.5 implies that all these loss functions are realizable H-consistent with respect to ℓ0 1 (ℓk for k = 1) when H is closed under scaling.

Theorem C.4. Assume that H is closed under scaling. Then, ℓlog, ℓexp, ℓgce and ℓmae are realizable H-consistent with respect to ℓ0 1.

The formal proof is presented in Appendix D. However, for k > 1, since in the realizability assumption, h(x,y) is only larger than h(x,hk+1(x)) and can be smaller than h(x,h1(x)), there may exist an H-realizable distribution D such that M ℓlog(H) > 0. This explains the inconsistency of the logistic loss on top-k separable data with linear predictors, when k = 2 and n > 2, as shown in [Yang and Koyejo, 2020]. More generally, the exact same example in [Yang and Koyejo, 2020, Proposition 5.1] can be used to show that all the comp-sum losses, ℓlog, ℓexp, ℓgce and ℓmae are not realizable Hconsistent with respect to ℓk. Nevertheless, as previously shown, when the hypothesis set H adopted is sufﬁciently rich such that M ℓ(H) = 0 or even A ℓ(H) = 0, they are guaranteed to be H-consistent. This is typically the case in practice when using deep neural networks.

D Proofs of realizable H-consistency for comp-sum losses

Theorem C.4. Assume that H is closed under scaling. Then, ℓlog, ℓexp, ℓgce and ℓmae are realizable H-consistent with respect to ℓ0 1.

Proof. Since the distribution is realizable, there exists a hypothesis h H such that

P(x,y) D(h (x,y) > h (x,h2(x))) = 1.

Therefore, for the logistic loss, by using the Lebesgue dominated convergence theorem,

M ℓlog(H) E ℓlog(H) lim β + E ℓlog(βh) = lim β + log[1 + y y eβ(h (x,y ) h (x,y))] = 0.

For the sum exponential loss, by using the Lebesgue dominated convergence theorem,

M ℓexp(H) E ℓexp(H) lim β + E ℓexp(βh) = lim β + y y eβ(h (x,y ) h (x,y)) = 0.

For the generalized cross entropy loss, by using the Lebesgue dominated convergence theorem,

M ℓgce(H) E ℓgce(H) lim β + E ℓgce(βh) = lim β + 1 q

1 y Y eβ(h (x,y ) h (x,y))

For the mean absolute error loss, by using the Lebesgue dominated convergence theorem,

M ℓmae(H) E ℓmae(H) lim β + E ℓmae(βh) = lim β + 1 y Y eβ(h (x,y ) h (x,y))

Therefore, by Theorem 4.5, the proof is completed.

E H-Consistency bounds for constrained losses

Constrained losses are deﬁned as a summation of a function Φ applied to the scores, subject to a constraint, as shown in [Lee et al., 2004, Awasthi et al., 2022b]. For any h H and (x,y) X Y, they are expressed as

ℓcstnd(h,x,y) = y y Φ( h(x,y )),

with the constraint y Y h(x,y) = 0, where Φ R R+ is non-increasing. When Φ is chosen as the function t e t, t max{0,1 t}2, t max{0,1 t} and t min{max{0,1 t/ρ},1}, ρ > 0, ℓcstnd(h,x,y) are referred to as the constrained exponential loss ℓcstnd exp (h,x,y) = y y eh(x,y ), the constrained squared hinge loss ℓsq hinge(h,x,y) = y y max{0,1 + h(x,y )}2, the constrained hinge loss ℓhinge(h,x,y) = y y max{0,1 + h(x,y )}, and the constrained ρ-margin loss ℓρ(h,x,y) = y y min{max{0,1 + h(x,y )/ρ},1}, respectively [Awasthi et al., 2022b]. We now study these loss functions and show that they beneﬁt from H-consistency bounds with respect to the top-k loss. Theorem E.1. Assume that H is symmetric and complete. Then, for any 1 k n, the following H-consistency bound holds for the constrained loss:

Eℓk(h) E ℓk(H) + Mℓk(H) kγ(E ℓcstnd(h) E ℓcstnd(H) + M ℓcstnd(H)).

In the special case where A ℓcstnd(H) = 0, for any 1 k n, the following bound holds:

Eℓk(h) E ℓk(H) kγ(E ℓcstnd(h) E ℓcstnd(H)),

where γ(t) = 2

t when ℓcstnd is either ℓcstnd exp or ℓsq hinge; γ(t) = t when ℓcstnd is either ℓhinge or ℓρ.

The proof is included in Appendix F. The second part follows from the fact that when the hypothesis set H is sufﬁciently rich such that A ℓcstnd(H) = 0, we have M ℓcstnd(H) = 0. Therefore, the constrained loss is H-consistent and Bayes-consistent with respect to ℓk. If the surrogate estimation error E ℓcstnd(h) E ℓcstnd(H) is ϵ, then, the target estimation error satisﬁes Eℓk(h) E ℓk(H) kγ(ϵ). Note that the constrained exponential loss and the constrained squared hinge loss both admit a square root H-consistency bound while the bounds for the constrained hinge loss and ρ-margin loss are both linear.

F Proofs of H-consistency bounds for constrained losses

The conditional error for the constrained loss can be expressed as follows:

C ℓcstnd(h,x) = n y=1 p(x,y) ℓcstnd(h,x,y) = n y=1 p(x,y) y y Φ( h(x,y )) = y Y (1 p(x,y))Φ( h(x,y)).

Theorem E.1. Assume that H is symmetric and complete. Then, for any 1 k n, the following H-consistency bound holds for the constrained loss:

Eℓk(h) E ℓk(H) + Mℓk(H) kγ(E ℓcstnd(h) E ℓcstnd(H) + M ℓcstnd(H)).

In the special case where A ℓcstnd(H) = 0, for any 1 k n, the following bound holds:

Eℓk(h) E ℓk(H) kγ(E ℓcstnd(h) E ℓcstnd(H)),

where γ(t) = 2

t when ℓcstnd is either ℓcstnd exp or ℓsq hinge; γ(t) = t when ℓcstnd is either ℓhinge or ℓρ.

Proof. Case I: ℓcstnd = ℓcstnd exp . For the constrained exponential loss ℓcstnd exp , the conditional regret can be written as

C ℓcstnd exp ,H(h,x) = n y=1 p(x,y) ℓcstnd exp (h,x,y) inf h H

n y=1 p(x,y) ℓcstnd exp (h,x,y)

n y=1 p(x,y) ℓcstnd exp (h,x,y) inf µ R

n y=1 p(x,y) ℓcstnd exp (hµ,i,x,y),

where for any i [k],

hµ,i(x,y) =

h(x,y), y {pi(x),hi(x)} h(x,pi(x)) + µ y = hi(x) h(x,hi(x)) µ y = pi(x).

Note that such a choice of hµ,i leads to the following equality holds:

y {hi(x),pi(x)} p(x,y) ℓcstnd exp (h,x,y) = y {hi(x),pi(x)} p(x,y) ℓcstnd exp (hµ,i,x,y).

Let q(x,pi(x)) = 1 p(x,pi(x)) and q(x,hi(x)) = 1 p(x,hi(x)). Therefore, for any i [k], the conditional regret of constrained exponential loss can be lower bounded as

C ℓcstnd exp ,H(h,x)

inf h H sup µ R {q(x,pi(x))(eh(x,pi(x)) eh(x,hi(x)) µ) + q(x,hi(x))(eh(x,hi(x)) eh(x,pi(x))+µ)}

q(x,hi(x))) 2 (differentiating with respect to µ, h to optimize)

= q(x,hi(x)) q(x,pi(x))

q(x,pi(x)) +

4(q(x,hi(x)) q(x,pi(x)))2 (0 q(x,y) 1)

4(p(x,pi(x)) p(x,hi(x)))2.

Therefore, by Lemma 4.4, the conditional regret of the top-k loss can be upper bounded as follows:

Cℓk,H(h,x) = k i=1 (p(x,pi(x)) p(x,hi(x))) 2k( C ℓcstnd exp ,H(h,x))

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓk(h) E ℓk(H) + Mℓk(H) 2k(E ℓcstnd exp (h) E ℓcstnd exp (H) + M ℓcstnd exp (H))

The second part follows from the fact that when A ℓcstnd exp (H) = 0, we have M ℓcstnd exp (H) = 0.

Case II: ℓcstnd = ℓsq hinge. For the constrained squared hinge loss ℓsq hinge, the conditional regret can be written as

C ℓsq hinge,H(h,x) = n y=1 p(x,y) ℓsq hinge(h,x,y) inf h H

n y=1 p(x,y) ℓsq hinge(h,x,y)

n y=1 p(x,y) ℓsq hinge(h,x,y) inf µ R

n y=1 p(x,y) ℓsq hinge(hµ,i,x,y),

where for any i [k],

hµ,i(x,y) =

h(x,y), y {pi(x),hi(x)} h(x,pi(x)) + µ y = hi(x) h(x,hi(x)) µ y = pi(x).

Note that such a choice of hµ,i leads to the following equality holds:

y {hi(x),pi(x)} p(x,y) ℓsq hinge(h,x,y) = y {hi(x),pi(x)} p(x,y) ℓsq hinge(hµ,i,x,y).

Let q(x,pi(x)) = 1 p(x,pi(x)) and q(x,hi(x)) = 1 p(x,hi(x)). Therefore, for any i [k], the conditional regret of the constrained squared hinge loss can be lower bounded as

C ℓsq hinge,H(h,x)

inf h H sup µ R {q(x,pi(x))(max{0,1 + h(x,pi(x))}2 max{0,1 + h(x,hi(x)) µ}2)

+ q(x,hi(x))(max{0,1 + h(x,hi(x))}2 max{0,1 + h(x,pi(x)) + µ}2)}

4(q(x,pi(x)) q(x,hi(x)))2 (differentiating with respect to µ, h to optimize)

4(p(x,pi(x)) p(x,hi(x)))2

Therefore, by Lemma 4.4, the conditional regret of the top-k loss can be upper bounded as follows:

Cℓk,H(h,x) = k i=1 (p(x,pi(x)) p(x,hi(x))) 2k( C ℓsq hinge,H(h,x))

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓk(h) E ℓk(H) + Mℓk(H) 2k(E ℓsq hinge(h) E ℓsq hinge(H) + M ℓsq hinge(H))

The second part follows from the fact that when the hypothesis set H is sufﬁciently rich such that A ℓsq hinge(H) = 0, we have M ℓsq hinge(H) = 0.

Case III: ℓcstnd = ℓhinge. For the constrained hinge loss ℓhinge, the conditional regret can be written as

C ℓhinge,H(h,x) = n y=1 p(x,y) ℓhinge(h,x,y) inf h H

n y=1 p(x,y) ℓhinge(h,x,y)

n y=1 p(x,y) ℓhinge(h,x,y) inf µ R

n y=1 p(x,y) ℓhinge(hµ,i,x,y),

where for any i [k],

hµ,i(x,y) =

h(x,y), y {pi(x),hi(x)} h(x,pi(x)) + µ y = hi(x) h(x,hi(x)) µ y = pi(x).

Note that such a choice of hµ,i leads to the following equality holds:

y {hi(x),pi(x)} p(x,y) ℓhinge(h,x,y) = y {hi(x),pi(x)} p(x,y) ℓhinge(hµ,i,x,y).

Let q(x,pi(x)) = 1 p(x,pi(x)) and q(x,hi(x)) = 1 p(x,hi(x)). Therefore, for any i [k], the conditional regret of the constrained hinge loss can be lower-bounded as

C ℓhinge,H(h,x) inf h H sup µ R {q(x,pi(x))(max{0,1 + h(x,pi(x))} max{0,1 + h(x,hi(x)) µ})

+ q(x,hi(x))(max{0,1 + h(x,hi(x))} max{0,1 + h(x,pi(x)) + µ})}

q(x,hi(x)) q(x,pi(x)) (differentiating with respect to µ, h to optimize) = p(x,pi(x)) p(x,hi(x))

Therefore, by Lemma 4.4, the conditional regret of the top-k loss can be upper bounded as follows:

Cℓk,H(h,x) = k i=1 (p(x,pi(x)) p(x,hi(x))) k C ℓhinge,H(h,x).

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓk(h) E ℓk(H) + Mℓk(H) k(E ℓhinge(h) E ℓhinge(H) + M ℓhinge(H)).

The second part follows from the fact that when the hypothesis set H is sufﬁciently rich such that A ℓhinge(H) = 0, we have M ℓhinge(H) = 0.

Case IV: ℓcstnd = ℓρ. For the constrained ρ-margin loss ℓρ, the conditional regret can be written as

C ℓρ,H(h,x) = n y=1 p(x,y) ℓρ(h,x,y) inf h H

n y=1 p(x,y) ℓρ(h,x,y)

n y=1 p(x,y) ℓρ(h,x,y) inf µ R

n y=1 p(x,y) ℓρ(hµ,i,x,y),

where for any i [k],

hµ,i(x,y) =

h(x,y), y {pi(x),hi(x)} h(x,pi(x)) + µ y = hi(x) h(x,hi(x)) µ y = pi(x).

Note that such a choice of hµ,i leads to the following equality holds:

y {hi(x),pi(x)} p(x,y) ℓρ(h,x,y) = y {hi(x),pi(x)} p(x,y) ℓρ(hµ,i,x,y).

Let q(x,pi(x)) = 1 p(x,pi(x)) and q(x,hi(x)) = 1 p(x,hi(x)). Therefore, for any i [k], the conditional regret of the constrained ρ-margin loss can be lower-bounded as

C ℓρ,H(h,x)

inf h H sup µ R {q(x,pi(x))(min{max{0,1 + h(x,pi(x))

ρ },1} min{max{0,1 + h(x,hi(x)) µ

+ q(x,hi(x))(min{max{0,1 + h(x,hi(x))

ρ },1} min{max{0,1 + h(x,pi(x)) + µ

q(x,hi(x)) q(x,pi(x)) (differentiating with respect to µ, h to optimize) = p(x,pi(x)) p(x,hi(x))

Therefore, by Lemma 4.4, the conditional regret of the top-k loss can be upper bounded as follows:

Cℓk,H(h,x) = k i=1 (p(x,pi(x)) p(x,hi(x))) k C ℓρ,H(h,x).

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓk(h) E ℓk(H) + Mℓk(H) k(E ℓρ(h) E ℓρ(H) + M ℓρ(H)).

The second part follows from the fact that when the hypothesis set H is sufﬁciently rich such that A ℓρ(H) = 0, we have M ℓρ(H) = 0.

G Technical challenges and novelty in Section 4.2

The technical challenges and novelty of proofs in Section 4.2 lie in the following three aspects:

(1) Conditional regret of the top-k loss: This involves a comprehensive analysis of the conditional regret associated with the top-k loss, which is signiﬁcantly more complex than that of the zero-one loss in a standard setting. The conditional regret of the top-k loss incorporates both the top-k conditional probabilities pi(x), for i = 1,...,k, and the top-k scores hi(x), for i = 1,...,k, as characterized in Lemma 4.4.

(2) Relating to the conditional regret of the surrogate loss: To establish H-consistency bounds, it is necessary to upper bound the conditional regret of the top-k loss with that of the surrogate loss. This task is particularly challenging in the top-k setting due to the intricate nature of the top-k loss s conditional regret. A pivotal observation is that the conditional regret of the top-k loss can be expressed as the sum of k terms (p(x,pi(x)) p(x,hi(x))) for i = 1,...,k. Each term (p(x,pi(x)) p(x,hi(x))) exhibits structural similarities to the conditional regret of the zero-one loss, (p(x,p1(x)) p(x,h1(x))). Consequently, we introduce a series of auxiliary hypotheses hµ,i, each dependent on hi(x) and pi(x) for i [k]. This approach transforms the challenge of upper bounding the conditional regret of the top-k loss into k subproblems, each focusing on upper bounding the term (p(x,pi(x)) p(x,hi(x))) with the conditional regret of the surrogate loss.

(3) Upper bounding each term (p(x,pi(x)) p(x,hi(x))): Following the approach in prior work [Mao et al., 2023f] for top-1 classiﬁcation, we deﬁne hµ,i(x,y) as:

hµ,i(x,y) =

h(x,y), y {pi(x)),hi(x))} log(eh(x,pi(x)) + µ) y = hi(x)) log(eh(x,hi(x))) µ) y = pi(x)).

for the proof of comp-sum losses (Theorem 4.5). The subsequent proof is considered straightforward.

However, for the proof of constrained losses (Theorem E.1), we adopt a different hypothesis formulation for hµ,i(x,y), leveraging the constraint that the scores sum to zero and the speciﬁc structure of constrained losses. The hypothesis is deﬁned as:

hµ,i(x,y) =

h(x,y), y {pi(x)),hi(x))} h(x,pi(x)) + µ y = hi(x)) h(x,hi(x))) µ y = pi(x)).

The remainder of the proof then speciﬁcally addresses the peculiarities of constrained losses, which signiﬁcantly diverges from the previous work.

In summary, aspects (1) and (2) are novel and represent signiﬁcant advancements that have not been explored previously. For aspect (3), the proof for comp-sum loss closely follows the approach in [Mao et al., 2023f], which appears straightforward due to the innovative ideas presented in aspects (1) and (2). However, the proof for constrained losses signiﬁcantly deviates from the previous work, particularly in terms of the new auxiliary hypothesis formulation and the speciﬁc constrained losses examined.

We would like to further emphasize that these results are signiﬁcant and useful. They demonstrate that comp-sum losses, which include the cross-entropy loss commonly used in top-1 classiﬁcation, and constrained losses, are H-consistent in top-k classiﬁcation for any k. Notably, the cross-entropy loss is the only Bayes-consistent smooth surrogate loss for top-k classiﬁcation identiﬁed to date. Furthermore, the Bayes-consistency of loss functions within the constrained loss family is a novel exploration in the context of top-k classiﬁcation. These ﬁndings are pivotal as they highlight two broad families of smooth loss functions that are Bayes-consistent in top-k classiﬁcation. Additionally, they reveal that these families, including the cross-entropy loss, beneﬁt from stronger, non-asymptotic and hypothesis set-speciﬁc guarantees H-consistency bounds in top-k classiﬁcation.

H Generalization bounds

Given a ﬁnite sample S = ((x1,y1),...,(xm,ym)) drawn from Dm, let h S be the minimizer of the empirical loss within H with respect to the top-k surrogate loss ℓ: h S = argminh H E ℓ,S(h) = argminh H 1 m m i=1 ℓ(h,xi,yi). Next, we will show that we can use H-consistency bounds for ℓto derive generalization bounds for the top-k loss by upper bounding the surrogate estimation error E ℓ( h S) E ℓ(H) with the complexity (e.g. the Rademacher complexity) of the family of functions associated with ℓand H: H ℓ= {(x,y) ℓ(h,x,y) h H}.

Let R ℓ m(H) be the Rademacher complexity of H ℓand B ℓan upper bound of the surrogate loss ℓ. Then, we obtain the following generalization bounds for the top-k loss. Theorem H.1 (Generalization bound with comp-sum losses). Assume that H is symmetric and complete. Then, for any 1 k n, the following top-k generalization bound holds for h S: for any δ > 0, with probability at least 1 δ over the draw of an i.i.d sample S of size m:

Eℓk( h S) E ℓk(H) + Mℓk(H) kψ 1(4R ℓ m(H) + 2B ℓ

δ 2m + M ℓ(H)).

where ψ(t) = 1 t

2 log(1 t) + 1+t

2 log(1 + t), t [0,1] when ℓis ℓlog; ψ(t) = 1

1 t2, t [0,1]

when ℓis ℓexp; ψ(t) = t/n when ℓis ℓmae; and ψ(t) = 1 qnq [ (1+t) 1 1 q +(1 t) 1 1 q 2 ]

1 q 1 , for all

q (0,1), t [0,1] when ℓis ℓgce.

Proof. By using the standard Rademacher complexity bounds [Mohri et al., 2018], for any δ > 0, with probability at least 1 δ, the following holds for all h H:

E ℓ(h) E ℓ,S(h) 2R ℓ m(H) + B ℓ

Fix ϵ > 0. By the deﬁnition of the inﬁmum, there exists h H such that E ℓ(h ) E ℓ(H) + ϵ. By deﬁnition of h S, we have

E ℓ( h S) E ℓ(H)

= E ℓ( h S) E ℓ,S( h S) + E ℓ,S( h S) E ℓ(H)

E ℓ( h S) E ℓ,S( h S) + E ℓ,S(h ) E ℓ(H)

E ℓ( h S) E ℓ,S( h S) + E ℓ,S(h ) E ℓ(h ) + ϵ

2[2R ℓ m(H) + B ℓ

Since the inequality holds for all ϵ > 0, it implies:

E ℓ( h S) E ℓ(H) 4R ℓ m(H) + 2B ℓ

Plugging in this inequality in the bounds of Theorem 4.5 completes the proof.

Theorem H.2 (Generalization bound with constrained losses). Assume that H is symmetric and complete. Then, for any 1 k n, the following top-k generalization bound holds for h S: for any δ > 0, with probability at least 1 δ over the draw of an i.i.d sample S of size m:

Eℓk( h S) E ℓk(H) + Mℓk(H) kγ(4R ℓ m(H) + 2B ℓ

δ 2m + M ℓ(H)).

where γ(t) = 2

t when ℓis either ℓcstnd exp or ℓsq hinge; γ(t) = t when ℓis either ℓhinge or ℓρ.

Proof. By using the standard Rademacher complexity bounds [Mohri et al., 2018], for any δ > 0, with probability at least 1 δ, the following holds for all h H:

E ℓ(h) E ℓ,S(h) 2R ℓ m(H) + B ℓ

Fix ϵ > 0. By the deﬁnition of the inﬁmum, there exists h H such that E ℓ(h ) E ℓ(H) + ϵ. By deﬁnition of h S, we have

E ℓ( h S) E ℓ(H)

= E ℓ( h S) E ℓ,S( h S) + E ℓ,S( h S) E ℓ(H)

E ℓ( h S) E ℓ,S( h S) + E ℓ,S(h ) E ℓ(H)

E ℓ( h S) E ℓ,S( h S) + E ℓ,S(h ) E ℓ(h ) + ϵ

2[2R ℓ m(H) + B ℓ

Since the inequality holds for all ϵ > 0, it implies:

E ℓ( h S) E ℓ(H) 4R ℓ m(H) + 2B ℓ

Plugging in this inequality in the bounds of Theorem E.1 completes the proof.

To the best of our knowledge, Theorems H.1 and H.2 provide the ﬁrst ﬁnite-sample guarantees for the estimation error of the minimizer of comp-sum losses and constrained losses, with respect to the top-k loss, for any 1 k n. The proofs use our H-consistency bounds with respect to the top-k loss, as well as standard Rademacher complexity guarantees.

I Proofs of H-consistency bounds for cost-sensitive losses

We ﬁrst characterize the best-in class conditional error and the conditional regret of the target cardinality aware loss function (2), which will be used in the analysis of H-consistency bounds.

Lemma I.1. Assume that R is symmetric and complete. Then, for any r K and x X, the best-in class conditional error and the conditional regret of the target cardinality aware loss function can be expressed as follows:

C ℓ(R,x) = min k K y Y p(x,y)c(x,k,y)

Cℓ,R(r,x) = y Y p(x,y)c(x,r(x),y) min k K y Y p(x,y)c(x,k,y).

Proof. By deﬁnition, for any r R and x X, the conditional error of the target cardinality aware loss function can be written as

Cℓ(r,x) = y Y p(x,y)c(x,r(x),y).

Since R is symmetric and complete, we have

C ℓ(R,x) = inf r R y Y p(x,y)c(x,r(x),y) = min k K y Y p(x,y)c(x,k,y).

Furthermore, the calibration gap can be expressed as

Cℓ,R(r,x) = Cℓ(r,x) C ℓ(R,x) = y Y p(x,y)c(x,r(x),y) min k K y Y p(x,y)c(x,k,y),

which completes the proof.

I.1 Proof of Theorem 4.6

For convenience, we let c(x,k,y) = 1 c(x,k,y), q(x,k) = y Y p(x,y)c(x,k,y)

[0,1] and S(x,k) = er(x,k)

k K er(x,k ) . We also let kmin(x) = argmink K(1 q(x,k)) = argmink K y Y p(x,y)c(x,k,y).

Theorem 4.6. Assume that R is symmetric and complete. Then, the following bound holds for the cost-sensitive comp-sum loss: for all r R and for any distribution,

Eℓ(r) E ℓ(R) + Mℓ(R) γ(E ℓc comp(r) E ℓc comp(R) + M ℓc comp(R));

When R = Rall, the following holds: Eℓ(r) E ℓ(Rall) γ(E ℓc comp(r) E ℓc comp(Rall)), where

t when ℓc comp is either ℓc log or ℓc exp; γ(t) = 2

K qt when ℓc comp is ℓc gce; and γ(t) = K t when ℓc comp is ℓc mae.

Proof. Case I: ℓc comp = ℓc log. For the cost-sensitive logistic loss ℓc log, the conditional error can be written as

C ℓc log(r,x) = y Y p(x,y) k K

c(x,k,y)log( er(x,k)

k K er(x,k ) ) = k K log(S(x,k))q(x,k).

The conditional regret can be written as

C ℓc log,R(r,x) = k K log(S(x,k))q(x,k) inf r R( k K log(S(x,k))q(x,k))

k K log(S(x,k))q(x,k) inf µ [ S(x,kmin(x)),S(x,r(x))]( k K log(Sµ(x,k))q(x,k)),

where for any x X and k K, Sµ(x,k) =

S(x,y), y {kmin(x),r(x)} S(x,kmin(x)) + µ y = r(x) S(x,r(x)) µ y = kmin(x). Note that

such a choice of Sµ leads to the following equality holds:

k {r(x),kmin(x)} log(S(x,k))q(x,k) = k {r(x),kmin(x)} log(Sµ(x,k))q(x,k).

Therefore, the conditional regret of cost-sensitive logistic loss can be lower bounded as

C ℓc log,H(h,x)

sup µ [ S(x,kmin(x)),S(x,r(x))] {q(x,kmin(x))[ log(S(x,kmin(x))) + log(S(x,r(x)) µ)]

+ q(x,r(x))[ log(S(x,r(x))) + log(S(x,kmin(x)) + µ)]}.

By the concavity of the function, differentiate with respect to µ, we obtain that the supremum is achieved by µ = q(x,r(x))S(x,r(x)) q(x,kmin(x))S(x,kmin(x))

q(x,kmin(x))+q(x,r(x)) . Plug in µ , we obtain

C ℓc log,H(h,x)

q(x,kmin(x))log (S(x,r(x)) + S(x,kmin(x)))q(x,kmin(x))

S(x,kmin(x))(q(x,kmin(x)) + q(x,r(x)))

+ q(x,r(x))log (S(x,r(x)) + S(x,kmin(x)))q(x,r(x))

S(x,r(x))(q(x,kmin(x)) + q(x,r(x)))

q(x,kmin(x))log 2q(x,kmin(x)) q(x,kmin(x)) + q(x,r(x)) + q(x,r(x))log 2q(x,r(x)) q(x,kmin(x)) + q(x,r(x)) (minimum is achieved when S(x,r(x)) = S(x,kmin(x)))

(q(x,r(x)) q(x,kmin(x)))2

2(q(x,r(x)) + q(x,kmin(x))) (alog 2a

a+b + blog 2b

2(a+b), a,b [0,1] [Mohri et al., 2018, Proposition E.7])

(q(x,r(x)) q(x,kmin(x)))2

4 . (0 q(x,r(x)) + q(x,kmin(x)) 2)

Therefore, by Lemma I.1, the conditional regret of the target cardinality aware loss function can be upper bounded as follows:

Cℓ,H(r,x) = q(x,kmin(x)) q(x,r(x)) 2( C ℓc log,R(r,x))

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓ(r) E ℓ(R) + Mℓ(R) 2(E ℓc log(r) E ℓc log(R) + M ℓc log(R))

The second part follows from the fact that M ℓc log(Rall) = 0.

Case II: ℓc comp = ℓc exp. For the cost-sensitive sum exponential loss ℓc exp, the conditional error can be written as

C ℓc exp(r,x) = y Y p(x,y) k K

c(x,k,y) k k er(x,k ) r(x,k) = k K ( 1 S(x,k) 1)q(x,k).

The conditional regret can be written as

C ℓc exp,R(r,x) = k K ( 1 S(x,k) 1)q(x,k) inf r R( k K ( 1 S(x,k) 1)q(x,k))

k K ( 1 S(x,k) 1)q(x,k) inf µ [ S(x,kmin(x)),S(x,r(x))]( k K ( 1 Sµ(x,k) 1)q(x,k)),

where for any x X and k K, Sµ(x,k) =

S(x,y), y {kmin(x),r(x)} S(x,kmin(x)) + µ y = r(x) S(x,r(x)) µ y = kmin(x). Note that

such a choice of Sµ leads to the following equality holds:

k {r(x),kmin(x)} ( 1 S(x,k) 1)q(x,k) = k {r(x),kmin(x)} ( 1 Sµ(x,k) 1)q(x,k).

Therefore, the conditional regret of cost-sensitive sum exponential loss can be lower bounded as

C ℓc exp,H(h,x) sup µ [ S(x,kmin(x)),S(x,r(x))] {q(x,kmin(x))[ 1 S(x,kmin(x)) 1 S(x,r(x)) µ]

+ q(x,r(x))[ 1 S(x,r(x)) 1 S(x,kmin(x)) + µ]}.

By the concavity of the function, differentiate with respect to µ, we obtain that the supremum is

achieved by µ =

q(x,r(x))S(x,r(x))

q(x,kmin(x))S(x,kmin(x))

q(x,kmin(x))+

q(x,r(x)) . Plug in µ , we obtain

C ℓc exp,H(h,x)

q(x,kmin(x))

S(x,kmin(x)) + q(x,r(x)))

S(x,r(x))) (

q(x,kmin(x)) +

q(x,r(x)))) 2

S(x,kmin(x)) + S(x,r(x)))

q(x,kmin(x))

q(x,r(x)))) 2

(minimum is achieved when S(x,r(x)) = S(x,kmin(x)) = 1

(q(x,r(x))) q(x,kmin(x)))2

q(x,r(x))) +

q(x,kmin(x))) 2

(q(x,r(x))) q(x,kmin(x)))2

b 2, a,b [0,1],a + b 2)

Therefore, by Lemma I.1, the conditional regret of the target cardinality aware loss function can be upper bounded as follows:

Cℓ,H(r,x) = q(x,kmin(x)) q(x,r(x)) 2( C ℓc exp,R(r,x))

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓ(r) E ℓ(R) + Mℓ(R) 2(E ℓc exp(r) E ℓc exp(R) + M ℓc exp(R))

The second part follows from the fact that M ℓc exp(Rall) = 0.

Case III: ℓc comp = ℓc gce. For the cost-sensitive generalized cross-entropy loss ℓc gce, the conditional error can be written as

C ℓc gce(r,x) = y Y p(x,y) k K

q (1 ( er(x,k)

k K er(x,k ) )

q k K (1 S(x,k)q)q(x,k).

The conditional regret can be written as

C ℓc gce,R(r,x) = 1

q k K (1 S(x,k)q)q(x,k) inf r R(1

q k K (1 S(x,k)q)q(x,k))

q k K (1 S(x,k)q)q(x,k) inf µ [ S(x,kmin(x)),S(x,r(x))](1

q k K (1 Sµ(x,k)q)q(x,k)),

where for any x X and k K, Sµ(x,k) =

S(x,y), y {kmin(x),r(x)} S(x,kmin(x)) + µ y = r(x) S(x,r(x)) µ y = kmin(x). Note that

such a choice of Sµ leads to the following equality holds:

k {r(x),kmin(x)}

1 q k K (1 S(x,k)q)q(x,k) = k {r(x),kmin(x)}

1 q k K (1 Sµ(x,k)q)q(x,k).

Therefore, the conditional regret of cost-sensitive generalized cross-entropy loss can be lower bounded as

C ℓc gce,H(h,x) = 1

q sup µ [ S(x,kmin(x)),S(x,r(x))] {q(x,kmin(x))[ S(x,kmin(x))q + (S(x,r(x)) µ)q]

+ q(x,r(x))[ S(x,r(x))q + (S(x,kmin(x)) + µ)q]}.

By the concavity of the function, differentiate with respect to µ, we obtain that the supremum is

achieved by µ = q(x,r(x)) 1 1 q S(x,r(x)) q(x,kmin(x)) 1 1 q S(x,kmin(x))

q(x,kmin(x)) 1 1 q +q(x,r(x)) 1 1 q . Plug in µ , we obtain

C ℓc gce,H(h,x)

q (S(x,r(x)) + S(x,kmin(x)))q(q(x,kmin(x)) 1 1 q + q(x,r(x)) 1 1 q ) 1 q

q q(x,kmin(x))S(x,kmin(x))q 1

q q(x,r(x))S(x,r(x))q

1 q K q [2q(q(x,kmin(x)) 1 1 q + q(x,r(x)) 1 1 q ) 1 q q(x,kmin(x)) q(x,r(x))]

(minimum is achieved when S(x,r(x)) = S(x,kmin(x)) = 1 K )

(q(x,r(x)) q(x,kmin(x)))2

(( a 1 1 q +b 1 1 q 2 )

4(a b)2, a,b [0,1], 0 a + b 1)

Therefore, by Lemma I.1, the conditional regret of the target cardinality aware loss function can be upper bounded as follows:

Cℓ,H(r,x) = q(x,kmin(x)) q(x,r(x)) 2 K

q 2 ( C ℓc gce,R(r,x))

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓ(r) E ℓ(R) + Mℓ(R) 2 K

q 2 (E ℓc gce(r) E ℓc gce(R) + M ℓc gce(R))

The second part follows from the fact that M ℓc gce(Rall) = 0.

Case IV: ℓc comp = ℓc mae. For the cost-sensitive mean absolute error loss ℓc mae, the conditional error can be written as

C ℓc mae(r,x) = y Y p(x,y) k K

c(x,k,y)(1 ( er(x,k)

k K er(x,k ) )) = k K (1 S(x,k))q(x,k).

The conditional regret can be written as

C ℓc mae,R(r,x) = k K (1 S(x,k))q(x,k) inf r R( k K (1 S(x,k))q(x,k))

k K (1 S(x,k))q(x,k) inf µ [ S(x,kmin(x)),S(x,r(x))]( k K (1 Sµ(x,k))q(x,k)),

where for any x X and k K, Sµ(x,k) =

S(x,y), y {kmin(x),r(x)} S(x,kmin(x)) + µ y = r(x) S(x,r(x)) µ y = kmin(x). Note that

such a choice of Sµ leads to the following equality holds:

k K (1 S(x,k))q(x,k) = k K (1 Sµ(x,k))q(x,k).

Therefore, the conditional regret of cost-sensitive mean absolute error can be lower bounded as

C ℓc mae,H(h,x) sup µ [ S(x,kmin(x)),S(x,r(x))] {q(x,kmin(x))[ S(x,kmin(x)) + S(x,r(x)) µ]

+ q(x,r(x))[ S(x,r(x)) + S(x,kmin(x)) + µ]}.

By the concavity of the function, differentiate with respect to µ, we obtain that the supremum is achieved by µ = S(x,kmin(x)). Plug in µ , we obtain

C ℓc mae,H(h,x)

q(x,kmin(x))S(x,r(x)) q(x,r(x))S(x,r(x))

K (q(x,kmin(x)) q(x,r(x))). (minimum is achieved when S(x,r(x)) = 1 K )

Therefore, by Lemma I.1, the conditional regret of the target cardinality aware loss function can be upper bounded as follows:

Cℓ,H(r,x) = q(x,kmin(x)) q(x,r(x)) K ( C ℓc mae,R(r,x)).

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓ(r) E ℓ(R) + Mℓ(R) K (E ℓc mae(r) E ℓc mae(R) + M ℓc mae(R)).

The second part follows from the fact that M ℓc mae(Rall) = 0.

I.2 Proof of Theorem 4.7

The conditional error for the cost-sensitive constrained loss can be expressed as follows:

C ℓc cstnd(r,x) = y Y p(x,y) ℓc cstnd(r,x,y)

= y Y p(x,y) k K c(x,k,y)Φ( r(x,k))

= k K q(x,k)Φ( r(x,k)),

where q(x,k) = y Y p(x,y)c(x,k,y) [0,1]. Let kmin(x) = argmink K q(x,k). We denote by Φexp t e t the exponential loss function, Φsq hinge t max{0,1 t}2 the squared hinge loss function, Φhinge t max{0,1 t} the hinge loss function, and Φρ t min{max{0,1 t/ρ},1}, ρ > 0 the ρ-margin loss function.

Theorem 4.7. Assume that R is symmetric and complete. Then, the following bound holds for the cost-sensitive constrained loss: for all r R and for any distribution,

Eℓ(r) E ℓ(R) + Mℓ(R) γ(E ℓc cstnd(r) E ℓc cstnd(R) + M ℓc cstnd(R));

When R = Rall, the following holds: Eℓ(r) E ℓ(Rall) γ(E ℓc cstnd(r) E ℓc cstnd(Rall)), where

t when ℓc cstnd is ℓcstnd c exp or ℓc sq hinge; γ(t) = t when ℓc cstnd is ℓc hinge or ℓc ρ.

Proof. Case I: ℓ= ℓcstnd c exp. For the cost-sensitive constrained exponential loss ℓcstnd c exp, the conditional regret can be written as

C ℓcstnd c exp,R(r,x) = k K q(x,k)Φexp( r(x,k)) inf r R k K q(x,k)Φexp( r(x,k))

k K q(x,k)Φexp( r(x,k)) inf µ R k K q(x,k)Φexp( rµ(x,k)),

where for any k K, rµ(x,k) =

r(x,y), y {kmin(x),r(x)} r(x,kmin(x)) + µ y = r(x) r(x,r(x)) µ y = kmin(x). Note that such a choice

of rµ leads to the following equality holds:

k {r(x),kmin(x)} q(x,k)Φexp( r(x,k)) = k {r(x),kmin(x)} k K q(x,k)Φexp( rµ(x,k)).

Therefore, the conditional regret of cost-sensitive constrained exponential loss can be lower bounded as

C ℓcstnd c exp,R(r,x)

inf r Rsup µ R { q(x,kmin(x))(er(x,kmin(x)) er(x,r(x)) µ) + q(x,r(x))(er(x,r(x)) er(x,kmin(x))+µ)}

q(x,kmin(x))

q(x,r(x))) 2 (differentiating with respect to µ, r to optimize)

= q(x,r(x)) q(x,kmin(x))

q(x,kmin(x)) +

4( q(x,r(x)) q(x,kmin(x)))2. (0 q(x,k) 1)

Therefore, by Lemma I.1, the conditional regret of the target cardinality aware loss function can be upper bounded as follows:

Cℓ,H(r,x) = q(x,r(x)) q(x,kmin(x)) 2( C ℓcstnd c exp,R(r,x))

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓ(r) E ℓ(R) + Mℓ(R) 2(E ℓcstnd c exp(r) E ℓcstnd c exp(R) + M ℓcstnd c exp(R))

The second part follows from the fact that M ℓcstnd c exp(Rall) = 0.

Case II: ℓ= ℓc sq hinge. For the cost-sensitive constrained squared hinge loss ℓc sq hinge, the conditional regret can be written as

C ℓc sq hinge,R(r,x) = k K q(x,k)Φsq hinge( r(x,k)) inf r R k K q(x,k)Φsq hinge( r(x,k))

k K q(x,k)Φsq hinge( r(x,k)) inf µ R k K q(x,k)Φsq hinge( rµ(x,k)),

where for any k K,

r(x,y), y {kmin(x),r(x)} r(x,kmin(x)) + µ y = r(x) r(x,r(x)) µ y = kmin(x).

Note that such a choice of rµ leads to the following equality holds:

k {r(x),kmin(x)} q(x,k)Φsq hinge( r(x,k)) = k {r(x),kmin(x)} k K q(x,k)Φsq hinge( rµ(x,k)).

Therefore, the conditional regret of cost-sensitive constrained squared hinge loss can be lower bounded as

C ℓc sq hinge,R(r,x)

inf r Rsup µ R { q(x,kmin(x))(max{0,1 + r(x,kmin(x))}2 max{0,1 + r(x,r(x)) µ}2)

+ q(x,r(x))(max{0,1 + r(x,r(x))}2 max{0,1 + r(x,kmin(x)) + µ}2)}

4( q(x,kmin(x)) q(x,r(x)))2. (differentiating with respect to µ, r to optimize)

Therefore, by Lemma I.1, the conditional regret of the target cardinality aware loss function can be upper bounded as follows:

Cℓ,H(r,x) = q(x,r(x)) q(x,kmin(x)) 2( C ℓc sq hinge,R(r,x))

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓ(r) E ℓ(R) + Mℓ(R) 2(E ℓc sq hinge(r) E ℓc sq hinge(R) + M ℓc sq hinge(R))

The second part follows from the fact that M ℓc sq hinge(Rall) = 0.

Case III: ℓ= ℓc hinge. For the cost-sensitive constrained hinge loss ℓc hinge, the conditional regret can be written as

C ℓc hinge,R(r,x) = k K q(x,k)Φhinge( r(x,k)) inf r R k K q(x,k)Φhinge( r(x,k))

k K q(x,k)Φhinge( r(x,k)) inf µ R k K q(x,k)Φhinge( rµ(x,k)),

where for any k K,

r(x,y), y {kmin(x),r(x)} r(x,kmin(x)) + µ y = r(x) r(x,r(x)) µ y = kmin(x).

Note that such a choice of rµ leads to the following equality holds:

k {r(x),kmin(x)} q(x,k)Φhinge( r(x,k)) = k {r(x),kmin(x)} k K q(x,k)Φhinge( rµ(x,k)).

Therefore, the conditional regret of cost-sensitive constrained hinge loss can be lower bounded as

C ℓc hinge,R(r,x)

inf r Rsup µ R {q(x,kmin(x))(max{0,1 + r(x,kmin(x))} max{0,1 + r(x,r(x)) µ})

+ q(x,r(x))(max{0,1 + r(x,r(x))} max{0,1 + r(x,kmin(x)) + µ})}

q(x,r(x)) q(x,kmin(x)). (differentiating with respect to µ, r to optimize)

Therefore, by Lemma I.1, the conditional regret of the target cardinality aware loss function can be upper bounded as follows:

Cℓ,H(r,x) = q(x,r(x)) q(x,kmin(x)) C ℓc hinge,R(r,x).

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓ(r) E ℓ(R) + Mℓ(R) E ℓc hinge(r) E ℓc hinge(R) + M ℓc hinge(R).

The second part follows from the fact that M ℓc hinge(Rall) = 0.

Case IV: ℓ= ℓc ρ. For the cost-sensitive constrained ρ-margin loss ℓc ρ, the conditional regret can be written as

C ℓc ρ,R(r,x) = k K q(x,k)Φρ( r(x,k)) inf r R k K q(x,k)Φρ( r(x,k))

k K q(x,k)Φρ( r(x,k)) inf µ R k K q(x,k)Φρ( rµ(x,k)),

where for any k K,

r(x,y), y {kmin(x),r(x)} r(x,kmin(x)) + µ y = r(x) r(x,r(x)) µ y = kmin(x).

Note that such a choice of rµ leads to the following equality holds:

k {r(x),kmin(x)} q(x,k)Φρ( r(x,k)) = k {r(x),kmin(x)} k K q(x,k)Φρ( rµ(x,k)).

Therefore, the conditional regret of cost-sensitive constrained ρ-margin loss can be lower bounded as

C ℓc ρ,R(r,x)

inf r Rsup µ R { q(x,kmin(x))(min{max{0,1 + r(x,kmin(x))

ρ },1} min{max{0,1 + r(x,r(x)) µ

+ q(x,r(x))(min{max{0,1 + r(x,r(x))

ρ },1} min{max{0,1 + r(x,kmin(x)) + µ

q(x,r(x)) q(x,kmin(x)). (differentiating with respect to µ, r to optimize)

Therefore, by Lemma I.1, the conditional regret of the target cardinality aware loss function can be upper bounded as follows:

Cℓ,H(r,x) = q(x,r(x)) q(x,kmin(x)) C ℓc ρ,R(r,x).

By the concavity, taking expectations on both sides of the preceding equation, we obtain

Eℓ(r) E ℓ(R) + Mℓ(R) E ℓc ρ(r) E ℓc ρ(R) + M ℓc ρ(R).

The second part follows from the fact that M ℓc ρ(Rall) = 0.

2 4 6 8 10 Cardinality

top-k classifier

cardinality-aware, = 0.05

cardinality-aware, = 0.1

cardinality-aware, = 0.01

2 4 6 8 10 12 Cardinality

top-k classifier cardinality-aware, = 0.05

cardinality-aware, = 0.1

cardinality-aware, = 0.01

CIFAR-100 Image Net

1.0 1.5 2.0 2.5 3.0 3.5 4.0 Cardinality

top-k classifier

cardinality-aware, = 0.05

cardinality-aware, = 0.1

cardinality-aware, = 0.01

1.0 1.5 2.0 2.5 3.0 3.5 4.0 Cardinality

top-k classifier

cardinality-aware, = 0.05

cardinality-aware, = 0.1

cardinality-aware, = 0.01

CIFAR-10 SVHN

Figure 5: Accuracy versus cardinality on various datasets for cost( gk(x) ) = log k. Each curve of the cardinality-aware algorithm is for a ﬁxed value of λ and the points on the curve are obtained by varying the number of experts.

2 4 6 8 10 Cardinality

top-k classifier

cardinality-aware, (k) = log k

cardinality-aware, (k) = k

2 4 6 8 10 Cardinality

top-k classifier cardinality-aware, (k) = log k

cardinality-aware, (k) = k

CIFAR-100 Image Net

1.0 1.5 2.0 2.5 3.0 3.5 4.0 Cardinality

top-k classifier

cardinality-aware, (k) = log k

cardinality-aware, (k) = k

1.0 1.5 2.0 2.5 3.0 3.5 4.0 Cardinality

top-k classifier

cardinality-aware, (k) = log k

cardinality-aware, (k) = k

CIFAR-10 SVHN

Figure 6: Accuracy versus cardinality on various datasets for cost( gk(x) ) = log k and cost( gk(x) ) = k, with λ = 0.05. The points on each curve of the cardinality-aware algorithm are obtained by varying the number of experts.

J Additional experimental results: top-k classiﬁers

Here, we report additional experimental results with different choices of set K and cost( gk(x) ) on benchmark datasets CIFAR-10, CIFAR-100 [Krizhevsky, 2009], SVHN [Netzer et al., 2011], and Image Net [Deng et al., 2009] and show that our cardinality-aware algorithm consistently outperforms top-k classiﬁers across all conﬁgurations.

In Figure 5 and Figure 6, we began with a set K = {1} for the loss function and then progressively expanded it by adding choices of larger cardinality, each of which doubles the largest value currently in K. The largest set K for the CIFAR-100 and Image Net datasets is {1,2,4,8,16,32,64}, whereas for the CIFAR-10 and SVHN datasets, it is {1,2,4,8}. As the set K expands, there is an increase in

3 4 5 6 7 Cardinality

conformal prediction card-aware, m = 500,000 card-aware, m = 50,000 r *

Figure 7: Accuracy versus cardinality on an artiﬁcial dataset for different training sample sizes m.

both the average cardinality and the accuracy. Figure 5 shows that the accuracy versus cardinality curve of the cardinality-aware algorithm is above that of top-k classiﬁers for various values of λ. Figure 6 presents the comparison of cost( gk(x) ) = k and cost( gk(x) ) = log k for λ = 0.05. These results demonstrate that different λ and different cost( gk(x) ) basically lead to the same curve, which veriﬁes the effectiveness and beneﬁt of our algorithm.

K Additional experimental results: threshold-based classiﬁers

We ﬁrst characterize the Bayes predictor r in this setting. We say that the scenario is deterministic if for all x X, there exists some true label y Y such that p(x,y) = 1; otherwise, we say that the scenario is stochastic. To simplify the discussion, we will assume that gk(x) is an increasing function of k, for any x.

Lemma K.1. Consider the deterministic scenario. Assume that λcost( gk(x) ) 1 for all k and x X. Then, the Bayes predictor r for the cardinality-aware loss function ℓsatisﬁes: r (x) = argmink y gk(x) k, that is the smallest k such that the true label y is in gk(x).

Proof. By the assumption, for k < r (x), we can write c(x,r (x),y) = λcost( gr (x)(x) ) 1 1y/ gk(x) + λcost( gk(x) ) = c(x,k,y). Furthermore, since gk(x) is an increasing function of k, we have c(x,r (x),y) = λcost( gr (x)(x) ) λcost( g k(x) ) = c(x,k ,y) for k > r (x).

Lemma K.2. Consider the stochastic scenario. The Bayes predictor r for the cardinality-aware loss function ℓsatisﬁes:

r (x) = argmin k K

λcost( gk(x) ) y gk(x) p(x,y) .

Proof. The conditional error can be written as follows:

Cℓ(r,x,y) = y Y p(x,y)c(x,r(x),y)

= y Y p(x,y)(1y gr(x)(x) + λcost( gr(x)(x) ))

= y Y p(x,y)1y gr(x)(x) + λcost( gr(x)(x) )

= 1 y gr(x)(x) p(x,y) + λcost( gr(x)(x) ).

Thus, the Bayes predictor can be characterized as

r (x) = argmin k K

λcost( gk(x) ) y gk(x) p(x,y) .

20 30 40 50 60 Cardinality

conformal prediction cardinality-aware r *

300 400 500 600 700 800 Cardinality

conformal prediction cardinality-aware r *

CIFAR-100 Image Net

1.5 2.0 2.5 3.0 3.5 4.0 4.5 Cardinality

conformal prediction cardinality-aware r *

2 3 4 Cardinality

conformal prediction cardinality-aware r *

CIFAR-10 SVHN

Figure 8: Accuracy versus cardinality on CIFAR-100, Image Net, CIFAR-10, and SVHN datasets.

It is clear that Lemma K.2 implies Lemma K.1 when there exists some true y Y such that p(x,y) = 1 and λcost( gk(x) ) 1.

We ﬁrst consider an artiﬁcial dataset containing 10 classes. Each class is modeled by a Gaussian distribution in a 100-dimensional space. As in Section 5, we plot the accuracy versus cardinality curve of the cardinality-aware algorithm by varying λ, where the set predictors used are threshold-based classiﬁers, and compare with that of conformal prediction. In Figure 7, we also indicate the point corresponding to r . The problem is close to being realizable, as we can train a predictor that performs almost as well as r on the test set. Thus, the minimizability gaps vanish, and our H-consistency bounds (Theorems 4.6 and 4.7) then suggest that with sufﬁcient training data, we can get close to the optimal solution and therefore outperform conformal prediction. For some tasks, however, the problem is hard, and it appears that a very large training sample would be needed. Figure 7 demonstrates that on the artiﬁcial dataset, with training sample size m = 50,000, the performance of our cardinality-aware algorithm is only slightly better than that of conformal prediction. If we increase the training sample size to m = 500,000, then the curve of our algorithm becomes much closer to the optimal point and signiﬁcantly outperforms conformal prediction.

Additionally, for a weaker scoring function, a smaller training sample sufﬁces in many cases, and our cardinality-aware algorithm can outperform conformal prediction on real datasets as shown in Figure 8.

L Future work

While our framework of cardinality-aware set prediction is very general applicable to any collection of set predictors (Section 2) and leads to novel cardinality-aware algorithms (Section 3), beneﬁts from theoretical guarantees with sufﬁcient training data (Section 4), and demonstrates effectiveness and empirical advantages in top-k classiﬁcation (Section 5), the learning problem can be challenging for certain tasks, often requiring a very large training sample (as shown in Appendix K). This underscores the need for a more detailed investigation to enhance our algorithms in these scenarios.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope?

Answer: [Yes]

Justiﬁcation: See lines 81-86 in Section 1.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reﬂect how much the results can be expected to generalize to other settings. It is ﬁne to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justiﬁcation: See Appendix L.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-speciﬁcation, asymptotic approximations only holding locally). The authors should reﬂect on how these assumptions might be violated in practice and what the implications would be. The authors should reﬂect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reﬂect on the factors that inﬂuence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efﬁciency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be speciﬁcally instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justiﬁcation: See Section 4, Appendix A, B, C, D, E, F, H, and I. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justiﬁcation: See Section 3, Section 5, Appendix J and Appendix K. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or veriﬁable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might sufﬁce, or if the contribution is a speciﬁc model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes] Justiﬁcation: See Section 5, Appendix J and Appendix K. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justiﬁcation: See Section 5, Appendix J and Appendix K. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments? Answer: [Yes] Justiﬁcation: See Figure 2, Figure 5, and Figure 6. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, conﬁdence intervals, or statistical signiﬁcance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not veriﬁed. For asymmetric distributions, the authors should be careful not to show in tables or ﬁgures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding ﬁgures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justiﬁcation: For each model training, we use an Nvidia A100 GPU.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justiﬁcation: The authors have reviewed the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justiﬁcation: Our work is theoretical in nature and we do not anticipate any immediate negative societal impact.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake proﬁles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact speciﬁc groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efﬁciency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justiﬁcation: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety ﬁlters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justiﬁcation: See Section 5, Appendix J and Appendix K.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justiﬁcation: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip ﬁle. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] . Justiﬁcation: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is ﬁne, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justiﬁcation: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary signiﬁcantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.