# multilabel_learning_with_stronger_consistency_guarantees__70e69098.pdf

Multi-Label Learning with Stronger Consistency Guarantees

Anqi Mao Courant Institute New York, NY 10012 aqmao@cims.nyu.edu

Mehryar Mohri Google Research & CIMS New York, NY 10011 mohri@google.com

Yutao Zhong Courant Institute New York, NY 10012 yutao@cims.nyu.edu

We present a detailed study of surrogate losses and algorithms for multi-label learning, supported by H-consistency bounds. We ﬁrst show that, for the simplest form of multi-label loss (the popular Hamming loss), the well-known consistent binary relevance surrogate suffers from a sub-optimal dependency on the number of labels in terms of H-consistency bounds, when using smooth losses such as logistic losses. Furthermore, this loss function fails to account for label correlations. To address these drawbacks, we introduce a novel surrogate loss, multi-label logistic loss, that accounts for label correlations and beneﬁts from label-independent Hconsistency bounds. We then broaden our analysis to cover a more extensive family of multi-label losses, including all common ones and a new extension deﬁned based on linear-fractional functions with respect to the confusion matrix. We also extend our multi-label logistic losses to more comprehensive multi-label comp-sum losses, adapting comp-sum losses from standard classiﬁcation to the multi-label learning. We prove that this family of surrogate losses beneﬁts from H-consistency bounds, and thus Bayes-consistency, across any general multi-label loss. Our work thus proposes a uniﬁed surrogate loss framework beneﬁting from strong consistency guarantees for any multi-label loss, signiﬁcantly expanding upon previous work which only established Bayes-consistency and for speciﬁc loss functions. Additionally, we adapt constrained losses from standard classiﬁcation to multi-label constrained losses in a similar way, which also beneﬁt from Hconsistency bounds and thus Bayes-consistency for any multi-label loss. We further describe efﬁcient gradient computation algorithms for minimizing the multi-label logistic loss.

1 Introduction

Supervised learning methods often assign a single label to each instance. However, real-world data exhibits a more complex structure, with objects belonging to multiple categories simultaneously. Consider a video about sports training, which could be categorized as both health and athletics, or a culinary blog post tagged with cooking and nutrition . As a result, multi-label learning [Mc Callum, 1999, Schapire and Singer, 2000] has become increasingly important, leading to the development of various interesting and effective approaches, predominantly experimental in nature, in recent years [Elisseeff and Weston, 2001, Deng et al., 2011, Petterson and Caetano, 2011, Kapoor et al., 2012].

Although there is a rich literature on multi-label learning (see [Zhang and Zhou, 2013] and [Bogatinovski et al., 2022] for detailed surveys), only a few studies focus on the theoretical analysis of multi-label learning, particularly the study of the Bayes-consistency of surrogate losses [Zhang, 2004a,b, Bartlett et al., 2006, Tewari and Bartlett, 2007, Steinwart, 2007].

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Gao and Zhou [2011] initiated the study of Bayes-consistency in multi-label learning with respect to Hamming loss and (partial) ranking loss. They provided negative results for ranking loss, demonstrating that no convex and differentiable pairwise surrogate loss is Bayes-consistent for that multi-label loss. They also showed that the binary relevance method, which learns an independent binary classiﬁer for each of the l labels, is Bayes-consistent with respect to the Hamming loss. Dembczynski et al. [2011] further demonstrated that under the assumption of conditionally independent labels, the binary relevance method is also Bayes-consistent with respect to the Fβ measure loss. However, they noted that it can perform arbitrarily poorly when this assumption does not hold. Dembczynski et al. [2012] provided a positive result for the (partial) ranking loss by showing that the simpler univariate variants of smooth surrogate losses are Bayes-consistent with respect to it. Additionally, Zhang et al. [2020] proposed a family of Bayes-consistent surrogate losses for the Fβ measure by reducing the Fβ learning problem to a set of binary class probability estimation problems. This approach was motivated by the consistent output coding scheme in [Ramaswamy et al., 2014] for general multiclass problems. Other works have studied generalization bounds in multi-label learning [Yu et al., 2014, Wydmuch et al., 2018, Wu and Zhu, 2020, Wu et al., 2021, 2023, Busa-Fekete et al., 2022].

Another related topic is the characterization of the Bayes classiﬁer and corresponding Bayes-consistent plug-in algorithm in multi-label learning. This includes the characterization of the Bayes classiﬁer for subset 0/1 loss and Hamming loss in [Cheng et al., 2010] and the characterization of the Bayes classiﬁer for F1 measure in [Dembczynski et al., 2011]. Dembczynski et al. [2013], Waegeman et al. [2014] further extended the results in [Dembczynski et al., 2011] by designing a Bayes-consistent plug-in algorithm for the Fβ measure. Koyejo et al. [2015] characterized the Bayes classiﬁer for general linear fractional losses with respect to the confusion matrix and designed the corresponding plug-in algorithms in the empirical utility maximization (EUM) framework. In this framework, the measures are directly deﬁned as functions of the population, in contrast to a loss function that is deﬁned as a function over a single instance in the decision theoretic analysis (DTA) framework [Ye et al., 2012]. Menon et al. [2019] studied the Bayes-consistency of various reduction methods with respect to Precision@κ and Recall@κ in multi-label learning. However, all these publications only established Bayes-consistency for speciﬁc loss functions. Can we derive a uniﬁed surrogate loss framework that is Bayes-consistent for any multi-label loss?

Furthermore, as Awasthi, Mao, Mohri, and Zhong [2022a,b] pointed out, Bayes-consistency is an asymptotic guarantee and does not provide convergence guarantees. It also applies only to the family of all measurable functions unlike the restricted hypothesis sets typically used in practice. Instead, they proposed a stronger guarantee known as H-consistency bounds, which are both non-asymptotic and account for the hypothesis set while implying Bayes-consistency. These guarantees provide upper bounds on the target estimation error in terms of the surrogate estimation error. Can we leverage this state-of-the-art consistency guarantee when designing surrogate loss functions for multi-label learning?

Moreover, one of the main concerns in multi-label learning is label correlations (see [Dembczy nski et al., 2012]). For the simplest form of multi-label loss, the popular Hamming loss, the existing Bayes-consistent binary relevance surrogate fails to account for label correlations. Can we design consistent loss functions that effectively account for label correlations as well?

Our Contributions. This paper directly addresses these key questions in multi-label learning. We present a detailed study of surrogate losses and algorithms for multi-label learning, supported by H-consistency bounds.

In Section 3, we ﬁrst show that for the simplest form of multi-label loss, the popular Hamming loss, the well-known consistent binary relevance surrogate, when using smooth losses such as logistic losses, suffers from a sub-optimal dependency on the number of labels in terms of H-consistency bounds. Furthermore, this loss function fails to account for label correlations.

To address these drawbacks, we introduce a novel surrogate loss, multi-label logistic loss, that accounts for label correlations and beneﬁts from label-independent H-consistency bounds (Section 4). We then broaden our analysis to cover a more extensive family of multi-label losses, including all common ones and a new extension deﬁned based on linear-fractional functions with respect to the confusion matrix (Section 5).

In Section 6, we also extend our multi-label logistic losses to more comprehensive multi-label comp-sum losses, adapting comp-sum losses from standard classiﬁcation to the multi-label learning.

We prove that this family of surrogate losses beneﬁts from H-consistency bounds, and thus Bayesconsistency, across any general multi-label loss. Our work thus proposes a uniﬁed surrogate loss framework that is Bayes-consistent for any multi-label loss, signiﬁcantly expanding upon previous work which only established consistency for speciﬁc loss functions.

Additionally, we adapt constrained losses from standard classiﬁcation to multi-label constrained losses in a similar way, which also beneﬁt from H-consistency bounds and thus Bayes-consistency for any multi-label loss (Section 7). We further describe efﬁcient gradient computation algorithms for minimizing the multi-label logistic loss (Section 8).

2 Preliminaries

Multi-label learning. We consider the standard multi-label learning setting. Let X be the input space and Y = {+1, 1}l the set of all possible labels or tags, where l is a ﬁnite number. For example, X can be a set of images, and Y can be a set of l pre-given tags (such as ﬂowers , shoes , or books ) that can be associated with each image in the image tagging problem. Let n = Y . For any instance x X and its associated label y = (y1,...,yl) Y, if yi = +1, we say that label i is relevant to x. Otherwise, it is not relevant. Let [l] = {1,...,l}. Given a sample S drawn i.i.d. according to some distribution D over X Y, the goal of multi-label learning is to learn a hypothesis h X [l] R to minimize the generalization error deﬁned by a multi-label loss function L Hall X Y R, RL(h) = E (x,y) D[L(h,x,y)], (1)

where Hall is the family of all measurable hypotheses. For convenience, we abusively denote the scoring vector by h(x) = (h(x,1),...,h(x,l)). Given a hypothesis set H Hall, we denote by R L(H) = infh H R(h) the best-in-class error. We refer to the difference RL(h) R L(H) as the estimation error, which is termed the excess error when H = Hall. Let sign t 1t 0 1t<0 be the sign function, and let t X R+ be a threshold function. The target loss function L can be typically given by a function L mapping from Y Y to real numbers: L(h,x,y) = L(h(x),y), (2) where h(x) = [h1(x),...,hl(x)] Y is the prediction for the input x X and hi(x) = sign(h(x,i)) for any i [l]. There are many multi-label loss functions, such as Hamming loss, (partial) ranking loss, F1 and the more general Fβ measure loss, subset 0/1 loss, precision@κ, recall@κ, etc. [Zhang and Zhou, 2013]. Among these, several loss functions are deﬁned based on the prediction of the hypothesis h(x), while others are based on the scoring vector h(x). We will speciﬁcally consider the ﬁrst type of multi-label loss in the form given in (2), which is based on some distance between the prediction and the true label. This includes all the loss functions previously mentioned (see Section 5 for a list of several common multi-label losses in this family) but excludes the (partial) ranking loss, which is deﬁned based on pairwise scores. For convenience, we may alternatively refer to L or its induced L as the multi-label loss. Without loss of generality, we assume that L [0,1], which can be achieved through normalization. We also denote by Lmax = maxy ,y L(y ,y). Our analysis is general and adapts to any multi-label loss L. Note that we adhere to the decision-theoretic analysis (DTA) framework, in which a loss function is deﬁned over a single instance, and the measure is the expected loss, also known as the generalization error (the expectation of the loss function over samples). Another popular framework is empirical utility maximization (EUM), where measures are deﬁned directly as functions of the population, that is, as a function of the expectation over samples.

Surrogate risk minimization and consistency. Minimizing the multi-label loss L directly is computationally hard for most hypothesis sets because it is discrete and non-convex. A common method involves minimizing a smooth surrogate loss function L Hall X Y R, which is the main focus of this paper. Minimizing a surrogate loss directly leads to an algorithm for multi-label learning. A desirable guarantee for the surrogate loss in multi-label learning is Bayes-consistency [Zhang, 2004a,b, Bartlett et al., 2006, Tewari and Bartlett, 2007, Steinwart, 2007, Gao and Zhou, 2011]. That is, minimizing the surrogate loss over the family of all measurable functions leads to the minimization of the multi-label loss over the same family: Deﬁnition 2.1. A surrogate loss L is said to be Bayes-consistent with respect to a multi-label loss L if the following holds for any distribution and all given sequences of hypotheses {hn}n N Hall:

( lim n + R L(hn) R L(Hall) = 0) Ô ( lim n + RL(hn) R L(Hall) = 0). (3)

As pointed out by Awasthi, Mao, Mohri, and Zhong [2022a,b] (see also [Long and Servedio, 2013, Zhang and Agarwal, 2020, Awasthi et al., 2021a,b, Mao et al., 2023d,c,a,b,e, Awasthi et al., 2023, 2024, Mao et al., 2024a,b,c,h,g,e,h,d,f, Mohri et al., 2024, Cortes et al., 2024]), Bayes-consistency is an asymptotic guarantee that cannot provide any guarantee for approximate minimizers; it also applies only to the family of all measurable functions and does not consider the hypothesis sets typically used in practice. Instead, they propose a stronger guarantee known as H-consistency bounds, which are both non-asymptotic and dependent on the hypothesis set, and imply Bayes-consistency when H = Hall. These guarantees provide upper bounds on the target estimation error in terms of the surrogate estimation error. In the multi-label learning scenario, they can be formulated as follows:

Deﬁnition 2.2. A surrogate loss L is said to admit an H-consistency bound with respect to a multilabel loss L if the following condition holds for any distribution and for all hypotheses h H, given a concave function Γ R+ R+ with Γ(0) = 0:

RL(h) R L(H) + ML(H) Γ(R L(h) R L(H) + M L(H)). (4)

The quantities M L(H) appearing in the bounds are called minimizability gaps, which measure the difference between the best-in-class error and the expected best pointwise error for a loss function L and a hypothesis set H:

M L(H) = R L(H) E x[ inf h H( E y x[ L(h,x,y)])] 0. (5)

These are inherent quantities depending on the distribution and hypothesis set, which we cannot hope to minimize. Since Γ is concave and Γ(0) = 0, Γ is sub-additive and an H-consistency bound (4) implies that: RL(h) R L(H) + ML(H) Γ(R L(h) R L(H)) + Γ(M L(H)). Therefore, when the surrogate estimation error (R L(h) R L(H)) is minimized to ϵ, the target estimation error (RL(h) R L(H)) is upper bounded by Γ(ϵ) + Γ(M L(H)). The minimizability gaps vanish when H = Hall or in more general realizable cases, such as when R L(H) = R L(Hall) [Steinwart, 2007, Awasthi, Mao, Mohri, and Zhong, 2022b, Mao, Mohri, and Zhong, 2023f]. In these cases, H-consistency bounds imply the H-consistency of a surrogate loss L with respect to a multi-label loss L: R L(h) R L(H) ϵ Ô RL(h) R L(H) Γ(ϵ), for any ϵ 0. The minimizability gap M L(H)

is upper bounded by the approximate error A L(H) = R L(H) Ex[infh Hall(Ey x[ L(h,x,,y)])] and is generally a ﬁner quantity [Mao et al., 2023f]. Thus, H-consistency bounds are more informative, more favorable, and stronger than excess error bounds, and they imply these bounds when H = Hall.

Next, we will study surrogate loss functions and algorithms for multi-label learning, supported by H-consistency bounds, the state-of-the-art consistency guarantee for surrogate risk minimization.

3 Existing consistent surrogates for the Hamming loss

In the section, we consider the simplest form of multi-label loss, the Hamming loss, deﬁned as:

(h,x,y) H X Y, Lham(h,x,y) = Lham(h(x),y), where Lham(y ,y) = l i=1 1yi y i. (6)

The existing Bayes-consistent surrogate loss function is to transform the multi-label learning into l independent binary classiﬁcation tasks [Gao and Zhou, 2011], deﬁned as for all (h,x,y) H X Y,

Lbr(h,x,y) = l i=1 Φ(yih(x,i)), (7)

where Φ R R+ is a binary margin-based loss function, such as the logistic loss u log(1 + e u). The algorithm that minimizes this surrogate loss is known as binary relevance [Zhang and Zhou, 2013], which learns an independent binary classiﬁer for each of the l labels. Gao and Zhou [2011, Theorem 15] shows that Lbr is Bayes-consistent with respect to Lham if Φ is Bayes-consistent with respect to ℓ0 1 (f,x,y) 1y sign(f(x)), the binary zero-one loss. Here, we prove a stronger result that Lbr admits an H-consistency bound with respect to Lham with a functional form lΓ(

l) if Φ admits an H-consistency bounds with respect to ℓ0 1 with a functional form Γ( ). Let F be a hypothesis set consisting of functions mapping from X to R.

Theorem 3.1. Let H = Fl. Assume that the following F-consistency bound holds in the binary classiﬁcation, for some concave function Γ R R+: f F, Rℓ0 1(f) R ℓ0 1(F) + Mℓ0 1(F) Γ(RΦ(f) R Φ(F) + MΦ(F)). (8) Then, the following H-consistency bound holds in the multi-label learning: for all h H,

RLham(h) R Lham(H) + MLham(H) lΓ

R Lbr(h) R Lbr(H) + M Lbr(H)

The proof is included in Appendix A. We say that a hypothesis set F is complete if {f(x) f F} = R for all x X. This notion of completeness is broadly applicable and holds for commonly used hypothesis sets in practice, including linear hypotheses, multi-layer feed-forward neural networks, and all measurable functions. For such complete hypothesis sets F and with smooth functions Φ like the logistic loss function, Γ admits a square root dependency in the binary classiﬁcation [Awasthi et al., 2022a, Mao et al., 2024h]. Thus, by Theorem 3.1, we obtain the following result. Corollary 3.2. Let H = Fl. Assume that F is complete and Φ(u) = log(1+e u). Then, the following H-consistency bound holds in the multi-label learning: for all h H,

RLham(h) R Lham(H) + MLham(H) l 1 2 (R Lbr(h) R Lbr(H) + M Lbr(H))

Since t t 1 2 is sub-additive, the right-hand side of the H-consistency bound in Corollary 3.2 can be

further upper bounded by l 1 2 (R Lbr(h) R Lbr(H))

1 2 + l 1 2 (M Lbr(H))

1 2 . This implies that when the

estimation error of the surrogate loss Lbr is reduced to ϵ, the corresponding estimation error of the

Hamming loss is upper bounded by l 1 2 ϵ 1 2 + l 1 2 (M Lbr(H))

1 2 MLham(H). In the nearly realizable cases where minimizability gaps are negligible, this upper bound approximates to

RLham(h) R Lham(H) l 1 2 ϵ 1 2 . (11) Therefore, as the number of labels l increases, the bound becomes less favorable. Furthermore, the loss function Lbr clearly fails to account for the inherent correlations among labels. For instance, coffee and mug are more likely to co-occur than coffee and umbrella . Additionally, Lbr is only Bayes-consistent with respect to the Hamming loss and cannot yield risk-minimizing predictions for other multi-label losses such as subset 0/1 loss or Fβ-measure loss [Dembczy nski et al., 2012]. To address these drawbacks, we will introduce a new surrogate loss in the next section.

4 Multi-label logistic loss

In this section, we deﬁne a new surrogate loss for Hamming loss in multi-label learning that accounts for label correlations and beneﬁts from label-independent H-consistency bounds. This loss function can be viewed as a generalization of the (multinomial) logistic loss [Verhulst, 1838, 1845, Berkson, 1944, 1951], used in standard classiﬁcation, to multi-label learning. Thus, we will refer to it as multi-label logistic loss. It is deﬁned as follows: for all (h,x,y) H X Y,

Llog(h,x,y) = y Y (1 Lham(y ,y))log y Y e l i=1(y i y i)h(x,i) . (12)

This formulation can be interpreted as a weighted logistic loss, where (1 Lham( ,y)) serves as a weight vector. Additionally, the term l i=1 (y i y i)h(x,i) represents the difference in the scores between the label y and any other label y , where these scores account for the correlations among the labels yi within the logarithmic function.

The next result shows that the multi-label logistic loss beneﬁts from a favorable H-consistency bound with respect to Lham, without dependency on the number of labels l. We assume that H = Fl and F is complete, conditions that typically hold in practice. Theorem 4.1. Let H = Fl. Assume that F is complete. Then, the following H-consistency bound holds in the multi-label learning: for all h H,

RLham(h) R Lham(H) + MLham(H) 2(R Llog(h) R Llog(H) + M Llog(H))

Since t t 1 2 is sub-additive, the right-hand side of the H-consistency bound in Theorem 4.1 can be

further upper bounded by 2(R Llog(h) R Llog(H))

1 2 + 2(M Llog(H))

1 2 . This implies that when the

estimation error of the surrogate loss Llog is reduced up to ϵ, the corresponding estimation error of

the Hamming loss is upper bounded by 2ϵ 1 2 + 2(M Llog(H))

1 2 MLham(H). In the nearly realizable cases where minimizability gaps are negligible, this upper bound approximates to

RLham(h) R Lham(H) 2ϵ 1 2 . (14)

Therefore, the bound is independent of the number of labels l. This contrasts with the bound for Lbr shown in (11), where a label-dependent factor ℓ 1 2 replaces the constant factor 2, making it signiﬁcantly less favorable.

The proof of Theorem 4.1 is included in Appendix B.2. We ﬁrst present a general tool (Theorem B.1) in Appendix B.1, which shows that to derive H-consistency bounds in multi-label learning with a concave function Γ, it is only necessary to upper bound the conditional regret of the target multi-label loss by that of the surrogate loss with the same Γ. This generalizes [Awasthi, Mao, Mohri, and Zhong, 2022b, Theorem 2] in standard multi-class classiﬁcation to multi-label learning. Next, we characterize the conditional regret of the target multi-label loss, such as Hamming loss, in Lemma B.2 found in Appendix B.1, under the given assumption. By using Lemma B.2, we upper bound the conditional regret of Lham by that of the surrogate loss Llog with a concave function Γ(t) = 2

When H = Hall, minimizability gaps M Llog(H) and MLham(H) vanish, Theorem 4.1 implies excess error bound and Bayes-consistency of multi-label logistic loss with respect to the Hamming loss. Corollary 4.2. The following excess error bound holds in the multi-label learning: for all h Hall,

RLham(h) R Lham(Hall) 2(R Llog(h) R Llog(Hall))

Moreover, Llog is Bayes-consistent with respect to Lham.

Approaches like binary relevance surrogate loss treat each label independently. This overlooks crucial inherent information encoded in the relationships between labels. Our new form of surrogate losses explicitly captures these label correlations. Both the binary relevance surrogate loss and our new surrogate loss are Bayes-consistent, meaning that minimizing them over the family of all measurable functions approximates the Bayes-optimal solution. However, our correlation-aware surrogate losses can converge faster, which is reﬂected in their more favorable H-consistency bounds, independent of the number of labels.

It is known that Lbr is only Bayes-consistent with respect to the Hamming loss and Precision@κ, and can be arbitrarily bad for other multi-label losses such as Fβ-measure loss [Dembczynski et al., 2011]. Instead, we will show in the next section that our surrogate loss Llog adapts to and is Bayes-consistent with respect to an extensive family of multi-label losses, including the Fβ measure loss.

5 Extension: general multi-label losses

In this section, we broaden our analysis to cover a more extensive family of multi-label losses, including all common ones and a new extension deﬁned based on linear-fractional functions with respect to the confusion matrix. Note that several loss functions are deﬁned over the space {0,1}l, rather than {+1, 1}l. To accommodate this difference, any pair y,y Y = {+1, 1}l can be projected onto {0,1}l by letting y = y+1

2 and y = y +1

2 , where 1 Rl is the vector with all elements equal to 1. Several common multi-label losses are deﬁned as follows.

Hamming loss: L(y ,y) = l i=1 1yi y i.

Fβ-measure loss: L(y ,y) = 1 (1+β2)y y β2 y 1+ y 1 .

Subset 0/1 loss: L(y ,y) = maxi [l] 1y i yi.

Jaccard distance: L(y ,y) = 1 y y y 1+ y 1 y y

Precision@κ: L(y ,y) = 1 1

κ i T(y ) 1yi=1 subject to y Yκ, where Yκ = {y Y y 1 = κ} and T(y ) = {i [l] y i = 1}.

Recall@κ: L(y ,y) = 1 1 y 1 i T(y ) 1yi=1 subject to y Yκ, where Yκ = {y Y y 1 = κ} and T(y ) = {i [l] y i = 1}.

More generally, we can deﬁne a multi-label loss based on true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) , which can be written explicitly as follows:

TP = y y TN = y 1 y y

FP = y 1 y y, FN = l + y y y 1 y 1 Similar to [Koyejo et al., 2014, 2015], we now deﬁne a general family of multi-label losses as linear-fractional functions in terms of these four quantities:

L(y ,y) = a0 + a11TP + a10FP + a01FN + a00TN

b0 + b11TP + b10FP + b01FN + b00TN . (16)

It can be shown that the aforementioned Hamming loss, Fβ-measure loss, Jaccard distance, precision and recall all belong to this family. Note that the previous deﬁnitions in [Koyejo et al., 2014, 2015] are within the empirical utility maximization (EUM) framework [Ye et al., 2012], where the measures are directly deﬁned as functions of the population. We generalize their deﬁnition to the decision theoretic analysis (DTA) framework, in terms of loss functions deﬁned over y and y .

Moreover, we can consider extending multi-label losses (16) to non-linear fractional functions of these four quantities, or more generally, to any other forms, as long as they are deﬁned over the space Y Y.

Another important family of multi-label losses is the tree distance loss, used in cases of hierarchical classes. In many practical applications, the class labels exist within a predeﬁned hierarchy. For example, in the image tagging problem, class labels might include broad categories such as animals or vehicles , which further subdivide into more speciﬁc classes like mammals and birds for animals, or cars and trucks for vehicles. Each of these subcategories can be divided even further, showcasing a clear hierarchical structure.

Tree distance: Let T = (Y,E,W) be a tree over the label space Y, with edge set E and positive, ﬁnite edge lengths speciﬁed by W. Suppose r Y is designated as the root node. Then, LT (y ,y) = the shortest path length in T between y and y .

Despite the widespread use of hierarchical classes in practice, to our knowledge, no Bayes-consistent surrogate has been proposed for the tree distance loss in multi-label learning. Next, we will show that our multi-label logistic loss can accommodate all these different loss functions, including the tree distance loss. For any general multi-label loss L, we deﬁne the multi-label logistic loss as follows:

(h,x,y) H X Y, Llog(h,x,y) = y Y (1 L(y ,y))log y Y e l i=1(y i y i)h(x,i) . (17)

Here, L can be chosen as all the multi-label losses mentioned above. Next, we will show that Llog beneﬁts from H-consistency bounds and Bayes consistency with respect to any of these loss functions. Theorem 5.1. Let H = Fl. Assume that F is complete. Then, the following H-consistency bound holds in the multi-label learning:

h H, RL(h) R L(H) + ML(H) 2(R Llog(h) R Llog(H) + M Llog(H))

The proof of Theorem 5.1 is basically the same as that of Theorem 4.1, modulo replacing the Hamming loss Lham with a general multi-label loss L. We include it in Appendix B.3 for completeness. When H = Hall, minimizability gaps M Llog(H) and ML(H) vanish, Theorem 4.1 implies excess error bound and Bayes-consistency of multi-label logistic loss with respect to any multi-label loss. Corollary 5.2. The following excess error bound holds in the multi-label learning: for all h Hall,

RL(h) R L(Hall) 2(R Llog(h) R Llog(Hall))

Moreover, Llog is Bayes-consistent with respect to L.

Corollary 5.2 is remarkable, as it demonstrates that a uniﬁed surrogate loss, Llog, is Bayes-consistent for any multi-label loss, signiﬁcantly expanding upon previous work which only established consistency for speciﬁc loss functions. Furthermore, Theorem 5.1 provides a stronger guarantee than Bayes-consistency, which is both non-asymptotic and speciﬁc to the hypothesis set used.

Minimizing the multi-label logistic loss directly leads to the effective algorithm in multi-label learning. We further discuss the efﬁciency and practicality of this algorithm in Section 8, where we describe efﬁcient gradient computation.

6 Extension: multi-label comp-sum losses

In this section, we further extend our multi-label logistic losses to more comprehensive multilabel comp-sum losses, adapting comp-sum losses [Mao, Mohri, and Zhong, 2023f] from standard classiﬁcation to the multi-label learning. As shown by Mao, Mohri, and Zhong [2023f], a comp-sum loss is deﬁned via a composition of the function Ψ and a sum, and includes the logistic loss (Ψ(u) = log(u)) [Verhulst, 1838, 1845, Berkson, 1944, 1951], the sum-exponential loss (Ψ(u) = u 1) [Weston and Watkins, 1998, Awasthi et al., 2022b], the generalized cross-entropy loss (Ψ(u) = 1 q(1 1

uq ),q (0,1)) [Zhang and Sabuncu, 2018], and the mean absolute error loss (Ψ(u) = 1 1

u) [Ghosh et al., 2017] as special cases.

Given any multi-label loss L, we will deﬁne our novel multi-label comp-sum losses as follows:

(h,x,y) H X Y, Lcomp(h,x,y) = y Y (1 L(y ,y))Ψ y Y e l i=1(y i y i)h(x,i) . (20)

This formulation can be interpreted as a weighted comp-sum loss, where (1 L( ,y)) serves as a weight vector. Additionally, this formulation accounts for label correlations among the yis within the function Ψ. Next, we prove that this family of surrogate losses beneﬁts from H-consistency bounds, and thus Bayes-consistency, across any general multi-label loss. Theorem 6.1. Let H = Fl. Assume that F is complete. Then, the following H-consistency bound holds in the multi-label learning:

h H, RL(h) R L(H) + ML(H) Γ(R Lcomp(h) R Lcomp(H) + M Lcomp(H)), (21)

where Γ(t) = 2

t when Ψ(u) = log(u) or u 1; Γ(t) = 2

nqt when Ψ(u) = 1

uq ),q (0,1); and Γ(t) = nt when Ψ(u) = 1 1

u. Corollary 6.2. The following excess error bound holds in the multi-label learning:

h Hall, RL(h) R L(Hall) Γ(R Lcomp(h) R Lcomp(Hall)), (22)

where Γ(t) = 2

t when Ψ(u) = log(u) or u 1; Γ(t) = 2

nqt when Ψ(u) = 1

uq ),q (0,1);

and Γ(t) = nt when Ψ(u) = 1 1

u. Moreover, Lcomp with these choices of Ψ are Bayes-consistent with respect to L.

The proof of Theorem 6.1 is included in Appendix B.4. Similar to the proof of Theorem 5.1, we make use of Theorem B.1 and Lemma B.2 in Appendix B.1. However, upper bounding the conditional regret of L by that of the surrogate loss Lcomp for different choices of Ψ requires a distinct analysis depending on the speciﬁc form of the function Ψ, leading to various concave functions Γ. Our proof is inspired by the proof of H-consistency bounds for comp-sum losses in [Mao et al., 2023f] through the introduction of a parameter µ and optimization. However, the novelty lies in the adaptation of µ with a quantity s tailored to multi-label loss functions instead of the score vector h itself.

Note that, as with Ψ(u) = log(u) shown in Section 5, for Ψ(u) = u 1, the bounds are also independent of the number of labels and are favorable. However, for other choices of Ψ, the bounds exhibit a worse dependency on n, which can be exponential with respect to l.

7 Extension: multi-label constrained losses

In this section, we introduce another novel family of surrogate losses, adapting constrained losses [Lee et al., 2004, Awasthi et al., 2022b] from standard classiﬁcation to multi-label constrained losses

in a similar way. Given any general multi-label loss L, we deﬁne multi-label constrained losses as:

(h,x,y) H X Y, Lcstnd(h,x,y) = y Y

L(y ,y)Φ( l i=1 y ih(x,i)). (23)

where y Y l i=1 yih(x,i) = 0. Next, we show that Lcstnd also beneﬁt from H-consistency bounds and thus Bayes-consistency for any multi-label loss. Theorem 7.1. Let H = Fl. Assume that F is complete Then, the following H-consistency bound holds in the multi-label learning:

h H, RL(h) R L(H) + ML(H) Γ(R Lcstnd(h) R Lcstnd(H) + M Lcstnd(H)), (24)

where Γ(t) = 2 Lmaxt when Φ(u) = e u; Γ(t) = 2

t when Φ(u) = max{0,1 u}2; and Γ(t) = t when Φ(u) = max{0,1 u} or Φ(u) = min{max{0,1 u/ρ},1}, ρ > 0.

Corollary 7.2. The following excess error bound holds in the multi-label learning:

h Hall, RL(h) R L(Hall) Γ(R Lcstnd(h) R Lcstnd(Hall)), (25)

where Γ(t) = 2 Lmaxt when Φ(u) = e u; Γ(t) = 2

t when Φ(u) = max{0,1 u}2; and Γ(t) = t when Φ(u) = max{0,1 u} or Φ(u) = min{max{0,1 u/ρ},1}, ρ > 0. Moreover, Lcstnd with these choices of Φ are Bayes-consistent with respect to L.

The proof of Theorem 7.1 is included in Appendix B.5. As with the proof of Theorem 6.1, we use Theorem B.1 and Lemma B.2 from Appendix B.1, and aim to upper bound the conditional regret of L by that of the surrogate losses Lcomp using various concave functions Γ. However, the difference lies in our introduction and optimization of a parameter µ tailored to a quantity z that is speciﬁc to the form of the multi-label constrained loss.

These results show that in cases where minimizability gaps vanish, reducing the estimation error of Lcstnd to ϵ results in the estimation error of target multi-label loss L being upper bounded by either ϵ or ϵ, modulo a constant that is independent of the number of labels.

8 Efﬁcient Gradient Computation

In this section, we demonstrate the efﬁcient computation of the gradient for the multi-label logistic loss Llog at any point (xj,yj). This loss function is therefore both theoretically grounded in Hconsistency bounds and computationally efﬁcient. Consider the labeled pair (xj,yj) and a hypothesis h in H. The expression for Llog(h,xj,yj) can be reformulated as follows:

Llog(h,xj,yj) = y Y (1 L(y ,yj))log y Y e l i=1(y i y i)h(xj,i)

= y Y (1 L(y ,yj)) l i=1 y ih(xj,i) + y Y (1 L(y ,yj))log y Y e l i=1 yih(xj,i) .

Let L1(j) = y Y(1 L(y,yj)), which is independent of h and can be pre-computed. It can also be invariant with respect to j and is a ﬁxed constant for many loss functions such as the Hamming loss.

For many commonly used loss functions, the terms involving sums over all possible label combinations can be simpliﬁed analytically. To illustrate this, we provide explicit formulae for the Hamming loss and Fβ measure loss functions:

Hamming loss: y Y(1 Lham(y,yj)) = (l 1)2l + y Y(l Lham(y,yj)) = (l 1)2l + y Y l i=1 1yi=yj i = (l 1)2l + l i=1 y Y 1yi=yj i = (l 1)2l + 2l 1l = 2l 1(2 l).

Fβ measure: y Y(1 LFβ(y,yj)) = y Y (1+β2)y yj

β2 y 1+ yj 1 = n k=0 y Gk (1+β2)y yj

β2k+ yj 1 =

n k=0 (1+β2) l i=1 yj i (n 1 k 1) β2k+ yj 1 = l k=0 (1+β2) yj 1(l 1 k 1) β2k+ yj 1 , where Gk = {y Y y 1 = k}.

A similar analysis applies to many other loss functions, thus, these terms do not affect the tractability of our algorithms. Next, we will consider the hypothesis set of linear functions H = {x w Ψ(x,i) w Rd}, where Ψ is a feature mapping from X [l] to Rd. Using the shorthand w for h, we can rewrite Llog at (xj,yj) as follows:

Llog(w,xj,yj) = w y Y (1 L(y ,yj))( l i=1 y iΨ(xj,i)) + L1(j)log(Zw,j), (26)

where Zw,j = y Y ew ( l i=1 yiΨ(xj,i)). Then, we can compute the gradient of Llog at any w Rd:

Llog(w) = y Y (1 L(y ,yj))( l i=1 y iΨ(xj,i)) + L1(j) y Y

ew ( l i=1 yiΨ(xj,i))

Zw,j ( l i=1 yiΨ(xj,i))

= y Y (1 L(y ,yj))( l i=1 y iΨ(xj,i)) + L1(j) E y qw[( l i=1 yiΨ(xj,i))], (27)

where qw is a distribution over Y with probability mass function qw(y) = ew ( l i=1 yiΨ(xj ,i))

Zw,j . By rearranging the terms in (27), we obtain the following result.

Lemma 8.1. The gradient of Llog at any w Rd can be expressed as follows:

Llog(w) = l i=1 Ψ(xj,i)L2(i,j) + L1(j) l i=1 Ψ(xj,i)Qw(i) (28)

where L2(i,j) = y Y(1 L(y,yj))yi, L1(j) = y Y(1 L(y,yj)), Qw(i) = y Y qw(y)yi,

qw(y) = ew ( l i=1 yiΨ(xj ,i))

Zw,j , and Zw,j = y Y ew ( l i=1 yiΨ(xj,i)). The overall time complexity for gradient computation is O(l).

Here, the evaluation of L2(i,j), i [l] and L1(j) can be computed once and for all, before any gradient computation. For evaluation of Qw(i), note that it can be equivalently written as follows:

Qw(i) = y Y

y Y ew Ψ(xj,y) yi, with Ψ(xj,y) = l i=1 yiΨ(xj,i), (29)

where Ψ(xj,y) admits a Markovian property of order 1 [Manning and Schutze, 1999, Cortes et al., 2016]. Thus, as shown by Cortes et al. [2016, 2018], Qw(i) can be evaluated efﬁciently by running two single-source shortest-distance algorithms over the (+, ) semiring on an appropriate weighted ﬁnite automaton (WFA). More speciﬁcally, in our case, the WFA can be described as follows: there are (l + 1) vertices labeled 0,...,l. There are two transitions from k to (k + 1) labeled with +1 and 1. The weight of the transition with label +1 is exp(+w Ψ(xj,k)), and exp( w Ψ(xj,k)) for the other. 0 is the initial state, and l the ﬁnal state. The overall time complexity of computing all quantities Qw(i), i [l], is O(l).

9 Conclusion

We presented a comprehensive analysis of surrogate losses for multi-label learning, establishing strong consistency guarantees. We introduced a novel multi-label logistic loss that addresses the shortcomings of existing methods and enjoys label-independent consistency bounds. Our proposed family of multi-label comp-sum losses offers a uniﬁed framework with strong consistency guarantees for any general multi-label loss, signiﬁcantly expanding upon previous work. Additionally, we presented efﬁcient algorithms for their gradient computation. While empirical validation is left for future work, our theoretical results demonstrate the potential of these new surrogate losses to advance multi-label learning. This uniﬁed framework shows promise for broader applications and paves the way for future research in multi-label learning and related areas.

P. Awasthi, N. Frank, A. Mao, M. Mohri, and Y. Zhong. Calibration and consistency of adversarial surrogate losses. Advances in Neural Information Processing Systems, pages 9804 9815, 2021a.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. A ﬁner calibration analysis for adversarial robustness. ar Xiv preprint ar Xiv:2105.01550, 2021b.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, 2022a.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. Multi-class H-consistency bounds. In Advances in neural information processing systems, pages 782 795, 2022b.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. Theoretically grounded loss functions and algorithms for adversarial robustness. In International Conference on Artiﬁcial Intelligence and Statistics, pages 10077 10094, 2023.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. DC-programming for neural network optimizations. Journal of Global Optimization, 2024.

P. L. Bartlett, M. I. Jordan, and J. D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

J. Berkson. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39:357 -365, 1944.

J. Berkson. Why I prefer logits to probits. Biometrics, 7(4):327 -339, 1951.

J. Bogatinovski, L. Todorovski, S. Džeroski, and D. Kocev. Comprehensive comparative study of multi-label classiﬁcation methods. Expert Systems with Applications, 203:117215, 2022.

R. Busa-Fekete, H. Choi, K. Dembczynski, C. Gentile, H. Reeve, and B. Szorenyi. Regret bounds for multilabel classiﬁcation in sparse label regimes. In Advances in Neural Information Processing Systems, pages 5404 5416, 2022.

W. Cheng, E. Hüllermeier, and K. J. Dembczynski. Bayes optimal multilabel classiﬁcation via probabilistic classiﬁer chains. In international conference on machine learning, pages 279 286, 2010.

C. Cortes, V. Kuznetsov, M. Mohri, and S. Yang. Structured prediction theory based on factor graph complexity. In Advances in Neural Information Processing Systems, 2016.

C. Cortes, V. Kuznetsov, M. Mohri, D. Storcheus, and S. Yang. Efﬁcient gradient computation for structured output learning with rational and tropical losses. In Advances in Neural Information Processing Systems, 2018.

C. Cortes, A. Mao, C. Mohri, M. Mohri, and Y. Zhong. Cardinality-aware set prediction and top-k classiﬁcation. In Advances in neural information processing systems, 2024.

K. Dembczynski, W. Waegeman, W. Cheng, and E. Hüllermeier. An exact algorithm for f-measure maximization. In Advances in neural information processing systems, 2011.

K. Dembczynski, W. Kotłowski, and E. Hüllermeier. Consistent multilabel ranking through univariate loss minimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1347 1354, 2012.

K. Dembczy nski, W. Waegeman, W. Cheng, and E. Hüllermeier. On label dependence and loss minimization in multi-label classiﬁcation. Machine Learning, 88:5 45, 2012.

K. Dembczynski, A. Jachnik, W. Kotlowski, W. Waegeman, and E. Hüllermeier. Optimizing the f-measure in multi-label classiﬁcation: Plug-in rule approach versus structured loss minimization. In International conference on machine learning, pages 1130 1138, 2013.

J. Deng, S. Satheesh, A. Berg, and F. Li. Fast and balanced: Efﬁcient label tree learning for large scale object recognition. In Advances in Neural Information Processing Systems, 2011.

A. Elisseeff and J. Weston. A kernel method for multi-labelled classiﬁcation. In Advances in neural information processing systems, 2001.

W. Gao and Z.-H. Zhou. On the consistency of multi-label learning. In Conference on learning theory, pages 341 358, 2011.

A. Ghosh, H. Kumar, and P. S. Sastry. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artiﬁcial intelligence, 2017.

A. Kapoor, R. Viswanathan, and P. Jain. Multilabel classiﬁcation using bayesian compressed sensing. In Advances in neural information processing systems, 2012.

O. O. Koyejo, N. Natarajan, P. K. Ravikumar, and I. S. Dhillon. Consistent binary classiﬁcation with generalized performance metrics. In Advances in neural information processing systems, 2014.

O. O. Koyejo, N. Natarajan, P. K. Ravikumar, and I. S. Dhillon. Consistent multilabel classiﬁcation. In Advances in Neural Information Processing Systems, 2015.

Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory and application to the classiﬁcation of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004.

P. Long and R. Servedio. Consistency versus realizable H-consistency for multiclass classiﬁcation. In International Conference on Machine Learning, pages 801 809, 2013.

C. Manning and H. Schutze. Foundations of statistical natural language processing. MIT press, 1999.

A. Mao, C. Mohri, M. Mohri, and Y. Zhong. Two-stage learning to defer with multiple experts. In Advances in neural information processing systems, 2023a.

A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds: Characterization and extensions. In Advances in Neural Information Processing Systems, 2023b.

A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds for pairwise misranking loss surrogates. In International conference on Machine learning, 2023c.

A. Mao, M. Mohri, and Y. Zhong. Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023d.

A. Mao, M. Mohri, and Y. Zhong. Structured prediction with stronger consistency guarantees. In Advances in Neural Information Processing Systems, 2023e.

A. Mao, M. Mohri, and Y. Zhong. Cross-entropy loss functions: Theoretical analysis and applications. In International Conference on Machine Learning, pages 23803 23828, 2023f.

A. Mao, M. Mohri, and Y. Zhong. Principled approaches for learning to defer with multiple experts. In International Symposium on Artiﬁcial Intelligence and Mathematics, 2024a.

A. Mao, M. Mohri, and Y. Zhong. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. In Algorithmic Learning Theory, 2024b.

A. Mao, M. Mohri, and Y. Zhong. Theoretically grounded loss functions and algorithms for scorebased multi-class abstention. In International Conference on Artiﬁcial Intelligence and Statistics, 2024c.

A. Mao, M. Mohri, and Y. Zhong. Enhanced H-consistency bounds. ar Xiv preprint ar Xiv:2407.13722, 2024d.

A. Mao, M. Mohri, and Y. Zhong. H-consistency guarantees for regression. In International Conference on Machine Learning, pages 34712 34737, 2024e.

A. Mao, M. Mohri, and Y. Zhong. Realizable H-consistent and Bayes-consistent loss functions for learning to defer. In Advances in neural information processing systems, 2024f.

A. Mao, M. Mohri, and Y. Zhong. Regression with multi-expert deferral. In International Conference on Machine Learning, pages 34738 34759, 2024g.

A. Mao, M. Mohri, and Y. Zhong. A universal growth rate for learning with smooth surrogate losses. In Advances in neural information processing systems, 2024h.

A. K. Mc Callum. Multi-label text classiﬁcation with a mixture model trained by em. In AAAI 99 workshop on text learning, 1999.

A. K. Menon, A. S. Rawat, S. Reddi, and S. Kumar. Multilabel reductions: what is my loss optimising? In Advances in Neural Information Processing Systems, 2019.

C. Mohri, D. Andor, E. Choi, M. Collins, A. Mao, and Y. Zhong. Learning to reject with a ﬁxed predictor: Application to decontextualization. In International Conference on Learning Representations, 2024.

J. Petterson and T. Caetano. Submodular multi-label learning. In Advances in Neural Information Processing Systems, 2011.

H. G. Ramaswamy, B. S. Babu, S. Agarwal, and R. C. Williamson. On the consistency of output code based learning algorithms for multiclass learning problems. In Conference on Learning Theory, pages 885 902, 2014.

R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine learning, 39:135 168, 2000.

I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007.

A. Tewari and P. L. Bartlett. On the consistency of multiclass classiﬁcation methods. Journal of Machine Learning Research, 8(36):1007 1025, 2007.

P. F. Verhulst. Notice sur la loi que la population suit dans son accroissement. Correspondance mathématique et physique, 10:113 -121, 1838.

P. F. Verhulst. Recherches mathématiques sur la loi d accroissement de la population. Nouveaux Mémoires de l Académie Royale des Sciences et Belles-Lettres de Bruxelles, 18:1 -42, 1845.

W. Waegeman, K. Dembczy nski, A. Jachnik, W. Cheng, and E. Hüllermeier. On the bayes-optimality of f-measure maximizers. Journal of Machine Learning Research, 15:3333 3388, 2014.

J. Weston and C. Watkins. Multi-class support vector machines. Technical report, Citeseer, 1998.

G. Wu and J. Zhu. Multi-label classiﬁcation: do hamming loss and subset accuracy really conﬂict with each other? In Advances in Neural Information Processing Systems, pages 3130 3140, 2020.

G. Wu, C. Li, K. Xu, and J. Zhu. Rethinking and reweighting the univariate losses for multi-label ranking: Consistency and generalization. In Advances in Neural Information Processing Systems, pages 14332 14344, 2021.

G. Wu, C. Li, and Y. Yin. Towards understanding generalization of macro-auc in multi-label learning. In International Conference on Machine Learning, pages 37540 37570, 2023.

M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, and K. Dembczynski. A no-regret generalization of hierarchical softmax to extreme multi-label classiﬁcation. In Advances in neural information processing systems, 2018.

N. Ye, K. M. Chai, W. S. Lee, and H. L. Chieu. Optimizing f-measures: A tale of two approaches. In International Conference on Machine Learning, pages 289 296, 2012.

H.-F. Yu, P. Jain, P. Kar, and I. Dhillon. Large-scale multi-label learning with missing labels. In International conference on machine learning, pages 593 601, 2014.

M. Zhang and S. Agarwal. Bayes consistency vs. H-consistency: The interplay between surrogate loss functions and the scoring function class. In Advances in Neural Information Processing Systems, 2020.

M. Zhang, H. G. Ramaswamy, and S. Agarwal. Convex calibrated surrogates for the multi-label f-measure. In International Conference on Machine Learning, pages 11246 11255, 2020.

M.-L. Zhang and Z.-H. Zhou. A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering, 26(8):1819 1837, 2013.

T. Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, 2004a.

T. Zhang. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004b.

Z. Zhang and M. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, 2018.

Contents of Appendix

A Proof of H-consistency bounds for existing surrogate losses (Theorem 3.1) 16

B Proofs of H-consistency bounds for new surrogate losses 17

B.1 Auxiliary deﬁnitions and results (Theorem B.1 and Lemma B.2) . . . . . . . . . . . . 17

B.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.3 Proof of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

B.4 Proof of Theorem 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B.5 Proof of Theorem 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

C Future work 23

A Proof of H-consistency bounds for existing surrogate losses (Theorem 3.1)

Theorem 3.1. Let H = Fl. Assume that the following F-consistency bound holds in the binary classiﬁcation, for some concave function Γ R R+:

f F, Rℓ0 1(f) R ℓ0 1(F) + Mℓ0 1(F) Γ(RΦ(f) R Φ(F) + MΦ(F)). (8)

Then, the following H-consistency bound holds in the multi-label learning: for all h H,

RLham(h) R Lham(H) + MLham(H) lΓ

R Lbr(h) R Lbr(H) + M Lbr(H)

Proof. Let p(y x) = P(Y = y X = x) be the conditional probability of Y = y given X = x. Given a multi-label surrogate loss L and a hypothesis set H, we denote the conditional error by C L(h,x) = Ey x[ L(h,x,y)], the best-in-class conditional error by C L(H,x) = infh H C L(h,x), and the conditional regret by C L,H(h,x) = C L(h,x) C L(H,x). We can express the conditional error of the hamming loss and the surrogate loss Lbr as follows:

CLham(h,x) = y Y p(y x) l i=1 1yi h(x,i)

y yi=+1 p(y x)11 sign(h(x,i)) + y yi= 1 p(y x)1 1 sign(h(x,i))

C Lbr(h,x) = y Y p(y x) l i=1 Φ(yih(x,i))

y yi=+1 p(y x)Φ(h(x,i)) + y yi= 1 p(y x)Φ( h(x,i))

Let qi(+1 x) = y yi=+1 p(y x) and qi( 1 x) = y yi= 1 p(y x). Let fi = h( ,i) F, for all i [l]. Then, it is clear that the conditional regrets of Lham and Lbr can be expressed as the corresponding conditional regrets of ℓ0 1 and Φ under this new introduced new distribution:

CLham,H(h,x) = l i=1 Cℓ0 1,F(fi,x), C Lbr,H(h,x) = l i=1 CΦ,F(fi,x). (30)

Since we have Cℓ0 1,F(fi,x) Γ( CΦ,F(fi,x)) under the assumption, we obtain

CLham,H(h,x) = l i=1 Cℓ0 1,F(fi,x) l i=1 Γ( CΦ,F(fi,x))

l i=1 CΦ,F(fi,x)) (concavity of Γ)

l C Lbr,H(h,x)).

By taking the expectation on both sides and using the Jensen s inequality, we have

RLham(h) R Lham(H) + MLham(H) = E x[ CLham,H(h,x)]

l C Lbr,H(h,x))]

Ex[ C Lbr,H(h,x)]

l (concavity of Γ)

R Lbr(h) R Lbr(H) + M Lbr(H)

This completes the proof.

B Proofs of H-consistency bounds for new surrogate losses

B.1 Auxiliary deﬁnitions and results (Theorem B.1 and Lemma B.2)

Before proceeding with the proof, we ﬁrst introduce some notation and deﬁnitions. Given a multilabel surrogate loss L and a hypothesis set H, we denote the conditional error by C L(h,x) = Ey x[ L(h,x,y)], the best-in-class conditional error by C L(H,x) = infh H C L(h,x), and the conditional regret by C L,H(h,x) = C L(h,x) C L(H,x). We then present a general theorem, which shows that to derive H-consistency bounds in multi-label learning with a concave function Γ, it is only necessary to upper bound the conditional regret of the target multi-label loss by that of the surrogate loss with the same Γ.

Theorem B.1. Let L be a multi-label loss and L be a surrogate loss. Given a concave function Γ R+ R+. If the following condition holds for all h H and x X:

CL,H(h,x) Γ( C L,H(h,x)), (31)

then, for any distribution and for all hypotheses h H,

RL(h) R L(H) + ML(H) Γ(R L(h) R L(H) + M L(H)). (32)

Proof. By the deﬁnitions, the expectation of the conditional regrets for L and L can be expressed as:

E x[ CL,H(h,x)] = RL(h) R L(H) + ML(H)

E x[ C L,H(h,x)] = R L(h) R L(H) + M L(H).

Thus, by taking the expectation on both sides of (31) and using Jensen s inequality, we have

RL(h) R L(H) + ML(H) = E x[ CL,H(h,x)]

E x[Γ( C L,H(h,x))] (Eq. (31))

Γ(E x[ C L,H(h,x)]) (concavity of Γ)

= Γ(R L(h) R L(H) + M L(H)).

This completes the proof.

To derive H-consistency bounds using Theorem B.1, we will characterize the conditional regret of a multi-label loss L. For simplicity, we ﬁrst introduce some notation. For any x X, let y(x) = argminy Y Ey x[L(y ,y)] Y, where we choose the label with the lowest index under the natural ordering of labels as the tie-breaking strategy. To simplify the notation further, we will drop the dependency on x. Speciﬁcally, we use y to denote y(x) and h to denote h(x). Additionally, we deﬁne ch = Ey x[L(h,y)], cy = Ey x[L(y,y)] and cy = Ey x[L(y ,y)], y Y.

Lemma B.2. Let H = Fl. Assume that F is complete. Then, the conditional regret of a multi-label loss L can be expressed as follows: CL,H(h,x) = ch cy.

Proof. By deﬁnition, the conditional error of L can be expressed as follows:

CL(h,x) = E y x[L(h,x,y)] = E y x[L(h(x),y)] = ch. (33)

Since H = Fl and F is complete, for any x X, {h(x) h H} = Y. Then, the best-in-class conditional error of L can be expressed as follows:

C L(H,x) = inf h H CL(h,x) = inf h H E y x[L(h(x),y)] = E y x[L(y(x),y)] = cy. (34)

Therefore, CL,H(h,x) = CL(h,x) C L(H,x) = ch cy.

Next, by using Lemma B.2, we will upper bound the conditional regret of the target multi-label loss L by that of the surrogate loss L with a concave function Γ.

B.2 Proof of Theorem 4.1

Theorem 4.1. Let H = Fl. Assume that F is complete. Then, the following H-consistency bound holds in the multi-label learning: for all h H,

RLham(h) R Lham(H) + MLham(H) 2(R Llog(h) R Llog(H) + M Llog(H))

Proof. We will use the following notation adapted to the Hamming loss: ch = Ey x[Lham(h,y)], cy = Ey x[L(y,y)] and cy = Ey x[Lham(y ,y)], y Y. We will denote by s(h,x,y ) =

e l i=1 y ih(x,i)

y Y e l i=1 y i h(x,i) and simplify notation by using sy , thereby dropping the dependency on h and x.

It is clear that sy [0,1]. Then, the conditional error of Llog can be expressed as follows:

C Llog(h,x) = E y x

y Y (1 Lham(y ,y))log y Y e l i=1(y i y i)h(x,i)

= y Y (1 cy )log(sy )

For any h y, we deﬁne sµ as follows: set sµ y = sy for all y y and y h; deﬁne sµ h = sy µ; and let sµ y = sh + µ. Note that sµ can be realized by some h H due to the completeness assumption. Then, we have

C Llog,H(h,x) y Y (1 cy )log(sy ) inf µ R y Y (1 cy )log(sµ y )

= sup µ R {(1 ch)[log(sy µ) log(sh)] + (1 cy)[log(sh + µ) log(sy)]}

= (1 cy)log (sh + sy)(1 cy)

sy(2 ch cy) + (1 ch)log (sh + sy)(1 ch)

sh(2 ch cy)

(supremum is attained when µ = (1 ch)sh+(1 cy)sy

(1 cy)log 2(1 cy) (2 ch cy) + (1 ch)log 2(1 ch) (2 ch cy) (minimum is attained when sh = sy since ch cy and sh sy)

2(2 ch cy) (alog 2a

a+b + blog 2b

2(a+b), a,b [0,1])

Therefore, by Lemma B.2, CLham,H(h,x) 2( C Llog,H(h,x))

1 2 . By Theorem B.1, we complete the proof.

B.3 Proof of Theorem 5.1

Theorem 5.1. Let H = Fl. Assume that F is complete. Then, the following H-consistency bound holds in the multi-label learning:

h H, RL(h) R L(H) + ML(H) 2(R Llog(h) R Llog(H) + M Llog(H))

Proof. The proof is basically the same as that of Theorem 4.1, modulo replacing the Hamming loss Lham with a general multi-label loss L. We adopt the following notation: ch = Ey x[L(h,y)], cy =

Ey x[L(y,y)] and cy = Ey x[L(y ,y)], y Y. We will denote by s(h,x,y ) = e l i=1 y ih(x,i)

y Y e l i=1 y i h(x,i) and simplify notation by using sy , thereby dropping the dependency on h and x. It is clear that sy [0,1]. Then, the conditional error of Llog can be expressed as follows:

C Llog(h,x) = E y x

y Y (1 L(y ,y))log y Y e l i=1(y i y i)h(x,i)

= y Y (1 cy )log(sy )

For any h y, we deﬁne sµ as follows: set sµ y = sy for all y y and y h; deﬁne sµ h = sy µ; and let sµ y = sh + µ. Note that sµ can be realized by some h H under the assumption. Then, we have

C Llog,H(h,x) y Y (1 cy )log(sy ) inf µ R y Y (1 cy )log(sµ y )

= sup µ R {(1 ch)[log(sy µ) log(sh)] + (1 cy)[log(sh + µ) log(sy)]}

= (1 cy)log (sh + sy)(1 cy)

sy(2 ch cy) + (1 ch)log (sh + sy)(1 ch)

sh(2 ch cy) (supremum is attained when µ = (1 ch)sh+(1 cy)sy

(1 cy)log 2(1 cy) (2 ch cy) + (1 ch)log 2(1 ch) (2 ch cy) (minimum is attained when sh = sy)

2(2 ch cy) (alog 2a

a+b + blog 2b

2(a+b), a,b [0,1])

Therefore, by Lemma B.2, CL,H(h,x) 2( C Llog,H(h,x))

1 2 . By Theorem B.1, we complete the proof.

B.4 Proof of Theorem 6.1

Theorem 6.1. Let H = Fl. Assume that F is complete. Then, the following H-consistency bound holds in the multi-label learning:

h H, RL(h) R L(H) + ML(H) Γ(R Lcomp(h) R Lcomp(H) + M Lcomp(H)), (21)

where Γ(t) = 2

t when Ψ(u) = log(u) or u 1; Γ(t) = 2

nqt when Ψ(u) = 1

uq ),q (0,1); and Γ(t) = nt when Ψ(u) = 1 1

Proof. Recall that we adopt the following notation: ch = Ey x[L(h,y)], cy = Ey x[L(y,y)] and

cy = Ey x[L(y ,y)], y Y. We will denote by s(h,x,y ) = e l i=1 y ih(x,i)

y Y e l i=1 y i h(x,i) and simplify

notation by using sy , thereby dropping the dependency on h and x. It is clear that sy [0,1]. Next, we will analyze case by case.

The case where Φ(u) = log(u): See the proof of Theorem 5.1.

The case where Φ(u) = u 1: The conditional error of Lcomp can be expressed as follows:

C Lcomp(h,x)

y Y (1 L(y ,y)) y Y e l i=1(y i y i)h(x,i) 1

= y Y (1 cy )( 1

For any h y, we deﬁne sµ as follows: set sµ y = sy for all y y and y h; deﬁne sµ h = sy µ; and let sµ y = sh + µ. Note that sµ can be realized by some h H under the assumption. Then, we have

C Lcomp,H(h,x)

y Y (1 cy )( 1

sy 1) inf µ R y Y (1 cy )( 1

= sup µ R {(1 ch)[ 1

sh 1 sy µ] + (1 cy)[ 1

sy 1 sh + µ]}

sy 2 ch cy + 2(1 ch) 1 2 (1 cy) 1 2

sh + sy (supremum is attained when µ = 1 chsh+

1 cy+ 1 ch )

((1 ch) 1 2 (1 cy) 1 2 ) 2 (minimum is attained when sh = sy = 1

((1 ch) 1 2 + (1 cy) 1 2 ) 2

Therefore, by Lemma B.2, CL,H(h,x) 2( C Lcomp,H(h,x))

1 2 . By Theorem B.1, we complete the proof.

The case where Φ(u) = 1

uq ),q (0,1): The conditional error of Lcomp can be expressed as:

C Lcomp(h,x) = 1

q y Y (1 cy )(1 (sy )q).

For any h y, we deﬁne sµ as follows: set sµ y = sy for all y y and y h; deﬁne sµ h = sy µ; and let sµ y = sh + µ. Note that sµ can be realized by some h H under the assumption. Then, we have

C Lcomp,H(h,x)

q y Y (1 cy )(1 sy) inf µ R 1 q y Y (1 cy )(1 (sµ y )q)

q sup µ R {(1 ch)[ sh + (sy µ)q] + (1 cy)[ (sy)q + (sh + µ)q]}

q (sh + sy)q((1 cy) 1 1 q + (1 ch) 1 1 q ) 1 q 1

q (1 cy)sq y 1

q (1 ch)sq h

(supremum is attained when µ = (1 ch) 1 1 q sh+(1 cy) 1 1 q sy

(1 cy) 1 1 q +(1 ch) 1 1 q )

1 qnq [2q((1 cy) 1 1 q + (1 ch) 1 1 q ) 1 q (1 cy) (1 ch)]

(minimum is attained when sh = sy = 1

4nq . (( a 1 1 q +b 1 1 q 2 )

4(a b)2, a,b [0,1], 0 a + b 1)

Therefore, by Lemma B.2, CL,H(h,x) 2n q 2 ( C Lcomp,H(h,x))

1 2 . By Theorem B.1, we complete the proof.

The case where Φ(u) = (1 1

u): The conditional error of Lcomp can be expressed as:

C Lcomp(h,x) = y Y (1 cy )(1 (sy )q).

For any h y, we deﬁne sµ as follows: set sµ y = sy for all y y and y h; deﬁne sµ h = sy µ; and let sµ y = sh + µ. Note that sµ can be realized by some h H under the assumption. Then, we have

C Lcomp,H(h,x)

y Y (1 cy )(1 sy) inf µ R y Y (1 cy )(1 sµ y )

= sup µ R {(1 ch)[ sh + sy µ] + (1 cy)[ sy + sh + µ]}

= sh(ch cy) (supremum is attained when µ = sy)

n(ch cy). (minimum is attained when sh = 1

Therefore, by Lemma B.2, CL,H(h,x) n C Lcomp,H(h,x). By Theorem B.1, we complete the proof.

B.5 Proof of Theorem 7.1

Theorem 7.1. Let H = Fl. Assume that F is complete Then, the following H-consistency bound holds in the multi-label learning:

h H, RL(h) R L(H) + ML(H) Γ(R Lcstnd(h) R Lcstnd(H) + M Lcstnd(H)), (24)

where Γ(t) = 2 Lmaxt when Φ(u) = e u; Γ(t) = 2

t when Φ(u) = max{0,1 u}2; and Γ(t) = t when Φ(u) = max{0,1 u} or Φ(u) = min{max{0,1 u/ρ},1}, ρ > 0.

Proof. Recall that we adopt the following notation: ch = Ey x[L(h,y)], cy = Ey x[L(y,y)] and cy = Ey x[L(y ,y)], y Y. We will also denote by z(h,x,y ) = l i=1 y ih(x,i) and simplify notation by using zy , thereby dropping the dependency on h and x. It is clear that the constraint can be expressed as y Y zy = 0. Next, we will analyze case by case.

The case where Φ(u) = e u: The conditional error of Lcstnd can be expressed as follows:

C Lcstnd(h,x) = E y x

L(y ,y)e l i=1 y ih(x,i) = y Y cy ezy .

For any h y, we deﬁne zµ as follows: set zµ y = zy for all y y and y h; deﬁne zµ h = zy µ; and let zµ y = zh + µ. Note that zµ can be realized by some h H under the assumption. Then, we have

C Lcomp,H(h,x) y Y cy ezy inf µ R y Y cy ezµ y

= sup µ R {cy(ezy ezh+µ) + ch(ezh ezy µ)}

cyezy) 2 (supremum is attained when µ = 1

2 log cyezy

= ( ch cy cy + ch )

2 (minimum is attained when zh = zy = 0)

1 4Lmax (ch cy)2.

Therefore, by Lemma B.2, CL,H(h,x) 2(Lmax)

1 2 ( C Lcstnd,H(h,x))

1 2 . By Theorem B.1, we complete the proof.

The case where Φ(u) = max{0,1 u}2: The conditional error of Lcstnd can be expressed as follows:

C Lcstnd(h,x) = y Y cy max{0,1 + zy }2.

For any h y, we deﬁne zµ as follows: set zµ y = zy for all y y and y h; deﬁne zµ h = zy µ; and let zµ y = zh + µ. Note that zµ can be realized by some h H under the assumption. Then, we have

C Lcstnd,H(h,x)

y Y cy max{0,1 + zy }2 inf µ R y Y cy max{0,1 + zµ y } 2

= sup µ R {cy(max{0,1 + zy}2 max{0,1 + zh + µ}2) + ch(max{0,1 + zh}2 max{0,1 + zy µ}2)}

(1 + zh)2(cy ch)2 (differentiating with respect to µ to optimize)

(ch cy)2. (minimum is attained when zh = 0)

Therefore, by Lemma B.2, CL,H(h,x) ( C Lcstnd,H(h,x))

1 2 . By Theorem B.1, we complete the proof.

The case where Φ(u) = max{0,1 u}: The conditional error of Lcstnd can be expressed as:

C Lcstnd(h,x) = y Y cy max{0,1 + zy }.

For any h y, we deﬁne zµ as follows: set zµ y = zy for all y y and y h; deﬁne zµ h = zy µ; and let zµ y = zh + µ. Note that zµ can be realized by some h H under the assumption. Then, we have

C Lcstnd,H(h,x)

y Y cy max{0,1 + zy } inf µ R y Y cy max{0,1 + zµ y }

= sup µ R {cy(max{0,1 + zy} max{0,1 + zh + µ}) + ch(max{0,1 + zh}2 max{0,1 + zy µ}2)}

(1 + zh)(cy ch) (differentiating with respect to µ to optimize) (ch cy). (minimum is attained when zh = 0)

Therefore, by Lemma B.2, CL,H(h,x) C Lcstnd,H(h,x). By Theorem B.1, we complete the proof.

The case where Φ(u) = min{max{0,1 u/ρ},1},ρ > 0: The conditional error of Lcstnd can be expressed as:

C Lcstnd(h,x) = y Y cy min{max{0,1 + zy /ρ},1}.

For any h y, we deﬁne zµ as follows: set zµ y = zy for all y y and y h; deﬁne zµ h = zy µ; and let zµ y = zh + µ. Note that zµ can be realized by some h H under the assumption. Then, we have

C Lcstnd,H(h,x)

y Y cy min{max{0,1 + zy /ρ},1} inf µ R y Y cy min{max{0,1 + zµ y /ρ},1}

= sup µ R {cy(min{max{0,1 + zy/ρ},1} min{max{0,1 + (zh + µ)/ρ},1})

+ ch(min{max{0,1 + zh/ρ},1} min{max{0,1 + (zy µ)/ρ}},1)}

(cy ch). (differentiating with respect to µ to optimize)

Therefore, by Lemma B.2, CL,H(h,x) C Lcstnd,H(h,x). By Theorem B.1, we complete the proof.

C Future work

While our work introduces a uniﬁed surrogate loss framework that is Bayes-consistent across any multi-label loss, thereby broadening the scope beyond previous approaches that established consistency only for particular loss functions, there remains an exciting opportunity for empirical comparison with surrogate losses tailored to speciﬁc loss functions a direction we leave for future work. Furthermore, reﬁning surrogate losses to theoretically enhance performance for speciﬁc target losses presents another promising avenue for research.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope?

Answer: [Yes]

Justiﬁcation: See the section Our Contributions in the introduction.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reﬂect how much the results can be expected to generalize to other settings. It is ﬁne to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justiﬁcation: See Appendix C.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-speciﬁcation, asymptotic approximations only holding locally). The authors should reﬂect on how these assumptions might be violated in practice and what the implications would be. The authors should reﬂect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reﬂect on the factors that inﬂuence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efﬁciency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be speciﬁcally instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justiﬁcation: See Section 3, Section 4, Section 5, Section 6, Section 7, Appendix A, Appendix B. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justiﬁcation: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or veriﬁable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might sufﬁce, or if the contribution is a speciﬁc model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA] Justiﬁcation: The paper does not include experiments requiring code. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justiﬁcation: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments? Answer: [NA] Justiﬁcation: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, conﬁdence intervals, or statistical signiﬁcance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not veriﬁed. For asymmetric distributions, the authors should be careful not to show in tables or ﬁgures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding ﬁgures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justiﬁcation: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justiﬁcation: The authors have reviewed the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justiﬁcation: Our work is theoretical in nature and we do not anticipate any immediate negative societal impact.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake proﬁles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact speciﬁc groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efﬁciency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justiﬁcation: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety ﬁlters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justiﬁcation: The paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justiﬁcation: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip ﬁle. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] . Justiﬁcation: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is ﬁne, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justiﬁcation: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary signiﬁcantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.