# on_the_learnability_of_multilabel_ranking__c616f5b6.pdf

On the Learnability of Multilabel Ranking

Vinod Raman Department of Statistics

University of Michigan

Ann Arbor, MI 48104 vkraman@umich.edu

Unique Subedi Department of Statistics

University of Michigan

Ann Arbor, MI 48104

subedi@umich.edu

Ambuj Tewari Department of Statistics

University of Michigan

Ann Arbor, MI 48104 tewaria@umich.edu

Multilabel ranking is a central task in machine learning. However, the most fundamental question of learnability in a multilabel ranking setting with relevancescore feedback remains unanswered. In this work, we characterize the learnability of multilabel ranking problems in both batch and online settings for a large family of ranking losses. Along the way, we give two equivalence classes of ranking losses based on learnability that capture most losses used in practice.

1 Introduction

Multilabel ranking is a supervised learning problem where a learner is presented with an instance x 2 X and is required to output a ranking of K different labels in decreasing order of relevance to x. This is in contrast with multilabel classiﬁcation where given an instance x 2 X, the learner is tasked with predicting a subset of the K labels without any explicit ordering. Multilabel ranking is a canonical learning problem with a wide range of applications to text categorization, genetics, medical imaging, social networks, and visual object recognition [Joachims, 2005, Schapire and Singer, 2000, Mc Callum, 1999, Clare and King, 2001, Baltruschat et al., 2019, Wang and Sukthankar, 2013, Bucak et al., 2009, Yang et al., 2016]. Recent years have seen a surge in the development of multilabel ranking methods with strong practical and theoretical guarantees [Schapire and Singer, 2000, Dembczynski et al., 2012, Gong et al., 2013, Bucak et al., 2009, Jung and Tewari, 2018, Gao and Zhou, 2011, Koyejo et al., 2015, Zhang and Zhou, 2013, Korba et al., 2018]. A related line of work has studied consistency for the convex surrogates of natural ranking losses [Duchi et al., 2010, Buffoni et al., 2011, Gao and Zhou, 2011, Ravikumar et al., 2011, Calauzenes et al., 2012, Dembczynski et al., 2012]. Despite this vast literature on multilabel ranking, the fundamental question of when a multilabel ranking problem is learnable remains unanswered.

Understanding when a hypothesis class is learnable is a fundamental question in Statistical Learning Theory. For binary classiﬁcation, the ﬁniteness of the Vapnik Chervonenkis (VC) dimension is both sufﬁcient and necessary for Probably Approximately Correct (PAC) learning [Vapnik and Chervonenkis, 1974, Valiant, 1984]. Likewise, the ﬁniteness of the Daniely-Shwartz (DS) dimension characterizes multiclass PAC learnability Daniely and Shalev-Shwartz [2014], Brukhim et al. [2022]. In the online setting, the Littlestone dimension [Littlestone, 1987] characterizes the online learnability of a binary hypothesis class and the multiclass Littlestone dimension [Daniely et al., 2011] characterizes online multiclass learnability. Unlike classiﬁcation, a distinguishing property of multilabel ranking is the mismatch between the predictions the learner makes and the feedback it receives. In particular, a learner is required to produce a permutation that ranks the relevance of the labels but only receives a relevance-score vector as feedback. This feedback model is standard in multilabel ranking since obtaining full permutation feedback is generally costly [Liu et al., 2009]. As a result, unlike the 0-1 loss in classiﬁcation, there is no canonical loss function in ranking. Together, these two issues

Equal contributions

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

create barriers for existing techniques used to prove learnability, such as the agnostic-to-realizable reductions from Hopkins et al. [2022] and Raman et al. [2023], to readily extend to ranking.

In this paper, we characterize the batch and online learnability of a ranking hypothesis class H SX

K under relevance-score feedback, where SK is the set of all permutations over [K] = {1, ..., K}. In doing so, we make the following contributions.

We show that a ranking hypothesis class H embeds K2 different binary hypothesis classes

i for i, j 2 [K], where hypotheses in Hj

i answer whether the label i should be ranked in the top j. Our main result relates the learnability of H to the learnability of Hj

i s. We deﬁne two families of ranking loss functions that capture most if not all ranking losses

used in practice. We show that these families are actually equivalence classes - the same characterization of batch and online learnability holds for every loss in that family. By relating the learnability of H to the learnability of binary hypothesis classes Hj

i, we show that existing combinatorial dimensions, like the VC and Littlestone dimension, continue to characterize learnability in the multilabel ranking setting. This allows us to prove that linear ranking hypothesis classes are learnable in the batch setting.

A unifying theme throughout the paper is our ability to constructively convert a learning algorithm A for H into a learning algorithm Aj

i for each i, j 2 [K] and vice versa. To do so, our proof techniques involve adapting the agnostic-to-realizable reduction for batch and online classiﬁcation, proposed by Hopkins et al. [2022] and Raman et al. [2023] respectively, to ranking.

2 Preliminaries and Notation

Let X denote the instance space, SK the set of permutations over labels [K] := {1, ..., K}, and Y = {0, 1, ..., B}K the target space for some K, B 2

. We highlight that the set of labels [K] is ﬁxed beforehand and does not depend on the instance x 2 X. This is to be contrasted with subset ranking, the set of labels can change depending on the instance x 2 X.

We refer to an element y 2 Y as a relevance-score vector that indicates the relevance of each of the K labels, where B indicates the highest relevance and 0 indicates the lowest relevance. Throughout the paper, we treat a permutation 2 SK as a vector in {1, ..., K}K that induces a ranking of the K labels in decreasing order of relevance. Accordingly, for an index i 2 [K], we let i 2 [K] denote the rank of label i. Likewise, given an index i 2 [K], we let yi denote the relevance of label i. In addition, it will be useful to deﬁne a mapping from SK to {0, 1}K. In particular, we deﬁne Bin Rel( , ) : SK [K] ! {0, 1}K as an operator that given a permutation (ranking) 2 SK and threshold p 2 [K], outputs a bit string b 2 {0, 1}K s.t. bi =

Ranking Equivalences. Our construction of ranking loss families in Section 3 requires different notions of equivalence between permutations (rankings) in SK. To that end, we say that = ˆ if and only if for all i 2 [K], i = ˆ i. On the other hand, we say

p= ˆ if and only if {i : i p} = {i : ˆ i p}. That is, two rankings are p-equivalent if the set of labels they rank in

the top-p are equal. Finally, we say

[p] = ˆ if and only if for all j 2 [p], {i : i j} = {i : ˆ i j}. That is, two rankings are [p]-equivalent if not only the set but also the order of labels they rank in the top-p are equal.

Ranking Hypothesis. A ranking hypothesis h 2 H SX

K maps instances in X to a ranking (permutation) in SK. Given an instance x 2 X, one can think of h(x) as h s ranking of the K different labels in decreasing order of relevance. For any ranking hypothesis h, we let hi : X ! [K] denote its restriction to the i th coordinate output. Accordingly, for an instance x 2 X, hi(x) gives the rank that h assigns to label i. Given a ranking hypothesis class H SX

K and any i, j 2 [K], we deﬁne its binary threshold-restricted hypothesis class Hj

i : h 2 H} where hj

{hi(x) j}. We can think of hypotheses in Hj

i as providing binary responses to queries of the form: for instance x, should label i ranked in the top j?" These threshold-restricted classes are central to our characterization of learnability in both the batch and online learning settings.

Batch Learnability. In the batch setting, we are interested in characterizing the learnability of a ranking hypothesis class H under a model similar to the classical PAC model [Valiant, 1984].

Deﬁnition 1 (Agnostic Ranking PAC Learnability). A ranking hypothesis class H SX

K is agnostic PAC learnable w.r.t. loss : SK Y ! R 0, if there exists a function m : (0, 1)2 N ! N and a learning algorithm A : (X Y)? ! SX

K with the following property: for every , δ 2 (0, 1) and for every distribution D on X Y, running algorithm A on n m( , δ, K) iid samples from D outputs a predictor g = A(S) such that with probability at least 1 δ over S Dn,

ED[ (g(x), y)] inf

h2H ED[ (h(x), y)] + .

If D is restricted to the class of distributions such that infh2H ED[ (h(x), y)] = 0, then we say we are in the realizable setting. Note that unlike in classiﬁcation, realizability in the multilabel ranking setting is loss dependent.

Online Learnability. In the online setting, an adversary plays a sequential game with the learner over T rounds. In each round t 2 [T], an adversary selects a labeled instance (xt, yt) 2 X Y and reveals xt to the learner. The learner makes a (potentially randomized) prediction ˆ t 2 SK. Finally, the adversary reveals the true relevance-score vector yt, and the learner suffers the loss (ˆ t, yt), where is some pre-speciﬁed ranking loss function. Given a ranking hypothesis class H SX

K, the goal of the learner is to output predictions ˆ t such that its cumulative loss is close to the best possible cumulative loss over hypotheses in H. A hypothesis class is online learnable if there exists an algorithm such that for any sequence of labeled examples (x1, y1), ..., (x T , y T ), the difference in cumulative loss between its predictions and the predictions of the best possible function in H is small. Deﬁnition 2 (Agnostic Online Ranking Learnability). A ranking hypothesis class H SX

K is agnostic online learnable w.r.t. loss , if there exists an (potentially randomized) algorithm A such that for any adaptively chosen sequence of labeled examples (xt, yt) 2 X Y, the algorithm outputs A(xt) 2 SK at every iteration t 2 [T] such that its expected regret,

R(T, K) := E

(A(xt), yt) inf

(h(xt), yt)

is a non-decreasing, sub-linear function of T. Here, the expectation is taken with respect to the randomness of the algorithm A.

If it is further guaranteed that there exists a hypothesis h? 2 H such that PT

t=1 (h?(xt), yt) = 0, then we say we are in the realizable setting. Again, realizability is loss dependent.

3 Ranking Loss Families

In statistical learning theory, we often characterize learnability with respect to a loss function. Unlike the 0-1 loss in classiﬁcation, there is no canonical loss function in multilabel ranking. Accordingly, we deﬁne two general families of ranking loss functions in this section and later characterize learnability with respect to all losses in these families. In Appendix A, we show that many of the ranking metrics used in practice (e.g. Pairwise Rank Loss, Discounted Cumulative Gain, Reciprocal Rank, Average Precision, Precision@p, etc.) fall into one of these two families.

On a high level, we can classify ranking losses into two main groups: (A) those losses that care about both the order and magnitude of the relevance scores within the top-p ranked labels and (B) those losses that only care about the magnitude of the relevance scores within the top-p ranked labels. Our goal will be to deﬁne a loss family for both groups A and B. To do so, we start by identifying a canonical ranking loss that lies in each group. For group A, the normalized sum loss@p,

sum( , y) =

min( i, p + 1)yi Zp

captures both the order and magnitude of the relevance scores only for the top-p ranked labels. Here, Zp

y is an appropriately chosen normalization factor that only depends on p and y such that

sum( , y) = 0. For Group B, the normalized precision loss@p,

prec( , y) = Zp

cares only about the magnitude of relevance scores in the top-p ranked labels. Again, Zp

y is an appropriately chosen normalization constant that only depends on p and y such that the minimum loss is 0. The form of @p

prec differs from @p

sum because PK

{ i p}yi is a gain whereas PK

i=1 min( i, p + 1)yi is a loss.

Next, we build loss families around @p

prec. For @p

sum, consider the family:

SK Y : = 0 if and only if @p

sum = 0}\{ 2

[p] = ˆ =) ( , y) = (ˆ , y)}.

By deﬁnition, L( @p

sum) contains those ranking losses that are (1) zero-matched with @p

sum and (2) remain unchanged for any two predicted rankings (permutations) that are [p]-equivalent. The second constraint is needed to ensure that losses in L( @p

sum) only depend on the order and set of labels that ranks in the top-p. Likewise, we can construct a similar loss family around @p

prec as follows:

prec) = { 2

SK Y : = 0 if and only if @p

prec = 0}\{ 2

p= ˆ =) ( , y) = (ˆ , y)}.

The set L( @p

prec) contains those ranking losses that are (1) zero-matched with @p

prec and (2) remain unchanged for any two predicted rankings (permutations) that are p-equivalent. The second constraint is needed to ensure that losses in L( @p

prec) only depend on the set of labels that ranks in the top-p. A major contribution of this paper is showing that both L( @p

sum) and L( @p

prec) are actually equivalence classes - the same characterization of learnability holds for every loss in that family.

4 Batch Multilabel Ranking

In this section, we characterize the agnostic PAC learnability of hypothesis classes H SX

K with respect to both L( @p

sum) and L( @p

prec). Our main results, stated below as two theorems, relate the learnability of H to the learnability of the threshold-restricted classes Hj

Theorem 4.1. A hypothesis class H SX

K is agnostic PAC learnable w.r.t 2 L( @p

sum) if and only if for all i 2 [K] and j 2 [p], Hj

i is agnostic PAC learnable w.r.t the 0-1 loss.

Theorem 4.2. A hypothesis class H SX

K is agnostic PAC learnable w.r.t 2 L( @p

prec) if and only if for all i 2 [K], Hp

i is agnostic PAC learnable w.r.t the 0-1 loss.

Since VC dimension characterizes the learnability of binary hypothesis classes under the 0-1 loss, an important corollary of Theorems 4.1 and 4.2 is that ﬁniteness of VC(Hj

i) s, for the appropriate i, j 2 [K] [p], is necessary and sufﬁcient for agnostic ranking PAC learnability. Later on, we use this fact to prove that linear ranking hypothesis classes are agnostic ranking PAC learnable.

We start with the proof of Theorem 4.1, which follows in three steps. First, we show that if for all (i, j) 2 [K] [p], Hj

i is agnostic PAC learnable w.r.t 0-1 loss, then Empirical Risk Minimization (ERM) is an agnostic PAC learner for H w.r.t @p

sum. Next, we show that if H is agnostic PAC learnable w.r.t @p

sum, then H is agnostic PAC learnable w.r.t any loss 2 L( @p

sum). Our proof of the latter uses the realizable to agnostic conversion from Hopkins et al. [2022]. Finally, we prove the necessity direction - if H is agnostic PAC learnable w.r.t an arbitrary 2 L( @p

sum), then for all (i, j) 2 [K] [p], Hj

i is agnostic PAC learnable w.r.t 0-1 loss. The proof of necessity direction also uses the realizable to agnostic conversion from Hopkins et al. [2022]. The proof of Theorem 4.2 follows exactly the same way as Theorem 4.1 with some minor changes. Thus, we only focus on the proof of Theorem 4.1 in this section and defer all discussion of Theorem 4.2 to Appendix C.3.

We begin with Lemma 4.3, which asserts that if Hj

i is agnostic PAC learnable for all (i, j) 2 [K] [p], then ERM is an agnostic PAC learner for H w.r.t @p

Lemma 4.3. If for all i 2 [K] and j 2 [p], Hj

i is agnostic PAC learnable w.r.t the 0-1 loss, then ERM is an agnostic PAC learner for H SX

The proof of Lemma 4.3 exploits the nice structure of @p

sum by upperbounding the empirical Rademacher complexity of the loss class @p

sum H = {(x, y) 7! @p

sum(h(x), y) : h 2 H)} and showing that it vanishes as the sample size n becomes large. Then, standard uniform convergence arguments outlined in Proposition C.1 imply that ERM is an agnostic PAC learner for H w.r.t @p

sum. The full proof is in Appendix C.

Since arbitrary losses in L( @p

sum) may not have nice analytical forms, Lemma 4.4 relates the learnability of an arbitrary loss 2 L( @p

sum) to the learnability of @p

Lemma 4.4. If H SX

K is agnostic PAC learnable w.r.t @p

sum, then H is agnostic PAC learnable w.r.t any 2 L( @p

Proof. (of Lemma 4.4) Fix 2 L( @p

sum). Let a = min ,y{ ( , y) | ( , y) 6= 0} and b = max ,y ( , y). We need to show that if H is agnostic PAC learnable w.r.t @p

sum, then H is agnostic PAC learnable w.r.t . We will do so in two steps. First, we will show that if A is an agnostic PAC learner for @p

sum, then A is also a realizable PAC learner for . Next, we will show how to convert a realizable PAC learner for into an agnostic PAC learner for in a black-box fashion. The composition of these two pieces yields an agnostic PAC learner for H w.r.t .

Realizable PAC learnability of H w.r.t . If H is agnostic PAC learnable w.r.t @p

sum, then there exists a learning algorithm A with sample complexity m( , δ, K) s.t. for any distribution D over X Y, with probability 1 δ over a sample S Dn of size n m( , δ, K), the output predictor

g = A(S) achieves

sum(g(x), y))

sum(h(x), y))

+ . In the realizable setting, we are further guaranteed that there exists a hypothesis h? 2 H s.t.

D [ (h?(x), y))] = 0. Since

sum), this also implies that

sum(h?(x), y))

= 0. Therefore, under realizability and the

fact that b @p

sum, we have

D [ (g(x), y))] b . This completes the ﬁrst part of the proof as we have shown that A is also a realizable PAC learner for H w.r.t with sample complexity m(

Realizable-to-agnostic conversion. Now, we show how to convert the realizable PAC learner A for into an agnostic PAC learner for in a black-box fashion. For this step, we will extend the agnosticto-realizable reduction proposed by Hopkins et al. [2022] to the ranking setting by accommodating the mismatch between the range space of H and the label space Y. In particular, we will show that Algorithm 1 below converts a realizable PAC learner for into an agnostic PAC learner for . Note that although input A is a realizable learner, the distribution D may not be realizable.

Algorithm 1 Agnostic PAC learner for H w.r.t. Input: Realizable PAC learner A for H, unlabeled and labeled samples SU Dn

X and SL Dm

1 For each h 2 H|SU , construct a dataset

U = {(x1, y1), ..., (xn, yn)} s.t. yi Unif{Bin Rel(h(xi), 1), ..., Bin Rel(h(xi), p)}

2 Run A over all datasets to get C(SU) :=

3 Return ˆg 2 C(SU) with the lowest empirical error over SL w.r.t. .

Let h? = arg minh2H

D [ (h(x), y)] denote the optimal predictor in H w.r.t D. Consider the sample Sh?

U and let g = A(Sh?

U ). We can think of g as the output of A run over an i.i.d sample S drawn from D?, a joint distribution over X Y deﬁned procedurally by ﬁrst sampling x DX , then independently sampling j Unif([p]), and ﬁnally outputting the labeled sample (x, Bin Rel(h?(x), j)). Note that D? is indeed a realizable distribution (realized by h?) w.r.t both and @p

sum. Recall that m A(

b, δ, K) is the sample complexity of A. Since A is a realizable learner for H w.r.t , we have that for n m A( a

2b2p, δ/2, K), with probability at least 1 δ

D? [ (g(x), y)] a 2bp.

Next, by Lemma E.1, we have (g(x), y) (h?(x), y)+ bp

j Unif([p]) [ (g(x), Bin Rel(h?(x), j))] pointwise. Taking expectations on both sides of the inequality gives

D [ (g(x), y)]

D [ (h?(x), y)] + bp

j Unif([p]) [ (g(x), Bin Rel(h?(x), j))]

D [ (h?(x), y)] +

The last inequality follows from the deﬁnition of D?, namely

D? [ (g(x), y)] =

j Unif([p]) [ (g(x), Bin Rel(h?(x), j))]. This shows that C(SU) contains a hypothesis g that generalizes well with respect to D. Now we want to show that the predictor ˆg returned in step 4 also has good generalization. Crucially, observe that C(SU) is a ﬁnite hypothesis class with cardinality at most 2n K. By standard Chernoff and union bounds, with probability at least 1 δ/2, the empirical risk of every hypothesis in C(SU) on a sample of size 8 2 log 4|C(SU)|

δ is at most /4 away from its true error. So, if m = |SL| 8 2 log 4|C(SU)|

δ , then with probability at least 1 δ/2,

(g(x), y) ED [ (g(x), y)] +

D [ (h?(x), y)] + 3

Since ˆg is the ERM on SL over C(S), its empirical risk can be at most

D [ (h?(x), y)] + 3

4 . Given that the population risk of ˆg can be at most /4 away from its empirical risk, we have that

ED[ (ˆg(x), y)]

D [ (h?(x), y)] + .

Applying union bounds, the entire process succeeds with probability 1 δ. We can upper bound the sample complexity of Algorithm 1, denoted n( , δ, K), as

n( , δ, K) m A

2b2p, δ/2, K

2 log |C(SU)|

2b2p, δ/2, K

2b2p, δ/2, K) + log 1

where we use |C(SU)| 2Km A( a 2b2p ,δ/2,K). This shows that Algorithm 1 is an agnostic PAC learner for H w.r.t .

Finally, Lemma 4.5 gives the necessity direction of Theorem 4.1.

Lemma 4.5. If a hypothesis class H SX

K is agnostic PAC learnable w.r.t 2 L( @p

sum), then Hj

i is agnostic PAC learnable w.r.t the 0-1 loss for all (i, j) 2 [K] [p].

Like the sufﬁciency proofs, the proof of Lemma 4.5 is constructive. Given an agnostic PAC learner for H w.r.t , we construct an agnostic PAC learner for Hj

i w.r.t 0-1 loss using a slight modiﬁcation of Algorithm 1. We defer the full proof to Appendix C since the analysis is similar to that of Algorithm 1. Together, Lemmas 4.3, 4.4 and 4.5 imply Theorem 4.1.

We conclude this section by giving a concrete application of our characterization. Consider the class of ranking-hypotheses H = {x 7! argsort(Wx) : W 2

K d} that compute rankings by sorting scores, in descending order, obtained from a linear function of the input features. Lemma 4.6, whose proof is in Appendix B, computes the VC dimension of Hj

i for an arbitrary i, j 2 [K].

Lemma 4.6. Let H = {x 7! argsort(Wx) : W 2

K d} be a linear ranking hypothesis class. Then for all i, j 2 [K], VC(Hj

i) = O(Kd), where O hides logarithmic factors of d and K.

Combining Lemma 4.6 with Theorems 4.1 and 4.2 shows that linear ranking hypothesis classes are agnostic ranking PAC learnable w.r.t to all losses in L( @p

prec). More generally, in Appendix B we give a dimension-based sufﬁcient condition under which generic score-based ranking hypothesis classes are agnostic ranking PAC learnable.

5 Online Multilabel Ranking

We now move to the online setting and characterize the online learnability of hypothesis classes H SX

K with respect to both L( @p

sum) and L( @p

prec). As in the batch setting, our characterization relates the learnability of H to the learnability of the threshold-restricted classes Hj

Theorem 5.1. A hypothesis class H SX

K is agnostic online learnable w.r.t 2 L( @p

sum) if and only if for all i 2 [K] and j 2 [p], Hj

i is agnostic online learnable w.r.t the 0-1 loss.

Theorem 5.2. A hypothesis class H SX

K is agnostic online learnable w.r.t 2 L( @p

prec) if and only if for all i 2 [K], Hp

i is agnostic online learnable w.r.t the 0-1 loss.

Since the Littlestone dimension characterizes the online learnability of binary hypothesis classes under the 0-1 loss, an important corollary of Theorems 5.1 and 5.2 is is that ﬁniteness of Ldim(Hj

i), for the appropriate i, j 2 [K] [p], is necessary and sufﬁcient for agnostic online ranking learnability.

We now begin the proof of Theorem 5.1. Since the proof of Theorem 5.2 follows a similar trajectory, we defer all discussion of Theorem 5.2 to Appendix D.2. Unlike Theorem 4.1 in the batch setting, we prove the sufﬁciency and necessity directions of Theorem 5.1 directly. We chose this direct path because, unlike the batch setting, sequential Rademacher analysis does not yield a constructive algorithm [Rakhlin et al., 2015]. On the other hand, our proofs are constructive and use the celebrated Randomized Exponential Weights Algorithm (REWA) [Cesa-Bianchi and Lugosi, 2006]. Moreover, a key ingredient of our proof is the realizable to agnostic conversion from Raman et al. [2023].

Proof. (of sufﬁciency in Theorem 5.1) Fix 2 L( @p

sum). Let a = min ,y{ ( , y) | ( , y) 6= 0} and M = max ,y ( , y). Given online learners for Hj

i for the 0-1 loss, our goal is to construct an online learner Q for H w.r.t that enjoys sub-linear regret in T. Our strategy will be to construct a set of experts E using the online learners for Hj

i s and run REWA using E and an appropriately scaled version of . Our proof borrows ideas from the realizable-to-agnostic online conversion from Raman et al. [2023] and so we use the same notation whenever possible.

Let (x1, y1), ..., (x T , y T ) 2 (X Y)T denote the stream of points to be observed by the online learner. We will assume an oblivious adversary and thus the stream is ﬁxed before the game starts. A standard reduction (Chapter 4 in Cesa-Bianchi and Lugosi [2006]) allows us to convert oblivious regret bounds to adaptive regret bounds. Since Hj

i {0, 1}X is online learnable w.r.t. 0-1 loss, we are guaranteed the existence of online learners Aj

Constructing Experts. For any bitstring b 2 {0, 1}T , let φ : {t 2 [T] : bt = 1} ! SK denote a function mapping time points where bt = 1 to rankings (permutations). Let Φb = S{t2[T ]:bt=1}

K denote all such functions φ. For every h 2 H, there exists a φh

b 2 Φb such that for all t 2 {t : bt = 1}, φh

b (t) = h(xt). Let |b| = |{t 2 [T] : bt = 1}|. For every b 2 {0, 1}T and φ 2 Φb, we will deﬁne an Expert Eb,φ. Expert Eb,φ, formally presented in Algorithm 2, uses Aj

i s to make predictions in each round. However, Eb,φ only updates the Aj

i s on those rounds where bt = 1, using φ to compute a labeled instance. For every b 2 {0, 1}T , let Eb = S

φ2Φb{Eb,φ} denote the set of all Experts parameterized by functions φ 2 Φb. If b is the bitstring with all zeros, then Eb will be empty. Therefore, we will actually deﬁne Eb = {E0} [ S

φ2Φb{Eb,φ}, where E0 is the expert that never updates Aj

i s and only uses them for predictions in all t 2 [T]. Note that 1 |Eb| (K!)|b| KK|b|.

Using these experts, Algorithm 3 presents our agnostic online learner Q for H w.r.t 2 L( @p

sum). We now show that Q enjoys sub-linear regret. We highlight that there are three sources of randomness in online learner Q, namely the randomness of sampling B, the internal randomness of Aj

i s, and the internal randomness of P. One may think of internal randomness as arising from the sampling step involved in the randomized predictions. Let A be the random variable associated with joint internal randomness of Aj

i for all (i, j) 2 [K] [p]. Similarly, denote P to be the random variable associated with the internal randomness of P. We begin by using the guarantee of REWA.

Algorithm 2 Expert (b, φ)

Input: Independent copy of realizable learners Aj

i for each (i, j) 2 [K] [p]

1 for t = 1, ..., T do

2 Receive example xt

3 Deﬁne a binary vote matrix Vt 2 {0, 1}K p such that Vt[i, j] = Aj

4 Predict ˆ t 2 arg min 2SKh , Vt1pi

5 if bt = 1 then

6 Let = φ(t) and for all (i, j) 2 [K] [p], update Aj

i by passing (xt, j

Algorithm 3 Agnostic Online Learner Q for H w.r.t. Input: Parameter 0 < β < 1

1 Let B 2 {0, 1}T s.t. Bt

iid Bernoulli( T β

2 Construct the set of experts EB = {E0} [ S

φ2ΦB{EB,φ} according to Algorithm 2

3 Run REWA P using EB and the loss function M over the stream (x1, y1), ..., (x T , y T )

REWA Guarantee. Using Theorem 21.11 in Shalev-Shwartz and Ben-David [2014] and the fact that B, A and P are mutually independent, REWA guarantees almost surely that

[ (P(xt), yt)|B, A] inf

(E(xt), yt) + M

2T ln(|EB|).

Taking an outer expectation gives

(P(xt), yt)

(E(xt), yt)

2T ln(|EB|)

Noting that Q(xt) = P(xt), we obtain

(Q(xt), yt)

(E(xt), yt)

2T ln(|EB|)

B (xt), yt)

2T ln(|EB|)

In the last step, we used the fact that for all b 2 {0, 1}T and h 2 H, Eb,φh

b 2 Eb. Here,

h? = infh2H

t=1 (h(xt), yt) is the optimal function in hindsight. First, note that ln(|EB|)

K|B| ln(K). Using Jensen s inequality gives

2T ln(|EB|)

2T 1+βK ln K. Thus,

(Q(xt), yt)

B (xt), yt)

2T 1+βK ln K. (1)

Upperbounding (I). It now sufﬁces to upperbound

t=1 (EB,φh?

B (xt), yt)

. Recall that Lemma E.1 gives pointwise

B (xt), yt) (h?(xt), yt) + p M

a Ej Unif([p])[ (EB,φh?

B (xt), Bin Rel(h?(xt), j))] (2)

where M = max ,y ( , y) and a = min ,y{ ( , y) | ( , y) 6= 0}. Note that, by deﬁnition of the constant M, we further get

B (xt), Bin Rel(h?(xt), j)) M

B (xt), Bin Rel(h?(xt), j)) > 0}

B (xt), Bin Rel(h?(xt), j)) > 0},

where the equality follows from the fact that 2 L( @p

In order to upperbound the indicator above, we need to introduce some more notations. Given the realizable online learner Am

i for (i, m) 2 [K] [p], an instance x 2 X, and an ordered ﬁnite sequence of labeled examples L 2 (X {0, 1}) , let Am

i (x|L) be the random variable denoting the prediction of Am

i on the instance x after running and updating on L. For any b 2 {0, 1}T , h 2 H, and t 2 [T], let Lh

b<t(i, m) = {(xs, hm

i (xs)) : s < t and bs = 1} denote the subsequence of the sequence of labeled instances {(xs, hm

i (xs))}t 1

s=1 where bs = 1. Then, for any j 2 [p], we have

B (xt), Bin Rel(h?(xt), j)) > 0}

i (xt | Lh?

B<t(i, m)) 6= h?,m

To prove the inequality above, consider the case when PK

i (xt | Lh?

B<t(i, m)) 6= h?,m

i (xt)} = 0 because the inequality is trivial otherwise. Then, we must have Am

i (xt | Lh?

B<t(i, m)) = h?,m

i (xt) for all (i, m) 2 [K] [p]. Let Vt 2 {0, 1}K p be a binary vote matrix that EB,φh?

B constructs in round t. Then, we have Vt[i, m] = Am

i (xt | Lh?

B<t(i, m)) = h?,m

i (xt) for all (i, m) 2 [K] [p]. Since h?(xt) is a permutation, the vote vector Vt1p must contain p labels with distinct number of non-zero votes, namely p, p 1, p 2, . . . , 2, 1 votes. Similarly, there must be K p labels with exactly 0 votes. Thus, every ˆ t 2 arg min 2SKh , Vt1pi must rank label that obtained p votes as 1, label with p 1 votes as 2, and so forth. In other words, we must have

[p] = h?(xt), and thus @p

sum(ˆ t, Bin Rel(h?(xt), j)) = 0 for any j 2 [p] by deﬁnition of @p

sum. Our claim now follows because EB,φh?

B (xt) 2 arg min 2SKh , Vt1pi. Using these two inequalities in equation (2), we obtain

B (xt), yt) (h?(xt), yt) + p M 2

i (xt | Lh?

B<t(i, m)) 6= h?,m

which further implies that

B (xt), yt)

(h?(xt), yt)+p M 2

i (xt | Lh?

B<t(i, m)) 6= h?,m

| {z } (II)

The ﬁrst term above is the cumulative loss of the best-ﬁxed hypothesis in hindsight.

Upperbounding (II). It now sufﬁces to show that E

i (xt | Lh?

B<t(i, m)) 6= h?,m

is sub-linear for every (i, m) 2 [K] [p]. Note that we can write

i (xt | Lh?

B<t(i, m)) 6= h?,m

i (xt | Lh?

B<t(i, m)) 6= h?,m

{Bt = 1}] E [

i (xt | Lh?

B<t(i, m)) 6= h?,m

where the last equality follows because E [

{Bt = 1}] = T β

T and the prediction of Am

i (xt | Lh?

B<t(i, m)) on round t only depends on bitstring (B1, . . . , Bt 1), but is independent of Bt. Next, we can use the regret guarantee of algorithm Am

i on the rounds it was updated. That is,

i (xt | Lh?

i (xt | Lh?

B<t(i, m)) 6= h?,m

i (xt | Lh?

B<t(i, m)) 6= h?,m

i (|B|) is the regret of Am

i , a sub-linear function of |B|. In the last step, we use the fact that Am

i is a realizable algorithm for Hm

i and the feedback that the algorithm received was (xt, h?,m

in the rounds whenever Bt = 1. Next, Lemma 5.17 from Ceccherini-Silberstein et al. [2017] guarantees that there exists a concave sub-linear function Rm

i (|B|) that upperbounds Rm

i (|B|). Thus,

by Jensen s inequality,

B [|B|]) Rm

i (T β), a sub-linear

function of T β.

Combining (I) and (II) together, we obtain

(Q(xt), yt)

(h(xt), yt) + p M 2

i (T β) + M

2T 1+βK ln K.

i (T β) is a sublinear function of T β, we have that T T β Rm

i (T β) is a sublinear function of T. As the sum of sublinear functions is sublinear, the second term above must be a sublinear function of T. The regret is sub-linear for any choice of β 2 (0, 1). This completes our proof as we have shown that the algorithm Q achieves sub-linear regret in T.

The proof of the necessity direction of Theorem 5.1 also involves constructing experts and running the REWA algorithm. Since the argument is similar, we defer details to Appendix D.1.

6 Discussion

In this paper, we characterize the learnability of a multilabel ranking hypothesis class in both the batch and online setting for a wide range of practical ranking losses. In all cases, we show that a ranking hypothesis class is learnable if and only if a sufﬁcient number of its binary-valued threshold restrictions are learnable. Our paper studies two families of ranking loss functions and leaves it open to characterize the learnability of other natural ranking loss functions. One loss function not captured by our families is recall@p.

While we do establish quantitative bounds on the sample complexity and regret, our bounds are not optimal. It may be difﬁcult to improve the sample complexity and regret bound at the highest level of generality for all losses in the families considered here. However, for natural losses such as sum loss, it is an interesting future direction to derive the optimal sample complexity and regret bounds in both the realizable and agnostic settings. In addition, our bounds depend on the number of labels K. Recently, K-free bounds have been achieved for multiclass classiﬁcation problems in both batch and online settings [Brukhim et al., 2022, Hanneke et al., 2023]. An interesting future direction is to study whether K-free bounds are possible for multilabel ranking.

Finally, the focus of this paper is on characterizing learnability, and thus our algorithms are not computationally efﬁcient. A natural future direction is to construct computationally efﬁcient algorithms for multilabel ranking. Along this direction, since ERM is the most common algorithm used in practice, it is an important future direction to tightly quantify the sample complexity of ERM in the batch setting. Moreover, in learning theory, combinatorial dimensions play an important role in providing a tight quantitative characterization of learnability. Thus, it is an interesting future direction to identify combinatorial dimensions that characterize multilabel ranking learnability for speciﬁc loss functions.

Acknowledgements

We acknowledge the support of NSF via grant IIS-2007055. VR acknowledges the support of the NSF Graduate Research Fellowship.

Noga Alon, Amos Beimel, Shay Moran, and Uri Stemmer. Closure properties for private classiﬁcation

and online prediction. In Conference on Learning Theory, pages 119 152. PMLR, 2020.

Ivo M Baltruschat, Hannes Nickisch, Michael Grass, Tobias Knopp, and Axel Saalbach. Comparison

of deep learning approaches for multi-label chest x-ray classiﬁcation. Scientiﬁc reports, 9(1):1 10, 2019.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and

structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoff. A characterization

of multiclass learnability, 2022. URL https://arxiv.org/abs/2203.01550.

Serhat S Bucak, Pavan Kumar Mallapragada, Rong Jin, and Anil K Jain. Efﬁcient multi-label ranking

for multi-class learning: application to object recognition. In 2009 IEEE 12th International Conference on Computer Vision, pages 2098 2105. IEEE, 2009.

David Buffoni, Clément Calauzenes, Patrick Gallinari, and Nicolas Usunier. Learning scoring

functions with order-preserving losses and standardized supervision. In The 28th International Conference on Machine Learning (ICML 2011), pages 825 832, 2011.

Clément Calauzenes, Nicolas Usunier, and Patrick Gallinari. On the (non-) existence of convex,

calibrated surrogate losses for ranking. Advances in Neural Information Processing Systems, 25, 2012.

T. Ceccherini-Silberstein, M. Salvatori, and E. Sava-Huss. Groups, Graphs and Random Walks.

London Mathematical Society Lecture Note Series. Cambridge University Press, 2017. doi: 10.1017/9781316576571.

Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university

press, 2006.

Amanda Clare and Ross D King. Knowledge discovery in multi-label phenotype data. In Principles

of Data Mining and Knowledge Discovery: 5th European Conference, PKDD 2001, Freiburg, Germany, September 3 5, 2001 Proceedings 5, pages 42 53. Springer, 2001.

Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Conference on

Learning Theory, pages 287 316. PMLR, 2014.

Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and

the erm principle. In Sham M. Kakade and Ulrike von Luxburg, editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research,

pages 207 232, Budapest, Hungary, 09 11 Jun 2011. PMLR.

Krzysztof Dembczynski, Wojciech Kotlowski, and Eyke Hüllermeier. Consistent multilabel ranking

through univariate losses. ar Xiv preprint ar Xiv:1206.6401, 2012.

John C Duchi, Lester W Mackey, and Michael I Jordan. On the consistency of ranking algorithms. In

ICML, pages 327 334, 2010.

Richard M Dudley. Central limit theorems for empirical measures. The Annals of Probability, pages

899 929, 1978.

Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. In Proceedings of the

24th annual conference on learning theory, pages 341 358. JMLR Workshop and Conference Proceedings, 2011.

Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey Ioffe. Deep convolu-

tional ranking for multilabel image annotation. ar Xiv preprint ar Xiv:1312.4894, 2013.

Steve Hanneke, Shay Moran, Vinod Raman, Unique Subedi, and Ambuj Tewari. Multiclass online

learning and uniform convergence. In The Thirty Sixth Annual Conference on Learning Theory, pages 5682 5696. PMLR, 2023.

Max Hopkins, Daniel M. Kane, Shachar Lovett, and Gaurav Mahajan. Realizable learning is all you

need. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 3015 3069. PMLR, 02 05 Jul 2022.

Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant

features. In Machine Learning: ECML-98: 10th European Conference on Machine Learning Chemnitz, Germany, April 21 23, 1998 Proceedings, pages 137 142. Springer, 2005.

Young Hun Jung and Ambuj Tewari. Online boosting algorithms for multi-label ranking. In International Conference on Artiﬁcial Intelligence and Statistics, pages 279 287. PMLR, 2018.

Anna Korba, Alexandre Garcia, and Florence d Alché Buc. A structured prediction approach for

label ranking. Advances in Neural Information Processing Systems, 31, 2018.

Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consis-

tent multilabel classiﬁcation. Advances in Neural Information Processing Systems, 28, 2015.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes,

volume 23. Springer Science & Business Media, 1991.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold

algorithm. Machine Learning, 2:285 318, 1987.

Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends in Information

Retrieval, 3(3):225 331, 2009.

Andrew Kachites Mc Callum. Multi-label text classiﬁcation with a mixture model trained by em. In

AAAI 99 workshop on text learning, 1999.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform

martingale laws of large numbers. Probability theory and related ﬁelds, 161:111 153, 2015.

Vinod Raman, Unique Subedi, and Ambuj Tewari. A characterization of multioutput learnability.

ar Xiv cs.LG, 2023. preprint ar Xiv:2303.17716.

Pradeep Ravikumar, Ambuj Tewari, and Eunho Yang. On ndcg consistency of listwise ranking

methods. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 618 626. JMLR Workshop and Conference Proceedings, 2011.

Robert E Schapire and Yoram Singer. Boostexter: A boosting-based system for text categorization.

Machine learning, 39:135 168, 2000.

Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to

Algorithms. Cambridge University Press, USA, 2014.

Leslie G. Valiant. A theory of the learnable. In Symposium on the Theory of Computing, 1984.

V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. 1974.

Xi Wang and Gita Sukthankar. Multi-label relational neighbor classiﬁcation using social context

features. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 464 472, 2013.

Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. Exploit bounding

box annotations for multi-label object recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 280 288, 2016.

Min-Ling Zhang and Zhi-Hua Zhou. A review on multi-label learning algorithms. IEEE transactions

on knowledge and data engineering, 26(8):1819 1837, 2013.