# on_the_learnability_of_multilabel_ranking__c616f5b6.pdf On the Learnability of Multilabel Ranking Vinod Raman Department of Statistics University of Michigan Ann Arbor, MI 48104 vkraman@umich.edu Unique Subedi Department of Statistics University of Michigan Ann Arbor, MI 48104 subedi@umich.edu Ambuj Tewari Department of Statistics University of Michigan Ann Arbor, MI 48104 tewaria@umich.edu Multilabel ranking is a central task in machine learning. However, the most fundamental question of learnability in a multilabel ranking setting with relevancescore feedback remains unanswered. In this work, we characterize the learnability of multilabel ranking problems in both batch and online settings for a large family of ranking losses. Along the way, we give two equivalence classes of ranking losses based on learnability that capture most losses used in practice. 1 Introduction Multilabel ranking is a supervised learning problem where a learner is presented with an instance x 2 X and is required to output a ranking of K different labels in decreasing order of relevance to x. This is in contrast with multilabel classification where given an instance x 2 X, the learner is tasked with predicting a subset of the K labels without any explicit ordering. Multilabel ranking is a canonical learning problem with a wide range of applications to text categorization, genetics, medical imaging, social networks, and visual object recognition [Joachims, 2005, Schapire and Singer, 2000, Mc Callum, 1999, Clare and King, 2001, Baltruschat et al., 2019, Wang and Sukthankar, 2013, Bucak et al., 2009, Yang et al., 2016]. Recent years have seen a surge in the development of multilabel ranking methods with strong practical and theoretical guarantees [Schapire and Singer, 2000, Dembczynski et al., 2012, Gong et al., 2013, Bucak et al., 2009, Jung and Tewari, 2018, Gao and Zhou, 2011, Koyejo et al., 2015, Zhang and Zhou, 2013, Korba et al., 2018]. A related line of work has studied consistency for the convex surrogates of natural ranking losses [Duchi et al., 2010, Buffoni et al., 2011, Gao and Zhou, 2011, Ravikumar et al., 2011, Calauzenes et al., 2012, Dembczynski et al., 2012]. Despite this vast literature on multilabel ranking, the fundamental question of when a multilabel ranking problem is learnable remains unanswered. Understanding when a hypothesis class is learnable is a fundamental question in Statistical Learning Theory. For binary classification, the finiteness of the Vapnik Chervonenkis (VC) dimension is both sufficient and necessary for Probably Approximately Correct (PAC) learning [Vapnik and Chervonenkis, 1974, Valiant, 1984]. Likewise, the finiteness of the Daniely-Shwartz (DS) dimension characterizes multiclass PAC learnability Daniely and Shalev-Shwartz [2014], Brukhim et al. [2022]. In the online setting, the Littlestone dimension [Littlestone, 1987] characterizes the online learnability of a binary hypothesis class and the multiclass Littlestone dimension [Daniely et al., 2011] characterizes online multiclass learnability. Unlike classification, a distinguishing property of multilabel ranking is the mismatch between the predictions the learner makes and the feedback it receives. In particular, a learner is required to produce a permutation that ranks the relevance of the labels but only receives a relevance-score vector as feedback. This feedback model is standard in multilabel ranking since obtaining full permutation feedback is generally costly [Liu et al., 2009]. As a result, unlike the 0-1 loss in classification, there is no canonical loss function in ranking. Together, these two issues Equal contributions 37th Conference on Neural Information Processing Systems (Neur IPS 2023). create barriers for existing techniques used to prove learnability, such as the agnostic-to-realizable reductions from Hopkins et al. [2022] and Raman et al. [2023], to readily extend to ranking. In this paper, we characterize the batch and online learnability of a ranking hypothesis class H SX K under relevance-score feedback, where SK is the set of all permutations over [K] = {1, ..., K}. In doing so, we make the following contributions. We show that a ranking hypothesis class H embeds K2 different binary hypothesis classes i for i, j 2 [K], where hypotheses in Hj i answer whether the label i should be ranked in the top j. Our main result relates the learnability of H to the learnability of Hj i s. We define two families of ranking loss functions that capture most if not all ranking losses used in practice. We show that these families are actually equivalence classes - the same characterization of batch and online learnability holds for every loss in that family. By relating the learnability of H to the learnability of binary hypothesis classes Hj i, we show that existing combinatorial dimensions, like the VC and Littlestone dimension, continue to characterize learnability in the multilabel ranking setting. This allows us to prove that linear ranking hypothesis classes are learnable in the batch setting. A unifying theme throughout the paper is our ability to constructively convert a learning algorithm A for H into a learning algorithm Aj i for each i, j 2 [K] and vice versa. To do so, our proof techniques involve adapting the agnostic-to-realizable reduction for batch and online classification, proposed by Hopkins et al. [2022] and Raman et al. [2023] respectively, to ranking. 2 Preliminaries and Notation Let X denote the instance space, SK the set of permutations over labels [K] := {1, ..., K}, and Y = {0, 1, ..., B}K the target space for some K, B 2 . We highlight that the set of labels [K] is fixed beforehand and does not depend on the instance x 2 X. This is to be contrasted with subset ranking, the set of labels can change depending on the instance x 2 X. We refer to an element y 2 Y as a relevance-score vector that indicates the relevance of each of the K labels, where B indicates the highest relevance and 0 indicates the lowest relevance. Throughout the paper, we treat a permutation 2 SK as a vector in {1, ..., K}K that induces a ranking of the K labels in decreasing order of relevance. Accordingly, for an index i 2 [K], we let i 2 [K] denote the rank of label i. Likewise, given an index i 2 [K], we let yi denote the relevance of label i. In addition, it will be useful to define a mapping from SK to {0, 1}K. In particular, we define Bin Rel( , ) : SK [K] ! {0, 1}K as an operator that given a permutation (ranking) 2 SK and threshold p 2 [K], outputs a bit string b 2 {0, 1}K s.t. bi = Ranking Equivalences. Our construction of ranking loss families in Section 3 requires different notions of equivalence between permutations (rankings) in SK. To that end, we say that = ˆ if and only if for all i 2 [K], i = ˆ i. On the other hand, we say p= ˆ if and only if {i : i p} = {i : ˆ i p}. That is, two rankings are p-equivalent if the set of labels they rank in the top-p are equal. Finally, we say [p] = ˆ if and only if for all j 2 [p], {i : i j} = {i : ˆ i j}. That is, two rankings are [p]-equivalent if not only the set but also the order of labels they rank in the top-p are equal. Ranking Hypothesis. A ranking hypothesis h 2 H SX K maps instances in X to a ranking (permutation) in SK. Given an instance x 2 X, one can think of h(x) as h s ranking of the K different labels in decreasing order of relevance. For any ranking hypothesis h, we let hi : X ! [K] denote its restriction to the i th coordinate output. Accordingly, for an instance x 2 X, hi(x) gives the rank that h assigns to label i. Given a ranking hypothesis class H SX K and any i, j 2 [K], we define its binary threshold-restricted hypothesis class Hj i : h 2 H} where hj {hi(x) j}. We can think of hypotheses in Hj i as providing binary responses to queries of the form: for instance x, should label i ranked in the top j?" These threshold-restricted classes are central to our characterization of learnability in both the batch and online learning settings. Batch Learnability. In the batch setting, we are interested in characterizing the learnability of a ranking hypothesis class H under a model similar to the classical PAC model [Valiant, 1984]. Definition 1 (Agnostic Ranking PAC Learnability). A ranking hypothesis class H SX K is agnostic PAC learnable w.r.t. loss : SK Y ! R 0, if there exists a function m : (0, 1)2 N ! N and a learning algorithm A : (X Y)? ! SX K with the following property: for every , δ 2 (0, 1) and for every distribution D on X Y, running algorithm A on n m( , δ, K) iid samples from D outputs a predictor g = A(S) such that with probability at least 1 δ over S Dn, ED[ (g(x), y)] inf h2H ED[ (h(x), y)] + . If D is restricted to the class of distributions such that infh2H ED[ (h(x), y)] = 0, then we say we are in the realizable setting. Note that unlike in classification, realizability in the multilabel ranking setting is loss dependent. Online Learnability. In the online setting, an adversary plays a sequential game with the learner over T rounds. In each round t 2 [T], an adversary selects a labeled instance (xt, yt) 2 X Y and reveals xt to the learner. The learner makes a (potentially randomized) prediction ˆ t 2 SK. Finally, the adversary reveals the true relevance-score vector yt, and the learner suffers the loss (ˆ t, yt), where is some pre-specified ranking loss function. Given a ranking hypothesis class H SX K, the goal of the learner is to output predictions ˆ t such that its cumulative loss is close to the best possible cumulative loss over hypotheses in H. A hypothesis class is online learnable if there exists an algorithm such that for any sequence of labeled examples (x1, y1), ..., (x T , y T ), the difference in cumulative loss between its predictions and the predictions of the best possible function in H is small. Definition 2 (Agnostic Online Ranking Learnability). A ranking hypothesis class H SX K is agnostic online learnable w.r.t. loss , if there exists an (potentially randomized) algorithm A such that for any adaptively chosen sequence of labeled examples (xt, yt) 2 X Y, the algorithm outputs A(xt) 2 SK at every iteration t 2 [T] such that its expected regret, R(T, K) := E (A(xt), yt) inf (h(xt), yt) is a non-decreasing, sub-linear function of T. Here, the expectation is taken with respect to the randomness of the algorithm A. If it is further guaranteed that there exists a hypothesis h? 2 H such that PT t=1 (h?(xt), yt) = 0, then we say we are in the realizable setting. Again, realizability is loss dependent. 3 Ranking Loss Families In statistical learning theory, we often characterize learnability with respect to a loss function. Unlike the 0-1 loss in classification, there is no canonical loss function in multilabel ranking. Accordingly, we define two general families of ranking loss functions in this section and later characterize learnability with respect to all losses in these families. In Appendix A, we show that many of the ranking metrics used in practice (e.g. Pairwise Rank Loss, Discounted Cumulative Gain, Reciprocal Rank, Average Precision, Precision@p, etc.) fall into one of these two families. On a high level, we can classify ranking losses into two main groups: (A) those losses that care about both the order and magnitude of the relevance scores within the top-p ranked labels and (B) those losses that only care about the magnitude of the relevance scores within the top-p ranked labels. Our goal will be to define a loss family for both groups A and B. To do so, we start by identifying a canonical ranking loss that lies in each group. For group A, the normalized sum loss@p, sum( , y) = min( i, p + 1)yi Zp captures both the order and magnitude of the relevance scores only for the top-p ranked labels. Here, Zp y is an appropriately chosen normalization factor that only depends on p and y such that sum( , y) = 0. For Group B, the normalized precision loss@p, prec( , y) = Zp cares only about the magnitude of relevance scores in the top-p ranked labels. Again, Zp y is an appropriately chosen normalization constant that only depends on p and y such that the minimum loss is 0. The form of @p prec differs from @p sum because PK { i p}yi is a gain whereas PK i=1 min( i, p + 1)yi is a loss. Next, we build loss families around @p prec. For @p sum, consider the family: SK Y : = 0 if and only if @p sum = 0}\{ 2 [p] = ˆ =) ( , y) = (ˆ , y)}. By definition, L( @p sum) contains those ranking losses that are (1) zero-matched with @p sum and (2) remain unchanged for any two predicted rankings (permutations) that are [p]-equivalent. The second constraint is needed to ensure that losses in L( @p sum) only depend on the order and set of labels that ranks in the top-p. Likewise, we can construct a similar loss family around @p prec as follows: prec) = { 2 SK Y : = 0 if and only if @p prec = 0}\{ 2 p= ˆ =) ( , y) = (ˆ , y)}. The set L( @p prec) contains those ranking losses that are (1) zero-matched with @p prec and (2) remain unchanged for any two predicted rankings (permutations) that are p-equivalent. The second constraint is needed to ensure that losses in L( @p prec) only depend on the set of labels that ranks in the top-p. A major contribution of this paper is showing that both L( @p sum) and L( @p prec) are actually equivalence classes - the same characterization of learnability holds for every loss in that family. 4 Batch Multilabel Ranking In this section, we characterize the agnostic PAC learnability of hypothesis classes H SX K with respect to both L( @p sum) and L( @p prec). Our main results, stated below as two theorems, relate the learnability of H to the learnability of the threshold-restricted classes Hj Theorem 4.1. A hypothesis class H SX K is agnostic PAC learnable w.r.t 2 L( @p sum) if and only if for all i 2 [K] and j 2 [p], Hj i is agnostic PAC learnable w.r.t the 0-1 loss. Theorem 4.2. A hypothesis class H SX K is agnostic PAC learnable w.r.t 2 L( @p prec) if and only if for all i 2 [K], Hp i is agnostic PAC learnable w.r.t the 0-1 loss. Since VC dimension characterizes the learnability of binary hypothesis classes under the 0-1 loss, an important corollary of Theorems 4.1 and 4.2 is that finiteness of VC(Hj i) s, for the appropriate i, j 2 [K] [p], is necessary and sufficient for agnostic ranking PAC learnability. Later on, we use this fact to prove that linear ranking hypothesis classes are agnostic ranking PAC learnable. We start with the proof of Theorem 4.1, which follows in three steps. First, we show that if for all (i, j) 2 [K] [p], Hj i is agnostic PAC learnable w.r.t 0-1 loss, then Empirical Risk Minimization (ERM) is an agnostic PAC learner for H w.r.t @p sum. Next, we show that if H is agnostic PAC learnable w.r.t @p sum, then H is agnostic PAC learnable w.r.t any loss 2 L( @p sum). Our proof of the latter uses the realizable to agnostic conversion from Hopkins et al. [2022]. Finally, we prove the necessity direction - if H is agnostic PAC learnable w.r.t an arbitrary 2 L( @p sum), then for all (i, j) 2 [K] [p], Hj i is agnostic PAC learnable w.r.t 0-1 loss. The proof of necessity direction also uses the realizable to agnostic conversion from Hopkins et al. [2022]. The proof of Theorem 4.2 follows exactly the same way as Theorem 4.1 with some minor changes. Thus, we only focus on the proof of Theorem 4.1 in this section and defer all discussion of Theorem 4.2 to Appendix C.3. We begin with Lemma 4.3, which asserts that if Hj i is agnostic PAC learnable for all (i, j) 2 [K] [p], then ERM is an agnostic PAC learner for H w.r.t @p Lemma 4.3. If for all i 2 [K] and j 2 [p], Hj i is agnostic PAC learnable w.r.t the 0-1 loss, then ERM is an agnostic PAC learner for H SX The proof of Lemma 4.3 exploits the nice structure of @p sum by upperbounding the empirical Rademacher complexity of the loss class @p sum H = {(x, y) 7! @p sum(h(x), y) : h 2 H)} and showing that it vanishes as the sample size n becomes large. Then, standard uniform convergence arguments outlined in Proposition C.1 imply that ERM is an agnostic PAC learner for H w.r.t @p sum. The full proof is in Appendix C. Since arbitrary losses in L( @p sum) may not have nice analytical forms, Lemma 4.4 relates the learnability of an arbitrary loss 2 L( @p sum) to the learnability of @p Lemma 4.4. If H SX K is agnostic PAC learnable w.r.t @p sum, then H is agnostic PAC learnable w.r.t any 2 L( @p Proof. (of Lemma 4.4) Fix 2 L( @p sum). Let a = min ,y{ ( , y) | ( , y) 6= 0} and b = max ,y ( , y). We need to show that if H is agnostic PAC learnable w.r.t @p sum, then H is agnostic PAC learnable w.r.t . We will do so in two steps. First, we will show that if A is an agnostic PAC learner for @p sum, then A is also a realizable PAC learner for . Next, we will show how to convert a realizable PAC learner for into an agnostic PAC learner for in a black-box fashion. The composition of these two pieces yields an agnostic PAC learner for H w.r.t . Realizable PAC learnability of H w.r.t . If H is agnostic PAC learnable w.r.t @p sum, then there exists a learning algorithm A with sample complexity m( , δ, K) s.t. for any distribution D over X Y, with probability 1 δ over a sample S Dn of size n m( , δ, K), the output predictor g = A(S) achieves sum(g(x), y)) sum(h(x), y)) + . In the realizable setting, we are further guaranteed that there exists a hypothesis h? 2 H s.t. D [ (h?(x), y))] = 0. Since sum), this also implies that sum(h?(x), y)) = 0. Therefore, under realizability and the fact that b @p sum, we have D [ (g(x), y))] b . This completes the first part of the proof as we have shown that A is also a realizable PAC learner for H w.r.t with sample complexity m( Realizable-to-agnostic conversion. Now, we show how to convert the realizable PAC learner A for into an agnostic PAC learner for in a black-box fashion. For this step, we will extend the agnosticto-realizable reduction proposed by Hopkins et al. [2022] to the ranking setting by accommodating the mismatch between the range space of H and the label space Y. In particular, we will show that Algorithm 1 below converts a realizable PAC learner for into an agnostic PAC learner for . Note that although input A is a realizable learner, the distribution D may not be realizable. Algorithm 1 Agnostic PAC learner for H w.r.t. Input: Realizable PAC learner A for H, unlabeled and labeled samples SU Dn X and SL Dm 1 For each h 2 H|SU , construct a dataset U = {(x1, y1), ..., (xn, yn)} s.t. yi Unif{Bin Rel(h(xi), 1), ..., Bin Rel(h(xi), p)} 2 Run A over all datasets to get C(SU) := 3 Return ˆg 2 C(SU) with the lowest empirical error over SL w.r.t. . Let h? = arg minh2H D [ (h(x), y)] denote the optimal predictor in H w.r.t D. Consider the sample Sh? U and let g = A(Sh? U ). We can think of g as the output of A run over an i.i.d sample S drawn from D?, a joint distribution over X Y defined procedurally by first sampling x DX , then independently sampling j Unif([p]), and finally outputting the labeled sample (x, Bin Rel(h?(x), j)). Note that D? is indeed a realizable distribution (realized by h?) w.r.t both and @p sum. Recall that m A( b, δ, K) is the sample complexity of A. Since A is a realizable learner for H w.r.t , we have that for n m A( a 2b2p, δ/2, K), with probability at least 1 δ D? [ (g(x), y)] a 2bp. Next, by Lemma E.1, we have (g(x), y) (h?(x), y)+ bp j Unif([p]) [ (g(x), Bin Rel(h?(x), j))] pointwise. Taking expectations on both sides of the inequality gives D [ (g(x), y)] D [ (h?(x), y)] + bp j Unif([p]) [ (g(x), Bin Rel(h?(x), j))] D [ (h?(x), y)] + The last inequality follows from the definition of D?, namely D? [ (g(x), y)] = j Unif([p]) [ (g(x), Bin Rel(h?(x), j))]. This shows that C(SU) contains a hypothesis g that generalizes well with respect to D. Now we want to show that the predictor ˆg returned in step 4 also has good generalization. Crucially, observe that C(SU) is a finite hypothesis class with cardinality at most 2n K. By standard Chernoff and union bounds, with probability at least 1 δ/2, the empirical risk of every hypothesis in C(SU) on a sample of size 8 2 log 4|C(SU)| δ is at most /4 away from its true error. So, if m = |SL| 8 2 log 4|C(SU)| δ , then with probability at least 1 δ/2, (g(x), y) ED [ (g(x), y)] + D [ (h?(x), y)] + 3 Since ˆg is the ERM on SL over C(S), its empirical risk can be at most D [ (h?(x), y)] + 3 4 . Given that the population risk of ˆg can be at most /4 away from its empirical risk, we have that ED[ (ˆg(x), y)] D [ (h?(x), y)] + . Applying union bounds, the entire process succeeds with probability 1 δ. We can upper bound the sample complexity of Algorithm 1, denoted n( , δ, K), as n( , δ, K) m A 2b2p, δ/2, K 2 log |C(SU)| 2b2p, δ/2, K 2b2p, δ/2, K) + log 1 where we use |C(SU)| 2Km A( a 2b2p ,δ/2,K). This shows that Algorithm 1 is an agnostic PAC learner for H w.r.t . Finally, Lemma 4.5 gives the necessity direction of Theorem 4.1. Lemma 4.5. If a hypothesis class H SX K is agnostic PAC learnable w.r.t 2 L( @p sum), then Hj i is agnostic PAC learnable w.r.t the 0-1 loss for all (i, j) 2 [K] [p]. Like the sufficiency proofs, the proof of Lemma 4.5 is constructive. Given an agnostic PAC learner for H w.r.t , we construct an agnostic PAC learner for Hj i w.r.t 0-1 loss using a slight modification of Algorithm 1. We defer the full proof to Appendix C since the analysis is similar to that of Algorithm 1. Together, Lemmas 4.3, 4.4 and 4.5 imply Theorem 4.1. We conclude this section by giving a concrete application of our characterization. Consider the class of ranking-hypotheses H = {x 7! argsort(Wx) : W 2 K d} that compute rankings by sorting scores, in descending order, obtained from a linear function of the input features. Lemma 4.6, whose proof is in Appendix B, computes the VC dimension of Hj i for an arbitrary i, j 2 [K]. Lemma 4.6. Let H = {x 7! argsort(Wx) : W 2 K d} be a linear ranking hypothesis class. Then for all i, j 2 [K], VC(Hj i) = O(Kd), where O hides logarithmic factors of d and K. Combining Lemma 4.6 with Theorems 4.1 and 4.2 shows that linear ranking hypothesis classes are agnostic ranking PAC learnable w.r.t to all losses in L( @p prec). More generally, in Appendix B we give a dimension-based sufficient condition under which generic score-based ranking hypothesis classes are agnostic ranking PAC learnable. 5 Online Multilabel Ranking We now move to the online setting and characterize the online learnability of hypothesis classes H SX K with respect to both L( @p sum) and L( @p prec). As in the batch setting, our characterization relates the learnability of H to the learnability of the threshold-restricted classes Hj Theorem 5.1. A hypothesis class H SX K is agnostic online learnable w.r.t 2 L( @p sum) if and only if for all i 2 [K] and j 2 [p], Hj i is agnostic online learnable w.r.t the 0-1 loss. Theorem 5.2. A hypothesis class H SX K is agnostic online learnable w.r.t 2 L( @p prec) if and only if for all i 2 [K], Hp i is agnostic online learnable w.r.t the 0-1 loss. Since the Littlestone dimension characterizes the online learnability of binary hypothesis classes under the 0-1 loss, an important corollary of Theorems 5.1 and 5.2 is is that finiteness of Ldim(Hj i), for the appropriate i, j 2 [K] [p], is necessary and sufficient for agnostic online ranking learnability. We now begin the proof of Theorem 5.1. Since the proof of Theorem 5.2 follows a similar trajectory, we defer all discussion of Theorem 5.2 to Appendix D.2. Unlike Theorem 4.1 in the batch setting, we prove the sufficiency and necessity directions of Theorem 5.1 directly. We chose this direct path because, unlike the batch setting, sequential Rademacher analysis does not yield a constructive algorithm [Rakhlin et al., 2015]. On the other hand, our proofs are constructive and use the celebrated Randomized Exponential Weights Algorithm (REWA) [Cesa-Bianchi and Lugosi, 2006]. Moreover, a key ingredient of our proof is the realizable to agnostic conversion from Raman et al. [2023]. Proof. (of sufficiency in Theorem 5.1) Fix 2 L( @p sum). Let a = min ,y{ ( , y) | ( , y) 6= 0} and M = max ,y ( , y). Given online learners for Hj i for the 0-1 loss, our goal is to construct an online learner Q for H w.r.t that enjoys sub-linear regret in T. Our strategy will be to construct a set of experts E using the online learners for Hj i s and run REWA using E and an appropriately scaled version of . Our proof borrows ideas from the realizable-to-agnostic online conversion from Raman et al. [2023] and so we use the same notation whenever possible. Let (x1, y1), ..., (x T , y T ) 2 (X Y)T denote the stream of points to be observed by the online learner. We will assume an oblivious adversary and thus the stream is fixed before the game starts. A standard reduction (Chapter 4 in Cesa-Bianchi and Lugosi [2006]) allows us to convert oblivious regret bounds to adaptive regret bounds. Since Hj i {0, 1}X is online learnable w.r.t. 0-1 loss, we are guaranteed the existence of online learners Aj Constructing Experts. For any bitstring b 2 {0, 1}T , let φ : {t 2 [T] : bt = 1} ! SK denote a function mapping time points where bt = 1 to rankings (permutations). Let Φb = S{t2[T ]:bt=1} K denote all such functions φ. For every h 2 H, there exists a φh b 2 Φb such that for all t 2 {t : bt = 1}, φh b (t) = h(xt). Let |b| = |{t 2 [T] : bt = 1}|. For every b 2 {0, 1}T and φ 2 Φb, we will define an Expert Eb,φ. Expert Eb,φ, formally presented in Algorithm 2, uses Aj i s to make predictions in each round. However, Eb,φ only updates the Aj i s on those rounds where bt = 1, using φ to compute a labeled instance. For every b 2 {0, 1}T , let Eb = S φ2Φb{Eb,φ} denote the set of all Experts parameterized by functions φ 2 Φb. If b is the bitstring with all zeros, then Eb will be empty. Therefore, we will actually define Eb = {E0} [ S φ2Φb{Eb,φ}, where E0 is the expert that never updates Aj i s and only uses them for predictions in all t 2 [T]. Note that 1 |Eb| (K!)|b| KK|b|. Using these experts, Algorithm 3 presents our agnostic online learner Q for H w.r.t 2 L( @p sum). We now show that Q enjoys sub-linear regret. We highlight that there are three sources of randomness in online learner Q, namely the randomness of sampling B, the internal randomness of Aj i s, and the internal randomness of P. One may think of internal randomness as arising from the sampling step involved in the randomized predictions. Let A be the random variable associated with joint internal randomness of Aj i for all (i, j) 2 [K] [p]. Similarly, denote P to be the random variable associated with the internal randomness of P. We begin by using the guarantee of REWA. Algorithm 2 Expert (b, φ) Input: Independent copy of realizable learners Aj i for each (i, j) 2 [K] [p] 1 for t = 1, ..., T do 2 Receive example xt 3 Define a binary vote matrix Vt 2 {0, 1}K p such that Vt[i, j] = Aj 4 Predict ˆ t 2 arg min 2SKh , Vt1pi 5 if bt = 1 then 6 Let = φ(t) and for all (i, j) 2 [K] [p], update Aj i by passing (xt, j Algorithm 3 Agnostic Online Learner Q for H w.r.t. Input: Parameter 0 < β < 1 1 Let B 2 {0, 1}T s.t. Bt iid Bernoulli( T β 2 Construct the set of experts EB = {E0} [ S φ2ΦB{EB,φ} according to Algorithm 2 3 Run REWA P using EB and the loss function M over the stream (x1, y1), ..., (x T , y T ) REWA Guarantee. Using Theorem 21.11 in Shalev-Shwartz and Ben-David [2014] and the fact that B, A and P are mutually independent, REWA guarantees almost surely that [ (P(xt), yt)|B, A] inf (E(xt), yt) + M 2T ln(|EB|). Taking an outer expectation gives (P(xt), yt) (E(xt), yt) 2T ln(|EB|) Noting that Q(xt) = P(xt), we obtain (Q(xt), yt) (E(xt), yt) 2T ln(|EB|) B (xt), yt) 2T ln(|EB|) In the last step, we used the fact that for all b 2 {0, 1}T and h 2 H, Eb,φh b 2 Eb. Here, h? = infh2H t=1 (h(xt), yt) is the optimal function in hindsight. First, note that ln(|EB|) K|B| ln(K). Using Jensen s inequality gives 2T ln(|EB|) 2T 1+βK ln K. Thus, (Q(xt), yt) B (xt), yt) 2T 1+βK ln K. (1) Upperbounding (I). It now suffices to upperbound t=1 (EB,φh? B (xt), yt) . Recall that Lemma E.1 gives pointwise B (xt), yt) (h?(xt), yt) + p M a Ej Unif([p])[ (EB,φh? B (xt), Bin Rel(h?(xt), j))] (2) where M = max ,y ( , y) and a = min ,y{ ( , y) | ( , y) 6= 0}. Note that, by definition of the constant M, we further get B (xt), Bin Rel(h?(xt), j)) M B (xt), Bin Rel(h?(xt), j)) > 0} B (xt), Bin Rel(h?(xt), j)) > 0}, where the equality follows from the fact that 2 L( @p In order to upperbound the indicator above, we need to introduce some more notations. Given the realizable online learner Am i for (i, m) 2 [K] [p], an instance x 2 X, and an ordered finite sequence of labeled examples L 2 (X {0, 1}) , let Am i (x|L) be the random variable denoting the prediction of Am i on the instance x after running and updating on L. For any b 2 {0, 1}T , h 2 H, and t 2 [T], let Lh b 0} i (xt | Lh? B