# beyond_bandit_feedback_in_online_multiclass_classification__b6312992.pdf Beyond Bandit Feedback in Online Multiclass Classification Dirk van der Hoeven dirk@dirkvanderhoeven.com Dept. of Computer Science Università degli Studi di Milano, Italy Federico Fusco fuscof@diag.uniroma1.it Dept. of Computer, Control and Management Engineering Sapienza Università di Roma, Italy Nicolò Cesa-Bianchi nicolo.cesa-bianchi@unimi.it DSRC & Dept. of Computer Science Università degli Studi di Milano, Italy We study the problem of online multiclass classification in a setting where the learner s feedback is determined by an arbitrary directed graph. While including bandit feedback as a special case, feedback graphs allow a much richer set of applications, including filtering and label efficient classification. We introduce GAPPLETRON, the first online multiclass algorithm that works with arbitrary feedback graphs. For this new algorithm, we prove surrogate regret bounds that hold, both in expectation and with high probability, for a large class of surrogate losses. Our bounds are of order B ρKT, where B is the diameter of the prediction space, K is the number of classes, T is the time horizon, and ρ is the domination number (a graph-theoretic parameter affecting the amount of exploration). In the full information case, we show that GAPPLETRON achieves a constant surrogate regret of order B2K. We also prove a general lower bound of order max B2K, showing that our upper bounds are not significantly improvable. Experiments on synthetic data show that for various feedback graphs our algorithm is competitive against known baselines. 1 Introduction In online multiclass classification a learner interacts with an unknown environment in a sequence of rounds. At each round t, the learner observes a feature vector xt Rd and outputs a prediction y t for the label yt {1, . . . , K} associated with xt. If y t = yt, then the learner is charged with a mistake. Kakade et al. (2008) introduced the bandit version of online multiclass classification, where the only feedback received by the learner after each prediction is the loss 1[y t = yt]. Hence, if a mistake is made at time t (and K > 2), the learner cannot uniquely identify the true label yt based on the feedback information. Although bandits are a canonical example of partial feedback, they fail to capture a number of important practical scenarios of online classification. Consider for example spam filtering, where an online learner is to classify emails as spam or non-spam based on their content. Whenever the learner classifies an email as legitimate, the recipient gets to see it, and can inform the learner whether the email was correctly classified of not. However, when the email is classified as spam, the learner does not get any feedback because the email is not checked by the recipient. Another example is label efficient multiclass classification. Here, instead of making a prediction, the learner can ask a human 35th Conference on Neural Information Processing Systems (Neur IPS 2021). expert for the true label. At the steps when predictions are made, however, the learner does not receive any feedback information (not even their own loss). A further example is disease prevention: if we predict an outburst of disease in a certain area, we can preemptively stop it by vaccinating the local population. This intervention would prevent us from observing whether our prediction was correct, but would still allow us to observe an outburst occurring in a different area. Unlike bandits, the amount of feedback obtained by the learner in these examples depends on the predicted class, and can vary from full information to no feedback at all. This scenario has been previously considered in the framework of online learning with feedback graphs (Mannor and Shamir, 2011; Alon et al., 2015, 2017). A feedback graph is a directed graph G = (V, E) where each node in V receives at least one edge from some other node in V (possibly from itself). The nodes in V correspond to actions, and a directed edge (a, b) E indicates that by playing action a the learner observes the loss of action b. This generalizes the well-known online learning settings of experts (where G is the complete graph, including self-loops) and bandits (where G has only self-loops). Note that it is easy to model spam filtering and label efficient prediction using feedback graphs. For spam filtering, G contains only two actions s and n (corresponding, respectively, to the learner s predictions for spam and non-spam), and the edge set is E = (n, n), (n, s) . For label efficient multiclass prediction, G contains a node for each class, plus an extra node corresponding to issuing a label request. It is important to observe that all previous analyses of feedback graphs only apply to the abstract setting of prediction with experts, where any dependence of the loss on feature vectors is ignored. This hampers the application of those results to online multiclass classification. In this work we build on previous results on online learning and classification with bandit feedback to design and analyze the first algorithm for online multiclass classification with arbitrary feedback graphs. In doing so, we also improve the analyses of the previously studied special cases (full information and bandit feedback) of multiclass classification. In the online multiclass classification setting, the goal is bound the number of mistakes made by the learner. The mistake bounds take the following form: t=1 1[y t = yt] = t=1 ℓt(U) + RT , (1) where ℓt is a surrogate loss, U W Rd K is the matrix of reference predictors, and RT is called the surrogate regret. In this work we provide two types of bounds on the surrogate regret: bounds that hold in expectation and bounds that hold with high probability. Note that equation (1) could also be written as PT t=1 1[y t = yt] ℓt(U) = RT . However, we prefer the former former since RT is not a proper regret: because the zero-one loss is non-convex we compare it with a surrogate loss. Our results build on recent work by Van der Hoeven (2020), who showed that one can exploit the gap between the surrogate loss and the zero-one loss to derive improved surrogate regret bounds in the full information and bandit settings of online multiclass classification. We modify the GAPTRON algorithm (Van der Hoeven, 2020) to make it applicable to the feedback graph setting. In the analysis of the resulting algorithm, called GAPPLETRON1, we use several new insights to show that it has O(B ρKT) surrogate regret in expectation and O p ρKT(B2 + ln(1/δ)) surrogate regret with probability at least 1 δ for any feedback graph with domination number2 ρ, and for any vec(U) B for some norm (if is the Euclidean norm, then vec(U) is the Frobenius norm of U). For example, in both spam filtering and label efficient classification we have ρ = 1. So in the label efficient setting, where each label request counts as a mistake, with high probability GAPPLETRON makes at most order of B KT mistakes while requesting at most order of B KT labels. Note that we are not aware of previously known high-probability bounds on the surrogate regret. Furthermore, whereas the results of Van der Hoeven (2020) only hold for a limited number of surrogate loss functions, our results hold for the larger class of regular surrogate losses. Interestingly, with feedback graphs the surrogate regret for online multiclass classification has, in general, a better dependence on T than the regret for online learning. Indeed, Alon et al. (2015) show that the best possible online learning regret is Ω(T 2/3) for certain feedback graphs that are 1Our algorithm is called after the apple tasting feedback model, which is the original name of the spam filtering graph. 2The domination number is the cardinality of the smallest dominating set. Upper bounds Partial Full Non-separable B ρKT KB2 Separable B ρT B2 Lower bounds Non-separable KB2 Table 1: Overview of the surrogate regret bounds in the separable and non-separable case. The upper bounds hold with high probability, while the lower bounds apply to any randomized prediction algorithm. All bounds are novel except for the lower bound in the full information separable case (Beygelzimer et al., 2019, Theorem 11). Algorithm Banditron Gappletron PNewtron SOBAdiag Figure 1: Error rate in non-separable synthetic bandit experiments showcasing GAPPLETRON against known baselines. The points are the means and the whiskers are minimum and maximum error rate over ten repetitions (details in Section 6). called weakly observable (e.g., the graphs for label efficient classification). In contrast, we prove a O(T 1/2) upper bound on the surrogate regret for any feedback graph, including weakly observable ones. Our results cannot be significantly improved in general: we prove a Ω(B2K + T) lower bound on the surrogate regret. Due to the new insights required by their proofs, we believe the high-probability bounds and the lower bounds are our strongest technical contributions. We provide several other new results. In the separable case, when there exists a U for which PT t=1 ℓt(U) = 0, GAPPLETRON has O(B ρT) surrogate regret in expectation. Even though O(B2K) mistake bounds are possible in the separable setting (Beygelzimer et al., 2019), ours is the first algorithm that has satisfactory surrogate regret in the non-separable case and has an improved surrogate regret in the separable case. Note that although BANDITRON (Kakade et al., 2008) also makes O(B KT) mistakes in the separable case, it suffers O(K1/3(BT)2/3) surrogate regret in the non-separable case. Our results for the separable case in the full information setting improve results of Van der Hoeven (2020) by a factor of K: GAPPLETRON suffers O(B2) surrogate regret both in expectation and with high probability, thus matching the bounds of the classical PERCEPTRON algorithm (Rosenblatt, 1958; Novikov, 1962) see Table 1 for a summary of our theoretical results. Finally, we also evaluated the performance of GAPPLETRON in several experiments, showing that GAPPLETRON is competitive against known baselines in the full information, bandit, and multiclass spam filtering setting, in which predicting a certain class provides full information feedback and all other predictions do not provide any information (see Figure 1 for an experimental result in the bandit setting). Additional related work. The full information and bandit versions of the online multiclass classification setting have been extensively studied. Here we provide the most relevant references and defer the reader to Van der Hoeven (2020) for a more extensive literature review. Algorithms for the full information setting include: the PERCEPTRON, its multiclass versions (Rosenblatt, 1958; Crammer and Singer, 2003; Fink et al., 2006) and many variations thereof, second-order algorithms such as AROW (Crammer et al., 2009) and the second-order PERCEPTRON (Cesa-Bianchi et al., 2005), and various algorithms for online logistic regression see Foster et al. (2018) and references therein. In the bandit setting, we mention the algorithms BANDITRON (Kakade et al., 2008), NEWTRON (Hazan and Kale, 2011), SOBA (Beygelzimer et al., 2017), and OBAMA (Foster et al., 2018). Online learning with feedback graphs has been investigated both in the adversarial and stochastic regimes. In the adversarial setting, variants where the graph changes over time and is partially known or stochastic have been studied by Cohen et al. (2016); Kocák et al. (2016). Regret bounds that scale with the loss of the best action have been obtained by Lykouris et al. (2018). Other variants include sleeping experts (Cortes et al., 2019), switching experts (Arora et al., 2019), and adaptive adversaries (Feng and Loh, 2018). Some works use feedback graphs to bound the regret in auctions (Cesa-Bianchi et al., 2017; Feng et al., 2018; Han et al., 2020). In the stochastic setting, regret bounds for Thompson sampling and UCB have been analyzed by Tossou et al. (2017); Liu et al. (2018); Lykouris et al. (2020). Finally, feedbacks graphs can also be viewed as a special case of the partial monitoring framework for sequential decisions, see (Lattimore and Szepesvári, 2020) for an introduction to the area. Helmbold et al. (2000) introduced online filtering as apple tasting . However, their analysis applies to a restricted version of online learning in which instances xt belong to a finite domain, and the labels yt are such that yt = f(xt) for all t and for some fixed f in a known class of functions. Practical applications of online learning to spam filtering have been investigated by Cesa-Bianchi et al. (2003); Sculley (2008). Notation. Let 1 and 0 denote, respectively, the all-one and all-zero vectors, and let ek be the basis vector in direction k. Let [K] = {1, . . . , K} and let R+ be the non-negative real numbers. We use g, w to denote the inner product between vectors g, w Rd. The rows of matrix W RK d are denoted by W 1, . . . , W K. Whenever possible, we use the same symbol W to denote both a K d matrix and a column vector vec(W ) = (W 1, . . . , W K) in RKd. We use x 2 to denote the Euclidean norm of a vector x and x to denote an arbitrary norm. The Kronecker product between matrices W and U is denoted by W U. We assume W W for some convex W RK d. This is equivalent to say that vec(W ) belongs to a convex subset of RKd, for example a p-norm ball. As in previous works, we assume instance-label pairs (xt, yt) are generated by an adversary who is oblivious to the algorithm s internal randomization. Finally, for any round t, Pt[ ] and Et[ ] denote the conditional probability and expectation, given the randomized predictions y 1, y 2, . . . , y t 1 and the corresponding feedback. A feedback graph is any directed graph G = (V, E), with edges E and nodes V, such that for any y V there exists some y V such that (y , y) E, where we allow y = y. In online multiclass classification, V = [K] and E specifies which predictions observe which outcomes. Let out(y ) = {y V : (y , y) E} be the out-neighbourhood of y . If the learner predicts y t at time t, then the feedback received by the learner is the set of pairs y, 1[y = yt] for all y out(y ). Due to the structure of the zero-one loss, if a node has K 1 outgoing edges, we always add the missing edge to E as this does not change the information available to the learner. We say that an outcome y is revealing if predicting that outcome provides the learner with full information feedback, i.e., out(y ) = [K], and we denote the set of revealing outcomes by Q. For example, in label efficient classification, querying the true label yt corresponds to playing a revealing outcome. We say that a set of nodes S is a dominating set if for each y V there is a node y S such that y out(y ). The number of nodes in a smallest dominating set is called the domination number, and we denote it by ρ. Note that GAPPLETRON is run using the minimum dominating set S, which is known to be hard to recover in general. However, if the algorithm is fed with any other dominating set S of bigger cardinality ρ , our results continue to hold with ρ replaced by ρ (recall that a dominating set of size at most (ln ρ + 2)ρ can be efficiently found via a greedy approximation algorithm). Regular surrogate losses. Fix a convex domain W. Let ℓ: W Rd [K] R+ be any function convex on W such that, for all W W, x Rd, and y [K]\{y } (with y = arg maxk W k, x ) we have K ℓ(W , x, y) + 1 K ℓ(W , x, y ) 1. (2) Then ℓt = ℓ( , xt, yt) is a regular surrogate loss if ℓt(W ) 2 2L ℓt(W ) W W (3) for some norm . When is the Euclidean norm, the condition on the gradient is satisfied by all L-smooth surrogate loss functions (see, for example, (Zhou, 2018, Lemma 4)). Examples of regular surrogate losses are the smooth hinge loss (Rennie and Srebro, 2005) and the logistic loss with base K, defined by ℓt(Wt) = log K q(Wt, xt, yt), where q is the softmax function. Even though the hinge loss is not a regular surrogate loss, in Appendix A we show that a particular version of the hinge loss satisfies all the relevant properties of regular surrogate losses. Also, note that in the feedback graph setting, this particular version of the hinge loss we use is random whenever the learner s predictions are randomized. 2 Gappletron Algorithm 1: GAPPLETRON Input: Set of revealing actions Q [K], minimum dominating set S, OCO algorithm A with domain W Rd, γ 0, and gap map a : RK d Rd [0, 1] 1: for t = 1 . . . T do 2: Obtain xt 3: Let y t = arg maxk W k t , xt max-margin prediction 4: if y t Q then 5: Set γt = 0 6: else 7: Set γt = min n 1 2, γ .q {s t : y s Q} o exploration rate 8: Set ζt = 1[γt a(Wt, xt)] 9: p t = 1 ζta(Wt, xt) (1 ζt)γt ey t + ζta(Wt, xt) 1 K 1 + (1 ζt) γt ρ 1S 10: Predict with label y t p t 11: Compute vt = 1[yt out(y t)] Pt(yt out(y t)) yt is observed only when yt out(y t) 12: Set bℓt(Wt) = vtℓt(Wt) compute loss estimates 13: Send bℓt to A and get Wt+1 in return In this section we introduce GAPPLETRON, whose pseudocode is presented in Algorithm 1. As input, the algorithm takes information about the graph G in the form of a minimum dominating set S and a (possibly empty) set of revealing actions Q. GAPPLETRON maintains a parameter Wt W Rd K and uses some full information Online Convex Optimization (OCO) algorithm A to update the vector form of Wt. Our results hold whenever A satisfies the condition that PT t=1 bℓt(Wt) bℓt(U) be at most of order h(U) q PT t=1 bℓt(Wt) 2, where bℓt are the estimated losses computed at line 12 of Algorithm 1 and h : W R+ is any upper bound on the norm of U W. Since practically any OCO algorithm can be tuned to have such a guarantee see (Orabona and Pál, 2018) this is a mild requirement. Whereas GAPTRON is only able to use Online Gradient Descent (OGD) with a fixed learning rate, GAPPLETRON allows for more flexibility, which in turn may lead to different guarantees on the surrogate regret. For example, if the learner runs an OCO algorithm with a good dynamic regret bound (Zinkevich, 2003), then GAPPLETRON enjoys a good dynamic surrogate regret bound. Furthermore, the guarantee of A allows us to derive stronger results in the separable setting while maintaining a similar guarantee as GAPTRON in the non-separable setting, which is not possible when using OGD with a fixed learning rate. Additional inputs to GAPPLETRON are γ > 0, which is used to control the exploration rate of the algorithm in the partial information setting, and the gap map a, whose role we explain below. The predictions of Algorithm 1 are sampled from p t defined in line 9, where ey t is the basis vector in the direction of the margin-based linear prediction y t = arg maxk W k t , xt . The gap map a : RK d Rd [0, 1] controls the mixture between ey t and the uniform exploration term 1 K 1. For brevity, we sometimes write at instead of a(Wt, xt). The single most important property of GAPPLETRON is presented in the following Lemma. Lemma 1. Fix any feedback graph G and suppose that, for all t, ℓt is a regular surrogate loss with respect to ℓ. Then GAPPLETRON, run on G with a such that a(Wt, xt) = ℓ(Wt, xt, y t ), satisfies X y [K] p t(y)1[y = yt] K 1 K ℓt(Wt) + γt. Proof. First, observe that P y [K] p t(y)1[y = yt] (1 at)1[y t = yt]+at K 1 K +γt, since ζ, (1 ζ), and the cost of exploration are at most 1. To conclude the proof we claim that the first two terms in the right-hand side are upper bounded by K 1 K ℓt(Wt). We show that by considering two cases. In the first case y t = yt and the inequality simply follows by substituting at = ℓ Wt, xt, y t = ℓt(Wt). In the second case y t = yt and we have that (1 at)1[y t = yt] + at K 1 K ℓ Wt, xt, y t K ℓ Wt, xt, y t K 1 K ℓ Wt, xt, yt + K 1 K ℓt(Wt) K 1 where the inequality is due to equation (2) in the definition of regular surrogate losses. Although the GAPTRON algorithm uses similar predictions, it is not clear how to choose a such that a property similar to the one described in Lemma 1 holds. Rather, Van der Hoeven (2020) derives a different gap map for the hinge loss, the smooth hinge loss, and the logistic loss, and analyses the surrogate regret separately for each loss. With Lemma 1 in hand, we simplify the analysis and at the same time also generalize the results of Van der Hoeven (2020) to other surrogate losses. Furthermore, Lemma 1 also allows us derive surrogate regret bounds that hold with high probability. What Lemma 1 states is that with regular surrogate losses and a(Wt, xt) = ℓ Wt, xt, y t the expected zero-one loss of GAPPLETRON can be upper bounded by K 1 K ℓt(Wt) plus the cost of exploration. While at first this may seem of little interest, note that we want to bound the zero-one loss in terms of ℓt rather than K 1 K ℓt. Compared to standard algorithms, this gains us a 1 K ℓt(Wt) term in each round, which we can use to derive our results. To see how, observe that GAPPLETRON uses an OCO algorithm A to update vec(Wt) on each round. Suppose that, for some h : W R and U W, Algorithm A satisfies bℓt(Wt) bℓt(U) h(U) t=1 bgt 2, (4) where bgt = vt ℓt(Wt). For simplicity, now assume we are in the full information setting (e.g., vt = 1 for all t). Since ℓt is a regular surrogate loss, we can use ℓt(W ) 2 2L ℓt(W ) and 2 infη>0 {a/η + ηb} to show that 1 K ℓt(Wt) h(U) t=1 2Lℓt(Wt) 1 K ℓt(Wt) KLh(U)2 This means that in the full information setting the surrogate regret of GAPPLETRON is independent of the number of rounds. In the partial information setting some additional steps are required, but the idea remains essentially the same. We formalize the aforementioned ideas in the following Lemma, whose proof is deferred to Appendix B. Lemma 2. Fix any feedback graph G and suppose that, for all t, ℓt is a regular surrogate loss with respect to ℓ. If A satisfies equation (4) then, for any realization of the randomized predictions y 1, . . . , y T , GAPPLETRON, run on G with gap map a such that a(Wt, xt) = ℓ(Wt, xt, y t ), satisfies y [K] p t(y)1[y = yt] t=1 bℓt(U) + K ℓt(Wt) vtℓt(Wt) + ηv2 t Lℓt(Wt) U W . 3 Bounds that Hold in Expectation In this section we present bounds on the surrogate regret that hold in expectation. For brevity we use MT = PT t=1 1[y t = yt]. We now state a simplified version of Theorem 4, whose full statement and proof can be found in Appendix C. Theorem 1. Let G be any feedback graph with dominating number ρ and revealing action set Q. Suppose that, for all t, ℓt is a regular surrogate loss with respect to ℓ. If A satisfies equation (4) then GAPPLETRON, run on G and A with gap map a such that a(Wt, xt) = ℓ(Wt, xt, y t ), satisfies E [RT ] = O E max K2Lh(U)2 max{1, |Q|}, h(U) q ρKL {t : y t Q} U W . Furthermore, for all U W such that PT t=1 ℓt(U) = 0, GAPPLETRON satisfies: E [MT ] = O E max h(U) q ρL {t : y t Q} , KLh(U)2 max{1, |Q|} In the full information setting we clearly have that Q = [K]. Hence, using OGD as A with an appropriated learning rate, the second statement in Theorem 1 reduces to E MT 4L U 2 2 PT t=1 1 K ℓt(Wt), which improves the results of GAPTRON in the separable case by at least a factor K. Interestingly, compared to standard bounds for the separable case, such as the PERCEPTRON bound, there is a negative term which seems to further lower the cost of learning how to separate the data. Similarly, in the partial information setting, the bound for the separable case in Theorem 1 has a reduced dependency on K compared to the non-separable case, obtaining similar improvements over GAPTRON as in the full information setting. For the non-separable case, Theorem 1 generalizes GAPTRON in two directions. The most prominent direction is the extension is to feedback graphs, where our analysis reveals a surprising phenomenon: Theorem 1 in fact shows that the surrogate regret in the label efficient setting (and in any setting where ρ < K) is actually smaller than in the bandit setting, where ρ = K. Intuitively, this is due to the fact that our algorithm only updates when yt is known. In the bandit setting, we need to explore all labels to find yt, while in label efficient classification we can just play whichever action is the revealing action, and find yt. This implies that exploration in label efficient classification is easier than in the bandit setting. Note that in the bandit setting, playing y t = yt also provides the learner with information. Perhaps by using this information effectively, one is able to improve our surrogate regret bounds, but as of yet it is not clear how to use knowledge of the wrong label. The second extension is that the bounds in Theorem 1 hold for all regular surrogate loss functions with the same gap map defined by the surrogate loss, rather than only for a limited number of loss functions and ad-hoc gap maps as it was the case with GAPTRON. 4 Bounds that Hold with High Probability We now present bounds on the surrogate regret that hold with high probability. After proving a general surrogate regret bound, we derive a corresponding bound, with improved guarantees, for the full information setting. The bound for the partial information setting can be found in Theorem 5 in Appendix D, which implies Theorem 2 below. Let the maximum loss over all rounds be ℓmax = maxt,W W ℓt(W ). Theorem 2. With probability at least 1 δ, GAPPLETRON satisfies: (Lh(U)2 + ℓmax ln(1/δ)) KρT U W Furthermore, for all U W such that PT t=1 ℓt(U) = 0, with probability at least 1 δ GAPPLETRON satisfies: (Lh(U)2 + Kℓmax ln(1/δ)) ρT . Theorem 2 shows that Algorithm 1 has O(h(U) ρKT) surrogate regret in the worst case, with high probability. As far as the authors are aware, this is the first high-probability surrogate regret bound for a margin-based classifier in the partial information setting. Similarly to the bounds in expectation, the worst-case surrogate regret is the largest in the bandit setting (ρ = K) and the smallest in label efficient classification (ρ = 1). Unlike the bounds in expectation, where the surrogate regret was at least a factor K smaller in the separable case, the improvement in Theorem 2 is less apparent, but the surrogate regret still has a better dependence on K in the separable case. In particular, all the terms with h(U) have a better dependence on K. In the full information setting the dependence on ℓmax can be removed. This cannot be achieved in the partial information setting, due to the necessity of estimating the surrogate loss. If W has a bounded radius B and ℓt has gradients bounded by G, then ℓmax 1 + BG by convexity. The bound for the full information setting can be found in Theorem 3. In the separable case of the full information setting, the bound does not depend on K, which is not the case for Theorem 2 due to the need to control the surrogate loss estimates. Theorem 3. Under the conditions of Lemma 2, with probability at least 1 δ, GAPPLETRON satisfies t=1 ℓt(U) + KLh(U)2 + 3K ln(1/δ) Furthermore, for all U W such that PT t=1 ℓt(U) = 0, then with probability at least 1 δ, GAPPLETRON satisfies MT 4Lh(U)2 + 3 We provide a proof sketch of the full information versions of Theorem 3. The proof for the partial information setting is essentially the same, with some extra steps to control the estimates of the surrogate losses. Let zt = 1[y t = yt] P y [K] p t(y)1[y = yt] . The proof relies on (Beygelzimer et al., 2011, Theorem 1) see Lemma 4 in this paper, which, when translated to our setting, states that with probability at least 1 δ, for η [0, 1], t=1 Et z2 t . Since the variance is bounded by the second moment, Et z2 t Et [1[y t = yt]] K 1 K ℓt(Wt), where the last inequality is due to Lemma 1. By applying Lemma 2, we find that, for some η [0, 1], t=1 ℓt(U) + h(U) ln δ with probability at least 1 δ. After choosing an appropriate η, this gives us a O Kh(U)2 surrogate regret bound with high probability. 5 Lower Bounds Corollary 1 below here shows that the bound of Theorem 1 cannot be significantly improved. Corollary 1. Let A be a possibly randomized algorithm for the online multiclass classification setting with feedback graphs. Then, for any B = Ω(1), the surrogate regret of A with respect to the smooth hinge loss must satisfy E [MT ] = min U W t=1 ℓt(U) + Ω KB2 + where K is the number of classes, the feature vectors xt satisfy xt 2 = Θ(1) for all t, and W = W : W B . Corollary 1 is implied by Theorems 6 and 7 in Appendix E. The proof of Theorem 6 builds on the lower bound of Daniely et al. (2015) for strongly-adaptive regret. The feedback graph considered in the proof is filtering with two classes: a blind class (no outgoing edges) and a revealing class. In the proof, we show that the algorithm either explores too much, in which case the lower bound trivially holds, or the algorithm explores too little, in which case the environment can trick the algorithm into playing the wrong action by exploiting the blind class. 6 Experiments We empirically evaluated the performance of GAPPLETRON on synthetic data in the bandit, multiclass filtering, and full information settings. Similarly to the Syn Sep and Non Syn Syp datasets described in (Kakade et al., 2008), we generated synthetic datasets with d {80, 120, 160}, K {6, 9, 12}, and the label noise rate in {0, 0.05, 0.1}. Due to space constraints, we only report part of the experiments for the bandit setting in the main text, see Figure 2. In the bandit setting we used worst case tuning for the algorithms with the parameters suggested by theory, or set all parameters to 1, except for T. Initially we only used theoretical tuning for all algorithms, but we found that two algorithms we compared with did not have satisfactory results. A more detailed description 0.00 0.05 0.10 0.00 0.05 0.10 0.00 0.05 0.10 Banditron Gap Hin Gap Log Gap Sm H PNewtron SOBAdiag Figure 2: Results of the synthetic experiments for the bandit setting. The plot shows the best results of algorithms with parameters suggested by theory, or tuned with all parameters set to 1, except for T. The rows indicate different values for K and the columns different values for d. Whiskers show the minimum and the maximum error rate over ten repetitions. of the results, including how we generated data and tuned the algorithms, can be found in Appendix F. In the bandit setting, we compared GAPPLETRON with the following baselines: PNewtron (the diagonal version of Newtron, by Hazan and Kale (2011)), SOBAdiag (the diagonal version of SOBA, by Beygelzimer et al. (2017)), and the importance weighted version of Banditron3 (Kakade et al., 2008). We opted to use the diagonal versions of Newtron and SOBA for computational reasons. We chose the importance weighted version of Banditron because the standard version did not produce satisfactory results. We used three surrogate losses for GAPPLETRON: the logistic loss ℓt(Wt) = log K q(Wt, xt, yt) where q is the softmax, the hinge loss defined in (5), and the smooth hinge loss (Rennie and Srebro, 2005), denoted by Gap Log, Gap Hin, and Gap Sm H respectively. The OCO algorithm used with all losses is Online Gradient Descent, with learning rate ηt = 10 8 + Pt j=1 bℓj(Wt) 2 2 1/2 and no projections. As shown in Figure 2, on average all versions of GAPPLETRON outperform the baselines in the bandit setting. Gap Hin appears to be more unstable than the other versions of GAPPLETRON. We suspect this is due to the fact that Gap Hin explores less than its counterparts. In multiclass spam filtering (Figure 8 in Appendix F), we see that Gap Log makes more mistakes than its counterparts for K > 6. We suspect this is due to the fact that with logistic loss, the gap map is never zero, which implies that Gap Log picks an outcome uniformly at random more often than Gap Hin and Gap Sm H, while not gaining any information. Due to this behaviour Gap Log makes more mistakes than necessary, which we also observe in the full information setting. In the bandit setting, the additional exploration leads to additional stability for Gap Log, as indicated by the small range of performance of Gap Log. In all cases, increasing the exploration rate increased the stability of GAPPLETRON, which is very much in agreement with Theorem 5. Additionally, in Appendix F we also compare GAPPLETRON with GAPTRON and in these experiments GAPPLETRON makes less mistakes than GAPTRON. 7 Future work There are several intriguing research directions left open to pursue. While our lower bound holds for general feedback graphs, it is not clear whether our bounds are tight for the bandit setting. Either providing a lower bound or an improved algorithm for the bandit setting remains thus open. Our results show that it is possible to obtain improved bounds for the separable case while maintaining satisfac- 3This is a version different from the one described by Kakade et al. (2008), in particular, we replaced e U t in their Algorithm 1 with the gradient of the importance weighted hinge loss. tory results for the non-separable case. However, as Beygelzimer et al. (2019) show, it is possible to obtain even better guarantees in the separable case of the bandit setting. An algorithm guaranteeing O(K U 2) mistakes in the separable case and O(K T) surrogate regret in the non-separable case, without prior knowledge of the separability, would therefore be an interesting contribution. Acknowledgments and Disclosure of Funding Nicolò Cesa-Bianchi, Federico Fusco and Dirk van der Hoeven gratefully acknowledge partial support by the MIUR PRIN grant Algorithms, Games, and Digital Markets (ALGADIMAR). Nicolò Cesa-Bianchi was also supported by the EU Horizon 2020 ICT-48 research and innovation action under grant agreement 951847, project ELISE. Federico Fusco was also supported by the ERC Advanced Grant 788893 AMDROMA Algorithmic and Mechanism Design Research in Online Markets . Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. (2015). Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, volume 40, pages 23 35. Alon, N., Cesa-Bianchi, N., Gentile, C., Mannor, S., Mansour, Y., and Shamir, O. (2017). Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785 1826. Arora, R., Marinov, T. V., and Mohri, M. (2019). Bandits with feedback graphs and switching costs. In Advances in Neural Information Processing Systems, pages 10397 10407. Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandit algorithms with supervised learning guarantees. In International Conference on Artificial Intelligence and Statistics, pages 19 26. Beygelzimer, A., Orabona, F., and Zhang, C. (2017). Efficient online bandit multiclass learning with o( T) regret. In International Conference on Machine Learning, pages 488 497. Beygelzimer, A., Pal, D., Szorenyi, B., Thiruvenkatachari, D., Wei, C.-Y., and Zhang, C. (2019). Bandit multiclass linear classification: Efficient algorithms for the separable case. In International Conference on Machine Learning, pages 624 633. Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2003). Margin-based algorithms for information filtering. Advances in Neural Information Processing Systems, pages 487 494. Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2005). A second-order perceptron algorithm. SIAM Journal on Computing, 34(3):640 668. Cesa-Bianchi, N., Gaillard, P., Gentile, C., and Gerchinovitz, S. (2017). Algorithmic chaining and the role of partial feedback in online nonparametric learning. In Conference on Learning Theory, pages 465 481. Cesa-Bianchi, N., Gentile, C., Zaniboni, L., and Warmuth, M. (2006). Worst-case analysis of selective sampling for linear classification. Journal of Machine Learning Research, 7:1205 1230. Cohen, A., Hazan, T., and Koren, T. (2016). Online learning with feedback graphs without the graphs. In International Conference on Machine Learning, pages 811 819. Cortes, C., De Salvo, G., Gentile, C., Mohri, M., and Yang, S. (2019). Online learning with sleeping experts and feedback graphs. In International Conference on Machine Learning, pages 1370 1378. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. (2006). Online passiveaggressive algorithms. Journal of Machine Learning Research, 7:551 585. Crammer, K., Kulesza, A., and Dredze, M. (2009). Adaptive regularization of weight vectors. In Advances in Neural Information Processing Systems, pages 414 422. Crammer, K. and Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3:951 991. Daniely, A., Gonen, A., and Shalev-Shwartz, S. (2015). Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405 1411. Feng, Z. and Loh, P.-L. (2018). Online learning with graph-structured feedback against adaptive adversaries. In IEEE International Symposium on Information Theory, pages 931 935. Feng, Z., Podimata, C., and Syrgkanis, V. (2018). Learning to bid without knowing your value. In ACM Conference on Economics and Computation, pages 505 522. Fink, M., Shalev-Shwartz, S., Singer, Y., and Ullman, S. (2006). Online multiclass learning by interclass hypothesis sharing. In International Conference on Machine Learning, pages 313 320. Foster, D. J., Kale, S., Luo, H., Mohri, M., and Sridharan, K. (2018). Logistic regression: The importance of being improper. In Conference On Learning Theory, pages 167 208. Han, Y., Zhou, Z., Flores, A., Ordentlich, E., and Weissman, T. (2020). Learning to bid optimally and efficiently in adversarial first-price auctions. ar Xiv preprint ar Xiv:2007.04568. Hazan, E. and Kale, S. (2011). Newtron: an efficient bandit algorithm for online multiclass prediction. In Advances in Neural Information Processing Systems, pages 891 899. Helmbold, D. P., Littlestone, N., and Long, P. M. (2000). Apple tasting. Information and Computation, 161(2):85 139. Van der Hoeven, D. (2020). Exploiting the surrogate gap in online multiclass classification. Advances in Neural Information Processing Systems, 33. Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. (2008). Efficient bandit algorithms for online multiclass prediction. In International Conference on Machine Learning, pages 440 447. Kocák, T., Neu, G., and Valko, M. (2016). Online learning with noisy side observations. In Artificial Intelligence and Statistics, pages 1186 1194. Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press. Liu, F., Zheng, Z., and Shroff, N. B. (2018). Analysis of thompson sampling for graphical bandits without the graphs. In Conference on Uncertainty in Artificial Intelligence, pages 13 22. Lykouris, T., Sridharan, K., and Tardos, É. (2018). Small-loss bounds for online learning with partial information. In Conference on Learning Theory, pages 979 986. Lykouris, T., Tardos, E., and Wali, D. (2020). Feedback graph regret bounds for Thompson sampling and UCB. In Algorithmic Learning Theory, pages 592 614. Mannor, S. and Shamir, O. (2011). From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684 692. Novikov, A. (1962). On the convergence proofs for perceptrons. In Proc. of the Symposium on the Mathematical Theory of Automata, vol XII, pages 615 622. Orabona, F. and Pál, D. (2018). Scale-free online learning. Theoretical Computer Science, 716:50 69. Rennie, J. D. and Srebro, N. (2005). Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, volume 1, pages 180 186. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386. Sculley, D. (2008). Advances in online learning-based spam filtering. Ph D thesis, Tufts University. Tossou, A., Dimitrakakis, C., and Dubhashi, D. (2017). Thompson sampling for stochastic bandits with graph feedback. In AAAI Conference on Artificial Intelligence, volume 31. Zhou, X. (2018). On the fenchel duality between strong convexity and lipschitz continuous gradient. ar Xiv preprint ar Xiv:1803.06573. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pages 928 936. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Our contributions are primarily theoretical and we clearly state the assumptions under which our theoretical results hold. (c) Did you discuss any potential negative societal impacts of your work? [No] Our contribution is primarily theoretical. Therefore, this work does not present any foreseeable societal consequence. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] Although some proofs have been deferred to the appendix due to space constraints. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] In our reporting of the experiments we include the best and worst performance of each algorithm. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [No] (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] The data is synthetic 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]