# nonstochastic_bandits_with_composite_anonymous_feedback__707dfe18.pdf Journal of Machine Learning Research 23 (2022) 1-24 Submitted 12/21; Published 8/22 Nonstochastic Bandits with Composite Anonymous Feedback Nicol o Cesa-Bianchi nicolo.cesa-bianchi@unimi.it Dept. of Computer Science and DSRC, Universit a degli Studi di Milano, Italy Tommaso Cesari tommaso.cesari@tse-fr.eu Institut de Math ematiques de Toulouse (IMT), Paul Sabatier University (UT3), Toulouse, France Toulouse School of Economics (TSE), Toulouse, France Roberto Colomboni roberto.colomboni@unimi.it Istituto Italiano di Tecnologia (IIT), Genova, Italy Universit a degli Studi di Milano, Italy Claudio Gentile cgentile@google.com Google Research, New York, USA Yishay Mansour mansour@tau.ac.il Tel Aviv University, Tel Aviv, Israel Google research, Tel Aviv, Israel Editor: Alexandre Proutiere We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over the subsequent rounds in an adversarial way. The instantaneous loss observed by the player at the end of each round is then a sum of many loss components of previously played actions. This setting encompasses as a special case the easier task of bandits with delayed feedback, a well-studied framework where the player observes the delayed losses individually. Our first contribution is a general reduction transforming a standard bandit algorithm into one that can operate in the harder setting: We bound the regret of the transformed algorithm in terms of the stability and regret of the original algorithm. Then, we show that the transformation of a suitably tuned FTRL with Tsallis entropy has a regret of order p (d + 1)KT, where d is the maximum delay, K is the number of arms, and T is the time horizon. Finally, we show that our results cannot be improved in general by exhibiting a matching (up to a log factor) lower bound on the regret of any algorithm operating in this setting. Keywords: Multi-armed bandits, non-stochastic losses, composite losses, delayed feedback, online learning, 1. Introduction Multiarmed bandits, originally proposed for managing clinical trials, are now routinely applied to a variety of other tasks, including computational advertising, e-commerce, and beyond. Typical examples of e-commerce applications include content recommendation . A preliminary version of this paper appeared in the Proceedings of the 31st Conference on Learning Theory (COLT). PMLR 75:750-773, 2018. c 2022 Nicol o Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Claudio Gentile, Yishay Mansour. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1443.html. Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour systems, like the recommendation of products to visitors of merchant websites and social media platforms. A common pattern in these applications is that the response elicited in a user by the recommendation system is typically not instantaneous, and might occur some time in the future, well after the recommendation was issued. This delay, which might depend on several unknown factors, implies that the reward obtained by the recommender at time t can actually be seen as the combined effect of many previous recommendations to that user. The more specific scenario of bandits with delayed rewards has been investigated in the literature under the assumption that the contributions of past recommendations to the combined reward is individually discernible see, e.g., (Neu et al., 2010; Joulani et al., 2013; Cesa-Bianchi et al., 2016; Vernade et al., 2017). Pike-Burke et al. (2018) revisited the problem of bandits with delayed feedback under the more realistic assumption that only the combined reward is available to the system, while the individual reward components remain unknown. This model captures a much broader range of practical settings where bandits are successfully deployed. Consider for example an advertising campaign which is spread across several channels simultaneously (e.g., radio, tv, web, social media). A well-known problem faced by the campaign manager is to disentangle the contribution of individual ads deployed in each channel from the overall change in sales. Pike-Burke et al. (2018) formalized this harder delayed setting in a bandit framework with stochastic rewards, where they introduced the notion of delayed anonymous feedback to emphasize the fact that the reward received at any point in time is the sum of rewards of an unknown subset of past selected actions. More specifically, choosing action It [K] at time t generates a stochastic reward Yt(It) [0, 1] and a stochastic delay τt {0, 1, . . . }, where Yt(i), τt i [K],t N is a family of independent random variables such that Y1(i), Y2(i), . . . have a common distribution νY (i) (for all arms i [K]) and τ1, τ2, . . . have a common distribution ντ with expectation µτ. The delayed anonymous feedback assumption entails that the reward observed at time t by the algorithm is the sum of t components of the form Ys(Is)I{τs = t s} for s {1, . . . , t}. The main result in (Pike-Burke et al., 2018) is that, when the expected delay µτ is known, the regret is at most of order of K (ln T)/ +µτ , where is the suboptimality gap. This bound is of the same order as the corresponding bound for the setting where the feedback is stochastically delayed, but not anonymous (Joulani et al., 2013), and cannot be improved in general. In this work, we study a bandit setting similar to delayed anonymous feedback, but with two important differences. First, we work in a nonstochastic bandit setting, where rewards (or losses, in our case) are generated by some unspecified deterministic mechanism. Second, we relax the assumption that the loss of an action is charged to the player at a single instant in the future. More precisely, we assume that the loss for choosing an action at time t is adversarially spread over at most d + 1 consecutive time steps t, t + 1, . . . , t + d. Hence, the loss observed by the player at time t is a composite loss, that is, the sum of (d + 1)-many loss components ℓ(0) t (It), ℓ(1) t 1(It 1), . . . , ℓ(d) t d(It d), where ℓ(s) t s(It s) defines the s-th loss component from the selection of action It s at time t s. Note that in the special case when ℓ(s) t (i) = 0 for all s = dt, and ℓ(dt) t (i) = ℓt(i), we recover the model of nonstochastic bandits with delays d1, d2, d (which, in particular, reduces to the standard nonstochastic bandits when d = 0). Our setting, which we call composite anonymous feedback, can accomodate scenarios where actions have a lasting effect which combines additively over Nonstochastic Bandits with Composite Anonymous Feedback time. Online businesses provide several use cases for this setting. For instance, an impression that results in an immediate clickthrough, later followed by a conversion, or a user that interacts with a recommended item such as media content multiple times over several days, or the free credit assigned to a user of a gambling platform which might not be used all at once. Our main contribution is a general reduction technique (Composite Loss Wrapper, or Co Lo Wrapper, Algorithm 1) turning a base nonstochastic bandit algorithm into one operating within the composite anonymous feedback setting. We then show that the regret of Co Lo Wrapper can be upper bounded in terms of the stability and the regret of the base algorithm (Theorem 4). Choosing as a base algorithm Follow the Regularized Leader (FTRL) with Tsallis entropy, Theorem 4 gives an upper bound of order p (d + 1)KT on the regret of nonstochastic bandits with composite anonymous feedback (Corollary 7), where d 0 is a known upper bound on the delay, K is the number of actions, and T is the time horizon. This result relies on a nontrivial stability analysis of FTRL with Tsallis entropy that could be of independent interest (Theorem 5). Finally, we show the optimality of the p (d + 1)KT rate by proving a matching lower bound (up to a logarithmic factor, Theorem 10). In particular, this shows that, in the nonstochastic case with delay d, anonymous feedback is strictly harder than nonanonymous feedback, whose minimax regret was characterized by Cesa-Bianchi et al. (2016) as p (d + K)T. See the table below for a summary of results for nonstochastic K-armed bandits (all rates are optimal ignoring logarithmic factors). no delay delayed feedback anonymous composite feedback (d + 1)KT (Auer et al., 2002) (Cesa-Bianchi et al., 2016) (this paper) We now give an idea of the proof techniques. Similar to (Pike-Burke et al., 2018), we play the same action for a block of at least 2d + 1 time steps, hence the feedback we get in the last d + 1 steps contains only loss components pertaining to the same action, so that we can estimate in those steps the true loss of that action. Unfortunately, although the original losses are in [0, 1], the composite losses can be as large as d + 1 (a composite loss sums d + 1 loss components, and each component can be as large as 1). This causes a corresponding scaling in the regret, compromising optimality. However, we observe that the total composite loss relative to the same action over any d + 1 consecutive steps can be at most 2d+1 (Lemma 2). Hence, we can normalize the total composite loss relative to the same action over d + 1 consecutive steps, simply dividing by 2d + 1, obtaining an average loss in the range [0, 1]. This idea leads to the right dependence on d in the regret. The last problem is how to avoid suffering a big regret in the first d steps of each block, where the composite losses mix loss components belonging to more than one action. We solve this issue by borrowing an idea of Dekel et al. (2014c). We build blocks with random endpoints so that their length is (always) at least 2d+1 and (on average) not much bigger. This random positioning and length of the blocks is the key to prevent the oblivious adversary from causing a large regret in the first half of each block. Moreover, as we prove in Theorem 4, if the distribution over actions maintained by the base algorithm is stable (Definition 3), then the algorithm is not significantly affected by the uncertainty in the positioning of the blocks. Extending our results to the case where d is unknown, Wang et al. (2021) show a Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour regret bound of order T 2/3. When d is known, however, their analysis does not guarantee our faster Further related work Online learning with delayed feedback was studied in the full information (non-bandit) setting by Weinberger and Ordentlich (2002); Mesterharm (2005); Langford et al. (2009); Joulani et al. (2013); Quanrud and Khashabi (2015); Khashabi et al. (2016); Joulani et al. (2016); Garrabrant et al. (2016); Mann et al. (2019), see also (Shamir and Szlak, 2017) for an interesting variant. The bandit setting with delay was investigated in (Neu et al., 2010; Joulani et al., 2013; Mandel et al., 2015; Cesa-Bianchi et al., 2016; Vernade et al., 2017; Pike-Burke et al., 2018; Li et al., 2019; Thune et al., 2019; Zimmert and Seldin, 2020; Vernade et al., 2020; Ito et al., 2020; Gael et al., 2020; Agrawal and Tulabandula, 2020). Our delayed composite loss function generalizes the composite loss function setting of Dekel et al. (2014a) see the discussion at the end of Section 2 for details and is also related to the notion of loss functions with memory. This latter setting has been investigated, e.g., by Arora et al. (2012), who showed how to turn an online algorithm with regret guarantee of O(T q) into one attaining O(T 1/(2 q))-policy regret, also adopting a blocking scheme. A more recent paper in this direction is (Anava et al., 2015), where the authors considered a more general loss framework than ours, though with the benefit of counterfactual feedback, in that the algorithm is aware of the loss it would incur had it played any sequence of d decisions in the previous d rounds, thereby making their results incomparable to ours. 2. Preliminaries We denote the set of positive integers by N and the set of integers by Z. For all n N we denote the set {1, . . . , n} of the first n integers by [n]. We will use the handy convention that, if (ct)t Z R and m, n Z are such that m > n, then Pn t=m ct = 0 and Qn t=m ct = 1. For any x R, we denote its positive part max{x, 0} by x+. We start by considering a nonstochastic multiarmed bandit problem on K actions with oblivious losses in which the loss ℓt(i) [0, 1] at time t of an action i [K] is defined by the sum s=0 ℓ(s) t (i) of (d + 1)-many components ℓ(s) t (i) 0 for s {0, . . . , d}. Let It denote the action chosen by the player at the beginning of round t. If It = i, then the player incurs loss ℓ(0) t (i) at time t, loss ℓ(1) t (i) at time t + 1, and so on until time t + d. Yet, what the player observes at time t is only the combined loss incurred at time t, which is the sum ℓ(0) t (It) + ℓ(1) t 1(It 1) + + ℓ(d) t d(It d) of the past d + 1 loss contributions, where ℓ(s) t (i) = 0 for all i and s when t 0. Then, we define the d-delayed composite loss at time t of a sequence of d+1 actions it d, . . . , it [K] as ℓ t (it d, . . . , it) := s=0 ℓ(s) t s(it s) . (1) Nonstochastic Bandits with Composite Anonymous Feedback With this notation, the d-delayed composite anonymous feedback assumption states that what the player observes at the end of each round t is only the composite loss ℓ t (It d, . . . , It). The goal of the algorithm is to bound its regret RT against the best fixed action in hindsight, t=1 ℓ t (It d, . . . , It) t=1 ℓ t (i, . . . , i) . We define the regret in terms of the composite losses ℓ t rather than the true losses ℓt because in our model ℓ t is what the algorithm pays overall on round t. It is easy to see that a bound on RT implies a bound on the more standard notion of regret E h PT t=1 ℓt(It) i mink PT t=1 ℓt(k) up to an additive term of at most O(d). Our setting generalizes the composite loss function setting of Dekel et al. (2014a). Specifically, the linear composite loss function therein can be seen as a special case of the composite loss (1) once we remove the superscripts s from the loss function components. In fact, in the linear case, the feedback in (Dekel et al., 2014a) allows one to easily reconstruct each individual loss component in a recursive manner. This is clearly impossible in our more involved scenario, where the new loss components that are observed in round t need not have occurred in past rounds. 3. The Co Lo Wrapper Algorithm Our Composite Loss Wrapper algorithm (Algorithm 1) takes as input a standard K-armed bandit algorithm A and a Boolean sequence B. The base algorithm A operates on standard (noncomposite) losses with values in [0, 1], producing probability distributions q1, q2, . . . over the action set [K]. The wrapper calls the base algorithm A only in a subset of rounds determined by the Boolean sequence B, which we call update rounds. Definition 1 (Update round) We say that t N is an update round with respect to a Boolean sequence B = (bt)t N {0, 1}N if t 2d + 1 and bt Q2d s=1(1 bt s) = 1. Note that if d > 0, the condition is equivalent to bt = 1, and bt 1 = . . . = bt 2d = 0. If d = 0, by our convention, the condition is equivalent to bt = 1. To help understand our algorithm, we will also define two other types of rounds. We say that t is a draw round if t = 1 or the previous round t 1 was an update round. If t is not a draw round, we say that it is a stay round. Note that, if d = 0, both draw and stay rounds can be update rounds, while if d 1, only stay rounds can be update rounds. The Co Lo Wrapper algorithm proceeds in blocks of (random) length of at least 2d + 1 rounds in which it constantly plays the same action (Figure 1). Blocks in Algorithm 1 are counted by variable nt. Each block nt consists of a draw round followed by (2d or more) stay rounds, with the last round of the block being also an update round. During a draw round t, Co Lo Wrapper uses its current distribution pt to draw and play an action It. During stay rounds, it keeps playing the action that was drawn during the latest draw round. After playing the action It for the current round t, if t is an update round, Co Lo Wrapper asks the base algorithm A to make an update of its base distribution qnt qnt+1 as if A played action It and observed as the loss of It the quantity 1 2d+1 Pt s=t d ℓ s(Is d, . . . , Is). Then, Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour 2d + 1 2d + 1 2d + 1 Figure 1: Sequence of rounds the algorithm is undergoing when d = 2. The top line contains the values of the Boolean sequence B = (Bt)t N. The bottom line shows the corresponding types of rounds: each block begins with a (D)raw round, followed by a variable number of (S)tay rounds, the last of which is also an (U)pdate round. Since a round t is an update round only if Bt = 1 and Bs = 0 for the 2d previous rounds s, the length of each block is at least 2d + 1. Algorithm 1: Co Lo Wrapper (Composite Loss Wrapper) input: Base K-armed bandit algorithm A and Boolean sequence B initialize: let n0 = 0 and q1 be the initial distribution over [K] of A 1 for round t = 1, 2, . . . do 2 if either t = 1 or t 1 was an update round (w.r.t. B) then 3 let nt = nt 1 + 1, pt = qnt, and draw It pt // draw 5 let nt = nt 1, pt = pt 1, and It = It 1 // stay 6 play It and observe loss ℓ t (It d, . . . , It) 7 if t is an update round (w.r.t. B) then // update 8 feed A with arm It and loss 1 2d+1 Pt s=t d ℓ s(Is d, . . . , Is) 9 use the update rule qnt qnt+1 of A to obtain a new base distribution qnt+1 the block ends and the distribution of Co Lo Wrapper at the beginning of the next block nt+1 = nt + 1 is pt+1 = qnt+1. Note that if t is an update round, the quantity 1 2d+1 Pt s=t d ℓ s(Is d, . . . , Is) that is fed back to A relates only to the current action It, because blocks contain at least 2d+1 rounds and the same action is played in all of them. The following lemma shows that this quantity is indeed in [0, 1], so that it is a legitimate feedback to pass to the base algorithm A. Lemma 2 For all t 2d + 1 and i [K], τ=t d ℓ τ(i, . . . , i) 2d + 1 . Proof For all t 2d + 1 and i [K], τ=t d ℓ τ(i, . . . , i) = s=0 ℓ(s) τ s(i) = τ=t d ℓ(s) τ s(i) = ρ=t d s ℓ(s) ρ (i) Nonstochastic Bandits with Composite Anonymous Feedback ρ=t 2d ℓ(s) ρ (i) = s=0 ℓ(s) ρ (i) = ρ=t 2d ℓρ(i) t (t 2d) + 1 = 2d + 1 . As a final remark, we point out that, albeit the algorithm is parameterized with an entire sequence B, at each time t, it does not require the knowledge of the sequence at future times t + 1, t + 2, . . . . This implies in particular that these Boolean values could be produced and fed to Co Lo Wrapper in an on-line fashion. 4. Upper Bound We begin by formalizing the notion of stability (of the base algorithm), in terms of which we express the performance of the Co Lo Wrapper algorithm. Definition 3 (ξ-stability) Let ξ > 0, A be a K-armed bandit algorithm, and (qn)n N be the (random) sequence of probability distributions over actions [K] produced by A during a run over rounds {1, 2, . . . }. We say that A is ξ-stable if for any round n, we have qn+1(i) qn(i) + In the previous definition, note that since P i [K] qn+1(i) = 1 = P i [K] qn(i), then qn+1 qn 1 = qn+1 qn 1 + X qn+1(i) qn(i) = 2 X qn+1(i) qn(i) + . Therefore, the ξ-stability of an algorithm is equivalent to controlling the expected 1distance between any two consecutive probability distributions produced by the algorithm. We stick to the positive part definition as this is the quantity that naturally appears in the analysis. We can now state our main result of this section. Theorem 4 If we run Co Lo Wrapper with a ξ-stable base K-armed bandit algorithm A and an i.i.d. sequence B = (Bt)t N of Bernoulli random variables with bias β (0, 1) (independent of the randomization of A), then, for any time horizon T 2d + 1, the regret RT satisfies RT 2d + 2d + 1 3d + 2dβ(1 β)2dξT + 1 β(1 β)2d R T/(2d+1) where R T/(2d+1) is the worst-case regret after T/(2d+1) rounds of A (for an adversarial setting with [0, 1]-valued losses). Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour Proof Fix an arbitrary horizon T 2d + 1 and an arm i [K]. Let, for all1 t 2d + 1, ct := E ℓ t (It d, . . . , It) ℓ t (i , . . . , i ) , a := 2d + 1 , b := T . Applying the elementary identity in Lemma 11 (Appendix A) in step ( ) below, we obtain t=1 E ℓt(It) + t=2d+1 ct 2d + = 2d + 1 d + 1 t=a d (t a + d + 1) ct + (d + 1) t=b+1 (b + d + 1 t) ct t=a d (t a + d + 1) ct 1 d + 1 t=b+1 (b + d + 1 t) ct ( ) = 2d + 1 d + 1 t=τ d ct 1 d + 1 t=a d (t a + d + 1)ct 1 d + 1 t=b+1 (b + d + 1 t)ct 2d + 1 d + 1 t=τ d ct + 1 d + 1 t=a d (t a + d + 1)ℓ t (i , . . . , i ) t=b+1 (b + d + 1 t)ℓ t (i , . . . , i ) =: ( ) Now, applying Lemma 2 in steps ( ) below, we get ( ) ( ) 2d + 22d + 1 d + 1 d + 1 d + 1 = 2d + 22d + 1 d + 1 d + 1 d + 1 t=τ d E ℓ t (It d, . . . , It) ℓ t (Iτ 2d, . . . , Iτ 2d) ℓ t (Iτ 2d, . . . , Iτ 2d) ℓ t (i , . . . , i ) # ( ) 2d + 32d + 1 d + 1 d + 1 d + 1 t=τ d E ℓ t (It d, . . . , It) ℓ t (Iτ 2d, . . . , Iτ 2d) ℓ t (Iτ 2d, . . . , Iτ 2d) ℓ t (i , . . . , i ) # =: 2d + 32d + 1 d + 1 d + (I) + 2d + 1 d + 1 (II) . 1. Here, we refer to the infinite sequence of losses (ℓ(s) t )s {0,...,d},t N (and the respective composite losses (ℓ t )t N). If the problem is formalized only with a finite sequence (ℓ(s) t )s {0,...,d},t [T ], it is sufficient to define an arbitrary sequence (ℓ(s) t )s {0,...,1},t>T with ℓ(s) t 0 and Pd s=0 ℓ(s) t 1 (and the corresponding composite losses (ℓ t )t>T ) and proceed as we do. This trick is needed to invoke Lemma 2, for which it is handy to sum d rounds into the future. Nonstochastic Bandits with Composite Anonymous Feedback We upper bound the two terms (I) and (II) separately. First, let U be the (random) set of update rounds U := τ [T] : τ is an update round (w.r.t. B) . For the first term (I), we have (I) = 1 d + 1 s=0 ℓ(s) t s(i) pt s(i) pτ 2d(i) # i [K] E max t {τ d,...,τ},s {0,...,d} pt s(i) pτ 2d(i) τ X t=τ d ℓ t (i, . . . , i) i [K] E max t {τ d,...,τ},s {0,...,d} pt s(i) pτ 2d(i) ( ) = 2d + 1 σ=τ 2d {σ U} ) pτ(i) pτ 2d(i) + # σ=τ 2d E h I{σ U} pτ(i) pτ 2d(i) +i =: ( ) where ( ) follows by Lemma 2 together with the fact that the max in the previous line is always nonnegative (to see this, simply observe that picking t = τ d and s = d within the max makes pt s(i) = pτ 2d(i)), and ( ) follows by the facts that the max in the previous line is always greater than or equal to zero, it can be strictly positive only if there is an update in a round σ with τ 2d σ τ 1, and there can be at most a single update in 2d + 1 consecutive time steps. Now, using the facts that pτ = qnτ and that nτ = nτ 2d + 1 on the event {σ U}, we get ( ) = 2d + 1 σ=τ 2d E I{σ U} qnτ 2d+1(i) qnτ 2d(i) + n N E h I{σ U}I{nτ 2d = n} qn+1(i) qn(i) +i ( ) = 2d + 1 n N E I{σ U}I{nτ 2d = n} E qn+1(i) qn(i) + n N E I{σ U}I{nτ 2d = n} σ=τ 2d P[σ U] 2d + 1 d + 1 2dβ(1 β)2dξT where ( ) follows by the independence of the Bernoulli sequence B and the randomization of base algorithm A; ( ) by the ξ-stability of A; and the last inequality by the fact that the probability that a time step σ is an update round is 0 if σ 2d and β(1 β)2d otherwise. Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour For the second term (II), set for brevity rτ : [K] R , i 7 1 2d + 1 ℓ t (i, . . . , i) ℓ t (i , . . . , i ) for all τ 2d + 1. We first note that τ U rτ(Iτ 2d) τ=2d+1 rτ(Iτ 2d)Bτ s=1 (1 Bτ s) = β(1 β)2d T X τ=2d+1 E rτ(Iτ 2d) where the first identity is a consequence of the fact that if τ U, then the action Iτ played by Algorithm 1 at round τ coincides with the actions played in the previous 2d rounds, i.e., Iτ = = Iτ 2d, while the last equality follows by the independence of Iτ 2d and the vector of Bernoulli random variables (Bτ, . . . , Bτ 2d). Thus, τ=2d+1 rτ(Iτ 2d) = 1 β(1 β)2d E = 1 β(1 β)2d E t=τ d ℓ t (Iτ, . . . , Iτ) X t=τ d ℓ t (i , . . . , i ) Now define for any τ 2d + 1, ℓτ : [K] [0, 1] , i 7 1 2d + 1 t=τ d ℓ t (i, . . . , i) . (Note that ℓτ(i) [0, 1] for all i [K] by Lemma 2.) Leveraging again the fact that τ U implies Iτ = = Iτ 2d, yields, for each τ U, t=τ d ℓ t (It d, . . . , It) = 1 2d + 1 t=τ d ℓ t (Iτ, . . . , Iτ) = ℓτ(Iτ) . This shows that the loss Algorithm 1 feeds the base algorithm A (in line 8) at each update round τ is a bandit feedback (for A) for the arm Iτ with respect to the [0, 1]-valued loss ℓτ. Therefore, by Equation (2) and the regret guarantees of the base algorithm, we obtain (II) = 1 β(1 β)2d E τ U ℓτ(Iτ) X 1 β(1 β)2d R T/(2d+1) where in the last inequality we also used the monotonicity of the worst-case regret τ 7 Rτ and the fact that |U| l T (2d+1)+1 2d+1 m j T 2d+1 k . Nonstochastic Bandits with Composite Anonymous Feedback In conclusion, we have RT 2d + 32d + 1 d + 1 d + (I) + 2d + 1 2d + 2d + 1 3d + 2dβ(1 β)2dξT + 1 β(1 β)2d R T/(2d+1) hence concluding the proof. We can derive corollaries for various algorithms using Theorem 4. Consider for instance as a base algorithm A the well-known Exp3 algorithm of Auer et al. (2002). It can be easily shown that Exp3 is η-stable (where η is its learning rate, see Lemma 14 in Appendix C). Combining this fact with its worst-case regret bound RN ln K 2KN, we obtain that the horizon-T regret of the Co Lo Wrapper algorithm with A = Exp3(η), η = q an i.i.d. sequence B of Bernoulli random variables with bias β = 1 2d+1 (independent of the randomization of A), satisfies RT = O p (d + 1)KT log K . Using Follow the Regularized Leader (FTRL) with 1 2-Tsallis Entropy (Algorithm 2), we can remove the log K term in the above bound.2 Algorithm 2: Follow The Regularized Leader (FTRL) with (1/2)-Tsallis entropy input: learning rate η 0 initialize: b L0 = 0 1 for round n = 1, 2, . . . do 2 Play action Jn drawn according to qn argmin q K b Ln 1(i)q(i) 2η X where K is the probability simplex in RK 3 Observe loss ℓn(Jn) and update b Ln = b Ln 1 + bℓn, where bℓn(i) = ℓn(i) qn(i)I{Jn = i} i [K] However, we still need to prove the stability of Algorithm 2. This is established by the following result, whose (non-trivial) proof is given in Appendix B. Theorem 5 Algorithm 2 run with any learning rate η > 0 is ξ-stable, with ξ 21 + ln K 2. The argmin where Algorithm 2 picks each distribution qn is always a singleton if η > 0. This is a consequence of Lemma 12, in Appendix B. For the sake of convenience, we also allow the learning rate η = 0, corresponding to a (non-regularized) Follow The Leader algorithm. In this case, multiple minimizers could exist but, for the sake of our results, ties could be broken in any (measurable) way. Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour The worst-case regret guarantees of FTRL with 1 2-Tsallis entropy are as follows.3 Theorem 6 (Abernethy et al. 2015) For each N N, the worst-case regret RN after N rounds of Algorithm 2 run with learning rate η = q 2 (in an adversarial setting with [0, 1]-valued losses) satisfies RN 2 Combining Theorems 4 to 6, we obtain the following regret bound for composite losses. Corollary 7 For any time horizon T N, if we run Co Lo Wrapper using: As A, Algorithm 2 with learning rate η = q 1 2 T/(2d + 1) As B, an i.i.d. sequence of Bernoulli random variables with bias β = 1 2d+1 (independent of the randomization of A) then, its regret satisfies RT c p where c = 2 2 if d = 0, and c = 28 otherwise. Proof If d = 0, then β = 1 and Algorithm 1 reduces to the base algorithm. The result in this case is therefore implied immediately by Theorem 6. For the second part, we can assume without loss of generality that T 2d + 1, so that η > 0. Then, plugging ξ 21+ln K η (Theorem 5) and R T/(2d+1) 2 p 2K T/(2d + 1) (Theorem 6) into Theorem 4 gives the result. 5. Lower bound In this section we derive a lower bound for bandits with composite anonymous feedback. We do that through a reduction from the setting of linear bandits (in the probability simplex) to our setting. This reduction allows us to upper bound the regret of a linear bandit algorithm in terms of (a suitably scaled version of) the regret of an algorithm in our setting. Since the reduction applies to any instance of a linear bandit problem, we can use a known lower bound for the linear bandit setting to derive a corresponding lower bound for our composite setting. Let K be the probability simplex in RK. At each round t, an algorithm A for linear bandit optimization chooses an action qt K and suffers loss ℓ t qt, where ℓt [0, 1]K is some unknown loss vector. The feedback observed by the algorithm at the end of round t is the scalar ℓ t qt. The regret suffered by algorithm A playing actions q1, . . . , q T is t=1 ℓ t qt min q K t=1 ℓ t q = t=1 ℓ t qt min i=1,...,K t=1 ℓ ei (3) where e1, . . . , e K are the elements of the canonical basis of RK and we used the fact that a linear function on the simplex is minimized at one of the corners. Let Rlin T (A, K) denote 3. The analysis of Abernethy et al. (2015) is presented for a more general class of algorithms. A straightforward application of their Corollary 3.2 shows the validity of Theorem 6 for Algorithm 2. Nonstochastic Bandits with Composite Anonymous Feedback the worst case regret (over the oblivious choice of ℓ1, . . . , ℓT ) of algorithm A. Similarly, let RT (Ad, K, d) be the worst case regret (over the oblivious choice of loss components ℓ(s) t (i) for all t, s, and i) of algorithm Ad for nonstochastic K-armed bandits with d-delayed composite anonymous feedback. For the sake of clarity, we assume below that the time horizon T is a multiple of d + 1. If this is not the case, we can straightforwardly stop at the highest multiple of d + 1 (smaller than T) up to paying an additive O(d + 1) regret. Our reduction shows the following. Lemma 8 For any algorithm Ad for K-armed bandits with d-delayed composite anonymous feedback, there exists an algorithm A for linear bandits in K such that RT (Ad, K, d) (d + 1) Rlin T/(d+1)(A, K). Our reduction, described in detail in the proof of the above lemma (see Appendix D), essentially builds the probability vectors qt played by A based on the empirical distribution of actions played by Ad during blocks of size d + 1. Now, an additional lemma is needed (whose proof is also given in the Appendix D). Lemma 9 The regret of any algorithm A for linear bandits on K satisfies Rlin T (A, K) = eΩ In the previous lemma, as well as the following result, the eΩnotation is only hiding a log T denominator. Using the two lemmas above we can finally prove the lower bound. Theorem 10 For any algorithm Ad for K-armed bandits with d-delayed composite anonymous feedback, RT (Ad, K, d) = eΩ p (d + 1)KT . Proof Fix an algorithm Ad. Using the reduction of Lemma 8 gives an algorithm A such that RT (Ad, K, d) (d + 1) Rlin T/(d+1)(A, K) = eΩ p (d + 1)KT , where we used Lemma 9 with horizon T/(d + 1) to prove the eΩ-equality. Although the loss sequence used to prove the lower bound for linear bandits in the simplex is stochastic i.i.d., the loss sequence achieving the lower bound in our delayed setting is not independent due to the deterministic loss transformation in the proof of Lemma 8 (which is defined independently of the algorithm, thus preserving the oblivious nature of the adversary). 6. Conclusions We have investigated the setting of d-delayed composite anonymous feedback as applied to nonstochastic bandits. Composite anonymous feedback lends itself to formalize scenarios where the actions performed by the online decision-maker produce long-lasting effects that combine additively over time. A general reduction technique was introduced that enables the conversion of a stable algorithm working in a standard bandit framework into one working in the composite feedback framework. Applying our reduction to the FTRL algorithm with Tsallis entropy, we obtain an upper bounded on the regret of order p (d + 1)KT. We then showed the optimality of this rate (up to a logarithmic factor) relying on a lower bound for bandit linear optimization in the probability simplex. Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour Acknowledgments NCB and RC were partially supported by the MIUR PRIN grant Algorithms, Games, and Digital Markets (ALGADIMAR) and the EU Horizon 2020 ICT-48 research and innovation action under grant agreement 951847, project ELISE (European Learning and Intelligent Systems Excellence). The work of TC has benefited from the AI Interdisciplinary Institute ANITI, which is funded by the French Investing for the Future PIA3 program under the Grant agreement ANR-19-P3IA-0004. TC also acknowledges the support of the project BOLD from the French national research agency (ANR), and that of IBM. YM was supported in part by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation (grant number 993/17), Tel Aviv University Center for AI and Data Science (TAD), and the Yandex Initiative for Machine Learning at Tel Aviv University. Finally, we would like to thank Mengxiao Zhang and Chen-Yu Wei for finding a mistake in the proof of the main result in the preliminary version of this work (Cesa-Bianchi et al., 2018). See Appendix E for more details on the corrections we made. Nonstochastic Bandits with Composite Anonymous Feedback a d a s s + 1 b b + d Figure 2: On the right-hand side of the equation in Lemma 11, we are summing each row of the squares in the picture. On the left hand side, we are summing the columns. Each column s contains the same constant cs in each component. Appendix A. An Accountants Lemma The next elementary lemma can be proved straightforwardly by swapping the order of the sums (see Figure 2). Lemma 11 If (ct)t Z R, a, b Z are such that a b and d 0 then t=a d (t a + d + 1)ct + (d + 1) t=b+1 (b + d + 1 t)ct = Appendix B. Stability of FTRL with Tsallis entropy In this section, we prove a key stability property of FTRL with Tsallis entropy that could be of independent interest. A general technique (Shalev-Shwartz, 2012, Lemma 2.10) to do so for the FTRL family of algorithms is to show that the regularizers are µ-strongly convex with respect to the desired norm (in our case, the ℓ1-norm). To the best of our knowledge, the existing results in this direction (e.g., Reem et al. 2019, Section 7.3) lead to a (probably loose) upper bound on 1/µ of order K, which in turn yields a suboptimal dependence on K in the stability of FTRL with Tsallis entropy. To obtain the correct dependence on K in the regret in Theorem 4, stability of order K or better is required instead. If one wanted to follow this path, it would therefore be required to prove the tighter upper bound 1/µ K, which seems non-trivial. Instead, we take a different route skipping this middle step and controlling the stability of FTRL with Tsallis entropy directly. Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour We begin by showing that, when η > 0, there exists a unique solution of the optimization problem that defines qn in Algorithm 2 and provide a formula for it in terms of the corresponding Lagrange multiplier λ. An analogous result was stated in (Zimmert and Seldin, 2021, Section 3.3) for the related algorithm Tsallis-INF. Lemma 12 Let w RK and η > 0. Then !λw,η < min i [K] w(i) , X η2 w(i) λw,η 2 = 1 . Furthermore, defining for all i [K], qopt w,η(i) := η2 w(i) λw,η 2 , we have that qopt w,η is the unique global minimizer of the function: K R , q 7 X i [K] w(i)q(i) 2η X Proof Define the two auxiliary functions fw,η : [0, )K R , q 7 X i [K] w(i)q(i) 2η X ϕ: [0, )K R , q 7 X Note that fw,η is strictly convex and continuous on [0, )K and differentiable on (0, )K. Thus, by the Lagrange multiplier theorem, a point q (0, )K is the unique global minimizer of fw,η on the simplex K if and only if ϕ(q) = 1 and λ R, fw,η(q) = λ ϕ(q) (4) A direct verification shows that a pair (q, λ) (0, )K R satisfies condition (4) if and only if λ < min i [K] w(i) , X η2 w(i) λ 2 = 1 , and i [K] , q(i) = η2 w(i) λ 2 (5) Note that, letting m := mini [K] w(i), the function g: ( , m) R , λ 7 X η2 w(i) λ 2 is continuous, strictly increasing, and it satisfies lim λ g(λ) = 0 and lim λ m g(λ) = + Nonstochastic Bandits with Composite Anonymous Feedback hence, there exists a unique λw,η ( , m) such that g(λw,η) = 1. Therefore, qopt w,η is the unique global minimizer of fw,η on the simplex K. This lemma controls the variation of the Lagrange multipliers corresponding to two points that vary by quantity δ only in a single coordinate. Lemma 13 Let w RK such that w(1) w(K) and η > 0. Then, for all δ > 0 and j [K], λw+δej,η λw,η 1 where λw+δej,η and λw,η are defined as in Lemma 12 and e1, . . . , e K is the canonical basis of RK. Proof Define X := u(1), . . . , u(K), λ RK R | λ < mini [K] u(i) and ψ: X R , u(1), . . . , u(K), λ 7 X η2 u(i) λ 2 Λ: RK R , u 7 λu,η where λu,η is defined as in Lemma 12. By the implicit function theorem, we have that Λ is infinitely differentiable and for all u RK, DjΛ(u) = Djψ u(1), . . . , u(K), λu,η DK+1ψ u(1), . . . , u(K), λu,η (u(j) λu,η)3 2 P i [K] η2 (u(i) λu,η)3 = 1 1 + P i [K],i =j u(j) λu,η where we denoted the partial derivative with respect to the j-th coordinate by Dj. Now, by the fundamental theorem of calculus, λw+δej,η λw,η = Λ(w + δej) Λ(w) = δ Z 1 0 DjΛ(w + sδej) ds 1 + P i [K],i j 1 w(j)+sδ λw+sδej,η w(i) λw+sδej,η 3 + P i [K],i j+1 w(j)+sδ λw+sδej,η w(i) λw+sδej,η 1 + P i [K],i j 1 w(i) λw+sδej,η w(i) λw+sδej,η We can finally prove the stability of FTRL with 1 2-Tsallis entropy. Proof of Theorem 5 Consider an arbitrary sequence of losses (ℓm)m N [0, 1]. Fix any η > 0. For each w RK, let λw := λw,η, where λw,η is defined as in Lemma 12. Let F0 be the trivial σ-algebra (containing only the sample space and the empty set) and for all n N, Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour Fn := σ(J1, . . . Jn). Note that, for each n N, we have that b Ln 1 is Fn 1-measurable and, as a consequence of Lemma 12, for all i [K], qn(i) = η2 b Ln 1(i) λb Ln 1 2 . Let e1, . . . , e K be the canonical basis of RK and fix any n N. Define for each j [K], bℓn,j := ℓn(j) qn(j)ej and note that bℓn = bℓn,Jn. To make the notation more compact, we also let En[ ] := E[ | Fn 1]. Then qn+1(i) qn(i) + η2 b Ln(i) λb Ln 2 η2 b Ln 1(i) λb Ln 1 2 η2 b Ln 1(i) + bℓn,Jn(i) λb Ln 1+bℓn,Jn 2 η2 b Ln 1(i) λb Ln 1 2 I{Jn = j} X η2 b Ln 1(i) + bℓn,j(i) λb Ln 1+bℓn,j 2 η2 b Ln 1(i) λb Ln 1 2 j [K] En I{Jn = j} X η2 b Ln 1(i) + bℓn,j(i) λb Ln 1+bℓn,j 2 η2 b Ln 1(i) λb Ln 1 2 j [K] qn(j) X η2 b Ln 1(i) + bℓn,j(i) λb Ln 1+bℓn,j 2 η2 b Ln 1(i) λb Ln 1 2 Now, note that, for each j [K]: λb Ln 1+bℓn,j λb Ln 1 (because, as we show in the proof of Lemma 13, the directional derivatives of u 7 λu are positive). For each i [K]\{j}, η2 b Ln 1(i) λb Ln 1 η2 b Ln 1(i) λb Ln 1+bℓn,j = η2 b Ln 1(i)+bℓn,j(i) λb Ln 1+bℓn,j (by the previous point). η2 b Ln 1(j) λb Ln 1 η2 b Ln 1(j)+bℓn,j(j) λb Ln 1+bℓn,j (by the previous point and the fact that P i [K] η2 b Ln 1(i) λb Ln 1 = 1 = P i [K] η2 b Ln 1(i)+bℓn,j(i) λb Ln 1+bℓn,j ). It follows that j [K] qn(j) X η2 b Ln 1(i) λb Ln 1+bℓn,j 2 η2 b Ln 1(i) λb Ln 1 2 j [K] qn(j) X b Ln 1(i) λb Ln 1 2 b Ln 1(i) λb Ln 1+bℓn,j 2 b Ln 1(i) λb Ln 1+bℓn,j 2 b Ln 1(i) λb Ln 1 2 Nonstochastic Bandits with Composite Anonymous Feedback j [K] qn(j)(λb Ln 1+bℓn,j λb Ln 1) X b Ln 1(i) λb Ln 1 + b Ln 1(i) λb Ln 1+bℓn,j b Ln 1(i) λb Ln 1+bℓn,j 2 b Ln 1(i) λb Ln 1 2 j [K] qn(j)(λb Ln 1+bℓn,j λb Ln 1) X 1 b Ln 1(i) λb Ln 1+bℓn,j 3 j [K] qn(j)(λb Ln 1+bℓn,j λb Ln 1) 1 b Ln 1(i) λb Ln 1+bℓn,j 2 j [K] qn(j)(λb Ln 1+bℓn,j λb Ln 1) 1 b Ln 1(i) + bℓn,j(i) λb Ln 1+bℓn,j 2 j [K] qn(j)(λb Ln 1+bℓn,j λb Ln 1) 1 b Ln 1(i) + bℓn,j(i) λb Ln 1+bℓn,j 2 j [K] qn(j)(λb Ln 1+bℓn,j λb Ln 1) = ( ) . Now, let σ be a random permutation of [K] such that b Ln 1 σ(1) b Ln 1 σ(K) . Then, using Lemma 13, we have j [K] qn σ(j) (λb Ln 1+bℓn,σ(j) λb Ln 1) j [K] qn σ(j) λb Ln 1+ ℓn(σ(j)) qn(σ(j))eσ(j) λb Ln 1 j [K] qn σ(j) 1 j ℓn(σ(j)) qn(σ(j)) 2 1 j 21 + ln K In conclusion: qn+1(i) qn(i) + | Fn 1 It follows that: qn+1(i) qn(i) + qn+1(i) qn(i) + | Fn 1 Being n and (ℓm)m N arbitrary, the result follows. Appendix C. Stability of Exp3 In this section, we prove the stability of Exp3. Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour Lemma 14 Exp3 with learning rate η is ξ-stable with ξ = η. Proof Consider an arbitrary sequence of losses (ℓn)n N [0, 1]. In this case, stability holds pointwise (for all realizations of the actions J1, J2, . . . played by Exp3 on the sequence of losses (ℓn)n N) rather that in expectation. From (Cesa-Bianchi et al., 2016, Lemma 1) we have, for any round n N and all arms i [K], qn+1(i) qn(i) η qn+1(i) j=1 qn(j)bℓn(j) where bℓn(j) = ℓn(j)I{Jn=j} qn(j) , qn(j) = wn(j)/ PK k=1 wn(k), and if n = 1, wn(k) = 1 while, if n 2, wn(k) is defined inductively by wn(k) = qn 1(k)e ηbℓn 1(k). Hence we can write, for any n N, i : qn+1(i)>qn(i) qn+1(i) qn(i) X i : qn+1(i)>qn(i) η qn+1(i) j=1 qn(j)bℓn(j) i : qn+1(i)>qn(i) η qn+1(i)ℓn(Jn) η X i : qn+1(i)>qn(i) qn+1(i) η . Being (ℓn)n N arbitrary, the result follows. Appendix D. Lower Bound (missing proofs) We begin this section by showing a reduction mapping each algorithm for bandits with composite anonymous feedback to one for linear bandits with a better or equal regret. Proof of Lemma 8 Fix an instance ℓ1, . . . , ℓT/(d+1) of a linear bandit problem and use it to construct an instance of the d-delayed bandit setting with loss components ℓ(s) t (i) = ( ℓ t/(d+1) ei if t + s = 0 (mod d + 1), 0 otherwise, where e1, . . . , e K are the elements of the canonical basis of RK. These components define the following composite loss incurred by any algorithm Ad playing actions I1, I2, . . . ℓ t (It d, . . . , It) = s=0 ℓ(s) t s(It s) = ( (d + 1) ℓ t/(d+1) qt if t = 0 (mod d + 1), 0 otherwise, where qt is defined from It d, . . . , It [K] as follows qt(j) = 1 d + 1 s=t d I{Is = j} j [K]. (6) Nonstochastic Bandits with Composite Anonymous Feedback Note that qt(i) is the fraction of times action i was played by Ad in the last d + 1 rounds. Given the algorithm Ad, we define the algorithm A for playing linear bandits on the loss sequence ℓ1, . . . , ℓT/(d+1) as follows. If t = 0 (mod d + 1), then A skips the round. On the other hand, when t = 0 (mod d + 1), A performs action qt defined in (6), observes the loss ℓ t/(d+1) qt, and returns to Ad the composite loss ℓ t (It d, . . . , It). Essentially, Ad observes a nonzero composite loss only every d time steps, when t = 0 (mod d + 1). When this happens, the composite loss of Ad is d ℓ t/(d+1) qt, which is d + 1 times the loss of A. Now it is enough to note that, using (3), min k=1,...,K t=1 ℓ t (k, . . . , k) = min k=1,...,K(d + 1) s=1 ℓ s ek = min q K(d + 1) s=1 ℓ s q . This concludes the proof. We remark that our lower bound construction relies crucially on the power of the adversary to plan the assignment of delays: losses of order d are revealed only on T/d time steps, leading to a multiplicative dependence on d. This stacking effect is not possible in settings like the ones studied in Pike-Burke et al. (2018), where delays are drawn i.i.d. over rounds and are, therefore, independently spread across time steps. We now prove a lower bound for linear bandits. Proof of Lemma 9 The statement is essentially proven in (Shamir, 2015, Theorem 5), where the author shows a Ω p K/T lower bound on the error of bandit linear optimization in the probability simplex.4 As explained in (Shamir, 2015, Section 1.1), (cumulative) regret lower bounds for linear bandits can be obtained by multiplying the lower bounds on bandit linear optimization error by T. A possible issue is that the proof in (Shamir, 2015, Theorem 5) uses unbounded Gaussian losses. However, in (Shamir, 2015, Appendix B) it is shown how lower bounds for Gaussian losses can be converted into lower bounds for losses in [ 1, 1] at the cost of a 1 ln T factor in the regret. Finally, note that our setting requires losses in [0, 1], but this is not an issue either because we are in a linear setting, and thus we can add the (1, . . . , 1) constant vector to all loss vectors without affecting the regret. Appendix E. Issues in the Preliminary Version of the Paper This paper is an extended and improved version of a preliminary paper appeared as (Cesa Bianchi et al., 2018). There are two main issues in the preliminary version. Firstly, in Cesa-Bianchi et al. (2018, last line of Eq. (11)), we find the inequality i:pt s(i)>pt d+1(i) (pt s(i) pt d+1(i)) but, whenever an update occurred at round t d + 2, the same term is summed Θ(d) times (rather than 1), leading to a bound of order Θ(dξ) rather than the claimed Θ(ξ), and a consequently looser upper bound for the performance of the wrapper. 4. It is worth stressing that the lower bound in (Shamir, 2015) is based on stochastic i.i.d. generation of losses, hence it does not violate our assumption about the obliviousness of the adversary. Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour Secondly, in Cesa-Bianchi et al. (2018, first line of Eq. (10)), we find the inequality t U,t 2d 2 k t q 1 q(2d 1) T X t=2d 2 E[ k t ] which would follow from the provided discussion on P V2d 1 s=1 (t s / U) if k t were nonnegative, but this is not necessarily the case. The present version proposes a different wrapper and patch both things up, as described below. When analyzing the term corresponding to the first issue in (Cesa-Bianchi et al., 2018), we take a different route. We begin with a change of variables and never upper bound the losses ℓ(s) t s with 1, relying instead on Lemma 2 to obtain the correct dependence on d. This can be seen in the upper bound of (I) in the proof of Theorem 4. For the second issue, the problem with the original wrapper in (Cesa-Bianchi et al., 2018) is to disentangle k t from the random variable I{t is an update round}. There is, however, an easier way to get around this roadblock, taking a slightly different route and relying on a different definition of draw, stay, and update rounds ( a la Dekel et al. (2014b), see Theorem 1 and the subsequent discussion). This greatly simplifies the analysis as can be seen in the upper bound of (II) in the proof of Theorem 4. J. D. Abernethy, C. Lee, and A. Tewari. Fighting bandits with a new kind of smoothness. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. P. Agrawal and T. Tulabandula. Learning by repetition: Stochastic multi-armed bandits under priming effect. In Conference on Uncertainty in Artificial Intelligence, pages 470 479. PMLR, 2020. O. Anava, E. Hazan, and S. Mannor. Online learning for adversaries with memory: price of past mistakes. In Advances in Neural Information Processing Systems, pages 784 792, 2015. R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proc. 29th ICML, 2012. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002. N. Cesa-Bianchi, C. Gentile, Y. Mansour, and A. Minora. Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory, pages 605 622, 2016. N. Cesa-Bianchi, C. Gentile, and Y. Mansour. Nonstochastic bandits with composite anonymous feedback. In Conference On Learning Theory, pages 750 773. PMLR, 2018. O. Dekel, J. Ding, T. Koren, and Y. Peres. Online learning with composite loss functions. In Conference on Learning Theory, pages 1214 1231, 2014a. Nonstochastic Bandits with Composite Anonymous Feedback O. Dekel, E. Hazan, and T. Koren. The blinded bandit: Learning with adaptive feedback. Advances in Neural Information Processing Systems, 27:1610 1618, 2014b. O. Dekel, E. Hazan, and T. Koren. The blinded bandit: Learning with adaptive feedback. In Advances in Neural Information Processing Systems, pages 1610 1618, 2014c. M. A. Gael, C. Vernade, A. Carpentier, and M. Valko. Stochastic bandits with armdependent delays. In International Conference on Machine Learning, pages 3348 3356. PMLR, 2020. S. Garrabrant, N. Soares, and J. Taylor. Asymptotic convergence in online learning with unbounded delays. ar Xiv preprint ar Xiv:1604.05280, 2016. S. Ito, D. Hatano, H. Sumita, K. Takemura, T. Fukunaga, N. Kakimura, and K.-I. Kawarabayashi. Delay and cooperation in nonstochastic linear bandits. Advances in Neural Information Processing Systems, 33:4872 4883, 2020. P. Joulani, A. Gyorgy, and C. Szepesv ari. Online learning under delayed feedback. In International Conference on Machine Learning, pages 1453 1461, 2013. P. Joulani, A. Gy orgy, and C. Szepesv ari. Delay-tolerant online convex optimization: Unified analysis and adaptive-gradient algorithms. In AAAI, volume 16, pages 1744 1750, 2016. D. Khashabi, K. Quanrud, and A. Taghvaei. Adversarial delays in online strongly-convex optimization. ar Xiv preprint ar Xiv:1605.06201, 2016. J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. Advances in Neural Information Processing Systems, 22:2331 2339, 2009. B. Li, T. Chen, and G. B. Giannakis. Bandit online learning with unknown delays. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 993 1002. PMLR, 2019. T. Mandel, Y.-E. Liu, E. Brunskill, and Z. Popovic. The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. In AAAI, pages 2849 2856, 2015. T. A. Mann, S. Gowal, A. Gyorgy, H. Hu, R. Jiang, B. Lakshminarayanan, and P. Srinivasan. Learning from delayed outcomes via proxies with applications to recommender systems. In International Conference on Machine Learning, pages 4324 4332. PMLR, 2019. C. Mesterharm. On-line learning with delayed label feedback. In Algorithmic Learning Theory, pages 399 413. Springer, 2005. G. Neu, A. Antos, A. Gy orgy, and C. Szepesv ari. Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems 23, pages 1804 1812. Curran Associates, Inc., 2010. Cesa-Bianchi, Cesari, Colomboni, Gentile, Mansour C. Pike-Burke, S. Agrawal, C. Szepesvari, and S. Grunewalder. Bandits with delayed, aggregated anonymous feedback. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4105 4113. PMLR, 10 15 Jul 2018. K. Quanrud and D. Khashabi. Online learning with adversarial delays. In Advances in Neural Information Processing Systems, pages 1270 1278, 2015. D. Reem, S. Reich, and A. D. Pierro. Re-examination of bregman functions and new properties of their divergences. Optimization, 68(1):279 348, 2019. S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4(2):107 194, 2012. O. Shamir. On the complexity of bandit linear optimization. In Conference on Learning Theory, pages 1523 1551, 2015. O. Shamir and L. Szlak. Online learning with local permutations and delayed feedback. In Proc. 34th ICML, 2017. T. S. Thune, N. Cesa-Bianchi, and Y. Seldin. Nonstochastic multiarmed bandits with unrestricted delays. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 6541 6550, 2019. C. Vernade, O. Capp e, and V. Perchet. Stochastic bandit models for delayed conversions. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI, 2017. C. Vernade, A. Carpentier, T. Lattimore, G. Zappella, B. Ermis, and M. Brueckner. Linear bandits with stochastic delayed feedback. In International Conference on Machine Learning, pages 9712 9721. PMLR, 2020. S. Wang, H. Wang, and L. Huang. Adaptive algorithms for multi-armed bandit with composite and anonymous feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10210 10217, 2021. M. J. Weinberger and E. Ordentlich. On delayed prediction of individual sequences. IEEE Transactions on Information Theory, 48(7):1959 1976, 2002. J. Zimmert and Y. Seldin. An optimal algorithm for adversarial bandits with arbitrary delays. In International Conference on Artificial Intelligence and Statistics, pages 3285 3294. PMLR, 2020. J. Zimmert and Y. Seldin. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. J. Mach. Learn. Res., 22:28 1, 2021.