# delay_and_cooperation_in_nonstochastic_linear_bandits__1d0c412f.pdf Delay and Cooperation in Nonstochastic Linear Bandits Shinji Ito NEC Corporation i-shinji@nec.com Daisuke Hatano RIKEN AIP daisuke.hatano@riken.jp Hanna Sumita Tokyo Institute of Technology sumita@c.titech.ac.jp Kei Takemura NEC Corporation kei_takemura@nec.com Takuro Fukunaga Chuo University, RIKEN AIP, JST PRESTO fukunaga.07s@g.chuo-u.ac.jp Naonori Kakimura Keio University kakimura@math.keio.ac.jp Ken-ichi Kawarabayashi National Institute of Informatics k-keniti@nii.ac.jp This paper offers a nearly optimal algorithm for online linear optimization with delayed bandit feedback. Online linear optimization with bandit feedback, or nonstochastic linear bandits, provides a generic framework for sequential decisionmaking problems with limited information. This framework, however, assumes that feedback can be observed just after choosing the action, and, hence, does not apply directly to many practical applications, in which the feedback can often only be obtained after a while. To cope with such situations, we consider problem settings in which the feedback can be observed d rounds after the choice of an action, and propose an algorithm for which the expected regret is O( m(m + d)T), ignoring logarithmic factors in m and T, where m and T denote the dimensionality of the action set and the number of rounds, respectively. This algorithm achieves nearly optimal performance, as we are able to show that arbitrary algorithms suffer the regret of ( m(m + d)T) in the worst case. To develop the algorithm, we introduce a technique we refer to as distribution truncation, which plays an essential role in bounding the regret. We also apply our approach to cooperative bandits, as studied by Cesa-Bianchi et al. [18] and Bar-On and Mansour [12], and extend their results to the linear bandits setting. 1 Introduction Bandit linear optimization (nonstochastic linear bandits) models various sequential decision-making problems under partial-information conditions and has a wide range of applications including combinatorial bandits [17] and adaptive routing problems [11]. In this model, a player is given a set A Rm of actions, each of which corresponds to an m-dimensional feature vector, and chooses an action at 2 A in each round t 2 [T]. Just after choosing action at, the player gets feedback > t at of the loss for the chosen action, where t 2 Rm is a loss vector. The goal of the player is to minimize cumulative loss PT t at. A number of studies have proposed algorithms for this model, achieving sublinear regret bounds of O(m T) [17; 22; 31]. However, in many applications, we cannot always get feedback right after choosing actions, as noted, e.g., in [5]. For example, in the application of advertisement optimization [3; 6; 33], it takes a certain 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Table 1: Regret bounds for nonstochastic bandit problems with/without delay Multi-armed bandit (K-arms) Linear bandit (m: dim. of action set) Without delay O(p KT log K) [10] O(mp T log m) [15; 17; 22] O( KT) [10] (m With delay O( d KT) [35; 36] O(m d T) [19] (d-rounds delay) O( (d + K)T log K) [18] ([19] applies to more general models) O( (d log K + K)T) [46] O( m(d + m)T) [Theorem 1] ( (d log K + K)T) [18] ( m(d + m)T) [Theorem 2] amount of time until the advertisements begin to affect consumers behavior. Previous studies are not directly applicable to such situations. For the multi-armed bandit problems, algorithms that work well even for delayed-feedback settings have been proposed, as shown in Table 1, but they are not available for more general linear bandits settings. Vernade et al. [43] provided algorithms for stochastic linear bandits with delayed feedback, and they work well under assumptions of time-invariant generative models for the loss, without regret bounds for nonstochastic settings. This paper introduces bandit linear optimization with delayed feedback, which includes the multiarmed bandit problems [18], and proposes a min-max optimal algorithm for the model. In this model, there is an additional parameter d > 1 representing the rounds of feedback delay. In contrast to the standard bandit feedback model, in which a player observes > t at at the end of t-th round, in our model, the feedback > t at can be observed at the end of the (t + d)-th round, i.e., d-rounds later. Our contribution The main contribution of this paper is to construct a nearly-optimal algorithm for online linear optimization with delayed bandit feedback. More precisely, for online linear optimization with an m-dimensional action set and d-rounds delay, Algorithm 1, described in Section 4, enjoys the following regret bound: Theorem 1. For arbitrary loss sequences ( t)T t=1, the regret for Algorithm 1 is bounded as 8m(d + em)T log T + 3, Cm(dpm + m) log3(dm T) where E[ ] means the expectation taken w.r.t. the internal randomization of the algorithm and C > 0 is a global constant. We note that this paper considers oblivious adversarial model, i.e., t is assumed not to depend on the output of the algorithm. Our results can easily be generalized to an adaptive adversarial model, i.e., a model in which t may depend on {aj}j d, the environment reveals the loss > t dat d 2 R. Without loss of generality, we assume d T 1. The loss vector t 2 Rm is assumed to satisfy | > for all a 2 A. In this paper, we assume that the sequence ( t)T t=1 of loss vectors is an arbitrary non-adaptive sequence, i.e., we do not assume any generative model for t, but each t is assumed not to depend on the output of the algorithm. Player performance is measured by means of regret RT , defined as RT = PT t at mina 2A 3.2 Cooperative nonstochastic linear bandits The model for cooperative bandits in this paper is based on the problem setting considered in [12; 18]. There is a communication graph G = (V, E), an undirected graph each vertex of which corresponds to an agent that plays a common linear bandit problem. In each round t = 1, 2, . . . , T, each agent v 2 V chooses an action at(v) 2 A from a common action set A Rm, and then observes loss t(v)>at(v) 2 [ 1, 1] for the chosen action, where we assume that E[ t(v)] = t for each agent v.1 At the end of each round, each agent v sends a message mt(v) to neighbors u 2 N(v) = {u 2 V | {u, v} 2 E}. The message mt(v) consists of the chosen action, observed loss, and the distribution qv t for choosing an action such that at(v) qv mt(v) = ht, v, at(v), t(v)>at(v), qv Note that each agent chooses an action independently, i.e., the action at(v) independently follows qt(v) for each agent v. The goal is to construct an algorithm for which the regret RT (v) = PT t at(v) mina 2A t a for each agent v is as small as possible. 4 Algorithms and Regret Upper Bounds 4.1 Preliminary In this subsection, we introduce a technique that we refer to as distribution truncation, which plays a central role in bounding the regret. We denote the convex hull of A by B. Given a distribution p over B, define µ(p) 2 Rm and S(p) 2 Sym(m) by µ(p) = Ex p[x] and S(p) = Ex p[xx>]. For any vector x 2 Rm and positive semidefinite matrix A 2 Sym(m), denote kxk A = k A x>Ax. Given a distribution p over B, define a truncated distribution p0 by S(p) 1 mγ2} Proby p[kyk2 S(p) 1 mγ2] / p(x)1{kxk2 S(p) 1 mγ2}, (5) where γ 4 log(10m T) is a parameter that we refer to as the truncation level. From the definition of p0, any sample b chosen from a truncated distribution p0 is bounded in terms the norm k k S(p) 1, as kbk2 S(p) 1 mγ2. This property ensures that the estimated loss vector (defined in (10)), constructed from a sample from a truncated distribution, has a bounded norm as in (11), thanks to which action distributions do not change drastically per round, as will be shown in Lemmas 5 and 6. Properties of log-concave distributions If a probability distribution has a density function p : B ! R 0 such that log(p(x)) is a concave function, then we call it a log-concave distribution. We use the following concentration property of log-concave distributions: Lemma 1. If x follows a log-concave distribution p over Rm satisfying S(p) I, we have 2 m 2] m exp(1 ) (6) for an arbitrary 0. Missing proofs of lemmas are provided in the appendix. From this lemma 1, we can show that p and p0 defined as (5) are close if p is a log-concave distribution, in the following sense: Lemma 2. Suppose that p is a log-concave distribution over B. For any function f : B ! [ 1, 1] and γ 4 log(10m T), we have x p[f(x)] E T and T T + 1 S(p) S(p0) T + 1 T S(p). (7) 1In previous studies [18; 12], all players share a common loss vector t, i.e., the loss vectors t(v) are equal to t for all v. This case is a special case of our problem setting. Lemma 3. If y follows a one-dimensional log-concave distribution such that E[y2] s2 1/100, we have E[g(y)] s2 + 30 exp 2s2 where g(y) = exp(y) y 1. (8) 4.2 Algorithm for linear bandits with delayed feedback In our algorithm, we update distribution pt over B := conv(A), by the multiplicative weight update method (MWU) as follows: wt(x) := exp A , pt(x) = wt(x) R y2B wt(y)dy , (9) where > 0 is a parameter referred to as the learning rate, and ˆ t is as defined below. In each round, we pick bt 2 B from the truncated distribution p0 t of pt. We can get a sample from p0 t by iteratively sampling b pt until kbk2 S(pt) 1 mγ2. There is a computationally efficient way for sampling b pt under mild assumptions since pt is a log-concave distribution. In fact, we can use the techniques developed in [34] to get samples from pt with polynomial-time computation, given a membership oracle for B. A membership oracle for B can be constructed from a linear optimization oracle for A, as stated e.g., in [39]. After getting bt, we choose at 2 A so that E[at|bt] = bt. We can compute such an at efficiently, given a linear optimization oracle for A. Indeed, as shown in Corollary 14.1g in [39], given b 2 B = conv(A) we can compute λ1, . . . , λm+1 0 and c1, . . . , cm+1 2 A such that Pm+1 i=1 λi = 1 and Pm+1 i=1 λtici = b via solving linear optimization over A polynomial times. By setting a = ci with probability λi, we obtain E[a|b] = b. The algorithm then plays at at the t-th round, and the feedback > t at will be observed at the end of the (t + d)-th round. We define ˆ t by t) 1bt. (10) We note that S(p0 t) is invertible, which can be concluded from the assumption that A is not contained in any proper linear subspace. Under this assumption, indeed, B = Conv(A) is a full-dimensional convex set with a positive Lebesgue measure. It follows from this fact and Lemma 1 that the domain of p0 t is full-dimensional as well. Thus, the distribution p0 t has a density function taking positive values over a full-dimensional set, which implies that the matrix S(p0 t) is invertible. A similar argument can be found, e.g., in [27] (between Eq. (4) and (5)), and is implicitly used in [16] as well. Further, we can compute S(p0 t) efficiently. In fact, since p0 t is a log-concave distribution, for any " > 0, we can calculate an "-approximation of S(p0 t) w.h.p. using (d/")O(1) samples generated from p0 t, as can be seen from Corollary 2.7 of [34]. The vector ˆ t defined as (10) is an unbiased estimator of t and bounded in terms of the norm k k2 S(pt), i.e., we have S(pt) 4mγ2. (11) In fact, we have E[ˆ t] = E = t, which means the first part in (11) holds. To show the second part in (11), we use kbtk2 S(pt) 1 mγ2, which is ensured by the fact that bt is sampled from the truncated distribution p0 t. It follows that kˆ tk2 S(pt) = ( > t at)2k S(p0 S(pt) 2k S(p0 t) = 2kbtk2 t) 1 4kbtk2 S(pt) 1 4mγ2, where we use the second part of (7) in the first and the second inequalities. The inequality in (11) for ˆ t will be used in the analysis of MWU. The procedure can be summarized in Algorithm 1. Let us next show that Algorithm 1 enjoys the regret bound given in Theorem 1. Since we have E[at|p0 t] = E[bt|p0 t), the regret can be bounded as t (µ(pt) a ) + t (µ(pt) a ) Algorithm 1 An algorithm for online linear optimization with delayed bandit feedback Require: Action set A, parameters T, d T 1 1: Set γ = 4 log(10m T) and = min m log T 2(d+em)T , 1 100γ2(dpm+m) 2: Define w1 : B ! R>0 by w1(x) = 1 for all x 2 B. 3: for t = 1, 2, . . . , T do 4: Let pt be a distribution whose density function is proportional to wt as in (9). 5: Pick bt p0 t, e.g., by iteratively sampling b pt until kbk2 S(pt) 1 mγ2 holds. 6: If t > d, get feedback of > t dat d, construct an unbiased estimator of t d as ˆ t d = > t dat d S(p0 t d) 1bt d, and update wt by wt+1(x) = wt(x) exp( ˆ > t dx). 7: If t d, let wt+1 = wt. 8: end for where a 2 argmina2A t a and the last inequality follows from the first part of (7) and the assumption that > t a 2 [ 1, 1] for all a 2 A. From this inequality and a standard analysis for continuous multiplicative weight update methods [7; 22; 23], we obtain the following regret bounds: Lemma 4. If pt is defined by (9) with ˆ t such that E[ˆ t] = t, the regret for at is bounded as t (µ(pt) µ(pt+d)) + 1 where g : R ! R is defined in (8). In the following, we give bounds for the right-hand side of (13), separately for the terms > µ(pt+d)) and Ex pt+d . The first can be bounded via the following lemma: Lemma 5. Suppose that 2 Rm satisfies | >a| 1 for all a 2 A and 1/(48γ2m). Then, if pt is defined by (9) with ˆ t satisfying (11), it holds for all t 2 [T] that | E[ >(µ(pt) µ(pt+1))]| 2 . This lemma can be shown using (11) and Lemma 3. Finally, to bound the term Ex pt+d in Lemma 4, we use the following lemma: Lemma 6. Assume 1 100γ2(d+1)pm. If pt is defined by (9) with ˆ t satisfying (11), for all t, we have S(pt+1) This lemma follows from (11) and Lemma 1, by induction in t. We are now ready to prove Theorem 1. Proof of Theorem 1 Combining Lemmas 4 and 5, we have E[RT ] 2 d T + E Let us bound Ex pt+d using Lemma 3 and (11). We have 1 + 1 d + 1 S(pt) 4e 2mγ2 1 100, (15) where the first inequality follows from Lemma 6, the second inequality follows from (11), and the last inequality comes from the assumption of 1 100γ2(dpm+m). From this inequality and the fact t x follows a log-concave distribution when x pt+d, we have 1 + 1 d + 1 where the first and second inequalities follow from (3) and (15), respectively. From the definition (10) of ˆ t, we have t at)2 66S(p0 1 + 1 d + 1 1 + 1 d + 1 where the first inequality follows from | > t at| 1 and the second part of (7). Combining the above two inequalities, we obtain E 2 2(1 + 1 d+1)d+1m 2e 2m. From this and (14), we have E[RT ] 2 (d + em)T + m log T + 3. By substituting the parameter setting in Step 1 of Algorithm 1, we obtain the inequality of Theorem 1. 4.3 Algorithm for cooperative nonstochastic linear bandits For the cooperative bandit problem, we consider a center-based algorithm [12]. A major difference between our algorithm and the one by Bar-On and Mansour [12] is that ours applies to linear bandit settings while theirs focus on multi-armed bandit settings. Similarly to [12], we first choose center agents C V and a corresponding partition {Vc}c2C of agents with the following properties: Theorem 5 (Theorem 12. in [12]). Given an undirected graph G = (V, E) and a parameter m 2, one can find center agents C V and a partition {Vc}c2C of V such that the following hold for all c 2 C: (i) c [ N(c) Vc, (ii) subgraphs of G induced by Vc are connected, and (iii) for all v 2 Vc, it holds that min{|N(v)|, m} min{|N(c)|, m} exp where dc(v) denotes the distance from c to v in the subgraph of G induced by Vc. In the center-based algorithm, each center agent c 2 C updates its distribution by means of MWU, and the other agents v 2 Vc imitate center agent c on the basis of the message mt in (4). More precisely, for each c 2 C we construct breadth-first search tree Tc for the subgraph induced by Vc with root node c, and each agent v 2 Vc \ {c} uses the distribution qv t defined by qv t 1 , where pa(v) denotes the parent node of v in Tc. Consequently, for each v 2 Vc \ {c}, the action distribution qv t satisfies qv t dc(v) for t > dc(v). We define the distribution qc t of each center agent c 2 C as follows: define a distribution pc t over B = conv(A) by MWU as follows: t(x) := exp t(y)dy , (17) where we set the truncation level γ and the learning rate (c) as γ = 4 log(10m T) and (c) = m log T T (1+log m+m/ min{|N(c)|,m}), 1 100γ2m}. Pick bt(c) from the truncated distribution pc 0 defined by (5), and pick at(c) so that E[at(c)|bt(c)] = bt(c). The distribution qc t is defined to be the distribution that at(c) follows. The estimated loss ˆ t(c) is defined on the basis of the messages mt (4) from neighbors N(c) of c. Each center agent c 2 C computes ˆ t(c) using at(v) and t(v)>at(v) for v 2 N(v) as follows: ˆ t(c) = 1 |N(c)| t(v)>at(v)S(pc t(v) is chosen from the posterior probability distribution for bt(v) given at(v), i.e., Prob[b0 t(v) = b] / Prob[at(v)|bt(v) = b] pc 0(b). Combining Lemmas 3, 4, 5, 6 and Theorem 5, we obtain the regret bounds in Theorem 3. The proof of Theorem 3 is given in Section B in the appendix. 5 Lower Bounds In this section, we briefly describe how we can prove regret lower bounds in Theorems 2 and 4. Inspired by the proof of the lower bound for the multi-armed bandit problem with delayed feedback given in [18], we prove lower bounds by combining existing bounds for different settings such as linear bandits without delay [21; 26] and full-information online optimization with delay [44]. To prove Theorem 2, we combine lower bounds of (m T) and of ( md T). We note that the first one comes from a lower bound for linear bandits without delay, shown, e.g., in [21]. The second one is also a lower bound for online linear optimization with delayed feedback. As shown in [44], if an online optimization problem without delay admits a regret lower bound of (L(T)), a counterpart with d-round delayed feedback has an (d L(T/d))-lower bound. Since there is a lower bound of ( m T) for the online linear optimization (see, e.g., Theorem 3.2. in [24]), we have (d So the only question left is that both the regret bounds of (m T) and of ( md T) must come from the same instance. To this end, we construct the problem instance over A = { 1, 1}m such that k tk1 1. By considering a mixed distribution of loss vectors, we obtain a lower bound of (m m(d + m)T). A complete proof is provided in Section C in the appendix. Theorem 4 for cooperative linear bandits can be, again, obtained by combining lower bounds of ( m T) and of (m T/|V |). The first one comes from a lower bound for full-information online linear optimization, as cooperative bandits are at least harder than full-information online optimization. The second can come from an (m T)-lower bound for signle-player linear bandits [21]. Since we can regard cooperative bandits as a harder version of single-player bandits with (T |V |) rounds, the sum of regrets over all agents is at least (m T |V |), which implies that the regret per agent is of (m T/|V |). A complete proof of Theorem 4 to show how we construct the problem instance that gives both the regret bounds of ( m T) and of (m T/|V |) is given in Section D in the appendix. 6 Conclusion In this paper, we considered online linear optimization with d-round-delayed bandit feedback, where d was a given parameter fixed for all rounds. We provided an algorithm that achieves nearly-tight regret bounds, and extended this result to the cooperate bandit setting. An important future work would be to extend the model to deal with the unknown and rounddependent delay as in [46]. We believe that an adaptive way of tuning parameters, such as , and γ in our algorithms, would work well for this general setting. Another future direction is to improve practical computational cost. The proposed algorithms in this paper rely on continuous relaxation and sampling from log-concave distributions, which causes a large computational time in practice, though they run in polynomial time. Broader Impact The authors believe that this paper presents neither ethical nor societal issues, as this is a theoretical work. Acknowledgments and Disclosure of Funding This work was supported by JST, ERATO, Grant Number JPMJER1201, Japan. SI was supported by JST, ACT-I, Grant Number JPMJPR18U5, Japan. TF was supported by JST, PRESTO, Grant Number JPMJPR1759, Japan. NK and KK were supported by JSPS, KAKENHI, Grant Number JP18H05291, Japan. [1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011. [2] N. Abe and P. M. Long. Associative reinforcement learning using linear probabilistic concepts. In International Conference on Machine Learning, pages 3 11, 1999. [3] N. Abe, A. W. Biermann, and P. M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263 293, 2003. [4] J. D. Abernethy, E. Hazan, and A. Rakhlin. An efficient algorithm for bandit linear optimization. In Conference on Learning Theory, 2008. [5] A. Agarwal, S. Bird, M. Cozowicz, L. Hoang, J. Langford, S. Lee, J. Li, D. Melamed, G. Oshri, O. Ribas, et al. Making contextual decisions with low technical debt. ar Xiv preprint ar Xiv:1606.03966, 2016. [6] D. Agarwal, B.-C. Chen, P. Elango, N. Motgi, S.-T. Park, R. Ramakrishnan, S. Roy, and J. Zachariah. Online models for content optimization. In Advances in Neural Information Processing Systems, pages 17 24, 2009. [7] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012. [8] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Conference on Learning Theory, pages 217 226, 2009. [9] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. [10] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002. [11] B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, pages 45 53, 2004. [12] Y. Bar-On and Y. Mansour. Individual regret in cooperative nonstochastic multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3110 3120, 2019. [13] R. Baraglia, P. Dazzi, M. Mordacchini, and L. Ricci. A peer-to-peer recommender system for self-emerging user communities based on gossip overlays. Journal of Computer and System Sciences, 79(2):291 308, 2013. [14] I. Bistritz and A. Leshem. Distributed multi-player bandits-a game of thrones approach. In Advances in Neural Information Processing Systems, pages 7222 7232, 2018. [15] S. Bubeck, N. Cesa-Bianchi, and S. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, 2012. [16] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 72 85. ACM, 2017. [17] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012. [18] N. Cesa-Bianchi, C. Gentile, Y. Mansour, and A. Minora. Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory, pages 605 622, 2016. [19] N. Cesa-Bianchi, C. Gentile, and Y. Mansour. Nonstochastic bandits with composite anonymous feedback. In Conference on Learning Theory, pages 750 773, 2018. [20] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208 214, 2011. [21] V. Dani, S. M. Kakade, and T. P. Hayes. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, pages 345 352, 2008. [22] E. Hazan and Z. Karnin. Volumetric spanners: an efficient exploration basis for learning. The Journal of Machine Learning Research, 17(1):4062 4095, 2016. [23] E. Hazan, W. Hu, Y. Li, et al. Online improper learning with an approximation oracle. In Advances in Neural Information Processing Systems, pages 5652 5660, 2018. [24] E. Hazan et al. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016. [25] E. Hillel, Z. S. Karnin, T. Koren, R. Lempel, and O. Somekh. Distributed exploration in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 854 862, 2013. [26] S. Ito. Submodular function minimization with noisy evaluation oracle. In Advances in Neural Information Processing Systems, pages 12080 12090, 2019. [27] S. Ito, D. Hatano, H. Sumita, K. Takemura, T. Fukunaga, N. Kakimura, and K.-I. Kawarabayashi. Oracle-efficient algorithms for online linear optimization with bandit feedback. In Advances in Neural Information Processing Systems, pages 10590 10599, 2019. [28] S. Ito, S. Hirahara, T. Soma, and Y. Yoshida. Tight firstand second-order regret bounds for adversarial linear bandits. In Advances in Neural Information Processing Systems, 2020, to appear. [29] P. Joulani, A. Gyorgy, and C. Szepesvári. Online learning under delayed feedback. In Interna- tional Conference on Machine Learning, pages 1453 1461, 2013. [30] P. Landgren, V. Srivastava, and N. E. Leonard. On distributed cooperative decision-making in multiarmed bandits. In 2016 European Control Conference (ECC), pages 243 248. IEEE, 2016. [31] T. Lattimore and C. Szepesvári. Bandit algorithms. preprint, Revision: 1699, 2019. [32] B. Li, T. Chen, and G. B. Giannakis. Bandit online learning with unknown delays. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 993 1002, 2019. [33] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661 670, 2010. [34] L. Lovász and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007. [35] G. Neu, A. Antos, A. György, and C. Szepesvári. Online markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, pages 1804 1812, 2010. [36] G. Neu, A. Gyorgy, C. Szepesvari, and A. Antos. Online markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3):676 691, 2014. [37] C. Pike-Burke, S. Agrawal, C. Szepesvari, and S. Grunewalder. Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pages 4105 4113, 2018. [38] A. Sankararaman, A. Ganesh, and S. Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3): 1 35, 2019. [39] A. Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, 1998. [40] Y. Seldin, P. Bartlett, K. Crammer, and Y. Abbasi-Yadkori. Prediction with limited advice and multiarmed bandits with paid observations. In International Conference on Machine Learning, pages 280 287, 2014. [41] B. Szorenyi, R. Busa-Fekete, I. Hegedüs, R. Ormandi, M. Jelasity, and B. Kégl. Gossip-based distributed stochastic bandit algorithms. In 30th International Conference on Machine Learning, volume 28, pages 19 27, 2013. [42] T. S. Thune, N. Cesa-Bianchi, and Y. Seldin. Nonstochastic multiarmed bandits with unrestricted delays. In Advances in Neural Information Processing Systems, pages 6538 6547, 2019. [43] C. Vernade, A. Carpentier, T. Lattimore, G. Zappella, B. Ermis, and M. Brueckner. Linear bandits with stochastic delayed feedback. ar Xiv preprint ar Xiv:1807.02089, 2018. [44] M. J. Weinberger and E. Ordentlich. On delayed prediction of individual sequences. IEEE Transactions on Information Theory, 48(7):1959 1976, 2002. [45] Z. Zhou, R. Xu, and J. Blanchet. Learning in generalized linear contextual bandits with stochastic delays. In Advances in Neural Information Processing Systems, pages 5198 5209, 2019. [46] J. Zimmert and Y. Seldin. An optimal algorithm for adversarial bandits with arbitrary delays. ar Xiv preprint ar Xiv:1910.06054, 2019.