# delay_and_cooperation_in_nonstochastic_linear_bandits__1d0c412f.pdf

Delay and Cooperation in Nonstochastic Linear Bandits

Shinji Ito NEC Corporation i-shinji@nec.com

Daisuke Hatano

RIKEN AIP daisuke.hatano@riken.jp

Hanna Sumita Tokyo Institute of Technology

sumita@c.titech.ac.jp

Kei Takemura NEC Corporation kei_takemura@nec.com

Takuro Fukunaga Chuo University, RIKEN AIP, JST PRESTO

fukunaga.07s@g.chuo-u.ac.jp

Naonori Kakimura

Keio University kakimura@math.keio.ac.jp

Ken-ichi Kawarabayashi National Institute of Informatics

k-keniti@nii.ac.jp

This paper offers a nearly optimal algorithm for online linear optimization with delayed bandit feedback. Online linear optimization with bandit feedback, or nonstochastic linear bandits, provides a generic framework for sequential decisionmaking problems with limited information. This framework, however, assumes that feedback can be observed just after choosing the action, and, hence, does not apply directly to many practical applications, in which the feedback can often only be obtained after a while. To cope with such situations, we consider problem settings in which the feedback can be observed d rounds after the choice of an action, and propose an algorithm for which the expected regret is O(

m(m + d)T), ignoring logarithmic factors in m and T, where m and T denote the dimensionality of the action set and the number of rounds, respectively. This algorithm achieves nearly optimal performance, as we are able to show that arbitrary algorithms suffer the regret of (

m(m + d)T) in the worst case. To develop the algorithm, we introduce a technique we refer to as distribution truncation, which plays an essential role in bounding the regret. We also apply our approach to cooperative bandits, as studied by Cesa-Bianchi et al. [18] and Bar-On and Mansour [12], and extend their results to the linear bandits setting.

1 Introduction

Bandit linear optimization (nonstochastic linear bandits) models various sequential decision-making problems under partial-information conditions and has a wide range of applications including combinatorial bandits [17] and adaptive routing problems [11]. In this model, a player is given a set A Rm of actions, each of which corresponds to an m-dimensional feature vector, and chooses an action at 2 A in each round t 2 [T]. Just after choosing action at, the player gets feedback >

t at of the loss for the chosen action, where t 2 Rm is a loss vector. The goal of the player is to minimize cumulative loss PT

t at. A number of studies have proposed algorithms for this model, achieving sublinear regret bounds of O(m

T) [17; 22; 31].

However, in many applications, we cannot always get feedback right after choosing actions, as noted, e.g., in [5]. For example, in the application of advertisement optimization [3; 6; 33], it takes a certain

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Table 1: Regret bounds for nonstochastic bandit problems with/without delay

Multi-armed bandit (K-arms) Linear bandit (m: dim. of action set)

Without delay O(p KT log K) [10] O(mp T log m) [15; 17; 22] O(

KT) [10] (m

With delay O(

d KT) [35; 36] O(m

d T) [19] (d-rounds delay) O(

(d + K)T log K) [18] ([19] applies to more general models) O(

(d log K + K)T) [46] O(

m(d + m)T) [Theorem 1] (

(d log K + K)T) [18] (

m(d + m)T) [Theorem 2]

amount of time until the advertisements begin to affect consumers behavior. Previous studies are not directly applicable to such situations. For the multi-armed bandit problems, algorithms that work well even for delayed-feedback settings have been proposed, as shown in Table 1, but they are not available for more general linear bandits settings. Vernade et al. [43] provided algorithms for stochastic linear bandits with delayed feedback, and they work well under assumptions of time-invariant generative models for the loss, without regret bounds for nonstochastic settings.

This paper introduces bandit linear optimization with delayed feedback, which includes the multiarmed bandit problems [18], and proposes a min-max optimal algorithm for the model. In this model, there is an additional parameter d > 1 representing the rounds of feedback delay. In contrast to the standard bandit feedback model, in which a player observes >

t at at the end of t-th round, in our model, the feedback >

t at can be observed at the end of the (t + d)-th round, i.e., d-rounds later.

Our contribution

The main contribution of this paper is to construct a nearly-optimal algorithm for online linear optimization with delayed bandit feedback. More precisely, for online linear optimization with an m-dimensional action set and d-rounds delay, Algorithm 1, described in Section 4, enjoys the following regret bound: Theorem 1. For arbitrary loss sequences ( t)T

t=1, the regret for Algorithm 1 is bounded as

8m(d + em)T log T + 3, Cm(dpm + m) log3(dm T)

where E[ ] means the expectation taken w.r.t. the internal randomization of the algorithm and C > 0 is a global constant.

We note that this paper considers oblivious adversarial model, i.e., t is assumed not to depend on the output of the algorithm. Our results can easily be generalized to an adaptive adversarial model, i.e., a model in which t may depend on {aj}j<t. As shown in Table 1, our regret bound improves upon the results presented by Cesa-Bianchi et al. [19]. We should note that their algorithm works for more general settings with composite delayed feedback. Their work will be discussed in Section 2.

Our regret bound can be shown to be min-max optimal up to logarithmic factors. In fact, we provide the following regret lower bound: Theorem 2. Suppose that A = { 1, 1}m and k tk1 1. There is a distribution of ( t)T

t=1 for which an arbitrary algorithm suffers regret as

E[RT ] = (min{

m(d + m)T, T}), (1)

where E[ ] means the expectation w.r.t. the randomness of ( t) and the algorithm.

We show this lower bound by combining the result for bandit linear optimization without delay [21; 26] and for online linear optimization with delayed full-information [44]. This lower bound implies that there is no room to improve the upper bound in Theorem 1, up to logarithmic factors.

Our proposed algorithm is based on the multiplicative weight update (MWU) method [7] with an unbiased estimator ˆ t of the loss vector t. MWU methods manage probability distributions pt over

action set A, and choose an action at following pt. Some existing algorithms [17; 22] for bandit linear optimization employ MWU to achieve an optimal regret bound up to logarithmic factors. These algorithms, however, have not been proven to work well for delayed-feedback settings. This issue appears to be due to the behavior of the probability distribution pt. In existing algorithms, pt is updated using ˆ t, and pt may change drastically per round since ˆ t is unbounded, which can worsen the regret, especially in delayed-feedback settings.

To deal with the above issue, we employ two techniques to construct a new, more stable unbiased estimator ˆ t. The ﬁrst technique is to manage probability distributions over the convex hull B := conv(A) of an action set A, rather than A itself. If we apply MWU to B, the probability distribution pt will have a property referred to as log-concavity [34], which plays an important role in our analysis. The other, and more essential, technique is to truncate the distribution, which ensures that the estimator ˆ t is bounded in terms of a speciﬁc norm depending on the distribution. Additionally, thanks to log-concavity, we can also show that this truncation does not change the distribution drastically. Similar techniques to this truncation can be found in [16; 28], though much difference can be found as well. For example, in contrast to our truncation technique, the focus region introduced in [16] is updated so that the new one is included in the prior one. This property seems essential for stabilizing their kernel-based estimators, but makes the algorithm and the analysis much complicated.

Our algorithm and analysis can be generalized to a multi-agent cooperative bandit setting [12; 18] in which N agents cooperate to solve a common bandit optimization problem while communicating via a network. In this problem, there is an underlying undirected communication graph G = (V, E), each node of which corresponds to an individual agent. Each agent solves a common bandit optimization problem, and the observation of agent u 2 V can be shared with another agent v 2 V in d G(u, v)-rounds later, where d G(u, v) denotes the length of the shortest path in G between u and v. As shown in Table 2, for cooperative nonstochastic multi-armed bandit problems, Cesa Bianchi et al. [18] provides an algorithm achieving regret of O(

|V | (G))T) averaged over

all agents, where (G) denotes the independence number of G. Bar-On and Mansour [12] achieved

(1 + K |N(v)|)T)-regret for each agent v simultaneously, where N(v) denotes the neighbors of v in G. In this paper, we construct an algorithm for a more general linear optimization setting. By combining the techniques in cooperative multi-armed bandit [12; 18] and linear bandits with delayed feedback as in Theorem 1, we obtain an algorithm such that the expected regret of each agent is bounded by O(

m(1 + m |N(v)|)T).

Theorem 3. For cooperative nonstochastic linear bandits, there is an algorithm for which the regret of each agent v is bounded as

E[RT (v)] max

1 + log m + m |N(v)|

T log T + 3, Cm2 log3(m T)

where E[ ] means the expectation taken w.r.t. the internal randomization of each agent s algorithm as well as the randomness of loss vectors, and C is a global constant.

This bound is tight up to logarithmic factors in a special case in which G is a complete graph. Indeed, we provide the following regret lower bound: Theorem 4. Let G = (V, E) be a complete graph. There is an environment of a cooperative linear bandit problem over A = { 1, 1}m with the communication graph G, for which each agent v 2 V suffers regret RT (v) of at least

E[RT (v)] =

for any arbitrary algorithm.

This lower bound matches the upper bound of Theorem 3 when |N(v)| = (|V |).

2 Related Work

Bandit linear optimization [17; 21; 22] is a generic model for sequential decision-making with partial information. This model includes a well-studied multi-armed bandit problem [10] as a special

Table 2: Regret bounds for cooperative nonstochastic bandit problems

Multi-armed bandit (K-arms) Linear bandit (m: dim. of action set)

Upper bound O(

|V | (G))T) [18] O(

m(1 + m |N(v)|)T) [Theorem 3]

(1 + K |N(v)|)T) [12]

Lower bound (

|V |)T) [40] (

|V |)T) [Theorem 4]

case, where the action set A = [K] is just a ﬁnite set of actions. Other important special cases are combinatorial bandits [15; 17], where the action set A 2[K] is a set of subsets of a ﬁxed ﬁnite set, and the loss for choosing action a 2 A is given by P

i2a ti, the sum of losses for the items in a. For solving bandit linear optimization, many algorithms have been proposed for stochastic settings [1; 9; 20] as well as for nonstochastic settings [2; 4; 11; 15; 17; 22]. There are known to exist algorithms [15; 19] achieving O(m

T)-regret, which nearly matches the lower bound of (m

T) shown in [21].

In the context of online optimization, delayed feedback has been considered for a wide range of settings [18; 19; 29; 43; 32; 37; 42; 45; 46] due to its practical signiﬁcance. As has been noted, e.g., in [18; 29; 44], a regret bound of RT U(T) for the no-delay setting immediately leads to a bound of RT d U(T/d) for the setting with d-rounds delay. Our question is, then, how one can achieve a regret bound better than d U(T/d). For nonstochastic multi-armed bandit settings, some algorithms have been found to achieve better bounds than d U(T/d) [18; 42; 46], as can be seen in Table 1. For nonstochastic linear bandits, however, such an improvement upon O(d U(T/d))-regret cannot be found in the literature, and this paper offers the ﬁrst. We should note that Vernade et al. [43] provided an algorithm for linear bandits that works for stochastic settings with delayed feedback. It is also worth noting that Cesa-Bianchi et al. [19] provided a generic framework for reducing bandit problems with delayed and composite feedback to those without delay. In their model, the feedback at each step t depends on all chosen actions in the last d steps t, t 1, . . . , t d+1, i.e., the feedback is expressed as Pd 1

t s(at s). This model is applicable to wider problem settings than the delayed feedback model in this paper, as each feedback depends on the chosen action at a single round in the latter model. In their paper, it was shown that any algorithm for a bandit optimization problem without delay can be transformed into one for a counterpart problem with delayed composite feedback.

Multi-agent cooperative bandit online learning has been considered for various settings [11; 12; 14; 19; 25; 30; 38; 40; 41], with applications including, e.g., peer-to-peer recommendation services

serving a large number of users connected in a network [13]. While many of these studies are focusing on stochastic models, only a limited number of studies dealing with nonstochastic models can be found. Among them, Cesa-Bianchi et al. [18] have considered a cooperative nonstochastic multi-armed bandit, taking communication delays into account, and proposed an algorithm for which the average (or the sum) of the regret for all agents is bounded. The cooperative bandit model in this paper is based on their model. Bar-On and Mansour [12] provided algorithms for the multi-armed bandit problem in [19], by which the regret for each agent is bounded well. In this paper, we generalize these results to linear bandit problems.

3 Problem Settings

3.1 Online linear optimization with delayed bandit feedback

A player is given the number T of rounds and an action set A Rm before the game starts. The action set A is an arbitrary compact set in Rm, which is not contained in any proper linear subspace. We note that the assumption does not affect the generality of the problem. Indeed, if A is contained in a proper linear subspace, we can ﬁnd such a subspace using the linear optimization oracle for A (e.g., from Corollary 14.1 of [39]). Hence, by reducing the entire vector space into this linear subspace, we can transform the problem so that the assumption holds. In each round t 2 [T], the player chooses action at, and then, if t > d, the environment reveals the loss >

t dat d 2 R. Without loss of generality, we assume d T 1. The loss vector t 2 Rm is assumed to satisfy | >

for all a 2 A. In this paper, we assume that the sequence ( t)T

t=1 of loss vectors is an arbitrary non-adaptive sequence, i.e., we do not assume any generative model for t, but each t is assumed not to depend on the output of the algorithm. Player performance is measured by means of regret RT , deﬁned as RT = PT

t at mina 2A

3.2 Cooperative nonstochastic linear bandits

The model for cooperative bandits in this paper is based on the problem setting considered in [12; 18]. There is a communication graph G = (V, E), an undirected graph each vertex of which corresponds to an agent that plays a common linear bandit problem. In each round t = 1, 2, . . . , T, each agent v 2 V chooses an action at(v) 2 A from a common action set A Rm, and then observes loss t(v)>at(v) 2 [ 1, 1] for the chosen action, where we assume that E[ t(v)] = t for each agent v.1 At the end of each round, each agent v sends a message mt(v) to neighbors u 2 N(v) = {u 2 V | {u, v} 2 E}. The message mt(v) consists of the chosen action, observed loss, and the distribution qv

t for choosing an action such that at(v) qv

mt(v) = ht, v, at(v), t(v)>at(v), qv

Note that each agent chooses an action independently, i.e., the action at(v) independently follows qt(v) for each agent v. The goal is to construct an algorithm for which the regret RT (v) = PT

t at(v) mina 2A

t a for each agent v is as small as possible.

4 Algorithms and Regret Upper Bounds

4.1 Preliminary

In this subsection, we introduce a technique that we refer to as distribution truncation, which plays a central role in bounding the regret. We denote the convex hull of A by B. Given a distribution p over B, deﬁne µ(p) 2 Rm and S(p) 2 Sym(m) by µ(p) = Ex p[x] and S(p) = Ex p[xx>]. For any vector x 2 Rm and positive semideﬁnite matrix A 2 Sym(m), denote kxk A = k A

x>Ax. Given a distribution p over B, deﬁne a truncated distribution p0 by

S(p) 1 mγ2}

Proby p[kyk2

S(p) 1 mγ2] / p(x)1{kxk2

S(p) 1 mγ2}, (5)

where γ 4 log(10m T) is a parameter that we refer to as the truncation level. From the deﬁnition of p0, any sample b chosen from a truncated distribution p0 is bounded in terms the norm k k S(p) 1, as kbk2

S(p) 1 mγ2. This property ensures that the estimated loss vector (deﬁned in (10)), constructed from a sample from a truncated distribution, has a bounded norm as in (11), thanks to which action distributions do not change drastically per round, as will be shown in Lemmas 5 and 6.

Properties of log-concave distributions If a probability distribution has a density function p : B ! R 0 such that log(p(x)) is a concave function, then we call it a log-concave distribution. We use the following concentration property of log-concave distributions: Lemma 1. If x follows a log-concave distribution p over Rm satisfying S(p) I, we have

2 m 2] m exp(1 ) (6)

for an arbitrary 0.

Missing proofs of lemmas are provided in the appendix. From this lemma 1, we can show that p and p0 deﬁned as (5) are close if p is a log-concave distribution, in the following sense: Lemma 2. Suppose that p is a log-concave distribution over B. For any function f : B ! [ 1, 1] and γ 4 log(10m T), we have

x p[f(x)] E

T and T T + 1 S(p) S(p0) T + 1

T S(p). (7)

1In previous studies [18; 12], all players share a common loss vector t, i.e., the loss vectors t(v) are equal to t for all v. This case is a special case of our problem setting.

Lemma 3. If y follows a one-dimensional log-concave distribution such that E[y2] s2 1/100, we have

E[g(y)] s2 + 30 exp

2s2 where g(y) = exp(y) y 1. (8)

4.2 Algorithm for linear bandits with delayed feedback

In our algorithm, we update distribution pt over B := conv(A), by the multiplicative weight update method (MWU) as follows:

wt(x) := exp

A , pt(x) = wt(x) R

y2B wt(y)dy , (9)

where > 0 is a parameter referred to as the learning rate, and ˆ t is as deﬁned below. In each round, we pick bt 2 B from the truncated distribution p0

t of pt. We can get a sample from p0

t by iteratively sampling b pt until kbk2

S(pt) 1 mγ2. There is a computationally efﬁcient way for sampling b pt under mild assumptions since pt is a log-concave distribution. In fact, we can use the techniques developed in [34] to get samples from pt with polynomial-time computation, given a membership oracle for B. A membership oracle for B can be constructed from a linear optimization oracle for A, as stated e.g., in [39]. After getting bt, we choose at 2 A so that E[at|bt] = bt. We can compute such an at efﬁciently, given a linear optimization oracle for A. Indeed, as shown in Corollary 14.1g in [39], given b 2 B = conv(A) we can compute λ1, . . . , λm+1 0 and c1, . . . , cm+1 2 A

such that Pm+1

i=1 λi = 1 and Pm+1

i=1 λtici = b via solving linear optimization over A polynomial times. By setting a = ci with probability λi, we obtain E[a|b] = b. The algorithm then plays at at the t-th round, and the feedback >

t at will be observed at the end of the (t + d)-th round. We deﬁne ˆ t by

t) 1bt. (10) We note that S(p0

t) is invertible, which can be concluded from the assumption that A is not contained in any proper linear subspace. Under this assumption, indeed, B = Conv(A) is a full-dimensional convex set with a positive Lebesgue measure. It follows from this fact and Lemma 1 that the domain of p0

t is full-dimensional as well. Thus, the distribution p0

t has a density function taking positive values over a full-dimensional set, which implies that the matrix S(p0

t) is invertible. A similar argument can be found, e.g., in [27] (between Eq. (4) and (5)), and is implicitly used in [16] as well. Further, we can compute S(p0

t) efﬁciently. In fact, since p0

t is a log-concave distribution, for any " > 0, we can calculate an "-approximation of S(p0

t) w.h.p. using (d/")O(1) samples generated from p0

t, as can be seen from Corollary 2.7 of [34].

The vector ˆ t deﬁned as (10) is an unbiased estimator of t and bounded in terms of the norm k k2

S(pt), i.e., we have

S(pt) 4mγ2. (11)

In fact, we have E[ˆ t] = E

= t, which means the ﬁrst part in (11) holds. To show the second part in (11), we use kbtk2

S(pt) 1 mγ2, which is ensured by the fact that bt is sampled from the truncated distribution p0

t. It follows that kˆ tk2

S(pt) = ( >

t at)2k S(p0

S(pt) 2k S(p0

t) = 2kbtk2

t) 1 4kbtk2

S(pt) 1 4mγ2, where we use the second part of (7) in the ﬁrst and the second inequalities. The inequality in (11) for ˆ t will be used in the analysis of MWU. The procedure can be summarized in Algorithm 1.

Let us next show that Algorithm 1 enjoys the regret bound given in Theorem 1. Since we have E[at|p0

t] = E[bt|p0

t), the regret can be bounded as

t (µ(pt) a ) +

t (µ(pt) a )

Algorithm 1 An algorithm for online linear optimization with delayed bandit feedback Require: Action set A, parameters T, d T 1

1: Set γ = 4 log(10m T) and = min

m log T 2(d+em)T , 1 100γ2(dpm+m)

2: Deﬁne w1 : B ! R>0 by w1(x) = 1 for all x 2 B. 3: for t = 1, 2, . . . , T do 4: Let pt be a distribution whose density function is proportional to wt as in (9). 5: Pick bt p0

t, e.g., by iteratively sampling b pt until kbk2

S(pt) 1 mγ2 holds.

6: If t > d, get feedback of >

t dat d, construct an unbiased estimator of t d as ˆ t d = >

t dat d S(p0

t d) 1bt d, and update wt by wt+1(x) = wt(x) exp( ˆ >

t dx). 7: If t d, let wt+1 = wt. 8: end for

where a 2 argmina2A

t a and the last inequality follows from the ﬁrst part of (7) and the assumption that >

t a 2 [ 1, 1] for all a 2 A. From this inequality and a standard analysis for continuous multiplicative weight update methods [7; 22; 23], we obtain the following regret bounds:

Lemma 4. If pt is deﬁned by (9) with ˆ t such that E[ˆ t] = t, the regret for at is bounded as

t (µ(pt) µ(pt+d)) + 1

where g : R ! R is deﬁned in (8).

In the following, we give bounds for the right-hand side of (13), separately for the terms >

µ(pt+d)) and Ex pt+d

. The ﬁrst can be bounded via the following lemma:

Lemma 5. Suppose that 2 Rm satisﬁes | >a| 1 for all a 2 A and 1/(48γ2m). Then, if pt is deﬁned by (9) with ˆ t satisfying (11), it holds for all t 2 [T] that | E[ >(µ(pt) µ(pt+1))]| 2 .

This lemma can be shown using (11) and Lemma 3. Finally, to bound the term Ex pt+d

in Lemma 4, we use the following lemma:

Lemma 6. Assume 1 100γ2(d+1)pm. If pt is deﬁned by (9) with ˆ t satisfying (11), for all t, we

have S(pt+1)

This lemma follows from (11) and Lemma 1, by induction in t. We are now ready to prove Theorem 1.

Proof of Theorem 1 Combining Lemmas 4 and 5, we have

E[RT ] 2 d T + E

Let us bound Ex pt+d

using Lemma 3 and (11). We have

1 + 1 d + 1

S(pt) 4e 2mγ2 1 100, (15)

where the ﬁrst inequality follows from Lemma 6, the second inequality follows from (11), and the last inequality comes from the assumption of 1 100γ2(dpm+m). From this inequality and the fact

t x follows a log-concave distribution when x pt+d, we have

1 + 1 d + 1

where the ﬁrst and second inequalities follow from (3) and (15), respectively. From the deﬁnition (10) of ˆ t, we have

t at)2 66S(p0

1 + 1 d + 1

1 + 1 d + 1

where the ﬁrst inequality follows from | >

t at| 1 and the second part of (7). Combining the above two inequalities, we obtain E

2 2(1 + 1 d+1)d+1m 2e 2m. From this

and (14), we have E[RT ] 2 (d + em)T + m log T

+ 3. By substituting the parameter setting in Step 1 of Algorithm 1, we obtain the inequality of Theorem 1.

4.3 Algorithm for cooperative nonstochastic linear bandits

For the cooperative bandit problem, we consider a center-based algorithm [12]. A major difference between our algorithm and the one by Bar-On and Mansour [12] is that ours applies to linear bandit settings while theirs focus on multi-armed bandit settings.

Similarly to [12], we ﬁrst choose center agents C V and a corresponding partition {Vc}c2C of agents with the following properties:

Theorem 5 (Theorem 12. in [12]). Given an undirected graph G = (V, E) and a parameter m 2, one can ﬁnd center agents C V and a partition {Vc}c2C of V such that the following hold for all c 2 C: (i) c [ N(c) Vc, (ii) subgraphs of G induced by Vc are connected, and (iii) for all v 2 Vc, it holds that

min{|N(v)|, m} min{|N(c)|, m} exp

where dc(v) denotes the distance from c to v in the subgraph of G induced by Vc.

In the center-based algorithm, each center agent c 2 C updates its distribution by means of MWU, and the other agents v 2 Vc imitate center agent c on the basis of the message mt in (4). More precisely, for each c 2 C we construct breadth-ﬁrst search tree Tc for the subgraph induced by Vc with root node c, and each agent v 2 Vc \ {c} uses the distribution qv

t deﬁned by qv

t 1 , where pa(v) denotes the parent node of v in Tc. Consequently, for each v 2 Vc \ {c}, the action distribution qv

t satisﬁes qv

t dc(v) for t > dc(v). We deﬁne the distribution qc

t of each center agent c 2 C as follows: deﬁne a distribution pc

t over B = conv(A) by MWU as follows:

t(x) := exp

t(y)dy , (17)

where we set the truncation level γ and the learning rate (c) as γ = 4 log(10m T) and (c) =

m log T T (1+log m+m/ min{|N(c)|,m}), 1 100γ2m}. Pick bt(c) from the truncated distribution pc

0 deﬁned by (5), and pick at(c) so that E[at(c)|bt(c)] = bt(c). The distribution qc

t is deﬁned to be the distribution that at(c) follows. The estimated loss ˆ t(c) is deﬁned on the basis of the messages mt (4) from neighbors N(c) of c. Each center agent c 2 C computes ˆ t(c) using at(v) and t(v)>at(v) for v 2 N(v) as follows:

ˆ t(c) = 1 |N(c)|

t(v)>at(v)S(pc

t(v) is chosen from the posterior probability distribution for bt(v) given at(v), i.e., Prob[b0

t(v) = b] / Prob[at(v)|bt(v) = b] pc

0(b). Combining Lemmas 3, 4, 5, 6 and Theorem 5, we obtain the regret bounds in Theorem 3. The proof of Theorem 3 is given in Section B in the appendix.

5 Lower Bounds

In this section, we brieﬂy describe how we can prove regret lower bounds in Theorems 2 and 4. Inspired by the proof of the lower bound for the multi-armed bandit problem with delayed feedback given in [18], we prove lower bounds by combining existing bounds for different settings such as linear bandits without delay [21; 26] and full-information online optimization with delay [44].

To prove Theorem 2, we combine lower bounds of (m

T) and of (

md T). We note that the ﬁrst one comes from a lower bound for linear bandits without delay, shown, e.g., in [21]. The second one is also a lower bound for online linear optimization with delayed feedback. As shown in [44], if an online optimization problem without delay admits a regret lower bound of (L(T)), a counterpart with d-round delayed feedback has an (d L(T/d))-lower bound. Since there is a lower bound of (

m T) for the online linear optimization (see, e.g., Theorem 3.2. in [24]), we have (d

So the only question left is that both the regret bounds of (m

T) and of (

md T) must come from the same instance. To this end, we construct the problem instance over A = { 1, 1}m such that k tk1 1. By considering a mixed distribution of loss vectors, we obtain a lower bound of (m

m(d + m)T). A complete proof is provided in Section C in the appendix.

Theorem 4 for cooperative linear bandits can be, again, obtained by combining lower bounds of (

m T) and of (m

T/|V |). The ﬁrst one comes from a lower bound for full-information online linear optimization, as cooperative bandits are at least harder than full-information online optimization. The second can come from an (m

T)-lower bound for signle-player linear bandits [21]. Since we can regard cooperative bandits as a harder version of single-player bandits with (T |V |) rounds, the sum of regrets over all agents is at least (m

T |V |), which implies that the regret per agent is of (m

T/|V |). A complete proof of Theorem 4 to show how we construct the problem instance that gives both the regret bounds of (

m T) and of (m

T/|V |) is given in Section D in the appendix.

6 Conclusion

In this paper, we considered online linear optimization with d-round-delayed bandit feedback, where d was a given parameter ﬁxed for all rounds. We provided an algorithm that achieves nearly-tight regret bounds, and extended this result to the cooperate bandit setting.

An important future work would be to extend the model to deal with the unknown and rounddependent delay as in [46]. We believe that an adaptive way of tuning parameters, such as , and γ in our algorithms, would work well for this general setting. Another future direction is to improve practical computational cost. The proposed algorithms in this paper rely on continuous relaxation and sampling from log-concave distributions, which causes a large computational time in practice, though they run in polynomial time.

Broader Impact

The authors believe that this paper presents neither ethical nor societal issues, as this is a theoretical work.

Acknowledgments and Disclosure of Funding

This work was supported by JST, ERATO, Grant Number JPMJER1201, Japan. SI was supported by JST, ACT-I, Grant Number JPMJPR18U5, Japan. TF was supported by JST, PRESTO, Grant Number JPMJPR1759, Japan. NK and KK were supported by JSPS, KAKENHI, Grant Number JP18H05291, Japan.

[1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits.

In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

[2] N. Abe and P. M. Long. Associative reinforcement learning using linear probabilistic concepts.

In International Conference on Machine Learning, pages 3 11, 1999.

[3] N. Abe, A. W. Biermann, and P. M. Long. Reinforcement learning with immediate rewards and

linear hypotheses. Algorithmica, 37(4):263 293, 2003.

[4] J. D. Abernethy, E. Hazan, and A. Rakhlin. An efﬁcient algorithm for bandit linear optimization.

In Conference on Learning Theory, 2008.

[5] A. Agarwal, S. Bird, M. Cozowicz, L. Hoang, J. Langford, S. Lee, J. Li, D. Melamed,

G. Oshri, O. Ribas, et al. Making contextual decisions with low technical debt. ar Xiv preprint ar Xiv:1606.03966, 2016.

[6] D. Agarwal, B.-C. Chen, P. Elango, N. Motgi, S.-T. Park, R. Ramakrishnan, S. Roy, and

J. Zachariah. Online models for content optimization. In Advances in Neural Information Processing Systems, pages 17 24, 2009.

[7] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm

and applications. Theory of Computing, 8(1):121 164, 2012.

[8] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In

Conference on Learning Theory, pages 217 226, 2009.

[9] P. Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine

Learning Research, 3(Nov):397 422, 2002.

[10] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit

problem. SIAM Journal on Computing, 32(1):48 77, 2002.

[11] B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed

learning and geometric approaches. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, pages 45 53, 2004.

[12] Y. Bar-On and Y. Mansour. Individual regret in cooperative nonstochastic multi-armed bandits.

In Advances in Neural Information Processing Systems, pages 3110 3120, 2019.

[13] R. Baraglia, P. Dazzi, M. Mordacchini, and L. Ricci. A peer-to-peer recommender system for

self-emerging user communities based on gossip overlays. Journal of Computer and System Sciences, 79(2):291 308, 2013.

[14] I. Bistritz and A. Leshem. Distributed multi-player bandits-a game of thrones approach. In

Advances in Neural Information Processing Systems, pages 7222 7232, 2018.

[15] S. Bubeck, N. Cesa-Bianchi, and S. Kakade. Towards minimax policies for online linear

optimization with bandit feedback. In Conference on Learning Theory, 2012.

[16] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization.

In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 72 85. ACM, 2017.

[17] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System

Sciences, 78(5):1404 1422, 2012.

[18] N. Cesa-Bianchi, C. Gentile, Y. Mansour, and A. Minora. Delay and cooperation in nonstochastic

bandits. In Conference on Learning Theory, pages 605 622, 2016.

[19] N. Cesa-Bianchi, C. Gentile, and Y. Mansour. Nonstochastic bandits with composite anonymous

feedback. In Conference on Learning Theory, pages 750 773, 2018.

[20] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In

Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 208 214, 2011.

[21] V. Dani, S. M. Kakade, and T. P. Hayes. The price of bandit information for online optimization.

In Advances in Neural Information Processing Systems, pages 345 352, 2008.

[22] E. Hazan and Z. Karnin. Volumetric spanners: an efﬁcient exploration basis for learning. The

Journal of Machine Learning Research, 17(1):4062 4095, 2016.

[23] E. Hazan, W. Hu, Y. Li, et al. Online improper learning with an approximation oracle. In

Advances in Neural Information Processing Systems, pages 5652 5660, 2018.

[24] E. Hazan et al. Introduction to online convex optimization. Foundations and Trends R in

Optimization, 2(3-4):157 325, 2016.

[25] E. Hillel, Z. S. Karnin, T. Koren, R. Lempel, and O. Somekh. Distributed exploration in

multi-armed bandits. In Advances in Neural Information Processing Systems, pages 854 862, 2013.

[26] S. Ito. Submodular function minimization with noisy evaluation oracle. In Advances in Neural

Information Processing Systems, pages 12080 12090, 2019.

[27] S. Ito, D. Hatano, H. Sumita, K. Takemura, T. Fukunaga, N. Kakimura, and K.-I. Kawarabayashi.

Oracle-efﬁcient algorithms for online linear optimization with bandit feedback. In Advances in Neural Information Processing Systems, pages 10590 10599, 2019.

[28] S. Ito, S. Hirahara, T. Soma, and Y. Yoshida. Tight ﬁrstand second-order regret bounds for

adversarial linear bandits. In Advances in Neural Information Processing Systems, 2020, to appear.

[29] P. Joulani, A. Gyorgy, and C. Szepesvári. Online learning under delayed feedback. In Interna-

tional Conference on Machine Learning, pages 1453 1461, 2013.

[30] P. Landgren, V. Srivastava, and N. E. Leonard. On distributed cooperative decision-making in

multiarmed bandits. In 2016 European Control Conference (ECC), pages 243 248. IEEE, 2016.

[31] T. Lattimore and C. Szepesvári. Bandit algorithms. preprint, Revision: 1699, 2019.

[32] B. Li, T. Chen, and G. B. Giannakis. Bandit online learning with unknown delays. In The 22nd

International Conference on Artiﬁcial Intelligence and Statistics, pages 993 1002, 2019.

[33] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized

news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661 670, 2010.

[34] L. Lovász and S. Vempala. The geometry of logconcave functions and sampling algorithms.

Random Structures & Algorithms, 30(3):307 358, 2007.

[35] G. Neu, A. Antos, A. György, and C. Szepesvári. Online markov decision processes under

bandit feedback. In Advances in Neural Information Processing Systems, pages 1804 1812, 2010.

[36] G. Neu, A. Gyorgy, C. Szepesvari, and A. Antos. Online markov decision processes under

bandit feedback. IEEE Transactions on Automatic Control, 59(3):676 691, 2014.

[37] C. Pike-Burke, S. Agrawal, C. Szepesvari, and S. Grunewalder. Bandits with delayed, aggregated

anonymous feedback. In International Conference on Machine Learning, pages 4105 4113, 2018.

[38] A. Sankararaman, A. Ganesh, and S. Shakkottai. Social learning in multi agent multi armed

bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3): 1 35, 2019.

[39] A. Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, 1998.

[40] Y. Seldin, P. Bartlett, K. Crammer, and Y. Abbasi-Yadkori. Prediction with limited advice and

multiarmed bandits with paid observations. In International Conference on Machine Learning, pages 280 287, 2014.

[41] B. Szorenyi, R. Busa-Fekete, I. Hegedüs, R. Ormandi, M. Jelasity, and B. Kégl. Gossip-based

distributed stochastic bandit algorithms. In 30th International Conference on Machine Learning, volume 28, pages 19 27, 2013.

[42] T. S. Thune, N. Cesa-Bianchi, and Y. Seldin. Nonstochastic multiarmed bandits with unrestricted

delays. In Advances in Neural Information Processing Systems, pages 6538 6547, 2019.

[43] C. Vernade, A. Carpentier, T. Lattimore, G. Zappella, B. Ermis, and M. Brueckner. Linear

bandits with stochastic delayed feedback. ar Xiv preprint ar Xiv:1807.02089, 2018.

[44] M. J. Weinberger and E. Ordentlich. On delayed prediction of individual sequences. IEEE

Transactions on Information Theory, 48(7):1959 1976, 2002.

[45] Z. Zhou, R. Xu, and J. Blanchet. Learning in generalized linear contextual bandits with

stochastic delays. In Advances in Neural Information Processing Systems, pages 5198 5209, 2019.

[46] J. Zimmert and Y. Seldin. An optimal algorithm for adversarial bandits with arbitrary delays.

ar Xiv preprint ar Xiv:1910.06054, 2019.