# multiagent_multiarmed_bandits_with_limited_communication__2d8a9a88.pdf

Journal of Machine Learning Research 23 (2022) 1-24 Submitted 2/21; Revised 5/22; Published 7/22

Multi-Agent Multi-Armed Bandits with Limited Communication

Mridul Agarwal agarw180@purdue.edu

Vaneet Aggarwal vaneet@purdue.edu

Kamyar Azizzadenesheli kamyar@purdue.edu Purdue University

Editor: Aurelien Garivier

We consider the problem where N agents collaboratively interact with an instance of a stochastic K arm bandit problem for K N. The agents aim to simultaneously minimize the cumulative regret over all the agents for a total of T time steps, the number of communication rounds, and the number of bits in each communication round. We present Limited Communication Collaboration - Upper Conﬁdence Bound (LCC-UCB), a doubling-epoch based algorithm where each agent communicates only after the end of the epoch and shares the index of the best arm it knows. With our algorithm, LCC-UCB, each agent enjoys a regret of O p

(K/N + N)T , communicates for O(log T) steps and broadcasts O(log K) bits in each communication step. We extend the work to sparse graphs with maximum degree KG and diameter D to propose LCC-UCB-GRAPH which enjoys a regret bound of O D p

(K/N + KG)DT . Finally, we empirically show that the LCC-UCB and the LCC-UCB-GRAPH algorithms perform well and outperform strategies that communicate through a central node.

1. Introduction

We consider a setup where N agents, connected over a network, interact with a multiarmed bandit (MAB) environment (Lattimore and Szepesv ari, 2020). The agents aim to collaborate with other agents in the network to minimize their regret. The agents also aim to reduce the number of messages and the size of messages communicated with others. Consider a case of an e-commerce company serving its users by recommending its vast number of items through multiple servers for quick response times. It attempts to learn the user preferences using a MAB algorithm. If each of the multiple servers runs their own algorithm, they waste the large amount of data that other servers collect. Or, if they communicate after every recommendation, the communication complexity becomes high within the servers themselves. As observed from the example above, communicating after each time-step is not favorable because of the increased communication cost. If N agents communicate after every round to reduce the regret for T time steps, their total regret is lower bounded by the regret of a super-agent solving the MAB problem with NT time steps. This bounds the

. This work was supported in part by the U.S. National Science Foundation under grant CCF-2149588.

2022 Mridul Agarwal, Vaneet Aggarwal, Kamyar Azizzadenesheli.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0138.html.

Agarwal, Aggarwal, Azizzadenesheli

total regret as at least Ω(

NKT) or a per agent regret of Ω( p

KT/N). Whereas, if the N agents interact with the MAB problem independently, without any information exchange with other agents, the individual regret is upper bounded by O(

KT). We aim to ﬁnd an algorithm that can obtain the regret bound of the super-agent setup, i.e., O( p

KT/N), though with limited communication between the agents. We provide an algorithm, Limited Communication Collaboration - UCB, (LCC-UCB), to minimize the regret. LCC-UCB divides the arms among multiple agents, such that each agent interacts with the MAB instance but plays arms only from a subset of all the arms. The algorithm proceeds in epochs which double in duration, where the agents use the UCB algorithm to ﬁnd the best arm in their smaller MAB problem and communicate at the end of each epoch. On receiving the messages from other agents, each agent updates its set of arms and restarts its algorithm. We prove the regret of LCC-UCB is upper bounded by O p

(K/N + N 1) T . For N = 1, the regret of the LCC-UCB algorithm reduces to

the standard regret bounds of O(

KT). We also consider a general setup where the network of agents may not be completely connected, and the agents may not be able to broadcast knowledge to all the other agents at once. Under such a case, we propose LCC-UCB-GRAPH algorithm that subdivides epochs into sub-epochs of equal length. The agents restart their UCB algorithm in each sub-phase with the new information available from their neighbors. We show that the regret bound of this modiﬁed algorithm with divided phases changes to O D p

(K/N + KG) DT , where KG is the maximum degree of the nodes in the graph. Also, the increased communication complexity of this algorithm is bounded by O (KGD log T) message exchanges per node. The key novelty in both the algorithms is that the gap between the recommended arms and the optimal arm reduces with epochs. Finally, we simulate and compare our algorithms with other communication protocols. We show that the algorithm behaves close to the communication strategy where the agents share the knowledge at each time step. For the LCC-UCB-GRAPH algorithm, we consider sparse graphs with more than 100 nodes. We observe that the LCC-UCB-GRPAH algorithm performs better (in terms of median cumulative regret per agent, over 30 independent runs) than the communication strategy where the agents share local data with all their neighbors at every time step. Further, the LCC-UCB and the LCC-UCB-GRAPH algorithms also outperform the DEMAB algorithm (Wang et al., 2020) where agents communicate for only O(N log(NK)) rounds.

2. Related Works

Optimal action selection problem dates back to (Thompson, 1933), and since then, many algorithms have been proposed and studied to solve the MAB problem ranging from indexbased policies (Gittins, 1979), Optimism in the Face of Uncertainty based UCB algorithm (Auer, 2002; Auer and Ortner, 2010; Audibert et al., 2009), to Thompson Sampling algorithm (Agrawal and Goyal, 2013). All the algorithms achieve an upper bound on regret O(

KT) and match the lower bound of Ω(

KT) up to logarithmic factors. Since then, various generalizations and extensions have been proposed to solve various online learning problems using a bandit framework (Abbasi-Yadkori et al., 2011; Li et al., 2010; Latti-

Multi-Agent Multi-Armed Bandits with Limited Communication

more et al., 2018; Yang et al., 2020). However, all these problems consider a single agent interacting with the environment. Since the last decade, there has been a thrust in studying distributed agents solving an instance of MAB problem. Kanade et al. (2012) consider a model where N agents talk to a central controller at every round. However, they considered the problem of reducing the communication cost for each agent connected in a star topology with a controller as the central node, which is unlike our setup where we allow any topology, including the central node/agent. Hillel et al. (2013) consider the problem of reducing communication cost for stochastic bandits in a setup where every agent can communicate with each other. Their work also bounds the total communication rounds by O(log2 T) using an action eliminationbased algorithm. However, their agents communicate the estimates of arm rewards for all the K arms in each message, whereas we bound the number of bits required in each message by O(log2 K). Shahrampour et al. (2017) consider a setup where multiple agents collectively select an arm at a time step and observe diﬀerent rewards sampled from diﬀerent distributions for each agent. Other works consider a setup where the agents talk to only one of the other nodes in a network at any given time step (gossiping style algorithm) (Landgren et al., 2016; Mart ınez Rubio et al., 2019; Wang et al., 2020). However, they allow their agents to communicate at every time-step, which is a diﬀerent setup and do not optimize a regret-communication trade-oﬀ. Further, they also send estimates of arm rewards in each message. Sankararaman et al. (2019); Chawla et al. (2020) also consider gossip style algorithms. Similar to us, these works divide the time horizon into epochs of variable length. Their strategies also divide the arms among the agents, and the agents unicast the knowledge of the best arm they have using O(log K) bits in each epoch. However, because of gossip style communication protocols, an agent becomes aware of the best arm after it has already incurred O( 1

2 ) regret, which translates to a problem independent bound of O(T 2/3). We note that we use the same number of communication as these papers while achieving better regret bound of O(T 1/2). Further, we can convert the proposed broadcast based communication of our work to a unicast based strategy by sending a message to each neighbor at one timestep for N timesteps. Wang et al. (2019); Dubey and Pentland (2020a,b) consider the problem of distributed linear bandits. They considered a fully connected network to reduce the communication messages and reduce the average regret for N agents. In contrast, we aim to ﬁnd bounds on the regret of each of the N agents for K-armed stochastic bandits. Wang et al. (2019) propose DEMAB algorithm for a distributed bandit setup where all the nodes communicate with a central node. The setup assumes knowledge of the time horizon to cleverly obtain a bound on the number of communication messages independent of time. The DEMAB algorithm is based on action elimination that also proceeds in epochs with duration growing exponentially after an initial period of length T/(NK) where every agent eliminates arms independently. In each epoch, the algorithm generates new estimates of arm rewards discarding the old samples. This results in high constants O(

214) in the regret term. The regret bounds of the proposed LCC-UCB algorithm only exceeds the regret of DEMAB for log2 T > 214/144. Additionally, the DEMAB algorithm requires a central coordinating node, which may not always be the case. Lastly, for an unknown time horizon, the number of messages increases back to O(log T), the same as ours.

Agarwal, Aggarwal, Azizzadenesheli

The proposed algorithm, LCC-UCB, obtains O( p

(N/K)T) for each agent with messages of size O(log K) with a total of O(log T) messages, thus achieving the regret of O(

T) Additionally, the proposed LCC-UCB-GRAPH algorithm works well on sparse graphs with a large number of agents with communication complexity of O(D log2 T). A summary of recent algorithms is provided in Table 1.

Algorithm Regret Bits per message Rounds Comments

DEMAB (Wang et al., 2020) O( q

N ) O(log T) O(N log NK)

known T, clientserver message

Gos In E (Chawla et al., 2020) O( K

N + 2 1/3 T 2/3) O(log K) O(log T)

unicast message, general graphs

LCC-UCB (this work) O( q K

N + N T) O(log K) O(log T)

broadcast message, completely connected graphs

LCC-UCBGRAPH (this work)

N + KG D3T) O(log K) O(D log T)

broadcast message, general graphs

Table 1: Summary of the baseline algorithms and the algorithms presented in this work.

3. Problem Formulation

We consider a completely connected network of N agents, indexed as n [N] = {1, 2, , N}. We also consider N independent instances of same stochastic K armed bandit. Each agent n [N] interacts with a ﬁxed bandit instance m [N] over T time steps. For simplicity, we assume m = n. Let {Xi,m,t}i [K],m [N],t [T] be a sequence of random variables deﬁned on (Ω, F), where Ω= [0, 1]KNT and F = B(Ω). For each i, we assume that Xi,m,t are identically distributed and are independent across all N instances and T time steps. Let µi denote the expected value of for the random variables Xi,m,t for all m, t, and i. At time t, agent n [N] selects arm in(t) using a deterministic policy and observes rn,t = Xin(t),n,t. A super-agent policy would ideally have all N agents communicating all information each time to selects arms in(t) which is Ft 1 = {i1(1), r1,1, , i N(1), r N,1, , i1(t 1), r1,t 1 , i N(t 1), r N,t 1}-measurable. However, the agents do not communicate at every time step and communicates only arm indices and hence arm in(t) selected by agent n is Fn,t 1-measurable where Fn,t 1 is the local information of arms played and their re-

Multi-Agent Multi-Armed Bandits with Limited Communication

wards available to agent n and the indices of the arms communicated by other agents. Fn,t 1 will be deﬁned precisely in Section 5.

For our analysis, we assume that µ1 µ2 µK. However, the ordering is unknown to the agents. We also deﬁne the gap between the expected reward of the best arm and the expected reward of arm i as i := µ1 µi. For our analysis we assume 0 µi 1 i [K]. For our system model, we assume that N K as observed in many practical setups. For example, an e-commerce website will have many more products listed than the number of servers deployed. Since our agents are completely connected, all agents can communicate with each other (we later relax this assumption in Section 6). This implies, whenever an agent broadcasts a message, all the other N 1 agents receive the message. Further, we assume that each agent only communicates the index of the best arm it knows. This requires log(K) bits for every message and since there are N 1 other agents to send the message, the total bits required by any agent is (N 1) log(K) bits in every communication round. We assume that each agent n [N] interacts with the bandit environment equal number of times in an epoch. Further the synchronization time at the end of each epoch is assumed to be negligible.

An agent n aims to minimize its cumulative regret over time T, Rn(T), deﬁned as:

Rn(T) = Tµ1 E

i=1 µi1{in(t) = i}

Note that minimizing regret Rn(T) for all agents n [N] also minimizes the total cumulative regret over the agents as well.

4. LCC-UCB Algorithm

We design our algorithm LCC-UCB on the basis of the fact that the regret of UCB algorithms (Auer, 2002; Bubeck et al., 2011; Lattimore and Szepesv ari, 2020) scales as O(

KT). We reduce the per step regret by distributing the K arms among the N agents in growing in length epochs. An agent n chooses to interact with a potentially smaller set of arms Sn where Sn = { (n 1) K

N mod K + 1, , (n K

N 1) mod K + 1}. For the ﬁrst epoch, i.e., j = 0, each agent starts with possibly sub-optimal arms, even the worst possible arms. As the algorithm proceeds, in epoch j 1, agents broadcast the most played arm by UCB algorithm during epoch j to all the other agents. Each agent n [N] receives Rn,j, a set of arm recommendations from other N 1 agents. The agent now runs the UCB algorithm (Bubeck et al., 2011) over the arms in the augmented set An,j = Sn Rn,j. At the end of any epoch, the agent purges any old recommendations it has and starts again with the new recommendations received after an epoch. This ensures that the number of arms with any agent does not exceed K := K/N + N 1. This approach helps to bound the regret of any agent n [N] by O p

( K/N + N 1) T . The LCC-UCB algorithm running at an agent n [N] is described in Algorithm 1. The algorithm at agent n receives the set of initial arms Sn, the indices of other agents, and the total horizon T. At each epoch j, agent n maintains a set Rn,j of the arms received from the remaining [N] \ {n} agents. For the ﬁrst epoch Rn,1 = as the agent has not

Agarwal, Aggarwal, Azizzadenesheli

heard anything from the remaining agents and the augmented set is same as the initial set of arms, An,0 = Sn. As the algorithm proceeds, it runs the UCB algorithm (Auer et al., 2002; Bubeck et al., 2011), described in Algorithm 2, on the arms in the augmented set An,j for epoch duration K (K + 1)2j. If at time t, remaining time is not suﬃcient to run a complete epoch of duration Tj, it just runs the UCB algorithm for the remaining horizon T t.

Algorithm 1 LCC-UCB(n, Sn, [N] \ {n}, T)

1: t = 0, j = 0, K = |S\ + N 1| 2: Rn,j = 3: while s < T do 4: Set augmented set An,j = Sn Rn,j 5: i = UCB(n, s, An,j, min T s, K (K + 1)2j ) 6: s = s + min T s, K (K + 1)2j

7: j = j + 1 8: Send i to other [N] \ {n} agents 9: Receive most played arms of [N] \ {n} agents as Rn,j 10: end while

Algorithm 2 UCB(n, A, Tj)

2: Ni(tj) = 0, ˆµi = 0 i A

3: for tj = 1, , Tj do

4: Obtain reward rn,t by playing arm in(t), where

in(t) = arg max i A

5: Ni(tj) = Ni(tj 1) + 1{it=i} i A

6: Update ˆµit = ˆµi Ni(tj 1)+rn,t

Ni(tj) 7: end for

8: Return i = arg maxi A Ni(Tj)

5. Main Result

We now state the main result for bounding the regret and number of communication rounds for the proposed LCC-UCB algorithm.

Theorem 1 The regret of any agent n following LCC-UCB algorithm is bounded by

K T log T , (2)

where K = K/N + N 1.

Multi-Agent Multi-Armed Bandits with Limited Communication

To prove Theorem 1, we ﬁrst state the necessary lemmas required for the construction of the proof. We start by making two important observations for our analysis. First, the communication does not happen at every time step. Second, any agent n only receives the index of the most played arm from the other agents at the end of every epoch, and does not receive any additional information about the arms played and the rewards obtained by the other agents in the epoch. Thus, following the LCC-UCB algorithm, in epoch j 1, an agent uses a policy which is Fn,t 1 = {i 1, , i n 1, i n+1, , i N, in(s), rn,s, , in(t 1), rn,t 1}- measurable, where s is the ﬁrst time step of the epoch j and {i 1, , i n 1, i n+1, , i N} are the most played arms by the other agents which agent n receives at the end of the previous epoch at time step s 1.

For our setup, we assumed that the rewards samples obtained by playing an arm are independent across time. This allows to use the Hoeﬀding s bound stated in the following Lemma, Lemma 2.

Lemma 2 (Hoeﬀding, 1994, Hoeﬀding s bound) If X1, X2, , Xn are n independent random variables such that Xi [0, 1] for all i = 1, , n, then

Pr X1 + + Xn

n E X1 + + Xn

ϵ exp 2nϵ2 , and (3)

Pr E X1 + + Xn

n ϵ exp 2nϵ2 (4)

Note that, the LCC-UCB algorithm bounds regret when agent 1 recommends an arm i which is close to the best arm (i = 1) from its augmented set A1,j at every epoch, and then, in the following epoch, every other agent n minimizes the regret with respect to the their augmented sets An,j+1 which now contain the arm i .

Since the agent runs UCB algorithm (Algorithm 2) which returns the most played arm for each epoch, we want to analyse the properties of the most played arm. We now state and prove the lemma that the most played arms by the UCB algorithm is good , or µi µ1 j, with high probability for some j.

Lemma 3 For any epoch j, such that Tj K (K + 1), instance of the UCB Algorithm 2 running at agent 1 returns an arm i j that satisﬁes µi j µ1 j, with probability atleast

K 1 2 , (5)

Proof We ﬁrst note that the augmented set at agent 1 contains the best arm 1 as arm index (n 1) K

N mod K + 1 Sn for n = 1. From Algorithm 2 instance that ran at epoch j, Ni(Tj) is the number of times arm i A1,j is played in epoch j. We now prove that the arm i j = arg maxi A1,j Ni(Tj) is at most j far from the true optimal arm 1. For time step tj in epoch j, we construct an event where arm i is selected and the total plays Ni(tj 1) of arm i has exceeded some number l as Gtj(i) = {{it = i} {Ni(tj 1) li}}

Agarwal, Aggarwal, Azizzadenesheli

for li = 1 + 8 log T

2 i and tj K + 1 as each arm is played atleast once. We ﬁrst bound the probability of the event Gtj(i) using the probability measure induced on the observed samples, as a result of the policy s interaction with the K-armed bandit instance 1. Note that the most played arms returned by agents 2 n N in epoch j 1 lie in the set A1,j, or {i 2,j 1, , i N,j 1} A1,j. Conditioned on F1,t 1 = {i 2,j 1, , i N,j 1, in(s), rn,s, , in(t 1), rn,t 1}, we bound Pr Gtj(i)|F1,t 1 as:

Multi-Agent Multi-Armed Bandits with Limited Communication

Pr Gtj(i)|F1,t 1 = Pr ({{it = i} {Ni(tj 1) li}} |F1,t 1) (6)

Pr ({it = i}|{Ni(tj 1) li}, F1,t 1) (7)

Ni(tj) ˆµ i +

2 log tj Ni (tj), i A1,j

| {Ni(tj 1) li} , F1,t 1

Ni(tj) ˆµ1 +

| {Ni(tj 1) li} , F1,t 1

Ni(tj) min 1 N1(tj) tj ˆµ1 +

| {Ni(tj 1) li} , F1,t 1

N1(tj)=1 Pr

Ni(tj) ˆµ1 +

| {Ni(tj 1) li} , F1,t 1

N1(tj)=1 Pr

| {Ni(tj 1) li} , F1,t 1

N1(tj)=1 Pr

N1(tj) < µ1

{Ni(tj 1) li} , F1,t 1) +

N1(tj)=1 Pr

{Ni(tj 1) li} , F1,t 1) (13)

N1(tj)=1 Pr

N1(tj) < µ1

| {Ni(tj 1) li} , F1,t 1

N1(tj)=1 Pr ˆµi + i

| {Ni(tj 1) li} , F1,t 1

N1(tj)=1 Pr

N1(tj) < µ1

N1(tj)=1 Pr ˆµi µi i

| {Ni(tj 1) li} , F1,t 1

Agarwal, Aggarwal, Azizzadenesheli

exp 2N1(tj)2 log tj

+ exp 2Ni(tj)2 log tj

1/t4 j + 1/t4 j = 2/t3 j (17)

Equation (8) follows from the fact that the UCB algorithm will select arm i if the upper conﬁdence bound of the arm i is highest among all the arms in set A1,j. Note that the true best arm, arm 1, lies in the set A1,j and this gives Equation (9). Equation (11) follows by taking union bound over all possible values of N1(tj). Equation (12) follows by replacing the lower bound of Ni(tj) obtained from the conditioning. Equation (14) follows from the fact that the sets in Equation (14) contains the sets in Equation (13). The ﬁrst term in Equation (15) holds because the conﬁdence intervals for arm 1 are independent of the number of samples of arm i and µ1 = µi + i. Equation (16) follows from Hoeﬀding s concentration bound. Using the law of total probability, we get Pr(Gtj(i)) 2t 3 j . We can now use the probability of the event Gtj(i), to bound the probability of the event that the number of plays of an arm exceeds li by using union bound. Speciﬁcally we have:

Pr (Ni(Tj) li)

tj=li Pr Gtj(i) (18)

tj=li 2t 3 j (19)

tj=li 2t 3 j (20)

tj=li 1 2t 3 j = 1 (li 1)2 (21)

Now, for an arm i such that i > q

Tj =: j, we have,

li = 1 + 8 log T

< 1 + 8 log T

Hence, Ni(Tj) Tj/K 1 with probability at least 1 (Tj/K 1) 2. Now, let B1,j = {i A1,j|µi < µ1 j} be the set of bad arms in the augmented set of agent 1 in epoch j. Then for any arm i B1,j, we have Ni(Tj) < Tj/K with probability at least

Multi-Agent Multi-Armed Bandits with Limited Communication

1 K (Tj/K 1) 2. Thus,

P Ni(Tj) Tj/K for any i B1,j = i B1,j P Ni(Tj) Tj/K (25)

i B1,j P Ni(Tj) Tj/K (26)

i B1,j (Tj/K 1) 2 (27)

i A1,j (Tj/K 1) 2 (28)

= K (Tj/K 1) 2 (29)

Thus, P (Ni(Tj) < Tj/K for all i B1,j) 1 K (Tj/K 1) 2 . Thus, the probability that Ni(Tj) < Tj/K for all i B1,j is at least 1 K (Tj/K 1) 2.

After bounding the number of plays of arms i, such that µi µ1 j, with high probability, we show that the most played arm i j has expected reward µi j µ1 j. We have:

max i A1,j\B1,j Ni(Tj) 1 |A1,j \ B1,j|

i A1,j\B1,j Ni(Tj) (30)

= 1 |A1,j \ B1,j|

i B1,j Ni(Tj)

> 1 |A1,j \ B1,j|

= 1 |A1,j \ B1,j|

K |B1,j| Tj

= 1 |A1,j \ B1,j| K |B1,j| Tj

= 1 |A1,j \ B1,j||A1,j \ B1,j|Tj

This proves that the most played arm in A1,j, i j, is at most j far from the optimal arm 1.

After showing that the agent 1 returns a good arm after each epoch, we now show that the regret of all the other agents is bounded in the following epoch j + 1. Lemma 4 bounds the regret of an agent n running UCB Algorithm 2 during an epoch j. We then sum over all the epochs to obtain the total regret of the algorithm. We focus our analysis on an agent n. The analysis of the remaining agents follows identically.

Agarwal, Aggarwal, Azizzadenesheli

Lemma 4 (UCB regret bound) The regret of any agent n running UCB algorithm described in Algorithm 2 for an epoch j 2 with Tj time steps is upper bounded by

2K Tj log T + 16K 3

Tj + 2K (36)

Proof We ﬁrst consider the case of an agent n = 1. The agent receives recommendations from all the other N 1 agents including the agent 1 and hence contains the arm i

recommended by the agent 1. To analyze the regret, we ﬁrst create some events that will help in analysis. The ﬁrst event denotes the case where the agent 1, after the end of epoch j 1, recommends arm i

such that µi µ1 j 1. We denote this event as G1. Further note that Ni(Tj) is the number of times agent plays arm i An,j in epoch j. We note that when the event G1 occurs i j 1. We assume that i satisﬁes µi = maxi An,j µi. In case the assumption is not valid, we redeﬁne i as i = arg maxi An,j µi, and we still have µ1 µi j. Also, for the simplicity of notation, we deﬁne i ,i = µi µi. Then, using the regret decomposition lemma (Lattimore and Szepesv ari, 2020, Lemma 4.5), the regret of the UCB algorithm for epoch j is upper bounded as:

i An,j E [ i Ni(Tj)] (37)

i An,j E [(µ1 µi)Ni(Tj)] (38)

i An,j E [(µ1 µi + µi µi)Ni(Tj)] (39)

i An,j E [( i + i ,i)Ni(Tj)] (40)

i An,j E [ i Ni(Tj)] + X

i An,j E [( i ,i)Ni(Tj)] (41)

i An,j E h i Ni(Tj)| G1 i Pr( G1) + X

i An,j E h i Ni(Tj)| Gc 1 i Pr( Gc 1)

i An,j E [( i ,i)Ni(Tj)] (42)

j 1E h Ni(Tj)| G1 i Pr( G1) + X

i An,j E h Ni(Tj)| Gc 1 i Pr( Gc 1)

i An,j E [( i ,i)Ni(Tj)] (43)

i An,j Ni(Tj)| G1

+ Pr( Gc 1)E

i An,j Ni(Tj)| Gc 1

i An,j E [( i ,i)Ni(Tj)] (44)

Multi-Agent Multi-Armed Bandits with Limited Communication

j 1Tj + K K

i An,j E [( i ,i)Ni(Tj)] (45)

Tj 1 Tj + K 2K

i An,j E [( i ,i)Ni(Tj)] (46)

2K Tj log T + 16K 3

i An,j E [( i ,i)Ni(Tj)] (47)

We now focus on the last term. We deﬁne event where the UCB algorithm plays arm i after the number of plays of an arm i is has crossed li, or

Gn,i(tj) = {{it = i} {Ni(tj 1) li}} , where li = 1 i ,i , (48)

Again, similar to Lemma 3, we use Hoeﬀding s concentration bound to upper bound the probability of the event Gtj(i) by 2t 3 j . Then we can bound the last term in Equation 47 as:

i An,j E [( i ,i)Ni(Tj)] X

i An,j i ,ili + X

tj=li Pr (Gn,i(tj)) (49)

i An.j i ,i

1 + 8 log T

tj=1 t 2 j (50)

i ,i + K + K π2

8K Tj log T + K + K π2

Replacing the value in Equation 47, we get the required result for n = 1. Further, note that for n = 1, the true optimal arm 1 is always present in A1,j for all j 1.

We are now ready to prove Theorem 1. We ﬁrst note that for epoch j = 0, not agents have yet communicated, and hence the regret of any agent is trivially bounded by T0 = K (K + 1). For the later epochs, we sum over the regret incurred in each epoch using Lemma 4. To do so, we ﬁrst bound the total number of epochs. Let the total number of epochs be J, then noting that the total number of time steps is T, we get:

j=0 K (K + 1)2j < 2T

= 2J 1 < 2T K (K + 1)

T K (K + 1) + 1

Agarwal, Aggarwal, Azizzadenesheli

T K (K + 1) + 1

After bounding the regret in each epoch R(Tj) and bounding the total number of epochs, we can bound the total regret as,

j=0 R(Tj) (53)

2K Tj log T +

Tj + 3K log2(2T + 1) + K (K + 1) (54)

Tj + 16K + 3K log2(2T + 1) + K (K + 1) (55)

K (K + 1)2j + 16K + 3K log2(2T + 1) + K (K + 1) (56)

j=1 2j/2 + 16K + 3K log2(2T + 1) + K (K + 1)

+ 16K + 3K log2(2T + 1) + K (K + 1)

2K log T(2T) q

K (K + 1)2J + 16K + 3K log2(2T + 1) + K (K + 1) (59)

K T log T + 16K + 3K log2(2T + 1) + K (K + 1), (60)

where Equation (59) follows from summing over the geometric progression in Equation (58)

Theorem 5 For LCC-UCB algorithm, total number of bits exchanged by an agent is bounded by O (N log(K) log(T)).

Proof An agent sends or receives only arm index, which requires log2(K) bits. In each epoch, the agents communicates with N 1 agents and sends and receives 2(N 1) log2(K) bits. Finally, there are log2(T) epochs. This bounds the total number of bits as O (K log(K) log(T)).

We note that the algorithm proposed by Sankararaman et al. (2019) also divides the time horizon into epochs with K arms divided among N agents. However, they consider the ﬁrst few epochs to be of ﬁxed length where agents only explore to ﬁnd the best arm within themselves. Our algorithm runs UCB from the very ﬁrst epoch. Also, the length of

Multi-Agent Multi-Armed Bandits with Limited Communication

the ﬁrst epoch is o(1) in LCC-UCB algorithm which limits the regret. These novel changes allow for a signiﬁcantly improved regret bound as compared to the state of the art with limited communication rounds.

6. Extension to general network structures

So far we assumed that all the nodes are connected to each and every other node. However, this might not always be true. We now assume a general structure where a graph G = (V, E) that has the diﬀerent agents as vertices and the connections as edges represents the network structure. We assume that the graph representing the network is sparsely connected with a small diameter and degree, for example Erd os-R enyi graphs (Chung and Lu, 2001). We assume that the maximum degree of G is KG and the diameter of G is D. We assume that each agent knows the diameter D(< N) of the graph and the maximum degree KG of the nodes. Further, we assume that each node is aware of its neighbors in order to communicate with them. We do not require the complete graph structure in order to proceed with the proposed algorithm.

For this setup, we assume that an agent or node can communicate with only its neighbors. Under this assumption, it may take multiple epochs for the knowledge of the best arm to reach an agent that may not have the best arm to begin with. Further, the number of epochs where an agent does not hear from the agent that has the best arm is bounded by the diameter D. Also, the maximum size of An,j is now upper bounded by K

N + KG instead of K

N + N 1. We ﬁrst start with a direct extension of the result in Theorem 1, and by understanding the issues in the direct extension, will propose an algorithm to improve the results for general networks. The following result gives a corollary for Theorem 1 for general graphs.

Corollary 6 For graph G = (V, E) with agents as nodes V , LCC-UCB algorithm results in a regret bound of:

Rn(T) O 2DK 2 +

where D is the diameter of the graph G, K = l K N m + KG and KG is the maximum degree of any node in the graph G.

Proof An agent n = 1 receives arm recommendations only from its neighboring nodes which results in reduction of K from K/N + N to K/N + KG. However, this also implies that the n = 1 does not obtains information about a good arm from the agent 1 directly. Note that applying Lemma 3 on UCB algorithm ran by agent n = 1 suggests that the agent recommends an arm i n such that µi n µi j where i = arg maxi An,j µi. This implies that the agent (or node) located farthest from the agent 1 receives knowledge about a good arm, (1) only after D epochs for the very ﬁrst time, and (2) the best arm in the received i = arg maxi Rn,j µi set i satisﬁes µ i µ1 PD j=1 j 1. This results in an additional constant regret during the ﬁrst D epochs as:

j=0 (K + 1)K 2j = (K + 1)K (2D 1) (62)

Agarwal, Aggarwal, Azizzadenesheli

Further, the gap incurred from receiving a bad recommendation in each epoch scales as:

j =1 j 1Tj =

Tj 1 Tj (63)

K 2j Tj log T (64)

j =1 2j /2 (65)

K 2(2D 1)Tj log T (66)

Remark 7 Note that for D = 1 and KG = N 1, or the case for a completely connected graph, the result of Theorem 1 is obtained.

To avoid the exponential blow-up of 2D in the regret, we ﬁrst consider a strategy where an agent forwards the messages from one neighbor to all the other neighbor. However, this increases the message size from O(KG log2 K) bits to O(N log2 K) bits. Further, additional complexity is added to reduce repeated propagation of messages. In order to avoid the potential exponential increase in regret or increase in the message size and the communication complexity, we propose a modiﬁcation of the LCC-UCB algorithm as LCC-UCB-GRAPH algorithm. The proposed LCC-UCB-GRAPH algorithm is described in Algorithm 3.

Algorithm 3 LCC-UCB-GRAPH(Sn, G, T0, T)

1: t = 0, j = 0 2: Rn,1,0 = 3: for t < T do 4: d = 1 5: for d D do 6: Set augmented set An,d,j = Sn Rn,d,j 7: i = UCB(An,d,j, min(T t, K (K + 1)2j)) 8: t = t + K (K + 1)2j

9: Send i to neighbors 10: Receive most played arms of neighbors as Rn,d,j 11: d = d + 1 12: end for 13: j = j + 1 14: end for

The LCC-UCB-GRAPH algorithm further divides an epoch j into D sub-epochs indexed as d. The duration of each sub-epoch in epoch j is Tj = K (K + 1)2j. Now, the LCC-UCB-GRAPH algorithm restarts UCB algorithm for sub epochs (Line 6-12). Additionally, the agents now communicate after every sub-epoch, but, only with their neighbors. This gives the K K

Multi-Agent Multi-Armed Bandits with Limited Communication

Note that results from sub-epoch d of epoch j are propagated throughout the graph by

the time sub-epoch d starts in epoch j + 1. Hence, for j := q

Tj , this approach

allows to propagate arms with i D j 1 instead of Pj j =j D j . Based on this modiﬁcation, we can bound the regret of LCC-UCB-GRAPH algorithm and the number of bits required for communication by LCC-UCB-GRAPH algorithm.

Theorem 8 Let G = (V, E) be the graph representing the network structure of agents n [N], and let D be the diameter of the graph G and let KG be the maximum degree of the vertices of the graph G. Then, the regret of any agent n following LCC-UCB-GRAPH algorithm is bounded by

DK T , (67)

where K = K

Proof Note that at the beginning of the phase of a sub-epoch d in epoch j, the information from the farthest node D edges away is also received for epoch j 1 sub-epoch d. This is because exactly D communication rounds happens between sub-epoch, epoch pair d, j 1 and d, j. Further, each intermediate D nodes drifts from the optimal arm found in subepoch, epoch d, j 1 by at most j 1. This suggest that instead of receiving an arm with i j 1, the node actually receives an arm i = arg maxi An,d,j µi with i D j 1. Hence, extending Lemma 4 with D hops, the regret R(d, j) in each sub-epoch d and epoch j is now upper bounded as

R(d, j) 2(2D + 1) p

2K Tj log T + 16DK 3

Tj + 2K (68)

In Equation (68), the extra factors of D comes from the fact that now each of the agents in D hops recommends an arm i such that µi d µi d 1 j for all d 1 and i 0 = 1, the true best arm. Note that the duration of any sub-epoch d is K (K + 1)2j and it depends only on the epoch j. Hence, the regret R(d, j) is only a function of epoch count j. The total regret of the agent n, which is the sum of regrets over all sub-epochs in every epoch, can now be bounded as:

Rn(T) = XJ 1

d=1 R(d, j)

d=1 R(d, j) + XD

d=1 R(d, 0)

2(2D + 1) p

2K Tj log T + 16DK 3

d=1 K (K + 1) (69)

= 2(2D + 1)D p

2K log T XJ 1

Tj + DJ 16DK 3

Tj + 3DJK + DK (K + 1)

= 12(2D + 1) p

K DT log T + 16D2K + 3K D log2(2T + 1) + DK (K + 1)

Agarwal, Aggarwal, Azizzadenesheli

The key novelty of LCC-UCB-GRAPH algorithm is to let sub-epochs 0 d < D collect the messages from the entire graph. The equal length of each sub-epoch avoids the exponential blow-up in the regret. Further, the exponential length of each epoch j still keeps the total messages in logarithmic order of T.

Theorem 9 For LCC-UCB-GRAPH algorithm, total number of bits exchanged by an agent is bounded by O (KGD log(K) log(T)).

Proof An agent sends or receives only arm index, which requires log2(K) bits. The agent communicates at the end of every sub-epoch of every epoch. In each communication, the agents talks to at most KG neighbors and sends and receives 2KG log2(K) bits. Finally, there are D sub-epochs in every log2(T) epochs. This bounds the total number of bits as O (DKG log(K) log(T)).

Results from Theorem 8 and Theorem 9 suggest that it is possible to reduce the regret from an exponential order of the diameter D of the graph G at the expense of D times more communication rounds. Further, since each communication involves only exchange of arm indices, the cost of communication is not high (O(KG log2 K) bits) for power constrained devices such as sensor networks.

Optimizing Regret using Knowledge of Graph

We now show that the additional knowledge of the complete graph structure could be utilized to reduce the regret bound of the proposed algorithm. As a motivating example, consider L cliques of size N/L connected in a line. An example for L = 3 and N = 15 is shown in Figure 1. Here, we can run the LCC-UCB algorithm on individual cliques to obtain a per-agent regret of O( q

KT N/L + NT/L). In contrast, running LCC-UCB algorithm

on the entire graph incurs a regret of O(L p

L (KT/N + NT/L)). Hence, working with cliques separately helps in improving the regret by a factor of L in this case.

Figure 1: A topology where 3 cliques, each of size 5, are connected via bridges.

The example above can be extended to partition the graph into sub-graphs. Let the graph G can be divided into C subgraphs - G1, , GC such that each node is contained in exactly one of the sub-graphs, and the links in the sub-graphs are subset of the links in the graph. Let DGc be the diameter of the sub-graph Gc, NGc be the number of nodes in

Multi-Agent Multi-Armed Bandits with Limited Communication

the sub-graph Gc, and KGc is the maximum degree of any node in Gc. Then, for a given choice of the sub-graphs, the regret can be given as

DGc KNGc T + DGc N2 Gc KGc T (70)

Finding the best possible split of sub-graphs that optimizes the above metric can be performed for the given graph. This split can help obtain improved results in some graphs as illustrated in the example above, where we obtained an improvement of O(L).

7. Evaluations

We consider various problem setups to evaluate our algorithms. We compare the proposed algorithms, LCC-UCB and LCC-UCB-GRAPH with a no-communication strategy and a full communication strategy. The details of the no-communication and the full communication setup are:

No-Communication Setup: We consider the case where each agent n [N] works in isolation. The initial set Sn of every agent contains all the K arms. Each agent uses the standard UCB algorithm to interact with the environment and reduce its regret. Hence, each agent incurs a regret of O(

KT). Since the agents are not communicating, they do not reduce the regret.

Full-Communication Setup: We consider the case where each agent n [N] can talk only to its neighbors and observes the arms played and the rewards obtained by the neighbors at every time step. The initial set Sn of every agent again contains all the arms. Each agent uses the standard UCB algorithm to interact with the environment. For a fully connected graph, the regret of any agent n scales as O( p

KT/N). However, for a general graph, the regret of agent n scales as O( p

KT/Nn), where Nn is the number of neighbors of agent n. The proposed LCC-UCB-GRAPH algorithm helps the agent to reduce the regret to O(D p

DKT/N) by eﬀectively propagating the knowledge about the best arm throughout the network. This approach requires signiﬁcantly higher communication O(T) as compared to the algorithms proposed in this paper O(log T).

We also compare with the DEMAB algorithm, proposed by Wang et al. (2020), which requires only O(M log(MK) communication rounds for known time horizons. We ﬁrst present the comparison results for Algorithm 1. We consider a horizon of T = 105 steps. We study the behaviour of the algorithm by varying the number of agents N and the number of arms K. We choose three pairs (N, K), which are (10, 100), (20, 100), (10, 200). We present the result in Fig. 2 for 30 independent runs for expected rewards drawn from uniform U(0, 1) distribution. We plot the median of the cumulative regret incurred by a single angle at each time step and the 95% conﬁdence intervals. We ﬁrst note that the regret of the DEMAB algorithm is even larger than the nocommunication strategy. The high regret in the DEMAB algorithm is expected because the algorithm purges the observations collected after each epoch. Further, the agents do not share the knowledge of the best arm and continue to redivide the remaining arms to

Agarwal, Aggarwal, Azizzadenesheli

0 20000 40000 60000 80000 100000 t

Regret for 100000 time steps

LCC-UCB Full-Communication No - Communication DEMAB

(a) (N, K) = (10, 100)

0 20000 40000 60000 80000 100000 t

Regret for 100000 time steps

LCC-UCB Full-Communication No - Communication DEMAB

(b) (N, K) = (20, 100)

0 20000 40000 60000 80000 100000 t

Regret for 100000 time steps

LCC-UCB Full-Communication No - Communication DEMAB

(c) (N, K) = (10, 200)

Figure 2: Per-step cumulative regret for a single agent under various communication strategies.

0 20000 40000 60000 80000 100000 t

Regret for 100000 time steps

LCC-UCB Full-Communication No - Communication

(a) (N, K) = (10, 100)

0 20000 40000 60000 80000 100000 t

Regret for 100000 time steps

LCC-UCB Full-Communication No - Communication

(b) (N, K) = (20, 100)

0 20000 40000 60000 80000 100000 t

Regret for 100000 time steps

LCC-UCB Full-Communication No - Communication

(c) (N, K) = (10, 200)

Figure 3: Per-step cumulative regret for a single agent under various communication strategies. (Excluding plots from DEMAB algorithm to the regret growth of other algorithms)

quickly eliminate the bad arms, and hence not all agents are able to exploit the best arm. This results in the high regret of the algorithm. To show the scale between the remaining communication strategies, we plot the regret curves with the DEMAB algorithm in Figure 3.

0 10000 20000 30000 40000 50000 t

Regret for 50000 time steps

LCC-UCB-GRAPH LCC-UCB (relay) Full-Communication No - Communication DEMAB

(a) (N, K) = (100, 250)

0 10000 20000 30000 40000 50000 t

Regret for 50000 time steps

LCC-UCB-GRAPH LCC-UCB (relay) Full-Communication No - Communication DEMAB

(b) (N, K) = (150, 250)

0 10000 20000 30000 40000 50000 t

Regret for 50000 time steps

LCC-UCB-GRAPH LCC-UCB (relay) Full-Communication No - Communication DEMAB

(c) (N, K) = (100, 500)

Figure 4: Per-step cumulative regret for a single agent, in a sparse graph, under various communication strategies.

The start of an epoch j can be observed as the jumps in the cumulative regret. We observe that the initial epochs incur the largest regret despite the duration Tj being small.

Multi-Agent Multi-Armed Bandits with Limited Communication

This is because the agents are not aware of the best arm yet and are exploring from possibly worst arms. Also, the regret grows very slowly in the later phase because most agents send the same arm index (the optimal arm) and the eﬀective regret in the later rounds increase only as O p

(1 + N/K ) Tj , instead of the upper bound of O p

(N 1 + N/K ) Tj . We note that for small number of agents N compared to the number of arms K, (N, K) = (10, 100) and (N, K) = (10, 200), the algorithm performs closer to the optimal case where the agents could communicate with each other as observed from Fig. 3a and Fig. 3c. This is because of the reduced overhead of re-sampling new arms obtained from all the agents. We now evaluate the proposed LCC-UCB-GRAPH algorithm on sparse graphs. We speciﬁcally consider Erd os-R enyi graphs G(N, p) where N 100 vertices are a swarm of N agents. Also, p = 10/N ln N/N is the edge selection probability. This gives an expected number of total edges in the graph to be 5N. We consider only connected graphs (If the resulting graph is not connected, we sample another graph.). Once initiated, the graph does not changes structure over the subsequent time steps. This setup is typically used in placement of Io T devices communicating with only neighbors (Avner and Mannor, 2016; Sankararaman et al., 2019). We again consider 3 cases of (N, K) which are (100, 250), (150, 250), and (100, 500). We present the result in Fig. 4 for 30 independent runs. Along with the expected rewards of the arms, graph structure is also diﬀerent for each run. We plot the median of the cumulative regret incurred by a single angle at each time step and the 95% conﬁdence intervals. We note that for K = 250, the performance is similar for N = 100 (Fig. 4b) and N = 150 (Fig. 4b). This is expected for no-communication strategy as the number of arms are same. For LCC-UCB-GRAPH algorithm, this makes sense as the degree of the graph KG is higher than the the number of arms allocated to every agent K/N . For full communication strategy, this happens because the expected degree of each agent is same for both graphs. Each agent can access data from only neighbors, and that remains same. On doubling K from 250 to 500, we observe that the regret increases at lower rate for LCC-UCB-GRAPH than for the other two strategies. This is again attributed to the fact that KG dominates K/N term in regret. We note that the performance of the DEMAB algorithm is still sub-par to the all the other three strategies. Note that the LCC-UCBGRAPH algorithm accumulates extremely low regret because of the reduced arms per agent ( 5) and the degree of any node is also very low as we considered sparse G(N, p) graphs with p = 10/N. As expected, we note that the proposed strategy performs better than the no communication strategy. Further, we note that the proposed strategy even outperforms the strategy where communication happens after every time step and lags behind in initial time steps only. This is because, for the always communicating setup, an agent only shares its knowledge with its neighbors and thus is not able to fully utilize the graph with N agents. For the initial time steps, the LCC-UCB-GRAPH algorithm performs pure exploration, hence incurs regret. We also compare the performance of the LCC-UCB-GRAPH algorithm against a modiﬁed LCC-UCB algorithm which relays messages from other nodes. This modiﬁcation allows every agent to receive recommendations from all the other agents after every epoch. However, the performance of the LCC-UCB-GRAPH algorithm is signiﬁcantly better than the relay based LCC-UCB algorithm which justiﬁes the sub-epoch based modiﬁcation used

Agarwal, Aggarwal, Azizzadenesheli

in LCC-UCB-GRAPH. LCC-UCB algorithm wastes a signiﬁcant portion of the time to generate good recommendations and hence incur a large regret. The better performance of the LCC-UCB-GRAPH algorithm is because after each epoch, an agent only receives arm updates from its neighbors, and hence, the p

K/N + KG term in regret is very small.

8. Conclusion

We considered the problem of reducing communication rounds between N agents and minimizing the regret of agents interacting with an instance of a Multi Armed Bandit problem with K arms for time horizon T. We proposed two algorithm LCC-UCB for fully connected networks and LCC-UCB-GRAPH for sparse networks with maximum degree KG and diameter D. We analyzed the algorithms and obtain regret bound of O( p

T(N + K/N)) and O(D p

D(K/N + KG)T) for LCC-UCB and LCC-UCB-GRAPH algorithms respectively. We found that the algorithms perform well empirically with the LCC-UCB-GRAPH algorithm outperforming every time communication strategy in which an agent shares knowledge only with its neighbors. Further, both the LCC-UCB and the LCC-UCB-GRAPH algorithm beat the existing state of the art results. Additionally, the low bit complexity for communication in both the algorithms makes them a suitable choice for power constrained devices. We conjecture when N agents are connected with graph of diameter D = o(N1/3), and the agents can only communicate the index of the best known arm, the lower bound of regret of any agent scales as O(D p

D(KT/N + NT). However, proving the conjecture remains an open problem. As future work, the setting where the agents interact with non-identical bandit instances can be considered. Moreover, considering limited communication setups for federated reinforcement learning setups (Agarwal et al., 2021) is also an interesting direction for future work.

Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

Mridul Agarwal, Bhargav Ganguly, and Vaneet Aggarwal. Communication eﬃcient parallel reinforcement learning. In Uncertainty in Artiﬁcial Intelligence, pages 247 256. PMLR, 2021.

Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artiﬁcial intelligence and statistics, pages 99 107, 2013.

Jean-Yves Audibert, S ebastien Bubeck, et al. Minimax policies for adversarial and stochastic bandits. In COLT, volume 7, pages 1 122, 2009.

Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-oﬀs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010.

Multi-Agent Multi-Armed Bandits with Limited Communication

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

Orly Avner and Shie Mannor. Multi-user lax communications: a multi-armed bandit approach. In IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, pages 1 9. IEEE, 2016.

S ebastien Bubeck, R emi Munos, and Gilles Stoltz. Pure exploration in ﬁnitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832 1852, 2011.

Ronshee Chawla, Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. The gossiping insert-eliminate algorithm for multi-agent bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3471 3481. PMLR, 2020.

Fan Chung and Linyuan Lu. The diameter of sparse random graphs. Advances in Applied Mathematics, 26(4):257 279, 2001.

Abhimanyu Dubey and Alex Pentland. Kernel methods for cooperative contextual bandits. In International Conference on Machine Learning, 2020a.

Abhimanyu Dubey and Alex Sandy Pentland. Diﬀerentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33, 2020b.

John C Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B (Methodological), 41(2):148 164, 1979.

Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 854 862, 2013.

Wassily Hoeﬀding. Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeﬀding, pages 409 426. Springer, 1994.

Varun Kanade, Zhenming Liu, and Bozidar Radunovic. Distributed non-stochastic experts. In Advances in Neural Information Processing Systems, pages 260 268, 2012.

Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Distributed cooperative decision-making in multiarmed bandits: Frequentist and bayesian algorithms. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 167 172. IEEE, 2016.

Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020.

Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. Toprank: A practical algorithm for online stochastic ranking. In Advances in Neural Information Processing Systems, pages 3945 3954, 2018.

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670, 2010.

Agarwal, Aggarwal, Azizzadenesheli

David Mart ınez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic bandits. In Advances in Neural Information Processing Systems, pages 4529 4540, 2019.

Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1 35, 2019.

Shahin Shahrampour, Alexander Rakhlin, and Ali Jadbabaie. Multi-armed bandits in multiagent networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2786 2790. IEEE, 2017.

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

Po-An Wang, Alexandre Proutiere, Kaito Ariu, Yassir Jedra, and Alessio Russo. Optimal algorithms for multiplayer multi-armed bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 4120 4129. PMLR, 2020.

Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: Near-optimal regret with eﬃcient communication. In International Conference on Learning Representations, 2019.

Jiaqi Yang, Wei Hu, Jason D Lee, and Simon Shaolei Du. Impact of representation learning in linear bandits. In International Conference on Learning Representations, 2020.