# collaborative_multiagent_heterogeneous_multiarmed_bandits__1f0f7da9.pdf

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Ronshee Chawla 1 Daniel Vial 1 2 Sanjay Shakkottai 1 R. Srikant 2

The study of collaborative multi-agent bandits has attracted significant attention recently. In light of this, we initiate the study of a new collaborative setting, consisting of N agents such that each agent is learning one of M stochastic multi-armed bandits to minimize their group cumulative regret. We develop decentralized algorithms which facilitate collaboration between the agents under two scenarios. We characterize the performance of these algorithms by deriving the per agent cumulative regret and group regret upper bounds. We also prove lower bounds for the group regret in this setting, which demonstrates the near-optimal behavior of the proposed algorithms.

1. Introduction

The multi-armed bandit (MAB) problem is a paradigm for sequential decision-making under uncertainty, which involves allocating a resource to an action, in order to obtain a reward. MABs address the tradeoff between exploration and exploitation while making sequential decisions. Owing to their utility in large-scale distributed systems (such as information retrieval (Yue & Joachims, 2009), advertising (Chakrabarti et al., 2008), etc.), an extensive study has been conducted on multi-agent versions of the classical MAB in the last few years. In multi-agent MABs, there are multiple agents learning a bandit and communicating over a network. The goal is to design communication strategies which allow efficient exploration of arms across agents, so that they can perform better than single agent MAB algorithms.

There exist many versions of multi-agent MABs in the literature (see Section 1.2 for an overview). We propose a new collaborative setting where each of the N agents is learning

1Chandra Family Department of Electrical and Computer Engineering, University of Texas, Austin, TX, USA 2Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, IL, USA. Correspondence to: Ronshee Chawla <ronsheechawla@utexas.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

one of M stochastic MABs (where each of the bandits have K arms and M < N) to minimize the group cumulative regret, i.e., the sum of individual cumulative regrets of all the agents. We assume that every bandit m [M] is learnt by Nm = Θ( N

M ) agents. At time t N, for all m [M], any agent learning the mth bandit plays one of the K arms and receives a stochastic reward, independent of everything else (including the other agents learning the mth bandit pulling the same arm). The network among the agents is denoted by a N N gossip matrix G (fixed and unknown to the agents), where every nth row of G (n [N]) is a probability mass function over the set [N]. We investigate our setting under two scenarios: (a) context unaware, in which no agent is aware of the other agents learning the same bandit, and (b) partially context aware, in which every agent is aware of r 1 other agents learning the same bandit. In the context unaware scenario, in addition to the arm pulls conducted at each time, each agent can choose to pull information from another agent who is chosen at random based on the gossip matrix G. In the partially context aware scenario, every agent can also choose to exchange messages with the r 1 agents that they know are learning the same bandit, in addition to the arm pulls and the information pulls. Agents behave in a decentralized manner, i.e., the arm pulls, information pulls and messages exchanged are only dependent on their past actions, obtained rewards and the messages received from other agents.

Problem Motivation: Our setting finds its utility in applications involving multiple agents, with possibly differing contexts. While agents with the same context have the same objective, they may not be aware of who else shares the same context. Analyzing the proposed setting is an attempt in figuring out the best way agents should utilize recommendations from others. The quality of the recommendations received from other agents depends on whether those agents share the same context (and hence have the same objective).

For example, consider a social recommendation engine like Yelp, where the N agents correspond to the users, each choosing one from among K pizzerias (arms) in a particular location; this can be modeled as a K-armed MAB. Suppose that each pizzeria serves multiple pizza styles such as Neapolitan, Chicago deep dish, New York style, Detroit style, etc., where each pizza style corresponds to a bandit. Each user has a preference for one of the pizza styles and

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

is looking for the corresponding best pizzeria. For instance, pizzeria 1 (arm 1) might excel in Chicago deep dish, while pizzeria 2 might be best for Detroit style. If a user (agent) does not know who else has similar tastes (i.e., has the same context), the user would not be able to determine whether a review recommendation is useful , especially if the review only states that a pizzeria is excellent, but does not describe the type of pizza that the reviewer consumed. What we have here is a setting where an arm (pizzeria) is being recommended by some user, but for a specific user who is perusing the reviews, it is not clear if that recommendation is actually useful (e.g., the recommendation is actually for a Detroit style pizza, but the user is interested in Chicago deep dish). This example corresponds to the context unaware scenario in our model. Moreover, if a user knows another user or a group of users who share the same pizza style preference, they can exchange pizzeria recommendations while still scrolling through the reviews of all the pizzerias by themselves, which corresponds to the partially context aware scenario in our model.

1.1. Key Contributions

Algorithms: For the context unaware scenario, we modify and subsequently analyze the Gos In E algorithm (Algorithm 1), which (Chawla et al., 2020) proposed for the case of a single bandit (M = 1). For the partially context aware scenario, we utilize the insights obtained from the analysis of Algorithm 1 and propose a new algorithm (Algorithm 3). Algorithm 1 proceeds in phases, such that agents play from a subset of the K arms within a phase, and recommend the arm ID of the best arm during information pulls at the end of a phase. The received arm recommendations are used to update the active sets before the next phase begins. In the partially context aware scenario, the agents additionally (a) share the arm recommendation with the r 1 agents known to be learning the same bandit, (b) determine the M most recent unique arm recommendations among their set of r agents, and (c) distribute these recent recommendations among themselves to update their respective active sets.

Gossip despite multiple bandits: Our main analytical contribution is to show that under a mild assumption (Assumption 3.1), agents running Algorithm 1 for the context unaware scenario and running Algorithm 3 for the partially context aware scenario are able to identify the best arm of the bandit they are learning, and are able to spread it quickly to the other agents learning the same bandit. Even though the outcome is the same as the setting in which agents are collaboratively learning a single bandit (Chawla et al., 2020; Newton et al., 2022), the spreading of the best arms for each of the M bandits is extremely complicated because the M spreading processes are intertwined and evolving simultaneously (since every agent interacts with the agents learning other bandits). Consequently, agents cannot trust

the arm recommendations received from the agents learning other bandits. Thus, unlike (Chawla et al., 2020; Newton et al., 2022), the spreading process of each of the M arms isn t bounded by a standard rumor spreading process, hence requiring an involved analysis (detailed in Section 3.5).

Upper bounds: We show that the expected cumulative regret of an individual agent running Algorithms 1 and 3 scales as O MK

m log T (Theorem 3.2 and Corollary 3.3) and

r m log T (Theorem 4.1 and Corollary 4.2) respectively, for large T, where m is the minimum arm gap of the mth bandit. It is evident that when M << min{K, N}, agents learning the mth bandit for all m [M] experience regret reduction compared to the case of no collaborations, because of the distributed exploration of the K arms among Nm = Θ( N

M ) agents. Furthermore, the expected group regret incurred by agents running Algorithms 1 and 3 scales as O P

m log T (Corollary 3.4) and

r m log T (Corollary 4.3) respectively.

Lower bounds: We show in Theorem 5.1 that the expected group regret of our model scales as Ω P

k =k (m) 1 m,k log T for large T, where

k (m) is the best arm and m,k is the arm gap of the kth

arm of the mth bandit. This demonstrates that the first terms in our group regret upper bounds are near-optimal; we also show the second terms (which scale as N log T) are unavoidable in general. See Section 5 for details.

1.2. Related Work

Our work falls broadly under the category of cooperative MABs. To the best of our knowledge, the setting considered in this work hasn t been studied previously. Our work is closest to the line of work in (Sankararaman et al., 2019; Chawla et al., 2020; Newton et al., 2022; Vial et al., 2021; 2022), which involves multiple agents learning the same K-armed MAB and communicating sub-linear number of times (during the entire time horizon) through bit-limited pairwise gossip style communications to minimize individual cumulative regret. (Vial et al., 2021; 2022) considers a multi-agent system with honest and malicious agents, where honest agents are learning the same K-armed MAB. Their algorithms can be used in our setting, where an agent learning some bandit treats agents learning other bandits as malicious. In our setting, an agent running those algorithms incur regret scaling as O 1 m MK

N + N 1 1 M log T

after T time steps. This regret scaling is problematic when K N = Θ(1), as there is no benefit of collaboration: the re-

gret scales as O K m log T , which is akin to learning a K-armed MAB without communications. In our work, we show that an agent using the Gos In E Algorithm in (Chawla

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

et al., 2020) with a slight modification results in lesser regret in the context unaware scenario and is further reduced in the partially context aware scenario. Due to space constraints, we refer the reader to Appendix A for other related work.

2. Problem Setup

We consider a multi-agent multi-armed bandit (MAB) setting consisting of N agents, where each agent attempts to learn one of M stochastic MABs (where M < N), each containing K arms. Formally, for each m [M], let Im [N] denote the set of agents learning the mth bandit. We assume that every bandit m [M] is learnt by Nm agents, such that P

m [M] Nm = N and c1 N

M for all m [M], where M

N < c1 c2 M

2 are known absolute constants. For every bandit m [M], the K arms have unknown mean rewards, denoted by {µm,k}k [K], where µm,k [0, 1] for all k [K]. Let k (m) denote the best arm for the mth bandit, i.e., k (m) = arg maxk [K] µm,k, and assume that µm,k (m) > µm,k for all k = k (m). We define B to be the set of M best arms, i.e., B = {k (m)}m [M] and B( m) = B\{k (m)}. Let m,k := µm,k (m) µm,k denote the arm gaps for all k = k (m) and m [M]. The assumption on the arm means implies that m,k [0, 1] for all k = k (m). Let m denote the minimum arm gap of the mth bandit, i.e., m = mink [K]\k (m) m,k.

For all m [M], each agent i Im at time t [T] pulls an arm I(i) t [K] and receives a reward X(i) t (I(i) t ), where X(i) t (k) = µm,k + δ(i) t for each k [K] and δ(i) t is 1-subgaussian noise (independent of everything else).

The network between the agents is represented by a N N gossip matrix G (fixed and unknown to the agents), where each row of the matrix G(n, .) denotes a probability distribution for all n [N]. In this work, we consider that the agents are connected by a complete graph, i.e., for all n [N], G(n, i) = (N 1) 1 for all i = n.

The problem setup is investigated under two scenarios:

(i) Context Unaware - No agent i [N] knows which other agents are learning the same bandit.

(ii) Partially Context Aware - For all bandits m [M], each agent i learning the mth bandit knows r 1 other agents (where 1 < r minm [M] Nm ) who are also learning the mth bandit, such that Nm is an integral multiple of r for all m [M]1.

In the context unaware scenario, after pulling an arm, agents can choose to receive a message from another agent through an information pull. In particular, if an agent n [N]

1since N = P

m [M] Nm, N is an integral multiple of r

decides to pull information, it does so by contacting another agent i [N] according to the probability distribution G(n, .), independently of everything else. The agent i who is contacted is then allowed to communicate log2 K number of bits during this information pull.

In the partially context aware scenario, in addition to the information pulls allowed in the context unaware scenario, each agent learning the mth bandit can also exchange messages with the r 1 other agents who they know are also learning the mth bandit. As a result, agents in this scenario are allowed to communicate r log2 K number of bits during information pulls.

Agents operate in a decentralized fashion, i.e., all the decisions that an agent makes can solely depend on their past actions, rewards and the messages received from other agents during information pulls. Moreover, decisions made by agents during the information pulling slots (i.e., what to communicate if asked for information) are allowed to be dependent on the arm pulls in those slots.

Under both the scenarios, we would like to leverage collaboration between the agents to minimize the expected group cumulative regret, i.e., the sum of the individual cumulative regrets for all the agents. Mathematically, let I(i) t denote the arm pulled by agent i [N] at time t [T] and c(i) denote the index of the (unknown) bandit that agent i is trying to learn, i.e., if i Im, c(i) = m. Then, the cumulative regret of an agent i [N] after playing for T time steps is denoted by R(i) T := PT t=1(µc(i),k (c(i)) µc(i),I(i) t ) and the

expected group cumulative regret is given by E[Reg(T)]2, where Reg(T) = PN i=1 R(i) T .

3. Context Unaware Algorithm

For the context unaware scenario, we consider the Gos In E Algorithm in (Chawla et al., 2020) with a slight modification (Algorithm 1). Subsequently, we demonstrate that under a mild assumption (Assumption 3.1), agents running Algorithm 1 incur less regret compared to the case when they are learning their bandit without collaboration, despite being unaware of the other agents learning the same bandit. Furthermore, we will show that this regret is near-optimal by stating a lower bound in Section 5.

3.1. Key Algorithmic Principles

The Gos In E Algorithm in (Chawla et al., 2020) has the following key components:

Phases - The algorithm proceeds in phases j N, such that during a phase, agents play from a set S(i) j [K],

2the expectation is with respect to the rewards, communications and the algorithm

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

also known as active set. At the end of a phase, agents exchange arm recommendations through pairwise gossip communication and update their active sets.

Active Sets - The active set for any agent i [N] is a combination of two sets: (i) the time-invariant sticky set, denoted by b S(i) [K], (ii) the non-sticky set. As the name suggests, the sticky set b S(i) S(i) j for all j N, i.e., it is always present in the active set. The non-sticky set contains two arms at all times, which are updated across the phases through arm recommendations received from other agents.

Arm Recommendations - At the end of phase j, every agent contacts another agent at random based on the network s gossip matrix (which in this work is a complete graph), and the contacted agent sends the best arm in their active set. After receiving the recommendation, agents update their active set by adding the recommended arm to their sticky set and discarding the worst performing arm out of the two non-sticky arms. We provide an efficient modification to this update rule in Algorithm 1, and provide the details about what it means to be the best and the worst performing arm in the next sub-section.

3.2. Algorithm Description

We provide a detailed description of the Gos In E Algorithm with an efficient modification to the active set updates at the end of a phase, with the pseudo-code in Algorithm 1.

Initialization: The algorithm is initialized with the following inputs - (i) the standard exploration parameter of the UCB Algorithm, denoted by α > 0, (ii) the parameter β > 1 which controls the length of the phases, where a phase j runs from the (discrete) time instants 1 + Aj 1, , Aj with Aj := jβ , and (iii) a sticky set b S(i) of cardinality S for each agent i. Note that the phases grow longer as the algorithm progresses (since Aj Aj 1 = Θ(jβ 1) and β > 1). Details regarding the size of the sticky set are provided in the remarks at the end of this sub-section. For the first phase (j = 1), we initialize the active set of each agent to be their sticky set, i.e., S(i) 1 = b S(i).

UCB within a phase: We denote T (i) k (t) to be the number of times agent i has played an arm k up to and including time t, and bµ(i) k (t) to be the empirical mean reward among those plays, i.e., bµ(i) k (t) = 1 T (i) k (t) P

s t:I(i) s =k X(i) s (I(i) s ) 3.

In phase j, every agent i plays UCB Algorithm on their active set S(i) j , i.e., for t {1+Aj 1, , Aj}, the chosen

arm I(i) t = arg maxk S(i) j bµ(i) k (t 1) + r

α log T T (i) k (t 1).

3µ(i) k (t) = 0 if T (i) k (t) = 0

Arm recommendation at the end of a phase: The arm recommendation received by agent i when t = Aj is denoted by O(i) j [K]. During the information pull request at time t = Aj, every agent shares the ID of the most played arm in their active set during phase j, which is what we refer to as the best performing arm. Similarly, the worst performing arm in the active set during a phase refers to the non-sticky arm that was played the least number of times in that phase. The intuition behind sharing the most played arm during a phase is that for large time horizons, UCB chooses to play the best arm more than any other arm (Bubeck et al., 2011). Therefore, if the true best arm is present in the active set, it will be recommended at the end of a phase as the algorithm progresses and the phases grow longer.

Active set update for the next phase: We update the active set in a more efficient manner (Newton et al., 2022) compared to the Gos In E Algorithm in (Chawla et al., 2020). Specifically, Gos In E uses the update S(i) j+1 = b S(i) {U (i) j }

{O(i) j }, where U (i) j is the most played non-sticky arm

in phase j, i.e., U (i) j = arg maxk S(i) j \ b S(i) T (i) k (Aj)

T (i) k (Aj 1). In contrast, we use the update S(i) j+1 = b S(i)

{ b O(i) j } {O(i) j }, where b O(i) j = arg maxk S(i) j T (i) k (Aj)

T (i) k (Aj 1) is the most played among all arms (not just the non-sticky arms). As observed in (Newton et al., 2022), the latter update ensures that once the best arm spreads, the active set becomes b S(i) {k } thereafter, where k is the true best arm. This reduces the cardinality of the active set by up to two arms if k b S(i), subsequently reducing the regret in the single bandit case. As our analysis will show, a similar regret reduction is possible for our setting.

Algorithm 1 (at agent i)

Input: UCB Parameter α > 0, phase parameter β > 1, sticky set b S(i) with |b S(i)| = S K 2 Initialize Aj = jβ , j 1, S(i) 1 b S(i)

Play I(i) t = arg maxk S(i) j bµ(i) k (t 1) + r

α log T T (i) k (t 1) if t == Aj then

O(i) j Get Rec(i, j) (Algorithm 2) b O(i) j arg maxk S(i) j T (i) k (Aj) T (i) k (Aj 1)

S(i) j+1 b S(i) { b O(i) j } {O(i) j } j j + 1 end if end for

3.3. Assumption on the Sticky Set

It is possible that during the initialization of Algorithm 1, none of the agents learning a particular bandit have the best

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Algorithm 2 Arm Recommendation

Input: Agent i [N], phase j N function Get Rec(i, j)

n G(i, .) (sample a neighbor) return b O(n) j (most played arm by agent n in phase j) end function

arm k (m) in their sticky sets, i.e., k (m) / b S(i) for all i Im for some m [M]. This will result in all agents learning that bandit to incur linear regret. In order to avoid this situation, we will follow (Chawla et al., 2020; Vial et al., 2021; 2022; Newton et al., 2022; Sankararaman et al., 2019) and make a mild assumption:

Assumption 3.1. For all m [M], there exists an agent i m Im such that k (m) b S(i m).

1. In fact, if S = l MK

γ m for some γ (0, 1) and we

construct b S(i) for each i by sampling S arms independently and uniformly at random from K arms, then Assumption 3.1 holds with probability at least 1 γ. We prove this claim as Proposition D.1 in Appendix D. This choice of S, scaling as MK

M ), implies that for every bandit m [M], the K

arms are equally distributed across the set of Nm = Θ( N

M ) agents Im with high probability, ensuring that every arm remains active for some agent learning that bandit.

2. We can alternatively define the set of arms to be those present among their sticky sets of agents to avoid Assumption 3.1 altogether. In such a scenario, agents learning a particular bandit will learn and spread the best arm among the arms in their sticky sets.

3.4. Regret Guarantee

Theorem 3.2 characterizes the performance of Algorithm 1.

Theorem 3.2. Consider a system of N 2 agents connected by a complete graph (for each i [N], G(i, n) = (N 1) 1 n = i) and learning one of the M 2 bandits with K 2 arms, satisfying Assumption 3.1. Let the UCB parameter α > 10 and the phase parameter β > 2. Then, the regret incurred by an agent i Im running Algorithm 1 for each m [M] after T time steps is bounded by:

E[R(i) T ] (τ )β + (K + g)π2

k { b S(i) B( m)}\{k (m)}

4α m,k log T,

where τ = 2 max{2, maxm [M] τ m},

τ m = inf n j N : Aj Aj 1

2m log Aj o ,

g = N(2β+1)2β( α

2 3)(S+1) α

2 3 K 2 , gspr scales as

O M β+1 log N

M 2 log log N

M β and O(.) only

hides the absolute constants.

1. Scaling of τ - Proposition C.4 in Appendix C implies

that τ = O S

2 1 β 2 , where = minm [M] m. This is expected, because it takes longer for the best arm to be identified and spread for bandits with the smaller gaps.

2. Regret scaling - Essentially, Theorem 3.2 says that the regret of any agent i Im for all m [M] scales as

m log T for T large.

3. Regret guarantee for arbitrary values of arm means - The result in Theorem 3.2 can be easily extended for arm means not restricted to [0, 1], as assumed above. The only modification that occurs is the following: the terms K + g, gspr and (τ )β will be multiplied by e m := maxk [K]\k (m) m,k. It is noteworthy that the modification only affects the second-order term in regret.

4. Benefit of collaboration - We have the following corollary when the K arms are equally distributed across all the agents learning the same bandit, i.e., when S = Θ MK

Corollary 3.3. When S = Θ MK

N , the regret incurred by an agent i Im for each m [M] after T time steps

scales as O 1 m MK

N + M log T .

Corollary 3.3 implies that when S = Θ MK

N and M << min{K, N}, agents experience regret reduction compared to the case of no collaborations.

5. Group Regret - Corollary 3.4 quantifies the performance of Algorithm 1 in terms of the group regret.

Corollary 3.4. For all bandits m [M], when {b S(i)}i Im is a partition of the set of K arms, i.e., b S(i1) b S(i2) = ϕ for i1 = i2 Im and i Im b S(i) = [K], the group regret Reg(T) of the system playing Algorithm 1 satisfies

E[Reg(T)] X

k [K]\{k (m)}

4α m,k log T

4α m,k log T

+ N (τ )β + N(K + g)π2

Essentially, Corollary 3.4 implies expected group regret scales as O P

m log T for large T.

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

3.5. Proof Sketch (Theorem 3.2)

The proof of Theorem 3.2 is detailed in Appendix C and we highlight the key ideas here. Akin to the scenario of learning the single bandit in (Chawla et al., 2020), we first show the existence of a random phase τ, after which all the agents starting from phase τ contain the best arm of the bandit they are learning in their respective active sets and recommend it during information pulls (Claim 3). This allows us to decompose the expected cumulative regret incurred by an agent into two parts: the regret up to phase τ and the regret after phase τ. The regret after phase τ is the regret incurred by playing the UCB algorithm on the sticky set and the M best arms, because agents recommend their respective best arms during information pulls post phase τ.

The technical challenge lies in bounding the expected cumulative regret up to phase τ. In particular, we prove a generalization of the result in (Chawla et al., 2020) that irrespective of the bandit an agent is learning, the probability of an agent not recommending their best arm and thus dropping it from their active set at the end of a phase is small and decreases as the phases progress, such that it doesn t happen infinitely often (Lemma C.2). This is a consequence of agents playing UCB Algorithm on their active sets during a phase and the fact that UCB chooses to play the best arm more often than any other arm for large time horizons (Bubeck et al., 2011). This implies the existence of a random phase (denoted by τstab), after which agents aware of their best arm (i.e., agents with the relevant best arm in their active set) will always recommend it moving forward.

Our analysis differs significantly from the analysis in (Chawla et al., 2020) post phase τstab, where we characterize the time τspr,m required for the best arm of the mth

bandit to spread via recommendations to the agents Im learning that bandit. By definition of τstab, we know that if an agent i1 Im contacts another agent i2 Im who is aware of the true best arm k (m) after phase τstab, then i1 will become informed of the true best arm as well. This arm spreading process therefore resembles a rumor process, where one agent (i m in Assumption 3.1) initially knows the rumor (i.e., the best arm), and any agent who contacts someone aware of the rumor becomes informed itself.

However, our rumor process is extremely complicated compared to the process for the single bandit case in (Chawla et al., 2020). This is because we have M intertwined rumor spreading processes evolving simultaneously: every agent can interact with agents learning a different bandit, and post phase τstab, agents recommend whichever of the M rumors (i.e., best arms) is relevant to them. Hence, none of these M rumor spreading processes are standard rumor spreading processes (unlike (Chawla et al., 2020)), so analyzing them directly is infeasible.

To tackle this issue, we disentangle the M intertwined processes via a coupling argument. In particular, we define M independent noisy rumor processes and show via coupling that the spreading times of these processes upper bound the time of the actual arm spreading processes. The mth of the noisy rumor processes, denoted by { Rm,j} j=0, unfolds on the subgraph of agents Im learning the mth bandit and tracks the agents Rm,j aware of the mth rumor at phase j. Initially, only i m is aware, i.e., Rm,0 = {i m} by Assumption 3.1. In each phase j N, each agent i Im contacts another agent ag Im uniformly at random. If ag Rm,j 1 (ag is aware of the rumor), then i Rm,j (i becomes aware as well) subject to Bernoulli(η) noise, where η = Nm 1

N 1 . Therefore, i becomes aware with probability | Rm,j 1 Im|η/Nm | Rm,j 1 Im|/(N 1). Observe that the right hand side of the inequality is equal the probability with which agent i becomes aware of the best arm in the real process (since in the real process, i contacts ag [N] \ {i} uniformly at random), which allows us to upper bound the spreading time via coupling as discussed above. Thereafter, we further couple the noisy processes to a noiseless one as in (Vial et al., 2022), then use an existing bound for the noiseless setting (Chierichetti et al., 2010).

4. Partially Context Aware Algorithm

From Corollary 3.3, it can be noticed that Algorithm 1 incurs additional regret scaling of O M m log T after T time steps. This is because agents playing Algorithm 1 have the best arm in their active sets and recommend it during information pulls after a random phase τ. Hence, in the absence of knowledge about which other agents are learning the same bandit, any agent playing Algorithm 1 is unable to determine this information with certainty, due to the random nature of τ. One can think of ways in which agents can stop communicating with the agents not learning the same bandit, for example, the blocking approaches considered in (Vial et al., 2021; 2022). These works considered agents learning a single bandit collaboratively in the presence of adversarial agents. However, such blocking strategies incur worse regret: under the assumptions of Corollary 3.3, both (Vial et al., 2021; 2022) incur regret of O 1 m MK

N + N 1 1 M log T for large T. This

regret scaling is problematic when K

N = Θ(1), as there is no

benefit of collaboration: the regret scales as O K m log T , which is akin to learning a single agent regret.

Given that agents are collaborative in our setting in the sense that they divide the exploration of sub-optimal arms in addition to identifying their best arm, Algorithm 1 has a structure to it: post phase τ, there are only M rumors (corresponding to the M best arms) flowing through the network. If each agent is aware of r 1 other agents learning

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

the same bandit, they can distribute the exploration of the received arm recommendations as follows: after receiving the arm recommendations, each group of r agents learning the same bandit can select the M most recent unique arm recommendations from all the recommendations they have received so far and divide them (almost) equally among themselves. The intuition is that post phase τ, given that there are only M rumors flowing through the network, we know from the coupon collector problem that after a (finite) random number of phases post phase τ, the M most recent unique arm recommendations will be the M best arms, and it will stay that way from then on. This reduces the additional regret due to arm recommendations by a factor of r.

Algorithm 3 (at agent i)

Input: UCB Parameter α > 0, phase parameter β > 1, sticky set b S(i) with |b S(i)| = S K 2 M

r , the set f(i) of agents known to be learning the same bandit Initialize Aj = jβ , j 1, S(i) 1 b S(i), rec(ag) = {} for all ag i f(i). for t N do

Play I(i) t = arg maxk S(i) j bµ(i) k (t 1) + r

α log T T (i) k (t 1) if t == Aj then

O(i) j Get Rec(i, j) (Algorithm 2) b O(i) j arg maxk S(i) j T (i) k (Aj) T (i) k (Aj 1)

rec(i) rec(i) {(j, O(i) j )}

Obtain O(ag) j from all ag f(i) to maintain the set of arm recommendations rec(ag) rec(ag) {(j, O(ag) j )} for all ag f(i) uniquerec(i, f(i), j) M most recent unique arm recommendations in S

ag {{i} f(i)} rec(ag) if uniquerec(i, f(i), j) = uniquerec(i, f(i), j 1) then

sortrec(i, f(i), j) elements of uniquerec(i, f(i), j) in the ascending order

e S(i) j Divide Rec(i, j, f(i), sortrec(i, f(i), j))) (Algorithm 4) S(i) j+1 b S(i) { b O(i) j } {O(i) j } {e S(i) j } else

S(i) j+1 S(i) j end if j j + 1 end if end for

We use the intuition in the previous paragraph and propose Algorithm 3, which builds upon Algorithm 1 and uses the following extra input: for each m [M] and i Im, let f(i) satisfy: (i) f(i) Im, (ii) |f(i)| = r 1, and (iii) if n f(i), then i f(n). Thus, f(i) consists of r 1 other agents learning the same bandit who are known to agent i.

Algorithm 3 divides the M most recent unique arm recommendations among r agents in {i} f(i) using the subroutine Divide Rec, described in Algorithm 4. Divide Rec can be best understood for the case when M

r is a (positive) integer. For example, if M = 6 and r = 3, each agent in the set {i} f(i) will get 2 arm IDs from sortrec(i, f(i), .). Suppose that i is the second smallest element in the sorted version of {i} f(i) (pos(i) = 2), it will get the third and the fourth entries in sortrec(i, f(i), .).

It is important to note that the array sortrec(i, f(i), .) is identical for all ag {i} f(i). This observation is crucial in dividing the M most recent unique recommendations among the r agents in {i} f(i), without violating the constraint on the number of communication bits per agent.

1. Constructing e S(.) j when |uniquerec(., f(.), j)| < M - when enough phases haven t elapsed such that there are less than M most recent unique arm recommendations, we can construct e S(.) j with |uniquerec(.,f(.),j)|

r elements.

2. Satisfying the communication bit constraint of r log2 K bits - Each agent i Im uses log2 K bits to receive an arm recommendation via an information pull, and uses log2 K bits per agent to obtain the arm recommendations received by r 1 agents in f(i).

Algorithm 4 Dividing M most recent unique arm recommendations

Input: Agent i [N], phase j N, sets f(i) and sortrec(i, f(i), j)) function Divide Rec(i, j, f(i), sortrec(i, f(i), j)))

ags elements of {i} f(i) in the ascending order pos(i) position of i in the array ags

e S(i) j elements of sortrec(i, f(i), j)) in the positions n (pos(i) 1) M

r mod M + 1, , pos(i) M

r 1 mod M + 1 o

return e S(i) j end function

4.1. Regret Guarantee

Theorem 4.1 characterizes the performance of Algorithm 3.

Here, for any bandit m [M], km,1, km,2, , km,K denotes the order statistics of the arm means, i.e., km,1 = k (m) and µm,km,1 > µm,km,2 µm,km,K.

Theorem 4.1. Under the assumptions of Theorem 3.2, the regret incurred by an agent i Im running Algorithm 3 for

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

each m [M] after T time steps is bounded by:

E[R(i) T ] X

k b S(i)\{k (m)}

4α m,k log T

4α m,k log T

+ (j )β + (K + bg)π2

3 + bgspr + bgrec,

where j = 3 max{2, maxm [M] j m},

j m = inf n j N : Aj Aj 1 S+2+ M

2m log Aj o ,

bg = N(3β+1)3β( α

2 3)(S+1+ M

(3M)β + 2 3 ( c1 M 1

M ) r Γ(β + 1)

Γ(z) = R t=0 tz 1e t dt for z > 0, bgspr scales as

O M β+1 log N

M 2 log log N

M β and O(.) only

hides the absolute constants.

1. Scaling of j - j scales just like τ in Theorem 3.2, except with S replaced by S + M

r . Proposition E.5 in Appendix E formalizes this scaling.

2. Regret Scaling - Essentially, Theorem 4.1 says that the regret of any agent i Im for all m [M] scales as

r m log T for large T.

3. Benefit of collaboration - We have the following corollary when the K arms are equally distributed across all the agents learning the same bandit, i.e., when S = Θ MK

Corollary 4.2. When S = Θ MK

N , the regret incurred by an agent i Im for each m [M] after T time steps

scales as O 1 m MK

It is clear from Theorem 4.1 and Corollary 4.2 that agents having knowledge of r 1 other agents learning the same bandits results in lesser regret, compared to the context unaware scenario where agents aren t aware of the other agents are learning the same bandit.

4. Group Regret - Corollary 4.3 quantifies the performance of Algorithm 1 in terms of the group regret.

Corollary 4.3. For all bandits m [M], when {b S(i)}i Im is a partition of the set of K arms, i.e., b S(i1) b S(i2) = ϕ for i1 = i2 Im and i Im b S(i) = [K], the group regret

Reg(T) of the system playing Algorithm 3 satisfies

E[Reg(T)] X

k [K]\{k (m)}

4α m,k log T

4α m,k log T

+ N (j )β + N(K + bg)π2

3 + Nbgspr + Nbgrec.

Essentially, Corollary 4.3 implies that agents running Algorithm 3 incur expected group regret scaling as O P m [M] K+ N

r m log T for large T.

4.2. Proof Sketch (Theorem 4.1)

Similar to the regret analysis of Algorithm 1, we first show the existence of (finite) random phases: (i) τstab, such that if agents have the best arm in their active sets, that will be their most played arm and recommended henceforth during information pulls, and (ii) additional τspr phases post the phase τstab, after which all the agents have their best arms in their active sets. The proof for showing that the random phases τstab and τspr are finite proceeds identically to the proof for Theorem 3.2.

Moreover, for Algorithm 3, we also show the existence of additional (finite) random phases post the phase τstab +τspr, denoted by τrec, after which the M most recent unique arm recommendations is equal to the set of M best arms from then onwards. As a consequence of the active set updates in Algorithm 3, the active sets of agents remain unchanged in all the subsequent phases and freeze like the Gos In E Algorithm in (Chawla et al., 2020), which helps improve the regret by distributing the exploration of M best arms across r 1 other agents learning the same bandit.

Remark: This freezing does not happen in Algorithm 1, where the active sets are still time-varying post phase τ and the randomness of τ prevents agents from gathering information about others learning the same bandit with certainty.

5. Lower Bounds

We state lower bounds for Gaussian noise with unit variance. As is standard, we restrict to uniformly efficient policies, i.e., those that ensure small regret on any problem instance. In our case, instances are defined by the arm means µ = {µm,k}m [M],k [K] of the M bandits and the partition I = {Im}m [M] of the N agents to the M bandits. Policies are called uniformly efficient if E[Regµ,I(T)] = o(T γ) for any γ (0, 1) and any instance (µ, I).

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

" ! # " ! # !

" ! # " ! # !

Figure 1. (K, M, N, r) are (20, 5, 25, 5) and (30, 6, 36, 6) respectively. Arm means are in [0, 1) and the UCB parameter α = 15.

Theorem 5.1. For any uniformly efficient policy,

lim inf T E[Regµ,I(T)]

k [K]\{k (m)}

Theorem 5.1 is proved in Appendix F, by reducing our model to the setting of (R eda et al., 2022). The result shows that the first terms in Corollaries 3.4 and 4.3 are optimal up to constant factors. Hence, for large T, the suboptimality of our upper bounds is due to the second terms, which grow linearly in N and logarithmically in T. Under a further assumption on µ, we can show these dependencies are unavoidable. Again, see Appendix F for a proof.

Theorem 5.2. Let (µ, I) be any instance satisfying µm,k (m) = µm+1,k (m) and k (m) = k (m + 1) for each m [M 1].4 Then for any uniformly efficient policy in the context unaware scenario,

lim inf T E[Regµ,I(T)]

m=1 |Im| m N .

6. Numerical Results

We evaluate Algorithm 1 in the context unaware setting and Algorithm 3 in the partially context aware setting, and verify their insights through synthetic simulations. The algorithms are compared with respect to the following benchmarks: (a) no communication, corresponding to all the agents running UCB-α algorithm on K-armed MAB from (Auer et al., 2002) without any communications, (b) fully aware, corresponding to agents playing Gos In E algorithm from (Chawla et al., 2020), but agents only communicate with all the other agents learning the same bandit, and (c) full communication, where for each bandit, all agents play the UCB-α algorithm on K-armed MAB from (Auer et al., 2002), but with the entire history of all arms pulled and the corresponding rewards obtained by all the agents.

We show the group cumulative regret of Algorithms 1 and 3 over 30 random runs with 95% confidence intervals. The

4Such instances µ [0, 1]M K exist; for example, if µm,k = ((m 1) + 1(m = k))/M, where 1 is the indicator function.

" ! # " ! # !

" ! # " ! # !

Figure 2. (K, M, N, r) are (20, 5, 25, 5) and (30, 6, 36, 6) respectively. Arm means are in [2, 4) and the UCB parameter α = 30.

K arm means for each of the M bandits are generated uniformly at random from [0, 1) in Figure 1 and [2, 4) in Figure 2. We assume an equal number ( N

M ) of agents learning each bandit, set β = 3 and the size of the sticky set S = MK

N in these simulations. The UCB parameter α is set to 15 in Figure 1 and 30 in Figure 2.

From our simulations, it is evident that Algorithms 1 and 3 incur lower regret than the case when agents don t communicate, despite limited communication among the agents and agents interacting with agents learning other bandits. Furthermore, our simulations also demonstrate that for each bandit, when an agent knows other agents learning the same bandit in the partially context aware scenario, it incurs lesser regret compared to the context unaware scenario.

Acknowledgements

This work was supported by NSF Grants 2207547, 1934986, 2106801, 2019844, 2112471, and 2107037; ONR Grant N00014-19-1-2566; the Machine Learning Lab (MLL) at UT Austin; and the Wireless Networking and Communications Group (WNCG) Industrial Affiliates Program.

Anandkumar, A., Michael, N., Tang, A. K., and Swami, A. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731 745, 2011.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002.

Avner, O. and Mannor, S. Concurrent bandits and cognitive radio networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 66 81. Springer, 2014.

Bistritz, I. and Leshem, A. Distributed multi-player bandits - a game of thrones approach. In Advances in Neural Information Processing Systems, volume 31, 2018.

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Boursier, E. and Perchet, V. Sic-mmab: Synchronisation involves communication in multiplayer multi-armed bandits. Advances in Neural Information Processing Systems, 32, 2019.

Bubeck, S., Munos, R., and Stoltz, G. Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832 1852, 2011.

Buccapatnam, S., Tan, J., and Zhang, L. Information sharing in distributed stochastic bandits. In 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 2605 2613. IEEE, 2015.

Chakrabarti, D., Kumar, R., Radlinski, F., and Upfal, E. Mortal multi-armed bandits. Advances in neural information processing systems, 21, 2008.

Chakraborty, M., Chua, K. Y. P., Das, S., and Juba, B. Coordinated versus decentralized exploration in multi-agent multi-armed bandits. In IJCAI, pp. 164 170, 2017.

Chawla, R., Sankararaman, A., Ganesh, A., and Shakkottai, S. The gossiping insert-eliminate algorithm for multiagent bandits. In International Conference on Artificial Intelligence and Statistics, pp. 3471 3481. PMLR, 2020.

Chawla, R., Sankararaman, A., and Shakkottai, S. Multiagent low-dimensional linear bandits. IEEE Transactions on Automatic Control, 2022.

Chierichetti, F., Lattanzi, S., and Panconesi, A. Almost tight bounds for rumour spreading with conductance. In Proceedings of the forty-second ACM symposium on Theory of computing, pp. 399 408, 2010.

Dakdouk, H., F eraud, R., Laroche, R., Varsier, N., and Maill e, P. Collaborative exploration and exploitation in massively multi-player bandits. 2021.

Dubey, A. and Pentland, A. t. S. Differentially-private federated linear bandits. In Advances in Neural Information Processing Systems, volume 33, pp. 6003 6014, 2020.

Dubey, A. et al. Kernel methods for cooperative multiagent contextual bandits. In International Conference on Machine Learning, pp. 2740 2750. PMLR, 2020.

Hillel, E., Karnin, Z. S., Koren, T., Lempel, R., and Somekh, O. Distributed exploration in multi-armed bandits. Advances in Neural Information Processing Systems, 26, 2013.

Kalathil, D., Nayyar, N., and Jain, R. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331 2345, 2014.

Kolla, R. K., Jagannathan, K., and Gopalan, A. Collaborative learning of stochastic bandits over a social network. IEEE/ACM Transactions on Networking, 26(4): 1782 1795, 2018.

Korda, N., Szorenyi, B., and Li, S. Distributed clustering of linear bandits in peer to peer networks. In International conference on machine learning, pp. 1301 1309. PMLR, 2016.

Lalitha, A. and Goldsmith, A. Bayesian algorithms for decentralized stochastic bandits. IEEE Journal on Selected Areas in Information Theory, 2(2):564 583, 2021.

Landgren, P., Srivastava, V., and Leonard, N. E. Distributed cooperative decision-making in multiarmed bandits: Frequentist and bayesian algorithms. In 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 167 172. IEEE, 2016.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Liu, K. and Zhao, Q. Distributed learning in multi-armed bandit with multiple players. IEEE transactions on signal processing, 58(11):5667 5681, 2010.

Liu, L. T., Mania, H., and Jordan, M. Competing bandits in matching markets. In International Conference on Artificial Intelligence and Statistics, pp. 1618 1628. PMLR, 2020.

Madhushani, U. and Leonard, N. When to call your neighbor? strategic communication in cooperative stochastic bandits. ar Xiv preprint ar Xiv:2110.04396, 2021.

Mansour, Y., Slivkins, A., and Wu, Z. S. Competing bandits: Learning under competition. ar Xiv preprint ar Xiv:1702.08533, 2017.

Mart ınez-Rubio, D., Kanade, V., and Rebeschini, P. Decentralized cooperative stochastic bandits. In Advances in Neural Information Processing Systems, volume 32, 2019.

Newton, C., Ganesh, A., and Reeve, H. Asymptotic optimality for decentralised bandits. ACM SIGMETRICS Performance Evaluation Review, 49(2):51 53, 2022.

R eda, C., Vakili, S., and Kaufmann, E. Near-optimal collaborative learning in bandits. In Advances in Neural Information Processing Systems, 2022.

Rosenski, J., Shamir, O., and Szlak, L. Multi-player bandits a musical chairs approach. In International Conference on Machine Learning, pp. 155 163. PMLR, 2016.

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Sankararaman, A., Ganesh, A., and Shakkottai, S. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1 35, 2019.

Shahrampour, S., Rakhlin, A., and Jadbabaie, A. Multiarmed bandits in multi-agent networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2786 2790. IEEE, 2017.

Szorenyi, B., Busa-Fekete, R., Hegedus, I., Orm andi, R., Jelasity, M., and K egl, B. Gossip-based distributed stochastic bandit algorithms. In International Conference on Machine Learning, pp. 19 27. PMLR, 2013.

Tekin, C. and Van Der Schaar, M. Distributed online learning via cooperative contextual bandits. IEEE transactions on signal processing, 63(14):3700 3714, 2015.

Vial, D., Shakkottai, S., and Srikant, R. Robust multiagent multi-armed bandits. In Proceedings of the Twentysecond International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pp. 161 170, 2021.

Vial, D., Shakkottai, S., and Srikant, R. Robust multi-agent bandits over undirected graphs. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6 (3):1 57, 2022.

Wang, P.-A., Proutiere, A., Ariu, K., Jedra, Y., and Russo, A. Optimal algorithms for multiplayer multi-armed bandits. In International Conference on Artificial Intelligence and Statistics, pp. 4120 4129. PMLR, 2020.

Yue, Y. and Joachims, T. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1201 1208, 2009.

Zhu, Z., Zhu, J., Liu, J., and Liu, Y. Federated bandit: A gossiping approach. Proc. ACM Meas. Anal. Comput. Syst., 5(1), 2021.

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

A. Related Work Revisited

Apart from the works mentioned in Section 1.2, there exist several other works in the space of collaborative multi-agent MABs with different models of communication among agents, some of which are shared here. Agents exchange information in (Buccapatnam et al., 2015; Chakraborty et al., 2017) via broadcasting instead of pairwise gossip style communications. There is more frequent communication between the agents in (Kolla et al., 2018; Lalitha & Goldsmith, 2021; Mart ınez-Rubio et al., 2019). In (Madhushani & Leonard, 2021) the number of communications is inversely proportional to the minimum arm gap so could be large. Arm mean estimates are exchanged instead of arm indices in (Landgren et al., 2016). (Wang et al., 2020) employs a leader-follower mechanism, wherein the leader explores the arms and estimates their mean rewards, while the followers play the arm with the highest estimated arm mean based on the samples collected by the leader.

Collaborative multi-agent bandits have also been studied in the contextual setting (Dubey et al., 2020; Tekin & Van Der Schaar, 2015) and the linear reward model setting (Dubey & Pentland, 2020; Chawla et al., 2022; Korda et al., 2016), which are significantly different from the setting considered in our work. Some of the works in this space also focus on minimizing the simple regret (Hillel et al., 2013; Szorenyi et al., 2013), instead of minimizing the cumulative regret. There also exist works such as (Bistritz & Leshem, 2018; Shahrampour et al., 2017; Zhu et al., 2021; R eda et al., 2022) which assume different bandits across agents. While this list of works is by no means exhaustive, it represents a broad class of settings studied in the collaborative multi-agent bandits thus far.

Another long line of work in multi-agent MABs comprises of collision-based models (Anandkumar et al., 2011; Avner & Mannor, 2014; Bistritz & Leshem, 2018; Boursier & Perchet, 2019; Dakdouk et al., 2021; Kalathil et al., 2014; Liu & Zhao, 2010; Liu et al., 2020; Mansour et al., 2017; Rosenski et al., 2016), where if multiple agents play the same arm, they receive a small or no reward. However, our work assumes that if multiple agents learning the same bandit play the same arm, they receive independent rewards and thus, is different than the collision-based models.

B. Notations Revisited

Recall that for every bandit m [M] and arm k [K], the (unknown) mean rewards are denoted by µm,k, k (m) denotes the best arm, m,k := µm,k (m) µm,k denotes the arm gap, m := mink =k (m) m,k is the minimum arm gap, Im [N] denotes the set of agents learning the mth bandit and c(.) : [N] [M] is a function mapping the set of agents to the set of bandits, i.e., if i Im, c(i) = m. Let B denote the set of all the M best arms {k (m)}m [M] and B( m) = B\{k (m)} denote the set of best arms excluding the best arm of the mth bandit. For every agent i [N] and phase j N, b O(i) j S(i) j denotes the most played arm by agent i in phase j (where S(i) j is the set of active arms for agent i in phase j), and subsequently recommended at the end of the phase if requested by another agent through information pull. Each phase j lasts from t {1 + Aj 1, , Aj} and

A 1(t) = sup{j N : Aj t}. (1)

For Aj = jβ , A 1(t) = t 1 β .

C. Proof of Theorem 3.2

We extend the proof ideas from (Chawla et al., 2020) and (Vial et al., 2022) to prove our results. Fix an agent i [N] and

phase j N. Let χ(i) j be a boolean random variable associated with the event n k (c(i)) S(i) j , b O(i) j = k (c(i)) o , i.e.,

χ(i) j = 1 k (c(i)) S(i) j , b O(i) j = k (c(i)) , (2)

indicating whether agent i does not recommend the best arm of the bandit it plays at the end of the phase j, if it is present in its active set S(i) j . We also extend the definitions of random times defined in (Chawla et al., 2020), which will capture the

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

key aspects in the system dynamics, as well as highlight the differences from the single bandit case, as follows:

τ (i) stab = inf{j τ : l j, χ(i) l = 0},

τstab = max i [N] τ (i) stab,

τ (i) spr = inf{j τstab : k (c(i)) S(i) j } τstab,

τspr,m = max i Im τ (i) spr,

τspr = max m [M] τspr,m,

τ = τstab + τspr.

τ (i) stab is the earliest phase, following which if agent i has their best arm in its active set, will play it most number of times and recommend it during information pulls requested by other agents. Furthermore, starting from phase τ, all the agents contain their best arms in their respective active sets, such that it will also be their most played arm and hence, will recommend it at the end of any phase j τ. Mathematically, it implies the following: for all i [N],

k (c(i)) S(i) j , b O(i) j = k (c(i)) j τ. (3)

Claim (3) can be shown by induction as follows: it is evident from the definition of τ (i) spr that k (c(i)) S(i)

τstab+τ (i) spr.

Furthermore, b O(i) j = k (c(i)) for any phase j τstab if k (c(i)) S(i) j . Therefore, b O(i)

τstab+τ (i) spr = k (c(i)) the update step

of the Algorithm 1 ensures that k (c(i)) S(i)

τstab+τ (i) spr+1, and hence claim (3) follows.

We now highlight the similarities and differences of the random times defined in this work with respect to the random times defined in (Chawla et al., 2020). τstab is defined exactly the same as in (Chawla et al., 2020), because we will demonstrate in Lemma C.3 that the bound on the tail probability of the random variable τ (i) stab is independent of the bandit played by an agent. However, in contrast to (Chawla et al., 2020), given that the set of agents are learning M different bandits, the spread of M best arms is non-trivial (compared to spreading a single best arm), because agents learning different bandits are communicating with each other, hence we have M intertwined spreading processes occuring simultaneously. Consequently, we simply cannot bound the spreading time τspr,m for each of the M best arms by the spreading time of a standard rumor spreading process, unlike (Chawla et al., 2020). This necessitates coupling the M intertwined rumor spreading processes happening in our model to a set of M independent fictitious noisy rumor spreading process happening on the subgraph of the agents for each bandit.

C.1. Intermediate Results

We begin by providing a roadmap for the proof of Theorem 3.2 and then proving the intermediate results needed to complete it. Claim (3) states that all the agents starting from phase τ contain the best arm of the bandit they are learning in their respective active sets and recommend it during information pulls. This allows us to decompose the expected cumulative regret incurred by an agent into two parts: the regret up to phase τ and the regret after phase τ.

We first show that the expected cumulative regret up to phase τ is bounded by a constant that depends only on the system parameters (number of agents, number of bandits and their respective arm gaps) and independent of the time horizon T (Proposition C.1). It follows from the observation that the probability of an agent not recommending their best arm and thus dropping it from their active set at the end of a phase is small and decreases as the phases progress, such that it doesn t happen infinitely often (Lemma C.2). This implies the existence of a random phase (defined by τstab), after which agents always recommend and never drop their respective best arms. Post phase τstab, we characterize the time taken by the best arms of each of the M bandits to spread across their respective agents (denoted by τspr,m). Unlike (Chawla et al., 2020), we cannot bound the spreading time {τspr,m}m [M] of each of the M best arms through a standard rumor spreading process. This is because each agent communicates with either other agents learning the same bandit or the agents learning different bandits. For each m [M], τspr,m is bounded by multiple layers of stochastic domination, described in the order in which they are applied below:

first, reducing the system to the case when only one agent learning the mth bandit is aware of their best arm k (m)

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

second, lower bounding the spreading time of k (m) through a fictitious noisy rumor spreading process unfolding on the subgraph of the agents learning the mth bandit (described in the proof sketch of Theorem 3.2)

third, coupling the fictitious noisy rumor spreading process in the subgraph of the agents learning the mth bandit in the previous layer with a fictitious noiseless process (described in Appendix C.2)

The last two layers of stochastic domination are absent in (Chawla et al., 2020), as all the agents are playing the same bandit. The preceding discussion characterizes the regret up to phase τ.

We subsequently show that the expected cumulative regret incurred by an agent after phase τ is bounded by the regret due to the arms in their sticky set and the regret due to the other M 1 best arms (Proposition C.1). This is a consequence of all the agents recommending their respective best arms after phase τ (Claim (3)) and every agent communicating with agents learning other bandits. Proposition C.1. The expected cumulative regret of any agent i Im for all m [M] after playing for T time steps is bounded by:

E[R(i) T ] E[Aτ] + π2

k { b S(i) B( m)}\{k (m)}

4α m,k log T.

Proof. Fix a bandit m [M] and an agent i Im. By regret decomposition principle, we know that

t=1 1(I(i) t = k),

k [K] m,k1(I(i) t = k),

t=Aτ +1 1(I(i) t = k),

where I(i) t is the arm played by agent i at time t and the last step follows from the fact that m,k (0, 1). Taking expectation on both sides, we get

E[R(i) T ] E[Aτ] + X

k [K] m,k E

t=Aτ +1 1(I(i) t = k)

The first term E[Aτ] is bounded in Proposition C.3. We now bound the second term. Let X(i) k,t =

1 t > Aτ, I(i) t = k, T (i) k (t 1) 4α 2 m,k log T and Y (i) k,t = 1 t > Aτ, I(i) t = k, T (i) k (t 1) > 4α 2 m,k log T . Then, the inner sum in the second term can be re-written as follows:

t=Aτ +1 1(I(i) t = k) =

t=1 (X(i) k,t + Y (i) k,t ). (5)

We first bound the sum PT t=1 E[Y (i) k,t ]. Notice that

t=1 E[Y (i) k,t ] =

t > Aτ, I(i) t = k, T (i) k (t 1) > 4α 2 m,k log T

I(i) t = k, T (i) k (t 1) > 4α 2 m,k log T

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

where we substitute the classical estimate of the probability that UCB plays a sub-optimal arm using Chernoff-Hoeffding bound for 1-subgaussian rewards and α > 10 in the last line.

For an arm k [K], we bound the sum PT t=1 E[X(i) k,t] and complete the proof. By taking the expectation over X(i) k,t, we get

t=1 E[X(i) k,t] =

t > Aτ, I(i) t = k, T (i) k (t 1) 4α 2 m,k log T

For any arm k [K], three cases are possible:

if k b S(i), i.e., if arm k is one of the sticky sub-optimal arms, the sum in equation (7) is bounded above by 4α 2 m,k log T,

This is because the event I(i) t = k cannot occur more than 4α 2 m,k log T times, otherwise it will contradict the event

T (i) k (t 1) 4α 2 m,k log T.

if arm k is a non-sticky arm in the active set, it has to be either the best arm k (m) or one of the other best arms from the set B( m). This follows from Claim (3), as starting from phase τ, agents will recommend their respective best arms during information pulls. Given that every agent communicates with agents learning other bandits, each of the arms in B( m) could be present in agent i s active set after phase τ. Therefore, for all k B( m), the sum in equation (7) is bounded by 4α 2 m,k log T, which follows from the same argument used for a sticky sub-optimal arm.

if arm k is neither a sticky sub-optimal arm nor one of the other best arms from the set B( m), the event I(i) t = k cannot happen, because arm k cannot be present in the active set after phase τ from Claim (3). Thus, the sum in equation (7) is equal to zero.

From the above observations, we can conclude that for all k { b S(i) B( m)}\{k (m)},

t=1 E[X(i) k,t] 4α 2 m,k log T. (8)

The proof of Proposition C.1 is completed by substituting equations (6) and (8) into the expectation of the equation (5), followed by substituting the bound on the expectation of the equation (5) into equation (4).

In order to obtain an upper bound on E[Aτ], we first show that the probability of the error event that the best arm is not recommended during information pull at the end of the phase j (indicated by χ(i) j in equation (2)) decreases as the phases progress. This result is stated as a lemma below:

Lemma C.2. For any agent i Im for all m [M] and phase j N such that Aj Aj 1

2 m log Aj, we have

E[χ(i) j ] 2 α

Proof. The proof of this lemma is identical to the proof of Lemma 6 in (Chawla et al., 2020), except that it is re-written in a general form here (in (Chawla et al., 2020), S := | b S(i)| = K

N ) and uses 1-subgaussian rewards instead of Bernoulli rewards.

Proposition C.3. The regret up to phase τ is bounded by

E[Aτ] (τ )β + N(2β + 1)2β( α

m [M] E[A2τspr,m],

where τ is defined in Theorem 3.2.

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Proof. The random variable τ has support in N. For N-valued random variables, we know that

t 1 P(Aτ t),

t 1 P(τ A 1(t)),

t 1 P(τstab + τspr A 1(t)),

t 1 P τstab A 1(t)

t 1 P τspr A 1(t)

t 1 P τstab A 1(t)

τspr,m A 1(t)

t 1 P τstab A 1(t)

t 1 P τspr,m A 1(t)

t Aτ +1 P τstab A 1(t)

m [M] E[A2τspr,m], (9)

where we use the definition of A 1(.) defined in equation (1) in step (a). Unlike (Chawla et al., 2020), we cannot bound the spreading time {τspr,m}m [M] of each of the M best arms through a standard rumor spreading process. Instead, we bound E[A2τspr,m] for all m [M] in Appendix C.3 by stochastically dominating the random variable τspr,m through a fictitious noisy rumor spreading process, which is described in Section 3.5 and proved in Proposition C.5.

Here, we focus on bounding the sum P

t Aτ +1 P τstab A 1(t)

2 . We will calculate P (τstab x) for some fixed

2 using Lemma C.2 and then bound the sum in the previous sentence.

P (τstab x) (a) = P

τ (i) stab x

i=1 P τ (i) stab x ,

l x P χ(i) l = 1 ,

where we used the definitions of τstab and τ (i) stab in the steps (a) and (b) respectively, and Lemma C.2 in step (c) because it

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

holds for any phase j τ

2 by definition of τ . Therefore,

t Aτ +1 P τstab A 1(t)

(2l)β (l 1)β α

(2l)β + 1 (l 1)β( α

(S + 1)(2β + 1)2β( α

(e) 2N(2β + 1)2β( α

N(2β + 1)2β( α

Step (a) follows by re-writing the range of summations. In step (b), we use Aj = jβ . Step (c) uses the property of ceiling function: x x < x + 1. Steps (d) and (e) use the fact that l 1 l

2 for all l 2 and τ 4. The last step follows under the assumption that α > 10 and β > 2.

Substituting the above bound in equation (9) completes the proof of Proposition C.3.

Proposition C.4. τ m defined in Theorem 3.2 is bounded by

(S + 2) 1 β 2 .

Proof. From Theorem 3.2,

τ m = inf j N : Aj Aj 1

S + 2 1 + 4α

For Aj = jβ and j 2 + 1 β + 1 β + 8α

(S + 2) 1 β 2 ,

2m log Aj = 1 + 4α

2m log jβ ,

2m log(jβ + 1),

2m log(2jβ),

2m log 2 + 4αβ

2m log j 1 + 8αβ

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

where we use x < x + 1 for any x R in step (a) and j 2 + 1 β + 1 β + 8α

(S + 2) 1 β 2 > 2 in the last step. Therefore,

1 + (S + 2) 1 + 4α

1 + 1 + 8αβ

2m log j (S + 2) log j,

2m (j 1) (S + 2) ,

β(j 1)(j 2)β 2 β(j 1)β 1,

by assumption on j and log j j 1. Furthermore,

Aj Aj 1 = jβ (j 1)β ,

jβ (j 1)β 1,

= βbjβ 1 1 for some bj (j 1, j),

β(j 1)β 1 1,

by Lagrange s Mean Value Theorem. Thus, we have shown 1 + 4α

2m log Aj β(j 1)β 1 1

S+2 Aj Aj 1

S+2 . This completes the proof of Proposition C.4.

C.2. Coupling the Noisy Rumor Spreading Process with the Noiseless Process

Proposition C.5. The random variable τspr,m is stochastically dominated by τspr,m for all m [M].

Proof. Follows from the construction of the fictitious noisy rumor spreading process in Section 3.5.

Let Gm denote the gossip matrix of the subgraph of the agents learning the mth bandit. In our work, Gm is a complete graph of size Nm, i.e., Gm(i, n) = (Nm 1) 1 for all i Im and n Im\{i}. Similar to (Vial et al., 2022), we begin by defining the variables pertinent to the fictitious noisy rumor spreading process, followed by coupling the fictitious noisy rumor spreading process with a fictitious noiseless process unfolding on Gm. The fictitious noiseless process is defined in the same way as the fictitious noisy process in Section 3.5, but with η = 1.

Definition C.6. For each i Im, let {Y (i) j } j=1 be i.i.d. Bernoulli (η) random variables (with η = Nm 1

N 1 ) and { H(i) j } j=1 be i.i.d. random variables chosen uniformly at random from Im\{i}. We define Rm,j as follows: Rm,0 = {i m} (from Assumption 3.1), and

Rm,j = Rm,j 1 {i Im\ Rm,j 1 : Y (i) j = 1, H(i) j Rm,j 1} j N.

Then, we can define τspr,m = inf{j N : Rm,j = Im}.

We now couple the fictitious noisy rumor spreading process introduced in the proof sketch with the fictitious noiseless process to obtain a bound on E[A2 τspr,m], which will also bound E[A2τspr,m] following Proposition C.5. We describe the fictitious noiseless process, which is restated from (Vial et al., 2022) for the sake of completeness. Fix a bandit m [M]. Let {H(i) j } j=1 be i.i.d. Uniform (Nhon(i)) random variables for each i Im. Note that in our setting, Nhon(i) = Im\{i}. Let

Rm,0 = {i m}, Rm,j = Rm,j 1 {i Im\Rm,j 1 : H(i) j Rm,j 1} j N,

and let τ spr,m = inf{j N : Rm,j = Im}.

Next, we define the variables pertinent tor the coupling between the fictitious noisy and the fictitious noiseless processes. Let

σ0 = 0, σl = inf

j > σl 1 : min i Im

s=σl 1+1 Y (i) s 1

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Furthermore, for each i Im and l N, let Z(i) l = min{j {1 + σl 1, , σl} : Y (i) j = 1}. Note that {Z(i) l }i Im,l N is

non-empty, and since Z(i) l is a deterministic function of { Y (i) j } j=1 (which is independent of { H(i) j } j=1), { H(i)

Z(i) l } j=1 is

also Uniform(Im\{i}) for each l N. Thus, we can set

Z(i) l if j = Z(i) l for some l N

Uniform(Im\{i}) otherwise

without changing the distribution of {Rm,j} j=0. This results in a coupling where the fictitious noiseless process dominates the fictitious noisy process, as follows:

Proposition C.7. (Claim 5 in (Vial et al., 2022)) For the coupling described above, Rm,j Rm,σj for all j N.

Proposition C.7 allows us to relate the rumor spreading time of the fictitious noisy and the fictitious noiseless processes, denoted by τspr,m and τ spr,m. In order to do so, we restate the following result from (Vial et al., 2022) in the context of our setting.

Proposition C.8. (Claim 6 in (Vial et al., 2022)) For any j 3 and ι > 1, we have P τspr,m > ιj log j

η P(τ spr,m) +

We now state the result bounding E[A2 τspr,m] with Aj = jβ .

Proposition C.9. Under the conditions of Theorem 3.2, E[A2 τspr,m] scales as O M log N

M 2 log log N

M β , where

O(.) only hides the absolute constants.

We refer the interested reader to (Vial et al., 2022) (Appendix D.2) for the details about the proofs of Propositions C.8 and C.9. The results in the Appendix D.2 of (Vial et al., 2022) (Claims 7 and 8 in particular) are for d-regular graphs with conductance ϕ. We would like to point out that in our setting, for each bandit m [M], Gm is the gossip matrix of a complete graph of size Nm. Given that a complete graph of size Nm is a d-regular graph with d = Nm 1, hence the conductance ϕ = Nm 2(Nm 1). Proposition C.9 follows by substituting the aforementioned values of d and ϕ, along with η and using the assumption that Nm = Θ( N

M ) in the results in the Appendix D.2 of (Vial et al., 2022).

C.3. Completing the proof of Theorem 3.2

Propositions C.5 and C.9 imply that E[A2τspr,m] also scales as O M log N

M 2 log log N

M β . Combining the above

observation with Propositions C.3 and C.1 completes the proof of Theorem 3.2.

D. Random Initialization of Sticky Sets

Proposition D.1. If S = l MK

γ m for some γ (0, 1) and we construct b S(i) for each agent i [N] by sampling S arms independently and uniformly at random from K arms, then Assumption 3.1 holds with probability at least 1 γ.

Proof. Fix a bandit m [M] and let i Im. Then,

P(k (m) / b S(i)) =

Let b Em denote the event that k (m) / b S(i) for all i Im.Then,

P( b Em) = P

i Im (k (m) / b S(i))

i Im P(k (m) / b S(i)),

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

In order for Assumption 3.1 to fail, there must be at least one bandit for which no agent learning that bandit has the best arm. Therefore, Assumption 3.1 fails with probability

m [M] P( b Em),

Setting S = l MK

γ m completes the proof.

E. Proof of Theorem 4.1

We need additional notation for proving Theorem 3, particularly due to the complicated active set updates at the end of a phase. Let {same(z)}}z [ N

r ] be a partition of the set of all the agents [N] consisting of N

r sets, such that: (i) |same(z)| = r

for all z [ N

r ], and (ii) for any i, i same(z) such that i = i , c(i) = c(i ) and if i f(i ), then i f(i). In words, for each z, same(z) consists of agents learning the same bandit, which each of the agents in same(z) are aware of, i.e., for some z [ N

r ], same(z) = {i, f(i)} = {i f(i)} for some i [N]. Similar to the proof of Theorem 3.2, we define some random times, which will help us prove Theorem 4.1.

τ (i) stab = inf{j j : l j, χ(i) l = 0},

τstab = max i [N] τ (i) stab,

τ (i) spr = inf{j τstab : k (c(i)) S(i) j } τstab,

τspr,m = max i Im τ (i) spr,

τspr = max m [M] τspr,m,

τrec,z = inf{j τstab + τspr : uniquerec(same(z), j) = B} (τstab + τspr),

τrec = max z [ N

r ] τrec,z,

τ = τstab + τspr + τrec.

For each group of agents z [ N

r ] learning the same bandit and are aware of that, τrec,z denotes the minimum number of phases it takes after the phase τstab + τspr such that the M most recent unique arm recommendations among all the agents in same(z) is the set of M best arms. The active set updates in Algorithm 3 ensure that the active sets of agents freeze after phase τ (like (Chawla et al., 2020)), unlike Algorithm 1, where the active sets are time-varying despite agents eventually identifying their respective best arms and recommending it in information pulls. We now prove this claim.

Proposition E.1. For any agent i Im playing Algorithm 3, the following statements hold for all j > τ:

(i) k (m) S(i) j ,

(ii) S(i) j = S(i) τ .

(iii) n e S(i) τ {O(i) τ } o B.

Proof. (i) follows exactly from Claim (3).

For proving (ii), notice that for any group of agents same(z) and any phase j τstab + τspr + τrec,z, uniquerec(same(z), j) = B. This follows from Claim (3) because after the phase τstab + τspr, agents recommend their respective best arms during information pulls and eventually, the M most recent unique arm recommendations for all the agents in same(z) (denoted by uniquerec(same(z), j 1)) will be the set of M best arms. Furthermore, the active set updates in Algorithm 3 ensure that if uniquerec(same(z), j) = uniquerec(same(z), j 1), the active sets for all the agents in same(z) remain unchanged. Since τ τstab + τspr + τrec,z, claim (ii) holds.

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Claim (iii) follows from the observation that for any group of agents same(z) and any phase j τstab + τspr + τrec,z, uniquerec(same(z), j) = B, e S(i) j uniquerec(same(z), j) by construction and O(i) j B after any phase j τstab + τspr.

E.1. Intermediate Results

Most of the intermediate results for Algorithm 3 are similar to the results for Algorithm 1, and are proved in a similar way. The technical novelty in proving Theorem 4.1 is to prove that E[τrec] is finite.

Before decomposing the regret up to phase τ and after phase τ, we define some notation. For any bandit m [M], Let km,1, km,2, , km,K denote the IDs of the arms with their arm means sorted in decreasing order, i.e., km,1 = k (m) and µm,km,1 > µm,km,2 µm,km,K. Proposition E.2. The expected cumulative regret of any agent i Im for all m [M] after playing for T time steps is bounded by:

E[R(i) T ] E[Aτ] + π2

k {km,l} S+ M

4α m,k log T.

Proof. Fix a bandit m [M] and an agent i Im. By regret decomposition principle, we know that

t=1 1(I(i) t = k),

k [K] m,k1(I(i) t = k),

t=Aτ +1 1(I(i) t = k),

where I(i) t is the arm played by agent i at time t and the last step follows from the fact that m,k (0, 1). Taking expectation on both sides, we get

E[R(i) T ] E[Aτ] + X

k [K] m,k E

t=Aτ +1 1(I(i) t = k)

The first term E[Aτ] is bounded in Proposition E.4. We now bound the second term. Let X(i) k,t =

1 t > Aτ, I(i) t = k, T (i) k (t 1) 4α 2 m,k log T and Y (i) k,t = 1 t > Aτ, I(i) t = k, T (i) k (t 1) > 4α 2 m,k log T . Then, the inner sum in the second term can be re-written as follows:

t=Aτ +1 1(I(i) t = k) =

t=1 (X(i) k,t + Y (i) k,t ). (11)

We first bound the sum PT t=1 E[Y (i) k,t ]. Notice that

t=1 E[Y (i) k,t ] =

t > Aτ, I(i) t = k, T (i) k (t 1) > 4α 2 m,k log T

I(i) t = k, T (i) k (t 1) > 4α 2 m,k log T

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

where we substitute the classical estimate of the probability that UCB plays a sub-optimal arm using Chernoff-Hoeffding bound for 1-subgaussian rewards and α > 10 in the last line.

For an arm k [K], we bound the sum PT t=1 E[X(i) k,t] and complete the proof.

t=1 X(i) k,t =

t > Aτ, I(i) t = k, T (i) k (t 1) 4α 2 m,k log T

For any arm k [K], two cases are possible:

if k S(i) τ and arm k is one of the sub-optimal arms, the sum in equation (13) is bounded above by 4α 2 m,k log T,

This is because the event I(i) t = k cannot occur more than 4α 2 m,k log T times, otherwise it will contradict the event

T (i) k (t 1) 4α 2 m,k log T.

if arm k / S(i) τ , the event I(i) t = k cannot happen. Thus, the sum in equation (13) is equal to zero.

From the above observations, we can conclude that for all k {S(i) τ }\{k (m)},

t=1 X(i) k,t 4α 2 m,k log T. (14)

t=1 X(i) k,t X

4α m,k log T,

k b S(i)\{k (m)}

4α m,k log T + X

4α m,k log T,

because S(i) τ = b S(i) { b O(i) τ } {O(i) τ } {e S(i) τ }, b O(i) τ = k (m) and we don t have any control over which arms from the set B are present in the set e S(i) τ O(i) τ (Proposition E.1), so we assume the worst case scenario and bound the regret of the arms in that set by the regret of the first M

r + 2 arms in the increasing order of their arm gaps (or equivalently, decreasing order of the arm means). By taking the expectation on both sides in the above inequality, we get

t=1 E[X(i) k,t] X

k b S(i)\{k (m)}

4α m,k log T + X

4α m,k log T. (15)

The proof of Proposition C.1 is completed by substituting equations (12) and (15) into the expectation of the equation (11), followed by substituting the bound on the expectation of the equation (11) into equation (10).

In order to obtain an upper bound on E[Aτ], we first show that the probability of the error event that the best arm is not recommended during information pull at the end of the phase j (indicated by χ(i) j in equation (2)) decreases as the phases progress. This result is stated as a lemma below:

Lemma E.3. For any agent i Im for all m [M] and phase j N such that Aj Aj 1

r +2 1 + 4α

2m log Aj, we have

E[χ(i) j ] 2 α

Proof. The proof of this lemma is identical to the proof of Lemma 6 in (Chawla et al., 2020) and Lemma C.2, except that |S(i) j | S + M

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Proposition E.4. The regret up to phase τ is bounded by

E[Aτ] (j )β + N(3β + 1)3β( α

m [M] E[A3τspr,m] + X

r ] E[A3τrec,z],

where j is defined in Theorem 4.1.

Proof. The random variable τ has support in N. For N-valued random variables, we know that

t 1 P(Aτ t),

t 1 P(τ A 1(t)),

t 1 P(τstab + τspr + τrec A 1(t)),

t 1 P τstab A 1(t)

t 1 P τspr A 1(t)

t 1 P τrec A 1(t)

t 1 P τstab A 1(t)

τspr,m A 1(t)

τrec,z A 1(t)

t 1 P τstab A 1(t)

t 1 P τspr,m A 1(t)

t 1 P τrec,m A 1(t)

t Aj +1 P τstab A 1(t)

m [M] E[A3τspr,m] + X

r ] E[A3τrec,z], (16)

where we use the definition of A 1(.) defined in equation (1) in step (a). E[A3τspr,m] is bounded for all m [M] in the same way as for Algorithm 1 and is stated in Proposition E.6. We bound E[A3τrec,z] for each z [ N

r ] in Proposition E.7.

The process for bounding the sum P

t Aj +1 P τstab A 1(t)

3 is identical to bounding the sum P

t Aτ +1 P τstab A 1(t)

2 for Algorithm 1 in Proposition C.3.

Proposition E.5. j m defined in Theorem 4.1 is bounded by

Proof. The proof of Proposition E.5 is identical to the proof of Proposition C.4, except that S + 2 is replaced by S + 2 + M

Proposition E.6. Under the conditions of Theorem 4.1, E[A3τspr,m] scales as O M log N

M 2 log log N

M β , where

O(.) only hides the absolute constants.

Proof. The proof is identical to the proof of E[A2τspr,m] for Algorithm 1.

Proposition E.7. For each z [ N

r ], E[A3τrec,z] is bounded by

E[A3τrec,z] (3M)β + 2

M r Γ(β + 1),

where Γ(α) = R t=0 tα 1e t dt for any α > 0 denotes the Gamma function.

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Proof. Fix a z [ N

r ]. Then, we know from the definition of same(z) that there exists an i Im such that same(z) = {i, f(i)} = {{i} f(i)} for some m [M].

Let Em denote the event that an agent i [N] doesn t contact an agent from Im during an information pull at the end of a phase. Then, from the choice of gossip matrix G in Theorem 4.1,

N 1 if c(i) = m, 1 Nm

N 1 otherwise.

Furthermore,

N if c(i) = m, 1 c1

M otherwise, (17)

by assumption on Nm for all m [M].

For some m [M] and i Im, same(z) = {i, f(i)} = {{i} f(i)} for some z [ N

r ]. We also know that in τrec,z steps, agents in same(z) will receive a total of rτrec,z arm recommendations, which for each agent i same(z) are i.i.d. uniform{[N]\{i}} and independent across all i same(z). By the definition of τrec,z, (τrec,z > x) denotes the event that agents in the set same(z) don t have the M most recent unique arm recommendations equal to the set of M best arms in x steps (denoted by B). Therefore,

P(τrec,z > x) = P m [M](Em occurs rx times) ,

m [M] P (Em occurs rx times) ,

where the last two steps follow from equation (17) and the fact that the information pulls are independent across agents and phases.

We are now ready to bound E[A3τrec,z]. Given that A3τrec,z is a N-valued random variable, we have

E[A3τrec,z] = X

t 1 P(A3τrec,z t),

t A3M+1 P(A3τrec,z t),

(a) A3M + X

j 1 P(A3τrec,z A3(M+j 1))(A3(M+j) A3(M+j 1)),

(b) A3M + X

j 1 P(τrec,z M + j 1)A3(M+j),

(c) A3M + (M 1) X

r(M+j 1) A3(M+j) + X

r(M+j 1) A3(M+j), (19)

where step (a) follows from the fact that for a random variable X, P(X x) is non-increasing in x, step (b) follows from the definition of A 1(.) in equation (1) and we subsitute equation (18) in step (c). We bound each of the two sums in

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

equation (19) to complete the proof. For any ϵ (0, 1) and Aj = jβ ,

j 1 (1 ϵ)r(M+j 1)A3(M+j) = X

j 1 (1 ϵ)r(M+j 1) (3(M + j))β ,

(a) 2(3β) X

j 1 (1 ϵ)r(M+j 1)(M + j)β,

l M+1 (1 ϵ)r(l 1)lβ,

2(3β) (1 ϵ)r X

l M+1 e ϵrllβ,

2(3β) (1 ϵ)r

0 e ϵryyβ dy,

(b) = 2 (1 ϵ)r

0 uβe ϵu du, = 2 (1 ϵ)r

β Γ(β + 1),

where we use the property that x 2x for all x 1 in step (a) and we perform a change of variables u = ϵry in step (b). Substituting the above bound in equation (19) with ϵ = c1

M and ϵ = c1

N in each of the two sums respectively, we get

E[A3τrec,z] A3M + 2(M 1) 1 c1

β Γ(β + 1) + 2 1 c1

A3M + 2(M 1) 1 c1

Γ(β + 1) + 2 1 c1

= (3M)β + 2

M r Γ(β + 1),

which completes the Proof of Proposition E.7.

E.2. Completing the proof of Theorem 4.1

The proof of Theorem 3 follows from the Propositions E.2, E.4, E.6 and E.7.

F. Proofs of Lower Bounds

F.1. Proof of Theorem 5.1

We derive the lower bound for our model by adapting the lower bound for the setting considered in (R eda et al., 2022). (R eda et al., 2022) considers the following problem: there are N agents and K arms. When agent n [N] pulls arm k [K], it observes a noisy reward with mean µk,n. However, regret is measured with respect to a mixed reward with mean µ k,n = PN i=1 wi,n µk,i, where {wi,n}N i=1 are (known) nonnegative weights with PN i=1 wi,n = 1. So the optimal arm for agent n is k n = arg maxk [K] µ k,n and the relevant arm gaps are k,n = µ k n,n µ k,n. Unlike our work, (R eda et al., 2022) doesn t have any constraints on the lengths of the messages exchanged in the communication rounds.

Lower bound: Theorem 3 of (R eda et al., 2022) bounds the group regret Reg(T) (i.e., regret summed across agents) as follows: for uniformly efficient algorithms, assuming Gaussian rewards with unit variance,

lim inf T Reg(T)

log T min x X f(x), where (20)

x RK N + : X

w2 i,n xk,i ( k,n)2

2 n [N], k [K]

n:k n =k xk,n k,n x X. (21)

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Applying to our setting and proving Theorem 5.1

Notation: In our setting, c(n) [M] denotes the bandit that agent n [N] is learning and let c 1(m) = {n [N] : c(n) = m} denote the set of agents learning the bandit m [M]. Notice that for any agent n [N], c 1(c(n)) is the set of agents learning the same bandit as agent n (also, n c 1(c(n))).

Reduction to the setting of (R eda et al., 2022): For each agent n [N], we can choose µk,n = µc(n),k and wi,n = 1(i c 1(c(n)))/|c 1(c(n))| for the arm means and weights. Then defining µ k,n as above, a simple calculation shows µ k,n = µc(n),k = µk,n, so k,n = c(n),k. Note in particular that µ k,n = µk,n means the observed and mixed rewards are the same, as in our model, and thus, k n = k (c(n)). Given that agents are aware of the weights in the setting of (R eda et al., 2022), in our model this corresponds to the case when every agent knows all the other agents learning the same bandit. Additionally, in (R eda et al., 2022), there are no constraints on the length of the messages exchanged in the communication rounds. Due to the absence of constraints on the lengths of the messages exchanged during communications, our model is a special case of the model in (R eda et al., 2022), and thus, their lower bound is applicable to our setting.

Applying their lower bound: In our special case, for any agent n and arm k = k n, we have

w2 i,n xk,i = 1 |c 1(c(n))|2 X

i c 1(c(n)):k i =k

1 xk,i = 1 |c 1(c(n))|2 X

i c 1(c(n))

1 xk,i , (22)

where the first equality uses our choice of wi,n and the second holds since k i = k n = k for each i c 1(c(n)). Therefore, the constraint for such n, k pairs simplifies to

1 |c 1(c(n))|2 X

i c 1(c(n))

1 xk,i ( k,n)2

2 = 2 c(n),k

which can be rearranged to obtain

2 |c 1(c(n))| 2 c(n),k |c 1(c(n))| P

i c 1(c(n)) 1 xk,i = HM({xk,i}i c 1(c(n))),

where HM({yj}j) is the harmonic mean of {yj}j R+. On the other hand, when k = k n, we have

wi,n = 0 i / c 1(c(n)), k i = k n = k i c 1(c(n)),

so the summation in (22) is zero and the corresponding constraint is satisfied for any x RK N + . Combining these observations, the constraint set in our special case simplifies to

x RK N + : 2 |c 1(c(n))| 2 k,c(n) HM({xk,i}i c 1(c(n))) n [N], k = k (c(n))

Now notice that these inequality constraints only depend on the bandit c(n), not the individual agent n. Therefore, we further simplify the constraint set as follows:

x RK N + : 2 |c 1(m)| 2 m,k HM({xk,n}n c 1(m)) m [M], k = k (m)

Next, consider the objective function in our special case. We rewrite it as

n=1 xk,n k,n1(k n = k, c(n) = m) =

n=1 xk,n m,k1(k (m) = k, c(n) = m) (23)

k =k (m) m,k X

n c 1(m) xk,n =

k =k (m) m,k|c 1(m)|AM({xk,n}n c 1(m)), (24)

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

where AM({yj}j) denotes the arithmetic mean. Then for any x X, we have

k =k (m) m,k|c 1(m)|HM({xk,n}n c 1(m))

where the first inequality is AM({yj}j) HM({yj}j) for any {yj}j R+ and the second holds since x X. On the other hand, if we set x k,n = 2/(|c 1(c(n))| 2 c(n),k), we clearly have

AM({x k,n}n c 1(m)) = HM({x k,n}n c 1(m)) = 2/(|c 1(m)| 2 m,k) m [M].

Combined with the above, this shows that

x X, f(x ) =

2 m,k f(x) x X,

so x is optimal. Hence, using the lower bound from (R eda et al., 2022), we get

lim inf T Reg(T)

F.2. Proof of Theorem 5.2

The bulk of the proof involves showing that for any uniformly efficient policy,

lim inf T E[R(i) T ] log T 2 m i M 1 m=1 Im. (25)

Assuming we can prove (25), the first lower bound in the theorem statement follows easily, since

E[Reg(T)] =

i=1 E[R(i) T ]

i Im E[R(i) T ],

and the second follows from the first and the fact that, by assumption in Section 2,

m=1 |Im| m 2

m=1 |Im| = 2 (N |IM|) 2 N(1 c2/M) N.

Hence, for the remainder of the proof, we fix m [M 1] and i Im and prove (25).

Toward this end, we first reduce the real system to a fictitious system with a larger set of uniformly efficient policies. In the fictitious system, agent i initially (i.e., at time t = 0) observes the full sequence of rewards for all other agents, i.e., X := (X(n) t (k) : t [T], k [K], n [N] \ {i}), and thereafter does not communicate with other agents. Note that any policy in the real system can also be implemented in the fictitious system, since any information communicated by other agents is a function of X. Hence, it suffices to prove (25) for any uniformly efficient policy in the fictitious system.

To do so, we modify a standard lower bound approach that comprises Section 4.6, Lemma 15.1, and Theorem 16.2 of (Lattimore & Szepesv ari, 2020). For brevity, we focus on the modifications needed in our setting and refer the reader to the associated references for details.

To begin, we modify the probability measure construction from Section 4.6. First, we define a measurable space (ΩT , FT ), where ΩT = ([K] R)T RT K (N 1), FT = B(ΩT ), and B is the Borel σ-algebra. Next, let π = (πt)T t=1 be a uniformly efficient policy in the fictitious system. More precisely, each πt is a mapping from the history of arm pulls and rewards for agent i (i.e., I(i) 1 , X(i) 1 (I(i) 1 ), . . . I(i) t 1, X(i) t 1(I(i) t 1)), along with the rewards of other agents (i.e., X), to the distribution of the

action I(i) t . Further, let pν be the density for a Gaussian random variable with mean ν and unit variance. Finally, let ρ denote

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

the counting measure and λ any measure on (R, B(R)) for which all of the reward distributions {pµm,k}(m,k) [M] [K] are absolutely continuous with respect to λ. Then we define the probability measure

Pµ,I,π(B) = Z

B pµ,I,π(ω) (ρ λ)T λT K (N 1) dω B FT , (26)

where, for any ((jt, xt)t [T ], (xt,k,n)t [T ],k [K],n [N]\{i}) ΩT , pµ,I,π is the density given by

pµ,I,π (jt, xt)t [T ], (xt,k,n)t [T ],k [K],n [N]\{i} (27)

t [T ] πt jt (js, xs)s [t 1], (xt,k,n)t [T ],k [K],n [N]\{i} pµm,jt(xt) Y

t [T ],k [K],n [N]\{i} pµc(n),k(xt,k,n). (28)

Next, we adapt the divergence decomposition from Lemma 15.1 to our setting. More precisely, we show it holds for certain pairs of instances (µ, I) and (µ , I ). Specifically, let µ = µ and I = {I j}j [M], where

Ij \ {i}, j = m Ij {i}, j = m + 1 Ij, otherwise .

In words, the instance (µ , I ) is identical to (µ, I), except i plays the (m + 1)th bandit in the former and the mth in the latter. For simplicity, we thus write PI = Pµ,I,π and PI = Pµ ,I ,π for the probability measures (26), and EI and EI for the associated expectations. We claim

D(PI, PI ) =

k=1 EI[T (i) k (T)]D(pµm,k, pµm+1,k), (29)

where D denotes the KL divergence. The proof of (29) follows the essentially the same logic as that of Lemma 15.1. The key difference is that our density (27) includes the product term Q t [T ],k [K],n [N]\{i} pµc(n),k(xt,k,n), which arises from the reward sequences X. However, since agents n = i have the same reward distributions pµc(n),k under both instances (µ, I) and (µ , I ), these product terms cancel in the proof, which yields (29).

Finally, we prove (25) using an approach similar to Theorem 16.2. We begin by upper bounding the summands on the right side of (29). For k = k (m) (the optimal arm for i in the instance I), we have µm,k (m) = µm+1,k (m) by assumption, which implies D(pµm,k (m), pµm+1,k (m)) = 0. For k = k (m), we have

D(pµm,k, pµm+1,k) = (µm,k µm+1,k)2

where we used the fact that pν is Gaussian with mean ν and unit variance, the assumption µm,k [0, 1], and the definition of m, respectively. Combining arguments and using (29), we thus obtain

D(PI, PI ) 1 2 m

k [K]\{k (m)} EI[T (i) k (T)] m,k = EI[R(i) T ] 2 m . (30)

Next, we lower bound D(PI, PI ). Let A = {T (i) k (m+1)(T) > T/2} be the event that i plays the best arm for the (m + 1)th

bandit at least T/2 times. Then by assumption k (m) = k (m + 1), we know that

EI[R(i) T ] T m,k (m+1)

2 PI(A) T m

and by definition of I (where i plays the (m + 1)th bandit), we similarly have EI [R(i) T ] T PI (AC)/2. Combining these inequalities and using Theorem 14.2 of (Lattimore & Szepesv ari, 2020), we get

EI[R(i) T ] + EI [R(i) T ] T

2 PI(A) + PI (AC) T

4 exp ( D(PI, PI )) . (31)

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Therefore, combining (30) and (31) and letting T , we obtain

lim inf T EI[R(i) T ] log T 2 m lim inf T D(PI, PI )

log T 2 m lim inf T

1 log(4(EI[R(i) T ] + EI [R(i) T ])/ ) log T

where the equality holds by the uniformly efficient assumption (which states that the group regret on any problem instance, and hence the individual regrets EI[R(i) T ] and EI [R(i) T ], are o(T γ) for arbitrarily small γ > 0).