# multiagent_learning_with_heterogeneous_linear_contextual_bandits__792b7352.pdf

Multi-Agent Learning with Heterogeneous Linear Contextual Bandits

Anh Do Johns Hopkins University ado8@jhu.edu

Thanh Nguyen-Tang Johns Hopkins University nguyent@cs.jhu.edu

Raman Arora Johns Hopkins University arora@cs.jhu.edu

As trained intelligent systems become increasingly pervasive, multi-agent learning has emerged as a popular framework for studying complex interactions between autonomous agents. Yet, a formal understanding of how and when learners in heterogeneous environments benefit from sharing their respective experiences is still in its infancy. In this paper, we seek answers to these questions in the context of linear contextual bandits. We present a novel distributed learning algorithm based on the upper confidence bound (UCB) algorithm, which we refer to as H-LINUCB, wherein agents cooperatively minimize the group regret under the coordination of a central server. In the setting where the level of heterogeneity or dissimilarity across the environments is known to the agents, we show that H-LINUCB is provably optimal in regimes where the tasks are highly similar or highly dissimilar.

1 Introduction

Heterogeneous multi-agent systems enable agents to work together and coordinate their actions to solve complex problems. These systems are inherently scalable, as they can distribute the computational load across multiple agents. This scalability allows the system to handle large and sophisticated tasks beyond the capabilities of a single agent. Despite the potential of multi-agent systems, it poses the following fundamental challenges.

Statistical challenge. Each agent s reward distribution may vary, meaning that different agents receive different rewards for the same action. This heterogeneity in reward distributions introduces complexity and makes coordination among agents more difficult. Furthermore, Wang et al. [2021] point out that an ineffective use of shared data could lead to a significant negative impact on the overall performance. In particular, sharing experiences amongst agents may hinder the system s performance if the tasks are too dissimilar [Rosenstein, 2005, Brunskill and Li, 2013].

Computational complexity. Coordinating the decisions of numerous agents and performing complex computations pose challenges in terms of computational resources, time constraints, and algorithmic scalability.

Communication cost. Efficient communication is also a fundamental challenge for a largescale multi-agent system. As the number of agents in a system increases, the complexity of interactions between agents grows exponentially. Managing the interaction of numerous agents and making decisions in a timely manner becomes increasingly difficult.

Several works address these challenges for heterogeneous multi-agent systems, including federated linear bandits [Huang et al., 2021, Li and Wang, 2022], clustering bandits [Gentile et al., 2014, Ghosh et al., 2023], multi-task linear bandits [Hu et al., 2021, Yang et al., 2021]. However, these works either rely on some special structure of the parameter [Li and Wang, 2022], e.g., a low-rank structure [Hu

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

et al., 2021] or make different assumptions, e.g., stochastic contexts, finite decision set [Huang et al., 2021, Ghosh et al., 2023], etc.

In this work, we provide a general notion of heterogeneous multi-agent linear contextual bandits (ε-MALCB) and give analytical results under different regimes of heterogeneity. Specifically, we study a model that consists of M agents. Each agent i [M] plays a d-dimensional linear contextual bandit, parametrized by θi (see Section 3 for more details), for T rounds. We capture the heterogeneity by a dissimilarity parameter ε > 0 such that θi θj 2 ε, for all i, j [M]. Notably, we do not assume any special structure on the linear parameter; and we allow the decision set to be infinite and possibly chosen adversarially.

Motivating application. Consider a personalized recommendation system for online advertisements [Li et al., 2010, Bouneffouf et al., 2020, Ghosh et al., 2023]. Here, the platform needs to be adaptive to user preferences and maximize total user clicks based on user-click feedback. Each ad can be represented as a context vector, encoding information such as publisher, topic, ad content, etc. The inner product of the ad s context vector and user preference represents the alignment. A higher inner product value indicates greater relevance of the ad. Furthermore, the recommended ads must be personalized to accommodate user preference differences. One naive approach would involve solving a separate linear contextual bandit problem for each user. However, we can pose the following question: Can we enhance the system s performance by pooling data from other users? If so, to what extent of user heterogeneity can we achieve that?

Contributions. We make the following contributions in this paper.

First, we formulate the heterogeneous multi-agent linear contextual bandits as ε-MALCB problem, building on the classic notion of heterogeneity in multiarmed bandits (MABs) Wang et al. [2021]. Our notion of heterogeneity is natural and captures many settings in real-world applications. Second, when the level of dissimilarity is known, we propose a distributed algorithm, namely H-LINUCB which achieves a regret of O(d

MT + min{εd MT, d M

T}) under the coordination of a central server. We discuss in detail how to handle the dissimilarity and introduce a criterion for stopping collaboration when the level of dissimilarity is high. We show that under the regime of low dissimilarity, we can still achieve a regret of O(d

MT), which is the same regret rate as if M agents collaborate to solve the same task. In this regime, H-LINUCB outperforms independent learners improving by a factor of

M, from O(d M

MT). This is significant when we have a large number of agents.

Third, we complement the upper bound with a lower bound of Ω(d

MT + min{εMT, d M

T}). This suggests that our theoretical guarantees are tight in settings where tasks are highly similar or highly dissimilar. Finally, we validate our theoretical results with numerical simulations on synthetic data. When the level of dissimilarity is small, H-LINUCB outperforms independent learning. When the level of dissimilarity is high, our simulation shows that blindly using shared data can lead to linear regret, emphasizing the importance of the criterion we propose for when to stop the collaboration.

2 Related work

The classic linear bandits have a rich literature in both theory and application; see, for example, [Abbasi-Yadkori et al., 2011, Li et al., 2010, Chu et al., 2011, Chatterji et al., 2020, Rusmevichientong and Tsitsiklis, 2010, Bouneffouf et al., 2020, Mahadik et al., 2020], to name a few. With a surge in distributed computing, multi-agent systems have shown their potential and gained more attention in recent years. A large body of works dedicated to studying the homogeneous setting of multiple collaborating agents solve a global linear bandits problem [Wang et al., 2020, Dubey and Pentland, 2020, Moradipari et al., 2022, Mitra et al., 2022, Martínez-Rubio et al., 2019, Chawla et al., 2022].

The problem of multi-agent linear bandits in heterogeneous environments, on the other hand, has received limited attention. Soare et al. [2014] were amongst the first to study heterogeneous linear

bandits; however, the focus in that work is not on group regret, and the authors only consider the setting where tasks are similar. More recently, Huang et al. [2021] proposed an algorithm with a novel multi-agent G-Optimal design. They assume that the heterogeneity comes from the contexts associated with each agent, but agents can still collaborate since they share the same arm parameters. Li and Wang [2022] consider an extension where they assume that each agent s parameter has two components a shared global component and an individual local component. This formalization requires agents to work on their respective tasks (the local component) but still allows agents to collaborate on the common task (the global component). Wang et al. [2021] study heterogeneity of the Bernoulli MABs problem and provide guarantees for both cases when the level of heterogeneity is known and unknown.

A related line of work studies heterogeneous linear bandits through clustering [Gentile et al., 2014, Li et al., 2016, 2019, Korda et al., 2016, Ghosh et al., 2023]. These works give a guarantee based on the clustering structure of the different linear bandit problems agents belonging to the same cluster will likely achieve the highest collaboration gain. We do not make any assumption about the clusterability of different bandit problems we may encounter. A yet another approach focuses on multi-task linear bandits, wherein we solve multiple different but closely related linear bandits tasks [Yang et al., 2021, 2022, Hu et al., 2021, Cella et al., 2023, Du et al., 2023]. In particular, these works rely on the assumption that all tasks share a common k-dimensional representation, where k d. Then, pooling data from different bandits helps learn a good representation and reduces the statistical burden of learning by reducing the linear bandit problem in d dimensions to a k-dimensional setting. We do not consider multi-task learning here.

Our formulation of heterogeneous contextual linear bandits is similar to that of misspecified and corrupted bandits setting (Remark 3.1 for more details) [Ghosh et al., 2017, Lattimore and Csaba, 2020, Takemura et al., 2021, Foster et al., 2020, He et al., 2022]. It is then natural to ask if we can apply the techniques from that part of the literature to deal with the dissimilarity between different bandits in a heterogeneous setting. However, there is a fundamental difference in how the two problems manifest themselves. While misspecification may be typically unavoidable in many settings, in a heterogeneous bandit setting, an agent can always choose to rely solely on its own data if it finds that the data from other agents are too dissimilar.

3 Preliminaries

Multi-Agent Linear Contextual Bandits. We consider a multi-agent learning setting with M agents. At each round t [T], each agent m [M] picks an action (context)1 xm,t Dm,t, where Dm,t Rd is a given decision set. The agent m receives reward ym,t = x m,tθm + ηm,t, where θm Rd is an unknown but fixed parameter and ηm,t is sub-Gaussian noise. Let Ft denote the filtration, i.e., the σ-algebra, induced by σ({xm,k}m [M],k t+1, {ηm,k}m [M],k t).

Regret. Our goal is to design algorithms for multi-agent linear contextual bandits that achieve a small group regret defined as

max x Dm,t x, θm xm,t, θm .

Assumption 3.1. Without loss of generality, we assume that,

1. Bounded parameters: θm 2 1, x 2 1, x Dm,t, m [M], t [T].

2. Sub-Gaussian noise: ηm,t is conditionally zero-mean 1-sub-Gaussian random variable with respect to Ft 1.

We note that the assumptions above are standard in linear bandits literature [Abbasi-Yadkori et al., 2011, Hu et al., 2021, Huang et al., 2021]. Further, it is straightforward to let θm B, for some constant B, by appropriately scaling the rewards. We make no additional assumptions on the context. The decision set could be infinite, and given to each agent possibly adversarially.

1Throughout the paper, we use the terms action and context interchangeably.

Definition 3.1. (ε-MALCB Problem) A multi-agent linear contextual bandits problem is said to be an ε-MALCB problem instance, if for any two agents i, j [M], θi θj 2 ε, for an ε 0. We call ε the dissimilarity parameter.

Definition 3.2. (Homogeneous setting) A multi-agent linear contextual bandits problem is homogeneous, if it is an ε-MALCB with ε = 0, i.e., θi = θj, for all i, j [M].

Given the bound on the parameters, we have that θi θj 2 2 for any i, j [M]. Therefore, it suffices to only consider the case where ε [0, 2].

Remark 3.1. (Misspecified structure) Under the Assumption 3.1, for any two agents i, j we have that θ i x ε θ j x θ i x + ε. Then, E[yj,x] = θ i x + (x), for (x) [ ε, ε]. This represents a misspecified structure wherein agent i receives the reward yj,x from agent j = i.

Remark 3.2. (Recover the ε-MPMAB of Wang et al. [2021]). We note that the ε-MPMAB is a special case of ε-MALCB. Define the mean reward of K arms for agent m as θm = [µm 1 , . . . , µm K]. Then, the reward for arm k at round t is ym t,k = θ i ek + ηt. The decision set D = {e1, . . . , e K} are the standard basis vectors. This is a fixed set of arms, given to all agents at each round. The dissimilarity parameter ε is defined as: θi θj ε for all i, j [M].

Nonetheless, the results in Wang et al. [2021] are not directly comparable to ours since the dissimilarity parameter ε hides inside the size of the set of subpar arms |Iε|.2 Furthermore, Wang et al. [2021] give guarantees in a full-communication setting, in which each agent has full access to the past data of all other agents at every round.

Remark 3.3. There are other formulations that also capture the heterogeneity in multi-agent linear bandits. Huang et al. [2021] consider a multi-agent linear bandits setting with a fixed size decision set, containing K actions, {θa}K a=1, which is unknown to the agents. Each agent i is associated with K different contexts {xi,a}K a=1. At reach round, each agent i picks an action a [K], and receives reward ri,a = x i,aθa + ηi,a. Since xi,a can vary for different agents, this captures the heterogeneity across agents. It also allows for collaboration across agents since they share the same decision set.

Li and Wang [2022] assume that each agent parameter θ has a special structure that consists of a shared global component and a unique local component. The reward of agent i can be given as,

ri,x = θ(g)

. We do not assume such special structure of the linear parameter.

The notion of ε-MALCB is in the worst-case sense. One can imagine an ε-MALCB instance of M agents, such that M 1 agents have identical linear parameter, i.e. θi = θj, i, j [M 1], and the parameter of the last agent θM θM 1 2 = ε. With this type of instance, for a large M, we can simply use DISLINUCB of Wang et al. [2021] and achieve nearly optimal regret. Clustering bandits framework could also be used for this ε-MALCB instance since it presents a strong cluster structure. Even though our formulation takes a pessimistic approach but it can handle the case that bandits are largely unclustered.

Our goal is to design a system such that its performance is no worse than running M independent bandit algorithms for each user (zero collaboration). The system should also be adaptive to the heterogeneity in the problem instance, i.e., it should automatically leverage any structure in the problem parameters {θm}M m=1 to collaboratively solve all bandit problems at a faster rate. To benchmark the performance of such a system, we consider the following baseline.

Independent Learners (IND-OFUL). We establish a baseline algorithm in which each agent independently runs an optimal linear contextual bandits algorithm (OFUL, [Abbasi-Yadkori et al., 2011]) without any communication. Each agent incurs O(d

T) regret, and O(d M

T) group regret.

Notation. We denote the weighted norm of vector x w.r.t. matrix A (Mahalanobis norm) as x A =

x Ax. We write A B iff A B is a positive semi-definite matrix. We use O ( ) to hide the polylogarithmic factors in standard Big O notation.

2ROBUSTAGG(ε) algorithm achieves O( IεMT + M p

(|Iε| 1)T + M|Iε|) regret when ε is known. The subpar arms Iε is defined in Wang et al. [2021, Section 3.2].

4 Main Results

In this section, we present H-LINUCB, a UCB-style algorithm for ε-MALCB problem, and give guarantees for the case when the dissimilarity ε is known to the agents. In Section 4.2, we present a lower bound and discuss the implication of our results in different regimes of dissimilarity. We defer the detailed proof to the Appendix.

Algorithm 1 H-LINUCB

Input: Dissimilarity parameter ε, number of agents M, number of rounds T, regularization parameter λ, dimension d, confidence parameters βt, t [T], weight threshold parameter α, collaboration budget τ, synchronization threshold D 1: tsyn 1 tsyn is the index of the last synchronized round 2: Vsyn 0, bsyn 0 Vsyn, bsyn store the relevant statistics of all agents after synchronization 3: Vepoch,m 0, bepoch,m 0, m [M] Vepoch,m, bepoch,m store the relevant statistics of agent m in the current epoch 4: for t = 1, , T do 5: for Agent m = 1, , M do 6: if t = τ then 7: Vsyn 0, bsyn 0 8: Vepoch,m 0, bepoch,m 0 9: end if 10: Vm,t λI + Vsyn + Vepoch,m 11: ˆθm,t V 1 m,t (bsyn + bepoch,m)

12: Construct the confidence ellipsoid Cm,t = θ Rd : ˆθm,t θ Vm,t βt

13: (xm,t, θm,t) = arg max(x,θ) Dm,t Cm,t θ, x 14: Play xm,t and get reward ym,t

15: wm,t 1 [t < τ] min 1, α/ xm,t V 1 m,t

16: Vepoch,m Vepoch,m + wm,txm,tx m,t 17: bepoch,m bepoch,m + wm,txm,tym,t 18: if log det(Vm,t + wm,txm,tx m,t)/det(λI + Vsyn) (t tsyn) D and t < τ then 19: Send a synchronization signal to the server to start a communication round 20: end if 21: if A communication round is started then 22: Agent i sends Vepoch,i, bepoch,i to the server, i [M] 23: Server computes Vsyn Vsyn + Vepoch,i, bsyn bsyn + bepoch,i, i [M] 24: Server sends Vsyn, bsyn back to all agents 25: Vepoch,i 0; bepoch,i 0; i [M] Reset Vepoch,i, bepoch,i for the new epoch 26: tsyn t 27: end if 28: end for 29: end for

4.1 H-LINUCB Algorithm

H-LINUCB is a distributed UCB-style algorithm (see Algorithm 1 for pseudocode), in which agents work cooperatively under the coordination of a central server.

H-LINUCB has two learning phases: the collaboration phase (for rounds t {1, . . . , τ 1}) and the independent learning phase (for rounds t {τ, . . . , T}), where τ T is the collaboration budget. Intuitively, our two-phase learning framework ensures that the agents stop collaboration after τ rounds lest they incur a linear regret in bandit environments with large dissimilarity. Naturally, then, the parameter τ should depend on the dissimilarity parameter, ε. We give an optimal choice of τ in Theorem 4.1.

At each round t < τ (the collaboration phase), each agent s data is weighted to adapt to the dissimilarity across different agents (Line 15). Then, each agent uses the weighted data to construct

its Confidence Ellipsoid (Line 12) and makes a decision following the optimism principle (Line 13). When a certain condition is met (Line 18), data is pooled and synchronized across the agents. Starting from round τ, all collaboration ceases and each agent enters the independent learning mode and runs an independent copy of the OFUL algorithm [Abbasi-Yadkori et al., 2011] locally for the last T τ + 1 rounds.

We note that H-LINUCB builds upon DISLINUCB of Wang et al. [2020, Protocol 8] with the following modifications:

We scale each agent s data using the weight min(1, α/ xm,t V 1 m,t), which we adopt from He et al. [2022], to handle the dissimilarity across different agents (Line 15). We only allow collaboration until round τ (Line 18). The value of τ depends on the dissimilarity parameter, which we assume is given. We reset the variables Vsyn, bsyn, Vepoch,m, bepoch,m at round τ (Lines 6-9), where each agent switches to the independent learning mode. Here, epoch refers to the time period between two consecutive synchronization rounds.

Each agent uses all of the data available to them at each round to construct the Confidence Ellipsoid Cm,t using the result in Lemma 4.1. Given the confidence ellipsoid, the agent chooses the action optimistically: (xm,t, θm,t) = arg max(x,θ) Dm,t Cm,t θ, x . During the collaboration phase, if the variation in the volume of the ellipsoid exceeds a certain synchronization threshold, D, it triggers a synchronization condition (Lines 18-20). Subsequently, the central server commences the synchronization procedure to update Vsyn, bsyn across all participating agents (Lines 21-27. The optimal value of D depends on the number of agents M, dimension d, and the collaboration budget τ.

The weight min(1, α/ xm,t V 1 m,t) is a truncation of the inverse bonus, where α > 0 is a threshold parameter that shall be optimized later. When xm,t is not explored much, we have a large exploration bonus xm,t V 1 m,t (low confidence). Hence, the algorithm will put a small weight on it to avoid a large regret due to stochastic noise and misspecification. When xm,t V 1 m,t is small (high confidence),

H-LINUCB puts a large weight on it, and it can be as large as one [He et al., 2022].3 We note that using this weighting could have a significant negative impact on the performance if we are not careful. Several recent studies show that we can incur a regret of O(d

d T) and O(d

T + εd T) for misspecified and corrupted linear bandits, respectively [Lattimore and Csaba, 2020, Takemura et al., 2021, Foster et al., 2020, He et al., 2022].4 However, a direct application of an algorithm designed for misspecified/corrupted linear bandits to our setting can lead to linear regret when ε = Θ(1). This is significantly worse as compared to naive independent learning, which always achieves sub-linear O(d M

We emphasize that the condition (t < τ) in Line 18 is crucial to avoid linear regret for H-LINUCB in the regime of large ε. For example, for ε = Θ(1), τ = T, Algorithm 1 incurs O(d MT) regret, which is linear in term of MT. Furthermore, Theorem 4.3 indicates that there exists an instance of ε-MALCB such that any algorithm incurs at least Ω(d M

T) regret. Then, each agent playing OFUL independently would be enough to achieve an optimal O(d M

T) regret. This suggests that we get a tighter upper bound if we cease collaboration; we discuss the stopping criterion and the choice of τ in Theorem 4.2.

Communication protocol. We use a star-shaped communication network where M agents can interact with a central server [Wang et al., 2020, Dubey and Pentland, 2020]. Each agent communicates with the server by uploading and downloading its data but does not communicate directly with each other. The communication will be triggered only if any agent has enough new data since the last synchronization round. Finally, we assume no latency, or error in the communication between the central server and agents.

3The technique of using the exploration bonus to control misspecification is also used in Zhang et al. [2023]. 4Takemura et al. [2021], Foster et al. [2020] give guarantees for when the misspecification level is unknown. The CW-OFUL algorithm of He et al. [2022] has O(d

T + d C) regret, where C is the total amount of corruption. Setting C = εT, where ε is the level of misspecification at each round, gives us O(d

T + εd T) regret.

Remark 4.1. Wang et al. [2020] show that DISLINUCB can rely on old data and produce nearly optimal policy without much communication, only incurring logarithmic factors in the final regret. H-LINUCB has the same communication cost of O(M 1.5d3), as DISLINUCB, which does not depend on the horizon T.

When an agent uses data from other agents, due to the dissimilarity, we need to adjust our confidence bound to make sure the true linear parameter θm lies in the defined ellipsoid with high probability. The next result shows how to construct such a Confidence Ellipsoid. Lemma 4.1. (Confidence Ellipsoid). With probability at least 1 Mδ1 Mδ2, for each agent m [M], θm lies in the confidence set,

Cm,t = θ Rd : ˆθm,t θ Vm,t βt

λ + αεMt + r

d log 1+Mt/(λd)

, for t < τ,

d log 1+t/(λd)

Note that the result above provides two separate confidence bounds for θm, one for the period before round τ and the other one for after round τ. Before round τ, agent m will use all the data from other agents to construct Cm,t. The proof follows by first bounding ˆθm,t θm Vm,t(λ) as

ˆθm,t θm Vm,t(λ) I1 |{z} Regularization term +

k=1 wi,kxi,kx i,k(θi θm)

V 1 m,t(λ) | {z } I2:Dissimilarity term

+ I3 |{z} Noise term .

Here, I1 is a bounded regularization term, I3 is a noise term that can be bounded by self-normalization lemma (Lemma C.4). Finally, the dissimilarity term I2 is bounded from above by αεMt using the definition of dissimilarity θi θm ε, and applying a similar argument as He et al. [2022] and the choice of the weight wm,t = min(1, α/ xm,t V 1 m,t). For the phase after round τ, we use the same argument as Abbasi-Yadkori et al. [2011] to construct the confidence bound.

Next, we present the group regret upper bound of Algorithm 1 up to round τ.

Theorem 4.1. Given Assumption 3.1, T 1, any τ T, and δ1 > 0, setting λ = 1 and α =

d εMτ in the upper bound of βt on the confidence interval according to Lemma 4.1 t [T], and setting the synchronization threshold D = τ log(Mτ)/(d M), we have that with probability at least 1 Mδ1, the group regret of Algorithm 1 up to round τ, is bounded as

Mτξ2 τ + εd Mτξ1.5 τ ,

where ξt = log 1+Mt/(λd)

Theorem 4.1 shows that Algorithm 1 incurs O(d

Mτ) regret in the first term (which is the same order as a single agent playing for Mτ rounds) plus a penalty of using the data from other agents in the order of O(εd Mτ). The regret of O(d

Mτ) is unavoidable for any regime of ε, and this rate is known to be optimal in the case of homogeneous multi-agent. Since we use the technique from He et al. [2022] to handle the dissimilarity, the corruption amount of each round is ε, and a total corruption of εd Mτ when the central server allows to collaborate up to round τ. To the best of our knowledge, we are not aware of any UCB-based algorithm for misspecified linear bandits in the setting of infinite arms. We expect that employing a misspecified linear bandits algorithm would achieve a regret of O(d

d MT), which is tighter by a factor of

d. It is worth noting that the CW-OFUL algorithm in He et al. [2022] is designed for handling corruption, whether it can achieve O(d

d T) in a misspecified setting remains an open question.

We now present our main result giving an upper bound on the group regret of H-LINUCB.

Theorem 4.2. Given Assumption 3.1, T 1, let λ = 1, α =

d εMτ , δ1 = δ2 = 1 M 2T in the upper bound of βt on the confidence interval (see Lemma 4.1) t [T]. Let τ = min( 1

2ε2 , T), and let the synchronization threshold D = τ log(Mτ)/(d M). Then, the expected group regret of Algorithm 1 is bounded as

E [R(M, T)] 320

MT + 2 min n εd MT, d M

T o log2(MT).

Here, τ is the maximum round that the central server allows communication. After that, all agents switch to independent learning. By choosing τ = min( 1

2ε2 , T), agents fully cooperate in the regime ε [0, 1

2T ] if T 1 2ε2 , and gradually reduce τ as ε increases from 1

2T to + . This is important for avoiding a linear regret since when ε is large, most of the regret comes from the εd MT term and dominates the d

MT term. The condition on Line 6 also discards all of the synchronized data. In the extreme case, when ε > 1, there is no collaboration happening due to the condition in Line 18 failing at every round. In other words, H-LINUCB behaves like IND-OFUL.

In the other extreme case, when ε = 0, all agents solve identical linear bandits. The weight in Line 15 always evaluates to its minimum value of 1 for all t [T] since α =

d εMT + . We have t < ε2 for all rounds. Therefore, the reset condition in Line 6 is never triggered and H-LINUCB behaves exactly like DISLINUCB, achieving a regret of O(d

MT), which is optimal up to some logarithmic factors.

Theorem 4.2 suggests that the upper bound of H-LINUCB is tighter than IND-OFUL for all ε. Remark 4.2. Ghosh et al. [2023] also propose a personalized algorithm (PMLB) for the heterogeneous multi-agent linear bandits; however, our problem setting is fundamentally different than that of Ghosh et al. [2023]. They consider a finite action set and impose a strong distributional assumption on how contexts are generated, i.e., the stochastic context xi,t for each action i and at each round t is zero-mean and forms a positive-definite covariance matrix. In stark contrast, we consider the adversarial setting where the context set is adversarially generated at each round (and thus, the associated action set can be infinite and arbitrary). This renders the algorithm and guarantees of Ghosh et al. [2023] inapplicable in our setting and, thus, requires a completely different treatment. Those assumptions of Ghosh et al. [2023] are crucial for them to obtain the O(T 1/4) bound (for ε = 0). We note that this bound is not information-theoretically possible in our adversarial setting; the minimax lower bound in such settings is Ω(

T) (Theorem 4.3).

4.2 Lower bound

In this section, we present a lower bound result for the ε-MALCB problem. We denote RA,I(M, T) as a regret of algorithm A on a problem instance I of M agents run for T rounds. Theorem 4.3. Let I(ε) denote the class of ε-MALCB problem instances that satisfy the Assumption 3.1. Then for any d, M, T Z+ with d

48 T, ε 0, we have the following,

inf A sup I I(ε) RA,I(M, T) = Ω d

MT + min n εMT, d M

The first term is a straightforward observation that solving an ε-MALCB is at least as hard as solving a single linear bandits for MT rounds, or M agents solving identical bandits for T rounds. The second term suggests that we pay an additional regret of εMT for a small ε [0, d

T ], and Ω(d M

for a large ε d

T . We note that Ω(d M

T) is also the lower bound of IND-OFUL when each

agent incurs a regret of at least Ω(d

T). We believe that the analysis of the lower bound could be tightened by using the arguments from misspecified bandits literature, achieving a lower bound of Ω(d

The lower bound suggests that our upper bound is tight up to logarithmic factors in the following extreme regimes, (i) ε [0, 1

MT ], where R(M, T) = Θ(d

MT); (ii) ε [ d

T , + ], where

R(M, T) = Θ(d M

T). In regime (i), all agents solve tasks that are similar to one another, yielding the highest collaborative gain. In regime (ii), tasks are highly dissimilar, H-LINUCB turns off the collaboration and lets agents solve their own tasks individually.

Figure 1: Simulation on synthetic data with M = 60, d = 30, T = 10000.

Finally, in the regime that ε ( 1

T ), our results illustrate the interpolation between two extremes. In this regime, our upper bound presents a gap of d in the dissimilarity term εd MT compared to εMT in the lower bound.

The key idea in the proof of Theorem 4.3 is based on an information-theoretic lower bound of Lemma C.1, wherein we extend the result from (single-agent) linear bandits to heterogeneous multi-agent linear bandits. Here, we give the lower bound result without any constraints on the communication, hence, this is also the lower bound of the H-LINUCB algorithm.

5 Numerical Simulations

In this section, we provide some numerical simulations to support our theory. Our goal is to address the following question: how does H-LINUCB perform in three different regimes of dissimilarity: (i) ε [0, 1

MT ], (ii) ε ( 1

T ), (iii) ε [ d

We compare the performance of H-LINUCB with that of the following two algorithms: (a) Independent Learners (IND-OFUL), wherein each agent independently runs OFUL algorithm of Abbasi-Yadkori et al. [2011], and there is no communication between agents (zero collaboration), and (b) DISLINUCB, for which we use the implementation of Wang et al. [2020] without any modification.

Simulation setup. We generate the ε-MALCB problem for M = 60, d = 30, T = 10000 via the following procedure. We first choose a value of ε in each of the three dissimilarity regimes. Then we create the linear parameters {θm}M m=1 as follows. Let u, {vm}M m=1 be random vectors with unit norm. We set θm = c u + ε

2vm, where c is a constant in the range [0, 1 ε]. This guarantees θm 1 and θi θj ε for any two agents i, j. At each round, for each agent, we create a new decision set with a size of 50, each action is random and normalized to 1. The random noise is sampled from the standard normal distribution, η N(0, 1). We run each experiment 10 times, then report the

group regret averaged over the runs and the confidence intervals in Figure 1. Our code is available here: https://github.com/anhddo/hlin UCB.

Results and discussions. In regime (i), where the level of dissimilarity is small, plots (a) and (b) show that H-LINUCB retains a regret comparable with DISLINUCB.

In regime (ii), plots (c) and (d) illustrate the interpolation between the two extreme regimes.

In regime (iii), plots (e) and (f), DISLINUCB incurs linear regret, H-LINUCB has the same rate with IND-OFUL. This illustrates that collaboration brings no benefit when the dissimilarity is high.

6 Conclusions

In this paper, we studied the heterogeneous multi-agent linear contextual bandit problem. We formulated the problem under the notion of ε-MALCB, and provided the upper and lower bounds when ε is known. We showed that our results are provably optimal in the regime where tasks are highly similar or highly dissimilar. Finally, we validated our theoretical results with numerical simulations on synthetic data.

A natural avenue for future work would be to close the gap in the regime ε ( 1

T ). Another research direction pertains to designing an adaptive algorithm when ε is unknown. Such an algorithm would be practical and flexible enough to apply to a wide range of heterogeneous multi-agent bandit problems. We are also interested in extending this work to a more challenging setting such as Reinforcement Learning.

Acknowledgements

This research was supported, in part, by DARPA GARD award HR00112020004, NSF CAREER award IIS-1943251, funding from the Institute for Assured Autonomy (IAA) at JHU, and the Spring 22 workshop on Learning and Games at the Simons Institute for the Theory of Computing.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. 2, 3, 4, 6, 7, 9, 14, 22, 23

Djallel Bouneffouf, Irina Rish, and Charu Aggarwal. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pages 1 8. IEEE, 2020. 2

Emma Brunskill and Lihong Li. Sample complexity of multi-task reinforcement learning. In Uncertainty in Artificial Intelligence, page 122. Citeseer, 2013. 1

Leonardo Cella, Karim Lounici, Grégoire Pacreau, and Massimiliano Pontil. Multi-task representation learning with stochastic linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 4822 4847. PMLR, 2023. 3

Niladri Chatterji, Vidya Muthukumar, and Peter Bartlett. Osom: A simultaneously optimal algorithm for multi-armed and linear contextual bandits. In International Conference on Artificial Intelligence and Statistics, pages 1844 1854. PMLR, 2020. 2

Ronshee Chawla, Abishek Sankararaman, and Sanjay Shakkottai. Multi-agent low-dimensional linear bandits. IEEE Transactions on Automatic Control, 2022. 2

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208 214. JMLR Workshop and Conference Proceedings, 2011. 2

Yihan Du, Longbo Huang, and Wen Sun. Multi-task representation learning for pure exploration in linear bandits. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on

Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 8511 8564. PMLR, 23 29 Jul 2023. 3

Abhimanyu Dubey and Alex Sandy'Pentland. Differentially-private federated linear bandits. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6003 6014. Curran Associates, Inc., 2020. 2, 6, 14

Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33:11478 11489, 2020. 3, 6

Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In International Conference on Machine Learning, pages 757 765. PMLR, 2014. 1, 3

Avishek Ghosh, Sayak Ray Chowdhury, and Aditya Gopalan. Misspecified linear bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. 3

Avishek Ghosh, Abishek Sankararaman, and Kannan Ramchandran. Multi-agent heterogeneous stochastic linear bandits. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19 23, 2022, Proceedings, Part IV, pages 300 316. Springer, 2023. 1, 2, 3, 8

Jiafan He, Dongruo Zhou, Tong Zhang, and Quanquan Gu. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. 3, 6, 7, 15

Jiachen Hu, Xiaoyu Chen, Chi Jin, Lihong Li, and Liwei Wang. Near-optimal representation learning for linear bandits and linear rl. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4349 4358. PMLR, 18 24 Jul 2021. 1, 3

Ruiquan Huang, Weiqiang Wu, Jing Yang, and Cong Shen. Federated linear contextual bandits. Advances in Neural Information Processing Systems, 34:27057 27068, 2021. 1, 2, 3, 4

Nathan Korda, Balazs Szorenyi, and Shuai Li. Distributed clustering of linear bandits in peer to peer networks. In International conference on machine learning, pages 1301 1309. PMLR, 2016. 3

Tor Lattimore and Szepesvári Csaba. Bandit algorithms. Cambridge University Press, 2020. 3, 6, 22, 23

Chuanhao Li and Hongning Wang. Asynchronous upper confidence bound algorithms for federated linear bandits. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 6529 6553. PMLR, 28 30 Mar 2022. 1, 3, 4

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th international conference on World wide web - WWW 10, 2010. doi: 10.1145/1772690.1772758. 2

Shuai Li, Claudio Gentile, and Alexandros Karatzoglou. Graph clustering bandits for recommendation. ar Xiv preprint ar Xiv:1605.00596, 2016. 3

Shuai Li, Wei Chen, Shuai Li, and Kwong-Sak Leung. Improved algorithm on online clustering of bandits. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 19, page 2923 2929. AAAI Press, 2019. ISBN 9780999241141. 3

Kanak Mahadik, Qingyun Wu, Shuai Li, and Amit Sabne. Fast distributed bandits for online recommendation systems. In Proceedings of the 34th ACM international conference on supercomputing, pages 1 13, 2020. 2

David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic bandits. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. 2

Aritra Mitra, Arman Adibi, George J. Pappas, and Hamed Hassani. Collaborative linear bandits with adversarial agents: Near-optimal regret bounds. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. 2

Ahmadreza Moradipari, Mohammad Ghavamzadeh, and Mahnoosh Alizadeh. Collaborative multiagent stochastic linear bandits. In 2022 American Control Conference (ACC), pages 2761 2766. IEEE, 2022. 2

Michael T. Rosenstein. To transfer or not to transfer. In NIPS 2005, 2005. 1

Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, 2010. 2

Marta Soare, Ouais Alsharif, Alessandro Lazaric, and Joelle Pineau. Multi-task linear bandits. In NIPS2014 workshop on transfer and multi-task learning: theory meets practice, 2014. 2

Kei Takemura, Shinji Ito, Daisuke Hatano, Hanna Sumita, Takuro Fukunaga, Naonori Kakimura, and Ken-ichi Kawarabayashi. A parameter-free algorithm for misspecified linear contextual bandits. In International Conference on Artificial Intelligence and Statistics, pages 3367 3375. PMLR, 2021. 3, 6

Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: Near-optimal regret with efficient communication. In International Conference on Learning Representations, 2020. 2, 6, 7, 9, 14, 15

Zhi Wang, Chicheng Zhang, Manish Kumar Singh, Laurel Riek, and Kamalika Chaudhuri. Multitask bandit learning through heterogeneous feedback aggregation. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1531 1539. PMLR, 13 15 Apr 2021. 1, 2, 3, 4

Jiaqi Yang, Wei Hu, Jason D. Lee, and Simon Shaolei Du. Impact of representation learning in linear bandits. In International Conference on Learning Representations, 2021. 1, 3

Jiaqi Yang, Qi Lei, Jason D Lee, and Simon S Du. Nearly minimax algorithms for linear bandits with shared representation. ar Xiv preprint ar Xiv:2203.15664, 2022. 3

Weitong Zhang, Jiafan He, Zhiyuan Fan, and Quanquan Gu. On the interplay between misspecification and sub-optimality gap in linear contextual bandits. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 41111 41132. PMLR, 23 29 Jul 2023. 6

A Proof of Upper Bound

A.1 Proof of Lemma 4.1

Proof. We prove that the true parameter of each agent lies in the Confidence Ellipsoid with high probability. In particular, we prove such results for two periods, for round t < τ and t τ. We denote δ1, δ2 as the probabilities such that true linear parameters lie outside the confidence ellipsoid for the period of t < τ, and t τ, respectively.

For the round t < τ, let ts be the last round that synchronization occurs, where ts < t. Then each agent s statistics can be divided into two parts, its own data and the pooled data from other agents since round ts. Specifically, for an agent m, at round t, we have the following for Vm,t, bm,t,

Vm,t = λI +

k=1 wm,kxm,kx m,k + X

k=1 wi,kxi,kx i,k,

k=1 wm,kxm,kym,k + X

k=1 wi,kxi,kyi,k.

We then have the following for ˆθm,t,

ˆθm,t = V 1 m,tbm,t

k=1 wm,kxm,kym,k + X

k=1 wi,kxi,kyi,k

k=1 wm,kxm,k x m,kθm + ηm,k + X

k=1 wi,kxi,k x i,kθi + ηi,k

k=1 wm,kxm,kx m,k + X

k=1 wi,kxi,kx i,k

k=1 wi,kxi,kx i,k (θi θm)

k=1 wm,kxm,kηm,k + X

k=1 wi,kxi,kηi,k

Notice that the summation in the first term is the covariance matrix Vm,t but missing the regularization term λI; thus, the first term can be simplified as θm λV 1 m,tθm. Applying triangle inequality, we

have the following for ˆθm,t θm Vm,t ,

ˆθm,t θm Vm,t λ θm V 1 m,t | {z } I1:Regularization term

k=1 wi,kxi,kx i,k(θi θm)

V 1 m,t | {z } I2:Dissimilarity term

k=1 wm,kxm,kηm,k + X

k=1 wi,kxi,kηi,k

V 1 m,t | {z } I3:Noise term

For the regularization term I1, it is upper bounded by,

I1 = λ θm V 1 m,t λ

λmin V 1/2 m,t θm 2

For the dissimilarity term I2, we have,

k=1 wi,kxi,kx i,k(θi θm)

wi,kxi,kx i,k(θi θm) V 1 m,t

k=1 wi,k x i,k(θi θm) xi,k V 1 m,t

k=1 xi,k θi θm α

where we use triangle inequality in the first inequality, the second inequality holds since wm,t α/ xm,t V 1 m,t by the definition of wm,t, the last inequality holds due to the bounded parameters assumption and the ε-MALCB definition (see Assumption 3.1 and Definition 3.1).

To bound term I3, we denote vector xm,t = wm,txm,t, and the random noise ηm,t = wm,tηm,t. We have wm,t 1, therefore, ηm,t is 1-sub Gaussian. Furthermore, we rewrite the covariance matrix as Vm,t = λI + Pt s=1 xm,s x m,s + P

i =m Pts k=1 xi,k x i,k. We then have the following for the noise term I3,

k=1 xm,k ηm,k + X

k=1 xi,k ηi,k

v u u t2 log

det (Vm,t)1/2 det(V0) 1/2

d log 1 + Mt/(λd)

where we apply Lemma C.4 in the first inequality. Now, putting these three terms together, we have the following Confidence Ellipsoid bound for round t < τ,

ˆθm,t θm Vm,t

d log 1 + Mt/(λd)

We now turn our attention to the round t τ. Since agents switch to independent learning, we follow the same argument for confidence ellipsoid as in Theorem 2 of Abbasi-Yadkori et al. [2011], and get

d log 1 + t/(λd)

Finally, we complete the proof by taking a union bound over all agents.

A.2 Proof of Theorem 4.1

Before proving Theorem 4.1, we give an upper bound on the pseudo-regret, rm,t, in the next proposition. The proof follows standard arguments for linear bandits; most of the arguments can be extracted from Wang et al. [2020], Dubey and Pentland [2020].

Proposition A.1. The pseudo-regret rt obtained by any agent m at round t, is upper bounded as

rm,t 2βt min 1, xm,t V 1 m,t

Proof. Recall that an agent makes decisions optimistically: (xm,t, θm,t) = arg max (x,θ) Dm,t Cm,t x θ.

Let x m,t denote the optimal action at round t of agent m, i.e., x m,t = arg max x Dm,t x θm.

We then have the following for the pseudo-regret

rm,t = x m,t θm xm,t θm

xm,t θm,t xm,t θm (Since (xm,t, θm,t) is optimistic)

= xm,t ( θm,t θm)

= V 1/2 m,t xm,t, V 1/2 m,t ( θm,t θm) (Vm,t 0)

xm,t V 1 m,t

θm,t θm Vm,t (Cauchy-Schwarz s inequality)

xm,t V 1 m,t

ˆθm,t θm,t Vm,t + ˆθm,t θm Vm,t (Triangle inequality)

2βt xm,t V 1 m,t (Since θm, θm,t Ct)

2βt min 1, xm,t V 1 m,t

where the last inequality is due to the fact that βt 1 and that suboptimality is no more than 2.

We now proceed to prove Theorem 4.1. The proof technique follows the analysis of group regret in Wang et al. [2020] and applies the arguments from Theorem 4.2 of He et al. [2022] to handle the dissimilarity.

Theorem A.1 (Theorem 4.1 restated). For a given T, and τ T, if we set D = τ log(Mτ)/(d M), and βt, t [τ] according to Lemma 4.1, then, given Assumption 3.1, with probability at least 1 Mδ1, the group regret of Algorithm 1 up to round τ, is upper bounded as

λd Mτ(ξτ)1.5 + αε

α + εd Mτξτ

d(Mτξτ)1.5 + d

where ξt = log 1+Mt/(λd)

. Furthermore, if we choose λ = 1, α =

d εMτ , then the group regret is upper bounded as

Mτξ2 τ + εd Mτξ1.5 τ .

We refer to this group regret as the collaboration regret, and denote it as Rcollab(M, T).

Proof. Let P be the total number of synchronization rounds, then, we can index the synchronization matrix Vsyn for P epoch as {Vsyn,p}P p=1, and define Vp = λI + Vsyn,p. Observe that det(V0) = det(λI) = λd and det(VP ) (trace(VP )/d)d (λ + Mτ/d)d. Therefore,

log det(VP )

det(V0) d log 1 + Mτ

By telescoping, we have that log det(VP )

det(V0) = PP p=1 log det(Vp) det(Vp 1). Therefore, we have at most

R = d log(1 + Mτ

λd ) epochs such that log det(Vp) det(Vp 1) 1; otherwise, it violates the condition

in Equation (1). WLOG, we use logarithm base 2 for the determinant ratio. This implies that for all but R epochs,

1 det(Vp) det(Vp 1) 2, (2)

We call the epochs satisfying Equation (2) as good epochs. We imagine a single agent playing M(τ 1) actions x1,1, x2,1, . . . , x M 1,τ 1, x M,τ 1 sequentially. Let Wm,t = λI + PM i=1 Pt 1 s=1 wi,sxi,sx i,s +Pm 1 j=1 wj,txj,tx j,tbe the covariance matrix of this imaginary agent when it gets to xm,t. If xm,t belongs to a good epoch, say the j-th epoch, we have the following:

1 xm,t V 1 m,t xm,t W 1 m,t

det(Vj) det(Vj 1)

where the first inequality is due to the fact that Wm,t Vm,t, and the second inequality follows from Lemma C.2.

Now, applying the Proposition A.1, we bound the pseudo-regret rm,t of these good epochs as follows:

rm,t 2βt min 1, xm,t V 1 m,t

1, xm,t W 1 m,t

2βt min 1, xm,t W 1 m,t

where we use Lemma C.2 in the second inequality, and Equation (3) in the third inequality.

Let Rgood(M, τ) be the group regret of these good epochs up to round τ. Suppose Pgood contains all the good epochs, and Bp contains all the pairs (m, t) belonging to epoch p. We have

Rgood(M, τ) = X

(m,t) Bp rm,t

2βτ 1 min 1, xm,t W 1 m,t

(m,t) Bp wm,t=1 2

2βτ 1 min 1, xm,t W 1 m,t

| {z } I1:Rounds of good epoch with wm,t=1

(m,t) Bp wm,t<1 2

2βτ 1 min 1, xm,t W 1 m,t

| {z } I2:Rounds of good epoch with wm,t<1

where in the last equality we split the summation to two cases: wm,t = 1, and wm,t < 1.

For term I1, we consider all the pair (m, t) in good epochs up to round τ such that wm,t = 1; we assume that we have a total of such K pairs, and these pairs can be sequentially listed as {( m1, t1), ( m2, t2), , ( m K, t K)}. With this notation, for i K, we construct the auxiliary covariance matrix A mi, ti = λI + Pi 1 k=1 x mk, tkx mk, tk. Thus, we have W mi, ti A mi, ti, which implies x mi, ti W 1 mi, ti x mi, ti A 1 mi, ti.

We now have the following for term I1:

(m,t) Bp wm,t=1 2

2βτ 1 min 1, xm,t W 1 m,t

2βτ 1 min 1, x mk, tk W 1 mk, tk

2βτ 1 min 1, x mk, tk A 1 mk, tk

i=1 min 1, x mk, tk 2 A 1 mk, tk

2Mτ log det(VP )

d Mτ log 1 + Mτ

where in the second inequality we use Cauchy Schwarz s inequality and the fact that K Mτ, and use Lemma C.3 in the second inequality.

For term I2, since we consider the case such that wm,t < 1, thus, by the definition of wm,t we have 1 αwm,t xm,t V 1 m,t = 1. Therefore, we have,

(m,t) Bp wm,t<1 min 1, βτ xm,t W 1 m,t

(m,t) Bp wm,t<1 min

1, βτ wm,t xm,t V 1 m,t α xm,t W 1 m,t

(m,t) Bp wm,t<1 min

1, βτ wm,t xm,t 2 W 1 m,t α

where the last inequality follows from Equation (3).

Similarly to what we have done with I1, we consider all the pair (m, t) in good epochs up to round τ such that wm,t < 1; we assume that we have a total of such K pairs, and these pairs can be sequentially listed as {( m1, t1), ( m2, t2), , ( m K, t K)}. Furthermore, we define xm,t = wm,txm,t, and construct the auxiliary covariance matrix,

W mi, ti = λI +

k=1 w mk, tkx mk, tkx mk, tk = λI + x mk, tk x mk, tk.

Thus, we have W mi, ti W mi, ti, which implies x mi, ti W 1 mi, ti x mi, ti W 1 mi, ti.

k=1 min 1, βτ 1w mk, tk/α x mk, tk 2 W 1 mk, tk

k=1 min 1 + βτ 1/α, (1 + βτ 1/α) w mk, tkx mk, tk 2 W 1 mk, tk

4(1 + βτ 1/α)

k=1 min 1, x mk, tk 2 W 1 mk, tk

4(1 + βτ 1/α)d log 1 + Mτ

4(1 + βτ 1/α)dξτ,

where we use Lemma C.3 in the fourth inequality. Putting the bounds for I1 and I2 together, we have

Rgood(M, τ) = I1 + I2 4

d Mτξτ + 4 1 + βτ 1

d Mτξτ + dξτ + βτ 1

Plugging βτ 1 =

λ + αεMτ + q

d log( 1+Mτ/λ

δ1 ), the regret of the good epochs is bounded as

Rgood(M, τ) 4

λd Mτξτ + αε

α + εd Mτξτ + (dξτ)1.5

Now, we focus on the epochs that are not good, which do not satisfy Equation (2). Let p be one such epoch, and let t0 be the first round and n be the length, respectively, of the epoch p. Recall that at the beginning of each epoch, agents covariance matrices are synchronized, i.e., Vm,t0 = λI + Vsyn, m [M]. We can then bound the regret of the (bad) epoch p as follows.

Rbad(M, p) 2βτ 1

t=t0 min 1, xm,t V 1 m,t

t=t0 min 1, xm,t 2 V 1 m,t

n log det(Vm,t0+n)

det(Vm,t0) ,

where we use Cauchy-Schwarz s inequality and Lemma C.3 in the second inequality.

Lets say the synchronization is triggered on round t0 + n + 1, i.e., for one of the agents we have that (n + 1) log det(Vm,t0+n+1)

det(Vm,t0) > D. Since n log det(Vm,t0+n)

det(Vm,t0) < D for all m [M], for this bad epoch (i.e., from round t0 to round t0 + n), we can bound the group regret as

Rbad(M, p) 2βτ 1M

As we argued earlier, since det Vp (λ + MT

d )d, the bad epochs are rare, i.e., there are at most R = d log(1 + Mτ

λd ) (otherwise, if det(Vp)/ det(Vp 1) > 2 for more than R rounds, then the condition of Equation (1) is violated). Setting D = τ log(Mτ)/(d M) and plugging βτ 1 into the bound above,

Rbad(M, τ) 2Rβτ 1M

λ + αεMτ + p

λd Mτ(ξτ)1.5 + αε

d(Mτξτ)1.5 + d

We finish the proof by putting the regret of the good and the bad epochs together,

R(M, τ) = Rgood(M, τ) + Rbad(M, τ)

λd Mτ(ξτ)1.5 + αε

α + εd Mτξτ

d(Mτξτ)1.5 + d

A.3 Proof of Theorem 4.2

Proof. Setting δ1 = δ2 = 1/(M 2T), then the group regret caused by failure event, which does not satisfy Lemma 4.1, is at most MT (Mδ1 + Mδ2) = O(1). Thus, we mainly consider the group regret when Lemma 4.1 holds.

From round τ onward, agents switch to independent learning mode, this is the group regret from round τ to the end, we denote it as Rind(M, T). Furthermore, for round t τ, we have Vm,t = λI + Pt 1 s=τ xm,sx m,s as the gram matrix which only contains agent s data.

Applying Proposition A.1, we have the following

t=τ 2βt min 1, xm,t V 1 m,t

t=τ min 1, xm,t 2 V 1 m,t

d log 1 + T/(λd)

2d log 1 + T

where we use Cauchy-Schwarz s inequality in the first inequality, and Lemma 4.1 in the second inequality. Combining with the regret from the beginning up to round τ from Theorem 4.1, we have

R(M, T) = Rcollab(M, T) + Rind(M, T)

Mτ + εd Mτ + d M

T τ ξ2 T (4)

We notice that Equation (4) has the second term that is linear in term of Mτ. To avoid linear regret, the choice of τ needs to adapt to the dissimilarity level ε.

For the last two terms, by Cauchy-Schwarz s inequality, we have εd Mτ + d M

2(ε2τ 2 + T τ)), this term can be optimized by setting τ = 1 2ε2 . We need to restrict τ = min( 1

2ε 2 , T) since τ can not exceed T. In other words, all agents can collaborate up to round min 1

Next, we consider the following two cases:

Case 1: For 1

2ε 2 T, we have that d M

T, and εd Mτ = d M

T. Thus, εd Mτ + d M

Case 2: For 1

2ε 2 > T, we have τ = T. This implies that εd Mτ + d M

T τ = εd MT. We also have εd MT < 1

Notice that, in both cases, εd Mτ + d M

T τ is evaluated as small as εd MT and always be upper bounded by 2d M

T. Therefore, in Equation (4), the summation of the last two terms of the group regret is upper bounded by 2 min{εd MT, d M

By the choice of δ1, δ2, λ, and assume that d 1, we can upper bound ξT 4 log(MT). Therefore, we have the following for the expected group regret

E [R(M, T)] 20

MT + 2 min{εd MT, d M

MT + 2 min{εd MT, d M

T} log2(MT)

B Proof of Lower Bound

B.1 Proof of Theorem 4.3

Before proving the theorem, we give a formal definition of ε-MALCB problem for p-norm, Definition B.1. Given an instance of multi-agent linear bandits, it belongs to the class of p-norm ε-MALCB, which we denote as Ip(ε), if for all i, j [M], θi θj p ε.

To prove Theorem 4.3, we use the results of the two following lemmas on max-norm ε-MALCB instances. Lemma B.1. Let I (ε) be the class of max-norm ε-MALCB instances that satisfy the Assumption 3.1. For ε 0, we have the following,

inf A sup I I (ε) RA,I(M, T) = Ω(d

Lemma B.2. Let I (ε) be the class of max-norm ε-MALCB instances that satisfy the Assumption 3.1. Assume d 2T, d2

48 T. For ε 0, we have the following,

inf A sup I I (ε) RA,I(M, T) = Ω min n εMT

Proof of Theorem 4.3. Combining the results of Lemma B.1 and Lemma B.2, we have the following results for the class of max-norm ε-MALCB, for ε 0,

inf A sup I I (ε) RA,I(M, T) = Ω d

MT + min n εMT

In addition, x ε implies x 2 ε

d, therefore, I (ε) I2(ε

d). In other words, for ε 0,

inf A sup I I2(ε) RA,I(M, T) inf A sup

RA,I(M, T) = Ω d

MT + min n εMT, d M

Proof of Lemma B.1. In this lemma, we prove that solving any ε-MALCB instance for T rounds is at least as hard as solving a (single-agent) linear bandits for MT rounds. We prove the lemma by contradiction, which is based on linear bandits lower bound of Lemma C.1. We have that I (ε ) I (ε), for 0 ε ε. This implies, for ε 0, inf A sup I I (ε) RA,I(M, T) inf A sup I I (0) RA,I(M, T).

We completes the proof by proving inf A sup I I (0) RA,I(M, T) = Ω d

Now, we assume there exists an algorithm A which achieves sup I I (0) RA,I(M, T) < d

We observe that I (0) is the class of multi-agent solving exactly the same linear bandits problem. We simulate the algorithm B on a single agent linear bandits for MT rounds by the protocol of M agents solving an identical linear bandits for T rounds. Therefore, if we have RA,I(M, T) < d

3 then we also achieve RB,I(MT) < d

3 , which contradicts Lemma C.1. Thus, we have

sup I I (0) RA,I(M, T) d

3 , which completes the proof.

Proof of Lemma B.2. We extend Lemma C.1 to multi-agent linear contextual bandit by constructing the following max-norm ε-MALCB instance.

Max-norm ε-MALCB instance. The set of linear parameter of M agents {θm}M m=1 belong to a d-dimensional hypercube θm { ε}d, where ε [0, 1

d]. Let D = x Rd : x 2 1 be the action set given to agents at every round. The reward when agent m picks action x is defined as rm,x = θ mx + ηm,x, where the noise samples from a standard normal distribution, ηm,x N(0, 1).

We first verify if the instance satisfies the Assumption 3.1. It satisfies the context assumption since we also restrict the context vector to lie in the unit ball. It also satisfies the θm 1 assumption because we restrict ε [0, 1

d]. Finally, ηm,x N(0, 1) is 1-sub Gaussian distribution. This instance belongs to I (2ε) since θi θj 2ε for all i, j [M].

Now, we proceed to prove Lemma B.2. Let {θm}M m=1 be a set of parameters of the max-norm ε-MALCB instance. For brevity, we omit the commas in the subscripts when it is clear from the context, e.g., θmi = θm,i or xtmi = xt,m,i, t [T], m [M], i [d]. Given m [M], i [d], we define the stopping time τmi = T min{t : Pt s=1 x2 smi t/d}, and the function Umi(z) = Pτmi t=1( 1

d xtmi z)2. We then have following result for Umi(1):

t=1 x2 tmi 4T

Let x m be the optimal action of agent m. We then have the following for the group regret

RA(M, T) = E{θm}M m=1

i=1 (x miθmi xtmiθmi)

= ε E{θm}M m=1

d xtmi sign(θmi) #

d 2 E{θm}M m=1

d xtmi sign(θmi) 2#

i=1 E{θm}M m=1

d xtmi sign(θmi) 2#

where we use the fact that optimal action x m = [ θm1

θm ] , which is a unit vector and has the same direction with θm, the first inequality holds due to xtm 2 1.

Now, let {θ m}M m=1 be another set of linear parameters that are different at only one coordinate of the linear parameter of only one agent compared to {θm}M m=1. Specifically, fix g [M], i [d], we have θg,i = θ g,i; otherwise, θk,j = θ k,j, for (k, j) = (g, i), k [M], j [d]. To simplify the

notion, we define Φ, Φ as Md-dimensional vectors, where Φ, Φ { ε}Md. That is, Φ, Φ represent {θm}M m=1, {θ m}M m=1, respectively. We let P and P be the law of Ugi(z) w.r.t. the interaction of the multi-agent linear bandits induced by Φ, Φ , respectively. We then get

EΦ [Ugi(1)] EΦ [Ugi(1)] 4T

1 2DKL (P||P )

EΦ [Ugi(1)] ε

d + 2 v u u t E

EΦ [Ugi(1)] ε

EΦ [Ugi(1)] 4

where the first inequality holds due to Lemma C.5 and Equation (5), the second inequality follows the stopping time arguments from Lattimore and Csaba [2020], the last inequality holds due to the assumption that d 2T. Then,

EΦ [Ugi(1)] + EΦ [Ugi( 1)] EΦ [Ugi(1) + Ugi( 1)] 4

where the last inequality holds for ε [0, 1 4

d T ], this satisfies the requirement ε [0, 1

the instance construction due to the assumption that d2

48 T. Furthermore, we denote Φ mi as a (Md 1)-dimensional vector, which is obtained by excluding Φmi from vector Φ. Let RΦ(M, T) be the regret w.r.t. Φ, and applying randomization hammer, we have the following,

Φ { ε}Md RΦ(M, T) ε

Φ { ε}Md EΦ [Umi (sign (εmi))]

Φ mi { ε}Md 1

Φmi { ε} EΦ [Umi (sign (εmi))]

Φ mi { ε}Md 1

d = 2Md 2εMT

Therefore, we conclude that there exists an instance with the parameter set of {θm}M m=1 such that

R(M, T) εMT

d 4 , for ε [0, 1 4

d T ]. Notice that this proof holds for I (2ε) class. Scaling down by 2, we have the following for I (ε) class,

RI (ε)(M, T) = RI (2 ε

2 )(M, T) εMT

d 8 , for ε

Observe that εMT

d 8 is a strictly increasing function w.r.t. ε. It is at most d M

3 , when ε = 1 2

In other words, RI (ε)(M, T) min( εMT

3 ) for ε [0, 1 2

Recall that inf A sup I I (ε) R(M, T) inf A sup I I (ε ) R(M, T), for any ε ε 0. Thus, we conclude, ε 0, the following holds,

inf A sup I I (ε) RA,I(M, T) min(εMT

which completes the proof.

C Supporting Lemmas

Lemma C.1. ([Lattimore and Csaba, 2020, Theorem 24.2]). Assume d 2n and let D = x Rd : x 2 1 . Then there exists a parameter vector θ Rd with θ 2 2 = d2/(48n) such that Rn(D, θ) d n/(16

3). Lemma C.2. ([Abbasi-Yadkori et al., 2011, Lemma 12]). Let A, B and C be positive semi-definite matrices such that A = B + C. Then, we have that

x Ax x Bx det(A)

Lemma C.3. ([Abbasi-Yadkori et al., 2011, Lemma 11]). Let {Xt} t=1 be a sequence in Rd, V a d d positive definite matrix and define Vt = V + Pt s=1 Xs X s . Then, we have that

t=1 Xt 2 V 1 t 1 .

Further, if Xt 2 L for all t, then

t=1 min n 1, Xt 2 V 1 t 1

o 2 log det Vn log det V 2 d log trace(V ) + n L2 /d log det V

Lemma C.4. (Self-Normalized Bound for Vector-Valued Martingales, [Abbasi-Yadkori et al., 2011, Theorem 1]). Let {Ft} t=0 be a filtration. Let {ηt} t=1 be a real-valued stochastic process such that ηt is Ft-measurable and ηt is conditionally R-sub-Gaussian for some R 0 i.e.

λ R E eληt | Ft 1 exp λ2R2

Let {Xt} t=1 be an Rd-valued stochastic process such that Xt is Ft 1-measurable. Assume that V is a d d positive definite matrix. For any t 0, define Vt = V + Pt s=1 Xs X s . Then, for any δ > 0, with probability at least 1 δ, for all t 0,

det Vt 1/2 det(V ) 1/2

Lemma C.5. (Pinsker s inequality [Lattimore and Csaba, 2020, Equation 14.12])

For measures P and Q on the same probability space (Ω, F), we have the following,

sup A F P(A) Q(A)

1 2DKL(P||Q).