# nearoptimal_collaborative_learning_in_bandits__a90ea762.pdf

Near-Optimal Collaborative Learning in Bandits

Clémence Réda Université Paris Cité, Inserm, Neuro Diderot, F-75019 Paris, France clemence.reda@inria.fr

Sattar Vakili Media Tek Research, Cambourne Business Park, CB23 6DW, United Kingdom sattar.vakili@mtkresearch.com

Emilie Kaufmann Université de Lille, CNRS, Inria, Centrale Lille, UMR 9189 CRISt AL, F-59000 Lille, France emilie.kaufmann@univ-lille.fr

This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify in pure exploration or play in regret minimization its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization [30]. In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.

1 Introduction

Collaborative learning is a general machine learning paradigm in which a group of agents collectively train a learning algorithm. Some recent works have investigated how agents can efficiently perform sequential decision making in a collaborative context [36]. In particular, Shi et al. propose an interesting setting to tackle collaborative bandit learning when some level of personalization is required [30]. Personalization leads to the twist that each agent should play the best arm in a mixed model which is obtained as a combination of her local model with the local model of other agents. In this work, we introduce a more general model retaining this idea, that we call the weighted collaborative bandit model.

In this model, there are M agents and a finite number of K arms. When agent m samples arm k at time t, she gets to observe a local reward Xk,m(t) , which is drawn from a 1-sub-Gaussian 1 distribution of mean µk,m, independently from past observations and from other agents observations. However, this agent does not necessarily seek to maximize her local reward, but rather some notion of mixed reward, related to the utility of that arm for other agents. More specifically, we assume that agents share a known weight matrix W = (wn,m)n,m [0,1]M M , such that n [M] wn,m = 1

1A random variable X is said to be σ2-sub-Gaussian if, for any λ R, ln(E[eλ(X E[X])]) σ2λ2/2.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

for all m [M] , where [n] = {1,2,...,n} . The mixed reward at time t for agent m and arm k is defined as a weighted average of the local rewards across all agents X k,m(t) = M n=1 wn,m Xk,n(t) , and its expectation, called the expected mixed reward, is

µ k,m = M n=1 wn,mµk,n .

We denote by k m = arg maxk [K] µ k,m the arm with largest expected mixed reward for agent m, assumed unique. Besides the degenerated case in which wn,m = 1(n = m), in which each agent is solving their own bandit problem in isolation, the agents need to communicate ; i.e., to share information about their local rewards to other agents for everyone to be able to estimate their expected mixed rewards. A strategy for an agent is defined as follows: at each time t, each agent m samples an arm πm(t), based on the available information, and then observes a noisy local reward from this arm. She has also the option to communicate information (e.g., empirical means of past local observations) to a central controller (or server), which will broadcast this information to all other agents. Just like in any multi-armed bandit model, several objectives may be considered, either related to maximizing (mixed) rewards or, equivalently, minimizing regret or identifying the best arms, while maintaining a reasonably small communication cost.

The weighted collaborative bandit model encompasses different frameworks previously studied in the literature. Notably, the paper [30] studies a special case in which, given a level of personalization α [0,1], the mixed reward is an interpolation between µk,m and the average of local rewards 1 M M n=1 µk,n which amounts to choosing wm,m = α + 1 α

M and wn,m = 1 α

M for n m . The authors consider the objective of minimizing the regret while minimizing for the number of communication rounds, under the name federated multi-armed bandit with personalization2. In this paper, we mainly focus on the counterpart pure exploration problem in which agents should collaboratively identify their own optimal arm in terms of expected mixed reward, with high confidence, and using as few exploration rounds as possible. This extends the well-studied fixed-confidence best arm identification problem [12, 14, 20] to the weighted collaborative bandit setting.

Another related setting is collaborative pure exploration [18, 32, 6, 35], which considers M agents solving the same best arm identification problem. Most of these papers propose algorithms to solve this problem, while [32] and [21] also prove lower bounds on sample complexity for the fixed-confidence and the fixed-budget best arm identification. Unlike our framework, [6] consider asynchronous agents, which can only sample at some times, whereas [35, 36] consider agents which can only communicate to some of the other agents. The goal of collaborative pure exploration is to reduce sample complexity at the cost of some communication rounds. Our model recovers the synchronous setting when considering µk,m = µk,m for all arm k and agents m,m and W = Id M .

Besides collaborative learning, the work of [34] considers a similar weighted model but in a different, contextual bandit setting in which a central controller chooses in each round an arm for the unique agent (corresponding to a sub-population) that arrives, and aim at minimizing regret. Finally, the paper [29] considers a pure exploration task in which the value of an arm is the weighted average of its utility for M distinct populations (agents). In this setting wn,m = αn, so that all agents have a common best arm, but the proposed algorithms do not aim at a low communication cost.

Collaborative learning in our general weighted model is also interesting beyond these examples. In particular, the work of [30] mentioned possible applications to recommendation systems, for which one may want to go beyond uniformly personalized learning. The personalization part means that we favour the local rewards of agent m over other agents observations in the identification of optimal arm k m. But the introduction of a general weight matrix W in our framework allows any agent m to consider any linear combination of the other agents observations, and then, different degrees of personalization across agents.

Such a setting could also be appropriate for adaptive clinical trials on K therapies, run by M teams who have access to different sub-populations of patients. In this context, each sub-population is typically aiming at finding their (local) best treatment. However, solving their best arm identification in isolation may have a large sample complexity. If one is willing to assume that we have a

2Their setting, as well as ours, is neglecting some challenges typically addressed in (centralized) federated learning, such as privacy issues or dealing with communication interruption, this is why we prefer naming our framework "collaborative learning".

weight matrix W for which the best mixed arm of each agent coincides with its local best arm, i.e. k m = arg maxk [K] µk,m for any agent m, then solving best arm identification in the weighted collaborative bandit could have a much smaller sample complexity due to the sharing of information, while allowing the different clinical centers to communicate only once in a while. A first possibility to build such a weight matrix is to rely on a clustering of the different sub-populations so that the µk,m is supposed to be close to µk,m (but not necessarily equal) for all k and all agents m,m in the same cluster. Denoting by Cm the cluster to which agent m belongs, by setting wm,n = 1(Cm = Cn)/ Cm we would have the mixed mean of each agent be very close to their local means. Another possibility is to rely on a similarity function S between sub-populations and define for all n,m wm,n to be proportional to S(m,n). This similarity function could be obtained prior to the learning phase by computing the similarities between subpopulation biomarkers, for instance.

Related work Collaborative bandit learning has recently sparked wide interest in the multi-armed bandit literature. While some works do not deal with personalization [11, 31, 26], others have for instance studied the integration of arm features in a modified setting with personalization [19]. An interesting kernelized collaborative pure exploration problem was recently studied by [10]. In their model, both agents and arms are described by feature vectors, and there is a known kernel encoding the similarity between the mean reward of each (agent,arm) pair. This independent work follows a similar approach as ours and also propose a near-optimal phased elimination algorithm inspired by a lower bound, but the models and related lower bounds are significantly different.

In bandits working in collaboration, the need for a small communication cost indeed makes algorithms based on phased eliminations appealing. In such algorithms, agent(s) maintain a set of a active arms that are candidate for being optimal, and potentially eliminate arms from this set at the end of each sampling phase. Adaptivity to the observed rewards (and, in our case, communication) is only needed between sampling phases, which are typically long. This type of structure has been used in various bandit settings, both for regret minimization or pure exploration objectives [2, 18, 5, 13, 30, 3]. In some of these algorithms, including the one in [30] which motivates this paper, the number of samples gathered from an arm which is active in some phase r is fixed in advance. We believe that going beyond such a deterministic sampling scheme is crucial to achieve optimal performance with phased algorithms. In order to achieve (near)-optimality, other phased algorithms rely on computing an oracle allocation from the optimization problem associated with a lower bound on the sample complexity [13, 10]. In these works, based on this allocation, a total number of samples to collect in the current phase is computed, which depends on the identity of the surviving arms. The distribution of samples across arms for the current round is proportional to the oracle allocation, and is obtained through a rounding procedure. Compared to these works, our algorithm will show three distinctive features: an allocation inspired by a relaxation of the lower bound, which does not only depend on the identity of surviving arms, and an alternative to the rounding procedure.

Contributions The authors of [30] exhibit an algorithm using phased elimination for federated bandit learning with personalization, and prove a logarithmic regret bound, whose dependency in the parameters of the problem is conjectured to be sub-optimal. They also propose a heuristic improvement, based on a more adaptive exploration within each phase. In this work, we take a step further in identifying the problem-dependent complexity of bandit learning in our novel weighted collaborative bandit model which includes the setting of [30] as a special case both from the pure exploration and the regret perspective. We propose new information theoretic lower bounds, and a recipe to design algorithms (nearly) matching those for pure exploration. Our main algorithmic contribution is for weighted collaborative best arm identification. Our phased elimination-based algorithm achieves minimal exploration cost up to some logarithmic multiplicative factors, while using a constant amount of communication rounds. It relies on a novel data-dependent sampling scheme, which renders its analysis trickier. The structure of the algorithm can easily be extended to Top-N identification, that is, where each agent has to identify her own N best arms (instead of her best for N = 1) with respect to mixed rewards. We further compare our novel regret lower bound to what was conjectured in [30] for the particular case of federated learning with personalization.

Notation For both best arm identification and regret minimization, our complexity terms feature the (mixed) gaps of each agent m and arm k, defined by

k,m = { µ k m,m µ k,m if k k m , mink k m k,m otherwise .

2 Collaborative Best-Arm Identification

The goal of collaborative best-arm identification is that each agent m identifies its optimal arm k m by sampling the arms as little as possible and with few communication rounds. Formally, a collaborative Best Arm Identification (BAI) algorithm consists of a sampling rule πm for each agent m, such that, at time t, either arm πm(t) [K] is sampled by m, or πm(t) = 0 ; in that case, instead of picking an arm at time t, we allow the agent to remain idle, and not to select an arm. 3 Similarly to the communication model studied in [32, 30], communication only happens at the end of local sampling rounds for all agents, when all agents are idle at the same time. Besides the sampling rule, the BAI algorithm uses a stopping rule τ which determines when exploration is over for all agents. The end of exploration is decided by the central server. Then, at time τ, each player outputs a guess for its optimal arm with respect to mixed rewards, denoted by ˆkm.

Our goal is to construct a δ-correct strategy A = (π,τ, ˆk), which satisfies, for any model µ RK M,

PA µ ( m [M], ˆkm = k m) 1 δ ,

while achieving a small exploration cost (e.g. in high probability or in expectation)

Expµ(A) = M m=1

K k=1 Nk,m(τ) ,

where Nk,m(t) = t s=1 1(πm(s)=k) is the number of selections of arm k by agent m up to time t, and a small communication cost, defined as

Comµ(A) = τ t=1 1(It) ,

where It is the event that some information is shared between agents at round t.

In our setting, we do not put constraints on the type of information that is exchanged in each communication round which can be interesting when we consider privacy issues [11, 36] nor on the lengths of the messages. Each communication round has a unit cost. In a communication round, all agents send messages to the central server (e.g., estimates of their local means) and the server can send back arbitrary quantities or instructions (e.g., how many times each arm should be sampled in the next exploration phase, and when to communicate next).

Moreover, contrary to the works of [18, 32] on collaborative learning, we do not look at strategies explicitly minimizing for the number of communication rounds. Instead, our approach consists in proving a lower bound on the smallest possible exploration cost of a δ-correct algorithm which would communicate at every round ; and then, finding an algorithm for which exploration cost matches this lower bound, while suffering a reasonable communication cost.

3 Lower Bound

We prove the following lower bound on the exploration cost of an algorithm in which all agents communicate to the central server their latest observation as soon as they received it. It holds for Gaussian rewards with variance σ2 = 1 , meaning that the reward from an arm k observed at time t by agent m will be Xk,m(t) = µk + εt , where εt N(0,1) . We further assume that the weight matrix W satisfies wm,m 0 for any agent m [M] .

Theorem 1. Let µ be a fixed matrix of means in RK M. For any δ (0,1/2], let A be a δ-correct algorithm under which each agent communicates each reward to the central server after it is observed, and let us denote for any k [K] , m [M] , τk,m = EA µ [Nk,m(τ)] , where τ is the stopping time. For any m [M] and k k m, it holds that

n w2 n,m ( 1

τk,n + 1 τk m,n ) ( k,m) 2 /2

log(1/(2.4δ)) ,

and therefore Eµ [Expµ(A)] T W (µ )log ( 1 2.4δ), where

3Note that, in a regret setting, an idle agent may exploit its empirical best arm.

T W (µ) = min t (R+)K M

(k,m) [K] [M] tk,m m,k k m, n [M] w2 n,m ( 1

tk,n + 1 tk m,n ) ( k,m) 2

The proof, given in Appendix A.1, uses standard change-of-distribution arguments, together with classical results from constrained optimization. Note that, for M = 1, we recover the complexity of best arm identification in a Gaussian bandit model [15].

Computing the complexity term The optimization problem which defines T W (µ) belongs to the family of disciplined convex optimization problems, and can be numerically solved using available solvers, such as CVXPY [1, 9]. We now illustrate, on a small example, the possible reduction in exploration cost that can be obtained by solving weighted collaborative best arm identification instead of M parallel best arm identification problems. We consider K = M = 2 and a similarity S(1,2) = 0.9 between the two agents, which yields the following normalized weight matrix

1.9 [ 1 0.9 0.9 1 ] .

Considering the following matrix of expected rewards

µ = [0.9 0.8 0.1 0.5] ,

for which arm 1 is the local best arm for both agents and is also their best mixed arm, we obtain T W (µ) 28. However, if each agent solves its own best arm identification problem in isolation (which amounts to using W = Id2), the resulting exploration cost scales with T Id2(µ) 101 > 3T W (µ).

From lower bounds to algorithms In single-agent pure exploration tasks, lower bounds are usually guidelines to design optimal algorithms, as they allow to recover an oracle allocation (i.e., the arg min for t (R+)K,M in the definition of T W (µ)) which algorithms can try to achieve by using some tracking [15, 10, 29]. Yet, these approaches may be computationally expensive, as they solve the optimization problem featured in the lower bound in every round.

In the next section, we will propose an alternative approach for our collaborative setting, which exploits the knowledge of the lower bound within a phased elimination algorithm. This is crucial to maintain a small communication cost and also permit to reduce the computational complexity compared to a pure tracking approach. Our algorithm will rely on a relaxed complexity term T W (µ), which is within constant factors of T W (µ), as proved in Appendix C.1. Lemma 1. Introducing the quantity

T W (µ) = min t (R+)K M

(k,m) [K] [M] tk,m m, k [K], n [M]

w2 n,m tk,n ( k,m) 2

it holds that T W (µ) T W (µ) 2 T W (µ).

Compared to T W (µ), a nice feature of T W (µ) is that its constraint set does not depend on the knowledge of (k m)m [M], which will allow us to design algorithms that do not suffer too much from bad empirical guesses for k m in early phases. We further remark that the computation of T W (µ) and that of its associated oracle allocation (see Definition 1) are slightly easier than for T W (µ). Indeed, the optimization problem which defines T W (µ) can be decoupled across arms. Computing for every arm k [K] the vector

τ k = argmin τ k (R+)M (k,m) [K] [M] τ k m s.t. m, n [M]

w2 n,m τ kn ( k,m) 2

we obtain the argmin in T W (µ) by setting (tk,m)k,m = ( τ k m)k,m. The computation of T W (µ) can therefore be done by solving K disciplined optimization problems (e.g., with CVXPY [1, 9]) involving M variables, instead of one optimization problem with K M variables.

Definition 1. For any (R+)K M, the oracle P ( ) is

arg min (τk,m)k,m (R+)K M k,m τk,m s.t. m [M], k [K], n [M]

w2 n,m τk,n ( k,m)2

With this notation, observe that T (µ ) = k,m τk,m , where (τk,m)k,m P ( ). The following lemma will be useful to compare values from different oracle problems. The full lemma along with its proof is available in Appendix C.1.

Lemma 2. Consider , (R+)K M, such that τ P ( ) and τ P ( ). Moreover, assume that there is a positive constant β such that: k,m, k,m β k,m. Then

1 β2 k,m τk,m k,m τ k,m .

4 A Near-Optimal Algorithm For Best Arm Identification

We now introduce an algorithm for collaborative best-arm identification, called W-CPE-BAI for Weighted Collaborative Phased Elimination, stated as Algorithm 1. To present an analysis of this algorithm, we assume that, for any k,m [K] [M] , µk,m [0,1] , and that local rewards (Xk,m(t))k,m,t are 1-sub-Gaussian.

W-CPE-BAI proceeds in phases, indexed by r . In phase r , we let Bm(r) be the set of active arms for agent m , and B(r) = m Bm(r) be the set of arms that are active for at least one agent. The algorithm maintains proxies for the gaps ( k,m(r))k [K],m [K] that are halved at the end of each phase for arms that remain active. At the beginning of each round, the oracle allocation t(r), with respect to the proxy gaps, is computed, as well as the number of new samples dk,m(r) that player m should get from arm k in phase r . dk,m(r) is defined such that the total number of selections of arm k by agent m becomes close to (a quantity slightly larger than) tk,m(r)log(1/δ) . We observe that any arm k B(r) will not get any new samples in phase r , as the proxy gaps ( k,n(r))n are identical to those in the previous phase ; therefore tk,n(r) = tk,n(r 1) . In contrast to prior works, where the allocation in each round only depends on the identity of the surviving arms and the round index [13, 10], in W-CPE-BAI it also depends on when the arms have been eliminated (which condition the value of their frozen proxy gaps).

After each agent m samples arm k dk,m(r) times, they send their local means ˆµk,m(r) to the central server, which computes the mixed mean estimates ˆµ k,n(r) = M m=1 wn,mˆµk,m(r) . The active sets (Bm(r))m [M] of all agents are then updated by removing arms whose mixed means are too small. As in several prior works [22, 30], we rely on confidence intervals to perform these eliminations. However, constructing confidence intervals on the mixed means which are linear combinations of the local means under our adaptive sampling rule is more challenging than when the number of samples from an active arm in phase r is fixed in advance which is the case for instance in the algorithm in [30]. The width of our confidence intervals scales with the following quantity

Definition 2. For any k,m, and round r 0, we define

Á Á Àβδ(nk, (r)) M n=1

w2n,m nk,n(r) ,

where nk,m(r) is the number of times arm k was selected by agent m by the end of phase r (included), and N βδ(N) is a threshold function defined for any N (R+)M .

Leveraging some recent time-uniform concentration inequalities [23], we exhibit below a choice of threshold that yields valid confidence intervals on the mixed means (by "projecting" confidence intervals that can be obtained on local means, see Proposition 24 in [23]). The fact that the confidence interval depends on the random number of past draws (and not just the index of the round) leads to some non-trivial complication in the analysis, with the introduction of quantities (dk,m)k,m. The proof is given in Appendix C.2, where we also provide an explicit expression of the function g M .

Algorithm 1 Weighted Collaborative Phased Elimination for Best Arm Identification (W-CPE-BAI)

Input: δ (0,1), M agents, K arms, weights matrix W Initialize r 0, k,m, k,m(0) 1,nk,m(0) 1, m,Bm(0) [K] Draw each arm k by each agent m once repeat

# Central server B(r) m [M] Bm(r)

Compute t(r) P ((

2 k,m(r))k,m) For any k [K], compute

(dk,m(r))m [M] arg min d NM m dm s.t. m [M], nk,m(r 1) + dm

βδ(nk, (r 1) + d) tk,m(r)

Send to each agent m (dk,m(r))k,m and dmax = maxn [M] k [K] dk,n(r)

# Agent m Sample arm k B(r) dk,m(r) times, so that nk,m(r) = nk,m(r 1) + dk,m(r) Remain idle for dmax k [K] dk,m(r) rounds Send to the server empirical mean ˆµk,m(r) = s nk,m(r) Xk,m(s)/nk,m(r) for any k [K]

# Central server Compute the empirical mixed means (ˆµ k,m(r))k,m based on (ˆµk,m(r))k,m and W // Update set of candidate best arms for each user for m = 1 to M do

Bm(r + 1) {k Bm(r) ˆµ k,m(r) + Ωk,m(r) max j Bm(r)(ˆµ j,m(r) Ωj,m(r))}

end for // Update the gap estimates For all k,m, k,m(r + 1) k,m(r) (1/2)1(k Bm(r+1) Bm(r+1) >1) r r + 1 until m [M], Bm(r) 1 Output: {k Bm(r) m [M]}

Lemma 3. Let us define

βδ(N) = 2(g M ( δ KM ) + 2 M m=1 ln(4 + ln(Nm))) ,

for any N (N )M, where g M is some non-explicit function, defined in [23], that satisfies g M(δ) log ( 1

δ ) + M log log ( 1

δ ). Then the good event

E = { r N, m, k, ˆµ k,m(r) µ k,m Ωk,m(r)}

holds with probability larger than 1 δ.

From this lemma, it easily follows that W-CPE-BAI is δ-correct for the above choice of threshold function, as, for any agent m , no good arm k m can ever be eliminated from Bm(r) at round r . The fact that the sample complexity of W-CPE-BAI scales with T (µ ) on the good event E comes from the interplay between the expression of Ωk,m(r) (which, up to the threshold function, is exactly one of the constraints featured in the lower bound) together with the definition of the allocation t(r) , which leads to the following crucial result in our analysis

Lemma 4. On E, k,m,r 0, Ωk,m(r) k,m(r) .

Proof. For any round r and arm k, by Algorithm 1 and the definition of oracle tk, (r), for any agent m,

Á Á À n w2n,m βδ(nk, (r 1) + dk, ) nk,n(r 1) + dk,n(r)

n w2n,m tk,n(r) k,m(r) if k B(r) ,

n w2n,m tk,n(r k) k,m(r k) = k,m(r) otherwise ,

where when k / B(r), r k = sup{r 0 k B(r )}, and we use the fact that dk,m(r) = 0 when k / B(r).

We did not put much emphasis on the way communications are performed between the agents and the central server, as several choices are possible. The important part is that the server receives all values of the local means (ˆµk,m(r))k,m at the end of round r . Our suggestion is that the central server maintains the sets (Bm(r))m [M] , calls the oracle, and sends to all agents their values of (dk,m(r + 1))k,m at the end of each phase r. In any case, the number of communication rounds in our definition will be equal to the number of phases used by W-CPE-BAI. All in all, we prove Theorem 2. With probability 1 δ, W-CPE-BAI outputs the optimal arm for each agent with an exploration cost at most

32 T W (µ)log2(8/ min)log (1

δ ) + oδ (log (1

and at most log2 (8/ min) communication rounds, where min = mink [K],m [M] k,m .

Proof sketch The detailed proof is given in Appendix B, where we also provide an explicit upper bound on the exploration cost. We let R denote the (random) number of phases used by the algorithm before stopping. On the good event E , we can prove that the algorithm never eliminates k m therefore R = maxm maxk k m Rk,m where Rk,m is the last phase in which k Bm(r) . Using Lemma 4, we can easily establish that

Rk,m rk,m = min{r 0 4 2 r < k,m}

which satisfies rk,m log2(8/ k,m). This yields R log2(8/ min) , and further permits to prove that the proxy gaps can be lower bounded by the true gaps:

r R, k [K], m [M], k,m(r) 1

See Corollary 2 in Appendix B. Using the monotonicity properties of the oracle that are stated in Lemma 2, we can then establish that the allocation t(r) computed from the proxy gaps in the algorithm satisfies r R, k [K],m [M] tk,m(r) 32 T W (µ). (1)

To upper bound the exploration cost, the next step is to relate nk,m(R) to the oracle allocations. To do so, we observe that if R k,m is the last round before R such that dk,m(r) 0 (i.e. the last round in which arm k is actually sampled by agent m ; then nk,m(R) = nk,m(R k,m)), we have by definition of the (dk,m(r))k,m,r that

nk,m(R k,m) tk,m(R k,m)βδ(nk, (R k,m)) + 1 tk,m(R k,m)βδ(nk, (R)) + 1 .

See Lemma 12 in Appendix B. We can then upper bound τ = k,m nk,m(R) as follows

τ k,m tk,m(r k,m)βδ(nk, (R)) + KM k,m r R tk,m(r)β (τ) + KM ,

and τ R 32 T W (µ)β (τ) + KM , where we use (1) and introduce (2)

β (τ) = 2(g M ( δ KM ) + 2M ln(4 + ln(τ))) .

The end of the proof consists in using the known upper bound on R , and finding an upper bound for the largest τ satisfying the inequality in (2).

Discussion Theorem 2 proves that W-CPE-BAI is matching the exploration lower bound of Theorem 1 in a regime where δ is small, up to multiplicative constants, including a logarithmic term in 1/ min. It achieves this using only log2 (8/ min) communication rounds. We note that a similar extra multiplicative logarithmic factor is present in the analysis of near-optimal phased algorithms in other contexts [13, 10]. Such a quantity appears as an upper bound on the number of phases, and may be a price to pay for the phased structure.

On the communication cost We argue that the communication cost of W-CPE-BAI is actually of the same order of magnitude as that featured in some related work. In [30], which is the closest setting to our framework, the equivalent number of communication rounds p needed to solve the regret minimization problem is upper bounded by O (2log2 (8/(

M min))). In the setting of collaborative learning where M agents face the same set of arm distributions and W = Id [18] in their Theorem 4.1 prove that an improvement of multiplicative factor 1/M on the exploration cost for a traditional best arm identification algorithm can be reached by using at most log2(1/ min) communication rounds, where min is the gap between the best and second best arms.

Experimental validation We propose in Appendix E an empirical evaluation of W-CPE-BAI for the weight matrix wm,n = α1(n = m) + 1 α

M which corresponds to the setting studied by [30]. In this particular case, we propose as a baseline a counterpart of the regret algorithm of [30] which we call PF-UCB-BAI. Our experiments on a synthetic instance show that W-CPE-BAI and PF-UCB-BAI have similar performances in terms of exploration cost and that W-CPE-BAI becomes better when the level of personalization α is smaller than 0.5. Moreover, W-CPE-BAI uses less rounds of communication than PF-UCB-BAI for all values of α. Finaly, the near-optimality of W-CPE-BAI is empirically observed when compared to an oracle algorithm which has access to the true gaps. We refer the reader to Appendix E for further details on the optimization libraries that were used.

Remark 1. The analysis and the structure of Algorithm 1 have the potential to be extended to other pure exploration problems, with similar guarantees. In Appendix D, we illustrate this claim by tackling Top-N identification.

5 Regret Lower Bound

In contrast to the BAI setting, [30] considered the objective of minimizing the regret

Rµ(T) = E[ M m=1

T t=1 (µ k m,m µπm(t),m)] .

They provided a conjecture on the lower bound on regret in personalized federated learning [see, 30, Conjecture 1]. As mentioned in introduction, their reward model is a special case of ours with weights wm,m = α + 1 α

M and wn,m = 1 α

M for n m . In this section, we prove a regret lower bound with general weights that proves in particular Conjecture 1 in [30].

We prove the following result, for an algorithm that selects in each round exactly one arm for each agent, and all agents communicate after each round. This lower bound holds for Gaussian arms with variance σ2 = 1.

Theorem 3. Any uniformly efficient algorithm4 in which all agents communicate after each round satisfies

liminf T R(T) log(T) C W (µ) ,

C W (µ) = min c=(ck,m)k,m k m k

K k=1 m k m k ck,m k,m k [K], m [M], n k n k

w2 n,m ck,n ( k,m)2

We recall that for any agent m, k m,m = mink k m k ,m .

4A uniformly efficient algorithm satisfies Rµ(T) = o(T γ) for any γ (0, 1) and any possible instance µ.

Theorem 3 may be viewed as an extension of the lower bound of [25] to the collaborative setting. Its proof, given in Appendix A.2, uses classical ingredients for regret lower bounds in (single agent) structured bandit models [7, 17].

The generic approaches proposed in [7, 8] for optimal regret minimization in structured bandits could therefore be useful. However, to turn them into a reasonable algorithm for the collaborative setting, we would need to preserve the phased elimination structure. Analogous to the relaxation in the BAI setting, we can also define a relaxed complexity term CM(µ) , which does not require the knowledge of the best arms, and is within constant factors of C(µ) by the following Lemma 5 (which proof is given in Appendix C.3).

Lemma 5. Introducing the quantity

C W (µ) = min c (R+)K M { K k=1

M m=1 ck,m k,m s.t. k [K], m [M], M n=1

w2 n,m ck,n ( k,m)2

it holds that C W (µ) C W (µ) 4C W (µ) .

The constrained optimization problems governing the regret lower bound are subtly different from those in the BAI setting. In regret minimization, unknown gaps ( k,m)k,m appear both in the constraints and in the objective. In contrast, in BAI, ( k,m)k,m only appear in the constraints. Due to this difference, designing a similar algorithm for regret minimization leads to an extra multiplicative O(1/ min) factor in the upper bound. A very similar issue exists in a different structured bandit problem, bandits on graphs with side observations, where arms are connected through a graph with unweighted edges, and the agent receives the reward associated with the selected arm and its neighbors at a given round. In [4, Problem P1], authors describe a constrained linear optimization problem which governs the regret lower bound, and face the same issue of scaling in O(1/ min). Dealing with this problem would be an interesting subsequent work.

6 Conclusion

This paper introduced a general framework for collaborative learning in multi-armed bandits. Our work presents two novel lower bounds: one on the exploration cost for pure exploration, and another on cumulative regret. The latter permits to prove a prior conjecture on the topic. Moreover, we propose a phased elimination algorithm for fixed-confidence best arm identification. This algorithm tracks the optimal allocation from the pure exploration lower bound through a novel approach solving a relaxed optimization problem linked to the lower bound. The exploration cost of this algorithm is matching the lower bound up to logarithmic factors. This strategy can be extended to other pure exploration problems, such as Top-N identification, as shown in Appendix D.

As mentioned in introduction, our collaborative setting was motivated by the design of collaborative adaptive clinical trials for personalized drug recommendations, where several patient subpopulations (for instance, representing several subtypes of cancer) are considered and sequentially treated. However, in practice, especially when dealing with patient data, disclosing the mean response values to the central server should be handled with care to preserve the anonymity of the patients. A possible solution to overcome this problem would be to carefully combine our algorithm with a data privacy-preserving method, for instance by adding some noise to the data [11].

Our code and run traces are available in an open-source repository. 5

Acknowledgments and Disclosure of Funding

Clémence Réda was supported by the French Ministry of Higher Education and Research [ENS.X19RDTME-SACLAY19-22]. The authors acknoweldge the support of the French National Research Agency under the project [ANR-19-CE23-0026-04] (BOLD).

5https://github.com/clreda/near-optimal-federated

[1] Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42 60, 2018.

[2] Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010.

[3] Etienne Boursier, Emilie Kaufmann, Abbas Mehrabian, and Vianney Perchet. A practical algorithm for multiplayer bandits when arm means vary among players. In The 23rd International Conference on Artificial Intelligence and Statistics, (AISTATS), 2020.

[4] Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. Stochastic bandits with side observations on networks. In The 2014 ACM international conference on Measurement and modeling of computer systems, pages 289 300, 2014.

[5] Lijie Chen, Jian Li, and Mingda Qiao. Nearly instance optimal sample complexity bounds for top-k arm selection. In Artificial Intelligence and Statistics, pages 101 110. PMLR, 2017.

[6] Yu-Zhen Janice Chen, Stephen Pasteris, Mohammad Hajiesmaili, John Lui, Don Towsley, et al. Cooperative stochastic bandits with asynchronous agents and constrained feedback. Advances in Neural Information Processing Systems, 34, 2021.

[7] R. Combes, S. Magureanu, and A. Proutière. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems (Neur IPS), 2017.

[8] Rémy Degenne, Han Shao, and Wouter Koolen. Structure adaptive algorithms for stochastic bandits. In International Conference on Machine Learning, pages 2443 2452. PMLR, 2020.

[9] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016.

[10] Yihan Du, Wei Chen, Yuko Yuroki, and Longbo Huang. Collaborative pure exploration in kernel bandit. ar Xiv preprint ar Xiv:2110.15771, 2021.

[11] Abhimanyu Dubey and Alex Sandy'Pentland. Differentially-private federated linear bandits. In Advances in Neural Information Processing Systems, volume 33, pages 6003 6014, 2020.

[12] E. Even-Dar, S. Mannor, and Y. Mansour. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research, 7:1079 1105, 2006.

[13] Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. Advances in neural information processing systems, 32:10667 10677, 2019.

[14] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In NIPS-Twenty-Sixth Annual Conference on Neural Information Processing Systems, 2012.

[15] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998 1027. PMLR, 2016.

[16] Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 2018.

[17] T.L. Graves and T.L. Lai. Asymptotically Efficient adaptive choice of control laws in controlled markov chains. SIAM Journal on Control and Optimization, 35(3):715 743, 1997.

[18] Eshcar Hillel, Zohar Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multi-armed bandits. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, page 854 862, 2013.

[19] Ruiquan Huang, Weiqiang Wu, Jing Yang, and Cong Shen. Federated linear contextual bandits. Advances in Neural Information Processing Systems, 34, 2021.

[20] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil UCB: an Optimal Exploration Algorithm for Multi-Armed Bandits. In Proceedings of the 27th Conference on Learning Theory, 2014.

[21] Nikolai Karpov, Qin Zhang, and Yuan Zhou. Collaborative top distribution identifications with limited interaction. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 160 171. IEEE, 2020.

[22] Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection. In Conference on Learning Theory, pages 228 251. PMLR, 2013.

[23] Emilie Kaufmann and Wouter Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. Journal of Machine Learning Research, 22(246):1 44, 2021.

[24] Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Algorithmic Learning Theory, pages 865 891. PMLR, 2021.

[25] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985.

[26] Aritra Mitra, Hamed Hassani, and George Pappas. Exploiting heterogeneity in robust federated best-arm identification. Co RR, abs/2109.05700, 2021.

[27] MOSEK-Ap S. MOSEK Optimizer API for Python 9.3.13, 2022.

[28] Clémence Réda, Andrea Tirinzoni, and Rémy Degenne. Dealing with misspecification in fixed-confidence linear top-m identification. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.

[29] Yoan Russac, Christina Katsimerou, Dennis Bohle, Olivier Cappé, Aurélien Garivier, and Wouter M Koolen. A/b/n testing with control in the presence of subpopulations. Advances in Neural Information Processing Systems, 34, 2021.

[30] Chengshuai Shi, Cong Shen, and Jing Yang. Federated multi-armed bandits with personalization. In International Conference on Artificial Intelligence and Statistics, pages 2917 2925. PMLR, 2021.

[31] Chengshuai Shi, Haifeng Xu, Wei Xiong, and Cong Shen. (almost) free incentivized exploration from decentralized learning agents. Advances in Neural Information Processing Systems, 34, 2021.

[32] Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multi-armed bandits. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 126 146. IEEE, 2019.

[33] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, Ilhan Polat, Yu Feng, Eric W. Moore, Jake Vander Plas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020.

[34] Qingyun Wu, Huazheng Wang, Quanquan Gu, and Hongning Wang. Contextual bandits in a collaborative environment. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 529 538, 2016.

[35] Jingxuan Zhu, Ethan Mulle, Christopher Salomon Smith, and Ji Liu. Decentralized multi-armed bandit can outperform classic upper confidence bound. Co RR, abs/2111.10933, 2021.

[36] Zhaowei Zhu, Jingxuan Zhu, Ji Liu, and Yang Liu. Federated bandit: A gossiping approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, pages 3 4, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] , a general framework which encompasses several prior works is described in Section 1; a lower bound on the sample complexity for a pure exploration problem is derived in Section 3, whereas a lower bound on regret on the regret minimization counterpart of the problem is shown in Section 5. An algorithm for best arm identification, using a novel approach to achieve near-optimality, is described and analyzed in Section 4. (b) Did you describe the limitations of your work? [Yes] , we explained that the weight matrix had to be known in advance (Section 1), although there is an application case where it can be computed. Moreover, we explained in conclusion to our work the issue faced when applying the same technique to regret minimization (Section 6). (c) Did you discuss any potential negative societal impacts of your work? [Yes] , we remarked in our discussion (Section 6) that our collaborative algorithm might not be adapted to private, sensitive data, but might still be interesting for dealing with collaborative adaptive clinical trials between trusting teams. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] , the assumptions are mentioned both in the statement of the results (Section 3, Section 4, Section 5) and the proofs (Appendix). (b) Did you include complete proofs of all theoretical results? [Yes] , proofs in full are included in Appendix. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] , the code and run traces are available in a open-source repository. 6

(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] , this information is reported in Appendix E. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] , tables in Appendix E reports the average values their standard deviation. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] , this information is provided in Appendix E (paragraph Numerical considerations ). 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

6https://github.com/clreda/near-optimal-federated