# best_arm_identification_in_multiagent_multiarmed_bandits__9fbd4298.pdf Best Arm Identification in Multi-Agent Multi-Armed Bandits Filippo Vannella 1 2 Alexandre Proutiere 1 Jaeseong Jeong 2 We investigate the problem of best arm identification in Multi-Agent Multi-Armed Bandits (MAMABs) where the rewards are defined through a factor graph. The objective is to find an optimal global action with a prescribed level of confidence and minimal sample complexity. We derive a tight instance-specific lower bound of the sample complexity and characterize the corresponding optimal sampling strategy. Unfortunately, this bound is obtained by solving a combinatorial optimization problem with a number of variables and constraints exponentially growing with the number of agents. We leverage Mean Field (MF) techniques to obtain, in a computationally efficient manner, an approximation of the lower bound. The approximation scales at most as ρKd (where ρ, K, and d denote the number of factors in the graph, the number of possible actions per agent, and the maximal degree of the factor graph). We devise MF-Ta S (Mean-Field-Track-and-Stop), an algorithm whose sample complexity provably matches our approximated lower bound. We illustrate the performance of MF-Ta S numerically using both synthetic and real-world experiments (e.g., to solve the antenna tilt optimization problem in radio communication networks). 1. Introduction Best arm identification with fixed confidence in stochastic bandits (Lai & Robbins, 1985) refers to the problem of finding the arm with the highest expected reward with a prescribed level of certainty, while minimizing the number of samples or arm draws, i.e., the sample complexity. When the arm rewards are unrelated, the sample complexity needs to typically scale linearly with the number of arms, and the task quickly becomes intractable as this number grows large. 1KTH Royal Institute of Technology, Stockholm, Sweden 2Ericsson, Stockholm, Sweden. Correspondence to: Filippo Vannella . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). To reduce the sample complexity, the learner may leverage any underlying structure tying up the expected rewards of the various arms together. Most efforts in this direction have focused on linear structures (Degenne et al., 2020; Fiez et al., 2019; Jedra & Proutiere, 2020; Karnin, 2016; Soare et al., 2014; Tao et al., 2018), see (Wang et al., 2021) and references therein. Exploiting the underlying structure is critical when the number of arms becomes extremely large, as in combinatorial bandit problems (Chen et al., 2014; Jourdan et al., 2021). To solve these problems, the learner faces a statistical efficiency issue (she has to control the sample complexity), but also needs to account for inherent computational limits (she will typically have to solve combinatorial optimization problems over the set of possible arms). We investigate the best arm identification problem in the stochastic Multi-Agent Multi-Armed Bandits (MAMABs) referred to as M-BAI for short. M-BAI is a particular instance of the best arm identification problem in combinatorial bandits, where (i) a global arm or action is defined by the actions individually selected by the various agents, and (ii) the reward function is defined through a factor graph. This reward structure arises naturally in networks where agents interact with their neighbors in the graph, and need to coordinate toward a common goal. The paper was actually motivated by the problem of learning to coordinate base-stations in radio communication networks (see 7.2). Contributions. For the M-BAI problem, we present a statistically and computationally efficient algorithm. The algorithm has provable performance guarantees (in the form of sample complexity upper bounds), and can be applied in large-scale MAMABs. More precisely, our contributions are as follows. 1) Sample complexity lower bound. We derive an instancespecific lower bound on the sample complexity satisfied by any algorithm. The bound is defined through an optimization problem, whose solution provides an optimal sampling strategy. Unfortunately, because of the factored reward structure, this optimization problem contains an exponential number of variables and constraints, and is hence difficult to exploit in practice. 2) Mean-Field-based approximation of the lower bound optimization problem. We propose a tight approximation of the lower bound optimization problem obtained by com- Best Arm Identification in Multi-Agent Multi-Armed Bandits bining a Mean Field (MF) technique, to reduce the number of variables, and Factored Constraint Reduction (FCR), a procedure inspired by methods in probabilistic graphical models, to reduce the number of constraints. We show that the resulting optimization problem is equivalent to a convex program that can be solved efficiently. The resulting approximated lower bound scales at most as ρKd, where ρ is the number of factors (or groups), K is the number of actions per agent and d is the maximal degree of the factor graph. This scaling illustrates the gains one may achieve by exploiting the factor graph structure (without leveraging the structure, the sample complexity would well scale as KN, where N is the number of agents). 3) The MF-Ta S algorithm. We devise MF-Ta S (Mean-Field Track-and-Stop), an algorithm whose sample complexity provably matches our approximated lower bound. The algorithm, based on the popular Track and Stop (Ta S) algorithm, selects actions suggested by the solution of the approximated lower bound optimization problem and decides to stop gathering data according to the result of a classical Generalized Likelihood Ratio Test (GLRT). 4) Synthetic and real-world experiments. First, we test the performance of MF-Ta S numerically using synthetic experiments. We then apply the algorithm to solve the problem of learning to coordinate the antenna tilts at various base stations in a radio communication network. In both sets of experiments, we show that MF-Ta S can solve very large problems with millions of global actions in a sample and computationally efficient manner. 2. Related Work Most existing works on multi-agent bandit models focus on the so-called multi-player bandits (Besson & Kaufmann, 2018; Rosenski et al., 2016; Shi et al., 2021; Wang et al., 2020). There, agents have access to the same set of actions and interact through collisions (if two agents select the same action, no reward is collected by both agents). Our setting is different and assumes that the global reward is a sum over groups of local rewards which depend on the actions selected by related agents. A few papers investigate MAMABs with the same factored reward structure as ours (Bargiacchi et al., 2018; 2022; Stranders et al., 2012; Verstraeten et al., 2020). While (Bargiacchi et al., 2018; Stranders et al., 2012; Verstraeten et al., 2020) focus on regret minimization, our paper studies best arm identification. The closest related work is (Bargiacchi et al., 2022). There, the goal is to identify a global arm or action that is ε-optimal and with error probability bounded by δ. Targeting ε-optimal arms greatly simplifies the problem and the analysis, and removes the need for an adaptive stopping rule (the number of samples is fixed a-priori). In turn, the al- gorithm proposed in (Bargiacchi et al., 2022) does not adapt to the hardness of the problem. This hardness is captured through min, the gap between the best and the second-best arm. Our algorithm is learning this gap and adapting its sampling strategy accordingly. Another closely related line of work is best arm identification in combinatorial bandits with semi-bandit feedback (Chen et al., 2014; Du et al., 2021; Jourdan et al., 2021; Wagenmaker et al., 2020), which encompasses the MAMABs setting as a particular case (see App. G). These works do not explicitly consider multi-agent problems and focus on devising computationally and statistically efficient algorithms. The closest work here is (Jourdan et al., 2021), which leverages a game interpretation of the lower bound optimization problem to devise asymptotically optimal meta-algorithms using online optimization methods. Their method requires the existence of a best-response oracle , which is computationally inefficient for factored rewards models with combinatorial action spaces. In contrast, our algorithm leverages an MF approximation to reduce the number of variables and constraints in the lower bound optimization problem and leads to a more efficient algorithm. A few related works investigate regret minimization in combinatorial semi-bandit feedback settings (Cuvelier et al., 2021a;b; Wagenmaker et al., 2020). The authors of (Cuvelier et al., 2021b) derive a lower bound on the regret that has a similar structure and presents the same challenges, as the one we derive for best arm identification: the bound is obtained by solving an optimization problem with an exponentially large number of variables and constraints (see App. F for details). By smartly rewriting the optimization problem, the authors of (Cuvelier et al., 2021b) manage to devise an asymptotically optimal algorithm. Unfortunately, the algorithm relies on Assumption 6 in (Cuvelier et al., 2021b) which, in general, does not hold in our setting (Wainwright & Jordan, 2008). It is hence impossible to follow a similar approach in the MAMAB setting. 3. Problem Setting Model and reward structure. We consider the generic MAMAB model with factored structure introduced in (Bargiacchi et al., 2018). The model is defined by the tuple S, A, r , where: 1. S = [N] {1, . . . , N} is a set of N agents, 2. A = i [N]Ai is a set of global actions, which is the Cartesian product over i of the set Ai of actions available to the agent i. We assume w.l.o.g. that |Ai| = K, for all i [N], and define A |A| = KN, 3. r is the reward function mapping the global action to the collected reward. Best Arm Identification in Multi-Agent Multi-Armed Bandits We now describe the reward collected when a global action a A is selected. We assume that there are ρ groups of possibly overlapping subsets of agents (Se)e [ρ], with Se S, and |Se| = Ne. Each group generates rewards. The local reward generated by group e depends on group actions ae (ai)i Se Ae i Se Ai only. More precisely, each time ae is selected, the collected local rewards are i.i.d. copies of a random variable re(ae) N(θe(ae), 1). Rewards collected in various groups are independent. The global reward for action a is then r(a) = P e [ρ] re(ae), a random variable with expectation θ(a) = P e [ρ] θe(ae). The number of possible group actions in group e is Ae |Ae| = KNe, and we define A P e [ρ] Ae. Factor graph representation. The reward structure can be represented as a factor graph (Wainwright & Jordan, 2008). Factor graphs are bipartite graphs with two types of node: N action nodes, one for each agent (represented by circles), and ρ factor nodes, one for each group (represented by squares). An edge between a factor re and an agent i exists if the action ai selected by the agent i is an input of re, i.e., i Se. Fig. 1 shows an example of a factor graph with N = 4 agents and ρ = 4 factors in which the reward is factored additively as r(a) = r1(a1, a2) + r2(a2, a4) + r3(a1, a3) + r4(a3, a4). In this example, when each agent has K actions available, we have A = K4 and A = 4K2. r1 r2 r3 r4 Figure 1. Example of a factor graph. Sequential decision process. In M-BAI, the decision maker sequentially selects global actions based on the history of previous observations and receives a set of samples of the local rewards associated to the various groups. Specifically, in each round t 1, the decision maker selects a global action at = (at,1, . . . , at,N) and observes the local rewards rt = (rt,1, . . . , rt,ρ) from each group. The global action at+1 is selected based on the history of observations Ht = (as, rs)s [t]. This type of interaction is known as semibandit feedback (see App. G for details). Best Arm Identification. In this work, we study the problem of M-BAI in the fixed confidence setting. The goal is to devise an algorithm that returns, using as few rounds as possible, the best global action with a fixed confidence level. This action is defined as a θ arg maxa A θ(a). Throughout the paper, we assume that a θ is unique. Furthermore, when there is no ambiguity, we use a and a θ interchangeably. In this setting, an algorithm is defined through a sampling rule, a stopping rule, and a recommendation rule, described as follows: (i) Sampling rule: it specifies the global action selected in each round. The sampling rule is defined as a sequence of actions (at)t 1, where at A may depend on past observations. Formally, at is Ft 1-measurable, where Ft is the σ-algebra generated by Ht, the history up to time t. (ii) Stopping rule: it controls the end of the data acquisition phase and is defined as a stopping time τ with respect to the filtration (Ft)t 1. (iii) Recommendation rule: in round τ, after the data acquisition phase ends, it returns an estimated best global action ˆaτ A. We denote by Pθ the probability measure of the observations generated under the parameter θ, and by Eθ the respective expectation. With these definitions, the objective is to devise a δ-PAC algorithm, as defined below, with minimal expected sample complexity Eθ[τ]. Definition 3.1. Let δ (0, 1). An algorithm is δ-PAC if θ, Pθ(a θ = ˆaτ) δ and Pθ (τ < ) = 1. 4. Sample Complexity Lower Bound We present a lower bound on the sample complexity satisfied by any δ-PAC algorithm. The lower bound is obtained using classical change-of-measure arguments introduced in (Kaufmann et al., 2016; Lai & Robbins, 1985). To state the lower bound, we introduce the following notations. Notations. Define the marginal polytope as: w R A : w Λ, e [ρ], ae Ae, we,ae = P b A:be=ae wb, where Λ = {w RA + : P a A wa = 1} is the (A 1)- dimensional simplex. The set Λ contains group allocations w = ( we)e [ρ], where we = ( we,ae)ae Ae. In other words, Λ contains group probability measures w which satisfy a consistency constraint w.r.t. the global allocations w Λ. Such constraints encode the condition that there exists a global allocation w having marginal group allocations w. Let kl(x, y) be the Kullback Leibler divergence between two Bernoulli distributions of mean x and y, and (a) = θ(a ) θ(a) be the sub-optimality gap for a suboptimal action a = a θ. Best Arm Identification in Multi-Agent Multi-Armed Bandits Theorem 4.1. The sample complexity of any δ-PAC algorithm satisfies θ, Eθ[τ] 2T θ kl(1 δ, δ), where T θ = inf w Λ max a =a e [ρ]:ae =a e w 1 e,a e + w 1 e,ae The proof of the above theorem is given for completeness in App. A.1. T θ is referred to as characteristic time and represents the hardness of the M-BAI problem for a MAMAB instance θ. Furthermore, an allocation w Λ attaining T θ is optimal: an algorithm relying on a sampling strategy realizing w such that for all e [ρ] and ae Ae, we,ae = Eθ[Nτ,e,ae]/Eθ[τ] (where Nt,e,ae = P s [t] 1{as,e=ae} is the number of times the group action ae is selected up to time t) would yield the lowest possible sample complexity. 5. Lower Bound Approximation Solving the lower bound optimization problem (1) is hard for the following reasons: (i) exponential variable space: global allocations w lie in Λ RA +; (ii) exponentially large number of constraints: T θ is defined as a maximum over a set of KN 1 actions (a = a ), which in turn corresponds to solving a problem with an equal number of constraints (see (4)). To circumvent these issues, we propose an approximation of the lower bound optimization problem, which allows to reduce the number of variables and constraints. We will then leverage this approximation in the design of an efficient M-BAI algorithm. Mean-field variable reduction. In order to reduce the variable space, we consider an MF approximation (Wainwright & Jordan, 2008), which restricts the allocations w Λ to the set factored distributions. To define this set, let vi = (vi,ai)ai Ai denote the local allocation of agent i. vi,ai is the proportion of time agent i selects ai Ai, and hence vi Λi = {vi RK + : P ai Ai vi,ai = 1}. The set of factored distributions is then defined as: w Λ : i [N], vi Λi, wa = Q i [N] vi,ai We also denote by ΛMF the corresponding marginal MF polytope. The lower bound approximation is essentially obtained replacing Λ by ΛMF in (1). Then, the corresponding MF characteristic time is: T MF θ = inf w ΛMF max a =a e [ρ]:ae =a e w 1 e,ae + w 1 e,a e 2 min , (2) where min = mina =a (a). The following lemma proves that T MF θ is an upper bound of T θ and provides a gap-dependent scaling. Lemma 5.1. θ, T θ T MF θ 2 A 2 min , and we have: T MF θ = inf (vi)i [N] max a =a i Se v 1 i,ai+ Q i Se v 1 i,a i 2 min . (3) The proof can be found in App. A.2. Note that the dimension of the variables involved in (2) is KN, whereas solving (1) would involve KN-dimensional variables w Λ. The lemma also provides a worst-case scaling of T MF θ : it scales at most with the sum of the group action sets size A = P e [ρ] KNe. Note that there are trivial factor graphs for which T θ scales as A/ 2 min (this is the case when each agent is involved in a single factor). Notice that the program in (2) is seemingly non-convex. In fact, it is generally known that MF approximations lead to non-convex programs due to the geometry of ΛMF (Wainwright & Jordan, 2008). Despite the non-convexity, we show, in App. B, that (2) can be reformulated as a Geometric Program (GP), which can be reduced to a (non-linear) convex program by a change of variables. Factored constraint reduction. After the variable reduction step, the main challenge in solving (2) is the maxa =a . We can rewrite (2) in epigraph form (Boyd & Vandenberghe, 2004) as: inf w ΛMF,z R z s.t. z X e [ρ] f we e (ae), a = a , (4) where f we e (ae) = ( w 1 e,ae + w 1 e,a e)/ 2 min1{ae =a e}. In the following, we will omit the superscript we for clarity. Hence, the maxa =a operator in (2) is equivalent to considering a set of KN 1 non-linear constraints: C = n z P e [ρ] fe(ae), a = a o . The objective is to obtain a compact representation of C that avoids an explicit enumeration of the exponentially-many actions. To this aim, we adapt a popular method used in factored Markov Decision Processes (Guestrin et al., 2001; 2003). This method, which we refer to as Factored Constraint Reduction (FCR), reduces a constraint set described by an exponential number of constraints with a factored structure to a provably equivalent set with a reduced number of constraints. The method is inspired by the Variable Elimination (VE) procedure in graphical models (Dechter, 1999). Specifically, FCR considers constraints of the type: e [ρ] pe(ae), a A, where pe( ) is a factor function mapping group actions ae to real values, and z is a real variable, and construct an equivalent set of constraints K. We present the pseudo-code of FCR in Alg. 1 and describe its steps below. Best Arm Identification in Multi-Agent Multi-Armed Bandits Algorithm 1 FCR Input: Elimination order O, factors F Initialize K = for i = 1, . . . , N do l O(i) Fl {p F : l SC(p)} K K n upl a SC(pl) P p Fl up a SC(p), a SC(pl), al o F F {pl} \ Fl end for K K {z up O(N)} Return K FCR takes as input an initial set of factors F = {pe}e [ρ], and an ordered elimination set O. For a factor p F, we define its scope SC(p) [N] as the set of agents involved in p. We also associate a real variable up a SC(p) to each factor p F. After initializing the output constraint set as K = , the algorithm proceeds in an iterative manner. At each iteration i = 1, . . . , N, we set l = O(i) (the ith element of O), and define Fl = {p F : l SC(p)}. We then introduce a new factor pl having scope SC(pl) = p Fl{SC(p)} \ {l}, and we associate the variable upl a SC(pl) to pl. We include in K a new set of constraints upl a SC(pl) X p Fl up a SC(p), a SC(pl), al. We further include the new factor variable pl in the set of factors F and remove all factors in Fl from it, i.e., F = F {pl} \ Fl. At l = O(N), we introduce the constraint z up O(N) into K, where p O(N) is the last generated factor and has empty scope. As shown in App. B, the set of constraints in K, constructed through FCR, are equivalent to the ones in C, i.e., an assignment of variables satisfies the constraints in C if and only if it satisfies the constraints in K. Furthermore, the number of constraints in K set scales as O(NKAO), where AO = maxi [N] |SC(p O(i))| is the size of the maximum scope induced by the chosen order of elimination O (see App. E for details). 6. The MF-Ta S Algorithm Our algorithm, MF-Ta S, identifies and tracks the sampling allocation solving the approximated lower bound optimization problem (3). Such problem (3) depends on the unknown parameter θ through the optimal action a θ and min. The algorithm hence consists in (i) estimating the unknown parameter θ, (ii) plugging this estimator in (3) to compute the corresponding optimal sampling rule, and (iii) tracking this sampling rule and stopping when enough information has been gathered. We present MF-Ta S in Alg. 2 and detail its step in the remainder of this section. Algorithm 2 MF-Ta S Input: Confidence δ, exploration set A0 Initialize e [ρ], N0,e = 0, ˆθ0,e = 0, U0,e = Ae, i [N], N0,i = 0, t = 1 while t < T MF ˆθt β(δ, t) do if e : Ut,e = then bt,e arg minae Ut,e Nt,e,ae at at A0 : bt,e = at,e else at,i arg max ai Ai tvt,i,ai Nt,i,ai at (at,i)i [N] end if t t + 1 Update (Nt,i)i [N], (Nt,e, ˆθt,e, Ut,e)e [ρ], (vt,i)i [N] Solve (3) with ˆθt plugged in end while Return ˆaτ arg maxa A P e [ρ] ˆθt,e(ae) 6.1. Parameter estimation In principle, the estimation of θ would require evaluating an exponentially large number of components, i.e., θ(a), a A. Instead, by leveraging the factored reward structure, we can focus on estimating group parameters (θe)e [ρ] R A. We define the estimate at time t, group e, and action ae as: ˆθt,e,ae = 1 Nt,e,ae s [t] rs,e1{as,e=ae}. We then define ˆθt,a = P e [ρ] ˆθt,e,ae, for all a A. 6.2. Sampling Rule The sampling rule is inspired by the D-tracking rule in (Garivier & Kaufmann, 2016) and alternates between forced exploration and tracking steps as described below. Forced Exploration. During forced exploration steps, we select arms to ensure convergence of the group parameter estimates (ˆθt,e)e [ρ] as t . Define a set of exploratory global actions A0 A, which is chosen in such a way that it covers all possible group actions, i.e., A0 is such that e [ρ], ae Ae, b A0 : be = ae (see App. C for an algorithm to select A0 efficiently). Further define the set of under-explored actions at group e and time t as: Ut,e = n ae Ae : Nt,e,ae < p t/|A0| o . At time t, if there is a group e such that Ut,e = , the algorithm executes a forced exploration step. In such steps, we first compute the most under-explored group arm at,e = arg minae Ut,e Nt,e,ae (with ties breaking arbitrarily), and then select an action from the exploratory action set bt A0 such that at,e = bt,e (with ties breaking arbitrarily). Best Arm Identification in Multi-Agent Multi-Armed Bandits Tracking. In the tracking phase, at time t, we solve the optimization problem (3) with the estimated parameter ˆθt, and derive the optimal estimated allocations (vt,i)i [N], where vt,i = (vt,i,ai)ai Ai. Then, the action selected by agent i at a tracking time step t is: at,i = arg max ai Ai tvt,i,ai Nt,i,ai. The global action is simply selected as at = (at,i)i [N]. Note that tracking single-agent allocations vt,i,ai rather than the global allocations wt,a = Q i [N] vt,i,ai allows us to reduce the search space for the tracking action from a combinatorial set of actions a A to local sets ai Ai, for all agents i [N]. 6.3. Stopping Rule For the stopping rule, we use the classical GLRT as in previous works (Garivier & Kaufmann, 2016; Wang et al., 2021). Specifically, the test consists in comparing T MF ˆθt to an exploration threshold β(δ, t) as τ = inf n t 1 : t T MF ˆθt β(δ, t) o . (5) The conditions that the exploration threshold must satisfy are the same as Sec. 3.2 of (Wang et al., 2021) (see App. D for details). An exploration threshold satisfying these conditions is presented in (Kaufmann & Koolen, 2021). Unless otherwise mentioned, we will use such a threshold. 6.4. Decision Rule The decision rule selects the best empirical action: ˆaτ = arg max a A Note that, in principle, computing ˆaτ would require a max operation over an exponential number of actions a A. However, due to the factored structure, we can implement the decision rule efficiently through VE (Dechter, 1999), an important sub-routine presented in Alg. 3 and detailed in the remainder of this section. Algorithm 3 VE Input: Elimination order O, factors R for i = 1, . . . , N do l = O(i) Rl = {re R : l SC(re)} pl(ae\l) = maxal Al P re Rl re(al, ae\l) R R {pl(ae\l)} \ Rl end for Return P Variable Elimination. Similarly to FCR, VE follows an elimination order O, where O(i) is the ith variable to be eliminated. The algorithm takes as input a set of factorized reward functions R = {re}e [ρ]. The algorithm proceeds iteratively for i = 1, . . . , N, by eliminating variable l = O(i) in each round. At round l, all the factors in R containing variable l in their scopes are collected in the set Rl. Subsequently, the (marginal) best response is computed as pl(ae\l) = maxal Al P re Rl re(al, ae\l), where ae\l corresponds to the action ae corresponds to the action ae with the l-th component removed. The set of factors is then updated as R R {p(ae\l)}\Rl. At this point, every factor containing l in its scope is eliminated. At the next iteration, the algorithm selects the next variable to be eliminated until i = N. Finally, it returns the optimal value P p R p O(N). Note that VE, applied to R = {ˆθτ,e}e [ρ] returns the highest global estimated reward ˆθτ,ˆaτ . A backward pass of the VE algorithm allows to recover the optimal arm ˆaτ. The time and memory complexity of VE is O(NKAO) (see App. E). 6.5. Sample complexity guarantees We establish that the MF-Ta S algorithm achieves a sample complexity, matching the approximated lower bound T MF θ kl(1 δ, δ) asymptotically (as δ 0). The proof is given in App. D. Theorem 6.1. MF-Ta S is δ-PAC, and its sample complexity satisfies, θ, Pθ lim sup δ 0 δ) T MF θ = 1, and lim sup δ 0 Eθ[τ] log( 1 δ) T MF θ . 7. Experiments In this section, we assess the performance of MF-Ta S. We propose two sets of experiments through numerical experiments. We apply MF-Ta S to synthetic MAMABs with different levels of complexity in 7.1, and to a timely industrial use-case from the radio communication domain: antenna tilt optimization, in 7.2. Additional experiments are reported in App. J, and the code is available at this link. 7.1. Synthetic Experiments Problem instance. We consider a ring factor graph, depicted in Fig. 2, in which the reward is described as r(a) = P i [N 1] ri(ai, ai+1) + r N(a1, a N), for ai [K]. The local expected rewards are selected at random as θi(ai, ai+i) U(0, M), for all i [N] and for some M > 0. We propose two sets of synthetic experiments to test different levels of complexity: (i) we vary the number of agents N {3, 5, 10} while keeping a fixed K = 10 and δ = 0.01, and (ii) we vary the confidence δ {10 i}i [5], while keeping K = 10 and N = 3 fixed. Best Arm Identification in Multi-Agent Multi-Armed Bandits Table 1. Experimental results for the synthetic experiments with varying N (with fixed K = 10, δ = 0.01). Problem instance Sample complexity Computational complexity [s] N A A T MF θ log 1 δ 4 A 2 min log 1 δ Oracle MF-Ta S Random T MF θ T θ 3 103 300 158.3 209.0 223 299.6 107.5 644.7 140.2 0.09 0.03 6.87 1.12 5 105 500 270.5 358.0 385 356.8 150.0 708.3 132.9 0.54 0.13 2375.82 5.32 10 1010 1000 305.9 405.0 533 411.1 193.5 800.4 177.9 0.81 0.31 > 10800 (3 h) Implementation details. We execute our experiments for Nsim = 100 runs. Following previous works (Kaufmann & Koolen, 2021; Wang et al., 2021), the exploration threshold is selected as β(δ, t) = log(log(t) + 1)/δ). The elimination order for both VE and FCR is chosen as O = {N, N 1, . . . , 1}. We implement the solver for the lower bound optimization problems using CVXPY (Diamond & Boyd, 2016), with a MOSEK solver. The experiments run on a Mac Book Pro 2.6 GHz 6-Core Intel Core i7 processor. We use this setup in all of our experiments. Figure 2. Ring factor graph used in the synthetic experiments. Results. The results are presented in Tab. 1 for the experiments with varying N, and in Tab. 2 for the experiments with varying δ, with the sample complexity box-plots reported in Fig. 3. The sample complexity corresponds to the stopping time averaged over the various runs. The performance of MF-Ta S is compared to that of an oracle algorithm aware of the problem parameters θ (obtained by replacing ˆθt by θ in MF-Ta S), and to a random strategy selecting actions uniformly at random. The computational complexity is the average running time (in seconds) to solve one instance of the lower bound optimization problem for T θ and T MF θ . Table 2. Experimental results for the synthetic experiments with varying δ (with fixed K = 10, N = 3). Problem instance Sample complexity δ T MF θ log 1 δ Oracle MF-Ta S Random 10 1 79.1 154 223.5 82.4 635.5 135.5 10 2 158.3 223 317.5 133.2 649.4 142.6 10 3 237.4 303 397.6 129.0 668.2 131.5 10 4 316.5 384 493.9 153.8 695.5 141.4 10 5 395.7 464 535.4 154.2 793.4 190.8 We note that MF-Ta S exhibits a sample complexity close to the proposed MF approximation of the lower bound T MF θ log(1/δ) (they differ from a small multiplicative constant) for all values of N. We also observe that, as expected, MF-Ta S outperforms the random sampling strategy and is competitive with the oracle strategy. In terms of computational complexity, the average running time to solve T MF θ is significantly lower than that to solve T θ , which becomes quickly untractable even for a small number of agents. Figure 3. Sample complexity boxplots for the experiments with varying δ (dashed line represents the oracle performance). 7.2. Antenna Tilt Optimization Next, we test MF-Ta S on the antenna tilt optimization problem. The task consists in controlling the vertical antenna tilt at different network base stations to optimize the network throughput. In the following, we detail the network model, our simulation setup, and present our experimental results. Additional details are presented in App. I. Network Model. We consider a sectorized mobile network consisting of a set of sectors S = [N]. The set of sectors corresponds to the set of agents in our M-BAI model. Since each sector is associated to a unique antenna, we will use the terms sector and antenna interchangeably. We assume that each sector i S serves (on the downlink) a fixed set of Users Equipments (UEs) Ui (each UE is associated with a unique antenna, that from which it receives the strongest signal). The set of UEs in the network is U = i SUi. Factor graph. We model the observed reward in the network as a factor graph with N = |S| agent nodes and ρ = |S| factor nodes. Each sector is associated with a unique factor, which models the rewards observed in that sector. We build the factor graph based on the interference pattern of the antennas, i.e., antennas that can interfere with each other are connected to common factors. Fig. 4 shows an example of such a graph on a network with |S| = 15. Best Arm Identification in Multi-Agent Multi-Armed Bandits Table 3. Experimental results for the antenna tilt optimization experiments. Problem instance Sample complexity Computational complexity [s] N A A T MF θ log 1 δ Oracle MF-Ta S Random T MF θ T θ 6 2.4 102 324 129.2 203 286.4 341.1 0.93 0.27 4.34 0.92 12 5.9 104 1124 472.1 524 623.5 813.9 1.32 0.62 2778.53 5.32 15 1.4 107 1799 568.9 729 913.7 1262.1 3.24 0.91 > 10800 (3 h) Figure 4. Network factor graph. Actions. The action at,i represents the antenna tilt for sector i S and at time t. For simplicity, it is chosen from a discrete set of K tilts, i.e., at,i {α1, . . . , αK}. The antenna tilt for a group of sectors e is denoted by ae. Rewards. Rewards are based on the throughput of UEs in sector i, which depends on the actions of a group of agents ae: re(ae) = P u Ui Ti,u(ae), where Ti,u is the throughput of an UE u associated to sector i. Hence, the global reward for a tilt configuration a A is r(a) = P u Ui Ti,u(ae). The throughput Ti,u depends on channel conditions (or fading) between the base station antenna and the user. These conditions rapidly evolve over time around their mean. The fadings between pairs of (antenna, user) are typically stochastically independent across users and antennas (Tse & Viswanath, 2009). More precisely, since the sets of (Ui)i [N] form a partition, they do not overlap, and the random variables re(ae) are independent across groups and we assume that can be modeled as independent Gaussian r.v. Simulator. We run our experiments in a proprietary mobile network simulator in an urban environment. The simulation parameters used in our experiments are reported in Tab. 4. Based on the user positions and network parameters, the simulator computes the path loss in the network environment using a BEZT propagation model (Rappaport, 2001) and returns the throughput for each sector by conducting user association and resource allocation in a full-buffer traffic demand scenario. Given a user configuration, the goal is to identify the best global tilt configuration in the network, i.e., the one which maximizes the overall network throughput. Table 4. Simulator parameters. PARAMETER SYMBOL VALUE Number of sectors |S| {6, 12, 15} Number of UEs |U| 1000 Antenna tilt values Ai {3 , 6 , 12 } Carrier frequency f 1800 MHz Antenna height h 32 m Network size M 2 km2 Results. We test our algorithm with the same experimental conditions in 7.1. The sample complexity and the computational complexity of MF-Ta S are presented in Tab. 3. The results are in line with the experimental findings of the previous section. However, due to the higher degree of the factor graph, the MF-Ta S running time is higher. 8. Conclusions In this paper, we investigated the M-BAI problem: we derived a sample complexity lower bound, proposed a Mean Field approximation of it, and devised MF-Ta S, an algorithm achieving this limit in a computationally efficient manner. MF-Ta S is statistically and computationally efficient on both synthetic examples and the antenna tilt optimization problem. The algorithm runs fast and identifies the best global action using a limited number of samples, even for scenarios with a very large number of actions. Interesting future research directions include (i) the analysis of the sample complexity lower bound and its Mean Field approximation depending on the factor graph topology, (ii) extending the analysis towards tighter lower bound approximations, (iii) proposing efficient distributed implementations of MF-Ta S with a specific focus on its communication complexity, and (iv) investigating representation learning problems in M-BAI where the underlying factor graph is initially unknown and needs to be learned. Acknowledgement This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Best Arm Identification in Multi-Agent Multi-Armed Bandits Bargiacchi, E., Verstraeten, T., Roijers, D., Now e, A., and van Hasselt, H. Learning to coordinate with coordination graphs in repeated single-stage multi-agent decision problems. In Proc. of ICML, 2018. Bargiacchi, E., Verstraeten, T., Roijers, D., Now e, A., and van Hasselt, H. Multi-agent RMax for multi-agent multiarmed bandits. In Proc. of Adaptive and Learning Agents Worksh., 2022. Berge, C. Topological spaces. Proceedings of the Edinburgh Mathematical Society, 1963. Besson, L. and Kaufmann, E. Multi-Player Bandits Revisited. In Proc. of ALT, 2018. Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. Boyd, S. P., Kim, S.-J., Vandenberghe, L., and Hassibi, A. A tutorial on geometric programming. Optimization and Engineering, 2007. Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W. Combinatorial pure exploration of multi-armed bandits. In Proc. of Neur IPS, 2014. Cuvelier, T., Combes, R., and Gourdin, E. Statistically efficient, polynomial-time algorithms for combinatorial semi-bandits. Proc. ACM Meas. Anal. Comput. Syst., 2021a. Cuvelier, T., Combes, R., and Gourdin, E. Asymptotically optimal strategies for combinatorial semi-bandits in polynomial time. In Proc. of ALT, 2021b. Dechter, R. Bucket elimination: A unifying framework for reasoning. Artif. Intell., 1999. Degenne, R., M enard, P., Shang, X., and Valko, M. Gamification of pure exploration for linear bandits. In Proc. of ICML, 2020. Diamond, S. and Boyd, S. CVXPY: A Python-embedded modeling language for convex optimization. JMLR, 2016. Du, Y., Kuroki, Y., and Chen, W. Combinatorial pure exploration with full-bandit or partial linear feedback. In Proc. of AAAI, 2021. Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. Sequential experimental design for transductive linear bandits. In Proc. of Neur IPS, 2019. Garivier, A. and Kaufmann, E. Optimal best arm identification with fixed confidence. In Proc. of COLT, 2016. Guestrin, C., Koller, D., and Parr, R. Max-norm projections for factored MDPs. In Proc. of IJCAI, 2001. Guestrin, C., Koller, D., Parr, R., and Venkataraman, S. Efficient solution algorithms for factored mdps. J. Artif. Int. Res., 2003. Jedra, Y. and Proutiere, A. Optimal best-arm identification in linear bandits. In Proc. of Neur IPS, 2020. Jourdan, M., Mutn y, M., Kirschner, J., and Krause, A. Efficient pure exploration for combinatorial bandits with semi-bandit feedback. In Proc. of ALT, 2021. Karnin, Z. S. Verification based solution for structured mab problems. In Advances in Neural Information Processing Systems, 2016. Kaufmann, E. and Koolen, W. M. Mixture martingales revisited with applications to sequential tests and confidence intervals. JMLR, 2021. Kaufmann, E., Capp e, O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. JMLR, 2016. Kiefer, J. and Wolfowitz, J. The equivalence of two extremum problems. Canad. Journ. of Math., 1960. Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 1985. Lawler, E. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 1972. Nilsson, D. An efficient algorithm for finding the m most probable configurationsin probabilistic expert systems. Statistics and Computing, 1998. Rappaport, T. Wireless Communications: Principles and Practice. Prentice Hall PTR, 2001. Rosenski, J., Shamir, O., and Szlak, L. Multi-player bandits a musical chairs approach. In Proc. of ICML, 2016. Shi, C., Xiong, W., Shen, C., and Yang, J. Heterogeneous multi-player multi-armed bandits: Closing the gap and generalization. In Proc. of Neur IPS, 2021. Soare, M., Lazaric, A., and Munos, R. Best-arm identification in linear bandits. In Proc. of Neur IPS, 2014. Stranders, R., Fave, F. M. D., Rogers, A., and Jennings, N. R. DCOPs and bandits: exploration and exploitation in decentralised coordination. In Proc. of AAMAS, 2012. Tao, C., Blanco, S., and Zhou, Y. Best arm identification in linear bandits with linear dimension dependency. In Proc. of ICML, 2018. Best Arm Identification in Multi-Agent Multi-Armed Bandits Tse, D. N. C. and Viswanath, P. Fundamentals of wireless communication. IEEE Trans. Inf. Theory, 2009. Verstraeten, T., Bargiacchi, E., Libin, P. J. K., Helsen, J., Roijers, D. M., and Now e, A. Multi-agent thompson sampling for bandit applications with sparse neighbourhood structures. Scientific Reports, 2020. Wagenmaker, A., Katz-Samuels, J., and Jamieson, K. Experimental design for regret minimization in linear bandits, 2020. Wainwright, M. and Jordan, M. Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc., 2008. Wang, P.-A., Proutiere, A., Ariu, K., Jedra, Y., and Russo, A. Optimal algorithms for multiplayer multi-armed bandits. In Proc. of AISTATS, 2020. Wang, P.-A., Tzeng, R.-C., and Proutiere, A. Fast pure exploration via frank-wolfe. In Proc. of Neur IPS, 2021. Best Arm Identification in Multi-Agent Multi-Armed Bandits A. Lower Bound Proofs In this appendix, we prove Theorem 4.1 (in App. A.1) and Lemma 5.1 (in App. A.2). Finally, we state a result bounding the approximation ratio T MF θ /T θ in App. A.3. A.1. Proof of Theorem 4.1 Proof. The proof leverages the classical change-of-measure argument (Lai & Robbins, 1985) and the transportation lemma from Lemma 19 in (Kaufmann et al., 2016) to accommodate the MAMAB setting. We divide the proof into 5 steps. The first four steps are standard, and consist in relating the log-likelihood ratio of the observations under two models to the expected sample complexity. They are given for completeness. The last step is specific to MAMABs and deals with optimizing the resulting lower bounds. 1) The log-likelihood ratio. Let M = {θ RA : (θe(ae))e [ρ],ae A : a A, θ(a) = P e [ρ] θe(ae), a θ is unique} denote the set of possible parameters describing a MAMAB. Let θ, µ M. For any (global) action a A denote by f θ a the density (w.r.t. the Lebesgue measure) of the reward distribution of action a. For any (group) action ae Ae, denote by νθe ae = N(θe(ae), 1) the distribution of the corresponding reward, and by f θe ae its density. For the parameters θ, µ, the log-likelihood ratio of the observations under a M-BAI algorithm up to round T can be written as Lθ,µ T (a1,[ρ], r1,[ρ], . . . , a T,[ρ], r T,[ρ]) = log f θ(a1,[ρ], r1,[ρ], . . . , a T,[ρ], r T,[ρ]) f µ(a1,[ρ], r1,[ρ], . . . , a T,[ρ], r T,[ρ]) f θe(rt,e|at,e) f µe(rt,e|at,e) e [ρ] log f θe(rt,e|at,e) f µe(rt,e|at,e) ae Ae 1{at,e=ae} log f θe ae (rt,e) f µe ae (rt,e) 2) The change-of-measure argument. Define the set of confusing parameters as B(θ) = µ M : a µ = a θ , and the event E = {ˆaτ = a θ} . Under any δ-PAC algorithm, it holds that θ M, Pθ(E) 1 δ, and µ B(θ), Pµ(E) δ. Therefore, by applying Lemma 19 in (Kaufmann et al., 2016), we obtain µ B(θ), E Lθ,µ τ kl (1 δ, δ) . (7) 3) The expected log-likelihood ratio. We apply (6) to T = τ, the stopping time. Taking the expectation of (6), we get Eθ Lθ,µ τ = Eθ t=1 1{τ>t 1} X ae Ae 1{at,e=ae} log f θe ae (rt,e) f µe ae (rt,e) ae Ae 1{τ>t 1}1{at,e=ae} log f θe ae (rt,e) f µe ae (rt,e) ae Ae 1{τ>t 1}1{at,e=ae}Eθ f θe ae (rt,e) f µe ae (rt,e) ae Ae Eθ [Nτ,e,ae] KL(νθe ae, νµe ae ), Best Arm Identification in Multi-Agent Multi-Armed Bandits where KL(ν, ν ) is the KL divergence for distributions ν and ν . Now, since the distributions are Gaussians, we get: Eθ Lθ,µ τ = X ae Ae Eθ [Nτ,e,ae] (θe(ae) µe(ae))2 4) Optimizing the lower bound. By combining (7) and (8), we obtain ae Ae Eθ [Nτ,e,ae] (θe(ae) µe(ae))2 2 kl(1 δ, δ). (9) Dividing both sides of (9) by Eθ[τ], defining the group allocations we,ae = Eθ [Nτ,e,ae] /Eθ[τ] for all e [ρ], ae Ae, and optimizing over the set of possible allocations w Λ, we get: Eθ[τ] 2kl(1 δ, δ) sup w Λ infµ B(θ) P ae Ae we,ae (θe(ae) µe(ae))2 . (10) 5) Solving the optimization problem. Consider the optimization problem at the denominator of (10): ae Ae we,ae (θe(ae) µe(ae))2 Note that the set of confusing parameters defined before can be expressed as: B(θ) = µ M : a µ = a θ a =a θ {µ M : µ(a θ) µ(a)} (µe)e [ρ] : X e [ρ] µe(a e) X e [ρ] µe(ae) Then, using this decomposition, (11) can be rewritten as: min a =a inf (µe)e [ρ] ae Ae we,ae (θe(ae) µe(ae))2 subject to X e [ρ] (µe(a e) µe(ae)) 0. (12) Now, (12) is a convex program and it can be easily verified that Slater s conditions hold (Boyd & Vandenberghe, 2004). The Lagrangian associated to (12) is: L((µe)e [ρ], λ) = X ae Ae we,ae (θe(ae) µe(ae))2 e [ρ] (µe(a e) µe(ae)) 1{ae =a e} where λ R is the Lagrange multiplier associated with the inequality constraint. The optimality conditions impose: we,ae(µe(ae) θe(ae)) + λ1{ae =a e} = 0, for e [ρ] we,a e(µe(a e) θe(a e)) λ1{ae =a e} = 0, for e [ρ] e [ρ] (µe(a e) µe(ae)) 1{ae =a e} = 0 It can be directly verified that the solution of the set of equations in (13) is attained at: µe(ae) = θe(ae) λ we,ae , µe(a e) = θe(a e) + λ we,a e , λ = e [ρ](θe(a e) θe(ae)) P e [ρ] 1 we,ae + 1 we,a e 1{ae =a e} > 0. Hence, by substitution of (µe(ae))e [ρ],ae Ae into the objective of (12), and by (10), we get the result. Best Arm Identification in Multi-Agent Multi-Armed Bandits A.2. Proof of Lemma 5.1 Proof. We first show that θ M, T θ T MF θ . This simply follows by noting that 1/ (a) 1/ min, and ΛMF Λ, and hence ΛMF Λ. Combining these facts, we can write: T θ = inf w Λ max a =a e [ρ]:ae =a e w 1 e,a e + w 1 e,ae inf w Λ max a =a e [ρ]:ae =a e w 1 e,a e + w 1 e,ae inf w ΛMF max a =a e [ρ]:ae =a e w 1 e,a e + w 1 e,ae 2 min = T MF θ Next, we show that (2) and (3) are equivalent. First, recall the expression of the MF marginal polytope: w R A : w ΛMF, e [ρ], ae Ae, we,ae = X b A:be=ae wb, Note that, for any w ΛMF, we must have that: b A:be=ae wb = X i [N] vi,bi = Y i Se vi,ai X j [N]\Se wj,bj = Y i Se vi,ai. (14) Hence, by substituting (14) into (2), and noticing that T MF θ involves only local allocation variables (vi)i [N], we have that: T MF θ = inf w ΛMF max a =a w 1 e,a e + w 1 e,ae 2 min = inf (vi Λi)i [N] max a =a i Se v 1 i,ai + Q i Se v 1 i,a i Note that, if (v i )i [N] is an optimal solution to T MF θ , we can express such solution it in terms of the optimal group allocations as in (14), i.e., w = Q i Se v i,ai. Finally, we show that, θ M, T MF θ 4 A 2 min . The bound is proved using techniques from combinatorial bandits with semi-bandit feedback. This class of bandit problems encompasses MAMAB as a particular case in (we provide clarification for this connection in App. G). Specifically, the proof relies on two known results from (Wagenmaker et al., 2020) that we report below. Lemma A.1 is a reformulation of the MAMAB sample complexity lower bound in the combinatorial semi-bandit feedback setting, while Lemma A.2 is an adaptation of the celebrated Kiefer-Wolfowitz Equivalence Theorem (Kiefer & Wolfowitz, 1960) for the case of semi-bandit feedback problems. Lemma A.1 ((Wagenmaker et al., 2020), Theorem 6). For any θ M, we have that T θ = inf w Λ max a =a ϕ(a ) ϕ(a) 2 Asemi(w) 1 ( θ (ϕ(a ) ϕ(a))2 , (15) θ = (θe(ae))e [ρ],ae Ae ϕ(a) {0, 1} A, is the binary vector containing ϕ(a) = [ϕe(be)]e [ρ],be Ae such that ϕe(be) = 1{ae=be}, Asemi(w) = diag(P a A waϕ(a)ϕ(a) ), where diag(X) is the operator which sets all elements in a matrix X not on the diagonal to 0 Best Arm Identification in Multi-Agent Multi-Armed Bandits Lemma A.2 ((Wagenmaker et al., 2020), Proposition 9). infw Λ maxa A ϕ(a) 2 Asemi(w) 1 = A. By Lemma A.1, we can equivalently characterize the MF characteristic time as: T MF θ = inf w ΛMF max a =a ϕ(a ) ϕ(a) 2 Asemi(w) 1 ( θ (ϕ(a ) ϕ(a))2 . To verify this, notice that θ (ϕ(a ) ϕ(a)) = P e [ρ](θe(a e) θe(ae)). Hence, to show the equivalence, it is sufficient to show that ϕ(a ) ϕ(a) 2 Asemi(w) 1 = X e [ρ]:ae =a e 1/ we,ae + 1/ we,a e. By definition of (ϕ(a))a A, and w Λ, we have that ϕ(a ) ϕ(a) 2 Asemi(w) 1 = ϕ(a ) ϕ(a) 2 M( w) 1, where M( w) R A A is the diagonal matrix containing the vector w on its diagonal. Hence, we get: ϕ(a ) ϕ(a) 2 Asemi(w) 1 = (ϕ(a ) ϕ(a)) a A waϕ(a)ϕ(a) !! 1 (ϕ(a ) ϕ(a)) = (ϕ(a ) ϕ(a)) M( w) 1(ϕ(a ) ϕ(a)) e [ρ]:ae =a e 1/ we,ae + 1/ we,a e. Then, we can upper bound the characteristic time as: T MF θ = T MF θ,semi = 1 2 min inf w ΛMF max a =a ϕ(a ) ϕ(a) 2 Asemi(w) 1 (i) 4 2 min inf w ΛMF max a A ϕ(a) 2 Asemi(w) 1 (ii) = 4 A 2 min In the above steps, (i) follows by an application of the triangular inequality. Note that, in general, showing (ii) is a non-trivial step due to the non-convexity of the MF approximation. By Lemma A.2, we have that: inf w Λ max a A ϕ(a) 2 Asemi(w) 1 = A, and it can be directly verified that the optimal value is attained at the point w = (1/A, . . . , 1/A) Λ. Now, since ΛMF Λ, in order to show (ii), it suffices to verify that w ΛMF. Specifically, we need to show that there exists a set of local allocations (v i )i [N], where v i Λi, for all i [N], such that w a = Q i [N] v i,ai, for all a A. This is easily verified by the vector of marginal allocations v i = (1/K, . . . , 1/K). A.3. Quantifying the approximation ratio Lemma A.3. For any θ, we have that 1 T MF θ T θ A/ ρ. Proof. The lower bound T MF θ T θ 1 is obvious from Lemma 5.1. For the upper bound, we have (i) 2 A 2 min (ii) 2 A 2 min ρ 2 min 2ρ = A ρ, where (i) follows from Lemma 5.1, and (ii) follows directly from an application of Lemma 2 of (Soare et al., 2014). Best Arm Identification in Multi-Agent Multi-Armed Bandits B. Properties of T MF θ In this appendix, we provide important properties of T MF θ . We first establish, in App. B.1, that the optimization leading to T MF θ is equivalent to a non-linear convex program, which ensures that it can be computed efficiently. We then show, in App. B.2, the continuity of functions involved in the definition of T MF θ . These continuity arguments will be needed in the sample complexity analysis for our algorithm. B.1. An equivalent convex program We establish in Proposition B.1 below that the optimization problem defining the MF characteristic time T MF θ (2) is equivalent to a convex program. Note that this is a non-trivial result and does not hold in general. Indeed, it is known that in general, MF approximations lead to non-convex optimization problems (see e.g., (Wainwright & Jordan, 2008)) in App. D. Proposition B.1. The optimization problem (2) is equivalent to a (non-linear) convex program. The proof relies on the application of Lemma B.2 given below. Specifically, in Lemma B.2, we show that (2) can be reformulated as a GP in standard form. The proof of the Lemma relies on the fact that relaxing the constraints that are not in standard GP form does not affect optimality, while leading to a GP in standard form. Then, in Proposition B.1, we show that the GP (3) can be reformulated as a non-linear convex optimization program by a change of variables. Lemma B.2. The optimization problem (2) is equivalent to a geometric program. Proof. Recall that, for a positive variable x RI >0, a GP in standard form is described as (Boyd et al., 2007): infx f0(x) subject to fi(x) 1, i = 1, . . . , m hi(x) = 1, i = 1, . . . , p (16) where f0, . . . , fm are posynomials and h1, . . . , hp are monomials. Recall that for x RI >0, α RI, and γ > 0, a monomial is a function of the type h(x) = γ Q i [I] xαi i and a posynomial is a positive sum of monomials: f(x) = P i [I] xβi,j i , for some β RI J and ξ RJ >0. By Lemma 5.1, and using an epigraph representation (Boyd & Vandenberghe, 2004), we can rewrite the optimization problem (2) describing T MF θ , for a positive variable z > 0 and vi RK >0, for all i [N], as: inf z,(vi)i [N] z (17a) subject to 1 2 min e [ρ]:ae =a e i Se v 1 i,aiz 1 + Y i Se v 1 i,a i z 1 ! 1, a = a (17b) ai Ai vi,ai = 1, i [N]. (17c) Note that, at this stage, (17) cannot be described as a GP in standard form. Indeed, although the objective (17a) is a monomial and the set of inequality constraints (17b) are posynomials, the set of constraints (17c) are posynomial equality constraints, which do not comply with standard GP requirements. The problem (17) is in fact generally known as a signomial program, and is hard to solve in general (Boyd et al., 2007). The rest of the proof will be aimed at showing that (17) can be transformed into an equivalent GP. In order to show that T MF θ can be actually casted as a GP, we can apply the method described in (Boyd et al., 2007) to relax the set of constraints (17c). Define the relaxed GP as: Best Arm Identification in Multi-Agent Multi-Armed Bandits inf z,(vi)i [N] z subject to 1 2 min e [ρ]:ae =a e i Se v 1 i,aiz 1 + Y i Se v 1 i,a i z 1 ! ai Ai vi,ai 1, i [N]. As described in (Boyd et al., 2007) (Sec. 7.4), we have that (18) and (17) are equivalent if and only if we have: for all i [N], there exists a k [N], such that: (a) the variable vi,ak does not appear in any of the monomial equality constraints; (b) the objective and the inequality constraint functions are all monotone decreasing in vi,ak, i.e., if we increase vi,ak (holding all other variables constant), the inequality constraint functions decrease, or remain constant; (c) the posynomial function in the equality constraints are monotone strictly increasing in vi,ak, i.e., if we increase vi,ak (holding all other variables constant), the posynomial function increases. In our setting, these assumptions hold naturally: (a) is satisfied since the original optimization problem does not involve any monomial equality constraints; (b) holds since the functions on the LHS of (17b) are monotone decreasing in vi,ak, a A; (c) holds since the function on the LHS of (17c) are monotone strictly increasing in vi,ak, i [N]. Proof of Proposition B.1. By Lemma B.2, the optimization problem leading to T MF θ is equivalent to (18). Let yz = log(z), yvi,ai = log(vi,ai), for all i [N], ai Ai. By applying this change of variables, we can rewrite (18) as: inf yz R>0,(yvi RK >0)i [N] eyz subject to 1 2 min e:ae =a e exp i Se yvi,aiyz i Se yvi,a i yz ai Ai eyvi,ai 1, i [N]. By the monotonicity of the logarithm function, we have that (19) is also equivalent to: inf yz R>0,(yvi RK >0)i [N] yz subject to log e:ae =a e exp i Se yvi,aiyz i Se yvi,a i yz ai Ai exp yi,wai ! It follows directly from the convexity of the Log-Sum-Exp function that (20) is a convex program. Best Arm Identification in Multi-Agent Multi-Armed Bandits B.2. Continuity arguments This section presents continuity arguments on functions related to the optimization problem (2). Define, for θ M and w ΛMF, the function ψ(θ, w) = min a =a 2 min P e [ρ]:ae =a e w 1 e,ae + w 1 e,a e Further define ψ (θ) = sup w ΛMF ψ(θ, w), and w = arg max w ψ(θ, w). In the following, we shall assume w.l.o.g. that w is unique, i.e., the set of optimal allocations is a singleton: C (θ) = { w ΛMF : ψ(θ, w) = (T MF θ ) 1} = { w }. This may be proved using similar techniques as those of Theorem 5 in (Garivier & Kaufmann, 2016). Note that if this was not the case, one may reason in terms of the objective function as, e.g., in (Jedra & Proutiere, 2020) for linear bandits. Lemma B.3. The function ψ(θ, w) is continuous in both w and θ, and ψ (θ) is continuous in θ. Furthermore there exists a w ΛMF such that w arg max w Λ ψ(θ, w). Proof. The proof of Lemma B.3 follows from standard continuity arguments and is similar to that of the related results in (Jedra & Proutiere, 2020). First, we prove that ψ(θ, w) is continuous in both θ and w. Let (θt, wt)t 1 be a sequence taking values in M ΛMF, and converging to (θ, w). Recall the definition of the set of confusing parameters with respect to θ M, B(θ) = {µ M : a = a θ : µ(a) > µ(a θ)} . f(θ, µ, w) = X ae Ae we,ae (θe(ae) µe(ae))2 Since (θt, wt) converges to (θ, w), and a θ is unique, there exists ε > 0 and a t1 1 such that t t1, (θ, w ) (θt, wt) < ε and such that B(θ) = B(θt). Further note that, since f(θ, µ, w) is a polynomial in θ, µ, w, is continuous. As a consequence, there exists t2 1 : t t2, such that for all µ M, we have |f(θt, µ, wt) f(θ, µ, w)| εf(θ, µ, w). Thus, for all t max{t1, t2}, we get |ψ(θ, w) ψ(θt, wt)| = min µ B(θ) f(θ, µ, w) min µ B(θ) f(θt, µ, wt) ε |f(θ, µ, w)| ε|ψ(θ, w)|. The continuity of ψ (θ) and existence of a solution w follows from Berge s maximum theorem (Berge, 1963). Best Arm Identification in Multi-Agent Multi-Armed Bandits C. Sampling Rule Analysis In this appendix, we present various results on the sampling rule of MF-Ta S. We divide the analysis for the forced exploration (in App. C.1) and tracking (in App. C.2). Recall that, the forced exploration relies on the set A0, as described in 6.2. This set may be difficult to construct due to the combinatorial nature of the action set A. To address this issue, we provide, in App. C.3, a simple and efficient algorithm to build A0. C.1. Forced Exploration In this section, we state and prove Lemma C.1. It shows that each group arm is sampled sufficiently often. This, in turn, ensures that the estimators of the group means ˆθt,e converge to θe, for all groups e [ρ], and hence that ˆθt θ a.s.. Lemma C.1. The sampling rule of MF-Ta S ensures that exists a finite t > 0, such that t t and e [ρ], ae Ae, we have that Proof. Recall the expression for the set of under-explored actions at group e and time t: ae Ae : Nt,e,ae < The proof is inspired by that of Lemma 5 in (Jedra & Proutiere, 2020). The main idea is to show that, if at some time t0 + 1 the condition e [ρ], Ut0+1,e = is violated, then the number of rounds needed to satisfy the condition again cannot exceed |A0| rounds. This follows by the definition of A0 and of the forced exploration rule. By construction, we have that inf{t 1 : e [ρ], Ut,e = } T0 |A0|. Now, if there exists t0 T0 such that e [ρ], Ut0,e = and e [ρ] such that Ut0+1,e = , we may define t1 = inf{t > t0 : e [ρ], Ut,e = }. Observe that for all t0 t t1, and for all e [ρ], we have, Nt,e,ae Nt0,e,ae p Furthermore, if t1 t0 + |A0| + 1 then we have, e [ρ], Nt1,e,ae Nt0+|A0|+1,e,ae Nt0,e,ae + 1 p t0/|A0| + 1. However, we have: t0 |A0| + 1 t0 + |A0| + 1 Therefore if t0 1 4|A0| then t1 t0 + |A0| + 1. In other words, we have shown that for all t 1 4|A0| + |A0| + 1, we have for all e [ρ], for all ae Ae Best Arm Identification in Multi-Agent Multi-Armed Bandits C.2. Tracking In this section we state and prove Lemma C.2 and Lemma C.3. Lemma C.2 shows that MF-Ta S can correctly track a changing sequence that concentrates. Lemma C.3 (and Corollary C.4) shows that tracking local allocations ensures that global (and group) allocations are correctly tracked. Finally, in Proposition C.5, we show that MF-Ta S ensures that the empirical allocations converge to the optimal allocation. Lemma C.2. Define, for all i [N], the sequence (vt,i)t 1, where vt,i = (vt,i,ai)ai Ai, taking values in Λi used in the tracking rule of MF-Ta S. Let, for all i [N], v i Λi. Then, for all ε > 0, and for all t0, there exists tε > t0 such that sup t t0 max i [N],ai Ai vt,i,ai v i,ai ε sup t tε max i [N],ai Ai Proof. The proof is quite similar to the one of Lemma 7 in (Garivier & Kaufmann, 2016) for D-tracking. We adapt the proof to allow our sampling rule to track local allocations for each agent, as opposed to tracking arm allocations as in (Garivier & Kaufmann, 2016). Also, there is an asymmetry in the forced exploration that slightly complicates the analysis. For all i [N], ai Ai, define Xt,i,ai = Nt,i,ai tv i,ai, and note that i [N], we have: ai Ai Xt,i,ai = X ai Ai (Nt,i,ai tv i,ai) = t t ai Ai v i,ai Furthermore, we have that maxi [N],ai Ai |Xt,i,ai| N(K 1) maxi [N],ai Ai Xt,i,ai. To see this, note that for every i [N], ai Ai, we have that Xt,i,ai maxk [N] maxak Ak Xt,k,ak, and that, by (21), we have Xt,i,ai = X j =i Xt,j,ai X j =i max k [N] Xt,k,ai = (K 1) max k [N] Xt,k,ai. The remaining part of the proof will aim at determining an upper bound on maxi [N],ai Ai Xt,i,ai, for t large enough. Now, let t 0 t0 be such that t t 0, p t/|A0| 2tε and 1/t ε. We will show that for t t 0, and for all i [N], {at+1,i = ai} {Xt,i,ai 2tε}. (22) We analyze two (mutually exclusive) cases: (i) forced exploration and (ii) tracking. (i) if at t t 0, e [ρ] : Nt,e,ae t/|A0|, MF-Ta S is in a forced exploration step. This, in turn, implies that i Se such that Nt,i,ai t/|A0|. Hence we have that: t/|A0| tv i,ai t/|A0| 2tε, where the last inequality follows by definition of t 0. (ii) If MF-Ta S is in a tracking step at t t 0, for all i [N], ai Ai, we have, Nt,i,ai twt,i,ai = min bi Ai (Nt,i,bi twt,i,bi) Xt,i,ai + (w i,ai wt,i,ai) = min bi Ai Xt,i,bi + t(w i,bi wt,i,bi) . Now, for all t t0, we have that Xt,i,ai + (w i,ai vt,i,ai) minbi Ai(Xt,i,bi + tε) tε, where we used that, i [N], minbi Ai Xt,i,bi 0 by (21). Now, by assumption, we have that for all t t 0 t0, |v i,ai vt,i,ai| ε, and we can conclude that (22) holds. Note that, for all i [N], we have Xt+1,i,ai = Xt,i,ai + 1{at+1,i=ai} v i,ai. Hence, for t t 0, we have Xt+1,i,ai Xt,i,ai + 1{Xt,i,ai 2tε} v i,ai. Best Arm Identification in Multi-Agent Multi-Armed Bandits We now prove by induction that for all t t 0, we have Xt,i,ai max{Xt 0,i,ai, 2tε + 1}. (23) The base case t = t 0 automatically holds. Now, assume that, at t t 0, (23) holds. If Xt,i,ai 2tε, we have Xt+1,i,ai 2tε + 1 v i,ai 2tε + 1 max{Xt 0,i,ai, 2tε + 1} max{Xt 0,i,ai, 2(t + 1)ε + 1}. On the other hand, if Xt,i,ai > 2tε, we have Xt+1,i,ai max{Xt 0,i,ai, 2tε + 1} v i,ai max{Xt 0,i,ai, 2(t + 1)ε + 1}. This concludes the induction step. Now, for t t 0, using that Xt 0,i,ai t 0 and 1/t ε, we have max i [N],ai Ai (K 1) max{2ε + 1/t, t 0/t} (K 1) max{3ε, t 0/t}. Hence, we can conclude that exists t1(ε) t 0 such that for all t t1(ε), max i [N],ai Ai Lemma C.3. Let w(1), w(2) ΛMF, and let v(1), v(2) be the corresponding local allocations. For all ε 0, we have that max i [N],ai Ai |v(1) i,ai v(2) i,ai| ε max a A |w(1) a w(2) a | Nε. Proof. We have that ε max i [N],ai Ai |v(1) i,ai v(2) i,ai| 1 i [N] max ai Ai |v(1) i,ai v(2) i,ai| 1 i [N] |v(1) i,ai v(2) i,ai|, (24) where the first inequality follows by hypothesis and the second and third by inequalities on the max. Next, we show by induction that, a A, X i [N] |v(1) i,ai v(2) i,ai| i [N] v(1) i,ai Y i [N] v(2) i,ai Let n [N], and define W (1) n = Q i [n] v(1) i,ai and W (2) n = Q i [n] v(2) i,ai. The base case n = 1 obviously holds true since. Suppose that |W (1) n W (2) n | P i [n] |v(1) i,ai v(2) i,ai|, for any n [N 1]. Then we have: |W (1) n+1 W (2) n+1| |v(1) n+1,an+1 v(2) n+1,an+1||W (1) n | + |W (1) n W (2) n ||v(2) n+1,an+1| X i [n+1] |v(1) i,ai v(2) i,ai|, where the inequalities follow by applying a triangular inequality and by the fact that |vi,ai| 1, i [N], ai Ai. This concludes the induction step. Hence, by (24), we get: i [N] |v(1) i,ai v(2) i,ai| 1 i [N] v(1) i,ai Y i [N] v(2) i,ai N max a A |w(1) a w(2) a |. By following a similar approach, we can prove the following corollary. Corollary C.4. Let w(1), w(2) ΛMF and let v(1), v(2) be the corresponding local allocations. For all ε > 0, we have that max i [N],ai Ai |v(1) i,ai v(2) i,ai| ε max e [ρ],ae Ae | w(1) e,ae w(2) e,ae| Nε. Best Arm Identification in Multi-Agent Multi-Armed Bandits Proposition C.5. Let w = arg max w ΛMF ψ(θ, w), and let (v i )i [N] be the corresponding optimal local allocations. The MF-Ta S sampling rule satisfies i [N], ai Ai P lim t Nt,i,ai Proof. Let E = n ˆθt t θ o , and note that P(E) = 1. In fact, by Lemma C.1, there exists a finite t0 1 such that for all t t0, we have mine [ρ],ae Ae Nt,e,ae q |A0| . Hence, by the law of large numbers (each group action will be played infinite times), we have that ˆθt,e t θe a.s., and hence ˆθt t θ a.s.. Note that under the event E, by continuity of w (Lemma B.3), we have that there exists t0(ε) 1 such that: sup t t0(ε) max e [ρ],ae Ae | w e,ae wt,e,ae| Nε 3(K 1). Furthermore, by Corollary C.4, we have that supt t0(ε) maxi [N],ai Ai |v i,ai vt,i,ai| ε 3(K 1). Hence, using Lemma C.2, there exists tε t0(ε) such that for all t tε, sup t tε max i [N],ai Ai C.3. An algorithm for selecting A0 We present in Alg. 4, the pseudocode of a simple procedure for selecting A0. It takes as input the set of global actions A and the set of group actions Ae for all e [ρ]. Let Ie,ae = P b A0 1{ae=be} be the counter of group actions ae Ae in A0, and define Ie = [Ie,ae]ae Ae NAe. To describe the algorithm, we assume that A is an ordered set, and denote by A(i) is the i-th global action. First, the algorithm initializes A0 , and Ie = 0 NAe, e [ρ]. Then, the algorithm iterates over groups e [ρ] and groups actions be Ae, and iteratively includes arms in A into the set A0 which are never observed in previous iterates. By construction, Alg. 4, ensures that every group arm is observed at least once in A0. Algorithm 4 BUILD A0 Input: Global actions A, group actions (Ae)e [ρ] Initialize: A0 , Ie = 0 NAe, e [ρ], for e [ρ] do i 1 while minae Ae Ie,ae = 0 do a A(i) for be Ae do if be = ae and Ie,be = 0 then A0 A0 {a} Ie,be Ie,be + 1 end if end for i i + 1 end while end for Return A0 Note that Alg. 4 may be improved with more precise search strategies, at the cost of increased computational complexity. For example, the algorithm could include in A0, at each step, global arms a A which maximizes the number of non-observed group actions corresponding to the global actions in A0. Best Arm Identification in Multi-Agent Multi-Armed Bandits Example 1 (A0 action choice). Consider the example in Fig. 1 with K = 2 local actions and N = 4 agents (i.e., A = 16 actions). In such setting, we can select A0 = {a0000, a0110, a1001, a1111}. Running Alg. 4 on this instance instead produces the set A0 = {a0000, a0001, a0110, a0111, a1000, a1010, a1100}. D. Asymptotic Sample Complexity Upper Bound In this section, we present the proof of Theorem 6.1. The proofs of the results are quite similar to those of the related results in (Garivier & Kaufmann, 2016). We divide the proof for the guarantee in probability (a.s.) and in expectation. In the remainder of this section, we will make use of the following technical lemma. Lemma D.1 (Lemma 18 in (Garivier & Kaufmann, 2016)). For every α [1, e/2], and for any two constants c1, c2 > 0, + log log c2 is such that c1x log(c2xα). D.1. Almost Sure Upper Bound Proof. Define the event E = i [N], ai Ai, Nt,i,ai t w i,ai, ˆθt t θ . Observe that P(E) = 1. This follows directly from Lemma C.1 and Proposition C.5. Note that, under E, we also have that e [ρ], ae Ae, Nt,e,ae t t w e,ae. Let wt = (Nt,e,ae/t)e [ρ],ae Ae. As described in 6.3, the conditions that the stopping threshold β(δ, t) must satisfy are the same as in Sec. 3.2 of (Wang et al., 2021): t 1, tψ( wt, ˆθt) β(δ, t) P[a θ = ˆat] δ, (25a) c1, c2 > 0 : t c1, β(δ, t) log(c2t/δ). (25b) Note that a threshold satisfying these properties exists (see e.g., (Kaufmann & Koolen, 2021)). In the following, we assume these conditions hold. Let ε > 0. Under the event E, by continuity of ψ (Lemma B.3), we have that there exists t1 such that for all t t1, ψ ˆθt, wt (1 ε)ψ (θ, w ). This implies that for all t t0 we have: τ = inf n t N>0 : tψ(ˆθt, wt) > β(δ, t) o t1 inf n t N>0 : t(1 ε)ψ (θ, w ) > β(δ, t) o t1 inf t N>0 : t T MF θ β(δ, t) c1 t1 inf t N>0 : t log(c2t)T MF θ (1 ε)δ where we used the fact that ψ(θ, w ) = (T MF θ ) 1 and the assumption on the stopping threshold (25b). Now, by applying Lemma D.1, for α = 1, c1 = (1 ε)δ T MF θ we get: τ c1 + t1 + T MF θ (1 ε)δ log T MF θ c2e (1 ε)δ + log log T MF θ c2 (1 ε)δ Hence, for all δ (0, 1), we have that: lim sup δ 0 τ log(1/δ) T MF θ Best Arm Identification in Multi-Agent Multi-Armed Bandits D.2. Expected Upper Bound Proof. Let ε > 0. By Lemma B.3, we have that w is continuous in θ, and hence there exists ξ(ε, θ) such that Iε = a A[θ(a) ξ, θ(a) + ξ], is such that for all θ Iε, we have max e [ρ],ae Ae | w e,ae(θ ) w e,ae(θ)| ε. Note that, when ˆθt Iε, we have ˆat = a θ. Define, for T N, h(T) = T 1/4, and let the good event be defined as: n ˆθt Iε o . We will now show that: (i) P[Ec T ] BT exp( CT 1/8), for some constants B, C (which depend on θ and ε). (ii) There exists a Tε such that for all T Tε, it holds, under the event ET , that T, max i [N],ai Ai We shall prove (i) first. By a union bound on Ec T , we have t=h(T ) P ˆθt / Iε = h P ˆθt,a θ(a) ξ + P ˆθt,a θa + ξ i . Now, let T be such that h(T) |A0|(|A0|+1) (|A0| 1) . Then, for t h(T), we have e [ρ], ae Ae, Nt,e,ae t/|A0| by Lemma C.1. Applying a union bound and Chernoff s inequality, we get: P ˆθt,a θ(a) ξ = P e [ρ] θe(ae) ξ e [ρ] P ˆθt,e,ae θe,ae ξ/ρ e [ρ] P ˆθt,e,ae θe(ae) ξ/ρ, Nt,e,ae t/|A0| P ˆθs,e,ae θe(ae) ξ/ρ t/|A0| exp skl(ˆθs,e,ae θe(ae) ξ/ρ t/|A0|kl(θe(ae) ξ/ρ, θe(ae)) 1 exp ( kl(θe(ae) ξ/ρ, θe(ae))) . Similarly, we can prove that P ˆθt,a θ(a) + ξ X t/|A0|kl(θe(ae) + ξ/ρ, θe(ae)) 1 exp ( kl(θe(ae) + ξ/ρ, θe(ae))) . Best Arm Identification in Multi-Agent Multi-Armed Bandits Hence, by letting exp ( kl(θe(ae) + ξ/ρ, θe(ae))/|A0|) 1 exp ( kl(θe(ae) + ξ/ρ, θe(ae))) + exp ( kl(θe(ae) ξ/ρ, θe(ae))/|A0|) 1 exp ( kl(θe(ae) ξ/ρ, θe(ae))) , e kl(θe(ae) + ξ/, θe(ae)) X e kl(θe(ae) + ξ/, θe(ae)) we have P(Ec T ) PT h(T ) B exp( C t) BT exp( C p h(T)) BT exp( CT 1/8). Note that (ii) follows directly from the definition of Iε and Lemma C.1. For T Tε, define the constant C ε (θ) = inf θ : θ θ ξ(ε) w : w w 3(K 1)ε Now, on the event ET (ε), it holds that for every and for all t h(T), we have that ˆat = a θ and ψ(ˆθt, wt) C ε (θ). Let T Tε, on ET (ε), we have T 1{tψ(ˆθt, wt) β(δ,t)} T 1{t C ε (θ) β(δ,T )} T + β(δ, T) Let us introduce T0(δ) = inf n T N : T + β(δ,T ) C ε (θ) T o . For every T max{T0(δ), Tε}, we have that ET (ε) {τ T}. Hence, we get P(τ > T) P(ET (ε)) BT exp( CT 1/8), and E[τ] T0(δ) + Tε + T =1 BT exp( CT 1/8). We now provide an upper bound on T0(δ). For ξ > 0, let C(ξ) = inf{T N : T T T/(1 + ξ)} Using the upper bound (25b) on the threshold β(δ, T), we have T ε 0 (δ) c1 + C(ξ) + inf T N : ln c2T C ε (θ) T 1 + ξ Using Prop. 8 in (Kaufmann & Koolen, 2021), it follows that T0(δ) c1 + C(ξ) + (1 + ξ) ln (1 + ξ)c2 ln (1 + ξ)c2 2 ln (1 + ξ)c2 Hence for any ξ > 0 and ε > 0, it holds that lim supδ 0 Eθ[τ] ln(1/δ) (1+ξ) C ε (θ). Letting ε 0 and ξ 0 and by continuity of ψ, we get lim ε 0 C ε (θ) = (T MF θ ) 1, which implies lim sup δ 0 E[τ] log(1/δ) T MF θ . Best Arm Identification in Multi-Agent Multi-Armed Bandits E. Examples and results on VE and FCR In this section, we present examples of the application of VE (see Alg. 3) and FCR (see Alg. 1) on a factor graph, in order to clarify their use. We also report results that ensure the correctness and bound the complexity of these methods. E.1. Variable Elimination We illustrate the use of VE to compute a joint optimal arm a θ. r1 r2 r3 r4 Figure 5. Factor graph from example 2. Example 2. Consider the factor graph in Fig. 5 with N = ρ = 4. The average reward is described as: θ(a) = θ1(a1, a2) + θ2(a2, a4) + θ3(a1, a3) + θ4(a3, a4). The key idea in VE is that, rather than summing all reward functions and then doing the maximization, we fix an ordering for the variables, and we maximize over variables one at a time, according to the predefined ordering. For example, fix the ordering as O = {a4, a3, a2, a1}. Starting from a4, we get max a A θ(a) = max a1,a2,a3 θ1(a1, a2) + θ3(a1, a3) + max a4 θ2(a2, a4) + θ4(a3, a4). Agent 4 can summarize the value that it brings to the system when varying (a2, a3) using a new function p4(a2, a3) = maxa4 θ2(a2, a4) + θ4(a3, a4). Note that p4 represents the best response of agent 4 conditioned on the actions played by agents 2, 3. We may also denote a 4(a2, a3) = arg maxa4 θ2(a2, a4) + θ4(a3, a4) as the best action for agent 4 conditioned on the actions of agent 2, 3. Hence, we get max a A θ(a) = max a1,a2,a3 θ1(a1, a2) + θ3(a1, a3) + p4(a2, a3). Next, we do the same for agent 3, where we denote by p3(a1, a2) = maxa3 θ3(a1, a3) + p4(a2, a3), a 3(a1, a2) = arg maxa3 θ3(a1, a3) + p4(a2, a3), and we reduce the problem to max a A θ(a) = max a1,a2 θ1(a1, a2) + p3(a1, a2). Next, agent 2 computes her response p2(a1) = maxa2 θ1(a1, a2) + p3(a1, a2), and a 2(a1) = arg maxa2 θ1(a1, a2) + p3(a1, a2). Hence agent a1 can simply select her action a1 that maximizes p1 = maxa1 p2(a1). We can recover the best joint action a = (a 1, a 2, a 3, a 4) by performing the entire process in reverse order: a 1 = arg maxa1 p2(a1), a 2 = arg maxa2 θ1(a 1, a2) + p3(a 1, a2), a 3 = arg maxa3 θ3(a 1, a3) + p4(a 2, a3), and a 4 = arg maxa4 θ2(a 2, a4) + θ4(a 3, a4). Complexity of VE.VE is guaranteed to return the optimal global arm in O(NKAO+1) operations (Dechter, 1999), where AO = maxi [N] |SC(p O(i))| is the size of the largest factor generated when using elimination order O. The complexity of VE depends on the elimination order O and is linear in the maximum size of the scope of best-response functions introduced in the elimination process. Remark E.1. Note that the complexity is linear in N for any order of elimination O. However, finding the optimal ordering, i.e., the one minimizing AO, is an NP-hard problem (Dechter, 1999). This issue has been addressed successfully for a large variety of graph structures in the graphical model community, where there exists a variety of good heuristics for the VE ordering problem (Wainwright & Jordan, 2008). In addition, there are approximate (and more efficient) alternatives to VE (e.g., the max-plus algorithm (Wainwright & Jordan, 2008)), but using those methods invalidates the correctness of VE. Best Arm Identification in Multi-Agent Multi-Armed Bandits E.2. Factored Constraint Reduction We provide an example of the application of FCR to reduce a combinatorial number of constraints. We consider set of constraints z 1 2 min e [ρ]:ae =a e w 1 e,ae + w 1 e,a e | {z } fe(ae) These constraints can be equivalently rewritten as z maxa A P e [ρ] fe(ae). Example 3. For the graph in Ex. 2 we have: z max a1,a2,a3,a4 f1(a1, a2) + f2(a2, a4) + f3(a1, a3) + f4(a3, a4). We introduce a set of variables (ufe ae)e [ρ],ae Ae, and the equality constraints: ufe ae = w 1 e,ae, e [ρ], ae Ae. Note that we can rewrite fe(ae) = ufe ae + ufe a e. Fix the elimination ordering O = {4, 3, 2, 1} and let F = . Now we introduce a new function pl into F by eliminating a variable l = O(i). For i = 1, we have O(1) = 4 and FO(1) = {f2(a2, a4), f4(a3, a4)}, and a variable associated to this function up4 a2,a3, for all a2, a. We introduce a set of constraints: up4 a2,a3 uf2 a4,a2 + uf2 a 4,a 2 + uf4 a4,a3 + uf4 a 4,a 3, a2, a3, a4 and we include these in the constraints set K. We further exclude the function f2 and f4 from the set F, while including p4(a2, a3). Subsequently, we consider O(2) = 3. Then F3 = {p4(a2, a3), f3(a3, a1)}. We introduce the new constraints: up3 a1,a2 up4 a2,a3 + up4 a 2,a 3 + uf3 a3,a1 + uf3 a 3,a 1, a1, a2, a3, and we add them to the constraint set K. We proceed to eliminate p4 and f3 from F and include p3. We then move to O(3) = 2 and define F2 = {f1(a1, a2), p3(a1, a2)}. The set of constraints introduced at this step are: up2 a1 up3 a1,a2 + up3 a 1,a 2 + uf1 a1,a2 + uf1 a 1,a 2, a1, a2, and similarly to the previous steps we add these constraints to K and eliminate the variables p3 and f1 from F, while including p2. The last step at O(4) = 1 consists of simply including in K the constraints up1 up2 a1, a1 A1. Finally we add to K the constraint z up1, and output K. Correctness and Complexity of FCR. The number of constraints in K set scales as O(NKAO), where AO = maxi [N] |SC(p O(i))| is the size of the maximum scope induced by the chosen order of elimination O. Note that FCR also includes O(NKAO) new variables in the optimization problem. Hence, similarly to VE, the number of constraints and variables to represent an exponentially large set depends linearly in N and exponentially only on the width of the induced graph, i.e., O(N exp(AO)). Furthermore, the following lemma, adapted from Theorem 4.4 in (Guestrin et al., 2003), establishes the correctness of the FCR algorithm. Lemma E.2 ((Guestrin et al., 2003), Theorem 4.4). Let K = FCR(C). Then C and K are equivalent, that is, an assignment of variables (p, z, w) is feasible for K if and only if (z, w) is feasible for C. Best Arm Identification in Multi-Agent Multi-Armed Bandits E.3. m-BEST algorithm In this section, we discuss an algorithm to find the m-best global arms. As explained in App. H, a set of tighter approximations can be built by considering an ordering of the first m smallest gaps and hence requires to compute the m + 1 global arms with highest expected rewards. The Lawler and Nilsson s m-BEST algorithm (Lawler, 1972; Nilsson, 1998), briefly described in the remainder of this section, will serve this purpose. The procedure was originally devised to compute the m most probable configurations in graphical models. The main idea is the following: At each step, the m-BEST find the best solution to a re-formulation of the original problem that excludes the solutions already discovered. Specifically, at each time iteration j < m, the algorithm runs VE excluding the first j most probable configurations. The Lawler s algorithm (Lawler, 1972) starts by computing the best global action a(1) by applying VE (with elimination order O) over the combinatorial action space A by applying VE N times. To determine the second best action a(2), the algorithm searches over the set A(2) = A \ {a(1)}. More generally, at iteration j, the algorithm finds the jth best global action a(j) by running VE over the sets A(j) = A \ k [j]{a(k)}. This procedure provably identifies the m-best global actions with complexity O(m N 2KAO+1). By leveraging similar ideas and using a junction tree representation of the graph, Nilsson (Nilsson, 1998) improves over this procedure leading to an m-best algorithm with complexity O(m NKAO+1). In this section, we consider the MAMAB model as in 3 in the regret setting. We provide a lower bound on the regret and an approximation of such lower bound, using similar techniques as the ones used in the BAI setting. In regret minimization, the goal is to devise an algorithm π to minimize the regret up to time T 1, defined as t=1 θ(a ) θ(at) a =a (θ(a ) θ(a))E[NT,a], where at A is the action selected by algorithm π at time t. F.1. Regret lower bound Now, we give an instance-specific lower bound on the regret in the MAMAB setting. The regret lower bound is stated on the class of uniformly good algorithms, according to the following definition. Definition F.1. An algorithm π is uniformly good algorithm if for all θ, we have that Rπ(T) = o(T α), α > 0. The following theorem gives a lower bound on the regret of any consistent algorithm. It is a direct consequence of the analogous lower bound in the combinatorial semi-bandit feedback setting, given e.g., in Theorem 1 in (Cuvelier et al., 2021b) or Theorem 12 in (Wagenmaker et al., 2020). Theorem F.2. Let π be a uniformly good algorithm. Then θ, lim inf T Rπ(T) log(T) c θ, (26) where c θ is the value of the optimization problem: min w R A 0,w RA 0 P e [ρ],ae Ae we,ae(θe(a e) θe(ae)) subject to P e [ρ]:ae =a e w 1 e,ae (a)2/2, a A we,ae = P b A:be=ae wb, e [ρ], ae Ae Note that the structure of the regret lower bound is similar to the one for M-BAI. The challenges are also similar: the optimization problem (27) has a combinatorial number of variables and constraints. Best Arm Identification in Multi-Agent Multi-Armed Bandits G. Connection to Combinatorial Semi-bandit Feedback Bandits The MAMAB setting can be regarded as a specific instance of a combinatorial semi-bandit feedback setting (Cuvelier et al., 2021a;b; Wagenmaker et al., 2020). In the following, we present an equivalent characterization of the MAMAB problem to clarify its connection to the combinatorial semi-bandit feedback setting. We first describe the interaction model in the generic (linear) combinatorial semi-bandit feedback setting. In such a setting, at each time step t 1, the learner selects an action from a combinatorial set at {0, 1}d, and, given an unknown parameter θ Rd, she observes: rt,i = θi + ηt,i, i [d] : at,i = 1, where ηt,i N(0, 1), for all i [d], are independent Gaussian noise samples. Recall that, since in MAMAB the set of global actions is defined as A = i [N]Ai, the problem is not directly interpretable in the semi-bandit feedback setting. We show that a simple map from actions in A to binary vectors in the A-dimensional space can reformulate the MAMAB problem to the semi-bandit feedback setting. Let ϕ( ) : A {0, 1} A be a function mapping global actions to binary vectors in the A-dimensional space. In MAMAB, the vector ϕ has a block structure: it can be decomposed as ϕ(a) = [ϕe(be)]e [ρ],be Ae, where ϕe(be) {0, 1}Ae is a group vector ϕe(be) = 1{ae=be}, i.e., containing 1 in correspondence of the activated group action ae. Further define θ = [θe(ae)]e [ρ],ae Ae R A, i.e., θ is the vector containing the local mean parameters. At round t 1, a global action at A is selected by the learner, and she observes: rt,e,ae = θe(ae), e [ρ] : ϕe(at,e) = 1. In other words, in the semi-bandit feedback setting, a (joint) action a A is selected and the learner observes a vector of rewards [re(ae)]e [ρ],ae a, where re(ae) = θe(ae) ϕe(ae) + ηe, where ηe N(0, 1) is i.i.d. Gaussian Noise. Note that the feature vectors satisfy ϕ(a) 0 = ρ, a A and ϕe(ae) 0 = 1, e [ρ], ae Ae. In order to further clarify the connection to the semi-bandit feedback, we provide a concrete example below. Example 4. Consider the factor graph in Fig. 6 with N = 3 agents, ρ = 2 groups, and K = 2 actions. The reward can be written as r(a1, a2, a3) = r1(a1, a2) + r2(a2, a3). Let ai {0, 1}, for all i [N]. The average reward can be expressed by the vector θ = [θ1(a1, a2)](a1,a2) {0,1}2, [θ2(a2, a3)](a2,a3) {0,1}2 R8, where [θ1(a1, a2)](a1,a2) {0,1}2 = [θ1(0, 0), θ1(0, 1), θ1(1, 0), θ1(1, 1)] [θ2(a2, a3)](a2,a3) {0,1}2 = [θ2(0, 0), θ2(0, 1), θ2(1, 0), θ2(1, 1)]. For example, selecting action a = (0, 0, 0), corresponds to the feature vector ϕ(a) = (1, 0, 0, 0, 1, 0, 0, 0), while selecting action b = (0, 1, 0) corresponds to the feature vector ϕ(b) = (0, 1, 0, 0, 0, 0, 1, 0). Figure 6. Factor graph from Example 4. Best Arm Identification in Multi-Agent Multi-Armed Bandits H. Tighter Approximations H.1. Tighter constraints reduction In this section, we propose tighter constraint approximations by leveraging an ordering of the arms and of the sub-optimality gaps. For m [KN], let a(m) be the mth best arm and, for m [KN 1], let m = θ(a θ) θ(a(m+1)) the mth minimal non-zero gap (with ties breaking arbitrarily). Lemma H.1. Let m [KN 1], and T MF θ (m) be the solution to the following optimization problem: inf w ΛMF,z R z subject to z 1 e [ρ]:a(j+1) e =a e w 1 e,a e + w 1 e,a(j+1) e , j [m] e [ρ]:ae =a e w 1 e,a e + w 1 e,ae, a A \ j [m]{a(j+1)}. Then, T MF θ (m + 1) T MF θ (m), m [KN 2], and T θ T MF θ (m), m [KN 1]. Note that T MF θ (1) = T MF θ , and for m > 1, T MF θ (m) provides provably tighter approximations. This approximation gain comes at the cost of increased computational complexity. Indeed, to solve T MF θ (m), one needs to compute the first m + 1 best arms and the m minimal gaps. To solve this task, there exist algorithm having complexity O((m + 1)NKAO+1) (see Sec. E.3). This result shows that for increasingly larger values of m, the approximation gets tighter, but the computational complexity increases. Hence, the approximations T MF θ (m) allow for an interplay between sample complexity and computational complexity when varying m. Sec. J.1 provides numerical results on this trade-off. There exists an interesting trade-off between sample complexity and computational complexity. The smallest sample complexity achievable is the true lower bound constant, i.e. T θ , which is the solution to an optimization problem with O(KN) variables and constraints. The approximation T MF θ has generally higher complexity, but its computational complexity is characterized by O(m NKAO), where AO is typically much smaller than N. As explained above, for m = 1 T MF θ (m) reduces to the MF approximation T MF θ . For illustration purposes we also report the sample complexity 2 A/ 2 min which is achieved by the random allocation w = (1/A)a A Λ. Proof of Lemma H.1. First, we shall prove T MF θ (m + 1) T MF θ (m), m [KN 1], by induction. The base case, for m = 1, is T MF θ (2) T MF θ (1). Note that T MF θ (1) can be written as inf w ΛMF,z R z (29) subject to z 1 e [ρ]:a(2) e =a e w 1 e,a e + w 1 e,a(2) e (30) e [ρ]:ae =a e w 1 e,a e + w 1 e,ae, a A \ {a(1), a(2)}, (31) while T MF θ (2) is defined as inf w ΛMF,z R z (32) subject to z 1 e [ρ]:a(2) e =a e w 1 e,a e + w 1 e,a(2) e (33) e [ρ]:a(3) e =a e w 1 e,a e + w 1 e,a(3) e (34) e [ρ]:ae =a e w 1 e,a e + w 1 e,ae, a A \ {a(1), a(2), a(3)}. (35) Best Arm Identification in Multi-Agent Multi-Armed Bandits The constraints (30) and (33) are identical. The constraints (34)-(35) can be simply written as e [ρ]:ae =a e w 1 e,a e + w 1 e,ae, a A \ {a(1), a(2)}, which corresponds to (31) with the exception of the term 1 2 1 in place of 1 2 2 . As 1 2, we naturally conclude that T MF θ (2) T MF θ (1). Now, assume that T MF θ (m + 1) T MF θ (m) holds for m 1, to complete the induction we need to show that T MF θ (m + 1) T MF θ (m). By following a similar approach to the base case, we can show that the only difference in the optimization problems defining T MF θ (m) and T MF θ (m + 1) is in the last set of constraints: for T MF θ (m) these constraints are e [ρ]:ae =a e w 1 e,a e + w 1 e,ae, a A \ j [m]{a(j+1)}, while for T MF θ (m + 1) they can be written as e [ρ]:ae =a e w 1 e,a e + w 1 e,ae, a A \ j [m]{a(j+1)}. As we have m m+1, we can conclude that T MF θ (m + 1) T MF θ (m). Now, to complete the proof, we show that T θ T MF θ (m), m [KN 1]. In light of the fact that T MF θ (m+1) T MF θ (m), it is sufficient to prove that T θ T MF θ (KN 1). It is easy to check that T MF θ (KN 1) can be written as inf w ΛMF max a =a e [ρ]:ae =a e w 1 e,a e + w 1 e,ae (a)2 . (36) Eq. (36) is essentially the same as T θ in (1), with the difference that the allocations variables are in ΛMF. As ΛMF Λ we conclude that T θ T MF θ (KN 1). H.2. Tighter variables reduction: Structured Mean Field In this section we present tighter variable reduction schemes that consider a different factorization factorization w.r.t. allocation distributions named Structured Mean Field (SMF). Let G be a set of subsets of [N] and consider the following factorization wa = Y g G wγg g,ag, a A, where ag denotes the sub-vector of actions corresponding to indices g G, and γg > 0. Let ΛG denote the set of distributions satisfying this factorization, and ΛG the corresponding marginal polytope. The approximation is essentially obtained replacing Λ by ΛG in (4.1). Note that if G = {{1}, ..., {N}}, and αg = 1, g G, we recover the MF approximation T MF θ . In general tighter (or even exact approximations) are possible. Lemma H.2. For any valid factorization G, we have that T θ T G θ , where T G θ = inf w ΛG max a =a e [ρ]:ae =a e w 1 e,a e + w 1 e,ae P e [ρ] θe(a e) θe(ae) Best Arm Identification in Multi-Agent Multi-Armed Bandits I. Details on antenna tilt optimization experiments In this section, we present details on the antenna tilt optimization experiments. Throughput. The throughput Ti,u, is formally defined in terms of the Signal-to-Interference-plus-Noise Ratio (SINR), a metric that measures the quality of a signal in the presence of interference and noise. Let a Ni be the group vector containing Specifically, the SINR of a UE u U connected to cell c C is defined as: SINRi,u(ai) = Pi Gi,u(ai)Li,u(ai) P k Ni Pk Gk,u(ak)Lk,u(ak) + σ , where Pi, Gi,u, and Li,u are the transmitter antenna power, the gain of the transmitter antenna, and path loss for UE u connected to cell i, respectively. The gain is influenced by antenna parameters such as tilt and azimuth, and the path loss accounts for the transmission medium and obstacles (e.g., buildings, atmospherical conditions, vegetation, etc.). The throughput Ti,u experienced by UE u connected to the cell i is then expressed as a function of the SINR and available bandwidth: Ti,u = ωBn R i,u log2 (1 + SINRi,u) , where n R i,u is the number of Physical Resource Blocks (PRBs) allocated to UE u in cell i and ωB is the bandwidth per PRB (180 k Hz). We use the average throughput of a cell in our group reward definition, i.e., ri(ae) = 1 |Ui| Hence the global reward is expressed as u Ui Ti,u(ae). On the noise independence assumption. In our experiments, each group e [ρ] corresponds to a sector: more precisely, it consists of an antenna i [N] serving the users u Ui connected to this sector, and the set of antennas that can interfere with the transmissions of the antenna i. Recall that the group reward is defined as re(ae) = P u Ui Ti,u(ae), where ae represents the tilts of antennas in group i. The throughput Ti,u(ae) is the rate at which an user u can decode transmissions from the antenna u. This rate depends on the random channel conditions (also known as fading) between each antenna in the group and the user i. Now, the fadings between pairs of (antenna, user) are typically stochastically independent across users and antennas (Tse & Viswanath, 2009). Since the sets of (Ui)i [N] form a partition, they do not overlap, and the random variables re(ae) are indeed independent across groups. They can be modeled as independent Gaussian realizations in the sum-throughput over groups. For details, refer e.g., to (Tse & Viswanath, 2009). Additional details. The set of UEs in the network is U = i SUi as presented in Sec. 7.2. The number of UEs connected to cell i is affected by tilt variation since we assume UEs connect to the cell from which they get maximum Reference Signal Received Power (RSRP). In particular, given a tilt configuration a, the UEs in cell i are defined as u U : arg max k [N] Pk Gk,u Lk,u = i There exist other methods to determine relations between antennas which rely on automated procedures, domain knowledge, and heuristics. For example, they may be based on the geographic distance between cells, on Neighbor Relations (ANR) as defined in 3GPP standards, on network planning tools for coverage prediction, or on cell handover logs (Rappaport, 2001). In addition, domain knowledge can be used to refine the graph topology by pruning or adding edges based on key feature of a city or knowledge about the terrain (if there is a natural obstacle for example). Analyzing the influence of the graph structure is not in the scope of this paper and is left as future work. Best Arm Identification in Multi-Agent Multi-Armed Bandits J. Additional Experiments J.1. Tighter bounds experiments In this section, we propose a set of experiments to test the effect of the tighter lower bound approximations T MF θ (m), presented in App. H. We consider different instances of the line and ring graphs with varying N {3, 4, 5, 6, 7}, K {2, 3, 4, 5}, and m {1, 10}. The group means (θe)e [ρ] are generated uniformly at random. We report in Fig. 7, the results averaged over Nsim = 5 independent runs and report the results in terms of the mean and standard deviation of the normalized T MF θ (m). 2 4 6 8 10 0 (a) K = 3 (line) 2 4 6 8 10 0 (b) K = 4 (line) 2 4 6 8 10 0 (c) K = 5 (line) 2 4 6 8 10 0 (d) K = 3 (ring) 2 4 6 8 10 0 (e) K = 4 (ring) 2 4 6 8 10 0 (f) K = 5 (ring) Figure 7. Normalized T MF θ (m) for varying m, K, and N. As expected, T MF θ (m) is a monotonically decreasing function of m for all considered graphs. The approximation for m = 1 corresponds to the MF approximation T MF θ . Hence the curves in Fig. 7 represent improvement over the MF approximations (recall that T MF θ T MF θ (m), for any m). We can observe that for increasing values of N, the improvement over T MF θ gets larger, while the effect of larger K is less significant. Best Arm Identification in Multi-Agent Multi-Armed Bandits J.2. Quantifying the approximation In this section, we provide numerical results on the approximation ratio T MF θ /T θ . We consider different instances of line and ring graphs with varying N and K {2, 3, 4, 5}. The group means (θe)e [ρ] are generated uniformly at random, as described in Sec. 7.1. We report the results averaged over Nsim = 5 independent runs and report the results in terms of mean and standard deviation of the ratio T MF θ /T θ , together with its upper bound A/ρ (see Lem. A.3). The results are shown in Fig. 8. It can be observed that, as conjectured in App. A.3, the upper bound on the approximation ratio is loose for different instances, and generally the actual value of T MF θ /T θ is significantly smaller than the bound. 6 8 10 12 14 16 18 (a) Line graph. 6 8 10 12 14 16 18 (b) Ring graph. Figure 8. Approximation ratio T MF θ /T θ vs upper bound 4 A/ρ.