# multiagent_best_arm_identification_with_private_communications__41a1b39c.pdf

Multi-Agent Best Arm Identification with Private Communications

Alexandre Rio 1 Merwan Barlier 1 Igor Colin 1 Marta Soare 2

We address multi-agent best arm identification with privacy guarantees. In this setting, agents collaborate by communicating to find the optimal arm. To avoid leaking sensitive data through messages, we consider two notions of privacy withholding different kinds of information: differential privacy and (ϵ, η)-privacy. For each privacy definition, we propose an algorithm based on a two-level successive elimination scheme. We provide theoretical guarantees for the privacy level, accuracy and sample complexity of our algorithms. Experiments on various settings support our theoretical findings.

1. Introduction

The process of gathering data to identify the best option in a stochastic multi-armed bandit is known as pure exploration or best arm identification (BAI). The goal for the learner is to suggest the most valuable option given the data collected through sequential interaction with the system. In the specific fixed-confidence setting, we seek to achieve this goal using as few interactions as possible, while guaranteeing that the actual best option is returned with fixed, high probability. Motivating examples include automatic parameter tuning of telecommunication antennas and pharmaceutical drug selection. Antenna performance depends on a large number of parameters, making it difficult and time-consuming for engineers to manually explore the space of configurations and find the one that will optimize the user experience. Selecting the best drug among multiple candidates is also a complex and costly process. Re-framing these problems as best arm identification in multi-armed bandits and relying on automatic strategies can be very effective.

Such problems usually involve multiple agents that share and interact with the same environment, generating as many

1Huawei Noah s Ark Lab, Paris, France 2Universit e d Orl eans, Universit e Grenoble Alpes, CNRS LIG, France. Correspondence to: Alexandre Rio <alexandre.rio2@huawei.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

data streams. Clinical trials involve many participants to select the most appropriate drug for the majority of the population. Antennas are generally distributed over networks spanning a specific region. Moreover, in many situations, data is distributed by nature, and cannot be consolidated and processed on a single server. It is therefore interesting for the companies to model such problems as multi-agent systems and try to exploit the data collected by each agent (i.e., locally) to learn the optimal decision at the system level (i.e., globally). We therefore refer to this problem as multi-agent best arm identification. Collaboration among agents is made effective as each agent broadcasts messages, reflecting the feedback they receive from the system, to a central coordinator involved in the decision process.

Large amounts of data can be handled during such learning processes, and some of them are potentially sensitive (e.g., personal data or highly valued industrial information). Hence, it is critical to design multi-agent BAI methods that ensure the privacy of the data identified as sensitive. In many applications, one may want to protect the rewards used as inputs of the learning algorithm, especially since they reflect the preferences of the agents. For instance, confidential medical data may be at risk during a drug selection study. In this situation, the well-known differential privacy (DP) (Dwork, 2006) framework ensures good guarantees against privacy leakage. Some other cases may require protecting the algorithm s output, that is the global optimal action. In our motivating examples, both the optimal antenna configuration and the best candidate drug are indeed of great value, as the product of the expertise of the engineers. Protecting it from competing firms is therefore of high importance. These concerns call for another notion of privacy, like the lesser known (ϵ, η)-privacy from F eraud et al. (2019).

However, if it may be possible to protect the data held by the different participants (the central coordinator and the agents), the communications may still be easily intercepted by an adversary. Moreover, they cannot always be encrypted, especially in the multi-agent setting where the computational overhead would increase significantly and where the exchange of keys is not obvious. Consequently, whereas communications from agents are a crucial building block of multi-agent algorithms, they are often a weak link in terms of privacy. Indeed, if these messages carry enough information, they can leak sensitive data. Designing

Multi-Agent Best Arm Identification with Private Communications

multi-agent solutions that protect sensitive information from adversaries observing the communications is thus crucial.

We refer to this problem as multi-agent best arm identification with private communications. In this paper, we investigate solutions to this problem, considering two scenarios based on the type of information that we want to protect: 1) the rewards, i.e. the input data, and 2) the estimated best arm, i.e., the output data. This leads us to question the most appropriate privacy notion to use in each scenario. In the first scenario, we propose DP-MASE, a multi-agent version of DP Successive Elimination (Sajed & Sheffet (2019)). DP-MASE is the first attempt to protect the input data in the multi-agent setting with sensitive communications, using differential privacy. To address the second scenario, we rely on the notion of (ϵ, η)-privacy, introduced in F eraud et al. (2019), and propose CORRUPTED ELIMINATION that improves over their method by decoupling privacy and local BAI accuracy, leading to better sample complexity and providing flexibility to adjust these two essential aspects of the problem. We provide a theoretical analysis for the privacy and performance of both algorithms. In particular, we show that multi-agent collaboration leads to greater accuracy than independent agents, and derive associated sample complexity bounds. Experiments on toy environments support this theoretical findings.

Structure of the paper After presenting related works (Sect. 2), we state the setting for multi-agent BAI with private communications (Sect. 3). Then, we design algorithms based on successive elimination at both local and global levels. In the first scenario where we want to protect the reward, we use differential privacy and propose DP-MASE (Sect. 4). In the second scenario, to protect the output, we rely on (ϵ, η)-privacy and propose CORRUPTED ELIMINATION (Sect. 5). Experiments are provided in (Sect. 6) to support our theoretical claims.

2. Related Work

In multi-armed bandits (MAB), the reference theoretical framework for sequential decision-making (Thompson (1933), Robbins (1952)), the problem of identifying the option (called arm) with the highest mean is known as Pure Exploration or Best Arm Identification (BAI). Unlike the problem of regret minimization, BAI assumes that we incur no cost in exploring suboptimal options and are just interested in returning the best arm for further exploitation: we thus do not face the well-known exploration-exploitation trade-off. BAI has been thoroughly studied in two different settings: fixed-budget (Audibert & Bubeck (2010)) and fixed-confidence (Even-Dar et al. (2006)). In the former, we aim at returning the best arm in a given amount of time while minimizing the probability of failure, whereas in the

latter we want to output the best arm as quickly as possible with a fixed probability of failure. In this paper, we focus on the fixed confidence setting, for which Even-Dar et al. (2006) formulate and analyze two standard BAI algorithms: Successive Elimination and Median Elimination. The theoretical analysis is conducted in the Probably Approximately Correct (PAC) framework, where we accept to only find an ϵ-optimal arm with high probability. More advanced analysis and algorithms have been discussed, for instance in Jamieson & Nowak (2014).

Recently, much attention has been paid to multi-agent multiarmed bandits, where multiple agents collaborate to solve the same MAB instance. First, this approach can allow to take advantage of multiple computing nodes to speed up the learning process (Hillel et al. (2013), Sz or enyi et al. (2013)). More importantly, it accurately models real-world problems where multiple agents combine their efforts to solve a given learning task. A traditional example is the device cloud, where a possibly wide range of remote smart devices, each producing and owning their data, are solicited for the learning of a common model. However, Madhushani & Leonard (2020) point out the limits of natural extensions of traditional bandit algorithms like UCB to the multi-agent setting. A significant part of the research has been directed toward networked agents. Sankararaman et al. (2019) address the regret minimization problem by initially sharing arms among the agents and propagating the best arm using gossip protocols, much like Chawla et al. (2020) and Agarwal et al. (2021). Kumar Kolla et al. (2018) design UCB-like algorithms where statistics are aggregated over graph neighborhoods; a work recently extended to BAI by Jha et al. (2022). Other works consider a slightly different setting, named multi-bandit, where the agents have access to different bandit instances, with different arms (Gabillon et al. (2011), Scarlett et al. (2019)) or different preferences (Shi et al. (2021), R eda et al. (2022)).

In the multi-agent setting, privacy is more than ever a concern, as sensitive user data may be leaked through the messages sent by the participants. Building private algorithms is therefore a major issue. While the very definition of privacy may vary according to the context, there are two main families of techniques to design such algorithms in the MAB framework: cryptography ((Ciucanu et al., 2022; Garcelon et al., 2022)), and differential privacy (DP) ((Dubey & Pentland, 2020; Li & Song, 2022)). Encrypting the data usually has the advantage of being less harmful to the accuracy of the algorithm at the cost of additional computations, while differential privacy, consisting of adding noise in the data, requires little computation but may reduce accuracy. Differential privacy is more widely used in multi-agent settings. Tossou & Dimitrakakis (2016) and Ren et al. (2020) discuss the practical design of differentially-private MAB algorithms for the regret minimization problem, using additional

Multi-Agent Best Arm Identification with Private Communications

Laplace noise. They also provide theoretical regret bounds that exhibit the cost of this mechanism compared to equivalent non-private algorithms. Dubey & Pentland (2020) and Li & Song (2022) use DP mechanisms to address privacy preservation in the specific multi-agent setting, where communications between agents add an additional layer of risk. Privacy guarantees for pure exploration have been less investigated. Sajed & Sheffet (2019), for instance, propose a differentially private version of Successive Elimination, first described in Dwork et al. (2009), to eliminate suboptimal arms while guaranteeing privacy. (F eraud et al., 2019), on their part, address privacy in multi-agent best arm identification with a different approach. In contrast to differential privacy, the goal is no longer to protect the reward information that serves as input to the learning algorithm, but rather to protect its output, i.e., the identified optimal arm. Arms are eliminated both locally and globally with a voting system while having weak (not accurate) local agents ensures that an adversary cannot infer individual agent preferences by simply observing communications.

3. Multi-Agent BAI with Private Communications

3.1. Multi-Agent Bandits

We model our problem using the stochastic multi-armed bandit framework, stated in Definition 3.1.

Definition 3.1. (Stochastic Multi-Armed Bandit) A stochastic multi-armed bandit (MAB) with K arms, or Karmed bandit, is a set of K stochastic real distributions (R1, ..., RK), bounded in [0, 1], with means E[Rk] = µk. When a user interacts with the MAB, the user pulls arm k K = {1, ..., K} and receives a reward r Rk.

We consider a multi-agent setting where a set N = {1, ..., N} of N agents collaborates to identify the best arm in a common K-armed bandit instance. When an agent n N interacts with the system at time t N and pulls an arm k K, it receives a stochastic reward rn k(t) drawn from some unknown distribution Rk with mean µk. Without loss of generality, we assume µ1 ... µK. The objective for the system is to return an ϵ-optimal arm, i.e., an arm k such that µk µ1 ϵ with high probability.

Assumption 3.2. (Active agent) At any round t, only one agent nt N, the active agent, can interact with the system.

Assumption 3.3. (Fixed agent distribution) The active agent at time t is sampled from a fixed distribution PN on the agents N, given by the environment.

Once active, an agent pulls every arm in his local active set (that is, the set of arms that the agent still considers to be potentially optimal). Over the course of the algorithm, arm k s mean is estimated through the following quantity:

ˆµn k(t) = 1 tn Pt τ=1 rn k(τ)1 [nτ = n] ,

where tn is the number of rounds agent n has been active up to time t, such that tn := tn(t) = Pt τ=1 1 [nτ = n]. Note that Assumption 3.2 simply models asynchronous interactions with the system, which corresponds to several real-world applications (e.g., telecommunication networks).

We consider a centralized setting so that agents collaborate by sending messages to a central coordinator. This is equivalent to considering a decentralized setting where agents are connected with a complete graph. The messages contain the indices of the arms they have eliminated locally. The messages sent by agent n up to time t can be summarized as a vector (λn k(t))k [K] where λn k(t) = 1 if agent n has already eliminated arm k at time t, and λn k(t) = 0 otherwise.

Assumption 3.4. (Communication between agents) Each time the active agent nt eliminates an arm locally, he sends its index k to the central coordinator.

Notation Let Mn(t) = (λn k(t))k [K] denote the messages sent by agent n N up to time t, and M(t) = (M1(t), ..., MN(t)). Let also Mn denote the set of all messages sent by agent n N and M = (M1, ..., MN).

3.2. Multi-Agent BAI with Private Communications

In the context of multi-agent BAI, we are concerned with withholding two types of data one may want to protect : 1) the input data, namely the rewards, and 2) the output of the BAI algorithm, namely the identity of the optimal option. Moreover, we assume that the privacy of these data may be compromised through the messages sent by the agents. We thus propose privacy notions that aim at protecting the privacy of either the inputs or the output of multi-agent BAI algorithms from adversaries observing the communications.

3.2.1. PROTECTING THE INPUT WITH DIFFERENTIAL PRIVACY

The first situation is typically handled using differential privacy. Given two input datasets that only differ from one data point referred to as neighboring datasets, differential privacy ensures that the distribution of the output of the learning algorithm (often called a mechanism, i.e., a randomized function of the data) does not change significantly. Definition 3.5 formalizes this notion in the general case.

Definition 3.5. (ϵ-differential privacy) For any ϵ > 0, the mechanism Q is ϵ-differentially private if for any pair of neighboring datasets(D,D ) and any event S in Q s range:

P (Q(D) S) eϵP (Q(D ) S) .

To obtain a ϵ-DP mechanism from a query (i.e., a function of the data), we usually add an ad hoc noise with a well-tuned

Multi-Agent Best Arm Identification with Private Communications

scale. The scale usually depends on the global sensitivity of the query, which is the maximum amount by which the query value can change between evaluations on two neighboring datasets. For instance, given a real-valued query f of sensitivity S(f), we obtain a ϵ-DP mechanism Q by adding a centered Laplace r.v. with scale S(f)

ϵ to f. This procedure is called the Laplace mechanism.

A BAI algorithm learns the best arm based on the sequence of rewards collected through interactions with the system: these rewards may reveal sensitive information as they reflect the preferences of the agents. Since we consider that an adversary may observe the communicated messages (and not the best arm itself), we seek differential privacy with respect to the mechanism that outputs the set of messages sent by one agent. We then guarantee that the presence of one particular reward in the input dataset does not affect the message distribution significantly.

We give in Definition 3.6 a statement of DP adapted to our particular setting. Let D1:t be the set of all the rewards received by the N agents up to time t. D1:t and D 1:t are considered neighbors if they only differ by one reward. We consider the mechanism Qn that takes as input a dataset D1:t and outputs Mn.

Definition 3.6. (ϵ-differential privacy for communications) For any ϵ > 0 and any agent n N, the mechanism Qn is ϵ-differentially private if for any time t, any set of messages Mn(t) {0, 1}[K], and any pair of neighboring datasets D1:t, D 1:t, the following holds:

P (Qn(D1:t) = Mn(t)) eϵP (Qn(D 1:t) = Mn(t)) .

3.2.2. PROTECTING THE OUTPUT WITH (ϵ, η)-PRIVACY

Some applications may require protecting the algorithm s output, i.e., the global optimal action. This information can indeed be of high value, for instance when it is the product of the engineers expertise in industrial applications. Protecting it from competing firms is therefore of critical importance. However, in multi-agent systems, it is difficult to prevent adversaries from observing the messages sent by the different participants. If those messages carry enough information about the optimal action, their identity can leak even if it is well protected at the system level.

We are interested in designing methods intended to protect against this eventuality, which is formalized through the notion of (ϵ, η)-privacy, first introduced in F eraud et al. (2019). (ϵ, η)-privacy, stated in Definition 3.7, ensures that an adversary observing the messages sent by one agent cannot infer an ϵ-best arm with high confidence. Limiting the awareness of the adversary to the messages from a single agent is realistic in many applications, as we can imagine installing a bug on one particular device. However, this could be easily extended to the setting where the adversary may watch up

to c agents, given a loss in the privacy guarantees.

Definition 3.7. ((ϵ, η)-privacy) An algorithm A is considered (ϵ, η)-private for finding an ϵ-optimal arm if, for any agent n N, an adversary that knows A and has access to the set of messages Mn cannot infer what arm is ϵ-optimal for agent n with probability higher than 1 η.

In particular, if 1 η 1/K, the adversary cannot make better than a random guess.

3.3. Successive Elimination in Multi-Agent MAB

In multi-agent algorithms, messages are signals that reflect the feedback an agent receives through his interactions with the environment. Sharing the signals with the other agents enables collaboration and allows better decision-making at the global level, compared to the situation where each agent learns independently. In multi-agent bandits, one could imagine a variety of such signals. However, in many applications, the amount of information contained in the messages is constrained by the capacity of the communication channels. In this work, we consider simple messages with integer values, fitting situations where communication is heavily constrained.

BAI algorithms based on successive arm elimination are particularly suited to this setting. These methods work by maintaining an active set of arms that is progressively reduced as the agent eliminates suboptimal arms. An arm is eliminated when it is estimated suboptimal with enough confidence, based on the rewards collected during interactions with the system. The algorithm stops when the active set is reduced to a single arm, which is the estimated best arm. As an important example, SUCCESSIVE ELIMINATION from Even-Dar et al. (2006) guarantees to output an ϵ-optimal arm with confidence 1 δ in approximately K

It is quite straightforward to use successive elimination schemes within multi-agent systems and enable collaboration through integer-valued messages and a voting process. At a high level, each agent n independently runs a local BAI algorithm, managing his own local set of arms Kn, while a central coordinator separately maintains a global set of arms K. Every time an agent eliminates an arm locally, he sends its index to the central coordinator, hence voting against this arm. If the number of votes λk against a specific arm exceeds some predefined threshold M, it is removed from the global active set as well as from all local active sets. This multi-agent successive elimination (MASE) scheme at both local and global levels is described in Algorithm 1. In the following, we propose methods based on this multi-agent successive elimination scheme that ensures differential privacy and (ϵ, η)-privacy.

Multi-Agent Best Arm Identification with Private Communications

Algorithm 1 Multi-Agent Successive Elimination(MASE)

1: Input: ϵ > 0, δ [0, 1], voting threshold M [[1, N]] 2: Output: Estimated best arm in K 3: Active sets K=Kn =[K]; global votes (λk = 0)k [K]; local epochs en =0; local round rn =0 4: while |K| > 1 do 5: Active agent n is sampled: n PN(t) 6: n interacts with system and updates mean estimates 7: tn := tn + 1 8: if condition for local elimination is met by k then 9: Remove k from local active set: Kn := Kn \ {k} 10: λk := λk + 1 11: en := en + 1 12: if λk > M then 13: Remove k from global active set: K := K\{k} 14: Remove k from all local active sets: Kn := Kn \ {k} for all n N

Sample Complexity of MASE. We expect the sample complexity, computed as the total number of overall interactions with the system, to scale up with number of agents. Indeed, in the asynchronous case (Assumption 3.2), more agents means more timesteps before a given agent has interacted enough with the system to output the best arm with high confidence. 1

For a lower bound, we consider the extreme, non-private case, where agents share all their raw reward samples with the central coordinator (the set of messages that can be sent is then larger than what is described in Section 3.1). Up to negligible communication delays, the problem becomes equivalent to the single-agent setting. It is then relevant to compare our algorithms with single-agent SUCCESSIVE ELIMINATION, for which lower bounds exist (see, for instance, Even-Dar et al. (2002)), although such approach does not offer any privacy.

4. Multi-Agent Successive Elimination with Differentially Private Communications

In this section, we introduce DP-MASE, for Differentially Private Multi-Agent Successive Elimination. While learning the best option in the multi-armed bandit, DP-MASE guarantees that the communications are differentially private with respect to the rewards. In essence, DP-MASE follows the multi-agent successive elimination scheme described in Algorithm 1, and uses DP Successive Elimination from

1Formally, the per-agent counts (t1, ..., t N) follow a multinomial with parameters (t, (p1, ..., p N)), where pi = PN (i). Moreover, the algorithm stops at step t if max M t1, ..., t N > T(η), where max M denotes the M-th largest value. For a given T, the probability P (max M (T1, ..., TN) > T(η)) decreases with N, which necessarily links the sample complexity to N.

Sajed & Sheffet (2019) as a local BAI routine.

4.1. Differentially Private Successive Elimination

In Successive Elimination (Even-Dar et al. (2006)), we consider the uncertainty over the value of each arm using a confidence interval of decreasing radius around the empirical mean. More specifically, at step t, we want the true mean of arm k to lie within the following confidence interval, for a well-chosen α(t) R, with high probability:

µk [ˆµk(t) α(t); ˆµk(t) + α(t)] .

If we set α(t) = q

log(c Kt2/β)

t for some c > 0, we know

this holds with high probability 1 2β c Kt2 due to Hoeffding s inequality. An arm k is eliminated as soon as its upper bound is less than the lower bound of the current best arm estimate, i.e.:

max l K ˆµl(t) ˆµk(t) > 2α(t) . (1)

From then on, we refer to (1) as suboptimality evaluation. This procedure returns an ϵ-optimal best arm with probability at least 1 β, in at most O K ϵ2 log 1

Making Successive Elimination differentially private is not straightforward. Naively, one could simply release noisy values for the means instead of raw values (for instance using the Laplace mechanism). However, this approach raises practical challenges. Indeed, a reward sample from arm k is involved in every subsequent empirical mean ˆµk, making the privacy cost scale linearly with T, the total number of steps. To circumvent this, Sajed & Sheffet (2019) only perform suboptimality evaluation after a given number of steps (i.e. an epoch). Empirical means are then reset and samples are discarded. Therefore, each reward sample is only consumed in a single mean estimation. This approach also allows controlling global sensitivity of the queries and the scale of the Laplace noise appropriately.

Thanks to parallel composition, Sajed & Sheffet (2019) prove that DP Successive Elimination is ϵ-differentially private. It also effectively returns the best arm as the epoch length R(e) grows properly over time, each time allowing for more accurate mean estimates. Indeed, R(e) is such that every arm with a suboptimality gap greater than 2 e is eliminated at epoch e with high probability.

4.2. A DP Algorithm for Multi-Agent BAI

Our objective is now to propose a multi-agent version of DP Successive Elimination, based on the MASE scheme. We obtain DP-MASE, a new algorithm for multi-agent BAI with private communications, depicted in Algorithm 2.

Multi-Agent Best Arm Identification with Private Communications

Algorithm 2 DP-MASE

1: Input: ϵ, δ, β, M = log δ

log β 2: Output: Estimated best arm in K 3: Active agents N(t) = N; global and local active sets K = Kn = [K]; global votes (λk = 0)k [K]; local epochs en =0; local round rn =0; global time t = 0 4: while |K| > 1 do 5: Active agent n is sampled: n PN(t) 6: Sample all arms in Kn and update mean estimates 7: rn := rn + 1 8: if rn = R(en) then 9: k(en) = Local Elimination (ϵ, en, (ˆµn k(en))k, β) 10: Kn = Kn \ k(en) 11: λk := λk + 1 for any k k(en) 12: if |Kn| = 1 then 13: N(t) := N(t) \ {n} {Remove agent if done} 14: else 15: en := en + 1 16: Reset (ˆµn k(en) = 0)k, rn = 0 17: Compute next epoch length R(en) 18: Kn := Kn T K {Remove the latest globally eliminated arms from local active set} 19: for k k(en) do 20: K := K \ {k} if λk > M 21: t := t + 1

In DP-MASE, each agent independently runs a DP Successive Elimination instance on the shared multi-armed bandits. To deal with the asynchronous interactions, every agent keeps track of his current epoch en and current round rn [[0, R(en)]] within this epoch. At the end of his epoch, an agent performs local elimination as in Sajed & Sheffet (2019), returning a set k(en) of eliminated arms. If it is not empty, arms in k(en) are immediately removed from the local active set Kn, and their global voting counts λk s are updated (lines 9-11). Empirical means are then reset for the next epoch whose length is also computed using the same formula as in Sajed & Sheffet (2019):

32 log 8|Kn|en2/β

2 2en ; 8 log 4|Kn|en2/β

Moreover, Kn is synchronized with global active set K, meaning that arms that have been globally eliminated during epoch en are now also removed from Kn (lines 15-18). Eventually, if votes against arms in k(en) have reached the voting threshold, they are removed from global active set K (lines 19-20).

We show that DP-MASE is ϵ-differentially private for communications, in the sense of Definition 3.6, and also derive guarantees over its accuracy for finding an optimal arm. Detailed proofs are provided in the appendix.

Proposition 4.1. Algorithm 2, DP-MASE, is ϵdifferentially private for communications.

Proposition 4.2. (Failure Probability of DP-MASE) With voting threshold M = log δ

log β , DP-MASE fails to return the best arm with probability at most δ.

Eventually, we derive a problem-dependent sample complexity bound for DP-MASE that holds with high probability. Proposition 4.3. (Sample Complexity of DP-MASE) With probability at least (1 δ) (1 I1 p (T T(β), 1 + T(β)))M, DP-MASE has sample complexity:

O 1 p T(β) + 1 4p 2 log 1

T(β) = log(K/β) + log log(1/ 2) 1 2 2 + 1 ϵ 2

Above, Ip (a, b) is the regularized incomplete beta function evaluated at p with parameters (a, b), and p is the probability of the M-th most likely agent.

In Proposition 4.3, T(β) is the local sample complexity, that is the number of steps a single-agent running DP Successive Elimination must execute to output the best arm with confidence 1 β. T(β) naturally increases with the confidence and the strength of the privacy guarantees. It also depends negatively on the smallest suboptimality gap 2, which quantifies the hardness of the BAI problem. The global sample complexity of DP-MASE obviously scales with the local sample complexity. However, it also depends on the agent distribution PN , and particularly on the inverse probability of the M-th most likely player: through the global elimination process, the algorithm roughly waits for the M-th most frequent player to output his best arm estimate.

DP-MASE is of practical interest when rewards may carry sensitive information and when it is impossible to completely secure the communication channels. However, we could be concerned about withholding other information, such as the optimal option. In this situation, we consider that DP is not the most adapted notion. In the next section, we therefore propose another method based on multi-agent successive elimination that better fits this objective.

5. Best Arm Privacy in Multi-Agent BAI

Here, we are no longer concerned with preserving the input data, but rather want to protect the learned optimal option. We believe that the most suitable privacy notion for this problem is (ϵ, η)-privacy, first introduced in F eraud et al. (2019). In this section, we first discuss existing solutions and later introduce a new, more efficient multi-agent BAI algorithm that guarantees (ϵ, η)-privacy.

Multi-Agent Best Arm Identification with Private Communications

5.1. Multi-Agent Successive Elimination with (ϵ, η)-Private Communications

To address this setting, F eraud et al. (2019) propose a (ϵ, η)- private algorithm based on multi-agent successive elimination. To reach (ϵ, η)-privacy, each local arm selection subroutine is run with precision ϵ and confidence 1 η. Such low local confidence implies that observing a single agent does not bring any useful information, making the algorithm (ϵ, η)-private, while the consensus required to eliminate an arm globally ensures a low failure probability. More precisely, setting the global elimination threshold as M = log δ

log η ensures that an ϵ-best arm is returned with a probability higher than 1 δ.

Moreover, upper bounds on the sample complexity are derived that directly depend on the local subroutine complexity and the agent probability distribution PN . In particular, for any (1 η)-confident BAI subroutine with sample complexity T(η), the global sample complexity scales with 1 p T(η), where p is the probability of the M-th most likely agent, similarly to DP-MASE.

5.2. Improved Privacy with Message Corruption

The privacy mechanism in F eraud et al. (2019) inherently links the local confidence level with the privacy level. Indeed, the algorithm is private in the sense of Definition 3.7 only because each independent participant is inaccurate in his best arm identification, returning an ϵ-optimal arm with confidence as low as 1 η.

This is likely to harm sample complexity: to guarantee high global accuracy, the voting threshold M must be high to compensate for the low local BAI confidence. In this section, we provide a method that decouples global privacy and local accuracy, providing more flexibility to adjust these two essential aspects of the problem and leading to better sample complexity.

Our method to guarantee a high privacy level without sacrificing local BAI confidence comes from the observation that the adversary is only able to derive critical information from sent messages and has no access to the internal state of the observed agent. Thus, instead of guaranteeing privacy by having poor local best arm identification, we limit the information that could be extracted from the sent messages by corrupting them with a random binary noise. More formally, instead of sending Mn, agent n sends the corrupted messages Mξ n = (λn 1 νn 1 , ..., λn k νn K) , where ξ > 0 is the corruption probability and (νn k )k [K] is the corruption mask such that νn k B(1 ξ) for any k. We can then apply Successive Elimination. Algorithm 3 describes the whole procedure. We now establish the privacy guarantees and the failure probability of Algorithm 3. Complete proofs are provided in appendix.

Algorithm 3 CORRUPTED ELIMINATION

1: Input: η, δ, ξ, M = log δ

log η 2: Output: Estimated best arm in K 3: Active agents N(t) = N; global and local active sets K = Kn = [K]; local failure probability ηξ = 1 1 η

(1 ξ)K 1 , Kn := K for all n N; global votes (λk = 0)k [K], global time t = 0 4: while |K| > 1 do 5: Active agent n is sampled: n PN(t) 6: Sample all arms in Kn and update mean estimates 7: k, , k G = Local Elimination (n, ξ) {Identify arms to eliminate locally} 8: Kn = Kn \ k {Remove eliminated arms from local active set} 9: λk := λk + 1 for any k k G(en) {Vote for global elimination of eliminated arms if corresponding message not corrupted} 10: if |Kn| = 1 then 11: N(t) := N(t) \ {n} 12: Kn := Kn T K {Remove the latest globally eliminated arms from local active set} 13: for k k do 14: K := K \ {k} if λk > M 15: t := t + 1

Proposition 5.1. (Privacy of CORRUPTED ELIMINATION) Running local BAI subroutines with confidence 1 η , an adversary knowing A and observing Mξ n cannot infer the best arm of agent n with probability higher than (1 η ) (1 ξ)K 1. Therefore, to maintain an apparent privacy level η, we need to set the confidence level of the BAI subroutine as:

ηξ = max 0, 1 1 η (1 ξ)K 1

Proposition 5.2. (Failure probability of CORRUPTED ELIMINATION) The failure probability of CORRUPTED ELIMINATION is at most:

min α N (1 Iξ (α + 1, M)) δα , where δα = δ ηξ α .

Proposition 5.1 tells us that we can run the local subroutines with better accuracy than F eraud et al. (2019) while keeping the same privacy guarantees. With Proposition 5.2, failure probability is less than without communication noise. Indeed, since some messages are corrupted, more independent agents need to vote against the best arm to remove it globally. In Proposition 5.3, we propose an approximation of the sample complexity of CORRUPTED ELIMINATION.

Proposition 5.3. (Sample complexity of CORRUPTED ELIMINATION) If the agent distribution PN is uniform, then with probability α, the number of

Multi-Agent Best Arm Identification with Private Communications

rounds needed by CORRUPTED ELIMINATION to output an ϵ-optimal arm is approximately:

T = N Q 1 T(ηξ), 1 α1/M ,

where Q 1 is the inverse of the regularized gamma function, and T(ηξ) is the number of local samples needed for the BAI subroutine to find the optimal arm with confidence 1 ηξ.

As we observe in our experiments (see, e.g., appendix and figures therein), there is a positive correlation between the global sample complexity and the local failure probability ηξ (corresponding exactly to the privacy level when ξ = 0), even with uniform sampling. Proposition 5.3 provides a first theoretical justification for these empirical findings. Indeed, this observation was not reflected in the initial bound from F eraud et al. (2019), as increasing BAI confidence 1 ηξ means more local samples T(ηξ) are needed (see for instance the bound for Successive Elimination). In contrast, if the approximation provided in Proposition 5.3 is not enough to conclude that the sample complexity increases with ηξ (because of the first parameter in Q 1), it shows at least an ambiguity. Indeed, a smaller ηξ means a smaller voting threshold M = log δ

log ηξ , which is likely to decrease T via the second argument in Q 1. An extended discussion is provided in the appendix.

6. Experiments

We assess the performance of our methods on different stochastic environments suggested in F eraud et al. (2019). In these problems, there are K = 10 arms with means µ1 . . . µ10. We display our results on Problem 1, where rewards are drawn from Bernoulli distributions, with µ1 = 0.7, µ2 = 0.5, µ3 = 0.3, and µk = 0.1 for k = 4, . . . , 10. Agent distribution PN is uniform. Further experiments, including other problems and agent distributions, are discussed in the appendix. We run experiments with N ranging from 64 to 1024 agents to evaluate how performance is affected by the scale of the problem. For both methods, we fix the global failure probability δ to 5%. Each data point is an average value over 10 runs, and 95% confidence intervals are shown for every plot.

Evaluation and Baselines. We evaluate the performance of our algorithms for different values of the privacy parameters (ϵ for DP-MASE, η and ξ for CORRUPTED ELIMINATION). In addition, we compare them with two baselines, corresponding to the extreme cases with full and no communication.

CENTRAL: the agents send all their raw reward samples with the central coordinator at each round, which is equivalent (up to negligible communication delays)

to a single-agent SUCCESSIVE ELIMINATION. This approach does not offer any privacy since data is fully shared;

INDEPENDENT: every agent runs an independent SUCCESSIVE ELIMINATION algorithm, without sharing any information with the central coordinator. This approach is fully private as there is no communication.

They allow to illustrate both the gain in performance achieved through multi-agent collaboration and the challenges of distributing data across multiple nodes. To measure performance, we report the sample complexity, which is the total number of rounds (corresponding to a single iteration of the while loop in Algorithms 1, 2 and 3) needed to output the best arm with confidence 1 δ.

6.1. Impact of Multi-Agent Collaboration

Figure 1 shows the performance of our methods against the CENTRAL and INDEPENDENT baselines. We observe that the BAI problem is solved much faster if agents can freely communicate their raw reward samples, which amounts to the single-agent case. This is expected, as the problem is inevitably harder when the data is distributed to multiple locations. In particular, the sample complexity must scale with the number of agents, as already discussed in Section 3.3. We indeed observe a growth in sample complexity with the number of agents for both DP-MASE and CORRUPTED ELIMINATION. However, they scale better than the naive INDEPENDENT baseline, which shows that the communication between agents and the collaborative elimination procedure help to be more efficient and attenuate the growth in sample complexity. Figure 4 in appendix shows the same results without the baselines for a better distinction between the different curves.

6.2. Multi-Agent Successive Elimination with Differentially Private Communications

Our experiments exhibit the influence of the privacy level ϵ on the performance of DP-MASE (Figure 1, left). Unsurprisingly, for a fixed global confidence 1 δ, the sample complexity gets worse with stronger privacy guarantees, i.e., smaller ϵ s. Longer epochs and larger gap thresholds for local elimination are indeed needed to compensate for the noisier mean estimates. We notice a significant increase in sample complexity for small values of ϵ (less than 0.1), while it is relatively stable for larger values.

6.3. Multi-Agent Successive Elimination with (ϵ, η)-Private Communications

We compare the sample complexity of CORRUPTED ELIMINATION against the method proposed by F eraud et al. (2019) (corresponding to ξ = 0).

Multi-Agent Best Arm Identification with Private Communications

Figure 1. Sample complexity of DP-MASE (left) for ϵ = 0.05, 0.1, 0.25 and CORRUPTED ELIMINATION (right) for ξ = 0, 0.05, 0.1 on Problem 1.

The apparent privacy level η is set to 0.9. In accordance with Proposition 5.1, for each value of ξ, we compute ηξ = 1 1 η (1 ξ)K 1 , and then run local BAI subroutines with confidence 1 ηξ, which guarantees the same apparent privacy level and hence a fair comparison.

We observe that corrupting messages actually decreases the sample complexity of CORRUPTED ELIMINATION, and that it tends to further decrease with higher corruption probability ξ (Figure 1, right). In particular, we improve over the method proposed by F eraud et al. (2019). Our intuition is that a smaller ηξ is likely to speed up the global elimination process, and hence the algorithm, by decreasing the voting threshold M. This was already suggested by our theoretical analysis in Proposition 5.3. We refer to this as the threshold effect, and empirical evidence therefore points toward its prevalence over the increase in the number of local samples. Similarly to DP-MASE, we also find that the sample complexity increases as the apparent privacy level η grows, as shown in the appendix. We interpret it as a direct consequence of the threshold effect.

7. Discussion

In this work, we have proposed two methods to address multi-agent best arm identification, where multiple participants work together to identify with fixed confidence the best option in a common multi-armed bandit. Based on the same successive elimination scheme, each method guarantees the privacy of a type of data against adversaries observing the communications. DP-MASE ensures that the communications are differentially private with respect to the rewards, while CORRUPTED ELIMINATION prevents the adversaries from inferring the best arm. We provide theoretical analysis on privacy, sample complexity, and failure probability, and experiments on toy problems give insights into the behavior of our methods. In particular, they show that

CORRUPTED ELIMINATION is faster than a similar method without message corruption, for comparable privacy levels.

Direct comparison between our two methods is impossible, as they rely on distinct privacy definitions. However, they are suitable to address complementary privacy concerns in multi-agent best arm identification. Also, since our primary goal was to investigate and address different privacy needs within the relatively new setting of collaborative BAI, our methods could be refined to achieve better stopping time. Designing more efficient approaches thus is a natural direction for future work. Relaxing the assumption regarding the nature of communications, and allowing agents to send more informative messages (like continuous signals), is also worth investigating and certainly more challenging in terms of privacy. From a broader perspective, we believe our approach can be applied to the BAI with a fixed budget setting, to linear bandits, or to more advanced multi-agent scenarios, e.g., considering agents with heterogeneous arm sets.

Acknowledgements

M.S. has been partially supported by MIAI@Grenoble Alpes (ANR-19-P3IA-0003) and INODE project funded by EU Horizon 2020 research and innovation programme under GA No 863410.

Multi-Agent Best Arm Identification with Private Communications

Agarwal, M., Aggarwal, V., and Azizzadenesheli, K. Multi Agent Multi-Armed Bandits with Limited Communication. Journal of Machine Learning Research, 23:1 24, 2021. URL https://www.jmlr.org/papers/ volume23/21-138/21-138.pdf.

Audibert, J.-Y. and Bubeck, S. Best Arm Identification in Multi-Armed Bandits. In Proceedings of COLT, 2010. URL https://hal-enpc.archivesouvertes.fr/hal-00654404.

Chawla, R., Sankararaman, A., Ganesh, A., and Shakkottai, S. The Gossiping Insert-Eliminate Algorithm for Multi-Agent Bandits. In Proceedings of AISTATS, 2020. URL http://proceedings.mlr.press/ v108/chawla20a/chawla20a.pdf.

Ciucanu, R., Lafourcade, P., Marcadet, G., and Soare, M. SAMBA: A Generic Framework for Secure Federated Multi-Armed Bandits. Journal of Artificial Intelligence Research, 73:737 765, 2022. URL https://www.jair.org/index.php/ jair/article/download/13163/26774/.

Das Gupta, A. Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer Texts in Statistics. Springer, 2011. URL https://link.springer.com/book/10.1007/ 978-1-4419-9634-3.

Dubey, A. and Pentland, A. Private and Byzantine-Proof Cooperative Decision-Making. In Proceedings of AAMAS, 2020. URL https://www.ifaamas.org/ Proceedings/aamas2020/pdfs/p357.pdf.

Dwork, C. Differential Privacy. In Proceedings of ICALP, 2006. URL https://www.microsoft.com/enus/research/publication/differentialprivacy/.

Dwork, C. and Roth, A. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci., 9(3 4):211 407, 2014. URL https://doi.org/ 10.1561/0400000042.

Dwork, C., Naor, M., Reingold, O., Rothblum, G. N., and Vadhan, S. On the Complexity of Differentially Private Data Release: Efficient Algorithms and Hardness Results. In Proceedings of STOC, 2009. URL https: //doi.org/10.1145/1536414.1536467.

Even-Dar, E., Mannor, S., and Mansour, Y. PAC bounds for multi-armed bandit and markov decision processes. In Proceedings of COLT, 2002. URL https:// doi.org/10.1007/3-540-45435-7 18.

Even-Dar, E., Mannor, S., and Mansour, Y. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research, 2006. URL https://jmlr.org/papers/ volume7/evendar06a/evendar06a.pdf.

F eraud, R., Alami, R., and Laroche, R. Decentralized Exploration in Multi-Armed Bandits. In Proceedings of ICML, 2019. URL http://proceedings.mlr.press/ v97/feraud19a/feraud19a.pdf.

Gabillon, V., Ghavamzadeh, M., Lazaric, A., and Bubeck, S. Multi-Bandit Best Arm Identification. In Proceedings of Neur IPS, 2011. URL https://proceedings.neurips.cc/ paper/2011/file/ c4851e8e264415c4094e4e85b0baa7cc Paper.pdf.

Garcelon, E., Pirotta, M., and Perchet, V. Encrypted Linear Contextual Bandit. In Proceedings of AISTATS, 2022. URL https://proceedings.mlr.press/v151/ garcelon22a.html.

Hillel, E., Karnin, Z., Koren, T., Lempel, R., and Somekh, O. Distributed Exploration in Multi Armed Bandits. In Proceedings of Neur IPS, 2013. URL https://proceedings.neurips.cc/ paper/2013/file/ 598b3e71ec378bd83e0a727608b5db01Paper.pdf.

Jamieson, K. and Nowak, R. Best-arm Identification Algorithms for Multi-Armed Bandits in the Fixed Confidence Setting. In Proceedings of CISS, 2014. URL https: //ieeexplore.ieee.org/document/6814096.

Jha, A., Mohamed, N., and Jagannathan, K. Collaborative Best Arm Identification in Multi-armed Bandits. In Proceedings of COMSNETS, 2022. URL https: //ieeexplore.ieee.org/document/9668527.

Kumar Kolla, R., Jagannathan, K., and Gopalan, A. Collaborative Learning of Stochastic Bandits Over a Social Network. Transactions on Networking, 26(4):1782 1795, 2018. URL https://www.ee.iitm.ac.in/ krishnaj/Publications files/papers/ 08418308.pdf.

Li, T. and Song, L. Privacy-Preserving Communication Efficient Federated Multi-Armed Bandits. IEEE Journal on Selected Areas in Communications, 40(3):773 787, 2022. URL https://doi.org/10.1109/ JSAC.2022.3142374.

Multi-Agent Best Arm Identification with Private Communications

Madhushani, U. and Leonard, N. It Doesn t Get Better and Here s Why: A Fundamental Drawback in Natural Extensions of UCB to Multi-agent Bandits. In Proceedings of I Can t Believe It s Not Better! Neur IPS 2020 workshop, 2020. URL https://openreview.net/ forum?id=e K034ng O05Y.

Mc Sherry, F. D. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, 2009. URL https: //doi.org/10.1145/1559845.1559850.

Ren, W., Zhou, X., Liu, J., and Shroff, N. B. Multi-Armed Bandits with Local Differential Privacy. Co RR, 2020. URL https://arxiv.org/abs/2007.03121.

Robbins, H. E. Some Aspects of the Sequential Design of Experiments. Bulletin of American Mathematical Society, 58(5):527 535, 1952. URL https://www.ams.org/journals/bull/ 1952-58-05/S0002-9904-1952-09620-8/ S0002-9904-1952-09620-8.pdf.

R eda, C., Vakili, S., and Kaufmann, E. Near-Optimal Collaborative Learning in Bandits. In Proceedings of Neur IPS, 2022. URL https://arxiv.org/abs/ 2206.00121.

Sajed, T. and Sheffet, O. An Optimal Private Stochastic-MAB Algorithm based on Optimal Private Stopping Rule. In Proceedings of ICML, 2019. URL https://proceedings.mlr.press/ v97/sajed19a.html.

Sankararaman, A., Ganesh, A., and Shakkottai, S. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1 35, 2019. URL https://doi.org/ 10.1145/3366701.

Scarlett, J., Bogunovic, I., and Cevher, V. Overlapping Multi-Bandit Best Arm Identification. In Proceedings of ISIT, 2019. URL http://ilijabogunovic.com/pdf/ scarlett2019multibandit ISIT2019.pdf.

Shi, C., Xu, H., Xiong, W., and Shen, C. (Almost) Free Incentivized Exploration from Decentralized Learning Agents. In Proceedings of Neur IPS, 2021. URL https://proceedings.neurips.cc/ paper/2021/file/ 054ab897023645cd7ad69525c46992a0Paper.pdf.

Sz or enyi, B., Busa-Fekete, R., Heged us, I., Orm andi, R., Jelasity, M., and K egl, B. Gossip-based Distributed

Stochastic Bandit Algorithms. In Proceedings of ICML, 2013. URL http://proceedings.mlr.press/ v28/szorenyi13.pdf.

Thompson, W. R. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3/4):285 294, 1933. URL http://www.jstor.org/stable/2332286.

Tossou, A. and Dimitrakakis, C. Algorithms for Differentially Private Multi-Armed Bandits. In Proceedings of AAAI, 2016. URL https://aaai.org/papers/ 212-algorithms-for-differentiallyprivate-multi-armed-bandits/.

Multi-Agent Best Arm Identification with Private Communications

A.1. Multi-Agent Successive Elimination with Differentially Private Communications

Proposition 4.1. Algorithm 2, DP-MASE, is ϵ-differentially private for communications.

Proof. Let D(n, en) be the sequence of all rewards collected by agent n at epoch en. Fix global step t, and denote T n the set of agent n s local epochs occurring up to time t. Therefore D1:t = S

n N,en T n D(n, en).

Let also Qen be the mechanism that outputs the indices of the arms eliminated by agent n at epoch en. Note that an agent n has knowledge of the latest globally eliminated arms only at the end of his current epoch, so that the reward samples in D(n, en) has no effect on the local elimination process for any other agent m. Thus, Qen operates only on D(n, en). In other words, each reward sample only plays a role for a single agent, at a single epoch.

At epoch en, for every arm k in the active set Kn, the empirical mean ˆµk(en) is computed using R(en) samples. The sensitivity of ˆµk(en) is therefore 1/R(en). Moreover, each noisy mean is computed only once before suboptimality evaluations. Therefore, the local elimination mechanism which uses Laplace noise with scale 1 ϵR(en) is ϵ-DP. Qen is thus ϵ-DP.

Consider the composed mechanism Q defined as:

Q(D1:t) = Qen(D(n, en))

n N,en T n .

Thanks to the parallel composition theorem (see Theorem 4 in Mc Sherry (2009)), Q is ϵ-DP. However, Qn can be seen as a

deterministic function of Qen

n,en, i.e. Qn = g Qen

. So Qn is also ϵ-DP by the post-processing property (see

Proposition 2.1 in Dwork & Roth (2014)).

Proposition 4.2. (Failure Probability of DP-MASE) With voting threshold M = log δ

log β , DP-MASE fails to return the best arm with probability at most δ.

Proof. Sajed & Sheffet (2019) show that using DP Successive Elimination, the optimal arm k is never eliminated with probability at least 1 β. The probability of eliminating k is therefore at most β.

To eliminate k globally, M = log δ

log β independent agents need to eliminate k . Since the agents interact independently

with the system, k is then globally eliminated with probability at most βM. But, βM = β log δ

log δ log β = δ.

Proposition 4.3. (Sample Complexity of DP-MASE) With probability at least (1 δ) (1 I1 p (T T(β), 1 + T(β)))M, DP-MASE has sample complexity:

O 1 p T(β) + 1 4p 2 log 1

T(β) = log(K/β) + log log(1/ 2) 1 2 2 + 1 ϵ 2

Above, Ip (a, b) is the regularized incomplete beta function evaluated at p with parameters (a, b), and p is the probability of the M-th most likely agent.

To prove Proposition 4.3, we first prove the following lemma, which is an updated version of the sample complexity bound from F eraud et al. (2019).

Lemma A.1. (Sample complexity of F eraud et al. (2019)) Using any BAI subroutine, with probability at least (1 δ) (1 I1 p (T T(η), 1 + T(η)))M, the multi-agent BAI algorithm proposed in F eraud et al. (2019) has sample complexity

O 1 p T(η) + 1 4p 2 log 1

Multi-Agent Best Arm Identification with Private Communications

where Ip (a, b) is the regularized incomplete beta function evaluated at p with parameters (a, b), p is the probability of the M-th most likely agent, and T(η) is the number of local samples needed for the BAI subroutine to find the optimal arm with confidence 1 η.

Proof. We follow a similar line of reasoning to F eraud et al. (2019), for the derivation of the sample complexity bound.

Let T denote the number of rounds before the algorithm stops (i.e. its sample complexity) and Tn denote the number of rounds agent n has been active. PN is the probability distribution over agents, so that PN (x = n) is the probability that the active player x is n at any time. We have Tn B (T, PN (x = n)), and thus EPN [Tn] = PN (x = n)T.

Let Bδ,η be the set of agents with the M := log δ

log η highest Tn. The algorithm does not stop if the following event occurs: E1 = { n Bδ,η, Tn < T(η)}.

Applying Hoeffding s inequality, we have:

P (Tn PN (x = n)T ϵ) exp 2ϵ2/T . (2)

When E1 occurs, all agents in Bδ,η have enough samples to output the optimal arm with confidence 1 η, i.e. n Bδ,η, Tn T(η). Thus, with probability at most δ:

T(η) PN (x = n)T Tn PN (x = n)T

This holds for every agent n Bδ,η, and in particular for n such that PN (x = n ) = minn Bδ,η PN (x = n) := pδ,η.

Therefore, the total number of samples T is such that:

T T(η) 0 , (4)

which is a second order polynomial equation in

T. Solving this equation in T, we have with probability at least 1 δ:

T 1 4p2 δ,η

δ + 4pδ,ηT(η)

Developing and re-ordering the terms, we have:

T 1 pδ,η T(η) + 1 4p2 δ,η

δ + 8 log 1

In particular:

T(η) + 1 4pδ,η log 1

Therefore, when E1 does not occur, that is when all M most frequent agents have enough samples to output best arm with confidence η, with probability at least 1 δ:

T(η) + 1 4pδ,η log 1

Let now NM be the set of the M most likely agents and define n = arg minn NM PN (x = n) and p = argminn NM PN (x = n) as the index and the probability of the M-th most likely agent, respectively.

We now consider the following event: E2 = {n Bδ,η}. Under E1, by definition of Bδ,η, E2 is equivalent to the event {Tn < T(η)}.

Multi-Agent Best Arm Identification with Private Communications

But: P (E2) = I1 p (T T(η), 1 + T(η)) . (9)

In order to have exactly pδ,η = p , an equivalent formulation of E2 should hold for all agents whose sampling probability PN (x = n) is greater than p . This is why pδ,η = p with probability (1 I1 p (T T(η), 1 + T(η)))M.

Now, we prove Proposition 4.3 using Lemma A.1.

Proof. From Sajed & Sheffet (2019) (Lemma 4.2), the sample complexity of DP Successive Elimination is:

T(β) = log(K/β) + log log(1/ 2) 1

2 2 + 1 ϵ 2

Using the same global elimination mechanism, we can see DP-MASE as the multi-agent algorithm from F eraud et al. (2019) with DP Successive Elimination as a BAI subroutine. We can therefore apply Lemma A.1 with local sample complexity T(β) and conclude that the sample complexity of DP-MASE is:

p T(β) + 1 4p 2 log 1

with probability at least (1 δ) (1 I1 p (T T(β), 1 + T(β)))M.

A.2. Multi-Agent Successive Elimination with (ϵ, η)-Private Communications

Proposition 5.1. (Privacy of CORRUPTED ELIMINATION) Running local BAI subroutines with confidence 1 η , an adversary knowing A and observing Mξ n cannot infer the best arm of agent n with probability higher than (1 η ) (1 ξ)K 1. Therefore, to maintain an apparent privacy level η, we need to set the confidence level of the BAI subroutine as:

ηξ = max 0, 1 1 η (1 ξ)K 1

Proof. Let us recall the definition of Mξ n:

Mn = (λn 1, ..., λn K) , (10)

Mξ n = (λn 1 νn 1 , ..., λn k νn K) , (11)

where, for any arm k, νn k is the corruption mask and is a random variable with distribution B(1 ξ).

An adversary can guess the global optimal arm from messages Mn if these two events hold:

Event A: the arm selection subroutine of agent n actually finds the optimal arm;

Conditionally to A, event B: no message associated to sub-optimal arms is corrupted.

By assumption, P(A) 1 ηξ. Moreover, B holds when all νn k are ones for k = 2, ..., K. Since νn k B(1 ξ), we have P(B) = (1 ξ)K 1. Therefore, observing messages Mn, an external observer can only infer the best arm with probability (1 ηξ) (1 ξ)K 1.

To maintain an apparent privacy level η, we then need to set ηξ > 0 such as:

1 η = (1 ηξ) (1 ξ)K 1 . (12)

Multi-Agent Best Arm Identification with Private Communications

ηξ = max 0, 1 1 η (1 ξ)K 1

Proposition 5.2. (Failure probability of CORRUPTED ELIMINATION) The failure probability of CORRUPTED ELIMINATION is at most:

min α N (1 Iξ (α + 1, M)) δα , where δα = δ ηξ α .

Proof. Assuming it terminates, CORRUPTED ELIMINATION fails, i.e. eliminates the best arm, if there are at least M = log δ

log η votes against it. Let Ne and Nv denote respectively the number of local independent eliminations of the best arm and the number of votes actually sent against the best arm. Nv follows a binomial distribution with Ne trials:

Nv B (Ne, 1 ξ) . (14)

Given Ne independent eliminations, the probability that Nv exceeds the voting threshold M is:

(1 ξ)mξNe m . (15)

In order to overcome the randomness of Ne, we want α N such that if Ne M + α, Nv M with high probability. As Ne M + α, obviously, the probability of a random variable X following B (M + α, 1 ξ) being greater than M is lower than if it followed B (Ne, 1 ξ). Thus:

(1 ξ)mξM+α m (16)

(1 ξ)mξM+α m (17)

= 1 Iξ (α + 1, M) , (18)

where Ip (a, b) is the regularized incomplete beta function evaluated at p with parameters (a, b).

Now since the local BAI routines are run with confidence ηξ, the probability that at least M +α independent agents eliminate the best arm is at most ηM+α ξ .

Eventually, for α N, CORRUPTED ELIMINATION eliminates the best arm if the two following events happen:

M + α independent agents eliminate the best arm locally, which happens with probability at most ηM+α ξ ;

These M + α local eliminations imply at least M votes, which happens with probability 1 Iξ (α + 1, M).

By independence of the corruption mechanism, the failure probability of the algorithm is at most:

min α N (1 Iξ (α + 1, M)) ηM+α ξ . (19)

Noticing that:

ηM+α ξ = η log δ

log η +α ξ δ ηα ξ = η

log δ log η +α ξ , (20)

and setting δα = δ ηξα concludes the proof.

Multi-Agent Best Arm Identification with Private Communications

Proposition 5.3. (Sample complexity of CORRUPTED ELIMINATION) If the agent distribution PN is uniform, then with probability α, the number of rounds needed by CORRUPTED ELIMINATION to output an ϵ-optimal arm is approximately:

T = N Q 1 T(ηξ), 1 α1/M ,

where Q 1 is the inverse of the regularized gamma function, and T(ηξ) is the number of local samples needed for the BAI subroutine to find the optimal arm with confidence 1 ηξ.

Proof. For all n N, let Tn denote the number of rounds agent n has been active, and T denote the total number of rounds before the algorithm stops. Then (T1, ..., TN) follows a multinomial distribution:

(T1, ..., TN) Mult(T, {p1, ..., p N}) , (21)

where pn = PN (x = n). For simplicity, we consider a uniform distribution, i.e. pn = 1/N for any n N, in the sequel of this proof.

Assuming the local BAI subroutines accurately return the best arm, the algorithm terminates when the M largest T1, ..., TN exceed threshold T(ηξ). We thus look for a bound on the probability of this event.

We can make the components of the vector (T1, ..., TN) independent by poissonizing the multinomial, using the following theorem (Das Gupta, 2011):

Theorem A.2. Let N P(λ) and suppose (X1, ..., Xk) Multi(p1, ..., pk). Then, marginally, X1, ..., XK are independent Poisson random variables, with Xi Xi P(piλ).

Instead of considering T, the number of trials of the multinomial, as fixed, we assume that it follows a Poisson distribution, i.e., T P(T) for some T N. Thus, we can now think of T1, ..., TN as mutually independent random variables such that Tn P

T/N for all n N.

The probability that one count Tn is greater than T(ηξ) can then be expressed in terms of the cumulative distribution function of the Poisson distribution:

P(Tn T(ηξ)) = 1 Q T(ηξ), T/N , (22)

where Q(s, x) is the regularized gamma function evaluated in (s, x).

Consequently, by independence of the T1, ..., TN, the probability that M different agents have more samples than T(ηξ) at time t is, for some α [0, 1]: (1 Q T(ηξ), T/N M = α .

Now, to estimate the sample complexity T, we may just compute T given α, M and T(ηξ). To do this, we notice that:

Q(T(ηξ), T/N) = 1 α1/M .

There is no close form for the inverse of Q that would lead to a close form for T. However, we can numerically compute the inverse of Q in its second argument. Thus, the total number of samples would be given by:

T N Q 1 T(ηξ), 1 α1/M .

Confusing T with its expectation T, we get an approximation of the sample complexity of CORRUPTED ELIMINATION.

We can analyze this approximation in the following way. When ηξ increases, M = log δ

log ηξ increases, and therefore α1/M

increases (as 0 < α < 1). Empirically, we also observe that Q 1 is a decreasing function in its second argument. Therefore, although the final effect is unclear due to the presence of T(ηξ) as a first argument, the sample complexity possibly increases with ηξ via its second argument.

Multi-Agent Best Arm Identification with Private Communications

B. Algorithms

In this section, for self-completeness, we recall the main algorithms of the paper and introduce the elimination subroutines that have been omitted because of space constraints.

B.1. Multi-Agent Successive Elimination with Differentially Private Communications

We recall the algorithm for DP-MASE and introduce in Algorithm 4 the elimination subroutine used at line 7 of Algorithm 2.

Algorithm 2 DP-MASE

1: Input: ϵ, δ, β, M = log δ

log β 2: Output: Estimated best arm in K 3: Active agents N(t) = N; global and local active sets K=Kn =[K]; global votes (λk = 0)k [K]; local epochs en =0; local round rn =0; global time t = 0 4: while |K| > 1 do 5: Active agent n is sampled: n PN(t) 6: Sample all arms in Kn and update mean estimates 7: rn := rn + 1 8: if rn = R(en) then 9: k(en) = Local Elimination (ϵ, en, (ˆµn k(en))k, β) {Identify arms to eliminate locally} 10: Kn = Kn \ k(en) {Remove eliminated arms from local active set} 11: λk := λk + 1 for any k k(en) {Vote against eliminated arms for global elimination} 12: if |Kn| = 1 then 13: N(t) := N(t) \ {n} {Remove agent if done} 14: else 15: en := en + 1 16: Reset (ˆµn k(en) = 0)k, rn = 0 17: Compute next epoch length R(en) 18: Synchronize Kn := Kn T K {Remove the latest globally eliminated arms from local active set} 19: for k k(en) do 20: K := K \ {k} if λk > M {Remove arms from global active set if the voting threshold is met} 21: t := t + 1

Algorithm 4 DP-MASE Local Elimination

1: Input: Privacy level ϵ, current local epoch en, estimated means (ˆµn k(en))Kn, local accuracy β 2: Output: A set of arms k(en) to eliminate locally at agent n. 3: k(en) = {}

4: Set h(en) = q

log(8|Kn(en)|en2/β)

5: Set c(en) = log(4|Kn(en)|en2/β)

ϵR(en) 6: Perturb means: µn k = ˆµn k(en) + Lap (1/ϵR(en)) 7: for all k Kn do 8: if maxl Kn µn l µn k > 2 (h(en) + c(en)) then 9: k(en) := k(en) {k} {Add k to the set of arms to eliminate if elimination condition is met}

B.2. Multi-Agent Successive Elimination with (ϵ, η)-Private Communications

We recall the algorithm for CORRUPTED ELIMINATION.

Multi-Agent Best Arm Identification with Private Communications

Algorithm 3 CORRUPTED ELIMINATION

1: Input: η, δ, ξ, M = log δ

log η 2: Output: Estimated best arm in K 3: Active agents N(t) = N; global and local active sets K = Kn = [K]; local failure probability ηξ = 1 1 η

(1 ξ)K 1 , Kn := K for all n N; global votes (λk = 0)k [K], global time t = 0 4: while |K| > 1 do 5: Active agent n is sampled: n PN(t) 6: Sample all arms in Kn and update mean estimates 7: k, , k G = Local Elimination (n, ξ) {Identify arms to eliminate locally} 8: Kn = Kn \ k {Remove eliminated arms from local active set} 9: λk := λk + 1 for any k k G(en) {Vote for global elim. of eliminated arms if corresponding message not corrupted} 10: if |Kn| = 1 then 11: N(t) := N(t) \ {n} 12: Synchronize Kn := Kn T K {Remove the latest globally eliminated arms from local active set} 13: for k k do 14: K := K \ {k} if λk > M {Remove arms from global active set if the voting threshold is met} 15: t := t + 1

We also introduce in Algorithm 4 the elimination subroutine, Local Elimination, used at line 5 of Algorithm 3. Local Elimination returns two arm sets: k, containing the arms to eliminate locally, and k G, containing the arms to vote against for global elimination, i.e., those whose corresponding messages have not been corrupted.

Algorithm 4 CORRUPTED ELIMINATION Local Elimination

1: Input: Active agent n, corruption probability ξ 2: Output: A set of arms k to eliminate locally, and a set of arms k G to vote against for global elimination 3: k = {}, k G = {} 4: for all k Kn do

5: if maxl Kn ˆµn l (t) ˆµn k(t) 2 q

log(Ktn2/ηξ)

6: k := k S{k} {Add k to the set of arms to eliminate if elimination condition is met} 7: Draw νn k B(1 ξ) 8: if νn k = 1 then 9: k G := k G {k} {If message is not corrupted, add k to the set of arms to vote against for global elim.}

Multi-Agent Best Arm Identification with Private Communications

C. Additional Experiments

In this section, we provide additional experimental results.

C.1. Experimental Setting

We consider the following stochastic environments suggested in related work (F eraud et al., 2019):

Problem 1: K = 10 arms with Bernoulli distributions and means µ1 = 0.7, µ2 = 0.5, µ3 = 0.3 and µk = 0.1 for k = 4, ..., 10:

Problem 2: K = 10 arms with Gaussian distributions and means µ1 = 11, µ2 = 10.8, µk = 10.4 for k = 3, ..., 10.

C.2. Sample Complexity and Privacy Level

Figure 2. Sample complexity of F eraud et al. (2019) (ξ = 0) on Problem 1 and Problem 2 as a function of η

For CORRUPTED ELIMINATION, Figure 2 shows a positive correlation between the sample complexity and η, which is both the local BAI failure probability and the privacy level when ξ = 0. While this behavior is not reflected in the initial sample complexity bound stated in F eraud et al. (2019), Proposition 5.3 provides a first theoretical justification.

Figure 3. Sample complexity of DP-MASE (left) for ϵ = 0.05, 0.1, 0.25 and CORRUPTED ELIMINATION (right) for ξ = 0, 0.05, 0.1 on Problem 2.

Experimental results on Problem 2, shown in Figure 3, lead to the same conclusions as results on Problem 1. Figure 4 shows the same results as in Figures 1 and 3 but without the baselines to distinguish the different curves more clearly.

Multi-Agent Best Arm Identification with Private Communications

Figure 4. Sample complexity of DP-MASE (left) for ϵ = 0.05, 0.1, 0.25 and CORRUPTED ELIMINATION (right) for ξ = 0, 0.05, 0.1 on Problem 1 (top) and Problem 2 (without baselines).

C.3. Sample Complexity and Local Accuracy

In the figures above, we displayed experimental sample complexity for DP-MASE for a fixed local accuracy (β = 0.5). We found it also interesting to compare sample complexity over different local accuracy levels.

In Figure 5, we plot the sample complexity of DP-MASE for β = 0.3, 0.5, 0.8, for a fixed privacy level ϵ = 0.5. We observe that the sample complexity increases with local accuracy. This is in contradiction with what we observe for CORRUPTED ELIMINATION. Indeed, in the case ξ = 0 (corresponding to the algorithm from F eraud et al. (2019)), we already noticed that the sample complexity increases with η, which is both the local probability of failure and the global privacy level. In the case ξ > 0, ξ is positively correlated with the local accuracy 1 ηξ for a given apparent privacy level η. Higher corruption probability means we can run more confident local BAI without compromising the global privacy of the best arm. But sample complexity decreases with ξ, as shown empirically in Figure 1. Therefore, the sample complexity of CORRUPTED ELIMINATION decreases with local accuracy 1 ηξ.

To explain these two opposite behaviors, we once again stress the implications of a higher local accuracy in terms of sample complexity. More confident local agents means that 1) more local samples are needed to estimate the best arm and 2) less independent eliminations are needed to reach a given global accuracy, hence a smaller voting threshold M. The former effect will tend to increase the sample complexity, while the latter (which we have called the threshold effect) will tend to decrease the sample complexity. In Section 5, based on experiments in Figure 1, we already argued that the threshold effect is predominant in the case of CORRUPTED ELIMINATION. Why would the local sample complexity matter more for DP-MASE? DP-MASE works in epochs, and with uniform agent sampling, it is likely that all agents will reach the end of a given epoch and proceed to an elimination at about the same time. In CORRUPTED ELIMINATION, each agent may vote for the global elimination of an arm at any time not only at the end of an epoch, so that it is more likely to gather M votes against an arm early in the process. In terms of speeding up the elimination of suboptimal arms, the global voting process will thus be more important in CORRUPTED ELIMINATION than in DP-MASE.

Multi-Agent Best Arm Identification with Private Communications

Figure 5. Sample complexity of DP-MASE for β = 0.3, 0.5, 0.8 on Problem 1 (left) and Problem 2 (right).

Figure 6. Sample complexity of DP-MASE (left) and CORRUPTED ELIMINATION (right) on Problem 1 for uniform and non-uniform agent distributions.

C.4. Sample Complexity and Agent Distribution

In previous experiments, we assumed PN was uniform. However, Proposition 4.3 suggests that PN affects the sample complexity in the Multi-Agent Successive Elimination scheme. Indeed, through the global elimination process, the algorithm roughly waits for the M-th most frequent player to output his best arm estimate. Therefore, we expect DP-MASE and CORRUPTED ELIMINATION to be faster for unbalanced distributions, without losing accuracy, and indeed observe this behavior empirically.

To simulate an unbalanced agent distribution, we choose the following:

P γ N (n) (n + α) γ ,

for some α 0 and γ (0, 1). Thus, the probability of sampling agent n decreases with n.

We illustrate the impact of PN on the sample complexity for both DP-MASE and CORRUPTED ELIMINATION, comparing the uniform distribution to P γ N with γ = 0.8. We choose ϵ = 0.1 and β = 0.5 for DP-MASE, and set ξ = 0.05 for CORRUPTED ELIMINATION. As expected, both methods are much faster when the agent distribution is unbalanced.