# doubly_adversarial_federated_bandits__2275e9f1.pdf

Doubly Adversarial Federated Bandits

Jialin Yi 1 Milan Vojnovi c 1

We study a new non-stochastic federated multiarmed bandit problem with multiple agents collaborating via a communication network. The losses of the arms are assigned by an oblivious adversary that specifies the loss of each arm not only for each time step but also for each agent, which we call doubly adversarial . In this setting, different agents may choose the same arm in the same time step but observe different feedback. The goal of each agent is to find a globally best arm in hindsight that has the lowest cumulative loss averaged over all agents, which necessities the communication among agents. We provide regret lower bounds for any federated bandit algorithm under different settings, when agents have access to full-information feedback, or the bandit feedback. For the bandit feedback setting, we propose a near-optimal federated bandit algorithm called FEDEXP3. Our algorithm gives a positive answer to an open question proposed in (Cesa Bianchi et al., 2016): FEDEXP3 can guarantee a sub-linear regret without exchanging sequences of selected arm identities or loss sequences among agents. We also provide numerical evaluations of our algorithm to validate our theoretical results and demonstrate its effectiveness on synthetic and real-world datasets.

1. Introduction

There is a rising trend of research on federated learning, which coordinates multiple heterogeneous agents to collectively train a learning algorithm, while keeping the raw data decentralized (Kairouz et al., 2021). We consider the federated learning variant of a multi-armed bandit problem which is one of the most fundamental sequential decision making

1Department of Statistics, London School of Economics and Political Science, London, United Kingdom. Correspondence to: Jialin Yi <j.yi8@lse.ac.uk>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

problems. In standard multi-armed bandit problems, a learning agent needs to balance the trade-off between exploring various arms in order to learn how much rewarding they are and selecting high-rewarding arms. In federated bandit problems, multiple heterogeneous agents collaborate with each other to maximize their cumulative rewards. The challenge here is to design decentralized collaborative algorithms to find a globally best arm for all agents while keeping their raw data decentralized.

Finding a globally best arm with raw arm or loss sequences stored in a distributed system has ubiquitous applications in many systems built with a network of learning agents. One application is in recommender systems where different recommendation app clients (i.e. agents) in a communication network collaborate with each other to find news articles (i.e. arms) that are popular among all users within a specific region, which can be helpful to solve the cold start problem (Li et al., 2010; Yi et al., 2021). In this setting, the system avoids the exchange of users browsing history (i.e. arm or loss sequences) between different clients for better privacy protection. Another motivation is in international collaborative drug discovery research, where different countries (i.e. agents) cooperate with each other to find a drug (i.e. arm) that is uniformly effective for all patients across the world (Varatharajah & Berry, 2022). To protect the privacy of the patients involved in the research, the exact treatment history of specific patients (i.e. arm or loss sequences) should not be shared during the collaboration.

The federated bandit problems are focused on identifying a globally best arm (pure exploration) or maximizing the cumulative group reward (regret minimization) in face of heterogeneous feedback from different agents for the same arm, which has gained much attention in recent years (Dubey & Pentland, 2020; Zhu et al., 2021; Huang et al., 2021; Shi et al., 2021; R eda et al., 2022). In prior work, heterogeneous feedbacks received by different agents are modeled as samples from some unknown but fixed distributions. Though this formulation of heterogeneous feedback allows elegant statistical analysis of the regret, it may not be adequate for dynamic (non-stationary) environments. For example, consider the task of finding popular news articles within a region mentioned above. The popularity of news articles on different topics can be time-varying, e.g. the news on football may become most popular during the FIFA World

Doubly Adversarial Federated Bandits

Cup but may be less popular afterwards.

In contrast with the prior work, we introduce a new nonstochastic federated multi-armed bandit problem in which the heterogeneous feedback received by different agents are chosen by an oblivious adversary. We consider a federated bandit problem with K arms and N agents. The agents can share their information via a communication network. At each time step, each agent will choose one arm, receive the feedback and exchange their information with their neighbors in the network. The problem is doubly adversarial, i.e. the losses are determined by an oblivious adversary which specifies the loss of each arm not only for each time step but also for each agent. As a result, the agents which choose the same arm at the same time step may observe different losses. The goal is to find the globally best arm in hindsight, whose cumulative loss averaged over all agents is lowest, without exchanging raw information consisting of arm identity or loss value sequences among agents. As standard in online learning problems, we focus on regret minimization over an arbitrary time horizon.

1.1. Related work

The doubly adversarial federated bandit problem is related to several lines of research, namely that on federated bandits, multi-agent cooperative adversarial bandits, and distributed optimization. Here we briefly discuss these related works.

Federated bandits Solving bandit problems in the federated learning setting has gained attention in recent years. (Dubey & Pentland, 2020) and (Huang et al., 2021) considered the linear contextual bandit problem and extended the Lin UCB algorithm (Li et al., 2010) to the federated learning setting. (Zhu et al., 2021) and (Shi et al., 2021) studied a federated multi-armed bandit problem where the losses observed by different agents are i.i.d. samples from some common unknown distribution. (R eda et al., 2022) considered the problem of identifying a globally best arm for multi-armed bandit problems in a centralized federated learning setting. All these works focus on the stochastic setting, i.e. the reward or loss of an arm is sampled from some unknown but fixed distribution. Our work considers the non-stochastic setting, i.e. losses are chosen by an oblivious adversary, which is a more appropriate assumption for non-stationary environments.

Multi-agent cooperative adversarial bandit (Cesa Bianchi et al., 2016; Bar-On & Mansour, 2019; Yi & Vojnovic, 2022) studied the adversarial case where agents receive the same loss for the same action chosen at the same time step, whose algorithms require the agents to exchange their raw data with neighbors. (Cesa-Bianchi et al., 2020) discussed the cooperative online learning setting where the agents have access to the full-information feedback and the

communication is asynchronous. In these works, the agents that choose the same action at the same time step receive the same reward or loss value and agents aggregate messages received from their neighbors. Our work relaxes this assumption by allowing agents to receive different losses even for the same action in a time step. Besides, we propose a new algorithm that uses a different aggregation of messages than in the aforementioned papers, which is based on distributed dual averaging method in (Nesterov, 2009; Xiao, 2009; Duchi et al., 2011).

Distributed optimization (Duchi et al., 2011) proposed the dual averaging algorithm for distributed convex optimization via a gossip communication mechanism. Subsequently, (Hosseini et al., 2013) extended this algorithm to the online optimization setting. (Scaman et al., 2019) found optimal distributed algorithms for distributed convex optimization and a lower bound which applies to strongly convex functions. The doubly adversarial federated bandit problem with full-information feedback is a special case of distributed online linear optimization problems. Our work complements these existing studies by providing a lower bound for the distributed online linear optimization problems. Moreover, our work proposes a near-optimal algorithm for the more challenging bandit feedback setting.

1.2. Organization of the paper and our contributions

We first formally formulate the doubly adversarial federated bandit problem and the federated bandit algorithms we study in Section 2. Then, in Section 3, we provide two regret lower bounds for any federated bandit algorithm under the full-information and bandit feedback setting, respectively. In Section 4, we present a federated bandit algorithm adapted from the celebrated Exp3 algorithm for the bandit-feedback setting, together with its regret upper bound. Finally, we show results of our numerical experiments in Section 5.

Our contributions can be summarized as follows:

(i) We introduce a new federated bandit setting, doubly adversarial federated bandits, in which no stochastic assumptions are made for the heterogeneous losses received by the agents. This adversarial setting complements the prior work focuses on the stochastic setting.

(ii) For both the full-information and bandit feedback setting, we provide regret lower bounds for any federated bandit algorithm. The regret lower bound for the fullinformation setting also applies to distributed online linear optimization problems, and, to the best of our knowledge, is the first lower bound result for this problem.

(iii) For the bandit feedback setting, we propose a new near-

Doubly Adversarial Federated Bandits

optimal federated bandit Exp3 algorithm (FEDEXP3) with a sub-linear regret upper bound. Our FEDEXP3 algorithm resolves an open question proposed in (Cesa Bianchi et al., 2016): it is possible to achieve a sublinear regret for each agent simultaneously without exchanging both the action or loss information and the distribution information among agents.

2. Problem setting

Consider a communication network defined by an undirected graph G = (V, E), where V is the set of N agents and (u, v) E if agent u and agent v can directly exchange messages. We assume that G is simple, i.e. it contains no self loops nor multiple edges. The agents in the communication network collaboratively aim to solve a nonstochastic multi-armed bandit problem. In this problem, there is a fixed set A of K arms and a fixed time horizon T. Each instance of the problem is parameterized by a tensor L = (ℓv t (i)) [0, 1]T N K where ℓv t (i) is the loss associated with agent v V if it chooses arm i A at time step t.

At each time step t, each agent v will choose its action av t = i, observe the feedback Iv t and incur a loss defined as the average of losses of arm i over all agents, i.e.,

v V ℓv t (i). (1)

At the end of each time step, each agent v V can communicate with their neighbors N(v) = {u V : (u, v) E}. We assume a non-stochastic setting, i.e. the loss tensor L is determined by an oblivious adversary. In this setting, the adversary has the knowledge of the description of the algorithm running by the agents but the losses in L do not depend on the specific arms selected by the agents.

The performance of each agent v V is measured by its regret, defined as the difference of the expected cumulative loss incurred and the cumulative loss of a globally best fixed arm in hindsight, i.e.

Rv T (π, L) = E

t=1 ℓt(av t ) min i A

where the expectation is taken over the action of all agents under algorithm π on instance L. We will abbreviate Rv T (π, L) as Rv T when the algorithm π and instance L have no ambiguity in the context. We aim to characterize max L Rv T (π, L) for each agent v V under two feedback settings,

full-information feedback: Iv t = ℓv t , and

bandit feedback: Iv t = ℓv t (av t ).

Let Fv t be the sequence of agent v s actions and feedback up to time step t, i.e., Fv t = St s=1{av s, Iv s }. For a graph G = (V, E), we denote as d(u, v) the number of edges of a shortest path connecting nodes u and v in V and d(v, v) = 0.

We focus on the case when π is a federated bandit algorithm in which each agent v V can only communicate with their neighbors within a time step. Definition 2.1 (federated bandit algorithm). A federated bandit algorithm π is a multi-agent learning algorithm such that for each round t and each agent v V, the action selection distribution pv t only depends on S

u V Fu t d(u,v) 1.

From Definition 2.1, the communication between any two agents u and v in G comes with a delay equal to d(u, v) + 1. Here we give some examples of π in different settings:

when |V| = 1 and Iv t = ℓv t , π is an online learning algorithm for learning with expert advice problems (Cesa-Bianchi et al., 1997),

when |V| = 1 and Iv t = ℓv t (av t ), π is a sequential algorithm for a multi-armed bandit problem (Auer et al., 2002),

when Iv t fi(xv t ) for some convex function f(x), π belongs to the black-box procedure for distributed convex optimization over a simplex (Scaman et al., 2019), and

when G is a star graph, π is a centralized federated bandit algorithm discussed in (R eda et al., 2022).

3. Lower bounds

In this section, we show two lower bounds on the cumulative regret of any federated bandit algorithm π in which all agents exchange their messages through the communication network G = (V, E), for full-information and bandit feedback setting. Both lower bounds highlight how the cumulative regret of the federated algorithm π is related to the minimum time it takes for all agents in G to reach an agreement on a globally best arm.

Agents reaching an agreement about a globally best arm in hindsight is to find i arg mini A PT t=1 ℓt(i) by each agent v exchanging their private information about {PT t=1 ℓv t (i) : i A} with their neighbors. This is known as a distributed consensus averaging problem (Boyd et al., 2004). Let dv = |N(v)| and dmax = maxv V dv and dmin = minv V dv . The dynamics of a consensus averaging procedure is usually characterized by spectrum of the Laplacian matrix M of graph G defined as

du if u = v 1 if u = v and (u, v) E 0 otherwise.

Doubly Adversarial Federated Bandits

Let λ1(M) λN(M) = 0 be the eigenvalues of the Laplacian matrix M. The second smallest eigenvalue λN 1(M) is the algebraic connectivity which approximates the sparest-cut of graph G (Arora et al., 2009). In the following two theorems, we show that for any federated bandit algorithm π, there always exists a problem instance and an agent whose worst-case cumulative regret is Ω(λN 1(M) 1/4

T). Theorem 3.1 (Full-information feedback). For any federated bandit algorithm π, there exists a graph G = (V, E) with Laplacian matrix M and a full-information feedback instance L [0, 1]T N K such that for some v1 V,

Rv1 T (π, L) = Ω

1 + dmax λN 1(M)

The proof, in Appendix A.1, relies on the existence of a graph in which there exist two clusters of agents, A and B, with distance d(A, B) = minu A,v B d(u, v) =

(dmax + 1)/λN 1(M) . Then, we consider an instance where only agents in cluster A receive non-zero losses. Based on a reduction argument, the cumulative regrets for agents in cluster B are the same as (up to a constant factor) the cumulative regret in a single-agent adversarial bandit problem with feedback of delay d(A, B) (see Lemma A.4 in Appendix A.1). Hence, one can show that the cumulative regret of agents in cluster B is Ω p

d(A, B) T log K .

Note that the doubly adversarial federated bandit with fullinformation feedback is a special case of distributed online linear optimization, with the decision set being a K 1dimensional simplex. Hence, Theorem 3.1 immediately implies a regret lower bound for the distributed online linear optimization problem. To the best of our knowledge, this is the first non-trivial lower bound that relates the hardness of distributed online linear optimization problem to the algebraic connectivity of the communication network.

Leveraging the lower bound for the full-information setting, we show a lower bound for the bandit feedback setting. Theorem 3.2 (Bandit feedback). For any federated bandit algorithm π, there exists a graph G = (V, E) with Laplacian matrix M and a bandit feedback instance L [0, 1]T N K

such that for some v1 V,

Rv1 T (π, L) = Ω

1 + dmax λN 1(M)

The proof is provided in Appendix A.2. The lower bound contains two parts. The first part, derived from the

information-theoretic argument in (Shamir, 2014), captures the effect from bandit feedback. The second part is inherited from Theorem 3.1 by the fact that the regret of an agent in bandit feedback setting cannot be smaller than its regret in full-information setting.

4. FEDEXP3: a federated regret-minimization algorithm

Inspired by the fact that the cumulative regret is related to the time need to reach consensus about a globally best arm, we introduce a new federated bandit algorithm based on the gossip communication mechanism, called FEDEXP3. The details of FEDEXP3 are described in Algorithm 1. We shall also show that FEDEXP3 has a sub-linear cumulative regret upper bound which holds for all agents simultaneously.

The FEDEXP3 algorithm is adapted from the Exp3 algorithm, in which each agent v maintains an estimator zv t RK of the cumulative losses for all arms and a tentative action selection distribution xv t [0, 1]K. At the beginning of each time step t, each agent follows the action selection distribution xv t with probability 1 γt, or performs a uniform random exploration with probability γt. Once the action av t is sampled, the agent observes the associated loss ℓv t (av t ) and then computes an importance-weighted loss estimator gv t RK using the sampling probability pv t (av t ). Before gv t is integrated into the cumulative loss estimator zv t+1, the agent communicates with its neighbors to average its cumulative loss estimator zv t .

The communication step is characterized by the gossip matrix which is a doubly stochastic matrix W [0, 1]N N

satisfying the following constraints

v V Wu,v = X

u V Wu,v = 1

and Wu,v 0 where equality holds when (u, v) / E. This gossip communication step facilitates the agents to reach a consensus on the estimators of the cumulative losses of all arms, and hence allows the agents to identify a globally best arm in hindsight. We present below an upper bound on the cumulative regret of each agent in FEDEXP3.

Theorem 4.1. Assume that the network runs Algorithm 1 with

γt = 3 s CW + 1

2 K2 log K t

(log K)2 CW + 1

with CW = min{2 log T +log N,

N}/(1 σ2(W))+3.

Doubly Adversarial Federated Bandits

Algorithm 1 FEDEXP3

Input: Learning rates {ηt > 0}, Exploration ratios {γt > 0}, and a gossip matrix W [0, 1]N N. Initialize: zv 1(i) = 0, xv 1(i) = 1/K for all i A and v V. for each time step t = 1, 2, . . . , T do

for each agent v V do

compute the action distribution

pv t (i) = (1 γt)xv t (i) + γt/K;

choose the action av t pv t ; compute the loss estimators

gv t (i) = ℓv t (i)I {av t = i} /pv t (i);

update the gossip accumulative loss

u:(u,v) E Wu,vzu t + gv t ;

update the tentative action selection distribution

xv t+1 = exp{ ηtzv t+1(i)} P

j A exp{ ηtzv t+1(j)};

end for end for

Then, the expected regret of each agent v V is bounded as

1 σ2(W) K2/3T 2/3 !

where σ2(W) is the second largest singular value of W.

Proof sketch Let ˆℓt and zt be the average instant loss estimator and average cumulative loss estimator,

v V gv t and zt = 1

and let yt be action distribution that minimizes the regularized average cumulative loss estimator

yt(i) = exp{ ηt 1 zt(i)} P

j A exp{ ηt 1 zt(j)}.

The cumulative regret can be bounded by the sum of three terms

t=1 ( ft, yv t ft(i ))

| {z } FTRL

t=1 ηt 1E zv t zt

| {z } CONSENSUS

| {z } EXPLORATION

where i arg mini A PT t=1 ℓt(i) is a globally best arm in hindsight.

The FTRL term is a typical regret term from the classic analysis for the Follow-The-Regularized-Leader algorithm (Lattimore & Szepesv ari, 2020). The CONSENSUS term measures the cumulative approximation error generated during the consensus reaching process, which can be bounded using the convergence analysis of distributed averaging algorithm based on doubly stochastic matrices (Duchi et al., 2011; Hosseini et al., 2013). The last EXPLORATION term can be bounded by specifying the time-decaying exploration ratio γt. The full proof of Theorem 4.1 is provided in Appendix A.4.

The FEDEXP3 algorithm is also a valid algorithm for the multi-agent adversarial bandit problem (Cesa-Bianchi et al., 2016) which is a special case of the doubly adversarial federated bandit problem when ℓv t (i) = ℓt(i) for all v V. According to the distributed consensus process of FEDEXP3, each agent v V only communicates cumulative loss estimator values zv t , instead of the actual loss values ℓv t (av t ), and the selection distribution pv t . FEDEXP3 can guarantee a sub-linear regret without the exchange of sequences of selected arm identities or loss sequences of agents, which resolves an open question raised in (Cesa-Bianchi et al., 2016).

Choice of the gossip matrix The gossip matrix W can be constructed using the max-degree trick in (Duchi et al., 2011), i.e.,

W = I D A 2(1 + dmax)

where D = diag(d1, . . . , d N) and A is the adjacency matrix of G. This construction of W requires that all agents have knowledge of the maximum degree dmax, which can indeed be easily computed in a distributed system by nodes exchanging messages and updating their states using the maximum reduce operator.

Another choice of W comes from the effort to minimize the cumulative regret. The leading factor 1/ 3p

1 σ2(W) in the regret upper bound of FEDEXP3 can be minimized by choosing a gossip matrix W with smallest σ2(W). Suppose

Doubly Adversarial Federated Bandits

that the agents have knowledge of the topology structure of the communication graph G, then the agents can choose the gossip matrix to minimize their regret by solving the following convex optimization problem:

minimize W (1/n)11T 2 subject to W 0, W1 = 1, W = W T , Wij = 0, for (i, j) / E

which has an equivalent semi-definite programming formulation as noted in (Boyd et al., 2004).

Gap between upper and lower bounds The regret upper bound of FEDEXP3 algorithm in Theorem 4.1 grows sublinearly in the number of arms K and horizon time T. There is only a small polynomial gap between the regret upper bound and the lower bound in Theorem 3.2 with respect to these two parameters. The regret upper bound depends also on the second largest singular value σ2(W) of W. The related term in the lower bound in Theorem 3.2 is the second smallest eigenvalue λN 1(M) of the Laplacian matrix M. To compare these two terms, we point that when the gossip matrix is constructed using the max-degree method, as discussed in Corollary 1 in (Duchi et al., 2011),

1 σ2(W) 3 s

With respect to 4p

(dmax + 1)/λN 1(M) in Theorem 3.2, there is only a small polynomial gap between the regret upper bound and the lower bound. We note that a similar dependence on σ2(W) is present in the analysis of distributed optimization algorithms (Duchi et al., 2011; Hosseini et al., 2013).

5. Numerical experiments

We present experimental results for the FEDEXP3 algorithm (W constructed by the max-degree method) using both synthetic and real-world datasets. We aim to validate our theoretical analysis and demonstrate the effectiveness of FEDEXP3 on finding a globally best arm in non-stationary environments. All the experiments are performed with 10 independent runs. The code for producing our experimental results is available online in an anonymous Github repository: https://github.com/jialinyi94/doubly-stochasticfederataed-bandit.

5.1. Synthetic datasets

We validate our theoretical analysis of the FEDEXP3 algorithm on synthetic datasets. The objective is two-fold. First, we demonstrate that the cumulative regret of FEDEXP3 grows sub-linearly with time. Second, we examine the dependence of the regret on the second largest singular value of the gossip matrix.

Figure 1. (Top) The average cumulative regret, i.e. P

v V Rv T /N, versus T, for FEDEXP3 and Exp3 on the grid graph. (Down) The average cumulative regret versus (1 σ2(W)) 1/3 for FEDEXP3 on different networks at T = 3000.

A motivation for finding a globally best arm in recommender systems is to provide recommendations for those users whose feedback is sparse. In this setting, we construct a federated bandit setting in which a subset of agents will be activated at each time step and only activated agents may receive non-zero loss. Specifically, we set T = 3, 000 with N = 36 and K = 20. At each time step t, a subset Ut of N/2 agents are selected from V with replacement. For all activated agents Ut, the loss for arm i is sampled independently from Bernoulli distribution with mean µi = (i 1)/(K 1). All non-activated agents receive a loss of 0 for any arm they choose at time step t.

We evaluate the performance of FEDEXP3 on different networks, i.e. for a complete graph, a

N grid network, and random geometric graphs. The random geometric graph RGG(d) is constructed by uniform random placement of each node in [0, 1]2 and connecting any two nodes whose distance is less or equal to d (Penrose, 2003). Random geometric graphs are commonly used for modeling spatial networks.

In our experiments, we set d {0.3, . . . , 0.9}. The results in Figure 1 confirm that the cumulative regret of the FED-

Doubly Adversarial Federated Bandits

EXP3 algorithm grows sub-linearly with respect to time and suggest that the cumulative regret of FEDEXP3 is proportional to (1 σ2(W)) 1/3. This is compatible with the regret upper bound in Theorem 4.1.

5.2. Movie Lens dataset: recommending popular movie genres

We compare FEDEXP3 with a UCB-based federated bandit algorithm in a movie recommendation scenario using a realworld dataset. In movie recommendation settings, users preferences over different genres of movies can change over time. In such non-stationary environments, we demonstrate that a significant performance improvement can be achieved by FEDEXP3 against the Gossip UCB algorithm (we refer to as GUCB) proposed in (Zhu et al., 2021), which is defined for stationary settings.

We evaluate the two algorithms using the Movie Lens Latest-full dataset which contains 58,000 movies, classified into 20 genres, with 27,000,000 ratings (rating scores in {0.5, 1, . . . , 5}) from 280,000 users. Among all the users, there are 3,364 users who rated at least one movie for every genre. We select these users as our agents, i.e. N = 3, 364, and the 20 genres as the arms to be recommended, i.e. K = 20.

We create a federated bandit environment for movie recommendation based on this dataset. Let mv(i) be the number of ratings that agent v has for genre i. We set the horizon T = maxv N mv(i) = 12, 800. To reflect the changes in agents preferences over genres as time evolves, we sort the ratings in an increasing order by their Unix timestamps and construct the loss tensor in the following way. Let rv j (i) be the j-th rating of agent v on genre i, the loss of recommending an movie to agent v of genre i at time step t is defined as

ℓv t (i) = 5.5 rv j (i) 5.5

for t h (j 1) j T mv(i) k , j j T mv(i) k . The performance of FEDEXP3 and GUCB is shown in Figure 2. The results demonstrate that FEDEXP3 can outperform GUCB by a significant margin for different communication networks.

6. Conclusion and future research

We studied doubly adversarial federated bandits, a new adversarial (non-stochastic) setting for federated bandits, which complement prior study on stochastic federated bandits. Firstly, we derived regret lower bounds for any federated bandit algorithm when the agents have access to fullinformation or bandit feedback. These regret lower bounds relate the hardness of the problem to the algebraic connectivity of the network through which the agents communicate. Then we proposed the FEDEXP3 algorithm which is a fed-

Figure 2. The average cumulative regret versus horizon time for FEDEXP3 and GUCB in the movie recommendation setting with the communication networks: (top) complete graph, (mid) the grid network, and (down) RGG(0.5).

erated version of the Exp3 algorithm. We showed that there is only a small polynomial gap between the regret upper bound of FEDEXP3 and the lower bound. Numerical experiments performed by using both synthetic and real-word datasets demonstrated that FEDEXP3 can outperform the state-of-the-art stochastic federated bandit algorithm by a significant margin in non-stationary environments.

We point out some interesting avenues for future research on doubly adversarial federated bandits. The first is to close the gap between the regret upper bound of FEDEXP3 algorithm and the lower bounds shown in this paper. The second is to extend the doubly adversarial assumption to federated linear bandit problems, where the doubly adversarial assumption could replace the stochastic assumption on the noise in the linear model.

Arora, S., Rao, S., and Vazirani, U. Expander flows, geometric embeddings and graph partitioning. Journal of the

Doubly Adversarial Federated Bandits

ACM, 56(2):1 37, 2009.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

Bar-On, Y. and Mansour, Y. Individual regret in cooperative nonstochastic multi-armed bandits. Advances in Neural Information Processing Systems, 32, 2019.

Boyd, S., Diaconis, P., and Xiao, L. Fastest mixing markov chain on a graph. SIAM review, 46(4):667 689, 2004.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.

Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. How to use expert advice. Journal of the ACM, 44(3):427 485, 1997.

Cesa-Bianchi, N., Gentile, C., Mansour, Y., and Minora, A. Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory, pp. 605 622. PMLR, 2016.

Cesa-Bianchi, N., Cesari, T., and Monteleoni, C. Cooperative online learning: Keeping your neighbors updated. In Algorithmic Learning Theory, pp. 234 250. PMLR, 2020.

Dubey, A. and Pentland, A. Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33:6003 6014, 2020.

Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592 606, 2011.

Hagberg, A., Swart, P., and S Chult, D. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.

Harris, C. R., Millman, K. J., Van Der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., et al. Array programming with numpy. Nature, 585(7825):357 362, 2020.

Hiriart-Urruty, J.-B. and Lemarechal, C. Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods: 306. Springer, Berlin Heidelberg, softcover reprint of hardcover 1st ed. 1993 edition, December 2010.

Hosseini, S., Chapman, A., and Mesbahi, M. Online distributed optimization via dual averaging. In 52nd IEEE Conference on Decision and Control, pp. 1484 1489. IEEE, 2013.

Huang, R., Wu, W., Yang, J., and Shen, C. Federated linear contextual bandits. Advances in Neural Information Processing Systems, 34:27057 27068, 2021.

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021.

Lattimore, T. and Szepesv ari, C. Bandit Algorithms. Cambridge University Press, 2020. doi: 10.1017/ 9781108571401.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010.

Nesterov, Y. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221 259, 2009.

Penrose, M. Random geometric graphs, volume 5. OUP Oxford, 2003.

R eda, C., Vakili, S., and Kaufmann, E. Near-optimal collaborative learning in bandits. ar Xiv preprint ar Xiv:2206.00121, 2022.

Scaman, K., Bach, F., Bubeck, S., Lee, Y., and Massouli e, L. Optimal convergence rates for convex distributed optimization in networks. Journal of Machine Learning Research, 20:1 31, 2019.

Shamir, O. Fundamental limits of online and distributed algorithms for statistical learning and estimation. Advances in Neural Information Processing Systems, 27, 2014.

Shi, C., Shen, C., and Yang, J. Federated multi-armed bandits with personalization. In International Conference on Artificial Intelligence and Statistics, pp. 2917 2925. PMLR, 2021.

Varatharajah, Y. and Berry, B. A contextual-bandit-based approach for informed decision-making in clinical trials. Life, 12(8):1277, 2022.

Xiao, L. Dual averaging method for regularized stochastic learning and online optimization. Advances in Neural Information Processing Systems, 22, 2009.

Yi, J. and Vojnovic, M. On regret-optimal cooperative nonstochastic multi-armed bandits. ar Xiv preprint ar Xiv:2211.17154, 2022.

Yi, J., Wu, F., Wu, C., Liu, R., Sun, G., and Xie, X. Efficient-fedrec: Efficient federated learning framework

Doubly Adversarial Federated Bandits

for privacy-preserving news recommendation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2814 2824, 2021.

Zhu, Z., Zhu, J., Liu, J., and Liu, Y. Federated bandit: A gossiping approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, pp. 3 4, 2021.

Doubly Adversarial Federated Bandits

A. Appendix

Notation For a vector x, we use x(i) to denote the i-th coordinate of x. We define Ft = S

v V Fv t where Fv t is the sequence of agent v s actions and feedback up to time step t, i.e., Fv t = St s=1{av s, Iv s }.

A.1. Proof of Theorem 3.1

We first define a new class of cluster-based distributed online learning procedure, referred to as cluster-based federated algorithms, in which the delay only occurs when the communication is between different clusters. The regret lower bound for federated bandit algorithms will be no less than the regret lower bound for cluster-based federated algorithms, as shown in Lemma A.2. Then we show in Lemma A.3 that there exists a special graph in which there exist two clusters of agents A and B with distance d(A, B) = minu A,v B d(u, v) = Ω p

(dmax + 1)/λN 1(M) . Then, we consider an instance where only agents in cluster A receive non-zero losses. Based on a reduction argument, the cumulative regrets of agents in cluster B are the same as (up to a constant factor) the cumulative regrets in a single-agent adversarial bandit setting with feedback delay d(A, B) (see Lemma A.4 in Appendix A.1). Hence, one can show that the cumulative regret of agents in

cluster B is Ω p

d(A, B) T log K .

We denote with d(U, U ) the smallest distance between any two nodes in U, U V, i.e.

d(U, U ) = min u U,u U d(u, u )

where d(u, v) is the length of a shortest path connecting u and u .

Definition A.1 (Cluster-based federated algorithms). A cluster-based federated algorithm is a multi-agent learning algorithm defined by a partition of graph S

r Ur = V where Ur is called cluster. In the cluster-based federated algorithm, at each round t, the action selection probability pv t of agent v Ur depends on the history information up to round t d(Ur, Ur ) 1 of all agents u Ur .

Note that when all agents are in the same cluster V, the centralized federated algorithm in (R eda et al., 2022) is an instance of a cluster-based federated algorithm.

Lemma A.2 (Monotonicity). Let Π and Π be two sets of all cluster-based federated algorithms with two partitions S

s U s, respectively. Suppose for any cluster U s of π , there exists a cluster Ur of π such that U s Ur, then

Π Π and min π Π Rv T (π, L) min π Π Rv T (π , L)

for any L [0, 1]T N K and any v V.

Proof. It suffices to show Π Π. Consider a cluster-based federated algorithm π Π . For any agent v V, let U s be the cluster of v in Π . By definition of cluster-based procedure, agent v s action selection distribution probability pv t depends on the history information up to round t d(U s, U h) 1 of all agents u U h.

By the assumption, there exists two subset Ur1, Ur2 V such that U s Ur1 and U h Ur2. Hence d(Ur1, Ur2) d(U s, U h), from which it follows t d(U s, U h) 1 t d(Ur1, Ur2) 1. Hence π Π which completes the proof.

Lemma A.3. There exists a graph G = (V, E) with N nodes and a matrix M MG, together with two subsets of nodes I0, I1 V of size |I0| = |I1| N/4 and such that

d (I0, I1) ,

where d (I0, I1) is the shortest-path distance in G between the two sets and

1 + dmax λN 1(M).

Doubly Adversarial Federated Bandits

Proof. From Lemma 24 in (Scaman et al., 2019), there exists exists a graph G = (V, E) with N nodes and a matrix M MG, together with two subsets of nodes I0, I1 V of size |I0| = |I1| N/4 and such that

λ1(M) λN 1(M).

We show that λ1(M) 1 + dmax. To see this, note that λ1(M) is the Rayleigh quotient maxx =0 x T Mx

x T x . By the definition of the Laplacian matrix, x T Mx = X

(v,u) E (xv xu)2 .

Let v be a vertex whose degree is dmax and

dmax 1+dmax if u = v

1 dmax 1+dmax if u = v and vi is adjacent to vj 0 otherwise

(v,u) E (xv xu)2 = dmax

dmax 1 + dmax + 1 dmax 1 + dmax

1 dmax 1 + dmax

2 (dmax + 1)2 = 1+dmax

u V x2 u = dmax 1 + dmax + dmax 1 dmax(1 + dmax) = 1.

Hence, λ1(M) 1 + dmax.

Let I0, I1 be two subsets of nodes satisfying

d(I0, I1) and |I0| = |I1| = N/4.

The number of rounds needed to communicate between any node in I0 and any node I1 is at least .

Lemma A.4. Let v0 I0 and v1 V\I0. Consider a cluster-based federated algorithm with clusters I0 and V \I0. Then, any distributed online learning algorithm σ for full information feedback setting has an expected regret

Rv1 T 1 o(1)

Proof. Consider an online learning with expert advice problem with the action set A over B rounds (Cesa-Bianchi et al., 1997). Let ℓ 1, . . . , ℓ B be an arbitrary sequence of losses and p b be the action selection distribution at round b. We show that σ can be used to design an algorithm for this online learning with expert advice problem, adapted from (Cesa-Bianchi et al., 2016).

Consider the loss sequences {ℓv t }T t=1 for each v V with T = ( + 1)B such that

( ℓ t/( +1) , v I0 0 otherwise.

Let pv t be the action select distribution of agent v V running the algorithm σ. Define the algorithm for the online learning with expert advice problem as follows:

p b = 1 + 1

s=1 pv1 ( +1)(b 1)+s

Doubly Adversarial Federated Bandits

where pv t = (1/k, . . . , 1/k) for all t 1 and v V.

Note that p b is defined by pv1 ( +1)(b 1)+1, . . . , pv1 ( +1)b. These are in turn defined by ℓv0 1 , . . . , ℓv0 ( +1)(b 1) by the definition

of cluster-based communication protocol. Also note that t/( + 1) b 1 for t ( + 1)b, hence p b is determined by ℓ 1, . . . , ℓ b 1. Therefore p 1, . . . , p B are generated by a legitimate algorithm for online learning with expert advice problem.

Note that the cumulative regret of agent v1 is

t=1 pv1 t , ℓt = 1 N

v I0 pv1 t , ℓv t + X

v I0 pv1 t , ℓv t

t=1 pv1 t , ℓ t/( +1)

s=1 pv1 ( +1)(b 1)+s, ℓ b

b=1 p b, ℓ b (5)

where the second equality comes from the definition of ℓv t and the fourth equality comes from the definition of p b.

Also note that

t=1 ℓt(i) = 1 4 min i A

t=1 ℓ t/( +1) (i)

b=1 ℓ b(i). (6)

From (5) and (6), it follows that

t=1 pv1 t , ℓt min i A

t=1 ℓt(i) = + 1

b=1 p b, ℓ b min i A

There exists a sequence of losses ℓ 1, . . . , ℓ B such that for any algorithm for online learning with expert advice problem, the expected regret satisfies (Cesa-Bianchi & Lugosi, 2006, Theorem 3.7),

b=1 p b, ℓ b min i A

b=1 ℓ b(i) (1 o(1))

Hence, we have T X

t=1 pv1 t , ℓt min i A

t=1 ℓt(i) 1 o(1)

A.2. Proof of Theorem 3.2

The lower bound contains two parts. The first part is derived by using information-theoretic arguments in (Shamir, 2014) and it captures the effect of bandit feedback. The second part is inherited from the full-information feedback lower bound in Theorem 3.1 by the fact that the regret of an agent in the bandit feedback setting cannot be smaller than the regret in the full-information setting.

Consider a centralized federated algorithm with all the agents in the same cluster V, denoted as ΠC. Note that by Lemma A.2, for a federated bandit algorithm ΠG,

ΠG ΠC and min π ΠC Rv T (π , L) min π ΠG Rv T (π, L)

Doubly Adversarial Federated Bandits

for any L [0, 1]T N K and any v V.

For any π ΠC, at each round t, every agent v V receives O(|N(v)|) bits since its neighboring agents can choose at most |N(v)| distinct actions. By Theorem 4 in (Shamir, 2014), there exists some distribution D over [0, 1]K such that loss vectors ℓt i.i.d D for all t = 1, 2, . . . , T and mini A E h PT t=1 ℓt(at(v)) PT t=1 ℓt(i) i = Ω min{T, p

KT/(1 + |N(v)|)} .

Hence, it follows that

min L Rv T (π, L) min L Rv T (π , L)

t=1 ℓt(at(v)) min i A

max i A E ℓt D

t=1 ℓt(at(v))

min i A E ℓt D

t=1 ℓt(at(v))

= Ω min{T, p

KT/(1 + |N(v)|)}

where the third inequality comes from Jensen s inequality.

Also note that any federated bandit algorithm for bandit feedback setting is also a federated bandit algorithm for fullinformation setting, from which it follows

min L Rv T (π, L) max

KT/(1 + |N(v)|)} , Ω

1 + dmax λN 1(M)

K/(1 + |N(v)|), 4 s

1 + dmax λN 1(M)

A.3. Auxiliary lemmas

Here we present some auxiliary lemmas which are used in the proof of Theorem 4.1. Recall that ˆℓt and zt are the average instant loss estimator and average cumulative loss,

v V gv t and zt = 1

and yt is action distribution to minimize the regularized average cumulative loss

yt(i) = exp{ ηt 1 zt(i)} P

j A exp{ ηt 1 zt(j)}.

Lemma A.5. For each time step t = 1, . . . , T,

zt+1 = zt + ft

max{ gv t , ft } K

Doubly Adversarial Federated Bandits

u:(u,v) E Wu,vzu t + 1

v V zv t + 1

where the second equality comes from Line 7 in Algorithm 1 and the third equality comes from the double-stochasticity of W.

Noting that pv t (i) γ/K for all v V, i A and t {1, . . . , T}, it follows that

gv t = ℓv t (av t ) pv t (av t ) K

γt and ft 1

Lemma A.6. For any v V and t 1, it holds that

E [gv t | Ft 1] = ℓv t and E [ft | Ft 1] = ℓt

E [ ft ] K and E ft 2 K2

Proof. Note that pv t is determined by Ft 1, hence

E [gv t (i) | Ft 1] = ℓv t (i) pv t (i)E [I {av t = i} | Ft 1] = ℓv t (i) pv t (i)pv t (i) = ℓv t (i)

E [ gv t ] = E ℓv t (av t ) pv t (av t )

= E E ℓv t (av t ) pv t (av t ) | Ft 1

i A pv t (i) ℓv t (i) pv t (i)

i A ℓv t (i) K

where the last inequality comes from ℓv t (i) 1. Since ft(i) = 1

v V gv t , it follows that

E [ft(i) | Ft 1] = 1

v V ℓv t (i) = ℓt(i)

and E [ ft ] 1

v V E [ gv t ] K

which comes from Jensen s inequality. Notice that

E gv t 2 = E ℓv t (av t )2

pv t (av t )2

= E E ℓv t (av t )2

pv t (av t )2 | Ft 1

i A pv t (i) ℓv t (i)2

where the last inequality comes from pv t (i) γt/K. Again, from Jensen s inequality, it follows

v V E gv t 2 K2

Doubly Adversarial Federated Bandits

Before presenting the next lemma, we recall the definition of strongly-convex functions and Fenchel duality. A function ϕ is said to be α-strongly convex function on a convex set X if

ϕ(x ) ϕ(x) + ϕ(x), x x + 1

for all x , x X, for some α 0.

Let ϕ denote the Fenchel conjugate of ϕ, i.e.,

ϕ (y) = max x X { x, y ϕ(x)}

with the projection, ϕ (y) = arg max x X { x, y ϕ(x)}.

Lemma A.7. Let ψ the normalized negative entropy function (Lattimore & Szepesv ari, 2020) on PK 1 = {x [0, 1]K : PK i=1 x(i) = 1} ,

i=1 x(i) (log(x(i)) 1) .

For all t = 1, . . . , T, it holds that

xv t = argmin x PK 1 { x, zv t + ψηt(x)} = ψ ηt 1( zv t )

with X = PK 1 and yt = argmin x PK 1 { x, zt + ψηt 1(x)} = ψ ηt 1( zt).

Furthermore, it holds xv t yt ηt 1 zv t zt .

Proof. We prove for yt = argmin x PK 1 { x, zt + ψηt 1(x)} whose argument also applies to xv t .

Notice it suffices to consider the minimization problem

min x PK 1 ηt 1 PK k=1 x(i) zt(i) + Pk i=1 x(i) log(x(i))

subject to PK k=1 x(i) = 1.

It suffices to consider the Lagrangian,

k=1 x(i) zt(i)

i=1 x(i) log(x(i)) λ

Consider the first-order conditions for all i = 1, . . . , K

L x(i) = ηt 1 zt(i) log(x(i)) 1 λ = 0

which gives x(i) = exp{ ηt 1 zt(i)}/ exp{1 + λ} for all i = 1, . . . , K. Plugging into the constraint PK k=1 x(i) = 1 together with the definition of Fenchel duality (Hiriart-Urruty & Lemarechal, 2010) completes the proof for yt.

Note that the normalized negative entropy ψ(x) is 1-strongly convex,

ψ(x ) ψ(x) + ψ(x), x x + 1

Multiplying 1/ηt 1 both sides of the inequality yields that ψηt 1(x) is 1/ηt 1-strongly convex. By Theorem 4.2.1 in (Hiriart-Urruty & Lemarechal, 2010), we have that ψ ηt 1(z) is ηt 1-Lipschitz.

It follows that pv t pt = ψ η( zv t ) ψ η( zt) ηt 1 zt zv t .

Doubly Adversarial Federated Bandits

We state an upper bound on the network disagreement on the cumulative loss estimators from (Duchi et al., 2011) and (Hosseini et al., 2013). Lemma A.8. For any v V and t = 1, 2, . . . , T,

min{2 log T + log n, n}

1 σ2(W) + 3 = K

where σ2(W) is the second largest singular value of W.

Proof. From Lemma A.5, it follows that gv t K/γt. Since {γt} is non-increasing, let L = K/γT in Eq. (29) in (Duchi et al., 2011) and Lemma 6 in (Hosseini et al., 2013) completes the proof.

A.4. Proof of Theorem 4.1

Let i = arg mini A PT t=1 ℓt(i). Note that pv t is determined by Ft 1 and E [ft | Ft 1] = ℓt from Lemma A.6. It follows that for each agent v V

t=1 ℓt, pv t

t=1 E [ft | Ft 1] , pv t

t=1 E [ft(i ) | Ft 1]

t=1 ft, pv t

By the definition of pv t , it follows

t=1 ( ft, (1 γ)xv t + γxv 1 ft(i ))

t=1 (1 γt) ( ft, xv t ft(i ))

t=1 γt E [( ft, xv 1 ft(i ))]

t=1 (1 γt) ( ft, xv t ft(i ))

t=1 γt ℓt, xv 1 ℓt(i )

t=1 ( ft, xv t ft(i ))

t=1 ( ft, yv t ft(i ))

t=1 ft, xv t yv t

| {z } (II)

where the first inequality comes from the fact that γt > 0 and the fact that ℓt P

v V 1/N ℓv t 1.

From Lemma A.5, it follows zt = Pt 1 s=1 fs. Hence, it follows from Lemma A.7 that

yt = arg min x PK 1

s=1 fs, x + 1 ηt 1 ψ(x)

From Lemma 3 in (Duchi et al., 2011) and Corollary 28.8 in (Lattimore & Szepesv ari, 2020), we have

t=1 ηt 1E ft 2 + 1

ηT log(K) (8)

Doubly Adversarial Federated Bandits

which is because {ηt} is a non-increasing sequence. Note that

xv t yt ηt 1 zv t zt

by Lemma A.7. This yields that

t=1 ηt 1E [ ft zv t zt ] . (9)

Plugging Equations (8) and (9) into (I) yields that

t=1 ηt 1E ft 2 + 1

ηT log(K) +

t=1 ηt 1E [ ft zv t zt ] +

t=1 ηt 1E ft 2 + K

t=1 ηt 1E [ ft ] +

where the second inequality comes from Lemma A.8 and the third inequality comes from Lemma A.6.

γt = 3 s CW + 1

2 K2 log K t and ηt = log K

(log K)2 CW + 1

Then, for every v V, we have

K2 log K CW + 1

2 2 T 2 3 + 3 s

K2 log K C3 W CW + 1

K2 log KT 2 3 + 3 s CW + 1

K2 log KT 2 3

K2 log KT 2 3 + 3p

CW K2 log KT 2 3 + 5 3

CW K2 log KT 2 3

CW K2 log KT 2 3 .

A.5. Numerical experiments

All the experiments are run on a desktop with AMD Ryzen 5 2600 Six-Core Processor and 16GB memory. Each experiment took less than 6 hours to finish.

The code is written in Python and uses Numpy package (Harris et al., 2020) and Network X package (Hagberg et al., 2008) for numerical calculation and graph operations. The Numpy package and the Network X package are distributed with the BSD license.