# federated_linear_contextual_bandits__e8d526ff.pdf

Federated Linear Contextual Bandits

Ruiquan Huang The Pennsylvania State University rzh5514@psu.edu

Weiqiang Wu Facebook weiqiang.wwu@gmail.com

Jing Yang The Pennsylvania State University yangjing@psu.edu

Cong Shen University of Virginia cong@virginia.edu

This paper presents a novel federated linear contextual bandits model, where individual clients face different K-armed stochastic bandits coupled through common global parameters. By leveraging the geometric structure of the linear rewards, a collaborative algorithm called Fed-PE is proposed to cope with the heterogeneity across clients without exchanging local feature vectors or raw data. Fed-PE relies on a novel multi-client G-optimal design, and achieves near-optimal regrets for both disjoint and shared parameter cases with logarithmic communication costs. In addition, a new concept called collinearly-dependent policies is introduced, based on which a tight minimax regret lower bound for the disjoint parameter case is derived. Experiments demonstrate the effectiveness of the proposed algorithms on both synthetic and real-world datasets.

1 Introduction

Federated learning (FL) (Mc Mahan et al., 2017) is an emerging distributed machine learning (ML) paradigm where massive number of clients collaboratively learn a shared prediction model while keeping all the training data on local devices. Compared with standard centralized machine learning, FL has the following characteristics (Kairouz et al., 2021):

Heterogeneous local datasets. The local datasets, which are often generated at edge devices, are likely drawn from non-independent and identically distributed (non-IID) distributions. Communication efﬁciency. The communication cost scales with the number of clients, which is one of the primary bottlenecks of FL. It is critical to minimize the communication cost while maintaining the learning accuracy. Privacy. FL protects local data privacy by only sharing model updates instead of the raw data.

While the main focus of the state-of-the-art FL is on the supervised learning setting, recently, a few researchers begin to extend FL to the multi-armed bandits (MAB) framework (Lai and Robbins, 1985; Auer et al., 2002; Bubeck and Cesa-Bianchi, 2012; Agrawal and Goyal, 2012, 2013a). In the canonical setting of MAB, a player chooses to play one arm from a set of arms at each time slot. An arm, if played, will offer a reward that is drawn from its distribution which is unknown to the player. With all previous observations, the player needs to decide which arm to pull each time in order to maximize the cumulative reward. MAB thus represents an online learning model that naturally captures the intrinsic exploration-exploitation tradeoff in many sequential decision-making problems.

Extending FL to the MAB framework is naturally motivated by a corpus of applications, such as recommender systems, clinical trials, and cognitive radio. In those applications, the sequential decision making involves multiple clients and is distributed by nature. While classical MAB models assume immediate access to the sequentially generated data at the learning agent, under the new

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

realm of FL, local datasets can be stored and analyzed at the clients, thus reducing the communication load and potentially protecting the data privacy.

Despite the potential beneﬁts of FL, the sequential decision making and bandit feedback bring new challenges to the design of FL algorithms in the MAB setting. Different from the supervised learning setting where static datasets are collected beforehand, under the MAB setting, data is generated sequentially as decisions are made, actions are taken, and observations are collected. In order to maximize the cumulative reward and minimize the corresponding learning regret, it thus requires sophisticated coordination of the actions of the clients. The heterogeneous reward distributions across clients make the coordination process even more convoluted and challenging. Besides, the data privacy and communication efﬁciency requirements result in signiﬁcant challenges for efﬁcient information exchange and aggregation between local clients and the central server.

In this work, we attempt to address those challenges in a federated linear contextual bandits framework. This particular problem is motivated by the following exemplary applications.

Personalized content recommendation. For content (arm) recommendation in web-services, user engagement (reward) depends on the proﬁle of a user (context). The central server may deploy a recommender system on each user local device (client) in order to personalize recommendations without knowing the personal proﬁle or behavior of the user.

Personalized online education. In order to maximize students performances (reward) in online learning, the education platform (central server) needs to personalize teaching methods (arms) based on the characteristics of individual students (context). With the online learning software installed at local devices (client), it is desirable to personalize the learning experiences without allowing the platform to access students characteristics or scores.

In those examples, the reward of pulling the same arm at different clients follows different distributions dependent on the context as in contextual bandits (Auer, 2003; Langford and Zhang, 2008). We note that conventional contextual bandits is deﬁned with respect to a single player, where the time-varying context can be interpreted as different incoming user proﬁles. In contrast, we consider a multi-client model, where each client is associated with a ﬁxed user proﬁle. The variation of contexts is captured over clients as opposed to over time. Although the set of clients remains ﬁxed through the learning process, the reward of pulling the same arm still varies across clients. Such a model naturally takes data heterogeneity into consideration. Besides, we adopt a linear reward model, which has been widely studied in contextual bandits (Li et al., 2010; Agrawal and Goyal, 2013b).

Main contributions. Our main contributions are summarized as follows.

First, we propose a new federated linear contextual bandits model that takes the diverse user preferences and data heterogeneity into consideration. Such a model naturally bridges local stochastic bandits with linear contextual bandits, and is well poised to capture the tradeoffs between communication efﬁciency and learning performances in the federated bandits setting.

Second, we design a novel algorithm named Fed-PE and further develop its variants to solve the federated linear contextual bandits problem. Under Fed-PE, clients only upload their local estimates of the global parameters without sharing their local feature vectors or raw observations. It not only keeps the personal information private, but also reduces the upload cost. We explicitly show that Fed-PE and its variants achieve near-optimal regret performances for both disjoint and shared parameter cases with logarithmic communication costs.

Third, we generalize the G-optimal design from the single-player setting (Lattimore and Szepesvári, 2020) to the multi-client setting. We develop a block coordinate ascent algorithm to solve the generalized G-optimal design efﬁciently with convergence guarantees. Such a multi-client G-optimal design plays a vital role in Fed-PE, and may ﬁnd broad applications in related multi-agent setups.

Finally, we introduce a novel concept called collinearly-dependent policy and show that the celebrated Lin UCB type of policies (Li et al., 2010), Thompson sampling based policies with Gaussian priors (Agrawal and Goyal, 2013b), and least squared estimation based policies, such as Fed-PE, are all in this category. By utilizing the property of collinearly-dependent policies, we are able to characterize a tight minimax regret lower bound in the disjoint parameter setting. We believe that this concept may be of independent interest for the study of bandits with linear rewards.

Table 1: Performance comparison Model Algorithm Regret Communication cost Linear DELB O(d p

MT log(T)) O((Md + d log log d) log T)

Linear contextual (shared parameter)

Fed UCB1 O(

d MT log T) O(Md2 log T) Fed-PE (this work) O( p

d MT log(KMT)) O(M(d2 + d K) log T) Lower bound Ω(

Linear contextual (disjoint parameter)

Centralized2 O( p

d KMT log3(KMT)) O(Md2KT) Fed-PE (this work) O( p

d KMT log(KMT)) O(Md2K log T) Lower bound (this work) Ω(

M: number of clients; K: number of arms; T : time horizon; d: ambient dimension of the feature vectors.

Notations. Throughout this paper, we use x V to denote

x V x. The range of a matrix A, denoted by range(A), is the subspace spanned by the column vectors of A. We use A and Det(A) to denote the pseudo-inverse and pseudo-determinant of square matrix A, respectively. The speciﬁc deﬁnitions can be found in Appendix B of the supplementary material.

2 Related Works

Collaborative and distributed bandits. Our model is closely related to the collaborative and distributed bandits when action collision is not considered. Landgren et al. (2016, 2018) and Martínez-Rubio et al. (2019) study distributed bandits in which multiple agents face the same MAB instance, and the agents collaboratively share their estimates over a ﬁxed communication graph in order to design consensus-based distributed estimation algorithms to estimate the mean of rewards at each arm. Szorenyi et al. (2013) considers a similar setup where in each round an agent is able to communicate with a few random peers. Korda et al. (2016) considers the case where clients in different unknown clusters face independent bandit problems, and every agent can communicate with only one other agent per round. The communication and coordination among the clients in those works are fundamentally different from our work.

Wang et al. (2020) investigates communication-efﬁcient distributed linear bandits, where the agents can communicate with a server by sending and receiving packets. It proposes two algorithms, namely, DELB and Dis Lin UCB, for ﬁxed and time-varying action sets, respectively. The ﬁxed action set setting is similar to our setup, except that it assumes that all agents face the same bandits model, which does not take data heterogeneity into consideration.

Federated bandits. A few recent works have touched upon the concept of federated bandits. With heterogeneous reward distributions at local clients, Shi and Shen (2021) and Shi et al. (2021) investigate efﬁcient client-server communication and coordination protocols for federated MAB without and with personalization, respectively. Agarwal et al. (2020) studies regression-based contextual bandits as an example of the federated residual learning framework, where the reward of a client depends on both a global model and a local model. Li et al. (2020) and Zhu et al. (2021) focus on differential privacy based local data privacy protection in federated bandits. While the linear contextual bandit model considered in Dubey and Pentland (2020) is similar to this work, it focuses on federated differential privacy and proposes a Lin UCB-based Fed UCB algorithm, which incurs a higher regret compared with our result for the shared parameter case. A regret and communication cost comparison between Fed-PE and other baseline algorithms is provided in Table 1.

3 Problem Formulation

Clients and local bandits model. We consider a federated linear contextual bandits setting where there are M clients pulling the same set of K items (arms) denoted as [K] := {1, 2, . . . , K}. At each time t, each client i [M] pulls an arm ai,t [K] based on locally available information. The incurred reward yi,t is given by yi,t = ri,ai,t + ηi,t, where ηi,t is a random noise, and ri,ai,t is

1Results adapted from Dubey and Pentland (2020) by letting the privacy budget 1/ϵ go to 0. 2Results adapted from the single-player linear contextual bandits studied in Dimakopoulou et al. (2017) by assuming instantaneous information exchange and sequential decision-making at the central server.

the unknown expected reward by pulling arm ai,t. We note that without additional assumptions or interaction among the clients, each local model is a standard single-player stochastic MAB, where classic algorithms such as UCB (Auer and Ortner, 2010) and Thompson sampling (Agrawal and Goyal, 2012) are known to achieve order-optimal regret.

Linear reward structure with global parameters. In order to capture the inherent correlation between rewards of pulling the same arm by different clients, we assume ri,a has a linear structure, i.e., ri,a = x i,aθa, where xi,a Rd is the feature vector associated with client i and arm a, and θa Rd is a ﬁxed but unknown parameter vector for each a [K]. Here we use x to denote the transpose of vector x. The same arm a may have different reward distributions for different clients, due to potentially varying xi,a across clients. Such a linear model naturally captures the heterogeneous data distributions at the clients, yet admits possible collaborations among clients due to the common parameters {θa}a [K]. When θa varies for different arm a, it is called the disjoint parameter case; when θa is known to be a constant across the arms, it is the shared parameter case. We investigate both cases in Sections 4 and 5, respectively.

Communication model. We assume there exists a central server in the system, and similar to FL, the clients can communicate with the server periodically with zero latency. Speciﬁcally, the clients can send local model updates to the central server, which then aggregates and broadcasts the updated global model to the clients. (We will specify these components later.) Note that just as in FL, communication is one of the major bottlenecks and the algorithm has to be conscious about its usage. Similar to Wang et al. (2020), we deﬁne the communication cost of an algorithm as the number of scalars (integers or real numbers) communicated between server and clients. We also make the assumption that clients and server are fully synchronized (Mc Mahan et al., 2017).

Data privacy concerns. Similar to Dubey and Pentland (2020), our contextual bandit problem involves two sets of information that are desirable to be kept private to client i: the feature vectors {xi,a}a [K] and the observed rewards {yi,t}t [T ]. Different from the differential privacy mechanism adopted in Dubey and Pentland (2020), in this work, we aim to communicate estimated global model parameters {θa}a between the clients and the server. This is consistent with the FL framework, where only model updates are communicated instead of the raw data.

Assumption 1 We make the following assumptions throughout the paper:

1) Bounded parameters: For any i [M], a [K], we have θa 2 s, 0 < ℓ xi,a 2 L. 2) Independent 1-subgaussian noise: ηi,t is a 1-subgaussian noise parameter sampled independently at each time for each client with E[ηi,t] = 0, E[exp(ληi,t)] exp( λ2

2 ) for any λ > 0.

Assumption 1.1 is a standard assumption in the bandit literature, which ensures that the maximum regret at any step is bounded. We emphasize that our work does not make any assumption on the knowledge of suboptimality gaps, nor do we assume the existence of a unique optimal arm at each client.

Our objective is to minimize the expected cumulative regret among all clients, deﬁned as:

E[R(T)] = E

x i,a i θa i x i,ai,tθai,t #

where a i [K] is an optimal arm for client i: b = a i , x i,a i θa i x i,bθb 0.

4 Federated Linear Contextual Bandits: Disjoint Parameter Case

4.1 Challenges

Solving the federated linear contextual bandits model faces several new challenges. The ﬁrst challenge is due to the constraint that only locally estimated parameters {θa}a [K] are uploaded to the central server. While this is not an issue for stochastic MAB where the {θa}a [K] are scalars (Shi and Shen, 2021; Shi et al., 2021), this brings signiﬁcant challenges for the aggregation of the local estimates into a global model in our setup. This is because under the linear reward structure, the locally received rewards {yi,t}t [T ] only contain the projection of θa along the direction of xi,a, while the

portion of information lying outside range(xi,a) is not captured in {yi,t}t [T ]. Thus, by utilizing {yi,t}t [T ], the locally estimated θa, denoted as ˆθi,a, cannot provide any information of θa beyond range(xi,a). Since {xi,a}i [M] are different for the same arm a, the locally estimated {ˆθi,a}i [M] are essentially lying in different subspaces. The central server thus needs to take such geometric structure into account when aggregating {ˆθi,a} to construct the global estimate of θa.

The geometric structure of the local rewards also brings another challenge for the coordination of actions of local clients. Intuitively, in order to help client i accurately estimate the expected reward by pulling arm a, it sufﬁces to obtain an accurate projection of θa on range(xi,a); any part of θa lying outside this subspace is irrelevant. Therefore, if two clients i and j have xi,a and xj,a orthogonal to each other, exchanging the local estimates ˆθi,a and ˆθj,a does not help the other client improve her own local estimation. On the other hand, if xi,a and xj,a are completely aligned with each other, ˆθi,a and ˆθj,a can be aggregated directly to improve the local estimation accuracy of both. With M possible subspaces spanned by {xi,a}i, it is highly likely that different clients receive different amounts of relevant information (i.e., information lying in range(xi,a)) through information exchange facilitated by the central server. Therefore, in order to reduce the overall regret, it is necessary to coordinate the actions of clients in a sophisticated fashion.

Third, since the exact knowledge of the local feature vectors {xi,a} are kept from the central server, the server may not have an accurate estimation of the uncertainty level of the local estimates at each client, or how much the coordination would help individual clients. This would make efﬁcient and effective coordination even more challenging.

4.2 Federated Phased Elimination (Fed-PE) Algorithm

To address the aforementioned challenges, we propose the Federated Phased Elimination (Fed-PE) algorithm. The Fed-PE algorithm works in phases, where the length of phase p is f p +K. It contains a client side subroutine (Algorithm 1) and a server side subroutine (Algorithm 2). Throughout the paper we use superscript p to indicate phase p barring explicit explanation. We use Ap i [K] to denote the subset of active arms at client i in phase p, Ap := M i=1Ap i , Rp a := {i : a Ap i }, and deﬁne T p i,a as the time indices at which client i pulls arm a during the collaborative exploration step in phase p. Then, the algorithm works as follows.

Algorithm 1 Fed-PE : client i

Input: T, M, K, α, f p

1: Initialization: Pull each arm a [K] and receive reward yi,a; ˆθ0 i,a yi,axi,a

xi,a 2 ; Send {ˆθ0 i,a}a to the server;

A0 i [K]; p 1. 2: while not reaching the time horizon T do 3: Receive {(ˆθp a, V p a )}a Ap 1 from the server. Arm elimination 4: for a Ap 1 i do ˆrp i,a x i,aˆθp a, up i,a α xi,a V p a /ℓ. (2)

5: end for 6: ˆap i arg maxa Ap 1 i ˆrp i,a, Ap i n a Ap 1 i | ˆrp i,a + up i,a ˆrp i,ˆap i up i,ˆap i

7: Send Ap i to the central server. Active arm set updating 8: Receive f p i,a for all a Ap i . 9: for a Ap i do Collaborative exploration 10: Pull arm a for f p i,a times and receive rewards {yi,t}t T p i,a.

xi,a xi,a 2 . (3)

11: end for 12: Send {ˆθp i,a}a Ap i to the server; Pull ˆap i until phase length equals f p + K. 13: p p + 1. 14: end while

At the initialization phase, each client i pulls every arm a [K] once and receives a reward yi,a, based on which it obtains an estimate of the projection of θa. These estimates are sent to the server to construct a preliminary global estimate of θa for each a. The global estimates {ˆθ1 a}a and the potential matrices {V 1 a }a are then broadcast to all clients, after which phase p = 1 begins. Note that after receiving {ˆθ0 i,a}i,a, the central server will keep a unit vector ei,a = ˆθ0 i,a/ ˆθ0 i,a for all i [M], a [K]. Since ˆθ0 i,a is a scaled version of xi,a, ei,a lies in range(xi,a), and will be utilized to coordinate the arm pulling process (coined as collaborative exploration), as elaborated in Section 4.3.

At the beginning of phase p, after receiving the broadcast {ˆθp a}a and {V p a }a from the server, each client i will utilize the (ˆθp a, V p a ) pair to estimate the expected rewards ri,a and obtain the conﬁdence level according to Eqn. (2) for each a Ap 1 i . Based on the constructed conﬁdence interval, client i then eliminates some arms in Ap 1 i and obtains Ap i .

Next, each client i sends the newly constructed active arm set Ap i to the server. The server then decides f p i,a, the number of times client i pulling arm a during the collaborative exploration step in phase p for each i [M] and a Ap i . The speciﬁc mechanism to decide f p i,a is elaborated in Section 4.3.

After the collaborative exploration step, client i performs least-square estimation (LSE) for each arm a Ap i based on local observations collected in the current phase according to Eqn. (3) and then sends it to the server for global aggregation. Note that although ˆθp i,a lies in range(xi,a), the exact value of xi,a is not revealed to the server, thus preserving the privacy to certain extent.

Algorithm 2 Fed-PE : Central server

Input: T, M, K, α, f p

1: Initialization: Receive {ˆθ0 i,a}i,a; ei,a ˆθ0 i,a ˆθ0 i,a for all i [M], a [K]; V 1 a P

ˆθ0 i,a(ˆθ0 i,a)

ˆθ1 a V 1 a P

i [M] ˆθ0 i,a for all a [K]; Broadcast {ˆθ1 a, V 1 a }a [K]; p 1. 2: while not reaching the time horizon T do 3: Receive {Ap i }i [M]; Set Ap M i=1Ap i ; Set Rp a {i : a Ap i }. 4: Solve the multi-client G-optimal design in (6), and obtain solution πp = {πp i,a}i [M],a Ap i . 5: For every client i, send {f p i,a := πp i,af p }a Ap i .

6: Receive {(a, ˆθp i,a)}a Ap i from each client i. 7: for a Ap do Global aggregation

i Rp a f p i,a ˆθp i,a(ˆθp i,a)

, ˆθp+1 a V p+1 a

i Rp a f p i,aˆθp i,a

8: end for 9: Broadcast {(ˆθp+1 a , V p+1 a )}a Ap to all clients. 10: p p + 1. 11: end while

4.3 Multi-client G-optimal Design

In this subsection, we elaborate the core design of Fed-PE, the collaborative exploration step. There are three main design objectives we aim to achieve: 1) As explained in Section 4.1, one of the main challenges in our federated linear contextual bandits setting is that, each client may beneﬁt differently from the information exchange through the central server. To minimize the overall regret, for each arm a Ap, it is desirable to ensure that after the global aggregation following the collaboration exploration in phase p, the uncertainty in ˆrp+1 i,a across the clients is balanced. 2) For each client i, in order to eliminate the sub-optimal arms efﬁciently, it is also important to guarantee that the uncertainty in ˆrp+1 i,a across the arms in Ap i is balanced. 3) Finally, in order to ensure synchronized model updating, we aim to have each client perform the same number of arm pulling in each phase.

Motivated by those objectives, we propose a multi-client G-optimal design to coordinate the exploration of all clients. Speciﬁcally, we deﬁne πp i : Ap i [0, 1] as a distribution on the active arm set

Ap i for each i, and denote πp := (πp 1, . . . , πp M) as a vector in R P

i [M] |Ap i |. Let ei,a := xi,a/ xi,a . We note that ei,a equals to either ei,a or ei,a. Then, we deﬁne a feasible set Cp R P

i [M] |Ap i | as follows:

πp i,a 0, i [M], a Ap i , P

a Ap i πp i,a = 1, i [M], rank({πp i,aei,a}i Rp a) = rank({ei,a}i Rp a), a Ap

We can verify that Cp is a convex set. The ﬁrst two conditions ensure that {πp i,a}a form a valid distribution for each client i. We name the last condition as the rank-preserving condition. We note that the subspace spanned by the LHS of the rank-preserving condition is always a subset of that spanned by the RHS. Thus, once the rank is preserved, the subspaces spanned by the LHS and the RHS are the same. Thus, this condition ensures that every dimension of θa lying in range({xi,a}i Rp a) will be explored under the collaborative exploration. Any violation of the rank-preserving condition will lead to information missing along the unexplored dimensions, which shall be prevented in order to reduce the uncertainty level regarding arm a at every client i Rp a.

Then, we formulate the so called multi-client G-optimal design problem as follows:

minimize G(π) =

i=1 max a Ap i e i,a

j Rp a πp j,aej,ae j,a

ei,a s.t. πp Cp. (6)

We note that e i,a P

j Rp a πp j,aej,ae j,a ei,a can be interpreted as an approximate measure of the uncertainty level along dimension xi,a if the arms are explored locally according to distributions {πp i }i. Thus, the objective function is an approximate measure of the total uncertainties in the least explored arms at each of the clients. By solving (6), the aforementioned three design objectives can be met. We point out that although the server does not known ei,a, the objective function remains the same when ei,a is replaced by ei,a. Thus, the server can simply use ei,a to solve (6).

After solving (6) and obtaining {πp i,a}, the server would set {f p i,a := πp i,af p }a Ap i and send it to client i. Note that after taking the ceiling function, P

a Ap i f p i,a may be greater than f p. To ensure synchronized updating, each client i would keep pulling the estimated best arm ˆap i until the phase length equals f p + K.

We note that the multi-client G-optimal design formulated in (6) is related to the G-optimal design for the single-player linear bandits problem discussed in Lattimore and Szepesvári (2020), and the DELB algorithm for the distributed linear bandits in Wang et al. (2020). However, for such cases, the player(s) faces a single bandit problem, thus the objective is to simply obtain a distribution over a so-called core set of arms in order to minimize the maximum uncertainty across the arms. In contrast, due to the multiple clients involved in the federated bandits setting and the heterogeneous reward distributions, we are essentially solving M coupled G-design problems, one associated with each client. Such coupling effect fundamentally changes the nature of the problem, leading to very different characterization of the problem and numerical approaches.

In Appendix C.1 of the supplementary material, we analyze an equivalent problem of the multi-client G-optimal design. We note that such equivalence essentially generalizes the equivalence between the original G-optimal design and D-optimal design in Lattimore and Szepesvári (2020) to the coupled design case. While the original G-design problem can be approximately solved through the Frank Wolfe algorithm under an appropriate initialization (Todd, 2016), solving the multi-client G-optimal design problem is numerically non-trivial. In Appendix C.2, we propose a block coordinate ascent algorithm to solve the equivalent problem of (6) efﬁciently with guaranteed convergence.

4.4 Theoretical Analysis of Fed-PE

We now characterize the performance of the Fed-PE algorithm.

Theorem 1 Under Assumption 1, with probability at least 1 δ, the cumulative regret under Fed-PE scales in O p

d KMT(log(K(log T)/δ) + min{d, log M}) and the communication cost scales

in O(Md2K log T).

The complete version of Theorem 1 and its proof can be found in Appendix D in the supplementary material. For the communication cost, at each phase p, client i uploads at most K local estimates with dimension d and downloads at most K global estimates and potential matrices with dimension d and d2, respectively. Thus, the upload cost is O(Md K log T) and the download cost is O(Md2K log T).

Remark 1 When we set δ = O( p

d K/MT), the overall regret scales in O( p

d KMT log(MKT)), and the per-client regret scales in O( p

d KT log(MKT)/M). While the minimax lower bound for standard stochastic MAB scales in Ω(

KT), the collaborative learning induced by Fed-PE leads to p

d/M-fold reduction of the per-client regret. We also note that for single-player linear contextual bandits with disjoint parameters, the best known upper bound scales in O(

d KMT) (Dimakopoulou et al., 2017) over MT arm pulls, which indicates that the regret of Fed-PE is close to the state-of-the-art centralized algorithms at a communication cost in O(log T).

4.5 Enhanced Fed-PE

The original Fed-PE algorithm requires exponentially increasing f p in order to achieve the regret upper bound in Theorem 1. This is because in each phase p, we only utilize the rewards collected in phase p 1 to estimate θa. While this simpliﬁes the analysis, the measurements collected in earlier phases cannot be utilized. In order to overcome this limitation, we propose an Enhanced Fed-PE algorithm by leveraging all historical information. Enhanced Fed-PE achieves different tradeoffs between communication cost and regret performance by adjusting f p. The detailed description and analysis of Enhanced Fed-PE for different selection of f p can be found in Appendix E in the supplementary material.

4.6 Lower Bound

To derive a tight lower bound, we focus on a set of representative policies deﬁned as follows.

Deﬁnition 1 (Collinearly-dependent policy) Two clients i and j are called collinear if there exist an arm a [K] and a subset S [M] such that the following conditions are satisﬁed: 1) xi,a / span({xm,a|m S}); and 2) xi,a span({xm,a|m S} {xj,a}). For any two clients i and j that are not collinear, if the action of client i is independent of the action of j under a policy π, then, the policy is called a collinearly-dependent policy.

We note that the deﬁnition of collinearly-dependent policies is actually quite natural. Intuitively, for two clients that are not collinear, their local observations on any arm a cannot be utilized to improve each other s knowledge of their own local models. As a result, they should not affect each other s decision-making process. As shown in the supplementary material, we can verify that the most celebrated ridge regression based Lin UCB type of policies (Li et al., 2010), Thompson sampling based polices with Gaussian priors (Agrawal and Goyal, 2013b), and least-square estimation based policies, including Fed-PE, all fall in this category.

Theorem 2 For any collinearly-dependent policy, there exists an instance of the federated linear contextual bandits such that the regret is lower bounded as R(T) = Ω(

Remark 2 Theorem 2 essentially shows that, even if raw data transmission and instantaneous communication are allowed and other collinearly-dependent policies are adopted, we cannot improve the order of the regret summarized in Theorem 1 much, i.e., Fed-PE is order-optimal up to p

The proof of Theorem 2 relies on the construction of a special instance of the federated linear contextual bandits where the clients can be divided into d groups. Clients in each group face the same K-armed stochastic bandits model locally, while clients from two distinct groups are not collinear. Analyzing the regret bound in each individual group, we can show that it is lower bounded by Ω( p

KMT/d). Then, by utilizing the property of collinearly-dependent policies, we can show that the overall regret is lower bounded by Ω(

d KMT) for this scenario. More discussions on the collinearly-dependent policies and the complete proof of Theorem 2 can be found in Appendix F in the supplementary material.

5 Federated Linear Contextual Bandits: Shared Parameter Case

The Fed-PE algorithm can be slightly modiﬁed for the shared parameter case where θa = θ, a [K]. While the client side operation stays the same, the global aggregation step at the server side in (4) can be changed by letting the potential matrix V p be (P

a [K](V p a ) ) , and the global estimator ˆθp

a Ap 1 i f p 1 i,a ˆθp 1 i,a . Below, we present the main result for this case and leave the detailed algorithm description and regret analysis in the supplementary material.

Theorem 3 Under Assumption 1, with probability at least 1 δ, the regret of the adapted Fed-PE for the shared parameter case is upper bounded by O p

d MT(log((log T)/δ) + min{d, log MK}) ,

and the communication cost scales in O((Kd M + d2M) log T).

Remark 3 By setting δ = O( p

1/MT), we can show that the overall regret scales in O( p

d MT log(MTK)). We note that this bound improves the regret bound for the no differential privacy guarantee case in Dubey and Pentland (2020) by a factor of log T, although our settings are slightly different. By assuming all clients face the same linear bandits in this shared parameter setting, the regret in the federated setting over horizon T must be worse than the linear bandits over horizon MT. Since the latter is lowered bounded by Ω(

d MT) (Chu et al., 2011), the minimax regret for the shared parameter setting is bounded by Ω(

d MT) as well. Thus, the modiﬁed Fed-PE is near-optimal for this case.

In terms of the communication cost, the uploading cost stays the same as in the disjoint parameter case, while the broadcast cost is reduced by a factor of K, since only one potential matrix needs to be broadcast for the shared parameter case.

6 Experiments

Experiment results using both synthetic and real-world datasets are reported in this section to evaluate Fed-PE and the proposed enhancement. Additional experimental details and more experimental results can be found in the supplementary material. We consider four different algorithms, namely, Fed-PE, Enhanced Fed-PE, local UCB without communication, and a modiﬁed Fed-PE algorithm with full information exchange after collaborative exploration in each phase (coined as Collaborative in Figure 1). For all experiments, we set T = 217, f p = 2p, p {1, 2, . . . , 16}, and run 10 trials. For Fed-PE and its variants, we choose δ = 0.1. Note that other values of δ may further improve the regret. We evaluate the algorithms on both synthetic and Movie Lens-100K datasets.

Synthetic Dataset: We ﬁrst set M = 100, K = 10, and d = 3. We set {θa} as the canonical basis of R3. The feature vectors xi,a are generated randomly ensuring that the suboptimality reward gaps lie in [0.2, 0.4] and ℓ= 0.5, L = 1. The per-client cumulative regret as a function of T is plotted in Figure 1(a). We see that Enhanced Fed-PE outperforms Fed-PE while being slightly worse than Collaborative . This indicates that keeping feature vectors xi,a private to clients does not impact the learning performance signiﬁcantly. All Fed-PE related algorithms outperform local UCB when T is sufﬁciently large, demonstrating the effectiveness of communication in improving learning locally. We also set K = 10, d = 4, and vary the number of clients M. The performance of Enhanced Fed-PE is plotted in Figure 1(b). We note that the per-client regret monotonically decreases as M increases, corroborating the theoretical results.

Movielens Dataset: We then use the Movie Lens-100K dataset (Harper and Konstan, 2015) to evaluate the performances. Motivated by Bogunovic et al. (2021), we ﬁrst complete the rating matrix R = [ri,a] R943 1682 through collaborative ﬁltering (Morabia, 2019), and then use non-negative matrix factorization with 3 latent factors to get R = WH, where W R943 3, H R3 1682. Let xi,a be the ith row vector of W. We apply the k-means algorithm to the row vectors of H to produce K = 30 groups (arms), and let θa be the center of the a-th group. Finally, we randomly choose M = 100 users feature vectors. We observe that 0.4 xi,a 2 0.8, and the suboptimality gaps lie in [0.01, 0.8]. The regret performances of the algorithms are plotted in Figure 1(c). The curves show similar characteristics as in Figure 1(a). These results demonstrate the effectiveness of collaborative learning in the federated bandits setting.

(a) Synthetic regret.

(b) Regret with varying M.

(c) Movie Lens regret. Figure 1: Pseudo-regret over T. Shaded area indicates the standard deviation.

7 Discussion and Conclusion

In this work, we have considered a novel federated linear contextual bandits model, which naturally connects local stochastic MAB models with linear contextual bandits through common global parameters. While each client can only observe a projection of each global parameter in its own subspace, Fed-PE utilizes the geometric structure of the local estimates to reconstruct the global parameters and guide efﬁcient collaborative exploration. Theoretical analysis indicates that Fed-PE achieves near-optimal regret for both disjoint and shared parameter cases with a communication cost in the order of O(log T).

An interesting open question is whether we can further reduce the communication cost without downgrading the regret performance. In particular, we note the the original single-player G-optimal design allows for a sparse solution whose support is of size d(d+1)/2. Our numerical results indicate that such sparse solutions exist for the multi-client G-optimal design as well. Utilizing the sparsity of the solution may reduce the communication cost signiﬁcantly. Theoretical characterization of the existence of such sparse solutions is our next step.

Another possible direction to explore is to incorporate the differential privacy mechanism to the Fed-PE framework. Although local feature vectors {xi,a} are kept private under Fed-PE, local estimate ˆθi,a lies in range(xi,a), thus revealing the direction of xi,a to the central server. We aim to add certain perturbation on ˆθi,a in order to obfuscate the direction information without signiﬁcantly affecting the regret performance.

Acknowledgments and Disclosure of Funding

The work of RH and JY was supported by the US National Science Foundation under Grants CNS-1956276, CNS-2003131, CNS-2114542, and ECCS-2030026. CS acknowledges the funding support by the US National Science Foundation under Grants ECCS-2029978, ECCS-2033671, and CNS-2002902. WW s work was done before he joined Facebook.

Agarwal, A., Langford, J., and Wei, C.-Y. (2020). Federated residual learning. ar Xiv 2003.12880.

Agrawal, S. and Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39 1.

Agrawal, S. and Goyal, N. (2013a). Further optimal regret bounds for Thompson sampling. In Artiﬁcial Intelligence and Statistics, pages 99 107.

Agrawal, S. and Goyal, N. (2013b). Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on International Conference on Machine Learning, pages 1220 1228.

Auer, P. (2003). Using conﬁdence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3:397 422.

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256.

Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1):55 65.

Bogunovic, I., Losalka, A., Krause, A., and Scarlett, J. (2021). Stochastic linear bandits robust to adversarial attacks. In International Conference on Artiﬁcial Intelligence and Statistics, pages 991 999. PMLR.

Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122.

Chu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with linear payoff functions. In Gordon, G., Dunson, D., and Dudík, M., editors, Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 208 214, Fort Lauderdale, FL, USA. PMLR.

Dimakopoulou, M., Zhou, Z., Athey, S., and Imbens, G. (2017). Estimation considerations in contextual bandits. ar Xiv preprint ar Xiv:1711.07077.

Dubey, A. and Pentland, A. (2020). Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33.

Harper, F. M. and Konstan, J. A. (2015). The Movie Lens Datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4).

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., D Oliveira, R. G. L., Eichner, H., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Gascón, A., Ghazi, B., Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Koneˇcný, J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri, M., Nock, R., Özgür, A., Pagh, R., Raykova, M., Qi, H., Ramage, D., Raskar, R., Song, D., Song, W., Stich, S. U., Sun, Z., Suresh, A. T., Tramèr, F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H., and Zhao, S. (2021). Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1).

Korda, N., Szorenyi, B., and Li, S. (2016). Distributed clustering of linear bandits in peer to peer networks. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1301 1309, New York, New York, USA. PMLR.

Lai, T. L. and Robbins, H. (1985). Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22.

Landgren, P., Srivastava, V., and Leonard, N. E. (2016). On distributed cooperative decision-making in multiarmed bandits. In 2016 European Control Conference (ECC), pages 243 248. IEEE.

Landgren, P., Srivastava, V., and Leonard, N. E. (2018). Social imitation in cooperative multiarmed bandits: partition-based algorithms with strictly local information. In 2018 IEEE Conference on Decision and Control (CDC), pages 5239 5244. IEEE.

Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems, pages 817 824.

Lattimore, T. and Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.

Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661 670.

Li, T., Song, L., and Fragouli, C. (2020). Federated recommendation system via differential privacy. In IEEE International Symposium on Information Theory (ISIT), pages 2592 2597.

Martínez-Rubio, D., Kanade, V., and Rebeschini, P. (2019). Decentralized cooperative stochastic bandits. In Advances in Neural Information Processing Systems, pages 4529 4540.

Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. (2017). Communicationefﬁcient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1273 1282, Fort Lauderdale, FL, USA.

Morabia, K. (2019). Collaborative-ﬁltering. https://github.com/kevalmorabia97/ Collaborative-Filtering.

Shi, C. and Shen, C. (2021). Federated multi-armed bandits. In Proceedings of the 35th AAAI Conference on Artiﬁcial Intelligence (AAAI).

Shi, C., Shen, C., and Yang, J. (2021). Federated multi-armed bandits with personalization. In Proceedings of the 24rd International Conference on Artiﬁcial Intelligence and Statistics (AISTATS).

Szorenyi, B., Busa-Fekete, R., Hegedus, I., Ormandi, R., Jelasity, M., and Kegl, B. (2013). Gossipbased distributed stochastic bandit algorithms. In Proceedings of the 30th International Conference on Machine Learning, pages 19 27.

Todd, M. J. (2016). Minimum-Volume Ellipsoids. Society for Industrial and Applied Mathematics, Philadelphia, PA.

Wang, Y., Hu, J., Chen, X., and Wang, L. (2020). Distributed bandit learning: Near-optimal regret with efﬁcient communication. In International Conference on Learning Representations.

Zhu, Z., Zhu, J., Liu, J., and Liu, Y. (2021). Federated bandit: A gossiping approach. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 5(1):1 29.