# federated_linear_bandits_with_finite_adversarial_actions__dd273594.pdf Federated Linear Bandits with Finite Adversarial Actions Li Fan University of Virginia lf2by@virginia.edu Ruida Zhou Texas A&M University ruida@tamu.edu Chao Tian Texas A&M University chao.tian@tamu.edu Cong Shen University of Virginia cong@virginia.edu We study a federated linear bandits model, where M clients communicate with a central server to solve a linear contextual bandits problem with finite adversarial action sets that may be different across clients. To address the unique challenges of adversarial finite action sets, we propose the Fed Sup Lin UCB algorithm, which extends the principles of Sup Lin UCB and OFUL algorithms in linear contextual bandits. We prove that Fed Sup Lin UCB achieves a total regret of O( d T), where T is the total number of arm pulls from all clients, and d is the ambient dimension of the linear model. This matches the minimax lower bound and thus is order-optimal (up to polylog terms). We study both asynchronous and synchronous cases and show that the communication cost can be controlled as O(d M 2 log(d) log(T)) and O( d3M 3 log(d)), respectively. The Fed Sup Lin UCB design is further extended to two scenarios: (1) variance-adaptive, where a total regret of O( q d PT t=1 σ2 t ) can be achieved with σ2 t being the noise variance of round t; and (2) adversarial corruption, where a total regret of O( d T + d Cp) can be achieved with Cp being the total corruption budget. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of Fed Sup Lin UCB on both synthetic and realworld datasets. 1 Introduction In the canonical formulation of contextual bandits, a single player would repeatedly make arm-pulling decisions based on contextual information with the goal of maximizing the long-term reward. With the emerging federated learning paradigm (Mc Mahan et al., 2017) where multiple clients and a server jointly learn a global model with each client locally updating the model with its own data and server only aggregating the local models periodically, researchers have started exploring contextual bandits algorithms in such federated learning setting (Dubey and Pentland, 2020; Huang et al., 2021; Li and Wang, 2022a,b). This federated contextual bandits framework broadens the applicability of contextual bandits to practical scenarios such as recommender systems, clinical trials, and cognitive radio. In these applications, although the goal is still to maximize the cumulative reward for the overall system, decision-making and observations are naturally distributed at the participating clients. Several intrinsic challenges arise with the federated contextual bandit formulation. One important issue is that besides regret, we should also take into account the communication cost, which is usually the system bottleneck. To reduce the communication cost while maintaining the same regret guarantee, the clients should transmit the necessary information to the server only when the local 37th Conference on Neural Information Processing Systems (Neur IPS 2023). information has accumulated to the extent that it would affect the decision-making. Compared with the centralized contextual bandits, which have a linearly growing communication cost, algorithms for federated contextual bandits attempt to achieve a comparable regret with sub-linear communication cost. Second, most existing studies on federated contextual bandits focus on the synchronous communication scenario (Huang et al., 2021; Li and Wang, 2022b), in which all participating clients first upload local information and then download updated global information from the server in each communication round. This stringent communication requirement is often not met in practice. A recent work of Li and Wang (2022a) studies the asynchronous federated linear bandit problem. However, communications for different clients are not independent in their approach because the upload from one client may trigger the server to perform downloads for all clients. To address this issue, He et al. (2022a) proposes Fed Lin UCB, which enables independent synchronizations between clients and the server. Third, the majority of prior studies on federated linear bandits focused on the infinite-arm setting (Li and Wang, 2022b,a; He et al., 2022a) (see Section 2 for a detailed literature review). From a methodology point of view, these papers largely build on the OFUL principle (Abbasi-Yadkori et al., 2011). One notable exception is Huang et al. (2021), which studies synchronous communication with fixed contexts and proposes the Fed-PE algorithm based on the phased elimination G-optimal design (Lattimore and Szepesvári, 2020). To the best of our knowledge, no prior result exists for federated linear bandits with finite arms and time-evolving adversarial contexts, which is the focus of our work. Table 1: Comparison of this paper with related works System Action Algorithm Regret Communication Single-player infinite arm OFUL (Abbasi-Yadkori et al., 2011) d T log T N/A Single-player finite fixed arm PE + G-optimal (Lattimore and Szepesvári, 2020) O( d T log T) N/A Single-player finite adversarial arm Sup Lin UCB (Chu et al., 2011) O( p d T log3 T) N/A Federated (Async) infinite arm Fed Lin UCB (He et al., 2022a) O(d T log T) O(d M 2 log T) Federated (Async) infinite arm Async-Lin UCB (Li and Wang, 2022a) O(d T log T) O(d M 2 log T) Federated (Sync) infinite arm Dis Lin UCB (Wang et al., 2019) O(d T log2 T) O(d M 3/2) Federated (Sync) finite fixed arm Fed-PE (Huang et al., 2021) O( d T log T) O(d2MK log T) Federated (Async) finite adversarial arm Fed Sup Lin UCB (This work) O( p d T log3 T) O(d M 2 log d log T) Federated (Sync) finite adversarial arm Fed Sup Lin UCB (This work) O( p d T log3 T) O(d3/2M 3/2 log(d)) d: the dimension of the unknown parameter, M: the number of clients, K: the number of finite actions, T : the total arm pulls from all clients. Main contributions. Our main contributions are summarized as follows. We develop a general federated bandits framework, termed Fed Sup Lin UCB, for solving the problem of federated linear contextual bandits with finite adversarial actions. Fed Sup Lin UCB extends Sup Lin UCB (Chu et al., 2011; Ruan et al., 2021) and OFUL (Abbasi-Yadkori et al., 2011), two important principles in (single-player, centralized) linear bandits, to the federated bandits setting with a carefully designed layered successive screening. We instantiate Fed Sup Lin UCB with both asynchronous and synchronous client activities. For the former setting, we propose Async-Fed Sup Lin UCB where communication is triggered only when the cumulative local information impacts the exploration uncertainty to a certain extent. We prove that Async-Fed Sup Lin UCB achieves O( d T) regret with O(d M 2 log d log T) communication cost, which not only reduces the regret by d compared with previous results on asynchronous federated linear bandits with infinite arms, but also matches the minimax lower bound up to polylog terms, indicating that Async-Fed Sup Lin UCB achieves order-optimal regret. For synchronous communications, we propose Sync-Fed Sup Lin UCB, which has a refined communication design where only certain layers are communicated, as opposed to the complete information. Sync-Fed Sup Lin UCB achieves order-optimal regret of O( d T) with horizon-independent communication cost O( d3M 3 log d). Compared with the best previous result (Huang et al., 2021) which achieves the same order-optimal regret but only for fixed actions, we show that it is the finite actions that fundamentally determines the regret behavior in the federated linear bandits setting. We further develop two extensions of Fed Sup Lin UCB: (1) Variance-adaptive Fed Sup Lin UCB, for which a total regret of O( q d PT t=1 σ2 t ) is achieved, where σ2 t is the noise variance at round t. (2) Adversarial corruption Fed Sup Lin UCB, for which a total regret of O( d T + d Cp) is achieved, where Cp is the total corruption budget. 2 Related Works The linear bandit model, as a generalization of finite armed bandits with linear contextual information, has been extensively studied. The setting of infinite arm sets solved by Lin UCB was analyzed in (Dani et al., 2008; Abbasi-Yadkori et al., 2011), which achieves regret O(d T) with appropriate confidence width (Abbasi-Yadkori et al., 2011) and matches the lower bound (Dani et al., 2008) up to logarithmic factors. In contrast, algorithms like Sup Lin Rel (Auer, 2002) and Sup Lin UCB (Chu et al., 2011) achieve O( d T) in the setting of finite time-varying adversarial arm sets under K 2d, with a lower bound Ω( d T) (Chu et al., 2011). The Sup Lin UCB algorithm was later optimized and matches the lower bound up to iterated logarithmic factors in Li et al. (2019). As a special case of the finite arm setting, if the arm set is time-invariant, an elimination-based algorithm (Lattimore and Szepesvári, 2020) via G-optimal design can be applied to achieve similar optimal performance. The federated linear bandits problems were studied under the settings of infinite arm set (Dubey and Pentland, 2020; Li et al., 2020; Li and Wang, 2022a) and time-invariant finite arm set (Huang et al., 2021), while the time-varying finite arm set setting has not been well explored. A finite time-varying arm set has many meaningful practical applications such as recommendation system (Li et al., 2010; Chu et al., 2011), and the distributed (federated) nature of the applications naturally falls in the federated linear bandits problem with finite time-varying arms. The paper fills this gap by generalizing the Sup Lin UCB algorithm to the federated setting. We study both the asynchronous setting (Li and Wang, 2022a) (He et al., 2022a), where clients are active on their own and full participation is not required, and the synchronous setting (Shi et al., 2021; Dubey and Pentland, 2020), where all the clients make decisions at each round and the communication round requires all the clients to upload new information to the server and download the updated information. We design algorithms so as to reduce the communication cost while maintaining optimal regret. Technically, the communication cost is associated with the algorithmic adaptivity, since less adaptivity requires fewer updates and thus fewer communication rounds. The algorithmic adaptivity of linear bandits algorithms was studied in the single-player setting (Han et al., 2020) (Ruan et al., 2021). It was also considered in the federated setting (Wang et al., 2019; Huang et al., 2021; Salgia and Zhao, 2023). 3 System Model and Preliminaries 3.1 Problem Formulation We consider a federated linear contextual bandits model with K finite but possibly time-varying arms. The model consists of M clients and one server in a star-shaped communication network. Clients jointly solve a linear bandit problem by collecting local information and communicating with the central server through the star-shaped network in a federated manner, with no direct communications among clients. The only function of the server is to aggregate received client information and to send back updated information to clients. It cannot directly play the bandits game. Specifically, some clients It [M] are active at round t. Client i It receives K arms (actions to take) associated with contexts {xi t,a}a [K] Rd with xi t,a 2 1. Here we adopt the oblivious adversarial setting, where all contexts are chosen beforehand, and not dependent on previous game observation. Client i then pulls an arm ai t [K] based on the information collected locally as well as previously communicated from the server. A reward ri t,ai t = θ xi t,ai t + ϵt is revealed privately to client i, where θ Rd is an unknown weight vector with θ 2 1 and ϵt is an independent 1-sub-Gaussian noise. At the end of round t, depending on the communication protocol, client i may exchange the collected local information with the server so that it can update the global information. We aim to design algorithms to guide the clients decision-making and overall communication behaviors. We analyze two patterns of client activity. 1) Synchronous: all M clients are active at each round. 2) Asynchronous: one client is active at each round. For the latter case, we further assume that client activity is independent of data and history. Denote by Ti the number of times client i is active. In the former case, Ti = Tj, i, j [M], while in the latter case, Ti may be different among clients. We define T = PM i=1 Ti as the total number of arm pulls from all clients. The performance is measured under two metrics total regret and communication cost, which concern the decision-making effectiveness and the communication efficiency respectively. Denote by P i T = {t [T] | i It} the set of time indices at which client i is active, with |P i T | = Ti. The total regret is defined as t P i T ri t,ai, t ri t,ai t where ai, t = arg maxa [K] θ xi t,a. We define the communication cost as the total number of communication rounds between clients and the server. 3.2 Preliminaries Information encoding. In the linear bandits setting (federated or not), the information a client acquires is usually encoded by the gram matrix and the action-reward vector. Specifically, when the client has observed n action-reward pairs {(xt, rt)}n t=1, the information is encoded by matrix An = Pn t=1 xtx t and vector bn = Pn t=1 rtxt. Denote by Encoder( ) this encoding function, i.e., An, bn Encoder({xt, rt}n t=1). Communication criterion. Communication in our proposed framework is data-dependent, in the same spirit as the doubling trick introduced in Abbasi-Yadkori et al. (2011) to reduce the computation complexity in single-player linear bandits. The key idea is that communication is triggered only when the cumulative local information, represented by the determinant of the gram matrix An, affects the exploration uncertainty to a great extent and hence the client needs to communicate with the server. Detailed communication protocols will be presented in each algorithm design. Synchronization procedure. Denote by Sync() a routine that n clients (client 1, . . ., client n) first communicate their local gram matrices and action-reward vectors to the server, and the server then aggregates the matrices (and vectors) into one gram matrix (and action-reward vector) and transmits them back to the n clients. Specifically, each client i holds newly observed local information ( Ai, bi), which is the difference between the client s current information (Ai, bi) and the information after the last synchronization. In other words, ( Ai, bi) is the information that has not been communicated to the server. The server, after receiving the local information {( Ai, bi)}n i=1, updates the server-side information (Aser, bser) by Aser Aser + Pn i=1 Ai, bser bser + Pn i=1 bi and sends them back to each of the n clients. Each client i will then update the local information by Ai Aser, bi bser. The procedure is formally presented in Algorithm 1. Algorithm 1 Sync(s, server, client 1, . . . client n) 1: for i = 1, 2, . . . , n do Client-side local information upload 2: Client i sends the local new layer s information ( Ai s, bi s) to the server 3: end for 4: Update server s layer s information: Server-side information aggregation and distribution Aser s Aser s + Xn i=1 Ai s, bser s bser s + Xn 5: Send server information Aser s , bser s back to all clients 6: for i = 1, 2, . . . , n do 7: Ai s Aser s , bi s bser s , Ai s 0, bi s 0 Client i updates the local information 8: end for 4 The Fed Sup Lin UCB Framework In this section, we present a general framework of federated bandits for linear bandits with finite oblivious adversarial actions. Two instances (asynchronous and synchronous) of this general framework will be discussed in subsequent sections. Building block: Sup Lin UCB. As the name suggests, the proposed Fed Sup Lin UCB framework is built upon the principle of Sup Lin UCB (Chu et al., 2011; Ruan et al., 2021). The information (A, b) is useful in the sense that the reward corresponding to an action x can be estimated within confidence interval x ˆθ α x A 1, where ˆθ = A 1b. It is shown in Abbasi-Yadkori et al. (2011) that in linear bandits (even with an infinite number of actions) with α = O( d), the true reward is within the confidence interval with high probability. Moreover, if the rewards in the action-reward vector b are mutually independent, α can be reduced to O(1). The former choice of α naturally guarantees O(d T) regret. However, to achieve regret O( d T), it is critical to keep α = O(1). This is fulfilled by the Sup Lin UCB algorithm (Chu et al., 2011) and then recently improved by Ruan et al. (2021). The key intuition is to successively refine an action set that contains the optimal action, where the estimation precision of sets is geometrically strengthened. Specifically, the algorithm maintains (S + 1) layers of information pairs {(As, bs)}S s=0, and the rewards in the action-reward vectors are mutually independent, except for layer 0. The confidence radius for each layer s is ws = 2 sd1.5/ Algorithm 2 S-LUCB 1: Initialization: S = log d , w0 = d1.5/ T, ws 2 sw0, s [1 : S]. 2: α0 = 1 + p d ln(2M 2T/δ), αs 1 + p 2 ln(2KMT ln d/δ), s [1 : S] 3: Input: Client i (with local information Ai, bi, Ai, bi), contexts set {xi t,1, . . . , xi t,K} 4: Ai t,s Ai s + Ai s, bi t,s bi s + bi s or Ai t,s Ai s, bi t,s bi s for lazy update 5: ˆθs (Ai t,s) 1bi t,s, ˆri t,s,a = ˆθ s xi t,a, wi t,s,a αs xi t,a (Ai t,s) 1, s [0 : S], a [K] 6: s 0; A0 {a [K] | ˆri t,0,a + wi t,0,a maxa [K](ˆri t,0,a wi t,0,a)} Initial screening 7: repeat Layered successive screening 8: if s = S then 9: Choose action ai t arbitrarily from AS 10: else if wi t,s,a ws for all a As then 11: As+1 {a As | ˆri t,s,a maxa As(ˆri t,s,a ) 2ws}; s s + 1 12: else 13: ai t arg max{a As,wi t,s,a>ws} wi t,s,a 14: end if 15: until action ai t is found 16: Take action ai t and and receive reward ri t,ai t 17: Ai s Ai s + xi t,ai txi t,ai t, bi s bi s + ri t,ai txi t,ai t Update local information 18: Return layer index s Fed Sup Lin UCB. S-LUCB, presented in Algorithm 2, combines the principles of Sup Lin UCB and OFUL (Abbasi-Yadkori et al., 2011) and is the core subroutine for Fed Sup Lin UCB. We maintain S = log d information layers, and the estimation accuracy starts from d1.5/ T of layer 0 and halves as the layer index increases. Finally, it takes Θ(log d) layers to reach the sufficient accuracy of p d/T and achieves the minimax-optimal regret. When client i is active, the input parameters (Ai, bi) contain information received from the server at the last communication round, and ( Ai, bi) is the new local information collected between two consecutive communication rounds. {xi t,1, . . . , xi t,K} is the set of contexts observed in this round. Client i can estimate the unknown parameter θ either with all available information or just making a lazy update. This choice depends on the communication protocol and will be elaborated later. During the decision-making process, client i first makes arm elimination at layer 0 to help bootstrap the accuracy parameters. Then, it goes into the layered successive screening in the same manner as the Sup Lin UCB algorithm, where we sequentially eliminate suboptimal arms depending on their empirical means and confidence widths. After taking action ai t and receiving the corresponding reward ri t,ai t, client i updates its local information set ( Ai s, bi s) by aggregating the context into layer s in which we take the action, before returning layer s. 5 Asynchronous Fed Sup Lin UCB In the asynchronous setting, only one client is active in each round. Note that global synchronization and coordination are not required, and all inactive clients are idle. 5.1 Algorithm We first initialize the information for all clients and the server (gram matrix and action-reward vector) in each layer s [0 : S]. We assume only one client it is active at round t. It is without loss of generality since if multiple clients are active, we can queue them up and activate them in turn. More discussion of this equivalence can be found in He et al. (2022a); Li and Wang (2022a). The active client chooses the action, receives a reward, updates local information matrices of layer s with a lazy update according to S-LUCB, and decides whether communication with the server is needed by the criterion in Line 7 of Algorithm 3. If communication is triggered, we synchronize client it with the server by Algorithm 1. Algorithm 3 Async-Fed Sup Lin UCB 1: Initialization: T, C, S = log d 2: {Aser s Id, bser s 0 | s [0 : S]} Server initialization 3: {Ai s Id, Ai s, bi s, bi s 0 | s [0 : S], i [M]} Clients initialization 4: for t = 1, 2, , T do 5: Client it = i is active, and observes K contexts {xi t,1, xi t,2, , xi t,K} 6: s S-LUCB client i, {xi t,1, xi t,2, , xi t,K} with lazy update 7: if det(Ai s+ Ai s) det(Ais) > (1 + C) then 8: Sync(s, server, clients i) for each s [0 : S] 9: end if 10: end for 5.2 Performance Analysis Theorem 5.1. For any 0 < δ < 1, if we run Algorithm 3 with C = 1/M 2, then with probability at least 1 δ, the regret of the algorithm is bounded as RT O q d PM i=1 Ti Moreover, the corresponding communication cost is bounded by O(d M 2 log d log T). Remark 1. The minimax lower bound of the expected regret for linear contextual bandits with K adversarial actions is Ω( d T), given in Chu et al. (2011). Theorem 5.1 indicates that Async Fed Sup Lin UCB achieves order-optimal regret (up to polylog term) with O(d M 2 log d log T) communication cost. To the best of our knowledge, this is the first algorithm that achieves the (near) optimal regret in federated linear bandits with finite adversarial actions. Remark 2. Without any communication, each client would execute Sup Lin UCB (Chu et al., 2011) for Ti rounds locally, and each client can achieve regret of order O( d Ti). Therefore, the total regret of M clients is upper bound by RT PM i=1 d Ti polylog(T) q d M PM i=1 Ti polylog(T), where the last inequality becomes equality when Ti = Tj, i, j [M]. Compared with conducting M independent Sup Lin UCB algorithms locally, Async-Fed Sup Lin UCB yields an average per-client gain of 1/ M, demonstrating that communications in the federated system can speed up local linear bandits decision-making at clients. Remark 3. Most previous federated linear bandits consider the infinite action setting, based on the Lin UCB principle (Abbasi-Yadkori et al., 2011). Async-Fed Sup Lin UCB considers a finite adversarial action setting and has a d reduction on the regret bound. Fed-PE proposed in Huang et al. (2021) also considers the finite action setting. However, their action sets are fixed. We generalize their formulation and take into account a more challenging scenario, where the finite action set can be chosen adversarially. The regret order is the same as Fed-PE (ignoring the ploylog term), indicating that it is the finite actions as opposed to fixed actions that fundamentally leads to the d regret improvement in the federated linear bandits setting. Communication cost analysis of Fed Sup Lin UCB. We sketch the proof for the communication cost bound in Theorem 5.1 in the following, while deferring the detailed proofs for the regret and the communication cost to Appendix C. We first study the communication cost triggered by some layer s. Denote by Aser t,s the gram matrix in the server aggregated by the gram matrices uploaded by all clients up to round t. Define Tn,s = min{t [T]| det Aser t,s 2n}, for each n 0. We then divide rounds into epochs {Tn,s, Tn,s + 1, , min(Tn+1,s 1, T)} for each n 0. The number of communications triggered by layer s within any epoch can be upper bounded by 2(M + 1/C) (see Lemma C.1), and the number of non-empty epochs is at most d log(1 + T/d) by Lemma A.1. Since there are S = log d layers and synchronization among all layers is performed once communication is triggered by any layer (Line 8 in Algorithm 3), the total communication cost is thus upper-bounded by O(d(M + 1/C) log d log T). Plugging C = 1/M 2 proves the result. We note that although choosing a larger C would trigger fewer communications, the final choice of C = 1/M 2 takes into consideration both the regret and the communication cost, i.e., to achieve a small communication cost while maintaining an order-optimal regret. 6 Synchronous Fed Sup Lin UCB In the synchronous setting, all clients are active and make decisions at each round. Though it can be viewed as a special case of the asynchronous scenario (clients are active and pulling arms in a round-robin manner), the information update is broadcast to all clients. In other words, the key difference from the asynchronous scenario besides that all clients are active at each round is that when a client meets the communication criterion, all clients will upload local information to the server and download the updated matrices. This leads to a higher communication cost per communication round, but in this synchronous scenario, knowing all clients are participating allows the communicated information to be well utilized by other clients. This is in sharp contrast to the asynchronous setting, where if many other clients are active in the current round, uploading local information to the clients seems unworthy. To mitigate the total communication cost, we use a more refined communication criterion to enable time-independent communication cost. 6.1 The Algorithm The Sync-Fed Sup Lin UCB algorithm allows each client to make decisions by the S-LUCB subroutine. Note that the decision-making is based on all available local information instead of the lazy update in the Async-Fed Sup Lin UCB algorithm. The communication criterion involves the count of rounds since the last communication, which forces the communication to prevent the local data from being obsolete. Some layers may trigger the communication criterion either because the local client has gathered enough new data or due to having no communication with the server for too long. We categorize these layers in the Comm Layers and synchronize all the clients with the server. 6.2 Performance Analysis Theorem 6.1. For any 0 < δ < 1, if we run Algorithm 4 with D = Tc log Tc d2M , with probability at least 1 δ, the regret of the algorithm is bounded as RT O( d MTc) where Tc is the total per-client arm pulls. Moreover, the corresponding communication cost is bounded by O( d3M 3 log d). Remark 4. Theorem 6.1 demonstrates Sync-Fed Sup Lin UCB also achieves the minimax regret lower bound while the communication cost is independent of Tc. It is particularly beneficial for large Tc. Especially, the number of total rounds in the synchronous scenario is T = MTc, while in the asynchronous setting, we have T = PM i=1 Ti rounds. Communication cost analysis of Sync-Fed Sup Lin UCB. We sketch the proof for the communication cost bound in Theorem 6.1 below, while deferring the detailed proofs for the regret and the communication cost to Appendix D. Algorithm 4 Sync-Fed Sup Lin UCB 1: Initialization: Tc, D, S = log d , ts last 0, s [0 : S], Comm Layers . 2: {Aser s Id, bser s 0 | s [0 : S]} Server initialization 3: {Ai s Id, Ai s, bi s, bi s 0 | s [0 : S], i [M]} Clients initialization 4: for t = 1, 2, , Tc do 5: for i = 1, 2, , M do 6: Client it = i is active, and observes K contexts {xi t,1, xi t,2, , xi t,K} 7: s S-LUCB client i, {xi t,1, xi t,2, , xi t,K} 8: if (t ts last) log det(Ai s+ Ai s) det(Ais) > D then 9: Add s to Comm Layers 10: end if 11: end for 12: end for 13: for s Comm Layers do 14: Sync(s, server, clients [M]); ts last t, Comm Layers 15: end for We call the chunk of consecutive rounds without communicating information in layer s (except the last round) an epoch. Information in layer s is collected locally by each client and synchronized at the end of the epoch, following which the next epoch starts. Denoted by Aall p,s the synchronized gram matrix at the end of the p-th epoch. For any value β > 0, there are at most Tc β epochs that contain more than β rounds by pigeonhole principle. If the p-th epoch contains less than β rounds, then log det(Aall p,s) det(Aall p 1,s) β based on the communication criterion and the fact that PP p=1 log det(Aall p,s) det(Aall p 1,s) Rs = O(d log(Tc)) (see Equation (6)). The number of epochs containing rounds fewer than β is at most O( Rs D/β ). Noting that D = Tc log(Tc) d2M , the total number of epochs for layer s is at most Tc d3M) by taking β = q Rs . The total communication cost is thus upper bounded by O(SM d3M) = O(log(d) 7 Extensions of Fed Sup Lin UCB In this section, we extend the Fed Sup Lin UCB algorithm to address two distinct settings in federated systems: scenarios characterized by heterogeneous variances, and those affected by adversarial corruptions. 7.1 Federated Heteroscedastic Linear Bandits We have so far focused on the federated linear bandits with 1-sub-Gaussian reward noises. In this section, we adapt Async-Fed Sup Lin UCB to the case where the reward noises have heterogeneous variances, which extends the heteroscedastic linear bandits as studied in Zhou et al. (2021); Zhou and Gu (2022) to the asynchronous federated setting, where one client is active at a time. Specifically, the reward noises {ϵt}t [T ] are independent with |ϵt| R, E[ϵt] = 0 and E[ϵ2 t] σ2 t , where σt is known to the active client. We propose a variance-adaptive Asyc-Fed Sup Lin UCB and analyze its regret and the communication cost in the theorem below, with the algorithm and the proof details in Appendix E due to space constraint. The regret is significantly less than that of the Async-Fed Sup Lin UCB when the variances {σ2 t } are small. Theorem 7.1. For any 0 < δ < 1, if we run the variance-adaptive Async-Fed Sup Lin UCB algorithm in Appendix E with C = 1/M 2, with probability at least 1 δ, the regret is bounded as RT O( q d PT t=1 σ2 t ), and the communication cost is bounded by O(d M 2 log2 T). 7.2 Federated Linear Bandits with Corruption We further explore asynchronous federated linear bandits with adversarial corruptions, where an adversary inserts a corruption ct to the reward rt of the active client at round t. The total corruption is bounded by PT t=1 |ct| Cp. We incorporate the idea of linear bandits with adversarial corruption studied in He et al. (2022b) to the proposed Fed Sup Lin UCB framework and propose the Robust Async-Fed Sup Lin UCB algorithm, with details in Appendix F. Robust Async-Fed Sup Lin UCB can achieve the optimal minimax regret (matching the lower bound in He et al. (2022b)) while incurring a low communication cost. Theorem 7.2. For any 0 < δ < 1, if we run the Robust Async-Fed Sup Lin UCB algorithm in Appendix F with C = 1/M 2, with probability at least 1 δ, the regret is bounded as RT O( d T + d Cp), and the communication cost is bounded by O(d M 2 log d log T). 8 Experiments We have experimentally evaluated Fed Sup Lin UCB in the asynchronous and synchronous settings on both synthetic and real-world datasets. We report the results in this section. 8.1 Experiment Results Using Synthetic Dataset We simulate the federated linear bandits environment specified in Section 3. With T = 40000, M = 20, d = 25, A = 20, contexts are uniformly randomly sampled from an l2 unit sphere, and reward rt,a = θ xt,a + ϵt, where ϵt is Gaussian distributed with zero mean and variance σ = 0.01. It should be noted that while M clients participate in each round in the synchronous scenario, only one client is active in the asynchronous case. In the plots, the x-axis coordinate denotes the number of arm pulls, which flattens the actions in the synchronous setting. (a) Regret: arrival patterns. (b) Communication: arrival patterns. (c) Regret: client numbers. (d) Regret vs communications. Figure 1: Experimental results with the synthetic dataset. Arrival pattern. We first investigate the impact of different arrival patterns (the sequence of activating clients): (1) Random, which randomly allocates T/M arm pulls in [T] for each client. (2) Round-robin, i.e. [1, 2, 3, , M, 1, 2, 3, M, ]. (3) Click-leave, i.e. [1, 1, , 2, 2, , , M, M, ]. The regret and the communication cost of these three arrival patterns in the synthetic experiment are reported in Figure 1(a) and Figure 1(b), respectively. We note that although the upper bound analysis in our proof is for the worst-case instance, the numerical results suggest that different arrival patterns result in diverse regret performances. Round-robin and random patterns are more challenging since both local bandit learning and each client s policy updates happen relatively slowly. The click-leave pattern, which is the closest to the centralized setting, achieves the best regret. In addition, compared with Async-Fed Sup Lin UCB , Sync-Fed Sup Lin UCB achieves better cumulative regrets with a higher communication cost. Amount of clients. The per-client cumulative regret as a function of Tc = T/M with different amounts of clients is plotted in Figure 1(c). In comparison to the baseline Sup Lin UCB, Fed Sup Lin UCB algorithms achieve better regret via communication between clients and the server. We can see from the experiment that Fed Sup Lin UCB significantly reduces the per-client regret compared with Sup Lin UCB, and achieves a better regret as M increases in both asynchronous and synchronous settings. Trade-off between regrets and communications. We evaluate the tradeoff between communication and regret by running Fed Sup Lin UCB with different communication threshold values C and D in asynchronous and synchronous settings respectively. The results are reported in Figure 1(d), where each scattered dot represents the communication cost and the cumulative regret that Fed Sup Lin UCB has achieved with a given threshold value at round T = 40000. We see a clear tradeoff between the regret and the communication. More importantly, Sync-Fed Sup Lin UCB achieves a better tradeoff than Async-Fed Sup Lin UCB. 8.2 Experiment Results Using Real-world Dataset We further investigate how efficiently the federated linear bandits algorithm performs in a more realistic and difficult environment. We have carried out experiments utilizing the real-world recommendation dataset Movie Lens 20M (Harper and Konstan, 2015). Following the steps in Li and Wang (2022b), we first filter the data by maintaining users with above 2500 movie ratings and treating rating points greater than 3 as positive, ending up with N = 37 users and 121934 total movie rating interactions. Then, we follow the process described in Cesa-Bianchi et al. (2013) to generate the contexts set, using the TF-IDF feature d = 25 and the arm set K = 20. We plot the per-client normalized rewards of the Fed Sup Lin UCB algorithm with different client numbers M in synchronous and asynchronous cases respectively. Note that the per-client cumulative rewards here are normalized by a random strategy. From Figure 2(a) and Figure 2(b), we can see that in both synchronous and asynchronous experiments, Fed Sup Lin UCB has better rewards than Sup Lin UCB, and the advantage becomes more significant as the number of users increases. (a) Async-Fed Sup Lin UCB. (b) Sync-Fed Sup Lin UCB. Figure 2: Experimental results with the real-world Movie Lens-20M dataset. 9 Conclusion We studied federated linear bandits with finite adversarial actions, a model that has not been investigated before. We proposed Fed Sup Lin UCB that extends the Sup Lin UCB and OFUL principles to the federated setting in both asynchronous and synchronous scenarios, and analyzed their regret and communication cost, respectively. The theoretical results proved that Fed Sup Lin UCB is capable of approaching the minimal regret lower bound (up to polylog terms) while only incurring sublinear communication costs, suggesting that it is the finite actions that fundamentally determines the regret behavior in the federated linear bandits setting. Furthermore, we examined the extensions of the algorithm design to the variance-adaptive and adversarial corruption scenarios. Acknowledgments and Disclosure of Funding The work of LF and CS was supported in part by the U.S. National Science Foundation (NSF) under grants 2143559, 2029978, and 2132700. Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24. Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422. Cesa-Bianchi, N., Gentile, C., and Zappella, G. (2013). A gang of bandits. Advances in neural information processing systems, 26. Chu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208 214. JMLR Workshop and Conference Proceedings. Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. 21st Annual Conference on Learning Theory, pages 355 366. Dubey, A. and Pentland, A. (2020). Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33:6003 6014. Han, Y., Zhou, Z., Zhou, Z., Blanchet, J., Glynn, P. W., and Ye, Y. (2020). Sequential batch learning in finite-action linear contextual bandits. ar Xiv preprint ar Xiv:2004.06321. Harper, F. M. and Konstan, J. A. (2015). The Movie Lens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4):1 19. He, J., Wang, T., Min, Y., and Gu, Q. (2022a). A simple and provably efficient algorithm for asynchronous federated contextual linear bandits. ar Xiv preprint ar Xiv:2207.03106. He, J., Zhou, D., Zhang, T., and Gu, Q. (2022b). Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. Advances in neural information processing systems. Huang, R., Wu, W., Yang, J., and Shen, C. (2021). Federated linear contextual bandits. Advances in Neural Information Processing Systems, 34:27057 27068. Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press. Li, C. and Wang, H. (2022a). Asynchronous upper confidence bound algorithms for federated linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 6529 6553. PMLR. Li, C. and Wang, H. (2022b). Communication efficient federated learning for generalized linear bandits. ar Xiv preprint ar Xiv:2202.01087. Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670. Li, T., Song, L., and Fragouli, C. (2020). Federated recommendation system via differential privacy. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2592 2597. IEEE. Li, Y., Wang, Y., and Zhou, Y. (2019). Nearly minimax-optimal regret for linearly parameterized bandits. In Conference on Learning Theory, pages 2173 2174. PMLR. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. (2017). Communicationefficient learning of deep networks from decentralized data. In Proc. AISTATS, pages 1273 1282, Fort Lauderdale, FL, USA. Ruan, Y., Yang, J., and Zhou, Y. (2021). Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74 87. Salgia, S. and Zhao, Q. (2023). Distributed linear bandits under communication constraints. In International Conference on Machine Learning, pages 29845 29875. PMLR. Shi, C., Shen, C., and Yang, J. (2021). Federated multi-armed bandits with personalization. In International Conference on Artificial Intelligence and Statistics, pages 2917 2925. PMLR. Wang, Y., Hu, J., Chen, X., and Wang, L. (2019). Distributed bandit learning: Near-optimal regret with efficient communication. In International Conference on Learning Representations. Zhou, D. and Gu, Q. (2022). Computationally efficient horizon-free reinforcement learning for linear mixture mdps. ar Xiv preprint ar Xiv:2205.11507. Zhou, D., Gu, Q., and Szepesvari, C. (2021). Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532 4576. PMLR.