# federated_neural_bandits__b7637ab2.pdf

Published as a conference paper at ICLR 2023

FEDERATED NEURAL BANDITS

Zhongxiang Dai, Yao Shu , Arun Verma, Flint Xiaofeng Fan & Bryan Kian Hsiang Low Department of Computer Science, National University of Singapore {daizhongxiang,shuyao,arun,xiaofeng,lowkh}@comp.nus.edu.sg

Patrick Jaillet Department of Electrical Engineering and Computer Science, MIT jaillet@mit.edu

Recent works on neural contextual bandits have achieved compelling performances due to their ability to leverage the strong representation power of neural networks (NNs) for reward prediction. Many applications of contextual bandits involve multiple agents who collaborate without sharing raw observations, thus giving rise to the setting of federated contextual bandits. Existing works on federated contextual bandits rely on linear or kernelized bandits, which may fall short when modeling complex real-world reward functions. So, this paper introduces the federated neural-upper confidence bound (FN-UCB) algorithm. To better exploit the federated setting, FN-UCB adopts a weighted combination of two UCBs: UCBa

allows every agent to additionally use the observations from the other agents to accelerate exploration (without sharing raw observations), while UCBb uses an NN with aggregated parameters for reward prediction in a similar way to federated averaging for supervised learning. Notably, the weight between the two UCBs required by our theoretical analysis is amenable to an interesting interpretation, which emphasizes UCBa initially for accelerated exploration and relies more on UCBb later after enough observations have been collected to train the NNs for accurate reward prediction (i.e., reliable exploitation). We prove sub-linear upper bounds on both the cumulative regret and the number of communication rounds of FN-UCB, and empirically demonstrate its competitive performance.

1 INTRODUCTION

The stochastic multi-armed bandit is a prominent method for sequential decision-making problems due to its principled ability to handle the exploration-exploitation trade-off (Auer, 2002; Bubeck & Cesa-Bianchi, 2012; Lattimore & Szepesvári, 2020). In particular, the stochastic contextual bandit problem has received enormous attention due to its widespread real-world applications such as recommender systems (Li et al., 2010a), advertising (Li et al., 2010b), and healthcare (Greenewald et al., 2017). In each iteration of a stochastic contextual bandit problem, an agent receives a context (i.e., a d-dimensional feature vector) for each of the K arms, selects one of the K contexts/arms, and observes the corresponding reward. The goal of the agent is to sequentially pull the arms in order to maximize the cumulative reward (or equivalently, minimize the cumulative regret) in T iterations.

To minimize the cumulative regret, linear contextual bandit algorithms assume that the rewards can be modeled as a linear function of the input contexts (Dani et al., 2008) and select the arms via classic methods such as upper confidence bound (UCB) (Auer, 2002) or Thompson sampling (TS) (Thompson, 1933), consequently yielding the Linear UCB (Abbasi-Yadkori et al., 2011) and Linear TS (Agrawal & Goyal, 2013) algorithms. The potentially restrictive assumption of a linear model was later relaxed by kernelized contextual bandit algorithms (Chowdhury & Gopalan, 2017; Valko et al., 2013), which assume that the reward function belongs to a reproducing kernel Hilbert space (RKHS) and hence model the reward function using kernel ridge regression or Gaussian process (GP) regression. However, this assumption may still be restrictive (Zhou et al., 2020) and the kernelized

Corresponding author.

Published as a conference paper at ICLR 2023

model may fall short when the reward function is very complex and difficult to model. To this end, neural networks (NNs), which excel at modeling complex real-world functions, have been adopted to model the reward function in contextual bandits, thereby leading to neural contextual bandit algorithms such as Neural UCB (Zhou et al., 2020) and Neural TS (Zhang et al., 2021). Due to their ability to use the highly expressive NNs for better reward prediction (i.e., exploitation), Neural UCB and Neural TS have been shown to outperform both linear and kernelized contextual bandit algorithms in practice. Moreover, the cumulative regrets of Neural UCB and Neural TS have been analyzed by leveraging the theory of the neural tangent kernel (NTK) (Jacot et al., 2018), hence making these algorithms both provably efficient and practically effective. We give a comprehensive review of the related works on neural bandits in App. A.

The contextual bandit algorithms discussed above are only applicable to problems with a single agent. However, many modern applications of contextual bandits involve multiple agents who (a) collaborate with each other for better performances and yet (b) are unwilling to share their raw observations (i.e., the contexts and rewards). For example, companies may collaborate to improve their contextual bandits-based recommendation algorithms without sharing their sensitive user data (Huang et al., 2021b), while hospitals deploying contextual bandits for personalized treatment may collaborate to improve their treatment strategies without sharing their sensitive patient information (Dai et al., 2020). These applications naturally fall under the setting of federated learning (FL) (Kairouz et al., 2019; Li et al., 2021) which facilitates collaborative learning of supervised learning models (e.g., NNs) without sharing the raw data. In this regard, a number of federated contextual bandit algorithms have been developed to allow bandit agents to collaborate in the federated setting (Shi & Shen, 2021). We present a thorough discussion of the related works on federated contextual bandits in App. A. Notably, Wang et al. (2020) have adopted the Linear UCB policy and developed a mechanism to allow every agent to additionally use the observations from the other agents to accelerate exploration, while only requiring the agents to exchange some sufficient statistics instead of their raw observations. However, these previous works have only relied on either linear (Dubey & Pentland, 2020; Huang et al., 2021b) or kernelized (Dai et al., 2020; 2021) methods which, as discussed above, may lack the expressive power to model complex real-world reward functions (Zhou et al., 2020). Therefore, this naturally brings up the need to use NNs for better exploitation (i.e., reward prediction) in federated contextual bandits, thereby motivating the need for a federated neural contextual bandit algorithm.

To develop a federated neural contextual bandit algorithm, an important technical challenge is how to leverage the federated setting to simultaneously (a) accelerate exploration by allowing every agent to additionally use the observations from the other agents without requiring the exchange of raw observations (in a similar way to that of Wang et al. (2020)), and (b) improve exploitation by further enhancing the quality of the NN for reward prediction through the federated setting (i.e., without requiring centralized training using the observations from all agents). In this work, we provide a theoretically grounded solution to tackle this challenge by deploying a weighted combination of two upper confidence bounds (UCBs). The first UCB, denoted by UCBa, incorporates the neural tangent features (i.e., the random features embedding of NTK) into the Linear UCB-based mechanism adopted by Wang et al. (2020), which achieves the first goal of accelerating exploration. The second UCB, denoted by UCBb, adopts an aggregated NN whose parameters are the average of the parameters of the NNs trained by all agents using their local observations for better reward prediction (i.e., better exploitation in the second goal). Hence, UCBb improves the quality of the NN for reward prediction in a similar way to the most classic FL method of federated averaging (Fed Avg) for supervised learning (Mc Mahan et al., 2017). Notably, our choice of the weight between the two UCBs, which naturally arises during our theoretical analysis, has an interesting practical interpretation (Sec. 3.3): More weight is given to UCBa in earlier iterations, which allows us to use the observations from the other agents to accelerate the exploration in the early stage; more weight is assigned to UCBb only in later iterations after every agent has collected enough local observations to train its NN for accurate reward prediction (i.e., reliable exploitation). Of note, our novel design of the weight (Sec. 3.3) is crucial for our theoretical analysis and may be of broader interest for future works on related topics.

This paper introduces the first federated neural contextual bandit algorithm which we call federated neural-UCB (FN-UCB) (Sec. 3). We derive an upper bound on its total cumulative regret from all N agents: RT = e O(ed

TN + edmax N

T)1 where ed is the effective dimension of the contexts from all N agents and edmax represents the maximum among the N individual effective dimensions

1The e O ignores all logarithmic factors.

Published as a conference paper at ICLR 2023

of the contexts from the N agents (Sec. 2). The communication complexity (i.e., total number of communication rounds in T iterations) of FN-UCB can be upper-bounded by CT = e O(ed

N). Finally, we use both synthetic and real-world contextual bandit experiments to explore the interesting insights about our FN-UCB and demonstrate its effective practical performance (Sec. 5).

2 BACKGROUND AND PROBLEM SETTING

Let [k] denote the set {1, 2, . . . , k} for a positive integer k, 0k represent a k-dimensional vector of 0 s, and 0k k denote an all-zero matrix with dimension k k. Our setting involves N agents with the same reward function h defined on a domain X Rd. We consider centralized and synchronous communication: The communication is coordinated by a central server and every agent exchanges information with the central server during a communication round. In each iteration t [T], every agent i [N] receives a set Xt,i {xk t,i}k [K] of K context vectors and selects one of them xt,i Xt,i to be queried to observe a noisy output yt,i h(xt,i) + ϵ where ϵ is an R-sub-Gaussian noise. We will analyze the total cumulative regret from all N agents in T iterations: RT PN i=1 PT t=1 rt,i where rt,i h(x t,i) h(xt,i) and x t,i arg maxx Xt,ih(x).

Let f(x; θ) denote the output of a fully connected NN for input x with parameters θ (of dimension p0) and g(x; θ) denote the corresponding (column) gradient vector. We focus on NNs with Re LU activations, and use L 2 and m to denote its depth and width (of every layer), respectively. We follow the initialization technique from Zhang et al. (2021); Zhou et al. (2020) to initialize the NN parameters θ0 init( ). Of note, as a common ground for collaboration, we let all N agents share the same initial parameters θ0 when training their NNs and computing their neural tangent features: g(x; θ0)/ m (i.e., the random features embedding of NTK (Zhang et al., 2021)). Also, let H denote the (TKN) (TKN)-dimensional NTK matrix on the set of all received TKN contexts (Zhang et al., 2021; Zhou et al., 2020). Similarly, let Hi denote the (TK) (TK)-dimensional NTK matrix on the set of TK contexts received by agent i. We defer the details on the definitions of H and Hi s, the NN f(x; θ), and the initialization scheme θ0 init( ) to App. B due to limited space.

Next, let h [h(xk t,i)]t [T ],i [N],k [K] denote the (TKN)-dimensional column vector of reward function values at all received contexts and B be an absolute constant s.t.

2h H 1h B. This is related to the commonly adopted assumption in kernelized bandits that h lies in the RKHS H induced by the NTK (Chowdhury & Gopalan, 2017; Srinivas et al., 2010) (or, equivalently, that the RKHS norm h H of h is upper-bounded by a constant) because

h H 1h h H (Zhou et al., 2020). Following the works of Zhang et al. (2021); Zhou et al. (2020), we define the effective dimension of H as ed log det(I+H/λ)

log(1+T KN/λ) with regularization parameter λ > 0. Similarly, we define the

effective dimension for agent i as edi log det(I+Hi/λ)

log(1+T K/λ) and also define edmax maxi [N] edi. Note that the effective dimension is related to the maximum information gain γ which is a commonly adopted notion in kernelized bandits (Zhang et al., 2021): ed 2γT KN/ log(1 + TKN/λ) and edi 2γT K/ log(1 + TK/λ), i [N]. Consistent with the works on neural contextual bandits (Zhang et al., 2021; Zhou et al., 2020), our only assumption on the reward function h is its boundedness: |h(x)| 1, x X. We also make the following assumptions for our theoretical analysis, all of which are mild and easily achievable, as discussed in (Zhang et al., 2021; Zhou et al., 2020): Assumption 1. There exists λ0 > 0 s.t. H λ0I and Hi λ0I, i [N]. Also, all contexts satisfy x 2 = 1 and [x]j = [x]j+d/2, x Xt,i, t [T], i [N].

3 FEDERATED NEURAL-UPPER CONFIDENCE BOUND (FN-UCB)

Our FN-UCB algorithm is described in Algo. 1 (agents part) and Algo. 2 (central server s part).

3.1 OVERVIEW OF FN-UCB ALGORITHM

Before the beginning of the algorithm, we sample the initial parameters θ0 and share it with all agents (Sec. 2). In each iteration t [T], every agent i [N] receives a set Xt,i = {xk t,i}k [K] of K contexts (line 3 of Algo. 1) and then uses a weighted combination of UCBa t,i and UCBb t,i to select a context xt,i Xt,i to be queried (lines 4-7 of Algo. 1). Next, each agent i observes a noisy output yt,i (line 8 of Algo. 1) and then updates its local information (lines 9-10 of Algo. 1). After that, every agent checks if it has collected enough information since the last communication round (i.e., checks

Published as a conference paper at ICLR 2023

Algorithm 1 FN-UCB (Agent i)

1: inputs: λ = 1+2/T, θ0 init( ), Wsync = 0p0 p0, Wnew,i = 0p0 p0, Bsync = 0, Bnew,i = 0p0, α = 0, V local t,i = λI, V 1 sync,NN = λ 1I, Vlast = λI, tlast = 0, θsync,NN = θ0. 2: for t = 1, 2, . . . , T do 3: Receive a set Xt,i = {xk t,i}k [K] of K contexts

4: Compute V t,i = λI + Wsync + Wnew,i , θt,i = V 1 t,i (Bsync + Bnew,i)

5: Compute UCBa t,i(x) g(x; θ0)/ m, θt,i + νT KN

λ g(x; θ0)/ m

V 1 t,i 6: If α = 0, compute UCBb t,i(x) f(x; θsync,NN)+νT K

λN 1 PN j=1 g(x; θ0)/ m (V local j ) 1 7: Select xt,i arg maxx Xt,i(1 α) UCBa t,i(x) + α UCBb t,i(x) 8: Query xt,i to observe yt,i 9: Update Wnew,i Wnew,i+g(xt,i; θ0) g(xt,i; θ0) /m, Bnew,i Bnew,i+yt,i g(xt,i; θ0)/ m 10: Update V local t,i = V local t 1,i + g(xt,i; θ0) g(xt,i; θ0) /m

11: if (t tlast) log det(λI + Wsync + Wnew,i)/det(Vlast) > D then 12: Send a synchronisation signal to the central server to start a communication round 13: if a communication round is started then 14: Train an NN with gradient descent using all agent i s local observations Dt,i = {(xτ,i, yτ,i)}τ [t] based on initial parameters θ0, learning rate η, number J of iterations, and Equation 1 as loss function to obtain θi t 15: Compute αt,i = eσlocal t,i,min/eσlocal t,i,max (Sec. 3.3) 16: send { Wnew,i, Bnew,i, θi t, αt,i, (V local t,i ) 1 } to the central server 17: receive { Wsync, Bsync, θsync,NN, α, {(V local i ) 1 }i [N] } from the central server 18: Set Vlast = Wsync + λI , tlast = t , Wnew,i = 0p0 p0 , Bnew,i = 0

Algorithm 2 Central Server

1: if a synchronization signal is received from any agent then 2: Send a signal to all agents to start a communication round 3: receive { Wnew,i, Bnew,i, θi t, αt,i, (V local t,i ) 1 }i [N] 4: Compute θsync,NN = N 1 PN i=1 θi t , α = mini [N] αt,i ; let (V local i ) 1 = (V local t,i ) 1, i [N]

5: Update Wsync Wsync + PN i=1 Wnew,i , Bsync Bsync + PN i=1 Bnew,i 6: Broadcast { Wsync, Bsync, θsync,NN, α, {(V local i ) 1}i [N] } to all agents

the criterion in line 11 of Algo. 1); if so, it sends a synchronization signal to the central server (line 12 of Algo. 1) who then tells all agents to start a communication round (line 2 of Algo. 2). During a communication round, every agent i uses its current history of local observations to train an NN (line 14 of Algo. 1) and sends its updated local information to the central server (line 16 of Algo. 1); the central server then aggregates these information from all agents (lines 4-5 of Algo. 2) and broadcasts the aggregated information back to all agents (line 6 of Algo. 2) to start the next iteration. We refer to those iterations between two communication rounds as an epoch.2 So, our FN-UCB algorithm consists of a number of epochs which are separated by communication rounds.

Note that every agent i only needs to train an NN in every communication round, i.e., only after the change in the log determinant of the covariance matrix of any agent exceeds a threshold D (line 11 of Algo. 1). This has the additional benefit of reducing the computational cost due to the training of NNs. Interestingly, this is in a similar spirit as the adaptive batch size scheme in Gu et al. (2021) which only retrains the NN in Neural UCB after the change in the determinant of the covariance matrix exceeds a threshold and is shown to only slightly degrade the performance of Neural UCB.

3.2 THE TWO UPPER CONFIDENCE BOUNDS (UCBS)

Firstly, UCBa t,i can be interpreted as the Linear UCB policy (Abbasi-Yadkori et al., 2011) using the neural tangent features g(x; θ0)/ m as the input features. In iteration t, let p denote the index of the

2The first (last) epoch is between a communication round and the beginning (end) of FN-UCB algorithm.

Published as a conference paper at ICLR 2023

current epoch. Then, computing UCBa t,i (line 5 of Algo. 1) makes use of two types of information. The first type of information, which uses the observations from all N agents before epoch p, is used for computing UCBa t,i via Wsync and Bsync (line 4 of Algo. 1). Specifically, as can be seen from line 5 of Algo. 2, Wsync and Bsync are computed by the central server by summing up the Wnew,i s and Bnew,i s from all agents (i.e., by aggregating the information from all agents) where Wnew,i and Bnew,i are computed using the local observations of agent i (line 9 of Algo. 1). The second type of information used by UCBa t,i (via Wnew,i and Bnew,i utilized in line 4 of Algo. 1) exploits the newly collected local observations of agent i in epoch p. As a result, UCBa t,i allows us to utilize the observations from all agents via the federated setting for accelerated exploration without requiring the agents to share their raw observations. UCBa t,i is computed with the defined parameter νT KN B + R[2(log(3/δ) + 1) + ed log(1 + TKN/λ)]1/2 where δ (0, 1).

Secondly, UCBb t,i leverages the federated setting to improve the quality of NN for reward prediction (to achieve better exploitation) in a similar way to Fed Avg, i.e., by averaging the parameters of the NNs trained by all agents using their local observations (Mc Mahan et al., 2017). Specifically, when a communication round is started, every agent i [N] uses its local observations Dt,i {(xτ,i, yτ,i)}τ [t] to train an NN (line 14 of Algo. 1). It uses initial parameters θ0 (i.e., shared among all agents (Sec. 2)) and trains the NN using gradient descent with learning rate η for J training iterations (see Theorem 1 for the choices of η and J) to minimize the following loss function:

Lt,i(θ) 0.5 Pt τ=1(f(xτ,i; θ) yτ,i)2 + 0.5mλ θ θ0 2 2 . (1)

The resulting NN parameters θi t s from all N agents are sent to the central server (line 16 of Algo. 1) who averages them (line 4 of Algo. 2) and broadcasts the aggregated θsync,NN N 1 PN i=1 θi t back to all agents to be used in the next epoch. In addition, to compute the second term of UCBb t,i, every agent needs to compute the matrix V local t,i using its local inputs (line 10 of Algo. 1) and send its inverse to the central server (line 16 of Algo. 1) during a communication round; after that, the central server broadcasts these matrices {(V local i ) 1}i [N] received from each agent back to all agents to be used in the second term of UCBb t,i (line 6 of Algo. 1). Refer to Sec. 4.2 for a detailed explanation on the validity of UCBb t,i as a high-probability upper bound on h (up to additive error terms). UCBb t,i is computed with the defined parameter νT K B + R[2(log(3N/δ) + 1) + edmax log(1 + TK/λ)]1/2.

3.3 WEIGHT BETWEEN THE TWO UCBS

Our choice of the weight α between the two UCBs, which naturally arises during our theoretical analysis (Sec. 4), has an interesting interpretation in terms of the relative strengths of the two UCBs and the exploration-exploitation trade-off. Specifically, eσlocal t,i (x)

λ g(x; θ0)/ m (V local t,i ) 1 intuitively represents our uncertainty about the reward at x after conditioning on the local observations of agent i up to iteration t (Kassraie & Krause, 2022).3 Next, eσlocal t,i,min minx X eσlocal t,i (x) and eσlocal t,i,max maxx X eσlocal t,i (x) represent our smallest and largest uncertainties across the entire domain, respectively. Then, we choose α mini [N] αt,i (line 4 of Algo. 2) where αt,i eσlocal t,i,min/eσlocal t,i,max (line 15 of Algo. 1). In other words, αt,i is the ratio between the smallest and largest uncertainty across the entire domain for agent i, and α is the smallest such ratio αt,i among all agents. Therefore, α is expected to be generally increasing with the number of iterations/epochs: eσlocal t,i,min is already small after the first few iterations since the uncertainty at the queried contexts is very small; on the other hand, eσlocal t,i,max is expected to be very large in early iterations and become smaller in later iterations only after a large number of contexts has been queried to sufficiently reduce the overall uncertainty in the entire domain. This implies that we give more weight to UCBa t,i in earlier iterations and assign more weight to UCBb t,i in later iterations. This, interestingly, turns out to have an intriguing practical interpretation: Relying more on UCBa t,i in earlier iterations is reasonable because UCBa t,i is able to utilize the observations from all agents to accelerate exploration in the early stage (Sec. 3.2); it is also sensible to give more emphasis to UCBb t,i only in later iterations because the NN trained by every agent is only able to accurately model the reward function (for reliable exploitation) after it has collected enough observations to train its NN. In our practical implementation (Sec. 5), we will use the analysis here as an inspiration to design an increasing sequence of α.

3Formally, eσlocal t,i (x) is the Gaussian process posterior standard deviation at x conditioned on the local observations of agent i till iteration t and computed using the kernel ek(x, x ) = g(x; θ0) g(x ; θ0)/m.

Published as a conference paper at ICLR 2023

3.4 COMMUNICATION COST

To achieve a better communication efficiency, we propose here a variant of our main FN-UCB algorithm called FN-UCB (Less Comm.) which differs from FN-UCB (Algos. 1 and 2) in two aspects. Firstly, the central server averages the matrices {(V local t,i ) 1}i [N] received from all agents to produce a single matrix V 1 sync,NN = N 1 PN i=1(V local t,i ) 1 and hence only broadcasts the single matrix V 1 sync,NN instead of all N received matrices {(V local t,i ) 1}i [N] to all agents (see line 6 of Algo. 2). Secondly, the UCBb t,i of every agent i (line 6 of Algo. 1) is modified to use the matrix V 1 sync,NN: UCBb t,i(x) f(x; θsync,NN) + νT K

λ g(x; θ0)/ m V 1 sync,NN. To further reduce the communication cost of both FN-UCB and FN-UCB (Less Comm.) especially when the NN is large (i.e., its total number p0 of parameters is large), we can follow the practice of previous works (Zhang et al., 2021; Zhou et al., 2020) to diagonalize the p0 p0 matrices, i.e., by only keeping the diagonal elements of the matrices. Specifically, we can diagonalize Wnew,i (line 9 of Algo. 1) and V local t,i (line 10 of Algo. 1), and let the central server aggregate only the diagonal elements of the corresponding matrices to obtain Wsync and V 1 sync,NN. This reduces both the communication and computational costs.

As a result, during a communication round, the parameters that an agent sends to the central server include {Wnew,i, Bnew,i, θi t, αt,i, (V local t,i ) 1} (line 16 of Algo. 1) which constitute p0+p0+p0+1+p0 = O(p0) parameters and are the same for FN-UCB and FN-UCB (Less Comm.). The parameters that the central server broadcasts to the agents include {Wsync, Bsync, θsync,NN, α, {(V local t,i ) 1}i [N]} for FN-UCB (line 6 of Algo. 2) which amount to p0 + p0 + p0 + 1 + Np0 = O(Np0) parameters. Meanwhile, FN-UCB (Less Comm.) only needs to broadcast O(p0) parameters because the N matrices {(V local t,i ) 1}i [N] are now replaced by a single matrix V 1 sync,NN. Therefore, the total number of exchanged parameters by FN-UCB (Less Comm.) is O(p0) which is comparable to the number of exchanged parameters in standard FL for supervised learning (e.g., Fed Avg) where the parameters (or gradients) of the NN are exchanged (Mc Mahan et al., 2017). We will also analyze the total number of required communication rounds by FN-UCB, as well as by FN-UCB (Less Comm.), in Sec. 4.1.

As we will discuss in Sec. 4.1, the variant FN-UCB (Less Comm.) has a looser regret upper bound than our main FN-UCB algorithm (Algos. 1 and 2). However, in practice, FN-UCB (Less Comm.) is recommended over FN-UCB because it achieves a very similar empirical performance as FN-UCB (which we have verified in Sec. 5.1) and yet incurs less communication cost.

4 THEORETICAL ANALYSIS

4.1 THEORETICAL RESULTS

Regret Upper Bound. For simplicity, we analyze the regret of a simpler version of our algorithm where we only choose the weight α using the method described in Sec. 3.3 in the first iteration after every communication round (i.e., first iteration of every epoch) and set α = 0 in all other iterations. Note that when communication occurs after each iteration (i.e., when D is sufficiently small), this version coincides with our original FN-UCB described in Algos. 1 and 2 (Sec. 3). The regret upper bound of FN-UCB is given by the following result (proof in Appendix C): Theorem 1. Let δ (0, 1), λ = 1 + 2/T, and D = O(T/(N ed)). Suppose that the NN width m grows polynomially: m poly(λ, T, K, N, L, log(1/δ), 1/λ0). For the gradient descent training (line 14 of Algo. 1), let η = C4(mλ + m TL) 1 for some constant C4 > 0 and J = e O TL/(λC4) . Then, with probability of at least 1 δ, RT = e O ed

TN + edmax N

Refer to Appendix C.1 for the detailed conditions on the NN width m as well as the learning rate η and number J of iterations for the gradient descent training (line 14 of Algo. 1). Intuitively, the effective dimension ed measures the actual underlying dimension of the set of all TKN contexts for all agents (Zhang et al., 2021), and edmax maxi [N] edi is the maximum among the underlying dimensions of the set of TK contexts for each of the N agents. Zhang et al. (2021) showed that if all contexts lie in a d -dimensional subspace of the RKHS induced by the NTK, then the effective dimension of these contexts can be upper-bounded by the constant d .

The first term ed

TN in the regret upper bound (Theorem 1) arises due to UCBa t,i and reflects the benefit of the federated setting. In particular, this term matches the regret upper bound of standard Neural UCB (Zhou et al., 2020) running for TN iterations and so, the average regret ed p

Published as a conference paper at ICLR 2023

across all agents decreases with a larger number N of agents. The second term edmax N

T results from UCBb t,i which involves two major components of our algorithm: the use of NNs for reward prediction and the aggregation of the NN parameters. Although not reflecting the benefit of a larger N in the regret bound, both components are important to our algorithm. Firstly, the use of NNs for reward prediction is a crucial component in neural contextual bandits in order to exploit the strong representation power of NNs. This is similar in spirit to the works on neural contextual bandits (Zhang et al., 2021; Zhou et al., 2020) in which the use of NNs for reward prediction does not improve the regret upper bound (compared with using the linear prediction given by the first term of UCBa t,i) and yet significantly improves the practical performance. Secondly, the aggregation of the NN parameters is also important for the performance of our FN-UCB since it allows us to exploit the federated setting in a similar way to FL for supervised learning which has been repeatedly shown to improve the performance (Kairouz et al., 2019). We have also empirically verified (Sec. 5.1) that both components (i.e., the use of NNs for reward prediction and the aggregation of NN parameters) are important to the practical performance of our algorithm. The work of Huang et al. (2021a) has leveraged the NTK to analyze the convergence of Fed Avg for supervised learning (Mc Mahan et al., 2017) which also averages the NN parameters in a similar way to our algorithm. Note that their convergence results also do not improve with a larger number N of agents but in fact become worse with a larger N.

Of note, in the single-agent setting where N = 1, we have that ed = edmax (Sec. 2). Therefore, our regret upper bound from Theorem 1 reduces to RT = e O(ed

T), which, interestingly, matches the regret upper bounds of standard neural bandit algorithms including Neural UCB (Zhou et al., 2020) and Neural TS (Zhang et al., 2021). We also prove (App. C.7) that FN-UCB (Less Comm.), which is a variant of our FN-UCB with a better communication efficiency (Sec. 3.4), enjoys a regret upper bound of RT = e O(ed

TN + edmax N

TN), whose second term is worse than that of FN-UCB (Theorem 1) by a factor of

N. In addition, we have also analyzed our general algorithm which does not set α = 0 in any iteration (results and analysis in Appendix F), which requires an additional assumption and only introduces an additional multiplicative constant to the regret bound.

Communication Complexity. The following result (proof in App. D) gives a theoretical guarantee on the communication complexity of FN-UCB, including its variant FN-UCB (Less Comm.):

Theorem 2. With the same parameters as Theorem 1, if the NN width m satisfies m poly(T, K, N, L, log(1/δ)), then with probability of at least 1 δ, the total number of communication rounds for FN-UCB satisfies CT = e O(ed

The specific condition on m required by Theorem 2 corresponds to condition 1 listed in App. C.1 (see App. D for details) which is a subset of the conditions required by Theorem 1. Following the same discussion on the effective dimension ed presented above, if all contexts lie in a d -dimensional subspace of the RKHS induced by the NTK, then ed can be upper-bounded by the constant d , consequently leading to a communication complexity of CT = e O(

4.2 PROOF SKETCH

We give a brief sketch of our regret analysis for Theorem 1 (detailed proof in Appendix C). To begin with, we need to prove that both UCBa t,i and UCBb t,i are valid high-probability upper bounds on the reward function h (App. C.3) given that the conditions on m, η, and J in App. C.1 are satisfied.

Since UCBa t,i can be viewed as Linear UCB using the neural tangent features g(x; θ0)/ m as the input features (Sec. 3), its validity as a high-probability upper bound on h can be proven following similar steps as that of standard linear and kernelized bandits (Chowdhury & Gopalan, 2017) (see Lemma 3 in App. C.3). Next, to prove that UCBb t,i is also a high-probability upper bound on h (up to additive error terms), let θlocal t,i (V local t,i ) 1(Pt τ=1 yτ,ig(xτ,i; θ0)/ m) which is defined in the same way as θt,i (line 4 of Algo. 1) except that θlocal t,i only uses the local observations of agent i. Firstly, we show that f(x; θsync,NN) (i.e., the NN prediction using the aggregated parameters) is close to N 1 PN i=1 g(x; θ0)/ m, θlocal t,i which is the linear prediction using θlocal t,i averaged over all agents. This is achieved by showing that the linear approximation of the NN at θ0 is close to both terms. Secondly, we show that the absolute difference between the linear prediction g(x; θ0)/ m, θlocal t,i of agent i and the reward function h(x) can be upper-bounded by νT K

λ||g(x; θ0)/ m||(V local t,i ) 1. This can be done following similar steps as the proof for UCBa t,i mentioned above. Thirdly, using the

Published as a conference paper at ICLR 2023

averaged linear prediction N 1 PN i=1 g(x; θ0)/ m, θlocal t,i as an intermediate term, the difference between f(x; θsync,NN) and h(x) can be upper-bounded. This implies the validity of UCBb t,i as a high-probability upper bound on h (up to additive error terms which are small given the conditions on m, η, and J presented in App. C.1), as formalized by Lemma 4 in App. C.3.

Next, following similar footsteps as the analysis in Wang et al. (2020), we separate all epochs into good epochs (intuitively, those epochs during which the amount of newly collected information from all agents is not too large) and bad epochs (details in App. C.2), and then separately upper-bound the regrets incurred in these two types of epochs. For good epochs (App. C.4), we are able to derive a tight upper bound on the regret rt,i = h(x t,i) h(xt,i) in each iteration t by making use of the fact that the change of information in a good epoch is bounded, and consequently obtain a tight upper bound on the total regrets in all good epochs. For bad epochs (App. C.5), we make use of the result from App. C.2 which guarantees that the total number of bad epochs can be upper-bounded. As a result, with an appropriate choice of D = O(T/(N ed)), the growth rate of the total regret incurred in bad epochs is smaller than that in good epochs. Lastly, the final regret upper bound follows from adding up the total regrets from good and bad epochs (App. C.6).

5 EXPERIMENTS

All figures in this section plot the average cumulative regret across all N agents up to an iteration, which allows us to inspect the benefit that the federated setting brings to an agent (on average). In all presented results, unless specified otherwise (by specifying a value of D), a communication round happens after each iteration. All curves stand for the mean and standard error from 3 independent runs. Some experimental details and results are deferred to App. E due to space limitation.

5.1 SYNTHETIC EXPERIMENTS

We firstly use synthetic experiments to illustrate some interesting insights about our algorithm. Similar to that of Zhou et al. (2020), we adopt the synthetic functions of h(x) = cos(3 a, x ) and h(x) = 10( a, x )2 which are referred to as the cosine and square functions, respectively. We add a Gaussian observation noise with a standard deviation of 0.01. The parameter a is a 10dimensional vector randomly sampled from the unit hypersphere. In each iteration, every agent receives K = 4 contexts (arms) which are randomly sampled from the unit hypersphere. For fair comparisons, for all methods (including our FN-UCB, Neural UCB, and Neural TS), we use the same set of parameters of λ = νT KN = νT K = 0.1 and use an NN with 1 hidden layer and a width of m = 20. As suggested by our theoretical analysis (Sec. 3.3), we select an increasing sequence of α which is linearly increasing (to 1) in the first 700 iterations, and let α = 1 afterwards. To begin with, we compare our main FN-UCB algorithm and its variant FN-UCB (Less Comm.) (Sec. 3.4). The results (Figs. 3a and 3b in App. E) show that their empirical performances are very similar. So, for practical deployment, we recommend the use of FN-UCB (Less Comm.) as it is more communication-efficient and achieves a similar performance. Accordingly, we will use the variant FN-UCB (Less Comm.) in all our subsequent experiments and refer to it as FN-UCB for simplicity.

Fig. 1 presents the results. Figs. 1a and 1b show that our FN-UCB with N = 1 agent performs comparably with Neural UCB and Neural TS, and that the federation of a larger number N of agents consistently improves the performance of our FN-UCB. Note that the federation of N = 2 agents can already provide significant improvements over non-federated algorithms. Fig. 1c gives an illustration of the importance of different components in our FN-UCB. The red curve is obtained by removing UCBb t,i (i.e., letting α = 0) and the green curve corresponds to removing UCBa t,i. The red curve shows that relying solely on UCBa t,i leads to significantly larger regrets in the long run due to its inability to utilize NNs to model the reward functions. On the other hand, the green curve incurs larger regrets than the red curve initially; however, after more observations are collected (i.e., after the NNs are trained with enough data to accurately model the reward function), it quickly learns to achieve much smaller regrets. These results provide empirical justifications for our discussion on the weight between the two UCBs (Sec. 3.3): It is reasonable to use an increasing sequence of α such that more weight is given to UCBa t,i initially and then to UCBb t,i later. The yellow curve is obtained by removing the step of aggregating (i.e., averaging) the NN parameters (in line 4 of Algo. 2), i.e., when calculating UCBb t,i (line 6 of Algo. 1), we use θi t to replace θsync,NN. The results show that the aggregation of the NN parameters significantly improves the performance of FN-UCB (i.e., the blue curve has much smaller regrets than the yellow one) and is hence an indispensable part of our

Published as a conference paper at ICLR 2023

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

Neural UCB Neural TS FN-UCB (N = 1)

FN-UCB (N = 2) FN-UCB (N = 5) FN-UCB (N = 10)

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

Neural UCB Neural TS FN-UCB (N = 1)

FN-UCB (N = 2) FN-UCB (N = 5) FN-UCB (N = 10)

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

FN-UCB (N = 2)

FN-UCB (N = 2, only UCBa

FN-UCB (N = 2, only UCBb

FN-UCB (N = 2, no NN aggregation)

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

FN-UCB (N = 1) FN-UCB (N = 5, D = 5)

FN-UCB (N = 5, D = 4)

FN-UCB (N = 5, D = 2.5)

FN-UCB (N = 5, comm. every round)

(a) cosine (b) square (c) cosine (d) cosine Figure 1: Cumulative regret with varying number of agents for the (a) cosine function and (b) square function. (c) Illustration of the importance of different components of our FN-UCB algorithm (cosine function). (d) Performances with different values of D (cosine function). The average number of rounds of communications are 348.0, 380.0, 456.7 for D = 5, 4, 2.5, respectively.

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

Neural UCB Neural TS FN-UCB (N = 1) FN-UCB (N = 2)

FN-UCB (N = 5)

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

Neural UCB Neural TS FN-UCB (N = 1) FN-UCB (N = 2)

FN-UCB (N = 5)

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

Neural UCB Neural TS FN-UCB (N = 1)

FN-UCB (N = 2)

FN-UCB (N = 5)

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

FN-UCB (N = 1)

FN-UCB (N = 5, D = 0.05) FN-UCB (N = 5, D = 0.03)

FN-UCB (N = 5, D = 0.01)

FN-UCB (N = 5, comm. every round)

(a) shuttle (b) magic telescope (c) shuttle (diag.) (d) shuttle (diag.) Figure 2: Results (m = 20) for (a) shuttle and (b) magic. (c) Results for shuttle with diagonal approximation (m = 50). (d) Results for shuttle with different values of D. The average number of communication rounds are 3850.7, 4442.7, 4906.3 for D = 0.05, 0.03, 0.01, respectively.

FN-UCB. Lastly, Fig. 1d shows that more frequent communications (i.e., smaller values of D which make it easier to initiate a communication round; see line 11 of Algo. 1) lead to smaller regrets.

5.2 REAL-WORLD EXPERIMENTS We adopt the shuttle and magic telescope datasets from the UCI machine learning repository (Dua & Graff, 2017) and construct the experiments following a widely used protocol in previous works (Li et al., 2010a; Zhang et al., 2021; Zhou et al., 2020). A K-class classification problem can be converted into a K-armed contextual bandit problem. In each iteration, an input x is randomly drawn from the dataset and is then used to construct K context feature vectors x1 = [x; 0d; . . . ; 0d], x2 = [0d; x; . . . ; 0d], . . . , x K = [0d; . . . ; 0d; x] which correspond to the K classes. The reward is 1 if the arm with the correct class is pulled, and 0 otherwise. For fair comparisons, we use the same set of parameters of λ = 10, νT KN = 0.1, and νT K = 0.01 for all methods. Figs. 2a and 2b present the results for the two datasets (1 hidden layer, m = 20) and show that our FN-UCB with N = 2 agents consistently outperforms standard Neural UCB and Neural TS, and its performance also improves with the federation of more agents. Fig. 2c shows the results for shuttle when diagonal approximation (Sec. 3.4) is applied to the NNs (1 hidden layer, m = 50); the corresponding results are consistent with those in Fig. 2a.4 Moreover, the regrets in Fig. 2c are in general smaller than those in Fig. 2a. This may suggest that in practice, a wider NN with diagonal approximation may be preferable to a narrower NN without diagonal approximation since it not only improves the performance but also reduces the computational and communication costs (Sec. 3.4). Fig. 2d plots the regrets of shuttle (with diagonal approximation) for different values of D and shows that more frequent communications lead to better performances and are hence consistent with that in Fig. 1d. For completeness, we also compare their performance with that of linear and kernelized contextual bandit algorithms (for the experiments in both Secs. 5.1 and 5.2), and the results (Fig. 4, App. E) show that they are outperformed by neural contextual bandit algorithms.

6 CONCLUSION

This paper describes the first federated neural contextual bandit algorithm called FN-UCB. We use a weighted combination of two UCBs and the choice of this weight required by our theoretical analysis has an interesting interpretation emphasizing accelerated exploration initially and accurate prediction of the aggregated NN later. We derive upper bounds on the regret and communication complexity of FN-UCB, and verify its effectiveness using empirical experiments. Our algorithm is not equipped with privacy guarantees, which may be a potential limitation and will be tackled in future work.

4Since diagonalization increases the scale of the first term in UCBa t,i, we use a heuristic to rescale the values of this term for all contexts such that the max and min values (among all contexts) are 0 and 1 after rescaling.

Published as a conference paper at ICLR 2023

REPRODUCIBILITY STATEMENT

We have included the necessary details to ensure the reproducibility of our theoretical and empirical results. For our theoretical results, we have stated all our assumptions in Sec. 2, added a proof sketch in Sec. 4.2, and included the complete proofs in App. C and App. D. Our detailed experimental settings have been described in Sec. 5.1, Sec. 5.2, and App. E. Our code has been submitted as supplementary material.

ACKNOWLEDGMENTS

This research/project is supported by A*STAR under its RIE2020 Advanced Manufacturing and Engineering (AME) Industry Alignment Fund Pre Positioning (IAF-PP) (Award A19E4a0101).

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Proc. NIPS, 2011.

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proc. ICML, pp. 127 135, 2013.

Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. JMLR, 3(Nov):397 422, 2002.

Yikun Ban and Jingrui He. Convolutional neural bandit: Provable algorithm for visual-aware advertising. ar Xiv:2107.07438, 2021.

Yikun Ban, Yuchen Yan, Arindam Banerjee, and Jingrui He. Ee-net: Exploitation-exploration neural networks in contextual bandits. In Proc. ICLR, 2022.

Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Proc. ICML, pp. 844 853, 2017.

Radu Ciucanu, Pascal Lafourcade, Gael Marcadet, and Marta Soare. SAMBA: A generic framework for secure federated multi-armed bandits. JAIR, 73:737 765, 2022.

Zhongxiang Dai, Bryan Kian Hsiang Low, and Patrick Jaillet. Federated Bayesian optimization via Thompson sampling. In Proc. Neur IPS, 2020.

Zhongxiang Dai, Bryan Kian Hsiang Low, and Patrick Jaillet. Differentially private federated Bayesian optimization with distributed exploration. In Proc. Neur IPS, 2021.

Zhongxiang Dai, Yao Shu, Bryan Kian Hsiang Low, and Patrick Jaillet. Sample-then-optimize batch neural Thompson sampling. In Proc. Neur IPS, 2022.

Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In Proc. COLT, pp. 355 366, 2008.

Ilker Demirel, Yigit Yildirim, and Cem Tekin. Federated multi-armed bandits under byzantine attacks. ar Xiv:2205.04134, 2022.

Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml.

Abhimanyu Dubey and Alex Pentland. Differentially-private federated linear bandits. In Proc. Neur IPS, pp. 6003 6014, 2020.

Xiaofeng Fan, Yining Ma, Zhongxiang Dai, Wei Jing, Cheston Tan, and Bryan Kian Hsiang Low. Fault-tolerant federated reinforcement learning with theoretical guarantee. In Proc. Neur IPS, 2021.

Published as a conference paper at ICLR 2023

Kristjan Greenewald, Ambuj Tewari, Susan Murphy, and Predag Klasnja. Action centered contextual bandits. In Proc. NIPS, 2017.

Quanquan Gu, Amin Karbasi, Khashayar Khosravi, Vahab Mirrokni, and Dongruo Zhou. Batched neural bandits. ar Xiv:2102.13028, 2021.

Stephanie Holly, Thomas Hiessl, Safoura Rezapour Lakani, Daniel Schall, Clemens Heitzinger, and Jana Kemnitz. Evaluation of hyperparameter-optimization approaches in an industrial federated learning system. ar Xiv:2110.08202, 2021.

Baihe Huang, Xiaoxiao Li, Zhao Song, and Xin Yang. FL-NTK: A neural tangent kernel-based framework for federated learning analysis. In Proc. ICML, pp. 4423 4434, 2021a.

Ruiquan Huang, Weiqiang Wu, Jing Yang, and Cong Shen. Federated linear contextual bandits. In Proc. Neur IPS, 2021b.

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Proc. Neur IPS, 2018.

Ali Jadbabaie, Haochuan Li, Jian Qian, and Yi Tian. Byzantine-robust federated linear bandits. ar Xiv:2204.01155, 2022.

Yiling Jia, Weitong Zhang, Dongruo Zhou, Quanquan Gu, and Hongning Wang. Learning neural contextual bandits through perturbed rewards. In Proc. ICLR, 2021.

Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. ar Xiv:1912.04977, 2019.

Parnian Kassraie and Andreas Krause. Neural contextual bandits without regret. In Proc. AISTATS, 2022.

Parnian Kassraie, Andreas Krause, and Ilija Bogunovic. Graph neural network bandits. In Proc. Neur IPS, 2022.

Mikhail Khodak, Renbo Tu, Tian Li, Liam Li, Maria-Florina Balcan, Virginia Smith, and Ameet Talwalkar. Federated hyperparameter tuning: Challenges, baselines, and connections to weightsharing. ar Xiv:2106.04502, 2021.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

Chuanhao Li and Hongning Wang. Asynchronous upper confidence bound algorithms for federated linear bandits. In Proc. AISTATS, 2022a.

Chuanhao Li and Hongning Wang. Communication efficient federated learning for generalized linear bandits. In Proc. Neur IPS, 2022b.

Chuanhao Li, Huazheng Wang, Mengdi Wang, and Hongning Wang. Communication efficient distributed learning for kernelized contextual bandits. In Proc. Neur IPS, 2022.

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proc. WWW, pp. 661 670, 2010a.

Qinbin Li, Zeyi Wen, Zhaomin Wu, Sixu Hu, Naibo Wang, Yuan Li, Xu Liu, and Bingsheng He. A survey on federated learning systems: vision, hype and reality for data privacy and protection. IEEE Transactions on Knowledge and Data Engineering, 2021.

Tan Li and Linqi Song. Privacy-preserving communication-efficient federated multi-armed bandits. IEEE Journal on Selected Areas in Communications, 2022.

Tan Li, Linqi Song, and Christina Fragouli. Federated recommendation system via differential privacy. In Proc. IEEE ISIT, pp. 2592 2597, 2020.

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. ar Xiv:1908.07873, 2014.

Published as a conference paper at ICLR 2023

Wei Li, Xuerui Wang, Ruofei Zhang, Ying Cui, Jianchang Mao, and Rong Jin. Exploitation and exploration in a performance based contextual advertising system. In Proc. KDD, pp. 27 36, 2010b.

Michal Lisicki, Arash Afkanpour, and Graham W Taylor. An empirical study of neural kernel bandits. In Neur IPS Workshop on Bayesian Deep Learning, 2021.

H Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. In Proc. AISTATS, 2017.

Ofir Nabati, Tom Zahavy, and Shie Mannor. Online limited memory neural-linear bandits with likelihood matching. In Proc. ICML, 2021.

Thanh Nguyen-Tang, Sunil Gupta, A Tuan Nguyen, and Svetha Venkatesh. Offline neural contextual bandits: Pessimism, optimization and generalization. In Proc. ICLR, 2022.

Sudeep Salgia, Sattar Vakili, and Qing Zhao. Provably and practically efficient neural contextual bandits. ar Xiv:2206.00099, 2022.

Chengshuai Shi and Cong Shen. Federated multi-armed bandits. In Proc. AAAI, 2021.

Chengshuai Shi, Cong Shen, and Jing Yang. Federated multi-armed bandits with personalization. In Proc. AISTATS, pp. 2917 2925, 2021.

Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. ICML, pp. 1015 1022, 2010.

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

Sattar Vakili, Michael Bromberg, Jezabel Garcia, Da-shan Shiu, and Alberto Bernacchia. Uniform generalization bounds for overparameterized neural networks. ar Xiv:2109.06099, 2021.

Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. In Proc. UAI, 2013.

Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: Near-optimal regret with efficient communication. In Proc. ICLR, 2020.

Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration. ar Xiv:2012.01780, 2020.

Zirui Yan, Quan Xiao, Tianyi Chen, and Ali Tajer. Federated multi-armed bandit via uncoordinated exploration. In Proc. ICASSP, pp. 5248 5252, 2022.

Weitong Zhang, Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural Thompson sampling. In Proc. ICLR, 2021.

Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with UCB-based exploration. In Proc. ICML, pp. 11492 11502, 2020.

Yi Zhou, Parikshit Ram, Theodoros Salonidis, Nathalie Baracaldo, Horst Samulowitz, and Heiko Ludwig. Flora: Single-shot hyper-parameter optimization for federated learning. ar Xiv:2112.08524, 2021.

Yinglun Zhu, Dongruo Zhou, Ruoxi Jiang, Quanquan Gu, Rebecca Willett, and Robert Nowak. Pure exploration in kernel and neural bandits. In Proc. Neur IPS, volume 34, pp. 11618 11630, 2021a.

Zhaowei Zhu, Jingxuan Zhu, Ji Liu, and Yang Liu. Federated bandit: A gossiping approach. Proc. ACM Meas. Anal. Comput. Syst., 5(1):1 29, 2021b.

Hankz Hankui Zhuo, Wenfeng Feng, Yufeng Lin, Qian Xu, and Qiang Yang. Federated deep reinforcement learning. ar Xiv:1901.08277, 2019.

Published as a conference paper at ICLR 2023

A RELATED WORKS

Federated Bandits. Federated learning (FL) has received enormous attention in recent years (Kairouz et al., 2019; Li et al., 2021; 2014; Mc Mahan et al., 2017). A number of recent works have extended the classic K-armed bandits (i.e., the arms are not associated with feature vectors) to the federated setting. Li & Song (2022) and Li et al. (2020) focused on incorporating privacy guarantees into federated K-armed bandits in both centralized and decentralized settings. Shi & Shen (2021) proposed a setting where the goal is to minimize the regret of a global bandit whose reward of an arm is the average of the rewards of the corresponding arm from all agents, which was later extended by adding personalization such that every agent aims to maximize a weighted combination between the global and local rewards (Shi et al., 2021). Subsequent works on federated K-armed bandits have focused on other important aspects such as decentralized communication via the gossip algorithm (Zhu et al., 2021b), the security aspect via cryptographic techniques (Ciucanu et al., 2022), uncoordinated exploration (Yan et al., 2022), and robustness against Byzantine attacks (Demirel et al., 2022). Regarding federated linear contextual bandits, Wang et al. (2020) proposed a distributed linear contextual bandit algorithm which allows every agent to use the observations from the other agents by only exchanging the sufficient statistics to calculate the Linear UCB policy. Subsequently, Dubey & Pentland (2020) extended the method from Wang et al. (2020) to consider differential privacy and decentralized communication, Huang et al. (2021b) considered a setting where every agent is associated with a unique context vector, Li & Wang (2022a) focused on asynchronous communication, and Jadbabaie et al. (2022) considered the robustness against Byzantine attacks. Federated kernelized/GP bandits (also named federated Bayesian optimization) have been explored by Dai et al. (2020; 2021), which focused on the practical problem of hyperparameter tuning in the federated setting. The recent works of Li et al. (2022); Li & Wang (2022b) have, respectively, focused on deriving communication-efficient algorithms for federated kernelized and generalized linear bandits. In addition to federated bandits, other similar sequential decision-making problems have also been extended to the federated setting, such as federated reinforcement learning (Fan et al., 2021; Zhuo et al., 2019) and federated hyperparameter tuning (Holly et al., 2021; Khodak et al., 2021; Zhou et al., 2021).

Neural Bandits. Since the pioneering works of Zhou et al. (2020) and Zhang et al. (2021) which, respectively, introduced Neural UCB and Neural TS, a number of recent works have focused on different aspects of neural contextual bandits. Xu et al. (2020) reduced the computational cost of Neural UCB by using an NN as a feature extractor and applying Linear UCB only to the last layer of the learned NN, Kassraie & Krause (2022) analyzed the maximum information gain of the NTK and hence derived no-regret algorithms, Gu et al. (2021) focused on the batch setting in which the policy is only updated at a small number of time steps, Nabati et al. (2021) aimed to reduce the memory requirement of Neural UCB, Lisicki et al. (2021) performed an empirical investigation of neural bandit algorithms to verify their practical effectiveness, Ban et al. (2022) adopted a separate NN for exploration in neural contextual bandits, Ban & He (2021) applied the convolutional NTK, Jia et al. (2021) used perturbed rewards to train the NN to remove the need for explicit exploration, Nguyen-Tang et al. (2022) incorporated offline policy learning into neural contextual bandits, Zhu et al. (2021a) studied pure exploration in kernel and neural bandits, Kassraie et al. (2022) applied graph NNs in neural bandits to handle graph-structured data, Salgia et al. (2022) extended neural bandits beyond the Re LU activation to consider smoother activation functions, and Dai et al. (2022) introduced a scalable batch Neural TS algorithm through sample-then-optimize optimization.

B MORE BACKGROUND

In this section, we give more details on some of the technical background mentioned in Sec. 2. The details in this section all follow the works of Zhang et al. (2021); Zhou et al. (2020), and we present them here for completeness.

Published as a conference paper at ICLR 2023

Definition of the NN f(x; θ). Let W1 Rm d, Wl Rm m, l = 2, . . . , L 1, and WL Rm 1, then the NN f(x; θ) is defined as

f1 = W1x, fl = Wl Re LU(fl 1), l = 2 . . . , L,

f(x; θ) = mf L,

in which Re LU(z) = max(z, 0) denotes the rectified linear unit (Re LU) activation function and is applied to each element of fl 1. With this definition of the NN, θ denotes the collection of all parameters of the NN: θ = (vec(W1), . . . , vec(WL)) Rp0.

Details of the Initialization Scheme θ0 init( ). To obtain the initial parameters θ0, for each

l = 1, . . . , L 1, let Wl = W 0 0 W

where each entry of W is independently sampled from

N(0, 4/m), and let WL = (w , w ) where each entry of w is independently sampled from N(0, 2/m). This initialization scheme is the same as that used by the works of Zhang et al. (2021); Zhou et al. (2020).

Definitions of the NTK Matrices H and Hi s. To simplify the exposition here, we use {xj}j=1,...,T KN to denote the set of all contexts from all iterations, all arms and all agents: {xk t,i}t [T ],k [K],i [N]. We can then define

e H(1) i,j = Σ(1) i,j = xi, xj , A(l) i,j =

Σ(l) i,i Σ(l) i,j Σ(l) i,j Σ(l) j,j

Σ(l+1) i,j = 2E(u,v) N(0,A(l) i,j) max(u, 0) max(v, 0),

e H(l+1) i,j = 2 e H(l) i,j E(u,v) N(0,A(l) i,j)1(u > 0)1(v > 0) + Σ(l+1) i,j .

With these definitions, the NTK matrix is defined as H = ( e H(L) + Σ(L))/2. Similarly, Hi can be obtained in the same way by only using all contexts from agent i in the definitions above, i.e., now we use {xj}j=1,...,T K to denote {xk t,i}t [T ],k [K] and plug these TK contexts into the definitions above to obtain Hi.

C PROOF OF REGRET UPPER BOUND (THEOREM 1)

We use p to index different epochs and denote by P the total number of epochs. We use tp to denote the first iteration of epoch p, and use Ep to represent the length (i.e., number of iterations) of epoch p. Throughout our theoretical analysis, we will denote different error probabilities as δ1, . . . , δ6, which we will combine via a union bound at the end of the proof to ensure that our final regret upper bound holds with probability of at least 1 δ.

C.1 CONDITIONS ON THE WIDTH m OF THE NEURAL NETWORKS

We list here the detailed conditions on the width m of the NN that are needed by our theoretical analysis. These include two types of conditions, some of them (conditions 1-4) are required for our regret upper bound to hold (i.e., they are used during the proof to derive the regret upper bound), whereas the others (conditions 5-6) are used after the final regret upper bound is derived to ensure that the final regret upper bound is small (see App. C.6).

When presenting our detailed proofs starting from the next subsection, we will refer to each of these conditions whenever they are used by the corresponding lemmas. Different lemmas may use different leading constants in their required condition (i.e., lower bound) on m, but here we use the same constant C > 0 for all lower bounds for simplicity, which can be considered as simply taking the maximum among all these different leading constants for different lemmas.

1. m CT 6K6N 6L6 log(TKNL/δ), 2. m CT 4K4N 4L6 log(T 2K2N 2L/δ)/λ4 0,

Published as a conference paper at ICLR 2023

λL 3/2[log(TKNL2/δ)]3/2,

4. m(log m) 3 CTL12λ 1 + CT 7λ 8L18(λ + LT)6 + CL21T 7λ 7(1 + p

5. m(log m) 3 CT 10N 6λ 4L18,

6. m(log m) 3 CT 16N 6L24λ 10(1 + p

Some of these conditions above can be combined, but we leave them as separate conditions to make it easier to refer to the corresponding place in the proof where a particular condition is needed.

Furthermore, to achieve a small upper bound on the cumulative regret, we also need to place some conditions on the learning rate η and number of iterations J for the gradient descent training (line 14 of Algo. 1). Specifically, we need to choose the learning rate as

η = C4(mλ + m TL) 1, (2)

in which C4 > 0 is an absolute constant such that C4 1 + TL, and choose

= e O TL/(λC4) . (3)

C.2 DEFINITION OF GOOD AND BAD EPOCHS

Denote the matrix Vlast (see line 18 of Algo. 1) after epoch p as Vp. As a result, the matrix VP is calculated using all selected inputs from all agents: VP = PT t=1 PN i=1 g(xt,i; θ0)g(xt,i; θ0) /m + λI. Define V0 λI. Imagine that we have a hypothetical agent which chooses all T N queries {xt,i}t [T ],i [N] sequentially in a round-robin fashion (i.e., the hypothetical agent chooses x1,1, x1,2, . . . , x2,1, x2,2, . . . , x T,N), and denote the corresponding hypothetical covariance matrix as e Vt,i = Pt 1 τ=1 PN j=1 g(xτ,j; θ0)g(xτ,j; θ0) /m + Pi j=1 g(xt,j; θ0)g(xt,j; θ0) /m + λI. We represent the indices of this hypothetical agent by t [TN] to distinguish it from our original multi-agent setting. Define JT N [g(xt ; θ0)]t [T N] which is a p0 (TN) matrix, and define KT N J T NJT N/m, which is a (TN) (TN) matrix. According to thes definitions, we have that Lemma 1 (Lemma B.7 of Zhang et al. (2021)). Let δ1 (0, 1). If m CT 6N 6K6L6 log(TNKL/δ1), we have with probability of at least 1 δ1 that

log det(I + λ 1Kt ) log det(I + λ 1H) + 1, t [TN].

The condition on m given in Lemma 1 corresponds to condition 1 listed in App. C.1, except that δ1 is used here instead of δ in App. C.1. Lemma 1 allows us to derive the following equation, which we will use (at the end of this section) to justify that the total number of "bad" epochs is not too large.

p=0 log det Vp+1

det Vp = log det VP

(a) = log det JT NJ T N/m + λI

= log det λ λ 1JT NJ T N/m + I

(b) = log λp0det λ 1JT NJ T N/m + I

= log det λ 1JT NJ T N/m + I

(c) = log det λ 1J T NJT N/m + I

= log det λ 1KT N + I

(d) log det λ 1H + I + 1

(e) = ed log(1 + TKN/λ) + 1 R .

Published as a conference paper at ICLR 2023

Step (a) is because VP = JT NJ T N/m + λI according to our definition of JT N above. Step (b) follows from our definition of V0 = λI above, as well as some standard properties of matrix determinant. Step (c) follows because: det(AA + I) = det(A A + I). Step (d) has made use of Lemma 1 above, which suggests that equation 4 holds with probability of at least 1 δ1. Step (e) follows from the definition of ed log det(I+H/λ)

log(1+T KN/λ) (Sec. 2). In the last step, we have defined

R ed log(1+TKN/λ)+1. We further define R R , in which denotes the ceiling operator.

Now we define all epochs p s which satisfy the following condition as "good epochs":

1 det Vp det Vp 1 e, (5)

and define all other epochs as "bad epochs". The first inequality trivially holds for all epochs according to the way in which the matrices are constructed. It is easy to verify that the second inequality holds for at least R epochs (with probability of at least 1 δ1). This is because if the second inequality is violated for more than R epochs (i.e., if log det Vp det Vp 1 > 1 for more than R epochs),

then PP 1 p=0 log det Vp+1

det Vp > R, which violates equation 4. This suggests that there are no more than R bad epochs (with probability of at least 1 δ1). From here onwards, we will denote the set of good epochs by Egood and the set of bad epochs by Ebad.

C.3 VALIDITY OF THE UPPER CONFIDENCE BOUND

In this section, we prove that the upper confidence bound used in our algorithm, (1 αt)UCBa t,i(x) + αt UCBb t,i(x) (used in line 7 of Algo. 1), is a valid high-probability upper bound on the reward function h. We will achieve this by separately proving that UCBa t,i and UCBb t,i are valid highprobability upper bounds on h in the next two sections. Note that for both UCBs, unlike Neural UCB (Zhou et al., 2020) and Neural TS (Zhang et al., 2021) which use θt (the parameters of trained NNs) to calculate the exploration term (the second terms of UCBa t,i and UCBb t,i), we instead use θ0. This is consistent with Kassraie & Krause (2022) who have shown that the use of θ0 gives accurate uncertainty estimation.

C.3.1 VALIDITY OF UCBa t,i AS A HIGH-PROBABILITY UPPER BOUND ON h:

To begin with, we will need the following lemma from Zhang et al. (2021). Lemma 2 (Lemma B.3 of Zhang et al. (2021)). Let δ2 (0, 1). There exists a constant C > 0 such that if m CT 4K4N 4L6 log(T 2K2N 2L/δ2)/λ4 0,

then with probability of 1 δ2 over random initializations of θ0, there exists a θ Rp0 such that

h(x) = g(x; θ0), θ θ0 , m θ θ0 2

2h H 1h B, x Xt,i, t [T], i [N]. (6)

The condition on m required by Lemma 2 corresponds to condition 2 listed in App. C.1, except that δ2 is used here instead of δ as in App. C.1. The following lemma formally guarantees the validity of UCBa t,i as a high-probability upper-bound on h.

Lemma 3. Let δ3 (0, 1) and νT KN = B + R q

2(log(1/δ3) + 1) + ed log(1 + TKN/λ). We have with probability of at least 1 δ1 δ2 δ3 for all t [T], i [N], that

|h(x) g(x; θ0)/ m, θt,i | νT KN

λ g(x; θ0)/ m

V 1 t,i , x Xt,i

Proof. Lemma 3 can be proved by following similar steps as the proof of Lemma 4.3 in the work of Zhang et al. (2021). Specifically, the proof of Lemma 3 requires Lemmas B.3, B.6 and B.7 of Zhang et al. (2021) (after being adapted for our setting). The adapted versions of Lemmas B.3 and B.7 of Zhang et al. (2021) have been presented in our Lemma 2 and Lemma 1, respectively. Of note, Lemma 1 and Lemma 2 require some conditions on the width m of the NN, which have been listed as conditions 1 and 2 in Appendix C.1. Lastly, Lemma B.6 of Zhang et al. (2021), which makes use

Published as a conference paper at ICLR 2023

of Theorem 1 of Chowdhury & Gopalan (2017), can be directly applied in our setting and introduces an error probability of δ3 (which appears in the expression of νT KN). As a result, Lemma 3 holds with probability of at least 1 δ1 δ2 δ3, in which the error probabilities come from Lemma 1 (δ1), Lemma 2 (δ2) and the application of Lemma B.6 of Zhang et al. (2021) (δ3).

C.3.2 VALIDITY OF UCBb t,i AS A HIGH-PROBABILITY UPPER BOUND ON h:

Note that UCBb t,i is updated only in every communication round. We denote the set of iterations after which UCBb t,i is updated (i.e., the last iteration in every epoch) as T 1 {tp 1}p=2,...,P 1, which immediately implies that T 1 [T] and hence |T 1| T.

Lemma 4. Let δ4, δ5 (0, 1), and νT K = B + R q

2(log(N/δ4) + 1) + edmax log(1 + TK/λ)

Suppose the width m of the NN satisfies m C

λL 3/2[log(TKNL2/δ5)]3/2 for some constant C > 0, as well as condition 4 in Appendix C.1. Suppose the learning rate η and number of iterations J of the gradient descent training satisfy the conditions in equation 2 and equation 3 (App. C.1), respectively. We have with probability of at least 1 δ4 δ5 for all t T 1, i [N], that

|h(x) f(x; θsync,NN)| νT K

g(x; θ0)/ m (V local i ) 1 + εlinear(m, T), x Xt,i.

Proof. Note that the condition on m listed in the lemma, m C

λL 3/2[log(TKNL2/δ5)]3/2, corresponds to condition 3 listed in App. C.1 except that δ5 is used here instead of δ. Therefore, the validity of Lemma 4 requires conditions 3 and 4 on m (App. C.1) to be satisfied. For ease of exposition, we separate our proof into three steps.

Step 1: NN Output f(x; θsync,NN) Is Close to (Averaged) Linear Prediction

Based on Lemma C.2 of Zhang et al. (2021), if the conditions on m listed in Lemma 4 is satisfied, then for any eθ such that eθ θ0 2 2 p

t/(mλ), there exists a constant C1 > 0 such that we have with probability of at least 1 δ5 over random initializations θ0 that

|f(x; eθ) g(x; θ0), eθ θ0 | C1t2/3m 1/6λ 2/3L3p

C1T 2/3m 1/6λ 2/3L3p

εlinear,1(m, T),

which holds x Xt,i, t [T], i [N].

Also note that according to Lemma C.1 of Zhang et al. (2021), if conditions 3 and 4 on m listed in App. C.1, as well as the condition on η equation 2, are satisfied, then we have with probability of at least 1 δ5 over random initializations θ0 that θi t θ0 2 2 p

t/mλ, i [N]. An immediate implication is that the aggregated NN parameters θsync,NN = 1

N PN i=1 θi t also satisfies:

θsync,NN θ0 2 =

i=1 θi t θ0

θi t θ0 2 2 p

This implies that equation 7 holds for θsync,NN with probability of at least 1 2δ5:

|f(x; θsync,NN) g(x; θ0), θsync,NN θ0 | εlinear,1(m, T). (8)

Next, note that the θi t is obtained by training only using agent i s local observations (line 14 of Algo. 1). Define θlocal t,i = (V local t,i ) 1(Pt τ=1 yτ,ig(xτ,i; θ0)/ m). Note that θlocal t,i is calculated in the same way as θt,i (line 4 of Algo. 1), except that its calculation only involves agent i s local observations. Next, making use of Lemmas C.1 and C.4 of Zhang et al. (2021), we can follow similar

Published as a conference paper at ICLR 2023

steps as equation C.3 of Zhang et al. (2021) (in Appendix C.2 of Zhang et al. (2021)) to show that there exists constants C2 > 0 and C3 > 0 such that we have x Xt,i, t [T], i [N] that

| g(x; θ0), θi t θ0 g(x; θ0)/ m, θlocal t,i |

C2(1 ηmλ)Jp

t L/λ + C3m 1/6p

log m L4t5/3λ 5/3(1 + p

C2(1 ηmλ)Jp

TL/λ + C3m 1/6p

log m L4T 5/3λ 5/3(1 + p

εη,J + εlinear,2(m, T).

We refer to g(x; θ0)/ m, θlocal t,i as the linear prediction because it is the prediction of the linear model with the neural tangent features g(x; θ0)/ m as the input features, conditioned on the local observations of agent i. Note that similar to equation 8 which also relies on Lemma C.1 of Zhang et al. (2021), equation 9 also requires conditions 3 and 4 on m, as well as the condition on η, in App. C.1 to be satisfied. equation 9 holds with probability of at least 1 2δ5, where the error probabilities come from the use of Lemmas C.1 and C.4 of Zhang et al. (2021).

Next, we can bound the difference between f(x; θsync,NN) (i.e., the prediction of the NN with the aggregated parameters) and the averaged linear predictions of all agents calculated using their local observations:

|f(x; θsync,NN) 1

i=1 g(x; θ0)/ m, θlocal t,i | |f(x; θsync,NN) 1

i=1 g(x; θ0), θi t θ0 |

i=1 g(x; θ0), θi t θ0 1

i=1 g(x; θ0)/ m, θlocal t,i |

|f(x; θsync,NN) g(x; θ0), θsync,NN θ0 |

i=1 | g(x; θ0), θi t θ0 g(x; θ0)/ m, θlocal t,i |

εlinear,1(m, T) + 1

i=1 (εη,J + εlinear,2(m, T))

εlinear,1(m, T) + εη,J + εlinear,2(m, T)

εlinear(m, T).

In the second inequality, we plugged in the definition of θsync,NN = 1

N PN i=1 θi t. In the third inequality, we have made use of equation 8 and equation 9; equation 10 holds with probability of at least 1 4δ5, where the error probabilities come from equation 8 (2δ5) and equation 9 (2δ5), respectively. Now we replace δ5 by δ5/4, which ensures that equation 10 holds with probability of at least 1 δ5. This will only introduce a factor of 4 within the log of condition 3 on m (App. C.1), which is ignored since it can be absorbed by the constant C.

Step 2: Linear Prediction Is Close to the Reward Function h(x)

In the proof in this section, we will also need a "local" variant of the confidence bound of Lemma 3, i.e., the confidence bound of Lemma 3 calculated only using the local observations of an agent i:

Lemma 5 (Zhang et al. (2021)). We have with probability of at least 1 δ4 for all t T 1 [T], i [N], that

|h(x) g(x; θ0)/ m, θlocal t,i | νT K

λ g(x; θ0)/ m (V local t,i ) 1 , x Xt,i.

Proof. Similar to the proof of Lemma 3 (App. C.3.1), the proof of Lemma 5 also requires of Lemmas B.3, B.6 and B.7 from Zhang et al. (2021), which, in this case, can be directly applied to our setting (except that we need an additional union bound over all N agents). The implication of the additional union bound on the error probabilities is taken care of by the additional term of N within the log in the expression of νT K (Lemma 4), and in conditions 1 and 2 on m (see App. C.1, and also Lemmas 2 and 1). The required lower bounds on m by the local variants of Lemmas B.3 and B.7 (required in

Published as a conference paper at ICLR 2023

the proof here) are smaller than those given in Lemmas 2 and 1 and hence do not need to appear in the conditions in Appendix C.1. By letting the sum of the three error probabilities (resulting from the applications of Lemmas B.3, B.6 and B.7 of Zhang et al. (2021)) be δ4, we can ensure that Lemma 5 holds with probability of at least 1 δ4. For simplicity, we let the error probability for Lemma B.6 be δ4, which leads to the cleaner expression of νT K in Lemma 4. This means that the error probabilities for Lemmas B.3 and B.7 are very small, which can be accounted for by simply increasing the value of the absolute constant C in conditions 1 and 2 on m (App. C.1) and hence does not affect our main theoretical analysis.

Step 3: Combining Results from Step 1 and Step 2

Next, we are ready to prove the validity of UCBb t,i by using the averaged linear prediction

1 N PN i=1 g(x; θ0)/ m, θlocal t,i as an intermediate term:

|f(x;θsync,NN) h(x)|

= |f(x; θsync,NN) 1

i=1 g(x; θ0)/ m, θlocal t,i + 1

i=1 g(x; θ0)/ m, θlocal t,i h(x)|

i=1 | g(x; θ0)/ m, θlocal t,i h(x)| + εlinear(m, T)

λ g(x; θ0)/ m (V local t,i ) 1 + εlinear(m, T)

λ g(x; θ0)/ m (V local i ) 1 + εlinear(m, T)

The second inequality has made use of equation 10, and the third inequality follows from Lemma 5. In the last inequality, we have made the substitution of (V local i ) 1 = (V local t,i ) 1. This is because in the proof of Lemma 4 here, we only consider the iterations of t T 1 {tp 1}p=2,...,P 1, i.e., the last iteration of every epoch. As a result, this ensures that (V local i ) 1 = (V local t,i ) 1 because every time the central server obtains (V local i ) 1 through (V local i ) 1 = (V local t,i ) 1, i [N] (line 4 of Algo. 2), we have that the current iteration t is the last iteration of the previous epoch. As a results, equation 11 holds with probability of at least 1 δ4 δ5, in which the error probabilities come from equation 10 (δ5) and Lemma 5 (δ4). In other words, Lemma 4 (i.e., the validity of UCBb t,i) holds with probability of at least 1 δ4 δ5.

C.4 REGRET UPPER BOUND FOR GOOD EPOCHS

In this section, we derive an upper bound on the total regrets incurred in all good epochs Egood (defined in App. C.2).

C.4.1 AUXILIARY INEQUALITY

We firstly derive an auxiliary result which will be used in the proofs later.

Published as a conference paper at ICLR 2023

For agent i and iteration t in a good epoch p Egood, we have that

λ g(x; θ0)/ m

V 1 t,i = q

λg(x; θ0) V 1 t,i g(x; θ0)/m

λg(x; θ0) e V 1 t,i g(x; θ0)/m dete Vt,i

λg(x; θ0) e V 1 t,i g(x; θ0)/m det Vp

eλg(x; θ0) e V 1 t,i g(x; θ0)/m

eλ g(x; θ0)/ m e V 1 t,i .

Recall that V t,i (line 4 of Algo. 1) is used by agent i in iteration t to select xt,i (via UCBa t,i), and that the matrix e Vt,i is defined for the hypothetical agent which sequentially chooses all TN queries {xt,i}t [T ],i [N] in a round-robin fashion (App. C.2). The first inequality in equation 12 above follows from Lemma 12 of Abbasi-Yadkori et al. (2011). The second inequality is because Vp contains more information than e Vt,i (since Vp is calculated using all the inputs selected after epoch p), and Vp 1 contains less information than V t,i (because compared with Vp 1, V t,i additionally contains the local inputs selected by agent i in the current epoch p). In the last inequality, we have made use of the definition of good epochs, i.e., (det Vp)/(det Vp 1) e (App. C.2).

C.4.2 UPPER BOUND ON THE INSTANTANEOUS REGRET rt,i

Here we assume that both UCBa t,i and UCBb t,i hold (hence we ignore the error probabilities here), which we have proved in App. C.3. We now derive an upper bound on the instantaneous regret rt,i = h(x t,i) h(xt,i) for agent i and iteration t in a good epoch p Egood:

rt,i = h(x t,i) h(xt,i)

= αh(x t,i) + (1 α)h(x t,i) h(xt,i)

αUCBb t,i(x t,i) + αεlinear(m, T) + (1 α)UCBa t,i(x t,i) h(xt,i)

αUCBb t,i(xt,i) + (1 α)UCBa t,i(xt,i) + αεlinear(m, T) h(xt,i)

= α UCBb t,i(xt,i) h(xt,i) + (1 α) UCBa t,i(xt,i) h(xt,i) + αεlinear(m, T)

α 2νT K 1 N

λ g(xt,i; θ0)/ m (V local j ) 1 + εlinear(m, T) +

(1 α) 2νT KN

λ g(xt,i; θ0)/ m

+ αεlinear(m, T)

α 2νT K 1 N

λ g(xt,i; θ0)/ m (V local j ) 1 + εlinear(m, T) +

(1 α) 2νT KN

eλ g(xt,i; θ0)/ m e V 1 t,i

+ αεlinear(m, T)

= α2νT K 1 N

λ g(xt,i; θ0)/ m (V local j ) 1 +

(1 α)2νT KN

eλ g(xt,i; θ0)/ m e V 1 t,i + 2αεlinear(m, T)

(1 α)2νT KN eeσt,i(xt,i) + α2νT K 1 N

j=1 eσlocal tp 1,j(xt,i) + 2αεlinear(m, T).

The first inequality makes use of Lemma 3 (i.e., the validity of UCBa t,i) and Lemma 4 (i.e., the validity of UCBb t,i). The second inequality follows from the way in which xt,i is selected

Published as a conference paper at ICLR 2023

(line 7 of Algo. 1): xt,i = arg maxx Xt,i(1 α)UCBa t,i(x) + αUCBb t,i(x). The third inequality again makes use of Lemma 3 and Lemma 4, as well as the expressions of UCBa t,i and UCBb t,i. In the fourth inequality, we have made used of the auxiliary inequality of equation 12 we derived in the last section. Recall that we have discussed that (V local i ) 1 = (V local t,i ) 1 for all t = tp 1 at the end of App. C.3.2. Therefore, in the last step, we have defined eσlocal tp 1,j(xt,i)

λ g(xt,i; θ0)/ m (V local tp 1,j) 1 =

λ g(xt,i; θ0)/ m (V local j ) 1 which repre-

sents the GP posterior standard deviation (using the kernel of ek(x, x ) = g(x; θ0) g(x ; θ0)/m) conditioned on all agent j s local observations before iteration tp. Note that eσlocal tp 1,j(xt,i) is the same as the one defined in Sec. 3.3 of the main text where we explain the weight between the two UCBs. Similarly, we have also defined eσt,i(xt,i)

λ g(xt,i; θ0)/ m e V 1 t,i , which represents the GP

posterior standard deviation conditioned on the observations of the hypothetical agent before xt,i is selected (App. C.2).

Next, we will separately derive upper bounds on the summation (across all good epochs and all agents) of the first and second terms of the upper bound from equation 13.

C.4.3 UPPER BOUND ON THE SUM OF THE FIRST TERM OF EQUATION 13

Here, similar to Kassraie & Krause (2022), we denote as κ0 an upper bound on the value of the NTK function for any input: g(x; θ0)/ m, g(x; θ0)/ m κ0, x Xt,i, t [T], i [N]. As a result, we can use it to show that both eσt,i(x) and eσlocal tp 1,j(x) can be upper-bounded: eσt,i(x) κ0 and eσlocal tp 1,j(x) κ0. To show this, following the notations of Appendix C.2, we

denote e Vt,i = Jt,i J t,i + λI where Jt,i = h g(xτ,j; θ0)

τ [t 1],j [N] , g(xt,j; θ0)

i which is a

p0 [(t 1)N + i] matrix. Then we have

eσ2 t,i(x) = λ g(x; θ0)/ m 2 e V 1 t,i = λg(x; θ0) (Jt,i J t,i + λI) 1g(x; θ0)/m

= λg(x; θ0) 1

λJt,i I + J t,i 1 λJt,i 1J t,i 1 λ

= g(x; θ0) g(x; θ0)/m (g(x; θ0) / m)Jt,i λI + J t,i Jt,i 1J t,i(g(x; θ0)/ m)

= g(x; θ0) g(x; θ0)/m (g(x; θ0) / m)Jt,i 2 λI+J t,i Jt,i 1

g(x; θ0) g(x; θ0)/m κ0

where we used the matrix inversion lemma in the third equality. Using similar derivations also allows us to show that (eσlocal tp 1,j(x))2 κ0. Therefore, we have that eσt,i(x) κ0 and eσlocal tp 1,j(x) κ0.

Published as a conference paper at ICLR 2023

Denoting the set of iterations from all good epochs as T good, we can derive an upper bound the first term of equation 13, summed across all agents i [N] and all iteration in good epochs T good:

t T good (1 α)2νT KN eeσt,i(xt,i) (a) 2 eνT KN

t=1 eσt,i(xt,i)

(b) = 2 eνT KN

t=1 min{eσt,i(xt,i), κ0}

(c) 2 eνT KN

t=1 min{ κ0eσt,i(xt,i), κ0}

2 eνT KN κ0

t=1 min{eσt,i(xt,i), 1}

(d) 2 eνT KN κ0

t=1 min{eσ2 t,i(xt,i), 1}

(e) 2 eνT KN κ0 p

TN[2λ log det(λ 1KT N + I)]

2eνT KN κ0 p

TNλ[log det(λ 1H + I) + 1]

= 2 eνT KN κ0

TNλ h ed log(1 + TNK/λ) + 1 i

Step (a) follows from α 1, t 1 and summing across all iterations [T] instead of only those iterations T good in good epochs. Step (b) follows because eσt,i(x) κ0 as discussed above. In step (c), we have assumed that κ0 1; however, if κ0 < 1, the proof still goes through since we can directly upper-bound min{eσt,i(xt,i), κ0} by min{eσt,i(xt,i), 1}, after which the only modification we need to make to the equation above is to remove the dependency on multiplicative term of κ0. Step (d) results from the Cauchy Schwarz inequality. Step (e) can be derived following the proof of Lemma 4.8 of Zhang et al. (2021) (in Appendix B.7 of Zhang et al. (2021)). Step (f) follows from Lemma 1 and hence holds with probability of at least 1 δ1. The last equality simply plugs in the definition of the effective dimension ed (Sec. 2).

C.4.4 UPPER BOUND ON THE SUM OF THE SECOND TERM OF EQUATION 13

In this subsection, we derive an upper bound on the sum of the second term in equation equation 13 across all good epochs and all agents.

For the proof here, we need a "local" version of Lemma 1, i.e., a version of Lemma 1 which only makes use of the contexts of an agent i. Define Kt,i as the local counterpart to Kt (from Lemma 1), i.e., Kt,i is the t t matrix calculated using only agent j s local contexts up to (and including) iteration t. Specifically, define Jt,i [g(xτ,i; θ0)]τ [t] which is a p0 t matrix, then Kt,i is defined as Kt,i J t,i Jt,i/m, which is a t t matrix. Also recall that in the main text, we have defined Hi as the local counterpart of H for agent i (Sec. 2). The next lemma gives our desired local version of Lemma 1.

Lemma 6 (Lemma B.7 of Zhang et al. (2021)). If m CT 6K6L6 log(TNKL/δ6), we have with probability of at least 1 δ6 that

log det(I + λ 1Kt,i) log det(I + λ 1Hi) + 1,

for all t [T], i [N].

We needed to take a union bound over all N agents, which explains the factor of N within the log in the lower bound on m given in Lemma 6. Note that the required lower bound on m by Lemma 6 is smaller than that of Lemma 1 (by a factor of N 6), therefore, the condition on m in Lemma 6 is ignored in the conditions listed in App. C.1.

Published as a conference paper at ICLR 2023

Of note, throughout the entire epoch p, eσlocal tp 1,j(xt,i) is calculated conditioned on all the local observations of agent j before iteration tp. Denote by T (p) the iteration indices in epoch p: T (p) = {tp, . . . , tp + Ep 1}. In the proof in this section, as we have discussed in the first paragraph of Sec. 4.1, we analyze a simpler variant of our algorithm where we only set α > 0 in the first iteration after a communication round, i.e., α > 0, t {tp}p [P ] and α = 0, t [T] \ {tp}p [P ]. Now we are ready to derive an upper bound on the second term in equation 13, summed over all agents and all good epochs:

t T (p) α2νT K 1 N

j=1 eσlocal tp 1,j(xt,i) (a) 2νT K 1 N

j=1 eσlocal tp 1,j(xt,i)

(b) 2νT K 1 N

p [P ] αeσlocal tp 1,j(xtp,i)

(c) 2νT K 1 N

p [P ] eσlocal tp 1,j(xtp,j)

(d) 2νT K 1 N

t=1 eσlocal t 1,j(xt,j)

The inequality in step (a) results from summing across all epoch p [P] instead of only good epochs p Egood. Step (b) follows since αt = 0, t [T] \ {tp}p [P ] as we discussed above, therefore, for every epoch p, we only need to keep the first term of t = tp in the summation of t T (p). To understand step (c), recall that in the main text (Sec. 3.3), we have defined: eσlocal t,i,min minx X eσlocal t,i (x) and eσlocal t,i,max maxx X eσlocal t,i (x), i [N]. Next, note that our algorithm selects α by: α = mini [N] αt,i (line 4 of Algo. 2) where αt,i = eσlocal t,i,min/eσlocal t,i,max (line 15 of Algo. 1) and t = tp 1 since αt,i is calculated only in the last iteration of every epoch. As a result, we have that

α = min i [N] αtp 1,i = min i [N]

eσlocal tp 1,i,min eσlocal tp 1,i,max eσlocal tp 1,j,min eσlocal tp 1,j,max eσlocal tp 1,j(xtp,j)

eσlocal tp 1,j(xtp,i) ,

which tells us that αeσlocal tp 1,j(xtp,i) eσlocal tp 1,j(xtp,j) and hence leads to step (c). Step (d) results from summing across all iterations [T] instead of only the first iteration of every epoch.

Next, we can derive an upper bound on the inner summation over t = 1, . . . , T from equation 16:

t=1 eσlocal t 1,j(xt,j) (a) κ0

t=1 min{eσlocal t 1,j(xt,j), 1}

t=1 min{ eσlocal t 1,j(xt,j) 2 , 1}

T[2λ log det(λ 1KT,j + I)]

Tλ[log det(λ 1Hj + I) + 1]

Tλ h edj log(1 + TK/λ)) + 1 i .

Step (a) is obtained in the same way as steps (b) and (c) in equation 15 (App. C.4.3), i.e., we have made use of eσlocal tp 1,j(x) κ0 and assumed that κ0 1. Again note that if κ0 < 1, then the proof still goes through since eσlocal t 1,j(xt,j) min{eσlocal t 1,j(xt,j), κ0} min{eσlocal t 1,j(xt,j), 1}, after which the only modification we need to make to the equation above is to remove the dependency on multiplicative term of κ0. Step (b) makes use of the Cauchy Schwarz inequality. Step (c),

Published as a conference paper at ICLR 2023

similar to step (e) of equation 15, is derived following the proof of Lemma 4.8 of Zhang et al. (2021) (in Appendix B.7 of Zhang et al. (2021)). Step (d) follows from Lemma 6 and hence holds with probability of at least 1 δ6. In the last equality, we have simply plugged in the definition of edj (Sec. 2).

Now we can plug equation 17 into equation 16 to obtain

t T (p) α2νT K 1 N

j=1 eσlocal tp 1,j(xt,i)

Tλ h edj log(1 + TK/λ)) + 1 i

Tλ h edj log(1 + TK/λ)) + 1 i .

C.4.5 PUTTING THINGS TOGETHER

Finally, recall that our derived upper bound on rt,i in equation 13 contains three terms (the third term is simply an error term), and now we can make use of our derived upper bound on the first term (App. C.4.3) and the second term (App. C.4.4), summed over all agents and all good epochs, to obtain an upper bound on the total regrets incurred in all good epochs:

t T good rt,i

2 eνT KN κ0

TNλ h ed log(1 + TNK/λ) + 1 i +

Tλ h edj log(1 + TK/λ)) + 1 i + TNεlinear(m, T)

T edmax + TNεlinear(m, T)

TN + edmax N

T + TNεlinear(m, T) .

In the second last equality, we have used νT KN = e O( p

ed) and νT K = e O( q

Published as a conference paper at ICLR 2023

C.5 REGRET UPPER BOUND FOR BAD EPOCHS

In this section, we derive an upper bound on the total regrets from all bad epochs. To begin with, we firstly derive an upper bound on the total regrets of any bad epoch p denoted as R[p]:

t=tp+1 rt,i

UCBa t,i(x t,i) h(xt,i) i

UCBa t,i(xt,i) h(xt,i) i

t=tp+1 2νT KN

λ g(xt,i; θ0)/ m

4 + 2νT KN κ0

λ g(xt,i; θ0)/ m

V 1 t,i , 1}

4 + 2νT KN p

t=tp min{ g(xt,i; θ0)/ m

V 1 t,i , 1}

Step (a) follows from simply upper-bounding the regrets of the first and last iteration within this epoch by 2. Step (b) makes use of the validity of UCBa t,i (Lemma 3). Step (c) follows because α = 0, t [T] \ {tp}p [P ] (i.e., we set α = 0 except for the first iteration of all epochs), which implies that after the first iteration of an epoch, xt,i is selected by only maximizing UCBa t,i (line 7 of Algo. 1). Step (d) again uses Lemma 3, as well as the expression of UCBa t,i. Step (e) is obtained in the same way as steps (b) and (c) in equation 15 (App. C.4.3). Specifically, since g(x; θ0), g(x; θ0) κ0 (App. C.4.3), therefore,

λ g(xt,i; θ0)/ m

V 1 t,i κ0, which can be proved by following the

same steps as equation 14. As a result, if we assume that κ 1, then

λ g(xt,i; θ0)/ m

λ g(xt,i; θ0)/ m

V 1 t,i , κ0} κ0 min{

λ g(xt,i; θ0)/ m

V 1 t,i , 1}; in the other

case where κ0 < 1, then

λ g(xt,i; θ0)/ m

V 1 t,i = min{

λ g(xt,i; θ0)/ m

V 1 t,i , κ0}

λ g(xt,i; θ0)/ m

V 1 t,i , 1}. Here we have assumed κ0 1 for simplicity, since when

κ0 < 1, the equation above still holds except that we can remove the dependency on κ0. Step (f) follows because λ = 1 + 2/T > 1.

Next, we derive an upper bound on the inner summation in equation 20.

t=tp min{ g(xt,i; θ0)/ m

V 1 t,i , 1}

v u u t(Ep 1)

t=tp min{ g(xt,i; θ0)/ m 2

V 1 t,i , 1}

(Ep 1)2 log det V tp+Ep 2,i

2((tp + Ep 2) tlast) log det Vtp+Ep 2,i

Published as a conference paper at ICLR 2023

Step (a) follows from the Cauchy Schwarz inequality. Step (b) makes use of Lemma 11 of Abbasi Yadkori et al. (2011). In step (c), we used the notations of tlast = tp 1, V tp,i = Vlast (this is because in the first iteration tp of an epoch, Wnew,i = 0p0 p0 and hence V tp,i = Vlast = Wsync + λI), and Vtp+Ep 2,i = V tp+Ep 2,i + g(xt,i; θ0)g(xt,i; θ0) /m, and also used det V tp+Ep 2,i det Vtp+Ep 2,i. To understand step (d), note that the term in step (c):

((tp + Ep 2) tlast) log det Vtp+Ep 2,i

det Vlast is exactly the criterion we use to check whether to start a communication round in iteration t = tp+Ep 2 (line 11 of Algo. 1). Since t = tp+Ep 2 is not the last iteration in this epoch (i.e., we did not start a communication round after checking this criterion in iteration t = tp + Ep 2), therefore, this criterion is not satisfied in iteration t = tp + Ep 2, i.e., ((tp + Ep 2) tlast) log det Vtp+Ep 2,i

det Vlast D, which explains step (d).

Next, we can plug equation 21 into equation 20 to obtain:

4 + 2νT KN p

2D = 4 + 2νT KN p

2κ0λD N, (22)

which gives an upper bound on the total regret from any bad epoch. Now recall that as we have discussed in App. C.2, there are no more than R bad epochs (with probability of at least 1 δ1). Therefore, the total regret of all bad epochs can be upper-bounded by:

Rbad T R 4 + 2νT KN p

ed log(1 + TKN/λ) + 1 4 + 2νT KN p

= e O (ed)3/2

In the second last equality, we have used νT KN = e O( p

ed). By choosing D = O( T

N e d) (line 1 of Algo. 1), we can further express the above upper bound on the total regrets from all bad epochs as:

Rbad T = O s

N ed (ed)3/2N

C.6 FINAL REGRET UPPER BOUND

Here we derive an upper bound on the total cumulative regret by adding up the regrets resulting from all good epochs (App. C.4) and all bad epochs (App. C.5):

RT = Rgood T + Rbad T

TN + edmax N

T + TNεlinear(m, T) + ed

TN + edmax N

T + TNεlinear(m, T) .

This regret upper bound holds with probability of at least 1 δ1 δ2 δ3 δ4 δ5 δ6. We let δ3 = δ4 = δ/3, which leads to the expressions of νT KN and νT K given in the main paper (Sec. 3). We let δ1 = δ2 = δ5 = δ6 = δ/12, and this will only introduce an additional factor of log 12 in the first three conditions on m in App. C.1 which can be absorbed by the constant C.

Next, the last term from the upper bound in equation 25 can be further written as:

TNεlinear(m, T) = TN εlinear,1(m, T) + εlinear,2(m, T) + εη,J

= TNC1T 2/3m 1/6λ 2/3L3p

log m + TNC3m 1/6p

log m L4T 5/3λ 5/3(1 + p

+ TNC2(1 ηmλ)Jp

Published as a conference paper at ICLR 2023

It can be easily verified that as long as m(log m) 3 36C6 1T 10N 6λ 4L18 and m(log m) 3 36C6 3T 16N 6L24λ 10(1 + p

T/λ)6 (which are ensured by conditions 5 and 6 on m in App. C.1), then the first and second terms in equation 26 can both be upper-bounded by 1/3. Moreover, if the conditions on η and J presented in App. C.1 are satisfied, i.e., if we choose the learning rate as η = C4(mλ + m TL) 1 in which C4 > 0 is an absolute constant such that C4 1 + TL, and

choose J = 1 C4

λ log 1 3C2N

λ T 3L = e O TL/(λC4) , then the third term in equation 26

can also be upper-bounded by 1/3.

As a result, the last term from the upper bound in equation 25 can be upper-bounded by 1, and hence the regret upper bound becomes:

RT = e O ed

TN + edmax N

Worst-Case Regret Upper Bound in Terms of the Maximum Information Gain γ. Next, we perform some further analysis of the final regret upper bound derived above, which allows us to inspect the order of growth of our regret upper bound in the worst-case scenario (i.e., without assuming that the effective dimensions are upper-bounded by constants). We have defined in Sec. 2 that ed 2γT KN/ log(1 + TKN/λ), edi 2γT K/ log(1 + TK/λ), i [N] and edmax = maxi [N] ed. As a result, in our derivations in equation 19 and equation 23, we can replace ed log(1 + TKN/λ) by 2γT KN and replace edj log(1 + TK/λ) by 2γT K, after which the regret upper bound becomes

RT = e O γT KN

The growth rate of the maximum information gain of NTK has been characterized by previous works: γT = e O(T d 1

d ) (Kassraie & Krause, 2022; Vakili et al., 2021). This implies that our regret upper bound can be further expressed as

d (TN) 3d 2

2d N = e O K

C.7 REGRET UPPER BOUND FOR FN-UCB (LE S S CO M M.)

Here we explain how the proof above can be modified to derive a regret upper bound FN-UCB (Less Comm.). To begin with, note that in terms of the regret analysis, the only difference between FN-UCB (Less Comm.) and FN-UCB is that UCBb t,i of every agent i is now modified to be: UCBb t,i(x) = f(x; θsync,NN) + νT K

λ g(x; θ0)/ m V 1 sync,NN, in which the matrix V 1 sync,NN is

obtained by: V 1 sync,NN = 1 N PN i=1(V local t,i ) 1. Note that every time the matrix V 1 sync,NN is calculated, we have that t = tp 1.

Firstly, we prove that the modified UCBb t,i is also a valid high-probability upper bound on the reward function f. To achieve this, all we need to do is to add a few steps to equation 11 in Step 3 of the

Published as a conference paper at ICLR 2023

proof of. Specifically, we can further analyze equation 11 by:

|f(x;θsync,NN) h(x)|

λ g(x; θ0)/ m (V local i ) 1 + εlinear(m, T)

λg(x; θ0) (V local i ) 1g(x; θ0)/m + εlinear(m, T)

i=1 λg(x; θ0) (V local i ) 1g(x; θ0)/m + εlinear(m, T)

v u u u tλg(x; θ0)

i=1 (V local i ) 1

g(x; θ0)/m + εlinear(m, T)

λg(x; θ0) V 1 sync,NN g(x; θ0)/m + εlinear(m, T)

λ g(x; θ0)/ m V 1 sync,NN + εlinear(m, T).

The first inequality directly follows from equation 11, and the second inequality results from the concavity of the square root function. In the second last equality, we have plugged in the definition of V 1 sync,NN = 1

N PN i=1(V local t,i ) 1. As a result, Lemma 4 which guarantees the validity of UCBb t,i can be modified to be:

|h(x) f(x; θsync,NN)| νT K

λ g(x; θ0)/ m V 1 sync,NN + εlinear(m, T), x Xt,i. (30)

Secondly, we will need the following auxiliary inequality for agent i and iteration t in a good epoch p Egood:

λ g(xt,i; θ0)/ m V 1 sync,NN = q

λg(xt,i; θ0) V 1 sync,NNg(xt,i; θ0)/m

v u u tλg(xt,i; θ0) 1

j=1 (V local j ) 1 g(xt,i; θ0)/m

j=1 λg(xt,i; θ0) (V local j ) 1g(xt,i; θ0)/m

λg(xt,i; θ0) (V local j ) 1g(xt,i; θ0)/m

λ g(xt,i; θ0)/ m (V local j ) 1 .

The first inequality is because

Thirdly, we need to modify the proof of the regret upper bound for good epochs (App. C.4). Specifically, we can derive an upper bound on the instantaneous regret rt,i = h(x t,i) h(xt,i) for

Published as a conference paper at ICLR 2023

agent i and iteration t in a good epoch p Egood (in a similar way to equation 13):

rt,i = h(x t,i) h(xt,i)

λ g(xt,i; θ0)/ m V 1 sync,NN + εlinear(m, T) +

(1 α) 2νT KN

λ g(xt,i; θ0)/ m

+ αεlinear(m, T)

λ g(xt,i; θ0)/ m (V local j ) 1 + εlinear(m, T) +

(1 α) 2νT KN

eλ g(xt,i; θ0)/ m e V 1 t,i

+ αεlinear(m, T)

λ g(xt,i; θ0)/ m (V local j ) 1 +

(1 α)2νT KN

eλ g(xt,i; θ0)/ m e V 1 t,i + 2αεlinear(m, T)

(1 α)2νT KN eeσt,i(xt,i) + α2νT K 1

j=1 eσlocal tp 1,j(xt,i) + 2αεlinear(m, T).

In the first inequality, we have made use of equation 30 which ensures the validity of the modified UCBb t,i as a high probability upper bound on h. The second inequality follows from equation 31. In the last equality, we have defined eσlocal tp 1,j(xt,i) in the same way as equation 13. The steps regarding the term involving (1 α) are the same as those from equation 13. As a result, the only change we have made to instantaneous regret upper bound from equation 13 is that in the second term, we have replaced 1

N . Further propagating this change through the proof for the regret upper bound for all good epochs (App. C.4.4 and App. C.4.5), we have that:

t T good rt,i = e O ed

TN + edmax N 3/2

T + TNεlinear(m, T) . (33)

Lastly, also note that the regret upper bound for the bad epochs (i.e., the proof in App. C.5) remains unchanged. Therefore, the final regret upper bound for FN-UCB (Less Comm.) is

RT = Rgood T + Rbad T

TN + edmax N 3/2

T + TNεlinear(m, T) + ed

TN + edmax N 3/2

T + TNεlinear(m, T)

TN + edmax N 3/2

D PROOF OF UPPER BOUND ON COMMUNICATION COMPLEXITY (THEOREM 2)

In this section, we derive an upper bound on the communication complexity (i.e., the total number of communication rounds) of our FN-UCB algorithm (including its variant FN-UCB (Less Comm.)). Define ζ p

DT/R. An immediate implication is that there can be at most T/ζ epochs whose length is larger than ζ. Next, we try to derive an upper bound on the number of epochs whose length is smaller than ζ.

Note that if an epoch p contains less than ζ iterations, then because of our criterion to start a communication round (line 10 of Algo. 1), we have that log det Vp det Vp 1 > D

ζ . Also recall that equation equation 4 (Appendix C.2) tells us that:

p=0 log det Vp+1

det Vp R R, (35)

Published as a conference paper at ICLR 2023

with probability of at least 1 δ1 1 δ. Therefore, there can be at most R D/ζ = Rζ

D such epochs whose length is smaller than ζ. As a result, the total number of epochs can be upper-bounded by:

Recall that R = e O(ed) (App. C.2). Therefore, with probability of at least 1 δ1 1 δ, the total

number of epochs can be upper-bounded by e O( q

Since we have chosen D = e O( T

N e d) (line 1 of Algo. 1), therefore, the total number of epochs can be

upper-bounded by e O( r

T e d T N e d ) = e O(ed

N). Now we can further make use of the relationship between

ed and γT KN: ed 2γT KN/ log(1 + TKN/λ), which allows us to show that the worst-case commu-

nication complexity is upper-bounded by: e O(ed

N) = e O γT KN

N = e O (TKN) d 1

2d ), which is still sub-linear in T even in the worst case.

The proof here, and hence Theorem 2, makes use of Lemma 1. Therefore, we only need condition 1 on m listed in App. C.1 to hold, and do not require any condition on η and J.

E MORE EXPERIMENTAL DETAILS

Our code can be found at: https://github.com/daizhongxiang/ Federated-Neural-Bandits.

Some of the experimental details (e.g., the number of layers and the width m of the NN used in every experiment) are already described in the main text (Sec. 5). Following the works of Zhang et al. (2021); Zhou et al. (2020), when training the NN (line 14 of Algo. 1) for agent i, we use the NN parameters resulting from the last gradient descent training of agent i (instead of θ0) as the initial parameters, in order to accelerate the training procedure. Every time we train an NN, we use stochastic gradient descent to train the NN for 30 iterations with a learning rate of 0.01. To save computational cost, we stop training the NNs after 2000 iterations, i.e., after 2000 iterations, all NN parameters are no longer updated. Also to reduce the computational cost, when checking the criterion in line 11 of Algo. 1, we diagonalize (i.e., only keep the diagonal elements of) the two matrices for which we need to calculate the determinant. Our experiments are run on a server with 96 CPUs, an NVIDIA A100 GPU with a memory of 40GB, a RAM of 256GB, running the Ubuntu system.

The shuttle dataset is publicly available at https://archive.ics.uci.edu/ml/ datasets/Statlog+(Shuttle) and contains no personally identifiable information or offensive content. It includes 58000 instances, has an input dimension of d = 9 and contains K = 7 classes/arms. As a result, according to the way in which the contexts are constructed (Sec. 5.2), every context feature vector has a dimension of 9 7 = 63. The magic telescope dataset is publicly available at https://archive.ics.uci.edu/ml/datasets/magic+gamma+ telescope and contains no personally identifiable information or offensive content. The dataset contains 19020 instances, has an input dimension of d = 10 and K = 2 classes/arms. As a result, every context feature vector has a dimension of 10 2 = 20.

When comparing with Linear-UCB, Linear TS, Kernelized UCB and Kernelized TS, we follow the work of Zhang et al. (2021) to set λ = 1 and perform a grid search within ν {1, 0.1, 0.01}. The results showing comparisons with these algorithms, for both the synthetic experiments (Sec. 5.1) and real-world experiments (Sec. 5.2), are presented in Fig. 4. The figures show that both linear and kernelized contextual bandit algorithms are outperformed by neural contextual bandit algorithms, which is consistent with the observations from Zhang et al. (2021); Zhou et al. (2020).

Published as a conference paper at ICLR 2023

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

FN-UCB (N = 1)

FN-UCB (N = 2)

FN-UCB (N = 5) FN-UCB (N = 10)

FN-UCB (Less Comm.) (N = 1) FN-UCB (Less Comm.) (N = 2) FN-UCB (Less Comm.) (N = 5)

FN-UCB (Less Comm.) (N = 10)

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

FN-UCB (N = 1)

FN-UCB (N = 2)

FN-UCB (N = 5) FN-UCB (N = 10)

FN-UCB (Less Comm.) (N = 1) FN-UCB (Less Comm.) (N = 2) FN-UCB (Less Comm.) (N = 5)

FN-UCB (Less Comm.) (N = 10)

(a) cosine (b) square

Figure 3: Cumulative regret of FN-UCB and FN-UCB (Less Comm.) for the cosine and square functions. Their performances are very similar for both functions.

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

Neural UCB Neural TS FN-UCB (N = 1) FN-UCB (N = 2) FN-UCB (N = 5) FN-UCB (N = 10) Kernel UCB Kernel TS Linear UCB Linear TS

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

Neural UCB Neural TS FN-UCB (N = 1) FN-UCB (N = 2) FN-UCB (N = 5) FN-UCB (N = 10) Kernel UCB Kernel TS Linear UCB Linear TS

(a) cosine (b) square

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

Neural UCB Neural TS FN-UCB (N = 1) FN-UCB (N = 2) FN-UCB (N = 5) FN-UCB (N = 5, D = 0.03) Kernel UCB Kernel TS Linear UCB Linear TS

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

Neural UCB Neural TS FN-UCB (N = 1) FN-UCB (N = 2) FN-UCB (N = 5) FN-UCB (N = 5, D = 0.01) Kernel UCB Kernel TS Linear UCB Linear TS

(c) shuttle (d) magic telescope

Figure 4: Cumulative regrets for the (a) cosine, (b) square, (c) shuttle (with diagonalization), and (d) magic telescope experiments, with additional comparisons with Linear UCB, Linear TS, Kernel UCB and Kernel TS.

Published as a conference paper at ICLR 2023

0 1000 2000 3000 4000 5000 Iterations

Cumulative Regret

FN-UCB (N = 5)

FN-UCB (N = 5, diag)

Figure 5: Comparison between the performances without (yellow) and with (red) diagonalization, using m = 20 with the shuttle dataset. The results show that using an NN with the same width m = 20, diagonalization indeed deteriorates the performances.

2 4 6 8 10 N

Average Cumulative Regret

2 4 6 8 10 N

Average Cumulative Regret

(a) cosine (b) square

Figure 6: The scaling of the final average cumulative regret after 5000 iterations (averaged across all N agents) in terms of the number N of agents, using the cosine and square experiments. The results correspond to Fig. 1 a and Fig. 1 b, respectively.

We have additionally evaluated the empirical impact of the technique of diagonalization of the matrices (Sec. 3.4), using the shuttle dataset and a fixed width of m = 20 for the NN. The results (Fig. 5) show that for the same width of the NN, the technique of diagonalization indeed results in worse performances. However, also note that diagonalization allows us to afford a larger value of m in a computationally feasible way, which can lead to better performances than using a smaller m without diagonalization. This is corroborated by our empirical results in Fig. 2a and 2c, because the regrets in Fig. 2c (m=50 , with diagonalization) are in general smaller than the regrets in Fig. 2a (m=20 , without diagonalization), and the computational cost of Fig. 2c (244.9 seconds) is smaller than that of Fig. 2a (361.8 seconds). Furthermore, using m = 50 without diagonalization would incur a significantly larger computational cost (3134.3 seconds). These results demonstrate the practical usefulness of diagonalization.

We have also visualized the empirical scaling of the final average cumulative regret (after 5000 iterations) in terms of the number N of agents, using the cosine and square experiments. The results (Fig. 6) demonstrate that the average cumulative regret (averaged across all N agents) is indeed decreasing as the number N of agents increases.

F EXTENDED ANALYSIS FOR THE GENERAL ALGORITHM

Recall that it has been mentioned at the beginning of Sec. 4.1 that our main regret analysis (Theorem 1) has focused on a simpler version of our FN-UCB algorithm, in which we only choose the value of α using the method described in Sec. 3.3 in the first iteration of every epoch and set α = 0 in the other iterations. Here, we show how our regret analysis can be extended to derive a regret upper bound

Published as a conference paper at ICLR 2023

for the general FN-UCB algorithm, in which we choose α using the method described in Sec. 3.3 in every iteration, i.e., we do not set α = 0 in any iteration. To achieve this, we need an additional assumption of an upper bound on the amount of new information collected by every agent i in every epoch p. Specifically, we assume that

det V local tp+Ep 2,i det V local tp 1,i D, i [N], p [P] (37)

for a constant D 1. This can in fact be viewed as an additional property of the sequence of contexts for each agent. Intuitively, if the contexts for each agent are received in such an order that similar contexts also arrive in similar iterations, then the constant D is likely to be small. This can be seen as a "stationarity" property of the sequence of contexts, which is reasonable in many practical scenarios. For example, in a healthcare application, the patients arriving within the same time period are likely to have similar characteristics due to factors such as the local transmission of a seasonal flu. In addition, another scenario where D is likely to be small is when every agent has some previously observed offline contexts before running our algorithm. If these offline contexts have a good coverage of the space of contexts, then conditioned on these offline contexts, the newly collected information by every agent in every epoch is highly likely to be small.

With this additional assumption, the most important step in the proof that we need to modify is the proof in Appendix C.4.4, in which we proved an upper bound on the sum of the second term of equation 13. To begin with, t = tp, . . . , tp + Ep 1, we have that

eσlocal tp 1,j(xt,i) (a) =

λ g(xt,i; θ0)/ m (V local tp 1,j) 1

λg(xt,i; θ0) (V local tp 1,j) 1g(xt,i; θ0)/m

v u u tλg(xt,i; θ0) (V local t 1,j) 1g(xt,i; θ0)/m det V local t 1,j det V local tp 1,j

v u u tλg(xt,i; θ0) (V local t 1,j) 1g(xt,i; θ0)/m det V local tp+Ep 1 1,j det V local tp 1,j

λg(xt,i; θ0) (V local t 1,j) 1g(xt,i; θ0)/m D

Deσlocal t 1,j(xt,i).

Step (a) has made use of the definition of eσlocal tp 1,j(xt,i) (see the paragraph below equation 13), step (b) results from Lemma 12 of Abbasi-Yadkori et al. (2011), step (c) follows because V local tp+Ep 1 1,j contains more information than V local t 1,j t = tp, . . . , tp + Ep 1, step (d) follows from equation 37, and step (e) has again made use of the definition of eσlocal tp 1,j(xt,i).

Published as a conference paper at ICLR 2023

Using equation 38, we can modify the proof in equation 16 (Appendix C.4.4):

t T (p) α2νT K 1 N

j=1 eσlocal tp 1,j(xt,i) 2νT K 1 N

j=1 eσlocal tp 1,j(xt,i)

t=tp αeσlocal tp 1,j(xt,i)

(a) 2νT K 1 N

t=tp eσlocal tp 1,j(xt,j)

(b) 2νT K 1 N

Deσlocal t 1,j(xt,j)

t=1 eσlocal t 1,j(xt,j).

Step (a) follows from the same reasoning as step (c) of equation 16, step (b) has made use of equation 38, and all other steps follow the same corresponding steps of equation 16.

As a result, by comparing the modified equation 39 with the original equation 16, the only modification to the result in equation 16 is the additional multiplicative term of

D. Therefore, after propagating this modification to all the analysis in Appendix C.4.4, we have that a multiplicative term of

D will also be introduced into equation 18. Subsequently, the upper bound on the total regrets from all good epochs (i.e., equation 19) will be correspondingly modified to be:

Rgood T = e O ed

T + TNεlinear(m, T) . (40)

Next, we also need to modify the proof of the upper bound on the total regrets from all bad epochs (Appendix C.5). Following the roadmap of Appendix C.5, we start by upper-bounding the total regrets from a particular bad epoch p:

t=tp rt,i =

t=tp [αh(x t,i) + (1 α)h(x t,i) h(xt,i)]

h αUCBb t,i(x t,i) + αεlinear(m, T) + (1 α)UCBa t,i(x t,i) h(xt,i) i

h αUCBb t,i(xt,i) + αεlinear(m, T) + (1 α)UCBa t,i(xt,i) h(xt,i) i

h α(UCBb t,i(xt,i) h(xt,i)) + (1 α)(UCBa t,i(xt,i) h(xt,i))

+ αεlinear(m, T) i

UCBa t,i(xt,i) h(xt,i) i

h α(UCBb t,i(xt,i) h(xt,i)) + αεlinear(m, T) i

Published as a conference paper at ICLR 2023

Step (a) follows from Lemma 3 (i.e., the validity of UCBa t,i) and Lemma 4 (i.e., the validity of UCBb t,i). Step (b) results from the way in which xt,i is selected (line 7 of Algo. 1): xt,i = arg maxx Xt,i(1 α)UCBa t,i(x) + αUCBb t,i(x). For step (c), the term A is obtained by upperbounding the regrets of the first and last iteration within this epoch by 2 and using the fact that α 1.

Next, we can separately analyze the terms A and B in equation 41. Firstly, note that the term A is the same as step (c) of equation 20, therefore, we can follow the same steps of analyses in App. C.5 (i.e., equation 20, equation 21, equation 22, equation 23 and equation 24) to show that after summing the term A across all bad epochs, we get an upper bound of O ed

TN . Secondly, for the term B, we can in fact follow similar steps of analysis in equation 13 to show that every term inside the square bracket of the term B is upper-bounded by the last two terms in equation 13. That is,

α(UCBb t,i(xt,i) h(xt,i)) + αεlinear(m, T) α2νT K 1 N

j=1 eσlocal tp 1,j(xt,i) + 2αεlinear(m, T). (42)

As a result, we can follow the same steps of analysis in App. C.4.4 (after making the modification using equation 39; note that the analysis in App. C.4.4 is applicable to both good and bad epochs) to show that after summing over all bad epochs, the term B can be upper-bounded by e O

T + TNεlinear(m, T) . Next, combining the upper bounds on both A and B (after summing across all bad epochs), we have that for the general algorithm, the total regrets from all bad epochs can be upper bounded by

Rbad T = e O ed

T + TNεlinear(m, T) , (43)

which is in fact the same as the upper bound on the total regrets from all good epochs which we have derived in equation 40.

Finally, following the same analysis in App. C.6, we can show that the final regret upper bound for the general algorithm, in which we do not set α = 0 in any iteration, is

RT = e O ed

Note that compared to our regret upper bound from Theorem 1, the regret upper bound for the general algorithm (i.e., when we choose the value of α using the method from Sec. 3.3 in every iteration) only includes an additional multiplicative term of

D in the second term. Of note, when communication indeed occurs after each iteration (i.e., Ep = 1 for every epoch p), we have that D = 1 because det V local tp+Ep 2,i det V local tp 1,i = 1 (equation 37). In this case, the version of our algorithm analyzed in Theorem 1

becomes the same as our general algorithm (Sec. 4.1), and interestingly, the regret upper bound of our general algorithm (equation 44) also becomes the same as Theorem 1 because D = 1.