# communicationefficient_federated_nonlinear_bandit_optimization__9101f261.pdf Published as a conference paper at ICLR 2024 COMMUNICATION-EFFICIENT FEDERATED NONLINEAR BANDIT OPTIMIZATION Chuanhao Li Chong Liu Yu-Xiang Wang Yale University University of Chicago University of California, Santa Barbara chuanhao.li.cl2637@yale.edu chongl@uchicago.edu yuxiangw@cs.ucsb.edu Federated optimization studies the problem of collaborative function optimization among multiple clients (e.g. mobile devices or organizations) under the coordination of a central server. Since the data is collected separately by each client and always remains decentralized, federated optimization preserves data privacy and allows for large-scale computing, which makes it a promising decentralized machine learning paradigm. Though it is often deployed for tasks that are online in nature, e.g., next-word prediction on keyboard apps, most works formulate it as an offline problem. The few exceptions that consider federated bandit optimization are limited to very simplistic function classes, e.g., linear, generalized linear, or non-parametric function class with bounded RKHS norm, which severely hinders its practical usage. In this paper, we propose a new algorithm, named Fed-GOUCB, for federated bandit optimization with generic non-linear objective function. Under some mild conditions, we rigorously prove that Fed-GO-UCB is able to achieve sub-linear rate for both cumulative regret and communication cost. At the heart of our theoretical analysis are distributed regression oracle and individual confidence set construction, which can be of independent interests. Empirical evaluations also demonstrate the effectiveness of the proposed algorithm. 1 INTRODUCTION Federated optimization is a machine learning method that enables collaborative model estimation over decentralized dataset without data sharing (Mc Mahan et al., 2017; Kairouz et al., 2019). It allows the creation of a shared global model with personal data remaining in local sites instead of being transferred to a central location, and thus reduces the risks of personal data breaches. While the main focus of the state-of-the-art federated optimization is on the offline setting, where the objective is to obtain a good model estimation based on fixed dataset (Li et al., 2019a; Mitra et al., 2021), several recent research efforts have been made to extend federated optimization to the online setting, i.e., federated bandit optimization (Wang et al., 2020; Li & Wang, 2022b; Li et al., 2022a). Compared with its offline counterpart, federated bandit optimization is characterized by its online interactions with the environment, which continuously provides new data points to the clients over time. The objective of the clients is to collaboratively minimize cumulative regret, which measures how fast they can find the optimal decision, as well as the quality of decisions made during the trialand-error learning process. This new paradigm greatly improves sample efficiency, as the clients not only collaborate on model estimation, but also actively select informative data points to evaluate in a coordinated manner. Moreover, compared with the standard Bayesian optimization approach (Shahriari et al., 2015), the improved data protection of federated bandit optimization makes it a better choice for applications involving sensitive data, such as recommender systems (Li et al., 2010), clinical trials (Villar et al., 2015) and sequential portfolio selection (Shen et al., 2015). For example, medical data such as disease symptoms and medical reports are very sensitive and private, and are typically stored in isolated medical centers and hospitals (Yang et al., 2019). Federated bandit optimization offers a principled way for different medical institutions to jointly solve optimization problems for smart healthcare applications, while ensuring privacy and communication efficiency. Equal Contribution Published as a conference paper at ICLR 2024 Figure 1: Illustration of Fed-GO-UCB algorithm, which consists of two phases: in Phase I, all clients do uniform exploration for a total number of T0 time steps, and then use the collected data to jointly construct an model ˆw0 via iterative distributed optimization; in Phase II, each client constructs a local confidence set for the unknown non-linear function f using gradients w.r.t. the shared model ˆw0, based on which they do optimistic exploration. Synchronization of their local statistics happens when a communication event is triggered, which enables coordinated exploration among the clients. However, despite these potential benefits and the compelling theoretical guarantees, prior works in this direction are limited to very restrictive function classes, e.g., linear (Wang et al., 2020; Li & Wang, 2022a; He et al., 2022), generalized linear (Li & Wang, 2022b), and non-parametric function class with bounded RKHS norm (Li et al., 2022a; 2023; Du et al., 2021), which limits their potential in practical scenarios that typically require more powerful tools in nonlinear modeling, e.g. neural networks. The main challenges in bridging this gap come from two aspects. First, different from offline federated optimization, federated bandit optimization needs to efficiently explore the decision space by actively picking data points to evaluate. This requires a careful construction of confidence sets for the unknown optimal model parameter (Abbasi-yadkori et al., 2011), which is challenging for generic nonlinear functions. Second, for clients to collaboratively estimate confidence sets, occasional communications are required to aggregate their local learning parameters as new data points are collected over time. Prior works consider simple function classes (Wang et al., 2020; Li & Wang, 2022a), so efficient communication can be realized by directly aggregating local sufficient statistics for the closed-form model estimation. However, generic non-linear function places a much higher burden on the communication cost, as iterative optimization procedure is required. To address these challenges, we propose the Federated Global Optimization with Upper Confidence Bound (Fed-GO-UCB) algorithm, as illustrated in Figure 1. Specifically, Fed-GO-UCB has two phases: Phase I does uniform exploration to sufficiently explore the unknown function; and Phase II does optimistic exploration to let N clients jointly optimize the function. All clients separately choose which points to evaluate, and only share statistics summarizing their local data points with the central server. Details of Fed-GO-UCB are presented in Section 4.1. Technical novelties. Our core technique to address the aforementioned challenges is a novel confidence sets construction that works for generic nonlinear functions, and more importantly, can be updated communication efficiently during federated bandit optimization. Our construction is motivated by Liu & Wang (2023), with non-trivial extensions tailored to federated bandit setting. Specifically, the statistics used for function approximation in Liu & Wang (2023) is computed based on a single sequence of continuously updated models, but in federated bandits, each client has a different sequence of locally updated models. Direct aggregation of such local statistics does not necessarily lead to valid confidence sets. Instead, we propose a new approximation procedure, such that all statistics are computed based on the same fixed model shared by all clients, which is denoted by ˆw0. With improved analysis, we show that valid confidence sets can still be constructed. More importantly, this allows direct aggregation of statistics from different clients, so that communication strategies proposed in linear settings can be utilized to reduce frequency of communications. Over the entire horizon, only the estimation of ˆw0 at the end of Phase I requires iterative optimization. To further control the communication cost it incurs, we show that ˆw0 only needs to be an O(1/ Published as a conference paper at ICLR 2024 approximation to the empirical risk minimizer that is required in the analysis of Liu & Wang (2023), and adopt a distributed implementation of Gradient Langevin Dynamics (GLD) for its optimization. Contributions. Our main contributions can be summarized as follows. To the best of our knowledge, this is the first federated bandit algorithm for generic non-linear function optimization with provable communication efficiency, making it deployment-efficient. Under realizable assumption and some other mild conditions, we prove that cumulative regret of Fed-GO-UCB is O( NT) and its communication cost is O(N 1.5 T). Our empirical evaluations show Fed-GO-UCB outperforms existing federated bandit algorithms, which demonstrates the effectiveness of generic non-linear function optimization, 2 RELATED WORK Centralized global optimization Most work on global optimization studies the centralized setting where all data points are available on a single machine. Its applications include hyperparameter tuning for deep neural networks (Hazan et al., 2018; Kandasamy et al., 2020) and materials design (Nakamura et al., 2017; Frazier & Wang, 2016). The most popular approach to this problem is Bayesian optimization (BO) (Shahriari et al., 2015; Frazier, 2018), which is closely related to bandit problems (Li et al., 2019b; Foster & Rakhlin, 2020). BO typically assumes the unknown objective function is drawn from some Gaussian Processes (GP). The learner sequentially choose points to evaluate and then improve its estimation via posterior update. Classical BO algorithms include GPUCB (Srinivas et al., 2010), GP-EI (Jones et al., 1998), and GP-PI (Kushner, 1964). To improve heterogeneous modeling of the objective function and mitigate over-exploration, Trust region BO (Eriksson et al., 2019) that uses multiple local optimization runs is proposed. In this line of research, the closes work to ours is Liu & Wang (2023), which also considers global approximation of generic nonlinear functions, though it s not suitable for federated setting as discussed in Section 1. Federated bandit optimization Another closely related line of research is federated/distributed bandits, where multiple agents collaborate in pure exploration (Hillel et al., 2013; Tao et al., 2019; Du et al., 2021), or regret minimization (Wang et al., 2020; Li & Wang, 2022a;b). However, most of these works make linear model assumptions, and thus the clients can efficiently collaborate by transferring the O(d2 x) sufficient statistics for closed-form model estimation, where dx is input data dimension. The closest works to ours are Wang et al. (2020); Dubey & Pentland (2020); Li & Wang (2022a;b), which uses event-triggered communication strategies to obtain sub-linear communication cost over time, i.e., communication only happens when sufficient amount of new data has been collected. There is also recent work by Dai et al. (2022) that studies federated bandits with neural function approximation, but it still relies on GP with a Neural Tangent Kernel in their analysis, which is intrinsically linear. More importantly, this analysis assumes the width of the neural network is much larger than the number of samples, while our results do not require such over-parameterization. 3 PRELIMINARIES We consider the problem of finding a global maximum solution to an unknown non-linear black-box function f, i.e., x = arg max x X f(x). Different from previous works, we consider a decentralized system of 1) N clients that selects data points to evaluate, and 2) a central server that coordinates the communication among the clients. The clients cannot directly communicate with each other, but only with the central server, i.e., a star-shaped communication network as shown in Figure 1. In each round, N clients interact with the unknown function f in a round-robin manner, for a total number of T rounds, so the total number of interactions is NT. Let [N] denote the integer set {1, 2, ..., N}. Specifically, at round l [T], each client i [N] selects a point xt from the set X, and has a zeroth-order noisy function observation: yt = f(xt) + ηt R, (1) where the subscript t := N(l 1) + i indicates this is the t-th interaction between the system and the function f, i.e., the t-th time function f is evaluated at a selected point xt, and ηt is independent, zero-mean, σ-sub-Gaussian noise, for t [NT]. Published as a conference paper at ICLR 2024 We adopt the classical definition of cumulative regret to evaluate the algorithm performance. It is defined as RNT = PNT t=1 f(x ) f(xt) for NT interactions. Following Wang et al. (2020), we define the communication cost CNT as the total amount of real numbers being transferred across the system during the NT interactions with function f. Let U denote uniform distribution. W.l.o.g., let X [0, 1]dx and Y R denote the domain and range of unknown function f. We are working with a parametric function class F := {fw : X Y|w W} to approximate f where the parametric function class is controlled by the parameter space W. For a parametric function fw(x), let fw(x) denote the gradient taken w.r.t. x and fx(w) denote the gradient taken w.r.t. w. We use it [N] as the index of the client that evaluates point xt at time step t. We denote the sequence of time steps corresponding to data points that have been evaluated by client i up to time t as Dl t,i = {1 s t : is = i} for t = 1, 2, . . . , NT. In addition, we denote the sequence of time steps corresponding to the data points that have been used to update client i s local model as Dt,i, which include both data points collected locally and those received from the server. If client i never receives any communication from the server, Dt,i = Dl t,i; otherwise, Dl t,i Dt,i [t]. Moreover, when each new data point evaluated by any client in the system is readily communicated to all the other clients, we recover the centralized setting, i.e., Dt,i = [t], i, t. For completeness, we list all notations in Appendix A. Here we list all assumptions that we use throughout this paper. The first two assumptions are pretty standard and the last assumption comes from previous works. Assumption 1 (Realizabilty). There exists w W such that the unknown objective function f = fw . Also, assume W [0, 1]dw. This is w.l.o.g. for any compact W. Assumption 2 (Bounded, differentiable and smooth function approximation). There exist constants F, Cg, Ch > 0, s.t. |fx(w)| F, fx(w) 2 Cg, and 2fx(w) op Ch, x X, w W. Assumption 3 (Geometric conditions on the loss function (Liu & Wang, 2023; Xu et al., 2018)). L(w) = Ex U(fx(w) fx(w ))2 satisfies (τ, γ)-growth condition or local µ-strong convexity at w , i.e., w W, 2 w w 2 2, τ 2 w w γ 2 o L(w) L(w ), for constants µ, τ > 0, µ < dw, 0 < γ < 2. L(w) satisfies dissipative assumption, i.e., L(w) w Ci w 2 2 Cj for some constants Ci, Cj > 0, and a c-local self-concordance assumption at w , i.e., (1 c)2 2L(w ) 2L(w ) (1 c) 2 2L(w ), w s.t. w w 2L(w) c. Assumption 2 implies there exists a constant ζ > 0, such that 2L(w ) op ζ. For example, it suffices to take ζ = 2C2 g. Assumption 3 is made on the expected loss function L w.r.t. parameter w rather than function f, and is strictly weaker than strong convexity as it only requires strong convexity in the small neighboring region around w . For w far away from w , L(w) can be highly non-convex since only growth condition needs to be satisfied. The dissipative assumption is typical for stochastic differential equation and diffusion approximation (Raginsky et al., 2017; Zhang et al., 2017). In our paper, it is only needed for the convergence analysis of Algorithm 2, and may be relaxed by adopting other non-convex optimization methods with global convergence guarantee. 4 METHODOLOGY 4.1 FED-GO-UCB ALGORITHM In this paper, we develop an algorithm that allows Bayesian optimization style active queries to work for general supervised learning-based function approximation under federated optimization scheme. Phase I In Step 1 of Phase I, the algorithm evaluates the unknown function f at uniformly sampled points for a total number of T0 times, where T0 is chosen to be large enough such that function f can be sufficiently explored. By the end of time step T0, each client i [N] has collected a local dataset {(xs, ys)}s Dl T0,i (by definition i [N]Dl T0,i = [T0]). In Step 2, we call the distributed regression oracle, denoted as Oracle, to jointly learn model parameter ˆw0, by optimizing equation 2 below. min w W ˆLT0(w) := 1 i=1 ˆLT0,i(w), (2) Published as a conference paper at ICLR 2024 Algorithm 1 Fed-GO-UCB Input: Phase I length T0, Time horizon (Phase II length) NT, Oracle, number of iterations n to execute Oracle, communication threshold γ, regularization weight λ. Phase I (Uniform exploration) 1: for t = 1, 2, . . . , T0 do client it [N] evaluates point xt U(X), and observe yt 2: Execute Oracle (e.g., Algorithm 2) over N clients to obtain ˆw0 Phase II (Optimistic exploration) 1: Initialize ˆw T0,i = ˆw0, ΣT0,i = λI, b T0,i = 0, DT0,i = , for each client i [N] 2: for t = 1, . . . , NT do 3: Client it evaluates point xt = arg maxx X maxw Ballt 1,it fx(w) and observe yt 4: Client it updates Σt,it = Σt 1,it + fxt( ˆw0) fxt( ˆw0) , bt,it = bt 1,it + fxt( ˆw0)( fxt( ˆw0) ˆw0 + yt fxt), and Σt,it = Σt 1,it + fxt( ˆw0) fxt( ˆw0) , bt,it = bt 1,it + fxt( ˆw0) fxt( ˆw0) ˆw0 + yt fxt 5: Client it updates ˆwt,it by equation 5 and Ballt,it by equation 4; sets index set Dt,it = Dt 1,it {t} # Check whether global synchronization is triggered 6: if |Dt,it| |Dtlast,it| log det(Σt,it) det(Σt,it Σt,it) > γ then 7: All clients upload { Σt,i, bt,i}, and reset Σt,i = 0, bt,i = 0 for i = 1, . . . , N 8: Server aggregates Σt,g = Σtlast,g + PN i=1 Σt,i, bt,g = btlast,g + PN i=1 bt,i 9: All clients download {Σt,g, bt,g}, and update Σt,i = Σt,g, bt,i = bt,g for i = 1, . . . , N 10: Set tlast = t Output: ˆx U({x1, ..., x T }). where ˆLT0,i(w) = P s Dl T0,i ys fw(x s ) 2 is the squared error loss on client i s local dataset. It is worth noting that, executing Oracle to compute the exact minimizer ˆw 0 := arg minw W ˆLT0(w) as in the prior work by Liu & Wang (2023) is unreasonable in our case, since it requires infinite number of iterations, which leads to infinite communication cost. Instead, we need to relax the requirement by allowing for an approximation error ϵ, such that Oracle only need to output ˆw0 that satisfies |ˆLT0( ˆw0) ˆLT0( ˆw 0)| ϵ. (3) Oracle can be any distributed non-convex optimization method with global convergence guarantee, i.e., we can upper bound the number of iterations required, denoted as n, to attain ϵ, for some ϵ 0. Remark 4. In this paper we adopt Gradient Langevin Dynamics (GLD) based methods to optimize equation 2, which are known to have global convergence guarantee to the minimizer of non-convex objectives under smooth and dissipative assumption (Assumption 2 and Assumption 3). GLD based methods work by introducing a properly scaled isotropic Gaussian noise to the gradient descent update at each iteration, which allows them to escape local minima. Specifically, we use a distributed implementation of GLD, which is given in Algorithm 2. It requires n = O( dx ϵ )) iterations to attain ϵ approximation (with step size set to τ1 ϵ), where ν = O(e O(dx)) denotes the spectral gap of the discrete-time Markov chain generated by GLD (Theorem 3.3 of Xu et al. (2018)). Algorithm 2 Distributed-GLD-Update 1: Input: total iterations n; step size τ1 > 0; inverse temperature parameter τ2 > 0; w(0) = 0 2: for k = 0, 1, . . . , n 1 do 3: Server sends w(k) to all clients, and receives local gradients ˆLT0,i(w(k)) for i [N] back 4: Server aggregates local gradients ˆLT0(w(k)) = 1 T0 PN i=1 ˆLT0,i(w(k)) 5: Server randomly draws zk N(0, Id d) 6: Server computes update w(k+1) = w(k) τ1 ˆLT0(w(k)) + p 2τ1/τ2zk 7: Output: ˆw0 = w(K) Phase II At the beginning of Phase II, the estimator ˆw0 obtained in Phase I is sent to each client, which will be used to construct the confidence sets about the unknown parameter w . Specifically, Published as a conference paper at ICLR 2024 at each time step t = 1, 2, . . . , NT, the confidence set constructed by client it is a ball defined as Ballt,it := {w : w ˆwt,it 2 Σt,it βt,it}, (4) such that with the choice of βt,it in Lemma 8, w Ballt,it, t with probability at least 1 δ. The center of this ball is defined as ˆwt,it := Σ 1 t,itbt,it + λΣ 1 t,it ˆw0, (5) where the matrix Σt,it := λI + X s Dt,it fxs( ˆw0) fxs( ˆw0) , (6) and vector bt,it := P s Dt,it fxs( ˆw0) fxs( ˆw0) ˆw0 + ys fxs( ˆw0) . Note fxs( ˆw0) is the gradient of our parametric function taken w.r.t parameter ˆw0, rather than the gradient of the unknown objective function. The statistics Σt,i, bt,i for all client i and time t are constructed using gradients w.r.t. the same model ˆw0. This is essential in federated bandit optimization, as it allows the clients to jointly construct the confidence set by directly aggregating their local updates, denoted by Σt,i, bt,i. In comparison, although the statistics used to construct the confidence sets in Liu & Wang (2023) are computed based on different models and lead to tigher results, they impede collaboration among clients and cannot be directly used in our case. In Phase II, exploration is conducted following Optimism in the Face of Uncertainty (OFU) principle, i.e., at time step t, client it selects point xt X to evaluate via joint optimization over x X and w Ballt 1,it, as shown in line 3. The newly obtained data point (xt, yt) will then be used to update client it s confidence set as shown in line 4-5. In order to ensure communication efficiency during the collaborative global optimization across N clients, an event-triggered communication protocol is adopted, as shown in line 6. Intuitively, this event measures the amount of new information collected by client it since last global synchronization. If the value of LHS exceeds threshold γ, another global synchronization will be triggered, such that the confidence sets of all N clients will be synchronized as shown in line 7-9. In Section 5, we will show that with proper choice of γ, the total number of synchronizations can be reduced to O( N), without degrading the performance. 4.2 THEORETICAL RESULTS Theorem 5 (Cumulative regret and communication cost of Fed-GO-UCB). Suppose Assumption 1, 2, and 3 hold. Let C denote a universal constant and Cλ denote a constant that is independent to T. Under the condition that T Cd2 w F 4ι2 N max µγ/(2 γ) τ 2/(2 γ) , ζ µc2 2, where ι is the logarithmic term depending on T0, Ch, and 2/δ. Algorithm 1 with parameters T0 = NT, γ = dw F 4T µ2N , and ϵ 8 C NT , has cumulative regret Rphase I + phase II = O NTd3 w F 4/µ2 + Nd4 w F 4/µ2 , with probability at least 1 δ, and communication cost Cphase I + phase II = O(N 1.5 Tdwdx exp (dx) + N 1.5d2 wµ/F 2). Theorem 5 shows that our proposed Fed-GO-UCB algorithm matches the regret upper bound of its centralized counterpart, GO-UCB algorithm by Liu & Wang (2023), while only requiring communication cost that is sub-linear in T. We should note that the O( T) dependence in communication cost is due to the iterative optimization procedure to compute ˆw0 at the end of Phase I, which also exists for prior works in federated bandits that requires iterative optimization (Li & Wang, 2022b). 5 PROOF OVERVIEW In this section, to highlight our technical contributions, we provide a proof sketch of the theoretical results about cumulative regret and communication cost that are presented in Theorem 5. All auxiliary lemmas are given in Appendix C and complete proofs are presented in Appendix D. Published as a conference paper at ICLR 2024 5.1 PHASE I. UNIFORM EXPLORATION & DISTRIBUTED REGRESSION ORACLE Cumulative Regret and Communication Cost in Phase I Recall from Section 4.1 that, N clients conduct uniform exploration in Phase I, which constitutes a total number of T0 interactions with the environment. By Assumption 2, we know that the instantaneous regret has a unform upper bound rt := f(x ) f(xt) 2F, so for a total number of T0 time steps, the cumulative regret incurred in Phase I, denoted as Rphase I, can be upper bounded by Rphase I = PT0 t=1 rt 2T0F. Choice of T0 value will be discussed in Section 5.2, as it controls the quality of ˆw0, which further affects the constructed confidence sets used for optimistic exploration. Moreover, the only communication cost in Phase I is incurred when executing Oracle, i.e., the distributed regression oracle, for n iterations. In each iteration, N clients need to upload their local gradients to the server, and then receive the updated global model back (both with dimension dw). Therefore, the communication cost incurred during Phase I is Cphase I = 2n Ndw. Distributed Regression Oracle Guarantee At the end of Phase I, we obtain an estimate ˆw0 by optimizing equation 2 using Oracle. As we mentioned in Section 4.2, ˆw0 will be used to construct the confidence sets, and thus to prepare for our analysis of the cumulative regret in Phase II, we establish the following regression oracle lemmas. Lemma 6. There is an absolute constant C , such that after time step T0 in Phase I of Algorithm 1 and under the condition that approximation error ϵ 1/(C T0), with probability at least 1 δ/2, regression oracle estimated ˆw0 satisfies Ex U[(fx( ˆw0) fx(w ))2] C dw F 2ι/T0, where ι is the logarithmic term depending on T0, Ch, 2/δ. Lemma 6 is adapted from Lemma 5.1 of Liu & Wang (2023) to account for the additional approximation error from the distributed regression oracle. Specifically, instead of proving the risk bound for the exact minimizer ˆw 0, we consider ˆw0 that satisfies |ˆLT0( ˆw0) ˆLT0( ˆw 0)| ϵ for some constant ϵ. As discussed in Section 4.1, this relaxation is essential for the communication efficiency in federated bandit optimization. And Lemma 6 shows that, by ensuring ϵ 1/(C T0), we can obtain the same regression oracle guarantee as in the centralized setting (Liu & Wang, 2023). As discussed in Remark 4, this condition can be satisfied with n = O( C dx T0 ν log(C T0)) number of iterations. With Lemma 6, we can establish Lemma 7 below using the same arguments as Liu & Wang (2023). Lemma 7 (Regression oracle guarantee (Theorem 5.2 of Liu & Wang (2023))). Under Assumption 1, 2, and 3, and by setting ϵ 1/(C T0), there exists an absolute constant C such that after time step T0 in Phase I of Algorithm 1, where T0 satisfies T0 Cdw F 2ι max µγ/(2 γ) τ 2/(2 γ) , ζ µc2 , with probability at least 1 δ/2, regression oracle estimated ˆw0 satisfies ˆw0 w 2 2 Cdw F 2ι 5.2 PHASE II. CONFIDENCE SET CONSTRUCTION & OPTIMISTIC EXPLORATION Confidence Set Construction In Phase II of Algorithm 1, each client i selects its next point to evaluate based on OFU principle, which requires construction of the confidence set in equation 4. The following lemma specifies the proper choice of βt,it, such that Ballt,it contains true parameter w for all t [T] with high probability. Lemma 8. Under Assumption 1, 2, & 3 and by setting βt,it = Θ dwσ2 + dw F 2/µ + d3 w F 4/µ2 , T0 = NT, and λ = Cλ NT, then ˆwt,it w 2 Σt,it βt,it, with probability at least 1 δ, t [NT] in Phase II of Algorithm 1. Cumulative Regret in Phase II Thanks to the confidence sets established in Lemma 8, Algorithm 1 can utlize a communication protocol similar to the ones designed for federated linear bandits (Wang et al., 2020; Dubey & Pentland, 2020) during Phase II, while providing much diverse modeling choices. Therefore, our analysis of the communication regret and communication cost incurred in Phase II follows a similar procedure as its linear counterparts. Denote the total number of global synchronizations (total number of times the event in line 6 of Algorithm 1 is true) over time horizon T as P [0, NP]. Then we use tp for p [P] to denote the time step when the p-th synchronization happens (define t0 = 0), and refer to the sequence of time steps in-between two consecutive synchronizations as an epoch, i.e., the p-th epoch is [tp 1 + 1, tp]. Published as a conference paper at ICLR 2024 Similar to Wang et al. (2020); Dubey & Pentland (2020); Li & Wang (2022b), for the cumulative regret analysis in Phase II, we decompose P epochs into good and bad epochs, and then analyze them separately. Specifically, consider an imaginary centralized agent that has immediate access to each data point in the learning system, and we let this centralized agent executes the same model update rule and arm selection rule as in line 3-5 of Algorithm 1. Then the covariance matrix maintained by this agent can be defined as Σt = Pt s=1 fxs( ˆw0) fxs( ˆw0) for t [NT]. Then the p-th epoch is called a good epoch if ln det(Σtp) det(Σtp 1) 1, otherwise it is a bad epoch. Note that based on Lemma 9, we have ln(det(ΣNT )/ det(λI)) dw log (1 + NT C2 g dwλ ) := R. Since ln( det(Σt1) det(λI) ) + ln( det(Σt2) det(Σt1)) + + ln( det(ΣNT ) det(Σt B ) ) = ln( det(ΣNT ) det(λI) ) R, and due to the pigeonhole principle, there can be at most R bad epochs. Then with standard optimistic argument (Abbasi-yadkori et al., 2011), we can show that the cumulative regret incurred in good epochs Rgood = O( d3 w F 4 NT µ2 ), which matches the regret of centralized algorithm by (Liu & Wang, 2023). Moreover, by design of the event-triggered communication in line 6 of Algorithm 1, we can show that the cumulative regret incurred in any bad epoch p can be bounded by Ptp t=tp 1+1 rt N p 16βNT γ + 8β2 NT C2 h/C2 λ, where βNT = O( d3 w F 4 µ2 ) according to Lemma 8. Since there can be at most R bad epochs, the cumulative regret incurred in bad epochs Rbad = O(Ndw q µ2 γ + Ndw d3 w F 4 µ2 ). By setting communication threshold γ = dw F 4T µ2N , we have Rbad = O( d3 w F 4 NT µ2 + d4 w F 4N µ2 ). Combining cumulative regret incurred during both good and bad epochs, we have Rphase II = Rgood + Rbad = O NTd3 w F 4/µ2 + Nd4 w F 4/µ2 . Communication Cost in Phase II Consider some α > 0. By pigeon-hole principle, there can be at most NT α epochs with length (total number of time steps) longer than α. Then we consider some epoch with less than α time steps, similarly, we denote the first time step of this epoch as ts and the last as te, i.e., te ts < α. Since the users appear in a round-robin manner, the number of interactions for any user i [N] satisfies |Dte,i| |Dts,i| < α N . Due to the event-triggered in line 6 of Algorithm 1, we have log det(Σte) det(Σts) > γN α . Using the pigeonhole principle again, we know that the number of epochs with less than α time steps is at most Rα γN . Therefore, the total number of synchronizations P NT γN , and the RHS can be minimized by choosing α = N p so that P 2 p TR/γ. With γ = dw F 4T µ2N , P 2 q N log(1+NT C2g/(dwλ))µ2 Nµ2/F 4). At each global synchronization, Algorithm 1 incurs 2N(d2 w + dw) communication cost to update the statistics. Therefore, Cphase II = P 2N(d2 w + dw) = (N 1.5d2 wµ/F 2). 6 EXPERIMENTS In order to evaluate Fed-GO-UCB s empirical performance and validate our theoretical results in Theorem 5, we conducted experiments on both synthetic and real-world datasets. Due to the space limit, here we only discuss the experiment setup and results on synthetic dataset. More discussions about the experiment setup and results on real-world datasets are presented in Appendix E. For synthetic dataset, we consider two test functions, f1(x) = P4 i=1 αi exp( P6 j=1 Aij(xj Pij)2) (see values of α, A, P in appendix) and f2(x) = 0.1 P8 i=1 cos(5πxi) P8 i=1 x2 i . The decision set X is finite (with |X| = 50), and is generated by uniformly sampling from [0, 1]6 and [ 1, 1]8, respectively. We choose F to be a neural network with two linear layers, i.e., the model ˆf(x) = W2 σ(W1x + c1) + c2, where the parameters W1 R25,dx, c1 R25, W2 R25, c2 R, and σ(z) = 1/(1+exp( z)). Results (averaged over 10 runs) are reported in Figure 2. We included Dis Lin UCB (Wang et al., 2020), Fed-GLB-UCB (Li & Wang, 2022b), Approx Dis Kernel UCB (Li et al., 2022a), One-GO-UCB, and N-GO-UCB (Liu & Wang, 2023) as baselines. One-GO-UCB learns one model for all clients by immediately synchronizing every data point, and N-GO-UCB learns one model for each client with no communication. From Figure, 2, we can see that Fed-GOUCB and One-GO-UCB have much smaller regret than other baseline algorithms, demonstrating Published as a conference paper at ICLR 2024 Fed-GLB-UCB Dis Lin UCB Approx Dis Kernel UCB One-GO-UCB N-GO-UCB One-Lin UCB Fed-GO-UCB Hartmann function f1 (dx = 6, dw = 201, T = 100, N = 20) 0 500 1000 1500 2000 Cumulative Regret 0 500 1000 1500 2000 Communication Cost Cosine8 function f2 (dx = 8, dw = 301, T = 100, N = 20) 0 500 1000 1500 2000 Cumulative Regret 0 500 1000 1500 2000 Communication Cost Figure 2: Comparison of cumulative regret and communication cost on synthetic test functions. the superiority of neural network approximation for the global non-linear optimization, though inevitably this comes at the expense of higher communication cost due to transferring statistics of size d2 w 104 (compared with d2 x 102 for the baselines). Nevertheless, we can see that the communication cost of Fed-GO-UCB is significantly lower than One-GO-UCB and it grows at a sub-linear rate over time, which conforms with our theoretical results in Theorem 5. 7 CONCLUSIONS AND FUTURE WORK Despite the potential of federated optimization in high-impact applications, such as, clinical trial optimization, hyperparameter tuning, and drug discovery, there is a gap between current theoretical studies and practical usages, i.e.,, federated optimization is often needed in online tasks, like nextword prediction on keyboard apps, but most existing works formulate it as an offline problem. To bridge this gap, some recent works propose to study federated bandit optimization problem, but their objective functions are limited to simplistic classes, e.g., linear, generalized linear, or non-parametric function class with bounded RKHS norm, which limits their potential in real-world applications. In this paper, we propose the first federated bandit optimization method, named Fed-GO-UCB, that works with generic non-linear objective functions. Under some mild conditions, we rigorously prove that Fed-GO-UCB is able to achieve O( NT) cumulative regret and O(N 1.5 T + N 1.5) communication cost where T is time horizon and N is number of clients. Our technical analysis builds upon Xu et al. (2018); Liu & Wang (2023) and the main novelties lie in the distributed regression oracle and individual confidence set construction, which makes collaborative exploration under federated setting possible. In addition, extensive empirical evaluations are performed to validate the effectiveness of our algorithm, especially its ability in approximating nonlinear functions. One interesting future direction is to generalize our work to heterogeneous clients, i.e., each client i [N] has a different reward function fi, that can be assumed to be close to each other as in Dubey & Pentland (2021), or have shared components as in Li & Wang (2022a). This allows us to better model the complex situations in reality, especially when personalized decisions are preferred. Published as a conference paper at ICLR 2024 ACKNOWLEDGMENTS The work was partially supported by NSF Awards #2007117 and #2003257. The work was partially done when Chong Liu was at UCSB and when Chuanhao Li was at University of Virginia. We thank ICLR reviewers and the area chair for their valuable input that led to improvements to the paper. Yasin Abbasi-yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24, 2011. Alekh Agarwal, Nan Jiang, Sham M. Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms, 2021. Zhongxiang Dai, Yao Shu, Arun Verma, Flint Xiaofeng Fan, Bryan Kian Hsiang Low, and Patrick Jaillet. Federated neural bandit. ar Xiv preprint ar Xiv:2205.14309, 2022. Yihan Du, Wei Chen, Yuko Yuroki, and Longbo Huang. Collaborative pure exploration in kernel bandit. ar Xiv preprint ar Xiv:2110.15771, 2021. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. Abhimanyu Dubey and Alex Pentland. Provably efficient cooperative multi-agent reinforcement learning with function approximation. ar Xiv preprint ar Xiv:2103.04972, 2021. Abhimanyu Dubey and Alex Sandy Pentland. Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33, 2020. David Eriksson, Michael Pearce, Jacob Gardner, Ryan D Turner, and Matthias Poloczek. Scalable global optimization via local bayesian optimization. In Advances in neural information processing systems 32, 2019. Sarah Filippi, Olivier Cappe, Aur elien Garivier, and Csaba Szepesv ari. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems 23, 2010. Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, 2020. Peter I Frazier. A tutorial on bayesian optimization. ar Xiv preprint ar Xiv:1807.02811, 2018. Peter I Frazier and Jialei Wang. Bayesian optimization for materials design. In Information science for materials discovery and design, pp. 45 75. Springer, 2016. Jemin George and Prudhvi Gurram. Distributed stochastic gradient descent with event-triggered communication. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 7169 7178, 2020. Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter optimization: a spectral approach. In International Conference on Learning Representations, 2018. Jiafan He, Tianhao Wang, Yifei Min, and Quanquan Gu. A simple and provably efficient algorithm for asynchronous federated contextual linear bandits. ar Xiv preprint ar Xiv:2207.03106, 2022. Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multi-armed bandits. Advances in Neural Information Processing Systems, 26, 2013. Ruiquan Huang, Weiqiang Wu, Jing Yang, and Cong Shen. Federated linear contextual bandits. Advances in Neural Information Processing Systems, 34, 2021. Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455 492, 1998. Published as a conference paper at ICLR 2024 Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. Kirthevasan Kandasamy, Karun Raju Vysyaraju, Willie Neiswanger, Biswajit Paria, Christopher R. Collins, Jeff Schneider, Barnabas Poczos, and Eric P. Xing. Tuning hyperparameters without grad students: Scalable and robust bayesian optimisation with dragonfly. Journal of Machine Learning Research, 21(81):1 27, 2020. Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, 2020. Solmaz S Kia, Jorge Cort es, and Sonia Mart ınez. Distributed convex optimization via continuoustime coordination algorithms with discrete-time communication. Automatica, 55:254 264, 2015. Nathan Korda, Balazs Szorenyi, and Shuai Li. Distributed clustering of linear bandits in peer to peer networks. In International Conference on Machine Learning, 2016. Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 8(1):97 106, 1964. Chuanhao Li and Hongning Wang. Asynchronous upper confidence bound algorithms for federated linear bandits. In International Conference on Artificial Intelligence and Statistics, 2022a. Chuanhao Li and Hongning Wang. Communication efficient federated learning for generalized linear bandits. Advances in Neural Information Processing Systems, 35, 2022b. Chuanhao Li, Huazheng Wang, Mengdi Wang, and Hongning Wang. Communication efficient distributed learning for kernelized contextual bandits. Advances in Neural Information Processing Systems, 35, 2022a. Chuanhao Li, Huazheng Wang, Mengdi Wang, and Hongning Wang. Learning kernelized contextual bandits in a distributed and asynchronous environment. In The Eleventh International Conference on Learning Representations, 2022b. Chuanhao Li, Huazheng Wang, Mengdi Wang, and Hongning Wang. Learning kernelized contextual bandits in a distributed and asynchronous environment. In International Conference on Learning Representations, 2023. Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010. Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. ar Xiv preprint ar Xiv:1907.02189, 2019a. Yingkai Li, Yining Wang, and Yuan Zhou. Nearly minimax-optimal regret for linearly parameterized bandits. In Annual Conference on Learning Theory, 2019b. Chong Liu and Yu-Xiang Wang. Global optimization with parametric function approximation. In International Conference on Machine Learning, 2023. Kanak Mahadik, Qingyun Wu, Shuai Li, and Amit Sabne. Fast distributed bandits for online recommendation systems. In Proceedings of the 34th ACM international conference on supercomputing, pp. 1 13, 2020. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics, 2017. Aritra Mitra, Rayana Jaafar, George J Pappas, and Hamed Hassani. Achieving linear convergence in federated learning under objective and systems heterogeneity. ar Xiv preprint ar Xiv:2102.07053, 2021. Published as a conference paper at ICLR 2024 Nathan Nakamura, Jason Seepaul, Joseph B Kadane, and B Reeja-Jayan. Design for lowtemperature microwave-assisted crystallization of ceramic thin films. Applied Stochastic Models in Business and Industry, 33(3):314 321, 2017. Reese Pathak and Martin J Wainwright. Fedsplit: an algorithmic framework for fast federated optimization. Advances in Neural Information Processing Systems, 33:7057 7066, 2020. Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pp. 1674 1703. PMLR, 2017. Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. Annals of Statistics, 48(4):1875 1897, 2020. Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1): 148 175, 2015. Weiwei Shen, Jun Wang, Yu-Gang Jiang, and Hongyuan Zha. Portfolio choices with orthogonal bandit learning. In Twenty-fourth international joint conference on artificial intelligence, 2015. Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In International Conference on Machine Learning, 2010. Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multi-armed bandits. In Annual Symposium on Foundations of Computer Science, 2019. Sof ıa S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: Nearoptimal regret with efficient communication. In International Conference on Learning Representations, 2020. Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of langevin dynamics based algorithms for nonconvex optimization. Advances in Neural Information Processing Systems, 31, 2018. Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology, 10(2):1 19, 2019. Xinlei Yi, Lisha Yao, Tao Yang, Jemin George, and Karl H Johansson. Distributed optimization for second-order multi-agent systems with dynamic event-triggered communication. In 2018 IEEE Conference on Decision and Control (CDC), pp. 3397 3402. IEEE, 2018. Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory, pp. 1980 2022. PMLR, 2017. Published as a conference paper at ICLR 2024 A NOTATION TABLE Table 1: Symbols and notations. Symbol Definition Description A op operator norm of matrix A Ballt equation 4 parameter uncertainty region at round t βt given in Lemma 8 parameter uncertainty region radius at round t C, ζ constants dx domain dimension dw parameter dimension δ failure probability ε covering number discretization distance η σ-sub-Gaussian observation noise fw(x) objective function at x parameterized by w fx(w) objective function at w parameterized by x fx(w) 1st order derivative w.r.t. w parameterized by x 2fx(w) 2nd order derivative w.r.t. w parameterized by x F function range constant bound ι, ι , ι logarithmic terms L(w) E[(fx(w) fx(w ))2] expected loss function λ regularization parameter µ strong convexity parameter [T] {1, 2, ..., T} integer set of size T N number of agents Oracle regression oracle n number of iterations to execute Oracle ϵ approximation error guaranteed of Oracle γ communication threshold rt fw (x ) fw (xt) instantaneous regret at round t RT PT t=1 rt cumulative regret after round T Σt equation 6 covariance matrix at round t T0 time horizon in Phase I T time horizon in Phase II U uniform distribution w w W function parameter w w W true parameter ˆw0 oracle-estimated parameter after Phase I ˆwt,i equation 5 updated parameter of client i at round t W W [0, 1]dw parameter space x x X data point x optimal data point x p (Pd i=1 |xi|p)1/p ℓp norm x A x Ax distance defined by square matrix A X X Rdx function domain Y Y = [ F, F] function range Published as a conference paper at ICLR 2024 B ADDITIONAL DISCUSSIONS B.1 MORE DETAILS ON RELATED WORKS More details on distributed/federated bandits For distributed bandits (Korda et al., 2016; Mahadik et al., 2020; Wang et al., 2020; Dubey & Pentland, 2020; Huang et al., 2021), designing an efficient communication strategy is the main focus. Existing algorithms mainly differ in the relations of learning problems solved by the clients (i.e., identical vs., clustered) and the type of communication network (i.e., peer-to-peer (P2P) vs., star-shaped). Korda et al. (2016) studied two problem settings with a P2P communication network: 1) all the clients solve a common linear bandit problem, and 2) the problems are clustered. Mahadik et al. (2020) later proposed an improved algorithm on the second problem setting studied by Korda et al. (2016). However, both works only tried to reduce per-round communication, and thus the communication cost is still linear over time. Two follow-up studies considered the setting where all clients solve a common linear bandit problem with time-varying arm set and interact with the environment in a round-robin fashion (Wang et al., 2020; Dubey & Pentland, 2020). Similar to our work, they also used event-triggered communications to obtain a sub-linear communication cost over time. In particular, Wang et al. (2020) considered a star-shaped network and proposed a synchronous communication protocol for all clients to exchange their sufficient statistics via the central server. Dubey & Pentland (2020) extended this synchronous protocol to differentially private Lin UCB algorithms under both star-shaped and P2P network. Huang et al. (2021) considered a similar setting but with a fixed arm set and thus proposed a phase-based elimination algorithm. Later, in order to improve robustness against possible stragglers in the system, Li & Wang (2022a); He et al. (2022) proposed asynchronous communication protocols for the federated linear bandit problems. All the works mentioned above consider linear models, where efficient communication is enabled via closed-form model update. The only existing work that studies nonlinear models with iterative optimization is by Li & Wang (2022b), who employed a combination of online and offline regression, with online regression adjusting each client s local model using its newly collected data, and offline (distributed) regression that conducts iterative gradient aggregation over all clients for joint model estimation during global synchronizations. Offline federated learning Another related line of research is federated learning or decentralized machine learning that considers offline supervised learning scenarios (Kairouz et al., 2019). Fed Avg (Mc Mahan et al., 2017) has been the most popular algorithm for offline federated learning. However, despite its popularity, several works (Li et al., 2019a; Karimireddy et al., 2020; Mitra et al., 2021) identified that Fed Avg suffers from a client-drift problem when the clients data are non-IID (which is an important signature of our case), i.e., local iterates in each client drift towards their local minimum. This leads to a sub-optimal convergence rate of Fed Avg, i.e., a sub-linear convergence rate for strongly convex and smooth losses, though a linear convergence rate is expected in this case. To address this, Pathak & Wainwright (2020) proposed an operator splitting procedure to guarantee linear convergence to a neighborhood of the global minimum. Later, Mitra et al. (2021) introduced variance reduction techniques to guarantee exact linear convergence to the global minimum. However, due to the fundamental difference in the learning objectives, they are not suitable for our federated optimization problem: their focus is to collaboratively learn a good point estimate over a fixed dataset, i.e., convergence to the minimizer with fewer iterations, while federated bandit learning requires collaborative confidence set estimation for efficient regret reduction. This is also reflected by the difference in the design of communication triggering events. For decentralized supervised machine learning, triggering event measuring the change in the learned parameters suffices (Kia et al., 2015; Yi et al., 2018; George & Gurram, 2020), while for federated bandit learning, triggering event needs to measure change in the volume of the confidence set, i.e., uncertainty in the problem space (Wang et al., 2020; Li & Wang, 2022a). B.2 SUMMARY OF EXISTING FEDERATED BANDIT ALGORITHMS Here we provide a summary of existing works in federated bandit optimization in Table 2, which mainly differs in their modeling assumptions, type of communication protocol (synchronous vs asynchronous), as well as theoretical guarantees on regret and communication cost (in terms of N and T). Published as a conference paper at ICLR 2024 Table 2: Summary of existing federated bandit algorithms. Related Works Modeling Assumption Protocol Regret Communication Wang et al. (2020) linear synchronous Li & Wang (2022a) linear asynchronous He et al. (2022) linear asynchronous Li & Wang (2022b) generalized linear synchronous T Li et al. (2022a) kernel synchronous Li et al. (2022b) kernel asynchronous Dai et al. (2022) kernel (NTK) synchronous N Ours nonconvex synchronous C AUXILIARY LEMMAS Lemma 9 (Lemma C.5 of Liu & Wang (2023)). Set Σt,it as in equation 6 and suppose Assumptions 1, 2, & 3 hold. Then ln det Σt 1,it 1 + |Dt,it|C2 g dwλ Lemma 10 (Lemma C.6 of Liu & Wang (2023)). Set Σt,it as in equation 6 and suppose Assumptions 1, 2, & 3 hold. Then X s Dt,it fxs( ˆw0) Σ 1 s 1,it fxs( ˆw0) 2 ln(Σt 1,it Lemma 11 (Self-normalized bound for vector-valued martingales (Abbasi-yadkori et al., 2011; Agarwal et al., 2021)). Let {ηi} i=1 be a real-valued stochastic process with corresponding filtration {Fi} i=1 such that ηi is Fi measurable, E[ηi|Fi 1] = 0, and ηi is conditionally σ-sub-Gaussian with σ R+. Let {Xi} i=1 be a stochastic process with Xi H (some Hilbert space) and Xi being Ft measurable. Assume that a linear operator Σ : H H is positive definite, i.e., x Σx > 0 for any x H. For any t, define the linear operator Σt = Σ0 + Pt i=1 Xi X i (here xx denotes outer-product in H). With probability at least 1 δ, we have for all t 1: σ2 ln det(Σt) det(Σ0) 1 Lemma 12 (Risk Bounds for ϵ-approximation of ERM Estimator). Given a dataset {xs, ys}t s=1 where ys is generated from equation 1, and f is the underlying true function. Let ˆft be an ERM estimator taking values in F where F is a finite set and F {f : [0, 1]d [ F, F]} for some F 1. ft F denotes its ϵ-approximation. Then with probability at least 1 δ, f satisfies that E ( ft f0)2 1 + α 1 α inf f F E[(f f0)2] + F 2 ln(|F|) ln(2) tα + 2 ln(2/δ) for all α (0, 1] and ϵ 0. Proof of Lemma 12. Define the risk and empirical risk function as R(f) := EX,Y [(f(X) Y )2], s=1 (f(xs) ys)2. By definition, f = E[Y |X = x] = arg min f F R(f). Published as a conference paper at ICLR 2024 Denote the excess risk and the empirical excess risk as E(f) := R(f) R(f ), ˆE(f) := ˆR(f) ˆR(f ). Following the same steps in Nowak [2007], we have (1 α)E(f) ˆE(f) + c(f) ln 2 + ln 1/δ αt = ˆR(f) ˆR(f ) + c(f) ln 2 + ln 1/δ where penalties c(f) are positive numbers assigned to each f F that satisfies P f F 2 c(f) 1. We set it to c(f) = F 2 ln(|F|) according to Lemma 4 of Schmidt-Hieber (2020). Now recall that, we have denoted ˆft as the ERM estimator, i.e., ˆft := arg minf F ˆR(f), and ft as its ϵ-approximation, such that ˆR( ft) ˆR( ˆft) ϵ. Therefore, we have (1 α)E( ft) ˆE( fn) + F 2 ln(|F|) ln 2 + ln 1/δ ˆE( ˆft) + F 2 ln(|F|) ln 2 + ln 1/δ ˆE(f t ) + F 2 ln(|F|) ln 2 + ln 1/δ Due to Craig-Bernstein inequality, we have ˆE(f t ) E(f t ) + αE(f t ) + ln 1/δ Combining the two inequalities above, we have E( ft) 1 + α 1 αE(f t ) + 1 1 α F 2 ln(|F|) ln 2 + 2 ln 1/δ αt + 1 1 αϵ. D COMPLETE PROOFS D.1 PROOF OF LEMMA 6 Proof of Lemma 6. The regression oracle lemma establishes on Lemma 12 which works only for finite function class. In order to work with our continuous parameter class W, we need ε-covering number argument. First, let w, W denote the ϵ-approximation of ERM parameter and finite parameter class after applying covering number argument on W. By Lemma 12, we find that with probability at least 1 δ/2, Ex U[(fx( w) fx(w ))2] 1 + α 1 α inf w W {w } Ex U[(fx(w) fx(w ))2] + F 2 ln(|W|) ln(2) + ϵ 1 α + 2 ln(4/δ) 1 α F 2 ln(| W|) ln(2) T0α + ϵ 1 α + 2 ln(4/δ) where the second inequality is due to Assumption 1. Our parameter class W [0, 1]dw, so ln(| W|) = ln(1/ϵdw) = dw ln(1/ε) and the new upper bound is that with probability at least 1 δ/2, Ex U[(fx( w) fx(w ))2] C dw F 2 ln(1/ε) T0 + ln(2/δ) Published as a conference paper at ICLR 2024 where C is a universal constant obtained by choosing α = 1/2. Note w is the parameter in W after discretization, not our target parameter w0 W. By (a + b)2 2a2 + 2b2, Ex U[(fx( ˆw0) fx(w ))2] 2Ex U[(fx( ˆw0) fx( w))2] + 2Ex U[(fx( w) fx(w ))2] 2ε2C2 h + 2C dw F 2 ln(1/ε) T0 + ln(2/δ) where the second line applies Lemma 12, discretization error ε, and Assumption 2. By choosing ε = 1/(2 p T0C2 h), and ϵ = 1/(2C T0) we get T0 + C dw F 2 ln(T0C2 h) T0 + 2C ln(2/δ) T0 C dw F 2 ln(T0C2 h) + ln(2/δ) T0 where we can take C = 2C (assuming 2 < C dw F 2 log(T0C2 h)). The proof completes by defining ι as the logarithmic term depending on T0, Ch, 2/δ. D.2 CONFIDENCE ANALYSIS Lemma 13 (Restatement of Lemma 8). Set ˆwt,it, Σt,it as in eq. equation 6, equation 5. Set βt,it as βt,it = Θ dwσ2 + dw F 2 µ + d3 w F 4 Suppose Assumptions 1, 2, & 3 hold and choose T0 = NT. Then t [NT] in Phase II of Algorithm 1, ˆwt,it w 2 Σt,it βt,it, with probability at least 1 δ. Proof. The proof has two steps. First we obtain the closed form solution of ˆwt,it. Next we prove the upper bound of ˆwt,it w 2 Σt,it matches our choice of βt,it. Step 1: Closed form solution of ˆwt,it. Let denote fxs( ˆw0) in this proof. Recall ˆwt,it is estimated by solving the following optimization problem: ˆwt,it = Σ 1 t,itbt,it + λΣ 1 t,it ˆw0 = arg min w λ 2 w ˆw0 2 2 + 1 s Dt(it) ((w ˆw0) + fxs( ˆw0) ys)2 The optimal criterion for the objective function is 0 = λ( ˆwt,it ˆw0) + X s Dt(it) (( ˆwt,it ˆw0) + fxs( ˆw0) ys) . Rearrange the equation and we have λ( ˆwt,it ˆw0) + X s Dt(it) ( ˆwt,it ˆw0) = X s Dt(it) (ys fxs( ˆw0)) , λ( ˆwt,it ˆw0) + X s Dt(it) ( ˆwt,it ˆw0) = X s Dt(it) (ys fxs(w ) + fxs(w ) fxs( ˆw0)) , λ( ˆwt,it ˆw0) + X s Dt(it) ˆw t,it = X s Dt(it) ( ˆw 0 + ηs + fxs(w ) fxs( ˆw0)) , s Dt(it) ( ˆw 0 + ηs + fxs(w ) fxs( ˆw0)) , ˆwt,itΣt,it = λ ˆw0 + X s Dt(it) ( ˆw 0 + ηs + fxs(w ) fxs( ˆw0)) , Published as a conference paper at ICLR 2024 where the second line is by removing and adding back fxs(w ), the third line is due to definition of observation noise η and the last line is by our choice of Σt,it (eq. equation 6). Now we have the closed form solution of ˆwt,it: ˆwt,it = λΣ 1 t,it ˆw0 + Σ 1 t,it X s Dt(it) ( ˆw 0 + ηs + fxs(w ) fxs( ˆw0)) , where denotes fxs( ˆw0). Then ˆwt,it w can be written as ˆwt,it w = Σ 1 t,it s Dt(it) ( ˆw0 + ηs + fxs(w ) fxs( ˆw0)) + λΣ 1 t,it ˆw0 Σ 1 t,itΣt,itw s Dt(it) ( ˆw0 + ηs + fxs(w ) fxs( ˆw0)) + λΣ 1 t,it( ˆw0 w ) s Dt(it) ( ( ˆw0 w ) + ηs + fxs(w ) fxs( ˆw0)) + λΣ 1 t,it( ˆw0 w ) 2 ˆw0 w 2 2fxs( w) s Dt(it) ηs + λΣ 1 t,it( ˆw0 w ), (8) where the second line is again by our choice of Σt and the last equation is by the second order Taylor s theorem of fxs(w ) at ˆw0 where w lies between w and ˆw0. Step 2: Upper bound of ˆwt,it w 2 Σt,it. Multiply both sides of eq. equation 8 by Σ 1 2 t,it and we have 1 2 t,it( ˆwt,it w ) 1 s Dt(it) fxs( ˆw0) ˆw0 w 2 2fxs( w) s Dt(it) fxs( ˆw0)ηs 2 t,it ( ˆw0 w ). Take square of both sides and by inequality (a + b + c)2 4a2 + 4b2 + 4c2 we obtain ˆwt,it w 2 Σt,it 4 s Dt(it) fxs( ˆw0)ηs + 4λ2 ˆw0 w 2 Σ 1 t,it s Dt(it) fxs( ˆw0) ˆw0 w 2 2fxs( w) The remaining job is to bound three terms in eq. equation 9 individually. The first term of eq. equation 9 can be bounded as s Dt(it) fxs( ˆw0)ηs 4σ2 log det(Σt,it) det(Σ0) 1 1 + i C2 g dwλ Published as a conference paper at ICLR 2024 where the second inequality is due to self-normalized bound for vector-valued martingales (Lemma 11 in Appendix C) and Lemma 7, the second inequality is by Lemma 9 and our choice of δi = 3δ/(π2i2), and the last inequality is by defining ι as the logarithmic term depending on i, dw, Cg, 1/λ, 2/δ (with probability > 1 δ/2). The choice of δi guarantees the total failure probability over t rounds is no larger than δ/2. Using Lemma 7, the second term in eq. equation 9 is bounded as 4λ2 ˆw0 w 2 Σ 1 t,it 4λCdw F 2ι Again using Lemma 7 and Assumption 3, the third term of eq. equation 9 can be bounded as s Dt(it) fxs( ˆw0) ˆw0 w 2 2fxs( w) s Dt(it) fxs( ˆw0) = C2C2 hd2 w F 4ι2 s Dt(it) fxs( ˆw0) = C2C2 hd2 w F 4ι2 s Dt(it) fxs( ˆw0) s Dt(it) fxs ( ˆw0) Rearrange the summation and we can write C2C2 hd2 w F 4ι2 s Dt(it) fxs( ˆw0) s Dt(it) fxs ( ˆw0) = C2C2 hd2 w F 4ι2 s Dt(it) fxs( ˆw0) Σ 1 t,it fxs ( ˆw0) C2C2 hd2 w F 4ι2 s Dt(it) fxs( ˆw0) Σ 1 t,it fxs ( ˆw0) Σ 1 t,it = C2C2 hd2 w F 4ι2 s Dt(it) fxs( ˆw0) Σ 1 t,it s Dt(it) fxs ( ˆw0) Σ 1 t,it = C2C2 hd2 w F 4ι2 s Dt(it) fxs( ˆw0) Σ 1 t,it C2C2 hd2 w F 4ι2 s Dt(it) fxs( ˆw0) 2 Σ 1 t,it C2C2 hd3 w F 4tι ι2 where the second last inequality is due to Cauchy-Schwarz inequality and the last inequality is by Lemma 10. Published as a conference paper at ICLR 2024 Finally, put three bounds together and we have ˆwt,it w 2 Σ 1 t,it 4dwσ2ι + 4λCdw F 2ι µT0 + C2C2 hd3 w F 4tι ι2 O dwσ2ι + dw F 2ι µ + d3 w F 4tι ι2 where the last inequality is by our choices of λ = Cλ NT. Therefore, our choice of βt,it = Θ dwσ2 + dw F 2 µ + d3 w F 4 guarantees that w is always contained in Ballt with probability 1 δ. D.3 CUMULATIVE REGRET AND COMMUNICATION COST IN PHASE II Cumulative Regret in Phase II Thanks to the confidence set established in Lemma 8, Phase II of Algorithm 1 can operate in a similar way as existing works in federated linear bandits (Wang et al., 2020; Dubey & Pentland, 2020), while allowing for a much wider choices of models. The main difference is that, the regret of their work depends on the matrix constructed using context vectors xs for s = 1, 2, . . . for the selected points, while ours rely on the matrix constructed using gradients w.r.t. the shared model ˆw0. In the following paragraphs, we first establish the relation between instantaneous regret rt and matrix Σt 1,it, and then analyze the regret of Algorithm 1. Though, compared with Liu & Wang (2023), we are using a different way of constructing the confidence ellipsoid, which is given in Lemma 8. Their Lemma 5.4, which is given below, still holds, because it only requires Ballt 1,it to be a valid confidence set, and that Assumption 2 holds. Lemma 14 (Instantaneous regret bound [Lemma 5.4 of Liu & Wang (2023)). Under the same condition as Lemma 8,in Phase II of Algorithm 1, for all t [NT], we have βt 1,it fxt( ˆw0) Σ 1 t 1,it + 2βt,it Ch/λ, with probability at least 1 δ. Denote the total number of global synchronizations (total number of times the event in line 6 of Algorithm 1 is true) over time horizon T as P [0, NP]. Then we use tp for p [P] to denote the time step when the p-th synchronization happens (define t0 = 0), and refer to the sequence of time steps in-between two consecutive synchronizations as an epoch, i.e., the p-th epoch is [tp 1 + 1, tp]. Similar to Wang et al. (2020); Dubey & Pentland (2020); Li & Wang (2022b), for the cumulative regret analysis in Phase II, we decompose P epochs into good and bad epochs, and then analyze them separately. Specifically, consider an imaginary centralized agent that has immediate access to each data point in the learning system, and we let this centralized agent executes the same model update rule and arm selection rule as in line 3-5 of Algorithm 1. Then the covariance matrix maintained by this agent can be defined as Σt = Pt s=1 fxs( ˆw0) fxs( ˆw0) for t [NT]. The p-th epoch is called a good epoch if ln det(Σtp) otherwise it is a bad epoch. Note that based on Lemma 9, we have ln(det(ΣNT )/ det(λI)) dw log (1 + NT C2 g dwλ ) := R. Since ln( det(Σt1) det(λI) )+ln( det(Σt2) det(Σt1))+ +ln( det(ΣNT ) det(Σt B ) ) = ln( det(ΣNT ) det(λI) ) R, and due to the pigeonhole principle, there can be at most R bad epochs. Published as a conference paper at ICLR 2024 Now consider some good epoch p. By definition, we have det(Σt 1) det(Σt 1,it) det(Σtp) det(Σtp 1) e, for any t [tp 1 + 1, tp]. Therefore, if the instantaneous regret rt is incurred during a good epoch, we have βt 1,it fxt( ˆw0) Σ 1 t 1,it + 2βt 1,it Ch βt 1,it fxt( ˆw0) Σ 1 t 1 fxt( ˆw0) Σ 1 t 1,it/ fxt( ˆw0) Σ 1 t 1 + 2βt 1,it Ch βt 1,it fxt( ˆw0) Σ 1 t 1 det(Σt 1) det(Σt 1,it) + 2βt 1,it Ch βt 1,it fxt( ˆw0) Σ 1 t 1 + 2βt 1,it Ch λ where the first inequality is due to Lemma 14, and the last inequality is due to the definition of good epoch, i.e., det(Σt 1) det(Σt 1,it) det(Σtp) det(Σtp 1) e. This suggests the instantaneous regret rt incurred in a good epoch is at most e times of that incurred by the imaginary centralized agent that runs GO-UCB algorithm of Liu & Wang (2023). Therefore, the cumulative regret incurred in good epochs of Phase II, denoted as Rgood is p=1 1{ln det(Σtp) 16eβNT dw ln(1 + NTC2g dwλ ) + 8β2 NT C2 h NT λ2 where the last inequality is due to (a+b)2 2a2+2b2, Lemma 9, and Lemma 10. Note that βNT = O( d3 w F 4 µ2 ) according to Lemma 8. By setting λ = Cλ NT, we have Rgood = O( d3 w F 4 Consider some bad epoch p, we can upper bound the cumulative regret incurred by all N clients in this epoch p as t=tp 1+1 rt = t Dtp,i\Dtp 1,i rt t Dtp,i\Dtp 1,i βt 1,it fxt( ˆw0) Σ 1 t 1,it + 2βt 1,it Ch (|Dtp,i| |Dtp 1,i|)8βNT X t Dtp,i\Dtp 1,i fxt( ˆw0) 2 Σ 1 t 1,it + 8β2 NT C2 h/C2 λ 16βNT γ + 8β2 NT C2 h/C2 λ where the first inequality is due to Lemma 14, the second is due to Cauchy-Schwartz inequality and (a + b)2 2a2 + 2b2, and the last is due to Lemma 10 and event-trigger with threshold γ in line 6 of Algorithm 1. Since there can be at most R = dw log (1 + NT C2 g dwλ ) bad epochs, the cumulative regret incurred in bad epochs of Phase II, denoted as Rbad is O(Ndw q µ2 γ + Ndw d3 w F 4 µ2 ). By setting commu- nication threshold γ = dw F 4T µ2N , we have Rbad = O( d3 w F 4 NT µ2 + d4 w F 4N µ2 ). Combining cumulative regret incurred during both good and bad epochs, we have Rphase II = Rgood + Rbad = O d3 w F 4 NT + d4 w F 4 Communication Cost in Phase II Consider some α > 0. By pigeon-hole principle, there can be at most NT α epochs with length (total number of time steps) longer than α. Then consider some epoch with less than α time steps. We denote the first time step of this epoch as ts and Published as a conference paper at ICLR 2024 the last as te, i.e., te ts < α. Since the users appear in a round-robin manner, the number of interactions for any user i [N] satisfies |Dte,i| |Dts,i| < α N . Due to the event-triggered in line 6 of Algorithm 1, we have log det(Σte) det(Σts) > γN α . Using the pigeonhole principle again, we know that the number of epochs with less than α time steps is at most Rα γN . Therefore, the total number of synchronizations P NT γN , and the RHS can be minimized by choosing α = N p so that P 2 p TR/γ. With γ = dw F 4T µ2N , P 2 q N log(1+NT C2g/(dwλ))µ2 Nµ2/F 4). At each global synchronization, Algorithm 1 incurs 2N(d2 w + dw) communication cost to update the statistics. Therefore, Cphase II = P 2N(d2 w + dw) = (N 1.5d2 wµ/F 2). E EXPERIMENT SETUP & ADDITIONAL RESULTS Synthetic dataset experiment setup Here we provide more details about the experiment setup on synthetic dataset in Section 6. Specifically, we compared all the algorithms on the following two synthetic functions i=1 αi exp( j=1 Aij(xj Pij)2), f2(x) = 0.1 i=1 cos(5πxi) Both are popular synthetic functions for Bayesian optimization benchmarking1. The 6-dimensional function f1 is called Hartmann function, where α = [1.0, 1.2, 3.0, 3.2] , A = 10, 3, 17, 3.5, 1.7, 8 0.05, 10, 17, 0.1, 8, 14 3, 3.5, 1.7, 10, 17, 8 17, 8, 0.05, 10, 0.1, 14 1312, 1696, 5569, 124, 8283, 5886 2329, 4135, 8307, 3736, 1004, 9991 2348, 1451, 3522, 2883, 3047, 6650 4047, 8828, 8732, 5743, 1091, 381 And the 8-dimensional function f2 is a cosine mixture test function, which is named Cosine8. To be compatible with the discrete candidate set setting assumed in prior works (Wang et al., 2020; Li & Wang, 2022b; Li et al., 2022a), we generate the decision set X for the optimization of f1 by uniformly sampling 50 data points from [0, 1]6, and similarly for the optimization of f2, 50 data points from [ 1, 1]8. Following our problem formulation in Section 3, at each time step t [T] (we set T = 100), each client i [N] (we set N = 20) picks a data point xt,i from the candidate set X, and then observes reward yt,i generated by function f1, f2 as mentioned above. Note that the values of both functions are negated, so by maximizing reward, the algorithms are trying to find data point that minimizes the function values. We should also note that the communication cost presented in the experiment results are defined as the total number of scalars transferred in the system (Wang et al., 2020), instead of number of time communication happens (Li & Wang, 2022a). Real-world dataset experiment setup & results To further evaluate Fed-GO-UCB s performance in a more challenging and practical scenario, we performed experiments using real-world datasets: Magic Telescope and Shuttle from the UCI Machine Learning Repository (Dua & Graff, 2017). We pre-processed these two datasets following the steps in prior works (Filippi et al., 2010), by partitioning the dataset in to 20 clusters, and using the centroid of each cluster as feature vector for the arm and its averaged response as mean reward. Then we simulated the federated bandit learning problem introduced in Section 3 with T = 100 and N = 100. From Figure 3, we can see that Fed-GO-UCB outperforms the baselines, with relatively low communication cost. 1We chose them from the test functions available in Bo Torch package. See https://botorch.org/ api/test_functions.html for more details. Published as a conference paper at ICLR 2024 Fed-GLB-UCB Dis Lin UCB Approx Dis Kernel UCB One-GO-UCB N-GO-UCB One-Lin UCB Fed-GO-UCB Magic Telescope (dx = 10, dw = 201, T = 100, N = 100) 0 2000 4000 6000 8000 10000 Cumulative Regret 0 2000 4000 6000 8000 10000 Communication Cost Shuttle (dx = 10, dw = 201, T = 100, N = 100) 0 2000 4000 6000 8000 10000 Cumulative Regret 0 2000 4000 6000 8000 10000 Communication Cost Figure 3: Comparison of cumulative regret and communication cost on real-world datasets.