# federated_online_and_bandit_convex_optimization__446e05ce.pdf

Federated Online and Bandit Convex Optimization

Kumar Kshitij Patel 1 Lingxiao Wang 1 Aadirupa Saha 2 Nati Srebro 1

We study the problems of distributed online and bandit convex optimization against an adaptive adversary. We aim to minimize the average regret on M machines working in parallel over T rounds with R intermittent communications. Assuming the underlying cost functions are convex and can be generated adaptively, our results show that collaboration is not beneficial when the machines have access to the first-order gradient information at the queried points. This is in contrast to the case for stochastic functions, where each machine samples the cost functions from a fixed distribution. Furthermore, we delve into the more challenging setting of federated online optimization with bandit (zeroth-order) feedback, where the machines can only access values of the cost functions at the queried points. The key finding here is identifying the high-dimensional regime where collaboration is beneficial and may even lead to a linear speedup in the number of machines. We further illustrate our findings through federated adversarial linear bandits by developing novel distributed single and two-point feedback algorithms. Our work is the first attempt towards a systematic understanding of federated online optimization with limited feedback, and it attains tight regret bounds in the intermittent communication setting for both first and zeroth-order feedback. Our results thus bridge the gap between stochastic and adaptive settings in federated online optimization.

1. Introduction

We consider the following distributed regret minimization problem on M machines over a horizon of length T: X

m [M],t [T ] f m t (xm t ) min x 2 B

m [M],t [T ] f m t (x ), (1)

This work was done while AS was visiting TTIC. 1TTIC 2Apple. Correspondence to: Kumar Kshitij Patel <kkpatel@ttic.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

where f m t is an arbitrary convex cost function observed by machine m at time t, xm t is the model the machine plays based on available history, and the comparator x is shared across the machines. This formulation is a natural extension of the classical federated optimization (Mc Mahan et al., 2016; Kairouz et al., 2019). It captures many real-world applications, such as mobile keyboard prediction (Hard et al., 2018; Chen et al., 2019; Hartmann, 2021), self-driving vehicles (Elbir et al., 2020; Nguyen et al., 2022), and recommendation systems (Shi et al., 2021; Liang et al., 2021; Khan et al., 2021). These applications involve sequential decisionmaking across multiple machines where data is generated in real time and might not be stored. This regret minimization problem can be solved in a federated manner, i.e., by storing and analyzing the data locally at each machine while only communicating the models intermittently, reducing communication load and mitigating privacy concerns. However, sequential decision-making and potentially adversarial cost functions bring new challenges to this problem. In particular, as opposed to usual federated optimization, where we want to come up with one good final model, the goal here is to develop a sequence of instantaneously good models. This is a natural requirement in many applications, such as a voice assistant (Hao, 2020; Google, 2023), where the service needs to improve as it is used. These challenges require designing new methods to benefit from collaboration while attaining small regret.

Furthermore, we want to solve problem (1) in the intermittent communication (IC) setting (Woodworth et al., 2018; 2021) where the machines work in parallel and are allowed to communicate R times with K time steps in between communication rounds. The IC setting has been widely studied over the past decade (Zinkevich et al., 2010; Cotter et al., 2011; Dekel et al., 2012; Zhang et al., 2013; 2015; Shamir et al., 2014; Stich, 2018; Dieuleveut & Patel, 2019; Woodworth et al., 2020a; Bullins et al., 2021; Patel et al., 2022), and it captures the expensive nature of communication in collaborative learning, such as in cross-device federated learning (Mc Mahan et al., 2016; Kairouz et al., 2019).

Limitations of Recent Attempts. Although many recent attempts (Wang et al., 2020; Dubey & Pentland, 2020; Huang et al., 2021; Li & Wang, 2022; He et al., 2022; Gauthier et al., 2022; Gogineni et al., 2022; Dai & Meng, 2022; Kuh, 2021; Mitra et al., 2021) have been made towards tack-

Federated Online and Bandit Convex Optimization

ling problem (1), most existing works (Wang et al., 2020; Dubey & Pentland, 2020; Huang et al., 2021; Li & Wang, 2022) study problem (1) by focusing on the stochastic setting, where {f m t } s are sampled from distributions specified in advance. However, real-world applications may have distribution shifts, un-modeled perturbations, or even an adversarial sequence of cost functions, all of which violate the fixed distribution assumption, and thus the above problem setups fail. To alleviate this issue, in this paper, we extend our understanding of distributed online and bandit convex optimization, i.e., problem (1), to adaptive adversaries that could potentially generate a worst-case sequence of cost functions. Although some recent works have underlined the importance of the adaptive setting (Gauthier et al., 2022; Gogineni et al., 2022; Dai & Meng, 2022; Kuh, 2021; Mitra et al., 2021; He et al., 2022), our understanding of the optimal regret guarantees for problem (1) with different classes of algorithms is still lacking.

Our Contributions. Our work is the first attempt towards a systematic understanding of federated online optimization with limited feedback and attains tight regret bounds (in at least some regime) in the intermittent communication setting for both first and zeroth-order feedback, as well as bridges the gap between stochastic and adaptive settings in federated online optimization. The following summarizes the main contributions of our work:

Federated Online Optimization with First-Order (Gradient) Feedback. We first show in Section 3 that, under usual assumptions, there is no benefit of collaboration if all the machines have access to the gradients, a.k.a. first-order feedback for their cost functions. Specifically, in this setting, running online gradient descent on each device without any communication is min-max optimal for problem (1) (c.f., Theorems 3.1, 3.2). This motivates us to study weaker forms of feedback/oracles accesses where collaboration can be provably useful. Federated Online Optimization with Zeroth-Order (Bandit) Feedback. We then study the problem of federated adversarial linear bandits in Section 4, which is an important application of problem (1) and has not been studied in federated learning literature. We propose a novel federated projected online stochastic gradient descent algorithm to solve this problem (c.f., Algorithm 2). Our algorithm only requires single bandit feedback for every function. We show that when the problem dimension d is large, our algorithm can outperform the non-collaborative baseline, where each agent solves its problem independently. Additionally, when d is larger than O M p

T/R , we prove that the proposed algorithm can achieve O d/

MT average regret, where M is the number of agents and R is the communication round (c.f., Theorem 4.1). These results suggest that one can benefit from collaborations in the more challenging

setting of using bandit feedback when the problem dimension is large. Bandit Feedback Beyond Linearity. We next consider the general (non-linear) problem (1) with bandit feedback in Section 5, and study a natural variant of FEDAVG equipped with a stochastic gradient estimator using two-point feedback (Shamir, 2017). We show that collaboration reduces the variance of the stochastic gradient estimator and is thus beneficial for problems of high enough dimension (c.f., Theorems 5.1 and 5.2). We prove a linear speedup in the number of machines for high-dimensional problems, which mimics the stochastic setting with first-order gradient information (Woodworth et al., 2021; Woodworth & Srebro, 2021). Tighter Rates with Two-Point Bandit Feedback. Additionally, we show that (c.f., Corollary 5.3) one can achieve better regret for federated adversarial linear bandits using two-point feedback, indicating that multi-point feedback can be beneficial for federated adversarial linear bandits. Characterizing the General Class of Problem. Finally, we characterize the full space of related min-max problems in distributed online optimization, thus connecting the adversarial and stochastic versions of problem (1) (see Appendix A). This underlines how we understand only a small space of problems in federated online optimization, despite more than a decade of work. This also identifies problems at the intersection of sequential and collaborative decision-making for future research.

2. Problem Setting

This section introduces definitions and assumptions used in our analysis. We formalize our adversary class and algorithm class. We also specify the notion of min-max regret.

Notation. We denote the horizon by T = KR. , , = denote inequalities up to numerical constants. We denote the average function by ft( ) := P

m [M] f m t ( )/M for all t [T]. We use 1A to denote the indicator function for the event A. B2(B) Rd denotes the L2 ball of radius B centered at zero. We suppress the indexing in the summation P

t [T ],m [M] to P

t,m wherever the usage is clear.

Function Classes. We consider several common function classes (Saha & Tewari, 2011; Hazan, 2016) in this paper:

1. FG,B, the class of convex, differentiable, and GLipschitz functions, i.e.,

x, y Rd, |f(x) f(y)| G x y 2

with bounded optima x B2(B). 2. FH,B, the class of convex, differentiable, and Hsmooth functions, i.e.,

x, y Rd, f(x) f(y) 2 H x y 2

Federated Online and Bandit Convex Optimization

with bounded optima x B2(B). We also define FG,H,B := FG,B FH,B, i.e., the class of Lipschitz and smooth functions. 3. FG,B lin FG,B, which includes linear cost functions with gradients bounded by G. Note that this includes the cost functions in federated adversarial linear bandits, discussed in Section 4.

Adversary Model. Note that in the most general setting, each machine will encounter arbitrary functions from a class F at each time step. Our algorithmic results are for this general model, which is usually referred to as an adaptive adversary. Specifically, we allow the adversary s functions to depend on the past sequence of models played by the machines but not on the randomness used by the machines, i.e., {f m t }m [M] for all t can depend on {xm i }m [M] i [t 1], A , where A is the algorithm being used by the machines.

We also consider a weaker stochastic adversary model to explain some baseline results. More specifically, the adversary cannot adapt to the sequence of the models used by each machine but must fix a distribution in advance for each machine, i.e., m [M], Dm (F) such that at each time t [T], f m t Dm. An example of this less challenging model is distributed stochastic optimization where f m t ( ) := f( ; zm t Dm) for f( ; ) F and zm t is a data-point sampled from the distribution Dm. For the stochastic adversaries throughout we denote

Fm(x) := Ef Dm[f(x)], F(x) := 1

m [M] Fm(x),

where with a slight abuse of notation we suppress z Dm. The distinction between different adversary models is discussed further in Appendix A.

Oracle Model. We consider three kinds of access to the cost functions in this paper. Each machine m [M] for all time steps t [T] has access to one of the following:

1. Gradient of f m t at a single point, a.k.a., first-order feedback. 2. Function value of f m t at a single point, a.k.a., one-point bandit feedback. 3. Function values of f m t at two different points, a.k.a., two-point bandit feedback. Remark 2.1. Note that we always look at the regret at the queried points. Thus in the two-point feedback setting, if the machine m queries at points xm,1 t and xm,2 t at time t, it incurs the cost f m t (xm,1 t ) + f m t (xm,2 t ) (c.f., Theorem 5.1).

Algorithm Class. We assume the algorithms on each machine can depend on all the history it has seen. Since we are in the IC setting, at time t on machine m, the algorithm A s output can only depend on f 1:M 1 , . . . , f 1:M τ(t) , f m τ(t)+1, . . . , f m t 1 ,

where τ(t) is the last time step smaller than or equal to t where communication happened. We assume T = KR so that τ(t) := t mod K. In other words, the algorithms output on a machine can depend on all the information (gradients or function values) it has seen on the machine or other machines information communicated to it. We denote this class of algorithms by AIC and add super-scripts 1, 0, (0, 2) to denote first-order, one-point bandit, and two-point bandit feedback. Thus, we consider three algorithm classes A1 IC, A0 IC, A0,2 IC in this paper.

Finally, we consider two more assumptions controlling how similar the cost functions look across machines and the average regret at the comparator (Srebro et al., 2010):

Assumption 1 (Bounded First-Order Heterogeneity). For all t [T], x Rd,

m [M] f m t (x) ft(x) 2 2 ζ2 4G2.

Remark 2.2 (Bounded First-Order Heterogeneity for Stochastic Setting). In the stochastic setting, Woodworth et al. (2020b) consider a related but more relaxed assumption, i.e., x Rd,

m [M] Fm(x) F(x) 2 2 ζ2 4G2.

This assumption does not require the gradients at each time step to be point-wise close but requires them to be close only on average over the draws from the machines distributions.

Assumption 2 (Bounded Optimality at Optima). For all x arg minx B2(B) P

t [T ] ft(x), P

t [T ] ft(x )/T F . For non-negative functions in FH,B, this implies P

t [T ] ft(x ) 2 2 /T HF (c.f., Lemma 4.1 in Srebro et al. (2010)).

Remark 2.3 (Bounded Optimiality for Stochastic Setting). When the functions are generated stochastically, denote F := minx B2(B) F(x) as the minimum realizable function value. Then assuming Dm is supported on functions in FH,B, implies for all t [T],

E{f m t Dm}m [M]

m [M] f m t (x )

(c.f., Lemma 3 in Woodworth & Srebro (2021)).

Min-Max Regret. We now define our problem class. We use PM,K,R(F) := F MKR to denote all selections of MKR functions from a class F. We use the argument ζ, F to further restrict this to selections that satisfy Assumptions 1 and 2 respectively. In this paper, we consider four problem classes: PM,K,R(FG,B, ζ), PM,K,R(FH,B, ζ, F ), PM,K,R(FG,H,B, ζ, F ), and PM,K,R(FG,B lin , ζ). We are

Federated Online and Bandit Convex Optimization

interested in characterizing the min-max regret for these classes. In particular, for a problem class P and algorithm class A, we want to characterize up to numerical constants the following quantity:

min A Amax P P EA 1 MT

t [T ] m [M]

f m t (xm t ) min x B2(B)

t [T ] m [M]

where A is a randomized algorithm producing models xm t s. We add super-scripts 0, (0, 2), 1 to R to denote the oracle access of the min-player s algorithm.

The second player, i.e., the adversary selecting the functions, does not benefit from randomization for problem (2), which is why ignore that here1. It is also important to note that the max player does not have access to the randomness of the min player (c.f., problem (P1) in appendix A). Finally, note again that the max player can adapt to the sequence of models played by A, which is why we call this regret minimization against an adaptive adversary. We want to characterize the min-max value of this game, i.e., R(P) in this paper.

Most existing work (Khaled et al., 2020; Woodworth et al., 2020b; Patel et al., 2022; Wang et al., 2022) in federated learning has instead focused on characterizing the min-max regret for the following simpler problem2:

:= min A A max {Dm (F)} EA,{f m t Dm}m [M] t [T ]

t,m f m t (xm t )

min x B2(B) 1 MT

t,m Ef m t Dm[f m t (x )]

= min A A max {Dm (F)} EA,{f m t Dm}m [M] t [T ]

t,m Fm(xm t )

min x B2(B) 1 M

where Fm( ) := Ef Dm[f( )]. This problem has a stochastic adversary. The min-max complexity of this easier problem is understood in the homogeneous setting when the

1It makes sense to make the randomization on the second player explicit when comparing to a weaker comparator, c.f., problem (P3) in Appendix A. 2These works aim to upper bound the function sub-optimality at a single point, but most of their analyses actually provide regret guarantees that are converted to function sub-optimality bounds at the averaged iterate by applying Jensen s inequality.

machines have the same distribution (Woodworth et al., 2021; Woodworth, 2021). However, it is not yet fully understood in the heterogeneous setting where these distributions are allowed to differ across the machines (Woodworth et al., 2020b; Wang et al., 2022; Glasgow et al., 2022). In general, note that for any problem class P, Rstoc.(P) R(P), and we are interested in understanding the higher R(P). We further discuss some related distributed problems and their potential applications in Appendix A.

3. Collaboration Does Not Help with First-Order Feedback

In this section, we will explain why collaboration does not improve the min-max regret R(P) for the adaptive problem while using first-order oracles, even though it does improve Rstoc.(P) for the stochastic problem.

We first consider the class of algorithms in A1 IC, i.e., algorithms that have access to one gradient per cost function on each machine at each time step. This class is very strong in the serial setting, i.e., when M = 1. For instance, online gradient descent (OGD) (Zinkevich et al., 2010; Hazan, 2016) attains the min-max regret for both the problem classes P1,K,R(FG,B) and P1,K,R(FH,B, F ) (Woodworth & Srebro, 2021). This raises the question of whether a distributed version of OGD can also be shown to be min-max optimal when M > 1. The answer is, unfortunately, no! To state this more formally, we first introduce a trivial non-collaborative algorithm in the distributed setting: we run online gradient descent independently on each machine without any communication (see Algorithm 1). We prove the following result for this algorithm (lies in the class A1 IC), which essentially shows the non-collaborative baseline is min-max optimal.

Algorithm 1: Non-collaborative OGD (η)

1 Initialize xm 0 = 0 on all machines m [M]

2 for t {0, . . . , KR 1} do

3 for m [M] in parallel do

4 Play model xm t and see function f m t ( )

5 Incur loss f m t (xm t )

6 Compute gradient at point xm t as f m t (xm t )

7 xm t+1 xm t η f m t (xm t )

Theorem 3.1 (Optimal Bounds with Non-collaborative OGD for Lipschitz Functions). Algorithm 1 incurs an average regret of O GB/

T for problem

class PM,K,R(FG,B, ζ). This regret is optimal for PM,K,R(FG,B, ζ) among algorithms in A1 IC, where 0 ζ 2G.

Proof. We first prove the upper bound on the average regret

Federated Online and Bandit Convex Optimization

of non-collaborative OGD and then show that it is optimal, i.e., equals R1 PM,K,R(FG,B, ζ) . Note the following bound is always true for any stream of functions and sequence of models, we are just changing the comparator:

t [KR] f m t (xm t ) min xm, B2(B)

t [KR] f m t (xm)

t [KR],m [M] f m t (xm t ) min x B2(B)

t [KR] ft(x).

This means we can upper bound R1(PM,K,R(FG,B, ζ)) by running online gradient descent (OGD) independently on each machine without collaboration. In other words, for functions from FG,B, we have:

R1 PM,K,R(FG,B, ζ) R1 P1,K,R(FG,B) = GB

The min-max rate for a single machine follows classical results using vanilla OGD (c.f., Theorem 3.1 in Hazan (2016)). Now we prove that this average regret is optimal by noting the following:

R PM,K,R(FG,B, ζ) MT

min A A1 IC max {ft F}t E A

m,t ft(xm t ) M min x B2(B)

min A A1 IC E A

m,t ft(xm t ) min x B2(B) M X

where F denotes FG,B in the first inequality, and this inequality holds because we can choose a weaker adversary that can pick functions that do not vary across the machines but may change over time (which implies that ζ = 0). The second inequality holds because we choose an even weaker adversary that fixes the distribution for picking these functions in advance. Furthermore, the last inequality follows from choosing D as the usual lower bound construction for serial first-order online convex optimization (Hazan, 2016). We include the construction in Appendix B. Combining equations (4) and (6) proves the claim, showing that noncollaborative OGD attains the min-max regret.

As the lower bound instance is linear, it is easy to show that the above theorem holds for the linear functions FG,B lin .

We can use the same proof strategy to prove a similar result for smooth functions. The upper bound in the serial setting

follows from a classical work on optimistic rates (c.f, Theorem 3 (Srebro et al., 2010)) while the lower bound follows from a more recent paper (c.f., Theorem 4 (Woodworth & Srebro, 2021) and Appendix B).

Theorem 3.2 (Optimal Bounds with Non-collaborative OGD for Smooth Functions). Algorithm 1 incurs the optimal regret of Θ HB2/T + HF B/

T for problem

class PM,K,R(FH,B, ζ, F ).

Implications of Theorems 3.1 and 3.2: The above theorems imply that in the worst case, there is no benefit of collaboration if the machines already have access to gradient information! This is counter-intuitive at first because several works have shown in the stochastic setting, Rstoc(P) indeed improves with collaboration in several regimes (Woodworth et al., 2020b; Koloskova et al., 2020).

How do we reconcile these results?

Note that while proving Theorem 3.1, we crucially rely on equation (5), i.e., lower bounding by the min-max regret with the adversary that can put the same function on each machine every time-step. Let s compare it against the expression for stochastic min-max regret. One key difference is that in equation (5), x is allowed to depend on the random draws of ft from D, which makes the comparator smaller and regret minimization harder. In particular, for any problem class P, we have

Rstoc.(P) MT

min A A1 IC E A

m,t ft(xm t ) min x B2(B) M X

This is the key reason why there is a benefit of collaboration in stochastic federated optimization, i.e., the problem captured by Rstoc.(P). The lower bound in equation (5) does not apply to the problem. We discuss this issue in more detail in Appendix A while discussing the space of federated online optimization problems, where we explicitly show (c.f., Figure 1 in Appendix A) how there is no contradiction between the theorems in this section and most existing results, which do show a benefit of collaboration with first-order oracles (Woodworth et al., 2020a;b).

4. Collaboration Helps with Bandit Feedback

In this section, we move our attention to the more challenging setting of using bandit (zeroth-order) feedback at each machine. We begin by studying one important application of problem (1), i.e., federated adversarial linear bandits. We then investigate the more general problem of federated bandit convex optimization with two-point feedback in the next section.

Federated Online and Bandit Convex Optimization

Algorithm 2: FEDPOSGD (η, δ) with one-point bandit feedback

1 Initialize xm 0 = 0 on all machines m [M]

2 for t {0, . . . , KR 1} do

3 for m [M] in parallel do

4 wm t = Proj(xm t )

5 Sample um t Unif(Sd 1), i.e., a random unit vector

6 Query function f m t at point wm,1 t = wm t + δum t

7 Incur loss f m t (wm t + δum t )

8 Compute stochastic gradient at point wm t as gm t = df(wm t + δum t )um t /δ

9 if (t + 1) mod K = 0 then

10 Communicate to server: (xm t η gm t )

11 On server xt+1 1 M P m [M] (xm t η gm t )

12 Communicate to machine: xm t+1 xt+1

14 xm t+1 xm t η gm t

Federated Adversarial Linear Bandits. One important application of problem (1) is federated linear bandits, which has received increasing attention in recent years. However, most existing works (Wang et al., 2020; Huang et al., 2021; Li & Wang, 2022; He et al., 2022) do not consider the more challenging adaptive adversaries, leaving it unclear whether collaboration can be beneficial in this scenario. Therefore, we propose to study federated adversarial linear bandits, a natural extension of single-agent adversarial linear bandits (Bubeck et al., 2012) to the federated optimization setting. Specifically, at each time t [T], an agent m [M] chooses an action xm t Rd while simultaneously environment picks βm t B2(G) Rd. Agent m then suffers the loss f m t (xm t ) = βm t , xm t . Our goal is to output a sequence of models {xm t }m [M] t [T ] that minimize the following expected regret

m,t βm t , xm t min x 2 B

m,t βm t , x

where the expectation is w.r.t. the randomness of the model selection. To solve this federated adversarial linear bandits problem, we propose a novel projected online stochastic gradient descent with one-point feedback algorithm, which we call FEDPOSGD, as illustrated in Algorithm 2.

At the core of Algorithm 2 is the gradient estimator gm t constructed using one-point feedback (see line 8). This approach is motivated by Flaxman et al. (2004), with the key distinction that our gradient estimator is calculated at the projected point wm t = Proj(xm t ) = arg min w 2 B w xm t 2 (see line 4). For the linear cost function f m t (x) =

βm t , x , we can show that gm t is an unbiased gradient estimator Eum t [gm t ] = f m t (x) with the variance (Hazan, 2016):

Eum t h gm t f m t (x) 2 2 i d βm t 2 ( x 2 + δ)/δ 2.

Therefore, the projection step in Algorithm 2 can ensure the variance of our gradient estimator is bounded. However, the extra projection step can make it difficult to benefit from collaboration when we have gradient estimators from multiple agents. To address this issue, we propose to perform the gradient update in the unprojected space, i.e., xm t η gm t (see line 14), instead of the projected space, i.e., Proj(wm t η gm t ), which is motivated by the lazy mirror descent based methods (Nesterov, 2009; Bubeck et al., 2015; Yuan et al., 2021). We obtain the following guarantee for Algorithm 2 for federated adversarial linear bandits. Theorem 4.1 (Regret Guarantee of Algorithm 2 for Federated Adversarial Linear Bandits). For the problem class PM,K,R(FG,B lin , ζ), if we choose η = B G

M d B , 1 1K>1

o and δ = B, the

queried points {wm,1 t }T,M t,m=1 of Algorithm 2 satisfy:

t [KR],m [M] E h f m t (wm,1 t ) f m t (x ) i

MKR + 1K>1 GB

where x arg minx B2(B) P

t [KR] ft(x), and the expectation is w.r.t. the choice of function queries.

Implication of Theorem 4.1: When degenerated to the single-agent adversarial linear bandits, FEDPOSGD with K = 1, M = 1 achieves O GBd/

T average regret, which matches the optimal average regret (Dani et al., 2007; Hazan, 2016) using one-point feedback when we have unconstrained action space. When K > 1 and M > 1, we would like to compare our results to the non-collaborative baseline as we did in Section 3. More specifically, the noncollaborative baseline 3 can obtain the average regret of O GBd/

KR . If we have d

K, FEDPOSGD outperforms the non-collaborative baseline. Furthermore, if we have d

KM, then the average regret of FEDPOSGD will be dominated by O GBd/

MKR . As a result, FEDPOSGD achieves a linear speed-up in terms of the number of machines compared to the non-collaborative baseline. Although a smaller ζ will lead to a smaller regret bound, we cannot benefit from a small ζ. This is because the ζ dependent term GζB/

R in the average regret bound will be

3For the non-collaborative baseline, we can run SCRIBLE (Hazan, 2016) independently on each machine or run Algorithm 1 using the gradient estimator proposed by Flaxman et al. (2004) with an extra projection step, i.e., xm t+1 = Proj(xm t+1).

Federated Online and Bandit Convex Optimization

dominated by other terms when the problem dimension is large, i.e., d

Limitations of Algorithm 2. Based on the average regret, we can see that: (1) the proposed one-point feedback algorithm, i.e., FEDPOSGD, requires an extra projection step; (2) the averaged regret bound of FEDPOSGD increases linearly with the problem dimension d; and (3) the average regret of FEDPOSGD cannot benefit from the small heterogeneity in the regime where FEDPOSGD outperforms the non-collaborative baseline. To address these issues, we propose to use a two-point feedback algorithm in the next section.

5. Better Rates with Two-Point Bandit Feedback

In this section, we study distributed bandit convex optimization with two-point feedback, i.e., at each time step, the machines can query the value (and not the full gradient) of their cost functions at two different points. We show an improved regret guarantee for general Lipschitz smooth functions and then specify the improvements for federated adversarial linear bandits.

The two-point feedback structure is useful for single-agent bandit convex optimization, as it can help attain the optimal horizon dependence in the regret (Duchi et al., 2015; Shamir, 2017) using simple algorithms. We consider a general convex cost function rather than the linear cost function discussed in the last section. We analyze the online variant of the FEDAVG or LOCAL-SGD algorithm, which is commonly used in the stochastic setting. We refer to this algorithm as FEDOSGD and describe it in Algorithm 3.

The key idea in FEDOSGD is using the gradient estimator constructed with two-point feedback (see line 7), originally proposed by Shamir (2017) and based on a similar estimator by Duchi et al. (2015). For a smoothed version of the function f m t , i.e., ˆf m t (x) := Eum t [f m t (x + δum t )], the gradient estimator gm t is an unbiased estimator Eum t [gm t ] = ˆf m t (x) with variance (c.f., Lemmas 3 and 5 in Shamir (2017)):

gm t ˆf m t (x) 2

where G is the Lipschitz parameter for f m t .

Equipped with this gradient estimator, we can prove the following guarantee for PM,K,R(FG,B, ζ) using FEDOSGD.

Theorem 5.1 (Better Bounds with Two-Point Feedback for Lipschitz Functions). For the problem class PM,K,R(FG,B, ζ). If we choose η = B G

Kd1/4 o , and δ = Bd1/4

Algorithm 3: FEDOSGD (η, δ) with two-point bandit feedback

1 Initialize xm 0 = 0 on all machines m [M]

2 for t {0, . . . , KR 1} do

3 for m [M] in parallel do

4 Sample um t Unif(Sd 1), i.e., a random unit vector

5 Query function f m t at points (xm,1 t , xm,2 t ) := (xm t + δum t , xm t δum t )

6 Incur loss (f m t (xm t + δum t ) + f m t (xm t δum t ))

7 Compute stochastic gradient at point xm t as gm t = d(f(xm t +δum t ) f(xm t δum t ))um t 2δ 8 if (t + 1) mod K = 0 then

9 Communicate to server: (xm t η gm t )

10 On server xt+1 1 M P

m [M] (xm t η gm t )

11 Communicate to machine: xm t+1 xt+1

13 xm t+1 xm t η gm t

the queried points {xm,j t }T,M,2 t,m,j=1 of Algorithm 3 satisfy:

t [KR],m [M],j [2] E h f m t (xm,j t ) f m t (x ) i

MKR + 1K>1 GBd1/4

where x arg minx B2(B) P

t [KR] ft(x), and the expectation is w.r.t. the choice of function queries.

Implication of Theorem 5.1: When K = 1, the above average regret reduces to the first two terms, which are known to be tight for two-point bandit feedback (Duchi et al., 2015; Hazan, 2016) (see Appendix D), making FEDOSGD optimal. When K > 1, we would like to compare our results to the non-collaborative baseline as we did in Section 3. The non-collaborative baseline 4 attains the average regret of O GB

KR . Thus, as long as d K2, FEDOSGD performs better than the non-collaborative baseline. Furthermore, if d K2M 2, then the average regret of FE-

DOSGD is dominated by O GB

MKR . Therefore, FEDOSGD achieves a linear speed-up in the number of machines compared to the non-collaborative baseline. Unfortunately, the bound doesn t improve with smaller ζ.

Note that the Lipschitzness assumption is crucial for the two-point gradient estimator in Algorithm 3 to control the variance of the proposed gradient estimator. While there

4For the non-collaborative baseline, we run Algorithm 1 using the gradient estimator proposed by Shamir (2017).

Federated Online and Bandit Convex Optimization

are gradient estimators that do not require Lipschitzness or bounded gradients (Flaxman et al., 2004), they actually impose stronger assumptions such as bounded function values or necessitate extra projection steps (see the gradient estimator in Algorithm 2). To avoid making these assumptions, we instead focus on problems in PM,K,R(FG,H,B, ζ, F ) rather than problems in PM,K,R(FH,B, ζ, F ).

Theorem 5.2 (Better Bounds with Two-Point Feedback for Smooth Functions). Consider the problem class PM,K,R(FG,H,B, ζ, F ). If we choose appropriate η, δ (c.f., Lemma E.1 in Appendix E), the queried points {xm,j t }T,M,2 t,m,j=1 of Algorithm 3 satisfy:

t [KR],m [M],j [2] E h f m t (xm,j t ) f m t (x ) i

( H1/3B4/3G2/3d1/3

K1/3R2/3 + H1/3B4/3ζ2/3

where x arg minx B2(B) P

t [KR] ft(x), and the expectation is w.r.t. the choice of function queries. Furthermore, we can obtain the same average regret as in Theorem 5.1 with the corresponding step size.

The above result is a bit technical, so to interpret it, we consider the simpler class FG,B lin of linear functions with bounded gradients. Linear functions are the smoothest Lipschitz functions as their smoothness constant H = 0. We can get the following guarantee (see Appendix G.1):

Corollary 5.3 (Better Bounds with Two-Point Feedback for Linear Functions). Consider the problem class PM,K,R(FG,B lin , ζ, F ). If we choose the same η and δ as in Theorem 5.2, the queried points {xm,j t }T,M,2 t,m,j=1 of Algorithm 3 satisfy:

t [KR],m [M],j [2] E h f m t (xm,j t ) f m t (x ) i

MKR + 1K>1 ζGBd1/4

where x arg minx B2(B) P

t [KR] ft(x), and the expectation is w.r.t. the choice of function queries.

Implications of Theorem 5.2 and Corollary 5.3: Unlike general Lipschitz functions, the last two terms in the average regret bound for linear functions are zero when ζ = 0, and the upper bound is smaller for smaller ζ. In fact, when K = 1 or ζ = 0, the upper bound reduces to O GB/

MKR , which is

optimal (c.f., Appendix B). These results show that FEDOSGD can benefit from small heterogeneity. More generally, when K max 1, G2ζ2d, G2d/(ζ2M 2) , then FEDOSGD again achieves the optimal average regret of O GB/

MKR . Recall that in this setting, the non-collaborative baseline 5 obtains an average regret of O(GB

KR). Thus, the benefit of collaboration through FEDOSGD again appears in high-dimensional problems in PM,K,R(FG,B lin , ζ, F ) similar to what we discussed for PM,K,R(FG,B, ζ).

Single v/s Two-Point Feedback. For federated adversarial linear bandits, we can directly apply Algorithm 3, i.e., FEDOSGD, to solve it and achieve the average regret bound in Corollary 5.3 as follows:

MKR + 1K>1 ζGBd1/4

Recall that we can also use the one-point feedback based Algorithm 2, i.e., FEDPOSGD, to get the following average regret bound for federated adversarial linear bandits:

MKR + 1K>1 GB

According to these results, FEDOSGD significantly improves the average regret bound of FEDPOSGD in terms of the dependence on d and ζ. In addition, FEDOSGD does not require the extra projection step and can benefit from the small heterogeneity compared to FEDPOSGD. These results also imply that multi-point feedback can be beneficial in federated adversarial linear bandits.

6. Conclusion and Future Work

In this paper, we show that, in the setting of distributed bandit convex optimization against an adaptive adversary, the benefit of collaboration is very similar to the stochastic setting with first-order gradient information, where the collaboration is useful when: (i) there is stochasticity in the problem and (ii) the variance of the gradient estimator is high (Woodworth et al., 2021) and reduces with collaboration. There are several open questions and future research directions:

1. Does collaboration provably not help for the smaller class PM,K,R(FG,H,B, ζ, F ) using algorithms with first-order information? 2. Is the final term tight in Theorems 5.1 and 5.2? We do not know any lower bounds in the intermittent communication setting against an adaptive adversary. 3. When K is large, but R is a fixed constant, the average regret of the non-collaborative baseline goes to zero, but

5The same baseline as for problems in PM,K,R(F G,B, ζ).

Federated Online and Bandit Convex Optimization

our upper bounds for FEDOSGD do not. It is unclear if our analysis is loose or if we need another algorithm. 4. What structures in the problem can we further exploit to reduce communication (c.f., federated stochastic linear bandits (Huang et al., 2021))?

Acknowledgments

We thank the anonymous reviewers who helped us improve the writing of the paper. We would also like to thank Ayush Sekhari, John Duchi, and Ohad Shamir for several useful discussions. This research was partly supported by the NSFBSF award 1718970, the NSF TRIPOD IDEAL award, and the NSF-Simons funded Collaboration on the Theoretical Foundations of Deep Learning.

Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5 (1):1 122, 2012.

Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015.

Bullins, B., Patel, K., Shamir, O., Srebro, N., and Woodworth, B. E. A stochastic newton algorithm for distributed convex optimization. Advances in Neural Information Processing Systems, 34, 2021.

Chen, M., Mathews, R., Ouyang, T., and Beaufays, F. Federated learning of out-of-vocabulary words. ar Xiv preprint ar Xiv:1903.10635, 2019.

Cotter, A., Shamir, O., Srebro, N., and Sridharan, K. Better mini-batch algorithms via accelerated gradient methods. Advances in neural information processing systems, 24, 2011.

Dai, S. and Meng, F. Addressing modern and practical challenges in machine learning: A survey of online federated and transfer learning. ar Xiv preprint ar Xiv:2202.03070, 2022.

Dani, V., Kakade, S. M., and Hayes, T. The price of bandit information for online optimization. Advances in Neural Information Processing Systems, 20, 2007.

Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012.

Dieuleveut, A. and Patel, K. K. Communication trade-offs for local-sgd with large step size. Advances in Neural Information Processing Systems, 32, 2019.

Dubey, A. and Pentland, A. Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33:6003 6014, 2020.

Duchi, J. C., Jordan, M. I., Wainwright, M. J., and Wibisono, A. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788 2806, 2015.

Elbir, A. M., Soner, B., and Coleri, S. Federated learning in vehicular networks. ar Xiv preprint ar Xiv:2006.01412, 2020.

Flaxman, A. D., Kalai, A. T., and Mc Mahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. ar Xiv preprint cs/0408007, 2004.

Gauthier, F., Gogineni, V. C., Werner, S., Huang, Y.-F., and Kuh, A. Resource-aware asynchronous online federated learning for nonlinear regression. In ICC 2022-IEEE International Conference on Communications, pp. 2828 2833. IEEE, 2022.

Glasgow, M. R., Yuan, H., and Ma, T. Sharp bounds for federated averaging (local sgd) and continuous perspective. In International Conference on Artificial Intelligence and Statistics, pp. 9050 9090. PMLR, 2022.

Gogineni, V. C., Werner, S., Huang, Y.-F., and Kuh, A. Communication-efficient online federated learning framework for nonlinear regression. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5228 5232. IEEE, 2022.

Google. Your voice amp; audio data stays private while google assistant improves, 2023. URL https://support.google.com/assistant/ answer/10176224?hl=en.

Hao, K. How apple personalizes siri without hoovering up your data. Technology Review, 2020.

Hard, A., Rao, K., Mathews, R., Ramaswamy, S., Beaufays, F., Augenstein, S., Eichner, H., Kiddon, C., and Ramage, D. Federated learning for mobile keyboard prediction. ar Xiv preprint ar Xiv:1811.03604, 2018.

Hartmann, F. Predicting text selections with federated learning, Nov 2021. URL https://ai.googleblog.com/2021/11/ predicting-[]text-[]selections-[]with. html.

Hazan, E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Federated Online and Bandit Convex Optimization

He, J., Wang, T., Min, Y., and Gu, Q. A simple and provably efficient algorithm for asynchronous federated contextual linear bandits. In Advances in Neural Information Processing Systems, 2022.

Huang, R., Wu, W., Yang, J., and Shen, C. Federated linear contextual bandits. Advances in Neural Information Processing Systems, 34:27057 27068, 2021.

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. corr. ar Xiv preprint ar Xiv:1912.04977, 2019.

Khaled, A., Mishchenko, K., and Richt arik, P. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp. 4519 4529. PMLR, 2020.

Khan, F. K., Flanagan, A., Tan, K. E., Alamgir, Z., and Ammad-Ud-Din, M. A payload optimization method for federated recommender systems. In Fifteenth ACM Conference on Recommender Systems, pp. 432 442, 2021.

Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pp. 5381 5393. PMLR, 2020.

Kuh, A. Real time kernel learning for sensor networks using principles of federated learning. In 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 2089 2093. IEEE, 2021.

Li, C. and Wang, H. Asynchronous upper confidence bound algorithms for federated linear bandits. In International Conference on Artificial Intelligence and Statistics, pp. 6529 6553. PMLR, 2022.

Liang, F., Pan, W., and Ming, Z. Fedrec++: Lossless federated recommendation with explicit feedback. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 4224 4231, 2021.

Mc Mahan, H. B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data (2016). ar Xiv preprint ar Xiv:1602.05629, 2016.

Mitra, A., Hassani, H., and Pappas, G. J. Online federated learning. In 2021 60th IEEE Conference on Decision and Control (CDC), pp. 4083 4090. IEEE, 2021.

Nemirovski, A. Efficient methods in convex programming. Lecture notes, 1994.

Nesterov, Y. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221 259, 2009.

Nguyen, A., Do, T., Tran, M., Nguyen, B. X., Duong, C., Phan, T., Tjiputra, E., and Tran, Q. D. Deep federated learning for autonomous driving. In 2022 IEEE Intelligent Vehicles Symposium (IV), pp. 1824 1830. IEEE, 2022.

Patel, K. K., Wang, L., Woodworth, B., Bullins, B., and Srebro, N. Towards optimal communication complexity in distributed non-convex optimization. In Advances in Neural Information Processing Systems, 2022.

Saha, A. and Tewari, A. Improved regret guarantees for online smooth convex optimization with bandit feedback. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 636 642. JMLR Workshop and Conference Proceedings, 2011.

Shamir, O. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research, 18(1):1703 1713, 2017.

Shamir, O., Srebro, N., and Zhang, T. Communicationefficient distributed optimization using an approximate newton-type method. In International conference on machine learning, pp. 1000 1008. PMLR, 2014.

Shi, C., Shen, C., and Yang, J. Federated multi-armed bandits with personalization. In International Conference on Artificial Intelligence and Statistics, pp. 2917 2925. PMLR, 2021.

Srebro, N., Sridharan, K., and Tewari, A. Optimistic rates for learning with a smooth loss. ar Xiv preprint ar Xiv:1009.3896, 2010.

Stich, S. U. Local sgd converges fast and communicates little. ar Xiv preprint ar Xiv:1805.09767, 2018.

Wang, J., Das, R., Joshi, G., Kale, S., Xu, Z., and Zhang, T. On the unreasonable effectiveness of federated averaging with heterogeneous data. ar Xiv preprint ar Xiv:2206.04723, 2022.

Wang, Y., Hu, J., Chen, X., and Wang, L. Distributed bandit learning: Near-optimal regret with efficient communication. In International Conference on Learning Representations, 2020. URL https://openreview. net/forum?id=SJx Zn R4Yv B.

Woodworth, B. The minimax complexity of distributed optimization. ar Xiv preprint ar Xiv:2109.00534, 2021.

Woodworth, B., Patel, K. K., Stich, S., Dai, Z., Bullins, B., Mcmahan, B., Shamir, O., and Srebro, N. Is local sgd

Federated Online and Bandit Convex Optimization

better than minibatch sgd? In International Conference on Machine Learning, pp. 10334 10343. PMLR, 2020a.

Woodworth, B. E. and Srebro, N. An even more optimal stochastic optimization algorithm: minibatching and interpolation learning. Advances in Neural Information Processing Systems, 34:7333 7345, 2021.

Woodworth, B. E., Wang, J., Smith, A., Mc Mahan, B., and Srebro, N. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. Advances in neural information processing systems, 31, 2018.

Woodworth, B. E., Patel, K. K., and Srebro, N. Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33: 6281 6292, 2020b.

Woodworth, B. E., Bullins, B., Shamir, O., and Srebro, N. The min-max complexity of distributed stochastic convex optimization with intermittent communication. In Conference on Learning Theory, pp. 4386 4437. PMLR, 2021.

Yuan, H., Zaheer, M., and Reddi, S. Federated composite optimization. In International Conference on Machine Learning, pp. 12253 12266. PMLR, 2021.

Zhang, Y., Duchi, J., Jordan, M. I., and Wainwright, M. J. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. Advances in Neural Information Processing Systems, 26, 2013.

Zhang, Y., Duchi, J., and Wainwright, M. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research, 16(1):3299 3340, 2015.

Zinkevich, M., Weimer, M., Li, L., and Smola, A. Parallelized stochastic gradient descent. Advances in neural information processing systems, 23, 2010.

Federated Online and Bandit Convex Optimization

A. Comparison of Related Problems

Figure 1. Summary of the problem space of federated online optimization. An arrow from the parent to child denotes that the child min-max problem is easier or has a lower min-max value than the parent problem. Note that to show that there is no benefit of collaboration for first-order algorithms for problem (P2), we use the lower bound construction for the problem (P5). The figure clarifies why this does not contradict the benefit of collaboration for problems (P9) and (P10), as they lie on a different path from the parent (P2).

First, we recall that the algorithms on each machine can depend on all the history it has seen. Since we are in the IC setting, at time t on machine m, the algorithm A s output can only depend on f 1:M 1 , . . . , f 1:M τ(t) , f m τ(t)+1, . . . , f m t 1 ,

where τ(t) is the last time step smaller than or equal to t where communication happened. We also assume that the algorithms can be randomized, i.e., at each time t on machine m, the output can also depend on a randomly drawn seed zm t . With that said, the hardest adversarial problem we can hope to solve for functions coming from some problem class P F MT is

max P P 1 MT

t [T ],m [M] f m t (xm t ) min x B2(B) 1 MT

t [T ],m [M] f m t (x )

For this problem, note that the max-player knows both the algorithm and the randomization of the min-player. Thus, the min-player doesn t gain from randomization at all. A simpler problem is the following, where the max-player knows the algorithm but not the random seeds

min A A max P P EA

t [T ],m [M] f m t (xm t ) min x B2(B) 1 MT

t [T ],m [M] f m t (x )

Federated Online and Bandit Convex Optimization

Problem (P2) is usually considered when talking about randomized optimization algorithms. Note that randomization on the second player doesn t increase the min-max regret, so we can equivalently state (P2) with randomized max-players as follows

min A A max P P EA,P

t [T ],m [M] f m t (xm t ) min x B2(B) 1 MT

t [T ],m [M] f m t (x )

However, making the randomization on the max-player explicit makes it easier to state the following easier version of the problem (P2) with a weaker comparator

min A A max P P EA

t [T ],m [M] f m t (xm t )

min x B2(B) EP

t [T ],m [M] f m t (x )

The comparator in problem (P3) does not know the randomization of the max-player. This form of regret is common in multi-armed bandit literature and is often called pseudo-regret . In this paper, we upper bound (P2), thus also upper bounding (P3), which might be more relevant for the federated adversarial linear bandits problem. Note that we have only talked about the adaptive setting so far. We can also relax the problem (P2) by weakening the adversary. One way to do this is by requiring the functions to be the same across the machines, which leads to the following problem

min A A max {ft F} M t [T ] P EA

t [T ],m [M] ft(xm t ) min x B2(B) 1 T

t [T ] ft(x )

Note that if the functions across the machines are shared, then ζ = 0 in Assumption 1. Depending on the algorithm class, this problem may or may not be equivalent to a fully serial problem, as we showed in this paper. We can further simplify this problem by making the adversary stochastic

min A A max D (F) EA,{ft D}t [T ]

t [T ],m [M] ft(xm t ) min x B2(B) 1 T

t [T ] ft(x )

We can also simplify problem (P5) to have a weaker comparator or consider problem (P3) with a stochastic adversary

min A A max D (F) EA,{ft D}t [T ]

t [T ],m [M] Eft D [ft(xm t )]

min x B2(B) Ef D [f(x )] . (P6)

Recalling the definition of Fm, F this can be re-written as,

min A A max D (F) EA,{ft D}t [T ]

t [T ],m [M] F(xm t )

min x B2(B) F(x ). (P6)

Now let s relax (P2) directly to have stochastic adversaries, i.e., have fixed distributions on each machine. Note that this will require appropriately changing the problem classes assumptions, as discussed in the remarks 2.2 and 2.3. To simplify the discussion, we assume that the problem class has no additional assumption and is just a selection of MKR functions from some class F. With this simplification in mind, we can now relax (P2) by picking the functions at machine m at time t from distribution Dm. This leads to the following problem

min A A max {Dm (F)}m [M] EA,{f m t Dm}m [M] t [T ]

t [T ],m [M] f m t (xm t ) min x B2(B) 1 MT

t [T ],m [M] f m t (x )

If we weaken the comparator for this problem and recall the definitions for {Fm := Ef Dm[f]}m [M] and F := 1 M P

m [M] Fm, we get the following problem

min A A max {Dm (F)}m [M] EA,{f m t Dm}m [M] t [T ]

t [T ],m [M] Fm(xm t )

min x B2(B) F(x ). (P8)

Federated Online and Bandit Convex Optimization

We note that problem (P8) is the regret minimization version of the usual heterogeneous federated optimization problem (Mc Mahan et al., 2016; Woodworth et al., 2020b). To make the final connection to the usual federated optimization literature, we note that the problem becomes easier if the algorithm can look at all the functions before deciding which model to choose. In other words, it is harder to minimize regret in the online setting than to come up with one final retrospective model. This means we can simplify the problem (P8) to the following problem, where A outputs ˆx after looking at all the functions

min A A max {Dm (F)}m [M] EA,{f m t Dm}m [M] t [T ]

t [T ],m [M] Fm(ˆx)

min x B2(B) F(x ). (P9)

With some re-writing of the notation, this reduces the usual heterogeneous federated optimization problem (Mc Mahan et al., 2016; Woodworth et al., 2020b)

min A A max {Dm (F)}m [M] EA,{f m t Dm}m [M] t [T ] [F(ˆx)] min x B2(B) F(x ). (P9)

Assuming Dm = D for all m [M] this reduces to the usual homogeneous federated optimization problem (Woodworth et al., 2020a)

min A A max D (F) EA,{f m t D}m [M] t [T ] [F(ˆx)] min x B2(B) F(x ). (P10)

Note that we can get a similar relaxation of the problem (P6) by converting regret minimization to find a final good solution. The problem will look as follows

min A A max D (F) EA,{ft D}t [T ] [F(ˆx)] min x B2(B) F(x ). (P11)

The key difference between (P10) and (P11) is that ˆx depends on MT v/s T samples respectively each case. This means (P10) is simpler than (P11). This concludes the discussion, and we summarize the comparisons between different min-max problems in Figure 1. Next, we discuss two relevant lower bounds that follow from this understanding of problem hierarchies.

B. Relevant Lower Bound Proofs

We would like to understand the lower bounds for the problem (P2) from the previous section, i.e., this paper s main quantity of interest for relevant problem and algorithm classes. Note that to lower bound (P2), we can lower bound any children problems in Figure 1. We first restate a well-known OCO sample complexity lower bound to show that

R1(PM,K,R(FG,B, ζ)) = R1(PM,K,R(FG,B lin , ζ)) = GB

To see this, we recall that (P2) (P4) (P5) and then note that for the adversary in (P5), ζ = 0 by design as all the machines see the same function. Thus to prove a lower bound, it is sufficient to specify a distribution D (FG,B lin ) (FG,B) such that for any sequence of models {xm t }m [M] t [T ] ,

E{ft}t [T ]

m,t ft(xm t ) min x B2(B) 1 T

And one such easy construction is choosing ft(x) = vt, x where vt G

d Unif({+1, 1}d). This ensures that the first term on the L.H.S. in the above inequality is zero after taking the expectation no matter which model is picked. The minimizer of the second quantity is x is GB

T (c.f., Theorem 3.2 (Hazan, 2016)), which gives the lower bound in equation (8). Note that the construction is linear and lies in the class FG,B, which is an order-optimal lower bound since it doesn t have anything to do with the form of the algorithm. Furthermore, to get equality in equation (8), we recall the non-collaborative OGD upper bound (c.f., Theorem 3.1 (Hazan, 2016)).

Next, we draw our attention to the problem class of smooth functions PM,K,R(FH,B, ζ, F ) where we show that

R1(PM,K,R(FH,B, ζ, F )) = HB2

Federated Online and Bandit Convex Optimization

We use the same lower-bounding strategy mentioned above but instead lower bound (P10), and note that (P2) (P4) (P5) (P6) (P10). This already makes ζ = 0 since our attack is coordinated. To lower bound (P10), we use the construction and distribution as used in the proof of Theorem 4 by Woodworth & Srebro (2021), which is a sample complexity lower bound that only depends on T, i.e., the number of samples observed from D. For the upper bound, we use Theorem 3 by Srebro et al. (2010) proved in the same setting. Even though the upper bounds are first-order algorithms, the lower bounds are order-independent, i.e., sample complexity lower bounds.

Finally, we d like to prove a lower bound for the problem class PMKR(FG,B) with two-point bandit feedback. In particular, we want to show that

R0,2(PMKR(FG,B), ζ) GB

To prove this, we d use the reduction (P2) (P4) (P5) (P6). Then we note for the problem (P6), ζ = 0, and using 2-point feedback, we get in total 2MKR function value accesses to D. We can directly use the lower bound in Proposition 2 by Duchi et al. (2015) for the problem (P6) for 2M points of feedback and KR iterations. Combined with the order-independent lower bound proved previously, this proves the required result.

C. Proof of Theorem 4.1

In this section, we provide the proofs of Theorem 4.1. We first introduce several notations, which will be used in our analysis. Let d(x, y) = x 2 2/2 ˆy 2 2/2 y, x ˆy , where x 2 B and ˆy is the projected point of y in to the ℓ2-norm ball with radius B. We have the following holds

2 x ˆy 2 2. (11)

This is due to the following: if y B, (11) clearly holds. If y > B, we have

2 x ˆy 2 2 = x ˆy, ˆy y ˆy ˆy, ˆy y = 0,

where the inequality is due to the fact that ˆy y = (1 y 2/B)ˆy lies in the opposite direction of ˆy, and x = ˆy will minimize the inner product. Now, we are ready to prove the regret of Algorithm 2.

Proof. Define the following notations

m=1 xm t , wt = Proj( xt), wm t = Proj(xm t ).

d(x , xt+1) = 1

2 wt+1 2 2 xt+1, x wt+1

2 wt+1 2 2 xt η 1

m=1 gm t , x wt+1

2 wt+1 2 2 xt, x wt+1 | {z } I1

m=1 gm t , wt+1 x

where the second equality comes from the updating rule of Algorithm 2. For the term I1, we have

2 wt+1 2 2 xt, x wt+1

Federated Online and Bandit Convex Optimization

2 wt 2 2 xt, x wt xt, wt wt+1

2 wt+1 2 2 + 1

= d(x , xt) d( wt+1, xt)

d(x , xt) 1

2 wt+1 wt 2 2,

where the last inequality is due to (11). For the term I2, we have

m=1 gm t , wt+1 x

m=1 gm t f m t (wm t ), wt+1 x

m=1 f m t (wm t ), wt+1 x

For the term I21, we have

Et[I21] = ηEt 1 M

m=1 f m t (wm t ) gm t , wt+1 x

m=1 f m t (wm t ) gm t , wt+1 wt

m=1 ( f m t (wm t ) gm t ) 2 wt+1 wt 2

M Et wt+1 wt 2.

Thus we have

M E wt+1 wt 2.

For the term I22, we have

m=1 f m t (wm t ), wm t x η 1

m=1 f m t (wm t ), wt wm t

m=1 f m t (wm t ), wt+1 wt

m=1 (f m t (wm t ) f m t (x ) + η 1

m=1 f m t (wm t ) 2 wt wm t 2

m=1 f m t (wm t )

4 wt+1 wt 2 2

Therefore, combining (12) and the upper bound of I1 and I2, we have

Ed(x , xt+1) Ed(x , xt) 1

4E wt+1 wt 2 2 + η σ

M E wt+1 wt 2

m=1 E(f m t (wm t ) f m t (x ) + η 1

m=1 E f m t (wm t ) 2 wt wm t 2

Federated Online and Bandit Convex Optimization

m=1 f m t (wm t )

Therefore, we have

m=1 E(f m t (wm t ) f m t (x ) Ed(x , xt) Ed(x , xt+1) + η2 σ2

m=1 E f m t (wm t ) 2 wt wm t 2

m=1 f m t (wm t )

In addition, we have

m=1 E wt wm t 2 1

m=1 E xt xm t 2 2η(σ

where the last inequality is due to the linear function and follows the similar proofs of Lemma 8 in Woodworth et al. (2020b). Thus, we can obtain (the indicator function comes from the fact that if K = 1, there would be no consensus error)

m [M] E [f m t (wm t ) f m t (x ] 1

η (Ed(x , xt) Ed(x , xt+1))

+ η G2 + σ2

+ 1K>1 2G(σ

Since Ed(x , x T ) E x w T 2 2/2 0 and Ed(x , x0) = x 2 2/2, summing the above inequality over t, we can get

t [KR],m [M] E [f m t (wm t ) f m t (x )] B2

η + η G2 + σ2

M + 1K>1 G(σ

If we choose η such that

G 1K>1 σK1/4 ,

t [KR],m [M] E [f m t (wm t ) f m t (x )] GB

To get the regret, we just need to notice that we have the linear function, and thus we have: the smoothed function ˆf = f and Ef m t (wm t ) = Ef m t (wm t + δum t ), where the expectation is over um t . Furthermore, gm t 2 2 = d2 f m t (wm t + δum t ) 2 d2G2(B + δ)2/δ2 4d2G2, where the last inequality is due the choice of δ = B. Since Egm t = ˆf m t (wm t ) and E gm t ˆf m t (wm t ) 2 2 E gm t 2 2 Therefore, we can plug in σ2 = 4d2G2 to get our regret.

D. Proof of Theorem 5.1

In this section and the next one, we consider access to a first-order stochastic oracle as an intermediate step before considering the zeroth-order oracle. Specifically, each machine has access to a stochastic gradient gm t of f m t at point xm t , such that it is unbiased and has bounded variance, i.e., for all x X,

E[gm t (xm t )|xm t ] = f m t (xm t ) and E h gm t (xm t ) f m t (xm t ) 2 2 |xm t i σ2.

In Algorithm 3, we constructed a particular stochastic gradient estimator at xm t with σ2 = G2d. We can define the corresponding problem class P1,σ M,K,R(FG,B, ζ) when the agents can access a stochastic first-order oracle. We have the following lemma about this problem class:

Federated Online and Bandit Convex Optimization

Lemma D.1. Consider the problem class P1,σ M,K,R(FG,B, ζ). If we choose η = B G

σK , 1 1K>1

then the models {xm t }T,M t,m=1 of Algorithm 3 satisfy the following guarantee:

t [KR],m [M] E [f m t (xm t ) f m t (x )] GB

where x arg minx Rd P

t [KR] ft(x), and the expectation is w.r.t. the stochastic gradients.

Remark D.2. Note that when K = 1, the upper bound in Lemma D.1 reduces to the first two terms, both of which are known to be optimal due to lower bounds in the stochastic setting, i.e., against a stochastic online adversary (Nemirovski, 1994; Hazan, 2016). We now use this lemma to guarantee bandit two-point feedback oracles for the same function class. We recall that one can obtain a stochastic gradient for a smoothed-version ˆf of a Lipschitz function f at any point x X, using two function value calls to f around the point x (Shamir, 2017; Duchi et al., 2015).

With this lemma, we can prove Theorem 5.1.

Proof of Theorem 5.1. First, we consider smoothed functions

ˆf m t (x) := Eu Unif(Sd 1)[f m t (x + δu)],

for some δ > 0 and Sd 1 denoting the euclidean unit sphere. Based on the gradient estimator proposed by Shamir (2017) (which can be implemented with two-point bandit feedback) and Lemma D.1, we can get the following regret guarantee (noting that σ c1

d G for a numerical constant c1, c.f., (Shamir, 2017)):

t [KR],m [M]

ˆf m t (ˆxm t )

t [KR],m [M]

ˆf m t (x ) GB

MKR + 1K>1 GBd1/4

where the expectation is w.r.t. the stochasticity in the stochastic gradient estimator. To transform this into a regret guarantee for f we need to account for two things:

1. The difference between the smoothed function ˆf and the original function f. This is easy to handle because both these functions are pointwise close, i.e., supx X |f(x) ˆf(x)| Gδ.

2. The difference between the points ˆxm t at which the stochastic gradient is computed for ˆf m t and the actual points xm,1 t and xm,2 t on which we incur regret while making zeroth-order queries to f m t . This is also easy to handle because due to the definition of the estimator, xm,1 t , xm,1 t Bδ(ˆxm t ), where Bδ(x) is the L2 ball of radius δ around x.

In light of the last two observations, the average regret between the smoothed and original functions only differs by a factor of 2Gδ, i.e.,

t [KR],m [M],j [2] f m t (xm,j t )

t [KR],m [M] f m t (x )

MKR + 1K>1 GBd1/4

MKR + 1K>1 GBd1/4

where the last inequality is due to the choice of δ such that δ Bd1/4

Federated Online and Bandit Convex Optimization

E. Proof of Theorem 5.2

Similar to before, we start by looking at P1,σ M,K,R(FG,H,B, ζ, F ). We have the following lemma.

Lemma E.1. Consider the problem class P1,σ M,K,R(FG,H,B, ζ, F ). The models {xm t }T,M t,m=1 of Algorithm 3 with appropriate η (specified in the proof) satisfy the following regret guarantee:

t [KR],m [M] E [f m t (xm t ) f m t (x )] HB2

MKR + min GB

( H1/3B4/3σ2/3

K1/3R2/3 + H1/3B4/3ζ2/3

R2/3 + ζσB K1/4

where x arg minx Rd P

t [KR] ft(x), and the expectation is w.r.t. the stochastic gradients. The models also satisfy the guarantee of Lemma D.1 with the same step-size.

Proof of Theorem 5.2. Given Lemma E.1, it is now straightforward to prove Theorem 5.2 similar to the proof for Theorem 5.1 by replacing σ2 with G2d ad choosing small enough δ such that Gδ = the r.h.s. of the theorem statement.

F. Proof of Lemma D.1

In this section, we prove Lemma D.1.

Proof of Lemma D.1. Consider any time step t [KR] and define ghost iterate xt = 1 M P

m [M] xm t (which not might actually get computed). If K = 1, the machines calculate the stochastic gradient at the same point, xt. Then using the update rule of Algorithm 3, we can get the following:

Et h xt+1 x 2 2 i = Et

m [M] f m t (xm t ) x + ηt

m=1 ( f m t (xm t ) gm t (xm t ))

= xt x 2 2 + η2 t M 2

m [M] f m t (xm t )

m [M] xt x , f m t (xm t ) + η2 t σ2

= xt x 2 2 + η2 t M 2

m [M] f m t (xm t )

m [M] xm t x , f m t (xm t )

m [M] xm t xt, f m t (xm t ) + η2 t σ2

xt x 2 2 + η2 t M 2

m [M] f m t (xm t )

m [M] (f m t (xm t ) f m t (x ))

m [M] xm t xt, f m t (xm t ) + η2 t σ2

where Et is the expectation conditioned on the filtration at time t under which xm t s are measurable, and the last inequality

Federated Online and Bandit Convex Optimization

is due to the convexity of each function. Re-arranging this leads to

m [M] (f m t (xm t ) f m t (x )) 1 2ηt

xt x 2 2 Et h xt+1 x 2 2 i + ηt 2M 2

m [M] f m t (xm t )

m [M] Et xm t xt, f m t (xm t ) + ηtσ2

xt x 2 2 Et h xt+1 x 2 2 i + ηt

m [M] E [ xm t xt 2] . (13)

The last inequality comes from each function s G-Lipschitzness. For the last term in (13), we can upper bound it similar to Lemma 8 in Woodworth et al. (2020b) to get that

m [M] E [ xm t xt 2] 2(σ + G)Kη. (14)

Plugging (14) into (13) and choosing a constant step-size η, and taking full expectation we get

m [M] E [f m t (xm t ) f m t (x )] 1

E [ xt x ]2 2 E h xt+1 x 2 2 i

+ 1K>1 2G(σ + G)Kη.

Summing this over time t [KR] we get,

m [M],t [T ] E [f m t (xm t ) f m t (x )] x0 x 2 2 η + η G2 + σ2

M + 1K>1 σGK + 1K>1 ζGK T

η + η G2 + σ2

M + 1K>1 σGK + 1K>1 G2K T.

Finally choosing,

σK , 1 1K>1

we can obtain,

m [M],t [T ] E [f m t (xm t ) f m t (x )] GB

KT + 1K>1 GB

Dividing by KR finishes the proof.

G. Proof of Lemma E.1

In this section, we provide the proofs for Lemma E.1.

Proof of Lemma E.1. Consider any time step t [KR] and define ghost iterate xt = 1 M P

m [M] xm t (which not might actually get computed). Then using the update rule of Algorithm 3, we can get:

Et h xt+1 x 2 2 i = Et

m [M] f m t (xm t ) x + ηt

m=1 ( f m t (xm t ) gm t (xm t ))

Federated Online and Bandit Convex Optimization

= xt x 2 2 + η2 t M 2

m [M] f m t (xm t )

m [M] xt x , f m t (xm t ) + η2 t σ2

= xt x 2 2 + η2 t M 2

m [M] f m t (xm t )

m [M] xm t x , f m t (xm t )

m [M] xm t xt, f m t (xm t ) + η2 t σ2

xt x 2 2 + η2 t M 2

m [M] f m t (xm t )

m [M] (f m t (xm t ) f m t (x ))

m [M] xm t xt, f m t (xm t ) + η2 t σ2

where Et is the expectation taken with respect to the filtration at time t, and the last line comes from the convexity of each function. Re-arranging this and taking expectation gives leads to

m [M] E (f m t (xm t ) f m t (x )) 1 2ηt

E xt x 2 2 E h xt+1 x 2 2 i + ηt 2M 2 E

m [M] f m t (xm t )

m [M] E xm t xt, f m t (xm t ) + ηtσ2

Bounding the blue term. We consider two different ways to bound the term. First note that similar to Lemma D.1 we can just use the following bound,

m [M] f m t (xm t )

However, since we also have smoothness, we can use the self-bounding property (c.f., Lemma 4.1 (Srebro et al., 2010)) to get,

m [M] f m t (xm t )

m [M] (f m t (xm t ) f m t (x )) + ηt H

m [M] f m t (x ) (18)

Bounding the red term. We will bound the term in three different ways. Similar to lemma D.1, we can bound the term after taking expectation and then bounding the consensus term similar to Lemma 8 in Woodworth et al. (2020b) as follows,

m [M] E [ xm t xt, f m t (xm t ) ] G

m [M] E [ xm t xt 2]

t =τ(t) ηt , (19)

where τ(t) maps t to the last time step when communication happens. Alternatively, we can use smoothness as follows after assuming ηt 1/2H,

m [M] E [ xm t xt, f m t (xm t ) ] = 1

m [M] E [ xm t xt, f m t (xm t ) ft( xt) ] ,

Federated Online and Bandit Convex Optimization

m [M] E xm t xt 2 2

m [M] E f m t (xm t ) ft( xt) 2 2,

m [M] E xm t xt 2 2

m [M] H2E xm t xt 2 2 + 2ζ2,

m [M] E xm t xt 2 2 + 2ζ

m [M] E xm t xt 2 2,

2η2 t H(σ2K + ζ2K2) + 2ηtζ(σ

K + ζK), (20)

where we used lemma 8 from Woodworth et al. (2020b) in the last inequality. We can also use the lipschitzness and smoothness assumption together with a constant step size η < 1/2H to obtain,

m [M] E [ xm t xt, f m t (xm t ) ] G

m [M] E [ xm t xt 2]

K + ζK). (21)

Combining everything. After using a constant step-size η, summing (16) over time, we can use the upper bound of the red and blue terms in different ways. If we plug in (17) and (19) we recover the guarantee of lemma D.1. This is not surprising because FG,H,B FG,B. Combining the upper bounds in all other combinations assuming η < 1 2H , we can show the following upper bound

Reg(M, K, R)

MKR + min GB

( H1/3B4/3σ2/3

K1/3R2/3 + H1/3B4/3ζ2/3

R2/3 + ζσB K1/4

where we used step size,

KR , max B G

KR , B HF KR

H1/3σ2/3K2/3R1/3 , B2/3

H1/3ζ2/3KR1/3 , B K3/4 ζσR, B ζK

GσR , B K ζGR

This finishes the proof.

G.1. Modifying the Proof for Federated Adversarial Linear Bandits

To prove the guarantee for the adversarial linear bandits, we first note that the self-bounding property can t be used anymore as the functions are not non-negative. Thus we proceed with the lemma s proof with the following changes:

We don t prove the additional upper bound in (18) for blue term.

While upper bounding the red term in (20), we set H = 0 and use this single bound for the red term.

After making these changes, combining all the terms, and tuning the learning rate, we recover the correct lemma for federated adversarial linear bandits.