# fedexp_speeding_up_federated_averaging_via_extrapolation__38cd856e.pdf

Published as a conference paper at ICLR 2023

FEDEXP: SPEEDING UP FEDERATED AVERAGING VIA EXTRAPOLATION

Divyansh Jhunjhunwala1, Shiqiang Wang2, Gauri Joshi1

1Carnegie Mellon University, 2IBM Research {djhunjhu, gaurij}@andrew.cmu.edu, wangshiq@us.ibm.com

Federated Averaging (Fed Avg) remains the most popular algorithm for Federated Learning (FL) optimization due to its simple implementation, stateless nature, and privacy guarantees combined with secure aggregation. Recent work has sought to generalize the vanilla averaging in Fed Avg to a generalized gradient descent step by treating client updates as pseudo-gradients and using a server step size. While the use of a server step size has been shown to provide performance improvement theoretically, the practical benefit of the server step size has not been seen in most existing works. In this work, we present Fed Ex P, a method to adaptively determine the server step size in FL based on dynamically varying pseudo-gradients throughout the FL process. We begin by considering the overparameterized convex regime, where we reveal an interesting similarity between Fed Avg and the Projection Onto Convex Sets (POCS) algorithm. We then show how Fed Ex P can be motivated as a novel extension to the extrapolation mechanism that is used to speed up POCS. Our theoretical analysis later also discusses the implications of Fed Ex P in underparameterized and non-convex settings. Experimental results show that Fed Ex P consistently converges faster than Fed Avg and competing baselines on a range of realistic FL datasets.

1 INTRODUCTION

Federated Learning (FL) has emerged as a key distributed learning paradigm in which a central server orchestrates the training of a machine learning model across a network of devices. FL is based on the fundamental premise that data never leaves a clients device, as clients only communicate model updates with the server. Federated Averaging or Fed Avg, first introduced by Mc Mahan et al. (2017), remains the most popular algorithm in this setting due to the simplicity of its implementation, stateless nature (i.e., clients do not maintain local parameters during training) and the ability to incorporate privacy-preserving protocols such as secure aggregation (Bonawitz et al., 2016; Kadhe et al., 2020).

Slowdown Due to Heterogeneity. One of the most persistent problems in Fed Avg is the slowdown in model convergence due to data heterogeneity across clients. Clients usually perform multiple steps of gradient descent on their heterogeneous objectives before communicating with the server in Fed Avg, which leads to what is colloquially known as client drift error (Karimireddy et al., 2019). The effect of heterogeneity is further exacerbated by the constraint that only a fraction of the total number of clients may be available for training in every round (Kairouz et al., 2021). Various techniques have been proposed to combat this slowdown, among the most popular being variance reduction techniques such as Karimireddy et al. (2019); Mishchenko et al. (2022); Mitra et al. (2021), but they either lead to clients becoming stateful, add extra computation or communication requirements or have privacy limitations.

Server Step Size. Recent work has sought to deal with this slowdown by using two separate step sizes in Fed Avg a client step size used by the clients to minimize their local objectives and a server step size used by the server to update the global model by treating client updates as pseudo-gradients (Karimireddy et al., 2019; Reddi et al., 2021). To achieve the fastest convergence rate, these works propose keeping the client step size as O 1/τ

T and the server step size as O

τM , where T is the number of communication rounds, τ is the number of local steps and M is the number of clients. Using a small client step size mitigates client drift, and a large server

Published as a conference paper at ICLR 2023

step size prevents global slowdown. While this idea may be asymptotically optimal, it is not always effective in practical non-asymptotic and communication-limited settings (Charles & Koneˇcn y, 2020).

0.01 0.03 0.1 0.3 1 Client Step Size (ηℓ)

0.1 0.3 1 3 10

Server Step Size (ηg)

5.8 8.2 42 55 9.1

16 46 58 71 55

46 60 73 77 6.2

36 57 60 50 6

10 6.4 6.1 5.7 4.8

Figure 1: Test accuracy (%) achieved by different server and client step sizes on EMNIST dataset (Cohen et al., 2017) after 50 rounds (details of experimental setup are in Section 6 and Appendix D).

In practice, a small client step size severely slows down convergence in the initial rounds and cannot be fully compensated for by a large server step size (see Figure 1). Also, if local objectives differ significantly, then it may be beneficial to use smaller values of the server step size (Malinovsky et al., 2022).

Therefore, we seek to answer the following question: For a moderate client step size, can we adapt the server step size according to the local progress made by the clients and the heterogeneity of their objectives? In general, it is challenging to answer this question because it is difficult to obtain knowledge of the heterogeneity between the local objectives and appropriately use it to adapt the server step size.

Our Contributions. In this paper, we take a novel approach to address the question posed above. We begin by considering the case where the models are overparameterized, i.e., the number of model parameters is larger than the total number of data points across all clients. This is often true for modern deep neural network models (Zhang et al., 2017; Jacot et al., 2018) and the small datasets collected by edge clients in the FL setting. In this overparameterized regime, the global minimizer becomes a common minimizer for all local objectives, even though they may be arbitrarily heterogeneous. Using this fact, we obtain a novel connection between Fed Avg and the Projection Onto Convex Sets (POCS) algorithm, which is used to find a point in the intersection of some convex sets.

Based on this connection, we find an interesting analogy between the server step size and the extrapolation parameter that is used to speed up POCS (Pierra, 1984). We propose new extensions to the extrapolated POCS algorithm to support inexact and noisy projections as in Fed Avg. In particular, we derive a time-varying bound on the progress made by clients towards the global minimum and show how this bound can be used to adaptively estimate a good server step size at each round. The result is our proposed algorithm Fed Ex P, which is a method to adaptively determine the server step size in each round of FL based on the pseudo-gradients in that round.

Although motivated by the overparameterized regime, our proposed Fed Ex P algorithm performs well (both theoretically and empirically) in the general case, where the model can be either overparameterized or underparameterized. For this general case, we derive the convergence upper bounds for both convex and non-convex objectives. Some highlights of our work are as follows.

We reveal a novel connection between Fed Avg and the POCS algorithm for finding a point in the intersection of convex sets.

The proposed Fed Ex P algorithm is simple to implement with virtually no additional communication, computation, or storage required at clients or the server. It is well suited for both cross-device and cross-silo FL, and is compatible with partial client participation.

Experimental results show that Fed Ex P converges 1.4 2 faster than Fed Avg and most competing baselines on standard FL tasks.

Related Work. Popular algorithms for adaptively tuning the step size when training neural networks include Adagrad (Duchi et al., 2011) and its variants RMSProp (Tieleman et al., 2012) and Adadelta (Zeiler, 2012). These algorithms consider the notion of coordinate-wise adaptivity and adapt the step size separately for each dimension of the parameter vector based on the magnitude of the accumulated gradients. While these algorithms can be extended to the federated setting using the concept of pseudo-gradients as done by Reddi et al. (2021), these extensions are agnostic to inherent data heterogeneity across clients, which is central to FL. On the contrary, Fed Ex P is explicitly designed for FL settings and uses a client-centric notion of adaptivity that utilizes the heterogeneity of client updates in each round. The work closest to us is Johnson et al. (2020), which proposes a method to adapt the step size for large-batch training by estimating the gradient diversity (Yin et al., 2018) of a minibatch. This result has been improved in a recent work by Horváth et al. (2022). However, both Johnson et al. (2020); Horváth et al. (2022) focus on the centralized setting. In

Published as a conference paper at ICLR 2023

Fed Ex P, we use a similar concept, but within a federated environment which comes with a stronger theoretical motivation, since client data are inherently diverse in this case. We defer a more detailed discussion of other adaptive step size methods and related work to Appendix A.

2 PROBLEM FORMULATION AND PRELIMINARIES

As in most standard federated learning frameworks, we consider the problem of optimizing the model parameters w Rd to minimize the global objective function F(w) defined as follows:

min w Rd F(w) := 1

i=1 Fi(w), (1)

where Fi(w) := 1 |Di| P

δi Di ℓ(w, δi) is the empirical risk objective computed on the local data set Di at the the i-th client. Here, ℓ( , ) is a loss function and δi represents a data sample from the empirical local data distribution Di. The total number of clients in the FL system is denoted by M. Without loss of generality, we assume that all the M client objectives are given equal weight in the global objective function defined in (1). Our algorithm and analysis can be directly extended to the case where client objectives are unequally weighted, e.g., proportional to local dataset sizes |Di|.

Fed Avg. We focus on solving (1) using Fed Avg (Mc Mahan et al., 2017; Kairouz et al., 2021). At round t of Fed Avg, the server sends the current global model w(t) to all clients. Upon receiving the global model, clients perform τ steps of local stochastic gradient descent (SGD) to compute their updates { (t) i }M i=1 for round t as follows.

Perform Local SGD: w(t,k+1) i = w(t,k) i ηl Fi(w(t,k) i , ξ(t,k)) k {0, 1, . . . , τ 1} (2)

Compute Local Difference: (t) i = w(t) w(t,τ) i (3)

where w(t,0) i = w(t) for all i [M], ηl is the client step size and Fi(w(t,k) i , ξ(t,k)) represents a stochastic gradient computed on the minibatch ξ(t,k) i sampled randomly from Di.

Server Optimization in Fed Avg. In vanilla Fed Avg (Mc Mahan et al., 2017), the global model would simply be updated as the average of the client local models, that is, w(t+1) = 1 M PM i=1 w(t,τ) i . To improve over this, recent work (Reddi et al., 2021; Hsu et al., 2019) has focused on optimizing the server aggregation process by treating the client updates (t) i as pseudo-gradients and multiplying by a server step size when aggregating them as follows.

Generalized Fed Avg Global Update: w(t+1) = w(t) ηg (t) (4)

where (t) = 1 M PM i=1 (t) i is the aggregated client update in round t and ηg acts as server step size. Note that setting ηg = 1 recovers the vanilla Fed Avg update.

While the importance of the server step size has been theoretically well established in these works, we find that its practical relevance has not been explored. In this work, we take a step towards bridging this gap between theory and practice by adaptively tuning the value of ηg that we use in every round.

3 PROPOSED ALGORITHM: FEDEXP

Before discussing our proposed algorithm, we first highlight a useful and novel connection between Fed Avg and the POCS algorithm used to find a point in the intersection of some convex sets.

3.1 MOTIVATION FOR EXTRAPOLATION

Connection Between Fed Avg and POCS in the Overparameterized Convex Regime. Consider the case where the local objectives of the clients {Fi(w)}M i=1 are convex. In this case, we know that the set of minimizers of Fi(w) given by S i = {w : w arg min Fi(w)} is also a convex set for all i [M]. Now let us assume that we are in the overparameterized regime where d is sufficiently larger than the total number of data points across clients. In this regime, the model can

Published as a conference paper at ICLR 2023

fit all the training data at clients simultaneously and hence be a minimizer for all local objectives. Thus we assume that the global minimum satisfies w S i , i [M]. Our original problem in (1) can then be reformulated as trying to find a point in the intersection of convex sets {S i }M i=1 since w S i , i [M]. One of the most popular algorithms to do so is the Projection Onto Convex Sets (POCS) algorithm (Gurin et al., 1967). In POCS, at every iteration the current model is updated as follows1.

Generalized POCS update: w(t+1) POCS = w(t) POCS λ 1 M PM i=1 Pi(w(t) POCS) w(t) POCS (5)

where Pi(w(t) POCS) is a projection of w(t) POCS on the set S i and λ is known as the relaxation coefficient (Combettes, 1997).

Extrapolation in POCS. Combettes (1997) notes that POCS has primarily been used with λ = 1, with studies failing to demonstrate a systematic benefit of λ < 1 or λ > 1 (Mandel, 1984). This prompts Combettes (1997) to study an adaptive method of setting λ, first introduced by Pierra (1984) as follows:

λ(t) = PM i=1 Pi(w(t)) w(t) 2

M PM i=1 Pi(w(t)) w(t) 2 .

Pierra (1984) refer to the POCS algorithm with this adaptive λ(t) as Extrapolated Parallel Projection Method (EPPM). This is referred to as extrapolation since we always have λ(t) 1 by Jensen s inequality. The intuition behind EPPM lies in showing that the update with the proposed λ(t)

always satisfies w(t+1) POCS w 2 < w(t) POCS w 2, thereby achieving asymptotic convergence. Experimental results in Pierra (1984) and Combettes (1997) show that EPPM can give an order-wise speedup over POCS, motivating us to study this algorithm in the FL context.

3.2 INCORPORATING EXTRAPOLATION IN FL

Note that to implement POCS we do not need to explicitly know the sets {S i }M i=1; we only need to know how to compute a projection on these sets. From this point of view, we see that Fed Avg proceeds similarly to POCS. In each round, clients receive w(t) from the server and run multiple SGD steps to compute an approximate projection w(t,τ) i of w(t) on their solution sets S i . These approximate projections are then aggregated at the server to update the global model. In this case, the relaxation coefficient λ plays exactly the same role as the server step size ηg in Fed Avg.

Inspired by this observation and the idea of extrapolation in POCS, we seek to understand if a similar idea can be applied to tune the server step size ηg in Fed Avg. Note that the EPPM algorithm makes use of exact projections to prove convergence which is not available to us in FL settings. This is further complicated by the fact that the client updates are noisy due to the stochasticity in sampling minibatches. We find that in order to use an EPPM-like step size the use of exact projections can be relaxed to the following condition, which bounds the distance of the local models from the global minimum as follows.

Approximate projection condition in FL: 1 M PM i=1 w(t,τ) i w 2 w(t) w 2 (6)

where w(t) and {w(t,τ) i }M i=1 are the global and local client models, respectively, at round t and w is a global minimum. Intuitively, this condition suggests that after the local updates, the local models are closer to the optimum w on average as compared to model w(t) at the beginning of that round. We first show that this condition (6) holds in the overparameterized convex regime under some conditions. The full proofs for lemmas and theorems in this paper are included in Appendix C. Lemma 1. Let Fi(w) be convex and L-smooth for all i [M] and let w be a common minimizer of all Fi(w). Assuming clients run full-batch gradient descent to minimize their local objectives with ηl 1/L, then (6) holds for all t and τ 1.

In the case with stochastic gradient noise or when the model is underparameterized, although (6) may not hold in general, we expect it to be satisfied at least during the initial phase of training when w(t) w 2 is large and clients make common progress towards a minimum.

1We refer here to a parallel implementation of POCS. This is also known as Parallel Projection Method (PPM) and Simultaneous Iterative Reconstruction Technique (SIRT) in some literature (Combettes, 1997).

Published as a conference paper at ICLR 2023

Algorithm 1 Proposed Algorithm: Fed Ex P

1: Input: w(0), number of rounds T, local iteration steps τ, parameters ηl, ϵ 2: For t = 0, . . . , T 1 communication rounds do: 3: Global server does: 4: Send w(t) to all clients 5: Clients i [M] in parallel do:

6: Set w(t,0) i w(t,0)

7: For k = 0, . . . , τ 1 local iterations do: 8: Update w(t,k+1) i w(t,k) i ηl Fi(w(t,k) i , ξ(t,k) i )

9: Send (t) i w(t) w(t,τ) i to the server 10: Global server does: 11: Compute (t) 1

M PM i=1 (t) i and η(t) g max n 1, PM i=1 (t) i 2. 2M (t) 2+ϵ o

12: Update global model with w(t+1) w(t) η(t) g (t)

Given that (6) holds, we now consider the generalized Fed Avg update with a server step size η(t) g in round t. Our goal is to find the value of η(t) g that minimizes the distance of w(t+1) to w : w(t+1) w 2 = w(t) w 2 + (η(t) g )2 (t) 2 2η(t) g w(t) w , (t) . (7) Setting the derivative of the RHS of (7) to zero we have,

(η(t) g )opt =

w(t) w , (t)

PM i=1 D w(t) w , (t) i E

M (t) 2 PM i=1 (t) i 2

2M (t) 2 , (8)

where the last inequality follows from a, b = 1

2[ a 2 + b 2 a b 2], definition of (t) i in (3) and (6). Note that depending on the values of { (t) i }M i=0, we may have (η(t) g )opt 1. Thus, we see that (6) acts as a suitable replacement for projection to justify the use of extrapolation in FL settings.

3.3 PROPOSED ALGORITHM

Motivated by our findings above, we propose the following server step size for the generalized Fed Avg update at each round:

(η(t) g )Fed Ex P = max

1, PM i=1 (t) i 2

2M( (t) 2 + ϵ)

We term our algorithm Federated Extrapolated Averaging or Fed Ex P, in reference to the original EPPM algorithm which inspired this work. Note that our proposed step size satisfies the property that (η(t) g )opt (η(t) g )Fed Ex P (η(t) g )opt 1 when (6) holds, which can be seen by comparing (8) and (9). Since (7) depends quadratically on η(t) g , we can show that in this case w(t+1)

(η(t) g )Fed Ex P (t) w 2 w(t+1) w 2, implying we are at least as close to the optimum as

the Fed Avg update. In the rest of the paper, we denote (η(t) g )Fed Ex P as η(t) g when the context is clear.

Importance of Adding Small Constant to Denominator. In the case where (6) does not hold, using the lower bound established in (8) can cause the proposed step size to blow up. This is especially true

towards the end of training where we can have (t) 2 0 but (t) i 2 = 0. Thus we propose to add a small positive constant ϵ to the denominator in (9) to prevent this blow-up. For a large enough ϵ our algorithm reduces to Fed Avg and therefore tuning ϵ can be a useful tool to interpolate between vanilla averaging and extrapolation. Similar techniques exist in adaptive algorithms such as Adam (Kingma & Ba, 2015) and Adagrad (Duchi et al., 2011) to improve stability.

Compatibility with Partial Client Participation and Secure Aggregation. Note that Fed Ex P can be easily extended to support partial participation of clients by calculating η(t) g using only the updates of participating clients, i.e., the averaging and division in (9) will be only over the clients that participate in the round. Furthermore, since the server only needs to estimate the average of pseudo-gradient norms, η(t) g can be computed with secure aggregation, similar to computing (t).

Published as a conference paper at ICLR 2023

Connection with Gradient Diversity. We see that our lower bound on (η(t) g )opt naturally depends on the similarity of the client updates with each other. In the case where τ = 1 and clients run full-batch gradient descent, our lower bound (8) reduces to PM i=1 Fi(w(t)) 2 2M F(w(t)) 2

which is used as a measure of data-heterogeneity in many FL works (Wang et al., 2020; Haddadpour & Mahdavi, 2019). Our lower bound suggests using larger step-sizes as this gradient diversity increases, which can be a useful tool to speed up training in heterogeneous settings. This is an orthogonal approach to existing optimization methods to tackle heterogeneity such as Karimireddy et al. (2020b); Li et al. (2020); Acar et al. (2021), which propose additional regularization terms or adding control variates to the local client objectives to limit the impact of heterogeneity.

4 CONVERGENCE ANALYSIS

Our analysis so far has focused on the overparameterized convex regime to motivate our algorithm. In this section we discuss the convergence of our algorithm in the presence of underparameterization and non-convexity. We would like to emphasize that (6) is not needed to show convergence of Fed Ex P; it is only needed to motivate why Fed Ex P might be beneficial. To show general convergence, we only require that ηl be sufficiently small and the standard assumptions stated below.

Challenge in incorporating stochastic noise and partial participation. Our current analysis focuses on the case where clients are computing full-batch gradients in every step with full participation. This is primarily due to the difficulty in decoupling the effect of stochastic and sampling noise on η(t) g and the pseudo-gradients { (t) i }M i=1. To be more specific, if we use ξ(t) to denote the randomness

at round t, then Eξ(t) h (η(t) g ) (t)i = Eξ(t) h (η(t) g ) i Eξ(t) (t) which significantly complicates the proof. This is purely a theoretical limitation. Empirically, our results in Section 6 show that Fed Ex P performs well with both SGD and partial client participation. Assumption 1. (L-smoothness) Local objective Fi(w) is differentiable and L-smooth for all i [M], i.e., Fi(w) Fi(w ) L w w , w, w Rd. Assumption 2. (Bounded data heterogenenity at optimum) The norm of the client gradients at the global optima w is bounded as follows: 1 M PM i=1 Fi(w ) 2 σ2 . Theorem 1. (Fi are convex) Under Assumptions 1,2 and assuming clients compute full-batch gradients with full participation and ηl 1 6τL, the iterates {w(t)} generated by Fed Ex P satisfy,

F( w(T )) F O

ηlτ PT 1 t=0 η(t) g

| {z } T1:=initialization error

+ O η2 l τ(τ 1)Lσ2

| {z } T2:=client drift error

| {z } T3:=noise at optimum

where η(t) g is the Fed Ex P server step size at round t and w(T ) =

PT 1 t=0 η(t) g w(t) PT 1 t=0 η(t) g .

For the non-convex case, we need the data heterogeneity to be bounded everywhere as follows. Assumption 3. (Bounded global gradient variance) There exists a constant σ2 g > 0 such that the global gradient variance is bounded as follows. 1 M PM i=1 Fi(w) F(w) 2 σ2 g, w Rd. Theorem 2. (Fi are non-convex) Under Assumptions 1, 3 and assuming clients compute full-batch gradients with full participation and ηl 1 6τL, the iterates {w(t)} generated by Fed Ex P satisfy,

F(w(t)) 2 O

ηlτ PT 1 t=0 η(t) g

| {z } T1:=initialization error

+ O η2 l L2(τ 1)τσ2 g

| {z } T2:=client drift error

+ O ηl Lτσ2 g

| {z } T3:= global variance

where η(t) g is the Fed Ex P server step size at round t.

Discussion. In the convex case, the error of Fed Avg can be bounded by O w(0) w 2/ηlτT + O η2 l τ(τ 1)Lσ2 (Khaled et al., 2020) and in the non-convex case by O (F(w0) F )/ηlτT + O η2 l L2τ(τ 1)σ2 g (Wang et al., 2020). A careful inspection

Published as a conference paper at ICLR 2023

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 w1

1 2 3 4 5 6 7 8 9

Trajectory of Iterates

Fed Avg Fed Ex P Initialization Global Minimum

0 5 10 15 Training rounds

Global Objective

Fed Avg (Last iterate) Fed Ex P (Last iterate) Fed Ex P (Avg. of last two iterates)

0 5 10 15 Training rounds

|w(t) w * |2

Distance to Optimum

Fed Avg (Last iterate) Fed Ex P (Last iterate) Fed Ex P (Avg. of last two iterates)

Figure 2: Training characteristics of Fed Avg and Fed Ex P for the 2-D toy problem in Section 5. The last iterate of Fed Ex P has an oscillating behavior in F(w) but monotonically decreases w(t)

w 2; the average of the last two iterates lies in a lower loss region than the last iterate.

reveals that the impact of T1 on convergence of Fed Ex P is different from Fed Avg (effect of T2 is the same). We see that since PT 1 t=0 η(t) g T, Fed Ex P reduces T1 faster than Fed Avg. However this comes at the price of an increased error floor due to T3. Thus, the larger step-sizes in Fed Ex P help us reach the vicinity of an optimum faster, but can ultimately end up saturating at a higher error floor due to noise around the optimum. Note that the impact of the error floor can be controlled by setting the client step size ηl appropriately. Moreover, in the overparameterized convex regime where σ2 = 0, the effect of T2 and T3 vanishes and thus Fed Ex P clearly outperforms Fed Avg. This aligns well with our initial motivation of using extrapolation in the overparameterized regime.

5 FURTHER INSIGHTS INTO FEDEXP

In this section, we discuss some further insights into the training of Fed Ex P and how we leverage these insights to improve the performance of Fed Ex P.

Fed Ex P monotonically decreases w(t) w 2 but not necessarily F(w(t)) F(w ). Recall that our original motivation for the Fed Ex P step size was aimed at trying to minimize the distance to the optimum give by w(t+1) w 2, when (6) holds. Doing so satisfies w(t+1) w 2 w(t) w 2 but does not necessarily satisfy F(w(t+1)) F(w(t)).

To better illustrate this phenomenon, we consider the following toy example in R2. We consider a setup with two clients, where the objective at each client is given as follows:

F1(w) = (3w1 + w2 3)2; F2(w) = (w1 + w2 3)2. (12)

We denote the set of minimizers of F1(w) and F2(w) by S 1 = {w : 3w1 + w2 = 3} and S 2 = {w : w1 + w2 = 3} respectively. Note that S 1 and S 2 intersect at the point w = [0, 3], making it a global minimum. To minimize their local objectives, we assume clients run gradient descent with τ in every round2. Figure 2 shows the trajectory of the iterates generated by Fed Ex P and Fed Avg. We see that while w(t) w 2 decreases monotonically for Fed Ex P, F(w(t)) does not do so and in fact has an oscillating nature as we discuss below.

Understanding oscillations in F(w(t)). We see that the oscillations in F(w(t)) are caused by Fed Ex P iterates trying to minimize their distance from the solution sets S 1 and S 2 simultaneously. The initialization point w(0) is closer to S 1 than S 2, which causes the Fed Ex P iterate at round 1 to move towards S 2, then back towards S 1 and so on. To understand why this happens, consider the case where (t) 1 = 0, (t) 2 = 0. In this case, we have η(t) g = 2 and therefore w(t+1) =

w(t) 2 (t) = w(t,τ) 2 , which indicates that Fed Ex P is now trying to minimize (t+1) 2 2 . This gives us the intuition that the Fed Ex P update in round t is trying to minimize the objectives of the

clients that have (t) i 2 0. While this leads to a temporary increase in global loss F(w(t)) in

2The local models will be an exact projection of the global model on the solution sets {S i }2 i=1. In this case, the lower bound in (8) can be improved by a factor of 2 and therefore we use η(t) g = ( 1 2+ 2 2)/2 (t) 2

for this experiment (see Appendix C.4 and Appendix C.4.1 for proof).

Published as a conference paper at ICLR 2023

0 100 200 Training rounds

Mean Squared Error

Fed Ex P (ours) Fed Avg SCAFFOLD Fed Adagrad

0 250 500 Training rounds

Test Accuracy (%)

0 250 500 Training rounds

Test Accuracy (%)

0 250 500 Training rounds

Test Accuracy (%)

0 250 500 Training rounds

Test Accuracy (%)

Figure 3: Experimental results on a synthetic linear regression experiments and a range of realistic FL tasks. Fed Ex P consistently gives faster convergence compared to baselines while adding no extra computation, communication or storage at clients or server.

some rounds as shown in Figure 2, it is beneficial in the long run as it leads to a faster decrease in distance to the global optimum w .

Averaging last two iterates in Fed Ex P. Given the oscillating behavior of the iterates of Fed Ex P, we find that measuring progress on F(w) using the last iterate can be misleading. Motivated by this finding, we propose to set the final model as the average of the last two iterates of Fed Ex P. While the last iterate oscillates between regions that minimize the losses F1(w) and F2(w) respectively, the behavior of the average of the last two iterates is more stable and proceeds along a globally low loss region. Interestingly, we find that the benefits of averaging the iterates of Fed Ex P also extend to training neural networks with multiple clients in practical FL scenarios (see Appendix D.1). In practice, the number of iterates to average over could also be a hyperparameter for Fed Ex P, but we find that averaging the last two iterates works well, and we use this for our other experiments.

6 EXPERIMENTS

We evaluate the performance of Fed Ex P on synthetic and real FL tasks. For our synthetic experiment, we consider a distributed overparameterized linear regression problem. This experiment aligns most closely with our theory and allows us to carefully examine the performance of Fed Ex P when (6) holds. For realistic FL tasks, we consider image classification on the following datasets i) EMNIST (Cohen et al., 2017), ii) CIFAR-10 (Krizhevsky et al., 2009), iii) CIFAR-100 (Krizhevsky et al., 2009), iv) CINIC-10 (Darlow et al., 2018). In all experiments, we compare against the following baselines i) Fed Avg, ii) SCAFFOLD (Karimireddy et al., 2020b), and iii) Fed Adagrad (Reddi et al., 2021) which is a federated version of the popular Adagrad algorithm. To the best of our knowledge, we are not aware of any other baselines that adaptively tune the server step size in FL.

Experimental Setup. For the synthetic experiment, we consider a setup with 20 clients, 30 samples at each client, and model size to be 1000, making this an overparameterized problem. The data at each client is generated following a similar procedure as the synthetic dataset in Li et al. (2020). We use the federated version of EMNIST available at Caldas et al. (2019), which is naturally partitioned into 3400 clients. For CIFAR-10/100 we artifically partition the data into 100 clients, and for CINIC-10 we partition the data into 200 clients. In both cases, we follow a Dirichlet distribution with α = 0.3 for the partitioning to model heterogeneity among client data (Hsu et al., 2019). For EMNIST we use the same CNN architecture used in Reddi et al. (2021). For CIFAR10, CIFAR100 and CINIC-10 we use a Res Net-18 model (He et al., 2016). For our baselines, we find the best performing ηg and ηl by grid-search tuning. For Fed Ex P we optimize for ϵ and ηl by grid search. We fix the number of participating clients to 20, minibatch size to 50 and number of local updates to 20 for all experiments. In Appendix D, we provide additional details and results, including the best performing hyperparameters, comparison with Fed Prox (Li et al., 2020), and results for more rounds.

Fed Ex P comprehensively outperforms Fed Avg and baselines. Our experimental results in Figure 3 demonstrate that Fed Ex P clearly outperforms Fed Avg and competing baselines that use the best performing ηg and ηl found by grid search. Moreover, Fed Ex P does not require additional communication or storage at clients or server unlike SCAFFOLD and Fed Adagrad. The orderwise improvement in the case of the convex linear regression experiment confirms our theoretical motivation for Fed Ex P outlined in Section 3.2. In this case, since (6) is satisfied, we know that the Fed Ex P iterates are always moving towards the optimum. For realistic FL tasks, we see a consistent

Published as a conference paper at ICLR 2023

Table 1: Table showing the average number of rounds to reach desired accuracy for Fed Ex P and baselines. Fed Ex P provides a consistent speedup over all baselines.

Dataset Target Acc. Fed Ex P Fed Avg SCAFFOLD Fed Adagrad EMNIST 84% 186 328 (1.76 ) 232 (1.24 ) 277 (1.48 ) CIFAR-10 72% 267 434 (1.62 ) 429 (1.61 ) 419 (1.56 ) CIFAR-100 40% 242 500 (2.06 ) >500 (>2.06 ) 494 (2.04 ) CINIC-10 58% 318 450 (1.42 ) 470 (1.48 ) 444 (1.40 )

speedup of over 1.4 2 over Fed Avg. This verifies that Fed Ex P also provides performance improvement in more general settings with realistic datasets and models. Plots showing η(t) g can be found in Appendix D.5. The key takeaway from our experiments is that adapting the server step size allows Fed Ex P to take much larger steps in some (but not all) rounds compared to the constant optimum step size taken by our baselines, leading to a large speedup.

Comparison with Fed Adagrad. As discussed in Section 1, Fed Adagrad and Fed Ex P use different notions of adaptivity; Fed Adagrad uses coordinate-wise adaptivity, while Fed Ex P uses client-based adaptivity. We believe that the latter is more meaningful for FL settings as seen in our experiments. In many experiments, especially image classification tasks like CIFAR, the gradients produced are dense with relatively little variance in coordinate-wise gradient magnitudes (Reddi et al., 2021; Zhang et al., 2020). In such cases, Fed Adagrad is unable to leverage any coordinate-level information and gives almost the same performance as Fed Avg.

Comparison with SCAFFOLD. We see that Fed Ex P outperforms SCAFFOLD in all experiments, showing that adaptively tuning the server step size is sufficient to achieve speedup in FL settings. Furthermore, SCAFFOLD even fails to outperform Fed Avg for the more difficult CIFAR and CINIC datasets. Several other papers have reported similar findings, including Reddi et al. (2021); Karimireddy et al. (2020a); Yu et al. (2022). Several reasons have been postulated for this behavior, including the staleness of control variates (Reddi et al., 2021) and the difficulty in characterizing client drift in non-convex scenarios (Yu et al., 2022). Thus, while theoretically attractive, simply using variance reduction techniques such as SCAFFOLD may not provide any speedup in practice.

0 100 200 300 400 500 Training rounds

Test Accuracy (%)

Fed Ex P SCAFFOLD-Ex P Fed Avg SCAFFOLD

Figure 4: Adding extrapolation to SCAFFOLD for greater speedup.

Adding extrapolation to SCAFFOLD. We note that SCAFFOLD only modifies the Local SGD procedure at clients and keeps the global aggregation at the server unchanged. Therefore, it is easy to modify the SCAFFOLD algorithm to use extrapolation when updating the global model at the server (algorithm details in Appendix E). Figure 4 shows the result of our proposed extrapolated SCAFFOLD on the CIFAR-10 dataset. Interestingly, we observe that while SCAFFOLD alone fails to outperform Fed Avg, the extrapolated version of SCAFFOLD achieves the best performance among all algorithms. This result highlights the importance of carefully tuning the server step size to achieve the best performance for variance-reduction algorithms. It is also possible to add extrapolation to algorithms with server momentum (Appendix F).

7 CONCLUSION

In this paper, we have proposed Fed Ex P, a novel extension of Fed Avg that adaptively determines the server step size used in every round of global aggregation in FL. Our algorithm is based on the key observation that Fed Avg can be seen as an approximate variant of the POCS algorithm, especially for overparameterized convex objectives. This has inspired us to leverage the idea of extrapolation that is used to speed up POCS in a federated setting, resulting in Fed Ex P. We have also discussed several theoretical and empirical perspectives of Fed Ex P. In particular, we have explained some design choices in Fed Ex P and how it can be used in practical scenarios with partial client participation and secure aggregation. We have also shown the convergence of Fed Ex P for possibly underparameterized models and non-convex objectives. Our experimental results have shown that Fed Ex P consistently outperforms baseline algorithms with virtually no additional computation or communication at clients or server. We have also shown that the idea of extrapolation can be combined with other techniques, such as the variance-reduction method in SCAFFOLD, for greater speedup. Future work will study the convergence analysis of Fed Ex P with stochastic gradient noise and the incorporation of extrapolation into a wider range of algorithms used in FL.

Published as a conference paper at ICLR 2023

ACKNOWLEDGMENTS

This work was supported in part by NSF grants CCF 2045694, CNS-2112471, ONR N00014-23-12149, and the CMU David Barakat and La Verne Owen-Barakat Fellowship.

Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In International Conference on Learning Representations, 2021.

Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in Neural Information Processing Systems, 32, 2019.

Larry Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of mathematics, 16(1):1 3, 1966.

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems, 32, 2019.

Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods. IMA journal of numerical analysis, 8(1):141 148, 1988.

K. A. Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan Mc Mahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for federated learning on user-held data. In Neur IPS Workshop on Private Multi-Party Machine Learning, 2016.

Stephen Boyd and Jon Dattarro. Alternating projections, 2003. https://web.stanford.edu/ class/ee392o/alt_proj.pdf.

Oleg Burdakov, Yu-Hong Dai, and Na Huang. Stabilized Barzilai-Borwein method. Journal of Computational Mathematics, 37(6):916 936, 2019.

Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Koneˇcn y, H Brendan Mc Mahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings. In Workshop on Federated Learning for Data Privacy and Confidentiality, 2019.

Zachary Charles and Jakub Koneˇcn y. On the outsized importance of learning rates in local update methods. ar Xiv preprint ar Xiv:2007.00878, 2020.

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921 2926. IEEE, 2017.

Patrick L Combettes. Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections. IEEE Transactions on Image Processing, 6(4):493 506, 1997.

Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. CINIC-10 is not Imagenet or CIFAR-10. ar Xiv preprint ar Xiv:1810.03505, 2018.

Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Local SGD optimizes overparameterized neural networks in polynomial time. In International Conference on Artificial Intelligence and Statistics, pp. 6840 6861. PMLR, 2022.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2011.

AA Goldstein. Optimization of Lipschitz continuous functions. Mathematical Programming, 13(1): 14 22, 1977.

Published as a conference paper at ICLR 2023

Leonid Georgievich Gurin, Boris Teodorovich Polyak, and È V Raik. The method of projections for finding the common point of convex sets. Zhurnal Vychislitel noi Matematiki i Matematicheskoi Fiziki, 7(6):1211 1228, 1967.

Farzin Haddadpour and Mehrdad Mahdavi. On the convergence of local descent methods in federated learning. ar Xiv preprint ar Xiv:1910.14425, 2019.

Elad Hazan and Sham Kakade. Revisiting the Polyak step size. ar Xiv preprint ar Xiv:1905.00313, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016.

Samuel Horváth, Konstantin Mishchenko, and Peter Richtárik. Adaptive learning rates for faster stochastic gradient methods. ar Xiv preprint ar Xiv:2208.05287, 2022.

Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. ar Xiv preprint ar Xiv:1909.06335, 2019.

Baihe Huang, Xiaoxiao Li, Zhao Song, and Xin Yang. FL-NTK: A neural tangent kernel-based framework for federated learning analysis. In International Conference on Machine Learning, pp. 4423 4434. PMLR, 2021.

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 31, 2018.

Tyler Johnson, Pulkit Agrawal, Haijie Gu, and Carlos Guestrin. Adascale SGD: A user-friendly algorithm for distributed training. In International Conference on Machine Learning, pp. 4911 4920. PMLR, 2020.

Swanand Kadhe, Nived Rajaraman, O Ozan Koyluoglu, and Kannan Ramchandran. Fast Sec Agg: Scalable secure aggregation for privacy-preserving federated learning. ar Xiv preprint ar Xiv:2009.11248, 2020.

Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021.

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes Sign SGD and other gradient compression schemes. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 3252 3261. PMLR, 2019.

Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning. ar Xiv preprint ar Xiv:2008.03606, 2020a.

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132 5143. PMLR, 2020b.

Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp. 4519 4529. PMLR, 2020.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015. URL http://arxiv.org/abs/1412.6980.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429 450, 2020.

Published as a conference paper at ICLR 2023

Nicolas Loizou, Sharan Vaswani, Issam Hadj Laradji, and Simon Lacoste-Julien. Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pp. 1306 1314. PMLR, 2021.

Grigory Malinovsky, Konstantin Mishchenko, and Peter Richtárik. Server-side stepsizes and sampling without replacement provably help in federated optimization. ar Xiv preprint ar Xiv:2201.11066, 2022.

Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of PMLR, pp. 6702 6712, 2020.

Jan Mandel. Convergence of the cyclical relaxation method for linear inequalities. Mathematical programming, 30(2):218 228, 1984.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273 1282. PMLR, 2017.

Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richtarik. Prox Skip: Yes! Local gradient steps provably lead to communication acceleration! Finally! In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp. 15750 15769. PMLR, 2022.

Aritra Mitra, Rayana Jaafar, George J Pappas, and Hamed Hassani. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34:14606 14619, 2021.

Guy Pierra. Decomposition through formalization in a product space. Mathematical Programming, 28(1):96 115, 1984.

Boris Teodorovich Polyak. Minimization of unsmooth functionals. USSR Computational Mathematics and Mathematical Physics, 9(3):14 29, 1969. ISSN 0041-5553. doi: https://doi.org/10.1016/ 0041-5553(69)90061-5.

Marcos Raydan. On the barzilai and borwein choice of steplength for the gradient method. IMA Journal of Numerical Analysis, 13(3):321 326, 1993.

Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcný, Sanjiv Kumar, and Hugh Brendan Mc Mahan. Adaptive federated optimization. In International Conference on Learning Representations, 2021.

Tijmen Tieleman, Geoffrey Hinton, et al. Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26 31, 2012.

Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in Neural Information Processing Systems, 33:7611 7623, 2020.

Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. In International Conference on Artificial Intelligence and Statistics, pp. 1998 2007. PMLR, 2018.

Yaodong Yu, Alexander Wei, Sai Praneeth Karimireddy, Yi Ma, and Michael I Jordan. TCT: Convexifying federated learning using bootstrapped neural tangent kernels. ar Xiv preprint ar Xiv:2207.06343, 2022.

Kai Yue, Richeng Jin, Ryan Pilgrim, Chau-Wai Wong, Dror Baron, and Huaiyu Dai. Neural tangent kernel empowered federated learning. In International Conference on Machine Learning, pp. 25783 25803. PMLR, 2022.

Matthew D Zeiler. Adadelta: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012.

Published as a conference paper at ICLR 2023

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383 15393, 2020.

Published as a conference paper at ICLR 2023

A Additional Related Work 15

B Table of Notation and Schematic 16

B.1 Table of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

B.2 Schematic of Client-Server communication in Fed Ex P . . . . . . . . . . . . . . . 16

C Proofs 17

C.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

C.2 Convergence Analysis for Convex Objectives . . . . . . . . . . . . . . . . . . . . 18

C.3 Convergence Analysis for Non-Convex Objectives . . . . . . . . . . . . . . . . . 21

C.4 Exact Projection with Gradient Descent for Linear Regression . . . . . . . . . . . 23

D Additional Experiments and Setup Details 26

D.1 Impact of Averaging Iterates for Neural Networks . . . . . . . . . . . . . . . . . . 26

D.2 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D.3 Hyperparameter Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

D.4 Sensitivity of Fed Ex P to ϵ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

D.5 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

E Combining Extrapolation with SCAFFOLD 32

F Combining Extrapolation with Server Momentum 33

Published as a conference paper at ICLR 2023

A ADDITIONAL RELATED WORK

In this section, we provide further discussion on some additional related work that complements our discussion in Section 1.

Adaptive Step Size in Gradient Descent. Here we briefly discuss methods for tuning the step size in gradient descent and the challenges in applying them to the FL setting. Early methods to tune the step size in gradient descent were based on line search (or backtracking) strategies (Armijo, 1966; Goldstein, 1977). However, these strategies need to repeatedly compute the function value or gradient within an iteration, making them computationally expensive. Another popular class of adaptive step sizes is based on the Polyak step size (Polyak, 1969; Hazan & Kakade, 2019; Loizou et al., 2021). Similar to Fed Ex P, the Polyak step size is derived from trying to minimize the distance to the optimum for convex functions. However it is not clear how this can be extended to the federated setting where we only have access to pseudo-gradients. Also, the Polyak step size requires knowledge of the function value at the optimum which is hard to estimate. Another related class of step sizes is the Barzilai-Borwein stepsize (Barzilai & Borwein, 1988). However, to the best of our knowledge, these are known to provably work only for quadratic functions (Raydan, 1993; Burdakov et al., 2019) only. A recent work (Malitsky & Mishchenko, 2020) alleviates some of the concerns associated with these classical methods by setting the step size as an approximation of the inverse local Lipschitz constant; however it is again not clear how this intuition can be applied to the federated setting. An orthogonal line of work has focused on methods that adapt to the geometry of the data using gradient information in previous iterations, the most popular among them being Adagrad (Duchi et al., 2011) and its extensions RMSProp (Tieleman et al., 2012) and Adadelta (Zeiler, 2012). There exist federated counterparts of these algorithms, namely Fed Adagrad; however, as we show in our experiments these methods can fail to even outperform Fed Avg in standard FL tasks.

Overparameterization in FL. Inspired by the success of analyzing deep neural networks in the neural tangent kernel (NTK) regime (Jacot et al., 2018; Arora et al., 2019; Allen-Zhu et al., 2019), recent work has looked at studying the convergence of overparameterized neural networks in the FL setting. Huang et al. (2021) and Deng et al. (2022) show that for a sufficiently wide neural network and proper step size conditions, Fed Avg will converge to a globally optimal solution even in the presence of data heterogeneity. We note that these works are primarily concerned with convergence analysis, whereas our focus is on developing a practical algorithm that is inspired by characteristics in the overparameterized regime for speeding up FL training. Another recent line of work has looked at utilizing NTK style Jacobian features for learning a FL model in just a few rounds of communication (Yu et al., 2022; Yue et al., 2022). While interesting, these approaches are orthogonal to our current work.

Published as a conference paper at ICLR 2023

B TABLE OF NOTATION AND SCHEMATIC

B.1 TABLE OF NOTATION

Table 2: Summary of notation used in paper

Symbol Description

L2 norm M Number of clients ℓ( , ) Loss function Di Dataset at i-th client Fi(w) Local objective at i-th client F(w) Global objective at server ηl Client step size ηg Server step size w(t) Global model at round t η(t) g Fed Ex P server step size at round t w(t,k) i Local model at i-th client at t-th round and k-th iteration τ Number of local SGD steps (t) i Update of i-th client at round t (t) Average of client updates at round t S i Set of minimizers of Fi(w) T Number of communication rounds ϵ Small constant added to denominator of Fed Ex P step size w Global minimum F Minimum value of global objective L L-smoothness constant used in Assumption 1 σ2 Upper bound on variance of client gradients at optimum (see Assumption 2) σ2 Upper bound on variance of client gradients (see Assumption 3)

B.2 SCHEMATIC OF CLIENT-SERVER COMMUNICATION IN FEDEXP

At each round t, the server first sends global model w(t) to all clients. Clients perform local optimization on w(t) to compute their local models w(t,τ) i and send back their update (t) i =

w(t) i w(t,τ) i and norm of update (t) i 2 to the server. This procedure is illustrated in Figure 5.

1) Server sends global

model w(t) to all clients

2) Clients perform local training

3) Clients send back (t)

i = w(t) w(t, )

Client M Client 2 Client 1

Cloud Server

Figure 5: Schematic of client-server communication in Fed Ex P.

Published as a conference paper at ICLR 2023

We first state some preliminary lemmas that will used throughout the proofs.

Lemma 2. (Jensen s inequality) For any ai Rd, i {1, 2, . . . , M}: 1 M

i=1 ai 2 , (13)

i=1 ai 2 . (14)

We also note the following known result related to the Bregman divergence.

Lemma 3. (Khaled et al., 2020) If F is smooth and convex, then

F(w) F(w ) 2 2L(F(w) F(w ) F(w ), w w ). (15)

Lemma 4. (Co-coercivity of convex smooth function) If F is L-smooth and convex then,

F(w) F(w ), w w 1

L F(w) F(w ) 2 . (16)

A direct consequence of this lemma is,

F(w), w w 1

L F(w) 2 (17)

where w is a minimizer of F(w).

C.1 PROOF OF LEMMA 1

Let Fi(w) be the local objective at a client and w be the global minimum. From the overparameterization assumption, we know that w is also a minimizer for Fi(w). We have, w(t,k) i w 2 = w(t,k 1) i ηl F(w(t,k 1) i ) w 2 (18)

= w(t,k 1) i w 2 2ηl F(w(t,k 1) i ), w(t,k 1) i w + η2 l F(w(t,k 1) i ) 2

w(t,k 1) i w 2 2ηl

F(w(t,k 1) i ) 2 + η2 l F(w(t,k 1) i ) 2 (20)

w(t,k 1) i w 2 ηl

F(w(t,k 1) i ) 2 (21)

where (20) follows from (17) and (21) follows from ηl 1

L. Summing the above inequality from k = 0 to τ 1 we have,

w(t,τ) i w 2 w(t) w 2 ηl

F(w(t,k) i ) 2 . (22)

Thus we have,

w(t,τ) i w 2 w(t) w 2 ηl ML

F(w(t,k) i ) 2 (23)

w(t) w 2 . (24)

This completes the proof of this lemma.

Published as a conference paper at ICLR 2023

C.2 CONVERGENCE ANALYSIS FOR CONVEX OBJECTIVES

Our proof technique is inspired by Khaled et al. (2020) with some key differences. The biggest difference is the incorporation of the adaptive Fed Ex P server step sizes which Khaled et al. (2020) does not account for. Another difference is that we provide convergence guarantees in terms of number of rounds T while Khaled et al. (2020) focus on number of iterations T = Tτ. We highlight the specific steps where we made adjustments to the analysis of Khaled et al. (2020) below.

We begin by modifying Khaled et al. (2020, Lemma 11 and Lemma 13) to bound client drift in every round instead of every iteration.

Lemma 5. (Bounding client aggregate gradients)

Fi(w(t,k) i ) 2 3L2

w(t,k) i w(t) 2 + 6τL(F(w(t)) F(w )) + 3τσ2 .

Proof of Lemma 5:

Fi(w(t,k) i ) 2

Fi(w(t,k) i ) Fi(w(t)) + Fi(w(t)) Fi(w ) + Fi(w ) 2 (26)

Fi(w(t,k) i ) Fi(w(t)) 2 + 3

Fi(w(t)) Fi(w ) 2 (27)

k=0 Fi(w ) 2

w(t,k) i w(t) 2 + 6τL(F(w(t)) F ) + 3τσ2 . (28)

The first term in (28) follows from L-smoothness of Fi(w), the second term follows from Lemma 3 and the third term follows from bounded noise at optimum.

Lemma 6. (Bounding client drift)

w(t) w(t,k) i 2 12η2 l τ 2(τ 1)L(F(w(t)) F(w )) + 6η2 l τ 2(τ 1)σ2 . (29)

Published as a conference paper at ICLR 2023

Proof of Lemma 6:

w(t) w(t,k) i 2

l=0 Fi(w(t,l) i )

Fi(w(t,l) i ) 2 (31)

η2 l τ(τ 1) 1

Fi(w(t,k) i ) 2 (32)

3η2 l τ(τ 1)L2 1

w(t) w(t,k) i 2 + 6η2 l τ 2(τ 1)L(F(w(t)) F(w )) (33)

+ 3η2 l τ 2(τ 1)σ2

w(t) w(t,k) i 2 + 6η2 l τ 2(τ 1)L(F(w(t)) F(w )) (34)

+ 3η2 l τ 2(τ 1)σ2

where (33) uses Lemma 5 and (34) uses ηl 1 6τL.

Therefore we have,

w(t) w(t,k) i 2 12η2 l τ 2(τ 1)L(F(w(t)) F(w )) + 6η2 l τ 2(τ 1)σ2 . (35)

Proof of Theorem 1:

We define the following auxiliary variables that will used in the proof.

Aggregate Client Gradient: h(t) i =

k=0 Fi(w(t,k) i ). (36)

We also define h(t) = 1 M PM i=1 h(t) i .

Recall that the update of the global model can be written as w(t+1) = w(t) η(t) g ηl h(t).

We have w(t+1) w 2 = w(t) η(t) g ηl h(t) w 2 (37)

= w(t) w 2 2η(t) g ηl D wt w , h(t)E + (η(t) g )2η2 l h(t) 2 (38)

w(t) w 2 2η(t) g ηl D wt w , h(t)E + η(t) g η2 l 1 M

h(t) i 2 (39)

where (39) follows from η(t) g

PM i=1 h(t) i 2

M h(t) 2 . Inequality (39) is a key step in our proof and the

differentiating factor in our approach from Khaled et al. (2020). Following a similar technique as Khaled et al. (2020) to bound (η(t) g )2η2 l h(t) 2 will end up requiring the condition ηl 1/8Lη(t) g , which cannot be satisfied in our setup due to the adaptive choice of η(t) g . Therefore we first upper

Published as a conference paper at ICLR 2023

bound (η(t) g )2η2 l h(t) 2 by η(t) g η2 l 1 M PM i=1 h(t) i 2 and focus on further bounding this quantity in the rest of the proof, which does not require the aforementioned condition. Note that this comes at the expense of the additional T3 error seen in our final convergence bound in Theorem 1.

w(t+1) w 2 w(t) w 2 2η(t) g ηl D wt w , h(t)E

+η(t) g η2 l 1 M

Bounding T2

h(t) i 2 (41)

k=0 Fi(w(t,k) i )

Fi(w(t,k) i ) 2 (43)

w(t,k) i w(t) 2 + 6τ 2L(F(w(t)) F ) + 3τ 2σ2 (44)

where (43) follows from Jensen s inequality and and (44) follows from Lemma 5.

Bounding T1

D wt w , h(t) i E (45)

D w(t) w , Fi(w(t,k) i ) E . (46)

We have, D w(t) w , Fi(w(t,k) i ) E = D w(t) w(t,k) i , Fi(w(t,k) i ) E + D w(t,k) i w , Fi(w(t,k) i ) E .

From L-smoothness of Fi we have, D w(t) w(t,k) i , Fi(w(t,k) i ) E Fi(w(t)) Fi(w(t,k) i ) L

w(t) w(t,k) i 2 . (48)

From convexity of Fi we have, D w(t,k) i w , Fi(w(t,k) i ) E Fi(w(t,k) i ) Fi(w ). (49)

Therefore, adding the above inequalities we have, D w(t) w , Fi(w(t,k) i ) E Fi(w(t)) Fi(w ) L

w(t) w(t,k) i 2 . (50)

Substituting (50) in (46) we have,

T1 τ(F(w(t)) F(w )) L

w(t) w(t,k) i 2 . (51)

Published as a conference paper at ICLR 2023

Here we would like to note that the bound for T1 is our contribution and is needed in our proof due to the relaxation in (39). The bound for T2 follows a similar technique as Khaled et al. (2020, Lemma 12).

Substituting the bounds for T1 and T2 in (40) we have, w(t+1) w 2 w(t) w 2 2η(t) g ηlτ(1 3ηlτL)(F(w(t)) F(w )) + 3η(t) g η2 l τ 2σ2

+ (3η(t) g η2 l τL2 + η(t) g ηl L) 1

w(t,k) i w(t) 2

w(t) w 2 η(t) g ηlτ(F(w(t)) F(w )) + 3η(t) g η2 l τ 2σ2 (52)

+ 2η(t) g ηl L 1

w(t,k) i w(t) 2

w(t) w 2 η(t) g ηlτ(F(w(t)) F(w )) + 3η(t) g η2 l τ 2σ2 (53)

+ 24η(t) g η3 l τ 2(τ 1)L2(F(w(t)) F(w )) + 12η(t) g η3 l τ 2(τ 1)Lσ2

w(t) w 2 η(t) g ηlτ

3 (F(w(t)) F(w )) + 3η(t) g η2 l τ 2σ2 (54)

+ 12η(t) g η3 l τ 2(τ 1)Lσ2

where both (52) and (55) use ηl 1 6τL, and (53) uses Lemma 6.

Rearranging terms and averaging over all rounds we have,

PT 1 t=0 η(t) g F(w(t)) F(w )

PT 1 t=0 η(t) g 3 w(0) w 2

PT 1 t=0 η(t) g ηlτ + 9ηlτσ2 + 36η2 l τ(τ 1)Lσ2 . (55)

This implies,

F( w(T )) F(w ) O

ηlτ PT 1 t=0 η(t) g

+ O η2 l τ(τ 1)Lσ2 + O ηlτσ2 (56)

where w(T ) =

PT 1 t=0 η(t) g w(t) PT 1 t=0 η(t) g . This completes the proof.

C.3 CONVERGENCE ANALYSIS FOR NON-CONVEX OBJECTIVES

Our proof technique is inspired by Wang et al. (2020) and we use one of their intermediate results to bound client drift in non-convex settings as we describe below. We highlight the specific steps where we made adjustments to the analysis of Wang et al. (2020) below.

We begin by defining the following auxiliary variables that will used in the proof.

Normalized Gradient: h(t) i = 1

k=0 Fi(w(t,k) i ). (57)

We also define h(t) = 1 M PM i=1 h(t) i .

Lemma 7. (Bounding client drift in Non-Convex Setting)

Fi(w(t)) h(t) i 2 1

F(w(t) 2 + 5η2 l L2τ(τ 1)σ2 g . (58)

Published as a conference paper at ICLR 2023

Proof of Lemma 7: Let D = 4η2 l L2τ(τ 1). We have the following bound from equation (87) in Wang et al. (2020),

Fi(w(t)) h(t) i 2 D 1 D

F(w(t)) 2 + Dσ2 g 1 D. (59)

From ηl 1 6τL we have D 1

9 which implies 1 1 D 9

8 and D 1 D 1

Therefore we have,

Fi(w(t)) h(t) i 2 1

F(w(t)) 2 + 9D

8 σ2 g (60)

F(w(t) 2 + 5η2 l L2τ(τ 1)σ2 g . (61)

Proof of Theorem 2:

The update of the global model can be written as follows,

w(t+1) = w(t) η(t) g ηlτ h(t). (62)

Now using the Lipschitz-smoothness assumption we have,

F(w(t+1)) F(w(t)) η(t) g ηlτ D F(w(t)), h(t)E + (η(t) g )2η2 l τ 2L 2

h(t) 2 (63)

η(t) g ηlτ D F(w(t)), h(t) E + η(t) g η2 l τ 2L 2M

h(t) i 2 (64)

where (64) uses η(t) g

PM i=1 h(t) i 2

M h(t) 2 . As in the convex case, inequality (64) is a key step in our proof

and the differentiating factor in our approach from Wang et al. (2020). Following a similar technique as Wang et al. (2020) to bound (η(t) g )2η2 l τ 2L h(t) 2 /2 will need the condition ηl 1/2Lτη(t) g , which cannot be satisfied in our setup due to the adaptive choice of η(t) g . Therefore we first upper

bound (η(t) g )2η2 l τ 2L2 h(t) 2 by η(t) g η2 l τ 2L 1

M PM i=1 h(t) i 2 . 2 and focus on further bounding this quantity in the rest of the proof, which does not require the aforementioned condition. Note that this comes at the expense of the additional T3 error seen in our final convergence bound in Theorem 2.

Therefore we have,

F(w(t+1)) F(w(t)) η(t) g ηlτ D F(w(t)), h(t)E

+η(t) g η2 l τ 2L 2M

Bounding T1

F(w(t)) 2 + 1

F(w(t)) 2 1 2M

Fi(w(t)) h(t) i 2 (68)

Published as a conference paper at ICLR 2023

where (67) uses a, b = 1

2 a b 2 and (68) uses Jensen s inequality and the definition of the global objective function F.

Bounding T2

h(t) i 2 (69)

h(t) i Fi(w(t)) + Fi(w(t)) F(w(t)) + F(w(t)) 2 (70)

h(t) i Fi(w(t)) 2 + Fi(w(t)) F(w(t)) 2 + F(w(t)) 2 (71)

h(t) i Fi(w(t)) 2 + 3σ2 g + 3 F(w(t)) 2 (72)

where (71) uses Jensen s inequality, (72) uses bounded data heterogeneity assumption.

Here we would like to note that the bound for T2 is our contribution and is needed in our proof due to the relaxation in (39). The bound for T1 follows a similar technique as in Wang et al. (2020).

Substituting the T1 and T2 bounds into (65), we have,

F(w(t+1)) F(w(t)) η(t) g ηlτ

F(w(t)) 2 + 1 2M

Fi(w(t)) h(t) i 2 (73)

3σ2 g + 3 F(w(t)) 2 + 3

h(t) i Fi(w(t)) 2 ! !

F(w(t)) 2 + 1

Fi(w(t)) h(t) i 2 + 3ηlτLσ2 g

η(t) g ηlτ 1

F(w(t) 2 + 3ηlτLσ2 g + 5η2 l L2τ(τ 1)σ2 g

where (74) uses ηl 1 6τL, (75) uses Lemma 7.

Thus rearranging terms and averaging over all rounds we have, PT 1 t=0 η(t) g F(w(t)) 2

PT 1 t=0 η(t) g 8(F(w(0)) F )

PT 1 t=0 η(t) g ηlτ + 40η2 l L2τ(τ 1)σ2 g + 24ηl Lτσ2 g . (76)

This implies,

F(w(t)) 2 O

(F(w(0)) F )

PT 1 t=0 η(t) g ηlτ

+ O η2 l L2τ(τ 1)σ2 g + O ηl Lτσ2 g . (77)

This completes the proof.

C.4 EXACT PROJECTION WITH GRADIENT DESCENT FOR LINEAR REGRESSION

Let F(w) = Aw b 2 where A is a (n d) matrix and b is a n dimensional vector. We assume that d n here and A has rank n. The singular value decomposition (SVD) of A can be written as,

A = UΣV = U [Σ1 0] V 1 V 2

= UΣ1V 1 (78)

Published as a conference paper at ICLR 2023

where U is an (n n) orthogonal matrix, Σ is an (n n) diagonal matrix, V1 is a (d n) matrix with orthogonal columns and V2 is a (d (d n)) matrix with orthogonal columns. Here V1 is a basis for the row space of A, while V2 is a basis for the null space of A. We first prove the following lemmas about the set of minimizers of F(w) and the projection on this set. Lemma 8. The set of minimizers of F(w) is given by,

S = {V2V 2 w + V1Σ 1 1 U b|w Rd}. (79)

Proof. Let w = V2V 2 x + V1Σ 1 1 U b for some x Rd. We have,

Aw = UΣ1V 1 (V2V 2 x + V1Σ 1 1 U b) (80) = b (81)

where the last line uses V 1 V2 = 0, V 1 V1 = I, UU = I. This implies Aw b 2 = 0. Thus any w in S is a minimizer of F(w).

Now let w be a minimizer of F(w), implying Aw = UΣ1V 1 w = b. We have,

w = V2V 2 w + V1V 1 w (82)

= V2V 2 w + V1Σ 1 1 U b (83)

where (82) uses V1V 1 + V2V 2 = I and (83) uses UΣ1V 1 w = b. Thus any minimizer of F(w) must lie in S .

Combining the above statements we have,

w is a minimizer of F(w) w S . (84)

which completes the proof. Lemma 9. The projection of any w Rd on S is given by,

PS (w) = arg min w S w w 2 = V2V 2 w + V1Σ 1 1 U b. (85)

Proof. When w S , it is easy to see that this holds. Therefore we consider the case where w / S . Let x = V2V 2 w + V1Σ 1 1 U b and PS (w) = V2V 2 w0 + V1Σ 1 1 U b where w0 = w. We have, w V2V 2 w0 V1Σ 1 1 U b 2 (86)

= V2V 2 (w w0) + V1V 1 w V1Σ 1 1 U b 2 (V1V 1 + V2V 2 = I) (87)

= V2V 2 (w w0) 2 + V1V 1 w V1Σ 1 1 U b 2 (88)

= V2V 2 (w w0) 2 + w x 2 (89)

> w x 2 (90)

leading to a contradiction. The cross term in (88) is zero since V 1 V2 = 0. Equation (89) follows by the definition of x.

We now show that running gradient descent on F(w) starting from w with a sufficiently small step size converges to PS (w).

Lemma 10. Let w(0), w(1), . . . be the iterates generated by running gradient descent on F(w) with w(0) = w and learning rate ηl λmax, where λmax is the largest eigen value of A A. Then lim T w(T ) = PS (w).

Proof. By the gradient descent update we have,

w(t+1) = w(t) ηl(A Aw(t) A b) (91)

= (I ηl A A)w(t) + ηl A b. (92)

Published as a conference paper at ICLR 2023

w(T ) = (I ηl A A)T w(0) + ηl

t=0 (I ηl A A)t A b (93)

= V(I ηlΣ Σ)T V w(0) + ηl

t=0 V(I ηlΣ Σ)tΣ U b (94)

= (V1(I ηlΣ2 1)T V1 + V2V 2 )w(0) + ηl V1

t=0 (I ηlΣ2 1)t !

Σ1U b. (95)

In the limit T and with ηl λmax, we have,

lim T (I ηlΣ2 1)T = 0 and lim T

t=0 (I ηlΣ2 1)t = 1

ηl Σ 2 1 . (96)

lim T w(T ) = V2V 2 w(0) + V1Σ 1 1 U b (97)

= PS (w(0)) (98) = PS (w). (99)

C.4.1 IMPROVING LOWER BOUND IN (8) IN THE CASE OF EXACT PROJECTIONS

Let S i be convex and let w Si for all i [M]. We assume that w(t,τ) i = PS i (w(t)) i [M], i.e., the local models are an exact projection of w(t) on their respective solution sets. From (8) we have,

(η(t) g )opt =

w(t) w , (t)

PM i=1 D w(t) w , (t) i E

M (t) 2 . (100)

We can lower bound D w(t) w , (t) i E as follows, D w(t) w , (t) i E = D w(t) w(t,τ) i + w(t,τ) i w , w(t) w(t,τ) i E (101)

= w(t) w(t,τ) i 2 + D w(t,τ) i w , w(t) w(t,τ) i E (102)

w(t) w(t,τ) i 2 (103)

= (t) i 2 (104)

where (103) uses the fact that D w(t,τ) i w , w(t) w(t,τ) i E 0 following the properties of projection (Boyd & Dattarro, 2003).

Thus we have,

(η(t) g )opt PM i=1 (t) i 2

M (t) 2 (106)

Note here the improvement by a factor of 2 in the lower bound compared to (8).

Published as a conference paper at ICLR 2023

D ADDITIONAL EXPERIMENTS AND SETUP DETAILS

Our code is available at the following link https://github.com/Divyansh03/Fed Ex P.

D.1 IMPACT OF AVERAGING ITERATES FOR NEURAL NETWORKS

As discussed in Section 5, we find that setting the final Fed Ex P model as the average of the last two iterates also improves performance when training neural networks in practical FL scenarios. To demonstrate this, we consider an experiment on the CIFAR-10 dataset with 10 clients, where the data at each client is distributed using a Dirichlet distribution with α = 0.3. We set the number of local steps to be τ = 20 and train a CNN model having the same architecture as outlined in Mc Mahan et al. (2017) with full client participation. Figure 6 shows the training accuracy as a function of the last iterate and the average of last two iterates for Fed Avg and Fed Ex P. We see that the last iterate of Fed Ex P has an oscillating behavior that can hide improvements in training accuracy. On the other hand, the average of the last two iterates of Fed Ex P produces a more stable training curve and shows a considerable improvement in the final accuracy. Note however that this improvement only shows for Fed Ex P; averaging iterates does not make significant difference for Fed Avg.

0 20 40 60 80 100 Training rounds

Training Accuracy (%)

Fed Ex P (Avg. of last two iterates) Fed Avg (Avg. of last two iterates)

Fed Ex P (Last iterate) Fed Avg (Last iterate)

0 20 40 60 80 100 Training rounds

Server Step Size η(t) g

Figure 6: Benefit of averaging the last two iterates for Fed Ex P in training a CNN model on CIFAR-10. Note that averaging does not make significant difference for Fed Avg.

D.2 DATASET DETAILS

Here we provide more details about the datasets used in Section 6.

Synthetic Linear Regression. In this case we assume that the local objective of each client is given by Fi(w) = Aiw bi 2 where Ai R(30 1000), bi R30 and w R1000. We set the number of clients to be M = 20. Note that since d PM i=1 ni, this is an overparameterized convex problem. To generate Ai and bi, we follow a similar process as Li et al. (2020). We have (Ai)j: N(mi, Id) and (bi)j = w i (Ai)j: where mi N(ui, 1), wi N(yi, 1), ui N(0, 0.1), yi N(0, 0.1).

EMNIST. EMNIST is an image classification task consisting of handwritten characters associated with 62 labels. The federated EMNIST dataset available at Caldas et al. (2019) is naturally partitioned into 3400 clients based on the identities of the character authors. The number of training and test samples is 671,585 and 77,483 respectively.

CIFAR-10/100. CIFAR-10 is a natural image dataset consisting of 60,000 32x32 images divided into 10 classes. CIFAR-100 uses a finer labeling of the CIFAR images to divide them into 100 classes making it a harder dataset for image classification. In both cases the number of training examples and test examples is 50,000 and 10,000 respectively. To simulate a federated setting, we artificially partition the training data into 100 clients following the procedure outlined in Hsu et al. (2019).

CINIC-10. CINIC-10 is a natural image dataset that can be used as a direct replacement of CIFAR for machine learning tasks. It is intended to act as a harder dataset than CIFAR-10 while being easier than CIFAR-100. The number of training and test examples is both 90,000. We partition the training data into 200 clients in this case, following a similar procedure as for CIFAR.

Published as a conference paper at ICLR 2023

D.3 HYPERPARAMETER DETAILS

For our baselines, we find the best performing ηg and ηl by grid-search tuning. For Fed Ex P we search for ϵ and ηl. This is done by running algorithms for 50 rounds and finding the parameters that achieve the highest training accuracy averaged over the last 10 rounds. We provide details of the grid used below for each experiment below.

Grid for Synthetic. For Fed Avg and SCAFFOLD, the grid for ηg is {100, 100.5, 100.5, 101, 102}. For Fed Adagrad, the grid for ηg is {10 1, 10 0.5, 10 0, 100.5, 101}. For Fed Ex P we keep ϵ = 0 in this experiment as (6) is satisfied in this case. The grid for ηl is {10 2, 10 1.5, 10 1, 10 0.5, 100} for all algorithms.

Grid for Neural Network Experiments. For Fed Avg and SCAFFOLD the grid for ηg is {10 1, 10 0.5, 100, 100.5, 101}. For Fed Adagrad, the grid for ηg is {10 2, 10 1.5, 10 1, 10 0.5, 100}. For Fed Ex P the grid for ϵ is {10 3, 10 2.5, 10 2, 10 1.5, 10 1}. The grid for ηl is {10 2, 10 1.5, 10 1, 10 0.5, 100} for all algorithms.

We use lower values of ηg in the grid for Fed Adagrad based on observations from Reddi et al. (2021) which show that Fed Adagrad performs better with smaller values of the server step size. We provide details of the best performing hyperparameters below.

Table 3: Base-10 logarithm of the best combination of ϵ and ηl for Fed Ex P and combination of ηl and ηg for baselines. For the synthetic dataset we keep ϵ = 0 for Fed Ex P.

Dataset Fed Ex P Fed Avg SCAFFOLD Fed Adagrad ϵ ηl ηg ηl ηg ηl ηg ηl Synthetic * 1 1 1 1 1 1 1 EMNIST 1 0.5 0 0.5 0 0.5 0.5 0.5 CIFAR-10 3 2 0 2 0 2 1 2 CIFAR-100 3 2 0 2 0 2 1 2 CINIC-100 3 2 0 2 0 2 1 2

Other hyperparameters are kept the same for all algorithms. In particular, we apply a weight decay of 0.0001 for all algorithms and decay ηl by a factor of 0.998 in every round. We also use gradient clipping to improve stability of the algorithms as done in previous works (Acar et al., 2021). In all experiments we fix the number of participating clients to be 20, minibatch size to be 50 (for the synthetic dataset this reduces to full-batch gradient descent) and number of local updates τ to be 20.

D.4 SENSITIVITY OF FEDEXP TO ϵ

To evaluate the sensitivity of Fed Ex P to ϵ, we compute the training accuracy of Fed Ex P after 500 rounds for varying ϵ and on different tasks. For each task, we fix ηl to be the value used in our experiments in Section 6 and only vary ϵ. The results are summarized below.

Table 4: Training accuracy obtained by Fed Ex P with different choices of ϵ after 500 rounds of training on various tasks. Value of ηl is fixed for each task (10 0.5 for EMNIST and 10 2 for others). Results averaged over last 10 rounds.

Dataset ϵ=10 3 ϵ=10 2.5 ϵ=10 2 ϵ=10 1.5 ϵ=10 1

EMNIST 85.40 86.26 85.73 85.49 84.90 CIFAR-10 84.79 77.82 77.63 77.66 77.64 CIFAR-100 59.01 44.76 44.21 44.37 44.40 CINIC-10 66.31 60.93 61.05 60.47 60.96

We see that the sensitivity of ϵ is similar to that of the τ parameter which is added to the denominator of Fed Adam and Fed Adagrad (Reddi et al., 2021) to prevent the step size from blowing up.

Published as a conference paper at ICLR 2023

Keeping ϵ too large reduces the adaptivity of the method and makes the behavior similar to Fed Avg. At the same time, keeping ϵ too small may not also be beneficial always as seen in the case of EMNIST. In practice, we find that a grid search for ϵ in the range {10 3, 10 2.5, 10 2, 10 1.5, 10 1} usually suffices to yield a good value of ϵ. A general rule of thumb would be to start with ϵ = 10 3 and increase ϵ till the performance drops.

D.5 ADDITIONAL RESULTS

In this section, we provide additional results obtained from our experiments.

Synthetic Linear Regression. Note that for the synthetic linear regression experiments there is no test data. Also note that there is no randomness in this experiment since clients compute full-batch gradients with full participation. We provide the plot of η(t) g for Fed Ex P in Figure 7. We see that Fed Ex P takes much larger steps in some (but not all) rounds compared to the constant optimum step size taken by our baselines, leading to a large speedup. Recall that we also let ϵ = 0 in this experiment (since it aligns with our theory) which also explains the larger values of η(t) g taken by Fed Ex P in this case.

0 25 50 75 100 125 150 175 200 Training rounds

Server Step Size η(t) g

Fed Ex P (ours) Fed Avg SCAFFOLD Fed Adagrad

Figure 7: Global learning rates for synthetic data with linear regression. Results from a single instance of experiment.

EMNIST. For EMNIST we observe that SCAFFOLD gives slightly better training loss than Fed Ex P towards the end of training. As described in Section 6, extrapolation can be combined with the variance-reduction in SCAFFOLD (the resulting algorithm is referred to as SCAFFOLD-Ex P) to further improve performance. This gives the best result in this case as shown in Figure 8.

0 250 500 Training rounds

Training Loss

Fed Ex P (ours) SCAFFOLD-Ex P (ours) Fed Avg SCAFFOLD Fed Adagrad

0 250 500 Training rounds

Training Accuracy (%)

0 250 500 Training rounds

Server Step Size η(t) g

Figure 8: Additional results for EMNIST dataset. Mean and standard deviation from experiments with 20 different random seeds. The shaded areas show the standard deviation.

Published as a conference paper at ICLR 2023

CIFAR-10, CIFAR-100 and CINIC-10. From Figure 3 and Figures 9 11, we see that Fed Ex P comprehensively outperforms baselines in these cases, achieving almost 10% 20% higher accuracy than the closest baseline by the end of training. The margin of improvement is most in CIFAR-100, which can be considered as the toughest dataset in our experiments. This points to the practical utility of Fed Ex P even in challenging FL scenarios.

0 250 500 Training rounds

Training Loss

Fed Ex P (ours) Fed Avg SCAFFOLD Fed Adagrad

0 250 500 Training rounds

Training Accuracy (%)

0 250 500 Training rounds

Server Step Size η(t) g

Figure 9: Additional results for CIFAR-10 dataset. Mean and standard deviation from experiments with 5 different random seeds. The shaded areas show the standard deviation.

0 250 500 Training rounds

Training Loss

Fed Ex P (ours) Fed Avg SCAFFOLD Fed Adagrad

0 250 500 Training rounds

Training Accuracy (%)

0 250 500 Training rounds

Server Step Size η(t) g

Figure 10: Additional results for CIFAR-100 dataset. Mean and standard deviation from experiments with 5 different random seeds. The shaded areas show the standard deviation.

0 250 500 Training rounds

Training Loss

Fed Ex P (ours) Fed Avg SCAFFOLD Fed Adagrad

0 250 500 Training rounds

Training Accuracy (%)

0 250 500 Training rounds

Server Step Size η(t) g

Figure 11: Additional results for CINIC-10 dataset. Mean and standard deviation from experiments with 5 different random seeds. The shaded areas show the standard deviation.

Published as a conference paper at ICLR 2023

Long-Term Behavior of Algorithms and Comparison with Fed Prox. To evaluate the long-term behavior of different algorithms, we ran the experiments for 2000 rounds. Here, we also consider an additional algorithm, namely Fed Prox, for comparison. For fair comparison, we have tuned the µ parameter of Fed Prox for each dataset, by doing a grid search over the range {10 3, 10 2, 10 1, 1} as done in the original Fed Prox paper (Li et al., 2020). The results of EMNIST, CIFAR-10, CIFAR100, and CINIC-10 in Figures 12 14 and Table 5 are from experiments with 3 different random seeds. Except for the synthetic dataset, the plots show mean and standard deviation values across all the random seeds and also over a moving average window of size 20.

0 1000 2000 Training rounds

Mean Squared Error

Fed Ex P (ours) Fed Avg SCAFFOLD Fed Adagrad Fed Prox

0 1000 2000 Training rounds

Training Loss

0 1000 2000 Training rounds

0.3 0.4 0.5 0.6 0.7 0.8 0.9

Training Loss

0 1000 2000 Training rounds

0.7 0.8 0.9

Training Loss

0 1000 2000 Training rounds

0.7 0.8 0.9

Training Loss

Figure 12: Training loss results of Fed Ex P, Fed Avg, SCAFFOLD, Fed Adagrad and Fed Prox on the Synthetic, EMNIST, CIFAR-10,CIFAR-100 and CINIC-10 datasets for 2000 rounds.

0 500 1000 1500 2000 Training rounds

Training Accuracy (%)

Fed Ex P (ours) Fed Avg SCAFFOLD Fed Adagrad Fed Prox

0 500 1000 1500 2000 Training rounds

Training Accuracy (%)

0 500 1000 1500 2000 Training rounds

Training Accuracy (%)

0 500 1000 1500 2000 Training rounds

Training Accuracy (%)

Figure 13: Training accuracy results of Fed Ex P, Fed Avg, SCAFFOLD, Fed Adagrad and Fed Prox on the EMNIST, CIFAR-10, CIFAR-100 and CINIC-10 datasets for 2000 rounds.

0 500 1000 1500 2000 Training rounds

Test Accuracy (%)

Fed Ex P (ours) Fed Avg SCAFFOLD Fed Adagrad Fed Prox

0 500 1000 1500 2000 Training rounds

Test Accuracy (%)

0 500 1000 1500 2000 Training rounds

Test Accuracy (%)

0 500 1000 1500 2000 Training rounds

Test Accuracy (%)

Figure 14: Test accuracy results of Fed Ex P, Fed Avg, SCAFFOLD, Fed Adagrad and Fed Prox on the EMNIST, CIFAR-10, CIFAR-100 and CINIC-10 datasets for 2000 rounds.

Published as a conference paper at ICLR 2023

Table 5: Test accuracy obtained by Fed Ex P and baselines after 2000 rounds of training on various tasks. Results are averaged across 3 random seeds and last 20 rounds.

Dataset Fed Ex P Fed Avg SCAFFOLD Fed Adagrad Fed Prox

EMNIST 86.96 0.58 85.78 0.35 86.22 0.35 85.53 1.04 85.77 0.39 CIFAR-10 82.94 0.42 80.10 0.56 82.02 0.30 80.21 0.60 80.16 0.59 CIFAR-100 54.65 0.49 49.63 0.37 49.40 0.38 49.64 0.39 49.47 0.31 CINIC-10 66.45 1.28 64.87 0.44 64.61 0.49 64.87 0.44 64.52 0.45

We see that Fed Ex P continues to outperform baselines including Fed Prox in the long-term behavior as well.

Published as a conference paper at ICLR 2023

E COMBINING EXTRAPOLATION WITH SCAFFOLD

As described in Section 6, the extrapolation step can be added to the SCAFFOLD algorithm in a similar way as Fed Ex P. The detailed steps of this SCAFFOLD-Ex P algorithm are shown in Algorithm 2.

Algorithm 2 SCAFFOLD-Ex P

1: Input: w(0), control variate c(0), c(0) i , i [M], number of rounds T, local iteration steps τ, parameters ηl, ϵ 2: For t = 0, . . . , T 1 communication rounds do: 3: Global server do: 4: Send w(t), c(t) to all clients 5: Clients i [M] in parallel do:

6: Set w(t,0) i w(t,0)

7: For k = 0, . . . , τ 1 local iterations do:

8: Update w(t,k+1) i w(t,k) i ηl Fi(w(t,k) i , ξ(t,k) i ) c(t) i + c(t)

9: Compute (t) i w(t) w(t,τ) i and Ψ(t) i c(t) 1 τηl (t) i 10: Send (t) i and Ψ(t) i to the server

11: Update local control variate c(t+1) i c(t) i Ψ(t) i 12: Global server do: 13: Compute (t) 1

M PM i=1 (t) i and η(t) g max n 1, PM i=1 (t) i 2. 2M (t) 2+ϵ o

14: Update global model with w(t+1) w(t) η(t) g (t)

15: Compute Ψ(t) 1 M PM i=1 Ψ(t) i 16: Update global control variate with c(t+1) c(t) Ψ(t)

Published as a conference paper at ICLR 2023

F COMBINING EXTRAPOLATION WITH SERVER MOMENTUM

We begin by recalling some notation from our work. The vector w(t) is the global model at round t and (t) is the average of client updates at round t. The server momentum update at round t can be written as v(t) = (t) + βv(t 1) (let v 1 = 0) and the global model update can be written as w(t+1) = w(t) η(t) g v(t). Our goal is now to find η(t) g that minimizes w(t+1) w 2. We have, w(t+1) w 2 = w(t) w 2 + (η(t) g )2 v(t) 2 2η(t) g w(t) w , v(t) . (107)

Setting the derivative of the RHS of (107) to zero we have,

(η(t) g )opt =

w(t) w , v(t)

v(t) 2 . (108)

Our goal now is to find a lower bound on w(t) w , v(t) . We have the following lemma.

Lemma 11. Assume that w(t) w , (t) m(t) = PM i=1 (t) i 2 /M (see Appendix C.4.1) for

all t 0 and η(r) g (m(r) + Pr 1 k=0(β/2)r km(k))/2 v(r) 2 for all r < t 1. Then,

D w(t) w , v(t)E m(t) +

k=0 (β/2)t km(k), (109)

which implies,

(η(t) g )opt m(t) + Pt 1 k=0(β/2)t km(k)

2 v(t) 2 . (110)

Proof. We proceed via a proof by induction. The statement clearly holds at t = 0 since w(0) w , v(0) = w(0) w , (0) m(0).

Now assuming the lemma holds at t 1 we have, D w(t) w , v(t)E = D w(t) w , (t)E + β D w(t) w , v(t 1)E (111)

= D w(t) w , (t)E + β D w(t 1) η(t 1) g v(t 1) w , v(t 1)E (112)

m(t) + β D w(t 1) w , v(t 1)E η(t 1) g v(t 1) 2 (113)

k=0 (β/2)t km(k), (114)

where the last line follows from the fact that w(t 1) w , v(t 1) m(t 1) + Pt 2 k=0(β/2)t 1 km(k) and η(t 1) g (m(t 1) + Pt 2 k=0(β/2)t 1 km(k))/2 v(t 1) 2.

Thus we propose to keep the following server step size when using server momentum,

η(t) g = m(t) + Pt 1 k=0(β/2)t km(k)

2( v(t) 2 + ϵ) , (115)

where m(t) = PM i=1 (t) i 2 /M. Note that we also add a small constant ϵ to the denominator to prevent the step size from blowing up as done for Fed Ex P. We call server momentum with this step size as Fed Ex P-M.

We compare the performance of Fed Ex P-M with Fed Adam and Fed Avg-M (Fed Avg with server momentum) on the CIFAR-10 and CIFAR-100 datasets as shown in Figures 15 17, where the mean and standard deviation values are computed over 3 random seeds and a moving average window of

Published as a conference paper at ICLR 2023

size 20. The experimental setup is the same as described in Section 6. The hyperparameters ηl, ϵ for Fed Ex P-M and ηl, ηg for Fed Adam and Fed Avg-M were tuned following a similar process as described in Appendix D.3, and their resulting values are in Table 6.

Table 6: Base-10 logarithm of the best combination of ϵ and ηl for Fed Ex P-M and combination of ηl and ηg for Fed Adam and Fed Avg-M.

Dataset Fed Ex P Fed Adam Fed Avgm-M ϵ ηl ηg ηl ηg ηl

CIFAR-10 3 2 2 2 0 2 CIFAR-100 3 2 2 2 0 2

0 500 1000 1500 2000 Training rounds

Training Loss

Fed Ex P-M (ours) Fed Avg-M Fed Adam

0 500 1000 1500 2000 Training rounds

Training Loss

Figure 15: Training loss results of Fed Ex P-M, Fed Adam and Fed Avg-M on the CIFAR10 and CIFAR100 datasets.

0 500 1000 1500 2000 Training rounds

Training Accuracy (%)

Fed Ex P-M (ours) Fed Avg-M Fed Adam

0 500 1000 1500 2000 Training rounds

Training Accuracy (%)

Figure 16: Training accuracy results of Fed Ex P-M, Fed Adam and Fed Avg-M on the CIFAR10 and CIFAR100 datasets.

0 500 1000 1500 2000 Training rounds

Test Accuracy (%)

Fed Ex P-M (ours) Fed Avg-M Fed Adam

0 500 1000 1500 2000 Training rounds

Test Accuracy (%)

Figure 17: Test accuracy results of Fed Ex P-M, Fed Adam and Fed Avg-M on the CIFAR10 and CIFAR100 datasets.

Our result shows that server momentum can be successfully combined with extrapolation for the best speed-up among all baselines. The behavior of Fed Adam and Fed Avg-M are quite similar in these experiments which can be attributed to the dense nature of the gradients in image classification as

Published as a conference paper at ICLR 2023

discussed in Section 6. We note that this is only a preliminary result and future work will look to study the effect of combining server momentum and extrapolation more rigorously.