# locally_adaptive_federated_learning__35f2560e.pdf

Published in Transactions on Machine Learning Research (10/2024)

Locally Adaptive Federated Learning

Sohom Mukherjee sohom.mukherjee@uni-wuerzburg.de Julius-Maximilians-Universität Würzburg

Nicolas Loizou nloizou@jhu.edu Johns Hopkins University

Sebastian U. Stich stich@cispa.de CISPA Helmholtz Center for Information Security

Reviewed on Open Review: https: // openreview .net/ forum? id= au Ld S1iu KW

Federated learning is a paradigm of distributed machine learning in which multiple clients coordinate with a central server to learn a model, without sharing their own training data. Standard federated optimization methods such as Federated Averaging (Fed Avg) ensure balance among the clients by using the same stepsize for local updates on all clients. However, this means that all clients need to respect the global geometry of the function which could yield slow convergence. In this work, we propose locally adaptive federated learning algorithms, that leverage the local geometric information for each client function. We show that such locally adaptive methods with uncoordinated stepsizes across all clients can be particularly efficient in interpolated (overparameterized) settings, and analyze their convergence in the presence of heterogeneous data for convex and strongly convex settings. We validate our theoretical claims by performing illustrative experiments for both i.i.d. non-i.i.d. cases. Our proposed algorithms match the optimization performance of tuned Fed Avg in the convex setting, outperform Fed Avg as well as state-of-the-art adaptive federated algorithms like Fed AMS for non-convex experiments, and come with superior generalization performance.

1 Introduction

Federated Learning (FL) (Kairouz et al., 2021) has become popular as a collaborative learning paradigm where multiple clients jointly train a machine learning model without sharing their local data. Despite the recent success of FL, state-of-the-art federated optimization methods like Fed Avg (Mc Mahan et al., 2017) still face various challenges in practical scenarios such as not being able to adapt according to the training dynamics Fed Avg using vanilla SGD updates with constant stepsizes maybe unsuitable for heavytail stochastic gradient noise distributions, arising frequently in training large-scale models such as Vi T (Dosovitskiy et al., 2021). Such settings benefit from adaptive stepsizes, which use some optimization statistics (e.g., loss history, gradient norm).

In the centralized setting, adaptive methods such as Adam (Kingma & Ba, 2014) and Ada Grad (Duchi et al., 2011) have succeeded in obtaining superior empirical performance over SGD for various machine learning tasks. However, extending adaptive methods to the federated setting remains a challenging task, and majority of the recently proposed adaptive federated methods such as Fed Adam (Reddi et al., 2021) and Fed AMS (Wang et al., 2022a) consider only server-side adaptivity, i.e., essentially adaptivity only in the aggregation step. Some methods like Local-AMSGrad (Chen et al., 2020) and Local-Ada Alter (Xie et al., 2019) do consider local (client-side) adaptivity, but they perform some form of stepsize aggregation in the communication round, thereby using the same stepsize on all clients.

work done at the CISPA Helmholtz Center for Information Security

Published in Transactions on Machine Learning Research (10/2024)

Using the same stepsize for all clients, that needs to respect the geometry of the global function, can yield sub-optimal convergence. To harness the full power of adaptivity for federated optimization, we argue that it makes sense to use fully locally adaptive stepsizes on each client to capture the local geometric information of each objective function (Wang et al., 2021), thereby leading to faster convergence. However, such a change is non trivial, as federated optimization with uncoordinated stepsizes on different clients might not convergence. The analysis of locally adaptive methods necessitates developing new proof techniques extending the existing error-feedback framework (Stich, 2018) for federated optimization algorithms (that originally works for equal stepsizes) to work for fully un-coordinated local (client) stepsizes. In this work, we provide affirmative answers to the following open questions:

(a) Can local adaptivity for federated optimization be useful (faster convergence)? (b) Can we design such a locally adaptive federated optimization algorithm that provably converges?

To answer (a), we shall see a concrete case in Example 1 along with an illustration in Figure 1, showing locally adaptive stepsizes can substantially speed up convergence. For designing a fully locally adaptive method for federated optimization, we need an adaptive stepsize that would be optimal for each client function. Inspired by the Polyak stepsize (Polyak, 1987) which is designed for gradient descent on convex functions, Loizou et al. (2021) recently proposed the stochastic Polyak step-size (SPS) for SGD. SPS comes with strong convergence guarantees, and needs less tuning compared to other adaptive methods like Adam and Ada Grad. We propose the Fed SPS algorithm by incorporating the SPS stepsize in the local client updates. We obtain exact convergence of our locally adaptive Fed SPS when interpolation condition is satisfied (overparameterized case common in deep learning problems), and convergence to a neighbourhood for the general case. Reminiscing the fact that Li et al. (2020b) showed Fed Avg needs decaying stepsizes to converge under heterogeneity, we extend our method to a decreasing stepsize version Fed Dec SPS (following ideas from Dec SPS (Orvieto et al., 2022)), that provides exact convergence in practice, for the general non-interpolating setting without the aforementioned small stepsize assumption. Finally, we experimentally observe that the optimization performance of Fed SPS is always on par or better than that of tuned Fed Avg and Fed AMS, and Fed Dec SPS is particularly efficient in non-interpolating settings.

Contributions. We summarize our contributions as follows:

We show that local adaptivity can lead to substantially faster convergence. We design the first fully locally adaptive method for federated learning called Fed SPS, and prove sublinear and linear convergence to the optimum, for convex and strongly convex cases, respectively, under interpolation (Theorem 3). This is in contrast to existing adaptive federated methods such as Fed Adam and Fed AMS, both of which employ adaptivity only for server aggregation.

For real-world FL scenarios (such as when the interpolation condition is not satisfied due to client heterogeneity), we propose a practically motivated algorithm Fed Dec SPS that enjoys local adaptivity and exact convergence also in the non-interpolating regime due to decreasing stepsizes.

We empirically verify our theoretical claims by performing relevant illustrative experiments to show that our method requires less tuning compared to state-of-the-art algorithms which need extensive grid search. We also obtain competitive performance (both optimization as well as generalization) of the proposed Fed SPS and Fed Dec SPS compared to tuned Fed Avg and Fed AMS for the convex and non-convex cases in i.i.d. as well as non-i.i.d. settings.

1.1 Additional related Work

Adaptive gradient methods and SPS. Recently, adaptive stepsize methods that use some optimization statistics have become popular for deep learning applications. Such methods, including Adam (Kingma & Ba, 2014) and Ada Grad (Duchi et al., 2011), work well in practice and also theoretically convergent under different levels of smoothness (Rodomanov et al., 2024). A further adaptive method with sound theoretical guarantees is the Polyak stepsize (Polyak, 1987), which has been recently extended to the stochastic setting by Loizou et al. (2021) and termed stochastic Polyak stepsize (SPS). Extensions of SPS have been proposed for solving structured non-convex problems (Gower et al., 2021a) and in the update rule of stochastic mirror

Published in Transactions on Machine Learning Research (10/2024)

descent (D Orazio et al., 2021). Further follow-up works have come up with various ways to overcome the limitations of vanilla SPS, such as when optimal stochastic loss values are not known (Orvieto et al., 2022; Gower et al., 2022), or when the interpolation condition does not hold (Orvieto et al., 2022; Gower et al., 2021b; Jiang & Stich, 2024), as well as a proximal variant for tackling regularization terms (Schaipp et al., 2023).

Adaptive federated optimization. Reddi et al. (2021) provide a general framework for adaptive federated optimization (Fed Opt), including particular instances such as Fed Adam and Fed Yogi, by using the corresponding centralized adaptive methods as the server optimizer. Several works followed on the idea of server side adaptivity, some recent ones being CD-Adam (Wang et al., 2022b) and Fed AMS (Wang et al., 2022a). Fully locally adaptive stepsizes on the client side have not been explored before, except in one concurrent work Kim et al. (2023). Their proposed method is based on estimator for the inverse local Lipschitz constant from (Malitsky & Mishchenko, 2019), and analyses only the non-convex setting a strong bounded gradient assumption. Our work is also related to tackling the heterogeneity issue in federated learning, wherein clients differ from each other with respect to local datasets, or compute capabilities, or both. Fed Nova (Wang et al., 2020) proposes a solution to this problem by using locally adjusted stepsizes but suffers from a speed-accuracy trade-off. Fed Lin (Mitra et al., 2021) and CLIMB (Shen et al., 2022) further improve this line of work. However, none of these existing methods leverage the geometry of the local loss landscape to tackle client heterogeneity. Another line of work studies federated optimization with local solvers (Li et al., 2020a; Jiang et al., 2024a;b).

2 Problem setup

In this work, we consider the following sum-structured (cross-silo) federated optimization problem

f := min x Rd

where the components fi : Rd R are distributed among n local clients and are given in stochastic form fi(x) := Eξ Di[Fi(x, ξ)], where Di denotes the distribution of ξ over parameter space Ωi on client i [n]. Standard empirical risk minimization is an important special case of this problem, when each Di presents a finite number mi of elements {ξi 1, . . . , ξi mi}. Then fi can be rewritten as fi(x) = 1 mi Pmi j=1 Fi(x, ξi j). We do not make any restrictive assumptions on the data distributions Di, so our analysis covers the case of heterogeneous (non-i.i.d.) data where Di = Dj, i = j and the local minima x i := arg minx Rd fi(x), can be different from the global minimizer of (1).

3 Locally Adaptive Federated Optimization

In the following, we provide a background on federated optimization, and the (stochastic) Polyak stepsize. This is followed by an Example to motivate how local adaptivity with (stochastic) Polyak stepsizes can help to improve convergence especially in the interpolation regime. Finally, we outline our proposed method Fed SPS to solve (1).

3.1 Background and Motivation

Federated averaging. A common approach to solving (1) in the distributed setting is Fed Avg (Mc Mahan et al., 2017) also known as Local SGD (Stich, 2018). This involves the clients performing a local step of SGD in each iteration, and the clients communicate with a central server after every τ iterations their iterates are averaged on the server, and sent back to all clients. Fed Avg corresponds to the special case of Algorithm 1 with constant stepsizes γi t γ0 (Line 4).

PS and SPS. Considering the centralized setting (n = 1) of finite-sum optimization on a single worker

minx Rd h f1(x) := 1

m Pm j=1 F1(x, ξ1 j ) i , we introduce the PS as well as the SPS as below:

Published in Transactions on Machine Learning Research (10/2024)

Deterministic Polyak stepsize. The convergence analysis of Gradient Descent (GD) for a convex function f1(x) involves the inequality xt+1 x 2 xt x 2 2γt (f1(xt) f1(x )) + γ2 t f1(xt) 2, the right-hand side of which is minimized by the PS γt = f1(xt) f 1 f1(xt) 2 .

Stochastic Polyak stepsize. We use the notion of SPSmax from the original paper (Loizou et al., 2021). The SPSmax stepsize for SGD (with single stochastic sample) is given by γt =

min F1(xt,ξ1 j ) F 1 c F1(xt,ξ1 j ) 2 , γb

, where F 1 := infξ D1,x Rd F1(x, ξ), γb > 0 is an upper bound on the

stepsize that controls the size of neighbourhood (γb trades-off adaptivity for accuracy), and c > 0 is a constant scaling factor. Instead of using the optimal function values of each stochastic function as in the original SPS paper (Loizou et al., 2021), we use the lower bound on the function values ℓ 1 F 1 , which is easier to obtain for many practical tasks as shown in (Orvieto et al., 2022). It should also be noted that using the lower bound on function values lead to the same convergence result as in original SPS using optimal function values, as shown in Theorem 2 of Orvieto et al. (2022)).

0 25 50 75 100 125 150 175 200 Iteration

Constant Stepsize: 1, 2 = 0.50 Global SPS: 1, 2 = SPS(f) Ours

0 25 50 75 100 125 150 175 200 Iteration

Constant Stepsize: 1, 2 = 0.50 Global SPS: 1, 2 = SPS(f) Ours ( 1) Ours ( 2)

Figure 1: Illustration for Example 1, showing local adaptivity can improve convergence. We run SGD with constant, global SPS, and locally adaptive SPS stepsizes (with c = 0.5, γb = 1.0), for functions f1(x) = x2, f2(x) = 1

2x2, where stochastic noise was simulated by adding Gaussian noise with mean 0, and standard deviation 10 to the gradients.

Example 1 (Local adaptivity using Polyak stepsizes can improve convergence). For a parameter a > 0, consider the finite sum optimization problem minx R f(x) := 1

2 P2 i=1 fi(x) , with f1(x) = a

2x2, f2(x) = 1

in the interpolation regime. If we solve this problem using mini-batch GD, xt+1 = xt γ

2 ( f1(xt)+ f2(xt)), we are required to choose a stepsize γ 2/L, where L = 1+a

2 to enable convergence, and therefore Ω(a log 1

ϵ ) steps are needed. However, if we solve the same problem using locally adaptive distributed GD of the form xt+1 = xt 1

2(γ1 f1(xt) + γ2 f2(xt)), then the complexity can be near-constant. Concretely, for any stepsizes γi [ 1

2γ i ], with γ 1 = 1

a, γ 2 = 1 (which also includes the Polyak stepsizes corresponding to functions f1 and f2), the iteration complexity is O(log 1

2), which can be arbitrarily better than Ω(a log 1

ϵ ) when a . Note that the observation made in this example can also be extended to the stochastic case of SGD with SPS, as illustrated in Figure 1.

3.2 Proposed Method

Motivated by the previous example on the benefit of local adaptivity, we now turn to design such a locally adaptive federated optimization algorithm with provable convergence guarantees. As stated before, we need an adaptive stepsize that is optimal for each client function, and we choose the SPS stepsize for this purpose. In the following, we describe our proposed method Fed SPS.

Fed SPS. We propose a fully locally (i.e., client-side) adaptive federated optimization algorithm Fed SPS (Algorithm 1) with asynchronous stepsizes, i.e., the stepsizes are different across the clients, and also across the local steps for a particular client. The Fed SPS stepsize for a client i and local iteration t will be given by

( Fi(xi t, ξi t) ℓ i c Fi(xi t, ξi t) 2 , γb

Published in Transactions on Machine Learning Research (10/2024)

Algorithm 1 Fed SPS: Federated averaging with fully locally adaptive stepsizes.

Input: xi 0 = x0, i [n]

1: for t = 0, 1, , T 1 do 2: for each client i = 1, , n in parallel do 3: sample ξi t, compute gi t := Fi(xi t, ξi t)

4: Fed SPS: γi t = min n Fi(xi t,ξi t) ℓ i c||gi t||2 , γb o

local stochastic Polyak stepsize

5: if t + 1 is a multiple of τ then 6: xi t+1 = 1

n Pn i=1 xi t γi tgi t communication round 7: else 8: xi t+1 = xi t γi tgi t local step 9: end if 10: end for 11: end for

where c, γb > 0 are constants as explained before, ξi t is the sample at time t on worker i, Fi(xi t, ξi t) is the stochastic loss, gi t := Fi(xi t, ξi t) is the stochastic gradient, and ℓ i F i = infξi Di,x Rd Fi(x, ξi) is a lower bound on the minima of all functions on worker i. Since the loss functions are non-negative for most practical machine learning tasks, we can use ℓ i = 0 as discussed before, for running our algorithms. We analyse Fed SPS in the strongly convex and convex settings and prove convergence guarantees (Theorem 3). We would like to stress that γb and c are free hyperparameters (in the sense that they do not depend theoretically on any problem-dependent parameters) and we demonstrate that they require minimal tuning through relevant sensitivity analysis in Section 6. This is an advantage over the learning rate parameter of Fed Avg and Fed AMS which theoretically depend on L, and require extensive grid search in practice (Section F.4). The notations used throughout can be extended to the mini-batch setting as described in Appendix C.1.

Remark 2 (Alternative design choices). Note that there can be various alternative design choices for incorporating SPS in Fed Avg. We tried some variants such as Fed SPS-Normalized using client and server side correction to account for solution bias (Wang et al., 2021) due to asynchronous stepsizes (Appendix D.1). We also introduce Fed SPS-Global in the Appendix D.2, that uses aggregation of stepsizes in communication rounds, similar to (Chen et al., 2020; Xie et al., 2019). However, none of these other design choices provided any practical advantages over our proposed Fed SPS.

4 Convergence analysis of Fed SPS

4.1 Assumptions on the objective function and noise

Assumption 1 (L-smoothness). Each function Fi(x, ξ): Rd Ωi R, i [n] is differentiable for each ξ supp(Di) and there exists a constant L 0 such that for each x, y Rd, ξ supp(Di):

Fi(y, ξ) Fi(x, ξ) L x y . (3)

Note that Assumption 1 implies L-smoothness of each fi(x) and of f(x). The assumption of each Fi(x, ξ) being smooth is often used in federated and decentralized optimization literature (for e.g., (Koloskova et al., 2020, Assumption 1a), or (Koloskova et al., 2022, Assumption 3)).

Assumption 2 (µ-convexity). There exists a constant µ 0 such that for each for each i [n], ξ supp(Di) and for all x, y Rd:

Fi(y, ξ) Fi(x, ξ) + Fi(x, ξ), y x + µ

2 y x 2 . (4)

For some of our results, we assume µ-strong convexity for a parameter µ > 0, or convexity (when µ = 0). Furthermore we assume (as mentioned in the introduction) access to stochastic functions Fi(x, ξ) on each client i, with Eξ Di Fi(x, ξ) = fi(x), Eξ Di Fi(x, ξ) = fi(x).

Published in Transactions on Machine Learning Research (10/2024)

Finite optimal objective difference. For each i [n] we denote f i := infx Rd fi(x). Recall that we defined F i := infξ Di,x Rd Fi(x, ξ), and need knowledge of lower bounds, ℓ i F i for our algorithm. We define the quantity

i=1 (fi(x ) ℓ i ) = f 1

i=1 ℓ i , (5)

that will appear in our complexity estimates, and thus we implicitly assume that σ2 f < (finite optimal objective difference). Moreover, σ2 f also acts as our measure of heterogeneity between clients. This is in line with previous works on federated optimization in non-i.i.d. setting, such as Li et al. (2020b) that used Γ := f Eif i as the heterogeneity measure. We can relate σ2 f to the more standard measures of function het-

erogeneity ζ2 = 1

n Pn i=1 fi(x) f(x) 2 2 and gradient variance σ2 = 1

n Pn i=1 Eξi Fi(x, ξi) fi(x) 2

in the federated literature (Koloskova et al., 2022; Wang et al., 2022a) as shown in the following proposition (proof in Appendix C.2). For the case of convex functions, it suffices (Koloskova et al., 2020) to compare with ζ2 = 1

n Pn i=1 fi(x ) 2 2, σ2 = 1

n Pn i=1 Eξi Fi(x , ξi) fi(x ) 2, calculated at the global optimum x = arg minx Rd f(x). We can observe that σ2 f is actually a stronger assumption than bounded noise at optimum (ζ , σ ), but weaker than uniformly bounded noise (ζ, σ).

Proposition 1 (Comparison of heterogeneity measures). Using the definitions of σ2 f, ζ2 , and σ2 as defined above, we have: (a) ζ2 2Lσ2 f, and (b) σ2 2Lσ2 f.

4.2 Convergence of fully locally adaptive Fed SPS

In this section we provide the convergence guarantees of Fed SPS on sums of convex (or strongly convex) functions. We do not make any restriction on γb, and thus denote this as the fully locally adaptive setting that is of most interest to us. The primary theoretical challenge was to extend the error-feedback framework (that originally works for equal stepsizes) (Mania et al., 2017; Stich & Karimireddy, 2020) to work for fully un-coordinated local stepsizes, and we do this for the first time in our work. All proofs are provided in Appendix B.

Theorem 3 (Convergence of Fed SPS). Assume that Assumptions 1 and 2 hold and c 2τ 2, then after T iterations (T/τ communication rounds) of Fed SPS (Algorithm 1) it holds (a) Convex case:

i=1 E[fi(xi t) ℓ i )] 2 Tα x0 x 2 + 4γbσ2 f α , (6)

where α := min 1 2c L, γb . If µ > 0, and c 4τ 2, we have (b) Strongly convex case:

E x T x 2 A(1 µα)T x0 x 2 + 2γbσ2 f αµ , (7)

where A = 1 µα, and xt := 1

n Pn i=1 xi t.

The convergence criterion of the first result (6) is non-standard, as it involves the average of all iterates xi t on the left hand side, and not the average xt more commonly used. However, note that every τ-th iteration these quantities are the same, and thus our result implies convergence of 1 T n P(T 1)/τ t=0 Pn i=1 E[fi( xtτ) ℓ i ]. Moreover, in the interpolation case all ℓ i f .

Remark 4 (Minimal need for hyperparamter tuning). The parameter τ is a user selected input parameter to determine the number of local steps, and γb trades-off adaptivity (potentially faster convergence for large γb) and accuracy (higher for small γb). Moreover, as c only depends on the input parameter τ and not on properties of the function (e.g. L or µ), it is also a free parameter. The algorithm provably converges (up to the indicated accuracy) for any choice of these parameters. The lower bounds ℓ i can be set to zero for many

Published in Transactions on Machine Learning Research (10/2024)

machine learning problems as discussed before. Therefore, we effectively reduce the need for hyperparameter tuning compared to previous methods like Fed AMS whose convergence depended on problem-dependent parameters.

Comparison with SPS (Loizou et al., 2021). We will now compare our results to Loizou et al. (2021) that studied SPS for a single worker (n = 1). First, we note that in the strongly convex case, we almost recover Theorem 3.1 in Loizou et al. (2021). The only differences are that they have A = 1 and allow weaker bounds on the parameter c (c 1/2, vs. our c > 4), but we match other constants. In the convex case, we again recover (Loizou et al., 2021, Theorem 3.4) up to constants and the stronger condition on c (vs. c > 1 in their case).

(Special case I) Exact convergence of Fed SPS in interpolation regime: We highlight the linear convergence of Fed SPS in the interpolation case (σf = 0) in the following corollary.

Corollary 5 (Linear Convergence of Fed SPS under Interpolation). Assume interpolation, σ2 f = 0 and let the assumptions of Theorem 3 be satisfied with µ > 0. Then

E x T x 2 A(1 µα)T x0 x 2 . (8)

(Special case II) Exact convergence of Fed SPS in the small stepsize regime: Theorem 3 shows convergence of Fed SPS to a neighborhood of the solution. Decreasing γb (smaller than 1 2c L) can improve the accuracy, but the error is at least Ω(σ2 f) even when γb 0. This issue is also persistent in the original work on SPSmax (Loizou et al., 2021, Corr. 3.3). However, we remark when the stepsize upper bound γb is chosen extremely small not allowing for adaptivity -then Fed SPS becomes identical to constant stepsize Fed Avg. This is not reflected in Theorem 3 that cannot recover the exact convergence known for Fed Avg. We address this in the next theorem, proving exact convergence of small stepsize Fed SPS (equivalent to analysis of Fed Avg with σ2 f assumption).

Theorem 6 (Convergence of small stepsize Fed SPS). Assume that Assumptions 1 and 2 hold and γb min 1 2c L, 1 20Lτ , then after T iterations of Fed SPS (Algorithm 1) it holds (a) Convex case:

t=0 E[f( xt) f ] = O 1

Tγb x0 x 2 + γb Lσ2 f + γ2 b Lτ 2σ2 f

and when µ > 0, (b) Strongly convex case:

E x T x 2 = O

µγb (1 µγb)T + γb Lσ2 f µ + γ2 b Lτ 2σ2 f µ

This theorem shows that by choosing an appropriately small γb, any arbitrary target accuracy ϵ > 0 can be obtained. We are only aware of Li et al. (2020b) that studies Fed Avg under similar assumptions as us (Γ := f Eif i measuring heterogeneity). However, their analysis additionally required bounded stochastic gradients and their convergence rates are weaker (e.g., not recovering linear convergence under interpolation when σ2 f = 0).

5 Decreasing Fed SPS for exact convergence

In the previous section, we have proved that Fed SPS converges in the interpolation setting irrespective of the value of the stepsize parameter γb. However, many practical federated learning scenarios such as those involving heterogeneous clients constitute the non-interpolating setting (σf > 0). Here, we need to choose a small value of γb to ensure convergence, trading-off adaptivity for achieving exact convergence. We might recall that Li et al. (2020b) proved choosing a decaying stepsize is necessary for convergence of Fed Avg under heterogeneity. In this section, we draw inspiration from the decreasing SPS stepsize Dec SPS (Orvieto

Published in Transactions on Machine Learning Research (10/2024)

0 100 200 300 400 500 Round

Train Loss (log )

Fed SPS-Local b = 1

Fed SPS-Local b = 5

Fed SPS-Local b = 10

Fed Avg = 0.1

Fed Avg = 0.01

(a) Effect of varying γb.

0 100 200 300 400 500 Round

(b) Effect of varying γb.

0 100 200 300 400 500 Round

Train Loss (log )

c=0.01 c=0.1 c=0.5 c=1.0 c=10.0 c=20.0 c=40.0

(c) Effect of varying c.

20 40 60 80 100 Number of Local Steps ( )

(d) Optimal c versus τ.

Figure 2: Sensitivity analysis of Fed SPS to hyperparameters for convex logistic regression on the MNIST dataset (i.i.d.) without client sampling. (a) Comparing the effect of varying γb on Fed SPS and varying γ on Fed Avg convergence Fed Avg is more sensitive to changes in γ, while Fed SPS is insensitive changes in to γb. (b) Effect of varying γb on Fed SPS stepsize adaptivity adaptivity is lost if γb is chosen too small. (c) Small c works well in practice (τ = 5). (d) Optimal c versus τ, showing that there is no dependence.

et al., 2022), to develop Fed Dec SPS, that achieves exact convergence in practical non-interpolating scenarios without compromising adaptivity.

Fed Dec SPS. In order to obtain exact convergence to arbitrary accuracy (without the small stepsize assumption) in the heterogeneous setting with σf > 0, we propose a heuristic decreasing stepsize version of Fed SPS, called Fed Dec SPS. The Fed Dec SPS stepsize for client i and local iteration t is given by

γi t = 1 ct min Fi(xi t,ξi t) ℓ i Fi(xi t,ξi t) 2 , ct 1γi t 1

, where (ct) t=0 is any non-decreasing positive sequence of real num-

bers. We also set c 1 = c0, γi 1 = γb. Experiments involving heterogeneous clients in Section 6 demonstrate the practical convergence benefits of Fed Dec SPS in non-interpolating settings.

0 100 200 300 400 500 Round

Train Loss (log )

Fed AMS (best )

Fed AMS (worst )

Fed Avg (best )

Fed Avg (worst )

(a) Convex MNIST (i.i.d.).

0 100 200 300 400 500 Round

Train Loss (log )

Fed AMS (best )

Fed Avg (best )

(b) Convex w8a (i.i.d.).

0 100 200 300 400 500 Round

Train Loss (log )

Fed Avg Fed AMS Fed ADAM MIME Fed SPS Fed Dec SPS

(c) Convex (non-i.i.d.).

0 100 200 300 400 500 Round

Stepsize (log )

Fed Dec SPS (non-i.i.d.) Fed SPS (non-i.i.d.)

(d) Stepsizes for (c).

Figure 3: Comparison for convex logistic regression. (a) MNIST dataset (i.i.d. without client sampling). (b) w8a dataset (i.i.d. with client sampling). (c) MNIST dataset (non-i.i.d. with client sampling). (d) Average stepsize across all clients for Fed SPS and Fed Dec SPS corresponding to (c). Performance of Fed SPS matches that of Fed Avg with best tuned local learning rate for the i.i.d. cases, and outperforms in the non-i.i.d. case.

6 Experiments

Experimental setup. For all federated training experiments we have 500 communication rounds (the no. of communication rounds being T/τ as per our notation), 5 local steps on each client (τ = 5, unless otherwise specified for some ablation experiments), and a batch size of 20 (|B| = 20). We perform experiments in the i.i.d. as well as non-i.i.d. settings. Results are reported for both settings without client sampling (10 clients) and with client sampling (10 clients sampled uniformly at random from 100 clients with participation fraction 0.1, and data split among all 100 clients) i.e., n = 10 active clients throughout. The i.i.d. experiments involve randomly shuffling the data and equally splitting the data between clients.

For non-i.i.d. experiments with the MNIST dataset (Le Cun et al., 2010), we assign every client samples from exactly two classes of the dataset, the splits being non-overlapping and balanced with each client having same number of samples (Li et al., 2020b). For non-i.i.d. experiments with the CIFAR-10/CIFAR-100 datasets, we use the Dirichlet distribution over classes following the proposal in Hsu et al. (2019). The degree of

Published in Transactions on Machine Learning Research (10/2024)

heterogeneity is controlled by α, the concentration parameter for Dirichlet distribution and we report the results for α = 1, α = 0.1 and α = 0.01. Our code is based on publicly available repositories for SPS and Fed AMS1, and will be made available upon acceptance.

Fed SPS. The implementation is done according to Algorithm 1. Since all our experimental settings involve non-negative loss functions, we can use the lower bound ℓ i = 0 (Orvieto et al., 2022), throughout. In the following, we perform empirical sensitivity analysis for the free hyperparameters γb and c, concluding that our method is indeed insensitive to changes in these parameters.

We start with benchmarking our method by running some initial convex experiments performing classification of the MNIST (i.i.d.) dataset with a logistic regression model, without client sampling. In Figure 2(a), we compare the effect of varying γb {1, 5, 10} on Fed SPS, and varying γ {0.1, 0.01} on Fed Avg. We find that Fed Avg is not robust to changing stepsize converging well for stepsize 0.1, but very slow convergence for stepsize 0.01. On the contrary, all instances of Fed SPS converge to a neighbourhood of the optimum the size of the neighbourhood being proportional to γb as suggested by the theory. We now fix γb = 1, and perform an ablation study to understand the effect of varying SPS scaling parameter c on the convergence in Figure 2(c). For number of local steps τ = 5, we vary c from 0.01 to 40 (i.e., of the order of square of τ). Unlike what is predicted by our theory, empirically we observe that small c works better and larger c leads to slower convergence. Moreover, all values of c {0.01, 0.1, 0.5, 1.0} have similarly good convergence, thereby implying our method is robust to this hyperparameter and needs no tuning. We provide additional plots for τ {10, 20, 50, 100} local steps in Appendix F.2 to confirm that this observation is valid across all values of τ, and plot the optimal value of c versus τ for each case in Figure 2(d). Gaining insights from above experiments we fix c = 0.5 for all further experiments.

Fed Dec SPS. We evaluate the performance of Fed Dec SPS with ct = c0 t + 1. Similar to the sensitivity analysis of Fed SPS towards c (Figure 2), we performed ablations studies for a fixed value of γb and varying c0 as well as τ. The observation is same as the previous case: the optimal value of c0 does not scale according to τ as suggested by theory and we fix c0 = 0.5 for all experiments. Similarly we fix γb = 1, following similar observations as before. We compare the convergence of Fed SPS and Fed Dec SPS for the case of heterogeneous data on clients (i.e., σf > 0) in Figure 3 (c) and (d), as well as Figure 5. We observe that our practically motivated Fed Dec SPS performs better in such non-interpolating settings, as expected.

Fed Avg and Fed AMS. We compare the performance of our methods Fed SPS and Fed Dec SPS against the Fed Avg baseline, and the state-of-the-art adaptive federated algorithm Fed AMS Wang et al. (2022a). Fed Avg and Fed AMS need extensive tuning using grid search for client learning rate ηl, server learning rate η, as well as max stabilization factor ϵ, and β1, β2. We refer readers to Appendix F.4 for details on the grid search performed and the optimal set of hyperparameters.

Convex comparison. For the convex setting of logistic regression on MNIST dataset (i.i.d. setting), without client sampling, we now compare Fed SPS with Fed Avg and Fed AMS in Figure 3(a). We see that the convergence of Fed SPS matches that of the best tuned Fed Avg. Note that while the best tuned Fed AMS slightly outperforms our method, it requires considerable tuning depicted by the large margin between best and worst learning rate performances. For additional convex experiments in the more practical setting with client sampling, we take the problem of binary classification of LIBSVM Chang & Lin (2011) datasets (w8a, mushrooms, ijcnn, phishing, a9a) with logistic regression model in the i.i.d. setting. We report the performance on w8a in Figure 3(b), where Fed SPS again converges similarly as tuned Fed Avg, and better than Fed AMS. We defer rest of the LIBSVM dataset plots to Appendix F. In the non-i.i.d. case we compare our proposed Fed SPS and Fed Dec SPS to the Fed Avg baseline, adaptive federated methods Fed AMS and Fed ADAM, as well as another state-of-the-art federated method MIME (Karimireddy et al., 2021). In this setting Fed Dec SPS does better than Fed SPS, and our methods outperform the best tuned Fed Avg, Fed ADAM and MIME.

Non-convex comparison. For non-convex experiments, we look at multi-class classification of MNIST dataset using Le Net architecture (Le Cun et al., 1998), CIFAR-10 dataset using Res Net18 architecture (He et al., 2016), and CIFAR-100 dataset using Res Net50 architecture, in the i.i.d. as well as non-i.i.d. setting

1SPS (https://github.com/Issam Laradji/sps), Fed AMS (https://github.com/jinghuichen/Fed CAMS)

Published in Transactions on Machine Learning Research (10/2024)

Table 1: Non-convex experiments with CIFAR-10/CIFAR-100 dataset and Res Net18/Res Net50 architecture for different levels of heterogeneity. Fed SPS outperforms other non-adaptive, locally adaptive and globally adaptive methods in the homogeneous setting, and Fed Dec SPS performs best in the non-iid settings. We report the mean and standard deviation for runs with three random seeds.

Dataset/Model Non-iidness (Dirichlet α) Optimizer CIFAR-10 Res Net18 CIFAR-100 Res Net50

Fed Avg 86.71 0.41 56.34 0.52 Fed AMS 88.66 0.87 64.32 0.91 Fed ADAM 87.79 0.76 63.91 0.76 MIME 87.53 0.41 63.78 0.45 Local-Ada Alter 86.98 0.23 60.34 0.27 Local-AMSGrad 86.89 0.92 61.02 0.89 Fed SPS 89.92 0.13 65.24 0.19

Fed Dec SPS 89.91 0.15 65.11 0.23

Fed Avg 82.68 0.51 47.91 0.56 Fed AMS 84.98 0.89 62.24 0.72 Fed ADAM 83.62 0.68 61.67 0.67 MIME 83.41 0.45 61.39 0.34 Local-Ada Alter 82.49 0.27 57.72 0.23 Local-AMSGrad 82.39 0.91 58.81 0.84 Fed SPS 84.74 0.16 62.91 0.16

Fed Dec SPS 85.78 0.14 63.05 0.18

Fed Avg 27.71 0.39 23.22 0.45 Fed AMS 31.94 0.88 30.61 0.85 Fed ADAM 30.15 0.71 28.96 0.73 MIME 30.35 0.51 27.98 0.50 Local-Ada Alter 28.87 0.32 26.23 0.30 Local-AMSGrad 29.18 0.84 26.21 0.76 Fed SPS 31.15 0.19 30.66 0.23

Fed Dec SPS 32.57 0.16 31.29 0.14

Published in Transactions on Machine Learning Research (10/2024)

0 100 200 300 400 500 Round

Fed AMS Fed Avg Fed SPS

0 100 200 300 400 500 Round

Test Accuracy

Fed AMS Fed Avg Fed SPS

(a) MNIST (i.i.d.).

0 100 200 300 400 500 Round

Fed AMS Fed Avg Fed SPS Fed Dec SPS

0 100 200 300 400 500 Round

Test Accuracy

Fed AMS Fed Avg Fed SPS Fed Dec SPS

(b) MNIST (non-i.i.d.).

Figure 4: Non-convex MNIST experiments with client sampling. (a) Non-convex case of Le Net on MNIST dataset (i.i.d.). (b) Non-convex case of Le Net on MNIST dataset (non-i.i.d.). First column represents training loss, second column is test accuracy. Convergence of Fed SPS is very close to that of Fed Avg with the best possible tuned local learning rate. Moreover, Fed SPS converges better than Fed AMS for the non-convex MNIST case (both i.i.d. and non-i.i.d.), and also offers superior generalization performance than Fed AMS.

0 100 200 300 400 500 Round

Train Loss (log )

Fed SPS Fed AMS Fed Avg

0 100 200 300 400 500 Round

Test Accuracy

Fed SPS Fed AMS Fed Avg

Figure 5: Non-convex CIFAR-10 (i.i.d.) experiments with client sampling and Res Net18 architecture. First column represents training loss, second column is test accuracy. Fed SPS converges better than Fed Avg and Fed AMS, and also offers superior generalization performance.

(with client sampling), in Figures 4, 5, and Table 1. For the upper bound on stepsizes, we use the smoothing technique for rest of the experiments, as suggested by Loizou et al. (2021) for avoiding sudden fluctuations in the stepsize. For a client i [n] and iteration t, the adaptive iteration-dependent upper bound is given by γi b,t = 2|B|/miγi b,t 1, where |B| is the batch-size, mi is the number of data examples on that client and we fix γb,0 = 1. In Figure 4 (MNIST), we find that the convergence of Fed SPS is very close to that of Fed Avg with the best possible tuned local learning rate, while outperforming Fed AMS. In Figure 5 (CIFAR-10), Fed SPS outperforms tuned Fed Avg and Fed AMS in terms of both training loss and test accuracy. We report the

Published in Transactions on Machine Learning Research (10/2024)

results in the non-i.i.d. setting of CIFAR-10/CIFAR-100 for different levels heterogeneity in Table 1, and find that Fed Dec SPS performs the best in heterogeneous settings closely followed by Fed SPS.

7 Conclusion

In this paper, we show that locally adaptive federated optimization can lead to faster convergence by harnessing the geometric information of local objective functions. This is especially beneficial in the interpolating setting, which arises commonly for overparameterized deep learning problems. We propose a locally adaptive federated optimization algorithm Fed SPS, by incorporating the stochastic Polyak stepsize in local steps, and prove sublinear and linear convergence to a neighbourhood for convex and strongly convex cases, respectively. We further extend our method to the decreasing stepsize version Fed Dec SPS, that enables exact convergence even in practical non-interpolating FL settings without compromising adaptivity. We perform relevant illustrative experiments to show that our proposed method is relatively insensitive to the hyperparameters involved, thereby requiring less tuning compared to other state-of-the-art federated algorithms. Moreover, our methods perform as good or better than tuned Fed Avg and Fed AMS for convex as well as non-convex experiments in i.i.d. and non-i.i.d. settings.

Acknowledgments

This work was supported by the Helmholtz Association s Initiative and Networking Fund on the HAICORE@FZJ partition. Sebastian Stich thanks the partial financial support from a Google Research Scholar Award 2023. Nicolas Loizou acknowledges support from CISCO Research.

Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1 27, 2011.

Xiangyi Chen, Xiaoyun Li, and Ping Li. Toward communication efficient adaptive gradient method. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, pp. 119 128, 2020.

Ryan D Orazio, Nicolas Loizou, Issam Laradji, and Ioannis Mitliagkas. Stochastic mirror descent: Convergence analysis and adaptive variants via the mirror stochastic polyak stepsize. ar Xiv preprint ar Xiv:2110.15412, 2021.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Yicb Fd NTTy.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Robert Gower, Othmane Sebbouh, and Nicolas Loizou. Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics, pp. 1315 1323. PMLR, 2021a.

Robert M Gower, Aaron Defazio, and Michael Rabbat. Stochastic polyak stepsize with a moving target. ar Xiv preprint ar Xiv:2106.11851, 2021b.

Robert M Gower, Mathieu Blondel, Nidham Gazagnadou, and Fabian Pedregosa. Cutting some slack for sgd with adaptive polyak stepsizes. ar Xiv preprint ar Xiv:2202.12328, 2022.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Published in Transactions on Machine Learning Research (10/2024)

Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. ar Xiv preprint ar Xiv:1909.06335, 2019.

Xiaowen Jiang and Sebastian U Stich. Adaptive sgd with polyak stepsize and line-search: Robust convergence and variance reduction. Advances in Neural Information Processing Systems, 36, 2024.

Xiaowen Jiang, Anton Rodomanov, and Sebastian U Stich. Federated optimization with doubly regularized drift correction. ar Xiv preprint ar Xiv:2404.08447, 2024a.

Xiaowen Jiang, Anton Rodomanov, and Sebastian U Stich. Stabilized proximal-point methods for federated optimization. ar Xiv preprint ar Xiv:2407.07084, 2024b.

Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021.

Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Breaking the centralized barrier for cross-device federated learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 28663 28676. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/ f0e6be4ce76ccfa73c5a540d992d0756-Paper.pdf.

Junhyung Lyle Kim, Mohammad Taha Toghani, César A Uribe, and Anastasios Kyrillidis. Adaptive federated learning with auto-tuned clients. ar Xiv preprint ar Xiv:2306.11201, 2023.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian U. Stich. A unified theory of decentralized SGD with changing topology and local updates. In 37th International Conference on Machine Learning (ICML). PMLR, 2020.

Anastasiia Koloskova, Sebastian U Stich, and Martin Jaggi. Sharper convergence guarantees for asynchronous sgd for distributed and federated learning. Advances in Neural Information Processing Systems, 35:17202 17215, 2022.

Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Yann Le Cun, Corinna Cortes, and Chris Burges. Mnist handwritten digit database, 2010.

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429 450, 2020a.

Xiang Li, Wenhao Yang, Shusen Wang, and Zhihua Zhang. Communication-efficient local decentralized sgd methods. ar Xiv preprint ar Xiv:1910.09126, 2019.

Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2020b. URL https: //openreview.net/forum?id=HJx NAn Vt DS.

Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd international conference on artificial intelligence and statistics, pp. 983 992. PMLR, 2019.

Nicolas Loizou, Sharan Vaswani, Issam Hadj Laradji, and Simon Lacoste-Julien. Stochastic polyak stepsize for sgd: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pp. 1306 1314. PMLR, 2021.

Published in Transactions on Machine Learning Research (10/2024)

Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent. ar Xiv preprint ar Xiv:1910.09529, 2019.

Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202 2229, 2017.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017.

Aritra Mitra, Rayana Jaafar, George J Pappas, and Hamed Hassani. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34:14606 14619, 2021.

Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.

Antonio Orvieto, Simon Lacoste-Julien, and Nicolas Loizou. Dynamics of SGD with stochastic polyak stepsizes: Truly adaptive variants and convergence to exact solution. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=l Uy Aaz-i A4u.

Boris T Polyak. Introduction to optimization. optimization software. Inc., Publications Division, New York, 1:32, 1987.

Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, and Hugh Brendan Mc Mahan. Adaptive federated optimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Lk FG3l B13U5.

Anton Rodomanov, Xiaowen Jiang, and Sebastian Stich. Universality of adagrad stepsizes for stochastic optimization: Inexact oracle, acceleration and variance reduction. ar Xiv preprint ar Xiv:2406.06398, 2024.

Fabian Schaipp, Robert M Gower, and Michael Ulbrich. A stochastic proximal polyak step size. ar Xiv preprint ar Xiv:2301.04935, 2023.

Zebang Shen, Juan Cervino, Hamed Hassani, and Alejandro Ribeiro. An agnostic approach to federated learning with class imbalance. In International Conference on Learning Representations, 2022.

Sebastian U Stich. Local sgd converges fast and communicates little. ar Xiv preprint ar Xiv:1805.09767, 2018.

Sebastian U. Stich. Unified optimal analysis of the (stochastic) gradient method. ar Xiv preprint ar Xiv:1907.04232, 2019. URL https://arxiv.org/abs/1907.04232.

Sebastian U. Stich and Sai P. Karimireddy. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. Journal of Machine Learning Research (JMLR), 2020.

Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33: 7611 7623, 2020.

Jianyu Wang, Zheng Xu, Zachary Garrett, Zachary Charles, Luyang Liu, and Gauri Joshi. Local adaptivity in federated learning: Convergence and consistency. ar Xiv preprint ar Xiv:2106.02305, 2021.

Yujia Wang, Lu Lin, and Jinghui Chen. Communication-efficient adaptive federated learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 22802 22838. PMLR, 17 23 Jul 2022a. URL https://proceedings.mlr.press/v162/ wang22o.html.

Published in Transactions on Machine Learning Research (10/2024)

Yujia Wang, Lu Lin, and Jinghui Chen. Communication-compressed adaptive gradient method for distributed nonconvex optimization. In International Conference on Artificial Intelligence and Statistics, pp. 6292 6320. PMLR, 2022b.

Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, and Haibin Lin. Local adaalter: Communication-efficient stochastic gradient descent with adaptive learning rates. ar Xiv preprint ar Xiv:1911.09030, 2019.

Published in Transactions on Machine Learning Research (10/2024)

The Appendix is organized as follows. We begin by introducing some general definitions and inequalities used throughout the proofs, in Section A. Proofs for convergence analysis of Fed SPS are provided in Section B including convex and strongly convex cases. Section C provides some additional theoretical details for Fed SPS. Alternative design choices for Fed SPS, namely Fed SPS-Normalized and Fed SPS-Global are described in Section D. We provide some theoretical results for Fed SPS in the non-convex setting in Section E. Finally, additional experiments for Fed SPS and Fed SPS-Global in convex settings are provided in Section F.

Contents of Appendix

A Technical preliminaries 17

A.1 General definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.2 General inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

B Convergence analysis of Fed SPS 18

B.1 Fed SPS inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.2 Proof for convex case of Fed SPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.2.1 Difference Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.2.2 Proof of Theorem 3 (a) (general case valid for all stepsizes) . . . . . . . . . . . . . . . 20

B.2.3 Proof of Theorem 6 (a) (special case with small step sizes) . . . . . . . . . . . . . . . . 22

B.3 Proof for strongly convex case of Fed SPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

B.3.1 Proof of Theorem 3 (b) (general case valid for all stepsizes) . . . . . . . . . . . . . . . 23

B.3.2 Proof of Theorem 6 (b) (special case with small step sizes) . . . . . . . . . . . . . . . 24

C Additional details for Fed SPS 24

C.1 Extending Fed SPS to the mini-batch setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

C.2 Comparison of heterogeneity measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

C.2.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

D Alternative design choices for Fed SPS 25

D.1 Fed SPS-Normalized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

D.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D.1.2 Fed SPS-Normalized experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D.2 Fed SPS-Global . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D.2.2 Choice of aggregation formula for Fed SPS-Global . . . . . . . . . . . . . . . . . . . . . 27

D.2.3 Fed SPS-Global experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

E Extension of Fed SPS to non-convex setting 29

Published in Transactions on Machine Learning Research (10/2024)

E.1 Convergence analysis for non-convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

E.1.1 Proof of Theorem 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

F Additional experiments 31

F.1 Visualization of Fed SPS stepsize statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

F.2 Effect of varying c on convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

F.3 Additional convex experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

F.4 Hyperparameters tuning for Fed Avg and Fed AMS . . . . . . . . . . . . . . . . . . . . . . . . 32

A Technical preliminaries

Let us present some basic definitions and inequalities used in the proofs throughout the appendix.

A.1 General definitions

Definition 1 (Convexity). The function f : Rd R, is convex, if for all x, y Rd

f(x) f(y) + f(y), x y . (11)

Definition 2 (L-smooth). The function f : Rd R, is L-smooth, if there exists a constant L > 0 such that for all x, y Rd

f(x) f(y) L x y , (12)

or equivalently (Nesterov, 2003, Lemma 1.2.3)

f(x) f(y) + f(y), x y + L

2 x y 2 . (13)

Lemma 7 (Li & Orabona (2019), Lemma 4). For a L-smooth function f : Rd R, having an optima at x Rd, we have for any x Rd

f(x) f(x ) 2 2L (f(x) f(x )) . (14)

Lemma 8 (Orvieto et al. (2022), Lemma B.2). For a L-smooth and µ strongly convex function f : Rd R, the following bound holds 1 2c L f(x) f

c f(x) 2 f(x) ℓ

c f(x) 2 1 2cµ , (15)

where f = infx f(x), and ℓ is a lower bound ℓ f .

A.2 General inequalities

Lemma 9. For arbitrary set of n vectors {ai}n i=1, ai Rd

i=1 ai 2 . (16)

Lemma 10. For arbitrary set of n vectors {ai}n i=1, ai Rd

i=1 ai 2 . (17)

Published in Transactions on Machine Learning Research (10/2024)

Lemma 11. For given two vectors a, b Rd

2 a, b γ a 2 + γ 1 b 2 , γ > 0 . (18)

Lemma 12. For given two vectors a, b Rd

a + b 2 (1 + λ) a 2 + (1 + λ 1) b 2 , λ > 0 . (19)

Lemma 13. For any random variable X

E X E [X] 2 E X 2 . (20)

B Convergence analysis of Fed SPS

B.1 Fed SPS inequalities

Lemma 14 (Upper and lower bounds on Fed SPS stepsizes). Under the assumption that each Fi is L-smooth (Assumption 1), the stepsizes of Fed SPS follow for all rounds t [T 1]

α γt γb , (21)

where α = min 1 2c L, γb .

Proof. This lemma is easily obtained by plugging in the definition of Fed SPS stepsizes into (15).

Lemma 15. Fed SPS stepsizes γi t follow the following fundamental inequality

γi tgi t 2 γi t c [Fi(xi t, ξi t) ℓ i ] (22)

Proof. We observe that it holds from the definition of Fed SPS stepsizes

γi tgi t 2 = γi t min

( F(xi t, ξi t) ℓ i c F(xi t, ξi t) 2 , γb

) Fi(xi t, ξi t) 2

γi t c [Fi(xi t, ξi t) ℓ i ] .

B.2 Proof for convex case of Fed SPS

Proof outline and notation. Before we start with the proof, let us recall the notation: below we will frequently use gi t to denote the random component sampled by worker i at iteration t, so gi t = Fi(xi t, ξi t) and

the stepsize γi t = min Fi(xi t,ξi t) ℓ i c Fi(xi t,ξi t) 2 , γb

. We define the (possibly virtual) average iterate xt := 1

n Pn i=1 xi t.

We follow the proof template for Fed Avg (Local SGD) Koloskova et al. (2020); Stich (2019), deriving a difference lemma to bound Rt := 1

n Pn i=1 xt xi t 2, and plugging it into the decrease xt+1 x 2. Our poof introduces differences at various points like in the difference lemma (25), and the decrease (29), to incorporate local adaptivity using properties of Fed SPS stepsizes (Lemma 14 and Lemma 15) and the finite optimal objective difference assumption (σ2 f < ).

B.2.1 Difference Lemmas

Lemma 16 (Bound on difference Rt for convex case with any stepsizes). For c 2τ 2, the variance of the iterates between the clients Rt := 1

n Pn i=1 xt xi t 2, is bounded as

j=(t 1) k(t) γi j[Fi(xi j, ξi j) ℓ i ] , (23)

where t 1 k(t) denotes the index of the last communication round (k τ 1).

Published in Transactions on Machine Learning Research (10/2024)

Proof. We use the property that xt = xi t for every t that is a multiple of τ. Therefore, there exists a k(t) τ 1 such that R(t 1) k(t) = 0. So we have,

j=(t 1) k(t) γi jgi j v

j=(t 1) k(t) γi jgi j

where v := 1

n Pn i=1 Pt 1 j=(t 1) k(t) γi jgi j. The inequality follows by the fact the v is the mean of the terms

in the sum. With the inequality Pτ i=1 ai 2 τ Pτ i=1 ai 2 for vectors ai Rd, and the property of the Polyak Stepsize we deduce:

j=(t 1) k(t)

γi jgi j 2 (24)

j=(t 1) k(t) γi j[Fi(xi j, ξi j) ℓ i ] (25)

j=(t 1) k(t) γi j[Fi(xi j, ξi j) ℓ i ] ,

where we used the assumption c 2τ 2 to obtain the last inequality.

Lemma 17 (Bound on difference Rt for convex case with small stepsizes). For γb 1 20Lτ , the variance of the iterates between the clients Rt := 1

n Pn i=1 xt xi t 2, is bounded as

j=(t 1) k(t) E[f( xj) f ] + 64Lγ2 b τ

j=(t 1) k(t) σ2 f , (26)

where t 1 k(t) denotes the index of the last communication round (k τ 1).

Proof. Again we derive a bound on the variance on the iterated between the clients. We use the property that xt = xi t for every t that is a multiple of τ. Define k(t) τ 1 as above. We will now prove that it holds

Rt 32γ2 b τ n

j=(t 1) k(t)

Fi( xj, ξi j) 2 (27)

(12) 64γ2 b Lτ n

j=(t 1) k(t) [Fi( xj, ξi j) ℓ i ] ,

and consequently (after taking expectation) and using γb 1 20Lτ :

j=(t 1) k(t) E[f( xj) f ] + 64Lγ2 b τ

j=(t 1) k(t) σ2 f .

Note that if t is a multiple of τ, then Rt = 0 and there is nothing to prove. Otherwise note that

xt xi t + γi t F(xi t, ξi t) v 2 ,

Published in Transactions on Machine Learning Research (10/2024)

where v := 1

n Pn i=1 γi t F(xi t, ξi t) denotes the average of the client updates. With the inequality a + b 2 (1 + τ 1) a 2 + 2τ b 2 for τ 1, a, b Rd, we continue:

γi t Fi(xi t, ξi t) v 2

γi t Fi(xi t, ξi t) 2

Rt + 2τγ2 b n

Fi(xi t, ξi t) 2

Rt + 4τγ2 b n

Fi(xi t, ξi t) Fi( xt, ξi t) 2 + Fi( xt, ξi t) 2

Rt + 4τγ2 b n

L2 xt xi t 2 + Fi( xt, ξi t) 2

Rt + 4τγ2 b n

Fi( xt, ξi t) 2 ,

where we used γb 1 10Lτ . Equation (27) now follows by unrolling, and noting that 1 + 2

τ j 8 for all 0 j τ.

B.2.2 Proof of Theorem 3 (a) (general case valid for all stepsizes)

Distance to optimality. We now proceed by using the definition of the virtual average: xt+1 = 1 n Pn i=1 xi t γi tgi t .

xt+1 x 2 xt x 2 2

xt x , γi tgi t + 1

xi t x , γi tgi t + 1

γi tgi t 2 2

xt xi t, γi tgi t

(18) xt x 2 2

xi t x , γi tgi t + 1

γi tgi t 2 + 1

xt xi t 2 + γi tgi t 2

xi t x , γi tgi t + 2

γi tgi t 2 + Rt . (28)

Published in Transactions on Machine Learning Research (10/2024)

Upper bound (valid for arbitrary stepsizes). We now proceed in a similar fashion as outlined in the proof of (Loizou et al., 2021, Theorem 3.4). Using the property (22) of the Fe SPS stepsize we get:

xt+1 x 2 (22) xt x 2 2

xi t x , γi tgi t + 2

i=1 γi t[Fi(xi t, ξi t) ℓ i ] + Rt (29)

(11) xt x 2 2

i=1 γi t[Fi(xi t, ξi t) Fi(x , ξi t)] + 2

i=1 γi t[Fi(xi t, ξi t) ℓ i ] + Rt

i=1 γi t[Fi(xi t, ξi t) ℓ i + ℓ i Fi(x , ξi t)]

i=1 γi t[Fi(xi t, ξi t) ℓ i ] + Rt

= xt x 2 2 2

i=1 γi t[Fi(xi t, ξi t) ℓ i ] + 2

i=1 γi t[Fi(x , ξi t) ℓ i ] + Rt , (30)

We use the assumption c > 2 and simplify further:2

xt+1 x 2 xt x 2 1

i=1 γi t[Fi(xi t, ξi t) ℓ i ] + 2γbσ2 t + Rt ,

where we defined σ2 t := 1

n Pn i=1[Fi(x , ξi t) ℓ i ]. After rearranging:

i=1 γi t[Fi(xi t, ξi t) ℓ i ] xt x 2 xt+1 x 2 + 2γbσ2 t + Rt .

We now plug in the bound on Rt calculated above in equation (23):

i=1 γi t[Fi(xi t, ξi t) ℓ i ] xt x 2 xt+1 x 2 + 2γbσ2 t + 1 2nτ

j=(t 1) k(t) γi j[Fi(xi j, ξi j) ℓ i ]

xt x 2 xt+1 x 2 + 2γbσ2 t + 1 2nτ

j=(t 1) (τ 1) γi j[Fi(xi j, ξi j) ℓ i ] .

We now sum this equation up from t = 0 to T 1, and divide by T:

i=1 γi t[Fi(xi t, ξi t) ℓ i ] 1

xt x 2 xt+1 x 2

t=0 σ2 t + 1

i=1 γi t[Fi(xi t, ξt j) ℓ i ] ,

by noting that each component in the last term appears at most τ times in the sum. We can now move the last term to the left:

i=1 γi t[Fi(xi t, ξi t) ℓ i ] 1

T x0 x 2 + 2γb 1 T

It remains to note γi t α := min 1 2c L, γb , therefore:

i=1 γi t[Fi(xi t, ξi t) ℓ i ] 1

i=1 α[Fi(xi t, ξi t) ℓ i ] .

2The attentive reader will observe that any c > 1

2 would be sufficient to show convergence of the function suboptimality (note that Loizou et al. (2021) used c 1

2 , but only showed convergence of the iterate distance to optimality).

Published in Transactions on Machine Learning Research (10/2024)

To summarize:

i=1 [Fi(xi t, ξi t) ℓ i ] 1 Tα x0 x 2 + 2γb

t=0 σ2 t . (31)

We now take full expectation to get the claimed result.

B.2.3 Proof of Theorem 6 (a) (special case with small step sizes)

We now give an additional bound that tightens the result when the parameter γb is chosen to be a small value, smaller than 1 2c L. As a consequence of this assumption (see Equation (15)), it holds γi t γb for all t and i [n]. The proof now follows a similar template as above, but we can now directly utilize the imposed upper bound on γb (v.s. the bound on c used previously).

Upper bound (valid for small stepsizes). Using the definition of the virtual average: xt+1 = 1 n Pn i=1 xi t γi tgi t .

xt+1 x 2 xt x 2 2

xt x , γi tgi t + 1

γi tgi t 2 (32)

We now observe

xt x , γi tgi t = γi t xt xi t, gi t γi t xi t x , gi t

Fi( xt, ξi t) Fi(xi t, ξi t) L

xt xi t 2 γi t xi t x , gi t (33)

Fi( xt, ξi t) Fi(xi t, ξi t) L

xt xi t 2 + Fi(xi t, ξi t) Fi(x , ξi t)

= γi t Fi( xt, ξi t) Fi(x , ξi t) + γi t L 2

xt xi t 2 . (34)

xt+1 x 2 (32),(34) xt x 2 2

i=1 γi t[Fi( xt, ξi t) Fi(x , ξi t)] + L

i=1 γi t xt xi t 2 + 1

γi tgi t 2 .

We now use the observation that γi t = γb (by the assumption that γb is small we could have made this simplification also earlier), and continue:

xt+1 x 2 xt x 2 2γb

i=1 [Fi( xt, ξi t) Fi(x , ξi t)] + γb LRt + γ2 b n

gi t 2 (19) 2 Fi( xt, ξi t) Fi(xi t, ξi t) 2 + 2 Fi( xt, ξi t) 2

(12) 2L2 xt xi t 2 + 2 Fi( xt, ξi t) 2

and the assumption γb 1 10L, to obtain

xt+1 x 2 xt x 2 2γb

i=1 [Fi( xt, ξi t) Fi(x , ξi t)] + 2γb LRt + 2γ2 b n

Fi( xt, ξi t) 2

i=1 [Fi( xt, ξi t) Fi(x , ξi t)] + 2γb LRt + 4Lγ2 b n

i=1 [Fi( xt, ξi t) ℓ i ] .

Published in Transactions on Machine Learning Research (10/2024)

To simplify, let s take expectation over the randomness in iteration t:

E xt+1 x 2 xt x 2 2γb[f( xt) f ] + 2γb L E[Rt] + 4Lγ2 b [f( xt) f + σ2 f] .

Because f( xt) f 0 and γb 1 10L, we can further simplify:

E xt+1 x 2 xt x 2 γb[f( xt) f ] + 2γb L E[Rt] + 4Lγ2 b σ2 f . (36)

After re-arranging and taking expectation:

γb E[f( xt) f ] E xt x 2 E xt+1 x 2 + 2γb L E[Rt] + 4Lγ2 b σ2 f .

We now plug in the bound on Rt from Equation (26):

γb E[f( xt) f ] E xt x 2 E xt+1 x 2 + 2γb

j=(t 1) k(t) [f( xj) f ] + 4(γ2 b + 32τ 2γ3 b )Lσ2 f ,

and after summing over t = 0 to T 1, and dividing by T:

t=0 γb E[f( xt) f ] 1

E xt x 2 E xt+1 x 2 + 2γb

t=0 [f( xt) f ] + 4(γ2 b + 32τ 2γ3 b )Lσ2 f ,

and consequently:

t=0 E[f( xt) f ] 3 Tγb E x0 x 2 + 12(γb + 32τ 2γ2 b )Lσ2 f . (37)

B.3 Proof for strongly convex case of Fed SPS

Here we outline the proof in the strongly convex case. As the proof is closely following the arguments from the convex case (just with the difference of applying µ-strong convexity whenever convexity was used), we just highlight the main differences for the readers.

B.3.1 Proof of Theorem 3 (b) (general case valid for all stepsizes)

In the respective proof in the convex case, convexity was used in Equation (29). If we use strong convexity instead, we obtain

xt+1 x 2 (4) xt x 2 2

i=1 γi t[Fi(xi t, ξi t) Fi(x , ξi t) + µ

i=1 γi t[Fi(xi t, ξi t) ℓ i ] + Rt

i=1 γi t[Fi(xi t, ξi t) Fi(x , ξi t) + µ

2 xt x 2] + 2

i=1 γi tµ xt xi t 2

i=1 γi t[Fi(xi t, ξi t) ℓ i ] + Rt

i=1 γi t[Fi(xi t, ξi t) Fi(x , ξi t) + µ

i=1 γi t[Fi(xi t, ξi t) ℓ i ] + 2Rt

Published in Transactions on Machine Learning Research (10/2024)

where the second inequality used xt x 2 2 xt xi t 2 + 2 xi t x 2 and the last inequality used γi tµ 1 2c 1

2. By applying the lower bound α on the stepsize, we see that convergence is governed by linearly decreasing term:

xt+1 x 2 (1 µα) xt x 2 2

i=1 γi t[Fi(xi t, ξi t) Fi(x , ξi t)] + 2

i=1 γi t[Fi(xi t, ξi t) ℓ i ] + 2Rt ,

while the remaining terms are the same (up to a small change in the constant in front of Rt) as in the convex case. The increased constant can be handled by imposing a slightly stronger condition on c (here we used c 4τ 2, compared to c 2τ 2 in the convex case). The rest of the proof follows by the same steps, with one additional departure: when summing up the inequalities over the iterations t = 0 to T 1 we need to use an appropriate weighing to benefit from the linear decrease. This technique is standard in the analysis (illustrated in Stich (2019) and carried out in the setting a residual Rt in (Stich & Karimireddy, 2020, Proof of Theorem 16) and with a residual Rt in a distributed setting in (Koloskova et al., 2020, Proof of Theorem 2, in particular Lemma 15).

B.3.2 Proof of Theorem 6 (b) (special case with small step sizes)

Convexity was used in Equation (33). If we use µ-strong convexity instead, we obtain:

xt x , γi tgi t (4) γi t

Fi( xt, ξi t) Fi(xi t, ξi t) L

xt xi t 2 + Fi(xi t, ξi t) Fi(x , ξi t) + µ

γi t Fi( xt, ξi t) Fi(x , ξi t) + γi t L xt xi t 2 γi tµ 2 xt x 2 , (38)

where the last inequality used xt x 2 2 xt xi t 2 + 2 xi t x 2. The last term is essential to get a linear decrease in the first term. For illustration, Equation (36) now changes to

E xt+1 x 2 (1 γbµ) xt x 2 γb[f( xt) f ] + 4γb L E[Rt] + 4Lγ2 b σ2 f

(the constant in front of E[Rt] increased by a factor of two because of the estimate in (38) and can be controlled by the slightly stronger condition on γb). From there, the proof is very standard and follows exactly the template outlined in e.g. (Stich & Karimireddy, 2020, Proof of Theorem 16) or (Koloskova et al., 2020, Proof of Theorem 2, in particular Lemma 15).

C Additional details for Fed SPS

C.1 Extending Fed SPS to the mini-batch setting

Note that for the sake of simplicity in notation of the proposed method and algorithms, we used a single stochastic sample ξi t. However, this can be trivially extended to the mini-batch setting. For a batch Bi t, the stochastic gradient would become 1 |Bi t| P

ξ Bi t Fi(xi t, ξ), and the stochastic loss would become

1 |Bi t| P ξ Bi t Fi(xi t, ξ). We use this mini-batch setting for our practical experiments.

C.2 Comparison of heterogeneity measures

C.2.1 Proof of Proposition 1

Proof. We look at the two cases separately as follows:

Function heterogeneity. We recall the definitions ζ2 = 1 n Pn i=1 fi(x ) f(x ) 2 2 and σ2 f = 1 n Pn i=1 (fi(x ) ℓ i ). The global optimal point is denoted by x , and the local optima by x i . Note that f(x ) = 0, fi(x i ) = 0, and we make use of these facts in our proof.

Published in Transactions on Machine Learning Research (10/2024)

Assuming L smoothness of each fi(x), we can apply Lemma 7 to each of them with x = x to obtain

fi(x ) fi(x i ) 2 2L (fi(x ) f i ) , i [n]. (39)

Using this result we can bound ζ2 from above as follows (noting that f(x ) = 0, fi(x i ) = 0):

i=1 fi(x ) f(x ) 2 2

i=1 fi(x ) fi(x i ) 2

i=1 2L (fi(x ) f i )

i=1 2L (fi(x ) ℓ i )

Gradient variance. We recall the definitions σ2 = 1 n Pn i=1 Eξi Fi(x , ξi) fi(x ) 2 and σ2 f = 1 n Pn i=1 (fi(x ) ℓ i ). The global optimal point is denoted by x , local optimum on client i by x i and the local optima of the functions on worker i by x ξi. Note that fi(x i ) = 0, Fi(x ξi, ξi) = 0 and we make use of these facts in our proof.

Assuming L smoothness of each Fi(x, ξi), we can apply Lemma 7 to each function on any client i, with x = x to obtain Fi(x , ξi) Fi(x ξi, ξi) 2 2L Fi(x , ξi) Fi(x ξi, ξi) , i [n] , ξi Di. (40)

Using this result we can bound σ2 from above as follows

i=1 Eξi Fi(x , ξi) fi(x ) 2

i=1 Eξi Fi(x , ξi) 2

Eξi Fi(x , ξi) Fi(x ξi, ξi) 2

Eξi h Fi(x , ξi) F i (x ξi, ξi) i

Eξi Fi(x , ξi) ℓ i

i=1 (fi(x ) ℓ i )

D Alternative design choices for Fed SPS

D.1 Fed SPS-Normalized

Drawing ideas from Wang et al. (2021) which employs client and server side normalization to account for any solution bias due to asynchronous local stepsizes (especially for heterogeneous clients), we design Fed SPS-

Published in Transactions on Machine Learning Research (10/2024)

Algorithm 2 Fed SPS-Normalized: Fully locally adaptive Fed SPS with normalization to account for heterogeneity as suggested in Wang et al. (2021).

Input: xi 0 = x0, i [n]

1: for t = 0, 1, , T 1 do 2: for each client i = 1, , n in parallel do 3: sample ξi t, compute gi t := Fi(xi t, ξi t)

4: Fed SPS-Normalized: γi t = min n Fi(xi t,ξi t) ℓ i c||gi t||2 , γb o

local stochastic Polyak stepsize

5: if t + 1 is a multiple of τ then 6: xi t+1 = 1 1 n Pn

xi t γi tgi t aggregation incorporating normalizations

7: νi = 0 8: else 9: xi t+1 = xi t γi tgi t local step 10: νi = νi + γi t client normalization factor 11: end if 12: end for 13: end for

Normalized. In the following we provided a detailed description of the algorithm, as well as comparison to our main proposed method Fed SPS.

D.1.1 Algorithm

Fed SPS-Normalized. In Fed SPS-Normalized (Algorithm 2), we calculate the client correction factor νi

for each client i [n] as the sum of local stepsizes (since the last aggregation round). During the server aggregation step, the pseudo-gradients are then normalized using the client and server normalization factors, to remove any bias due to local heterogeneity.

D.1.2 Fed SPS-Normalized experimental details

The implementation is done according to Algorithm 2. All remarks we made about the choice of scaling parameter c, upper bound on stepsize γb, and ℓ i in the previous section on Fed SPS also remain same here. Figure 6 shows a comparison of Fed SPS and Fed SPS-Normalized for the convex case of logistic regression on MNIST dataset. For the homogeneous case (a) normalization has no effect this is as expected since the client and server side correction factors are equal and balance out. We expect to observe some changes for the heterogeneous setting (b), but here the Fed SPS-Normalized actually performs very slightly worse than Fed SPS. We conclude that normalization does not lead to any significant performance gain, and can intuitively attribute this to the fact the Fed SPS stepsizes have low inter-client variance (Figure 7).

D.2 Fed SPS-Global

For the sake of comparison with existing locally adaptive FL algorithms such as Local-AMSGrad Chen et al. (2020) and Local-Ada Alter Xie et al. (2019) (both of which perform some form of stepsize aggregation on the server), we introduce a synchronous stepsize version of our algorithm called Fed SPS-Global.

D.2.1 Algorithm

Fed SPS-Global. In Fed SPS-Global (Algorithm 3), the stepsize is constant across all clients and also for local steps for a particular client, for a particular round t. There can be several choices for the aggregation formula for the global stepsize we use a simple average of the local SPS stepsizes as given below

( Fi(xi t, ξi t) ℓ i c F(xi t, ξi t) 2 , γb

Published in Transactions on Machine Learning Research (10/2024)

0 100 200 300 400 500 Round

Train Loss (log )

Fed SPS Fed SPS-Normalized

(a) MNIST i.i.d.

0 100 200 300 400 500 Round

Train Loss (log )

Fed SPS Fed SPS-Normalized

(b) MNIST non-i.i.d.

Figure 6: Comparison of Fed SPS and Fed SPS-Normalized for the convex case of logistic regression on MNIST dataset (i.i.d. and non-i.i.d.). For the homogeneous case (a) normalization has no effect, while for the heterogeneous case (b) normalization slightly deteriorates performance.

Algorithm 3 Fed SPS-Global: Fed SPS with global stepsize aggregation

Input: γ = γ0, xi 0 = x0, i [n]

1: for t = 0, 1, , T 1 do 2: for each client i = 1, , n in parallel do 3: sample ξi t, and compute stochastic gradient gi t := Fi(xi t, ξi t) 4: if t + 1 is a multiple of τ then 5: xi t+1 = 1

n Pn i=1 xi t γgi t communication round

6: Fed SPS-Global: γ = 1

n Pn i=1 min n Fi(xi t,ξi t) ℓ i c||gi t||2 , γb o

global stepsize

7: else 8: xi t+1 = xi t γgi t local step 9: end if 10: end for 11: end for

We provide a theoretical justification of this choice of aggregation formula, and compare it with other possible choices in Appendix D.2.2. As it is apparent from (41), we need to use stale quantities from the last round, for calculating the stepsize for the current round. This is justified by the empirical observation that stepsizes do not vary too much between consecutive rounds, and is a reasonable assumption that has also been made in previous work on adaptive distributed optimization Xie et al. (2019). Due to the staleness in update of the stepsize in this method, we need to provide a starting stepsize γ0 to the algorithm, and this stepsize is used in all clients till the first communication round happens. Fed SPS-Global can be thought of more as a heuristic method using cheap stepsize computations, and offering similar empirical performance as Fed SPS.

D.2.2 Choice of aggregation formula for Fed SPS-Global

We can have various choices for aggregating the local stepsizes to calculate the global stepsize. The first obvious choice would be to perform a simple average of the local stepsizes across all clients. This is given by the aggregation formula γ(a) = 1

n Pn i=1 min n fi(xi t) ℓ i c||gi t||2 , γb o which is used in our proposed method (Algorithm

3). Two other plausible choices of aggregation formula could be γ(b) = min 1 n Pn

i=1[fi(xi t) ℓ i ]

i=1 ||gi t||2 , γb

γ(c) = min 1 n Pn

i=1[fi(xi t) ℓ i ]

i=1 gi t 2 , γb

. Among these choices, γ(c) represents the correct SPS stepsize if we

follow the original definition and replace batches with clients in the distributed setup. In the following Proposition we show theoretically that γ(b) < min γ(a), γ(c) . Experimentally we find the simple averaging of local stepsizes i.e., γ(a) to work best, followed closely by γ(a), while the correct SPS stepsize γ(c) explodes

Published in Transactions on Machine Learning Research (10/2024)

in practice. Therefore, we choose γ(a) as our aggregation formula and this has some added advantages for the associated proofs. We feel that a reason behind the good performance of Fed SPS-Global is the low inter-client and intra-client variance of the stepsizes explained in Section 6 this is the reason behind why simple averaging of the local stepsizes work. Proposition 2 (Global SPS stepsize aggregation formula). Using the definitions of the three aggregation formulae for synchronous SPS stepsizes γ(a), γ(b), and γ(c), as defined above in Section D.2.2, we have the following inequalities

1. γ(c) γ(b).

2. γ(a) γ(b).

combining which we get γ(b) < min γ(a), γ(c) .

Proof. We look at the two cases separately as follows:

1. From Lemma 10 it is easy to observe that

which can be rearranged as Pn i=1[fi(xi t) ℓ i ] 1

n Pn i=1 gi t 2 Pn i=1[fi(xi t) ℓ i ]

1 n Pn i=1 gi t 2

The required statement follows trivially from the above inequality.

2. From Chebyshev s inequality we have

fi(xi t) ℓ i gi t 2 Pn i=1[fi(xi t) ℓ i ] n 1

From AM-HM inequality we obtain Pn i=1 1 gi t 2

n n Pn i=1 gi t 2

Plugging in this into the above we get

fi(xi t) ℓ i gi t 2

1 n Pn i=1[fi(xi t) ℓ i ]

1 n Pn i=1 gi t 2

The required statement follows trivially from the above inequality.

D.2.3 Fed SPS-Global experimental details

The implementation is done according to Algorithm 33. All remarks we made about the choice of scaling parameter c, upper bound on stepsize γb, and ℓ i in the previous section on Fed SPS also remain same here. As mentioned before, this method needs an extra hyperparameter γ0, that is the stepsize across all clients until the first communication round. Empirically we found that setting γ0 = γ0 0, i.e. the local SPS stepsize for client 0 at iteration 0, works quite well in practice. This is explained by the fact that the SPS stepsizes have low inter-client and intra-client variance (Figure 7). Experimentally we find that Fed SPS-Global converges almost identically to the locally adaptive Fed SPS (Figure 4 and Figure 9).

3Note that for any experiment in the Appendix that involves Fed SPS-Global, we refer to Fed SPS (from the main paper) as Fed SPS-Local in the legends of plots, for the sake of clarity between the two approaches.

Published in Transactions on Machine Learning Research (10/2024)

E Extension of Fed SPS to non-convex setting

We outline here some theoretical results of extending Fed SPS to the non-convex setting. Our non-convex results have the limitation of requiring small stepsizes, but we do not need the additional strong assumption of bounded stochastic gradients used in previous work on adaptive federated optimization.

E.1 Convergence analysis for non-convex functions

We now discuss the convergence of Fed SPS on non-convex functions. Unfortunately, it is required to impose additional assumptions in this section. This is mainly due that the Polyak stepsize was designed for convex objectives and additional assumptions are needed in the non-convex setting.

Assumption 3 (Bounded variance). Each function fi : Rd R, i [n] has stochastic gradient with bounded local variance, that is, for all x Rd,

Eξ Di Fi(x, ξ) fi(x) 2 2 σ2 . (42)

Moreover, global variance of the loss function on each client is bounded, that is, for all x Rd,

i=1 fi(x) f(x) 2 2 ζ2 . (43)

This assumption is frequently used in the analysis of Fed Avg Koloskova et al. (2020), as well as adaptive federated methods Reddi et al. (2021); Wang et al. (2022a). The local variance denotes randomness in stochastic gradients of clients, while the global variance represents heterogeneity between clients. Note that ζ = 0 corresponds to the i.i.d. setting. We further note that Equation (43) corresponds to the variance Assumption in Loizou et al. (2021) (their Equation (8) with ρ = 1, δ = ζ2) that they also required for the analysis in the non-convex setting.

The following theorem applies to Fed SPS and Fed Avg with small stepsizes:

Theorem 18 (Convergence of Fed SPS for non-convex case). Under Assumptions 1 and 3, if γb < min 1 2c L, 1 25Lτ then after T steps, the iterates of Fed SPS and Fed Avg satisfy

γb T + γb Lσ2

n + γ2 b L2τ(σ2 + τζ2) (44)

where ΦT := min0 t T 1 E f( xt) 2 and F0 := f( x0) f .

The proof of this theorem again relies on the assumption that the stepsize is very small, but otherwise follows the template of earlier work Li et al. (2019); Koloskova et al. (2020) and precisely recovers their result. For the sake of completeness we still add a proof in the Appendix, but do not claim novelty here. The theorem states that when using a constant stepsize, the algorithms reach a neighborhood of the solution, where the neighbourhood is dependent on the local variance σ and global variance ζ, and for the i.i.d. case of ζ = 0 the neighbourhood is smaller and less dependent on number of local steps τ. By choosing an appropriately small γb, any arbitrary target accuracy ϵ can be reached.

A limitation of our result is that it only applies to the small stepsize regime, when the adaptivity is governed by the stepsize bound on γb. However, when comparing to other theoretical work on adaptive federated learning algorithms we observe that related work has similar (or stronger) limitations, as e.g. both Wang et al. (2022a) (Fed AMS) and Reddi et al. (2021) (Fed Adam) require an uniform bound on the stochastic gradients (we do not need) and also require effective stepsizes smaller than 1 τL similar to our case.

E.1.1 Proof of Theorem 18

Small Stepsize. We start by the observation that it must hold γi t = γb for all iterations t and clients i. This is due to our strong assumption on the small stepsizes.

Published in Transactions on Machine Learning Research (10/2024)

Bound on Rt. Similar as in the convex case, we need a bound on E[Rt]. We note that when deriving the bound in Equation (27) we did not use any convexity assumption, so this bound also holds for arbitrary smooth functions:

Rt 32γ2 b τ n

j=(t 1) k(t)

Fi( xj, ξi j) 2 .

and consequently (after taking expectation) combined with Assumption 3:

E Rt 32γ2 b τ

j=(t 1) k(t) [E f( xj) 2 + σ2/τ + ζ2]

j=(t 1) k(t) E f( xj) 2 + 32γ2 b τ 2[σ2/τ + ζ2] , (45)

where the last estimate followed by γb 1 25τL.

Decrease. We study the (virtual) average xt. By smoothness and definition of xt+1 we have:

f( xt+1) f( xt) γb

f( xt), gi t + γ2 b L

The scalar product can be estimated as follows:

f( xt), gi t = 1

f( xt), Fi( xt, ξi t) + f( xt), Fi(xi t, ξi t) Fi( xi t, ξi t)

f( xt), Fi( xt, ξi t) + 1

2 f( xt) 2 + 1

Fi(xi t, ξi t) Fi( xi t, ξi t) 2

f( xt), Fi( xt, ξi t) + 1

2 f( xt) 2 + 1

and after taking expectation:

f( xt), gi t #

2 f( xt) 2 + 1

We now also use smoothness to estimate the last term in (46):

i=1 Fi(xi t, ξi t) fi(xi t)

i=1 fi(xi t)

n + 2 f( xt) 2 + 2

i=1 fi(xi t) fi( xt)

n + 2 f( xt) 2 + 2L2Rt .

By combining these estimates, and using γb 1 10L, we arrive at:

E f( xt+1) f( xt) γb

2 f( xt) 2 + γb L2Rt + γ2 b L

2 f( xt) 2 + σ2

4 f( xt) 2 + 2γb L2Rt + γ2 b Lσ2

Published in Transactions on Machine Learning Research (10/2024)

0 100 200 300 400 500 Round

Train Loss (log )

Fed AMS (best )

Fed AMS (worst )

Fed Avg (best )

Fed Avg (worst )

0 100 200 300 400 500 Round

Test Accuracy

Fed AMS (best )

Fed Avg (best )

0 100 200 300 400 500 Round

Intra-client Inter-client

(a) Training loss and stepsizes for Fed SPS (i.i.d.).

0 100 200 300 400 500 Round

Train Loss (log )

Fed Avg Fed AMS Fed ADAM MIME Fed SPS Fed Dec SPS

0 100 200 300 400 500 Round

Test Accuracy

Fed Avg Fed AMS Fed ADAM MIME Fed SPS Fed Dec SPS

0 100 200 300 400 500 Round

Intra-client Inter-client

(b) Training loss, test accuracy and stepsizes for Fed SPS (non-i.i.d.).

Figure 7: Visualization of Fed SPS stepsize statistics for convex case of logistic regression on MNIST dataset (i.i.d. and non-i.i.d.). Left column represents training loss, middle column test accuracy, and right column stepsize plots. intra-client refers to selecting any one client and plotting the mean and standard deviation of stepsizes for that client across τ local steps, for a particular communication round. inter-client refers plotting the mean and standard deviation of stepsizes across all clients, for a particular communication round. We extend (a) beyond the best and worst stepsizes and report the performance for all stepszies in Table 2.

We now re-arrange and plug in the estimate of Rt from (45):

4 E f( xt) 2 E f( xt) f( xt+1) + γ2 b Lσ2

j=(t 1) k(t) E f( xj) 2 + 64γ3 b τ 2L2[σ2/τ + ζ2]

and after summing from t = 0 to T 1 and dividing by T:

t=0 E f( xt) 2 1

t=0 (E f( xt) f( xt+1)) + γ2 b Lσ2

2n + 64γ3 b τ 2L2[σ2/τ + ζ2]

t=0 E f( xt) 2

implying that

t=0 E f( xt) 2 1 Tγb (f( x0) f ) + γb Lσ2

2n + 64γ2 b τ 2L2(σ2/τ + ζ2) .

F Additional experiments

F.1 Visualization of Fed SPS stepsize statistics

Intuitively it may seem at first glance that using fully locally adaptive stepsizes can lead to poor convergence due to different stepsizes on each client. However, as we have already seen in our theory and well as verified

Published in Transactions on Machine Learning Research (10/2024)

0 100 200 300 400 500 Round

Train Loss (log )

c=0.01 c=0.1 c=0.5 c=1.0 c=10.0 c=20.0 c=40.0 c=100.0

(a) τ = 10.

0 100 200 300 400 500 Round

Train Loss (log )

c=0.01 c=0.1 c=0.5 c=1.0 c=10.0 c=20.0 c=40.0 c=100.0 c=500.0

(b) τ = 20.

0 100 200 300 400 500 Round

Train Loss (log )

c=0.01 c=0.1 c=0.5 c=1.0 c=10.0 c=20.0 c=40.0 c=100.0 c=500.0 c=1000.0

(c) τ = 50.

0 100 200 300 400 500 Round

Train Loss (log )

c=0.01 c=0.1 c=0.5 c=1.0 c=10.0 c=20.0 c=40.0 c=100.0 c=500.0 c=1000.0

(d) τ = 100.

Figure 8: Additional experiment on the effect of varying c on convergence of Fed SPS for different values of τ. In practice, there is no square law relation between c and τ, and any small value of c 0.5 works well.

Table 2: Test accuracy for different values of ηl for convex case of logistic regression on MNIST dataset (i.i.d.) extending the results of Figure 7 (a) for multiple values of local stepsizes. The best and worst stepsize values are highlighted in bold. We see that there is a substantial gap between the best other stepsizes.

ηl Fed Avg Fed AMS

0.0001 76.38 80.75 0.001 78.27 81.52 0.01 79.97 83.05 0.1 91.81 91.23 1 79.32 82.85

in experiments, that this is not the case our fully locally adaptive Fed SPS indeed converges. In this remark, we try to shed further light into the convergence of Fed SPS by looking at the stepsize statistics across clients. Figure 7 plots the intra-client and inter-client stepsize plots for i.i.d. and non-i.i.d. experiments. intraclient visualizes the variance in stepsizes for a particular client across different local steps. inter-client visualizes the the variance in stepsizes across all clients. We can notice that both these variances are small, explaining the good convergence behaviour of Fed SPS.

F.2 Effect of varying c on convergence

We provide additional experiments on the effect of varying the Fed SPS scale parameter c on convergence, for different number of local steps τ {10, 20, 50, 100} (convex case logistic regression on i.i.d. MNIST dataset without client sampling) in Figure 8. Similarly as before, we observe that smaller c results in better convergence, and any c {0.01, 0.1, 0.5} works well. Moreover, the effect of varying c on the convergence is same for all τ clarifying that in practice there is no square law relation between c and τ, unlike as suggested by the theory.

F.3 Additional convex experiments

Additional convex experiments for the rest of the LIBSVM datasets (i.i.d.) have been shown in Figure 9. The first column represents the training loss and the second column the test accuracy. The third column represents the Fed SPS stepsize statistics as described before in Section F.1. We can make similar observations as before in the main paper the proposed Fed SPS methods (without tuning) converge as well as or slightly better than Fed Avg and Fed Adam with the best tuned hyperparameters. The stepsize plots again highlight the low inter-client and intra-client stepsize variance.

F.4 Hyperparameters tuning for Fed Avg and Fed AMS

Here we provide more details on hyperparameters for each dataset and model needed by Fed Avg and Fed AMS. We perform a grid search for the client learning rate ηl {0.0001, 0.001, 0.01, 0.1, 1.0}, and server learning rate η {0.001, 0.01, 0.1, 1.0}. Moreover, for Fed AMS we also choose β1 = 0.9, β2 = 0.99, and the max

Published in Transactions on Machine Learning Research (10/2024)

Table 3: Hyperparameters for Fed Avg and Fed AMS.

Dataset Architecture ηl η ϵ

LIBSVM Logistic Regression 1.0 1 0.01

MNIST Logistic Regression 0.1 1 0.001 Le Net

stabilization factor ϵ {10 8, 10 4, 10 3, 10 2, 10 1, 100}. The grid search leads to the following set of optimal hyperparameters presented in Table 3.

Published in Transactions on Machine Learning Research (10/2024)

0 100 200 300 400 500 Round

Fed AMS Fed Avg Fed SPS-Local Fed SPS-Global

(a) mushrooms train.

0 100 200 300 400 500 Round

Test Accuracy

Fed AMS Fed Avg Fed SPS-Local Fed SPS-Global

(b) mushrooms test.

0 100 200 300 400 500 Round

Intra-client Inter-client

(c) mushrooms stepsize.

0 100 200 300 400 500 Round

Fed AMS Fed Avg Fed SPS-Local Fed SPS-Global

(d) ijcnn train.

0 100 200 300 400 500 Round

Test Accuracy

Fed AMS Fed Avg Fed SPS-Local Fed SPS-Global

(e) ijcnn test.

0 100 200 300 400 500 Round

Intra-client Inter-client

(f) ijcnn stepsize.

0 100 200 300 400 500 Round

Fed AMS Fed Avg Fed SPS-Local Fed SPS-Global

(g) phishing train.

0 100 200 300 400 500 Round

Test Accuracy

Fed AMS Fed Avg Fed SPS-Local Fed SPS-Global

(h) phishing test.

0 100 200 300 400 500 Round

Intra-client Inter-client

(i) phishing stepsize.

0 100 200 300 400 500 Round

Fed AMS Fed Avg Fed SPS-Local Fed SPS-Global

(j) a9a train.

0 100 200 300 400 500 Round

Test Accuracy

Fed AMS Fed Avg Fed SPS-Local Fed SPS-Global

(k) a9a test.

0 100 200 300 400 500 Round

Intra-client Inter-client

(l) a9a stepsize.

Figure 9: Additional convex experiments for the LIBSVM datasets (i.i.d.). As mentioned in Section D.2.3, Fed SPS (from the main paper) is referred to as Fed SPS-Local here in the legends, to distinguish it clearly from Fed SPS-Global.