# problemparameterfree_federated_learning__be43295f.pdf

Published as a conference paper at ICLR 2025

PROBLEM-PARAMETER-FREE FEDERATED LEARNING

Wenjing Yan1, Kai Zhang2, Xiaolu Wang3, Xuanyu Cao4

1The Chinese University of Hong Kong 2The Hong Kong University of Science and Technology 3East China Normal University 4Washington State University {wjyan}@ie.cuhk.edu.hk, {kzhangbn}@connect.ust.hk, {xiaoluwang}@sei.ecnu.edu.cn, {xuanyu.cao}@wsu.edu

Federated learning (FL) has garnered signiﬁcant attention from academia and industry in recent years due to its advantages in data privacy, scalability, and communication efﬁciency. However, current FL algorithms face a critical limitation: their performance heavily depends on meticulously tuned hyperparameters, particularly the learning rates. This manual tuning process is challenging in federated settings due to data heterogeneity and limited accessibility of local datasets. Consequently, the reliance on problem-speciﬁc parameters hinders the widespread adoption of FL and potentially compromises its performance in dynamic or diverse environments. To address this issue, we introduce PAda MFed, a novel algorithm for nonconvex FL that carefully combines adaptive stepsize and momentum techniques. PAda MFed offers two key advantages: 1) it operates autonomously without relying on problem-speciﬁc parameters, and 2) it manages data heterogeneity and partial participation without requiring heterogeneity bounds. Despite these beneﬁts, PAda MFed provides several strong theoretical guarantees: 1) it achieves state-of-the-art convergence rates with a sample complexity of O(ϵ 4) and communication complexity of O(ϵ 3) to obtain an accuracy of f(θ) ϵ, even using constant learning rates; 2) these complexities can be improved to the bestknown O(ϵ 3) for sampling and O(ϵ 2) for communication when incorporating variance reduction; 3) it exhibits linear speedup with respect to the number of local update steps and participating clients at each global round. These attributes make PAda MFed highly scalable and adaptable for various real-world FL applications. Extensive empirical evidence validates the efﬁcacy of our approach.

1 INTRODUCTION

Federated learning (FL) has emerged as a promising paradigm for machine learning, allowing multiple clients to collaboratively train a model without sharing raw data. Since its introduction by Mc Mahan et al. (2017), FL has garnered substantial attention from both academia and industry. Major conferences such as Neur IPS, ICML, and ICLR have witnessed a proliferation of FL-related research, addressing critical challenges including communication efﬁciency (Chen et al., 2021; Sattler et al., 2019), privacy preservation (Wei et al., 2020; Mothukuri et al., 2021), heterogeneity management Li et al. (2020); Karimireddy et al. (2020b); Wang et al. (2020), and partial and asynchronous participation (Wang et al., 2024; Xu et al., 2023).

Despite signiﬁcant advancements, current FL algorithms face a critical limitation: their performance heavily depends on meticulously tuned hyperparameters, particularly the learning rates. This tuning process typically requires extensive computational resources and problem-speciﬁc knowledge, such as smoothness parameters, heterogeneity bounds, stochastic gradient variances, and initial optimality gaps. For instance, MIME (Karimireddy et al., 2020a) relies on smoothness constants and data heterogeneity bounds for stepsize determination, Fed Prox (Li et al., 2020) requires careful adjustment of a proximal term based on data heterogeneity, and Fed Dyn (Acar et al., 2021) demands tuning of a regularization parameter contingent on problem characteristics. Furthermore, algorithms such as Fed ADT (Gao et al., 2023) and Fed AMS (Chen et al., 2020) necessitate problem-speciﬁc

Corresponding Author.

Published as a conference paper at ICLR 2025

parameters to establish minimum communication rounds, which must exceed thresholds associated with smoothness constants and other problem characteristics.

This reliance on problem-speciﬁc parameters poses several critical challenges. First, it impedes the widespread adoption of FL by complicating deployment and requiring expertise for hyperparameter tuning (Mostafa, 2019; Deng et al., 2020). Second, it potentially compromises the performance of FL in dynamic or diverse environments where data distributions may evolve (Reddi et al., 2020; Koloskova et al., 2020). Last, accurately estimating these parameters is often infeasible due to the distributed nature of data and the inherent privacy constraints of FL (Koneˇcn y et al., 2016).

Recent research has attempted to address this issue through adaptive stepsize methods. Fed Opt (Reddi et al., 2020) incorporates adaptive optimization techniques like Ada Grad, Adam, and Yogi into FL, demonstrating improved convergence compared to Fed Avg (Mc Mahan et al., 2017). Fed Nova (Wang et al., 2020) introduces a normalization technique that effectively mitigates objective inconsistency caused by partial client participation and data heterogeneity. Fed BN (Li et al., 2021) employed local batch normalization to alleviate feature shift before averaging models, outperforming both classical Fed Avg (Mc Mahan et al., 2017) and Fed Prox (Li et al., 2020) for non-IID data. While these approaches show promise, they still require careful tuning of global learning rates.

Momentum is a technique that mitigates data heterogeneity and accelerates gradient descent by maintaining a velocity vector of gradient history. Recent studies have investigated the combination of adaptive methods with momentum to leverage both advantages. Hsu et al. (2019) proposed Fed Avg M, which incorporates a server-side momentum term into the Fed Avg algorithm, demonstrating enhanced convergence rates and robustness against data heterogeneity. Wang et al. (2019) developed Slow Mo, a momentum-based method that employs two nested loops to facilitate faster convergence in distributed optimization. Fed AMS (Chen et al., 2020) utilizes adaptive moment methods on both the server and client sides to address data heterogeneity. MIME (Karimireddy et al., 2020a) combines client and server momentum to enhance convergence. Wu et al. (2023) introduced FAFED, a momentum-based variance reduction scheme integrated with an adaptive matrix, achieving the best-known sample and communication complexity when utilizing diminishing stepsizes. Nevertheless, these methods still necessitate ﬁne-tuning multiple hyperparameters, limiting their practical applicability.

A very recent contribution, Fed SPS (Sohom Mukherjee, 2024), claims to be the ﬁrst fully locally adaptive method for FL with minimal hyperparameter tuning. While promising, this approach relies on stringent assumptions of bounded gradients and bounded data heterogeneity. Moreover, it fails to converge to optima with constant stepsizes and requires adjustment of a maximum stepsize threshold based on the smoothness parameter, maintaining a degree of hyperparameter dependence. Therefore, there is a critical need for more robust and adaptive FL algorithms capable of operating effectively across diverse scenarios without relying on problem-speciﬁc parameters.

1.1 MAIN CONTRIBUTIONS

This paper addresses these critical limitations of FL by proposing a novel approach that eliminates the need for problem-speciﬁc hyperparameter tuning while effectively handling arbitrary heterogeneous data. Our method is based on a careful combination of adaptive stepsize and momentum techniques. The adaptive stepsize mechanism dynamically adjusts local learning rates against client update heterogeneity, while the momentum component provides stability under partial participation and accelerates convergence in the nonconvex landscape. To the best of our knowledge, this is the ﬁrst algorithm to achieve such parameter-agnostic adaptation in nonconvex FL with a state-of-the-art convergence guarantee. Our main contributions are summarized as follows.

1) We introduce PAda MFed, a problem-speciﬁc Parameter Agnostic algorithm for nonconvex FL based on adaptive stepsizes and client-side Momentum. PAda MFed offers several signiﬁcant advantages:

Independent of problem-speciﬁc parameters: PAda MFed operates autonomously without relying on any problem-speciﬁc parameters such as smoothness constants or stochastic gradient variance. All stepsizes in our approach are explicitly determined by system-deﬁned constants, including the number of participating clients, local updates, and communication rounds.

Published as a conference paper at ICLR 2025

Robustness to arbitrary heterogeneous data: PAda MFed inherently manages data heterogeneity without requiring any heterogeneity bounds among clients while accommodating partial client participation. This feature enhances its scalability and adaptability in realworld scenarios where client data can be highly diverse and unpredictable, and full participation may not always be feasible due to resource constraints or device availability. 2) We provide a rigorous theoretical analysis of PAda MFed, demonstrating its state-of-the-art performance:

PAda MFed achieves a sample complexity of O(ϵ 4) and communication complexity of O(ϵ 3) to obtain a f(θ) ϵ accuracy for nonconvex FL problems, even using constant learning rates. The complexities are further improved to the best-known sample complexity of O(ϵ 3) and communication complexity of O(ϵ 2) when incorporating variance reduction. PAda MFed exhibits linear speedup with respect to the numbers of local update steps and participating clients in each global round. Notably, these theoretical results are obtained under minimal assumptions, requiring only Lsmoothness of loss functions and unbiased stochastic gradients with bounded within-client variance. This represents a signiﬁcant advancement over existing FL algorithms, which typically necessitate constraints such as data heterogeneity bounds (Li et al., 2021; Wu et al., 2023), diminishing stepsizes (Wu et al., 2023; Sohom Mukherjee, 2024), or fail to achieve the best-known convergence rates (Liang et al., 2019; Alghunaim, 2024). 3) We conduct empirical evaluations to validate our theoretical ﬁndings and the efﬁcacy of our algorithms. Our methods are compared against several established baselines, including Fed Avg (Mc Mahan et al., 2017), SCAFFOLD (Karimireddy et al., 2020b), and SCAFFOLD-M (Cheng et al., 2024) . Extensive numerical evidence demonstrates the superiority of our approaches in not only runtime efﬁciency but also stepsize robustness.

2 PROBLEM SETUP

We consider an FL system where N clients collaboratively train a common learning model θ Rd under the coordination of a parameter server. Let ξi represent a random sample of client i drawn from its local data distribution Di. The loss function associated with client i is given by fi(θ) := Eξi Di [F (θ; ξi)], where F (θ; ξi) is the stochastic loss of client i over sample ξi. The objective of the FL system is to minimize the global loss function across all clients, deﬁned as:

minθ Rd f(θ) := 1

N PN i=1 fi(θ) where fi(θ) := Eξi Di [F (θ; ξi)] . In a federated setting, clients collaboratively train a global model, but the raw data of each client is never shared with the server and other clients.

Denote by the ℓ2 norm. We make the following standard assumptions. Assumption 1 (Sample-Wise Smoothness). Each sample-wise loss function F(θ; ξi) is L-smooth, i.e., F (θ; ξi) F (δ; ξi) L θ δ for all θ, δ Rd, i {1, . . . , n}, and ξi Di.

The Sample-Wise Smoothness Assumption 1 implies the following standard smoothness condition. Assumption 2 (Standard Smoothness). There exists L > 0, such that each loss function fi is Lsmooth, i.e., fi (θ) fi (δ) L θ δ for all θ, δ Rd and i {1, . . . , N}.

We emphasize that our original PAda MFed algorithm is based on the Standard Smoothness Assumption 2. The slightly more stringent Assumption 1 is required when using variance reduction to further accelerate convergence. Assumption 3 (Stochastic Gradient). There exists σ 0 such that for any θ Rd and i {1, . . . , N}, Eξi [ F (θ; ξi)] = fi(θ) and Eξi F (θ; ξi) fi(θ) 2 σ2, where ξi Di.

Assumption 3 ensures that the stochastic gradient F (θ; ξi) is unbiased and has bounded withinclient variance, which is standard in stochastic optimization.

We consider nonconvex FL problems with data heterogeneity among clients, where the local data distributions Di = Dj for any i = j. When addressing heterogeneous data, most existing approaches, such as SCAFFOLD (Karimireddy et al., 2020b), Fed Prox (Li et al., 2020), Fed AMS

Published as a conference paper at ICLR 2025

(Chen et al., 2020), and MIME (Karimireddy et al., 2020a), require an upper bound on gradient dissimilarity, i.e., there exist constants B, σ2 h > 0 such that

1 N PN i=1 fi(θ) 2 B f(θ) 2 + σ2 h for all θ Rd. (1)

This assumption simpliﬁes the mathematical analysis of those FL approaches and ensures their algorithmic performance. However, it may not hold in scenarios where data across clients exhibit signiﬁcant and unpredictable variations, thus compromising the robustness of FL.

Additionally, existing FL algorithms typically rely on problem-speciﬁc parameters to determine their stepsizes, including the smoothness constant L, gradient variance σ2, and heterogeneity bounds B and σ2 i . The smoothness constant, which characterizes the Lipschitz continuity of gradients, is generally a global property requiring knowledge of the entire dataset. Similarly, quantifying data heterogeneity across clients necessitates a comprehensive understanding of the differences between local data distributions. However, in FL settings where raw data sharing is prohibited and only model updates are exchanged, obtaining precise measurements of these parameters is computationally prohibitive and may compromise FL s privacy guarantees.

In the subsequent section, we present an algorithm that is independent of problem-speciﬁc parameters and capable of handling arbitrarily heterogeneous data, thereby eliminating the requirement of the heterogeneity bound (1).

3 ALGORITHM DEVELOPMENT

Algorithm 1 PAda MFed: A Problem-Parameter-Agnostic Algorithm for Nonconvex FL

1: Require: initial model θ0, control variates c 1 i = 1 K PK 1 k=0 F θ0; ξ 1,k i for any i, c 1 =

i c 1 i , momentum g 1 = c 1, global learning rate γ, local learning rate η, and momentum parameter β 2: for t = 0, , T 1 do 3: Central Server: Uniformly sample clients St {1, , N} with |St| = S 4: for each client i St in parallel do 5: Initialize local model θt,0 i = θt

6: for k = 0, , K 1 do

7: Compute gt,k i = β F θt,k i ; ξt,k i ct 1 i + ct 1 + (1 β)gt 1

8: Update local model θt,k+1 i = θt,k i η gt,k i gt,k i 9: end for 10: Update control variate ct i = 1

K PK 1 k=0 F θt,k i ; ξt,k i ( set ct i = ct 1 i for i / St)

11: Upload θt,K i and ct i to central server 12: end for Central server: 13: Aggregate local updates gt = 1 ηSK P i St

14: Update global model θt+1 = θt γgt

15: Aggregate control variate ct = ct 1 + 1

i St ct i ct 1 i

16: Aggregate momentum gt = β 1

i St ct i ct 1 i + ct 1 + (1 β)gt 1

17: Download θt+1, βct + (1 β)gt to all clients 18: end for

In this section, we propose PAda MFed, a problem-parameter-agnostic algorithm for nonconvex FL based on adaptive stepsizes and client-side momentum. PAda MFed is designed to operate independently of any problem-speciﬁc parameters, handle arbitrarily heterogeneous data, and accommodate partial client participation.

Published as a conference paper at ICLR 2025

3.1 ALGORITHM DEVELOPMENT OF PADAMFED

Our approach builds upon the well-established SCAFFOLD algorithm (Karimireddy et al., 2020b), which was designed to address client drift in FL, where local models signiﬁcantly deviate from the global model due to partial participation and data heterogeneity. The core concept of SCAFFOLD is the utilization of control variates to correct the drift between client updates and the global model. Speciﬁcally, the server maintains a global control variate, denoted ct, to represent the average model update direction, while each client maintains a local control variate, denoted by ct i for all i {1, . . . , N}, to track individual update directions. Client updates are subsequently adjusted using the difference between local and global control variates. SCAFFOLD demonstrates faster and more stable convergence compared to the seminal Fed Avg algorithm (Mc Mahan et al., 2017).

In this paper, we extend the original SCAFFOLD framework by incorporating client-side momentum and local adaptive stepsizes, as outlined in Algorithm 1. Speciﬁcally, in Step 7 of Algorithm 1, the local descent direction of client i at global round t and local step k is computed as:

gt,k i = β F θt,k i ; ξt,k i ct 1 i + ct 1 + (1 β)gt 1.

In this expression, the term F θt,k i ; ξt,k i represents the stochastic gradient at the current local

model θt,k i with the sample ξt,k i . The term ct 1 ct 1 i adjusts the difference between the global and local control variates, helping to mitigate client drift. The term gt 1 denotes the current global momentum, essential for stabilizing and accelerating the convergence across clients.

In Step 8 of Algorithm 1, the local model for each client i is updated by: θt,k+1 i = θt,k i η gt,k i gt,k i .

Here, an adaptive stepsize η/ gt,k i is utilized by normalizing the descent direction vector gt,k i . This normalization guarantees that the progresses from all clients have a uniform magnitude, preventing the disproportionate impact of any individual client on the global model update. It also provides us the convenience on quantifying the distance between consecutive models in our theoretical analysis, maintaining that θt,k+1 i θt,k i = η gt,k i / gt,k i = η for all i, k, t.

Additionally, since ct i = 1 K PK 1 k=0 F θt,k i ; ξt,k i for all i, the momentum update in Step 16 of Algorithm 1 can be expressed as:

gt = β 1 S P

1 K PK 1 k=0 F θt,k i ; ξt,k i ct 1 i + ct 1 + (1 β)gt 1. (2)

This equation accumulates the descent directions across clients and iterations that we have gt = P

i,k gt,k i . With gt, the optimization trajectories at each client are smoothed by the descent directions of other clients, enhancing the robustness of the optimization process against variability in local updates caused by data heterogeneity. Notably, our PAda MFed algorithm maintains the communication workload of the SCAFFOLD for both uplink and downlink.

3.2 ACCELERATING PADAMFED WITH VARIANCE REDUCTION

Variance reduction is an effective technique to accelerate convergence and enhance the stability of FL, particularly when dealing with heterogeneous data and limited client participation. In this subsection, we enhance PAda MFed by integrating a variance reduction component into each client s descent direction, resulting in our PAda MFed-VR algorithm. The detailed procedures of PAda MFed VR are provided in Appendix B.

PAda MFed-VR differs from PAda MFed primarily in its computation of the local gradient. Speciﬁcally, Step 7 of Algorithm 1 is replaced with:

gt,k i = F θt,k i ; ξt,k i + β ct 1 ct 1 i + (1 β) gt 1 F θt 1; ξt,k i ,

where F θt 1; ξt,k i represents the variance reduction component. Our variance reduction design follows the principle of STORM (Cutkosky & Orabona, 2019) to make more efﬁcient sample utilization. For each local update, the sample ξt,k i is used twice: 1) to compute the gradient based on

Published as a conference paper at ICLR 2025

the current local model θt,k i ; and 2) to evaluate the gradient at the previous global model θt 1. This dual usage of each sample mitigates the inﬂuence of within-client gradient noise, enabling more accurate estimation of gradient directions.

4 THEORETICAL RESULTS AND COMPARISONS WITH PRIOR WORK

4.1 THEORETICAL RESULTS

Theorem 1. Suppose that Assumptions 2 and 3 hold. Let the local and global learning rates of

PAda MFed be η = 1 K

T and γ = (SK)1/4

T 3/4 , respectively, the momentum parameter be β = q

T , and {θt}t 0 be the iterates generated by Algorithm 1. Then, it holds for all T 1 that

t=0 E f θt O

(SKT) 1 4 +

where := f θ0 minθ f(θ). Theorem 2. Suppose that Assumptions 1 and 3 hold. Let the local and global learning rates of PAda MFed-VR be η = 1 KT and γ = (SK)1/3

T 2/3 , respectively, the momentum parameter be β =

T 2/3 , and {θt}t 0 be the iterates generated by Algorithm 2. Then, it holds for all T 1 that

t=0 E f θt O

(SKT) 1 3 + (L + σ)(SK) 1 3

Remark 1. According to Theorem 1, PAda MFed converges to an ϵ-stationary point 1 in expectation within O 1 SKϵ4 communication rounds. This convergence rate is improved to O 1 SKϵ3 when incorporating variance reduction, as shown in Theorem 2. Furthermore, both algorithms demonstrate linear speedup with respect to the number of participating clients S and local update steps K.

Remark 2. In PAda MFed, setting SK = O(T 1 3 ) yields a sample complexity2 of O(ϵ 4) with communication complexity O(ϵ 3). Similarly, by setting SK = O(

T), PAda MFed-VR achieves the best-known sample complexity of O(ϵ 3) with a communication complexity of O(ϵ 2) to ﬁnd an ϵ-stationary point (Wu et al., 2023). Remark 3. In traditional FL, selecting optimal stepsizes theoretically requires knowledge of problem-speciﬁc parameters, which are often unavailable. Consequently, in real-world FL scenarios, stepsizes must be tuned empirically a process that is labor-intensive, time-consuming, and sometimes even impractical. In our PAda MFed and PAda MFed-VR, the stepsizes are explicitly determined by system-deﬁned constants (the numbers of participating clients S, local update steps K, and communication rounds T), without requiring any problem-speciﬁc parameters. This hyperparameter-independent nature simpliﬁes implementation, enhances robustness, and facilitates the deployment of our algorithm across diverse FL applications.

4.2 PROOF SKETCH

Our theoretical proof starts from the L-smoothness property of the loss function f(θ), which yields the following inequality:

f θt+1 f (θt) 2γ f (θt) gt γ f (θt) + γ SK P

i St,k gt,k i gt + γ2L

1A point θ is said to be ϵ-stationary if f (θ) ϵ. Note that for any ϵ-stationary point deﬁned using f(θ) 2, one can derive the corresponding guarantee for f(θ) based on the following relationship:

t=1 E f(θt) = 1

t=1 E f(θt) 2

where the ﬁrst and second inequalities utilize Jensen s inequality as the square root function is concave. Therefore, we can align our results with the metric of 1 T PT t=1 f(θt) 2 by taking square root on both sides of their convergence bounds. 2Total number of samples across clients, local updates, and communication.

Published as a conference paper at ICLR 2025

To establish an upper bound for 1 T PT 1 t=0 E f (θt) , we must quantify two key terms:

1 T PT 1 t=0 E f (θt) gt and 1

T PT 1 t=0 1 SK E h P

i St,k gt,k i gt i .

The momentum gt is a recursive variable that accumulates values from previous rounds. By plugging into the expression of gt (provided in (2)) into f (θt) gt and introducing the auxiliary term f θt 1 , we can recursively express f (θt) gt by its predecessor f θt 1 gt 1 , scaled by a contraction coefﬁcient (1 β). This substitution also introduces additional terms associated with stochastic gradients and control variates. Through meticulous control of all intermediate terms, we prove that 1

T PT 1 t=0 E f (θt) gt is upper bounded by O((SKT) 1

4 ) for PAda MFed and by O((SKT) 1

3 ) for PAda MFed-VR.

The term E h P

i St,k gt,k i gt i represents gradient dissimilarity across clients. While the heterogeneity bound (1) could readily control this term, our objective is to eliminate dependence on such bounds. Instead, we relax this term to E fi θt 1 ct 1 i , along with other controllable terms, by substituting the expressions for gt,k i and gt. Returning to our treatment of E f (θt) gt , we exploit the recursive property of the control variate ct 1 i to bound

E fi θt 1 ct 1 i . This, in turn, allows us to establish a bound for E h P

i St,k gt,k i gt i .

Combining the above processes leads to our analytical results. For comprehensive proofs, please refer to Appendix A for Theorem 1 and Appendix B for Theorem 2.

Intuition on the Algorithmic Features: The efﬁcacy of PAda MFed stems from the synergistic integration of three indispensable components: local gradient normalization, client-side momentum, and control variates. 1) Gradient normalization serves as an adaptive learning rate scheme, automatically adjusting stepsizes based on the local optimization landscape. This design automatically allows larger steps in regions with small gradients (where more aggressive exploration is beneﬁcial) and smaller steps in steep regions (where careful progress is needed). 2) Client-side momentum helps accelerate convergence while maintaining stability. First, it helps overcome local irregularities in the loss landscape by accumulating gradients over clients and iterations. Second, it accelerates progress in directions of consistent gradient agreement. 3) Furthermore, control variates align local updates with the global objective, reducing variance in gradient estimates and ensuring more consistent updates across heterogeneous client data. Collectively, these techniques ensure that the stepsizes are independent of problem-speciﬁc parameters such as smoothness constants and data heterogeneity bounds, simplifying the tuning process and enhancing robustness across diverse FL environments. Notably, our algorithms also eliminate the requirement of data heterogeneity bounds, further broadening their applicability.

4.3 COMPARISONS WITH PRIOR WORK

We compare PAda MFed and PAda MFed-VR with several representative algorithms for solving FL problems with heterogeneous data, as listed in Table 1.

Comparisons of PAda MFed with Prior FL Algorithms: The SCAFFOLD (Karimireddy et al., 2020b) algorithm requires the smoothness parameter L for stepsize tuning. However, its communi-

cation complexity, given by O K

3 L Kϵ4 , is suboptimal. MIME (Karimireddy et al., 2020a) im-

proves this complexity to O 1 SKϵ4 by incorporating server-level momentum. Nevertheless, MIME requires large-batch local gradients per round, and its learning rates depend on multiple problem parameters, including initial optimality gap and heterogeneity bound σ2 h, which are challenging to estimate.

Fed SPS (Sohom Mukherjee, 2024) incorporates stochastic Polyak step-sizes into local client updates, achieving a communication complexity of O 1 NKϵ4 in full participation scenarios. However, its analysis relies on the assumption of bounded data heterogeneity. Moreover, its stepsize tuning requires knowledge of the lower bounds of all loss functions, i.e., ℓ i for all i, in addition to the smoothness parameter L, leading to signiﬁcant problem-parameter dependence.

SCAFFOLD-M (Cheng et al., 2024) employs similar client-side momentum as PAda MFed, removing the need for bounded data heterogeneity. However, SCAFFOLD-M s stepsizes depend on sev-

Published as a conference paper at ICLR 2025

Table 1: Comparisons of algorithms for solving FL problems with heterogeneous data. (Shorthand notation: Add. Assump. = Additional assumptions aside from Assumptions 1 3, BDH = Bounded data heterogeneity deﬁne in (1), BG = Bounded gradient that fi(θ) G, i, θ, BHD = Bounded hessian dissimilarity that 2fi(θ) 2f(θ) 2 δ, i, θ)

Algorithms Add. Assump. Stepsize Restrictions Stepsize-Related

Problem-Parameters

Communication

SCAFFOLD (Karimireddy et al., 2020b) γ =

S, η 1 24γKL S N 2

MIme (Karimireddy et al., 2020a) BDH, BHD η = q

S L GT K2 , G = σ2 h + σ2

K L, , σ2, σ2 h O 1 SKϵ4

Fed SPS Sohom Mukherjee (2024) BDH ηt,k i = min

( F θt,k i ;ξt,k i

c F θt,k i ;ξt,k i

ηb min 1 2c L , 1 25LK L, ℓ i , i O 1 NKϵ4

SCAFFOLD-M (Cheng et al., 2024) β = min 1, S

L , ηKL min 1

L, , σ2, G0 O 1 SKϵ4

PAda MFed (This paper)

T , γ = (SK) 1 4

T 3 4 , η = 1 K

Variance Reduction

FAFED (Wu et al., 2023) BDH, BG ηt N 2 3

Lt 1 3 , βt η2 t L O 1 SKϵ3

SCAFFOLD-M-VR (Cheng et al., 2024)

β = min S N , KL

γL = min 1, βS

β S , β SK 1

4 L, , σ2 O 1 S

PAda MFed-VR (This paper)

β = (SK) 1 3

T 2 3 , γ = (SK) 1 3

T 2 3 , η = 1 KT O 1 SKϵ3

1 ℓ i infξi Di,θ F(θ; ξi) for any i, and c is a constant to balance adaptivity and accuracy. 2 G0 := 1 N PN i=1 fi(θ0) 2.

eral problem-speciﬁc parameters, including the smoothness parameter L, initial optimal gap , and stochastic gradient variance σ2, resulting in laborious stepsize tuning. In contrast, PAda MFed is completely independent of problem-speciﬁc parameters while achieving start-of-the-art communication complexity.

Comparisons of PAda MFed-VR with Prior Variance-Reduced FL Algorithms: FAFED (Wu et al., 2023) employing a momentum-based variance reduction with an adaptive matrix, achieving the best-known O(ϵ 3) sample complexity and O(ϵ 2) communication complexity through the use of diminishing stepsizes. However, FAFED requires stringent assumptions of bounded gradients and bounded data heterogeneity. Moreover, its learning rates are subject to several complex constraints and rely on problem-parameter-based algorithm tuning. SCAFFOLD-M-VR (Cheng et al., 2024) is a variance-reduced SCAFFOLD-M algorithm and, similarly, requires careful step size tuning based on multiple problem-dependent parameters. Nevertheless, it fails to achieve the best-known complexity established in the literature.

5 NUMERICAL EXPERIMENTS

In this section, we present experiments on two real-world datasets: EMNIST (Cohen et al., 2017) and CIFAR-10 (Li et al., 2017). We evaluate the proposed algorithms against several baselines,

Published as a conference paper at ICLR 2025

(b) non-i.i.d

Figure 1: Test accuracy versus the number of communication rounds on the EMNIST dataset.

(b) non-i.i.d

Figure 2: Test accuracy versus learning rate on the EMNIST dataset.

including Fed Avg (Mc Mahan et al., 2017), SCAFFOLD (Karimireddy et al., 2020b), SCAFFOLDM (Cheng et al., 2024). Additionally, experiments are conducted under both i.i.d. and non-i.i.d. data distributions. Due to space limitations, the detailed experimental setup and additional simulation results are provided in Appendix C.

Figure 1 illustrates the test accuracy of various algorithms versus the number of communication rounds on the EMNIST dataset, with subﬁgure 1a representing i.i.d. data and subﬁgure 1b depicting non-i.i.d. data. The stepsizes for our algorithms, PAda MFed and PAda MFed-VR, are determined based on the theoretical guidance provided in Theorem 1 and Theorem 2, respectively. For fair comparisons, the hyperparameters of other algorithms in Figure 1 are optimized through grid search to achieve their best performance. The results demonstrate that our proposed methods signiﬁcantly outperform all baseline algorithms Fed Avg, SCAFFOLD, and SCAFFOLD-M in both convergence speed and test accuracy. Notably, although SCAFFOLD-M employs a similar momentum technique, it converges more slowly than PAda MFed and achieves lower accuracy, validating the efﬁcacy of our adaptive stepsize design. Building upon these advantages, the incorporation of variance reduction further enhances our methods superiority through more efﬁcient sample utilization. Moreover, the results on non-i.i.d. data in subﬁgure 1b demonstrate even greater performance margins than the i.i.d. case, highlighting the advantages of our algorithms.

Figure 2 compares the test accuracy of various algorithms versus the learning rate on the EMNIST dataset. All algorithms were evaluated over 400 communication rounds to ensure a fair comparison. We observe that our algorithm demonstrates superior robustness to stepsize selection, maintaining stable performance across a signiﬁcantly wider range of learning rates compared to baseline methods. Speciﬁcally, it achieves test accuracy exceeding 0.8 across the stepsize range [3 10 3, 10 1]

Published as a conference paper at ICLR 2025

for i.i.d. data distributions (subﬁgure 2a) and above 0.7 for the same stepsize range under non-i.i.d. conditions (subﬁgure 2b). In contrast, baseline algorithms exhibit substantially narrower regions of stable performance, empirically validating our method s enhanced stepsize robustness.

6 CONCLUSIONS

This paper proposed a novel training approach, PAda MFed, for nonconvex federated learning (FL) that eliminates dependency on problem-speciﬁc parameters and enables automatic generalization across diverse FL environments. PAda MFed also removes the need for data heterogeneity bounds and accommodates partial client participation, further broadening its applicability. We provided rigorous theoretical analysis for PAda MFed, demonstrating its state-of-the-art convergence under minimal assumptions, including the L-smoothness of loss functions and unbiased stochastic gradients with bounded within-client variance. Furthermore, we enhanced this convergence rate by incorporating variance reduction, achieving the best-known O(ϵ 3) sampling complexity and O(ϵ 2) communication complexity. Extensive numerical experiments demonstrated that our algorithms outperform existing representative FL approaches in both runtime efﬁciency and stepsize robustness.

Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. ar Xiv preprint ar Xiv:2111.04263, 2021.

Sulaiman A Alghunaim. Local exact-diffusion for decentralized optimization and learning. IEEE Transactions on Automatic Control, 2024.

Mingzhe Chen, Nir Shlezinger, H Vincent Poor, Yonina C Eldar, and Shuguang Cui. Communication-efﬁcient federated learning. Proceedings of the National Academy of Sciences, 118(17):e2024789118, 2021.

Xiangyi Chen, Xiaoyun Li, and Ping Li. Toward communication efﬁcient adaptive gradient method. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, pp. 119 128, 2020.

Ziheng Cheng, Xinmeng Huang, and Kun Yuan. Momentum beneﬁts non-iid federated learning simply and provably. The Twelfth International Conference on Learning Representations., 2024.

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pp. 2921 2926. IEEE, 2017.

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex SGD. Advances in neural information processing systems, 32, 2019.

Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning. ar Xiv preprint ar Xiv:2003.13461, 2020.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Huimin Gao, Qingtao Wu, Xuhui Zhao, Junlong Zhu, and Mingchuan Zhang. Fedadt: An adaptive method based on derivative term for federated learning. Sensors, 23(13):6034, 2023.

Geoffrey Hinton. Neural networks for machine learning, lecture 6a, 2012. URL http:// www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdf. Coursera.

Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classiﬁcation. ar Xiv preprint ar Xiv:1909.06335, 2019.

Published as a conference paper at ICLR 2025

Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning. ar Xiv preprint ar Xiv:2008.03606, 2020a.

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pp. 5132 5143. PMLR, 2020b.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A uniﬁed theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning, pp. 5381 5393. PMLR, 2020.

Jakub Koneˇcn y, H Brendan Mc Mahan, Felix X Yu, Peter Richt arik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efﬁciency. ar Xiv preprint ar Xiv:1610.05492, 2016.

Hongmin Li, Hanchao Liu, Xiangyang Ji, Guoqi Li, and Luping Shi. Cifar10-dvs: an event-stream dataset for object classiﬁcation. Frontiers in neuroscience, 11:309, 2017.

Jiaxiang Li, Xuxing Chen, Shiqian Ma, and Mingyi Hong. Problem-parameter-free decentralized nonconvex stochastic optimization. ar Xiv preprint ar Xiv:2402.08821, 2024.

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429 450, 2020.

Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Federated learning on non-iid features via local batch normalization. ar Xiv preprint ar Xiv:2102.07623, 2021.

Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, and Yifei Cheng. Variance reduced local SGD with lower communication complexity. ar Xiv preprint ar Xiv:1912.12844, 2019.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efﬁcient learning of deep networks from decentralized data. In Artiﬁcial intelligence and statistics, pp. 1273 1282. PMLR, 2017.

Hesham Mostafa. Robust federated learning through representation matching and adaptive hyperparameters. ar Xiv preprint ar Xiv:1912.13075, 2019.

Viraaji Mothukuri, Reza M Parizi, Seyedamin Pouriyeh, Yan Huang, Ali Dehghantanha, and Gautam Srivastava. A survey on security and privacy of federated learning. Future Generation Computer Systems, 115:619 640, 2021.

Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcn y, Sanjiv Kumar, and H Brendan Mc Mahan. Adaptive federated optimization. ar Xiv preprint ar Xiv:2003.00295, 2020.

Felix Sattler, Simon Wiedemann, Klaus-Robert M uller, and Wojciech Samek. Robust and communication-efﬁcient federated learning from non-iid data. IEEE transactions on neural networks and learning systems, 31(9):3400 3413, 2019.

Sebastian U. Stich Sohom Mukherjee, Nicolas Loizou. Locally adaptive federated learning. ar Xiv preprint ar Xiv:2307.06306, 2024.

Jianyu Wang, Vinayak Tantia, Nicolas Ballas, and Michael Rabbat. Slowmo: Improving communication-efﬁcient distributed SGD with slow momentum. ar Xiv preprint ar Xiv:1910.00643, 2019.

Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611 7623, 2020.

Published as a conference paper at ICLR 2025

Xiaolu Wang, Yuchang Sun, Hoi-To Wai, and Jun Zhang. Dual-delayed asynchronous sgd for arbitrarily heterogeneous data. ar Xiv preprint ar Xiv:2405.16966, 2024.

Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H Yang, Farhad Farokhi, Shi Jin, Tony QS Quek, and H Vincent Poor. Federated learning with differential privacy: Algorithms and performance analysis. IEEE transactions on information forensics and security, 15:3454 3469, 2020.

Xidong Wu, Feihu Huang, Zhengmian Hu, and Heng Huang. Faster adaptive federated learning. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 37, pp. 10379 10387, 2023.

Chenhao Xu, Youyang Qu, Yong Xiang, and Longxiang Gao. Asynchronous federated learning on heterogeneous devices: A survey. Computer Science Review, 50:100595, 2023.

Jing Xu, Sen Wang, Liwei Wang, and Andrew Chi-Chih Yao. Fedcm: Federated learning with client-level momentum. ar Xiv preprint ar Xiv:2106.10874, 2021.

Published as a conference paper at ICLR 2025

A RELATED WORKS

Adaptive Stepsize Methods: Adaptive stepsize methods have gained signiﬁcant attention in optimization literature due to their ability to automatically adjust learning rates based on the geometry of the objective function. These methods, such as Adam (Kingma & Ba, 2014), Ada Grad (Duchi et al., 2011), and RMSProp Hinton (2012), have demonstrated remarkable success in various machine learning tasks, particularly in handling sparse gradients and non-stationary objectives. In the context of FL, adaptive stepsizes offer several advantages. First, they eliminate the need for manual tuning of learning rates, which is especially beneﬁcial in federated settings where global knowledge of the objective function s properties is limited. Second, adaptive methods can potentially mitigate the impact of data heterogeneity by allowing different update rates for different model parameters, effectively accounting for varying gradient statistics across clients.

Fed Opt (Reddi et al., 2020) introduced adaptive optimization techniques like Ada Grad, Adam, and Yogi into FL, demonstrating improved convergence properties compared to the traditional Fed Avg algorithm. Wang et al. (2020) proposed Fed Nova, which normalizes and scales local updates to mitigate objective inconsistency caused by partial client participation and data heterogeneity. Fed BN (Li et al., 2021) employed local batch normalization to alleviate feature shift before averaging models, outperforming both classical Fed Avg (Mc Mahan et al., 2017) and Fed Prox (Li et al., 2020) for non-IID data. While these approaches demonstrated the advantages of adaptive methods for easing parameter tuning, none provides theoretical guarantees, and careful tuning of global learning rates remains essential.

Momentum: Momentum, on the other hand, is a technique that accelerates gradient descent by accumulating a velocity vector in directions of persistent reduction in the objective across iterations. In nonconvex optimization, momentum has been shown to help escape saddle points more efﬁciently and potentially reach better local optima (Cutkosky & Orabona, 2019). The incorporation of momentum in FL algorithms can provide several beneﬁts. It can help smooth out the impact of heterogeneous updates from different clients, potentially leading to more stable and faster convergence. Moreover, momentum can aid in overcoming the challenges posed by partial client participation by maintaining a consistent optimization trajectory even when client participation varies across rounds.

Hsu et al. (2019) proposed Fed Avg M, which adds a server-side momentum term to the Fed Avg algorithm, demonstrating improved convergence rates and robustness to data heterogeneity. Wang et al. (2019) developed Slow Mo, a momentum-based method using two nested loops to achieve faster convergence in distributed optimization. Both Fed CM (Xu et al., 2021) and Cheng et al. (2024) investigated the integration of client-side momentum in Fed Avg to effectively tackle client heterogeneity and partial participation in FL.

Combination of Adaptive Stepsizes and Momentum: Recent works have explored combining adaptive methods and momentum in FL. Fed AMS (Chen et al., 2020) implements a local AMSGrad scheme for FL, demonstrating fast convergence with low communication cost. MIme (Karimireddy et al., 2020a) combines control variates with server-level momentum at every local update to mimic centralized methods running on IID data, outperforming centralized methods but requiring a large-batch local gradient per round for each client. Wu et al. (2023) introduced FAFED, a momentum-based variance reduction scheme integrated with an adaptive matrix, attaining the best-known sample and communication complexity when using diminishing stepsizes. While these methods demonstrate improved performance, they still require careful tuning of global learning rates.

A recent contribution, Fed SPS (Sohom Mukherjee, 2024), claims to be the ﬁrst fully locally adaptive method for FL with minimal hyperparameter tuning. While promising, this approach relies on stringent assumptions of bounded data heterogeneity and gradients. Moreover, it fails to converge to optima with constant stepsizes and requires adjustment of a maximum stepsize threshold based on the smoothness parameter, thus retaining some hyperparameter dependence. The limitations of existing approaches underscore the critical need for more robust, adaptive FL algorithms capable of operating effectively across diverse scenarios without extensive parameter tuning.

This paper makes a signiﬁcant advancement by proposing a novel algorithm, PAda MFed, that completely eliminates dependency on problem-speciﬁc parameters. All stepsizes in our approach are explicitly determined by the number of participating clients, local updates, and communication rounds.

Published as a conference paper at ICLR 2025

To the best of our knowledge, this is the ﬁrst algorithm that achieves such problem-parameter independence in FL. Moreover, our algorithm inherently manages data heterogeneity and partial client participation without requiring any heterogeneity bound among clients, which is also nontrivial.

Data heterogeneity has been extensively studied in FL. However, existing algorithms either depend on bounded data heterogeneity (e.g., Fed Avg (Mc Mahan et al., 2017), SCAFFOLD (Karimireddy et al., 2020b), Fed Prox (Li et al., 2020), and MIme (Karimireddy et al., 2020a)) or fall short of achieving state-of-the-art convergence rates (e.g., VRL-SGD (Liang et al., 2019) and LED (Alghunaim, 2024)). Recently, Cheng et al. (2024) demonstrated that momentum can eliminate the data heterogeneity constraint in the Fed Avg and SCAFFOLD algorithms while achieving state-of-the-art convergence results. Wang et al. (2024) introduced Du De-ASGD for asynchronous FL, which can effectively handle arbitrarily heterogeneous data by leveraging stale stochastic gradients. However, their algorithms require carefully designed stepsizes based on hyperparameters. In contrast, our algorithm accommodates arbitrary data heterogeneity while achieving complete problem-parameter independence.

A notable concurrent work by (Li et al., 2024) also explores problem-parameter-free algorithms in the context of decentralized non-convex optimization. While both studies target problem-parameterfree optimization, our work addresses the unique challenges inherent to federated learning settings. The fundamental distinction lies in our treatment of client drift a critical phenomenon arising from multiple local updates and data heterogeneity in federated environments. Our key technical contributions beyond (Li et al., 2024) are twofold: 1) The development of a novel framework integrating control variates with momentum to mitigate client drift, requiring sophisticated theoretical analysis due to their complex interactions; 2) The successful elimination of explicit heterogeneity bounds, which were previously considered essential in federated learning literature.

A THEORETICAL ANALYSIS OF PADAMFED

Our analysis is based on the following useful lemmas. Lemma 1. For any t, we have

θt,k i θt 2 1

3η2K2 and 1 NK

θt,k i θt 1

Proof. From the update rule of local model, for any i, k and t, we have

θt,k+1 i θt,k i = η

gt,k i gt,k i

θt,k i θt 2 =

θt,j+1 i θt,j i

θt,j+1 i θt,j i 2 η2k2.

Summing the above inequality over i and k yields

θt,k i θt 2 η2

6K (K 1)K(2K 1) 1

Similarly, we have

θt,k i θt = 1 NK

θt,k i θt 2 1

These inequalities in Lemma 1 are frequently used in our analysis.

Published as a conference paper at ICLR 2025

Lemma 2. Given vectors ω1, , ωN Rd and ω = 1 N PN i=1 ωi, if we sample S {1, , N} uniformly randomly such that |S| = S, then it holds that

i=1 ωi ω 2 .

Proof. Letting 1{i S} be the indicator for the event i S , we prove this lemma by direct calculation as follows:

i=1 ωi1{i S}

i ωi 2 1{i S} + 2 X

i<j ω i ωj1{i, j S}

i=1 ωi 2 + 1

S2 S(S 1) N(N 1)2 X

i=1 ωi 2 + 1

S2 S(S 1) N(N 1)

= N S S(N 1) 1 N

i=1 ωi 2 + N(S 1)

= N S S(N 1) 1 N

i=1 ωi ω 2 + ω 2

i=1 ωi ω 2 + ω 2,

where the last inequality uses the fact that N S

N 1 1 for any nonempty set S.

From the L-smoothness of f( ) in Assumption 2, we have

f θt+1 f θt

f θt θt+1 θt + L

gt,k i gt,k i

= γ f θt gt

gt,k i gt,k i

gt,k i gt,k i

γ f θt gt γ gt

gt,k i gt,k i gt

(b) 2γ f θt gt γ f θt + γ gt

gt,k i gt,k i gt

(c) 2γ f θt gt γ f θt + γ SK

i St,k gt,k i gt + γ2L

Published as a conference paper at ICLR 2025

where (a) uses the inequality that θt+1 θt = γ SK P

i St,k gt,k i gt,k i

γ, (b) is based on

γ f (θt) γ gt γ f (θt) gt and (c) is from the following relation:

gt,k i gt,k i gt

gt,k i gt,k i gt,k i gt

gt gt,k i gt,k i

gt,k i gt .

Taking expectation on both sides of (A1), we obtain

γE f θt E f θt f θt+1 + 2γE f θt gt

Summing the above inequality over t and dividing it by γT, we have

t=0 E f θt 1

γT E f θ0 f θT + 2

t=0 E f θt gt

We have the following results on the terms 1 T PT 1 t=0 E f (θt) gt and

1 SKT PT 1 t=0 E h P

gt,k i gt i in (A2).

Lemma 3. Under Assumptions 2 and 3, the disparity 1

T PT 1 t=0 E f (θt) gt is upper bounded by:

t=0 E f θt gt 1

Lemma 4. Under Assumptions 2 and 3, the gradient dissimilarity

1 SKT PT 1 t=0 E h P i St

gt,k i gt i is upper bounded by:

σ + 2ηKL + 1 + 2N

Published as a conference paper at ICLR 2025

The proof of Lemma 3 is presented in subsection A.1 and the proof of Lemma 4 is presented in subsection A.2.

T , γ = γ0 T 3 4 and η = 1 K

T . From Lemma 3, we know that

t=0 E f θt gt

β0T 1 4 + L

Similarly, from Lemma 4, we know that

Deﬁne the initial optimality gap := f θ0 f . Then, f θ0 f θT f θ0 f = . Plugging (A3) and (A4) into (A2), we have

γ0T 1 4 + 6σ β0

β0T 1 4 + 2L

(SKT) 1 4 + 2β0σ

Let γ0 = (SK) 1 4 and β0 =

SK. Then, we have

t=0 E f θt O

(SKT) 1 4 +

By setting SK O T 1 3 , we have

T O (SKT) 1

t=0 E f θt O

A.1 PROOF OF LEMMA 3

The proof of Lemma 3 utilizes the following result.

Lemma 5. For any i, t, deﬁne φt i := E fi (θt) ct i 2. Under Assumptions 2 and 3, we have

3N η2K2L2 1 S

S2 γ2L2 + σ2

3η2K2L2 , i.

Proof. Since for any t, the S elements in St are uniformly sampled from {1, , N}, we have

( ct 1 i with probability1 S

k F θt,k i ; ξt,k i with probability S

Published as a conference paper at ICLR 2025

Using Young s inequality repeatedly, we have

E fi θt ct 1 i 2 + S

fi θt F θt,k i ; ξt,k i

E fi θt fi θt 1 ct 1 i 2 + S

k E θt,k i θt 2 !

φt 1 i + 1 + 2N

φt 1 i + 2N

S γ2L2 + 2S

t φ0 i + 2N

S γ2L2 + 2S

3η2K2L2 t 1 X

t φ0 i + 4 N 2

S2 γ2L2 + σ2

Since c 1 i = 1

K PK 1 k=0 F θ0; ξ 1,k i , we have

E fi θ0 c 1 i 2 + S

k F θ0,k i ; ξ0,k i

L2E θ0 θ0,k i 2 + σ2

Then, we have

3N η2K2L2 1 S

S2 γ2L2 + σ2

3N η2K2L2 1 S

S2 γ2L2 + σ2

where we use the relation 1 S 2N 1 S 4N 2.

Deﬁne Et := f (θt) gt and ut := f (θt) f θt 1 . From the update rule of momentum gt, we have

Et =(1 β) f θt gt 1 + β

f θt ct 1 1 SK

F θt,k i ; ξt,k i ct 1 i

| {z } :=vt

=(1 β)Et 1 + (1 β)ut + βvt

=(1 β)t E0 +

τ=1 uτ(1 β)t+1 τ +

τ=1 βvτ(1 β)t τ.

Based on the triangle inequality of ℓ2 norm and the concavity of the square root ( ) 1 2 , we have

E Et (1 β)t E E0 +

τ=1 E uτ (1 β)t+1 τ +

τ=1 βvτ(1 β)t τ

Published as a conference paper at ICLR 2025

Since c 1 i = 1

K PK 1 k=0 F θ0; ξ 1,k i for any i, c 1 = 1

i c 1 i , and g 1 = c 1, we have

i,k F θ0; ξ 1,k i + β SK

F θ0 i ; ξ 1,k i F θ0,k i ; ξ0,k i

F θ0 i ; ξ 1,k i fi θ0 i fi θ0,k i F θ0,k i ; ξ0,k i

SK + β SK E

fi θ0 fi θ0,k i

i,k E θ0,k i θ0 + 3σ

where the last inequality uses the results in Lemma 1.

Additionally, for any t, we have

ut = f θt+1 f θt L θt+1 θt γL

gt,k i gt,k i

To proceed, we handle the last term in (A5). First, we have

τ=1 βvτ(1 β)t τ

τ=1 β2E vτ 2(1 β)2(t τ)

1 τ1,τ2 t,τ1 =τ2 E βvτ1(1 β)t τ1, βvτ2(1 β)t τ2 . (A8)

Let F0 = and Ft,k i := σ({θt,j i }0 j k Ft) and Ft+1 := σ( i Ft,K i ) for all t 0, where σ() indicates the σ-algebra. Let E[ |Ft] represent the expectation conditioned on the ﬁltration Ft

with respect to the random variables {ξt,k i }1 i N,0 k<K in the t-th iteration. Let Eξt,k i [ ] represent

the expectation taking over the random sample ξt,k i . Similarly, let ESt[ ] represent the expectation taking over the uniformly sampled client set St. The set St is independent across different t. Then, for any t, we have

E[vt|Ft] = E{ξt,k i } i,k,St[vt]

= E{ξt,k i } i,k

f θt ct 1 1 NK

F θt,k i ; ξt,k i ct 1 i

= f θt 1 NK

i,k fi θt,k i ,

Published as a conference paper at ICLR 2025

where the last equality is based on Assumption 3 and the fact that 1

N PN i=1 ct i = ct for any t. Then, for any 0 t1 < t2 T 1, we have E vt1, vt2

= E vt1, E vt2|Ft2

fi θt1 fi θt1,k i , 1 NK

fi θt2 fi θt2,k i +

fi θt1,k i F θt1,k i ; ξt1,k i , 1 NK

fi θt2 fi θt2,k i +

i St ct 1 ct 1 i

fi θt2 fi θt2,k i +

i,k E θt1,k i θt1 2 + L2

i,k E θt2,k i θt2 2 + σ

θt2,k i θt2

3η2K2L2 + η

Further, based on Lemma 2, we have

i,k F θt,k i ; ξt,k i

F θt,k i ; ξt,k i 1

i F θt,k i ; ξt,k i !

ct 1 i ct 1

| {z } :=Λt

2 L2E θt θt,k i 2 + σ2

3η2K2L2 + 2σ2

k F θt,k i ; ξt,k i ct 1 i

k F θt,k i ; ξt,k i fi θt,k i fi θt fi θt 1 ct 1 i

i,k E θt,k i θt 2 + 4L2E θt θt 1 2 + 4

i E fi θt 1 ct 1 i 2

3η2K2L2 + 4γ2L2 + 4

where φt 1 i := E fi θt 1 ct 1 i 2. From Lemma 5, we know that, for any i,

3N η2K2L2 1 S

2(t 1) + 4 N 2

S2 γ2L2 + σ2

K + 2η2K2L2 + 4N 2

S2 γ2L2. (A10)

Plugging the upper bound of φt 1 i into Λt yields

3 η2K2L2 + 4γ2L2 1 + 4N 2

Published as a conference paper at ICLR 2025

Then, we have

η2K2L2 + 30σ2

Plugging (A9) and (A11) into (A8) gives

τ=1 βvτ(1 β)t τ

βE vτ 2 + E vτ1, vτ2

βη2K2L2 + 30σ2β

SK + 4βγ2L2

3η2K2L2 + η

Since β 1, taking square root on both sides of the above inequality yields

τ=1 βvτ(1 β)t τ

30β SK + 2γL

where we use the fact that

b, for any a, b 0.

Plugging (A6), (A7), and (A12) into (A5), we have

E Et (1 β)t 1

Summing the above inequality over t yields

A.2 PROOF OF LEMMA 4

Recall that gt,k i = β F θt,k i ; ξt,k i ct 1 i + ct 1 + (1 β)gt 1, and

i St,k gt,k i

k=0 F θt,k i ; ξt,k i ct 1 i

+ (1 β)gt 1.

Published as a conference paper at ICLR 2025

Then, we have

F θt,k i ; ξt,k i ct 1 i 1

k=0 F θt,k i ; ξt,k i ct 1 i

i,k E F θt,k i ; ξt,k i ct 1 i

i,k E F θt,k i ; ξt,k i fi θt,k i fi θt fi θt 1 ct 1 i

θt,k i θt + L θt θt 1

i E fi θt 1 ct 1 i

2ηKL + γL + 2β

From Lemma 5, we know that

Thus, we have

σ + 2ηKL + 1 + 2N

Summing the above inequality over t yields

σ + 2ηKL + 1 + 2N

B THEORETICAL ANALYSIS OF PADAMFED WITH VARIANCE REDUCTION

The analysis of PAda MFed-VR is similar to that of PAda MFed. We ﬁrst present the following two auxiliary Lemmas.

Lemma 6. Under Assumptions 1 and 3, the disparity 1

T PT 1 t=0 E f (θt) gt is upper bounded by:

t=0 E f θt gt 1

2β + γL r 2 SKβ + σ

Published as a conference paper at ICLR 2025

Algorithm 2 PAda MFed-VR: PAda MFed with Variance Reduction

1: Require: initial model θ0, θ 1 = θ0, control variates c 1 i = 1 K PK 1 k=0 F θ0; ξ 1,k i for

any i, c 1 = 1 N P

i c 1 i , momentum g 1 = c 1, global learning rate γ, local learning rate η, and momentum parameter β 2: for t = 0, , T 1 do 3: Central Server: Uniformly sample clients St {1, , N} with |St| = S 4: for each client i St in parallel do 5: Initialize local model θt,0 i = θt and control variate ct i = 0 (for i / St, ct i = ct 1 i ) 6: for k = 0, , K 1 do

7: Compute gt,k i = F θt,k i ; ξt,k i + β ct 1 ct 1 i + (1

β) gt 1 F θt 1; ξt,k i

8: Update local model θt,k+1 i = θt,k i η gt,k i gt,k i

9: Update control variate ct i = ct i + 1

K F θt,k i ; ξt,k i

10: end for 11: Upload θt,K i and ct i to central server 12: end for Central server: 13: Aggregate local updates gt = 1 ηSK P

14: Update global model θt+1 = θt γgt

15: Aggregate control variate ct = ct 1 + 1

i St ct i ct 1 i

16: Aggregate momentum gt = β 1

i St ct i ct 1 i + ct 1 + (1 β)gt 1

17: Download θt+1, βct + (1 β)gt to all clients 18: end for

Lemma 7. Under Assumptions 1 and 3, the gradient dissimilarity

1 SKT PT 1 t=0 E h P

gt,k i gt i is upper bounded by:

σ + 2ηKL + 1 + 2N

+ ηKL + 2γL.

Set β = β0 T 2 3 and γ = γ0 T 2 3 . η = 1 KT . From Lemma 6, we know that

t=0 E f θt gt 1

2β0T 1 3 + γ0L

r 2 SKβ0 + σ

SKT 1 3 + L

2β0T 1 3 + γ0L

r 2 SKβ0 + σ

Similarly, from Lemma 7, we have

T 2 3 . (A14)

Published as a conference paper at ICLR 2025

Plugging (A13) and (A14) into (A2), we have

γ0T 1 3 + 6σ

SKT 1 3 + L

β0T 1 3 + 2

2γ0L SKβ0T 1 3 + 2σ 22β0

T 2 3 + 2γ0L

Set β0 = (SK) 1 3 and γ0 = (SK) 1 3 , we have

γ0T 1 3 + 6σ

SKT 1 3 + L

β0T 1 3 + 2

2γ0L SKβ0T 1 3

SKT 1 3 + 2β0σ

T 2 3 + 2γ0L

(SKT) 1 3 + (L + σ)(SK) 1 3

By setting SK O

T , we have (SK) 1 3

T 2 3 O (SKT) 1

t=0 E f θt O + L + σ

B.1 PROOF OF LEMMA 6

Since Et := f (θt) gt, we have

Et = f θt 1 SK

i St,k F θt,k i ; ξt,k i + β

ct 1 i ct 1

gt 1 f θt 1 1 SK

i St,k F θt 1; ξt,k i

=(1 β)Et 1 + 1 SK

F θt; ξt,k i F θt,k i ; ξt,k i

| {z } :=wt

f θt ct 1 1 SK

F θt; ξt,k i ct 1 i

| {z } :=evt

F θt 1; ξt,k i F θt; ξt,k i + f θt f θt 1

| {z } :=eut

=(1 β)t E0 +

τ=1 wτ(1 β)t τ +

τ=1 euτ(1 β)t+1 τ +

τ=1 βevτ(1 β)t τ.

E Et (1 β)t E E0 +

τ=1 E wt (1 β)t τ +

τ=1 euτ(1 β)t+1 τ

τ=1 βevτ(1 β)t τ

Published as a conference paper at ICLR 2025

Since θ 1 = θ0, c 1 i = 1 K PK 1 k=0 F θ0; ξ 1,k i for any i, c 1 = 1 N P

i c 1 i , and g 1 = c 1, we have

i,k F θ0; ξ 1,k i

β F θ0 i ; ξ 1,k i + (1 β) F θ0; ξ0,k i F θ0,k i ; ξ0,k i

SK + 1 SK E

fi θ0 fi θ0,k i

i,k E θ0,k i θ0 + 3σ

Then, we have wt L SK

i St,k E θt,k i θt 1

2ηKL. (A17)

Additionally, since E[eut|Ft] = 0, then, for any 0 t1 < t2 T 1, we have

E eut1, eut2 = E eut1, E[eut2|Ft2] = 0.

From Lemma 2, for any t, we have

F θt 1; ξt,k i F θt; ξt,k i + f θt f θt 1

F θt 1; ξt,k i F θt; ξt,k i + f θt f θt 1

θt θt 1 2 + L2

Then, we have

τ=1 euτ(1 β)t+1 τ

τ=1 E euτ 2(1 β)t+1 τ 2γ2L2

SKβ . (A18)

Similarly, since E[evt|Ft] = 0, for any 0 t1 < t2 T 1, we have

E evt1, evt2 = E evt1, E[evt2|Ft2] = 0.

From Lemma 2, for any t, we have

i,k F θt; ξt,k i

k F θt; ξt,k i ct 1 i

k F θt; ξt,k i fi θt fi θt 1 ct 1 i

i=1 φt 1 i .

Published as a conference paper at ICLR 2025

By (A10), φt 1 i 6σ2

K + 2η2K2L2 + 4N2

S2 γ2L2. Then, we have

E evt 2 22σ2

τ=1 βevτ(1 β)t τ

τ=1 E evτ 2(1 β)t+1 τ

SK + 3βγ2L2

S η2K2L2. (A19)

Plugging (A16), (A17), (A18), and (A19) into (A15) yields

E Et (1 β)t 1

2β + γL r 2 SKβ + σ

Summing the above inequality over t yields

2β + γL r 2 SKβ + σ

B.2 PROOF OF LEMMA 7

With variance reduction, we have gt,k i = F θt,k i ; ξt,k i β ct 1 i ct 1 + (1

β) gt 1 F θt 1; ξt,k i . Since gt = 1 SK P

i St,k gt,k i , we have

i,k E F θt,k i ; ξt,k i βct 1 i (1 β) F θt 1; ξt,k i

βE F θt,k i ; ξt,k i ct 1 i + (1 β)E F θt,k i ; ξt,k i F θt 1; ξt,k i

i,k E F θt,k i ; ξt,k i ct 1 i + 2L(1 β)

i,k E θt,k i θt + E θt θt 1

i,k E F θt,k i ; ξt,k i ct 1 i + ηKL + 2γL.

From Section A.2, we know that

i,k E F θt,k i ; ξt,k i ct 1 i 2β 1 + 2

σ + 2ηKL + 1 + 2N

Published as a conference paper at ICLR 2025

(b) non-i.i.d

Figure 3: Test accuracy versus the number of communication rounds on the CIFAR-10 dataset.

Then, we have

σ + 2ηKL + 1 + 2N

t 1 + ηKL + 2γL.

Summing the above inequality over t, we have

σ + 2ηKL + 1 + 2N

+ ηKL + 2γL.

C ADDITIONAL NUMERICAL RESULTS

Experimental Settings: We employ a convolutional neural network (CNN) with three convolutional layers and two fully connected layers for the EMNIST dataset, and a Res Net-18 architecture for CIFAR-10. The experimental framework involves 100 distributed clients with 10 clients participating randomly in each training round. We investigate both independent and identically distributed (i.i.d.) and non-i.i.d. data distributions. For i.i.d. scenarios, we implement uniform random data distribution across clients. To simulate realistic heterogeneity in non-i.i.d. settings, we apply a Dirichlet distribution Dir(1) for EMNIST, Dir(0.5) for CIFAR-10. The hyperparameters of all baselines, including learning rates, are optimized through comprehensive grid search.

C.1 SIMULATIONS ON CIFAR-10 DATASET

Figure 3 presents the comparative analysis of test accuracy across different algorithms on the CIFAR-10 dataset, with subﬁgures 3a and 3b illustrating the performance under i.i.d. and noni.i.d. data distributions, respectively. The observed patterns align with those demonstrated in Figure 1. Our proposed algorithms demonstrate superior performance compared to existing methods, including Fed Avg, SCAFFOLD, and SCAFFOLD-M, both in terms of convergence rate and ﬁnal test accuracy. Under non-i.i.d. conditions, while all algorithms exhibit increased performance volatility and reduced accuracy, the relative performance hierarchy remains consistent with the i.i.d. scenario, as shown in subﬁgure 3b.

Figure 4 illustrates the evolution of gradient norm f(θt) for various algorithms on the CIFAR10 dataset under both i.i.d. and non-i.i.d. data distributions. The results demonstrate that

Published as a conference paper at ICLR 2025

(b) non-i.i.d

Figure 4: Gradient norm versus the number of communication rounds on the CIFAR-10 dataset.

(b) non-i.i.d

Figure 5: Ablation study versus the number of communication rounds on the CIFAR-10 dataset.

momentum-enhanced methods, speciﬁcally our proposed algorithm and SCAFFOLD-M, achieve more rapid gradient norm reduction compared to their non-momentum counterparts.

We further carry out the ablation study by isolating the effects of momentum and gradient normalization in Figure 5. The results demonstrate that incorporating SCAFFOLD with normalized gradient leads to degraded performance due to the loss of gradient magnitude information. Therefore, the momentum is essential in our algorithm design to maintain the descent direction by effectively aggregating gradients across clients and iterations.