# problemparameterfree_federated_learning__be43295f.pdf Published as a conference paper at ICLR 2025 PROBLEM-PARAMETER-FREE FEDERATED LEARNING Wenjing Yan1, Kai Zhang2, Xiaolu Wang3, Xuanyu Cao4 1The Chinese University of Hong Kong 2The Hong Kong University of Science and Technology 3East China Normal University 4Washington State University {wjyan}@ie.cuhk.edu.hk, {kzhangbn}@connect.ust.hk, {xiaoluwang}@sei.ecnu.edu.cn, {xuanyu.cao}@wsu.edu Federated learning (FL) has garnered significant attention from academia and industry in recent years due to its advantages in data privacy, scalability, and communication efficiency. However, current FL algorithms face a critical limitation: their performance heavily depends on meticulously tuned hyperparameters, particularly the learning rates. This manual tuning process is challenging in federated settings due to data heterogeneity and limited accessibility of local datasets. Consequently, the reliance on problem-specific parameters hinders the widespread adoption of FL and potentially compromises its performance in dynamic or diverse environments. To address this issue, we introduce PAda MFed, a novel algorithm for nonconvex FL that carefully combines adaptive stepsize and momentum techniques. PAda MFed offers two key advantages: 1) it operates autonomously without relying on problem-specific parameters, and 2) it manages data heterogeneity and partial participation without requiring heterogeneity bounds. Despite these benefits, PAda MFed provides several strong theoretical guarantees: 1) it achieves state-of-the-art convergence rates with a sample complexity of O(ϵ 4) and communication complexity of O(ϵ 3) to obtain an accuracy of f(θ) ϵ, even using constant learning rates; 2) these complexities can be improved to the bestknown O(ϵ 3) for sampling and O(ϵ 2) for communication when incorporating variance reduction; 3) it exhibits linear speedup with respect to the number of local update steps and participating clients at each global round. These attributes make PAda MFed highly scalable and adaptable for various real-world FL applications. Extensive empirical evidence validates the efficacy of our approach. 1 INTRODUCTION Federated learning (FL) has emerged as a promising paradigm for machine learning, allowing multiple clients to collaboratively train a model without sharing raw data. Since its introduction by Mc Mahan et al. (2017), FL has garnered substantial attention from both academia and industry. Major conferences such as Neur IPS, ICML, and ICLR have witnessed a proliferation of FL-related research, addressing critical challenges including communication efficiency (Chen et al., 2021; Sattler et al., 2019), privacy preservation (Wei et al., 2020; Mothukuri et al., 2021), heterogeneity management Li et al. (2020); Karimireddy et al. (2020b); Wang et al. (2020), and partial and asynchronous participation (Wang et al., 2024; Xu et al., 2023). Despite significant advancements, current FL algorithms face a critical limitation: their performance heavily depends on meticulously tuned hyperparameters, particularly the learning rates. This tuning process typically requires extensive computational resources and problem-specific knowledge, such as smoothness parameters, heterogeneity bounds, stochastic gradient variances, and initial optimality gaps. For instance, MIME (Karimireddy et al., 2020a) relies on smoothness constants and data heterogeneity bounds for stepsize determination, Fed Prox (Li et al., 2020) requires careful adjustment of a proximal term based on data heterogeneity, and Fed Dyn (Acar et al., 2021) demands tuning of a regularization parameter contingent on problem characteristics. Furthermore, algorithms such as Fed ADT (Gao et al., 2023) and Fed AMS (Chen et al., 2020) necessitate problem-specific Corresponding Author. Published as a conference paper at ICLR 2025 parameters to establish minimum communication rounds, which must exceed thresholds associated with smoothness constants and other problem characteristics. This reliance on problem-specific parameters poses several critical challenges. First, it impedes the widespread adoption of FL by complicating deployment and requiring expertise for hyperparameter tuning (Mostafa, 2019; Deng et al., 2020). Second, it potentially compromises the performance of FL in dynamic or diverse environments where data distributions may evolve (Reddi et al., 2020; Koloskova et al., 2020). Last, accurately estimating these parameters is often infeasible due to the distributed nature of data and the inherent privacy constraints of FL (Koneˇcn y et al., 2016). Recent research has attempted to address this issue through adaptive stepsize methods. Fed Opt (Reddi et al., 2020) incorporates adaptive optimization techniques like Ada Grad, Adam, and Yogi into FL, demonstrating improved convergence compared to Fed Avg (Mc Mahan et al., 2017). Fed Nova (Wang et al., 2020) introduces a normalization technique that effectively mitigates objective inconsistency caused by partial client participation and data heterogeneity. Fed BN (Li et al., 2021) employed local batch normalization to alleviate feature shift before averaging models, outperforming both classical Fed Avg (Mc Mahan et al., 2017) and Fed Prox (Li et al., 2020) for non-IID data. While these approaches show promise, they still require careful tuning of global learning rates. Momentum is a technique that mitigates data heterogeneity and accelerates gradient descent by maintaining a velocity vector of gradient history. Recent studies have investigated the combination of adaptive methods with momentum to leverage both advantages. Hsu et al. (2019) proposed Fed Avg M, which incorporates a server-side momentum term into the Fed Avg algorithm, demonstrating enhanced convergence rates and robustness against data heterogeneity. Wang et al. (2019) developed Slow Mo, a momentum-based method that employs two nested loops to facilitate faster convergence in distributed optimization. Fed AMS (Chen et al., 2020) utilizes adaptive moment methods on both the server and client sides to address data heterogeneity. MIME (Karimireddy et al., 2020a) combines client and server momentum to enhance convergence. Wu et al. (2023) introduced FAFED, a momentum-based variance reduction scheme integrated with an adaptive matrix, achieving the best-known sample and communication complexity when utilizing diminishing stepsizes. Nevertheless, these methods still necessitate fine-tuning multiple hyperparameters, limiting their practical applicability. A very recent contribution, Fed SPS (Sohom Mukherjee, 2024), claims to be the first fully locally adaptive method for FL with minimal hyperparameter tuning. While promising, this approach relies on stringent assumptions of bounded gradients and bounded data heterogeneity. Moreover, it fails to converge to optima with constant stepsizes and requires adjustment of a maximum stepsize threshold based on the smoothness parameter, maintaining a degree of hyperparameter dependence. Therefore, there is a critical need for more robust and adaptive FL algorithms capable of operating effectively across diverse scenarios without relying on problem-specific parameters. 1.1 MAIN CONTRIBUTIONS This paper addresses these critical limitations of FL by proposing a novel approach that eliminates the need for problem-specific hyperparameter tuning while effectively handling arbitrary heterogeneous data. Our method is based on a careful combination of adaptive stepsize and momentum techniques. The adaptive stepsize mechanism dynamically adjusts local learning rates against client update heterogeneity, while the momentum component provides stability under partial participation and accelerates convergence in the nonconvex landscape. To the best of our knowledge, this is the first algorithm to achieve such parameter-agnostic adaptation in nonconvex FL with a state-of-the-art convergence guarantee. Our main contributions are summarized as follows. 1) We introduce PAda MFed, a problem-specific Parameter Agnostic algorithm for nonconvex FL based on adaptive stepsizes and client-side Momentum. PAda MFed offers several significant advantages: Independent of problem-specific parameters: PAda MFed operates autonomously without relying on any problem-specific parameters such as smoothness constants or stochastic gradient variance. All stepsizes in our approach are explicitly determined by system-defined constants, including the number of participating clients, local updates, and communication rounds. Published as a conference paper at ICLR 2025 Robustness to arbitrary heterogeneous data: PAda MFed inherently manages data heterogeneity without requiring any heterogeneity bounds among clients while accommodating partial client participation. This feature enhances its scalability and adaptability in realworld scenarios where client data can be highly diverse and unpredictable, and full participation may not always be feasible due to resource constraints or device availability. 2) We provide a rigorous theoretical analysis of PAda MFed, demonstrating its state-of-the-art performance: PAda MFed achieves a sample complexity of O(ϵ 4) and communication complexity of O(ϵ 3) to obtain a f(θ) ϵ accuracy for nonconvex FL problems, even using constant learning rates. The complexities are further improved to the best-known sample complexity of O(ϵ 3) and communication complexity of O(ϵ 2) when incorporating variance reduction. PAda MFed exhibits linear speedup with respect to the numbers of local update steps and participating clients in each global round. Notably, these theoretical results are obtained under minimal assumptions, requiring only Lsmoothness of loss functions and unbiased stochastic gradients with bounded within-client variance. This represents a significant advancement over existing FL algorithms, which typically necessitate constraints such as data heterogeneity bounds (Li et al., 2021; Wu et al., 2023), diminishing stepsizes (Wu et al., 2023; Sohom Mukherjee, 2024), or fail to achieve the best-known convergence rates (Liang et al., 2019; Alghunaim, 2024). 3) We conduct empirical evaluations to validate our theoretical findings and the efficacy of our algorithms. Our methods are compared against several established baselines, including Fed Avg (Mc Mahan et al., 2017), SCAFFOLD (Karimireddy et al., 2020b), and SCAFFOLD-M (Cheng et al., 2024) . Extensive numerical evidence demonstrates the superiority of our approaches in not only runtime efficiency but also stepsize robustness. 2 PROBLEM SETUP We consider an FL system where N clients collaboratively train a common learning model θ Rd under the coordination of a parameter server. Let ξi represent a random sample of client i drawn from its local data distribution Di. The loss function associated with client i is given by fi(θ) := Eξi Di [F (θ; ξi)], where F (θ; ξi) is the stochastic loss of client i over sample ξi. The objective of the FL system is to minimize the global loss function across all clients, defined as: minθ Rd f(θ) := 1 N PN i=1 fi(θ) where fi(θ) := Eξi Di [F (θ; ξi)] . In a federated setting, clients collaboratively train a global model, but the raw data of each client is never shared with the server and other clients. Denote by the ℓ2 norm. We make the following standard assumptions. Assumption 1 (Sample-Wise Smoothness). Each sample-wise loss function F(θ; ξi) is L-smooth, i.e., F (θ; ξi) F (δ; ξi) L θ δ for all θ, δ Rd, i {1, . . . , n}, and ξi Di. The Sample-Wise Smoothness Assumption 1 implies the following standard smoothness condition. Assumption 2 (Standard Smoothness). There exists L > 0, such that each loss function fi is Lsmooth, i.e., fi (θ) fi (δ) L θ δ for all θ, δ Rd and i {1, . . . , N}. We emphasize that our original PAda MFed algorithm is based on the Standard Smoothness Assumption 2. The slightly more stringent Assumption 1 is required when using variance reduction to further accelerate convergence. Assumption 3 (Stochastic Gradient). There exists σ 0 such that for any θ Rd and i {1, . . . , N}, Eξi [ F (θ; ξi)] = fi(θ) and Eξi F (θ; ξi) fi(θ) 2 σ2, where ξi Di. Assumption 3 ensures that the stochastic gradient F (θ; ξi) is unbiased and has bounded withinclient variance, which is standard in stochastic optimization. We consider nonconvex FL problems with data heterogeneity among clients, where the local data distributions Di = Dj for any i = j. When addressing heterogeneous data, most existing approaches, such as SCAFFOLD (Karimireddy et al., 2020b), Fed Prox (Li et al., 2020), Fed AMS Published as a conference paper at ICLR 2025 (Chen et al., 2020), and MIME (Karimireddy et al., 2020a), require an upper bound on gradient dissimilarity, i.e., there exist constants B, σ2 h > 0 such that 1 N PN i=1 fi(θ) 2 B f(θ) 2 + σ2 h for all θ Rd. (1) This assumption simplifies the mathematical analysis of those FL approaches and ensures their algorithmic performance. However, it may not hold in scenarios where data across clients exhibit significant and unpredictable variations, thus compromising the robustness of FL. Additionally, existing FL algorithms typically rely on problem-specific parameters to determine their stepsizes, including the smoothness constant L, gradient variance σ2, and heterogeneity bounds B and σ2 i . The smoothness constant, which characterizes the Lipschitz continuity of gradients, is generally a global property requiring knowledge of the entire dataset. Similarly, quantifying data heterogeneity across clients necessitates a comprehensive understanding of the differences between local data distributions. However, in FL settings where raw data sharing is prohibited and only model updates are exchanged, obtaining precise measurements of these parameters is computationally prohibitive and may compromise FL s privacy guarantees. In the subsequent section, we present an algorithm that is independent of problem-specific parameters and capable of handling arbitrarily heterogeneous data, thereby eliminating the requirement of the heterogeneity bound (1). 3 ALGORITHM DEVELOPMENT Algorithm 1 PAda MFed: A Problem-Parameter-Agnostic Algorithm for Nonconvex FL 1: Require: initial model θ0, control variates c 1 i = 1 K PK 1 k=0 F θ0; ξ 1,k i for any i, c 1 = i c 1 i , momentum g 1 = c 1, global learning rate γ, local learning rate η, and momentum parameter β 2: for t = 0, , T 1 do 3: Central Server: Uniformly sample clients St {1, , N} with |St| = S 4: for each client i St in parallel do 5: Initialize local model θt,0 i = θt 6: for k = 0, , K 1 do 7: Compute gt,k i = β F θt,k i ; ξt,k i ct 1 i + ct 1 + (1 β)gt 1 8: Update local model θt,k+1 i = θt,k i η gt,k i gt,k i 9: end for 10: Update control variate ct i = 1 K PK 1 k=0 F θt,k i ; ξt,k i ( set ct i = ct 1 i for i / St) 11: Upload θt,K i and ct i to central server 12: end for Central server: 13: Aggregate local updates gt = 1 ηSK P i St 14: Update global model θt+1 = θt γgt 15: Aggregate control variate ct = ct 1 + 1 i St ct i ct 1 i 16: Aggregate momentum gt = β 1 i St ct i ct 1 i + ct 1 + (1 β)gt 1 17: Download θt+1, βct + (1 β)gt to all clients 18: end for In this section, we propose PAda MFed, a problem-parameter-agnostic algorithm for nonconvex FL based on adaptive stepsizes and client-side momentum. PAda MFed is designed to operate independently of any problem-specific parameters, handle arbitrarily heterogeneous data, and accommodate partial client participation. Published as a conference paper at ICLR 2025 3.1 ALGORITHM DEVELOPMENT OF PADAMFED Our approach builds upon the well-established SCAFFOLD algorithm (Karimireddy et al., 2020b), which was designed to address client drift in FL, where local models significantly deviate from the global model due to partial participation and data heterogeneity. The core concept of SCAFFOLD is the utilization of control variates to correct the drift between client updates and the global model. Specifically, the server maintains a global control variate, denoted ct, to represent the average model update direction, while each client maintains a local control variate, denoted by ct i for all i {1, . . . , N}, to track individual update directions. Client updates are subsequently adjusted using the difference between local and global control variates. SCAFFOLD demonstrates faster and more stable convergence compared to the seminal Fed Avg algorithm (Mc Mahan et al., 2017). In this paper, we extend the original SCAFFOLD framework by incorporating client-side momentum and local adaptive stepsizes, as outlined in Algorithm 1. Specifically, in Step 7 of Algorithm 1, the local descent direction of client i at global round t and local step k is computed as: gt,k i = β F θt,k i ; ξt,k i ct 1 i + ct 1 + (1 β)gt 1. In this expression, the term F θt,k i ; ξt,k i represents the stochastic gradient at the current local model θt,k i with the sample ξt,k i . The term ct 1 ct 1 i adjusts the difference between the global and local control variates, helping to mitigate client drift. The term gt 1 denotes the current global momentum, essential for stabilizing and accelerating the convergence across clients. In Step 8 of Algorithm 1, the local model for each client i is updated by: θt,k+1 i = θt,k i η gt,k i gt,k i . Here, an adaptive stepsize η/ gt,k i is utilized by normalizing the descent direction vector gt,k i . This normalization guarantees that the progresses from all clients have a uniform magnitude, preventing the disproportionate impact of any individual client on the global model update. It also provides us the convenience on quantifying the distance between consecutive models in our theoretical analysis, maintaining that θt,k+1 i θt,k i = η gt,k i / gt,k i = η for all i, k, t. Additionally, since ct i = 1 K PK 1 k=0 F θt,k i ; ξt,k i for all i, the momentum update in Step 16 of Algorithm 1 can be expressed as: gt = β 1 S P 1 K PK 1 k=0 F θt,k i ; ξt,k i ct 1 i + ct 1 + (1 β)gt 1. (2) This equation accumulates the descent directions across clients and iterations that we have gt = P i,k gt,k i . With gt, the optimization trajectories at each client are smoothed by the descent directions of other clients, enhancing the robustness of the optimization process against variability in local updates caused by data heterogeneity. Notably, our PAda MFed algorithm maintains the communication workload of the SCAFFOLD for both uplink and downlink. 3.2 ACCELERATING PADAMFED WITH VARIANCE REDUCTION Variance reduction is an effective technique to accelerate convergence and enhance the stability of FL, particularly when dealing with heterogeneous data and limited client participation. In this subsection, we enhance PAda MFed by integrating a variance reduction component into each client s descent direction, resulting in our PAda MFed-VR algorithm. The detailed procedures of PAda MFed VR are provided in Appendix B. PAda MFed-VR differs from PAda MFed primarily in its computation of the local gradient. Specifically, Step 7 of Algorithm 1 is replaced with: gt,k i = F θt,k i ; ξt,k i + β ct 1 ct 1 i + (1 β) gt 1 F θt 1; ξt,k i , where F θt 1; ξt,k i represents the variance reduction component. Our variance reduction design follows the principle of STORM (Cutkosky & Orabona, 2019) to make more efficient sample utilization. For each local update, the sample ξt,k i is used twice: 1) to compute the gradient based on Published as a conference paper at ICLR 2025 the current local model θt,k i ; and 2) to evaluate the gradient at the previous global model θt 1. This dual usage of each sample mitigates the influence of within-client gradient noise, enabling more accurate estimation of gradient directions. 4 THEORETICAL RESULTS AND COMPARISONS WITH PRIOR WORK 4.1 THEORETICAL RESULTS Theorem 1. Suppose that Assumptions 2 and 3 hold. Let the local and global learning rates of PAda MFed be η = 1 K T and γ = (SK)1/4 T 3/4 , respectively, the momentum parameter be β = q T , and {θt}t 0 be the iterates generated by Algorithm 1. Then, it holds for all T 1 that t=0 E f θt O (SKT) 1 4 + where := f θ0 minθ f(θ). Theorem 2. Suppose that Assumptions 1 and 3 hold. Let the local and global learning rates of PAda MFed-VR be η = 1 KT and γ = (SK)1/3 T 2/3 , respectively, the momentum parameter be β = T 2/3 , and {θt}t 0 be the iterates generated by Algorithm 2. Then, it holds for all T 1 that t=0 E f θt O (SKT) 1 3 + (L + σ)(SK) 1 3 Remark 1. According to Theorem 1, PAda MFed converges to an ϵ-stationary point 1 in expectation within O 1 SKϵ4 communication rounds. This convergence rate is improved to O 1 SKϵ3 when incorporating variance reduction, as shown in Theorem 2. Furthermore, both algorithms demonstrate linear speedup with respect to the number of participating clients S and local update steps K. Remark 2. In PAda MFed, setting SK = O(T 1 3 ) yields a sample complexity2 of O(ϵ 4) with communication complexity O(ϵ 3). Similarly, by setting SK = O( T), PAda MFed-VR achieves the best-known sample complexity of O(ϵ 3) with a communication complexity of O(ϵ 2) to find an ϵ-stationary point (Wu et al., 2023). Remark 3. In traditional FL, selecting optimal stepsizes theoretically requires knowledge of problem-specific parameters, which are often unavailable. Consequently, in real-world FL scenarios, stepsizes must be tuned empirically a process that is labor-intensive, time-consuming, and sometimes even impractical. In our PAda MFed and PAda MFed-VR, the stepsizes are explicitly determined by system-defined constants (the numbers of participating clients S, local update steps K, and communication rounds T), without requiring any problem-specific parameters. This hyperparameter-independent nature simplifies implementation, enhances robustness, and facilitates the deployment of our algorithm across diverse FL applications. 4.2 PROOF SKETCH Our theoretical proof starts from the L-smoothness property of the loss function f(θ), which yields the following inequality: f θt+1 f (θt) 2γ f (θt) gt γ f (θt) + γ SK P i St,k gt,k i gt + γ2L 1A point θ is said to be ϵ-stationary if f (θ) ϵ. Note that for any ϵ-stationary point defined using f(θ) 2, one can derive the corresponding guarantee for f(θ) based on the following relationship: t=1 E f(θt) = 1 t=1 E f(θt) 2 where the first and second inequalities utilize Jensen s inequality as the square root function is concave. Therefore, we can align our results with the metric of 1 T PT t=1 f(θt) 2 by taking square root on both sides of their convergence bounds. 2Total number of samples across clients, local updates, and communication. Published as a conference paper at ICLR 2025 To establish an upper bound for 1 T PT 1 t=0 E f (θt) , we must quantify two key terms: 1 T PT 1 t=0 E f (θt) gt and 1 T PT 1 t=0 1 SK E h P i St,k gt,k i gt i . The momentum gt is a recursive variable that accumulates values from previous rounds. By plugging into the expression of gt (provided in (2)) into f (θt) gt and introducing the auxiliary term f θt 1 , we can recursively express f (θt) gt by its predecessor f θt 1 gt 1 , scaled by a contraction coefficient (1 β). This substitution also introduces additional terms associated with stochastic gradients and control variates. Through meticulous control of all intermediate terms, we prove that 1 T PT 1 t=0 E f (θt) gt is upper bounded by O((SKT) 1 4 ) for PAda MFed and by O((SKT) 1 3 ) for PAda MFed-VR. The term E h P i St,k gt,k i gt i represents gradient dissimilarity across clients. While the heterogeneity bound (1) could readily control this term, our objective is to eliminate dependence on such bounds. Instead, we relax this term to E fi θt 1 ct 1 i , along with other controllable terms, by substituting the expressions for gt,k i and gt. Returning to our treatment of E f (θt) gt , we exploit the recursive property of the control variate ct 1 i to bound E fi θt 1 ct 1 i . This, in turn, allows us to establish a bound for E h P i St,k gt,k i gt i . Combining the above processes leads to our analytical results. For comprehensive proofs, please refer to Appendix A for Theorem 1 and Appendix B for Theorem 2. Intuition on the Algorithmic Features: The efficacy of PAda MFed stems from the synergistic integration of three indispensable components: local gradient normalization, client-side momentum, and control variates. 1) Gradient normalization serves as an adaptive learning rate scheme, automatically adjusting stepsizes based on the local optimization landscape. This design automatically allows larger steps in regions with small gradients (where more aggressive exploration is beneficial) and smaller steps in steep regions (where careful progress is needed). 2) Client-side momentum helps accelerate convergence while maintaining stability. First, it helps overcome local irregularities in the loss landscape by accumulating gradients over clients and iterations. Second, it accelerates progress in directions of consistent gradient agreement. 3) Furthermore, control variates align local updates with the global objective, reducing variance in gradient estimates and ensuring more consistent updates across heterogeneous client data. Collectively, these techniques ensure that the stepsizes are independent of problem-specific parameters such as smoothness constants and data heterogeneity bounds, simplifying the tuning process and enhancing robustness across diverse FL environments. Notably, our algorithms also eliminate the requirement of data heterogeneity bounds, further broadening their applicability. 4.3 COMPARISONS WITH PRIOR WORK We compare PAda MFed and PAda MFed-VR with several representative algorithms for solving FL problems with heterogeneous data, as listed in Table 1. Comparisons of PAda MFed with Prior FL Algorithms: The SCAFFOLD (Karimireddy et al., 2020b) algorithm requires the smoothness parameter L for stepsize tuning. However, its communi- cation complexity, given by O K 3 L Kϵ4 , is suboptimal. MIME (Karimireddy et al., 2020a) im- proves this complexity to O 1 SKϵ4 by incorporating server-level momentum. Nevertheless, MIME requires large-batch local gradients per round, and its learning rates depend on multiple problem parameters, including initial optimality gap and heterogeneity bound σ2 h, which are challenging to estimate. Fed SPS (Sohom Mukherjee, 2024) incorporates stochastic Polyak step-sizes into local client updates, achieving a communication complexity of O 1 NKϵ4 in full participation scenarios. However, its analysis relies on the assumption of bounded data heterogeneity. Moreover, its stepsize tuning requires knowledge of the lower bounds of all loss functions, i.e., ℓ i for all i, in addition to the smoothness parameter L, leading to significant problem-parameter dependence. SCAFFOLD-M (Cheng et al., 2024) employs similar client-side momentum as PAda MFed, removing the need for bounded data heterogeneity. However, SCAFFOLD-M s stepsizes depend on sev- Published as a conference paper at ICLR 2025 Table 1: Comparisons of algorithms for solving FL problems with heterogeneous data. (Shorthand notation: Add. Assump. = Additional assumptions aside from Assumptions 1 3, BDH = Bounded data heterogeneity define in (1), BG = Bounded gradient that fi(θ) G, i, θ, BHD = Bounded hessian dissimilarity that 2fi(θ) 2f(θ) 2 δ, i, θ) Algorithms Add. Assump. Stepsize Restrictions Stepsize-Related Problem-Parameters Communication SCAFFOLD (Karimireddy et al., 2020b) γ = S, η 1 24γKL S N 2 MIme (Karimireddy et al., 2020a) BDH, BHD η = q S L GT K2 , G = σ2 h + σ2 K L, , σ2, σ2 h O 1 SKϵ4 Fed SPS Sohom Mukherjee (2024) BDH ηt,k i = min ( F θt,k i ;ξt,k i c F θt,k i ;ξt,k i ηb min 1 2c L , 1 25LK L, ℓ i , i O 1 NKϵ4 SCAFFOLD-M (Cheng et al., 2024) β = min 1, S L , ηKL min 1 L, , σ2, G0 O 1 SKϵ4 PAda MFed (This paper) T , γ = (SK) 1 4 T 3 4 , η = 1 K Variance Reduction FAFED (Wu et al., 2023) BDH, BG ηt N 2 3 Lt 1 3 , βt η2 t L O 1 SKϵ3 SCAFFOLD-M-VR (Cheng et al., 2024) β = min S N , KL γL = min 1, βS β S , β SK 1 4 L, , σ2 O 1 S PAda MFed-VR (This paper) β = (SK) 1 3 T 2 3 , γ = (SK) 1 3 T 2 3 , η = 1 KT O 1 SKϵ3 1 ℓ i infξi Di,θ F(θ; ξi) for any i, and c is a constant to balance adaptivity and accuracy. 2 G0 := 1 N PN i=1 fi(θ0) 2. eral problem-specific parameters, including the smoothness parameter L, initial optimal gap , and stochastic gradient variance σ2, resulting in laborious stepsize tuning. In contrast, PAda MFed is completely independent of problem-specific parameters while achieving start-of-the-art communication complexity. Comparisons of PAda MFed-VR with Prior Variance-Reduced FL Algorithms: FAFED (Wu et al., 2023) employing a momentum-based variance reduction with an adaptive matrix, achieving the best-known O(ϵ 3) sample complexity and O(ϵ 2) communication complexity through the use of diminishing stepsizes. However, FAFED requires stringent assumptions of bounded gradients and bounded data heterogeneity. Moreover, its learning rates are subject to several complex constraints and rely on problem-parameter-based algorithm tuning. SCAFFOLD-M-VR (Cheng et al., 2024) is a variance-reduced SCAFFOLD-M algorithm and, similarly, requires careful step size tuning based on multiple problem-dependent parameters. Nevertheless, it fails to achieve the best-known complexity established in the literature. 5 NUMERICAL EXPERIMENTS In this section, we present experiments on two real-world datasets: EMNIST (Cohen et al., 2017) and CIFAR-10 (Li et al., 2017). We evaluate the proposed algorithms against several baselines, Published as a conference paper at ICLR 2025 (b) non-i.i.d Figure 1: Test accuracy versus the number of communication rounds on the EMNIST dataset. (b) non-i.i.d Figure 2: Test accuracy versus learning rate on the EMNIST dataset. including Fed Avg (Mc Mahan et al., 2017), SCAFFOLD (Karimireddy et al., 2020b), SCAFFOLDM (Cheng et al., 2024). Additionally, experiments are conducted under both i.i.d. and non-i.i.d. data distributions. Due to space limitations, the detailed experimental setup and additional simulation results are provided in Appendix C. Figure 1 illustrates the test accuracy of various algorithms versus the number of communication rounds on the EMNIST dataset, with subfigure 1a representing i.i.d. data and subfigure 1b depicting non-i.i.d. data. The stepsizes for our algorithms, PAda MFed and PAda MFed-VR, are determined based on the theoretical guidance provided in Theorem 1 and Theorem 2, respectively. For fair comparisons, the hyperparameters of other algorithms in Figure 1 are optimized through grid search to achieve their best performance. The results demonstrate that our proposed methods significantly outperform all baseline algorithms Fed Avg, SCAFFOLD, and SCAFFOLD-M in both convergence speed and test accuracy. Notably, although SCAFFOLD-M employs a similar momentum technique, it converges more slowly than PAda MFed and achieves lower accuracy, validating the efficacy of our adaptive stepsize design. Building upon these advantages, the incorporation of variance reduction further enhances our methods superiority through more efficient sample utilization. Moreover, the results on non-i.i.d. data in subfigure 1b demonstrate even greater performance margins than the i.i.d. case, highlighting the advantages of our algorithms. Figure 2 compares the test accuracy of various algorithms versus the learning rate on the EMNIST dataset. All algorithms were evaluated over 400 communication rounds to ensure a fair comparison. We observe that our algorithm demonstrates superior robustness to stepsize selection, maintaining stable performance across a significantly wider range of learning rates compared to baseline methods. Specifically, it achieves test accuracy exceeding 0.8 across the stepsize range [3 10 3, 10 1] Published as a conference paper at ICLR 2025 for i.i.d. data distributions (subfigure 2a) and above 0.7 for the same stepsize range under non-i.i.d. conditions (subfigure 2b). In contrast, baseline algorithms exhibit substantially narrower regions of stable performance, empirically validating our method s enhanced stepsize robustness. 6 CONCLUSIONS This paper proposed a novel training approach, PAda MFed, for nonconvex federated learning (FL) that eliminates dependency on problem-specific parameters and enables automatic generalization across diverse FL environments. PAda MFed also removes the need for data heterogeneity bounds and accommodates partial client participation, further broadening its applicability. We provided rigorous theoretical analysis for PAda MFed, demonstrating its state-of-the-art convergence under minimal assumptions, including the L-smoothness of loss functions and unbiased stochastic gradients with bounded within-client variance. Furthermore, we enhanced this convergence rate by incorporating variance reduction, achieving the best-known O(ϵ 3) sampling complexity and O(ϵ 2) communication complexity. Extensive numerical experiments demonstrated that our algorithms outperform existing representative FL approaches in both runtime efficiency and stepsize robustness. Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. ar Xiv preprint ar Xiv:2111.04263, 2021. Sulaiman A Alghunaim. Local exact-diffusion for decentralized optimization and learning. IEEE Transactions on Automatic Control, 2024. Mingzhe Chen, Nir Shlezinger, H Vincent Poor, Yonina C Eldar, and Shuguang Cui. Communication-efficient federated learning. Proceedings of the National Academy of Sciences, 118(17):e2024789118, 2021. Xiangyi Chen, Xiaoyun Li, and Ping Li. Toward communication efficient adaptive gradient method. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, pp. 119 128, 2020. Ziheng Cheng, Xinmeng Huang, and Kun Yuan. Momentum benefits non-iid federated learning simply and provably. The Twelfth International Conference on Learning Representations., 2024. Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pp. 2921 2926. IEEE, 2017. Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex SGD. Advances in neural information processing systems, 32, 2019. Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning. ar Xiv preprint ar Xiv:2003.13461, 2020. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. Huimin Gao, Qingtao Wu, Xuhui Zhao, Junlong Zhu, and Mingchuan Zhang. Fedadt: An adaptive method based on derivative term for federated learning. Sensors, 23(13):6034, 2023. Geoffrey Hinton. Neural networks for machine learning, lecture 6a, 2012. URL http:// www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdf. Coursera. Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. ar Xiv preprint ar Xiv:1909.06335, 2019. Published as a conference paper at ICLR 2025 Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning. ar Xiv preprint ar Xiv:2008.03606, 2020a. Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pp. 5132 5143. PMLR, 2020b. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning, pp. 5381 5393. PMLR, 2020. Jakub Koneˇcn y, H Brendan Mc Mahan, Felix X Yu, Peter Richt arik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. ar Xiv preprint ar Xiv:1610.05492, 2016. Hongmin Li, Hanchao Liu, Xiangyang Ji, Guoqi Li, and Luping Shi. Cifar10-dvs: an event-stream dataset for object classification. Frontiers in neuroscience, 11:309, 2017. Jiaxiang Li, Xuxing Chen, Shiqian Ma, and Mingyi Hong. Problem-parameter-free decentralized nonconvex stochastic optimization. ar Xiv preprint ar Xiv:2402.08821, 2024. Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429 450, 2020. Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Federated learning on non-iid features via local batch normalization. ar Xiv preprint ar Xiv:2102.07623, 2021. Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, and Yifei Cheng. Variance reduced local SGD with lower communication complexity. ar Xiv preprint ar Xiv:1912.12844, 2019. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017. Hesham Mostafa. Robust federated learning through representation matching and adaptive hyperparameters. ar Xiv preprint ar Xiv:1912.13075, 2019. Viraaji Mothukuri, Reza M Parizi, Seyedamin Pouriyeh, Yan Huang, Ali Dehghantanha, and Gautam Srivastava. A survey on security and privacy of federated learning. Future Generation Computer Systems, 115:619 640, 2021. Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcn y, Sanjiv Kumar, and H Brendan Mc Mahan. Adaptive federated optimization. ar Xiv preprint ar Xiv:2003.00295, 2020. Felix Sattler, Simon Wiedemann, Klaus-Robert M uller, and Wojciech Samek. Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems, 31(9):3400 3413, 2019. Sebastian U. Stich Sohom Mukherjee, Nicolas Loizou. Locally adaptive federated learning. ar Xiv preprint ar Xiv:2307.06306, 2024. Jianyu Wang, Vinayak Tantia, Nicolas Ballas, and Michael Rabbat. Slowmo: Improving communication-efficient distributed SGD with slow momentum. ar Xiv preprint ar Xiv:1910.00643, 2019. Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611 7623, 2020. Published as a conference paper at ICLR 2025 Xiaolu Wang, Yuchang Sun, Hoi-To Wai, and Jun Zhang. Dual-delayed asynchronous sgd for arbitrarily heterogeneous data. ar Xiv preprint ar Xiv:2405.16966, 2024. Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H Yang, Farhad Farokhi, Shi Jin, Tony QS Quek, and H Vincent Poor. Federated learning with differential privacy: Algorithms and performance analysis. IEEE transactions on information forensics and security, 15:3454 3469, 2020. Xidong Wu, Feihu Huang, Zhengmian Hu, and Heng Huang. Faster adaptive federated learning. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pp. 10379 10387, 2023. Chenhao Xu, Youyang Qu, Yong Xiang, and Longxiang Gao. Asynchronous federated learning on heterogeneous devices: A survey. Computer Science Review, 50:100595, 2023. Jing Xu, Sen Wang, Liwei Wang, and Andrew Chi-Chih Yao. Fedcm: Federated learning with client-level momentum. ar Xiv preprint ar Xiv:2106.10874, 2021. Published as a conference paper at ICLR 2025 A RELATED WORKS Adaptive Stepsize Methods: Adaptive stepsize methods have gained significant attention in optimization literature due to their ability to automatically adjust learning rates based on the geometry of the objective function. These methods, such as Adam (Kingma & Ba, 2014), Ada Grad (Duchi et al., 2011), and RMSProp Hinton (2012), have demonstrated remarkable success in various machine learning tasks, particularly in handling sparse gradients and non-stationary objectives. In the context of FL, adaptive stepsizes offer several advantages. First, they eliminate the need for manual tuning of learning rates, which is especially beneficial in federated settings where global knowledge of the objective function s properties is limited. Second, adaptive methods can potentially mitigate the impact of data heterogeneity by allowing different update rates for different model parameters, effectively accounting for varying gradient statistics across clients. Fed Opt (Reddi et al., 2020) introduced adaptive optimization techniques like Ada Grad, Adam, and Yogi into FL, demonstrating improved convergence properties compared to the traditional Fed Avg algorithm. Wang et al. (2020) proposed Fed Nova, which normalizes and scales local updates to mitigate objective inconsistency caused by partial client participation and data heterogeneity. Fed BN (Li et al., 2021) employed local batch normalization to alleviate feature shift before averaging models, outperforming both classical Fed Avg (Mc Mahan et al., 2017) and Fed Prox (Li et al., 2020) for non-IID data. While these approaches demonstrated the advantages of adaptive methods for easing parameter tuning, none provides theoretical guarantees, and careful tuning of global learning rates remains essential. Momentum: Momentum, on the other hand, is a technique that accelerates gradient descent by accumulating a velocity vector in directions of persistent reduction in the objective across iterations. In nonconvex optimization, momentum has been shown to help escape saddle points more efficiently and potentially reach better local optima (Cutkosky & Orabona, 2019). The incorporation of momentum in FL algorithms can provide several benefits. It can help smooth out the impact of heterogeneous updates from different clients, potentially leading to more stable and faster convergence. Moreover, momentum can aid in overcoming the challenges posed by partial client participation by maintaining a consistent optimization trajectory even when client participation varies across rounds. Hsu et al. (2019) proposed Fed Avg M, which adds a server-side momentum term to the Fed Avg algorithm, demonstrating improved convergence rates and robustness to data heterogeneity. Wang et al. (2019) developed Slow Mo, a momentum-based method using two nested loops to achieve faster convergence in distributed optimization. Both Fed CM (Xu et al., 2021) and Cheng et al. (2024) investigated the integration of client-side momentum in Fed Avg to effectively tackle client heterogeneity and partial participation in FL. Combination of Adaptive Stepsizes and Momentum: Recent works have explored combining adaptive methods and momentum in FL. Fed AMS (Chen et al., 2020) implements a local AMSGrad scheme for FL, demonstrating fast convergence with low communication cost. MIme (Karimireddy et al., 2020a) combines control variates with server-level momentum at every local update to mimic centralized methods running on IID data, outperforming centralized methods but requiring a large-batch local gradient per round for each client. Wu et al. (2023) introduced FAFED, a momentum-based variance reduction scheme integrated with an adaptive matrix, attaining the best-known sample and communication complexity when using diminishing stepsizes. While these methods demonstrate improved performance, they still require careful tuning of global learning rates. A recent contribution, Fed SPS (Sohom Mukherjee, 2024), claims to be the first fully locally adaptive method for FL with minimal hyperparameter tuning. While promising, this approach relies on stringent assumptions of bounded data heterogeneity and gradients. Moreover, it fails to converge to optima with constant stepsizes and requires adjustment of a maximum stepsize threshold based on the smoothness parameter, thus retaining some hyperparameter dependence. The limitations of existing approaches underscore the critical need for more robust, adaptive FL algorithms capable of operating effectively across diverse scenarios without extensive parameter tuning. This paper makes a significant advancement by proposing a novel algorithm, PAda MFed, that completely eliminates dependency on problem-specific parameters. All stepsizes in our approach are explicitly determined by the number of participating clients, local updates, and communication rounds. Published as a conference paper at ICLR 2025 To the best of our knowledge, this is the first algorithm that achieves such problem-parameter independence in FL. Moreover, our algorithm inherently manages data heterogeneity and partial client participation without requiring any heterogeneity bound among clients, which is also nontrivial. Data heterogeneity has been extensively studied in FL. However, existing algorithms either depend on bounded data heterogeneity (e.g., Fed Avg (Mc Mahan et al., 2017), SCAFFOLD (Karimireddy et al., 2020b), Fed Prox (Li et al., 2020), and MIme (Karimireddy et al., 2020a)) or fall short of achieving state-of-the-art convergence rates (e.g., VRL-SGD (Liang et al., 2019) and LED (Alghunaim, 2024)). Recently, Cheng et al. (2024) demonstrated that momentum can eliminate the data heterogeneity constraint in the Fed Avg and SCAFFOLD algorithms while achieving state-of-the-art convergence results. Wang et al. (2024) introduced Du De-ASGD for asynchronous FL, which can effectively handle arbitrarily heterogeneous data by leveraging stale stochastic gradients. However, their algorithms require carefully designed stepsizes based on hyperparameters. In contrast, our algorithm accommodates arbitrary data heterogeneity while achieving complete problem-parameter independence. A notable concurrent work by (Li et al., 2024) also explores problem-parameter-free algorithms in the context of decentralized non-convex optimization. While both studies target problem-parameterfree optimization, our work addresses the unique challenges inherent to federated learning settings. The fundamental distinction lies in our treatment of client drift a critical phenomenon arising from multiple local updates and data heterogeneity in federated environments. Our key technical contributions beyond (Li et al., 2024) are twofold: 1) The development of a novel framework integrating control variates with momentum to mitigate client drift, requiring sophisticated theoretical analysis due to their complex interactions; 2) The successful elimination of explicit heterogeneity bounds, which were previously considered essential in federated learning literature. A THEORETICAL ANALYSIS OF PADAMFED Our analysis is based on the following useful lemmas. Lemma 1. For any t, we have θt,k i θt 2 1 3η2K2 and 1 NK θt,k i θt 1 Proof. From the update rule of local model, for any i, k and t, we have θt,k+1 i θt,k i = η gt,k i gt,k i θt,k i θt 2 = θt,j+1 i θt,j i θt,j+1 i θt,j i 2 η2k2. Summing the above inequality over i and k yields θt,k i θt 2 η2 6K (K 1)K(2K 1) 1 Similarly, we have θt,k i θt = 1 NK θt,k i θt 2 1 These inequalities in Lemma 1 are frequently used in our analysis. Published as a conference paper at ICLR 2025 Lemma 2. Given vectors ω1, , ωN Rd and ω = 1 N PN i=1 ωi, if we sample S {1, , N} uniformly randomly such that |S| = S, then it holds that i=1 ωi ω 2 . Proof. Letting 1{i S} be the indicator for the event i S , we prove this lemma by direct calculation as follows: i=1 ωi1{i S} i ωi 2 1{i S} + 2 X i