# fadas_towards_federated_adaptive_asynchronous_optimization__0aa744f1.pdf FADAS: Towards Federated Adaptive Asynchronous Optimization Yujia Wang 1 Shiqiang Wang 2 Songtao Lu 2 Jinghui Chen 1 Federated learning (FL) has emerged as a widely adopted training paradigm for privacy-preserving machine learning. While the SGD-based FL algorithms have demonstrated considerable success in the past, there is a growing trend towards adopting adaptive federated optimization methods, particularly for training large-scale models. However, the conventional synchronous aggregation design poses a significant challenge to the practical deployment of those adaptive federated optimization methods, particularly in the presence of straggler clients. To fill this research gap, this paper introduces federated adaptive asynchronous optimization, named FADAS, a novel method that incorporates asynchronous updates into adaptive federated optimization with provable guarantees. To further enhance the efficiency and resilience of our proposed method in scenarios with significant asynchronous delays, we also extend FADAS with a delay-adaptive learning adjustment strategy. We rigorously establish the convergence rate of the proposed algorithms and empirical results demonstrate the superior performance of FADAS over other asynchronous FL baselines. 1. Introduction In recent years, federated learning (FL) (Mc Mahan et al., 2017) has drawn increasing attention as an efficient privacypreserving distributed machine learning paradigm. An FL framework consists of a central server and numerous clients, where clients collaboratively train a global model without sharing their private data. FL entails each client conducting multiple local iterations, while the central server periodically 1College of Information Sciences and Technology, Pennsylvania State University, State College, PA, USA 2IBM T. J. Watson Research Center, Yorktown Heights, NY, USA. Correspondence to: Yujia Wang , Shiqiang Wang , Songtao Lu , Jinghui Chen . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). aggregates these local updates into the global model. Following the original design of the Fed Avg algorithm (Mc Mahan et al., 2017), a large number of stochastic gradient descent (SGD)-based FL methods have emerged, aiming to improve the performance or efficiency of Fed Avg (Karimireddy et al., 2020; Acar et al., 2021; Wang et al., 2020b). In addition to the successes of SGD-based algorithms in enhancing the efficiency of FL, the adoption of adaptive optimization techniques is becoming increasingly prevalent in FL. Adaptive optimization techniques such as Adam (Kingma & Ba, 2015) and Adam W (Loshchilov & Hutter, 2017) have proven their advantages over SGD in effectively training or fine-tuning large-scale models like BERT (Devlin et al., 2018), Vi T (Dosovitskiy et al., 2021), and Llama (Touvron et al., 2023). This progress has encouraged the incorporation of adaptive optimization into the FL settings, taking advantage of their ability to navigate update directions and dynamically adjust learning rates. For example, Fed Adam (Reddi et al., 2021) and Fed AMS (Wang et al., 2022b) employ global adaptive optimization after the server aggregates local model updates. Moreover, strategies such as Fed LALR (Sun et al., 2023a), Fed LADA (Sun et al., 2023b), and FAFED (Wu et al., 2023) replace SGD with the Adam optimizer for the local training phase, exemplifying the utility of local adaptive optimizations in FL. However, existing methods in adaptive FL still rely on traditional synchronous aggregation approaches, where the server must wait for all participating clients to complete their local training before global updates. This reliance presents a significant challenge to the practical implementation of adaptive FL methodologies, as the server is required to wait until slower clients, which may have limited computation or communication capabilities. While asynchronous FL strategies such as Fed Buff (Nguyen et al., 2022) and Fed Async (Xie et al., 2019) have been investigated to improve the scalability and to study the impact of client delays on the convergence of SGD-based FL algorithms, the specific implications of asynchronous delays on nonlinear adaptive gradient operations are not completely understood. This motivates us to explore the following question: Can we develop an asynchronous method for adaptive federated optimization (with provable guarantees) that enhances training efficiency and is resilient to asynchronous delays? FADAS: Towards Federated Adaptive Asynchronous Optimization In this paper, we propose FADAS, Federated ADaptive ASynchronous optimization, to address this challenge. FADAS introduces asynchronous updates within the adaptive federated optimization framework and integrates a delay-adaptive mechanism for adjusting the learning rate adaptively in response to burst delays. We summarize our contributions as follows: We propose FADAS, a novel adaptive federated optimization method that extends traditional adpative federated optimization support asynchronous client updates. We prove that FADAS achieves a convergence rate of O 1 T M + τmaxτavg T w.r.t. the number of global communication rounds T and the number of accumulated updates M, with bounded worst-case delay, denoted by τmax, and the average of the maximum delay over all the rounds, denoted by τavg. To further reduce the dependency on the worst-case delay term τmax in the convergence rate, we extend FADAS with a delay-adaptive learning rate adjustment strategy. Our theoretical results demonstrate that the inclusion of a delay-adaptive learning rate effectively diminishes the dependency on τmax in the convergence rate. We conduct experiments across various asynchronous delay settings in both vision and language modeling tasks. Our results indicate that the proposed FADAS, whether or not including the delay-adaptive learning rate, outperforms other asynchronous FL baselines. In particular, the delay-adaptive FADAS demonstrates significant advantages in scenarios with large worst-case delays. Moreover, our experimental results on simulating the wall-clock training time underscores the efficiency of our proposed FADAS approach. 2. Related Work Federated learning. FL, as introduced by Mc Mahan et al. (2017), has become a pivotal framework for collaboratively training machine learning models on edge devices while keeping local data private. Following the initial Fed Avg algorithm, several works studied the theoretical analysis and empirical performance of it (Lin et al., 2018; Stich, 2018; Li et al., 2019a; Karimireddy et al., 2020; Wang & Joshi, 2021; Yang et al., 2021), and a range of works aim to improve Fed Avg from different perspectives, such as reducing the impact of data heterogeneity (Karimireddy et al., 2020; Acar et al., 2021; Wang et al., 2020b), saving the communication overhead (Reisizadeh et al., 2020; Jhunjhunwala et al., 2021), and adjusting the parameter aggregation procedure (Tan et al., 2022; Wang & Ji, 2023). Adaptive FL optimizations and adaptive updates. Besides traditional SGD-based methods, there is a line of works focusing on adaptive updates in FL. A local adaptive FL method with momentum-based variance-reduced gradient was used in FAFED (Wu et al., 2023). Li et al. (2023) proposed a framework for local adaptive gradient methods in Fed DA. Fed LALR (Sun et al., 2023a) uses local adaptive optimization in FL with local historical gradients and periodically synchronized learning rates. Fed LADA (Sun et al., 2023b) is an efficient local adaptive FL method with a locally amended technique. Jin et al. (2022) developed novel adaptive FL optimization methods from the perspective of dynamics of ordinary differential equations. Moreover, Reddi et al. (2021) introduced Fed Adagrad, Fed Adam and Fed Yogi, and Wang et al. (2022b) proposed Fed AMS for global adaptive FL optimizations. Several works of global adaptive learning rate (Jhunjhunwala et al., 2023) and adaptation in aggregation weights (Tan et al., 2022; Wang & Ji, 2023) are also related to adaptive learning rate adjustment. Asynchronous SGD and asynchronous FL. There have been extensive studies over the years about asynchronous optimization techniques, including asynchronous SGD and its various adaptations. For example, Hogwild (Niu et al., 2011) includes an applicable lock-free, coordinate-wise asynchronous method and has been widely used in multithread computation. A body of works focuses on the theoretical analysis and explorations of asynchronous SGD (Mania et al., 2017; Nguyen et al., 2018; Stich et al., 2021; Leblond et al., 2018; Glasgow & Wootters, 2022) and discusses the gradient delay in the convergence rate (Avdiukhin & Kasiviswanathan, 2021; Mishchenko et al., 2022; Koloskova et al., 2022; Wu et al., 2022). Within federated learning, innovative asynchronous aggregation algorithms like Fed Async (Xie et al., 2019) allow the server to update the global model once a client finishes local training, and Fed Buff (Nguyen et al., 2022) introduces a buffered aggregation approach. There are also many works focusing on algorithms based on Fed Buff with theoretical and/or empirical analysis (Toghani & Uribe, 2022; Ortega & Jafarkhani, 2023; Wang et al., 2023), and other aspects of asynchronous FL (Chen et al., 2020b; Yang et al., 2022; Bornstein et al., 2023). Although adaptive FL and asynchronous FL have achieved the success of training large machine learning models with desirable numerical performance, the exploration of asynchronous updates in the context of adaptive FL has not been well-studied yet. In this paper, we start with the asynchronous update framework in adaptive FL and further integrate delay-adaptive learning rate scheduling into it. 3. Preliminaries Federated learning. A general FL framework considers a distributed optimization problem across N clients: min x Rd f(x) := 1 i=1 Fi(x) = 1 i=1 Eξi Di[Fi(x; ξi)], (1) FADAS: Towards Federated Adaptive Asynchronous Optimization where x Rd is the model parameter with d dimensions, Fi(x) is the loss function corresponding to client i, Di is the local data distribution on client i. The objective in Eq. (1) can be interpreted as setting pi = 1 N for all clients in another commonly used objective function in FL, i.e., f(x) = PN i=1 pi Eξi Di[Fi(x; ξi)] with pi 0 and PN i=1 pi = 1. Fed Avg (Mc Mahan et al., 2017) is a typical synchronous FL algorithm to solve Eq. 1, where in the t-th global round, each participating client i performs local SGD updates as follows: xi t,k+1 = xi t,k ηl Fi(xi t,k; ξ) and xi t,0 = xt (2) where ηl is the learning rate. After several local steps (e.g., K steps of local training), the server performs a global averaging step after receiving all the updates from assigned clients in St, i.e., xt+1 = 1 |St| P i St xi t,K. Adaptive optimization and its application to FL. Several adaptive optimizers have been proposed to improve the convergence of SGD, such as Adagrad (Duchi et al., 2011), RMSProp (Tieleman et al., 2012), Adam (Kingma & Ba, 2015) and its variant AMSGrad (Reddi et al., 2018). In general machine learning optimization, Adam effectively inherits the benefits of both momentum and RMSProp optimizers, leading to better empirical performance in practical applications. Reddi et al. (2021) first introduced adaptive federated optimization, which applies the adaptive optimizers during the global aggregation steps in FL. Fed AMSGrad (Tong et al., 2020) and Fed AMS (Wang et al., 2022b) further adjust the effective global learning rate in adaptive FL. Specifically, Fed Adam and Fed AMS take the idea of viewing the difference of local updates sync t = 1 |St| P i St i,sync t = i St(xi t,K xt) as a pseudo-gradient, and applies the Adam or AMSGrad optimizer when updating global model xt+1 using sync t , i.e., mt = β1mt 1 + (1 β1) sync t , vt = β2vt 1 + (1 β2) sync t sync t , xt+1 = xt + η mt vt + ϵ (Fed Adam), bvt = max(bvt 1, vt), xt+1 = xt + η mt bvt + ϵ (Fed AMS), where denotes the element-wise product for two vectors, and for vectors x, y Rd, x, x/y, max(x, y) denote the element-wise square root, division, and maximum operation of the vectors. Asynchronous updates in FL. In asynchronous FL, clients train the model asynchronously and update it to the server once it finishes several steps of local training. Fed Buff (Nguyen et al., 2022) has improved the global update steps with the concept of buffer based on the initial Fed Async baseline (Xie et al., 2019). In Fed Buff, it requires the framework maintain a given number (referred to as the concurrency Mc) of clients that are actively local training. At the t-th global round, after the client i finishes local training, it sends its local update i t = xi t τ,K xt τ to the server, where t τ is the global round where client i starts local training and 0 τ t. The server simultaneously accumulates the model update i t to the global update direction t t + i t, and sends the latest global model to a randomly selected client who is idle. When the number of accumulated updates reaches the given buffer size of M, the server updates the global model with the averaging t/M. Meanwhile, clients who have not finished their local training will continue their training based on the previously received global model, and are not affected by the global model updates on the server. During the training, the framework always maintains a fixed number (Mc) of clients who are conducting local training. This is achieved by having the server randomly sample an idle client for training each time a client completes its local training and sends its update to the server. Discussion about synchronous and asynchronous methods. Synchronous FL typically offers consistency and stability, i.e., all client updates are based on the same global model, and this consistency may lead to a more stable and predictable learning process. However, when there exist one or a few clients that are much slower than the majority of clients, which often happens in large-scale systems, synchronous FL can be inefficient since every client needs to wait for the slowest client before progressing with the next round of training. Asynchronous FL is more efficient when clients have system heterogeneity such as diverse computational capabilities or communication bandwidth. In FL, if the delay among clients is relatively uniform, synchronous FL tends to be more stable and efficient. Overall, the choice between synchronous and asynchronous FL hinges on specific needs and system characteristics. Synchronous FL is ideal in homogeneous systems, while asynchronous FL is advantageous in heterogeneous systems with potential straggler clients. 4. Proposed Method: FADAS Although adaptive FL methods achieve promising convergence and generalization performance theoretically and empirically, the existing adaptive FL methods are restricted to synchronous settings, as the server needs to wait for all the assigned clients to finish their local updates for aggregation and then update the global model. However, those synchronous adaptive FL algorithms are susceptible to the presence of stragglers, where slower clients with insufficient computation or communication speed impede the progress of the global update. FADAS: Towards Federated Adaptive Asynchronous Optimization To improve the efficiency and resiliency of adaptive FL in the presence of stragglers, we introduce FADAS, a Federated ADaptive ASynchronous optimization method. Similar to Fed Adam and Fed AMS, the proposed FADAS algorithm takes the model update difference from clients as a pseudo-gradient and it updates the global model following an Adam-like update scheme. Algorithm 1 summarizes the details. FADAS keeps the local asynchronous training scheme as Fed Buff and maintains the concept of concurrency and buffer size for flexible control of the number of active clients and the frequency of global model update. In FADAS, after the server aggregates to obtain model update difference t, it finds an adaptive update direction, whose components are computed based on the AMSGrad optimizer (Reddi et al., 2018) as follows: mt = β1mt 1 + (1 β1) t, vt = β2vt 1 + (1 β2) t t, bvt = max(bvt 1, vt). (3) In general, FADAS enables clients to conduct local training in their own pace, and the server aggregates the asynchronous updates for global adaptive updates. It improves the training efficiency and scalability of over synchronous adaptive FL while inheriting the advantage of adaptive optimizer of reducing oscillations and stabilizing the optimization process. Although FADAS applies asynchronous local training for adaptive FL, the global adaptive optimizer adjusts the global update direction only based on local updates but without considering the impact of asynchronous delay. Intuitively, a large asynchronous delay from a client means that this model update is made based on an outdated global model. This may lead to a negative effect on the convergence, and later we also verify this intuition in the theoretical analysis. This inspires us to apply a delay-adaptive learning rate adjustment to improve the resiliency of FADAS to stragglers with large delays. Specifically, we let the server track the delay for every received model update and adopt a delayadaptive learning rate. We highlight the delay-adaptive steps in Algorithm 1 and those steps are executed with almost no extra overhead. Delay tracking. In general, the server manages the delay record for each client through straightforward timestamping. For example, the server records the global update round t when it broadcasts the current global model xt to client i, the client conducts local training with xt . When the server receives the first i t from client i at round t t , the gradient delay for i t, which is τ i t = t t , is updated and recorded on the server. Delay-adaptive learning rate. Assume that for each global update round t, clients in the set Mt ( |Mt| = M) send updates to the server. The received model updates at global Algorithm 1 FADAS (with delay adaptation ) Input: local learning rate ηl, global learning rate η, adaptive optimization parameters β1, β2, ϵ, server concurrency Mc, buffer size M, delay threshold τc ; 1: Initialize model x1, initialize 1 = 0, m0 = 0, v0 = 0, m = 0 and sample a set of M0 with size Mc active clients to run local SGD updates. 2: repeat 3: if receive client update then 4: Server accumulates update from client i: t t + i t and set m m + 1 5: Sample another client j from available clients 6: Send the current model xt to client j, and run local SGD updates on client j 7: end if 8: if m = M then 9: t t M 10: Update mt, vt, bvt by (3) 11: if delay-adaptive then 12: Set ηt to be delay-adaptive based on Eq. (4) 13: else 14: ηt = η 15: end if 16: Update global model xt+1 = xt + ηt mt bvt+ϵ 17: Set m 0, t+1 0, t t + 1 18: end if 19: until convergence round t have a maximum delay τ max t defined as τ max t := max{τ i t, i Mt}. Suppose we set up a delay threshold τc, we can define a delay-adaptive learning rate as: ( η if τ max t τc, min n η, 1 τ max t o if τ max t > τc. (4) Intuitively, this design implies that we need to turn the learning rates down for the model update t with larger current-step delays. Specifically, if the current-step maximum delay τ max t is larger than a given threshold τc, we scale down the learning rates for this step in proportional to 1/τ max t (also capped by a constant learning rate η) to avoid that the high-latency update worsens the convergence. Comparison with Fed Async (Xie et al., 2019). Fed Async (Xie et al., 2019) also studies delay-adaptive weighted averaging during global model updates. In Fed Async, after the server receives a local model xnew, it updates xt based on xt = (1 αt)xt 1 + αtxnew, and Fed Async includes a hinge strategy of αt which is similar to our delay-adaptive strategy in Eq. (4). However, unlike Fed Async, where the server updates the global model immediately upon receiving a new update from a client, FADAS updates the global model less frequently. In FADAS, the server accumulates FADAS: Towards Federated Adaptive Asynchronous Optimization M local updates before a global update. Moreover, the convergence analysis in Fed Async did not consider their delay adaptation procedure, while we provide a convergence analysis incorporating the effect of delay adaptation in the next section. 5. Theoretical Analysis In this section, we delve into the theoretical analysis of our proposed FADAS algorithm. We first introduce some common assumptions required for the analysis. Subsequently, we present the analysis in two parts: one focusing on FADAS without delay adaptation, as discussed in Section 5.1, and the other on the delay-adaptive FADAS in Section 5.2. Assumption 5.1 (Smoothness). Each objective function on the i-th worker Fi(x) is L-smooth, i.e., x, y Rd, Fi(x) Fi(y) L x y . Assumption 5.2 (Bounded Variance). Each stochastic gradient is unbiased and has a bounded local variance, i.e., for all x, i [N], we have E Fi(x; ξ) Fi(x) 2 σ2, and the loss function on each worker has a global variance bound, 1 N PN i=1 Fi(x) f(x) 2 σ2 g. Assumption 5.1 and 5.2 are standard assumptions in federated non-convex optimization literature (Li et al., 2019b; Yang et al., 2021; Reddi et al., 2021; Wang et al., 2022b; Wang & Ji, 2023). The global variance upper bound of σ2 g in Assumption 5.2 measures the data heterogeneity across clients, and a global variance of σ2 g = 0 indicates a uniform data distribution across clients. Assumption 5.3 (Bounded Gradient). Each loss function on the i-th worker Fi(x) has G-bounded stochastic gradient on ℓ2 norm, i.e., for all ξ, we have Fi(x; ξ) G. Assumption 5.3 is necessary for adaptive gradient algorithms for both general (Kingma & Ba, 2015; Chen et al., 2020a), distributed (Wang et al., 2022a) and federated adaptive optimization (Reddi et al., 2021; Wang et al., 2022b; Sun et al., 2023b). This is because the effective global learning rate for adaptive gradient methods is η bvt+ϵ, and we need a lower bound for η bvt+ϵ to guarantee that the effective learning rate does not vanish to zero. Assumption 5.4 (Bounded Delay of Gradient Computation). Let τ i t represent the delay for global round t and client i which is applied in Algorithm 1. The delay τ i t is the difference between the current global round t and the global round at which client i started to compute the gradient. We assume that the maximum gradient delay (worst-case delay) is bounded, i.e., τmax = maxt [T ],i [N]{τ i t} < . Assumption 5.4 is common in analyzing asynchronous and anarchic FL algorithms which incorporate the gradient de- lays into their algorithm design (Koloskova et al., 2022; Yang et al., 2021; Nguyen et al., 2022; Toghani & Uribe, 2022; Wang et al., 2023). Assumption 5.5 (Uniform Arrivals of Gradient Computation). Let the set Mt (with size M) include clients that transmit their local updates to the server in global round t. We assume that the clients update arrivals are uniformly distributed, i.e., from a theoretical perspective, the M clients in Mt are randomly sampled without replacement from all clients [N] according to a uniform distribution1. Assumption 5.5 is also discussed in Anarchic FL (Yang et al., 2022), which has been utilized to analyze the AFA-CD algorithm proposed therein. 5.1. Convergence Rate of FADAS For expository convenience, in the following, we provide the theoretical convergence analysis of FADAS under the case of β1 = 0. The theoretical analysis and the proof for the general case of 0 β1 < 1 are provided in Appendix A. We define the average of the maximum delay over time as τavg = 1 T PT t=1 τ max t = 1 T PT t=1 maxi [N]{τ i t} which is useful in our analysis. Theorem 5.6. Under Assumptions 5.1 5.5, let T represent the total number of global rounds, K be the number of local SGD training steps and M be the number of the accumulated updates (buffer size) in each round. If the learning rate η and ηl satisfies ηηl min n ϵ2M(N 1) 180CGN(N M)τmax KL, CGN(M 1)τ 3 max KL o , ηl ϵ 360CGτmax KL, then the global iterates {xt}T t=1 of Algorithm 1 satisfy t=1 E[ f(xt) 2] 4CG ηηl KT F + 20CGη2 l KL2(σ2 + 6Kσ2 g) ϵ + 8CGη2η2 l KL2τavgτmax Mϵ3 + 12CGηηl L N 1 [15η2 l K2L2(σ2 + 6Kσ2 g) + 3Kσ2 g] , where F = f(x1) f , f = minx f(x) > and CG = ηl KG + ϵ. Corollary 5.7. If we choose the global learning rate η = Θ( M) and ηl = Θ T K(σ2+Kσ2g) in Theorem 5.6, then for sufficiently large T, the global iterates {xt}T t=1 1This assumption is only used for theoretical analysis. Our experiments that show the advantage of FADAS empirically do not rely on this assumption. FADAS: Towards Federated Adaptive Asynchronous Optimization of Algorithm 1 satisfy t=1 E[ f(xt) 2] O M + Fτmaxτavg Remark 5.8. Corollary 5.7 suggests that given sufficiently large T and relatively small worst-case delay τmax, the proposed FADAS (without delay-adaptive learning rate) achieves a convergence rate of O 1 T M w.r.t. T and M. Comparison to asynchronous FL methods. Compared with the analysis for Fed Buff in Nguyen et al. (2022) and Toghani & Uribe (2022), our analysis for FADAS obtains a relaxed dependency on the worst-case gradient delay τmax, and FADAS achieves a slightly better rate on nondominant term than O 1 T + τ 2 max T obtained in Toghani & Uribe (2022). Moreover, Wang et al. (2023) also studied the convergence for Fed Buff with relaxed requirements for τmax, and our FADAS achieves a similar convergence of O 1 T M + τmaxτavg T as in Wang et al. (2023). It is worthwhile to mention that recently CA2FL (Wang et al., 2023) improves the convergence of asynchronous FL under heterogeneous data distributions, while the improvement is obtained by using the cached variable on the server for global update calibration. Note that when τmax in Eq. (6) is large, particularly in cases where τmax M , then τmaxτavg T becomes the dominant term in the convergence rate. This implies that a large worstcase delay τmax may lead to a worse convergence rate. In the next subsection, we demonstrate that the delay-adaptive learning rate strategy can relieve this problem and enhance FADAS with better resilience to large worst-case delays. 5.2. Convergence Rate of Delay-adaptive FADAS In the following, we provide the convergence analysis for delay-adaptive FADAS with β1 = 0. To get started, we first define the median of the maximum delay over all communication rounds [T]: τmedian = median{τ max 1 , τ max 2 , ..., τ max T }. (7) The definition of τmedian implies that the number of global update rounds that have a maximum delay greater than τmedian is less than half of the total number of global updates T. With this definition, we present the following theorem characterizing the convergence rate of delay-adaptive FADAS. Theorem 5.9. Under Assumptions 5.1 5.5, let T be the total number of global rounds, K be the number of local SGD training steps and M be the number of the buffer size in each round. If the learning rate η and ηl satisfies ηηl min n ϵ2M(N 1) 60CGN(N M)τmax KL, CGN(M 1)τ 3 max KL o , ηl ϵ 360CGτmax KL and η M τc , then the global iterates {xt}T t=1 of Algorithm 1 satisfy 1 PT t=1 ηt t=1 ηt E[ f(xt) 2] 4CG ηηl KT F + 20CGη2 l KL2(σ2 + 6Kσ2 g) ϵ + 8CGη3η2 l KL2Tτavg Mϵ3 PT t=1 ηt σ2 + 8CGη2η2 l KL2Tτavg Mϵ3 PT t=1 ηt N 1 [15η2 l K2L2(σ2 + 6Kσ2 g) + 3Kσ2 g] + 4CGηηl L N 1 [15η2 l K2L2(σ2 + 6Kσ2 g) + 3Kσ2 g] , where F = f(x1) f , f = minx f(x) > and CG = ηl KG + ϵ. Corollary 5.10. If we pick τc = τmedian, the global learning rate η = Θ( M/τc) and ηl = Θ τc T K(σ2+Kσ2g) , then for sufficiently large T, the global iterates {xt}T t=1 of Algorithm 1 satisfy 1 PT t=1 ηt t=1 ηt E[ f(xt) 2] O T + F(τ 2 c + τcτavg) Remark 5.11. Corollary 5.10 suggests that with sufficiently large T, delay-adaptive FADAS also achieves a convergence rate of O 1 T M w.r.t. T and M. Remark 5.12. Compared to the convergence rate in Corollary 5.7, the convergence rate in Corollary 5.10 does not rely on the (possibly large) worst-case delay τmax. In cases where τc = τmedian τavg τmax, Corollary 5.10 relaxes the requirement from τmax to τmedian for achieving the desired convergence rate. Since τmedian describes the median of τ max t = maxi [N]{τ i t} in each round t, the convergence rate in Corollary 5.10 is less sensitive to stragglers who may cause a large worst-case delay in the system. 6. Experiments We explore the performance of our proposed FADAS algorithm through experiments on vision and language tasks, using the CIFAR-10/100 (Krizhevsky et al., 2009) datasets with Res Net-18 model (He et al., 2016) for vision tasks, and applying the pre-trained BERT base model (Devlin et al., 2018) for fine-tuning several datasets from FADAS: Towards Federated Adaptive Asynchronous Optimization the GLUE benchmark dataset (Wang et al., 2018) for language tasks. We compare our proposed FADAS algorithm against asynchronous FL baselines, such as Fed Buff (without differential privacy) (Nguyen et al., 2022) and Fed Async (Xie et al., 2019), a synchronous SGD-based FL baseline Fed Avg (Mc Mahan et al., 2017), and a synchronous adaptive FL baseline Fed AMS (Wang et al., 2022b). We summarize some crucial implementation details in the following, and we leave some additional results and experiment details to Appendix D. Our code can be found at https://github.com/yujiaw98/FADAS. Overview of vision tasks implementation. We set up a total of 100 clients for the mild delay scenario, in which the concurrency Mc = 20 and the buffer size M = 10 by default. We also set up a total of 50 clients for the large worst-case delay scenario, with Mc = 25 and M = 5 correspondingly. For both settings, we partition the data on clients based on the Dirichlet distribution following Wang et al. (2020a;b), and the parameter α used in Dirichlet sampling determines the degree of data heterogeneity. We apply two levels of data heterogeneity with α = 0.1 and α = 0.3. Each client conducts two local epochs of training, and the mini-batch size is 50 for each client. The local optimizer for all methods is SGD with weight decay 10 4, and we grid search the global and local learning rates individually for each method. Overview of language tasks implementation. Considering the total number of data samples in the language classification datasets, we set up a total of 10 clients, partition the data on clients based on the labels, and we apply a heterogeneity level of α = 0.6. Each client conducts one local epoch and the mini-batch size is 32 for each client. The local optimizer for all methods is SGD with weight decay 10 4, and we grid search the global and local learning rates individually for each method. We set the concurrency Mc = 5 and buffer size M = 3 by default. We employ the widely-used low-rank adaptation method, Lo RA (Hu et al., 2021), as a parameter-efficient fine-tuning strategy for our language classification tasks. This involves freezing the original pre-trained weight matrix W0 Rd k and fine-tuning W through low-rank decomposition, where W = W0 + αLo RA W = W0 + αLo RABA, B Rd r, and A Rr k, and we adopt r = 1 and αLo RA = 8 in our experiments. Overview of delay simulation. In our experiments, we simulate the asynchronous environment as follows. Initially, we partition clients into three categories, including Small, Medium, and Large delay, at the start of training and tag them with a label reflective of their delay magnitude. This partitioning was executed via a Dirichlet sampling process controlled by the parameter γ. A smaller γ value corresponds to a higher proportion of clients experiencing large delays. Unless otherwise specified in subsequent experiments, we set γ = 1. To mimic actual wall-clock running times within each delay category, we apply uniform sampling at each round for each client. We adopt the following uniform distributions to simulate wall-clock running time for both the large worst-case delay and mild delay settings as shown in Table 1. 6.1. Results on Vision Tasks Large worst-case delay. Under this setting, we simulate the wall-clock running time by letting a small proportion of clients have more significant delays than other clients. Tables 2 and 3 show the overall performance of training the Res Net-18 model on CIFAR-10 and CIFAR-100, respectively. The results show that FADAS, especially with a delay-adaptive learning rate, offers significant advantages in terms of test accuracy. Compared to Fed Async and Fed Buff, both FADAS methods achieve higher accuracy, and FADAS with delay-adaptive learning rates is shown to be more stable during the learning process with lower standard derivation. In these experiments, we conduct a total of T = 500 global communication rounds, and the maximum delay τmax = 127, which even more than a quarter of the total number of global communication rounds. Notably, as seen in Tables 2 and 3, Fed Async shows severely fluctuating in test accuracy, suggesting that it may be less reliable in situations with large worst-case delays. Table 1. Overview for wall-clock delay simulation (in units of 10 seconds). Delay Small Medium Large Large worst-case U(1, 2) U(3, 5) U(50, 80) Mild U(1, 2) U(3, 5) U(5, 8) Mild delay. Under this setting, we simulate the wall-clock running time for clients by assuming that all clients can finish their local training within a comparable duration (see Table 1). Tables 4 and 5 show the overall performance of training the Res Net-18 model on CIFAR-10 and CIFAR-100 under mild delay. The results highlight that both FADAS and its delay-adaptive variant achieve superior test accuracy than Fed Async and Fed Buff. 6.2. Results on Language Tasks The performance for fine-tuning the BERT base model on three GLUE benchmark datasets, RTE, MRPC, and SST-2, under mild delay conditions are shown in Table 6, which illustrates that FADAS and its delay-adaptive counterpart consistently outperform the results of Fed Async and Fed Buff across the three datasets. Fed Async achieves good performance in SST-2 but is less satisfactory in RTE and MRPC, and Fed Buff presents an overall lower accu- FADAS: Towards Federated Adaptive Asynchronous Optimization Table 2. The test accuracy on training Res Net-18 model on CIFAR10 dataset with two data heterogeneity levels in a large worst-case delay scenario for 500 communication rounds. We report the average accuracy and standard derivation over the last 5 rounds, and we abbreviate delay-adaptive FADAS to FADASda in this and subsequent tables. Dir(0.1) Dir (0.3) Method Acc. & std. Acc. & std. Fed Async 50.92 5.03 75.3 6.18 Fed Buff 38.68 8.16 51.32 4.43 FADAS 72.0 0.94 73.27 1.37 FADASda 73.96 3.54 79.68 2.14 Table 3. The test accuracy on training Res Net-18 model on CIFAR100 dataset with two data heterogeneity levels in a large worst-case delay scenario for 500 communication rounds. Dir(0.1) Dir (0.3) Method Acc. & std. Acc. & std. Fed Async 46.51 4.76 38.55 7.36 Fed Buff 13.04 5.5 18.63 5.13 FADAS 47.84 0.59 53.64 0.52 FADASda 50.31 1.0 57.18 0.31 racy with larger standard derivation compared with FADAS. The delay-adaptive FADAS shows parity with the standard FADAS algorithm under mild delays. Moreover, FADAS achieves significant accuracy improvements on RTE and MRPC datasets against the SGD-based asynchronous FL baselines, further demonstrating the intuition of developing the FADAS method. Running time speedup. Table 7 demonstrates the efficiency of FADAS and its delay-adaptive variant by comparing their performance with two synchronous FL methods in reaching the target validation accuracy across different dataset. Notably, FADAS consistently outperforms Fed Avg and Fed AMS in terms of wall-clock running time, requiring significantly fewer time units to reach the desired accuracy levels. In vision classification tasks such as CIFAR10 and CIFAR-100, the standard FADAS shows a significant reduction in training time, achieving 8 speedup than Fed Avg and more than 2.5 speedup than Fed AMS. The delay-adaptive FADAS shows similar results as the standard version. For language classification tasks, FADAS also improves the training time compared with Fed AMS and Fed Avg. These results highlight the scalability and efficiency of FADAS, especially when considering the computational constraints in practical FL environments. 6.3. Ablation studies Sensitivity of delay adaptive learning rates. Figure 1 (a) exhibits the ablation study for different delay threshold τc for the delay-adaptive FADAS under the scenario of large Table 4. The test accuracy on training Res Net-18 model on CIFAR10 dataset with two data heterogeneity levels under mild delay scenario. Dir(0.1) Dir (0.3) Method Acc. & std. Acc. & std. Fed Async 42.48 4.93 71.76 3.85 Fed Buff 72.15 2.71 79.82 3.25 FADAS 77.68 2.32 82.93 0.81 FADASda 78.93 0.83 83.91 0.54 Table 5. The test accuracy on training Res Net-18 model on CIFAR100 dataset with two data heterogeneity levels under mild delay scenario. Dir(0.1) Dir (0.3) Method Acc. & std. Acc. & std. Fed Async 45.26 7.04 53.41 8.94 Fed Buff 53.70 1.13 56.26 1.64 FADAS 57.37 0.47 61.22 0.31 FADASda 57.21 0.45 60.34 0.42 worst-case delays. Following Eq. (4), τc provides a threshold so that we reduce the learning rate if there exists a client with extremely large delay. The experiment compares the accuracy of three thresholds τc = 1, 4, 8, 10, and τc = 4 shows very similar test accuracy as τc = 10. The result in Figure 1 (a) shows that using τc = 8 obtains a slightly better result than using τc = 1, τc = 4, and τc = 10. It is interesting that in this large worst-case delays setting, we observe the average of the maximum delay τavg = 10.89, the median of the maximum delay τmedian = 6.0, and maximum delay during training is τmax = 127, which shows τmedian τavg τmax, confirming the practicality of our analysis as discussed in Remark 5.12. Together with the theoretical and experimental results, we find that the optimal choice of τc may depend on the actual delay during training. Ablation for concurrency Mc and buffer size M. Figure 1 (b) presents the test accuracy of both the standard and delay-adaptive FADAS for different concurrency levels Mc, given the same buffer size M = 5. The delay-adaptive FADAS achieves higher accuracy than FADAS when concurrency Mc = 15 and Mc = 25, and the delay-adaptive version has worse accuracy at larger concurrency Mc = 35. Table 6. The test accuracy on parameter-efficient fine-tuning BERT base model on three datasets from GLUE benchmark with heterogeneous data partitioned and mild delay. RTE MRPC SST-2 Method Acc. & std. Acc. & std. Acc. & std. Fed Async 49.46 2.66 69.71 1.02 90.02 0.79 Fed Buff 61.61 4.90 76.80 6.05 78.37 4.86 FADAS 64.26 2.30 83.33 1.20 90.76 0.26 FADASda 65.10 2.40 83.09 1.71 90.05 1.80 FADAS: Towards Federated Adaptive Asynchronous Optimization Table 7. Training/fine-tuning time simulation (in units of 10 seconds) to reach target test accuracy on the server under mild delay scenarios. For each dataset, the concurrency Mc is fixed for fair comparison. Acc. Fed Avg Fed AMS FADAS FADASda CIFAR-10 75% 2257.7 648.7 228.0 237.5 CIFAR-100 50% 1806.3 546.9 209.8 209.8 RTE 63% 921.9 412.4 376.2 436.9 MRPC 80% 1018.1 424.0 368.3 370.1 SST-2 90% - 495.2 73.8 57.2 Figure 1 (c) presents an ablation study on buffer size for our proposed FADAS algorithm. It compares the performance of buffer sizes from M = 3, 5, 10 with their delay-adaptive counterparts over total client updates, i.e., the number of times the server receives updates from clients. It shows that with the same number of client trips, increasing the buffer size M tends to achieve higher accuracy. This is also due to the design of the concurrency-buffer size framework, as increasing the buffer size moves closer to traditional synchronous FL algorithms, i.e., clients are more likely to get up-to-date with the server. We also provide the comparison w.r.t. global communication round in Figure 1 (d). Fig- 0 100 200 300 400 500 #Rounds Test Accuracy (a) Ablation on τc 0 100 200 300 400 500 #Rounds Test Accuracy Mc = 15 Mc = 25 Mc = 35 Mc = 15 (da) Mc = 25 (da) Mc = 35 (da) (b) Ablation on concurrency 0 500 1000 1500 2000 2500 #Client Updates Test Accuracy M = 3 M = 5 M = 10 M = 3 (da) M = 5 (da) M = 10 (da) (c) Ablation on buffer size 0 100 200 300 400 500 #Rounds Test Accuracy M = 3 M = 5 M = 10 M = 3 (da) M = 5 (da) M = 10 (da) (d) Ablation on buffer size 0 100 200 300 400 500 Time (in units of 10 seconds) Test Accuracy FADAS(M = 3) FADAS(M = 5) FADAS(M = 10) (e) Run time for FADAS 0 100 200 300 400 500 Time (in units of 10 seconds) Test Accuracy FADASda(M = 3) FADASda(M = 5) FADASda(M = 10) (f) Run time for FADASda Figure 1. Several ablation studies based on training Res Net-18 model on CIFAR-10 data under large worst-case delay setting. ure 1 (d) shows that as the buffer size M increases, i.e., the number of clients contributing to one step of global update increases, the test accuracy also increases. Moreover, we simulate the running time (similar to the setting for Table 7) for different buffer sizes M to investigate the time efficiency for adopting different buffer sizes. Figure 1 (e) and (f) show the run time for FADAS and delay-adaptive FADAS. They reveal that a smaller buffer size (M = 3) may have less training time to achieve a target accuracy, e.g., 70%. These results demonstrate that using smaller buffer sizes may yield higher accuracy in the early stage of training. In conjunction with the results shown in Figure 1 (c) and (d), we think there is a trade-off between the time of reaching some initial target accuracy (that is slightly lower than the final accuracy) and the final accuracy with regard to the buffer size. A larger buffer size M may yield improved final accuracy at convergence, but it also means that the server needs to wait for slower clients and there are less frequent updates of the global model, so the training speed at initial rounds can be slower. 7. Conclusion In this paper, we propose FADAS, a novel asynchronous FL method that addresses the challenges of asynchronous updates in adaptive federated optimization. Based on the standard FADAS, we further integrate delay-adaptive learning rates to enhance the resiliency to stragglers with large delays. We theoretically establish the convergence rate for both standard and delay-adaptive FADAS under nonconvex stochastic settings. Our theoretical analysis indicates that the delay-adaptive algorithm substantially reduces the impact of severe worst-case delays on the convergence rate. Empirical evaluations across multiple tasks affirm that FADAS outperforms existing asynchronous FL methods and offers improved training efficiency compared to synchronous adaptive FL methods. Acknowledgments We thank the anonymous reviewers for their helpful comments. This work is partially supported by the National Science Foundation under Grant No. 2348541. The views FADAS: Towards Federated Adaptive Asynchronous Optimization and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies. Impact Statement This paper will make long-lasting contributions to the field of asynchronous federated learning and adaptive optimization. The focus of this work is on the technical advancement and optimization development of federated learning algorithms, and while there are numerous potential societal impacts of machine learning at large, this research does not necessitate a specific discussion on the societal consequences. Acar, D. A. E., Zhao, Y., Matas, R., Mattina, M., Whatmough, P., and Saligrama, V. Federated learning based on dynamic regularization. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=B7v4QMR6Z9w. Avdiukhin, D. and Kasiviswanathan, S. Federated learning under arbitrary communication patterns. In Proceedings of the 38th International Conference on Machine Learning, pp. 425 435, 2021. Bornstein, M., Rabbani, T., Wang, E. Z., Bedi, A., and Huang, F. SWIFT: Rapid decentralized federated learning via wait-free model communication. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=jh1n Cir1R3d. Chen, J., Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. Closing the generalization gap of adaptive gradient methods in training deep neural networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2020a. Chen, T., Jin, X., Sun, Y., and Yin, W. Vafl: a method of vertical asynchronous federated learning. ar Xiv preprint ar Xiv:2007.06081, 2020b. Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. ar Xiv preprint ar Xiv:1808.02941, 2018. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. Glasgow, M. R. and Wootters, M. Asynchronous distributed optimization with stochastic delays. In International Conference on Artificial Intelligence and Statistics, pp. 9247 9279. PMLR, 2022. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021. Jhunjhunwala, D., Gadhikar, A., Joshi, G., and Eldar, Y. C. Adaptive quantization of model updates for communication-efficient federated learning. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3110 3114. IEEE, 2021. Jhunjhunwala, D., Wang, S., and Joshi, G. Fed Ex P: Speeding up federated averaging via extrapolation. ar Xiv preprint ar Xiv:2301.09604, 2023. Jin, J., Ren, J., Zhou, Y., Lyu, L., Liu, J., and Dou, D. Accelerated federated learning with decoupled adaptive optimization. In International Conference on Machine Learning, pp. 10298 10322. PMLR, 2022. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132 5143. PMLR, 2020. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015. Koloskova, A., Stich, S. U., and Jaggi, M. Sharper convergence guarantees for asynchronous SGD for distributed and federated learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/forum?id=4_o CZg BIVI. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. FADAS: Towards Federated Adaptive Asynchronous Optimization Leblond, R., Pedregosa, F., and Lacoste-Julien, S. Improved asynchronous parallel optimization analysis for stochastic incremental methods. Journal of Machine Learning Research, 19(81):1 68, 2018. Li, J., Huang, F., and Huang, H. Fedda: Faster framework of local adaptive gradient methods via restarted dual averaging. ar Xiv preprint ar Xiv:2302.06103, 2023. Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of Fed Avg on non-iid data. ar Xiv preprint ar Xiv:1907.02189, 2019a. Li, X., Yang, W., Wang, S., and Zhang, Z. Communicationefficient local decentralized SGD methods. ar Xiv preprint ar Xiv:1910.09126, 2019b. Lin, T., Stich, S. U., Patel, K. K., and Jaggi, M. Don t use large mini-batches, use local sgd. ar Xiv preprint ar Xiv:1808.07217, 2018. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. Mania, H., Pan, X., Papailiopoulos, D., Recht, B., Ramchandran, K., and Jordan, M. I. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202 2229, 2017. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017. Mishchenko, K., Bach, F., Even, M., and Woodworth, B. E. Asynchronous SGD beats minibatch SGD under arbitrary delays. Advances in Neural Information Processing Systems, 35:420 433, 2022. Nguyen, J., Malik, K., Zhan, H., Yousefpour, A., Rabbat, M., Malek, M., and Huba, D. Federated learning with buffered asynchronous aggregation. In International Conference on Artificial Intelligence and Statistics, pp. 3581 3607. PMLR, 2022. Nguyen, L., Nguyen, P. H., van Dijk, M., Richtarik, P., Scheinberg, K., and Takac, M. SGD and hogwild! Convergence without the bounded gradients assumption. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3750 3758. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/ nguyen18c.html. Niu, F., Recht, B., Re, C., and Wright, S. J. Hogwild! a lockfree approach to parallelizing stochastic gradient descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS 11, pp. 693 701, Red Hook, NY, USA, 2011. Curran Associates Inc. ISBN 9781618395993. Ortega, T. and Jafarkhani, H. Asynchronous federated learning with bidirectional quantized communications and buffered aggregation. ar Xiv preprint ar Xiv:2308.00263, 2023. Reddi, S. J., Kale, S., and Kumar, S. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018. Reddi, S. J., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Koneˇcn y, J., Kumar, S., and Mc Mahan, H. B. Adaptive federated optimization. In International Conference on Learning Representations, 2021. Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, pp. 2021 2031. PMLR, 2020. Stich, S., Mohtashami, A., and Jaggi, M. Critical parameters for scalable distributed learning with large batches and asynchronous updates. In International Conference on Artificial Intelligence and Statistics, pp. 4042 4050. PMLR, 2021. Stich, S. U. Local SGD converges fast and communicates little. ar Xiv preprint ar Xiv:1805.09767, 2018. Sun, H., Shen, L., Chen, S., Sun, J., Li, J., Sun, G., and Tao, D. Fedlalr: Client-specific adaptive learning rates achieve linear speedup for non-iid data. ar Xiv preprint ar Xiv:2309.09719, 2023a. Sun, Y., Shen, L., Sun, H., Ding, L., and Tao, D. Efficient federated learning via local adaptive amended optimizer with linear speedup. ar Xiv preprint ar Xiv:2308.00522, 2023b. Tan, L., Zhang, X., Zhou, Y., Che, X., Hu, M., Chen, X., and Wu, D. Adafed: Optimizing participation-aware federated learning with adaptive aggregation weights. IEEE Transactions on Network Science and Engineering, 9(4): 2708 2720, 2022. Tieleman, T., Hinton, G., et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26 31, 2012. Toghani, M. T. and Uribe, C. A. Unbounded gradients in federated learning with buffered asynchronous aggregation. In 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1 8. IEEE, 2022. FADAS: Towards Federated Adaptive Asynchronous Optimization Tong, Q., Liang, G., and Bi, J. Effective federated adaptive gradient methods with non-iid decentralized data. ar Xiv preprint ar Xiv:2009.06557, 2020. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and finetuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Glue: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461, 2018. Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., and Khazaeni, Y. Federated learning with matched averaging. In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum? id=Bkluql SFDS. Wang, J. and Joshi, G. Cooperative sgd: A unified framework for the design and analysis of local-update SGD algorithms. The Journal of Machine Learning Research, 22(1):9709 9758, 2021. Wang, J., Liu, Q., Liang, H., Joshi, G., and Poor, H. V. Tackling the objective inconsistency problem in heterogeneous federated optimization. ar Xiv preprint ar Xiv:2007.07481, 2020b. Wang, S. and Ji, M. A lightweight method for tackling unknown participation probabilities in federated averaging. ar Xiv preprint ar Xiv:2306.03401, 2023. Wang, Y., Lin, L., and Chen, J. Communication-compressed adaptive gradient method for distributed nonconvex optimization. In International Conference on Artificial Intelligence and Statistics, pp. 6292 6320. PMLR, 2022a. Wang, Y., Lin, L., and Chen, J. Communication-efficient adaptive federated learning. In Proceedings of the 39th International Conference on Machine Learning, pp. 22802 22838. PMLR, 2022b. Wang, Y., Cao, Y., Wu, J., Chen, R., and Chen, J. Tackling the data heterogeneity in asynchronous federated learning with cached update calibration. In Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities, 2023. Wu, X., Magnusson, S., Feyzmahdavian, H. R., and Johansson, M. Delay-adaptive step-sizes for asynchronous learning. In International Conference on Machine Learning, pp. 24093 24113. PMLR, 2022. Wu, X., Huang, F., Hu, Z., and Huang, H. Faster adaptive federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 10379 10387, 2023. Xie, C., Koyejo, S., and Gupta, I. Asynchronous federated optimization. ar Xiv preprint ar Xiv:1903.03934, 2019. Yang, H., Fang, M., and Liu, J. Achieving linear speedup with partial worker participation in non-IID federated learning. In International Conference on Learning Representations, 2021. Yang, H., Zhang, X., Khanduri, P., and Liu, J. Anarchic federated learning. In International Conference on Machine Learning, pp. 25331 25363. PMLR, 2022. FADAS: Towards Federated Adaptive Asynchronous Optimization A. Convergence analysis for adaptive asynchronous FL Proof of Theorem 5.6. Here we directly start with general β1 0 cases. Following several previous works studied centralized and federated adaptive methods (Chen et al., 2018; Wang et al., 2022b), we adopt an auxiliary Lyapunov sequence zt, and assume x0 = x1, then for each t 1, we have zt = xt + β1 1 β1 (xt xt 1) = 1 1 β1 xt β1 1 β1 xt 1. (10) For the difference between zt+1 and zt, we have zt+1 zt = 1 1 β1 (xt+1 xt) β1 1 β1 (xt xt 1) = 1 1 β1 η mt bvt + ϵ β1 1 β1 η mt 1 p = 1 1 β1 η 1 bvt + ϵ[β1mt 1 + (1 β1) t] β1 1 β1 η mt 1 p = η t bvt + ϵ β1 1 β1 η bvt + ϵ η p where t = ηl i Mt PK 1 k=0 gi t τ i t ,k = ηl i Mt PK 1 k=0 Fi(xi t τ i t ,k; ξ) and Mt be the set that include client send the local updates to the server at global round t. From Assumption 5.1, f is L-smooth, taking the total expectation over all previous round, 0, 1, ..., t 1 on the auxiliary sequence zt, E[f(zt+1) f(zt)] = E[f(zt+1)] f(zt)] E[ f(zt), zt+1 zt ] + L 2 E[ zt+1 zt 2] = E f(xt), η t bvt + ϵ E f(zt), β1 1 β1 η 1 bvt + ϵ 1 p 2 E t bvt + ϵ β1 1 β1 1 bvt + ϵ 1 p + E ( f(zt) f(xt)), η t bvt + ϵ Bounding I1 Denote a sequence t = ηl i [N] PK 1 k=0 gi t τ i t ,k = ηl i [N] PK 1 k=0 Fi(xi t τ i t ,k; ξ), where ξ Di. For I1, there is I1 = ηE f(xt), t bvt + ϵ = ηE f(xt), t bvt + ϵ = ηE f(xt) bvt + ϵ, t + ηl K f(xt) ηl K f(xt) = ηηl KE f(xt) ( bvt + ϵ)1/2 2 + ηE f(xt) bvt + ϵ, t + ηl K f(xt) FADAS: Towards Federated Adaptive Asynchronous Optimization = ηηl KE f(xt) ( bvt + ϵ)1/2 2 + ηE f(xt) bvt + ϵ, ηl k=0 Fi(xi t τ i t ,k; ξi) + ηl K i [N] Fi(xt) , (13) where the second equality holds due to the characteristic of uniform arrivals (see Assumption 5.4), thus E( t) = t. The last inequality holds by the definition of t and the fact of the objective function f(x) = 1 N PN i=1 Fi(x). By the fact of a, b = 1 2[ a 2 + b 2 a b 2], for second term in (13), we have ηE f(xt) bvt + ϵ, ηl k=0 gi t τ i t ,k + ηl K i [N] Fi(xt) = ηE ηl K ( bvt + ϵ)1/2 f(xt), ηl K ( bvt + ϵ)1/2 1 NK k=0 (gi t τ i t ,k Fi(xt)) = ηE ηl K ( bvt + ϵ)1/2 f(xt), ηl K ( bvt + ϵ)1/2 1 NK k=0 ( Fi(xi t τ i t ,k) Fi(xt)) 2 E f(xt) ( bvt + ϵ)1/2 2 + ηηl 2N 2K E 1 ( bvt + ϵ)1/2 X k=0 ( Fi(xi t τ i t ,k) Fi(xt)) ηηl 2N 2K E 1 ( bvt + ϵ)1/2 X k=0 Fi(xi t τ i t ,k) where the second equality holds by E[gi t τ i t ,k] = E[ Fi(xi t τ i t ,k)]. Then for the second term in Eq. (14) , we have ηηl 2N 2K E 1 ( bvt + ϵ)1/2 X k=0 ( Fi(xi t τ i t ,k) Fi(xt)) ηηl 2N 2KϵE X k=0 ( Fi(xi t τ i t ,k) Fi(xt)) k=0 E[ Fi(xt) Fi(xi t τ i t ,k) 2] E[ Fi(xt) Fi(xt τ i t ) 2] + E[ Fi(xt τ i t ) Fi(xi t τ i t ,k) 2] L2E[ xt xt τ i t 2] + L2E[ xt τ i t xi t τ i t ,k 2] , (15) where the second inequality holds by ai, Pn i=1 ai 2 n Pn i=1 ai 2, and the last inequality holds by Assumption 5.1. For the second term in Eq. (15), following by Lemma C.5, there is E[ xt τ i t xi t τ i t ,k 2] = E m=0 ηlgi t τ i t ,m 5Kη2 l (σ2 + 6Kσ2 g) + 30K2η2 l E[ f(xt τ i t ) 2]. (16) For the first term in Eq. (15), since by ai, Pn i=1 ai 2 n Pn i=1 ai 2, there is E[ xt xt τ i t 2] = E s=t τ i t (xs+1 xs) s=t τ i t E[ xs+1 xs 2] τ i t s=t τ i t E η ms bvs + ϵ FADAS: Towards Federated Adaptive Asynchronous Optimization then by decomposing stochastic noise, E[ xt xt τ i t 2] s=t τ i t E[Es ms 2]] = η2τ i t ϵ2 s=t τ i t E Es u=1 βs u 1 1 M k=0 ηl[gj u τ j u,k Fj(xj u τ j u,k) + Fj(xj u τ j u,k)] 2η2τ i t ϵ2 s=t τ i t (1 β1) u=1 βs u 1 Kη2 l M σ2 + 2η2τ i t ϵ2 s=t τ i t (1 β1) u=1 βs u 1 η2 l M 2 E X k=0 Fj(xj u τ j u,k) 2η2(τ i t)2 ϵ2 Kη2 l M σ2 + 2η2τ i t ϵ2 s=t τ i t (1 β1) u=1 βs u 1 η2 l M 2 E X k=0 Fj(xj u τ j u,k) where the first inequality holds by decomposing the momentum ms, i.e., ms = (1 β1) Ps u=1 βs u 1 u = (1 β1) Ps u=1 βs u 1 1 M P j Mu PK 1 k=0 ηlgj u τ j u,k. The second inequality holds by a + b 2 2 a 2 + 2 b 2 and the fact of E[ ]] = E[ ], and the third inequality holds by (1 β1) Ps u=1 βs u 1 1. Following Lemma C.2, 1 CG x x bvt+ϵ 1 ϵ x and CG = ηl KG + ϵ, plugging Eq. (14), Eq. (15) and Eq. (16) to (13), we have E[I1] ηηl K 2CG E[ f(xt) 2] ηηl 2KCG E 1 N k=0 Fi(xi t τ i t ,k) 5Kη2 l (σ2 + 6Kσ2 g) + 30K2η2 l 1 N i=1 E[ f(xt τ i t ) 2] + 2η3 l η3K2L2 i=1 (τ i t)2 + 2η3 l η3KL2 s=t τ i t (1 β1) u=1 βs u 1 E X k=0 Fj(xj u τ j u,k) Bounding I2 I2 = E f(zt), β1 1 β1 η 1 bvt + ϵ 1 p = ηE f(zt) f(xt) + f(xt), β1 1 β1 1 bvt + ϵ 1 p ηE f(xt) β1 1 β1 1 bvt + ϵ 1 p + η2LE β1 1 β1 1 bvt + ϵ 1 p β1 1 β1 ηlηKG2E 1 bvt + ϵ 1 p + β2 1 (1 β1)2ϵη2 l η2K2G2LE 1 bvt + ϵ 1 p where the first inequality holds by a, b a b and L-smoothness of f, i.e., f(zt) f(xt) L zt xt , and by the definition of zt, there is zt xt = β1 1 β1 mt 1 bvt 1+ϵ. The second inequality holds by Lemma C.2. Bounding I3 2 E t bvt + ϵ β1 1 β1 1 bvt + ϵ 1 p FADAS: Towards Federated Adaptive Asynchronous Optimization η2LE t bvt + ϵ 2 + η2LE β1 1 β1 1 bvt + ϵ 1 p η2LE t bvt + ϵ 2 + η2L β2 1 (1 β1)2 η2 l K2G2E 1 bvt + ϵ 1 p ϵ2 E[ t 2] + η2L β2 1 (1 β1)2 η2 l K2G2E 1 bvt + ϵ 1 p where the first inequality follows by Cauchy-Schwarz inequality, i.e., ai, Pn i=1 ai 2 n Pn i=1 ai 2, and the second one holds by Lemma C.2. Bounding I4 I4 = E ( f(zt) f(xt)), η t bvt + ϵ E f(zt) f(xt) η t bvt + ϵ LE zt xt η t bvt + ϵ 2 E β1 1 β1 2 E t bvt + ϵ 2ϵ2 β2 1 (1 β1)2 E[ mt 2] + η2L 2ϵ2 E[ t 2], (22) where the second inequality holds by Assumption 5.1 (the L-smoothness of f), and the third inequality holds by the definition of zt and the inequality a b 1 Merging pieces. Therefore, by merging pieces together, we have E[f(zt+1) f(zt)] = E[I1 + I2 + I3 + I4] 2CG E[ f(xt) 2] ηηl 2KCG E 1 N k=0 Fi(xi t τ i t ,k) 5Kη2 l (σ2 + 6Kσ2 g) + 30K2η2 l 1 N i=1 E[ f(xt τ i t ) 2] + 2η3η3 l K2L2 i=1 (τ i t)2 + 2η3η3 l KL2 s=t τ i t (1 β1) u=1 βs u 1 E X k=0 Fj(xj u τ j u,k) + β1 1 β1 ηηl KG2E 1 bvt + ϵ 1 p + β2 1 (1 β1)2ϵη2η2 l K2G2LE 1 bvt + ϵ 1 p ϵ2 E[ t 2] + β2 1 (1 β1)2 η2η2 l K2G2LE 1 bvt + ϵ 1 p 2ϵ2 β2 1 (1 β1)2 E[ mt 2] + η2L 2ϵ2 E[ t 2]. (23) Denote a few sequences: Gt = P j Mt PK 1 k=0 Fj(xj t τ j t ,k) and Vt = 1 bvt+ϵ 1 bvt 1+ϵ, then re-write and organize the above inequality, we have E[f(zt+1) f(zt)] 2CG E[ f(xt) 2] ηηl 2KCG E 1 N k=0 Fi(xi t τ i t ,k) FADAS: Towards Federated Adaptive Asynchronous Optimization 5Kη2 l (σ2 + 6Kσ2 g) + 30K2η2 l 1 N i=1 E[ f(xt τ i t ) 2] + 2η3η3 l K2L2 i=1 (τ i t)2 + 2η3η3 l KL2 s=t τ i t (1 β1) u=1 βs u 1 E[ Gu 2] + β1 1 β1 ηηl KG2E[ Vt 1] + β2 1 (1 β1)2ϵη2η2 l K2G2LE[ Vt 1] + β2 1 (1 β1)2 η2η2 l K2G2LE[ Vt 2] 2ϵ2 E[ t 2] + η2L 2ϵ2 β2 1 (1 β1)2 E[ mt 2]. (24) Summing over t = 1 to T, we have E[f(z T +1) f(z1)] t=1 E[ f(xt) 2] + ηηl KL2 5Kη2 l (σ2 + 6Kσ2 g) + 30K2η2 l 1 N i=1 E[ f(xt τ i t ) 2] + 2η3η3 l K2L2 t=1 (τ i t)2 + 2η3η3 l KL2 s=t τ i t (1 β1) u=1 βs u 1 E[ Gu 2] + β1 1 β1 ηηl KG2 + β2 1 (1 β1)2ϵη2η2 l K2G2L T X t=1 E[ Vt 1] + β2 1 (1 β1)2 η2η2 l K2G2L t=1 E[ Vt 2] t=1 E[ t 2] + η2L 2ϵ2 β2 1 (1 β1)2 t=1 E[ mt 2] k=0 Fi(xi t τ i t ,k) we have the following for term A0, A0 = 2η3η3 l K2L2 t=1 (τ i t)2 2η3η3 l K2L2 Mϵ3 σ2τavgτmax T. (26) By Lemma C.6, we have the following for term A1, A1 =2η3η3 l KL2 s=t τ i t (1 β1) u=1 βs u 1 E[ Gu 2] = 2η3η3 l KL2 s=t τ i t (1 β1) 5K3L2η2 l (σ2 + 6Kσ2 g) + (30K4L2η2 l + K2) 1 j=1 E[ f(xu τ j u) 2] + K2σ2 g k=0 Fj(xj u τ j u,k) = 2η3η3 l KL2 s=t τ i t (1 β1) N 1 [5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g] FADAS: Towards Federated Adaptive Asynchronous Optimization + 2η3η3 l KL2 s=t τ i t (1 β1) N 1 (30K4L2η2 l + K2) 1 j=1 E[ f(xu τ j u) 2] + 2η3η3 l KL2 s=t τ i t (1 β1) k=0 Fj(xj u τ j u,k) | {z } A4 = A2 + A3 + A4. (27) For term A2, then re-organizing it we have A2 2η3η3 l KL2 t=1 (τ i t)2 3M(N M) N 1 [5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g] N 1 [5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g] Tτavgτmax. (28) For term A3, we have A3 2η3η3 l KL2 s=t τ i t (1 β1) N 1 (30K4L2η2 l + K2) 1 j=1 E[ f(xu τ j u) 2] M 2ϵ3 τ 2 max N 1 (30K4L2η2 l + K2) 1 j=1 E[ f(xt τ j t ) 2] M 2ϵ3 τ 3 max 6M(N M) N 1 (30K4L2η2 l + K2) t=1 E[ f(xt) 2], (29) where the first inequality in Eq. (29) holds due to: 1) τ i t τmax and 2) for a positive sequence at, PT t=1 Pt 1 s=t τ i t (1 β1) Ps u=1 βs u 1 au τmax(1 β1) PT t=1 Pt u=1 βt u 1 au τmax PT t=1 at. In details, s=t τ i t (1 β1) u=1 βs u 1 au s=t τ i t (1 β1)(βs 1 1 a1 + βs 2 1 a2 + + β0 1as) t=1 (1 β1) t 1 X s=t τ i t βs 1 1 a1 + s=t τ i t βs 2 1 a2 + + s=t τ i t β0 1as u=1 βt u 1 au t=1 at. (30) The second inequality in Eq. (29) hold by the fact of PT t=1 1 N PN j=1 E[ f(xt τ j t ) 2] τmax PT t=1 E[ f(xt) 2]. Similar, for term A4, we have A4 = 2η3η3 l KL2 s=t τ i t (1 β1) u=1 βs u 1 M(M 1) k=0 Fj(xj t τ j t ,k) FADAS: Towards Federated Adaptive Asynchronous Optimization M 2ϵ3 τ 2 max 2M(M 1) k=0 Fj(xj t τ j t ,k) With the term of A0 to A4, by Lemma C.3 and Lemma C.4, we have the following for Eq. (25), E[f(z T +1) f(z1)] t=1 E[ f(xt) 2] + ηηl KL2 5Kη2 l T(σ2 + 6Kσ2 g) + 30K2η2 l 1 N i=1 E[ f(xt τ i t ) 2] + 2η3η3 l K2L2τavgτmax T Mϵ3 σ2 + η3η3 l KL2 M 2ϵ3 τ 3 max 6M(N M) N 1 (30K4L2η2 l + K2) t=1 E[ f(xt) 2] + η3η3 l KL2 M 2ϵ3 6M(N M) N 1 [5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g] Tτavgτmax + β1 1 β1 ηηl KG2 + β2 1 (1 β1)2ϵη2η2 l K2G2L d ϵ + β2 1 (1 β1)2 η2η2 l K2G2L d 2ϵ2 β2 1 (1 β1)2 2η2 l K M σ2 + 2η2 l (N M) NM(N 1) 15NK3L2η2 l (σ2 + 6Kσ2 g) + (90K4L2η2 l + 3K2) i=1 E[ f(xt τ i t ) 2] + 3NK2σ2 g + 2η2 l (M 1) NM(N 1)E k=0 Fi(xi t τ i t ,k) + 2η3η3 l KL2 M 2ϵ3 τ 2 max M(M 1) k=0 Fj(xj t τ j t ,k) ηηl 2KN 2CG k=0 Fi(xi t τ i t ,k) E[f(z T +1) f(z1)] t=1 E[ f(xt) 2] + ηηl KL2 5Kη2 l T(σ2 + 6Kσ2 g) + 30K2η2 l τmax t=1 E[ f(xt) 2] + 2η3η3 l K2L2τavgτmax T Mϵ3 σ2 + η3η3 l KL2 M 2ϵ3 τ 3 max 6M(N M) N 1 (30K4L2η2 l + K2) t=1 E[ f(xt) 2] + η3η3 l KL2 M 2ϵ3 6M(N M) N 1 [5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g] Tτavgτmax + β1 1 β1 ηηl KG2 + β2 1 (1 β1)2ϵη2η2 l K2G2L d ϵ + β2 1 (1 β1)2 η2η2 l K2G2L d ϵ2 β2 1 (1 β1)2 Kη2 l M σ2 + η2 l (N M) NM(N 1) 15NK3L2η2 l (σ2 + 6Kσ2 g) + 3NK2σ2 g ϵ2 β2 1 (1 β1)2 η2 l (N M) M(N 1) (90K4L2η2 l + 3K2)Nτmax t=1 E[ f(xt) 2] ϵ2 β2 1 (1 β1)2 η2 l (M 1) NM(N 1) + 2η3η3 l KL2 M 2ϵ3 τ 2 max M(M 1) N(N 1) ηηl 2KN 2CG k=0 Fi(xi t τ i t ,k) FADAS: Towards Federated Adaptive Asynchronous Optimization If the learning rates satisfy ηl 1 8KL and ηηl ϵ2M(N 1) 4CGN(M 1)KL 3 + β2 1 (1 β1)2 ηl ϵ 360CGτmax KL, ηηl ϵ2M(N 1) 60CGN(N M)KLτmax 3 + β2 1 (1 β1)2 CGN(M 1)τ 3max KL , (34) then we have 3η2L ϵ2 β2 1 (1 β1)2 η2 l (M 1) NM(N 1) + 2η3η3 l KL2 M 2ϵ3 τ 2 max M(M 1) N(N 1) ηηl 2KN 2CG 0 ϵ 30K2η2 l τmax + 3η2L ϵ2 β2 1 (1 β1)2 η2 l (N M) M(N 1) (90K4L2η2 l + 3K2)Nτmax + η3η3 l KL2 M 2ϵ3 τ 3 max 6M(N M) N 1 (30K4L2η2 l + K2) ηηl K Thus Eq. (33) becomes PT t=1 E[ f(xt) 2] T 4CG ηηl KT [f(z1) E[f(z T +1)]] + 20CGϵ 1L2Kη2 l (σ2 + 6Kσ2 g) + 8CGη2η2 l KL2τavgτmax + 24CGη2η2 l L2τavgτmax Mϵ3 N M N 1 [5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g] + β1 1 β1 G2 + β2 1 (1 β1)2ϵηηl KG2L 4CGd Tϵ + β2 1 (1 β1)2 ηηl KG2L4CGd ϵ2 β2 1 (1 β1)2 M σ2 + ηl(N M) M(N 1) [15K2L2η2 l (σ2 + 6Kσ2 g) + 3Kσ2 g] . With CG = ηl KG + ϵ, Eq. (36) becomes PT t=1 E[ f(xt) 2] 4(ηl KG + ϵ) ηηl KT [f(z1) E[f(zt+1)]] + 20(ηl KG + ϵ)ϵ 1L2Kη2 l (σ2 + 6Kσ2 g) + 8(ηl KG + ϵ)η2η2 l KL2τavgτmax Mϵ3 σ2 + 24(ηl KG + ϵ)η2η2 l L2τavgτmax Mϵ3 N M N 1 [5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g] + β1 1 β1 G2 + β2 1 (1 β1)2ϵηηl KG2L 4(ηl KG + ϵ)d Tϵ + β2 1 (1 β1)2 ηηl KG2L4(ηl KG + ϵ)d + 4(ηl KG + ϵ) 3ηL ϵ2 β2 1 (1 β1)2 M σ2 + ηl(N M) M(N 1) [15K2L2η2 l (σ2 + 6Kσ2 g) + 3Kσ2 g] . (37) For β1 = 0, with the definition of F = f(x1) minx f(x), we have the following bound PT t=1 E[ f(xt) 2] 4(ηl KG + ϵ) ηηl KT [f(z1) E[f(zt+1)]] + 20(ηl KG + ϵ)ϵ 1L2Kη2 l (σ2 + 6Kσ2 g) + 8(ηl KG + ϵ)η2η2 l KL2τavgτmax Mϵ3 σ2 + 24(ηl KG + ϵ)η2η2 l KL2τavgτmax Mϵ3 N M FADAS: Towards Federated Adaptive Asynchronous Optimization [5η2 l K2L2(σ2 + 6Kσ2 g) + Kσ2 g] + 12(ηl KG + ϵ)ηL M σ2 + ηl(N M) M(N 1) [15η2 l K2L2(σ2 + 6Kσ2 g) + 3Kσ2 g] . (38) This concludes the proof. Proof of Corollary 5.7. From Eq. (38), we have the following bound PT t=1 E[ f(xt) 2] = O (ηl KG + ϵ) ηηl KT F + (ηl KG + ϵ)η2 l KL2(σ2 + Kσ2 g) ϵ + (ηl KG + ϵ)η2η2 l KL2τavgτmax Mϵ3 σ2 + (ηl KG + ϵ)η2η2 l L2τavgτmax Mϵ3 N M N 1 [K3L2η2 l (σ2 + Kσ2 g) + K2σ2 g] + (ηl KG + ϵ)ηηl L + (ηl KG + ϵ)ηηl KL N 1 [η2 l KL2(σ2 + Kσ2 g)] . (39) Reorganizing Eq. (39), particularly merging the stochastic variance and the global variance, we get PT t=1 E[ f(xt) 2] = O (ηl KG + ϵ) ηηl KT F + (ηl KG + ϵ)η2 l KL2 ϵ (σ2 + Kσ2 g) + (ηl KG + ϵ)η2η2 l KL2 τavgτmaxσ2 + N M N 1 τavgτmax Kσ2 g + (ηl KG + ϵ)η2η4 l K3L4τavgτmax Mϵ3 N M N 1 (σ2 + Kσ2 g) + (ηl KG + ϵ)ηηl L + (ηl KG + ϵ)ηη3 l K2L3 N 1 (σ2 + Kσ2 g) . (40) By choosing η = Θ( M) and ηl = Θ T K(σ2+Kσ2g)L , which implies ηηl = Θ T K(σ2+Kσ2g)L , and ηl KG = T (σ2+Kσ2g)L , PT t=1 E[ f(xt) 2] F(σ2 + Kσ2g)L T(σ2 + Kσ2g)L + ϵ FL T(σ2 + Kσ2g)L + ϵ FLτavgτmax N 1 FLτavgτmax T(σ2 + Kσ2g)L + ϵ We again generalize terms with smaller T dependency orders, then we have PT t=1 E[ f(xt) 2] T + FLτavgτmax N 1 FLτavgτmax FADAS: Towards Federated Adaptive Asynchronous Optimization M + Fτmaxτavg This concludes the proof for Corollary 5.7. B. Convergence analysis for delay adaptive asynchronous FL Proof of Theorem 5.9. For the proof of delay adaptive, for proof convenience, we conduct analysis under the case that β1 = 0. From Assumption 5.1, f is L-smooth, then taking conditional expectation at time t on the auxiliary sequence xt, we have E[f(xt+1) f(xt)] = E[f(xt+1) f(xt)] E[ f(xt), xt+1 xt ] + L 2 E[ xt+1 xt 2] = E f(xt), ηt t bvt + ϵ 2 E t bvt + ϵ Bounding I1 We have I1 = ηt E f(xt), t bvt + ϵ = ηt E f(xt), t bvt + ϵ = ηt E f(xt) bvt + ϵ, t + ηl K f(xt) ηl K f(xt) = ηtηl KE f(xt) ( bvt + ϵ)1/2 2 + ηt E f(xt) bvt + ϵ, t + ηl K f(xt) = ηtηl KE f(xt) ( bvt + ϵ)1/2 2 + ηt E f(xt) bvt + ϵ, 1 k=0 ηlgi t τ i t ,k + ηl K i [N] Fi(xt) , (44) where t = 1 i [N] PK 1 k=0 ηlgi t τ i t ,k. For the inner product term in (44), by the fact of a, b = 1 2[ a 2 + b 2 a b 2], we have ηt E f(xt) bvt + ϵ, 1 k=0 ηlgi t τ i t ,k + ηl K i [N] Fi(xt) = ηt E ηl K ( bvt + ϵ)1/2 f(xt), ηl K ( bvt + ϵ)1/2 1 NK k=0 (gi t τ i t ,k Fi(xt)) = ηt E ηl K ( bvt + ϵ)1/2 f(xt), ηl K ( bvt + ϵ)1/2 1 NK k=0 ( Fi(xi t τ i t ,k) Fi(xt)) 2 E f(xt) ( bvt + ϵ)1/2 2 + ηtηl 2N 2K E 1 ( bvt + ϵ)1/2 X k=0 ( Fi(xi t τ i t ,k) Fi(xt)) FADAS: Towards Federated Adaptive Asynchronous Optimization ηtηl 2N 2K E 1 ( bvt + ϵ)1/2 X k=0 Fi(xi t τ i t ,k) where second equation holds by E[gi t τ i t ,k] = E[ F(xi t τ i t ,k)], for the second term in Eq. (45), we have ηtηl 2N 2K E 1 ( bvt + ϵ)1/2 X k=0 ( Fi(xi t τ i t ,k) Fi(xt)) ηtηl 2N 2KϵE X k=0 ( Fi(xi t τ i t ,k) Fi(xt)) k=0 E[ Fi(xt) Fi(xi t τ i t ,k) 2] E[ Fi(xt) Fi(xt τ i t ) 2] + E[ Fi(xt τ i t ) Fi(xi t τ i t ,k) 2] L2E[ xt xt τ i t 2] + L2E[ xt τ i t xi t τ i t ,k 2] . (46) where the first second inequality holds by ai, Pn i=1 ai 2 n Pn i=1 ai 2, and the last inequality holds by Assumption 5.1. For the second term in Eq. (46), following by Lemma C.5, there is E[ xt τ i t xi t τ i t ,k 2] = E m=0 ηlgi t τ i t ,m 5Kη2 l (σ2 + 6Kσ2 g) + 30K2η2 l E[ f(xt τ i t ) 2]. (47) For the first term in Eq. (46), we have E[ xt xt τ i t 2] = E s=t τ i t (xs+1 xs) s=t τ i t ηs s bvs + ϵ s=t τ i t ηs s s=t τ i t ηs 1 M s=t τ i t ηs 1 M k=0 ηlgj s τ j s ,k then by decomposing stochastic noise, we have E[ xt xt τ i t 2] s=t τ i t ηs 1 M k=0 ηl[gj s τ j s ,k Fj(xj s τ j s ,k) + Fj(xj s τ j s ,k)] s=t τ i t ηs 1 M k=0 ηl[gj s τ j s ,k Fj(xj s τ j s ,k)] FADAS: Towards Federated Adaptive Asynchronous Optimization s=t τ i t ηs 1 M k=0 ηl Fj(xj s τ j s ,k) s=t τ i t η2 s Kη2 l M σ2 + 2τ i t ϵ2 s=t τ i t η2 s η2 l M 2 E X k=0 Fj(xj s τ j s ,k) ϵ2 τ i tη2 Kη2 l M σ2 + 2τ i t ϵ2 s=t τ i t η2 s η2 l M 2 E X k=0 Fj(xj s τ j s ,k) where the second inequality holds by a + b 2 2 a 2 + 2 b 2. The second inequality holds by Assumption 5.2, i.e., the zero-mean and the independency of stochastic noise. The last inequality in Eq. (49) holds due to the following: with adaptive learning rates ( η if τ max t τc, min{η, 1 τ max t } if τ max t > τc, (50) thus we have ηs η in the last inequality in Eq. (49). Then for I1, following Lemma C.2 1 CG x x bvt+ϵ 1 ϵ x and CG = ηl KG + ϵ, we have 2CG E[ f(xt) 2] ηtηl 2KCG E 1 N k=0 Fi(xi t τ i t ,k) 5Kη2 l (σ2 + 6Kσ2 g) + 30K2η2 l 1 N i=1 E[ f(xt τ i t ) 2] + 2ηtη2η3 l K2L2 i=1 τ i tσ2 + 2ηtη3 l KL2 s=t τ i t η2 s E X k=0 Fj(xj s τ j s ,k) Bounding I2 I2 = η2 t L 2 E t bvt + ϵ 2 η2 t L 2ϵ2 E[ t 2], (52) where the first inequality follows by Cauchy-Schwarz inequality. Merging pieces. Therefore, by merging pieces together, we have E[f(xt+1) f(xt)] = I1 + I2 2CG E[ f(xt) 2] ηtηl 2KCG E 1 N k=0 Fi(xi t τ i t ,k) 5Kη2 l (σ2 + 6Kσ2 g) + 30K2η2 l 1 N i=1 E[ f(xt τ i t ) 2] + 2ηtη2η3 l K2L2 i=1 τ i tσ2 + 2ηtη3 l KL2 s=t τ i t η2 s E X k=0 Fj(xj s τ j s ,k) 2 + η2 t L 2ϵ2 E[ t 2]. (53) Denote a sequences Gs = P j Ms PK 1 k=0 Fj(xj t τ j s ,k), then re-write and organize the above inequality, we have E[f(xt+1) f(xt)] FADAS: Towards Federated Adaptive Asynchronous Optimization 2CG E[ f(xt) 2] ηtηl 2KCG E 1 N k=0 Fi(xi t τ i t ,k) 5Kη2 l (σ2 + 6Kσ2 g) + 30K2η2 l 1 N i=1 E[ f(xt τ i t ) 2] + 2ηtη2η3 l K2L2 i=1 τ i tσ2 + 2ηtη3 l KL2 s=t τ i t η2 s E[ Gs 2] + η2 t L 2ϵ2 E[ t 2]. (54) Summing over t = 1 to T, we have E[f(x T +1) f(x1)] t=1 ηt E[ f(xt) 2] + ηl KL2 5Kη2 l (σ2 + 6Kσ2 g) t=1 ηt + 30K2η2 l 1 N i=1 ηt E[ f(xt τ i t ) 2] + 2η2η3 l K2L2 i=1 τ i tηt + 2η3 l KL2 t=1 ηtτ i t s=t τ i t η2 s E[ Gs 2] + L t=1 η2 t E[ t 2] t=1 ηt E 1 N k=0 Fi(xi t τ i t ,k) t=1 ηt E[ f(xt) 2] + ηl KL2 5Kη2 l (σ2 + 6Kσ2 g) t=1 ηt + 30K2η2 l 1 N i=1 ηt E[ f(xt τ i t ) 2] + 2η3η3 l K2L2 Mϵ3 σ2Tτavg + 2η3 l KL2 t=1 ηtτ i t s=t τ i t η2 s E[ Gs 2] t=1 η2 t E[ t 2] t=1 ηt E 1 N k=0 Fi(xi t τ i t ,k) where the second inequality holds by ηt η. We have the following for term A1, s=t τ i t η2 s E[ Gs 2] = 2η3 l KL2 t=1 ηtτ i t s=t τ i t η2 s 3M(N M) 5K3L2η2 l (σ2 + 6Kσ2 g) + (30K4L2η2 l + K2)E[ f(xs τ j s ) 2] + K2σ2 g + 2η3 l KL2 t=1 ηtτ i t s=t τ i t η2 s M(M 1) k=0 Fj(xj s τ j s ,k) t=1 ηtτ i t s=t τ i t η2 s 6M(N M) 5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g + η3 l KL M 2ϵ3 1 N t=1 ηtτ i t s=t τ i t η2 s 6M(N M) N 1 (30K4L2η2 l + K2)E[ f(xs τ j s ) 2] FADAS: Towards Federated Adaptive Asynchronous Optimization + 2η3 l KL2 t=1 ηtτ i t s=t τ i t η2 s M(M 1) k=0 Fj(xj s τ j s ,k) | {z } A4 = A2 + A3 + A4. (56) For term A2, note that with η M τc , we have the adaptive learning rates ( η if τ max t τc, min{η, 1 τ max t } if τ max t > τc, (57) which implies that ηt η and ηt min{ 1 τ max t , M τc }. Moreover, recall that τ max t = maxi [N]{τ i t}, for each i, we have M τ i t max(τ max t ,τc) M. by the fact of ηtτ i t N PN i=1 PT t=1 τ i t Tτavg and ηs η, we have A2 η3 l KL2 t=1 ηtτ i t τ i tη2 6M(N M) 5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g M τ i tη2 N M 5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g 5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g Tτavg. (58) For term A3, since ηs η, and consider τ i t τmax and PT t=1 Pt 1 s=t τ i t as τmax PT t=1 at, we have A3 η3 l KL2 t=1 ηtτ i t s=t τ i t η2 s 6M(N M) N 1 (30K4L2η2 l + K2)E[ f(xs τ j s ) 2] η2η3 l KL M 2ϵ3 6M(N M) N 1 (30K4L2η2 l + K2)τ 3 max t=1 ηt E[ f(xt) 2]. (59) For term A4, similar to the proof of non-delay adaptive FADAS, by ηs η, τ i t τmax and PT t=1 Pt 1 s=t τ i t as τmax PT t=1 at, there is A4 = 2η3 l KL2 t=1 ηtτ i t s=t τ i t η2 s M(M 1) k=0 Fj(xj s τ j s ,k) 2η2η3 l KL2 M 2ϵ3 M(M 1) N(N 1) τmax s=t τ i t E k=0 Fj(xj s τ j s ,k) 2η2η3 l KL2 M 2ϵ3 M(M 1) N(N 1) τ 2 max k=0 Fj(xj t τ j t ,k) By Lemma C.3 and Lemma C.4, we have the following for Eq. (55), E[f(x T +1) f(x1)] t=1 ηt E[ f(xt) 2] + ηl KL2 5Kη2 l (σ2 + 6Kσ2 g) t=1 ηt + 30K2η2 l 1 N i=1 ηt E[ f(xt τ i t ) 2] + 2η3η3 l K2L2 Mϵ3 Tτavgσ2 + η2η3 l KL2 N 1 [5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g]Tτavg FADAS: Towards Federated Adaptive Asynchronous Optimization + η2η3 l KL2 M 2ϵ3 6M(N M) N 1 (30K4L2η2 l + K2)τ 3 max t=1 ηt E[ f(xt) 2] 2Kη2 l M σ2 + 2η2 l (N M) NM(N 1) 15NK3L2η2 l (σ2 + 6Kσ2 g) + 3NK2σ2 g + 2η2 l (N M) M(N 1) (90NK4L2η2 l + 3NK2) 1 i=1 E[ f(xt τ i t ) 2] ϵ2 η2 l (M 1) NM(N 1) + 2η2η3 l KL2 M 2ϵ3 M(M 1) N(N 1) τ 2 max ηl 2KN 2CG k=0 Fi(xi t τ i t ,k) by the relationship of PT t=1 1 N PN j=1 E[ f(xt τ i t ) 2] τmax PT t=1 E[ f(xt) 2]. If the learning rates satisfy ηl 1 8KL and ηηl ϵ2M(N 1) 4CGN(M 1)KL, ηηl ηl ϵ 360CGτmax KL, ηηl ϵ2M(N 1) 60CGN(N M)KLτmax , CGN(M 1)τ 3max KL . (62) Then we have ϵ2 η2 l (M 1) NM(N 1) + 2η2η3 l KL2 M 2ϵ3 M(M 1) N(N 1) τ 2 max ηl 2KN 2CG 0 ϵ 30K2η2 l τmax + ηL ϵ2 η2 l (N M) M(N 1) (90K4L2η2 l + 3K2)Nτmax + η2η3 l KL2 M 2ϵ3 6M(N M) N 1 (30K4L2η2 l + K2)τ 3 max ηl K Thus Eq. (61) becomes t=1 ηt E[ f(xt) 2] ηl K [f(x1) E[f(x T +1)]] + 4CGL2 ϵ 5Kη2 l (σ2 + 6Kσ2 g) + 8CGη3η2 l KL2 Mϵ3 Tτavgσ2 + 24CGη2η2 l L2 N 1 [5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g]Tτavg M σ2 + ηl(N M) NM(N 1)[15NK2L2η2 l (σ2 + 6Kσ2 g) + 3NKσ2 g] , (64) divided by the learning rates, PT t=1 ηt E[ f(xt) 2] PT t=1 ηt 4CG ηl K PT t=1 ηt [f(x1) E[f(x T +1)]] + 20CGϵ 1L2Kη2 l (σ2 + 6Kσ2 g) + 8CGη3η2 l KL2 Mϵ3 Tτavg PT t=1 ηt σ2 + 24CGη2η2 l L2 N 1 [5K3L2η2 l (σ2 + 6Kσ2 g) + K2σ2 g] Tτavg PT t=1 ηt FADAS: Towards Federated Adaptive Asynchronous Optimization M σ2 + ηl(N M) M(N 1) [15K2TL2η2 l (σ2 + 6Kσ2 g) + 3Kσ2 g] . (65) This concludes the proof. Proof of Corollary 5.10. Since the delay adaptive learning rate satisfy, ηt η, and when τc = τmedian, there is PT t=1 ηt P t:τt τc η T η 2 (since there are at least half of the iterations with the delay smaller than τc). Recalling that CG = ηl KG+ϵ, then PT t=1 ηt E[ f(xt) 2] = O (ηl KG + ϵ) ηηl KT F + (ηl KG + ϵ)η2 l KL2(σ2 + Kσ2 g) ϵ + (ηl KG + ϵ)η2η2 l KL2τavg Mϵ3 σ2 + (ηl KG + ϵ)ηη2 l KL2τavg N 1 [K2L2η2 l (σ2 + Kσ2 g) + Kσ2 g] + (ηl KG + ϵ)ηηl L + (ηl KG + ϵ)ηηl KL N 1 [η2 l KL2(σ2 + Kσ2 g)] . (66) Reorganizing Eq. (66), particularly merging the stochastic variance and the global variance, then we have PT t=1 ηt E[ f(xt) 2] = O (ηl KG + ϵ) ηηl KT F + (ηl KG + ϵ)η2 l KL2 ϵ (σ2 + Kσ2 g) + (ηl KG + ϵ)ηη2 l KL2 N 1 τavg Kσ2 g + (ηl KG + ϵ)ηη4 l K3L4τavg N 1 (σ2 + Kσ2 g) + (ηl KG + ϵ)ηηl L + (ηl KG + ϵ)ηη3 l K2L3 N 1 (σ2 + Kσ2 g) . (67) By choosing η = M/τc and ηl = min 1 T K(σ2+Kσ2g)L , which implies ηηl = min T K(σ2+Kσ2g)L , PT t=1 ηt E[ f(xt) 2] T(σ2 + Kσ2g)L + ϵ q F(σ2 + Kσ2g)L T(σ2 + Kσ2g)L + ϵ FLτ 2 c Tϵ T(σ2 + Kσ2g)L + ϵ FLτavg N 1 FLτcτavg T(σ2 + Kσ2g)L + ϵ T(σ2 + Kσ2g)L + ϵ q F(σ2 + Kσ2g)L T(σ2 + Kσ2g)L + ϵ FLτ 2 c Tϵ T(σ2 + Kσ2g)L + ϵ FLτavg Tϵ3 + FLτcτavg T(σ2 + Kσ2g)L + ϵ FADAS: Towards Federated Adaptive Asynchronous Optimization We again generalize terms with smaller T dependency orders, then we have PT t=1 ηt E[ f(xt) 2] TM + Fτ 2 c T + Fτavg T + Fτcτavg reorganizing and then obtain the rate of convergence in Eq. (9). C. Supporting Lemmas Lemma C.1 (Lemma for momentum term in the update rule). The first order momentum terms mt in Algorithm 1 hold the following relationship w.r.t. model difference t: t=1 E[ mt 2] t=1 E[ t 2]. (70) Proof. By the updating rule, we have E[ mt 2] = E (1 β1) u=1 βt u 1 u (1 β1)2 d X u=1 βt u 1 i u (1 β1)2 d X u=1 βt u 1 ( i u)2 u=1 βt u 1 E[ u 2]. (71) Summing over t = 1, ..., T yields t=1 E[ mt 2] = (1 β1) u=1 βt u 1 E[ u 2] t=u βt u 1 E[ u 2] 1 1 β1 E[ u 2] u=1 E[ u 2]. (72) This concludes the proof. Lemma C.2. Under Assumptions 5.3, we have f(x) G, t ηl KG, mt ηl KG, vt η2 l K2G2 and bvt η2 l K2G2. Proof. Since f has G-bounded stochastic gradients, for any x and ξ, there is f(x, ξ) G, thus it implies f(x) = Eξ f(x, ξ) Eξ f(x, ξ) G. FADAS: Towards Federated Adaptive Asynchronous Optimization For each model difference i t on client i, i t satisfies, i t = xi t,K xt = ηl k=0 gi t,k, for the global model difference t, Thus we can obtain the bound for momentum mt and variance vt, mt = (1 β1) s=1 βt s 1 s ηl KG, vt = (1 β2) s=1 βt s 2 2 s By the updating rule of bvt, there exists a j [t] such that bvt = vj . Then bvt η2 l K2G2. (73) This concludes the proof. Lemma C.3. For the variance difference sequence Vt = 1 bvt+ϵ 1 bvt 1+ϵ, we have t=1 Vt 2 2 d Proof. The proof of Lemma C.3 is exactly the same as the proof of Lemma C.2 in Wang et al. (2022b). Lemma C.4. Recall the sequence t = 1 M P i Mt i t τ i t = ηl i Mt PK 1 k=0 gi t τ i t ,k = i Mt PK 1 k=0 Fi(xi t τ i t ,k; ξ) and Mt be the set that include client send the local updates to the server at global round t. The global model difference t satisfies E[ t 2] = E 1 M i Mt i t τ i t 2Kη2 l M σ2 + 2η2 l (N M) NM(N 1) 15NK3L2η2 l (σ2 + 6Kσ2 g) + (90NK4L2η2 l + 3K2) i=1 E[ f(xt τ i t ) 2] + 3NK2σ2 g + 2η2 l (M 1) NM(N 1)E k=0 Fi(xi t τ i t ,k) Proof. The proof of Lemma C.3 is similar to the proof of Lemma C.6 in (Wang et al., 2022b). Lemma C.5. (This lemma follows from Lemma 3 in Fed Adam (Reddi et al., 2021). For local learning rate which satisfying ηl 1 8KL, the local model difference after k ( k {0, 1, ..., K 1}) steps local updates satisfies i=1 E[ xi t,k xt 2] 5Kη2 l (σ2 l + 6Kσ2 g) + 30K2η2 l E[ f(xt) 2]. (75) Proof. The proof of Lemma C.5 is similar to the proof of Lemma 3 in Reddi et al. (2021). FADAS: Towards Federated Adaptive Asynchronous Optimization Lemma C.6. If assuming that the clients participation distributions are simulated as independently uniform distribution, then the sequence Gs = P j Ms PK 1 k=0 Fj(xj s τ j s ,k) has the following upper bound, k=0 Fj(xj s τ j s ,k) N 1 15K3L2η2 l (σ2 + 6Kσ2 g) + (90K4L2η2 l + 3K2) 1 j=1 E[ f(xs τ j s ) 2] N 1 K2σ2 g + M(M 1) k=0 Fj(xj s τ j s ,k) Proof. We begin with the proof similar to the partial participation with sampling without replacement, k=0 Fj(xj s τ j s ,k) k=0 Fj(xj s τ j s ,k) k=0 Fj(xj s τ j s ,k) 15NK3η2 l (σ2 + 6Kσ2 g) + (90K4L2η2 l + 3K2) j=1 E[ f(xs τ j s ) 2] + 3NK2σ2 g k=0 Fj(xj s τ j s ,k) 15NK3η2 l (σ2 + 6Kσ2 g) + (90K4L2η2 l + 3K2) j=1 E[ f(xs τ j s ) 2] + 3NK2σ2 g k=0 Fj(xj s τ j s ,k) 15NK3η2 l (σ2 + 6Kσ2 g) + (90K4L2η2 l + 3K2) j=1 E[ f(xs τ j s ) 2] + 3NK2σ2 g k=0 Fj(xj s τ j s ,k) 15K3L2η2 l (σ2 + 6Kσ2 g) + (90K4L2η2 l + 3K2) 1 j=1 E[ f(xs τ j s ) 2] + 3K2σ2 g k=0 Fj(xj s τ j s ,k) 15K3L2η2 l (σ2 + 6Kσ2 g) + (90K4L2η2 l + 3K2) 1 j=1 E[ f(xs τ j s ) 2] N 1 K2σ2 g + M(M 1) k=0 Fj(xj s τ j s ,k) FADAS: Towards Federated Adaptive Asynchronous Optimization D. Additional Experiments Tables 8 and 9 present results computed over experiments with 3 different random seeds. Tables 8 and 9 compare the performance of various federated learning methods, on the test accuracy of the Res Net-18 model across CIFAR-10 and CIFAR-100 datasets with heterogeneous data distributions. It is observed that the delay-adaptive FADAS (abbreviated as delay-adaptive FADAS) consistently outperforms the other methods. The consistency of FADAS performance under large worst-case delay settings indicates its reliability and potential for practical applications in federated learning environments with diverse and asynchronous model updates. D.1. Additional Results Table 8. The test accuracy on training Res Net-18 model on CIFAR-10 dataset with two data heterogeneity levels in a large worst-case delay scenario for 500 communication rounds. We abbreviate delay-adaptive FADAS to FADASda in this and subsequent tables. We conduct experiments on three seeds, and we report the average accuracy and standard derivation. Dir(0.1) Dir (0.3) Method Acc. & std. Acc. & std. Fed Async 50.29 6.86 59.75 13.40 Fed Buff 44.92 5.26 46.94 0.99 FADAS 70.57 2.04 75.97 2.64 FADASda 72.64 1.00 80.26 0.68 Table 9. The test accuracy on training Res Net-18 model on CIFAR-100 dataset with two data heterogeneity levels in a large worst-case delay scenario for 500 communication rounds. We conduct experiments on three seeds, and we report the average accuracy and standard derivation. Dir(0.1) Dir (0.3) Method Acc. & std. Acc. & std. Fed Async 46.25 4.33 43.22 10.75 Fed Buff 15.97 2.44 28.58 4.74 FADAS 47.85 0.69 52.80 1.15 FADASda 51.55 1.03 56.01 0.95 D.2. Implementation Details Details of applying adaptive learning rate. During our experiments, we found that choosing a relatively small global learning rate η yields better results for adaptive FL methods (hyper-parameter details can be found in the following). To scale the learning rate down for the model update with larger delays, we directly scale down the learning rate for this step to η/τ max t , which is shown in (77), ( η if τ max t τc, min n η, η τ max t o if τ max t > τc. (77) Hyper-parameter Settings. We conduct detailed hyper-parameter searches to find the best hyper-parameter for each baseline. We grid the local learning rate ηl from {0.001, 0.003, 0.01, 0.03, 0.1}, and global learning rate η = 1 for SGD-based method. We grid the local learning rate ηl from {0.003, 0.01, 0.03, 0.1} and global learning rate η from {0.0001, 0.0003, 0.001, 0.003} for adaptive method. For the global adaptive optimizer, we set β1 = 0.9, β1 = 0.99, and we set ϵ = 10 8. Table 10 summarizes the hyper-parameter details in our experiments. FADAS: Towards Federated Adaptive Asynchronous Optimization Table 10. Hyper-parameters details for vision tasks. CIFAR-10 (mild delay) Fed Async Fed Buff FADAS FADASda Models & Dir(α) ηl η ηl η ηl η ηl η Res Net-18 & Dir(0.1) 0.003 1 0.03 1 0.1 0.0003 0.1 0.001 Res Net-18 & Dir(0.3) 0.01 1 0.03 1 0.1 0.0003 0.1 0.001 CIFAR-100 (mild delay) Fed Async Fed Buff FADAS FADASda Models & Dir(α) ηl η ηl η ηl η ηl η Res Net-18 & Dir(0.1) 0.01 1 0.03 1 0.1 0.0003 0.1 0.001 Res Net-18 & Dir(0.3) 0.01 1 0.03 1 0.1 0.0003 0.1 0.001 CIFAR-10 (large worst-case delay) Fed Async Fed Buff FADAS FADASda Models & Dir(α) ηl η ηl η ηl η ηl η Res Net-18 & Dir(0.1) 0.003 1 0.03 1 0.1 0.0001 0.1 0.001 Res Net-18 & Dir(0.3) 0.003 1 0.03 1 0.1 0.0001 0.1 0.001 CIFAR-100 (large worst-case delay) Fed Async Fed Buff FADAS FADASda Models & Dir(α) ηl η ηl η ηl η ηl η Res Net-18 & Dir(0.1) 0.003 1 0.03 1 0.1 0.0001 0.1 0.001 Res Net-18 & Dir(0.3) 0.001 1 0.03 1 0.1 0.0001 0.1 0.001 Table 11. Hyper-parameters details for language tasks. Fed Async Fed Buff FADAS FADASda Datasets ηl η ηl η ηl η ηl η RTE 0.01 1 0.01 1 0.01 0.005 0.01 0.01 MRPC 0.001 1 0.01 1 0.01 0.001 0.01 0.002 SST-2 0.001 1 0.001 1 0.1 0.0005 0.1 0.001