# efficient_distributed_optimization_under_heavytailed_noise__3b178890.pdf Efficient Distributed Optimization under Heavy-Tailed Noise Su Hyeong Lee 1 Manzil Zaheer 2 Tian Li 3 Abstract Distributed optimization has become the default training paradigm in modern machine learning due to the growing scale of models and datasets. To mitigate communication overhead, local updates are often applied before global aggregation, resulting in a nested optimization approach with inner and outer steps. However, heavy-tailed stochastic gradient noise remains a significant challenge, particularly in attention-based models, hindering effective training. In this work, we propose Tail OPT, an efficient framework designed to address heavy-tailed noise by leveraging adaptive optimization and novel clipping techniques. We establish convergence guarantees for the Tail OPT framework under heavy-tailed noise with local updates and potentially unbounded gradient variance. Among its variants, we propose a memoryand communication-efficient instantiation (named Bi2Clip) that performs coordinate-wise clipping from both above and below at both the inner and outer optimizers. Bi2Clip brings about benefits of adaptive optimization (e.g., Adam) without the cost of maintaining or transmitting additional gradient statistics. Empirically, Tail OPT, including Bi2Clip, demonstrates superior performance on various tasks and models compared with state-ofthe-art methods, while being more efficient. 1. Introduction The training of deep learning models including large language models (LLMs) has become increasingly resourceintensive, driven by expansive datasets and models with billions of parameters (Rosa et al., 2022; Liu et al., 2024b; Sriram et al., 2022; Dehghani et al., 2023). As the computational demands escalate, distributed learning has emerged as the default approach, enabling the parallel activation of 1Department of Statistics, University of Chicago 2Meta 3Department of Computer Science, University of Chicago. Correspondence to: Su Hyeong Lee . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). training processes across multiple compute nodes such as GPUs or datacenters. However, this paradigm introduces a new bottleneck of communication overhead, especially as the progress in compute power has outpaced that of network infrastructure (Wu et al., 2023; Deep Seek-AI, 2024). To mitigate these communication challenges, one promising strategy is the utilization of local updates. By allowing each compute node to perform multiple gradient updates locally before aggregation, the frequency and volume of inter-node communication can be significantly reduced (Smith et al., 2018; Stich, 2018; Mc Mahan et al., 2017; Lee et al., 2024; Liu et al., 2024a; Jaghouar et al., 2024). For instance, the recent Di Lo Co algorithm for training LLMs in datacenter environments can apply several hundred local gradient updates prior to aggregation to reduce communication costs (Douillard et al., 2024). This approach naturally formulates a nested optimization problem, where inner optimization occurs within each compute node, and outer optimization is orchestrated by the coordinating node(s). However, training attention-based models like LLMs introduce an additional challenge due to the properties of their stochastic gradient distributions. Empirical and theoretical investigations have consistently demonstrated that the gradient noise in these models follows a heavy-tailed distribution (Ahn et al., 2024; Nguyen et al., 2019; Simsekli et al., 2019; 2020; Kunstner et al., 2024; Gorbunov et al., 2020). This heavy-tailed behavior, characterized by high or infinite variance and potentially very large deviations, poses significant challenges to the stability and convergence of existing optimization algorithms (Zhang et al., 2020b; Lee et al., 2024). Addressing these challenges necessitates the development of novel optimization strategies and a more principled understanding of their theoretical underpinnings. In this work, we propose Tail OPT, an efficient and theoretically principled nested training framework, designed to address the challenges posed by heavy-tailed gradient noise in distributed training with local updates. Tail OPT introduces several key strategies, including clipping mechanisms (such as coordinate-wise or L2-clipping) and adaptivity, applied at both inner and outer optimizers, to mitigate the adverse effects of heavy-tailed noise. We note that the preconditioning step in adaptive optimizers (e.g., Streeter & Mc Mahan, 2010) may be viewed as a form of soft clipping. Efficient Distributed Optimization under Heavy-Tailed Noise We analyze the convergence of Tail OPT while incorporating such adaptive methods, while allowing for heavy-tailed noise with unbounded variance. Among the various instantiations of the Tail OPT framework, we highlight Bi2Clip, a particularly scalable method that applies coordinate-wise clipping to gradients during inner iterations, and to model parameter updates at outer communication rounds, enforcing thresholding from both above and below on a per-coordinate basis. Our empirical and theoretical results demonstrate that Tail OPT is strongly effective in mollifying heavy-tailed noise, enhancing the stability and convergence of the training dynamics across several language benchmarks as well as synthetic data. Our contributions may be summarized as follows. We introduce Tail OPT, a general distributed training framework for large-scale models under communicationefficient local updates and heavy-tailed gradient distributions. Among its instantiations, we highlight Bi2Clip, which adjusts to gradient geometry similar to adaptive optimizers (e.g., Adam (Kingma & Ba, 2015)) while avoiding additional memory and communication overhead for maintaining or transmitting preconditioners. We provide convergence guarantees for a class of Tail OPT algorithms that leverage adaptive optimizers and various clipping strategies, effectively addressing heavy-tailed noise with potentially infinite gradient variance. This is achieved using a nested optimization framework, where the inner optimizer employs clipping operations to mitigate heavy-tailed gradient noise, while the outer optimizer utilizes either fully adaptive or efficient approximations of adaptive updates to guide the optimization process. We validate the practicality and effectiveness of Tail OPT through extensive experiments on synthetic and real-world datasets in large-scale settings. Our experiments demonstrate that Tail OPT produces several algorithmic instantiations that consistently outperform state-of-the-art baselines while being more efficient. 2. Related Works We cite the most related work in this section, and provide an extended literature review in Appendix A. Heavy-Tailed Gradient Noise. Training transformers and LLMs is complicated by heavy-tailed stochastic gradient distributions with very large variance, often theoretically and empirically modeled as L evy α-stable processes (Ahn et al., 2024; Nguyen et al., 2019; Simsekli et al., 2019; 2020; Gorbunov et al., 2020; Kunstner et al., 2024; Chezhegov et al., 2024a). In such scenarios, vanilla SGD-based optimization methods have been shown to destabilize during training in both centralized as well as distributed settings (Gorbunov et al., 2020; Zhang et al., 2020b; Lee et al., 2024). Recent advancements have explored centralized adaptive optimization techniques and robust gradient aggregation methods to mitigate the adverse effects of heavy-tailed noise, including gradient clipping (Simsekli et al., 2019; Juditsky et al., 2019a; Gorbunov et al., 2024a; Sadiev et al., 2023; Cutkosky & Mehta, 2021; Nguyen et al., 2023a) or adaptive clipping strategies (Chezhegov et al., 2024a). However, the complexities of handling heavy-tailed noise in nested distributed optimization environments often prevent these algorithms and their convergence bounds from extending to scenarios with multiple nodes training in parallel. Additionally, algorithms utilizing adaptive updates generally require preconditioner maintenance that incurs substantial memory costs. To our knowledge, developing an efficient distributed algorithm that provably converges under heavytailed stochastic gradient noise has remained an open challenge. For example, although Di Lo Co (Douillard et al., 2024; Jaghouar et al., 2024; Liu et al., 2024a) is a recent algorithmic development with local updates for communication efficiency that demonstrates competitive empirical performance, it noticeably lacks theoretical guarantees, and incurs non-trivial overheads when using adaptive optimizers. Our method addresses these gaps by introducing a nested and principled optimization framework, where a particular instantiation (Bi2Clip) brings about benefits of adaptivity without the added overhead of maintaining preconditioners, which also outperforms Di Lo Co empirically (Section 6). Clipping for Stabilizing Training Dynamics. Due to its success in stabilizing model updates, gradient clipping is a common technique that has been extensively studied empirically (Gehring et al., 2017; Merity et al., 2018; Peters et al., 2018; Mikolov, 2012) and theoretically (Gorbunov et al., 2020; Zhang et al., 2020b; Chezhegov et al., 2024a; Cutkosky & Mehta, 2021; Menon et al., 2020; Zhang et al., 2020a; Chen et al., 2020; Koloskova et al., 2023; Das et al., 2024). The majority of results study the centralized setting (e.g., Gorbunov et al., 2024a; Puchkin et al., 2024; Zhang & Cutkosky, 2022; Liu et al., 2023; Nguyen et al., 2023b; Parletta et al., 2024; Li & Liu, 2022), as moving to the distributed setting with local updates for communication efficiency provides significant added analytical challenges such as multiple inner optimizer updates prior to outer optimizer synchronization. Additionally, it was shown that using a constant clipping threshold can induce gradient bias, preventing the algorithm from ever converging (Chen et al., 2020; Koloskova et al., 2023). Therefore, some works have attempted to circumvent this issue by debiasing via error feedback (Khirirat et al., 2023; Zhang et al., 2024). Other works in distributed optimization have imposed strong distributional stochastic gradient structures in the analysis. For instance, Qian et al. (2021) assume a well-behaved angular dependence between the stochastic and deterministic gradients throughout training, and Liu et al. (2022) assume Efficient Distributed Optimization under Heavy-Tailed Noise symmetric gradient noise, almost surely bounded stochastic gradients, as well as homogeneous data. Our proposed clipping mechanism, realized as an instantiation of Tail OPT (i.e., Bi Clip), fundamentally differs from prior approaches by integrating per-coordinate clipping from both above and below. The inner optimization steps employ clipping operations to adapt to the gradient geometry, complemented by the outer optimizers which enhance rare signals through adaptivity or adaptive approximations. In addition, in the analysis of Tail OPT (Section 5), we do not impose any conditions on the noise nor data distributions except for finite noise α-moment for some α (1, 2). Our algorithm and analysis also accommodate local updates and allow for potentially unbounded stochastic gradient variance. We provide an extended review of distributed algorithms under heavy-tailed noise in Appendix A. 3. Problem Formulation In distributed optimization, the global objective is constructed by taking a weighted average over the local node objectives Fi(x) for model parameters x Rd and node i. In scenarios where data sizes at each node are unbalanced or sampling probabilities vary, the objective becomes: i=0 pi Fi(x), (1) where pi is proportional to the local data size of node i. Here, Fi(x) is defined as Eξ Di [Fi(x, ξ)], where Fi(x, ξ) = Fi(x) + ξ, x represents the stochastic local objective, and Di is the noise distribution of node i. This term comes from integrating the gradient noise model Fi(xt i, ξt i) = F(xt i)+ξt i, where xt i, ξt i are the parameter weights and stochastic gradient noise of node i at timestep t. In our formulation and theoretical analysis (Section 5), we allow for both independent and identically distributed (IID) data across N nodes, as commonly observed in datacenter environments, and more challenging non-IID data distributions. We now present the assumptions used in the convergence analysis. Assumption 1 (L-smoothness). For all x, y X and i [N], the local objectives Fi(x) satisfy Fi(x) Fi(y)+ x y, Fi(y) + Li x y 2/2. Assumption 2 (Bounded α-moment). For all nodes i [N] with noise distribution Di, there exists αi (1, 2), Bi > 0 such that E[ ξi αi] < Bαi i . Assumption 2 expresses that the noise distribution can be heavy-tailed. In particular, we note that the variance of the noise can be infinite (αi = 2), a setting in which distributed SGD was shown to fail to converge, both empirically and theoretically (Yang et al., 2022; Lee et al., 2024) This condition on the αi is optimally weakest , in that sending αi 1+ recovers the integrability condition of the noise, the minimal assumption necessary to form expectations. Furthermore, we note that E ξ α < = E ξ β < for β < α, α R. Therefore, we let α := mini [N] αi (1, 2) in the proceeding analysis for notational convenience. We note that this is strictly weaker than a conventional heavy-tailed assumption on the stochastic gradients, which is commonly given (e.g., Zhang et al., 2020b)) as E[ Fi xt i, ξt i αi] < Bαi i , which implies that Fi (xt i) is bounded. By contrast, this cannot be implied by Assumption 2. We also note that some works in the literature also define heavy-tailed distributions with bounded variance when establishing algorithm convergence bounds (e.g., Gorbunov et al., 2020; Parletta et al., 2024; Li & Liu, 2022; Das et al., 2024), which differs from our definition. We carry out our convergence proofs which subsumes the more general infinite variance setting, which naturally implies convergence under bounded stochastic gradients or variance. 4. Tail OPT: An Efficient Heavy-Tailed Optimization Framework In this section, we begin by motivating the heavy-tailed optimization framework (Tail OPT), a scalable training framework for heavy-tailed noise. SGD is a strong candidate given its simplicity and efficiency, but it has been shown to diverge under heavy-tailed noise in both centralized (Zhang et al., 2020b) and distributed settings (Lee et al., 2024). Therefore, modifications are necessary to stabilize the noisy updates. Gradient clipping is a widely adopted technique to mitigate the impact of large gradients (Menon et al., 2020; Zhang et al., 2020a; Chen et al., 2020; Koloskova et al., 2023; Yang et al., 2022). Typically, the clipping operator rescales the gradient uniformly to ensure its L2 norm remains below a predefined threshold. This procedure is mathematically equivalent to applying a dynamically adjusted, lower learning rate when large stochastic gradients are encountered. Therefore, we first include and analyze the usage of L2 clipping (L2Clip) in Tail OPT to stabilize heavy-tailed noisy updates. More specifically, we use L2 clipping on the gradients prior to standard gradient descent updates on each node, while a global model weight projection strategy is utilized on the outer optimizer after synchronizing all the collected updates. For additional clarity, the precise pseudocode (Algorithm 2) and analysis are given in Appendix C.1. Interpolating Adaptivity: Bi Clip. However, previous works on L2 clipping of gradients or model updates (e.g., Yang et al., 2022) do not adapt to gradient geometry, as Efficient Distributed Optimization under Heavy-Tailed Noise they proportionally and uniformly downscale each gradient coordinate. Therefore, smaller signals become even more difficult to detect and propagate. Adaptive optimizers have consistently demonstrated superior performance for training modern architectures (Zhang et al., 2020b; Reddi et al., 2021; Lee et al., 2024). Key among adaptive methods such as Adam (Kingma & Ba, 2015) and Adagrad (Duchi et al., 2011; Streeter & Mc Mahan, 2010) is the use of preconditioning, where preconditioners that are estimated from historical gradients effectively result in per-coordinate learning rates. This process amplifies rare yet important gradient coordinates, while scales down uninformative gradients, speeding up the convergence. The trade-off, however, lies in the increased systems requirements to compute and maintain preconditioners. For instance, deploying Adam can triple the memory demand to host model parameters during minibatch backpropagation, due to the maintenance of first and second moment exponentially decaying moving average statistics compared to vanilla SGD. To take advantage of adaptivity without incurring additional memory or communication overhead, we propose a new clipping mechanism, Bi Clip, that performs coordinatewise clipping from both above and below. Bi Clip is motivated by an interpolation between clipped-SGD and adaptive methods. It relies on two clipping thresholds (scalars) to dynamically rescale gradients in a per-coordinate fashion, while eliminating the overhead of preconditioner maintenance. For a model parameter x Rm, parameter coordinate j [m], lower clipping threshold d, and upper clipping threshold u (0 d u), Bi Clip is defined as1 Bi Clip(u, d, x)j := sign(xj) [d χ (|xj| d)] + sign(xj) [u χ (|xj| u) + |xj|χ (d < |xj| < u)] , (2) where χ is the indicator function. Bi Clip draws on the intuition of adaptive methods by selectively amplifying small gradient coordinates (|xj| d) while clipping down large ones (|xj| u). In contrast to typical adaptive optimizers, Bi Clip does not require preconditioner maintenance, with significantly reduced optimizer requirements identical to SGD. While our focus is on the distributed setting which aligns with practical applications, we note that Bi Clip itself can also be beneficial for centralized training (empirically validated in Section 6.4). In the next paragraph, we discuss our general Tail OPT framework (Algorithm 1), where one instance of Tail OPT applies Bi Clip as both the inner and outer optimizers. Tail OPT. In the Tail OPT framework (Algorithm 1), the inner optimization strategy, denoted as Tail Clip, refers to either Bi Clip or L2Clip. In Line 10, the outer optimization strategy can be either adaptive or non-adaptive methods that incorporate clipping, adaptivity, or momentum on top of 1For clarity in notation, we define 0/0 := 0. t by treating them as pseudogradients. We present multiple instantiations of Tail OPT along with their convergence bounds under heavy-tailed noise in Section 5, as well as in Appendix C. Algorithm 1 Heavy-Tailed Optimization (Tail OPT) Require: Initial model x1, learning rate schedule ηt Clipping schedules ut dt 0, Synchronization timestep z Z>0 1: for t = 1, . . . , T do 2: for each node i [N] in parallel do 3: xt i,0 xt 4: for each local step k [z] do 5: Draw gradient gt i,k = Fk(xt i,k, ξt i,k) 6: xt i,k+1 xt i,k ηt Tail Clip(ut, dt, gt i,k) 7: end for 8: end for 9: t = 1 i [N] xt i,z xt 1 10: xt = Outer Optimizer (xt 1, t) 11: end for Among those, we propose and highlight one efficient method that achieves superior empirical performance which utilizes the Bi Clip( ) operator (Eq. (2)) in both the inner and outer optimizers, called Bi2Clip. The exact pseudocode is presented in Algorithm 4 (Appendix C.3). Intuitively, Bi2Clip mitigates the effects of heavy-tailed noise across all inner as well as outer optimizers, while mimicking adaptive updates to amplify rare gradient signals. In Section 6, we empirically demonstrate that Bi2Clip outperforms state-of-the-art baselines without transferring or maintaining preconditioners in the distributed setting. For clarity, throughout the paper, we list the outer optimizer followed by the inner optimizer when referencing algorithms. For example, Adam-Bi Clip instantiates Adam as the outer optimizer and Bi Clip as the inner optimizer. Similarly, RMSProp-Tail Clip refers to RMSProp as outer optimizer, and Tail Clip (either L2Clip or Bi Clip) as the inner optimizer. Finally, Bi2Clip refers to the algorithm with Bi Clip as both inner and outer optimizers. 5. Convergence of the Tail OPT Framework Due to space constraints, we present convergence results for only a subset of Tail OPT instantiations in the main text. For a comprehensive analysis, Appendices C.1 and C.2 provide detailed convergence bounds for Avg-L2Clip, and Appendices C.3-C.6 include additional convergence analyses and precise pseudocodes for various (adaptive) algorithms of the Tail OPT framework incorporating Adagrad, RMSProp, or Adam. Additionally, we note that the convergence of Bi2Clip subsumes that of Avg-Bi Clip. Efficient Distributed Optimization under Heavy-Tailed Noise While clipping offers the benefit of stabilization, it introduces a non-zero bias on the stochastic gradients, rendering them to be no longer unbiased estimators of the true gradient. Theorems 1 and 2 demonstrate that with appropriately chosen (increasing) upper clipping ut and (decreasing) learning rate ηt and lower clipping dt schedules, convergence of Algorithm 1 is nevertheless attainable. Up to O(d), the presented convergence bounds hold for both gradient-wise clipping as well as coordinate-wise clipping. Generalization to layer-wise clipping with varying thresholds specific to each layer or model weight tensor slice is straightforward. We carry out our analysis for possibly non-convex problems where the model weights xt X are contained within a sufficiently large, compact set X Rd. In such settings, finding the global minimum is known to be NP-Hard, and the standard convergence metric is the stabilization of the minimum gradient. We then obtain the following theorems, where the pseudocode for each algorithm instantiation is detailed in Appendix C. Theorem 1. Let Assumptions 1-2 hold. Instantiating the outer optimizer in Algorithm 1 with RMSProp gives Algorithm 6 (RMSProp-Tail Clip). Let the clipping and learning rate thresholds satisfy ηt = Θ(tω), ηt ℓ= Θ(tν), dt = Θ(tγ), and ut = Θ(tζ). Impose the rate schedule conditions ζ > 0, γ < 1/2, and ω ζ 1/2 > 0. Then, we have that min t [T ] E F(xt) 2 where the Ψi are upper bounded by Ψ1 O(T ω+ζ 1 2 ), Ψ2 O(T ω+2ν+3ζ+ 1 Ψ3 O(T 4ζ+3ν+ 1 2 ), Ψ4 O(T 2ν+2ζ+ 1 Ψ5 O(T ν+γ+ζ+ 1 2 ), Ψ6 O(T ν+(2 α)ζ+ 1 2 ), which guarantees convergence via an inversely proportional power law decay with respect to T. Here, the exponential moving average parameter of the second pseudogradient moment is fixed within the range eβ2 [0, 1). In particular, the proof of this result immediately implies the following summarizing corollary. Corollary 1. Algorithm 6 (RMSProp-Tail Clip) convergences under heavy-tailed stochastic gradient noise. The convergence rate can be attained for 2α, ν = α + 1 as stabilizing the minimum expected gradient norm squared at rate O(T (1 α)/2α). The full proofs of all results in this section are given in Appendix C, which holds for both convex and non-convex functions. This achieves convergence in the presence of heavy-tailed noise with local updates where the rate is dependent on the α-moment condition, which is consistent with prior convergence results in the heavy-tailed setting (e.g., Zhang et al. (2020b)). As α 1+, we see that convergence is nullified. We also attain the identical convergence rate for an alternate instantiation (Adagrad-Tail Clip) and provide the exact algorithm in Algorithm 5 and convergence result in Theorem 6 of the appendix. When deploying distributed optimization, adaptive optimizers such as Adam can considerably increase the memory requirements on each compute node due to preconditioner storage, which matches the model parameter tensor size. This issue gets magnified when one deploys adaptive optimizers at both local compute notes and outer coordinating nodes, as preconditioners may be transmitted from outer to inner optimizers to maximize performance, incurring additional communication overhead (Wang et al., 2021b; Sun et al., 2023). This naturally motivates us to apply Bi Clip (Eq. (2)) at both inner and outer optimizers, resulting in Bi2Clip, retaining the benefits of adaptivity with minimal overhead. In general, all instantiations of our framework do not require to transmit preconditioners or extra gradient statistics. Convergence results of Bi2Clip are given below. Theorem 2. Let the learning rate and clipping schedules satisfy ηt = Θ(tω), ηt ℓ= Θ(tν), dt = Θ(tγ), ut = Θ(tζ), edt = Θ(teγ), and ut = Θ(teζ). For Bi2Clip (Algorithm 4), we have that the minimum gradient satisfies min t [T ] E[ F(xt 1) 2] where the Ψi are given Ψ1 = O T ω ν 1 , Ψ2 = O T ω+2eζ ν , Ψ3 = O (T γ) , Ψ4 = O T eγ ν , Ψ5 = O T (α 1)ν+(1 α)eζ , Ψ6 = O T (1 α)ζ , Ψ7 = O T ν+ζ . To attain convergence, we impose ζ, eζ > 0 > γ, eγ, for ω, ν 0, as well as 1 < ω + ν. Then, for γ, eγ small enough and ν = α 4α 2, ζ = 1 4α 2, ω = 1 Bi2Clip converges with rate O(T σ), where σ = (α 1)/(4α 2) in the limit eζ 0+. This gives the following corollary. Corollary 2. Algorithm 4 (Bi2Clip) converges with respect to heavy-tailed stochastic gradient noise (α > 1). For instance, if the moment is further constrained by α 1.5, the algorithm converges with a rate of at least O(T σ) for σ = 1/8. Similar as RMSProp-Tail Clip, the results here hold for Efficient Distributed Optimization under Heavy-Tailed Noise both convex and non-convex functions as long as the assumptions are satisfied. The convergence rate given in Corollary 2 represents a lower bound on the maximal achievable rate, obtained by a selection of hyperparameters. Interestingly, our empirical results demonstrate that Bi2Clip outperforms other methods, suggesting that the current convergence bounds could be further refined. Discussions. To ensure convergence and mitigate bias in the derived bound, it is necessary for the upper clipping threshold ut and the lower clipping threshold dt 0 as t , consistent with established counterexamples that occur due to unmitigated clipping bias (Koloskova et al., 2023; Chen et al., 2020). In cases where stochastic gradients are sampled from large-variance distributions, this necessitates a continual warm-up phase that is continuously relaxed, akin to learning rate warm-up schemes that conclude after a finite period (Kosson et al., 2024). The clipping schedules prescribed by Theorems 1, 2 grow polynomially with respect to t, which depict the realization of model weights throughout training. This effectively deactivates gradient clipping after an initial warm-up phase that is shaped by the noise distribution s tail behavior and the clipping thresholds. This may help to explain why learning rate warm-ups are observed to significantly improve training (Kalra & Barkeshli, 2024; Ma & Yarats, 2021) in the presence of heavy-tailed stochastic gradients. Finally, as the maximal bounded moment condition α approaches the integrability threshold (α = 1), or as γ nears 0 , the convergence bound is mollified. Despite this, in our experiments, we set ν = ζ = γ = 0 (i.e., fixing learning rates and clipping thresholds), which yielded strong empirical performance. Intuitively, this setup corresponds to a continual amplification of informative coordinates and attenuation of uninformative covariates. Other Instantiations and Extensions. As noted previously, we extend our analysis to support an Adagrad-based outer optimizer (Algorithm 5) and provide a convergence guarantee under heavy-tailed noise, detailed in Theorem 6 in Appendix C. In Appendix C.6, we further generalize our framework by incorporating momentum into the first-order stochastic pseudogradient statistics, resulting in an outer optimizer Adam instantiation. While we establish that the expected minimum gradient is asymptotically bounded even under restarting (Theorem 8), proving formal convergence to 0 remains an open challenge. The difficulty arises from the moving average applied to the first moment, which retains all historical gradient information and complicates the analytical proof structure. We also extend convergence results for certain instantiations to allow for node drop or failures at each round (Appendix C.2). Our bound further highlights that the dominating terms are influenced by the upper clipping threshold ur, which we tune empirically in Section 6 by sweeping over a hyperparameter grid defined in Appendix D.7. For this result, we extremize the noise tails such that there α such that the α-moment is finite for α > 1, under which ut stabilizes the gradient dynamics. 6. Experiments We assess the performance of various Tail OPT instantiations across a range of empirical tasks, benchmarking them against state-of-the-art algorithms from the literature. Our experiments include synthetic tasks with heavytailed noise injection and real-world benchmarks, including GLUE (Wang et al., 2019) for natural language understanding, WMT (Foundation, 2019) for machine translation, Qwen2.5 (Yang et al., 2025) for question answering, and Vi T (Dosovitskiy et al., 2021) for image classification. We present a brief summary of each experimental setup in the corresponding subsections. Extended details of the experimental setup, dataset descriptions, and extensive hyperparameter tuning procedures (including the best hyperparameters for each method and dataset) are provided in Appendix D. Our code are publicly available at github.com/sulee3/Heavy Tails. 6.1. Convex Models 0 2 4 6 log(Outer Node Steps) log( w ˆw ) Synthetic Pure (a) No Noise 0 2 4 6 log(Outer Node Steps) log( w ˆw ) Synthetic Gaussian Avg - SGD Avg - L2Clip Avg - Bi Clip Adam - Bi Clip Adam2 (b) Light-Tailed (scale 1) 0 2 4 6 log(Outer Node Steps) log( w ˆw ) Synthetic Gaussian (c) Light-Tailed (scale 3) 0 2 4 6 log(Outer Node Steps) log( w ˆw ) Synthetic t-Dist. (d) Heavy-Tailed Figure 1: Impact of heavy-tailed noise on synthetic datasets. Without gradient noise, Avg-SGD achieves good performance (c.f., (a)). As the noise tails grow heavier (from (a) to (d)), the performance of Avg-SGD deteriorates considerably. By contrast, both clipping mechanisms and adaptive updates demonstrate improved performance in learning the ground truth w , and effectively mitigates the adverse effects of heavy-tailed noise. Bi Clip as the inner optimizer outperforms other options (Adam, L2Clip, and SGD) with heavy-tailed nose (see (d)). See Appendix D.1 for full results. Efficient Distributed Optimization under Heavy-Tailed Noise Table 1: Test results on the GLUE Benchmark. Metric descriptions are given in Appendix D.3, and the full results are reported as Table 3 in the appendix. Entries marked with 0.0 indicate failure to learn. Top first, second, and third best-performing algorithms are highlighted. For Adam2, preconditioners are transmitted between the inner and outer optimizers, whereas Di Lo Co requires maintaining preconditioners on the inner optimizers, both of which incur significant communication or memory overhead than Bi2Clip. Our experiments show that Bi2Clip achieves the best performance with the smallest overhead. Algorithm MNLI QNLI QQP (Acc/F1) RTE SST-2 MRPC (Acc/F1) Co LA STS-B (S/P) Average Avg-SGD (Mc Mahan et al., 2017) 81.13 83.21 78.71/78.69 57.40 90.94 67.30/80.52 0.0 26.76/28.20 61.17 Avg-L2Clip (Yang et al., 2022) 81.82 85.68 80.00/79.82 54.51 91.97 68.38/81.22 0.0 41.27/40.96 64.15 Avg-Adagrad 84.70 88.79 87.09/83.34 64.26 93.34 71.56/82.63 27.72 81.93/81.26 76.97 Avg-Adam 84.97 89.47 87.66/84.09 64.62 93.80 81.86/87.74 41.41 86.21/86.55 80.76 Avg-Bi Clip 85.08 89.45 87.83/84.12 66.06 94.03 71.32/82.45 41.40 84.08/84.48 79.12 Adagrad-SGD (Reddi et al., 2021) 82.40 86.61 82.51/77.68 71.48 92.08 85.53/89.52 47.80 40.37/42.24 72.69 Adagrad-Bi Clip 85.54 90.02 88.60/85.05 73.36 93.23 85.78/89.86 48.87 84.03/85.90 82.75 RMSProp-SGD (Reddi et al., 2021) 84.20 88.46 87.12/83.30 72.56 91.85 85.50/89.17 52.39 45.72/41.80 74.73 RMSProp-Bi Clip 85.56 89.82 88.50/84.44 70.75 93.69 84.80/88.92 50.99 87.65/87.79 82.99 Adam-SGD (Reddi et al., 2021) 82.93 86.98 85.99/80.87 66.78 90.71 87.01/90.09 49.93 44.48/41.26 73.37 Adam-L2Clip 82.54 86.69 85.88/80.72 59.92 89.67 85.29/89.90 48.54 69.19/67.16 76.86 Adam-Bi Clip 84.26 89.20 88.64/84.74 69.67 92.43 86.52/90.09 56.12 82.83/79.71 82.20 Adam2 (Wang et al., 2021b) 85.11 88.87 89.04/85.51 71.48 92.66 87.50/91.03 52.70 84.47/83.82 82.93 Di Lo Co (Douillard et al., 2024) 85.68 89.87 88.78/85.19 67.87 91.89 87.99/91.20 54.77 85.93/84.76 83.08 Bi2Clip 85.06 89.73 84.93/83.97 76.53 93.80 89.21/92.44 60.08 87.07/86.89 84.52 We designed our convex, synthetic dataset setup to explicitly control and inject heavy-tailed noise, enabling a focused study of its effects. In language tasks, the frequencies of words or tokens typically follows a heavy-tailed distribution, where a small subset of tokens occurs with high frequency, while the majority appear infrequently yet carry significant contextual information. To mirror this phenomenon, emulating a similar setup in Li et al. (2022), we partitioned the input feature space into common and rare features. Specifically, we set the first p = 10% features (or tokens) from data X as common features, with each feature activated according to a Bernoulli distribution Bern(0.9). The remaining 90% of the features are configured as rare, each sampled from Bern(0.1). The weight vector w is drawn from a standard multivariate normal distribution, w N(0, Im), and the labels are generated as ˆy = Xw + ξnoise. A neural network with model weight ˆw is then trained to learn the ground truth w . A comprehensive explanation of the dataset construction and experimental setup is provided in Appendix D.1. We inject noise ξnoise sampled from a heavytailed distribution, which implies that the induced stochastic gradients must be heavy-tailed under the MSE loss. In Figure 1, we sample from the Gaussian and Student t distributions for the non-heavy-tailed and heavy-tailed ξnoise, respectively. By default, we multiply the noise by scale 1 unless otherwise specified. We observe that while SGD demonstrates strong performance in non-noisy settings, its effectiveness diminishes as noise tails become heavier a scenario where adaptive methods and Bi Clip excel. Similarly, L2Clip shows some ability to mitigate heavy-tailed noise but exhibits a comparable decline in performance under heavy-tailed conditions. 6.2. Transformer Encoders To evaluate the effectiveness of our approach, we finetune Ro BERTa (Liu et al., 2019) on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019), a widely-used suite of natural language understanding tasks. The GLUE benchmark includes diverse tasks such as sentiment analysis, sentence similarity, textual entailment, and natural language inference, providing a comprehensive evaluation of model performance across multiple linguistic phenomena. We follow standard finetuning protocols for Ro BERTa, leveraging pretrained weights and optimizing task-specific objectives for each dataset in GLUE. Our results demonstrate that Bi Clip (Bi2Clip) attains competitive or superior performance similar to Adam (Adam2), while only requiring SGD-like memory and compute. Table 1 presents the performance of the algorithms of interest on the GLUE benchmark. Our results show that L2Clip enhances performance on real-world data. Adaptive methods further improve upon these results, consistently outperforming L2Clip (e.g., convergence curves in Figure 2). Notably, the newly proposed clipping method in Tail OPT, Bi Clip, demonstrates superior performance compared to L2Clip and, in some cases, even surpasses Adam during test time (c.f., comparing Bi2Clip and Adam2), highlighting its potential as an efficient and effective optimizer in real-world applications. Additionally, instantiations of Tail OPT achieving 80% average accuracy generally employ adaptive or adaptive-approximating optimizers across all nodes. In particular, adaptivity on the inner optimizer appears crucial for performance, as SGD-based methods perform considerably worse ( 75%). By contrast, both Bi Clip or Adam reach 80% even when combined with a Efficient Distributed Optimization under Heavy-Tailed Noise 2 4 6 8 10 Outer Node Steps Test Accuracy Avg - SGD Avg - L2Clip Adam - L2Clip (a) Effects of L2 clipping 2 4 6 8 10 Outer Node Steps Test Accuracy Avg - SGD Adam - SGD Adam2 (b) Effects of Adaptivity 2 4 6 8 10 Outer Node Steps Test Accuracy Adam - L2Clip Adam2 Adam - Bi Clip (c) Effects of Bi Clip Figure 2: Convergence curves on the QNLI dataset. In (a), we see that L2Clip (one option of Tail Clip) can help to improve performance under different outer optimizers. Middle Figure (b) demonstrates that adaptivity also helps to mitigate the negative effects of heavy-tailed noise. In all three plots (a)-(c), L2Clip performs worse than adaptive methods, but the coordinate-wise Bi Clip optimizer performs comparably or even better than adaptive optimization frameworks, manifesting Adam-like performance. We note that the Adam2 baseline, which applies Adam both in inner and outer optimization, requires transmitting preconditioners of the same size as the model weights to inner optimizers, resulting in substantial communication and memory overhead to deploy. By contrast, Bi2Clip removes the necessity of preconditioner maintenance, sidestepping this bottleneck entirely. simple averaging outer optimizer strategy. We include full results of more baselines in Appendix D.3. 6.3. Generative Models We also evaluate Tail OPT on machine translation tasks utilizing the WMT datasets, a widely used benchmark for translation research (Foundation, 2019). Specifically, we finetune the T5 (Raffel et al., 2020) generative model on the TED Talks and News Commentary parallel training datasets. The TED Talks dataset, originally sourced from IWSLT 2017 (Cettolo et al., 2017), comprises multilingual translations of TED Talk transcripts, while the News Commentary dataset includes parallel text from news articles across various languages. We report both BLEU and METEOR scores across several variants of source and target language translations in Table 2. An expanded table with a more extensive evaluation is provided in Table 4 in the appendix. Table 2: Evaluation results on machine translation benchmarks. Metrics reported are formatted as BLEU/METEOR for various language pairs across the TED and News Commentary datasets. The final column represents the average score across all metrics for each algorithm See Table 4 for expanded results. Algorithm TED (en-de) News Comm (en-fr) Avg Avg-SGD 28.02/58.52 30.07/54.13 42.68 Avg-L2Clip 28.99/58.94 31.02/56.73 43.92 Adam2 28.06/58.05 30.97/55.85 43.23 Bi2Clip 29.41/59.18 31.79/57.69 44.51 Discussions. For language reasoning benchmarks, the performance differences across algorithmic instantiations are particularly pronounced. While L2 clipping is a common stabilization strategy, it exhibits limited effectiveness. In contrast, coordinate-wise Bi Clip demonstrates significantly better stability and performance. Moreover, frameworks aiming to utilize or mimic adaptivity in both the inner and outer optimizers generally achieve superior results, surpassing 80% average performance across all benchmarks. Notably, performance is highly sensitive to the choice of inner optimizers, with SGD and L2 clipping yielding the lowest results. For machine translation, however, the performance variance across different optimizer strategies can be relatively smaller when optimal hyperparameters are selected. We observe that Bi2Clip can outperform even Adam2 in some settings. While its design aims to emulate adaptivity under heavy-tailed noise, Bi Clip exhibits characteristics that can interpolate between non-adaptive and adaptive methods, capturing benefits from both without necessarily fully belonging to either paradigm (Figure 5 in the appendix). Further, Bi Clip retains the same memory and compute overheads as standard SGD, which cements a highly resource-efficient adaptive approximation. In addition, federated learning is a special distributed learning paradigm designed to train machine learning models jointly without the transmission of raw data (Mc Mahan et al., 2017; Li et al., 2020a; Wang et al., 2021a). Tail OPT can also be applied to federated learning with limited client participation, especially when the local data shards induce heavy-tailed stochastic gradients. See Appendix D.5 for experiments on effects of Tail OPT in this setting. 6.4. Bi Clip for Centralized Learning While we focus on modern distributed settings, the Bi Clip optimizer itself can be readily used in centralized learning with similar benefits achieving Adam-like performance without maintaining any gradient statistics. We empirically demonstrate the convergence and test performance of Bi Clip compared with SGD and Adam on both text and Efficient Distributed Optimization under Heavy-Tailed Noise 0 5 10 15 20 Train Steps (103) Test Accuracy Qwen2.5 on SQu AD 0 5 10 15 20 Train Steps (103) Test Accuracy Vi T on CIFAR100 SGD Adam Bi Clip Figure 3: Effects of Bi Clip in centralized training. We see that Bi Clip (only relying on two clipping thresholds) achieves the same accuracies and convergence speed as Adam without maintaining any preconditioners or their compressed versions. image datasets. We finetune Qwen2.5 (Yang et al., 2025) on the SQu AD dataset (Rajpurkar et al., 2016), and finetune a Vi T-base model (Dosovitskiy et al., 2021) (pretrained on Image Net) on CIFAR100 (Krizhevsky et al., 2009). We use the same hyperparameter tuning protocol as in other experiments, discussed in Appendix D.7. In Figure 3, we see that both Bi Clip and Adam outperform SGD on two tasks in terms of final accuracies and/or convergence speed, whereas Bi Clip only requires the same memory and compute as SGD, which is significantly more efficient than Adam. 7. Conclusion In this work, we have introduced Tail OPT, a framework for efficient heavy-tailed optimization. We have proposed the Bi Clip optimizer based on coordinate-wise clipping from above and below, which utilizes nearly identical memory and compute resources to vanilla SGD yet manifests Adam-like performance. We have established convergence guarantees for our Tail OPT under potentially unbounded variance and provide a thorough empirical evaluation with real-world as well as synthetic datasets. Our experiments indicate that Bi Clip stabilizes training under heavy-tailed noise and achieves the benefits of efficient adaptive optimization, exceeding the state-of-the-art performance. Future work could explore automatic selection of ut and dt based on initial statistics or bespoke estimators, which could provide more practical solutions. Alternatively, allowing the clipping thresholds to vary depending on coordinate partition subsets (e.g., across tensor slices) may further enhance performance. An extended conclusion with possible future directions is provided in Appendix B. Acknowledgements We thank anonymous reviewers for their feedback. We are grateful to Xiyuan Yang for valuable discussions and contributions to Section 6.4 of this paper. Impact Statement This paper presents work whose goal is to advance the field of machine learning. As our work is primarily focusing on optimization algorithms, we do not believe that there are many potential societal consequences of our work. Ahn, K., Cheng, X., Song, M., Yun, C., Jadbabaie, A., and Sra, S. Linear attention is (maybe) all you need (to understand transformer optimization). International Conference on Learning Representations, 2024. Anil, R., Gupta, V., Koren, T., and Singer, Y. Memory efficient adaptive optimization. Advances in Neural Information Processing Systems, 32, 2019. Caldas, S., Duddu, S. M. K., Wu, P., Li, T., Konecny, J., Mc Mahan, H. B., Smith, V., and Talwalkar, A. Leaf: A benchmark for federated settings. Arxiv, 2018. Cettolo, M., Federico, M., Bentivogli, L., Niehues, J., Stuker, S., Sudoh, K., Yoshino, K., and Federmann, C. Overview of the iwslt 2017 evaluation campaign. Proceedings of the 14th International Conference on Spoken Language Translation, 2017. Chen, X., Wu, Z. S., and Hong, M. Understanding gradient clipping in private sgd: A geometric perspective. Advances in Neural Information Processing Systems, 2020. Chezhegov, S., Klyukin, Y., ndrei Semenov, Beznosikov, A., Gasnikov, A., Horvath, S. S., Takac, M., and Gorbunov, E. Gradient clipping improves adagrad when the noise is heavy-tailed. Ar Xiv, 2024a. Chezhegov, S., Klyukin, Y., Semenov, A., Beznosikov, A., Gasnikov, A., Horv ath, S., Takac, M., and Gorbunov, E. Gradient clipping improves adagrad when the noise is heavy-tailed. Arxiv, 2024b. Cutkosky, A. and Mehta, H. High-probability bounds for non-convex stochastic optimization with heavy tails. Advances in Neural Information Processing Systems, 2021. Das, A., Nagaraj, D. M., Pal, S., Suggala, A., and Varshney, P. Near-optimal streaming heavy-tailed statistical estimation with clipped sgd. Advances in Neural Information Processing Systems, 2024. Davis, D., Drusvyatskiy, D., Xiao, L., and Zhang, J. From low probability to high confidence in stochastic convex optimization. Journal of Machine Learning Research, 22: 1 38, 2021. Deep Seek-AI. Deepseek-v3 technical report. Ar Xiv, 2024. Efficient Distributed Optimization under Heavy-Tailed Noise Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A. P., Caron, M., Geirhos, R., Alabdulmohsin, I., Jenatton, R., Beyer, L., Tschannen, M., Arnab, A., Wang, X., Ruiz, C. R., Minderer, M., Puigcerver, J., Evci, U., Kumar, M., Steenkiste, S. V., Elsayed, G. F., Mahendran, A., Yu, F., Oliver, A., Huot, F., Bastings, J., Collier, M., Gritsenko, A. A., Birodkar, V., Vasconcelos, C. N., Tay, Y., Mensink, T., Kolesnikov, A., Pavetic, F., Tran, D., Kipf, T., Lucic, M., Zhai, X., Keysers, D., Harmsen, J. J., and Houlsby, N. Scaling vision transformers to 22 billion parameters. International Conference on Machine Learning, 2023. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations, 2021. Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., Donchev, Y., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J. Diloco: Distributed low-communication training of language models. ICML Workshop on Advancing Neural Network Training, 2024. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011. Foundation, W. Shared task: Machine translation of news. Association for Computational Linguistics Conference on Machine Translation, 2019. Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Convolutional sequence to sequence learning. International Conference on Machine Learning, 2017. Gorbunov, E., Danilova, M., and Gasnikov, A. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Advances in Neural Information Processing Systems, 2020. Gorbunov, E., Danilova, M., Dobre, D., Dvurechensky, P., Gasnikov, A., and Gidel, G. Clipped stochastic methods for variational inequalities with heavy-tailed noise. Advances in Neural Information Processing Systems, 2022. Gorbunov, E., Danilova, M., Shibaev, I., Dvurechensky, P., and Gasnikov, A. High-probability complexity bounds for non-smooth stochastic convex optimization with heavytailed noise. Journal of Optimization Theory and Applications, 2024a. Gorbunov, E., Sadiev, A., Danilova, M., Horvath, S., Gidel, G., Dvurechensky, P., Gasnikov, A., and Richtarik, P. High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise. International Conference on Machine Learning, 2024b. Jaghouar, S., Ong, J. M., and Hagemann, J. Opendiloco: An open-source framework for globally distributed lowcommunication training. Ar Xiv, 2024. Juditsky, A., Nazin, A., Nemirovsky, A., and Tsybakov, A. Algorithms of robust stochastic optimization based on mirror descent method. Automation and Remote Control, 80:1607 1627, 2019a. Juditsky, A., Nazin, A., Nemirovsky, A., and Tsybakov, A. Algorithms of robust stochastic optimization based on mirror descent method. Automation and Remote Control, 2019b. Kalra, D. S. and Barkeshli, M. Why warmup the learning rate? underlying mechanisms and improvements. Advances in Neural Information Processing Systems, 2024. Karimireddy, S. P., Jaggi, M., Kale, S., Mohri, M., Reddi, S., Stich, S. U., and Suresh, A. T. Breaking the centralized barrier for cross-device federated learning. Advances in Neural Information Processing Systems, 2021. Khirirat, S., Gorbunov, E., Horv ath, S., Islamov, R., Karray, F., and Richtarik, P. Clip21: Error feedback for gradient clipping. Ar Xiv, 2023. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. International Conference for Learning Representations, 2015. Koloskova, A., Hendrikx, H., and Stich, S. U. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. International Conference on Learning Representations, Ar Xiv, 2023. Kosson, A., Messmer, B., and Jaggi, M. Analyzing and reducing the need for learning rate warmup in gpt training. Advances in Neural Information Processing Systems, 2024. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Kunstner, F., Yadav, R., Milligan, A., Schmidt, M., and Bietti, A. Heavy-tailed class imbalance and why adam outperforms gradient descent on language models. Ar Xiv, 2024. Lee, S. H., Sharma, S., Zaheer, M., and Li, T. Efficient adaptive federated optimization. ICML Workshop on Advancing Neural Network Training, 2024. Efficient Distributed Optimization under Heavy-Tailed Noise Li, S. and Liu, Y. High probability guarantees for nonconvex stochastic gradient descent with heavy tails. International Conference on Machine Learning, 2022. Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50 60, 2020a. Li, T., Zaheer, M., Reddi, S. J., and Smith, V. Private adaptive optimization with side information. International Conference on Machine Learning, 2022. Li, T., Zaheer, M., Liu, Z., Reddi, S., Mc Mahan, B., and Smith, V. Differentially private adaptive optimization with delayed preconditioners. International Conference on Learning Representations, 2023. Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. International Conference on Learning Representations, 2020b. Liu, B., Chhaparia, R., Douillard, A., Kale, S., Rusu, A. A., Shen, J., Szlam, A., and Ranzato, M. Asynchronous localsgd training for language modeling. ICML Workshop on Advancing Neural Network Training, 2024a. Liu, M., Zhuang, Z., Lei, Y., and Liao, C. A communicationefficient distributed gradient clipping algorithm for training deep neural networks. Advances in Neural Information Processing Systems, 2022. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. Arxiv, 2019. Liu, Z., Zhang, J., and Zhou, Z. Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise. Proceedings of Thirty Sixth Conference on Learning Theory, 195:2266 2290, 2023. Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y., Fedorov, I., Xiong, Y., Chang, E., Shi, Y., Krishnamoorthi, R., Lai, L., and Chandra, V. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. International Conference on Machine Learning, 2024b. Ma, J. and Yarats, D. On the adequacy of untuned warmup for adaptive optimization. Association for the Advancement of Artificial Intelligence, 2021. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics, 2017. Menon, A. K., Rawat, A. S., Reddi, S. J., and Kumar, S. Can gradient clipping mitigate label noise? International Conference on Learning Representations, 2020. Merity, S., Keskar, N. S., and Socher, R. Regularizing and optimizing lstm language models. International Conference on Learning Representations, 2018. Mikolov, T. Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology, 2012. Nguyen, T. D., Nguyen, T. H., Ene, A., and Nguyen, H. L. High probability convergence of clipped-sgd under heavytailed noise. Arxiv, 2023a. Nguyen, T. D., Nguyen, T. H., Ene, A., and Nguyen, H. L. Improved convergence in high probability of clipped gradient methods with heavy tailed noise. Advances in Neural Information Processing Systems, 2023b. Nguyen, T. H., Simsekli, U., Gurbuzbalaban, M., and Richard, G. First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. Advances in Neural Information Processing Systems, 2019. Parletta, D. A., Paudice, A., Pontil, M., and Salzo, S. High probability bounds for stochastic subgradient schemes with heavy tailed noise. SIAM Journal on Mathematics of Data Science, 6:953 977, 2024. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. International Conference on Learning Representations, 2018. Puchkin, N., Gorbunov, E., Kutuzov, N., and Gasnikov, A. Breaking the heavy-tailed noise barrier in stochastic optimization problems. AISTATS, 2024. Qian, J., Wu, Y., Zhuang, B., Wang, S., and Xiao, J. Understanding gradient clipping in incremental gradient methods. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 2021. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020. Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: 100,000+ questions for machine comprehension of text. ar Xiv preprint ar Xiv:1606.05250, 2016. Reddi, S., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Konecn y, J., Kumar, S., and Mc Mahan, B. Adaptive federated optimization. International Conference on Learning Representations, 2021. Efficient Distributed Optimization under Heavy-Tailed Noise Rosa, G. M., Bonifacio, L., Jeronymo, V., Abonizio, H., Lotufo, R., and Nogueira, R. Billions of parameters are worth more than in-domain training data: A case study in the legal case entailment task. Ar Xiv, 2022. Sadiev, A., Danilova, M., Gorbunov, E., Horvath, S., Gidel, G., Dvurechensky, P., Gasnikov, A., and Richtarik, P. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. International Conference on Machine Learning, 2023. Simsekli, U., Sagun, L., and Gurbuzbalaban, M. A tail-index analysis of stochastic gradient noise in deep neural networks. International Conference on Machine Learning, 2019. Simsekli, U., Zhu, L., Teh, Y. W., and Gurbuzbalaban, M. Fractional underdamped langevin dynamics: Retargeting sgd with momentum under heavy-tailed gradient noise. International Conference on Machine Learning, 2020. Smith, V., Forte, S., Ma, C., Tak aˇc, M., Jordan, M. I., and Jaggi, M. Cocoa: A general framework for communication-efficient distributed optimization. Journal of Machine Learning Research, 18(230):1 49, 2018. Sriram, A., Das, A., Wood, B. M., Goyal, S., and Zitnick, L. Towards training billion parameter graph neural networks for atomic simulations. International Conference on Learning Representations, 2022. Stich, S. U. Local sgd converges fast and communicates little. ar Xiv preprint ar Xiv:1805.09767, 2018. Streeter, M. and Mc Mahan, B. Less regret via online conditioning. Ar Xiv, 2010. Sun, C. and Chen, B. Distributed stochastic strongly convex optimization under heavy-tailed noises. 2024 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM), 2024. Sun, Y., Shen, L., Sun, H., Ding, L., and Tao, D. Efficient federated learning via local adaptive amended optimizer with linear speedup. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14453 14464, 2023. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Glue: A multi-task benchmark and analysis platform for natural language understanding. International Conference for Learning Representations, 2019. Wang, J., Liu, Q., Liang, H., Joshi, G., and Poor, H. V. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in Neural Information Processing Systems, 2020. Wang, J., Charles, Z., Xu, Z., Joshi, G., Mc Mahan, H. B., Al-Shedivat, M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al. A field guide to federated optimization. ar Xiv preprint ar Xiv:2107.06917, 2021a. Wang, J., Xu, Z., Garrett, Z., Charles, Z., Liu, L., and Joshi, G. Local adaptivity in federated learning: Convergence and consistency. ar Xiv preprint ar Xiv:2106.02305, 2021b. Wang, Y., Lin, L., and Chen, J. Communication-efficient adaptive federated learning. International Conference on Machine Learning, 2022. Wu, C., Zhang, H., Ju, L., Huang, J., Xiao, Y., Huan, Z., Li, S., Meng, F., Liang, L., Zhang, X., et al. Rethinking memory and communication cost for efficient large language model training. ar Xiv preprint ar Xiv:2310.06003, 2023. Xie, C., Koyejo, O., Gupta, I., and Lin, H. Local adaalter: Communication-efficient stochastic gradient descent with adaptive learning rates. OPT2020: 12th Annual Workshop on Optimization for Machine Learning, 2020. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. ar Xiv preprint ar Xiv:2505.09388, 2025. Yang, H., Qiu, P., and Liu, J. Taming fat-tailed (heaviertailed with potentially infinite variance) noise in federated learning. Advances in Neural Information Processing Systems, 2022. Yu, C., Tang, H., Renggli, C., Kassing, S., Singla, A., Alistarh, D., Zhang, C., and Liu, J. Distributed learning over unreliable networks. In International Conference on Machine Learning, pp. 7202 7212. PMLR, 2019. Yu, S., Jakovetic, D., and Kar, S. Smoothed gradient clipping and error feedback for decentralized optimization under symmetric heavy-tailed noise. Arxiv, 2024. Zhang, B., Jin, J., Fang, C., and Wang, L. Improved analysis of clipping algorithms for non-convex optimization. Advances in Neural Information Processing Systems, 2020a. Zhang, J. and Cutkosky, A. Parameter-free regret in high probability with heavy tails. Advances in Neural Information Processing Systems, 2022. Zhang, J., Karimireddy, S., Veit, A., Kim, S., Reddi, S., Kumar, S., and Sra, S. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 2020b. Zhang, X., Bu, Z., Wu, Z. S., , and Hong, M. Differentially private sgd without clipping bias: An error-feedback approach. International Conference on Learning Representations, 2024. Efficient Distributed Optimization under Heavy-Tailed Noise A Additional Related Works 14 B Future Directions and Possible Extensions 15 C Convergence of Tail OPT 15 C.1 Convergence of Avg-L2Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 C.2 Dynamics of Avg-L2Clip under Failing Compute Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.3 Convergence of Bi2Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 C.4 Convergence of Adagrad-Tail Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 C.5 Convergence of RMSProp-Tail Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 C.6 Convergence of Adam-Tail Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 D Experiment Setup & Full Results 39 D.1 Convex Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 D.2 Synthetic Data Experiments Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 D.3 Transformer Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 D.4 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 D.5 Performance under Non-IID Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 D.6 Gradient Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 D.7 Hyperparameter Sweep Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 D.8 Optimal Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Efficient Distributed Optimization under Heavy-Tailed Noise A. Additional Related Works Clipping for Stabilizing Training Dynamics. Due to its success in stabilizing model updates, gradient clipping has been extensively studied empirically (Gehring et al., 2017; Merity et al., 2018; Peters et al., 2018; Mikolov, 2012) and theoretically (Chezhegov et al., 2024a; Zhang et al., 2020b; Menon et al., 2020; Zhang et al., 2020a; Chen et al., 2020; Koloskova et al., 2023; Gorbunov et al., 2020; Cutkosky & Mehta, 2021). The majority of results study the centralized setting (e.g., Liu et al., 2023; Zhang & Cutkosky, 2022; Parletta et al., 2024; Li & Liu, 2022; Puchkin et al., 2024; Nguyen et al., 2023b; Gorbunov et al., 2024a). Additionally, it was shown that using a constant clip threshold can induce gradient bias, preventing the algorithm from ever converging (Koloskova et al., 2023; Chen et al., 2020). Therefore, some works have attempted to circumvent this issue by debiasing via error feedback (Khirirat et al., 2023; Zhang et al., 2024). Other works in distributed optimization have imposed strong distributional stochastic gradient structures in the analysis. For instance, Qian et al. (2021) assume a well-behaved angular dependence between the stochastic and deterministic gradients throughout training, and Liu et al. (2022) assume symmetric gradient noise, almost surely bounded stochastic gradients, as well as homogeneous data. In contrast with these works, we do not impose any conditions on the noise nor data distributions except for bounded noise α-moments for α (1, 2). This also sharpens the sensitivity of our bounds to gradient distributions, as α may be selected as the minimal (or close to infimum) α-moment value such that the moment is bounded. There are some recent results studying the dynamics of heavy-tailed clipped-SGD in the distributed setting, without local updates (Sun & Chen, 2024; Gorbunov et al., 2024b; Yu et al., 2024). In particular, Sun & Chen (2024) study the convergence of distributed clipped-sgd for strongly-convex objectives in the absence of a central server, where smaller nodes communicate with their neighbors according to a strongly connected graph. By contrast, Yu et al. (2024) propose smooth-clipping the difference between a local gradient estimator and the local stochastic gradient, which is shown to converge under only the integrability condition (finite first moment) for strongly convex objectives when assuming symmetric noise distributions. The work by Yang et al. (2022) studies L2 clipping with local updates (compared against in our paper as the Avg + L2Clip baseline in Table 1). Our proposed clipping mechanism, Bi Clip, fundamentally differs from these approaches by incorporating coordinate-wise clipping from both above and below, bringing about benefits of adaptive optimizers. An added advantage of Tail OPT is significant communication efficiency, as we do not transmit preconditioners from the inner and outer optimizers under iterative local updates. Our analysis covers both convex and non-convex functions without additional assumptions on the noise distribution except for heavy-tailedness with potentially unbounded variance. It also holds for a variety of adaptive optimizers and different clipping methods. Convergence Bounds under Heavy-Tailed Gradient Noise. There are two primary types of convergence bounds: inprobability bounds (Juditsky et al., 2019b; Davis et al., 2021; Gorbunov et al., 2022; 2020; Sadiev et al., 2023; Cutkosky & Mehta, 2021; Gorbunov et al., 2024a;b) and in-expectation bounds (Wang et al., 2022; Li et al., 2020b; Reddi et al., 2021; Xie et al., 2020; Wang et al., 2020; Karimireddy et al., 2021; Li et al., 2023; 2022). In-probability bounds provide an upper limit on the number of timesteps required to achieve model parameters x such that P{M(x) ε} 1 δ for a given evaluation metric M(x) (e.g., mint 1,...,T | F(xt)|). As δ 0+, the required communication complexity or number of timesteps may diverge. The key challenge is to mitigate this divergence as effectively as possible through novel algorithm designs or refined mathematical analysis, such as by deriving a polylogarithmic dependence on δ rather than a more severe inverse power-law dependence. A recent work (Sadiev et al., 2023) provides the first high-probability results under unbounded variance for clipped-SGD applied to star-convex or quasi-convex objectives in a distributed setting without local updates. Their analysis reveals an inverse logarithmic dependence on the confidence level. By contrast, in-expectation bounds complement in-probability bounds by ensuring that convergence to an optimal point is guaranteed under expectations, without a confidence level. However, the majority of such analyses assume a bounded noise variance, typically denoted by an upper bound G (Gorbunov et al., 2020; Parletta et al., 2024; Li & Liu, 2022). Due to this dependence, some works (e.g., those studying high-probability results (Davis et al., 2021; Gorbunov et al., 2020; 2024a)) argue that in-expectation bounds are insensitive to the underlying distributional structures of the stochastic gradients, due to being compressed or approximated away by G. Relaxing this assumption is particularly challenging because unbounded noise adds significant uncertainty to controlling model updates. Furthermore, works such as (Lee et al., 2024) have demonstrated that under stochastic gradient descent, unbounded noise is instantaneously transmitted to the model parameters in both centralized and distributed settings, leading to instability and divergence in expectation. Such results elucidate the additional difficulties induced by efforts to remove the bounded gradient condition. In this paper, we develop a more efficient and general Tail OPT framework, and study the dynamics of Tail OPT under heavy-tailed stochastic gradient distributions. Specifically, we provide the in-expectation convergence guarantees under infinite variance and local Efficient Distributed Optimization under Heavy-Tailed Noise updates for potentially non-convex functions, offering new bounds that are more sensitive to distributional structures of minibatch noise. B. Future Directions and Possible Extensions Efficient and effective tuning of the clipping thresholds dt and ut in Bi Clip remains an open avenue for research. One potential approach is to segment the thresholds into coordinate subsets (e.g., row-wise or column-wise), similar to the memory-efficient partitioning strategies employed in approximate optimizers such as SM3 (Anil et al., 2019). Alternatively, autonomous selection of ut and dt based on initial statistics or bespoke estimators could provide practical solutions. Our experiments indicate that coordinate-wise Bi Clip, rather than standard L2 clipping, achieves the benefits of adaptive optimization without incurring any additional memory overhead compared to SGD. Notably, methods like Adam at least double memory usage, whereas Bi Clip maintains parity with non-adaptive methods. This suggests that uniformly amplifying small updates can contribute to optimization efficiency. Furthermore, layer-wise Bi Clip can be readily generalized, with proofs extending naturally. Another intriguing direction for future research is the integration of Adam on top of Bi Clip to enhance optimization stability in either centralized or distributed training. Notably, when employing the Adam optimizer, some studies apply L2 clipping to gradients prior to plugging in Adam updates to improve stability of the optimization dynamics (Chezhegov et al., 2024b). A natural extension of this approach is to substitute Bi Clip for L2 clipping before passing updates to the Adam adaptive optimizer. This modification could not only enhance stability but also potentially reduce dependence on the hyperparameters of Adam, offering a more robust optimization framework. C. Convergence of Tail OPT In this section, we rigorously analyze the convergence of Tail OPT under heavy-tailed noise, beginning with the case of Avg L2Clip to enhance readability before progressively advancing to more sophisticated Tail OPT variants incorporating Bi Clip and other adaptive outer optimizers. We first establish the convergence proof for Avg-L2Clip in Appendix C.1, which serves as the basis for subsequent analyses. The proof for Avg-L2Clip studies a virtual history of model weights synthesized by inner optimizers, which is inaccessible in real-world settings except when the model updates are communicated to the outer optimizer. However, by analyzing the virtual history, we are able to attain convergence of a moving average of accessible model weights to the optimum, which can be materialized in practice. In Appendix C.2, we extend this proof to settings with partial participation and failing compute nodes, examining the resulting dynamics under heavy-tailed noise. In Appendix C.3, we further generalize the analysis to the Bi2Clip instantiation, where Bi Clip is applied to both the inner and outer optimizers. Notably, Bi2Clip encompasses Avg-Bi Clip as a special case under specific hyperparameter choices, which in turn subsumes Avg-L2Clip. Finally, in Appendices C.4, C.5, and C.6, we investigate the convergence properties of Tail OPT when the outer optimizer is instantiated with Adagrad, RMSProp, and Adam, respectively. C.1. Convergence of Avg-L2Clip We aim to model contemporary, large-scale neural network training across multiple powerful compute nodes (datacenters or GPU clusters), in which data is typically preprocessed IID to optimize for training. However, for fullest generality, we conduct our theoretical analysis in the more challenging, non-IID setting. Our setup is identical to Section 3, with some added notation. We denote x to represent the global optimum of F(x) with a minimum value F = F(x ), and additionally, we let x i be the global optimum of Fi(x) = Eξ[Fi(x, ξ)], with a minimum value F i = F(x i ). For model weight or stochastic gradient averages, we use the following notation i=1 pixt i,0, gt = i=1 pi Clip(ct, Fi xt i,0, ξt i,0 ), Clip(c, y) := min 1, c y The use of the notation xt i,0 instead of xt i carefully reflects the flow of the proof, which studies a virtual synchronization of the model weights synthesized by the inner optimizer at each time t [T] (see Algorithm 2). In other words, we first analyze the virtual average xt which is not materially realized except at outer optimizer synchronization steps, before modifying the proof to procure a moving average of weights which is solely dependent on those communicated to the outer optimizer, which can now be obtained. Efficient Distributed Optimization under Heavy-Tailed Noise We now present some assumptions used in the convergence analysis for this section. We take the model weight projection domain to be X = B(0, B) Rd, where B(0, B) is the closed ball centered at the origin with radius B. Clearly, B > 0 needs to be large enough to contain x , x i X for convergence. However, we note that the convergence analysis holds for X any large enough compact, convex set. Assumption 3 (µ-strong convexity). For all x, y X and i [N], Fi(x) satisfies Fi(x) Fi(y) + x y, Fi(y) + µi x y 2/2. Gradient clipping is a widely adopted technique to stabilize model updates by mitigating the impact of large gradients (Menon et al., 2020; Zhang et al., 2020a; Chen et al., 2020; Koloskova et al., 2023). The Clip( ) operator rescales the gradient uniformly to ensure it remains below a predefined threshold. This procedure is mathematically equivalent to applying a dynamically adjusted, lower learning rate when large stochastic gradients are encountered. Another related technique is projection, which operates in the model weight space rather than the gradient space, effectively stabilizing the model parameters themselves instead of acting on the updates. These observations motivate Algorithm 2, which may be interpreted as dynamically modulating the learning rates as well as backtracking toward the model origin 0 when heavy-tailed stochastic gradient updates are realized. Algorithm 2 Avg-L2Clip Require: Initial model x1, learning rate schedule ηt, clipping schedule ct Synchronization timestep z Z>0, projection domain X 1: for t = 1, . . . , T do 2: for each node i [N] do 3: Draw minibatch gradient gt i,0 = Fi(xt i,0, ξt i,0) 4: xt+1 i,0 xt i,0 ηt Clip(ct, gt i,0) 5: end for 6: if t 1 z Z 0 : 7: xt+1 i,0 Proj X P i [N] pixt+1 i,0 , for i [N] Theorem 3 demonstrates that distributed Avg-L2Clip converges in expectation under heavy-tailed noise, despite potential clipping-induced bias. We also offer the first proof demonstrating convergence under an extension of these results to accommodate failing nodes for additional utility in Appendix C.2. To proceed with the analysis, we first provide a proposition. Proposition 1. If Fi(x) is µ-strongly convex (or L-smooth), then so is Fi(x, ξ) for the identical µ (or L). While clipping offers the benefit of stabilization, it introduces complexities that complicate the convergence analysis. In particular, clipping induces a non-zero bias on the stochastic gradients, rendering them to be no longer unbiased estimators of the true gradient. Furthermore, unlike in previous analyses, our work also considers scenarios involving distributions with infinite variance, where the clipping bias is exacerbated by the presence of heavy tails. Despite these challenges, Theorem 3 demonstrates that with appropriately chosen (increasing) clipping and (decreasing) learning rate schedules, convergence of Algorithm 2 is nevertheless attainable in expectation. Theorem 3. Let Assumptions 1-3 hold, and the clipping threshold in Avg-L2Clip (Algorithm 2) satisfy ct = cηγ t for c > 0 and 1/2 > γ > 0. Decay the learning rate with schedule ηt = r/(t + 1) for r > 2/µ, where µ = mink [N] µk and L = maxk [N] Lk. Then, we have for x T := PT t=1 2t E[xt]/T(T + 1) that F( x T ) F(x ) Ψ1 + Ψ2 + Ψ3 + Ψ4, Efficient Distributed Optimization under Heavy-Tailed Noise Ψ1 = rc2T 2γ+1 (2γ + 1)T(T + 1), Ψ2 = (M α + Bα)2c2 2α(T (2 2α)γ+2 + 1) (µ 2/r)((2 2α)γ + 1)T(T + 1) , Ψ3 = c2 αrzu(M α + Bα)LT (2 α)γ+1 (µ 2/r)((2 α)γ + 1)T(T + 1), Ψ4 = r2c2z2u2L2(T 2γ + 1) 2γ(µ 2/r)T(T + 1). Here, we have used the notation max k [N],x X 2L2 µ (Fi(x) Fi(x i )), α = min k [N] αk, B = max k [N] Bk, u = z + 1 where X is a compact domain constructed by a uniformly closed extension of X with L2 distance Pz t=1 rctγ 1. Proof. Let us bound the distance between the averaged model weights xt and the global optimum x . Assume that t z Z. We consider the following function f(t) = x Proj X (xt ηtgt) + t( xt + ηtgt + Proj X (xt ηtgt)) 2, for which f (0) = 2 x Proj X (xt ηtgt), xt + ηtgt + Proj X(xt ηtgt) . Now, consider the function g(t) = (1 t) Proj X (xt ηtgt) + t Proj X (x ) xt + ηtgt By the projective property, g(t) Proj X(xt ηtgt) (xt ηtgt) . holds for t [0, 1] via convexity of X. Additionally, g(t)2 meets its minimum at t = 0. Therefore, we have that dg(t)2/dtt=0 0 due to g(t)2 being quadratic with respect to t. Noting that f (0) = dg(t)2/dt|t=0, we have that f(t) is monotonically increasing for t 0, again due to properties of a quadratic. Then, f(1) f(0) gives that Proj X (xt ηtgt) x 2 xt ηtgt x 2 . Therefore, we may conclude i=1 pi Proj X (xt ηtgt) x = Proj X (xt ηtgt) x 2 xt ηtgt x 2 = xt x 2 2ηt xt x , gt + η2 t gt 2 = xt x 2 2ηt xt x , gt F(xt) | {z } A1 2ηt xt x , F(xt) | {z } A2 + η2 t gt 2 | {z } A3 Note that the final inequality LHS RHS also holds for t / z Z. In bounding A2, we aim to derive a term that decays xt x 2 by inducing a coefficient (1 cηt) xt x 2 for some c > 0 to be determined. By µ-strong convexity of F(x), F(x ) F(xt) xt x , Fi(xt) + µ = (F(xt) F(x )) µ 2 xt x 2 xt x , F(xt) . Efficient Distributed Optimization under Heavy-Tailed Noise To bound A1, we consider conditional expectations 2ηt xt x , Et[gt] F(xt) 2ηt xt x Et[gt] F(xt) , where Et[ ] conditions on all realizations up to time t. Unraveling definitions gives Et[gt] F(xt) = X i [N] pi(Et[Clip(ct, Fi(xt i,0, ξt i,0))] Fi(xt i,0) + Fi(xt i,0) Fi(xt)) i [N] pi Et[Clip(ct, Fi(xt i,0, ξt i,0)) Fi(xt i,0, ξt i,0))] + X i [N] pi Fi(xt i,0) Fi(xt) i [N] pi Et[ Clip(ct, Fi(xt i,0, ξt i,0)) Fi(xt i,0, ξt i,0)) ] | {z } A4 i [N] pi L xt i,0 xt , where the second line used Jensen and triangle inequality, and the third line used L-smoothness as well as Jensen. Now, we note that clipping biases the expectation in A4, and we seek to ease out a measure of the clipping bias. For this purpose, we quantify the α-moment of the stochastic gradient: Fi(x) + ξt i,0 2 α 2α 1 Et Fi(x) α + Et ξt i,0 α 2α 1 ( Fi(x) α + Bα i ) . Here, we have used the notation Bi < for readability, but strictly speaking this is not identical to the Bi given in Assumption 2 as α := mini [N] αi. Finally, the projection in each outer optimizer synchronization step ensures that the xt i,0 remain in a compact set X. Therefore, to bound gradients, we use L-smoothness and µ-strong convexity of Fi(x) as follows: Fi(x) 2 L2 x x i 2 , where x i is the optimum of Fi(x). Then, convexity gives that Fi(x) Fi(x i ) + µ from which we conclude Fi(x) 2 2L2 µ (Fi(x) Fi(x i )) M 2 := max k [N],x X 2L2 µ (Fi(x) Fi(x i )). (4) Piecewise continuity of Fi(x) is clear due to the existence of Fi(x). Therefore, Et Fi(xt i,0) + ξt i,0 α (M α + Bα) Now, note that if Fi(xt i,0, ξt i,0) ct, clipping has no effect in A4. Thus, we focus on the case Fi(xt i,0, ξt i,0) > ct. Additionally, clipping only downscales each stochastic gradient by a scalar, which preserves direction. Therefore, A4 = Et Clip(ct, Fi(xt i,0, ξt i,0)) Fi(xt i,0, ξt i,0)) χ Fi(xt i,0, ξt i,0) > ct Et Fi(xt i,0, ξt i,0) χ Fi(xt i,0, ξt i,0) > ct Et Fi(xt i,0, ξt i,0) α Fi(xt i,0, ξt i,0)) 1 α χ Fi(xt i,0, ξt i,0) > ct (M α + Bα)c1 α t . Putting these inequalities together, we obtain as an intermediary step for a > 0: A1 2ηt xt x ((M α + Bα)c1 α t + X i [N] pi L xt i,0 xt ) µaηt xt x 2 + ηt µa((M α + Bα)c1 α t + L X i [N] pi xt i,0 xt )2. Thus, our next step is to ease out xt i,0 xt = O(ηt). For this purpose, our intuition is that the drift in model weights from local updates are bounded by the update size, as well as by taking a maximum of z local steps after global synchronization. Efficient Distributed Optimization under Heavy-Tailed Noise Therefore, we naturally consider the timestep ts(t) of the latest synchronization round up to t, and observe that if the random variable X := xt i,0 xts, then Ek[X] = xt xts. Noting that the variance of X is no greater than its second moment, we proceed as follows via telescoping: Ek[ xt i,0 xt 2] = i=1 pi xt i,0 xt 2 = Ek[ X Ek[X] 2] i=1 pi xt i,0 xts 2 t=ts+1 ( xk t + xk t ) xts i=1 pi(t ts 1)2 max t [ts,t] η2 t Clip(c t, Fi(xt i,0, ξt i,0)) 2 i=1 piz2η2 tsc2 t = z2η2 tsc2 t z2u2η2 t c2 t. The final inequality was obtained by noting that ηt 0+ monotonically from above and that ct ct 1. The above holds for all t Z 0, as if t is a synchronization step, Ek xt i,0 xt 2 = 0. The final inequality used that the monotonic near-harmonic decay of ηt allows ηts uηt for u = (z + 1)/2. Finally, by Cauchy-Schwartz, i=1 pi xt xt i,0 i=1 pi xt xt i,0 2 ! from which we conclude A1 µaηt xt x 2 + ηt µa((M α + Bα)c1 α t + ηtctzu L)2 (7) It now remains to bound A3, which can be done straightforwardly via Jensen: A3 = η2 t gt 2 η2 t i=1 pi Clip(ct, Fi xt i,0, ξt i,0 ) 2 η2 t c2 t. Collecting all inequalities gathered thus far gives the simple form Et[ xt+1 x 2] (1 (1 a)µηt) xt x 2 2ηt (F(xt) F(x )) + η2 t c2 t + ηt µa((M α + Bα)c1 α t + ηtctzu L)2, which under tower law of expectations is amenable to telescoping. Intuitively, we want to control the learning rate and form a quadratically decaying average on the LHS, which by Jensen and convexity will give a desired near-optimal point. The rest is a matter of carefully easing out a rate schedule that enables averaging, which also converges. Rearranging gives E[F(xt)] F(x ) (η 1 t (1 a)µ) 2 E[ xt x 2] 1 2ηt E[ xt+1 x 2] + ηtc2 t 2 + 1 2µa((M α + Bα)2c2 2α t + 2(M α + Bα)c2 α t ηtzu L + η2 t c2 tz2u2L2). (8) Letting ηt = r/(t + 1), a = 1 2/(rµ) for r > 2/µ, we have t E[F(xt)] t F(x ) t(t 1) 2 E[ xt x 2] (t + 1)t 2 E[ xt+1 x 2] + tηtc2 t 2 + t 2µa((M α + Bα)2c2 2α t + 2(M α + Bα)c2 α t ηtzu L + η2 t c2 tz2u2L2) (9) Efficient Distributed Optimization under Heavy-Tailed Noise Setting ct = ctγ for 1/2 > γ > 0, c > 0 gives after telescoping PT t=1 2t E[F(xt)] T(T + 1) F(x ) rc2 PT t=1 t2γ T(T + 1) + (M α + Bα)2c2 2α PT t=1 t(2 2α)γ+1 (µ 2/r)T(T + 1) + c2 αrzu(M α + Bα)L PT t=1 t(2 α)γ (µ 2/r)T(T + 1) + r2c2z2u2L2 PT t=1 t2γ 1 (µ 2/r)T(T + 1) . Standard integral bounds give PT t=1 2t E[F(xt)] T(T + 1) F(x ) rc2T 2γ+1 (2γ + 1)T(T + 1) + (M α + Bα)2c2 2α(T (2 2α)γ+2 + 1) (µ 2/r)((2 2α)γ + 1)T(T + 1) + c2 αrzu(M α + Bα)LT (2 α)γ+1 (µ 2/r)((2 α)γ + 1)T(T + 1) + r2c2z2u2L2(T 2γ + 1) 2γ(µ 2/r)T(T + 1). Finally, note that by Jensen and convexity, the left hand side is lower bounded by 0 F( x T ) F(x ) PT t=1 2t E[F(xt)] T(T + 1) F(x ) where x T := PT t=1 2t E[xt]/T(T + 1) is a quadratically decaying average. This concludes the proof. It is straightforward to extend to the case in which the learning rate is scheduled to decay in each outer optimizer synchronization step instead of at each local step, by letting ηt = r/( t/z + 1) in equation (8). The value of the moment α has a significant impact on the convergence behavior. When α is close to 1, the convergence becomes substantially slower due to the heavy-tailed nature of the induced stochastic gradients and the increased variance they introduce. Conversely, when α approaches 2, the variance is more controlled, leading to faster convergence rates. Importantly, our results demonstrate that even in the presence of infinite variance (i.e., 1 < α < 2), convergence can still be achieved, showcasing the robustness of the clipping approach under extreme heavy-tailed conditions. The averages xt are virtual constructs used for theoretical analysis of Algorithm 2, which are not accumulated during the execution phase. That is, these quantities are only available at the outer optimizer synchronization steps, t z Z 0, and are not collected otherwise (as models are not saved for every local timestep prior to synchronization). As a result, the application of Avg-L2Clip creates a virtual history on the compute node models, where the aggregation of ephemeral model weights can theoretically induce convergence. However, in practice, this conflicts with the use of local epochs for communication efficiency, necessitating adjustments to the convergence theorem. This leads to the development of Corollary 3. Corollary 3. Let the conditions of Theorem 3 hold. Then, we have that t Z(t 1)xt P F(x ) (T + 1)z (T z) (ψ1 + ψ2 + ψ3 + ψ4) , where the ψi are defined as in the statement of Theorem 3 and Z is the set of all outer optimizer synchronization steps. Proof. We may start with equation (9), where we use the same notation as the proof of Theorem 3. Recall that 0 F(x) F(x ) for all x. Therefore, we have for Z = {1, z + 1, . . . , z T/z + 1} for T / z Z and Z = {1, z + 1, . . . , z( T/z 1) + 1} otherwise, t Z t (E[F(xt)] F(x )) X 2 E[ xt x 2] (t + 1)t 2 E[ xt+1 x 2] tηtc2 t 2 + X t 2µa (M α + Bα)2c2 2α t + 2(M α + Bα)c2 α t ηtzu L + η2 t c2 tz2u2L2 . Noting that X t Z (t 1) (E[F(xt)] F(x )) X t Z t (E[F(xt)] F(x )) , Efficient Distributed Optimization under Heavy-Tailed Noise 2z z( T/z 1) T/z 2 z( T/z + 1) T/z t Z(t 1)xt P F(x ) (T + 1)z (T z) (ψ1 + ψ2 + ψ3 + ψ4) . As before, extension to the case where the learning rate decays at each outer optimizer synchronization step is straightforward. Therefore, the asymptotic convergence rate is identical that give in Theorem 3. In particular, we immediately deduce the following corollary. Corollary 4. Let the conditions of Theorem 3 hold. Then, Avg-L2Clip converges under heavy-tailed noise with rate O(T (2α 2)γ) for γ (0, 1/2). That is, the algorithm recovers a point ex T which is materialized during training such that E[F(ex T )] F(x ) O(T (2α 2)γ). Proof. The maximal rate of convergence is attained due to the dominating term Ψ2. We note that in the limit α 1+, which revokes the integrability of the heavy-tailed noise, convergence is nullified. C.2. Dynamics of Avg-L2Clip under Failing Compute Nodes Node failures can happen in both datacenter training (Yu et al., 2019) or federated learning (Li et al., 2020a). Consequently, it is crucial to conduct a theoretical performance analysis of Avg-L2Clip within environments to accommodate the presence of failing compute nodes or partial participation. In this setting, we modify Line 2 of Avg-L2Clip to sample a subset of participating nodes, S [N], rather than selecting S = [N]. Additionally, normalized averaging is performed across only the participating compute nodes in Line 7. However, we assume all compute nodes remain active, as described in Algorithm 3 below. We refer to this algorithm as Sludge Clip to emphasize its impracticality, despite being functionally equivalent to a variant of Avg-L2Clip aggregating over a subset of nodes. By analyzing Sludge Clip, we are able to establish convergence of Avg-L2Clip when several datacenters or compute nodes fail to partake in training. Algorithm 3 Sludge Clip Require: Initial model x1, learning rate schedule ηt, clipping schedule ct Synchronization timestep z Z>0, projection domain X 1: for t = 1, . . . , T do 2: Sample participating compute nodes S [N] according to pi 3: for each node i [N] do 4: Draw minibatch gradient gt i,0 = Fi(xt i,0, ξt i,0) 5: xk t+1 xk t ηt L2Clip(ct, gt i,0) 6: end for 7: if t 1 z Z 0 : 8: xk t+1 Proj X (P i S pi ) 1 P i S pi xi t+1 , for k [N] Theorem 4. Let the clipping threshold in Sludge Clip (Algorithm 3) satisfy ct = cηγ t for c > 0 and 1/2 > γ > 0. Decay the learning rate with schedule ηt = r/(t + 1) for r > 2/µ. If the sampling scheme preserves the global objective2, that is, i [S] pi Fi(x) i [N] pi Fi(x) = F(x), then we have for Z the set of synchronization steps up to T that E [F ( x T )] F(x ) := E F P t Z(t 1)xt P F(x ) z O t ω , 2For example, pi = 1/N satisfies this condition. That is, given any selection of pi and Fi(x), we may rescale the local objectives Fi(x) such that pi = 1/N by controlling the influence of each local gradient update. Efficient Distributed Optimization under Heavy-Tailed Noise where now ω satisfies ω = min{1 2γ, 1 (2 2α)γ, 1 (2 α)γ, 2 2γ, 2γ(α 1)}. If the subsampling scheme fails to preserve the global objective (e.g., by sampling only a strict subset of avaliable nodes repeatedly), then Algorithm 3 asymptotes toward biased minimizer points within an increasing region determined by the clipping threshold E [F ( x T )] F(x ) O(t2γ). We note that convergence is not clearly guaranteed when subsampling procedures violate the global objective in expectation. Specifically, we evaluate the algorithm s output relative to x , the global optimum of the true objective F(x). However, when subsampling alters the objective, the algorithm no longer optimizes for F(x), thereby clearly undermining convergence toward x . We then measure the propensity of the algorithm output to x , the global optimum of the true objective F(x) which is no longer the objective of the subsampled algorithm. Proof. We first analyze the case in which the subsampling strategy preserves the correct global objective, which allows for convergence to x . Recall that Sludge Clip-SGD was constructed to allow the analysis for non-synchronization steps to be analogous to full-participation Avg-L2Clip. Therefore, we focus on outer optimizer synchronization steps while incorporating the elements of the previous analysis for Theorem 3. We now use the following notation for subsampled averages of participating compute node devices: i S pixt i,0 P i S pi , gt = i S pi Clip(ct, Fi xt i,0, ξt i,0 ) P For added clarity, we denote gt as gt to indicate that normalized averages are taken over all inner compute nodes, and not solely participating nodes as in gt. Then for t + 1 a synchronization step, we have that xt+1 x 2 xt x ηt gt 2 = xt + ( xt xt) x ηt gt + (ηtgt ηtgt) 2 = xt x 2 + 2 xt x , xt xt ηt gt + (ηtgt ηtgt) | {z } B1 xt x 2 2ηt xt x , gt F(xt) | {z } A1 2ηt xt x , F(xt) | {z } A2 +2 xt x , xt xt | {z } B2 +2ηt xt x , gt gt | {z } B3 + xt xt ηt gt 2 | {z } B4 In this form, the Ai terms are therefore shared with the previous analysis, and A2 may be bounded by µ-strong convexity as before. This gives that A2 µηt xt x 2 2ηt (F(xt) F(x )) . A1 is once again bounded under conditional expectations Et[ ] by equation (7), though with a different value of a > 0 than in the previous proof, A1 µa ηt xt x 2 + ηt µa ((M α + Bα)c1 α t + ηtctzu L)2. (7) Now, as B2 is eliminated under expectations under subsampling, we focus on the remaining terms. It is clear that we must bound and gt gt to proceed. Intuitively, this is controlled by normalized averages and model drift across participating nodes. Therefore, we consider the nearest or most recent synchronization timestep ts(t) as before and rearrange to incorporate elements of our previous analysis. Assuming interchangeability between the integrals ES (integrating over the randomness of node subsampling) and Et (integrating over randomness of ξt i,0), Et [ES[ gt] gt] = i S pi (Clip(ct, Fi xt i,0, ξt i,0 ) Fi(xt)) i S pi (Clip(ct, Fi xt i,0, ξt i,0 ) Fi(xt, ξt i,0)) Et [gt F(xt)] i S pi Et[ Clip(ct, Fi xt i,0, ξt i,0 ) Fi(xt, ξt i,0) ] + Et[ gt F(xt) ] 2(M α + Bα)c1 α t Efficient Distributed Optimization under Heavy-Tailed Noise where to obtain the final line we used Jensen and an analogous reasoning as in equation (5). Therefore, we have for b > 0 that B3 bηt xt x 2 + 4ηt(M α + Bα)2c2(1 α) t . It now remains to bound B4, which can be done straightforwardly: B4 2 xt xt 2 + 2η2 t gt 2 4z2u2η2 t c2 t + 2η2 t c2 t. Collecting all inequalities gathered under the tower law of expectation, we have E[ xt+1 x 2] (1 ((1 a)µ + b)ηt)E[ xt x 2] 2ηt E [F(xt) F(x )] µa((M α + Bα)c1 α t + ηtctzu L)2 + 4z2u2η2 t c2 t + 2η2 t c2 t + 4ηt(M α + Bα)2c2(1 α) t . Recall the learning rate schedule ηt = r/(t + 1), while setting a , b such that r((1 a )µ + b) = 2. Then, we have for Z the set of all synchronization steps, t+1 Z t(E[F(xt)] F(x )) X 2 E[ xt x 2] (t + 1)t 2 E[ xt+1 x 2] t+1 Z 2(M α + Bα)2tc2(1 α) t 1 2µa((M α + Bα)c1 α t + ηtctzu L)2 | {z } Ψ2+Ψ3+Ψ4 t+1 Z tηtc2 t(2z2u2 + 1) For t+1 / Z, we use the standard telescoping sum in equation (9) while noting that xt+1 = xt+1 due to the synchronization step. We do not repeat mechanical calculation steps here to not obscure the intuitions behind the proof, and instead indicate asympototically equivalent terms to Ψi under 1/(T 2 + T) averaging on the right hand side. It remains to bound the residual term B5 under the averaging step, which gives B5 T(T + 1) O(t2γ(1 α)), which concludes the proof for the first case. In the setting in which the subsampling procedure fails to preserve the global objective, we bound xt xt as follows: P k/ [S] p k P i/ [S] pixt i,0 P k/ [S] p k P i [S] pi pi xt i,0 xts + X i/ [S] pi xt i,0 xts 2zuηtct, due to triangle inequality and Jensen. That is, by the synchronization step, we have xk ts = xts, k [N] via to full available node activation in Sludge Clip. This gives xt i,0 xts = t =ts+1 ( xk t + xk t ) xts t =ts+1 xk t xk t 1 zuηtct as in equation (6). Similarly, we have by Jensen and convexity of the norm that Therefore, we obtain for b1, b2 > 0 B2 b1ηt xt x 2 + 1 b1ηt xt xt 2 b1ηt xt x 2 + 2z2u2c2 tηt b1 , B3 b2ηt xt x 2 + 4ηtc2 t. Efficient Distributed Optimization under Heavy-Tailed Noise Following analogous calculations as in the case where the subsampling does not violate the global objective, we arrive at a new residual term B6 T(T + 1) O(t2γ), which controls the expansion of the bias due to the incorrect sampling strategy. C.3. Convergence of Bi2Clip In this section, we analyze the convergence of Bi2Clip under heavy-tailed noise. By employing Bi Clip at both the inner and outer optimizers, Bi2Clip is an efficient algorithm realized by Tail OPT that brings about benefits of adaptivity without maintaining gradient (or model update) statistics. Unlike Adam2, which applies Adam at both the inner and outer optimizers, Bi2Clip achieves comparable or even superior empirical performance while requiring no additional memory or computational overhead (Table 1). This highlights its efficiency and practicality, particularly in resource-constrained settings. We begin with the pseudocode for Bi2Clip in Algorithm 4. Algorithm 4 Bi2Clip Require: Initial model x1, learning rate schedule ηt, clipping schedules ut, dt, eut, edt Synchronization timestep z Z>0 1: for t = 1, . . . , T do 2: for each node i [N] in parallel do 3: xt i,0 xt 4: for each local step k [z] do 5: Draw minibatch gradient gt i,k = Fi(xt i,k, ξt i,k) 6: xt i,k+1 xt i,k ηt Bi Clip(ut, dt, gt i,k) 7: end for 8: end for 9: t = 1 i [N] xt i,z xt 1 , emt t 10: xt = xt 1 + ηBi Clip(eut, edt, emt) 11: end for We carry out the analysis over a sufficiently large, compact domain X. Let F(x) be the deterministic gradient, obtained by integrating over F(x, ξ), the stochastic gradient with a heavy-tailed distribution. The existence of F(x) implies F(x) is continuous, which gives boundedness via the extremal value theorem. Therefore, from now onward, we formally assume F(x) is coordinatewise bounded by G in absolute value. We have the following theorem. Theorem 5. Let Assumptions 1-2 hold, and the learning rate and clipping schedules satisfy ηt = Θ(tω), ηt ℓ= Θ(tν), dt = Θ(tγ), ut = Θ(tζ), edt = Θ(teγ), and ut = Θ(teζ). Imposing ζ, eζ > 0 > γ, eγ, for ω, ν 0, as well as 1 < ω + ν, for Bi2Clip (Algorithm 4), we have that min t [T ] E[ F(xt 1) 2] Ψ1 + Ψ2 + Ψ3 + Ψ4 + Ψ5 + Ψ6 + Ψ7, where the Ψi are given Ψ1 = O T ω ν 1 , Ψ2 = O T ω+2eζ ν , Ψ3 = O T eγ ν , Ψ4 = O (T γ) , Ψ5 = O T (α 1)ν+(1 α)eζ , Ψ6 = O T (1 α)ζ , Ψ7 = O T ν+ζ . Proof. We provide the proof for L2-wise Bi Clip( ) for illustrative purposes and notational convenience. The extension to coordinate-wise Bi Clip( ) is straightforward as described in the comments following the proof of Theorem 6, Remark 2. For completeness and readability, we formally provide the definition of L2-wise Bi Clip( ) as Bi Clip(ut, dt, x) = x dt x χ ( x dt) + x ut x χ ( x ut) + x χ (dt < x < ut) . Efficient Distributed Optimization under Heavy-Tailed Noise Here, χ is the indicator function, and ut dt 0 are the clipping thresholds. By default, we take a/0 := 0 for a R. Now, we begin by noting that due to L-smoothness, we have where Et[ ] takes expectation up to xt 1 that Et[F(xt)] F(xt 1) F(xt 1), Et[xt xt 1] + L 2 Et[ xt xt 1 2] ηt D F(xt 1), Et[Bi Clip(eut, edt, t)] E +Lη2 t 2 Et Bi Clip(eut, edt, t) 2 . Now, we expand to obtain the following form F(xt 1), Et[Bi Clip(eut, edt, t) t] ηt ℓ X ν [K] 1 pi Et[ Fi(xt i,ν)] Kηt ℓ F(xt 1) = D F(xt 1), Et[Bi Clip(eut, edt, t) + t] E F(xt 1), ηt ℓ X ν [K] 1 pi Et[ Fi(xt i,ν)] Et[ t] F(xt 1), ηt ℓ X ν [K] 1 pi Et[ Fi(xt i,ν)] Kηt ℓ F(xt 1) Kηt ℓ F(xt 1) 2. Using the convexity of compositions (via α 1) and Jensen, we deduce Et[ t α] = Et[ ηt ℓ X ν [K] 1 pi Bi Clip(ut, dt, Fi(xt i,ν, ξt i,ν)) α] (ηt ℓ)αKαEt ν [K] 1 pi Bi Clip(ut, dt, Fi(xt i,ν, ξt i,ν)) (ηt ℓ)αKα 1 X ν [K] 1 pi Et[ Bi Clip(ut, dt, Fi(xt i,ν, ξt i,ν)) α] (ηt ℓ)αKα 1 X ν [K] 1 pi(dα t + Et[ Fi(xt i,ν, ξt i,ν) α]) (ηt ℓ)αKα 1 X ν [K] 1 pidα t + (ηt ℓ)αKα 1 X ν [K] 1 pi Et[ Fi(xt i,ν, ξt i,ν) α] Note that the term C can be bounded as C (ηt ℓ)αKα 1 X ν [K] 1 pi2αEt " Fi(xt i,v) α (ηt ℓ)αKα 1 X ν [K] 1 pi2α 1(M α + Bα) = (ηt ℓ)αKα 1 X ν [K] 1 2α 1(M α + Bα), where M := maxx X,i [N] Fi(x) and Bα := maxi [N], ν [K] 1 Et[ ξt i,v α] supi [N](Bi)αi. We note that this results holds also under distribution shift for the stochastic noise ξt i, where t [T] and i [N], as long as the α-moment remains universally bounded. Therefore, we conclude Et[ t α] (ηt ℓ)αKα 1 X ν [K] 1 dα t + (ηt ℓ)αKα 12α 1 X ν [K] 1 (M α + Bα) =: (ηt ℓ)α f M. Efficient Distributed Optimization under Heavy-Tailed Noise This gives by the Cauchy-Schwartz inequality that B1 F(xt 1) Et[Bi Clip(eut, edt, t)] + t G Et[χ( t edt) edt + χ (eut t ) t α t 1 α] G h P( t edt) edt + P (eut t ) (ηt ℓ)αeu1 α t f M i . Now, B2 may be bounded as follows: ν [K] 1 pi Et[ Fi(xt i,ν)] + Et[ t] ν [K] 1 pi Fi(xt i,ν, ξt i,ν) + t] ν [K] 1 pi Fi(xt i,ν, ξt i,ν) + t where we used convexity, Jensen, and that the stochastic gradient noise is unbiased. Unraveling the definition of the pseudogradient t gives ν [K] 1 pi Fi(xt i,ν, ξt i,ν) X ν [K] 1 pi Bi Clip(ut, dt, Fi(xt i,ν, ξt i,ν)) ν [K] 1 pi Et Fi(xt i,ν, ξt i,ν) Bi Clip(ut, dt, Fi(xt i,ν, ξt i,ν)) ν [K] 1 pi dt P( Fi(xt i,ν, ξt i,ν) dt) + P( Fi(xt i,ν, ξt i,ν) ut)u1 α t 2α 1(M α + Bα) dt + u1 α t 2α 1(M α + Bα) . Additionally, B3 may be bounded via L-smoothness and telescoping: ν [K] 1 pi Fi(xt i,ν) K F(xt 1) ν [K] 1 pi Fi(xt i,ν) X ν [K] 1 pi Fi(xt i,0) ν [K] 1 pi L xt i,ν xt i,0 ν [K] 1 pi L r=1 (xt i,r xt i,r) xt i,0 r=1 xt i,r xt i,r 1 (ηt ℓ)2GLK2ut Collecting all inequalities gathered thus far, we have Et[F(xt)] F(xt 1) Lη2 t eu2 t 2 Kηt ℓηt F(xt 1) 2 + Gηt edt + Gηt(ηt ℓ)αeu1 α t f M + Gηt ℓηt X dt + u1 α t 2α 1(M α + Bα) + ηt(ηt ℓ)2GLK2ut Efficient Distributed Optimization under Heavy-Tailed Noise Telescoping under the law of iterated expectations gives t [T ] Kηt ℓηt E[ F(xt 1) 2] F(x0) E[F(x T )] + X Lη2 t eu2 t 2 + Gηt edt + Gηt(ηt ℓ)αeu1 α t f M t [T ] ηt ℓηt X dt + u1 α t 2α 1(M α + Bα) + X ηt(ηt ℓ)2GLK2ut Now, we move to the asymptotic regime. Let ηt = Θ(tω), ηt ℓ= Θ(tν), dt = Θ(tγ), ut = Θ(tζ), edt = Θ(teγ), and eut = Θ(teζ). This gives after routine calculations that min t [T ] E[ F(xt 1) 2] O T ω ν 1 + T ω+2eζ ν + T eγ ν + T (α 1)ν+(1 α)eζ + T γ + T (1 α)ζ + T ν+ζ . To attain convergence of the RHS, it is clear that we must impose ζ, eζ > 0 > γ, eγ, for ω, ν 0. Additionally, we have further constrained 1 < ω + ν, which ensures that the LHS diverges at a scale faster than logarithmic, validating the asymptotic regime and concluding the proof. To obtain the rate of convergence, we may let r1 = ω + ν + 1, r2 = ω 2 ζ + ν, r3 = ν eγ, r4 = (α 1)( ζ ν), r5 = γ, r6 = (α 1)ζ, r7 = ν ζ. (10) for σ = min {r1, . . . , r7} and min t [T ] E F (xt 1) 2 = O T σ . Equalizing r6 and r7 gives that ζ = ν/α as α > 1, ν < 0, ζ > 0. Similarly, equalizing r1 and r2 gives ω = (1 + 2 ζ)/2, yielding r1 = r2 = 1/2 + ν ζ. By letting eζ 0+, 1 2 + ν ζ = (α 1) ν ζ 0+ ν = α 4α 2, ζ = 1 4α 2, ω = 1 For suitably small γ, eγ, we therefore have concluding the proof. Remark 1. We note that setting edt = 0, eut = , and ηt = 1 at the outer optimizer recovers a special case of Bi2Clip, i.e., Avg-Bi Clip. Similarly, for specific hyperparameter choices, Bi2Clip collapses into Bi Clip-SGD, with upper and lower thresholding applied by the outer optimizers only to accumulated model updates from the inner compute nodes. Now, in the following subsections, we further analyze the convergence behavior of Tail OPT under additional varying adaptive optimizer instantiations. The Adagrad instantiation (Algorithm 5) collects pseudogradients and sums their squares, effectively implementing a form of implicit clipping. However, it aggressively decays coordinate-wise learning rates, which can limit performance. To address this, we introduce RMSProp-Tail Clip (Algorithm 6), which relaxes the preconditioning by employing an exponentially decaying moving average of the second moment. In both cases, we prove that the minimum expected gradient converges to 0. Additionally, by incorporating a moving average of the first pseudogradient moment as a form of momentum, we derive Algorithm 7. For this variant, we show that the expected minimal gradient does not diverge even under restarting of the algorithm, which in practice translates to the update of any singular step not diverging in expectation. As in the main paper, Tail Clip refers to either Bi Clip or L2Clip, and we provide our proofs for Bi Clip for added generality over L2Clip. C.4. Convergence of Adagrad-Tail Clip We begin by providing the pseudocode of Adagrad-Tail Clip (Algorithm 5). Then, we have the following result. Efficient Distributed Optimization under Heavy-Tailed Noise Algorithm 5 Adagrad-Tail Clip Require: Initial model x1, learning rate schedule ηt, clipping schedules ut, dt Synchronization timestep z Z>0, adaptivity parameter τ > 0 1: for t = 1, . . . , T do 2: for each node i [N] in parallel do 3: xt i,0 xt 4: for each local step k [z] do 5: Draw minibatch gradient gt i,k = Fi(xt i,k, ξt i,k) 6: xt i,k+1 xt i,k ηt Tail Clip(ut, dt, gt i,k) 7: end for 8: end for 9: t = 1 i [N] xt i,z xt 1 , emt t 10: evt = evt 1 + 2 t 11: xt = xt 1 + η e mt evt+τ 12: end for Theorem 6. Let the clipping and learning rate thresholds satisfy ηt = Θ(tω), ηt ℓ= Θ(tν), dt = Θ(tγ), and ut = Θ(tζ) for the conditions ζ > 0, γ < 1/2, and ω ζ 1/2 > 0. Then, we have that min t [T ] E F(xt) 2 Ψ1 + Ψ2 + Ψ3 + Ψ4 + Ψ5 + Ψ6, where the Ψi are upper bounded by Ψ1 O(T ω+ζ 1 2 ), Ψ2 O(T ω+2ν+3ζ+ 1 2 ), Ψ3 O(T 4ζ+3ν+ 1 Ψ4 O(T 2ν+2ζ+ 1 2 ), Ψ5 O(T ν+γ+ζ+ 1 2 ), Ψ6 O(T ν+(2 α)ζ+ 1 which guarantees convergence via an inversely proportional power law decay with respect to T. The maximal convergence rate is given by O(T (1 α)/2α). Proof. We analyze the convergence of the global objective, where model weights are updated in a distributed fashion via local Bi Clip under heavy-tailed noise. By L-smoothness, we have F(xt) F(xt 1) + F(xt 1), xt xt 1 + L 2 xt xt 1 2 = F(xt 1) + ηt F(xt 1), t evt + τ which we further decompose via noting that F(xt 1), t( p ( evt + τ)( p F(xt 1), t p F(xt 1), 3 t ( evt + τ)( p evt 1 + τ)( p evt 1 + evt) F(xt 1), t p | F(xt 1)| , | t|3 ( evt + τ)( p evt 1 + τ)( p evt 1 + evt) F(xt 1), t p To bound B1, we extract a negative gradient norm F(xt 1), t p evt 1 + τ + Kηt ℓ F(xt 1) p Efficient Distributed Optimization under Heavy-Tailed Noise where B2 decomposes further into F(xt 1), t p evt 1 + τ + v [K] 1 piηt ℓ( Fi(xt i,v) Fi(xt i,v)) p evt 1 + τ + Kηt ℓ F(xt 1) p Here, we use the convention [K] 1 = {0, . . . , K 1}, and that summation over null indices are zero (e.g. PK 1 j=K[ ] = 0). Now, recall t := X i [N] pi t i = X i [N] pi(xt i,K xt i,0) = X v [K] 1 piηt ℓ ˆgt i,v v [K] 1 piηt ℓ Bi Clip(ut, dt, Fi(xt i,v) + ξt i,v), which implies B2 = C1 + C2 for v [K] 1 piηt ℓ( Fi(xt i,v) Bi Clip(ut, dt, Fi(xt i,v) + ξt i,v)) p P i [N] P v [K] 1 piηt ℓ( Fi(xt i,0) Fi(xt i,v)) p Letting Et[ ] condition over all stochasticity up to global step t, we have that Et[C1] is equal to v [K] 1 piηt ℓ(Et[ Fi(xt i,v) + ξt i,v Bi Clip(ut, dt, Fi(xt i,v) + ξt i,v)]) p For D1 := Et[ Fi(xt i,v) + ξt i,v Bi Clip(ut, dt, Fi(xt i,v) + ξt i,v)], we have by convexity and Jensen that D1 Et[ Fi(xt i,v) + ξt i,v Bi Clip(ut, dt, Fi(xt i,v) + ξt i,v) ] dt P( Fi(xt i,v) + ξt i,v) dt) + Et[ Fi(xt i,v) + ξt i,v Bi Clip(ut, dt, Fi(xt i,v) + ξt i,v) χ Fi(xt i,v) + ξt i,v ut ] | {z } D2 Piecewise continuity of Fi(x) is clear via the existence of Fi(x). This gives that Et[ Fi(xt i,v) + ξt i,v αχ Fi(xt i,v) + ξt i,v)) ut ] Et[ Fi(xt i,v) + ξt i,v α] " Fi(xt i,v) + ξt i,v 2 " Fi(xt i,v) α = 2α 1(M α + Bα), where now, M := maxx X,i [N] Fi(x) . Thus, we may bound D2 via reduction to the α-moment: D2 2α 1(M α + Bα)Et[ Fi(xt i,v) + ξt i,v 1 αχ Fi(xt i,v) + ξt i,v)) ut ] 2α 1(M α + Bα)u1 α t P Fi(xt i,v; ξt i,v)) ut . Collecting inequalities gives D1 dt P( Fi(xt i,v; ξt i,v)) dt) + 2α 1(M α + Bα)u1 α t P Fi(xt i,v; ξt i,v)) ut . Et[C1] ηt Gd v [K] 1 piηt ℓdt P( Fi(xt i,v; ξt i,v)) dt) + 2α 1 ηt Gd v [K] 1 piηt ℓ(M α + Bα)u1 α t P Fi(xt i,v; ξt i,v)) ut . Efficient Distributed Optimization under Heavy-Tailed Noise To bound C2, we note that via L-smoothness, we have v [K] 1 piηt ℓ xt i,0 xt i,v v [K] 1 piηt ℓ xt i,0 + r=1 (xt i,r xt i,r) xt i,v r [v] piηt ℓ xt i,r xt i,r 1 ηt GLK2d 2τ (ηt ℓ)2ut. Noting that t ηt ℓut K, we thus obtain Et[F(xt)] F(xt 1) + η2 t (ηt ℓ)2u2 t K2L 2τ 2 + ηt Gd K3u3 t(ηt ℓ)3 τ 3 Kηtηt ℓ 2τ (ηt ℓ)2ut + ηt Gd v [K] 1 piηt ℓdt P( Fi(xt i,v; ξt i,v)) dt) + 2α 1 ηt Gd v [K] 1 piηt ℓ(M α + Bα)u1 α t P Fi(xt i,v; ξt i,v)) ut . Taking expectations on both sides and telescoping gives via the tower law of expectation, t [T ] Kηtηt ℓE E[F(x T ) F(x0)] | {z } E2 η2 t (ηt ℓ)2u2 t K2L 2τ 2 ηt Gd K3u3 t(ηt ℓ)3 2τ (ηt ℓ)2ut v [K] 1 piηt ℓdt P( Fi(xt i,v; ξt i,v)) dt) v [K] 1 piηt ℓ(M α + Bα)u1 α t P Fi(xt i,v; ξt i,v)) ut where we have enumerated each term from E1 to E7 for clarity. To simplify notation, we now move to the asymptotic regime. Letting ηt = Θ(tω), ηt ℓ= Θ(tν), dt = Θ(tγ), and ut = Θ(tζ), we have via standard integral bounds that E1 Ω T ω+ν+1 T ζ ν 1 2 min t [T ] E[ F(xt) 2] = Ω T ω ζ+ 1 2 min t [T ] E[ F(xt) 2] , E2 max x X F(x) min y X F(y) = O(1), E3 O(T 2ω+2ν+2ζ+1), E4 O(T ω+3ζ+3ν+1), E5 O(T ω+2ν+ζ+1), E6 O(T ω+ν+γ+1), E7 O(T ω+ν+(1 α)ζ+1) where any Ei residues of O(1) for i 2 have been incorporated into the upper bound for E2. We note that the bound may be sharpened as the probabilistic terms must necessarily decay if dt 0, ut , which further diminishes E6, E7. Now, to attain convergence of the minimal gradient, we impose the conditions Λ1 : ζ > 0 and γ < 0, Λ2 : ω ζ + 1 2 > 0, Λ3 : ω + 2ν + 3ζ + 1 Λ4 : 4ζ + 3ν + 1 2 < 0, Λ5 : 2ν + 2ζ + 1 2 < 0, Λ6 : ν + γ + ζ + 1 Λ7 : ν + (2 α)ζ + 1 Efficient Distributed Optimization under Heavy-Tailed Noise We note that each condition Λi 2 comes from Ei/E1 0, T , as any residual terms are subsumed by O(1), which decays via Λ2. To derive the convergence rate, we equalize the bottlenecks by letting Λ2 = Λ3 = Λ7 = σ. This gives a system of equations: ω + 2ν + 3ζ + 1 ν + (2 α)ζ + 1 which gives that α σ, ω = σ α 1 Letting σ = (α 1)/2α gives the slackness of Λ4, Λ5, and Λ6 for γ < 1/2. Therefore, the minimum expected gradient norm squared stabilizes at rate O(T σ). Remark 2. In the case of coordinate-wise clipping, all major adjustments up to a scaling factor of d are made in the terms bounding E[C1]. In this case, the proof proceeds as follows. Defining | | to act coordinatewise, Et[C1] is now less than or equal to | F(xt 1)| , v [K] 1 piηt ℓ|Et[ Fi(xt i,v) + ξt i,v Bi Clip(ut, dt, Fi(xt i,v) + ξt i,v)]| p Therefore by Jensen, Et[C1] ηtηt ℓG τ j [d] pi Et[| Fi(xt i,v) + ξt i,v Bi Clip(ut, dt, Fi(xt i,v) + ξt i,v)|j | {z } D1,j We note that Et[D1,j] can be upper bounded by D2,j + D3,j where D2,j = Et D1,j χ | Fi(xt i,v; ξt i,v)|j dt) dt P | Fi(xt i,v; ξt i,v)|j dt) D3,j = Et | Fi(xt i,v; ξt i,v)|jχ | Fi(xt i,v; ξt i,v)|j ut) . It follows that D3,j Et | Fi(xt i,v; ξt i,v)|α j | Fi(xt i,v; ξt i,v)|1 α j χ | Fi(xt i,v; ξt i,v)|j ut 2α 1(M α + Bα)u1 α t P | Fi(xt i,v; ξt i,v)|j ut . Note that we used coordinate-wise bounded alpha moments for some α (1, 2), E[|ξi|α j ] Bα i,j. We therefore define the M and B to be M := max x X,i [N],j [d] | Fi(x)|j and B = max i [N],j [d] Bi,j. Comparing terms gives the identical asymptotic order of convergence to L2 clipping in Theorem 6. C.5. Convergence of RMSProp-Tail Clip For Algorithm 6, we have the following convergence bound. Theorem 7. For clipping and learning rate thresholds satisfying ηt = Θ(tω), ηt ℓ= Θ(tν), dt = Θ(tγ), and ut = Θ(tζ), let the conditions listed in Theorem 6 hold. Then, local Bi Clip with outer optimizer RMSProp stabilizes the expected minimum gradient mint [T ] E[ F(xt) 2] 0+ with maximal rate O(T (1 α)/2α). Here, the exponential moving average parameter of the second pseudogradient moment is fixed within the range eβ2 [0, 1). Efficient Distributed Optimization under Heavy-Tailed Noise Algorithm 6 RMS-Tail Clip Require: Initial model x1, learning rate schedule ηt, clipping schedules ut, dt Synchronization timestep z Z>0, adaptivity/EMA parameters τ > 0, eβ2 [0, 1) 1: for t = 1, . . . , T do 2: for each node i [N] in parallel do 3: xt i,0 xt 4: for each local step k [z] do 5: Draw minibatch gradient gt i,k = Fi(xt i,k, ξt i,k) 6: xt i,k+1 xt i,k ηt Tail Clip(ut, dt, gt i,k) 7: end for 8: end for 9: t = 1 i [N] xt i,z xt 1 , emt t 10: evt = eβ2evt 1 + (1 eβ2) 2 t 11: xt = xt 1 + η e mt evt+τ 12: end for Proof. The proof for outer optimizer RMSProp builds on the prior proof for Bi Clip with outer optimizer Adagrad. We skip repeated details for clarity of exposition, and concisely present only the main steps and ideas central to the proof for readability. L-smoothness gives as before F(xt) F(xt 1) + F(xt 1), xt xt 1 + L 2 xt xt 1 2 = F(xt 1) + ηt F(xt 1), t evt + τ We note the decomposition F(xt 1), t evt + τ F(xt 1), t evt + τ t q eβ2evt 1 + τ F(xt 1), t q eβ2evt 1 + τ To form an upper bound, we use that F(xt 1), t q eβ2evt 1 + τ + Kηt ℓ F(xt 1) q eβ2evt 1 + τ eβ2evt 1 + τ where C0 = C1 + C2 for P i [N] P v [K] 1 piηt ℓ( Fi(xt i,v) Bi Clip(ut, dt, Fi(xt i,v) + ξt i,v)) q eβ2evt 1 + τ v [K] 1 piηt ℓ( Fi(xt i,0) Fi(xt i,v)) q eβ2evt 1 + τ Efficient Distributed Optimization under Heavy-Tailed Noise By the tower law and conditioning on stochastic realizations up to t 1, we have as before v [K] 1 piηt ℓdt P( Fi(xt i,v; ξt i,v)) dt) + GLK2d 2τ (ηt ℓ)2ut v [K] 1 piηt ℓ(M α + Bα)u1 α t P Fi(xt i,v; ξt i,v)) ut τ Kηt ℓdt + GLK2d 2τ (ηt ℓ)2ut + 2α 1Gd τ Kηt ℓ(M α + Bα)u1 α t . To bound B1, we have F(xt 1), t evt + τ t q eβ2evt 1 + τ F(xt 1), (eβ2 1) 3 t evt + τ q eβ2evt 1 + τ evt + q We prepare the global inequality (12) for telescoping. It is straightforward to see that collecting inequalities gives E[F(xt)] E[F(xt 1)] + η2 t LK2u2 t(ηt ℓ)2 2τ 2 Kηtηt ℓ eβ2evt 1 + τ τ Kηtηt ℓdt + GLK2d 2τ ηt(ηt ℓ)2ut + 2α 1Gd τ Kηtηt ℓ(M α + Bα)u1 α t + d G(1 eβ2)(utηt ℓ)3 Rearranging and telescoping gives t=1 Kηtηt ℓE eβ2evt 1 + τ E[F(x0)] E[F(x T )] + η2 t LK2u2 t(ηt ℓ)2 τ Kηtηt ℓdt + GLK2d 2τ ηt(ηt ℓ)2ut + 2α 1Gd τ Kηtηt ℓ(M α + Bα)u1 α t + d G(1 eβ2)(utηt ℓ)3 By non-negativity of squared pseudogradients, we immediately obtain eβ2evt 1 evt 1. Therefore up to constants, the convergence bound collapses to asymptotically equivalent bounds than that of Theorem 6, up to constant multiples from the exponentially decaying moving average of the second moment pseudogradient. The modification to coordinate-wise clipping instead of L2 clipping follows analogous steps. Incorporating momentum into the first pseudogradient moment further complicates the analysis, and yields the results presented in Section C.6. C.6. Convergence of Adam-Tail Clip By incorporating a moving average of the first pseudogradient moment as a form of momentum, we derive Algorithm 7. For this variant, we demonstrate that the expected minimal gradient does not diverge, even when the algorithm undergoes restarts. Practically, this ensures that the located gradient value update of any single step remains bounded in expectation. Investigating the conditions required to guarantee convergence to 0 under this framework presents a promising avenue for future research. Our bound highlights that the dominating terms are influenced by the upper clipping threshold ur, suggesting that the algorithm s convergence behavior may be closely related the choice of this threshold and can be tuned in practice. Theorem 8. Let the exponentially decaying moving average parameters satisfy eβ1 (0, 1), eβ2 [0, 1) for the outer optimizer first and second order pseudogradient moments, respectively. Extremize the unbiased stochastic noise such Efficient Distributed Optimization under Heavy-Tailed Noise Algorithm 7 Adam-Tail Clip Require: Initial model x1, learning rate schedule ηt, clipping schedules ut, dt Synchronization timestep z Z>0, adaptivity/EMA parameters τ > 0, eβ1, eβ2 [0, 1) 1: for t = 1, . . . , T do 2: for each node i [N] in parallel do 3: xt i,0 xt 4: for each local step k [z] do 5: Draw minibatch gradient gt i,k = Fi(xt i,k, ξt i,k) 6: xt i,k+1 xt i,k ηt Tail Clip(ut, dt, gt i,k) 7: end for 8: end for 9: t = 1 i [N] xt i,z xt 1 10: emt = eβ1 emt 1 + (1 eβ1) t 11: evt = eβ2evt 1 + (1 eβ2) 2 t 12: xt = xt 1 + η e mt evt+τ 13: end for that αk (1, 2) for which E[ ξk αk] < Bαk k for integrable ξk. Then, Algorithm 7 gives under constant upper clipping threshold invariant to global timestep t (ζ = 0) that min t [T ] E[ F(xt) 2] O(1), where for ηt = Θ(tω), ηt ℓ= Θ(tν), and dt = Θ(tγ), we impose ν ( 1, 0), ν 1 < ω 0, (1 + ν + ω) < γ < 0. (13) Proof. As in the case of outer optimizer Adagrad, we analyze the convergence of the global objective. By L-smoothness, we have F(xt) F(xt 1) + F(xt 1), xt xt 1 + L 2 xt xt 1 2 = F(xt 1) + ηt F(xt 1), eβt 1 em0 + (1 eβ1) Pt r=1 eβt r 1 r evt + τ | {z } A1 2 A1 2 . (14) To proceed with the proof, we note that F(xt 1), A1 = F(xt 1), eβt 1 em0 evt + τ r=1 eβt r 1 F(xt 1), r evt + τ which we further decompose by using F(xt 1), r evt + τ F(xt 1), r q eβq 2evt q + τ r q eβq+1 2 evt q 1 + τ | {z } A1,q F(xt 1) F(xr 1), r q eβt r+1 2 evr 1 + τ F(xr 1), r q eβt r+1 2 evr 1 + τ Efficient Distributed Optimization under Heavy-Tailed Noise We have that eβq+1 2 evt q 1 q eβq 2evt q + τ q eβq+1 2 evt q 1 + τ F(xt 1), (1 eβ2)eβq 2 2 t q r q eβq 2evt q + τ q eβq+1 2 evt q 1 + τ q eβq+1 2 evt q 1 + q To upper bound B2, we observe F(xr 1), r q eβt r+1 2 evr 1 + τ + Kηr ℓ F(xr 1) q eβt r+1 2 evr 1 + τ | {z } C0,r eβt r+1 2 evr 1 + τ where C0,r = C1,r + C2,r for v [K] 1 piηr ℓ( Fi(xr i,v) Bi Clip(ur, dr, Fi(xr i,v) + ξr i,v)) q eβt r+1 2 evr 1 + τ v [K] 1 piηr ℓ( Fi(xr i,0) Fi(xr i,v)) q eβt r+1 2 evr 1 + τ Noting that E[ ] = E[Er[ ]] by the tower law, we have as before v [K] 1 piηr ℓdr P( Fi(xr i,v; ξr i,v)) dr) + GLK2d 2τ (ηr ℓ)2ur v [K] 1 piηr ℓ(M α + Bα)u1 α r P Fi(xr i,v; ξr i,v)) ur τ Kηr ℓdr + GLK2d 2τ (ηr ℓ)2ur + 2α 1Gd τ Kηr ℓ(M α + Bα)u1 α r . We retain the α for clarity and to draw comparision to previous proofs, however we note that α = 1 as higher moments do not exist. Now, to bound B1, we use L-smoothness: B1 Lηr ℓur K τ xt 1 xr 1 Lηr ℓur K diam(X) Collecting all inequalities gathered thus far gives E[F(xt)] E[F(xt 1)] + η2 t L 2 E[ A1 2] + eβt 1ηt E F(xt 1), em0 evt + τ + (1 eβ1)ηt r=1 eβt r 1 q=0 E[B1,q] Kηr ℓE eβt r+1 2 evr 1 + τ + Lηr ℓur K diam(X) + (1 eβ1)ηt r=1 eβt r 1 τ Kηr ℓdr + GLK2d 2τ (ηr ℓ)2ur + 2α 1Gd τ Kηr ℓ(M α + Bα)u1 α r Efficient Distributed Optimization under Heavy-Tailed Noise We note the use of Jensen and convexity to ensure E[B1] E[ B1 ]. We now rearrange and telescope t [1, T]: r=1 eβt r 1 eβt r+1 2 evr 1 + τ E[F(x0)] E[F(x T )] | {z } F2 t=1 ηt eβt 1E F(xt 1), em0 evt + τ r=1 eβt r 1 | {z } F5 q=0 E[B1,q] + Lηr ℓur K diam(X) τ | {z } F7 r=1 eβt r 1 | {z } F5 τ Kηr ℓdr | {z } F8 2τ (ηr ℓ)2ur | {z } F9 τ Kηr ℓ(M α + Bα)u1 α r | {z } F10 We now aim to bound each term in the left hand side from below, and right hand side from above. Letting ηt = Θ(tω), ηt ℓ= Θ(tν), dt = Θ(tγ), and ut = Θ(tζ), we move to the asymptotic regime to simplify notation and suppress auxiliary constants for readability. We have that r=1 ηt eβt r 1 ηr ℓ= (1 eβ1) t=1 ηt eβt 1 r=1 eβ r 1 ηr ℓ t=1 ηt eβt 1 1 eβ r 1 rν dr. (15) Then, L Hˆopital s rule allows us to derive an asymptotically sharp bound as follows: 1 eβ r 1 rν dr = " eβ r 1 rν ν eβ r 1 rν 1 loge(eβ1) dr eβ t 1 tν | loge(eβ1)| (16) Here, we used that ν 0 and 0 < eβ1 < e. Asymptotic equivalence is verified via lim t | loge(eβ1)|( R t 1 eβ r 1 rν dr) eβ t 1 tν = lim t | loge(eβ1)|eβ t 1 tν loge(eβ1)eβ t 1 tν + ν eβ t 1 tν 1 = 1. Therefore, the rightmost side of (16) is an asymptotically sharp approximation, relieving the condition ν 0 for validity of the approximation. Within eβ1 (0, 1), the approximation diverges as expected, validating the asymptotic analysis. Recall that | r| Kηr ℓur, which now gives via (16) eβt r+1 2 evr 1 z=1 eβr 1 z 2 2 z eβr 1 2 z=1 eβ z 2 (ηz ℓ)2u2 z max n O(1), T 2(ν+ζ)o . (17) Here, we used eβ2 1 and r T. We thus obtain r=1 ηt eβt r 1 ηr ℓ (1 eβ1) | loge(eβ1)| (1 eβ1) Z T loge(eβ1) dt (1 eβ1)T ω+ν+1 (ω + ν + 1)| loge(eβ1)| . Therefore as ν + ζ < 0, we conclude that (ω + ν + 1) loge(eβ1) T ω+ν+1 min t [T ] E[ F(xt) 2] Efficient Distributed Optimization under Heavy-Tailed Noise Clearly, F2 O(1). To bound F3, we have eβt 1 em0 + (1 eβ1) Pt r=1 eβt r 1 r evt + τ eβ2t 1 em0 2 + (1 eβ1)2 r=1 eβt r 1 r τ 2 + (1 eβ1)2 PT t=1 t2ν+2ζ+2ω τ 2(loge(eβ1))2 O(1) τ 2 + (1 eβ1)2 T 2(ν+ζ+ω)+1 τ 2(loge(eβ1))2 . F4 is bounded similarly after using Jensen, t=1 ηt eβt 1E | F(xt 1)|, | em0| evt + τ t=1 ηt eβt 1d G max j [d] | em0|j p [evt]j + τ O(1). Bounding F5 and F6 is more complex. We begin by noting that q 2 2 τ 2 E " [ 2 t q| r|]j p q 2 2 τ 2 E [ 2 t q| r|]j q max{[eβt q 2 ev0 + (1 eβ2) Pt q o=1 eβt q o 2 2o]j, τ 2} q 2 2 τ 3 E [ 2 t q| r|]j . F5F6 (1 eβ1) r=1 eβt r 1 (1 eβ2) q 2 2 E 2 t q| r| r=1 eβt r 1 (1 eβ2)ηr ℓur q 2 2 (ηt q ℓ ut q)2. Under the substitution q t eq, we have that F5F6 (1 eβ1) r=1 eβt r 1 (1 eβ2)ηr ℓur eβ 2 2 (ηeq ℓueq)2 r=1 eβt r 1 (1 eβ2)ηr ℓur 2ν+ζ t2(ν+ζ) | loge(eβ2)| | loge(eβ2)| eβt 1(1 eβ2) r=1 eβ r 1 rν+ζ t=1 (1 eβ2) tω+3(ν+ζ) | loge(eβ1)|| loge(eβ2)| (1 eβ1)(1 eβ2) | loge(eβ1)|| loge(eβ2)| max n O(1), T ω+3(ν+ζ)+1o . As O(1) terms are subsumed by F4, F5F7 is bounded via r=1 eβt r 1 Lηr ℓur K diam(X) r=1 eβt 1 ηr ℓur eβ r 1 τ τ| loge(eβ1)| (1 eβ1)T ω+ν+ζ+1 τ| loge(eβ1)| . Efficient Distributed Optimization under Heavy-Tailed Noise The remaining terms may also be bounded as follows: F5F8 (1 eβ1) r=1 eβt r 1 ηr ℓdr (1 eβ1) r=1 eβt 1 eβ r 1 rν+γ | loge(eβ1)| t=1 tωtν+γ (1 eβ1) | loge(eβ1)| max{T ω+ν+γ+1, O(1)} where F9 and F10 can be bounded via F5F9 (1 eβ1) ηt eβt r 1 (ηr ℓ)2ur τ (1 eβ1) | loge(eβ1)| ηt eβt r 1 r2ν+ζ τ T 2ν+ζ+1+ω F5F10 (1 eβ1) r=1 ηt eβt r 1 ηr ℓu1 α r τ (1 eβ1) r=1 tω eβt 1 eβ r 1 rν+ζ(1 α) t=1 tω (1 eβ1) | loge(eβ1)| tν+ζ(1 α) (1 eβ1) | loge(eβ1)| T ω+ν+ζ(1 α)+1. Standard calculations imply that under the conditions (13), the dominating terms are F7, F10 with order O(1). Within the derived upper bound, ζ > 0 destabilizes F7 and decays F10 to 0, while ζ < 0 gives the analogous properties with F7 and F10 swapped. Efficient Distributed Optimization under Heavy-Tailed Noise D. Experiment Setup & Full Results In this section, we present the experimental setups and results across two primary domains: synthetic data and natural language processing tasks. More precisely, we evaluate the performance of Tail OPT instantiations with state-of-the-art benchmarks on convex models (with synthetic data), transformer encoders, as well as generative models. For convex, synthetic experiments, we construct datasets to emulate heavy-tailed stochastic gradients, focusing on linear regression models trained under contaminated label noise. The design includes generating feature matrices and labels while injecting noise from heavy-tailed distributions to study convergence behaviors. Additionally, we introduce the Syn Token dataset, which models the heavy-tailed distribution of token frequencies observed in natural language processing. For brevity, we only include the results of the Syn Token dataset, denoted Synthetic data , in the main text (Figure 1). This allows us to evaluate learning algorithms in controlled settings, easing out and exploring the effects of both common and rare features. For assessing the optimization of transformer encoders on natural language processing tasks, we evaluate Ro BERTa (Liu et al., 2019) on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019), which encompasses a diverse range of tasks such as sentiment analysis, paraphrase detection, and natural language inference. By finetuning Ro BERTa on GLUE, we assess its generalization capabilities and robustness. The benchmark s inclusion of multiple datasets ensures a comprehensive evaluation of model performance across various linguistic phenomena. Additionally, we also evaluate the capabilities of the T5 (Raffel et al., 2020) generative model on WMT machine translation tasks (Foundation, 2019). These experiments provide insights into the behavior of optimization algorithms and pretrained models under realistic and challenging conditions. For Ro BERTa, we optimize over GLUE across 10 simulated compute nodes, whereas for T5, we model 3 compute node finetuning on WMT benchmark datasets. D.1. Convex Models D.1.1. DATA GENERATION PROCESS To simulate heavy-tailed stochastic gradients in a simple yet controlled linear regression setting, we generated a synthetic dataset as follows. The feature matrix X RM m was constructed with entries drawn independently from a standard normal distribution, Xij N(0, 1). The true weight vector wtrue Rm was sampled from N(0, Im), where Im is the m m identity matrix. The true labels were computed using: ytrue = Xwtrue. To induce heavy-tailed stochastic gradients, we injected noise into the label vector by adding a noise term ξ, resulting in contaminated labels: ˆy = ytrue + ξ, where ξ RM is a noise vector with entries drawn independently from a heavy-tailed distribution D. For simplicity, we assume coordinate-wise independence of the noise components. After generating the dataset, we distributed the data across n = 10 datacenters in an IID fashion. Notably, the heavy-tailed noise was injected once prior to distribution, and no additional data were generated afterward. This approach ensured that the same contaminated training data are used locally throughout the training process. D.1.2. LINEAR REGRESSION MODEL We consider a single-layer neural network without biases, parameterized by w Rm, which is equivalent to linear regression. Training is performed using the contaminated labels (X, ˆy) with the mean-squared error (MSE) loss function: The gradient of the loss with respect to w is given by: w L(w) = X (ˆy Xw). Substituting ˆy = ytrue + ξ = Xwtrue + ξ, we have: w L(w) = X (Xwtrue + ξ Xw) = X X(wtrue w) X ξ. Efficient Distributed Optimization under Heavy-Tailed Noise Simplifying, we obtain: w L(w) = X X(w wtrue) X ξ. The term X ξ reflects the influence of the heavy-tailed noise on the gradient. Given that X has Gaussian entries and ξ follows a heavy-tailed distribution, the stochastic gradients w L(w) are also heavy-tailed. D.1.3. THE SYNTOKEN DATASET To model the heavy-tailed nature of token frequencies observed in natural language processing, we created the synthetic Syn Token dataset. In natural language, word or token usage often follows a heavy-tailed distribution. That is, a small number of tokens appear very frequently, while a large number of tokens appear infrequently but carry significant contextual information. In our dataset, we partitioned the feature space into common and rare features to reflect this phenomenon. Specifically, we designated the first p = 10% of the columns of X as common features and the remaining 90% as rare features. The common features were generated by sampling from a Bernoulli distribution with a high probability of success: Xcommon Bernoulli(0.9), resulting in features that are frequently active. The rare features were sampled from a Bernoulli distribution with a low probability of success: Xrare Bernoulli(0.1), introducing sparsity and emulating infrequently occurring tokens. The complete feature matrix X was formed by concatenating Xcommon and Xrare: X = [Xcommon, Xrare] . The weight vector w was sampled from a standard multivariate normal distribution, w N(0, Im), consistent with the previous setup. Noise injection was analogously applied to the labels as before. This approach was taken to mimic the key characteristics of tokenization and word embeddings in natural language processing, via a minimal yet effective model. One benefit of synthetic datasets is that by simulating the distribution of common and rare tokens, the Syn Token dataset allows us to study the effects of heavy-tailed data distributions on learning algorithms in a controlled setting. Additionally, we note that the problem being studied is µ-strongly convex. D.2. Synthetic Data Experiments Discussion Does the heavy-tailed distribution of covariates matter? Figure 4 (a) and (c) illustrate that a heavy-tailed distribution of token frequencies has significant impacts on the performance of optimization strategies. In (a), RMSProp-Bi Clip performs competitively under standard tokenization. However, in (c), heavy-tailed tokenization applied to the feature matrix destabilizes RMSProp-Bi Clip. Interestingly, under tokenized conditions without noise, RMSProp exhibits oscillatory behavior, whereas Adam maintains relative stability. This is consistent with the interpretation of Adam as incorporating an exponentially decaying moving average of the gradient s first moment, which augments optimization stability. Upon noise injection, best performing hyperparameters for RMSProp-Bi Clip does not show oscillatory behavior, but is larger in terms of distance w ˆw than the case without noise. Does noise matter? When noise is injected into the labels, the performance dynamics shift considerably. outer optimizer adaptive or non-adaptive methods combined with inner optimizer SGD perform poorly, which may indicate that inner optimizers should take a focal role in addressing the challenges posed by heavy-tailed noise. While the choice of the outer optimizer may appear to a limited impact on the binary question of learnability for this specific synthetic data (i.e., Can the algorithm decrease distance to the true w or not? ), under tokenized conditions with heavy-tailed noise (Figure 4(d)), outer optimizer Adam demonstrates the best performance. Figure 4 reveals that heavy-tailed noise generally destabilizes all algorithms, including adaptive methods, clipped approaches, and pure SGD (c.f., minimum values in (a) and (c) to (b) and (d)). Notably, coordinate-wise Bi Clip consistently outperforms L2 clipping, aligning with the results in Table 1. How far should these results generalize? A word of caution is warranted against overgeneralization. These results are derived from a simplified regression model, limiting the ability to generalize the observed trends. Nevertheless, the experiments underscore the pronounced effects of heavy-tailed noise in a controlled synthetic environment and highlight the noise-mitigating capabilities of optimizers such as Adam, RMSProp, and Bi Clip. Additionally, it is important to note that real-world transformer models often comprise tens of millions to billions of parameters. Efficient Distributed Optimization under Heavy-Tailed Noise 0 1 2 3 4 5 6 log(Outer Node Steps) log( w ˆw ) Synthetic Pure (Non-Tokenized) Avg - SGD Avg - Bi Clip (L2) Avg - Bi Clip Avg - Adam Avg - Adagrad Adagrad - SGD Adagrad - Bi Clip Adam - SGD Adam - Bi Clip RMSProp - Bi Clip Adam2 (a) Pure non-tokenized task 0 1 2 3 4 5 6 log(Outer Node Steps) log( w ˆw ) Synthetic t-Distribution (Non-Tokenized) Avg - SGD Avg - Bi Clip (L2) Avg - Bi Clip Avg - Adam Avg - Adagrad Adagrad - SGD Adagrad - Bi Clip Adam - SGD Adam - Bi Clip RMSProp - Bi Clip Adam2 (b) Noised non-tokenized task 0 1 2 3 4 5 6 log(Outer Node Steps) log( w ˆw ) Synthetic Pure (Tokenized) Avg - SGD Avg - Bi Clip (L2) Avg - Bi Clip Avg - Adam Avg - Adagrad Adagrad - SGD Adagrad - Bi Clip Adam - SGD Adam - Bi Clip RMSProp - Bi Clip Adam2 (c) Pure tokenized task 0 1 2 3 4 5 6 log(Outer Node Steps) log( w ˆw ) Synthetic t-Distribution (Tokenized) Avg - SGD Avg - Bi Clip (L2) Avg - Bi Clip Avg - Adam Avg - Adagrad Adagrad - SGD Adagrad - Bi Clip Adam - SGD Adam - Bi Clip RMSProp - Bi Clip Adam2 (d) Noised tokenized task Figure 4: (Top) The results on the non-tokenized synthetic dataset are presented. In the absence of noise injection, Avg-Adam, Avg-SGD, and RMSProp-Bi Clip demonstrate the most competitive performance. However, under heavy-tailed noise injection, RMSProp-Bi Clip and Adam-Bi Clip achieve the highest performance, while Avg-SGD exhibits among the poorest outcomes. Notably, oscillations observed in Adam-Bi Clip may reflect the impact of amplified update learning rates in the outer optimizer, potentially enabling finer-grained exploration of the optimization landscape. (Bottom) Tokenization drastically alters algorithmic performance. Without noise, Avg-SGD decays the fastest, while Avg-Adam converges to a superior optimum. However, when synthetic, unbiased heavy-tailed noise is introduced, Avg-SGD becomes highly unstable, whereas Adam-Bi Clip and RMSProp-Bi Clip consistently deliver the best results. D.3. Transformer Encoders The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) serves as a comprehensive framework for evaluating natural language understanding (NLU) models across a diverse range of tasks. By incorporating datasets that span various linguistic challenges, GLUE provides a rigorous testbed for assessing the generalization capabilities of NLP models. Ro BERTa is a state-of-the-art transformer-based model designed to enhance the performance of the original BERT architecture through improved pretraining strategies. Proposed by Liu et al. (2019), Ro BERTa optimizes BERT by refining its training setup, enabling more robust natural language understanding (NLU) across diverse tasks. Efficient Distributed Optimization under Heavy-Tailed Noise Table 3: Evaluation results on GLUE Benchmark datasets during test time. Metrics: Co LA (Matthews Correlation Coefficient, MCC), SST-2 (Accuracy), MRPC (Accuracy/F1), STS-B (Spearman/Pearson), QQP (Accuracy/F1), MNLI (Accuracy), QNLI (Accuracy), RTE (Accuracy). Entries marked with 0.0 indicate the actual metric value (averaged across the granularity of each datapoint in the baseline dataset), which implies random guessing or failure to learn. Top first, second, and third best-performing algorithms are highlighted. We note that nested optimization algorithms utilizing adaptivity or coordinate-wise Bi Clip on both inner and outer optimizers generally achieve greater than 80% averaged performance (out of 100%). For Adam2, preconditioners are transmitted between the inner and outer optimizers, whereas Di Lo Co requires maintaining preconditioners on the inner optimizers, both of which incur significant communication or memory overhead. Algorithm MNLI QNLI QQP (Acc/F1) RTE SST-2 MRPC (Acc/F1) Co LA STS-B (S/P) Average Avg-SGD (Mc Mahan et al., 2017) 81.13 83.21 78.71/78.69 57.40 90.94 67.30/80.52 0.0 26.76/28.20 61.17 Avg-L2Clip (Yang et al., 2022) 81.82 85.68 80.00/79.82 54.51 91.97 68.38/81.22 0.0 41.27/40.96 64.15 Avg-Bi Clip (L2) 81.95 86.16 84.62/79.89 55.59 92.31 68.38/81.23 0.0 36.93/37.22 64.03 Avg-Adagrad 84.70 88.79 87.09/83.34 64.26 93.34 71.56/82.63 27.72 81.93/81.26 76.97 Avg-Adam 84.97 89.47 87.66/84.09 64.62 93.80 81.86/87.74 41.41 86.21/86.55 80.76 Avg-Bi Clip 85.08 89.45 87.83/84.12 66.06 94.03 71.32/82.45 41.40 84.08/84.48 79.12 Bi2Clip (L2) 84.31 89.20 86.36/82.60 72.20 93.34 86.52/90.23 60.02 82.41/83.00 82.74 Adagrad-SGD (Reddi et al., 2021) 82.40 86.61 82.51/77.68 71.48 92.08 85.53/89.52 47.80 40.37/42.24 72.69 RMSProp-SGD (Reddi et al., 2021) 84.20 88.46 87.12/83.30 72.56 91.85 85.50/89.17 52.39 45.72/41.80 74.73 Adam-SGD (Reddi et al., 2021) 82.93 86.98 85.99/80.87 66.78 90.71 87.01/90.09 49.93 44.48/41.26 73.37 Adam-L2Clip 82.54 86.69 85.88/80.72 59.92 89.67 85.29/89.90 48.54 69.19/67.16 76.86 Adagrad-Bi Clip 85.54 90.02 88.60/85.05 73.36 93.23 85.78/89.86 48.87 84.03/85.90 82.75 RMSProp-Bi Clip 85.56 89.82 88.50/84.44 70.75 93.69 84.80/88.92 50.99 87.65/87.79 82.99 Adam-Bi Clip 84.26 89.20 88.64/84.74 69.67 92.43 86.52/90.09 56.12 82.83/79.71 82.20 Adam-Bi Clip (L2) 83.18 86.47 85.63/80.27 67.50 89.56 86.02/89.65 53.17 74.73/73.48 79.06 Adam2 (Wang et al., 2021b) 85.11 88.87 89.04/85.51 71.48 92.66 87.50/91.03 52.70 84.47/83.82 82.93 Di Lo Co (Douillard et al., 2024) 85.68 89.87 88.78/85.19 67.87 91.89 87.99/91.20 54.77 85.93/84.76 83.08 Bi2Clip 85.06 89.73 84.93/83.97 76.53 93.80 89.21/92.44 60.08 87.07/86.89 84.52 D.4. Generative Models We additionally evaluate our method using T5 (Raffel et al., 2020), a state-of-the-art text-to-text transformer model developed by Google Research. T5 unifies natural language processing tasks under a text-to-text framework, where both inputs and outputs are text strings, making it highly versatile across tasks such as summarization, translation, and classification. The model was pretrained on the Colossal Clean Crawled Corpus (C4) using a span corruption objective and is available in multiple sizes, ranging from T5-Small (60M parameters) to T5-XXL (11B parameters). This unified framework and scalability allow T5 to excel in a wide range of tasks, making it a strong baseline for evaluating our proposed method. To evaluate machine translation tasks, we utilize the WMT datasets, a widely recognized benchmark for translation research (Foundation, 2019). Specifically, we finetune T5 on the TED Talks and News Commentary datasets. The TED Talks dataset, originally sourced from IWSLT 2017 (Cettolo et al., 2017), provides multilingual translations of TED Talk transcripts, offering diverse linguistic and domain-specific challenges. In contrast, the News Commentary dataset contains parallel text derived from news articles in various languages, presenting a more formal and structured domain. These datasets represent distinct styles and linguistic features, providing a rigorous evaluation of algorithm agility in optimizing across various domains or tasks. Table 4: Evaluation results on machine translation benchmarks. Metrics reported are BLEU and METEOR scores for various language pairs across the TED Talks and News Commentary datasets. The final column represents the average score across all metrics for each algorithm. Algorithm TED Talks (en-de) TED Talks (en-fr) News Commentary (en-fr) Average BLEU METEOR BLEU METEOR BLEU METEOR Avg-SGD 28.02 58.52 27.48 54.67 30.07 54.13 42.15 Avg-L2Clip 28.99 58.94 29.66 57.40 31.02 56.73 43.79 Bi2Clip 29.41 59.18 30.70 58.13 31.79 57.69 44.48 Adam2 28.06 58.05 30.94 57.48 30.97 55.85 43.56 Efficient Distributed Optimization under Heavy-Tailed Noise D.5. Performance under Non-IID Data D.5.1. CUSTOM SHAKESPEARE DATASET Though not the main focus of this work, in this section, we aim to briefly evaluate the performance of Tail OPT and baselines under non-datacenter, distributed environments. We utilized the LEAF repository (Caldas et al., 2018), originally a benchmark suite for federated learning, which provides datasets, tools, and baselines to evaluate algorithms under realworld conditions. LEAF emphasizes non-IID data distributions, enabling the study of federated systems where data is naturally heterogeneous across smaller compute nodes. Among the datasets in LEAF, we modified the Shakespeare dataset, originally designed for next-character prediction, where each user now represented a character from Shakespeare s works. After preprocessing, the dataset contained 1144 inner compute nodes, each corresponding to a character s dialogue, with substantial variations in sample sizes, vocabulary, and syntax across compute nodes. This structure mirrors the imbalanced, domain-specific data distributions often encountered in federated learning. We modify the Shakespeare dataset by redefining the prediction task from next-character prediction to next-token predictions. Table 5: Perplexity scores on the Federated Shakespeare Next Word Prediction Task at a 0.1% participation rate, for distill GPT-2 architecture finetuning after 3 communication rounds. Algorithm Avg-SGD Avg-L2Clip Avg-Bi Clip RMSProp-Bi Clip Bi2Clip Adam2 Perplexity Score 1.9813 2.0126 1.7827 2.0054 1.9112 1.9445 D.5.2. CUSTOM PHILOSOPHER DATASET To mitigate potential data leakage, we constructed a custom dataset, termed the Philosopher Dataset, to evaluate the non-IID setting and facilitate training from scratch. The Philosopher Dataset was synthesized by allocating each literary work to one of eight compute nodes, followed by an 80-20 train-test split. These texts were open sourced from Project Gutenberg3, an extensive online repository offering over 75,000 classic or traditional books while strictly adhering to copyright protections. Table 6: Composition of the Philosopher Dataset. Title Author Translator The Critique of Pure Reason Immanuel Kant J. Meiklejohn The Collected Works of William Hazlitt, Volume One William Hazlitt - The Works of Jane Austen Jane Austen - The Republic Plato Benjamin Jowett War and Peace Leo Tolstoy - The Federalist Papers Alexander Hamilton, John Jay, James Madison - The Count of Monte Cristo Alexandre Dumas - The Brothers Karamazov Fyodor Dostoevsky Constance Garnett We instantiated a shallower GPT-2 architecture comprising 2 layers, 256 embedding dimensions, and 4 attention heads. This model was trained from scratch on the Philosopher Dataset. The training results are summarized in Table 7. Table 7: Perplexity scores on the Philosopher Next Word Prediction Task at a 100% participation rate for the compressed GPT-2 architecture after 3 communication rounds. Algorithm Avg-SGD Avg-L2Clip Avg-Bi Clip RMSProp-Bi Clip Bi2Clip Adam2 Perplexity Score 2.6361 2.1183 1.6266 1.7983 2.3488 2.5861 Discussion. In the synthesized non-IID setting, we observe that algorithmic instantiations employing joint adaptivity or adaptive approximations i.e., incorporating adaptivity or its efficient approximations at both the inner and outer optimizers tend to underperform slightly. This aligns with the theoretical intuition that highly sensitive, rapidly adapting optimizers are more susceptible to unmitigated client drift, effectively overfitting to the biases of local data shards at the inner optimizers. However, Avg-Bi Clip, which integrates a clipping mechanism to regulate noise variance and stabilize optimization dynamics, exhibits notably robust performance. In particular, Avg-Bi Clip achieves the strongest results in settings with high 3https://www.gutenberg.org/ Efficient Distributed Optimization under Heavy-Tailed Noise data heterogeneity across compute nodes, suggesting that Bi Clip mitigates not only noise variance but also client drift. We further compare these findings to results on the synthetic dataset (Appendix D.1) where noise-injected data were distributed IID across nodes, contrasting with the Shakespeare and Philosopher datasets. We note that the perplexities obtained are lower compared to those achieved on larger text datasets, such as Wiki Text-103 or large-scale Common Crawl subsets (e.g., distill GPT reportedly achieves a perplexity of around 16 on the Wiki Text-103 benchmark, a long-term dependency language modeling dataset)4. This arises from the smaller size of the Shakespeare and Philosopher datasets in comparison to larger benchmarks. We provide the optimal hyperparameters for the non-IID experiments in Table 15. D.6. Gradient Distributions Figure 5 highlights how gradient distributions can be distinctly altered by adaptive or clipping operations. 0 20 40 60 Gradient L2 Norm Log(Frequency) 0 50 100 Gradient L2 Norm Log(Frequency) 10 20 30 Gradient L2 Norm Log(Frequency) 10 20 30 40 Gradient L2 Norm Log(Frequency) 10 20 Gradient L2 Norm Log(Frequency) 0 100 200 300 Gradient L2 Norm Log(Frequency) 0 100 200 300 400 Gradient L2 Norm Log(Frequency) 0 50 100 Gradient L2 Norm Log(Frequency) 0 50 100 150 200 Gradient L2 Norm Log(Frequency) 0 20 40 60 80 Gradient L2 Norm Log(Frequency) 0.2 0.4 0.6 0.8 Gradient L2 Norm Log(Frequency) 0.70 0.75 0.80 Gradient L2 Norm Log(Frequency) 0.65 0.70 0.75 0.80 Gradient L2 Norm Log(Frequency) 0.65 0.70 0.75 0.80 Gradient L2 Norm Log(Frequency) 0.65 0.70 0.75 0.80 Gradient L2 Norm Log(Frequency) 0 10 20 30 Gradient L2 Norm Log(Frequency) 20 40 60 Gradient L2 Norm Log(Frequency) 10 20 Gradient L2 Norm Log(Frequency) 10 20 30 40 Gradient L2 Norm Log(Frequency) 10 20 30 Gradient L2 Norm Log(Frequency) 0 5000 10000 15000 Gradient L2 Norm Log(Frequency) 5000 10000 15000 Gradient L2 Norm Log(Frequency) 5000 10000 15000 Gradient L2 Norm Log(Frequency) 5000 10000 15000 Gradient L2 Norm Log(Frequency) 5000 10000 Gradient L2 Norm Log(Frequency) Figure 5: Gradient statistics for MNLI across different algorithms for the first 5 communication rounds, where rounds increase from left to right. (Top) We visualize local minibatch stochastic gradient distributions, where the outliers can dominate model updates upon outer aggregation. The Bi Clip and Adam optimizers mitigate this phenomenon. (Middle) Row 2 displays the local gradients accumulated from all inner optimizers during Bi2Clip prior to clipping, which uncovers the presence of outliers akin to those visible in Avg-SGD. In Row 3, the identical gradients are plotted after applying the coordinate-wise Bi Clip operation. We observe that Bi Clip stabilizes updates by rescaling large and small gradient coordinates, constraining model update lengths within a defined range. (Bottom) Similar to above, Row 4 shows the accumulated gradient lengths across all inner optimizers while training via Adam2. Optimal inner optimizer learning rates are 0.0059, 0.5, and 1.8e-5 for Avg-SGD, Bi2Clip, and Adam2, respectively, with corresponding outer optimizer learning rates of 1 and 3.2e-4 for the latter two algorithms. Test-time results show that Bi2Clip outperforms Adam2, which in turn outperforms Avg-SGD (Table 1). Finally, we note that upon centering, the aggregate update gradient histograms in red depict the stochastic gradient noise distributions upon application of the optimizer strategy. Bi Clip attenuates the pure gradient noise (in blue) by projecting the noise distribution to an almost bell-shaped curve (in red), while Adam implicitly samples gradient noise from a skewed distribution. 4https://github.com/huggingface/transformers/tree/main/examples/research_projects/ distillation Efficient Distributed Optimization under Heavy-Tailed Noise D.7. Hyperparameter Sweep Grids The sweep grids in Tables 8, 9 were determined by first performing a coarser sweep using an approximate grid, then localizing near the discovered well-performing hyperparameters. Table 8: Hyperparameter sweeps on gradient clipping parameters. i u, i d = inner optimizer u, d, o u, o d = outer optimizer u, d. Algorithm i u i d o u o d Avg-SGD - - - - Avg-L2Clip SGD np.linspace(10 4, 1.5, 12) 0.0 - - Avg-Bi Clip np.linspace(10 4, 1.5, 4) np.linspace(10 7, i u, 4) - - Avg-Bi Clip (L2) np.linspace(10 4, 1.5, 4) np.linspace(10 7, i u, 4) - - Avg-Adagrad - - - - Avg-Adam - - - - Adagrad-SGD - - - - RMSProp-SGD - - - - Adam-SGD - - - - Adagrad-Bi Clip np.linspace(10 4, 1.5, 3) np.linspace(10 7, i u, 3) - - RMSProp-Bi Clip np.linspace(10 4, 1.5, 3) np.linspace(10 7, i u, 3) - - Adam-L2Clip np.linspace(10 4, 1.5, 12) 0.0 - - Adam-Bi Clip np.logspace( 2, 1, 5) np.linspace(10 7, i u, 3) - - Adam-Bi Clip (L2) np.linspace(10 4, 1.5, 3) np.linspace(10 7, i u, 3) - - Adam2 - - - - Bi2Clip (Coordinate-wise) np.linspace(10 4, 1.5, 3) np.linspace(10 7, i u, 3) np.linspace(10 4, 1.5, 3) np.linspace(10 7, o u, 3) Bi2Clip (L2) np.logspace( 1, 0.5, 3) np.linspace(10 7, i u, 3) np.logspace( 1, 0.5, 3) np.linspace(10 7, o u, 3) Di Lo Co - - - - Table 9: Hyperparameter sweeps. ilr = inner optimizer learning rate, olr = outer optimizer learning rate, ieps = inner optimizer ε, oeps = outer optimizer ε. Additionally, Di Lo Co swept over the nesterov learning rates (0.9, 0.95), and inner optimizer weight decay parameters (10 1, 10 4), as reported in prior works. Algorithm ilr olr ieps oeps Avg-SGD np.logspace( 9, 1, 100) - - - Avg-L2Clip SGD np.linspace(10 9, 1, 10) - - - Avg-Bi Clip np.linspace(10 9, 1, 10) - - - Avg-Bi Clip (L2) np.linspace(10 9, 1, 10) - - - Avg-Adagrad np.linspace(10 9, 1, 30) - {10 8, 10 6, 10 4, 10 3} - Avg-Adam np.linspace(10 9, 1, 30) - {10 8, 10 6, 10 4, 10 3} - Adagrad-SGD np.linspace(10 5, 0.1, 7) np.logspace( 5, 1, 7) - {10 7, 10 5, 10 3} RMSProp-SGD np.linspace(10 5, 0.1, 7) np.linspace(10 5, 0.1, 7) - {10 7, 10 5, 10 3} Adam-SGD np.linspace(10 5, 0.1, 7) np.logspace( 5, 1, 7) - {10 7, 10 5, 10 3} Adagrad-Bi Clip np.linspace(10 5, 0.1, 4) np.logspace( 5, 1, 4) - {10 7, 10 5, 10 3} RMSProp-Bi Clip np.linspace(10 5, 0.1, 4) np.logspace( 5, 1, 4) - {10 7, 10 5, 10 3} Adam-L2Clip np.linspace(10 5, 0.1, 4) np.linspace(10 5, 0.1, 4) - {10 7, 10 5, 10 3} Adam-Bi Clip np.logspace( 6, 1, 5) np.logspace( 6, 1, 5) - {10 7, 10 5, 10 3} Adam-Bi Clip (L2) np.linspace(10 5, 0.1, 4) np.linspace(10 5, 0.1, 4) - {10 7, 10 5, 10 3} Adam2 np.logspace( 6, 1, 5) np.logspace( 6, 1, 5) {10 7, 10 5, 10 3} {10 7, 10 5, 10 3} Bi2Clip (Coordinate-wise) np.linspace(10 9, 1, 3) np.linspace(10 9, 1, 3) - - Bi2Clip (L2) np.logspace( 1, 0.5, 3) np.logspace( 1, 0.5, 3) - - Di Lo Co np.logspace( 5, 1, 5) {1, 0.7, 0.5, 10 1, 10 2} - {10 7, 10 5, 10 3} Efficient Distributed Optimization under Heavy-Tailed Noise D.8. Optimal Hyperparameters In this subsection, we display the optimal hyperparameters located during our extensive sweep. Table 10: Best hyperparameter selection over a sweep of various parameter grids. ilr = inner optimizer learning rate, olr = outer optimizer learning rate, ieps = inner optimizer ε, oeps = outer optimizer ε, o u , o d = outer optimizer u, d, i u , i d = inner optimizer u, d. Here, ε is the adaptivity parameter in the denominator of adaptive optimizers to enhance stability of learning dynamics. Algorithm Dataset ilr olr ieps oeps o u o d i u i d Avg-SGD STS-B 0.019 - - - - - - - RTE 0.095 - - - - - - - QNLI 0.0059 - - - - - - - QQP 0.0074 - - - - - - - Co LA 0.019 - - - - - - - SST-2 0.0074 - - - - - - - MRPC 0.038 - - - - - - - MNLI 0.0059 - - - - - - - Avg-L2Clip STS-B 0.56 - - - - - 1.5 0.0 RTE 1 - - - - - 0.14 0.0 QNLI 0.33 - - - - - 0.14 0.0 QQP 0.44 - - - - - 0.14 0.0 Co LA 0.33 - - - - - 0.14 0.0 SST-2 0.11 - - - - - 0.27 0.0 MRPC 0.22 - - - - - 0.41 0.0 MNLI 0.11 - - - - - 0.41 0.0 Avg-Bi Clip STS-B 0.44 - - - - - 0.0001 0.0001 RTE 1 - - - - - 0.0001 6.7e-5 QNLI 0.44 - - - - - 0.0001 6.7e-5 QQP 0.56 - - - - - 0.0001 3.3e-5 Co LA 0.89 - - - - - 0.0001 0.0001 SST-2 0.56 - - - - - 0.0001 6.7e-5 MRPC 0.89 - - - - - 0.0001 6.7e-5 MNLI 0.56 - - - - - 0.0001 3.3e-5 Avg-Bi Clip (L2) STS-B 0.067 - - - - - 0.75 0.75 RTE 1 - - - - - 0.0001 6.7e-5 QNLI 0.067 - - - - - 0.75 0.75 QQP 0.11 - - - - - 0.5 0.33 Co LA 0.067 - - - - - 0.75 0.75 SST-2 0.1 - - - - - 0.75 0.38 MRPC 0.11 - - - - - 1 1 MNLI 0.033 - - - - - 1.5 1.5 Bi2Clip STS-B 0.5 0.5 - - 0.0001 0.0001 0.0001 1e-7 RTE 1 1 - - 0.0001 0.0001 0.001 5e-5 QNLI 0.5 1 - - 0.0001 0.0001 0.0001 5e-5 QQP 0.5 1 - - 1.5 1e-7 0.0001 5e-5 Co LA 0.5 1 - - 0.0001 0.0001 0.0001 0.0001 SST-2 0.5 1 - - 0.75 1e-7 0.0001 1e-7 MRPC 1 1 - - 0.0001 0.0001 0.0001 1e-7 MNLI 0.5 1 - - 0.75 1e-7 0.0001 1e-7 Bi2Clip (L2) STS-B 0.56 3.2 - - 0.1 0.05 0.1 0.05 RTE 0.1 0.56 - - 0.1 0.1 0.56 0.56 QNLI 0.1 0.1 - - 3.2 3.2 0.56 1e-7 QQP 0.1 3.2 - - 0.56 1e-7 0.56 0.56 Co LA 0.1 3.2 - - 0.1 0.05 0.56 1e-7 SST-2 0.56 0.1 - - 3.2 3.2 0.1 1e-7 MRPC 0.56 0.1 - - 0.56 0.56 0.1 0.1 MNLI 0.1 0.56 - - 3.2 1.6 0.56 1e-7 Efficient Distributed Optimization under Heavy-Tailed Noise Table 11: Best hyperparameter selection over a sweep of various parameter grids. ilr = inner optimizer learning rate, olr = outer optimizer learning rate, ieps = inner optimizer ε, oeps = outer optimizer ε, o u , o d = outer optimizer u, d, i u , i d = inner optimizer u, d. Here, ε is the adaptivity or ε-smoothing parameter employed in the denominator of adaptive optimizers to enhance stability of learning dynamics. Algorithm Dataset ilr olr ieps oeps o u o d i u i d Adam-SGD STS-B 0.017 4.6e-5 - 1e-7 - - - - RTE 0.033 4.6e-5 - 1e-7 - - - - QNLI 0.017 2.2e-4 - 1e-7 - - - - QQP 0.017 2.2e-4 - 1e-7 - - - - Co LA 0.033 0.001 - 1e-5 - - - - SST-2 0.017 4.6e-5 - 1e-7 - - - - MRPC 0.017 4.6e-5 - 1e-7 - - - - MNLI 0.017 2.2e-4 - 1e-7 - - - - Adam-L2Clip STS-B 0.067 0.033 - 0.001 - - 0.75 0.0 RTE 0.033 1e-5 - 1e-7 - - 1.5 0.0 QNLI 0.067 0.067 - 0.001 - - 0.75 0.0 QQP 0.067 0.033 - 0.001 - - 1.5 0.0 Co LA 0.1 0.033 - 0.001 - - 0.75 0.0 SST-2 0.1 0.033 - 0.001 - - 1.5 0.0 MRPC 0.033 0.033 - 0.001 - - 0.75 0.0 MNLI 0.067 0.033 - 0.001 - - 0.75 0.0 Adam-Bi Clip STS-B 0.0056 3.2e-4 - 1e-5 - - 0.01 0.0067 RTE 3.2e-4 1.8e-5 - 1e-7 - - 0.01 0.0067 QNLI 0.0056 3.2e-4 - 1e-7 - - 0.01 0.0067 QQP 0.0056 0.00032 - 1e-7 - - 0.01 0.0033 Co LA 0.0056 1.8e-5 - 1e-7 - - 0.01 0.01 SST-2 0.0056 1.8e-5 - 1e-7 - - 0.01 0.0067 MRPC 0.0056 0.0056 - 0.001 - - 0.056 0.019 MNLI 0.0056 3.2e-4 - 1e-5 - - 0.01 0.0033 Adam-Bi Clip (L2) STS-B 0.033 0.033 - 0.001 - - 1.5 0.75 RTE 0.033 0.067 - 0.001 - - 0.75 0.38 QNLI 0.033 0.067 - 0.001 - - 1.5 0.75 QQP 0.067 0.033 - 0.0001 - - 0.75 0.38 Co LA 0.033 0.033 - 0.001 - - 1.5 0.75 SST-2 0.067 0.033 - 0.001 - - 1.5 1e-7 MRPC 0.033 0.033 - 0.001 - - 1.5 1e-7 MNLI 0.067 0.033 - 0.001 - - 1.5 0.75 Adam2 STS-B 1.8e-5 1.8e-5 1e-5 1e-7 - - - - RTE 1.8e-5 1.8e-5 1e-5 1e-7 - - - - QNLI 1.8e-5 3.2e-4 1e-5 1e-5 - - - - QQP 1.8e-5 3.2e-4 1e-5 1e-7 - - - - Co LA 1.8e-5 0.0056 1e-5 0.001 - - - - SST-2 1.8e-5 1.8e-5 0.001 1e-7 - - - - MRPC 1.8e-5 1.8e-5 1e-5 1e-7 - - - - MNLI 1.8e-5 3.2e-4 1e-5 1e-7 - - - - Efficient Distributed Optimization under Heavy-Tailed Noise Table 12: The notational setup is analogous to Table 11. For Di Lo Co , we provide the Nesterov learning rate and weight decay parameter in the i u, i d entries, respectively. Algorithm Dataset ilr olr ieps oeps o u o d i u i d Adagrad-SGD STS-B 0.017 0.0046 - 0.001 - - - - RTE 0.033 0.001 - 1e-5 - - - - QNLI 0.017 0.001 - 1e-5 - - - - QQP 0.017 0.0001 - 1e-5 - - - - Co LA 0.017 2.2e-4 - 1e-7 - - - - SST-2 0.017 2.2e-4 - 1e-5 - - - - MRPC 0.017 2.2e-4 - 1e-7 - - - - MNLI 0.017 0.0001 - 1e-7 - - - - RMSProp-SGD STS-B 0.017 1e-5 - 1e-7 - - - - RTE 0.017 1e-5 - 1e-7 - - - - QNLI 0.033 0.001 - 1e-5 - - - - QQP 0.017 1e-5 - 1e-7 - - - - Co LA 0.017 1e-5 - 1e-7 - - - - SST-2 0.017 1e-5 - 1e-7 - - - - MRPC 0.033 1e-5 - 1e-7 - - - - MNLI 0.017 1e-5 - 1e-7 - - - - Adagrad-Bi Clip STS-B 1e-5 2.2e-4 - 1e-7 - - 1.5 1.5 RTE 0.033 2.2e-4 - 1e-7 - - 1.5 1e-7 QNLI 1e-5 0.0046 - 0.001 - - 1.5 1.5 QQP 1e-5 0.0046 - 0.0001 - - 1.5 1.5 Co LA 0.1 2.2e-4 - 1e-7 - - 0.0001 5e-5 SST-2 1e-5 0.0046 - 0.001 - - 1.5 1.5 MRPC 1e-5 2.2e-4 - 1e-7 - - 1.5 0.75 MNLI 1e-5 0.0046 - 0.001 - - 1.5 1.5 RMSProp-Bi Clip STS-B 1e-5 1e-5 - 1e-7 - - 1.5 1.5 RTE 0.067 1e-5 - 1e-7 - - 0.0001 5e-5 QNLI 0.1 1e-5 - 1e-7 - - 0.0001 0.0001 QQP 0.1 0.0046 - 1e-7 - - 0.0001 5e-5 Co LA 0.1 0.0046 - 0.001 - - 0.0001 1e-7 SST-2 0.1 1e-5 - 1e-7 - - 0.0001 0.0001 MRPC 1e-5 0.0046 - 0.001 - - 0.75 0.75 MNLI 0.1 0.0046 - 0.001 - - 0.0001 0.0001 Di Lo Co STS-B 1.8e-5 0.7 1e-5 - - - 0.9 0.1 RTE 1.8e-5 1 1e-5 - - - 0.95 0.0001 QNLI 1.8e-5 1 1e-5 - - - 0.9 0.0001 QQP 1.8e-5 1 1e-5 - - - 0.95 0.0001 Co LA 1.8e-5 1 1e-5 - - - 0.95 0.1 SST-2 1.8e-5 0.1 1e-5 - - - 0.9 0.0001 MRPC 1.8e-5 0.7 1e-5 - - - 0.9 0.1 MNLI 1.8e-5 1 1e-5 - - - 0.9 0.1 Efficient Distributed Optimization under Heavy-Tailed Noise Table 13: Best hyperparameter selection over a sweep of various parameter grids for GLUE tasks. The notation is analogous to Table 11. Algorithm Dataset ilr olr ieps oeps o u o d i u i d Avg-Adagrad STS-B 3e-5 - 1e-8 - - - - - RTE 1.5e-4 - 1e-6 - - - - - QNLI 3.3e-4 - 0.001 - - - - - QQP 3.3e-4 - 0.001 - - - - - Co LA 6.7e-5 - 1e-6 - - - - - SST-2 3.3e-4 - 0.001 - - - - - MRPC 1.5e-4 - 1e-6 - - - - - MNLI 3.3e-4 - 0.001 - - - - - Avg-Adam STS-B 1.4e-5 - 1e-6 - - - - - RTE 3e-5 - 1e-8 - - - - - QNLI 6.2e-6 - 1e-8 - - - - - QQP 1.4e-5 - 1e-8 - - - - - Co LA 6.2e-6 - 1e-8 - - - - - SST-2 6.2e-6 - 1e-8 - - - - - MRPC 3e-5 - 1e-8 - - - - - MNLI 3e-5 - 0.0001 - - - - - Table 14: Best hyperparameter selection over a sweep of various parameter grids for WMT. The conventions are identical with Tables 1113. Algorithm Dataset ilr olr ieps oeps o u o d i u i d Avg-SGD TED-T (en-de) 0.03 - - - - - - - TED-T (en-fr) 0.015 - - - - - - - News Comm (en-fr) 0.015 - - - - - - - Avg-L2Clip TED-T (en-de) 0.89 - - - - - 1.4 0.0 TED-T (en-fr) 0.89 - - - - - 0.55 0.0 News Comm (en-fr) 0.78 - - - - - 0.41 0.0 Bi2Clip TED-T (en-de) 1 1 - - 0.0001 0.0001 0.75 1e-7 TED-T (en-fr) 1 1 - - 0.0001 0.0001 0.75 1e-7 News Comm (en-fr) 0.5 1 - - 1.5 1e-7 0.0001 5e-5 Adam2 TED-T (en-de) 3.2e-4 0.0056 1e-7 0.001 - - - - TED-T (en-fr) 1.8e-5 1.8e-5 1e-5 1e-7 - - - - News Comm (en-fr) 3.2e-4 0.0056 1e-5 0.001 - - - - Efficient Distributed Optimization under Heavy-Tailed Noise Table 15: Best hyperparameter selection over a sweep of various parameter grids. The conventions are identical with Tables 11-14. Algorithm Dataset ilr olr ieps oeps o u o d i u i d Avg-SGD Shakespeare 0.012 - - - - - - - Philosopher 0.15 - - - - - - - Avg-L2Clip Shakespeare 0.56 - - - - - 0.55 0 Philosopher 1 - - - - - 0.41 0 Avg-Bi Clip Shakespeare 1 - - - - - 0.0001 3.3e-5 Philosopher 1 - - - - - 0.0001 3.3e-5 RMSProp-Bi Clip Shakespeare 0.067 2.2e-4 - 1e-5 - - 0.75 1e-7 Philosopher 0.067 0.0046 - 0.001 - - 0.75 1e-7 Bi2Clip Shakespeare 1 1 - - 1.5 1e-7 0.0001 0.0001 Philosopher 1 1 - - 1.5 1e-7 0.0001 5e-5 Adam2 Shakespeare 1.8e-5 0.0056 1e-7 0.001 - - - - Philosopher 1.8e-5 0.0056 1e-5 1e-5 - - - -