# communication_efficient_distributed_training_with_distributed_lion__08cb2f83.pdf Distributed Lion for Communication Efficient Distributed Training Bo Liu The University of Texas at Austin bliu@cs.utexas.edu Lemeng Wu Meta AI lmwu@google.com Lizhang Chen The University of Texas at Austin lzchen@utexas.edu Kaizhao Liang The University of Texas at Austin kaizhaol@utexas.edu Jiaxu Zhu Meta AI jiaxuzhu@meta.com Chen Liang Google crazydonkey@google.com Raghuraman Krishnamoorthi Meta AI raghuraman@meta.com Qiang Liu The University of Texas at Austin lqiang@cs.utexas.edu The Lion optimizer has been a promising competitor with the Adam W for training large AI models, with advantages in memory, computation, and sample efficiency. In this paper, we introduce Distributed Lion, an innovative adaptation of Lion for distributed training environments. Leveraging the sign operator in Lion, our Distributed Lion only requires to communicate binary or lower-precision vectors between workers to the center server, significantly reducing the communication cost. Our theoretical analysis confirms Distributed Lion s convergence properties. Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. Notably, Distributed Lion attains comparable performance to standard Lion or Adam W optimizers applied on aggregated gradients, but with significantly reduced communication bandwidth. This feature is particularly advantageous for training large models. In addition, we also demonstrate that Distributed Lion presents a more favorable performancebandwidth balance compared to existing efficient distributed methods such as deep gradient compression and ternary gradients. 1 Introduction The pursuit of modern artificial intelligence hinges on the training of large-scale models like large language models[28] and large vision models (LVM)[20]. As the stakes in terms of time, cost, and environmental impact grow ever higher for training expansive AI systems, the hunt for efficient optimizers becomes critical. Recently, a new optimization named Lion (evolved sign momentum) [11] has been discovered with an evolutionary program. It was shown that it exhibits performance on par with the current state-ofthe-art Adam W [26] across a wide range of tasks, while reducing the memory cost and training time. Equal contribution. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). JCXAR5pg Ipm+1SIDLDEBn V5Nh+DMfvkv6Rw1n NOGc31Sb15M46i Hb SHDp CDzl ATXa EWai OCHt ATek Gvxq Pxb Lw Z75PSij Ht2Ua/YHx8A83Am5g=δ3,t R/JMXsmb9WS9WO/Wx7R1wcpn9sgf WJ8/b G+Yq Q= t low-precision Figure 1: Illustration of Distributed-Lion. Each worker keeps its own optimizer state and applies the Lion optimizer individually to a binary update δi,t = Lion(x, Di) (without the weight decay), then the server aggregates all δi,t to produce a binary t by majority vote (or an integer t by averaging) and send it back to all workers. The workers then apply t and weight decay to update their model parameters (Algorithm 1). Consider optimizing a loss function f D(x) on Rd with a dataset D, the update rule of Lion is: mt+1 = β2mt + (1 β2) f D(xt), δt = Lion(xt, D) def = sign(β1mt + (1 β1) f D(xt)), xt+1 = xt ϵ δt + λxt , where mt plays the role of the momentum, ϵ is the learning rate, β1, β2 [0, 1]2 are two momentum related coefficients, and λ 0 is the weight decay coefficient. Comparing Lion against Adam W, one observes that Lion only requires the storage of the first-order momentum term, which results in a more relaxed memory requirement. In this study, we tailor the Lion optimizer for distributed training. The Lion optimizer is particularly suitable for this context due to two main attributes: (1) its simple update mechanism that relies solely on first-order momentum, and (2) its use of the sign( ) function. We showcase the effective employment of the sign( ) function to streamline communication processes, leading to the development of a novel distributed training framework named Distributed Lion. Within the Distributed Lion framework, each participating worker independently adjusts the model parameters using a distinct instance of the Lion optimizer, thereby maintaining separate optimizer states. A distinctive feature of this framework is the mode of communication between workers and the central server, which is restricted to binary or low-precision vectors. Crucially, in this setup, workers convey updates rather than raw gradients to the central server. The server, in turn, aggregates these updates through either a straightforward averaging process (Distributed Lion-Avg) or a majority voting mechanism (Distributed Lion-Ma Vo). In the case of Distributed Lion-Ma Vo, the consolidated update is maintained as a binary vector, whereas for Distributed Lion-Avg, given the presence of n workers, each element of the update vector is encoded using log(n) bits. This approach markedly reduces the bandwidth requirements compared to traditional distributed training methods, which typically rely on high-precision floating-point vectors for communication. The bandwidth efficiencies achieved by our method are detailed in Table 1. Our contributions are: 1) We introduce the Distributed Lion algorithm, a simple yet effective approach to extend Lion to distributed training, where all communications between workers and the server are done through binary or low-precision vectors (Section 2); 2) We provide theoretical analysis to ensure the convergence of Distributed Lion (Section 3); 3) Empirically, we demonstrate that on both vision and language modeling tasks, Distributed Lion achieves comparable performance against applying Lion and Adam with the synchronized gradients from all workers, while being significantly more communication efficient. In addition, we show that Distributed Lion achieves a better trade-off than existing efficient distributed training methods like deep gradient compression [24] and ternary gradients [36] (Section 5). 2Chen et al. [11] suggests (β1 = 0.9, β2 = 0.99) based on empirical findings. Method Bandwidth Requirement Worker Server Server Worker Global Lion/Adam W 32d 32d Tern Grad [36] 1.5d log(2n + 1)d DGC [24] (1 η)32d 32d Distributed Lion-Avg d log(n)d Distributed Lion-Ma Vo d d Table 1: Minimum bandwidth requirements of different methods for a model with d parameters and n workers. For Deep Gradient Compression (DGC), η denotes the compression rate (default: η = 0.96). 2 The Distributed Lion We introduce the distributed learning problem and then our Distributed Lion framework. 2.1 Distributed Training In distributed training, we aim to minimize the following learning objective: min x F(x) = 1 f(x; ξi) . (2) Here, N denotes the number of workers, {Di} are N datasets,3 and x is the model parameter (e.g., the weights of a neural network). In the distributed learning setting, each worker i [n] will get its own dataset Di, and we assume there is a centralized server that all workers can communicate with. The simplest distributed training technique is to perform distributed gradient aggregation: gserver = 1 i=1 gi, where gi = Eξi Di xf(x; ξi) . (3) Here, each local gradient gi is an unbiased estimation of the true gradient x F(x) when Di are i.i.d. drawn from the same underlying distribution. The server aggregates all local gradients into gserver, and then applies an optimizer like Adam [19] on top of gserver. However, the aggregation step requires communicating the full gradient vectors gi, which can be expensive for large models. Notation. Given a function f(x; ξ), the gradient f(x; ξ) is taken with respect to variable x. We use , 1, and to denote the ℓ2, ℓ1, and ℓ norm, respectively. ξi,t is the sampled data at time t for the i-th worker and gi,t = f(xt; , ξi,t). We similarly denote zi,t as any variable z at time t from worker i. 2.2 Distributed Lion The main idea of Distributed Lion is to leverage the binary nature of the Lion s update for efficient communication. To enable that, we want the workers to only send the binary updates to the server. As a result, we let each worker keep tracks of its own optimizer state, i.e., the momentum mi,t. Then at each step, each worker i first computes: mi,t+1 = β2mi,t + (1 β2)gi,t, δi,t = sign(β1mi,t + (1 β1)gi,t). (4) Then all workers send the δi,t back to the server. The server receives the binary updates" from all workers and then aggregates them. Here, we propose two simple ways for aggregation. Denote St = PN i=1 δi,t, which is a vector of integers in {0, . . . N}. Define the aggregation as follows: t = aggregate(St) = 1 N St (Averaging) sign(St) (Majority Vote) . (5) 3Throughout this work, we assume {Di} consist of i.i.d data samples, ξi sampled from Di is i.i.d. though our method should be directly applicable to non-i.i.d data. Algorithm 1 Distributed Lion Training Inputs: Initial parameters x0 Rd, datasets {D1, . . . , DN}, loss function f, learning rate ϵ, hyper-parameters β1, β2 [0, 1] (default to 0.9, 0.99)4, and the weight decay λ. Initialization: t = 0, i, mi,0 = 0, and xi,0 = x0. while not convergent do Worker-side: Each worker i samples a batch ξi,t Di, computes the following, and sends δi,t to the server: if t > 0, xi,t xi,t 1 ϵ t 1 + λxi,t 1 δi,t sign β1mi,t + (1 β1) xf(xi,t; ξi,t) mi,t+1 β2mi,t + (1 β2) xf(xi,t; ξi,t). Server-side: The server computes the aggregated update t and broadcast it to all workers: 1 N PN i=1 δi,t (Averaging) sign PN i=1 δi,t (Majority Vote) and t t + 1. So we simply average or take the majority vote from all {δi,t}. Here, we denote binary vectors in magenta and low precision vectors in cyan. In the end, the server broadcasts t back to each worker i, and each worker performs xi,t+1 = xi,t ϵ( t + λxi,t), where ϵ is the step size and λ is the weight decay coefficient. Communication Cost In both variants of Distributed Lion, the N workers only need to send the binary vectors δi,t to the server. The server then sends the aggregated update t back to the workers, which is binary when using the majority vote aggregation, and an integer in {0, . . . , N} when using the averaging aggregation. Note that an integer in {0, . . . , N} can be represented by at most log(N) bits. In practice, usually N 232 hence log(N) < 32 and we still save the communication bandwidth even with the average aggregation, comparing against communicating with floating point numbers (Check Table 1). The full Distributed Lion algorithm is summarized in Algorithm 1. 3 Theoretical Analysis We provide our theoretical analysis of the Distributed Lion algorithm, both with the averaging and the majority vote aggregation methods. In the following, we first describe that the distributed training problem can be viewed as a constrained optimization problem when Distributed Lion is used. We provide convergence results for Distributed Lion with both aggregation methods. 3.1 Lion as Constrained Optimization Chen et al. [10] showed that the (global) Lion is a theoretically novel and principled approach for minimizing a general loss function f(x) while enforcing a box-constrained optimization problem: min x Rd f(x) s.t. λx 1, (6) where the constraint is introduced due to the use of the weight decay coefficient λ. Moreover, Chen et al. [10] showed that the Lion dynamics consists of two phases: 1) [Phase 1] When the constraint is not satisfied, that is, x F, where F is the feasible set F def = {x: λx 1}, it exponentially decays the distance to F: α (0, 1), such that dist(xt+n, F) αndist(xt, F). where n 0. Hence, xt converges to F rapidly and stays within F once it reaches it. 2) [Phase 2] After λxt enters F, the dynamics minimizes the objective f(x) while being confined within the set F. This step is proved in [10] by constructing a Lyapunov function when sign( ) is treated as the sub-gradient of a convex function. 3.2 Convergence Analysis In this section, we analyze the convergence of distributed Lion algorithms. Similar to the case of global Lion, we show that distributed Lion also solves the box constrained optimization (6). Its dynamics also unfolds into two phases aligning with Lion s dynamics: Phase I shows rapid convergence to a feasible set F, while Phase II seeks to minize the objective f(x) within the feasible set F. Different from the Lyapunov approach used in Chen et al. [10], the proof of our Phase II result is made by introducing a surrogate metric S(x) of constrained optimality, and providing upper bound of S(xt) following the algorithm. Our analysis makes the following assumptions. Assumption 3.1 (Variance bound). Di is i.i.d. drawn from a common distribution π , and the stochastic sample ξi Di is i.i.d. and upon receiving query x Rd, the stochastic gradient oracle gives us an independent unbiased estimate f(x; ξi) from the i-th worker that has coordinate bounded variance: Eξ[ f(x; ξi)] = f(x), Eξ f(x; ξi) f(x) 2 σ2. Assumption 3.2 (Smooth and Differentiable f). Function f( ) is differentiable and L-smooth. Assumption 3.3 (Bias Correction). Consider the sequence {mi t}t>0,i [N] generated by Algorithm 1, E[ mi t]/E[sign( mi t)] 0. Note that assumption 3.1 and 3.2 are standard in the analysis of stochastic optimization algorithms [8, 34]. When Assumption 3.1 holds, E 1 N PN i=1 f(x; ξi) f(x) 2 σ2/N. In distributed training setting, m1,t, m2,t, , m N,t are i.i.d., so E[β1mi,t +(1 β1)gi,t] and E[sign( mi t+1)] don t depend on i. Assumption 3.3 evaluates the discrepancy between the expected value and the expected sign of a measure, positing that the expected values of mi t and sign(mi t) ought to share the same sign. We now present our results. Similar to the case of global Lion, the dynamics of distributed lion can also be divided into two phases depending on if the constraint x F is satisfied. Phase I (x F) In line with the behavior observed in the global Lion, when the constraint is not satisfied, both variants of distributed Lion decrease the distance to the feasible set exponentially fast. Theorem 3.4 (Phase I). Assume f : Rd R is L-smooth, β1, β2 (0, 1), and β2 > β1, and ϵ, λ > 0. Let (xt)t 0 be generated by Algorithm 1. Define F = {x: λx 1}, and dist(xt, F) = infz F z xt w.r.t. any norm . For any two non-negative integers s t, then s t, we have dist(xt, F) (1 ϵλ)t sdist(xs, F). Hence, xt converges to F rapidly and stays within F once it arrived. Phase II (x F) Now, we present the main result of the analysis for Phase II in Theorems 3.6, 3.7, and 3.8. We start with introducing a surrogate metric that quantifies the optimality of the solution within Phase II: S(x) := f(x), sign( f(x)) + λx . (7) Let s delve into the implications of S(x) = 0. Proposition 3.5. Assume f is continuously differentiable, λ > 0, and λx 1. Then S(x) = 0 implies a KKT stationary condition of minx f(x) s.t. λx 1. This KKT score (7) is tailored to encompass the stationary solutions of the box constrained problem as described in (6). Building on this, we then proceed to analyze the convergence for the majority vote, averaging, and global LION strategies throughout this section. Theorem 3.6 (Majority Vote). Assumptions 3.1, 3.2, and 3.3 hold, consider the Majority vote scheme in Algorithm 1 , β1, β2 (0, 1), and β2 > β1, and σ 2 dβ1βt 2 f(x0) , 1 t T , and ϵ, λ > 0. Let (xt)t 0 be generated by Majority Vote, and it is in Phase II: λxt 1 for all t. t=1 E[S(xt)] f(x0) f d f(x0) T(1 β2) +4β1Lϵd where C = β2 1(1 β2) 1 1+β2 + (1 β1)2, and D = max{1, σ/ 2 dβ1βT 2 f(x0) }, ρt[k] = 0 if E[sign( mi t+1[k])] = 0, E[ mi t+1[k]]/E[sign( mi t+1[k])] otherwise , and ρ = max1 t T ρt . The result above shows that 1 T PT t=1 E[S(xt)] decays with a rate of O( 1 T ϵ + 1 T (1 β2) + ϵ + 1 N ). This rate is in fact on par with global Lion as we show in the following result. Theorem 3.7 (Global). Assumptions 3.1 and 3.2 hold, Consider the scheme in Algorithm (16), with the same settings in Theorem 3.6, we have t=1 E[S(xt)] f(x0) f d f(x0) T(1 β2) + 4β1Lϵd 1 β2 + 2(1 β1) N + 2Lϵd. (9) Theorem 3.8 (Averaging). Assumptions 3.1 and 3.2 hold, consider the Averaging scheme in Algorithm 1 , with the same settings in Theorem 3.6, we have t=1 E[S(xt)] f(x0) f d f(x0) T(1 β2) +4β1Lϵd dσ 1 + β2 +2(1 β1) The Averaging method s convergence bound doesn t improve with more workers since 1 N PN i=1 sign(δi,t) doesn t approximate sign(PN i=1 δi,t) effectively, unlike the Majority Vote s approach sign(PN i=1 sign(δi,t)). 4 Related Work In this section, we provide a summary of optimizers that use the sign function and existing literature on bandwidth-friendly distributed training. Sign Operation in Optimization The sign operation is integral to optimization for several reasons. Primarily, it acts as a normalization mechanism by disregarding the magnitude of gradients, thereby equilibrating updates across different dimensions and potentially facilitating the avoidance of saddle points. Additionally, the binary nature of the sign function s output significantly reduces the memory footprint required for storing gradient updates. The concept of sign-based optimization dates back to RProp [30] and has seen renewed interest with the advent of Sign SGD and its momentum-enhanced variant, Signum [4]. A more recent advancement is the generalized Sign SGD algorithm introduced by [14], which incorporates a preconditioner, making it a superset of Sign SGD and akin to Adam in certain aspects. A noteworthy addition to sign-based optimizers is the Lion optimizer, which emerged from evolutionary program search, achieving performance comparable to Adam [19] and Adam W [26] for the first time. Lion distinguishes itself from Signum by employing a different convex combination for outputting local updates, a technique referred to as the double-β scheme, reminiscent of Nesterov s momentum update, and encapsulates Signum as a particular case. On the theoretical front, Sign SGD and Signum have been shown to exhibit convergence rates comparable to traditional SGD [4]. Recent work by [34] has extended the theoretical understanding by providing a convergence theory that relaxes the requirements for bounded stochastic gradients and enlarged batch sizes. Additionally, Lion has demonstrated its capability in performing constrained optimization under the ℓ -norm constraint [10]. Distributed Training In addressing the communication constraints of distributed training, the research community has devised several innovative strategies, prominently featuring asynchronous Stochastic Gradient Descent (SGD), gradient quantization, and sparsification techniques. Asynchronous SGD offers a solution by enabling parameter updates immediately after back-propagation, bypassing the need for gradient synchronization, thereby expediting the training process [9, 40, 25]. Li et al. [21] utilizes sketch-based algorithms for lossless data compression [23], achieving an asymptotically optimal compression ratio [22]. However, its applicability is limited to highly sparse gradients, making it orthogonal to our research. In the realm of gradient quantization, methods such as 1-bit SGD [33], QSGD [2], and Tern Grad [36] are pivotal. These approaches compact the gradient data, substantially reducing the required communication bandwidth, with 1-bit SGD demonstrating a tenfold acceleration in speech applications and both QSGD and Tern Grad confirming the feasibility of quantized training in maintaining convergence. Moreover, gradient sparsification further mitigates the communication load by transmitting only the most substantial gradients. Techniques like threshold quantization and Gradient Dropping [1] exemplify this, with Gradient Dropping notably achieving a 99 reduction in gradient exchange with minimal impact on performance metrics, such as a mere 0.3 loss in BLEU score for machine translation tasks. The recent Deep Gradient Compression (DGC) strategy [24] also contributes to this field by incorporating momentum correction and local gradient clipping among other methods to maintain accuracy while significantly reducing communication demands, albeit at the cost of increased computational overhead. Compared to gradient quantization methods, Distributed Lion uniquely leverages the binary nature of Lion s update and can be viewed as performing quantization on updates rather than the gradient. 5 Experiment In this section, we perform a thorough evaluation of the Distributed Lion algorithm, employing both the averaging and majority vote aggregation methods. The design of our experiments is aimed at addressing the following questions to ascertain the algorithm s efficacy and performance: (Q1) How does Mavolion perform in comparison to traditional global distributed training methods, which aggregate gradients from local workers to apply an optimizer to the collective gradient? (Q2) How does Mavolion measure up against established methodologies known for their communication efficiency in distributed training? (Q3) How does Distributed Lion scale on large vision or language problems? 5.1 Comparing Distributed Lion Against Established Methods on CIFAR-10 To address Q1 and Q2, we compare Distributed Lion with both the averaging and the majority vote methods, against established low-bandwidth distributed training techniques and the global distributed training methods. We consider the following baseline methods: 1) Global Adam W (G-Adam W), where we apply Adam W with the averaged gradients from all workers. 2) Global Lion (G-Lion), where we apply Lion with the averaged gradients from all workers. Note that Global Adam W and Global Lion serve as the performance and communication upper bounds. 3) Distributed Lion with Averaged Updates (D-Lion (Avg)), In contrast to the majority vote mechanism used in Distributed Lion, this variant averages the binary update vectors from all workers. While D-Lion (Avg) might offer improved performance in principle, it comes at the cost of non-binary communication from the server to the workers. 4) Tern Grad [36]. The main idea is to tenarize the gradient into a vector of { 1, 0, 1}, which is similar to what Lion does. But this process is done on the gradient level instead of on the update level 5) Gradient Dropping (Grad Drop) [1]. The main idea is to drop insignificant gradient entries and only transmit sparse gradient signals. 6) Deep Gradient Compression (DGC) [24]. DGC is built on top of the Grad Drop, but additionally applies momentum correction, local gradient clipping, momentum factor masking, and warm-up training. Experiment Setup For Grad Drop, DGC, and Tern Grad, we choose the compression rate of 0.04 (note that 1/32 = 0.03125) to match the bandwidth of the D-Lion (Ma Vo). We conduct experiments on the CIFAR-10 dataset using a vision transformer (Vi T) with 6 layers, 8 heads, and a hidden dimension of 512. This is because Vi T has arguably become the most widely used architecture in computer vision, and we empirically found no additional gain in performance when using a larger Vi T on CIFAR-10. In addition, to validate how Distributed Lion performs with different numbers of workers, we consider k {4, 8, 16, 32}, each worker at each step samples an i.i.d batch of size 32. We list the optimal hyperparameters selected for each method from Figure 2 in Table 4. The learning rates are selected from {0.00005, 0.001, 0.005, 0.01} and the weight decays are selected from {0.0005, 0.001, 0.005}. For each experiment, we use a cosine learning rate scheduler and run for 200 epochs, and we ensure that in each epoch, each local worker sees the entire dataset once. k = 4 k = 8 k = 16 k = 32 Epoch Epoch Epoch Epoch Figure 2: Performance of Distributed Lion v.s. baseline distributed optimizers on CIFAR-10 with 4, 8, 16, and 32 workers, each worker at each step runs on a local batch with size 32. All results are averaged over three seeds. Each experiments are conducted with three random seeds {42, 52, 62}, which results in a total of 4 7 3 = 84 experiments. Figure 3: Performance of G-Lion, G-Adam W, Grad Drop, DGC, Tern Grad, and D-Lion (Avg/Ma Vo) v.s. the number of workers k. 0 25 50 75 Bits / Iteration Error (1 - Accuracy) G-Adam W G-Lion Grad Drop (0.96) Tern Grad DGC (0.96) D-SIGNUM (Avg) D-SIGNUM (Ma Vo) D-Lion (Avg) D-Lion (Ma Vo) Figure 4: Test Error v.s. Communication Bits per Iteration (closer to the lower-left is better). Note that we set G-Lion and G-Adam W are both 64, because they require 32 bits per parameter, and there are both workerto-server and server-to-worker communications. Observation We plot the testing accuracy (Test Acc.) over epochs for different methods in Figure 2, the best testing accuracy of different methods over the number of workers in Figure 3, and the performance versus per-iteration bandwidth in Figure 4 when using k = 4 workers. From the above plots, we make the following observations. Compared to global methods, D-Lion (Ma Vo) performs on par with G-Lion. D-Lion (Avg) performs slightly worse than G-Lion but is on par with G-Adamw (Figure 2). Compared to established communication efficient methods, both D-Lion (Ma Vo) and D-Lion (Avg) outperform Grad Drop, DGC and Tern Grad by a large margin (Figure 2). We observe that both D-Lion (Ma Vo) and D-Lion (Avg) exhibit strong performance while being 30x more communication efficient than global distributed training methods like G-Adam W. To broaden our comparison, we introduced two additional baseline methods: DSIGNUM (Avg) and D-SIGNUM (Ma Vo). These baselines apply our proposed techniques to the SIGNUM framework instead of Lion.5 We set β = 0.99 for D-SIGNUM. According to our results, depicted in Figure 4, these SIGNUM-based methods do not perform as well as their Lion-based counterparts. We notice that the overall performance of the same optimizer is worse as k is larger, this is consistent with the observation made in DGC [24]. We hypothesize that this may be due to the larger effective batch size resulting in smaller stochasticity, which is consistent with why D-Lion (Ma Vo) performs a bit better than G-Lion on CIFAR-10 (Figure 3). 5Note that D-SIGNUM (Avg/Ma Vo) further subsumes D-Sign SGD [5, 6]. 5.2 Scale to Larger Models on Larger Datasets To answer Q3, we validate Distributed Lion on several large-scale setups including both vision and natural language processing tasks. Under this setting, we compare D-Lion (Ma Vo) and D-Lion (Avg) against G-Adam W and G-Lion. For the vision task, we tested Vi T-S/16 [16] and Vi T-B/16 on the Image Net-1K [31] classification benchmark. For the natural language processing task, we perform both language pretraining and finetuning tasks. This is because Lion has shown good results on language modeling. For the language model pretraining task, we pretrain GPT2++ [29] (the GPT-2 model with modern training techniques adopted from the LLa MA model [35]) on the Open Web Text [17] benchmark, for both 350M and 760M size models. For the language model finetuning task, we conduct few-shot finetuning of the LLa MA 7B model [35] and evaluate the models downstream performance on standard downstream evaluation benchmarks [13, 37, 12, 27, 7, 32]. Experiment Setup For the Image Net-1K benchmark, we train all methods for 300 epochs, using a global batch size of 4096 and data augmentations Mix Up [39] of 0.5 and Auto Aug [15]. When training Vi T-S/16, we use a learning rate of 3e 3 for G-Adam W, with betas of (0.9, 0.999) and a weight decay of 0.1. For G-Lion, D-Lion (Ma Vo), and D-Lion (Avg), we use a learning rate of 3e 4, betas of (0.9, 0.99), and a weight decay of 1.0. As for Vi T-B/16, we use a learning rate of 1e 3 for G-Adam W, with betas of (0.9, 0.999) and a weight decay of 1.0, while for all Lion variants, we use a learning rate of 1e 4, betas of (0.9, 0.99), and a weight decay of 10.0. For pretraining language models on the Open Web Text dataset, we build GPT2++ models using the original GPT2 model, but with modern training techniques from the LLa MA model, including using the Gated Linear Unit activation for the multilayer layer perceptron layers (MLPs) and the RMSNorm [38] instead of the Layer Norm [3]. Following the Chinchilla scaling law [18], we trained the 350M model for 14,000 iterations and the 760M model for 30,000 iterations, both with 1,024 tokens. For G-Adam W, we use a learning rate of 3e 4, betas of (0.95, 0.99), and a weight decay of 0.1. For all Lion variants, we use a learning rate of 9e 5, betas of (0.9, 0.99), and a weight decay of 1.0. All the models are trained under a global batch size of 480. For the instruction finetuning task, we instruct finetune a LLa MA 7B model for 3 epochs with batch size 32. We use 2e 5 learning rate, betas of (0.9, 0.999), 0 weight decay for G-Adam W and 6e 6, (0.9, 0.99) betas, 0.01 weight decay for all Lion variants. For all pretraining experiments, we use 4nodes 8gpus = 32 workers. For instruction finetuning experiments, we use 4 workers per experiment. Table 2: Results on Image Net classification and Open Web Text language modeling. For Image Net experiments, we report the Top-1 accuracy. For language modeling experiments, we report the validation perplexity. The best performance is marked with bold text, and the second best with an underline. Method Image Classification Language Modeling Vi T-S/16 Vi T-B/16 GPT-2++ (350M) GPT-2++ (760M) Adam W 79.74 80.94 18.43 14.70 G-Lion 79.82 80.99 18.35 14.66 D-Lion (Ma Vo) 79.69 80.79 18.37 14.66 D-Lion (Avg) 80.11 81.13 18.39 14.69 Table 3: 3-Shot instruction finetuning downstream evaluation results on various datasets. We mark the best performance with bold text and the second one with an underline. Method Arc-Easy Arc-Challenge Bool Q PIQA SIQA Hella Swag OBQA 0-Shot 76.64 43.06 76.43 78.64 45.96 56.87 33.53 G-Adam W 77.06 46.06 77.23 79.18 48.97 59.23 35.51 G-Lion 77.11 45.54 77.50 79.18 49.64 58.93 35.51 D-Lion (Ma Vo) 76.86 45.72 77.14 78.92 49.75 58.96 35.71 D-Lion (Avg) 76.35 45.54 76.90 78.76 48.06 59.06 32.14 Observation We summarize the results in Table 2 (Image Net 1K and Open Web Text Language Model Pretraining) and Table 3 (Instruction Finetuning). Both D-Lion (Avg) and D-Lion (Ma Vo) can maintain a performance similar to, or even better than, that of G-Adam W and G-Lion, on both large-scale vision and language tasks. We observe that D-Lion (Avg) outperforms D-Lion (Ma Vo) on Image Net, and observe the opposite on language modeling and instruction finetuning. We hypothesize that these differences are due to the impact of global batch size. As a result, we recommend using D-Lion (Avg) / (Ma Vo) when the global batch size is large / small. 6 Conclusion and Future Work In this paper, we introduced Distributed Lion, a communication-efficient distributed training strategy that builds upon the Lion optimizer s binary update mechanism. Distributed Lion is designed to minimize communication overhead by allowing workers to independently manage their optimizer states and exchange only binary or low-precision update vectors with the server. We proposed two aggregation techniques within the Distributed Lion framework: average-based (Distributed Lion Avg) and majority vote-based (Distributed Lion Ma Vo) algorithms. We provide both theoretical and empirical results to demonstrate Distributed Lion s effectiveness, scalability, and efficiency. Notably, we show that Distributed Lion performs significantly better than existing communication-friendly methods. In the meantime, Distributed Lion demonstrates performance on par with strong global distributed training baselines, while being 32x more communication efficient. As our method is orthogonal to existing communication-efficient methods, an interesting future direction is to combine both techniques for further improvement. As a limitation, currently Distributed Lion (Avg / Ma Vo) performs inconsistently across different datasets and benchmarks, it will be an interesting future research direction to understand when and why one performs better than the other. 7 Acknowledgment The research is conducted in Statistics & AI group at UT Austin, which receives supports in part from NSF CAREER1846421, Sen SE2037267, Office of Navy Research, and NSF AI Institute for Foundations of Machine Learning (IFML). [1] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. ar Xiv preprint ar Xiv:1704.05021, 2017. [2] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in neural information processing systems, 30, 2017. [3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016. [4] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. sign SGD: Compressed Optimisation for Non-Convex Problems, August 2018. ar Xiv:1802.04434 [cs, math]. [5] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 560 569. PMLR, 2018. [6] Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd with majority vote is communication efficient and fault tolerant. ar Xiv preprint ar Xiv:1810.05291, 2018. [7] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432 7439, 2020. [8] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223 311, 2018. [9] Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous sgd. ar Xiv preprint ar Xiv:1604.00981, 2016. [10] Lizhang Chen, Bo Liu, Kaizhao Liang, and Qiang Liu. Lion secretly solves constrained optimization: As lyapunov predicts. ar Xiv preprint ar Xiv:2310.05898, 2023. [11] Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, et al. Symbolic discovery of optimization algorithms. ar Xiv preprint ar Xiv:2302.06675, 2023. [12] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019. [13] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. Ar Xiv, abs/1803.05457, 2018. [14] Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, and Zhenxun Zhuang. Robustness to unbounded smoothness of generalized signsgd. ar Xiv preprint ar Xiv:2208.11195, 2022. [15] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. ar Xiv preprint ar Xiv:1805.09501, 2018. [16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. [17] Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019. [18] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. ar Xiv preprint ar Xiv:2203.15556, 2022. [19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [20] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. ar Xiv preprint ar Xiv:2304.02643, 2023. [21] Haoyu Li, Qizhi Chen, Yixin Zhang, Tong Yang, and Bin Cui. Stingy sketch: a sketch framework for accurate and fast frequency estimation. Proceedings of the VLDB Endowment, 15(7):1426 1438, 2022. [22] Haoyu Li, Liuhui Wang, Qizhi Chen, Jianan Ji, Yuhan Wu, Yikai Zhao, Tong Yang, and Aditya Akella. Chainedfilter: Combining membership filters by chain rule. Proceedings of the ACM on Management of Data, 1(4):1 27, 2023. [23] Haoyu Li, Yuchen Xu, Jiayi Chen, Rohit Dwivedula, Wenfei Wu, Keqiang He, Aditya Akella, and Daehyeok Kim. Accelerating distributed deep learning using lossless homomorphic compression, 2024. [24] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. ar Xiv preprint ar Xiv:1712.01887, 2017. [25] Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A Rusu, Jiajun Shen, Arthur Szlam, and Marc Aurelio Ranzato. Asynchronous local-sgd training for language modeling. ar Xiv preprint ar Xiv:2401.09135, 2024. [26] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. [27] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing, 2018. [28] Open AI. Gpt-4 technical report. Ar Xiv, abs/2303.08774, 2023. [29] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. [30] Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In IEEE international conference on neural networks, pages 586 591. IEEE, 1993. [31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015. [32] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. ar Xiv preprint ar Xiv:1904.09728, 2019. [33] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth annual conference of the international speech communication association, 2014. [34] Tao Sun, Qingsong Wang, Dongsheng Li, and Bao Wang. Momentum ensures convergence of signsgd under weaker assumptions. In International Conference on Machine Learning, pages 33077 33099. PMLR, 2023. [35] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023. [36] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. Advances in neural information processing systems, 30, 2017. [37] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? ar Xiv preprint ar Xiv:1905.07830, 2019. [38] Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019. [39] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017. [40] Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming Ma, and Tie-Yan Liu. Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning, pages 4120 4129. PMLR, 2017. A Additional Experiment Details In this section, we provide additional experiment details. CIFAR Experiments We list the optimal hyperparameters selected for each method from Figure 2 in Table 4. The learning rates are selected from {0.00005, 0.001, 0.005, 0.01} and the weight decays are selected from {0.0005, 0.001, 0.005}. For each experiment, we use a cosine learning rate scheduler and run for 200 epochs, and we ensure that in each epoch, each local worker sees the entire dataset once. Method LR ϵ WD λ Compression Rate G-Adam W 0.0001 0.0005 - G-Lion 0.00005 0.005 - DGC 0.01 0.0005 0.96 Grad Drop 0.001 0.0005 0.96 Tern Grad 0.001 0.0005 - D-Lion (Avg) 0.00005 0.005 - D-Lion (Ma Vo) 0.00005 0.005 - Table 4: Hyperparameters for each method in Figure 2. Where LR represents learning rate and WD represents weight decay. This section is focusing on the proof of Lion dynamics, and will be organized into these folders: Constraint enforcing: Discrete time Majority Voting convergence Avg update convergence Global LION convergence In line with the behavior observed in the global Lion approach, Lion under a distributed setting also exhibits the two phases. In Section B.1, we show that converging to box can be exponentially fast using our Algorithm 1. We start with introducing a notion of KKT score function that quantifies a stationary solution to the box constrained optimization problem (6) in Section B.2. Building on this, we then proceed to analyze the convergence in terms of the KKT score function for the majority vote (Section B.2.1), averaging (Section B.2.2), and global LION strategies (Section B.2.3). B.1 Phase I: Constraint Enforcing We study phase I in this section. We show that when the constraint is not satisfied, both variants of distributed Lion decrease the distance to the feasible set exponentially fast. Theorem B.1 (Phase I). Assume f : Rd R is L-smooth, β1, β2 (0, 1), and β2 > β1, and ϵ, λ > 0, and 1 ϵλ (0, 1). Let (xt)t 0 be generated by Algorithm 1. Define F = {x: λx 1}, and dist(xt, F) = infz F z xt w.r.t. any norm . For any two non-negative integers s t, then s t, we have dist(xt, F) (1 ϵλ)t sdist(xs, F). Proof. Recall Algorithm 1: δi,t sign β1mi,t + (1 β1) xf(xt; ξi,t) mi,t+1 β2mi,t + (1 β2) xf(xt; ξi,t) ( 1 N PN i=1 δi,t (Averaging) sign PN i=1 δi,t (Majority Vote) xt+1 = xt ϵ( t + λxt) Rewrite the update into the following form: xt+1 = (1 ϵλ)xt ϵ t, Define ws t = (1 ϵλ)t s. Unrolling this update yields, xt = (1 ws t)zs t + ws txs, zs t = Pt 1 k=s wk t( t/λ) Pt 1 k=s wk t . We have zs t F since t/λ F. For any ϵ > 0, let ˆxs F be the point satisfying ˆxs xs dist(xs, F) + η. Hence, we have dist(xt, F) = inf z F xt z xt (1 ws t)zs t ws tˆxs) = ws t xs ˆxs (1 ϵλ)t s(dist(xs, F) + η). As η 0, we achieve the desired result. B.2 Phase II We study the convergence of Phase II in this section. We begin by defining a KKT score function to quantify stationary solutions for the box-constrained optimization problem discussed in Section B.2. Following this, we analyze convergence through the KKT score across majority vote (Section B.2.1), averaging (Section B.2.2), and global Lion strategies (Section B.2.3). First, we list the following assumptions used in our proof. Assumption B.2 (Smooth and Differentiable f). Function f( ) is differentiable and L-smooth. Assumption B.3 (Variance bound). Di is i.i.d. drawn from a common distribtion π , and the stochastic sample ξi Di is i.i.d. and upon receiving query x Rd, the stochastic gradient oracle gives us an independent unbiased estimate f(x; ξi) from the i-th worker that has coordinate bounded variance: Eξ[ f(x; ξi)] = f(x), Eξ f(x; ξi) f(x) 2 σ2. Assumption B.4 (Bias Correction). Consider the sequence {mi t}t>0,i [N] generated by Algorithm 1, E[ mi t]/E[sign( mi t)] 0. Here we define the a KKT score function for box constrained problem (6): S(x) := f(x), sign( f(x)) + λx . Proposition B.5. Assume f is continuously differentiable, λ > 0, and λx 1. Then S(x) = 0 implies a KKT stationary condition of minx f(x) s.t. λx 1. Proof. We will verify that S(x) = 0 coincides with the first order KKT conditions of the box constrained optimization problem (6). Recall the box constrained problem in (6), we can rewrite it into the following formulation: min x Rd f(x) s.t. λxi 1 0, λxi 1 0, i [d]. Let µ = (µ1, µ2, , µd) and µ = ( µ1, µ2, , µd) , then its first order KKT stationary condition can be written as: xif(x) + µiλ µiλ = 0 //Stationarity µi(λxi 1) = 0, µi( λxi 1) = 0 //Complementary slackness µi 0, µi 0 //Dual feasibility λxi 1 0, λxi 1 0 //Primal feasibility i {1, 2, , d}. Expressing S(x) element-wisely, we obtain: k=1 Sk(x), with Sk(x) = xkf(x) (sign( xkf(x)) + λxk) , where xk denotes the k-th element of vector x. Since λx 1, we have Sk(x) 0, because Sk(x) = xkf(x) (sign( xkf(x)) + λxk) = | xkf(x)| + λ xkf(x) xk | xkf(x)| | xkf(x)| |λxk| = | xkf(x)|(1 |λxk|) 0 //since λx 1. Hence, if S(x) = 0, we have Sk(x) = 0 for each component k. It means that we have either sign( xkf(x)) + λxk = 0 or xkf(x) = 0 for each coordinate k. There are two primary cases to consider for each k: Case I: xkf(x) = 0. This suggests that we reach a stationary condition of f(x) w.r.t. coordinate xk, and the KKT condition is satisfied in this case with µk = µk = 0. Case II: sign( xkf(x)) + λxk = 0, it follows that xk = 1 λsign( xkf(x)). if sign( xkf(x) = 1, then xkf(x) 0, and the KKT condition is satisfied with µk = 0 and µk = xkf(x)/λ if sign( xkf(x)) = 1, then xkf(x) 0, and the KKT condition is satisfied with µk = 0 and µk = xkf(x)/λ. It turns out the two cases above exactly covers the KKT stationary solution pair (x, µ, µ) of the box constrained problem in (6). In conclusion, S(x) = 0 signifies reaching a stationary point of the bound-constrained optimization problem, as formulated in (6), providing critical insights into the convergence behavior of the algorithm under consideration. B.2.1 Majority Vote Assume f : Rd R is L-smooth, and N is the number of workers, on the i-th worker, consider the following scheme based on the majority vote: gi t := f(xt; ξi t) mi t+1 = β2mi t + (1 β2)gi t mi t+1 = β1mi t + (1 β1)gi t xt+1 = xt ϵ i=1 sign( mi t+1) . //Majority Voting Theorem B.6 (Convergence in Phase II). Assumption B.2 B.3 B.4 hold, consider the scheme in Algorithm 11, and β1, β2 (0, 1), and β2 > β1, and ϵ, λ > 0. λx0 1. t=1 ES(xt) f(x0) f Tϵ + 2Dβ1β2 d f(x0) T(1 β2) + 4β1Lϵd where C = β2 1(1 β2) 1 1+β2 + (1 β1)2, D = max{1, σ/ 2 dβ1βT 2 f(x0) }, and ρt[k] = 0 if E[sign( mi t+1[k])] = 0, E[ mi t+1[k]]/E[sign( mi t+1[k])] else. Proof. Following Theorem B.1 from phase 1, once we have λx0 1, we stay within the constraint set with λxt 1 for all subsequent time t 0. For notation, write Mt+1 = PN i=1 sign( mi t+1). This yields xt+1 = xt ϵsign( Mt+1) ϵλxt. We have f(xt+1) f(xt) f(xt), xt+1 xt + L 2 xt+1 xt 2 2 //L-smoothness of f = ϵ f(xt), sign( Mt+1) + λxt + L 2 xt+1 xt 2 2 = ϵ f(xt), sign( f(xt)) + λxt + L 2 xt+1 xt 2 2 + ϵ f(xt), sign( f(xt)) sign( Mt+1)) ϵS(xt) + 2Lϵ2d + ϵ f(xt), sign( f(xt)) sign( Mt+1) , (12) where we used xt+1 xt 2 = ϵ2 sign( Mt+1) + λxt 2 4ϵ2d, because λxt 1. By Assumption B.3, m1 t+1, m2 t+1, , m N t+1 are i.i.d., so E[ mi t+1] and E[sign( mi t+1)] don t depend on i. Hence we can define Rt+1 = E[ mi t+1]/E[sign( mi t+1)], where the division operation is element wise, so Rt+1 Rd. By Assumption 3.3, Rt is non-negative, one special case for the ratio Rt is when E[sign( mi t[k])] = 0, yet E[ mi t[k]] = 0, leading to Rt[k] = + for k [d]. In such instance, P( mi t[k] > 0) = 1/2 derived from the equation E[sign( mi t[k])] = 2P( mi t[k] > 0) 1 = 0, for k [d]. First, recognizing that E[sign( Mt[k])] = 0 is straightforward as we model it as a binomial distribution with success probability p = 1/2 for t > 0. This leads to the result E f(xt)[k] sign( f(xt)[k]) sign( Mt[k]) = E | f(xt)[k]|. Given that E[X] = arg minz E X z 2 defines the expectation of a random variable X as the value z minimizes the expected euclidean distance to X, and the median X = arg minz E X z 1 defines the median as the value z minimizing the expected absolute distance to X, for a R.V. X in R, recall our case where P( mi t[k] > 0) = 1/2, which is equivalent to that the median is 0. From this, it follows that E | f(xt)[k]| E[Eξ[ f(xt; ξi t)[k] f(xt)[k] 1]] E q Eξ f(xt; ξi t)[k] f(xt)[k] 2 2 σ. To bound the last term in (12) f(xt), sign( f(xt)) sign( Mt+1) , we follow a structured approach. Here s an outline for bounding this term: To bound the last term in Equation (12), f(xt), sign( f(xt)) sign( Mt+1) , we follow a structured approach: 1. Transform Inner Product into Norm of Difference: Using Lemma B.8 to convert the inner product f(xt), sign( f(xt)) sign( Mt+1) into the norm of a difference. 2. Introduce Rt as a De-bias Ratio: Rt is defined to adjust or correct for any bias in the expected value of mi t and the expected sign of mi t as in Assumption B.4. 3. Handle Cases of Rt Separately: Given the possibility of Rt[k] = + , it s essential to treat the scenarios of Rt[k] < + and Rt[k] = + with separate proofs. For Rt[k] < + , standard bounding techniques can be applied, potentially leveraging properties of Rt to establish a finite upper bound. For Rt[k] = + , it s actually bounding f(xt) . This can be bounded by the variance of the stochastic gradient gi t. 4. Merge Cases with Finite ρt Replacing Rt: After separately proving bounds for each case of Rt, the results are unified by substituting Rt with a finite ρt, where ρt serves a similar purpose but ensures a manageable, finite adjustment. Case I (Finite Rt+1) The first step is to expand this inner product, we have E f(xt), sign( f(xt)) sign( Mt+1) = E f(xt), sign( f(xt)) sign( 1 k=1 f(xt)[k] sign( f(xt)[k]) sign( 1 k=1 Rt+1[k] f(xt)[k]/Rt+1[k] 1 k=1 Rt+1[k] f(xt)[k]/Rt+1[k] 1 i=1 sign( mi t+1[k]) . //Lemma B.8 and Assumption 3.3 By definition of Rt, it is a debiasing ratio between E[ mi t+1] and E[sign( mi t+1)], so we construct a difference between 1 N PN i=1 sign( mi t+1[k]) and 1 N PN i=1 mi t+1[k] by decoupling the difference between f(xt)[k]/Rt+1[k] and 1 N sign( mi t+1[k]). f(xt)[k]/Rt+1[k] 1 i=1 sign( mi t+1[k]) f(xt)[k]/Rt+1[k] 1 i=1 mi t+1[k]/Rt+1[k] + 1 i=1 mi t+1[k]//Rt+1[k] 1 i=1 sign( mi t+1[k]) f(xt)[k]/Rt+1[k] 1 i=1 mi t+1[k]/Rt+1[k] i=1 mi t+1[k]/Rt+1[k] 1 i=1 sign( mi t+1[k]) i=1 mi t+1[k] i=1 mi t+1[k]/Rt+1[k] 1 i=1 sign( mi t+1[k]) The first term E f(xt)[k] 1 N PN i=1 mi t+1[k] doesn t depend on Rt+1, we can bound this term across d coordinates using Lemma B.10: i=1 mi t+1[k] β1mi t + (1 β1)gi t i=1 β1 f(xt) mi t + i=1 (1 β1) f(xt) gi t βt 2 f(x0) + 2Lϵ d 1 β2 + σ p N . //Lemma B. The second term ERt+1[k] 1 N PN i=1 mi t+1[k]/Rt+1[k] 1 N PN i=1 sign( mi t+1[k]) can be decoupled into the variance of 1 N PN i=1 sign( mi t+1[k]) and the variance of 1 N PN i=1 mi t+1[k]: k=1 Rt+1[k] i=1 mi t+1[k]/Rt+1[k] 1 i=1 sign( mi t+1[k]) k=1 Rt+1[k] i=1 mi t+1[k]/Rt+1[k] E mi t+1[k]/Rt+1[k] + E mi t+1[k]/Rt+1[k] 1 i=1 sign( mi t+1[k]) k=1 Rt+1[k] i=1 mi t+1[k]/Rt+1[k] E mi t+1[k]/Rt+1[k] + Esign( mi t+1[k]) 1 i=1 sign( mi t+1[k]) k=1 Rt+1[k] i=1 mi t+1[k]/Rt+1[k] E mi t+1[k]/Rt+1[k] Esign( mi t+1[k]) 1 i=1 sign( mi t+1[k]) i=1 mi t+1[k] E mi t+1[k] i=1 sign( mi t+1) Esign( mi t+1) i=1 mi t+1 E mi t+1 i=1 sign( mi t+1) Esign( mi t+1) Now we have got the variance of 1 N PN i=1 sign( mi t+1[k]) and the variance of 1 N PN i=1 mi t+1[k], let us bound them one by one: The variance of 1 N PN i=1 mi t+1[k] i=1 mi t+1 E mi t+1 i=1 mi t+1 E mi t+1 i=1 E mi t+1 E mi t+1 2 N , //Lemma B.11 where C = β2 1(1 β2) 1 1+β2 + (1 β1)2. The variance of 1 N PN i=1 sign( mi t+1[k]) i=1 sign( mi t+1) Esign( mi t+1) i=1 sign( mi t+1)/N E[sign( mi t+1)] i=1 E sign( mi t+1) E[sign( mi t+1)] 2 1 N . //Lemma B.9 In above, we have the bound of the last term in (12) f(xt), sign( f(xt)) sign( Mt+1) : E f(xt), sign( f(xt)) sign( Mt+1) i=1 mi t+1[k] k=1 Rt+1[k] i=1 mi t+1[k]/Rt+1[k] 1 i=1 sign( mi t+1[k]) i=1 mi t+1 E mi t+1 i=1 sign( mi t+1) Esign( mi t+1) βt 2 f(x0) + 2Lϵ d 1 β2 + σ p Case II (Infinite R) From our discussion above, we know that P( mi t[k] > 0) = 1/2 since E[sign( mi t[k])] = 2P( mi t[k] > 0) 1 = 0, where k [d]. For notion, write D = {j [d] | E[sign( mi t+1[j])] = 0}. In this case, we have j D f(xt)[j] sign( f(xt)[j]) sign( Mt[j]) = E X j D | f(xt)[j]| f(xt; ξi t)[j] f(xt)[j] f(xt; ξi t)[j] f(xt)[j] 2 2 So, the inner product f(xt), sign( f(xt)) sign( Mt+1) is still bounded. Hence we can merge both cases into a unified bound by simply replacing Rt by ρt: ρt[k] = 0 if E[sign( mi t+1[k])] = 0, E[ mi t+1[k]]/E[sign( mi t+1[k])] else. Adding one constant D 1 to make the bound in finite case adpative to infinite case: dβ1βt 2 f(x0) , t, 1 t T. j D f(xt)[j] sign( f(xt)[j]) sign( Mt[j]) dβ1βt 2 f(x0) + 4Ldβ1ϵ C) + 2 ρt+1 Finally, we have the bound for both cases: E f(xt), sign( f(xt)) sign( Mt+1) βt 2 f(x0) + 2Lϵ d 1 β2 + σ p dβ1βt 2 f(x0) + 4Ldβ1ϵ C) + 2 ρt+1 Then we have f(xt+1) f(xt) ϵS(xt) + 2Lϵ2d + ϵ f(xt), sign( f(xt)) sign( Mt+1) ϵS(xt) + 2Lϵ2d + ϵ dβ1βt 2 f(x0) + 4Ldβ1ϵ C) + 2 ρt+1 Hence, a telescope yields t=1 ES(xt) f(x0) f Tϵ + 2Dβ1β2 d f(x0) T(1 β2) + 4β1Lϵd where ρ = max1 t T ρt . Lemma B.7. Let (X, Y ) is a joint random variable on Rd Rd. For any constant a (0, + ), we have E[ X, sign(X) sign(Y ) ] 2a d E X/a Y . Proof. Without loss of generality, set a = 1. E[ X, sign(X) sign(Y ) ] = E[ X 1 X, sign(Y ) ] 2E[ X Y 1] //Lemma B.8 d E[ X Y ] //by Cauchy-Schwarz, where 1 is the ℓ1 norm and denotes the Euclidean norm. Lemma B.8. For any x, y R, we have |x| xsign(y) 2 |x y| . Proof. If sign(y) = sign(x), we have |x| xsign(y) = 0 2 |x y|. If sign(y) = sign(x), we have |x| xsign(y) = 2 |x| 2 |x| + 2 |y| = 2 |x y|. If sign(y) = 0, we have |x| xsign(y) = |x| = |x y| 2 |x y| . Lemma B.9. Let X be a random variable in R, we have E sign(X) E[sign(X)] 2 < 1. Proof. The result is a direct derivation from Bernoulli distribution s variance, E sign(X) E[sign(X)] 2 = E[sign(X)2] E[sign(X)]2 < 1. Lemma B.10. Following the same setting in Theorem B.6, we have i=1 mi t f(xt) βt 2 f(x0) + 2Lε d 1 β2 + σ p N(1 + β2) . Proof. We use the notions: gi t := f(xt; ξi t), Mt = 1 N PN i=1 mi t, εt := Mt f(xt), gt = 1 N PN i=1 gi t, δt := gt f(xt), and st = f(xt 1) f(xt) εt = Mt f(xt) = β2Mt 1 + (1 β2)gt f(xt) = β2(Mt 1 f(xt 1)) + (1 β2)(gt f(xt)) + β2( f(xt 1) f(xt) = β2εt 1 + (1 β2)δt + β2st. That is εt = β2εt 1 + (1 β2)δt + β2st. Under the L-smoothness assumption B.2: st = f(xt 1) f(xt) L xt 1 xt 2L where ε is the step size. Using mathematical induction, we have εt = βt 2ε0 + i=1 βt i+1 2 si + (1 β2) i=1 βt i 2 δt. (14) By taking the norms of both sides of the above equation and using the strong bound 13 we obtain εt βt 2 ε0 + 2L i=1 βt i+1 2 + (1 β2) i=1 βt i 2 δt . Taking expectations on both sides, E εt βt 2 ε0 + 2L dε 1 β2 + (1 β2) i=1 βt i 2 δt . Note that r.v.s (δi)1 i t are mean zero, using B.11, we have i=1 βt i 2 δi i=1 β2t 2i 2 σ2 E εt βt 2 ε0 + 2L dε 1 β2 + σ p N(1 + β2) . Note that M0 = 0 under our setting, so ε0 = f(x0), we have E εt βt 2 f(x0) + 2L dε 1 β2 + σ p N(1 + β2) . Lemma B.11 (Cumulative error of stochastic gradient [4]). Assume the same settings as in Theorem B.6. Define Yk := Pk l=1 αℓδl where δt := gt f(xt) with gt = PN i=1 gi t and gi t := f(xt; ξi t) following the update in (11), and {αℓ: ℓ= 0, 1, . . .} is a deterministic sequence. Then Yk is a martingale, and l=1 α2 l σ2. Proof. We simply check the definition of martingales. First, we have E[|Yk|] = E l |αl|E[|δl|] //triangle inequality l |αl|E[E[|δl||xl]] //law of total probability E[δ2 l |xl]] //Jensen s inequality l |αl|σ < //Assumption B.3. Second, again using the law of total probability, E[Yk+1|Y1, ..., Yk] = E α1δ1, ..., αkδk = Yk + αk+1E [δk+1|α1δ1, ..., αkδk] = Yk + αk+1E [E [δk+1|xk+1, α1δ1, ..., αkδk] |α1δ1, ..., αkδk] = Yk + αk+1E [E [δk+1|xk+1] |α1δ1, ..., αkδk] = Yk. This completes the proof that it is a martingale. We now make use of the properties of martingale difference sequences to establish a variance bound on the martingale. l=1 αlδl]2] = l=1 E[α2 l δ2 l ] + 2 X l β1, and ϵ, λ > 0. λx0 1. We have t=1 ES(xt) f(x0) f d f(x0) T(1 β2) + 4β1Lϵd 1 β2 + 2β1σ 1 + β2 + 2(1 β1)σ + 2Lϵd. Proof. For notation, write Mt+1 = PN i=1 sign( mi t+1). This yields xt+1 = xt ϵ Mt+1 ϵλxt. Following Theorem B.1 from phase 1, once we have λx0 1, we stay within the constraint set with λxt 1 for all subsequent time t 0. Following a similar procedure in B.6, we have f(xt+1) f(xt) f(xt), xt+1 xt + L 2 xt+1 xt 2 2 ϵ f(xt), Mt+1 + λxt + L 2 xt+1 xt 2 2 ϵ f(xt), sign( f(xt)) + λxt + L 2 xt+1 xt 2 2 + ϵ f(xt), sign( f(xt)) Mt+1 ϵS(xt) + 2Lϵ2d + ϵ f(xt), sign( f(xt)) Mt+1 . Let us bound the last term f(xt), sign( f(xt)) Mt+1 , E f(xt), sign( f(xt)) Mt+1 = E f(xt), sign( f(xt)) 1 i=1 sign( mi t+1) 1 N E f(xt), sign( f(xt)) sign( mi t+1) = E f(xt), sign( f(xt)) sign( mi t+1) //{ mi t+1}1 i N are independent d E f(xt) mi t+1 //Lemma B.7 d E β1 f(xt) mi t + (1 β1) f(xt) gi t //triangle inequality βt 2 f(x0) + 2Lϵ d 1 β2 + σ 1 + β2 . //Lemma B.10 Then we have f(xt+1) f(xt) ϵS(xt) + 2Lϵ2d + ϵ f(xt), sign( f(xt)) Mt+1 ϵS(xt) + 2Lϵ2d + 2ϵ βt 2 f(x0) + 2Lϵ d 1 β2 + σ 1 + β2 Hence, a telescope yields t=1 ES(xt) f(x0) f d f(x0) T(1 β2) + 4β1Lϵd 1 β2 + 2β1σ d 1 + β2 + 2(1 β1) B.2.3 Global Lion Convergence Assume f : Rd R is L-smooth, N is the number of workers, on the i-th worker, consider the following scheme based on the global Lion: gi t := f(xt; ξi t) mi t+1 = β2mi t + (1 β2)gi t mi t+1 = β1mi t + (1 β1)gi t xt+1 = xt ϵ i=1 mi t+1) + λxt . //Global Lion Theorem B.14 (Convergence in Phase II). Under Assumption B.2 and B.3, consider the scheme in (16) , and β1, β2 (0, 1), and β2 > β1, and ϵ, λ > 0. λx0 1. We have t=1 ES(xt) f(x0) f d f(x0) T(1 β2) + 4β1Lϵd Proof. For notation, write Gt+1 = 1 N PN i=1 mi t+1. This yields xt+1 = xt ϵsign( Gt+1) ϵλxt. Following Theorem B.1 from phase 1, once we have λx0 1, we stay within the constraint set with λxt 1 for all subsequent time t 0. Following the same procedure in B.6, we have f(xt+1) f(xt) f(xt), xt+1 xt + L 2 xt+1 xt 2 2 ϵ f(xt), sign( Gt+1) + λxt + L 2 xt+1 xt 2 2 ϵ f(xt), sign( f(xt)) + λxt + L 2 xt+1 xt 2 2 + ϵ f(xt), sign( f(xt)) sign( Gt+1) ϵS(xt) + 2Lϵ2d + ϵ f(xt), sign( f(xt)) sign( Gt+1) . Let us bound f(xt), sign( f(xt)) sign( Gt+1) , E f(xt), sign( f(xt)) sign( Gt+1) = E f(xt), sign( f(xt)) sign( 1 i=1 mi t+1) //Lemma B.7 //triangle inequality βt 2 f(x0) + 2Lϵ d 1 β2 + σ p //Lemma B.10 βt 2 f(x0) + 2Lϵ Then we have f(xt+1) f(xt) ϵS(xt) + 2Lϵ2d + ϵ f(xt), sign( f(xt)) Mt+1 ϵS(xt) + 2Lϵ2d + 2ϵ βt 2 f(x0) + 2Lϵ Hence, a telescope yields t=1 ES(xt) f(x0) f d f(x0) T(1 β2) + 4β1Lϵd 1 β2 + 2(1 β1) Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer:[Yes] Justification: The claims are supported with theoretical and empirical results. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: In the Conclusion section, we mentioned that Distributed Lion can be further improved when combined with compression techniques. Currently, a limitation is that D-Lion (Avg) and D-Lion (Ma Vo) perform inconsistently across datasets and benchmarks, and it will be good to understand why in future work. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We list our assumptions and results explicit in the theory section. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide the benchmark, algorithm, and hyperparameters for reproducing our results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] . Justification: All the data we use are public. We will release code upon acceptance. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: answer Yes Justification: We provide the details for training and testing in the experiment section for reproducing our results. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] . Justification: We average the results over 3 seeds and report the mean in Figure 2 for the CIFAR experiment. But for larger scale experiment, it is extremely computationally expensive to conduct the experiments multiple times, and it is known that the result is relatively stable. Hence we only run once for each large-scale experiment. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provided how many workers are needed for each experiment, the GPU resource can be arbitrary as long as it fits in memory. 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We think our paper confirms in every respect with the Code of Ethics. 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We do not think our work leads to any negative societal impact. 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] . Justification: We think the paper poses no such risks. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: we cite the data and benchmarks we use, and the baseline methods we compare against. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] . Justification: The paper does not release new assets. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] . Justification: The paper does not involve crowdsourcing nor research with human subjects. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] . Justification: The paper does not involve crowdsourcing nor research with human subjects.