# ldadam_adaptive_optimization_from_lowdimensional_gradient_statistics__e8d28a63.pdf

Published as a conference paper at ICLR 2025

LDADAM: ADAPTIVE OPTIMIZATION FROM LOWDIMENSIONAL GRADIENT STATISTICS

Thomas Robert1 , Mher Safaryan2, Ionut-Vlad Modoranu2, Dan Alistarh2

1Institut Polytechnique de Paris (IPP) 2Institute of Science and Technology Austria (ISTA)

We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer s memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models. Code is available at https://github.com/IST-DASLab/LDAdam.

1 INTRODUCTION

Scaling up deep neural networks leads to ever-improving models (Kaplan et al., 2020), but also to ever-increasing memory and compute requirements. In this work, we focus on the memory cost of optimizer states, i.e., the additional firstand second-order gradient statistics required during the adaptive optimization process, which in the case of the baseline algorithm Adam (Kingma & Ba, 2017; Loshchilov & Hutter, 2019) reaches up to twice the size of the model. Our goal is to replicate Adam s performance and theoretical guarantees while significantly reducing its memory footprint.

We start from the observation that the optimization landscape of deep networks exhibits low intrinsic dimensionality (Li et al., 2018). Low-rank learning has been successfully leveraged for model finetuning, with the Lo RA technique (Hu et al., 2021) and its many extensions (Dettmers et al., 2023; Liu et al., 2024; Nikdan et al., 2024) becoming standard. However, such adapter-based methods do not extend to end-to-end accurate training (Lialin et al., 2023; Zhao et al., 2024), nor even to finetuning on more demanding tasks (Zhang et al., 2023; Nikdan et al., 2024; Liu et al., 2024). Despite promising results such as Ga Lore (Zhao et al., 2024), designing memory-efficient full-parameter optimizers which leverage low-dimensional gradient statistics with provable convergence and good practical performance remains an open challenge.

Contributions. We propose Low-Dimensional Adam (LDAdam), a memory-efficient adaptive optimizer which, for the first time, incorporates low-rank compression of both gradients and optimizer states, with convergence guarantees under standard assumptions. The goal of the algorithm is to perform optimization steps within subspaces of lower dimension, while introducing mechanisms to 1) maintain relevant information from the last seen subspaces and 2) recover information lost due to low-rank compression. To reach this goal, our work is based on three main techniques. First, we formulate projection-aware update rules that allow us to perform adaptive optimization on dynamically changing coordinate systems and subspaces. Second, we leverage block power iteration (Bentbib & Kanber, 2015) at each step to efficiently and accurately project the gradient and optimizer states. Third, we use a generalized error feedback mechanism for both gradients and optimizer states. Moreover, we prove the convergence of LDAdam under standard assumptions and provide empirical evidence supporting its training efficiency.

Work performed as an intern at ISTA. Correspondence to dan.alistarh@ist.ac.at.

Published as a conference paper at ICLR 2025

When performing adaptive optimization within low-dimensional subspaces, a key challenge is to adapt the optimizer states to changes in the projection map. Adapting gradient statistics to new subspaces is not obvious and comes with two challenges. First, standard adaptive optimization algorithms are anisotropic: that is, the parameter-wise learning rate tuning implies that the choice of coordinate system used to express the gradient matrix impacts the optimization process. Second, the second-order statistics are nonlinear, which complicates estimation of the gradient statistics and error compensation. We address the first challenge by carefully selecting the coordinate system used for low-rank compression, i.e. the basis given by the most significant singular vectors. The second challenge is addressed by approximating the gradient statistics in the current subspace given the raw moments estimates in the previous subspace. Thus, with LDAdam, optimizer step is performed from the projection-aware intermediate optimizer states and the projected gradient, rather than from the optimizer states at the previous iteration and the full-rank gradient (Kingma & Ba, 2017). In turn, this leads to stronger convergence guarantees, and better practical performance.

The choice of projection map at each step is key to low-rank compression. Performing singular value decomposition (SVD) at each step would add prohibitive overhead; instead, we use block power iteration with warm initialization (Bentbib & Kanber, 2015; Vogels et al., 2020), which can achieve a tight approximation of the most significant singular vectors of successive gradients, at low cost. To preserve the information from the last iterations kept within the optimizer states, we show that one can perform the block power iteration on a barycenter of the newly computed gradient and an average of the previous gradients. Finally, to mitigate the errors due to low-rank projection, we introduce a new variant of the error feedback mechanism (Seide et al., 2014; Karimireddy et al., 2019; Alistarh et al., 2018) that supports compression of both gradient and optimizer states. Moreover, we propose a memory efficient implementation of it by reusing the space allocated to gradient accumulation.

The resulting LDAdam method is memory-efficient: for a weight layer of shape n m, the space cost of optimizer states is nr + 2rm, where r is the projection rank, while the cost of Adam s state is 2nm. This matches the efficiency of the popular Ga Lore algorithm (Zhao et al., 2024), the closest prior work in this area; and can bring major gains in practice, as usually r min(m, n). Relative to Ga Lore, LDAdam has the advantages of tracking gradient evolution in the adapted subspace at each step, and correcting for the projection error. This leads to better practical performance, but also to theoretical guarantees of convergence under standard assumptions. Specifically, for smooth non-convex objectives, LDAdam can preserve the asymptotic convergence rate of AMSGrad (Reddi et al., 2019), the provably-convergent version of Adam (see Theorem 1). Moreover, we show even faster rates for objectives that obey the Polyak-Łojasiewicz condition (see Theorem 2).

We validate the practical performance of LDAdam via an efficient Py Torch implementation. We apply LDAdam for fine-tuning Ro BERTa (Liu et al., 2019) and Llama-family (Touvron et al., 2023) models on the GLUE (Wang et al., 2018) and Grade-School Math (GSM) (Cobbe et al., 2021) benchmarks, respectively. In addition, we provide pre-training results for Llama-type models with 130M and 350M parameters. Across all tasks and models, LDAdam shows competitive accuracy relative to Adam, but with much smaller optimizer states relative to the baseline, and comparable runtime. Relative to Ga Lore, LDAdam shows consistently higher accuracy, for matching memory costs and similar runtime. Ablations show that this is a consequence of both our projection-aware intermediate updates and error correction.

In summary, we show that it is possible to achieve both theoretical guarantees and strong practical performance when performing adaptive optimization from low-dimensional gradient statistics. Our techniques may also be relevant to areas such as distributed optimization, while our implementation is relevant in reducing the practical memory overheads of training large language models (LLMs).

2 RELATED WORK

Memory-Efficient First-Order Methods in Optimization. Gradient statistics are used to determine the descent direction (Rumelhart et al., 1986) and to adapt the learning rate per parameter (Duchi et al., 2011; Hinton, 2012). Adam (Kingma & Ba, 2017) combines both ideas in a biascorrected design, and Adam W (Loshchilov & Hutter, 2019) adds weight decay. Although it is the de facto optimizer for deep neural networks, Adam/Adam W has been the subject of several extensions aimed at reducing its memory footprint. For example, Adafactor (Shazeer & Stern, 2018), CAME (Luo et al., 2023) and Adam-mini (Zhang et al., 2024a) rely on a factorized representation of the second-order statistic to reduce memory. Quantization has also been proposed to reduce the

Published as a conference paper at ICLR 2025

footprint of optimizer states (Dettmers et al., 2022), and fusing computation of the backward pass and the optimizer step (Lv et al., 2024b;a) has also proven effective.

Error Feedback in Distributed Optimization. Several methods have been proposed to reduce the communication overhead in distributed optimization: quantization (Alistarh et al., 2017), sparsification (Alistarh et al., 2018; Lin et al., 2020) and low-rank approximation (Vogels et al., 2020). Error feedback (Seide et al., 2014) has been shown to be effective in improving the empirical performance of biased gradient compression methods (Karimireddy et al., 2019; Vogels et al., 2020) methods and key to ensuring their convergence (Stich et al., 2018; Alistarh et al., 2018). Micro Adam (Modoranu et al., 2024) only stores sparse gradients from each optimization step, and uses error feedback for correction, but reconstructs the full optimizer state at each step, instead of compressing it.

Parameter-Efficient Fine-Tuning (PEFT). A natural approach for reducing memory overheads is to train less parameters (Han et al., 2024). This idea led to additive PEFT, where new trainable modules are added to the model architecture (Houlsby et al., 2019; Li & Liang, 2021); selective PEFT, where only certain parameters of the original model are retrained (Guo et al., 2021); and reparametrized PEFT, where new modules are trained and merged with the original parameters (Hu et al., 2021; Nikdan et al., 2024; Liu et al., 2024). However, methods that train significantly less parameters tend to underperform for pre-training and fine-tuning on challenging tasks.

Optimization via Low-Rank Gradient Projection. Ga Lore (Zhao et al., 2024), which partly motivated our work, performs adaptive optimization in a low-dimensional subspace that is updated periodically. The adaptive optimizer is fed with a low-rank representation of the gradients obtained via SVD, thus saving memory, and produces a low-rank representation of the descent direction, which is then up-projected to update the model parameters. The idea is similar to PEFT in that low-rank gradient projection yields selective training in the coordinate system induced by SVD. The addition of low-rank components to achieve full-rank training is also proposed in Re Lo RA (Lialin et al., 2023) or COLA (Xia et al., 2024). However, unlike PEFT methods, Ga Lore computes and exploits the full gradient of the original model. The final algorithm allows pre-training of 7B parameter language models on consumer GPUs. However, it can only do so with a very small batch size and without the ability to accumulate gradients over multiple mini-batches.

Ga Lore Comparison. Relative to Ga Lore, we propose the following new techniques: (1) we introduce a new projection-aware update rule (see Sections 3.1 and 3.2), which avoids accumulating low-rank gradients from different subspaces into the optimizer states, and performs frequent updates; (2) to correct for projection errors in both gradients and optimizer states, we introduce a new generalized error feedback mechanism (see Section 3.4). The convergence of Ga Lore is only guaranteed under a very strong stable-rank assumption, whereas we show convergence under standard conditions. Experimentally, we show that these improvements lead to consistently improved accuracy, and that they can be implemented efficiently in terms of both memory and runtime.

Several orthogonal improvements such as quantization (Dettmers et al., 2022; Zhang et al., 2024b), per-layer weight update (Lv et al., 2024b), and weight factorization (Jaiswal et al., 2024) have been proposed to further improve projection-based methods. Many of these improvements are compatible with LDAdam, and we plan to investigate them in future work.

3 THE LOW-DIMENSIONAL ADAM (LDADAM) ALGORITHM

3.1 THE NEED FOR A PROJECTION-AWARE UPDATE RULE

Standard Adaptive Optimization. We denote f the loss function to minimize, θt the model parameters at step t, Gt = θf(θt) Rn m the gradient at θt, gt its mini-batch stochastic counterpart, ηt the learning rate, and ϵ a positive scalar used for numerical stability. The standard adaptive optimization relies on the gradient statistics estimates mt and vt, which are obtained from the exponential moving average of the gradients at rate β1 and the squared gradients at rate β2, respectively. With ˆmt = mt 1 βt 1 and ˆvt = vt 1 βt 2 their unbiased counterpart, Adam s update is: θt+1 = θt ηt ˆmt ˆvt+ϵ.

The Challenges Behind Low-Rank Updates. When adapting the learning subspaces to capture the low-rank gradient structure, one expects the optimizer states to retain information from previous iterations. However, frequent updates of the low-rank projection causes significant challenges: first, projections are lossy and gradient information is lost; second, since the projections may be different

Published as a conference paper at ICLR 2025

between steps, subspace adaptation may alter the representation of the low-dimensional optimizer states. To illustrate this second point, we briefly discuss the Ga Lore solution (Zhao et al., 2024).

Let r be the compression rank, and U t and Ut be the projection and back-projection matrices we use at step t. U t Rr n yields a truncation to its first r rows of the gradient matrix written in an adapted coordinate system, leading to a low-dimensional representation of the gradient with shape r m. Multiplication with the matrix Ut is then used to express the compressed gradient in the canonical high-dimensional coordinate system. From now on, let mt and vt (resp. ˆmt and ˆvt) be the first and second moment estimates (resp. unbiased estimates) in the low-dimensional space induced at step t by U t , and let Mt = Ut mt denote the back-projected low-rank counterpart of mt.

In this context, Equation 1 describes the Ga Lore dynamics (Zhao et al., 2024, Algorithm 2):

θt+1 = θt ηt Ut ˆmt ˆvt+ϵ = θt ηt ˆvt+ϵ 1 β1 1 βt 1 Pt τ=1 βt τ 1 Ut U τ gτ. (1)

Notice that, in the above equation, the change in the low-dimensional projection basis between steps is ignored: as can be seen in the right-hand sum, gradients are projected and back-projected using different maps, i.e. Ut and U τ , respectively. Moreover, the statistics of gradients projected into different subspaces are accumulated on the same momentum buffers. The issue we raise here (highlighted in bold in Equation 1) is partially mitigated in practice by heuristics: occasional updates of the projection map and exponential decay rates lead to only a fraction of the steps being effectively resulting from accumulation of compressed gradients in different subspaces. However, we will see experimentally that this leads to a drop in accuracy relative to the correct projections.

3.2 THE LDADAM UPDATE RULE

Overview. With LDAdam, we address two key challenges in low-rank updates: we introduce a projection-aware update rule that accounts for the change in the low-dimensional representation of the optimizer states, and we provide a generalized error feedback mechanism that accounts for the loss of gradient accuracy due to low-rank compression. Compared to standard adaptive optimization step, optimizer states are replaced by low-dimensional intermediate optimizer states, denoted by mt 1/2, vt 1/2; and the gradient is replaced by the accumulator at = U t At obtained by lowrank projection of the high-dimensional accumulator At = gt + ξt, where ξt is the (generalized) error feedback buffer, storing projection errors. In the following, we build the LDAdam algorithm step-by-step. The end-to-end construction is provided in Algorithm 1.

Our first objective is to find a mechanism to transfer gradient information from one lower dimensional vector space to another. For this, we take inspiration from orthogonal projections, which provide the best approximation in terms of ℓ2 error. The projection-aware update rule for the first moment estimate (Algorithm 1 Line 8) follows from the multiplication with the matrix U t Ut 1, which stands for both the projection matrix and the change of basis matrix. The matrix U t Ut 1 Rr r is low-dimensional and thus allows efficient transition between subspaces, as no intermediate highdimensional vector is created. We obtain the following update:

mt = β1PROJUt(Mt 1) + (1 β1)at = β1 U t Ut 1 mt 1 | {z } mt 1/2

+(1 β1)at. (2)

Optimizer States Compression. Note already that projection yields compression. Yet, the discrepancy between successive subspaces leads to a loss of information contained in the optimizer states. We address this by enforcing slight shifts between subspaces, and by generalizing error feedback to account for the updating of optimizer states. The information lost on gradient momentum due to compression, and that needs to be reintroduced via the error feedback mechanism, is given by:

Mt 1 Ut U t Mt 1 = Ut 1 mt 1 Ut mt 1/2. (3)

Estimating the Statistics of Projected Gradients. Although reliable, the orthogonal projection of the optimizer state does not apply to the estimation of the second-order statistic. In particular, Adam does not rely solely on linear operations and is anisotropic (see Appendix A). To overcome these challenges, we interpret Adam s optimizer states as statistical estimates of the first two moments of each gradient coordinate. Indeed, Adam s optimization step t holds for an approximation of (Et,β1[Gei]/ p

Et,β2[(Gei)2])i n, where Et = (e1, . . . , en) is the basis provided by the canonical

Published as a conference paper at ICLR 2025

parameterization of the model, Gei = G, ei a column of the gradient matrix, and Et,β[ ] is the exponential time-weighted expectation from time t with decay-rate β.

We rely on this observation to rewrite Adam s update in any given coordinate system Et+1: we build ˆmt+1/2 and ˆvt+1/2 as statistical estimates of (Et,β1[G ei])i n and (Et,β2[(G ei)2])i n. It follows from G ei = P

j n ei, ej Gej that the matrix of change of basis ( ei, ej )i,j n provides transition between coordinate systems. Its truncation to i, j r induces a rank-r orthogonal projection of the gradients, and thus holds for the transition between subspaces:

Et,β1[G ei] = Pn j=1 ei, ej Et,β1[Gej] Pr j=1 ei, ej ( ˆmt)j = (U t+1 Ut ˆmt)i. (4)

Linearity of expectation leads to an autoregressive update rule for the first moment estimate (see Equation 4). With ( ei, ej )i,j = (U t+1 Ut)i,j the estimation of the first-order statistic in the coordinate system Et+1 resolves to an orthogonal projection when considering only the first r components (i.e., after truncation to i, j r), and the above discussion on compression of optimizer states remains valid. Following our statistical approach, we obtain the projection-aware update rule for the second moment estimates (Algorithm 1 Line 9) from the approximation below:

Et,β2[(G ei)2] = Pn j=1 ei, ej 2Et,β2[(Gej)2] + Pn k =l ei, ek ei, el Et,β2[G ek G el]

Pr j=1 ei, ej 2(ˆvt)j + Pr k =l ei, ek ei, el ( ˆmt)k( ˆmt)l. (5)

The approximation of the second-order statistics in a new coordinate system requires the estimation of the covariance between the gradient coordinates (see Equation 5). Although too large to store, the covariance can be approximated by the product of first-order moment estimates, assuming independence between gradient coordinates. The latter assumption is enforced on average over the gradient matrix columns by the nature of the coordinate system inherited from SVD. In practice, we guarantee a positive estimate of the second-order statistics by clipping value to zero if necessary.

3.3 LEARNING SUBSPACE ADAPTATION

LDAdam s memory savings stem from the low-dimensional structure of the moments estimate. The compression strategy, defined by a choice of truncated coordinate systems, sets the exploration of the parameter space, and its design meets two criteria: (i) the learning subspace must be adapted at each step to integrate error feedback; (ii) the compression operator must be set to approximate the current gradient and to preserve the information stored in the optimizer states.

Projection Map Computation. Although singular value decomposition (SVD) provides an optimal low-rank approximation, its computational cost makes it prohibitive in our case. We follow Power SGD (Vogels et al., 2020) to efficiently approximate the most significant singular vectors: we perform a single block power iteration (Bentbib & Kanber, 2015) initialized with the approximation from the previous optimization step; and apply the Gram-Schmidt process to derive the projection map (Algorithm 1 Line 6). In addition, we propose to start the whole process by computing SVD of the first gradient.

Learning A Smooth Subspace Transition. LDAdam relies at each step on compression of the gradient and the optimizer states via a unique low-rank projection map. Instead of fitting two projection maps and interpolate between them, we propose to fit a single projection map to the interpolation of the gradient and an average of the previous gradients (Algorithm 1 Line 5). We introduce an interpolation factor ρ and perform a block power iteration to approximate SVD of the matrix: Bt = ρ Ut 1 ˆmt 1 + (1 ρ)At. The interpolation factor helps to balance the accuracy of the gradient compression with the preservation of the optimizer states. We suggest setting ρ = β1, i.e. computing the optimal subspace for gradient momentum compression.

3.4 THE GENERALIZED ERROR FEEDBACK MECHANISM

Error Feedback was introduced for distributed optimization (Seide et al., 2014) to account for gradient compression errors by reintroducing them into the next iteration gradient. We extend this mechanism to account for loss of information on both the gradient and the optimizer states. We keep the same structure of the error feedback mechanism: the optimizer is fed by a unique accumulator that is a sum of the gradient and the error buffer.

Published as a conference paper at ICLR 2025

Unbiased Error Buffer Loading. With gradient and optimizer states decayed at different rates within the optimizer step, we must adjust the error buffer loading strategy of the optimizer states projection error by the factor β1/(1 β1) to recover the first moment estimate. Furthermore, the condition under which the error buffer loading strategy is consistent with the second moment estimate is given by: β2 = (1 β2)( β1 1 β1 )2. We suggest setting β2 = 0.99 and β1 0.908, as the slight decrease in the decay rate of the second moment estimate helps to adapt faster to a new learning subspace. Equation 6 describes the loading strategy of the generalized error feedback mechanism (Algorithm 1 Line 16).

ξt+1 = [At Ut at] + β1(β2) 1 β1(β2)[Ut 1 mt 1 Ut mt 1/2]. (6)

Implementing Error Feedback. A naive implementation of the error feedback mechanism would add memory overhead equal to the model size. Since the gradient and the error buffer are added together before the optimization step is performed, we store the error buffer in the variable used for gradient accumulation (Algorithm 1 Lines 16 and 3)(e.g. in Py Torch, modify the optimizer.zero grad function to store the error buffer in the p.grad variable). This results in an error feedback mechanism that does not require any further memory for optimizer states.

Algorithm 1 LDAdam ( Practical View Only, gt Rn m / Analytical View Only, gt Rd )

Hyperparameters: step size ηt; decay rates β1, β2 LDAdam Hyperparameters: projection rank r; interpolation factor ρ LDAdam Hyperparameters: contraction factor qr [0, 1); ρ = β1 1: Initialization: m0 = 0; v0 = 0; A0 = 0, U0 = SVD(g0); v0 = 0; ξ1 = 0 2: for t = {1, 2, . . . , T} do

Gradient accumulation and error buffer unloading 3: At = At + gt 4: At = ξt + gt Learning subspace adaptation 5: Bt = ρ Ut 1 ˆmt 1 + (1 ρ)At 6: Ut = GRAM-SCHMIDT(Bt B t Ut 1) 7: Ut is any d r orthogonal matrix such that (I Ut U t )Bt qr Bt Optimizer states projection-aware update 8: mt 1/2 = U t Ut 1mt 1 9: vt 1/2 = (1 βt 1 2 ) (U t Ut 1)2 (ˆvt 1 ( ˆmt 1)2) + (U t Ut 1 ˆmt 1)2 Optimizer states Adam-type update 10: at = U t At 11: mt = β1mt 1/2 + (1 β1)at 12: vt = β2vt 1/2 + (1 β2)a2 t 13: vt = max(vt, vt 1 max) Model update 14: θt+1 = θt ηt Ut ˆmt ˆvt+ϵ 15: θt+1 = θt ηt Ut mt vt+ϵ Error buffer loading 16: At+1 = (At Ut at) + β1 1 β1 (Ut 1 mt 1 Ut mt 1/2) = ξt+1 17: end for

4 CONVERGENCE GUARANTEES FOR LDADAM

Analytical Overview. We now present our theoretical convergence results for LDAdam. We first provide/discuss an analytical perspective on our approach in Algorithm 1. Then we introduce and discuss the analytical assumptions used in the theory, along with two theoretical convergence rates.

The outlined steps closely resemble those in the practical view of the algorithm. One notable difference is the introduction of a new AMSGrad-type normalization, vt = max(vt, || vt 1||max) in line 13. This represents uniform version of the original AMSGrad technique (Reddi et al., 2019), vt = max(vt, vt 1), which enforces coordinate-wise monotonicity for the diagonal preconditioning. In our setup, as we transition between different low-dimensional spaces, the preconditioning matrix

Published as a conference paper at ICLR 2025

is no longer diagonal, and scaling factors may change after such a transition. To ensure that adaptive preconditioning remains monotonic, we enforce a monotonic spectrum of the preconditioning.

To avoid unnecessary complications in the notations and derivations, in this section we assume that the stochastic gradients gt and other variables derived from the gradients, such as momentum, are vectors. We remind the reader that, in practice and in our implementation, gradients are viewed/represented as 2D matrices. In our analysis, vector gradients should be treated as the columns of the gradient matrix. Nevertheless, all steps in Algorithm 1 are still applicable when dealing with matrix gradients. Specifically, given a matrix gradient gt, we obtain a matrix Bt, for which we assume that, regardless of the low-rank compression method used (e.g., SVD or power iterations), the contractive inequality (I U t Ut)Bt F qr Bt F holds with respect to the Frobenious norm. We also incorporate the debiasing factor p

1 βt 2/(1 βt 1) in ηt.

Convergence Guarantees for General Smooth Non-convex Functions. We first outline the analytical assumptions under which we establish LDAdam s convergence guarantees. Assumption 1 (Lower bound and smoothness). The loss function f : Rd R is L-smooth and lower bounded by some f R: f(θ) f(θ ) L θ θ , for any θ, θ Rd. Assumption 2 (Unbiased and bounded stochastic gradient). For all iterates t 1, the stochastic gradient gt at θt is unbiased and uniformly bounded by some constant G 0:

E[gt] = f(θt), gt G.

Assumption 3 (Bounded variance). For all iterates t 1, the variance of the stochastic gradient gt at θt is uniformly bounded by some constant σ2 0: E[ gt f(θt) 2] σ2.

All three assumptions, including the bounded gradient condition, are standard and commonly used in adaptive optimization literature (Reddi et al., 2019; Chen et al., 2019; D efossez et al., 2022; Modoranu et al., 2024). Attempting to relax the bounded gradient condition, some works (Shi et al., 2020; Zhang et al., 2022; Wang et al., 2024) resorted to using a strong growth condition (which is not necessarily a relaxation over bounded gradient condition) or bounded stochastic noise conditions (Li et al., 2023; Hong & Lin, 2024). Taniguchi et al. (2024) achieved an optimal rate by modifying Adam and only slightly relaxing the bounded gradient assumption by requiring the expected gradient norm to be bounded. The bounded gradient condition in Assumption 2 directly implies a bounded variance condition with a constant σ2 = 2G2. Therefore, for the purpose of asymptotic analysis, one can omit Assumption 3. However, we will show that in the non-convex convergence rate, the σ2-term decays at a rate of 1/

T, while all other G-terms decay at a faster rate of 1/T. Theorem 1 states our general non-convex convergence result. Theorem 1 (Non-convex convergence rate). Let Assumptions 1, 2 and 3 hold. Then, choosing step-size η = min( ϵ 4LC0 1+C2 , 1

T ), LDAdam (Algorithm 1) satisfies

1 T PT t=1 E[ f(θt) 2] 2C0

f(θ1) f + Lσ2

with constants C0 := q

1+β2 1 β2 (1 β1(1 qr))2 (1 β1)2(1 qr)2 G2 + ϵ and C2 = β1+(1 β1)q2 r (1 β1)2(1 qr)2 .

Discussion. Compared to the standard stochastic gradient descent (Ghadimi & Lan, 2016), the leading term 1/

T of the obtained rate recovers the optimal non-convex convergence speed. Additionally, the asymptotic rate O( 1+σ2

T ) aligns with the rate of uncompressed AMSGrad in the stochastic non-convex setup (Zhou et al., 2024). Therefore, the introduced low-rank compression framework along with the error feedback mechanism does not hurt the asymptotic convergence with respect to T. The slowdown caused by compression in the leading term is only through the constant C0 = O( 1 1 qr ), where the factor 1 qr (0, 1] measures the aggressiveness of the compression. Note that our theory applies to any low-rank compression with r 1. More compression corresponds to a smaller rank r for the projections and a smaller factor 1 qr. For instance, when using low-rank compression via SVD decomposition, one can prove 1 qr = Θ( r

d) for any rank r 1. In this case, the leading term of the rate becomes O( d

T ), indicating that the convergence slows down in the worst case by a factor of d/r due to compression. The proof can be found in Appendix E.2.

Convergence Rate for Non-Convex Functions under the PL Condition. Next, we show faster convergence for LDAdam when the loss function satisfies the PL condition.

Published as a conference paper at ICLR 2025

Assumption 4 (PL-condition). For some constant µ > 0 the loss function f satisfies the PolyakŁojasiewicz (PL) inequality: f(θ) 2 2µ(f(θ) f ), for any θ Rd. Theorem 2. (PL convergence rate) Let Assumptions 1, 2, 3 and 4 hold. Then, choosing step-size η = min(η0, 2C0 log T

µT ), LDAdam (Algorithm 1) satisfies

E[f(θT +1)] f log T

µ2ϵ + 6C0(1+C1)G2

µ ϵ + e O G4 T 2

with constants C1 := β1+(1 β1)qr

(1 β1)(1 qr) and η0 := min( ϵ 16LC0 , C0(1 β1)(1 qr)

Discussion. Notably, the leading term of the convergence rate under the PL condition improves to O( log T

T ), compared to O( 1

T ) in the general non-convex setting. Faster rates of e O( 1

T ) for adaptive optimization algorithms under the PL condition have been established when β2 1 (He et al., 2023) or through the use of the AMSGrad trick (Modoranu et al., 2024). In LDAdam, we apply a uniform version of AMSGrad normalization with arbitrary β2 < 1 and derive the rate for compression with any rank r 1. Hence, similar to the general non-convex case, we show the best-known convergence rate in the leading term, up to a logarithmic factor. While the last term has higher-order constant dependencies, these are negligible due to the T 2 damping effect. Thus, the theory suggests that the algorithm s convergence rate remains comparable to that of the uncompressed version, with only a constant factor affected by the compression rank. To quantify the slowdown caused by compression, we again consider low-rank compression via SVD decomposition, as discussed earlier. As noted before, C0 = O( 1 1 qr ) = O( d

r ), and similarly, from the expression of C1, we have

C1 = O( 1 1 qr ) = O( d

r ). Therefore, the leading term of the rate becomes O(( d

T ), with a compression slowdown factor of ( d

r )2. The detailed proof is deferred to Appendix E.3.

5 TRAINING MEMORY FOOTPRINT AND RUNTIME

LDAdam s memory savings over Adam (Kingma & Ba, 2017) (see Table 1 for comparison) follow from the low dimensionality of the optimizer states: for gradients of shape n m with n m, the gradient statistics in a learning subspace of rank r are stored as tensors of shape r m, and the additional projection matrix is of shape n r. As a result, the memory footprint of LDAdam s optimizer states is only a fraction of the size of the model. For an n n layer, the compression ratio of LDAdam relative to Adam is of 3

2 r n. Table 6 and Figure 3 reports peak memory for fine-tuning and pre-training tasks.

With the error buffer stored in the gradient variable, the error feedback mechanism is memory neutral. However, this comes at the cost of not supporting gradient clipping and per-layer weight updates (Lv et al., 2024b; Zhao et al., 2024). The latter allows memory savings proportional to the model size, but at the expense of gradient accumulation. To be precise, the standard Ga Lore algorithm does not support gradient accumulation (see Appendix D.2 for a detailed explanation), while its newer retaining grad version (Zhao et al., 2024) does, but at the expense of a memory overhead equal to the size of the model. Although applicable to most optimizers, per-layer weight updates is not widely adopted, due to the large batch size used for large language model training (Touvron et al., 2023) and the memory-intensive nature of activations in attention layers.

Compared to Ga Lore, LDAdam adds optimizer states projection-aware update and the generalized error feedback mechanism. Yet, the practical runtime differences are not substantial: on Llama350M pre-training, LDAdam has a 10% longer runtime compared to Ga Lore, and is 15% slower than Adam. Table 7 and Figure 3 report runtimes, and Table 5 reports theoretical complexity in terms of standard matrix operations.

6 EXPERIMENTS

We empirically evaluate LDAdam for fine-tuning and pre-training large language models against baseline Adam (Kingma & Ba, 2017) and the memory-efficient Ga Lore (Zhao et al., 2024) (Algorithm 2). Similar to most parameter-efficient fine-tuning (PEFT) methods and to Ga Lore, we apply low-rank compression only to the two-dimensional matrices of the self-attention layers, e.g. wide embedding and output layers are trained using standard Adam optimizer. For a fair comparison, in

Published as a conference paper at ICLR 2025

Table 1: Optimizer comparison: parameter count during training for a weight layer of shape n m with n m (i.e., left projection), training capabilities, and estimates of optimizer states memory footprint in half precision (models architecture is described in Table 9).

Adam LDAdam Ga Lore (retaining grad) Ga Lore

Token count Weights nm nm nm nm Gradients nm nm nm Optimizer States 2nm nr + 2rm nr + 2rm nr + 2rm

Gradient Clipping Gradient Accumulation

Memory estimates Ro BERTa-base (r=8) 0.46 GB 0.15 GB 0.15 GB 0.15 GB Llama 350M (r=256) 1.37 GB 0.95 GB 0.95 GB 0.95 GB Llama-2 7B (r=32) 25.1 GB 1.22 GB 1.22 GB 1.22 GB Llama-2 7B (r=512) 25.1 GB 4.87 GB 4.87 GB 4.87 GB

all experiments we do not apply additional layer-wise learning rate scaling, and apply hyperparameters suggested in the papers introducing the algorithm. To ensure reproducibility, the full list of hyperparameters we used is provided in the Appendix C.

Fine-tuning on GLUE. We evaluate LDAdam for fine-tuning Ro BERTa-base model (Liu et al., 2019) on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018). Ro BERTa-base is a 125M parameters large language model, built upon the encoder only attentionbased BERT architecture (Devlin et al., 2019), and pre-trained for natural language understanding (NLU). GLUE benchmark includes a wide variety of NLU tasks with training samples of various size. We mimic memory-constrained setups by training for only 3 epochs with a batch size of 16, and used rank r=8 compression for LDAdam and Ga Lore. For a fair comparison, we tune the learning rate over the set {1e-5, 2e-5, . . ., 5e-5}. Table 2 reports the best average over 3 seeds (Table 8 reports standard deviation), and Figure 5 shows that the error buffer norm is of the same order of magnitude as the gradient norm, and that their ratio is stable throughout training.

Table 2: Results of fine-tuning Ro BERTa-base model on the GLUE benchmark.

MRPC STS-B Co LA SST-2 QNLI QQP MNLI Avg Training Samples 3.7k 7k 8.5k 67k 105k 364k 393k

Adam 88.97 90.08 58.57 94.34 92.81 91.38 87.85 86.28

LDAdam 88.40 90.11 59.91 95.00 92.87 91.28 87.81 86.48 LDAdam no-EF 88.00 89.88 56.86 95.00 92.32 89.75 86.99 85.54 Ga Lore 86.19 88.97 55.12 94.15 92.01 89.86 86.80 84.73

Fine-tuning on GSM8K. We evaluate LDAdam for fine-tuning Llama-2 7B model (Touvron et al., 2023) on the GSM8K mathematical reasoning dataset (Cobbe et al., 2021) containing grade school math word problems. We run experiments for ranks r=32 and r=512, on 3 epochs, with a batch size of 32, and tune the learning rate over {5e-5, 6e-5, . . ., 9e-5, 1e-4, 2e-4, ..., 5e-4}. Table 3 reports the best average accuracy. We also provide peak memory usage, but also the fixed memory cost of storing model, gradient and activations.

Table 3: Results of fine-tuning Llama-2 7B on the GSM8K dataset.

Adam LDAdam LDAdam no-EF Ga Lore

r=32 r=512 r=32 r=512 r=32 r=512

Accuracy 34.72 32.53 35.86 30.70 35.78 27.07 35.18

Peak Memory 55.34 GB 32.08 GB 35.58 GB 32.08 GB 35.58 GB 32.52 GB 35.58 GB

Published as a conference paper at ICLR 2025

Across all fine-tuning experiments, LDAdam outperforms Ga Lore, and is either on par or better than Adam. When ablating the LDAdam techniques, we observe that LDAdam without error feedback also outperforms Ga Lore, supporting LDAdam s continuous subspace adaptation strategy over Ga Lore s sequential update approach. The fact that we can match Adam once error feedback is enabled suggests that this technique is key.

Pre-training on C4. We evaluate LDAdam for pre-training Llama models (Touvron et al., 2023) on the C4 dataset (Raffel et al., 2023). Due to limited compute resources, we pre-train smaller 130M, 350M and 1.3B parameter variants (Lialin et al., 2023; Zhang & Sennrich, 2019; Shazeer, 2020) of the Llama-2 models. C4 is a clean version of Common Crawl s web crawl corpus. We adjust the number of training steps according to the Chinchilla scaling law (Hoffmann et al., 2022), i.e. 20 training tokens per model parameter; and use batch size 512 and sequence length 256, accounting to 131 072 tokens per step, thus using the same settings as Ga Lore. For a fair comparison, we tune the learning rate over the set {5e-4, 1e-3, 5e-3}, and report the best results for ranks r=8 and r=256 in Table 4. Figure 1 depicts the pre-training dynamics.

Table 4: Results (validation perplexity ) of pre-training Llama models on the C4 dataset.

Adam LDAdam LDAdam no-EF Ga Lore

r=8 r=256 r=8 r=256 r=8 r=256

Llama 130M 24.64 23.82 22.65 45.17 25.32 67.78 26.04

Llama 350M 18.08 18.37 17.30 - 19.85 - 20.03

Furthermore, we scale the experiments up to the Llama 1.3B parameter model, using the same hyperparameters but fixing the learning rate at 5e 4 due to limited computational resources. Adam achieves a validation perplexity of 14.86 , while LDAdam with rank r=16 achieves 14.09.

Figure 1: Pre-training dynamics for Llama 350M (left) and Llama 1.3B (right) on the C4 dataset.

Pre-training experiments also show that LDAdam has comparable performance to Adam, both in terms of convergence and validation accuracy, and confirm the improvements over Ga Lore. Further, for very high compression rates (i.e., r=8), LDAdam still converges, but its no-EF variant does not, nor does Ga Lore. This highlights the need for an error correction mechanism for pre-training tasks, which are known to require models updates of high rank. For lower compression (i.e. r=256), LDAdam outperforms the no-EF version and Ga Lore, which is correlated to their consistent projection errors. For more details on the impact of the rank on training, see Figure 4. LDAdam even appears to slightly outperform Adam. We attribute this small improvement to compression, as sparse updates might enhance implicit regularization (Zhang et al., 2017; Neyshabur, 2017).

7 DISCUSSION

We proposed a new low-rank learning method with the memory-efficient LDAdam optimizer, analyzed its convergence under standard assumptions, and provided empirical evidence of its ability to match the performance of Adam at a fraction of its memory cost. LDAdam relies on Adam to estimate gradient statistics, but its development required the design of a specific intermediate optimizer state update rule. Therefore, efficient implementations of Adam are not directly applicable to LDAdam, and extending our work to other optimizers requires further work. On the experimental side, a natural next step would be to execute large-scale billion-parameter pre-training experiments.

Published as a conference paper at ICLR 2025

8 ETHICS STATEMENT

This paper presents work that aims to advance the field of machine learning. We believe that memory-efficient optimization is a step toward democratizing large-scale model training, and thus provides opportunities to foster both the development of new applications and the research in the field. There are many societal concerns about the rapidly growing use of artificial intelligence, we feel that none of them is specifically related to our work.

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding, 2017. URL https: //arxiv.org/abs/1610.02132.

Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and C edric Renggli. The convergence of sparsified gradient methods, 2018. URL https://arxiv.org/ abs/1809.10505.

Abdeslem Bentbib and A. Kanber. Block Power Method for SVD Decomposition. Analele Stiintifice ale Universitatii Ovidius Constanta, Seria Matematica, 23:45 58, 06 2015. doi: 10.1515/auom-2015-0024.

Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the Convergence of A Class of Adam Type Algorithms for Non-Convex Optimization. International Conference on Learning Representations, 2019. URL https://arxiv.org/abs/1808.02941.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv. org/abs/2110.14168.

Alexandre D efossez, Leon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=ZPQhz TSWA7.

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization, 2022. URL https://arxiv.org/abs/2110.02861.

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023. URL https://arxiv.org/abs/2305.14314.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/ abs/1810.04805.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121 2159, 2011. URL http://jmlr.org/papers/v12/duchi11a.html.

Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 2016. ISSN 156(1-2):59 99. URL https://arxiv.org/abs/1310.3787.

Demi Guo, Alexander M. Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning, 2021. URL https://arxiv.org/abs/2012.07463.

Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey, 2024. URL https://arxiv.org/abs/2403. 14608.

Meixuan He, Yuqing Liang, Jinlan Liu, and Dongpo Xu. Convergence of adam for non-convex objectives: Relaxed hyperparameters and non-ergodic case. Journal of Machine Learning Research, 2023. URL https://arxiv.org/abs/2307.11782.

Published as a conference paper at ICLR 2025

Geoffrey Hinton. Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. https://www.cs. toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdf.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/ 2203.15556.

Yusu Hong and Junhong Lin. On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions. Advances in Neural Information Processing Systems, 2024. URL https:// arxiv.org/abs/2402.03982.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp, 2019. URL https://arxiv.org/abs/1902.00751.

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https: //arxiv.org/abs/2106.09685.

Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang. From galore to welore: How low-rank weights non-uniformly emerge from low-rank gradients, 2024. URL https://arxiv.org/abs/2407.11239.

Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes, 2019. URL https://arxiv.org/ abs/1901.09847.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980.

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes, 2018. URL https://arxiv.org/abs/1804.08838.

Haochuan Li, Ali Jadbabaie, and Alexander Rakhlin. Convergence of Adam under relaxed assumptions. Advances in Neural Information Processing Systems, 2023. URL https://arxiv. org/abs/2304.13972.

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation, 2021. URL https://arxiv.org/abs/2101.00190.

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, and Anna Rumshisky. Relora: High-rank training through low-rank updates, 2023. URL https://arxiv.org/abs/2307.05695.

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training, 2020. URL https://arxiv. org/abs/1712.01887.

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation, 2024. URL https://arxiv.org/abs/2402.09353.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019. URL https://arxiv.org/abs/1907.11692.

Published as a conference paper at ICLR 2025

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101.

Yang Luo, Xiaozhe Ren, Zangwei Zheng, Zhuo Jiang, Xin Jiang, and Yang You. Came: Confidenceguided adaptive memory efficient optimization, 2023. URL https://arxiv.org/abs/ 2307.02047.

Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, and Xipeng Qiu. Adalomo: Low-memory optimization with adaptive learning rate, 2024a. URL https://arxiv.org/abs/2310.10195.

Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large language models with limited resources, 2024b. URL https://arxiv. org/abs/2306.09782.

Ionut-Vlad Modoranu, Mher Safaryan, Grigory Malinovsky, Eldar Kurtic, Thomas Robert, Peter Richtarik, and Dan Alistarh. Microadam: Accurate adaptive optimization with low space overhead and provable convergence, 2024. URL https://arxiv.org/abs/2405.15593.

Behnam Neyshabur. Implicit regularization in deep learning, 2017. URL https://arxiv.org/ abs/1709.01953.

Mahdi Nikdan, Soroush Tabesh, Elvir Crnˇcevi c, and Dan Alistarh. Rosa: Accurate parameterefficient fine-tuning via robust adaptation, 2024. URL https://arxiv.org/abs/2401. 04679.

Pytorch Tutorials. How to save memory by fusing the optimizer step into the backward pass, n.d. URL https://pytorch.org/tutorials/intermediate/optimizer_step_ in_backward_tutorial.html. Accessed: 2024-11-16.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683.

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. ar Xiv preprint ar Xiv:1904.09237, 2019. URL https://arxiv.org/abs/1904.09237.

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Nature, 323(6088):533 536, 1986.

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.

Noam Shazeer. Glu variants improve transformer, 2020. URL https://arxiv.org/abs/ 2002.05202.

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost, 2018. URL https://arxiv.org/abs/1804.04235.

Naichen Shi, Dawei Li, Mingyi Hong, and Ruoyu Sun. Rmsprop converges with proper hyperparameter. International Conference on Learning Representations, 2020. URL https: //openreview.net/pdf?id=3UDSdy Ic BDA.

Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory, 2018. URL https://arxiv.org/abs/1809.07599.

Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Matsuo. ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate. Advances in Neural Information Processing Systems, 2024. URL https://arxiv.org/abs/2411.02853.

Published as a conference paper at ICLR 2025

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.

Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization, 2020. URL https://arxiv.org/abs/1905. 13727.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi (eds.), Proceedings of the 2018 EMNLP Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353 355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446.

Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Zhi-Ming Ma, Tie-Yan Liu, and Wei Chen. Provable adaptivity in adam. International Conference on Knowledge Discovery and Data Mining, 2024. URL https://arxiv.org/abs/2208.09900.

Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of lora: Efficient fine-tuning of language models via residual learning, 2024. URL https://arxiv.org/abs/2401.04151.

Biao Zhang and Rico Sennrich. Root mean square layer normalization, 2019. URL https:// arxiv.org/abs/1910.07467.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization, 2017. URL https://arxiv.org/abs/ 1611.03530.

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameterefficient fine-tuning, 2023. URL https://arxiv.org/abs/2303.10512.

Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam can converge without any modification on update rules. Advances in Neural Information Processing Systems, 2022. URL https://arxiv.org/abs/2208.09632.

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more, 2024a. URL https: //arxiv.org/abs/2406.16793.

Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang. Q-galore: Quantized galore with int4 projection and layer-adaptive low-rank gradients, 2024b. URL https://arxiv.org/abs/2407.08296.

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection, 2024. URL https://arxiv.org/abs/2403.03507.

Dongruo Zhou, Jinghui Chen, Yuan Cao, Ziyan Yang, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id= Gh0cxhbz3c. Featured Certification.

Published as a conference paper at ICLR 2025

A CHALLENGES FOR ESTIMATING THE SECOND-ORDER STATISTIC OF PROJECTED GRADIENTS

Adam Does Not Rely Solely on Linear Operations. Adam follows the proposition from Ada Grad (Duchi et al., 2011) to use diagonal approximation to estimate: ((1 β2) Pt τ=0 βt τ 2 gτ g T τ ) 1

2 . This leads to v Adam t = (1 β2) Pt τ=0 βt τ 2 gτ gτ and to the matrix inverse square root being replaced by coordinate wise inverse square root operations. Such an approximation saves both memory and computation, but at the cost of leaving the linear framework. Therefore, one cannot expect to draw an autoregressive linear algebra formula to update the second moment estimate.

Adam is Anisotropic. Adaptive optimization learning rate tuning for each parameter is equivalent to a direction-wise rescaling of the training space. It implies that the choice of coordinate system affects the optimization steps. Let U be a fixed orthogonal matrix which acts as rotation in the parameter space and consider the following two minimization problems:

(PI) : arg minθ Rn m f(θ) (7)

(P U) : arg min θ Rn m f( U θ) (8)

These two problems are equivalent up to a change of parametrization. However, applying Adam to each of them will lead to optimization process that are not equivalent: adaptive estimate of the second-order statistics of the gradient leads to θt = U θt :

θt+1 = θt ηt

1 βt 2 1 βt 1

(1 β1) Pt τ=1 βt τ 1 f(θt)

(1 β2) Pt τ=1 βt τ 2 ( f(θt))2+ϵ θt ηt Et,β1[ f(θt)]

Et,β2[( f(θt))2]+ϵ (9)

θt+1 = θt ηt

1 βt 2 1 βt 1

(1 β1) Pt τ=1 βt τ 1 U f( U θt)

(1 β2) Pt τ=1 βt τ 2 ( U f( U θt))2+ϵ θt ηt Et,β1[ U f( U θt)]

Et,β2[( U f( U θt))2]+ϵ (10)

Low-rank adaptive optimization follows from (PU) where U Rn r is the truncation of the rotation matrix U to its first r rows, and thus yields reparametrization and projection to a lower-dimensional space. However, this leads to θ Rr m and the optimization set is restricted to a low-rank subspace. To enable full-parameter training, Ga Lore s strategy is to keep the model parameter high-rank while performing low-dimensional updates, leading to:

θGALORE t+1 = θGALORE t U ηt

1 βt 2 1 βt 1

(1 β1) Pt τ=1 βt τ 1 U f(θGALORE t )

(1 β2) Pt τ=1 βt τ 2 (U f(θGALORE t ))2+ϵ

= θGALORE t U ηt ˆmt ˆvt+ϵ

With the projection map U = Ut occasionally updated, Ga Lore s low-rank gradient statistic estimate ˆmt (resp. ˆvt) fails to approximate Et,β1[U t f(θGALORE t )] (resp. Et,β2[(U t f(θGALORE t ))2]), because gradients projected into different subspaces are accumulated on the same momentum buffers. A simple idea would be to fix a coordinate system for the momentum buffers. However, this would have to be high-dimensional and thus would prevent memory savings. The purpose of LDAdam s projection-aware update rule is to adapt the optimizer states to the new learning subspace and coordinate system, leading to ˆmt 1/2 Et 1,β1[U t f(θLDADAM t 1 )], and ˆvt 1/2 Et 1,β2[(U t f(θLDADAM t 1 ))2] and enabling gradient accumulation within low-dimensional momentum buffers. One can thus rewrite LDAdam s heuristic as performing the following update:

θLDADAM t+1 θLDADAM t Ut ηt Et,β1[U t f(θLDADAM t )]

Et,β2[(U t f(θLDADAM t ))2]+ϵ

As a result, while Adam is anisotropic, LDAdam is isotropic since it doesn t rely on any specific coordinate system but instead always uses the one induced by the SVD of its gradient.

Published as a conference paper at ICLR 2025

B ADDITIONAL MEASUREMENTS

B.1 TRAINING MEMORY FOOTPRINT AND RUNTIME

Table 5 reports theoretical complexity in terms of standard matrix operations. We write ORTHO(r, m) the complexity of orthogonalizing a family of r vectors of dimension m. In our implemention, similar to power SGD (Vogels et al., 2020), we apply power iteration the Gram-Schmidt process which has complexity O(r2 m).

Table 5: Optimizer comparison: matrix operation count for a weight layer of shape n m with n m (i.e., left projection).

Adam LDAdam Learning Subspace Adaptation n r m + 3 n m + r m + ORTHO(r, m) Projection-aware Update 3 r2 m + 4 r m + r2

Adam-type Update 9 n m 9 r m Descent Direction 3 n m n r m + 3 n m Error Buffer loading n r m + 3 n m

Table 6 reports peak memory for fine-tuning and pre-training on a single NVIDIA H100 80GB GPU with micro batch size 1 and without activation checkpointing.

Table 6: Peak memory for fine-tuning (FT) and pre-training (PT) tasks for micro batch size of 1.

Model Task Rank Optimizer

Adam LDAdam

Ro BERTa-base MNLI (FT) r=8 2.39 GB 1.60 GB

Llama 350M C4 (PT) r=8 4.20 GB 2.67 GB Llama 350M C4 (PT) r=256 4.20 GB 3.01 GB

Llama 1.3B C4 (PT) r=16 15.20 GB 8.67 GB

Llama-2 7B GSM8K (FT) r=32 57.62 GB 31.98 GB Llama-2 7B GSM8K (FT) r=512 57.62 GB 35.90 GB

Figure 2 depicts the pre-training dynamics over time, allowing comparison of training efficiency with respect to time.

Figure 2: Pre-training dynamics over time for Llama 350M (left) and Llama 1.3B (right) on the C4 dataset.

Figure 3 shows the throughput (token per second) and peak memory (GB) of Adam and LDAdam with respect to rank for pre-training the Llamma 350M model on the C4 dataset. We used a micro batch size of 1 for both to replicate the memory constrained setting. This helps to isolate the effect of the optimisation algorithm itself. See Table 7 and Table 6 for results closer to the standard use case.

Published as a conference paper at ICLR 2025

Figure 3: Throughput (token per second) and peak memory (GB) of Adam and LDAdam with respect to rank for pre-training the Llamma 350M model on the C4 dataset, on a single NVIDIA H100 80BG GPU, using micro batch size of 1.

Table 7 reports runtime for fine-tuning and pre-training on a single (except for Llama 1.3B pretraining were we use four) NVIDIA H100 80GB GPU. We don t use activation checkpointing. We use a micro batch size of 1 for GSM8K fine-tuning, a micro batch size of 128 for Llama 350M model, and micro batch size of 64 for Llama 1.3B model.

Table 7: Runtime for fine-tuning (FT) and pre-training (PT) tasks.

Model Task Rank Optimizer

Adam LDAdam LDAdam no-EF Ga Lore

Ro BERTa-base MNLI (FT) r=8 37m 01h 07m 01h 01m 56m

Llama 350M C4 (PT) r=8 21h 59m 23h 12m - - Llama 350M C4 (PT) r=256 21h 59m 24h 50m 24h 44m 23h 09m

Llama 1.3B C4 (PT) r=16 04d 13h 09m 05d 04h 24m - -

Llama-2 7B GSM8K (FT) r=32 50m 56m 55m 01h 08m Llama-2 7B GSM8K (FT) r=512 50m 01h 08m 01h 07m 01h 09m

B.2 IMPACT OF THE RANK ON TRAINING

Figure 4: Training dynamics and validation perplexity for various rank when pre-training Llama 350M model. For training dynamics we used a single learning rate of 5e 4 to allow comparison between runs and provide results for the first 10000 optimization steps. We report the best validation perplexity for learning rates tuned over the set {5e 4, 1e 3, 5e 3}.

Published as a conference paper at ICLR 2025

B.3 FINE-TUNING STANDARD DEVIATION

Table 8: Standard deviation over 3 seeds for the results of fine-tuning Ro BERTa-base model on the GLUE benchmark.

MRPC STS-B Co LA SST-2 QNLI QQP MNLI Training Samples 3.7k 7k 8.5k 67k 105k 364k 393k

Adam 1.068 0.226 2.604 0.517 0.074 0.050 0.167

LDAdam 0.617 0.211 0.633 0.463 0.275 0.080 0.177 LDAdam no-EF 0.849 0.555 0.903 0.066 0.285 0.687 0.087 Ga Lore 0.861 0.330 2.348 0.413 0.166 0.034 0.111

B.4 ERROR BUFFER NORM DURING TRAINING

Figure 5: Error buffer norm and gradient norm during the fine-tuning of the Ro BERTa-base model on the GLUE benchmark.

Published as a conference paper at ICLR 2025

C MATERIALS FOR REPRODUCIBILITY

C.1 OPTIMIZER STATES MEMORY ESTIMATES

Table 9 describes the architecture of the trained models architecture and the low-rank structure of the low-dimensional optimizer states. It allows optimizer states token count, thus memory estimates in GB by applying factor 2 10243 for half precision training.

Table 9: Weights and low-dimensional optimizer states shape for used models.

Weight Low-rank optimizer states

Ro BERTa-base Token Embedding 50265*768+768 - Positional Embedding 564*768 - Attention Head 12*4*768*768 12*4*3*768*r MLP Block 12*3*3072*768 12*3*(2*3072*r+768*r) Normalization 2*768+12*(9*768+3072) - Dense Layer 768*768+768 - Output Layer 768*n label+n label -

Llama 130M Embedding Layer 32000*768 - Attention Head 12*4*768*768 12*4*3*768*r MLP Block 12*3*2048*768 12*3*(2*2048*r+768*r) Layer Normalization (12*2+1)*768 - Output Layer 768*32000 -

Llama 350M Embedding Layer 32000*1024 - Attention Head 24*4*1024*1024 24*4*3*1024*r - MLP Block 24*3*2736*1024 24*3*(2*2736*r+1024*r) Layer Normalization (24*2+1)*1024 - Output Layer 1024*32000

Llama-2 7B Embedding Layer 32000*4096 - Attention Head 32*4*4096*4096 32*4*3*4096*r MLP Block 32*3*4096*11008 32*3*(2*11008*r+4096*r) Layer Normalization (32*2+1)*4096 - Output Layer 4096*32000 -

C.2 FINE-TUNING HYPERPARAMETERS

Tables 10 and 11 detail all the hyperparameters we use when fine-tuning respectively Ro BERTa-base model on the GLUE benchmark and Llama-2 7B model on the GSM8K dataset.

Published as a conference paper at ICLR 2025

Table 10: Hyperparameters used for fine-tuning Ro BERTa-base model on the GLUE benchmark.

Adam LDAdam Ga Lore

Epochs 3 Warm-up Batch Size 16 Maximum Length 128 Data Type bfloat32

Learning Rate {1e 5, 2e 5, . . . , 5e 5} Learning Rate Scheduling linear to 0% Decay Rate β1 0.9 0.908 0.9 Decay Rate β2 0.999 0.99 0.999 Weight Decay Dropout Gradient Clipping

Interpolation Factor ρ 0.908 Error Feedback Subspace Frequency 1 500 Learning Rate Scaling

Table 11: Hyperparameters used for fine-tuning Llama-2 7B model on the GSM8K dataset.

Adam LDAdam Ga Lore

Epochs 3 Training Steps 702 Warm-up Steps 20 Batch Size 32 Maximum Length 512 Data Type bfloat16

Learning Rate {5e 5, 6e 5, . . . , 9e 5} {1e 4, 2e 4, . . . , 5e 4} Warm-up Scheduling linear from 0% Learning Rate Scheduling linear to 0% Decay Rate β1 0.9 0.908 0.9 Decay Rate β2 0.999 0.99 0.999 Weight Decay Dropout Gradient Clipping 1.0

Interpolation Factor ρ 0.908 Error Feedback Subspace Frequency 1 200 Learning Rate Scaling

Published as a conference paper at ICLR 2025

C.3 PRE-TRAINING HYPERPARAMETERS

Table 12 details all the hyperparameters we use when pre-training Llama models on the C4 dataset.

Table 12: Hyperparameters used for pre-training Llama models on the C4 dataset.

Llama 130M Llama 350M Llama 1.3B

Adam LDAdam Ga Lore Adam LDAdam Ga Lore Adam LDAdam

Training Steps 20000 55000 200000 Warm-up Steps 2000 5500 10000 Maximum Length 256 256 256 Batch Size 512 512 512 Token Batch Size 131 072 131 072 131 072 Total Training Tokens 2 621 440 000 7 208 960 000 26 621 440 000 Data Type bfloat16 bfloat16 bfloat16

Learning Rate {5e 4, 1e 3, 5e 3} {5e 4, 1e 3, 5e 3} 5e 4 Warm-up Scheduling linear from 0% linear from 0% linear from 0% Learning Rate Scheduling cosine to 10% cosine to 10% cosine to 10% Decay Rate β1 0.9 0.908 0.9 0.9 0.908 0.9 0.9 0.908 Decay Rate β2 0.999 0.98 0.999 0.999 0.98 0.999 0.999 0.98 Weight Decay Dropout Gradient Clipping 1.0 1.0 1.0

Interpolation Factor ρ 0.908 0.908 0.908 Error Feedback Subspace Frequency 1 200 1 200 1 Learning Rate Scaling

Published as a conference paper at ICLR 2025

D GALORE ALGORITHM

Algorithm 2 describes Ga Lore s (Zhao et al., 2024) proposal for performing an Adam-type update from a low-rank gradient projection. For experiments we use the author s implementation of the Ga Lore algorithm accessible at: https://github.com/jiaweizzhao/Ga Lore. Other than layer-wise learning rate rescaling, in our experiments we use the suggested hyperparameters.

Algorithm 2 Ga Lore

Hyperparameters: step size ηt; decay rates β1, β2 Ga Lore Hyperparameters: projection rank r; subspace change frequency T , scale factor α 1: Initialization: m0 = 0; v0 = 0; U0 = 0 2: for t = {1, 2, . . . , T} do 3: if t mod T = 0 then 4: Ut, Σ, V = SVD(gt) 5: Ut = Ut[:, 1 : r] 6: else 7: Ut = Ut 1 8: end if 9: at = U t gt 10: mt = β1mt + (1 β1)at 11: vt = β2vt + (1 β2)a2 t 12: θt+1 = θt αηt Ut ˆmt ˆvt+ϵ 13: end for

D.1 LAYER-WISE LEARNING RATE RESCALING

Note that the learning rate rescaling induced by multiplying by α = 1 is equivalent to using different learning rates for different layers. This level of hyperparameter tuning is not usual, so for a fair comparison we use a single learning rate for all layers. However, our preliminary experiments suggest that LDAdam would also benefit from layer-wise learning rate tuning.

D.2 INCOMPATIBILITY OF GRADIENT ACCUMULATION AND PER-LAYER WEIGHT UPDATE

Ga Lore s improvement in memory efficiency over the standard Adam implementation comes from three distinct additions, namely: low-rank gradient projection (Zhao et al., 2024), 8-bit quantization (Dettmers et al., 2022), and per-layer weight update (Lv et al., 2024b). The latter saves memory by releasing the variable used to store the gradient after the model layer has been updated rather than after the entire model has been updated. A practical implementation of per-layer weight updates is to add a gradient hook that triggers the model update and releases the gradient variable immediately after the gradient for that particular layer has been computed. For example, in Py Torch, the per-layer weight update is built by adding a gradient hook using p.register post accumulate grad hook(optimizer hook), with the optimizer hook function implementing both optimizer.step() and optimizer.zero grad() (Pytorch Tutorials, n.d.).

Therefore, when using per-layer weight update, one never has access to the gradient for the entire model, and furthermore one cannot accumulate the gradient over multiple micro batches and perform the model update (e.g. in Py Torch, run optimizer.step()) only after accumulation. For the same reason, gradient clipping is not possible when using per-layer weight update, since the norm of the gradient for the entire model is not computable.

To enable gradient accumulation for Ga Lore, the per-layer weight update has to be abandoned. The method is called by the author Ga Lore (no retaining grad) (Zhao et al., 2024) and it results in an additional memory overhead equal to the size of the model compared to Ga Lore s claim.

Published as a conference paper at ICLR 2025

E DEFERRED PROOFS

Algorithm 3 LDAdam: Analytical View

1: Hyperparameters: step size ηt, decay rates β1 and β2, projection rank r with contraction qr. 2: Initialization: error ξ1 = 0d, moments m0 = v0 = v0 = 0r and U0 = 0d r. 3: for t = {1, 2, . . . , T} do

Compute error corrected gradient and momentum 4: At = gt + ξt, where gt = e θf(θt) is a mini-batch stochastic gradient at θt 5: Bt = β1Ut 1mt 1 + (1 β1)At Update the projection matrix 6: Ut is any d r orthogonal matrix such that (I Ut U t )Bt qr Bt with qr < 1 Intermediate updates to adjust to the new low-dimensional space 7: at = U t At 8: mt 1/2 = U t Ut 1mt 1

9: vt 1/2 = (1 βt 1 2 ) (U t Ut 1)2 ( vt 1 1 βt 1 2 ( mt 1

1 βt 1 1 )2) + (U t Ut 1 mt 1 1 βt 1 1 )2 Adam updates in the low-dimensional space 10: mt = β1mt 1/2 + (1 β1)at 11: vt = β2vt 1/2 + (1 β2)a2 t 12: vt = max(vt, vt 1 max) AMSGrad-type normalization Update the main model 13: θt+1 = θt ηt Ut mt vt+ϵ Update the error feedback 14: ξt+1 = (At Ut at) + β1 1 β1 (Ut 1 mt 1 Ut mt 1/2) 15: end for

E.1 KEY LEMMAS

Lemma 1. With ΣT = Tσ2 + PT t=1 E f(θt) 2 , for any t 1 the following bounds hold:

Bt G 1 qr ,

t=1 E Bt 2 1 (1 qr)2 ΣT (13)

ξt qr G (1 β1)(1 qr),

t=1 E ξt 2 q2 r (1 β1)2(1 qr)2 ΣT (14)

β1 1 β1 Bt + (1 β1)ξt+1

β1 + (1 β1)qr

(1 β1)(1 qr)G (15)

" β1 1 β1 Bt 1 + (1 β1)ξt

β1 + (1 β1)q2 r (1 β1)2(1 qr)2 ΣT . (16)

Proof. Let us start with the proof of the first bound on Bt. Denote qr := (1 β1)qr + β1 [qr, 1). Bt+1 = β1Mt + (1 β1)At+1 = β1Mt + (1 β1)gt+1 + (1 β1)ξt+1 = (1 β1)Mt + (1 β1)gt+1 + Bt

= (1 β1)Ut U t Bt + (1 β1)gt+1 + Bt

= (1 β1)(Bt Ut U t Bt) + (1 β1)gt+1 + β1Bt

(1 β1) Bt Ut U t Bt + (1 β1) gt+1 + β1 Bt ((1 β1)qr + β1) Bt + (1 β1) gt+1 = qr Bt + (1 β1) gt+1

= qt r B1 + (1 β1)

τ=2 qt+1 τ r gτ = (1 β1)

τ=1 qt+1 τ r gτ .

Published as a conference paper at ICLR 2025

Using the bounded gradient assumption, we get

τ=1 qt τ r (1 β1)G

1 qr = G 1 qr .

To derive the bound with expectation, we apply Cauchy-Schwartz inequality and the bounded variance assumption:

t=1 E Bt 2 (1 β1)2 T X

τ=1 qt τ r gτ

(1 β1)2 T X

τ=1 qt τ r gτ 2 !#

τ=1 qt τ r E gτ 2

= 1 (1 qr)2

t=1 E gt f(θt) + f(θt) 2

σ2 + E f(θt) 2

(1 qr)2 + 1 (1 qr)2

t=1 E f(θt) 2

To bound the norm of the error ξt , notice that

ξt+1 = 1 1 β1 Bt Ut U t Bt qr 1 β1 Bt qr G (1 β1)(1 qr). (17)

To get the bound with expectation, we apply previous inequality on Bt and get

t=1 E ξt 2 =

t=1 E ξt+1 2

q2 r (1 β1)2 E Bt 2

q2 r (1 β1)2(1 qr)2

t=1 E f(θt) 2 !

The fifth bound (15) follows from the triangle inequality and combining the obtained two bounds. For the last bound with expectation, we have

" β1 1 β1 Bt 1 + (1 β1)ξt

β1 (1 β1)2 E Bt 1 2 + (1 β1)E ξt 2

(1 qr)2 + 1 (1 qr)2

t=1 E f(θt) 2 !

+ q2 r (1 β1)(1 qr)2

t=1 E f(θt) 2 !

= β1 + (1 β1)q2 r (1 β1)2(1 qr)2

t=1 E f(θt) 2 !

Published as a conference paper at ICLR 2025

Lemma 2. If γ < 1 and 1 γ 1

2(1 β1)(1 qr), then

" β1 1 β1 Bt 1 + (1 β1)ξt

t=1 γT t E f(θt) 2 !

where C2 = β1+(1 β1)q2 r (1 β1)2(1 qr)2 .

Proof. From the condition on γ and the notation qr := (1 β1)qr + β1 [qr, 1) from the previous proof, we have 1 γ 1 qr

2 or equivalently γ qr 1 qr

2 . Using this inequality on γ and the previous bound (16) for E Bt 2 , we have

t=1 γT t E Bt 2 (1 β1)2 T X

τ=1 qt τ r gτ

(1 β1)2 T X

τ=1 qt τ r gτ 2 !#

τ=1 γT t qt τ r E gτ 2

t=τ γT t qt τ r E gτ 2

t=τ γτ t qt τ r

1 ( qr/γ)T τ+1

1 qr/γ γT τE gτ 2

(1 qr)(1 qr/γ)

t=1 γT τE gt 2

t=1 γT τE gt 2

= 2 (1 qr)2

t=1 γT τE gt f(θt) + f(θt) 2

t=1 γT τ σ2 + E f(θt) 2

t=1 γT τE f(θt) 2 !

Using the bound (17) for ξt 2, we have

t=1 γT τE ξt 2 = 2q2 r (1 β1)2(1 qr)2

t=1 γT τE f(θt) 2 !

Published as a conference paper at ICLR 2025

Combining these two bounds we get

" β1 1 β1 Bt 1 + (1 β1)ξt

t=1 γT t β1 (1 β1)2 E Bt 1 2 + (1 β1)E ξt 2

β1 (1 β1)2 2 (1 qr)2

t=1 γT t E f(θt) 2 !

+ 2(1 β1)q2 r (1 β1)2(1 qr)2

t=1 γT t E f(θt) 2 !

= 2(β1 + (1 β1)q2 r) (1 β1)2(1 qr)2

t=1 γT t E f(θt) 2 !

Lemma 3. For Γt = Γt 1 Γt we have

t=1 Γt 2 ϵ,

Proof. From the definitions of Γt (19) and vt = max(vt, vt 1 max) we have

Γt = Ut Diag 1/2( vt + ϵ, vt min + ϵ) U t Ut Diag 1/2( vt min + ϵ) U t

vt min + ϵ Ut U t = 1 p

vt min + ϵ I

vt 1 max + ϵ I = 1 p

vt 1 max + ϵ Ut 1 U t 1

= Ut 1Diag 1/2( vt 1 max + ϵ) U t 1 Ut 1Diag 1/2( vt 1 + ϵ, vt 1 min + ϵ) U t 1 = Γt 1, which implies that Γt = Γt 1 Γt is positive semidefinite. Hence, Γt = λmax( Γt) 0. Using the convexity of λmax over symmetric matrices, we get

t=1 λmax(Γt 1 Γt)

t=1 λmax(Γt 1) + λmax( Γt) =

t=1 λmax(Γt 1) λmin(Γt)

vt 1 min + ϵ 1 p

vt 1 min + ϵ 1 p

vt 1 max + ϵ + 1 p

vt 1 max + ϵ 1 p

vt 1 min + ϵ 1 p

vt min + ϵ +

vt 1 max + ϵ 1 p

v0 min + ϵ 1 p

v T min + ϵ + 1 p

v0 max + ϵ 1 p

v T max + ϵ

v0 min + ϵ 2 ϵ.

Published as a conference paper at ICLR 2025

For the second sum of squared norms, notice that for scalars a b 0, it holds that

(a b)2 (a b)(a + b) = a2 b2.

Therefore, the above derivation can be repeated without the square roots as follows:

t=1 (λmax(Γt 1 Γt))2

t=1 (λmax(Γt 1) + λmax( Γt))2 =

t=1 (λmax(Γt 1) λmin(Γt))2

t=1 (λmax(Γt 1))2 (λmin(Γt))2

1 vt 1 min + ϵ 1 vt max + ϵ

1 vt 1 min + ϵ 1 vt 1 max + ϵ + 1 vt 1 max + ϵ 1 vt max + ϵ

1 vt 1 min + ϵ 1 vt min + ϵ +

1 vt 1 max + ϵ 1 vt max + ϵ

= 1 v0 min + ϵ 1 v T min + ϵ + 1 v0 max + ϵ 1 v T max + ϵ

2 v0 min + ϵ 2

which completes the proof.

Lemma 4. For all iterates t 1 the following bound holds

vt max 1 + β2

(1 β1(1 qr))2

(1 β1)2(1 qr)2 G2.

Proof. First, let us bound the at term using Lemma 1 and Assumption 2.

at = U t At At ξt + gt 1 (1 qr)β1

(1 β1)(1 qr)G =: CG.

Next, we bound momentum mt:

mt β1 mt 1/2 + (1 β1) at

β1 mt 1 + (1 β1)CG

βt 1 m0 + (1 β1)CG

τ=0 βτ 1 = (1 βt 1)CG.

Next, we bound the intermediate term vt 1/2 1. Note that by the triangle inequality we have the following direct bound for it:

U t Ut 1 2 vt 1 1 +

U t Ut 1 2 mt 1 1 βt 1 1

U t Ut 1 mt 1 1 βt 1 1

Published as a conference paper at ICLR 2025

Now let us bound each term separately. For the first term we have U t Ut 1 2 vt 1 1 =

j=1 Ut[:, i], Ut 1[:, j] 2 vt 1,j

j=1 vt 1,j max 1 j r

i=1 Ut[:, i], Ut 1[:, j] 2

= vt 1 1 max 1 j r U t Ut 1[:, j] 2

vt 1 1 max 1 j r Ut 1[:, j] 2 = vt 1 1,

since columns Ut[:, i] and Ut 1[:, j] have unit length by construction. Similarly, for the second term we have U t Ut 1 2 mt 1 1 βt 1 1

j=1 Ut[:, i], Ut 1[:, j] 2 mt 1,j

2 max 1 j r

i=1 Ut[:, i], Ut 1[:, j] 2

= mt 1 1 βt 1 1

2 max 1 j r U t Ut 1[:, j] 2 mt 1 1 βt 1 1

Finally, for the third term we have

U t Ut 1 mt 1 1 βt 1 1

2 1 = U t Ut 1 mt 1 1 βt 1 1

2 mt 1 1 βt 1 1

Combining all three bounds together, we arrive

vt 1/2 1 vt 1 1 + 2C2G2.

From this we get the bound for vt using the initialization v0 = 0:

vt max vt 1 β2 vt 1/2 1 + (1 β2) at 2

β2( vt 1 1 + 2C2G2) + (1 β2)C2G2

β2 vt 1 1 + (1 + β2)C2G2

βt 2 v0 1 + (1 + β2)C2G2 t 1 X

τ=0 βτ 2 1 + β2

Hence, using the update rule of vt and initialization v0 = 0, we conclude

vt max max( vt max, vt 1 max) 1 + β2

1 β2 C2G2 = 1 + β2

(1 (1 qr)β1)2

(1 β1)2(1 qr)2 G2.

E.2 NON-CONVEX ANALYSIS

Theorem 3 (Non-convex convergence rate). Let Assumptions 1, 2 and 3 hold. Then, choosing stepsize η = min(η0, 1

T ) with η0 = ϵ 4LC0 1+C2 , Algorithm 3 satisfies

t=1 E[ f(θt) 2] 2C0

2η0T + L2C0C2σ2

2ϵ2T + (1 + C1)G2

ϵT + (1 + 2C1)C1LG2

with constants C0 := q

1+β2 1 β2 (1 β1(1 qr))2 (1 β1)2(1 qr)2 G2 + ϵ, C1 := β1+(1 β1)qr

(1 β1)(1 qr), C2 := β1+(1 β1)q2 r (1 β1)2(1 qr)2 .

Published as a conference paper at ICLR 2025

Remark 1. Further ignoring absolute constants, the bound becomes C0

f(θ1) f + Lσ2

η0 + L2C0C2σ2

ϵ + C2 1LG2

Let us assume that for the low-rank compression of the algorithm we have 1 qr = O( r

d) (e.g., SVD option). Then C0 = O( d

r G), C1 = O( d

r ), C2 = O( d2

r2 ) and 1 η0 = O( d2

r2 G). Plugging this asymptotic expressions into the bound and ignoring other parameters (e.g., σ2, L, ϵ), we get d r G

or equivalently d r G

Proof. Let Γt := Ut Diag 1/2( vt+ϵ)U t be a preconditioning matrix and Mt := Utmt the exponential moving averages in the full space. With these notations, the update rule of the model becomes θt+1 = θt ηt Γt Mt. As it will be used later, we need to make sure that our preconditioning is positive definite. Due to the structure of Γt, it is positive semi-definite only. To make it positive definite, notice that Γt Ut = Γt Ut, where the full-rank preconditioner Γt takes the form

Γt = Ut Diag 1/2( vt + ϵ, vt min + ϵ) U t , (19) where Ut Rd d is an orthogonal matrix with the same first r columns as Ut Rd r and the diagonal matrix in the middle has been extended to meet the sizes of Ut by adding the values vt min +ϵ on the diagonal. Γt Ut = Γt Ut comes from the observation that U t Ut Rd r is composed of two blocks: upper r r block of the identity matrix and (d r) r block of the zero matrix. Hence, the added (d r) elements on the diagonal do not really affect, so does the last (d r) columns of Ut. Therefore, we can write the model update rule as θt+1 = θt ηtΓt Mt, (20) with full-rank preconditioning Γt. Next, note that Bt = β1Mt 1 + (1 β1)gt + (1 β1)et, (21)

mt 1/2 = U t Mt 1. (22) Then, for the low dimensional momentum and the erorr we get

mt = β1mt 1/2 + (1 β1)at = U t (β1Mt 1 + (1 β1)gt + (1 β1)ξt) = U t Bt, (23)

(1 β1)ξt+1 = (I Ut U t )Bt = Bt Mt. (24)

Letting B0 = 0, we define virtual iterates xt as follows:

xt+1 = θt+1 ηΓt

(1 β1)ξt+1 + β1 1 β1 Bt

In particular, x1 = θ1. Then, we derive the recurrence relation for the new sequence as follows:

xt+1 = θt+1 ηΓt

(1 β1)ξt+1 + β1 1 β1 Bt

(24) = θt ηΓt Mt ηΓt

Bt Mt + β1 1 β1 Bt

= θt η 1 β1 Γt Bt

(21) = θt η 1 β1 Γt (β1Mt 1 + (1 β1)gt + (1 β1)ξt)

(24) = θt ηΓt

β1 1 β1 (Bt 1 (1 β1)ξt) + ξt

β1 1 β1 Bt 1 + (1 β1)ξt

(25) = xt ηΓtgt + η Γt

β1 1 β1 Bt 1 + (1 β1)ξt

Published as a conference paper at ICLR 2025

where Γt := Γt 1 Γt.

Next we apply smoothness (Assumption 1) of the loss function f over the iterates xt. From the gradient Lipschitzness we have

f(xt+1) f(xt) + f(xt), xt+1 xt + L

2 xt+1 xt 2.

Taking expectation, we obtain

E[f(xt+1)] E[f(xt)] ηE [ f(xt), Γtgt ]

+ηE f(xt), Γt

β1 1 β1 Bt 1 + (1 β1)ξt

β1 1 β1 Bt 1 + (1 β1)ξt

= ηE [ f(θt), Γtgt ] | {z } I

+ ηE f(xt), Γt

β1 1 β1 Bt 1 + (1 β1)ξt

β1 1 β1 Bt 1 + (1 β1)ξt

| {z } III + ηE [ f(θt) f(xt), Γtgt ] | {z } IV

In the following, we bound all the four terms mentioned above.

Bounding term I. Let Γ be the operator norm (with respect to ℓ2 norm) of the matrix Γ. We have

I = ηE [ f(θt), Γt 1gt ] ηE [ f(θt), Γtgt ]

ηE [ f(θt), Γt 1 f(θt) ] + ηG2E[ Γt ].

ηλmin(Γt 1)E[ f(θt) 2] + ηG2E[ Γt ]

C0 E[ f(θt) 2] + ηG2E[ Γt ], (28)

where we use Assumption 2 and Lemma 4 to bound

λmin(Γt 1) ( vt 1 max + ϵ) 1/2 1 + β2

(1 β1(1 qr))2

(1 β1)2(1 qr)2 G2 + ϵ 1/2 = 1

Note that the purpose of making Γ matrix positive definite is to have negative term in (28).

Bounding term II. Then we have

II ηE f(θt), Γt

β1 1 β1 Bt 1 + (1 β1)ξt

+ ηE f(xt) f(θt), Γt

β1 1 β1 Bt 1 + (1 β1)ξt

ηE f(θt) Γt

β1 1 β1 Bt 1 + (1 β1)ξt

β1 1 β1 Bt 1 + (1 β1)ξt

β1 1 β1 Bt 1 + (1 β1)ξt

(15) ηC1G2E[ Γt ] + η2C2 1LG2 ϵ E[ Γt ], (29)

Published as a conference paper at ICLR 2025

where C1 = β1+(1 β1)qr (1 β1)(1 qr) and we used the fact that the largest eigenvalue λmax(Γt) = Γt =

( vt min + ϵ) 1/2 ϵ 1/2. The second inequality is because of smoothness of f(θ), and the last inequality is due to Lemma 1, Assumption 2 and the property of norms.

Bounding term III. This term can be bounded as follows:

III η2LE h Γtgt 2i + η2LE

β1 1 β1 Bt 1 + (1 β1)ξt

ϵ E[ gt f(θt) + f(θt) 2] + η2LE

β1 1 β1 Bt 1 + (1 β1)ξt

ϵ E[ f(θt) 2] + σ2 + η2C2 1LG2E[ Γt 2]

ϵ E[ f(θt) 2] + η2Lσ2

ϵ + η2C2 1LG2E[ Γt 2], (30)

where we used Assumption 3 that gt is unbiased with bounded variance σ2.

Bounding term IV. We have

IV = ηE [ f(θt) f(xt), Γt 1gt ] + ηE [ f(θt) f(xt), Γtgt ]

ηE [ f(θt) f(xt), Γt 1 f(θt) ] + η2LE Γt

β1 1 β1 Bt 1 + (1 β1)ξt

2ϵ E[ f(θt) 2] + η

2ρE[ f(θt) f(xt) 2] + η2C1LG2

2ϵ E[ f(θt) 2] + η3L2

β1 1 β1 Bt 1 + (1 β1)ξt

2ϵ E[ f(θt) 2] + η3L2

" β1 1 β1 Bt 1 + (1 β1)ξt

where (a) is due to Young s inequality and (b) is based on Assumption 1. Now integrating (28), (29), (30), (31) into (27),

I ηλmin(Γt 1)E[ f(θt) 2] + ηG2E[ Γt ]

II ηC1G2E[ Γt ] + η2C2 1LG2 ϵ E[ Γt ]

ϵ E[ f(θt) 2] + η2Lσ2

ϵ + η2C2 1LG2E[ Γt 2]

2ϵ E[ f(θt) 2] + η3L2

" β1 1 β1 Bt 1 + (1 β1)ξt

Published as a conference paper at ICLR 2025

and taking the telescoping summation over t = 1, . . . , T, we obtain

E[f(x T +1) f(x1)]

t=1 E[ f(θt) 2] + Tη2Lσ2

" β1 1 β1 Bt 1 + (1 β1)ξt

+ η(1 + C1)G2 + η2(1 + C1)C1LG2

t=1 E[ Γt ] + η2C2 1LG2 T X

t=1 E[ Γt 2]

2ϵ + η3L2C2

t=1 E[ f(θt) 2] + Tη2Lσ2

ϵ + Tη3L2C2σ2

+ η(1 + C1)G2 + η2(1 + C1)C1LG2

t=1 E[ Γt ] + η2C2 1LG2 T X

t=1 E[ Γt 2],

where we used (16) of Lemma 1 with constant C2 = β1+(1 β1)q2 r (1 β1)2(1 qr)2 . Choosing ρ = ϵ 2C0 and η η0 := ϵ 4LC0 1+C2 and using Lemma 3, we get

E[f(x T +1) f(x1)] η

t=1 E[ f(θt) 2] + Tη2Lσ2

ϵ + Tη3L2C0C2σ2

+ 2η(1 + C1)G2

ϵ + 2η2(1 + 2C1)C1LG2

Re-arranging terms, we get that

t=1 E[ f(θt) 2] 2C0

ϵ + η2L2C0C2σ2

T ϵ + η(1 + 2C1)C1LG2

where in the last inequality we used x1 = θ1 and the lower bound f f(θ) for all θ Rd. Finally, choosing η = min(η0, 1

T ) and considering the two cases, we arrive at the following rate

t=1 E[ f(θt) 2] 2C0

max 1, 1 η0

T + L2C0C2σ2

ϵT + (1 + 2C1)C1LG2

2η0T + L2C0C2σ2

2ϵ2T + (1 + C1)G2

ϵT + (1 + 2C1)C1LG2

which completes the proof of the theorem.

E.3 ANALYSIS UNDER PL CONDITION

As in the non-convex analysis, here we derive the convergence rate with fixed step-size η. Theorem 4. (Convergence rate under PL) Let Assumptions 1, 2, 3 and 4 hold. Then, choosing step-size η = min(η0, 2C0 log T

µT ) with η0 = min( ϵ 16LC0 , C0(1 β1)(1 qr)

6L C0C2 ), Algorithm 3 satisfies

E[f(θT +1)] f log T

µ2ϵ + 6C0(1 + C1)G2

Published as a conference paper at ICLR 2025

Proof. We start from descent lemma

E[f(xt+1)] E[f(xt)] = ηE [ f(xt), Γtgt ] | {z } I

+ ηE f(xt), Γt

β1 1 β1 Bt 1 + (1 β1ξt)

β1 1 β1 Bt 1 + (1 β1ξt)

We bound part II and part III in the same way as it was done in the non-convex analysis. We now provide a bound for part I :

I = ηE [ f(xt), Γtgt ] = ηE [ f(xt), Γt 1gt ] ηE [ f(xt), Γtgt ] = ηE [ f(xt), Γt 1(gt f(xt) + f(xt)) ] ηE [ f(xt), Γtgt ] = ηE [ f(xt), Γt 1 f(xt) ] ηE [ f(xt), Γt 1(gt f(xt)) ] ηE [ f(xt), Γtgt ] .

We further expand and bound this equation as follows:

C0 E h f(xt) 2i

ηE [ f(xt) f(θt) + f(θt), Γt 1 ( f(θt) f(xt)) ] ηE [ f(xt) f(θt) + f(θt), Γtgt ]

C0 E h f(xt) 2i

ηE [ f(xt) f(θt), Γt 1 ( f(θt) f(xt)) ] ηE [ f(θt), Γt 1 ( f(θt) f(xt)) ] ηE [ f(xt) f(θt), Γtgt ] ηE [ f(θt), Γtgt ]

C0 E h f(xt) 2i + η ϵE h f(xt) f(θt) 2i

ηE [ f(θt), Γt 1 ( f(θt) f(xt)) ] + ηE [ f(xt) f(θt), Γtgt ] + ηE [ f(θt), Γtgt ] .

To bound the second and third terms above we reuse derivation done in (31) to have

ηE [ f(θt), Γt 1 ( f(θt) f(xt)) ] (31) ηρ

2ϵ E[ f(θt) 2] (33)

" β1 1 β1 Bt 1 + (1 β1)ξt

E h f(xt) f(θt) 2i (31) η2L2

" β1 1 β1 Bt 1 + (1 β1)ξt

Next, we use the Cauchy Schwartz inequality to bound inner products above, L-smoothness inequality to bound f(xt) f(θt) L xt θt ηLC1G ϵ , and the inequality a 2

Published as a conference paper at ICLR 2025

2 b 2 + a b 2 for the first term:

C0 E h f(xt) 2i + η ϵE h f(xt) f(θt) 2i

2ϵ E[ f(θt) 2] + η3L2

" β1 1 β1 Bt 1 + (1 β1)ξt

+ ηGE [ f(xt) f(θt) Γt ] + ηG2E [ Γt ]

η 2C0 E h f(xt) 2i η 2C0 E h f(xt) 2i

2ϵ E[ f(θt) 2] + η ϵE h f(xt) f(θt) 2i + η3L2

" β1 1 β1 Bt 1 + (1 β1)ξt

+ ηGηLC1G ϵ E [ Γt ] + ηG2E [ Γt ]

η 2C0 E h f(xt) 2i η 4C0 E[ f(θt) 2] + η 2C0 E[ f(xt) f(θt) 2]

2ϵ E[ f(θt) 2] + η ϵE h f(xt) f(θt) 2i + η3L2

" β1 1 β1 Bt 1 + (1 β1)ξt

ϵ E [ Γt ] + ηG2E [ Γt ]

η 2C0 E h f(xt) 2i η 4C0 E[ f(θt) 2]

2ϵ E[ f(θt) 2] + 3η

2 ϵE h f(xt) f(θt) 2i + η3L2

" β1 1 β1 Bt 1 + (1 β1)ξt

ϵ E [ Γt ] + ηG2E [ Γt ]

(34) η 2C0 E h f(xt) 2i η 4C0 E[ f(θt) 2] + ηρ

2ϵ E[ f(θt) 2]

" β1 1 β1 Bt 1 + (1 β1)ξt

" β1 1 β1 Bt 1 + (1 β1)ξt

ϵ E [ Γt ] + ηG2E [ Γt ]

η 2C0 E h f(xt) 2i η 4C0 E[ f(θt) 2] + ηρ

2ϵ E[ f(θt) 2]

" β1 1 β1 Bt 1 + (1 β1)ξt

ϵ E [ Γt ] + ηG2E [ Γt ] .

Plugging the obtained bound for I with previously obtained bounds for II and III

II ηC1G2E[ Γt ] + η2C2 1LG2 ϵ E[ Γt ]

ϵ E[ f(θt) 2] + η2Lσ2

ϵ + η2C2 1LG2E[ Γt 2]

Published as a conference paper at ICLR 2025

into (32), choosing ρ = ϵ 8C0 and using the step-size bound η ϵ 16LC0 we get

E[f(xt+1)] E[f(xt)] η 2C0 E h f(xt) 2i η 4C0 E[ f(θt) 2] + ηρ

2ϵ E[ f(θt) 2]

" β1 1 β1 Bt 1 + (1 β1)ξt

+ ηC1G2E[ Γt ] + η2C2 1LG2 ϵ E[ Γt ] + ηG2E [ Γt ]

ϵ E[ f(θt) 2] + η2Lσ2

ϵ + η2C2 1LG2E[ Γt 2]

η 2C0 E h f(xt) 2i η 8C0 E[ f(θt) 2] + η2Lσ2

3 ϵ + ϵ 8C0

" β1 1 β1 Bt 1 + (1 β1)ξt

+ η(1 + C1)G2E[ Γt ] + η2(1 + C1)C1LG2

ϵ E[ Γt ] + η2C2 1LG2E[ Γt 2]

C0 (E[f(xt)] f ) η 8C0 E[ f(θt) 2] + η2Lσ2

" β1 1 β1 Bt 1 + (1 β1)ξt

+ η(1 + C1)G2E[ Γt ] + η2(1 + C1)C1LG2

ϵ E[ Γt ] + η2C2 1LG2E[ Γt 2],

where in the last inequality we applied PL condition from Assumption 4. After some reshuffling of the terms, we obtain the following recursion:

E[f(xt+1)] f 1 ηµ

(E[f(xt)] f ) η 8C0 E[ f(θt) 2] + η2Lσ2

" β1 1 β1 Bt 1 + (1 β1)ξt

+ η(1 + C1)G2E[ Γt ] + η2(1 + C1)C1LG2

ϵ E[ Γt ] + η2C2 1LG2E[ Γt 2].

Now we unroll the obtained recursion and invoke Lemma 2 with γ = 1 ηµ

C0 . From η ϵ 4LC0 C0

4µ , we conclude that γ = 1 ηµ

C0 (0, 1). To satisfy the condition on γ of Lemma 1, we enforce the

Published as a conference paper at ICLR 2025

2µ (1 β1)(1 qr) on the step size. Therefore,

E[f(x T +1)] f (16) 1 ηµ

T (f(x1) f ) η 8C0

T t E[ f(θt) 2]

" β1 1 β1 Bt 1 + (1 β1)ξt

+ η(1 + C1)G2 T X

T t E[ Γt ]

+ η2(1 + C1)C1LG2

T t E[ Γt ]

+ η2C2 1LG2 T X

T t E[ Γt 2]. (35)

For the second sum above we upper bound it by its infinite sum as

For the third sum, we apply Lemma 2 and for the other three sums we bound 1 ηµ

C0 1 and apply the bounds in Lemma 3:

t=1 E[ Γt ] 2 ϵ,

t=1 E[ Γt 2] 2

Published as a conference paper at ICLR 2025

Plugging all this bounds into (35), cancelling terms involving gradients with step size restriction η ϵ3/4

6L C0C2 and noticing that x1 = θ1, we finally get

E[f(x T +1)] f 1 ηµ

T (f(θ1) f ) η 8C0

T t E[ f(θt) 2]

T t E[ f(θt) 2]

+ η(1 + C1)G2 2 ϵ + η2(1 + C1)C1LG2

ϵ 2 ϵ + η2C2 1LG2 2

T (f(θ1) f )

µϵ + 4η2L2C0C2σ2

+ 2η(1 + C1)G2

ϵ + 2η2(1 + C1)C1LG2

ϵ + 2η2C2 1LG2

T (f(θ1) f )

µϵ + 2(1 + C1)G2

+ η2 4L2C0C2σ2

µϵ3/2 + 2(1 + 2C1)C1LG2

The obtained rate above is with respect to the virtual iterates xt that we defined for the purposes of analysis. To convert this rate with respect to the iterates θt of the algorithm, we apply L-smoothness to bound the functional difference:

|f(xt) f(θt)| | f(θt), xt θt) | + L

2 xt θt 2 ηC1G2

ϵ + η2LC2 1G2

which implies

E[f(θT +1)] f 1 ηµ

T (f(θ1) f )

µϵ + 3(1 + C1)G2

+ η2 4L2C0C2σ2

µϵ3/2 + 5(1 + C1)C1LG2

Plugging η = min(η0, 2C0 log T

µT ) with η0 = min( ϵ 16LC0 , C0(1 β1)(1 qr)

6L C0C2 ), we get

E[f(θT +1)] f max

1 T 2 , 1 η0µ

µϵ + 3(1 + C1)G2

T 2 4C2 0 µ2

µϵ3/2 + 5(1 + C1)C1LG2

µ2ϵ + 6C0(1 + C1)G2

which completes the proof.