# adagrad_under_anisotropic_smoothness__d29eda4b.pdf

Published as a conference paper at ICLR 2025

ADAGRAD UNDER ANISOTROPIC SMOOTHNESS

Yuxing Liu Rui Pan Tong Zhang University of Illinois Urbana-Champaign {yuxing6,ruip4,tozhang}@illinois.edu

Adaptive gradient methods have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite the huge success in practice, their theoretical advantages over classical gradient methods with uniform step sizes across all coordinates (e.g. SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate this benefit was obtained in the original paper of Adagrad for convex nonsmooth objective functions, which is insufficient for large batch algorithms. In this work, we attempt to resolve this gap between theory and practice by proposing a novel anisotropic generalized smoothness assumption and providing corresponding analysis of Adagrad. It is shown that under anisotropic smoothness and noise conditions, Ada Grad can achieve faster convergence guarantees in terms of better dimensional dependence than algorithms with uniform step sizes across all coordinates. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our novel assumption and theoretical analysis.

1 INTRODUCTION

To solve the stochastic optimization problem

min w Rd f(w) Eξ[f(w; ξ)], (1)

adaptive gradient methods (Duchi et al., 2011; Zeiler, 2012; Tieleman et al., 2012; Kingma & Ba, 2014; Loshchilov & Hutter, 2017) are among the most popular methods. These methods have gained incredible importance from their superior efficiency, especially in training large foundation models, where large batch sizes are commonly employed. One of the most important features of adaptive gradient methods is the coordinate-wise anisotropic step size. Take Adagrad (Duchi et al., 2011; Streeter & Mc Mahan, 2010), the first adaptive gradient method as an example, which writes

wt+1 = wt ηtΛ 1 t gt,

where gt notes the stochastic gradient estimation obtained at wt and Λt = Id is a diagonal matrix, representing the square root of a coordinate-wise summation of former gradient estimations.

Despite the huge success in practice, as also noted in Li et al. (2021); Kunstner et al. (2023); Li et al. (2024), the theoretical understanding of when and why adaptive gradient methods enjoy acceleration over classical gradient algorithms with uniform step size across all coordinates such as SGD, is still limited. On the theory side, the original Adagrad paper Duchi et al. (2011); Streeter & Mc Mahan (2010) shows the superiority of Adagrad over SGD in the convex non-smooth scheme, suggesting that if the gradients are sparse and the predictor is limited in an appropriate convex set, Adagrad can converge faster in terms of better dimensional dependence. However, as the convergence rates in nonsmooth settings rely on the scale of stochastic gradients, their results can be insufficient for the smooth and large-batch training scheme, which is a realistic setting gaining extensive focus. This is because the scale of stochastic gradients does not decrease linearly as the batch size M increases. Therefore, if we fix the stochastic gradient computation number N = MT, where M is the batch size and T is the total iteration number, such that increasing M leads to a linear decrease in T,

Equal contribution.

Published as a conference paper at ICLR 2025

the original convergence rate O(1/

T) may be unsatisfying. Fine analysis in smooth settings can solve this problem as the obtained convergence results in this case depend on the gradient variance, which decreases linearly as the batch size M increases. Many theoretical papers have also conducted analysis of popular adaptive gradient methods (e.g. Adagrad, Adam) in smooth settings under smoothness and noise assumptions with respect to 2. However, to the best of our knowledge, their proven results have no better or even worse dimensional dependence than the standard convergence results of SGD in the same settings. The existence of this gap between theory and practice makes us wonder: Can we obtain better theoretical guarantees of adaptive gradient methods to explain their practical success, especially in large batch settings?

To answer this question, it would be a good starting point to revisit the well-known insight of adaptive gradient methods. Intuitively, Adagrad shines when the problem is highly imbalanced, i.e., coordinates of the gradients have very different magnitudes. It schedules a larger learning rate (compared to SGD) for coordinates with small gradients and thus converges faster. This implies that the performance of adaptive gradient methods relies largely on the anisotropic structure of the problem. However, existing standard assumptions fail to describe this property. Take the standard smoothness (Nesterov et al., 2018) as an example, which assumes a constant L > 0 such that LId 2f(w) LId for all w. It is evident that this assumption is coordinate-wise equivalent and cannot reflect the imbalance between coordinates, thus the benefits of adaptive gradient methods are hidden, as in the results of Vaswani et al. (2020); Défossez et al. (2020); Wang et al. (2023).

To better explore the provable benefits of adaptive gradient methods, it is necessary to employ appropriate assumptions that can better describe the structure of models. In this paper, we consider the anisotropic smoothness and noise assumptions. Based on these assumptions, we give novel convergence analysis of Adagrad and show that Adagrad can adapt well to the problem s anisotropic nature. To extend our results to a more general setting, we additionally introduce a novel generalized anisotropic smoothness assumption and provide corresponding analysis of Ada Grad. By comparisons between the convergence results of Adagrad and gradient methods with step sizes across all coordinates (SGD and Ada Grad-Norm (Streeter & Mc Mahan, 2010) as two representatives), we justify the power of adaptive gradient methods when the problem has highly anisotropic nature. 1

Our contributions are summarized as follows:

1. We present a fine-grained analysis of Adagrad under anisotropic smoothness and noise assumptions, leading to novel theoretical convergence guarantees for Adagrad. We further introduce a generalized form of anisotropic smoothness, extending these results to more practical settings.

2. We discuss how the convergence results indicate the potential benefits of Ada Grad compared to algorithms with coordinate-wisely uniform step sizes such as SGD and Ada Grad-Norm.

3. Experiments on logistic regressions on real-world datasets and instruction-following finetuning with GPT-2 provide concrete empirical evidence to support our claims.

2 RELATED WORK

Adaptive gradient methods. Adaptive gradient methods are popular optimizers for training neural networks (Choi et al., 2019; Vani & Rao, 2019). Among them, Adagrad (Duchi et al., 2011; Streeter & Mc Mahan, 2010) is considered to be the first adaptive gradient method in this branch, which was originally proposed to solve non-smooth online convex optimization problems. Ever since its first appearance, numerous adaptive gradient methods have emerged, such as RMSProp (Tieleman et al., 2012), Ada Delta (Zeiler, 2012), Adam (Kingma & Ba, 2014), SC-Adagrad (Mukkamala & Hein, 2017), Adam W (Loshchilov & Hutter, 2017), WNGrad (Wu et al., 2018), AMSGrad (Reddi et al., 2019), SAdam (Wang et al., 2019) to name a few. They have revolutionized the field of deep neural network training, and are still widely adopted in the literature of large language models (Radford et al., 2019; Touvron et al., 2023; Touvron et al.).

1Note that the comparison between upper bounds is commonly adopted in past literature, e.g. (Bernstein et al., 2018a; Allen-Zhu, 2018), though theoretically it only suggests the worst-case performance comparison of algorithms.

Published as a conference paper at ICLR 2025

Convergence results of SGD. Stochastic gradient descent (SGD), a popular optimizer for many real-world tasks, has been extensively studied in the literature (Robbins & Monro, 1951; Nemirovski & Yudin, 1978; Nemirovskij & Yudin, 1983; Nemirovski et al., 2009; Hazan et al., 2007; Rakhlin et al., 2011; Shamir & Zhang, 2013). In the smooth stochastic optimization scheme, Moulines & Bach (2011) and Ghadimi & Lan (2013) gave analysis of SGD under standard assumptions in the convex and nonconvex settings separately. More recently, Zhang et al. (2019) introduced the (L0, L1)-smoothness to relax the standard smoothness assumption and studied the convergence of SGD and clipped SGD under this assumption. A line of following work appeared after that (Zhang et al., 2020a; Chen et al., 2020; Gorbunov et al., 2020; Qian et al., 2021; Koloskova et al., 2023).

Convergence results of Adagrad. As the pioneering work of Adagrad, Duchi et al. (2011) provided an analysis for Adagrad s convergence guarantees in online convex optimization settings, showing acceleration over SGD when the gradients are sparse. However, the presented analysis is only for general nonsmooth objectives, which is insufficient to explain Adagrad s effectiveness under large batch settings. After that, Levy et al. (2018); Vaswani et al. (2020) studied the smooth convex convergence but showed no better results than SGD. On another side, the convergence of Adagrad or its close variants in the smooth nonconvex setting has been extensively studied (Li & Orabona, 2019; Défossez et al., 2020; Ward et al., 2020; Faw et al., 2022; 2023; Wang et al., 2023; Kavis et al., 2022; Liu et al., 2023b; Attia & Koren, 2023; Hong & Lin, 2024b). Among them, Ward et al. (2020) obtained the O(1/

T) rate of Adagrad-Norm under global bounded gradient assumption. More recently, Faw et al. (2023); Wang et al. (2023) further improved the result to hold under the (L0, L1)-smoothness assumptions. There are also extensive studies focused on the convergence of Adam and its close variants, we list some of them here for reference: (Reddi et al., 2019; De et al., 2018; Défossez et al., 2020; Guo et al., 2021; Zhang et al., 2022; Wang et al., 2022; Li et al., 2024; Wang et al., 2024; Hong & Lin, 2024a).

Theoretical understanding of adaptive gradient methods. Surprisingly, though the community tends to have a common sense that adaptive gradient methods converge much faster than algorithms with uniform step sizes in specific tasks, theoretical explanations on when and why this happens are relatively rare compared to extensive empirical studies. It is worth pointing out that among all the above-mentioned works on theoretical analysis of adaptive gradient methods, only the initial Adagrad analysis (Duchi et al., 2011) clearly shows this acceleration in terms of possibly lower dimensional dependence in the online convex programming setting. How adaptive gradient methods can help accelerate convergence in smooth or large batch stochastic optimization still remains unclear. We also notice that Cesa-Bianchi et al. (2007); Orabona & Pál (2015; 2018); Zhuang et al. (2022) mentioned the intuition of scale-free algorithms and Zhuang et al. (2022) demonstrates the connection between scale-freeness of adaptive gradient methods and better condition number dependence. More recently, Zhang et al. (2024); Das et al. (2024) also investigated why Adam is effective in certain tasks compared to SGD and gives theoretical results in quadratics settings. However, their results are more intuitive with very restrictive analysis that are insufficient to explain real-world tasks. In another line of work, Bernstein et al. (2018a); Wu et al. (2020); Kunstner et al. (2023); Liu et al. (2023a) suggests a relation between the benefits of adaptive gradient methods and their sign-based nature. These intuitions may shed light on the theoretical understanding of adaptive gradient methods.

Large batch training. Large batch training enjoys extensive focus for its practical impact. It has been observed that large batch sizes are beneficial for accelerating large model training in practice (You et al., 2017a;b; 2018; 2019; Pan et al., 2022). Furthermore, large batch training is a valuable acceleration technique in distributed machine learning (Verbraeken et al., 2020) and pretraining (Zhou et al., 2023), where adaptive gradient methods are popular. In particular, it is a common practice in pretraining of large language models to combine large batch sizes with adaptive gradient methods (Radford et al., 2019; Touvron et al., 2023; Touvron et al.).

Concurrent Work. In parallel with our work, Maladkar et al. (2024) also investigates the use of anisotropic assumptions to describe the nonconvex convergence of Ada Grad. Though the starting points are similar, there are some notable differences between the two works. First, our result applies to a more general smoothness assumption that can well describe practical neural network training (as in Figure 1) and covers the result in Maladkar et al. (2024). Second, We include a convex part in our paper, which is a simpler case for illustrating the power of anisotropic assumptions. Based on this part and our new assumption and theory, we also did various empirical verifications in the experiment part, which provides strong evidence for our theory. Third, Maladkar et al. (2024) further introduced a

Published as a conference paper at ICLR 2025

lower bound for gradient ℓ1-norm convergence of SGD, which may help a more rigorous comparison between SGD and Ada Grad in the nonconvex case. However, it s not clear whether ℓ1-norm is a better convergence measure and it depends on the training curvature. To conclude, both the work contribute to a better theoretical understanding of the benefits of adaptive gradient methods.

3 PRELIMINARIES

3.1 NOTATIONS

We use to denote the coordinate-wise product of vectors and without leading to confusion, is sometimes used to denote the coordinate-wise square root of a vector or diagonal matrix. Let H Rd d be a symmetric positive definite matrix, we denote the vector norm induced by H that w 2 H w Hw. With a slightly abuse of notation, for a vector h Rd +, we denote w 2 h = Pd j=1 hjw2 j. For a symmetric positive definite matrix H Rd d and convex set W we introduce the H-based projection operator ΠH W( ) such that ΠH W(w) = argminz W z w 2 H . As discussed in Hazan et al. (2007), the projection is a convex program and can be solved efficiently.

Let us denote wf(w; ξ) the stochastic gradient oracle at w and gt the gradient estimation employed at wt. Ft σ(g0, , gt 1) stands for the sigma field of the gradient estimators from the first iteration to the t 1 iteration. We use E[ ] to denote total expectation over FT where T is the maximum iteration number and Et[ ] as an abbreviation of the conditional expectation E[ |Ft].

3.2 PROBLEM SETTINGS AND ASSUMPTIONS

We study the stochastic optimization problem (1), where we can only access the stochastic gradient oracle f(w; ξ) at w. Throughout this paper, we consider the following assumptions.

Assumption 3.1 (Convexity). f( ) is convex. For convex cases, we search solution in a closed convex set W Rd such that there exists at least one optimal solution w W and

max w,w W w w D and max w,w W w w 2 D2. (2)

Assumption 3.2 (Lower bounded). There exists constant f such that for w Rd, f(w) f .

Assumption 3.3 (Anisotropic L-smoothness). There exists a positive vector L = [L1, . . . , Ld] Rd + such that f( ) is L-smooth, namely, for w, w Rd,

f(w) f(w ) L 1 w w L . (3)

Assumption 3.1 and 3.2 are standard for convex and nonconvex problems, respectively. Assumption 3.3 is an anisotropic generalization of the smoothness condition, which has also been employed in a line of work on Sign SGD (Bernstein et al., 2018a;b). It is also worth pointing out that a similar block Lipschitz assumption is commonly employed in the study of coordinate descent (Wright, 2015; Richtárik & Takáˇc, 2016). It can be an interesting direction to find out the relations in between. The intuition of Assumption 3.3 can be understood in the following manner. When the loss function f( ) is twice-differentiable, Assumption 3.3 is equivalent to 2f diag(L) that implies the standard smoothness assumption by L = L . When the Hessian of f( ) is imbalanced, namely, coordinates have very different scales, L can be adapted to this imbalanced distribution and can describe a tighter upper bound of the Hessian of f( ), resulting in L 1 Ld. This benefit is realistic as the highly imbalanced spectrum distribution of the Hessian has been widely observed in multiple circumstances (Sagun et al., 2016; Arjevani & Field, 2020; Pan et al., 2021). The power of this adaptation shines when adaptive gradient methods are employed.

We consider the standard stochastic approximation framework (Kushner & Clark, 2012) and denote the gradient noise at w to be n(w; ξ) f(w) wf(w; ξ). We assume the following assumptions on gradient noise throughout this paper.

Assumption 3.4 (Unbiased Independent gradient). Each f(w; ξ) is independently drawn and

E [n(w; ξ)] = 0. (4)

Published as a conference paper at ICLR 2025

Algorithm 1 Adagrad

1: Input: w0 Rd, {ηt}T 1 t=0 R, ϵ R, and batch size M N 2: Initialize v 1 = ϵ21d 3: for t = 0 to T 1 do 4: Sample mini-batch Bt with |Bt| M uniformly 5: gt = 1 M P

ξ Bt wf(wt; ξ) 6: vt = vt 1 + (gt gt) 7: Λt = diag( vt) 8: Option I: wt+1 = ΠΛt W (wt ηtΛ 1 t gt) 9: Option II: wt+1 = wt ηtΛ 1 t gt 10: end for 11: Output: 1/T PT 1 t=0 wt

Assumption 3.5 (Anisotropic noise). There exists positive vector σ = [σ1, . . . , σd] Rd + such that

E nj(w; ξ)2 σ2 j for all j [d]. (5)

Note that Assumption 3.5 implies the standard bounded noise assumption by E[ n(w; ξ) 2 2] σ 2 2. Intuitively, it upper bounds all the coordinates of n(w; ξ) instead of only the norm and gives more detailed information on the scale of noise. Generally, Combining Assumption 3.3 and 3.5 allows more fine-grained analysis, which can take the sparsity of the model into account. Note that the anisotropic noise assumption has also been explored in other lines of studies on sign-based methods (Bernstein et al., 2018a;b; Crawshaw et al., 2022) and quadratics (Dieuleveut et al., 2017; Ge et al., 2019; Pan et al., 2021; 2023). It is also closely related to Assumption 2 in Zhang et al. (2020b), where the authors attempt to model the heavy-tailedness of neural networks.

4 ADAGRAD WITH ANISOTROPIC ASSUMPTIONS

How to accelerate the convergence of SGD has been a fundamental problem in training machine learning models. One possible approach is to lower the implicit dependence of dimension d. As discussed in Nguyen et al. (2019), in smooth strongly convex settings, the lower bound of SGD can be d times larger than a wider class of algorithms including adaptive gradient methods. The original Adagrad paper (Duchi et al., 2011) showed that Adagrad has better dimension dependence than SGD when the stochastic gradients are generally sparse in the non-smooth convex scheme. However, in the stochastic smooth optimization scheme, to the best of our knowledge, existing theoretical results are insufficient to account for the benefit of adaptive gradient methods. We attempt to fill the gap, equipped with the anisotropic Assumptions 3.3 and 3.5.

4.1 CONVEX CASES

For a warmup, we present results in the convex case to first gain some intuition on how the anisotropic assumptions can better describe the convergence of Ada Grad in the large batch setting. Theorem 4.1 (Convex convergence of Adagrad). Under Assumptions 3.1, 3.3, 3.4, 3.5, for the sequence {wt}T t=1 generated by Adagrad (Algorithm 1 with option I) with constant step size ηt η = D , it holds that for w T = (1/T) PT 1 t=0 wt,

E [f( w T ) f(w )] = O D σ 1

MT + L 1 D2 T

+ O ϵD2 2 D T

Note that ϵ is employed mainly for numerical stability and is commonly very small (in order 10 10 by default). Theorem 4.1 shows that the convergence of Ada Grad depends on L 1 and σ 1, which require Assumptions 3.3 and 3.5 to describe. In contrast, if we only use the standard L-smooth and σ2 2bounded gradient variance assumption, explicit dimensional dependence will be inevitably involved, resulting in worse results than coordinate-wise uniform step sizes like Vaswani et al. (2020) To better see how Theorem 4.3 can describe the potential benefits of Ada Grad, we also include the convergence

Published as a conference paper at ICLR 2025

rates of SGD and Ada Grad-Norm as representatives of algorithms with coordinate-wisely uniform step sizes here2:

SGD & Ada Grad-Norm: E [f(w) f(w )] = O D2 σ 2

MT + L D2 2 T

By comparing the results in Theorem 4.3 and (6), we can find that whether Ada Grad is better than SGD or Ada Grad-Norm largely relies on the ratios D σ 1

D2 σ 2 and L 1D2 L D2 2 , which reflect the sparsity of the curvature, noise, and the geometry of W. When M is small such that the variance term, i.e. the term relevant with σ, is dominant, our conclusion is consistent with that presented in Duchi et al. (2011) for nonsmooth cases. Generally, when (1) σ is sparse, i.e. has very different scales in different coordinates, which implies that σ 1

d σ 2; (2) W satisfies that D2 is close to

d D , which can be satisfied by setting W to be a hypercube, the variance term of Adagrad might be much smaller than that of SGD. When a large batch size is employed, the bias term, the O(1/T) term irrelevant with noise, can also be important, and thus the superior performance of Adagrad in this case additionally requires that (3) L is sparse such that L 1 d L . Following Duchi et al. (2011), we also provide a concrete example for better understanding the quantities in Appendix B. Remark 4.2. It is also worth pointing out that the ratio between bias terms of Theorem 4.1 and (6) can be in order Θ(1/d) in extreme cases, while the variance ratio can only be Θ(1/

d). This suggests an even sharper possible gap when M is large and might provide some intuition on the observation that adaptive gradient methods benefit more from large batch size than SGD (Kunstner et al., 2023).

4.2 NONCONVEX CASES

Next we consider the more general nonconvex scheme. Theorem 4.3 (Nonconvex convergence of Adagrad). Under Assumptions 3.2, 3.3, 3.4, 3.5, for the sequence {wt}T 1 t=1 generated by Adagrad (Algorithm 1 with option II) with constant step size

, it holds that

t=0 f(wt) 2 1

MT + σ 2 1 M

where = f(w0) f and we use O( ) to hide logarithmic factors.

It is worth pointing out that Theorem 4.3 obtains convergence of f(w) 1 instead of the common f(w) 2 by algorithms with coordinate-wise uniform step sizes like SGD, which indicates at most

d times tighter results when the gradients are generally dense. Similar to the convex case, the introduction of the anisotropic assumptions 3.3 and 3.5 removes the explicit dependence on dimension d compared to existing results on adaptive gradient methods like Défossez et al. (2020); Liu et al. (2023b); Zhang et al. (2022); Wang et al. (2022), rendering Ada Grad at least comparable to algorithms with coordinate-wisely uniform step sizes like SGD, which has the following results:

SGD: E h f(w) 2 2 i O

We can find that when the batch size is large enough such that M σ 2 1 /( L 1 ), the comparison between Theorem 4.3 and (7) is generally consistent with the comparison between SGD and Sign SGD (Bernstein et al., 2018a). It mainly relies on the sparsity of f(wt), L, and σ, which can determine the ratio between the upper bounds. Generally speaking, when L and σ are sparse and the gradients f(wt) are relatively dense, Adagrad shines compared to SGD in terms of having

2We also include the standard proof of SGD and discussions on convex SGD convergence in Appendix D for completeness. The result of Ada Grad-Norm can be found in Levy et al. (2018).

Published as a conference paper at ICLR 2025

Table 1: We summarize the convergence rates of algorithms under the nonconvex smooth settings, with large enough batch size such that M σ 2 1 /( L 1 ). Note that we omit logarithmic terms here. The convergence of SGD in the smooth setting is a standard result, and we include the theorem in the appendix for reference. Also, we leave the convergence of SGD under generalized smoothness blank as it is generally incomparable with other results, which we will discuss in Section 5.

Algorithms Convergence Objective Anisotropic Smooth Anisotropic Generalized Smooth

SGD E f(w) 2 2

T Theorem D.3

Ada Grad-Norm E f(w) 2

T (Faw et al., 2023)

MT + L2 1 2

T (Faw et al., 2023)

Ada Grad E f(w) 1

T Theorem 4.3

MT + L1 2 2

T Theorem 5.3

a tighter convergence guarantee. We note that this can be a realistic case, as L can be extremely sparse based on many observations on multiple scenarios (Sagun et al., 2016; Arjevani & Field, 2020; Pan et al., 2021), and on the other side, σ 1

σ 2 and f(wt) 1

f(wt) 2 can be mild constants as examined by Bernstein et al. (2018a). Therefore, with all these conditions satisfied, we show the potentially faster convergence of Ada Grad compared to SGD. Also, the consistency between the convergence results of Ada Grad and Sign SGD provides theoretical insights into the close relation between adaptive gradient methods and sign-based methods.

5 ADAGRAD WITH GENERALIZED ANISOTROPIC SMOOTHNESS

In previous sections, we have discussed how Ada Grad can potentially outperform algorithms with uniform step sizes under the anisotropic smoothness settings. However, the loss function in practice can commonly dissatisfy the smoothness assumption, with local smoothness potentially unbounded. To better describe the complicated real cases, Zhang et al. (2019) introduce the (L0, L1)-smoothness as a generalization of the standard L-smoothness, which can be written as

f(w) f(w ) 2 (L0 + L1 f(w) 2) w w 2 (8)

for all w w 1/L1 (Zhang et al., 2020a). By involving the gradient term to describe the local smoothness, this assumption can even be applicable to describe neural networks (Zhang et al., 2019; Crawshaw et al., 2022). In this section, we discuss an extension of both the anisotropic L-smoothness and the (L0, L1)-smoothness and the corresponding convergence results of Ada Grad. Assumption 5.1 (Anisotropic (L0, L1)-smoothness). There exists positive vectors L0 = [L0,1, . . . , L0,d] Rd + and L1 = [L1,1, . . . , L1,d] Rd + such that f( ) is (L0,L1)-smooth, namely, for w, w W such that w w L1

d, it holds that

f(w) f(w ) (L(w)) 1 w w L(w) , (9)

where L(w) = [[L(w)]1, . . . , [L(w)]d] Rd and [L(w)]j = L0,j + L1,j | jf(w)| for all j [d].

For one thing, Assumption 5.1 is a natural generalization of Assumption 3.3 and can directly imply it by setting L1 = 0, hence also enjoys the nice properties we have discussed for Assumption 3.3. Also, in the spirits of (8), we include the gradient magnitudes to describe the local smoothness. Instead of the gradient 2-norm, we use the absolute value of each coordinate of the gradient, which is intuitively more relevant to the anisotropic nature of curvature. We also conducted numerical experiments to verify the validity of the assumption, as shown in Figure 1.

Published as a conference paper at ICLR 2025

Remark 5.2. It is also worth noticing that a similar but different coordinate-wise generalized smoothness assumption has been considered in Crawshaw et al. (2022): | jf(w) jf(w )| (L0,j + L1,j | jf(w)|) w w 2 (10) for all w w 2 1/ L1 and all coordinates j [d]. Compared to this assumption, Assumption 5.1 offers several evident advantages: (1) Assumption 5.1 does not require conditions for each coordinate; (2) Experiments show that Assumption 5.1 can well describe the real anisotropic curvature; (3) Technically, it is difficult to avoid explicit existence of d in the final convergence bound of Ada Grad if Eqn. (10) is employed, for instance, even the convergence of the Generalized Sign SGD obtained in Crawshaw et al. (2022) has explicit dependence on d. Theorem 5.3 (Convergence of Adagrad with generalized smoothness). Under Assumptions 3.2, 3.4, 3.5, 5.1, for the sequence {wt}T 1 t=1 generated by Adagrad (Algorithm 1 with option II) with constant

step size ηt η = min 1 4 L1 , q

, it holds that

t=0 f(wt) 2 1

MT + σ 2 1 M

MT + L1 2 2

where = f(w0) f and we use O( ) to hide logarithmic factors.

Note that Theorem 5.3 is a generalization of Theorem 4.3 based on the relation between Assumptions 5.1 and 3.3, which can directly imply Theorem 4.3 by simply setting L1 = 0. Intuitively, the interpretation of the theorem leads to the following implications.

Comparison with SGD. Zhang et al. (2019) show that SGD generally requires an additional assumption on universally bounded gradients to obtain convergence under the generalized (L0, L1)- smoothness, and thus clipping should be introduced. The theoretical properties of clipped SGD has been extensively studied after that. However, to the best of our knowledge, existing results on clipped SGD either require restrictive noise assumptions (Zhang et al., 2019; 2020a; Chen et al., 2020; Qian et al., 2021) or can only shows worse rate on T and M (Koloskova et al., 2023; Gorbunov et al., 2020). These results are generally incomparable with Theorem 4.3, showing the superiority of adaptive gradient methods, as also found in existing results (Wang et al., 2023; Li et al., 2024).

Comparison with Ada Grad-Norm. One can also look at the convergence results of Ada Grad Norm (Streeter & Mc Mahan, 2010), a scalar step size variant of Ada Grad, with assumption (8):

Ada Grad-Norm: (E [ f(w) 2])2 O L0 σ 2

MT + L2 1 2

which is obtained by taking η to optimize the upper bound in Theorem 3 in Faw et al. (2023). It is worth noticing that in the case L1 = 0, the comparison between Ada Grad and Ada Grad-Norm is generally consistent with that between Ada Grad and SGD discussed in Section 4, showing Ada Grad s effectiveness for anisotropic problems. On the other hand, when it comes to the general (L0, L1)- smoothness assumption, things become more complicated given the indeterminate relationship between the anisotropic L1 and L1 in (8). This relation is mainly about whether Assumption 5.1 or the initial (L0, L1)-smoothness assumption can better fit the common training settings, such as large-scale language model pre-training, which is an intriguing topic worth exploring in the future. Remark 5.4. (Technical Contribution) From the technical perspective, our proof generally considers a similar main line to Défossez et al. (2020); Ward et al. (2020) but involves multiple novel proof techniques to remove the restrictive assumption on globally bounded f(w) and enable the presence of generalized smoothness. Also, Theorem 4.3 can directly imply a comparable convergence result of Ada Grad-Norm with Faw et al. (2023); Wang et al. (2023) while it avoids the complicated proof in Faw et al. (2023) and the heavy dependence on 1/ϵ in Wang et al. (2023). On the other hand, however, Faw et al. (2023); Wang et al. (2023) allows relaxed noise assumption, which may be of interest and we leave it for possible future work.

Published as a conference paper at ICLR 2025

6 EXPERIMENTAL RESULTS

6.1 CONVEX CASE

To verify the aforementioned theoretical results, we conduct experiments on logistics regressions, whose loss functions are generally convex smooth. Specifically, we utilize real-world datasets a4a, a6a, a9a, real-sim and rcv1.binary from libsvm (Chang & Lin, 2011), which comprises of N = 4781, 11220, 32561, 20242 and 72309 samples respectively. Within the former three datasets, each sample has a feature of d = 123 dimensions, which are generally sparse given only 14 non-zero-valued dimensions in average for each sample. For the latter two large datasets, each sample possesses d = 47, 236 and d = 20, 958 feature dimensions individually in each dataset, where only 51.29 and 74.05 dimensions in average are non-zero. More details of the experimental setup are available in Appendix A.

Table 2: Statistics on logistic regression. D2 = maxt wt w 2 and D = maxt wt w are estimated by the maximum value under all searched settings without loss explosion. Smaller values in Cvar D /D2 represent better theoretical bounds for Adagrad when compared with SGD. It is evident that in large sparse datasets, D is much smaller than D2, verifying the empirical gains of Adagrad implied by our theory.

DATASET D D2 L 1 L CVAR

A4A 9.73 32.24 14.87 7.26 0.30

A6A 8.88 28.57 14.87 7.26 0.31

A9A 9.79 29.43 14.87 7.27 0.32

REAL-SIM 34.70 729.50 2.00 1.01 0.05

RCV1.BIN 15.32 635.47 2.00 1.02 0.02

As shown by the results presented in Table 3, for cases when the variance term dominates, e.g. a9a, Adagrad demonstrates similar convergence behaviors for varied batch sizes. This behavior cannot be explained by previous nonsmooth theories, since T = N/M should provide worse convergence guarantees O(1/

T) when batch size M increases. Furthermore, Adagrad constantly provides faster convergence than SGD, which verifies the superiority suggested in our theorems. In addition, when the batch size increases and the bias term becomes dominant, Adagrad is affected less than SGD, showing its robustness against different batch sizes.

Table 3: Training losses of SGD and Adagrad on logistic regression over 3 seeds with batch size M. Note that Adagrad s convergence behavior is generally unaffected by the batch size when M

DATASET (SMALL) METHOD (f(w) f(w )) 10 3

M = 1 M = 4 M = 16 M = 64 M = 256 M = 1024

A4A SGD 2.66 0.08 2.32 0.34 3.47 0.07 2.24 0.14 4.61 0.03 8.47 0.05 ADAGRAD 0.22 0.03 0.22 0.03 0.25 0.03 0.24 0.03 0.29 0.03 0.51 0.08

A6A SGD 0.87 0.09 1.40 0.01 0.82 0.09 1.56 0.03 1.03 0.01 2.53 0.03 ADAGRAD 0.16 0.00 0.16 0.00 0.16 0.01 0.21 0.05 0.17 0.01 0.20 0.02

A9A SGD 0.77 0.08 0.47 0.01 0.48 0.05 0.58 0.01 0.52 0.06 0.76 0.01 ADAGRAD 0.14 0.01 0.14 0.00 0.15 0.00 0.16 0.01 0.20 0.01 0.12 0.02

DATASET (LARGE) METHOD (f(w) f(w )) 10 1

M = 1 M = 4 M = 16 M = 64 M = 256 M = 1024

R E AL-SIM SGD 0.42 0.08 0.27 0.08 0.52 0.06 0.92 0.03 1.57 0.02 2.68 0.01 ADAGRAD 0.14 0.00 0.14 0.00 0.14 0.00 0.14 0.00 0.15 0.00 0.19 0.00

R C V1.BINARY SGD 0.48 0.02 0.28 0.07 0.68 0.04 1.33 0.04 2.55 0.21 5.02 0.16 ADAGRAD 0.10 0.00 0.10 0.00 0.10 0.00 0.11 0.00 0.14 0.01 0.20 0.01

Published as a conference paper at ICLR 2025

6.2 NONCONVEX CASE

Table 4: Losses on instruction following tasks with dataset Alpaca and GPT2 model.

METHOD TRAINING LOSS f(w)

M = 32 64 128

SGD 2.20 2.24 2.29 ADAGRAD 2.14 2.12 2.11

M = 256 512 1024

SGD 2.36 2.45 2.57 ADAGRAD 2.12 2.14 2.20

For nonconvex cases, we check the instructionfollowing fine-tuning task on Alpaca (Taori et al., 2023) dataset with GPT-2 (Radford et al., 2019) model. GPT-2 utilizes GELU (Hendrycks & Gimpel, 2016) as its activation function. As shown in Table 4, it can be observed that Adagrad still outperforms SGD with different batch sizes. The loss gap is especially salient under large batch sizes. This also confirms that Adagrad s convergence speed is not significantly affected by the large batch size, which matches the results in our theory. Full experimental details are available in Appendix A.

Figure 1: Verification of Assumption 5.1 in GPT-2 on Alpaca dataset. x-axis: | jf(w)|, y-axis: | jf(w) jf(w )|/|w w |j, where the color represents the iteration index of w. We run Adam with full gradients for 100 steps and randomly selected nearby points w and w along the trajectory to plot the scatter points.

We also examine Assumption 5.1 under this non-convex setting by following the common practice of (Zhang et al., 2019; Crawshaw et al., 2022). Due to the inherent dependency on dimensions in Assumption 5.1, we randomly sample a small portion of the parameters in GPT-2 to check. As shown in Figure 1, a linear bound of local smoothness with respect to the gradient value holds in all randomly chosen dimensions. Notice that the observed condition in Figure 1 is even stronger, as it directly implies Assumption 5.1 by summing up all the dimensions.

7 CONCLUSION

In this paper, we present the theoretical convergence results of Adagrad under anisotropic smoothness and noise assumptions. We further introduce a novel anisotropic generalized smoothness assumption to extend the aforementioned results in a more fine-grained analysis. Based on the theorems, we conduct comparisons between the convergence rates of Ada Grad and SGD and Ada Grad-Norm in the large batch settings, which provides a deeper theoretical understanding of when and why adaptive gradient methods can outperform classical gradient algorithms with coordinate-wise uniform step sizes. Empirical studies offer strong evidence to support our proposed assumptions and theory.

Published as a conference paper at ICLR 2025

Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research, 18(221):1 51, 2018.

Yossi Arjevani and Michael Field. Analytic characterization of the hessian in shallow relu models: A tale of symmetry. Advances in Neural Information Processing Systems, 33:5441 5452, 2020.

Amit Attia and Tomer Koren. Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. ar Xiv preprint ar Xiv:2302.08783, 2023.

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pp. 560 569. PMLR, 2018a.

Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd with majority vote is communication efficient and fault tolerant. ar Xiv preprint ar Xiv:1810.05291, 2018b.

Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66:321 352, 2007.

Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1 27, 2011.

Xiangyi Chen, Steven Z Wu, and Mingyi Hong. Understanding gradient clipping in private sgd: A geometric perspective. Advances in Neural Information Processing Systems, 33:13773 13782, 2020.

Dami Choi, Christopher J Shallue, Zachary Nado, Jaehoon Lee, Chris J Maddison, and George E Dahl. On empirical comparisons of optimizers for deep learning. ar Xiv preprint ar Xiv:1910.05446, 2019.

Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, and Zhenxun Zhuang. Robustness to unbounded smoothness of generalized signsgd. Advances in neural information processing systems, 35:9955 9968, 2022.

Rudrajit Das, Naman Agarwal, Sujay Sanghavi, and Inderjit S Dhillon. Towards quantifying the preconditioning effect of adam. ar Xiv preprint ar Xiv:2402.07114, 2024.

Soham De, Anirbit Mukherjee, and Enayat Ullah. Convergence guarantees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration. ar Xiv preprint ar Xiv:1807.06766, 2018.

Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad. ar Xiv preprint ar Xiv:2003.02395, 2020.

Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger convergence rates for least-squares regression. The Journal of Machine Learning Research, 18(1): 3520 3570, 2017.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Matthew Faw, Isidoros Tziotis, Constantine Caramanis, Aryan Mokhtari, Sanjay Shakkottai, and Rachel Ward. The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance. In Conference on Learning Theory, pp. 313 355. PMLR, 2022.

Matthew Faw, Litu Rout, Constantine Caramanis, and Sanjay Shakkottai. Beyond uniform smoothness: A stopped analysis of adaptive sgd. In The Thirty Sixth Annual Conference on Learning Theory, pp. 89 160. PMLR, 2023.

Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods. ar Xiv preprint ar Xiv:2301.11235, 2023.

Published as a conference paper at ICLR 2025

Rong Ge, Sham M Kakade, Rahul Kidambi, and Praneeth Netrapalli. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In Advances in Neural Information Processing Systems, pp. 14977 14988, 2019.

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov. Stochastic optimization with heavytailed noise via accelerated gradient clipping. Advances in Neural Information Processing Systems, 33:15042 15053, 2020.

Robert Gower, Othmane Sebbouh, and Nicolas Loizou. Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics, pp. 1315 1323. PMLR, 2021.

Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. Sgd: General analysis and improved rates. In International conference on machine learning, pp. 5200 5209. PMLR, 2019.

Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms of the adam family. ar Xiv preprint ar Xiv:2112.03459, 2021.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169 192, 2007.

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016.

Yusu Hong and Junhong Lin. On convergence of adam for stochastic optimization under relaxed assumptions. ar Xiv preprint ar Xiv:2402.03982, 2024a.

Yusu Hong and Junhong Lin. Revisiting convergence of adagrad with relaxed assumptions. ar Xiv preprint ar Xiv:2402.13794, 2024b.

Ali Kavis, Kfir Yehuda Levy, and Volkan Cevher. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. ar Xiv preprint ar Xiv:2204.02833, 2022.

Ahmed Khaled and Peter Richtárik. Better theory for sgd in the nonconvex world. ar Xiv preprint ar Xiv:2002.03329, 2020.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U Stich. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In International Conference on Machine Learning, pp. 17343 17363. PMLR, 2023.

Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, and Mark Schmidt. Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be. ar Xiv preprint ar Xiv:2304.13960, 2023.

Harold Joseph Kushner and Dean S Clark. Stochastic approximation methods for constrained and unconstrained systems, volume 26. Springer Science & Business Media, 2012.

Kfir Y Levy, Alp Yurtsever, and Volkan Cevher. Online adaptive methods, universality and acceleration. Advances in neural information processing systems, 31, 2018.

Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions. Advances in Neural Information Processing Systems, 36, 2024.

Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd international conference on artificial intelligence and statistics, pp. 983 992. PMLR, 2019.

Published as a conference paper at ICLR 2025

Yan Li, Dhruv Choudhary, Xiaohan Wei, Baichuan Yuan, Bhargav Bhushanam, Tuo Zhao, and Guanghui Lan. Frequency-aware sgd for efficient embedding learning with provable benefits. ar Xiv preprint ar Xiv:2110.04844, 2021.

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. ar Xiv preprint ar Xiv:2305.14342, 2023a.

Zijian Liu, Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Nguyen. High probability convergence of stochastic gradient methods. In International Conference on Machine Learning, pp. 21884 21914. PMLR, 2023b.

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Devyani Maladkar, Ruichen Jiang, and Aryan Mokhtari. Convergence analysis of adaptive gradient methods under refined smoothness and noise assumptions. ar Xiv preprint ar Xiv:2406.04592, 2024.

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011.

Lev Muchnik, Sen Pei, Lucas C Parra, Saulo DS Reis, José S Andrade Jr, Shlomo Havlin, and Hernán A Makse. Origins of power-law degree distribution in the heterogeneity of human activity in social networks. Scientific reports, 3(1):1783, 2013.

Mahesh Chandra Mukkamala and Matthias Hein. Variants of rmsprop and adagrad with logarithmic regret bounds. In International conference on machine learning, pp. 2545 2553. PMLR, 2017.

Arkadi Nemirovski and D Yudin. On cezari s convergence of the steepest descent method for approximating saddle point of convex-concave functions. In Soviet Mathematics. Doklady, volume 19, pp. 258 269, 1978.

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

Arkadij Semenoviˇc Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.

Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.

Phuong_Ha Nguyen, Lam Nguyen, and Marten van Dijk. Tight dimension independent lower bound on the expected convergence rate for diminishing step sizes in sgd. Advances in Neural Information Processing Systems, 32, 2019.

Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Francesco Orabona and Dávid Pál. Scale-free algorithms for online linear optimization. In International Conference on Algorithmic Learning Theory, pp. 287 301. Springer, 2015.

Francesco Orabona and Dávid Pál. Scale-free online learning. Theoretical Computer Science, 716: 50 69, 2018.

Rui Pan, Haishan Ye, and Tong Zhang. Eigencurve: Optimal learning rate schedule for sgd on quadratic objectives with skewed hessian spectrums. ar Xiv preprint ar Xiv:2110.14109, 2021.

Rui Pan, Shizhe Diao, Jianlin Chen, and Tong Zhang. Extremebert: A toolkit for accelerating pretraining of customized bert. ar Xiv preprint ar Xiv:2211.17201, 2022.

Rui Pan, Yuxing Liu, Xiaoyu Wang, and Tong Zhang. Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise. ar Xiv preprint ar Xiv:2312.14567, 2023.

Published as a conference paper at ICLR 2025

Jiang Qian, Yuren Wu, Bojin Zhuang, Shaojun Wang, and Jing Xiao. Understanding gradient clipping in incremental gradient methods. In International Conference on Artificial Intelligence and Statistics, pp. 1504 1512. PMLR, 2021.

Zhaonan Qu, Wenzhi Gao, Oliver Hinder, Yinyu Ye, and Zhengyuan Zhou. Optimal diagonal preconditioning: Theory and practice. ar Xiv preprint ar Xiv:2209.00809, 2022.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. ar Xiv preprint ar Xiv:1109.5647, 2011.

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. ar Xiv preprint ar Xiv:1904.09237, 2019.

Peter Richtárik and Martin Takáˇc. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156:433 484, 2016.

Peter Richtárik and Martin Takác. Stochastic reformulations of linear systems: algorithms and convergence theory. SIAM Journal on Matrix Analysis and Applications, 41(2):487 524, 2020.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Levent Sagun, Leon Bottou, and Yann Le Cun. Eigenvalues of the hessian in deep learning: Singularity and beyond. ar Xiv preprint ar Xiv:1611.07476, 2016.

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pp. 71 79. PMLR, 2013.

Matthew Streeter and H Brendan Mc Mahan. Less regret via online conditioning. ar Xiv preprint ar Xiv:1002.4862, 2010.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.

Tijmen Tieleman, Geoffrey Hinton, et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26 31, 2012.

Hugo Touvron, Louis Martin, and Kevin Stone. Llama 2: Open Foundation and Fine-Tuned Chat Models.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLa MA: Open and Efficient Foundation Language Models. 2023. doi: 10.48550/ARXIV.2302.13971. URL https://arxiv.org/abs/2302. 13971. Publisher: ar Xiv Version Number: 1.

S Vani and TV Madhusudhana Rao. An experimental approach towards the performance assessment of various optimizers on convolutional neural network. In 2019 3rd international conference on trends in electronics and informatics (ICOEI), pp. 331 336. IEEE, 2019.

Sharan Vaswani, Issam H Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, and Simon Lacoste Julien. Adaptive gradient methods converge faster with over-parameterization (and you can do a line-search). ar Xiv preprint ar Xiv:2006.06835, 2020.

Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S Rellermeyer. A survey on distributed machine learning. Acm computing surveys (csur), 53(2): 1 33, 2020.

Published as a conference paper at ICLR 2025

Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Zhi-Ming Ma, Tie-Yan Liu, and Wei Chen. Provable adaptivity in adam. ar Xiv preprint ar Xiv:2208.09900, 2022.

Bohan Wang, Huishuai Zhang, Zhiming Ma, and Wei Chen. Convergence of adagrad for non-convex objectives: Simple proofs and relaxed assumptions. In The Thirty Sixth Annual Conference on Learning Theory, pp. 161 190. PMLR, 2023.

Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap between the upper bound and lower bound of adam s iteration complexity. Advances in Neural Information Processing Systems, 36, 2024.

Guanghui Wang, Shiyin Lu, Weiwei Tu, and Lijun Zhang. Sadam: A variant of adam for strongly convex functions. ar Xiv preprint ar Xiv:1905.02957, 2019.

Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. The Journal of Machine Learning Research, 21(1):9047 9076, 2020.

Stephen J Wright. Coordinate descent algorithms. Mathematical programming, 151(1):3 34, 2015.

Xiaoxia Wu, Rachel Ward, and Léon Bottou. Wngrad: Learn the learning rate in gradient descent. ar Xiv preprint ar Xiv:1803.02865, 2018.

Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting hessian: Understanding common structure of hessian in neural networks. ar Xiv preprint ar Xiv:2010.04261, 2020.

Hongzhi Yin, Bin Cui, Jing Li, Junjie Yao, and Chen Chen. Challenging the long tail recommendation. ar Xiv preprint ar Xiv:1205.6700, 2012.

Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. ar Xiv preprint ar Xiv:1708.03888, 2017a.

Yang You, Zhao Zhang, C Hsieh, James Demmel, and Kurt Keutzer. 100-epoch imagenet training with alexnet in 24 minutes. ar Xiv preprint ar Xiv:1709.05011, 8, 2017b.

Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing, pp. 1 10, 2018.

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. ar Xiv preprint ar Xiv:1904.00962, 2019.

Matthew D Zeiler. Adadelta: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012.

Bohang Zhang, Jikai Jin, Cong Fang, and Liwei Wang. Improved analysis of clipping algorithms for non-convex optimization. Advances in Neural Information Processing Systems, 33:15511 15521, 2020a.

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. ar Xiv preprint ar Xiv:1905.11881, 2019.

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383 15393, 2020b.

Tong Zhang. Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press, 2023. doi: 10.1017/9781009093057.

Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam can converge without any modification on update rules. Advances in neural information processing systems, 35: 28386 28399, 2022.

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhi-Quan Luo. Why transformers need adam: A hessian perspective. ar Xiv preprint ar Xiv:2402.16788, 2024.

Published as a conference paper at ICLR 2025

Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. ar Xiv preprint ar Xiv:2302.09419, 2023.

Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, and Francesco Orabona. Understanding adamw through proximal methods and scale-freeness. ar Xiv preprint ar Xiv:2202.00089, 2022.

George Kingsley Zipf. Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books, 2016.

A MORE EXPERIMENTAL DETAILS

A.1 CONVEX CASE

For a4a, a6a and a9a, we run 100 epochs of optimization with SGD and Adagrad individually, starting from a uniform distribution initialization w0 U( 0.05, 0.05)d. To find the best hyperparameter, grid searches are conducted for both algorithms, with the search space being initial learning rate η {10.0, 1.0, 0.1, 0.01} and learning rate schedules being either constant ηt η or inverse square root decay ηt = η/ t + 1, the same choices as in our theorems. For large datasets real-sim and rcv1.binary, all settings stay the same, except for the number of epochs 3 and the initialization w0 U( 1/d, 1/d)d 10 2.

Since it is generally hard to obtain analytical closed-form solutions for logistic regressions, we run gradient descent for 106 epochs to obtain an approximated optimum w for small datasets a4a, a6a and a9a. For larger ones real-sim and rcv1.binary, we run 102 epochs of Adagrad instead, since GD converges much slower in comparison. In addition, for computing H 2 in large datasets, we run 10 iterations of power iteration to approximate the largest eigenvalue, which quickly converge to desired precisions 10 5.

Table 5: Statistics on logistic regression. D2 = maxt wt w 2 and D = maxt wt w are estimated by the maximum value under all searched settings without loss explosion. Smaller values in Cvar D /D2 represent better theoretical bounds for Adagrad when compared with SGD, and Cbias ( L 1D2 )/( L D2 2) corresponds to the bias difference.

DATASET D D2 L 1 L CVAR CBIAS mint f(wt) f(w ) SGD ADAGRAD

A4A 9.73 32.24 14.87 7.26 0.30 0.19 1.46 10 3 6.88 10 5

A6A 8.88 28.57 14.87 7.26 0.31 0.20 4.36 10 4 6.39 10 5

A9A 9.79 29.43 14.87 7.27 0.32 0.23 3.47 10 4 5.04 10 5

A.2 NONCONVEX CASE

For all experiments, we run 3 epochs of optimization with SGD and Adagrad, which is one of the recommended settings of the Alpaca dataset (Taori et al., 2023) (https://github.com/ tatsu-lab/stanford_alpaca?tab=readme-ov-file#fine-tuning). We search the learning rate η {1.0, 10 1, 10 2, 10 3, 10 4, 10 5, 10 6} and report the best result in training loss for both SGD and Adagrad. The maximum sequence length is set to 512, along with the learning rate schedule being set to cosine decay (Loshchilov & Hutter, 2016). Other settings remain the same as default ones in huggingface transformers (https://github.com/huggingface/ transformers). In all our implementations, we use the version transformers==4.38.2. All experiments are conducted on a single A40 GPU, where gradient accumulation is adopted for batch sizes larger than 128 to reduce memory cost.

In addition to the presented results, we include in Figure 2 the loss curves of SGD and Adagrad under batch sizes 256 and 512. They represent the typical loss decrease tendency of all experiments, where Adagrad converges faster than SGD since the beginning, and converges to a point with smaller loss.

Published as a conference paper at ICLR 2025

Figure 2: Training loss curves of SGD and Adagrad for instruction following tasks on Alpaca with GPT2. Left: batch size 256, Right: batch size 512.

Regarding licenses, the Alpaca dataset is released under Creative Commons Attribution Non Commercial 4.0 International Public License (https://github.com/tatsu-lab/ stanford_alpaca/blob/main/DATA_LICENSE), while GPT-2 is released under MIT License (https://huggingface.co/openai-community/gpt2). The code repository of huggingface transformers is released under Apache License 2.0 (https://github.com/ huggingface/transformers/blob/main/LICENSE).

B AN EXAMPLE IN THE CONVEX CASE

Following the SVM example in Duchi et al. (2011), we provide Example B.1 to create a concrete case to help our illustration.

Example B.1. Consider a finite-sum optimization problem in a hypercube W:

i=1 ϕi(x i w),

where ϕi : R R are convex and twice-differentiable functions and xi Rd, w W. We assume that the first and second order derivatives are bounded such that |ϕ i(w)| G1 and ϕ i (w) G2 for all i and w W. We also assume that the data points {xi}n i=1 yield that

for p 1 and denote H = 1/n Pn i=1 xix i . In this case, stochastic gradient yields that

wf(w; ξ) = xξ ϕ ξ(x ξ w),

where ξ is uniformly sampled from {1, , n}.

Note that Example B.1 includes a wide range of problems such as linear and logistic regression. We can check that the problem in Example B.1 yields Assumptions of Theorem D.1 and 4.1. As the stochastic gradients are sampled uniformly with replacement, the variance of gradient noise is reduced by factor M, where M is the batch size, as proven in Lemma C.2. We also assume an exponential tail to quantify the imbalance of Hessian in our following example.

Published as a conference paper at ICLR 2025

Assumption B.2. We say the problem has an exponential tail when there exists some constants τ > 1/d such that

i=1 x2 i,j exp ( τj)

for all j = 1, , d, where without loss of generality, we assume that Hj,j Hk,k for all j k.

Note that the exponential tail is a typical example of highly imbalanced distributions of data, which is common in natural language modeling Zipf (2016), social networks Muchnik et al. (2013), and recommender systems Yin et al. (2012). Then recall the comparison between Ada Grad and SGD in the convex case:

Ada Grad: E [f( w T ) f(w )] = O D σ 1

MT + L 1 D2 T

SGD & Ada Grad-Norm: E [f( w T ) f(w )] = O D2 σ 2

MT + L D2 2 T

and we take the following ratios to denote the difference in bias and variance terms separately.

D2 σ 2 , R2 L 1 D2 L D2 2 .

As we take W to be a hypercube, it holds that D2 2 = d D2 . Concerning R1, we have

σ 1 max w W

ϕ i(x i w) 2 x2 i,j G1M1

and similarly,

σ 2 max w W

ϕ i(x i w) 2 x2 i,j G1M2.

Therefore, based on the heavy-tailed assumption (Assumption B.2), we can obtain that

ϕ(σ) ϕ(D) = M1D

Pd j=1 exp 1

d Pd j=1 exp ( τj)

1 exp ( τ) 1 exp 1

If we consider τ to be a mild constant, it should be clear that R1 can be close to 1/

d and R1 1. On the other side, we have for all w W,

0 2f(w) = 1

i=1 ϕ i (w) xix i G2H.

Moreover, that the diagonal matrix Lm = diagonal(L) satisfies Lm G2H is a sufficient condition

for the smoothness assumption (Assumption 3.3). This is equivalent to that L 1

2 m 2 1/G2. Thus actually Lm can be taken as the optimal solution to an optimization problem:

min L Rd + L 1 s.t. L 1

which shares a similar form with the optimal diagonal preconditioner problem in solving linear systems, where an optimal preconditioner for a fixed matrix H can result in much faster convergence as pointed out in Qu et al. (2022). As an example, when H is diagonally dominant, we can choose Lm = diagonal(2G2H) which takes the diagonal entries. Then we have L = 2G2H1,1 and L 1 = 2G2tr (H). In this case, based on the heavy-tailed assumption (Assumption B.2), we have

ϕ(D) = tr (H) D2 H1,1D2 2 = tr (H)

Pd j=1 exp( τj)

2τ) 1 1 exp ( τ) 1

When τ is a mild constant, it should be clear that R2 can be close to 1/d and R2 1.

To conclude, in this concrete example, we address that R1 can be close to 1/

d, R2 can be close to 1/d and both R1, R2 1, showing that Ada Grad has better dimensional dependence than SGD and Ada Grad and the larger gap between bias terms than that of variance terms.

Published as a conference paper at ICLR 2025

C PROOF PRELIMINARIES

Notations. In the appendix, we define

ξ Bt wf(wt; ξ) f(wt) = 1

ξ Bt n(wt; ξ) (11)

to note the gradient noise at iteration t. We further use ft,j, gt.j and nt,j to denote the j-th coordinate of f(wt), gt and nt, separately. Lemma C.1 (Projection). Suppose W Rd is a closed convex set and Λ Rd d is symmetric and positive definite. Then for w Rd, w ΠΛ C[w] if and only if for all z W,

1 2 w z 2 Λ 1

2 w w 2 Λ (12)

or equivalently, for all z W,

z w, Λ(w w) 0. (13)

Proof. Equation (12) simply follows the definition. To prove (13), take any z W and α (0, 1) we know that αz + (1 α) w W as W is a convex set. Hence by (12), we have

1 2 w w 2 Λ 1

2 w [αz + (1 α) w] 2 Λ

1 2 w w 2 Λ 1

2 (w w) α(z w) 2 Λ

2 z w 2 Λ α z w, Λ(w w)

z w, Λ(w w) α

2 z w 2 Λ .

As α can be arbitrarily close to 0, we have that (13) holds.

With the definition of (11), we can obtain the following lemma describing the variance reduced by batch size M. Lemma C.2 (Variance reduced by batch size). Under Assumption 3.4, 3.5, it holds that

Et n2 t,j σ2 j M . (14)

Proof. It holds that

Et n2 t,j =Et

ξ Bt nj(wt; ξ)

(4)= 1 M 2 X

ξ Bt Et (nj(wt; ξ))2 (5) σ2 j M ,

where the second equality is based on the independence of n(wt; ξ).

C.1 PROPERTIES OF ASSUMPTION 5.1

The following is the descent lemma based on Assumption 5.1. We basically follow similar proof techniques to Zhang et al. (2019; 2020a); Crawshaw et al. (2022) but obtain results because of the difference in assumption. Lemma C.3 (Descent lemma). Under Assumption 5.1, for all w, w W such that w w L1

d, it holds that

f(w ) f(w) + f(w), w w + 1

j=1 (L0,j + L1,j jf(w)) |w w|2 j , (15)

where |w w|j denotes the absolute value of the j-th coordinate of w w.

Published as a conference paper at ICLR 2025

Algorithm 2 SGD

1: Input: w0 Rd, {ηt}T 1 t=0 R, and batch size M N 2: for t = 0 to T 1 do 3: Sample mini-batch Bt with |Bt| M uniformly 4: gt = 1 M P

ξ Bt wf(wt; ξ)

5: Option I: wt+1 = ΠId W(wt ηtgt) 6: Option II: wt+1 = wt ηtgt 7: end for 8: Output: 1/T PT 1 t=0 wt

Algorithm 3 Adagrad-Norm

1: Input: w0 Rd, {ηt}T 1 t=0 R, ϵ R, and batch size M N 2: Initialize v 1 = ϵ2

3: for t = 0 to T 1 do 4: Sample mini-batch Bt with |Bt| M uniformly 5: gt = 1 M P

ξ Bt wf(wt; ξ)

6: vt = vt 1 + gt 2 2 7: bt = vt 8: Option I: wt+1 = ΠΛt W (wt ηt

bt gt) 9: Option II: wt+1 = wt ηt

bt gt 10: end for 11: Output: 1/T PT 1 t=0 wt

Proof. Denote wu = uw + (1 u)w with u [0, 1]. Based on the Taylor expansion, we have

f(w ) =f(w) + Z 1

0 f(wu), w w du

=f(w) + f(w), w w + Z 1

0 f(wu) f(w), w w du

=f(w) + f(w), w w + Z 1

j=1 ( jf(wu) jf(w))(w w)jdu

f(w) + f(w), w w

| jf(wu) jf(w)|2

(L0,j + L1,j | jf(w)|)

j=1 (L0,j + L1,j | jf(w)|) |wu w|2 jdu

(3) f(w) + f(w), w w + Z 1

j=1 (L0,j + L1,j | jf(w)|) |wu w|2 j du

=f(w) + f(w), w w +

j=1 (L0,j + L1,j | jf(w)|) |w w|2 j Z 1

=f(w) + f(w), w w + 1

j=1 (L0,j + L1,j | jf(w)|) |w w|2 j ,

where we use Cauchy-Schwarz inequality in the first inequality.

D STANDARD CONVERGENCE OF SGD

Theorem D.1 (Convex convergence of SGD). Under Assumption 3.1, 3.3, 3.4, 3.5, for the sequence {wt}T t=1 generated by SGD (Algorithm 2 with option I), if we appropriately take step size ηt =

Published as a conference paper at ICLR 2025

min 1 L , r

D2 2M 2 σ 2 2(t+1)

, it holds that for w T 1/T PT 1 t=0 wt,

E [f( w T ) f(w )] = O D2 σ 2

MT + L D2 2 T

Remark D.2. Here we consider a decaying step size schedule ηt = O 1/

t for SGD, which is commonly used and usually more efficient than constant step size. Note that under the same settings, the convex convergence of SGD with the same step size schedule can also depend on just w0 w 2 instead of D2, if we use a slightly different proof approach compared to Theorem D.1. One can refer to Theorem 5.7 in Garrigos & Gower (2023) for this different approach. While it can obtain convergence with dependence on ||w0 w ||2 instead of D2, it suffers from an additional logarithmic term in the bound when using step size ηt = O 1/

t . Therefore, we think each of these two approaches has its own merits and we take the current approach for discussion.

Theorem D.3 (Nonconvex convergence of SGD). Under Assumption 3.2, 3.3, 3.4, 3.5, for the sequence {wt}T t=1 generated by SGD (Algorithm 2 with option II), if we appropriately take step size

ηt η = n 1 2 L , q

2(f(w0) f )M

o , it holds that

t=0 E h f(wt) 2 2 i = O

L (f(w0) f ) σ 2

MT + L (f(w0) f )

Theorem D.1 and D.3 are standard results of the convergence of SGD under L -smooth settings. Similar results have been extensively studied (e.g. (Orabona, 2019; Garrigos & Gower, 2023; Ghadimi & Lan, 2013; Bernstein et al., 2018a)). We also include the proof below for completeness.

Proof of Theorem D.1. For w W, it holds that

Et h wt+1 w 2 2 i (12) Et h wt ηtgt w 2 2 i

wt w 2 2 2ηt Et [ gt, wt w ] + η2 t Et h gt 2 2 i

= wt w 2 2 2ηt f(wt), wt w + η2 t f(wt) 2 2 + η2 t Et h nt 2 2 i ,

where nt = gt f(wt). With convexity and standard smoothness of f( ), we have

f(wt), wt w f(wt) f(w ) and f(wt), wt w 1 L f(wt) 2 2 .

Then after rearranging, it holds that

(2 ηt L )(f(wt) f(w )) 1

ηt wt w 2 2 1

ηt Et h wt+1 w 2 2 i + ηt Et h nt 2 2 i .

If we take summation and full expectation with ηt 1/L and ηt ηt+1 for t = 1, , T, it holds that

t=0 E [f(wt) f(w )]

ηt wt w 2 2 1

ηt wt+1 w 2 2

t=0 ηt E h nt 2 2 i

ηt wt w 2 2 1

ηt wt+1 w 2 2

η0 w0 w 2 2 1 ηT 1 E h w T w 2 2 i +

E h wt w 2 2 i

Published as a conference paper at ICLR 2025

(2) D2 2 η0 +

= D2 2 ηT 1 +

ηt σ 2 2 M .

Therefore, by having both sides of the inequality divided by T and making use of the convexity, we conclude the proof that

E [f( w T ) f(w )] Asm. 3.1 1 T

t=0 E [f(wt) f(w )] D2 2 ηT 1T + σ 2 2 MT

where w T = 1/T PT 1 t=0 wt. Then by taking

p D2 2M σ2 2(t + 1)

where p is a constant, we can obtain that

E [f( w T ) f(w )] D2 2 ηT 1T + σ 2 2 MT

D2 2 σ 2 2 MT r1

D2 2 σ 2 2 MT 2 p,

where the second inequality holds as

t t + 1 t = 2

Thus by taking p = 1

2, we finish the proof.

Proof of Theorem D.3. Based on the smoothness condition, it holds that

f(wt+1) f(wt) f(wt), wt+1 wt + L

2 wt+1 wt 2 2

=f(wt) ηt f(wt), gt + L η2 t 2 gt 2 2

f(wt) ηt f(wt), gt + L η2 t f(wt) 2 2 + L η2 t nt 2 2 . By taking expectation, we have

E [f(wt+1)] E [f(wt)] ηt E [ f(wt), gt ] + L η2 t 2 E h gt 2 2 i

=E [f(wt)] ηt E h f(wt) 2 2 i + L η2 t 2 E h f(wt) 2 2 i + L η2 t 2 E h nt 2 2 i

(14) E [f(wt)] ηt E h f(wt) 2 2 i + L η2 t 2 E h f(wt) 2 2 i + L η2 t σ 2 2 2M

E [f(wt)] ηt

2 E h f(wt) 2 2 i + L η2 t σ 2 2 2M ,

where nt = f(wt) gt and the last inequality holds as we take ηt 1/ L . As ηt η, we have after rearranging that

t=0 E h f(wt) 2 2 i 2E[f(w0) f(w T )]

ηT + L η σ 2 2 M

2(f(w0) f )

ηT + L η σ 2 2 M .

Then by taking η = min n 1 L , q

(f(w0) f )M

o , we finish the proof.

Published as a conference paper at ICLR 2025

E PROOF OF THEOREM 4.1

To prove Theorem 4.1, we first look at the standard Ada Grad convergence under the non-smooth convex stochastic optimization setting. Theorem E.1 (Convergence for convex nonsmooth Ada Grad). Under Assumption 3.1, 3.4, for the sequence {wt}T t=1 generated by Algorithm 1 with constant step size ηt η, it holds that

t=0 E[f(wt)] f(w ) 1

t=0 f 2 t,j +

+ ϵD2 2 2η (16)

Proof. This proof is a stochastic optimization version of the proof of Ada Grad in the online convex learning scheme Duchi et al. (2011); Streeter & Mc Mahan (2010); Zhang (2023). We also include the proof here for completeness.

First, we give an important result that the gradient norm can be expressed as Λt Λt 1.

gt 2 Λ 1 t =

g2 t,j λt,j

g2 t,j λt,j + λt 1,j =

λ2 t,j λ2 t 1,j λt,j + λt 1,j = tr (Λt Λt 1) , (17)

where we note λt,j is the j-th coordinate of Λt. Then we start the proof.

Et h wt+1 w 2 Λt

i (12) E h wt w ηtΛ 1 t gt 2 Λt

=Et h wt w 2 Λt 2ηt gt, wt w + η2 t gt 2 Λ 1 t

=Et h wt w 2 Λt

i 2ηt f(wt), wt w + Et h η2 t gt 2 Λ 1 t

(17) Et h wt w 2 Λt

i 2ηt f(wt), wt w + Et η2 t tr (Λt Λt 1)

wt w 2 Λt 1 2ηt (f(wt) f(w )) + Et η2 t tr (Λt Λt 1)

+ Et h wt w 2 Λt Λt 1

(2) wt w 2 Λt 1 2ηt (f(wt) f(w )) + Et η2 t tr (Λt Λt 1)

+ Et D2 tr (Λt Λt 1) ,

where the third inequality holds as f( ) is convex. After taking ηt = η, summing up, taking full expectation and rearrangement, we can obtain that

t=0 E [f(wt) f(w )] 1

t=0 E h wt w 2 Λt 1 wt+1 w 2 Λt

t=0 E [tr (Λt Λt 1)]

=ϵ w0 w 2 2 2η w T w 2 ΛT 1 2η + η

E [tr (ΛT 1 ϵId)]

(2) ϵD2 2 2η + η

t=0 f 2 t,j +

where we take Λ 1 = ϵId and the last inequality holds as x + y x + y for all x, y 0.

Published as a conference paper at ICLR 2025

Then we consider giving a bound on the noise summation term. Lemma E.2 (variance bound). Under Assumption 3.4 and 3.5, it holds that

Proof. It holds that

t=0 Et n2 t,j

where the first inequality holds as Jensen s inequality.

Then we are ready to prove Theorem 4.1.

Proof of Theorem 4.1. From Theorem E.1, we can obtain that

t=0 E[f(wt)] f(w ) (16) 1

t=0 f 2 t,j +

+ ϵD2 2 2η ,

and in Lemma E.2 we bound the noise term. Then we consider the bias term. It holds that

t=0 f 2 t,j

v u u t L 1

v u u u t L 1

v u u t L 1

t=0 E h f(wt) 2 L 1 i ,

where the first inequality uses Lemma G.1 and the second inequality holds as the Jensen s inequality. Moreover, based on the smoothness assumption (Assumption 3.3), we can obtain that

t=0 E h f(wt) 2 L 1 i 1

t=0 E [f(wt) f(w )] .

Thus if we denote A = 1/TPT 1 t=0 E [f(wt) f(w )], then combining (16), there is a simplified expression that

A CC0 0, where from Theorem E.1 and Lemma E.2,

T , C0 = σ 1

MT + ϵD2 2 2ηTC .

Then we can solve this inequality by conducting simple derivation that

C2C2 1 + 4CC0

2 C2C2 1 + C2C2 1 + 4CC0

= C2C2 1 + 2CC0. If we replace C, C0 and C1 by their corresponding value, we can obtain that

t=0 E [f(wt) f(w )] 1

MT + ϵD2 2 ηT

If we further take the optimal step size η = D based on this bound, we can obtain that

t=0 E [f(wt) f(w )] 4 L 1 D2 T +

MT + ϵD2 2 D T ,

which concludes the proof.

Published as a conference paper at ICLR 2025

E.1 RELAXATION OF THE NOISE ASSUMPTION

Note that Assumption 3.5 assumes a universal bound on the noise. However, similar to a line of recent works on SGD (Richtárik & Takác, 2020; Gower et al., 2019; 2021; Khaled & Richtárik, 2020), Theorem 4.1 can also be extended to the settings that we only assume upper bound of noise variance on w as Assumption E.3, with Assumption 3.3 replaced by an assumption that aligns with the expected smoothness (Assumption E.4). Based on this setting, our analysis can be well adapted to the interpolation or overparameterized scheme. Assumption E.3 (Anisotropic noise on w ). There exists a positive vector σ = [σ ,1, . . . , σ ,d] Rd such that for all w W that are minimum of f( ), it holds that

E nj(w ; ξ)2 σ2 ,j for all j [d]. (18)

Assumption E.4 (Expected anisotropic smoothness). There exists a positive vector L = [L ,1, . . . , L ,d] Rd + such that f( ) is L -expected smooth, namely, for w W, g( ) is the stochastic gradient oracle and all w W that are the minimum of f( ), it holds that 1 2E h g(w) g(w ) 2 L 1

i f(w) f(w ). (19)

With these two assumptions replaced, we may rewrite the last step in Theorem E.1 such that

t=0 E g2 t,j

t=0 E h 2 |g(wt) g(w )|2 j i +

t=0 E h 2 |g(w )|2 j i

v u u t2 L 1

t=0 E h g(wt) g(w ) 2 L 1

v u u t4 L 1

t=0 E [f(wt) f(w )] + σ 1

Then we can process the proof of Theorem 4.1 in the same manner, which gives the following convergence result:

t=0 E [f(wt) f(w )] O L 1 D2 T + D σ 1

MT + ϵD2 2 D T

Over-parameterized models usually result in interpolation of the data, which typically holds wf(w ; ξ) = f(w) = 0, which corresponds to that σ = 0. Therefore, we have O (1/T) convergence rate for Ada Grad under over-parameterization based on this relaxation of the noise assumption.

F PROOF OF THEOREM 5.3

In this section, we prove the convergence of Ada Grad in the nonconvex generalized smooth setting. Note that by setting L1 = 0, the proof below can be directly applied to prove Theorem 4.3. Let us first give a brief overview. Based on Assumption 3.3 and Lemma C.3, when we set ηt η 1/ L1 , it holds that wt+1 wt L1

f(wt+1) (15) f(wt) + f(wt), wt+1 wt + 1

j=1 (L0,j + L1,j | ft,j|) |wt+1 wt|2 j

=f(wt) η f(wt), Λ 1 t gt + η2

j=1 (L0,j + L1,j | ft,j|) g2 t,j λt,j .

Published as a conference paper at ICLR 2025

Here a critical problem is it is nontrivial to straightforwardly transfer Et[ f(wt), Λ 1 t gt ] to Et[ f(w) ] as both Λt and gt is relevant with gt. To deal with this issue, we consider an auxiliary diagonal matrix Λt with each diagonal entry being

λ2 t,j = λ2 t 1,j + Et g2 t,j (20)

for all j [d]. Note that this auxiliary sequence is the same as Défossez et al. (2020), but our proof technique is much different and obtains better results. Then we consider

f(wt), Λ 1 t gt = D f(wt), Λ 1 t gt E + D f(wt), Λ 1 t Λ 1 t gt E

and attempt to bound the second term, which is described in Lemma F.1.

Another problem is how to bound the additional terms introduced by the generalized smoothness,

namely, Pd j=1 L1,j | ft,j| g2 t,j λt,j . We use the divide-and-conquer strategy to have this additional term resolved by the existing terms, as described in Lemma F.2.

With these issues solved, the final convergence property would be determined by

t=0 E h D f(wt), Λ 1 t gt Ei ,

which largely relies on E [tr (Λt)] that we bound in Lemma F.4. Note that Lemma F.4 considers a different main line and is important for the proof. Then we are ready to complete the proof of Theorem 5.3.

Lemma F.1 (Bound of D f(wt), Λ 1 t Λ 1 t gt E ). Under the same setting as Theorem 5.3, if

we take diagonal matrix Λt as defined in (20), it holds that

Et h D f(wt), Λ 1 t Λ 1 t gt Ei

" g2 t,j λ2 t,j

D f(wt), Λ 1 t f(wt) E . (21)

Proof. It holds that

D f(wt), Λ 1 t Λ 1 t gt E =

j=1 ft,jgt,j

1 λt,j 1 λt,j

ft,jgt,j g2 t,j Et g2 t,j

λt,jλt,j(λt,j + λt,j)

| ft,j| |gt,j| g2 t,j Et g2 t,j

λt,jλt,j(λt,j + λt,j)

| ft,j| |gt,j| |gt,j| q

where in the first inequality we take the properties of absolute values. In the last inequality, we use the fact that

λ2 t,j = λ2 t 1,j + Et g2 t,j Et g2 t,j and λ2 t,j = λ2 t 1,j + g2 t,j g2 t,j.

Then we consider an arbitrary coordinate j [d]. By applying inequality (31) with

Et g2 t,j 2

λt,j , x = |gt,j|

Et g2 t,j | ft,j|

Published as a conference paper at ICLR 2025

it holds that

| ft,j| |gt,j| |gt,j| q

" g2 t,j λ2 t,j

2c f 2 t,j λ2 t,j Et

Et g2 t,j 2

" g2 t,j λ2 t,j

2 f 2 t,j λ2 t,j

Et g2 t,j 2#

" g2 t,j λ2 t,j

+ f 2 t,j 2 λt,j

" g2 t,j λ2 t,j

+ f 2 t,j 2 λt,j ,

where in the second inequality we use the fact that

Et g2 t,j 2#

=2Et g2 t,j 2Et [|gt,j|] q

Et g2 t,j 2Et g2 t,j

and the last inequality is based on the fact that

Et g2 t,j 2#

=2Et g2 t,j 2Et [|gt,j|] q

Et g2 t,j q

Et g2 t,j Et [|gt,j|]

Et g2 t,j Et g2 t,j Et [|gt,j|]2 q

Et g2 t,j + Et [|gt,j|]

Et g2 t,j Et g2 t,j Et [gt,j]2 q

Et g2 t,j + Et [|gt,j|]

Et g2 t,j σ2 j q

Et g2 t,j + Et [|gt,j|] 2σ2 j .

Thus by substituting into (22), we can obtain that

Et h D f(wt), Λ 1 t Λ 1 t gt Ei Et

| ft,j| |gt,j| |gt,j| q

" g2 t,j λ2 t,j

f 2 t,j 2 λt,j

" g2 t,j λ2 t,j

D f(wt), Λ 1 t f(wt) E ,

which concludes the proof.

Lemma F.2 (Bound of L1,j | ft,j| g2 t,j λ2 t,j ). Under the same settings as Theorem 5.3, it holds that

j=1 L1,j | ft,j| g2 t,j λ2 t,j

2 L1 D f(wt), Λ 1 t f(wt) E +

j=1 L1,jσj Et

" g2 t,j λ2 t,j

Published as a conference paper at ICLR 2025

Proof. Let us first consider an arbitrary coordinate j [d] and two cases regarding | ft,j|.

(1) If | ft,j| σj, we have

L1,j | ft,j| g2 t,j λ2 t,j

= L1,j | ft,j| Et

" g2 t,j λ2 t,j

" g2 t,j λ2 t,j

(2) Else, we have | ft,j| σj. In this case, we have

" g2 t,j λ2 t,j

" g2 t,j λ2 t 1,j + g2 t,j

λ2 t 1,j + Et g2 t,j (5) f 2 t,j + σ2 j λ2 t 1,j + Et g2 t,j 2 f 2 t,j λ2 t,j ,

where we use the Jensen inequality of convex function h(x) = x a2+x in the first inequality. Therefore,

L1,j | ft,j| Et

" g2 t,j λ2 t,j

2L1,j | ft,j| f 2 t,j λ2 t,j 2L1,j f 2 t,j λt,j ,

where in the last inequality we use the fact that | ft,j| = |E[gt,j]| q

E g2 t,j λt,j. By combining the two cases, we have

L1,j | ft,j| Et

" g2 t,j λ2 t,j

2L1,j f 2 t,j λt,j + L1,jσj Et

" g2 t,j λ2 t,j

and by summing up, we have

j=1 L1,j | ft,j| g2 t,j λ2 t,j

2 L1 D f(wt), Λ 1 t f(wt) E +

j=1 L1,jσj Et

" g2 t,j λ2 t,j

Lemma F.3 (Bound of PT 1 t=0 g2 t,j λ2 t,j ). Under the same settings of Theorem 5.3, for all j [d], it holds

g2 t,j λ2 t,j

2 ln E tr (ΛT 1)

Proof. It holds that

" g2 t,j λ2 t,j

ln 2E λT 1,j

2 ln E tr (ΛT 1)

where the first inequality is based on Lemma G.3 and the fact that λ2 t,j = λ2 t 1,j + g2 t,j, and the second inequality is based on Jensen s inequality.

Lemma F.4 (Bound of tr (Λt)). Under the same settings of Theorem 5.3, it holds that

E [tr (ΛT 1)] 2dϵ + 4

η (f(w0) f )

+ 5 η L0 1 + 3

2η L0 1 + 5

T σ 1 ϵ + e

Published as a conference paper at ICLR 2025

Proof. As we take η 1/ L1 , we have wt+1 wt L1

d. Then from Assumption 3.3 and Lemma C.3 it holds that

f(wt+1) (15) f(wt) + f(wt), wt+1 wt + 1

j=1 (L0,j + L1,j | ft,j|) |wt+1 wt|2 j

=f(wt) η f(wt), Λ 1 t gt + η2

j=1 (L0,j + L1,j | ft,j|) g2 t,j λ2 t,j

=f(wt) η gt, Λ 1 t gt + η2

j=1 (L0,j + L1,j | ft,j|) g2 t,j λ2 t,j + η nt, Λ 1 t gt

f(wt) η gt, Λ 1 t gt + η2

j=1 (L0,j + L1,j(|gt,j| + |nt,j|)) g2 t,j λ2 t,j + η nt, Λ 1 t gt

=f(wt) η gt, Λ 1 t gt + η2

j=1 L0,j g2 t,j λ2 t,j + η2

j=1 L1,j |gt,j|3

j=1 L1,j |nt,j| gt,j2

λ2 t,j + η nt, Λ 1 t gt

f(wt) η gt, Λ 1 t gt + η2

j=1 L0,j g2 t,j λ2 t,j + η2

j=1 L1,j |gt,j|3

j=1 L1,j |nt,j| g2 t,j λ2 t,j + η

j=1 |nt,j| |gt,j|

where we use absolute value inequality. Then we deal with the terms separately by

j=1 L1,j |gt,j|3

g2 t,j λt,j = η2

2 L1 gt, Λ 1 t gt

2 gt, Λ 1 t gt ,

where we use the fact that |gt,j| λt,j and η 1/ L1 . Similarly, we can also obtain that

j=1 L1,j |nt,j| g2 t,j λ2 t,j η2

j=1 L1,j |nt,j| |gt,j|

j=1 |nt,j| |gt,j|

where we use the fact that |gt,j| λt,j and η 1/ L1 . By combining the bounds of the two terms, we can obtain that

f(wt+1) f(wt) η

2 gt, Λ 1 t gt + η2

j=1 L0,j g2 t,j λ2 t,j + 3η

j=1 |nt,j| |gt,j|

Then after summation in t and rearrangement, it holds that

η 2 gt, Λ 1 t gt

t=0 [f(wt) f(wt+1)] + η2

g2 t,j λ2 t,j + 3η

t=0 |nt,j| |gt,j|

f(w0) f(w T ) + η2

g2 t,j λ2 t,j + 3η

g2 t,j λ2 t,j ,

Published as a conference paper at ICLR 2025

where in the last inequality we use Cauchy-Schwarz Inequality. Moreover, it holds that

gt, Λ 1 t gt =

g2 t,j λt,j

g2 t,j λt,j + λt 1,j =

λ2 t,j λ2 t 1,j λt,j + λt 1,j

t=0 (λt,j λt 1,j) = tr (ΛT 1 Λ 1) ,

where Λ 1 = ϵId. Then by combining the two inequalities together with Lemma F.3 and taking expectation, we can obtain that

E [tr (ΛT 1)] tr (Λ 1) + 2

η E[f(w0) f(w T )] + 2η

g2 t,j λ2 t,j

g2 t,j λ2 t,j

(24) dϵ + 2

η E[f(w0) f(w T )] + 2η

j=1 L0,j ln E tr (ΛT 1)

g2 t,j λ2 t,j

η E[f(w0) f(w T )] + 2η

j=1 L0,j ln E tr (ΛT 1)

g2 t,j λ2 t,j

(5),(24) dϵ + 2

η E[f(w0) f(w T )] + 2η

j=1 L0,j ln E tr (ΛT 1)

2E ln tr (ΛT 1)

η E[f(w0) f(w T )] + 2η L0 1 + 5

E h tr (ΛT 1)2i

where the second inequality is based on the Jensen inequality and the third inequality is based on the Cauchy-Schwarz inequality. Then by taking

x = E[tr (ΛT 1)]

ϵ , C1 = 2η L0 1 + 5

T σ 1 ϵ , and C0 = dϵ + 1

η[f(w0) f(w T )]

in Lemma G.4, and using Assumption 3.2, we can obtain that

E [tr (ΛT 1)] 2dϵ + 4

η (f(w0) f )

+ 5 2η L0 1 + 5

2η L0 1 + 5

T σ 1 ϵ + e

which concludes the proof.

Then we are ready to prove Theorem 5.3.

Published as a conference paper at ICLR 2025

Proof of Theorem 5.3. As we take η 1/ L1 , we have wt+1 wt L1

d. Then from Assumption 3.3 and Lemma C.3 it holds that

f(wt+1) (15) f(wt) + f(wt), wt+1 wt + 1

j=1 (L0,j + L1,j | ft,j|) |wt+1 wt|2 j

=f(wt) η f(wt), Λ 1 t gt + η2

j=1 (L0,j + L1,j | ft,j|) g2 t,j λ2 t,j .

Then by taking expectation and summation on t we can obtain that

t=0 Et[f(wt+1)]

t=0 f(wt) η

t=0 Et f(wt), Λ 1 t gt + η2

" g2 t,j λ2 t,j

t=0 | ft,j| Et

" g2 t,j λ2 t,j

t=0 f(wt) η

t=0 Et f(wt), Λ 1 t gt + η2

" g2 t,j λ2 t,j

D f(wt), Λ 1 t f(wt) E + 1

" g2 t,j λ2 t,j

We first consider the second term on the right hand side of (26). It holds that

t=0 E f(wt), Λ 1 t gt =

t=0 E h D f(wt), Λ 1 t gt Ei

t=0 E h D f(wt), Λ 1 t Λ 1 t gt Ei

t=0 E h D f(wt), Λ 1 t gt Ei +

" g2 t,j λ2 t,j

Therefore, after substituting the inequality and (24) into (26) and setting η 1 4 L1 , we have

t=0 E h D f(wt), Λ 1 t gt Ei 4

η E [f(w0) f(w T )] + (12 σ 1 + 2 L0 1) ln

" (tr ΛT 1)2

Moreover, for the left hand side of (27), we have

E h D f(wt), Λ 1 t gt Ei =E h D f(wt), Λ 1 t f(wt) Ei =

λ2 t 1,j + Et [gt,j]2

λ2 t 1,j + f 2 t,j + σ2 j

λ2 T 1,j + PT 1 s=0 f 2 s,j + σ2 j

Published as a conference paper at ICLR 2025

Thus by taking summation we have

t=0 E h D f(wt), Λ 1 t gt Ei

λ2 T 1,j + PT 1 s=0 f 2 s,j + σ2 j

PT 1 t=0 | ft,j|2 q

λ2 T 1,j + PT 1 s=0 f 2 s,j + σ2 j

E q PT 1 t=0 | ft,j|2 2

λ2 T 1,j + PT 1 s=0 f 2 s,j + σ2 j

q PT 1 t=0 | ft,j|2 2

λ2 T 1,j + PT 1 s=0 f 2 s,j + σ2 j

where in the second last and last inequality we apply Cauchy-Schwarz inequality

E[|XY |] E h |X|2i 1

2 E h |Y |2i 1

j=1 |Xj Yj|

We can further deal with the denominator such that for all j [d],

v u u tλ2 T 1,j +

s=0 f 2 s,j + σ2 j

s=0 f 2 s,j + σ2 j

s=0 (gs,j ns,j)2 + σ2 j

s=0 2g2 s,j + 2n2 s,j + σ2 j

s=0 2n2 s,j + σ2 j

T + 1σj = 3E [λt,j] +

where the first and the third inequality holds as

b for a, b > 0 and the second inequality holds as (a + b)2 2a2 + 2b2 for a, b R. Therefore, with Lemma F.1 bounding tr (Λt), combining (27) with (28) and (29) we can obtain that

t=0 | ft,j|2

v u u tλ2 T 1,j +

s=0 f 2 s,j + σj

Published as a conference paper at ICLR 2025

4 η E[f(w0) f(w T )] + (12 σ 1 + 2η L0 1) ln

" (tr ΛT 1)2

(29) 3E [tr (ΛT 1)] +

4 η E[f(w0) f(w T )] + (12 σ 1 + 2η L0 1) ln

" (tr ΛT 1)2

(25) 30B(η) + 6dϵ + 76

T + 1V (4B(η) + 12V ),

where we denote

η (f(w0) f ) + η L0 1 Clog, V = σ 1 Clog

η (f(w0) f )

+ 5 2η L0 1 + 5

2η L0 1 + 5

T σ 1 ϵ + e

based on Lemma F.4. By taking η = min 1 4 L1 , q

, where f(w0) f , we can

obtain that

L0 1 + L1 .

Then combining the fact proven in Lemma G.5 that

t=0 | ft,j|2

t=0 f(wt) 2 1

and dividing T in both side, we can obtain that

t=0 f(wt) 2 1

MT + σ 2 1 M

MT + L1 2 2

which concludes the proof.

G USEFUL LEMMAS

Lemma G.1 (A useful inequality). Assume a non-negative sequence {xj}n j=1 and a positive sequence {sj}n j=1 with S = Pn i=1 sj, it holds that

x2 j sj . (30)

The inequality holds as an equality if and only if for all i = 1, , n and j = 1, , n, xi

Published as a conference paper at ICLR 2025

Proof. We first multiply S by both sides and take the square, then the right-hand side minus the left-hand side will be

si sj x2 j X

which concludes the proof. Note that this result is an application of the Cauchy-Schwarz inequality.

Lemma G.2 (A basic inequality). For c 0 and x, y R, it holds that

Lemma G.3 (sum of ratios with the denominator being the sum of past numerators). Assume a non-negative sequence (an) and ϵ > 0. We define bn = Pn i=1 ai. Then it holds that

at bt + ϵ ln b N + ϵ

Proof. It holds that

at bt + ϵ ln 1 at ϵ + bt

= ln(bt + ϵ) ln(bt at + ϵ) = ln(bt + ϵ) ln(bt 1 + ϵ),

where the inequality is based on the fact that x ln(1 + x) for all x > 1 and we set b0 = 0. Then by summing up for 1 to N we finish the proof.

Lemma G.4 (Solving inequality x C0 + C1 ln x). Assume C1 C0 0 and x > 0. If x C0 + C1 ln x (where ln denotes the natural logarithm), it holds that for ζ 5,

x 2C0 + ζC1 ln(C1 + e). (33)

Proof. Denote g(x) = x C1 ln x C0. Then we have g is a convex function and attains uniform lower bound at C1. For x C1, g is a monotonically increasing function. Therefore, to verify (33), it is equivalent to verify that if x = 2C0 + ζC1 ln(C1 + e), where ζ = 5, then x C0 + C1 ln x. We begin the verification then.

Denote z = 2C0 + ζC1 ln(C1 + e). Then we have

z (C0 + C1 ln z) =z (C0 + C1 ln (2C0 + ζC1 ln(C1 + e)))

2 2C0 + C1 ln (ζC1 ln(C1 + e))

=C1 (ζ ln(C1 + e) ln (ζC1 ln(C1 + e))) ,

where the second inequality is based on the fact that

C1 ln (2C0 + ζC1 ln(C1 + e)) 1

2C0 + C1 ln (ζC1 ln(C1 + e))

for C0 0 and ζ 2. Then let us consider function h(y) = ζ ln(y + e) ln (ζy ln(y + e)) for y 0 and we have h(1) 0. We also have

h (y) = ζ y + e 1 ζy ln(y + e)

ζ ln(y + e) + ζy y + e

= ζ y + e 1

y 1 (y + e) ln(y + e),

Published as a conference paper at ICLR 2025

which implies that if y 1, we have h (y) 0 and thus h(y) 0 for y 1. For y [0, 1), it is also straightforward to obtain that h(y) ζ ln(ζ ln(1 + e)) 0. Therefore, we conclude that h(y) 0. By substituting y = C1, we have

z (C0 + C1 ln z) 0

and thus x C0 + C1 ln x implies x z, which finishes the proof.

Lemma G.5 (comparison of measures). For a sequence {at,j} with t = 1, ..., T and j = 1, ..., d, it holds that

Proof. It holds that

t=1 a2 t,j1

t=1 a2 t,j2

j1 =j2 |at,j1at,j2|

t=1 a2 t,j1

t=1 a2 t,j2

t=1 |at,j1at,j2|

The last inequality follows from the Cauchy-Schwarz Inequality. Thus we finish the proof.