# gradient_descent_with_generalized_newtons_method__3613e2c9.pdf

Published as a conference paper at ICLR 2025

GRADIENT DESCENT WITH GENERALIZED NEWTON S METHOD

Zhiqi Bu woodyx218@gmail.com Shiyun Xu University of Pennsylvania shiyunxu@sas.upenn.edu

We propose the generalized Newton s method (Ge N) a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, our method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and Res Net) to showcase that Ge N optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers.

1 INTRODUCTION

Deep learning models are trained via the gradient descent wt+1 = wt ηt P 1 t gt, where wt Rd is the model parameters. The model update wt wt+1 consists of a learning rate scheduler ηt R that is controlled by several hyperparameters, a pre-conditioner Pt Rd d that can depend on the Hessian or Fisher information, and the first-order gradient gt Rd of the loss function L(wt).

Table 1: Summary of optimizers with wt+1 = wt ηt P 1 t gt, where H is the Hessian, F is the Fisher information, d is the number of model parameters. We highlight that Ge N can take advantage of any pre-conditioner Pt.

Method ηt Pt dim(Pt) Newton-Raphson =1 = H d2

BFGS (Broyden, 1970), LBFGS (Byrd et al., 1995) need to tune H d2

Ge N (this work) = G g/g Hg Any 1 or d D-adaptation (Defazio & Mishchenko, 2023), Prodigy (Mishchenko & Defazio), Do G (Ivgi et al., 2023), Do WG (Khaled et al., 2023)

w0 w /max G = I or p

diag(F) 1 or d

Ada Hessian (Yao et al., 2021), Sophia (Liu et al.) need to tune diag(H) d

Shampoo (Gupta et al., 2018), K-FAC (Martens & Grosse, 2015), Natural Gradient (Amari, 1998)

need to tune F d

Adam (Kingma & Ba, 2014), Adam W (Loshchilov & Hutter, 2017), Ada Grad (Duchi et al., 2011), Ada Delta (Zeiler, 2012), RMSProp (Hinton et al., 2012)

need to tune p

SGD (Robbins & Monro, 1951), Heavyball (Polyak, 1964), NAG (Nesterov, 1983)

need to tune = I 1

Equal contribution. Code available at https://github.com/Shiyun Xu/Auto Ge N.

Published as a conference paper at ICLR 2025

For large-scale optimization problems with millions to billions of parameters, the training can be significantly costly. For example, large language models are trained with trillions of tokens, on thousands of GPUs, costing millions of dollars (Sharir et al., 2020; Biderman et al., 2023), and at high carbon footprint (Patterson et al., 2021; Luccioni et al., 2023; Dodge et al., 2022; Touvron et al., 2023a): GPT-3 (175B, Brown et al. (2020)), LLAMA (7 70B, Touvron et al. (2023a;b)), Chinchilla (70B, Hoffmann et al. (2022)), Pa LM (540B, Chowdhery et al. (2023)), and Falcon (7 180B, Almazrouei et al. (2023)) models are trained on 1T tokens; state-of-the-art models require extremely long training time, measured in thousands of Peta FLOPS-days (Almazrouei et al., 2023, Figure 2) or millions of GPU hours (AI@Meta, 2024). It is therefore computationally prohibitive to instantiate a dense Pt or to tune the hyperparameters of ηt. However, the second-order pre-conditioner and the learning rate scheduler are critical to the fast convergence.

On one hand, a Hessian-informed pre-conditioner is necessary in large-scale model training, as SGD (using no pre-conditioning) empirically does not compete with adaptive optimizers like Adam. However, implementing or approximating the full Hessian matrix induces O(d2) or even O(d3) complexity, which is infeasible for large models on current generation of computing devices. In order to compute the pre-conditioner Pt efficiently, a long list of adaptive optimizers (see Table 1 with diagonal Pt) have proposed to only leverage the diagonal of Pt and/or to replace the Hessian with Fisher information, although some information in the full Hessian may be lost.

On the other hand, manually setting a learning rate scheduler requires a critical balance between the utility more hyperparameters the better, and the tuning effort fewer hyperparameters the easier. The simplest approach is to set a constant learning rate ηt = η (only one hyperparameter) through grid search. However, the mini-batch gradient descent with a constant ηt may not allow the model to converge even under the convex setting, which is easier to optimize than the non-convex setting in deep learning. Additionally, as we see in Figure 1, small η converges slow but to better solution, and vice versa for large η, which depicts the importance of setting the proper ηt at each iteration. That is, tuning the learning rate scheduler is essentially selecting T (number of iterations) learning rates to maximize the convergence speed. To accommodate this challenge, a line of schedulers (including the linear decay, warm-up and cosine scheduler) has been proposed to control the learning rate during the training, with a number of hyperparameters. For instance, LLAMA 65B (Touvron et al., 2023a) uses a scheduler with 3 hyperparameters: warm-up steps=2000, maximum ηt = 1.5e 4 and final ηt = 1.5e 5. Unfortunately, most schedulers are heuristic and have their limits: they do not guarantee fast convergence and become increasingly difficult to tune as the model and data sizes increase.

Related works This work is closely related to previous literature in learning rate schedulers (including the heuristic and the parameter-free methods), optimization with Hessian information, and deep learning system design, which are discussed in Section 3.2 and Appendix D.

Contribution We propose the generalized Newton s method (Ge N) a Hessian-informed approach which merges the information from the full Hessian into the learning rate, so as to dynamically adapt to the loss landscape and accelerate the convergence. We examine Ge N on 5 criteria: automaticity, applicability, scalability, computation efficiency, and convergence speed.

Overall, Ge N1 is an automatic optimizer that is applicable to the general optimizers and work successfully in deep learning without adding much computations. Our efficiency analysis shows that Ge N is perfectly scalable to large-scale training and as computationally efficient as existing optimizers. We empirically demonstrate that Ge N is highly performant on image classification, text classification, natural language generation, object detection, instance segmentation, recommendation system, and parameter-efficient training (PET).

2 MOTIVATION

2.1 NOTATIONS

We denote the model parameters as wt Rd, where t T is the index of iterations; {xi} are the samples, with or without labels. For any loss function, we denote L(w) := Ex L(w, x) as the

1We use generalized to mean that (1) Newton-Raphson method can be viewed as a sub-case of Ge N and (2) we leverage the generalized inverse without inversing the Hessian matrix.

Published as a conference paper at ICLR 2025

0 50 100 150 200 250 300 Iterations

= 0.5 = 0.1 = 0.05 = 0.01

= 0.005 = 0.001 more 's

0 50 100 150 200 250 300 Iterations

Train accuracy

= 0.5 = 0.1 = 0.05 = 0.01

= 0.005 = 0.001 more 's

400 600 800 1000 1200 1400 1600 1800 2000 Iterations

Ge N linear inv invsqrt constant cosine

400 600 800 1000 1200 1400 1600 1800 2000 Iterations

Ge N linear inv invsqrt constant cosine

Figure 1: Effects of various learning rate schedulers. Left two: Res Net18 on CIFAR100 dataset, compared with constant learning rates. Right two: GPT2 on E2E dataset, compared with heuristic learning rate schedulers.

generalization loss, and L(w) := PB i=1 L(w, xi)/B as the training loss. We denote the mini-batch stochastic gradient as goptim t ( L) = Pt L Rd, where L(wt) := 1 B P

vanilla gradient and goptim t is the pre-conditioned gradient of any optimizer. For instance, SGD simply gives g SGD( L) = L; Sign SGD (a special case of Adam) post-processes the gradient as g Sign SGD( L) = sign( L); gradient clipping gives gclip( L) = min{C/|| L||, 1}( L) and similarly for the projection; PET (e.g. Lo RA) gives g PET( L) = m L where m is a binary mask and is the element-wise multiplication. We highlight that many optimization algorithms, including but not limited to Ada Grad, Adam, Sophia, momentum, and weight decay, can also be summarized as the post-processing of L.

2.2 SECOND-ORDER TAYLOR EXPANSION

In deep learning, the models are updated via the gradient descent by various optimizers:

wt+1 = wt ηtgoptim(wt) (2.1)

Appyling the Taylor expansion on the loss and leveraging (2.1), we can derive the loss improvement at the current iteration,

L(wt) L(wt+1) = L(wt) L(wt ηtgoptim t ) = ηt G t goptim t η2 t 2 (goptim t ) Htgoptim t + O(η3 t ).

where Gt := L wt and Ht := 2L

w2 t are the oracle gradient and Hessian, respectively.

In light of (2.2), we claim that employing the second-order Taylor expansion is (1) necessary, since the first-order Taylor expansion only characterizes a linear approximation of the loss and thus fail to characterize the loss curvature; and (2) sufficient, especially in the small learning rate regime where large models are universally trained and O(η3) is negligible2. In fact, (2.2) is also used in the theoretical analysis of neural scaling laws (Kaplan et al., 2020; Mc Candlish et al., 2018).

We visualize the loss function in Figure 2, in which a quadratic function with respect to η is fitted by ignoring the higher order terms,

L(wt) L(wt ηtgoptim t ) L(goptim t , ηt) := ηt G t goptim t η2 t 2 (goptim t ) Htgoptim t (2.3)

3 METHODOLOGY

3.1 GENERALIZED NEWTON S METHOD

Our goal is to minimize the loss L(wt) over the domain of learning rate η R, in particular the next-iteration loss in (2.3) . In fact, if we ming,η L(g, η) over both learning rate and descent

2The classical convergence analysis of gradient descent also leverages the second-order Taylor expansion L(w ηG) L(w)+ηG G+ Lη2

2 ||G||2 where L is the Lipschitz smoothness of L. This quadratic function is minimized at η = 1/L.

Published as a conference paper at ICLR 2025

0 100 200 300 400 500 Iterations (5 epochs)

R2 score of quadratic curve fitting

0 25 50 75 100 125 150 175 200 Iterations (5 epochs)

R2 score of quadratic curve fitting

0.6 0.4 0.2 0.0 0.2 0.4 0.6 Learning rates

* = 1.96e-01

L(w g) in epoch 1

Quadratic curve (R2 = 0.994)

0.3 0.2 0.1 0.0 0.1 0.2 0.3 Learning rates

* = 5.75e-02

L(w g) in epoch 5

Quadratic curve (R2 = 0.998)

0.0100 0.0075 0.0050 0.0025 0.0000 0.0025 0.0050 0.0075 Learning rates

* = 9.72e-04

L(w g) in epoch 1

Quadratic curve (R2 = 0.997)

0.015 0.010 0.005 0.000 0.005 0.010 0.015 Learning rates

* = 1.47e-03

L(w g) in epoch 5

Quadratic curve (R2 = 0.999)

Figure 2: Illustration of the second-order Taylor expansion in (2.3). Left two: Res Net18 on CIFAR100 with SGD. Right two: GPT2 on E2E with Adam W.

direction, then the optimal solution is η t g t = H 1 t Gt and therefore wt+1 = wt H 1 t Gt. This recovers the vanilla Newton s method conditioning on that Ht is invertible.

By working with the domain of η, we effectively reduce the high(d+1)-dimensional ming,η L(g, η) to a uni-variate problem minη L(g, η). From (2.3), it is obvious that the optimal learning rate is

η Ge N(goptim t ) = G t goptim t . (goptim t ) Htgoptim t (3.1)

Remark 3.1. The closed form of (2.3) allows us to directly derive the optimal learning rate, without resorting to more complex methods such as back-tracking or line searches.

That is, given goptim t (determined by the pre-conditioner Pt) and that (goptim t ) Htgoptim t > 0, our method transforms any base optimizer to a new optimizer by

wt+1 = wt η Ge N,tgoptim t = wt (goptim t ) Gtgoptim t (goptim t ) Htgoptim t (3.2)

We term the family of optimizers in (3.2) as the generalized Newton s method (Ge N), because the vanilla Newton s method is a sub-case by Proposition 3.3. We highlight that goptim t can come from any optimizer or even be random: e.g. we equip (3.2) with g SGD t to obtain Ge N-SGD from SGD, or with g Adam t to obtain Ge N-Adam from Adam. Remark 3.2. When the formulation of (3.2) is restricted to SGD without the pre-conditioning, this equivalent form of Ge N-SGD has also been presented in prior works like LQA (Zhu et al., 2021) and QLABGrad (Fu & Wu, 2024) and ADLER (Balboni & Bacciu, 2023). However, without the general goptim through pre-conditioning, such formulation does not include the Newton s method as a sub-case. We dedicate Appendix D.4 to highlight the similarities and differences between Ge N and these works. Proposition 3.3. If goptim t = H 1 t Gt, (3.2) reduces to the Newton s method as η t goptim t = H 1 t Gt.

Another way to understand (3.2) is via the generalized right inverse3 (see Definition A.1). For the simplicity of presentation, we drop the super-script of goptim t from now on. We can write

(g t Ht) 1 R = gt/g t Htgt = η t gt = (g t Ht) 1 R g t Gt (3.3)

which resembles the Newton s update H 1 t Gt but Ht and Gt are projected along the direction of gt:

wt+1 = wt H 1 t Gt g t Htwt+1 = g t Htwt g t Gt wt+1 = wt (g t Ht) 1 R g t Gt

Remark 3.4. Ge N-SGD is scale-invariant, as g t Gtgt g t Htgt does not change if gt is multiplied by a factor of c R, i.e. gt cgt. This indicates that Ge N-SGD (but not Ge N-Adam) is stable w.r.t. the vanishing/exploding gradient and does not need the gradient clipping.

3For a row vector A R1 d, its left inverse does not exist since rank(A) = 1; its right inverse is a column vector A 1 R Rd 1 such that AA 1 R = 1.

Published as a conference paper at ICLR 2025

3.2 DERIVING η WITHOUT BACK-PROPAGATION

To compute η in (3.1) for Ge N optimizers, or equivalently to compute G t gt and especially g t Htgt, the straight-forward but inefficient approach is to compute or approximate Gt and the full Ht. For example, Ht can be approximated iteratively by BFGS methods, although instantiating and inverting an Rd d matrix is prohibitively expensive for large models. More commonly, Ht is approximated by its low-rank representation (e.g. K-FAC) or its diagonal (e.g. Ada Hessian, Sophia), which is usually combined with a replacement by the Fisher information (e.g. Adam, Ada Grad). However, these approximations incur at least O(d) memory overhead, and may be sub-optimal in performance due to the gap between the approximated Hessian and the full Hessian.

Alternatively, we can estimate g t Htgt via the Hessian-vector product of Htgt, without directly accessing Ht: firstly back-propagating on L gives gt and then back-propagating on gt(wt) gt gives Htgt. In fact, all above-mentioned methods (except BFGS and Fisher-related methods) need the Hessian-vector product4, which is not supported in large-scale distributed learning such as Deep Speed (Rasley et al., 2020), Megatron (Shoeybi et al., 2019), and FSDP (Fair Scale authors, 2021), because the computation graph is complicated and interferes with the communication orchestra. In summary, existing methods suffer from two constraints: they have to either approximate Ht and sacrifice the performance, or to rely on the Hessian-vector product and sacrifice the efficiency and scalability.

In stark contrast, we can overcome these constraints by using multiple additional forward passes to efficiently compute G t gt and g t Htgt without accessing Gt and Ht, which is universally easy-toimplement because the forward pass is a fundamental operation in any deep learning system. To be specific, we apply gradient descent with different learning rates and derive the quadratic function in (2.3). Our approach can be viewed as an extension of LQA (Zhu et al., 2021), but allowing more flexibility while enhancing the stability and speed of convergence (see Algorithm 1 and Remark 3.5). As an extension, our method can use more forward passes to fit a higher-order polynomial than the second-order one in (2.3).

3.3 ALGORITHM

We now present a quadratic curve fitting approach to estimate the optimal learning rate η Ge N in (3.1):

η (Γ) = b /A where A , b = argmin A R,b R

L(wt ηgoptim t ) L(wt) + bη Aη2

in which Γ is a set of learning rates, e.g. Γ = { ηt 1, 0, ηt 1} in Algorithm 1, and the objective function leverages the Taylor expansion in (2.3). Therefore, we obtain

G t goptim t b and (goptim t ) Htgoptim t A = η η Ge N.

In fact, a mathematical equivalence can be established between this curve fitting and the finite difference (see Appendix A.3). Especially, for 3-point Γ = { ηt 1, 0, ηt 1}, the finite difference approach from Zhu et al. (2021) gives an estimate that is equal to (3.4):

η LQA(Γ) = ηt 1

2 L+ L L+ 2L0 + L where L := L(wt ηt 1goptim t ) (3.5)

However, on one hand, (3.5) is not generalizable as it takes different forms whenever Γ changes (see an example of η LQA({0, η, 2η}) in Appendix A.3). On the other hand, the formulation is complicated when Γ takes more than 3 points (see an example of 5 points in (A.2)), while (3.4) easily extends to any number of points as demonstrated in Figure 2.

In practice, we leverage the curve fitting to implement (3.2) via an efficient algorithm in Algorithm 1. Remark 3.5. We discuss the flexible designs of Algorithm 1 and extend the discussion in Appendix C.

In line 3, Ge N can operate on goptim t from any base optimizers such as SGD, Adam W and PET.

4Estimating the diagonal Hessian needs the Hutchinson method and Hessian-vector product, where 1 K P

k K vk Hvk diag(H) as K for vk N(0, I). Because the precision of these methods relies on computing vk Hvk for K rounds, the training is very slow as the cost is K times that of a standard back-propagation, say K = 20 in Sophia and K = 100 in Py Hessian.

Published as a conference paper at ICLR 2025

Algorithm 1 Generalized Newton s optimizers (Ge N), e.g. γ = 0.9, Φ = 8

1: for t 1, , T do 2: Compute loss L0 = L(wt) by the standard forward pass 3: Compute gradient goptim t (wt) by the back-propagation on L0 4: if t mod Φ == 0: then 5: Compute L = L(wt ηt 1goptim t ) by two forward passes 6: Fit the quadratic function via (3.4): { ηt 1, 0, ηt 1} {L , L0, L+} 7: Extract A , b from the quadratic function and derive η = b /A

8: if A > 0, b > 0, R2 score > 0.99 then 9: Update the learning rate ηt = γηt 1 + (1 γ)η

10: Update wt+1 = wt ηtgt

In line 4, setting a large Φ significantly amortizes the computational overhead (see Section 4.1, where Ge N optimizers are almost as fast as the base optimizers).

In line 5, the forward passes are cheaper than line 2, in that they do not store the activation tensors and save the memory which translates to faster training. In addition, we can use more forward passes to fit (2.3), incurring more computation cost but reducing the variance in estimation of η . Notice that the additional computation can be easily amortized through Φ.

In line 6, we highlight that { ηt 1, 0, ηt 1} is symmetric and auto-regressive, without introducing a new hyperparameter. The choice is justified in Appendix A.3.

In line 8, we ensure the convexity (A > 0), the positivity (b > 0) and the goodness of fit (R2 score > 0.99) for the update of learning rate.

Algorithm 1 is an approximation to the Ge N method in (3.2), since (3.4) approximates ηGe N in (3.1). We also give Algorithm 2 that implements Ge N exactly as described in Section 3.2.

3.4 ERROR ANALYSIS FOR OPTIMAL LEARNING RATE

We now analyze the estimation error of η in (3.4), with batch size B and mini-batch loss L(wt). The estimation error consists of the sub-sampling error by using the mini-batch, and the precision error from fitting (2.3). We note that the analysis is based on the closed form of η , which is available to us through η LQA in (3.5).

Proposition 3.6. The estimation error of η in (3.1) by mini-batch losses ( L0, L+, L ) is Op( 1

B ) + O(η2 t 1) where B is batch size and ηt 1 is the learning rate from the previous iteration.

Empirically, the sub-sampling error Op( 1

B ) is dominant, compare to the precision error O(η2). As a consequence, we advocate to apply Ge N optimizers with large batch size for the best performance.

0 20000 40000 60000 80000 100000 120000 140000 Sample trained

B=100 B=200 B=500 B=1000 B=2000

0 20000 40000 60000 80000 100000 120000 140000 Sample trained

Train accuracy

B=100 B=200 B=500 B=1000 B=2000

0 20000 40000 60000 80000 100000 120000 140000 Sample trained

B=100 B=200 B=500 B=1000 B=2000

0 20000 40000 60000 80000 100000 120000 140000 Sample trained

Test accuracy

B=100 B=200 B=500 B=1000 B=2000

Figure 3: Convergence of Res Net18 on CIFAR10, optimized by Ge N-SGD with various batch sizes.

4 EFFICIENCY AND SCALABILITY ANALYSIS

In this section, we show that Ge N optimizers are as computationally efficient and scalable (in terms of the utility and the system design) as their base optimizers. We train CIFAR10 (Krizhevsky et al., 2009)

Published as a conference paper at ICLR 2025

on Res Net 18, 34, 50, 152 (He et al., 2016) and Vi T tiny, small, base and large (Dosovitskiy et al., 2020). For fine tuning, we use the pretrained models from the Py Torch Image Models framework (Wightman, 2019).

For the utility, we observe on CIFAR10 that Ge N optimizers work well with different model sizes across different architectures.

0 100 200 300 400 500 Iterations

Res Net-18 Res Net-34 Res Net-50 Res Net-152

0 100 200 300 400 500 Iterations

Train accuracy

Res Net-18 Res Net-34 Res Net-50 Res Net-152

0 100 200 300 400 500 Iterations

Res Net-18 Res Net-34 Res Net-50 Res Net-152

0 100 200 300 400 500 Iterations

Test accuracy

Res Net-18 Res Net-34 Res Net-50 Res Net-152

0 100 200 300 400 500 Iterations

Vi T-tiny Vi T-small Vi T-base Vi T-large

0 100 200 300 400 500 Iterations

Train accuracy

Vi T-tiny Vi T-small Vi T-base Vi T-large

0 100 200 300 400 500 Iterations

Vi T-tiny Vi T-small Vi T-base Vi T-large

0 100 200 300 400 500 Iterations

Test accuracy

Vi T-tiny Vi T-small Vi T-base Vi T-large

Figure 4: Convergence of Ge N-SGD (upper panel) and Ge N-Adam W (lower panel) on CIFAR10 with various model architectures and sizes.

For efficiency and system-wise scalability, we focus on the additional forward passes in Algorithm 1, especially in the scenarios such as parameter-efficient training (PET) and distributed learning. Our default setting is full-parameter training (including mixed precision training), Φ = 1, and on single GPU (no communication cost among devices).

To set the stage, Ge N optimizers have the same peak memory cost as the base optimizers, as all optimizers need forward passes and we adopt CPU off-loading. Hence it suffices to count the additional time complexity introduced by Ge N. In the default setting, we can write

absolute speed: Base 1/(F + B + C) Ge N 1/(F + B + C + 2

ΦF) relative speed of Ge N: F + B + C F + B + C + 2

where F is the time complexity of the forward pass, in terms of float-point operations, B is that of the back-propagation, and C is other costs including the communication and the data loading, which are minimal on single GPU. Given that in full-parameter training, B 2F 5, the Ge N optimizers are roughly 60% as fast as the base optimizers.

4.1 LAZY LEARNING RATE UPDATE

One simple trick to make Ge N almost as fast as base optimizers is to update the learning rate every Φ > 1 iterations, so the additional computation is amortized. For instance, Ge N achieves 86% relative speed at Φ = 4 and > 92% relative speed at Φ 8.

0 100 200 300 400 500 600 700 Iterations

=1 =2 =4 =8

0 100 200 300 400 500 600 700 Iterations

=1 =2 =4 =8

0 2 4 6 8 10 Training time (min)

=1 =2 =4 =8

0 2 4 6 8 10 Training time (min)

Test accuracy

=1 =2 =4 =8

Figure 5: Convergence of Res Net18 on CIFAR10, optimized by Ge N-SGD with various Φ.

5Forward pass takes 1 unit of time. Back-propagation consists of two sub-processes output gradient and parameter gradient, each taking 1 unit of time. See the complexity analysis in Table 1 of Bu et al. (2022).

Published as a conference paper at ICLR 2025

In the PET setting, most parameters are frozen whose gradients are not computed. Hence B F and the relative speed of Ge N becomes 1 1+1/Φ, e.g. 50% at Φ = 1 and 89% at Φ = 8. Empirically, applying Φ has insignificant effect on the convergence.

4.2 COMMUNICATION IN DISTRIBUTED LEARNING

For large-scale optimization, the distributed learning with multiple devices is necessary but challenging, in that (1) the communication orchestra is complicated, and (2) the communication cost is significant (C 0). These two challenges determine if and how well an optimizer scales under a distributed solution. For data-parallel solutions such as Distributed Data Parallel and ZERO1/2 (Rajbhandari et al., 2020), the communication orchestra is relatively simple so that many Hessian-related optimizers are executable. For model-parallel and pipeline-parallel solutions such as ZERO3 and FSDP, each forward pass requires the communication of model parameters, and Ge N indeed adds a noticeable volume of communication. Nevertheless, applying the lazy learning rate can essentially reduce the communication overhead to a negligible level.

5 EXPERIMENTS ON SYNTHETIC DATA

To compare Ge N and different optimizers deterministically in the synthetic setting, we experiment on 2-dimensional functions and carefully select an optimal learning rate for each optimizer (see the details in Appendix B.1).

0 200 400 600 800 1000 Iterations

Ge N-SGD SGD Adam Ada Grad RMSprop Ada Belief Ada Hessian

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5

4 Ge N-SGD SGD Adam Ada Grad RMSprop Ada Belief Ada Hessian

0 25 50 75 100 125 150 175 200 Iterations

Ge N-SGD Ge N-Adam Ge N-Ada Grad SGD Adam Ada Grad

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5

4 Ge N-SGD Ge N-Adam Ge N-Ada Grad SGD Adam Ada Grad

Figure 6: Optimizing over the Rosenbrock (non-convex) function. Plots 1&3: losses over iterations optimized by different optimizers. Plots 2&4: 2D visualization of the optimization trajectories at 40-th iteration (Plot 2) and the 180-th iteration (Plot 4).

In Figure 6, we test on the Rosenbrock function, which is non-convex and has a unique minimum at (1, 1) that lies in a narrow valley. The left two plots show the trajectory of our Ge N-SGD and its convergence speed, which significantly outperforms other optimizers. The right two plots show the one-to-one comparison between the base optimizers (dashed curves) and their Ge N variants (solid curves), from which the acceleration by Ge N is observed universally.

In Figure 7, we test on the Beale function, which is convex and has a unique minimum at (3, 0.5), and observe the same advantage of Ge N.

0 200 400 600 800 1000 Iterations

Ge N-SGD SGD Adam Ada Grad RMSprop Ada Belief Ada Hessian

3 2 1 0 1 2 3

Ge N-SGD SGD Adam Ada Grad RMSprop Ada Belief Ada Hessian

0 25 50 75 100 125 150 175 200 Iterations

Ge N-SGD Ge N-Adam Ge N-Ada Grad SGD Adam Ada Grad

3 2 1 0 1 2 3

Ge N-SGD Ge N-Adam Ge N-Ada Grad SGD Adam Ada Grad

Figure 7: Optimizing over the Beale (convex) function. Plots 1&3: losses over iterations optimized by different optimizers. Plots 2&4: 2D visualization of the optimization trajectories at 60-th iteration (Plot 2) and the 40-th iteration (Plot 4).

Published as a conference paper at ICLR 2025

6 EXPERIMENTS ON REAL DATA

In this section, we test Ge N on real datasets across image classification, text classification, natural language generation, object detection and instance segmentation. We additionally experiment on image generation and recommendation system in Appendix B, where all training details can be found. Our experiments cover end-to-end training as well as fine-tuning, and include PET methods such as Lo RA (Hu et al.) and Bit Fit (Zaken et al., 2022).

6.1 IMAGE CLASSIFICATION

We apply variants of SGD on Res Net50 and Adam W on Vi T, by training on computer vision datasets with various sizes (e.g. Places365 has 2 million images) and difficulty (ranging from 30% to 98% accuracy). In Table 2 and Figure 8, Ge N optimizers consistently achieve high accuracy when compared with heuristic learning rate schedulers (i.e. constant, linear and cosine decay) as well as automatic optimizers like Prodigy and D-Adaptation6.

Table 2: Test accuracy of Res Net (optimized by SGD) and Vi T (optimized by Adam W) on various image classification datasets.

dataset CIFAR10 CIFAR100 Food101 GTSRB SVHN Places365 INat2021 reference Krizhevsky et al. (2009) Bossard et al. (2014) Houben et al. (2013) Netzer et al. (2011) Zhou et al. (2014) ina (2021) epochs 5 5 5 5 5 5 10

Ge N-SGD 96.76 84.77 82.43 97.96 97.02 54.24 44.57 SGD(constant) 95.91 81.64 74.24 95.20 95.08 51.09 33.80 SGD(linear decay) 95.52 81.67 75.14 95.33 95.34 50.95 30.69 SGD(cosine decay) 95.82 80.71 76.39 95.20 95.27 51.15 31.10 Prodigy 95.17 80.39 80.74 97.80 96.23 50.33 33.77 D-adapt SGD 95.20 80.95 80.50 97.38 95.32 47.24 38.00

Ge N-Adam W 98.68 92.62 90.48 99.06 97.14 59.80 66.28 Adam W(constant) 97.49 89.23 88.44 98.54 96.65 58.28 65.62 Adam W(linear decay) 98.48 92.60 90.54 98.74 97.08 58.52 65.43 Adam W(cosine decay) 98.73 92.71 90.46 98.77 97.16 58.19 67.04 Prodigy 98.92 92.49 90.42 98.88 97.13 57.24 62.61 D-adapt Adam W 97.56 88.11 89.45 99.04 96.77 56.19 66.52

0 2000 4000 6000 8000 10000 Iterations

Learning rate

Ge N-Adam W

0 2000 4000 6000 8000 10000 Iterations

0 2000 4000 6000 8000 10000 Iterations

Train accuracy

Adam W(constant) Adam W(linear) Adam W(cosine) Prodigy D-adaptation Ge N-Adam W

0 2000 4000 6000 8000 10000 Iterations

Test accuracy

Figure 8: Convergence of Vi T on INat2021 dataset, optimized by variants of Adam W.

6.2 NATURAL LANGUAGE PROCESSING

We compare Ge N-Adam W with a variety of learning rate schedulers on natural language generation (NLG) and natural language understanding (NLU).

For NLG experiments, we train GPT2 (Radford et al., 2019) model with Lo RA on the E2E dataset (Novikova et al., 2017). We measure the quality of the generated texts by the perplexity (the exponential of loss), BLEU (Papineni et al., 2002), ROUGE-L (Lin, 2004), METEOR (Banerjee & Lavie, 2005), NIST (Doddington, 2002), and CIDEr (Vedantam et al., 2015). We use (Hu et al.) as the baseline, which trains for 5 epochs with linearly decaying learning rate. In Figure 1 and Table 3, Ge N is consistently performant and outperforms the baseline over multiple metrics. Somewhat surprisingly, the learning rate of Ge N continues to go up even if we extend the training to 20 epochs in Figure 9, which is not captured by previous heuristic learning rate schedulers.

6We have observed that D-Adaptation SGD diverges for all datasets when the weight decay of 5e-4 is used.

Published as a conference paper at ICLR 2025

0 2000 4000 6000 8000 10000 12000 14000 16000 Iterations

Learning rate *

Figure 9: Loss and learning rate during GPT2 training.

Table 3: Test performance of GPT2 on E2E dataset (higher the better).

BLEU ROGUE NIST METEOR CIDER Ge N 67.30 67.09 8.6508 0.4407 2.2810 linear decay 66.85 67.78 8.4438 0.4375 2.2561 cosine decay 66.59 67.35 8.4511 0.4290 2.2553 constant 65.95 66.69 8.3982 0.4462 2.1697 inverse sqrt(1/

t) 65.10 66.19 8.1274 0.4047 2.1001 inverse (1/t) 64.43 65.95 8.0447 0.4010 2.0792

For NLU experiments, we evaluate Ro BERTa-base (Liu et al., 2019) on the GLUE (Wang et al., 2019) benchmark with Lo RA, Bit Fit and full-parameter training (FT).

Table 4: Test performance of Ro BERTa model with different methods on the GLUE benchmark (higher the better). Blue numbers are results in published papers, produced by heuristic learning rate schedulers in Hu et al. (linear warm-up and linear decay). Red numbers are results of Ge N optimizers. We report the overall (matched and mismatched) accuracy for MNLI, Matthew s correlation for Co LA, F1 score for MRPC and QQP, and accuracy for other tasks.

Trainable param MNLI SST-2 MRPC Co LA QNLI QQP RTE

Lo RA 0.3M 87.5|86.7 95.1|94.3 90.8|92.1 63.4|63.7 93.3|92.3 85.3|86.9 86.6|79.1 Bit Fit 0.1M 85.0|85.1 93.7|94.3 92.0|92.3 61.8|62.5 91.3|91.5 84.2|84.3 77.8|78.0 FT 125.0M 86.7|86.8 94.2|94.7 92.5|92.3 61.1|64.8 92.3|91.9 88.0|88.4 77.4|80.5

6.3 OBJECT DETECTION & INSTANCE SEGMENTATION

We train a Mask R-CNN (He et al., 2017) with SGD on the Penn-Fudan dataset (Wang et al., 2007), following the official Pytorch tutorial which uses a linearly warm-up then constant learning rate. The loss L(w) is the sum of 5 losses from the classification and the localization in different model components. To be specific, the losses are classifier loss, bounding box regression loss in object detection, mask loss, objectness loss, and Region Proposal Network (RPN) bounding box regression loss. The model is built on Res Net50 and pre-trained on the COCO dataset (Lin et al., 2014). We measure the precision in Table 5, based on the Intersection over Union (Io U).

0 20 40 60 80 100 120 Iterations

Learning rate

Ge N manual

Figure 10: Loss and learning rate during Mask R-CNN training.

Table 5: Average precision of Mask R-CNN (higher the better). AP_s/m/l means small/medium/large instances, and AP is the average over all instances.

Object detection AP APs APm APl Ge N 0.805 0.038 0.452 0.002 0.624 0.082 0.813 0.032 manual 0.802 0.025 0.368 0.013 0.586 0.079 0.819 0.017

Instance segmentation Ge N 0.771 0.019 0.432 0.010 0.518 0.104 0.781 0.015 manual 0.768 0.022 0.388 0.029 0.455 0.092 0.783 0.020

7 DISCUSSION

In this work, we propose the Ge N optimizers as a generally applicable and efficient approach towards the second-order optimization that is automatic and performant. Specifically, Ge N is applicable to any future pre-conditioner that improves the convergence. However, additional work is needed to further our understanding of Ge N. We remark that the learning rate in Ge N is locally optimal but less is known about the global convergence speed. Looking forward, we expect further combination of more optimization algorithms and Ge N, as well as exploring Ge N on various domains and problems.

Published as a conference paper at ICLR 2025

Inaturalist 2021 competition. 2021. URL https://github.com/visipedia/inat_comp/ tree/master/2021.

AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md.

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models. ar Xiv preprint ar Xiv:2311.16867, 2023.

Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251 276, 1998.

Dario Balboni and Davide Bacciu. Adler an efficient hessian-based strategy for adaptive learning rate. ar Xiv preprint ar Xiv:2305.16396, 2023.

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65 72, 2005.

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397 2430. PMLR, 2023.

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446 461. Springer, 2014.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

Charles George Broyden. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76 90, 1970.

Zhiqi Bu, Jialin Mao, and Shiyun Xu. Scalable and efficient training of large convolutional neural networks with differential privacy. Advances in Neural Information Processing Systems, 35: 38305 38318, 2022.

Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing, 16(5):1190 1208, 1995.

Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008 1031, 2016.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1 113, 2023.

Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by d-adaptation. In International Conference on Machine Learning, pp. 7449 7479. PMLR, 2023.

George Doddington. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, pp. 138 145, 2002.

Jesse Dodge, Taylor Prewitt, Remi Tachet des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A Smith, Nicole De Cario, and Will Buchanan. Measuring the carbon intensity of ai in cloud instances. In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pp. 1877 1894, 2022.

Published as a conference paper at ICLR 2025

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Fair Scale authors. Fairscale: A general purpose modular pytorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale, 2021.

Minghan Fu and Fang-Xiang Wu. Qlabgrad: A hyperparameter-free and convergence-guaranteed scheme for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 12072 12081, 2024.

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017.

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pp. 1842 1850. PMLR, 2018.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961 2969, 2017.

Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pp. 30016 30030, 2022.

Sebastian Houben, Johannes Stallkamp, Jan Salmen, Marc Schlipsing, and Christian Igel. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In International Joint Conference on Neural Networks, number 1288, 2013.

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.

Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is sgd s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning, pp. 14465 14499. PMLR, 2023.

Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497 511, 2020.

Xiaomeng Jin, Zhiqi Bu, Bhanukiran Vinzamuri, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, and Mingyi Hong. Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 11278 11294, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-89176-189-6. URL https://aclanthology.org/2025.naacl-long.563/.

Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020.

Published as a conference paper at ICLR 2025

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naac L-HLT, volume 1. Minneapolis, Minnesota, 2019.

Ahmed Khaled, Konstantin Mishchenko, and Chi Jin. Dowg unleashed: An efficient universal parameter-free gradient descent method. Advances in Neural Information Processing Systems, 36: 6748 6769, 2023.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74 81, 2004.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pp. 740 755. Springer, 2014.

Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. In The Twelfth International Conference on Learning Representations.

Ruixuan Liu and Zhiqi Bu. Towards hyperparameter-free optimization with differential privacy. In The Thirteenth International Conference on Learning Representations.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2016.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.

Aurelien Lucchi, Brian Mc Williams, and Thomas Hofmann. A variance reduced stochastic newton method. ar Xiv preprint ar Xiv:1503.08316, 2015.

Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Estimating the carbon footprint of bloom, a 176b parameter language model. Journal of Machine Learning Research, 24(253): 1 15, 2023.

Haipeng Luo, Alekh Agarwal, Nicolo Cesa-Bianchi, and John Langford. Efficient second order online learning by sketching. Advances in Neural Information Processing Systems, 29, 2016.

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36:53038 53075, 2023.

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408 2417. PMLR, 2015.

Sam Mc Candlish, Jared Kaplan, Dario Amodei, and Open AI Dota Team. An empirical model of large-batch training. ar Xiv preprint ar Xiv:1812.06162, 2018.

Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. In Forty-first International Conference on Machine Learning.

Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k** 2). Doklady Akademii Nauk SSSR, 269(3):543, 1983.

Published as a conference paper at ICLR 2025

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, pp. 7. Granada, Spain, 2011.

Jekaterina Novikova, Ondrej Dusek, and Verena Rieser. The e2e dataset: New challenges for end-toend generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 201 206, 2017.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311 318, 2002.

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. ar Xiv preprint ar Xiv:2104.10350, 2021.

Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1 17, 1964.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485 5551, 2020.

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1 16. IEEE, 2020.

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505 3506, 2020.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Nicol N Schraudolph, Jin Yu, and Simon Günter. A stochastic quasi-newton method for online convex optimization. In Artificial intelligence and statistics, pp. 436 443. PMLR, 2007.

Or Sharir, Barak Peleg, and Yoav Shoham. The cost of training nlp models: A concise overview. ar Xiv preprint ar Xiv:2004.08900, 2020.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick Le Gresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. ar Xiv preprint ar Xiv:1909.08053, 2019.

Leslie N Smith. No more pesky learning rate guessing games. Co RR, abs/1506.01186, 5:575, 2015.

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management, pp. 1441 1450, 2019.

Published as a conference paper at ICLR 2025

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023a.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023b.

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566 4575, 2015.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, 2019.

Liming Wang, Jianbo Shi, Gang Song, and I-fan Shen. Object detection combining recognition and segmentation. In Asian conference on computer vision, pp. 189 199. Springer, 2007.

Xiao Wang, Shiqian Ma, Donald Goldfarb, and Wei Liu. Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27(2):927 956, 2017.

Ross Wightman. Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019.

Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, and Michael Mahoney. Adahessian: An adaptive second order optimizer for machine learning. In proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 10665 10673, 2021.

Rui Yuan, Alessandro Lazaric, and Robert M Gower. Sketched newton raphson. SIAM Journal on Optimization, 32(3):1555 1583, 2022.

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 1 9, 2022.

Matthew D Zeiler. Adadelta: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012.

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. Advances in neural information processing systems, 27, 2014.

Yingqiu Zhu, Danyang Huang, Yuan Gao, Rui Wu, Yu Chen, Bo Zhang, and Hansheng Wang. Automatic, dynamic, and nearly optimal learning rate specification via local quadratic approximation. Neural Networks, 141:11 29, 2021.

Published as a conference paper at ICLR 2025

A PRELIMINARIES AND PROOFS

A.1 FINITE DIFFERENCE METHOD

The finite difference method can be used to approximate high-order derivatives of a scalar f(w). We denote as one unit of difference.

We start with the first-order finite difference for df

dw and two points. There are three popular methods: the forward, the backward, and the central difference:

Forward f(w+ ) f(w)

Backward f(w) f(w )

Central f(w+ ) f(w )

The precision of forward and backward difference is O( ), and that of central difference is O( 2).

We can approximate higher-order derivatives, such as the second-order d2f

dw2 , with more points (say three).

Forward f(w+2 ) 2f(w+ )+f(w)

2 Backward f(w) 2f(w )+f(w 2 )

2 Central f(w+ ) 2f(w)+f(w )

The precision of forward and backward difference is O( ), and that of central difference is O( 2). Generally speaking, we can use (k + 1) points to approximate the k-th order derivatives at O( 2) with the central difference.

The precision can be improved with more points for the same derivative, say for df

2 f(w+2 )+2f(w+ ) 3

2 f(w) Backward

3 2 f(w) 2f(w )+ 1

2 f(w 2 ) Central 1

12 f(w+2 )+ 2

2 2 3 f(w 2 )

Compared with the two-point difference, using three points improves the precision of forward difference from O( ) to O( 2); using four points improves the precision of central difference from O( 2) to O( 4).

A.2 GENERALIZED INVERSE

Definition A.1. Any matrix A Rm n has at least one generalized inverse A 1 G Rn m such that AA 1 G A = A. The generalized right inverse and left inverse are sub-cases of the generalized inverses: if A 1 R satisfies AA 1 R = Im and rank(A) = m, then A 1 R is the right inverse of A; similarly, if A 1 L A = In and rank(A) = n, then A 1 L is the left inverse of A.

A.3 EQUIVALENCE BETWEEN FINITE DIFFERENCE AND POLYNOMIAL FITTING

Proof. Consider fitting the quadratic function:

L+ := L0 + ηG g + η2

L := L0 ηG g + η2

Then G g = L+ L

2η , g Hg = L+ 2L0+L

η2 , which is the equivalent to applying the second-order central finite difference method for both terms.

Hence, using the symmetric learning rates (0, η, η), the optimal learning rate is estimated as η = η

2 L+ L L+ 2L0+L with O(η2) error.

Published as a conference paper at ICLR 2025

If we instead fit another quadratic function using (0, η, 2η), then

L+2 := L0 + 2ηG g + 2η2g Hg

L+ := L0 + ηG g + η2

Then G g = L+2 4L++3L0

2η , g Hg = L+2 2L++L0

η2 , which is the equivalent to applying the second-order forward difference to G g and the first-order forward difference to g Hg.

Hence, using the non-symmetric learning rates (0, η, 2η), the optimal learning rate is estimated as η 2 4L+ L+2 3L0

L+2 2L++L0 with O(η) error.

A.4 ESTIMATION ERROR OF η

Proof of Proposition 3.6. We omit the subscript t for simplicity of presentation. Assume samples xi are independently and identically distributed, then by central limit theorem,

i L(w ηg, xi) = L + Op( 1

Hence the mini-batch approximation

L+ L L+ 2 L0 + L = η

L+ L + Op( 1

L+ 2L0 + L + Op( 1

2 L+ L L+ 2L0 + L + Op( 1

Next, the finite difference method has a precision of O(η2). Hence, the overall estimation error of η is Op( 1

B ) + O(η2).

A.5 DERIVATION OF η

We give the derivation of (3.4), given { ηt 1, 0, ηt 1} and {L , L0, L+}. We aim to minimize (2.3): ignoring the superscript in goptim t ,

min η L(wt ηgt) or equivalently min η L(wt ηgt) L(wt).

By second-order Taylor expansion, we get

min η L(wt ηgt) L(wt) min η η2

2 g t Htgt ηG t gt

which requires the knowledge of two coefficients g t Htgt and G t gt.

Using curve fitting, we estimate the coefficients g t Htgt, G t gt A , b by

A , b = argmin A,b

η { ηt 1,0,ηt 1}

(L(wt ηgt) L(wt)) (Aη2

The same result can be derived with finite difference as in (3.4) (though not through an optimization problem), through the numerical analysis as demonstrated in Appendix A.3.

Finally, we can write η = b /A G t gt/g t Htgt, where the error is controlled to small values by Proposition 3.6.

As an important extension, we can use more points (say { 2ηt 1, ηt 1, 0, ηt 1, 2ηt 1}) to derive η . This usually gives a slightly more stable convergence, almost the same accuracy, but more computational overhead.

Using curve fitting, we easily extend to

g t Htgt, G t gt argmin A,b

η { 2ηt 1, ηt 1,0,ηt 1,2ηt 1}

(L(wt ηgt) L(wt)) (Aη2

Published as a conference paper at ICLR 2025

Using finite difference, the derivation is more complicated than (3.4),

12L 2 (A.2)

where L 2 = L(wt 2ηt 1gt). We refer to https://en.wikipedia.org/wiki/Finite_ difference_coefficient for the table of finite difference coefficients.

B MORE ON EXPERIMENTS

B.1 SYNTHETIC DATA

In Section 5, we carefully select the best learning rate for each optimizer, through a exponential grid search from {10 k, 2 10 k, 5 10 k} for k = 5, 4, ..., 0. For each optimizer, we train with each learning rate for 1000 iterations and choose the one with smallest objective value.

The Rosenbrock function, also known to as the Valley or Banana function is a classic test problem. We take its two-dimensional case:

f(x1, x2) = 100(x2 x2 1)2 + (1 x1)2.

The Beale is a convex function with the following expression:

f(x1, x2) = (1.5 x1 + x1x2)2 + (2.25 x1 + x1x2 2)2 + (2.625 x1 + x1x3 2)2.

B.2 IMAGE CLASSIFICATION

For image classification problems, we use models that are pre-trained on Image Net and can be loaded from torchvision and timm libraries. We resize all images to 224x224 and normalize the pixel values to [-1,1]. In what follows, all Res Nets are trained with SGD with 0.9 momentum and 5e-4 weight decay, and Vi T is trained with Adam W using default hyperparameters in Pytorch.

B.2.1 CIFAR

CIFAR10 and CIFAR100 are standard tiny image datasets that we have used as the test-bed. Our default hyperparameters for Figure 1, Figure 2, Figure 3, Figure 4, Figure 5 are: B = 500, Φ = 4, SGD learning rate=1e-2, Adam W learning rate=1e-4, unless one of the hyperparameters are varied for the ablation study.

B.2.2 SECTION 6.1

We use B = 500, SGD learning rate=0.1 and Adam W learning rate=1e-4 for all datasets. Notice that D-adapt SGD diverges on all datasets so we have to remove the 5e-4 weight decay for this optimizer only. For Places365 and INat2021, we use Φ = 20; for other datasets, because they are much smaller in sample size, we use Φ = 4.

B.3 NATURAL LANGUAGE PROCESSING

All transformers (Ro BERTa, GPT2) are trained with Adam W using default hyperparameters in Pytorch.

In Figure 1, Figure 2, Figure 9 and Table 3, we follow the codebase of Hu et al. and use B = 256, sequence length 128, η0 = 1e 3, and 5 epochs. While applying, we set Φ = 4.

The results of full-parameter training and Bit Fit training are from Table 2 of Zaken et al. (2022) and those of Lo RA are from Table 2 of Hu et al.. However, since Lo RA didn t report the F1 score of

Published as a conference paper at ICLR 2025

Batch size Initial learning rate for FT # of epochs Eval metrics

MRPC 128 2e-5 10 F1 SST2 128 1e-6 10 acc. MNLI 128 1e-6 5 (1 for FT) matched acc.&mismatched acc. Co LA 128 2e-5 10 Matthews corr. QNLI 128 2e-5 10 acc. QQP 256 2e-5 5 F1 RTE 128 2e-5 60 acc.

Table 6: Hyper-parameters and evaluation metrics for training GLUE datasets with Ge N.

MRPC and QQP, we run Lo RA under our settings (i.e., the same batch size and number of epochs) for a fair comparison. The table below records our hyper-parameters for training with Ge N.

While applying Ge N, we set Φ = 8 for all tasks. For hyperparameters that were not mentioned in Table 6, we followed Table 9 of Hu et al..

B.4 IMAGE GENERATION

We apply Ge N-Adam to pre-train a deep convolutional generative adversarial network (DCGAN) Radford et al. (2015), following the official Pytorch tutorial, which uses a constant learning rate, with a batch size 128 and training for 5 epochs. The training uses Celeb A Liu et al. (2015), a real dataset of > 200000 face images. We qualitatively compare the fake generated images with the real ones,

Figure 11: Left: learning rate of Ge N-Adam over the iterations. Right: losses of the components in DCGAN, where D is the discriminator and G is the generator.

which demonstrates the effectiveness of our optimization that can be further improved with longer training. We believe this showcases the potential applicability of Ge N to the vision generation, e.g. audio/music generation and diffusion models.

B.5 RECOMMENDATION SYSTEM

We apply Ge N-Adam to BERT Kenton & Toutanova (2019) model for recommendation system, following the setting in Sun et al. (2019), with a batch size 1024 and training for 100 epochs. We train on Movie Lens-1m dataset, containing 1 million user-rating pairs. The performance is measured by the recall and Normalized Discounted Cumulative Gain (NDCG, related to the precision). The baseline is a constant learning rate given by the codebase. We observe that Ge N outperforms the state-of-the-art setting (batch size 128) but needs a larger batch size to converge stably.

batch size Recall@1 Recall@5 Recall@10 Recall@20 NDCG@1 NDCG@5 NDCG@10 NDCG@20 Ge N 1024 0.405 0.720 0.821 0.900 0.405 0.575 0.607 0.627 Constant 1024 0.374 0.693 0.796 0.883 0.374 0.546 0.580 0.602 Constant 128 0.397 0.718 0.820 0.895 0.397 0.570 0.603 0.622

Table 7: Performance of top K recommendations (Recall@5 means the recall of K = 5).

Published as a conference paper at ICLR 2025

Figure 12: Side-by-side comparison of images, which is comparable to the quality here.

0 20 40 60 80 100 Epoch

Ge N(B=1024) Constant(B=1024) Constant(B=128)

0 20 40 60 80 100 Epoch

Ge N(B=1024) Constant(B=1024) Constant(B=128)

0 20 40 60 80 100 Epoch

Ge N(B=1024) Constant(B=1024) Constant(B=128)

Figure 13: Convergence of Ge N-Adam and Adam with constant learning rates.

C MORE ON THE DESIGN OF ALGORITHM 1

C.1 TWO MODES OF FORWARD PASS

As we discussed in Section 3.3, there are two modes of forward pass: with or without the activation. If the forward pass is followed by the back-propagation, then it must store the activation in each layer. The activation is RBT d where B is the batch size (say 256), T is the sequence length (say 2048 for LLAMA) or pixels (say 224 224 for Image Net), and d is the input dimension of one layer (say 768 for a transformer block). In practice, the activation is very expensive to store and could take up 95% of the total memory cost (see Jain et al. (2020) Figure 3).

On the other hand, the forward pass can be activation-free if we only do inference, not the training. In Pytorch, this is enabled by with torch.no_grad(). We can leverage the saved memory to use larger batch size and thus accelerate the computation, further reducing the Ge N overhead.

C.2 CHOICE OF FINITE DIFFERENCE

We extend our discussion on fitting the quadratic function. In general, we can learn g Hg and G g by fitting any collection of finite differences, say {ξi}, to the corresponding losses {L(wt ξigt)}. For instance, we can use more than 3 points or use non-symmetric sequences like (0, ξ, 2ξ) (L(wt), L(wt ξgt), L(wt 2ξgt)).

We recommend to use the symmetric sequences like { 2ξ, ξ, 0, ξ, 2ξ} for better precision (c.f. Appendix A.3). Nevertheless, this introduces a hyperparameter ξ and much churn to tune it, as is the case in zero-th order optimization Malladi et al. (2023). Therefore we use the symmetric and auto-regressive sequences that employ the previous learning rates: ( 2ηt 1, ηt 1, 0, ηt 1, 2ηt 1). Lastly, to reduce the computation overhead, we use the fact that any quadratic function is uniquely determined by the three points. Therefore, we can use ( ηt 1, 0, ηt 1) as the shortest sequence that is both symmetric and auto-regressive.

Published as a conference paper at ICLR 2025

C.3 SMOOTHING

In Algorithm 1, we propose to use the smoothing (e.g. setting γ = 0.9) as a standard technique in the deep learning optimization. Note the smoothing is also known as the momentum. For example, Nesterov accelerated gradient and SGD both use the smoothing on the gradients; Adam and Sophia additionally use smoothing on the pre-conditioner. We empirically confirm that the smoothing helps stabilize the training in all our experiments.

C.4 INITIAL LEARNING RATE

Ge N optimizer and existing parameter-free methods such as D-adaptation Defazio & Mishchenko (2023) and Do G Ivgi et al. (2023) still need someone hyperparameters. To be specific, D-adaptation needs an initial D estimate and a growth rate 1 if the training is unstable; Do G needs to carefully select rϵ to compute the initial distance: too large values could make Do G diverge, while too small values cannot optimize the model. In our case, Ge N only needs one hyperparameter the initial learning rate η0, which can be determined with nearly zero effort of manual tuning. We now introduce two methods to set η0.

C.4.1 AUTOMATIC CORRECTION

Empirically, Ge N optimizers have the capability to automatically correct the learning rate if η0 is set too large or too small. We test a wide range of learning rates, with the largest one being 1000 larger than the smallest, all resulting in similar final performance. This capability allows the practitioners to start the training with a highly robust choice of η0.

0 50 100 150 200 250 300 Iterations

Learning rate (log)

initial =1.0 initial =0.5

initial =0.1

initial =0.01 initial =0.001

0 50 100 150 200 250 300 Iterations

Train cross entropy loss

initial =1.0

initial =0.5

initial =0.1 initial =0.01

initial =0.001

0 50 100 150 200 250 300 Iterations

Train accuracy

initial =1.0 initial =0.5

initial =0.1 initial =0.01 initial =0.001

0 50 100 150 200 250 300 Iterations

Test accuracy

initial =1.0 initial =0.5 initial =0.1 initial =0.01

initial =0.001

Figure 14: Convergence of Res Net18 on CIFAR10, optimized by Ge N-SGD with B = 200.

In Figure 14, smaller η0 (say 0.1) quickly catches up at an exponential speed then stabilizes, whereas larger η0 tends to decay from the beginning of training. We recommend setting η0 with small values, to avoid risking the possibility of loss divergence.

C.4.2 AUTOMATIC SEARCH

Alternatively, we can search the η0 automatically at the first iteration. The computation overhead is negligible since it will be amortized over a large number of iterations. In practice, we observe that an exponential grid search ranging from 10 6 102 can quickly give an accurate estimate of η0. Our experiment in the same setting as Figure 14 selects η0 = 0.008, close to the red curve.

Published as a conference paper at ICLR 2025

C.5 EXACT ALGORITHM FOR GEN

We give an algorithm that computes (3.2) precisely using higher order differentiation. In contrast to Algorithm 1 that approximately computes Ge N, this algorithm uses the second-order back-progation, which is generally inefficient and incompatible with large-scale distributed systems due to lack of support.

Algorithm 2 Generalized Newton s optimizers with higher order differentiation (Ge N-HOD)

1: for t 1, , T do 2: Compute loss L(wt) by the standard forward pass 3: Compute vanilla gradient gt by the first-order back-propagation from L(wt) R 4: Compute modified gradient goptim t (wt) 5: Compute the Hessian-vector product Htgoptim t by the second-order back-propagation from g t goptim t R 6: Derive G t goptim t and (goptim t ) Htgoptim t by multiplication

7: Derive the optimal learning rate ηt = G t goptim t (goptim t ) Htgoptim t by (3.1)

8: Update wt+1 = wt ηtgt

C.6 PRELIMINARY RESULTS ON PRE-TRAINING

Following https://github.com/kuangliu/pytorch-cifar.git, we experimented on CIFAR10 training from scratch on Res Net18. We replace SGD with Adam W at learning rate 1e-3 but the rest settings are the same as in the repository. We use Φ = 4 for Ge N. For D-adaptation, we use their Adam W version for a fair comparison.

0 200 400 600 800 1000 1200 1400 Iterations

Adam W(cosine) Prodigy D-adaptation Ge N

0 200 400 600 800 1000 1200 1400 Iterations

Test accuracy

Adam W(cosine) Prodigy D-adaptation Ge N

Figure 15: CIFAR10 pre-training on Res Net18 for 15 epochs.

C.7 DEPENDENCE ON TOTAL STEPS

Total steps T is an important factor in existing learning rate schedules (e.g. linear and consine decay in Table 8): larger T allows longer large-learning-rate-training, while smaller T requires early learning rate decay. Our Ge N in (3.1) is originally proposed as T-independent. Yet it may be slightly improved with a dependence of T, especially if T is relatively small:

η Ge N(goptim t ) η Ge N(goptim t ) (1 t/T) (C.1)

where the last term is a hyperparameter-free factor, decaying from 1 to 0. The form of this factor is flexible, e.g. instead of linear decay, we can also use cosine decay like in D-adaptation Defazio & Mishchenko (2023) and Prodigy Mishchenko & Defazio, which denote such factor as γk.

We have used this variant of Ge N in the experiments of Places365, INat2021 and NLU.

Published as a conference paper at ICLR 2025

D RELATED WORKS

D.1 HEURISTIC LEARNING RATE SCHEDULER

A proper learning rate is important for fast convergence. There is a long line of work making efforts to search the best hyper-parameter, especially the learning rate. In general, the learning rate should decay to 0 as the training iteration increases. This is due to the mini-batch sampling which does not guarantee that SGD converges even in the convex setting. As a consequence of this theoretical insight, different learning rate schedulers have been devised, albeit usually based on the empirical evidence.

type scheduler ηt form HP # HP reference

Constant c0 c0 1 Raffel et al. (2020) Cosine decay c0(1 + cos (tπ/T))/2 c0 1 Loshchilov & Hutter (2016); Radford et al. (2021) Stepwise ck if Tk 1 < t < Tk {ck, Tk}k K 2K + 1 Linear decay c0(1 t/T) c0 1 Smith (2015) Polynomial decay c0/t or c0/

t c0 1 Exponential decay c0 exp( pt) c0, p 2 Warmp Up cmin + t(c0 cmin)

c0, cmin, p 3 Goyal et al. (2017)

Table 8: Learning rate schedulers for training with T iterations. HP means hyperparameter.

In deep learning, state-of-the-art performance is usually obtained by a multi-hyparameter scheduler: for instance, LLAMA Touvron et al. (2023a) combines linear Warm Up and cosine decay (not decaying to 0 but to a minimum value); Res Net He et al. (2016) uses the stepwise scheduler (starting at η = 0.1 and decay by 10 times at 32k and 48k iterations) with 5 hyperparameters. While tuning multiple hyperparameters may be feasible for some models, it is generally expensive for large models (say over 1B parameters).

D.2 AUTOMATIC LEARNING RATE

A number of automatic or parameter-free learning rate methods have attempted to dynamically adjust the learning rate throughout the training. D-adaptation Defazio & Mishchenko (2023) and Prodigy Mishchenko & Defazio both improve on Ada Grad by characterizing ||w0 w ||, where w is the global minimum. However, this motivation is not well-defined in deep learning as the minimum is not unique. In fact, for all our computer vision experiments, D-adapt SGD with even a small magnitude of weight decay diverges. Also Prodigy only implements to Adam, hence not future-proof for more advanced algorithms such as Sophia.

Do G Ivgi et al. (2023) and Do WG Khaled et al. (2023) are parameter-free dynamic SGD scheduler with convergence guarantees for stochastic convex optimization. Additionally, these methods only work for SGD but not for the adaptive optimizers like Adam, and the algorithms requires to store the past iterates (because the output is the weighted P

t ctwt in Equation 2 of Ivgi et al. (2023) and Theorem 1 of Khaled et al. (2023), instead of the last iterate w T ). As the authors admit in their Github: Do G must be combined with iterate averaging , which adds to the memory cost that may be unacceptable for large models. In practice, we have observed these methods to be very unstable and inaccurate, e.g. on toy tasks like fine-tuning CIFAR10/100.

D.3 SECOND-ORDER METHODS

There is a long list of second-order or quasi-second-order methods in the optimization literature, including but not limited to stochastic Newton methods Byrd et al. (2016); Wang et al. (2017); Lucchi et al. (2015); Schraudolph et al. (2007), sketching methods Yuan et al. (2022); Luo et al. (2016) and diagonal Hessian methods in Section 3.2. We notice these methods are difficult to implement in distributed learning at large scale, because the Hessian information (or Hessian-vector product) is generally not supported by auto-differentiation or too expensive to store for large models.

D.4 PRIOR WORKS SIMILAR TO GEN

We elaborate multiple prior works that also present Ge N-SGD in (3.1) and leverage the secondorder Taylor approximation in (2.3). However, these methods are limited to goptim = g SGD and

Published as a conference paper at ICLR 2025

not generalize to pre-conditioned gradients like in Adam W. As a consequence, the role of preconditioning of gradient is under-investigated, and the connection to the generalized inverse in (3.3) and to Newton s method in Proposition 3.3 is lacking.

Importantly, a missing piece in these works is to validate the second-order Taylor approximation of L(w ηg). Our empirical visualization in Figure 2 serves as a necessary validation, without which the method and algorithms are not justified in the first place7.

Experiment-wise, prior works are focused on computer vision and convolutional neural networks, whereas we have extensively tested various model architectures (transformers, GAN, recommendation systems) over many tasks like language modeling.

LQA The closest work to ours is LQA Zhu et al. (2021) which proposes (3.5).

At the method level, this method (termed LQA-SGD) is equivalent to Ge N-SGD but it is not extended to general optimizers: in their Figure 1-4, where the red curves in each figure are exactly the same, LQA-SGD is compared to SGD, SGD-momentum, and Adam; but there is no LQA-Adam. Additionally, the convergence speed of LQA-SGD (or Ge N-SGD) is missing, whereas we provide an analysis under the convex Lipschitz regime in Appendix E.

At the algorithm level, LQA relies on estimating the losses at {L(w ηg), L(w), L(w + ηg)}:

2 L(w + ηg) L(w ηg) L(w + ηg) 2L(w) + L(w ηg)

This fixed choice of learning rates gives a closed form but could limit the general applicability when one considers other learning rates (see our discussion below (3.5)). The estimation error of LQA was not analyzed. In contrast, our Proposition 3.6 indicates the benefit of a large batch size which is empirically verified in Figure 3. Additionally, we adopt critical techniques in Remark 3.5 like smoothing and lazy frequency, in order to stabilize the training and enhance the computational efficiency, which are not presented in LQA.

QLABGrad While LQA and Ge N requires 2 extra forward passes, QLABGrad Fu & Wu (2024) only requires 1 extra forward passes by restricting to SGD (hence incompatible to Adam W and other adaptive optimizers).

η QLABGrad = η2

L(w ηg) L(w) + η L 2 .

QLABGrad provides a convergence analysis in terms of the gradient norm, i.e. || Lt|| 0, whereas our convergence analysis in Appendix E is in terms of Lt 0.

ADLER ADLER Balboni & Bacciu (2023) uses Hessian-vector product to compute Htgt (viewed as a sub-case of Algorithm 2). We note that Hessian-vector product is not only slow, but also not supported yet in distributed learning systems. Nevertheless, ADLER indeed implements (3.2) without approximating η Ge N, whereas the estimation error must be considered for other algorithms since η LQA η Ge N η QLABGrad.

D.5 APPLYING GEN TO MORE PROBLEMS

We expect general applicability of Ge N to problems beyond the formulation of (2.1). Examples include differentially private optimization Liu & Bu, multi-task optimization Jin et al. (2025), gradient ascent, projected/proximal gradient descent, reinforcement learning, etc.

E CONVERGENCE ANALYSIS OF GEN

We provide the convergence analysis of Ge N for multiple dimension, given that the 1-dimensional Ge N is equivalent to the classic Newton-Raphson method. We work with the convex and Lipschitz

7In fact, Figure 1 in Fu & Wu (2024) contradicts their own motivation in Equation (4), which is the Taylor expansion at η = 0, since there is no quadratic curve between η [0, α].

Published as a conference paper at ICLR 2025

conditions, the same conditions on which Prodigy and D-adaptation are proven to converge. We note that, similar to the proof of Newton s method, we also need to bound the third order derivative and Ge N can enjoy the quadratic convergence rate. Theorem 1. For convex and G-Lipschitz loss L, assuming H(x) = 0 x (i.e. no completely flat point), then Ge N-GD in (3.2) with g G gives

wt+1 w M wt w 2

and Lt L G wt w GM2t 1 w0 w 2t

where w is the minimum of L, M = supx ||κ(x)|| supx 1/||H(x)||, and κ is the third order derivative of L. Furthermore, if M||w0 w || < 1, then we have the quadratic convergence wt w as t .

Proof. Denote et = wt w , then

wt+1 = wt GG G

wt = et+1 = et GG G

et+w Taylor expansion on the gradient at wt leads to

G(w ) = G(wt) H(wt)et + κ(ξt)[et]et

where G(w ) = 0 Rd and κ(ξt) = 3L(ξt) = H(ξt) Rd d d is the third-order remainder. We have denoted the directional derivative κ(x)[et] = limh 0 H(x+het) H(x)

Left-multiplying by G t = G(wt) gives

G t Htet G t Gt = G t κ(ξt)[et]et

Multiplying the generalized inverse (G t Ht) 1 gives

et+1 = et Gt G t Gt G t Ht Gt = Gt G t κ(ξt)[et]et G t Ht Gt .

By the matrix norm inequality, we obtain

||et+1|| M||et||2

i.e. the quadratic convergence in the parameter space. Furthermore, the Lipschitz condition allows the quadratic convergence in the loss space.