# hvadam_a_fulldimension_adaptive_optimizer__74ce3346.pdf

HVAdam: A Full-Dimension Adaptive Optimizer

Yiheng Zhang1*, Shaowu Wu1*, Yuanzhuo Xu1, Jiajun Wu2, Shang Xu3, Steve Drew2, Xiaoguang Niu1

1 School of Computer Science, Wuhan University 2 Department of Electrical and Software Engineering, University of Calgary 3 Department of Computer Science, University College London {gjryhxt,wshaow,xyzxyz,xgniu}@whu.edu.cn, {jiajun.wu1, steve.drew}@ucalgary.ca, shang.xu.24@ucl.ac.uk

Adaptive optimizers such as Adam and RMSProp have gained attraction in complex neural networks, including generative adversarial networks (GANs) and Transformers, thanks to their stable performance and fast convergence compared to non-adaptive optimizers. A frequently overlooked limitation of adaptive optimizers is that adjusting the learning rate of each dimension individually would ignore the knowledge of the whole loss landscape, resulting in slow updates of parameters, invalidating the learning rate adjustment strategy and eventually leading to widespread insufficient convergence of parameters. In this paper, we propose HVAdam, a novel optimizer that associates all dimensions of the parameters to find a new parameter update direction, leading to a refined parameter update strategy for an increased convergence rate. We validated HVAdam in extensive experiments, showing its faster convergence, higher accuracy, and more stable performance on image classification, image generation, and natural language processing tasks. Particularly, HVAdam achieves a significant improvement on GANs compared with other state-of-the-art methods, especially in Wasserstein-GAN (WGAN) and its improved version with gradient penalty (WGAN-GP).

Introduction

Optimizers play a crucial role in machine learning, efficiently minimizing loss and achieving generalization. There are two main types of state-of-the-art optimizers: Nonadaptive optimizers, for instance, stochastic gradient descent (SGD) (Robbins and Monro 1951), use a global learning rate for all parameters. In contrast, adaptive optimizers, such as RMSProp (Graves 2013) and Adam (Kingma and Ba 2014), use the partial derivative of each parameter to adjust their learning rates separately. When applied correctly, adaptive optimizers can significantly accelerate the convergence of parameters, even in cases where partial derivatives are minimal. These adaptive optimizers are crucial for overcoming saddle points or navigating regions characterized by small partial derivatives, which can otherwise hinder the convergence process (Xie et al. 2022).

*These authors contributed equally. Corresponding Author Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Figure 1: A typical example of the valley dilemma. (a) (c) depict the trajectories of SGD, Adam, Ada Belief and HVAdam in both 2D and 3D plots. (b) is a close-up of the red box area in (a), showing the slow convergence and zigzagging behavior of Adam, Ada Belief, and SGD. However, HVAdam demonstrates rapid convergence along the hidden vector direction.

In deep learning, model parameters often change significantly during training, and these changes frequently reverse direction, especially when the model is getting close to its optimal performance (near the minimum of the loss function). Adam dynamically adjusts the learning rate according to the magnitude of the gradients, while SGD incorporates momentum to smooth the gradient and does not directly modify the learning rate. Ada Belief (Zhuang et al. 2020), a variant of Adam, adjusts the learning rate based on the deviation of the gradient from the moving average. However, these approaches may be problematic in complex landscapes, such as valley-like regions that are quite common within the loss functions of deep learning models (Zhuang et al. 2020). In such regions, the true direction of optimization often has a small gradient, making it difficult for traditional optimizers to identify and efficiently optimize in these directions. Traditional optimizers tend to oscillate along valley sides, where gradients are larger, leading to slow and inefficient convergence. We refer to this phenomenon as the Valley Dilemma, which occurs when optimization techniques fail to converge effectively due to the challenging landscape as shown in Figure 1.

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

(a) (b) (c) (d)

Figure 2: Trajectories of SGD, Adam, Ada Belief and HVAdam. The functions are f1, f2, f3 and f4 from (a) to (d). The functions are mentioned in the supplementary material s Sec.E. HVAdam reaches the optimal point (marked as orange cross in 2D and 3D plots) the fastest in all cases.

To further demonstrate the Valley Dilemma, we illustrate its effect on optimizers using two more common loss functions (e.g., (c) and (d) in Figure 2). In valley-like regions, traditional optimizers face several key challenges: 1. Identifying Update Direction: In valley-like regions, the true optimal direction (along the valley floor) often has a very small gradient, making it difficult for traditional optimizers to detect and follow this direction, as shown in Figure 1. 2. Changing Landscapes: As training progresses, the direction of optimization can change, as shown in Figure 2(b)(c), especially in complex and dynamic landscapes. 3. Adjusting Learning Rate: Since this stable update direction incorporates trend information, the learning rate needs to be adjusted accordingly. To help convergence in other directions, we should design a more reasonable learning rate adjustment strategy based on the information of the trend for them. To address the aforementioned challenges, we propose the hidden-vector-based adaptive gradient method (HVAdam), an adaptive optimizer that considers the full dimensions of the parameters. With the help of the hidden vector, HVAdam can determine the stable update direction, which enables faster convergence along the descending trajectory of the loss function through an increased learning rate compared to the traditional zigzag approach. Additionally, HVAdam employs a restart strategy to adapt the change of the hidden vector, namely the changing direction of the landscapes. Lastly, HVAdam incorporates information from all dimensions to adjust the learning rate for each dimension using a new preconditioning matrix based on the hidden vector. Contributions of our paper can be summarized as follows:

We propose HVAdam, which constructs a vector that ap-

proximates the invariant components within the gradients, namely the hidden vector, to more effectively guide parameter updates as a solution to the Valley Dilemma. We demonstrate HVAdam s convergence properties under online convex and non-convex stochastic optimization, emphasizing its efficacy and robustness. We empirically evaluate the performance of HVAdam, demonstrating its significant improvements across image classification, NLP, and GANs tasks.

Background and Motivation

Notations f(θ) R, θ Rd: f(θ) is the scalar-valued function to minimize, θ is the parameter in Rd to be optimal. Q F,S(y) = argminx F||S1/2(x y)||: The projection of y onto convex feasible set F. gt Rd: The gradient at step t. mt Rd: The exponential moving average (EMA) of gt. vt Rd: The hidden vector calculated by vt 1 and mt. ht R: The EMA of mt 2. pt Rd: pt = (gt vt 1)2. pt is an intermediate variable. ηt Rd: The factor measure the size of pt. at, st Rd: The EMA of g2 t and ηtpt. α1, α2, γ R: α1 is the unadjusted learning rate for mt; α2 is the unadjusted learning rate for vt ; γ is a constant to limit the value of ηt, which is usually set as 0. These are hyperparameters. δt R: The factor used to adjust α2t.

ϵ R: The hyperparameter ϵ is a small number, used for avoiding division by 0.

lr(δt2, c δt2) R R R: 10δt2 6 3, if c δt2 0.1 0, otherwise .

The function can be set as other more suitable choices; we will not change it in the rest of the paper unless otherwise specified. β1, β2 R: The hyperparameter for EMA, 0 β1, β2 < 1, typically set as 0.9 and 0.999.

Adaptive Moment Estimation (Adam)

Algorithm 1: Adam Optimizer Input: α1, β1 , β2, ϵ Initialize θ0, m0 0 , v0 0, t 0 1: while θt not converged do 2: t t + 1 3: gt θft(θt 1) 4: mt β1mt 1 + (1 β1)gt 5: at β2at 1 + (1 β2)g2 t 6: Bias Correction 7: c mt mt 1 βt 1 , bat at 1 βt 2 8: Update

9: θt Q F, b at

θt 1 α1 c mt b at+ϵ

10: end while

Adam integrates the merits of the adaptive gradient algorithm (Ada Grad) and root mean square propagation (RMSProp) by adaptively adjusting the learning rates based on estimates of the first and second moments of the gradients, making it exceptionally suitable for large-scale and complex machine learning problems. The algorithm is summarized in Algorithm 1, and all operations are element-wise. In Algorithm. 1, β1 and β2 are hyperparameters that represent the exponential decay rates of the moving averages of the past gradients and squared gradients, respectively. The learning rate is denoted by α, and ϵ is a small constant added to the denominator to ensure numerical stability. Adam adapts the learning rate for each parameter. Its straightforward implementation has made it a popular choice in the field of deep learning, particularly when working with large datasets or high-dimensional spaces.

Problems and Motivation Traditional Adam and its variants do not effectively solve the valley dilemma. Adam adjusts the learning rate for each parameter. In regions with small partial derivatives or saddle points, the gradient gt,i at step t is small, leading to a small at,i, since it is the EMA of g2 t,i. As at,i is in the denominator, Adam takes a relatively large step in the θt,i direction due to this small denominator. This adjustment strategy is effective in escaping saddle points or regions with small partial derivatives because it enables larger updates in such scenarios, which is critical for maintaining the momentum of the optimizer (Xie et al. 2022). However, for the valley

dilemma case, all related gt,i can be large, so the corresponding at,i will also be large. This results in small update step sizes for all parameters, including in directions where acceleration is needed for effective updating. Therefore, this direction is hidden from Adam. Adam adjusts the learning rates for each parameter separately, which allows it to perform well when dealing with problems where parameter updates are primarily aligned with the coordinate axes. However, this approach is less effective for problems that require significant updates in non-axis-aligned directions. This limitation highlights that Adam s learning rate adjustment strategy is most effective for problems that are close to axis-aligned , as discussed in (Balles and Hennig 2018). Most other first-order adaptive optimizers face similar challenges. Therefore, in the context of the non-axis-aligned valley dilemma, the critical challenge is to identify this hidden direction, which we refer to as the hidden vector. We provide specific examples to clarify our intuitive explanations regarding the valley dilemma and hidden vector. As shown in Figure 2, panels (a) and (b) depict two typical valley functions. The hidden vector, corresponding to the gradient at the bottom of the valley, represents the intersection line of two planes. The HVadam converges most rapidly towards the direction of the hidden vector and achieves the fastest convergence. Figure 2(a) illustrates a typical coordinate-aligned valley problem, where Ada Belief also converges quickly. However, for the non-axis-aligned valley problem represented in Figure 2(b), Adam s convergence speed is relatively slow. We have further generalized the concept of the hidden vector to functions where the hidden vector frequently changes using a restart strategy. Furthermore, we have verified the convergence performance of HVAdam in more general cases, as shown in Figure 2(c)(d), where HVAdam continues to demonstrate good convergence. These examples provide insight into the local behavior of optimizers in deep learning. Their behavior reflects common patterns seen in deep networks, such as Re LU activation, neuron connections, cross-entropy loss, and smooth activations (Zhuang et al. 2020). Further details on the analysis of these examples are available in Sec.E of the supplementary material.

Step 1 2 3 4 5 gx 5 -3 5 -3 5 gy -3 5 -3 5 -3 c mx 5 0.7895 2.3432 0.7895 1.8177 c my -3 -1.2105 -0.3432 1.2105 0.1823 vx 5 1 1 1 1 vy -3 1 1 1 1 v x 1 1 1 1 1 v y 1 1 1 1 1

Table 1: Consider f(x, y) = 4|x y|+|x+y|. Optimization process for the example function. Our algorithm uses only two steps to get the hidden vector v .

We propose HVAdam, a first-order, full-dimension op-

timizer designed to address the non-axis-aligned valley dilemma. It not only solves the valley dilemma but also proves effective for general deep learning optimization. Specifically, we obtain a hidden vector using gradient projection, which represents the stable gradient trend of the loss function. By employing the restart strategy, we extend this approach to situations where the hidden vector may change over time. Finally, we enhance HVAdam s effectiveness through a hidden-vector-based preconditioning matrix adjustment strategy, where the hidden vector is used to adjust the learning rate. The HVAdam algorithm is summarized in Algorithm 2. All operations are element-wise, except for and , . The proof of the optimizer s convergence is shown in the supplementary material s Sec.C and Sec.D.

Algorithm 2: HVAdam Optimizer Input: α1, β1 , β2, ϵ, γ Initialize θ0, α1 α2, m0 0 , s0 0, v0 0, t 0, t2 1, δ0 0 1: while θt not converged do 2: t t + 1 3: t2 t2 + 1 4: gt θft(θt 1) 5: mt β1mt 1 + (1 β1)gt 6: pt (gt vt 1)2

7: ηt pt (gt mt)2+γpt+ϵ 8: st β2st 1 + (1 β2)ηtpt + ϵ 9: Bias Correction 10: c mt mt 1 βt 1 , bst st 1 βt 2 11: if t2 not 0 then 12: if vt 1 = c mt then 13: kt = 0 14: else 15: kt vt2 1 c mt,vt2 1 vt2 1 c mt 2 16: end if 17: vt2 kt c mt + (1 kt)vt2 1 18: δt2 β2δt2 1 + (1 β2)cos vt2, c mt

19: c δt2 δt2 1 βt2 2 20: bt lr(δt2, c δt2) 21: if bt = 0 then 22: t2 = 1 23: end if 24: else 25: v0 c mt, δ0 0 26: end if 27: Update

28: θt Q F, b st

θt 1 α1 c mt b st+ϵ α2btvt

29: end while

Hidden Vector

kt := vt 1 c mt,vt 1 vt 1 c mt 2 , if vt 1 = c mt 0, otherwise , (1)

vt := kt c mt + (1 kt)vt 1, (2)

0 1 2 3 4 5 X

v* m1 m2 v1

Figure 3: Consider f(x, y) = 4|x y| + |x + y|. The figure shows how we make vt approximate v for the function.

To obtain the hidden vector of the loss function, we analyze the relationship between the hidden vector of the bivariate function f(x, y) = 4|x y| + |x + y| in its valley region and its corresponding vectors vt and c mt. The process is illustrated in Table 1 and Figure 3. We observe that the height of the triangle formed by the edges vt and c mt can be used to update vt, leading to its convergence to the hidden vector v . The update process can be formulated as Eq. (1) and Eq. (2). We extend the algorithm to higher dimensions and prove its convergence in Sec.B of the Supplementary Material. The update process of the hidden vector is detailed in line 12 17 of Algorithm 2.

Restart Strategy Now we can calculate vt through Eq. (1) and Eq. (2), however, vt is a value that gradually converges to v over time. Therefore, it is necessary to measure the current convergence rate of vt. Then, we update the parameters in the direction of vt according to this convergence rate. Considering that the moving average of mt can reflect the region trend of the loss function, we use the cosine similarity between vt and c mt to measure the convergence rate of vt. Although c mt can roughly reflect the current region trend, it is unstable and therefore cannot replace vt. To make the results more stable, we use the moving average of cosine similarity as the index, which is the 18th line of Algorithm 2, that is δt2 := β2δt2 1 + (1 β2)cos vt2, c mt , (3) Furthermore, c mt is adaptive to any changes between different local trends, while vt cannot automatically adjust itself. Whenever the difference between c mt and vt becomes too large, it indicates that the trend of the region is changing, which requires the reinitiation of the calculations vt at time t. Considering that the initialized values of vt and c mt do not represent an accurate estimate of the current trend, we introduce the unbiased estimate of δt2 as a criterion to determine whether a restart is needed,

c δt2 := δt2 1 βt2 2 . (4)

A larger δt2 means that the direction of vt is closer to the direction of mt, and vice versa. Therefore, a larger step in the direction of vt can be taken, while a small δt indicates less confidence. When δt is extremely small, it implies that the hidden vector v has changed, so we will restart the calculation of vt. Finally, we obtain the step size bt by δt2 which represents the convergence rate of vt. lr( ) should be an increasing function capable of covering a wide range of magnitudes. After some experiments, we empirically select Eq. (5), following the same process as in (Luo et al. 2019). Whether to restart, we use c δt2 as the criterion. When c δt2is less than the threshold of 0.1, it indicates a significant deviation between vt and c mt. In such cases, we recompute vt starting from the current step t. The value of 0.1 here is selected empirically.

bt = lr(δt2, c δt2) := 10δt2 6 3, if c δt2 0.1 0, otherwise (5)

The pseudocode for the restart strategy corresponds to lines 18 to 23 of Algorithm 2, where t2 is used to indicate whether a restart is needed.

Hidden-vector-based preconditioning matrix adjustment strategy A crucial element of adaptive optimizers is the preconditioning matrix, which improves the information about the gradient and controls the step size in each direction of the gradient (Yue et al. 2023). In order to adjust the learning rate based on the stable trend information obtained from the hidden vector vt, we measure the difference between the gradients at the current position and the trend in the current region,

pt := (gt vt 1)2. (6)

This difference indicates the magnitude of the noise in each direction of the coordinate axis. The magnitude of noise determines the step size of parameter updates in this direction; large noise corresponds to small steps, while small noise corresponds to large steps. The pt only represents the absolute magnitude of the noise. For different orders of magnitude of gt, it is necessary to incorporate the relative magnitude of pt on each dimension into the design of the learning rate adjustment strategy. Therefore, we introduce Eq. (7) to extract the relative magnitude factor of the noise in each dimension.

ηt := pt (gt mt)2 + γpt + ϵ (7)

We obtain the relative magnitude of the difference between gt and vt by comparing it to the difference between gt and mt. And in order to reduce the impact of abnormal data that make the denominator too small, we add γpt. The ϵ is used to avoid a zero denominator. Finally, the relative magnitude of noise ηtpt constitute the adjustment factor of the learning rate, formalized as

st := β2st 1 + (1 β2)ηtpt + ϵ. (8)

The hidden vector-based preconditioning matrix adjustment strategy is in the line 6 8 of Algorithm 2. If pt is large, it means that there is a significant difference between the projection of the gradient and vt on the parameters. In this case,

as the denominator, st is the EMA of ηtpt, making α1 small which makes updating more cautious . If pt is small, this means that we should accelerate the update so the small st makes α1 large. For the denominators of Adam and Ada Belief, their value ranges are narrow. Adam s at is in (0, max|g|) and Ada Belief s is in (0, max|2gt|). And the range of the denominator determines the adjustment range of α1. For HVAdam, we multiply pt by ηt. We use (gt mt)2 to measure pt. If the latter is larger than the former, ηt is decreased. Otherwise, ηt is increased. The formula shows that the value range of ηt is (0, 1

γ ), which can make the range wider so that an optimal learning rate can be reached.

Validation on Tasks in Deep Learning

We compare HVAdam with 13 baseline optimizers in various experiments, including SGD (Sutskever et al. 2013), Adam, Adam W (Loshchilov and Hutter 2017), Yogi (Zaheer et al. 2018), Ada Bound, RAdam (Liu et al. 2019), Fromage (Bernstein et al. 2020), RMSProp, SWA(Izmailov et al. 2018), Lookahead(Zhang et al. 2019), Ada Belief, Adai (Xie et al. 2022) and Lookaournd(Zhang et al. 2023), and the The experiments include: (a) image classification in the CIFAR10 dataset and CIFAR-100 dataset (Krizhevsky, Hinton et al. 2009) with VGG (Simonyan and Zisserman 2014), Res Net (He et al. 2016) and Dense Net (Huang et al. 2017); (b) natural language processing tasks with LSTM (Ma et al. 2015) on Penn Tree Bank dataset (Marcus, Santorini, and Marcinkiewicz 1993) and Transformer on IWSLT14 dataset; (c) WGAN (Arjovsky, Chintala, and Bottou 2017), WGANGP (Gulrajani et al. 2017) and Spectral-Norm GAN (SNGAN) (Miyato et al. 2018) on CIFAR-10. The hyperparameter settings and searching are shown in supplementary material.

Experiments for Image Classification

CNNs on image classification The experiments are conducted on CIFAR-10 with VGG11, Res Net34, and Dense Net121 on CIFAR-10 using the official implementation of Ada Belief. The accuracy of other optimizers is obtained from (Buvanesh and Panwar 2021). As Figure 4 shows, HVAdam achieves a fast convergence comparable to other adaptive methods. When the training accuracy of several optimizers is close to 100%, HVAdam achieves the highest test accuracy and outperforms other optimizers. The results show that HVAdam has both fast convergence and high generalization performance. We then train a Res Net50 on Image Net (Deng et al. 2009) and report the accuracy on the validation set in Table 2. The experiment is conducted using the official implementation of Lookaround. For other optimizers, we report the best result in the literature. Due to the heavy computational burden, we were unable to perform an extensive hyperparameter search. But our optimizer still outperforms other adaptive methods and achieves comparable accuracy Lookaround (77.22 vs 77.32), which closes the generalization gap between adaptive methods and non-adaptive methods. Experiments validate the fast convergence and great generalization performance of HVAdam.

0 25 50 75 100 125 150 175 200 84

92 Accuracy ~ Training epoch

HVAdam Ada Belief SGD Adam

(a) VGG11 on CIFAR-10

0 25 50 75 100 125 150 175 200 88

96 Accuracy ~ Training epoch

HVAdam Ada Belief SGD Adam

(b) Res Net34 on CIFAR-10

0 25 50 75 100 125 150 175 200 88

96 Accuracy ~ Training epoch

HVAdam Ada Belief SGD Adam

(c) Dense Net121 on CIFAR-10

0 25 50 75 100 125 150 175 200 45

Accuracy ~ Training epoch

HVAdam Ada Belief SGD Adam

(d) VGG11 on CIFAR-100

0 25 50 75 100 125 150 175 200 64

78 Accuracy ~ Training epoch

HVAdam Ada Belief SGD Adam

(e) Res Net34 on CIFAR-100

0 25 50 75 100 125 150 175 200 64

80 Accuracy ~ Training epoch

HVAdam Ada Belief SGD Adam

(f) Dense Net121 on CIFAR-100

Figure 4: Test accuracies with three models using different optimizers on CIFAR-10 and CIFAR-100.

SGDM Adam Adai SWA Lookahead Lookaround HVAdam

76.49 72.87 76.80 76.78 76.52 77.32 77.22

Table 2: Top-1 accuracy of Res Net50 on Image Net. is reported in (Xie et al. 2022), is reported in (Zhang et al. 2023).

Visual Transformer on image classification Besides validating with the classical model CNNs, we also validate the performance of HVAdam with Visual Transformer (Vi T)(Dosovitskiy et al. 2021). As Table 3 shows, HVAdam outperforms all other optimizers, which means that HVAdam performs well on both classical and advanced models.

Adam SWA Lookahead Lookaround HVAdam CIFAR10 98.34 98.47 98.51 98.71 99.00 CIFAR100 91.55 91.32 91.76 92.21 92.38

Table 3: The test set accuracy under optimizers using Vi TB/16. The results of other optimizers are taken from (Zhang et al. 2023).

Experiments for Natural Language Processing

LSTMs on language modeling We experiment with 1layer, 2-layer, and 3-layer LSTM models on the Penn Tree Bank dataset. The results of the experiments are shown in Figure 5. Except for HVAdam, the score data for the other optimizers is provided by (Buvanesh and Panwar 2021). For all LSTM models, Ada Belief achieves the lowest perplexity

or the best performance. The experiments resonate with both the fast convergence and the excellent accuracy of HVAdam.

Experiments for Image Generation

Generative adversarial networks (GANs) on image generation Stability is also important for optimizers. And, as mentioned in (Salimans et al. 2016), mode collapse and numerical instability can easily impact GAN training. So, training GANs can reflect the stability of optimizers. For example, SGD often fails when training GANs, whereas Adam can effectively train GANs. To assess the robustness of HVAdam, we performed experiments with WGAN, WGAN-GP, and SN-GAN using the CIFAR-10 dataset. The results were generated using the code from the official implementation of Ada Belief and compared with the findings in (Buvanesh and Panwar 2021), where hyperparameters were explored more extensively. We perform 5 runs of experiments, and The comparison results on WGAN, WGAN-GP, and SN-GAN are reported in Figure 6, and Table 4 . And we also add the great results of expriments on diffusion model (Nichol and Dhariwal 2021) in the supplementary material. We can see that HVAdam gets the lowest FID (Heusel et al. 2017) scores with all GANs, where the lower the FID is, the better the quality and diversity of the generated images. Fur-

0 25 50 75 100 125 150 175 200 80

105 Perplexity ~ Training epoch

HVAdam Ada Belief SGD Adam

0 25 50 75 100 125 150 175 200 65

100 Perplexity ~ Training epoch

HVAdam Ada Belief SGD Adam

0 25 50 75 100 125 150 175 200 60

100 Perplexity ~ Training epoch

HVAdam Ada Belief SGD Adam

Figure 5: The perplexity on Penn Treebank for 1,2,3-layer LSTM from left to right. Lower is better.

HVAdam Ada Belief RAdam RMSProp Adam Fromage Yogi SGD Ada Bound 12.72 0.21 12.98 0.22 13.10 0.20 12.86 0.08 13.01 0.15 46.31 0.86 14.16 0.05 48.94 2.88 16.84 0.10

Table 4: FID values ([µ σ]) of a SN-GAN with Res Net generator on CIFAR-10. A lower FID value is better.

HVAdam Ada Belief Adam RMSProp

SGD Ada Bound

(a) FID scores of WGAN.

HVAdam Ada Belief Adam RMSProp Ada Bound

(b) FID scores of WGAN-GP.

Figure 6: FID scores of different optimizers on WGAN and WGAN-GP (using the vanilla CNN generator on CIFAR10). Lower FID indicates better performance. For every model, the successful and failed optimizers are displayed on the left and right sides, with different ranges of y values.

thermore, HVAdam s FID score with WGAN is even better than the other optimizers FID scores with WGAN-GP. So, the stability of HVAdam is fully validated.

Ablation Study We perform the ablation study to demonstrate that every step of improvement plays important roles. We use three methods to train a 1-layer-LSTM. (a) Use Adam (HVAdam0). (b) Use Adam which we introduce vt (HVAdam1). (c) After introducing vt, we can change Adam s adjustment strategy of learning rate by using vt to calculate st (HVAdam2). (d) Based on HVAdam2, we introduce the restart strategy (HVAdam). As Table. 5 shows, all the steps can help the optimizer perform better.

HVAdam0 HVAdam1 HVAdam2 HVAdam valid ppl 85.04 84.54 83.42 83.31

Table 5: The valid perplexity in ablation study.

Conclusion We propose the HVAdam optimizer, which obtains the trend of the loss function through a hidden vector and restart strategy and utilizes the trend to update the learning rate, thereby accelerating the model s convergence speed. We validate HVAdam s advantages with four simple but representative functions and prove its convergence in both convex and nonconvex cases. Experimental results show that HVAdam outperforms almost all other optimizers in tasks that cover a comprehensive category of deep learning tasks. Although our optimizer achieves excellent results and faster convergence speed in these experiments, the algorithm s calculation process is relatively complex and incurs extra memory overhead. We recognize that these factors may limit the algorithm s practicality in memory-constrained environments. In future work, we plan to optimize the algorithm s computational efficiency and reduce its memory requirements to enhance its applicability.

Acknowledgements This project was supported by the Key Research and Development Project of Hubei Province under grant number 2022BCA057. The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.

References Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. In International conference on machine learning, 214 223. PMLR. Balles, L.; and Hennig, P. 2018. Dissecting adam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, 404 413. PMLR. Bernstein, J.; Vahdat, A.; Yue, Y.; and Liu, M.-Y. 2020. On the distance between two neural networks and the stability of learning. Advances in Neural Information Processing Systems, 33: 21370 21381. Buvanesh, A.; and Panwar, M. 2021. [Re] Ada Belief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients. In ML Reproducibility Challenge 2021 (Fall Edition). Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248 255. Ieee. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR. Graves, A. 2013. Generating sequences with recurrent neural networks. ar Xiv preprint ar Xiv:1308.0850. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved training of wasserstein gans. In Advances in neural information processing systems, 5767 5777. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, 6626 6637. Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700 4708. Izmailov, P.; Podoprikhin, D. A.; Garipov, T.; Vetrov, D.; and Wilson, A. G. 2018. Averaging Weights Leads to Wider Optima and Better Generalization. Conference on Uncertainty in Artificial Intelligence. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980. Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.

Liu, L.; Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; and Han, J. 2019. On the variance of the adaptive learning rate and beyond. ar Xiv preprint ar Xiv:1908.03265.

Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101.

Luo, L.; Xiong, Y.; Liu, Y.; and Sun, X. 2019. Adaptive gradient methods with dynamic bound of learning rate. ar Xiv preprint ar Xiv:1902.09843.

Ma, X.; Tao, Z.; Wang, Y.; Yu, H.; and Wang, Y. 2015. Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies, 54: 187 197.

Marcus, M.; Santorini, B.; and Marcinkiewicz, M. A. 1993. Building a large annotated corpus of English: The Penn Treebank.

Miyato, T.; Kataoka, T.; Koyama, M.; and Yoshida, Y. 2018. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957.

Nichol, A. Q.; and Dhariwal, P. 2021. Improved denoising diffusion probabilistic models. In International conference on machine learning, 8162 8171. PMLR.

Robbins, H.; and Monro, S. 1951. A stochastic approximation method. The annals of mathematical statistics, 400 407.

Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; and Chen, X. 2016. Improved techniques for training gans. In Advances in neural information processing systems, 2234 2242.

Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556.

Sutskever, I.; Martens, J.; Dahl, G.; and Hinton, G. 2013. On the importance of initialization and momentum in deep learning. In International conference on machine learning, 1139 1147.

Xie, Z.; Wang, X.; Zhang, H.; Sato, I.; and Sugiyama, M. 2022. Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In International Conference on Machine Learning, 24430 24459. PMLR.

Yue, Y.; Ye, Z.; Jiang, J.; Liu, Y.; and Zhang, K. 2023. AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix. Advances in Neural Information Processing Systems, 36: 45812 45832.

Zaheer, M.; Reddi, S.; Sachan, D.; Kale, S.; and Kumar, S. 2018. Adaptive methods for nonconvex optimization. In Advances in neural information processing systems, 9793 9803.

Zhang, J.; Liu, S.; Song, J.; Zhu, T.; Xu, Z.; and Song, M. 2023. Lookaround Optimizer: k steps around, 1 step average. In Advances in Neural Information Processing Systems.

Zhang, M.; Lucas, J.; Ba, J.; and Hinton, G. E. 2019. Lookahead Optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, 9593 9604.

Zhuang, J.; Tang, T.; Ding, Y.; Tatikonda, S. C.; Dvornek, N.; Papademetris, X.; and Duncan, J. 2020. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in neural information processing systems, 33: 18795 18806.