# faster_double_adaptive_gradient_methods__7592ed31.pdf

Faster Double Adaptive Gradient Methods

Feihu Huang1,2*, Yuning Luo1

1College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China 2MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China huangfeihu2018@gmail.com

In this paper, we propose a class of faster double adaptive gradient methods to solve nonconvex finite-sum optimization problems possibly with nonsmooth regularization by simultaneously using adaptive learning rate and adaptive mini-batch size. Specifically, we first propose a double adaptive stochastic gradient method (i.e., 2Ada SGD), and prove that our 2Ada SGD obtains a low stochastic first-order oracle (SFO) complexity for finding a stationary solution under the population smoothness condition. Furthermore, we propose a variance reduced double adaptive stochastic gradient method (i.e., 2Ada SPIDER), and prove that our 2Ada SPIDER obtains an optimal SFO complexity under the average smoothness condition, which is lower than the SFO complexity of the existing double adaptive gradient algorithms. In particular, we introduce a new stochastic gradient mapping to adaptively adjust mini-batch size in our stochastic gradient methods. We conduct some numerical experiments to verify efficiency of our proposed methods.

Introduction In the era of big data and large model, efficient stochastic optimization algorithms have received widespread attention in machine learning community (Bottou, Curtis, and Nocedal 2018). In this paper, we consider studying efficient stochastic algorithms to solve the following nonconvex optimization problem possibly with nonsmooth regularization,

min x Rd 1 n

i=1 fi(x; ξi) + h(x), (1)

where function f(x) = 1

n Pn i=1 fi(x; ξi) is smooth but possibly nonconvex, and function h(x) is convex and possibly nonsmooth. Here the samples {ξi}n i=1 are drawn from an unknown data distribution, and x Rd denotes the parameter vector of model. When n is large, recently the large-scale problem (1) is widely applied in many machine learning tasks such as training deep learning models (You et al. 2019). Stochastic gradient descent (SGD) (Robbins and Monro 1951; Ghadimi and Lan 2013) and its variants (Allen-Zhu and Hazan 2016; Allen-Zhu 2018; Reddi

*Corresponding Author Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

et al. 2016; J Reddi et al. 2016; Fang et al. 2018; Wang et al. 2019; Zhou, Xu, and Gu 2020) have been extensively used to solve these large-scale problems. In fact, the performances of these large-scale models often rely on the hyperparameters of their training process. It is well known that one of most critical hyper-parameters is learning rate in optimization algorithms. Recently, many adaptive gradient methods (Duchi, Hazan, and Singer 2011; Kingma and Ba 2014; Zhuang et al. 2020; Xie et al. 2024) with adaptive learning rates have been developed. For example, Adagrad (Duchi, Hazan, and Singer 2011) is the first adaptive gradient method by using the global adaptive learning rate. Adam (Kingma and Ba 2014) algorithm is a momentum-based adaptive gradient method by using the coordinate-wise learning rate, which is a popular optimization operator for training deep learning models. Subsequently, some variants of Adam algorithm (Reddi, Kale, and Kumar 2019; Chen et al. 2019) have been proposed to deal with its divergence under the nonconvex setting. More recently, some accelerated adaptive gradient methods have been proposed based on the variance reduced techniques. Specifically, (Cutkosky and Orabona 2019) proposed an efficient adaptive method (i.e., STORM) via momentum-based variance reduced technique. Subsequently, (Huang, Li, and Huang 2021; Kavis et al. 2022) proposed some variance reduced adaptive methods. Meanwhile, (Yun, Lozano, and Yang 2021) proposed an effective adaptive proximal gradient method for the nonconvex problem with nonsmooth regularization. Another important hyper-parameter is batch size that also heavily affects generalization performance of the large-scale models (Lau, Liu, and Kolar 2024). Thus, the adaptive batch size is also a good choice like adaptive learning rate (Smith et al. 2018; Mc Candlish et al. 2018). Recently, some adaptive gradient methods (Devarakonda, Naumov, and Garland 2017; De et al. 2017; Smith et al. 2018; Zhou, Yuan, and Feng 2018; Ji et al. 2020) have been proposed by using adaptive batch size. For example, (De et al. 2017) adapted the adaptive batch size by satisfying certain properties of gradient and variance. (Zhou, Yuan, and Feng 2018) used the adaptive batch size by exponential and polynomial increasing batch size. (Ji et al. 2020) applied the adaptive batch size to the variance reduced methods based on history gradients. More recently, (Lau, Liu, and Kolar 2024) have been

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

Algorithm Reference ALR ABS SFO NSR Adam (Kingma and Ba 2014; Zhang et al. 2022) O(ϵ 2) Adam-type (Chen et al. 2019; Reddi, Kale, and Kumar 2019) O(ϵ 2) Adagrad (Ward, Wu, and Bottou 2020) O(ϵ 2) Ada Belief (Zhuang et al. 2020) O(ϵ 2) STORM (Cutkosky and Orabona 2019) O(ϵ 3

2 ) Super-Adam (Huang, Li, and Huang 2021) O(ϵ 3

2 ) Aba SPIDER (Ji et al. 2020) O nϵ 1 ϵ 3

Ada SPIDER (Kavis et al. 2022) O(n + nϵ 1) PROXGEN (Yun, Lozano, and Yang 2021) O(ϵ 2)

Ad Ada Grad (Lau, Liu, and Kolar 2024) O(ϵ 2) 2Ada SGD ours O nϵ 1 ϵ 2

2Ada SPIDER ours O nϵ 1 ϵ 3

Table 1: Comparison of our methods and the existing adaptive gradient methods for nonconvex optimization. ALR denotes adaptive learning rate. ABS denotes adaptive mini-batch size. SFO denotes stochastic first-order oracle (SFO) complexity of the proposed methods for finding an ϵ-stationary solution of nonconvex optimization (i.e., E f(x) 2 ϵ or its equivalent form). NSR denotes that the proposed method could work well for the nonsmooth regularization.

begun to study the adaptive gradient method (i.e, Ad Ada Grad) by simultaneously using adaptive learning rate and adaptive batch size. However, the proposed Ad Ada Grad method (Lau, Liu, and Kolar 2024) suffers from high computation and sample complexity for finding stationary solution (please see Table 1). Thus, to fill this gap, we propose a class of faster double adaptive gradient methods by simultaneously using adaptive learning rate and adaptive batch size. Our main contributions are given as follows:

1) We propose a double adaptive stochastic gradient method (i.e., 2Ada SGD) to solve Problem (1) by introducing stochastic gradient mappings. Moreover, we also propose a variance reduced double adaptive stochastic gradient method (i.e., 2Ada SPIDER) based on the variance reduced technique of SARAH/SPIDER (Nguyen et al. 2017; Fang et al. 2018; Wang et al. 2019). 2) We provide a solid convergence analysis for our methods. Under some mild conditions, we prove that our 2Ada SGD obtains a low stochastic first-order oracle (SFO) complexity of O nϵ 1 ϵ 2 , and the worstcase SFO complexity of our 2Ada SPIDER method is O nϵ 1 ϵ 3

2 , which matches the lower bound complexity for finding an ϵ-stationary solution of nonconvex finite-sum optimization (Fang et al. 2018). 3) We conduct some numerical experiments to verify the efficiency of our proposed methods.

Related Work In the section, we review some typical adaptive learning rate methods and adaptive batch size methods, respectively.

Adaptive Learning Rate Methods It is well known that learning rate is one of most critical hyper-parameters in optimization algorithms, which affects performances of the large-scale models. Recently, the adaptive gradient methods (Duchi, Hazan, and Singer 2011;

Kingma and Ba 2014; Loshchilov and Hutter 2017; Zhuang et al. 2020; Xie et al. 2024) by using adaptive learning rates have been widely studied in machine learning community. For example, Adam (Kingma and Ba 2014) is one of popular adaptive gradient methods by using a coordinate-wise adaptive learning rate and using simultaneously momentum technique to accelerate algorithm. Thus, the Adam is the default optimization tool for training large-scale model. Subsequently, some variants of Adam algorithm (Reddi, Kale, and Kumar 2019; Chen et al. 2019; Guo et al. 2021) have been proposed to obtain a convergence guarantee under the non-convex setting. By using coordinatewise adaptive learning rate, Adam algorithm generally has a bad generalization performance in training machine learning models. To improve the generalization performance of Adam, recently some adaptive gradient methods such as Adam W (Loshchilov and Hutter 2017), Padam (Chen et al. 2018) and Ada Belief (Zhuang et al. 2020) have been proposed. Recently, some accelerated adaptive gradient methods (Cutkosky and Orabona 2019; Huang, Li, and Huang 2021; Kavis et al. 2022) have been proposed based on the variance-reduced techniques.

Adaptive Batch Size Methods

In fact, batch size also is one of most critical hyperparameters in optimization algorithms. Recently, some adaptive batch size methods (Devarakonda, Naumov, and Garland 2017; De et al. 2017; Smith et al. 2018; Zhou, Yuan, and Feng 2018; Ji et al. 2020) have been studied in machine learning community. Specifically, (Zhou, Yuan, and Feng 2018) presented the adaptive gradient method by exponentially or polynomially increasing batch size. (De et al. 2017) used a class of adaptive batch sizes satisfying certain properties of gradient and variance. (Ji et al. 2020) proposed the adaptive variance reduced methods by using the adaptive batch-size based on norm of the old stochastic gradients. More recently, (Lau, Liu, and Kolar 2024) proposed

Algorithm 1: Double Adaptive SGD (2Ada SGD) Algorithm

1: Input: T, m N+, α1 > 0, α2 > 0, η > 0 and ϵ > 0; 2: initialize: x0 Rd and g 1 = = g m = γ > 0; 3: for t = 0, 1, . . . , T 1 do 4: Compute βt = 1

m Pm i=1 gt i 2; 5: Randomly sample Bt from [n] without replacement, where bt = min α1σ2β 1 t , α2σ2ϵ 1 , n ; 6: vt = f Bt(xt) = 1

bt P i Bt fi(xt; ξi t);

7: Generate the adaptive matrix At Rd d; One example of At by using update rule: given a0 = 0, 0 < ϱ < 1, ρ > 0, and then compute at = ϱat 1+(1 ϱ) f Bt(xt) 2, At = diag( at+ ρId); 8: Update xt+1 = arg minx Rd vt, x + 1 2η(x xt)T At(x xt) + h(x) ;

9: Set gt = xt xt+1

η ; 10: end for 11: Output: xζ chosen uniformly random from {xt}T t=1.

the adaptive gradient method by using adaptive batch-size derived from the existing adaptive sampling methods.

Notations denotes the floor function. Let a b = min(a, b). Given two vectors x and y, xr (r > 0) denotes the element-wise power operation, x/y denotes the element-wise division and max(x, y) denotes the element-wise maximum. Id denotes a d-dimensional identity matrix. denotes the ℓ2 norm for vectors and spectral norm for matrices. x, y denotes the inner product of two vectors x and y. Index function X(a) = 1 if the event a occurs and 0 otherwise.

Preliminaries In this section, we provide some mild conditions for the above problem (1) and the following our methods.

Assumption 1. Variance of unbiased stochastic gradient is bounded, i.e., there exists a constant σ > 0 such that for all i = 1, 2, , n,

E[ f(x; ξi)] = f(x), E f(x; ξi) f(x) 2 σ2.

Assumption 2. Function F(x) = f(x) + h(x) is bounded from below in x Rd, i.e., F = infx Rd f(x) + h(x).

Assumption 3. Function f(x) = 1 n Pn i=1 fi(x; ξi) is Lsmooth, i.e.,

f(x1) f(x2) L x1 x2 , x1, x2 Rd.

Assumption 4. Each component function f(x; ξi) is Lsmooth for all i = 1, 2, , n, i.e.,

f(x1; ξi) f(x2; ξi) L x1 x2 , x1, x2 Rd.

Assumption 1 is commonly applied in the existing stochastic optimization (Ghadimi, Lan, and Zhang 2016;

Algorithm 2: Double Adaptive SPIDER (2Ada SPIDER) Algorithm

1: Input: S, m N+, α1 > 0, α2 > 0, η > 0, b > 0 and ϵ > 0; 2: initialize: x0 = x0 Rd and β1 > 0; 3: for s = 1, 2, . . . , S do 4: xs 0 = xs 1; 5: Randomly sample Bs from [n] without replacement, where Bs = min α1σ2β 1 s , α2σ2ϵ 1 , n ; 6: vs 0 = f Bs( xs 1) = 1 Bs P i Bs fi( xs 1; ξs i );

7: Generate the adaptive matrix As Rd d; One example of As by using update rule: given a0 = 0, 0 < ϱ < 1, ρ > 0, and then compute as = ϱas 1+(1 ϱ) f Bs(xs 0) 2, As = diag(

as+ ρId); 8: Update xs 1 = arg minx Rd vs 0, x + 1 2η(x xs 0)T As(x xs 0) + h(x) ;

9: Compute βs+1 = gs 0 2/m with gs 0 = xs 0 xs 1 η ; 10: for t = 1, 2, . . . , m 1 do 11: Randomly sample It from [n] without replacement and |It| = b; 12: Compute vs t = f It(xs t) f It(xs t 1) + vs t 1; 13: Generate the adaptive matrix As t Rd d; One example of As t by using update rule: given as 0 = 0, 0 < ϱ < 1, ρ > 0, and then compute as t = ϱas t 1 + (1 ϱ) f It(xs t) 2, As t = diag( as t + ρId); 14: Update xs t+1 = arg minx Rd vs t , x + 1 2η(x xs t)T As t(x xs t) + h(x) ; 15: Compute βs+1 βs+1 + gs t 2/m with gs t = xs t xs t+1 η ; 16: end for 17: end for 18: Output: xζ chosen uniformly random from {{xs t}m t=1}S s=1.

Cutkosky and Orabona 2019). Assumption 2 guarantees feasibility of Problem (1). Assumption 3 is widely used in stochastic optimization algorithms (Chen et al. 2019; Zhuang et al. 2020). Assumption 4 is widely used in the variance-reduced algorithms (Fang et al. 2018; Cutkosky and Orabona 2019). According to Assumption 4, we have f(x) f(y) = E[ f(x; ξi) f(y; ξi)] E f(x; ξi) f(y; ξi) L x y for all x, y Rd. Thus, Assumption 3 is milder than Assumption 4.

Assumption 5. Assume the adaptive matrix At for all t 1 satisfies At ρId 0, and ρ > 0 denotes a lower bound of the smallest eigenvalue of At for all t 1.

Assumption 5 ensures that the adaptive matrices {At}t 1 are positive definite, in which their smallest eigenvalues have a lower bound ρ > 0, as in (Huang, Li, and Huang 2021; Yun, Lozano, and Yang 2021).

Faster Double Adaptive Gradient Methods In this section, we propose an efficient double adaptive SGD (i.e.,2Ada SGD) algorithm by simultaneously using adaptive learning rate and adaptive batch size. Then we further propose a variance reduced double adaptive SGD (i.e., 2Ada SPIDER) algorithm based on the variance reduced technique of SPIDER (Nguyen et al. 2017; Fang et al. 2018; Wang et al. 2019).

2Ada SGD Algorithm Algorithm 1 shows an algorithmic framework of our 2Ada SGD method. In Algorithm 1, we set mini-batch size bt = min α1σ2β 1 t , α2σ2ϵ 1 , n , which depends on average norm of stochastic gradient mappings βt = 1 m Pm i=1 gt i 2 and the variance of stochastic gradient σ2. Thus, these stochastic gradient mappings {gt i}m i=1 adaptively adjust mini-batch size in our 2Ada SGD algorithm. Specifically, when the regularization term h(x) = 0 and the adaptive matrix At = Id, gt = f Bt(xt) is the stochastic gradient as in (Ji et al. 2020); otherwise, gt = xt xt+1

η is the stochastic gradient mapping. In our Algorithm 1, we also use the adaptive matrix At to update variable x. Here the adaptive matrix At generally is diagonal, and it represents many adaptive learning rates. For example, we can use the Adam-like adaptive learning rate:

at = ϱat 1 + (1 ϱ) f Bt(xt) 2, At = diag( at + ρId),

where ϱ (0, 1) and ρ > 0 denotes a small positive number. We can also use the Adam W-like adaptive learning rate:

at = ϱat 1 + (1 ϱ) f Bt(xt) + λxt 2,

At = diag( at + ρId), where λ > 0 denotes the regularization parameter. At the line 8 of Algorithm 1, when h(x) = 0 and At = diag(ˆat) with ˆat = (ˆa1,t, ˆa2,t, , ˆad,t) Rd, we have

xt+1 = arg min x Rd vt, x + 1

2η (x xt)T At(x xt) + h(x)

= arg min x Rd vt, x + 1

2η (x xt)T diag(ˆat)(x xt)

ˆat vt, (2)

where denotes the coordinate-wise product. In other words, we have xi,t+1 = xi,t η ˆai,t vi,t for all i {1, 2, , d}. When h(x) = λ||x||1 and At = diag(ˆat) with ˆat = (ˆa1,t, ˆa2,t, , ˆad,t) Rd, we have

xt+1 = arg min x Rd vt, x + 1

2η (x xt)T At(x xt) + h(x)

= arg min x Rd vt, x + 1

2η (x xt)T At(x xt) + λ||x||1

where S(a, λ) = sign(a) max(|a| λ, 0) denotes the soft threshold operator. In other words, we have xi,t+1 = S xi,t η ˆai,t vi,t, ηλ

ˆai,t for all i {1, 2, , d}.

2Ada SPIDER Algorithm Algorithm 2 provides an algorithmic framework of our 2Ada SPIDER method. Algorithm 2 basically follows the above Algorithm 1. In fact, Algorithm 2 uses the same adaptive batch size and adaptive learning rate as in Algorithm 1. While our 2Ada SPIDER algorithm uses the variance reduced stochastic gradient estimator (Nguyen et al. 2017; Fang et al. 2018; Wang et al. 2019): at the outer loop

vs 0 = f Bs( xs 1) = 1

i Bs fi( xs 1; ξs i ), (4)

and at the inner loop

vs t = f It(xs t) f It(xs t 1) + vs t 1. (5)

Convergence Analysis In the section, we study the convergence properties of our methods under some mild assumptions. All related proofs are provided in the Appendix. Here we first define a useful gradient mapping as in (Ghadimi, Lan, and Zhang 2016): Gt = 1

η(xt ˆxt+1), and the iteration ˆxt+1 is generated from

ˆxt+1 = arg min x Rd

n f(xt), x + 1

2η (x xt)T At(x xt)

+ h(x) o , (6)

where η > 0. Clearly, when h(x) = 0 and At = Id, we have Gt = f(xt).

Theorem 1. Under the above Assumptions 1, 2, 3 and 5, suppose the sequence {xt}T 1 t=0 be generated from Algorithm 1, and the output xζ is uniformly at random chosen from them. Set θ1 = 3ρ

2 1 ρα1 and θ2 = 2 ρ2α1 + 2, we have

E Gζ 2 θ2(F(x0) F )

θ1ηT + θ2mγ2

Tθ1ρα1 + θ2ϵ θ1ρα2

Tρ2α1 + 2ϵ ρ2α2 . (7)

Remark 1. Let α1 = α2 = 4 ρ2 and 0 < η ρ 2L, we have θ1 ρ

4 and θ2 = 10. Let m = O(1) < n and γ = 1 m. Then let T = O(ϵ 1), we have E Gζ 2 O(ϵ 1). The SFO complexity of our 2Ada SGD method is

t=0 min n α1σ2β 1 t , α2σ2ϵ 1, n o

t=0 min n 4σ2

ρ2 m Pm i=1 gt i 2 , 4σ2

ρ2 ϵ 1, n o

ρ2 ϵ 1, n o = O nϵ 1 ϵ 2 . (8)

Theorem 2. Under the above Assumptions 1, 2, 4 and 5, suppose the sequence {{xs t}m 1 t=0 }S s=1 be generated from Algorithm 2, and the output xζ is uniformly at random chosen

Figure 1: Image Classification on CIFAR-10 dataset without Regularization.

Figure 2: Image Classification on CIFAR-10 dataset with Regularization.

from them. Set α = min(α1/2, α1/2), θ1 = 3ρ

bρ 1 2αρ > 0 and θ2 = 2mη2L2

ρ2b + 1 ρ2α + 2, we have

E Gζ 2 θ2(F(x0) F )

where T = Sm and x1 0 = x0.

Remark 2. Let α1 = α2 = 4 ρ2 and 0 < η ρ 2L, we

have θ1 ρ 8 and θ2 = 9 4 + 2η2L2

ρ2 . Further let η = ρ 2L,

we have θ1 = ρ

8 and θ2 = 11

4 . Then let b (n ϵ 1) 1 2 , m = (n ϵ 1)b 1, S = ϵ 1

(n ϵ 1) 1 2 and T = O(ϵ 1),

we have E Gζ 2 O(ϵ 1). The SFO complexity of our

2Ada SPIDER is

s=1 Bt + Smb

s=1 min n α1σ2β 1 s , α2σ2ϵ 1, n o + Smb

s=1 min n 4σ2

ρ2 m Pm t=1 gs 1 t 1 2 , 4σ2

ρ2 ϵ 1, n o

Numerical Experiments

In this section, we conduct some experiments on image classification and language modeling tasks to verify efficiency of our proposed methods. In the following experiments, we compare our methods with the existing methods provided in Table 1. Since the Super-Adam (Huang, Li, and Huang 2021) extends the STORM (Cutkosky and Orabona 2019), we exclude the STORM in the comparisons. All experiments are run over a machine with Intel(R) Xeon(R) Platinum 8352V CPU and 1 Nvidia RTX 4090 GPU.

Figure 3: Image Classification on Image Net dataset without Regularization.

Figure 4: Image Classification on Image Net dataset with Regularization.

Inputs (d channels)

Conv d 32, Batchnorm , Re LU Conv 32 64, Batchnorm , Re LU Conv 64 64, Batchnorm , Re LU

Linear 1024 512, Re LU Dropout (0.5) Linear 512 C

Table 2: The CNN used in our experiment. C is the number of classes, and d is the number of channels for inputs.

Image Classification Task In the experiment, we conduct image classification task on CIFAR-10 (Krizhevsky, Hinton et al. 2009) and Imagenet (Deng et al. 2009) datasets, respectively. Given the training sample data {xi, yi}n i=1 where xi denotes the features of data and yi is the label, here we perform training

neural networks by solving the following problem:

min w Rd 1 n

i=1 L Φ(xi; w), yi + h(w), (11)

where Φ( ; w) denote the neural network model, and L( , ) denotes the loss function, and h(w) denotes a regularization. Here w is the parameters in the neural network model, we will learn these parameters w Rd. In the experiment, we use the cross-entropy loss to L( , ). Specifically, we train a 3-layer Convolutional Neural Network (CNN) on the CIFAR-10 dataset and train the Res Net18 (He et al. 2016) on the Imagenet dataset. Here the neural network architecture of the 3-layer CNN is provided in Table 2. For the learning rates and other hyper-parameters, we do grid search and report the best one for each optimizer. We set γ = 10 3, m = 50 in our 2Ada SGD algorithm, and set γ = 10 2, b = 64 in our 2Ada SPIDER algorithm. In other algorithms, we set the basic learning rate as 0.001, and the basic batchsize as 64. When h(w) = 0, i.e., without regularization, from the Figures 1 and 3, our methods basically have better performances than the other methods. At the same time, when

Figure 5: Language Modeling on LSTM without Regularization.

Figure 6: Language Modeling on LSTM with Regularization.

h(w) = w 1, i.e., with L1-regularization, from the Figures 2 and 4, our methods also basically have better performances than the other methods.

Language Modeling Task In the experiment, we conduct language modeling task on the Penn-Treebank (Marcus, Santorini, and Marcinkiewicz 1993) and Wiki Text2 (Merity et al. 2016) datasets, respectively. Specifically, we will train a 2-layer LSTM (Hochreiter and Schmidhuber 1997) on the Penn-Treebank dataset and train a 2-layer Transformer (Vaswani 2017) on the Wiki Text2 dataset. Given n samples {xi}n i=1, we will learn the language model by solving the following problem:

t=1 log P(xi t|xi 1:t 1; θ) + h(θ), (12)

where sample xi includes li tokens for all i (1, 2, , n), and h(θ) denotes a regularization. Here P(xi t|xi 1:t 1; θ) denotes the probability function of the token xi t given the tokens xi 1:t 1, and θ Rd is the parameters of the language model. For the learning rates and other hyper-parameters, we do grid search and report the best one for each optimizer. We set γ = 10 3, m = 50 in our 2Ada SGD algorithm, and set γ = 10 2, b = 20 in our 2Ada SPIDER algorithm. In other algorithms, we set the basic learning rate as 0.001, and the basic batch-size as 20. When h(θ) = 0, i.e., without regularization, from the Figures 5 and 7, our methods basically have better performances than the other methods. At the same time, when

Figure 7: Language Modeling on Transformer without Regularization.

Figure 8: Language Modeling on Transformer with Regularization.

h(θ) = θ 1, i.e., with L1-regularization, from the Figures 6 and 8, our methods also basically have better performances than the other methods.

Conclusion In the paper, we studied double adaptive gradient methods for nonconvex optimization possibly with nonsmooth regularization, and proposed a class of double adaptive stochastic gradient methods by simultaneously using adaptive minibatch size and adaptive learning rate. Moreover, we studied convergence properties of our methods, and proved that the worst-case SFO complexity of our 2Ada SGD method is O nϵ 1 ϵ 2 for finding an ϵ-stationary solution, and the worst-case SFO complexity of our 2Ada SPIDER method is the optimal SFO complexity of O nϵ 1 ϵ 3

2 . To the best of our knowledge, our methods are the first double adaptive stochastic gradient methods for solving nonconvex optimization with nonsmooth regularization.

Acknowledgments This paper was partially supported by NSFC under Grant No. 62376125. It was also partially supported by the Fundamental Research Funds for the Central Universities NO.NJ2023032.

References Allen-Zhu, Z. 2018. Natasha 2: Faster non-convex optimization than SGD. Advances in neural information processing systems, 31.

Allen-Zhu, Z.; and Hazan, E. 2016. Variance reduction for faster non-convex optimization. In International conference on machine learning, 699 707. PMLR. Bottou, L.; Curtis, F. E.; and Nocedal, J. 2018. Optimization methods for large-scale machine learning. SIAM review, 60(2): 223 311. Chen, J.; Zhou, D.; Tang, Y.; Yang, Z.; Cao, Y.; and Gu, Q. 2018. Closing the generalization gap of adaptive gradient methods in training deep neural networks. ar Xiv preprint ar Xiv:1806.06763. Chen, X.; Liu, S.; Sun, R.; and Hong, M. 2019. On the convergence of a class of Adam-type algorithms for non-convex optimization. In 7th International Conference on Learning Representations, ICLR 2019. Cutkosky, A.; and Orabona, F. 2019. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32. De, S.; Yadav, A.; Jacobs, D.; and Goldstein, T. 2017. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, 1504 1513. PMLR. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248 255. Ieee. Devarakonda, A.; Naumov, M.; and Garland, M. 2017. Adabatch: Adaptive batch sizes for training deep neural networks. ar Xiv preprint ar Xiv:1712.02029. Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7). Fang, C.; Li, C. J.; Lin, Z.; and Zhang, T. 2018. Spider: Near-optimal non-convex optimization via stochastic pathintegrated differential estimator. Advances in neural information processing systems, 31. Ghadimi, S.; and Lan, G. 2013. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization, 23(4): 2341 2368. Ghadimi, S.; Lan, G.; and Zhang, H. 2016. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1): 267 305. Guo, Z.; Xu, Y.; Yin, W.; Jin, R.; and Yang, T. 2021. A novel convergence analysis for algorithms of the adam family and beyond. ar Xiv preprint ar Xiv:2104.14840. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778. Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural computation, 9(8): 1735 1780. Huang, F.; Li, J.; and Huang, H. 2021. Super-adam: faster and universal framework of adaptive gradients. Advances in Neural Information Processing Systems, 34: 9074 9085. J Reddi, S.; Sra, S.; Poczos, B.; and Smola, A. J. 2016. Proximal stochastic methods for nonsmooth nonconvex finitesum optimization. Advances in neural information processing systems, 29.

Ji, K.; Wang, Z.; Weng, B.; Zhou, Y.; Zhang, W.; and Liang, Y. 2020. History-gradient aided batch size adaptation for variance reduced algorithms. In International Conference on Machine Learning, 4762 4772. PMLR.

Kavis, A.; Skoulakis, S.; Antonakopoulos, K.; Dadi, L. T.; and Cevher, V. 2022. Adaptive stochastic variance reduction for non-convex finite-sum minimization. Advances in Neural Information Processing Systems, 35: 23524 23538.

Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980.

Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.

Lau, T. T.-K.; Liu, H.; and Kolar, M. 2024. Ad Ada Grad: Adaptive Batch Size Schemes for Adaptive Gradient Methods. ar Xiv preprint ar Xiv:2402.11215.

Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101.

Marcus, M.; Santorini, B.; and Marcinkiewicz, M. A. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2): 313 330.

Mc Candlish, S.; Kaplan, J.; Amodei, D.; and Team, O. D. 2018. An empirical model of large-batch training. ar Xiv preprint ar Xiv:1812.06162.

Merity, S.; Xiong, C.; Bradbury, J.; and Socher, R. 2016. Pointer sentinel mixture models. ar Xiv preprint ar Xiv:1609.07843.

Nguyen, L. M.; Liu, J.; Scheinberg, K.; and Tak aˇc, M. 2017. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In International conference on machine learning, 2613 2621. PMLR.

Reddi, S. J.; Hefny, A.; Sra, S.; Poczos, B.; and Smola, A. 2016. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, 314 323. PMLR.

Reddi, S. J.; Kale, S.; and Kumar, S. 2019. On the convergence of adam and beyond. ar Xiv preprint ar Xiv:1904.09237.

Robbins, H.; and Monro, S. 1951. A stochastic approximation method. The annals of mathematical statistics, 400 407.

Smith, S. L.; Kindermans, P.-J.; Ying, C.; and Le, Q. V. 2018. Don t Decay the Learning Rate, Increase the Batch Size. In International Conference on Learning Representations.

Vaswani, A. 2017. Attention is all you need. ar Xiv preprint ar Xiv:1706.03762.

Wang, Z.; Ji, K.; Zhou, Y.; Liang, Y.; and Tarokh, V. 2019. Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems, 32.

Ward, R.; Wu, X.; and Bottou, L. 2020. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. Journal of Machine Learning Research, 21(219): 1 30.

Xie, X.; Zhou, P.; Li, H.; Lin, Z.; and Yan, S. 2024. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence. You, Y.; Li, J.; Reddi, S.; Hseu, J.; Kumar, S.; Bhojanapalli, S.; Song, X.; Demmel, J.; Keutzer, K.; and Hsieh, C.-J. 2019. Large batch optimization for deep learning: Training bert in 76 minutes. ar Xiv preprint ar Xiv:1904.00962. Yun, J.; Lozano, A. C.; and Yang, E. 2021. Adaptive proximal gradient methods for structured neural networks. Advances in Neural Information Processing Systems, 34: 24365 24378. Zhang, Y.; Chen, C.; Shi, N.; Sun, R.; and Luo, Z.-Q. 2022. Adam can converge without any modification on update rules. Advances in neural information processing systems, 35: 28386 28399. Zhou, D.; Xu, P.; and Gu, Q. 2020. Stochastic nested variance reduction for nonconvex optimization. Journal of machine learning research, 21(103): 1 63. Zhou, P.; Yuan, X.; and Feng, J. 2018. New insight into hybrid stochastic gradient descent: Beyond with-replacement sampling and convexity. Advances in Neural Information Processing Systems, 31. Zhuang, J.; Tang, T.; Ding, Y.; Tatikonda, S. C.; Dvornek, N.; Papademetris, X.; and Duncan, J. 2020. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in neural information processing systems, 33: 18795 18806.