# implicit_regularization_of_adadelta__b21f637b.pdf

Published in Transactions on Machine Learning Research (12/2024)

Implicit Regularization of Ada Delta

Matthias Englert m.englert@warwick.ac.uk University of Warwick

Ranko Lazić r.s.lazic@warwick.ac.uk University of Warwick

Avi Semler avi.semler@warwick.ac.uk University of Warwick

Reviewed on Open Review: https: // openreview. net/ forum? id= nm40lbbwo R

We consider the Ada Delta adaptive optimization algorithm on locally Lipschitz, positively homogeneous, and o-minimally definable non-smooth neural networks, with either the exponential or the logistic loss. We prove that, after achieving perfect training accuracy, the resulting adaptive gradient flows converge in direction to a Karush-Kuhn-Tucker point of the margin maximization problem, i.e. perform the same implicit regularization as the plain gradient flows. We also prove that the loss decreases to zero and the Euclidean norm of the parameters increases to infinity at the same rates as for the plain gradient flows. Moreover, we consider generalizations of Ada Delta where the exponential decay coefficients may vary with time and the numerical stability terms may be different across the parameters, and we obtain the same results provided the former do not approach 1 too quickly and the latter have isotropic quotients. Finally, we corroborate our theoretical results by numerical experiments on convolutional networks with MNIST and CIFAR-10 datasets.

1 Introduction

Understanding when, why, and how training overparameterized neural networks by gradient-based algorithms achieves good generalization (Zhang, Bengio, Hardt, Recht, and Vinyals, 2021; Belkin, Hsu, Ma, and Mandal, 2019) remains one of the central questions in machine learning, despite several years of vibrant research. Much progress on the question has been made by investigating implicit regularization (or implicit bias) (Neyshabur, Bhojanapalli, Mc Allester, and Srebro, 2017): the mysterious preference of the training algorithms for interpolators that perform well at test time.

Buiding on the seminal work of Soudry, Hoffer, Nacson, Gunasekar, and Srebro (2018), one of the most celebrated results in the field was obtained by Lyu and Li (2020); Ji and Telgarsky (2020): that after achieving perfect training accuracy, gradient flow implicitly regularizes locally Lipschitz, positively homogenous, and o-minimally definable non-smooth networks so that their parameters converge in direction to a margin maximization KKT point. This precise bias towards margin maximization for this wide class of networks has been the basis of numerous theoretical works (cf. Vardi (2023)), as well as remarkable practical methods such as the reconstruction of training data by Haim, Vardi, Yehudai, Shamir, and Irani (2022); Buzaglo, Haim, Yehudai, Vardi, Oz, Nikankin, and Irani (2023).

Nevertheless, already Soudry et al. (2018) observed that algorithms such as Adam (Kingma and Ba, 2015) may not perform the same implicit regularization, and posed the research question:

Can we characterize the bias of adaptive optimization algorithms for classification problems?

Equal contribution.

Published in Transactions on Machine Learning Research (12/2024)

The pertinence of this question was further attested by Wilson, Roelofs, Stern, Srebro, and Recht (2017), who demonstrated that, for several realistic deep learning models, the solutions found by adaptive methods often generalize significantly worse than those found by stochastic gradient descent, even when the former solutions have better training performance.

Extending the pioneering work of Gunasekar, Lee, Soudry, and Srebro (2018); Qian and Qian (2019); Zhou, Feng, Ma, Xiong, Hoi, and E (2020), a major advance was made by Wang, Meng, Chen, and Liu (2021) who proved that, for the same wide class of networks as admitted by Lyu & Li (2020); Ji & Telgarsky (2020) and in the continuous setting corresponding to an infinitesimal learning rate, if the adapter1 of an algorithm without momentum can be shown to converge to an isotropic vector without large fluctuations, then the implicit regularization is the same as for plain gradient flow. They also established that this holds for RMSProp (Hinton, Srivastava, and Swersky, 2012) and Adam without momentum,2 but fails for Ada Grad (Duchi, Hazan, and Singer, 2011).

Our contributions. In this work, we tackle the posed research question for Ada Delta (Zeiler, 2012), which has remained open. Ada Delta is one of the main adaptive optimization algorithms, implemented in Py Torch3, and known to perform well in many circumstances compared to other algorithms including RMSProp and Adam (cf. e.g. Ruder (2016)). Specifically:

we overcome the technical challenge of Ada Delta having exponentially decaying averages in both the numerator and the denominator of the adapter (which makes it not readily amenable to the techniques of Wang et al. (2021)), and prove that the adapter has the required convergence properties, which enables us to conclude the same implicit regularization as for plain gradient flow;

we show that the implicit regularization critically depends on the numerical stability terms in the numerator and the denominator of the adapter, and that it changes if they are permitted to have different components so that their quotient is not isotropic;

we also investigate permitting the exponential decay coefficients to vary with time, and show that under a mild assumption on their integrals, the implicit regularization is not affected;

we corroborate these theoretical results in three empirical settings, ranging from a simple visualization to a 14-layer convolutional network on CIFAR-10.

Further related work. The implicit regularization of adaptive optimization algorithms was the focus of several other recent works. Wang, Meng, Zhang, Sun, Chen, Ma, and Liu (2022) proved that momentum does not affect the implicit bias towards margin maximization for linear classification. Tarzanagh, Li, Zhang, and Oymak (2023) showed margin maximization when optimizing attention by gradient descent. Cattaneo, Klusowski, and Shigida (2024) studied the implicit bias of Adam and RMSProp by backward error analysis. Zhang, Zou, and Cao (2024) established that Adam without a numerical stability term maximizes the -norm margin for linear logistic regression. Some of the prior empirical works on generalization of adaptive methods are Keskar and Socher (2017); Zaheer, Reddi, Sachan, Kale, and Kumar (2018); Luo, Xiong, Liu, and Sun (2019); Chen, Zhou, Tang, Yang, Cao, and Gu (2020).

Convergence properties of the popular Adam algorithm have been considered further in works that include the following. Wang, Fu, Zhang, Zheng, and Chen (2023) proved a tight upper bound on Adam s iteration complexity. Wang, Zhang, Zhang, Meng, Sun, Ma, Liu, Luo, and Chen (2024) studied convergence of randomly reshuffled Adam with diminishing learning rate and without assuming bounded smoothness, and showed that it can be faster than stochastic gradient descent with diminishing learning rate. Thilak, Littwin, Zhai, Saremi, Paiss, and Susskind (2024) identified and investigated a slingshot phenomenon in late-stage Adam and related it to grokking (Power, Burda, Edwards, Babuschkin, and Misra, 2022). Xie and Li (2024) showed that Adam with decoupled weight decay implicitly performs constrained optimization.

1In an adaptive algorithm, for each model parameter, the learning rate is adapted by a separate coefficient. The adapter consists of those coefficients, and is updated at every step of the algorithm. 2Here the difference between RMSProp and Adam without momentum is that the latter involves a bias-correction coefficient in the computation of the exponentially decaying average. 3https://pytorch.org/docs/stable/generated/torch.optim.Adadelta.html

Published in Transactions on Machine Learning Research (12/2024)

A number of works have investigated convergence of classes of adaptive optimization algorithms with momentum, often by means of continuous-time limits. Da Silva and Gazeau (2020) derived a system of non-autonomous differential equations that is the limit for a class of algorithms that includes Adam, Ada Fom, Heavy Ball, and Nesterov s Accelerated Gradient (NAG); and they studied convergence of its trajectories. In a concurrent work, Barakat and Bianchi (2021) focused on a similar system of equations, handling an irregular vector field that arises from the original version of Adam, and establishing further connections with the discrete-time iterates. The latter were extended by Barakat, Bianchi, Hachem, and Schechtman (2021) to allow noise in the Euler discretization of the non-autonomous differential equation system, thus being able to express algorithms such as stochastic Heavy Ball and stochastic NAG, and in particular to prove convergence of stochastic NAG in a non-convex setting. In a concurrent work, Gadat and Gavra (2022) were also able to prove almost sure escape of local maxima with low mini-batch noise level. More recently, Xiao, Hu, Liu, and Toh (2024) provided a framework for establishing convergence properties of Adam-family algorithms without assuming smoothness, which is thus applicable to neural networks with non-smooth activations.

Understanding generalization in deep learning by means of obtaining bounds in terms of the normalized margin and related quantities has attracted extensive attention, cf. e.g. Bartlett, Foster, and Telgarsky (2017); Neyshabur, Bhojanapalli, and Srebro (2018); Golowich, Rakhlin, and Shamir (2018); Wei, Lee, Liu, and Ma (2019); Wei and Ma (2020); Glasgow, Wei, Wootters, and Ma (2023). Although a variety of questions remain open, there is significant evidence behind the general understanding that larger normalized margins lead to better generalization, cf. e.g. (Jiang, Krishnan, Mobahi, and Bengio, 2019; Jiang, Neyshabur, Mobahi, Krishnan, and Bengio, 2020).

2 Preliminaries

Basic notation. We write: [n] for the set {1, . . . , n}; u, v for the inner product of vectors u and v; v = p

v, v for the Euclidean length of a vector v; vi for the ith component of a vector v; 0, 1, , etc. for the vectors whose dimension is inferred from the context and whose all components are equal to the specified value, so that for all i we have 0i = 0, 1i = 1, i = , etc.

Provided u and v are vectors of equal dimensions, we write uv, u/v, v2, v, etc. for the component-wise product, quotient, square, square root, etc. operations, so that for all i we have (uv)i = uivi, (u/v)i = ui/vi, (v2)i = (vi)2, ( v)i = vi, etc. Similarly, we write u < v, u v, etc. for the component-wise less, less than or equal, etc. relations, so that we have u < v i: ui < vi, u v i: ui vi, etc.

Local Lipschitz continuity and the Clarke subdifferential. Suppose a function f : Rk R is locally Lipschitz, i.e. every point v Rk has a neighborhood U such that f is Lipschitz continuous on U. By Rademacher s theorem (cf. e.g. Borwein and Lewis (2010, Theorem 9.1.2)), then f is differentiable almost everywhere. The Clarke subdifferential of f at a point v is the convex hull

f(v) := conv n lim i f(v(i)) lim i v(i) = v and f(v(i)) exists for all i o .

It is nonempty and compact for all v, and equals the singleton { f(v)} if f is continuously differentiable at v (Clarke, 1975). It consists of subgradients, which we may refer to simply as gradients.

O-minimal structures and definable functions. An o-minimal structure S is a family {Sk} k=1 such that: each Sk is a set of subsets of Rk; S1 is the set of all finite unions of open intervals and points; each Sk contains the zero sets of all polynomials on Rk; each Sk is closed under finite union, finite intersection, and complement; each Sk+k contains the Cartesian products of all sets in Sk and Sk ; each Sk contains the projections of all sets in Sk+1 onto the first k components. A function f : Rk Rk is definable in S if and only if its graph is a set in Sk+k .

For every o-minimal structure, the collection of all definable functions is closed under algebraic operations, composition, inverse, maximum, minimum, etc. (cf. e.g. Ji & Telgarsky (2020, Appendix B)). Moreover, by Wilkie s theorem (Wilkie, 1996), there exists an o-minimal structure in which the exponential function is definable.

Published in Transactions on Machine Learning Research (12/2024)

Procedure 1 Discrete generalized Ada Delta. In step k + 1, using a current gradient e L(wk) L(w(t)), it computes the next exponentially decaying average gk+1 of squared gradients, and computes the next adapted gradient k+1. It then computes the next exponentially decaying average hk+1 of squared adapted gradients, and computes the next parameter vector wk+1 by subtracting k+1 scaled by the learning rate. The hyperparameters are: learning rate η > 0, exponential decay coefficients ρk [0, 1]p, and numerical stability terms δ > 0 and ε > 0. The implementation of Ada Delta in Py Torch3 corresponds to the special case where ρk = ϱ 1 for all k, and δ = ε = ϵ 1. By specializing further to η = 1, we obtain the original Ada Delta (Zeiler, 2012).

gk+1 = ρkgk + (1 ρk)e L(wk)2

hk+1 = ρkhk + (1 ρk) 2 k+1 wk+1 = wk η k+1

ε + hk δ + gk+1 e L(wk)

Predictor and loss functions. We assume the following properties of the predictor function Φ(w, x) with parameters w Rp, inputs x Rd, and scalar outputs. The L-positive homogeneity means that scaling the parameters w by any α > 0 scales the output by αL, i.e. Φ(αw, x) = αLΦ(w, x).

Assumption 1. For some L > 0 and some o-minimal structure S in which the exponential function is definable, for each x we have that Φ(w, x) as a function of w is: (i) locally Lipschitz; (ii) L-positively homogeneous; (iii) definable in S.

This assumption admits neural networks that are constructed from a wide variety of layer types, including fully connected, convolutional, Re LU, Leaky Re LU, Square Re LU, cube activation, and max-pooling, which may be composed arbitrarily. Points of nondifferentiability, such as at 0 in the case of the Re LU nonlinearity, are permitted because we assume only local Lipschitzness instead of continuous differentiability and we work with the Clarke subdifferential instead of the gradient. However, due to the L-positive homogeneity (ii), skip connections are excluded, biases are excluded except at the first layer, and activations based on the exponential function (e.g. sigmoid activation) are excluded. The homogeneity exponent L is the degree to which the parameters are multiplied when computing the network output this may be determined by starting with exponent 0, then going forwards through the layers and e.g.: adding 1 to the exponent if the layer is fully connected, not changing the exponent if the layer is Re LU, doubling the exponent if the layer is Square Re LU, etc.

We consider minimizing the total loss L(w) := Pn i=1 ℓ y(i)Φ(w, x(i)) , which is with respect to a finite dataset (x(i) Rd, y(i) { 1}) n i=1 and where the individual loss function is one of:

( e z for the exponential loss; log(1 + e z) for the logistic loss.

Now we make use of the definability of the exponential function in the o-minimal structure S from Assumption 1. (The exponential function was effectively excluded by the homogeneity clause (ii) from the building of the predictor function Φ(w, x).) Since the individual loss function ℓ(in both cases, exponential and logistic) is locally Lipschitz and definable in S, the same is true of the total loss function L. Note however that the loss functions we consider are not homogeneous.

Generalized Ada Delta flow trajectories. The starting point of our analysis is the adaptive gradient descent of Procedure 1, which is a generalization of Ada Delta (Zeiler, 2012) by allowing: the learning rate to be specified (this is already the case in the Py Torch implementation3), the exponential decay coefficients to vary with time (e.g. by following a specified schedule), and both those coefficients and the numerical stability terms to be specified differently across the vector components.

The focus of our theoretical study is the adaptive gradient flow that corresponds to the adaptive gradient descent with an infinitesimal learning rate. To arrive at its definition, we first restate the equations of Procedure 1 as follows. We eliminate the auxiliary variable k+1, we suppose the hyperparameters ρk obey

Published in Transactions on Machine Learning Research (12/2024)

Process 2 Continuous generalized Ada Delta. This adaptive gradient flow, where e L(w(t)) L(w(t)) is a gradient of the loss at the current parameters w(t), corresponds to the adaptive gradient descent of Procedure 1 with an infinitesimal learning rate. The hyperparameters are: exponential decay coefficients schedule ρ: [0, ) [0, 1]p, and numerical stability terms δ > 0 and ε > 0.

g (t) = (1 ρ(t))(e L(w(t))2 g(t)) (1)

h (t) = (1 ρ(t))(w (t)2 h(t)) (2)

ε + h(t) δ + g(t) e L(w(t)) (3)

that (1 ρk)/η is constant with respect to the hyperparameter η, and we replace steps k by times t = kη:

g(t + η) g(t)

η = (1 ρ(t)) e L(w(t))2 g(t)

h(t + η) h(t)

η = (1 ρ(t))

w(t + η) w(t)

w(t + η) w(t)

ε + h(t) δ + g(t + η) e L(w(t)) .

Now, regarding these equations as determining the endpoints g(t + η), h(t + η), w(t + η) of the next line segments in some continuous polygonal curves g, h, w: [0, ) Rp born from the adaptive gradient descent of Procedure 1 with learning rate η (cf. e.g. Elkabetz and Cohen (2021)), letting η tend to 0 we obtain that the limits of their directions are given by the right-hand sides of eqs. (1) to (3), which we take to be the derivatives that determine the adaptive gradient flow we are seeking.

We therefore analyze trajectories g, h, w: [0, ) Rp of the two exponentially decaying averages and of the parameters, which are arcs (i.e. absolutely continuous on every compact subinterval) and which obey the adaptive gradient flow of Process 2 for almost all t 0.

Assumption 2. (i)

R 0 (1 ρ(t))dt = . (ii) g(0) 0 and h(0) 0. (iii) There exists a time t0 such that L(w(t0)) < ℓ(0).

This is a mild regularity assumption. Part (i) ensures that the exponential decay coefficients are not scheduled to approach 1 so fast that they stop the learning, and for example it is satisfied by any constant coefficients smaller than 1. Part (ii) just requires that the exponentially decaying averages of squared gradients and squared adapted gradients are initialized as nonnegative (in original Ada Delta they are initialized to zero). Part (iii) assumes that the dataset is separable by the network, moreover that for every parameters trajectory w that we consider arising from the generalized Ada Delta flow, there exists a time t0 at which a separation (i.e. the perfect training accuracy) is achieved; this commonly occurs when training overparameterized neural networks by gradient based algorithms (cf. e.g. Zhang et al. (2021)). We also remark that, for both the exponential and the logistic individual loss function ℓ(z), if f(z) is defined by ℓ(z) = e f(z), then f(z) is monotonically increasing, and so the condition L(w(t0)) < ℓ(0) is equivalent to f 1(log(1/L(w(t0)))) > 0.

Admittance of a chain rule. Since the total loss function L is locally Lipschitz and definable in S (for both the exponential and the logistic individual loss functions), and the trajectory w of the parameters is an arc, we are able to use the following fact applied to them.

Proposition 1 (Davis, Drusvyatskiy, Kakade, and Lee (2020, Theorem 5.8)). If f : Rk R is locally Lipschitz and definable in an o-minimal structure, then it admits a chain rule: for all arcs v: [0, ) Rk, almost all t 0, and all u f(v(t)), we have df(v(t))/dt = u, dv(t)/dt .

Directional convergence, KKT conditions, and margin maximization. That a trajectory v converges in direction to a vector u means limt v(t)/ v(t) = u/ u .

Published in Transactions on Machine Learning Research (12/2024)

Following Dutta, Deb, Tulshyan, and Arora (2013, Section 2.2), supposing f, g1, . . . , gn : Rk R are locally Lipschitz, we have that v Rk is a Karush-Kuhn-Tucker point of the problem

minimize: f(v) subject to: gi(v) 0 for all i [n]

if and only if there exist Lagrange multipliers λ1, . . . , λn 0 such that:

(feasibility) gi(v) 0 for all i [n];

(equilibrium inclusion) 0 f(v) + Pn i=1 λi gi(v);

(complementary slackness) λigi(v) = 0 for all i [n].

For local optimality, these first-order conditions are necessary, however in general not sufficient.

The margin of a parameters vector w is the smallest label-adjusted prediction for an input, i.e. formally mini [n] y(i)Φ(w, x(i)). By the L-positive homogeneity of the predictor function (cf. Assumption 1.(ii)), the normalized margin min i [n] y(i)Φ(w, x(i))/ w L = min i [n] y(i)Φ(w/ w , x(i))

depends only on the direction of w, and it is straightforward to show that the directions that maximize it are also the optimal directions of the problem

minimize: 1 2 w 2 subject to: y(i)Φ(w, x(i)) 1 for all i [n] . (4)

3 Main result

We prove that continuous generalized Ada Delta obeys the same tight rates for convergence of the loss and growth of the parameters as were established for plain gradient flow by Lyu & Li (2020, Corollary A.11), and also implicitly regularizes the parameters to converge in direction to a KKT point of a variant of the margin

maximization problem in eq. (4), in which the objective function 1

2 w 2 is replaced by the skewed 1

δ ε w 2. If the quotient δ/ε of the numerical stability hyperparameters is isotropic (i.e. δj/εj are equal for all j [p]), then this modification does not alter the directions of the KKT points, so the implicit regularization is the same as was established for plain gradient flow by Lyu & Li (2020, Theorem A.8) and Ji & Telgarsky

(2020, Theorem 3.1). Otherwise, the sets of points w that have the same objective value 1

δ ε w 2 are ellipsoids which are not spheres, and the directions of the KKT points may be different from those for the margin maximization problem in eq. (4). Moreover, these conclusions are robust with respect to changing the exponential decay coefficients schedule ρ hyperparameter.

Theorem 2. Under Assumptions 1 and 2, for the continuous generalized Ada Delta defined in Process 2, with either the exponential loss or the logistic loss, we have that L(w(t)) = Θ 1 t(log t)2 2/L , w(t) = Θ((log t)1/L),

and w(t) converges in direction to a KKT point of the problem

minimize: 1 2 4q

δ ε w 2 subject to: y(i)Φ(w, x(i)) 1 for all i [n] .

Remark 3. The growth of the Euclidean length w(t) of the parameters to infinity during the training process is an artifact of the L-positively homogeneous predictor function and the strictly decreasing individual loss function (both exponential and logistic) indeed, once perfect training accuracy is achieved, scaling the parameters w(t) by any α > 1 decreases the total loss. Theorem 2 however establishes that this growth is slow, bounded above and below by the Lth root of log t. Note also that the convergence shown in Theorem 2 to a KKT point of the skewed normalized margin maximization problem is directional, and so independent of the growth of w(t) .

Our proof of Theorem 2 builds on the following result, stated here in terms of the notations in this paper.

Published in Transactions on Machine Learning Research (12/2024)

Theorem 4 (Wang et al. (2021, Theorems 2, 3, and 10)). Under Assumption 1, if χ Rp has no zero component and v, β: [0, ) Rp are arcs that obey the adaptive gradient flow

v (t) = β(t) e b L(v(t)) for almost all t 0 ,

where b L(v) := L(χ v) = Pn i=1 ℓ y(i)Φ(χ v, x(i)) is with either exponential or logistic ℓ, if b L(v(t0)) < ℓ(0) at some time t0, if limt β(t) = 1, and if d log β(t)/dt is Lebesgue integrable, then we have that b L(v(t)) =

Θ 1 t(log t)2 2/L , v(t) = Θ((log t)1/L),4 and v(t) converges in direction to a KKT point of the problem

minimize: 1 2 v 2 subject to: y(i)Φ(χ v, x(i)) 1 for all i [n] .

To apply Theorem 4 to the continuous generalized Ada Delta in Process 2, we define for all t 0:

ε δ v(t) := w(t)/χ β(t) := q

1+h(t)/ε 1+g(t)/δ e b L(v(t)) := χ e L(w(t)) .

In terms of the skewing vector χ and the skewed adapter β(t), eq. (3) that governs the parameters trajectory can then be restated as w (t) = χ2 β(t) e L(w(t)) for almost all t 0 . (5)

Hence v (t) = w (t)/χ = χ β(t) e L(w(t)) = β(t) e b L(v(t)) for almost all t 0, as required. From Assumption 2.(iii), we have b L(v(t0)) = L(w(t0)) < ℓ(0), also as required.

Therefore, to establish Theorem 2, it remains to prove that limt β(t) = 1 and that d log β(t)/dt is Lebesgue integrable. The remainder of this section consists of a sequence of lemmas, which culminates in Lemma 8 that shows the former fact and Lemma 9 that shows the latter fact. The lemmas in particular make use of Assumption 2.(ii) on the initialization of the arcs of the exponentially decaying averages g(t) and h(t), and of Assumption 2.(i) on the unboundedness of the integrals of the complements of the exponential decay coefficients, to show that each component of the skewed adapter β(t) monotonically (either from below or from above, depending on g(0) and h(0), and on the numerical stability terms δ and ε) tends to 1 as the time t tends to infinity.

Our first lemma shows a useful expression for the derivative of 1 minus the squared skewed adapter.

Lemma 5. For almost all t 0 we have d dt(1 β(t)2) = 1 ρ(t)

1+g(t)/δ(1 β(t)2).

Proof. Observe that

d dt(1 β(t)2) = h (t)/ε + h (t) g(t)/εδ g (t)/δ g (t) h(t)/δε

(1 + g(t)/δ)2 by the def. of β(t)

= 1 ρ(t) (1 + g(t)/δ)2

w (t)2/ε h(t)/ε e L(w(t))2/δ + g(t)/δ

+w (t)2g(t)/εδ e L(w(t))2h(t)/δε

by eqs. (1) and (2)

= 1 ρ(t) (1 + g(t)/δ)2

g(t)/δ h(t)/ε + (w (t)2/ε)(1 + g(t)/δ)

(e L(w(t))2/δ)(1 + h(t)/ε)

rearranging

= 1 ρ(t) (1 + g(t)/δ)2 (g(t)/δ h(t)/ε) by eq. (5)

1 + g(t)/δ (1 β(t)2) calculation

for almost all t 0.

The second lemma verifies that the exponentially decaying averages of squared gradients and squared adapted gradients are nonnegative at all times.

4Wang et al. (2021, Theorem 10) contains a typo: the bound Θ 1 (log t)1/L should be Θ((log t)1/L).

Published in Transactions on Machine Learning Research (12/2024)

Lemma 6. For all t 0 we have g(t) 0 and h(t) 0.

Proof. Suppose g(t)j < 0 for some t 0 and j [p]. By Assumption 2.(ii) and since gj : [0, ) R is an arc, there exists t0 < t such that g(t0)j = 0 and for all τ (t0, t] we have g(τ)j < 0. Recalling that ρj : [0, ) [0, 1], from eq. (1) we obtain that g (τ)j 0 for almost all τ (t0, t]. Hence g(t)j = g(t0)j + R t t0 g (τ)j dτ 0, which is a contradiction.

Supposing h(t)j < 0 for some t 0 and j [p], we obtain a contradiction analogously, using eq. (2).

Consider an arbitrary jth component of the skewed adapter. By Assumption 2.(ii), its initial value β(0)j is positive. The proof of the third lemma shows that there are three cases: if β(0)j < 1 then β(t)j is monotonically nondecreasing and bounded above by 1, if β(0)j = 1 then β(t)j is constant, and if β(0)j > 1 then β(t)j is monotonically nonincreasing and bounded below by 1.

Lemma 7. The following hold for each j [p].

(i) For almost all t 0 we have: if 0 < β(t)j < 1 then β (t)j 0, if β(t)j = 1 then β (t)j = 0, and if β(t)j > 1 then β (t)j 0.

(ii) For all t 0 we have β(t)j min{β(0)j, 1}.

Proof. Consider an arbitrary j [p].

For part (i), suppose t 0 is such that Lemma 5 applies. If 0 < β(t)j < 1 then 1 β(t)2 j > 0, so from Lemmas 5 and 6, and by recalling that 0 ρ(t)j 1, we have d(1 β(t)2 j)/dt 0, i.e. dβ(t)2 j/dt 0. But dβ(t)2 j/dt = 2β(t)jβ (t)j, and so β (t)j 0. In the remaining two cases, namely if β(t)j = 1 or β(t)j > 1, the reasoning is analogous.

Now part (ii) follows by part (i) and the fact that, since gj, hj : [0, ) R are arcs, so is βj.

Lemma 7 provides a lower bound for the skewed adapter, but it leaves open its asymptotic behavior. The next lemma fills that gap, establishing that all its components converge to 1. This equivalently means that

the continuous generalized Ada Delta adapter q

ε+h(t) δ+g(t) converges to the squared skewing vector χ2 = p ε

Lemma 8. We have that limt β(t) = 1.

Proof. We first use Lemma 7 to prove that the squared gradients have bounded integrals. To that end, for all t 0 we have

L(w(0)) > L(w(0)) L(w(t)) since L(w(t)) > 0

0 (d L(w(τ))/dτ)dτ calculus

0 e L(w(τ)), w (τ) dτ by Proposition 1

0 e L(w(τ)), q

ε δ β(τ) e L(w(τ)) dτ by eq. (5)

ε δ min{β(0), 1}, e L(w(τ))2 dτ by Lemma 7.(ii)

εj δj min{β(0)j, 1} Z t

0 e L(w(τ))2 j dτ rearranging ,

so for each j [p] we have

0 e L(w(τ))2 j dτ < q

δj εj L(w(0)) min{β(0)j, 1} =: Cj .

Published in Transactions on Machine Learning Research (12/2024)

Now suppose j [p]. Equation (1) and Lemma 6 imply that

g(t)j = g(0)j + Z t

0 g (τ)j dτ g(0)j + Z t

0 e L(w(τ))2 j dτ < g(0)j + Cj =: C j for all t 0 ,

so recalling Lemma 5 and Lemma 7.(i), and setting C j := 1 1+C j /δj , we have

d log |1 β(t)2 j|/dt C j (1 ρ(t)j) for almost all t 0 such that β(t)j = 1 .

|1 β(t)2 j| |1 β(0)2 j| exp C j

0 (1 ρ(τ)j)dτ for all t 0 ,

which by positivity of C j and unboundedness of the integrals of the complements of the exponential decay coefficients (cf. Assumption 2.(i)) establishes the lemma.

Our final lemma shows that the components of the skewed adapter converge without large fluctuations of their logarithms.

Lemma 9. The function d log β(t)/dt is Lebesgue integrable.

Proof. For each j [p], from the proof of Lemma 7 we have that d log β(t)j/dt = β (t)j/β(t)j is either nonnegative for almost all t 0, or nonpositive for almost all t 0. Thus by Lemma 8 we have Z

0 |d log β(t)/dt|dt =

0 (d log β(t)/dt)dt = | log 1 log β(0)| = | log β(0)| < .

4 Experiments

To test our theoretical result for generalized Ada Delta, we evaluated comparatively five algorithms:

SGD: Stochastic gradient descent as implemented in Py Torch5.

Ada Delta: Its standard Py Torch implementation3, with the exponential decay coefficient ϱ = 0.9, and the numerical stability term ϵ = 10 5.

Ada Delta S: This is Ada Delta amended to have the exponential decay coefficients follow the schedule ϱk = 1 0.1/(1 + 100k/K ), where K is the total number of steps. Thus 1 ϱk follows a harmonic sequence, increasing the coefficient from ϱ0 = 0.9 at the first step to ϱK 1 = 0.999 at the last step, which lessens aggressiveness of the decay in computing the averages of the squared gradients and the squared adapted gradients along the training.

Ada Delta N: This is Ada Delta amended to have the numerical stability terms different in the numerator and the denominator of the adaptor, and different across the network parameters. At the start of the training, each component of δ and of ε is sampled independently from 10 5+X, where X is a centered Gaussian with standard deviation 1 for the smaller two networks we consider, and with standard deviation 0.25 for the largest network.

Ada Delta NS: This combines the extensions in Ada Delta S and Ada Delta N, i.e. has exponential decay coefficients that follow the specified schedule as well as numerical stability terms whose components are initialized randomly as above.

We performed experiments in the following three gradually more complex settings. The total compute for the perceptron setting was around 10min on a mid-range CPU; whereas one run of all five algorithms for the smaller convolutional setting took roughly 2h, and for the larger convolutional setting took roughly 12h, in both cases on a mid-range GPU.

5https://pytorch.org/docs/stable/generated/torch.optim.SGD.html

Published in Transactions on Machine Learning Research (12/2024)

Two-layer Leaky Re LU perceptron on a two-dimensional dataset. Following Wang et al. (2021), we first considered a perceptron with parameters v R and w R2, whose prediction for an input x R2

equals v σ( w, x ), where σ is the Leaky Re LU nonlinearity with inactive gradient 0.5.

The dataset (x(1), y(1)), . . . , (x(100), y(100)) consists of 50 points sampled from (cos 0.5, sin 0.5) + u independently and labelled 1, and 50 points sampled from (cos 0.5, sin 0.5) + u independently and labelled 1, where u is distributed uniformly on the square [ 0.6, 0.6]2.

This toy setting is convenient for three-dimensional visualizations of the adapter s reciprocal square roots. We trained the network for K = 5000 full-batch epochs using the exponential loss, with learning rate 0.1 for SGD and learning rate 1 for the four variants of Ada Delta. The learning rate for SGD was chosen so that it achieves perfect training accuracy after a similar number of epochs as the four versions of Ada Delta. We repeated the training 100 times, where in each round the five algorithms used the same randomly initialized parameters and numerical stability terms (if applicable). The results are shown in fig. 1.

We observe that:

all rounds achieve 100% training accuracy by around 256 epochs, in line with the separation Assumption 2.(iii);

the final normalized margin, which here equals mini [100] y(i) v K σ( w K, x(i) )/(v2 K + w K 2), is within a higher and much narrower range for SGD and with isotropic numerical stability hyperparameters (i.e. for Ada Delta and Ada Delta S), confirming the prediction of Theorem 2 that, without the isotropy, the implicit regularization may not be towards margin maximization;

the final adapter s reciprocal square root, which in the notations of Procedure 1 equals 4q

δ+g K ε+h K 1 , has a large variance of its direction when the numerical stability hyperparameters have random components (i.e. for Ada Delta N and Ada Delta NS), and it is the limit of this direction that according to the proof of Theorem 2 determines the nature of the implicit regularization.

Four-layer convolutional network on MNIST. Also following Wang et al. (2021), we then considered the network from Mądry, Makelov, Schmidt, Tsipras, and Vladu (2018) with biases removed, and in order consisting of: 32-channel 5 5-filter convolutional, Re LU, 2-kernel 2-stride max-pooling, 64-channel 3 3-filter convolutional, Re LU, 2-kernel 2-stride max-pooling, 1024-width fully connected, Re LU, and 10-width fully connected.

We trained the network on MNIST (Le Cun, Bottou, Bengio, and Haffner, 1998) from the default Py Torch random initialization for 500 epochs using the cross-entropy loss: in a finer regime with batch size 100, learning rate 0.01 for SGD, and learning rate 0.1 for the four variants of Ada Delta; and in a coarser regime with batch size 1000, learning rate 0.1 for SGD, and learning rate 1 for the four variants of Ada Delta. The learning rates for SGD were chosen so that it achieves perfect training accuracy after a similar number of epochs as the four versions of Ada Delta. The results are shown in fig. 2.

For both regimes, we observe that:

all algorithms achieve perfect training accuracy by around 250 epochs, in line with the separation Assumption 2.(iii);

the normalized margin, whose values here are relatively small partly due to the division by the fourth power of the norm of the network parameters, grows significantly higher for SGD and with isotropic numerical stability hyperparameters (i.e. for Ada Delta and Ada Delta S), confirming the prediction of Theorem 2 that, without the isotropy, the implicit regularization may not be towards margin maximization;

after all algorithms achieve perfect training accuracy, their test accuracies remain roughly constant, in the finer regime without significant differences among the four versions of Ada Delta, and in the coarser regime with Ada Delta and Ada Delta S generalizing slightly but consistently better;

Published in Transactions on Machine Learning Research (12/2024)

the training loss shrinks faster when the numerical stability hyperparameters have random components (i.e. for Ada Delta N and Ada Delta NS), however remains within a constant multiplicative band of the training loss when those hyperparameters are isotropic;

the results are not substantially affected by whether the exponential decay coefficients follow the increasing schedule, i.e. they are similar for Ada Delta and Ada Delta S, and also for Ada Delta N and Ada Delta NS.

VGG on CIFAR-10. Following Lyu & Li (2020), we finally considered the 14-layer VGG-16 (Simonyan and Zisserman, 2015) with biases only at the first level, and in order consisting of: 64-channel convolutional then Re LU, repeated 2 times; max-pooling; 128-channel convolutional then Re LU, repeated 2 times; max-pooling; 256-channel convolutional then Re LU, repeated 3 times; max-pooling; 512-channel convolutional then Re LU, repeated 3 times; max-pooling; 256-channel convolutional then Re LU, repeated 3 times; max-pooling; 10-width fully connected. Each convolutional layer is 3 3-filter, and each max-pooling is 2-kernel 2-stride.

We trained the network on CIFAR-10 (Krizhevsky, 2009) from the default Py Torch random initialization for 1000 epochs using the cross-entropy loss: in a finer regime with batch size 100, and learning rate 0.1 for all five algorithms; and in a coarser regime with batch size 250, and learning rate 0.25 for all five algorithms. The results are shown in fig. 3.

For both regimes, we observe that:

all algorithms achieve perfect training accuracy by around 250 epochs, in line with the separation Assumption 2.(iii);

the normalized margin, whose values here are relatively small partly due to the division by the 14th power of the network norm, grows significantly higher for SGD and with isotropic numerical stability hyperparameters, with the exception in the coarser regime of Ada Delta whose training loss shrinks considerably slower compared with Ada Delta S this is in line with Theorem 2, whose prediction of implicit regularization towards margin maximization depends both on the isotropy and on convergence to zero training loss;

after all algorithms achieve perfect training accuracy, their test accuracies remain roughly constant and without substantial differences among them (approximately within 81 1% in the finer regime, or 79 1% in the coarser regime), with the exception in the coarser regime of Ada Delta and Ada Delta N whose training losses shrink considerably slower compared with Ada Delta S and Ada Delta NS;

whether the exponential decay coefficients follow the increasing schedule does not substantially affect the results in the finer regime, however in the courser regime Ada Delta S and Ada Delta NS shrink the training loss considerably faster than Ada Delta and Ada Delta N, leading to higher normalized margins as well as better generalization.

5 Conclusion

Our main result, Theorem 2 which holds under Assumptions 1 and 2, indicates that Ada Delta may be used in practical methods that rely on the optimization algorithm implicitly regularizing the network parameters to converge in direction to a margin maximization KKT point, which places Ada Delta in the same category as e.g. RMSProp and Adam without momentum (Wang et al., 2021, Theorem 7). This relies on the isotropy of the quotient of the numerical stability hyperparameters, without which according to Theorem 2 the implicit regularization of Ada Delta may not be towards margin maximization, as is the case for e.g. Ada Grad (Wang et al., 2021, Theorem 6).

Relaxing Assumption 1 on the predictor function is challenging even without considering adaptive algorithms, e.g. the definability excludes pathological examples such as based on the Mexican hat function (cf. Lyu & Li (2020, Appendix J)).

We focused on binary classification for simplicity of presentation; it is straightforward to extend Theorem 2

for logistic loss to an arbitrary number of classes (cf. Wang et al. (2021, Appendix E)).

An important future goal is to obtain a counterpart of Theorem 2 directly for adaptive gradient descent as in Procedure 1. This is also a challenge already without adaptivity: the directional convergence result (Ji

Published in Transactions on Machine Learning Research (12/2024)

20 22 24 26 28 210 212

Average Accuracy

0 20 40 60 80 100 0

Normalized Margin

Adapter Reciprocal Square Root

Adapter Reciprocal Square Root

Adapter Reciprocal Square Root

Adapter Reciprocal Square Root

SGD Ada Delta Ada Delta S Ada Delta N Ada Delta NS

Figure 1: We trained a two-layer Leaky Re LU perceptron on a synthetic binary classification dataset which consists of 100 points in two-dimensional space. Each of 100 rounds of the experiment consisted of randomly initializing the network parameters and the numerical stability terms (if applicable), and then running separately each of the five algorithms for 5000 epochs. The plots in the top row show: on the left, the training accuracy averaged over the rounds, depending on the epoch; and on the right, the normalized margin at the end of the training, across each of the rounds. The plots in the next two rows visualize the direction of the adapter s reciprocal square root at the end of the training, where the isotropic direction is indicated by the longer dashed orange arrow.

Published in Transactions on Machine Learning Research (12/2024)

20 21 22 23 24 25 26 27 28 2 19

Training Loss

20 21 22 23 24 25 26 27 28 2 19

20 21 22 23 24 25 26 27 28 0.97

Training Accuracy

20 21 22 23 24 25 26 27 28 0.97

100 200 300 400 500

Test Accuracy

100 200 300 400 500

100 200 300 400 500

Normalized Margin

100 200 300 400 500

SGD Ada Delta Ada Delta S Ada Delta N Ada Delta NS

Figure 2: The results of training a 4-layer convolutional network on MNIST, with batch sizes and learning rates for the right-hand column that are 10 times larger than for the left-hand column. The solid lines show the median values, and the shaded areas are between the 25th and 75th percentiles, over 19 runs.

Published in Transactions on Machine Learning Research (12/2024)

26 27 28 29 2 25

Training Loss

26 27 28 29 2 25

26 27 28 29

Training Accuracy

26 27 28 29

200 400 600 800 1,000

Test Accuracy

200 400 600 800 1,000

200 400 600 800 1,000

Normalized Margin

200 400 600 800 1,000

SGD Ada Delta Ada Delta S Ada Delta N Ada Delta NS

Figure 3: The results of training a 14-layer convolutional network on CIFAR-10, with batch sizes and learning rates for the right-hand column that are 2.5 times larger than for the left-hand column. The solid lines show the median values, and the shaded areas are between the 25th and 75th percentiles, over 19 runs.

Published in Transactions on Machine Learning Research (12/2024)

& Telgarsky, 2020, Theorem 3.1) is only for gradient flow, and the characterization of directional limits for gradient descent (Lyu & Li, 2020, Theorem E.3) assumes C2-smoothness of the predictor function which excludes nonlinearities such as Re LU and Leaky Re LU.

Broader Impact Statement

This is foundational research on a general-purpose algorithm for optimizing neural networks. Greater understanding of its implicit regularization properties may lead to machine learning models that have cheaper training, greater efficiency, and increased performance.

Acknowledgments

We acknowledge the Centre for Discrete Mathematics and Its Applications at the University of Warwick for partial support.

Anas Barakat and Pascal Bianchi. Convergence and Dynamical Behavior of the ADAM Algorithm for Nonconvex Stochastic Optimization. SIAM J. Optim., 31(1):244 274, 2021. 3

Anas Barakat, Pascal Bianchi, Walid Hachem, and Sholom Schechtman. Stochastic optimization with momentum: Convergence, fluctuations, and traps avoidance. Electron. J. Statist., 15(2):3892 3947, 2021. 3

Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. In Neur IPS, pp. 6240 6249, 2017. 3

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proc. Natl. Acad. Sci., 116(32):15849 15854, 2019. 1

Jonathan Borwein and Adrian Lewis. Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, 2nd edition, 2010. 3

Gon Buzaglo, Niv Haim, Gilad Yehudai, Gal Vardi, Yakir Oz, Yaniv Nikankin, and Michal Irani. Deconstructing Data Reconstruction: Multiclass, Weight Decay and General Losses. In Neur IPS, 2023. 1

Matias D. Cattaneo, Jason M. Klusowski, and Boris Shigida. On the Implicit Bias of Adam. In ICML, 2024.

Jinghui Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks. In IJCAI, pp. 3267 3275, 2020. 2

Frank H. Clarke. Generalized gradients and applications. Trans. Amer. Math. Soc., 205:247 262, 1975. 3

André Belotto da Silva and Maxime Gazeau. A General System of Differential Equations to Model First-Order Adaptive Algorithms. J. Mach. Learn. Res., 21:129:1 129:42, 2020. 3

Damek Davis, Dmitriy Drusvyatskiy, Sham M. Kakade, and Jason D. Lee. Stochastic Subgradient Method Converges on Tame Functions. Found. Comput. Math., 20(1):119 154, 2020. 5

John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res., 12:2121 2159, 2011. 2

Joydeep Dutta, Kalyanmoy Deb, Rupesh Tulshyan, and Ramnik Arora. Approximate KKT points and a proximity measure for termination. J. Glob. Optim., 56(4):1463 1499, 2013. 6

Omer Elkabetz and Nadav Cohen. Continuous vs. Discrete Optimization of Deep Neural Networks. In Neur IPS, pp. 4947 4960, 2021. 5

Published in Transactions on Machine Learning Research (12/2024)

Sébastien Gadat and Ioana Gavra. Asymptotic Study of Stochastic Adaptive Algorithms in Non-convex Landscape. J. Mach. Learn. Res., 23:228:1 228:54, 2022. 3

Margalit Glasgow, Colin Wei, Mary Wootters, and Tengyu Ma. Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence. In ICLR, 2023. 3

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-Independent Sample Complexity of Neural Networks. In COLT, pp. 297 299, 2018. Extended version in Co RR abs/1712.06541. 3

Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Characterizing Implicit Bias in Terms of Optimization Geometry. In ICML, pp. 1827 1836, 2018. 2

Niv Haim, Gal Vardi, Gilad Yehudai, Ohad Shamir, and Michal Irani. Reconstructing Training Data From Trained Neural Networks. In Neur IPS, 2022. 1

Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Overview of mini-batch gradient descent, 2012. Neural Networks for Machine Learning: Lecture 6a. 2

Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning. In Neur IPS, 2020. 1,

2, 3, 6, 11

Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. Predicting the Generalization Gap in Deep Networks with Margin Distributions. In ICLR, 2019. 3

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic Generalization Measures and Where to Find Them. In ICLR, 2020. 3

Nitish Shirish Keskar and Richard Socher. Improving Generalization Performance by Switching from Adam to SGD. Co RR, abs/1712.07628, 2017. 2

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015. 1

Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master s thesis, University of Toronto, 2009. 11

Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278 2324, 1998. 10

Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive Gradient Methods with Dynamic Bound of Learning Rate. In ICLR, 2019. 2

Kaifeng Lyu and Jian Li. Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. In ICLR, 2020. 1, 2, 6, 11, 15

Aleksander Mądry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. In ICLR, 2018. 10

Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring Generalization in Deep Learning. In Neur IPS, pp. 5947 5956, 2017. 1

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian Approach to Spectrally Normalized Margin Bounds for Neural Networks. In ICLR, 2018. 3

Alethea Power, Yuri Burda, Harrison Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. Co RR, abs/2201.02177, 2022. 2

Qian Qian and Xiaoyuan Qian. The Implicit Bias of Ada Grad on Separable Data. In Neur IPS, pp. 7759 7767, 2019. 2

Sebastian Ruder. An overview of gradient descent optimization algorithms. Co RR, abs/1609.04747, 2016. 2

Published in Transactions on Machine Learning Research (12/2024)

Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, 2015. 11

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The Implicit Bias of Gradient Descent on Separable Data. J. Mach. Learn. Res., 19(70):1 57, 2018. 1

Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, and Samet Oymak. Max-Margin Token Selection in Attention Mechanism. In Neur IPS, 2023. 2

Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua M. Susskind. The Slingshot Effect: A Late-Stage Optimization Anomaly in Adaptive Gradient Methods. Trans. Mach. Learn. Res., 2024. 2

Gal Vardi. On the Implicit Bias in Deep-Learning Algorithms. Commun. ACM, 66(6):86 93, 2023. 1

Bohan Wang, Qi Meng, Wei Chen, and Tie-Yan Liu. The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks. In ICML, pp. 10849 10858, 2021. 2, 7, 10, 11

Bohan Wang, Qi Meng, Huishuai Zhang, Ruoyu Sun, Wei Chen, Zhi-Ming Ma, and Tie-Yan Liu. Does Momentum Change the Implicit Regularization on Separable Data? In Neur IPS, 2022. 2

Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap between the upper bound and lower bound of Adam s iteration complexity. In Neur IPS, 2023. 2

Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, and Wei Chen. Provable Adaptivity of Adam under Non-uniform Smoothness. In KDD, pp. 2960 2969, 2024. 2

Colin Wei and Tengyu Ma. Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin. In ICLR, 2020. 3

Colin Wei, Jason D. Lee, Qiang Liu, and Tengyu Ma. Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel. In Neur IPS, pp. 9709 9721, 2019. 3

Alex Wilkie. Model completeness results for expansions of the ordered field of real numbers by restricted Pfaffian functions and the exponential function. J. Am. Math. Soc., 9(4):1051 1094, 1996. 3

Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The Marginal Value of Adaptive Gradient Methods in Machine Learning. In Neur IPS, pp. 4148 4158, 2017. 2

Nachuan Xiao, Xiaoyin Hu, Xin Liu, and Kim-Chuan Toh. Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees. J. Mach. Learn. Res., 25:48:1 48:53, 2024. 3

Shuo Xie and Zhiyuan Li. Implicit Bias of Adam W: ℓ Norm Constrained Optimization. In ICML, pp. 54488 54510, 2024. 2

Manzil Zaheer, Sashank J. Reddi, Devendra Singh Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive Methods for Nonconvex Optimization. In Neur IPS, pp. 9815 9825, 2018. 2

Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. Co RR, abs/1212.5701, 2012. 2, 4

Chenyang Zhang, Difan Zou, and Yuan Cao. The Implicit Bias of Adam on Separable Data. Co RR, abs/2406.10650, 2024. Accepted to Neur IPS 2024. 2

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Commun. ACM, 64(3):107 115, 2021. 1, 5

Pan Zhou, Jiashi Feng, Chao Ma, Caiming Xiong, Steven Chu-Hong Hoi, and Weinan E. Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning. In Neur IPS, 2020. 2