# a_momentumized_adaptive_dual_averaged_gradient_method__92217108.pdf Journal of Machine Learning Research 23 (2022) 1-34 Submitted 3/21; Revised 4/22; Published 4/22 Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization Aaron Defazio adefazio@fb.com Facebook AI Research, New York Samy Jelassi sjelassi@princeton.edu Princeton University, Princeton Editor: Animashree Anandkumar We introduce MADGRAD, a novel optimization method in the family of Ada Grad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both SGD and ADAM in test set performance, even on problems for which adaptive methods normally perform poorly. An implementation is available at https://github.com/facebookresearch/madgrad 1 Keywords: Stochastic Optimization, Ada Grad, Optimization, Optimization for machine learning 1. Introduction Optimization for deep learning forms a relatively new and growing sub-field in the optimization community. Compared to classical first order optimization, deep learning problems introduce additional concerns which require new tools to overcome. Deep learning problems are characterized by very large parameter vector sizes D, making it computationally infeasible to store matrices of size D D, and even limited memory approaches can be impractical for problems such as the 100+ billion parameter models currently being explored (Rajbhandari et al., 2019; Brown et al., 2020). The practical limit on these problems is storage that is fixed at a small multiple of the parameter vector size. For this reason, diagonal scaling approaches have become the industry standard for deep learning. In this class of methods, adaptivity is performed independently for each coordinate, so that memory usage scales as O(D). We consider Adam (Kingma and Ba, 2014) the benchmark method in this class; it has seen widespread adoption, and there are no alternative adaptive methods that consistently out-perform it (Choi et al., 2020; Schmidt et al., 2020). 1. This work revises and extends the technical report Jelassi and Defazio (2020) with a new algorithm and further theory and experiments. 2022 Aaron Defazio Samy Jelassi. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0226.html. Adam builds upon a rich history of diagonal adaptive methods. The Ada Grad method (Duchi et al., 2011) introduced a principled approach to diagonal adaptivity, that arises naturally as a simplification of a full-matrix adaptivity scheme. This approach is clearly motivated and yields natural convergence rate bounds for convex losses. Also within this family, the RMSProp method (Tieleman and Hinton, 2012) arose as a well-performing empirical method in this class, albeit with little theoretical motivation. The development of the Adam method can be seen as a natural extension of the scaling used in RMSProp to include a form of momentum, as well as a stabilizing bias-correction that significantly dampens the adaptivity and step-size during the early stages of optimization. Despite its widespread success, Adam is far from a panacea for deep learning optimization. Wilson et al. (2017) show that Adam as well as other common adaptive optimizers converge to bad local minima on some important problems, such as the widely studied problem of image classification. This has led to the general claim that adaptive methods generalize poorly. As we will show, this is not necessarily the case. The method we develop in this work combines adaptivity with strong generalization performance. Our MADGRAD (Momentumized, Adaptive, Dual averaged GRADient) method performs consistently at a state-of-the-art level across a varied set of realistic large-scale deep learning problems, without requiring any more tuning than Adam. MADGRAD is constructed from the lesser-used dual averaging form of Ada Grad, through a series of direct and systematic changes that adapt the method to deep learning optimization. 2. Problem Setup We consider the stochastic optimization framework, where the goal is to minimize a parameterized function f(x) = Eξ [f(x, ξ)] , where x RD, and each ξ is a random variable drawn from a fixed known distribution. In the case of empirical risk minimization, ξ is a data-point drawn from the data distribution, typically further processed by a stochastic data-augmentation procedure. At each step k, a stochastic optimization algorithm is given ξk and has access to f(xk, ξk) and f(xk, ξk) for a pre-specified iterate xk. 3. Related Work The theory of adaptive methods for non-convex optimization is still in its infancy. The current best known convergence theory for Adam due to Défossez et al. (2020) greatly improves over earlier theory (Zou et al., 2019b), but has the important caveat that it requires momentum values of the order β = 1 1/N for N iterations, which is far from the values used in practice, which are of the order β = 0.9 to β = 0.99. Results for these settings may not be possible, as Reddi et al. (2018) show via a counter-example that Adam may fail to converge under common parameter settings, even in the convex case. When β1 & β2 are small, the Adam update is close to sign-sgd (i.e. xk+1 = xk γsign( f(xk, ξk)), a method that also fails to converge in the general stochastic case (Balles and Hennig, 2018), although some theory is possible under a large batch assumption Bernstein et al. (2018) where the behavior is closer to the non-stochastic case. A Momentumized, Adaptive, Dual Averaged Gradient Method Ada Grad s convergence in the non-convex case has also been studied. Ward et al. (2019) establish convergence for a restricted variant where only a global step size is adaptively updated. Li and Orabona (2019) establish almost sure convergence for a variant of Ada Grad where the most recently seen gradient is omitted from the denominator. Convergence with high probability is also established for a variant with global rather than coordinate-wise step size. More recently Zhou et al. (2020) and Zou et al. (2019a) establish convergence of non-momentum and momentum variants respectively, although with bounds that are much worse than established by Défossez et al. (2020), who also cover Ada Grad in their analysis. Weighted Ada Grad as we use in this work has been explored to varying degrees before, including the non-convex case in the aforementioned work by Zou et al. (2019a), and the convex case by Levy et al. (2018). Weighting is particularly interesting in the strongly convex case, where weights such as λk k2 can be used to achieve accelerated convergence. Neither of these works cover the dual averaged form of Ada Grad which we explore. 4. Adaptivity in deep learning beyond Adam To understand the motivation and design of the MADGRAD method, a clear understanding of the short-comings of existing methods is needed. Consider Adam, the most heavily used adaptive method in practice. Although it works remarkably well on some important problems, it also suffers from the following issues: It greatly under-performs the non-adaptive SGD-M method in a number of important situations including the widely studied Image Net training problem. Problems can be constructed on which it will fail to converge entirely, even in the convex setting. The exponential moving average updates used are non-sparse when given sparse gradients, which makes the method poorly suited to sparse problems. Due to these issues, Adam doesn t quite reach the goal of being a general-purpose deep learning optimizer. The MADGRAD method is directly designed to address these issues. MADGRAD: Achieves state-of-the-art performance across problems traditionally tackled by Adam, while simultaneously achieving state-of-the-art on problems where Adam normally under-performs. Has provable and strong convergence theory on convex problems. Is directly applicable to sparse problems when momentum is not used. The MADGRAD method is the combination of a number of techniques that individually address separate short-comings in the Ada Grad method when applied to deep learning optimization problems. By building upon a method with known convergence theory, we are able to construct a method that is still provably convergent (under convexity assumptions) without sacrificing the practical performance characteristics of Adam. We will detail each of these techniques in turn, to build up MADGRAD from its foundations. 5.1 Dual averaging for deep learning MADGRAD is based upon the dual averaging formulation of Ada Grad, rather than the mirror descent formulation. Although the original seminal work on Ada Grad (Duchi et al., 2011) presents the dual averaging formulation with equal weight as the mirror descent (MD) form, the dual averaging form has seen virtually no use for deep learning optimization. The Ada Grad implementations available in major deep learning frameworks (Py Torch, Tensorflow) contain the mirror descent form only. This is despite the theory presented for the dual averaging formulation being arguably more elegant than the mirror descent theory. The dual averaging form of Ada Grad satisfies the following bound: i=1 f(xi) f(x ) 1 γ ψk(x ) + γ i=1 fi(xi) 2 ψ i 1 Whereas the mirror descent form satisfies the following more complex bound, involving the Bregman divergence of ψ: i=1 f(xi) f(x ) 1 γ Bψ1(x , x1) + 1 Bψi+1(x , xi+1) Bψi(x , xi+1) + γ i=1 fi(xi) 2 ψ i . Given the clear advantage in terms of theoretical simplicity, why then are dual averaging approaches not used more widely? We believe this is due to a number of misconceptions. The first misconception is that dual averaging is only interesting in the composite optimization setting, where sophisticated regularizers are used to encourage sparsity or induce other properties of the solution. It is true that for smooth non-stochastic optimization, gradient descent and mirror descent coincide (under optimal hyper-parameters). However, when the objective is stochastic or non-smooth, the methods become distinct, and actually behave quite differently. Dual averaging has the general form, given a proximal function ψ: gk = f (xk, ξk) , sk+1 = sk + λkgk, xk+1 = arg min x { sk+1, x + βk+1ψ(x)} . (1) The gradient buffer s0 is initialized as the zero vector. The simplest form of dual averaging occurs when the standard Euclidean squared norm is used: ψ(x) = 1 2 x x0 2, and λk = 1 in which case the method takes the form: xk+1 = x0 1 βk+1 i=0 gi. (2) A Momentumized, Adaptive, Dual Averaged Gradient Method If the objective is either non-smooth or stochastic (or both), β sequences of the form βk+1 = k + 1 give a convergent method. Although Equation 2 has little resemblance to SGD as written, SGD s update: xk+1 = xk γk f (xk, ξk) , can be written in the more comparable form: i=0 γigi. (3) where to achieve convergence without a fixed stopping time, a step size of the form γi 1/ i + 1 is standard. Comparing SGD and DA at a step k, it s clear that the weighting sequence used by SGD places a smaller weight on newer gi in the summation compared to earlier gi, whereas the sequence used by DA places equal weight on all gi. This difference is key to understanding why methods in the DA family behaves differently from SGD in practice, even without additional regularization or non-Euclidean proximal functions. The second misconception arises from implementing the dual averaging form of Ada Grad without considering what modifications need to be made for the deep learning setting. The algorithm as originally stated, uses an initial point of the origin x0 = 0, and a proximity function ψt(x) = 1 2 x, Htx that is quadratic, but centered around the origin. It is well known that neural network training exhibits pathological behavior when initialized at the origin, and so naive use of this algorithm does not perform well. When centering around 0, we have observed severely degraded empirical performance and a high risk of divergence. Instead, a proximity function centered about x0 needs to be used: 2 x x0, Ht (x x0) , with initialization of x0 following standard conventions for the network being trained. 5.2 Dual averaging generalizes well In addition to the theoretical advantages of dual averaging methods, we have also observed that they also enjoy a strong practical advantage in the form of better generalization performance. Dual averaging based methods include a form of implicit regularization, which we believe is a crucial factor contributing to their good generalization performance. To see this, consider the classical dual averaging update: xk+1 = x0 1 This update can be written in a form closer to the SGD update by substituting for x0: k (xk x0) i . k + 1), the behavior of dual averaging resembles a SGD step with a step-dependent regularizer: k xk x0 2 , which decays in strength during the course of optimization. We speculate that the indirect decaying regularization inherent in dual averaging methods may explain why MADGRAD also requires less decay than other methods to match their performance. The strong initial regularization may have a positive effect during early iterations, while not negatively affecting the ability of the model to fit to the data during the later "fine-tuning" epochs. Given the practical advantages we observe in our experiments, we believe further research into the effect of using stronger regularization at the early stages of optimization may be interesting more generally. In the online learning setting, where dual averaging is known as follow-the-regularizedleader, the differences between the two approaches are more heavily explored. There are situations where the DA variant of a method is clearly superior to the mirror descent variant of the same method. For instance, when the stopping time is not known in advance, scalefree mirror descent may suffer from linear regret, whereas DA does not (Orabona and Pál, 2015). Another advantage of DA approaches are their dependence on the sub-optimality of the initial point. They have regret bounds that depend multiplicatively on the initial distance to the solution x0 x 2 compared to maxk=0...n xk x0 2 for mirror descent. The MD bound may be much worse in cases where the iterate sequence is poorly behaved. 5.3 λ sequences for deep learning Even with this modification, dual averaging both with and without adaptivity is not competitive with SGD on standard benchmark problems such as CIFAR10, as shown in Figure 1. The top row shows Ada Grad and DA methods using a flat learning rate schedule, and the bottom row shows a stage-wise schedule. SGD is shown as a baseline. For the DA family methods, λk is decreased for the stage-wise schedules. Both Ada Grad, DA and Ada Grad DA under-perform SGD with either learning rate schedule. Part of this performance gap can be attributed to the fact that each of these methods either implicitly or explicitly use a 1/ i + 1 learning rate sequence. This sequence is actually harmful, as we can confirm by testing SGD using a schedule of the form: Figure 2 illustrates the learning curves achievable for varying b values on CIFAR-10. Full description of our experimental setup is in Section 7. We performed a hyper-parameter search over a separately for each b, with test accuracy as the target quantity. All sqrt-decay sequences are significantly worse than the baseline stage-wise schedule, where the learning rate is decreased 10 fold at epochs 150 and 225. We speculate that the sqrt-decay sequences result in convergence that is too rapid, skipping over the initial annealing stage of learning, resulting in convergence to a poor local minima. The Ada Grad and Ada Grad-DA methods also use an implicitly decreasing sequence, although the rate of decrease depends on the A Momentumized, Adaptive, Dual Averaged Gradient Method 0 50 100 150 200 250 300 Accuracy (%) DA/ADAGRAD on CIFAR-10 SGD stagewise (94.26% SE 0.12) ADAGRAD (91.31% SE 0.13) ADAGRAD-DA (91.85% SE 0.11) DA (87.03% SE 0.19) 0 50 100 150 200 250 300 DA/ADAGRAD on CIFAR-10 SGD stagewise (2.5e-04 SE 4.2e-05) ADAGRAD (4.5e-03 SE 6.7e-04) ADAGRAD-DA (2.1e-02 SE 1.5e-03) DA (1.1e-02 SE 9.3e-04) 0 50 100 150 200 250 300 Accuracy (%) DA/ADAGRAD on CIFAR-10 SGD stage (94.33% SE 0.13) ADAGRAD stage (92.05% SE 0.10) ADAGRAD-DA stage (93.66% SE 0.07) DA stage (84.48% SE 0.12) 0 50 100 150 200 250 300 DA/ADAGRAD on CIFAR-10 SGD stage (2.7e-04 SE 4.5e-05) ADAGRAD stage (5.0e-04 SE 5.6e-05) ADAGRAD-DA stage (2.2e-03 SE 2.8e-04) DA stage (1.6e-01 SE 3.2e-03) Figure 1: Comparison of SGD without momentum to DA and DA-Ada Grad and Ada Grad on CIFAR-10. Left column is test classification performance, right column is training loss. The "stage" learning rate scheme involves a 10 fold decrease in the learning rate at epochs 150 and 225. See Section 7 for a full description of the experimental setup. 0 50 100 150 200 250 300 Accuracy (%) LR schedule for CIFAR-10 SGD stagewise (94.15% SE 0.06) SGD b=1 (84.25% SE 0.15) SGD b=300 (91.68% SE 0.05) SGD b=1000 (92.75% SE 0.07) SGD b=10000 (93.55% SE 0.04) 0 50 100 150 200 250 300 LR schedule for CIFAR-10 SGD stagewise (2.3e-04 SE 3.9e-05) SGD b=1 (4.9e-03 SE 4.7e-04) SGD b=300 (1.6e-04 SE 2.1e-05) SGD b=1000 (1.5e-04 SE 2.0e-05) SGD b=10000 (1.3e-04 SE 2.6e-05) Figure 2: Sqrt-decay learning rate schedules under-perform stage-wise schedules. With batch-size 128 on CIFAR-10. No momentum is used in this comparison. A range of offsets b in the rate a/ i + b were tried with values up to 10,000 shown. Larger values of b up to 100,000 were also tested, they also failed to match the performance of the stage-wise schedule. Left column is test classification performance, right column is training loss. magnitude of the gradients, which is very problem dependent. If gradients stay of similar magnitude over a particular time-scale, then the rate of decrease will also be a 1/ k rate for step k. This step size scheme is also undesirable as prevents the use of standard SGD & Adam step size sequences for choosing the explicit step size constants λi. Since in practice the same learning rate scheme is commonly used when comparing different optimization methods, this schedule contributes to the commonly held perception that Ada Grad is not as effective as other adaptive methods such as Adam. For the DA method, we propose to remedy this issue by introducing a scaling of the λ values to counter-act the step size sequence. In particular we propose the choice: λi = (i + 1)1/2 γi, where γ is a conventional (SGD/Adam) step size sequence. The advantage of this choice is that the leading term in the sum in Equation 2 has constant weight across k: xk+1 = x0 1 = x0 γkgk 1 mirroring the behavior of SGD during a constant step size phase, but retaining the k + 1 decay of past gradients. This simple change is sufficient to greatly improve the test-set performance of DA when using the same learning rate schedule as SGD. A Momentumized, Adaptive, Dual Averaged Gradient Method Another advantage of this sequence is that it will place higher weights on latter gradients in the final convergence rate bound. This makes no difference if we expect gradients to be of similar magnitude at all stages of optimization (which can happen for non-smooth problems in the worse case), but in practice even for non-smooth objectives the gradient typically shrinks to some degree during optimization, leading to tighter bounds when using a forward weighted lambda sequence. We discuss this difference further in Section 6. 5.4 Momentum The use of momentum on top of SGD is known to be highly beneficial, if not crucial, for deep learning optimization across a wide variety of architectures and problem settings (Sutskever et al., 2013). Given how crucial it can be to maintaining competitive performance, we now examine how we can add a form of momentum to the dual averaging updates, and latter the Ada Grad updates. We will consider an update of the following form, which was first explored in this general form by Nesterov and Shikhman (2015) under the name Dual Averaging with Double Averaging: gk = f (xk, ξk) , sk+1 = sk + λkgk, zk+1 = arg min x { sk+1, x + βk+1ψ(x)} , (4) xk+1 = (1 ck+1) xk + ck+1zk+1. The essential idea behind this algorithm is simple. Instead of evaluating the gradient at each step at the value of the argmin operation as with regular DA, instead it s evaluated at a moving average point instead. This serves to smooth the iterate sequence. This technique has the advantage in the convex setting of making it possible to prove convergence properties of the last iterate xk+1 rather than the average iterate xk+1 = 1 k+1 Pk i=0 xi. Essentially the averaging operation is incorporated into the algorithm itself. Momentum is normally thought of as performing more than just a smoothing of the iterate sequence, although a line of recent research has shown that inline averaging of the above form is actually exactly equivalent to momentum (Sebbouh et al., 2020; Defazio, 2020). This is clearly illustrated when momentum is added on top of SGD, where inline averaging: zk+1 = zk ηk f(xk, ξk), xk+1 = (1 ck+1) xk + ck+1zk+1, is actually exactly equivalent to more common equational forms of momentum: mk+1 = βkmk + f(xk, ξk), xk+1 = xk αkmk+1, for appropriate choices of the hyper-parameters. In the convex setting the advantage of this form arises when ck+1 = 1 k+1, which corresponds to an equal weighted moving average xk+1 = 1 k+1 Pk i=0 zi. Under this setting convergence of the last iterate can be shown just as when this kind of averaging is used with dual averaging (Defazio and Gower, 2020). In the non-convex setting, constant ck+1 values, which correspond to an exponential moving average, appear to be the best choice (Defazio, 2020). 5.5 Adaptivity Our goal is to combine these ideas together with the adaptivity technique from the Ada Grad method. The dual averaging form of coordinate-wise Ada Grad has the following form: xk+1 = x0 1 q Pk i=0 γig2 i where represents the element-wise (Hadamard) product, and γ is a fixed step size hyperparameter. There are many different ways of combining this kind of coordinate-wise adaptivity with the weighted gradient sequence λi = i + 1 that we have proposed. Due to the flexibility of the dual averaging framework, it s possible to prove a convergence rate of some form for practically any choice of denominator sequence. However, we must take into consideration that we also want to maintain the magnitude of the effective step size, as discussed in Section 5.3. We also need to ensure that the weighted dominator includes γi not just i + 1, as this mitigates a problem illustrated for DA in Figure 1: when λ is decreased 10 fold at epoch 150, the method starts to diverge. At this point, the β sequence continues to decrease at a square-root rate, while the sum-of-gradients starts growing ten times slower. This results in the method shrinking the iterates towards x0 far to strongly. We review a number of possible alternatives below and discuss their practicality. 5.5.1 Unweighted denominator One possibility is keep the denominator the same but just weight the gradients in the sum: xk+1 = x0 1 q Pk i=0 γig2 i i=0 (i + 1)1/2 γigi, This is appealing as it maintains the constant effective step size property, however the resulting convergence rate bound derivable from this form depends on q Pk i=0 γig2 i rather than q Pk i=0 (i + 1)1/2 γig2 i , which defeats the purpose of using a front-weighted gradient sequence. 5.5.2 Weighted denominator We can weight the gradient sequence in the denominator by λ also: xk+1 = x0 1 q Pk i=0 (i + 1)1/2 γig2 i i=0 (i + 1)1/2 γigi. A Momentumized, Adaptive, Dual Averaged Gradient Method This form does not maintain a constant effective step size, which results in poor empirical performance. We experimented with mitigations such as adding additional terms to the numerator that would counteract this growth, however this still resulted in unsatisfactory empirical results. 5.5.3 Weighted numerator The Ada Grad variant proposed by Zou et al. (2019a) uses a weighting scheme where the weights λk are included in the numerator as well as the denominator: xk+1 = x0 γi q Pk i=0 λi q Pk i=0 λig2 i q Pk i=0 (i + 1)1/2 q Pk i=0 (i + 1)1/2 g2 i This numerator is proportional to t1/4. To adapt this sequence to dual averaging, we must include a step size parameter in the weights. It s unclear exactly how to do this in a way that maintains the effective step size property, since if λi γi then the step size will cancel between the numerator and denominator. 5.5.4 MADGRAD s Cube-root denominator To maintain the correct effective step size we propose the use of a cube root instead: xk+1 = x0 1 3q Pk i=0 (i + 1)1/2 γig2 i i=0 (i + 1)1/2 γigi. (5) Although this modification appears ad-hoc, the use of a cube root here can actually be motivated by a similar argument used to motivate the standard square-root formulation. Duchi et al. (2011) consider the following minimization problem over a D dimensional vector s: g2 id sd , 1, s c, d : sd > 0, which is solved by sd q Pk i=0 g2 id. The motivation for this surrogate problem is to minimize weighted square norm of the gradients in hind-sight. Rather than a linear penalty on the size of s, which when combined with the positivity constraint is just a L1 norm penalty s 1 c, if we instead use a L2 norm penalty: g2 id sd , s 2 2 c, d : sd > 0 then we recover a cube-root solution sd 3q Pk i=0 g2 id. We show this in the Appendix. The cube root maintains the effective step size as can be seem by considering that Pk i (i + 1)1/2 (k + 1)3/2 which after the cube root operation gives the necessary k + 1 scaled denominator required to cancel against λ s square-root growth. Algorithm 1 MADGRAD Require: γk stepsize sequence, ck momentum sequence, initial point x0, epsilon ϵ 1: s0 : d = 0, ν0 : d = 0 2: for k = 0, . . . , T do 3: Sample ξk and set gk = f(xk, ξk) 5: sk+1 = sk + λkgk 6: νk+1 = νk + λk (gk gk) zk+1 = x0 1 3 νk+1 + ϵ sk+1 8: xk+1 = (1 ck+1) xk + ck+1zk+1. 10: return x T One disadvantage of this weighting is that it results in a final convergence rate bound that is not fully adaptive in the sense that the choices of global step size will depend on an expression involving the gradient norms. We don t believe this is a significant problem given that the choice of step size still depends on other unknown quantities even when using a fully adaptive sequence such as the function sub-optimality gap and gradient bound G. 6. Convergence Theory The MADGRAD algorithm, combining the discussed ideas, is listed in Algorithm 1. In order to establish convergence results for potentially non-smooth functions, we rely on a bounded gradient assumption: f(x, ξ) G for all x, ξ. We also assume each f( , ) is proper and convex in x over all RD. Our analysis uses a slight variant of Algorithm 1, where the denominator includes an extra term λk G2: zk+1 = x0 1 λk+1G2 + vk+1 sk+1, (6) A similar term is also needed by the original DA-Ada Grad method in Duchi et al. (2011). We don t believe this term plays an important role in practice as its magnitude quickly diminishes, and so we have not included this term in Algorithm 1. Orabona and Pál (2015) describe a way to remove this term from Ada Grad DA methods, at the expense of a more complicated analysis. A per-coordinate upper bound Gd may be used instead of G to further tighten the theory. Theorem 1. After k steps of MADGRAD using the update in Equation 6, E [f(xk) f(x )] 6 k1/2 x0 x 2 GD1/2, if ck = 3/2 k+3/2 and γ = 1 k3/4D3/4G1/2 x0 x 3/2 2 . A Momentumized, Adaptive, Dual Averaged Gradient Method This bound is very loose. It results from the application of f(x, ξ)i G to bound each index of the gradient at each time-step separately, which does not capture any of the adaptivity of the convergence rate. We discuss more precise bounds below. Note that g 2 D1/2 g = GD1/2, so the dependence on dimensionality here is comparable to bounds established for non-adaptive stochastic methods which have bounds on the 2-norm of the gradient on the right instead. Note also that we recommend using a flat ck = c momentum for non-convex problems, this decaying rate is only optimal in the convex case. A value of c = 0.1 corresponds to the β = 0.9 momentum commonly used with SGD and Adam. 6.1 Adaptivity To understand the adaptivity of the method at a more granular level, we can express the convergence rate as: E [f(xk) f(x )] 3 i=0 λig2 id d=0 (x0x x d)2 E i=0 λig2 id The convergence rate heavily depends on a weighted sequence: i=0 λig2 id = γ i=0 (i + 1)1/2 g2 id, rather than an unweighted sum PD d=0 Pk i=0 g2 id used in Ada Grad. This is key to understanding the performance characteristics of MADGRAD over traditional Ada Grad. In particular, large gradients at the early stages have a smaller effect on the overall bound then they do for Ada Grad. This can be quantified by considering the behavior when the gradient norm bound is time dependent, i.e. f(xi, ξ) Gi. Then as we show in the appendix, for MADGRAD, when using optimal step-sizes: E [f(xk) f(x )] 6 (k + 1)5/4 x0 x 2 D1/2 k X i=0 (i + 1)1/2 G2 i whereas for Ada Grad with the use of momentum: E [f(xk) f(x )] 6 (k + 1) x0 x 2 D1/2 k X In MADGRAD the effect of an outlier Gi that is particularly large at time-step i decays at a faster rate, with a power 5/4 compared to linearly for Ada Grad. Using λi with larger power than 1/2 is also possible within our momentumized-dual averaged gradient framework, which would result in a faster decay. We have found that the 1/2 factor is a "Sweet-spot", as larger values result in empirically slower convergence. Similar convergence rate bounds can be derived using the same proof technique, although they are prefixed by progressively larger constants (growing factorially in the power) as the power used is increased. In general, the advantage of MADGRAD over Ada Grad manifests in the common situation where the gradients are largest at the early stages of optimization. 6.2 Comparison to Adam Although Adam is known to potentially diverge, we can consider the theoretical properties of the AMSGrad variant of Adam, which is perhaps the smallest modification to Adam that results in provable convergence. For AMSGrad, parameterized by momentum β1λi 1 at step i, assuming a bounded domain with R = maxx,y x y 2 , defining γ = β1/ β2, and using step size αi = α/ i (Reddi et al., 2018): i=1 f(xi) f(x ) β1RG (1 β1)2 (1 λ)2 + R d=1 (ˆvk,d)1/2 + α 1 + log k (1 β1)2 (1 γ) 1 β2 ˆv is the maximum of the exponential moving average of the squared gradients, see Reddi et al. (2018) for further details. This result has a number of shortcomings compared to the MADGRAD. Firstly, note that the momentum term 1 β1, comparable to c in MADGRAD divides each term in the bound. This means that momentum hurts rather than improves performance. The dependence on a bounded domain is also an undesirable property compared to MADGRAD, and the convergence theory of MADGRAD avoids log factors. 7. Experimental Results In our experiments we compared MADGRAD against SGD, Adam and Ada Grad. SGD is known to perform well on computer vision classification problems due to its ability to produce solutions that generalize better than adaptive methods. In contrast, Adam is the method of choice in other domains with structured output where overfitting is less of an issue. We present results across a large number of problems across both categories to validate the general purpose utility of the MADGRAD approach. In our experiments we use the most common step-size reduction scheme used in the literature for each respective problem. For all algorithms, we performed a learning rate and decay sweep on a grid on intervals of [1 10i, 2.5 10i, 5 10i] for a range of i large enough to ensure the best parameters for each problem and method were considered. We present the results from the best learning rate and decay for each method when considering test set performance. For other hyper-parameters, we used commonly accepted defaults for each respective problem. Full parameter settings used for each method are listed in the appendix. All presented results are averaged over a number of seeds with error bars indicating 2 standard errors. Ten seeds were used for CIFAR-10 and IWSLT14, whereas only five seeds were used for the remaining larger scale problems. A Momentumized, Adaptive, Dual Averaged Gradient Method 0 50 100 150 200 250 300 Accuracy (%) CIFAR-10 (Pre Res Net152) ADAM (94.69% SE 0.04) SGD+M (95.22% SE 0.10) ADAGRAD (91.62% SE 0.31) MADGRAD (95.38% SE 0.05) 0 50 100 150 200 250 300 CIFAR-10 (Pre Res Net152) MADGRAD (6.5e-04 SE 9.2e-05) ADAGRAD (1.1e-03 SE 2.0e-04) ADAM (6.9e-04 SE 7.2e-05) SGD+M (5.2e-04 SE 4.5e-05) 0 20 40 60 80 Accuracy (%) ILSVRC 2012 Image Net (Res Net-50) SGD+M (76.09% SE 0.06) ADAM (73.03% SE 0.04) ADAGRAD (68.55% SE 0.13) MADGRAD (76.22% SE 0.05) 0 20 40 60 80 ILSVRC 2012 Image Net (Res Net-50) MADGRAD (8.2e-01 SE 1.2e-02) ADAGRAD (1.4e+00 SE 1.8e-02) ADAM (1.3e+00 SE 9.7e-03) SGD+M (9.5e-01 SE 2.3e-02) 0 10 20 30 40 50 fast MRI Knee (Var Net 2.0) SGD+M (0.9032% SE 0.00378) ADAM (0.9111% SE 0.00022) ADAGRAD (0.9100% SE 0.00036) MADGRAD (0.9119% SE 0.00017) 0 10 20 30 40 50 fast MRI Knee (Var Net 2.0) SGD+M (0.4390 SE 0.00679) ADAM (0.4417 SE 0.01243) ADAGRAD (0.4262 SE 0.00025) MADGRAD (0.4209 SE 0.00008) Figure 3: Experimental results for the CIFAR-10, Image Net and fast MRI Knee problems. Left column shows test set performance and the right column shows training set performance. 7.1 CIFAR10 image classification CIFAR10 (Krizhevsky, 2009) is an established baseline method within the deep learning community due to its manageable size and representative performance within the class of data-limited supervised image classification problems. It is particularly notable for showing clear differences between adaptive and non-adaptive methods, as the former tend to overfit considerably on this problem. Following standard practice, we apply a data-augmentation step consisting of random horizontal flipping, 4px padding followed by random cropping to 32px at training time only. We used a high-performance pre-activation Res Net architecture (He et al., 2016b) which is known to work well on this problem, consisting of 58,144,842 parameters across 152 layers. The depth of this network is representative of the typical point of diminishing returns for network depth on computer vision problems. As this network is greatly over-parameterized, each method can be expected to fit the training data exactly, achieving near zero loss, even with this data augmentation. For this reason, this task is particularly sensitive to difference in generalization performance of each method. As illustrated in Figure 3, both Adam and Ada Grad perform poorly on this problem in terms of test accuracy. The under-performance of Adam on this problem is well known (Wilson et al., 2017), and is typically attributed to convergence to poor local minima, as the training set convergence is very rapid initially. MADGRAD exhibits excellent test accuracy results on this problem, achieving the highest test accuracy among the methods considered. This demonstrates that unlike Adam and Ada Grad, MADGRAD s adaptivity does not come at the cost of inferior generalization performance. 7.2 ILSVRC 2012 Image Net image classification The Image Net problem (Krizhevsky et al., 2012) is a larger problem more representative of image classification problems encountered in industrial applications where a large number of classes and higher resolution input images are encountered. Like CIFAR10, overfitting can be an issue on this problem for adaptive methods. We ran experiments using the Res Net-50 architecture, which is considered the standard baseline for this problem. This combination of data set and architecture are one of the most studied in all of machine learning, which makes it an ideal testing ground for optimization algorithms. Our setup used data preprocessing consisting of a mean [0.485, 0.456, 0.406] and std [0.229, 0.224, 0.225] normalization of the three respective color channels, followed by a Random Resized Crop Py Torch operation to reduce the resolution to 224 pixels followed by a random 50% chance of horizontal flipping. For test set evaluation a resize to 256 pixels followed by a center crop to 224 pixels is used instead. This setup was used as it is standard within the Py Torch community, however it differs from the setup in He et al. (2016a), meaning that test accuracy is close but not directly comparable. On this problem both Adam and Ada Grad show similar convergence properties as were seen on the CIFAR-10 problem. They both greatly under-perform SGD with momentum. MADGRAD shows strong performance here as well, achieving higher test accuracy than any other method for the majority of the training time, and yielding the best final test accuracy. The accuracy of MADGRAD at epoch 70 is 75.87, a level only reached by SGD+M after A Momentumized, Adaptive, Dual Averaged Gradient Method the learning rate reduction at epoch 90, more than 28% longer. MADGRAD also performs the best on training loss on this problem. 7.3 fast MRI challenge MRI reconstruction The fast MRI Knee challenge (Zbontar et al., 2018) is a recently proposed large-scale image2-image problem. Unlike the previously explored classification problems, the scale of this problem makes overfitting a non-concern given the number of weights in the largest models currently trainable on contemporary hardware, meaning that adaptive methods are not prone to overfitting. This problem is also particularly notable for being poorly conditioned among image processing problems. Part of the reason for the poor conditioning is the high depth of current SOTA models, such as the Var Net 2.0 Sriram et al. (2020) model that we used. This model has 12,931,532 parameters over 273 layers. Our implementation uses 16 auto-calibration lines and an offset equispaced sampling pattern (Defazio, 2019), which is much closer to a realistic clinical configuration than the challenge s random sampling mask. Figure 3 shows a number of interesting properties of the methods. SGD+M exhibits extremely variable performance on this problem, and under-performs other methods by a large margin. Ada Grad also has a clear performance gap compared to the top performing methods, MADGRAD and Adam. MADGRAD is the best performer, with a small but statistically significant improvement over Adam, which is the standard method for this problem. Training set performance shows a much higher degree of variability, making comparisons difficult, however MADGRAD appears to also be the best performing method on training loss as well. 7.4 Machine translation with a recurrent neural network For a machine translation baseline we trained our model on the IWSLT14 Germain-to English dataset (Cettolo et al., 2014), using a popular LSTM variant introduced by Wiseman and Rush (2016). Figure 4 shows that all of the adaptive methods out-perform SGD on this problem by a significant margin. The results are close but MADGRAD has a small performance lead, yielding 4.33 test loss compared to 4.38 for Ada Grad and 4.35 for Adam. In training loss Ada Grad s lead over the other methods can be attributed to a slight degree of overfitting; these is a slight increase in test loss near the end of optimization for Ada Grad which indicates this. 7.5 Masked language modeling with a Transformer Bidirectional training objectives, as used in the BERT approach (Devlin et al., 2019), have quickly established themselves as the new standard for large-scale pre-training of natural language models. We performed our experiments using the Ro BERTa variant of BERT_BASE (Liu et al., 2019), a 110M parameter transformer model. This model is large enough to provide a realistic optimization test-bed for large-scale Transformer models while still being trainable in in time comparable to a Res Net-50 model on Image Net. Similar to the LSTM problem, SGD+M performs poorly here. It exhibits some spikes where training loss rapidly degrades then recovers quickly. both Adam and MADGRAD 0 5000 10000 15000 20000 25000 30000 IWSLT14 (LSTM) MADGRAD (4.33 SE 0.004) SGD+M (4.49 SE 0.003) ADAM (4.35 SE 0.005) ADAGRAD (4.38 SE 0.002) 0 5000 10000 15000 20000 25000 30000 Training Loss IWSLT14 (LSTM) MADGRAD (4.20 SE 0.003) SGD+M (4.34 SE 0.003) ADAM (4.27 SE 0.002) ADAGRAD (4.06 SE 0.002) 5000 7500 10000 12500 15000 17500 20000 Book Wiki (Ro BERTa Masked Language Model) SGD+M (2.37 SE 0.002) ADAM (2.09 SE 0.003) MADGRAD (2.07 SE 0.001) ADAGRAD (2.49 SE 0.004) 0 5000 10000 15000 20000 Training Loss Book Wiki (Ro BERTa Masked Language Model) SGD+M (2.46 SE 0.006) ADAM (2.17 SE 0.002) MADGRAD (2.14 SE 0.004) ADAGRAD (2.58 SE 0.002) Figure 4: Experimental results for the IWSLT14 and Book Wiki problems. Left column shows test set performance and the right column shows training set performance. A Momentumized, Adaptive, Dual Averaged Gradient Method perform well, however MADGRAD is significantly faster initially, and also achieves a better final test loss of 2.07 compared to 2.09 achieved by Adam. 8. Discussion 8.1 Hyper-parameter settings We have made the following observations during our experimentation: Typically, using the default weight decay from previous SGD/Adam training runs will result in poor generalization performance. Weight decay will need to be much less, potentially even 0, for good performance. We recommend reducing the weight-decay before any learning rate tuning. Learning rate values are not directly comparable to SGD/Adam, a full learning rate sweep is necessary to find the optimal value. In the appendix we list the best LR values for each of our test problems, which should form a good starting point. Sweeping across a power-of-2 grid is recommended as the value several of orders of magnitude different from SGD/Adam. Momentum values used for SGD/Adam should work without issue, by setting c = 1 β for momentum β. 8.2 Empirical results in deep learning We believe our experimental validation is one of the most comprehensive performed for any newly proposed deep learning optimization method. More than 20,000 hours of GPU time were needed to perform the grid search and final evaluation mentioned above, as we performed the search for each of the methods considered, rather than just the MADGRAD method. This prevents our method looking better than it would otherwise look due to hyperparameter optimization rather than an actual performance advantage. Our comparison also includes a number of large and realistic problems, which are better representative of modern deep learning compared to small scale problems. Finally, our final results are averaged over a sufficiently large number of seeds for each problem to ensure that run-to-run variation is not mistaken for actual performance differences. This is particularly a problem with CIFAR-10, yet many published results still use only a single seed for comparisons on that problem. For these reasons, we believe our experimental results for MADGRAD are representative of the performance of the method across modern large-scale empirical risk minimization problems. 8.3 Sparsity The reliance on a slowly updating moving average for the squared gradient within the Adam method greatly hinders its application to sparse models. In contrast, MADGRAD maintains a simple sum of the squared gradient entries which may be updated in a sparse fashion. One potential problem in the sparse case is that the buffer of iterates (rather than gradients) is maintained with a moving average. To support sparse applications, the iterate buffer may be removed, effectively equivalent to setting c = 1. 9. Conclusion We have proposed the MADGRAD (Momentumized, Adaptive, Dual averaged GRADient) method as a general purpose optimizer for deep learning. Given MADGRAD s state-of-theart empirical performance, together with its strong theoretical foundations, it is an excellent first choice of optimizer across many sub-fields of machine learning. Lukas Balles and Philipp Hennig. Dissecting adam: The sign, magnitude and variance of stochastic gradients. In Proceedings of the 35th International Conference on Machine Learning (ICML2018, 2018. Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. sign SGD: Compressed optimisation for non-convex problems. In Proceedings of the 35th International Conference on Machine Learning, 2018. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 pre-proceedings (Neur IPS 2020), 2020. Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. Report on the 11th IWSLT evaluation campaign, IWSLT 2014. 2014. Dami Choi, Christopher J. Shallue, Zachary Nado, Jaehoon Lee, Chris J. Maddison, and George E. Dahl. On empirical comparisons of optimizers for deep learning, 2020. Aaron Defazio. Offset sampling improves deep learning based accelerated mri reconstructions by exploiting symmetry. ar Xiv preprint ar Xiv:1912.01101, 2019. Aaron Defazio. Understanding the role of momentum in non-convex optimization: Practical insights from a lyapunov analysis. ar Xiv preprint ar Xiv:2010.00406, 2020. Aaron Defazio and Robert M. Gower. The power of factorial powers: New parameter settings for (stochastic) optimization, 2020. Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad, 2020. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Tout. Bert: Pre-training of deepbidirectional transformers for language understan. Proceedings of the 2019 Conferenceof the North American Chapter of the Association for Computational Lingustics, 2019. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. A Momentumized, Adaptive, Dual Averaged Gradient Method Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision ECCV 2016, 2016b. Samy Jelassi and Aaron Defazio. Dual averaging is surprisingly effective for deep learning optimization, 2020. URL https://arxiv.org/abs/2010.10502. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012. Kfir Y. Levy, Alp Yurtsever, and Volkan Cevher. Online adaptive methods, universality and acceleration. In Advances in Neural Information Processing Systems, 2018. Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 983 992. PMLR, 16 18 Apr 2019. URL http://proceedings.mlr.press/v89/ li19c.html. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019. Yu Nesterov and Vladimir Shikhman. Quasi-monotone subgradient methods for nonsmooth convex minimization. Journal of Optimization Theory and Applications, 165(3):917 940, 2015. Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221 259, 2009. Francesco Orabona and Dávid Pál. Scale-free algorithms for online linear optimization. In Kamalika Chaudhuri, CLAUDIO GENTILE, and Sandra Zilles, editors, Algorithmic Learning Theory, pages 287 301, Cham, 2015. Springer International Publishing. ISBN 978-3-319-24486-0. Samyam Rajbhandari, JeffRasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. Ar Xiv, 2019. Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. Robin M. Schmidt, Frank Schneider, and Philipp Hennig. Descending through a crowded valley benchmarking deep learning optimizers, 2020. Othmane Sebbouh, Robert M Gower, and Aaron Defazio. On the convergence of the stochastic heavy ball method. ar Xiv preprint ar Xiv:2006.07867, 2020. Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C Lawrence Zitnick, Nafissa Yakubova, Florian Knoll, and Patricia Johnson. End-to-end variational networks for accelerated mri reconstruction. ar Xiv preprint ar Xiv:2004.06688, 2020. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139 1147, 2013. Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5 - rmsprop, coursera: Neural networks for machine learning. technical report, 2012. URL https://www.cs.toronto.edu/~tijmen/ csc321/slides/lecture_slides_lec6.pdf. Rachel Ward, Xiaoxia Wu, and Leon Bottou. Ada Grad stepsizes: Sharp convergence over nonconvex landscapes. In Proceedings of the 36th International Conference on Machine Learning, 2019. Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, 2017. Sam Wiseman and Alexander M. Rush. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2016. Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J Muckley, Mary Bruno, Aaron Defazio, Marc Parente, Krzysztof J Geras, Joe Katsnelson, Hersh Chandarana, et al. fast MRI: An open dataset and benchmarks for accelerated MRI. ar Xiv preprint ar Xiv:1811.08839, 2018. Dongruo Zhou, Jinghui Chen, Yuan Cao, Yiqi Tang, Ziyan Yang, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization, 2020. Fangyu Zou, Li Shen, Zequn Jie, Ju Sun, and Wei Liu. Weighted adagrad with unified momentum, 2019a. Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for convergences of adam and rmsprop. CVPR, 2019b. Appendix A. Parameter settings Our data augmentation pipeline followed standard practice: random horizontal flipping, then random cropping to 32x32, then normalization by centering around (0.5, 0.5, 0.5). A Momentumized, Adaptive, Dual Averaged Gradient Method Hyper-parameter Value Architecture Pre Act Res Net152 Epochs 300 GPUs 1x V100 Batch Size per GPU 128 LR schedule 150-225 tenthing Seeds 10 Method LR Decay MADGRAD 2.5e-4 0.0001 Ada Grad 0.01 0.0001 Adam 0.00025 0.0001 SGD 0.1 0.0001 A standard LR schedule was used, where the learning rate is decreased 10 fold every 30 epochs. Interestingly, for this problem, a smaller decay constant improved the performance of MADGRAD, but didn t yield any improvement to the other methods considered. Hyper-parameter Value Architecture Res Net50 Epochs 90 GPUs 8x V100 Batch size per GPU 32 LR schedule 30-60-90 tenthing Seeds 5 Method LR Decay MADGRAD 0.001 2.5e-5 Ada Grad 0.01 0.0001 Adam 0.00025 0.0001 SGD 0.1 0.0001 For this task, the best learning rate schedule is a flat schedule, with a small number of fine-tuning epochs at the end to stabilize. To this end, we decreased the learning rate 10 fold at epoch 40. Hyper-parameter Value Architecture 12 layer Var Net 2 Epochs 50 GPUs 8x V100 Batch size per GPU 1 Acceleration factor 4 Low frequency lines 16 Mask type Offset-1 LR schedule 40 tenthing Seeds 5 Method LR Decay MADGRAD 0.01 0.0 Ada Grad 0.25 0.0 Adam 0.00025 0.0 SGD 0.01 0.0 Our implementation used Fair Seq defaults except for the parameters listed below. Hyper-parameter Value Architecture lstm_wiseman_iwslt_de_en Max updates 60,000 GPUs 1x V100 Max tokens per batch 4096 Warmup steps 4000 Dropout 0.3 Label smoothing 0.1 Share decoder/input/output embed True Float16 True Update Frequency 1 LR schedule Inverse square-root Seeds 10 Method LR Decay MADGRAD 0.025 5e-6 Ada Grad 0.25 1e-5 Adam 0.01 0.05 SGD 1.0 1e-5 Our implementation used Fair Seq defaults except for the parameters listed below. A Momentumized, Adaptive, Dual Averaged Gradient Method Hyper-parameter Value Architecture roberta_base Task masked_lm Max updates 20,000 GPUs 8x V100 Max tokens per sample 512 Dropout 0.1 Attention Dropout 0.1 Max sentences 16 Warmup 10,000 Sample Break Mode Complete Share decoder/input/output embed True Float16 True Update Frequency 16 LR schedule Polynomial decay Seeds 5 Gradient clipping 0.5 Method LR Decay MADGRAD 0.005 0.0 Ada Grad 0.01 0.0 Adam 0.001 0.0 SGD 1.0 0.0 Appendix B. Theory B.1 Theoretical variant We analyze a variant of the MADGRAD algorithm, using fixed step size γ, and λk = γ sk+1 = sk + λkgk, vk+1 = vk + λkg2 k, zk+1 = x0 1 λk+1G2 + vk+1 sk+1, xk+1 = (1 ck+1) xk + ck+1zk+1. (7) This variant differs from Algorithm 1 just with the addition of λk G2 in the denominator, which is necessitated by our analysis method. Note that the Ada Grad DA formulation originally proposed by Duchi et al. (2011) also requires this extra term. B.2 Support function We define a matrix analogue of the support function from Nesterov (2009): VAk( sk) = max x 2 x x0 2 Ak In this work we only consider diagonal Ak, represented by a vector ak : Ak = diag(ak). In this notation, we have αk = 3p λk G2 + vk. The maximizer of expression 8 is (using component-wise division): zk = x0 sk Since vk+1 is non-decreasing, it s clear that: VAk+1 ( sk) VAk ( sk) . (9) We will also use the following properties, which follow directly by modifying the argument in Nesterov (2009) to handle scaling matrices instead of constants: VAk( sk) = zk x0, (10) VAk(s + δ) VAk(s) + δ, VAk(s) + 1 2 δ 2 A 1 k . (11) Lemma 2. For all natural k, assuming λk+1 λk: λ2 t g2 t λt G2 + Pt 1 i=0 λig2 i 1/3 3 Proof We prove by induction. For the base case: g2 0 (G2)1/3 g2(1 1/3) = g2 2/3 3 Now assume the lemma holds for k 1 then using the inductive hypothesis λ2 t g2 t λt G2 + Pt 1 i=0 λig2 i 1/3 λ2 kg2 k λt G2 + Pk 1 i=0 λig2 i 1/3 + 3 λ2 kg2 k λt G2 + Pk 1 i=0 λig2 i 1/3 + 3 Define bk = Pk i=0 λig2 i and ak = g2 k then we have: λ2 t g2 t λt G2 + Pt 1 i=0 λig2 i 1/3 λ2 kak λk G2 + bk λkak 1/3 + 3 2λk (bk λkak)2/3 . A Momentumized, Adaptive, Dual Averaged Gradient Method We have two terms on the right to consider. For the first term, note that since ak G2, λ2 kak λk G2 + bk λkak 1/3 λ2 kak (bk) 1/3 . For the 2nd term, we can use concavity to get: 3 2λk (bk λkak)2/3 3 2λk (bk)2/3 λ2 kak (bk) 1/3 . Combining gives: k X λ2 t g2 t λt G2 + Pt 1 i=0 λig2 i 1/3 3 2λk (bk)2/3 , and so the inductive case is proven. Lemma 3. Let 0 < r < 1 and j 0. Then define: ck = r + 1 k + j + r, for all k 0 it then holds that: ck (k + j)r 1 ck 1 (k + j 1)r. Proof We start by simplifying: ck (k + j)r = 1 r+1 k+j+r r+1 k+j+r (k + j)r, r + 1 (k + j)r, = k + j + r 1 r + 1 k + j 1 k + j + r 1(k + j)r, k + j 1 k + j + r 1(k + j)r. So we need: (k + j)r k + j + r 1 k + j 1 (k + j 1)r . Recall the concavity upper bound: f(x) f(y) + f(y), x y , using f(x) = (k + j)r which is concave for r (0, 1), and x = k + j, y = k + j 1, we have: (k + j)r (k + j 1)r + r (k + j 1)r 1 , = (k + j 1)r + r k + j 1 (k + j 1)r , = k + j 1 + r k + j 1 (k + j 1)r . Which establishes the result. Lemma 4. The dual averaging iterates obey: zk = xk 1 ck ck (xk 1 xk) . (12) Proof We rearrange the x update: xk+1 = (1 ck+1) xk + ck+1zk+1. xk = (1 ck) xk 1 + ckzk, ckzk = xk (1 ck)xk 1, Theorem 5. Consider the MADGRAD method. We upper bound the quantity VAk+1 ( sk+1) as follows: For the first step k = 0: VA1 ( s1) λ2 0 2 f (x0, ξk) 2 A 1 0 . For subsequent steps k 1: VAk+1 ( sk+1) VAk ( sk) + λ2 k 2 f (xk, ξk) 2 A 1 k + λk f (xk, ξk) , x0 x ck λk [f(xk, ξk) f(x , ξk)] + 1 ck ck λk [f(xk 1, ξk) f(x , ξk)] . Proof Base case: VA1 ( s1) (13) λ0 f (x0, ξk) , V0 ( s0) + λ2 0 2 f (x0, ξk) 2 A 1 0 (Eq. 11), = λk f (xk, ξk) , x0 x0 + λ2 0 2β0 f (x0, ξk) 2 A 1 0 , (Eq. 10) = λ2 0 2 f (x0, ξk) 2 A 1 0 . (14) A Momentumized, Adaptive, Dual Averaged Gradient Method Inductive case: VAk+1 ( sk+1) VAk ( sk+1) VAk ( sk) λk f (xk, ξk) , VAk ( sk) + λ2 k 2 f (xk, ξk) 2 A 1 k , (Eq. 11) = VAk ( sk) + λk f (xk, ξk) , x0 zk + λ2 k 2 f (xk, ξk) 2 A 1 k , (Eq. 10) = VAk ( sk) + λ2 k 2 f (xk, ξk) 2 A 1 k f (xk, ξk) , x0 xk + 1 ck (xk 1 xk) , (Eq. 12) = VAk+1 ( sk) + λ2 k 2 f (xk, ξk) 2 A 1 k + λi f (xk, ξk) , x0 xk + λk 1 ck ck f (xk, ξk) , xk 1 xk , = VAk+1 ( sk) + λ2 k 2 f (xk, ξk) 2 A 1 k + λk f (xk, ξk) , x0 x + λk f (xk, ξk) , x xk ck f (xk, ξk) , xk 1 xk . Now we use: f (xk, ξk) , x xk f(x , ξk) f(xk, ξk), and: f (xk, ξk) , xk 1 xk f(xk 1, ξk) f(xk, ξk), VAk+1 ( sk+1) VAk ( sk) + λ2 k 2 f (xk, ξk) 2 A 1 k + λk f (xk, ξk) , x0 x + λk [f(x , ξk) f(xk, ξk)] + λk 1 ck ck [f(xk 1, ξk) f(xk, ξk)] , grouping function value terms gives the result. B.4 Convergence rate Theorem 6. After k steps of MADGRAD, E [f(xk) f(x )] 6 k1/2 x0 x GD1/2, if ck = 3/2 k+3/2 and γ = 1 k3/4D3/4G1/2 x0 x 3/2 . We assume that γk = γ is a constant. First note that for our choice of λk = γ (k + 1)1/2 ck = 3/2 k + 3/2, applying Lemma 3 gives that: 1 ck ck λk 1 ck 1 λk 1. Using this bound we can telescope the bound from Theorem 5 after taking expectations: 1 ck λk [f(xk, ξk) f(x , ξk)] E VAk+1 ( sk+1) + 1 t=0 λ2 t f (xt, ξt) 2 A 1 t i=0 λi f (xi, ξi) , x0 x Now note that sk+1 = Pk i=0 λi f (xi, ξi), so: E VAk+1 ( sk+1) = E max x sk+1, x x0 1 2 x x0 2 Ak+1 E sk+1, x x0 1 2 x x0 2 Ak+1 i=0 λi f (xi, ξi) , x0 x 2 x x0 2 Ak+1 . So combining this bound and further using the definition of ck and λk: 3/2 γ (k + 1)1/2 E [f(xk) f(x )] 1 t=0 λ2 t f (xt, ξt) 2 A 1 t 2 x x0 2 Ak+1 . To simplify further we need to start working in a coordinate wise fashion. Let D be the number of dimensions in x, then we can write the above bound using Lemma 2 applied coordinate wise as: 3/2 γ (k + 1)1/2 E [f(xk) f(x )] 1 i=0 λig2 id d=0 (x0x x d)2 E i=0 λig2 id A Momentumized, Adaptive, Dual Averaged Gradient Method We now apply the bound gid G: 3/2 γ (k + 1)1/2 E [f(xk) f(x )] 3 i=0 λi G2 !2/3 d=0 (x0x x d)2 k+1 X i=0 λi G2 !1/3 Since λk = γ (k + 1)1/2, we can further simplify using the summation property: i=0 (i + 1)1/2 2 3 (k + 2)3/2 , we apply on the two locations on the right to give: 3/2 γ (k + 1)1/2 E [f(xk) f(x )] 1 d=0 (k + 1)1/2 (k + 2) G4/3 d=0 (x0x x d)2 (k + 3)1/2 G2/3. (k + 3/2)(k + 1) (k + 3/2)1/2 + (3/2)1/2 (k + 3/2)(k + 1) 1 k + 1 + 1 (k + 1) and likewise: k + 2 k + 3/2 2 so after rearranging: 2 3E [f(xk) f(x )] 2γ2/3G4/3D + γ 2/3G2/3 2 k + 1 d=0 (x0x x d)2 , E [f(xk) f(x )] 3γ2/3G4/3D + 3 k + 1γ 2/3G2/3 x0 x 2 . Taking the gradient with respect to γ to zero gives 3γ 1/3G4/3D 2 3(k + 1)γ 5/3G2/3 x0 x 2 , γ 1G4D3 = 1 (k + 1)3 γ 5G2 x0 x 6 , γ4 = 1 (k + 1)3 D3G2 x0 x 6 , (k + 1)3/4 D3/4G1/2 x0 x 3/2 . Using this optimal γ gives: γ2/3 = 1 k1/2D1/2G1/3 x0 x . E [f(xk) f(x )] 6 k1/2 x0 x GD1/2. Note that g 2 D1/2 g = D1/2G, so the dependence on dimensionality here is comparable to standard stochastic method proofs which have g 2 on the right instead. B.5 Time varying case Consider the situation where the bound on the gradient potentially varies over time. f(xi, ξ) Gi for all x, ξ. Then using the same argument as in the previous section we arrive at: E [f(xk) f(x )] 3γ2/3 1 (k + 1)D i=0 (i + 1)1/2 G2 i (k + 1)3/2 x0 x 2 2 i=0 (i + 1)1/2 G2 i We may solve for the optimal step size, giving: (k + 1)1/2 x0 x 2 2 Pk+1 i=0 (i + 1)1/2 G2 i 1/3 D Pk+1 i=0 (i + 1)1/2 G2 i 2/3 , (k + 1)1/2 x0 x 2 2 D Pk+1 i=0 (i + 1)1/2 G2 i 1/3 , (k + 1)1/4 x0 x 2 D1/2 Pk+1 i=0 (i + 1)1/2 G2 i 1/6 . A Momentumized, Adaptive, Dual Averaged Gradient Method Then substituting this in gives: E [f(xk) f(x )] 6 1 (k + 1)5/4 D1/2 x0 x 2 i=0 (i + 1)1/2 G2 i When applying λi = γ, as in Ada Grad, we instead get: E [f(xk) f(x )] 3 γ1/2 + 3 1 (k + 1) γ1/2 x0 x 2 2 solving for the optimal step size: = 1 (k + 1) γ 3/2 x0 x 2 2 γ2 = x0 x 2 2 D . E [f(xk) f(x )] 6 (k + 1) x0 x 2 D1/2 k X Appendix C. Cube root formulation Consider the minimization problem parameterized by g : k D and a single vector s : D, where g2 id sd , s 2 2 c, d : sd > 0 In this section we show that sd 3q Pk i=0 g2 id is a solution. Without loss of generality we disregard the inequality constraint on sd and consider only positive solutions to the equality constrained problem. We will apply the method of Lagrange multipliers. Firstly we form the Lagrangian with multiplier µ: g2 id sd + µ Saddle-points of the Lagrangian can be found by equating the gradients to zero and solving. L sd L(s, µ) = 1 i=0 g2 id + µsd, L µL(s, µ) = 1 From the first equation: 1 s2 d i=0 g2 id = µsd, Therefore sd = µ 1/3 Pk i=0 g2 id 1/3 since sd is positive we take the positive root, The Lagrange multiplier µ is given by the requirement that d=0 s2 d = c, µ2/3 = c 1 D X v u u u tc 1 D X c 1 PD d=0 Pk i=0 g2 id 2/3 We can verify that this is an extreme point of the original problem by noting that the linear independence constraint qualification (LICQ) condition trivially holds when using one equality constraint. Since the objective is convex for sd > 0, this point must be a minimizer.