# better_sgd_using_secondorder_momentum__8e221000.pdf Better SGD using Second-order Momentum Hoang Tran Boston University tranhp@bu.edu Ashok Cutkosky Boston University ashok@cutkosky.com We develop a new algorithm for non-convex stochastic optimization that finds an ϵ-critical point in the optimal O(ϵ 3) stochastic gradient and Hessian-vector product computations. Our algorithm uses Hessian-vector products to correct a bias term in the momentum of SGD with momentum. This leads to better gradient estimates in a manner analogous to variance reduction methods. In contrast to prior work, we do not require excessively large batch sizes, and are able to provide an adaptive algorithm whose convergence rate automatically improves with decreasing variance in the gradient estimates. We validate our results on a variety of large-scale deep learning architectures and benchmarks tasks. 1 Introduction In the last couple of decades, first-order algorithms such as Stochastic Gradient Descent (SGD) or Adam [Kingma and Ba, 2014] have emerged as the main workhorse for modern Machine Learning (ML) tasks. They achieve good empirical results while being easy to implement and requiring relatively few computational resources. When the objective function is convex, SGD s convergence is well-understood [Zinkevich, 2003]. However, the results are much less favorable for the non-convex setting in which modern deep learning models operate. In fact, the problem of optimizing non-convex function is NP-hard in general, so instead analysis often focuses on finding a critical point - that is, a point at which the gradient of the loss is zero. SGD is well-known to find an ϵ-approximate critical point in at most O(ϵ 4) total stochastic gradient evaluations [Ghadimi and Lan, 2013]. In an effort to improve upon SGD, many different algorithms and heuristics have been proposed, including various advanced learning rate schedules [Loshchilov and Hutter, 2016, Goyal et al., 2017, Xie et al., 2020], or per-coordinate learning rates and adaptive algorithms [Duchi et al., 2011, Mc Mahan and Streeter, 2010, Kingma and Ba, 2014, Reddi et al., 2018]. All of these methods have enjoyed practical success, but none of their convergence rates has shown any asymptotic benefit over the O(ϵ 4) rate of SGD, which is to be expected because this rate is in fact optimal in the worst-case for first-order methods operating on smooth losses [Arjevani et al., 2019]. Thus, in order to make improved algorithms, we need additional assumptions. In this work, we consider the case in which our algorithm is not a pure first-order method, but has some limited access to second-order information. Second-order algorithms such as Newton s method are among the most powerful methods in optimization theory. Instead of using a linear approximation for the objective via the gradient, Newton s method employs a quadratic approximation using both the gradient and the Hessian. This quadratic approximation hugs the curvature of the error surface and allows each iteration to make much faster progress. Unfortunately, second-order algorithms are also much more complex and expensive to implement. The cost of forming the Hessian matrix is O(d2) and the cost of a Newton step is typically O(d3), thus rendering such algorithms largely impractical for large-scale learning problems [Bottou and Bousquet, 2007]. The good news is we may not have to explicitly compute the Hessian matrix to be able to take advantages of the nice properties of second-order methods. In particular, it is possible to compute a Hessian-vector product, that is a vector of the form 2F(x)v for arbitrary x and v, in roughly 36th Conference on Neural Information Processing Systems (Neur IPS 2022). the same time it takes to evaluate F(x) [Pearlmutter, 1994]. In this paper, we develop a novel second-order SGD-based algorithm that uses a stochastic gradient and Hessian-vector product oracle to expedite the training process. Our algorithm not only enjoys optimal theoretical properties, it is also practically effective, as demonstrated through our experimental results across various deep learning tasks. Importantly, our method works well on a variety of tasks in different domains, while standard practice is to employ different optimizers for different domains (e.g. Adam W for NLP, SGD for computer vision). 1.1 Contributions We present a novel algorithm based on SGD with momentum (SGDHess) that uses Hessian-vector products to correct a bias term in the momentum. Through our theoretical analysis, we show that our algorithm requires the optimal O(ϵ 3) oracle calls for finding stationary points [Arjevani et al., 2020b]. In contrast to previous algorithms with this property (e.g. [Arjevani et al., 2020b]), we do not require excessively large batch sizes, and we also feel that our algorithm and analysis is relatively more straightforward: our method is a simple modification to SGD that could easily be applied to any momentum-based optimizer. We also provide a variant of our algorithm based on normalized SGD, which dispenses with a Lipschitz assumption on the objective, and another variant with an adaptive learning rate that automatically improves to a rate of O(ϵ 2) when the noise in the gradients is negligible. Finally, we test our algorithm on multiple learning tasks on different deep architectures. In all of these tasks, our algorithm consistently matches or exceeds the best algorithms for the task. Further, the tuning process of our algorithm is reasonably simple; in many cases, we can use the exact parameters of SGD for the new algorithm. 1.2 Related works Although second-order methods have many attractive theoretical properties, making these methods practical is challenging. There have been multiple efforts to develop efficient second-order algorithms. One popular approach is the Broyden Fletcher Goldfarb Shanno algorithm (BFGS) [Keskar and Wächter, 2019, Liu and Nocedal, 1989, Bollapragada et al., 2018, Pan et al., 2017], which provides a faster approximation to the Newton step using only a first-order oracle. Although BFGS and similar methods have been applied with some success to deep learning tasks [Bollapragada et al., 2018, Ma, 2020], we stress that such methods are fundamentally incapable of matching the convergence guarantees available to algorithms that truly use the Hessian in the stochastic non-convex setting [Arjevani et al., 2019]. More recent attention has focused on the use of Hessian-vector products as a way to achieve some of the benefits of second-order information without incurring significant computational overhead [Carmon et al., 2018, Agarwal et al., 2017]. [Tripuraneni et al., 2017] uses Hessian-vector products to approximate the cubic regularized Newton method [Nesterov and Polyak, 2006], converging in only O(ϵ 3.5) stochastic oracle calls. [Zhou et al., 2019] leverages both Hessian information and variance-reduction techniques to achieve a convergence rate of O( n4/5 ϵ3/2 ) for finite-sum problems with n summands. However, both algorithms fail to achieve the optimal O(ϵ 3) rate and to our knowledge none of these have been tested extensively on deep learning benchmarks. The first algorithm achieving the optimal O(ϵ 3) using Hessian-vector products rate was introduced in [Arjevani et al., 2020a]. However, their algorithm requires the batch size to be roughly O(T 1/3), which is excessively large for practical use-cases. On the other hand, our algorithm will work with any batch size. Finally, variance-reduction based algorithms [Johnson and Zhang, 2013] can also achieve this same O(T 1/3) rate for non-convex optimization [Fang et al., 2018, Zhou et al., 2018, Cutkosky and Orabona, 2019, Tran-Dinh et al., 2019]. Despite favorable theoretical guarantees, variance-reduction algorithms often rely on stronger smoothness assumptions, are tricky to implement properly, and seem to not perform as well as expected in practice [Defazio and Bottou, 2018]. Comparison between variance-reduction algorithms and our algorithm (SGDHess) will be further discussed at the end of the paper. 1.3 Problem setup We are interested in minimizing a function F given by: F( x) = E z Pz [f( x, z)] Where f( x, z) is a differentiable function of x. x indicates the model parameters (e.g. the weights of some neural networks), z indicates an example data point1. We are also given a point x1 and define R by: sup x Rd F( x1) F( x) = (1) Further, we assume F is L-smooth. That is, for all x and y: F( x) F( y) L x y (2) To design a second-order algorithm, we also assume access to second-order stochastic oracle. Given any x and vector w, we are allowed to sample z Pz and compute f( x, z) and 2f( x, z) w. We assume that for all x and w: E[ f( x, z) F( x) 2] σ2 G (3) E[ 2f( x, z) w 2F( x) w 2] σ2 H w 2 (4) In addition, we will also need the assumption of second-order smoothness. For all x, y, and w: ( 2F( x) 2F( y)) w ρ x y w (5) Finally, in two of our algorithms, we need to assume f is G Lipschitz. f( x, z) G (6) 2 SGD with Hessian-corrected momentum In this section, we provide our SGD-based second-order algorithm. Before we go into the analysis, let us compare our algorithm to SGD to see why adding the Hessian-vector product term to the momentum update is a good idea. The standard SGD with momentum update is the following: ˆgt = (1 α)ˆgt 1 + α f( xt, zt) xt+1 = xt ηˆgt To gain some intuition for this update, let us suppose that ˆgt 1 = F( xt 1). Then, viewing ˆgt as an estimate of F( xt), define the error as: ˆϵt = ˆgt F( xt) Intuitively, SGD will converge rapidly if this error is small. We can write: ˆϵt = (1 α)(ˆgt 1 F( xt)) + α( f( xt, zt) F( xt)) The second term is fairly benign: it is zero in expectation and is multiplied by a potentially small value α. However, the first term is a bit trickier to bound since F( xt 1) = F( xt). Our approach is to leverage the second-order oracle to improve this estimate. Specifically, we modify the momentum update to: ˆgt = (1 α)(ˆgt 1 + 2f( xt, zt)( xt xt 1)) + α f( xt, zt) Now, if ˆgt 1 = F( xt 1), we would have E[ˆgt 1 + 2f( xt, zt)( xt xt 1)] = ˆgt 1 + 2F( xt, zt)( xt xt 1) F( xt) + O( xt xt 1 2) 1z could also indicate a minibatch of examples Further, we have: ˆϵt = (1 α)(ˆgt 1 F( xt) + 2f( xt, zt)( xt xt 1)) + α( f( xt, zt) F( xt)) so that the first term is now bounded by O( xt 1 xt 2) which can be controlled through the learning rate η and the second term is again easy to control via tuning α. Without this second-order correction, we need to rely on ˆgt 1 = F( xt) + O( xt xt 1 ). Thus, by using another term of the Taylor expansion, we have improved the dependency to O( xt xt 1 2), which will create a corresponding reduction in our final error. Of course, the case ˆgt 1 = F( xt 1) above is examined for intuition only. In our analysis, we do not make any such assumption. Note that this approach is morally similar to the methods of Cutkosky and Orabona [2019], Tran-Dinh et al. [2019], which are themselves similar to the recursive variance reduction Nguyen et al. [2017] algorithm for stochastic convex optimization. In these algorithms, the Hessian-vector product is replaced with two gradient evaluations with the same example zt: f( xt, zt) f( xt 1, zt). So long as the individual functions f( xt, zt) are L-smooth, this will eventually have a similar correction effect. However, past empirical work suggests that variance reduction may not be effective in practice on deep learning tasks Defazio and Bottou [2018], while the use of Hessian-vector products has not been as extensively tested to our knowledge. Our SGD with Hessian-corrected momentum is described in Algorithm 1, and its analysis is presented in Theorem 1. Note that we only need to make a small modification to the standard SGD update, which allows for streamlined analyses. Concretely, we avoid the large batch-size requirement of Arjevani et al. [2020a], and can extend the analysis to adaptive learning rates in Section 4. Algorithm 1 SGD with Hessian-corrected Momentum (SGDHess) Input: Initial Point x1, learning rates ηt, momentum parameters αt, time horizon T, Lipschitz parameter G. Sample z1 Pz. ˆgclip 1 ˆg1 f( x1, z1) x2 x1 η1ˆgclip 1 for t = 2 . . . T do Sample zt Pz. ˆgt (1 αt 1)(ˆgclip t 1 + 2f( xt, zt)( xt xt 1)) + αt 1 f( xt, zt). ˆgclip t ˆgt if ˆgt G; otherwise, ˆgclip t G ˆgt ˆgt xt+1 xt ηtˆgclip t . end for Return ˆx uniformly at random from x1, . . . , x T (in practice ˆx = x T ). Theorem 1. Assume (1), (2), (4), (5), (6). Then Algorithm 1 with ηt = 1 Ct1/3 , αt = 2Kηtηt+1 with 2K and C 4L, K = 2G2ρ2 4σ4 H+ ρ2G2 2 and D = 24 25K2 guarantees t=1 E[ F( xt) 2] 20C + 96C2G2/K T 2/3 + 20G2D(1 + log(T)) To see why we want to use a clipped gradient, let us consider the error term with an unclipped gradient: ˆϵt+1 = (1 αt)(ˆgt F( xt)) + (1 αt)( 2f( xt+1, zt+1) 2F( xt+1)( xt+1 xt)) + (1 αt)( F( xt) + 2F( xt+1)( xt+1 xt) F( xt+1)) + αtϵG t+1 where ϵG t+1 = f( xt+1, zt+1) F( xt+1). It is easy to see that we could bound the second and the fourth term using assumption (4) and (6) and the first term is simply (1 αt)ˆϵt, which we can use to analyze how the error changes over each iteration . However, the third term is a bit trickier to control. Let δt = F( xt) + 2F( xt+1)( xt+1 xt) F( xt+1), from second-order smoothness we have: 4 xt+1 xt 4 ρ2 4 η4 t ˆgt 4 Intuitively, if we let ˆgt be unbounded, ˆgt 4 might be difficult to control since we only assume a bound on the variance of the f( xt, zt) rather than the fourth moment. Therefore, by enforcing some bound on the norm of ˆgt (using clipping), we make sure that we can control this term properly. In order to prove this Theorem, we will require two Lemmas. The first (Lemma 2) is due to Cutkosky and Orabona [2019], and provides a bound on the progress of one iteration of stochastic gradient descent without making any assumptions (such as unbiasedness) about the gradient estimates. The second (Lemma 3), is a technical result characterizing the quality of the gradient estimates ˆgclip t generated by Algorithm 1. Lemma 2. [Cutkosky and Orabona [2019] Lemma 1] Define: ˆϵt = ˆgclip t F( xt) Suppose ηt is a deterministic and non-increasing choice of learning rate. Then, so long as ηt 1 4L, E[F( xt+1) F( xt)] ηt 4 E[ F( xt) 2] + 3ηt 4 E[ ˆϵt 2] Lemma 3. Suppose that f( x, z) satisfies (2), (4), (6), and (5). Define: ˆϵt = ˆgclip t F( xt) Now, for some constant C and σH, set K = 2G2ρ2 4σ4 H+ ρ2G2 2 , ηt = 1 Ct1/3 with C αt = 2Kηtηt+1. Then we have: 6 5Kηt+1 E[ ˆϵt+1 2] 6 5Kηt E[ ˆϵt 2] 3ηt 4 E[ ˆϵt 2] + ηt 5 E[ F( xt) 2] + η3 t 5 KG2 + 16C6G2 Let us look into how Lemma 3 is used in our analysis. To prove Theorem 1, we use the Lyapunov function defined as Φt = F( xt) + 6 5Kηt ˆϵt 2 to bound the E[ F( xt) 2] term. Then: E[Φt+1 Φt] = E F( xt+1) F( xt) + 6 5Kηt+1 ˆϵt+1 2 6 5Kηt ˆϵt 2 4 F( xt) 2 + 3ηt 4 ˆϵt 2 + 6 5Kηt+1 ˆϵt+1 2 6 5Kηt ˆϵt 2] (7) where the inequality comes from Lemma 2. Now if we plug in the result of Lemma 3, we are able to simplify the bound of (7) by canceling the positive error term 3ηt 4 ˆϵt 2 while keeping the coefficient of F( xt) 2 negative. Then, we can move the negative term to the left hand side and derive a bound for E[ F( xt) 2]. With Lemma 2 and Lemma 3 in hand, we are now ready to prove Theorem 1. Proof of Theorem 1. We define the potential: Φt = F( xt) + 6 5Kηt ˆϵt 2 We will then show that Φt roughly decreases with t at rate that depends on F( xt) 2. Specifically: E[Φt+1 Φt] = E F( xt+1) F( xt) + 6 5Kηt+1 ˆϵt+1 2 6 5Kηt ˆϵt 2 applying Lemmas 2 and 3: 4 E[ F( xt) 2] + 3ηt 4 E[ ˆϵt 2] 3ηt 4 E[ ˆϵt 2] + ηt 5 E[ F( xt) 2] + η3 t G2 24 20 E[ F( xt) 2] + η3 t G2(24 20 E[ F( xt) 2] + η3 t G2D Now sum over t, and use ηt ηT for all t: E[ΦT +1 Φ1] ηT t=1 E[ F( xt) 2 + G2D t=1 E[ F( xt) 2] 20 ηT E[Φ1 ΦT +1] + 20G2D PT t=1 η3 t ηT Now, observe that: E[Φ1 ΦT +1] = E F( x1) F( x T +1) + 6 5Kη1 ˆϵ1 2 6 5KηT +1 ˆϵT +1 2 5Kη1 + 24CG2 where in the second line we used E[ ˆϵ1 2] 4G2. Also, we have: t=1 η3 t = 1 t 1 + log(T) Putting all this together yields: t=1 E[ F( xt) 2] 20 TηT E[Φ1 ΦT +1] + 20G2D PT t=1 η3 t TηT 20C + 96C2G2/K T 2/3 + 20G2D(1 + log(T)) 3 Normalized SGD with Hessian-corrected momentum Algorithm 2 Normalized SGD with Hessian-corrected Momentum (N-SGDHess) Input: Initial Point x1, learning rates η, momentum parameters α, time horizon T: Sample z1 Pz. ˆg1 f( x1, z1). x2 x1 η ˆg1 ˆg1 for t = 2 . . . T do Sample zt Pz. ˆgt (1 α)(ˆgt 1 + 2f( xt, zt)( xt xt 1)) + α f( xt, zt). xt+1 xt η ˆgt ˆgt . end for Return ˆx uniformly at random from x1, . . . , x T (in practice ˆx = x T ). In this section, we introduce an algorithm that dispenses with the assumption (6) required by Algorithm 1. This method (Algorithm 2) uses SGD with normalized updates and Hessian-vector product-based momentum. We will show that normalization can significantly simplify the analysis from section 2 while still maintaining O(ϵ 3) convergence rate. Indeed, the technical term in the analysis of Algorithm 1 that required us to enforce a bound on the updates via clipping simply does not appear because the updates are automatically bounded by η. The new convergence bound is presented in Theorem 4. Theorem 4. Assuming (1), (2), (3), (4), and (5) hold (but not assuming (6)), with α = min{max{ 1 T 2/3 , 4/5ρ2/5 T 4/5σ6/5 G , (2 σH)2/3 T 2/3σ4/3 G }, 1} and η = min{ T (L α+4σH), ( α)1/3 (ρT )1/3 }, Algorithm 2 guarantees E[ F( xt) ] 6σG + 541/3( σH)1/3σ1/3 G T 1/3 + 6σ2/5 G 2/5ρ1/5 9 L + 72 σH T + 6 2/3ρ1/3 T 3/2 + 3 2/3ρ1/3 In words, Algorithm 2 achieves O(1/T 1/3) with large σH and σG, and achieves O(1/ T) in noiseless case, without requiring a Lipschitz bound on the objective. The proof of Theorem 4 is provided in the appendix. 4 Adaptive SGD with Hessian-corrected Momentum Algorithm 3 Adaptive learning rate for SGD with Hessian-corrected Momentum Input: Initial Point x1, parameters c, w, αt, time horizon T, parameter G: Sample z1 Pz. ˆgclip 1 ˆg1 f( x1, z1) G1 f( x1, z1) . η1 c w1/3 x2 x1 η1ˆgclip 1 for t = 2 . . . T do Sample zt Pz. G1 f( xt, zt) ˆgt (1 αt 1)(ˆgclip t 1 + 2f( xt, zt)( xt xt 1)) + αt 1 f( xt, zt). ˆgclip t ˆgt if ˆgt G; otherwise, ˆgclip t G ˆgt ˆgt ηt c (w+Pt 2 i=1 G2 i )1/3 (set η2 = c w1/3 ). xt+1 xt ηtˆgclip t . end for Return ˆx uniformly at random from x1, . . . , x T (in practice ˆx = x T ). In previous sections, we have presented two different versions of our algorithm. Both algorithms achieve the worst-case optimal O(1/T 1/3) convergence rate. However, the bound in Section 2 is non-adaptive and remains to be O(1/T 1/3) even in noiseless settings (e.g. if σG = 0). The bound for the normalized SGD algorithm of Section 3 is better in the sense that when the noise is negligible, the algorithm can achieve O(1/ T) with explicit tuning of the learning rate, but this requires us to set the parameters η and α based on prior knowledge of σG. In this section, we will describe an adaptive version of our algorithm that has the best of both worlds (Algorithm 3). The algorithm does not require the knowledge of σG but still automatically improves to a tighter bound whenever σG is small. Theorem 5. With K = 2G2ρ2 16σ4 H+ ρ2G2 2 , ηt = c (w+Pt 2 i=1 G2 i )1/3 with c 2G2/3 max{(4Lc)3, 3G2}, αt = 2Kηtηt+1. Algorithm 3 guarantees: 2M + 2M 3/4 T + 2σ1/3 G T 1/3 Where M = 1 c(20( + 6σ2 Gw1/3 5Kc ) + 192Kc2 ln(T + 1) + 64G4 5K2c3 ln T). In words, algorithm 3 converges with O 1 T 1/3 rate in noisy case and automatically improves to O 1 in noiseless case. We defer the proof of Theorem 5 to the appendix. 5 Experiments Our algorithm is a simple extension of the official SGD implementation in Pytorch. The Hessianvector products can be efficiently computed using the automatic differentiation package Paszke et al. [2017]: 2f( x, z)v = h( x) where h( x) = f( x, z), v . Since Pytorch allows backprogation through the differentiation process itself, this is straightforward to implement. To validate the effectiveness of our proposed algorithm, we perform experiments in two tasks: image classification and neural machine translation on popular deep learning benchmarks. The performance of SGDHess (unclipped version - we found that the clipped version had similar results but was slightly slower) is compared to those of commonly used optimizers such as Adam and SGD as well as Ada Hessian, another algorithm that incorporates second-order information. All experiments are run on NVIDIA v100 GPUs. The link to the code is provided in the appendix. Figure 1: (a) SGDHess (Red) achieves the best accuracy among all optimizers (92.46 0.07%). (b) Similar to the experiment on Resnet20, SGDHess also converges to the best accuracy (93.19 0.08%). For the full plot, refer to the appendix. 5.2 Image Classification Our Cifar10 experiment is conducted using the official implementation of Ada Hessian. For Ada Hessian, we use their recommended values for the same task for all the parameters. For the rest of the optimizers, we performed a grid search on the base learning rate η {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1} to find the best settings. Similar to the Cifar10 experiment of Ada Hessian, we also trained our models on 160 epochs and we ran each optimizer 5 times and reported the average best accuracy as well as the standard deviation (detailed results in the appendix). As we can see from the results in Figure 1, SGDHess outperforms all other optimizers (0.32% and 0.11% better than the next best optimizer in Resnet20 and Resnet32 respectively). Table 1: Top 1 accuracy on Imagenet SGD SGDHess Ada Hessian 70.36 70.58 69.89 We also train SGD, SGDHess, and Ada Hessian with Imagenet Deng et al. [2009] on Resnet18 to see how well SGDHess perform on a larger-scale benchmark. We use standard parameter values for SGD (lr = 0.1, momentum = 0.9, weight_decay = 1e-4) for both SGD and SGDHess and the recommended parameters values for Ada Hessian. For the learning rate scheduler, we employ the plateau decay scheduler that was used in Yao et al. [2020]. We train our model in 90 epochs as usual. Even without extensive tuning, SGDHess not only still outperforms SGD (as shown in Table 1) but also comes close to the state-of-the-art accuracy (70.7%) on this particular task Redmon [2013 2016], even though these settings were chosen with SGD in mind rather than SGDHess. Table 2: Bleu Score on IWSLT 14 SGD SGDHess Adam W Ada Hessian 29.75 33.73 33.95 33.62 5.3 Neural Machine Translation We use the IWSLT 14 German to English dataset that contains 153k/7k/7k in the train/validation/test set. Our experiments are run using all the default values specified in the official implementation of Fairseq Ott et al. [2019]. We use BLEU Papineni et al. [2002] as the evaluation metrics for our experiment. For Ada Hessian, we use the parameters specified in Yao et al. [2020]. For other optimizers, we again run a grid search to find the best learning rate for the task. The best bleu scores are reported in Table 2. It is worth stressing that SGDHess is an algorithm based on SGD, which consistently performs much worse than adaptive algorithms such as Ada Hessian and Adam W in this type of task. Still, SGDHess is able to produce comparable results to those of Ada Hessian and Adam W, thus significant bridging the gap between SGD and other adaptive algorithms (an almost 4 points increase compared to SGD s in BLEU score, which is significant for the task). 6 Conclusion and Future Work In this paper, we have presented SGDHess, a novel SGD-based algorithm using Hessian-corrected momentum. We show that when the objective is second-order smooth, our algorithm can achieve the optimal O(ϵ 3) bound, and we provide a variant that automatically adapts to the level of noise in the gradients. Finally, we provide experimental results on computer vision and neural machine translation. In both tasks, SGDHess consistently performs better or comparable to other commonly used optimizers such as SGD and Adam. It is our hope that this work demonstrates that Hessianbased optimization can combine both theoretical and practical improvements for large-scale machine learning problems. Our algorithm is one of the first second-order method that is able to achieve the optimal theoretical convergence rate and compete with first-order methods on large-scale tasks in terms of performance and runtime. Limitations: Our modification to the standard momentum formula is also extremely simple, and so we suspect that is possible to make similar modifications to other popular optimization algorithms that use momentum, but we did not investigate this possibility. For example, one might hope that an appropriate per-coordinate or per-layer adaptaptation a la Adam Kingma and Ba [2014], AMSGrad Reddi et al. or LAMB You et al. [2019] might improve performance. On the theoretical side, our adaptive results require Lipschitz losses, which may not be truly necessary. Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195 1199, 2017. Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. ar Xiv preprint ar Xiv:1912.02365, 2019. Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Ayush Sekhari, and Karthik Sridharan. Second-order information in non-convex stochastic optimization: Power and limitations. In Conference on Learning Theory, pages 242 299, 2020a. Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Ayush Sekhari, and Karthik Sridharan. Secondorder information in non-convex stochastic optimization: Power and limitations. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 242 299. PMLR, 09 12 Jul 2020b. URL http: //proceedings.mlr.press/v125/arjevani20a.html. Raghu Bollapragada, Jorge Nocedal, Dheevatsa Mudigere, Hao-Jun Shi, and Ping Tak Peter Tang. A progressive batching l-bfgs method for machine learning. In International Conference on Machine Learning, pages 620 629. PMLR, 2018. Leon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 161 168, 2007. Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751 1772, 2018. Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019. Aaron Defazio and Léon Bottou. On the ineffectiveness of variance reduced optimization for deep learning. ar Xiv preprint ar Xiv:1812.04529, 2018. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 689 699, 2018. Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS 13, page 315 323, Red Hook, NY, USA, 2013. Curran Associates Inc. N Keskar and Andreas Wächter. A limited-memory quasi-newton algorithm for bound-constrained non-smooth optimization. Optimization Methods and Software, 34(1):150 171, 2019. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014. Dong C. Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. MATHEMATICAL PROGRAMMING, 45:503 528, 1989. Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016. Xuezhe Ma. Apollo: An adaptive parameter-wise diagonal quasi-newton method for nonconvex stochastic optimization. ar Xiv preprint ar Xiv:2009.13586, 2020. H. Brendan Mc Mahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In Conference on Learning Theory, pages 244 256, 2010. Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006. Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáˇc. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613 2621. PMLR, 2017. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019. Wenyong Pan, Kristopher A Innanen, and Wenyuan Liao. Accelerating hessian-free gauss-newton full-waveform inversion via l-bfgs preconditioned conjugate-gradient algorithm. Geophysics, 82(2):R49 R64, 2017. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311 318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://www.aclweb.org/anthology/P02-1040. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, 6:147 160, 1994. Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. SJ Reddi, S Kale, and S Kumar. On the convergence of adam and beyond. arxiv 2019. ar Xiv preprint ar Xiv:1904.09237. Joseph Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013 2016. Quoc Tran-Dinh, Nhan H Pham, Dzung T Phan, and Lam M Nguyen. A hybrid stochastic optimization framework for stochastic composite nonconvex optimization. ar Xiv preprint ar Xiv:1907.03793, 2019. Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. ar Xiv preprint ar Xiv:1711.02838, 2017. Zeke Xie, Issei Sato, and Masashi Sugiyama. Understanding and scheduling weight decay. ar Xiv preprint ar Xiv:2011.11152, 2020. Zhewei Yao, Amir Gholami, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. Adahessian: An adaptive second order optimizer for machine learning. ar Xiv preprint ar Xiv:2006.00719, 2020. Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. ar Xiv preprint ar Xiv:1904.00962, 2019. Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 3921 3932, 2018. Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced cubic regularization methods. Journal of Machine Learning Research, 20(134):1 47, 2019. URL http://jmlr.org/papers/v20/19-055.html. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (ICML), pages 928 936, 2003. The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example: Did you include the license to the code and datasets? [Yes] Did you include the license to the code and datasets? [No] The code and the data are proprietary. Did you include the license to the code and datasets? [N/A] Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [No] the paper is a theoretical paper, it has no negative social impacts (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] see the appendix (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [No] the data is open source (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] the data contains no personally identifiable information 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]