# momentum_improves_normalized_sgd__b541c546.pdf

Momentum Improves Normalized SGD

Ashok Cutkosky 1 2 Harsh Mehta 1

We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on nonconvex objectives. Then, we consider the case of objectives with bounded second derivative and show that in this case a small tweak to the momentum formula allows normalized SGD with momentum to ﬁnd an ϵ-critical point in O(1/ϵ3.5) iterations, matching the best-known rates without accruing any logarithmic factors or dependence on dimension. We also provide an adaptive method that automatically improves convergence rates when the variance in the gradients is small. Finally, we show that our method is effective when employed on popular large scale tasks such as Res Net-50 and BERT pretraining, matching the performance of the disparate methods used to get state-of-the-art results on both tasks.

1. Non-Convex Stochastic Optimization

The rise of deep learning has focused research attention on the problem of solving optimization problems that are highdimensional, large-scale, and non-convex. Modern neural networks can have billions of parameters (high-dimensional) (Raffel et al., 2019; Shazeer et al., 2017), are trained using datasets containing millions of examples (large scale) (Deng et al., 2009) on objective functions that are non-convex. Because of these considerations, stochastic gradient descent (SGD) has emerged as the de-facto method-of-choice for training deep models. SGD has several properties that make it attractive for this setting. First, it can operate in a streaming manner by running over the large dataset while only keeping a few examples in memory at any one time. Second, SGD has a dimension-free convergence guarantee for ﬁnding critical points on non-convex objectives. This means

1Google Research, California, USA 2Boston University, Massachusetts, USA. Correspondence to: Ashok Cutkosky <ashok@cutkosky.com>, Harsh Mehta <harshm@google.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

that SGD s convergence rate is independent of the dimension of the problem, allowing it to scale more easily to massive neural networks. In this paper we will investigate a variant of SGD that incorporates gradient normalization and momentum, which are both commonly used empirical modiﬁcations to SGD that are poorly understood in the non-convex setting.

A common way to analyze algorithms for training neural nets is through the lens of stochastic optimization. In this setting, we are interested in minimizing a function:

F( w) = E ξ D[f( w, ξ)]

Where here w Rd and D is a distribution over an arbitrary space Ξ. This general formulation allows for cleaner analysis, but for a more concrete intuition, w may represent the weights of a neural network, ξ may represent an individual training example, and D is the distribution over training examples. We do not know the entire function F, but we are allowed to access F through a stochastic gradient oracle. That is, given any w, we may compute f( w, ξ) for some ξ drawn i.i.d. from D. Under this formulation, the SGD algorithm employs the simple iterative update:

wt+1 = wt ηt f( wt, ξt)

where ξ1, . . . , ξT are i.i.d. examples from the distribution D, and ηt R is a a scalar called the learning rate. Since our objective F is non-convex, it may be computationally intractable to ﬁnd a true minimizer. Instead, we will look for an ϵ-critical point. That is, we would like our optimization algorithm to output a point w such that

E[ F( w) ] ϵ

where indicates the standard 2-norm. The expectation is taken over the randomness of the stochastic gradient oracle and any randomness inherent in the algorithm itself. In order to make the problem tractable, we make three structural assumptions. First, we assume that F is L-smooth, which means that for all w and x,

F( w) F( x) L w x (A1)

Second, we assume that the objective F is bounded from below, and without loss of generality (since our algorithms

Momentum Improves Normalized SGD

will only access gradients rather than function values), we may assume it is positive:

F( w) 0 (A2)

Finally, we assume that the stochastic gradient oracle has bounded variance:

f( w, ξ) F( w) 2 σ2 (A3)

Under these assumptions, SGD with an optimally tuned learning rate ensures that after T iterations, we can output a point w such that (Ghadimi & Lan, 2013):

E[ F( w) ] O 1

T + σ T 1/4

Spurred by the empirical success of SGD, there has been a tremendous amount of work in designing modiﬁcations that improve upon its performance in various ways. However, it has recently been shown that the O(1/T 1/4) rate is optimal in the worst-case (Arjevani et al., 2019), and so improvements must either make more assumptions about the problem setting, or provide an algorithm that can somehow interpolate between a worst-case and non-worst-case problem.

One popular modiﬁcation to SGD is the use of adaptive learning rates, popularized by Ada Grad (Duchi et al., 2010). These learning rates enable SGD to converge faster when the objective under consideration is in some technical sense easier than a worst-case objective1, speciﬁcally when the variance of the loss from one example to another is small (Li & Orabona, 2019; Ward et al., 2019). The success of Ada Grad has inspired a huge number of related algorithms, including notably the Adam algorithm (Kingma & Ba, 2014).

One of the key improvements added to Ada Grad by Adam is the use of momentum. Inspired by the heavy-ball and acceleration algorithms in convex optimization (Polyak, 1964; Nesterov, 1983), momentum attempts to improve the convergence rate on non-convex objectives by modifying the update to have the form:

mt = β mt 1 + (1 β) f( wt, ξt)

wt+1 = wt ηt mt

Intuitively, the value m is holding a running average of the past gradient values, and the hope is that this style of update may provide some kind of better stability that enables

1In this paper we use adaptive in a statistical sense of the word: an algorithm is adaptive if it automatically adjusts itself to some unknown parameter, such as the variance of the gradients. This is different from the idea of using a different learning rate for each dimension of an optimization problem that was also popularized by (Duchi et al., 2010).

improvements over the base SGD. Momentum has had dramatic empirical success, but although prior analyses have considered momentum updates (Reddi et al., 2018; Zaheer et al., 2018), none of these have shown a strong theoretical beneﬁt in using momentum, as their bounds do not improve on (1).

Finally, a third popular modiﬁcation is the use of normalized updates. For example, the LARS (You et al., 2017) and LAMB (You et al., 2019) optimizers use updates similar to:

mt = β mt 1 + (1 β) f( wt, ξt) (2)

wt+1 = wt ηt mt mt (3)

Intuitively, this style of update attempts to capture the idea that in non-convex objectives, unlike convex ones, the magnitude of the gradient provides less information about the value of the function, while the direction still indicates the direction of steepest descent. However, all analyses of normalized updates we know of (e.g. (You et al., 2017; 2019; Hazan et al., 2015)) require the variance of the gradient oracle to be very small, or, equivalently, for the algorithm to make use of an extremely large batch-size in order to achieve any convergence guarantees. Intuitively, this requirement arises because the normalization can inﬂate very small errors: even if mt = F( wt)+ζ for some small error ζ, it might be that mt mt is actually very far from F ( wt) F ( wt) . Nevertheless, this update has also been used to empirically accelerate training neural networks.

From a more theoretical perspective, a recent approach to going beyond the SGD rate (1) is to add some additional structural assumptions to the loss. One appealing assumption is second-order smoothness, in which we assume that the third derivative of F has bounded operator norm. Equivalently, for all w, and z, we have

2F( w) 2F( z) op ρ w z (A4)

Many previous works have achieved improved results by utilizing this assumption in concert with stronger oracles, such as evaluating two gradients per each example or evaluating Hessian-vector products (Allen-Zhu, 2018; Tripuraneni et al., 2018). However, using our same stochastic gradient oracle model and this second-order smoothness assumption, it was recently proven (Fang et al., 2019) that a variant of SGD can achieve a convergence rate of

F( w) O polylog(d)

where d is the dimension of the problem.2 This breakthrough result shows that even for fairly high-dimensional

2In fact, this result provided a convergence to a local minimum, which is stronger than a critical point.

Momentum Improves Normalized SGD

problems, SGD can obtain faster convergence than the initial analysis suggests. However, in modern deep learning architectures, d can be on the order of billions (Radford et al., 2019). In this regime, it may easily hold that the logarithmic term is large enough that this new analysis of SGD does not actually suggest improved performance over the previous O(1/T 1/4) rate. To deal with extremely highdimensional regimes, we would like a convergence rate that is completely dimension-free.

In a somewhat orthogonal direction, several algorithms have been proposed based on variance reduction (Johnson & Zhang, 2013). Recently, these have provided provable improvements over SGD s convergence rate, ﬁnding a point with E[ F( w) ] O(1/T 1/3) after T iterations (Fang et al., 2018; Zhou et al., 2018). Unfortunately, these algorithms require two gradient evaluations for every i.i.d. sample ξ. Our model requires each ξ to be used only once, and so forbids this operation. Moreover, it turns out to be surprisingly hard to implement this two-gradient model efﬁciently in modern machine learning frameworks. Nevertheless, more recent work in this area has produced algorithms whose updates are very similar to the (unnormalized) momentum update (Cutkosky & Orabona, 2019; Tran-Dinh et al., 2019), suggesting that momentum may operate by reducing variance in SGD.

Finally, another technique that also suggests a connection to momentum in reducing variance is implicit gradient transport (Arnold et al., 2019). Implicit gradient transport is a very recent discovery that provably reduces the variance in gradient estimates in the special case that the Hessian of the objective is constant (e.g. if the objective is a linear regression problem). The original presentation considers the following version of momentum:

mt = t t + 1 mt 1 + 1 t + 1 f( wt + t( wt wt 1), ξt)

wt+1 = wt ηt mt

To gain some intuition for why this is a good idea when the Hessian is constant, suppose we have mt 1 = F( wt 1)+ X where X is some mean-zero random variable with variance σ2/t. Let us also write f( wt + t( wt wt 1), ξt) = F( wt + t( wt wt 1)) + Y where Y is also a mean-zero random variable with variance σ2. Then since the Hessian is constant, we have:

F( wt 1) = F( wt) + 2F( wt 1 wt)

F( wt + t( wt wt 1)) = t 2F( wt wt 1) + F( wt)

mt = F( wt) + t X + Y

Notice that t X+Y

t+1 has variance σ2/(t + 1), so that by induction, we will have for all t that mt is an unbiased estimate of F( wt) with variance only σ2/(t + 1). This technique

shows great promise, but the current theoretical analysis is limited to convex quadratic objectives.

In this paper we address some of these preceding issues. First, we show that the normalized stochastic gradient descent update with momentum does not require small variance or large batches in order to match the convergence rate of SGD. Next, we tackle the case of second-order smooth objectives. For this, we introduce a further modiﬁcation of the momentum update based on the implicit gradient transport idea. We show that combining this newer kind of momentum with normalized gradient descent enables a dimension-free convergence rate of O(1/T 2/7). Moreover, we feel that our new analysis is substantially more straightforward than prior work, as demonstrated by the number of pages required to prove our results. Finally, we propose an adaptive version of our algorithm and show that this ﬁnal algorithm s convergence guarantee automatically improves when the stochastic gradient oracle has small variance. We hope our results can shed some light in the empirical success of momentum in training deep networks.

In addition to the theoretical discussion, we also demonstrate effectiveness of our method, NIGT (pronounced night ), on popular deep learning tasks. We show comparisons of our method on 1) BERT pretraining and 2) Image Net classiﬁcation with Resnet-50. Adam is typically used to achieve state of the art results on BERT task (Devlin et al., 2019) whereas SGD is used for Resnet-50 on Imagenet dataset since Adam fails to perform well on it (You et al., 2019; Anil et al., 2019). We show comparison with the stronger baseline in each case. Finally, since momentum is the only signiﬁcant slot variable kept, our method incurs greatly reduced memory overhead compared to other methods such as Adam. This is a signiﬁcant practical advantage due to the continuing trend of increasingly large models (Shazeer & Stern, 2018; Anil et al., 2019).

The rest of this paper is organized as follows: in Section 2, we provide our ﬁrst result showing that momentum can be used to rescue normalized SGD in the highvariance/small-batch regime. In Section 3, we expand upon this theory to show that incorporating a momentum update inspired by (Arnold et al., 2019) allows for improved convergence rates when the objective is second-order smooth, and we proceed to incorporate adaptivity into our analysis in Section 4. In Section 5 we describe our implementation and experimental study, and we provide some concluding remarks and open problems in Section 6.

2. Normalized SGD with Momentum

In this section, we analyze the normalized SGD with momentum update given by equations (2) and (3). The result is Theorem 1, which shows that the addition of momentum

Momentum Improves Normalized SGD

allows normalized SGD to match the optimal convergence rate obtained by ordinary SGD, without requiring any large batch sizes. To gain some intuition for why the momentum is necessary in this case, consider a one-dimensional optimization scenario in which f( w, ξ) = p with probability 1 p and p 1 with probability p for some p (0, 1/2). Then we clearly have F( w) = E[ f( w, ξ)] = 0, so that intuitively the correct thing for the algorithm to do is to not move, at least on average. Unfortunately, we have

E h f( w,ξ) f( w,ξ) i = 1 2p > 0, so that the algorithm will be biased to moving away from this critical point. The problem here was that, without momentum, the error in the gradient estimate before normalization is much larger than the true value of the gradient. In order to ﬁx this, one must decrease the variance in the stochastic gradient oracle by employing a very large batch size, usually on the order of T (You et al., 2019; Hazan et al., 2015). Our Theorem 1 avoids this requirement and matches the optimal bound (1) up to constants.

Theorem 1. Suppose F and D satisfy the assumptions (A1), (A2) and (A3). Let w1 be some given initial point and set m1 = f( w1, ξ1). Let R be any upper bound on F( w1). Set α = min

T , 1 and η =

T L Then let wt be the iterates produced by the recurrences (2) and (3) with ηt = η and βt = 1 α for all t. Then the average norm of F( wt) satisﬁes:

t=1 E [ F( wt) ]

T + 21 σ(RL)1/4

Before proving this Theorem, we provide a general technical Lemma about each step of normalized SGD:

Lemma 2. Suppose F and D satisfy the assumptions (A1) and (A3). Let w1 be some initial point, and consider the updates:

wt+1 = wt ηt ˆgt ˆgt

where ˆgt is generated by some arbitrary process or algorithm. Let ˆϵt = ˆgt F( wt). Then we have

F( wt+1) F( wt) ηt

3 F( wt) + 8ηt

3 ˆϵt + Lη2 t 2

In particular, if F also satisﬁes (A2) and ηt is a constant η for all t, we have:

t=1 F( wt) 3F( w1)

Proof. Since F is L-smooth, we have

F( wt+1) F( wt)

F( wt), wt+1 wt + L wt+1 wt 2

= ηt F( wt), ˆgt

ˆgt + Lη2 t 2

Now consider two cases, either F( wt) 2 ˆϵt or not. In the former case, we have

ˆgt, F( wt)

ˆgt F( wt) 2 + F( wt), ˆϵt

F( wt) + ˆϵt

2 F( wt) + ˆϵt

In the latter case, we have

ˆgt, F( wt)

3 + 4 F( wt)

Thus, the same bound holds in both cases. Putting everything together now proves the ﬁrst statement of the Lemma. The second statement follows by summing over T, observing that the left-hand-side of the equation telescopes, and rearranging.

By this Lemma, if we can show that our choices for β and η result in a small value for PT t=1 ˆϵt , then setting η appropriately will prove Theorem 1. We carry out this agenda in more detail below:

Proof of Theorem 1. Deﬁne ˆϵt = mt F( wt) and deﬁne ϵt = f( wt, ξt) F( wt). Notice that by our assumptions, we have

E[ ϵt 2] σ2

E[ ϵi, ϵj ] = 0 for i = j

Further, deﬁne S(a, b) = F(a) F(b). By smoothness, we must have S( wt, wt+1) L wt wt+1 = ηL for all t. With this notation, we have the following recursive formulation for any t 1:

mt+1 = (1 α)( F( wt) + ˆϵt) + α f( wt+1, ξt+1)

= F( wt+1) + (1 α)(S( wt, wt+1) + ˆϵt) + αϵt+1 ˆϵt+1 = (1 α)S( wt, wt+1) + (1 α)ˆϵt + αϵt+1

Momentum Improves Normalized SGD

Now, we unravel the recursion for t iterations:

ˆϵt+1 = (1 α)tˆϵ1 + α

τ=0 (1 α)τϵt+1 τ

τ=0 (1 α)τS( wt τ, wt+1 τ)

Next, take the magnitude of both sides and use triangle inequality:

ˆϵt+1 (1 α)t ϵ1 + α

τ=0 (1 α)τϵt+1 τ

Where in the above we have observed that ˆϵ1 = ϵ1. Now take expectation, and use the fact that any two ϵi are uncorrelated and apply Jensen inequality to obtain:

E [ ˆϵt+1 ] (1 α)t E [ ϵ1 ] + α

τ=0 (1 α)2τσ2

= (1 α)tσ + ασ p

1 (1 α)2 + ηL

(1 α)tσ + ασ + ηL

α + T ασ + TηL

Applying Lemma 2 now yields:

t=1 E [ F( wt) ] 3F( w1)

+ 8T ασ + 8TLη

α + 8T ασ + 10TLη

Where we have used α 1. Set α = min

T L. Observe from the setting of η that we have:

t=1 E [ F( wt) ] 13

RLT α + 8 ασT + 8σ

Now suppose α = 1. This implies that σ

T . Therefore the RHS of the above expression is bounded by

RLT. On the other hand, if α =

T , the RHS is bounded by

21 σ(RL)1/4T 3/4 + 8σ

Adding these expressions yields the desired result.

3. Faster Convergence with Second-Order Smoothness

Now that we have seen a simpler example of how to analyze the convergence of normalized SGD with momentum in the non-convex setting, we can proceed to consider the more advanced case that F is second-order smooth. Notice that in the proof of Theorem 1, we made use of the quantity S(a, b) = F(a) F(b), which satisﬁes S(a, b) L a b . In this section, we will need a related quantity:

Z(a, b) = F(a) F(b) 2F(b)(a b)

Assuming F satisﬁes (A4), Taylor s theorem implies that Z satisﬁes Z(a, b) ρ a b 2. The improved dependency on a b from linear to quadratic will allow us to achieve a faster convergence rate.

Without further ado, we present our algorithm NIGT in Algorithm 1 below: The auxiliary variable xt is introduced

Algorithm 1 Normalized SGD with Implicit Gradient Transport (NIGT - pronounced night )

Input: Initial Point w1, learning rate η, momentum parameter β. Set m1 = f( x1, ξ1). Set w2 = w1 η m1 m1 . for t = 2 . . . T do

Set xt = wt + β 1 β ( wt wt 1). Set mt = βt mt 1 + (1 βt) f( xt, ξt). Set wt+1 = wt η mt mt . end for

for notational convenience - it can be recomputed every iteration from wt and wt 1. Notice that, with βt = t t+1, the update for mt corresponds exactly to the implicit gradient transport update proposed by (Arnold et al., 2019). We will instead use the different setting β = 1 O(T 4/7). We show that this value of β, combined with the normalized SGD update, enables faster convergence on second-order smooth non-convex objectives. This signiﬁcantly extends the prior results of (Arnold et al., 2019), which hold only on convex quadratic objectives with constant Hessian, and helps realize the theoretical potential of the implicit gradient transport idea.

Theorem 3. Suppose f and D satisﬁes the assumptions (A1), (A2), (A3), and (A4). Let w1 be an arbitrary

Momentum Improves Normalized SGD

starting point, and set m1 = f( w1, ξ1). Let R

F( w1) and set η = min R5/7

T 5/7ρ1/7σ4/7 , q

R T L and α =

min R4/7ρ2/7

T 4/7σ6/7 , 1 . Suppose w1, . . . , w T are the iterates produced by Algorithm 1 with ηt = η and βt = 1 α for all t. Then, the average expected gradient norm satisﬁes:

t=1 E [ F( wt) ]

R4/7ρ2/7T 3/7 + 27R2/7ρ1/7σ4/7

In particular, the dependence on T is O(1/T 2/7).

Proof. Our proof begins very similarly to that of Theorem 1. We deﬁne ˆϵt = mt F( wt), and ϵt = f( xt, ξt) F( xt). Then we have the following recursive formulation:

mt+1 = (1 α)( F( wt) + ˆϵt) + α f( xt+1, ξt+1)

= (1 α)( F( wt+1) + 2F( wt+1)( wt wt+1))

+ (1 α)(Z( wt, wt+1) + ˆϵt) + α F( wt+1)

α 2F( wt+1)( wt+1 wt)

+ α(Z( xt+1, wt+1) + ϵt+1)

= F( wt+1) + (1 α)Z( wt, wt+1)

+ αZ( xt+1, wt+1) + (1 α)ˆϵt + αϵt+1 ˆϵt+1 = (1 α)ˆϵt + (1 α)Z( wt, wt+1)

+ αZ( xt+1, wt+1) + αϵt+1

Here we have already used the key insight of implicit gradient transport, as well as our second-order smoothness assumption. The carefully constructed equation for xt is such that the 2F( wt+1) terms cancel out, and we keep track of the error introduced by having a non-constant Hessian in the Z function. Next, we unroll the recursion to obtain:

ˆϵt+1 = (1 α)tˆϵ1 + α

τ=0 (1 α)τϵt+1 τ

τ=0 (1 α)τZ( wt τ, wt+1 τ)

τ=0 (1 α)τZ( xt+1, wt+1)

Now, recall that Z(a, b) ρ a b . This implies:

α Z( xt, wt) ρ(1 α)2η2

(1 α) Z( wt, wt+1) (1 α)ρη2 ρ(1 α)η2

Now just as in the proof of Theorem 1, we take magnitudes, use triangle inequality, and take expectations, but this time we use the bounds on Z:

E[ ˆϵt+1 ] (1 α)tσ + α

τ=0 (1 α)2τσ2

τ=0 (1 α)τ+1

(1 α)tσ + ασ p

1 (1 α)2 + 2η2ρ(1 α)

(1 α)tσ + ασ + 2η2ρ(1 α)

Where we have used ˆϵ1 = ϵ1 in the ﬁrst line. Now sum over t to obtain:

t=1 E[ ˆϵt ] σ(1 αT )

α + T ασ + 2Tη2ρ(1 α)

Next, we apply Lemma 2:

t=1 E[ F( wt) ] 3F( w1)

2 + 8σ(1 αT )

+ 8T ασ + 16Tη2ρ(1 α)

Now set η = min R5/7

T 5/7ρ1/7σ4/7 , q

R T L , so that we have:

RLT + 3R2/7ρ1/7σ4/7T 5/7

Further, set α = min R4/7ρ2/7

T 4/7σ6/7 , 1 . Notice that in all cases

we have T ασ R2/7ρ1/7σ4/7T 5/7, so we can conclude,

t=1 E[ F( wt) ] 5

RLT + 11R2/7ρ1/7σ4/7T 5/7

+ 8σ(1 αT )

α + 16Tη2ρ(1 α)

Let us consider the case that α = 1. In this case, the last two terms in (4) are zero, so we have:

t=1 E[ F( wt) ]

RLT + 11R2/7ρ1/7σ4/7T 5/7 (5)

Next, suppose α = R4/7ρ2/7

T 4/7σ6/7 . Then instead we use the fact

Momentum Improves Normalized SGD

that η = R5/7

T 5/7ρ1/7σ4/7 to reﬁne (4) as:

t=1 E[ F( wt) ]

RLT + 11R2/7ρ1/7σ4/7T 5/7 + 8σ

RLT + 27R2/7ρ1/7σ4/7T 5/7 + 8σ13/7T 4/7

R4/7ρ2/7 (6)

Taking the maximum of (5) and (6) yields the desired convergence guarantee.

4. Adaptive Algorithm

In the previous sections, we considered the case of constant learning rate η and momentum β parameters. We showed that with appropriate settings, we can obtain fast convergence rates. In this section, we go further and consider varying η and β values. In particular, we will provide an adaptive schedule in which η and β are adjusted on-the-ﬂy based on the observed gradients. This adaptive schedule yields a convergence guarantee that automatically improves when σ is small, without knowing σ. In order to do this, we will need to sample two gradients at each point xt. The two gradient samples are independent, so this still ﬁts within our stochastic gradient oracle model of computation. We denote the second independent gradient estimate evaluated at the point xt as f( xt, ξ t). Finally, we will add another assumption: the function F (or at least the values of F( xt)) should be bounded above by some constant M. With this notation, our adaptive algorithm is presented in Algorithm 2 and analyzed in Theorem 4 below.

Algorithm 2 Adaptive Normalized SGD with Momentum

Input: Initial point w1, lipschitz bound g. Set C = q

7 26g6/7 and D = C 14/3

Set w0 = w1. Set m0 = 0. Set G1 = 3g2 + D. Set η0 = C D2/7 . for t = 1 . . . T do

Set ηt = C (G2 t (t+1)3)1/7 .

Set αt = 1 tη2 t 1Gt 1 . Set βt = 1 αt. Set xt = wt + βt 1 βt ( wt wt 1). Set mt = βt mt 1 + (1 βt) f( xt, ξt). Set Gt+1 = Gt+ f( xt, ξt) f( xt, ξ t) 2+g2((t+ 1)1/4 t1/4). Set wt+1 = wt ηt mt mt . end for

Theorem 4. Suppose that f and D satisﬁes (A1), (A2), (A3), and (A4). Further, suppose f( w, ξ) g for all w with

probability 1, and that F( w) M for all w for some M. Then Algorithm 2 guarantees:

t=1 E [ F( wt) ] O 1

where the O notation hides constants that depend on R, G, and M and a factor of log(T).

We defer the proof to Appendix A.

5. Experiments

Now, we turn to experimental evaluation of the proposed method NIGT on two popular large-scale deep learning benchmarks: BERT pretraining and Res Net-50. First we explain the setup and choice of hyperparameters for our method, and then elucidate both tasks and the details on the baseline used for each task.

We implemented our algorithm in the Tensorﬂow framework. For simplicity, we implemented a per-layer version of our algorithm, normalizing the gradients for each layer in the network, rather than normalizing the full gradient. Taking our cue from defaults from previous empirical literature on momentum, we set the β parameter to 0.9 for NIGT for both BERT and Res Net-50. For BERT, we stick with the learning rate schedule used for Adam in (Devlin et al., 2019) i.e linear warmup and polynomial decay of ηt = η0 (1 t

T ). Whereas for Res Net-50, we found that linear warmup and polynomical decay of ηt = η0 (1 t

T )2 worked best (You et al., 2017). We performed a grid search on base learning rate η0 [10 5, 10 4, 10 3, 10 2, 10 1, 100] for both the tasks. In our implementation, we also scale the learning rate with the norm of the weights for each layer similar to (You et al., 2017). We did not normalize gradients for bias, batch normalization and layer normalization parameters, and we scaled their learning rates by a factor of 1000. All our experiments were conducted on a TPUv3 architecture.

5.2. BERT pretraining

We replicate most of the setup created by (Devlin et al., 2019) for our BERT baseline. Our text corpus is a concatenation of Wikipedia and Books Corpus, with a total of roughly 3.3B words. We use examples of sequence length 512 with token masking rate of 0.15 and train the BERT-Base model with Masked Language Modeling target accuracy of 70.0% for 500k steps. Fig 1 compares masked language modeling accuracy resulting with our method vs the baseline. We set NIGT base learning rate η0 to 0.001 and batch size to 256. We replicate the hyperparameters used for our baseline method Adam exactly from (Devlin et al., 2019) i.e. step

Momentum Improves Normalized SGD

Figure 1. (a) Masked language modeling validation accuracy comparisons for our method compared to Adam. Our method NIGT (dark blue), in the end, fares slightly better with 70.91 vs 70.76 for Adam (light blue). (b) Top-1 validation accuracy attained by Resnet-50 architecture over Imagenet dataset. Accuracy at the end of 90 epochs: NIGT (dark blue) - 76.37% and SGD (light blue) - 76.2%.

size α = 0.0001, β1 = 0.9, β2 = 0.999.

5.3. Resnet-50

We train the Image Net dataset (Deng et al., 2009) with Resnet-50 architecture and compare our method against the commonly used SGD optimizer with momentum in Fig 1. Others have observed that Adam/Adagrad fails to achieve accuracy attained by SGD on this task (You et al., 2019; Anil et al., 2019). For our experiments, we set the batch size to 1024 and train the models for 90 epochs. We set the base learning rate η0 for NIGT to 0.01. For the SGD baseline, we stick with the common practice of employing gradual warmup for 5 epochs up to a base learning rate of 0.4 and then divide it by 10 at 30-th, 60-th and 80-th epoch, as done in (Goyal et al., 2017; He et al., 2015).

6. Conclusion and Future Work

We have presented a new analysis of the effects of momentum on normalized SGD, and showed that when the objective is second-order smooth, normalized SGD with momentum can ﬁnd an ϵ-critical point in O(1/ϵ3.5) iterations, matching the best-known rates in a dimension-free manner and surprisingly straightforward manner. Further, we provided an adaptive version of our algorithm whose convergence guarantee automatically improves when the underlying distribution has low variance. Finally, we provided an empirical evaluation showing that our algorithm matches the performance of the methods used to achieve state-ofthe-art in training Res Net-50 on Image Net and on BERT pretraining. Notably, these two tasks are often tackled with different optimizers (SGD and Adam respectively). Our method is one of the select few (You et al., 2019; Anil et al., 2019) which performs well on both simultaneously. Also, since NIGT does not maintain any second order statistic, it is signiﬁcantly more memory efﬁcient compared to popular

optimizers such as Adam or LAMB.

Our results provide some intuition that can help explain the practical success of momentum in stochastic non-convex optimization. Prior analyses of momentum have failed to demonstrate signiﬁcant theoretical beneﬁts, but our results show that if one is willing to posit second-order smoothness, momentum may play a role in accelerating convergence. However, there are several open problems remaining. First, we suspect that the upper bound M imposed on F in our adaptive analysis is unnecessary, and also that the use of an extra gradient sample per iteration can be removed. Second, we note that in the noise-free case with second-order smoothness, the best-known dimension-free convergence rate is actually O(1/ϵ1.75) (Carmon et al., 2017) rather than the O(1/ϵ2) achieved by gradient descent. Is it possible to design an adaptive algorithm that interpolates between this rate and O(1/ϵ3.5) in the noisy setting?

Finally, our implementation employs a few extra learning rate heuristics popular in practice, notably linear warm-up and polynomial decay (Devlin et al., 2019), as well as scaling the learning rate by the norm of the weights (You et al., 2017). We found these heuristics to be better-performing than our adaptive algorithm, despite their lack of theoretical motivation. Understanding the principles underlying the success of these heuristics poses an interesting orthogonal question to understanding momentum, and we hope that our work inspires future investigation into these areas.

Allen-Zhu, Z. Natasha 2: Faster non-convex optimization than sgd. In Advances in neural information processing systems, pp. 2675 2686, 2018.

Anil, R., Gupta, V., Koren, T., and Singer, Y. Memory efﬁcient adaptive optimization. In Neur IPS, 2019.

Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N.,

Momentum Improves Normalized SGD

and Woodworth, B. Lower bounds for non-convex stochastic optimization. ar Xiv preprint ar Xiv:1912.02365, 2019.

Arnold, S. M., Manzagol, P.-A., Babanezhad, R., Mitliagkas, I., and Roux, N. L. Reducing the variance in online optimization by transporting past gradients. ar Xiv preprint ar Xiv:1906.03532, 2019.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Convex until proven guilty: Dimension-free acceleration of gradient descent on non-convex functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 654 663. JMLR. org, 2017.

Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex sgd. In Neur IPS, 2019.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10. 18653/v1/N19-1423. URL https://www.aclweb.org/ anthology/N19-1423.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory (COLT), 2010.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689 699, 2018.

Fang, C., Lin, Z., and Zhang, T. Sharp analysis for nonconvex sgd escaping from saddle points. In Conference on Learning Theory, pp. 1192 1234, 2019.

Ghadimi, S. and Lan, G. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Goyal, P., Dollr, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017.

Hazan, E., Levy, K., and Shalev-Shwartz, S. Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems, pp. 1594 1602, 2015.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2015.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315 323, 2013.

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 983 992, 2019.

Nesterov, Y. E. A method for solving the convex programming problem with convergence rate o (1/kˆ 2). In Dokl. akad. nauk Sssr, volume 269, pp. 543 547, 1983.

Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. 2019.

Raffel, C., Shazeer, N., Roberts, A. K., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. Ar Xiv, abs/1910.10683, 2019.

Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.

Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. In ICML, 2018.

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.

Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. A hybrid stochastic optimization framework for stochastic composite nonconvex optimization. ar Xiv preprint ar Xiv:1907.03793, 2019.

Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, M. I. Stochastic cubic regularization for fast nonconvex optimization. In Advances in neural information processing systems, pp. 2899 2908, 2018.

Ward, R., Wu, X., and Bottou, L. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, pp. 6677 6686, 2019.

You, Y., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks. ar Xiv preprint ar Xiv:1708.03888, 2017.

You, Y., Li, J., Hseu, J., Song, X., Demmel, J., and Hsieh, C.- J. Reducing bert pre-training time from 3 days to 76 minutes. ar Xiv preprint ar Xiv:1904.00962, 2019.

Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. In Advances in neural information processing systems, pp. 9793 9803, 2018.

Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduction for nonconvex optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925 3936. Curran Associates Inc., 2018.