# smg_a_shuffling_gradientbased_method_with_momentum__f1672c07.pdf

SMG: A Shufﬂing Gradient-Based Method with Momentum

Trang H. Tran 1 Lam M. Nguyen 2 Quoc Tran-Dinh 3

We combine two advanced ideas widely used in optimization for machine learning: shufﬂing strategy and momentum technique to develop a novel shufﬂing gradient-based method with momentum, coined Shufﬂing Momentum Gradient (SMG), for non-convex ﬁnite-sum optimization problems. While our method is inspired by momentum techniques, its update is fundamentally different from existing momentum-based methods. We establish state-of-the-art convergence rates of SMG for any shufﬂing strategy using either constant or diminishing learning rate under standard assumptions (i.e. L-smoothness and bounded variance). When the shufﬂing strategy is ﬁxed, we develop another new algorithm that is similar to existing momentum methods, and prove the same convergence rates for this algorithm under the L-smoothness and bounded gradient assumptions. We demonstrate our algorithms via numerical simulations on standard datasets and compare them with existing shufﬂing methods. Our tests have shown encouraging performance of the new algorithms.

1. Introduction

Most training tasks in supervised learning are boiled down to solving the following ﬁnite-sum minimization:

n F(w) := 1

i=1 f(w; i) o , (1)

where f( ; i) : Rd R is a given smooth and possibly nonconvex function for i [n] := {1, , n}.

Problem (1) looks simple, but covers various convex and nonconvex applications in machine learning and statistical

1School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, USA. 2IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA. 3Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill, NC, USA. Correspondence to: Lam M. Nguyen <Lam Nguyen.MLTD@ibm.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

learning, including, but not limited to, logistic regression, multi-kernel learning, conditional random ﬁelds, and neural networks. Especially, (1) covers the empirical risk minimization as a special case. Solution methods for approximately solving (1) have been widely studied in the literature under different sets of assumptions. The most common approach is perhaps stochastic gradient-type (SGD) methods (Robbins & Monro, 1951; Ghadimi & Lan, 2013; Bottou et al., 2018; Nguyen et al., 2018; 2019) and their variants.

Motivation. While SGD and its variants rely on randomized sampling strategies with replacement, gradient-based methods using without-replacement strategies are often easier and faster to implement. Moreover, practical evidence (Bottou, 2009) has shown that they usually produce a faster decrease of the training loss. Randomized shufﬂing strategies (also viewed as sampling without replacement) allow the algorithm to use exactly one function component f( ; i) at each epoch compared to SGD, which has only statistical convergence guarantees (e.g., in expectation or with high probability). However, very often, the analysis of shufﬂing methods is more challenging than SGD due to the lack of statistical independence.

In the deterministic case, single permutation (also called shufﬂe once, or single shufﬂing) and incremental gradient methods can be considered as special cases of the shufﬂing gradient-based methods we study in this paper. One special shufﬂing strategy is randomized reshufﬂing, which is broadly used in practice, where we use a different random permutation at each epoch. Alternatively, in recent years, it has been shown that many gradient-based methods with momentum update can notably boost the convergence speed both in theory and practice (Nesterov, 2004; Dozat, 2016; Wang et al., 2020). These methods have been widely used in both convex and nonconvex settings, especially, in deep learning community. Remarkably, Nesterov s accelerated method (Nesterov, 1983) has made a revolution in largescale convex optimization in the last two decades, and has been largely exploited in nonconvex problems. The developments we have discussed here motivate us to raise the following research question:

Can we combine both shufﬂing strategy and momentum scheme to develop new provable gradient-based algorithms for handling (1)?

SMG: A Shufﬂing Gradient-Based Method with Momentum

In this paper, we answer this question afﬁrmatively by proposing a novel algorithm called Shufﬂing Momentum Gradient (SMG). We establish its convergence guarantees for different shufﬂing strategies, and in particular, randomized reshufﬂing strategy. We also investigate different variants of our method.

Our contribution. To this end, our contributions in this paper can be summarized as follows.

(a) We develop a novel shufﬂing gradient-based method with momentum (Algorithm 1 in Section 2) for approximating a stationary point of the nonconvex minimization problem (1). Our algorithm covers any shufﬂing strategy ranging from deterministic to randomized, including incremental, single shufﬂing, and randomized reshufﬂing variants. (b) We establish the convergence of our method in the nonconvex setting and achieve the state-of-the-art O 1/T 2/3 convergence rate under standard assumptions (i.e. the L-smoothness and bounded variance conditions), where T is the number of epochs. For randomized reshufﬂing strategy, we can improve our convergence rate up to O 1/(n1/3T 2/3) . (c) We study different strategies for selecting learning rates (LR), including constant, diminishing, exponential, and cosine scheduled learning rates. In all cases, we prove the same convergence rate of the corresponding variants without any additional assumption. (d) When a single shufﬂing strategy is used, we show that a momentum strategy can be incorporated directly at each iteration of the shufﬂing gradient method to obtain a different variant as presented in Algorithm 2. We analyze the convergence of this algorithm and achieve the same O(1/T 2/3) epoch-wise convergence rate, but under a bounded gradient assumption instead of the bounded variance as for the SMG algorithm.

Our O(1/T 2/3) convergence rate is the best known so far for shufﬂing gradient-type methods in nonconvex optimization (Nguyen et al., 2020; Mishchenko et al., 2020). However, like (Mishchenko et al., 2020), our SMG method only requires a generalized bounded variance assumption (Assumption 1(c)), which is weaker and more standard than the bounded component gradient assumption used in existing works. Algorithm 2 uses the same set of assumptions as in (Nguyen et al., 2020) to achieve the same rate, but has a momentum update. For the randomized reshufﬂing strategy, our O 1/(n1/3T 2/3) convergence rate also matches the rate of the without-momentum algorithm in (Mishchenko et al., 2020). It leads to the total of iterations n T = O( nε 3).

We emphasize that, in many existing momentum variants, the momentum m(t) i is updated recursively at each iteration as m(t) i+1 := βm(t) i + (1 β)g(t) i for a given weight β

(0, 1). This update shows that the momentum m(t) i+1 incorporates all the past gradient terms g(t) i , g(t) i 1, g(t) i 1, . . . , g(t) 0 with exponential decay weights 1, β, β2, . . . , βi, respectively. However, in shufﬂing methods, the convergence guarantee is often obtained in epoch. Based on this observation, we modify the classical momentum update in the shufﬂing method as shown in Algorithm 1. More speciﬁcally, the momentum term m(t) 0 is ﬁxed at the beginning of each epoch, and an auxiliary sequence {v(t) i } is introduced to keep track of the gradient average to update the momentum term in the next epoch. This modiﬁcation makes Algorithm 1 fundamentally different from existing momentum-based methods. This new algorithm still achieves O(1/T 2/3) epoch-wise convergence rate under standard assumptions. To the best of our knowledge, our work is the ﬁrst analyzing convergence rate guarantees of shufﬂing-type gradient methods with momentum under standard assumptions.

Besides Algorithm 1, we also exploit recent advanced strategies for selecting learning rates, including exponential and cosine scheduled learning rates. These two strategies have shown state-of-the-art performance in practice (Smith, 2017; Loshchilov & Hutter, 2017; Li et al., 2020). Therefore, it is worth incorporating them in shufﬂing methods.

Related work. Let us brieﬂy review the most related works to our methods studied in this paper.

Shufﬂing gradient-based methods. Shufﬂing gradient-type methods for solving (1) have been widely studied in the literature in recent years (Bottou, 2009; G urb uzbalaban et al., 2019; Shamir, 2016; Haochen & Sra, 2019; Nguyen et al., 2020) for both convex and nonconvex settings. It was empirically investigated in a short note (Bottou, 2009) and also discussed in (Bottou, 2012). These methods have also been implemented in several software packages such as Tensor Flow and Py Torch, broadly used in machine learning (Abadi et al., 2015; Paszke et al., 2019).

In the strongly convex case, shufﬂing methods have been extensively studied in (Ahn et al., 2020; G urb uzbalaban et al., 2019; Haochen & Sra, 2019; Safran & Shamir, 2020; Nagaraj et al., 2019; Rajput et al., 2020; Nguyen et al., 2020; Mishchenko et al., 2020) under different assumptions. The best known convergence rate in this case is O(1/(n T)2 + 1/(n T 3)), which matches the lower bound rate studied in (Safran & Shamir, 2020) up to some constant factor. Most results in the convex case are for the incremental gradient variant, which are studied in (Nedic & Bertsekas, 2001; Nedi c & Bertsekas, 2001). Convergence results of shufﬂing methods on the general convex case are investigated in (Shamir, 2016; Mishchenko et al., 2020), where (Mishchenko et al., 2020) provides a uniﬁed approach to cover different settings. The authors in (Ying et al., 2017) combine a randomized shufﬂing and a variance reduction

SMG: A Shufﬂing Gradient-Based Method with Momentum

technique (e.g., SAGA (Defazio et al., 2014) and SVRG (Johnson & Zhang, 2013)) to develop a new variant. They show a linear convergence rate for strongly convex problems but using an energy function, which is unclear how to convert it into known convergence criteria.

In the nonconvex case, (Nguyen et al., 2020) ﬁrst shows O(1/T 2/3) convergence rate for a general class of shufﬂing gradient methods under the L-smoothness and bounded gradient assumptions on (1). This analysis is then extended in (Mishchenko et al., 2020) to a more relaxed assumption. The authors in (Meng et al., 2019) study different distributed SGD variants with shufﬂing for strongly convex, general convex, and nonconvex problems. An incremental gradient method for weakly convex problems is investigated in (Li et al., 2019), where the authors show O(1/T 1/2) convergence rate as in standard SGD. To the best of our knowledge, the best known rate of shufﬂing gradient methods for the nonconvex case under standard assumptions is O(1/T 2/3) as shown in (Nguyen et al., 2020; Mishchenko et al., 2020).

Our Algorithm 1 developed in this paper is a nontrivial momentum variant of the general shufﬂing method in (Nguyen et al., 2020) but our analysis uses a standard bounded variance assumption instead of bounded gradient one.

Momentum-based methods. Gradient methods with momentum were studied in early works for convex problems such as heavy-ball, inertial, and Nesterov s accelerated gradient methods (Polyak, 1964; Nesterov, 2004). Nesterov s accelerated method is the most inﬂuent scheme and achieves optimal convergence rate for convex problems. While momentum-based methods are not yet known to improve theoretical convergence rates in the nonconvex setting, they show signiﬁcantly encouraging performance in practice (Dozat, 2016; Wang et al., 2020), especially in the deep learning community. However, the momentum strategy has not yet been exploited in shufﬂing methods.

Adaptive learning rate schemes. Gradient-type methods with adaptive learning rates such as Ada Grad (Duchi et al., 2011) and Adam (Kingma & Ba, 2014) have shown stateof-the-art performance in several optimization applications. Recently, many adaptive schemes for learning rates have been proposed such as diminishing (Nguyen et al., 2020), exponential scheduled (Li et al., 2020), and cosine scheduled (Smith, 2017; Loshchilov & Hutter, 2017). In (Li et al., 2020), the authors analyze convergence guarantees for the exponential and cosine learning rates in SGD. These adaptive learning rates have also been empirically studied in the literature, especially in machine learning tasks. Their convergence guarantees have also been investigated accordingly under certain assumptions.

Although the analyses for adaptive learning rates are generally non-trivial, our theoretical results in this paper are

ﬂexible enough to cover various different learning rates. However, we only exploit the diminishing, exponential scheduled, and cosine scheduled schemes for our shufﬂing methods with momentum. We establish that in the last two cases, our algorithm still achieves state-of-the-art O(T 2/3) (and possibly up to O(n 1/3T 2/3)) epoch-wise rates.

Content. The rest of this paper is organized as follows. Section 2 describes our novel method, Shufﬂing Momentum Gradient (Algorithm 1). Section 3 investigates its convergence rate under different shufﬂing-type strategies and different learning rates. Section 4 proposes an algorithm with traditional momentum update for single shufﬂing strategy. Section 5 presents our numerical simulations. Due to space limit, the convergence analysis of our methods, all technical proofs, and additional experiments are deferred to the Supplementary Document (Supp. Doc.).

2. Shufﬂing Gradient-Based Method with Momentum

In this section, we describe our new shufﬂing gradient algorithm with momentum in Algorithm 1. We also compare our algorithm with existing methods and discuss its per-iteration complexity and possible modiﬁcations, e.g., mini-batch.

Algorithm 1 Shufﬂing Momentum Gradient (SMG)

1: Initialization: Choose w0 Rd and set m0 := 0. 2: for t := 1, 2, , T do 3: Set w(t) 0 := wt 1; m(t) 0 := mt 1; and v(t) 0 := 0; 4: Generate an arbitrarily deterministic or random permutation π(t) of [n]; 5: for i := 0, , n 1 do 6: Query g(t) i := f(w(t) i ; π(t)(i + 1));

7: Choose η(t) i := ηt

n and update

m(t) i+1 := βm(t) 0 + (1 β)g(t) i

v(t) i+1 := v(t) i + 1

w(t) i+1 := w(t) i η(t) i m(t) i+1;

8: end for 9: Set wt := w(t) n and mt := v(t) n ; 10: end for 11: Output: Choose ˆw T { w0, , w T 1} at random with probability P[ ˆw T = wt 1] = ηt PT t=1 ηt .

Clearly, if β = 0, then Algorithm 1 reduces to the standard shufﬂing gradient method as in (Nguyen et al., 2020; Mishchenko et al., 2020). Since each inner iteration of our method uses one component f( , i), we use η(t) i = ηt

n in Algorithm 1 to make it consistent with one full gradient, which consists of n gradient components. This form looks

SMG: A Shufﬂing Gradient-Based Method with Momentum

different from the learning rates used in previous work for SGD (Shamir, 2016; Mishchenko et al., 2020), however, it does not necessary make our learning rate smaller. In the same order of training samples n, our learning rate matches the one in (Mishchenko et al., 2020) as well as the state-ofthe-art complexity results. In fact, the detailed learning rate ηt used in Algorithm 1 will be speciﬁed in Section 3.

Comparison. Unlike existing momentum methods where m(t) i in (2) is updated recursively as m(t) i+1 := βm(t) i +(1

β)g(t) i for β (0, 1), we instead ﬁx the ﬁrst term m(i) 0 in (2) at each epoch. It is only updated at the end of each epoch by averaging all the gradient components {g(t) i }n 1 i=0 evaluated in such an epoch. To avoid storing these gradients, we introduce an auxiliary variable v(t) i to keep track of the gradient average. Here, we ﬁx the weight β for the sake of our analysis, but it is possible to make β adaptive.

Our new method is based on the following observations. First, while SGD generally uses an unbiased estimator for the full gradient, shufﬂing gradient methods often do not have such a nice property, making them difﬁcult to analyze. Due to this fact, updating the momentum at each inner iteration does not seem preferable since it could make the estimator deviate from the true gradient. Therefore, we consider updating the momentum after each epoch. Second, unlike the traditional momentum with exponential decay weights, our momentum term m(t) 0 is an equal-weighted average of all the past gradients in an epoch, but evaluated at different points w(t 1) i in the inner loop. Based on these observations, the momentum term should aid the full gradient F instead of its component f( ; i), leading to the update at the end of each epoch.

It is also worth noting that our algorithm is fundamentally different from variance reduction methods. For instance, SVRG and SARAH variants require evaluating full gradient at each snapshot point while our method always uses single component gradient. Hence, our algorithm does not require a full gradient evaluation at each outer iteration, and our momentum term does not require full gradient of f.

When the learning rate η(t) i is ﬁxed at each epoch as η(t) i := ηt

n , where ηt > 0, we can derive from (2) that

w(t) i+1 = w(t) i (1 β)ηt

n g(t) i + βηt n(1 β)ηt 1 et,

where et := wt 1 wt 2 + βηt 1 mt 2. Here, et plays a role as a momentum or an inertial term, but it is different from the usual momentum term. However, we still name Algorithm 1 the Shufﬂing Momentum Gradient since it is inspired by momentum methods.

Per-iteration complexity. The per-iteration complexity of Algorithm 1 is almost the same as in standard shufﬂing

gradient schemes, see, e.g., (Shamir, 2016). It needs to store two additional vectors m(t) i+1 and v(t) i , and performs two vector additions and three scalar-vector multiplications. Such additional costs are very mild.

Shufﬂing strategies. Our convergence guarantee in Section 3 holds for any permutation π(t) of {1, 2, , n}, including deterministic and randomized ones. Therefore, our method covers any shufﬂing strategy, including incremental, single shufﬂing, and randomized reshufﬂing variants as special cases. We highlight that our convergence result for the randomized reshufﬂing variant is signiﬁcantly better than the general ones as we will explain in detail in Section 3.

Mini-batch. Algorithm 1 also works with mini-batches, and our theory can be slightly adapted to establish the same convergence rate for mini-batch variants. However, it remains unclear to us how mini-batches can improve the convergence rate guarantee of Algorithm 1 in the general case where the shufﬂing strategy is not speciﬁed.

3. Convergence Analysis

We analyze the convergence of Algorithm 1 under standard assumptions, which is organized as follows.

3.1. Technical Assumptions

Our analysis relies on the following assumptions:

Assumption 1. Problem (1) satisﬁes:

(a) (Boundedness from below) dom(F) = and F is bounded from below, i.e. F := inf w Rd F(w) > .

(b) (L-Smoothness) f( ; i) is L-smooth for all i [n], i.e. there exists a universal constant L > 0 such that, for all w, w dom (F), it holds that

f(w; i) f(w ; i) L w w . (3)

(c) (Generalized bounded variance) There exist two nonnegative and ﬁnite constants Θ and σ such that for any w dom (F) we have

i=1 f(w; i) F(w) 2 Θ F(w) 2 +σ2. (4)

Assumption 1(a) is required in any algorithm to guarantee the well-deﬁnedness of (1). The L-smoothness (3) is standard in gradient-type methods for both stochastic and deterministic algorithms. From this assumption, we have for any w, w dom (F) (see (Nesterov, 2004)):

F(w) F(w )+ F(w ), w w + L

2 w w 2. (5)

The condition (4) in Assumption 1(c) reduces to the standard bounded variance condition if Θ = 0. Therefore, (4) is

SMG: A Shufﬂing Gradient-Based Method with Momentum

more general than the bounded variance assumption, which is often used in stochastic optimization. Unlike recent existing works on momentum SGD and shufﬂing (Chen et al., 2018; Nguyen et al., 2020), we do not assume the bounded gradient assumption on each f( ; i) in Algorithm 1 (see Assumption 2). This condition is stronger than (4).

3.2. Main Result 1 and Its Consequences

Our ﬁrst main result is the following convergence theorem for Algorithm 1 under any shufﬂing strategy.

Theorem 1. Suppose that Assumption 1 holds for (1). Let {w(t) i }T t=1 be generated by Algorithm 1 with a ﬁxed momentum weight 0 β < 1 and an epoch learning rate η(t) i := ηt

n for every t 1. Assume that η0 = η1, ηt ηt+1, and 0 < ηt 1 L

K for t 1, where

K := max n 5 2, 9(5 3β)(Θ+1)

1 β o . Then, it holds that

E F( ˆw T ) 2 = 1 PT t=1 ηt

t=1 ηt F( wt 1) 2 (6)

4[F( w0) F ]

(1 β) PT t=1 ηt + 9σ2L2(5 3β)

PT t=1 η3 t 1 PT t=1 ηt

Remark 1 (Convergence guarantee). When a deterministic permutation π(t) is used, our convergence rate can be achieved in a deterministic sense. However, to unify our analysis, we will express our convergence guarantees in expectation, where the expectation is taken over all the randomness generated by π(t) and ˆw T up to T iterations. Since we can choose permutations π(t) either deterministically or randomly, our bounds in the sequel will hold either deterministically or with probability 1 (w.p.1), respectively. Without loss of generality, we write these results in expectation.

Next, we derive two direct consequences of Theorem 1 by choosing constant and diminishing learning rates.

Corollary 1 (Constant learning rate). Let us ﬁx the number of epochs T 1, and choose a constant learning rate ηt := γ T 1/3 for some γ > 0 such that γ T 1/3 1 L

K for t 1 in Algorithm 1. Then, under the conditions of Theorem 1, E F( ˆw T ) 2 is upper bounded by

4[F( w0) F ]

(1 β)γ + 9σ2(5 3β)L2γ2

Consequently, the convergence rate of Algorithm 1 is O(T 2/3) in epoch.

With a constant LR as in Corollary 1, the convergence rate of Algorithm 1 is exactly expressed as

O [F( w0) F ] + σ2

which matches the best known rate in the literature (Mishchenko et al., 2020; Nguyen et al., 2020) in term of T for general shufﬂing-type strategies.

Corollary 2 (Diminishing learning rate). Let us choose a diminishing learning rate ηt := γ (t+λ)1/3 for some γ > 0 and λ 0 for t 1 such that η1 := γ (1+λ)1/3 1 L

K in Algorithm 1. Then, under the conditions of Theorem 1, after T epochs with T 2, we have

E F( ˆw T ) 2 C1 + C2 log(T 1 + λ)

(T + λ)2/3 (1 + λ)2/3 ,

where C1 and C2 are respectively given by

C1 := 4 [F( w0) F ]

(1 β)γ + 18σ2L2(5 3β)γ2

(1 β)(1 + λ) and

C2 := 9σ2L2(5 3β)γ2

Consequently, the convergence rate of Algorithm 1 is O(T 2/3 log(T)) in epoch.

The diminishing LR ηt := γ (t+λ)1/3 allows us to use large learning rate values at early epochs compared to the constant case. However, we loose a log(T) factor in the second term of our worst-case convergence rate bound.

We also derive the following two consequences of Theorem 1 for exponential and cosine scheduled learning rates.

Exponential scheduled learning rate. Given an epoch budget T 1, and two positive constants γ > 0 and ρ > 0, we consider the following exponential LR, see (Li et al., 2020):

T 1/3 , where α := ρ1/T (0, 1). (7)

The following corollary shows the convergence of Algorithm 1 using this LR without any additional assumption.

Corollary 3. Let {w(t) i }T t=1 be generated by Algorithm 1 with η(t) i := ηt

n , where ηt is in (7) such that 0 < ηt 1 L

K . Then, under Assumption 1, we have

E F( ˆw T ) 2 4[F( w0) F ]

(1 β)γρT 2/3 + 9σ2L2(5 3β)γ2

(1 β)ρT 2/3 .

Cosine annealing learning rate: Alternatively, given an epoch budget T 1, and a positive constant γ > 0, we consider the following cosine LR for Algorithm 1:

ηt := γ T 1/3

, t = 1, 2, , T. (8)

This LR is adopted from (Loshchilov & Hutter, 2017; Smith, 2017). However, different from these works, we ﬁx our learning rate at each epoch instead of updating it at every iteration as in (Loshchilov & Hutter, 2017; Smith, 2017).

SMG: A Shufﬂing Gradient-Based Method with Momentum

Corollary 4. Let {w(t) i }T t=1 be generated by Algorithm 1 with η(t) i := ηt

n , where ηt is given by (8) such that 0 < ηt 1 L

K . Then, under Assumption 1, and for T 2, E F( ˆw T ) 2 is upper bounded by

8[F( w0) F ]

(1 β)γ + 144σ2(5 3β)L2γ2

The scheduled LRs (7) and (8) still preserve our best known convergence rate O(T 2/3). Note that the exponential learning rates are available in both Tensor Flow and Py Torch, while cosine learning rates are also used in Py Torch.

3.3. Main Result 2: Randomized Reshufﬂing Variant

A variant of Algorithm 1 is called a randomized reshufﬂing variant if at each iteration t, the generated permutation π(t) = π(t)(1), , π(t)(n) is uniformly sampled at random without replacement from {1, , n}. Since the randomized reshufﬂing strategy is extremely popular in practice, we analyze Algorithm 1 under this strategy.

Theorem 2. Suppose that Assumption 1 holds for (1). Let {w(t) i }T t=1 be generated by Algorithm 1 under a randomized reshufﬂing strategy, a ﬁxed momentum weight 0 β < 1, and an epoch learning rate η(t) i := ηt

n for every t 1. Assume that ηt ηt+1 and 0 < ηt 1 L

where D = max 5 3, 6(5 3β)(Θ+n)

n(1 β) and η0 = η1. Then

E F( ˆw T ) 2 = 1 PT t=1 ηt

t=1 ηt E F( wt 1) 2 (9)

4 [F( w0) F ]

(1 β) PT t=1 ηt + 6σ2(5 3β)L2

PT t=1 η3 t 1 PT t=1 ηt

We can derive the following two consequences.

Corollary 5 (Constant learning rate). Let us ﬁx the number of epochs T 1, and choose a constant learning rate ηt := γn1/3

T 1/3 for some γ > 0 such that γn1/3

D for t 1 in Algorithm 1. Then, under the conditions of Theorem 2, E F( ˆw T ) 2 is upper bounded by

1 n1/3T 2/3

4 [F( w0) F ]

(1 β)γ + 6σ2(5 3β)L2γ2

Consequently, the convergence rate of Algorithm 1 is O(n 1/3T 2/3) in epoch.

With a constant LR as in Corollary 5, the convergence rate of Algorithm 1 is improved to

O [F( w0) F ] + σ2

which matches the best known rate as in the randomized reshufﬂing scheme, see, e.g., (Mishchenko et al., 2020).

In this case, the total number of iterations Ttol := n T is Ttol = O ( nε 3) to obtain E F( ˆw T ) 2 ε2. Compared to the complexity O(ε 4) of SGD, our randomized reshufﬂing variant is better than SGD if n O(ε 2).

Corollary 6 (Diminishing learning rate). Let us choose a diminishing learning rate ηt := γn1/3

(t+λ)1/3 for some γ > 0

and λ 0 for t 1 such that η1 := γn1/3

(1+λ)1/3 1 L

D in Algorithm 1. Then, under the conditions of Theorem 2, after T epochs with T 2, we have

E F( ˆw T ) 2 C3 + C4 log(T 1 + λ) n1/3 (T + λ)2/3 (1 + λ)2/3 ,

where C3 and C4 are respectively given by

C3 := 4 [F( w0) F ]

(1 β)γ + 12σ2(5 3β)L2γ2

(1 β)(1 + λ) and

C4 := 6σ2(5 3β)L2γ2

Consequently, the convergence rate of Algorithm 1 is O(n 1/3T 2/3 log(T)) in epoch.

Remark 2. Algorithm 1 under randomized reshufﬂing still works with exponential and cosine scheduled learning rates, and our analysis is similar to the one with general shufﬂing schemes. However, we omit their analysis here.

4. Single Shufﬂing Variant

In this section, we modify the single-shufﬂing gradient method by directly incorporating a momentum term at each iteration. We prove for the ﬁrst time that this variant still achieves state-of-the-art convergence rate guarantee under the smoothness and bounded gradient (i.e. Assumption 2) assumptions as in existing shufﬂing methods. This variant, though somewhat special, also covers an incremental gradient method with momentum as a special case.

Our new momentum algorithm using single shufﬂing strategy for solving (1) is presented in Algorithm 2.

Besides ﬁxing the permutation π, Algorithm 2 is different from Algorithm 1 at the point of updating m(t) i . Here, m(t) i+1 is updated from m(t) i instead of m(t) 0 as in Algorithm 1. In addition, we update mt := m(t) n instead of the epoch gradient average. Note that we can write the main update of Algorithm 2 as

w(t) i+1 := w(t) i (1 β)ηt

n g(t) i + β(w(t) i w(t) i 1),

which exactly reduces to existing momentum updates.

SMG: A Shufﬂing Gradient-Based Method with Momentum

Algorithm 2 Single Shufﬂing Momentum Gradient

1: Initialization: Choose w0 Rd and set m0 := 0; 2: Generate a permutation π of [n]; 3: for t := 1, 2, , T do 4: Set w(t) 0 := wt 1 and m(t) 0 := mt 1; 5: for i =: 0, , n 1 do 6: Query g(t) i := f(w(t) i ; π(i + 1));

7: Choose η(t) i := ηt

n and update ( m(t) i+1 := βm(t) i + (1 β)g(t) i w(t) i+1 := w(t) i η(t) i m(t) i+1;

8: end for 9: Set wt := w(t) n and mt := m(t) n ; 10: end for 11: Output: Choose ˆw T { w0, , w T 1} at random with probability P[ ˆw T = wt 1] = ηt PT t=1 ηt .

Incremental gradient method with momentum. If we choose π := [n], then we obtain the well-known incremental gradient variant, but with momentum. Hence, Algorithm 2 is still new compared to the standard incremental gradient algorithm (Bertsekas, 2011). To prove convergence of Algorithm 2, we replace Assumption 1(c) by the following:

Assumption 2 (Bounded gradient). There exists G > 0 such that f(x; i) G, x dom (F) and i [n].

Now, we state the convergence of Algorithm 2 in the following theorem as our third main result.

Theorem 3. Let {w(t) i }T t=1 be generated by Algorithm 2 with a LR η(t) i := ηt

n and 0 < ηt 1 L for t 1. Then, under Assumption 1(a)-(b) and Assumption 2, we have

E F( ˆw T ) 2 1 PT t=1 ηt (1 βn)

+ L2G2 PT t=1 ξ3 t PT t=1 ηt

1 βn , (10)

where ξt := max(ηt, ηt 1) for t 2, ξ1 = η1, and

1 := 2[F( w0) F ] + 1

+ 2Lη2 1G2.

Compared to (9) in Theorem 2, (10) strongly depends on the weight β as 4βn G2

1 βn appears on the right-hand side of (10). Hence, β must be chosen in a speciﬁc form to obtain desired convergence rate.

Theorem 3 provides a key bound to derive concrete convergence rates in the following two corollaries.

Corollary 7 (Constant learning rate). In Algorithm 2, let us ﬁx T 1 and choose the parameters:

β := ν T 2/3 1/n for some constant ν 0 such that

R 1/n for some R 1, and ηt := γ T 1/3 for some γ > 0 such that γ T 1/3 1

Then, under the conditions of Theorem 3, we have

E F( ˆw T ) 2 D0 T 2/3 + R F( w0) 2

γ [F( w0) F ] + R

γL F( w0) 2

+ L2G2γ2 + 4νG2R.

Thus the convergence rate of Algorithm 2 is O(T 2/3).

With a constant learning rate as in Corollary 7, the convergence rate of Algorithm 2 is

O L[F( w0) F ] + F( w0) 2 + G2

which depends on L[F( w0) F ], F( w0) 2, and G2, and is slightly different from Corollary 1. Corollary 8 (Diminishing learning rate). In Algorithm 2, let us choose the parameters:

β := ν T 2/3 1/n for some constant ν 0 such that

R 1/n for some R 1, and a diminishing learning rate ηt := γ (t+λ)1/3 for all t [T] for some γ > 0 and λ 0 such that η1 = γ (1+λ)1/3 1

Then, under the conditions of Theorem 3, we have

E F( ˆw T ) 2 D1 [(T + λ)2/3 (1 + λ)2/3] + 4νG2R

+ L2G2γ2 log(T 1 + λ)

[(T + λ)2/3 (1 + λ)2/3],

for T 2, where

γ [F( w0) F ] + R

γL + R (1+λ)1/3 F( w0) 2

(1+λ)2/3 + 2 1+λL2G2γ2.

Thus the convergence rate of Algorithm 2 is O( log(T )

T 2/3 ). Remark 3. Algorithm 2 still works with exponential and cosine scheduled LRs, and our analysis is similar to the one in Algorithm 1, which is deferred to the Supp. Doc.

5. Numerical Experiments

In order to examine our algorithms, we present two numerical experiments for different nonconvex problems and compare them with some state-of-the-art SGD-type and shufﬂing gradient methods.

SMG: A Shufﬂing Gradient-Based Method with Momentum

5.1. Models and Datasets

Neural Networks. We test our Shufﬂing Momentum Gradient (SMG) algorithm using two standard network architectures: fully connected network (FCN) and convolutional neural network (CNN). For the fully connected setting, we train the classic Le Net-300-100 model (Le Cun et al., 1998) on the Fashion-MNIST dataset (Xiao et al., 2017) with 60, 000 images.

We also use the convolutional Le Net-5 (Le Cun et al., 1998) architecture to train the well-known CIFAR-10 dataset (Krizhevsky & Hinton, 2009) with 50, 000 samples. We repeatedly run the experiments for 10 random seeds and report the average results. All the algorithms are implemented and run in Python using the Py Torch package (Paszke et al., 2019).

Nonconvex Logistic Regression. Nonconvex regularizers are widely used in statistical learning such as approximating sparsity or gaining robustness. We consider the following nonconvex binary classiﬁcation problem:

i=1 log(1+exp( yix i w))+λr(w) o ,

where {(xi, yi)}n i=1 is a set of training samples; and

r(w) := 1 2 Pd j=1 w2 j 1+w2 j is a nonconvex regularizer, and λ := 0.01 is a regularization parameter. This example was also previously used in (Wang et al., 2019; Tran-Dinh et al., 2019; Nguyen et al., 2020).

We have conducted the experiments on two classiﬁcation datasets w8a (49, 749 samples) and ijcnn1 (91, 701 samples) from LIBSVM (Chang & Lin, 2011). The experiment is repeated with random seeds 10 times and the average result is reported.

5.2. Comparing SMG with Other Methods

We compare our SMG algorithm with Stochastic Gradient Descent (SGD) and two other methods: SGD with Momentum (SGD-M) (Polyak, 1964) and Adam (Kingma & Ba, 2014). For the latter two algorithms, we use the hyper-parameter settings recommended and widely used in practice (i.e. momentum: 0.9 for SGD-M, and two hyperparameters β1 := 0.9, β2 := 0.999 for Adam). For our new SMG algorithm, we ﬁxed the parameter β := 0.5 since it usually performs the best in our experiments.

To have a fair comparison, we apply the randomized reshufﬂing scheme to all methods. Note that shufﬂing strategies are broadly used in practice and have been implemented in Tensor Flow, Py Torch, and Keras (Chollet et al., 2015). We tune each algorithm using constant learning rate and report the best result obtained.

Results. Our ﬁrst experiment is presented in Figure 1, where we depict the value of train loss (i.e. F(w) in (1)) on the y-axis and the number of effective passes (i.e. the number of epochs) on the x-axis. It was observed that SGD-M and Adam work well for machine learning tasks (see, e.g., (Ruder, 2017)). Align with this fact, from Figure 1, we also observe that our SMG algorithm and SGD is slightly worse than SGD-M and Adam at the initial stage when training a neural network. However, SMG quickly catches up to Adam and demonstrates a good performance at the end of the training process.

Figure 1. The train loss produced by SMG, SGD, SGD-M, and Adam for Fashion-MNIST and CIFAR-10, respectively.

For the nonconvex logistic regression problem, our result is reported in Figure 2. For two small datasets tested in our experiments, our algorithm performs signiﬁcantly better than SGD and Adam, and slightly better than SGD with momentum.

Figure 2. The train loss produced by SMG, SGD, SGD-M, and Adam for the w8a and ijcnn1 datasets, respectively.

5.3. The Choice of Hyper-parameter β

Since the hyper-parameter β plays a critical role in the proposed SMG method, our next experiment is to investigate how this algorithm depends on β, while using the same constant learning rate.

Results. Our result is presented in Figure 3. We can observe from this ﬁgure that in the early stage of the training process, the choice β := 0.5 gives comparably good performance comparing to other smaller values. This choice also results in the best train loss in the end of the training process. However, the difference is not really signiﬁcant, showing that SMG seems robust to the choice of the momentum weight β in the range of [0.1, 0.5].

SMG: A Shufﬂing Gradient-Based Method with Momentum

Figure 3. The train loss reported by SMG with different β on the Fashion-MNIST and CIFAR-10 datasets, respectively.

In the nonconvex logistic regression problem, the same choice of β also yields similar outcome for two datasets: w8a and ijcnn1, as shown in Figure 4. We have also experimented with different choices of β {0.6, 0.7, 0.8, 0.9}. However, these choices of β do not lead to good performance, and, therefore, we omit to report them here. This ﬁnding raises an open question on the optimal choice of β. Our empirical study here shows that the choice of β in [0.1, 0.5] works reasonably well, and β := 0.5 seems to be the best in our test.

Figure 4. The train loss produced by SMG under different values of β on the w8a and ijcnn1 datasets, respectively.

5.4. Different Learning Rate Schemes

Our last experiment is to examine the effect of different learning rate variants on the performance of our SMG method, i.e. Algorithm 1.

Results. We conduct this test using four different learning rate variants: constant, diminishing, exponential decay, and cosine annealing learning rates. Our results are reported in Figure 5 and Figure 6. From Figure 5, we can observe that the cosine scheduled and the diminishing learning rates converge relatively fast at the early stage. However, the exponential decay and the constant learning rates make faster progress in the last epochs and tend to give the best result at the end of the training process in our neural network training experiments.

For the non-convex logistic regression, Figure 6 shows that the cosine learning rate has certain advantages compared to other choices after a few dozen numbers of epochs.

We note that the detailed settings and additional experiments in this section can be found in the Supp. Doc.

Figure 5. The train loss produced by SMG under four different learning rate schemes on the Fashion-MNIST and CIFAR-10 datasets, respectively.

Figure 6. The train loss produced by SMG using four different learning rates on the w8a and ijcnn1 datasets, respectively.

6. Conclusions and Future Work

We have developed two new shufﬂing gradient algorithms with momentum for solving nonconvex ﬁnite-sum minimization problems. Our Algorithm 1 is novel and can work with any shufﬂing strategy, while Algorithm 2 is similar to existing momentum methods using single shufﬂing strategy. Our methods achieve the state-of-the-art O(1/T 2/3) convergence rate under standard assumptions using different learning rates and shufﬂing strategies. When a randomized reshufﬂing strategy is exploited for Algorithm 1, we can further improve our rates by a fraction n1/3 of the data size n, matching the best known results in the non-momentum shufﬂing method. Our numerical results also show encouraging performance of the new methods.

We believe that our analysis framework can be extended to study non-asymptotic rates for some recent adaptive SGD schemes such as Adam (Kingma & Ba, 2014) and Ada Grad (Duchi et al., 2011) as well as variance-reduced methods such as SARAH (Nguyen et al., 2017) under shufﬂing strategies. It is also interesting to extend our framework to other problems such as minimax and federated learning. We will further investigate these opportunities in the future.

Acknowledgements

The authors would like to thank the reviewers for their useful comments and suggestions which helped to improve the exposition in this paper. The work of Q. Tran-Dinh has partly been supported by the Ofﬁce of Naval Research under Grant No. ONR-N00014-20-1-2088 (2020-2023).

SMG: A Shufﬂing Gradient-Based Method with Momentum

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorﬂow.org.

Ahn, K., Yun, C., and Sra, S. Sgd with shufﬂing: optimal rates without component convexity and large epoch requirements. ar Xiv preprint ar Xiv:2006.06946, 2020.

Bertsekas, D. Incremental proximal methods for large scale convex optimization. Math. Program., 129(2):163 195, 2011.

Bottou, L. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, volume 8, pp. 2624 2633, 2009.

Bottou, L. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp. 421 436. Springer, 2012.

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev., 60(2):223 311, 2018.

Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. Software available at http://www.csie.ntu.edu. tw/ cjlin/libsvm.

Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. ICLR, 2018.

Chollet, F. et al. Keras, 2015. URL https://github. com/fchollet/keras.

Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646 1654, 2014.

Dozat, T. Incorporating nesterov momentum into ADAM. ICLR Workshop, 1:2013 -2016, 2016.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011.

Ghadimi, S. and Lan, G. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim., 23(4):2341 2368, 2013.

G urb uzbalaban, M., Ozdaglar, A., and Parrilo, P. A. Why random reshufﬂing beats stochastic gradient descent. Mathematical Programming, Oct 2019. ISSN 1436-4646. doi: 10.1007/s10107-019-01440-w. URL http://dx. doi.org/10.1007/s10107-019-01440-w.

Haochen, J. and Sra, S. Random shufﬂing beats sgd after ﬁnite epochs. In International Conference on Machine Learning, pp. 2624 2633. PMLR, 2019.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315 323, 2013.

Kingma, D. P. and Ba, J. ADAM: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), abs/1412.6980, 2014.

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Li, L., Zhuang, Z., and Orabona, F. Exponential step sizes for non-convex optimization. ar Xiv preprint ar Xiv:2002.05273, 2020.

Li, X., Zhu, Z., So, A., and Lee, J. D. Incremental methods for weakly convex optimization. ar Xiv preprint ar Xiv:1907.11687, 2019.

Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts, 2017.

Meng, Q., Chen, W., Wang, Y., Ma, Z.-M., and Liu, T.- Y. Convergence analysis of distributed stochastic gradient descent with shufﬂing. Neurocomputing, 337:46 57, 2019.

Mishchenko, K., Khaled Ragab Bayoumi, A., and Richt arik, P. Random reshufﬂing: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33, 2020.

SMG: A Shufﬂing Gradient-Based Method with Momentum

Nagaraj, D., Jain, P., and Netrapalli, P. Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pp. 4703 4711, 2019.

Nedi c, A. and Bertsekas, D. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pp. 223 264. Springer, 2001.

Nedic, A. and Bertsekas, D. P. Incremental subgradient methods for nondifferentiable optimization. SIAM J. on Optim., 12(1):109 138, 2001.

Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Doklady AN SSSR, 269:543 547, 1983. Translated as Soviet Math. Dokl.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2004.

Nguyen, L., Nguyen, P. H., van Dijk, M., Richtarik, P., Scheinberg, K., and Takac, M. SGD and Hogwild! convergence without the bounded gradients assumption. In Proceedings of the 35th International Conference on Machine Learning-Volume 80, pp. 3747 3755, 2018.

Nguyen, L. M., Liu, J., Scheinberg, K., and Tak aˇc, M. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2613 2621. JMLR. org, 2017.

Nguyen, L. M., Nguyen, P. H., Richt arik, P., Scheinberg, K., Tak aˇc, M., and van Dijk, M. New convergence aspects of stochastic gradient algorithms. Journal of Machine Learning Research, 20(176):1 49, 2019. URL http: //jmlr.org/papers/v20/18-759.html.

Nguyen, L. M., Tran-Dinh, Q., Phan, D. T., Nguyen, P. H., and van Dijk, M. A uniﬁed convergence analysis for shufﬂing-type gradient methods. ar Xiv preprint ar Xiv:2002.08246, 2020.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019.

Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964.

Rajput, S., Gupta, A., and Papailiopoulos, D. Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pp. 7964 7973. PMLR, 2020.

Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3): 400 407, 1951.

Ruder, S. An overview of gradient descent optimization algorithms, 2017.

Safran, I. and Shamir, O. How good is sgd with random shufﬂing? In Conference on Learning Theory, pp. 3250 3284. PMLR, 2020.

Shamir, O. Without-replacement sampling for stochastic gradient methods. In Advances in neural information processing systems, pp. 46 54, 2016.

Smith, L. N. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464 472. IEEE, 2017.

Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization. ar Xiv preprint ar Xiv:1905.05920, 2019.

Wang, B., Nguyen, T. M., Bertozzi, A. L., Baraniuk, R. G., and Osher, S. J. Scheduled restart momentum for accelerated stochastic gradient descent. ar Xiv preprint ar Xiv:2002.10583, 2020.

Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems, 32:2406 2416, 2019.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Ying, B., Yuan, K., and Sayed, A. H. Convergence of variance-reduced stochastic learning under random reshufﬂing. ar Xiv preprint ar Xiv:1708.01383, 2(3):6, 2017.