# smg_a_shuffling_gradientbased_method_with_momentum__f1672c07.pdf SMG: A Shuffling Gradient-Based Method with Momentum Trang H. Tran 1 Lam M. Nguyen 2 Quoc Tran-Dinh 3 We combine two advanced ideas widely used in optimization for machine learning: shuffling strategy and momentum technique to develop a novel shuffling gradient-based method with momentum, coined Shuffling Momentum Gradient (SMG), for non-convex finite-sum optimization problems. While our method is inspired by momentum techniques, its update is fundamentally different from existing momentum-based methods. We establish state-of-the-art convergence rates of SMG for any shuffling strategy using either constant or diminishing learning rate under standard assumptions (i.e. L-smoothness and bounded variance). When the shuffling strategy is fixed, we develop another new algorithm that is similar to existing momentum methods, and prove the same convergence rates for this algorithm under the L-smoothness and bounded gradient assumptions. We demonstrate our algorithms via numerical simulations on standard datasets and compare them with existing shuffling methods. Our tests have shown encouraging performance of the new algorithms. 1. Introduction Most training tasks in supervised learning are boiled down to solving the following finite-sum minimization: n F(w) := 1 i=1 f(w; i) o , (1) where f( ; i) : Rd R is a given smooth and possibly nonconvex function for i [n] := {1, , n}. Problem (1) looks simple, but covers various convex and nonconvex applications in machine learning and statistical 1School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, USA. 2IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA. 3Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill, NC, USA. Correspondence to: Lam M. Nguyen . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). learning, including, but not limited to, logistic regression, multi-kernel learning, conditional random fields, and neural networks. Especially, (1) covers the empirical risk minimization as a special case. Solution methods for approximately solving (1) have been widely studied in the literature under different sets of assumptions. The most common approach is perhaps stochastic gradient-type (SGD) methods (Robbins & Monro, 1951; Ghadimi & Lan, 2013; Bottou et al., 2018; Nguyen et al., 2018; 2019) and their variants. Motivation. While SGD and its variants rely on randomized sampling strategies with replacement, gradient-based methods using without-replacement strategies are often easier and faster to implement. Moreover, practical evidence (Bottou, 2009) has shown that they usually produce a faster decrease of the training loss. Randomized shuffling strategies (also viewed as sampling without replacement) allow the algorithm to use exactly one function component f( ; i) at each epoch compared to SGD, which has only statistical convergence guarantees (e.g., in expectation or with high probability). However, very often, the analysis of shuffling methods is more challenging than SGD due to the lack of statistical independence. In the deterministic case, single permutation (also called shuffle once, or single shuffling) and incremental gradient methods can be considered as special cases of the shuffling gradient-based methods we study in this paper. One special shuffling strategy is randomized reshuffling, which is broadly used in practice, where we use a different random permutation at each epoch. Alternatively, in recent years, it has been shown that many gradient-based methods with momentum update can notably boost the convergence speed both in theory and practice (Nesterov, 2004; Dozat, 2016; Wang et al., 2020). These methods have been widely used in both convex and nonconvex settings, especially, in deep learning community. Remarkably, Nesterov s accelerated method (Nesterov, 1983) has made a revolution in largescale convex optimization in the last two decades, and has been largely exploited in nonconvex problems. The developments we have discussed here motivate us to raise the following research question: Can we combine both shuffling strategy and momentum scheme to develop new provable gradient-based algorithms for handling (1)? SMG: A Shuffling Gradient-Based Method with Momentum In this paper, we answer this question affirmatively by proposing a novel algorithm called Shuffling Momentum Gradient (SMG). We establish its convergence guarantees for different shuffling strategies, and in particular, randomized reshuffling strategy. We also investigate different variants of our method. Our contribution. To this end, our contributions in this paper can be summarized as follows. (a) We develop a novel shuffling gradient-based method with momentum (Algorithm 1 in Section 2) for approximating a stationary point of the nonconvex minimization problem (1). Our algorithm covers any shuffling strategy ranging from deterministic to randomized, including incremental, single shuffling, and randomized reshuffling variants. (b) We establish the convergence of our method in the nonconvex setting and achieve the state-of-the-art O 1/T 2/3 convergence rate under standard assumptions (i.e. the L-smoothness and bounded variance conditions), where T is the number of epochs. For randomized reshuffling strategy, we can improve our convergence rate up to O 1/(n1/3T 2/3) . (c) We study different strategies for selecting learning rates (LR), including constant, diminishing, exponential, and cosine scheduled learning rates. In all cases, we prove the same convergence rate of the corresponding variants without any additional assumption. (d) When a single shuffling strategy is used, we show that a momentum strategy can be incorporated directly at each iteration of the shuffling gradient method to obtain a different variant as presented in Algorithm 2. We analyze the convergence of this algorithm and achieve the same O(1/T 2/3) epoch-wise convergence rate, but under a bounded gradient assumption instead of the bounded variance as for the SMG algorithm. Our O(1/T 2/3) convergence rate is the best known so far for shuffling gradient-type methods in nonconvex optimization (Nguyen et al., 2020; Mishchenko et al., 2020). However, like (Mishchenko et al., 2020), our SMG method only requires a generalized bounded variance assumption (Assumption 1(c)), which is weaker and more standard than the bounded component gradient assumption used in existing works. Algorithm 2 uses the same set of assumptions as in (Nguyen et al., 2020) to achieve the same rate, but has a momentum update. For the randomized reshuffling strategy, our O 1/(n1/3T 2/3) convergence rate also matches the rate of the without-momentum algorithm in (Mishchenko et al., 2020). It leads to the total of iterations n T = O( nε 3). We emphasize that, in many existing momentum variants, the momentum m(t) i is updated recursively at each iteration as m(t) i+1 := βm(t) i + (1 β)g(t) i for a given weight β (0, 1). This update shows that the momentum m(t) i+1 incorporates all the past gradient terms g(t) i , g(t) i 1, g(t) i 1, . . . , g(t) 0 with exponential decay weights 1, β, β2, . . . , βi, respectively. However, in shuffling methods, the convergence guarantee is often obtained in epoch. Based on this observation, we modify the classical momentum update in the shuffling method as shown in Algorithm 1. More specifically, the momentum term m(t) 0 is fixed at the beginning of each epoch, and an auxiliary sequence {v(t) i } is introduced to keep track of the gradient average to update the momentum term in the next epoch. This modification makes Algorithm 1 fundamentally different from existing momentum-based methods. This new algorithm still achieves O(1/T 2/3) epoch-wise convergence rate under standard assumptions. To the best of our knowledge, our work is the first analyzing convergence rate guarantees of shuffling-type gradient methods with momentum under standard assumptions. Besides Algorithm 1, we also exploit recent advanced strategies for selecting learning rates, including exponential and cosine scheduled learning rates. These two strategies have shown state-of-the-art performance in practice (Smith, 2017; Loshchilov & Hutter, 2017; Li et al., 2020). Therefore, it is worth incorporating them in shuffling methods. Related work. Let us briefly review the most related works to our methods studied in this paper. Shuffling gradient-based methods. Shuffling gradient-type methods for solving (1) have been widely studied in the literature in recent years (Bottou, 2009; G urb uzbalaban et al., 2019; Shamir, 2016; Haochen & Sra, 2019; Nguyen et al., 2020) for both convex and nonconvex settings. It was empirically investigated in a short note (Bottou, 2009) and also discussed in (Bottou, 2012). These methods have also been implemented in several software packages such as Tensor Flow and Py Torch, broadly used in machine learning (Abadi et al., 2015; Paszke et al., 2019). In the strongly convex case, shuffling methods have been extensively studied in (Ahn et al., 2020; G urb uzbalaban et al., 2019; Haochen & Sra, 2019; Safran & Shamir, 2020; Nagaraj et al., 2019; Rajput et al., 2020; Nguyen et al., 2020; Mishchenko et al., 2020) under different assumptions. The best known convergence rate in this case is O(1/(n T)2 + 1/(n T 3)), which matches the lower bound rate studied in (Safran & Shamir, 2020) up to some constant factor. Most results in the convex case are for the incremental gradient variant, which are studied in (Nedic & Bertsekas, 2001; Nedi c & Bertsekas, 2001). Convergence results of shuffling methods on the general convex case are investigated in (Shamir, 2016; Mishchenko et al., 2020), where (Mishchenko et al., 2020) provides a unified approach to cover different settings. The authors in (Ying et al., 2017) combine a randomized shuffling and a variance reduction SMG: A Shuffling Gradient-Based Method with Momentum technique (e.g., SAGA (Defazio et al., 2014) and SVRG (Johnson & Zhang, 2013)) to develop a new variant. They show a linear convergence rate for strongly convex problems but using an energy function, which is unclear how to convert it into known convergence criteria. In the nonconvex case, (Nguyen et al., 2020) first shows O(1/T 2/3) convergence rate for a general class of shuffling gradient methods under the L-smoothness and bounded gradient assumptions on (1). This analysis is then extended in (Mishchenko et al., 2020) to a more relaxed assumption. The authors in (Meng et al., 2019) study different distributed SGD variants with shuffling for strongly convex, general convex, and nonconvex problems. An incremental gradient method for weakly convex problems is investigated in (Li et al., 2019), where the authors show O(1/T 1/2) convergence rate as in standard SGD. To the best of our knowledge, the best known rate of shuffling gradient methods for the nonconvex case under standard assumptions is O(1/T 2/3) as shown in (Nguyen et al., 2020; Mishchenko et al., 2020). Our Algorithm 1 developed in this paper is a nontrivial momentum variant of the general shuffling method in (Nguyen et al., 2020) but our analysis uses a standard bounded variance assumption instead of bounded gradient one. Momentum-based methods. Gradient methods with momentum were studied in early works for convex problems such as heavy-ball, inertial, and Nesterov s accelerated gradient methods (Polyak, 1964; Nesterov, 2004). Nesterov s accelerated method is the most influent scheme and achieves optimal convergence rate for convex problems. While momentum-based methods are not yet known to improve theoretical convergence rates in the nonconvex setting, they show significantly encouraging performance in practice (Dozat, 2016; Wang et al., 2020), especially in the deep learning community. However, the momentum strategy has not yet been exploited in shuffling methods. Adaptive learning rate schemes. Gradient-type methods with adaptive learning rates such as Ada Grad (Duchi et al., 2011) and Adam (Kingma & Ba, 2014) have shown stateof-the-art performance in several optimization applications. Recently, many adaptive schemes for learning rates have been proposed such as diminishing (Nguyen et al., 2020), exponential scheduled (Li et al., 2020), and cosine scheduled (Smith, 2017; Loshchilov & Hutter, 2017). In (Li et al., 2020), the authors analyze convergence guarantees for the exponential and cosine learning rates in SGD. These adaptive learning rates have also been empirically studied in the literature, especially in machine learning tasks. Their convergence guarantees have also been investigated accordingly under certain assumptions. Although the analyses for adaptive learning rates are generally non-trivial, our theoretical results in this paper are flexible enough to cover various different learning rates. However, we only exploit the diminishing, exponential scheduled, and cosine scheduled schemes for our shuffling methods with momentum. We establish that in the last two cases, our algorithm still achieves state-of-the-art O(T 2/3) (and possibly up to O(n 1/3T 2/3)) epoch-wise rates. Content. The rest of this paper is organized as follows. Section 2 describes our novel method, Shuffling Momentum Gradient (Algorithm 1). Section 3 investigates its convergence rate under different shuffling-type strategies and different learning rates. Section 4 proposes an algorithm with traditional momentum update for single shuffling strategy. Section 5 presents our numerical simulations. Due to space limit, the convergence analysis of our methods, all technical proofs, and additional experiments are deferred to the Supplementary Document (Supp. Doc.). 2. Shuffling Gradient-Based Method with Momentum In this section, we describe our new shuffling gradient algorithm with momentum in Algorithm 1. We also compare our algorithm with existing methods and discuss its per-iteration complexity and possible modifications, e.g., mini-batch. Algorithm 1 Shuffling Momentum Gradient (SMG) 1: Initialization: Choose w0 Rd and set m0 := 0. 2: for t := 1, 2, , T do 3: Set w(t) 0 := wt 1; m(t) 0 := mt 1; and v(t) 0 := 0; 4: Generate an arbitrarily deterministic or random permutation π(t) of [n]; 5: for i := 0, , n 1 do 6: Query g(t) i := f(w(t) i ; π(t)(i + 1)); 7: Choose η(t) i := ηt n and update m(t) i+1 := βm(t) 0 + (1 β)g(t) i v(t) i+1 := v(t) i + 1 w(t) i+1 := w(t) i η(t) i m(t) i+1; 8: end for 9: Set wt := w(t) n and mt := v(t) n ; 10: end for 11: Output: Choose ˆw T { w0, , w T 1} at random with probability P[ ˆw T = wt 1] = ηt PT t=1 ηt . Clearly, if β = 0, then Algorithm 1 reduces to the standard shuffling gradient method as in (Nguyen et al., 2020; Mishchenko et al., 2020). Since each inner iteration of our method uses one component f( , i), we use η(t) i = ηt n in Algorithm 1 to make it consistent with one full gradient, which consists of n gradient components. This form looks SMG: A Shuffling Gradient-Based Method with Momentum different from the learning rates used in previous work for SGD (Shamir, 2016; Mishchenko et al., 2020), however, it does not necessary make our learning rate smaller. In the same order of training samples n, our learning rate matches the one in (Mishchenko et al., 2020) as well as the state-ofthe-art complexity results. In fact, the detailed learning rate ηt used in Algorithm 1 will be specified in Section 3. Comparison. Unlike existing momentum methods where m(t) i in (2) is updated recursively as m(t) i+1 := βm(t) i +(1 β)g(t) i for β (0, 1), we instead fix the first term m(i) 0 in (2) at each epoch. It is only updated at the end of each epoch by averaging all the gradient components {g(t) i }n 1 i=0 evaluated in such an epoch. To avoid storing these gradients, we introduce an auxiliary variable v(t) i to keep track of the gradient average. Here, we fix the weight β for the sake of our analysis, but it is possible to make β adaptive. Our new method is based on the following observations. First, while SGD generally uses an unbiased estimator for the full gradient, shuffling gradient methods often do not have such a nice property, making them difficult to analyze. Due to this fact, updating the momentum at each inner iteration does not seem preferable since it could make the estimator deviate from the true gradient. Therefore, we consider updating the momentum after each epoch. Second, unlike the traditional momentum with exponential decay weights, our momentum term m(t) 0 is an equal-weighted average of all the past gradients in an epoch, but evaluated at different points w(t 1) i in the inner loop. Based on these observations, the momentum term should aid the full gradient F instead of its component f( ; i), leading to the update at the end of each epoch. It is also worth noting that our algorithm is fundamentally different from variance reduction methods. For instance, SVRG and SARAH variants require evaluating full gradient at each snapshot point while our method always uses single component gradient. Hence, our algorithm does not require a full gradient evaluation at each outer iteration, and our momentum term does not require full gradient of f. When the learning rate η(t) i is fixed at each epoch as η(t) i := ηt n , where ηt > 0, we can derive from (2) that w(t) i+1 = w(t) i (1 β)ηt n g(t) i + βηt n(1 β)ηt 1 et, where et := wt 1 wt 2 + βηt 1 mt 2. Here, et plays a role as a momentum or an inertial term, but it is different from the usual momentum term. However, we still name Algorithm 1 the Shuffling Momentum Gradient since it is inspired by momentum methods. Per-iteration complexity. The per-iteration complexity of Algorithm 1 is almost the same as in standard shuffling gradient schemes, see, e.g., (Shamir, 2016). It needs to store two additional vectors m(t) i+1 and v(t) i , and performs two vector additions and three scalar-vector multiplications. Such additional costs are very mild. Shuffling strategies. Our convergence guarantee in Section 3 holds for any permutation π(t) of {1, 2, , n}, including deterministic and randomized ones. Therefore, our method covers any shuffling strategy, including incremental, single shuffling, and randomized reshuffling variants as special cases. We highlight that our convergence result for the randomized reshuffling variant is significantly better than the general ones as we will explain in detail in Section 3. Mini-batch. Algorithm 1 also works with mini-batches, and our theory can be slightly adapted to establish the same convergence rate for mini-batch variants. However, it remains unclear to us how mini-batches can improve the convergence rate guarantee of Algorithm 1 in the general case where the shuffling strategy is not specified. 3. Convergence Analysis We analyze the convergence of Algorithm 1 under standard assumptions, which is organized as follows. 3.1. Technical Assumptions Our analysis relies on the following assumptions: Assumption 1. Problem (1) satisfies: (a) (Boundedness from below) dom(F) = and F is bounded from below, i.e. F := inf w Rd F(w) > . (b) (L-Smoothness) f( ; i) is L-smooth for all i [n], i.e. there exists a universal constant L > 0 such that, for all w, w dom (F), it holds that f(w; i) f(w ; i) L w w . (3) (c) (Generalized bounded variance) There exist two nonnegative and finite constants Θ and σ such that for any w dom (F) we have i=1 f(w; i) F(w) 2 Θ F(w) 2 +σ2. (4) Assumption 1(a) is required in any algorithm to guarantee the well-definedness of (1). The L-smoothness (3) is standard in gradient-type methods for both stochastic and deterministic algorithms. From this assumption, we have for any w, w dom (F) (see (Nesterov, 2004)): F(w) F(w )+ F(w ), w w + L 2 w w 2. (5) The condition (4) in Assumption 1(c) reduces to the standard bounded variance condition if Θ = 0. Therefore, (4) is SMG: A Shuffling Gradient-Based Method with Momentum more general than the bounded variance assumption, which is often used in stochastic optimization. Unlike recent existing works on momentum SGD and shuffling (Chen et al., 2018; Nguyen et al., 2020), we do not assume the bounded gradient assumption on each f( ; i) in Algorithm 1 (see Assumption 2). This condition is stronger than (4). 3.2. Main Result 1 and Its Consequences Our first main result is the following convergence theorem for Algorithm 1 under any shuffling strategy. Theorem 1. Suppose that Assumption 1 holds for (1). Let {w(t) i }T t=1 be generated by Algorithm 1 with a fixed momentum weight 0 β < 1 and an epoch learning rate η(t) i := ηt n for every t 1. Assume that η0 = η1, ηt ηt+1, and 0 < ηt 1 L K for t 1, where K := max n 5 2, 9(5 3β)(Θ+1) 1 β o . Then, it holds that E F( ˆw T ) 2 = 1 PT t=1 ηt t=1 ηt F( wt 1) 2 (6) 4[F( w0) F ] (1 β) PT t=1 ηt + 9σ2L2(5 3β) PT t=1 η3 t 1 PT t=1 ηt Remark 1 (Convergence guarantee). When a deterministic permutation π(t) is used, our convergence rate can be achieved in a deterministic sense. However, to unify our analysis, we will express our convergence guarantees in expectation, where the expectation is taken over all the randomness generated by π(t) and ˆw T up to T iterations. Since we can choose permutations π(t) either deterministically or randomly, our bounds in the sequel will hold either deterministically or with probability 1 (w.p.1), respectively. Without loss of generality, we write these results in expectation. Next, we derive two direct consequences of Theorem 1 by choosing constant and diminishing learning rates. Corollary 1 (Constant learning rate). Let us fix the number of epochs T 1, and choose a constant learning rate ηt := γ T 1/3 for some γ > 0 such that γ T 1/3 1 L K for t 1 in Algorithm 1. Then, under the conditions of Theorem 1, E F( ˆw T ) 2 is upper bounded by 4[F( w0) F ] (1 β)γ + 9σ2(5 3β)L2γ2 Consequently, the convergence rate of Algorithm 1 is O(T 2/3) in epoch. With a constant LR as in Corollary 1, the convergence rate of Algorithm 1 is exactly expressed as O [F( w0) F ] + σ2 which matches the best known rate in the literature (Mishchenko et al., 2020; Nguyen et al., 2020) in term of T for general shuffling-type strategies. Corollary 2 (Diminishing learning rate). Let us choose a diminishing learning rate ηt := γ (t+λ)1/3 for some γ > 0 and λ 0 for t 1 such that η1 := γ (1+λ)1/3 1 L K in Algorithm 1. Then, under the conditions of Theorem 1, after T epochs with T 2, we have E F( ˆw T ) 2 C1 + C2 log(T 1 + λ) (T + λ)2/3 (1 + λ)2/3 , where C1 and C2 are respectively given by C1 := 4 [F( w0) F ] (1 β)γ + 18σ2L2(5 3β)γ2 (1 β)(1 + λ) and C2 := 9σ2L2(5 3β)γ2 Consequently, the convergence rate of Algorithm 1 is O(T 2/3 log(T)) in epoch. The diminishing LR ηt := γ (t+λ)1/3 allows us to use large learning rate values at early epochs compared to the constant case. However, we loose a log(T) factor in the second term of our worst-case convergence rate bound. We also derive the following two consequences of Theorem 1 for exponential and cosine scheduled learning rates. Exponential scheduled learning rate. Given an epoch budget T 1, and two positive constants γ > 0 and ρ > 0, we consider the following exponential LR, see (Li et al., 2020): T 1/3 , where α := ρ1/T (0, 1). (7) The following corollary shows the convergence of Algorithm 1 using this LR without any additional assumption. Corollary 3. Let {w(t) i }T t=1 be generated by Algorithm 1 with η(t) i := ηt n , where ηt is in (7) such that 0 < ηt 1 L K . Then, under Assumption 1, we have E F( ˆw T ) 2 4[F( w0) F ] (1 β)γρT 2/3 + 9σ2L2(5 3β)γ2 (1 β)ρT 2/3 . Cosine annealing learning rate: Alternatively, given an epoch budget T 1, and a positive constant γ > 0, we consider the following cosine LR for Algorithm 1: ηt := γ T 1/3 , t = 1, 2, , T. (8) This LR is adopted from (Loshchilov & Hutter, 2017; Smith, 2017). However, different from these works, we fix our learning rate at each epoch instead of updating it at every iteration as in (Loshchilov & Hutter, 2017; Smith, 2017). SMG: A Shuffling Gradient-Based Method with Momentum Corollary 4. Let {w(t) i }T t=1 be generated by Algorithm 1 with η(t) i := ηt n , where ηt is given by (8) such that 0 < ηt 1 L K . Then, under Assumption 1, and for T 2, E F( ˆw T ) 2 is upper bounded by 8[F( w0) F ] (1 β)γ + 144σ2(5 3β)L2γ2 The scheduled LRs (7) and (8) still preserve our best known convergence rate O(T 2/3). Note that the exponential learning rates are available in both Tensor Flow and Py Torch, while cosine learning rates are also used in Py Torch. 3.3. Main Result 2: Randomized Reshuffling Variant A variant of Algorithm 1 is called a randomized reshuffling variant if at each iteration t, the generated permutation π(t) = π(t)(1), , π(t)(n) is uniformly sampled at random without replacement from {1, , n}. Since the randomized reshuffling strategy is extremely popular in practice, we analyze Algorithm 1 under this strategy. Theorem 2. Suppose that Assumption 1 holds for (1). Let {w(t) i }T t=1 be generated by Algorithm 1 under a randomized reshuffling strategy, a fixed momentum weight 0 β < 1, and an epoch learning rate η(t) i := ηt n for every t 1. Assume that ηt ηt+1 and 0 < ηt 1 L where D = max 5 3, 6(5 3β)(Θ+n) n(1 β) and η0 = η1. Then E F( ˆw T ) 2 = 1 PT t=1 ηt t=1 ηt E F( wt 1) 2 (9) 4 [F( w0) F ] (1 β) PT t=1 ηt + 6σ2(5 3β)L2 PT t=1 η3 t 1 PT t=1 ηt We can derive the following two consequences. Corollary 5 (Constant learning rate). Let us fix the number of epochs T 1, and choose a constant learning rate ηt := γn1/3 T 1/3 for some γ > 0 such that γn1/3 D for t 1 in Algorithm 1. Then, under the conditions of Theorem 2, E F( ˆw T ) 2 is upper bounded by 1 n1/3T 2/3 4 [F( w0) F ] (1 β)γ + 6σ2(5 3β)L2γ2 Consequently, the convergence rate of Algorithm 1 is O(n 1/3T 2/3) in epoch. With a constant LR as in Corollary 5, the convergence rate of Algorithm 1 is improved to O [F( w0) F ] + σ2 which matches the best known rate as in the randomized reshuffling scheme, see, e.g., (Mishchenko et al., 2020). In this case, the total number of iterations Ttol := n T is Ttol = O ( nε 3) to obtain E F( ˆw T ) 2 ε2. Compared to the complexity O(ε 4) of SGD, our randomized reshuffling variant is better than SGD if n O(ε 2). Corollary 6 (Diminishing learning rate). Let us choose a diminishing learning rate ηt := γn1/3 (t+λ)1/3 for some γ > 0 and λ 0 for t 1 such that η1 := γn1/3 (1+λ)1/3 1 L D in Algorithm 1. Then, under the conditions of Theorem 2, after T epochs with T 2, we have E F( ˆw T ) 2 C3 + C4 log(T 1 + λ) n1/3 (T + λ)2/3 (1 + λ)2/3 , where C3 and C4 are respectively given by C3 := 4 [F( w0) F ] (1 β)γ + 12σ2(5 3β)L2γ2 (1 β)(1 + λ) and C4 := 6σ2(5 3β)L2γ2 Consequently, the convergence rate of Algorithm 1 is O(n 1/3T 2/3 log(T)) in epoch. Remark 2. Algorithm 1 under randomized reshuffling still works with exponential and cosine scheduled learning rates, and our analysis is similar to the one with general shuffling schemes. However, we omit their analysis here. 4. Single Shuffling Variant In this section, we modify the single-shuffling gradient method by directly incorporating a momentum term at each iteration. We prove for the first time that this variant still achieves state-of-the-art convergence rate guarantee under the smoothness and bounded gradient (i.e. Assumption 2) assumptions as in existing shuffling methods. This variant, though somewhat special, also covers an incremental gradient method with momentum as a special case. Our new momentum algorithm using single shuffling strategy for solving (1) is presented in Algorithm 2. Besides fixing the permutation π, Algorithm 2 is different from Algorithm 1 at the point of updating m(t) i . Here, m(t) i+1 is updated from m(t) i instead of m(t) 0 as in Algorithm 1. In addition, we update mt := m(t) n instead of the epoch gradient average. Note that we can write the main update of Algorithm 2 as w(t) i+1 := w(t) i (1 β)ηt n g(t) i + β(w(t) i w(t) i 1), which exactly reduces to existing momentum updates. SMG: A Shuffling Gradient-Based Method with Momentum Algorithm 2 Single Shuffling Momentum Gradient 1: Initialization: Choose w0 Rd and set m0 := 0; 2: Generate a permutation π of [n]; 3: for t := 1, 2, , T do 4: Set w(t) 0 := wt 1 and m(t) 0 := mt 1; 5: for i =: 0, , n 1 do 6: Query g(t) i := f(w(t) i ; π(i + 1)); 7: Choose η(t) i := ηt n and update ( m(t) i+1 := βm(t) i + (1 β)g(t) i w(t) i+1 := w(t) i η(t) i m(t) i+1; 8: end for 9: Set wt := w(t) n and mt := m(t) n ; 10: end for 11: Output: Choose ˆw T { w0, , w T 1} at random with probability P[ ˆw T = wt 1] = ηt PT t=1 ηt . Incremental gradient method with momentum. If we choose π := [n], then we obtain the well-known incremental gradient variant, but with momentum. Hence, Algorithm 2 is still new compared to the standard incremental gradient algorithm (Bertsekas, 2011). To prove convergence of Algorithm 2, we replace Assumption 1(c) by the following: Assumption 2 (Bounded gradient). There exists G > 0 such that f(x; i) G, x dom (F) and i [n]. Now, we state the convergence of Algorithm 2 in the following theorem as our third main result. Theorem 3. Let {w(t) i }T t=1 be generated by Algorithm 2 with a LR η(t) i := ηt n and 0 < ηt 1 L for t 1. Then, under Assumption 1(a)-(b) and Assumption 2, we have E F( ˆw T ) 2 1 PT t=1 ηt (1 βn) + L2G2 PT t=1 ξ3 t PT t=1 ηt 1 βn , (10) where ξt := max(ηt, ηt 1) for t 2, ξ1 = η1, and 1 := 2[F( w0) F ] + 1 + 2Lη2 1G2. Compared to (9) in Theorem 2, (10) strongly depends on the weight β as 4βn G2 1 βn appears on the right-hand side of (10). Hence, β must be chosen in a specific form to obtain desired convergence rate. Theorem 3 provides a key bound to derive concrete convergence rates in the following two corollaries. Corollary 7 (Constant learning rate). In Algorithm 2, let us fix T 1 and choose the parameters: β := ν T 2/3 1/n for some constant ν 0 such that R 1/n for some R 1, and ηt := γ T 1/3 for some γ > 0 such that γ T 1/3 1 Then, under the conditions of Theorem 3, we have E F( ˆw T ) 2 D0 T 2/3 + R F( w0) 2 γ [F( w0) F ] + R γL F( w0) 2 + L2G2γ2 + 4νG2R. Thus the convergence rate of Algorithm 2 is O(T 2/3). With a constant learning rate as in Corollary 7, the convergence rate of Algorithm 2 is O L[F( w0) F ] + F( w0) 2 + G2 which depends on L[F( w0) F ], F( w0) 2, and G2, and is slightly different from Corollary 1. Corollary 8 (Diminishing learning rate). In Algorithm 2, let us choose the parameters: β := ν T 2/3 1/n for some constant ν 0 such that R 1/n for some R 1, and a diminishing learning rate ηt := γ (t+λ)1/3 for all t [T] for some γ > 0 and λ 0 such that η1 = γ (1+λ)1/3 1 Then, under the conditions of Theorem 3, we have E F( ˆw T ) 2 D1 [(T + λ)2/3 (1 + λ)2/3] + 4νG2R + L2G2γ2 log(T 1 + λ) [(T + λ)2/3 (1 + λ)2/3], for T 2, where γ [F( w0) F ] + R γL + R (1+λ)1/3 F( w0) 2 (1+λ)2/3 + 2 1+λL2G2γ2. Thus the convergence rate of Algorithm 2 is O( log(T ) T 2/3 ). Remark 3. Algorithm 2 still works with exponential and cosine scheduled LRs, and our analysis is similar to the one in Algorithm 1, which is deferred to the Supp. Doc. 5. Numerical Experiments In order to examine our algorithms, we present two numerical experiments for different nonconvex problems and compare them with some state-of-the-art SGD-type and shuffling gradient methods. SMG: A Shuffling Gradient-Based Method with Momentum 5.1. Models and Datasets Neural Networks. We test our Shuffling Momentum Gradient (SMG) algorithm using two standard network architectures: fully connected network (FCN) and convolutional neural network (CNN). For the fully connected setting, we train the classic Le Net-300-100 model (Le Cun et al., 1998) on the Fashion-MNIST dataset (Xiao et al., 2017) with 60, 000 images. We also use the convolutional Le Net-5 (Le Cun et al., 1998) architecture to train the well-known CIFAR-10 dataset (Krizhevsky & Hinton, 2009) with 50, 000 samples. We repeatedly run the experiments for 10 random seeds and report the average results. All the algorithms are implemented and run in Python using the Py Torch package (Paszke et al., 2019). Nonconvex Logistic Regression. Nonconvex regularizers are widely used in statistical learning such as approximating sparsity or gaining robustness. We consider the following nonconvex binary classification problem: i=1 log(1+exp( yix i w))+λr(w) o , where {(xi, yi)}n i=1 is a set of training samples; and r(w) := 1 2 Pd j=1 w2 j 1+w2 j is a nonconvex regularizer, and λ := 0.01 is a regularization parameter. This example was also previously used in (Wang et al., 2019; Tran-Dinh et al., 2019; Nguyen et al., 2020). We have conducted the experiments on two classification datasets w8a (49, 749 samples) and ijcnn1 (91, 701 samples) from LIBSVM (Chang & Lin, 2011). The experiment is repeated with random seeds 10 times and the average result is reported. 5.2. Comparing SMG with Other Methods We compare our SMG algorithm with Stochastic Gradient Descent (SGD) and two other methods: SGD with Momentum (SGD-M) (Polyak, 1964) and Adam (Kingma & Ba, 2014). For the latter two algorithms, we use the hyper-parameter settings recommended and widely used in practice (i.e. momentum: 0.9 for SGD-M, and two hyperparameters β1 := 0.9, β2 := 0.999 for Adam). For our new SMG algorithm, we fixed the parameter β := 0.5 since it usually performs the best in our experiments. To have a fair comparison, we apply the randomized reshuffling scheme to all methods. Note that shuffling strategies are broadly used in practice and have been implemented in Tensor Flow, Py Torch, and Keras (Chollet et al., 2015). We tune each algorithm using constant learning rate and report the best result obtained. Results. Our first experiment is presented in Figure 1, where we depict the value of train loss (i.e. F(w) in (1)) on the y-axis and the number of effective passes (i.e. the number of epochs) on the x-axis. It was observed that SGD-M and Adam work well for machine learning tasks (see, e.g., (Ruder, 2017)). Align with this fact, from Figure 1, we also observe that our SMG algorithm and SGD is slightly worse than SGD-M and Adam at the initial stage when training a neural network. However, SMG quickly catches up to Adam and demonstrates a good performance at the end of the training process. Figure 1. The train loss produced by SMG, SGD, SGD-M, and Adam for Fashion-MNIST and CIFAR-10, respectively. For the nonconvex logistic regression problem, our result is reported in Figure 2. For two small datasets tested in our experiments, our algorithm performs significantly better than SGD and Adam, and slightly better than SGD with momentum. Figure 2. The train loss produced by SMG, SGD, SGD-M, and Adam for the w8a and ijcnn1 datasets, respectively. 5.3. The Choice of Hyper-parameter β Since the hyper-parameter β plays a critical role in the proposed SMG method, our next experiment is to investigate how this algorithm depends on β, while using the same constant learning rate. Results. Our result is presented in Figure 3. We can observe from this figure that in the early stage of the training process, the choice β := 0.5 gives comparably good performance comparing to other smaller values. This choice also results in the best train loss in the end of the training process. However, the difference is not really significant, showing that SMG seems robust to the choice of the momentum weight β in the range of [0.1, 0.5]. SMG: A Shuffling Gradient-Based Method with Momentum Figure 3. The train loss reported by SMG with different β on the Fashion-MNIST and CIFAR-10 datasets, respectively. In the nonconvex logistic regression problem, the same choice of β also yields similar outcome for two datasets: w8a and ijcnn1, as shown in Figure 4. We have also experimented with different choices of β {0.6, 0.7, 0.8, 0.9}. However, these choices of β do not lead to good performance, and, therefore, we omit to report them here. This finding raises an open question on the optimal choice of β. Our empirical study here shows that the choice of β in [0.1, 0.5] works reasonably well, and β := 0.5 seems to be the best in our test. Figure 4. The train loss produced by SMG under different values of β on the w8a and ijcnn1 datasets, respectively. 5.4. Different Learning Rate Schemes Our last experiment is to examine the effect of different learning rate variants on the performance of our SMG method, i.e. Algorithm 1. Results. We conduct this test using four different learning rate variants: constant, diminishing, exponential decay, and cosine annealing learning rates. Our results are reported in Figure 5 and Figure 6. From Figure 5, we can observe that the cosine scheduled and the diminishing learning rates converge relatively fast at the early stage. However, the exponential decay and the constant learning rates make faster progress in the last epochs and tend to give the best result at the end of the training process in our neural network training experiments. For the non-convex logistic regression, Figure 6 shows that the cosine learning rate has certain advantages compared to other choices after a few dozen numbers of epochs. We note that the detailed settings and additional experiments in this section can be found in the Supp. Doc. Figure 5. The train loss produced by SMG under four different learning rate schemes on the Fashion-MNIST and CIFAR-10 datasets, respectively. Figure 6. The train loss produced by SMG using four different learning rates on the w8a and ijcnn1 datasets, respectively. 6. Conclusions and Future Work We have developed two new shuffling gradient algorithms with momentum for solving nonconvex finite-sum minimization problems. Our Algorithm 1 is novel and can work with any shuffling strategy, while Algorithm 2 is similar to existing momentum methods using single shuffling strategy. Our methods achieve the state-of-the-art O(1/T 2/3) convergence rate under standard assumptions using different learning rates and shuffling strategies. When a randomized reshuffling strategy is exploited for Algorithm 1, we can further improve our rates by a fraction n1/3 of the data size n, matching the best known results in the non-momentum shuffling method. Our numerical results also show encouraging performance of the new methods. We believe that our analysis framework can be extended to study non-asymptotic rates for some recent adaptive SGD schemes such as Adam (Kingma & Ba, 2014) and Ada Grad (Duchi et al., 2011) as well as variance-reduced methods such as SARAH (Nguyen et al., 2017) under shuffling strategies. It is also interesting to extend our framework to other problems such as minimax and federated learning. We will further investigate these opportunities in the future. Acknowledgements The authors would like to thank the reviewers for their useful comments and suggestions which helped to improve the exposition in this paper. The work of Q. Tran-Dinh has partly been supported by the Office of Naval Research under Grant No. ONR-N00014-20-1-2088 (2020-2023). SMG: A Shuffling Gradient-Based Method with Momentum Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org. Ahn, K., Yun, C., and Sra, S. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. ar Xiv preprint ar Xiv:2006.06946, 2020. Bertsekas, D. Incremental proximal methods for large scale convex optimization. Math. Program., 129(2):163 195, 2011. Bottou, L. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, volume 8, pp. 2624 2633, 2009. Bottou, L. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp. 421 436. Springer, 2012. Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev., 60(2):223 311, 2018. Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. Software available at http://www.csie.ntu.edu. tw/ cjlin/libsvm. Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. ICLR, 2018. Chollet, F. et al. Keras, 2015. URL https://github. com/fchollet/keras. Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646 1654, 2014. Dozat, T. Incorporating nesterov momentum into ADAM. ICLR Workshop, 1:2013 -2016, 2016. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011. Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim., 23(4):2341 2368, 2013. G urb uzbalaban, M., Ozdaglar, A., and Parrilo, P. A. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, Oct 2019. ISSN 1436-4646. doi: 10.1007/s10107-019-01440-w. URL http://dx. doi.org/10.1007/s10107-019-01440-w. Haochen, J. and Sra, S. Random shuffling beats sgd after finite epochs. In International Conference on Machine Learning, pp. 2624 2633. PMLR, 2019. Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315 323, 2013. Kingma, D. P. and Ba, J. ADAM: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), abs/1412.6980, 2014. Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Li, L., Zhuang, Z., and Orabona, F. Exponential step sizes for non-convex optimization. ar Xiv preprint ar Xiv:2002.05273, 2020. Li, X., Zhu, Z., So, A., and Lee, J. D. Incremental methods for weakly convex optimization. ar Xiv preprint ar Xiv:1907.11687, 2019. Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts, 2017. Meng, Q., Chen, W., Wang, Y., Ma, Z.-M., and Liu, T.- Y. Convergence analysis of distributed stochastic gradient descent with shuffling. Neurocomputing, 337:46 57, 2019. Mishchenko, K., Khaled Ragab Bayoumi, A., and Richt arik, P. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33, 2020. SMG: A Shuffling Gradient-Based Method with Momentum Nagaraj, D., Jain, P., and Netrapalli, P. Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pp. 4703 4711, 2019. Nedi c, A. and Bertsekas, D. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pp. 223 264. Springer, 2001. Nedic, A. and Bertsekas, D. P. Incremental subgradient methods for nondifferentiable optimization. SIAM J. on Optim., 12(1):109 138, 2001. Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Doklady AN SSSR, 269:543 547, 1983. Translated as Soviet Math. Dokl. Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2004. Nguyen, L., Nguyen, P. H., van Dijk, M., Richtarik, P., Scheinberg, K., and Takac, M. SGD and Hogwild! convergence without the bounded gradients assumption. In Proceedings of the 35th International Conference on Machine Learning-Volume 80, pp. 3747 3755, 2018. Nguyen, L. M., Liu, J., Scheinberg, K., and Tak aˇc, M. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2613 2621. JMLR. org, 2017. Nguyen, L. M., Nguyen, P. H., Richt arik, P., Scheinberg, K., Tak aˇc, M., and van Dijk, M. New convergence aspects of stochastic gradient algorithms. Journal of Machine Learning Research, 20(176):1 49, 2019. URL http: //jmlr.org/papers/v20/18-759.html. Nguyen, L. M., Tran-Dinh, Q., Phan, D. T., Nguyen, P. H., and van Dijk, M. A unified convergence analysis for shuffling-type gradient methods. ar Xiv preprint ar Xiv:2002.08246, 2020. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964. Rajput, S., Gupta, A., and Papailiopoulos, D. Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pp. 7964 7973. PMLR, 2020. Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3): 400 407, 1951. Ruder, S. An overview of gradient descent optimization algorithms, 2017. Safran, I. and Shamir, O. How good is sgd with random shuffling? In Conference on Learning Theory, pp. 3250 3284. PMLR, 2020. Shamir, O. Without-replacement sampling for stochastic gradient methods. In Advances in neural information processing systems, pp. 46 54, 2016. Smith, L. N. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464 472. IEEE, 2017. Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization. ar Xiv preprint ar Xiv:1905.05920, 2019. Wang, B., Nguyen, T. M., Bertozzi, A. L., Baraniuk, R. G., and Osher, S. J. Scheduled restart momentum for accelerated stochastic gradient descent. ar Xiv preprint ar Xiv:2002.10583, 2020. Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems, 32:2406 2416, 2019. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Ying, B., Yuan, K., and Sayed, A. H. Convergence of variance-reduced stochastic learning under random reshuffling. ar Xiv preprint ar Xiv:1708.01383, 2(3):6, 2017.