# random_shuffling_beats_sgd_after_finite_epochs__f3643d95.pdf Random Shuffling Beats SGD after Finite Epochs Jeff Hao Chen 1 Suvrit Sra 2 A long-standing problem in optimization is proving that RANDOMSHUFFLE, the withoutreplacement version of SGD, converges faster than (the usual) with-replacement SGD. Building upon (G urb uzbalaban et al., 2015b), we present the first non-asymptotic results for this problem, proving that after a reasonable number of epochs RANDOMSHUFFLE converges faster than SGD. Specifically, we prove that for strongly convex, second-order smooth functions, the iterates of RANDOMSHUFFLE converge to the optimal solution as O(1/T 2 + n3/T 3), where n is the number of components in the objective, and T is number of iterations. This result implies that after O( n) epochs, RANDOMSHUFFLE is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis shows, in general a dependence on n is unavoidable without further changes. To understand how RANDOMSHUFFLE works in practice, we further explore two valuable settings: data sparsity and over-parameterization. For sparse data, RAN- DOMSHUFFLE has the rate O (1/T 2), again strictly better than SGD. Under a setting closely related to over-parameterization, RANDOMSHUFFLE is shown to converge faster than SGD after any arbitrary number of iterations. Finally, we extend the analysis of RANDOMSHUFFLE to smooth convex and some non-convex functions. 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2Massachusetts Institute of Technology. Correspondence to: Jeff Hao Chen , Suvrit Sra . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). 1. Introduction We focus on minimization of the finite-sum i=1 fi (x) , (1.1) where each fi : Rd R is smooth and convex, and the sum F is strongly convex. A classical approach to solving (1.1) is stochastic gradient descent (SGD). At each iteration SGD independently samples an index i uniformly from {1, . . . , n}, and uses the (stochastic) gradient fi to compute its update. The stochasticity makes each iteration of SGD cheap, and the uniformly independent sampling of i makes fi an unbiased estimator of the full gradient F. These properties are central to SGD s effectiveness in large scale machine learning, and underlie much of its theoretical analysis (see e.g., (Rakhlin et al., 2012; Bertsekas, 2011; Bottou et al., 2016; Shalev-Shwartz and Ben-David, 2014)). However, what is actually used in practice is the without replacement version of SGD, henceforth called RAN- DOMSHUFFLE. At each epoch RANDOMSHUFFLE samples a random permutation of the n functions uniformly independently (some implementations shuffle the data only once at load, rather than at each epoch). Then, it performs SGDstyle updates by going through the n functions according to the sampled permutation. By avoiding random sampling at each iteration, RANDOMSHUFFLE can be computationally more practical (Bottou, 2012); and empirically, it is known to converge faster than SGD (Bottou, 2009). Resolving this discrepancy between theory and practice of SGD has been long an open problem. Recently, this problem has drawn renewed attention, with the goal of better understanding RANDOMSHUFFLE. The key difficulty is that without-replacement produces non-independent samples, which greatly complicates the analysis. Two extreme case results are known: Shamir (2016) shows that RANDOMSHUFFLE is not much worse than SGD, provided the number of epochs is not too large; while G urb uzbalaban et al. (2015b) show that RANDOMSHUFFLE converges faster than SGD asymptotically at the rate O( 1 But it remains unclear what happens in between, i.e., after a (reasonable) finite number of epochs are run. This regime is the most compelling one to study, since in practice one runs Random Shuffling Beats SGD after Finite Epochs neither one, nor infinitely many epochs. This background motivates the central question of our paper: Does RANDOMSHUFFLE converge faster than SGD after a reasonable number of epochs? We answer this question positively in this paper; our results are more precisely summarized below. 1.1. Summary of results We follow the common practice of reporting convergence rates depending on T, the number of calls to the (stochastic / incremental) gradient oracle. For instance, SGD converges at the rate O( 1 T ) for solving (1.1), ignoring logarithmic terms in the bound (Rakhlin et al., 2012). Our key observation for RANDOMSHUFFLE is that one should include dependence on n into the bound (see Section 3.3). This compromise then leads to a better dependence on T, which further shows how RANDOMSHUFFLE beats SGD after a finite number of epochs. Our main contributions are the following: Under moderate assumptions, we establish a convergence rate of O(1/T 2 + n3/T 3) for RANDOMSHUFFLE, where n is the number of components in (1.1), and T the total number of iterations (Thm. 1 and 2). From the bounds we can calculate the number of epochs after which RANDOMSHUFFLE is strictly better than SGD. We prove that a dependence on n is necessary for beating the SGD rate O(1/T). This tradeoff precludes the possibility of proving a rate of the type O(1/T 1+δ) with some δ > 0 in the general case, and justifies our choice of introducing n into the rate (Thm. 3). Assuming sparse data, a setting common in machine learning, we further improve the convergence rate of RANDOMSHUFFLE to O(1/T 2). This rate is strictly better than SGD, indicating RANDOMSHUFFLE s advantage in such cases (Thm. 4). We consider a setting where variance vanishes at the global minimum, which is closely related to the notion of over-parameterization in recent literature. We show that RANDOMSHUFFLE converges faster than SGD after arbitrary number of iterations. (Thm. 5) In Section 4, we discuss various aspects of our results in detail, including explicit comparisons to SGD, the role of condition numbers, as well as limitations. In Section 7, we analyze RANDOMSHUFFLE to a class of non-convex functions and convex functions. 1.2. Related work Very Recently, Jain et al. (2019) obtained a convergence rate of O(n/T 2) for RANDOMSHUFFLE under strongly convex setting without assuming Hessian smoothness. Their result improves our result when the number of epochs is smaller than κn, with κ being the condition number. Recht and R e (2012) conjectured a remarkable matrix AMGM inequality that underlies RANDOMSHUFFLE s superiority over SGD. While limited progress on this inequality has been reported (Israel et al., 2016; Zhang, 2014), the full conjecture is wide open. With the technique of transductive Rademacher complexity, Shamir (2016) shows that SGD is not worse than RANDOMSHUFFLE provided the number of iterations is not too large. Ying et al. (2018) show that for a fixed step size, RANDOMSHUFFLE converges to a distribution closer to optimal than SGD asymptotically. Most closely related to our work, G urb uzbalaban et al. (2015b) prove that RANDOMSHUFFLE limits to a O 1 rate for large T. Their analysis is based on an epoch level iteration, which wraps up the bias brought by without replacement sampling into the error of a single epoch. However, their results are based on an asymptotic version of Chung s Lemma, and are therefore asymptotic. A precise bound on the number of epochs after which their convergence rates apply is not clear. Moreover, the constants hidden in their bound are unclear and can be potentially large. Building upon a similar approach, while introducing new theory to explicitly control the constants, we establish precise non-asymptotic results. A key challenge therein was to reduce the dependence on n, which we overcome by a few steps of carefully constructed AM-GM inequalities. When the functions in (1.1) are visited in a deterministic order (e.g., cyclic), the method turns into Incremental Gradient Descent (IGD), which has a long history (Bertsekas, 2011). Kohonen (1974) shows that IGD converges to a limit cycle under constant step size and quadratic functions. Convergence to neighborhood of optimality for more general functions is studied in several works, under the assumption that step size is bounded away from zero (see for instance (Solodov, 1998)). With properly diminishing step size, Nedi c and Bertsekas (2001) show that an O(1/ T) convergence rate in terms of distance to optimal can be achieved under strong convexity of the finite-sum. This rate is further improved in (G urb uzbalaban et al., 2015a) to O(1/T) under a second order differentiability assumption. In practice, RANDOMSHUFFLE has been proposed as a standard heuristic (Bottou, 2012). With numerical experiments, Bottou (2009) notices an approximately O(1/T 2) convergence rate of RANDOMSHUFFLE. Without-replacement sampling also improves data-access efficiency in distributed settings (Feng et al., 2012; Lee et al., 2015). The permutation-sampling idea has also been embedded into more complicated algorithms; see (De and Goldstein, 2016; Defazio et al., 2014; Shamir, 2016) for variance-reduced methods, and (Shalev-Shwartz and Zhang, 2013) for decomposition methods. Random Shuffling Beats SGD after Finite Epochs Table 1. Comparison of convergence rates of SGD and RANDOMSHUFFLE. All functions considered are strongly convex except for the Polyak-Łojasiewicz condition setting. We omit all the constants from the rate (for details on constants, please see Section 4). Under the sparse setting (sparsity level ρ), we are not aware of specialized results corresponding to SGD. Algorithm Quadratic Lipschitz Hessian Sparse Data PL Condition SGD O(1/T) O(1/T) O(1/T) O(1/T) RANDOMSHUFFLE O(1/T 2 + n3/T 3) O(1/T 2 + n3/T 3) O(1/T 2 + ρ2n3/T 3) O(1/T 2 + n3/T 3) Finally, we note a related but separate body of work on coordinate descent, where a similar problem has been studied: when does random permutation over coordinates behave well? G urb uzbalaban et al. (2017) give two kinds of quadratic problems where cyclic coordinate descent beats the with-replacement randomized one, which is a stronger result indicating that random permutation also beats the with-replacement method. However, such a deterministic version of the algorithm suffers from poor worst case. Indeed, in (Sun and Ye, 2016) a setting is analyzed where cyclic coordinate descent can be much worse than both with-replacement and random permutation versions. Lee and Wright (2016) further analyze how the random permutation version of coordinate descent avoids the slow convergence of cyclic version. Wright and Lee (2017) propose a more general class of quadratic functions where random permutation outperforms cyclic coordinate descent. 2. Background and problem setup For problem (1.1), we assume the finite sum function F(x) : Rd R is strongly convex, i.e., F(x) F(y) + F(y), x y + µ where x, y Rd, and µ > 0 is the strong convexity parameter. Furthermore, we assume each component function is L-smooth, so that for i = 1, . . . , n, there exists a constant L such that fi(x) fi(y) L x y . (2.1) Furthermore, we assume that the component functions are second order differentiable with a Lipschitz continuous Hessian. We use Hi(x) to denote the Hessian of function fi at x. Specifically, for each i = 1, . . . , n, we assume that for all x, y Rd, there exists a constant LH such that Hi(x) Hi(y) LH x y . (2.2) The norm is the spectral norm for matrices and ℓ2 norm for vectors. We denote the unique minimizer of F(x) as x , the index set {1, , n} as [n]. The complexity bound is represented as O( ), with all logarithmic terms hidden. All other parameters that might be hidden in the complexity bounds will be clarified in corresponding sections. 2.1. The algorithms under study: SGD and RANDOMSHUFFLE For both SGD and RANDOMSHUFFLE, we use γ as the step size, which is predetermined before the algorithms are run. The sequences generated by both methods are denoted as (xk)T k=0; here x0 is the initial point and T is the total number of iterations (i.e., number of stochastic gradients used). SGD runs as follows: at each iteration 1 k T, it picks an index s(k) independently uniformly from the index set [n], and then performs the update xk = xk 1 γ fs(k)(xk 1). (SGD) In contrast, RANDOMSHUFFLE runs as follows: for each epoch t, it picks one permutation σt( ) : [n] [n] independently uniformly from the set of all permutations of [n]. Then, it sequentially visits each of the component functions of the finite-sum (1.1) and performs the update xt k = xt k 1 γ fσt(k) xt k 1 , (RANDOMSHUFFLE) for 1 k n. Here xt k = x(t 1)n+k represents the k-th iterate within the t-th epoch. For two consecutive epochs t and t + 1, one has xt+1 0 = xt n; for the initial point one has x1 0 = x0. For convenience of analysis, we always assume RANDOMSHUFFLE is run for an integer number of epochs, i.e., T = ln for some l Z+. This is a reasonable assumption given our main interest is when several epochs of RANDOMSHUFFLE are run. 3. Convergence analysis The goal of this section is to build theoretical analysis for RANDOMSHUFFLE. Specifically, we answer the following question: when can we show RANDOMSHUFFLE to be better than SGD? We begin by first analyzing quadratic functions in Section 3.1, where the analysis benefits from having a constant Hessian. Subsequently, in Section 3.2, we extend our analysis to the general (smooth) strongly convex setting. A key idea in our analysis is to make the convergence rate bounds sensitive to n, the number of components in the finite-sum (1.1). In Section 3.3, we discuss and justify the necessity of introducing n into our convergence bound. Random Shuffling Beats SGD after Finite Epochs 3.1. RANDOMSHUFFLE for quadratics We first consider the quadratic instance of (1.1), where 2x T Aix + b T i x, i = 1, . . . , n, (3.1) where Ai Rd d is positive semi-definite, and bi Rd. We should notice often in analyzing strongly convex problems, the quadratic case presents a good example when tight bounds are achieved. Quadratic functions have a constant Hessian function Hi(x) = Ai, which eases our analysis. In particular, our bound depends on the following constants: (i) strong convexity parameter µ, and component-wise Lipschitz constant L; (ii) diameter bound x x D (i.e., any iterate x remains bounded; can be enforced by explicit projection if needed); and (iii) bounded gradients fi(x) G for each fi (1 i n), and any x satisfying (ii). We omit these constants for clarity, but discuss the condition number further in Section 4. Our main result for RANDOMSHUFFLE is the following theorem (omitting logarithmic terms): Theorem 1. With fi defined by (3.1), let the condition number of problem (1.1) be κ = L/µ. So long as T log T > 6(1 + κ)n, with step size γ = 4 log T T µ , RANDOMSHUFFLE achieves convergence rate: E[ x T x 2] O 1 We prove this theorem based on the same idea as (G urb uzbalaban et al., 2015b), i.e., by establishing an epoch-based recursion inequality. However, we no longer think of n as a constant, but rather state it explicitly in the bound and try to optimize the dependence on it. The main challenge comes with controlling the dependence on n, which is reduced to what appears in Theorem 1 as several steps of a carefully designed AM-GM inequality. (See equation (A.8) and (A.9) in supplementary material.) We provide a proof sketch for Theorem 1 in Section 8, deferring the involved technical details to the appendix. In terms of sample complexity, Theorem 1 implies the following corollary: Corollary 1. With fi defined by (3.1), the sample complexity for RANDOMSHUFFLE to achieve E[ x T x 2] is no more than O ϵ 1 3.2. RANDOMSHUFFLE for strongly convex problems Next, we consider the more general case where each component function fi is convex and the sum F(x) = 1 i fi(x) is strongly convex. Surprisingly1, one can easily adapt the 1Intuitively, the change of Hessian over the domain can raise challenges. However, our convergence rate here is quite similar to methodology of the proof for Theorem 1 in this setting. To this end, our analysis requires one further assumption that each component function is twice differentiable, and its Hessian satisfies the Lipschitz condition (2.2) with constant LH. Under these assumptions, we obtain the following result: Theorem 2. Define constant µ2 (LHLD + 3LHG), 12(1 + L So long as T log T > Cn, with step size η = 8 log T DOMSHUFFLE achieves convergence rate: E[ x T x 2] O 1 Except for extra dependence on LH and a mildly different step size, this rate is essentially the same as that in quadratic case. The proof for the result can be found in the supplement. Due to the similar formulation, most of the consequences noted in Section 3.1 also hold in this general setting. 3.3. Understanding the dependence on n Since the motivation of building our convergence rate analysis is to show that RANDOMSHUFFLE behaves better than SGD, we would definitely hope that our convergence bounds have a better dependence on T compared to the O( 1 T ) bound for SGD. In an ideal situation, one may hope for a rate of the form O( 1 T 1+δ ) with some δ > 0. One intuitive criticism toward this goal is evident: if we allow T < n, then by setting n > T 2, RANDOMSHUFFLE is essentially same as SGD by the birthday paradox. Therefore, a O 1 T 1+δ bound is unlikely to hold. However, this argument is not rigorous when we require a positive number of epochs to be run (at least one round through all the data). To this end, we provide the following result indicating the impossibility of obtaining O( 1 T 1+δ ) even when T n is required. Theorem 3. Given the information of µ, L, G. Under the assumption of constant step sizes, no step size choice for RANDOMSHUFFLE leads to a convergence rate o 1 T for any T n, if we do not allow n to appear in the bound. The key idea to prove Theorem 3 is by constructing a special instance of problem (1.1). In particular, the following quadratic instance of (1.1) lays the foundation of our proof:2 2(x b) A(x b) i odd, 1 2(x + b) A(x + b) i even. (3.2) quadratic case, with only mild dependence on Hessian Lipschitz constant. 2The same setting is also used for the comparison of limiting cycles of randomized/deterministic incremental gradient descent methods, see for instance Example 1.5.6 of (Bertsekas, 1999). Random Shuffling Beats SGD after Finite Epochs Here ( ) denotes the transpose of a vector, A Rd d is some positive definite matrix, and b Rd is some vector. Running RANDOMSHUFFLE on (3.2) leads to a closeformed expression of RANDOMSHUFFLE s error. Then by setting T = n (i.e., only running RANDOMSHUFFLE for one epoch) and assuming a convergence rate of o 1 T , we deduce a contradiction by properly setting A and b. The detailed proof can be found in our supplementary document. We directly have the following corollary: Corollary 2. Given the information of µ, L, G, under the assumption T n and constant step size, there is no step size choice that leads to a convergence rate O 1 T 1+δ for any δ > 0. This result indicates that in order to achieve a better dependence on T using constant step sizes, the bound should either: (i) depend on n; (ii) make some stronger assumptions on T being large enough (at least exclude T = n); or (iii) leverage a more versatile step size schedule, which could potentially be hard to design and analyze. Although Theorem 3 shows that one may not hope (assuming constant step sizes) for a better dependence on T for RANDOMSHUFFLE without an extra n dependence, whether the current dependence on n is optimal still requires further discussion. In the special case n = T, numerical evidence has shown that RANDOMSHUFFLE behaves at least as well as SGD. However, our bound fails to even show RANDOMSHUFFLE converges in this setting. Therefore, it is reasonable to conjecture that a better dependence on n exists. In the following section, we improve the dependence on n under a specific setting. But whether a better dependence on n can be achieved in general remains open. 4. Discussion of results We discuss below our results in more detail, including their implications, strengths, and limitations. Comparison with SGD: It is well-known that under strong convexity SGD converges with a rate of O 1 T (Rakhlin et al., 2012). A direct comparison indicates the following fact: RANDOMSHUFFLE is provably better than SGD after O( n) epochs. This is an acceptable amount of epochs for even some of the largest data sets in current machine learning literature. To our knowledge, this is the first result rigorously showing that RANDOMSHUFFLE behaves better than SGD within a reasonable number of epochs. To some extent, this result confirms the belief and observation that RANDOMSHUFFLE is the correct choice in real life, at least when the number of epochs is comparable with n. Deterministic variant: When the algorithm is run in a deterministic fashion, i.e., the functions fi are visited in a fixed order, better convergence rate than SGD can also be achieved as T becomes large. For instance, a result in (G urb uzbalaban et al., 2015a) translates into a O n2 bound for the deterministic case. This directly implies the same bound for RANDOMSHUFFLE, since random permutation always has the weaker worst case. But according to this bound, at least n epochs are required for RANDOMSHUFFLE to achieve an error smaller than SGD, which is not a realistic number of epochs in most applications. Comparison with GD: Another interesting viewpoint is by comparing RANDOMSHUFFLE with Gradient Descent (GD). One of the limitations of our result is that we do not show a regime where RANDOMSHUFFLE can be better than GD. By computing the average for each epoch and running exact GD on (1.1), one can get a convergence rate of the form O(e T n ). This fact shows that our convergence rate for RANDOMSHUFFLE is worse than GD. This comes naturally from the epoch based recursion (8.1) in our proof methodology, since for one epoch the sum of the gradients is only shown to be no worse than a full gradient. It is true that GD should behave better in long-term as the dependence on n is negligible, and comparing with GD is not the major goal for this paper. However, being worse than GD even when T is relatively small indicates that the dependence on n probably can still be improved. It may be worth investigating whether RANDOMSHUFFLE can be better than both SGD and GD in some regime. However, different techniques may be required. Epochs required: It is also a limitation that our bound only holds after a certain number of epochs. Moreover, this number of epochs is dependent on κ (e.g., O(κ) epochs for the quadratic case). This limits the interest of our result to cases when the problem is not too ill-conditioned. Otherwise, such a number of epochs will be unrealistic by itself. We are currently not certain whether similar bounds can be proved when allowing T to assume smaller values, or even after only one epoch. Dependence on κ: It should be noticed that κ can be large sometimes. Therefore, it may be informative to view our result in a κ-dependent form. In particular, we still assume D, L, LH are constant, but no longer µ. We use the bound G maxi fi(x ) + DL and assume maxi fi(x ) is constant. Since κ = L µ, we now have κ = Θ( 1 µ). Our results translate into κ-dependent sample complexity O(κn + κ2ϵ 1 2 + nκ 4 3 ϵ 1 3 + nκ 3 2 ϵ 1 4 ) for quadratic problems, and O(κ2n + κ2ϵ 1 2 + nκ 4 3 ϵ 1 3 + nκ 3 2 ϵ 1 4 ) for strongly convex ones. At first sight, the dependence on κ in the convergence rate may seem relatively high. However, it is important to notice that our sample complexity s dependence on κ is actually better than what is known for SGD. A O( 4G2 T µ2 ) convergence bound for SGD has long been known (Rakhlin et al., 2012), which translates into a O( κ2 ϵ ), κ-dependent sample complexity in our notation. Although better κ Random Shuffling Beats SGD after Finite Epochs dependence has been shown for F(x T ) F(x ) < ϵ (see e.g., (Hazan and Kale, 2014)), the κ2 dependence is still the best for E[ x T x 2] < ϵ as far as we know (e.g., Nguyen et al. (2018)). Furthermore, according to (Nemirovskii et al., 1983), the lower bound to achieve F(x T ) F(x ) < ϵ for strongly convex F using stochastic gradients is Ω( κ ϵ ). Translating this into the sample complexity to achieve E[ x T x 2] < ϵ is likely to introduce another κ into the bound. Therefore, it is reasonable to believe that O( κ2 ϵ ) is the best sample complexity one can get for SGD (which is worse than RANDOMSHUFFLE), to achieve E[ x T x 2] < ϵ. Assumptions on bounded iterates / gradients: Although the bounded iterates / gradients assumptions may appear to be strong, they are actually quite reasonable: unlike SGD, RANDOMSHUFFLE with proper step size decreases the loss function after one epoch no matter what the random permutation is. The bounded iterates / gradients assumptions can be therefore guaranteed by showing that RANDOMSHUF- FLE is indeed a descent algorithm (after ensuring that the initial sublevel set is bounded). 5. Sparse functions In the literature on large-scale machine learning, sparsity is a common feature of data. When the data are sparse, each training data point has only a few non-zero features. Under such a setting, each iteration of SGD only modifies a few dimensions of the decision variables. Some commonly occurring sparse problems include large-scale logistic regression, matrix completion, and graph cuts. Sparse data provides a prospective setting under which RAN- DOMSHUFFLE might be powerful. Intuitively, when data are sparse, with-replacement sampling used by SGD is likely to miss some decision variables, while RANDOMSHUFFLE is guaranteed to update all possible decision variables in one epoch. In this section, we show some theoretical results justifying such intuition. Formally, a sparse finite-sum problem assumes the form i=1 fi(xei), where ei (1 i n) denotes a small subset of {1, . . . , d} and xei denotes the entries of the vector x indexed by ei. Define the set E := {ei : 1 i n}. By representing each subset ei E with a node, and considering edges (ei, ej) for all ei ej = , we get a graph with n nodes. Following the notation in (Recht et al., 2011), we consider the sparsity factor of the graph: ρ := max 1 i n |{ej E : ei ej = }| One obvious fact is 1 n ρ 1. The statistic (5.1) indicates how likely is it that two subsets of indices intersect, which reflects the sparsity of the problem. For a problem with strong sparsity, we may anticipate a relatively small value for ρ. We summarize our result with the following theorem: Theorem 4. Define constant µ2 (LHLD + 3LHG), 12(1 + L So long as T log T > Cn, with step size η = 8 log T DOMSHUFFLE achieves convergence rate: E[ x T x 2] O 1 Compared with Theorem 2, the bound in Theorem 4 depends on the parameter ρ, so we can exploit sparsity to obtain a faster convergence rate. The key to proving Theorem 4 lies in constructing a tighter bound for the error term in the main recursion (see 8) by including a discount due to sparsity. Corollary 3. When ρ = O 1 n and constant C defined as in Theorem 4, so long as T log T > Cn, for step size η = 8 log T T µ , RANDOMSHUFFLE achieves convergence rate E[ x T x 2] O 1 As shown in the above corollary, with sparsity factor ρ = O 1 n , the proven convergence rate of RANDOMSHUFFLE is strictly better than the O 1 T rate of SGD. This result follows the following intuition: when each dimension is only touched by several functions, letting the algorithm to visit every function would avoid missing certain dimensions. For larger ρ, similar speedup can be observed. In fact, so long as we have ρ = o(n 1 2 ), the proven bound is better off than SGD. Such a result confirms the usage of RAN- DOMSHUFFLE under sparse setting. 6. An example where RANDOMSHUFFLE always converges faster In this section, we study a specialized class of convex problems where RANDOMSHUFFLE always converges faster than SGD, i.e., after any arbitrary number of iterations. We build our example with the vanishing variance property: fi(x ) = 0 for the optimal point x . Moulines and Bach (2011) show that when F(x) is strongly convex, SGD converges linearly in this setting. Notably, this setting is closely related to the notion of over-parameterization in recent machine learning methods such as deep neural networks, as this relationship has been addressed in (Ma et al., 2017). One special case of this setting is when the variance of stochastic gradients can be bounded by either norm of full gradient Random Shuffling Beats SGD after Finite Epochs or the suboptimality of the loss. Many previous works are built on these assumptions, such as (Tseng, 1998; Solodov, 1998; Schmidt and Roux, 2013; Vaswani et al., 2018). We consider below a nontrivial subclass of strongly convex problems coined valid problems. Given the constraint defined by the strong convexity and Lipschitz-continuity constants, we show that RANDOMSHUFFLE achieves better worst case convergence rate among all valid problems. Given n pairs of positive numbers (µ1, L1), , (µn, Ln) such that 0 µi Li, a dimension d and a point x Rd, we say a d dimensional finite-sum function F(x) = Pn i=1 fi(x) is a valid problem if: each component fi(x) is µi-strongly convex, the gradient of component fi(x) is Li-Lipschitz continuous, and there is some x minimizing all components at the same time (which is equivalent to vanishing componentwise gradient). A set of valid problems P is characterized by n, (µ1, L1), , (µn, Ln), d, and an upper bound on initial distance: R x0 x . For a problem P P, let random variable XRS be the result of running RANDOMSHUFFLE from initial point x0 for T iterations with step size γ on problem P. Similarly, let XSGD be the result of running SGD from initial point x0 for T iterations with step size γ on problem P. For a fixed set P, we can compare the worst case convergence of RANDOMSHUFFLE and SGD within this set: Theorem 5. Given constants (µ1, L1), , (µn, Ln) such that 0 µi Li, a dimension d, a point x Rd and an upper bound of initial distance x0 x 2 R. Let P be the set of valid problems. For step size γ min i { 2 Li+µi } and any T 1, there is max P P E h XRS x 2i max P P E h XSGD x 2i . This theorem indicates that RANDOMSHUFFLE has a better worst-case convergence rate than SGD after an arbitrary number of iterations under this noted setting. 3 7. Extensions In this section, we provide some further extensions. 7.1. RANDOMSHUFFLE for nonconvex optimization The first extension that we discuss is to nonconvex finite sum problems. In particular, we study RANDOMSHUFFLE applied to functions satisfying the Polyak-Łojasiewicz (PL) 3The inequality is true even if every component is only convex but not necessarily strongly convex, i.e., µi = 0 for all i. However, there is no meaning of comparing worst case convergence in terms of distance to a specific x in this case, since there can potentially exist more than one global minimizer. condition (also known as gradient dominated functions): 1 2 F(x) 2 µ(F(x) F ), x. Here µ > 0 is some real number, F is the minimal function value of F( ). Strongly convexity is a special situation of this condition with µ being the strongly convex parameter. One important implication of this condition is that every stationary point is a global minimum. However function F can be non-convex under such setting. Also, it doesn t imply a unique minimum of the function. This setting was proposed and analyzed in (Polyak, 1963), where a linear convergence rate for GD was shown. Later, many other optimization methods have been proven efficient under this condition (see (Nesterov and Polyak, 2006) for second order methods and (Reddi et al., 2016) for variance reduced gradient methods). Notably, SGD converges at the rate O(1/T) under this setting (see appendix for a proof). Assume each component function fi being L Lipschitz continuous, and the average function F(x) satisfying the Polyak-Łojasiewicz condition with some constant µ. We have the following extension of our previous result: Theorem 6. Under the Polyak-Łojasiewicz condition, define condition number κ = L/µ. So long as T log T > 16κ2n, with step size η = 2 log T T µ , RANDOMSHUFFLE achieves convergence rate: E[ x T x 2] O 1 T 2 + n3 7.2. RANDOMSHUFFLE for convex problems An important extension of RANDOMSHUFFLE is to the general (smooth) convex case without assuming strong convexity. There are no previous results on the convergence rate of RANDOMSHUFFLE in this setting that show it to be faster than SGD. The only result we are aware of is by Shamir (2016), who shows RANDOMSHUFFLE is not worse than SGD in the (smooth) convex setting. We extend our results to the general convex case, and show a convergence rate that is possibly faster than SGD in certain parameter regimes, albeit only up to constant terms. We take the viewpoint of gradients with errors, and denote the difference between component gradient and full gradient as the error: F(x) fi(x) = ei(x). Different assumptions bounding the error term ei(x) have been studied in optimization literature. We assume that there is a constant δ that bound the norm of the gradient error: ei(x) δ, x. Random Shuffling Beats SGD after Finite Epochs Here i is any index and x is any point in domain. Obviously, δ 2G, with G being the gradient norm bound as before.4 Theorem 7. Assume = Ei =j Hi(x ) fj (x ) with i, j uniformly drawn from [n], x is an arbitrary minimizer of F. Assume x = n T PT/n i=1 xi 0 being the average of epoch ending points of RANDOMSHUFFLE. Then with proper step size, we have the bound F( x) F(x ) 2D p n D ( + LHLD2 + 2LHDG) T 2/3 δ1/3 + n Compared to the known convergence rate of O(DG/ T) for SGD, we can see that RANDOMSHUFFLE and SGD share the same asymptotic rate of O(1/ T). However, in certain parameter space, the constant in front of 1/ T for RANDOMSHUFFLE can be smaller than that of SGD. 8. Proof sketch of Theorem 1 In this section we provide a proof sketch for Theorem 1. The central idea is to establish an inequality E xt+1 0 x 2 (1 nγα1) xt 0 x 2+nγ3α2+n4γ4α3, (8.1) where xt 0 and xt+1 0 are the beginning and final points of the t-th epoch, respectively. Constant α1 captures the speed of convergence for the linear convergence part, while α2 and α3 together bound the error introduced by randomness. We start from the following equality for one epoch of RANDOMSHUFFLE: xt+1 0 x 2 = xt 0 x 2 2γ xt 0 x , n F(xt 0) | {z } 2γ xt 0 x , Rt | {z } + 2γ2 n F(xt 0) 2 | {z } + 2γ2 Rt 2 | {z } where random variable i=1 fσt(i) xt i 1 Xn i=1 fσt(i)(xt 0) denotes the gradient error of RANDOMSHUFFLE for epoch t dependent on functions permutation σt( ). The right hand side of the equality can be thought as two parts: terms that behave like full gradient descent (At 1 and At 3) and terms that capture the effects of random sampling (At 2 and At 4). The main body of our analysis involves bounding each of these terms separately. 4Another common assumption is when the variance of the gradient (i.e., E[ ei(x) 2]) is bounded. We made the more rigorous assumption here for ease of a simpler analysis. However, there is at most an extra n term difference between these two assumptions due to the finite sum structure. A key challenge toward building (8.1) is to bound E[At 2], where the expectation is over σt( ). It is not easy to directly bound this term with γ3C for some constant C. Instead, by introducing second-order information and several steps of carefully designed AM-GM inequality, we obtain the following bound for At 2: Lemma 1. Over the randomness of the permutation, we have the inequality: 2γµ(n 1) xt 0 x 2 + γ2n2 F xt 0 2 + γ3µ 1n2(n 1) 2 + 2µ 1γ5L4G2n5. where = Ei =j Hi(x ) fj (x ) with i, j uniformly drawn from [n], and x is the minimizer of sum function. Furthermore, we have 1 n 1LG. We bound At 1 with standard inequality (Nesterov, 2013) At 1 2nγ L + µ F(xt 0) 2 + 2nγ Lµ L + µ xt 0 x 2 , (8.2) where the first term is further used to bound At 3 and At 2, and the second term leads to the optimization gain α1 > 0. We absorb At 4 into α3 term in (8.1), which finishes the build-up of the recursion. Finally, expanding the recursion (8.1) and substituting a proper step-size leads to a bound of the form O 1 T 3 . The complete proof can be found in the supplement. 9. Conclusions A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that RANDOMSHUFFLE converges faster than the usual with-replacement SGD. In this paper, we provide the first non-asymptotic convergence rate analysis for RANDOMSHUFFLE. We show in particular that after O( n) epochs, RANDOMSHUFFLE behaves strictly better than SGD under strong convexity and second-order differentiability. The underlying introduction of dependence on n into the bound plays an important role toward a better dependence on T. We further improve the dependence on n for sparse data settings, showing RANDOMSHUFFLE s advantage in such situations. Acknowledgments SS acknowledges support from the NSF-CAREER award (id 1846088). D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999. D. P. Bertsekas. Incremental gradient, subgradient, and Random Shuffling Beats SGD after Finite Epochs proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011. L. Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, 2009. L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421 436. Springer, 2012. L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. ar Xiv:1606.04838, 2016. S. De and T. Goldstein. Efficient distributed sgd with variance reduction. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pages 111 120, 2016. A. Defazio, J. Domke, et al. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning, pages 1125 1133, 2014. X. Feng, A. Kumar, B. Recht, and C. R e. Towards a unified architecture for in-rdbms analytics. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 325 336. ACM, 2012. M. G urb uzbalaban, A. Ozdaglar, and P. Parrilo. Convergence rate of incremental gradient and newton methods. ar Xiv preprint ar Xiv:1510.08562, 2015a. M. G urb uzbalaban, A. Ozdaglar, and P. Parrilo. Why random reshuffling beats stochastic gradient descent. ar Xiv preprint ar Xiv:1510.08560, 2015b. M. G urb uzbalaban, A. E. Ozdaglar, P. A. Parrilo, and N. D. Vanli. When cyclic coordinate descent outperforms randomized coordinate descent. In NIPS, 2017. E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489 2512, 2014. A. Israel, F. Krahmer, and R. Ward. An arithmetic geometric mean inequality for products of three matrices. Linear Algebra and its Applications, 488:1 12, 2016. P. Jain, D. Nagaraj, and P. Netrapalli. Sgd without replacement: Sharper rates for general smooth convex functions. ar Xiv preprint ar Xiv:1903.01463, 2019. T. Kohonen. An adaptive associative memory principle. IEEE Transactions on Computers, 100(4):444 445, 1974. C.-P. Lee and S. J. Wright. Random permutations fix a worst case for cyclic coordinate descent. ar Xiv preprint ar Xiv:1607.08320, 2016. J. D. Lee, Q. Lin, T. Ma, and T. Yang. Distributed stochastic variance reduced gradient methods and a lower bound for communication complexity. ar Xiv preprint ar Xiv:1507.07595, 2015. S. Ma, R. Bassily, and M. Belkin. The power of in- terpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. ar Xiv preprint ar Xiv:1712.06559, 2017. E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451 459, 2011. A. Nedi c and D. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pages 223 264. Springer, 2001. A. Nemirovskii, D. B. Yudin, and E. R. Dawson. Problem complexity and method efficiency in optimization. Wiley, 1983. Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006. L. M. Nguyen, P. H. Nguyen, M. van Dijk, P. Richt arik, K. Scheinberg, and M. Tak aˇc. Sgd and hogwild! convergence without the bounded gradients assumption. ar Xiv preprint ar Xiv:1802.03801, 2018. B. T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864 878, 1963. A. Rakhlin, O. Shamir, K. Sridharan, et al. Making gradient descent optimal for strongly convex stochastic optimization. In ICML. Citeseer, 2012. B. Recht and C. R e. Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences. ar Xiv preprint ar Xiv:1202.4184, 2012. B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693 701, 2011. S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314 323, 2016. M. Schmidt and N. L. Roux. Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv preprint ar Xiv:1308.6370, 2013. S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567 599, 2013. Random Shuffling Beats SGD after Finite Epochs O. Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 46 54, 2016. M. V. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23 35, 1998. R. Sun and Y. Ye. Worst-case complexity of cyclic coordinate descent: o(n2) gap with randomized version. ar Xiv preprint ar Xiv:1604.07130, 2016. P. Tseng. An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506 531, 1998. S. Vaswani, F. Bach, and M. Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. ar Xiv preprint ar Xiv:1810.07288, 2018. S. J. Wright and C.-P. Lee. Analyzing random permutations for cyclic coordinate descent. ar Xiv preprint ar Xiv:1706.00908, 2017. B. Ying, K. Yuan, S. Vlaski, and A. H. Sayed. Stochastic learning under random reshuffling. ar Xiv preprint ar Xiv:1803.07964, 2018. T. Zhang. A note on the non-commutative arithmeticgeometric mean inequality. ar Xiv:1411.5058, 2014.