# random_reshuffling_simple_analysis_with_vast_improvements__e7816c0a.pdf

Random Reshufﬂing: Simple Analysis with Vast Improvements

Konstantin Mishchenko KAUST Thuwal, Saudi Arabia

Ahmed Khaled Cairo University Giza, Egypt

Peter Richtárik KAUST Thuwal, Saudi Arabia

Random Reshufﬂing (RR) is an algorithm for minimizing ﬁnite-sum functions that utilizes iterative gradient descent steps in conjunction with data reshufﬂing. Often contrasted with its sibling Stochastic Gradient Descent (SGD), RR is usually faster in practice and enjoys signiﬁcant popularity in convex and non-convex optimization. The convergence rate of RR has attracted substantial attention recently and, for strongly convex and smooth functions, it was shown to converge faster than SGD if 1) the stepsize is small, 2) the gradients are bounded, and 3) the number of epochs is large. We remove these 3 assumptions, improve the dependence on the condition number from κ2 to κ (resp. from κ to κ) and, in addition, show that RR has a different type of variance. We argue through theory and experiments that the new variance type gives an additional justiﬁcation of the superior performance of RR. To go beyond strong convexity, we present several results for non-strongly convex and non-convex objectives. We show that in all cases, our theory improves upon existing literature. Finally, we prove fast convergence of the Shufﬂe-Once (SO) algorithm, which shufﬂes the data only once, at the beginning of the optimization process. Our theory for strongly convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shufﬂing once or only a few times. As a byproduct of our analysis, we also get new results for the Incremental Gradient algorithm (IG), which does not shufﬂe the data at all.

1 Introduction

We study the ﬁnite-sum minimization problem

minx Rd h f(x) = 1

i=1 fi(x) i , (1)

where each fi : Rd R is differentiable and smooth, and are particularly interested in the big data machine learning setting where the number of functions n is large. Thanks to their scalability and low memory requirements, ﬁrst-order methods are especially popular in this setting (Bottou et al., 2018). Stochastic ﬁrst-order algorithms in particular have attracted a lot of attention in the machine learning community and are often used in combination with various practical heuristics. Explaining these heuristics may lead to further development of stable and efﬁcient training algorithms. In this work, we aim at better and sharper theoretical explanation of one intriguingly simple but notoriously elusive heuristic: data permutation/shufﬂing.

1.1 Data permutation

In particular, the goal of our paper is to obtain deeper theoretical understanding of methods for solving (1) which rely on random or deterministic permutation/shufﬂing of the data {1, 2, . . . , n} and

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

perform incremental gradient updates following the permuted order. We study three methods which belong to this class, described next.

An immensely popular but theoretically elusive method belonging to the class of data permutation methods is the Random Reshufﬂing (RR) algorithm (see Algorithm 1). This is the method we pay most attention to in this work, as reﬂected in the title. In each epoch t of RR, we sample indices π0, π1, . . . , πn 1 without replacement from {1, 2, . . . , n}, i.e., {π0, π1, . . . , πn 1} is a random permutation of the set {1, 2, . . . , n}, and proceed with n iterates of the form

xi+1 t = xi t γ fπi(xi t),

where γ > 0 is a stepsize. We then set xt+1 = xn t , and repeat the process for a total of T epochs. Notice that in RR, a new permutation/shufﬂing is generated at the beginning of each epoch, which is why the term reshufﬂing is used.

Furthermore, we consider the Shufﬂe-Once (SO) algorithm, which is identical to RR with the exception that it shufﬂes the dataset only once at the very beginning and then reuses this random permutation in all subsequent epochs (see Algorithm 2). Our results for SO follow as corollaries of the tools we developed in order to conduct a sharp analysis of RR.

Finally, we also consider the Incremental Gradient (IG) algorithm, which is identical to SO, with the exception that the initial permutation is not random but deterministic. Hence, IG performs incremental gradient steps through the data in a cycling fashion. The ordering could be arbitrary, e.g., it could be selected implicitly by the ordering the data comes in, or chosen adversarially. Again, our results for IG follow as a byproduct of our efforts to understand RR.

Algorithm 1 Random Reshufﬂing (RR) Input: Stepsize γ > 0, initial vector x0 = x0 0 Rd, number of epochs T 1: for epochs t = 0, 1, . . . , T 1 do 2: Sample a permutation π0, π1, . . . , πn 1 of {1, 2, . . . , n} 3: for i = 0, 1, . . . , n 1 do 4: xi+1 t = xi t γ fπi(xi t) 5: xt+1 = xn t

2021 Git Hub, Inc.

Algorithm 2 Shufﬂe Once (SO) Input: Stepsize γ > 0, initial vector x0 = x0 0 Rd, number of epochs T 1: Sample a permutation π0, π1, . . . , πn 1 of {1, 2, . . . , n} 2: for epochs t = 0, 1, . . . , T 1 do 3: for i = 0, 1, . . . , n 1 do 4: xi+1 t = xi t γ fπi(xi t) 5: xt+1 = xn t

1.2 Brief literature review

RR is usually contrasted with its better-studied sibling Stochastic Gradient Descent (SGD), in which each πi is sampled uniformly with replacement from {1, 2, . . . , n}. RR often converges faster than SGD on many practical problems (Bottou, 2009; Recht and Ré, 2013), is more friendly to cache locality (Bengio, 2012), and is in fact standard in deep learning (Sun, 2020).

The convergence properties of SGD are well-understood, with tightly matching lower and upper bounds in many settings (Rakhlin et al., 2012; Drori and Shamir, 2019; Nguyen et al., 2019). Sampling without replacement allows RR to leverage the ﬁnite-sum structure of (1) by ensuring that each function contributes to the solution once per epoch. On the other hand, it also introduces a signiﬁcant complication: the steps are now biased. Indeed, in any iteration i > 0 within an epoch, we face the challenge of not having (conditionally) unbiased gradients since

E fπi(xi t) | xi t = f(xi t).

This bias implies that individual iterations do not necessarily approximate a full gradient descent step. Hence, in order to obtain meaningful convergence rates for RR, it is necessary to resort to more involved proof techniques. In recent work, various convergence rates have been established for RR. However, a satisfactory, let alone complete, understanding of the algorithm s convergence remains elusive. For instance, the early line of attack pioneered by Recht and Ré (2012) seems to have hit the wall as their noncommutative arithmetic-geometric mean conjecture is not true (Lai and Lim, 2020).

The situation is even more pronounced with the SO method, as Safran and Shamir (2020) point out that there are no convergence results speciﬁc for the method, and the only convergence rates for SO follow by applying the worst-case bounds of IG. Rajput et al. (2020) state that a common practical heuristic is to use methods like SO that do not reshufﬂe the data every epoch. Indeed, they add that current theoretical bounds are insufﬁcient to explain this phenomenon, and a new theoretical breakthrough may be required to tackle it .

IG has a long history owing to its success in training neural networks (Luo, 1991; Grippo, 1994), and its asymptotic convergence has been established early (Mangasarian and Solodov, 1994; Bertsekas and Tsitsiklis, 2000). Several rates for non-smooth & smooth cases were established by Nedi c and Bertsekas (2001); Li et al. (2019); Gürbüzbalaban et al. (2019a); Ying et al. (2019) and Nguyen et al. (2020). Using IG poses the challenge of choosing a speciﬁc permutation for cycling through the iterates, which Nedi c and Bertsekas (2001) note to be difﬁcult. Bertsekas (2011) gives an example that highlights the susceptibility of IG to bad orderings compared to RR. Yet, thanks to Gürbüzbalaban et al. (2019b) and Haochen and Sra (2019), RR is known to improve upon both SGD and IG for twice-smooth objectives. Nagaraj et al. (2019) also study convergence of RR for smooth objectives, and Safran and Shamir (2020); Rajput et al. (2020) give lower bounds for RR and related methods.

2 Contributions

In this work, we study the convergence behavior of the data-permutation methods RR, SO and IG. While existing proof techniques succeed in obtaining insightful bounds for RR and IG, they fail to fully capitalize on the intrinsic power reshufﬂing and shufﬂing offers, and are not applicable to SO at all1. Our proof techniques are dramatically novel, simple, more insightful, and lead to improved convergence results, all under weaker assumptions on the objectives than prior work.

2.1 New and improved convergence rates for RR, SO and IG

In Section 3, we analyze the RR and SO methods and present novel convergence rates for strongly convex, convex, and non-convex smooth objectives. Our results for RR are summarized in Table 1.

Strongly convex case. If each fi is strongly convex, we introduce a new proof technique for studying the convergence of RR/SO that allows us to obtain a better dependence on problem constants, such as the number of functions n and the condition number κ, compared to prior work (see Table 1). Key to our results is a new notion of variance speciﬁc to RR/SO (see Deﬁnition 2), which we argue explains the superior convergence of RR/SO compared to SGD in many practical scenarios. Our result for SO tightly matches the lower bound of Safran and Shamir (2020). We prove similar results in the more general setting when each fi is convex and f is strongly convex (see Theorem 2), but in this case we are forced to use smaller stepsizes. Convex case. For convex but not necessarily strongly convex objectives fi, we give the ﬁrst result showing that RR/SO can provably achieve better convergence than SGD for a large enough number of iterations. This holds even when comparing against results that assume second-order smoothness, like the result of Haochen and Sra (2019). Non-convex case. For non-convex objectives fi, we obtain for RR a much better dependence on the number of functions n compared to the prior work of Nguyen et al. (2020).

Furthermore, in the appendix we formulate and prove convergence results for IG for strongly convex objectives, convex, and non-convex objectives as well. The bounds are worse than RR by a factor of n in the noise/variance term, as IG does not beneﬁt from randomization. Our result for strongly convex objectives tightly matches the lower bound of Safran and Shamir (2020) up to an extra iteration and logarithmic factors, and is the ﬁrst result to tightly match this lower bound.

2.2 More general assumptions on the function class

Previous non-asymptotic convergence analyses of RR either obtain worse bounds that apply to IG, e.g., (Ying et al., 2019; Nguyen et al., 2020), or depend crucially on the assumption that each fi is Lipschitz (Nagaraj et al., 2019; Haochen and Sra, 2019; Ahn and Sra, 2020). Unfortunately, requiring each fi

1As we have mentioned before, the best known bounds for SO are those which apply to IG also, which means that the randomness inherent in SO is wholly ignored.

Table 1: Number of individual gradient evaluations needed by RR to reach an ε-accurate solution (deﬁned in Section 3). Logarithmic factors and constants that are not related to the assumptions are ignored. For non-convex objectives, A and B are the constants given by Assumption 2.

Assumptions µ-Strongly Non-Strongly Non-Convex Citation N.L.(1) U.V.(2) Convex Convex

µ ε Ying et al. (2019)

κ2n + κ n G

ε2 (3) Nagaraj et al. (2019)

ε3 Nguyen et al. (2020)

κ2n µε + κ2nσ

(4) Nguyen et al. (2020)

κα ε1/α + κ n Gα3/2

(5) Ahn and Sra (2020)

κn + n µε + κ n G0

(6) Ahn et al. (2020)

Lnσ ε3/2 Ln

ε2 + L n(B+

A) ε3 This work κn + κnσ

(1) Support for non-Lipschitz functions (N.L.): proofs without assuming that maxi=1,...,n fi(x) G for all x Rd and some G > 0. Note that 1

n Pn i=1 fi(x ) 2 def = σ2 G2 and B2 G2. (2) Unbounded variance (U.V.): there may be no constant σ such that Assumption 2 holds with A = 0 and B = σ. Note that when the individual gradients are bounded, the variance is automatically bounded too. (3) Nagaraj et al. (2019) require, for non-strongly convex functions, projecting at each iteration onto a bounded convex set of diameter D. We study the unconstrained problem. (4) For strongly convex, Nguyen et al. (2020) bound f(x) f(x ) rather than squared distances, hence we use strong convexity to translate their bound into a bound on x x 2. (5) The constant α > 2 is a parameter to be speciﬁed in the stepsize used by (Ahn and Sra, 2020). Their full bound has several extra terms but we include only the most relevant ones. (6) The result of Ahn et al. (2020) holds when f satisﬁes the Polyak-Łojasiewicz inequality, a generalization of strong convexity. We nevertheless specialize it to strong convexity for our comparison. The constant G0 is deﬁned as G0 def = supx:f(x) f(x0) maxi [n] fi(x) . Note that σ G0. We show a better complexity for PL functions under bounded variance in Theorem 4. (7) This result is the ﬁrst to show that RR and SO work with any γ 1 L, but it asks for each fi to be strongly convex. The second result assumes that only f is strongly convex.

to be Lipschitz contradicts strong convexity (Nguyen et al., 2018) and is furthermore not satisﬁed in least square regression, matrix factorization, or for neural networks with smooth activations. In contrast, our work is the ﬁrst to show how to leverage randomization to obtain better rates for RR without assuming each fi to be Lipschitz. In concurrent work, Ahn et al. (2020) also obtain a result for non-convex objectives satisfying the Polyak-Łojasiewicz inequality, a generalization of strong convexity. Their result holds without assuming bounded gradients or bounded variance, but unfortunately with a worse dependence on κ and n when specialized to µ-strongly convex functions.

Strongly convex and convex case. For strongly convex and convex objectives we do not require any assumptions on the functions used beyond smoothness and convexity.

Non-convex case. For non-convex objectives we obtain our results under a signiﬁcantly more general assumption than the bounded gradients assumptions employed in prior work. Our assumption is also provably satisﬁed when each function fi is lower bounded, and hence is not only more general but also a more realistic assumption to use.

3 Convergence theory

We will derive results for strongly convex, convex as well as non-convex objectives. To compare between the performance of ﬁrst-order methods, we deﬁne an ε-accurate solution as a point x Rd

that satisﬁes (in expectation if x is random)

f( x) ε, or x x 2 ε, or f( x) f(x ) ε

for non-convex, strongly convex, and non-strongly convex objectives, respectively, and where x is assumed to be a minimizer of f if f is convex. We then measure the performance of ﬁrst-order methods by the number of individual gradients fi( ) they access to reach an ε-accurate solution.

Our ﬁrst assumption is that the objective is bounded from below and smooth. This assumption is used in all of our results and is widely used in the literature. Assumption 1. The objective f and the individual losses f1, . . . , fn are all L-smooth, i.e., their gradients are L-Lipschitz. Further, f is lower bounded by some f R. If f is convex, we also assume the existence of a minimizer x Rd.

Assumption 1 is necessary in order to obtain better convergence rates for RR compared to SGD, since without smoothness the SGD rate is optimal and cannot be improved (Nagaraj et al., 2019). The following quantity is key to our analysis and serves as an asymmetric distance between two points measured in terms of functions.

Deﬁnition 1. For any i, the quantity Dfi(x, y) def = fi(x) fi(y) fi(y), x y is the Bregman divergence between x and y associated with fi.

It is well-known that if fi is L-smooth and µ-strongly convex, then for all x, y Rd

µ 2 x y 2 Dfi(x, y) L

2 x y 2, (2)

so each Bregman divergence is closely related to the Euclidian distance. Moreover, the difference between the gradients of a convex and L-smooth fi is related to its Bregman divergence by

fi(x) fi(y) 2 2L Dfi(x, y). (3)

3.1 Main result: strongly convex objectives

Before we proceed to the formal statement of our main result, we need to present the central ﬁnding of our work. The analysis of many stochastic methods, including SGD, rely on the fact that the iterates converge to x up to some noise. This is exactly where we part ways with the standard analysis techniques, since, it turns out, the intermediate iterates of shufﬂing algorithms converge to some other points. Given a permutation π, the real limit points are deﬁned below,

xi def = x γ i 1 P

j=0 fπj(x ), i = 1, . . . , n 1. (4)

In fact, it is predicted by our theory and later validated by our experiments that within an epoch the iterates go away from x , and closer to the end of the epoch they make a sudden comeback to x .

The second reason the vectors introduced in Equation (4) are so pivotal is that they allow us to deﬁne a new notion of variance. Without it, there seems to be no explanation for why RR sometimes overtakes SGD from the very beginning of optimization process. We deﬁne it below. Deﬁnition 2 (Shufﬂing variance). Given a stepsize γ > 0 and a random permutation π of {1, 2, . . . , n}, deﬁne xi as in (4). Then, the shufﬂing variance is given by

σ2 Shuﬄe def = max i=1,...,n 1

h 1 γ E Dfπi(xi , x ) i , (5)

where the expectation is taken with respect to the randomness in the permutation π.

Naturally, σ2 Shuﬄe depends on the functions f1, . . . , fn, but, unlike SGD, it also depends in a nontrivial manner on the stepsize γ. The easiest way to understand the new notation is to compare it to the standard deﬁnition of variance used in the analysis of SGD. We argue that σ2 Shuﬄe is the natural counter-part for the standard variance used in SGD. We relate both of them by the following upper and lower bounds: Proposition 1. Suppose that each of f1, f2, . . . , fn is µ-strongly convex and L-smooth. Then γµn

8 σ2 σ2 Shuﬄe γLn

4 σ2 , where σ2 def = 1

n Pn i=1 fi(x ) 2.

In practice, σ2 Shuﬄe may be much closer to the lower bound than the upper bound; see Section 4. This leads to a dramatic difference in performance and provides additional evidence of the superiority of RR over SGD. The next theorem states how exactly convergence of RR depends on the introduced variance.

Theorem 1. Suppose that the functions f1, . . . , fn are µ-strongly convex and that Assumption 1 holds. Then for Algorithms 1 or 2 run with a constant stepsize γ 1

L, the iterates generated by either of the algorithms satisfy

E h x T x 2i (1 γµ)n T x0 x 2 + 2γσ2 Shuﬄe

Proof. The key insight of our proof is that the intermediate iterates x1 t, x2 t, . . . do not converge to x , but rather converge to the sequence x1 , x2 , . . . deﬁned by (4). Keeping this intuition in mind, it makes sense to study the following recursion:

E xi+1 t xi+1 2

= E xi t xi 2 2γ fπi(xi t) fπi(x ), xi t xi + γ2 fπi(xi t) fπi(x ) 2 . (6)

Once we have this recursion, it is useful to notice that the scalar product can be decomposed as

fπi(xi t) fπi(x ), xi t xi = [fπi(xi ) fπi(xi t) fπi(xi t), xi xi t ]

+ [fπi(xi t) fπi(x ) fπi(x ), xi t x ]

[fπi(xi ) fπi(x ) fπi(x ), xi x ]

= Dfπi(xi , xi t) + Dfπi(xi t, x ) Dfπi(xi , x ). (7)

This decomposition is, in fact, very standard and is a special case of the so-called three-point identity (Chen and Teboulle, 1993). So, it should not be surprising that we use it. The rest of the proof relies on obtaining appropriate bounds for the terms in the recursion. Firstly, we bound each of the three Bregman divergence terms appearing in (7). By µ-strong convexity of fi, the ﬁrst term in (7) satisﬁes µ

2 xi t xi 2 (2) Dfπi(xi , xi t),

which we will use to obtain contraction. The second term in (7) can be bounded via

1 2L fπi(xi t) fπi(x ) 2 (3) Dfπi(xi t, x ),

which gets absorbed in the last term in the expansion of xi+1 t xi+1 2. The expectation of the third divergence term in (7) is trivially bounded as follows:

E Dfπi(xi , x ) max i=1,...,n 1 E Dfπi(xi , x ) = γσ2 Shuﬄe.

Plugging these three bounds back into (7), and the resulting inequality into (6), we obtain

E xi+1 t xi+1 2 E (1 γµ) xi t xi 2 2γ(1 γL)Dfπi(xi t, x ) + 2γ2σ2 Shuﬄe (1 γµ)E xi t xi 2 + 2γ2σ2 Shuﬄe. (8)

The rest of the proof is just solving this recursion, and is relegated to Section 8.2 in the appendix.

We show (Corollary 1 in the appendix) that by carefully controlling the stepsize, the ﬁnal iterate of RR after T epochs satisﬁes

E h x T x 2i = O exp µn T

L x0 x 2 + κσ2 µ2n T 2 , (9)

where the O( ) notation suppresses absolute constants and polylogarithmic factors. Note that Theorem 1 covers both RR and SO, and for SO, Safran and Shamir (2020) give almost the same lower bound. Stated in terms of the squared distance from the optimum, their lower bound is

E h x T x 2i = Ω min n 1, σ2 µ2n T 2 o ,

where we note that in their problem κ = 1. This translates to sample complexity O ( nσ /(µ ε)) for ε 12. Specializing κ = 1 in Equation (9) gives the sample complexity of O (1 + nσ /(µ ε)), matching the optimal rate up to an extra iteration. More recently, Rajput et al. (2020) also proved a similar lower bound for RR. We emphasize that Theorem 1 is not only tight, but it is also the ﬁrst convergence bound that applies to SO. Moreover, it also immediately works if one permutes once every few epochs, which interpolates between RR and SO mentioned by Rajput et al. (2020).

Comparison with SGD To understand when RR is better than SGD, let us borrow a convergence bound for the latter. Several works have shown (e.g., see (Needell et al., 2014; Stich, 2019)) that for any γ 1 2L the iterates of SGD satisfy

E h x SGD n T x 2i (1 γµ)n T x0 x 2 + 2γσ2 µ .

Thus, the question as to which method will be faster boils down to which variance is smaller: σ2 Shuﬄe or σ2 . According to Proposition 1, it depends on both n and the stepsize. Once the stepsize is sufﬁciently small, σ2 Shuﬄe becomes smaller than σ2 , but this might not be true in general. Similarly, if we partition n functions into n/τ groups, i.e., use minibatches of size τ, then σ2 decreases as O (1/τ) and σ2 Shuﬄe as O 1/τ 2 , so RR can become faster even without decreasing the stepsize. We illustrate this later with numerical experiments.

While Theorem 1 requires each fi to be strongly convex, we can also obtain results in the case where the individual strong convexity assumption is replaced by convexity. However, in such a case, we need to use a smaller stepsize, as the next theorem shows. Theorem 2. Suppose that each fi is convex, f is µ-strongly convex, and Assumption 1 holds. Then provided the stepsize satisﬁes γ 1

2Ln the ﬁnal iterate generated by Algorithms 1 or 2 satisﬁes

E h x T x 2i 1 γµn

2 T x0 x 2 + γ2κnσ2 .

It is not difﬁcult to show that by properly choosing the stepsize γ, the guarantee given by Theorem 2 translates to a sample complexity of O κn + κnσ

µ ε , which matches the dependence on the accuracy ε in Theorem 1 but with κ(n 1) additional iterations in the beginning. For κ = 1, this

translates to a sample complexity of O n + nσ

µ ε which is worse than the lower bound of Safran and Shamir (2020) when ε is large. In concurrent work, Ahn et al. (2020) obtain in the same setting a complexity of O 1/ε1/α + n G

µ ε (for a constant α > 2), which requires that each fi is Lipschitz

and matches the lower bound only when the accuracy ε is large enough that 1/ε1/α 1. Obtaining an optimal convergence guarantee for all accuracies ε in the setting of Theorem 2 remains open.

3.2 Non-strongly convex objectives

We also make a step towards better bounds for RR/SO without any strong convexity at all and provide the following convergence statement. Theorem 3. Let functions f1, f2, . . . , fn be convex. Suppose that Assumption 1 holds. Then for Algorithm 1 or Algorithm 2 run with a stepsize γ 1

2Ln, the average iterate ˆx T def = 1 T PT j=1 xj satisﬁes E [f(ˆx T ) f(x )] x0 x 2

2γn T + γ2Lnσ2 4 .

Unfortunately, the theorem above relies on small stepsizes, but we still deem it as a valuable contribution, since it is based on a novel analysis. Indeed, the prior works showed that RR approximates a full gradient step, but we show that it is even closer to the implicit gradient step, see the appendix.

To translate the recursion in Theorem 3 to a complexity, one can choose a small stepsize and obtain (Corollary 2 in the appendix) the following bound for RR/SO:

E [f(ˆx T ) f(x )] = O L x0 x 2

T + L1/3 x0 x 4/3σ2/3 n1/3T 2/3 .

2In their problem, the initialization point x0 satisﬁes x0 x 2 1 and hence asking for accuracy ε > 1 does not make sense.

Stich (2019) gives a convergence upper bound of O L x0 x 2

n T + σ x0 x

for SGD. Comparing

upper bounds, we see that RR/SO beats SGD when the number of epochs satisﬁes T L2 x0 x 2n

σ2 . To the best of our knowledge, there are no strict lower bounds in this setting. Safran and Shamir (2020) suggest a lower bound of Ω σ

n T by setting µ to be small in their lower bound for µ-strongly convex functions, however this bound may be too optimistic.

3.3 Non-convex objectives

For non-convex objectives, we formulate the following assumption on the gradients variance.

Assumption 2. There exist nonnegative constants A, B 0 such that for any x Rd we have,

i=1 fi(x) f(x) 2 2A (f(x) f(x )) + B2. (10)

Assumption 2 is quite general: if there exists some G > 0 such that fi(x) G for all x Rd and i {1, 2, . . . , n}, then Assumption 2 is clearly satisﬁed by setting A = 0 and B = G. Assumption 2 also generalizes the uniformly bounded variance assumption commonly invoked in work on nonconvex SGD, which is equivalent to (10) with A = 0. Assumption 2 is a special case of the Expected Smoothness assumption of Khaled and Richtárik (2020), and it holds whenever each fi is smooth and lower-bounded, as the next proposition shows.

Proposition 2. (Khaled and Richtárik, 2020, special case of Proposition 3) Suppose that f1, f2, . . . , fn are lower bounded by f 1 , f 2 , . . . , f n respectively and that Assumption 1 holds. Then there exist constants A, B 0 such that Assumption 2 holds.

We now give our main convergence theorem for RR without assuming convexity.

Theorem 4. Suppose that Assumptions 1 and 2 hold. Then for Algorithm 1 run for T epochs with a stepsize γ min 1 2Ln, 1 (AL2n2T )1/3 we have

min t=0,...,T 1 E h f(xt) 2i 12(f(x0) f )

γn T + 2γ2L2n B2.

In the extended arxiv version, we also show that for PL functions satisfying Assumption 2 with A = 0 it holds E [f(xt) f ] 1 γµn

2 T (f(x0) f ) + γ2κLn B2, where κ = L

Comparison with SGD. From Theorem 4, one can recover the complexity that we provide in Table 1, see Corollary 3 in the appendix. Let s ignore some constants not related to our assumptions and specialize to uniformly bounded variance. Then, the sample complexity of RR, KRR L n

ε ), becomes better than that of SGD, KSGD L

ε2 ), whenever nε σ.

4 Experiments

We run our experiments on the ℓ2-regularized logistic regression problem given by

bi log h(a i x) + (1 bi) log 1 h(a i x) + λ

where (ai, bi) Rd {0, 1}, i = 1, . . . , N are the data samples and h: t 1/(1 + e t) is the sigmoid function. For better parallelism, we use minibatches of size 512 for all methods and datasets. We set λ = L/

N and use stepsizes decreasing as O(1/t). See the appendix for more details on the parameters used, implementation details, and reproducibility.

Reproducibility. Our code is provided at https://github.com/konstmish/random_reshufﬂing. All used datasets are publicly available and all additional implementation details are provided in the appendix.

Observations. One notable property of all shufﬂing methods is that they converge with oscillations, as can be seen in Figure 1. There is nothing surprising about this as the proof of our Theorem 1 shows

0 25 50 75 100 125 150 Data passes

SGD IG Shuffle-once RR

0 25 50 75 100 125 150 Data passes

SGD IG Shuffle-once RR

0 1 2 3 4 Data passes

Average Worst sampled shuffle Best sampled shuffle

0 50 100 150 200 250 Data passes

SGD IG Shuffle-once RR

0 50 100 150 200 250 Data passes

SGD IG Shuffle-once RR

0 1 2 3 4 Data passes

Average Worst sampled shuffle Best sampled shuffle

0 5 10 15 20 25 30 Data passes

SGD IG Shuffle-once RR

0 5 10 15 20 25 30 Data passes

SGD IG Shuffle-once RR

0 1 2 3 4 Data passes

Average Worst sampled shuffle Best sampled shuffle

Figure 1: Top: real-sim dataset (N = 72, 309; d = 20, 958), middle row: w8a dataset (N = 49, 749; d = 300), bottom: RCV1 dataset (N = 804, 414; d = 47, 236). Left: convergence of f(xi t), middle column: convergence of xi t x 2, right: convergence of SO with different permutations .

that the intermediate iterates converge to xi instead of x . It is, however, surprising how striking the difference between the intermediate iterates within one epoch can be.

Next, one can see that SO and RR converge almost the same way, which is in line with Theorem 1. On the other hand, the contrast with IG is dramatic, suggesting existence of bad permutations. The probability of getting such a permutation seems negligible; see the right plot in Figure 2.

Finally, we remark that the ﬁrst two plots in Figure 2 demonstrate the importance of the new variance introduced in Deﬁnition 2. The upper and lower bounds from Proposition 1 are depicted in these two plots and one can observe that the lower bound is often closer to the actual value of σ2 Shuﬄe than the upper bound. And the fact that σ2 Shuﬄe very quickly becomes smaller than σ2 explains why RR often outperforms SGD starting from early iterations.

10 0 10 1 10 2 10 3

Variance at x

σ2 * γLn 4 σ2 * , γ = 1 L γμn 8 σ2 * , γ = 1 L

σ2 Shuffle, γ = 1 L

0 2000 4000 6000 8000 10000 Stepsize ratio 1 γL

Variance at x

σ2 * γLn 4 σ2 * γμn 8 σ2 *

10 4 10 3 10 2 10 1 Variance at x

No shuffling Shuffle

Figure 2: Estimated variance at the optimum, σ2 Shuﬄe and σ2 , for the w8a dataset. Left: the values of variance for different minibatch sizes with γ = 1/L. Middle: variance with ﬁxed minibatch size 64 for different γ, starting with γ = 1/L and ending with γ = 10 4/L. Right: the empirical distribution of σ2 Shuﬄe for 500, 000 sampled permutations with γ = 1/L and minibatch size 64.

Broader Impact

Our contribution is primarily theoretical. Moreover, we study methods that are already in use in practice, but are notoriously hard to analyze. We believe we have made a breakthrough in this area by developing new and remarkably simple proof techniques, leading to sharp bounds. This, we hope, will inspire other researchers to apply and further develop our techniques to other contexts and algorithms. These applications may one day push the state of the art in practice for existing or new supervised machine learning applications, which may then have broader impacts. Besides this, we do not expect any direct or short term societal consequences.

Acknowledgments and Disclosure of Funding

Ahmed Khaled acknowledges internship support from the Optimization and Machine Learning Lab led by Peter Richtárik at KAUST.

Kwangjun Ahn and Suvrit Sra. On Tight Convergence Rates of Without-replacement SGD. ar Xiv preprint ar Xiv:2004.08657, 2020. 3, 4

Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. SGD with shufﬂing: optimal rates without component convexity and large epoch requirements. ar Xiv preprint ar Xiv:2006.06946. To appear in Neur IPS 2020, 2020. 4, 7

Yoshua Bengio. Practical Recommendations for Gradient-Based Training of Deep Architectures. Neural Networks: Tricks of the Trade, page 437 478, 2012. ISSN 1611-3349. doi: 10.1007/ 978-3-642-35289-8_26. 2

Dimitri P. Bertsekas. Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. In Suvrit Sra, Sebastan Nowozin, and Stephen J. Wright, editors, Optimization for Machine Learning, chapter 4. The MIT Press, 2011. ISBN 9780262298773. 3

Dimitri P. Bertsekas and John N. Tsitsiklis. Gradient Convergence in Gradient methods with Errors. SIAM Journal on Optimization, 10(3):627 642, January 2000. doi: 10.1137/s1052623497331063. 3

Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. Unpublished open problem offered to the attendance of the SLDS 2009 conference, 2009. URL http://leon. bottou.org/papers/bottou-slds-open-problem-2009. 2, 15

Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60(2):223 311, 2018. doi: 10.1137/16M1080173. 1

Gong Chen and Marc Teboulle. Convergence Analysis of a Proximal-Like Minimization Algorithm Using Bregman Functions. SIAM Journal on Optimization, 3(3):538 543, 1993. doi: 10.1137/ 0803026. 6

Yoel Drori and Ohad Shamir. The Complexity of Finding Stationary Points with Stochastic Gradient Descent. ar Xiv preprint ar Xiv:1910.01845, 2019. 2

Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. SGD: General Analysis and Improved Rates. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5200 5209, Long Beach, California, USA, 09 15 Jun 2019. PMLR. 15

Luigi Grippo. A class of unconstrained minimization methods for neural network training. Optimization Methods and Software, 4(2):135 150, January 1994. doi: 10.1080/10556789408805583. 3

Mert Gürbüzbalaban, Asuman Ozdaglar, and Pablo A. Parrilo. Convergence Rate of Incremental Gradient and Incremental Newton Methods. SIAM Journal on Optimization, 29(4):2542 2565, 2019a. doi: 10.1137/17M1147846. 3

Mert Gürbüzbalaban, Asuman Özda glar, and Pablo A. Parrilo. Why random reshufﬂing beats stochastic gradient descent. Mathematical Programming, Oct 2019b. ISSN 1436-4646. doi: 10.1007/s10107-019-01440-w. 3

Jeff Haochen and Suvrit Sra. Random Shufﬂing Beats SGD after Finite Epochs. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2624 2633, Long Beach, California, USA, 09 15 Jun 2019. PMLR. 3

Ahmed Khaled and Peter Richtárik. Better Theory for SGD in the Nonconvex World. ar Xiv preprint ar Xiv:2002.03329, 2020. 8, 25, 27

Zehua Lai and Lek-Heng Lim. Recht-Ré Noncommutative Arithmetic-Geometric Mean Conjecture is False. ar Xiv preprint ar Xiv:2006.01510, 2020. 2

Xiao Li, Zhihui Zhu, Anthony Man-Cho So, and Jason D Lee. Incremental Methods for Weakly Convex Optimization. ar Xiv preprint ar Xiv:1907.11687, 2019. 3

Zhi-Quan Luo. On the Convergence of the LMS Algorithm with Adaptive Learning Rate for Linear Feedforward Networks. Neural Computation, 3(2):226 245, June 1991. doi: 10.1162/neco.1991.3. 2.226. 3

Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent. ar Xiv preprint ar Xiv:1910.09529, 2019. 15

Olvi L. Mangasarian and Mikhail V. Solodov. Serial and parallel backpropagation convergence via nonmonotone perturbed minimization. Optimization Methods and Software, 4(2):103 116, 1994. doi: 10.1080/10556789408805581. 3

Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. SGD without Replacement: Sharper Rates for General Smooth Convex Functions. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4703 4711, Long Beach, California, USA, 09 15 Jun 2019. PMLR. 3, 4, 5

Angelia Nedi c and Dimitri P. Bertsekas. Incremental Subgradient Methods for Nondifferentiable Optimization. SIAM Journal on Optimization, 12(1):109 138, January 2001. doi: 10.1137/ s1052623499362111. 3

Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1017 1025. Curran Associates, Inc., 2014. 7

Lam Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takáˇc. SGD and Hogwild! Convergence Without the Bounded Gradients Assumption. volume 80 of Proceedings of Machine Learning Research, pages 3750 3758, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. 4

Lam M. Nguyen, Quoc Tran-Dinh, Dzung T. Phan, Phuong Ha Nguyen, and Marten van Dijk. A Uniﬁed Convergence Analysis for Shufﬂing-Type Gradient Methods. ar Xiv preprint ar Xiv:2002.08246, 2020. 3, 4, 25

Phuong Ha Nguyen, Lam Nguyen, and Marten van Dijk. Tight Dimension Independent Lower Bound on the Expected Convergence Rate for Diminishing Step Sizes in SGD. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 3660 3669. Curran Associates, Inc., 2019. 2

Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of SGD without replacement. ar Xiv preprint ar Xiv:2002.10400, 2020. 3, 7

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML 12, page 1571 1578, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851. 2

Benjamin Recht and Christopher Ré. Toward a noncommutative arithmetic-geometric mean inequality: Conjectures, case-studies, and consequences. In S. Mannor, N. Srebro, and R. C. Williamson, editors, Proceedings of the 25th Annual Conference on Learning Theory, volume 23, page 11.1 11.24, 2012. Edinburgh, Scotland. 2

Benjamin Recht and Christopher Ré. Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion. Mathematical Programming Computation, 5(2):201 226, April 2013. doi: 10.1007/s12532-013-0053-8. 2

Itay Safran and Ohad Shamir. How Good is SGD with Random Shufﬂing? In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of the 33rd Annual Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3250 3284. PMLR, 09 12 Jul 2020. 3,

6, 7, 8, 30

Sebastian U. Stich. Uniﬁed Optimal Analysis of the (Stochastic) Gradient Method. ar Xiv preprint ar Xiv:1907.04232, 2019. 7, 8, 15

Ruo-Yu Sun. Optimization for Deep Learning: An Overview. Journal of the Operations Research Society of China, 8(2):249 294, Jun 2020. ISSN 2194-6698. doi: 10.1007/s40305-020-00309-6. 2

Bicheng Ying, Kun Yuan, Stefan Vlaski, and Ali H. Sayed. Stochastic Learning Under Random Reshufﬂing With Constant Step-Sizes. In IEEE Transactions on Signal Processing, volume 67, pages 474 489, 2019. doi: 10.1109/TSP.2018.2878551. 3, 4