# random_reshuffling_is_not_always_better__7758903e.pdf Random Reshuffling is Not Always Better Christopher De Sa Department of Computer Science Cornell University cdesa@cs.cornell.edu Many learning algorithms, such as stochastic gradient descent, are affected by the order in which training examples are used. It is generally believed that sampling the training examples without-replacement, also known as random reshuffling, causes learning algorithms to converge faster. We give a counterexample to the Operator Inequality of Noncommutative Arithmetic and Geometric Means, a longstanding conjecture that relates to the performance of random reshuffling in learning algorithms [19]. We use this to give an example of a learning task and algorithm for which with-replacement random sampling outperforms random reshuffling. 1 Introduction Many machine learning algorithms work by iteratively updating a model based on one of a number of possible steps. For example, in stochastic gradient descent (SGD), each model update is performed based on a single example selected from a training dataset. The order in which the samples are selected in which the update steps are performed can have an impact on the convergence rate of the algorithm. There is a general sense in the community that the random reshuffling method, which selects the order by without-replacement sampling of the steps in an epoch (where an epoch means a single pass through the data, and different epochs may use different random orders), is better (for convergence) than ordinary with-replacement sampling for these algorithms [7, 8, 19]. There are two intuitive reasons why we might expect random reshuffling to outperform sampling with replacement. The first applies when our model updates are in some sense noisy: each one could perturb us away from the desired optimum, and they are only guaranteed to approach the optimum on average. In this case, random reshuffling ensures that the noise in some sense cancels out over an epoch in which all samples are used. Most previous work on random reshuffling has studied this noisy case, and this intuition has been borne out in a series of results that show random-reshuffling results in a convergence rate of O(1/t2) rather than O(1/t) for convex SGD [8, 10, 16, 18, 20]. The second intuitive reason is that, because sampling without replacement avoids using the same update step repeatedly, it should tend to be more contractive than sampling with replacement. This intuition applies even for noiseless algorithms that converge at a linear rate of O(1)t. In contrast to the noisy case, the belief that random reshuffling should be better in general for these algorithms that converge at a linear rate is backed up theoretically only with conjectures. The main conjecture in this space is the Operator Inequality of Noncommutative Arithmetic and Geometric Means, stated as Conjecture 1 of Recht and Ré [19]. That conjecture, which is motivated by algorithms such as the randomized Kaczmarz method [23] that converge at a linear rate, asserts the following. Conjecture 1 (Operator Inequality of Noncommutative Arithmetic and Geometric Means). Let A1, . . . , An Rd d be a collection of (symmetric) positive semidefinite matrices. Then it is 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. conjectured that the following inequalities always hold: f {1,...,n}n where P(n) denotes the set of permutations of the set {1, . . . , n} and denotes the ℓ2 induced operator norm (the magnitude of the largest-magnitude eigenvalue for symmetric matrices). A variant of the conjecture, which moves the sums to the outside of the norms, was given by [7]. Conjecture 2. Let A1, . . . , An Rd d be a collection of (symmetric) positive semidefinite matrices. Then it is conjectured that the following inequality always holds: f {1,...,n}n Conjecture 1 is a quite natural generalization of the ordinary arithmetic-mean-geometric-mean (AMGM) inequality of real numbers, which states that for non-negative numbers xi, Qn i=1 xi 1 n Pn i=1 xi n . In Conjecture 1, positive semidefinite matrices (matrices with non-negative eigenvalues) take the place of the non-negative scalars of the AMGM inequality, and indeed Conjecture 1 reduces to the AMGM inequality when d = 1. Conjecture 1 was proven by the original authors in the case of n = 2, and has been proven subsequently for n = 3 [12, 29]. It also seems to be true for random ensembles of matrices [2, 19], and random testing seems to suggest that Conjecture 1 is always true. However, recent work has shown non-constructively that Conjecture 1 is false [3, 14].1 These non-constructive disproofs are interesting, but deliver limited insight about random reshuffling, both because they involve complicated proof techniques and because they do not translate to concrete counterexamples of matrices A1, A2, . . . , An that can be used to study learning algorithms empirically. In this paper, we propose simple counterexamples for these conjectures to our knowledge this is the first explicit counterexample known for any of these conjectures, and the first disproof of Conjecture 2. We explore the consequences and limitations of this counterexample throughout the paper, and end by showing concrete problems for which SGD with random reshuffling converges asymptotically slower than SGD using with-replacement sampling. Our paper is structured as follows. In Section 2, we construct a family of counterexamples for Conjectures 1 and 2, showing constructively that all three conjectured inequalities are false. In Section 3, we adapt the counterexample to give concrete ML algorithms for which withreplacement sampling outperforms without-replacement sampling, contrary to folklore. In Section 4, we prove that for non-trivial matrix ensembles (1) always holds with strict inequality for sufficiently small step sizes. Thus, for algorithms with a slowly decreasing step, withoutreplacement sampling always outperforms with-replacement sampling. On the other hand, we show that when optimal step sizes are chosen separately for withand without-replacement sampling (but may not decrease to zero), with-replacement sampling can still perform better. In Section 5, we give an example convex learning task for which SGD using with-replacement sampling converges asymptotically faster than random-reshuffling. 1.1 Notation In this paper, of a vector always denotes the Euclidean ℓ2 norm, and of an operator denotes the ℓ2 induced norm. We have 1 denote the all-1s vector. We let denote the Kronecker product and denote the matrix direct sum, such that x y is the block diagonal matrix x 0 0 y . We let Sd denote 1Lai and Lim [14] can be considered parallel work to this paper. the set of symmetric d d matrices over R, and let Pd denote the set of symmetric positive definite d d matrices. We let denote inequality with respect to the positive definite ordering (i.e. A B when B A Pd). When K Rd is a convex cone (a set closed under sums and non-negative scalar multiplication), and x, y Rd, we say x K y if and only if y x K, and for A, B Rd d. We let M(K) denote the set {A Sd | x K, Ax K} of matrices that preserve the convex cone K, and we note that M(K) is also a convex cone. For brevity, we let rrk : Sd Sd Sd Sd denote the random reshuffling function rrk(A1, A2, . . . , An) = 1 σ P(n) Qk i=1 Aσ(i), and define srrk similarly as the symmetric random reshuffling function srrk(A1, A2, . . . , An) = 1 σ P(n) Qk i=1 Aσ(i) T Qk i=1 Aσ(i) . Note that this lets us write (1) more compactly as rrn(A1, . . .) rr1(A1, . . .) n. 1.2 Related Work Conjecture 1 was a generalization of a line of older work on matrix arithmetic-mean geometric-mean inequalities for two matrices [4, 5]. It was proved by Recht and Ré [19] in the case of n = 2 and by Zhang [29] for n = 3. Duchi [7] proposed a variant, Conjecture 2, in which the sum appears outside the norm and proved it for n = 2, and it was extended to the n = 3 case by Israel et al. [12]. Albar et al. [2] proves a version of the inequality of Conjecture 1 that is weaker by a constant. Albar et al. [3] provides a non-constructive disproof of (2), and very recently Lai and Lim [14] gave a non-constructive disproof of (1) via a transformation to the noncommutative Positivstellensatz. Alaifari et al. [1] studies a related class of matrix rearrangement inequalities. Several prior works have studied SGD on noisy learning problems, for which at the optimum w it is not the case that fi(w ) = 0 for every component loss function fi. Gürbüzbalaban et al. [8] exhibited an SGD variant for which random reshuffling converges at a Θ(1/t2) rate, which improves on the Ω(1/t) rate of standard with-replacement-sampled SGD; similar results were also proved for the noisy case in other settings [10, 20, 22] Ying et al. [25] and Ying et al. [26] show that random reshuffling converges for variance-reduced algorithms, and Ying et al. [27] analyzes random reshuffling in the constant-step-size case. Meng et al. [15] studies a distributed variant of SGD with random shuffling, albeit one different from the one we study here (Algorithm 1). Beyond SGD, Oswald and Zhou [17] analyzes random reshuffling for methods such as Gauss-Seidel and Kaczmarz, and He et al. [11] studies scan order for Gibbs sampling. 2 Constructing a Counterexample We start by outlining the main idea that underlies our counterexample. Fix some dimension d N, and let n = d. For any permutation σ P(n), let Pσ denote the permutation matrix over Rn, such that (Pσx)i = xσ(i) for any vector x Rn. The main idea is to construct a sequence of matrices A1, A2, . . . , An such that PσAi P T σ = Aσ(i) for any σ. For any permutation matrix Pς, rrk(A1, A2, . . . , An) = rrk(PςA1P T ς , PςA2P T ς , . . . , PςAn P T ς ) = 1 PςAσ(i)P T ς = Pς 1 n! P σ P(n) Qk i=1 Aσ(i) P T ς = Pς (rrk(A1, A2, . . . , An)) P T ς , where the first equality holds because rrk is a symmetric function, and the fourth holds because P T ς Pς = I. This shows that rrk(A1, . . .) is preserved by any permutation of its coordinates, and the only such matrices are of the form X = α11T + βI, where 1 denotes the all-1s vector. With careful choice of the Ai, we can find formulas for α and β, and show that they violate Conjecture 1. 2.1 A Counterexample for the First Inequality Define a family of vectors yk for k {1, . . . , n} such that for i = k we have (yk)k = n 1 n and (yk)i = 1 n n 1. Consider the matrices Ak = I + 1y T k + yk1T . For example, when n = 5, the matrices look like " 1.8 0.3 0.3 0.3 0.3 0.3 0.8 0.2 0.2 0.2 0.3 0.2 0.8 0.2 0.2 0.3 0.2 0.2 0.8 0.2 0.3 0.2 0.2 0.2 0.8 " 0.8 0.3 0.2 0.2 0.2 0.3 1.8 0.3 0.3 0.3 0.2 0.3 0.8 0.2 0.2 0.2 0.3 0.2 0.8 0.2 0.2 0.3 0.2 0.2 0.8 It is clear by construction that for any σ P(n), then P T σ Ak Pσ = Aσ(k). It is also easily seen that the sum of the yk is zero, from which we can see immediately that rr1(A1, . . . , An) = 1 i=1 Ai = I + 1 So, rr1(A1, . . . , An) = 1. It is less immediate but still straightforward to show the following. Statement 1. If we define λ (for any k N) as λ = 1 + 1 n 1 k/2 cos k arcsin 1 n then the random-reshuffled product of these matrices can be written as rrk(A1, . . . , An) = λ 11T n 1 + 1 I 11T and so λ will be an eigenvalue of rrn(A1, . . . , An) with corresponding eigenvector 1. We include a full derivation of this result which is relatively easy to derive by hand in the appendix. It is easy to find n for which |λ| is greater than 1. The smallest such n is n = 5, where rrn(A1, . . . , An) = 29 5 , and λ = 19 This fact can be easily verified numerically, by computing rrn directly for n = 5. Note that we can also have λ > 1, e.g. for n = 40, λ 1.655. This shows directly that (1) is false. Note that while this setup may seem to suggest that a counterexample requires n = d (while usually in linear regression n d), it is straightforward to construct examples for which n and d are arbitrary (but no less than 5) by either adding additional I matrices to the ensemble or adding additional dimensions containing only a 1 on the diagonal: this will change the norms of neither the arithmetic nor the geometric mean. 2.2 A Counterexample for the Second Inequality Using our construction from the previous section, define a collection of positive semidefinite symmetric matrices Bi such that B2 i = Ai. For these matrices, srr1(B1, . . . , Bn) = 1 n Pn i=1 B2 i = I, so by induction (s1,...,sn) {1,...,n}n Thus, its norm will be 1. Just as before, these matrices have the property that PσBi P T σ = Bσ(i) for any σ, so Pσsrrn(B1, . . .)P T σ = srrn(B1, . . .) and the symmetrized random-reshuffled product can also be written as α11T +βI. It is possible to perform the same sort of analysis as done in Section 2.1 to find an expression for the eigenvalues of srrn(B1, . . .) explicitly as an analytic expression in n; however, since it is much more complicated and does not deliver additional insight, for lack of space we will just state the result for the particular case of n = 10. This case is convenient because 10 1 = 32, and so Ai is rational and thus Bi is over Q( 2), and so we can do exact arithmetic easily. In this case, the eigenvalue of srrn(B1, . . . , Bn) with corresponding eigenvector 1 is exactly λ = 16623165607286458 16677181699666569 + 2195717144015980 16677181699666569 which shows directly that (2) is false, because if it were true this number could be at most 1. 2.3 A Counterexample for the Third Inequality We can construct a counterexample to Conjecture 2 based on the tight frame example of Recht and Ré [19]. The tight frame example for n = 2 consists of symmetric projection matrices Ak R2 2 for k {1, . . . , n} defined as Ak = uku T k , where uk = cos πk n T . These matrices have the interesting property that their fixed order product A1 A2 An has an asymptotically larger norm than the nth power of their mean, and they are used by Recht and Ré [19] to motivate why symmetrizing the order (by sampling without replacement rather than just going with some arbitrary fixed order) is important. Starting with this, we construct the family of matrices Bk defined by Bk = L ς P(n) Aς(k), where L here denotes an indexed matrix direct sum (which constructs a block diagonal matrix). The direct sum has the important properties that X Y = max( X , Y ) and that (if the dimensions match) (X1 X2) (Y1 Y2) = (X1Y1) (X2Y2). As a consequence, for any permutation σ, Qn i=1 Bσ(i) = Qn i=1 L ς P(n) Aς(σ(i)) = L ς P(n) Qn i=1 Aς(σ(i)) = maxς P(n) Qn i=1 Aς(σ(i)) = maxς P(n) Qn i=1 Aς(i) = Qn i=1 Ai , where the last equality is a known property of the tight frame example, and the other equalities follow from properties of the direct sum. This means that all the terms on the left side of (3) for this example will be the same, and in particular σ P(n) Qn i=1 Bσ(i) = Qn i=1 Ai . By a similar argument, the right side of that equation will be f:{1,...,n}n f:{1,...,n}n max ς P(n) i=1 Aς(f(i)) These formulas make it straightforward to compute these values, even though the matrices in question are of dimension 2 n!. For the particular case of n = 6, 3 0.487 and 1 nn X f:{1,...,n}n 124416 + 29965 This is a counterexample to Conjecture 2. 3 A Machine Learning Example Stochastic gradient descent is perhaps the central example in ML of an algorithm where sample order can affect convergence. Consider the parallel SGD algorithm described in Algorithm 1. Here, for every epoch, each of M parallel workers runs n iterations of SGD, using either with-replacement or without-replacement sampling. Then, the resulting weights are averaged among the workers to produce the starting value for the next epoch. This once-per-epoch averaging in some sense simulates the expected value in Conjecture 1, which makes Algorithm 1 a natural SGD-like algorithm to explore with the conjecture.2 This is equivalent to the method of local SGD with periodic averaging [9, 28] with the averaging period equal to the epoch length. Based on folklore, random reshuffling should outperform standard sampling for this sort of algorithm. We will show that this is not always the case by constructing an example for which standard sampling converges at a linear rate, but random reshuffling fails to converge at all. Consider the following matrix-completion-like task. We have an unknown rank-1 matrix X 0, and we are given noisy measurements from it of the form u T i Xvi ai where we know (ui, vi, ai). We want to recover X by solving the regularized least-squares minimization problem minimize: 1 u T i Xvi ai 2 + 1 2γ tr (X) subject to X Pd, rank(X) 1. 2Note that the parallelism itself is not necessary here; what is necessary for our purposes is the averaging. The averaging is necessary (even for large n) because it models the expected value in the original inequality (1): without it the convergence rate may be effected by higher-order moments (not just the expected value). We study parallel SGD because it is a real method from the literature that uses averaging [9, 28]. Algorithm 1 Parallel SGD 1: given: n loss functions fi, step size scheme α1, α2, . . ., initial state w0 Rd 2: given: number of epochs K, parallel machines M, replacement policy RP 3: for k = 1 to K do 4: for all m {1, . . . , M} do in parallel on machine m 5: uk,m,0 wk 1 6: if RP = with-replacement a.k.a standard sampling then 7: sample σk,m uniformly from the set of functions from {1, . . . , n} to {1, . . . , n} 8: else if RP = without-replacement a.k.a random reshuffling then 9: sample σk,m uniformly from P(n) 10: for t = 1 to n do 11: uk,m,t uk,m,t 1 αk fσk,m(t) (uk,m,t 1) 12: average wk 1 M PM m=1 uk,m,n 13: return w K To solve this more efficiently, we apply Algorithm 1 to a quadratic factorization [6] X = yy T , a common technique which results in the equivalent unconstrained problem minimize: f(y) = 1 i=1 fi(y) = 1 u T i yy T vi ai 2 + 1 2γ y 2 subject to: y Rd. We are now going to pick a particular dataset of (ui, vi, ai) such that the global optimum of f is at y = 0, and where nearby y = 0, Algorithm 1 behaves like our counterexample of Section 2. To do this, notice that nearby y = 0, I α fi(y) = (1 αγI) 2α u T i yy T vi ai viu T i + uiv T i y = (1 αγI) + 2αai viu T i + uiv T i y + O(y3). So, if we choose α, γ, ai, vi, and ui such that (1 αγI)+2αai viu T i + uiv T i = (1 αγ)Ai where this Ai is from our counterexample of Section 2, then when the algorithm is sufficiently close to y = 0, applying an iteration of SGD will behave like multiplying by a single matrix Ai, and averaging across multiple parallel workers will concentrate around the expected value over the sampled scan order. Concretely, we pick n = 40, a constant step size α = 0.1, γ = 0.05, M = 1000, K = 100, ui = 1, vi = yi (the yi of Section 2), and ai = 1 αγ 2α ; we initialize w0 randomly such that w0 = 1. It is easy to see that the global optimum of this task is at w = 0, and it has no other stationary points. Note that this is not necessarily a very realistic setting (the number of parallel workers is large and the dataset is relatively small): the artificial setting is chosen to make the comparison stand out clearly. Running Algorithm 1 on this example produces the results shown in Figure 1. This shows empirically that, counter-intuitively, standard with-replacement random sampling can outperform random reshuffling for this algorithm. 4 Taking Step Size into Account Many algorithms, including SGD, use a step size or learning rate that often decreases over time. Much of the previous work on random reshuffling has been done in such cases [8, 10, 20]. In this section, we develop a variant of Conjecture 1 that incorporates a step size, and prove that statement must hold true for sufficiently small step sizes. Our step-size-incorporating variant of Conjecture 1 is based on the following intuition. For a smooth objective, for w close to the optimum w where fi(w ) = 0, w α fi(w) (I α 2fi(w))(w w ) + w ; for a quadratic function fi, this approximation is exact. So we can model SGD with step size α by allowing the matrices Ai in Conjecture 1 to vary as a function of a step size α. We prove the following theorem about this modified inequality. Theorem 1 (Matrix AMGM Inequality, Sufficiently Small Step Size). Let A1, . . . , An be a collection of continuously twice-differentiable functions from R+ to Sd, that all satisfy Ai(0) = I and that are non-trivial in the sense that they have no eigenvalue/eigenvector pairs shared among all the matrices 0 20 40 60 80 100 epoch number with replacement without replacement Figure 1: Loss gap of multiple random runs of Algorithm 1 on task of Section 3. Notice that random reshuffling fails to converge to the minimal loss. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 step size distance to optimum with replacement (M=1) without replacement (M=1) with replacement (M=100) without replacement (M=100) Figure 2: Distance to optimum of Algorithm 1 for task of Section 4 after K = 20 epochs. Light dotted series indicate nonparallel SGD (M = 1). 0 200 400 600 800 1000 epoch number log10(loss) with replacement without replacement Figure 3: Asymptotic convergence of SGD using withand without-replacement sampling for multiple trials on the example of Section 5. A 1(0), A 2(0), . . . , A n(0). Then for any 2 k n, there exists an αmax > 0 and a constant C > 0 such that such that for all 0 < α αmax, rrk(A1(α), A2(α), . . . , An(α)) < rr1(A1(α), A2(α), . . . , An(α)) k α2C. The proof of Theorem 1 is a straightforward combination of the following two lemmas. The main idea is that we can expand the rrk expression approximately as a polynomial in α, and then consider only the dominant quadratic α2 term, which we can bound directly. We defer the proof of the theorem and both lemmas to the appendix. Lemma 1 (Binomial Theorem for Random Reshuffling). For any symmetric matrices X1, . . . , Xn, and any constants α and β, rrk(αI + βX1, αI + βX2, . . . , αI + βXn) = Pk i=0 k i αk iβkrri(X1, X2, . . . , Xn). Lemma 2. For any symmetric matrices X1, . . . , Xn Rd, and for any u Rd such that u = 1, u T (rr2(X1, . . . , Xn)) u rr1(X1, . . . , Xn) 2 and equality can hold only if there exists a λ R such that for all i {1, . . . , n}, u is an eigenvector of Xi with eigenvalue λ, that is Xiu = λu. We can use Theorem 1 to show that random reshuffling must outperform standard with-replacement sampling for slowly decaying step size schemes for noiseless convex quadratic problems, which can be thought of as a simplified model for optimization problems satisfying the strong growth condition of Schmidt and Roux [21]. Specifically, we study the following simplified class of problems. Definition 1. We say that a set of loss functions f1, . . . , fn is a noiseless convex quadratic problem if the following conditions hold. (Noiselessness.) There exists a unique w Rd such that for all i, fi(w ) = 0. (Convex quadratics.) Each loss function fi is a convex quadratic fi(w) = w T Hiw/2 + b T i w. (Lipschitz gradients.) For some L, each loss function fi satisfies 2fi(w) LI. Additionally, we say that the problem is non-trivial if the Hessian matrices Hi = 2fi(w ) share no eigenvalue/eigenvector pairs, i.e. there is no (λ, u) with u = 0 such that for all i, Hiu = λu. Note that the last condition here is designed to rule out trivial cases such as all the loss functions fi being the same. Such trivial cases only happen on a set of measure 0 within the space of all possible loss functions f1, . . . , fn and initializations, so it is reasonable for us to exclude them. The class of problems described by Definition 1 includes the setting of the Randomized Kaczmarz method originally studied in Recht and Ré [19], as well as well-known tasks such as linear regression and ridge regression. Theorem 1 now directly implies the following useful corollary, which says that without-replacement sampling outperforms with-replacement sampling whenever a diminishing but not square-summable step size scheme is used including for the important case of M = 1 in Algorithm 1, which corresponds to the most common case of ordinary single-worker SGD.3 3Note also that it should be straightforward to extend Corollary 1 to the case of nice convex losses, on the basis of the idea that any such functions must behave like quadratics locally in the neighborhood of w . But since this is not deliver any new insight, we do not present such a result here. Corollary 1. Consider Algorithm 1 on a non-trivial noiseless convex quadratic (using any M). Suppose that the step size scheme satisfies 0 < αi L < 1 and is diminishing but not square-summable, i.e. limk αk = 0 and P k=1 α2 k = . Then for almost all initial values w0 = w , E[wk,without-replacement] w E[wk,with-replacement] w = 0. Although it seems that Theorem 1 and Corollary 1 suggests that random reshuffling is indeed better when we allow the use of small step sizes, it has a significant limitation that would make that conclusion invalid: Corollary 1 only compares with-replacement and without-replacement sampling using the same learning rate scheme for both. Instead, when comparing two algorithms in the most fair way, we should select the best learning rate scheme for each algorithm individually. Surprisingly, when we allow this, we can give an example of a convex learning task for which with-replacement sampling, with a particular constant step size, converges faster than without-replacement sampling no matter what fixed step size scheme it uses. The convex functions in question can be constructed directly from our counterexample of Section 2. Let fi(w) = 1 2w T Hiw, where Hi = I 1 2, where denotes the matrix direct sum (such that X Y is a block diagonal matrix with diagonal blocks X and Y ). This function must be convex because Hi is positive semidefinite (which follows from the fact that the eigenvalues of Ai are 0, 1, and 2). The main idea of this construction is to force the step size to be α = 1, since otherwise either the 1/2 or 3/2 coordinate will end up converging at a suboptimal rate. Although a theoretical analysis of this would be straightforward, as such analysis delivers no new insight, we only validate that this example works empirically on a concrete example. We pick n = 8, M = 100, and K = 20, and we initialize w0 from a Gaussian with less power on the last two coordinates, which makes the effect more visible. In Figure 2 we plot the distance to optimum after K epochs for all step sizes α {0, 0.001, 0.002, . . . , 1.7}. Observe that while for smaller step sizes, random-reshuffling outperforms standard sampling which validates Theorem 1 the best convergence overall is achieved by standard with-replacement sampling. Interestingly, the averaging of Algorithm 1 is necessary for this effect to happen for this task: in Figure 2 we also display results for standard SGD on the same problem (equivalent to setting M = 1) for which without-replacement sampling seems to consistently outperform with-replacement sampling. 5 Random Reshuffling Can be Worse Asymptotically Even for SGD While we have shown Conjecture 1 is false, this does not necessarily imply random reshuffling can be worse with stochastic gradient descent, because the averaging present in Conjecture 1 is not present in plain SGD. Our counterexamples so far do not show random reshuffling performing worse with no averaging (Figure 2), so it remains consistent with our observations so far that random reshuffling could always outperform with-replacement sampling for SGD. But is this necessarily true? In this section we will show that it is not: even for SGD without any averaging (Algorithm 1 with m = 1), we can construct a learning task for which with-replacement sampling converges strictly faster than random reshuffling albeit one not based on a counterexample to any of the conjectures we have studied. Here, when we say it converges strictly faster, we mean that for any coupling of the two algorithms, with-replacement sampling almost surely eventually achieves lower loss than random reshuffling and its loss remains lower for all time. The main idea behind this construction, which is based on the idea that any rank-deficient square matrix can be written as the product of three symmetric positive semidefinite matrices [24], is as follows. Consider the matrices 4 h 2 1 1 1 2 1 1 1 1 4 h 2 0 1 0 1 1 1 1 2 4 h 0 0 0 0 1 1 0 1 2 i , and R = uu T 6 where u = h 1 2 1 It is easy to check that all these matrices are symmetric and positive semidefinite (and I), and that (A1A2A3)3 = 0. Now, consider the behavior of SGD with step size α = 1/2 on the problem where f1(w) = w T (I A1)w, f2(w) = w T (I A2)w, f3(w) = w T (I A3)w, f4(w) = w T (I R)w. Observe that all these functions are convex, and that choosing to do a step with f1 has the effect of multiplying w by A1, etc. This means that, for with-replacement sampling, if we sample our examples in the order (1, 2, 3, 1, 2, 3, 1, 2, 3), the result after running those SGD steps will be to have w = 0, regardless of what value of w we started with (because (A1A2A3)3 = 0). Since we are guaranteed to sample that run of examples eventually, it follows that almost surely, after some finite amount of time with-replacement SGD achieves wt = 0 = w , which minimizes the loss. On the other hand, this sequence of samples can not occur for sampling without replacement. Instead, every epoch of SGD will contain an R, which disrupts the sequence. The sequence of matrices that are multiplied by w will consist of sequences of Ai matrices of length no more than 6 broken up by R matrices. Because of the structure of R, for any matrix X, 6RXR = R u T Xu: this means that the sequences of matrices RA A R will reduce to the product of scalars u T A A u. It is straightforward to verify numerically that this scalar is nonzero for any sequence of A matrices that can occur for sampling without replacement. So, for almost all initializations w0, SGD with random reshuffling on this task will never reach 0, while SGD using with-replacement sampling is guaranteed to reach 0 in finite time. We conclude that with-replacement sampling converges asymptotically faster than random reshuffling for this task and step size. In fact, we can say something even more general: among all step sizes α that satisfy αL 1, where L is the smallest constant such that each fi is has L-Lipschitz gradients, SGD with α = 1/2 using with-replacement sampling converges asymptotically faster than all other settings. That is, here with-replacement sampling is still converging asymptotically faster, even if we choose the optimal learning rates for both sampling strategies separately. We can see that this holds immediately from the fact that for 0 < α < 1, taking a step with respect to fi has the effect of multiplying w by a full-rank matrix; such a step can never reach 0. We explore this task empirically in Figure 3, where we ran a thousand epochs of SGD using both withand without-replacement sampling on the example task we constructed in this section. Observe that for every run (we ran 20 independent runs), with-replacement sampling starts off a little worse, but quickly moves to zero loss, while without-replacement sampling continues to have nonzero loss for all time. Note that we needed to use exact arithmetic and look at points very close to the optimum for this effect to be observable: the figure ranges down to losses of 10 4000. Although not practical, this experiment does conclusively illustrate empirically that it is possible for with-replacement SGD to converge asymptotically faster than random-reshuffling, even when no averaging is used and when step sizes are chosen optimally for both algorithms.4 In future work, it may be interesting to study whether and to what extent this sort of effect can occur in real learning tasks. 6 Conclusion In this paper, we compared random reshuffling to with-replacement sampling for stochastic learning algorithms on noiseless problems. We found a counterexample to two longstanding conjectures from the literature (Conjectures 1 and 2) that would have implied random reshuffling is always no worse for many learning algorithms. Using this counterexample, we constructed concrete learning tasks for which with-replacement sampling outperforms without-replacement sampling in a way that can be observed empirically, even when step size is allowed to vary. This shows that, contrary to folklore, random-reshuffling can actually cause learning algorithms to converge asymptotically slower than with-replacement sampling on a particular problem. New insights will be required to develop theory that gets around our counterexamples to explain why random-reshuffling appears to consistently perform better on individual tasks in practice even for noiseless problems. We hope that this work will bring us closer to a deeper understanding of the effect of scan order in machine learning. Broader Impact We expect that the counterexamples presented in this work will have an impact on the ML theory community as we further try to understand the effect of scan order in large-scale optimization. We hope that these counterexamples will help guide future researchers towards proving more variants of Conjectures 1 and 2 that are true. Beyond this impact on the ML community, this work is primarily theoretical and does not present any foreseeable societal consequence. 4Observe that this example could also be applied to a noisy dataset case with SVRG [13], showing that the phenomenon of random reshuffling sometimes being worse is limited neither to SGD nor to noisy datasets. Since the transformation is straightforward it is left as an exercise for the reader. Funding Disclosure No external funding (apart from usual faculty support by Cornell University) was used in support of this work. The author has an engagement with Samba Nova Systems, but funding from Samba Nova was not used to directly support this work. [1] Rima Alaifari, Xiuyuan Cheng, Lillian B Pierce, and Stefan Steinerberger. On matrix rearrangement inequalities. ar Xiv preprint ar Xiv:1904.05239, 2019. [2] Wafaa Albar, Marius Junge, and Mingyu Zhao. Noncommutative versions of the arithmeticgeometric mean inequality. ar Xiv preprint ar Xiv:1703.00546, 2017. [3] Wafaa Albar, Marius Junge, and Mingyu Zhao. On the symmetrized arithmetic-geometric mean inequality for operators. ar Xiv preprint ar Xiv:1803.02435, 2018. [4] Rajendra Bhatia and Priyanka Grover. Norm inequalities related to the matrix geometric mean. Linear Algebra and its Applications, 437(2):726 733, 2012. [5] Rajendra Bhatia and Fuad Kittaneh. The matrix arithmetic geometric mean inequality revisited. Linear Algebra and its Applications, 428(8-9):2177 2191, 2008. [6] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329 357, 2003. [7] John C Duchi. Commentary on Towards a noncommutative arithmetic-geometric mean inequality by B. Recht and C. Ré. Ré. J. Mach. Learn. Res, 23:11 25, 2012. [8] Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 2019. [9] Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. In Advances in Neural Information Processing Systems, pages 11080 11092, 2019. [10] Jeffery Z Hao Chen and Suvrit Sra. Random shuffling beats sgd after finite epochs. ar Xiv preprint ar Xiv:1806.10077, 2018. [11] Bryan He, Christopher De Sa, Ioannis Mitliagkas, and Christopher Ré. Scan order in Gibbs sampling: Models in which it matters and bounds on how much. NIPS, 2016. [12] Arie Israel, Felix Krahmer, and Rachel Ward. An arithmetic geometric mean inequality for products of three matrices. Linear Algebra and its Applications, 488:1 12, 2016. [13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315 323, 2013. [14] Zehua Lai and Lek-Heng Lim. Recht-ré noncommutative arithmetic-geometric mean conjecture is false. ICML, 2020. [15] Qi Meng, Wei Chen, Yue Wang, Zhi-Ming Ma, and Tie-Yan Liu. Convergence analysis of distributed stochastic gradient descent with shuffling. Neurocomputing, 337:46 57, 2019. [16] Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. SGD without replacement: Sharper rates for general smooth convex functions. volume 97 of Proceedings of Machine Learning Research, pages 4703 4711, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/nagaraj19a.html. [17] Peter Oswald and Weiqi Zhou. Random reordering in sor-type methods. Numerische Mathematik, 135(4):1207 1220, 2017. [18] Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of sgd without replacement. ar Xiv preprint ar Xiv:2002.10400, 2020. [19] Benjamin Recht and Christopher Ré. Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences. In Conference on Learning Theory, pages 11 1, 2012. [20] Itay Safran and Ohad Shamir. How good is sgd with random shuffling? ar Xiv preprint ar Xiv:1908.00045, 2019. [21] Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv preprint ar Xiv:1308.6370, 2013. [22] Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in neural information processing systems, pages 46 54, 2016. [23] Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262, 2009. [24] Pei Yuan Wu. Products of positive semidefinite matrices. Linear Algebra and Its Applications, 111:53 61, 1988. [25] Bicheng Ying, Kun Yuan, and Ali H Sayed. Convergence of variance-reduced stochastic learning under random reshuffling. ar Xiv preprint ar Xiv:1708.01383, 2(3):6, 2017. [26] Bicheng Ying, Kun Yuan, and Ali H Sayed. Variance-reduced stochastic learning under random reshuffling. ar Xiv preprint ar Xiv:1708.01383, 2017. [27] Bicheng Ying, Kun Yuan, Stefan Vlaski, and Ali H Sayed. On the performance of random reshuffling in stochastic learning. In 2017 Information Theory and Applications Workshop (ITA), pages 1 5. IEEE, 2017. [28] Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and Christopher Ré. Parallel SGD: When does averaging help? Opt ML: Optimization Methods for the Next Generation of Machine Learning, 2016. [29] Teng Zhang. A note on the non-commutative arithmetic-geometric mean inequality. ar Xiv preprint ar Xiv:1411.5058, 2014.