# permutationbased_sgd_is_random_optimal__51993630.pdf Published as a conference paper at ICLR 2022 PERMUTATION-BASED SGD: IS RANDOM OPTIMAL? Shashank Rajput Kangwook Lee University of Wisconsin-Madison Dimitris Papailiopoulos A recent line of ground-breaking results for permutation-based SGD has corroborated a widely observed phenomenon: random permutations offer faster convergence than with-replacement sampling. However, is random optimal? We show that this depends heavily on what functions we are optimizing, and the convergence gap between optimal and random permutations can vary from exponential to nonexistent. We first show that for 1-dimensional strongly convex functions, with smooth second derivatives, there exist permutations that offer exponentially faster convergence compared to random. However, for general strongly convex functions, random permutations are optimal. Finally, we show that for quadratic, strongly-convex functions, there are easy-to-construct permutations that lead to accelerated convergence compared to random. Our results suggest that a general convergence characterization of optimal permutations cannot capture the nuances of individual function classes, and can mistakenly indicate that one cannot do much better than random. 1 INTRODUCTION Finite sum optimization seeks to solve the following: min x F(x) := 1 i=1 fi(x). (1) Stochastic gradient descent (SGD) approximately solves finite sum problems, by iteratively updating the optimization variables according to the following rule: xt+1 := xt α fσt(xt), (2) where α is the step size and σt [n] = {1, 2, . . . , n} is the index of the function sampled at iteration t. There exist various ways of sampling σt, with the most common being withand without-replacement sampling. In the former, σt is uniformly chosen at random from [n], and for the latter, σt represents the t-th element of a random permutation of [n]. We henceforth refer to these two SGD variants as vanilla and permutation-based, respectively. Although permutation-based SGD has been widely observed to perform better in practice (Bottou, 2009; Recht & R e, 2012; 2013), the vanilla version has attracted the vast majority of theoretical analysis. This is because of the fact that at each iteration, in expectation the update is a scaled version of the true gradient, allowing for simple performance analyses of the algorithm, e.g., see (Bubeck et al., 2015). Permutation-based SGD has resisted a tight analysis for a long time. However, a recent line of breakthrough results provides the first tight convergence guarantees for several classes of convex functions F (Nagaraj et al., 2019; Safran & Shamir, 2019; Rajput et al., 2020; Mishchenko et al., 2020; Ahn et al., 2020; Nguyen et al., 2020). These recent studies mainly focus on two variants of permutation-based SGD where (1) a new random permutation is sampled at each epoch (also known as RANDOM RESHUFFLE) (Nagaraj et al., 2019; Safran & Shamir, 2019; Rajput et al., 2020), and (2) a random permutation is sampled once and is reused throughout all SGD epochs (SINGLE SHUFFLE) (Safran & Shamir, 2019; Mishchenko et al., 2020; Ahn et al., 2020). Correspondence to Shashank Rajput rajput3@wisc.edu Published as a conference paper at ICLR 2022 Perhaps interestingly, RANDOM RESHUFFLE and SINGLE SHUFFLE exhibit different convergence rates and a performance gap that varies across different function classes. In particular, when run for K epochs, the convergence rate for strongly convex functions is e O(1/n K2) for both RANDOM RESHUFFLE and SINGLE SHUFFLE (Nagaraj et al., 2019; Ahn et al., 2020; Mishchenko et al., 2020). However, when run specifically on strongly convex quadratics, RANDOM RESHUFFLE experiences an acceleration of rates, whereas SINGLE SHUFFLE does not (Safran & Shamir, 2019; Rajput et al., 2020; Ahn et al., 2020; Mishchenko et al., 2020). All the above rates have been coupled by matching lower bounds, at least up to constants and sometimes log factors (Safran & Shamir, 2019; Rajput et al., 2020). Algorithm 1 Permutation-based SGD variants Input: Initialization x1 0, step size α, epochs K 1: σ = a random permutation of [n] 2: for k = 1, . . . , K do 3: if IGD then 4: σk = (1, 2, . . . , n) 5: else if SINGLE SHUFFLE then 6: σk = σ 7: else if RANDOM RESHUFFLE then 8: σk = a random permutation of [n] 9: end if 10: if FLIPFLOP and k is even then 11: σk = reverse of σk 1 13: for i = 1, . . . , n do 14: xk i := xk i 1 α fσk i (xk i 1) Epoch k 15: end for 16: xk+1 0 := xk n 17: end for Plain with FLIPFLOP RR Ω 1 n2K2 + 1 n K3 e O 1 n2K2 + 1 n K5 SS Ω 1 n K2 e O 1 n2K2 + 1 n K4 e O 1 n2K2 + 1 K3 Table 1: Convergence rates of RANDOM RESHUFFLE (RR), SINGLE SHUFFLE (SS) and INCREMENTAL GRADIENT DESCENT (IGD) on strongly convex quadratics: Plain vs. FLIPFLOP. Lower bounds for the plain versions are taken from (Safran & Shamir, 2019). When n K, that is when the training set is much larger than the number of epochs, which arguably is the case in practice, the convergence rates of RANDOM RESHUFFLE,SINGLE SHUFFLE, and INCREMENTAL GRADIENT DESCENT are Ω( 1 n K3 ), Ω( 1 n K2 ), and Ω( 1 K2 ) respectively. On the other hand, by combining these methods with FLIPFLOP the convergence rates become faster, i.e., e O( 1 n K5 ), e O( 1 n K4 ), and e O( 1 K3 ), respectively. From the above we observe that reshuffling at the beginning of every epoch may not always help. But then there are cases where RANDOM RESHUFFLE is faster than SINGLE SHUFFLE, implying that certain ways of generating permutations are more suited for certain subfamilies of functions. The goal of our paper is to take a first step into exploring the relationship between convergence rates and the particular choice of permutations. We are particularly interested in understanding if random permutations are as good as optimal, or if SGD can experience faster rates with carefully crafted permutations. As we see in the following, the answer the above is not straightforward, and depends heavily on the function class at hand. Our Contributions: We define as permutation-based SGD to be any variant of the iterates in (2), where a permutation of the n functions, at the start of each epoch, can be generated deterministically, randomly, or with a combination of the two. For example, SINGLE SHUFFLE, RANDOM RESHUFFLE, and incremental gradient descent (IGD), are all permutation-based SGD variants (see Algorithm 1). We first want to understand even in the absence of computational constraints in picking the optimal permutations what is the fastest rate one can get for permutation-based SGD? In other words, are there permutations that are better than random in the eyes of SGD? Perhaps surprisingly, we show that there exist permutations that may offer up to exponentially faster convergence than random permutations, but for a limited set of functions. Specifically, we show this for 1-dimensional functions (Theorem 1). However, such exponential improvement is no longer possible Published as a conference paper at ICLR 2022 in higher dimensions (Theorem 2), or for general strongly convex objectives (Theorem 3), where random is optimal. The above highlight that an analysis of how permutations affect convergence rates needs to be nuanced enough to account for the structure of functions at hand. Otherwise, in lieu of further assumptions, random permutations may just appear to be as good as optimal. In this work, we further identify a subfamily of convex functions, where there exist easy-togenerate permutations that lead accelerated convergence. We specifically introduce a new technique, FLIPFLOP, which can be used in conjunction with existing permutation-based methods, e.g., RANDOM RESHUFFLE, SINGLE SHUFFLE, or INCREMENTAL GRADIENT DESCENT, to provably improve their convergence on quadratic functions (Theorems 4, 5, and 6). The way that FLIPFLOP works is rather simple: every even epoch uses the flipped (or reversed) version of the previous epoch s permutation. The intuition behind why FLIPFLOP leads to faster convergence is as follows. Towards the end of an epoch, the contribution of earlier gradients gets attenuated. To counter this, we flip the permutation for the next epoch so that every function s contribution is diluted (approximately) equally over the course of two consecutive epochs. FLIPFLOP demonstrates that finding better permutations for specific classes of functions might be computationally easy. We summarize FLIPFLOP s convergence rates in Table 1 and report the results of numerical verification in Section 6.2. Note that in this work, we focus on the dependence of the error on the number of iterations, and in particular, the number of epochs. However, we acknowledge that its dependence on other parameters like the condition number is also very important. We leave such analysis for future work. Notation: We use lowercase for scalars (a), lower boldface for vectors (a), and upper boldface for matrices (A). 2 RELATED WORK G urb uzbalaban et al. (2019a;b) provided the first theoretical results establishing that RANDOM RESHUFFLE and INCREMENTAL GRADIENT DESCENT (and hence SINGLE SHUFFLE) were indeed faster than vanilla SGD, as they offered an asymptotic rate of O 1/K2 for strongly convex functions, which beats the convergence rate of O (1/n K) for vanilla SGD when K = Ω(n). Shamir (2016) used techniques from online learning and transductive learning theory to prove an optimal convergence rate of e O(1/n) for the first epoch of RANDOM RESHUFFLE (and hence SINGLE SHUFFLE). Later, Haochen & Sra (2019) also established a non-asymptotic convergence rate of e O 1 n2K2 + 1 K3 , when the objective function is quadratic, or has smooth Hessian. Nagaraj et al. (2019) used a very interesting iterate coupling based approach to give a new upper bound on the error rate of RANDOM RESHUFFLE, thus proving for the first time that for general strongly convex smooth functions, it converges faster than vanilla SGD in all regimes of n and K. This was followed by (Safran & Shamir, 2019), where the authors were able to establish the first lower bounds, in terms of both n and K, for RANDOM RESHUFFLE. However, there was a gap in these upper and lower bounds. The gap in the convergence rates was closed by Rajput et al. (2020), who showed that the upper bound given by Nagaraj et al. (2019) and the lower bound given by Safran & Shamir (2019) were both tight up to logarithmic terms. For SINGLE SHUFFLE, Mishchenko et al. (2020) and Ahn et al. (2020) showed an upper bound of e O 1 n K2 , which matched the lower bound given earlier by (Safran & Shamir, 2019), up to logarithmic terms. Ahn et al. (2020) and Mishchenko et al. (2020) also proved tight upper bounds for RANDOM RESHUFFLE, with a simpler analysis and using more relaxed assumptions than (Nagaraj et al., 2019) and (Rajput et al., 2020). In particular, the results by Ahn et al. (2020) work under the PŁ condition and do not require individual component convexity. INCREMENTAL GRADIENT DESCENT on strongly convex functions has also been studied well in literature (Nedi c & Bertsekas, 2001; Bertsekas, 2011; G urb uzbalaban et al., 2019a). More recently, Nguyen et al. (2020) provide a unified analysis for all permutation-based algorithms. The dependence of their convergence rates on the number of epochs K is also optimal for INCREMENTAL GRADIENT DESCENT, SINGLE SHUFFLE and RANDOM RESHUFFLE. There has also been some recent work on the analysis of RANDOM RESHUFFLE on non-strongly convex functions and non-convex functions. Specifically, Nguyen et al. (2020) and Mishchenko Published as a conference paper at ICLR 2022 et al. (2020) show that even there, RANDOM RESHUFFLE outperforms SGD under certain conditions. Mishchenko et al. (2020) show that RANDOM RESHUFFLE and SINGLE SHUFFLE beat vanilla SGD on non-strongly convex functions after Ω(n) epochs, and that RANDOM RESHUFFLE is faster than vanilla SGD on non-convex objectives if the desired error is O(1/ n). Speeding up convergence by combining without replacement sampling with other techniques like variance reduction (Shamir, 2016; Ying et al., 2020) and momentum (Tran et al., 2020) has also received some attention. In this work, we solely focus on the power of good permutations to achieve fast convergence. 3 PRELIMINARIES We will use combinations of the following assumptions: Assumption 1 (Component convexity). fi(x) s are convex. Assumption 2 (Component smoothness). fi(x) s are L-smooth, i.e., x, y : fi(x) fi(y) L x y . Note that Assumption 2 immediately implies that F also has L-Lipschitz gradients: x, y : F(x) F(y) L x y . Assumption 3 (Objective strong convexity). F is µ-strongly convex, i.e., x, y : F(y) F(x) + F(x), y x + 1 Note that Assumption 3 implies x, y : F(x) F(y), x y µ y x 2. (3) We denote the condition number by κ, which is defined as κ = L µ. It can be seen easily that κ 1 always. Let x denote the minimizer of Eq. (1), that is, x = arg minx F(x). We will study permutation-based algorithms in the constant step size regime, that is, the step size is chosen at the beginning of the algorithm, and then remains fixed throughout. We denote the iterate after the i-th iteration of the k-th epoch by xk i . Hence, the initialization point is x1 0. Similarly, the permutation of (1, . . . , n) used in the k-th epoch is denoted by σk, and its i-th ordered element is denoted by σk i . Note that if the ambient space is 1-dimensional, then we represent the iterates and the minimizer using non-bold characters, i.e. xk i and x , to remain consistent with the notation. In the following, due to lack of space, we only provide sketches of the full proofs, when possible. The full proofs of the lemmas and theorems are provided in the Appendix. 4 EXPONENTIAL CONVERGENCE IN 1-DIMENSION In this section, we show that there exist permutations for Hessian-smooth 1-dimensional functions that lead to exponentially faster convergence compared to random. Assumption 4 (Component Hessian-smoothness). fi(x) s have LH-smooth second derivatives, that is, x, y : | 2fi(x) 2fi(y)| LH|x y|. We also define the following instance dependent constants: G := maxi fi(x ) , D = max n x1 0 x , G 2L o , and G = G + 2DL. Theorem 1. Let Assumptions 1,2,3 and 4 hold. Let D and G be as defined above. If α = µ 8n(L2+LHG), then there exists a sequence of permutations σ1, σ2, . . . , σK such that using those permutations from any initialization point x1 0 gives the error |x K n x | (D + 4nαG)e CK, where C = µ2 16(L2+LHG). Published as a conference paper at ICLR 2022 1 2 3 4 5 6 7 8 9 10 Number of epochs K xk opt xk |xk opt xk| Figure 1: (A graphical depiction of Theorem 1 s proof sketch.) Assume that the minimizer is at the origin. The proof of Theorem 1 first shows that there exists an initialization and a sequence of permutations, such that using those, we get to the exact minimizer. Let the sequence of iterates for this run be xk opt. Consider a parallel run, which uses the same sequence of permutations, but an arbitrary initialization point. Let this sequence be xk. The figure shows how xk opt converges to the exact optima, and the distance between xk opt and xk decreases exponentially, leading to an exponential convergence for xk. An important thing to notice in the theorem statement is that the sequence of permutations σ1, σ2, . . . , σK only depends on the function, and not on the initialization point x1 0. This implies that for any such function, there exists a sequence of permutations σ1, σ2, . . . , σK, which gives exponentially fast convergence, unconditionally of the initialization. Note that the convergence rate is slower than Gradient Descent, for which the constant C would be larger. However, here we are purely interested in the convergence rates of the best permutations and their (asymptotic) dependence on K. Proof sketch The core idea is to establish that there exists an initialization point x1 0 (close to the minimizer x ), and a sequence of permutations such that that starting from x1 0 and using that sequence of permutation leads us exactly to the minimizer. Once we have proved this, we show that if two parallel runs of the optimization process are initialized from two different iterates, and they are coupled so that they use the exact same permutations, then they approach each other at an exponential rate. Thus, if we use the same permutation from any initialization point, it will converge to the minimizer at an exponential rate. See Figure 1 for a graphical depiction of this sketch. We note that the figure is not the result of an actual run, but only serves to explain the proof sketch. 5 LOWER BOUNDS FOR PERMUTATION-BASED SGD The result in the previous section leads us to wonder if exponentially fast convergence can be also achieved in higher dimensions. Unfortunately, for higher dimensions, there exist strongly convex quadratic functions for which there does not exist any sequence of permutations that lead to an exponential convergence rate. This is formalized in the following theorem. Theorem 2. For any n 4 (n must be even), there exists a 2n + 1-dimensional strongly convex function F which is the mean of n convex quadratic functions, such that for every permutation-based algorithm with any step size, x K 1 n x 2 = Ω 1 n3K2 This theorem shows that we cannot hope to develop constant step size algorithms that exhibit exponentially fast convergence in multiple dimensions. Proof sketch Here we give a proof sketch for a simpler version of the theorem, which works in 2-Dimensions, for n = 2, and when the step size α = Ω(1/K). Consider F(x, y) = 1 2f1(x, y) + 1 2f2(x, y) such that f1(x, y) = x2 2 x + y, andf2(x, y) = y2 Hence, F = 1 4(x2 + y2), and has minimizer at the origin. Each epoch has two possible permutations, σ = (1, 2) or σ = (2, 1). Working out the details manually, it can be seen that regardless of the permutation, xk+1 0 + yk+1 0 > xk 0 + yk 0, that is, the sum of the co-ordinates keeps increasing. This can be used to get a bound on the error term [x K y K] 2. Published as a conference paper at ICLR 2022 Next, we show that even in 1-Dimension, individual function convexity might be necessary to obtain faster rates than RANDOM RESHUFFLE. Theorem 3. There exists a 1-Dimensional strongly convex function F which is the mean of two quadratic functions f1 and f2, such that one of the functions is non-convex. Then, every permutationbased algorithm with constant step size α 1 L gives an error of at least x K 1 n x 2 = Ω 1 Proof sketch The idea behind the sketch is to have one of the two component functions as strongly concave. This gives it the advantage that the farther away from its maximum the iterate is, the more it pushes the iterate away. Hence, it essentially results in increasing the deviation in each epoch. This leads to a slow convergence rate. In the setting where the individual fi may be non-convex, Nguyen et al. (2020) and Ahn et al. (2020) show that SINGLE SHUFFLE, RANDOM RESHUFFLE, and INCREMENTAL GRADIENT DESCENT achieve the error rate of e O( 1 K2 ), for a fixed n. In particular, their results only need that the component functions be smooth and hence their results apply to the function F from Theorem 3. The theorem above essentially shows that when n = 2, this is the best possible error rate, for any permutationbased algorithm - deterministic or random. Hence, at least for n = 2, the three algorithms are optimal when the component functions can possibly be non-convex. However, note that here we are only considering the dependence of the convergence rate on K. It is possible that these are not optimal, if we further take into account the dependence of the convergence rate on the combination of both n and K. Indeed, if we consider the dependence on n as well, INCREMENTAL GRADIENT DESCENT has a convergence rate of Ω(1/K2) (Safran & Shamir, 2019), whereas the other two have a convergence rate of e O(1/n K2) (Ahn et al., 2020). 6 FLIPPING PERMUTATIONS FOR FASTER CONVERGENCE IN QUADRATICS In this section, we introduce a new algorithm FLIPFLOP, that can improve the convergence rate of SINGLE SHUFFLE,RANDOM RESHUFFLE, and INCREMENTAL GRADIENT DESCENT on strongly convex quadratic functions. The following theorem gives the convergence rate of FLIPFLOP WITH SINGLE SHUFFLE: Assumption 5. fi(x) s are quadratic. Theorem 4. If Assumptions 1, 2, 3, and 5 hold, then running FLIPFLOP WITH SINGLE SHUFFLE for K epochs, where K 80κ3/2 log(n K) max n 1, κ n o is an even integer, with step size α = 10 log(n K) µn K gives the error E h x K 1 n x 2i = e O 1 n2K2 + 1 n K4 For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla SINGLE SHUFFLE: E h x K 1 n x 2i = Ω 1 n K2 Note that both the terms in Eq. (4) are smaller than the term in Eq. (5). In particular, when n K2 and n is fixed as we vary K, the RHS of Eq. (5) decays as e O( 1 K2 ), whereas the RHS of Eq. (4) decays as e O( 1 K4 ). Otherwise, when K2 n and K is fixed as we vary n, the RHS of Eq. (5) decays as e O( 1 n), whereas the RHS of Eq. (4) decays as e O( 1 n2 ). Hence, in both the cases, FLIPFLOP WITH SINGLE SHUFFLE outperforms SINGLE SHUFFLE. The next theorem shows that FLIPFLOP improves the convergence rate of RANDOM RESHUFFLE: Published as a conference paper at ICLR 2022 30 40 60 90 150 220 300 Number of epochs K Normalized error Random Reshuffle Flip Flop with RR 30 40 60 90 150 220 300 Number of epochs K Normalized error Single Shuffle Flip Flop with SS 30 40 60 90 150 220 300 Number of epochs K Normalized error IGD Flip Flop with IGD Figure 2: Dependence of convergence rates on the number of epochs K for quadratic functions. The figures show the median and inter-quartile range after 10 runs of each algorithm, with random initializations and random permutation seeds (note that SS and IGD exhibit extremely small variance). We set n = 800, so that n K and hence the higher order terms of K dominate the convergence rates. Note that both the axes are in logarithmic scale. Theorem 5. If Assumptions 1, 2, 3, and 5 hold, then running FLIPFLOP WITH RANDOM RESHUFFLE for K epochs, where K 55κ log(n K) max 1, p n κ is an even integer, with step size α = 10 log(n K) µn K gives the error E h x K 1 n x 2i = e O 1 n2K2 + 1 n K5 For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla RANDOM RESHUFFLE: E h x K 1 n x 2i = Ω 1 n2K2 + 1 n K3 Hence, we see that in the regime when n K, which happens when the number of components in the finite sum of F is much larger than the number of epochs, FLIPFLOP WITH RANDOM RESHUFFLE is much faster than vanilla RANDOM RESHUFFLE. Note that the theorems above do not contradict Theorem 2, because for a fixed n, both the theorems above give a convergence rate of e O(1/K2). We also note here that the theorems above need the number of epochs to be much larger than κ, in which range Gradient Descent performs better than withor withoutreplacement SGD, and hence GD should be preferred over SGD in that case. However, we think that this requirement on epochs is a limitation of our analysis, rather than that of the algorithms. Finally, the next theorem shows that FLIPFLOP improves the convergence rate of INCREMENTAL GRADIENT DESCENT. Theorem 6. If Assumptions 1, 2, 3, and 5 hold, then running FLIPFLOP WITH INCREMENTAL GD for K epochs, where K 36κ log(n K) is an even integer, with step size α = 6 log n K µn K gives the error E h x K 1 n x 2i = e O 1 n2K2 + 1 For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla INCREMENTAL GRADIENT DESCENT: E h x K 1 n x 2i = Ω 1 In the next subsection, we give a short sketch of the proof of these theorems. 6.1 PROOF SKETCH In the proof sketch, we consider scalar quadratic functions. The same intuition carries over to multi-dimensional quadratics, but requires a more involved analysis. Let fi(x) := aix2 2 + bix + c. Published as a conference paper at ICLR 2022 Assume that F(x) := 1 n Pn i=1 fi(x) has minimizer at 0. This assumption is valid because it can be achieved by a simple translation of the origin (see (Safran & Shamir, 2019) for a more detailed explanation). This implies that Pn i=1 bi = 0. For the sake of this sketch, assume x1 0 = 0, that is, we are starting at the minimizer itself. Further, without loss of generality, assume that σ = (1, 2, . . . , n). Then, for the last iteration of the first epoch, x1 n = x1 n 1 αf n(x1 n 1) = x1 n 1 α(anx1 n 1 + bn) = (1 αan)x1 n 1 αbn. Applying this to all iterations of the first epoch, we get i=1 (1 αai)x1 0 α j=i+1 (1 αaj). (6) Substituting x1 0 = 0, we get j=i+1 (1 αaj). (7) Note that the sum above is not weighted uniformly: b1 is multiplied by Qn j=2(1 αaj), whereas bn is multiplied by 1. Because (1 αaj) < 1, we see that b1 s weight is much smaller than bn. If the weights were all 1, then we would get x0 n = α Pn i=1 bi = 0, i.e., we would not move away from the minimizer. Since we want to stay close to the minimizer, we want the weights of all the bi to be roughly equal. The idea behind FLIPFLOP is to add something like α Pn i=1 bi Qi j=1(1 αaj) in the next epoch, to counteract the bias in Eq. (7). To achieve this, we simply take the permutation that the algorithm used in the previous epoch and flip it for the next epoch. Roughly speaking, in the next epoch b1 will get multiplied by 1 whereas bn will get multiplied by Qn 1 j=1 (1 αaj). Thus over two epochs, both get scaled approximately the same. The main reason that the analysis for multidimensional quadratics is not as simple as the 1-dimensional analysis sketched above is because unlike scalar multiplication, matrix multiplication is not commutative, and the AM-GM inequality is not true in higher dimensions (Lai & Lim, 2020; De Sa, 2020). One way to bypass this inequality is by using the following inequality for small enough α: i=1 (I αAi) i=1 (I αAn i+1) Ahn et al. (2020) proved a stochastic version of this (see Lemma 6 in their paper). We prove a deterministic version in Lemma 3 (in the Appendix). 6.2 NUMERICAL VERIFICATION We verify the theorems numerically by running RANDOM RESHUFFLE, SINGLE SHUFFLE and their FLIPFLOP versions on the task of mean computation. We randomly sample n = 800 points from a 100-dimensional sphere. Let the points be xi for i = 1, . . . , n. Then, their mean is the solution to the following quadratic problem : arg minx F(x) = 1 n Pn i=1 x xi 2. We solve this problem by using the given algorithms. The results are reported in Figure 2. The results are plotted in a log log graph, so that we get to see the dependence of error on the power of K. Note that since the points are sampled randomly, INCREMENTAL GRADIENT DESCENT essentially becomes SINGLE SHUFFLE. Hence, to verify Theorem 6, we need hard instances of INCREMENTAL GRADIENT DESCENT, and in particular we use the ones used in Theorem 3 in (Safran & Shamir, 2019). These results are also reported in a log log graph in Figure 2. We also tried FLIPFLOP in the training of deep neural networks, but unfortunately we did not see a big speedup there. We ran experiments on logistic regression for 1-Dimensional artificial data, the results for which are in Figure 3, and its details are in Appendix H. The code for all the experiments can be found at https://github.com/shashankrajput/flipflop . Published as a conference paper at ICLR 2022 30 40 60 90 150 220 300 Number of epochs K Normalized error Random Reshuffle Flip Flop with RR 30 40 60 90 150 220 300 Number of epochs K Normalized error Single Shuffle Flip Flop with SS 30 40 60 90 150 220 300 Number of epochs K Normalized error IGD Flip Flop with IGD Figure 3: Dependence of convergence rates on the number of epochs K for logistic regression. The figures show the median and inter-quartile range after 10 runs of each algorithm, with random initializations and random permutation seeds (note that IGD exhibits extremely small variance). We set n = 800, so that n K and hence the higher order terms of K dominate the convergence rates. Note that both the axes are in logarithmic scale. 6.3 FASTER PERMUTATIONS FOR NON-QUADRATIC OBJECTIVES The analysis of Flip Flop leverages the fact that the Hessians of quadratic functions are constant. We think that the analysis of Flip Flop might be extended to strongly convex functions or even PŁ functions (which are non-convex in general), under some assumptions on the Lipschitz continuity of the Hessians, similar to how Haochen & Sra (2019) extended their analysis of quadratic functions to more general classes. A key take-away from Flip Flop is that we had to understand how permutation based SGD works specifically for quadratic functions, that is, we did a white-box analysis. In general, we feel that depending on the specific class of non-convex functions (say deep neural networks), practitioners would have to think about permutation-based SGD in a white-box fashion, to come up with better heuristics for shuffling. In a concurrent work by Lu et al. (2021), it is shown that by greedily sorting stale gradients, a permutation order can be found which converges faster for some deep learning tasks. Hence, there do exist better permutations than random, even for deep learning tasks. 7 CONCLUSION AND FUTURE WORK In this paper, we explore the theoretical limits of permutation-based SGD for solving finite sum optimization problems. We focus on the power of good, carefully designed permutations and whether they can lead to a much better convergence rate than random. We prove that for 1-dimensional, strongly convex functions, indeed good sequences of permutations exist, which lead to a convergence rate which is exponentially faster than random permutations. We also show that unfortunately, this is not true for higher dimensions, and that for general strongly convex functions, random permutations might be optimal. However, we think that for some subfamilies of strongly convex functions, good permutations might exist and may be easy to generate. Towards that end, we introduce a very simple technique, FLIPFLOP, to generate permutations that lead to faster convergence on strongly convex quadratics. This is a black box technique, that is, it does not look at the optimization problem to come up with the permutations; and can be implemented easily. This serves as an example that for other classes of functions, there can exist other techniques for coming up with good permutations. Finally, note that we only consider constant step sizes in this work for both upper and lower bounds. Exploring regimes in which the step size changes, e.g., diminishing step sizes, is a very interesting open problem, which we leave for future work. We think that the upper and lower bounds in this paper give some important insights and can help in the development of better algorithms or heuristics. We strongly believe that under nice distributional assumptions on the component functions, there can exist good heuristics to generate good permutations, and this should also be investigated in future work. Published as a conference paper at ICLR 2022 ETHICS STATEMENT This paper explores the theoretical limits of convergence rates of variants of popular SGD type algorithms. Thus, the authors do not think there are any ethical concerns regarding the paper. REPRODUCIBILITY STATEMENT The proof for all the theoretical claims are provided in the Appendix. The assumptions and notations are specified in the Preliminaries section (Section 3). The code for all the experiments can be found at https://github.com/shashankrajput/flipflop . Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. Sgd with shuffling: optimal rates without component convexity and large epoch requirements, 2020. Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011. L eon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. Unpublished open problem offered to the attendance of the SLDS 2009 conference, 2009. URL http: //leon.bottou.org/papers/bottou-slds-open-problem-2009. S ebastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015. Christopher M De Sa. Random reshuffling is not always better. Advances in Neural Information Processing Systems, 33, 2020. M G urb uzbalaban, Asuman Ozdaglar, and Pablo A Parrilo. Convergence rate of incremental gradient and incremental newton methods. SIAM Journal on Optimization, 29(4):2542 2565, 2019a. Mert G urb uzbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, pp. 1 36, 2019b. Jeff Haochen and Suvrit Sra. Random shuffling beats sgd after finite epochs. In International Conference on Machine Learning, pp. 2624 2633, 2019. Zehua Lai and Lek-Heng Lim. Recht-r e noncommutative arithmetic-geometric mean conjecture is false. In International Conference on Machine Learning, pp. 5608 5617. PMLR, 2020. Yucheng Lu, Si Yi Meng, and Christopher De Sa. A general analysis of example-selection for stochastic gradient descent. In International Conference on Learning Representations, 2021. Konstantin Mishchenko, Ahmed Khaled, and Peter Richt arik. Random reshuffling: Simple analysis with vast improvements. Ar Xiv, abs/2006.05988, 2020. Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pp. 4703 4711, 2019. Angelia Nedi c and Dimitri Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pp. 223 264. Springer, 2001. Yurii Nesterov. Introductory Lectures on Convex Optimization - A Basic Course, volume 87 of Applied Optimization. Springer, 2004. ISBN 978-1-4613-4691-3. doi: 10.1007/978-1-4419-8853-9. URL https://doi.org/10.1007/978-1-4419-8853-9. Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten van Dijk. A unified convergence analysis for shuffling-type gradient methods. ar Xiv preprint ar Xiv:2002.08246, 2020. Published as a conference paper at ICLR 2022 Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006. Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pp. 7964 7973. PMLR, 2020. Benjamin Recht and Christopher R e. Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences. ar Xiv preprint ar Xiv:1202.4184, 2012. Benjamin Recht and Christopher R e. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201 226, 2013. Itay Safran and Ohad Shamir. How good is sgd with random shuffling?, 2019. Itay Safran and Ohad Shamir. Random shuffling beats sgd only after many epochs on ill-conditioned problems. ar Xiv preprint ar Xiv:2106.06880, 2021. M. Schneider. Probability inequalities for kernel embeddings in sampling without replacement. In AISTATS, 2016. Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 46 54, 2016. Trang H Tran, Lam M Nguyen, and Quoc Tran-Dinh. Shuffling gradient-based methods with momentum. ar Xiv preprint ar Xiv:2011.11884, 2020. Bicheng Ying, Kun Yuan, and Ali H Sayed. Variance-reduced stochastic learning under random reshuffling. IEEE Transactions on Signal Processing, 68:1390 1408, 2020. Published as a conference paper at ICLR 2022 A DISCUSSION AND FUTURE WORK Being the first paper (to the best of our knowledge) to theoretically analyze the optimality of random permutations, we limited our scope to a specific, but common theoretical setting - strongly convex functions with constant step size. We think that future work can generalize the results of this paper to settings of non-convexity, variable step sizes, as well as techniques like variance reduction, momentum, etc. A.1 LOWER BOUNDS FOR VARIABLE STEP SIZES All the existing lower bounds (to the best of our knowledge) work in the constant step size regime (Safran & Shamir (2019); Rajput et al. (2020); Safran & Shamir (2021)). Thus, generalizing the lower bounds to variable step size algorithms would be a very important direction for future research. However, the case when step sizes are not constant can be tricky to prove lower bounds, since the step size could potentially depend on the permutation, and the current iterate. A more reasonable setting to prove lower bounds could be the case when the step sizes follow a schedule over epochs, similar to what happens in practice. A.2 FLIPFLOP ON RANDOM COORDINATE DESCENT A shuffling scheme similar to Flip Flop has been used in random coordinate descent for faster practical convergence (see page 231 in Nocedal & Wright (2006)). This should be further investigated empirically and theoretically in future work. Even though the current analysis of Flip Flop does not directly go through for random coordinate descent, we think the analysis can be adapted to work. B PROOF OF THEOREM 1 In certain places in the proof, we would need that α 1 4n L. To see how this is satisfied, note that we have assumed that α µ 8n(L2+LHG) in the theorem statement. Using the inequality µ L in α µ 4n(L2+LHG) gives that α 1 4n(L+(LHG/µ)) 1 4n L. In this proof, we assume that the minimizer of F is at 0 to make the analysis simpler. This assumption can be satisfied by simply translating the origin to the minimizer (Safran & Shamir, 2019). There are three main components in the proof: 1. Say that an epoch starts off at the minimizer. We prove that there exists at least one pair of permutations such that if we do two separate runs of the epoch, the first using the first permutation and the second using the second, then at the end of that epoch, the iterates corresponding to the two runs end up on opposite sides of the minimizer. 2. There exists a sequence of permutations and a point in the neighborhood of the minimizer, such that intializing at that point and using these sequence of permutations, we converge exactly to the minimizer. 3. Starting from any other point, we can couple the iterates with the iterates which were shown in the previous component, to get that these two sequences of iterates come close to each other exponentially fast. We prove the first and second components in the Subsections B.1 and B.2; and conclude the proof in Subsection B.3 where we also prove the third component. B.1 PERMUTATIONS IN ONE EPOCH In this subsection, we prove that if x0, x1, . . . , xn are the iterates in an epoch such that x0 = 0, then there exists a permutation of functions such that xn 0. By the same logic, we show that there exists a permutation of functions such that xn 0. These will give us control over movement of iterates across epochs. Published as a conference paper at ICLR 2022 Order the gradients at the minimizer, fi(0), in decreasing order. WLOG assume that it is just f1, f2, . . . , fn. We claim that this permutation leads to xn 0. We will need the following intermediate result. Let yi, yi 1 be such that yi = yi 1 α fi(yi 1). Assume α 1/L and yi 1 xi 1. Then, yi xi = yi 1 xi 1 α( fi(yi 1) fi(xi 1)) yi 1 xi 1 αL(yi 1 xi 1) = (1 αL)(yi 1 xi 1) 0, (8) that is, yi 1 xi 1 = yi xi. Because 0 is the minimizer, we know that Pn i=1 fi(0) = 0. Also, recall that x0 = 0. There can be two cases: 1. i [1, n] : xi < 0. This cannot be true because if i : 1 i n 1 : xi < 0, then i=1 α fi(xi 1) α i=1 fi(0) 0, where we used the fact that fi(x) fi(y) if x y and fi is convex. 2. Thus, i [1, n] : xi 0. Now, consider the sequence yi, yi+1, . . . , yn such that yi = 0 and for j i + 1, yj = yj 1 α fj(yj 1). Then because α 1/L and xi yi = 0, we get that xj yj for all j i (Using Ineq. (8)). Hence, there is an i 1 such that xi yi = 0, and further xj yj for all j i. Next, we repeat the process above for yi, . . . , yn. That is, there can be two cases: 1. j [i + 1, n] : yj < 0. This cannot be true because if j : i + 1 j n 1 : yj < 0, then j=i α fj(yj) α j=i fj(0) 0. 2. Thus, j [i + 1, n] : yj 0. Now, consider the sequence zj, zj+1, . . . , zn such that zj = 0 and for k j + 1, zk = zk 1 α fk(zk 1). Then because α 1/L and yj zj = 0, we get that yk zk for all k j (Using Ineq. (8)). Hence, there exists an integer j > i > 0 such that yj 0. We have already proved that xj yj. Thus, we have that xj 0. We can continue repeating this process (apply the same two cases above for zj, zj+1, . . . , zn, and so on), to get that xn 0. We define p to be this non-negative value of xn. Note that the following lemma gives us that the gradients are bounded by G Lemma 1. Define G := maxi fi(x ) and D = max n x1 0 x , G 2L o . If Assumptions 2 and 3 hold, and α < 1 8κn L, then for any permutation-based algorithm (deterministic or random), we have i, j, k : xk i x 2D, and fj(xk i ) G + 2DL. Because the gradients are bounded by G, we get that p nαG. Similarly, we can show that the reverse permutation leads to xn 0. We define q to be this nonpositive value of xn. Because we have assumed that the gradients are bounded by G, we get that q nαG. B.2 EXACT CONVERGENCE TO THE MINIMIZER In this section, we show that there exists a point such that if we initialize there and follow a specific permutation sequence, we land exactly at the minimizer. Published as a conference paper at ICLR 2022 In particular, we will show the following: There exists a point in [4q, 4p] such that if we initialize there and follow a specific permutation sequence, then we land exactly at the minimizer. We show this recursively. We will prove that there exists a point m K [4q, 4p] such that if the last epoch begins there, that is x K 0 = m K, then we land at the minimizer at the end of the last epoch, that is, x K n = 0. Then, we will show that there exists a point m K 1 [4q, 4p] such that if the x K 1 0 = m K 1, then x K 0 = x K 1 n = m K. Repeating this for K 2, . . . , 1, we get that there exists a point m0 [4q, 4p] such that if we initialize the first epoch there, that is x1 0 = m0, then there is a permutation sequence such that ultimately x K n = 0. We prove that any point mj [4q, 4p] can be reached at the end of an epoch by beginning the epoch at some point mj 1 [4q, 4p], that is if xj 1 0 = mj 1, then xj 1 n = mj. Case: mj [p, 4p]. In this case, we show that mj 1 [0, 4p]. We have proved in the previous subsection that there exists a permutation σ such that if xj 1 0 = 0 then xj 1 n = p. Next, we have the following helpful lemma that we will also use later. Lemma 2. Let x0, x1, . . . , xn be a sequence of iterates in an epoch and y0, y1, . . . , yn be another sequence of iterates in an epoch such that both use the same permutation of functions. If α µ 2n(L2+LHG), then (1 nαL)|y0 x0| (1 Lα)n|y0 x0| |yn xn| 1 1 2nµα |y0 x0|. If we set x0 = 4p, y0 = 0 in Lemma 2 and we follow the permutation σ, then we get that xn yn (x0 y0) h 1 αn L, 1 αnµ = xn p (4p 0) h 1 αn L, 1 αnµ (Since using σ and y0 = 0 results in yn = p.) where we used the fact that α 1 4n L is the last step. Thus, if xj 1 0 = 4p and we follow the permutation σ, then xj 1 n 4p. Next, note that xj 1 1 = xj 1 0 α fσ(0)(xj 1 0 ). xj 1 2 = xj 1 1 α fσ(1)(xj 1 1 ) ... xj 1 n = xj 1 n 1 α fσ(n 1)(xj 1 n 1) Looking above, we see that xj 1 1 is a continuous function of xj 1 0 ; xj 1 2 is a continuous function of xj 1 1 ; and so on. Thus, using the fact that composition of continuous functions is continuous, we get that xj 1 n is also a continuous function of xj 1 0 . We have shown that if xj 1 0 = 0, then xj 1 n = p and if xj 1 0 = 4p, then xj 1 n 4p. Thus, using the fact that that xj 1 n is a continuous function of xj 1 0 , we get that for any point mj [p, 4p], there is at least one point mj 1 [0, 4p], such that xj 1 0 = mj 1 leads to xj 1 n = mj. Case: mj [4q, q]. We can apply the same logic as above to show that there is at least one point mj 1 [4q, 0], such that xj 1 0 = mj 1 = xj 1 n = mj. Case: mj [q, p]. WLOG assume that |q| < |p|. Let σq be the permutation such that if xj 1 0 = 0 and the epoch uses this permutation, then we end up at xj 1 n = q. If we set x0 = 4p, y0 = 0 in Lemma 2 and we follow the permutation σq, then we get that xn yn (x0 y0) h 1 αn L, 1 αnµ Published as a conference paper at ICLR 2022 = xn q (4p 0) h 1 αn L, 1 αnµ (Since using σq and y0 = 0 results in yn = q.) = xn q + 4p(1 αn L) q + 3p 2p, where we used the fact that α 1 4n L is the last step. Thus, if xj 1 0 = 4p and we follow the permutation σq, then xj 1 n 2p. Thus, using similar argument of continuity as the first case, we know that there is a point mj 1 [0, 4p], such that xj 1 0 = mj 1 leads to xj 1 n = mj when we use the permutation σq. B.3 SAME SEQUENCE PERMUTATIONS GET CLOSER In the previous subsection, we have shown that there exists a point m0 [4q, 4p] and a sequence of permutations σ1, σ2, . . . , σK such that if x1 0 = m0 and epoch j uses permutation σj, then x K n = 0. In this subsection, we will show that if x1 0 is initialized at any other point such that |x1 0| D, then using the same permutations σ1, σ2, . . . , σK gives us that |x K n | (D + nαG)e K. For this we will repeatedly apply Lemma 2 on all the K epochs. Assume that x1 0 = ν0 with |ν0| D. Let yj i be the sequence of iterates such that y1 0 = m0 and uses the permutation sequence σ1, σ2, . . . , σK. Then, we know that y K n = 0. Let xj i be the sequence of iterates such that x1 0 = ν0 and uses the same permutation sequence σ1, σ2, . . . , σK. Then, using Lemma 2 gives us that |y1 n x1 n| |ν0 m0|(1 µαn 2 ). Thus, we get that |y2 0 x2 0| |ν0 m0|(1 µαn 2 ). Again applying Lemma 2 gives us that |y2 n x2 n| |ν0 m0|(1 µαn 2 )2. Therefore, after applying it K times, we get |y K n x K n | |ν0 m0| 1 µαn Note that |x1 0| = |ν0| D, and |y1 0| = |m0|, with m0 [4q, 4p]. We showed earlier in Subsection B.1 that |p| nαG and |q| nαG. Therefore, |y K n x K n | |D + 4nαG| 1 µαn Further, we know that y K n = 0. Thus, |x K n | |D + 4nαG| 1 µαn |D + 4nαG|e µαn Substituting the value of α completes the proof. Next we prove the lemmas used in this proof. B.4 PROOF OF LEMMA 1 We restate the lemma below. Lemma. Define G := maxi fi(x ) and D = max n x1 0 x , G 2L o . If Assumptions 2 and 3 hold, and α < 1 8κn L, then for any permutation-based algorithm (deterministic or random), we have i, j, k : xk i x 2D, and fj(xk i ) G + 2DL. This lemma says that for any permutation-based algorithm, the domain of iterates and the norm of the gradient stays bounded during the optimization. This means that we can assume bounds on norm of iterates and gradients, which is not true in general for unconstrained SGD. This makes analyzing such algorithms much easier, and hence this lemma can be of independent interest for proving future results for permutation-based algorithms. Published as a conference paper at ICLR 2022 Remark 1. This lemma does not hold in general for vanilla SGD where sampling is done with replacement. Consider the example with two functions f1(x) = x2 x, and f2(x) = x; and F(x) = f1(x) + f2(x). This satisfies Assumptions 2 and 3, but one may choose f2 consecutively for arbitrary many iterations, which will lead the iterates a proportional distance away from the minimizer. This kind of situation can never happen for permutation-based SGD because we see every function exactly once in every epoch and hence no particular function can attract the iterates too much towards its minimizer, and by the end of the epochs most of the noise gets cancelled out. Note that the requirement α < 1 8κn L is stricter than the usual requirement α = O 1 n L , but we believe the lemma should also hold in that regime. For the current paper, however, this stronger requirement suffices. Proof. To prove this lemma, we show two facts: If for some epoch k, xk 0 x D, then a) i : xk i x 2D and b) xk+1 0 x = xk n x D. To see how a) and b) are sufficient to prove the lemma, assume that they are true. Then, since the first epoch begins inside the bounded region x1 0 x D, we see using b) that every subsequent epoch begins inside the same bounded region, that is xk 0 x D as well. Hence using a) we get that during these epochs, the iterates satisfy xk i x 2D, which is the first part of the lemma. Further, this bound together with the gradient Lipshitzness directly gives the upper bound G + 2DL on the gradients. Thus, all we need to do to prove this lemma is to prove a) and b), which we do next. We will prove a) and b) for D = max{ x1 0 x , 2κnαG 1 4κnαL}. Once we do this, using α < 1 8κn L will give us the exact statement of the lemma. Let |xk 0 x | D for some epoch k. Then, we try to find the minimum number of iterations i needed so that |xk i x | 2D. Within this region, the gradient is bounded by G + 2DL. Thus, the minimum number of iterations needed are 2D D α(G +2DL). However, 2D D α(G + 2DL) = 1 α( G 1 α(G 1 4κnαL 2κnαG + 2L) (Using the fact that D 2κnαG 1 4κnαL.) = 1 α( 1 4κnαL Thus, the minimum number iterations needed to go outside the bound |xk i x | 2D is more than the length of the epoch. This implies that within the epoch, xk i x 2D, which proves a). We prove b) next: i=0 fσk i (xk i ) i=0 fσk i (xk 0) + α i=0 ( fσk i (xk 0) fσk i (xk i )) Note that Pn i=0 fσk i (xk 0) is just the sum of all component gradients at xk 0, that is Pn i=0 fσk i (xk 0) = n F(xk 0). Using this, we get xk 0 x nα F(xk 0) + α i=0 ( fσk i (xk 0) fσk i (xk i )) xk 0 x nα F(xk 0) + α fσk i (xk 0) fσk i (xk i ) (Triangle inequality.) Published as a conference paper at ICLR 2022 xk 0 x nα F(xk 0) + αL xk 0 xk i , (9) where we used gradient Lipschitzness (Assumption 2) in the last step. To bound the first term above, we use the standard analysis of gradient descent on smooth, strongly convex functions as follows xk 0 x nα F(xk 0) 2 = xk 0 x 2 2nα xk 0 x , F(xk 0) + n2α2 F(xk 0) 2 xk 0 x 2 2nαµ xk 0 x 2 + n2α2 F(xk 0) 2 (Using Ineq. (3)) = (1 nαµ) xk 0 x 2 + nα(nα F(xk 0) 2 µ xk 0 x 2) (1 nαµ) xk 0 x 2 + nα(nαL2 xk 0 x 2 µ xk 0 x 2) (Using gradient Lipschitzness) = (1 nαµ) xk 0 x 2 + nα(nαL2 µ) xk 0 x 2 (1 nαµ) xk 0 x 2 , where in the last step we used that α µ n L2 since α 1 8κn L. Substituting this inequality in Ineq. (9), we get 1 nαµ xk 0 x + αL 2nαµ xk 0 x + αL xk 0 xk i . We have already proven a) that says that the iterates xk i satisfy xk i x 2D. Using gradient Lipschitzness, this implies that the gradient norms stay bounded by G +2DL. Hence, xk 0 xk i αi(G + 2DL). Using this, 2nαµ xk 0 x + αL i=0 αi(G + 2DL) 2nαµ D + αL i=0 αi(G + 2DL) 2nαµ D + n2α2L(G + 2DL) where we used the fact that D 2κnαG 1 4κnαL in the last step. B.5 PROOF OF LEMMA 2 Without loss of generality, let σ = (1, 2, 3, . . . , n). This is only done for ease of notation. The analysis goes through for any other permutation σ too. First we show the lower bound. WLOG assume y0 > x0. Because α < 1/L, we have that i, yi > xi by induction (see the equations below). Then, yi xi = yi 1 xi 1 α( fi(yi 1) fi(xi 1)) yi 1 xi 1 αL(yi 1 xi 1) = (1 αL)(yi 1 xi 1) ... = (1 αL)i(y0 x0) Published as a conference paper at ICLR 2022 (1 iαL)(y0 x0). (10) Next we show the upper bound yn xn = y0 x0 α i=1 ( fi(yi 1) fi(xi 1)) i=1 ( fi(y0) fi(x0)) + α i=1 ( fi(y0) fi(x0) fi(yi 1) + fi(xi 1)) = y0 x0 nα( F(y0) F(x0)) + α i=1 ( fi(y0) fi(x0) fi(yi 1) + fi(xi 1)) (1 nαµ)(y0 x0) + α i=1 ( fi(y0) fi(x0) fi(yi 1) + fi(xi 1)). (Using strong convexity) We use the fact that the function is twice differentiable: yn xn = (1 nαµ)(y0 x0) + α x0 2fi(t)dt Z yi 1 xi 1 2fi(t)dt = (1 nαµ)(y0 x0) x0 2fi(t)dt Z xi 1+(y0 x0) xi 1 2fi(t)dt Z yi 1 xi 1+(y0 x0) 2fi(t)dt = (1 nαµ)(y0 x0) x0 ( 2fi(t) 2fi(xi 1 x0 + t))dt Z yi 1 xi 1+(y0 x0) 2fi(t)dt In the above, we used the convention that R b a f(x)dx is the same as R a b f(x)dx if a > b. Now, we can use the Hessian Lipschitzness to bound the term as follows yn xn (1 nαµ)(y0 x0) + α x0 LH|xi 1 x0|dt Z yi 1 xi 1+(y0 x0) 2fi(t)dt (1 nαµ)(y0 x0) + α x0 LHGαndt Z yi 1 xi 1+(y0 x0) 2fi(t)dt = (1 nαµ)(y0 x0) + LHGα2n2(x0 y0) α xi 1+(y0 x0) 2fi(t)dt (1 nαµ)(y0 x0) + LHGα2n2(x0 y0) + α i=1 L((y0 x0) (yi xi)) (1 nαµ)(y0 x0) + LHGα2n2(x0 y0) + α i=1 L(iαL)(y0 x0) (Using Ineq. (10).) (1 nαµ)(y0 x0) + LHGα2n2(x0 y0) + α2n2L2(y0 x0). Thus, if we have α µ 2n(L2+LHG), then yn xn 1 µnα C PROOF OF THEOREM 2 To prove this theorem, we consider three step-size ranges and do a case by case analysis for each of them. We construct functions for each range such that the convergence of any permutation-based Published as a conference paper at ICLR 2022 algorithm is slow for the functions on their corresponding step-size regime. The final lower bound is the minimum among the lower bounds obtained for the three regimes. In this proof, we will use different notation from the rest of the paper because we work with scalars in the proof, and hence the superscript will denote the scalar power, and not the epoch number. We will use xk,i to denote the i-th iterate of the k-th epoch. We will construct three functions F1(x), F2(y), and F3(z), each the means of n component functions, such that Any permutation-based algorithm on F1(x) with α [ 1 2(n 1)KL, 3 n L] and initialization x1,0 = 0 results in x K,n 2 = Ω 1 n3K2 F1 will be an n-dimensional function, that is x Rn. This function will have minimizer at 0 Rn. NOTE: This is the key step-size range, and the proof sketch explained in the main paper corresponds to this function s construction. Any permutation-based algorithm on F2(y) with α [ 3 L] and initialization y1,0 = 0 results in y K,n 2 = Ω 1 F2 will be an n-dimensional function, that is y Rn. This function will have minimizer at 0 Rn. The construction for this is also inspired by the construction for F1, but this is constructed significantly differently due to the different step-size range. Any permutation-based algorithm on F3(z) with α / [ 1 2(n 1)KL, 1 L] and initialization z1,0 = 1 results in |z K,n|2 = Ω(1) F3 will be an 1-dimensional function, that is z R. This function will have minimizer at 0. Then, the 2n + 1-dimensional function F([x , y , z] ) = F1(x) + F2(y) + F3(z) will show bad convergence in any step-size regime. This function will have minimizer at 0 R2n+1. Furthermore, n LI 2F1, 2F2, 2F3, 2F 2LI, that is, F1, F2, F3, and F are all n 1 n L-strongly convex and 2L-smooth. In the subsequent subsections, we prove the lower bounds for F1, F2, and F3 separately. NOTE: We note that above, we have used a specific initialization. However, in Appendix C.4, we discuss how the lower bound is actually invariant to initialization. C.1 LOWER BOUND FOR F1, α [ 1 2(n 1)KL, 3 n L] We will work in n-dimensions (n is even) and represent a vector in the space as z = [x1, y1, . . . , xn/2, yn/2]. These xi and yi are not related to the vectors used by F2 and F3 later, we only use xi and yi to make the proof for this part easier to understand. We start off by defining the n component functions: For i [n/2], define 2 x2 i xi + yi + X 2 y2 i yi + xi + X Thus, F(z) := 1 n Pn/2 i=1 fi + Pn/2 i=1 gi = n 1 2 z 2. This function has minimizer at z = 0. Published as a conference paper at ICLR 2022 Let zk,j denote z at the j-th iteration in k-th epoch. We initialize at z1,0 = 0. For the majority of this proof, we will work inside a given epoch, so we will skip the subscript denoting the epoch. We denote the j-th iterate within the epoch as zj and the coordinates as zj = [xj,1, yj,1, . . . , xj,n/2, yj,n/2] In the current epoch, let σ be the permutation of {f1, . . . , fn/2, g1, . . . , gn/2} used. For any i [n/2], let p and q be indices such that σp = fi and σq = gi. Let us consider the case that p < q (the case when p > q will be analogous). Then, it can be seen that xn,i = (1 αL)n 1x0,i + (1 αL)n p 1α (1 αL)n qα, and yn,i = (1 αL)n 1y0,i (1 αL)n pα + (1 αL)n qα. (11) xn,i + yn,i = (1 αL)n 1(x0,i + y0,i) + 2(1 αL)n p 1α2L (1 αL)n 1(x0,i + y0,i) + 2(1 αL)n 1α2L. In the other case, when p > q, we will get the same inequality. Let x K,n,i and y K,n,i denote the value of xn,i and yn,i at the K-th epoch. Then, recalling that z was initialized to 0, we use the inequality above to get x K,n,i + y K,n,i (1 αL)(n 1)K 0 + 2(1 αL)n 1α2L1 (1 αL)(n 1)K 1 (1 αL)n 1 = 2(1 αL)n 1α2L1 (1 αL)(n 1)K 1 (1 αL)n 1 (12) Since this inequality is valid for all i, we get that i=1 (x2 K,n,i + y2 K,n,i) 1 2(x K,n,i + y K,n,i)2 2(1 αL)n 1α2L1 (1 αL)(n 1)K 1 (1 αL)n 1 = n (1 αL)n 1α2L1 (1 αL)(n 1)K 1 (1 αL)n 1 Note that if α 3 n L and n 4, then (1 αL)n 1 1 8. Using this in (13), we get z K,n 2 n (1 αL)n 1α2L1 (1 αL)(n 1)K 1 (1 αL)n 1 n 128α4L2 1 (1 αL)(n 1)K 1 (1 αL)n 1 1 (1 αL)n 1 2 1 (1 αL)(n 1)K 2 We consider two cases: 1. Case A, αL 1 2(n+2): It can be verified that (αL)2 1 (1 αL)n 1 is an increasing function of α when αL 1 2(n+2). Noting that we are working in the range when α 1 2(n 1)KL, then z K,n 2 n 128L2 1 2(n 1)K 2 1 1 1 2(n 1)K n 1 2 1 (1 αL)(n 1)K 2 Published as a conference paper at ICLR 2022 1 2(n 1)K 2 1 2(n 1)K (n 1) 2 1 (1 αL)(n 1)K 2 = n 512(n 1)4K2L2 1 (1 αL)(n 1)K 2 n 512(n 1)4K2L2 1 e αL(n 1)K 2 n 512(n 1)4K2L2 = Ω 1 n3K2L2 2. Case B, αL > 1 2(n+2): In this case, z K,n 2 n 128L2 2 1 (1 αL)(n 1)K 2 2 1 e αL(n 1)K 2 2 1 e 3(n 1)K C.2 LOWER BOUND FOR F2, α [ 3 We will work in n-dimensions and represent a vector in the space as y = [y1, . . . , yn]. We start off by defining the n component functions: For i [n], define fi(y) := yi + X Ly2 j 2 + yj n 1 Thus, F(y) := 1 n Pn i=1 fi = n 1 2 y 2. This function has minimizer at y = 0. Let yk,j denote y at the j-th iteration in k-th epoch. We initialize at y1,0 = 0. For the majority of this proof, we will work inside a given epoch. We denote the j-th iterate within the epoch as yj and the coordinates as yj = [yj,1, . . . , yj,n] In the current epoch, let σ be the permutation of {f1, . . . , fn} used. Let i be the index such that σn = i, that is, i is the last element of the permutation σ. Then at the end of the epoch, yn,i = (1 αL)n 1y0,i + α α n 1 j=0 (1 αL)j = (1 αL)n 1y0,i + α (1 (1 αL)n 1) (n 1)L . (14) For some j [n], let s be the integer such that σs = j, that is j is the s-th element in the permutation σ. Then for any j and any epoch, yn,j = (1 αL)n 1y0,j + α(1 αL)n s α n 1 j=0 (1 αL)j. Published as a conference paper at ICLR 2022 yn,j (1 αL)n 1y0,j α n 1 j=0 (1 αL)j = (1 αL)n 1y0,j (1 (1 αL)n 1) Note that the above is independent of σ, and hence applicable for all epochs. Applying it recursively and noting that we initialized y1,0 = 0, we get yn,j (1 (1 αL)n 1) k=1 (1 αL)n Note that y0,i is just yn,i from the previous epoch. Hence we can substitute the inequality above for y0,i in (14). Thus, yn,i (1 αL)n 1 (n 1)L + α (1 (1 αL)n 1) = α 1 (n 1)L n L 1 (n 1)L This gives us that yk,n 2 = Ω 1 n2L2 for any k. C.3 LOWER BOUND FOR F3, α / [ 1 2(n 1)KL, 1 Consider the function F(z) = 1 n Pn i=1 fi(z), where fi(z) = Lz2 for all i. Note that the gradient of any function at z is just 2Lz. Hence, regardless of the permutation, if we start the shuffling based gradient descent method at z1,0 = 1, we get that z K,n = (1 2αL)n Kz1,0 = (1 2αL)n K. In the case when α 1 2(n 1)KL, we see that z K,n 1 2 1 2(n 1)KLL n K for n, K 2. Finally, in the case when α 1 L, we see that |z K,n| = |1 2αL|n K Published as a conference paper at ICLR 2022 C.4 DISCUSSION ABOUT INITIALIZATION The lower bound partitions the step size in three ranges - In the step size ranges α [ 1 2(n 1)KL, 3 n L] and α [ 3 L], the initializations are done at the minimizer and it is shown that any permutation-based algorithm will still move away from the minimizer. The choice of initializing at the minimizer was solely for the convenience of analysis and calculations, and the proof should work for any other initialization as well. Furthermore, the effect of initializing at any arbitrary (non-zero) point will decay exponentially fast with epochs anyway. To see how, note that every epoch can be treated as n steps of full gradient descent and some noise, and hence the full gradient descent part will essentially keep decreasing the effect of initialization exponentially, and what we would be left with is the noise in each epoch. Thus, it was more convenient for us to just assume initialization at the minimizer and only focus on the noise in each epoch. The step size range α / [ 1 2(n 1)KL, 1 L] can be divided into two parts, α [0, 1 2(n 1)KL] and α [ 1 L, ). For the range α [0, 1 2(n 1)KL], we essentially show that the step size is too small to make any meaningful progress towards the minimizer. Hence, instead of initializing at 1, initializing at any other arbitrary (non-zero) point will also give the same slow convergence rate. For the range α [ 1 L, ), we show that the optimization algorithm will in fact diverge since the step size is too large. Hence, even here, any other arbitrary (non-zero) point of initialize will also give divergence. D PROOF OF THEOREM 3 Define f1(x) := L 2 x2 x and f2(x, y) := L 4 x2 + x. Thus, F(x, y) = L 8 x2. This function has minimizer at x = 0. For this proof, we will use the convention that xi,j is the iterate after the i-th iteration of the j-th epoch. Further, the number in the superscript will be the scalar power. For example x2 i,j = xi,j xi,j. Initialize at x0,0 = 1 L. Then at epoch k, there are two possible permutations: σ = (1, 2) and σ = (2, 1). In the first case , σ = (1, 2), we get that after the first iteration of the epoch, x1,k = x0,k α f1(x0,k, y0,k) = (1 αL)x0,k + α, Continuing on, in the second iteration, we get x2,k = x1,k α f2(x1,k, y1,k) 2αL ((1 αL)x0,k + α) α 2αL (1 αL)x0,k + 1 Note that x0,k+1 = x2,k. Thus, x0,k+1 = 1 + 1 2αL (1 αL)x0,k + 1 Similarly, for the other possible permutation, σ = (2, 1), we get x0,k+1 = 1 + 1 2αL (1 αL)x0,k+ α2L. Thus, regardless of what permutation we use, we get that x0,k+1 1 + 1 2αL (1 αL)x0,k + 1 2α2L (1 αL)x0,k + 1 Hence, recalling that we initialized at x0,0 = 1 xn,K = x0,K+1 Published as a conference paper at ICLR 2022 i=0 (1 αL)i L(1 αL)K + 1 2α2L1 (1 αL)K L(1 αL)K + 1 2α 1 (1 αL)K . (15) Now, if α 1 KL, then 1 2α 1 (1 αL)K α and hence, x2 n,K = Ω( 1 K2L2 ). Otherwise, if α 1 KL, then continuing on from (15), L(1 αL)K + 1 2α 1 (1 αL)K and hence in this case, x2 n,K = Ω( 1 E PROOF OF THEOREM 4 Let F(x) := 1 n Pn i=1 fi(x) be such that its minimizer it at the origin. This can be assumed without loss of generality because we can shift the coordinates appropriately, similar to (Safran & Shamir, 2019). Since the fi are convex quadratics, we can write them as fi(x) = 1 2x Aix b i x + ci, where Ai are symmetric, positive-semidefinite matrices. We can omit the constants ci because they do not affect the minimizer or the gradients. Because we assume that the minimizer of F(x) is at the origin, we get that i=1 bi = 0. (16) Let σ = (σ1, σ2, . . . , σn) be the random permutation of (1, 2, . . . , n) generated at the beginning of the algorithm. Then for k (1, 2, . . . , K/2), epoch 2k 1 sees the n functions in the following sequence: 1 2x Aσ1x b σ1x, 1 2x Aσ2x b σ2x, . . . , 1 2x Aσnx b σnx , whereas epoch 2k sees the n functions in the reverse sequence: 1 2x Aσnx b σnx, 1 2x Aσn 1x b σn 1x, . . . , 1 2x Aσ1x b σ1x . We define Si := αAσi and ti = αbσi for convenience of notation. We start off by computing the progress made during an even indexed epoch. Since the even epochs use the reverse permutation, we get x2k+1 0 = x2k n = x2k n 1 α Aσ1x2k n 1 bσ1 (fσ1 is used at the last iteration of even epochs.) = (I αAσ1) x2k n 1 + αbσ1 Published as a conference paper at ICLR 2022 = (I S1)x2k n 1 + t1. We recursively apply the same procedure as above to the whole epoch to get the following x2k+1 0 = (I S1)x2k n 1 + t1 = (I S1) (I S2)x2k n 2 + t2 + t1 = (I S1)(I S2)x2k n 2 + (I S1)t2 + t1 ... tn+1 i, (17) where the product of matrices {Mi} is defined as Qm i=l Mi = Ml Ml+1 . . . Mm if m l and 1 otherwise. Similar to Eq. (17), we can compute the progress made during an odd indexed epoch. Recall that the only difference will be that the odd indexed epochs see the permutations in the order (σ1, σ2, . . . , σn) instead of (σn, σn 1, . . . , σ1). After doing the computation, we get the following equation: i=1 (I Sn i+1) j=1 (I Sn j+1) Combining the results above, we can get the total progress made after the pair of epoch 2k 1 and 2k: i=1 (I Sn i+1) j=1 (I Sn+1 j) tn+1 i. (18) In the sum above, the first term will have an exponential decay, hence we need to control the next two terms. We denote the sum of the terms as z (see the definition below) and we will control its norm later in this proof. j=1 (I Sn+1 j) j=1 (I Sn+1 j) To see where the iterates end up after K epochs, we simply set 2k = K in Eq. 18 and then keep applying the equation recursively to preceding epochs. Then, we get x K n = x K+1 0 = i=1 (I Sn i+1) x K 1 0 + z i=1 (I Sn i+1) i=1 (I Sn i+1) Published as a conference paper at ICLR 2022 i=1 (I Sn i+1) i=1 (I Sn i+1) Taking squared norms and expectations on both sides, we get E[ x K n 2] = E i=1 (I Sn i+1) i=1 (I Sn i+1) i=1 (I Sn i+1) i=1 (I Sn i+1) (Since (a + b)2 2a2 + 2b2) i=1 (I Sn i+1) i=1 (I Sn i+1) We assumed that the functions fi have L-Lipschitz gradients (Assumption 2). This translates to Ai having maximum eigenvalue less than L. Hence, if α 1/L, we get that I αAi is positive semi-definite with maximum eigenvalue bounded by 1. Hence, I Si 1. Using this and the fact that for matrices M1 and M2, M1M2 M1 M2 , we get that E h x K n 2i 2E i=1 (I Sn i+1) i=1 I Sn i+1 i=1 (I Sn i+1) i=1 (I Sn i+1) We handle the two terms above separately. For the first term, we have the following bound: Lemma 3. If α 1 8κL min n 2, κ i=1 (I Sn i+1) Published as a conference paper at ICLR 2022 Note that K 80κ3/2 log(n K) max n 1, κ n o = α 1 8κL min n 2, κ We also have the following lemma that bounds the expected squared norm of z. Lemma 4. If α 1 E h z 2i 2n2α4L2(G )2 + 170n5α6L4G2 log n, where G = maxi bi , and the expectation is taken over the randomness of z. Using these lemmas, we get that E h x K n 2i 2 (1 nαµ)K/2 x1 0 2 + K2n2α4L2G2 + 85K2n5α6L4G2 log n 2 nαµK x1 0 2 + K2n2α4L2G2 + 85K2n5α6L4G2 log n. Substituting α = 10 log n K µn K gives us the result. E.1 PROOF OF LEMMA 3 We define ( e S1, . . . , e Sn) := (S1, . . . , Sn) and ( e Sn+1, . . . , e S2n) := (Sn, . . . , S1). Then, i=1 (I Sn i+1) i=1 (I e Si) i=1 e Si + X i 6. Then for 0 k < k K 1000n2α4L2(G )2 + 2000n6α7L5(G )2 log n. Using Lemma 8, and inequalities (30) and (31) in Ineq. (29), we get E[ x K n 2] 2(1 nαµ)K/2 x1 0 2 + 2n2Kα4L2(G )2 + 170n5Kα6L4(G )2 log n + 1000n2K2α4L2(G )2 + 2000n6K2α7L5(G )2 log n 2(1 nαµ)K/2 x1 0 2 + 1002n2Kα4L2(G )2 + 2170n6K2α7L4(G )2 log n. Substituting α = 10 log n K µn K gives us the desired result. F.1 PROOF OF LEMMA 7 Recall that we have defined Nk := Qn i=1(I S 2 k i ) Qn i=1(I S 2 k n i+1) . Hence, 2 k i ) ! n Y 2 k n i+1) ! where we used the fact that S 2 k i are symmetric. Hence Nk is symmetric. Then, E[N k Nk] = max x: x =1 E[x N k Nkx] = max x: x =1 E[x Nk Nkx]. Next, note that Nk 1 as long as α 1 n L. This combined with the fact that Nk is symmetric gives us that x Nk Nkx x Nkx. Hence, we get E[N k Nk] max x: x =1 E[x Nkx] Published as a conference paper at ICLR 2022 To complete the proof, we apply the following lemma from Ahn et al. (2020): Lemma 9 (Lemma 6, Ahn et al. (2020)). For any 0 α 3 16n L min{1, n κ} and k [K], E Remark 2. To avoid confusion, we remark here that, the term Sk in the paper Ahn et al. (2020) is the same as the term Qn i=1(I Sk i ) in our paper, and hence the original lemma statement in their paper looks different from what is written above. F.2 PROOF OF LEMMA 8 We begin by decomposing the product into product of independent terms, similar to proof of Lemma 8 in Ahn et al. (2020). However, after that we diverge from their proof since we use FLIPFLOP specific analysis. Since k < k, we get that z K 2 k , Qk+1 l=k Qn i=1(I S 2 l i ) Qn i=1(I S 2 l n i+1) and are independent. Hence, we can write the expectation as product of expectations: Published as a conference paper at ICLR 2022 Applying the Cauchy-Schwarz inequality on the decomposition above, we get where in the last step we used that I Sk i 1 for all i, k. To see why this is true, recall that Sk i = αAσk i . Further by Assumption 2, Aσk i L and hence as long as α 1/L, we have I Sk i 1. For the two terms in the product above, we have the following lemma: Lemma 10. If n > 6 and α 1 n L, then 2 k] 28nα2LG + 9α5L4n4G p Published as a conference paper at ICLR 2022 Lemma 11. If n > 6 and α 1 2n L, then E 32nα2LG + 24α5L4n4G p Finally, using these lemma we get 28nα2LG + 9α5L4n4G p 2n log n 32nα2LG + 24α5L4n4G p 896n2α4L2(G )2 + 960α7L5n5(G )2p 2n log n + 432α10L8n9(G )2 log n 1000n2α4L2(G )2 + 2000α7L5n6(G )2 log n. F.3 PROOF OF LEMMA 10 Since we are dealing with just a single epoch, we will skip the superscript. Using Lemma 5, we get p=1 (I e Sp) ! e Sl e Sj i=1 E [ Si ti ] + p=1 (I e Sp) ! e Sl e Sj Define G := maxi bi . Then Si ti = αAσk i αbσk i α2LG and hence i=1 E [ Si ti ] nα2LG . (33) Next we bound the other two terms. Using Eq. (23), we get that for any l < j, Published as a conference paper at ICLR 2022 i=1 E [Sl Sjtn+1 i] E p=1 Sp Sl Sj Sr Sp Sl Sj i>j,l E [Sl Sjti] X pj,l E [Sl Sj E[ti|Sl, Sj]] X pj,l E [Sl Sj E[ti|Sl, Sj]] + X pj,l E Sl Sj tl + tj p 6 in the last step. The third term in Ineq. (32) can be handled similarly. For any l, j: p=1 (I e Sp) ! e Sl e Sj r=1 (I e Sq) ! e Sr e Sp ! e Sl e Sj i=1 E h e Sl e Sjti i i=1 E h e Sp e Sl e Sjti i + E r=1 (I e Sq) ! e Sr e Sp e Sl e Sj Now, it is easy to see that i = j and i = 2n j + 1. Then, if i = l or i = 2n l + 1 we use for that case the fact that E[ e Sl e Sjti] α3L2G . For all other i, we can again use that E[ti| e Sl, e Sj] = tl tj n 2 if l n or E[ti| e Sl, e Sj] = t2n l+1 tj n 2 otherwise. Similarly, we can bound Published as a conference paper at ICLR 2022 E h e Sp e Sl e Sjti i . Proceeding in a way similar to how we derived Ineq. (34), we get E p=1 (I e Sp) ! e Sl e Sj E h e Sl e Sjti i + E h e Sp e Sl e Sjti i r=1 (I e Sq) ! e Sr e Sp e Sl e Sj α3L2G + 2n n 2α3L2G + α4L3G + nα4L3G + 3n2 2(n 3)α4L3G r=1 (I e Sq) ! e Sr e Sp e Sl e Sj 5α3L2G + 5nα4L3G e Sr e Sp e Sl e Sj 5α3L2G + 5nα4L3G + α5L4n2G p 18n log n. (35) Substituting Ineq. (33), (34) and (35) into (32), we get E[z] nα2LG + 10n2α3L2G + 10n3α4L3G + 2α5L4n4G p + 4n2α3L2G + 3n3α4L3G + α5L4n4G p 28nα2LG + 9α5L4n4G p where we used the assumption that α 1 n L in the last step. F.4 PROOF OF LEMMA 11 Define the matrix M as Since M is independent of Qn i=1(I S i ) Qn i=1(I S n i+1) and z K 2 k , using the tower rule, we get We will now drop the superscript K 2 k for convenience. Hence, we need to control the following term: i=1 (I Sn i+1) Published as a conference paper at ICLR 2022 We define ( e S1, . . . , e S2n) as ( e S1, . . . , e Sn) = (S1, . . . , Sn) and ( e Sn+1, . . . , e S2n) = (Sn, . . . , S1). Then, we use Eq. (21) to get n Y i=1 (I Sn i+1) i=1 (I e Si) p=1 (I e Sp) ! e Sl e Sj. Note that I P2n j=1 e Sj is a constant matrix. Since e Sj = αAσj, we have that e Sj αL by Assumption 2. Hence, α 1 2n L then I P2n j=1 e Sj 1. Further, α 1/L implies that I Si 1, which implies M 1. Hence, E i=1 (I Sn i+1) p=1 (I e Sp) ! e Sl e Sj p=1 (I e Sp) ! e Sl e Sj p=1 (I e Sp) ! e Sl e Sj We can apply Lemma 10 to bound E [z] . So, we focus on the other term. Using Lemma 5, E p=1 (I e Sp) ! e Sl e Sj p=1 (I e Sp) ! e Sl e Sj p=1 (I e Sp) ! e Sl e Sj p=1 (I e Sp) ! e Sl e Sj p=1 (I e Sp) ! e Sl e Sj p=1 (I e Sp) ! e Sl e Sj p=1 (I e Sp) ! e Sl e Sj p=1 (I e Sp) ! e Sl e Sj Published as a conference paper at ICLR 2022 p=1 (I e Sp) ! e Sl e Sj Now, we use that M 1, I f Si 1, f Si αL and eti αG : E p=1 (I e Sp) ! e Sl e Sj 4n3α4L3G + 4n4α4L4E Using Lemma 6, we get E p=1 (I e Sp) ! e Sl e Sj 4n3α4L3G + 15n4α5L4G p Putting everything together, E (28nα2LG + 9α5L4n4G p 2n log n) + (4n3α4L3G + 15n4α5L4G p 32nα2LG + 24α5L4n4G p G PROOF OF THEOREM 6 Proof. We start off by defining the error term i=1 fi x2k 1 i 1 i=1 fi x2k 1 0 ! i=1 fn i+1 x2k i 1 i=1 fn i+1 x2k 1 0 ! where k [K/2]. This captures the difference between true gradients that the algorithms observes, and the gradients that a full step of gradient descent would have seen. For two consecutive epochs of FLIPFLOP WITH INCREMENTAL GD, we have the following inequality x2k n x 2 = x2k 1 0 x 2 2α x2k 1 0 x , i=1 fi x2k 1 i 1 + i=1 fn i+1 x2k i 1 + Published as a conference paper at ICLR 2022 i=1 fi x2k 1 i 1 + i=1 fn i+1 x2k i 1 = x2k 1 0 x 2 2α x2k 1 0 x , 2n F(x2k 1 0 ) 2α x2k 1 0 x , rk + α2 2n F(x2k 1 0 ) + rk 2 x2k 1 0 x 2 4nα Lµ L + µ x2k 1 0 x 2 + 1 L + µ F x2k 1 0 2 2α x2k 1 0 x , rk + α2 2n F(x2k 1 0 ) + rk 2 x2k 1 0 x 2 4nα Lµ L + µ x2k 1 0 x 2 + 1 L + µ F x2k 1 0 2 2α x2k 1 0 x , rk + 8α2n2 F(x2k 1 0 ) 2 + 2α2 rk 2 x2k 1 0 x 2 4nα 1 L + µ 8α2n2 F(x2k 1 0 ) 2 2α x2k 1 0 x , rk + 2α2 rk 2 , (36) where the first inequality is due to Theorem 2.1.11 in Nesterov (2004) and the second one is simply (a + b)2 2a2 + 2b2. What remains to be done is to bound the two terms with rk dependence. Firstly, we give a bound on the norm of rk: i=1 fi x2k 1 i 1 i=1 fi x2k 1 0 ! i=1 fn i+1 x2k i 1 i=1 fn i+1 x2k 1 0 ! fi x2k 1 i 1 fi x2k 1 0 + fn i+1 x2k i 1 fn i+1 x2k 1 0 . Next, we will use the smoothness assumption and bounded gradients property (Lemma 1). x2k 1 i 1 x2k 1 0 + L x2k i 1 x2k 1 0 i=1 i + LGα i=1 (n + i) = n(2n 1)αGL. Hence, rk 2 4n4α2G2L2. (37) For the rk term, we need a more careful bound. Since the Hessian is constant for quadratic functions, we use Hi to denote the Hessian matrix of function fi( ). We start off by using the definition of rk: i=1 fi x2k 1 i 1 i=1 fi x2k 1 0 ! i=1 fn i+1 x2k i 1 i=1 fn i+1 x2k 1 0 ! fi x2k 1 i 1 fi x2k 1 0 + fn i+1 x2k i 1 fn i+1 x2k 1 0 i=1 Hi x2k 1 i 1 x2k 1 0 + i=1 Hn i+1 x2k i 1 x2k 1 0 , where we used the fact that for a quadratic function f with Hessian H, we have that f(x) f(y) = H(x y). After that, we express x2k 1 i 1 x2k 1 0 and x2k i 1 x2k 1 0 as sum of gradient descent Published as a conference paper at ICLR 2022 j=1 α fj(x2k 1 j 1 ) j=1 α fj(x2k 1 j 1 ) + j=1 α fn j+1(x2k j 1) j=1 fj(x2k 1 j 1 ) j=1 fn j+1(x2k j 1) j=1 fj(x2k 1 j 1 ) j=1 fj(x2k 1 0 ) j=1 fn j+1(x2k 0 ) j=1 fj(x2k 1 0 ) j=1 fj(x2k 1 j 1 ) fj(x2k 1 0 ) j=1 fn j+1(x2k j 1) fn j+1(x2k 0 ) j=1 fj(x2k 1 j 1 ) fj(x2k 1 0 ) j=1 fj(x2k 1 0 ) i=1 Hi fi(x2k 1 0 ) j=1 fj(x2k 1 j 1 ) fj(x2k 1 0 ) j=1 fn j+1(x2k j 1) fn j+1(x2k 0 ) j=1 fj(x2k 1 j 1 ) fj(x2k 1 0 ) Next, we use the fact that Pn j=1 fj(x) = n F(x). We will also again use the fact that for a quadratic function f with Hessian H, we have that f(x) f(y) = H(x y): i=1 Hi(n F(x2k 1 0 )) + α i=1 Hi( fi(x2k 1 0 ) fi(x )) + α i=1 Hi fi(x ) j=1 fj(x2k 1 j 1 ) fj(x2k 1 0 ) j=1 fn j+1(x2k j 1) fn j+1(x2k 0 ) Published as a conference paper at ICLR 2022 j=1 fj(x2k 1 j 1 ) fj(x2k 1 0 ) (x2k 1 0 x ) + α i=1 H2 i (x2k 1 0 x ) + α i=1 Hi fi(x ) j=1 fj(x2k 1 j 1 ) fj(x2k 1 0 ) j=1 fn j+1(x2k j 1) fn j+1(x2k 0 ) j=1 fj(x2k 1 j 1 ) fj(x2k 1 0 ) where the random variables ak, bk as (x2k 1 0 x ) + α i=1 H2 i (x2k 1 0 x ) + α i=1 Hi fi(x ), and j=1 fj(x2k 1 j 1 ) fj(x2k 1 0 ) j=1 fn j+1(x2k j 1) fn j+1(x2k 0 ) j=1 fj(x2k 1 j 1 ) fj(x2k 1 0 ) Again, using smoothness assumption and bounded gradients property (Lemma 1) we get, bk 3α2L2Gn3. (38) Next, we decompose the inner product of x2k 1 0 x and E rk in Eq. (36): 2α x2k 1 0 x , rk = 2α x2k 1 0 x , ak + bk = 2α x2k 1 0 x , ak 2α x2k 1 0 x , bk (39) For the first term in (39), 2α x2k 1 0 x , ak = 4α2 * x2k 1 0 x , (x2k 1 0 x ) x2k 1 0 x , i=1 H2 i (x2k 1 0 x ) x2k 1 0 x , i=1 Hi fi(x ) x2k 1 0 x , (x2k 1 0 x ) Published as a conference paper at ICLR 2022 x2k 1 0 x , i=1 Hi fi(x ) = 4α2n2 F(x2k 1 0 ) 2 2α2 * x2k 1 0 x , i=1 Hi fi(x ) 4α2n2 F(x2k 1 0 ) 2 + 2α2n LG x2k 1 0 x . (40) For the second term in (39), we use Cauchy-Schwarz and Ineq. (38) 2α x2k 1 0 x , bk 6α3L2Gn3 x2k 1 0 x . (41) Substituting (40) and (41) back to (39), we get 2α x2k 1 0 x , rk 4α2n2 F(x2k 1 0 ) 2 + 2α2n LG x2k 1 0 x + 6α3L2Gn3 x2k 1 0 x . (42) Substituting (37) (42) back to (36), we finally get a recursion bound for one epoch: x2k n x 2 1 4nα Lµ x2k 1 0 x 2 4nα 1 L + µ 8α2n2 F(x2k 1 0 ) 2 2α x2k 1 0 x , rk + 2α2 rk 2 x2k 1 0 x 2 4nα 1 L + µ 8α2n2 F(x2k 1 0 ) 2 4α2n2 F(x2k 1 0 ) 2 + 2α2n LG x2k 1 0 x + 6α3L2Gn3 x2k 1 0 x + 8α4n4G2L2 x2k 1 0 x 2 4nα 1 L + µ 12α2n2 F(x2k 1 0 ) 2 + 2α2n LG(1 + 3αLn2) x2k 1 0 x + 8α4n4G2L2. Next, we use the fact that 2ab λa2 + b2/λ (for any λ > 0) on the term 2α2n LG(1 + 3αLn2) x2k 1 0 x to get that 2α2n LG(1 + 3αLn2) x2k 1 0 x α2n LG(1 + 3αLn2) 2 /(nαµ) + nαµ x2k 1 0 x 2 = µ 1α3n L2G2(1 + 3αLn2)2 + nα x2k 1 0 x 2. Substituting this back we get, x2k n x 2 1 4nα Lµ x2k 1 0 x 2 4nα 1 L + µ 12α2n2 F(x2k 1 0 ) 2 + µ 1α3n L2G2(1 + 3αLn2)2 + nαµ x2k 1 0 x 2 + 8α4n4G2L2 (1 2nαµ + nαµ) x2k 1 0 x 2 4nα 1 L + µ 12α2n2 F(x2k 1 0 ) 2 + µ 1α3n L2G2(1 + 3αLn2)2 + 8α4n4G2L2 (Since µ L) = (1 nαµ) x2k 1 0 x 2 4nα 1 L + µ 12α2n2 F(x2k 1 0 ) 2 + µ 1α3n L2G2(1 + 3αLn2)2 + 8α4n4G2L2 Now, substituting the values of α and the bound on K, we get that 4nα 1 L+µ 12α2n2 0 and hence, x2k n x 2 (1 nαµ) x2k 1 0 x 2 + µ 1α3n L2G2(1 + 3αLn2)2 + 8α4n4G2L2 Now, iterating this for K/2 epoch pairs, we get x K n x 2 (1 nαµ)K/2 x1 0 x 2 + K 2 µ 1α3n L2G2(1 + 3αLn2)2 + 4Kα4n4G2L2 Published as a conference paper at ICLR 2022 e nαµK/2 x1 0 x 2 + K 2 µ 1α3n L2G2(1 + 3αLn2)2 + 4Kα4n4G2L2 e nαµK/2 x1 0 x 2 + Kµ 1α3n L2G2 + 9Kµ 1α5n5L4G2 + 4Kα4n4G2L2 (Since (1 + a)2 2 + 2a2) Substituting α = 6 log n K µn K gives us the desired result. H ADDITIONAL EXPERIMENTS Although our theoretical guarantees for FLIPFLOP only hold for quadratic objectives, we conjecture that FLIPFLOP might be able to improve the convergence performance for other classes of functions, whose Hessians are smooth near their minimizers. To see this, we also ran some experiments on 1-dimensional logistic regression. As we can see in Figure 3, the convergence rates are very similar to those on quadratic functions. The data was synthetically generated such that the objective function becomes strongly convex and well conditioned near the minimizer. Note that logistic loss is not strongly convex on linearly separable data. Therefore, to make the loss strongly convex, we ensured that the data was not linearly separable. Essentially, the dataset was the following: all the datapoints were z = 1, and their labels were y = 1z>0 with probability 3/4 and y = 1z<0 with probability 1/4. Framing this as an optimization problem, we have min x F(x) := E [ y log(h(xz)) (1 y) log(1 h(xz))] , where h(xz) = 1/(1 + e xz). Note that x = log 3 is the minimizer of this function, which is helpful because we can use it to compute the exact error. Similar to the experiment on quadratic functions, n was set to 800 and step size was set in the same regime as in Theorems 4, 5, and 6.