# proximal_and_federated_random_reshuffling__d81a27bb.pdf Proximal and Federated Random Reshuffling Konstantin Mishchenko 1 Ahmed Khaled 2 Peter Richt arik 3 Random Reshuffling (RR), also known as Stochastic Gradient Descent (SGD) without replacement, is a popular and theoretically grounded method for finite-sum minimization. We propose two new algorithms: Proximal and Federated Random Reshuffling (Prox RR and Fed RR). The first algorithm, Prox RR, solves composite finite-sum minimization problems in which the objective is the sum of a (potentially non-smooth) convex regularizer and an average of n smooth objectives. Prox RR evaluates the proximal operator once per epoch only. When the proximal operator is expensive to compute, this small difference makes Prox RR up to n times faster than algorithms that evaluate the proximal operator in every iteration, such as proximal (stochastic) gradient descent. We give examples of practical optimization tasks where the proximal operator is difficult to compute and Prox RR has a clear advantage. One such task is federated or distributed optimization, where the evaluation of the proximal operator corresponds to communication across the network. We obtain our second algorithm, Fed RR, as a special case of Prox RR applied to federated optimization, and prove it has a smaller communication footprint than either distributed gradient descent or Local SGD. Our theory covers both constant and decreasing stepsizes, and allows for importance resampling schemes that can improve conditioning, which may be of independent interest. Our theory covers both convex and nonconvex regimes. Finally, we corroborate our results with experiments on real data sets. 1CNRS, DI ENS, Inria 2Princeton University 3KAUST. Correspondence to: Konstantin Mishchenko . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1. Introduction Modern theory and practice of training supervised machine learning models is based on the paradigm of regularized empirical risk minimization (ERM) (Shalev-Shwartz & Ben David, 2014). While the ultimate goal of supervised learning is to train models that generalize well to unseen data, in practice only a finite data set is available during training. Settling for a model merely minimizing the average loss on this training set the empirical risk is insufficient, as this often leads to over-fitting and poor generalization performance in practice. Due to this reason, empirical risk is virtually always amended with a suitably chosen regularizer whose role is to encode prior knowledge about the learning task at hand, thus biasing the training algorithm towards better performing models. The regularization framework is quite general and perhaps surprisingly it also allows us to consider methods for federated learning (FL) a paradigm in which we aim at training model for a number of clients that do not want to reveal their data (Koneˇcn y et al., 2016; Mc Mahan et al., 2017; Kairouz et al., 2019). The training in FL usually happens on devices with only a small number of model updates being shared with a global host. To this end, Federated Averaging algorithm has emerged that performs Local SGD updates on the clients devices and periodically aggregates their average. Its analysis usually requires special techniques and deliberately constructed sequences hindering the research in this direction. We shall see, however, that the convergence of our Fed RR follows from merely applying our algorithm for regularized problems to a carefully chosen reformulation. Formally, regularized ERM problems are optimization problems of the form min x Rd P(x) := 1 n Pn i=1 fi(x) + ψ(x) , (1) where fi : Rd R is the loss of model parameterized by vector x Rd on the i-th training data point, and ψ: Rd R {+ } is a regularizer. Let [n] := {1, 2, . . . , n}. We shall make the following assumption throughout the paper without explicitly mentioning it: Assumption 1. The functions fi are Li-smooth, and the regularizer ψ is proper, closed and convex. Let Lmax := maxi [n] Li. Proximal and Federated Random Reshuffling In some results we will additionally assume that either the individual functions fi, or their average f := 1 i fi, or the regularizer ψ are µ-strongly convex. Whenever we need such additional assumptions, we will make this explicitly clear. While all these concepts are standard, we review them briefly in Appendix A. Proximal SGD. When the number n of training data points is huge, as is increasingly common in practice, the most efficient algorithms for solving (1) are stochastic first-order methods, such as stochastic gradient descent (SGD) (Bordes et al., 2009), in one or another of its many variants proposed in the last decade (Shang et al., 2018; Pham et al., 2020). These method almost invariably rely on alternating stochastic gradient steps with the evaluation of the proximal operator proxγψ(x) := argminz Rd γψ(z) + 1 The simplest of these has the form x SGD k+1 = proxγkψ(x SGD k γk fik(x SGD k )), (2) where ik is an index from {1, 2, . . . , n} chosen uniformly at random, and γk > 0 is a properly chosen learning rate. Our understanding of (2) is quite mature; see (Gorbunov et al., 2020) for a general treatment which considers methods of this form in conjunction with more advanced stochastic gradient estimators in place of fik. Applications such as training sparse linear models (Tibshirani, 1996), nonnegative matrix factorization (Lee & Seung, 1999), image deblurring (Rudin et al., 1992; Vogel, 2002), and training with group selection (Yuan & Lin, 2006) all rely on the use of hand-crafted regularizes. For many of them, the proximal operator can be evaluated efficiently, and SGD is near or at the top of the list of efficient training algorithms. Random reshuffling. A particularly successful variant of SGD is based on the idea of random shuffling (permutation) of the training data followed by n iterations of the form (2), with the index ik following the pre-selected permutation (Bottou, 2012). This process is repeated several times, each time using a new freshly sampled random permutation of the data, and the resulting method is known under the name Random Reshuffling (RR). When the same permutation is used throughout, the technique is known under the name Shuffle-Once (SO). One of the main advantages of this approach is rooted in its intrinsic ability to avoid cache misses when reading the data from memory, which enables a significantly faster implementation. Furthermore, RR is often observed to converge in fewer iterations than SGD in practice. This can intuitively be ascribed to the fact that while due to its sampling-withreplacement approach SGD can miss to learn from some data points in any given epoch, RR will learn from each data point in each epoch. Understanding the random reshuffling trick, and why it works, has been a non-trivial open problem for the past decade (Bottou, 2009; Recht & R e, 2012; G urb uzbalaban et al., 2019; Haochen & Sra, 2019), and has inspired significant ongoing research effort (Shamir, 2016; Haochen & Sra, 2019; Nagaraj et al., 2019; Mishchenko et al., 2020; Ahn et al., 2020). 2. Contributions Our goal in this paper is twofold: we develop RR in new settings (namely, for proximal and federated learning), and also address some of the shortcomings of existing theory, in particular in the dependence on the condition number as well as step-size scheduling. The difficulty of analyzing RR has been the main obstacle in the development of even some of the most seemingly benign extensions of the method. Indeed, while these extensions are well understood in combination with its much simpler-to-analyze cousin SGD, to the best of our knowledge, there exists no theoretical analysis of proximal, parallel, and importance sampling variants of RR with both constant and decreasing stepsizes, and in most cases it is not even clear how should such methods be constructed. In this section we outline the key contributions of our work, and also offer a few intuitive explanations motivating some of the development. 2.1. RR in new problem settings New algorithm: Prox RR. Despite rich literature on Proximal SGD (Gorbunov et al., 2020), it is not obvious how one should extend RR to solve problem (1) when a regularizer ψ is present. Indeed, the standard practice for SGD is to apply the proximal operator after each stochastic step (Duchi & Singer, 2009), i.e., in analogy with (2). On the other hand, RR is motivated by the fact that a data pass better approximates the full gradient step (Bertsekas, 2011). The following example shows that if we apply the proximal operator after each step of RR, we would no longer approximate the full gradient after an epoch: Example 1. Let n = 2, ψ(x) = 1 2 x 2, f1(x) = c1, x , f2(x) = c2, x with some c1, c2 Rd, c1 = c2. Let x0 Rd, γ > 0 and define x1 = x0 γ f1(x0), x2 = x1 γ f2(x1). Then, we have prox2γψ(x2) = prox2γψ(x0 2γ f(x0)). However, if x1 = proxγψ(x0 γ f1(x0)) and x2 = proxγψ(x1 γ f2( x1)), then x2 = prox2γψ(x0 2γ f(x0)). Motivated by this observation, we propose Prox RR (Algorithm 1), in which the proximal operator is applied at the end of each epoch of RR, i.e., after each pass through all randomly reshuffled data. A notable property of Algorithm 1 is that only a single proximal operator evaluation is needed Proximal and Federated Random Reshuffling Algorithm 1 Proximal Random Reshuffling (Prox RR) and Shuffle-Once (Prox SO) 1: Input: Stepsizes γt > 0, initial vector x0 Rd, number of epochs T 2: Sample a permutation π = (π0u, π1, . . . , πn 1) of [n] (Do step 1 only for Prox SO) 3: for epochs t = 0, 1, . . . , T 1 do 4: Sample a permutation π = (π0, π1, . . . , πn 1) of [n] (Do step 3 only for Prox RR) 5: x0 t = xt 6: for i = 0, 1, . . . , n 1 do 7: xi+1 t = xi t γt fπi(xi t) 8: end for 9: xt+1 = proxγtnψ(xn t ) 10: end for during each data pass. This is in sharp contrast with the way Proximal SGD works, and offers significant advantages in regimes where the evaluation of the proximal mapping is expensive (e.g., comparable to the evaluation of n gradients f1, . . . , fn). Convergence of Prox RR (for strongly convex functions or regularizer). We establish several convergence results for Prox RR, of which we highlight two here. Both offer a linear convergence rate with a fixed stepsize to a neighborhood of the solution. In both we reply on Assumption 1. Firstly, in the case when in addition, each fi is µ-strongly convex, we prove the rate (see Theorem 2) E h x T x 2i (1 γµ)n T x0 x 2 + 2γ2σ2 rad µ , where γt = γ 1/Lmax is the stepsize, and σ2 rad is a shuffling radius constant (for precise definition, see (4)). In Theorem 1 we bound the shuffling radius in terms of f(x ) 2, n, Lmax and the more common quantity σ2 := 1 n Pn i=1 fi(x ) f(x ) 2. We give a similar rate of convergence if ψ is also strongly-convex. Both mentioned rates show exponential (linear in logarithmic scale) convergence to a neighborhood whose size is proportional to γ2σ2 rad. Since we can choose γ to be arbitrarily small or periodically decrease it, this implies that the iterates converge to x in the limit. Moreover, we show in Section 3 that when γ = O( 1 T ) the error is O( 1 T 2 ), which is superior to the O( 1 T ) error of SGD. All of our results in the convex case apply to the Shuffle-Once algorithm as well. Convergence of Prox RR for nonconvex optimization. In the nonconvex regime, and under suitable assumptions, we establish (see Theorems 5 and 3) an O( 1 γT ) rate up to a neighborhood of size O(γ2). For a certain stepsize it yields an O( 1 ε3 ) convergence rate. Application to Federated Learning. In Section 5 we describe an application of our results to federated learning (Koneˇcn y et al., 2016; Mc Mahan et al., 2017; Kairouz et al., 2019). In this way we obtain the Fed RR method, which is similar to Local SGD, except the local solver is a single pass of RR over the local data. Empirically, Fed RR can be vastly superior to Local SGD (see Figure 2). Remarkably, we also show that the rate of Fed RR beats the best known lower bound for Local SGD due to (Woodworth et al., 2020) (we needed to adapt it from the original online to the finite-sum setting we consider in this paper) for large enough n. See Appendix G for more details. 2.2. Improving vanilla RR Besides the above results, we describe two extensions that improve upon the rates of vanilla Random Reshuffling (i.e. with no prox) and which are of independent interest. Extension 1: Importance resampling for Proximal RR. All existing rates of convergence of RR in the stronglyconvex regime exhibit a dependence on maxi Li/µ (e.g. (Mishchenko et al., 2020; Ahn et al., 2020) and others), where Li is the smoothness constant of fi. We observe that this is highly suboptimal compared to the L/µ rate (for L = 1 n Pn i=1 Li) that SGD and variance-reduced methods can achieve: indeed, the difference between these two condition numbers can be of order n (Gower et al., 2019). This is a serious problem as the difference between the rate of convergence of RR and SGD is tightly related to the condition number (Safran & Shamir, 2021). In other words: existing results on RR can be suboptimal by up to a factor of n compared to the best known rates for SGD. To handle this, we reformulate problem (1) into a similar problem with a larger number of summands. In particular, for each i [n] we include ni copies of the function 1 ni fi, and then take average of all N = P i ni functions constructed this way. The value of ni depends on the importance of fi, described below. We then apply Prox RR to this reformulation. If fi is Li-smooth for all i [n] and we let L := 1 n P i Li, then we choose ni = Li/ L . It is easy to show that N 2n, and hence our reformulation leads to at most a doubling of the number of functions forming the finite sum. However, the overall complexity of RR/Prox RR applied to this reformulation will depend on L instead of maxi Li (see Theorem 9), which can improve the convergence rate by up to a factor of n. For details of the construction and our complexity results, Appendix I. Extension 2: Decreasing stepsizes. The convergence of RR is not always exact and depends on the parameters of the objective. Similarly, if the shuffling radius σ2 rad is positive, and we wish to find an ε-approximate solution, the optimal choice of a fixed stepsize for Prox RR will depend on ε. This deficiency can be fixed by using decreasing stepsizes in both vanilla RR (Ahn et al., 2020) Proximal and Federated Random Reshuffling and in SGD (Stich, 2019). However, the rate given by (Ahn et al., 2020) does not recover linear convergence in the absence of noise (i.e. σ = 0). We propose a stepsize schedule that allows us to both recover linear convergence in the absence of noise and recover the optimal O((n T 2) 1) rate of convergence of RR in the presence of noise. Our proposed stepsize schedule is thus noise-adaptive without requiring any knowledge of the magnitude of the noise σ . For details, see Appendix J. 3. Theory for strongly convex objectives 3.1. Preliminaries In the strongly-convex setting, we build upon the notions of shuffling variance introduced by Mishchenko et al. (2020) for analyzing RR. Given a stepsize γ > 0 (held constant during each epoch) and a permutation π of {1, 2, . . . , n}, we introduce the points x1 , x2 , . . . , xn defined by xi := x γ Pi 1 j=0 fπj(x ), i = 1, . . . , n. (3) The intuition behind this definition is fairly simple: if we performed i steps starting at x , we would end up close to xi . To quantify the closeness, we define the shuffling radius and then show how to upper bound it. Definition 1 (Shuffling radius). Given a stepsize γ > 0 and a random permutation π of {1, 2, . . . , n} used in Algorithm 1, define xi = xi (γ, π) as in (3). Then, the shuffling radius is defined by σ2 rad(γ) := max i=0,...,n 1 h 1 γ2 Eπ Dfπi(xi , x ) i , (4) where Df(x, y) = f(x) f(y) f(y), x y is the Bregman divergence associated with f evaluated at x, y and where the expectation is taken with respect to the randomness in the permutation π. If there are multiple stepsizes γ1, γ2, . . . used in Algorithm 1, we take the maximum of all of them as the shuffling radius, i.e., σ2 rad := maxt 1 σ2 rad(γt). Theorem 1 (Bounding the shuffling radius). For any stepsize γ > 0 and any random permutation π of {1, 2, . . . , n} we have σ2 rad Lmax 2 n n f(x ) 2 + 1 2σ2 , where x is a solution of Problem (1) and σ2 is the population variance at the optimum n Pn i=1 fi(x ) f(x ) 2. (5) All proofs are relegated to the supplementary material. In order to better understand the bound given by Theorem 1, observe that if there is no proximal operator (i.e., ψ = 0) then f(x ) = 0 and we get that σ2 rad Lmaxnσ2 4 . This recovers the existing upper bound on the shuffling variance of Mishchenko et al. (2020) for vanilla RR. On the other hand, if f(x ) = 0 then we get an additive term of size proportional to the squared norm of f(x ). 3.2. Convergence guarantees Our first theorem establishes a convergence rate for Algorithm 1 applied with a constant stepsize to Problem (1) when each objective fi is strongly convex. This assumption is commonly satisfied in machine learning applications where each fi represents a regularized loss on some data points, as in ℓ2 regularized linear regression and ℓ2 regularized logistic regression. Theorem 2. Let Assumption 1 be satisfied. Further, assume that each fi is µ-strongly convex. If Algorithm 1 is run with constant stepsize γt = γ 1/Lmax, then its iterates satisfy E h x T x 2i (1 γµ)n T x0 x 2 + 2γ2σ2 rad µ . We can convert the guarantee of Theorem 2 to a convergence rate by properly tuning the stepsize and using the upper bound of Theorem 1 on the shuffling radius. In particular, if we choose the stepsize as γ = min n 1 Lmax , εµ let κ := Lmax/µ and r0 := x0 x 2, then we obtain E h x T x 2i = O (ε) provided that the total number of iterations KRR = n T is at least KRR [(κ + κn εµ ( n f(x ) + σ )] log 2r0 Comparison with vanilla RR. If there is no proximal operator, then f(x ) = 0 and we recover the earlier result of Mishchenko et al. (2020) on the convergence of RR without proximal, which is optimal in ε up to logarithmic factors. On the other hand, when the proximal operator is nonzero, we get an extra term in the complexity proportional to f(x ) : thus, even when all the functions are the same (i.e., σ = 0), we do not recover the linear convergence of Proximal Gradient Descent (Karimi et al., 2016; Beck, 2017). This can be easily explained by the fact that Algorithm 1 performs n gradient steps per one proximal step. Hence, even if f1 = = fn, Algorithm 1 does not reduce to Proximal Gradient Descent. We note that other algorithms for composite optimization which may not take a proximal step at every iteration also suffer from the same dependence (Patrascu & Irofti, 2021). Comparison with proximal SGD. In order to compare (3.2) against the complexity of Proximal SGD (Algorithm 3 in Appendix C), we recall that Proximal SGD achieves E h x K x 2i = O (ε) if either f or ψ is µ-strongly convex and KSGD κ + σ2 εµ2 log 2r0 This result is standard (Needell et al., 2016; Gower et al., 2019), with the exception that we do not know any proof in the literature for the case when ψ is strongly convex. For completeness, we prove it in Appendix C. Proximal and Federated Random Reshuffling By comparing KSGD (given by (7)) and KRR (given by (3.2)), we see that Prox RR has milder dependence on ε than Proximal SGD. In particular, Prox RR converges faster whenever the target accuracy ε is small enough to satisfy ε 1 Lmaxnµ σ4 n f(x ) 2+σ2 . Furthermore, Prox RR is much better when we consider proximal iteration complexity (# of proximal operator access), in which case the complexity of Prox RR (3.2) is reduced by a factor of n (because we take one proximal step every n iterations), while the proximal iteration complexity of Proximal SGD remains the same as (7). In this case, Prox RR is better whenever the accuracy ε satisfies ε n G2 Lmaxµ or ε nσ4 LmaxµG2 , where G2 := n f(x ) 2 + σ2 . We can see that if the target accuracy is large enough or small enough, and if the cost of proximal operators dominates the computation, Prox RR is much quicker to converge than Proximal SGD. Comparison with Prox SVRG. Variance-reduced methods can improve the rate of convergence of SGD from sublinear to linear by using better estimates of the gradients fi. Prox SVRG (Xiao & Zhang, 2014) is a common variancereduced method used for solving (1) in practice (Tang et al., 2020). Xiao & Zhang (2014) give the following iteration complexity (in both proximal & gradient oracle calls), up to constant factors: KSVRG (κ + n) log 4r0 Comparing (8) and reveals that for gradient oracle calls, using Prox RR is beneficial if the accuracy satisfies ϵ > κG2 µn where G2 := n f(x ) 2 + σ2 . This bounds means that Prox RR is better when the problem is better-conditioned or when the number of functions n is large and the minimizers of both P and f match well (i.e. f(x ) is small). The situation is much better for proximal oracle calls, where Prox RR is better when µn f(x ) 2 + 1 Observe that this expression is decreasing in n as long as σ2 /n is nonincreasing in n, a very mild requirement satisfied, for example, if the functions fi are themselves sampled i.i.d. according to some distribution with mean f and bounded variance. Extension for strongly-convex regularizers. In Theorem 2, we assume that each fi is µ-strongly convex. This is motivated by the common practice of using ℓ2 regularization in machine learning. However, applying ℓ2 regularization in every step of Algorithm 1 can be expensive when the data are sparse and the iterates xi t are dense, because it requires accessing each coordinate of xi t which can be much more expensive than computing sparse gradients fi(xi t). Alternatively, we may instead choose to put the ℓ2 regularization inside ψ and only ask that ψ be strongly convex this way, we can save a lot of time as we need to access each coordinate of the dense iterates xi t only once per epoch rather than every iteration. Theorem 8 in Appendix D.3 gives a convergence guarantee in this setting which is similar to that of Theorem 2. 4. Theory for non-convex objectives We shall now present our theory for the nonconvex case. To quantify convergence, we define the proximal-gradient mapping, which was also used in the prior literature to show convergence of Proximal SGD. Definition 2. Given a stepsize γ > 0, a convex function ψ and arbitrary f, we define the proximal-gradient mapping as Gγ(x) := 1 γ x proxγψ(x γ f(x)) . Similarly to Theorem 1, the analysis shows that a gradient term appears in the variance bound. However, in contrast to the convex settings of Theorem 1, there might not exist an optimum to which the iterates would converge and we cannot use f(x ) 2 in the variance bound. For this reason, we resort to the following assumption that bounds full gradients in terms of proximal-gradient mapping and an extra constant. Assumption 2. There exists a constant ζ 0 such that the full gradient of f is uniformly bounded by the proximalgradient mapping and ζ f(x) 2 Gγn(x) 2 + ζ2 for any x dom(ψ) and γ > 0. We note that this assumption is trivially satisfied with ζ = 0 if ψ 0 because in that case, Gγ(x) f(x). Therefore, when there is no proximal term, it is not an extra assumption compared to the analysis of Mishchenko et al. (2020). We will also rely on the following measure of gradient variance, which we need for the same reason that there might be no optimum x to measure the variance the way we did for Theorem 1. Assumption 3. There exists a constant σ > 0 such that 1 n Pn i=1 fi(x) f(x) 2 σ2 for any x Rd. Note that we can relax this assumption by introducing extra terms in the right-hand side as in (Khaled & Richt arik, 2020). Nevertheless, for the sake of simplicity and readability, we prefer the stronger version as presented above. Theorem 3 gives our main convergence result: Theorem 3 (Convergence result in the nonconvex case). Let Assumptions 2 and 3 hold and choose any γ 1 5Lmaxn. Proximal and Federated Random Reshuffling min t=0,...,T 1E Gγn(xt) 2 4(P(x0) P ) γn T + 2γ2Lmaxn2ζ2 + 2γ2L2 maxnσ2. Instead of obtaining a convergence guarantee on the minimum of the prox-grad mapping norm, we can get the same guarantee by randomly choosing an iterate. This is standard in stochastic nonconvex optimization since, for any oracle with access only to stochastic gradients, obtaining a guarantee for a fixed iterate (e.g. the final iterate) is impossible in general (Drori & Shamir, 2020). Obtaining a complexity. Set the stepsize γ to γ = min n 1 5Lmaxn, ε Lmax nσ+ Lmaxnζ Then plugging into Theorem 3 and denoting δ0 := P(x0) P , we obtain that in order to get an ϵ-stationary solution the complexity in terms of full number of stochastic gradients n T equal to n T = O δ0Lmaxn ε2 + δ0Lmax nσ ε3 + δ0 Lmaxnζ When the accuracy ϵ is small enough, this rate is better than the O(ϵ 4) rate of convergence of Proximal SGD (Davis & Drusvyatskiy, 2018). 5. Fed RR: application of Prox RR to federated learning Let us consider now the problem of minimizing the average of N = PM m=1 Nm functions that are stored on M devices, which have N1, . . . , NM samples correspondingly, min x Rd F(x) + R(x), F(x) = 1 N PM m=1 Fm(x), (9) where Fm(x) := PNm j=1 fmj(x). For example, fmj(x) can be the loss associated with a single sample (Xmj, ymj), where pairs (Xmj, ymj) follow a distribution Dm that is specific to device m. An important instance of such formulation is federated learning, where M devices train a shared model by communicating periodically with a central node. We normalize the objective in (9) by N as this is the total number of functions after we expand each Fm into a sum. We denote the solution of (9) by x . Extending the space. To rewrite the problem as an instance of (1), we are going to consider a bigger product space, which is sometimes used in distributed optimization (Bianchi et al., 2015). Let us define n := max{N1, . . . , Nm} and introduce ψC, the consensus constraint, defined via ψC(x1, . . . , x M) := ( 0, x1 = = x M + , otherwise . By introducing dummy variables x1, . . . , x M and adding the constraint x1 = = x M, we arrive at the intermediate problem min x1,...,x M Rp 1 N PM m=1 Fm(xm) + (R + ψC)(x1, . . . , x M), where R + ψC is defined, with a slight abuse of notation, as (R + ψC)(x1, . . . , x M) = R(x1) if x1 = = x M, and (R + ψC)(x1, . . . , x M) = + otherwise. Since we have replaced R with a more complicated regularizer R + ψC, we need to understand how to compute the proximal operator of the latter. We show (Lemma 8 in the supplementary) that the proximal operator of (R + ψC) is merely the projection onto {(x1, . . . , x M) | x1 = = x M} followed by the proximal operator of R with a smaller stepsize. Reformulation. To have n functions in every Fm, we write Fm as a sum with extra n Nm zero functions, fmj(x) 0 for any j > Nm, so that Fm(xm) = Pn j=1 fmj(xm) = PNm j=1 fmj(xm) + Pn j=Nm+1 0. We can now stick the vectors together into x = (x1, . . . , x M) RM d and multiply the objective by N n , which gives the following reformulation: min x RM d 1 n Pn i=1fi(x) + ψ(x), (10) where ψ(x) := N n (R + ψC) and fi(x) = fi(x1, . . . , x M) := m=1 fmi(xm). In other words, function fi(x) includes i-th data sample from each device and contains at most one loss from every device, while Fm(x) combines all data losses on device m. Note that the solution of (10) is x := (x , . . . , x ) and the gradient of the extended function fi(x) is given by fi(x) = ( f1i(x1) , , f Mi(x M) ) . Therefore, a stochastic gradient step that uses fi(x) corresponds to updating all local models with the gradient of i-th data sample, without any communication. Algorithm 1 for this specific problem can be written in terms of x1, . . . , x M, which results in Algorithm 2. Note that since fmi(xi) depends only on xi, computing its gradient does not require communication. Only once the local epochs are finished, the vectors are averaged as the result of projecting onto the set {(x1, . . . , x M) | x1 = = x M}. Reformulation properties. To analyze Fed RR, the only thing that we need to do is understand the properties of the reformulation (10) and then apply Theorem 2 or Theorem 8. The following lemma gives us the smoothness and strong convexity properties of (10). Lemma 1. Let function fmi be Li-smooth and µ-strongly convex for every m. Then, fi from reformulation (10) is Li-smooth and µ-strongly convex. Proximal and Federated Random Reshuffling Algorithm 2 Federated Random Reshuffling (Fed RR) 1: Input: Stepsize γ > 0, initial vector x0 = x0 0 Rd, number of epochs T, number of functions Nm on each machine m, set N = PM m=1 Nm and n = maxm Nm. 2: for epochs t = 0, 1, . . . , T 1 do 3: for m = 1, . . . , M locally in parallel do 4: x0 t,m = xt 5: Sample permutation π0,m, π1,m, . . . , πNm 1,m of {1, 2, . . . , Nm} 6: for i = 0, 1, . . . , Nm 1 do 7: xi+1 t,m = xi t,m γ fπi,m(xi t,m) 8: end for 9: xn t,m = x Nm t,m 10: end for 11: xt+1 = prox γN n R 1 M PM m=1 xn t,m 12: end for The previous lemma shows that the conditioning of the reformulation is κ = Lmax µ just as we would expect. Moreover, it implies that the requirement on the stepsize remains exactly the same: γ 1/Lmax. What remains unknown is the value of σ2 rad, which plays a key role in the convergence bounds for Prox RR and Prox SO. To find an upper bound on σ2 rad, let us define σ2 m, := 1 Nm Pn j=1 fmj(x ) 1 Nm Fm(x ) 2, which is the variance of local gradients on device m. This quantity characterizes the convergence rate of local SGD (Yuan et al., 2020), so we should expect it to appear in our bounds too. The next lemma explains how to use it to upper bound σ2 rad. Lemma 2. The shuffling radius σ2 rad of the reformulation (10) is upper bounded by σ2 rad Lmax PM m=1 Fm(x ) 2 + n The lemma shows that the upper bound on σ2 rad depends on the sum of local variances PM m=1 σ2 m, as well as on the local gradient norms PM m=1 Fm(x ) 2. Both of these sums appear in the existing literature on convergence of Local GD/SGD (Woodworth et al., 2020; Yuan et al., 2020). We are now ready to present formal convergence results. For simplicity, we will consider heterogeneous and homogeneous cases separately and assume that N1 = = NM = n. To further illustrate generality of our results, we will present the heterogeneous assuming strong convexity R and the homogeneous under strong convexity of functions fmi. Heterogeneous data. In the case when the data are heterogeneous, we provide the first local RR method. We can apply either Theorem 2 or Theorem 8, but for brevity, we give only the corollary obtained from Theorem 8. Theorem 4. Assume that functions fmi are convex and Li-smooth for each m and i. If R is µ-strongly convex and γ 1/Lmax, then we have for the iterates produced by Algorithm 2 E h x T x 2i (1 + 2γµn) T x0 x 2 Mµ PM m=1 Fm(x ) 2 + N 4M σ2 m, . In the supplementary material (Appendix G), we show that our rates for Fed RR improve over the best known rates for both Local SGD and Distributed Gradient Descent in the heterogeneous data seting. For nonconvex analysis, we consider R 0 and require the following standard assumption. Assumption 4 (Bounded variance and dissimilarity). There exist constants σ, ζ > 0 such that for any x Rd and 1 n Pn i=1 fmi 1 n Fm(x) 2 σ2 and, 1 M PM m=1 1 n Fm(x) F(x) 2 ζ2. Note that above 1 n Fm(x) = 1 Nm Fm(x) is the gradient of a local dataset and F(x) = 1 N PM l=1 Fl(x) is the full gradient on all data. Theorem 5 (Nonconvex convergence). Let Assumptions 1 and 4 be satisfied, and R 0 (no prox). Then, the communication complexity to achieve E h F(x T ) 2i ε2 T = O 1 ε2 + σ nε3 + ζ ε3 (F(x0) F ) . Notice that by replicating the data locally on each device and thereby increasing the value of n without changing the objective, we can improve the second term in the communication complexity. In particular, if the data are not too dissimilar (σ ζ) and ε is small ( 1 ε3 1 ε2 ), the second term in the complexity dominates, and it helps to have more local steps. However, if the data are less similar, the nodes have to communicate more frequently to get more information about other objectives. Homogeneous data. For simplicity, in the homogeneous (i.e., i.i.d.) data case we provide guarantees without the proximal operator. Since then we have F1(x) = = FM(x), for any m it holds Fm(x ) = 0, and thus σ2 m, = 1 n Pn j=1 fmj(x ) 2. The full variance is then given by PM m=1 σ2 m, = 1 n PM m=1 Pn i=1 fmi(x ) 2 = N where σ2 := 1 N Pn i=1 PM m=1 fmi(x ) 2 is the variance of the gradients over all data. Proximal and Federated Random Reshuffling Theorem 6. Let R(x) 0 (no prox) and the data be i.i.d., that is Fm(x ) = 0 for any m, where x is the solution of (9). Let σ2 := 1 N Pn i=1 PM m=1 fmi(x ) 2. If each fmj is Lmax-smooth and µ-strongly convex, then the iterates of Algorithm 2 satisfy E x T x 2 (1 γµ)n T x0 x 2 + γ2Lmax Nσ2 Mµ . Observe that the guarantee given by Theorem 6 scales with the effective number of functions per machine N/M, similar to the scaling displayed by single-node RR. Corollary 5.1. Choose the stepsize γ > 0 as γ = min 1 Lmax , q ϵMµ 2Lmax Nσ2 and suppose that the total number of iterations K = n T satisfies ϵMµ3/2 log 2 x0 x 2 Then E h x T x 2i ϵ. In the small-accuracy regime, Theorem 5.1 shows that Fed RR enjoys a convergence rate depending on 1 ϵ compared to the 1 ϵ rate of convergence of Fed Avg (Karimireddy et al., 2020). 6. Experiments1 Prox RR vs SGD. In Figure 1, we look at the logistic regression loss with the elastic net regularization, 1 N PN i=1 fi(x) + λ1 x 1 + λ2 2 x 2, (11) where each fi : Rd R is defined as fi(x) := bi log h(a i x) +(1 bi) log 1 h(a i x) , and where (ai, bi) Rd {0, 1}, i = 1, . . . , N are the data samples, h: t 1/(1+e t) is the sigmoid function, and λ1, λ2 0 are parameters. We set minibatch sizes to 32 for all methods and use theoretical stepsizes, without any tuning. We denote the heuristic version of RR that performs proximal operator step after each iteration as RR (iteration prox) . From the experiments, we can see that all methods behave more or less the same way. However, the algorithm that we propose needs only a small fraction of proximal operator evaluations, which gives it a huge advantage whenever the operator takes more time to compute than stochastic gradients. Fed RR vs Local SGD and Scaffold. We also compare the performance of Fed RR, Local SGD and Scaffold (Karimireddy et al., 2020) on homogeneous (i.e., i.i.d.) and heterogeneous data. Since Local SGD and Scaffold require 1Our code is available on Git Hub: https://github.com/ konstmish/rr_prox_fed. More experimental details are in the appendix. 0 200 400 600 800 1000 1200 Data passes SGD RR (iteration prox) RR (epoch prox) 0 20000 40000 60000 80000 Prox steps SGD RR (iteration prox) RR (epoch prox) 0.0 0.5 1.0 1.5 2.0 Data passes Average Worst shuffle Best shuffle Figure 1. Experimental results for problem (11). The first two plots show with average and confidence intervals estimated on 20 random seeds and clearly demonstrate that one can save a lot of proximal operator computations with our method. The right plot shows the best/worst convergence of Prox SO over 20,000 sampled permutations. smaller stepsizes to converge, they are significantly slower in the i.i.d. regime, as can be seen in Figure 2. Fed RR, however, does not need small initial stepsize and very quickly converges to a noisy neighborhood of the solution. We obtain heterogeneous regime by sorting data with respect to the labels and mixing the sorted dataset with the unsorted one. In this scenario, we also use the same small stepsize for every method to address the data heterogeneity. Clearly, Scaffold is the best in terms of functional values because it does variance reduction with respect to the data. Extending Fed RR in the same way might be useful too, but this goes beyond the scope of our paper and we leave it for future work. We also note that in terms of distances from the optimum, Fed RR still performs much better than Local SGD and Scaffold. 0 200 400 600 800 1000 Communication rounds Local SGD Scaffold Fed RR 0 10000 20000 30000 40000 Communication rounds Local SGD Scaffold Fed RR 0 10000 20000 30000 40000 Communication rounds SGD Scaffold Fed RR Figure 2. Fed RR vs Local-SGD and Scaffold: i.i.d. data (left) and heterogeneous data (middle and right). We set λ1 = 0 and estimate the averages and standard deviations by running 10 random seeds for each method. Proximal and Federated Random Reshuffling Ahn, K., Yun, C., and Sra, S. SGD with shuffling: optimal rates without component convexity and large epoch requirements. ar Xiv preprint ar Xiv:2006.06946. Neural Information Processing Systems (Neur IPS) 2020, 2020. (Cited on pages 2, 3, 4, 29, and 30) Beck, A. First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997. (Cited on page 4) Bertsekas, D. P. Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. In Sra, S., Nowozin, S., and Wright, S. J. (eds.), Optimization for Machine Learning, chapter 4. The MIT Press, 2011. ISBN 9780262298773. (Cited on page 2) Bianchi, P., Hachem, W., and Iutzeler, F. A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization. IEEE Transactions on Automatic Control, 61(10):2947 2957, 2015. (Cited on page 6) Bordes, A., Bottou, L., and Gallinari, P. Sgd-qn: Careful quasi-newton stochastic gradient descent. J. Mach. Learn. Res., 10:1737 1754, dec 2009. ISSN 1532-4435. (Cited on page 2) Bottou, L. Curiously fast convergence of some stochastic gradient descent algorithms. Unpublished open problem offered to the attendance of the SLDS 2009 conference, 2009. URL http://leon.bottou.org/papers/ bottou-slds-open-problem-2009. (Cited on page 2) Bottou, L. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pp. 421 436. Springer, 2012. (Cited on page 2) Chen, G. and Teboulle, M. Convergence Analysis of a Proximal-Like Minimization Algorithm Using Bregman Functions. SIAM Journal on Optimization, 3(3):538 543, 1993. doi: 10.1137/0803026. (Cited on page 18) Davis, D. and Drusvyatskiy, D. Stochastic subgradient method converges at the rate o(k 1/4) on weakly convex functions. ar Xiv preprint, abs/1802.02988, 2018. URL https://ar Xiv.org/abs/1802.02988. (Cited on page 6) Drori, Y. and Shamir, O. The complexity of finding stationary points with stochastic gradient descent. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 2658 2667. PMLR, 2020. URL http://proceedings.mlr.press/v119/ drori20a.html. (Cited on page 6) Duchi, J. and Singer, Y. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(Dec):2899 2934, 2009. (Cited on page 2) Gorbunov, E., Hanzely, F., and Richt arik, P. A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent. In Chiappa, S. and Calandra, R. (eds.), Proceedings of Machine Learning Research, volume 108, pp. 680 690, Online, 26 28 Aug 2020. PMLR. (Cited on pages 2, 16, and 29) Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., and Richt arik, P. SGD: General Analysis and Improved Rates. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5200 5209, Long Beach, California, USA, 09 15 Jun 2019. PMLR. (Cited on pages 3 and 4) Gower, R. M., Richt arik, P., and Bach, F. Stochastic quasigradient methods: variance reduction via Jacobian sketching. Mathematical Programming, pp. 1 58, 2020. ISSN 0025-5610. doi: 10.1007/s10107-020-01506-0. (Cited on page 29) G urb uzbalaban, M., Ozda glar, A., and Parrilo, P. A. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, Oct 2019. ISSN 1436-4646. doi: 10.1007/s10107-019-01440-w. (Cited on page 2) Haochen, J. and Sra, S. Random Shuffling Beats SGD after Finite Epochs. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2624 2633, Long Beach, California, USA, 09 15 Jun 2019. PMLR. (Cited on page 2) Kairouz, P. et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. (Cited on pages 1 and 3) Karimi, H., Nutini, J., and Schmidt, M. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition. In European Conference on Machine Learning and Knowledge Discovery in Databases - Volume 9851, ECML PKDD 2016, pp. 795 811, Berlin, Heidelberg, 2016. Springer-Verlag. (Cited on page 4) Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S. U., and Suresh, A. T. SCAFFOLD: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132 5143. PMLR, 2020. (Cited on pages 8 and 28) Proximal and Federated Random Reshuffling Khaled, A. and Richt arik, P. Better theory for SGD in the nonconvex world. ar Xiv Preprint ar Xiv:2002.03329, 2020. (Cited on pages 5 and 30) Khaled, A., Mishchenko, K., and Richt arik, P. Tighter theory for Local SGD on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp. 4519 4529. PMLR, 2020. (Cited on page 28) Koneˇcn y, J., Mc Mahan, H. B., Yu, F., Richt arik, P., Suresh, A. T., and Bacon, D. Federated learning: strategies for improving communication efficiency. In NIPS Private Multi-Party Machine Learning Workshop, 2016. (Cited on pages 1 and 3) Lee, D. D. and Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755): 788 791, 1999. (Cited on page 2) Mc Mahan, H. B., Moore, E., Ramage, D., Hampson, S., and Ag uera y Arcas, B. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017. (Cited on pages 1 and 3) Mishchenko, K., Khaled, A., and Richt arik, P. Random Reshuffling: Simple Analysis with Vast Improvements. ar Xiv preprint ar Xiv:2006.05988. Neural Information Processing Systems (Neur IPS) 2020, 2020. (Cited on pages 2, 3, 4, 5, 14, 17, and 19) Nagaraj, D., Jain, P., and Netrapalli, P. SGD without Replacement: Sharper Rates for General Smooth Convex Functions. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 4703 4711, Long Beach, California, USA, 09 15 Jun 2019. PMLR. (Cited on page 2) Needell, D., Srebro, N., and Ward, R. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Mathematical Programming, 155(1): 549 573, Jan 2016. ISSN 1436-4646. doi: 10.1007/ s10107-015-0864-7. (Cited on pages 4 and 29) Parikh, N. and Boyd, S. Proximal Algorithms. Foundations and Trends in Optimization, 1(3):127 239, January 2014. ISSN 2167-3888. doi: 10.1561/2400000003. (Cited on pages 14 and 28) Patrascu, A. and Irofti, P. Stochastic proximal splitting algorithm for composite minimization. Optimization Letters, pp. 1 19, 2021. (Cited on page 4) Pham, N. H., Nguyen, L. M., Phan, D. T., and Tran-Dinh, Q. Prox SARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization. Journal of Machine Learning Research, 21(110):1 48, 2020. (Cited on page 2) Recht, B. and R e, C. Toward a noncommutative arithmeticgeometric mean inequality: Conjectures, case-studies, and consequences. In Mannor, S., Srebro, N., and Williamson, R. C. (eds.), Proceedings of the 25th Annual Conference on Learning Theory, volume 23, pp. 11.1 11.24, 2012. Edinburgh, Scotland. (Cited on page 2) Rudin, L. I., Osher, S., and Fatemi, E. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259 268, 1992. (Cited on page 2) Safran, I. and Shamir, O. Random shuffling beats SGD only after many epochs on ill-conditioned problems. ar Xiv preprint, abs/2106.06880, 2021. URL https: //ar Xiv.org/abs/2106.06880. (Cited on page 3) Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: from theory to algorithms. Cambridge University Press, 2014. (Cited on page 1) Shamir, O. Without-replacement sampling for stochastic gradient methods. In Advances in neural information processing systems, pp. 46 54, 2016. (Cited on page 2) Shang, F., Jiao, L., Zhou, K., Cheng, J., Ren, Y., and Jin, Y. ASVRG: Accelerated Proximal SVRG. In Zhu, J. and Takeuchi, I. (eds.), Proceedings of Machine Learning Research, volume 95, pp. 815 830. PMLR, 14 16 Nov 2018. (Cited on page 2) Stich, S. U. Unified Optimal Analysis of the (Stochastic) Gradient Method. ar Xiv preprint ar Xiv:1907.04232, 2019. (Cited on pages 4, 29, and 30) Sun, R.-Y. Optimization for Deep Learning: An Overview. Journal of the Operations Research Society of China, 8 (2):249 294, Jun 2020. ISSN 2194-6698. doi: 10.1007/ s40305-020-00309-6. (Cited on page 30) Tang, J., Egiazarian, K., Golbabaee, M., and Davies, M. The practicality of stochastic optimization in imaging inverse problems. IEEE Transactions on Computational Imaging, 6:1471 1485, 2020. (Cited on pages 5 and 29) Tibshirani, R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. (Cited on page 2) Tran, T. H., Nguyen, L. M., and Tran-Dinh, Q. Shuffling gradient-based methods with momentum. ar Xiv preprint ar Xiv:2011.11884, 2020. (Cited on page 30) Vogel, C. Computational methods for inverse problems. Society for Industrial and Applied Mathematics, Philadelphia, 2002. ISBN 9780898715507. (Cited on page 2) Proximal and Federated Random Reshuffling Woodworth, B., Patel, K. K., and Srebro, N. Minibatch vs Local SGD for Heterogeneous Distributed Learning. ar Xiv preprint ar Xiv:2006.04735. Neural Information Processing Systems (Neur IPS) 2020, 2020. (Cited on pages 3, 7, and 27) Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014. doi: 10.1137/140961791. URL https://doi.org/10. 1137/140961791. (Cited on page 5) Yuan, H., Zaheer, M., and Reddi, S. Federated composite optimization. ar Xiv preprint ar Xiv:2011.08474, 2020. (Cited on page 7) Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 (1):49 67, 2006. (Cited on page 2) Proximal and Federated Random Reshuffling Supplementary Material 1 Introduction 1 2 Contributions 2 2.1 RR in new problem settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Improving vanilla RR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Theory for strongly convex objectives 4 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 Theory for non-convex objectives 5 5 Fed RR: application of Prox RR to federated learning 6 6 Experiments 8 A Basic notions and preliminaries 14 A.1 Bregman divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Properties of the proximal operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B Proof of Theorem 1 (Bounding the shuffling radius) 15 C Proof of Convergence of Proximal SGD 15 D Proofs of Theorem 2 and Theorem 8 17 D.1 A key lemma for shuffling-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 D.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 D.3 Theorem 8 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 E Nonconvex analysis 20 E.1 A key lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 E.2 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 F Proofs for federated learning 23 F.1 Lemma for the extended proximal operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 F.2 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 F.3 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Proximal and Federated Random Reshuffling F.4 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 F.5 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 F.6 Proposition 1 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 F.7 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 G Comparison of Fed RR with other algorithms for federated learning 26 G.1 Heterogeneous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 G.1.1 Distributed gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 G.1.2 Local SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 H Further experimental details 28 I Extension: Importance resampling 29 J Extension: Decreasing stepsizes 29 J.1 A recursion Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 J.2 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Proximal and Federated Random Reshuffling Proofs A. Basic notions and preliminaries We say that an extended real-valued function φ: Rd R {+ } is proper if its domain, dom φ := {x : φ(x) < + }, is nonempty. We say that it is convex (resp. closed) if its epigraph, epi φ := {(x, t) Rd R : φ(x) t}, is a convex (resp. closed) set. Equivalently, φ is convex if dom φ is a convex set and φ(αx + (1 α)y) αφ(x) + (1 α)φ(y) for all x, y dom φ and α (0, 1). Finally, φ is µ-strongly convex if φ(x) µ 2 x 2 is convex, and L-smooth if L 2 x 2 φ(x) is convex. One useful fact that we will need is that for any vectors a1, . . . , a M Rd we have m=1 ai 2 = 1 The identity above is sometimes called bias-variance decomposition. To prove the upper bound in Theorem 1, we rely on a lemma due to Mishchenko et al. (2020) that bounds the variance when sampling without replacement. Lemma 3 (Lemma 1 in (Mishchenko et al., 2020)). Let X1, . . . , Xn Rd be fixed vectors, let X = 1 n Pn i=1 Xi be their mean, and let σ2 = 1 n Pn i=1 Xi X 2 be their variance. Fix any i [n] and let Xπ0, . . . , Xπi 1 be sampled uniformly without replacement from {X1, . . . , Xn} and Xπ = 1 i Pi 1 j=0 Xπj be their average. Then, the sample average and variance are given by E Xπ = X, E h Xπ X 2i = n i i(n 1)σ2. (13) Finally, we define [n] := {1, 2, . . . , n}. A.1. Bregman divergence These notions have a more useful characterization in the case of real valued and continuously differentiable functions φ: Rd R. The Bregman divergence of such φ is defined by Dφ(x, y) := φ(x) φ(y) φ(y), x y . A continuously differentiable function φ is called µ-strongly convex if µ 2 x y 2 Dφ(x, y), x, y Rd. It is convex if this holds with µ = 0. Moreover, a continuously differentiable function φ is called L-smooth if 2 x y 2 Dφ(x, y) L 2 x y 2 , x, y Rd. (14) Note that the first inequality is redundant for convex φ because convexity implies 0 Dφ(x, y). A.2. Properties of the proximal operator Before we proceed to the proofs of convergence, we should state some basic and well-known properties of the regularized objectives. The following lemma explains why the solution of (1) is a fixed point of the proximal-gradient step for any stepsize. Lemma 4. Let Assumption 1 be satisfied.2 Then point x is a minimizer of P(x) = f(x) + ψ(x) if and only if for any γ, b > 0 we have x = proxγbψ(x γb f(x )). Proof. This follows by writing the first-order optimality conditions for problem (1), see (Parikh & Boyd, 2014, p.32) for a full proof. 2We only need the part about ψ. Proximal and Federated Random Reshuffling The lemma above only shows that proximal-gradient step does not hurt if we are at the solution. In addition, we will rely on the following a bit stronger result which postulates that the proximal operator is a contraction (resp. strong contraction) if the regularizer ψ is convex (resp. strongly convex). Lemma 5. Let Assumption 1 be satisfied.3 If ψ is µ-strongly convex with µ 0, then for any γ > 0 we have proxγnψ(x) proxγnψ(y) 2 1 1 + 2γµn x y 2, (15) for all x, y Rd. Proof. Let u := proxγnψ(x) and v := proxγnψ(y). By definition, u = argminw{ψ(w) + 1 2γn w x 2}. By first-order optimality, we have 0 ψ(u) + 1 γn(u x) or simply x u γn ψ(u). Using a similar argument for v, we get x u (y v) γn( ψ(u) ψ(v)). Thus, by strong convexity of ψ, we get x u (y v), u v γµn u v 2. x y 2 = u v + (x u (y v)) 2 = u v 2 + 2 x u (y v), u v + x u (y v) 2 u v 2 + 2 x u (y v), u v (1 + 2γµn) u v 2. B. Proof of Theorem 1 (Bounding the shuffling radius) Proof. By the Li-smoothness of fi and the definition of xi , we can replace the Bregman divergence in (4) with the bound E Dfπi(xi , x ) (14) E Lπi xi x 2 Lmax (3)= γ2Lmax j=0 fπj(x ) j=0 fπj(x ) 2 E h Xπ 2i , (16) where Xπ = 1 j Pi 1 j=0 Xπj with Xj := fj(x ) for j = 1, 2, . . . , n. Since X = f(x ), by applying Lemma 3 we get E h Xπ 2i = X 2 + E h Xπ X 2i (13)+(5) = f(x ) 2 + n i i(n 1)σ2 . (17) It remains to combine (16) and (17), use the bounds i2 n2 and i(n i) n(n 1) 2 , which holds for all i {0, 1, . . . , n 1}, and divide both sides of the resulting inequality by γ2. C. Proof of Convergence of Proximal SGD Theorem 7 (Proximal SGD). Let Assumption 1 hold. Further, suppose that either f := 1 n Pn i=1 fi is µ-strongly convex or that ψ is µ-strongly convex. If Algorithm 3 is run with a constant stepsize γk = γ > 0 satisfying γ 1 2Lmax , then the final iterate after K steps satisfies E h x K x 2i (1 γµ)K x0 x 2 + 2γσ2 µ . 3We only need the part about ψ. Proximal and Federated Random Reshuffling Algorithm 3 Proximal SGD 1: Input: Stepsizes γk > 0, initial vector x0 Rd, number of steps K 2: for steps k = 0, 1, . . . , K 1 do 3: Sample ik uniformly at random from [n] 4: xk+1 = proxγkψ(xk γk fik(xk)) 5: end for Proof. We will prove the case when ψ is µ-strongly convex. The other result follows as a straightforward special case of (Gorbunov et al., 2020, Theorem 4.1). We start by analyzing one step of SGD with stepsize γk = γ and using Lemma 4 xk+1 x 2 = proxγψ(xk γ fξ(xk)) proxγψ(x γ f(x )) 2 1 1 + 2γµ xk γ fξ(xk) (x γ f(x )) 2. (18) We may write the squared norm term in (18) as xk γ fξ(xk) (x γ f(x )) 2 = xk x 2 2γ xk x , fξ(xk) f(x ) + γ2 fξ(xk) f(x ) 2. (19) We denote by Ek [ ] expectation conditional on xk. Note that the gradient estimate is conditionally unbiased, i.e., that Ek [ fξ(xk)] = 1 n Pn i=1 fi(xk) = f(xk). Hence, taking conditional expectation in (19) and using unbiasedness we have Ek h xk γ fξ(xk) (x γ f(x )) 2i = xk x 2 2γ xk x , f(xk) f(x ) + γ2Ek h fξ(xk) f(x ) 2i . (20) By the convexity of f we have xk x , f(xk) f(x ) Df(xk, x ). Furthermore, we may estimate the third term in (20) by first using the fact that x + y 2 2 x 2 + 2 y 2 for any two vectors x, y Rd Ek h fξ(xk) f(x ) 2i 2Ek h fξ(xk) fξ(x ) 2i + 2Ek h fξ(x ) f(x ) 2i = 2Ek h fξ(xk) fξ(x ) 2i + 2σ2 . We now use that by the Lmax-smoothness of fi we have that fi(xk) fi(x ) 2 2Lmax Dfi(xk, x ). Ek h fξ(xk) fξ(x ) 2i = 1 i=1 fi(xk) fi(x ) 2 i=1 [fi(xk) fi(x ) fi(x ), xk x ] = 2Lmax [f(xk) f(x ) f(x ), xk x ] = 2Lmax Df(xk, x ). (21) Combining equations (20) (21) we obtain Ek h xk γ fξ(xk) (x γ f(x )) 2i xk x 2 2γ (1 2γLmax) Df(xk, x ) + 2γ2σ2 . Proximal and Federated Random Reshuffling Since γ 1 2Lmax by assumption we have that 1 2γLmax 0. Since Df(xk, x ) 0 by the convexity of f we arrive at Ek h xk γ fξ(xk) (x γ f(x )) 2i xk x 2 + 2γ2σ2 . Taking unconditional expectation and combining (43) with the last equation we have E h xk+1 x 2i 1 1 + 2γµ E h xk x 2i + 2γ2σ2 = 1 1 + 2γµE h xk x 2i + 2γ2σ2 1 + 2γµ 1 1 + 2γµE h xk x 2i + 2γ2σ2 . To simplify this further, we use that for any x 1 2 we have that 1 1+2x 1 x and that γµ µ 2Lmax 1 E h xk+1 x 2i (1 γµ) E h xk x 2i + 2γ2σ2 . Recursing the above inequality for K steps yields E h x K x 2i (1 γµ)K x0 x 2 + 2γ2σ2 k=0 (1 γµ)k ! (1 γµ)K x0 x 2 + 2γ2σ2 k=0 (1 γµ)k ! = (1 γµ)K x0 x 2 + 2γσ2 µ . Furthermore, by choosing the stepsize γ as γ = min n 1 2Lmax , εµ o , we get that E h x K x 2i = O (ε) provided that the number of iterations is at least KSGD κ + σ2 εµ2 which we previously stated in (7). D. Proofs of Theorem 2 and Theorem 8 D.1. A key lemma for shuffling-based methods The intermediate limit points xi are extremely important for showing tight convergence guarantees for Random Reshuffling even without proximal operator. The following lemma illustrates that by giving a simple recursion, whose derivation follows (Mishchenko et al., 2020, Proof of Theorem 1). The proof is included for completeness. Lemma 6 (Theorem 1 in (Mishchenko et al., 2020)). Suppose that each fi is Li-smooth and λ-strongly convex (where λ = 0 means each fi is just convex). Then the inner iterates generated by Algorithm 1 satisfy E h xi+1 t xi+1 2i (1 γλ) E h xi t xi 2i 2γ (1 γLmax) E Dfπi(xi t, x ) + 2γ3σ2 rad, (22) where xi is as in (3), i = 0, 1, . . . , n 1, and x is any minimizer of P. Proof. By definition of xi+1 t and xi+1 , we have E h xi+1 t xi+1 2i = E h xi t xi 2i 2γE fπi(xi t) fπi(x ), xi t xi + γ2E h fπi(xi t) fπi(x ) 2i . (23) Proximal and Federated Random Reshuffling Note that the third term in (23) can be bounded as fπi(xi t) fπi(x ) 2 2Lmax Dfπi(xi t, x ). (24) We may rewrite the second term in (23) using the three-point identity (Chen & Teboulle, 1993, Lemma 3.1) as fπi(xi t) fπi(x ), xi t xi = Dfπi(xi , xi t) + Dfπi(xi t, x ) Dfπi(xi , x ). (25) Combining (23), (24), and (25) we obtain E h xi+1 t xi+1 2i E h xi t xi 2i 2γ E Dfπi(xi , xi t) + 2γ E Dfπi(xi , x ) 2γ (1 γLmax) E Dfπi(xi t, x ) . (26) Using λ-strong convexity of fπi, we derive xi t xi 2 Dfπi(xi , xi t). (27) Furthermore, by the definition of shuffling radius (Definition 1), we have E Dfπi(xi , x ) max i=0,...,n 1 E Dfπi(xi , x ) = γ2σ2 rad. (28) Using (27) and (28) in (26) yields (22). D.2. Proof of Theorem 2 Proof. Starting with Lemma 6 with λ = µ, we have E h xi+1 t xi+1 2i (1 γµ) E h xi t xi 2i 2γ (1 γLmax) E Dfπi(xi t, x ) + 2γ3σ2 rad. Since Dfπ(xi t, x ) is a Bregman divergence of a convex function, it is nonnegative. Combining this with the fact that the stepsize satisfies γ 1/Lmax, we have E h xi+1 t xi+1 2i (1 γµ) E h xi t xi 2i + 2γ3σ2 rad. Unrolling this recursion for n steps, we get E h xn t xn 2i (1 γµ)n E h x0 t x0 2i + 2γ3σ2 rad j=0 (1 γµ)j = (1 γµ)n E h xt x 2i + 2γ3σ2 rad j=0 (1 γµ)j where we used the fact that x0 t x0 = xt x . Since x minimizes P, we have by Lemma 4 that x = proxγnψ i=0 fπi(x ) = proxγnψ (xn ) . Moreover, by Lemma 5 we obtain that xt+1 x 2 = proxγnψ(xn t ) proxγnψ(xn ) 2 xn t xn 2. Using this in (29) yields E h xt+1 x 2i (1 γµ)n E h xt x 2i + 2γ3σ2 rad j=0 (1 γµ)j Proximal and Federated Random Reshuffling We now unroll this recursion again for T steps E h x T x 2i (1 γµ)n T E h x0 x 2i + 2γ3σ2 rad j=0 (1 γµ)j i=0 (1 γµ)ni ! Following Mishchenko et al. (2020), we rewrite and bound the product in the last term as j=0 (1 γµ)j i=0 (1 γµ)ni ! i=0 (1 γµ)ni+j k=0 (1 γµ)k k=0 (1 γµ)k = 1 It remains to plug this bound into (30). D.3. Theorem 8 and its proof Theorem 8. Let Assumption 1 hold and f1, . . . , fn be convex. Further, assume that ψ is µ-strongly convex. If Algorithm 1 is run with constant stepsize γt = γ 1/Lmax, where Lmax = maxi Li, then its iterates satisfy E h x T x 2i (1 + 2γµn) T x0 x 2 + γ2σ2 rad µ . Proof. Starting with Lemma 6 with λ = 0, we have E h xi+1 t xi+1 2i E h xi t xi 2i 2γ (1 γLmax) E Dfπi(xi t, x ) + 2γ3σ2 rad. Since γ 1/Lmax and Dfπ(xi t, x ) is nonnegative we may simplify this to E h xi+1 t xi+1 2i E h xi t xi 2i + 2γ3σ2 rad. Unrolling this recursion over an epoch we have E h xn t xn 2i E h x0 t x0 2i + 2γ3σ2 radn = E h xt x 2i + 2γ3σ2 radn. (31) Since x minimizes P, we have by Lemma 4 that x = proxγnψ i=0 fπi(x ) = proxγnψ (xn ) . Hence, xt+1 x = proxγnψ(xn t ) proxγnψ(xn ). We may now use Lemma 5 to get (1 + 2γµn) E h xt+1 x 2i E h xn t xn 2i . Combining this with (31), we obtain E h xt+1 x 2i 1 1 + 2γµn E h xt x 2i + 2γ3σ2 radn 1 + 2γµn. Proximal and Federated Random Reshuffling We may unroll this recursion again, this time for T steps, and then use that PT 1 j=1 (1 + 2γµn) j P j=1 (1 + 2γµn) j = 1/(2γµn): E h x T x 2i (1 + 2γµn) T E h x0 x 2i + 2γ3σ2 radn 1 + 2γµn j=0 (1 + 2γµn) j ! = (1 + 2γµn) T E h x0 x 2i + 2γ3σ2 radn j=1 (1 + 2γµn) j ! (1 + 2γµn) T E h x0 x 2i + 2γ3σ2 radn 1 2γµn = (1 + 2γµn) T E h x0 x 2i + γ2σ2 rad µ . Using Theorem 8 and choosing the stepsize as γ = min n 1 Lmax , εµ σrad we get E h x T x 2i = O (ε) provided that the total number of iterations satisfies K κ + σrad/µ εµ + n log 2r0 This can be converted to a bound similar to (3.2) by using Theorem 1, in which case the only difference between the two cases is an extra n log 1 ε term when only the regularizer ψ is µ-strongly convex. Since for small enough accuracies the 1/ ε term dominates, this difference is minimal. E. Nonconvex analysis E.1. A key lemma For notational convenience, we define γn(xt xn t ) = 1 i=0 fπi(xi t), which is equivalent to xn t = xt γngt. Lemma 7. Let functions f1, . . . , fn be Lmax-smooth, Assumptions 2 and 3 be satisfied and γ 1 2Lmaxn. Then, Et h f(xt) gt 2i γ2L2 maxn2( Gγn(xt) 2 + ζ2) + γ2L2 maxnσ2. (34) Proof. We start with the observation that gradient Lipschitzness reduces the left-hand side to a difference of iterates: f(xt) gt 2 = fπi(xt) fπi(xi t) fπi(xt) fπi(xi t) 2 i=0 L2 max xt xi t 2. Define Vt := Pn 1 i=0 xi t xt 2. Clearly, it is sufficient to bound E [Vt] to finish the proof. Also note that for any intermediate iterate xk t within epoch t we do not use proximal step, so the following identity holds: xk t = xt γ i=0 fπi(xi t). Proximal and Federated Random Reshuffling This identity only includes gradients, so to bound the deviation of xk t from xt we apply Jensen s inequality and gradient Lipschitzness Et xk t xt 2 = γ2Et i=0 fπi(xi t) fπi(xi t) fπi(xt) i=0 fπi(xt) i=0 Et h fπi(xi t) fπi(xt) 2i + 2γ2Et i=0 fπi(xt) i=0 Et xi t xt 2 + 2γ2Et i=0 fπi(xt) Now we are going to use the fact that for any i in RR we have Et [ fπi(xt)] = f(xt). Note that this property does not hold if xt is not independent of πi, which is why the result does not hold for SO. Let us also define σ2 t := 1 n Pn j=1 fj(xt) f(xt) 2. By Lemma 3 we have i=0 fπi(xt) = k2 f(xt) 2 + k2Et i=0 ( fπi(xt) f(xt)) (13) = k2 f(xt) 2 + k(n k) Plugging this back and using Assumption 3, we derive Et h xk t xt 2i 2γ2L2 maxk i=0 Et h xi t xt 2i + 2γ2k2 f(xt) 2 + 2γ2 k(n k) 2γ2L2 maxk E [Vt] + 2γ2k2 f(xt) 2 + 2γ2 k(n k) Let us use the obtained bound on a single iterate distance Et h xk t xt 2i to upper bound E [Vt]: i=0 Et xi t xt 2 γ2L2 maxn(n 1)Et [Vt] + 1 3γ2(n 1)n(2n 1) f(xt) 2 + 1 3γ2n(n + 1)σ2. This inequality has Et [Vt] in both sides, so we can rearrange it and use the assumption γ 1 2Lmaxn, which results in 3(1 γ2L2 maxn(n 1))Et [Vt] 9γ2(n 1)n(2n 1) f(xt) 2 + 4 9γ2n(n + 1)σ2 γ2n3 f(xt) 2 + γ2n2σ2. To conclude the proof, apply Assumption 2 to xt dom(ψ) and plug-in the bound on Et [Vt] into the bound on Et f(xt) gt 2 . Proximal and Federated Random Reshuffling E.2. Proof of Theorem 3 Proof. Let us introduce wt := proxγnψ(xt γn f(xt)). The idea of our proof is to first obtain a descent recursion for P(wt) and then bound P(xt+1) P(wt). By convexity of ψ, we have for any g ψ(wt) ψ(wt) ψ(xt) + g, wt xt . Furthermore, the definition of wt implies by first-order optimality that xt γn f(xt) wt γn ψ(wt), so we can plug it into the bound above to get ψ(wt) ψ(xt) + 1 γn xt γn f(xt) wt, wt xt = ψ(xt) f(xt), wt xt 1 γn wt xt 2. At the same time, by Lmax-smoothness of f we have f(wt) f(xt) + f(xt), wt xt + Lmax Adding the two recursion together yields P(wt) = f(xt) + ψ(wt) P(xt) + Lmax Now we shall upper bound P(xt+1). Using the convexity of ψ for xn t xt+1 γn ψ(xt+1), we derive ψ(xt+1) ψ(wt) + 1 γn xn t xt+1, xt+1 wt = ψ(wt) gt, xt+1 wt + 1 γn xt xt+1, xt+1 wt = ψ(wt) gt, xt+1 wt + 1 2γn xt wt 2 xt xt+1 2 xt+1 wt 2 . Next, we apply Lmax-smoothness of f two times, to upper bound Df(xt+1, xt) and to lower bound Df(wt, xt): f(xt+1) f(xt) + f(xt), xt+1 xt + Lmax 2 xt+1 xt 2, and f(xt) f(wt) + f(xt), xt wt + Lmax f(xt+1) f(wt) + f(xt), xt+1 wt + Lmax 2 xt+1 xt 2 + wt xt 2 . Combining the inequalities for ψ(xt+1) and f(xt+1), we obtain P(xt+1) P(wt) + f(xt) gt, xt+1 wt + Lmax xt wt 2 1 2γn xt+1 wt 2. By Young s inequality and Lemma 7 we have Et [ f(xt) gt, xt+1 wt ] 2 f(xt) gt 2 + 2 γn xt+1 wt 2 (34) γ3L2 maxn3 2 ζ2 + γ3L2 maxn3 2 Gγn(xt) 2 + γ3L2 maxn2 γn Et xt+1 wt 2 . Proximal and Federated Random Reshuffling If we plug this back, the term xt+1 wt 2 will cancel out, giving us for γ 1 Lmaxn Et [P(xt+1)] P(wt) + γ3L2 maxn3 2 ζ2 + γ3L2 maxn3 2 Gγn(xt) 2 + γ3L2 maxn2 2 σ2 + Lmax Et xt+1 xt 2 P(wt) + γ3L2 maxn3 2 ζ2 + γ3L2 maxn3 2 Gγn(xt) 2 + γ3L2 maxn2 2 σ2 + Lmax Using the recursion for P(wt) and our choice γ 1 5Lmaxn, we finally obtain, after plugging-in xt wt 2 = γ2n2Gγn(xt), Et [P(xt+1)] P(xt) + γ3L2 maxn3 2 ζ2 + γ3L2 maxn2 2 σ2 + γn L2 max 2 + Lmax 2 + 1 2γn + Lmax γ2n2 Gγn(xt) 2 P(xt) + γ3L2 maxn3 2 ζ2 + γ3L2 maxn2 2 σ2 + Lmax 10 + Lmax 1 2γn γ2n2 Gγn(xt) 2 P(xt) + γ3L2 maxn3 2 ζ2 + γ3L2 maxn2 2 σ2 1 4γnγ2n2 Gγn(xt) 2. Recursing this to P(x0) and using P P(x T ), we get the Theorem s claim. F. Proofs for federated learning F.1. Lemma for the extended proximal operator Lemma 8. Let ψC be the consensus constraint and R be a closed convex proximable function. Suppose that x1, x2, . . . , x M are all in Rd. Then, proxγ(R+ψC)(x1, . . . , x M) = prox γ where x = 1 M PM m=1 xm. Proof. We have, proxγ(R+ψC)(x1, . . . , x M) = M R(x) ... prox γ This is a simple consequence of the definition of the proximal operator. Indeed, the result of proxγ(R+ψC) must have blocks equal to some vector z such that z = argmin x m=1 x xm 2 ) x x 2 + 2 x x, x xm ) + x xm 2 ) 2M x x 2 = prox γ Proximal and Federated Random Reshuffling F.2. Proof of Lemma 1 Proof. Given some vectors x, y Rd M, let us use their block representation x = (x 1 , . . . , x M) , y = (y 1 , . . . , y M) . Since we use the Euclidean norm, we have fi(x) fi(y) 2 = m=1 fmi(xm) fmi(ym) 2 m=1 L2 i xm ym 2 = L2 i x y 2. We can obtain a lower bound by doing the same derivation and applying strong convexity instead of smoothness: m=1 fmi(xm) fmi(ym) 2 µ2 M X m=1 xm ym 2 = µ2 x y 2. Thus, we have µ x y fi(x) fi(y) Li x y , which is exactly µ-strong convexity and Li-smoothness of fi. F.3. Proof of Lemma 2 Proof. By Theorem 1 we have σ2 rad Lmax n2 f(x ) 2 + n Due to the separable structure of f, we have for the variance term i=1 fi(x ) f(x ) 2 = The expression inside the summation is not exactly the variance due to the different normalization: 1 n instead of 1 Nm . Nevertheless, we can expand the norm and try to get the actual variance: fmi(x ) 1 Nm Fm(x ) fmi(x ) 1 Nm Fm(x ), 1 = Nmσ2 m, + Nm 1 nσ2 m, + Fm(x ) 2. Moreover, the gradient term has the same block structure, so n2 f(x ) 2 = n2 1 n i=1 fmi(x ) m=1 Fm(x ) 2. Plugging the last two bounds back inside the upper bound on σ2 rad, we deduce the lemma s statement. F.4. Proof of Theorem 4 Proof. Since we assume that N1 = = NM = n, we have N M = n and the strong convexity constant of ψ = N n (R + ψC) is equal to N M = µ. By applying Theorem 8 we obtain E h x T x 2i (1 + 2γµn) T x0 x 2 + γ2σ2 rad µ . Since x T = proxγN(R+ψC)(xn T 1), we have x T C, i.e., all of its blocks are equal to each other and we have x T = (x T , . . . , x T ) . Since we use the Euclidean norm, it also implies E h x T x 2i = M x T x 2. Proximal and Federated Random Reshuffling The same is true for x0, so we need to divide both sides of the upper bound on x T x 2 by M. Doing so together with applying Lemma 2 yields E h x T x 2i (1 + 2γµn) T x0 x 2 + γ2σ2 rad Mµ (1 + 2γµn) T x0 x 2 + γ2Lmax Fm(x ) 2 + n = (1 + 2γµn) T x0 x 2 + γ2Lmax Fm(x ) 2 + N F.5. Proof of Theorem 6 Proof. According to Lemma 1, each fi is µ-strongly convex and Lmax-smooth, so we obtain the result by trivially applying Theorem 2 and upper bounding σ2 rad the same way as in the proof of Theorem 4. F.6. Proposition 1 and its proof An important property of Assumption 2 is that it is equivalent to the bounded dissimilarity assumption that was previously used for the nonconvex analysis of Local SGD. We formalize this in the following proposition. Proposition 1. Consider federated learning reformulation (10). If ψ ψC, i.e., R 0, then Assumption 2 with constant ζ 2 := Mζ2 is equivalent to ζ-bounded dissimilarity (Assumption 4): Proof. First, observe that if x dom(ψ), then x has all blocks equal to some x Rd, x = (x , . . . , x ) . Therefore, for the objective in reformulation (10) and x dom(ψ), we have i=1 fi(x) = 1 m=1 fmi(x) = 1 m=1 Fm(x) = 1 n F1(x1) ... 1 n FM(x M) 1 n F1(x) ... 1 n FM(x) With the help of bias-variance decomposition, the left-hand side of Assumption 2 can be written as f(x) 2 (35) = 1 n2 m=1 Fm(x) 2 (12) = 1 Mn2 Let us now work out the proximal-gradient mapping. According to Lemma 8, the proximal operator of ψ is simply the averaging of all blocks, while the full gradient is given in (35), which give when combined proxγnψ(x γn f(x)) = 1 M PM m=1(x γ Fm(x)) ... 1 M PM m=1(x γ Fm(x)) Proximal and Federated Random Reshuffling Gγn(x) 2 = 1 γ2n2 x proxγnψ(x γn f(x)) 2 (36) = 1 γ2n2 m=1 (x γ Fm(x)) Having the expressions for both sides, we can write f(x) 2 = Gγn(x) 2 + 1 n Fm(x) 1 From this expression and the fact 1 N PM l=1 Fl(x) = F(x), it is easy to see the equivalence. F.7. Proof of Theorem 5 The federated learning reformulation (10) has different constant scaling than the finite-sum federated learning problem (9), and the only constant that does not change at all is Lmax. For the initial error δ0 of the reformulation we have n δ0 = Mδ0, where δ0 := 1 N PM m=1 Fm(x0) minx 1 N PM m=1 Fm(x) and we use only consider the simplified case N1 = = NM = n so N n = M. For the variance, we have σ2 = sup x E fi(x) f(x) 2 = sup x m=1 E fmi(xm) 1 n Fm(xm) 2 = Mσ2. As we derived in (37), the proximal-gradient mapping norm is equal to E Gγn(x) 2 = M = M F(x) 2, so to have F(x T ) 2 ε2, we need E Gγn(x T ) 2 ε2 := Mε2. In addition, notice that by Proposition 1 the constant from Assumption 2 is ζ = Thus, Theorem 3 implies, if we ignore Lmax, that we need ε2 + δ0σ nε3 + δ0ζ ε2 + δ0σ nε3 + δ0ζ communication rounds to achieve mint=0,...,T 1 E F(x T ) 2 = O(ε2). G. Comparison of Fed RR with other algorithms for federated learning G.1. Heterogeneous Data In this section we compare between Fed RR and several known baseline algorithms for Federated Learning. In particular, we consider the following algorithms: 1. Distributed gradient descent (DGD) 2. Local SGD (with M nodes and n local steps per node) Proximal and Federated Random Reshuffling To be clear, the problem we are considering is min x Rd f(x) := m=1 Fm(x) + R(x) where each objective fm can be written as i=1 fm,i(x). We further assume that each objective is L-smooth and convex, and that R is µ-strongly convex. This implies that f is L-smooth and µ-strongly convex. Note that this is a special case of (9) where we keep N1 = N2 = . . . = n for simplicity. Corollary 1. Let c2 = ζ2 + n 4 σ2 , where ζ2 := 1 M PM m=1 Fm(x) 2 and σ2 = 1 M PM m=1 F(x ) Fm(x ) 2. Then the communication complexity required by Fed RR to reach an ϵ-accurate solution is where r0 = x0 x 2. Proof. This is a straightforward consequence of Theorem 4. G.1.1. DISTRIBUTED GRADIENT DESCENT When we compute n gradients on each node per communication round, we are essentially running distributed gradient descent (DGD). In order to reach an ϵ-accurate solution, DGD requires the following number of iterations T = Ω κ log r0 Comparing against the result of Corollary 1, we see that Fed RR is better whenever the accuracy ϵ satisfies ζ2 n2 + σ2 n Note that this guarantee grows more rigorous with increasing levels of heterogeneity this has been observed for other local methods as well, such as Local SGD (Woodworth et al., 2020). G.1.2. LOCAL SGD The best current lower bound for Local SGD is given by (Woodworth et al., 2020) in the stochastic case. By stochastic case, we mean that the problem considered is min x Rd Eξ D [fξ(x)] . This is a more general problem than the finite-sum minimization problem (1) and is usually strictly harder to solve (i.e., requires more iterations to achieve an ϵ-accurate solution). We are not aware of any analysis of Local SGD specifically for the finite-sum problem, and thus we specialize the result of Woodworth et al. (2020) anyway. For Local SGD on µ-strongly convex and L smooth functions, and with n steps of local steps per node, the lower bound they give after T communication rounds is , Lζ2 µ2T 2 µMn T + min , Lσ2 where σ2 is a uniform bound on the variance (i.e., E[ fξ(x) f(x) 2] σ2 for all x Rd), ζ2 is defined as in Corollary 1, and is an upper bound on f(x0) f . We note that this lower bound is not actually met by any of the existing analysis Proximal and Federated Random Reshuffling for Local SGD. Even ignoring the dependence on σ (which may not be tight because this is the stochastic case), the first term (i.e., the optimization term ) in (39) scales with κ when T is large and κζ µϵ when T is small. This is clearly worse than (38) for large n. H. Further experimental details Implementation details. For each i, we have Li = 1 4 ai . For the ℓ1-regularized problem, we set λ2 = 3 10 5 L and tune λ1 to obtain a solution with about 25% zero coordinates, which gives λ1 = 5 10 5. We use stepsizes decreasing as O( 1 t ) for all methods. We use the w8a dataset4 for the experiment with ℓ1 regularization. Proximal operator calculation. It is well-known (see, for instance, (Parikh & Boyd, 2014)) that the proximal operator for ψ(x) = λ1 x 1 + λ2 2 x 2 is given by proxγψ(x) = 1 1 + γλ2 proxγλ1 1(x), where the j-th coordinate of proxγλ1 1(x) is [proxγλ1 1(x)]j = ( sign([x]j)(|[x]j| γλ1), if |[x]j| γλ1, 0, otherwise. Federated experiments. The experiments for the comparison of Fed RR, Local SGD and Scaffold use no ℓ1 regularization and λ2 = 10 5 L. To make comparison fair, all methods use n local steps. For Fed RR, the initial stepsize was 1 L in the i.i.d. regime and 1 Ln in the heterogeneous regime. As per Theorem 3 in (Khaled et al., 2020), the stepsizes for Local SGD must satisfy γt = O(1/(LH)), where H is the number of local steps, a similar result holds for Scaffold (Karimireddy et al., 2020). The parallelization of local runs is done using the Ray package5. We use the w8a dataset for the i.i.d. experiment. For the heterogeneous experiment, we sort a9a dataset with respect to the target labels b {0, 1} and then mix it with the original order in proportion 2:1. For all methods, the local workers used minibatch size 16. Exact implementation can be found in our code. 4The datasets were downloaded from Lib SVM https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html 5https://ray.io/ Proximal and Federated Random Reshuffling Extensions Here we discuss two extensions of our theory that significantly matter in practice: using decreasing stepsizes and applying importance resampling. I. Extension: Importance resampling Suppose that each fi is Li-smooth. Then the iteration complexities of both SGD and RR depend on Lmax/µ, where Lmax is the maximum smoothness constant among the smoothness constants L1, L2, . . . , Ln. The maximum smoothness constant can be arbitrarily worse than the average smoothness constant L = 1 n Pn i=1 Li. This situation is in contrast to the complexity of gradient descent which depends on the smoothness constant Lf of f = 1 n Pn i=1 fi, for which we have Lf L. This is a problem commonly encountered with stochastic optimization methods and may cause significantly degraded performance in practical optimization tasks in comparison with deterministic methods (Tang et al., 2020). Importance sampling is a common technique to improve the convergence of SGD (Algorithm 3): we sample function ( L/Li)fi with probability pi proportional to Li, where L := 1 n Pn i=1 Li. In that case, the SGD update is still unbiased since i=1 pi L Li fi = f. Moreover, the smoothness of function ( L/Li)fi is L for any i, so the guarantees would depend on L instead of maxi=1,...,n Li. Importance sampling successfully improves the iteration complexity of SGD to depend on L (Needell et al., 2016), and has been investigated in a wide variety of settings (Gower et al., 2020; Gorbunov et al., 2020). Importance sampling is a neat technique but it relies heavily on the fact that we use unbiased sampling. How can we obtain a similar result if inside any permutation the sampling is biased? The answer requires us to think again as to what happens when we replace fi with ( L/Li)fi. To make sure the problem remains the same, it is sufficient to have ( L/Li)fi inside a permutation exactly Li/ L times. And since Li/ L is not necessarily integer, we should use ni = Li/ L and solve min x Rd 1 N ni fi(x) + + 1 ni fi(x) | {z } ni times + ψ(x), (40) N := n1 + + nn = L1 Clearly, this problem is equivalent to the original formulation in (1). At the same time, we have improved all smoothness constants to L. It might seem that that the new problem has more functions, but it turns out that the new number of functions satisfies N 2n, so any related costs, such as longer loops or storing duplicates of the data, are negligible, as the next theorem shows. Theorem 9. For every i, assume that each fi is convex and Li-smooth, and let ψ be µ-strongly convex. Then, the number of functions N in (40) satisfies N 2n, and Algorithm 1 applied to problem (40) has the same complexity as (33) but proportional to L rather than Lmax. Proof. We show that N 2n as the rest of the theorem s claim trivially follows from Theorem 8. Firstly, note that for any number a R we have a a + 1. Therefore, L + 1 = n + J. Extension: Decreasing stepsizes Using the theoretical stepsize (32) requires knowing the desired accuracy ε ahead of time as well as estimating σrad. It also results in extra polylogarithmic factors in the iteration complexity (33), a phenomenon observed and fixed by using decreasing stepsizes in both vanilla RR (Ahn et al., 2020) and in SGD (Stich, 2019). Proximal and Federated Random Reshuffling We show that we can adopt the same technique to our setting. However, we depart from the stepsize scheme of Ahn et al. (2020) by only varying the stepsize once per epoch rather than every iteration. This is closer to the common practical heuristic of decreasing the stepsize once every epoch or once every few epochs (Sun, 2020; Tran et al., 2020). The stepsize scheme we use is inspired by the schemes of (Stich, 2019; Khaled & Richt arik, 2020): in particular, we fix T > 0, let t0 = T/2 , and choose the stepsizes γt > 0 by ( 1 Lmax if T Lmax 2µn or t t0, 7 µn(s+t t0) if T > Lmax 2µn and t > t0, (41) where s := 7Lmax/(4µn). Hence, we fix the stepsize used in the first T/2 iterations and then start decreasing it every epoch afterwards. Using this stepsize schedule, we can obtain the following convergence guarantee when each fi is smooth and convex and the regularizer ψ is µ-strongly convex. Theorem 10. Suppose that each fi is Lmax-smooth and convex, and that the regularizer ψ is µ-strongly convex. Fix T > 0. Then choosing stepsizes γt according to (41) we have that γt 1/Lmax for all t and the final iterate generated by Algorithm 1 satisfies E h x T x 2i = O exp n T κ+2n r0 + σ2 rad µ3n2T 2 , where κ := Lmax/µ, r0 := x0 x 2 and O( ) hides absolute (non-problem-specific) constants. This guarantee holds for any number of epochs T > 0. We believe a similar guarantee can be obtained in the case each fi is strongly-convex and the regularizer ψ is just convex, but we did not include it as it adds little to the overall message. In the rest of the section we provide a proof of Theorem 10. J.1. A recursion Lemma We first state and prove the following algorithm-independent lemma. This lemma plays a key role in the proof of Theorem 10 and is heavily inspired by the stepsize schemes of Stich (2019) and Khaled & Richt arik (2020) and their proofs. Lemma 9. Suppose that there exist constants a, b, c 0 such that for all γt 1 (1 + γtan) rt+1 rt + γ3 t c. (42) Fix T > 0. Let t0 = T 2 . Then choosing stepsizes γt > 0 by ( 1 b, if t t0 or T b an, 7 an(s+t t0) if t > t0 and T > b an, where s = 7b 2an. Then r T exp n T 2 (b/a + n) Proof. If T 7b an, then we have γt = γ = 1 b for all t. Hence recursing we have, r T (1 + γan) T r0 + γ3c γan = (1 + γan) T r0 + γ2c Note that 1 1+x exp( x 1+x) for all x, hence r T exp γan T Substituting for γ yields r T exp n T b/a + n r0 + c b2an. Proximal and Federated Random Reshuffling Note that by assumption we have 1 b 7 T an, hence r T exp n T b/a + n r0 + 49c T 2a3n3 . (43) an, then we have for the first phase when t t0 with stepsize γt = 1 rt0 exp nt0 b/a + n r0 + c b2an exp n T 2(b/a + n) r0 + c b2an. (44) Then for t > t0 we have (1 + γtan) rt+1 rt + γ3 t c = rt + 73c a3n3 (s + t t0)3 . Multiplying both sides by (s + t t0)3 yields (s + t t0)3 (1 + γtan) rt+1 (s + t t0)3 rt + 73c a3n3 . (45) Note that because t and t0 are integers and t > t0, we have that t t0 1 and therefore s + t t0 1. We may use this to lower bound the multiplicative factor in the left hand side of (45) as (s + t t0)3 (1 + γtan) = (s + t t0)3 1 + 7 s + t t0 = (s + t t0)3 + 7 (s + t t0)2 = (s + t t0)3 + 3 (s + t t0)2 + 3 (s + t t0)2 + (s + t t0)2 (s + t t0)3 + 3 (s + t t0)2 + 3 (s + t t0) + 1 = (s + t + 1 t0)3 . (46) Using (46) in (45) we obtain (s + t + 1 t0)3 rt+1 (s + t t0)3 rt + 73c Let wt = (s + t t0)3. Then we can rewrite the last inequality as wt+1rt+1 wtrt 73c Summing up and telescoping from t = t0 to T yields w T r T wt0rt0 + 73c a3n3 (T t0) . Note that wt0 = s3 and w T = (s + T t0)3. Hence, (s + T t0)3 rt0 + 73c a3n3 (s + T t0)2 T t0 s + T t0 (s + T t0)3 rt0 + 73c a3n3 (s + T t0)2 . Since we have s + T t0 T t0 T/2, it holds T 3 rt0 + 4 73c a3n3T 2 . (47) The bound in (44) can be rewritten as T 3 exp n T 2 (b/a + n) r0 + s3c b2an T 3 . Proximal and Federated Random Reshuffling We now rewrite the last inequality, use that T > 2s and further use the fact that s = 7b 2an: exp n T 2 (b/a + n) r0 + s2c b2an T 2 8 exp n T 2 (b/a + n) r0 + s2c 2b2an T 2 8 exp n T 2 (b/a + n) r0 + 72c 8a3n3T 2 . (48) Plugging in the estimate of (48) into (47) we obtain r T exp n T 2 (b/a + n) r0 + 72c a3n3T 2 + 4 73c = exp n T 2 (b/a + n) a3n3T 2 . (49) Taking the maximum of (43) and (49) we see that for any T > 0 we have r T exp n T 2 (b/a + n) J.2. Proof of Theorem 10 Proof. Start with Lemma 6 with λ = 0, L = Lmax, and γ = γt, E h xi+1 t xi+1 2i E h xi t xi 2i 2γ (1 γLmax) E Dfπi(xi t, x ) + 2γ3 t σ2 rad. Since γ 1/Lmax and Dfπ(xi t, x ) is nonnegative we may simplify this to E h xi+1 t xi+1 2i E h xi t xi 2i + 2γ3 t σ2 rad. Unrolling this recursion for n steps we get E h xn t xn 2i E h x0 t x0 2i + 2nγ3 t σ2 rad. By Lemma 5 we have (1 + 2γtµn) E h xt+1 x 2i E h xt x 2i + 2nγ3 t σ2 rad. We may then use Lemma 9 to obtain that E h x T x 2i exp n T 2(Lmax/µ + n) x0 x 2 + 356σ2 rad µ3n2T 2 = O exp n T κ + 2n x0 x 2 + σ2 rad µ3n2T 2