# sgd_general_analysis_and_improved_rates__2dc6fa3d.pdf SGD: General Analysis and Improved Rates Robert M. Gower 1 Nicolas Loizou 2 Xun Qian 3 Alibek Sailanbayev 3 Egor Shulgin 4 Peter Richt arik 3 2 4 We propose a general yet simple theorem describing the convergence of SGD under the arbitrary sampling paradigm. Our theorem describes the convergence of an infinite array of variants of SGD, each of which is associated with a specific probability law governing the data selection rule used to form minibatches. This is the first time such an analysis is performed, and most of our variants of SGD were never explicitly considered in the literature before. Our analysis relies on the recently introduced notion of expected smoothness and does not rely on a uniform bound on the variance of the stochastic gradients. By specializing our theorem to different mini-batching strategies, such as sampling with replacement and independent sampling, we derive exact expressions for the stepsize as a function of the mini-batch size. With this we can also determine the mini-batch size that optimizes the total complexity, and show explicitly that as the variance of the stochastic gradient evaluated at the minimum grows, so does the optimal mini-batch size. For zero variance, the optimal mini-batch size is one. Moreover, we prove insightful stepsize-switching rules which describe when one should switch from a constant to a decreasing stepsize regime. 1. Introduction We consider the optimization problem x = arg minx Rd f(x) = 1 n Pn i=1 fi(x) , (1) 1T el ecom Paris Tech, LTCI, Universit e Paris-Saclay, France 2University of Edinburgh, United Kingdom 3King Abdullah University of Science and Technology, Kingdom of Saudi Arabia 4Moscow Institute of Physics and Technology, Russian Federation. Correspondence to: Robert M. Gower . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). where each fi : Rd R is smooth (but not necessarily convex). Further, we assume that f has a unique1 global minimizer x and is µ strongly quasi-convex (Karimi et al., 2016; Necoara et al., 2018): f(x ) f(x) + f(x), x x + µ 2 x x 2 (2) for all x Rd. 1.1. Background and contributions Stochastic gradient descent (SGD) (Robbins & Monro, 1951; Nemirovski & Yudin, 1978; 1983; Shalev-Shwartz et al., 2007; Nemirovski et al., 2009; Hardt et al., 2016), has become the workhorse for training supervised machine learning problems which have the generic form (1). Linear convergence of SGD. Moulines & Bach (2011) provided a non-asymptotic analyses of SGD showing linear convergence for strongly convex f up to a certain noise level. Needell et al. (2016) improved upon these results by removing the quadratic dependency on the condition number in the iteration complexity results, and considered importance sampling. The analysis of Needell et al. (2016) was later extended to a mini-batch variant where the mini-batches are formed by partitioning the data (Needell & Ward, 2017). These works are the main starting point for ours. Contributions: We further tighten and generalize these results to virtually all forms of sampling. We introduce an expected smoothness assumption (Assumption 2.1), first introduced in (Gower et al., 2018) in the context of a certain class of variance-reduced methods. This assumption is a joint property of f and the sampling scheme D utilized by an SGD method, and allows us prove a generic complexity result (Theorem 3.1) that holds for arbitrary sampling schemes D. Our work is the first time SGD is analysed under this assumption. We obtain linear convergence rates without strong convexity; in particular, assuming strong quasi-convexity (this class includes some non-convex functions as well). Furthermore, we do not require the functions fi to be convex. Gradient noise assumptions. Shamir & Zhang (2013) extended the analysis of SGD to convex non-smooth optimiza- 1This assumption can be relaxed; but for simplicity of exposition we enforce it. SGD: General Analysis and Improved Rates tion (including the strongly convex case). However, their proofs still rely on the assumption that the variance of the stochastic gradient is bounded for all iterates of the algorithm: there exists c R such that Ei fi(xk) 2 c for all k. The same assumption was used in the analysis of several recent papers (Recht et al., 2011; Hazan & Kale, 2014; Rakhlin et al., 2012). A much more relaxed weak growth assumption Ei fi(xk) 2 c1+c2E f(xk) 2 for all k, was apparently first used in the later 90 s to prove the asymptotic convergence of SGD (see Proposition 4.2 of Bertsekas & Tsitsiklis (1996)). Bottou et al. (2018) establish a linear convergence of SGD under this weak growth assumption. Recently, Nguyen et al. (2018) turn this assumption into a theorem by establishing formulas c1 and c2 under some reasonable conditions, and provide further insights into the workings of SGD and its parallel asynchronous cousin, Hogwild!. Similar conditions have been also proved and used in the analysis of decentralized variants of SGD (Lian et al., 2017; Assran et al., 2018). Based on a strong growth condition (c1 = 0), Schmidt & Roux (2013) were the first to establish linear convergence of SGD, with Cevher & Vu (2017) later giving sufficient and necessary conditions for the linear convergence of SGD under this condition. Contributions: Our analysis does not directly assume a growth condition. Instead, we make use of the remarkably weak expected smoothness assumption. Optimal mini-batch size. Recently it was experimentally shown by Goyal et al. (2017) that using larger mini-batches sizes is key to efficient training of large scale non-convex problems, leading to the training of Image Net in under 1 hour. The authors conjectured that the stepsize should grow linearly with the mini-batch size. Contributions: We prove (see Section 4) that this is the case, upto a certain optimal mini-batch size, and provide exact formulas for the dependency of the stepsizes on the mini-batch sizes. Learning schedules. Chee & Toulis (2018) develop techniques for detecting the convergence of SGD within a region around the solution. Contributions: We provide a closed-form formula for when should SGD switch from a constant stepsize to a decreasing stepsize (see Theorem 3.2). Further, we clearly show how the optimal stepsize (learning rate) increases and the iteration complexity decreases as the mini-batch size increases for both independent sampling and sampling with replacement. We also recover the well known L/µ log(1/ϵ) convergence rate of gradient descent (GD) when the minibatch size is n; this is the first time a generic SGD analysis recovers the correct rate of GD. Over-parameterized models. There has been some recent work in analysing SGD in the setting where the underlying model being trained has more parameters than there is data available. In this zero noise setting, Ma et al. (2018) showed that SGD converges linearly. Contributions: In the case of over-parametrized models, we extend the findings of Ma et al. (2018)2 to independent sampling and sampling with replacement by showing that the optimal mini-batch size is 1. Moreover, we provide results in the more general setting where the model is not necessarily over-parametrized. Practical performance. We corroborate our theoretical results with extensive experimental testing. 1.2. Stochastic reformulation In this work we provide a single theorem through which we can analyse all importance sampling and mini-batch variants of SGD. To do this, we need to introduce a sampling vector which we will use to re-write our problem (1). Definition 1.1. We say that a random vector v Rn drawn from some distribution D is a sampling vector if its mean is the vector of all ones: ED [vi] = 1, i [n]. (3) With each distribution D we now introduce a stochastic reformulation of (1) as follows minx Rd ED fv(x) := 1 n Pn i=1 vifi(x) . (4) By the definition of the sampling vector, fv(x) and fv(x) are unbiased estimators of f(x) and f(x), respectively, and hence probem (4) is indeed equivalent (i.e., a reformulation) of the original problem (1). In the case of the gradient, for instance, we get ED [ fv(x)] (4)= 1 n Pn i=1 ED [vi] fi(x) (3)= f(x). (5) Similar but different stochastic reformulations were recently proposed by Richt arik & Tak aˇc (2017) and further used in (Loizou & Richt arik, 2017; 2019) for the more special problem of solving linear systems, and by Gower et al. (2018) in the context of variance-reduced methods. Reformulation (4) can be solved using SGD in a natural way: xk+1 = xk γk fvk(xk) (6) where vk D is sampled i.i.d. at each iteration and γk > 0 is a stepsize. However, for different distributions D, (6) has a different interpretation as an SGD method for solving the original problem (1). In our main result we will analyse (6) for any D satisfying (3). By substituting specific choices of D, we obtain specific variants of SGD for solving (1). 2Recently, the results of Ma et al. (2018) were extended to the accelerated case by Vaswani et al. (2018); however, we do not study accelerated methods in this work. SGD: General Analysis and Improved Rates 2. Expected Smoothness and Gradient Noise In our analysis of SGD (6) applied to the stochastic reformulation (4) we rely on a generic and remarkably weak assumption of expected smoothness, which we now define and relate to existing growth conditions. 2.1. Expected smoothness Expected smoothness (Gower et al., 2018) is an assumption that combines both the properties of the distribution D and the smoothness properties of function f. Assumption 2.1 (Expected Smoothness). We say that f is L smooth in expectation with respect to a distribution D if there exists L = L(f, D) > 0 such that ED h fv(x) fv(x ) 2i 2L(f(x) f(x )), (7) for all x Rd. For simplicity, we will write (f, D) ES(L) to say that (7) holds. When D is clear from the context, we will often ignore mentioning it, and simply state that the expected smoothness constant is L. There are scenarios where the above inequality is tight. Indeed, in the setting of stochastic reformulations of linear systems considered in (Richt arik & Tak aˇc, 2017), one has fv(x) = 1 2 fv(x) 2, fv(x ) = 0 and fv(x ) = 0, which means that (7) holds as an identity with L = 1. In Section 3.3 we show how convexity and Li smoothness of fi implies expected smoothness. However, the opposite implication does not hold. Indeed, the expected smoothness assumption can hold even when the fi s and f are not convex, as we show in the next example. Example 2.2 (Non-convexity and expected smoothness). Let fi = φ for i = 1, . . . , n, where φ is a Lφ smooth and non-convex function which has a global minimum x Rd (such functions exista). Consequently f = φ and fv = i vi n φ. Letting θ := ED P i vi 2 , we have ED fv(x) fv(x ) 2 = θ n2 φ(x) φ(x ) 2 n2 (f(x) f(x )), where the last inequality follows from Proposition A.1. So, (f, D) ES(L) for L = θLφ a There exists invex functions that satisfy these conditions (Karimi et al., 2016). As an example φ(x) = x2+3 sin2(x) is smooth, non-convex, and has a unique global minimizer. 2.2. Gradient noise Our second key assumption is finiteness of gradient noise, defined next: Assumption 2.3 (Finite Gradient Noise). The gradient noise σ = σ(f, D), defined as follows is finite σ2 := ED[ fv(x ) 2]. (8) This is a very weak assumption, and should intuitively be seen as an assumption on D rather than on f. For instance, if the sampling vector v is non-negative with probability one and E[vi P j vj] is finite for all i, then σ is finite. When (1) is the training problem of an over-parametrized model, which often occurs in deep neural networks, each individual loss function fi attains its minimum at x , and thus fi(x ) = 0. It follows that σ = 0. 2.3. Key lemma and connection to the weak growth condition A common assumption used to prove the convergence of SGD is uniform boundedness of the stochastic gradients3: there exist 0 < c < such that E fv(x) 2 c for all x. However, this assumption often does not hold, such as in the case when f is strongly convex (Bottou et al., 2018; Nguyen et al., 2018). We do not assume such a bound. Instead, we use the following direct consequence of expected smoothness to bound the expected norm of the stochastic gradients. Lemma 2.4. If (f, D) ES(L), then ED fv(x) 2 4L(f(x) f(x )) + 2σ2. (9) When the gradient noise is zero (σ = 0), inequality (9) is known as the weak growth condition (Vaswani et al., 2018). Corollary 2.5. If (f, D) ES(L) and if σ = 0, then f satisfies the weak growth condition ED[ fv(x) 2] 2ρ(f(x) f(x )), with ρ = 2L. This corollary should be contrasted with Proposition 2 in (Vaswani et al., 2018) and Lemma 1 in (Nguyen et al., 2018), where it is shown, by assuming the fi functions to be smooth and convex, that the weak growth condition holds with ρ = 2Lmax. However, as we will show in Lemma F.1, Lmax L, and hence our bound is often tighter. 3Or it is assumed that E fv(xk) 2 c for all k iterates. But this too has issues since it implicitly assumes that the iterates remain within a compact set, and yet it it used to prove the convergence to within a compact set, raising issues of a circular argument. SGD: General Analysis and Improved Rates 3. Convergence Analysis 3.1. Main results We now present our main theorem. Theorem 3.1. Assume f is µ-quasi-strongly convex and that (f, D) ES(L). Choose γk = γ (0, 1 2L] for all k. Then iterates of SGD given by (6) satisfy: E xk x 2 (1 γµ)k x0 x 2 + 2γσ2 Hence, given any ϵ > 0, choosing stepsize 2L, ϵµ 4σ2 , (11) µ , 4σ2 ϵµ2 o log 2 x0 x 2 implies E xk x 2 ϵ. Note that we do not assume fi nor f to be convex. Theorem 3.1 states that SGD converges linearly up to the additive constant 2γσ2/µ which depends on the gradient noise σ2 and on the stepsize γ. We obtain a more accurate solution with a smaller stepsize, but then the convergence rate slows down. Since we control D, we also control σ2 and L (we compute these parameters for several distributions D in Section 3.3). Furthermore, we can control this additive constant by carefully choosing the stepsize, as shown in the next result. Theorem 3.2 (Decreasing stepsizes). Assume f is µquasi-strongly convex and that (f, D) ES(L). Let K := L/ µ and 1 2L for k 4 K 2k+1 (k+1)2µ for k > 4 K . (13) If k 4 K , then SGD iterates given by (6) satisfy: E xk x 2 σ2 e2k2 x0 x 2. (14) 3.2. Choosing D For (6) to be efficient, the sampling vector v should be sparse. For this reason we will construct v so that only a (small and random) subset of its entries are non-zero. Before we formally define v, let us first establish some random set terminology. Let C [n] and let e C := P i C ei, where {e1, . . . , en} are the standard basis vectors in Rn. These subsets will be selected using a random set valued map S, in the literature referred to by the name sampling (Richt arik & Tak aˇc, 2016; Qu & Richt arik, 2016). A sampling is uniquely characterized by choosing subset probabilities p C 0 for all subsets C of [n]: P [S = C] = p C, C [n], (15) where P C [n] p C = 1. We will only consider proper sam- plings. A sampling S is called proper if pi def = P[i S] = P C:i C p C is positive for all i. The first analysis of a randomized optimization method with an arbitrary (proper) sampling was performed by Richt arik & Tak aˇc (2016) in the context of randomized coordinate descent for strongly convex functions. This arbitrary sampling paradigm was later adopted in many other settings, including accelerated coordinate descent for strongly convex functions (Hanzely & Richt arik, 2018), coordinate and accelerated descent for convex functions (Qu & Richt arik, 2016), primal-dual methods (Qu et al., 2015; Chambolle et al., 2018), variance-reduced methods with convex (Csiba & Richt arik, 2015) and nonconvex (Horv ath & Richt arik, 2018) objectives. Arbitrary sampling arises as a special case of our more general analysis by specializing the sampling vector to one dependent on a sampling S. We now define practical sampling vector v = v(S) as follows: Lemma 3.3. Let S be a proper sampling, and let ˆP = Diag(p1, ..., pn). Then the random vector v = v(S) given by v = ˆP 1e S (16) is a sampling vector. Proof. Note that vi = 1(i S)/pi, where 1(i S) is the indicator function of the event i S. It follows that E [vi] = E 1(i S) /pi = 1. We can further specialize and define the following commonly used samplings. Each sampling S gives rise to a particular sampling vector v = v(S) (i.e., distribution D), which in turn gives rise to a particular stochastic reformulation (4) and SGD variant (6). Independent sampling. The sampling S includes every i, independently, with probability pi > 0. This type of sampling was considered in different contexts in (Horv ath & Richt arik, 2018; Hanzely & Richt arik, 2018). Partition sampling. A partition G of [n] is a set consisting of subsets of [n] such that C GC = [n] and Ci Cj = for any Ci, Cj G with i = j. A partition sampling S is a sampling such that p C = P[S = C] > 0 for all C G and P C G p C = 1. Single element sampling. Only the singleton sets {i} for i = 1, . . . , n have a non-zero probability of being sampled; that is, P [|S| = 1] = 1. We have P [v(S) = ei/pi] = pi. SGD: General Analysis and Improved Rates τ nice sampling. We say that S is a τ nice if S samples from all subsets of [n] of cardinality τ uniformly at random. In this case we have that pi = τ n for all i [n]. So, P v(S) = n τ e C = 1/ n τ for all subsets C {1, . . . , n} with τ elements. 3.3. Bounding L and σ2 By assuming that the fi functions are convex and smooth we can calculate closed form expressions for the expected smoothness L and gradient noise σ2. In particular we make the following smoothness assumption: Assumption 3.4. There exists a symmetric positive definite matrix Mi Rd d such that fi(x + h) fi(x) + fi(x), h + 1 2 h 2 Mi , (17) for all x, h Rd, and i [n], where h 2 Mi := Mih, h . In this case we say that fi is Mi smooth. Furthermore, we assume that each fi is convex. To better relate the above assumption to the standard smoothness assumptions we make the following remark. Remark 3.5. As a consequence of Assumption 3.4 we also have that each fi is Li := λmax(Mi) smooth and f is L := 1 nλmax(Pn i=1 Mi) smooth. Let Lmax := maxi [n] Li. Using Assumption 3.4 and a sampling we establish the following bounds on L. Theorem 3.6. Let S be a proper sampling, and v = v(S) (i.e., v is defined by (16). Let fi be Mi-smooth, and P Rn n be defined by Pij = P[i S & j S]. Then (f, D) ES(L), where L Lmax := maxi [n] n P 1 n maxi [n] n P j [n] Pij λmax(Mj) and LC := 1 j C 1 pj Mj). If |S| τ, then L Lmax Lmax = maxi [n] λmax(Mi). (19) By applying the above result to specific samplings, we obtain the following practical bounds on L: Proposition 3.7. (i) For single element sampling S, we have Lmax = 1 n maxi [n] λmax(Mi) (ii) For partition sampling S with partition G, we have n max C G n 1 p C λmax(P j C Mj) o . (21) For τ-nice sampling and independent sampling, we get the following very informative bounds on L. Proposition 3.8. (iii) For independent sampling S, we have L L + maxi [n] 1 pi pi λmax(Mi) (iv) For τ-nice sampling, we have τ(n 1)L + n τ τ(n 1) maxi λmax(Mi) (23) Gazagnadou et al. (2019) were the first to suggest using (23) as an approximation for L. Through extensive experiments, they showed that the bound (23) is very tight. Here we give the first proof that (23) is indeed a valid upper bound. For v = v(S) given by (16), formulas for the gradient noise σ2 are provided in the next result: Theorem 3.9. Let hi = fi(x ). Then σ2 = 1 n2 P i,j [n] Pij pipj hi, hj . (24) Specializing the above theorem to specific samplings S gives the following formulas for σ2: Proposition 3.10. (i) For single element sampling S, we have σ2 = 1 n2 P i [n] 1 pi hi 2. (25) (ii) For independent sampling S with E[|S|] = τ, we have σ2 = 1 n2 P pi hi 2. (26) (iii) For τ-nice sampling S, we have σ2 = 1 nτ n τ i [n] hi 2. (27) (iv) For partition sampling S with partition G, we have σ2 = 1 n2 P C G 1 p C P i C hi 2. (28) Generally, we do not know the values of hi = fi(x ). But if we have prior knowledge that x belongs to some set C, we can obtain upper bounds for σ2 for these samplings from Proposition 3.10 in a straightforward way. 4. Optimal Mini-Batch Size Here we develop the iteration complexity for different samplings by plugging in the bounds on L and σ given in Section 3.3 into Theorem 3.1. To keep the notation brief, in this section we drop the logarithmic term log 2 x0 x 2/ϵ from the iteration complexity results. Furthermore, for brevity and to better compare our results SGD: General Analysis and Improved Rates to others in the literature, we will use Li = λmax(Mi) and Lmax = maxi [n] Li (see Remark 3.5). Finally let h = 1 i [n] hi 2 for brevity. Gradient descent. As a first sanity check, we consider the case where |S| = n with probability one. That is, each iteration (6) uses the full batch gradient. Thus σ = 0 and it is not hard to see that for τ = n in (23) or pi = 1 for all i in (22) we have Lmax = L. Consequently, the resulting iteration complexity (12) is now k 2L/µ. This is exactly the rate of gradient descent, which is precisely what we would expect since the resulting method is gradient descent. Though an obvious sanity check, we believe this is the first convergence theorem of SGD that includes gradient descent as a special case. Clearly, this is a necessary pre-requisite if we are to hope to understand the complexity of minibatching. 4.1. Nonzero gradient noise To better appreciate how our iteration complexity evolves with increased mini-batch sizes, we now consider independent sampling with |S| = τ and τ-nice sampling. Independent sampling. Inserting the bound on L (22) and σ (26) into (12) gives the following iteration complexity µ max n L + maxi [n] 1 pi npi Li , 2 µϵ 1 pi npi h o . (29) This is a completely new mini-batch complexity result, which opens up the possibility of optimizing the mini-batch size and probabilities of sampling. For instance, if we fix uniform probabilities with pi = τ n then (29) becomes k 2 µ max {l(τ), r(τ)}, where l(τ) := L+ 1 n Lmax; r(τ) := 2 µϵ 1 This complexity result corresponds to using the stepsize 2 min n 1 l(τ), 1 r(τ) o (31) if τ < n, otherwise only the left-hand-side term in the minimization remains. The stepsize (31) is increasing since both l(τ) and r(τ) decrease as τ increases. With such a simple expression for the iteration complexity we can choose a mini-batch size that optimizes the total complexity. By defining the total complexity T(τ) as the number of iterations k times the number of gradient evaluations (τ) per iteration gives T(τ) := 2 µn max n τn L + (n τ) Lmax, 2(n τ)h µϵ o . (32) Minimizing T(τ) in τ is easy because T(τ) is a max of a linearly increasing term τ l(τ) and a linearly decreasing term τ r(τ) in τ. Furthermore n l(n) 0 = n r(n). Consequently, if l(1) r(1), then τ = 1, otherwise 2 µϵ h Lmax 2 µϵ h Lmax+n L. (33) Since r(1) is proportional to the noise and 1/ϵ and l(1) is proportional to the smoothness constants the condition l(1) r(1) holds when there is comparatively a lot of noise or the precision is high. As we will see in Section 4.2 this logic extends to the case where the noise is zero, where the optimal mini-batch size is τ = 1. τ nice sampling. Inserting the bound on L (23) and σ (27) into (12) gives the iteration complexity k 2 µ max{l(τ), r(τ)}, where l(τ) = n(τ 1) τ(n 1)L + n τ τ(n 1)Lmax, (34) r(τ) = 2(n τ) ϵµ(n 1) which holds for the stepsize 2 min n 1 l(τ), 1 r(τ) o . (36) Again, this is an increasing function in τ. We are now again able to calculate the mini-batch size that optimizes the total complexity T(τ) given by T(τ) = 2τ µ max{l(τ), r(τ)}. Once again T(τ) is a max of a linearly increasing term τ l(τ) and a linearly decreasing term τ r(τ) in τ. Furthermore r(n) = 0 l(n). Consequently, if r(1) l(1) then τ = 1, otherwise τ = n L Lmax+ 2 n L Lmax+ 2 4.2. Zero gradient noise Consider the case where the gradient noise is zero (σ = 0). According to Theorem 3.1, the resulting complexity of SGD with constant stepsize γ = 1 2L is given by the very simple expression k 2L where we have dropped the logarithmic term log x0 x 2 ϵ . In this setting, due to Corollary 2.5, we know that f satisfies the weak growth condition. Thus our results are directly comparable to those developed in (Ma et al., 2018) and in (Vaswani et al., 2018). In particular, Theorem 1 in (Ma et al., 2018) states that when running SGD with mini-batches based on sampling with replacement, the resulting iteration complexity is µ 1 τ , (39) again dropping the logarithmic term. Now gaining insight into the complexity (38) is a matter of studying the expected smoothness parameter L for different sampling strategies. SGD: General Analysis and Improved Rates Independent sampling. Setting σ = 0 (thus h = 0) and using uniform probabilities with pi = τ n in (29) gives τ nice sampling. If we use a uniform sampling and σ = 0 then the resulting iteration complexity is given by k n(τ 1) τ(n 1) 2L µ + n τ τ(n 1) 2Lmax Iteration complexities (39), (40) and (41) tell essentially the same story. Namely, the complexity improves as τ increases to n, but this improvement is not enough when considering the total complexity (multiplying by τ). Indeed, for total complexity, these results all say that τ = 1 is optimal. 5. Importance Sampling In this section we propose importance sampling for single element sampling and independent sampling with E[|S|] = τ, respectively. Due to lack of space, the details of this section are in the appendix, Section K. Again we drop the log term in (12) and adopt the notation in Remark 3.5. 5.1. Single element sampling For single element sampling, plugging (20) and (25) into (12) gives the following iteration complexity 2 ϵµ2 max n ϵµ n maxi [n] Li i [n] 1 pi hi 2o , where 0 < pi 1 and P i [n] pi = 1. In order to optimize this iteration complexity over pi, we need to solve a n dimensional linearly constrained nonsmooth convex minimization problem, which could be harder than the original problem (1). So instead, we will focus on minimizing Lmax and σ2 over pi seperately. We will then use these two resulting (sub)optimal probabilities to construct a sampling. In particular, for single element sampling we can recover the partially biased sampling developed in (Needell et al., 2016). First, from (20) it is easy to see that the probabilities that minimize Lmax are p L i = Li/ P j [n] Lj, for all i. Using these suboptimal probabilities we can construct a partially biased sampling by letting ˆpi := 1 2p L i + 1 2n. Plugging this sampling in (20) gives Lmax 2L := 2 n P i [n] Li, and from (25), we have σ2 2 n P i [n] hi 2 := 2h. This sampling is the same as the partially biased sampling in (Needell et al., 2016). From (29) in Theorem 3.1, we get that the total complexity is now given by k max n 4L αµ, 8h ϵµ2 o . (42) For uniform sampling, Lmax = maxi [n] Li L and σ2 = 1 n P i [n] hi 2. Hence, compared to uniform sampling, the iteration complexity of partially biased sampling is at most two times larger, but could be n/2 smaller in the extreme case where Lmax = n L. 5.2. Minibatches Importance sampling for minibatches was first considered in (Csiba & Richt arik, 2018); but not in the context of SGD. Here we propose the first importance sampling for minibatch SGD. In Section K.2 in the appendix we introduce the use of partially biased sampling together with independent sampling with |S| = τ and show that we can achieve a total complexity of (by Proposition K.3) k max n L + 2 ϵµ2 o , (43) which not only eliminates the dependence on Lmax, but also improves as the mini-batch size τ increases. 6. Experiments In this section, we empirically validate our theoretical results. We perform three experiments in each of which we highlight a different aspect of our contributions. In the first two experiments we focus on ridge regression and regularized logistic regression problems (problems with strongly convex objective f and components fi) and we evaluate the performance of SGD on both synthetic and real data. In the second experiment (Section 6.2) we compare the convergence of SGD for several choices of the distribution D (different sampling strategies) as described in Section 3.2. In the last experiment (Section 6.3) we focus on the problem of principal component analysis (PCA) which by construction can be seen as a problem with a strongly convex objective f but with non-convex functions fi (Allen Zhu & Yuan, 2016; Garber & Hazan, 2015; Shalev-Shwartz, 2016). In all experiments, to evaluate SGD we use the relative error measure xk x 2 x0 x 2 . For all implementations, the starting point x0 is sampled from the standard Gaussian. We run each method until xk x 2 10 3 or until a prespecified maximum number of epochs is achieved. For the horizontal axis we always use the number of epochs. For more experiments we refer the interested reader to Section L of the Appendix. Regularized Regression Problems: In the case of the ridge regression problem we solve: min x f(x) = 1 2n Pn i=1(A[i, :]x yi)2 + λ while for the L2-regularized logistic regression problem we solve: min x f(x) = 1 2n Pn i=1 log (1 + exp( yi A[i, :]x))+ λ SGD: General Analysis and Improved Rates 0 25 50 75 100 125 150 175 200 Epoch number Constant step size Decreasing step size Regime switch n = 1000, d = 400 0 10 20 30 40 50 Epoch number Constant step size Decreasing step size Regime switch n = 4177, d = 8 0 25 50 75 100 125 150 175 200 Epoch number Constant step size Decreasing step size Regime switch n = 2000, d = 100 0 25 50 75 100 125 150 175 200 Epoch number Constant step size Decreasing step size Regime switch n = 1605, d = 119 Figure 1. Comparison between constant and decreasing step size regimes of SGD. Ridge regression problem (first row): on left - synthetic data, on right - real dataset: abalone from LIBSVM. Logistic regression problem(second row): on left - synthetic data, on right - real data-set: a1a from LIBSVM. In all experiments λ = 1/n. 0 200 400 600 800 1000 Epoch number n = 4912, d = 300, λ = 100/n, ϵ = 10 3, τ = n/100 single element sampling uniform single element sampling importance τ independent uniform τ independent importance τ independent uniform, τ = 2633 τ nice τ nice, τ = 2633 0 20 40 60 80 100 120 Epoch number n = 200, d = 10, λ = 100/n, ϵ = 10 3, τ = n/100 single element sampling uniform single element sampling importance τ independent uniform τ independent importance τ independent uniform, τ = 180 τ nice τ nice, τ = 181 Figure 2. Performance of SGD with several minibatch strategies for logistic regression. Above: the w3a data-set from LIBSVM. Below: standard Gaussian data. In both problems A Rn d, y Rn are the given data and λ > 0 is the regularization parameter. We generated synthetic data in both problems by sampling the rows of matrix A (A[i, :]) from the standard Gaussian distribution N(0, 1). Furthermore for ridge regression we sampled the entries of y from the standard Gaussian distribution while in the case of logistic regression y { 1, 1}n where P(yi = 1) = P(yi = 1) = 1 2. For our experiments on real data we choose several LIBSVM (Chang & Lin, 2011) datasets. 6.1. Constant vs decreasing step size We now compare the performance of SGD in the constant and decreasing stepsize regimes considered in Theorems 3.1 (see (11)) and 3.2 (see (13)), respectively. Here we use a uniform single element sampling. As expected from theory, we see in Figure 1 that the decreasing stepsize regime is vastly superior at reaching a higher precision than the constant step-size variant. In our plots, the vertical red line denotes the value of 4 L/ µ predicted from Theorem 3.2 and highlights the point where SGD needs to change its update rule from constant to decreasing step-size. 0 20 40 60 80 100 Epoch number Constant step size Decreasing step size n = 1000, d = 100 0 200 400 600 800 1000 1200 1400 Epoch number τ-nice τ-ind 778 = τ -ind 779 = τ -nice n = 1000, d = 10, τ = n/5 Figure 3. On the left: Comparison between constant and decreasing step size regimes of SGD for PCA. On the right: comparison of different sampling strategies of SGD for PCA. 6.2. Minibatches In Figures 2 and 5 we compare the single element sampling (uniform and importance), τ independent sampling (uniform, uniform with optimal batch size and importance) and τ nice sampling (with some τ and with optimal τ ). The probabilities of importance samplings in the single element sampling and τ independent sampling are calculated by formulas (67) and (77) in the Appendix. Formulas for optimal minibatch size τ in independent sampling and τ-nice samplings are given in (33) and (37), respectively. Observe that minibatching with optimal τ gives the best convergence. In addition, note that for constant step size, the importance sampling variants depend on the accuracy ϵ. From Figure 2 we can see that before the error reaches the required accuracy, the importance sampling variants are comparable or better than their coresponding uniform sampling variants. 6.3. Sum-of-non-convex functions In Figure 3, our goal is to illustrate that Theorem 3.1 holds even if the functions fi are non convex. This experiment is based on the experimental setup given in (Allen Zhu & Yuan, 2016). We first generate random vectors a1, . . . , an, b Rd from U(0, 10) and set A := 1 n Pn i=1 aia i . Then we consider the problem: min x f(x) = 1 2n Pn i=1 x (aia i + Di)x + b x, where Di, i [n] are diagonal matrices satisfying D := D1 + + Dn = 0. In particular, to guarantee that D = 0, we randomly select half of the matrices and assign their j-th diagonal value (Di)jj equal to 11; for the other half we assign (Di)jj to be 11. We repeat that for all diagonal values. Note that under this construction, each fi is a nonconvex function. Once again, in the first plot we observe that while both are equally fast in the beginning, the decreasing stepsize variant is better at reaching higher accuracy than the fixed stepsize variant. In the second plot we see, as expected, that all four minibatch versions of SGD outperform single element SGD. However, while the τ-nice and τ-independent samplings with τ = n/5 lead to a slight improvement only, the theoretically optimal choice τ = τ leads to a vast improvement. SGD: General Analysis and Improved Rates Acknowledgements RMG acknowledges the support by a public grant as part of the Investissement d avenir project, reference ANR-11LABX-0056-LMH, Lab Ex LMH, in a joint call with Gaspard Monge Program for optimization, operations research and their interactions with data sciences. Allen-Zhu, Z. and Yuan, Y. Improved SVRG for non-stronglyconvex or sum-of-non-convex objectives. In International Conference on Machine Learning, pp. 1080 1089, 2016. Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochastic gradient push for distributed deep learning. ar Xiv preprint ar Xiv:1811.10792, 2018. Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific, 1st edition, 1996. Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018. Cevher, V. and Vu, B. C. On the linear convergence of the stochastic gradient method with constant step-size. ar Xiv:1712.01906, pp. 1 9, 2017. Chambolle, A., Ehrhardt, M. J., Richt arik, P., and Sch oenlieb, C.-B. Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM Journal on Optimization, 28(4):2783 2808, 2018. Chang, C.-C. and Lin, C.-J. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. Chee, J. and Toulis, P. Convergence diagnostics for stochastic gradient descent with constant learning rate. In Storkey, A. and Perez-Cruz, F. (eds.), Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 1476 1485. PMLR, 09 11 Apr 2018. Csiba, D. and Richt arik, P. Primal method for ERM with flexible mini-batching schemes and non-convex losses. ar Xiv:1506.02227, 2015. Csiba, D. and Richt arik, P. Importance sampling for minibatches. Journal of Machine Learning Research, 19(27):1 21, 2018. Garber, D. and Hazan, E. Fast and simple PCA via convex optimization. ar Xiv preprint ar Xiv:1509.05647, 2015. Gazagnadou, N., Gower, R. M., and Salmon, J. Optimal minibatch and step sizes for saga. In 36th International Conference on Machine Learning, 2019. Gower, R. M., Richt arik, P., and Bach, F. Stochastic quasigradient methods: Variance reduction via Jacobian sketching. arxiv:1805.02632, 2018. Goyal, P., Doll ar, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. Co RR, abs/1706.02677, 2017. Hanzely, F. and Richt arik, P. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. ar Xiv Preprint ar Xiv: 1809.09354, 2018. Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: stability of stochastic gradient descent. In 33rd International Conference on Machine Learning, 2016. Hazan, E. and Kale, S. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489 2512, 2014. Horv ath, S. and Richt arik, P. Nonconvex variance reduced optimization with arbitrary sampling. ar Xiv:1809.04146, 2018. Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the polyakłojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795 811. Springer, 2016. Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 5330 5340, 2017. Loizou, N. and Richt arik, P. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. ar Xiv preprint ar Xiv:1712.09677, 2017. Loizou, N. and Richt arik, P. Convergence analysis of inexact randomized iterative methods. ar Xiv preprint ar Xiv:1903.07971, 2019. Ma, S., Bassily, R., and Belkin, M. The power of interpolation: Understanding the effectiveness of SGD in modern overparametrized learning. In ICML, volume 80 of JMLR Workshop and Conference Proceedings, pp. 3331 3340, 2018. Moulines, E. and Bach, F. R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp. 451 459, 2011. Necoara, I., Nesterov, Y., and Glineur, F. Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming, pp. 1 39, 2018. doi: https://doi. org/10.1007/s10107-018-1232-1. Needell, D. and Ward, R. Batched stochastic gradient descent with weighted sampling. In Approximation Theory XV, Springer, volume 204 of Springer Proceedings in Mathematics & Statistics,, pp. 279 306, 2017. Needell, D., Srebro, N., and Ward, R. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Mathematical Programming, Series A, 155(1):549 573, 2016. Nemirovski, A. and Yudin, D. B. On Cezari s convergence of the steepest descent method for approximating saddle point of convex-concave functions. Soviet Mathetmatics Doklady, 19, 1978. Nemirovski, A. and Yudin, D. B. Problem complexity and method efficiency in optimization. Wiley Interscience, 1983. SGD: General Analysis and Improved Rates Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009. Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013. Nguyen, L., Nguyen, P. H., van Dijk, M., Richt arik, P., Scheinberg, K., and Tak aˇc, M. SGD and hogwild! Convergence without the bounded gradients assumption. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3750 3758. PMLR, 2018. Qu, Z. and Richt arik, P. Coordinate descent with arbitrary sampling I: Algorithms and complexity. Optimization Methods and Software, 31(5):829 857, 2016. Qu, Z. and Richt arik, P. Coordinate descent with arbitrary sampling II: Expected separable overapproximation. Optimization Methods and Software, 31(5):858 884, 2016. Qu, Z., Richt arik, P., and Zhang, T. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems, pp. 865 873, 2015. Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. In 29th International Conference on Machine Learning, volume 12, pp. 1571 1578, 2012. Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 693 701, 2011. Richt arik, P. and Tak aˇc, M. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 10(6):1233 1243, 2016. Richt arik, P. and Tak aˇc, M. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156 (1-2):433 484, 2016. Richt arik, P. and Tak aˇc, M. Stochastic reformulations of linear systems: algorithms and convergence theory. ar Xiv:1706.01108, 2017. Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400 407, 1951. Schmidt, M. and Roux, N. Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv:1308.6370, 2013. Shalev-Shwartz, S. SDCA without duality, regularization, and individual convexity. In International Conference on Machine Learning, pp. 747 754, 2016. Shalev-Shwartz, S., Singer, Y., and Srebro, N. Pegasos: primal estimated subgradient solver for SVM. In 24th International Conference on Machine Learning, pp. 807 814, 2007. Shamir, O. and Zhang, T. Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pp. 71 79, 2013. Vaswani, S., Bach, F., and Schmidt, M. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. ar Xiv:1810.07288, 2018.