# nonconvex_variance_reduced_optimization_with_arbitrary_sampling__c9385f1a.pdf Nonconvex Variance Reduced Optimization with Arbitrary Sampling Samuel Horváth 1 Peter Richtárik 1 2 3 We provide the first importance sampling variants of variance-reduced algorithms for empirical risk minimization with non-convex loss functions. In particular, we analyze non-convex versions of SVRG, SAGA and SARAH. Our methods have the capacity to speed up the training process by an order of magnitude compared to the state of the art on real datasets. Moreover, we also improve upon current mini-batch analysis of these methods by proposing importance sampling for minibatches in this setting. Ours are the first optimal samplings for minibatches in the literature on stochastic optimization. Surprisingly, our approach can in some regimes lead to superlinear speedup with respect to the minibatch size, which is not usually present in stochastic optimization. All the above results follow from a general analysis of the methods which works with arbitrary sampling, i.e., fully general randomized strategy for the selection of subsets of examples to be sampled in each iteration. Finally, we also perform a novel importance sampling analysis of SARAH in the convex setting. 1. Introduction Empirical risk minimization (ERM) is a key problem in machine learning as it plays a key role in training supervised learning models, including classification and regression problems, such as support vector machine, logistic regression and deep learning. A generic ERM problem has the finite-sum form min x Rd f(x) := 1 i=1 fi(x), (1) 1King Abdullah University of Science and Technology, Saudi Arabia 2Moscow Institute of Physics and Technology, Russia 3University of Edinburgh, United Kingdom. Correspondence to: Samuel Horváth , Peter Richtárik . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). where x corresponds to the parameters defining a model, fi(x) is the loss of the model x associated with data point i, and f is the average (empirical) loss across the entire training dataset. In this paper we focus on the case when the functions fi are Li smooth but non-convex. We assume the problem has a solution x . One of the most popular algorithms for solving (1) is stochastic gradient descent (SGD) (Nemirovsky and Yudin, 1983; Nemirovski et al., 2009). In recent years, tremendous effort was exerted to improve its performance, leading to various enhancements which use acceleration (Allen-Zhu, 2016), momentum (Loizou and Richtárik, 2017), minibatching (Takáˇc et al., 2013), distributed implementation (Ma et al., 2015; 2017), importance sampling (Zhao and Zhang, 2015; Csiba and Richtárik, 2018; Qu et al., 2015; Chambolle et al., 2018), higher-order information (Qu et al., 2016; Gower et al., 2016), and a number of other techniques. 1.1. Variance-reduced methods A particularly important recent advance has to do with the design of variance-reduced (VR) stochastic gradient methods, such as SAG (Roux et al., 2012), SDCA (Shalev Shwartz and Zhang, 2013; Richtárik and Takáˇc, 2014), SVRG (Johnson and Zhang, 2013), S2GD (Koneˇcný and Richtárik, 2017), SAGA (Defazio et al., 2014a), MISO (Mairal, 2015), FINITO (Defazio et al., 2014b) and SARAH (Nguyen et al., 2017a), which operate by modifying the classical stochastic gradient direction in each step of the training process in various clever ways so as to progressively reduce its variance as an estimator of the true gradient. We note that SAG and SARAH, historically the oldest and one the newest VR methods in the list, respectively, use a biased estimator of the gradient. In theory, all these methods enjoy linear convergence rates on smooth and strongly convex functions, which is in contrast with slow sublinear rate of SGD. VR methods are also easier to implement as they do not rely on a decreasing learning rate schedule. VR methods were recently extended to work with (non-strongly) convex losses by Koneˇcný and Richtárik (2017), and more recently also to non-convex losses by Reddi et al. (2016b;a); Allen-Zhu and Hazan (2016); Nguyen et al. (2017b) - in all cases leading to best current rates for (1) in a given function class. Nonconvex Variance Reduced Optimization with Arbitrary Sampling 1.2. Importance sampling, minibatching and non-convex models In the context of problem (1), importance sampling refers to the technique of assigning carefully designed non-uniform probabilities {pi} to the n functions {fi}, and using these, as opposed to uniform probabilities, to sample the next data point (stochastic gradient) during the training process. Despite the huge theoretical and practical success of VR methods, there are still considerable gaps in our understanding. For instance, an importance sampling variant of the popular SAGA method, with the correct convergence rate, was only designed very recently by Gower et al. (2018); and the analysis applies to strongly convex f only. A coordinate descent variant of SVRG with importance sampling, also in the strongly convex case, was analyzed by Koneˇcný et al. (2017). However, the method does not seem to admit a fast implementation. For dual methods based on coordinate descent, importance sampling is relatively well understood (Nesterov, 2012; Richtárik and Takáˇc, 2014; Qu and Richtárik, 2016; Qu et al., 2015; Allen-Zhu et al., 2016). The territory is completely unmapped in the non-convex case, however. To the best of our knowledge, no importance sampling VR methods have been designed nor analyzed in the popular case when the functions {fi} are non-convex. An exception to this is df SDCA (Csiba and Richtárik, 2015); however, this method applies to an explicitly regularized version of (1), and while the individual functions are allowed to be non-convex, the average f is assumed to be convex. Given the dominance of stochastic gradient type methods in training large non-convex models such as deep neural networks, theoretical investigation of VR methods that can benefit from importance sampling is much needed. The situation is worse still when one asks for importance sampling of minibatches. To the best of our knowledge, there are only a handful of papers on this topic (Richtárik and Takáˇc, 2016a; Csiba and Richtárik, 2018; Hanzely and Richtárik, 2019), none of which apply to the non-convex setting considered here, nor to the methods we will analyze, and the problem is open. This is despite the fact that minibatch methods are de-facto the norm for training deep nets. In practice, typically relatively small (O(1) or O(log n)) minibatch sizes are used. 1.3. Contributions The main contributions of this paper are: Arbitrary sampling. We peform a general analysis of three popular VR methods SVRG (Johnson and Zhang, 2013), SAGA (Defazio et al., 2014a) and SARAH (Nguyen et al., 2017a) in the arbitrary sampling paradigm (Richtárik and Takáˇc, 2016a; Qu and Richtárik, 2016; Qu and Richtárik, 2016; Qu et al., 2015; Chambolle et al., 2018). That is, we prove general complexity results which hold for an arbitrary random set valued mapping (aka arbitrary sampling) generating the minibatches of examples used by the algorithms in each iteration. Optimal sampling. Starting from our general complexity results which hold for arbitrary sampling (see the second column in Table 1), we are able calculate the optimal sampling out of all samplings of a given minibatch size (see Lemma 2 and also the last column in Table 1). This is the first time an optimal minibatch sampling was computed (from the class of all samplings) in the literature for any stochastic optimization method we know, including all variants of SGD and coordinate descent. Indeed, while the results in (Richtárik and Takáˇc, 2016a; Csiba and Richtárik, 2018; Hanzely and Richtárik, 2019) and other works on this topic construct importance sampling for minibatches, these are not shown nor believed to be optimal. Improved rates. Our iteration complexity bounds improve upon the best current rates for these methods even in the non-minibatch case. For SVRG and SAGA, this is true even when Li = Lj for all i, j, which is counter-intuitive as classical importance sampling is proportional to the constants Li, which in this case would lead to uniform probabilities. Our importance sampling can be faster by up to the factor of n compared to the current state of the art (see Table 1 and Appendix C). Our methods can enjoy linear speedup or even for some specific samplings superlinear speedup in minibatch size. That is, the number of iterations needed to output a solution of a given accuracy drops by a factor equal or greater to the minibatch size used. This is of utmost relevance to the practice of training neural nets with minibatch stochastic methods as our results predict that this is to be expected. We design importance sampling and approximate importance sampling for minibatches which in our experiments vastly outperform the standard uniform minibatch strategies. Best rates for SARAH under convexity. Lastly, we also perform an analysis of importance sampling variant of SARAH in the convex and strongly convex case (Appendix I). These are the currently fastest rates for SARAH. 2. Importance Sampling for Minibatches As mentioned in the introduction, we assume throughout that fi : Rd 7 R are smooth, but not necessarily convex. In particular, we assume that fi is Li smooth; that is, fi(x) fi(y) Li x y for all x, y Rd, where x := (P i x2 i )1/2 is the standard Euclidean norm. Let us define L := 1 n Pn i=1 Li. Without loss of generality assume that L1 L2 Ln. In this work, our aim is to find an ϵ accurate solution in expectation. A stochastic iterative algorithm for solving (1) Nonconvex Variance Reduced Optimization with Arbitrary Sampling Algorithm Uniform sampling Arbitrary sampling [NEW] S sampling [NEW] SVRG max n n, (1+4/3)Lmaxc1n2/3 max n n, (1+4α/3) Lc1n2/3 max n, (1+ 4(n b) 3n ) Lc1n2/3 SAGA n + 2Lmaxc2n2/3 ϵ n + (1+α) Lc2n2/3 ϵ n + (1+ n b n ) Lc2n2/3 n b n 1 L2 maxc3 ϵ2 n + α L2c3 Table 1: Stochastic gradient evaluation complexity for achieving E f(x) 2 ϵ for two variants of SVRG, SAGA and SARAH for minimizing the average of smooth non-convex functions. Constants: c1, c2, c3 are universal constant, Lmax = maxi Li; L = 1 i Li; b = (average) minibatch size (hidden in α); α can be for specific samplings smaller than 1 and decreasing with increasing b, which can lead to superlinear speedup in b. For SARAH this guarantee holds for one outer loop with minibatch size, where we assume 16 L2(f(x0) f(x )2)/(ϵb)2 0, in other words, minibatch size is not too big comparing to the required precision. is said to achieve ϵ accurate solution if the random output xa of this algorithm satisfies E f(xa) 2 ϵ. 2.1. Samplings Let S be a random set-valued mapping ( sampling ) with values in 2[n], where [n] := {1, 2, . . . , n}. A sampling1 is uniquely defined by assigning probabilities to all 2n subsets of [n]. With each sampling we associate a probability matrix P Rn n defined by Pij := Prob({i, j} S). The probability vector associated with S is the vector composed of the diagonal entries of P: p = (p1, . . . , pn) Rn, where pi := Prob(i S). We say that S is proper if pi > 0 for all i. It is easy to show that b := E [|S|] = Trace (P) = i=1 pi. (2) From now on, we will refer to b as the minibatch size of sampling S. It is known that P pp 0 (Richtárik and Takáˇc, 2016b). Let us without loss of generality assume that p1 p2 pn and define constant k = k(S) := |{i [n] : pi < 1}| = max{i : pi < 1} to be the number of pi s not equal to one. While our complexity results are general in the sense that 1Note that With-Replacement Sampling (WRS) does not arise as a special case of our definition of a sampling. Indeed, WRC allows a single element to be selected multiple times, which would not result in a subset of [n]. We analyzed our methods for WRS as well, using the tools we developed in this work. However, we found that WRS does not lead to any improvements in the rates, and hence decided to omit this sampling strategy and focus on the notion of arbitrary sampling, as defined here, for uniformity and simplicity of exposition. they hold for any proper sampling, we shall now consider three special samplings; all with expected minibatch size b (0, n]: 1. Standard uniform minibatch sampling (S = Su). S is chosen uniformly at random from all subsets of [n] of cardinality b. Clearly, |S| = b with probability 1. The probability matrix is given by ( b n i = j, b(b 1) n(n 1) i = j. (3) In the literature, this is known as the b nice sampling (Richtárik and Takáˇc, 2016b; Qu and Richtárik, 2016). 2. Independent sampling (S = S ). For each i [n] we independently flip a coin, and with probability pi > 0 include element i into S. Hence, by construction, pi = Prob(i S) and E [|S|] (2)= P i pi = b. The probability matrix of S is ( pi i = j, pipj i = j. 3. Approximate independent sampling (S = Sa). Independent sampling has the disadvantage that k = k(S) coin tosses need to be performed in order to generate the random set. However, we would like to sample at the cost O(b + k n) coin tosses instead. We now design a sampling which has this property and which in a certain precise sense, as we shall see later, approximates the independent sampling. In particular, given an independent sampling with parameters pi for i [n], let a = k maxi k pi . Since maxi k pi b+k n k , it follows that a b + k n. On the other hand, if maxi k pi = O((b + k n)/k), then Nonconvex Variance Reduced Optimization with Arbitrary Sampling a = O(b + k n). We now sample a single set S of cardinality a using the standard uniform minibatch sampling (just for i k). Subsequently, we apply an independent sampling to select elements of S , with selection probabilities p i = kpi/a. The resulting set is S. Since Prob(i S) = ( k 1 a 1) Prob({i, j} S) = ( k 2 a 2) for i, j k, the probability matrix of S is given by pi i = j, (a 1)k a(k 1)pipj i = j; i, j k, pipj otherwise. Since (a 1)k a(k 1) 1, the probability matrix of the approximate independent sampling approximates that of the independent sampling. Note that S includes both the standard uniform minibatch sampling and the independent sampling as special cases. Indeed, the former is obtained by choosing pi = b/n for all i (whence a = b and p i = 1 for all i), and the latter is obtained by choosing a = n instead of a = k maxi k pi . 2.2. Key lemma The following lemma, which we use as an upper bound for variance, plays a key role in our analysis. Lemma 1. Let ζ1, ζ2, . . . , ζn be vectors in Rd and let ζ := 1 n Pn i=1 ζi be their average. Let S be a proper sampling (i.e., assume that pi = Prob(i S) > 0 for all i). Assume that there is v Rn such that P pp Diag (p1v1, p2v2, . . . , pnvn) . (4) vi pi ζi 2, (5) where the expectation is taken over sampling S. Whenever (4) holds, it must be the case that vi 1 pi. (6) Moreover, (4) is always satisfied for vi = n(1 pi) for i k and 0 otherwise. Further, if |S| d with probability 1 for some d, then (4) holds for vi = d. The standard uniform minibatch sampling admits vi = n b n 1, the independent sampling admits vi = 1 pi, and the approximate independent sampling admits the choice vi = 1 pi(1 k a a(k 1)) if i k, vi = 0 otherwise. 2.3. Optimal sampling The following quantities play a key role in our general complexity results: vi L2 i pi , α := K Above, b = E [|S|] is the minibatch size, pi = Prob(i S), L := 1 i Li and {vi} are defined in (4) in Lemma 1. Our theory shows (see the 2nd column of Table 1 for a summary, and Section 3 for the full results) that in order to optimize the iteration complexity, we need to design sampling S for which the value α is as small as possible. The following result sheds light on how S should be chosen, from samplings of a given minibatch size b, to minimize α. Lemma 2. Fix a minibatch size b (0, n]. Then the quantity α, defined in (7), is minimized for the choice S = S with the probabilities ( (b + k n) Li Pk j=1 Lj, if i k 1, if i > k , (8) where k is the largest integer satisfying 0 < b + k n Pk i=1 Li/Lk (for instance, k = n b + 1 satisfies this). Usually, if Li s are not too much different, then k = O(n), for instance, if b Ln Pn i=1 Li then k = n. If we choose S = Sa, then α is minimized for (8) with b Pk i=1 Li 2 (b + k n)n2 bs where s = 1 for S and s = 1 k a a(k 1) for Sa. Moreover, if we assume2 b Ln Pn i=1 Li, then k = n, thus αS = 1 b Pn i=1 L2 i (Pn i=1 Li)2 n b αSu = (n b) n n 1 Pn i=1 L2 i (Pn i=1 Li)2 . For the special case of all L is are the same, one obtains n , αSu = n b From now on, let S , Sa denote Independent Sampling and Approximate Inpedendent Sampling, respectively, with the probabilities defined in (8). Lemma 2 guarantees that the sampling S is optimal (i.e., minimizes α). Moreover, if we let bmax := max {b | b Ln P i Li} , then we obtain superlinear speedup in b, up to bmax for all three algorithms. 2Note, that this can be always satisfied, if we uplift the smallest Li s, because if function is L-smooth, then it is also smooth with L L. Nonconvex Variance Reduced Optimization with Arbitrary Sampling Algorithm 1 SVRG with arb. sampling x0, m, T, η, S 1: x0 = x0 m = x0, M = T/m 2: for s = 0 to M 1 do 3: xs+1 0 = xs m 4: gs+1 = 1 n Pn i=1 fi( xs) 5: for t = 0 to m 1 do 6: Draw a random subset (minibatch) St S 7: vs+1 t = X fit(xs+1 t ) fit( xs) npit + gs+1 8: xs+1 t+1 = xs+1 t ηvs+1 t 9: end for 10: xs+1 = xs+1 m 11: end for 12: Output: Iterate xa chosen uniformly at random from {{xs+1 t }m t=0}M s=0. 3. SVRG, SAGA and SARAH In all of the results of this section we assume that S is an arbitrary proper sampling. Let b = E [|S|] be the (average) minibatch size. We assume that v satisfies (4) and that α (which depends on v) is defined as in (7). All complexity results will depend on α and b. We propose three methods, Algorithm 1, 2 and 3, which are generalizations of original SVRG (Reddi et al., 2016a), SAGA (Reddi et al., 2016b) and SARAH (Nguyen et al., 2017b) to the arbitrary sampling setting, respectively. The original non-minibatch methods arise as special cases for the sampling S = {i} with probability 1/n, and the original minibatch methods arise as a special case for the sampling Su (described in Section 2.1). Our general result for SVRG follows. Theorem 3 (Complexity of SVRG with arbitrary sampling). There exist universal constants µ2 > 0, 0 < ν2 < 1 such that the output of Alg. 1 with mini-batch size b αn2/3, step size η = µ2b/(α Ln2/3), and parameters β = L/n1/3, m = nα/(3bµ2) and T (multiple of m) satisfies: E f(xa) 2 α Ln2/3[f(x0) f(x )] Thus in terms of stochastic gradient evaluations to obtain ϵaccurate solution, one needs following number of iterations max n, µ2 Ln(2/3)(f(x0) f(x )) In the next theorem we provide a generalization of the results by Reddi et al. (2016b). Theorem 4 (Complexity of SAGA with arbitrary sampling). There exist universal constants µ3 > 0, 0 < ν3 < 1 such Algorithm 2 SAGA with arbitrary sampling x0, d, T, η, S 1: α0 i = x0 for i [n], 2: g0 = 1 n Pn i=1 fi(α0 i ) 3: for t = 0 to T 1 do 4: Draw a random subset (minibatch) St S 5: Pick random subset Jt [n] s.t. Prob(j Jt) = d fi(xt) fi(αt it) npit + gt 7: xt+1 = xt ηvt 8: αt+1 j = xt for j Jt and αt+1 j = αt j for j / Jt 9: gt+1 = gt 1 j Jt( fj(αt j) fj(αt+1 j )) 10: end for 11: Output: Iterate xa chosen uniformly at random from {xt}T t=0. Algorithm 3 SARAH with arb. sampling x0, m, T, η, S 1: x0 0 = x0 2: for s = 1 to M 1 do 3: v0 s = 1 n Pn i=1 fi(x0 s) 4: x1 = x0 ηv0 5: for t = 1 to m 1 do 6: Draw a random subset (minibatch) St S 7: vt = P i St 1 npi ( fi(xt) fi(xt 1)) + vt 1 8: xt+1 = xt ηvt 9: end for 10: x0 s+1 chosen uniformly at randomly from {xt s}m t=0 11: end for 12: Output: Iterate xa = x0 M that the output of Alg. 2 with mini-batch size b αn2/3, step size η = b/(µ3α L2n2/3), and parameter d = b/α satisfies: E f(xa) 2 α Ln2/3[f(x0) f(x )] Thus, in terms of stochastic gradient evaluations, to obtain ϵ accurate solution, one needs following number of iterations n + Ln(2/3)(f(x0) f(x )) ϵν3 (1 + α). We now introduce Algorithm 3: a general form of the SARAH algorithm (Nguyen et al., 2017b). Theorem 5 (Complexity of SARAH with arbitrary sampling). Consider one outer loop of Alg. 3 with b +1 . (10) Then the output xa satisfies: E f(xa) 2 2 η(m + 1)[f(x0 s) f(x )]. Nonconvex Variance Reduced Optimization with Arbitrary Sampling Algorithm 4 GD-Algorithm x0, T, A Input: x0 Rd, T, A for k = 0 to K do xk = Non-convex algorithm(xk 1, T, A) end for Output: x K Thus, to obtain ϵ accurate solution, one needs n + 16α L2(f(x0) f(x ))2 162α2 L4(f(x0) f(x ))4+16ϵ2 L2(f(x0) f(x ))2b2 stochastic gradient evaluations. If all Li s are the same and we choose S to be Sa, thus uniform with mini-batch size b, we can get back original result from (Nguyen et al., 2017b). Taking b = n, we can restore gradient descent with the correct step size 1/ L. 4. Additional Results In this section we describe three additional results: linear convergence for SVRG, SAGA and SARAH for gradient dominated functions, the first importance sampling results for SARAH for convex functions, and an array of new rates (with slight improvements) for non-minibatch versions of the above three methods for non-convex problems. 4.1. Gradient dominated functions Definition 1. We say that f is τ-gradient dominated if f(x) f(x ) τ f(x) 2, for all x Rd, where x is an optimal solution of (1). Gradient dominance is a weaker version of strong convexity due to the fact that if function is µ-strongly convex then it is τ-gradient dominated, where τ = 1/(2µ). Any of the non-convex methods in this paper can be used as a subroutine of Algorithm 4, where T is the number of steps of the subroutine and A is the set of optimal parameters for the subroutine. We set T = αn2/3/(bν2) for SVRG and T = αn2/3/(bν3) for SAGA. In the case of SARAH, T is obtained by solving m + 1 = 2/η in m and setting T m. Using Theorems 3, 4, 5 and the above special choice of T, we get E f(xk) 2 1 2τ E f(xk 1) f(x ) . Combined with Definition 1, this guarantees a linear convergence with the same constant terms consisting of α, L and b that we had before in our analysis. 4.2. Importance sampling for SARAH under convexity In addition to the results presented in previous sections, we also establish importance sampling results for SARAH in convex and strongly convex cases (Appendix I) with similar improvements as for the non-convex algorithm. Ours are the best current rates for SARAH in these settings. 4.3. Better rates for non-minibatch methods for non-convex problems Lastly, we also provide specialized non-minibatch versions of non-convex SAGA, SARAH and SVRG, which are special cases of their minibatch versions presented in the main part with slightly improved guarantees (see Theorems 17, 18, 19 and 20 in the Appendix). 5. Experiments In this section, we perform experiments with regression for binary classification, where our loss function has the form i=1 (1 yiσ(a i x))2, where σ(z) is the sigmoid function. Hence, f is smooth but non-convex. We chose this function because Li s can be easily computed (however, in many cases, even for much more complex problems, they can usually be estimated). We use four LIBSVM datasets3: covtype, ijcnn1, splice, australian. Parameters of each algorithm are chosen as suggested by the theorems in Section 3, and x0 = 0. For SARAH, we chose m = n/b . The y axis in all plots displays the norm of the gradient ( f(x) 2) or the function value f(x), and the x axis depicts either epochs (1 epoch = 1 pass over data) or iterations. 5.1. Importance vs uniform sampling Here we provide comparison of the methods with uniform (Su) and importance (S ) sampling. Looking at Figure 1, one can see that importance sampling outperforms uniform sampling for all three methods, in some cases, even by several orders of magnitude. For instance, in the left plot (minibatch size b = 2 and australian dataset) the improvement is as large as 4 orders of magnitude. Looking at Figure 2, one can see that there is an improvement not just in the norm of the gradient, but also in the function value. In the case of the australian dataset, the constants Li s are very non-uniform, and we can see that 3The LIBSVM dataset collection is available at https: //www.csie.ntu.edu.tw/~cjlin/libsvmtools/ datasets/ Nonconvex Variance Reduced Optimization with Arbitrary Sampling Figure 1: Comparison of all methods with uniform and importance sampling: gradient norm. Figure 2: Comparison of all methods with uniform and importance sampling for function values. the improvement is very significant. 5.2. Linear or superlinear speedup Our theory suggests that linear or even superlinear speedup (in minibatch size b) can be obtained using the optimal independent S . Our experiments show that this is indeed the case in practice as well, and for all three algorithms. Figure 3 confirms that linear, and sometimes even superlinear, speedup is present. For this dataset, such speedup is present up to the minibatch size of 250. The plots in the top row of Figure 3 depict convergence in a simulated multi-core setting, where the number of cores is the same as the minibatch size. 5.3. Independent vs approximate independent sampling According to our theory, Independent Sampling S is slightly better than Approximate Independent Sampling Sa. However, it is more expensive to use it in practice as generating samples from it involves more computational effort for large n. The goal of our next experiment is to show that in practice Sa yields comparable or even faster convergence. Hence, it is more reasonable to use this sampling for datasets where the number of data points n is big. For an efficient implementation of Sa, we can almost get rid of dependence on n. Intuitively, Sa works better because it has smaller variance in minibatch size than S . Indeed, it can be seen from Figure 4 that Sa can outperform S in practice even though S is optimal in theory. The difference can be small (the left and the middle plot), but also quite significant (right plot). Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. STOC 2017: Symposium on Theory of Computing, 19-23, 2016. Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In The 33th International Conference on Machine Learning, pages 699 707, 2016. Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In The 33rd International Conference on Machine Learning, pages 1110 1119, 2016. Antonin Chambolle, Matthias J. Ehrhardt, Peter Richtárik, and Carola-Bibiane Schöenlieb. Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM Journal on Optimization, 28 (4):2783 2808, 2018. Nonconvex Variance Reduced Optimization with Arbitrary Sampling Figure 3: Minibatch speedup, ijcnn1 dataset Figure 4: Performance of sampling S vs. Sa, splice dataset. Dominik Csiba and Peter Richtárik. Primal method for ERM with flexible mini-batching schemes and non-convex losses. ar Xiv:1506.02227, 2015. Dominik Csiba and Peter Richtárik. Importance sampling for minibatches. Journal of Machine Learning Research, 19(27), 2018. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems 27, 2014a. Aaron Defazio, Tiberio Caetano, and Justin Domke. Finito: A faster, permutable incremental gradient method for Big Data problems. The 31st International Conference on Machine Learning, 2014b. Robert Mansel Gower, Donald Goldfarb, and Peter Richtárik. Stochastic block BFGS: squeezing more curvature out of data. In The 33rd International Conference on Machine Learning, pages 1869 1878, 2016. Robert Mansel Gower, Peter Richtárik, and Francis Bach. Stochastic quasi-gradient methods: variance reduction via Jacobian sketching. ar Xiv:1805.02632, 2018. Filip Hanzely and Petert Richtárik. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. In The 22nd International Conference on Artificial Intelligence and Statistics, 2019. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315 323, 2013. Nonconvex Variance Reduced Optimization with Arbitrary Sampling Jakub Koneˇcný and Peter Richtárik. S2GD: Semi-stochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, pages 1 14, 2017. Jakub Koneˇcný, Zheng Qu, and Peter Richtárik. S2CD: Semi-stochastic coordinate descent. Optimization Methods and Software, 32(5):993 1005, 2017. Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. ar Xiv:1712.09677, 2017. Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan, Peter Richtárik, and Martin Takáˇc. Adding vs. averaging in distributed primal-dual optimization. In The 32nd International Conference on Machine Learning, pages 1973 1982, 2015. Chenxin Ma, Jakub Koneˇcný, Martin Jaggi, Virginia Smith, Michael I Jordan, Peter Richtárik, and Martin Takáˇc. Distributed optimization with arbitrary local solvers. Optimization Methods and Software, 32(4):813 848, 2017. Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829 855, 2015. A Nemirovski, A Juditsky, G Lan, and A Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009. Arkadi Nemirovsky and David B. Yudin. Problem complexity and method efficiency in optimization. Wiley, New York, 1983. Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. Lam Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáˇc. SARAH: A novel method for machine learning problems using stochastic recursive gradient. The 34th International Conference on Machine Learning, 2017a. Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáˇc. Stochastic recursive gradient algorithm for nonconvex optimization. ar Xiv:1705.07261, 2017b. Zheng Qu and Peter Richtárik. Coordinate descent with arbitrary sampling I: algorithms and complexity. Optimization Methods and Software, 31(5):829 857, 2016. Zheng Qu and Peter Richtárik. Coordinate descent with arbitrary sampling II: expected separable overapproximation. Optimization Methods and Software, 31(5):858 884, 2016. Zheng Qu, Peter Richtárik, and Tong Zhang. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems 28, pages 865 873, 2015. Zheng Qu, Peter Richtárik, Martin Takáˇc, and Olivier Fercoq. SDNA: stochastic dual Newton ascent for empirical risk minimization. In The 33rd International Conference on Machine Learning, pages 1823 1832, 2016. Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In The 33th International Conference on Machine Learning, pages 314 323, 2016a. Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Fast incremental method for smooth nonconvex optimization. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 1971 1977. IEEE, 2016b. Peter Richtárik and Martin Takáˇc. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 10(6):1233 1243, 2016a. Peter Richtárik and Martin Takáˇc. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433 484, 2016b. Peter Richtárik and Martin Takáˇc. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(2):1 38, 2014. Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663 2671, 2012. Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567 599, 2013. Martin Takáˇc, Avleen Bijral, Peter Richtárik, and Nathan Srebro. Mini-batch primal and dual methods for SVMs. In 30th International Conference on Machine Learning, 2013. Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In The 32rd International Conference on Machine Learning, pages 1 9, 2015.