# stochastic_proximal_gradient_descent_with_acceleration_techniques__cae5103d.pdf Stochastic Proximal Gradient Descent with Acceleration Techniques Atsushi Nitanda NTT DATA Mathematical Systems Inc. 1F Shinanomachi Rengakan, 35, Shinanomachi, Shinjuku-ku, Tokyo, 160-0016, Japan nitanda@msi.co.jp Proximal gradient descent (PGD) and stochastic proximal gradient descent (SPGD) are popular methods for solving regularized risk minimization problems in machine learning and statistics. In this paper, we propose and analyze an accelerated variant of these methods in the mini-batch setting. This method incorporates two acceleration techniques: one is Nesterov s acceleration method, and the other is a variance reduction for the stochastic gradient. Accelerated proximal gradient descent (APG) and proximal stochastic variance reduction gradient (Prox-SVRG) are in a trade-off relationship. We show that our method, with the appropriate mini-batch size, achieves lower overall complexity than both APG and Prox-SVRG. 1 Introduction This paper consider the following optimization problem: minimize x Rd f(x) def = g(x) + h(x), (1) where g is the average of the smooth convex functions g1, . . . , gn from Rd to R, i.e., g(x) = 1 n Pn i=1 gi(x) and h : Rd R is a relatively simple convex function that can be non-differentiable. In machine learning, we often encounter optimization problems of this form. For example, given a sequence of training examples (a1, b1), . . . , (an, bn), where ai Rd and bi R, if we set gi(x) = 1 2(a T i x bi)2, then we obtain ridge regression by setting h(x) = λ 2 x 2 or we obtain Lasso by setting h(x) = λ|x|. If we set gi(x) = log(1 + exp( bix T ai)), then we obtain regularized logistic regression. To solve the optimization problem (1), one popular method is proximal gradient descent (PGD), which can be described by the following update rule for k = 1, 2, . . .: xk+1 = proxηkh (xk ηk g(xk)) , where prox is the proximity operator, proxηh(y) = arg min x Rd 2 x y 2 + ηh(x) . A stochastic variant of PGD is stochastic proximal gradient descent (SPGD), where at each iteration k = 1, 2, . . ., we pick ik randomly from {1, 2, . . . , n}, and take the following update: xk+1 = proxηkh (xk ηk gik(xk)) . The advantage of SPGD over PGD is that at each iteration, SPGD only requires the computation of a single gradient gik(xk). In contrast, each iteration of PGD evaluates the n gradients. Thus the computational cost of SPGD per iteration is 1/n that of the PGD. However, due to the variance introduced by random sampling, SPGD obtains a slower convergence rate than PGD. In this paper we consider problem (1) under the following assumptions. Assumption 1. Each convex function gi(x) is L-Lipschitz smooth, i.e., there exist L > 0 such that for all x, y Rd, gi(x) gi(y) L x y . (2) From (2), one can derive the following inequality, gi(x) gi(y) + ( gi(y), x y) + L 2 x y 2. (3) Assumption 2. g(x) is µ-strongly convex; i.e., there exists µ > 0 such that for all x, y Rd, g(x) g(y) + ( g(y), x y) + µ 2 x y 2. (4) Note that it is obvious that L µ. Assumption 3. The regularization function h(x) is a lower semi-continuous proper convex function; however, it can be non-differentiable or non-continuous. Under the Assumptions 1, 2, and h(x) 0, PGD (which is equivalent to gradient descent in this case) with a constant learning rate ηk = 1 L achieves a linear convergence rate. On the other hand, for stochastic (proximal) gradient descent, because of the variance introduced by random sampling, we need to choose diminishing learning rate ηk = O(1/k), and thus the stochastic (proximal) gradient descent converges at a sub-linear rate. To improve the stochastic (proximal) gradient descent, we need a variance reduction technique, which allows us to take a larger learning rate. Recently, several papers proposed such variance reduction methods for the various special cases of (1). In the case where gi(x) is Lipschitz smooth and h(x) is strongly convex, Shalev-Shwartz and Zhang [1,2] proposed a proximal stochastic dual coordinate ascent (Prox-SDCA); the same authors developed accelerated variants of SDCA [3, 4]. Le Roux et al. [5] proposed a stochastic average gradient (SAG) for the case where gi(x) is Lipschitz smooth, g(x) is strongly convex, and h(x) 0. These methods achieve a linear convergence rate. However, SDCA and SAG need to store all gradients (or dual variables), so that O(nd) storage is required in general problems. Although this can be reduced to O(n) for linear prediction problems, these methods may be unsuitable for more complex and large-scale problems. More recently, Johnson and Zhang [6] proposed stochastic variance reduction gradients (SVRG) for the case where gi(x) is L-Lipschitz smooth, g(x) is µ-strongly convex, and h(x) 0. SVRG achieves the following overall complexity (total number of component gradient evaluations to find an ǫ-accurate solution), O (n + κ) log 1 where κ is the condition number L/µ. Furthermore, this method need not store all gradients. Xiao and Zhang [7] proposed a proximal variant of SVRG, called Prox-SVRG which also achieves the same complexity. Another effective method for solving (1) is accelerated proximal gradient descent (APG), proposed by Nesterov [8,9]. APG [8] is an accelerated variant of deterministic gradient descent and achieves the following overall complexity to find an ǫ-accurate solution, O n κ log 1 Complexities (5) and (6) are in a trade-off relationship. For example, if κ = n, then the complexity (5) is less than (6). On the other hand, the complexity of APG has a better dependence on the condition number κ. In this paper, we propose and analyze a new method called the Accelerated Mini-Batch Prox-SVRG (Acc-Prox-SVRG) for solving (1). Acc-Prox-SVRG incorporates two acceleration techniques in the mini-batch setting: (1) Nesterov s acceleration method of APG and (2) an variance reduction technique of SVRG. We show that the overall complexity of this method, with an appropriate minibatch size, is more efficient than both Prox-SVRG and APG; even when mini-batch size is not appropriate, our method is still comparable to APG or Prox-SVRG. 2 Accelerated Mini-Batch Prox-SVRG As mentioned above, to ensure convergence of SPGD, the learning rate ηk has to decay to zero for reducing the variance effect of the stochastic gradient. This slows down the convergence. As a remedy to this issue, we use the variance reduction technique of SVRG [6] (see also [7]), which allows us to take a larger learning rate. Acc-Prox-SVRG is a multi-stage scheme. During each stage, this method performs m APG-like iterations and employs the following direction with mini-batch instead of gradient, vk = g Ik(yk) g Ik( x) + g( x), (7) where Ik = {i1, . . . , ib} is a randomly chosen size b subset of {1, 2, . . . , n} and g Ik = 1 b Pb j=1 gij. At the beginning of each stage, the initial point x1 is set to be x, and at the end of stage, x is updated. Conditioned on yk, we can take expectation with respect to Ik and obtain EIk [vk] = g(yk), so that vk is an unbiased estimator. As described in the next section, the conditional variance EIk vk g(yk) 2 can be much smaller than Ei gi(yk) g(yk) 2 near the optimal solution. The pseudocode of our Acc-Prox-SVRG is given in Figure 1. Parameters update frequency m, learning rate η, mini-batch size b and non-negative sequence β1, . . . , βm Initialize x1 Iterate: for s = 1, 2, . . . x = xs v = 1 n Pn i=1 gi( x) x1 = y1 = x Iterate: for k = 1, 2, . . . , m Randomly pick subset Ik {1, 2, . . . , n} of size b vk = g Ik(yk) g Ik( x) + v xk+1 = proxηh (yk ηvk) yk+1 = xk+1 + βk(xk+1 xk) set xs+1 = xm+1 end end Figure 1: Acc-Prox-SVRG In our analysis, we focus on a basic variant of the algorithm (Figure 1) with βk = 1 µη In this section, we present our analysis of the convergence rates of Acc-Prox-SVRG described in Figure 1 under Assumptions 1, 2 and 3, and provide some notations and definitions. Note that we may omit the outer index s for notational simplicity. By the definition of a proximity operator, there exists a subgradient ξk h(xk+1) such that xk+1 = yk η (vk + ξk) . We define the estimate sequence Φk(x) (k = 1, 2, . . . , m + 1) by Φ1(x) = f(x1) + µ 2 x x1 2 and Φk+1(x) = (1 µη)Φk(x) + µη(g Ik(yk) + (vk, x yk) + µ +h(xk+1) + (ξk, x xk+1)), for k 1. Φ k = min x Rd Φk(x) and zk = arg min x Rd Φk(x). Since 2Φk(x) = µIn, it follows that for x Rd, 2 x zk 2 + Φ k. (8) The following lemma is the key to the analysis of our method. Lemma 1. Consider Acc-Prox-SVRG in Figure 1 under Assumptions 1, 2, and 3. If η 1 2L, then for k 1 we have E [Φk(x)] f(x) + (1 µη)k 1 (Φ1 f)(x) and (9) E [f(xk)] E l=1 (1 µη)k 1 l µ 2 1 µη µη xl yl 2 + η g(yl) vl 2 # where the expectation is taken with respect to the history of random variables I1, . . . , Ik 1. Note that if the conditional variance of vl is equal to zero, we immediately obtain a linear convergence rate from (9) and (10). Before we can prove Lemma 1, additional lemmas are required, whose proofs may be found in the Supplementary Material. Lemma 2. If η < 1 µ, then for k 1 we have zk+1 = (1 µη)zk + µηyk r η µ(vk + ξk) and (11) zk yk = 1 µη (yk xk). (12) Lemma 3. For k 1, we have ( g(yk) + ξk, vk + ξk) = 1 2 g(yk) + ξk 2 + vk + ξk 2 g(yk) vk 2 , (13) vk + ξk 2 2 g(yk) + ξk 2 + g(yk) vk 2 , and (14) g(yk) + ξk 2 2 vk + ξk 2 + g(yk) vk 2 . (15) Proof of Lemma 1. Using induction, it is easy to show (9). The proof is in Supplementary Material. Now we prove (10) by induction. From the definition of Φ1, Φ 1 = f(x1). we assume (10) is true for k. Using Eq. (11), we have yk zk+1 2 = (1 µη)(yk zk) + r η = (1 µη)2 yk zk 2 + 2 r η µ(1 µη)(yk zk, vk + ξk) + η µ vk + ξk 2. From above equation and (8) with x = yk, we get Φk+1(yk) = Φ k+1 + µ (1 µη)2 yk zk 2 + 2 r η µ(1 µη)(yk zk, vk + ξk) µ vk + ξk 2 . On the other hand, from the definition of the estimate sequence and (8), Φk+1(yk) = (1 µη) Φ k + µ 2 yk zk 2 + µη(g Ik(yk) + h(xk+1) + (ξk, yk xk+1)). Therefore, from these two equations, we have Φ k+1 = (1 µη)Φ k + µ 2 (1 µη) µη yk zk 2 + µη(g Ik(yk) + h(xk+1) +(ξk, yk xk+1)) (1 µη) µη(yk zk, vk + ξk) η 2 vk + ξk 2. (16) Since g is Lipschitz smooth, we bound f(xk+1) as follows: f(xk+1) g(yk) + ( g(yk), xk+1 yk) + L 2 xk+1 yk 2 + h(xk+1). (17) Using (16), (17), (12), and xk+1 yk = η(vk + ξk) we have EIk f(xk+1) Φ k+1 (18) (16),(17) EIk h (1 µη)( Φ k + g(yk) + h(xk+1)) + ( g(yk), xk+1 yk) + µη(ξk, xk+1 yk) + L 2 xk+1 yk 2 µ 2 (1 µη) µη yk zk 2 +(1 µη) µη(yk zk, vk + ξk) + η 2 vk + ξk 2i = (12) EIk h (1 µη)( Φ k + g(yk) + h(xk+1) + (xk yk, vk + ξk)) η( g(yk), vk + ξk) η µη(ξk, vk + ξk) µ 2 1 µη µη yk xk 2 +η 2(Lη + 1) vk + ξk 2i , (19) where for the first inequality we used EIk[g Ik(yk)] = g(yk). Here, we give the following EIk [g(yk) + h(xk+1) + (xk yk, vk + ξk)] = EIk [g(yk) + (vk, xk yk) + h(xk+1) + (ξk, xk xk+1) + (ξk, xk+1 yk)] EIk h g(xk) µ 2 xk yk 2 + h(xk) η(ξk, vk + ξk) i , (20) where for the first inequality we used EIk[vk] = g(yk) and convexity of g and h. Thus we have EIk f(xk+1) Φ k+1 (19),(20) EIk h (1 µη)(f(xk) Φ k) µ 2 1 µη µη xk yk 2 η( g(yk) + ξk, vk + ξk) + η 2(1 + Lη) vk + ξk 2i (1 µη)(f(xk) Φ k) µ 2 1 µη µη xk yk 2 2 g(yk) + ξk 2 + Lη2 2 vk + ξk 2 + η 2 vk g(yk) 2 (1 µη)(f(xk) Φ k) µ 2 1 µη µη xk yk 2 + η vk g(yk) 2 . By taking expectation with respect to the history of random variables I1, . . . , Ik 1, the induction hypothesis finishes the proof of (10). Our bound on the variance of vk is given in the following lemma, whose proof is in the Supplementary Material. Lemma 4. Suppose Assumption 1 holds, and let x = arg min x Rd f(x). Conditioned on yk, we have EIk vk g(yk) 2 1 b n b n 1 2L2 yk xk 2 + 8L(f(xk) f(x ) + f( x) f(x )) . (21) From (10), (21), and (9) with x = x , it follows that E [f(xk) f(x )] (1 µη)k 1(Φ1 f)(x ) + E h Pk 1 l=1 (1 µη)k 1 l 2 1 µη µη + n b b xl yl 2 + n b b (f(xl) f(x ) + f( x) f(x )) oi . If η min (pb)2 64 n 1 n b 2 µ L2 , 1 2L , then the coefficients of xl yl 2 are non-positive for p 2. Indeed, using 8 µη, for p > 0, (22) 2 1 µη µη + n b 2 1 µη µη + L = 1 2 µη µ + µ2η + µLη µ L 1 2 µη ( µ + 2µLη) η 1 Thus, using (22) again with p 1, we have E [f(xk) f(x )] (1 µη)k 1(Φ1 f)(x ) l=1 (1 µη)k 1 lp µη(f(xl) f(x ) + f( x) f(x )) (1 µη)k 1(Φ1 f)(x ) + p(f( x) f(x )) l=1 (1 µη)k 1 lp µη(f(xl) f(x )) where for the last inequality we used Pk 1 l=1 (1 µη)k 1 l P t=0(1 µη)t = 1 µη. Theorem 1. Suppose Assumption 1, 2, and 3. Let η min (pb)2 64 n 1 n b 2 µ L2 , 1 2L and 0 < p < 1. Then we have E [f( xs+1) f(x )] (1 (1 p) µη)m + p 1 p (2 + p)(f( xs) f(x )). (24) Moreover, if m 1 (1 p) µη log 1 p p , then it follows that E [f( xs+1) f(x )] 2p(2 + p) 1 p (f( xs) f(x )). (25) From Theorem 1, we can see that for small 0 < p, the overall complexity of Acc-Prox-SVRG (total number of component gradient evaluations to find an ǫ-accurate solution) is Thus, we have the following corollary: Corollary 1. Suppose Assumption 1, 2, and 3. Let p be sufficiently small, as stated above, and η = min (pb)2 64 n 1 n b 2 µ L2 , 1 2L . If mini-batch size b is smaller than l 8 κn 2p(n 1)+8 κ m , then the learning rate η is equal to (pb)2 64 n 1 n b 2 µ L2 and the overall complexity is Otherwise, η = 1 2L and the complexity becomes O n + b κ log 1 Table 1: Comparison of overall complexity. b0 = 8 κn 2p(n 1)+8 κ. Prox SVRG Acc Prox SVRG b < b0 APG [8] Acc Prox SVRG b b0 O (n + κ) log 1 ǫ O n + n b ǫ O (n κ) log 1 ǫ O (n + b κ) log 1 Table 1 lists the overall complexities of the algorithms that achieve linear convergence. As seen from Table 1, the complexity of Acc-Prox-SVRG monotonically decreases with respect to b < b0 , where b0 = 8 κn 2p(n 1)+8 κ and monotonically increases when b b0 . Moreover, if b = 1, then Acc-Prox-SVRG has the same complexity as that of Prox-SVRG, while if b = n then the complexity of this method is equal to that of APG. Therefore, with an appropriate mini-batch size, Acc-Prox SVRG may outperform both Prox-SVRG and APG; even if the mini-batch is not appropriate, then Acc-Prox-SVRG is still comparable to Prox-SVRG or APG. The following overall complexity is the best possible rate of Acc-Prox-SVRG, O n + nκ n + κ Now we give the proof of Theorem 1. Proof of Theorem 1. We denote E[f(xk) f(x )] by Vk, and we use Wk to denote the last expression in (23). Thus, for k 1, Vk Wk. For k 2, we have Wk = (1 µη) (1 µη)k 2(Φ1 f)(x ) + p V1 + l=1 (1 µη)k 2 lp µη Vl +p µη Vk 1 + p µη V1 (1 µη(1 p))Wk 1 + p µη W1. Since 0 < µη(1 p) < 1, the above inequality leads to Wk = (1 (1 p) µη)k 1 + p 1 p From the strong convexity of g (and f), we can see W1 = (1 + p)(f( x) f(x )) + µ 2 x x 2 (2 + p)(f( x) f(x )). Thus, for k 2, we have Vk Wk (1 (1 p) µη)k 1 + p 1 p (2 + p)(f( x) f(x )), and that is exactly (24). Using log(1 α) α and m 1 (1 p) µη log 1 p p , we have log(1 (1 p) µη)m m(1 p) µη log 1 p so that (1 (1 p) µη)m p 1 p. This finishes the proof of Theorem 1. 4 Numerical Experiments In this section, we compare Acc-Prox-SVRG with Prox-SVRG and APG on L1-regularized multiclass logistic regression with the regularization parameter λ. Table 2 provides details of the datasets mnist covtype.scale rcv1.binary Figure 2: Comparison of Acc-Prox-SVRG with Prox-SVRG and APG. Top: Objective gap of L1 regularized multi-class logistic regression. Bottom: Test error rates. and regularization parameters utilized in our experiments. These datasets can be found at the LIBSVM website1. The best choice of mini-batch size is b = b0 , which allows us to take a large learning rate, η = 1 2L. Therefore, we have m O( κ) and βk = 2κ+1. When the number of components n is very large compared with κ, we see that b0 = O( κ); for this, we set m = δb (δ {0.1, 1.0, 10}) and βk = b 2 b+2 varying b in the set {100, 500, 1000}. We ran Acc Prox-SVRG using values of η from the range {0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0}, and we chose the best η in each mini-batch setting. Figure 2 compares Acc-Prox-SVRG with Prox-SVRG and APG. The horizontal axis is the number of single-component gradient evaluations. For Acc-Prox-SVRG, each iteration computes the 2b gradients, and at the beginning of each stage, the n component gradients are evaluated. For Prox SVRG, each iteration computes the two gradients, and at the beginning of each stage, the n gradients are evaluated. For APG, each iteration evaluates n gradients. Table 2: Details of data sets and regularization parameters. Dataset classes Training size Testing size Features λ mnist 10 60,000 10,000 780 10 5 covtype.scale 7 522,910 58,102 54 10 6 rcv1.binary 2 20,242 677,399 47,236 10 5 As can be seen from Figure 2, Acc-Prox-SVRG with good values of b performs better than or is comparable to Prox-SVRG and is much better than results for APG. On the other hand, for relatively large b, Acc-Prox-SVRG may perform worse because of an overestimation of b0, and hence the worse estimates of m and βk. 5 Conclusion We have introduced a method incorporating Nesterov s acceleration method and a variance reduction technique of SVRG in the mini-batch setting. We prove that the overall complexity of our method, with an appropriate mini-batch size, is more efficient than both Prox-SVRG and APG; even when mini-batch size is not appropriate, our method is still comparable to APG or Prox-SVRG. In addition, the gradient evaluations for each mini-batch can be parallelized [3,10,11] when using our method; hence, it performs much faster in a distributed framework. 1http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ [1] S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. ar Xiv:1211.2717, 2012. [2] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research 14, pages 567-599, 2013. [3] S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. Advances in Neural Information Processing System 26, pages 378-385, 2013. [4] S. Shalev-Shwartz and T. Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization. Proceedings of the 31th International Conference on Machine Learning, pages 64-72, 2014. [5] N. Le Roux, M. Schmidt, and F. Bach. A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets. Advances in Neural Information Processing System 25, pages 2672-2680, 2012. [6] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing System 26, pages 315-323, 2013. [7] L. Xiao and T. Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. ar Xiv:1403.4699, 2014. [8] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston, 2004. [9] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Papers, 2007. [10] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13, pages 165-202, 2012. [11] A. Agarwal and J. Duchi. Distributed delayed stochastic optimization. Advances in Neural Information Processing System 24, pages 873-881, 2011.