# sdca_without_duality_regularization_and_individual_convexity__f6c47fa8.pdf SDCA without Duality, Regularization, and Individual Convexity Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel Stochastic Dual Coordinate Ascent is a popular method for solving regularized loss minimization for the case of convex losses. We describe variants of SDCA that do not require explicit regularization and do not rely on duality. We prove linear convergence rates even if individual loss functions are non-convex, as long as the expected loss is strongly convex. 1. Introduction We consider the following loss minimization problem: min w Rd F(w) := 1 i=1 fi(w) . An important sub-class of problems is when each fi can be written as fi(w) = φi(w) + λ 2 w 2, where φi is Lismooth and convex. A popular method for solving this sub-class of problems is Stochastic Dual Coordinate Ascent (SDCA), and (Shalev-Shwartz & Zhang, 2013) established the convergence rate of O((Lmax/λ + n) log(1/ϵ)), where Lmax = maxi Li. As its name indicates, SDCA is derived by considering a dual problem. In this paper, we consider the possibility of applying SDCA for problems in which individual fi do not necessarily have the form φi(w)+ λ 2 w 2, and can even be non-convex (e.g., deep learning optimization problems, or problems arising in fast calculation of the top singular vectors (Jin et al., 2015)). In many such cases, the dual problem is meaningless. Instead of directly using the dual problem, we describe and analyze a variant of SDCA in which only gradients of fi are being used. Following (Johnson & Zhang, 2013), we show that SDCA is a member of the Stochastic Gradient Descent (SGD) family of algorithms, that is, its update is based on an unbiased estimate of the gradient, but unlike the vanilla SGD, for SDCA the vari- Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). ance of the estimation of the gradient tends to zero as we converge to a minimum. Our analysis assumes that F is λ-strongly convex and each fi is Li-smooth. When each fi is also convex we establish the convergence rate of O( L/λ+n), where L is the average of Li and the O notation hides logarithmic terms, including the factor log(1/ϵ). This matches the best known bound for SVRG given in (Xiao & Zhang, 2014). Lower bounds have been derived in (Arjevani et al., 2015; Agarwal & Bottou, 2014). Applying an acceleration technique ((Shalev Shwartz & Zhang, 2015; Lin et al., 2015)) we obtain the convergence rate O(n1/2 p L/λ+n). If fi are non-convex we first prove that SDCA enjoys the rate O( L2/λ2 + n). Finally, we show how the acceleration technique yields the bound O n3/4p L/λ + n . That is, we have the same dependency on the square root of the condition number, p L/λ, but this term is multiplied by n3/4 rather than by n1/2. Understanding if this factor can be eliminated is left to future work. Related work: In recent years, many randomized methods for optimizing average of functions have been proposed. For example, SAG (Le Roux et al., 2012), SVRG (Johnson & Zhang, 2013), Finito (Defazio et al., 2014b), SAGA (Defazio et al., 2014a), S2GD (Koneˇcn y & Richt arik, 2013), and Uni Vr (Allen-Zhu & Yuan, 2015). All of these methods have similar convergence rates for strongly convex and smooth problems. Here we show that SDCA achieves the best known convergence rate for the case in which individual loss functions are convex, and a slightly worse rate for the case in which individual loss functions are non-convex. A systematic study of the convergence rate of the different methods under non-convex losses is left to future work. This version of the paper improves upon a previous unpublished version of the paper (Shalev-Shwartz, 2015) in three aspects. First, the convergence rate here depends on L as opposed to Lmax in (Shalev-Shwartz, 2015). Second, the version in (Shalev-Shwartz, 2015) only deals with the regularized case, while here we show that the same rate can be obtained for unregularized objectives. Last, for the non-convex case, here we derive SDCA without Duality, Regularization, and Individual Convexity the bound O n3/4p L/λ + n while in (Shalev-Shwartz, 2015) only the bound of O(L2 max/λ2 + n) has been given. (Csiba & Richt arik, 2015) extended the work of (Shalev Shwartz, 2015) to support arbitrary mini-batching schemes, and (He & Tak aˇc, 2015) extended the work of (Shalev Shwartz, 2015) to support adaptive sampling probabilities. A primal form of SDCA has been also given in (Defazio, 2014). Using SVRG for non-convex individual functions has been recently studied in (Shamir, 2015; Jin et al., 2015), in the context of fast computation of the top singular vectors of a matrix. 2. SDCA without Duality We start the section by describing a variant of SDCA that do not rely on duality. To simplify the presentation, we start in Section 2.1 with regularized loss minimization problems. In Section 2.2 we tackle the non-regularized case and in Section 2.3 we tackle the non-convex case. We recall the following basic definitions: A (differentiable) function f is λ-strongly convex if for every u, w we have f(w) f(u) f(u) (w u) + λ 2 w u 2. We say that f is convex if it is 0-strongly convex. We say that f is L-smooth if f(w) f(u) L w u . It is well known that smoothness and convexity also implies that f(w) f(u) f(u) (w u) + L 2.1. Regularized problems In regularized problems, each fi can be written as fi(w) = φi(w)+ λ 2 w 2. Similarly to the original SDCA algorithm, we maintain vectors α1, . . . , αn, where each αi Rd. We call these vectors pseudo-dual vectors. The algorithm is described below. Algorithm 1: Dual-Free SDCA for Regularized Objectives Goal: Minimize F(w) = 1 n Pn i=1 φi(w) + λ Input: Objective F, number of iterations T, step size η, Smoothness parameters L1, . . . , Ln Initialize: w(0) = 1 λ n Pn i=1 α(0) i for some α(0) = (α(0) 1 , . . . , α(0) n ) i [n], qi = (Li + L)/(2n L) where L = 1 n Pn i=1 Li For t = 1, . . . , T Pick i q, denote ηi = η qin Update: α(t) i = α(t 1) i ηiλn φi(w(t 1)) + α(t 1) i w(t) = w(t 1) ηi φi(w(t 1)) + α(t 1) i Observe that SDCA keeps the primal-dual relation i=1 α(t 1) i Observe also that the update of α can be rewritten as α(t) i = (1 βi)α(t 1) i + βi φi(w(t 1)) , where βi = ηiλn. Namely, the new value of αi is a convex combination of its old value and the negative gradient. Finally, observe that, conditioned on the value of w(t 1) and α(t 1), we have that Ei q[w(t)] = w(t 1) η X qi qin( φi(w(t 1)) + α(t 1) i ) i=1 φi(w(t 1)) + λw(t 1) ! = w(t 1) η P(w(t 1)) . That is, SDCA is in fact an instance of Stochastic Gradient Descent (SGD). As we will see shortly, the advantage of SDCA over a vanilla SGD algorithm is because the variance of the update goes to zero as we converge to an optimum. Our convergence analysis relies on bounding the following potential function, defined for every t 0, 2 w(t) w 2 + η qi α(t) i α i 2] , (1) w = argmin w F(w), and i, α i = φi(w ) . (2) Intuitively, Ct measures the distance to the optimum both in primal and pseudo-dual variables. Observe that if F is LF -smooth and convex then F(w(t)) F(w ) LF 2 w(t) w 2 LF and therefore a bound on Ct immediately implies a bound on the sub-optimality of w(t). The following theorem establishes the convergence rate of SDCA for the case in which each φi is convex. Theorem 1 Assume that each φi is Li-smooth and convex, and Algorithm 1 is run with η min 1 4 L , 1 4 λn . Then, for every t 1, E[Ct] (1 ηλ)t C0 , where Ct is as defined in (1). In particular, to achieve E[F(w(T )) F(w )] ϵ it suffices to set η = min 1 4 L , 1 4 λn and SDCA without Duality, Regularization, and Individual Convexity Variance Reduction: The lemma below tells us that the variance of the SDCA update decreases as we get closer to the optimum. Lemma 1 Under the same conditions of Theorem 1, the expected value of w(t) w(t 1) 2 conditioned on w(t 1) E[ w(t) w(t 1) 2] 3 η 1 2 w(t 1) w 2 + Ct 1 . 2.2. SDCA without regularization We now turn to the case in which the objective is not explicitly regularized. The algorithm below tackles this problem by a reduction to the regularized case. In particular, we artificially add regularization to the objective and compensate for it by adding one more loss function that cancels out the regularization term. While the added function is not convex (in fact, it is concave), we prove that the same convergence rate holds due to the special structure of the added loss function. Algorithm 2: Dual-Free SDCA for Non-Regularized Objectives Goal: Minimize F(w) = 1 n Pn i=1 fi(w) Input: Objective F, number of iterations T, step size η, Strong convexity parameter λ, Smoothness parameters L1, . . . , Ln Define: For all i [n], φi(w) = n+1 n fi(w), Li = n+1 n Li For i = n + 1, φi(w) = λ i 2 w 2, Li = λ i Solve: Rewrite F as F(w) = 1 n+1 Pn+1 i=1 φi(w) + λ Call Algorithm 1 with F above and with { Li} Theorem 2 Assume that F is λ-strongly convex, that each fi is Li-smooth and convex, and that Algorithm 2 is run with η min n 1 8( L+λ) , 1 4 λ(n+1) o . Then, for every t 1, E[Ct] (1 ηλ)t C0 , where Ct is as defined in (1). In particular, to achieve E[F(w(T )) F(w )] ϵ it suffices to set η = min n 1 8( L+λ) , 1 4 λ(n+1) o and 2.3. The non-convex case We now consider the non-convex case. For simplicity, we focus on the regularized setting. In the non-regularized setting we can simply replace every fi with φi(w) = 2 w 2 and apply the regularized setting. Note that this does not change significantly the smoothness (because λ is typically much smaller than the average smoothness of the fi). We can apply Algorithm 1 for the non-convex case, and the only change is the choice of η, as reflected in the theorem below. Theorem 3 Consider running algorithm 1 on F which is λ-strongly convex, assume that each φi is Li-smooth, and η min λ 4 L2 , 1 4 λn . Then, for every t 1, E[Ct] (1 ηλ)t C0 , where Ct is as defined in (1). In particular, to achieve E[F(w(T )) F(w )] ϵ it suffices to set η = min λ 4 L2 , 1 4 λn and As can be seen, the dependence of T on the condition number, L λ , is quadratic for the non-convex case, as opposed to a linear dependency for the convex case. We next show how to improve the bound using acceleration. 2.4. Acceleration Accelerated SDCA (Shalev-Shwartz & Zhang, 2015) is obtained by solving (using SDCA) a sequence of problems, where at each iteration, we add an artificial regularization of the form κ 2 w y(t 1) 2, where y(t 1) is a function of w(t 1) and w(t 2). The algorithm has been generalized in (Lin et al., 2015) to allow the inner solver to be any algorithm. For completeness, we provide the pseudo-code of the Catalyst algorithm of (Lin et al., 2015) and its analysis. Algorithm 3: Acceleration Goal: Minimize a λ-strongly convex function F(w) Parameters: κ, T Initialize: Initial solution w(0) ϵ0 s.t. ϵ0 F(w(0)) F(w ) y(0) = w(0), q = λ λ+κ For: t = 1, . . . , T Define Gt(w) = F(w) + κ 2 w y(t 1) 2 Set ϵt = (1 0.9 q) ϵt 1 Find w(t) s.t. Gt(w(t)) minw Gt(w) ϵt Set y(t) = w(t) + q q q+q (w(t) w(t 1)) Output: w(T ) SDCA without Duality, Regularization, and Individual Convexity Lemma 2 Fix ϵ > 0 and suppose we run the Acceleration algorithm (Algorithm 3) for λ log λ + κ iterations. Then, F(w(T )) F(w ) ϵ. Proof The lemma follows directly from Theorem 3.1 of (Lin et al., 2015) by observing that Algorithm 3 is a specification of Algorithm 1 in (Lin et al., 2015) with α0 = q (which implies that αt = α0 for every t), with ϵt = ϵ0(1 ρ)t, and with ρ = 0.9 q. Theorem 4 Let F = 1 n Pn i=1 φi(w) + λ 2 w 2, assume that each φi is Li smooth and that F is λ-strongly convex. Assume also that ( L/λ)2 3n (otherwise we can simply apply O(n) iterations of Algorithm 1). Then, running Algorithm 3 with parameters κ = L/ n, T = Ω 1 + n 1/4p L/λ , and while at each iteration of Al- gorithm 3 using Ω(n) iterations of Algorithm 1 to minimize Gt, guarantees that F(w(T )) F(w ) ϵ (with high probability). The total required number of iterations of Algorithm 1 is therefore bounded by O n + n3/4p L/λ . Observe that for the case of convex individual functions, accelerating Algorithm 1 yields the upper bound O n + n1/2p L/λ . Therefore, the convex and nonconvex cases have the same dependency on the condition number, but the non-convex case has a worse dependence on n. 3.1. Proof of Theorem 1 Observe that 0 = F(w ) = 1 n P i φi(w ) + λw , which implies that w = 1 λn P i α i , where α i = φi(w ). Define ui = φi(w(t 1)) and vi = ui + α(t 1) i . We also denote two potentials: 1 qj α(t) j α j 2 , Bt = w(t) w 2 . We will first analyze the evolution of At and Bt. If on round t we update using element i then α(t) i = (1 βi)α(t 1) i + βiui. It follows that, At 1 At = 1 qi α(t) i α i 2 + 1 qi α(t 1) i α i 2 qi (1 βi)(α(t 1) i α i ) + βi(ui α i ) 2 qi α(t 1) i α i 2 qi ( (1 βi) α(t 1) i α i 2 βi ui α i 2 + βi(1 βi) α(t 1) i ui 2 + α(t 1) i α i 2 ) α(t 1) i α i 2 ui α i 2 + (1 βi) vi 2 α(t 1) i α i 2 ui α i 2 + (1 βi) vi 2 . Taking expectation w.r.t. i q we obtain E[At 1 At] = (5) α(t 1) i α i 2 ui α i 2 + (1 βi) vi 2 ui α i 2 + (1 βi) vi 2 ! As to the second potential, we have Bt 1 Bt = w(t) w 2 + w(t 1) w 2 (6) = 2 (w(t 1) w ) (η vi) η2 i vi 2 . Taking expectation w.r.t. i q and noting that Ei q(ηivi) = η F(w(t 1)) we obtain E[Bt 1 Bt] =2η (w(t 1) w ) F(w(t 1)) (7) 1 qi vi 2 . We now take a potential of the form Ct = ca At + cb Bt. Combining (5) and (7) we obtain E[Ct 1 Ct] = caηλAt 1 caηλ X 1 qi ui α i 2 + 2cbη(w(t 1) w ) F(w(t 1)) 1 qi vi 2 caηλ(1 βi) cbη2 We will choose the parameters η, ca, cb such that SDCA without Duality, Regularization, and Individual Convexity This implies that βi = ηiλn = ηλ qi 1/2, and therefore the term in (8) is non-negative. Next, due to strong convexity of F we have that (w(t 1) w ) F(w(t 1)) F(w(t 1)) F(w ) + λ 2 w(t 1) w 2 . E[Ct 1 Ct] = caηλAt 1 caηλ X 1 qi ui α i 2 + 2cbη(F(w(t 1)) F(w )) + cbηλBt 1 = η λ Ct 1+ 2cb(F(w(t 1)) F(w )) caλ X 1 qi ui α i 2 ! Note that ui α i = φi(w(t 1)) φi(w ). In Lemma 3 we show that when φi is Li smooth and convex then φi(w(t 1)) φi(w ) 2 (11) 2 Li (φi(w(t 1)) φi(w ) φi(w ) (w(t 1) w )) Therefore, denoting τ = 2 maxi Li we obtain that 1 qi ui α i 2 = X 1 qi φi(w(t 1)) φi(w ) 2 i (φi(w(t 1)) φi(w ) φi(w ) (w(t 1) w )) = τ n F(w(t 1)) F(w ) λ 2 w(t 1) w 2 τ n F(w(t 1)) F(w ) . (13) The definition of qi implies that for every i, qi = 2n L Li Li + L 2n L . (14) Combining this with (12) and (10) we obtain η λ Ct 1 + η 2cb 4n2 Lλca (F(w(t 1)) F(w )) Plugging the value of cb = caλn2 2η yields that the coefficient in the last term is 2η 4n2 Lλca = caλn2 1 where we used the choice of η 1 4 L. In summary, we have shown that E[Ct 1 Ct] η λ Ct 1, which implies that E[Ct] (1 η λ) Ct 1 . Taking expectation over Ct 1 and continue recursively, we obtain that E[Ct] (1 η λ)t C0 e η λ t C0. Finally, since qi 1/(2n) for every i, we can choose 4 L , 1 4 λn and therefore 1 ηλ 4 n + L λ The proof is concluded by choosing cb = λ/2 and ca = η/n2. 3.2. Proof of Lemma 1 E[ w(t) w(t 1) 2] = X i qiη2 i φi(w(t 1)) + α(t 1) i 2 1 qi ( φi(w(t 1)) + α i 2 + α(t 1) i α i 2) (triangle inequality) qi φi(w(t 1)) φi(w ) 2 qi α(t 1) i α i 2) 2n L w(t 1) w 2 + 1 qi α(t 1) i α i 2 (smoothness and (14)) 3 η 1 2 w(t 1) w 2 + Ct 1 (because η 1 4 L) . 3.3. Proof of Theorem 2 The beginning of the proof is identical to the proof of Theorem 1. The change starts in (12), where we cannot apply (11) to φn+1 because it is not convex. To overcome this, we first apply (11) to φ1, . . . , φn, and obtain that 1 qi ui α i 2 = 1 qi φi(w(t 1)) φi(w ) 2 i=1 (φi(w(t 1)) φi(w ) φi(w ) (w(t 1) w )) = 2 (n + 1) (F(w(t 1)) F(w )) , SDCA without Duality, Regularization, and Individual Convexity where the last equality follows from the fact that Pn i=1 φi(w) = (n + 1)F(w), which also implies that P i φi(w ) = 0. In addition, since φn+1(w) = λ(n+1) 2 w 2, we have 1 qn+1 φn+1(w) φn+1(w ) 2 = λ2(n + 1)2 = 2 (n + 1) Ln+1 qn+1 λ 2 (n + 1) Ln+1 qn+1 (F(w) F(w )) , where the last inequality is because of the λ-strong convexity of F. Combining the two inequalities, we obtain an analogue of (12), 1 qi ui α i 2 max i [n+1] (F(w(t 1)) F(w )) . The rest of the proof is almost identical, except that we have n replaced by n+1 and L replaced by L := 1 n+1 Pn i=1 Li. We now need to choose 8 L , 1 4 λ(n + 1) Observe that, (n+1) L = n + 1 +λ(n+1) = (n+1)( L+λ) , so we can rewrite η = min 1 8( L + λ) , 1 4 λ(n + 1) This yields 1 ηλ 4 n + 3 + 2 L 3.4. Proof of Theorem 3 The beginning of the proof is identical to the proof of Theorem 1 up to (8). We will choose the parameters η, ca, cb such that This implies that βi = ηiλn = ηλ qi 1/2, and therefore the term in (8) is non-negative. Next, due to strong convexity of F we have that (w(t 1) w ) F(w(t 1)) F(w(t 1)) F(w ) + λ 2 w(t 1) w 2 λ w(t 1) w 2 . = caηλAt 1 caηλ X 1 qi ui α i 2 + 2cbηλBt 1 = η λ Ct 1 + η λ cb Bt 1 ca X 1 qi ui α i 2 ! Next, we use the smoothness of the φi to get 1 qi ui α i 2 = X 1 qi φi(w(t 1)) φi(w ) 2 L2 i qi w(t 1) w 2 = Bt 1 X The definition of qi implies that for every i, qi = 2n L Li Li + L 2n L , so by combining with (16) we obtain E[Ct 1 Ct] η λ Ct 1 + ηλ cb 2n2 L2ca Bt 1 The last term will be non-negative if cb ca 2n2 L2. Since we chose cb 2η we obtain the requirement 2η 2n2 L2 η λ 4 L2 . In summary, we have shown that E[Ct 1 Ct] η λ Ct 1. The rest of the proof is identical, but the requirement on η is 4 L2 , 1 4 λn and therefore 1 ηλ 4 n + L2 4. Proof of Theorem 4 Proof Each iteration of Algorithm 3 requires to minimize Gt to accuracy ϵt O(1) (1 ρ)t, where ρ = 0.9 q. If SDCA without Duality, Regularization, and Individual Convexity t T where T is as defined in Lemma 2, then we have that, t log(1 ρ) T log(1 ρ) = log(1 ρ) Using Lemma 4, log(1 ρ) ρ 2 for every ρ (0, 1/2). In our case, ρ is indeed in (0, 1/2) because of the definition of κ and our assumption that ( L/λ)2 3n. Hence, ϵt ) = O(log((λ + κ)/(λϵ))) . Combining this with Theorem 3, and using the definition of Gt, we obtain that the number of iterations required1 by each application of Algorithm 3 is O ( L + κ)2 (λ + κ)2 + n = O(n) , where in the equality we used the definition of κ. Finally, multiplying this by the value of T as given in Lemma 2 we obtain (ignoring log-terms): r λ n (1 + rκ λ) n = n + n3/4 r L 4.1. Technical Lemmas Lemma 3 Assume that φ is L-smooth and convex. Then, for every w and u, φ(w) φ(u) 2 2L φ(w) φ(u) φ(u) (w u) . Proof For every i, define g(w) = φ(w) φ(u) φ(u) (w u) . Clearly, since φ is L-smooth so is g. In addition, by convexity of φ we have g(w) 0 for all w. It follows that g is nonnegative and smooth, and therefore, it is self-bounded (see Section 12.1.3 in (Shalev-Shwartz & Ben-David, 2014)): g(w) 2 2Lg(w) . Using the definition of g, we obtain φ(w) φ(u) 2 = g(w) 2 2Lg(w) = 2L φ(w) φ(u) φ(u) (w u) . 1While Theorem 3 bounds the expected sub-optimality, by techniques similar to (Shalev-Shwartz & Zhang, 2015) it can be converted to a bound that holds with high probability. Lemma 4 For a (0, 1/2) we have log(1 a)/a 1.4. Proof Denote g(a) = log(1 a)/a. It is easy to verify that the derivative of g in (0, 1/2) is positive and that g(0.5) 1.4. The proof follows. We have described and analyzed a dual free version of SDCA that supports non-regularized objectives and nonconvex individual loss functions. Our analysis shows a linear rate of convergence for all of these cases. Two immediate open questions are whether the worse dependence on the condition number for the non-accelerated result for the non-convex case is necessary, and whether the factor n3/4 in Theorem 4 can be reduced to n1/2. Acknowledgements In a previous draft of this paper, the bound for the nonconvex case was n5/4 + n3/4p L/λ. We thank Ohad Shamir for showing us how to derive the improved bound of n + n3/4p L/λ. The work is supported by ICRI-CI and by the European Research Council (Theory DL project). Agarwal, Alekh and Bottou, Leon. A lower bound for the optimization of finite sums. In ICML, 2014. Allen-Zhu, Zeyuan and Yuan, Yang. Univr: A universal variance reduction framework for proximal stochastic gradient method. ar Xiv preprint ar Xiv:1506.01972, 2015. Arjevani, Yossi, Shalev-Shwartz, Shai, and Shamir, Ohad. On lower and upper bounds for smooth and strongly convex optimization problems. ar Xiv preprint ar Xiv:1503.06833, 2015. Csiba, Dominik and Richt arik, Peter. Primal method for erm with flexible mini-batching schemes and nonconvex losses. ar Xiv preprint ar Xiv:1506.02227, 2015. Defazio, Aaron. New Optimisation Methods for Machine Learning. Ph D thesis, Australian National University, 2014. Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646 1654, 2014a. SDCA without Duality, Regularization, and Individual Convexity Defazio, Aaron J, Caetano, Tib erio S, and Domke, Justin. Finito: A faster, permutable incremental gradient method for big data problems. ar Xiv preprint ar Xiv:1407.2710, 2014b. He, Xi and Tak aˇc, Martin. Dual free sdca for empirical risk minimization with adaptive probabilities. ar Xiv preprint ar Xiv:1510.06684, 2015. Jin, Chi, Kakade, Sham M, Musco, Cameron, Netrapalli, Praneeth, and Sidford, Aaron. Robust shift-and-invert preconditioning: Faster and more sample efficient algorithms for eigenvector computation. ar Xiv preprint ar Xiv:1510.08896, 2015. Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315 323, 2013. Koneˇcn y, Jakub and Richt arik, Peter. Semi-stochastic gradient descent methods. ar Xiv preprint ar Xiv:1312.1666, 2013. Le Roux, Nicolas, Schmidt, Mark, and Bach, Francis. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pp. 2663 2671, 2012. Lin, Hongzhou, Mairal, Julien, and Harchaoui, Zaid. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pp. 3366 3374, 2015. Shalev-Shwartz, S. and Zhang, T. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming SERIES A and B (to appear), 2015. Shalev-Shwartz, Shai. Sdca without duality. ar Xiv preprint ar Xiv:1502.06177, 2015. Shalev-Shwartz, Shai and Ben-David, Shai. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press, 2014. Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567 599, Feb 2013. Shamir, Ohad. A stochastic pca and svd algorithm with an exponential convergence rate. In ICML, 2015. Xiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014.