# stopwasting_my_gradients_practical_svrg__317f68a4.pdf Stop Wasting My Gradients: Practical SVRG Reza Babanezhad1, Mohamed Osama Ahmed1, Alim Virani2, Mark Schmidt1 Department of Computer Science University of British Columbia 1{rezababa, moahmed, schmidtm}@cs.ubc.ca,2alim.virani@gmail.com Jakub Koneˇcn y School of Mathematics University of Edinburgh kubo.konecny@gmail.com Scott Sallinen Department of Electrical and Computer Engineering University of British Columbia scotts@ece.ubc.ca We present and analyze several strategies for improving the performance of stochastic variance-reduced gradient (SVRG) methods. We first show that the convergence rate of these methods can be preserved under a decreasing sequence of errors in the control variate, and use this to derive variants of SVRG that use growing-batch strategies to reduce the number of gradient calculations required in the early iterations. We further (i) show how to exploit support vectors to reduce the number of gradient computations in the later iterations, (ii) prove that the commonly used regularized SVRG iteration is justified and improves the convergence rate, (iii) consider alternate mini-batch selection strategies, and (iv) consider the generalization error of the method. 1 Introduction We consider the problem of optimizing the average of a finite but large sum of smooth functions, min x Rd f(x) = 1 i=1 fi(x). (1) A huge proportion of the model-fitting procedures in machine learning can be mapped to this problem. This includes classic models like least squares and logistic regression but also includes more advanced methods like conditional random fields and deep neural network models. In the highdimensional setting (large d), the traditional approaches for solving (1) are: full gradient (FG) methods which have linear convergence rates but need to evaluate the gradient fi for all n examples on every iteration, and stochastic gradient (SG) methods which make rapid initial progress as they only use a single gradient on each iteration but ultimately have slower sublinear convergence rates. Le Roux et al. [1] proposed the first general method, stochastic average gradient (SAG), that only considers one training example on each iteration but still achieves a linear convergence rate. Other methods have subsequently been shown to have this property [2, 3, 4], but these all require storing a previous evaluation of the gradient f i or the dual variables for each i. For many objectives this only requires O(n) space, but for general problems this requires O(np) space making them impractical. Recently, several methods have been proposed with similar convergence rates to SAG but without the memory requirements [5, 6, 7, 8]. They are known as mixed gradient, stochastic variance-reduced gradient (SVRG), and semi-stochastic gradient methods (we will use SVRG). We give a canonical SVRG algorithm in the next section, but the salient features of these methods are that they evaluate two gradients on each iteration and occasionally must compute the gradient on all examples. SVRG methods often dramatically outperform classic FG and SG methods, but these extra evaluations mean that SVRG is slower than SG methods in the important early iterations. They also mean that SVRG methods are typically slower than memory-based methods like SAG. In this work we first show that SVRG is robust to inexact calculation of the full gradients it requires ( 3), provided the accuracy increases over time. We use this to explore growing-batch strategies that require fewer gradient evaluations when far from the solution, and we propose a mixed SG/SVRG method that may improve performance in the early iterations ( 4). We next explore using support vectors to reduce the number of gradients required when close to the solution ( 5), give a justification for the regularized SVRG update that is commonly used in practice ( 6), consider alternative minibatch strategies ( 7), and finally consider the generalization error of the method ( 8). 2 Notation and SVRG Algorithm SVRG assumes f is µ-strongly convex, each fi is convex, and each gradient f i is Lipschitzcontinuous with constant L. The method begins with an initial estimate x0, sets x0 = x0 and then generates a sequence of iterates xt using xt = xt 1 η(f it(xt 1) f it(xs) + µs), (2) where η is the positive step size, we set µs = f (xs), and it is chosen uniformly from {1, 2, . . . , n}. After every m steps, we set xs+1 = xt for a random t {1, . . . , m}, and we reset t = 0 with x0 = xs+1. To analyze the convergence rate of SVRG, we will find it convenient to define the function ρ(a, b) = 1 1 2ηa 1 mµη + 2bη . as it appears repeatedly in our results. We will use ρ(a) to indicate the value of ρ(a, b) when a = b, and we will simply use ρ for the special case when a = b = L. Johnson & Zhang [6] show that if η and m are chosen such that 0 < ρ < 1, the algorithm achieves a linear convergence rate of the form E[f(xs+1) f(x )] ρE[f(xs) f(x )], where x is the optimal solution. This convergence rate is very fast for appropriate η and m. While this result relies on constants we may not know in general, practical choices with good empirical performance include setting m = n, η = 1/L, and using xs+1 = xm rather than a random iterate. Unfortunately, the SVRG algorithm requires 2m + n gradient evaluations for every m iterations of (2), since updating xt requires two gradient evaluations and computing µs require n gradient evaluations. We can reduce this to m + n if we store the gradients f i(xs), but this is not practical in most applications. Thus, SVRG requires many more gradient evaluations than classic SG iterations of memory-based methods like SAG. 3 SVRG with Error We first give a result for the SVRG method where we assume that µs is equal to f (xs) up to some error es. This is in the spirit of the analysis of [9], who analyze FG methods under similar assumptions. We assume that xt x Z for all t, which has been used in related work [10] and is reasonable because of the coercity implied by strong-convexity. Proposition 1. If µs = f (xs) + es and we set η and m so that ρ < 1, then the SVRG algorithm (2) with xs+1 chosen randomly from {x1, x2, . . . , xm} satisfies E[f(xs+1) f(x )] ρE[f(xs) f(x )] + ZE es + ηE es 2 We give the proof in Appendix A. This result implies that SVRG does not need a very accurate approximation of f (xs) in the crucial early iterations since the first term in the bound will dominate. Further, this result implies that we can maintain the exact convergence rate of SVRG as long as the errors es decrease at an appropriate rate. For example, we obtain the same convergence rate provided that max{E es , E es 2} γ ρs for any γ 0 and some ρ < ρ. Further, we still obtain a linear convergence rate as long as es converges to zero with a linear convergence rate. Algorithm 1 Batching SVRG Input: initial vector x0, update frequency m, learning rate η. for s = 0, 1, 2, . . . do Choose batch size |Bs| Bs = |Bs| elements sampled without replacement from {1, 2, . . . , n}. µs = 1 |Bs| P i Bs f i(xs) x0=xs for t = 1, 2, . . . , m do Randomly pick it 1, . . . , n xt = xt 1 η(f it(xt 1) f it(xs) + µs) ( ) end for option I: set xs+1 = xm option II: set xs+1 = xt for random t {1, . . . , m} end for 3.1 Non-Uniform Sampling Xiao & Zhang [11] show that non-uniform sampling (NUS) improves the performance of SVRG. They assume each f i is Li-Lipschitz continuous, and sample it = i with probability Li/n L where L = 1 n Pn i=1 Li. The iteration is then changed to xt = xt 1 η L Lit [f it(xt 1) f it( x)] + µs , which maintains that the search direction is unbiased. In Appendix A, we show that if µs is computed with error for this algorithm and if we set η and m so that 0 < ρ( L) < 1, then we have a convergence rate of E[f(xs+1) f(x )] ρ( L)E[f(xs) f(x )] + ZE es + ηE es 2 which can be faster since the average L may be much smaller than the maximum value L. 3.2 SVRG with Batching There are many ways we could allow an error in the calculation of µs to speed up the algorithm. For example, if evaluating each f i involves solving an optimization problem, then we could solve this optimization problem inexactly. For example, if we are fitting a graphical model with an iterative approximate inference method, we can terminate the iterations early to save time. When the fi are simple but n is large, a natural way to approximate µs is with a subset (or batch ) of training examples Bs (chosen without replacement), µs = 1 |Bs| i Bs f i(xs). The batch size |Bs| controls the error in the approximation, and we can drive the error to zero by increasing it to n. Existing SVRG methods correspond to the special case where |Bs| = n for all s. Algorithm 1 gives pseudo-code for an SVRG implementation that uses this sub-sampling strategy. If we assume that the sample variance of the norms of the gradients is bounded by S2 for all xs, f i(xs) 2 f (xs) 2 S2, then we have that [12, Chapter 2] E es 2 n |Bs| So if we want E es 2 γ ρ2s, where γ 0 is a constant for some ρ < 1, we need S2 + nγ ρ2s . (3) Algorithm 2 Mixed SVRG and SG Method Replace (*) in Algorithm 1 with the following lines: if fit Bs then xt = xt 1 η(f it(xt 1) f it(xs) + µs) else xt = xt 1 ηf it(xt 1) end if If the batch size satisfies the above condition then ZE es 1 + ηE es 1 2 Z γ ρs + ηγ ρ2s 2 max{Z γ, ηγ ρ} ρs, and the convergence rate of SVRG is unchanged compared to using the full batch on all iterations. The condition (3) guarantees a linear convergence rate under any exponentially-increasing sequence of batch sizes, the strategy suggested by [13] for classic SG methods. However, a tedious calculation shows that (3) has an inflection point at s = log(S2/γn)/2 log(1/ ρ), corresponding to |Bs| = n 2 . This was previously observed empirically [14, Figure 3], and occurs because we are sampling without replacement. This transition means we don t need to increase the batch size exponentially. 4 Mixed SG and SVRG Method An approximate µs can drastically reduce the computational cost of the SVRG algorithm, but does not affect the 2 in the 2m+n gradients required for m SVRG iterations. This factor of 2 is significant in the early iterations, since this is when stochastic methods make the most progress and when we typically see the largest reduction in the test error. To reduce this factor, we can consider a mixed strategy: if it is in the batch Bs then perform an SVRG iteration, but if it is not in the current batch then use a classic SG iteration. We illustrate this modification in Algorithm 2. This modification allows the algorithm to take advantage of the rapid initial progress of SG, since it predominantly uses SG iterations when far from the solution. Below we give a convergence rate for this mixed strategy. Proposition 2. Let µs = f (xs)+es and we set η and m so that 0 < ρ(L, αL) < 1 with α = |Bs|/n. If we assume E f i(x) 2 σ2 then Algorithm 2 has E[f(xs+1) f(x )] ρ(L, αL)E[f(xs) f(x )] + ZE es + ηE es 2 + ησ2 2 (1 α) 1 2ηL We give the proof in Appendix B. The extra term depending on the variance σ2 is typically the bottleneck for SG methods. Classic SG methods require the step-size η to converge to zero because of this term. However, the mixed SG/SVRG method can keep the fast progress from using a constant η since the term depending on σ2 converges to zero as α converges to one. Since α < 1 implies that ρ(L, αL) < ρ, this result implies that when [f(xs) f(x )] is large compared to es and σ2 that the mixed SG/SVRG method actually converges faster. Sharing a single step size η between the SG and SVRG iterations in Proposition 2 is sub-optimal. For example, if x is close to x and |Bs| n, then the SG iteration might actually take us far away from the minimizer. Thus, we may want to use a decreasing sequence of step sizes for the SG iterations. In Appendix B, we show that using η = O ( p (n |B|)/n|B|) for the SG iterations can improve the dependence on the error es and variance σ2. 5 Using Support Vectors Using a batch Bs decreases the number of gradient evaluations required when SVRG is far from the solution, but its benefit diminishes over time. However, for certain objectives we can further Algorithm 3 Heuristic for skipping evaluations of fi at x if ski = 0 then compute f i(x). if f i(x) = 0 then psi = psi + 1. {Update the number of consecutive times f i(x) was zero.} ski = 2max{0,psi 2}. {Skip exponential number of future evaluations if it remains zero.} else psi = 0. {This could be a support vector, do not skip it next time.} end if return f i(x). else ski = ski 1. {In this case, we skip the evaluation.} return 0. end if reduce the number of gradient evaluations by identifying support vectors. For example, consider minimizing the Huberized hinge loss (HSVM) with threshold ϵ [15], min x Rd 1 n i=1 f(bia T i x), f(τ) = 0 if τ > 1 + ϵ, 1 τ if τ < 1 ϵ, (1+ϵ τ)2 4ϵ if |1 τ| ϵ, In terms of (1), we have fi(x) = f(bia T i x). The performance of this loss function is similar to logistic regression and the hinge loss, but it has the appealing properties of both: it is differentiable like logistic regression meaning we can apply methods like SVRG, but it has support vectors like the hinge loss meaning that many examples will have fi(x ) = 0 and f i(x ) = 0. We can also construct Huberized variants of many non-smooth losses for regression and multi-class classification. If we knew the support vectors where fi(x ) > 0, we could solve the problem faster by ignoring the non-support vectors. For example, if there are 100000 training examples but only 100 support vectors in the optimal solution, we could solve the problem 1000 times faster. While we typically don t know the support vectors, in this section we outline a heuristic that gives large practical improvements by trying to identify them as the algorithm runs. Our heuristic has two components. The first component is maintaining the list of non-support vectors at xs. Specifically, we maintain a list of examples i where f i(xs) = 0. When SVRG picks an example it that is part of this list, we know that f it(xs) = 0 and thus the iteration only needs one gradient evaluation. This modification is not a heuristic, in that it still applies the exact SVRG algorithm. However, at best it can only cut the runtime in half. The heuristic part of our strategy is to skip f i(xs) or f i(xt) if our evaluation of f i has been zero more than two consecutive times (and skipping it an exponentially larger number of times each time it remains zero). Specifically, for each example i we maintain two variables, ski (for skip ) and psi (for pass ). Whenever we need to evaluate f i for some xs or xt, we run Algorithm 3 which may skip the evaluation. This strategy can lead to huge computational savings in later iterations if there are few support vectors, since many iterations will require no gradient evaluations. Identifying support vectors to speed up computation has long been an important part of SVM solvers, and is related to the classic shrinking heuristic [16]. While it has previously been explored in the context of dual coordinate ascent methods [17], this is the first work exploring it for linearly-convergent stochastic gradient methods. 6 Regularized SVRG We are often interested in the special case where problem (1) has the decomposition min x Rd f(x) h(x) + 1 i=1 gi(x). (4) A common choice of h is a scaled 1-norm of the parameter vector, h(x) = λ x 1. This non-smooth regularizer encourages sparsity in the parameter vector, and can be addressed with the proximal SVRG method of Xiao & Zhang [11]. Alternately, if we want an explicit Z we could set h to the indicator function for a 2-norm ball containing x . In Appendix C, we give a variant of Proposition 1 that allows errors in the proximal-SVRG method for non-smooth/constrained settings like this. Another common choice is the ℓ2-regularizer, h(x) = λ 2 x 2. With this regularizer, the SVRG updates can be equivalently written in the form xt+1 = xt η h (xt) + g it(xt) g it(xs) + µs , (5) where µs = 1 n Pn i=1 gi(xs). That is, they take an exact gradient step with respect to the regularizer and an SVRG step with respect to the gi functions. When the g i are sparse, this form of the update allows us to implement the iteration without needing full-vector operations. A related update is used by Le Roux et al. to avoid full-vector operations in the SAG algorithm [1, 4]. In Appendix C, we prove the below convergence rate for this update. Proposition 3. Consider instances of problem (1) that can be written in the form (4) where h is Lh-Lipschitz continuous and each g i is Lg-Lipschitz continuous, and assume that we set η and m so that 0 < ρ(Lm) < 1 with Lm = max{Lg, Lh}. Then the regularized SVRG iteration (5) has E[f(xs+1) f(x )] ρ(Lm)E[f(xs) f(x )], Since Lm L, and strictly so in the case of ℓ2-regularization, this result shows that for ℓ2regularized problems SVRG actually converges faster than the standard analysis would indicate (a similar result appears in Koneˇcn y et al. [18]). Further, this result gives a theoretical justification for using the update (5) for other h functions where it is not equivalent to the original SVRG method. 7 Mini-Batching Strategies Koneˇcn y et al. [18] have also recently considered using batches of data within SVRG. They consider using mini-batches in the inner iteration (the update of xt) to decrease the variance of the method, but still use full passes through the data to compute µs. This prior work is thus complimentary to the current work (in practice, both strategies can be used to improve performance). In Appendix D we show that sampling the inner mini-batch proportional to Li achieves a convergence rate of E f(xs+1) f(x ) ρME [f(xs) f(x )] , where M is the size of the mini-batch while ρM = 1 M 2η L mµη + 2 Lη , and we assume 0 < ρM < 1. This generalizes the standard rate of SVRG and improves on the result of Koneˇcn y et al. [18] in the smooth case. This rate can be faster than the rate of the standard SVRG method at the cost of a more expensive iteration, and may be clearly advantageous in settings where parallel computation allows us to compute several gradients simultaneously. The regularized SVRG form (5) suggests an alternate mini-batch strategy for problem (1): consider a mini-batch that contains a fixed set Bf and a random set Bt. Without loss of generality, assume that we sort the fi based on their Li values so that L1 L2 Ln. For the fixed Bf we will always choose the Mf values with the largest Li, Bf = {f1, f2, . . . , f Mf }. In contrast, we choose the members of the random set Bt by sampling from Br = {f Mf +1, . . . , fn} proportional to their Lipschitz constants, pi = Li (Mr) Lr with Lr = (1/Mr) Pn i=Mf +1 Li. In Appendix D, we show the following convergence rate for this strategy: Proposition 4. Let g(x) = (1/n) P i/ [Bf ] fi(x) and h(x) = (1/n) P i [Bf ] fi(x). If we replace the SVRG update with xt+1 = xt η h (xt) + (1/Mr) X Lr Li (f i(xt) f i(xs)) + g (xs) then the convergence rate is E[f(xs+1) f(x )] ρ(κ, ζ)E[F(xs) f(x )]. where ζ = (n Mf ) Lr (M Mf )n and κ = max{ L1 If L1 n L/M and Mf < (α 1)n M αn M with α = L Lr , then we get a faster convergence rate than SVRG with a mini-batch of size M. The scenario where this rate is slower than the existing minibatch SVRG strategy is when L1 n L/M. But we could relax this assumption by dividing each element of the fixed set Bf into two functions: βfi and (1 β)fi, where β = 1/M, then replacing each function fi in Bf with βfi and adding (1 β)fi to the random set Br. This result may be relevant if we have access to a field-programmable gate array (FPGA) or graphical processing unit (GPU) that can compute the gradient for a fixed subset of the examples very efficiently. However, our experiments (Appendix F) indicate this strategy only gives marginal gains. In Appendix F, we also consider constructing mini-batches by sampling proportional to fi(xs) or f i(xs) . These seemed to work as well as Lipschitz sampling on all but one of the datasets in our experiments, and this strategy is appealing because we have access to these values while we may not know the Li values. However, these strategies diverged on one of the datasets. 8 Learning efficiency In this section we compare the performance of SVRG as a large-scale learning algorithm compared to FG and SG methods. Following Bottou & Bousquet [19], we can formulate the generalization error E of a learning algorithm as the sum of three terms E = Eapp + Eest + Eopt where the approximation error Eapp measures the effect of using a limited class of models, the estimation error Eest measures the effect of using a finite training set, and the optimization error Eopt measures the effect of inexactly solving problem (1). Bottou & Bousquet [19] study asymptotic performance of various algorithms for a fixed approximation error and under certain conditions on the distribution of the data depending on parameters α or ν. In Appendix E, we discuss how SVRG can be analyzed in their framework. The table below includes SVRG among their results. Algorithm Time to reach Eopt ϵ Time to reach E = O(Eapp + ϵ) Previous with κ n FG O nκd log 1 ϵ O d2κ ϵ1/α log2 1 ϵ2/α log3 1 SVRG O (n + κ)d log 1 ϵ1/α log2 1 ϵ + κd log 1 ϵ1/α log2 1 In this table, the condition number is κ = L/µ. In this setting, linearly-convergent stochastic gradient methods can obtain better bounds for ill-conditioned problems, with a better dependence on the dimension and without depending on the noise variance ν. 9 Experimental Results In this section, we present experimental results that evaluate our proposed variations on the SVRG method. We focus on logistic regression classification: given a set of training data (a1, b1) . . . (an, bn) where ai Rd and bi { 1, +1}, the goal is to find the x Rd solving argmin x Rd λ 2 x 2 + 1 i=1 log(1 + exp( bia T i x)), We consider the datasets used by [1], whose properties are listed in the supplementary material. As in their work we add a bias variable, normalize dense features, and set the regularization parameter λ to 1/n. We used a step-size of α = 1/L and we used m = |Bs| which gave good performance across methods and datasets. In our first experiment, we compared three variants of SVRG: the original strategy that uses all n examples to form µs (Full), a growing batch strategy that sets |Bs| = 2s (Grow), and the mixed SG/SVRG described by Algorithm 2 under this same choice (Mixed). While a variety of practical batching methods have been proposed in the literature [13, 20, 21], we did not find that any of these strategies consistently outperformed the doubling used by the simple Grow Effective Passes 0 5 10 15 Objective minus Optimum Full Grow Mixed Effective Passes 0 5 10 15 Full Grow Mixed Effective Passes 0 5 10 15 Objective minus Optimum Full Grow SV(Full) SV(Grow) Effective Passes 0 5 10 15 0.05 Full Grow SV(Full) SV(Grow) Figure 1: Comparison of training objective (left) and test error (right) on the spam dataset for the logistic regression (top) and the HSVM (bottom) losses under different batch strategies for choosing µs (Full, Grow, and Mixed) and whether we attempt to identify support vectors (SV). strategy. Our second experiment focused on the ℓ2-regularized HSVM on the same datasets, and we compared the original SVRG algorithm with variants that try to identify the support vectors (SV). We plot the experimental results for one run of the algorithms on one dataset in Figure 1, while Appendix F reports results on the other 8 datasets over 10 different runs. In our results, the growing batch strategy (Grow) always had better test error performance than using the full batch, while for large datasets it also performed substantially better in terms of the training objective. In contrast, the Mixed strategy sometimes helped performance and sometimes hurt performance. Utilizing support vectors often improved the training objective, often by large margins, but its effect on the test objective was smaller. 10 Discussion As SVRG is the only memory-free method among the new stochastic linearly-convergent methods, it represents the natural method to use for a huge variety of machine learning problems. In this work we show that the convergence rate of the SVRG algorithm can be preserved even under an inexact approximation to the full gradient. We also showed that using mini-batches to approximate µs gives a natural way to do this, explored the use of support vectors to further reduce the number of gradient evaluations, gave an analysis of the regularized SVRG update, and considered several new mini-batch strategies. Our theoretical and experimental results indicate that many of these simple modifications should be considered in any practical implementation of SVRG. Acknowledgements We would like to thank the reviewers for their helpful comments. This research was supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN 312176-2010, RGPIN 311661-08, RGPIN-06068-2015). Jakub Koneˇcn y is supported by a Google European Doctoral Fellowship. [1] N. Le Roux, M. Schmidt, and F. Bach, A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets, Advances in neural information processing systems (NIPS), 2012. [2] S. Shalev-Schwartz and T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization, Journal of Machine Learning Research, vol. 14, pp. 567 599, 2013. [3] J. Mairal, Optimization with first-order surrogate functions, International Conference on Machine Learning (ICML), 2013. [4] A. Defazio, F. Bach, and S. Lacoste-Julien, Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in neural information processing systems (NIPS), 2014. [5] M. Mahdavi, L. Zhang, and R. Jin, Mixed optimization for smooth functions, Advances in neural information processing systems (NIPS), 2013. [6] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in neural information processing systems (NIPS), 2013. [7] L. Zhang, M. Mahdavi, and R. Jin, Linear convergence with condition number independent access of full gradients, Advances in neural information processing systems (NIPS), 2013. [8] J. Koneˇcn y and P. Richt arik, Semi-stochastic gradient descent methods, ar Xiv preprint, 2013. [9] M. Schmidt, N. Le Roux, and F. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, Advances in neural information processing systems (NIPS), 2011. [10] C. Hu, J. Kwok, and W. Pan, Accelerated gradient methods for stochastic optimization and online learning, Advances in neural information processing systems (NIPS), 2009. [11] L. Xiao and T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM Journal on Optimization, vol. 24, no. 2, pp. 2057 2075, 2014. [12] S. Lohr, Sampling: design and analysis. Cengage Learning, 2009. [13] M. P. Friedlander and M. Schmidt, Hybrid deterministic-stochastic methods for data fitting, SIAM Journal of Scientific Computing, vol. 34, no. 3, pp. A1351 A1379, 2012. [14] A. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. Van Leeuwen, Robust inversion, dimensionality reduction, and randomized sampling, Mathematical Programming, vol. 134, no. 1, pp. 101 125, 2012. [15] S. Rosset and J. Zhu, Piecewise linear regularized solution paths, The Annals of Statistics, vol. 35, no. 3, pp. 1012 1030, 2007. [16] T. Joachims, Making large-scale SVM learning practical, in Advances in Kernel Methods - Support Vector Learning (B. Sch olkopf, C. Burges, and A. Smola, eds.), ch. 11, pp. 169 184, Cambridge, MA: MIT Press, 1999. [17] N. Usunier, A. Bordes, and L. Bottou, Guarantees for approximate incremental svms, International Conference on Artificial Intelligence and Statistics (AISTATS), 2010. [18] J. Koneˇcn y, J. Liu, P. Richt arik, and M. Tak aˇc, ms2gd: Mini-batch semi-stochastic gradient descent in the proximal setting, ar Xiv preprint, 2014. [19] L. Bottou and O. Bousquet, The tradeoffs of large scale learning, Advances in neural information processing systems (NIPS), 2007. [20] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu, Sample size selection in optimization methods for machine learning, Mathematical programming, vol. 134, no. 1, pp. 127 155, 2012. [21] K. van den Doel and U. Ascher, Adaptive and stochastic algorithms for EIT and DC resistivity problems with piecewise constant solutions and many measurements, SIAM J. Scient. Comput, vol. 34, 2012.