# beating_sgd_saturation_with_tailaveraging_and_minibatching__69f51ed8.pdf

Beating SGD Saturation with Tail-Averaging and Minibatching

Nicole Mücke Institute for Stochastics and Applications University of Stuttgart nicole.muecke@mathematik.uni-stuttgart.de

Gergely Neu Universitat Pompeu Fabra gergely.neu@gmail.com

Lorenzo Rosasco Universita degli Studi di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology lrosasco@mit.edu

While stochastic gradient descent (SGD) is one of the major workhorses in machine learning, the learning properties of many practically used variants are still poorly understood. In this paper, we consider least squares learning in a nonparametric setting and contribute to ﬁlling this gap by focusing on the effect and interplay of multiple passes, mini-batching and averaging, in particular tail averaging. Our results show how these different variants of SGD can be combined to achieve optimal learning rates, providing practical insights. A novel key result is that tail averaging allows faster convergence rates than uniform averaging in the nonparametric setting. Further, we show that a combination of tail-averaging and minibatching allows more aggressive step-size choices than using any one of said components.

1 Introduction

Stochastic gradient descent (SGD) provides a simple and yet stunningly efﬁcient way to solve a broad range of machine learning problems. Our starting observation is that, while a number of variants including multiple passes over the data, mini-batching and averaging are commonly used, their combination and learning properties are studied only partially. The literature on convergence properties of SGD is vast, but usually only one pass over the data is considered, see, e.g., [23]. In the context of nonparametric statistical learning, which we consider here, the study of one-pass SGD was probably ﬁrst considered in [35] and then further developed in a number of papers (e.g., [37, 36, 25]). Another line of work derives statistical learning results for one pass SGD with averaging from a worst-case sequential prediction analysis [29, 18, 28]. The idea of using averaging also has a long history going back to at least the works of [32] and [27], see also [34] and references therein. More recently, averaging was shown to lead to larger, possibly constant, step-sizes, see [2, 10, 11]. A different take on the role of (weighted) averaging was given in [24], highlighting a connection with ridge regression, a.k.a. Tikhonov regularization. A different ﬂavor of averaging called tail averaging for one-pass SGD was considered in [19] in a parametric setting. The role of minibatching has also being considered and shown to potentially lead to linear parallelization speedups, see e.g. [7] and references therein. Very few results consider the role of multiple passes for learning. Indeed, this variant of SGD is typically analyzed for the minimization of the empirical risk, rather than the actual population risk, see for example [4]. To the best of our knowledge the ﬁrst paper to analyze the learning properties of multipass SGD was [31], where a cyclic selection strategy was

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

considered. Other results for multipass SGD were then given in [16] and [20]. Our starting point are the results in [21] where optimal results for multipass SGD where derived considering also the effect of mini-batching. Following the approach in this latter paper, multipass SGD with averaging was analyzed by [26] with no minibatching.

In this paper, we develop and improve the above results on two fronts. On the one hand, we consider for the ﬁrst time the role of multiple passes, mini-batching and averaging at once. On the other hand, we further study the beneﬁcial effect of tail averaging. Both mini-batching and averaging are known to allow larger step-sizes. Our results show that their combination allows even more aggressive parameter choices. At the same time averaging was shown to lead to slower convergence rates in some cases. In a parametric setting, averaging prevents linear convergence rates [2, 11]. In a nonparametric setting, it prevents exploiting the possible regularity in the solution [10], a phenomenon called saturation [12]. In other words, uniform averaging can prevent optimal rates in a nonparametric setting. Our results provide a simple explanation to this effect, showing it has a purely deterministic nature. Further, we show that tail averaging allows to bypass this problem. These results parallel the ﬁndings of [19], showing similar beneﬁcial effects of tail-averaging and minibatching in the ﬁnite-dimensional setting. Following [21], our analysis relies on the study of batch gradient descent and then of the discrepancy between batch gradient and SGD, with the additional twist that it also considers the role of tail-averaging. The rest of the paper is organized as follows. In Section 2, we describe the least-squares learning problem that we consider, as well as the different SGD variants we analyze. In Section 3, we collect a number of observations shedding light on the role of uniform and tail averaging. In Section 4, we present and discuss our main results. In Section 5 we illustrate our results via some numerical simulations. Proofs and technical results are deferred to the appendices.

2 Least Squares Learning with SGD

In this section, we introduce the problem of supervised learning with the least squares loss and then present SGD and its variants.

2.1 Least squares learning

We let (X, Y ) be a pair of random variables with values in H R, with H a real separable Hilbert space. This latter setting is known to be equivalent to nonparametric learning with kernels [31]. We focus on this setting since considering inﬁnite dimensios allows to highlight more clearly the regularization role played by different parameters. Indeed, unlike in ﬁnite dimensions, regularization is needed to derive learning rates in this case. Throughout the paper we will suppose that the following assumption holds:

Assumption 1. Assume X κ, |Y | M almost surely, for some κ, M > 0.

The problem of interest is to solve

min w H L(w), L(w) = 1

2E[(Y w, X )2] (1)

provided a realization x1, . . . , xn of n identical copies X1, . . . , Xn of X. Deﬁning

Σ = E[X X], and h = E[XY ], (2)

the optimality condition of problem (1) shows that a solution w satisﬁes the normal equation

Σw = h. (3)

Finally, recall that the excess risk associated with any w H can be written as 1

L(w) L(w ) = Σ1/2(w w ) 2 .

1It is a standard fact that the operator Σ is symmetric, positive deﬁnite and trace class (hence compact), since X is bounded. Then fractional powers of Σ are naturally deﬁned using spectral calculus.

2.2 Learning with stochastic gradients

We now introduce various gradient iterations relevant in the following. The basic stochastic gradient iteration is given by the recursion

wt+1 = wt γtxt( xt, wt yt)

for all t = 0, 1 . . . , with w0 = 0. For all w H and t = 1, . . . n,

E[Xt( Xt, w Yt)] = L(w), (4)

hence the name. While the above iteration is not ensured to decrease the objective at each step, the above procedure and its variants are commonly called Stochastic Gradient Descent (SGD). We will also use this terminology. The sequence (γt)t > 0, is called step-size or learning rate. In its basic form, the above iteration prescribes to use each data point only once. This is the classical stochastic approximation perspective pioneered by [30].

In practice, however, a number of different variants are considered. In particular, often times, data points are visited multiple times, in which case we can write the recursion as

wt+1 = wt γtxit( xit, wt yit).

Here it = i(t) denotes a map specifying a strategy with which data are selected at each iteration. Popular choices include: cyclic, where an order over [n] is ﬁxed a priori and data points are visited multiple times according to it; reshufﬂing, where the order of the data points is permuted after all of them have been sampled once, amounting to sampling without replacement; and ﬁnally the most common approach, which is sampling each point with replacement uniformly at random. This latter choice is also the one we consider in this paper. We broadly refer to this variant of SGD as multipass-SGD, referring to the multiple passes over the data set as t grows larger than n.

Another variant of SGD is based on considering more than one data point at each iteration, a procedure called mini-batching. Given b [n] the mini-batch SGD recursion is given by

wt+1 = wt + γt 1 b

i=b(t 1)+1 ( wt, xji yji)xji ,

where j1, ..., jb T are i.i.d. random variables, distributed according to the uniform distribution on [n]. Here the number of passes over the data after t iterations is bt/n . Mini-batching can be useful for at least two different reasons. The most important is that considering mini-batches is natural to make the best use of memory resources, in particular when distributed computations are available. Another advantage is that in this case more accurate gradient estimates are clearly available at each step.

Finally, one last idea is considering averaging of the iterates, rather than working with the ﬁnal iterate,

This is a classical idea in optimization, where it is known to provide improved convergence results [32, 27, 15, 2], but it is also used when recovering stochastic results from worst case sequential prediction analysis [33, 17]. More recently, averaging was shown to lead to larger step-sizes, see [2, 10, 11]. In the following, we consider a variant of the above idea, namely tail-avaraging, where for 0 S T 1 we let

w S,T = 1 T S

We will occasionally write w L = w S,T , with L = T S. In the following, we study how the above ideas can be combined to solve problem (1) and how such combinations affect the learning properties of the obtained solutions.

3 An appetizer: Averaging and Gradient Descent Convergence

Averaging is known to allow larger step-sizes for SGD but also to slower convergence rates in certain settings [10]. In this section, we present calculations shedding light on these effects. In particular,

we show how the slower convergence is a completely deterministic effect and how tail averaging can provide a remedy. In the rest of the paper, we will build on these reasonings to derive novel quantitative results in terms of learning bounds. The starting observation is that since SGD is based on stochastic estimates of the expected risk gradient (cf. equations (1), (4)) it is natural to start from the exact gradient descent to understand the role played by averaging.

For γ > 0, w0 = 0, consider the population gradient descent iteration,

ut = ut 1 γE[X( X, ut 1 Y )] = (I γΣ)ut 1 + γh,

where the last equality follows from (2). Then using the normal equation (3) and a simple induction argument [12], it is easy to see that,

u T = g T (Σ)Σw , g T (Σ) = γ

j=0 (I γΣ)j. (5)

Here, g T is a spectral ﬁltering function corresponding to a truncated matrix geometric series (the von Neumann series). For the latter to converge, we need γ such that I γΣ < 1, e.g. γ < 1/σM < 1/κ2, with σM = σmax(Σ) κ2, hence recovering a classical step-size choice. The above computation provides a way to analyze gradient descent convergence. Indeed, one can easily show that w u T = r T (Σ)w , r T (Σ) = (I γΣ)T

since g T (Σ)Σ = (I (I γΣ)T )w , from basic properties of the Neumann series deﬁning g T .

The properties of the so-called residual operators r T (Σ) control the convergence of GD. Indeed, if σm = σmin(Σ) > 0, then Σ1/2(u T w ) 2 = Σ1/2r T (Σ)w 2 σM(1 γσm)2T w σMe 2σmγT w 2,

from the basic inequality 1 + z ez, highlighting that the population GD iteration converges exponentially fast to the risk minimizer. However, a major caveat is that assuming σmin(Σ) > 0 is clearly restrictive in an inﬁnite dimensional (nonparametric) setting, since it effectively implies that Σ has ﬁnite rank. In general, Σ will not be ﬁnite rank, but rather compact with 0 as the only accumulation point of its spectrum. In this case, it is easy to see that the slower rate Σ1/2(u T w ) 2 1

holds without any further assumption on the spectrum, since one can show, using spectral calculus and a direct computation 2, that s1/2r T (s) 1/γT. It is reasonable to ask whether it is possible to interpolate between the above-described slow and fast rates by making some intermediate assumption. Raher than making assumption on the spectrum of Σ, one can assume the optimal solution w to belong to a subspace of the range of Σ, more precisely that

w = Σrv (6)

holds for some r 0 and v H, where larger values of r correspond to making more stringent assumptions. In particular, as r goes to inﬁnity we are essentially assuming w to belong to a ﬁnite dimensional space. Assumption (6) is common in the literature of inverse problems [12] and statistical learning [8, 9]. Interestingly, it is also related to so-called conditioning and Łojasiewicz conditions, known to lead to improved rates in continuous optimization, see [13] and references therein. Under assumption (6), and using again spectral calculus, it is possible to show that, for all r 0,

Σ1/2(u T w ) 2 = Σ1/2r T (Σ)Σrv 2 1

Thus, higher values of r result in faster convergence rates, at the price of more stringent assumptions.

2Setting d dss(1 γs)T = 0 gives 1 γs sγT = 0 s = 1 γ(T +1) and 1 γ(T +1) 1 γ 1 γ(T +1) t

3.1 Tail averaged gradient descent

Given the above discussion, we can derive analogous computations for (tail) averaged GD and draw some insights. Using (5), for S < T, we can write the tail-averaged gradient

u S,T = 1 T S

t=S+1 ut (7)

u S,T = GS,T (Σ)Σw , GS,T (Σ) = 1 T S

j=S+1 gt(Σ). (8)

As before, we can analyze convergence considering a suitable residual operator

w u S,T = RS,T (Σ)w , RS,T (Σ) = I GS,T (Σ)Σ (9)

which, in this case, can be shown to take the form,

RS,T (Σ) = (I γΣ)S+1

γ(T S) (I (I γΣ)T S)Σ 1

and where with an abuse of notation we denote by Σ 1 the pseudoinverse of Σ. The case of uniform averaging corresponds to S = 0, in which case the residual operator simpliﬁes to

R0,T (Σ) = (I γΣ)

γT (I (I γΣ)T )Σ 1.

When σm > 0, the residual operators behave roughly as

RS,T (Σ) 2 e σmγ(S+1)

γ(T S) , R0,T (Σ) 2 1

respectively. This leads to a slower convergence rate for uniform averaging and shows instead how tail averaging with S T can preserve the fast convergence of GD.

When σm = 0, taking again S T, it is easy to see by spectral calculus that the residual operators behave similarly, Σ1/2RS,T (Σ) 2 1

γT , Σ1/2R0,T (Σ) 2 1

leading to comparable rates. The advantage of tail averaging is again apparent if we consider Assumption (6). In this case for all r > 0, if we take S T

Σ1/2RS,T (Σ)Σr 2 1

2r+1 , (10)

whereas with uniform averaging one can only prove

Σ1/2R0,T (Σ)Σr 2 1

2 min(r,1/2)+1 . (11)

One immediate observation following from the above discussion is that uniform averaging induces a so-called saturation effect [12], meaning that the rates do not improve after r reaches a critical point. As shown above, this effect vanishes considering tail-averaging and the convergence rate of GD is recovered. These results are critically important for our analysis and constitute the main conceptual contribution of our paper. They are proved in Appendix B, while Section A.1 highlights their critical role for SGD. To the best of our knowledge, we are the ﬁrst to highlight this acceleration property of tail averaging beyond the ﬁnite-dimensional setting.

4 Main Results and Discussion

In this section we present and discuss our main results. We start by presenting a general bound and then use it to derive the optimal parameter settings and corresponding performance guarantees. A key quantity in our results will be the effective dimension

N(1/γL) = Tr (Σ + 1

introduced in [38] to generalize results from parametric estimation problems to non-parametric kernel methods. Similarly this will be one of the main quantities in our learning bounds.

Further, in all our results we will require that the stepsize is bounded as γκ2 < 1/4, and that the tail length L = T S is scaled appropriately with the total number of iterations T. More precisely, our analysis considers two different scenarios where S = 0 (plain averaging) is explicitly allowed and where S > 0, i.e., where we investigate the merits of tail-averaging. To do so, we will assume 0 S K 1

K+1 T for some 1 K, and also T (K + 1)S for the latter case.

The following theorem presents a simpliﬁed version of our main technical result that we present in its general form in the Appendix. Here, we omit constants and lower order terms for clarity and give the ﬁrst insights into the interplay between the tuning parameters, namely the step-size γ, tail-length L, and mini-batch size b, and the number of points n. Note that in a nonparametric setting these are the quantities controlling learning rates. The following result provides a bound for any choice of the tuning parameters, and will allow to derive optimal choices balancing the various error contributions. Theorem 1. Let α (0, 1], 1 L T and let Assumption 1 hold. Assume γκ2 < 1/4 as well as n γL N(1/γL). Then, the excess risk of the tail-averaged SGD iterates satisﬁes

E Σ 1 2 ( w L w ) 2 Σ1/2RL(Σ)w 2 + N(1/γL)

n + γ Tr[Σα]

The proof of the result is given in Appendix E. We make a few comments. The ﬁrst term in the bound is the approximation error, already discussed in Section 3. It is controlled by the bound in (10) and which is decreasing in γL. The second term corresponds to a variance error due to sampling and noise in the data. It depends on the effective dimension which is increasing in γL. The third term is a computational error due to the randomization in SGD. Note how it depends on both γL and the minibatch size b. The larger b is, the smaller this error becomes. The dependence of all three terms on γL suggest already at this stage that (γL) 1 plays the role of a regularization parameter. We derive our ﬁnal bound by balancing all terms, i.e. choosing them to be of the same order. To do so we make additional assumptions. The ﬁrst one is Eq. (6), enforcing the optimal solution w to belong to a subspace of the range of Σ. Assumption 2. For some r 0 we assume w = Σrv , for some v H satisfying ||v || R.

The larger is r the more stringent is the assumption, or, equivalently, the easier is the problem, see Section 3. A second further assumption is related to the effective dimension. Assumption 3. For some ν (0, 1] and Cν < we assume N(1/γL) Cν(γL)ν.

This assumption is common in the nonparametric regression setting, see e.g [6]. Roughly speaking, it quantiﬁes how far Σ is from being ﬁnite rank. Indeed, it is satisﬁed if the eigenvalues (σi)i of Σ have a polynomial decay σi i 1

ν . Since Σ is trace class, the assumption is always satisﬁed for ν = 1 with Cν = κ2. Smaller values of ν lead to faster convergence rates.

The following corollary of Theorem 1, together with Assumptions 2 and 3, derives optimal parameter settings and corresponding learning rates. Corollary 1. Let all assumptions of Theorem 1 be satisﬁed, and suppose that Assumptions 2, 3 also hold. Further, assume either

1. 0 r 1/2, 1 L T (here S = 0, i.e., full averaging is allowed) or

2. 1/2 < r, 1 L < T with the additional constraint that for some K 2 K + 1 K 1S T (K + 1)S ,

(only tail-averaging is considered).

Then, for any n sufﬁciently large, the excess risk of the (tail)-averaged SGD iterate satisﬁes

E Σ 1 2 ( w Ln w ) 2 n 2r+1 2r+1+ν

for each of the following choices:

(a) bn 1, Ln n, γn n 2r+ν 2r+1+ν (one pass over data)

(b) bn n 2r+ν 2r+1+ν , Ln n 1 2r+1+ν , γn 1 (one pass over data)

(c) bn n, Ln n 1 2r+1+ν , γn 1 (n 1 2r+1+ν passes over data) .

The proof of Corollary 1 is given in Appendix E. It gives optimal rates [6, 5] under different assumptions and choices for the stepsize γ, the minibatch size b and the tail length L, considered as functions of n and the parameters r and ν from Assumptions 2, 3. We now discuss our ﬁndings in more detail and compare them to previous related work.

Optimality of the bound: The above results show that different parameter choices allow to achieve the same error bound. The latter is known to be optimal in minmax sense, see e.g. [6]. As noted before, here we provide simpliﬁed statements highlighting the dependence of the bound on the number of points n and the parameters r and ν that control the regularity of the problem. These are quantities controlling the learning rates and for which lower bounds are available. Note however, that all the constants in the Theorem are worked out and reported in detail in the Appendices.

Regularization properties of tail-length: We recall that for GD it is well known that (γT) 1 serves as a regularization parameter, having a quantitatively similar effect to Tikhonov regularization with parameter λ > 0, see e.g. [12]. More generally, our result shows that in the case of tail averaging the quantity (γL) 1 becomes the regularizing parameter for both GD and SGD.

The beneﬁt of tail-averaging: For SGD with b = 1 and full averaging it has been shown by [10] that a single pass over data (i.e., Tn = n) gives optimal rates of convergence provided that γn is chosen as in case (a) in the corollary. However the results in [10] held only in the case r 1/2. Indeed, beyond this regime, there is a saturation effect which precludes optimality for higher smoothness, see the discussion in Section 3, eq. (11). Our analysis for case (a) shows that optimal rates for r 0 can still be achieved with the same number of passes and step-size by using non-trivial tail averaging. Additionally, we compare our results with those from [26]. In that paper it is shown that multi-passes are beneﬁcial for obtaining improved rates for averaged SGD in a regime where the optimal solution w does not belong to H (Assumption 2 does not hold in that case). In that regime, tail-averaging does not improve convergence. Our analysis focuses on the opposite regime where w H and saturation slows down the convergence of uniformly-averaged SGD, preventing optimal rates. Here, tail-averaging is indeed beneﬁcial and leads to improved rates.

The beneﬁt of multi-passes and mini-batching: We compare our results with those in [21] where no averaging but mini-batching is considered. In particular, there it is shown that a relatively large stepsize of order log(n) 1 can be chosen provided the minibatch size is set to n 2r+1 2r+1+ν and a number of n 1 2r+1+ν passes is considered. Comparing to these results we can see the beneﬁts of combining minibatching with tail averaging. Indeed from (c) we see that with a comparable number of passes, we can use a larger, constant step-size already with a much smaller minibatch size. Further, comparing (b) and (c) we see that the setting of γ and L is the same and there is a full range of possible values for bn between [n 2r+ν 2r+1+ν , n] where a constant stepsize is allowed, still ensuring optimality. As noted in [21], increasing the minibatch size beyond a critical value does not yield any beneﬁt. Compared to [21], we show that that tail-averaging can lead to a much smaller critical minibatch size, and hence more efﬁcient computations.

Comparison to ﬁnite-dimensional setting: The relationship between the step-size and batch size in ﬁnite dimensions dim H = d < is derived in [19] where also tail-averaging but only one pass over the data is considered. One of the main contributions of this work is characterizing the largest stepsize that allows achieving statistically optimal rates, showing that the largest permissible

stepsize grows linearly in b before hitting a certain quantity bthresh. Setting b > bthresh results in loss of computational and statistical efﬁciency: in this regime, each step of minibatch SGD is exactly as effective in decreasing the bias as a step of batch gradient descent. The critical value bthresh and the corresponding largest admissible stepsize is problem dependent and does not depend on the sample size n. Notably, the statistically optimal rate of order σ2d/n is achieved for all constant minibatch sizes, and the particular choice of b only impacts the constants in the decay rate of the bias (which is of the lower order 1/n2 anyway). That is, choosing the right minibatch size does not involve a tradeoff between statistical and optimization error. In contrast, our work shows that setting a large batch size bn nα, α [0, 1] yields optimality guarantees in the inﬁnite dimensional setting. This is due to the fact that choosing the optimal values for parameters like γ and b involve a tradeoff between the bias and the variance in this setting. [19] also show that tail-averaging improves the rate at which the initial bias decays if the smallest eigenvalue of the covariance matrix σmin(Σ) is lower-bounded by a constant. Their analysis of this algorithmic component is based on observations similar to the ones we made in Section 3. Our analysis signiﬁcantly extends these arguments by showing the usefulness of tail-averaging in cases when σmin is not necessarily lower-bounded.

5 Numerical Illustration

This section provides an empirical illustration to the effects characterized in the previous sections. We focus on two aspects of our results: the beneﬁts of tail-averaging over uniform averaging as a function of the smoothness parameter r, and the impact of tail-averaging on the best choice of minibatch sizes. All experiments are conducted on synthetic data with d = 1, 000 dimensions, generated as follows. We set Σ as a diagonal matrix with entries Σii = i 1/ν and choose w = Σre, where e is a vector of all 1 s. The covariates Xt are generated from a Gaussian distribution with covariance Σ, and labels are generated as Yt = w , Xt + εt, where εt is standard Gaussian noise. For all experiments, we choose ν = 1/2 and n = 10, 000. With this choice of parameters, we have seen that increasing d beyond 100 does not yield any noticeable change in the results, indicating that setting d = 1, 000 is an appropriate approximation to the inﬁnite-dimensional setting.

Our ﬁrst experiment illustrates the saturation effect described in Section 3 (cf. Eqs. 10,11) by plotting the respective excess risks of uniformly-averaged and tail-averaged SGD as a function of r (Figure 1(a)). We ﬁx b = 1 and set γ = n 2r+ν 2r+1+ν as recommended in Corollary 1. As predicted by our theoretical results, the two algorithms behave similarly for smaller values of r, but uniformly-averaged SGD noticeably starts to lag behind its tail-averaged counterpart for larger values of r exceeding 1/2, eventually ﬂattening out and showing no improvement as r increases. On the other hand, the performance of the tail-averaged version continues to improve for large values of r, conﬁrming that this algorithm can indeed massively beneﬁt from favorable structural properties of the data.

In our second experiment, we study the performance of both tailand uniformly-averaged SGD as a function of the stepsize γ and the minibatch-size b (Figure 1(b), (c)). We ﬁx r = 1/2 and set T = n/b for all tested values of b, amounting to a single pass over the data. Again, as theory predicts, performance remains largely constant as γ b remains constant for both algorithms, until a critical threshold stepsize is reached. However, it is readily apparent from the ﬁgures that tail-averaging permits the use of larger minibatch sizes, therefore allowing for more efﬁcient parallelization.

0 0.5 1 1.5 2 r

excess risk

SGD-unif SGD-tail

0 0.5 1 1.5 2

0 0.5 1 1.5 2

Figure 1: Illustration of the effects of tail-averaging and minibatching. (a) Excess risk as a function of r with uniform and tail averaging. (b) Excess risk as a function of stepsize γ and minibatch-size b for SGD with uniform averaging. (c) Excess risk as a function of stepsize γ and minibatch-size b for SGD with tail-averaging.

Acknowledgments

Nicole Mücke is supported by the German Research Foundation under DFG Grant STE 1074/4-1. Gergely Neu was supported by La Caixa Banking Foundation through the Junior Leader Postdoctoral Fellowship Program and a Google Faculty Research Award. Lorenzo Rosasco acknowledges the ﬁnancial support of the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Ofﬁce of Aerospace Research and Development), and the EU H2020-MSCA-RISE project No MADS - DLV-777826.

[1] R. Aguech, E. Moulines, and P. Priouret. On a perturbation approach for the analysis of stochastic tracking algorithms. SIAM J. Control and Optimization, 39 (3):872 899, 2000.

[2] F. R. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). In NIPS, pages 773 781, 2013.

[3] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of Complexity, 23(1):52 72, 2007.

[4] D. P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal on Optimization, 7(4):913 926, 1997.

[5] G. Blanchard and N. Mücke. Optimal rates for regularization of statistical inverse learning problems. Foundations of Computational Mathematics, 18(4):971 1013, Aug 2017.

[6] A. Caponnetto and E. De Vito. Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2006.

[7] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient methods. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 1647 1655. Curran Associates, Inc., 2011.

[8] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1 49, 2002.

[9] E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares algorithm in learning theory. Foundations of Computational Mathematics, 5(1):59 85, 2005.

[10] A. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. Ann. Statist., 44(4):1363 1399, 08 2016.

[11] A. Dieuleveut, N. Flammarion, and F. Bach. Harder, better, faster, stronger convergence rates for least-squares regression. Journal of Machine Learning Research, 18:101:1 101:51, 2017.

[12] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems, volume 375. Springer Science & Business Media, 1996.

[13] G. Garrigos, L. Rosasco, and S. Villa. Convergence of the forward-backward algorithm: Beyond the worst case with the help of geometry. ar Xiv:1703.09477, 2017.

[14] Z.-C. Guo, S.-B. Lin, and D.-X. Zhou. Learning theory of distributed spectral algorithms. Inverse Problems, 33(7):074009, 2017.

[15] L. Györﬁand H. Walk. On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31 61, 1996.

[16] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 1225 1234. JMLR.org, 2016.

[17] E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

[18] E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489 2512, 2014.

[19] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Parallelizing stochastic gradient descent for least squares regression: Mini-batching, averaging, and model misspeciﬁcation. Journal of Machine Learning Research, 18(223):1 42, 2018.

[20] J. Lin, R. Camoriano, and L. Rosasco. Generalization properties and implicit regularization for multiple passes SGM. Co RR, abs/1605.08375, 2016.

[21] J. Lin and L. Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18:97:1 97:47, 2017.

[22] J. Lin, A. Rudi, L. Rosasco, and V. Cevher. Optimal rates for spectral algorithms with leastsquares regression over hilbert spaces. Applied and Computational Harmonic Analysis, 2018.

[23] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

[24] G. Neu and L. Rosasco. Iterate averaging as regularization for stochastic gradient descent. In COLT, volume 75 of Proceedings of Machine Learning Research, pages 3222 3242. PMLR, 2018.

[25] F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pages 1116 1124, 2014.

[26] L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Co RR, abs/1805.10074, 2018.

[27] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838 855, jul 1992.

[28] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. ar Xiv:1109.5647, 2011.

[29] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1571 1578, 2012.

[30] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400 407, 1951.

[31] L. Rosasco and S. Villa. Learning with incremental iterative regularization. In NIPS, pages 1630 1638, 2015.

[32] D. Ruppert. Efﬁcient estimations from a slowly convergent Robbins Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.

[33] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012.

[34] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization:convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, 2013.

[35] S. Smale and Y. Yao. Online learning algorithms. Foundations of Computational Mathematics, 6(2):145 170, 2006.

[36] P. Tarres and Y. Yao. Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence. IEEE Trans. Information Theory, 60(9):5716 5735, 2014.

[37] Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561 596, 2008.

[38] T. Zhang. Effective dimension and generalization of kernel learning. Advances in Neural Information Processing Systems 2003, 2003.