# minibatch_stochastic_approximate_proximal_point_methods__51668fce.pdf

Minibatch Stochastic Approximate Proximal Point Methods

Hilal Asi Stanford University asi@stanford.edu

Karan Chadha Stanford University knchadha@stanford.edu

Gary Cheng Stanford University chenggar@stanford.edu

John C. Duchi Stanford University jduchi@stanford.edu

We extend the Approximate-Proximal Point (APROX) family of model-based methods for solving stochastic convex optimization problems, including stochastic subgradient, proximal point, and bundle methods, to the minibatch setting. To do this, we propose two minibatched algorithms for which we prove a non-asymptotic upper bound on the rate of convergence, revealing a linear speedup in minibatch size. In contrast to standard stochastic gradient methods, these methods may have linear speedup in the minibatch setting even for non-smooth functions. Our algorithms maintain the desirable traits characteristic of the APROX family, such as robustness to initial step size choice. Additionally, we show improved convergence rates for "interpolation" problems, which (for example) gives a new parallelization strategy for alternating projections. We corroborate our theoretical results with extensive empirical testing, which demonstrates the gains provided by accurate modeling and minibatching.

1 Introduction

We develop parallel stochastic approximate proximal point methods (APROX) for solving the stochastic optimization problem

minimize f(x) = EP [F(x; S)] = Z

S F(x; s)d P(s)

subject to x X. (1)

Here the set S is a sample space, and for each s S, the function F( ; s) : Rn R is a closed convex function, subdifferentiable on the closed convex set X Rn. While stochastic gradient methods (SGM) are the de facto choice for problem (1) [32, 22, 8, 29] enjoying several convergence guarantees [32], with straightforward parallel extensions that make them practically attractive [18, 11, 13] they are very sensitive to the objective f, noise, and hyperparameter tuning [19, 1, 2]. For example, stochastic gradient methods are extremely sensitive to stepsize, and they may even diverge for objectives that do not satisfy their convergence criteria [2].

Motivated by these limitations, several researchers [6, 16, 10, 12, 2] have developed stochastic (approximate) proximal-point and model-based methods as a more robust alternative to standard gradient methods. These APROX methods, as we explain more carefully in Section 1.1, construct a model of the function at each iterate and update by minimizing a regularized version of the model.

Denotes equal contribution; authors listed in alphabetical order

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Recent results demonstrate the improvements of these frameworks over standard stochastic gradient methods, as the methods demonstrate robustness to stepsize choice, are adaptive to problem difﬁculty, and converge on a broader range of problems than the stochastic gradient method [12, 2]. Yet these APROX methods are inherently sequential, and as we hit physical limits on processor speeds, it is becoming clear that opportunities for improvements in large-scale computation and energy use must focus on parallelization [14]; it is not immediately apparent how to efﬁciently parallelize stochastic model-based and proximal point methods.

Contributions Motivated by these advantages of APROX over SGM and the importance of parallel computation in large scale stochastic optimization, we propose extensions of APROX to the minibatch setting that allow fast parallelization. We show that our extensions maintain the desirable robustness properties of APROX and enjoy signiﬁcant speedups in the minibatch size. In contrast to standard optimization methods, which can only guarantee speedups for smooth functions [18, 11], we show that our APROX extensions have speedups even for non-smooth and non-Lipschitz functions assuming a weak form of strong convexity. Moreover, these algorithms exhibit speedup robust to stepsize choices, unlike naive stochastic gradient methods, which require careful stepsize tuning to achieve the desired speedup. Finally, we show on a class of interpolation problems [4, 5] that minibatched APROX algorithms enjoy a linear convergence that also improves linearly with minibatch size. Our experimental investigation illustrates the importance of our APROX extensions over standard stochastic gradient methods. Please visit github.com/garyxcheng/parallel-aprox for the code for our methods and experiments.

1.1 Preliminaries

The starting point of our methods is the APROX framework [10, 12, 2]. The APROX algorithms rely on creating models of the function F, where the model Fx of F at x satisﬁes the conditions

(C.i) The function y 7 Fx(y; s) is convex and subdifferentiable on X. (C.ii) The model Fx satisﬁes the equality Fx(x; s) = F(x; s) and Fx(y; s) F(y; s) for all y.

With said model, at iterate k, APROX algorithms perform the update

xk+1 := argmin x X

Fxk(x; Sk) + 1 2αk x xk 2 2

The APROX framework builds on the idea that using better models in the update (2) results in more stable algorithms with better guarantees. The following models are key to illustrating our methods:

Stochastic gradient methods: for some F (x; s) F(x; s), use the linear model

Fx(y; s) := F(x; s) + F (x; s), y x . (3)

Proximal point methods: use the full proximal model

Fx(y; s) := F(y; s). (4)

Truncated methods: for some F (x; s) F(x; s), use a truncated version of the function

Fx(y; s) := max F(x; s) + F (x; s), y x , inf z X F(z; s) . (5)

The truncated model (5) is often easy to apply. In most machine learning applications, loss functions are non-negative, so F is readily modeled by Fxk(x; s) = [F(xk; s) + F (xk; s), x xk ]+. We note that it is possible to use a lower bound instead of the inﬁmum in (5). The full proximal (4) and truncated (5) models provide more accurate approximations of F than the linear model (3); which Asi and Duchi [2, 1], motivated by previous work in the area [16, 15, 10], show yields more robust algorithms with better theoretical and practical convergence.

Notation For a convex function f, f(x) denotes its subgradient set at x, and f (x) f(x) denotes an arbitrary element of the subdifferential. We let X = argminx X f(x) denote the set of minimizers for problem (1) and x X denote a single minimizer. We let Fk := σ(S1, . . . , Sk) be the σ-ﬁeld generated by the ﬁrst k random variables Si, so xk Fk 1 for all k under iteration (2). We defer proofs to the appendix.

1.2 Related work

Stochastic gradient methods [26] are the most widely used method for solving stochastic minimization problems; an enormous literature exists [24, 25, 32, 22, 31, 17, 3]. Several researchers also demonstrate the improvements that minibatching and parallelization can provide stochastic gradient methods, which can enjoy linear speedups as batch sizes increase [18, 11, 13, 23, 9]. Other works recognize the instability of stochastic gradients methods and importance of robustness, demonstrating situations where they can have slow convergence as a result of mis-speciﬁed stepsizes [22, 3, 1].

Rockafellar [27] introduces proximal point methods, which have seen a resurgence in applications to stochastic optimization [16, 6, 15, 7, 20]. Of most relevance to our work are extensions of the stochastic proximal methods that use approximate models in the proximal update [12, 10, 2, 1]. Asi and Duchi [2] develop a stochastic approximate proximal point method, namely APROX, and establish several convergence guarantees and stability properties that are superior to standard stochastic gradient methods. Yet this work does not address our central challenge: how to leverage parallelization.

While it is usually clear how to extend standard stochastic gradient methods to parallel settings, the same is not true for proximal methods as there are many different ways to model the function or average the iterates. Indeed, in this section, we describe three different methods for extending APROX all of which are identical for linear models (SGM) but can exhibit different behavior as reﬂected by our experiments. Given a choice of model Fx( ; s) and m samples S1:m k Sm, we propose three methods.

Iterate averaging (IA): The most natural and direct way to extend APROX to the minibatch setting is to average the individual updates per sample:

i=1 xi k+1 where xi k+1 := argmin x X

Fxk(x; Si k) + 1 2αk x xk 2 2

This method s simplicity and (near) full parallelization makes it attractive. Its empirical performance, however, is weaker than our other proposed methods performance.

Instead of averaging the iterates, we propose an alternative algorithm: at each iteration, perform the update

xk+1 := argmin x X

F xk(x; S1:m k ) + 1 2αk x xk 2 2

where F xk(x; S1:m k ) is a model of the average function

F(x; S1:m k ) = 1

i=1 F(x; Si k) (8)

satisfying Conditions (C.i) and (C.ii). While our theorems hold generally for any F xk(x; S1:m k ) satisfying Conditions (C.i) and (C.ii), we now present two instances of the model family which are of practical interest.

Truncated Average (Trunc Av): The simplest such model beyond the naive linear model is an extension of the truncated model (5). Here, we set

F x(y; S1:m) := max n

F(x; S1:m) + F (x; S1:m), y x , inf z X F(z; S1:m) o .

In the standard case that the functions F are nonnegative, for gk := F (x; S1:m k ), the update (7) is

xk+1 = xk min

αk, F(xk; S1:m k )

The update (9) for the truncated models thus yields an embarrassingly parallelizable scheme: each worker computes F(xk; Si k) and F (xk; Si k), which need only be averaged to apply the update (9).

Average of Truncated Models (Av Trunc): The update (9) ignores some structural aspects of the objectives F; it is natural to consider an even more accurate model, which averages models of individual samples. Here we construct a model Fxk(x; Si k) = max{F(x; Si k) + F (x; Si k), y x , infz X F(z; Si k)} for each sample, noting that 1 m Pm i=1 Fxk( ; Si k) satisﬁes conditions (C.i) and (C.ii), then set

xk+1 := argmin x X

i=1 Fxk(x; Si k) + 1 2αk x xk 2 2

There is an O(m)-dimensional dual problem to (10), which allows this method to be practically parallelizable. Consider the equivalent iterative algorithm

λk = argmax λ max ν αk

i=1 λ(i)F (x; Si k)||2 2 +

i=1 λ(i)F(xk; Si k)

subject to 1

m + λ(i) ν(i) = 0, ν 0

i=1 λk,(i)F (x; Si k),

where λ(i) is the ith component of the vector λ.

Remark We note that we could easily generalize Trunc Av and Av Trunc methods beyond the truncated model. Let H denote the set of convex functions. Let G : H X H be some operator which takes in a convex function F and a point y in the domain as input and constructs a model of F at y satisfying Conditions (C.i) and (C.ii). Then the Trunc Av model is more generally written as G(F(x; S1:m), xk). The Av Trunc model is more generally Pm i=1 G(F(x; Si), xk)/m.

Before proceeding to our theoretical guarantees, we provide a simple example to illustrate the differences between these methods. Consider the problem of ﬁnding a point in the intersection of convex sets C1, . . . , Ck by minimizing f(x) = k 1 Pk i=1 dist(x, Ci). Let m = 2, and consider the truncated model (5), which exhibits different behavior for each parallelization strategy. Figure 1 illustrates the IA, Trunc Av, and Av Trunc updates given inﬁnite stepsize (which still guarantees convergence if there is x i Ci [2]). In this case, iterate averaging (6) projects the current iterate xk to the two sets in the batch and averages these updates. Trunc Av (9) constructs an (average) hyperplane that gives a (somewhat) good representation of the two sets and projects to this set. Av Trunc (10) provides a more accurate representation of the two sets using two hyperplanes and projects to this set.

(a) (b) (c)

Figure 1: Updates for (a) IA (6) with truncated model, (b) Trunc Av (9), and (c) Av Trunc (10).

3 Non-asymptotic convergence guarantees and minibatch speedups

Having described our methods, we prove several convergence guarantees and minibatch speedups for these methods. We begin in Section 3.1 with standard speedup result which resembles known results for SGM which holds for any model satisfying Conditions (C.i) and (C.ii). In Section 3.2 we show that, somewhat surprisingly, our methods using the full proximal model (4) enjoy linear speedup in minibatch size even for non-smooth functions.

3.1 Speedup for smooth functions

We begin by proving minibatch speedup results for smooth functions. As is the case for stochastic gradient methods [11], our results require a bound on the variance of F(xk; Sk). We therefore begin with a one-step progress guarantee for any APROX model that depends on the noise of the estimate of f(x).

Lemma 3.1. Let f( ) be convex and have L-Lipschitz gradient. Deﬁne the function value errors ek = [Fxk(x ; Sk) f(x )] [F(xk, Sk) f(xk)] and let αk = 1 L+ηk . Let xk+1 be generated using the update (2) using any model Fxk(x; Sk) satisfying (C.i) and (C.ii). Then

f(xk+1) f(x ) 1 2αk

h xk x 2 2 xk+1 x 2 2 i + ek + 1 2ηk F(xk; Sk) f(xk) 2 2 .

By noting that E[ek] 0, Lemma 3.1 nearly immediately implies the following theorem.

Theorem 1. Let the conditions of Lemma 3.1 hold and ηk be non-decreasing. Additionally, assume that diam(X) = supx,x X ||x x ||2 R and E[ F(x; S) f(x) 2 2] σ2 0 for x X. Then

i=1 E[f(xi+1) f(x )] LR2

Having established a convergence result that depends on the noise of the gradient estimates, a speedup guarantee for the minibatched APROX methods is nearly immediate. Equation (7) applies APROX updates to the average (8), reducing variance by a factor of m, giving

Corollary 3.1. Let the conditions of Theorem 1 hold, let ηk = η0

k with η0 = σ0 m R, and let xk be generated using (7) with any model satisfying (C.i) and (C.ii) with minibatch m. Then the average xk = 1

k Pk+1 i=2 xi satisﬁes

E[f(xk) f(x )] LR2

When the batch size m kσ2 0 L2R2 , the second term dominates the rate of convergence. Letting T(ϵ)

denote the number of iterations to achieve E[f(x T (ϵ)) f(x )] ϵ, we obtain that T(ϵ) R2σ2 0 ϵ2m , that is, there is a linear speedup as a function of the minibatch of size m. This is similar to the speedup that standard stochastic gradient methods achieve [18, 11], and it is minimax optimal.

3.2 Speedup for non-smooth functions

We turn now to a complementary look at potentially non-Lipschitz and non-smooth functions, studying the minibatch APROX framework (7) using the full proximal model (4), that is, a minibatched stochastic proximal point method. Here, we show a speedup as the minibatch size grows, but in contrast to the previous section, it has the beneﬁt that it requires only that the noise of the gradient estimate is bounded over the optimal set X . Here, we consider the following restricted strong convexity (RSC) assumption [2].

Assumption A1 (Restricted strong convexity). The functions F( ; s) are strongly convex w.r.t. the matrix Σ(s) 0, i.e., for arbitrary x and y, F(y; s) F(x; s) + F (x, s)T (y x) + (1/2)(y x)T Σ(s)(y x), for all F (x, s) F(x; s). The matrix Σ(S) satisﬁes E[Σ(S)] λmin In n with λmin > 0.

Fixing an (otherwise arbitrary) constant c > 1, we deﬁne

Σk = E Σ(S) 1 + 2αkλmax(Σ(S))

and λk = λmin(Σk).

We then have the following convergence guarantee for minibatched stochastic proximal point methods (4), which is a consequence of Proposition 5 of Asi and Duchi [2].

Corollary 3.2. Let Assumption A1 hold and assume E[ F (x ; S) 2 2] σ2 1. For αk = α0k β with β (0, 1), the APROX method with proximal point model (4) using batch size m guarantees

E[ xk+1 x 2 2] exp

x1 x 2 2 + C σ2 1 mλ0 αk log k,

where C is a numerical constant and λ0 = λmin(Σ0).

The second term dominates the convergence rate in Corollary 3.2, so αk = b/k for a large constant b, we obtain E[ xk+1 x 2 2] O( σ2 1 log k mkλ0 ) demonstrating the desired speedup in m. Existing results for standard stochastic gradient methods do not exhibit a speedup in the batch size for non-smooth functions [11].

4 Linear rates and speedups for interpolation problems

The previous section shows that minibatching produces speedups for APROX for certain families of functions. Here, we consider a speciﬁc class of interpolation problems, where we will show linear convergence rates and minibatch speedups for APROX.

Deﬁnition 4.1. The optimization problem (1) is an interpolation problem if there exists x X := argminx X f(x) such that for P-almost all s S, we have infx X F(x; s) = F(x ; s).

While this restricts the class of objectives, many problems satisfy this condition [21, 28, 2], including overdetermined linear systems, ﬁnding a point in the intersection of convex sets, and the modern machine learning problems where it is possible to achieve zero training loss [4, 5]. For interpolation problems, we consider bounding the noise of the gradient estimate by its suboptimality.

Assumption A2. There is σ2 2 < such that for every x X, E[ f(x) F(x; S) 2 2] σ2 2 dist(x, X )2.

Assumption A2 holds for many interpolation problems. For example, consider a linear regression problem with data s = (a, b) Rn R, a T x = b for all (a, b), and F(x; (a, b)) = 1

2(a T x b)2. We immediately have Var( F(x; S)) E[ a 2 2 a, x x 2], evidently satisfying Assumption A2. The assumption implies linear convergence guarantees:

Proposition 1. Assume f is λ-strongly convex, has L-Lipshitz gradients, and satisﬁes Assumption A2 and Def. 4.1. Let xk be generated using update (7) using any model satisfying (C.i) and (C.ii)

with minibatch m > σ2 2/(λ(L + λ)). Set αk = (L + η) 1. For η = max n L, 8σ2 2 mλ o , we have

E[dist(xk, X )2] exp k min λ

E[dist(x0, X )2].

It is possible to relax the strong convexity condition on f into a weaker quadratic growth assumption:

Assumption A3 (Quadratic Growth). There exist λ0, λ1 > 0 such that for all x X and α > 0,

(f(x) f(x )) min α, f(x) f(x )

min{λ0α, λ1} dist(x, X )2.

Functions satisfying Assumption A3 need only grow quadratically away from their minimizers, so they are less restrictive than a global strong convexity assumption. We then have the following.

Theorem 2. Assume f has L-Lipschitz gradients and satisﬁes Assumptions A2, A3, and Deﬁnition 4.1. Deﬁne the step sizes αk = (2L + 2ηk) 1 where P i=1 η 1 i = and P i=1 η 2 i < , and let K be the largest value such that for all k > K, αk λ1/2λ0. If xk is generated using update (7) using any model satisfying (C.i) and (C.ii) with minibatch m, then

E[dist(xk, X )2] exp

i=K k+1 λ0αk +

σ2 2 m(L + ηi)ηi

dist(x0, X )2.

To give some intuition for this bound, we provide a heuristic sketch for stepsizes with a given minibatch size m, assuming for simplicity that λ0 = λ1 = 1. Take ηk = η0kβ/ m, where β (1/2, 1). Then Pk i=1 σ2 2 m(L+ηi)ηi σ2 2 η2 0 , and K = ( m/η0)1/β, while for large

k we have Pk i=K+1 αi k1 β m/η1 β 0 . This gives a heuristic convergence guarantee of E[dist(xk, X )2] exp( k1 β m/η1 β 0 + σ2 2/η2 0) dist(x0, X )2, which (taking the limit as β 1

2) gives E[dist(xk, X )2] exp(

km/η0) dist(xk, X )2, exhibiting linear speedup in m.

5 Experiments

0 10 20 30 40 50 60

SGM Av Trunc Trunc Av IA Prox

Batch size m

Noiseless l1 regression

0 10 20 30 40 50 60 0

100 SGM Av Trunc Trunc Av IA Prox

Batch size m

Noiseless logistic regression

0 10 20 30 40 50 60 0

100 SGM Av Trunc Trunc Av IA Prox

Batch size m

Noisy l1 regression

0 10 20 30 40 50 60 0

SGM Av Trunc Trunc Av IA Prox

Batch size m

Noisy logistic regression

Figure 2: Speed ups with best possible stepsizes vs. batch size Our goal now is to demonstrate the speedup and robustness of APROX methods with minibatches, comparing the relative performance of the proposed methods. We compare

1. SGM: stochastic gradient methods, i.e., the linear model (3).

2. Proximal: uses the full proximal model (4) with averaged function (8) (Prox).

3. Truncated-IA: the truncated model (5) in iterate averaging (6) (IA).

4. Trunc Av (9).

5. Av Trunc (10).

We study performance on linear, absolute loss, and logistic regression. In each case, we have data in the form of a matrix A Rn d. We have b Rn for linear and absolute loss regression, and b { 1, +1}n for logistic regression. We use n = 1000, d = 40 minibatch sizes m {1, 4, 8, 16, 32, 64} and initial stepsizes α0 {10 2, 10 1.5, . . . , 102.5, 103} (α0 {10 2, 10 1.5, . . . , 104.5, 105} for logistic regression). For all experiments we run 30 trials

10 2 10 1 100 101 102 103 0

10 SGM Av Trunc Trunc Av IA Prox

Noiseless m = 8

10 2 10 1 100 101 102 103 0

SGM Av Trunc Trunc Av IA Prox

Noiseless m = 32

10 2 10 1 100 101 102 103 0

SGM Av Trunc Trunc Av IA Prox

Noisy m = 8

10 2 10 1 100 101 102 103 0

SGM Av Trunc Trunc Av IA Prox

Noisy m = 32

Figure 3: Speed up vs. stepsizes for linear regression

10 2 10 1 100 101 102 103 0

SGM Av Trunc Trunc Av IA Prox

Iterations to ε

10 2 10 1 100 101 102 103 0

SGM Av Trunc Trunc Av IA Prox

Iterations to ε

10 2 10 1 100 101 102 103 0

SGM Av Trunc Trunc Av IA Prox

Iterations to ε

(a) (b) (c)

Figure 4: Time to convergence vs. stepsizes for noiseless absolute regression

10 2 10 1 100 101 102 103

SGM Av Trunc Trunc Av IA Prox

Iterations to ε

10 2 10 1 100 101 102 103

SGM Av Trunc Trunc Av IA Prox

Iterations to ε

10 2 10 1 100 101 102 103

SGM Av Trunc Trunc Av IA Prox

Iterations to ε

(a) (b) (c)

Figure 5: Time to convergence vs. stepsizes for absolute regression (noisy with σ = 0.5)

with different seeds and plot the 95% conﬁdence sets. We describe the objective function and noise adding mechanism for each experiment below.

a) Linear Regression: We have f(x) = 1 2n Ax b 2 2. For each experiment we ﬁrst generate a random matrix A and x N(0, Id) and then use b = Ax + σv with v N(0, In). In the noisy setting for our experiments, we set σ = 0.5, we get different noise levels. b) Absolute loss regression: We have f(x) = 1 2n Ax b 1. We generate a random matrix A, x N(0, Id) and use b = Ax + v, where vi Lap(0, σ

2 ) (Lap( , ) denotes the Laplace distribution). In the noisy setting for our experiments, we set σ = 0.5, we get different noise levels. c) Logistic Regression: We have f(x) = 1 2n Pn i=1 log(1 + exp( bi ai, x )). We generate a random matrix A, x N(0, Id) and b = sign( ai, x ). To add noise, we ﬂip each sign in b independently with probability p. In the noisy setting for our experiments, we set p = 0.01, we get different noise levels.

For the experiments in the main paper, we set the condition number of A to 1. Please refer to the appendix to see experiments where the condition number of A is set to 10.

For each of the problem types, we use stepsizes αk = α0k 1/2 and ﬁnd the number Tm(α0) of iterations k required to reach ε accuracy, f(xk) f(x ) ε. A max iteration termination threshold is chosen such that roughly three out of the ﬁve methods converge to ε accuracy before the threshold; if a method does not converge in that time, we set Tm(α0) to the threshold value. We also let T m = minα0 Tm(α0) denote the smallest time to convergence for a given method using batch size m. The supplement contains our full experimental setup. We present three types of plots.

Best speedups for minibatching (Fig. 2) For each method, we plot T 1 /T m against the minibatch size m to show the speedup minibatching provides using the best step sizes. This shows the best possible speedup obtained by minibatching through tuning the initial step size α0. All algorithms except the iterate averaging method IA (6) enjoy a linear speedup w.r.t. minibatch sizes for small minibatches before it tapers off in both absolute and logistic regression. In the absolute regression case, the full proximal model performs much better than any other model, while in logistic regression, the behaviour of each proposed model is comparable; this is consistent with our arbitrary acceleration results for proximal-point method (Cor. 3.2), while our results for Trunc Av and Av Trunc (Thms. 1 and 2) hold for smooth functions.

Speedups w.r.t. step-size (Fig. 3) For each method and minibatch size, we plot T 1 /Tm(α0) against the initial step size α0. This illustrates the robustness of speedups obtained by minibatching to the choice of step size α0. In Fig. 3, we expect the stochastic gradient method to exhibit improved performance for good stepsizes, though this improvement is not robust: it achieves no speedup when stepsizes are poorly chosen. For the parallel APROX methods (Av Trunc and Trunc Av), however, we observe that the speedup persists over many stepsizes, essentially perfectly in the noiseless case and with some degradation in the noisy case.

Time to solution w.r.t. step-size: (Fig. 4 and Fig. 5) For each method and minibatch size, we plot Tm(α0) against the initial step size α0. Whereas the previous plots explore relative performance, this plot characterizes performance on an absolute scale. In the noiseless case (Fig. 4), we clearly observe the robustness of model-based methods compared to SGM. IA performs better than SGM, but worse than others; Trunc Av, Av Trunc, and the full prox method have similar performance. In the noisy case, surprisingly, IA is more robust to stepsizes, though the performance under the best stepsize is not as good as Av Trunc or Trunc Av. We believe this is a consequence of extremely conservative updates. A natural next direction for future work here is to understand precisely which updates can yield improvement with parallelization.

Broader Impact

Data centers draw increasing amounts of the total energy we consume, and increasing applications of machine learning mean that model-ﬁtting and parameter exploration require a larger and larger proportion of their energy expenditures [1, 14, 30]. Indeed, as Asi and Duchi [1] note, the energy to train and tune some models is roughly on the scale of driving thousands of cars from San Francisco to Los Angeles, while training a modern transformer network (with architecture search) generates roughly six times the total CO2 of an average car s lifetime [30]. It is thus centrally important to build more efﬁcient and robust methods, which allow us to avoid wasteful hyperparameter search but simply work.

A major challenge in building better algorithms is that fundamental physical limits have forced CPU speed and energy to essentially plateau; only by parallelization can we harness both increasing speed and reduce the energy to ﬁt models [14]. In this context, our methods take a step toward reducing the energy and overhead to perform machine learning.

Taking a step farther back, we believe optimization and model-ﬁtting research in machine learning should refocus its attention: rather than developing algorithms that, with appropriate hyperparameter tuning, achieve state-of-the-art accuracy for a given dataset, we should evaluate algorithms by whether they robustly work. This would allow a more careful consideration of an algorithms costs and beneﬁts: is it worth 2 faster training, for appropriate hyperparameters, if one has to spend 25 as much time to ﬁnd the appropriate algorithmic hyperparameters? Even more, as Strubell et al. [30] point out, the extraordinary costs of hyperparameter tuning for ﬁtting large-scale models price many researchers out of making progress on certain frontiers; to the extent that we can mitigate these challenges, we will allow more equity in who can help machine learning progress.

Funding Transparency Statement

Funding in direct support of this work: NSF CAREER CCF-1553086, ONR YIP N00014-19-2288, Sloan Foundation, NSF HDR 1934578 (Stanford Data Science Collaboratory), Stanford DAWN Consortium, and the Professor Michael J. Flynn Stanford Graduate Fellowship.

[1] H. Asi and J. C. Duchi. The importance of better models in stochastic optimization. Proceedings of the National Academy of Sciences, 116(46):22924 22930, 2019. URL https://doi.org/10.1073/pnas. 1908018116.

[2] H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257 2290, 2019. URL https://ar Xiv.org/abs/ 1810.05633.

[3] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, pages 451 459, 2011.

[4] M. Belkin, D. Hsu, and P. Mitra. Overﬁtting or perfect ﬁtting? Risk bounds for classiﬁcation and regression rules that interpolate. In Advances in Neural Information Processing Systems 31, pages 2300 2311. Curran Associates, Inc., 2018.

[5] M. Belkin, A. Rakhlin, and A. B. Tsybakov. Does data interpolation contradict statistical optimality? In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1611 1619, 2019.

[6] D. P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, Series B, 129:163 195, 2011.

[7] P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization, 26(4):2235 2260, 2016.

[8] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 20, 2007.

[9] S. Chaturapruek, J. C. Duchi, and C. Ré. Asynchronous stochastic convex optimization: the noise is in the noise and SGD don t care. In Advances in Neural Information Processing Systems 28, 2015.

[10] D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019.

[11] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13:165 202, 2012.

[12] J. C. Duchi and F. Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229 3259, 2018.

[13] J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, 2012.

[14] S. Fuller and L. Millett. The Future of Computing Performance: Game Over or Next Level? National Academies Press, 2011.

[15] N. Karampatziakis and J. Langford. Online importance weight aware updates. In Proceedings of the 27th Conference on Uncertainty in Artiﬁcial Intelligence, 2011.

[16] B. Kulis and P. Bartlett. Implicit online learning. In Proceedings of the 27th International Conference on Machine Learning, 2010.

[17] H. J. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, second edition, 2003.

[18] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, Series A, 133(1 2):365 397, 2012.

[19] L. Li, K. Jamieson, G. De Salvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1 52, 2017.

[20] H. Lin, J. Mairal, and Z. Harchaoui. Catalyst acceleration for ﬁrst-order convex optimization: from theory to practice. Journal of Machine Learning Research, 18(212), 2018.

[21] S. Ma, R. Bassily, and M. Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In Proceedings of the 35th International Conference on Machine Learning, 2018.

[22] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

[23] F. Niu, B. Recht, C. Ré, and S. Wright. Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24, 2011.

[24] B. T. Polyak. Introduction to Optimization. Optimization Software, Inc., 1987.

[25] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992.

[26] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22: 400 407, 1951.

[27] R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14:877 898, 1976.

[28] M. Schmidt and N. Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv:1308.6370 [math.OC], 2013.

[29] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: primal estimated sub-gradient solver for SVM. Mathematical Programming, Series B, 127(1):3 30, 2011.

[30] E. Strubell, A. Ganesh, and A. Mc Callum. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.

[31] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning, 2004.

[32] M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.