# withoutreplacement_sampling_for_stochastic_gradient_methods__ce7b8122.pdf

Without-Replacement Sampling for Stochastic Gradient Methods

Ohad Shamir Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel ohad.shamir@weizmann.ac.il

Stochastic gradient methods for machine learning and optimization problems are usually analyzed assuming data points are sampled with replacement. In contrast, sampling without replacement is far less understood, yet in practice it is very common, often easier to implement, and usually performs better. In this paper, we provide competitive convergence guarantees for without-replacement sampling under several scenarios, focusing on the natural regime of few passes over the data. Moreover, we describe a useful application of these results in the context of distributed optimization with randomly-partitioned data, yielding a nearly-optimal algorithm for regularized least squares (in terms of both communication complexity and runtime complexity) under broad parameter regimes. Our proof techniques combine ideas from stochastic optimization, adversarial online learning and transductive learning theory, and can potentially be applied to other stochastic optimization and learning problems.

1 Introduction

Many canonical machine learning problems boil down to solving a convex empirical risk minimization problem of the form

min w W F(w) = 1

i=1 fi(w), (1)

where each individual function fi( ) is convex (e.g. the loss on a given example in the training data), and the set W Rd is convex. In large-scale applications, where both m, d can be huge, a very popular approach is to employ stochastic gradient methods. Generally speaking, these methods maintain some iterate wt W, and at each iteration, sample an individual function fi( ), and perform some update to wt based on fi(wt). Since the update is with respect to a single function, this update is usually computationally cheap. Moreover, when the sampling is done independently and uniformly at random, fi(wt) is an unbiased estimator of the true gradient F(wt), which allows for good convergence guarantees after a reasonable number of iterations (see for instance [18, 15]).

However, in practical implementations of such algorithms, it is actually quite common to use withoutreplacement sampling, or equivalently, pass sequentially over a random shufﬂing of the functions fi. Intuitively, this forces the algorithm to process more equally all data points, and often leads to better empirical performance. Moreover, without-replacement sampling is often easier and faster to implement, as it requires sequential data access, as opposed to the random data access required by with-replacement sampling (see for instance [3, 16, 8]).

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

1.1 What is Known so Far?

Unfortunately, without-replacement sampling is not covered well by current theory. The challenge is that unlike with-replacement sampling, the functions processed at every iteration are not statistically independent, and their correlations are difﬁcult to analyze. Since this lack of theory is the main motivation for our paper, we describe the existing known results in some detail, before moving to our contributions.

To begin with, there exist classic convergence results which hold deterministically for every order in which the individual functions are processed, and in particular when we process them by sampling without replacement (e.g. [14]). However, these can be exponentially worse than those obtained using random without-replacement sampling, and this gap is inevitable (see for instance [16]).

More recently, Recht and Ré [16] studied this problem, attempting to show that at least for least squares optimization, without-replacement sampling is always better (or at least not substantially worse) than with-replacement sampling on a given dataset. They showed this reduces to a fundamental conjecture about arithmetic-mean inequalities for matrices, and provided partial results in that direction, such as when the individual functions themselves are assumed to be generated i.i.d. from some distribution. However, the general question remains open.

In a recent breakthrough, Gürbüzbalaban et al. [8] provided a new analysis of gradient descent algorithms for solving Eq. (1) based on random reshufﬂing: Each epoch, the algorithm draws a new permutation on {1, . . . , m} uniformly at random, and processes the individual functions in that order. Under smoothness and strong convexity assumptions, the authors obtain convergence guarantees of essentially O(1/k2) after k epochs, vs. O(1/k) using with-replacement sampling (with the O( ) notation including certain dependencies on the problem parameters and data size). Thus, withoutreplacement sampling is shown to be strictly better than with-replacement sampling, after sufﬁciently many passes over the data. However, this leaves open the question of why without-replacement sampling works well after a few or even just one passes over the data. Indeed, this is often the regime at which stochastic gradient methods are most useful, do not require repeated data reshufﬂing, and their good convergence properties are well-understood in the with-replacement case.

1.2 Our Results

In this paper, we provide convergence guarantees for stochastic gradient methods, under several scenarios, in the natural regime where the number of passes over the data is small, and in particular that no data reshufﬂing is necessary. We emphasize that our goal here will be more modest than those of [16, 8]: Rather than show superiority of without-replacement sampling, we only show that it will not be signiﬁcantly worse (in a worst-case sense) than with-replacement sampling. Nevertheless, such guarantees are novel, and still justify the use of with-replacement sampling, especially in situations where it is advantageous due to data access constraints or other reasons. Moreover, these results have a useful application in the context of distributed learning and optimization, as we will shortly describe. Our main contributions can be summarized as follows:

For convex functions on some convex domain W, we consider algorithms which perform a single pass over a random permutation of m individual functions, and show that their suboptimality can be characterized by a combination of two quantities, each from a different ﬁeld: First, the regret which the algorithm can attain in the setting of adversarial online convex optimization [17, 9], and second, the transductive Rademacher complexity of W with respect to the individual functions, a notion stemming from transductive learning theory [22, 6].

As a concrete application of the above, we show that if each function fi( ) corresponds to a convex Lipschitz loss of a linear predictor, and the algorithm belongs to the class of algorithms which in the online setting attain O(

T) regret on T such functions (which includes, for example, stochastic gradient descent), then the suboptimality using without-replacement sampling, after processing T functions, is O(1/

T). Up to numerical constants, the guarantee is the same as that obtained using with-replacement sampling.

We turn to consider more speciﬁcally the stochastic gradient descent algorithm, and show that if the objective function F( ) is λ-strongly convex, and the functions fi( ) are also smooth, then the suboptimality bound becomes O(1/λT), which matches the with-replacement guarantees

(although with replacement, smoothness is not needed, and the dependence on some parameters hidden in the O( ) is somewhat better).

In recent years, a new set of fast stochastic algorithms to solve Eq. (1) has emerged, such as SAG, SDCA, SVRG, and quite a few other variants. These algorithms are characterized by cheap stochastic iterations, involving computations of individual function gradients, yet unlike traditional stochastic algorithms, enjoy a linear convergence rate (runtime scaling logarithmically with the required accuracy). To the best of our knowledge, all existing analyses require sampling with replacement. We consider a representative algorithm from this set, namely SVRG, and the problem of regularized least squares, and show that similar guarantees can be obtained using without-replacement sampling. This result has a potentially interesting implication: Under the mild assumption that the problem s condition number is smaller than the data size, we get that SVRG can converge to an arbitrarily accurate solution (even up to machine precision), without the need to reshufﬂe the data only a single shufﬂe at the beginning sufﬁces. Thus, at least for this problem, we can obatin fast and high-quality solutions even if random data access is expensive.

A further application of the SVRG result is in the context of distributed learning: By simulating without-replacement SVRG on data randomly partitioned between several machines, we get a nearly-optimal algorithm for regularized least squares, in terms of communication and computational complexity, as long as the condition number is smaller than the data size per machine (up to logarithmic factors). This builds on the work of Lee et al. [13], who were the ﬁrst to recognize the applicability of SVRG to distributed optimization. However, their results relied on with-replacement sampling, and are applicable only for much smaller condition numbers.

We note that our focus is on scenarios where no reshufﬂings are necessary. In particular, the O(1/

T) and O(1/λT) bounds apply for all T {1, 2, . . . , m}, namely up to one full pass over a random permutation of the entire data. However, our techniques are also applicable to a constant (> 1) number of passes, by randomly reshufﬂing the data after every pass. In a similar vein, our SVRG result can be readily extended to a situation where each epoch of the algorithm is done on an independent permutation of the data. We leave a full treatment of this to future work.

2 Preliminaries and Notation

We use bold-face symbols to denote vectors. Given a vector w, wi denotes it s i-th coordinate. We utilize the standard O( ), Θ( ), Ω( ) notation to hide constants, and O, Θ( ), Ω( ) to hide constants and logarithmic factors.

Given convex functions f1( ), f2( ), . . . , fm( ) from Rd to R, we deﬁne our objective function F : Rd R as

with some ﬁxed optimum w arg minw W F(w). In machine learning applications, each individual fi( ) usually corresponds to a loss with respect to a data point, hence will use the terms individual function , loss function and data point interchangeably throughout the paper.

We let σ be a permutation over {1, . . . , m} chosen uniformly at random. In much of the paper, we consider methods which draw loss functions without replacement according to that permutation (that is, fσ(1)( ), fσ(2)( ), fσ(3)( ), . . .). We will use the shorthand notation

F1:t 1( ) = 1 t 1

i=1 fσ(i)( ) , Ft:m( ) = 1 m t + 1

i=t fσ(i)( )

to denote the average loss with respect to the ﬁrst t 1 and last m t + 1 loss functions respectively, as ordered by the permutation (intuitively, the losses in F1:t 1( ) are those already observed by the algorithm at the beginning of iteration t, whereas the losses in Ft:m( ) are those not yet observed). We use the convention that F1:1( ) 0, and the same goes for other expressions of the form 1 t 1 Pt 1 i=1 throughout the paper, when t = 1. We also deﬁne quantities such as F1:t 1( ) and Ft:m( ) similarly.

A function f : Rd R is λ-strongly convex, if for any w, w , f(w ) f(w) + g, w w + λ

2 w w 2, where g is any (sub)-gradient of f at w, and is µ-smooth if for any w, w , f(w ) f(w) + g, w w + µ

2 w w 2. µ-smoothness also implies that the function f is differentiable, and its gradient is µ-Lipschitz. Based on these properties, it is easy to verify that if w arg min f(w), and f is λ-strongly convex and µ-smooth, then λ 2 w w 2 f(w) f(w ) µ 2 w w 2.

We will also require the notion of trandsuctive Rademacher complexity, as developed by El-Yaniv and Pechyony [6, Deﬁnition 1], with a slightly different notation adapted to our setting: Deﬁnition 1. Let V be a set of vectors v = (v1, . . . , vm) in Rm. Let s, u be positive integers such that s + u = m, and denote p := su (s+u)2 (0, 1/2). We deﬁne the transductive Rademacher Complexity Rs,u(V) as Rs,u(V) = 1

u Er1,...,rm [supv V Pm i=1 rivi] , where r1, . . . , rm are i.i.d. random variables such that ri = 1 or 1 with probability p, and ri = 0 with probability 1 2p.

This quantity is an important parameter is studying the richness of the set V, and will prove crucial in providing some of the convergence results presented later on. Note that it differs slightly from standard Rademacher complexity, which is used in the theory of learning from i.i.d. data, where the Rademacher variables ri only take 1, +1 values, and (1/s + 1/u) is replaced by 1/m).

3 Convex Lipschitz Functions

We begin by considering loss functions f1( ), f2( ), . . . , fm( ) which are convex and L-Lipschitz over some convex domain W. We assume the algorithm sequentially goes over some permuted ordering of the losses, and before processing the t-th loss function, produces an iterate wt W. Moreover, we assume that the algorithm has a regret bound in the adversarial online setting, namely that for any sequence of T convex Lipschitz losses f1( ), . . . , f T ( ), and any w W,

t=1 ft(w) RT

for some RT scaling sub-linearly in T 1. For example, online gradient descent (which is equivalent to stochastic gradient descent when the losses are i.i.d.), with a suitable step size, satisﬁes RT = O(BL

T), where L is the Lipschitz parameter and B upper bounds the norm of any vector in W. A similar regret bound can also be shown for other online algorithms (see [9, 17, 23]).

Since the ideas used for analyzing this setting will also be used in the more complicated results which follow, we provide the analysis in some detail. We ﬁrst have the following straightforward theorem, which upper bounds the expected suboptimality in terms of regret and the expected difference between the average loss on preﬁxes and sufﬁxes of the data. Theorem 1. Suppose the algorithm has a regret bound RT , and sequentially processes fσ(1)( ), . . . , fσ(T )( ) where σ is a random permutation on {1, . . . , m}. Then in expectation over σ,

t=1 F(wt) F(w )

t=2 (t 1) E[F1:t 1(wt) Ft:m(wt)].

The left hand side in the inequality above can be interpreted as an expected bound on F(wt) F(w ), where t is drawn uniformly at random from 1, 2, . . . , T. Alternatively, by Jensen s inequality and the fact that F( ) is convex, the same bound also applies on E[F( w T ) F(w )], where w T = 1

T PT t=1 wt.

The proof of the theorem relies on the following simple but key lemma, which expresses the expected difference between with-replacement and without-replacement sampling in an alternative form, similar to Thm. 1 and one which lends itself to tools and ideas from transductive learning theory. This lemma will be used in proving all our main results, and its proof appears in Subsection A.2

1For simplicity, we assume the algorithm is deterministic given f1, . . . , fm, but all results in this section also hold for randomized algorithms (in expectation over the algorithm s randomness), assuming the expected regret of the algorithm w.r.t. any w W is at most RT .

Lemma 1. Let σ be a permutation over {1, . . . , m} chosen uniformly at random. Let s1, . . . , sm R be random variables which conditioned on σ(1), . . . , σ(t 1), are independent of σ(t), . . . , σ(m). Then E 1

m Pm i=1 si sσ(t) equals t 1

m E [s1:t 1 st:m] for t > 1, and 0 for t = 1.

Proof of Thm. 1. Adding and subtracting terms, and using the facts that σ is a permutation chosen uniformly at random, and w is ﬁxed,

t=1 F(wt) F(w )

fσ(t)(wt) F(w ) #

F(wt) fσ(t)(wt) #

fσ(t)(wt) fσ(t)(w ) #

F(wt) fσ(t)(wt) #

Applying the regret bound assumption on the sequence of losses fσ(1)( ), . . . , fσ(T )( ), the above is at most RT

T PT t=1 E F(wt) fσ(t)(wt) . Since wt (as a random variable over the permutation σ of the data) depends only on σ(1), . . . , σ(t 1), we can use Lemma 1 (where si = fi(wt), and noting that the expectation above is 0 when t = 1), and get that the above equals RT

T + 1 m T PT t=2(t 1) E[F1:t 1(wt) Ft:m(wt)].

Having reduced the expected suboptimality to the expected difference E[F1:t 1(wt) Ft:m(wt)], the next step is to upper bound it with E[supw W (F1:t 1(w) Ft:m(w))]: Namely, having split our loss functions at random to two groups of size t 1 and m t+1, how large can be the difference between the average loss of any w on the two groups? Such uniform convergence bounds are exactly the type studied in transductive learning theory, where a ﬁxed dataset is randomly split to a training set and a test set, and we consider the generalization performance of learning algorithms ran on the training set. Such results can be provided in terms of the transductive Rademacher complexity of W, and combined with Thm. 1, lead to the following bound in our setting:

Theorem 2. Suppose that each wt is chosen from a ﬁxed domain W, that the algorithm enjoys a regret bound RT , and that supi,w W |fi(w)| B. Then in expectation over the random permutation σ,

t=1 F(wt) F(w )

t=2 (t 1)Rt 1:m t+1(V) + 24B m ,

where V = {(f1(w), . . . , fm(w) | w W}.

Thus, we obtained a generic bound which depends on the online learning characteristics of the algorithm, as well as the statistical learning properties of the class W on the loss functions. The proof (as the proofs of all our results from now on) appears in Section A.

We now instantiate the theorem to the prototypical case of bounded-norm linear predictors, where the loss is some convex and Lipschitz function of the prediction w, x of a predictor w on an instance x, possibly with some regularization:

Corollary 1. Under the conditions of Thm. 2, suppose that W {w : w B}, and each loss function fi has the form ℓi( w, xi ) + r(w) for some L-Lipschitz ℓi, xi 1, and a ﬁxed function

r. Then E h 1 T PT t=1 F(wt) F(w ) i RT

As discussed earlier, in the setting of Corollary 1, typical regret bounds are on the order of O( BL

T). Thus, the expected suboptimality is O( BL/

T), all the way up to T = m (i.e. a full pass over a random permutation of the data). Up to constants, this is the same as the suboptimality attained by T iterations of with-replacement sampling, using stochastic gradient descent or similar algorithms.

4 Faster Convergence for Strongly Convex Functions

We now consider more speciﬁcally the stochastic gradient descent algorithm on a convex domain W, which can be described as follows: We initialize at some w1 W, and perform the update steps

wt+1 = ΠW(wt ηtgt),

where ηt > 0 are ﬁxed step sizes, ΠW is projection on W, and gt is a subgradient of fσ(t)( ) at wt. Moreover, we assume the objective function F( ) is λ-strongly convex for some λ > 0. In this scenario, using with-replacement sampling (i.e. gt is a subgradient of an independently drawn fi( )), performing T iterations as above and returning a randomly sampled iterate wt or their average results in expected suboptimality O(G2/λT), where G2 is an upper bound on the expected squared norm of gt [15, 18]. Here, we study a similar situation in the without-replacement case.

In the result below, we will consider speciﬁcally the case where each fi(w) is a Lipschitz and smooth loss of a linear predictor w, possibly with some regularization. The smoothness assumption is needed to get a good bound on the transductive Rademacher complexity of quantities such as fi(w), w . However, the technique can be potentially applied to more general cases. Theorem 3. Suppose W has diameter B, and that F( ) is λ-strongly convex on W. Assume that fi(w) = ℓi( w, xi ) + r(w) where xi 1, r( ) is possibly some regularization term, and each ℓi is L-Lipschitz and µ-smooth on {z : z = w, x , w W, x 1}. Furthermore, suppose supw W fi(w) G. Then for any 1 < T m, if we run SGD for T iterations with step size ηt = 2/λt, we have (for a universal positive constant c)

t=1 F(wt) F(w )

c ((L + µB)2 + G2) log(T)

As in the results of the previous section, the left hand side is the expected optimality of a single wt where t is chosen uniformly at random, or an upper bound on the expected suboptimality of the average w T = 1 T PT t=1 wt. This result is similar to the with-replacement case, up to numerical constants and the additional (L + µB2) term in the numerator. We note that in some cases, G2 is the dominant term anyway2. However, it is not clear that the (L + µB2) term is necessary, and removing it is left to future work. We also note that the log(T) factor in the theorem can be removed by considering not 1

T PT t=1 F(wt), but rather only an average over some sufﬁx of the iterates, or weighted averaging (see for instance [15, 12, 21], where the same techniques can be applied here).

The proof of Thm. 3 is somewhat more challenging than the results of the previous section, since we are attempting to get a faster rate of O(1/T) rather than O(1/

T), all the way up to T = m. A signiﬁcant technical obstacle is that our proof technique relies on concentration of averages around expectations, which on T samples does not go down faster than O(1/

T). To overcome this, we apply concentration results not on the function values (i.e. F1:t 1(wt) Ft:m(wt) as in the previous section), but rather on gradient-iterate inner products, i.e. F1:t 1(wt) Ft:m(wt), wt w , where w is the optimal solution. To get good bounds, we need to assume these gradients have a certain structure, which is why we need to make the assumption in the theorem about the form of each fi( ). Using transductive Rademacher complexity tools, we manage to upper bound the expectation

of these inner products by quantities roughly of the form q

E[ wt w 2]/

t (assuming here t < m/2 for simplicity). We now utilize the fact that in the strongly convex case, wt w itself decreases to zero with t at a certain rate, to get fast rates decreasing as 1/t.

5 Without-Replacement SVRG for Least Squares

In this section, we will consider a more sophisticated stochastic gradient approach, namely the SVRG algorithm of [11], designed to solve optimization problems with a ﬁnite sum structure as in Eq. (1). Unlike purely stochastic gradient procedures, this algorithm does require several passes over the data. However, assuming the condition number 1/λ is smaller than the data size (assuming each fi( ) is O(1) smooth, and λ is the strong convexity parameter of F( )), only O(m log(1/ϵ)) gradient evaluations are required to get an ϵ-accurate solution, for any ϵ. Thus, we can get a high-accuracy solution after the equivalent of a small number of data passes. As discussed in the introduction, over the past few years several other algorithms have been introduced and shown to have such a behavior. We will focus on the algorithm in its basic form, where the domain W equals Rd.

The existing analysis of SVRG ([11]) assumes stochastic iterations, which involves sampling the data with replacement. Thus, it is natural to consider whether a similar convergence behavior occurs using

2G is generally on the order of L + λB, which is the same as L + µB if L is the dominant term. This happens for instance with the squared loss, whose Lipschitz parameter is on the order of µB.

without-replacement sampling. As we shall see, a positive reply has at least two implications: The ﬁrst is that as long as the condition number is smaller than the data size, SVRG can be used to obtain a high-accuracy solution, without the need to reshufﬂe the data: Only a single shufﬂe at the beginning sufﬁces, and the algorithm terminates after a small number of sequential passes (logarithmic in the required accuracy). The second implication is that such without-replacement SVRG can be used to get a nearly-optimal algorithm for convex distributed learning and optimization on randomly partitioned data, as long as the condition number is smaller than the data size at each machine.

The SVRG algorithm using without-replacement sampling on a dataset of size m is described as Algorithm 1. It is composed of multiple epochs (indexed by s), each involving one gradient computation on the entire dataset, and T stochastic iterations, involving gradient computations with respect to individual data points. Although the gradient computation on the entire dataset is expensive, it is only needed to be done once per epoch. Since the algorithm will be shown to have linear convergence as a function of the number of epochs, this requires only a small (logarithmic) number of passes over the data.

Algorithm 1 SVRG using Without-Replacement Sampling

Parameters: η, T, permutation σ on {1, . . . , m} Initialize w1 at 0 for s = 1, 2, . . . do w(s 1)T +1 := ws n := F( ws) = 1 m Pm i=1 fi( ws) for t = (s 1)T + 1, . . . , s T do wt+1 := wt η fσ(t)(wt) fσ(t)( ws) + n

end for Let ws+1 be the average of w(s 1)T +1, . . . , ws T , or one of them drawn uniformly at random. end for

We will consider speciﬁcally the regularized least mean squares problem, where

2 ( w, xi yi)2 + ˆλ 2 w 2 . (2)

for some xi, yi and ˆλ > 0. Moreover, we assume that F(w) = 1

m Pm i=1 fi(w) is λ-strongly convex (note that necessarily λ ˆλ). For convenience, we will assume that xi , |yi|, λ are all at most 1 (this is without much loss of generality, since we can always re-scale the loss functions by an appropriate factor). Note that under this assumption, each fi( ) as well as F( ) are also 1 + ˆλ 2-smooth. Theorem 4. Suppose each loss function fi( ) corresponds to Eq. (2), where xi Rd, maxi xi 1, maxi |yi| 1, ˆλ > 0, and that F( ) is λ-strongly convex, where λ (0, 1). Moreover, let B 1 be such that w 2 B and maxt F(wt) F(w ) B with probability 1 over the random permutation. There is some universal constant c0 1, such that for any c c0 and any ϵ (0, 1), if we run algorithm 1, using parameters η, T satisfying

ηλ , m c log2 64dm B2

then after S = log4(9/ϵ) epochs of the algorithm, w S+1 satisﬁes E[F( w S+1) minw F(w)] ϵ.

In particular, by taking η = Θ(1) and T = Θ(1/λ), the algorithm attains an ϵ-accurate solution after O(log(1/ϵ)/λ) stochastic iterations of without-replacement sampling, and O(log(1/ϵ)) sequential passes over the data to compute gradients of F( ). This implies that as long as 1/λ (which stands for the condition number of the problem) is smaller than O(m/ log(1/ϵ)), the number of withoutreplacement stochastic iterations is smaller than the data size m. Thus, assuming the data is randomly shufﬂed, we can get a solution using only sequential data passes, without the need to reshufﬂe.

In terms of the log factors, we note that the condition maxt F(wt) F(w ) B with probability 1 is needed for technical reasons in our analysis, and we conjecture that it can be improved. However, since B appears only inside log factors, even a crude bound would sufﬁce. In appendix C, we indeed show that under there is always a valid B satisfying log(B) = O (log(1/ϵ) log(T) + log(1/λ)). Regarding the logarithmic dependence on the dimension d, it is due to an application of a matrix Bernstein inequality for d d matrices, and can possibly be improved.

5.1 Application of Without-Replacement SVRG to distributed learning An important variant of the problems we discussed so far is when training data (or equivalently, the individual loss functions f1( ), . . . , fm( )) are split between different machines, who need to communicate in order to reach a good solution. This naturally models situations where data is too large to ﬁt at a single machine, or where we wish to speed up the optimization process by parallelizing it on several computers. Over the past few years, there has been much research on this question in the machine learning community, with just a few examples including [24, 2, 1, 5, 4, 10, 20, 19, 25, 13].

A substantial number of these works focus on the setting where the data is split equally at random between k machines (or equivalently, that data is assigned to each machine by sampling without replacement from {f1( ), . . . , fm( )})). Intuitively, this creates statistical similarities between the data at different machines, which can be leveraged to aid the optimization process. Recently, Lee et al. [13] proposed a simple and elegant approach, which applies at least in certain parameter regimes. This is based on the observation that SVRG, according to its with-replacement analysis, requires O(log(1/ϵ)) epochs, where in each epoch one needs to compute an exact gradient of the objective function F( ) = 1 m Pm i=1 fi( ), and O(1/λ) gradients of individual losses fi( ) chosen uniformly at random (where ϵ is the required suboptimality, and λ is the strong convexity parameter of F( )). Therefore, if each machine had i.i.d. samples from {f1( ), . . . , fm( )}, whose union cover {f1( ), . . . , fm( )}, the machines could just simulate SVRG: First, each machine splits its data to batches of size O(1/λ). Then, each SVRG epoch is simulated by the machines computing a gradient of F( ) = 1

m Pm i=1 fi( ) which can be fully parallelized and involves one communication round (assuming a broadcast communication model) and one machine computing gradients with respect to one of its batches. Overall, this would require O(log(1/ϵ)) communication rounds, and O(m/k + 1/λ) log(1/ϵ) runtime, where m/k is the number of data points per machine (ignoring communication time, and assuming constant time to compute a gradient of fi( )). Under the reasonable assumption that the strong convexity parameter λ is at least k/m, this requires runtime linear in the data size per machine, and logarithmic in the target accuracy ϵ, with only a logarithmic number of communication rounds. Up to log factors, this is essentially the best one can hope for with a distributed algorithm. Moreover, a lower bound in [13] indicates that at least in the worst case, O(log(1/ϵ)) communication rounds is impossible to obtain if λ is signiﬁcantly smaller than k/m.

Unfortunately, the reasoning above crucially relies on each machine having access to i.i.d. samples, which can be reasonable in some cases, but is different than the more common assumption that the data is randomly split among the machines. To circumvent this issue, [13] propose to communicate individual data points / losses fi( ) between machines, so as to simulate i.i.d. sampling. However, by the birthday paradox, this only works well in the regime where the overall number of samples required (O((1/λ) log(1/ϵ)) is not much larger than m. Otherwise, with-replacement and withoutreplacement sampling becomes statistically very different, and a large number of data points would need to be communicated. In other words, if communication is an expensive resource, then the solution proposed in [13] only works well when λ is at least order of 1/ m. In machine learning applications, the strong convexity parameter λ often comes from explicit regularization designed to prevent over-ﬁtting, and needs to scale with the data size, usually between 1/ m and 1/m. Thus, the solution above is communication-efﬁcient only when λ is relatively large.

However, the situation immediately improves if we can use a without-replacement version of SVRG, which can easily be simulated with randomly partitioned data: The stochastic batches can now be simply subsets of each machine s data, which are statistically identical to sampling {f1( ), . . . , fm( )} without replacement. Thus, no data points need to be sent across machines, even if λ is small. For clarity, we present an explicit pseudocode as Algorithm 2 in Appendix D.

Let us consider the analysis of no-replacement SVRG provided in Thm. 4. According to this analysis, by setting T = Θ(1/λ), then as long as the total number of batches is at least Ω(log(1/ϵ)), and λ = Ω(1/m), then the algorithm will attain an ϵ-suboptimal solution in expectation. In other words, without any additional communication, we extend the applicability of distributed SVRG (at least for regularized least squares) from the λ = Ω(1/ m) regime to 1/λ = Ω(1/m).

We emphasize that this formal analysis only applies to regularized squared loss, which is the scope of Thm. 4. However, this algorithmic approach can be applied to any loss function, and we conjecture that it will have similar performance for any smooth losses and strongly-convex objectives.

Acknowledgments: This research is supported in part by an FP7 Marie Curie CIG grant, an ISF grant 425/13, and by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI).

[1] A. Agarwal, O. Chapelle, M. Dudík, and J. Langford. A reliable effective terascale linear learning system. Co RR, abs/1110.4198, 2011.

[2] M.-F. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication complexity and privacy. In COLT, 2012.

[3] L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade. Springer, 2012.

[4] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient methods. In NIPS, 2011.

[5] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13:165 202, 2012.

[6] R. El-Yaniv and D. Pechyony. Transductive rademacher complexity and its applications. Journal of Artiﬁcial Intelligence Research, 35:193 234, 2009.

[7] D. Gross and V. Nesme. Note on sampling without replacing from a ﬁnite collection of matrices. ar Xiv preprint ar Xiv:1001.2738, 2010.

[8] M. Gürbüzbalaban, A. Ozdaglar, and P. Parrilo. Why random reshufﬂing beats stochastic gradient descent. ar Xiv preprint ar Xiv:1510.08560, 2015.

[9] E. Hazan. Introduction to online convex optimization. Book draft, 2015.

[10] M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. Jordan. Communicationefﬁcient distributed dual coordinate ascent. In NIPS, 2014.

[11] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, 2013.

[12] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. ar Xiv preprint ar Xiv:1212.2002, 2012.

[13] J. Lee, T. Ma, and Q. Lin. Distributed stochastic variance reduced gradient methods. ar Xiv preprint ar Xiv:1507.07595, 2015.

[14] A. Nedi c and D. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pages 223 264. Springer, 2001.

[15] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. ar Xiv preprint ar Xiv:1109.5647, 2011.

[16] B. Recht and C. Ré. Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences. In COLT, 2012.

[17] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011.

[18] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From Theory to Algorithms. Cambridge University Press, 2014.

[19] O. Shamir and N. Srebro. On distributed stochastic optimization and learning. In Allerton, 2014.

[20] O. Shamir, N. Srebro, and T. Zhang. Communication-efﬁcient distributed optimization using an approximate newton-type method. In ICML, 2014.

[21] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In ICML, 2013.

[22] V. Vapnik. Statistical learning theory. Wiley New York, 1998.

[23] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543 2596, 2010.

[24] Y. Zhang, J. Duchi, and M. Wainwright. Communication-efﬁcient algorithms for statistical optimization. Journal of Machine Learning Research, 14:3321 3363, 2013.

[25] Y. Zhang and L. Xiao. Communication-efﬁcient distributed optimization of self-concordant empirical loss. In ICML, 2015.