# saga_with_arbitrary_sampling__02549ef5.pdf

SAGA with Arbitrary Sampling

Xun Qian 1 Zheng Qu 2 Peter Rich arik 1 3

We study the problem of minimizing the average of a very large number of smooth functions, which is of key importance in training supervised learning models. One of the most celebrated methods in this context is the SAGA algorithm of Defazio et al. (2014). Despite years of research on the topic, a general-purpose version of SAGA one that would include arbitrary importance sampling and minibatching schemes does not exist. We remedy this situation and propose a general and ﬂexible variant of SAGA following the arbitrary sampling paradigm. We perform an iteration complexity analysis of the method, largely possible due to the construction of new stochastic Lyapunov functions. We establish linear convergence rates in the smooth and strongly convex regime, and under a quadratic functional growth condition (i.e., in a regime not assuming strong convexity). Our rates match those of the primal-dual method Quartz (Qu et al., 2015) for which an arbitrary sampling analysis is available, which makes a signiﬁcant step towards closing the gap in our understanding of complexity of primal and dual methods for ﬁnite sum problems.

1. Introduction

We consider a convex composite optimization problem

minx Rd P(x) := n P

i=1 λifi(x) + ψ(x), (1)

where f := P i λifi is a conic combination (with coefﬁcients λ1, . . . , λn > 0) of a very large number of smooth convex functions fi : Rd R, and ψ : Rd R {+ }

*Equal contribution 1King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia 2University of Hong Kong, Hong Kong 3Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia. Correspondence to: Peter Richt arik <peter.richtarik@kaust.edu.sa>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

is a proper closed convex function. We do not assume ψ to be smooth. In particular, ψ can be the indicator function of a nonempty closed convex set, turning problem (1) into a constrained minimization of function f. We are interested in the regime where n d, although all our theoretical results hold without this assumption.

In a typical setup in the literature, λi = 1/n for all i [n] := {1, 2, . . . , n}, fi(x) corresponds to the loss of a supervised machine learning model x on example i from a training dataset of size n, and f represents the average loss (i.e., empirical risk). Problems of the form (1) are often called ﬁnite-sum or regularized empirical risk minimization (ERM) problems, and are of immense importance in supervised learning, essentially forming the dominant training paradigm (Shalev-Shwartz & Ben-David, 2014).

1.1. Variance-reduced methods

One of the most successful methods for solving ERM problems is stochastic gradient descent (SGD) (Robbins & Monro, 1951; Nemirovski et al., 2009) and its many variants, including those with minibatches (Tak aˇc et al., 2013), importance sampling (Needell et al., 2015; Zhao & Zhang, 2015) and momentum (Loizou & Richt arik, 2017a;b).

One of the most interesting developments in recent years concerns variance-reduced variants of SGD. The ﬁrst method in this category is the celebrated1 stochastic average gradient (SAG) method of Schmidt et al. (2017). Many additional variance-reduced methods were proposed since, including SDCA (Richt arik & Tak aˇc, 2014; Shalev-Shwartz & Zhang, 2013; Shalev-Shwartz, 2016), SAGA (Defazio et al., 2014), SVRG (Johnson & Zhang, 2013; Xiao & Zhang, 2014b), S2GD (Koneˇcn y & Richt arik, 2017; Koneˇcn y et al., 2016), MISO (Mairal, 2015), Jac Sketch (Gower et al., 2018) and SAGD (Bibi et al., 2018).

1.2. SAGA: the known and the unknown

Since the SAG gradient estimator is not unbiased, SAG is notoriously hard to analyze. Soon after SAG was proposed, the SAGA method (Defazio et al., 2014) was developed, obtained by replacing the biased SAG estimator by a simi-

1Schmidt et al. (2017) received the 2018 Lagrange Prize in continuous optimization for their work on SAG.

SAGA with Arbitrary Sampling

lar, but unbiased, SAGA estimator. This method admits a simpler analysis, retaining the favourable convergence properties of SAG. SAGA is one of the early and most successful variance-reduced methods for (1).

Better understanding of the behaviour of SAGA remains one of the open challenges in the literature. Consider problem (1) with λi = 1/n for all i. Assume, for simplicity, that each fi is Li-smooth and f is µ-strongly convex. In this regime, the iteration complexity of SAGA with uniform sampling probabilities is O((n+ Lmax

ϵ ), where Lmax := maxi Li, which was established already by Defazio et al. (2014). Schmidt et al. (2015) conjectured that there exist nonuniform sampling probabilities for which the complexity improves to O((n + L µ ) log 1

ϵ ), where L := P

i Li/n. However, the correct importance sampling strategy leading to this result was not discovered until recently in the work of Gower et al. (2018), where the conjecture was resolved in the afﬁrmative. One of the key difﬁculties in the analysis was the construction of a suitable stochastic Lyapunov function controlling the iterative process. Likewise, until recently, very little was known about the minibatch performance of SAGA, even for the simplest uniform minibatch strategies. Notable advancements in this area were made by Gower et al. (2018), who have the currently best rates for SAGA with standard uniform minibatch strategies and the ﬁrst importance sampling results for a block variant of SAGA.

1.3. Contributions

SAGA with arbitrary sampling. We study the performance of SAGA under fully general data sampling strategies known in the literature as arbitrary sampling, generalizing all previous results, and obtaining an array of new theoretically and practically useful samplings. We call our general method SAGA-AS. Our theorems are expressed in terms of new Lyapunov functions, the constructions of which was essential to our success.

In the arbitrary sampling paradigm, ﬁrst proposed by Richt arik & Tak aˇc (2016) in the context of randomized coordinate descent methods, one considers all (proper) random set valued mappings S (called samplings ) with values being subsets of [n]. A sampling is uniquely determined by assigning a probability to all 2n subsets of [n]. A sampling is called proper2 if probability of each i [n] being sampled is positive; that is, if pi := P(i S) > 0 for all i. The term arbitrary sampling refers to an arbitrary proper sampling.

Smooth case. We perform an iteration complexity analysis in the smooth case (ψ 0), assuming f is µ-strongly con-

2It does not make sense to consider samplings S that are not proper. Indeed, if pi = 0 for some i, a method based on S will lose access to fi and, consequently, ability to solve (1).

vex. Our analysis generalizes the results of Defazio et al. (2014) and Gower et al. (2018) to arbitrary sampling. The Jac Sketch method Gower et al. (2018) and its analysis rely on the notion of a bias-correcting random variable. Unfortunately, such a random variable does not exist for SAGA-AS. We overcome this obstacle by proposing a bias-correcting random vector (BCRV) which, as we show, always exists. While Gower et al. (2018); Bibi et al. (2018) consider particular suboptimal choices, we are able to ﬁnd the BCRV which minimizes the iteration complexity bound. Unlike all known and new variants of SAGA considered in (Gower et al., 2018), SAGA-AS does not arise as a special case of Jac Sketch. Our linear rates for SAGA-AS are the same as those for the primal-dual stochastic ﬁxed point method Quartz (Qu et al., 2015) (the ﬁrst arbitrary sampling based method for (1)) in the regime when Quartz is applicable, which is the case when an explicit strongly convex regularizer is present. In contrast, we do not need an explicit regularizer, which means that SAGA-AS can utilize the strong convexity of f fully, even if the strong convexity parameter µ is not known. While the importance sampling results in (Gower et al., 2018) require each fi to be strongly convex, we impose this requirement on f only.

Nonsmooth case. We perform an iteration complexity analysis in the general nonsmooth case. When the regularizer ψ is strongly convex, which is the same setting as that considered in (Qu et al., 2015), our iteration complexity bounds are essentially the same as that of Quartz. However, we also prove linear convergence results, with the same rates, under a quadratic functional growth condition (which does not imply strong convexity) (Necoara et al., 2018). These are the ﬁrst linear convergence result for any variant of SAGA without strong convexity. Moreover, to the best of our knowledge, SAGA-AS is the only variance-reduced method which achieves linear convergence without any a priori knowledge of the error bound condition number.

Our arbitrary sampling rates are summarized in Table 1.

1.4. Brief review of arbitrary sampling results

The arbitrary sampling paradigm was proposed by Richt arik & Tak aˇc (2016), where a randomized coordinate descent method with arbitrary sampling of subsets of coordinates was analyzed for unconstrained minimization of a strongly convex function. Subsequently, the primal-dual method Quartz with arbitrary sampling of dual variables was studied in Qu et al. (2015) for solving (1) in the case when ψ is strongly convex (and λi = 1

n for all i). An accelerated randomized coordinate descent method with arbitrary sampling called ALPHA was proposed by Qu & Richt arik (2016a) in the context of minimizing the sum of a smooth convex function and a convex block-separable regularizer. A key concept in the analysis of all known methods in the arbi-

SAGA with Arbitrary Sampling

Regime Arbitrary sampling Thm Smooth ψ 0 fi is Li-smooth, f is µ-strongly convex

max max 1 i n

n 1 pi + 4(1+B)Li Aiλi

µ o , 2B(1+1/B)L

Nonsmooth P satisﬁes µ-growth condition (19) and Assumption 4.3

fi(x) = φi(A i x), φi is 1/γ-smooth, f is L-smooth

µ , 3 max 1 i n

n 1 pi + 4viλi

piµγ o log 1

Nonsmooth ψ is µ-strongly convex fi(x) = φi(A i x), φi is 1/γ-smooth

n 1 + 1 pi + 3viλi

piµγ o log 1

Table 1. Iteration complexity results for SAGA-AS. We have pi := P(i S), where S is a sampling of subsets of [n] utilized by SAGA-AS. The key complexity parameters Ai, B, and vi are deﬁned in the sections containing the theorems.

trary sampling paradigm is the notion of expected separable overapproximation (ESO), introduced by Richt arik & Tak aˇc (2016), and in the context of arbitrary sampling studied in depth by Qu & Richt arik (2016b). A stochastic primal-dual hybrid gradient algorithm (aka Chambolle-Pock) with arbitrary sampling of dual variables was studied by Chambolle et al. (2017). Recently, an accelerated coordinate descent method with arbitrary sampling for minimizing a smooth and strongly convex function was studied by Hanzely & Richt arik (2018). Finally, the ﬁrst arbitrary sampling analysis in a nonconvex setting was performed by Horv ath & Richt arik (2018), which is also the ﬁrst work in which the optimal sampling out of class of all samplings of a given minibatch size was identiﬁed.

2. The Algorithm

Let F : Rd Rn and G : Rd Rd n be deﬁned by F(x) := (f1(x), , fn(x)) Rn and G(x) := [ f1(x), , fn(x)] Rd n. We refer to G(x) as the Jacobian of F at x.

2.1. Jac Sketch

Gower et al. (2018) propose a new family of variance reduced SGD methods called Jac Sketch which progressively build a variance-reduced estimator of the gradient via the utilization of a new technique they call Jacobian sketching. As shown in (Gower et al., 2018), state-of-theart variants SAGA can be obtained as a special case of Jac Sketch. However, SAGA-AS does not arise as a special case of Jac Sketch. In fact, the generic analysis provided in (Gower et al., 2018) (Thm 3.6) is too coarse and does not lead to good bounds for any variants of SAGA with importance sampling. On the other hand, the analysis of Gower et al. (2018) which does do well for importance sampling does not generalize to arbitrary sampling, regularized objectives or regimes without strong convexity.

In this section we provide a brief review of the Jac Sketch method, establishing some useful notation along the way,

with the goal of pointing out the moment of departure from Jac Sketch construction which leads to SAGA-AS. The iterations of Jac Sketch have the form

xk+1 = xk αgk, (2)

where α > 0 is a ﬁxed step size, and gk is an variancereduced unbiased estimator of the gradient f(xk) built iteratively through a process involving Jacobian sketching, sketch-and-project and bias-correction.

Starting from an arbitrary matrix J0 Rd n, at each k 0 of Jac Sketch an estimator Jk Rd n of the true Jacobian G(xk) is constructed using a sketch-and-project iteration:

Jk+1 = arg min J Rd n J Jk

subject to JSk = G(xk)Sk. (3)

Above, X := p

Tr(XX ) is the Frobenius norm3, and Sk Rn τ is a random matrix drawn from some ensamble of matrices D in an i.i.d. fashion in each iteration. The solution to (3) is the closest matrix to Jk consistent with true Jacobian in its action onto Sk. Intuitively, the higher τ is, the more accurate Jk+1 will be as an estimator of the true Jacobian G(xk). However, in order to control the cost of computing Jk+1, in practical applications one chooses τ n. In the case of standard SAGA, for instance, Sk is a random standard unit basis vector in Rn chosen uniformly at random (and hence τ = 1).

The projection subproblem (3) has the explicit solution (see Lemma B.1 in (Gower et al., 2018)):

Jk+1 = Jk + (G(xk) Jk)ΠSk,

where ΠS := S(S S) S is an orthogonal projection matrix (onto the column space of S), and denotes the Moore Penrose pseudoinverse. Since Jk+1 is constructed to be an approximation of G(xk), and since f(xk) = G(xk)λ, where λ := (λ1, . . . , λn) , it makes sense to estimate the

3Gower et al. (2018) consider a weighted Frobenius norm, but this is not useful for our purposes.

SAGA with Arbitrary Sampling

gradient via f(xk) Jk+1λ. However, this gradient estimator is not unbiased, which poses dramatic challenges for complexity analysis. Indeed, the celebrated SAG method, with its infamously technical analysis, uses precisely this estimator in the special case when Sk is chosen to be a standard basis vector in Rn sampled uniformly at random. Fortunately, as show in (Gower et al., 2018), an unbiased estimator can be constructed by taking a random linear combination of Jk+1λ and Jkλ:

gk = (1 θSk)Jkλ + θSk Jk+1λ

= Jkλ + θSk(G(xk) Jk)ΠSkλ, (4)

where θ = θS R is a bias-correcting random variable (dependent on S), deﬁned as any random variable for which E[θSΠSλ] = λ. Under this condition, gk becomes an unbiased estimator of f(xk). The Jac Sketch method is obtained by alternating optimization steps (2) (producing iterates xk) with sketch-and-project steps (producing Jk).

2.2. Bias-correcting random vector

In order to construct SAGA-AS, we take a departure here and consider a bias-correcting random vector (θ1 S, , θn S) Rn instead. From now on it will be useful to think of θS as an n n diagonal matrix, with the vector (θ1 S, , θn S) embedded in its diagonal. In contrast to (4), we propose to construct gk via gk = Jkλ + (G(xk) Jk)θSkΠSkλ. It is easy to see that under the following assumption, gk will be an unbiased estimator of f(xk).

Assumption 2.1 (Bias-correcting random vector). We say that the diagonal random matrix θS Rn n is a bias-correcting random vector if

E[θSΠSλ] = λ. (5)

2.3. Choosing distribution D

In order to complete the description of SAGA-AS, we need to specify the distribution D. We choose D to be a distribution over random column submatrices of the n n identity matrix I. Such a distribution is uniquely characterized by a random subset of the columns of I Rn n, i.e., a random subset of [n]. This leads us to the notion of a sampling, already outlined in the introduction.

Deﬁnition 2.2 (Sampling). A sampling S is a random set-valued mapping with values being the subsets of [n]. It is uniquely characterized by the choice of probabilities p C := P[S = C] associated with every subset C of [n]. Given a sampling S, we let pi := P[i S] = P

C:i C p C. We say that a sampling S is i) proper if pi > 0 for all i, ii) serial if |S| = 1 with probability 1, iii) τ-nice if it selects from all subsets of [n] of cardinality τ uniformly

at random, and iv) independenta if it is formed as follows: for each i [n] we ﬂip a biased coin, independently, with probability of success pi > 0; if we are successful, we include i in S.

a Independent samplings were also considered for nonconvex ERM problems in (Horv ath & Richt arik, 2018) and for accelerated coordinate descent in (Hanzely & Richt arik, 2018).

Given a proper sampling S, we sample matrices S D as follows: i) Draw a random set S, ii) Deﬁne S = I:S Rn |S| (random column submatrix of I corresponding to columns i S). For h = (h1, , hn) Rn and sampling S deﬁne vector h S Rn as follows:

(h S)i := hi1(i S), where 1(i S) := 1, if i S 0, otherwise.

It is easy to observe (see Lemma 4.7 in (Gower et al., 2018)) that for S = I:S we have the identity

ΠS = ΠI:S = IS := Diag(e S). (6)

To simplify notation, we will write θS instead of θS = θI:S.

Lemma 2.3. Let S be a proper sampling and deﬁne D by setting S = I:S. Then condition (5) is equivalent to

E[θi S1(i S)] P

C [n]:i C p Cθi C = 1, i [n]. (7)

This condition is satisﬁed by the default vector θi S 1 pi .

In general, there is an inﬁnity of bias-correcting random vectors characterized by (7). In SAGA-AS we reserve the freedom to choose any of these vectors.

2.4. SAGA-AS

By putting all of the development above together, we have arrived at SAGA-AS (Algorithm 1). Note that since we consider problem (1) with a regularizer ψ, the optimization step involves a proximal operator, deﬁned as

proxψ α(x) := arg min 1

2α x y 2 + ψ(y) , α > 0.

To shed more light onto the key steps of SAGA-AS, note that an alternative way of writing the Jacobian update is Jk+1 :i = fi(xk) for i Sk and Jk+1 :i = Jk :i for i / Sk. The gradient estimate can be alternatively written as

gk = Pn i=1 λi Jk :i + P

i Sk λiθi Sk fi(xk) Jk :i

i/ Sk λi Jk :i + P

i Sk λi θi Sk fi(xk) + (1 θi Sk)Jk :i .

3. Analysis in the Smooth Case

In this section we consider problem (1) in the smooth case; i.e., we let ψ 0. We make the following assumption on S.

SAGA with Arbitrary Sampling

Algorithm 1 SAGA with Arbitrary Sampling (SAGA-AS)

Parameters: Arbitrary proper sampling S; biascorrecting random vector θS; stepsize α > 0 Initialization: Choose x0 Rd, J0 Rd n

for k = 0, 1, 2, ... do

Sample a fresh set Sk S [n] Jk+1 = Jk + (G(xk) Jk)ISk gk = Jkλ + (G(xk) Jk)θSk ISkλ xk+1 = proxψ α xk αgk

Assumption 3.1. There exists constants Ai 0 for 1 i n and 0 B 1 such that for any matrix M Rd n,

E[ MθSΠISλ 2] X

i Aiλ2 i M:i 2 +B Mλ 2. (8)

3.1. Main result

Given an arbitrary proper sampling S, and bias-correcting random vector θS, for each i [n] deﬁne

C [n]:i C p C|C|(θi C)2, i [n], (9)

where |C| is the cardinality of the set C. As we shall see, these quantities play a key importance in our complexity result, presented next.

Lemma 3.2. (i) For an arbitrary proper sampling S, the Assumption 3.1 is satisﬁed by Ai = βi and B = 0. (ii) For τ-nice sampling S with θi S = 1 pi , Assumption 3.1 is

satisﬁed by Ai = n τ n τ

n 1 and B = n(τ 1) τ(n 1). (iii) For independent sampling S with θi S = 1 pi , Assumption 3.1 is satisﬁed by Ai = ( 1

pi 1) and B = 1.

Theorem 3.3. Let S be an arbitrary proper sampling, and let θS be a bias-correcting random vector satisfying (7). Let f be µ-strongly convex and L-smooth, fi be convex and Li-smooth. Let {xk, Jk} be the iterates produced by Algorithm 1. Consider the stochastic Lyapunov function

Ψk := xk x 2 +2α Pn i=1 σi Aiλ2 i Jk :i fi(x ) 2,

where σi = 1 4(1+B)Li Aipiλi for all i. If stepsize α satisﬁes

α min min 1 i n

pi µ+4(1+B)Li Aiλipi , B 1 2(1+1/B)L

then E[Ψk] (1 µα)k E[Ψ0]. This implies that if we choose α equal to the upper bound in (10), then E[Ψk] ϵ E[Ψ0] as long as k

max n maxi n 1 pi + 4(1+B)Li Aiλi

µ o , 2B(1+1/B)L

If µ is unknown and we choose

α min min 1 i n

pi 8(1+B)Li Aiλipi , B 1 2(1+1/B)L

then E[Ψk] (1 min{µα, pi

2 })k E[Ψ0]. This implies that if we choose α equal to the upper bound in (11), then we can get E[Ψk] ϵ E[Ψ0] as long as k

max n maxi n 2 pi , 8(1+B)Li Aiλi

µ o , 2B(1+1/B)L

Our result involves a novel stochastic Lyapunov function Ψk, different from that in (Gower et al., 2018).

3.2. Optimal bias-correcting random vector

Note that for an arbitrary proper sampling the complexity bound gets better as βi get smaller. Having said that, even for a ﬁxed sampling S, the choice of βi is not unique, Indeed, this is because βi depends on the choice of θS. In view of Lemma 2.3, we have many choices for this random vector. Let Θ(S) be the collection of all bias-correcting random vectors associated with sampling S. In our next result we will compute the bias-correcting random vector θS which leads to the minimal complexity parameters βi. In the rest of the paper, let Ei[ ] := E[ | i S].

Lemma 3.4. Let S be a proper sampling. Then (i) minθ Θ(S) βi = 1 P

C:i C p C/|C| = 1 pi Ei[1/|S|] for all i,

and the minimum is obtained at θ Θ(S) given by θi C = 1 |C| P

C:i C p C/|C| = 1 pi|C|Ei[1/|S|] for all C : i C;

(ii) 1 Ei[1/|S|] Ei[|S|], for all i.

3.3. Importance Sampling for Minibatches

In this part we construct an importance sampling for minibatches. This is in general a daunting task, and only a handful of papers exist on this topic. In particular, Csiba & Richt arik (2018) and Hanzely & Richt arik (2019) focused on coordinate descent methods, and Gower et al. (2018) considered minibatch SAGA with importance sampling over subsets of [n] forming a partition.

Let τ := E[|S|] be the expected minibatch size, and L := P

i [n] Liλi. We consider the independent sampling with θi S = 1 pi . From Lemma 3.2 and Thm 3.3, the iteration complexity becomes

max max 1 i n

n 1 pi + 8Li Aiλi

where Ai = 1 pi 1. Hence

1 pi + 8Li Aiλi

µ = µ+8Liλi(1 pi)

Let qi = (µ+8Liλi)τ P

i [n](µ+8Liλi), and T = {i | qi > 1}. Next, we

SAGA with Arbitrary Sampling

discuss two cases:

Case 1. Suppose T = . In this case, we choose pi = qi. From (12), we have

1 pi + 8Li Aiλi

i [n](µ+8Liλi)

Theorefore, the iteration complexity has the following upper bound: max n n τ + 8 L

Case 2. Suppose T = . In this case, we choose pi = 1 for i T and qi pi 1 for i / T such that P

i [n] pi = τ. Notice that pi = 1 means Ai = 0 . Hence, for i T, from (12), we have 1 pi + 8Li Aiλi

µ = µ+8Liλi(1 pi)

µpi = 1. For i / T, from (12), we have

1 pi + 8Li Aiλi

i [n](µ+8Liλi)

Theorefore, the iteration complexity also has the following upper bound: max n n τ + 8 L

ϵ . To summarize the above two cases, by choosing min{qi, 1} pi 1 such that P

i [n] pi = τ, the iteration complexity admits the following upper bound:

max n n τ + 8 L

It should be noticed that in practice, we can just choose pi = min{qi, 1} for convenience, and then (13) also holds, but with E[|S|] = P

i [n] pi τ.

Linear speedup. When τ nµ+8 L

4L , (13) becomes n τ + 8 L

ϵ , which yields linear speedup with re-

spect to τ. When τ nµ+8 L

4L , (13) becomes 4L

3.4. SAGA-AS vs Quartz

In this section, we compare our results for SAGA-AS with known complexity results for the primal-dual method Quartz of Qu et al. (2015). We do this because this was the ﬁrst and (with the exception of the df SDCA method of Csiba & Richt arik (2015)) remains the only SGD-type method for solving (1) which was analyzed in the arbitrary sampling paradigm. Prior to this work we have conjectured that SAGA-AS would attain the same complexity as Quartz. As we shall show, this is indeed the case.

The problem studied in (Qu et al., 2015) is

n Pn i=1 φi(A i x) + ψ(x), (14)

where Ai Rd m, φi : Rm R is 1/γ-smooth and convex, ψ : Rd R is a µ-strongly convex function. When ψ is also smooth, problem (14) can be written in the form of problem (1) with λi = 1/n, and

fi(x) = φi(A i x) + ψ(x), (15)

Quartz guarantees the duality gap to be less than ϵ in expectation using at most

O n maxi 1 pi + vi piµγn log 1

iterations, where the parameters v1, ..., vn are assumed to satisfy the following expected separable overapproximation (ESO) inequality, which needs to hold for all hi Rm:

i S Aihi 2i Pn i=1 pivi hi 2. (17)

If in addition ψ is Lψ-smooth, then fi in (15) is smooth

with Li λmax(A i Ai) γ + Lψ. We now consider several particular samplings:

Serial samplings. By Lemma 5 in (Qu et al., 2015), vi = λmax(A i Ai). Hence, the bound (16) becomes

O n maxi 1 pi + λmax(A i Ai) piµγn log 1

By choosing θi S = 1/pi (this is both the default choice mentioned in Lemma 2.3 and the optimal choice in view of Lemma 3.4), Ai = βi = 1/pi, B = 0, and our iteration complexity bound in Thm 3.3 becomes

max n maxi n 1 pi + 4Lψ

nµpi + 4λmax(A i Ai) piµγn o , 2L

We can see that as long as Lψ/µ = O(n), the two bounds are essentially the same.

Parallel (τ-nice) sampling. By Lemma 6 in (Qu et al., 2015), vi = λmax Pd j=1 1 + (ωj 1)(τ 1)

n 1 A ji Aji , where Aji is the j-th row of Ai, and for each 1 j d, ωj is the number of nonzero blocks in the j-th row of A, i.e., ωj := |{i [n] : Aji = 0}|. In the dense case (i.e., ωj = n), vi = τλmax(A i Ai). Hence, (16) becomes

n τ + λmax(A i Ai) µγ log 1

By choosing θi S = 1/pi = n/τ, from Lemma 3.2 and Thm 3.3, the iteration complexity becomes

max n n τ + n τ

n 1 4(1+B) maxi n Liλi

µτ , 2(1+B)L

where B = n(τ 1)

τ(n 1) 1 and Li λmax(A i Ai) γ + Lψ. We can see the bounds would be better than Quartz. However, if ωj n, then Quartz may enjoy a tighter bound.

The parameters v1, , vn used in (Qu et al., 2015) allow one to exploit the sparsity of the data matrix A = (A1, , An) and achieve almost linear speedup when A is sparse or has favourable spectral properties. In the next section, we study further SAGA-AS in the case when the objective function is of the form (14), and obtain results which, like Quartz, are able to improve with data sparsity.

SAGA with Arbitrary Sampling

4. Analysis in the Composite Case

We now consider the general problem (1) with ψ = 0. In order to be able to take advantage of data sparsity, we assume that functions fi take the form

fi(x) φi(A i x). (18)

Then clearly fi(x) = Ai φi(A i x). Thus if SAGAAS starts with J0 = A1α0 1 A2α0 2 Anα0 n , for some α0 i Rm, i [n], then we always have Jk = A1αk 1 A2αk 2 Anαk n , for some αk i Rm, i [n]. We assume that the set of minimizers X := arg min{P(x) : x Rd}, is nonempty, and let P = P(x ) for x X . Further, denote [x] = arg min{ x y : y X }; the closest optimal solution from x. Further, for any M > 0 deﬁne X(M) to be the set of points with objective value bounded by P + M, i.e., X(M) := {x dom(ψ) : P(x) P + M)}. We make several further assumptions:

Assumption 4.1 (Smoothness). Each φi : Rm R is 1/γ-smooth and convex, i.e., 0 φi(a) φi(b), a b a b 2/γ, a, b Rm.

Assumption 4.2 (Quadratic functional growth condition; see (Necoara et al., 2018)). For any M > 0, there is µ > 0 such that for any x X(M)

2 x [x] 2. (19)

Assumption 4.3 (Nullspace consistency). For any x , y X we have A i x = A i y , i [n].

We shall need a slightly stronger condition than Assumption 4.2: there is a constant µ > 0 such that

2 xk [xk] 2, w.p.1, k 1, (20)

for the sequence {xk} produced by Algorithm 1. For (20) to be true, we only need (19) to hold for any x dom(ψ). This can be done by adding sufﬁciently large box constraint to problem (1) to make dom(ψ) compact without changing the optimal solution set.

4.1. Linear convergence under quadratic functional growth condition

Our ﬁrst complexity result states a linear convergence rate of SAGA-AS under the quadratic functional growth condition. Note that the Lyapunov function is not stochastic (i.e., E[Ψk] | xk, αk i ] is not random).

Theorem 4.4. Assume f is L-smooth. Let v Rn + be a positive vector satisfying (17) for a proper sampling S. Let {xk, Jk} be the iterates produced by Algorithm 1 with θi S = 1/pi for all i and S. Let any x X .

Consider the Lyapunov function Ψk := xk [xk] 2 + α Pn i=1 σip 1 i viλ2 i αk i φi(A i x ) 2, where σi = γ 2viλi for all i. Then there is a constant µ > 0 such that the following is true. If stepsize α satisﬁes

α min 2 3 min 1 i n

pi µ+4viλi/γ , 1 3L

then E[Ψk] 1+αµ/2

1+αµ k E[Ψ0]. This implies that if we

choose α equal to the upper bound in (21), then E[Ψk] ϵ E[Ψ0] when

k 2 + max n 6L

µ , 3 max i

1 pi + 4viλi

piµγ o log 1

If µ is unknow and we choose

α min min 1 i n

pi 12viλi/γ , 1 3L

then E[Ψk] (1 min{ αµ 2(1+αµ), pi

2 })k E[Ψ0]. This implies that if we choose α equal to the upper bound in (22), then E[Ψk] ϵ E[Ψ0] when

k 2 + max n 6L

Non-strongly convex problems in the form of (1) and (18) which satisfy Assumptions 4.1, 4.2 and 4.3 include the case when each φi is strongly convex and ψ is polyhedral (Necoara et al., 2018). In particular, Thm 4.4 applies to the following logistic regression problem that we use in the experiments (λ1 0 and λ2 0)

i=1 log(1 + ebi A i x) + λ1 x 1 + λ2

2 x 2. (23)

4.2. Linear convergence for strongly convex regularizer

For the problem studied in (Qu et al., 2015) where the regularizer ψ is µ-strongly convex (and hence Assumption (4.2) holds and the minimizer is unique), we obtain the following reﬁnement of Thm 4.4.

Theorem 4.5. Let ψ be µ-strongly convex. Let v Rn + be a positive vector satisfying (17) for a proper sampling S. Let {xk, Jk} be the iterates produced by Algorithm 1 with θi S = 1/pi for all i and S. Consider the same Lyapunov function as in Thm 4.4, but set σi = 2γ 3viλi for all i. If stepsize α satisﬁes

α min1 i n pi µ+3viλi/γ , (24)

then E[Ψk] (1 + αµ) k E[Ψ0]. So, if we choose α equal to the upper bound in (24), then E[Ψk]

SAGA with Arbitrary Sampling

ϵ E[Ψ0] when k maxi n 1 + 1

piµγ o log 1

ϵ . If µ is unknown and we choose σi = γ (1+αµ)viλi for all i and α min1 i n piγ 4viλi , then E[Ψk] 1 min{ αµ 1+αµ, pi

2 } k E[Ψ0]. So, if we choose α equal

to the upper bound, then E[Ψk] ϵ E[Ψ0] when

k maxi n 1 + 4viλi

Note that up to some small constants, the rate provided by Thm 4.5 is the same as that of Quartz. Hence, the analysis for special samplings provided in Section 3.4 applies, and we conclude that SAGA-AS is also able to accelerate on sparse data.

5. Experiments

We tested SAGA-AS to solve the logistic regression problem (23) on 3 different datasets: w8a, a9a and ijcnn14. The experiments presented in Section 5.1 and 5.2 are tested for λ1 = 0 and λ2 = 1e 5, which is of the same order as the number of samples in the three datasets. In Section 5.3 we test on the unregularized problem with λ1 = λ2 = 0. In all the plots, the x-axis records the number of pass of the dataset. More experiments can be found in the Suppl.

5.1. Batch sampling

Here we compare SAGA-AS with SDCA for τ-nice sampling S with τ {1, 10, 50}. Note that SDCA with τ-nice sampling works the same both in theory and in practice as Quartz with τ-nice sampling. We report in Figure 1 the results obtained for the dataset ijcnn1. When we increase τ by 50, the number of epochs of SAGA-AS only increased by less than 6. This indicates a considerable speedup if parallel computation is used in the implementation.

epoch 0 50 100 150 200

primal dual gap

SAGAτ=1 SAGAτ=10 SAGAτ=50

(a) SAGA; ijcnn1

epoch 0 50 100 150 200

primal dual gap

SDCAτ=1 SDCAτ=10 SDCAτ=50

(b) SDCA; ijcnn1

Figure 1. mini-batch SAGA V.S. mini-batch SDCA

4https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/

5.2. Importance sampling

We compare uniform sampling SAGA (SAGA-UNI) with importance sampling SAGA (SAGA-IP), as described in Section 3.3 , on three values of τ {1, 10, 50}. The results for the datasets w8a and ijcnn1 are shown in Figure 2. For the dataset ijcnn1, mini-batch with importance sampling almost achieves linear speedup as the number of epochs does not increase with τ. For the dataset w8a, mini-batch with importance sampling can even need less number of epochs than serial uniform sampling. Note that we adopt the importance sampling strategy described in (Hanzely & Richt arik, 2019) and the actual running time is the same as uniform sampling.

epoch 0 50 100 150 200

primal dual gap

SAGA-IPτ=1 SAGA-IPτ=10 SAGA-IPτ=50 SAGA-UNIτ=1 SAGA-UNIτ=10 SAGA-UNIτ=50

epoch 0 1000 2000 3000

primal dual gap

SAGA-IPτ=1 SAGA-IPτ=10 SAGA-IPτ=50 SAGA-UNIτ=1 SAGA-UNIτ=10 SAGA-UNIτ=50

Figure 2. importance sampling V.S. uniform sampling

5.3. Comparison with coordinate descent

We consider the un-regularized logistic regression problem (23) with λ1 = λ2 = 0. In this case, Thm 4.4 applies and we expect to have linear convergence of SAGA without any knowledge on the constant µ satsifying Assumption (4.2). This makes SAGA comparable with descent methods such as gradient method and coordinate descent (CD) method. However, comparing with their deterministic counterparts, the speedup provided by CD can be at most of order d while the speedup by SAGA can be of order n. Thus SAGA is much preferable than CD when n is larger than d. We provide numerical evidence in Figure 3.

epoch 0 200 400 600 800 1000

norm of gradient step

epoch 0 200 400 600 800 1000

norm of gradient step

Figure 3. SAGA V.S. CD

SAGA with Arbitrary Sampling

Bibi, A., Sailanbayev, A., Ghanem, B., Gower, R. M., and Richt arik, P. Improving SAGA via a probabilistic interpolation with gradient descent. ar Xiv: 1806.05633, 2018.

Chambolle, A., Ehrhardt, M. J., Richt arik, P., and Sch onlieb, C. B. Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM Journal on Optimization, 28(4):2783 2808, 2017.

Csiba, D. and Richt arik, P. Primal method for ERM with ﬂexible mini-batching schemes and non-convex losses. ar Xiv:1506.02227, 2015.

Csiba, D. and Richt arik, P. Importance sampling for minibatches. Journal of Machine Learning Research, 19(27): 1 21, 2018.

Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 1646 1654. Curran Associates, Inc., 2014.

Gower, R. M., Richt arik, P., and Bach, F. Stochastic quasigradient methods: Variance reduction via Jacobian sketching. ar Xiv Preprint ar Xiv: 1805.02632, 2018.

Hanzely, F. and Richt arik, P. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. ar Xiv Preprint ar Xiv: 1809.09354, 2018.

Hanzely, F. and Richt arik, P. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2019.

Horv ath, S. and Richt arik, P. Nonconvex variance reduced optimization with arbitrary sampling. ar Xiv:1809.04146, 2018.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315 323, 2013.

Koneˇcn y, J. and Richt arik, P. Semi-stochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, pp. 1 14, 2017. URL http://arxiv.org/ abs/1312.1666.

Koneˇcn y, J., Lu, J., Richt arik, P., and Tak aˇc, M. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10 (2):242 255, 2016.

Loizou, N. and Richt arik, P. Linearly convergent stochastic heavy ball method for minimizing generalization error. In NIPS Workshop on Optimization for Machine Learning, 2017a.

Loizou, N. and Richt arik, P. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. ar Xiv:1712.09677, 2017b.

Mairal, J. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829 855, 2015. URL http://jmlr.org/papers/ volume16/mokhtari15a/mokhtari15a.pdf.

Necoara, I., Nesterov, Y., and Glineur, F. Linear convergence of ﬁrst order methods for non-strongly convex optimization. Mathematical Programming, pp. 1 39, 2018. doi: https://doi.org/10.1007/s10107-018-1232-1.

Needell, D., Srebro, N., and Ward, R. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Mathematical Programming, 2015.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

Qu, Z. and Richt arik, P. Coordinate descent with arbitrary sampling I: Algorithms and complexity. Optimization Methods and Software, 31(5):829 857, 2016a.

Qu, Z. and Richt arik, P. Coordinate descent with arbitrary sampling II: Expected separable overapproximation. Optimization Methods and Software, 31(5):858 884, 2016b.

Qu, Z., Richt arik, P., and Zhang, T. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 865 873. Curran Associates, Inc., 2015.

Richt arik, P. and Tak aˇc, M. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 10(6):1233 1243, 2016.

Richt arik, P. and Tak aˇc, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(2): 1 38, 2014.

Richt arik, P. and Tak aˇc, M. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433 484, 2016.

SAGA with Arbitrary Sampling

Robbins, H. and Monro, S. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407, 1951.

Schmidt, M., Babanezhad, R., Ahmed, M. O., Defazio, A., Clifton, A., and Sarkar, A. Non-uniform stochastic average gradient method for training conditional random ﬁelds. In 18th International Conference on Artiﬁcial Intelligence and Statistics, 2015.

Schmidt, M., Le Roux, N., and Bach, F. Minimizing ﬁnite sums with the stochastic average gradient. Math. Program., 162(1-2):83 112, 2017.

Shalev-Shwartz, S. SDCA without duality, regularization, and individual convexity. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 747 754, 2016.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: from theory to algorithms. Cambridge University Press, 2014.

Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567 599, 2013.

Tak aˇc, M., Bijral, A., Richt arik, P., and Srebro, N. Minibatch primal and dual methods for SVMs. In 30th International Conference on Machine Learning, pp. 537 552, 2013.

Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014a. doi: 10.1137/140961791.

Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014b.

Zhao, P. and Zhang, T. Stochastic optimization with importance sampling. The 32nd International Conference on Machine Learning, 37:1 9, 2015.