# stochastic_reweighted_gradient_descent__78e096b1.pdf

Stochastic Reweighted Gradient Descent

Ayoub El Hanchi 1 David A. Stephens 2 Chris J. Maddison 1

Importance sampling is a promising strategy for improving the convergence rate of stochastic gradient methods. It is typically used to precondition the optimization problem, but it can also be used to reduce the variance of the gradient estimator. Unfortunately, this latter point of view has yet to lead to practical methods that provably improve the asymptotic error of stochastic gradient methods. In this work, we propose stochastic reweighted gradient descent (SRG), a stochastic gradient method based solely on importance sampling that can reduce the variance of the gradient estimator and improve on the asymptotic error of stochastic gradient descent (SGD) in the strongly convex and smooth case. We show that SRG can be extended to combine the beneﬁts of both importance-sampling-based preconditioning and variance reduction. When compared to SGD, the resulting algorithm can simultaneously reduce the condition number and the asymptotic error, both by up to a factor equal to the number of component functions. We demonstrate improved convergence in practice on regularized logistic regression problems.

1. Introduction

Unconstrained optimization of ﬁnite-sum objectives is a core algorithmic problem in machine learning. The prototypical way of solving such problems is by viewing them through the lens of stochastic optimization, where the source of stochasticity resides in the choice of the index in the sum. Stochastic gradient descent (SGD) (Robbins & Monro, 1951) remains the standard algorithm for this class of problems.

A natural way to improve on SGD is by considering importance sampling schemes. This idea is not new and dates

1University of Toronto and Vector Institute 2Mc Gill University. Correspondence to: Ayoub El Hanchi <aelhan@cs.toronto.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

back to (Needell et al., 2014) who uses importance sampling as a preconditioning technique. They propose sampling the indices with probabilities proportional to the smoothness constants of the corresponding component functions, and show that this sampling scheme provably reduces the condition number of the problem.

In another line of work, variance-reduced methods were found to achieve linear convergence in the strongly convex and smooth case (Roux et al., 2012; Schmidt et al., 2017). Many of these methods rely on control variates to reduce the variance of the gradient estimator used by SGD (Johnson & Zhang, 2013; Defazio et al., 2014). While very successful, the applicability of these methods is limited by the large memory overhead that they introduce, or the periodic fullgradient recomputation that they require (Defazio & Bottou, 2019). Despite strong progress in this research area, an importance-sampling-based analogue to these algorithms, which is potentially free from these drawbacks, has yet to emerge.

In this work, we propose such an analogue. We introduce stochastic reweighted gradient (SRG), an importancesampling-based stochastic optimization algorithm that reduces the variance of the gradient estimator. Similar to SGD, SRG requires a single gradient oracle call per iteration, and only requires O(n) additional memory, and O(log n) additional ﬂoating point operations per iteration. We analyze the convergence rate of SRG in the strongly-convex and smooth case, and show that it can provably improve the asymptotic error of SGD. Finally, we show how our importance sampling strategy can be combined with smoothness-based importance sampling, and prove that the resulting algorithm simultaneously performs variance reduction and preconditioning. We demonstrate improved convergence in practice on ℓ2-regularized logistic regression problems.

2. Preliminaries

We consider the ﬁnite-sum optimization problem:

where F is µ-strongly convex for µ > 0, and for all i [n], fi is convex and Li-smooth for Li > 0. Note that

Stochastic Reweighted Gradient Descent

by strong-convexity, F has a unique minimizer x Rd. We deﬁne the maximum Lmax := maxi [n] Li and average L := n 1 Pn j=1 Li smoothness constants. Similarly, we deﬁne the maximum κmax := Lmax/µ and average κ := L/µ condition numbers.

The classical way of solving (1) is by viewing it as a stochastic optimization problem where the randomness comes from the choice of the index i [n]. Starting from some x0 Rd, and for an iteration number k N, stochastic gradient descent (SGD) performs the following update:

xk+1 = xk αk fik(xk)

for a step size αk > 0 and a random index ik drawn uniformly from [n].

The idea behind importance sampling for SGD is to instead sample the index ik according to a chosen distribution pk on [n], and to perform the update (Needell et al., 2014):

xk+1 = xk αk 1 npik k fik(xk) (2)

where pi k is the ith component of the probability vector pk. It is immediate to verify that the importance sampling estimator of the gradient is unbiased as long as pk > 0.

The question that we address in this paper is how to design a sequence {pk} k=0 that produces more efﬁcient gradient estimators than the ones produced by uniform sampling. One way to design such a sequence is by adopting a greedy strategy: at each iteration k we choose pk to minimize the conditional variance of the gradient estimator, which is given by, up to an additive constant:

σ2(xk, p) = 1

1 pi fi(xk) 2 2 , (3)

This conditional variance is minimized at (Zhao & Zhang, 2015):

arg min p σ2(xk, p) =

fi(xk) 2 Pn j=1 fj(xk) 2

Ideally, we would like to set pk to this minimizer. However, this requires knowledge of the gradient norms ( fi(xk) 2)n i=1, which, in general, requires n gradient evaluations per iteration, making this approach intractable.

3. Algorithm

In this section, we show how to design a tractable sequence of importance sampling distributions for SGD that approximate the conditional-variance-minimizing distributions (4). First we construct efﬁcient approximations of the conditional variances (3). We then state a simple bound on their

Algorithm 1 SRG

1: Parameters: step sizes (αk) k=0 > 0, mixture coefﬁcients (θk) k=0 (0, 1] 2: Initialization: x0 Rd, ( gi 0 2)n i=1 Rn

3: for k = 0, 1, 2, . . . do 4: pk = (1 θk)qk + θk/n {qk is deﬁned in (6)} 5: bk Bernoulli(θk) 6: if bk = 1 then ik 1/n else ik qk 7: xk+1 = xk αk 1 np ik k fik(xk)

( fi(xk) 2 if bk = 1 and ik = i gi k 2 otherwise 9: end for

approximation errors and use it to motivate our choice of importance sampling distributions.

To approximate the conditional variances, we follow the strategy of certain variance-reduced methods (Roux et al., 2012; Schmidt et al., 2017; Defazio et al., 2014). These methods maintain a table (gi k)n i=1 that tracks the component gradients ( fi(xk))n i=1 which is updated at each iteration at the index ik used to update the iterates xk (2). Our method instead maintains an array of gradient norms, from which we construct an approximation of the conditional variance (3) of the gradient estimator:

σ2(xk, p) := 1

1 pi gi k 2 2 (5)

this approximation is minimized at:

qk = arg min p σ2(xk, p) =

gi k 2 Pn j=1 gj k 2

We do not directly use qk as an importance sampling distribution, because this approximation may be poor. In particular, we have the following bound that relates the conditional variance (3) to the approximation (5):

σ2(xk, p) 2

fi(xk) gi k 2 2 pi + 2 σ2(xk, p) (7)

Recall that our goal is to pick pk that minimizes σ2(xk, p). qk minimizes the second term on the right-hand side of 7, but we must ensure that both terms are small. Two conditions are needed to keep the ﬁrst term small. The ﬁrst is to control the terms fi(xk) gi k 2 2, which we can do by making sure that the historical gradients gi k are frequently updated. The second is to ensure that the probabilities pi k

Stochastic Reweighted Gradient Descent

are lower bounded. We achieve both of these properties, as well as approximately minimize the right-hand side of (7) by mixing qk with the uniform distribution over [n]. For a given mixture coefﬁcient θk (0, 1], this yields our ﬁnal importance sampling distribution:

pk = (1 θk)qk + θk

Our analysis in section 4 clariﬁes the role of the sequence of mixture coefﬁcients (θk) k=0, and relates it to both the step size sequence (αk) k=0 and the asymptotic error of SRG in the constant step size setting.

Curiously, our analysis requires performing the update of the array ( gi k 2)n i=1 only when the index ik is drawn from the uniform mixture component. It is not clear to us whether this constraint is an artifact of the analysis or a property of the algorithm. We discuss this further after the statement of Lemma 4.1 in section 4. Pseudocode for our method is given in Algorithm 1.

We brieﬂy discuss how SRG can be efﬁciently implemented. To sample from qk, we store ( gi k 2)n i=1 in a binary indexed tree using O(n) memory, along with the normalizing constant λk = Pn i=1 gi k 2. The binary indexed tree can be updated and maintained in O(log n) operations, while the normalizing constant can be maintained in constant time. We can then sample from qk using inverse transform sampling: we multiply a uniform random variable u [0, 1) by λk, then return the largest index ik whose corresponding preﬁx sum Pik i=1 gi k 2 is less than λku. This procedure also requires only O(log n) operations. The total overhead of SRG when compared to SGD is therefore O(n) memory and O(log n) ﬂoating point operations per iteration, both of which are dimension-free.

In this section, we analyze the convergence rate of SRG, and show that it can achieve a better asymptotic error than SGD. Two key constants are helpful in contrasting the asymptotic errors of SRG and SGD. Recall the deﬁnition of the conditional variance σ2(xk, pk) in (3), and deﬁne:

σ2 := σ2(x , 1/n) = 1

i=1 fi(x ) 2 2

σ2 := min p σ2(x , p) = 1

i=1 fi(x ) 2

It is well known that the asymptotic error of SGD in the strongly-convex and smooth case depends linearly on σ2

(Needell et al., 2014). We here show that SRG reduces this to a linear dependence on σ2 , which can be up to n times smaller.

To study the convergence rate of SRG, we use the following Lyapunov function, which is similar to the one used to study the convergence rate of SAGA (Hofmann et al., 2015; Defazio, 2016):

gi k fi(x ) 2 2 + xk x 2 2 (9)

for a constant a > 0 that we set during the analysis. The proofs of this section can be found in Appendix B.

4.1. Intermediate lemmas

Before proceeding with the main result, let us ﬁrst state two intermediate lemmas. The ﬁrst studies the evolution of (gi k)n i=1 from one iteration to the next. Lemma 4.1. Let k N and suppose that (gi k)n i=1 evolves as in Algorithm 1. Taking expectation with respect to (bk, ik), conditional on (bt, it)k 1 t=0 , we have:

gi k+1 fi(x ) 2 2

2θk Lmax [F(xk) F(x )]

gi k fi(x ) 2 2

The use of the Bernoulli random variable bk in Algorithm 1 to monitor the update of (gi k)n i=1 is necessary for Lemma 4.1 to hold. In particular, without the use of bk, the elements of (gi k)n i=1 may have different probabilities of being updated. In that case, the ﬁrst term of Lemma 4.1 becomes a weighted average of terms which is not easily bounded. When the importance sampling distribution does not depend on the iteration number, this issue can be ﬁxed with a slight modiﬁcation of the Lyapunov function (9) (Schmidt et al., 2015). In our case however, we are dealing with time-varying importance sampling distributions, and this approach fails. Instead, we condition the update of (gi k)n i=1 on the Bernoulli random variable bk such that the probability of updating gi k is ﬁxed to θk/n for all i [n].

The second lemma is a bound on the conditional variance of the gradient estimator used by SRG. We intentionally leave as many free parameters as possible in the bound and optimize over them in the main result of section 4.2. Lemma 4.2. Let k N and assume that θk (0, 1/2]. Taking expectation with respect to (bk, ik), conditional on (bt, it)k 1 t=0 , we have, for all β, γ, δ, η > 0:

1 npik k fik(xk)

θk [F(xk) F ]

gi k fi(x ) 2 2 + D3(1 + 2θk)σ2

Stochastic Reweighted Gradient Descent

where D1, D2 and D3 are given by:

D1 := (1 + β + γ)

D2 := (1 + β 1 + δ) + (1 + γ 1 + δ 1)(1 + η)

D3 := (1 + γ 1 + δ 1)(1 + η 1)

4.2. Main result

Our main result is a bound on the evolution of the Lyapunov function T k along the steps of SRG.

Theorem 4.3. Suppose that (xk, (gi k)n i=1) evolves according to Algorithm 1. Further, assume that for all k N: (i) αk/θk is non-increasing. (ii) θk (0, 1/2]. (iii) αk θk/12Lmax. Then:

E T k+1 (1 ρk)E T k + (1 + 2θk)6α2 kσ2

for all k N, and where:

ρk := min θk

The constants (1/12 in the bound on the step size and in ρk, 6 in front of the σ2 term) in this theorem are optimized under the following constraints. First, the parameterized form of the above bound shows that the largest allowable step size is given by c2θk/Lmax. Using it we get:

ρk = min c1 θk

n , c2 θk κmax

for some constants c1, c2 > 0. As we do not know a priori the relative magnitudes of n and κmax, and since c1 and c2 are inversely proportional, we impose the constraint c1 = c2. Similarly, our parameterized bound gives an asymptotic error of the form (1+2θk)c3α2 kσ2 for a constant c3. We chose to impose the constraint c3 = 6. c1, c2 and c3 are all functions of the free parameters of Lemma 4.2 and the constant a of the Lyapunov function (9). Numerically maximizing c2 (and therefore the largest allowable step size) subject to these two constraints (c3 = 6 and c1 = c2) with respect to these parameters yields the result in Theorem 4.3.

To give the reader an idea of the sensitivity of the result to the choice of c3, note that setting c3 = 2 yields c2 1/20, whereas taking c3 1 yields c2 1/10. We have attempted to obtain the best bound possible on the largest step size allowable, but the rather small prefactor c2 = 1/12 seems to be an inevitable consequence of the multiple (but as far as we can tell necessary) uses of Young s inequality in our analysis. Our experiments suggest that the dependence of the largest step size allowable on the mixture coefﬁcient

θ is real, but that the prefactor c2 = 1/12 may be an artifact of the analysis.

For SRG with constant mixture coefﬁcient and step size, its convergence rate and complexity are given by the following corollary of Theorem 4.3:

Corollary 4.4. Suppose that (xk, (gi k)n i=1) evolves according to Algorithm 1 with a constant mixture coefﬁcient θk = θ (0, 1/2] and a constant step size αk = α θ/12Lmax. Then for any k N:

E T k (1 ρ)k T 0 + (1 + 2θ)6α2σ2 ρ

where ρ = ρk is as deﬁned in Theorem 4.3. For any ε > 0 and θ (0, 1/2], choosing:

( θ 12Lmax , εµ (1 + 2θ)12σ2 ,

θ 1 + 2θ ε 144nσ2

guarantees E h xk x 2 2 i ε

Comparing the convergence rate of SRG in Corollary 4.4 with the standard result for SGD (Needell et al., 2014), we see that they are of similar form. When ρ = αµ, the bound of Corollary 4.4 is better asymptotically. Indeed, in this case, and as k , the iterates of SRG stay within a ball of radius O( p

σ2 ) of the minimizer, while those of SGD stay withing a ball of radius O(

σ2). The equality ρ = αµ holds when α 1/nµ, which is true for all allowable step sizes if the problem is dominated by its maximum condition number κmax n, and for smaller step sizes otherwise.

In terms of complexity, we have the following comparison. Up to constants, the complexity of SRG with a constant mixture coefﬁcient and step size is of the form:

µ2ε + κmax + σ2 µ2ε

We compare this to the complexity of SGD with constant step size (Needell et al., 2014):

O κmax + σ2

In the high accuracy regime, the ε 1 terms dominate the complexities of SRG and SGD. In this case, SRG enjoys a better complexity than SGD since σ2 σ2.

Stochastic Reweighted Gradient Descent

Algorithm 2 SRG+

Parameters: step sizes (αk) k=0 > 0, mixture coefﬁcients (θk) k=0 (0, 1] Initialization: x0 Rd, ( gi 0 2)n i=1 Rn

for k = 0, 1, 2, . . . do

pk = (1 θk)qk + θkv {qk is given by (6), v is given by (12)} bk Bernoulli(θk) if bk = 1 then (ik, jk) π else ik qk {π maximally couples (v, 1/n)} xk+1 = xk αk 1 np ik k fik(xk)

( fj(xk) 2 if bk = 1 and j = jk gj k 2 otherwise end for

5. Extension

In this section, we extend SRG to combine its variance reduction capacity with the preconditioning ability of smoothness-based importance sampling.

A straightforward way to generalize the argument given in the derivation of SRG in section 3 is to consider the following bound on the conditional variance (3) of the gradient estimator, which can be derived from Young s inequality and the Li-smoothness of each fi:

σ2(xk, p) 3

pi fi(xk) fi(x ), xk x

1 pi gi k fi(x ) 2 2 + 3 σ2(xk, p) (10)

While the motivating bound (7) seems more intuitive, because we think of gi k as tracking fi(xk), it turns out that this second bound better captures the evolution of gi k in relation to fi(xk). At a high-level, this is because gi k tracks fi(xk) indirectly: both hover around fi(x ) as k gets large.

Similar to our approach in section 3, our goal is to pick pk that minimizes the right-hand side of (10). We know that qk (6) minimizes the third term, but we need to make sure that the ﬁrst two are also small. To minimize the ﬁrst term, knowing nothing about the relative sizes of the inner product terms, it makes sense to have probabilities proportional to the smoothness constants 1. On the other hand, to keep the second term small, we need to ensure that the historical

1Based on this argument alone, one would want them to be proportional to the square root of the smoothness constants. This however does not lead to a nice averaging of the inner product terms, which is important for reasons related to the strong-convexity of F but not of the component functions fi.

gradients (gi k)n i=1 are frequently updated, which we can do by imposing a uniform lower bound on the probabilities. These considerations motivate us to consider the following distributions:

p k = (1 ηk θk)qk + ηkv + θk

for positive mixture coefﬁcients (θk, ηk) satisfying θk + ηk (0, 1], and where v is given by:

Using these probabilities, we are able to show that the resulting algorithm does indeed achieve both variance reduction and preconditioning. However, in the worst case, its complexity is twice as much as what we would expect from simply replacing Lmax with L in Corollary 4.4. Intuitively, this is because the probability assigned to the uniform component in (8) needs to be split between the uniform and the smoothness-based components in (11).

5.1. Carefully decoupling the updates of (gi k)n i=1 and xk

Here we show how to design our algorithm such that this additional factor of 2 (described above) in the complexity is reduced to:

1 + v 1/n T V 2 1/n

which is usually much closer to 1 in practical settings where the smoothness constants are roughly on the same scale. Following (Schmidt et al., 2015), our method is based on the observation that we can decouple the index used to update the historical gradients and the index used to update the iterates, which we refer to as jk and ik, respectively. Intuitively, to minimize (10) we would ideally want jk to be uniformly distributed and ik to be distributed according to pk = (1 θk)qk +θkv. Unfortunately, sampling (jk, ik) independently with these marginals does not address our issue, because we would still require an average of approximately two gradient evaluations per iteration.

Our main observation is that we can use any coupling between (jk, ik), because the evolution of the expectation of the Lyapunov function (9) only depends on their marginal distributions and not on their joint. In particular, we obtain the same bound on the evolution of E T k regardless of how ik and jk are coupled. It therefore makes sense to pick the coupling that maximizes the probability that ik = jk, since this minimizes the number of gradient evaluations required per iteration. Such couplings are known as maximal couplings in the literature, and can easily be computed for discrete random variables (see, e.g., Algorithm 5 in Biswas et al. (2019)). With a maximal coupling,

Stochastic Reweighted Gradient Descent

the expected number of gradient evaluations per iteration becomes 1 + v 1/n T V . Using one such coupling leads to SRG+ described in Algorithm 2.

Let us brieﬂy discuss the implementation of SRG+. Sampling from π, a maximal coupling of v and the uniform distribution, requires forming three probability vectors (see Algorithm 5 in Biswas et al. (2019) for details). As both v and 1/n are constant throughout the optimization process, we can form these vectors at the beginning of the algorithm along with their partial sums for a total initial cost of O(n) operations. We can then sample from π in O(log n) time using binary search on the partial sums at each iteration. The overhead of SRG+ is therefore the same as that of SRG.

5.2. Analysis

The analysis of SRG+ is similar to that of SRG. In particular, the iterates of SRG+ also obey a slight modiﬁcation of Theorem 4.3 where the bound on the largest allowable step size is loosened to αk θk/12L. We refer the reader to Appendix C for more details. Due to this improvement, we get that the complexity of SRG+ with a constant mixture coefﬁcient and step size is given by:

µε + κ + σ2 µ2ε

This shows that SRG+ performs both variance reduction as shown by the dependence of the complexity on σ2 instead of σ2 and preconditioning as shown by the dependence on κ instead of κmax.

6. Related work

At a high-level, three lines of work exist that study the use of importance sampling with SGD. The ﬁrst one considers ﬁxed importance sampling distributions based on the constants of the problem (Needell et al., 2014; Zhao & Zhang, 2015; Gower et al., 2019), and shows that such a strategy leads to improved conditioning of the problem. The second considers adaptive importance sampling and similar to our work targets the variance of the gradient estimator, but generally fails at providing strong convergence rate guarantees under standard assumptions (Papa et al., 2015; Gopal, 2016; Alain et al., 2016; Canevet et al., 2016; Yuan et al., 2016; Stich et al., 2017; Katharopoulos & Fleuret, 2018; Johnson & Guestrin, 2018; Liu et al., 2021). The third line of work frames the problem as an online learning problem with bandit feedback and provides guarantees on the regret of the proposed distributions in terms of the variance of the resulting gradient estimators (Namkoong et al., 2017; Salehi et al., 2017; Borsos et al., 2018; 2019; El Hanchi & Stephens, 2020). Our main method is very closely related to the one proposed by (Papa et al., 2015); the main

result of their analysis however is asymptotic in nature. (Liu et al., 2021) provide a non-asymptotic analysis of the algorithm proposed by (Papa et al., 2015), but under strong and non-standard assumptions. To the best of our knowledge, our work is the ﬁrst to provide non-asymptotic guarantees on the suboptimality of the iterates for variance-reducing importance sampling under standard technical assumptions.

7. Experiments

In this section, we empirically verify our two main claims: (i) SRG performs variance reduction which can improve the asymptotic error of SGD. (ii) SRG+ performs both variance reduction and preconditioning, and can both reduce the asymptotic error of SGD and allow the use of larger step sizes. We start with controlled synthetic experiments that provide direct support for our claims. We then compare SRG to other baseline optimizers on ℓ2-regularized logistic regression problems. In all experiments, we ran each algorithm 10 times and present the averaged result.

7.1. Synthetic experiments

For our ﬁrst experiment, we consider the following toy problem. We let x R and fi(x) = 1

2(x ai)2 where ai = 0 for i [n 1] and an = 1. In this case, x = 1/n, σ2 1/n, and σ2 4/n2. We consider ﬁve instantiations of this problem with n {8, 16, 32, 64, 128}, yielding ratios σ2/σ2 approximately in {2, 4, 8, 16, 32}. For each instantiation, we ran SGD and SRG until they reached stationarity and recorded their asymptotic errors limk E h xk x 2 2 i , which we denote by SGD and SRG respectively. We experimented with three different step sizes, two of them allowed by Corollary 4.4, and one larger one for which we can only prove that SRG has a similar convergence guarantee as SGD. For SRG, we used the mixture coefﬁcient θ = 1/2.

We plot SGD/ SRG against σ2/σ2 in Figure 1 (left) for each of the three step sizes, from which we see that the relationship between the two ratios is linear, and very close to identity. From an asymptotic error point of view, these results support our theory in that the improvement is seen to be directly proportional to the ratio σ2/σ2 . On the other hand, the constant 6(1 + 2θ)/ρ in Corollary 4.4 would suggest that the proportionality constant is much smaller than 1, particularly when n is large, but this is not what we observe in practice. This could be because the ﬁrst term of the Lyapunov function (9) is quite large at stationarity, or because the constants in our bound are not sharp due to the multiple uses of Young s inequality. This latter possibility is further supported by the fact that we see a similar behaviour for SRG for step sizes larger than the ones allowed by our theory. We have consistently made these two observations

Stochastic Reweighted Gradient Descent

Figure 1: Left: Linear dependence of SGD/ SRG, the ratio of asymptotic errors of SGD and SRG, on σ2/σ2 , the ratio of the uniform and optimal variances at the minimum. Right: SRG+ achieves smaller asymptotic error than both SGD with Li-sampling (SGD+) and SGD with partially biased sampling (SGD++), while using large O(1/L) step sizes just like SGD+ and SGD++.

in other experiments. It is however unclear to us how our analysis can be improved to match these observations.

For our second experiment, we considered the following problem. We let x R and fi(x) = 1

2Li(x ai)2, and ﬁxed n = 20. Similar to the ﬁrst experiment, we take ai = 0 for i [n 1] and an = 1. We then set L1 = n 1, Ln = 1/n, and Li = n 1 n(n 2), so that L = 1, Lmax = n 1. In this case, we get x = 1/n2, σ2 + := σ2(x , v) 1/n, σ2 2/n2, and σ2 4/n3. We then ran SGD with Li-sampling (SGD+), SGD with partially biased sampling (SGD++, Needell et al. (2014)), and SRG+, all with the largest allowable step size α = θ/12L from our theory for SRG+, and we use the mixture coefﬁcient θ = 1/2 for both SRG+ and SGD++. We have obtained very similar results when using the larger step size α = θ/2L, which is the maximum allowable for SGD+.

We plot the relative error E h xk x 2 2 i / x0 x 2 2 for each algorithm against the number of gradient oracle calls it makes in Figure 1 (right). We see that all three algorithms are able to converge even when using the large O(1/L) step sizes. The theory of SGD+ allows for larger step sizes compared to SGD, but the asymptotic error of the algorithm depends on σ2 + (Needell et al., 2014), which can be up to n times bigger than σ2. To solve this problem, Needell et al. (2014) proposed partially biased sampling (SGD++) which mixes the smoothness-based distribution with uniform sampling, allowing the use of larger step sizes just like SGD+, but preserving the asymptotic error of SGD. SRG+ further improves on SGD++, and reduces the asymptotic error, making it proportional to σ2 .

7.2. ℓ2-regularized logistic regression

For our last experiment, we test SRG on ℓ2-regularized logistic regression problems. In this case, the functions fi

are given by:

fi(x) := log (1 + exp ( yia T i x)) + µ

where yi { 1, 1} is the label of data point ai Rd. Each fi is convex and Li = 0.25 ai 2 2 + µ smooth. Their average F is also µ-strongly convex. As is standard, we select µ = 1/n.

We experiment with SRG on four datasets from LIBSVM (Chang & Lin, 2011): ijcnn1, w8a, mushrooms and phishing. For each dataset, and to be able to efﬁciently run our experiments, we randomly select a subset of the data of size n = 1000. As the datasets are normalized, we have that ai 2 = 1 for all i [n]. This makes Li = L = 0.25 + µ, which reduces SGD+ to SGD, and SRG+ to SRG.

We tested the performance of SRG against three baselines, the ﬁrst of which is standard SGD. The second is SGD with the optimal conditional-variance-minimizing probabilities pi k fi(xk) 2, which allows us to compare SRG with the best possible variance-reducing importance sampling scheme. The last baseline is SGD with random shufﬂing, which is also known to improve the asymptotic error of SGD (Mishchenko et al., 2020).2 We evaluate the performance of the algorithms by tracking the average relative error E h xk x 2 2 i / x0 x 2 2. We used the mixture coefﬁcient θ = 1/2 for SRG, and used the same step size α = θ/2L for all algorithms.

The results of this experiment are shown in Figure 2. We observe that SRG consistently outperforms SGD on all datasets, and that it closely matches the performance of SGD with the variance-minimizing distributions, which it tries to approximate. We also see that SRG outperforms

2under the additional assumption that each fi is also strongly convex

Stochastic Reweighted Gradient Descent

Figure 2: Comparison of the evolution of the average relative error xk x 2 2 / x0 x 2 2 for different optimizers on ℓ2-regularized logistic regression problems using the datasets ijcnn1, w8a, mushrooms, phishing. We compare SRG (orange) with SGD (blue), SGD with random shufﬂing (purple), and SGD with the optimal variance-minimizing distributions at each iteration (green).

SGD with random shufﬂing on two datasets, and is competitive with it on the remaining two.

8. Conclusion

We introduced SRG, a new importance-sampling based stochastic optimization algorithm for ﬁnite-sum problems that reduces the variance of the gradient estimator. We analyzed its convergence rate in the strongly convex and smooth case, and showed that it can improve on the asymptotic error of SGD. We also introduced SRG+, an extension of SRG which simultaneously performs variance reduction and preconditioning through importance sampling. We expect our algorithms to be most useful in the medium accuracy regime, where the required accuracy is higher than the one achieved by SGD, but low enough that the overhead of classical variance reduced methods becomes signiﬁcant. Finally, an interesting future direction would be to explore non-greedy strategies for the design of importance sampling distributions for SGD that not only minimize the variance of the current gradient estimator, but also take into account the variance of the gradient estimators of subsequent iterations.

Alain, G., Lamb, A., Sankar, C., Courville, A., and Bengio, Y. Variance Reduction in SGD by Distributed Importance Sampling. ar Xiv:1511.06481 [cs, stat], April 2016. ar Xiv: 1511.06481.

Biswas, N., Jacob, P. E., and Vanetti, P. Estimating Convergence of Markov chains with L-Lag Couplings. Advances in Neural Information Processing Systems, 32, 2019.

Borsos, Z., Krause, A., and Levy, K. Y. Online Variance Reduction for Stochastic Optimization. In Conference On Learning Theory, pp. 324 357. PMLR, July 2018. ISSN: 2640-3498.

Borsos, Z., Curi, S., Levy, K. Y., and Krause, A. Online Variance Reduction with Mixtures. In International Conference on Machine Learning, pp. 705 714. PMLR, May 2019. ISSN: 2640-3498.

Canevet, O., Jose, C., and Fleuret, F. Importance Sampling Tree for Large-scale Empirical Expectation. In International Conference on Machine Learning, pp. 1454 1462. PMLR, June 2016. ISSN: 1938-7228.

Chang, C.-C. and Lin, C.-J. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3), May 2011.

Defazio, A. A Simple Practical Accelerated Method for Finite Sums. Advances in Neural Information Processing Systems, 29:676 684, 2016.

Defazio, A. and Bottou, L. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning. Advances in Neural Information Processing Systems, 32: 1755 1765, 2019.

Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A Fast Incremental Gradient Method With Support for Non Strongly Convex Composite Objectives. Advances in

Stochastic Reweighted Gradient Descent

Neural Information Processing Systems, 27:1646 1654, 2014.

El Hanchi, A. and Stephens, D. Adaptive Importance Sampling for Finite-Sum Optimization and Sampling with Decreasing Step-Sizes. Advances in Neural Information Processing Systems, 33, 2020.

Gopal, S. Adaptive Sampling for SGD by Exploiting Side Information. In International Conference on Machine Learning, pp. 364 372. PMLR, June 2016. ISSN: 19387228.

Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., and Richt arik, P. SGD: General Analysis and Improved Rates. In International Conference on Machine Learning, pp. 5200 5209. PMLR, May 2019. ISSN: 2640-3498.

Hofmann, T., Lucchi, A., Lacoste-Julien, S., and Mc Williams, B. Variance Reduced Stochastic Gradient Descent with Neighbors. Advances in Neural Information Processing Systems, 28:2305 2313, 2015.

Johnson, R. and Zhang, T. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. Advances in Neural Information Processing Systems, 26:315 323, 2013.

Johnson, T. B. and Guestrin, C. Training Deep Models Faster with Robust, Approximate Importance Sampling. Advances in Neural Information Processing Systems, 31: 7265 7275, 2018.

Katharopoulos, A. and Fleuret, F. Not All Samples Are Created Equal: Deep Learning with Importance Sampling. In International Conference on Machine Learning, pp. 2525 2534. PMLR, July 2018. ISSN: 2640-3498.

Liu, H., Wang, X., Li, J., and So, A. M.-C. Low-Cost Lipschitz-Independent Adaptive Importance Sampling of Stochastic Gradients. In 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2150 2157, Milan, Italy, January 2021. IEEE.

Mishchenko, K., Khaled Ragab Bayoumi, A., and Richtarik, P. Random Reshufﬂing: Simple Analysis with Vast Improvements. Advances in Neural Information Processing Systems, 33, 2020.

Namkoong, H., Sinha, A., Yadlowsky, S., and Duchi, J. C. Adaptive Sampling Probabilities for Non-Smooth Optimization. In International Conference on Machine Learning, pp. 2574 2583. PMLR, July 2017. ISSN: 2640-3498.

Needell, D., Ward, R., and Srebro, N. Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm. Advances in Neural Information Processing Systems, 27:1017 1025, 2014.

Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization. Springer US, 2004.

Papa, G., Bianchi, P., and Cl emenc on, S. Adaptive Sampling for Incremental Optimization Using Stochastic Gradient Descent. In Chaudhuri, K., GENTILE, C., and Zilles, S. (eds.), Algorithmic Learning Theory, Lecture Notes in Computer Science, pp. 317 331, Cham, 2015. Springer International Publishing.

Robbins, H. and Monro, S. A Stochastic Approximation Method. Annals of Mathematical Statistics, 22(3):400 407, September 1951. Publisher: Institute of Mathematical Statistics.

Roux, N., Schmidt, M., and Bach, F. A Stochastic Gradient Method with an Exponential Convergence rate for Finite Training Sets. Advances in Neural Information Processing Systems, 25:2663 2671, 2012.

Salehi, F., Celis, L. E., and Thiran, P. Stochastic Optimization with Bandit Sampling. ar Xiv preprint ar Xiv:1708.02544, August 2017.

Schmidt, M., Babanezhad, R., Ahmed, M., Defazio, A., Clifton, A., and Sarkar, A. Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields. In Artiﬁcial Intelligence and Statistics, pp. 819 828. PMLR, February 2015. ISSN: 1938-7228.

Schmidt, M., Le Roux, N., and Bach, F. Minimizing ﬁnite sums with the stochastic average gradient. Mathematical Programming, 162(1):83 112, March 2017.

Stich, S. U., Raj, A., and Jaggi, M. Safe Adaptive Importance Sampling. Advances in Neural Information Processing Systems, 30:4381 4391, 2017.

Yuan, K., Ying, B., Vlaski, S., and Sayed, A. H. Stochastic gradient descent with ﬁnite samples sizes. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1 6. IEEE, 2016.

Zhao, P. and Zhang, T. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In International Conference on Machine Learning, pp. 1 9. PMLR, June 2015. ISSN: 1938-7228.

Supplementary material

A. Preliminaries

Lemma A.1. For all i [n], assume that fi is convex and Li-smooth. Then we have:

i=1 fi(x) fi(x ) 2 2 2Lmax [F(x) F(x )]

L Li fi(x) fi(x ) 2 2 2L [F(x) F(x )]

Proof. By the convexity and smoothness assumptions on the functions {fi}n i=1, and Theorem 2.1.5 in (Nesterov, 2004) we have for all i [n]:

fi(x) fi(x ) + fi(x ), x x + 1 2Li fi(x) fi(x ) 2 2

Rearranging gives:

fi(x) fi(x ) 2 2 2Li [fi(x) fi(x ) fi(x ), x x ]

Using Li Lmax, averaging over i [n], and noticing that 1

n Pn i=1 fi(x ) = F(x ) = 0, we get the ﬁrst inequality. For the second, we ﬁrst divide both sides by L/Li, then average over i [n] and notice that 1

n Pn i=1 fi(x ) = F(x ) = 0 to get the inequality.

B. Analysis of SRG

We study the convergence rate of SRG by studying the one-step evolution of the Lyapunov function:

T k = T(xk, (gi k)n i=1) := αk

gi k fi(x ) 2 2 + xk x 2 2

We start by proving Lemma 4.1 which studies the evolution of the ﬁrst term of T k.

Lemma 4.1. Let k N and suppose that (gi k)n i=1 evolves as in Algorithm 1. Taking expectation with respect to (bk, ik), conditional on (bt, it)k 1 t=0 , we have:

gi k+1 fi(x ) 2 2

2θk Lmax [F(xk) F(x )]

gi k fi(x ) 2 2

Stochastic Reweighted Gradient Descent

Proof. Taking expectation with respect to (bk, ik) conditional on (bt, it)k 1 t=1 we have:

gi k+1 fi(x ) 2 2

= P(bk = 0)

j=1 P(ik = j | bk = 0)

gi k fi(x ) 2 2

+ P(bk = 1)

j=1 P(ik = j | bk = 1)

gi k fi(x ) 2 2 + fj(xk) fj(x ) 2 2

gi k fi(x ) 2 2

gi k fi(x ) 2 2 gj k fj(x ) 2

2 + fj(xk) fj(x ) 2 2

gi k fi(x ) 2 2

i=1 fi(xk) fi(x ) 2 2

gi k fi(x ) 2 2

+ 2θk Lmax [F(xk) F(x )]

where the ﬁrst and second equalities follow from the update of (gi k)n i=1 in Algorithm 1, and the last inequality follows from the ﬁrst inequality of Lemma A.1.

The evolution of the second term of T k depends on the second moment of the gradient estimator. We prove Lemma 4.2 which gives us a bound on it.

Lemma 4.2. Let k N and assume that θk (0, 1/2]. Taking expectation with respect to (bk, ik), conditional on (bt, it)k 1 t=0 , we have, for all β, γ, δ, η > 0:

1 npik k fik(xk)

θk [F(xk) F ]

gi k fi(x ) 2 2 + D3(1 + 2θk)σ2

where D1, D2 and D3 are given by:

D1 := (1 + β + γ)

D2 := (1 + β 1 + δ) + (1 + γ 1 + δ 1)(1 + η)

D3 := (1 + γ 1 + δ 1)(1 + η 1)

Proof. Taking expectation with respect to (ik, bk) conditional on (it, bt)k 1 t=1 we have:

1 npik k fik(xk)

1 pi k fi(xk) 2 2

Stochastic Reweighted Gradient Descent

Now by Young s inequality (Peter-Paul inequality):

1 pi k fi(xk) 2 2 = 1

fi(xk) fi(x ) + fi(x ) gi k + gi k 2 2

(1 + β + γ) 1

1 pi k fi(xk) fi(x ) 2 2

+ (1 + β 1 + δ) 1

gi k fi(x ) 2 2

+ (1 + γ 1 + δ 1) 1

Let us bound each of the three terms. The ﬁrst can be bound using pk θk/n and the ﬁrst inequality of Lemma A.1:

1 pi k fi(xk) fi(x ) 2 2 1 θkn

i=1 fi(xk) fi(x ) 2 2 2Lmax

θk [F(xk) F(x )]

The second term is easily bound using pk θk/n:

gi k fi(x ) 2 2 1 θkn

gi k fi(x ) 2 2

Finally, the third term can be bounded by:

(1 + 2θk) 1

gi k fi(x ) 2 +

i=1 fi(x ) 2

(1 + 2θk)(1 + η) 1

gi k fi(x ) 2

+ (1 + 2θk)(1 + η 1) 1

i=1 fi(x ) 2

θk(1 + 2θk)(1 + η) 1

gi k fi(x ) 2 2 + (1 + 2θk)(1 + η 1)σ2

gi k fi(x ) 2 2 + (1 + 2θk)(1 + η 1)σ2

Where the second line follows from the inequality pk (1 θk)qk and the deﬁnition of qk in (6), the third from the triangle inequality and the inequality 1/(1 θk) < 1 + 2θk which holds since θk (0, 1/2] by hypothesis. The fourth line follows by Young s inequality (Peter-Paul inequality). The ﬁfth line follows from the deﬁnition of σ2 and an application of Cauchy-Schwarz inequality: let v be the vector whose ith component is gi k fi(x ) 2 and let 1n be the vector of ones. Then: n X

gi k fi(x ) 2

= | 1n, v |2 1n 2 2 v 2 2 = n

gi k fi(x ) 2 2

Finally, line six follows from the inequality θk(1 + 2θk) 1 since θk (0, 1/2] by hypothesis. The result of the lemma

Stochastic Reweighted Gradient Descent

then follows after deﬁning:

D1 := (1 + β + γ)

D2 := (1 + β 1 + δ) + (1 + γ 1 + δ 1)(1 + η)

D3 := (1 + γ 1 + δ 1)(1 + η 1)

Combining these two lemmas, we arrive at our main result, which is a per-step bound on the evolution of T k under the dynamics of SRG.

Theorem 4.3. Suppose that (xk, (gi k)n i=1) evolves according to Algorithm 1. Further, assume that for all k N: (i) αk/θk is non-increasing. (ii) θk (0, 1/2]. (iii) αk θk/12Lmax. Then:

E T k+1 (1 ρk)E T k + (1 + 2θk)6α2 kσ2

for all k N, and where:

ρk := min θk

Proof. All the expectations in this proof are with respect to (bk, ik) and are conditional on (bt, it)k 1 t=0 . Since αk/θk is non-increasing, Lemma 4.1 immediately gives us a bound on the ﬁrst term of E T k+1 . The second term of E T k+1

expands as:

E h xk+1 x 2 2 i = E

xk αk 1 npik k fik(xk) x

= xk x 2 2 2αk

" 1 npik k fik(xk)

1 npik k fik(xk)

= xk x 2 2 2αk F(xk), xk x + α2 k E

1 npik k fik(xk)

(1 αkµ) xk x 2 2αk [F(xk) F(x )] + α2 k E

1 npik k fik(xk)

where in the last line we use the µ-strong-convexity of F to bound the inner product term. Since we are assuming θk (0, 1/2], we can apply Lemma 4.2 to bound the last term above. Combining the resulting bound with the one on the ﬁrst term of the Lyapunov function we get:

E T k+1 1 θk

n + D2αk Lmax

gi k fi(x ) 2 2 + (1 αkµ) xk x 2 2

+ D3(1 + 2θk)α2 kσ2

θk + a 1 [F(xk) F(x )]

To ensure that the last parenthesis is not positive, we need:

θk Lmax (13)

Assuming αk satisﬁes this condition, and replacing in the ﬁrst parenthesis we get:

n + D2αk Lmax

Stochastic Reweighted Gradient Descent

It remains to choose the parameters β, γ, δ, η > 0, and the parameter a > 0 so as to maximize the largest allowable step size in (13). First however, note that we have the constraint a < 1 so that the step size can be allowed to be positive in (13). Furthermore, since we do not know the relative magnitudes of n and κmax, we decide to impose the equality:

a D2 D1 = 1 a

D1 Finally, we choose to impose D3 = 6. These considerations lead us to the following constrained optimization problem:

max a,β,γ,δ,η (1 a)

subject to: 1 1 a

a D2 D1 = 1 a

β, γ, δ, η > 0

which we solve numerically to ﬁnd the feasible point:

a = 0.677, β = 1.047, γ = 1.666, δ = 1.591, η = 0.591

With this choice of parameters, we get:

E T k+1 (1 ρk) T k + (1 + 2θk)6α2 kσ2

under the condition: αk 1

12 θk Lmax which ensures that (13) holds.

From this result, we can derive the convergence rate of SRG when it is used with a constant mixture coefﬁcient and step size.

Corollary 4.4. Suppose that (xk, (gi k)n i=1) evolves according to Algorithm 1 with a constant mixture coefﬁcient θk = θ (0, 1/2] and a constant step size αk = α θ/12Lmax. Then for any k N:

E T k (1 ρ)k T 0 + (1 + 2θ)6α2σ2 ρ

where ρ = ρk is as deﬁned in Theorem 4.3. For any ε > 0 and θ (0, 1/2], choosing:

( θ 12Lmax , εµ (1 + 2θ)12σ2 ,

θ 1 + 2θ ε 144nσ2

guarantees E h xk x 2 2 i ε

Proof. Under the assumptions of the corollary, the conditions of Theorem 4.3 are satisﬁed for all k N. Starting from E T k and repeatedly applying Theorem 4.3 we get:

E T k (1 ρ)k T 0 + (1 + 2θ)6α2σ2

(1 ρ)k T 0 + (1 + 2θ)6α2σ2

= (1 ρ)k T 0 + (1 + 2θ)6α2σ2 ρ

Stochastic Reweighted Gradient Descent

This proves the ﬁrst part. For the complexity part, let ε > 0. We choose the step size α so that:

(1 + 2θ)6α2σ2 ρ ε

We solve this inequality for α using the deﬁnition of ρ. Similarly, we need to ensure that the ﬁrst term satisﬁes:

(1 ρ)k T 0 ε

using the inequality (1 ρ) exp ( ρ), taking logarithms of both sides, and solving for k we get the desired lower bound.

Using Theorem 4.3, one can similarly derive the convergence rate of SRG when it is used with decreasing step sizes. In such a case, the mixture coefﬁcient can also be decreased, provided that the ratio αk/θk remains non-increasing as required by Theorem 4.3. We do not pursue this further here.

C. Analysis of SRG+

The analysis of SRG+ is very similar to that of SRG. The Lyapunov function needs to be modiﬁed to make use of index-speciﬁc constants:

T k = T(xk, (gi k)n i=1) := αk

gi k fi(x ) 2 2 + xk x 2 2

Similar to Lemma 4.1, the following lemma studies the evolution of the ﬁrst term of this Lyapunov function.

Lemma C.1. Let k N. Suppose (gi k)n i=1 evolves as in Algorithm 2. Taking expectation with respect to (bk, jk, ik), conditional on (bt, jt, it)k 1 t=0 , we have:

gi k+1 fi(x ) 2 2

gi k fi(x ) 2 2 + 2θk [F(xk) F(x )]

Proof. The proof is identical to that of Lemma 4.1, with the only difference being in the use of the second inequality of Lemma A.1 instead of the ﬁrst.

Similar to Lemma 4.2, the following lemma bounds the second moment of the gradient estimator of SRG+.

Lemma C.2. Let k N and assume that θk (0, 1/2]. Taking expectation with respect to (bk, jk, ik), conditional on (bt, jt, it)k 1 t=0 , we have, for all β, γ, δ, η > 0:

1 npik k fik(xk)

θk [F(xk) F ] + D2

gi k fi(x ) 2 2 + D3(1 + 2θk)σ2

D1 := (1 + β + γ)

D2 := (1 + β 1 + δ) + (1 + γ 1 + δ 1)(1 + η)

D3 := (1 + γ 1 + δ 1)(1 + η 1)

Stochastic Reweighted Gradient Descent

Proof. The proof is again very similar to that of Lemma 4.2. In particular, we use the same decomposition in three terms. The ﬁrst term is bounded using the second inequality of Lemma A.1 instead of the ﬁrst. The third term is also bounded using the same arguments except in the use of the Cauchy-Schwarz inequality. In particular, we use instead the following bound:

gi k fi(x ) 2 2

= | u, v | u 2 2 v 2 2 = n

gi k fi(x ) 2 2

where u is the vector whose ith component is Li while v is the vector whose ith component is 1 Li gi k fi(x ) 2.

Combining Lemmas C.1 and C.2, and using identical arguments to the ones used to prove Theorem 4.3, we arrive at Theorem 4.3 for the iterates of SRG+ with the loosening of the condition αk θk/12Lmax to αk θk/12L. By extension, Corollary 4.4 also holds for SRG+, with every occurrence of Lmax replaced by L.