# towards_noiseadaptive_problemadaptive_accelerated_stochastic_gradient_descent__e2dbc731.pdf

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Sharan Vaswani 1 Benjamin Dubois-Taine 2 Reza Babanezhad 3

We aim to make stochastic gradient descent (SGD) adaptive to (i) the noise σ2

in the stochastic gradients and (ii) problemdependent constants. When minimizing smooth, strongly-convex functions with condition number κ, we prove that T iterations of SGD with exponentially decreasing step-sizes and knowledge of the smoothness can achieve an O exp ( T/κ) + σ2/T rate, without knowing σ2. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD with SLS converges at the desired rate, but only to a neighbourhood of the solution. On the other hand, we prove that SGD with an ofﬂine estimate of the smoothness converges to the minimizer. However, its rate is slowed down proportional to the estimation error. Next, we prove that SGD with Nesterov acceleration and exponential step-sizes (referred to as ASGD) can achieve the nearoptimal O exp ( T/ κ) + σ2/T rate, without knowledge of σ2. When used with ofﬂine estimates of the smoothness and strong-convexity, ASGD still converges to the solution, albeit at a slower rate. Finally, we empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS.

1. Introduction

We study unconstrained minimization of a ﬁnite-sum objective f : Rd R prevalent in machine learning,

min w Rd f(w) := 1

i=1 fi(w). (1)

1Simon Fraser University 2DI ENS, Ecole normale sup erieure, Universit e PSL, CNRS, INRIA, 75005 Paris, France 3SAIT AI lab, Montreal. Correspondence to: Sharan Vaswani <vaswani.sharan@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

For supervised learning, n represents the number of training examples and fi is the loss on example i. We assume f to be a smooth, strongly-convex function and denote w to be the unique minimizer of the above problem.

We study stochastic gradient descent (SGD) and its accelerated variant for minimizing f (Robbins and Monro, 1951; Nemirovski and Yudin, 1983; Nesterov, 2004; Bottou et al., 2018). The empirical performance and the theoretical convergence of SGD is governed by the choice of its step-size, and there are numerous ways of selecting it. For example, Moulines and Bach (2011); Gower et al. (2019) use a constant step-size for convex and strongly convex functions. A constant step-size only guarantees convergence to a neighborhood of the solution. In order to converge to the exact minimizer, a common technique is to decrease the step-size at an appropriate rate, and such decreasing stepsizes have also been well-studied (Robbins and Monro, 1951; Ghadimi and Lan, 2012). The rate at which the stepsize needs to be decayed depends on the function class under consideration. For example, when minimizing smooth, strongly-convex functions using T iterations of SGD, the step-size is decayed at an O(1/k) rate where k is the iteration number. This results in an Θ(1/T) convergence rate for SGD and is optimal in the stochastic setting (Nguyen et al., 2018).

On the other hand, when minimizing a smooth, stronglyconvex function with condition number κ, deterministic (full-batch) gradient descent (GD) with a constant step-size converges linearly at an O(exp( T/κ)) rate. Augmenting constant step-size GD with Nesterov acceleration can further improve the convergence rate to Θ(exp( T/ κ)) which is optimal in the deterministic setting (Nesterov, 2004). Hence, the stochastic and deterministic algorithms use different step-size strategies to obtain the optimal rates in their respective settings.

Noise-adaptive SGD: Ideally, we want to design stepsize schemes that make SGD adaptive to the noise in the stochastic gradients, matching the optimal convergence rates in both the deterministic and stochastic settings. Furthermore, in order for the algorithm to be practical, it should not require knowledge of the stochasticity (e.g. a bound on σ2, the variance in stochastic gradients). Recently, Khaled and Richt arik (2020); Li et al. (2020)

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

achieve the O exp( T/κ) + σ2/T for smooth functions satisfying the Polyak-Lojasiewicz (PL) condition (Karimi et al., 2016), a generalization of strong-convexity. More importantly, these works are noise-adaptive and do not require the knowledge of σ2. For this, Li et al. (2020) use SGD with an exponentially decreasing sequence of stepsizes, while Khaled and Richt arik (2020) use a constant then decaying step-size. There are two limitations with these works: (i) they require the knowledge of problemdependent constants such as the smoothness and strongconvexity of the underlying function, and (ii) they do not match the optimal κ dependence (of the Nesterov accelerated method) in the linear convergence term, and are hence sub-optimal in the deterministic setting. We will address both these limitations in this work.

Towards noise and problem-adaptive SGD: Typically, SGD requires the knowledge of problem-dependent constants to set the step-size. In practice, it is difﬁcult to estimate these quantities, and one can only obtain loose bounds on them. Consequently, there have been numerous methods (Duchi et al., 2011; Li and Orabona, 2019; Kingma and Ba, 2015; Bengio, 2015; Vaswani et al., 2019b; Loizou et al., 2021) that can adapt to the problem, and adjust the step-size on the ﬂy. We term such methods as problemadaptive. Unfortunately, it is unclear if such problemadaptive methods can also be made noise-adaptive. On the other hand, as mentioned above, none of the noiseadaptive methods (Li et al., 2020; Khaled and Richt arik, 2020; Stich, 2019) are problem-adaptive. Amongst these, the noise-adaptive algorithm in Li et al. (2020) only requires knowledge of the smoothness constant and we try to relax this requirement.

Contribution: In Section 3.2, we use stochastic line-search (SLS) (Vaswani et al., 2019b) to estimate the smoothness constant on the ﬂy. We prove that SGD in conjunction with exponentially decreasing step-sizes and SLS converges at the desired noise-adaptive rate but only to a neighbourhood of the solution. This neighbourhood depends on the noise and the error in estimating the smoothness. We prove a corresponding lower-bound that shows the necessity of this neighbourhood term. Our lower-bound shows that if the SGD step-size is adaptively set in an online fashion (using the sampled function), no decreasing sequence of step-sizes can converge to the minimizer.

Contribution: In Section 3.3, we consider estimating the smoothness constant in an ofﬂine fashion (before running the algorithm). We prove that SGD with an ofﬂine estimate of the smoothness and exponentially decreasing step-sizes converges to the solution, though its rate is slowed down by a factor proportional to the estimation error in the smoothness. In particular, our upper-bound shows that misestimating the smoothness constant can slow down the conver-

gence rate. We complement this result with a lower-bound that shows that this slowdown is unavoidable.

Our results thus demonstrate the difﬁculty of obtaining noise-adaptive rates while being adaptive to problemdependent parameters.

Noise-adaptive SGD with Nesterov acceleration: We now turn to the second limitation of existing noise-adaptive methods, and aim to use Nesterov acceleration in order to obtain the optimal O exp( T/ κ) + σ2/T rate, without the knowledge of σ2. The work in Jain et al. (2018); Arjevani et al. (2020) satisﬁes the desired criteria for quadratic functions. For general smooth, stronglyconvex functions, Ghadimi and Lan (2013); Kulunchakov and Mairal (2019) obtain the desired rate, but require the knowledge of σ2, and are consequently not noise-adaptive. Aybat et al. (2019) propose a multi-stage accelerated algorithm that does not require knowledge of σ2. The authors use a dynamical systems analysis, and prove that their algorithm achieves the desired optimal rate only for T 2 κ.

Contribution: In contrast, in Section 4, we use SGD with a stochastic variant of Nesterov acceleration (Cohen et al., 2018; Vaswani et al., 2019a) and the same exponentially decreasing step-sizes. We refer to the resulting method as Accelerated SGD (ASGD). Compared to Aybat et al. (2019), ASGD is a more natural extension of the deterministic Nesterov accelerated gradient method. Under a growth condition (Li et al., 2020; Khaled and Richt arik, 2020; Bottou et al., 2018) similar to Aybat et al. (2019), we use the standard estimating sequences analysis, and prove that ASGD achieves the desired rate for all T without the knowledge of σ2. Hence, exponentially decreasing stepsizes result in noise-adaptivity for both SGD and ASGD.

Contribution: ASGD requires the knowledge of both the smoothness and strong-convexity parameters. As a step towards problem-adaptivity for ASGD, we analyze its convergence with ofﬂine estimates of these problem-dependent constants. To the best of our knowledge this is the ﬁrst such result. We prove that, similar to SGD, misspeciﬁed ASGD converges to the minimizer, but its rate is slowed down by a factor proportional to the estimation errors.

Contribution: Finally, in Section 5, we evaluate the performance of different step-size schemes on strongly-convex supervised learning problems. We show that (A)SGD consistently out-perform existing noise-adaptive algorithms. We propose a novel variant of SLS that guarantees convergence to the minimizer and demonstrate its practical effectiveness in making (A)SGD problem-adaptive.

Additional contributions: In Appendix B.1.1, we show matching results for SGD on strongly star-convex functions (Hinder et al., 2020), a class of structured nonconvex functions. Finally, we prove upper-bounds for non-

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

strongly-convex functions (Appendix B.1.2) and show that even when the smoothness constant is known, exponentially decreasing step-sizes converge to a neighbourhood of the solution. We give some justiﬁcation as to why polynomial or exponentially decreasing step-sizes are unlikely to be noise-adaptive in this setting.

2. Problem setup and Background

We assume that f and each fi are differentiable and lowerbounded by f and f i , respectively. Throughout the paper, we assume that f is µ-strongly convex, and each fi is convex. We also assume that each function fi is Li-smooth, implying that f is L-smooth with L := maxi Li (see Appendix A for the necessary deﬁnitions) and deﬁne κ := L

We use stochastic gradient descent (SGD) or SGD with Nesterov acceleration (Nesterov, 2004) (referred to as ASGD) to minimize f in Eq. (1). In each iteration k [T], SGD selects a function fik (typically uniformly) at random, computes its gradient and takes a descent step. Speciﬁcally,

wk+1 = wk γkαk fik(wk), (2)

where wk+1 and wk are the SGD iterates, and fik( ) is the gradient of the loss function chosen at iteration k. Each stochastic gradient fik(w) is unbiased, implying that Ei [ fi(w)] = f(w). The product of scalars ηk := γkαk deﬁnes the step-size for iteration k. The step-size consists of two parts γk, a problem-dependent scaling term that captures the (local) smoothness of the function; and αk, a problem-independent term that controls the decay of the step-size. Typically, αk is a decreasing sequence of k, and limk αk = 0. The choice of the αk sequence depends on the properties of f, for example, for smooth, stronglyconvex functions, αk is typically set to be O(1/k).

Throughout the paper, we will assume that T is known in advance. In order to obtain noise-adaptive rates, we consider exponentially decreasing step-sizes (Li et al., 2020)

of the form αk := αk where α := h β T i1/T 1 for a constant β 1. These step-sizes lie between the constant step-size used in the deterministic setting and the 1/k decreasing step-sizes used in the stochastic setting, meaning that for k [T], αk 1

In the next section, we analyze the convergence of SGD with exponentially decreasing step-sizes for smooth, strongly-convex functions.

3. Towards noise & problem adaptive SGD

In this section, we consider approaches for developing noise and problem-adaptive SGD i.e. we aim to obtain the noise-adaptive rate matching Stich (2019); Li et al. (2020);

Khaled and Richt arik (2020), but do so without the knowledge of problem-dependent constants.

Instead of the typical assumption of ﬁnite gradient noise z2 := Ei[ fi(w ) 2] < , we assume a ﬁnite optimal objective difference. Speciﬁcally, we deﬁne the noise as σ2 := Ei[fi(w ) f i ] 0. This notion of noise has been used to study the convergence of constant step-size SGD in the interpolation setting for over-parameterized models (Zhang and Zhou, 2019; Loizou et al., 2021; Vaswani et al., 2020). Note that when interpolation is exactly satisﬁed, σ = z = 0. In general, if each function fi is µstrongly convex and L-smooth, then 1 2Lz2 σ2 1 2µz2.

As a warm-up, we ﬁrst assume knowledge of the smoothness constant in Section 3.1 and analyze the resulting SGD algorithm with exponentially decreasing stepsizes. In Section 3.2, we consider using a stochastic linesearch (Vaswani et al., 2019b; 2020) in order to estimate the smoothness constant and set the step-size on the ﬂy. Finally, in Section 3.3, we analyze the convergence of SGD when using an ofﬂine estimate of the smoothness.

3.1. Known smoothness

We use the knowledge of smoothness to set the problemdependent part of the step-size for SGD, speciﬁcally, γk = 1/L. With an exponentially decreasing αk-sequence, we prove the following theorem in Appendix C.1.

Theorem 1. Assuming (i) convexity and Li-smoothness of each fi, (ii) µ strong-convexity of f, SGD (Eq. (2))

with γk = 1

L, αk = β T k/T converges as,

E w T +1 w 2 w1 w 2 c2 exp T

κ α ln(T/β)

µe2 (ln(T/β))2

where c2 = exp 1 κ 2β ln(T/β) .

Compared to Moulines and Bach (2011) that use polynomially decreasing step-sizes, exponential step-sizes result in a better trade-off between the bias (initial distance to the minimizer) and variance (noise) terms, achieving the desired O exp( T/κ) + σ2/T noise-adaptive rate. In Lemmas 3 and 4 in Appendix B.3, we show that no polynomially decreasing step-size can result in the desired noise-adaptive rate. In order to interpolate between the stochastic (minibatch size equal to 1) and fully deterministic (mini-batch size equal to n) setting, we show the explicit dependence of σ2 on the mini-batch size in Appendix B.2.

Since strongly-convex functions also satisfy the PL condition (Karimi et al., 2016), the above result can be deduced

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

from (Li et al., 2020). However, unlike (Li et al., 2020), our result does not require the growth condition and uses a weaker notion of noise. Moreover, we use a different proof technique, speciﬁcally, Li et al. (2020) use the smoothness inequality in the ﬁrst step and obtain the rate in terms of the function suboptimality, E[f(w T ) f ]. In contrast, our proof uses an expansion of the iterates to obtain the rate in terms of the distance to the minimizer, E w T +1 w 2. This change allows us to easily handle the case when the smoothness constant is unknown and needs to be estimated.

Next, we use stochastic line-search techniques to estimate the unknown smoothness and set the step-size on the ﬂy.

3.2. Online estimation of unknown smoothness

In this section, we assume that the smoothness constant is unknown, and aim to estimate it and set the step-size in an online fashion. By online estimation, we mean that in iteration k, we use the knowledge of the sampled function ik to set the step-size, i.e. setting γk depends on ik. We only consider methods that use the knowledge of ik in iteration k and are not allowed to access the other functions in f (for example, to compute the full-batch gradient at wk). Methods based on a stochastic line-search (Vaswani et al., 2019b; 2020) or the stochastic Polyak step-size (Loizou et al., 2021; Berrada et al., 2020) satisfy this criterion.

We use stochastic line-search (SLS) to estimate the local Lipschitz constant and set γk, the problem-dependent part of the step-size. SLS is the stochastic analog of the traditional Armijo line-search (Armijo, 1966) used for deterministic gradient descent (Nocedal and Wright, 2006). In iteration k, SLS estimates the smoothness constant Lik of the sampled function using fik and fik. In particular, starting from a guess (γmax) of the step-size, SLS uses a backtracking procedure and returns the largest step-size γk that satisﬁes: γk γmax and,

fik(wk γk fik(wk)) fik(wk) c γk fik(wk) 2 . (3)

Here, c (0, 1) is a hyper-parameter to be determined theoretically. SLS guarantees that resulting the step-size γk lies in the h min n 2(1 c)

Lik , γmax o , γmax i range (Lemma 8).

If the initial guess is large enough i.e. γmax > 1/Lik, then the resulting step-size γk 2(1 c)

Lik . Thus, with c = 1/2, SLS can be used to obtain an upper-bound on 1/Lik.

In the interpolation (σ = 0) setting, a constant stepsize (αk = 1 for all k) sufﬁces, and SGD with SLS achieves a linear rate of convergence (for c 1/2) when minimizing smooth, strongly-convex functions (Vaswani et al., 2019b). In general, for a non-zero σ, using SGD with SLS and no step-size decay (αk = 1) results in

O exp( T/κ) + γmaxσ2 rate (Vaswani et al., 2020), implying convergence to a neighbourhood determined by the γmaxσ2 term.

In order to obtain a similar rate as Theorem 1 but without the knowledge of L, we set γk with SLS and use the same exponentially decreasing αk-sequence. We prove the following theorem in Appendix C.2.

Theorem 2. Under the same assumptions as Theorem 1,

SGD (Eq. (2)) with αk = β T k/T , γk as the largest step-

size that satisﬁes γk γmax and Eq. (3) with c = 1/2 converges as,

E w T +1 w 2 w1 w 2 c1 exp T

κ α ln(T/β)

+ 8σ2c1(κ )2γmax

e2 (ln(T/β))2

+ 2σ2c1κ ln(T/β) γerr

where γerr := γmax min γmax, 1

κ := max n L µ , 1 µγmax

o , c1 = exp 1 κ 2β ln(T/β) .

We observe that the ﬁrst two terms are similar to those in Theorem 1. For γmax 1 L, κ = κ and the above

theorem implies the same O exp( T/κ) + σ2

T rate of convergence. However, as T , w T +1 does not converge to w , but rather to a neighbourhood determined by the last term 2 σ2κ c1 ln(T/β)

eα γmax min γmax, 1

L . The neighbourhood thus depends on the noise σ2 and γerr, the estimation error (in the smoothness) of the initial guess.

When σ2 = 0, this neighbourhood term disappears, and SGD converges to the minimizer despite the estimation error. This matches the result for SLS in the interpolation setting (Vaswani et al., 2019b). Conversely, when the smoothness is known and γmax can be set equal to 1 L, we also obtain convergence to the minimizer and recover the result of Theorem 1. In fact, if we can guess a value of γmax 1 L, it would result in the neighbourhood term becoming zero, thus ensuring convergence to the minimizer. In this case, the stochastic line-search does not decrease the step-size in any iteration, and the algorithm becomes the same as using a constant step-size equal to γmax. Finally, we contrast our result with the αk = 1 setting (Vaswani et al., 2020), and observe that instead of the dependence on γmax, our neighbourhood term depends on the estimation error in the smoothness. Next, we show the necessity of such a neighbourhood term.

3.2.1. LOWER BOUND ON QUADRATICS

In order to prove a lower-bound, we consider a pair of 1-dimensional quadratics fi(w) = 1/2 (xiw yi)2 for

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

i = 1, 2. Here, w, xi, yi are all scalars. The overall function to be minimized is f(w) = (1/2) [f1(w) + f2(w)]. We assume that x1 = x2 , and since Li = xi 2, this assumption implies different smoothness constants for the two functions. For a sufﬁciently large value of γmax i.e. γmax 1 mini [2] Li

, using SLS with c 1/2 (required

for convergence) results in γk 1/Lik1 (see Lemma 8). With these choices, we prove the following lower-bound.

Theorem 3. When using T iterations of SGD to minimize the sum f(w) = f1(w)+f2(w)

2 of two onedimensional quadratics, f1(w) = 1

2(w 1)2 and f2(w) = 1 2 (2w + 1/2)2, setting γk using SLS with γmax 1 and c 1/2, any convergent sequence of αk results in convergence to a neighbourhood of the solution. Speciﬁcally, if w is the minimizer of f and w1 > 0, then,

E(w T w ) min w1, 3

The above result (proved in Appendix D.1) shows that using SGD with SLS to set γk and any convergent sequence of αk (including the exponentially-decreasing sequence in Theorem 2) will necessarily result in convergence to a neighbourhood. The neighbourhood term can thus be viewed as the price of misestimation of the unknown smoothness constant. This result is in contrast to the conventional thinking that choosing an αk sequence such that limk αk = 0 will always ensure convergence to the minimizer. Note that this result is not speciﬁc to SLS and would hold for other related methods (Loizou et al., 2021; Berrada et al., 2020).

Since the lower-bound holds for any convergent αk sequence, a possible reason for this convergence to the neighbourhood is the correlation between ik and the computation of γk. We investigate this hypothesis in the next section.

3.3. Ofﬂine estimation of unknown smoothness

In this section, we consider an ofﬂine estimation of the smoothness constant. By ofﬂine, we mean that in iteration k, γk is set before sampling ik and cannot use any information about it. This ensures that γk is decorrelated with the sampled function ik. The entire sequence of γk can even be chosen before running SGD.

For simplicity of calculations, we consider a ﬁxed γk = γ for all iterations. Here γ is an ofﬂine estimate of 1

L, and can be obtained by any method. Without loss of generality, we assume that this ofﬂine estimate is off by a multiplicative factor ν that is γ = ν

L for some ν > 0. Here ν quantiﬁes

1For 1-dimensional quadratics, γk = 1/Lik for c = 1/2.

the estimation error in γ with ν = 1 corresponding to an exact estimation of L. In practice, it is typically possible to obtain lower-bounds on the smoothness constant. Hence, the ν > 1 regime is of practical interest. For SGD with γk = γ = ν

L and an exponentially decreasing αk-sequence, we prove the following theorem in Appendix C.3.

Theorem 4. Under the same assumptions as Theorem 1,

SGD (Eq. (2)) with αk = β T k/T , γk = ν

L converges as,

T +1 1 c2 exp min{ν, 1} T

κ α ln(T/β)

+ max{ν2, 1}8c2κ ln(T/β)

µ e2 α2 T 2σ2 ln(T/β) + G [ln(ν)]+

where c2 = exp 1 κ 2β ln(T/β) , [x]+ = max{x, 0}, k0 =

ln(T/β) , G = maxj [k0]{f(wj) f } and k :=

The above theorem implies an

O exp min{ν,1} T

κ + max{ν2,1}[σ2+G[ln(ν)]+]

vergence to the minimizer. The ﬁrst two terms are similar to that in Theorem 1 and imply an O exp( T/κ) + σ2

convergence to the minimizer. Analyzing the third term, we observe that when ν 1, the third term is zero (since [ln(ν)]+ = 0), and the rate matches that of Theorem 1 up to constants that depend on ν. The third term depends on [maxj k0{f(wj) f }] because if ν > 1, the step-size γkαk = ν L αk 1 L initially, and SGD diverges in this regime. Since αk is an exponentially decreasing sequence, after k0 := T ln(ν) ln(T/β) iterations, ν

L αk 1 L, the distance to the minimizer decreases after iteration k0, eventually converging to the solution.

Furthermore, observe that the second term depends on O max{ν2, 1} meaning that if we misestimate the smoothness constant by a multiplicative factor of ν > 1, it can slow down the convergence rate by an O(ν2) factor. Finally, our theorem implies that even in the deterministic setting, misestimating L can slowdown the convergence rate to O ν2

T instead of the usual linear rate of convergence. The third term can thus be viewed as the price of misestimation of the unknown smoothness constant. Unlike Theorem 2 where this price was convergence to a neighbourhood, here, the price of misestimation is slower convergence to the minimizer.

Moulines and Bach (2011) also considered the effect of misspecifying L but in conjunction with polynomially decreasing step-sizes. Speciﬁcally, they proved that using a step-size of ν

L 1 T θ results in the following bounds that depend on γ and ν (Moulines and Bach, 2011, Theorem 1).

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Below, we show their bounds for three common choices of θ = {0, 1/2, 1} and emphasize the effect of ν.

T +1 = O exp (ν2 ν/κ) T ( 1 + σ2) + νσ2

(When θ = 0)

= O exp ν2 ln(T) ν/κ

T ( 1 + σ2) + νσ2

(When θ = 1

= O exp ν2 ν/κ ln(T) ( 1 + σ2) + ν2σ2

(When θ = 1 and ν < 2κ)

= O exp ν2 ν/κ ln(T) ( 1 + σ2) + ν2σ2

(When θ = 1 and ν 2κ)

Observe that for each regime, the convergence rate depends on exp(ν). In contrast, the convergence rate in Theorem 4 depends on O(ν2). This robustness towards misspeciﬁcation can be viewed as an additional advantage of using exponentially decreasing step-sizes. In the next section, we justify the dependence on [ln(ν)]+ in Theorem 4 by proving a corresponding lower-bound. It is unclear whether the ν2 dependence in Theorem 4 is tight, and we leave verifying this for future work.

3.3.1. LOWER BOUND ON QUADRATICS

In this section, we consider gradient descent on a onedimensional quadratic and study the effect of misestimating the smoothness constant by a factor of ν > 1. We consider minimizing a single quadratic, ensuring that σ2 = 0 and prove the following lower-bound in Appendix D.2.

Theorem 5. When minimizing a one-dimensional quadratic function f(w) = 1 2(xw y)2, GD with

αk = β T k/T , γk = ν

L for ν > 3, satisﬁes

wk+1 w = (w1 w )

i=1 (1 ναi).

After k := T ln(T/β) ln ν

3 iterations, we have that

|wk +1 w | 2k |w1 w |.

Instantiating this lower-bound, suppose the estimate of L is off by a factor of ν = 10, then ln ν

3 1, which implies that k T ln(T/β) . In other words, we do not make any progress in the ﬁrst T ln(T/β) iterations, and at this point the

optimality gap has been multiplied by a factor of 2T/ ln(T/β)

compared to the starting optimality gap. This simple example shows the slowdown in the rate of convergence by misestimating the smoothness.

4. Towards noise & problem adaptive ASGD

In this section, we will ﬁrst aim to use SGD with Nesterov acceleration and obtain the optimal O exp T κρ + σ2

rate without knowledge of σ2. Subsequently, we will analyze the convergence of ASGD with ofﬂine estimates of the smoothness and strong-convexity parameters, quantifying the price of misspeciﬁcation (similar to Section 3.3).

ASGD has two sequences {wk, yk} and an additional extrapolation parameter bk. ASGD computes the stochastic gradient at the extrapolated point yk and takes a descent step in that direction. The update in iteration k is:

yk = wk + bk (wk wk 1), (4)

wk+1 = yk γkαk fik(yk). (5)

For analyzing the convergence of ASGD, we will assume that the stochastic gradients satisfy a growth condition similar to Bottou et al. (2018); Li et al. (2020); Khaled and Richt arik (2020)2 there exists a (ρ, σ) with ρ 1 and σ 0, such that for all w,

Ei fi(w) 2 ρ f(w) 2 + σ2. (6)

In the deterministic setting (when using the full-gradient in Eq. (5)), ρ = 1 and σ = 0. Similarly, σ = 0 when the stochastic gradients satisfy the strong-growth condition when using over-parameterized models (Schmidt and Roux, 2013; Vaswani et al., 2019a). 3 Using this growth condition, we prove the following result in Appendix E.3.

Theorem 6. Under the same assumptions of Theorem 1 and (iii) the growth condition in Eq. (6), ASGD (Eqs. (4)

and (5)) with w1 = y1, γk = 1 ρL, αk = β T k/T , rk = q

µ ρL β T k/2T and bk = (1 rk 1) rk 1 α

rk+r2 k 1 α converges as,

T +1 2c3 exp T κρ α ln(T/β)

ρµe2 (ln(T/β))2

where k := E[f(wk) f ] and c3 = exp 2β ρκ ln(T/β) .

2The growth condition in Bottou et al. (2018, (Eq 4.8) ) is Ei f(w) fi(w) 2 MV f(w) 2 + M for some constants M and MV . Since the stochastic gradient is unbiased, this is the same as Eq. (6) with ρ = MV + 1 and σ2 = M. 3Note that the noise-model in Section 3 is equivalent (upto constants) to assuming that the variance of the stochastic gradients at the optimum is bounded. However, the noise-model in Eq. (6) is equivalent to assuming that the variance of the stochastic gradients at any iterate is bounded. Hence, the two noise-models are similar, though Eq. (6) makes a stronger assumption on the noise.

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

The above theorem implies that ASGD achieves an O exp T κρ + σ2

T convergence rate. This improves

over the non-accelerated O exp T

T noiseadaptive rate obtained in Theorem 1 and Stich (2019); Khaled and Richt arik (2020); Li et al. (2020)4. In the fullydeterministic setting (ρ = 1 and σ = 0), Theorem 6 implies an O(exp( T/ κ)) convergence to the minimizer, matching the optimal rate in the deterministic setting (Nesterov, 2004). Under the strong-growth condition (when σ = 0), ASGD improves (by a ρ factor) over the rate in Vaswani et al. (2019a) and matches (upto log factors) the rate in Mishkin (2020). In the general stochastic case, when σ = 0, Cohen et al. (2018); Vaswani et al. (2019a) use constant step-sizes (αk = 1), and prove convergence to a neighbourhood of the solution; whereas we show convergence to the minimizer at a rate governed by the O(σ2/T) term. To smoothly interpolate between the stochastic (batch size equal to 1) and fully deterministic (batch size equal to n) setting, we generalize the growth condition to show an explicit dependence on the batch size (Appendix B).

Comparing our result to that in Aybat et al. (2019), we note that they prove the accelerated noise-adaptive rate under two different assumptions on the noise. (i) Bounded variance of the stochastic gradients. This is a special case of the growth condition in Eq. (6) with ρ = 1. (ii) They also consider a weaker assumption which is equivalent to Eq. (6) up to constants (see Appendix G for a detailed comparison of the two assumptions, and the resulting upper-bounds). Using a dynamical systems perspective, Aybat et al. (2019, Corollary 3.8) prove the desired rate only for T 2 κ, whereas our proof uses the more standard Nesterov s estimate sequences and our result holds for all T.

The result in Theorem 6 requires the knowledge of both µ and L and is thus not problem-adaptive. In the next section, we analyze the convergence of ASGD when it is used with ofﬂine estimates of L and µ.

4.1. Ofﬂine estimation of unknown smoothness & strong-convexity

Similar to Section 3.3, for simplicity, we will assume that γk = γ = 1 L where without loss of generality, 1

L . Similarly, we use µ as the ofﬂine estimate of the strongconvexity, and assume that µ = νµµ. We will only consider the case where we underestimate µ, and hence νµ 1. This is the typical case in practice for example, while

4Though we typically expect ρ < κ, in the worst-case stochastic setting, ρ can be as large as κ and we can not obtain an accelerated rate (with a dependence on κ) in this case (Liu and Belkin, 2018). Mini-batching (see Appendix B.2) is one way to decrease the effective ρ. For a large enough batch-size ρ < κ ensuring a κ dependence in the stochastic setting.

optimizing regularized convex loss functions in supervised learning (see Section 5 for empirical results), µ is set to the regularization strength, and thus underestimates the true strong-convexity parameter. The following theorem (proved in Appendix E.4.1) analyzes the effect of misspecifying L, µ on the ASGD convergence.

Theorem 7. Under the same assumptions as Theorem 6 and (iv) ν = νLνµ ρκ, ASGD (Eqs. (4) and (5))

with w1 = y1, γk = 1 ρ L = νL ρL, αk = β T k/T ,

µ = νµµ µ, rk = q

β T k/2T = q

ν ρκ β T k/2T

and bk = (1 rk 1) rk 1 α

rk+r2 k 1 α converges as,

T +1 2c3 exp

min{ν, 1}T κρ α ln(T/β)

+ 2c3(ln(T/β))2

ρ + G2 min{k0

T , 1} max{νL

νµ , ν2 L},

where k := E[f(wk) f ], c3 = exp 1 ρκ 2β ln(T/β)

[x]+ = max{x, 0}, k0 := T [ln(νL)]+

ln(T/β) and G = maxj [k0] f(yj) .

The above theorem implies an

min{ν,1} κρ

+ h σ2+G2[ln(νL)]+

T i max{ νL

convergence to the minimizer. Observe that (i) when the problem-dependent parameters are known (νµ = νL = 1), we recover the rate of Theorem 6, (ii) if νµ = 1, and we misestimate L, similar to SGD (Theorem 4), ASGD converges to the minimizer at an O(1/T) rate, even in the deterministic setting (when σ = 0), (iii) if νL = 1, underestimating µ matches the rate in Theorem 6 upto (potentially large) constants, resulting in linear convergence when σ = 0, and (iv) compared to Theorem 6, the decrease in the bias term is slowed down by an O exp( p

min{νLνµ, 1}) factor, whereas the decrease

in the variance is slowed by an O max{ νL

νµ , ν2 L} factor.

In the next section, we design an SLS variant that ensures convergence to the minimizer while empirically controlling the misestimation for both SGD and ASGD.

5. Experiments

For comparing different step-size choices, we consider two common supervised learning losses squared loss for regression tasks and logistic loss for classiﬁcation5. With

5The code to reproduce our experiments is available here: https://github.com/R3za/expsls

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

0 50000 100000 150000 200000 250000 iteration

Gradient Norm (log)

0 10000 20000 30000 40000 50000 60000 iteration

0 10000 20000 30000 40000 50000 60000 70000 80000

K-EXP SLS-EXP K-CNST M-ASG ACC-K-EXP ACC-SLS-EXP ACC-K-CNST KR-20

(a) Squared loss

0 50000 100000 150000 200000 250000 iteration

Gradient Norm (log)

0 10000 20000 30000 40000 50000 60000 iteration

0 10000 20000 30000 40000 50000 60000 70000 80000

K-EXP SLS-EXP K-CNST M-ASG ACC-K-EXP ACC-SLS-EXP ACC-K-CNST KR-20

(b) Logistic loss

Figure 1. Comparison for (a) squared loss and (b) logistic loss. Observe that exponentially decreasing step-sizes (i) result in more stable performance compared to using a constant step-size (for both SGD and ASGD) and (ii) consistently outperform the noise-adaptive methods in KR-20 and M-ASG, and (iii) methods using the SLS in Eq. (7) match the performance of those with known smoothness.

a linear model and an ℓ2 regularization equal to λ

2 w 2, both objectives are strongly-convex. We use three standard datasets from LIBSVM (Chang and Lin, 2011) mushrooms, ijcnn and rcv1, and use λ = 0.01. For each experiment, we consider 5 independent runs and plot the average result and standard deviation. We use the (full) gradient norm as the performance measure and plot it against the number of gradient evaluations.

For each dataset, we ﬁx T = 10n, use a batch-size of 1 and compare the performance of the following optimization strategies: (i) the noise-adaptive constant and then decay step-size scheme in Khaled and Richt arik (2020, Theorem 3) (denoted as KR-20 in the plots). Speciﬁcally, for b = max{ 2L2

µ , 2ρL}, we use a constant step-size equal to 1/b when T < b/µ or k < T/2 . Otherwise we set the stepsize at iteration k to be 2 µ((2b/µ)+k T/2 ), (ii) constant step-size SGD with γk = 1

L and αk = 1 for all k (denoted as K-CNST in the plots) (iii) SGD with an exponentially decreasing step-size with knowledge of smoothness (Li

et al., 2020) i.e. γk = 1 L and αk = αk for α = β T 1/T

(denoted as K-EXP), (iv) Accelerated SGD (ASGD) with a constant step-size (αk = 1 for all k) (Vaswani et al., 2019a; Cohen et al., 2018) (denoted as ACC-K-CNST), (v) ASGD with exponentially decreasing step-sizes, (Section 4) denoted as ACC-K-EXP and (vi) Multistage ASGD in Aybat et al. (2019) (denoted as M-ASG) with parameters as in Corollary 3.8. Speciﬁcally we ensure that T > 2 κ,

set T1 = T/C, Tk > 2k[ κ log(2(p+2))], α1 = 1/L and αk = 1/(22k L) where p and C are hyper-parameters.

None of the above strategies are problem-adaptive, and all of them require the knowledge of the smoothness constant L. Additionally, KR-20 and the ASGD variants also require knowledge of ρ, the parameter of the growth condition in Eq. (6), while the ASGD variants and M-ASG require knowledge of µ. If xi is the feature vector corresponding to example i, then we obtain theoretical upperbounds on the smoothness and set L = maxi xi 2 +λ for the squared-loss and L = maxi 1

4 xi 2 + λ for the logistic loss. Similarly, we set µ = λ for both the squared and logistic loss. Note that this underestimates the true strongconvexity parameter, and is in line with Theorem 7. To set ρ, we use a grid search over {10, 100, 1000}. Similarly, to set p and C for M-ASG, we use a grid search over {1, 2, 4} and {2, 10, 100} respectively. For each method, we plot the variant that results in the smallest gradient norm.

Using a stochastic line-search (SLS) to estimate L can result in convergence to the neighbourhood (Section 3.2) because of the correlations between ik and γk. To alleviate this, and still be problem-adaptive, we design a decorrelated conservative variant of SLS: at iteration k of SGD, we set γk using a stochastic line-search on the previously sampled function ik 1 (we can use a randomly sampled jk as well). This ensures that there is no correlation between ik and computing γk. The overall procedure can be

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

described as follows: starting from γk 1 (the conservative aspect), with γ0 = γmax, ﬁnd the largest step-size γk that satisﬁes, for a random or previously sampled index jk,

fjk(wk γk fjk(wk)) fjk(wk) cγk fjk(wk) 2 , (7)

and update wk according to Eq. (2). The above procedure with c = 1/2 ensures that γk [min{γk 1, 1/L}, γk 1]. Since there is no correlation between γk and ik, we can treat γk as an ofﬂine estimate of the smoothness, meaning that γk = νk/L for some νk > 0. Moreover, since we are using a conservative line-search, γk [min{γk 1, 1/L}, γk 1], meaning that νk νk 1 ν1. Hence the maximum misspeciﬁcation in the smoothness is given by ν1 > 1, which is governed by line-search in the ﬁrst iteration. Given this, we can use a similar analysis as Theorem 4, upper-bounding νk by ν1 for each k and obtaining the corresponding result in terms of ν1.

We use this variant of SLS with exponentially decreasing step-sizes for both SGD and ASGD, and denote the resulting variants as SLS-EXP and ACC-SLS-EXP respectively. We emphasize that this strategy is both noiseadaptive and problem-adaptive.

From Fig. 1, we observe that exponentially decreasing stepsizes (i) result in more stable performance compared to the constant step-size variants (for both SGD and ASGD) and (ii) consistently outperform the noise-adaptive methods, KR-20 and M-ASG. We also observe that (iii) methods (SLS-EXP and ACC-SLS-EXP) using the SLS condition in Eq. (7) consistently match the performance of those with known smoothness (K-EXP and ACC-K-EXP).

6. Conclusion

We used exponentially decreasing step-sizes to make SGD noise-adaptive, and considered two strategies for problemadaptivity. Using upper and lower-bounds, we quantiﬁed the price of problem-adaptivity estimating the smoothness in an online fashion results in convergence to a neighbourhood of the solution, while an ofﬂine estimation results in a slower convergence to the minimizer. We then developed an accelerated variant of SGD (ASGD) and proved that it achieves the near-optimal convergence rate. We analyzed the effect of misspecifying the strong-convexity and smoothness parameters for ASGD. Finally, we empirically demonstrated the effectiveness of (A)SGD with exponential step-sizes coupled with a novel variant of SLS.

7. Acknowledgements

We would like to thank Si Yi Meng, Yifan Sun and Frederik Kunstner for helpful feedback on the paper. Ben-

jamin Dubois-Taine would like to acknowledge support from the European Research Council (grant SEQUOIA 724063) and funding by the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute).

Arjevani, Y., Shamir, O., and Srebro, N. (2020). A tight convergence analysis for stochastic gradient descent with delayed updates. In Algorithmic Learning Theory, pages 111 132. PMLR.

Armijo, L. (1966). Minimization of functions having Lipschitz continuous ﬁrst partial derivatives. Paciﬁc Journal of mathematics.

Aybat, N. S., Fallah, A., Gurbuzbalaban, M., and Ozdaglar, A. (2019). A universally optimal multistage accelerated stochastic gradient method. Advances in neural information processing systems, 32:8525 8536.

Bengio, Y. (2015). Rmsprop and equilibrated adaptive learning rates for nonconvex optimization. corr abs/1502.04390.

Berrada, L., Zisserman, A., and Kumar, M. P. (2020). Training neural networks for and by interpolation. In International Conference on Machine Learning, pages 799 809. PMLR.

Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311.

Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):1 27. Software available at http://www.csie.ntu. edu.tw/ cjlin/libsvm.

Cohen, M., Diakonikolas, J., and Orecchia, L. (2018). On acceleration with noise-corrupted gradients. In International Conference on Machine Learning, pages 1019 1028. PMLR.

Duchi, J. C., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121 2159.

Ghadimi, S. and Lan, G. (2012). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469 1492.

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Ghadimi, S. and Lan, G. (2013). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061 2089.

Gower, R., Sebbouh, O., and Loizou, N. (2021). Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1315 1323. PMLR.

Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., and Richt arik, P. (2019). SGD: General analysis and improved rates. In ICML.

Hinder, O., Sidford, A., and Sohoni, N. (2020). Nearoptimal methods for minimizing star-convex functions and beyond. In Conference on Learning Theory, pages 1894 1938. PMLR.

Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., and Sidford, A. (2018). Accelerating stochastic gradient descent for least squares regression. In Conference On Learning Theory, pages 545 604. PMLR.

Karimi, H., Nutini, J., and Schmidt, M. (2016). Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795 811. Springer.

Khaled, A. and Richt arik, P. (2020). Better theory for sgd in the nonconvex world. ar Xiv preprint ar Xiv:2002.03329.

Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.

Kleinberg, B., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? In International Conference on Machine Learning, pages 2698 2707. PMLR.

Kulunchakov, A. and Mairal, J. (2019). Estimate sequences for variance-reduced stochastic composite optimization. In International Conference on Machine Learning, pages 3541 3550. PMLR.

Levy, K. Y., Yurtsever, A., and Cevher, V. (2018). Online adaptive methods, universality and acceleration. In Advances in Neural Information Processing Systems, Neur IPS.

Li, X. and Orabona, F. (2019). On the convergence of stochastic gradient descent with adaptive stepsizes. In AISTATS.

Li, X., Zhuang, Z., and Orabona, F. (2020). A second look at exponential and cosine step sizes: Simplicity, convergence, and performance. ar Xiv preprint ar Xiv:2002.05273.

Liu, C. and Belkin, M. (2018). Accelerating sgd with momentum for over-parameterized learning. ar Xiv preprint ar Xiv:1810.13395.

Lohr, S. L. (2019). Sampling: Design and Analysis: Design and Analysis. Chapman and Hall/CRC.

Loizou, N., Vaswani, S., Laradji, I. H., and Lacoste-Julien, S. (2021). Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1306 1314. PMLR.

Lucas, J., Bae, J., Zhang, M. R., Fort, S., Zemel, R., and Grosse, R. (2021). Analyzing monotonic linear interpolation in neural network loss landscapes. ar Xiv preprint ar Xiv:2104.11044.

Mishkin, A. (2020). Interpolation, growth conditions, and stochastic gradient descent. Ph D thesis, University of British Columbia.

Moulines, E. and Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24:451 459.

Nemirovski, A. and Yudin, D. B. (1983). Problem complexity and method efﬁciency in optimization. Wiley Interscience.

Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course. Springer Science & Business Media.

Nguyen, P. H., Nguyen, L. M., and van Dijk, M. (2018). Tight dimension independent lower bound on the expected convergence rate for diminishing step sizes in sgd. ar Xiv preprint ar Xiv:1810.04723.

Nocedal, J. and Wright, S. (2006). Numerical optimization. Springer Science & Business Media.

Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, pages 400 407.

Schmidt, M. and Roux, N. L. (2013). Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv preprint ar Xiv:1308.6370.

Stich, S. U. (2019). Uniﬁed optimal analysis of the (stochastic) gradient method. ar Xiv preprint ar Xiv:1907.04232.

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Vaswani, S., Bach, F., and Schmidt, M. (2019a). Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1195 1204. PMLR.

Vaswani, S., Laradji, I. H., Kunstner, F., Meng, S. Y., Schmidt, M., and Lacoste-Julien, S. (2020). Adaptive gradient methods converge faster with overparameterization (and you can do a line-search).

Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., and Lacoste-Julien, S. (2019b). Painless stochastic gradient: Interpolation, line-search, and convergence rates. Advances in neural information processing systems, 32:3732 3745.

Zhang, L. and Zhou, Z.-H. (2019). Stochastic approximation of smooth and strongly convex functions: Beyond the o(1/t) convergence rate. In Conference on Learning Theory, pages 3160 3179. PMLR.

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Organization of the Appendix

A Deﬁnitions

B Additional theoretical results

C Upper-bound Proofs for Section 3

D Lower-bound proofs for Section 3

E Proofs for Section 4

F Helper Lemmas

G Comparison to Aybat et al. (2019)

A. Deﬁnitions

Our main assumptions are that each individual function fi is differentiable, has a ﬁnite minimum f i , and is Li-smooth, meaning that for all v and w,

fi(v) fi(w) + fi(w), v w + Li

2 v w 2 , (Individual Smoothness)

which also implies that f is L-smooth, where L is the maximum smoothness constant of the individual functions. A consequence of smoothness is the following bound on the norm of the stochastic gradients,

fi(w) 2 2L(fi(w) f i ).

We also assume that each fi is convex, meaning that for all v and w,

fi(v) fi(w) fi(w), w v , (Convexity)

Depending on the setting, we will also assume that f is µ strongly-convex, meaning that for all v and w,

f(v) f(w) + f(w), v w + µ

2 v w 2 , (Strong Convexity)

B. Additional theoretical results

In this section, we relax the strong-convexity assumption to handle broader function classes in Appendix B.1 and prove results that help provide an explicit dependence on the mini-batch size (Appendix B.2) and in Appendix B.3 show that polynomially decreasing step-sizes cannot obtain the desired noise-adaptive rate.

B.1. Relaxing the assumptions

In this section, we extend our theoretical results to a richer class of functions - strongly quasar-convex functions (Hinder et al., 2020) in Appendix B.1.1, and (non-strongly) convex functions in Appendix B.1.2.

B.1.1. EXTENSION TO STRONGLY STAR-CONVEX FUNCTIONS

We consider the class of smooth, non-convex, but strongly star-convex functions (Hinder et al., 2020; Gower et al., 2021), a subset of strongly quasar-convex functions. Quasar-convex functions are unimodal along lines that pass through a global minimizer i.e. the function monotonically decreases along the line to the minimizer, and monotonically increases thereafter. In addition to this, strongly quasar-convex functions also have curvature near the global minimizer. Importantly, this property is satisﬁed for neural networks for common architectures and learning problems (Lucas et al., 2021; Kleinberg et al., 2018).

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Formally, a function is (ζ, µ) strongly quasar-convex if it satisﬁes the following for all w and minimizers w ,

f(w ) f(w) + 1

ζ f(w), w w + µ

2 w w 2 . (8)

Strongly star-convex functions are a subset of this class of functions with ζ = 1. If L is known, it is straightforward to show that the results of Theorem 1 carry over to the strongly star-convex functions and we obtain the similar O exp( T/κ) + σ2

T rate. In the case when L is not known, it was recently shown that SGD with a stochastic Polyak step-size (Gower et al., 2021) results in linear convergence to the minimizer on strongly star-convex functions under interpolation and achieves an O exp( T) + γmaxσ2 convergence rate in general. The proposed stochastic Polyak step-size (SPS) does not require knowledge of L, and matches the rate achieved for strongly-convex functions (Loizou et al., 2021). However, SPS requires knowledge of f i , which is usually zero for machine learning models under interpolation but difﬁcult to get a handle on in the general case.

Consequently, we continue to use SLS to estimate the smoothness constant. Our proofs only use strong-convexity between w and a minimizer w , and hence we can extend all our results from strongly-convex functions, to structured non-convex functions satisfying the strongly star-convexity property, matching the rates in Theorem 2 and Theorem 4. Finally, we note that given knowledge of ζ, there is no fundamental limitation in extending all our results to strongly quasar-convex functions. In the next section, we relax the strong-convexity assumption in a different way - by considering convex functions without curvature.

B.1.2. HANDLING (NON-STRONGLY)-CONVEX FUNCTIONS

In this section, we analyze the behaviour of exponentially decreasing step-sizes on convex functions (without strongconvexity). As a starting point, we assume that L is known, and the algorithm is only required to adapt to the noise σ2. In the following theorem (proved in Appendix C.4), we show that SGD with an exponentially decreasing step-size is not guaranteed to converge to the minimizer, but to a neighbourhood of the solution.

Theorem 8. Assuming (i) convexity and (ii) Li-smoothness of each fi, SGD with step-size ηk = 1 2L αk has the following convergence rate,

E[f( w T +1) f(w )] 2L w1 w 2

PT k=1 αk + σ2 PT k=1 α2 k PT k=1 αk (9)

where w T +1 =

PT k=1 αkwk PT k=1 αk . For αk = h β T ik/T , the convergence rate is given by,

E[f( w T +1) f(w )] 2L ln(T/β) w1 w 2

αT 2β + σ2 T T β

We thus see that even with the knowledge of L, SGD converges to a neighbourhood of the solution at an O(1/T) rate. We contrast our result to Ada Grad (Duchi et al., 2011; Levy et al., 2018) that adapts the step-sizes as the algorithm progresses (as opposed to using a predetermined sequence of step-sizes like in our case), is able to adapt to the noise, and achieves an O 1 T + σ2

In order to be noise-adaptive and match the Ada Grad rate, we can use Eq. (9) to infer that a sufﬁcient condition is for the αk-sequence to satisfy the following inequalities, (i) αk C1 T and (ii) α2 k C2

T where C1, C2 are constants. Unfortunately, in Lemmas 9 and 10, we prove that it is not possible for any polynomially or exponentially-decreasing sequence to satisfy these sufﬁcient conditions. While we do not have a formal lower-bound in the convex case, it seems unlikely that these αk-sequences can result in the desired rate, and we conjecture a possible lower-bound. Finally, we note that to the best of our knowledge, the only predetermined (non-adaptive) step-size that achieves the Ada Grad rate is min n 1 2L, 1 σ

o (Ghadimi and Lan, 2012). We also conjecture a lower-bound that shows that there is no predetermined

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

sequence of step-sizes (that does not use knowledge of σ2) that is noise-adaptive and can achieve the O 1 T + σ2

B.2. Dependence on the mini-batch size

In this section, we prove two results in order to explicitly model the dependence on the mini-batch size. We denote a mini-batch as B, its size as B [1, n] and the corresponding mini-batch gradient as f B(w) = 1 B P fi B fi(w). The mini-batch gradient is also unbiased i.e. EB[ f B(w)] = f(w), implying that all the proofs remain unchanged, but we need to use a different growth condition for the ASGD proofs in Section 4 and a different deﬁnition of σ for the SGD proofs in Section 3. We reﬁne these quantities here, and show the explicit dependence on the mini-batch size.

Lemma 1. If Ei fi(w) 2 ρ f(w) 2 + σ2,

EB f B(w)) 2 (ρ 1)n B

n B + 1 f(w) 2 + n B

EB f B(w)) 2 = EB f B(w) f(w) + f(w) 2 = EB f B(w) f(w) 2 + f(w) 2

(Since EB[ f B(w)] = f(w))

Since we are sampling the batch with replacement, using (Lohr, 2019),

Ei fi(w) 2 f(w) 2 + f(w) 2

(ρ 1) f(w) 2 + σ2 + f(w) 2 (Using the growth condition)

= EB f B(w)) 2 (ρ 1)n B

n B + 1 f(w) 2 + n B

Lemma 2. If σ2 := E[fi(w ) f i ],

and each function fi is µ strongly-convex and L-smooth, then

σ2 B := EB[f B(w ) f B] L

By strong-convexity of fi,

EB[f B(w ) f B] 1

2µEB f B(w )) 2

Since we are sampling the batch with replacement, using (Lohr, 2019),

n B Ei fi(w ) 2

n B E[fi(w ) f i ] (By smoothness of fi)

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

B.3. Polynomially decaying stepsizes

In this section, we analyze polynomially decreasing step-sizes, namely when ηk = η (k+1)δ for some constants η > 0 and 0 δ 1. We argue that even with knowledge of the smoothness constant, these step-sizes fail to converge at the desired noise-adaptive rate even on simple quadratics. In particular, the next lemma shows that gradient descent (GD) applied to a strongly-convex quadratic with a polynomially decreasing step-size fails to obtain the usual linear rate of the form O(ρ T ) for some ρ < 1.

Lemma 3. When using T iterations of GD to minimize a one-dimensional quadratic f(w) = 1 2(xw y)2, setting ηk = 1

L 1 (k+1)δ for some 0 < δ 1 results in the following lower bounds. If δ = 1,

w T +1 w = (w1 w ) 1 T + 1

If 0 < δ < 1, w1 w > 0 and T is large enough,

w T +1 w (w1 w ) 1 1

21/δ 1 4 2δ 1

1 δ 4 (T +1)1 δ

Proof. Observe that w = y/x and L = x2. The GD iteration with ηk = 1

L 1 (k+1)δ reads

wk+1 = wk 1

L 1 (k + 1)δ x2wk xy = wk

1 1 (k + 1)δ

x 1 (k + 1)δ = wk

1 1 (k + 1)δ

+ w 1 (k + 1)δ

wk+1 w = (wk w ) 1 1 (k + 1)δ

w T +1 w = (w1 w )

1 1 (k + 1)δ

w T +1 w = (w1 w )

k k + 1 = (w1 w ) 1 T + 1

If 0 < δ < 1 and w1 w > 0,

w T +1 w = (w1 w )

1 1 (k + 1)δ

1 2 2(k + 1)δ

We wish to use the inequality 1 2x

2 2 2x which is true for all x [0, 1/2]. In our case it holds for

1 (k + 1)δ 1

Let k0 = 21/δ . Then for T k0,

w T +1 w = (w1 w )

1 1 (k + 1)δ

1 2 2(k + 1)δ

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Now, for k k0 1, we have that 1 (k+1)δ 1 2δ and thus

1 1 (k + 1)δ

For k k0, we have 1 2 2(k+1)δ 2 2 1 (k+1)δ and thus

1 2 2(k + 1)δ

2 2 PT k=k0 1 (k+1)δ = 2 2 PT +1 k=1 1 kδ Pk0 k=1 1 kδ 2 2 PT +1 k=1 1 kδ

Using the bound in the proof of Lemma 9, we have

1 kδ 1 + 1 1 δ (T + 1)1 δ 1

Putting this together we have that

2 2 PT +1 k=1 1 kδ 2 2(1+ 1 1 δ((T +1)1 δ 1)) = 41/(1 δ)

4 4 (T +1)1 δ

1 δ = 4 2δ 1

1 δ 4 (T +1)1 δ

Putting everything together we get that

w T +1 w (w1 w ) 1 1

21/δ 1 4 2δ 1

1 δ 4 (T +1)1 δ

The next lemma shows that when δ = 0, namely when the step-size is constant, SGD applied to the sum of two quadratics fails to converge to the minimizer.

Lemma 4. When using SGD to minimize the sum f(w) = f1(w)+f2(w)

2 of two one-dimensional quadratics: f1(w) = 1 2(w 1)2 and f2(w) = 1

2(2w + 1/2)2 with a constant step-size η = 1

L, the following holds: whenever |wk w | < 1/8, the next iterate satisﬁes |wk+1 w | > 1/8.

Proof. First observe that w = 0 and that L = 4. The updates then read

If ik = 1 : wk+1 = wk η(wk 1) = wk(1 1

If ik = 2 : wk+1 = wk η2(2wk + 1

2) = wk(1 4

Suppose that |wk w | =|wk| < 1/8. We want to show that |wk+1| > 1/8. We can separate the analyses in three cases. If wk ( 1/8, 0) and ik = 1 then

If wk (0, 1/8) and ik = 1 then

If ik = 2 then

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

implying that in each case, |wk+1| > 1/8.

C. Upper-bound Proofs for Section 3

C.1. Proof of Theorem 1

Theorem 1. Assuming (i) convexity and Li-smoothness of each fi, (ii) µ strong-convexity of f, SGD

(Eq. (2)) with γk = 1

L, αk = β T k/T converges as,

E w T +1 w 2 w1 w 2 c2 exp T

κ α ln(T/β)

µe2 (ln(T/β))2

where c2 = exp 1 κ 2β ln(T/β) .

wk+1 w 2 = wk ηk fik(wk) w 2

= wk w 2 2ηk fik(wk), wk w + η2 k fik(wk) 2

= wk w 2 2γkαk fik(wk), wk w + γ2 kα2 k fik(wk) 2

wk+1 w 2 wk w 2 2γkαk fik(wk), wk w + γ2 kα2 k 2L[fik(wk) f ik] (Smoothness)

Lαk fik(wk), wk w + 2

Lα2 k [fik(wk) fik(w )] + 2

Lα2 k [fik(w ) f ik]

(Since γk = 1/L.)

Taking expectation w.r.t ik,

E wk+1 w 2 E wk w 2 2

Lαk f(wk), wk w + 2

Lα2 k [f(wk) f(w )] + 2

Lαk f(wk), wk w + 2

Lαk [f(wk) f(w )] + 2

Lα2 kσ2 (Since αk 1)

E wk+1 w 2 1 µαk

E wk w 2 + 2

Lα2 kσ2 (By µ-strong convexity of f)

Unrolling the recursion starting from w1 and using the exponential step-sizes,

E w T +1 w 2 w1 w 2 T Y

i=k+1 α2k 1 µαi

Writing k = E wk w 2

T +1 1 exp µ

k=1 α2k exp µ

| {z } :=Bt

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Using Lemma 5 to lower-bound A, we obtain A αT ln(T/β) 2β ln(T/β). The ﬁrst term in the above expression can then be bounded as,

LA = 1 c2 exp T

κ α ln(T/β)

where κ = L µ and c2 = exp 1 κ 2β ln(T/β) . Using Lemma 6 to upper-bound Bt, we obtain Bt 4κ2c2(ln(T/β))2

e2α2T , thus bounding the second term. Putting everything together,

T +1 1 c2 exp T

κ α ln(T/β)

Le2 (ln(T/β))2

C.2. Proof of Theorem 2

Theorem 2. Under the same assumptions as Theorem 1, SGD (Eq. (2)) with αk = β T k/T , γk as the largest

step-size that satisﬁes γk γmax and Eq. (3) with c = 1/2 converges as,

E w T +1 w 2 w1 w 2 c1 exp T

κ α ln(T/β)

+ 8σ2c1(κ )2γmax

e2 (ln(T/β))2

+ 2σ2c1κ ln(T/β) γerr

where γerr := γmax min γmax, 1

κ := max n L µ , 1 µγmax

o , c1 = exp 1 κ 2β ln(T/β) .

wk+1 w 2 wk w 2 2γkαk fik(wk), wk w + γkα2 k

fik(wk) f ik c

(By Lemma 8)

Setting c = 1/2,

= wk w 2 2γkαk fik(wk), wk w + 2γkα2 k [fik(wk) f ik]

= wk w 2 2γkαk fik(wk), wk w + 2γkα2 k [fik(wk) fik(w )] + 2γkα2 k [fik(w ) f ik]

Adding, subtracting 2γkαk[fik(wk) fik(w )],

= wk w 2 + 2γkαk [ fik(wk), wk w + [fik(wk) fik(w )]] 2γkαk[fik(wk) fik(w )]

+ 2γkα2 k [fik(wk) fik(w )] + 2γkα2 k [fik(w ) f ik]

wk w 2 + 2γminαk [ fik(wk), wk w + [fik(wk) fik(w )]]

2γk(αk α2 k)[fik(wk) fik(w )] + 2γmaxα2 k [fik(w ) f ik]

where we used convexity of fik to ensure that fik(wk), wk w + [fik(wk) fik(w )] 0. Taking expectation,

E wk+1 w 2 wk w 2 + 2γminαk [ f(wk), wk w + [f(wk) f(w )]]

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

(αk α2 k)E [2γk[fik(wk) fik(w )]] + 2γmaxα2 kσ2

E wk w 2 (1 αkγminµ) wk w 2 (αk α2 k)E [2γk[fik(wk) fik(w )]] + 2γmaxα2 kσ2

Since αk 1, and αk α2 k 0, let us analyze E[[γk[fik(wk) fik(w )]].

E[[γk[fik(wk) fik(w )]] = E[[γk[fik(wk) f ik]] E[[γk[f ik fik(w )]]

E[[γmin[fik(wk) f ik]] E[[γmax[f ik fik(w )]] (γk γmax)

= E[[γmin[fik(wk) f ik]] + γmaxσ2

= E[[γmin[fik(wk) fik(w )]] E[[γmin[fik(w ) f ik]] + γmaxσ2

= γmin[f(wk) f(w )] γminσ2 + γmaxσ2

(γmax γmin)σ2

Putting this relation back,

E wk w 2 (1 αkγminµ) wk w 2 + 2(αk α2 k) (γmax γmin)σ2 + 2γmaxα2 kσ2

(1 αkγminµ) wk w 2 + 2αk (γmax γmin)σ2 + 2γmaxα2 kσ2.

Setting κ = max{ L

µ , 1 µγmax } we get that 1 αkγminµ 1 1

κ . Writing k = E wk w 2 and unrolling the recursion we get

1 + 2γmaxσ2 T X

k=1 α2k T Y

κ αi + 2 σ2 T X

k=1 αk(γmax γmin)

+ 2γmaxσ2 T X

k=1 α2k exp 1

| {z } :=Bt

+ 2 σ2 (γmax γmin)

k=1 αk exp 1

| {z } :=Ct

Using Lemma 5 to lower-bound A, we obtain A αT ln(T/β) 2β ln(T/β). The ﬁrst term in the above expression can then be bounded as,

κ A 1 c1 exp T

κ α ln(T/β)

where c1 = exp 1 κ 2β ln(T/β) . Using Lemma 6 to upper-bound Bt, we obtain Bt 4(κ )2c1(ln(T/β))2

e2α2T , thus bounding the

second term. Using Lemma 7 to upper-bound Ct, we obtain Ct c1 κ ln(T/β)

eα , thus bounding the third term. Finally, by Lemma 8 we have that γmin min γmax, 1

Putting everything together,

T +1 1 c1 exp T

κ α ln(T/β)

+ 8σ2c1(κ )2γmax

e2 (ln(T/β))2

α2T + 2c1σ2κ ln(T/β)

γmax min γmax, 1

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

C.3. Proof of Theorem 4

Theorem 4. Under the same assumptions as Theorem 1, SGD (Eq. (2)) with αk = β T k/T , γk = ν L converges as,

T +1 1 c2 exp min{ν, 1} T

κ α ln(T/β)

+ max{ν2, 1}8c2κ ln(T/β)

µ e2 α2 T 2σ2 ln(T/β) + G [ln(ν)]+

where c2 = exp 1 κ 2β ln(T/β) , [x]+ = max{x, 0}, k0 = T [ln(ν)]+

ln(T/β) , G = maxj [k0]{f(wj) f } and

k := wk w 2.

Proof. Following the steps from the proof of Theorem 1,

wk+1 w 2 wk w 2 2γkαk fik(wk), wk w + 2Lγ2 kα2 k [fik(wk) fik(w )] + 2Lγ2 kα2 k [fik(w ) f ik]

Taking expectation wrt ik, and since both γk and αk are independent of ik,

E wk+1 w 2 wk w 2 2γkαk f(wk), wk w + 2Lγ2 kα2 k [f(wk) f ] + 2Lγ2 kα2 k σ2

E wk+1 w 2 (1 µγkαk) wk w 2 + 2Lγ2 kα2 k σ2 + [f(wk) f ] (2Lγ2 kα2 k 2γkαk) (By strong convexity)

Let us separately consider the ν 1 and ν > 1 case.

For the ν 1 case, (2Lγ2 kα2 k 2γkαk) = 2ν2α2 k L 2ναk

L = 0. Hence, the above equation can be simpliﬁed as:

E wk+1 w 2 1 µναk

wk w 2 + 2ν2α2 k L σ2

Proceeding in the same way as the proof of Theorem 1, deﬁne k = E wk w 2 and unroll the recursion,

L αk) + 2ν2σ2

i=k+1 (1 µν

Bounding the ﬁrst term similar to Lemma 5,

L αk) exp µν

κ α ln (T/β)

κ 2β ln (T/β)

Bounding the second term similar to Lemma 6,

i=k+1 (1 µν

k=1 α2 k exp

k=1 α2 k exp ν

κ αk+1 αT +1

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

= exp ναT +1

k=1 α2 k exp ναk+1

= exp ναT +1

ν2e2α2 ln(T/β)2

κ 2β ln(T/β)

ν2e2α2 ln(T/β)2

Putting everything together, we obtain that,

T +1 1 exp νT

κ α ln (T/β)

κ 2β ln (T/β)

κ 2β ln(T/β)

e2α2 ln(T/β)2

= T +1 1c2 exp νT

κ α ln (T/β)

e2α2 ln(T/β)2

T (Since ν 1)

For the ν > 1 case,

E wk+1 w 2 1 µναk

wk w 2 + 2ν2α2 k σ2

L + [f(wk) f ] (2Lγ2 kα2 k 2γkαk)

wk w 2 + 2ν2α2 k σ2

L + [f(wk) f ] 2ν2α2 k L 2ναk

(Since ν > 1)

For the last term to be negative, we require αk 1 ν . By deﬁnition of αk, this will happen after k k0 := T ln(ν) ln(T/β) iterations. However, until k0 iterations, we observe that (2Lγ2 kα2 k 2γkαk) 2ν (ν 1)

For the k < k0 regime,

E wk+1 w 2 1 µαk

wk w 2 + 2Lγ2 kα2 k σ2 + max j [k0]{f(wj) f } 2ν (ν 1)

Writing k = E wk w 2, and unrolling the recursion for the ﬁrst k0 iterations we get

L σ2 + max j [k0]{f(wj) f }2ν(ν 1)

L | {z } :=c5

Bounding the ﬁrst term similar to Lemma 5,

Bounding the second term similar to Lemma 6,

k=1 α2 k exp

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

k=1 α2 k exp 1

k=1 α2 k exp αk+1

e2α2 k0 ln(T/β)2

Putting everything together, we obtain,

+ c5 exp αk0

e2α2 k0 ln(T/β)2

Now let us consider the regime k k0 where αk 1

ν , so that we have

E wk+1 w 2 1 µαk

wk w 2 + 2ν2σ2

Writing k = E wk w 2, and unrolling the recursion from k = k0 to T,

Lαk) + 2ν2σ2

Bounding the ﬁrst term similar to Lemma 5,

L αk0 αT +1

Bounding the second term similar to Lemma 6,

k=k0 α2 k exp

k=k0 α2 k exp 1

κ αk+1 αT +1

= exp αT +1

k=k0 α2 k exp αk+1

= exp αT +1

e2α2 (T k0 + 1)

e2α2 (T k0 + 1) ln(T/β)2

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Putting everything together,

T +1 k0 exp µ

L αk0 αT +1

L exp αT +1

e2α2 (T k0 + 1) ln(T/β)2

Combining the above bounds for the two regimes, we get,

L αk0 αT +1

+ c5 exp µαk0

e2α2 k0 ln(T/β)2

L exp αT +1

e2α2 (T k0 + 1) ln(T/β)2

2e2α2 k0 ln(T/β)2

L exp αT +1

e2α2 (T k0 + 1) ln(T/β)2

Using Lemma 5 to bound the ﬁrst term, and noting that αT +1

1 α 2β ln(T/β)

T +1 1 c2 exp T

κ α ln(T/β)

e2α2 k0 ln(T/β)2

T 2 + 2ν2σ2

e2α2 (T k0 + 1) ln(T/β)2

where κ = L

µ and c2 = exp 1 κ 2β ln(T/β) .

Putting in the value of c5 and k0, and rearranging, we get

T +1 1 c2 exp T

κ α ln(T/β)

L T 4c2κ2 ln(T/β)2

+ max j [k0]{f(wj) f }2ν(ν 1)

ν2e2α2 [ln(ν)]+ ln(T/β)

Combining the statements from ν 1 and ν > 1 gives us the theorem statement.

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

C.4. Proof of Theorem 8

Theorem 8. Assuming (i) convexity and (ii) Li-smoothness of each fi, SGD with step-size ηk = 1 2L αk has the following convergence rate,

E[f( w T +1) f(w )] 2L w1 w 2

PT k=1 αk + σ2 PT k=1 α2 k PT k=1 αk (9)

where w T +1 =

PT k=1 αkwk PT k=1 αk . For αk = h β T ik/T , the convergence rate is given by,

E[f( w T +1) f(w )] 2L ln(T/β) w1 w 2

αT 2β + σ2 T T β

Proof. Following the proof of Theorem 1,

wk+1 w 2 wk w 2 2γkαk fik(wk), wk w + 2Lγ2 kα2 k [fik(wk) fik(w )]

Lγ2 kα2 k [fik(wk) f ik]

wk+1 w 2 wk w 2 αk

L fik(wk), wk w + α2 k 2L [fik(wk) fik(w )] + α2 k 2L [fik(wk) f ik]

(γk = 1 2L for all k.)

L [fik(wk) fik(w )] + α2 k 2L [fik(wk) fik(w )] + α2 k 2L [fik(wk) f ik]

(By convexity)

Taking expectation,

E wk+1 w 2 wk w 2 αk

L [f(wk) f(w )] + α2 k 2L [f(wk) f(w )] + α2 k 2Lσ2

2L[f(wk) f(w )] + α2 k 2Lσ2 (Since f(wk) f(w ) 0 and αk 1)

Rearranging and summing from k = 1 to T,

k=1 αk[f(wk) f(w )] 2L w1 w 2 + σ2 T X

By averaging and using Jensen. Denote w T +1 =

PT k=1 αkwk PT k=1 αk ,

E[f( w T +1) f(w )] 2L w1 w 2

PT k=1 αk + σ2 PT k=1 α2 k PT k=1 αk

Next, we bound PT k=1 αk and PT k=1 α2 k for the exponentially-decreasing αk sequence, when αk = h β T ik/T . From Lemma 5, we know that,

k=1 αk αT ln(T/β) 2β ln(T/β).

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Bounding the ratio

PT k=1 α2 k PT k=1 αk =

PT k=1 α2k PT k=1 αk where α = h β T i1/T ,

PT k=1 α2k PT k=1 αk α2

1 α2 1 α α αT +1

= α 1 + α 1 1 αT 1 1 αT = T T β

Putting everything together,

E[f( w T +1) f(w )] 2L ln(T/β) w1 w 2

αT 2β + σ2 T T β

C.5. Additional lemmas for upper-bound proofs

t=1 αt αT ln(T/β) 2β ln(T/β)

t=1 αt = α αT +1

1 α = α 1 α αT +1

1 α = αβ T(1 α) = β

T 2 ln(1/α) = β

T 2 1 T ln(T/β) = 2β ln(T/β) (10)

where in the inequality we used Lemma 16 and the fact that 1/α > 1. Plugging back into A we get,

A α 1 α 2β ln(T/β)

α ln(1/α) 2β ln(T/β) (1 x ln( 1

= αT ln(T/β) 2β ln(T/β)

Lemma 6. For α = β T 1/T and any κ > 0,

k=1 α2k exp

4κ2c2(ln(T/β))2

where c2 = exp 1 κ 2β ln(T/β

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Proof. First observe that,

i=k+1 αi = αk+1 αT +1

1 α = αβ T(1 α) = β

T 2 ln(1/α) = β

T 2 1 T ln(T/β) = 2β ln(T/β)

where in the inequality we used 16 and the fact that 1/α > 1. These relations imply that,

i=k+1 αi αk+1

1 α 2β ln(T/β)

κ 2β ln(T/β)

We then have

k=1 α2k exp

k=1 α2k exp 1

k=1 α2k 2(1 α)κ

2 (Lemma 17)

e2α2 T(1 α)2

e2α2 T(ln(1/α))2

= 4κ2c2(ln(T/β))2

Lemma 7. For α = β T 1/T and any κ > 0,

c2 κ ln(T/β)

for c2 = exp 1 κ 2β ln(T/β)

Proof. Proceeding in the same way as Lemma 6, we obtain the following inequality,

k=1 αk exp 1

Further bounding this term,

k=1 αk (1 α)κ

eαk+1 (Lemma 17)

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

c2 ln(1/α)κT

= c2 κ ln(T/β)

Lemma 8. If fi is Li-smooth, stochastic lines-searches ensures that

γ fi(w) 2 1

c (fi(w) f i ), and min γmax, 2 (1 c)

Moreover, if fi is a one-dimensional quadratic,

γ = min γmax, 2 (1 c)

Proof. Recall that if fi is Li-smooth, then for an arbitrary direction d,

fi(w d) fi(w) fi(w), d + Li

For the stochastic line-search, d = γ fi(w). The smoothness and the line-search condition are then

Smoothness: fi(w γ fi(w)) fi(w) Li

2 γ2 γ fi(w) 2 ,

Line-search: fi(w γ fi(w)) fi(w) cγ fi(w) 2 .

The line-search condition is looser than smoothness if Li

2 γ2 γ fi(w) 2 cγ fi(w) 2 .

The inequality is satisﬁed for any γ [a, b], where a, b are values of γ that satisfy the equation with equality, a = 0, b = 2(1 c)/Li, and the line-search condition holds for γ 2(1 c)/Li. As the line-search selects the largest feasible step-size, γ 2(1 c)/Li. If the step-size is capped at γmax, we have η min{γmax, 2(1 c)/Li}, and the proof for the stochastic line-search is complete.

From the previous discussion, observe that if γ > 2(1 c)

Li , then we have

2 γ2 γ fi(w) 2 > cγ fi(w) 2 .

If f is a one-dimensional quadratic, the smoothness inequality is actually an equality, and thus

fi(w γ fi(w)) fi(w) = Li

2 γ2 γ fi(w) 2

So if γ > 2(1 c)

fi(w γ fi(w)) fi(w) cγ fi(w) 2

and the line-search condition does not hold. This implies that for one-dimensional quadratics γ = min{γmax, 2(1 c)

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

D. Lower-bound proofs for Section 3

D.1. Proof of Theorem 3

Theorem 3. When using T iterations of SGD to minimize the sum f(w) = f1(w)+f2(w)

2 of two onedimensional quadratics, f1(w) = 1

2(w 1)2 and f2(w) = 1

2 (2w + 1/2)2, setting γk using SLS with γmax 1 and c 1/2, any convergent sequence of αk results in convergence to a neighbourhood of the solution. Specifically, if w is the minimizer of f and w1 > 0, then,

E(w T w ) min w1, 3

Proof. For SLS with a general c 1/2 on quadratics, we know that γk = 2(1 c)

Lik (see Lemma 8 for a formal proof). Recall

that we consider two one-dimensional quadratics fi(w) = 1

2(wxi yi)2 for i {1, 2} such that x1 = 1, y1 = 1, x2 = 2, y2 = 1

2. Speciﬁcally,

2(w 1)2 L1 = 1

4(w 1)2 + 1

wk+1 = wk αk2(1 c)(wk 1) = 2(1 c)αk + (1 2(1 c)αk)wk

wk+1 = wk 2(1 c)αk 2 4(2wk + 1

2) = (1 2(1 c)αk)wk 1

Ewk+1 = (1 2(1 c)αk)wk + 1

22(1 c)αk 1

82(1 c)αk = (1 2(1 c)αk)wk + 3

Ew T = E(w T w ) = (w1 w )

k=1 (1 2(1 c)αk) + 3

k=1 2(1 c)αk

i=k+1 (1 2(1 c)αi)

Using Lemma 18 and the fact that 2(1 c)αk 1 for all k, we have that if w1 w = w1 > 0,

E(w T w ) min w1, 3

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

D.2. Proof of Theorem 5

Theorem 5. When minimizing a one-dimensional quadratic function f(w) = 1

2(xw y)2, GD with αk = β T k/T , γk = ν

L for ν > 3, satisﬁes

wk+1 w = (w1 w )

i=1 (1 ναi).

After k := T ln(T/β) ln ν

3 iterations, we have that

|wk +1 w | 2k |w1 w |.

Proof. One has w = y

x and L = x2. Therefore

wk+1 w = wk w αkηkx (xwk y)

= wk w αk ν LLwk + αk ν Lxy

= wk w αkνwk + αkρw = (1 ναk)(wk w )

Iterating gives the ﬁrst part of the result. Now, for k k , we have

1 ναk 1 ναk 1 να

T ln(T/β) (ln ν ln 3) = 1 ν β

1 ln(T/β) (ln ν ln 3) = 1 ν 3

|wk +1 w | =|w1 w |

i=1 |1 ναk| |w1 w |2k

D.3. Lemmas for convex setting

Lemma 9. The polynomial stepsize deﬁned as αk = (1/k)δ for some 0 δ 1 cannot satisfy PT k=1 αk C1T and PT k=1 α2 k C2

T for positive constants C1 and C2.

Proof. If δ = 0, αk = 1 for all k, and then PT k=1 α2 k = T. If δ = 1, then PT k=1 αk = Θ(ln T). If 0 < δ < 1, basic calculus shows that

1 kδ 1 + Z T

1 1 δ (T + 1)1 δ 1

1 kδ 1 + 1 1 δ T 1 δ 1

which shows that PT k=1 αk = Θ(T 1 δ), and thus we cannot have PT k=1 αk C1T for all T.

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Lemma 10. The exponential stepsize deﬁned as αk = αk for some α < 1 cannot satisfy PT k=1 αk C1T and PT k=1 α2 k C2

T for positive constants C1 and C2.

Proof. Suppose by contradiction that the exponential stepsize satisﬁes the two conditions. Then

k=1 α2k 1 =

By assumption, P2T k=1 αk C12T and PT k=1 α2k C2

T. Therefore

k=1 α2k 2C1T 1

But then we obtain

which is a contradiction by taking T to inﬁnity.

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

E. Proofs for Section 4

E.1. Reformulation

Let us consider a general ASGD update whose parameters satisfy the following conditions.

r2 k = (1 rk)r2 k 1 ηk ηk 1 + rkµηk. (11)

bk = (1 rk 1)rk 1 ηk ηk 1 rk + r2 k 1 ηk ηk 1 , (12)

It can be veriﬁed that setting ηk = γkαk = 1 ρL β T k/T , rk = q

µ ρL β T k/2T satisﬁes Eq. (11). We ﬁrst show that the

update in Eq. (4)-Eq. (5) satisfying the conditions in Eq. (12) and Eq. (11) can be written in an equivalent form more amenable to the analysis.

Lemma 11. The following update:

yk = wk rkqk qk + rkµ(wk zk) (13)

wk+1 = yk ηk fik(yk) (14)

zk+1 = wk + 1

rk [wk+1 wk] (15)

qk+1 = (1 rk)qk + rkµ (16)

r2 k = qk+1ηk (17)

zk+1 = 1 qk+1 [(1 rk)qkzk + rkµyk rk fik(yk)] (18)

is equivalent to the update in Eq. (4)-Eq. (5).

First we check the consistency of the update (Eq. (15)) and deﬁnition (Eq. (18)) of zk. Using Eq. (18),

zk+1 = 1 qk+1 [(1 rk)qkzk + rkµyk rk fik(yk)]

rk wk rk qk+1 fik(yk) + yk

(1 rk)(qk + rkµ)

qk+1rk + rkµ

rk wk rk qk+1 fik(yk) + yk

qk(1 rk) + (rkµ r2 kµ) qk+1rk + r2 kµ qk+1rk

rk wk rk qk+1 fik(yk) + yk

(qk+1 rkµ) + (rkµ r2 kµ) + r2 kµ qk+1rk

(From Eq. (16))

rk [yk ηk fik(yk)] (From Eq. (17))

zk+1 = wk + 1

rk [wk+1 wk] (From Eq. (14))

which recovers Eq. (15) showing that the deﬁnition of zk and its update is consistent.

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Now we check the equivalence of Eq. (11) and Eq. (16)-Eq. (17). Eliminating qk using Eq. (16)-Eq. (17),

r2 k ηk = (1 rk)r2 k 1 ηk 1 + rkµ

Multiplying by ηk recovers Eq. (11).

Since Eq. (5) and Eq. (14) are equivalent, we need to establish the equivalence of Eq. (4) and the updates in Eq. (13)- Eq. (15). From Eq. (15)

zk = wk 1 + 1 rk 1 [wk wk 1] = zk wk = 1 rk 1

rk 1 (wk wk 1)

Starting from Eq. (13) and using the above relation to eliminate zk,

yk = wk + rkqk qk + rkµ 1 rk 1

rk 1 [wk wk 1]

which is in the same form as Eq. (4). We now eliminate qk from rkqk qk+rkµ 1 rk 1

rk 1 . From Eq. (16) and Eq. (17),

r2 k ηk = (1 rk)qk + rkµ = qk + rkµ = r2 k ηk + rkqk

Using this relation,

rkqk qk + rkµ 1 rk 1

rk 1 = qkηk rk + qkηk

Using Eq. (17), observe that ηkqk = ηk ηk 1 ηk 1qk = ηk ηk 1 r2 k 1. Using this relation,

rkqk qk + rkµ 1 rk 1

ηk ηk 1 r2 k 1 rk + ηk ηk 1 r2 k 1

rk 1 = (1 rk 1)rk 1

ηk + r2 k 1 = bk

which establishes the equivalence to Eq. (4) and completes the proof.

E.2. Estimating sequences

Similar to (Nesterov, 2004; Mishkin, 2020), we will use the estimating sequence {φk, λk} k=1 such that λk (0, 1) and

λ1 = 1 ; λk+1 = (1 rk)λk, (19)

φk(w) = [inf w φk(w)] + qk

2 w zk 2 , (20)

and satisﬁes the following update condition

φk(w) (1 λk)f(w) + λkφ1(w) (21)

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

The above deﬁnitions impose the following update for φ k := [infw φk(w)],

φ k+1 = (1 rk)φ k + rk

f(yk) rk 2qk+1 f(yk) 2 + (1 rk) qk

2 yk zk 2 + f(yk), zk yk (22)

Finally note that the deﬁnition of φk can be used to rewrite Eq. (13) as

yk = wk rk qk + rkµ φk(wk). (23)

E.3. Proof of Theorem 6

Given the deﬁnitions in Appendix E.2, we ﬁrst prove the descent lemma for ηk = 1 ρLαk, where αk 1 is the exponentially decreasing step-size.

Lemma 12. Using the update in Eq. (14) with ηk = 1 ρLαk, we obtain the following descent condition.

E[f(wk+1)] E[f(yk)] ηk

2 f(yk) 2 + 1 2ρ2Lα2 kσ2

Proof. By smoothness, and the update in Eq. (14),

f(wk+1) f(yk) ηk f(yk), fik(yk) + L

2 η2 k fik(yk) 2

Taking expectation w.r.t. ik,

E[f(wk+1)] E[f(yk)] ηk f(yk) 2 + L

2 η2 k E[ fik(yk) 2] (ηk is independent of ik.)

E[f(yk)] ηk f(yk) 2 + ρL

2 η2 k E[ f(yk) 2] + L

2 η2 kσ2 (From Eq. (6))

= E[f(yk)] ηk f(yk) 2 + ηkαk

2 E[ f(yk) 2] + 1 2ρ2Lα2 kσ2

E[f(yk)] ηk

2 f(yk) 2 + 1 2ρ2Lα2 kσ2

The main part of the proof is to show that φ k is an upper-bound on f(wk) (upto a factor governed by the noise term Nk depending on σ2) for all k and is proved in the following lemma.

Lemma 13. For the estimating sequences deﬁned in Appendix E.2 and the updates in Eq. (13)-Eq. (18), for all k,

E[φ k] := E[inf w φk(w)] E[f(wk)] Nk

where Nk := σ2 2ρ2L Pk j=1 α2 j Qk i=j+1(1 ri).

Proof. We will prove the lemma by induction. For k = 1, we deﬁne φ 1 = f(w1), and since Nk 0 for all k, E[φ 1] f(w1) N1, thus satisfying the base-case for the induction. For the induction, we will use the fact that Nk+1 = (1 rk)Nk + 2σ2

Assuming the induction hypothesis, E[φ k] E[f(wk)] Nk, we use Eq. (22) to prove the statement for k + 1 as follows. Taking expectations w.r.t to the randomness in j = 1 to k,

E[φ k+1] = (1 rk)E[φ k] + rk E f(yk) rk 2qk+1 f(yk) 2 + (1 rk) qk

2 yk zk 2 + f(yk), zk yk

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

(1 rk)E[f(wk) Nk]

+ rk E f(yk) rk 2qk+1 f(yk) 2 + (1 rk) qk

2 yk zk 2 + f(yk), zk yk

(by the induction hypothesis)

= (1 rk)E[f(wk)] + rk E[f(yk)] r2 k 2qk+1 E f(yk) 2

+ rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk (1 rk)Nk

= (1 rk)E[f(wk)] + rk E[f(yk)] ηk

2 E f(yk) 2

+ rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk (1 rk)Nk (Using Eq. (17))

By convexity, f(wk) f(yk) + f(yk), wk yk ,

(1 rk)E[f(yk) + f(yk), wk yk ] + rk E[f(yk)] ηk

2 E f(yk) 2

+ rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk (1 rk)Nk

= E h f(yk) ηk

2 f(yk) 2i + rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk

+ (1 rk)E[ f(yk), wk yk ] (1 rk)Nk

By Lemma 12,

E f(wk+1) 1 2ρ2Lα2 kσ2 + rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk

+ (1 rk)E[ f(yk), wk yk ] (1 rk)Nk

= E [f(wk+1)] + rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk + (1 rk)E[ f(yk), wk yk ]

(1 rk)Nk + 1 2ρ2Lα2 kσ2

Since Nk+1 = h (1 rk)Nk + 1 2ρ2Lα2 kσ2i ,

E[φ k+1] E [f(wk+1)] Nk+1 + (1 rk)E[ f(yk), wk yk ]

+ rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk

Now we show that (1 rk)E h f(yk), wk yk + rk qk

µ 2 yk zk 2 + f(yk), zk yk i 0. For this, we use Eq. (13)

yk = wk qkrk qk + rkµ(wk zk)

= zk yk = zk wk + qkrk qk + rkµ(wk zk) = 1 qkrk qk + rkµ

= qk(1 rk) + rkµ

(zk wk) = qk+1 qk + rkµ

(zk wk) (By Eq. (16))

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

qk+1 f(yk), zk yk = f(yk), rk qk qk + rkµ

(wk zk) = f(yk), yk wk

Using this relation to simplify,

(1 rk)E f(yk), wk yk + rk qk

2 yk zk 2 + f(yk), zk yk

= (1 rk)E rk qk

2 yk zk 2 + (1 rk) [ f(yk), wk yk + f(yk), yk wk ]

= (1 rk)E rk qk

2 yk zk 2 0 (Since rk 1.)

Putting everything together,

E[φ k+1] E [f(wk+1)] Nk+1

and we conclude that E[φ k] E [f(wk)] Nk for all k by induction.

We now use the above lemma to prove Theorem 6.

Theorem 6. Under the same assumptions of Theorem 1 and (iii) the growth condition in Eq. (6), ASGD

(Eqs. (4) and (5)) with w1 = y1, γk = 1 ρL, αk = β T k/T , rk = q

µ ρL β T k/2T and bk = (1 rk 1) rk 1 α

rk+r2 k 1 α converges as,

T +1 2c3 exp T κρ α ln(T/β)

ρµe2 (ln(T/β))2

where k := E[f(wk) f ] and c3 = exp 2β ρκ ln(T/β) .

Proof. Using the reformulation in Lemma 11 gives us qk = µ for all k and z1 = w1. For the estimating sequences deﬁned in Appendix E.2, using Lemma 13, we know that the (reformulated) updates satisfy the following relation,

E[f(w T +1)] E[φ T +1] + NT +1 E[φT +1(w )] + NT +1

From Eq. (21), we know that for all w and k,

φk(w) (1 λk)f(w) + λkφ1(w)

Using these relations,

E[f(w T +1)] (1 λT )f + λT φ1(w ) + NT +1 = E[f(w T +1) f ] λT [φ1(w ) f ] + NT +1

By Eq. (20),

λT h φ 1 + q1

2 w z1 2 f i + NT +1

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Choosing φ 1 = f(w1),

λT h f(w1) f + q1

2 w z1 2i + NT

Since z1 = w1, q1 = µ,

= E[f(w T +1) f ] λT h f(w1) f + µ

2 w w1 2i + 2σ2

i=j+1 (1 ri)

Using the fact that λ1 = 1 and λk+1 = (1 rk)λk, we know that that λT = QT k=1(1 rk), and

E[f(w T +1) f ]

# h f(w1) f + µ

2 w w1 2i + σ2

i=j+1 (1 ri).

Now our task is to upper-bound bound the 1 rk terms. From Eq. (17), we know that

rk = qk+1ηk = rqk+1

ρL αk rqk+1

ρL αk (Since αk 1 for all k)

= (1 rk) 1 rqk+1

Since qk = µ for all k, putting everything together,

E[f(w T +1) f ]

# h f(w1) f + µ

2 w w1 2i + σ2

Denoting k = E[f(wk) f ], using the exponential step-size αk = α k/T = 1

T k/T and that f(w1) f µ 2 w w1 2,

k=1 α2k exp

Using Lemma 5, we can bound the ﬁrst term as

1 2 exp r 1

αT ln(T/β) 2β ln(T/β)

= 2c3 exp T κρ α ln(T/β)

where c3 = exp 2β ρκ ln(T/β) . We can now bound the second term by a proof similar to Lemma 6. Indeed we have

k=1 α2k exp

k=1 α2k exp r 1

ρκ αk+1 αT +1

= exp 1 ρκ αT +1

k=1 α2k exp r 1

exp 1 ρκ αT +1

k=1 α2k 2(1 α) ρκ

2 (Lemma 17)

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

= exp 1 ρκ αT +1

e2α2 T(1 α)2

exp 1 ρκ αT +1

e2α2 T ln(1/α)2

= exp 1 ρκ αT +1

4ρκ ln(T/β)2

exp 1 ρκ αT +1

exp 2β ρκ ln(T/β)

where the inequality comes from the bound in Eq. (10) in the proof of Lemma 5. Putting everything together we obtain

E[f(w T +1) f ] 2c3 exp T κρ α ln(T/β)

[f(w1) f ] + 2σ2c3κ ln(T/β)2

E.4. Proof for misspeciﬁed ASGD

In this section, we assume that L and µ are misspeciﬁed by positive νL and νµ and we set ν = νLνµ. In particular, we will use L and µ, ofﬂine estimates of the smoothness and strong-convexity parameters. W.l.o.g we will assume that µ = νµµ and L = L νL . Importantly, we will assume that νµ 1 i.e. we will only consider the more practical case where we underestimate the strong-convexity. Since νµ 1, f is also µ strongly-convex. Hence, all the equations of (11) hold where we replace µ with µ. Similarly in the deﬁnition of φ k in (22) we can replace µ with µ.

With this estimated value, the extrapolation parameter bk is computed as follows:

r2 k = (1 rk)r2 k 1 ηk ηk 1 + rk µηk. (24)

bk = (1 rk 1)rk 1 ηk ηk 1 rk + r2 k 1 ηk ηk 1 , (25)

where ηk = γkαk = 1 ρ Lαk = νL

ρL β T k/T , rk = q

β T k/2T = q

νµ ρL β T k/2T = q

ν ρκ β T k/2T satisfy the above

equations. It can be veriﬁed that the reformulation in Appendix E.1 does not use the speciﬁc form of rk and ηk, and only relies on the consistency of the update above. Hence, the update can be be reformulated as a 3 variable sequence as in Appendix E.1 with a different choice of ηk and rk, but with qk and zk deﬁned analogously in terms of ηk and rk.

Similarly, it can be veriﬁed that the deﬁnition of the estimating sequences in Appendix E.2 also does not depend on the speciﬁc value of rk and ηk, and hence we use the same deﬁnition of φk.

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

E.4.1. PROOF OF THEOREM 7

Given the deﬁnitions in Appendix E.2, we ﬁrst prove the following descent lemma with the misspeciﬁed step-size.

Lemma 14. Using the update in Eq. (14) with ηk = νL ρLαk, and deﬁning k0 := T [ln(νL)]+

ln(T/β) and G = maxj [k0] f(yj) , we obtain the following descent lemma.

E[f(wk+1)] E[f(yk)] ηk

2 f(yk) 2 + ν2 L 2ρ2Lα2 kσ2 + I {k < k0} G2ν2 L 2ρL α2 k

Proof. By smoothness, and the update in Eq. (14),

f(wk+1) f(yk) ηk f(yk), fik(yk) + L

2 η2 k fik(yk) 2

Taking expectation w.r.t. ik,

E[f(wk+1)] E[f(yk)] ηk f(yk) 2 + L

2 η2 k E[ fik(yk) 2] (ηk is independent of the randomness in ik.)

E[f(yk)] ηk f(yk) 2 + ρL

2 η2 k f(yk) 2 + L

2 η2 kσ2 (By the growth condition in Eq. (6))

= E[f(yk)] ηk f(yk) 2 + ηkνLαk

2 f(yk) 2 + ν2 L 2ρ2Lα2 kσ2

= E[f(yk)] ηk

2 f(yk) 2 ηk

2 (1 νLαk) f(yk) 2 + ν2 L 2ρ2Lα2 kσ2

For k k0, 1 νLαk 0, implying that

E[f(wk+1)] E[f(yk)] ηk

2 f(yk) 2 + ν2 L 2ρ2Lα2 kσ2.

For k < k0,

E[f(wk+1)] E[f(yk)] ηk

2 f(yk) 2 + ηkνLαk

2 f(yk) 2 + ν2 L 2ρ2Lα2 kσ2

E[f(wk+1)] E[f(yk)] ηk

2 f(yk) 2 + α2 kν2 L 2ρL f(yk) 2 + ν2 L 2ρ2Lα2 kσ2 (Since ηk = νL

Since G = maxj [k0] f(yj) , we can further upper-bound the RHS by

E[f(wk+1)] E[f(yk)] ηk

2 f(yk) 2 + G2ν2 L 2ρL α2 k + ν2 L 2ρ2Lα2 kσ2.

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Let us now prove the equivalent of Lemma 13 using the above modiﬁed descent lemma.

Lemma 15. For the estimating sequences deﬁned in Appendix E.2 and the updates in Eq. (13)-Eq. (18),

E[φ k] := E[inf w φk(w)] E[f(wk)] Nk

where Nk := σ2ν2 L 2ρ2L Pk j=1 α2 j Qk i=j+1(1 ri) + G2ν2 L 2ρL Pmin{k0,k} 1 j=1 α2 j Qk i=j+1(1 ri)

Proof. We will prove the lemma by induction. For k = 1, we deﬁne φ 1 = f(w1), and since Nk 0 for all k, E[φ ] f(w1) N1, thus satisfying the base-case for the induction. For the induction, we will use the fact that Nk+1 = (1 rk)Nk + 2ν2 Lσ2

ρ2L α2 k + I {k < k0} G2ν2 L 2ρL α2 k.

Assuming the induction hypothesis, E[φ k] E[f(wk)] Nk, we use Eq. (22) to prove the statement for k + 1 as follows. Taking expectations w.r.t to the randomness in j = 1 to k,

E[φ k+1] = (1 rk)E[φ k] + rk E f(yk) rk 2qk+1 f(yk) 2 + (1 rk) qk

2 yk zk 2 + f(yk), zk yk

(1 rk)E[f(wk) Nk]

+ rk E f(yk) rk 2qk+1 f(yk) 2 + (1 rk) qk

2 yk zk 2 + f(yk), zk yk

(by the induction hypothesis)

= (1 rk)E[f(wk)] + rk E[f(yk)] r2 k 2qk+1 E f(yk) 2

+ rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk

= (1 rk)E[f(wk)] + rk E[f(yk)] ηk

2 E f(yk) 2

+ rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk (1 rk)Nk (Using Eq. (17))

By convexity, f(wk) f(yk) + f(yk), wk yk ,

(1 rk)E[f(yk) + f(yk), wk yk ] + rk E[f(yk)] ηk

2 E f(yk) 2

+ rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk (1 rk)Nk

= E h f(yk) ηk

2 f(yk) 2i + rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk

+ (1 rk)E[ f(yk), wk yk ] (1 rk)Nk

By Lemma 14,

E f(wk+1) ν2 L 2ρ2Lα2 kσ2 I {k < k0} G2ν2 L 2ρL α2 k

+ rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk

+ (1 rk)E[ f(yk), wk yk ] (1 rk)Nk

= E [f(wk+1)] + rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk + (1 rk)E[ f(yk), wk yk ]

(1 rk)Nk + ν2 L 2ρ2Lα2 kσ2 + I {k < k0} G2ν2 L 2ρL α2 k

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Since Nk+1 = h (1 rk)Nk + ν2 L 2ρ2Lα2 kσ2 + I {k < k0} G2ν2 L 2ρL α2 k i ,

E[φ k+1] E [f(wk+1)] Nk+1 + (1 rk)E[ f(yk), wk yk ]

+ rk(1 rk) qk

2 yk zk 2 + f(yk), zk yk

Similar to Lemma 13, we show that (1 rk)E h f(yk), wk yk + rk qk

µ 2 yk zk 2 + f(yk), zk yk i 0. For this, we use modiﬁed Eq. (13)

yk = wk qkrk qk + rk µ(wk zk)

= zk yk = zk wk + qkrk qk + rk µ(wk zk) = 1 qkrk qk + rk µ

= qk(1 rk) + rk µ

(zk wk) = qk+1 qk + rk µ

(zk wk) (By Eq. (16))

qk+1 f(yk), zk yk = f(yk), rk qk qk + rk µ

(wk zk) = f(yk), yk wk

Using this relation to simplify,

(1 rk)E f(yk), wk yk + rk qk

2 yk zk 2 + f(yk), zk yk

= (1 rk)E rk qk

µ 2 yk zk 2 + (1 rk) [ f(yk), wk yk + f(yk), yk wk ]

= (1 rk)E rk qk

µ 2 yk zk 2 0 (Since rk 1.)

Putting everything together,

E[φ k+1] E [f(wk+1)] Nk+1

and we conclude that E[φ k] E [f(wk)] Nk for all k by induction.

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

We now use the above lemma to prove Theorem 7.

Theorem 7. Under the same assumptions as Theorem 6 and (iv) ν = νLνµ ρκ, ASGD (Eqs. (4) and (5))

with w1 = y1, γk = 1 ρ L = νL

ρL, αk = β T k/T , µ = νµµ µ, rk = q

β T k/2T = q

ν ρκ β T k/2T and

bk = (1 rk 1) rk 1 α

rk+r2 k 1 α converges as,

T +1 2c3 exp

min{ν, 1}T κρ α ln(T/β)

+ 2c3(ln(T/β))2

ρ + G2 min{k0

T , 1} max{νL

νµ , ν2 L},

where k := E[f(wk) f ], c3 = exp 1 ρκ 2β ln(T/β)

, [x]+ = max{x, 0}, k0 := T [ln(νL)]+

ln(T/β) and G =

maxj [k0] f(yj) .

Proof. In order to have a valid estimate sequence, we need to restrict values that ν. Since ν ρκ, 0 rk 1 and hence λk (0, 1), as required for a valid estimate sequence. For the estimating sequences deﬁned in Appendix E.2, using Lemma 13, we know that the (reformulated) updates satisfy the following relation

E[f(w T +1)] E[φ T +1] + NT +1 E[φT +1(w )] + NT +1

From Eq. (21), we know that for all w and k,

φk(w) (1 λk)f(w) + λkφ0(w)

Using these relations,

E[f(w T +1)] (1 λT )f + λT φ1(w ) + NT = E[f(w T +1) f ] λT [φ1(w ) f ] + NT +1

By Eq. (20),

λT h φ 1 + q1

2 w z1 2 f i + NT +1

Choosing φ 1 = f(w1),

λT h f(w1) f + q1

2 w z1 2i + NT

Since z1 = w1, and we set q1 = µ,

= E[f(w T +1) f ] λT

f(w1) f + µ

+ σ2ν2 L 2ρ2L

i=j+1 (1 ri) + G2ν2 L 2ρL

min{k0,T } 1 X

i=j+1 (1 ri)

Using the fact that λ1 = 1 and λk+1 = (1 rk)λk, and since ν ρκ, we know that rk 1 and λT = QT k=1(1 rk), and

E[f(w T +1) f ]

# f(w1) f + µ

2 w w1 2 + σ2ν2 L 2ρ2L

i=j+1 (1 ri)

+ G2ν2 L 2ρL

min{k0,T } 1 X

i=j+1 (1 ri).

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

Now our task is to upper-bound bound the 1 rk terms. From Eq. (17), we know that

rk = qk+1ηk = rqk+1νL

ρL αk rqk+1νL

ρL αk (Since αk 1 for all k)

= (1 rk) 1 rqk+1νL

Since qk = µ for all k, putting everything together,

E[f(w T +1) f ]

# f(w1) f + µ

2 w w1 2 + σ2ν2 L 2ρ2L

+ G2ν2 L 2ρL

min{k0,T } 1 X

Denoting k = E[f(wk) f ] and 1 = E[f(w1) f ], using the exponential step-size αk = α k/T = 1

T k/T and that f(w1) f µ

2 w1 w 2. Similar to the proof of Theorem 4, if ν > 1, we can replace ν by 1 and get an upper-bound

for the 1 q

ν ρκαk term. Hence, we deﬁne ˆν := min{1, ν}, and obtain the following upper-bound.

1 + σ2ν2 L 2ρ2L

k=1 α2k exp

+ G2ν2 L 2ρL

min{k0,T } 1 X

k=1 α2k exp

Using Lemma 5, we can bound the ﬁrst term as

αT ln(T/β) 2β ln(T/β)

2 exp 2β ρκ ln(T/β)

ˆν κρ α ln(T/β)

[f(w1) f ] (since ˆν 1)

ˆν κρ α ln(T/β)

where c3 = exp 2β ρκ ln(T/β) . We can now bound the second term by a proof similar to Lemma 6. Indeed we have

k=1 α2k exp

k=1 α2k exp

ˆν ρκ αk+1 αT +1

ˆν ρκ αT +1

k=1 α2k exp

ˆν ρκ αT +1

k=1 α2k 2(1 α) ρκ

2 (Lemma 17)

ˆν ρκ αT +1

! 4ρκ e2ˆνα2 T(1 α)2

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

ˆν ρκ αT +1

! 4ρκ e2ˆνα2 T ln(1/α)2

ˆν ρκ αT +1

! 4ρκ ln(T/β)2

min{k0,T } 1 X

k=1 α2k exp

ˆν ρκ αT +1

! 4ρκ ln(T/β)2 min{k0, T}

ˆν ρκ αT +1

! 4ρκ ln(T/β)2 min n ln(ν) ln(T/β), 1 o

ˆν ρκ αT +1

ˆν ρκ ln(T/β)

where the inequality comes from the bound in Eq. (10) in the proof of Lemma 5. Putting everything together we obtain

E[f(w T +1) f ] 2c3 exp

ˆνT κρ α ln(T/β)

1 + 2c3κ ln(T/β)2

e2α2T σ2ν2 L ˆνρL

+ 2c3κ ln(T/β)2

e2α2T min [ln(νL)]+

ln(T/β) , 1 G2ν2 L Lˆν

ˆνT κρ α ln(T/β)

1 + 2c3κ ln(T/β)2

e2α2T σ2 max{ν2 L, νL/νµ} ρL

+ 2c3κ ln(T/β)2

e2α2T min [ln(νL)]+

ln(T/β) , 1 G2 max{ν2 L, νL/νµ} L

ˆνT κρ α ln(T/β)

1 + 2c3 ln(T/β)2

e2α2T σ2 max{ν2 L, νL/νµ} ρµ

+ 2c3 ln(T/β)2

e2α2T min [ln(νL)]+

ln(T/β) , 1 G2 max{ν2 L, νL/νµ} µ

For the second inequality, we consider the term ν2 L ˆν . If ˆν = ν, then ν2 L ˆν = ν2 L ν = νL

νµ . If ˆν = 1, then ν2 L ˆν = ν2 L. Putting these two cases together we get max{ν2, νL/νµ}. The last equality comes from the fact that κ

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

F. Helper Lemmas

Lemma 16. For all x > 1,

1 x 1 2 ln(x)

Proof. For x > 1, we have

1 x 1 2 ln(x) ln(x) < 2x 2

Deﬁne f(x) = 2x 2 ln(x). We have f (x) = 2 1

x. Thus for x 1, we have f (x) > 0 so f is increasing on [1, ). Moreover we have f(1) = 2 2 ln(1) = 0 which shows that f(x) 0 for all x > 1 and ends the proof.

Lemma 17. For all x, γ > 0,

Proof. Let x > 0. Deﬁne f(γ) = γ

ex γ exp( x). We have

f(γ) = exp (γ ln(γ) γ ln(ex)) exp( x)

f (γ) = γ 1

γ + ln(γ) ln(ex) exp (γ ln(γ) γ ln(ex))

f (γ) 0 1 + ln(γ) ln(ex) 0 γ exp (ln(ex) 1) = x

So f is decreasing on (0, x] and increasing on [x, ). Moreover,

x exp( x) = 1

x exp( x) = 0

and thus f(γ) 0 for all γ > 0 which proves the lemma.

Lemma 18. For any sequence αk

k=1 (1 αk) +

i=k+1 (1 αi) = 1

Proof. We show this by induction on T. For T = 1,

(1 α1) + α1 = 1

Induction step:

k=1 (1 αk) +

i=k+1 (1 αi) = (1 αT +1)

k=1 (1 αk) +

i=k+1 (1 αi)

= (1 αT +1)

k=1 (1 αk) +

αT +1 + (1 αT +1)

i=k+1 (1 αi)

Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD

= (1 αT +1)

k=1 (1 αk) +

i=k+1 (1 αi)

(Induction hypothesis)

= (1 αT +1) + αT +1 = 1

G. Comparison to Aybat et al. (2019)

Aybat et al. (2019) use the condition

Ei fi(w) f(w) 2 σ2 + η2 w w 2 (26)

In order to relate the above condition to the one in Eq. (6), we use smoothness of f,

E fik(xk) f(xk) 2 = E fik(xk) 2 f(xk) 2 σ2 + (ρ 1) f(xk) 2 σ2 + (ρ 1)L2 xk x 2 ,

and thus η2 = (ρ 1)L2.

Similarly, we can use the strong-convexity of f, and obtain the following relation,

E fik(xk) 2 = E fik(xk) f(xk) 2 + f(xk) 2

σ2 + η2 xk x 2 + f(xk) 2

µ2 f(xk) 2 + f(xk) 2 ,

and thus ρ = η2

Hence, the two assumptions are equivalent up to constants. In order to get an idea about the strength of the two results, we compare the two upper-bounds (we note that such a comparison is not ideal, and it may be possible to improve and tailor the upper-bounds to the speciﬁc growth condition).

Under the assumption in Eq. (26), in Appendix K, the authors prove the following rate,

E[f(w T ) f ] O(1) exp T O(1)( κ + η2/µ2)

(f(w0) f ) + σ2

If we were to use Eq. (6), since η2 = (ρ 1)L2, this result gives the following upper-bound,

E[f(w T ) f ] O(1) exp T O(1)( κ + (ρ 1)κ2)

(f(x0) f ) + σ2

Observe that for ρ > 1, this bound does not result in an accelerated rate.

On the other hand, if we were to use Eq. (26), since ρ = η2

µ2 + 1 , the result in Theorem 6 can be written as:

E[f(w T ) f ] 2c3 exp

κ(1 + η2/µ2)

(f(w0) f ) + 8σ2c4κ (η2/µ2)Le2 (ln(T/β))2

Observe that for any η > 1, this bound gives an accelerated rate, meaning that the accelerated result in Theorem 6 also holds for the alternative growth condition in Eq. (26).