# tuningfree_stochastic_optimization__b80bc31a.pdf

Tuning-Free Stochastic Optimization

Ahmed Khaled 1 Chi Jin 1

Large-scale machine learning problems make the cost of hyperparameter tuning ever more prohibitive. This creates a need for algorithms that can tune themselves on-the-fly. We formalize the notion of tuning-free algorithms that can match the performance of optimally-tuned optimization algorithms up to polylogarithmic factors given only loose hints on the relevant problem parameters. We consider in particular algorithms that can match optimally-tuned Stochastic Gradient Descent (SGD). When the domain of optimization is bounded, we show tuning-free matching of SGD is possible and achieved by several existing algorithms. We prove that for the task of minimizing a convex and smooth or Lipschitz function over an unbounded domain, tuning-free optimization is impossible. We discuss conditions under which tuning-free optimization is possible even over unbounded domains. In particular, we show that the recently proposed Do G and Do WG algorithms are tuning-free when the noise distribution is sufficiently well-behaved. For the task of finding a stationary point of a smooth and potentially nonconvex function, we give a variant of SGD that matches the best-known high-probability convergence rate for tuned SGD at only an additional polylogarithmic cost. However, we also give an impossibility result that shows no algorithm can hope to match the optimal expected convergence rate for tuned SGD with high probability.

1. Introduction

The hyperparameters we supply to an optimization algorithm can have a significant effect on the runtime of the algorithm and the quality of the final model (Yang et al., 2021; Sivaprasad et al., 2020). Yet hyperparameter tuning is

1Electrical and Computer Engineering Department, Princeton University, Princeton, NJ, USA. Correspondence to: Ahmed Khaled <ahmed.khaled@princeton.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

costly, and for large models might prove intractable (Black et al., 2022). As a result, researchers often resort to using a well-known optimizer like Adam (Kingma & Ba, 2015) or Adam W (Loshchilov & Hutter, 2019) with widely used or default hyperparameters. For example, GPT-3 (Brown et al., 2020), BLOOM (Workshop et al., 2022), LLa MA (Touvron et al., 2023a), and LLa MA2 (Touvron et al., 2023b) all use either Adam or Adam W with identical momentum parameters and similar training recipes.

This situation presents an immense opportunity for algorithms that can tune hyperparameters on-the-fly. Yet such algorithms and their limits are still poorly understood in the setting of stochastic optimization. Let us make our setting more specific. We consider the minimization problem

min x X f(x), (OPT)

where f : X R is differentiable and lower bounded by f . We assume that we have access to (stochastic) gradients g(x) that satisfy certain regularity conditions that we shall make precise later.

Our main objects of study are tuning-free algorithms. To make this notion more precise, let A be an optimization algorithm that takes in n problem parameters a = (a1, a2, . . . , an) and after T (stochastic) gradient accesses returns a point x such that with high probability

f(x) f Error A(f, a, T). (1)

The function Error A characterizes how well the algorithm A minimizes the function f in T steps given the supplied parameters. Let a = a (f, T) denote the set of parameters that minimizes the right hand side of equation (1) for a specific function f and number of steps T. In order for an algorithm to find a (f, T), it must start somewhere. We assume that we can easily find lower and upper bounds on the optimal parameters: two sets a and a such that for i = 1, 2, . . . , n we have

Such hints on problem parameters can often be easily estimated in practice, and are a much easier ask than the optimal parameters. To be a tuning-free version of A, an algorithm B has to approximately match the performance of A with optimally tuned parameters given those hints, a definition we make rigorous next.

Tuning-Free Stochastic Optimization

Definition 1.1. (Tuning-free algorithms). We call an algorithm B a tuning-free version of A if given hints a, a on the optimal parameters a for a function f it achieves the same error as A with the optimal parameters up to only polylogarithmic degradation that depends on the hints and the number of (stochastic) gradient accesses T. That is, if A achieves error f(x) f Error A(f, a (f, T), T), then B achieves the guarantee:

f(x) f ι Error A(f, a , T), (B-error)

where ι = poly log

a1 a1 , . . . , an

an , T is a polylogarithmic function of the hints.

Clearly, asking a tuning-free algorithm B to achieve exactly the same error as A is too much: we ought to pay some price for not knowing a upfront. On the other hand, if we allow polynomial dependencies on the hints, then our hints have to be very precise to avoid large errors. This beats the point of being tuning-free in the first place.

The algorithm A that we are primarily concerned with is Stochastic Gradient Descent (SGD). SGD and its variants dominate in practice, owing to their scalability and low memory requirements (Bottou et al., 2018). We consider three classes of functions: (a) L-smooth and convex functions, (b) G-Lipschitz and convex functions, and (c) L-smooth and potentially nonconvex functions. We ask for tuning-free algorithms for each of these function classes. We give precise definitions of these classes and our oracle model later in Section 1.1.

In the setting of deterministic optimization, we have a very good understanding of tuning-free optimization: there are many methods that, given only hints on the problem parameters required by Gradient Descent (GD), achieve the same rate as GD up to only polylogarithmic degradation. We review this case briefly in Section 4. Despite immense algorithmic developments in stochastic optimization (Duchi et al., 2010; Levy, 2017; Levy et al., 2018; Li & Orabona, 2019; Kavis et al., 2019; Carmon & Hinder, 2022; Ivgi et al., 2023; Cutkosky et al., 2023) and the related setting of online learning (Orabona & P al, 2016; Cutkosky & Boahen, 2016; Cutkosky, 2019; Mhammedi et al., 2019; Mhammedi & Koolen, 2020; Orabona & Cutkosky, 2020) we are not aware of any algorithms that fit our definition as tuningfree counterparts of SGD for any of the function classes we consider. The main question of our work is thus:

Can we find tuning-free counterparts for SGD in the setting of stochastic optimization and the classes of functions we consider (convex and smooth functions, convex and Lipschitz functions, and nonconvex and smooth functions)?

Our contributions. We answer the above question in the

negative for the first two function classes and make some progress towards answering it for the third. In particular, our main contributions are:

For convex optimization: if the domain of optimization is bounded, we highlight results from the literature showing tuning-free optimization matching SGD is possible. If the domain of optimization X is unbounded, we give an impossibility result that shows no algorithm can be a tuning-free counterpart of SGD for smooth and convex functions (Theorem 2), as well as for Lipschitz and convex functions (Theorem 3). Additionally if the stochastic gradient noise has a certain large signal-to-noise ratio (defined in Section 4.2), then tuning-free optimization is possible even when the domain of optimization X is unbounded and can be achieved by the recently-proposed Do G (Ivgi et al., 2023) and Do WG (Khaled et al., 2023) algorithms for both smooth and/or Lipschitz functions (Theorem 4).

For nonconvex optimization: We consider two different notions of tuning-free optimization that correspond to the best-known convergence error bounds for SGD in expectation (Ghadimi & Lan, 2013) and with high probability (Liu et al., 2023). We show tuning-free optimization is impossible in the former setting (Theorem 5). On the other hand, for the latter, slightly weaker notion, we give a positive result and give a tuning-free variant of SGD (Theorem 6).

1.1. Preliminaries

In this section we review some preliminary notions and definitions that we shall make use of throughout the paper. We say that a function f : X R is convex if for any x, y X we have

f(tx + (1 t)y) tf(x) + (1 t)f(y) for all t [0, 1].

We call a function G-Lipschitz if |f(x) f(y)| G x y for all x, y X. All norms considered in this

paper are Euclidean. We let log+ x def = 1+log x. A differentiable function f is L-smooth if for any x, y X we have f(x) f(y) L x y .

Oracle model. All algorithms we consider shall access gradients through one of the two oracles defined below. Definition 1.2. We say that O(f) is a deterministic firstorder oracle for the function f if given a point x X the oracle returns the pair {f(x), f(x)}.

If we only allow the algorithm access to stochastic gradients, then we call this a stochastic oracle. Our main lower bounds are developed under the following noise model. Assumption 1.1. The stochastic gradient estimates are bounded almost surely. That is, there exists some R 0

Tuning-Free Stochastic Optimization

such that for all x X

ˆg(x) f(x) R.

The stochastic oracle model we consider allows for access to both function values and stochastic gradients satisfying Assumption 1.1.

Definition 1.3. We say that O(f, Rf) is a stochastic firstorder oracle for the function f with bound σf if, given a point x X, it returns a pair of random variables h ˆf(x), ˆg(x) i such that (a) the estimates are unbiased

E[ ˆf(x)] = f(x), E[ˆg(x)] = f(x), and (b) the stochastic gradients satisfy Assumption 1.1 with R = σf.

The above oracle restricts the noise to be bounded almost surely. We shall develop our lower bounds under that oracle. However, for some of the upper bounds we develop, we shall relax the requirement on the noise from boundedness to being sub-gaussian (see Section 4.2), and we shall make this clear then.

2. Related Work

This section reviews existing approaches in the literature aimed at reducing or eliminating hyperparameter tuning.

Parameter-free optimization. An algorithm A is called parameter-free if it achieves the convergence rate O G x0 x

given T stochastic gradient accesses for any convex function f with stochastic subgradients bounded in norm by G, with possible knowledge of G (Orabona, 2023, Remark 1). There exists a vast literature on such methods, particularly in the setting of online learning, see e.g. (Orabona & Cutkosky, 2020). Parameter-free optimization differs from tuning-free optimization in two ways: (a) the O( ) can suppress higher-order terms that are not permitted according to the tuning-free definition, and (b) gives the algorithm possible knowledge of a parameter like G whereas tuning-free algorithms can only get to see upper and lower bounds on G. Nevertheless, many parameterfree methods do not need any knowledge of G (Cutkosky, 2019; Mhammedi et al., 2019; Mhammedi & Koolen, 2020). However, Cutkosky & Boahen (2017b; 2016) give lower bounds showing that any online learning algorithm insisting on a linear dependence on x0 x (as in optimally tuned SGD) must suffer from potentially exponential regret. If we do not insist on a linear dependence on x0 x , then the best achievable convergence bound scales x0 x 3, and this is tight (Mhammedi & Koolen, 2020). None of the aforementioned lower bounds apply to the setting of stochastic optimization, since in general online learning assumes an adversarial oracle, which is stronger than a stochastic oracle.

Tuning-free algorithms in the deterministic setting. Gradient descent augmented with line search (Nesterov, 2014; Beck, 2017) is tuning-free for smooth convex and nonconvex optimization. Bisection search (Carmon & Hinder, 2022) is tuning-free for both convex and smooth as well as convex and Lipschitz optimization, as is a restarted version of gradient descent with Polyak stepsizes (Hazan & Kakade, 2019). In the smooth setting, the adaptive descent method of (Malitsky & Mishchenko, 2020) is also tuning-free. There are also accelerated methods (Lan et al., 2023), methods for the Lipschitz setting (Defazio & Mishchenko, 2023), methods based on online learning (Orabona, 2023), and others. Renegar & Grimmer (2021) show that for strongly convex optimization, a simple restarting scheme suffices to obtain tuning-free algorithms in the deterministic setting.

Algorithms for the stochastic setting. Observe that because online learning is a more general setting than the stochastic one, we can apply algorithms from online convex optimization here, like e.g. (Mhammedi & Koolen, 2020) coupled with an appropriate online-to-batch conversion (Hazan, 2022). In more recent work (Carmon & Hinder, 2022; Ivgi et al., 2023), we see algorithmic developments specific to the stochastic setting. We discuss the convergence rates these algorithms achieve in more detail in Section 4.1. There are algorithms based on line search in the stochastic setting, but proving their convergence requires either extra assumptions like interpolation (Vaswani et al., 2019), or using large minibatch sizes (Paquette & Scheinberg, 2020). This drawback of stochastic line search is unavoidable, as (Vaswani et al., 2021, Theorem 4) shows applying stochastic line search on a quadratic objective results in non-convergence.

Other hyperparameter tuning approaches. In practice, hyperparameters are often found by grid search, random search, or methods based on Bayesian optimization (Bischl et al., 2023); None of these approaches come with efficient theoretical guarantees. Another approach is metaoptimization where we have a sequence of optimization problems and seek to minimize the cumulative error over this sequence. Often, another optimization algorithm is then used to select the learning rates, e.g. hypergradient descent (Baydin et al., 2018). Meta-optimization approaches are quite difficult to establish theoretical guarantees for, and only recently have some theoretical results been shown (Chen & Hazan, 2023). Our setting in this paper is different, since rather than seek to minimize regret over a sequence of optimization problems, we have a single function and an oracle that gives us (stochastic) gradient estimates for this function.

Concurrent work. In concurrent work, Carmon & Hinder (2024) and Attia & Koren (2024) also study lower bounds for first-order stochastic optimization. In both papers, like

Tuning-Free Stochastic Optimization

in our work, the algorithm is provided with a certain range that the problem parameters fall in (what we term as hints) and must make use of only that to minimize the function with stochastic gradient evaluations. Carmon & Hinder (2024) study what is the minimum possible multiplicative factor slowdown any algorithm must suffer compared to optimally-tuned baselines when provided access only to hints, which they term the price of adaptivity. They provide lower bounds for stochastic convex optimization for Lipschitz functions in expectation and with high probability, and also consider the case where some of the problem parameters have no uncertainty (e.g. when we know the Lipschitz constant but not the initial distance to the optimum). Our lower bound in this setting (Theorem 3) rules out any polylogarithmic price of adaptivity as impossible. Additionally, we also give lower bounds for nonconvex and smooth convex optimization (Theorems 2 and 5).

Attia & Koren (2024) study stochastic optimization in a similar setting to ours, and give a new upper bound for restarted non-convex SGD that achieves a similar convergence guarantee to Theorem 6. We give our upper bound under a slightly more general noise distribution (that the noise has subgaussian norm) at the cost of a polylogarithmic dependence on the problem dimension. We also additionally give a lower bound that rules out the stronger in-expectation convergence guarantee for nonconvex SGD. In the convex setting, Attia & Koren (2024) give lower bounds for smooth and nonsmooth stochastic optimization that show a polynomial dependence on the hints is the best we can hope to achieve, and give a matching upper bound based on restarted SGD with Ada Grad-like stepsizes. In contrast, for our upper bounds in this case we study more specifically which noise distributions are amenable to optimization and prove results for the Do G and Do WG algorithms (with no restarting procedures). Additionally, we also investigate whether tuningfree optimization is possible under a bounded domain and provide guarantees for Do G/Do WG there (Theorem 1).

3. Tuning-Free Optimization Under a Bounded Domain

We begin our investigation by studying the bounded setting, where we make the following assumption on the minimization problem (OPT):

Assumption 3.1. The optimization domain X is bounded. There exists some constant D > 0 such that x y D for all x, y X.

We seek a tuning-free version of SGD. Recall that SGD achieves with probability at least 1 δ the following convergence guarantee (Jain et al., 2019; Liu et al., 2023)

f(xout) f υ

T if f is L-smooth,

T if f is G-Lipschitz, (2)

where ν = poly log 1

δ and σ is an upper bound on the stochastic gradient noise (per Assumption 1.1). To achieve the convergence guarantee given by equation (2), we need to know the parameters D, σ, and L in the smooth case or G in the nonsmooth case. Per Definition 1.1, a tuning-free version of SGD will thus be given the hints D [D, D], σ [σ, σ], and either L [L, L] in the smooth setting or G [G, G] in the nonsmooth setting. Given those hints, we then ask the algorithm to achieve the same rate as equation (2) up to a multiplicative polylogarithmic function of the hints.

It turns out that tuning-free optimization under a bounded domain is solvable in many ways. Many methods from the online learning literature, e.g. (Cutkosky & Boahen, 2017a; Mhammedi et al., 2019; Cutkosky, 2019) can solve this problem when combined with standard online-to-batch conversion bounds. We give the details for this construction for one such algorithm in the next proposition: Proposition 1. Coin betting through Online Newton Steps with Hints (Cutkosky, 2019, Algorithm 1) is tuning-free in the bounded setting.

The proof of this result is provided in the appendix, and essentially just combines (Cutkosky, 2019, Theorem 2) with online-to-batch conversion.

In this paper, we shall focus particularly on methods that fit the stochastic gradient descent paradigm, i.e. that use updates of the form xk+1 = xk ηkgk, where gk is the stochastic gradient at step k. Two methods that fit this paradigm are Do G (Ivgi et al., 2023) and Do WG (Khaled et al., 2023). Do G uses stepsizes of the form

ηt = rt ut , rt = max k t ( xk x0 , rϵ),

k=0 gk 2, (3)

where rϵ is a parameter that we will always set to D. Similarly, Do WG uses stepsizes of the form

ηt = r2 t vt , rt = max k t ( xk x0 , rϵ),

r2 k gk 2. (4)

The next theorem shows that in the bounded setting, both Do G and Do WG are tuning-free.

Tuning-Free Stochastic Optimization

Theorem 1. Do G and Do WG are tuning-free in the bounded setting. That is, there exists some ι = poly log( D

σ, T, δ 1) such that

f(xout) f ι

T if f is L-smooth,

T if f is G-Lipschitz.

This rate is achieved simultaneously for both classes of functions without prior knowledge of whether f is smooth or Lipschitz (and thus no usage of the hints L, L, G, G).

This theorem essentially comes for free by modifying the results in (Ivgi et al., 2023; Khaled et al., 2023), and while the proof modifications are quite lengthy we claim no significant novelty here. We note further that unlike (Cutkosky, 2019, Algorithm 1), both Do G and Do WG are single-loop algorithms they do not restart the optimization process or throw away progress. This is a valuable property and one of the reasons we focus on these algorithms in the paper. Moreover, Do G and Do WG are universal. An algorithm is universal if it achieves the same rate as SGD for Lipschitz functions and also takes advantage of smoothness when it exists (Levy, 2017), without any prior knowledge of whether f is smooth. Do G and Do WG enjoy this property in the bounded domain setting.

4. Tuning-free Optimization Under an Unbounded Domain

We now continue our investigation to the general, unbounded setting where X = Rd. Now, the diameter D in Assumption 3.1 is infinite. The convergence of SGD is then characterized by the initial distance to the optimum D = x0 x (Liu et al., 2023). We can show that SGD with optimally-tuned stepsizes achieves with probability at least 1 δ the convergence rates

f(xout) f υ

( LD2 T + σD

T if f is L-smooth,

T if f is G-Lipschitz, (5)

where υ = poly log 1

δ, σ is the maximum stochastic gradient noise norm, and D = x0 x is the initial distance from the minimizer. An algorithm is a tuning-free version of SGD in the unbounded setting if it can match the best SGD rates given by Equation (5) up to polylogarithmic factors given access to the hints D, D, σ, σ, and G, G or L, L. This is a tall order: unlike in the bounded setting, a tuning-free algorithm now has to compete with SGD s convergence on any possible initialization.

Deterministic setting. When there is no stochastic gradient noise, i.e. σ = 0 and the algorithm accesses gradients according to the deterministic first-order oracle (Def-

inition 1.2), Tuning-free versions of gradient descent exist. For example, the Adaptive Polyak algorithm (Hazan & Kakade, 2019), a restarted version of gradient descent with the Polyak stepsizes (Polyak, 1987) is tuning-free: Proposition 2 (Hazan & Kakade (2019)). The Adaptive Polyak algorithm from (Hazan & Kakade, 2019) is tuningfree in the deterministic setting.

This is far from the only solution, and we mention a few others next. Parameter-free methods augmented with normalization are also tuning-free and universal, e.g. plugging in d0 = D in (Orabona, 2023) gives tuning-free algorithms matching SGD. The bisection algorithm from (Carmon & Hinder, 2022) is also tuning-free, as is the simple doubling trick. Finally, T-Do G and T-Do WG, variants of Do G and Do WG which use polylogarithmically smaller stepsizes than Do G and Do WG, are also tuning-free, as the following direct corollary of (Ivgi et al., 2023; Khaled et al., 2023) shows. Proposition 3. T-Do G and T-Do WG are tuning-free in the deterministic setting.

T-Do G and T-Do WG use the same stepsize structure as Do G and Do WG (given in equations (3) and (4)), but divide these stepsizes by running logarithmic factors as follows

T-Do G: ηt = rt ut log+ ut u0 ,

T-Do WG: ηt = r2 t vt log+ vt v0 .

Both methods achieve the same convergence guarantee as in equation (5) up to polylogarithmic factors in the hints.

4.1. Impossibility Results in the Stochastic Setting

The positive results in the deterministic setting give us some hope to obtain a tuning-free algorithm. Unfortunately, the stochastic setting turns out to be a tougher nut to crack. Our first major result, given below, slashes any hope of finding a tuning-free algorithm for smooth and stochastic convex optimization. Theorem 2. For any polylogarithmic function ι : R4 R and any algorithm A, there exists a time horizon T, an L-smooth and convex function f, and a stochastic oracle O(f, σf), and valid hints L, L, D, D, σ, σ such that the algorithm A initialized at some x0 returns with some constant probability a point xout satisfying

Error A = f(xout) f

σ , T LD2 T + σf D

where D = x0 x is the initial distance to the optimum and σf is the maximum norm of the stochastic gradient noise.

Tuning-Free Stochastic Optimization

Proof idea. This lower bound is achieved by 1-dimensional functions. In particular, we construct two one-dimensional quadratic functions f and h with associated oracles O(f, σf) and O(h, σh), and we supply the algorithm with hints that are valid for both functions and oracles. We show that with some constant probability, the algorithm observes the same stochastic gradients from both O(f, σf) and O(h, σh) for the entire run. Since the algorithm cannot tell apart either oracle, it must guarantee that equation (5) holds with high probability for both f and h if it is to be tuning-free. Now, if we choose f and h further apart, ensuring that their respective oracles return the same gradients with some constant probability becomes harder. On the other hand, if we choose f and h too close, the algorithm can conceivably guarantee that equation (5) holds (up to the same polylogarithmic factor of the hints) for both of them. By carefully choosing f and h to balance out this tradeoff, we show that no algorithm can be tuning-free in the unbounded and stochastic setting. The full proof is provided in Section 8.3 in the appendix.

Comparison with prior lower bounds. The above theorem shows a fundamental separation between the deterministic and stochastic settings when not given knowledge of the problem parameters. The classical lower bounds for deterministic and stochastic optimization algorithms (Nesterov, 2018; Woodworth & Srebro, 2016; Carmon et al., 2019) rely on a chain construction that is agnostic to whether the optimization algorithm has access to problem parameters. On the other hand, lower bounds from the online learning literature show that tuning-free optimization matching SGD is impossible when the oracle can be adversarial (and not stochastic), see e.g. (Cutkosky & Boahen, 2017b; 2016). However, adversarial oracles are much stronger than stochastic oracles, as they can change the function being optimized in response to the algorithm s choices. Our lower bound is closest in spirit to the lower bounds from the stochastic multi-armed bandits literature that also rely on confusing the algorithm with two close problems (Mannor & Tsitsiklis, 2004).

Our next result shows that tuning-free optimization is also impossible in the nonsmooth case.

Theorem 3. For any polylogarithmic function ι : R4 R and any algorithm A, there exists a time horizon T, valid hints L, L, D, D, σ, σ, an G-Lipschitz and convex function f and an oracle O(f, σf) such that the algorithm A returns with some constant probability a point xout satisfying

Error A = f(xout) f

The proof technique used for this result relies on a similar

construction as Theorem 2 but uses the absolute loss instead of quadratics.

Existing algorithms and upper bounds in the stochastic setting. Carmon & Hinder (2022) give a restarted variant of SGD with bisection search. If f is G-Lipschitz and all the stochastic gradients are also bounded by G, their method uses the hint G G and achieves the following convergence rate with probability at least 1 δ

f(ˆx) f cι ηϵ

where c is an absolute constant, ι a poly-logarithmic and double-logarithmic factor, and ηϵ an input parameter. If we set ηϵ = D/T D

T , then the guarantee of this method becomes

f(ˆx) f cι D G

Unfortunately, this dependence does not meet our bar as the polynomial dependence on G is in a higher-order term. A similar result is achieved by Do G (Ivgi et al., 2023). A different sort of guarantee is achieved by Mhammedi & Koolen (2020), who give a method with regret

t=0 gt, xt x cι GD

for some absolute constant c and polylogarithmic factor ι. This result is in the adversarial setting of online learning, and clearly does not meet the bar for tuning-free optimization matching SGD due to the cubic term D3 .

4.2. Guarantees Under Benign Noise

In the last subsection, we saw that tuning-free optimization is in general impossible. However, it is clear that sometimes it is possible to get within the performance of tuned SGD with self-tuning methods (Ivgi et al., 2023; Defazio & Mishchenko, 2023). However, the oracles used in Theorems 2 and 3 provide stochastic gradients g(x) such that the noise is almost surely bounded (i.e. satisfies Assumption 1.1):

g(x) f(x) R.

So boundedness is clearly not enough to enable tuning-free optimization. However, we know from prior results (e.g. (Carmon & Hinder, 2022)) that if we can reliably estimate the upper bound R on the noise, we can adapt to unknown distance to the optimum D or the smoothness constant L. The main issue that the oracles in the lower bound of Theorem 2 make it impossible to do that: while the noise n(x) = g(x) f(x) is bounded almost surely by R, the

Tuning-Free Stochastic Optimization

algorithm only gets to observe the same noise n(x) for the entire optimization run. This foils any attempt at estimating R from the observed trajectory.

A note on notation in this section and the next. In the past section we used σ to denote a uniform upper bound on the gradient noise, while in this section and the next we use σ to denote the variance of the stochastic gradient noise n(x) rather than a uniform upper bound on it. Instead, we use R to denote the uniform upper bound on the noise.

We will see that for some notion of benign noise, tuning-free optimization matching SGD is possible. We will develop our results under a more general assumption on the distribution of the stochastic gradient noise g(x) f(x)

Assumption 4.1. (Noise with Sub-Gaussian norm). For all x Rd, the noise vector n(x) = g(x) f(x) satisfies

n(x) is unbiased: E [g(x)] = f(x).

n(x) has sub-gaussian norm with modulus R:

Prob ( n(x) t) 2 exp t2

n(x) has bounded variance: E h n(x) 2i = σ2 < + .

This assumption is very general, it subsumes bounded noise (where R = σf) and sub-gaussian noise. The next definition gives a notion of signal-to-noise that turns out to be key in characterizing benign noise distributions.

Definition 4.1. Suppose the stochastic gradient noise satisfies Assumption 4.1. We define the signal-to-noise ratio associated with the noise as

To better understand the meaning of Ksnr, we consider the following example. Let Y be a random vector with mean E [Y ] = µ and variance E h Y µ 2i = σ2. Suppose further that the errors Y µ are bounded almost surely by some R. Then Y µ satisfies the assumptions in Assumption 4.1. Let Y1, . . . , Yb be independent copies of Y . Through standard concentration results (see Lemma 8) we can show that with high probability and for large enough b

i=1 Y µ 2 σ2.

Now observe that if the ratio Ksnr is small, then we cannot use the sample variance ˆσ as an estimator for the almostsure bound R. But if the ratio Ksnr is closer to 1, then we

have σ2 R2 and we can use ˆσ as an estimator for R. This fixes the problem we highlighted earlier: now we are able to get an accurate estimate of R from the observed stochastic gradients. The next proposition gives examples of noise distributions where Ksnr is close to 1.

Proposition 4. Suppose that the noise vectors g(x) f(x) follow one of the following two distributions:

A Gaussian distribution with mean 0 and covariance σ2

d Id d, with σ > 0.

A Bernoulli distribution, where [g(x) f(x)] = σϕ(x) with equal probability for some ϕ such that ϕ(x) 2 = 1 almost surely.

Then Ksnr = O(1).

We now give an algorithm whose convergence rate characterized by the signal-to-noise ratio Ksnr. We combine a variane estimation procedure with the T-Do G/T-Do WG algorithms in Algorithms 2 and 3. The next theorem gives the convergence of this algorithm. This theorem is generic, and does not guarantee any tuning-free matching of SGD, but can lead to tuning-free matching of SGD if the signal to noise ratio is high enough.

Theorem 4. Suppose we are given access to stochastic gradient estimates g(x) such that the noise vectors [g(x) f(x)] Rd satisfy Assumption 4.1 with modulus R and signal-to-noise ratio Ksnr. If we run T-Do G or T-Do WG with variance estimation (Algorithms 2 and 3) with a minibatch size b 2 large enough to satisfy

δ b + log 2(b d)T

where c is some absolute constant and θ [0, Ksnr] is some known constant. Then Algorithms 2 and 3 with either option returns a point xout such that with probability at least 1 δ,

If f is L-smooth:

f(xout) f cι

LD2 b Ttotal + θ 1RD

where D = x0 x , c is an absolute constant, ι is a polylogarithmic factor of the hints, and Ttotal denotes the total number of stochastic gradient accesses.

If f is G-Lipschitz:

f(xout) f cι

G2 + θ 2R2D

Tuning-Free Stochastic Optimization

Note on dimension dependence in Theorem 4. We note that the logarithmic dimension dependence on the dimension d can be removed if, rather than assuming the norm of the noise is subgaussian, we assumed it was bounded.

On the surface, it looks like Theorem 4 simply trades off knowledge of the absolute bound on the noise R with knowledge of some constant θ that lies in the interval [0, Ksnr]. In order to see how Theorem 4 can be useful, consider the special cases given in Proposition 4. For these noise distributions, we see that choosing a minibatch size b O(log 2d T

δ + 1) suffices to ensure Algorithms 2 and 3 converges with the simple choice θ = 1 2. Even though we had no apriori knowledge of the variance σ2 and did not assume the noise distribution was stationary, we could still optimize the function. In general, the minibatch size b O(log 2d T

δ + 1) suffices as long as Ksnr is bounded away from zero by some constant. The final cost of running the algorithm is Ttotal = b T = ιT, where ι is some polylogarithmic factor. Therefore, we only pay a logarithmic price for not knowing the distribution. Of course, if Ksnr is small enough, there can be no optimization there is not enough signal to do any estimation of the sub-gaussian modulus R. The distribution used in Theorem 2 does force Ksnr 1

5. Nonconvex Tuning-Free Optimization

In this section, we consider the case where the optimization problem (OPT) is possibly nonconvex. Throughout the section, we assume that f is L-smooth and lower bounded by some f R. In this setting, SGD with a tuned stepsize achieves the following rate in expectation

t=0 f(xt) 2

L(f(x0) f )σ2

T + L(f(x0) f )

for some absolute constant c > 0. This rate is known to be tight for convergence in expectation (Arjevani et al., 2019). However, it is not known if it is tight for returning a high probability guarantee. The best-known high-probability convergence rate for SGD is given by (Liu et al., 2023, Theorem 4.1) and guarantees with probability at least 1 δ that

t=0 f(xt) 2 5

L(f(x0) f )R2

2(f(x0) f )L

T + 12R2 log 1

We now consider tuning-free algorithms that can match the performance of SGD characterized by either equation (9)

Algorithm 1 Restarted SGD Require: Initialization y0, probability δ, hints R, R, , , L, L, total budget Ttotal 1: Set ηϵ = 1

2: if Ttotal < N then 3: Return y0. 4: end if 5: Set the per-epoch iteration budget as T = Ttotal/N . 6: for n = 1 to N do 7: η = ηϵ2n

8: Run SGD for T iterations with stepsize η starting from y0 to get outputs xn 1, . . . , xn T . 9: Set yn, ˆgn = Find Leader(S, δ, T) (see Algorithm 4). 10: end for 11: Return y = arg minn [N] ˆgn .

or equation (1). Per Definition 1.1, an algorithm B is given (1) an initialization x0, (2) a budget of Ttotal stochastic gradient accesses, and (3) hints L, L, R, R, , on the problem parameters such that (a) if L is the smoothness constant of f then L [L, L], (b) R [R, R], and (c) def = f(x0) f [ , ]. We call B strongly tuning-free if it matches the performance of SGD characterized by equation (9) up to polylogarithmic factors. Alternatively, if it instead matches the weaker guarantee given by equation (10) then we call it weakly tuning-free.

Our first result in this setting shows that we cannot hope to achieve the rate given by equation (9) in high probability, even given access to hints on all the problem parameters.

Theorem 5. For any polylogarithmic function ι : R4 R and any algorithm A, there exists a time horizon T, valid hints L, L, , , σ, σ, an L-smooth and lower-bounded function f and an oracle O(f, σf) such that the algorithm A returns with some constant probability a point xout satisfying

Error A = f(xout) 2

where = f(x0) f

Surprisingly, our next theorem shows that the rate given by equation (10) is achievable up to polylogarithmic factors given only access to hints. To achieve this, we use a restarted variant of SGD (Algorithm 6) combined with a Leader

Tuning-Free Stochastic Optimization

Finding procedure that selects a well-performing iterate by subsampling.

Theorem 6. (Convergence of Restarted SGD) Let f be an L-smooth function lower bounded by f and suppose the stochastic gradient noise vectors satisfy Assumption 4.1. Suppose that we are given the following hints on the problem parameters: (a) L [L, L], (b) Rf [R, R], and (c)

f def = f(x0) f [ , ]. Then there exists some absolute constant c such that the output of Algorithm 1 satisfies after Ttotal log+ 1 δ stochastic gradient evaluations

f(y) 2 c R2 log 2d max(log 1

δ ,N) δ Ttotal +

L(f(y0) f )R2

Ttotal + (f(y0) f )L

where c is an absolute constant, N is a polylogarithmic function of the hints defined in equation (11), and d is the problem dimensionality.

Discussion of Theorem 6. This theorem shows that in the nonconvex setting, we pay only an additional polylogarithmic factor to achieve the same high-probability rate as when we know all parameters. We emphasize that we do not know if the rate given by equation (10) is tight, but it is the best in the literature. Finally, the logarithmic dimension dependence on the dimension d can be removed if, rather than assuming the norm of the noise is subgaussian, we assumed that it was bounded almost surely.

Proof Idea. The proof is an application of the so-called doubling trick with a careful comparison procedure. If we start with a small enough stepsize, we only need to double a logarithmic number of times until we find a stepsize η such that η

2 η η , where η is the optimal stepsize for SGD on this problem. We therefore run SGD for N epochs with a carefully chosen N, each time doubling the stepsize. At the end of every SGD run, we run the Find Leader procedure (Algorithm 4) to get with high probability a point yn such that

t=0 f(xn t ) 2,

where xn 1, . . . , xn T are the SGD iterates from the n-th epoch. Finally, we know that at least one of these N points y1, . . . , y N has small gradient norm, so we return the point with the minimal estimated gradient norm and bound the estimation error as a function of T. The total number of gradient accesses performed is at most N(T + MT) Ttotal log+ 1 δ . Therefore, both the restarting and comparison procedures add at most a logarithmic number of gradient accesses.

Related work. Many papers give high probability bounds for SGD or Ada Grad and their variants in the nonconvex setting (Ghadimi & Lan, 2013; Madden et al., 2020; Lei & Tang, 2021; Li & Orabona, 2019; 2020; Faw et al., 2022; Kavis et al., 2022), but to the best of our knowledge none give a tuning-free algorithm matching SGD per Definition 1.1. The Find Leader procedure is essentially extracted from (Madden et al., 2020, Theorem 13), and is similar to the post-processing step in (Ghadimi & Lan, 2013).

Comparison with the convex setting. The rate achieved by Theorem 6 stands in contrast to the best-known rates in the convex setting, where we suffer from a polynomial dependence on the hints, as in equation (7). One potential reason for this divergence is the difficulty of telling apart good and bad points. In the convex setting, we ask for a point y with a small function value f(y). And while the oracle gives us access to stochastic realizations of f(y), the error in those realization is not controlled. Instead, to compare between two points y1 and y2 we have to rely on stochastic gradient information to approximate f(y1) f(y2), and this seems to be too difficult without apriori control on the distance between y1 and y2. On the other hand, in the nonconvex setting, such comparison is feasible through sampling methods like e.g. Algorithm 4.

6. Conclusion and Open Problems

We have reached the end of our investigation. To summarize: we defined tuning-free algorithms and studied several settings where tuning-free optimization was possible, and several where we proved impossibility results. Yet, many open questions remain. For example, tuning-free optimization might be possible in the finite-sum setting where we can periodically evaluate the function value exactly. The upper bounds we develop in both the convex and nonconvex settings require quite stringent assumptions on the noise (such as boundedness or sub-gaussian norm), and it is not known if they can be relaxed to expected smoothness (Gower et al., 2019; Khaled & Richt arik, 2020) or some variant of it. In the nonconvex case we only consider smooth objectives whereas in deep learning the objectives are usually highly nonsmooth, and exploring this area may yield more practically useful insights. Finally, we did not study tuning-free algorithms for strongly convex objectives. We leave these questions to future work.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Tuning-Free Stochastic Optimization

Acknowledgements

We thank Aaron Defazio, Yair Carmon, and Oliver Hinder for discussions during the preparation of this work. This work is partially supported by National Science Foundation Grant NSF-OAC-2411299.

Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization. ar Xiv preprint ar Xiv:1912.02365, abs/1912.02365, 2019.

Attia, A. and Koren, T. SGD with Ada Grad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. ar Xiv preprint ar Xiv:2302.08783, abs/2302.08783, 2023.

Attia, A. and Koren, T. How free is parameter-free stochastic optimization? ar Xiv preprint ar Xiv:2402.03126, abs/2402.03126, 2024.

Baydin, A. G., Cornish, R., Mart ınez-Rubio, D., Schmidt, M., and Wood, F. Online learning rate adaptation with hypergradient descent. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018.

Beck, A. First-order methods in optimization. MOS-SIAM series on optimization. Society for Industrial and Applied Mathematics ; Mathematical Optimization Society, 2017. ISBN 9781611974997.

Bischl, B., Binder, M., Lang, M., Pielok, T., Richter, J., Coors, S., Thomas, J., Ullmann, T., Becker, M., Boulesteix, A., Deng, D., and Lindauer, M. Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. WIREs Data Mining and Knowledge Discovery, 13(2), January 2023. ISSN 1942-4795. doi: 10.1002/widm.1484.

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., Mc Donell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S. GPT-Neo X-20B: an open-source autoregressive language model. ar Xiv preprint ar Xiv:2204.06745, abs/2204.06745, 2022.

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018. doi: 10.1137/16M1080173.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,

Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Carmon, Y. and Hinder, O. Making SGD parameter-free. In Loh, P. and Raginsky, M. (eds.), Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pp. 2360 2389. PMLR, 2022.

Carmon, Y. and Hinder, O. The price of adaptivity in stochastic convex optimization. ar Xiv preprint ar Xiv:2402.10898, abs/2402.10898, 2024.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower bounds for finding stationary points II: firstorder methods. Mathematical Programming, 185(1 2): 315 355, September 2019. ISSN 1436-4646. doi: 10.1007/s10107-019-01431-x.

Chen, X. and Hazan, E. A nonstochastic control approach to optimization. ar Xiv preprint ar Xiv:2301.07902, abs/2301.07902, 2023.

Cutkosky, A. Artificial constraints and hints for unbounded online learning. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 874 894. PMLR, 25 28 Jun 2019.

Cutkosky, A. and Boahen, K. A. Online convex optimization with unconstrained domains and losses. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 748 756, 2016.

Cutkosky, A. and Boahen, K. A. Stochastic and adversarial online learning without hyperparameters. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5059 5067, 2017a.

Cutkosky, A. and Boahen, K. A. Online learning without prior information. In Kale, S. and Shamir, O. (eds.), Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July

Tuning-Free Stochastic Optimization

2017, volume 65 of Proceedings of Machine Learning Research, pp. 643 677. PMLR, 2017b.

Cutkosky, A., Defazio, A., and Mehta, H. Mechanic: a learning rate tuner. ar Xiv preprint ar Xiv:2306.00144, abs/2306.00144, 2023.

Defazio, A. and Mishchenko, K. Learning-Rate-Free learning by D-adaptation. In Proceedings of the 40th International Conference on Machine Learning, ICML 23. JMLR.org, 2023.

Duchi, J. C., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. In Kalai, A. T. and Mohri, M. (eds.), COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pp. 257 269. Omnipress, 2010.

Faw, M., Tziotis, I., Caramanis, C., Mokhtari, A., Shakkottai, S., and Ward, R. The power of adaptivity in SGD: self-tuning step sizes with unbounded gradients and affine variance. In Loh, P. and Raginsky, M. (eds.), Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pp. 313 355. PMLR, 2022.

Ghadimi, S. and Lan, G. Stochastic firstand zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. doi: 10.1137/120880811.

Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., and Richt arik, P. SGD: General analysis and improved rates. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5200 5209. PMLR, 2019.

Hazan, E. Introduction to Online Convex Optimization. Book draft, MIT University Press, 2022.

Hazan, E. and Kakade, S. Revisiting the Polyak step size. ar Xiv preprint ar Xiv:1905.00313, abs/1905.00313, 2019.

Ivgi, M., Hinder, O., and Carmon, Y. Do G is SGD s best friend: a parameter-free dynamic step size schedule. ar Xiv preprint ar Xiv:2302.12022, abs/2302.12022, 2023.

Jain, P., Nagaraj, D., and Netrapalli, P. Making the last iterate of SGD information theoretically optimal. In Beygelzimer, A. and Hsu, D. (eds.), Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pp. 1752 1755. PMLR, 2019.

Jin, C., Netrapalli, P., Ge, R., Kakade, S. M., and Jordan, M. I. A short note on concentration inequalities for random vectors with subgaussian norm. ar Xiv preprint ar Xiv:1902.03736, abs/1902.03736, 2019.

Kavis, A., Levy, K. Y., Bach, F. R., and Cevher, V. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 6257 6266, 2019.

Kavis, A., Levy, K. Y., and Cevher, V. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. In ICLR. Open Review.net, 2022.

Khaled, A. and Richt arik, P. Better theory for SGD in the nonconvex world. ar Xiv preprint ar Xiv:2002.03329, abs/2002.03329, 2020.

Khaled, A., Mishchenko, K., and Jin, C. Do WG unleashed: An efficient universal parameter-free gradient descent method. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 6748 6769. Curran Associates, Inc., 2023.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

Lan, G., Ouyang, Y., and Zhang, Z. Optimal and parameterfree gradient minimization methods for smooth optimization. ar Xiv preprint ar Xiv:2310.12139, abs/2310.12139, 2023.

Lei, Y. and Tang, K. Learning rates for stochastic gradient descent with nonconvex objectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12): 4505 4511, December 2021. ISSN 1939-3539. doi: 10.1109/tpami.2021.3068154.

Levy, K. Y. Online to offline conversions, universality and adaptive minibatch sizes. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 1613 1622, 2017.

Levy, K. Y., Yurtsever, A., and Cevher, V. Online adaptive methods, universality and acceleration. In Bengio,

Tuning-Free Stochastic Optimization

S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 6501 6510, 2018.

Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In Chaudhuri, K. and Sugiyama, M. (eds.), The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pp. 983 992. PMLR, 2019.

Li, X. and Orabona, F. A high probability analysis of adaptive SGD with momentum. ar Xiv preprint ar Xiv:2007.14294, abs/2007.14294, 2020.

Liu, Z., Nguyen, T. D., Nguyen, T. H., Ene, A., and Nguyen, H. L. High probability convergence of stochastic gradient methods. ar Xiv preprint ar Xiv:2302.14843, abs/2302.14843, 2023.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

Madden, L., Dall Anese, E., and Becker, S. Highprobability convergence bounds for non-convex stochastic gradient descent. ar Xiv preprint ar Xiv:2006.05610, abs/2006.05610, 2020.

Malitsky, Y. and Mishchenko, K. Adaptive gradient descent without descent. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 6702 6712. PMLR, 2020.

Mannor, S. and Tsitsiklis, J. N. The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res., 5:623 648, 2004. ISSN 1532-4435.

Mhammedi, Z. and Koolen, W. M. Lipschitz and comparator-norm adaptivity in online learning. In Abernethy, J. D. and Agarwal, S. (eds.), Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pp. 2858 2887. PMLR, 2020.

Mhammedi, Z., Koolen, W. M., and van Erven, T. Lipschitz adaptivity with multiple learning rates in online learning. In Beygelzimer, A. and Hsu, D. (eds.), Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pp. 2490 2511. PMLR, 2019.

Nesterov, Y. Universal gradient methods for convex optimization problems. Mathematical Programming, 152 (1-2):381 404, 2014. doi: 10.1007/s10107-014-0790-0.

Nesterov, Y. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2nd edition, 2018. ISBN 3319915770.

Orabona, F. Normalized gradients for all. ar Xiv preprint ar Xiv:2308.05621, abs/2308.05621, 2023.

Orabona, F. and Cutkosky, A. ICML 2020 tutorial on parameter-free online optimization. ICML Tutorials, 2020.

Orabona, F. and P al, D. Coin betting and parameter-free online learning. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 577 585, 2016.

Paquette, C. and Scheinberg, K. A stochastic line search method with expected complexity analysis. SIAM Journal on Optimization, 30(1):349 376, 2020. doi: 10.1137/ 18M1216250.

Polyak, B. T. Introduction to Optimization. Translations Series in Mathematics and Engineering. Optimization Software Inc., New York, 1987.

Renegar, J. and Grimmer, B. A simple nearly optimal restart scheme for speeding up first-order methods. Foundations of Computational Mathematics, 22(1):211 256, 2021. doi: 10.1007/s10208-021-09502-2.

Sivaprasad, P. T., Mai, F., Vogels, T., Jaggi, M., and Fleuret, F. Optimizer benchmarking needs to account for hyperparameter tuning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 9036 9045. PMLR, 2020.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi ere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. LLa MA: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, abs/2302.13971, 2023a.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S.,

Tuning-Free Stochastic Optimization

Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. LLa MA 2: Open foundation and finetuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023b.

Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., and Lacoste-Julien, S. Painless stochastic gradient: Interpolation, line-search, and convergence rates. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Vaswani, S., Dubois-Taine, B., and Babanezhad, R. Towards noise-adaptive, problem-adaptive stochastic gradient descent. ar Xiv preprint ar Xiv:2110.11442, abs/2110.11442, 2021.

Vershynin, R. High-dimensional probability: An introduction with applications in data science. 2018. doi: 10.1017/9781108231596.

Woodworth, B. E. and Srebro, N. Tight complexity bounds for optimizing composite objectives. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 3639 3647, 2016.

Workshop, B., Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili c, S., Hesslow, D., Castagn e, R., Luccioni, A. S., Yvon, F., Gall e, M., Tow, J., Rush, A. M., Biderman, S., Webson, A., Ammanamanchi, P. S., Wang, T., Sagot, B., Muennighoff, N., Moral, A. V. d., Ruwase, O., Bawden, R., Bekman, S., Mc Millan-Major, A., Beltagy, I., Nguyen, H., Saulnier, L., Tan, S., Suarez, P. O., Sanh, V., Laurenc on, H., Jernite, Y., Launay, J., Mitchell, M., Raffel, C., Gokaslan, A., Simhi, A., Soroa, A., Aji, A. F., Alfassy, A., Rogers, A., Nitzav, A. K., Xu, C., Mou, C., Emezue, C., Klamm, C., Leong, C., Strien, D. v., Adelani, D. I., Radev, D., Ponferrada, E. G., Levkovizh, E., Kim, E., Natan, E. B., Toni, F. D., Dupont, G., Kruszewski, G., Pistilli, G., Elsahar, H., Benyamina, H., Tran, H., Yu, I., Abdulmumin, I., Johnson, I., Gonzalez-Dios, I., Rosa, J. d. l., Chim, J., Dodge, J., Zhu, J., Chang, J., Frohberg, J., Tobing, J., Bhattacharjee, J., Almubarak, K., Chen, K., Lo, K., Werra, L. V., Weber, L., Phan, L., allal, L. B., Tanguy, L., Dey, M., Mu noz, M. R., Masoud, M., Grandury, M., Saˇsko, M., Huang, M., Coavoux, M.,

Singh, M., Jiang, M. T.-J., Vu, M. C., Jauhar, M. A., Ghaleb, M., Subramani, N., Kassner, N., Khamis, N., Nguyen, O., Espejel, O., Gibert, O. d., Villegas, P., Henderson, P., Colombo, P., Amuok, P., Lhoest, Q., Harliman, R., Bommasani, R., L opez, R. L., Ribeiro, R., Osei, S., Pyysalo, S., Nagel, S., Bose, S., Muhammad, S. H., Sharma, S., Longpre, S., Nikpoor, S., Silberberg, S., Pai, S., Zink, S., Torrent, T. T., Schick, T., Thrush, T., Danchev, V., Nikoulina, V., Laippala, V., Lepercq, V., Prabhu, V., Alyafeai, Z., Talat, Z., Raja, A., Heinzerling, B., Si, C., Tas ar, D. E., Salesky, E., Mielke, S. J., Lee, W. Y., Sharma, A., Santilli, A., Chaffin, A., Stiegler, A., Datta, D., Szczechla, E., Chhablani, G., Wang, H., Pandey, H., Strobelt, H., Fries, J. A., Rozen, J., Gao, L., Sutawika, L., Bari, M. S., Al-shaibani, M. S., Manica, M., Nayak, N., Teehan, R., Albanie, S., Shen, S., Ben-David, S., Bach, S. H., Kim, T., Bers, T., Fevry, T., Neeraj, T., Thakker, U., Raunak, V., Tang, X., Yong, Z.-X., Sun, Z., Brody, S., Uri, Y., Tojarieh, H., Roberts, A., Chung, H. W., Tae, J., Phang, J., Press, O., Li, C., Narayanan, D., Bourfoune, H., Casper, J., Rasley, J., Ryabinin, M., Mishra, M., Zhang, M., Shoeybi, M., Peyrounette, M., Patry, N., Tazi, N., Sanseviero, O., Platen, P. v., Cornette, P., Lavall ee, P. F., Lacroix, R., Rajbhandari, S., Gandhi, S., Smith, S., Requena, S., Patil, S., Dettmers, T., Baruwa, A., Singh, A., Cheveleva, A., Ligozat, A.-L., Subramonian, A., N ev eol, A., Lovering, C., Garrette, D., Tunuguntla, D., Reiter, E., Taktasheva, E., Voloshina, E., Bogdanov, E., Winata, G. I., Schoelkopf, H., Kalo, J.-C., Novikova, J., Forde, J. Z., Clive, J., Kasai, J., Kawamura, K., Hazan, L., Carpuat, M., Clinciu, M., Kim, N., Cheng, N., Serikov, O., Antverg, O., Wal, O. v. d., Zhang, R., Zhang, R., Gehrmann, S., Mirkin, S., Pais, S., Shavrina, T., Scialom, T., Yun, T., Limisiewicz, T., Rieser, V., Protasov, V., Mikhailov, V., Pruksachatkun, Y., Belinkov, Y., Bamberger, Z., Kasner, Z., Rueda, A., Pestana, A., Feizpour, A., Khan, A., Faranak, A., Santos, A., Hevia, A., Unldreaj, A., Aghagol, A., Abdollahi, A., Tammour, A., Haji Hosseini, A., Behroozi, B., Ajibade, B., Saxena, B., Ferrandis, C. M., Mc Duff, D., Contractor, D., Lansky, D., David, D., Kiela, D., Nguyen, D. A., Tan, E., Baylor, E., Ozoani, E., Mirza, F., Ononiwu, F., Rezanejad, H., Jones, H., Bhattacharya, I., Solaiman, I., Sedenko, I., Nejadgholi, I., Passmore, J., Seltzer, J., Sanz, J. B., Dutra, L., Samagaio, M., Elbadri, M., Mieskes, M., Gerchick, M., Akinlolu, M., Mc Kenna, M., Qiu, M., Ghauri, M., Burynok, M., Abrar, N., Rajani, N., Elkott, N., Fahmy, N., Samuel, O., An, R., Kromann, R., Hao, R., Alizadeh, S., Shubber, S., Wang, S., Roy, S., Viguier, S., Le, T., Oyebade, T., Le, T., Yang, Y., Nguyen, Z., Kashyap, A. R., Palasciano, A., Callahan, A., Shukla, A., Miranda Escalada, A., Singh, A., Beilharz, B., Wang, B., Brito, C., Zhou, C., Jain, C., Xu, C., Fourrier, C., Peri n an, D. L., Molano, D., Yu, D., Manjavacas, E., Barth, F., Fuhrimann,

Tuning-Free Stochastic Optimization

F., Altay, G., Bayrak, G., Burns, G., Vrabec, H. U., Bello, I., Dash, I., Kang, J., Giorgi, J., Golde, J., Posada, J. D., Sivaraman, K. R., Bulchandani, L., Liu, L., Shinzato, L., Bykhovetz, M. H. d., Takeuchi, M., P amies, M., Castillo, M. A., Nezhurina, M., S anger, M., Samwald, M., Cullan, M., Weinberg, M., Wolf, M. D., Mihaljcic, M., Liu, M., Freidank, M., Kang, M., Seelam, N., Dahlberg, N., Broad, N. M., Muellner, N., Fung, P., Haller, P., Chandrasekhar, R., Eisenberg, R., Martin, R., Canalli, R., Su, R., Su, R., Cahyawijaya, S., Garda, S., Deshmukh, S. S., Mishra, S., Kiblawi, S., Ott, S., Sang-aroonsiri, S., Kumar, S., Schweter, S., Bharati, S., Laud, T., Gigant, T., Kainuma, T., Kusa, W., Labrak, Y., Bajaj, Y. S., Venkatraman, Y., Xu, Y., Xu, Y., Xu, Y., Tan, Z., Xie, Z., Ye, Z., Bras, M., Belkada, Y., and Wolf, T. BLOOM: a 176b-parameter open-access multilingual language model. ar Xiv preprint ar Xiv:2211.05100, abs/2211.05100, 2022.

Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., and Gao, J. Tuning large neural networks via zero-shot hyperparameter transfer. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021.

Tuning-Free Stochastic Optimization Appendix

7. Proofs for Section 7

Proposition 1. Coin betting through Online Newton Steps with Hints (Cutkosky, 2019, Algorithm 1) is tuning-free in the bounded setting.

Proof. In the bounded setting, Cutkosky (2019) give an algorithm that takes as parameters ϵ, α and achieves the following regret

t=0 gt, xt x ϵ + GD + x x0 G log

x x0 G exp(α/4G2)

1 + PT 1 t=0 gt 2

v u u u u t

t=0 gt 2 log

PT 1 t=0 gt 2 10 exp(α/2G2) x 2

If we set ϵ = D G, α = G2, and use the upper bound x0 x D and simplify we get the regret

t=0 gt, xt x GD + GD + DG log

log T 10G20D2

Observe that because G G and D D the above can be simplified to

t=0 gt, xt x GD log+

log T 10G20D2

Call the maximum of the two log terms ι, then the above rate is

t=0 gt, xt x GDι + D

t=0 gt 2 ι. (12)

Applying online-to-batch conversion starting from equation (12) proves the algorithm is tuning-free. For the smooth setting, it suffices to observe that under a bounded domain we have for any t

gt gt f(xt) + f(xt)

= gt f(xt) + f(xt) f(x )

Combining this and following online-to-batch conversion as in (Levy, 2017) shows the algorithm considered is tuning-free in the smooth setting as well.

We will make use of the following two lemmas throughout the upper bound proofs for Do G and Do WG. Lemma 1. (Ivgi et al., 2023, Lemma 7). Let S be the set of nonnegative and nondecreasing sequences. Let Ct Ft 1 and let Xt be a martingale difference sequence adapted to Ft such that |Xt| Ct with probability 1 for all t. Then for all

δ (0, 1) and ˆXt Ft 1 such that ˆXt Ct with probability 1, we have that with probability 1 δ that for all c > 0

v u u tθt,δ

i=1 (Xi ˆXi)2 + [c] θ2 t,δ + Prob ( t T | Ct > c)

Tuning-Free Stochastic Optimization

Lemma 2. (Ivgi et al., 2023, Lemma 3). Let s0, s1, . . . , s T be a positive increasing sequence. Then,

Lemma 3. (Ivgi et al., 2023, Lemma 1). Suppose that f is convex and has a minimizer x . Then the iterates generated by Do G satisfy for each t:

rk gk, xk x rb(2db + rb) ub 1.

Lemma 4. Suppose that f is convex and has a minimizer x . Then iterates generated by Do WG satisfy for every t:

r2 k f(xk), xk x 2rt

dt + rt vt 1 +

r2 k f(xk) gk, xk x

Proof. This is a modification of (Khaled et al., 2023, Lemma 3) to account for the case where gk = f(xk) (i.e. when the gradients used are not deterministic), following (Ivgi et al., 2023, Lemma 1). We start

d2 k+1 xk ηkgk x 2

= xk x 2 + η2 k gk 2 2ηk gk, xk x .

Rearranging we get

2ηk gk, xk x d2 k d2 k+1 + η2 k gk 2

Multiplying both sides by r2 k 2ηk we get

r2 k gk, xk x 1

d2 k d2 k+1 + r2 kηk

We then follow the same proof as in (Khaled et al., 2023, Lemma 3) to get

r2 k gk, xk x 2rt

dt + rt vt 1. (13)

We then decompose

r2 k gk, xk x =

r2 k gk f(xk), xk x +

r2 k f(xk), xk x .

Plugging back into equation (13) we get

r2 k f(xk), xk x 2rt

dt + rt vt 1 +

r2 k f(xk) gk, xk x (14)

7.1. Proof of Theorem 1

Theorem 1. Do G and Do WG are tuning-free in the bounded setting. That is, there exists some ι = poly log( D

σ, T, δ 1) such that

f(xout) f ι

T if f is L-smooth,

T if f is G-Lipschitz.

This rate is achieved simultaneously for both classes of functions without prior knowledge of whether f is smooth or Lipschitz (and thus no usage of the hints L, L, G, G).

Tuning-Free Stochastic Optimization

Proof of Theorem 1. We first handle the case that T < 4 log+

D D. In this case we just return x0. If f is G-Lipschitz, then by convexity we have

f(x0) f f(x0), x0 x f(x0) x0 x GD 2GD

If f is L-smooth, then by smoothness we have

2 x0 x 2 LD2

Therefore in both cases the point we return achieves a small enough loss almost surely. Throughout the rest of the proof, we shall assume that T 4 log+

Part 1: Do G. In the nonsmooth setting, this is a straightforward consequence of (Ivgi et al., 2023, Proposition 3). In particular, when using Do G with rϵ = D, then Corollary 1 in their work gives that with probability 1 δ there exists some τ [T] and some absolute constant c > 0 such that

f(xτ) f c DG

T log 60 log 6t

where ˆxt def = 1 Pt 1 i=0 ri Pt 1 i=0 rixi.

For the smooth setting, we start with Lemma 3 to get

rk f(xk), xk x rt 2dt + rt ut 1 +

rk f(xk) gk, xk x . (15)

We follow (Ivgi et al., 2023, Proposition 3) and modify the proof in a straightforward manner to accommodate the assumption of bounded noise (rather than bounded gradients). Define

τk = min {min {i | ri 2rτi 1} , T} , τ0 def = 0.

We denote by K the first index such that τK = T. Define

Xk = gk f(xk), xk x

, ˆXk = 0, yk = rkdk.

Observe that xk is determined by Fk 1, and since rk = maxt k ( xk x0 , rϵ), it is also determined by Fk 1. Therefore

E [Xk | Fk 1] = r2 k

E [gk f(xk)] , xk x

Moreover, observe that

|Xk| gk f(xk) xk x

Therefore the Xk form a martingale. Then we can apply Lemma 1 to get that with probability 1 δ that for every t [K]

rk gk f(xk), xk x

k=0 (Xk)2 + σ2

k=0 gk f(xk) 2 + σ2

8dtrtθt,δ p

16dtrtθt,δσ

Tuning-Free Stochastic Optimization

Now observe that we can use equation (16) to get

rk gk f(xk), xk x

rk gk f(xk), xk x

rk gk f(xk), xk x

16dτirτiθt,δσ

T + 16dτi 1rτi 1θt,δσ

32dτirτiθt,δσ

Now observe that by convexity we have for k {τi 1, τi 1 + 1, . . . , τi 1}

0 f(xk) f f(xk), xk x rk rτi 1 f(xk), xk x .

Summing up from k = τi 1 to k = τi 1 we get

k=τi 1 f(xk), xk x 1 rτi 1

rk f(xk), xk x

rk f(xk), xk x

rk f(xk) gk, xk x + 1 rτi 1

rk gk, xk x . (18)

We now use Lemma 3 to get that

rk gk, xk x 2rτi

dτi + rτi uτi 1. (19)

Plugging in the upper bounds from equations (17) and (19) into equation (18) we get

k=τi 1 f(xk), xk x rτi rτi 1

dτi + rτi uτi 1 + 32dτiθt,δσ

Now observe that

rk+1 rk + xt+1 xt = rk

It follows that

rτi rτi 1 2. Moreover by the definition of the τi we have that

rτi 1 rτi 1 2. Therefore

rτi rτi 1 = rτi rτi 1

rτi 1 2 2 = 4. (21)

using equation (21) in equation (20) we get

k=τi 1 f(xk), xk x 4 h 2

dτi + rτi uτi 1 + 32dτiθt,δσ

Summing up over the i, we get

t=0 f(xt), xt x

k=τi 1 f(xk), xk x 4K h 2

dτi + rτi uτi 1 + 32Kdτiθt,δσ

Tuning-Free Stochastic Optimization

Observe that by definition we have

K 1 + log r T

r0 = log 2r T

Therefore using the last equation and convexity we have

t=0 (f(xt) f ) 4 log 2r T

d T + r T u T 1 + 32d T θT,δσ

Note that because the domain is bounded we have max(r T , d T ) D, and we used r0 = D, therefore

t=0 (f(xt) f ) 4 log 2D

h 4D u T 1 + 32DθT,δσ

Observe that by our assumption on the noise and smoothness we have

k=0 gk f(xk) 2 + 2

k=0 f(xk) 2

k=0 f(xk) 2

k=0 (f(xk) f ) .

Using this in equation (22) gives

t=0 (f(xt) f ) 4 log 2D

t=0 (f(xt) f ) + 32DθT,δσ

t=0 (f(xt) f ) + 160 log 2D

Observe that if y2 ay + b, then by the quadratic equation and the triangle inequality we have

a2 + 4b 2 .

Squaring both sides gives

a2 + 4b)2 1

2 2a2 + 4b = a2 + 2b. (24)

Applying this to equation (23) with the following choices

t=0 f(xt) f , a = 8 log 2D

L, b = 160 log 2D

then we obtain

t=0 (f(xt) f ) 64 log2 2D

D LD2 + 320 log2 2D

Tuning-Free Stochastic Optimization

Dividing both sides by T and using Jensen s inequality we finally get

t=0 (f(xt) f )

T + 320 log2 2D

This shows Do G is tuning-free in this setting.

Plugging back into equation (15) we get with probability 1 δ that

rk f(xk), xk x rt 2dt + rt ut 1 + 16dtrtθt,δσ

Now we can divide both sides by rt to get

rt f(xk), xk x 2dt + rt ut 1 + 16dtθt,δσ

Part 2: Do WG. By Lemma 4 we have that our iterates satisfy

r2 k f(xk), xk x 2rt

dt + rt vt 1 +

r2 k f(xk) gk, xk x

Xk = gk f(xk), xk x

, ˆXk = 0, yk = r2 kdk.

Observe that xk is determined by Fk 1, and since rk = maxt k ( xk x0 , rϵ), it is also determined by Fk 1. Therefore

E [Xk | Fk 1] = r2 k

E [gk f(xk)] , xk x

Moreover, observe that

|Xk| gk f(xk) xk x

Therefore the Xk form a martingale. Then we can apply Lemma 1 to get that with probability 1 δ that for every t [T]

r2 k gk f(xk), xk x

8dtr2 tθt,δ

k=0 (Xk)2 + σ2

8dtr2 tθt,δ

k=0 gk f(xk) 2 + σ2

8dtr2 tθt,δ p

16dtr2 tθt,δσ

Plugging this back into equation (14) we get

r2 k f(xk), xk x 2rt

dt + rt vt 1 + 16dtr2 tθt,δσ

We now divide the proof in two cases:

Tuning-Free Stochastic Optimization

If f is G-Lipschitz: then σ = supx Rd f(x) g(x) 2G and therefore equation (25) reduces to

r2 k f(xk), xk x 2rt

dt + rt vt 1 + 32dtr2 tθt,δG

And we have

r2 k gk 2 r2 t G2T.

r2 k f(xk), xk x 2r2 t

T + 32dtr2 tθt,δG

dt + rt θt,δG

Using convexity we have

r2 k(f(xk) f )

r2 k f(xk), xk x 68r2 t DG

Dividing both sides by Pt 1 k=0 r2 k and using Jensen s inequality we get

f( xt) f 1 Pt 1 k=0 r2 k

r2 k(f(xk) f )

r2 t Pt 1 k=0 r2 k 68DG

Tθt,δ. (26)

We now use Lemma 2 to conclude that there exists some t T such that

r2 t Pt 1 k=0 r2 k e T 2 log+ rk rϵ 1 (27)

Note that by the fact that r T D, r0 = D, and that we assume T 4 log+

D D (see the beginning of this proof) we have

Plugging this into equation (27) we get

r2 t Pt 1 k=0 r2 k 4e

Using this in conjunction with equation (26) we thus have that for some t T

f( xt) f 748DGθT,δ

Tuning-Free Stochastic Optimization

If f is L-smooth: Observe that by straightforward algebra, our assumption on the noise, and smoothness

r2 k gk f(xk) 2 + 2

r2 k f(xk) 2

2r2 tσ2T + 2

r2 k f(xk) 2

2r2 tσ2T + 4L

r2 k(f(xk) f ).

Using the last line estimate in equation (25) with the triangle inequality we get

r2 k f(xk), xk x 4rt

r2 k(f(xk) f )

+ 16dtr2 tθt,δσ

By convexity we have

f(xk), xk x f(xk) f .

r2 k(f(xk) f ) 4rt

r2 k(f(xk) f ) + 20r2 tθt,δ

Observe that if y2 ay + b, then we have shown in equation (24) that y2 a2 + 2b. Applying this to equation (28) with a = 4rt

L and b = 20r2 tθt,δ

r2 k(f(xk) f ) 16r2 t

dt + rt 2 L + 40r2 tθt,δ

dt + rt 2 L + 40θt,δ

Dividing both sides by Pt 1 k=0 r2 k and using Jensen s inequality we get

f(ˆxt) f 1 Pt 1 k=0 r2 k

r2 k(f(xk) f ) r2 t Pt 1 k=0 r2 k

dt + rt 2 L + 40θt,δ

where ˆxt = 1 Pt 1 k=0 r2 k Pt 1 k=0 r2 kxk. We now use Lemma 2 to conclude that there exists some τ T such that

f(ˆxτ) f e T 2 log+ rk rϵ 1 16

dt + rt 2 L + 40θτ,δ

By assumption on T we have T 2 log+

D D 1 T 4 log+

D D , therefore

f(ˆxτ) f 4e log+

dt + rt 2 L + 40θτ,δ

700θT,δ log+

where in the last line we used that max(dt, rt) D.

Tuning-Free Stochastic Optimization

8. Proofs for Section 4

8.1. Proof of Proposition 2

Proposition 2 (Hazan & Kakade (2019)). The Adaptive Polyak algorithm from (Hazan & Kakade, 2019) is tuning-free in the deterministic setting.

Proof. By (Hazan & Kakade, 2019, Theorem 2) we have that the point returned by the algorithm x satisfies

T log+ f(x ) ˆ f0 GD

T if f is G-Lipschitz,

2LD2 T log+ f(x ) ˆ f0 LD2

T if f is L-smooth.

provided that ˆf0 f , where ˆf0 is a parameter supplied to the algorithm. To get a valid lower bound on f , observe that by the convexity of f we have

f(x0) f f(x0), x0 x f(x0) x0 x f(x0) D.

It follows that

f f(x0) f(x0) D.

And thus we can use ˆf0 = f(x0) f(x0) D.

8.2. Proof of Proposition 3

Proposition 3. T-Do G and T-Do WG are tuning-free in the deterministic setting.

Proof. This is shown in (Khaled et al., 2023, Supplementary material section 7) for Do WG. The proof for Do G is similar and we omit it for simplicity.

8.3. Proof of Theorem 2

Proof. Let σ > 0. Let L = σT. Define the functions

f1(x) def = L

f2(x) def = L

2 x2 σ T 1x

f(x) def = 1

T f1(x) + 1 1

We shall consider the stochastic oracle O(f, σf) that returns function values and gradients as follows:

O(f, σf)(x) def = {fz(x), fz(x)} =

( {f1(x), f1(x)} with probability 1

T , {f2(x), f2(x)} with probability 1 1

Clearly we have E [fz(x)] = f(x) and E [ fz(x)] = f(x). Moreover,

f1 f(x) = σ, f2(x) f(x) = σ T 1 σ.

It follows that σf σ. Therefore O(f, σf) is a valid stochastic first-order oracle. This oracle is similar to the one used by Attia & Koren (2023) in their lower bound on the convergence of Ada Grad-Norm. The minimizer of the function f is clearly xf = 0.

Tuning-Free Stochastic Optimization

Let u 0, we shall choose it later. Define

h1(x) def = L

2 (x u)2 + (σ (T 1)Lu)x + (T 1)L

h2(x) def = L

2 (x u)2 + Lux σ T 1x L

h(x) def = L

with the oracle O(h, σh) given by

O(h, σh)(x) def = {hz(x), hz(x)} =

( {h1(x), h1(x)} with probability 1

T , {h2(x), h2(x)} with probability 1 1

Observe that

E [hz(x)] = 1

2 (x u)2 + σx (T 1)Lux + (T 1)L

2 (x u)2 + Lux σx T 1 L

2 (x u)2 + σx

T Lux + T 1

We can similarly prove that E [ hz(x)] = h(x). Moreover,

h1(x) h(x) = σ (T 1)Lu σ + (T 1)Lu,

h2(x) h(x) = σ T 1 + Lu σ T 1 + Lu.

It follows that σh σ + (T 1)Lu, therefore O(h, σh) is a valid stochastic oracle. Finally, observe that the minimizer of h is xh = u.

We fix the initialization x0 = v > 0. Then the initial distance from the optimum for both f and h are:

D (f) = |v 0| = v, D (h) = |v u| . (29)

And recall that

σf σ, σh σ + (T 1)Lu. (30)

Observe that both f and h share the same smoothness constant L. We supply the algorithm with the following estimates:

L = L, L = L, D = min(v, |u v|), D = max(v, |u v|), σ = σ, σ = σ + TLu. (31)

We note that in light of equations (29) and (30) and the definitions of f and h, the hints given by equation (31) are valid for both problems. Now observe the following:

2 (x u)2 + Lux σ T 1x L

2 (x2 2ux + u2) + Lux σ T 1x L

2 x2 σ T 1x

Tuning-Free Stochastic Optimization

And by the linearity of expectation we have that h2(x) = f2(x). Therefore both oracles O(f, σf) and O(h, σh) return the same stochastic gradient and stochastic function values with probability 1 1

We thus have that over a run of T steps, with probability (1 1

T )T e 1 the algorithm will only get the evaluations {h2(x), h2(x)} from either oracle, and will get the same hints defined in equation (31). In this setting, the algorithm cannot distinguish whether it is minimizing h or minimizing f, and therefore must minimize both. This is the main idea behind this proof: we use that the algorithm is tuning-free, which gives us that the output of the algorithm xout satisfies with probability 1 δ

h(xout) h c poly log+

δ , log T LD (h)2

T + σh D (h)

We shall let ι def = poly log+

δ , log T and note that because all of the relevant parameters (the hints, the horizon T, and the probability δ) supplied to the algorithm are unchanged for h and f, this ι will be the same for h and f. Continuing from equation (32) and substituting the expressions for D (h) and σh from equations (29) and (30) we get

h(xout) h cι L(u v)2

T + (σ + (T 1)Lu) |u v|

T + σ |u v|

TLu |u v| .

Using the definition of h and the fact that h = 0 we have

2 xout u 2 cι L(u v)2

T + σ |u v|

TLu |u v| .

Multiplying both sides by 2

L and then using the definition L = σT we get

xout u 2 2cι (u v)2

T + σ |u v|

= 2cι (u v)2

This gives by taking square roots and using the triangle inequality

2cι |u v| T 1

4 + T 1 4 p

And finally this implies

2cι |u v| T 1

4 + T 1 4 p

u |u v| . (33)

Similarly, applying the tuning-free guarantees to f and using that D (f) = v we have

2 xout 2 = f(xout) f cι LD (f)2

xout 2 2cι v2

Which gives

Tuning-Free Stochastic Optimization

Now let us consider the difference between the lower bound on xout given by equation (33) and the upper bound given by equation (34),

2cι |u v| T 1

4 + T 1 4 p

Let us put u = v + 1 and v = T 2, then equation (35) becomes

4 + T 1 4 p

Now observe that

ι = poly log+

D D, log+ σ + TLu

σ , log+ 1 δ , log T

= poly log+ 1, log+ T 2, log+(1 + T 2 + T 4), log+ 1 δ , log+ T

= poly log+ T, log+ 1 δ

We set δ = e 1

4 , therefore we finally get that ι = poly(log T), plugging back into equation (36) we get that the difference between the lower bound of equation (33) and the upper bound of equation (34) is

2cpoly(log T) T 1

4 + T 1 4 p

2cpoly(log T) T 2 1

It is obvious that for large enough T, this expression is positive. Moreover, this situation happens with a positive probability of at least e 1

2 since by the union bound

Prob(Algorithm incorrect for f, h Oracle doesn t output all {h2, h2}) 2δ + 1 (1 1

By contradiction, it follows that no algorithm can be tuning-free.

8.4. Proof of Theorem 3

Proof. We consider the following functions

f(x) = G |x| ,

f1(x) = G |x| + Gx,

f2(x) = G |x| G T 1x.

We consider the stochastic oracle O(f, σf) that returns function values and gradients as follows:

O(f, σf)(x) def = {fz(x), fz(x)} =

( {f1(x), f1(x)} with probability 1

T , {f2(x), f2(x)} with probability 1 1

Clearly we have E [fz(x)] = f(x) and E [ fz(x)] = f(x). It is also not difficult to prove that f(x) fz(x) G. We define a second function

h(x) = G |x u| ,

h1(x) = (2 T)G |x u| (T 1)G |x| + Gx,

h2(x) = G |x| G T 1x.

Tuning-Free Stochastic Optimization

And we shall use the oracle O(h, σh) given by

O(h, σh)(x) def = {hz(x), hz(x)} =

( {h1(x), h1(x)} with probability 1

T , {h2(x), h2(x)} with probability 1 1

By direct computation we have that E [hz(x)] = h(x) and E [ hz(x)] = h(x). From the definition of the functions in equation (37) it is immediate that all the gradients and stochastic gradients are bounded by GT. It follows that σh GT. All in all, this shows O(h, σh) is a valid stochastic oracle.

We set x0 = 1, observe that, like in Theorem 2, with some small but constant probability both oracles return the same gradients and function values, and therefore the algorithm cannot distinguish between them. It is therefore forced to approximately minimize both, giving us the guarantee:

f(xout) f c ι G

h(xout) h c ι (GT) |1 u|

T = cι |1 u| G

|xout u| cι |1 u|

Let us put u = 1 1

T , then x 1 1

This implies

And equation (38) implies

Because ι = poly(log T) (by direct computation), we have that the lower bound on xout given by equation (39) exceeds the upper bound on the same iterate given by equation (40) as T becomes large enough, and we get our contradiction.

9. Proofs for Section 4.2

We have the two following algorithm-independent lemmas:

Lemma 5. Suppose that Y is a sub-exponential random variable (see Definition 9.1) with mean 0 and sub-exponential modulus R2, i.e. for all t > 0

Prob (|Y | t) 2 exp t

Let Y1, . . . , Yn be i.i.d. copies of Y . Then with probability 1 δ it holds that 1 n

where c > 0 is an absolute constant.

Tuning-Free Stochastic Optimization

Proof. By Bernstein s inequality (Vershynin, 2018, Corollary 2.8.3) we have

2 exp c min t2

for some c > 0. Let us set t as follows

δ if 1 cn log 2

δ if 1 cn log 2

δ if 1 cn log 2

δ if 1 cn log 2

( 1 cn log 2

δ if 1 cn log 2

δ 2 if 1 cn log 2

By combining the two cases above we get

2 exp c min t2

It follows that with probability at least 1 δ we have, 1 n

δ if 1 cn log 2

δ if 1 cn log 2

Recall the definition of sub-exponential random variables:

Definition 9.1. We call a random variable Y R-sub-exponential if

Prob (|Y | t) 2 exp t

for all t 0.

Definition 9.2. We call a random variable Y R-sub-gaussian if

Prob (|Y | t) 2 exp t2

for all t 0.

Lemma 6. (Vershynin, 2018, Lemma 2.7.7) A random variable Y is R-sub-gaussian if and only if Y 2 is R2-sub-exponential.

Lemma 7. (Vershynin, 2018, Exercise 2.7.10) If A is E-sub-exponential then A E [A] is c E-sub-exponential for some absolute constant c.

Lemma 8. Suppose that X is a random variable that satisfies the assumptions in Definition 4.1 and X1, . . . , Xn are all i.i.d. copies of X. Then with probability 1 δ we have that

i=1 ( Xi 2 σ2)

c σ2 K 2 snr

Tuning-Free Stochastic Optimization

Proof. By assumption we have that Xi is R-sub-gaussian, therefore by Lemma 6 we have that Xi 2 is R2-subexponential. By Lemma 7 we then have that Xi 2 σ2 is c1 R2-sub-exponential for some absolute constant c. By Lemma 5 applied to Yi = Xi 2 σ2 we have with probability 1 δ that 1 n

i=1 ( Xi σ2)

where c2 > 0 is some absolute constant. Using the definition of the signal-to-noise ratio K 1 snr = R

σ we get that for some absolute constant c 1 n

i=1 ( Xi σ2)

c σ2 K 2 snr

9.1. Proof of Theorem 4

The main idea in the proof is the following lemma, which characterizes the convergence of the sample variance estimator of b i.i.d. random variables by the number of samples b as well as the signal-to-noise ratio K 1 snr.

Lemma 9. Let Y be a random vector in Rd such that Z = Y E [Y ] satisfies the assumptions in Definition 4.1. Let Y1, Y2, . . . , Yb be i.i.d. copies of Y . Define the sample mean and variance as

i=1 Yi, ˆσ2 = 1

Then it holds with probability 1 δ that

σ2 1 c K 2 snr

δ b + log 2(b d)

where c is an absolute (non-problem-dependent) constant, b d = def = max(b, d), σ2 def = E h Y E [Y ] 2i , and K 2 snr is the ratio defined in Definition 4.1.

Proof. We shall use the shorthand µ = E [Y ]. We have

Yi µ + µ ˆY 2

Yi µ 2 + µ ˆY 2 + 2 D Yi µ, µ ˆY E

i=1 Yi µ 2 + µ ˆY 2 2

D Yi µ, ˆY µ E

We have by the triangle inequality

i=1 Yi µ 2 σ2 + µ ˆY 2 +

D Yi µ, ˆY i µ E (41)

Tuning-Free Stochastic Optimization

By Lemma 8, we may bound the first term on the right hand side of equation (41) as

i=1 Yi µ 2 σ2 c σ2 K 2 snr

δ b + log 2

For the second term on the right hand side of equation (41), we apply (Jin et al., 2019, Corollary 7) to Xi = µ Yi and obtain

b R2 log 2d

Squaring both sides we get

c2R2b log 2d

Therefore 1 b

c2R2 log 2d

For the third term on the right hand side of equation (41) we have

D Yi µ, ˆY µ E =

b Yi µ 2 + 1

j =i Yi µ, Yj µ .

Taking absolute values of both sides and using the triangle inequality we get

D Yi µ, ˆY µ E =

1 b Yi µ 2 + 1

j =i Yi µ, Yj µ

j =i Yi µ, Yj µ

By our sub-gaussian assumption on Y µ , the first term on the right hand side of equation (44) can be bounded with high probability as

Define Zi,j = Yi µ, Yj µ . Observe that for each i, we have that the random vectors Zi,1, . . . , Zi,i 1, Zi,i+1, , Zi,n are all independent, and therefore E [Yi,j] = 0 for i = j. Observe that by the Cauchy-Schwartz inequality

|Zi,j| = | Yi µ, Yj µ | Yi µ Yj µ .

Observe that each of Yi µ and Yj µ is sub-gaussian with modulus R, therefore by (Vershynin, 2018, Lemma 2.7.7) their product is sub-exponential with modulus R2. It follows that

Prob (|Zi,j| t) Prob ( Yi µ Yj µ t) 2 exp t

Tuning-Free Stochastic Optimization

Therefore Zi,j is also sub-exponential with modulus R2. By Lemma 5 we then get that for any fixed i, with probability at least 1 δ we have

j=1,...,b j =i

1 b 1 log 2

δ + 1 b 1 log 2

for some absolute constant c > 0. Multiplying both sides of equation (46) by b 1

b and then using straightforward algebra we get

j=1,...,b j =i

1 b 1 log 2

δ + 1 b 1 log 2

We now use the union bound over all i with the triangle inequality to get

j=1,...,b j =i

j=1,...,b j =i

Combining equations (45) and (48) we get that with probability 1 δ there exists some absolute constant c > 0

D Yi µ, ˆY µ E c R2

δ b + log 2b

Combining equations (42), (43) and (49) into equation (41) we get

i=1 Yi µ 2 σ2 + µ ˆY 2 +

D Yi µ, ˆY i µ E

c1 σ2 K 2 snr

δ b + log 2

+ c2 R2 log 2d

δ b + log 2b

For some absolute constants c1, c2, c3 > 0. Therefore, using the definition K 1 snr = R

σ and simplifying in the last equation we finally get that

ˆσ2 σ2 c4 σ2K 2 snr

δ b + log 2(b d)

for some absolute constant c4 > 0. Dividing both sides by σ2 yields the statement of the lemma.

Tuning-Free Stochastic Optimization

Algorithm 2 T-Do G + Variance Estimation Require: initial point x0 X, initial distance estimate rϵ > 0, minibatch size b, θ > 0.

1: Initialize rϵ = D, α = 84 log(60 log(6T)/δ) θ 1. 2: for t = 0, 1, 2, . . . , T 1 do 3: Update distance estimator: rt max ( xt x0 , rt 1). 4: Sample b stochastic gradients µ1 t, µ2 t, . . . , µb t at xt and compute:

i=1 µi t, ˆσ2 t = 1

µi t ˆµi t 2, σ2 t = max k t ˆσ2 k.

5: Compute a new stochastic gradient gt evaluated at xt. 6: Update the gradient sum ut = ut 1 + gt 2. 7: Set the stepsize:

log2 + 1 + ut+σ2 t v0+σ2 0

8: Gradient descent step: xt+1 xt ηt f(xt). 9: end for

Algorithm 3 T-Do WG + Variance Estimation Require: initial point x0 X, initial distance estimate rϵ > 0, minibatch size b, θ > 0.

1: Initialize rϵ = D, α = 84 log(60 log(6T)/δ) θ 1. 2: for t = 0, 1, 2, . . . , T 1 do 3: Update distance estimator: rt max ( xt x0 , rt 1). 4: Sample b stochastic gradients µ1 t, µ2 t, . . . , µb t at xt and compute:

i=1 µi t, ˆσ2 t = 1

µi t ˆµi t 2, σ2 t = max k t ˆσ2 k.

5: Compute a new stochastic gradient gt evaluated at xt. 6: Update weighted gradient sum: vt vt 1 + r2 t gt 2. 7: Set the stepsize:

γt r2 t α p

vt + βr2 tσ2 t

log2 + 1 + vt+r2 t σ2 t v0+r2 0σ2 0

8: Gradient descent step: xt+1 xt γt f(xt). 9: end for

Tuning-Free Stochastic Optimization

Proof of Theorem 4. First, observe that at every timestep t, conditioned on Ft = σ (g1:t 1, x1:t) we have by Lemma 9 that with probability 1 δ

T that the sample variance ˆσ2 t satisfies for some c > 0

ˆσ2 t σ2(xt) 1 c K 2 snr

δ b + log 2(b d)T

where c is an absolute constant and σ2 t = σ2(xt) denotes the variance of the noise at xt (we do not assume that the noise distribution is the same for all t). By our assumption on the minibatch size we have that for some u [0, K2 snr]

δ b + log 2(b d)T

1 θ K2snr .

And therefore ˆσ2 t σ2(xt) 1 1 θ K2snr .

Which gives

ˆσ2 t σ2 t 1 1 θ K2snr

= θ K2snr .

Multiplying both sides by σ2 t we get

ˆσ2 t σ2 t θ K2snr R2θ.

Therefore ˆσ2 t /θ is, with high probability, an upper bound on any noise norm, and we can use that as normalization in T-Do G/T-Do WG. This is the key idea of the proof, and it s entirely owed to Lemma 9. The rest of the proof follows (Ivgi et al., 2023) with only a few changes to incorporate the variance estimation process.

Following (Ivgi et al., 2023), we define the stopping time

Tout = min {t | rt > 3d0} .

And define the proxy sequences

( ηk if k < Tout, 0 otherwise. γk =

( γk if k < Tout, 0 otherwise. (52)

Lemma 10. (Modification of (Ivgi et al., 2023, Lemma 8)) Under the conditions of Theorem 4 both the Do G (50) and Do WG (51) updates satisfy for all t T

ρt σ(g0, µ1 0, . . . , µb 0 . . . , gt 1, µt 1 0 , . . . , µt 1 b ),

|ρt gt f(xt), xt x | 6d2 0 82θT,δ ,

k=0 ρ2 k gk 2 9d2 0 84θT,δ ,

k=0 (ρk gk, xk x )2 122d4 0 84θT,δ ,

where ρt stands for either the Do G stepsize proxy ηk or the Do WG stepsize proxy γk.

Proof. The modification of this lemma to account for bounded noise g(xk) f(xk) rather than bounded gradients is straightforward, and we omit it for simplicity.

Tuning-Free Stochastic Optimization

Lemma 11. (Modification of (Ivgi et al., 2023, Lemma 9)) Under the conditions of Theorem 4 both the Do G (50) and Do WG (51) updates satisfy for all t T with probability at least 1 δ

k=0 ηk gk f(xk), x xk d2 0.

Proof. The modification is straightforward and omitted.

Lemma 12. (Modification of (Ivgi et al., 2023, Lemma 10)) Under the conditions of Theorem 4, if Pt 1 k=0 ρt gk f(xk), x xk d2 0 for all t T, then Tout > T.

Proof. The modification is straightforward and omitted.

By Lemmas lemmas 11 and 12 we get that r T 3d0 and it follows that dt = maxk t dk maxk t rt + r0 4d0. Then, a straightforward modification of Theorem 1 to handle the slightly smaller stepsizes used by T-Do G/T-Do WG shows that both methods are tuning-free. The proof is very similar to Theorem 1 and is omitted.

10. Proofs for Section 5

10.1. Proof of Theorem 5

Proof. We use the exact same construction from Theorem 2 with the following hints:

L = L, L = L

2 min(v, |u v|), = L

2 max(v, |u v|).

σ = σ, σ = σ + TLu,

where u > 0 and v > 0 are parameters we shall choose later. Suppose that we have that the algorithm s output point x satisfies

L(f(x0) f )σ2

T + L(f(x0) f )

We now use the fact that f(x0) f = L

2 (x x )2 to get

L2 xout x 2 = f(x) 2

L2(x0 x )2σ2 f T + cιL2(x0 x )2

= cιL |x0 x | σf

T + cιL2(x0 x )2

Dividing both sides by L2 we get

xout x 2 cι|x0 x | σf

T + cι x0 x 2

Taking square roots and using the triangle inequality gives

|xout x | cι p

|x0 x | rσf

T 1 4 + cι|x0 x |

Applying equation (53) to the function f with x0 = v > 0, x = 0, σf = σ, and L = σ

|xout| cι r v

Tuning-Free Stochastic Optimization

xout cι r v

On the other hand, applying equation (53) to the function h = L

2 (x u)2 (as in the proof of Theorem 2) we obtain

|xout u| cι p

T 1 4 + cι|u v|

T 1 4 + cι|u v|

T 1 4 + cι|u v|

T + cιT 1 4 p

u |u v| + cι|u v|

T + cιT 1 4 p

u |u v| + cι|u v|

Combining equations (54) and (55) gives

T + cιT 1 4 p

u |u v| + cι|u v|

Dividing both sides by cι,

T + T 1 4 p

u |u v| + |u v|

Put v = T 2 and u = T 2 + 1, then we get

T + 1 T 2 + 1 cι 1

T + T 1 4 p

T 2 + 1 + 1

For large enough T, since ι = poly(log T), this inequality does not hold. Therefore we get our contradiction.

10.2. Proof of Theorem 6

Theorem 7. ((Liu et al., 2023), High-probability convergence of SGD in the nonconvex setting). Let f be L-smooth and possibly nonconvex. Suppose that the stochastic gradient noise is R2-sub-gaussian. Then for any fixed stepsize η such that ηL 1 we have

t=0 f(xt) 2 2(f(x0) f )

ηT + 5ηR2 + 12R2 log 1

Proof. This is a very straightforward generalization of (Liu et al., 2023, Theorem 4.1), and we include it for completeness. By (Liu et al., 2023, Corollary 4.4) we have that if ηt L 1 and 0 wtη2 t L 1 2R2

f(xt) 2 + w T T +1 w1 1 +

t=2 (wt wt 1) t + 3R2 T X

Tuning-Free Stochastic Optimization

Choose ηt = η and wtη2L = 1 4R2 , wt = 1 6R2η.

vt = 3R2w2 t η2 t (ηt L 1)2 = 3R2η2(ηL 1)2

36R4η2 = (1 ηL)2

2 ) (1 ηL)2

2 ) 1 + η2L2 2ηL

= 1 12R2 1 + ηL η2L2

The expression 1 + x x2 is minimized for x [0, 1] at x = 1 and has value 1. Therefore

vt 1 12R2 .

Plugging into equation (56) we get

1 12R2 f(xt) 2 1 6R2η + 3η

8 T + log 1

t=1 f(xt) 2 2 1

ηT + 5ηR2 + 12R2 log 1

10.3. Restarting SGD

We will use the following lemma from (Madden et al., 2020):

Lemma 13. (Madden et al., 2020, Lemma 33) Let Z = k {1, 2, . . . , K} with probability pk and PK k=1 pk = 1. Let Z1, . . . , Zm be independent copies of Z. Let Y = (Y1, . . . , Ym). Let X = (X1, . . . , XK) be a random vector on the reals independent of Z. Then for any γ > 0 we have

Prob min k Y Xk > eη exp( m) + Prob

k=1 pt Xk > γ

Theorem 8. (Convergence of Find Leader) If we run Algorithm 4 on a set V of P points v1, v2, . . . , v P , with sampling budget M and per-point estimation budget K, then the output of the algorithm satisfies for some absolute constant c > 0 and all γ > 0

f(slead) 2 > eγ + c R2 log 2d M

δ + exp( M) + Prob

p=1 f(vp) 2 > γ

gm f(slead) c R2 log 2d

Tuning-Free Stochastic Optimization

Algorithm 4 Find Leader(S, δ, K)

1: Require: set of points V , desired accuracy δ, and per-point estimation budget K. 2: Set M = log 1

δ and let P = |V |. 3: Construct the set S = (s1, . . . , s M) by sampling M points from v1, . . . , v P with replacement such that

Prob(vi S) 1 i + 1,

i=1 Prob(vi S) = 1.

4: for m = 1 to M do 5: Sample K stochastic gradients gm 1 , . . . , gm K evaluated at sm and compute their average

6: Compute and store hm = ˆgm . 7: end for 8: Find the point slead S with the minimal average stochastic gradient norm:

m = arg min m 1,2,...,M hm, slead = Sm .

9: Return slead and its estimated gradient norm gm .

Proof. The proof of this theorem loosely follows the proofs of (Ghadimi & Lan, 2013, Theorem 2.4) and (Madden et al., 2020, Theorem 13). First, define the following two sets of true gradients for the iterates in V and P respectively:

UV = { f(v1), f(v2), . . . , f(v P )} US = { f(s1), f(s2), . . . , f(s M)} .

Lemma 13 gives us

Prob min m 1,2,...,M f(sm) 2 > eγ exp( M) + Prob

p=1 f(vp) 2 > γ

We now compute how using the minimum from the stochastic estimates ˆgm affects the error. Fix m. Observe that because the norm of the stochastic gradient noise g(x) f(x) is sub-gaussian with modulus R2, then using (Jin et al., 2019, Corollary 7) we get with probability at least 1 δ M that for some absolute constant c1

ˆgm f(sm) c1

R2 log 2d M

Squaring both sides gives

ˆgm f(sm) 2 c1 R2 log 2d M

Taking a union bound gives us that for all m [M] we have with probability δ that

max m [M] ˆgm f(sm) 2 c1 R2 log 2d M

Tuning-Free Stochastic Optimization

We have by straightforward algebra

min m S ˆgm 2 min m [M]

h ˆgm f(sm) + f(sm) 2i

h 2 ˆgm f(sm) 2 + 2 f(sm) 2i

2 max α [M] ˆgα f(sα) 2 + 2 f(sm) 2

= 2 max m [M] ˆgm f(sm) 2 + 2 min m [M] f(sm) 2.

Let m be the argmin. Then

f(sm ) 2 2 f(sm ) ˆgsm 2 + 2 ˆgsm 2

2 f(sm ) ˆgsm 2 + 4 max m [M] ˆgm f(sm) 2 + 4 min m [M] f(sm) 2

6 max m [M] ˆgm f(sm) 2 + 4 min m [M] f(sm) 2

6c1 R2 log 2d M

δ K + 4 min m [M] f(sm) 2.

Therefore there exists some absolute constant c such that

f(sm ) 2 > eγ + c R2 log 2d M

δ + exp( M) + Prob

p=1 f(vp) 2 > γ

It remains to put slead = sm .

Proof of Theorem 6. First, observe that Theorem 7 gives that SGD run for T steps with a fixed stepsize η such that ηL 1

t=0 f(xt) 2 2(f(x0) f )

ηT + 5ηR2 + 12R2 log 1

Minimizing the above in η gives

2(f(x0) f )

Observe that η0 η . Now let

First, if we exit Algorithm 1 at line 4, i.e. if Ttotal < N, then by the L-smoothness of f we have

f(y0) 2 2L(f(y0) f )

N 2L(f(x0) f )

L(f(x0) f )

Tuning-Free Stochastic Optimization

This fulfills the theorem s statement. From here on our, we assume that N Ttotal. Observe that our choice of N guarantees that N N . Let τ be the first n (in the loop on line 2 of Algorithm 1) such that

Plugging η = ητ into Equation (59) we get with probability at least δ that

t=0 f(xτ t ) 2 2(f(x0) f )

ητT + 5ητR2 + 12R2 log 1

4(f(x0) f )

η T + 5η R2 + 12R2 log 1

2 2(f(x0) f )

η T + 5η R2 + 12R2 log 1

L(f(x0) f )R2

T + (f(x0) f )L

+ 12R2 log 1

We now apply Theorem 8 with the parameters:

V = xτ 0, xτ 1, . . . , xτ T 1 ,

L(f(x0) f )R2

T + (f(x0) f )L

+ 12R2 log 1

The theorem combined with equation (60) gives us that with probability at least 1 4δ

f(yτ) 2 13 e

L(f(x0) f )R2

T + (f(x0) f )L

+ 12R2 log 1

δ T + c R2 log 2d M

By straightforward algebra

ˆgr 2 = min n [N] ˆgn 2 min n [N]

h ˆgn f(yn) + f(yn) 2i

h 2 ˆgn f(yn) 2 + 2 f(yn) 2i

2 max α [N] ˆgα f(sα) 2 + 2 f(yn) 2

= 2 max n [N] ˆgn f(yn) 2 + 2 min n [N] f(yn) 2.

Recall that we have r = arg minn [N] ˆgn 2, then as in the proof of Theorem 8 we have

f(yr) 2 2 f(yr) ˆgyr 2 + 2 ˆgyr 2

2 f(yr) ˆgyr 2 + 4 max n [N] ˆgn f(yn) 2 + 4 min n [N] f(yn) 2

6 max n [N] ˆgn f(yn) 2 + 4 min n [N] f(yn) 2. (62)

Observe that because that we passed the budget K = T to the Find Leader procedure, we can use Equation (57) and the union bound to that with probability 1 δ,

max n [N] ˆgn f(yn) 2 c R2 log 2d N

Tuning-Free Stochastic Optimization

And clearly

min n [N] f(yn) 2 f(yτ) 2. (64)

Using the estimates of equations (61), (63) and (64) to upper bound the right hand side of equation (62) gives us that with probability at least 1 5δ

f(yr) 2 6c R2 log 2d N

L(f(x0) f )R2

T + (f(x0) f )L

+ 12R2 log 1

δ T + c R2 log 2d M

Combining the terms and substituting in the definition of Ttotal gives the theorem s statement.