# how_free_is_parameterfree_stochastic_optimization__e4178d75.pdf

How Free is Parameter-Free Stochastic Optimization?

Amit Attia 1 Tomer Koren 1 2

We study the problem of parameter-free stochastic optimization, inquiring whether, and under what conditions, do fully parameter-free methods exist: these are methods that achieve convergence rates competitive with optimally tuned methods, without requiring significant knowledge of the true problem parameters. Existing parameterfree methods can only be considered partially parameter-free, as they require some non-trivial knowledge of the true problem parameters, such as a bound on the stochastic gradient norms, a bound on the distance to a minimizer, etc. In the non-convex setting, we demonstrate that a simple hyperparameter search technique results in a fully parameter-free method that outperforms more sophisticated state-of-the-art algorithms. We also provide a similar result in the convex setting with access to noisy function values under mild noise assumptions. Finally, assuming only access to stochastic gradients, we establish a lower bound that renders fully parameter-free stochastic convex optimization infeasible, and provide a method which is (partially) parameter-free up to the limit indicated by our lower bound.

1. Introduction

Stochastic first-order optimization is a cornerstone of modern machine learning, with stochastic gradient descent (SGD, Robbins & Monro, 1951) as the go-to method for addressing statistical learning problems. The tuning of algorithmic parameters, particularly those associated with SGD such as the step-size, proves to be a challenging task, both theoretically and in practice (e.g., Bottou, 2012; Schaul et al., 2013), especially when the problem parameters are unknown. The present paper focuses on theoretical aspects of this challenge.

1Blavatnik School of Computer Science, Tel Aviv University 2Google Research Tel Aviv. Correspondence to: Amit Attia <amitattia@mail.tau.ac.il>.

Proceedings of the 41𝑠𝑡International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Addressing the challenge, a variety of methods emerged over the years, with the goal of minimizing knowledge about problem parameters required for tuning, while maintaining performance competitive with carefully tuned algorithms. These include the so-called adaptive methods, such as Ada Grad and Adam (e.g., Duchi et al., 2011; Kingma & Ba, 2015) and their recent theoretical advancements (Reddi et al., 2018; Tran et al., 2019; Kavis et al., 2019; Alacaoglu et al., 2020; Faw et al., 2022; Kavis et al., 2022; Attia & Koren, 2023; Liu et al., 2023), that self-tune learning rates based on gradient statistics; parameter-free methods (e.g., Chaudhuri et al., 2009; Streeter & Mc Mahan, 2012; Luo & Schapire, 2015; Orabona & P al, 2021; Carmon & Hinder, 2022) that focus primarily on automatically adapting to the complexity of (i.e., distance to) an optimal solution; and advanced techniques from the literature on online learning (Orabona & P al, 2016; Cutkosky & Orabona, 2018) that can be used in a stochastic optimization setting via an online-to-batch reduction. More recently, several works have achieved advancements in improving practical performance, narrowing the gap to finely-tuned methods (Orabona & Tommasi, 2017; Chen et al., 2020; Ivgi et al., 2023; Defazio & Mishchenko, 2023; Mishchenko & Defazio, 2023).

While adaptive and parameter-free methods enjoy strong theoretical guarantees, each of the aforementioned approaches still demands some level of non-trivial knowledge regarding problem parameters. This includes knowledge about the smoothness parameter and the initial suboptimality for non-convex adaptive and self-tuning methods, or a bound on either the stochastic gradient norms or on the distance to an optimal solution for parameter-free convex (including online) optimization algorithms. To the best of our knowledge, in the absence of such non-trivial assumptions, no method in the literature is fully parameter-free, in either the non-convex or convex, smooth or non-smooth cases.

This paper explores the following question: in what stochastic optimization scenarios, and under what conditions, do fully parameter-free optimization algorithms exist? In this context, we note that direct hyperparameter tuning, which parameter-free self-tuning methods aim to eschew, could be considered a valid approach for designing parameterfree algorithms. Consequently, we also examine the question: can self-tuning optimization algorithms actually outperform direct tuning methods based on simple hyperparam-

How Free is Parameter-Free Stochastic Optimization?

eter search?

We begin our inquiry in the general setting of non-convex (smooth) optimization. We observe that a simple hyperparameter search technique for tuning of fixed stepsize SGD results in a fully parameter-free method with convergence rate that matches optimally tuned SGD up to poly-logarithmic factors. While extremely straightforward, this approach outperforms the state-of-the-art bounds for adaptive methods in this context (Faw et al., 2022; Kavis et al., 2022; Attia & Koren, 2023; Liu et al., 2023) that require non-trivial knowledge on the smoothness parameter and the initial suboptimality for optimal tuning.

Moving on to the convex optimization setting, we observe that a similar hyperparameter search technique can be used to design a simple and efficient parameter-free algorithm, provided access to noisy function value queries to the objective. Under very mild assumptions on the scale of noise in the value queries, the obtained algorithm is fully parameterfree with convergence rate matching that of perfectly tuned SGD (up to poly-log factors), in both the smooth and nonsmooth convex cases. This algorithm thus outperforms existing results in the literature, albeit under an additional assumption of a reasonably-behaved stochastic function value oracle. Indeed, an inspection of previous studies in this context reveals that they neglect access to function values and only rely on stochastic gradient queries (even though function value access is often readily available in practical applications of these methods).

Our last area of focus is therefore the convex optimization setting with access solely to stochastic gradients (and no access whatsoever to function values), which emerges as the only general stochastic optimization setting where achieving parameter-freeness is potentially non-trivial. Indeed, and perhaps somewhat surprisingly, we identify in this setting a limitation to obtaining a fully parameter-free algorithm: we establish a lower bound showing that any optimization method cannot be parameter-free unless it is provided with either the gradient-noise magnitude or the distance to a minimizer, each up to a (multiplicative) factor of 𝑂(

𝑇) where 𝑇is the number of optimization steps. The bound sheds light on why prior methods require non-trivial knowledge of problem parameters, yet still leaves room for improvement: state-of-the-art results in this context (e.g., Carmon & Hinder, 2022; Ivgi et al., 2023) assume knowledge on the magnitude of the stochastic gradients, whereas our lower bound only necessitates knowledge on the magnitude of the noise component in the gradient estimates.

Our final contribution complements the lower bound and introduces a parameter-free method in the convex non-smooth (Lipschitz) setting, achieving the same rate as of optimallytuned SGD, up to poly-log factors and an unavoidable term prescribed by our lower bound. When given a bound on

the noise in the stochastic gradients accurate up to a 𝑂(

𝑇) factor, our method becomes fully parameter-free. The same method achieves a similar parameter-free guarantee in the setting of convex smooth optimization, nearly matching the rate of tuned SGD in this case.

1.1. Summary of contributions

To summarize our results in some more detail, let us first specify the setup of parameter-free optimization slightly more concretely. In all cases, we consider unconstrained optimization of an objective function 𝑓: ℝ𝑑 ℝ. Rather than directly receiving the ground-true problem parameters (e.g., smoothness parameter, Lipschitz constant, etc.), the algorithms are provided with a range (lower and upper bounds) containing each parameter. We refer to a method as parameter-free if its convergence rate matches a benchmark rate (e.g., that of optimally-tuned SGD), up to polylogarithmic factors in the ranges parameters (as well as in the number of steps 𝑇and the probability margin 𝛿).

Non-convex setting: fully parameter-free algorithm. We give a fully parameter-free method with the same convergence rate as tuned SGD up to poly-log factors. Given an initialization 𝑤1 ℝ𝑑, number of queries 𝑇, probability margin 𝛿and a range [𝜂min, 𝜂max] containing the theoreticallytuned SGD stepsize (the latter can be computed from the problem parameter ranges), the algorithm produces 𝑤 ℝ𝑑

such that with probability at least 1 𝛿,

𝑓(𝑤) 2 𝛽 𝐹 + 𝜎2 𝑇 + 𝜎 𝛽 𝐹

poly log 𝜂max 𝜂min , 1

where 𝛽 is the smoothness parameter, 𝐹 = 𝑓(𝑤1) min 𝑓(𝑤), and 𝜎 is a noise bound of the stochastic gradients. While the algorithm is based on simple observations, it is fully parameter-free and provides stronger guarantees with a simpler analysis compared to existing adaptive or self-tuning methods in this setting.

Convex setting: fully parameter-free algorithm with noisy function values. Moving to the convex case, we first consider the canonical setting with access to noisy function values. We devise a simple parameter-free method, that given 𝑤1, 𝑇, 𝛿and a range [𝜂min, 𝜂max] containing the theoretically-tuned SGD stepsize, produces 𝑤 ℝ𝑑such that with probability 1 𝛿,

poly log 𝜂max 𝜂min , 1

where 𝐺 is the Lipschitz constant, 𝐷 = 𝑤1 𝑤 (𝑤

is a minimizer of 𝑓), and 𝜎0 and 𝜎 are noise bounds of the function and gradient oracles respectively. Under very mild

How Free is Parameter-Free Stochastic Optimization?

assumptions on the function values noise 𝜎0, the method is fully parameter-free and achieves the same rate as tuned SGD up to poly-log factors.

Convex setting: impossibility without function values. Next we consider the convex setting without any access to function values. We establish that without further assumptions, no convex optimization method can be fully parameter-free while nearly matching the rate of tuned SGD. We show that for any 𝛼 [1, 3

𝑇] and every 𝑇-queries deterministic algorithm receiving ranges [ 1 𝑎

𝑇𝐷max, 𝐷max],

𝑇𝜎max, 𝜎max] and known 𝐺 = 𝜎max (2𝑇 1) , there exist 𝐷 and 𝜎 that belong to the ranges (resp.), a convex and 𝐺 -Lipschitz function 𝑓with a minimizer 𝑤 such that 𝑤 = 𝐷 , and a 𝜎 -bounded first-order oracle such that with constant probability, the output of the algorithm, 𝑤, satisfy

𝑓(𝑤) 𝑓(𝑤 ) 𝐷 (𝐺 + 𝜎 )𝛼

Thus, without non-trivial prior knowledge on the parameter ranges, no algorithm can match the performance of tuned SGD and must include a term linear in 𝜎max.

Parameter-free algorithm for the convex non-smooth setting. Assuming only gradient access, we propose a method which requires knowledge of 𝐷min, 𝐷max, 𝐺max and 𝜎max such that 𝐷min 𝐷 = 𝑤1 𝑤 𝐷max, 𝐺 𝐺max and 𝜎 𝜎max and produces 𝑤such that with probability at least 1 𝛿,

𝑓(𝑤) 𝑓 e 𝐶 𝐷 (𝐺 + 𝜎 )

where e 𝐶= poly log( 𝐷max

𝐷min , 𝐺max, 𝜎max, 𝑇

𝛿). The method is parameter-free if 𝜎max is provided up to a tolerance of 𝑂(

𝑇), in which case it achieves the same rate of convergence as tuned SGD up to logarithmic factors. This is the maximal tolerance possible in light of our lower bound, which shows that the additional 𝜎max/𝑇term is unavoidable without further assumptions.

Parameter-free algorithm for the convex smooth setting. Assuming that the objective is 𝛽 -smooth rather than Lipschitz, the same convex optimization method with parameters 𝐷min, 𝐷max, 𝛽max and 𝜎max such that 𝐷min 𝐷 𝐷max, 𝛽 𝛽max and 𝜎 𝜎max, produce 𝑤 ℝ𝑑such that with probability at least 1 𝛿,

𝑓(𝑤) 𝑓 e 𝐶 𝛽 𝐷2 𝑇 + 𝐷 𝜎

where e 𝐶= poly log(𝐷max, 1 𝐷min , 𝛽max, 𝜎max, 𝑇

𝛿). To the best of our knowledge, this is the first parameter-free method

for stochastic convex and smooth optimization (in the case where 𝜎 is provided up to 𝑂(

𝑇) tolerance, which is again required to account for the unavoidable term prescribed by our lower bound).

1.2. Additional related work

Adaptive stochastic non-convex optimization. A long line of work focus on the analysis of SGD with Ada Gradtype stepsizes (Ward et al., 2019; Li & Orabona, 2019; Kavis et al., 2022; Faw et al., 2022; Attia & Koren, 2023; Liu et al., 2023), which we refer to as Ada SGD (also known as Ada Grad-norm, e.g., Ward et al. 2019; Faw et al. 2022). Ada SGD enjoys high-probability guarantees and rate interpolation for small noise (Attia & Koren, 2023; Liu et al., 2023). Compared to our tuning-based fully parameter-free method, the analysis of Ada SGD is cumbersome and suboptimal without knowing the smoothness and function suboptimality.

Parameter-free and adaptive stochastic convex optimization methods. Existing methods assume either known diameter or known stochastic gradient norm bound. Assuming a known diameter, (Kavis et al., 2019) presented a method with an (almost) optimal rate for convex optimization and an accelerated rate for convex smooth optimization. It is an open question whether parameter-free acceleration is possible. Given a a bound of the stochastic gradient norm, Carmon & Hinder (2022) provided the first method which achieve the optimal rate up to a double-logarithmic factor in the diameter range by utilizing an efficient bisection procedure to find a good SGD stepsize. Ivgi et al. (2023) introduced a parameter-free method which use dynamic stepsizes for SGD, and demonstrated strong practical performance. Both methods are parameter-free if the bound of the stochastic gradients norm is given up to a tolerance of 𝑂(

𝑇), while our method and analysis require only such bound for the noise of the stochastic gradients; this distinction is crucial to our guarantee in the convex and smooth case, where the gradients are unbounded.

Parameter-free deterministic optimization methods. While our focus here is on parameter-free stochastic optimization, in the context of deterministic optimization several fully parameter-free methods were previously suggested. The adaptive Polyak method of Hazan & Kakade (2019) achieves almost the same rate as tuned gradient descent for convex Lipschitz and convex smooth problems. For convex smooth problems, Beck & Teboulle (2009) and Nesterov (2015) use line-search techniques to obtain the optimal (accelerated) rate. It is also worth mentioning the tuning-free methods of Defazio & Mishchenko (2023); Mishchenko & Defazio (2023), which are not fully parameter-free (they need knowledge of the Lipschitz constant) yet demonstrate

How Free is Parameter-Free Stochastic Optimization?

strong practical performance without tuning.

Parameter-free online convex optimization. In the setting of online convex optimization (OCO), several works concerned themselves with parameter-free (e.g., Mc Mahan & Orabona, 2014; Orabona & P al, 2016; Cutkosky & Orabona, 2018; Mhammedi & Koolen, 2020) and scalefree (e.g., Orabona & P al, 2018; Mhammedi & Koolen, 2020) methods. The parameter-free OCO literature assume known gradient norm bound and achieve regret of e 𝑂( 𝑢 𝐺max

𝑇log 𝑢 ) where 𝑢, 𝐺max and 𝑇are an arbitrary comparator, gradient norm bound and number of rounds respectively (Mc Mahan & Orabona, 2014). Cutkosky & Boahen (2016; 2017) established lower bounds which rule out such regret bounds if both the comparator and gradient norm bounds are unknown. Sidestepping the lower bound, several works designed parameter-free methods (with unknown Lipschitz bound and comparator norm), suffering an additional e 𝑂( 𝑢 3𝐺max) term (Cutkosky, 2019; Mhammedi & Koolen, 2020). Considering online-to-batch conversions, online parameter-free methods with known gradient norm bound still require a non-trivial bound of the stochastic gradient norms, not only the noise bound. Jun & Orabona (2019) relaxed the bounded gradients assumption using stochastic gradients, but requires a known bound of the expected gradient norms instead. We note that the method of Cutkosky (2019) can use crude range bounds for a guarantee of e 𝑂(𝐷 𝐺 /

𝑇+ 𝐷max𝐺 /𝑇3/4), which is parameter-free when 𝐷max/𝐷min = 𝑂(𝑇1/4). Diameter tolerance is a promising future research direction, as most current studies focus on gradient norm/noise tolerance.

Classical analyses of stochastic gradient descent. Ghadimi & Lan (2013) were the first to examine stochastic gradient descent in the smooth non-convex setting and obtain tight convergence. They demonstrated that a properly tuned SGD with 𝑇steps achieves a rate of 𝑂(1/𝑇+ 𝜎/

𝑇) under the assumption of uniformly bounded variance 𝜎2; Arjevani et al. (2022) provided a tight lower bound. Assuming sub-Gaussian noise, Ghadimi & Lan (2013) provided a highprobability convergence rate of e 𝑂(1/𝑇+ 𝜎2/𝑇+ 𝜎/

𝑇) by amplifying the success probability using multiple runs of SGD, while Liu et al. (2023) established a similar rate for a single run of SGD. For the convex setting, Lan (2012) provided convergence guarantees for stochastic gradient descent in composite stochastic optimization, establishing a convergence rate of 𝑂((𝐺+ 𝜎)/

𝑇) for convex and 𝐺-Lipschitz objectives and 𝑂(1/𝑇+ 𝜎/

𝑇) for convex smooth objectives, assuming uniformly bounded variance of 𝜎2. Classical non-asymptotic analyses of SGD and minimax lower bounds in stochastic optimization trace back to Nemirovskij & Yudin (1983).

Addendum: concurrent work on parameter-free stochastic optimization. Shortly after the present manuscript first appeared on ar Xiv (Attia & Koren, 2024a), other works appeared which explore similar questions in parameter-free stochastic optimization. Carmon & Hinder (2024) study the price of adaptivity to uncertainty in problem parameters in stochastic convex problems, with focus on characterizing the tight penalty (i.e., poly-logarithmic factors in the bound) one must pay for being parameter-free. They provide several lower bounds for different uncertainty scenarios, including stochastic gradients with bounded norms and stochastic gradients with a bounded second moment, as well as a result similar to our Theorem 6 that quantifies the price of uncertainty in both diameter and stochastic gradient norms. In contrast, our primary focus is on studying conditions under which parameter-free algorithms are possible in various optimization scenarios, modulo poly-logarithmic factors in the convergence bounds.

In an independent work, Khaled & Jin (2024) study parameter-free optimization in the non-convex and convex settings and establish a set of results closely related to ours. For non-convex smooth optimization, they provide an upper bound similar to our Theorem 1 (they also give a lower bound for the in-expectation rate of SGD, showing that it is unattainable by parameter-free methods). For convex optimization, they prove that full parameter-freeness is in general impossible, while showing a convergence result that inversely depends (polynomially) on a certain SNR parameter, that for favorable noise distributions is bounded away from zero, in which case the algorithm is parameter-free. Comparing to our lower bound in Theorem 6, we provide a detailed characterization of the cost of being parameter-free in this setting, showing how convergence of parameter-free methods degrades with the degree of uncertainty in the gradient noise parameter. In terms of upper bounds, we establish feasibility of parameter-free convex optimization with no dependence on SNR (that can be arbitrarily small even for bounded noise distributions), with rates matching our lower bound up to poly-log factors. In addition, we establish an analogous upper upper bound for convex smooth problems, which is, to our knowledge, the first parameter-free result in this setting.

2. Preliminaries

Notation. Throughout, we use log to denote the base 2 logarithm and log+( ) 1 + log( ). e 𝑂and b 𝑂are asymptotic notations that suppress logarithmic and doubly-logarithmic factors respectively (in addition to constant factors).

2.1. Optimization setup

Our focus is unconstrained stochastic optimization in 𝑑dimensional Euclidean space ℝ𝑑with the ℓ2 norm, denoted

How Free is Parameter-Free Stochastic Optimization?

. We assume a stochastic first-order access to a differentiable objective function 𝑓: ℝ𝑑 ℝthrough a randomized oracle O. We will consider two variants of such access, common to the literature on stochastic optimization:

(i) Stochastic first-order oracle: The oracle O generates, given any 𝑤 ℝ𝑑, unbiased value and gradient samples for 𝑓at the point 𝑤, namely O(𝑤) = ( e𝑓(𝑤), e𝑔(𝑤)) such that for all 𝑤 ℝ𝑑, 𝔼[ e𝑓(𝑤)] = 𝑓(𝑤) and 𝔼[e𝑔(𝑤)] = 𝑓(𝑤). We will make the standard assumption that e𝑓and e𝑔have uniformly bounded noise,1 that is, for some parameters 𝜎0 > 0 and 𝜎 > 0,

𝑤 ℝ𝑑: Pr(| e𝑓(𝑤) 𝑓(𝑤)| 𝜎0) = 1

and Pr( e𝑔(𝑤) 𝑓(𝑤) 𝜎 ) = 1.

(ii) Stochastic gradient oracle: The oracle O only generates a stochastic gradient sample with 𝜎 -bounded noise for 𝑓given a point 𝑤; namely O(𝑤) = (e𝑔(𝑤)) such that for all 𝑤 ℝ𝑑, 𝔼[e𝑔(𝑤)] = 𝑓(𝑤) and

Pr( e𝑔(𝑤) 𝑓(𝑤) 𝜎 ) = 1.

In this framework, we distinguish between several different optimization scenarios:

(i) Non-convex setting. In this case we will assume the objective is not necessarily convex but is 𝛽 -smooth.2

We further assume that the function is lower bounded by some 𝑓 ℝand denote 𝐹 𝑓(𝑤1) 𝑓 for a given reference point 𝑤1 ℝ𝑑. The goal in the setting is, given 𝑇queries to the stochastic oracle and 𝛿> 0, is to find a point 𝑤 ℝ𝑑such that 𝑓(𝑤) 2 𝜖with probability at least 1 𝛿, for 𝜖as small as possible (as a function of the problem parameters 𝛽 , 𝐹 , 𝜎 , 𝜎0 and 𝑇, 𝛿).

(ii) Convex non-smooth setting. Here we assume 𝑓is convex, 𝐺 -Lipschitz and admits a minimizer 𝑤 arg min𝑤 ℝ𝑑𝑓(𝑤) with value 𝑓 𝑓(𝑤 ). We denote 𝐷 𝑤1 𝑤 for a reference point 𝑤1 ℝ𝑑. The goal is, given 𝑇queries to the stochastic oracle and 𝛿> 0, is to produce a point 𝑤 ℝ𝑑such that 𝑓(𝑤) 𝑓 𝜖with probability at least 1 𝛿, for 𝜖as small as possible (in terms of the problem parameters 𝐺 , 𝐷 , 𝜎 , 𝜎0 and 𝑇, 𝛿).

1The bounded noise assumption is made here for simplicity and can be relaxed to a sub-Gaussian assumption on the noise; see, for example, Attia & Koren (2024b) or Appendix C of Attia & Koren (2023) for further details. 2A function 𝑓 : ℝ𝑑 ℝis said to be 𝛽-smooth if 𝑓(𝑥) 𝑓(𝑦) 𝛽 𝑥 𝑦 for all 𝑥, 𝑦 ℝ𝑑. In particular, this implies that | 𝑓(𝑦) 𝑓(𝑤) 𝑓(𝑥) (𝑦 𝑥)| 𝛽 2 𝑦 𝑥 2

for all 𝑥, 𝑦 ℝ𝑑.

(iii) Convex and smooth setting. This scenario is identical to the former, but here we assume that 𝑓is 𝛽 -smooth rather than 𝐺 -Lipschitz. The convergence rate 𝜖is then given in terms of the parameters 𝛽 , 𝐷 , 𝜎 , 𝜎0 and 𝑇, 𝛿.

2.2. Parameter-free algorithmic setup

Since our focus is on parameter-free optimization, it is crucial to specify what algorithms in this setting are allowed to receive as input. First, we will always assume that the number of steps 𝑇, the failure probability 𝛿and the reference point 𝑤1 are given as inputs.3 The remaining problem parameters are not known ahead of time and thus are not received as inputs; however, we will assume that algorithms do receive, for each ground-truth parameter, a crude range (i.e., lower and upper bounds) in which it lies. Namely, for any of the relevant parameters among 𝛽 , 𝐹 , 𝐺 , 𝐷 , 𝜎 , algorithms receive respective ranges such that 𝛽 [𝛽min, 𝛽max], 𝐹 [𝐹min, 𝐹max], 𝐺 [𝐺min, 𝐺max], 𝐷 [𝐷min, 𝐷max], 𝜎 [𝜎min, 𝜎max] (a range for the parameter 𝜎0 is not used in any of our methods). We refer to a method as parameter-free, with respect to a well-tuned benchmark algorithm, if its convergence rate matches that of the benchmark, up to poly-logarithmic factors in the range parameters as well as in 𝑇and 1/𝛿. (We do not include a more formal definition here, since it will not be required for the statement of any of our results.)

2.3. Tuned benchmarks

Our benchmark rates, for both convex and non-convex optimization, are those of tuned (fixed stepsize, possibly projected) Stochastic Gradient Descent (SGD). Starting at 𝑤1 ℝ𝑑, the update step of SGD is 𝑤𝑡+1 = 𝑤𝑡 𝜂𝑔𝑡, where 𝜂is the stepsize parameter, 𝑔𝑡= e𝑔(𝑤𝑡) is the stochastic gradient at step 𝑡, evaluated at 𝑤𝑡. The update of projected SGD is 𝑤𝑡+1 = ΠΩ(𝑤𝑡 𝜂𝑔𝑡), where ΠΩ( ) is the Euclidean projection onto a convex domain Ω ℝ𝑑. With a tuned stepsize parameter, this method achieves the best known rate in the non-convex setting and the optimal rate in convex nonsmooth setting (accelerated methods achieve a slightly better rate in the convex and smooth setting). The benchmark rates of SGD in the optimization scenarios described above are detailed in Table 1. Note that SGD is not parameter-free with respect to tuned SGD, unless the range parameters are such that the ground-true parameters are known up to a constant factor.

3A standard doubling trick can be used to handle an unknown 𝑇while keeping the same rate of convergence up to log-factors.

How Free is Parameter-Free Stochastic Optimization?

Table 1. Benchmark high-probability rates of tuned SGD, ignoring log-factors.

SETTING BENCHMARK RATE

Non-convex 𝛽 𝐹

𝛽 𝐹 𝜎2 𝑇 (Ghadimi & Lan, 2013) Convex, Lipschitz 𝐷 (𝐺 +𝜎 )

𝑇 (Lan, 2012)

Convex, smooth 𝛽 𝐷2 𝑇 + 𝐷 𝜎

𝑇 (Lan, 2012)

3. Parameter-Free Non-convex Optimization

We begin with the general non-convex smooth case, providing a fully parameter-free method which achieves the same rate of tuned SGD up to poly-logarithmic factors. Our approach is based on a simple observation that has also been used in the two-phase SGD approach of Ghadimi & Lan (2013). Our parameter-free method performs grid search over multiple SGD stepsizes and for each sequence of SGD iterations randomly samples a small portion of the iterations. Then the method picks the solution with the minimal approximate gradient norm based on stochastic gradients. For the full algorithm statement see Appendix A. Here we only states its guarantees.

Theorem 1. Assume that 𝑓is 𝛽 -smooth and lower bounded by some 𝑓 and e𝑔is a 𝜎 -bounded unbiased gradient oracle of 𝑓. Let 𝜂min, 𝜂max > 0 such that

𝜂min 𝜂 min 1 2𝛽 ,

where 𝐹 = 𝑓(𝑤1) 𝑓 . Then for any 𝛿 (0, 1

3), given 𝑤1, 𝑇, 𝛿, 𝜂min and 𝜂max, Algorithm 2 performs 𝑇gradient queries and produce 𝑤such that with probability 1 𝛿,

𝑓(𝑤) 2 = b 𝑂

𝜂min ) log+( 1

𝛿)(𝛽 𝐹 + 𝜎2 log 1

𝛽 𝐹 log+( 𝜂max

𝜂min ) log+( 1

First, we note that the rate of Algorithm 2 match the rate of fully-tuned SGD up to logarithmic factors. Note that valid 𝜂min, 𝜂max can be determined using ranges of the problem parameters (details at Appendix A). Additionally, the method produce a single point instead of the obscure average norm guarantee of SGD, a product of comparing multiple iterations of SGD. Comparing the result to recent advances in the analysis of adaptive SGD with Ada Grad-like stepsizes (Faw et al., 2022; Attia & Koren, 2023; Liu et al., 2023), Algorithm 2 has only logarithmic dependency in the specification of the parameters 𝜂min and 𝜂max while adaptive SGD require tight knowledge of 𝐹 /𝛽 for tuning or suffer a suboptimal polynomial dependency on problem parameters. In addition, the proof technique of Theorem 1 is simple and

straightforward compared to the cumbersome analysis of adaptive SGD with Ada Grad-like stepsizes.

Due to the simplicity of the analysis, we defer the full proof to Appendix A and provide here the general guidelines. The proof uses standard techniques (which are also used in the original analysis of Ghadimi & Lan 2013). As 𝜂 [𝜂min, 𝜂max], the algorithm s grid search execute SGD with some 𝜂 𝜂 , and with high probability, obtains (among other candidates) at least one point with a similar guarantee to tuned SGD. A concentration of the stochastic gradient norms ensures that we can select a good candidate with high probability.

4. Parameter-Free Convex Optimization

We proceed to consider convex problems, where the objective is either convex Lipschitz or convex smooth. The first result in this section assumes access via a first-order oracle (which includes access to noisy function values), while later results deal with the case of stochastic gradient oracle access.

4.1. Optimization with a stochastic first-order oracle

In this setting we assume access using a first-order oracle O(𝑤) = ( e𝑓(𝑤), e𝑔(𝑤)) with bounded noise. Stochastic function value access (with a reasonable level of noise variance) is often a valid and natural assumption since (i) it is often the case that stochastic gradients are obtained by random samples and those can be used for both function and gradient evaluation; and (ii) the practice of tuning using a validation set implicitly assume this exact assumption. We present a fully parameter-free method which utilize the function value oracle to perform model selection, similar to the use of the gradient oracle for model selection in Algorithm 2. The algorithm is detailed in Appendix B and following is its convergence result. Due to the similarities between the method and Algorithm 2, the proof is also deferred to Appendix B.

Theorem 2. Let 𝑓 : ℝ𝑑 ℝbe a differentiable, 𝐺 -Lipschitz and convex function, admitting a minimizer 𝑤 arg min𝑤 ℝ𝑑𝑓(𝑤). Let e𝑓be a 𝜎0-bounded unbiased zero-oracle of 𝑓and e𝑔be a 𝜎 -bounded unbiased gradient oracle of 𝑓. Let 𝜂min, 𝜂max > 0 such that 𝜂min

(𝐺2 + 𝜎2 )𝑇 𝜂max for some 𝑤1 ℝ𝑑. Then Algorithm 3 performs 𝑇queries and produce 𝑤such that with probability at least 1 2𝛿,

𝑓(𝑤) 𝑓(𝑤 ) = b 𝑂

(𝐺2 + 𝜎2 ) log( 𝜂max

𝜂min ) log( 1

How Free is Parameter-Free Stochastic Optimization?

Algorithm 3 obtains the same rate of convergence as tuned SGD up to logarithmic factors and a term of order 𝜎0/

𝑇, being parameter-free given the mild assumption that 𝜎0 = 𝑂(𝐷 (𝐺 + 𝜎 )). Similarly to Algorithm 2, the inputs 𝜂min, 𝜂max may be replaced by problem parameters ranges (details at Appendix B). We note that Algorithm 3 achieves an analogous guarantee for convex smooth objectives (via a proper setting of 𝜂min, 𝜂max using 𝐷min, 𝐷max, 𝛽min, 𝛽max and 𝜎min, 𝜎max).

4.2. Optimization with a stochastic gradient oracle

While function value access is a reasonable assumption which leads to a simple parameter-free method, much of the existing work on adaptive, self-tuning optimization only assume access to stochastic gradient queries. In this section we present a parameter-free method, using only stochastic gradient access, which match the convergence rate of tuned SGD up to a lower-order term 𝑂(𝜎max/𝑇) and polylogarithmic factors. Our proposed method (Algorithm 1) is based on tuning the diameter of the projected variant of Stochastic Gradient Descent with Ada Grad-like stepsizes method (Ada PSGD). For a given diameter, the update step of Ada SGD is 𝑤𝑡+1 = Proj𝑤1,𝐷(𝑤𝑡 𝜂𝑡𝑔𝑡), where Proj𝑤1,𝐷( ) is the Euclidean projection onto the 𝐷-ball centered at the initialization 𝑤1, 𝑔𝑡= e𝑔(𝑤𝑡) is the stochastic gradient at step 𝑡, and for some 𝛼> 0 and 𝛾 0,

𝛾2 + P𝑡 𝑠=1 𝑔𝑠 2 .

Our method performs multiple runs of Ada PSGD with an exponential grid of diameters and stops when all iterations are fully within the 𝐷-ball. The method may be considered as a middle ground between Carmon & Hinder (2022) and Ivgi et al. (2023) as it combines tuning and dynamic stepsizes.

Following is the convergence result of our method. Theorem 3. Let 𝑓: ℝ𝑑 ℝbe a differentiable, 𝐺 - Lipschitz and convex function, admitting a minimizer 𝑤 arg min𝑤 ℝ𝑑𝑓(𝑤). Let e𝑔be an unbiased gradient oracle of 𝑓with 𝜎 -bounded noise, and let 𝑤1 ℝ𝑑. Then Algorithm 1 with parameters 𝐷min 𝑤1 𝑤 < 𝐷max, 𝐺max 𝐺 and 𝜎max 𝜎 performs at most 𝑇log 𝐷max

𝐷min gradient queries and produce 𝑤 ℝ𝑑such that with probability at least 1 𝛿log 𝐷max

𝑓(𝑤) 𝑓(𝑤 ) 𝑤1 𝑤 (𝐺 + 𝜎 )

+ 𝑤1 𝑤 𝜎max

poly log( 𝐺max

Observing the convergence rate in Theorem 3, Algorithm 1 achieve the same rate of convergence as fully-tuned SGD

Algorithm 1: Adaptive projected SGD tuning Input: 𝐷min, 𝐷max, 𝐺max, 𝜎max, 𝑇and 𝛿 𝐾 log 𝐷max

5𝜎2max log 𝑇

𝛿 𝜃𝑇, 𝛿 log 60 log(6𝑇)

𝛼 48 log 1 + 2𝐺2 max𝑇+2𝜎2 max𝑇

5𝜎2max log 𝑇

+ 32 log(1 + 𝑇) +

𝜃𝑇, 𝛿log(1 + 𝑇) + 𝜃2 𝑇, 𝛿

for 𝑘 1, 2, . . . , 𝐾do

𝐷𝑘 2𝑘+2𝐷min 𝑤(𝑘) 1 , . . . , 𝑤(𝑘) 𝑇 Ada PSGD(𝐷𝑘, 𝛼, 𝛾, 𝑤1,𝑇) if max𝑡 𝑇 𝑤(𝑘) 𝑡+1 𝑤1 < 𝐷𝑘then return 𝑤(𝑘) 1

𝑇 P𝑇 𝑡=1 𝑤(𝑘) 𝑡 end end return 𝑤1 Failure case

up to logarithmic factors and an additional lower order term of e 𝑂(𝜎 /𝑇).4 Hence, in case the ratio 𝜎max/(𝜎min + 𝐺min) is 𝑂(

𝑇), Algorithm 1 is parameter-free compared to the rate of tuned SGD. Previous parameter-free results for SCO (Carmon & Hinder, 2022; Ivgi et al., 2023) suffer a similar lower order term, while our result depends only on the noise bound and not a bound of the norm of stochastic gradients which encapsulate both a noise and gradient norm bound, a property that stems from a careful combination of decorrelated stepsizes and martingale analysis instead of using the crude bound of stochastic gradients norm. (This point will subsequently prove critical in the convex smooth case where gradients are not directly bounded.) We also note that the bisection technique of Carmon & Hinder (2022) can improve the poly-logarithmic dependency; we abstain from using such a technique in favour of making the tuning procedure straightforward and leave improvements of the logarithmic factors to future work.

The convergence result of Algorithm 1 holds in fact holds more generally, assuming the gradient and noise bounds are satisfied only inside a ball of radius 8 𝑤1 𝑤 . We will prove a more general version and Theorem 3 will follow as a corollary at Appendix C. The refined result will be relevant later when we use the convergence guarantee for convex smooth objectives where a global gradient norm bound does not hold. To that end we define 𝐺8 , 𝜎8 ℝ+ as the minimal parameters satisfying, for all 𝑤 ℝ𝑑such

4In Theorem 6 we show that such a term is unavoidable without further assumptions and in Appendix D we provide a further assumption which mitigate the excess term.

How Free is Parameter-Free Stochastic Optimization?

that 𝑤 𝑤1 8 𝑤1 𝑤 ,

𝑓(𝑤) 𝐺8 , Pr( e𝑔(𝑤) 𝑓(𝑤) 𝜎8 ) = 1. (1)

Following is the general version of Theorem 3.

Theorem 4. Let 𝑓: ℝ𝑑 ℝbe a differentiable and convex function, admitting a minimizer 𝑤 arg min 𝑓(𝑤). Let e𝑔 be an unbiased gradient oracle of 𝑓such that 𝜎8 < (see Equation (1)), and let 𝑤1 ℝ𝑑. Then Algorithm 1

with parameters 𝐷min 𝑤1 𝑤 < 𝐷max, 𝐺max 𝐺8 and 𝜎max 𝜎8 , where 𝐺8 and 𝜎8 are defined by Equation (1), performs at most 𝑇log 𝐷max

𝐷min gradient queries

and produce 𝑤𝑘= 1 𝑇 P𝑇 𝑡=1 𝑤(𝑘) 𝑡 for some 𝑘 [log 𝐷max

𝐷min ] such that with probability at least 1 𝛿log 𝐷max

𝐷min , 𝐷𝑘 8 𝑤1 𝑤 and

𝑡=1 𝑓(𝑤(𝑘) 𝑡 ) 𝑓

𝑡=1 𝑔(𝑘) 𝑡 2 + 𝜎max

where e 𝐶= log( 𝑇

𝜎max ) and 𝑔(𝑘) 𝑡 is the observed gradient

at 𝑤(𝑘) 𝑡 .

The bound of the general version uses the norms of the observed stochastic gradients instead of the gradient norm and noise bounds, and a simple translation appearing in Appendix C yields Theorem 3.

In order to prove Theorem 4, we first provide several intermediate results for 𝑇-steps Ada PSGD for some 𝐷 8 𝑤1 𝑤 at Lemmas 1 to 3 (in which case the bounds 𝐺8 and 𝜎8 hold for all the projected domain). We defer their proofs to Appendix C. Differently from common analyses of projected methods, we are interested in both the local minimizer inside the 𝐷-ball and the global minimizer in ℝ𝑑. To that end, let 𝑤 𝐷= arg min𝑤: 𝑤 𝑤1 𝐷𝑓(𝑤) such that 𝑤1 𝑤 𝐷 𝑤1 𝑤 .5 The first lemma establishes a convergence guarantee with respect to the global minimizer given the following two events. The first is that all the iterates are contained (fully) within the 𝐷-ball, i.e., 𝐸1 {max𝑡 𝑇 𝑤𝑡+1 𝑤1 < 𝐷}. The second,

𝑡=1 ( 𝑓(𝑤𝑡) 𝑔𝑡) (𝑤𝑡 𝑤 )

4(𝐷+ 𝑤1 𝑤 )

𝑡=1 𝑔𝑡 2 + 4𝜃2 𝑇, 𝛿𝜎2 8

is the concentration needed to modify the regret analysis to a high-probability analysis.

5Such 𝑤 𝐷exists since either 𝑤1 𝑤 𝐷and we may use 𝑤 𝐷= 𝑤 , or 𝑤1 𝑤 𝐷 𝐷< 𝑤1 𝑤 .

Lemma 1. Under the event 𝐸1 𝐸2,

𝑡=1 𝑓(𝑤𝑡) 𝑓(𝑤 ) 2𝛾

𝛼+ 8𝜃𝑇, 𝛿𝜎8 𝑤1 𝑤 + 𝐷

+ ( 𝑤1 𝑤 + 𝐷)( 2

𝜃𝑇, 𝛿) + 𝛼𝐷 P𝑇 𝑡=1 𝑔𝑡 2

The second lemma bounds the distance of the iterates from the initialization. This lemma will prove useful to show that 𝐸1 holds with high probability for a large enough 𝐷.

Lemma 2. Let 𝛼, 𝛾as defined by Algorithm 1. Then with probability at least 1 2𝛿, for all 𝑡 [𝑇],

𝑤𝑡+1 𝑤1 2 𝑤1 𝑤 + 1

The third lemma lower bounds the probability of 𝐸2. Note that the concentration bound does not depends on the gradient norm bound, only the observed (noisy) gradients and the noise bound.

Lemma 3. Pr(𝐸2) 1 2𝛿.

We are ready to prove the main result.

Proof of Theorem 4. Let 𝑤(𝜏) for 𝜏 [𝐾+1] be the output of Algorithm 1 (treating 𝑤1 as 𝑤(𝐾+1)). Let

𝑖= min{𝑘 [𝐾] : 𝐷𝑘> 4 𝑤1 𝑤 }.

Note that 𝑖is well defined as 𝐷𝐾= 4𝐷max > 4 𝑤1 𝑤 and that by minimality of 𝑖, 𝐷𝑖 8 𝑤1 𝑤 . We define the events 𝐸(𝑘) 1 and 𝐸(𝑘) 2 as the respective 𝐸1 and 𝐸2 events for the 𝑘 th sequence of Ada PSGD, 𝑤(𝑘) 1 , . . . , 𝑤(𝑘) 𝑇. From Lemma 2, with probability at least 1 2𝛿, for all 𝑡 [𝑇],

𝑤(𝑖) 𝑡+1 𝑤1 2 𝑤1 𝑤 + 1

where the last inequality follows by 𝐷𝑖> 4 𝑤1 𝑤 . Hence, Pr{𝜏 𝑖} Pr{𝜏= 𝑖| 𝜏 𝑖} 1 2𝛿. By Lemma 3, with probability at least 1 2𝛿𝐾(union bound over all 𝑘 [𝑖], where 𝑖 𝐾), 𝐸(𝑘) 2 holds for all 𝑘 [𝑖]. Again using union bound with {𝜏 𝑖}, with probability at least 1 2𝛿(𝐾+ 1), we also have that {𝜏 𝑖}, which means that both 𝐸(𝜏) 1 and 𝐸(𝜏) 2 hold and we can invoke Lemma 1. Hence, with probability 1 2𝛿(𝐾+ 1), Algorithm 1 returns 𝑤(𝜏) for some 𝜏 𝑖for which

𝑡=1 𝑓(𝑤(𝜏) 𝑡 ) 𝑓 2𝛾

𝛼+ 8𝜃𝑇, 𝛿𝜎8 𝑤1 𝑤 + 𝐷𝜏

+ ( 𝑤1 𝑤 + 𝐷𝜏)( 2

𝜃𝑇, 𝛿) + 𝛼𝐷𝜏 P𝑇 𝑡=1 𝑔𝑡 2

How Free is Parameter-Free Stochastic Optimization?

Using the inequalities 𝐷𝜏 𝐷𝑖 8 𝑤1 𝑤 , 𝜎8 𝜎max, and a standard application of Jensen s inequality,

𝑓(𝑤(𝜏)) 𝑓 = 𝑂 𝛾 𝛼𝜎max + 𝜃𝑇, 𝛿 𝑤1 𝑤 𝜎max

𝜃𝑇,𝛿) + 𝛼 𝑤1 𝑤 P𝑇 𝑡=1 𝑔(𝜏) 𝑡 2

We conclude by substituting 𝛾, noting that 𝛼 1,

𝜃𝑇,𝛿, 𝜃𝑇,𝛿, 𝛼 1 = 𝑂 log 𝑇

4.3. Convex and smooth stochastic optimization

In this setting we assume that the objective is smooth rather then Lipschitz. We prove that Algorithm 1 also achieve the same rate of convergence as tuned SGD for smooth objectives, up to logarithmic factors and a lower-order term.

Theorem 5. Let 𝑓: ℝ𝑑 ℝbe a 𝛽 -smooth and convex function, admitting a minimizer 𝑤 arg min 𝑓(𝑤). Let e𝑔be an unbiased gradient oracle of 𝑓and let 𝑤1 ℝ𝑑. Then Algorithm 1 with parameters 𝐷min 𝑤1 𝑤 < 𝐷max, 𝐺max 9𝛽 𝑤1 𝑤 and 𝜎max 𝜎8 , where 𝜎8 is defined by Equation (1), performs at most 𝑇log 𝐷max

𝐷min gradient queries and produce 𝑤such that with probability at least 1 𝛿log 𝐷max

𝐷min , 𝐷𝑘 8 𝑤1 𝑤 and

e 𝐶2𝛽 𝑤1 𝑤 2

𝑇 + e 𝐶 𝑤1 𝑤 𝜎8

+ 𝜎max e 𝐶 𝑤1 𝑤

where e 𝐶= log( 𝑇

The method is parameter-free compared to tuned SGD, assuming 𝜎max/𝜎min = 𝑂(

𝑇). Previous parameter-free methods for stochastic convex optimization (Carmon & Hinder, 2022; Ivgi et al., 2023) provided only a noiseless guarantee for smooth objectives. Attia & Koren (2023) provided a convergence result for adaptive SGD with Ada Grad-like stepsizes for convex and smooth objectives, but similarly to the non-convex case, the method is not parameter-free compared to tuned SGD.

Theorem 4 requires only a local gradients norm bound, inside a ball of radius 8 𝑤1 𝑤 . An application of smoothness with the triangle inequality yields that for all 𝑤such that 𝑤 𝑤1 8 𝑤1 𝑤 , 𝑓(𝑤) 9𝛽 𝑤1 𝑤 . Hence, we can apply Theorem 4 and the error incurred by the bound is merely logarithmic. Using a standard analysis we can translate the convex rate guarantee to a convex and smooth rate guarantee. For the full proof see Appendix C.

4.4. Information-theoretic limits to parameter freeness

Both Theorems 4 and 5 suffer an additional term of e 𝑂( 𝑤1 𝑤 𝜎max/𝑇) compared to tuned SGD, that prevents them from being fully parameter-free compared to SGD due to the polynomial dependence on 𝜎max. The following theorem shows that without additional assumptions, this term is in fact unavoidable and the results of Theorems 4 and 5 are essentially the best one can hope for. Theorem 6. Let 𝑇 4, 𝐷max > 0, 𝜎max > 0, 𝛼 [1, 3

𝑇] and 𝐺 = 𝜎max 2𝑇 1. Let A be any deterministic algorithm which performs 𝑇gradient queries and outputs 𝑤 ℝ. Then there exist some 𝐷 [ 𝐷max

𝑇, 𝐷max], 𝜎 [ 𝜎max

𝑇, 𝜎max], a convex and 𝐺 -Lipschitz function 𝑓: ℝ ℝwhich admits a minimizer 𝑤 arg min𝑤 ℝ𝑓(𝑤) such that 𝑤 = 𝐷 , and a 𝜎 -bounded unbiased sub-gradient6 oracle of 𝑓, e𝑔 such that with probability at least 1

𝑓(𝑤) 𝑓(𝑤 ) 𝐷 (𝐺 + 𝜎 )𝛼

We remark that the deterministic requirement is made for simplicity and the argument can be adapted to randomized algorithms. To understand the lower bound consider the convergence guarantee of tuned SGD. A standard in-expectation analysis of SGD with a tuned stepsize of 𝜂= 𝐷/(𝐺+ 𝜎)

𝑇and an application of Markov s inequality shows that with probability at least 3

4, SGD produce 𝑤 with 𝑓(𝑤) 𝑓(𝑤 ) = 𝑂(𝐷 (𝐺 + 𝜎 )/

𝑇). Hence, when 𝛼= 𝜔(1), the worst case analysis of SGD is strictly better than any algorithm with the limited knowledge prescribed in Theorem 6. In particular, a method cannot be parameter-free and expect a comparable rate of convergence to SGD unless one of [𝐷min, 𝐷max], [𝜎min, 𝜎max] has ratio of e 𝑂(

𝑇) or the 𝐺 value is outside the range (for example if the noise bound is small). This lower bound of Theorem 6 indeed appears in the guarantees of Carmon & Hinder (2022); Ivgi et al. (2023) in the form of the bound of stochastic gradients norm and in our Theorem 4. While the term Ω( 𝑤1 𝑤 𝜎max/𝑇) is unavoidable, in Appendix D we show that assuming the initialization is a bad enough approximate minimizer, we can infer a reasonable bound of 𝜎 which reduce the lower bound to a term proportional to the guarantee of tuned SGD.

We defer the proof to Appendix C and provide the general structure of the proof. The proof is based on constructing two problem instances which are indistinguishable with constant probability using only 𝑇queries, where one instance pretends to be the other using a large 𝜎 value. The lower bound is established by showing that no point has good sub-optimality for both problems.

6The function is of the form 𝐺 |𝑤 𝐷 |, and smoothing can be used to support smooth objectives. With a slight abuse of notation we treat 𝑓(𝑤) as some sub-gradient of 𝑓such that 𝔼[e𝑔(𝑤)] = 𝑓(𝑤).

How Free is Parameter-Free Stochastic Optimization?

Acknowledgements

We are grateful to Yair Carmon for invaluable comments and discussions. This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 101078075). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. This work received additional support from the Israel Science Foundation (ISF, grant number 2549/19), from the Len Blavatnik and the Blavatnik Family foundation, from the Adelis Foundation, and from the Prof. Amnon Shashua and Mrs. Anat Ramaty Shashua Foundation.

Impact Statement

This paper presents work whose goal is to advance the field of theoretical Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Alacaoglu, A., Malitsky, Y., Mertikopoulos, P., and Cevher, V. A new regret analysis for adam-type algorithms. In International conference on machine learning, pp. 202 210. PMLR, 2020.

Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization. Mathematical Programming, pp. 1 50, 2022.

Attia, A. and Koren, T. Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. In International Conference on Machine Learning, 2023.

Attia, A. and Koren, T. How free is parameter-free stochastic optimization? ar Xiv preprint ar Xiv:2402.03126, 2024a.

Attia, A. and Koren, T. A note on high-probability analysis of algorithms with exponential, sub-gaussian, and general light tails. ar Xiv preprint ar Xiv:2403.02873, 2024b.

Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009.

Bottou, L. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp. 421 436. Springer, 2012.

Carmon, Y. and Hinder, O. Making sgd parameter-free. In Conference on Learning Theory, pp. 2360 2389. PMLR, 2022.

Carmon, Y. and Hinder, O. The price of adaptivity in stochastic convex optimization. ar Xiv preprint ar Xiv:2402.10898, 2024.

Chaudhuri, K., Freund, Y., and Hsu, D. J. A parameterfree hedging algorithm. Advances in neural information processing systems, 22, 2009.

Chen, K., Langford, J., and Orabona, F. Better parameterfree stochastic optimization with ode updates for coinbetting. In AAAI Conference on Artificial Intelligence, 2020.

Cutkosky, A. Artificial constraints and hints for unbounded online learning. In Conference on Learning Theory, pp. 874 894. PMLR, 2019.

Cutkosky, A. and Boahen, K. Online learning without prior information. In Conference on learning theory, pp. 643 677. PMLR, 2017.

Cutkosky, A. and Boahen, K. A. Online convex optimization with unconstrained domains and losses. Advances in neural information processing systems, 29, 2016.

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pp. 1493 1529. PMLR, 2018.

Defazio, A. and Mishchenko, K. Learning-rate-free learning by d-adaptation. In International Conference on Machine Learning, 2023.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Faw, M., Tziotis, I., Caramanis, C., Mokhtari, A., Shakkottai, S., and Ward, R. A. The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance. In COLT, 2022.

Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Hazan, E. and Kakade, S. Revisiting the polyak step size. ar Xiv preprint ar Xiv:1905.00313, 2019.

Howard, S. R., Ramdas, A., Mc Auliffe, J., and Sekhon, J. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055 1080, 2021.

How Free is Parameter-Free Stochastic Optimization?

Ivgi, M., Hinder, O., and Carmon, Y. Dog is sgd s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning, 2023.

Jun, K.-S. and Orabona, F. Parameter-free online convex optimization with sub-exponential noise. In Conference on Learning Theory, pp. 1802 1823. PMLR, 2019.

Kavis, A., Levy, K. Y., Bach, F., and Cevher, V. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32, 2019.

Kavis, A., Levy, K. Y., and Cevher, V. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. In International Conference on Learning Representations, 2022.

Khaled, A. and Jin, C. Tuning-free stochastic optimization. ar Xiv preprint ar Xiv:2402.07793, 2024.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365 397, 2012.

Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 983 992. PMLR, 2019.

Li, X. and Orabona, F. A high probability analysis of adaptive sgd with momentum. In Workshop on Beyond First Order Methods in ML Systems at ICML 20, 2020.

Liu, Z., Nguyen, T. D., Nguyen, T. H., Ene, A., and Nguyen, H. L. High probability convergence of stochastic gradient methods. ar Xiv preprint ar Xiv:2302.14843, 2023.

Luo, H. and Schapire, R. E. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, pp. 1286 1304. PMLR, 2015.

Mc Mahan, H. B. and Orabona, F. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pp. 1020 1039. PMLR, 2014.

Mhammedi, Z. and Koolen, W. M. Lipschitz and comparator-norm adaptivity in online learning. In Conference on Learning Theory, pp. 2858 2887. PMLR, 2020.

Mishchenko, K. and Defazio, A. Prodigy: An expeditiously adaptive parameter-free learner. ar Xiv preprint ar Xiv:2306.06101, 2023.

Nemirovskij, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983.

Nesterov, Y. Universal gradient methods for convex optimization problems. Mathematical Programming, 152 (1-2):381 404, 2015.

Orabona, F. and P al, D. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29, 2016.

Orabona, F. and P al, D. Scale-free online learning. Theoretical Computer Science, 716:50 69, 2018.

Orabona, F. and P al, D. Parameter-free stochastic optimization of variationally coherent functions. ar Xiv preprint ar Xiv:2102.00236, 2021.

Orabona, F. and Tommasi, T. Training deep networks without learning rates through coin betting. Advances in Neural Information Processing Systems, 30, 2017.

Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.

Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Schaul, T., Zhang, S., and Le Cun, Y. No more pesky learning rates. In International conference on machine learning, pp. 343 351. PMLR, 2013.

Streeter, M. J. and Mc Mahan, H. B. No-regret algorithms for unconstrained online convex optimization. In Neural Information Processing Systems, 2012.

Tran, P. T. et al. On the convergence proof of amsgrad and a new version. IEEE Access, 7:61706 61716, 2019.

Ward, R., Wu, X., and Bottou, L. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, pp. 6677 6686. PMLR, 2019.

How Free is Parameter-Free Stochastic Optimization?

Algorithm 2: Non-convex SGD tuning Input: 𝑤1, 𝑇, 𝛿, 𝜂min, 𝜂max (see Equation (2) for replacing 𝜂min, 𝜂max with parameter ranges) 𝑆 {}

𝐾 log 𝜂max 𝜂min

𝜂min ) log+( 1

𝛿 𝑇 𝑇 𝐾(1+𝑁) for 𝑘 1, 2, . . . , 𝐾do

𝜂𝑘 2𝑘 1𝜂min 𝑤(𝑘) 1 , . . . , 𝑤(𝑘) 𝑇 SGD(𝑤1, 𝜂𝑘,𝑇 ) Uniformly at random select 𝑤(𝑘) 1 , . . . , 𝑤(𝑘) 𝑁 from 𝑤(𝑘) 1 , . . . , 𝑤(𝑘) 𝑇 𝑆 𝑆 {𝑤(𝑘) 1 , . . . , 𝑤(𝑘) 𝑛} end return arg min𝑤 𝑆 P𝑇 𝑡=1 e𝑔𝑡(𝑤) e𝑔𝑡(𝑤) is an independent stochastic gradient at 𝑤

A. Additional Details for Section 3

In this section we provide the algorithmic statement, additional details and proofs related to Theorem 1. For completeness we re-state the theorem.

Theorem 1. Assume that 𝑓is 𝛽 -smooth and lower bounded by some 𝑓 and e𝑔is a 𝜎 -bounded unbiased gradient oracle of 𝑓. Let 𝜂min, 𝜂max > 0 such that

𝜂min 𝜂 min 1 2𝛽 ,

where 𝐹 = 𝑓(𝑤1) 𝑓 . Then for any 𝛿 (0, 1

3), given 𝑤1, 𝑇, 𝛿, 𝜂min and 𝜂max, Algorithm 2 performs 𝑇gradient queries and produce 𝑤such that with probability 1 𝛿,

𝑓(𝑤) 2 = b 𝑂

𝜂min ) log+( 1

𝛿)(𝛽 𝐹 + 𝜎2 log 1

𝛽 𝐹 log+( 𝜂max

𝜂min ) log+( 1

Note that Algorithm 2 receive 𝜂min, 𝜂max instead of ranges [𝐹min, 𝐹max], [𝛽min, 𝛽max], [𝜎min, 𝜎max]. This is done for framing the algorithm as a form of stepsize tuning (which in practice requires only two hyperparameters instead of two hyperparameters per problem parameter) and given the ranges the algorithm will use

( 1 2𝛽max ,

𝐹min 𝛽max𝜎2max𝑇

and 𝜂max = min

( 1 2𝛽min ,

𝐹max 𝛽min𝜎2 min𝑇

In order to prove the theorem we need the following lemmas , their proofs follow . The first is a standard convergence result of constant stepsize SGD. Alternatively, one can use the in-expectation analysis of SGD as in the two-phase SGD method of Ghadimi & Lan (2013).

Lemma 4 (SGD convergence with high probability). Assume that 𝑓is 𝛽 -smooth and lower bounded by 𝑓 . Then for the iterates of SGD with 𝜂 1/2𝛽 , it holds with probability at least 1 𝛿that

𝑡=1 𝑓(𝑤𝑡) 2 4𝐹

𝜂 + 12𝜎2 log 1

𝑇+ 4𝛽 𝜎2 𝜂,

where 𝐹 = 𝑓(𝑤1) 𝑓 .

How Free is Parameter-Free Stochastic Optimization?

Next is a standard success amplification technique for choosing between candidate solutions SGD; the precise version below is due to Ghadimi & Lan (2013). Lemma 5. Given candidates 𝑤1, . . . , 𝑤𝑁, let 𝑤= arg min𝑛 [𝑁] P𝑇 𝑡=1 e𝑔𝑡(𝑤𝑛) where 𝑔𝑡(𝑤) for 𝑡 [𝑇] are independent stochastic gradients at 𝑤. Then assuming 𝔼[e𝑔𝑡(𝑤)] = 𝑓(𝑤) and Pr( 𝑓(𝑤) e𝑔𝑡(𝑤) 𝜎) = 1 for any fixed 𝑤, for any 𝛿 (0, 1), we have with probability at least 1 𝛿that

𝑓(𝑤) 2 4 min 𝑛 [𝑁] 𝑓(𝑤𝑛) 2 + 24(1 + 3 log 𝑁

We proceed to proving the theorem.

Proof of Theorem 1. To fix the discrepancy between 𝑇and 𝑇 , we denote

𝐹 log+( 𝜂max

𝜂min ) log+( 1

Note that 𝜂1 𝜂 𝜂 𝜂

𝜂min ) log+( 1

𝜂min ) log+( 1

Let 𝜏= max{𝑘 [𝐾] : 𝜂𝑘 𝜂 }. 𝜏is well-defined as 𝜂1 𝜂 . Hence, since 2𝜂𝐾 𝜂 , 𝜂𝜏 (𝜂 /2, 𝜂 ]. Thus, by Lemma 4, with probability at least 1 𝛿,

𝑡=1 𝑓(𝑤(𝜏) 𝑡 ) 2 4𝐹

𝜂𝜏 + 12𝜎2 log 1

𝑇 + 4𝛽 𝜎2 𝜂𝜏

𝜂 + 12𝜎2 log 1

𝑇 + 4𝛽 𝜎2 𝜂

𝐾(1 + 𝑁)(16𝛽 𝐹 + 12𝜎2 log 1

𝛿) 𝑇 + 12𝐾(1 + 𝑁)𝜎 𝛽 𝐹

𝑇log+( 𝜂max

𝜂min ) log+( 1

By Markov s inequality, with probability at least 1

2, a uniformly at random index 𝑘 [𝑇 ] satisfy

𝑓(𝑤(𝜏) 𝑘) 2 2𝔼𝑘[ 𝑓(𝑤(𝜏) 𝑘) 2] = 2

𝑡=1 𝑓(𝑤(𝜏) 𝑡 ) 2,

and as we sample log 1

𝛿random indices, with probability at least 1 𝛿we add at least one point with the above guarantee to 𝑆. Hence, by a union bound with the convergence guarantee, with probability at least 1 2𝛿we add to 𝑆some 𝑤(𝜏) 𝑘 with

𝑓(𝑤(𝜏) 𝑘) 2 𝐾(1 + 𝑁)(32𝛽 𝐹 + 24𝜎2 log 1

𝛿) 𝑇 + 24𝐾(1 + 𝑁)𝜎 𝛽 𝐹

𝑇log+( 𝜂max

𝜂min ) log+( 1

The last step of the algorithm is an application of Lemma 5 (the bounded noise assumption satisfy the sub-Gaussian requirement of Lemma 5) with the set of 𝐾𝑁candidates, and we conclude with a union bound that with probability at least 1 3𝛿, 𝑤, the output of Algorithm 2 satisfy

𝑓(𝑤) 2 𝐾(1 + 𝑁)(128𝛽 𝐹 + 192𝜎2 log 𝐾𝑁

𝛿) 𝑇 + 96𝐾(1 + 𝑁)𝜎 𝛽 𝐹

𝑇log+( 𝜂max

𝜂min ) log+( 1

𝜂min ) log+( 1

𝛿)(𝛽 𝐹 + 𝜎2 log 1

𝛽 𝐹 log+( 𝜂max

𝜂min ) log+( 1

For each 𝑘 [𝐾] we perform a single 𝑇 -steps SGD execution and 𝑇 𝑁gradient queries to approximate the norm of the randomly selected points, to a total of 𝑇= 𝑇 𝐾(1 + 𝑁) queries. We translate the probability of 1 3𝛿to 1 𝛿for the b 𝑂 notation, limiting 𝛿 (0, 1

How Free is Parameter-Free Stochastic Optimization?

A.1. Proof of Lemma 4

In order to prove the lemma we use the following martingale concentration inequality due to Li & Orabona (2020).

Lemma 6 (Lemma 1 of Li & Orabona (2020)). Assume that 𝑍1, 𝑍2, ..., 𝑍𝑇is a martingale difference sequence with respect to 𝜉1, 𝜉2, ..., 𝜉𝑇and 𝔼𝑡 exp(𝑍2 𝑡/𝜎2 𝑡) exp(1) for all 𝑡, where 𝜎1, . . . , 𝜎𝑇is a sequence of random variables such that 𝜎𝑡 is measurable with respect to 𝜉1, 𝜉2, . . . , 𝜉𝑡 1. Then, for any fixed 𝜆> 0 and 𝛿 (0, 1), with probability at least 1 𝛿, it holds that 𝑇

𝑡=1 𝜎2 𝑡+ 1

Proof of Lemma 4. From smoothness,

𝑓(𝑤𝑡+1) 𝑓(𝑤𝑡) 𝜂 𝑓(𝑤𝑡) 𝑔𝑡+ 𝛽 𝜂2

Summing for 𝑡= 1, . . . ,𝑇and rearranging,

𝑡=1 𝑓(𝑤𝑡) 𝑔𝑡 𝑓(𝑤1) 𝑓(𝑤𝑇+1)

By Lemma 6 with 𝑍𝑡= 𝑓(𝑤𝑡) ( 𝑓(𝑤𝑡) 𝑔𝑡), where |𝑍𝑡| 𝜎 𝑓(𝑤𝑡) , and 𝜆= 1 3𝜎2 , with probability at least 1 𝛿,

𝑡=1 𝑓(𝑤𝑡) ( 𝑓(𝑤𝑡) 𝑔𝑡) 1

𝑡=1 𝑓(𝑤𝑡) 2 + 3𝜎2 log 1

Summing both inequalities and rearranging,

𝑡=1 𝑓(𝑤𝑡) 2 𝑓(𝑤1) 𝑓(𝑤𝑇+1)

𝑡=1 𝑔𝑡 2 + 3𝜎2 log 1

Since 𝑔𝑡 2 2 𝑓(𝑤𝑡) 2 + 2𝜎2 ,

𝑡=1 𝑓(𝑤𝑡) 2 𝑓(𝑤1) 𝑓(𝑤𝑇+1)

𝑡=1 𝑓(𝑤𝑡) 2 + 𝛽 𝜎2 𝑇𝜂+ 3𝜎2 log 1

As 𝜂 1 2𝛽 , substituting and rearranging,

𝑡=1 𝑓(𝑤𝑡) 2 4( 𝑓(𝑤1) 𝑓(𝑤𝑇+1))

𝜂 + 4𝛽 𝜎2 𝑇𝜂+ 12𝜎2 log 1

Dividing by 1

𝑇and replacing 𝑓(𝑤𝑇+1) 𝑓 we conclude the proof.

A.2. Proof of Lemma 5

In order to prove Lemma 5 we use the following martingale lemma.

Lemma 7 (Lemma 2.3 of Ghadimi & Lan (2013)). Let 𝑍1, . . . , 𝑍𝑛 ℝ𝑑be a martingale difference sequence with respect to 𝜉1, . . . , 𝜉𝑛. Assuming 𝔼[exp( 𝑍𝑖 2/𝜎2) | 𝜉1, . . . , 𝜉𝑖 1] exp(1), for any 𝜆> 0,

2(1 + 𝜆)𝜎 𝑛

exp( 𝜆2/3).

Proof of Lemma 5. Denote 𝑔(𝑤) 1

𝑇 P𝑇 𝑡=1 e𝑔𝑡(𝑤). Let 𝑘= arg min𝑛 [𝑁] 𝑓(𝑤𝑛) . Using 𝑎+ 𝑏 2 2 𝑎 2 + 2 𝑏 2 and the minimality of 𝑤,

𝑓(𝑤) 2 2 𝑔(𝑤) 2 + 2 𝑔(𝑤) 𝑓(𝑤) 2 2 𝑔(𝑤𝑘) 2 + 2 𝑔(𝑤) 𝑓(𝑤) 2

4 𝑓(𝑤𝑘) 2 + 4 𝑔(𝑤𝑘) 𝑓(𝑤𝑘) 2 + 2 𝑔(𝑤) 𝑓(𝑤) 2.

How Free is Parameter-Free Stochastic Optimization?

Algorithm 3: Convex SGD tuning

Input: 𝑤1, 𝑇, 𝛿, 𝜂min, 𝜂max (see Eq. treating 0

0 0 for replacing 𝜂min, 𝜂max with parameter ranges) 𝑆 {} 𝐾 log 𝜂max

𝜂min 𝑁 log 1

𝛿 𝑇 𝑇 2𝐾𝑁 for 𝑘 1, 2, . . . , 𝐾do

𝜂𝑘 2𝑘 1𝜂min

2𝐾𝑁factor fix discrepancy between 𝑇and 𝑇

for 𝑛 1, 2, . . . , 𝑁do

𝑤(𝑛) 𝑘 SGD(𝑤1, 𝜂𝑘,𝑇 ) 𝑤(𝑛) 𝑘 is the average of the iterates of SGD 𝑆 𝑆 {𝑤(𝑛) 𝑘} end end return arg min𝑤 𝑆 P𝑇 𝑡=1 e𝑓𝑡(𝑤) e𝑓𝑡(𝑤) is an independent evaluation of e𝑓(𝑤)

Let 𝑛 [𝑁]. By Lemma 7 (note that the bounded noise assumption satisfy the sub-Gaussian condition), for any 𝜆> 0,

Pr 𝑔(𝑤𝑛) 𝑓(𝑤𝑛) 2 2(1 + 𝜆)2𝜎2

𝑡=1 (e𝑔𝑡(𝑤𝑛) 𝑓(𝑤𝑛))

Thus, under a union bound,

Pr 𝑛 [𝑁], 𝑔(𝑤𝑛) 𝑓(𝑤𝑛) 2 2(1 + 𝜆)2𝜎2

𝑆exp( 𝜆2/3).

𝛿, with probability at least 1 𝛿, for all 𝑛 [𝑁],

𝑔(𝑤𝑛) 𝑓(𝑤𝑛) 2 4(1 + 𝜆2)𝜎2

𝑇 4(1 + 3 log 𝑁

Thus, with probability at least 1 𝛿.

𝑓(𝑤) 2 4 𝑓(𝑤𝑘) 2 + 4 𝑔(𝑤𝑘) 𝑓(𝑤𝑘) 2 + 2 𝑔(𝑤) 𝑓(𝑤) 2

4 𝑓(𝑤𝑘) 2 + 24(1 + 3 log 𝑁

B. Additional Details for Section 4.1

In this section we provide the algorithmic statement, additional details and proofs related to Theorem 2. For completeness we re-state the theorem.

Theorem 2. Let 𝑓: ℝ𝑑 ℝbe a differentiable, 𝐺 -Lipschitz and convex function, admitting a minimizer 𝑤 arg min𝑤 ℝ𝑑𝑓(𝑤). Let e𝑓be a 𝜎0-bounded unbiased zero-oracle of 𝑓and e𝑔be a 𝜎 -bounded unbiased gradient oracle of

𝑓. Let 𝜂min, 𝜂max > 0 such that 𝜂min 𝑤1 𝑤 /

(𝐺2 + 𝜎2 )𝑇 𝜂max for some 𝑤1 ℝ𝑑. Then Algorithm 3 performs 𝑇 queries and produce 𝑤such that with probability at least 1 2𝛿,

𝑓(𝑤) 𝑓(𝑤 ) = b 𝑂

(𝐺2 + 𝜎2 ) log( 𝜂max

𝜂min ) log( 1

How Free is Parameter-Free Stochastic Optimization?

Similarly to Algorithm 2, the inputs 𝜂min, 𝜂max may be replaced by problem parameters ranges by setting

𝜂min = 𝐷min

(𝐺2max + 𝜎2max)𝑇 and 𝜂max = 𝐷max

(𝐺2 min + 𝜎2 min)𝑇 .

Before proving Theorem 2, we need the following standard lemmas , their proofs follow. First is an in-expectation convergence of SGD. Lemma 8. Let 𝑓: ℝ𝑑 ℝbe a differentiable, 𝐺 -Lipschitz and convex function, admitting a minimizer 𝑤 arg min𝑤 ℝ𝑑𝑓(𝑤) and let e𝑔: ℝ𝑑 ℝ𝑑be a 𝜎 -bounded unbiased gradient oracle of 𝑓. Given some 𝑤1 ℝ𝑑, the iterates of 𝑇-steps SGD with stepsize 𝜂> 0 satisfy

2𝜂𝑇 + 𝜂(𝐺2 + 𝜎2 ) 2 .

The second is a candidate selection lemma based on stochastic function value access.

Lemma 9. Let 𝑓: ℝ𝑑 ℝand e𝑓a zero-order oracle of 𝑓such that

𝑤 ℝ𝑑: 𝔼[ e𝑓(𝑤)] = 𝑓(𝑤) and Pr(| e𝑓(𝑤) 𝑓(𝑤)| 𝜎0) = 1

for some 𝜎0 > 0. Given candidates 𝑤1, . . . , 𝑤𝑁, let

𝑤= arg min 𝑛 [𝑁]

𝑡=1 e𝑓𝑡(𝑤𝑛),

where e𝑓𝑡(𝑤𝑛) for 𝑡 [𝑇] are independent function evaluations at 𝑤𝑛. Then for any 𝛿 (0, 1), with probability at least 1 𝛿,

𝑓(𝑤) min 𝑛 [𝑁] 𝑓(𝑤𝑠) +

8𝜎2 0 log 2𝑁

We proceed to prove the convergence result.

Proof of Theorem 2. For each pair (𝑘, 𝑛), 𝑇 gradient queries are made for the SGD and additional 𝑇 function queries for the selection step. Hence, 𝑇= 2𝐾𝑁𝑇 queries are performed. By the condition of 𝜂min and 𝜂max, there is some 𝜏 [𝐾] such that

(𝐺2 + 𝜎2 )𝑇 , 𝑤1 𝑤

(𝐺2 + 𝜎2 )𝑇

Applying Markov s inequality to Lemma 8 with 𝜂𝜏, for all 𝑛 [𝑁], with probability at least 1

𝑤(𝑛) 𝜏 𝑓(𝑤 ) 𝑤1 𝑤 2

𝜂𝜏𝑇 + 𝜂𝜏(𝐺2 + 𝜎2 ) 3 𝑤1 𝑤

Hence, by a union bound, with probability at least 1 𝛿, 𝑆contains some 𝑤(𝑛) 𝜏 with the above guarantee. We conclude by a union bound of the above with Lemma 9 and substituting 𝑇 = 𝑇 2 log(𝐷max/𝐷min) log(1/𝛿) , such that with probability at least 1 2𝛿,

𝑓(𝑤) 𝑓(𝑤 ) 3 𝑤1 𝑤

(𝐺2 + 𝜎2 ) log( 𝜂max

𝜂min ) log( 1

𝛿) log( 𝜂max

(𝐺2 + 𝜎2 ) log( 𝜂max

𝜂min ) log( 1

𝛿) + 𝜎0 log( 1

How Free is Parameter-Free Stochastic Optimization?

B.1. Proof of Lemma 8

By the update step of SGD,

𝑤𝑡+1 𝑤 2 = 𝑤𝑡 𝑤 2 2𝜂𝑔𝑡 (𝑤𝑡 𝑤 ) + 𝜂2 𝑔𝑡 2.

Summing for 𝑡= 1, . . . ,𝑇and rearranging,

𝑡=1 𝑔𝑡 (𝑤𝑡 𝑤 ) = 𝑤1 𝑤 2 𝑤𝑇+1 𝑤 2

Note that by the Lipschitz and noise assumptions, 𝔼[ 𝑔𝑡 2] = 𝔼[ 𝑔𝑡 𝑓(𝑤𝑡) 2 2 𝑓(𝑤𝑡) ( 𝑓(𝑤𝑡) 𝑔𝑡)+ 𝑓(𝑤𝑡) 2] 𝐺2 + 𝜎2 . Taking expectation,

𝑡=1 𝑓(𝑤𝑡) (𝑤𝑡 𝑤 )

2𝜂 + 𝜂(𝐺2 + 𝜎2)𝑇

We conclude by an application of convexity and Jensen s inequality.

B.2. Proof of Lemma 9

Let 𝑘= arg min𝑛 [𝑁] 𝑓(𝑤𝑛). By the minimality of 𝑤,

𝑡=1 e𝑓𝑡(𝑤) +

𝑡=1 e𝑓𝑡(𝑤𝑘) +

𝑡=1 e𝑓𝑡(𝑤𝑘) 𝑓(𝑤𝑘)

Let 𝑛 [𝑁]. By Hoeffding s inequality, for any 𝜖> 0,

𝑡=1 e𝑓𝑡(𝑤𝑛)

Thus, under a union bound,

𝑡=1 e𝑓𝑡(𝑤𝑛)

2𝜎2 0 log 2𝑁

𝛿/𝑇, with probability at least 1 𝛿, for all 𝑛 [𝑁],

𝑡=1 e𝑓𝑡(𝑤𝑛)

2𝜎2 0 log 2𝑁

Thus, with probability at least 1 𝛿,

𝑓(𝑤) 𝑓(𝑤𝑘) +

8𝜎2 0 log 2𝑁

C. Proofs of Section 4

C.1. Proof of Theorem 3

Using the identity 𝑢+ 𝑣 2 2 𝑢 2 + 2 𝑣 2 and the global bounds, for all 𝑤 ℝ𝑑,

e𝑔(𝑤) 2 𝑓(𝑤) 2 + 2 e𝑔(𝑤) 𝑓(𝑤) 2 2𝐺2 + 2𝜎2 .

How Free is Parameter-Free Stochastic Optimization?

By the minimality of 𝐺8 and 𝜎8 , 𝐺max 𝐺 𝐺8 and 𝜎max 𝜎 𝜎8 , which means that Theorem 4 is applicable. Combining Theorem 4 with the above inequality, Algorithm 1 produce 𝑤𝑘such that with probability at least 1 𝛿log 𝐷max

𝑓(𝑤𝑘) 𝑓 = 𝑂

(𝐺2 + 𝜎2 )𝑇+ 𝜎max

We conclude by moving from big-𝑂notation to poly-log notation, replacing the term e 𝐶

𝛿with poly log( 𝐺max

C.2. Proof of Lemma 1

In order to prove the lemma we use the following standard inequality.

Lemma 10. Let 𝑎1, . . . , 𝑎𝑛 0. Then

𝑎𝑖 P𝑖 𝑗=1 𝑎𝑗 2

𝑖=1 𝑎𝑖. (treating 0

Proof. Using the identity 𝑎2 𝑏2 = (𝑎 𝑏)(𝑎+ 𝑏),

𝑎𝑖 P𝑖 𝑗=1 𝑎𝑗 = 𝑛

P𝑖 𝑗=1 𝑎𝑗 P𝑖 1 𝑗=1 𝑎𝑗 P𝑖 𝑗=1 𝑎𝑗+ P𝑖 1 𝑗=1 𝑎𝑗

𝑗=1 𝑎𝑗ª = 2

We proceed to prove the lemma.

Proof of Lemma 1. As 𝐸1 holds, max𝑡 𝑇 𝑤𝑡+1 𝑤1 < 𝐷and 𝑤𝑡+1 = 𝑤𝑡 𝜂𝑡𝑔𝑡for all 𝑡 [𝑇] (if a projection is performed the norm will equal 𝐷). Thus,

𝑤𝑡+1 𝑤 2 = 𝑤𝑡 𝑤 2 2𝜂𝑡𝑔𝑡 (𝑤𝑡 𝑤 ) + 𝜂2 𝑡 𝑔𝑡 2.

Rearranging and summing for 𝑡= 1, . . . ,𝑇,

𝑡=1 𝑔𝑡 (𝑤𝑡 𝑤 ) = 𝑤1 𝑤 2

2𝜂1 𝑤𝑇+1 𝑤 2

𝑡=1 𝜂𝑡 𝑔𝑡 2.

Let 𝑡 arg max𝑡 𝑇+1 𝑤𝑡 𝑤 . Hence,

𝑡=1 𝑔𝑡 (𝑤𝑡 𝑤 ) 𝑤1 𝑤 2

2𝜂1 𝑤𝑇+1 𝑤 2

2𝜂𝑇 + 𝑤𝑡 𝑤 2

𝑡=1 𝜂𝑡 𝑔𝑡 2

2𝜂1 𝑤𝑇+1 𝑤 2

2𝜂𝑇 + 𝑤𝑡 𝑤 2

𝑡=1 𝜂𝑡 𝑔𝑡 2

𝑤𝑡 𝑤 2 𝑤𝑇+1 𝑤 2

𝑡=1 𝜂𝑡 𝑔𝑡 2. ( 𝑤1 𝑤 𝑤𝑡 𝑤 )

Note that by the triangle inequality,

𝑤𝑡 𝑤 2 𝑤𝑇+1 𝑤 2 = ( 𝑤𝑡 𝑤 𝑤𝑇+1 𝑤 )( 𝑤𝑡 𝑤 + 𝑤𝑇+1 𝑤 )

2 𝑤𝑡 𝑤𝑇+1 𝑤𝑡 𝑤 (triangle inequality and def of 𝑡 )

2 𝑤𝑡 𝑤𝑇+1 (𝐷+ 𝑤1 𝑤 ) ( 𝑤𝑡 𝑤 𝑤𝑡 𝑤1 + 𝑤1 𝑤 )

4𝐷(𝐷+ 𝑤1 𝑤 ). ( 𝑤𝑡 𝑤𝑇+1 𝑤𝑡 𝑤1 + 𝑤1 𝑤𝑇+1 2𝐷)

How Free is Parameter-Free Stochastic Optimization?

Plugging the above inequality,

𝑡=1 𝑔𝑡 (𝑤𝑡 𝑤 ) 2𝐷(𝐷+ 𝑤1 𝑤 )

𝑡=1 𝜂𝑡 𝑔𝑡 2.

By Lemma 10 with 𝑎𝑖= 𝑔𝑖 2,

𝑡=1 𝜂𝑡 𝑔𝑡 2 2𝛼𝐷

Combining the two inequalities with the inequality of 𝐸2,

𝑡=1 𝑓(𝑤𝑡) (𝑤𝑡 𝑤 ) 2𝐷(𝐷+ 𝑤1 𝑤 )

+ 4(𝐷+ 𝑤1 𝑤 )

𝑡=1 𝑔𝑡 2 + 4𝜎2 8 𝜃2 𝑡, 𝛿

2(𝐷+ 𝑤1 𝑤 )

𝛼 + 𝛼𝐷+ 4(𝐷+ 𝑤1 𝑤 )

+ 2(𝐷+ 𝑤1 𝑤 )𝛾

𝛼 + 8𝜃𝑇, 𝛿(𝐷+ 𝑤1 𝑤 )𝜎8 (

= ( 𝑤1 𝑤 + 𝐷)( 2

𝜃𝑇, 𝛿) + 𝛼𝐷

𝑡=1 𝑔𝑡 2 + 2𝛾

𝛼+ 8𝜃𝑇, 𝛿𝜎8 ( 𝑤1 𝑤 + 𝐷).

Dividing by 𝑇, we conclude by applying 𝑓(𝑤𝑡) (𝑤𝑡 𝑤 ) 𝑓(𝑤𝑡) 𝑓(𝑤 ) due to convexity.

C.3. Proof of Lemma 2

In order to prove lemma Lemma 2 we use the following lemmas. Their proofs will be presented subsequently. The first lemma bounds the errors that arise from potential correlation between 𝜂𝑡and 𝑔𝑡. To that end we use the following decorrelated stepsize (Attia & Koren, 2023),

𝛾2 + 𝑓(𝑤𝑡) 2 + P𝑡 1 𝑠=1 𝑔𝑠 2 .

In the noiseless case we could simply replace the lemma with 𝑓(𝑤𝑠) (𝑤𝑠 𝑤 𝐷) 0 due to convexity.

Lemma 11. With probability at least 1 𝛿, for any 𝑡 [𝑇],

𝑠=1 𝜂𝑠𝑔𝑠 (𝑤𝑠 𝑤 𝐷) 1

𝑠=1 𝜂2 𝑠 𝑔𝑠 2 + 1

𝑠=1 𝜂2 𝑠 𝑔𝑠 𝑓(𝑤𝑠) 2

𝑠=1 𝜂2𝑠 𝑓(𝑤𝑠) 𝑔𝑠 2 + 𝛼2𝐷2𝜎2 8 𝜃2 𝑡, 𝛿 𝛾2 ,

where 𝜃𝑡,𝛿= log 60 log(6𝑡)

Following is a standard summation lemma of methods with Ada Grad-like stepsizes.

Lemma 12. For any 𝑡 [𝑇],

𝑠=1 𝜂2 𝑠 𝑔𝑠 2 𝛼2𝐷2 log

1 + 2𝐺2 8 𝑇+ 2𝜎2 8 𝑇

How Free is Parameter-Free Stochastic Optimization?

The next lemma is similar to Lemma 12, while using decorrelated stepsizes. Here, knowledge of the noise bound is required to achieve a logarithmic summation. Lemma 13. Assuming 𝛾2 5𝜎2 8 log 𝑇

𝛿, with probability at least 1 𝛿, for any 𝑡 [𝑇],

𝑠=1 𝜂2 𝑠 𝑔𝑠 𝑓(𝑤𝑠) 2 𝛼2𝐷2 log(1 + 𝑇).

Proof of Lemma 2. As 𝑤𝑠+1 is the projection of 𝑤𝑠 𝜂𝑠𝑔𝑠onto the convex set containing 𝑤 𝐷,

𝑤𝑠+1 𝑤 𝐷 2 𝑤𝑠 𝑤 𝐷 2 2𝜂𝑠𝑔𝑠 (𝑤𝑠 𝑤 𝐷) + 𝜂2 𝑠 𝑔𝑠 2.

Summing for 𝑠= 1, . . . , 𝑡,

𝑤𝑡+1 𝑤 𝐷 2 𝑤1 𝑤 𝐷 2 2 𝑡

𝑠=1 𝜂𝑠𝑔𝑠 (𝑤𝑠 𝑤 𝐷) + 𝑡

𝑠=1 𝜂2 𝑠 𝑔𝑠 2.

Using lemma Lemma 11, with probability at least 1 𝛿, for any 𝑡 [𝑇],

𝑤𝑡+1 𝑤 𝐷 2 𝑤1 𝑤 𝐷 2 + 16𝐷

𝑠=1 𝜂2𝑠 𝑓(𝑤𝑠) 𝑔𝑠 2 + 𝛼2𝐷2𝜎2 8 𝜃2 𝑡, 𝛿 𝛾2

𝑠=1 𝜂2 𝑠 𝑔𝑠 𝑓(𝑤𝑠) 2 + 1 + 2

𝑠=1 𝜂2 𝑠 𝑔𝑠 2.

Applying Lemmas 12 and 13 under a union bound, with probability at least 1 2𝛿, for any 𝑡 [𝑇],

𝑤𝑡+1 𝑤 𝐷 2 𝑤1 𝑤 𝐷 2 + 16𝛼𝐷2

𝜃𝑡, 𝛿log(1 + 𝑇) + 𝜎2 8 𝜃2 𝑡, 𝛿 𝛾2

+ 2𝛼𝐷2 log(1 + 𝑇) + 𝛼(𝛼+ 2)𝐷2 log

1 + 2𝐺2 8 𝑇+ 2𝜎2 8 𝑇

𝑤1 𝑤 𝐷 2 + 𝐷2

16 (Defs of 𝛼, 𝛾at Algorithm 1 and 𝛼 1)

= 𝑤1 𝑤 𝐷 2 + 1

Thus, using the triangle inequality and the fact that 𝑤1 𝑤 𝑤1 𝑤 𝐷 ,

𝑤𝑡+1 𝑤1 𝑤𝑡+1 𝑤 𝐷 + 𝑤 𝐷 𝑤1 2 𝑤1 𝑤 𝐷 + 1

2 𝐷 2 𝑤1 𝑤 + 1

In addition,

𝑤𝑡+1 𝑤 𝑤𝑡+1 𝑤1 + 𝑤1 𝑤 3 𝑤1 𝑤 + 1

C.4. Proof of Lemma 11

Using 𝜂𝑡we can decompose the term 𝜂𝑠𝑔𝑡 (𝑤𝑠 𝑤 𝐷),

𝜂𝑠𝑔𝑠 (𝑤𝑠 𝑤 𝐷) = 𝜂𝑠𝑔𝑠 (𝑤𝑠 𝑤 𝐷) + ( 𝜂𝑠 𝜂𝑠)𝑔𝑠 (𝑤𝑠 𝑤 𝐷)

= 𝜂𝑠 𝑓(𝑤𝑠) (𝑤𝑠 𝑤 𝐷) + 𝜂𝑠( 𝑓(𝑤𝑠) 𝑔𝑠) (𝑤𝑠 𝑤 𝐷) + ( 𝜂𝑠 𝜂𝑠)𝑔𝑠 (𝑤𝑠 𝑤 𝐷)

𝜂𝑠( 𝑓(𝑤𝑠) 𝑔𝑠) (𝑤𝑠 𝑤 𝐷) + ( 𝜂𝑠 𝜂𝑠)𝑔𝑠 (𝑤𝑠 𝑤 𝐷),

where the last inequality follows from convexity. Thus,

𝑠=1 𝜂𝑠𝑔𝑠 (𝑤𝑠 𝑤 𝐷) 𝑡

𝑠=1 𝜂𝑠( 𝑓(𝑤𝑠) 𝑔𝑠) (𝑤𝑠 𝑤 𝐷) + 𝑡

𝑠=1 ( 𝜂𝑠 𝜂𝑠)𝑔𝑠 (𝑤𝑠 𝑤 𝐷). (3)

The next two lemmas bounds the two sums and concludes the proof of Lemma 11. Their proofs follows.

How Free is Parameter-Free Stochastic Optimization?

Lemma 14. With probability at least 1 𝛿, for any 𝑡 [𝑇],

𝑠=1 𝜂𝑠( 𝑓(𝑤𝑠) 𝑔𝑠) (𝑤𝑠 𝑤 𝐷) 8𝐷

𝑠=1 𝜂2𝑠 𝑓(𝑤𝑠) 𝑔𝑠 2 + 𝛼2𝐷2𝜎2 8 𝜃2 𝑡, 𝛿 𝛾2 .

Lemma 15. For any 𝑡 [𝑇],

𝑠=1 ( 𝜂𝑠 𝜂𝑠)𝑔𝑠 (𝑤𝑠 𝑤 𝐷) 1

𝑠=1 𝜂2 𝑠 𝑔𝑠 2 + 1

𝑠=1 𝜂2 𝑠 𝑔𝑠 𝑓(𝑤𝑠) 2.

Combining both into Equation (3) yields the desired inequality.

C.5. Proof of Lemma 14

While 𝑓(𝑤𝑠) 𝑔𝑠 𝜎8 , this bound is too crude for the required martingale analysis. To this end we use the following empirical Bernstein inequality due to Howard et al. (2021) in order to exploit the relationship between 𝜂𝑡and 𝑓(𝑤𝑠) 𝑔𝑠. Lemma 16 (Corollary 2 of Ivgi et al. (2023)). Let 𝑐> 0 and 𝑋𝑡be a martingale difference sequence adapted to F𝑡such that |𝑋𝑡| 𝑐with probability 1 for all 𝑡. Then, for all 𝛿 (0, 1), and ˆ𝑋𝑡 F𝑡 1 such that | ˆ𝑋𝑡| 𝑐with probability 1,

𝑋𝑠 ˆ𝑋𝑠 2 + 𝑐2𝜃2 𝑡, 𝛿

where 𝜃𝑡,𝛿= log 60 log(6𝑡)

Invoking Lemma 16 with

𝑋𝑠= 𝜂𝑠( 𝑓(𝑤𝑠) 𝑔𝑠) (𝑤𝑠 𝑤 𝐷), ˆ𝑋𝑠= 0 and 𝑐= 2𝛼𝐷2𝜎8

(note the use of 𝐷 8 𝑤1 𝑤 to bound 𝑓(𝑤𝑠) 𝑔𝑠 𝜎8 ) with probability at least 1 𝛿, for any 𝑡 [𝑇],

𝑠=1 𝜂𝑠( 𝑓(𝑤𝑠) 𝑔𝑠) (𝑤𝑠 𝑤 𝐷)

𝜂𝑠( 𝑓(𝑤𝑠) 𝑔𝑠) (𝑤𝑠 𝑤 𝐷) 2 + 4𝛼2𝐷4𝜎2 8 𝜃2 𝑡, 𝛿 𝛾2 .

By Cauchy-Schwarz inequality and the triangle inequality ( 𝑤𝑠 𝑤 𝐷 𝑤𝑠 𝑤1 + 𝑤1 𝑤 𝐷 2𝐷), 𝜂𝑠( 𝑓(𝑤𝑠) 𝑔𝑠) (𝑤𝑠 𝑤 𝐷) 2 𝜂2 𝑠 𝑓(𝑤𝑠) 𝑔𝑠 2 𝑤𝑠 𝑤 𝐷 2 4 𝜂2 𝑠 𝑓(𝑤𝑠) 𝑔𝑠 2𝐷2.

Hence, with probability at least 1 𝛿, for any 𝑡 [𝑇],

𝑠=1 𝜂𝑠( 𝑓(𝑤𝑠) 𝑔𝑠) (𝑤𝑠 𝑤 𝐷)

𝑠=1 𝜂2𝑠 𝑓(𝑤𝑠) 𝑔𝑠 2 + 𝛼2𝐷2𝜎2 8 𝜃2 𝑡, 𝛿 𝛾2 .

C.6. Proof of Lemma 15

Let 𝐺𝑡= P𝑡 𝑠=1 𝑔𝑠 2. Thus,

| 𝜂𝑠 𝜂𝑠| = 𝛼𝐷| 𝑔𝑠 2 𝑓(𝑤𝑠) 2|

𝜎2 8 + 𝑓(𝑤𝑠) 2 + 𝐺2 𝑠 1(

𝜎2 8 + 𝐺2𝑠+

𝜎2 8 + 𝑓(𝑤𝑠) 2 + 𝐺2 𝑠 1) .

By the triangle inequality,

| 𝑔𝑠 2 𝑓(𝑤𝑠) 2|

𝜎2 8 + 𝐺2𝑠+

𝜎2 8 + 𝑓(𝑤𝑠) 2 + 𝐺2 𝑠 1

𝑔𝑠 𝑓(𝑤𝑠) ( 𝑔𝑠 + 𝑓(𝑤𝑠) )

𝜎2 8 + 𝐺2𝑠+

𝜎2 8 + 𝑓(𝑤𝑠) 2 + 𝐺2 𝑠 1 𝑔𝑠 𝑓(𝑤𝑠) .

How Free is Parameter-Free Stochastic Optimization?

Hence, by Cauchy-Schwarz inequality,

𝑠=1 ( 𝜂𝑠 𝜂𝑠)𝑔𝑠 (𝑤𝑠 𝑤 𝐷) 𝛼𝐷 𝑡

𝑔𝑠 𝑓(𝑤𝑠) 𝑔𝑠 𝑤𝑠 𝑤 𝐷

𝜎2 8 + 𝑓(𝑤𝑠) 2 + 𝐺2 𝑠 1

𝑔𝑠 𝑓(𝑤𝑠) 𝑔𝑠

𝜎2 8 + 𝑓(𝑤𝑠) 2 + 𝐺2 𝑠 1

𝑠=1 𝜂𝑠 𝜂𝑠 𝑔𝑠 𝑓(𝑤𝑠) 𝑔𝑠 ,

where the second inequality follows by 𝑤𝑠 𝑤 𝐷 2𝐷. Using the identity 2𝑎𝑏 𝑎2 + 𝑏2,

𝑠=1 ( 𝜂𝑠 𝜂𝑠)𝑔𝑠 (𝑤𝑠 𝑤 𝐷) 1

𝑠=1 𝜂2 𝑠 𝑔𝑠 2 + 1

𝑠=1 𝜂2 𝑠 𝑔𝑠 𝑓(𝑤𝑠) 2.

C.7. Proof of Lemmas 12 and 13

In order to prove the lemmas we need the following standard lemma used in the analysis of Ada Grad-like methods.

Lemma 17. Let 𝑎0 > 0 and 𝑎1, 𝑎2, . . . , 𝑎𝑛 0. Then

𝑎𝑖 P𝑖 𝑗=0 𝑎𝑗 ln P𝑛 𝑖=0 𝑎𝑖

𝑎0 log P𝑛 𝑖=0 𝑎𝑖

Proof. Using the inequality 1 𝑥 ln(1/𝑥) for 𝑥> 0,

𝑎𝑖 P𝑖 𝑗=0 𝑎𝑗 = 𝑛

P𝑖 1 𝑗=0 𝑎𝑗 P𝑖 𝑗=0 𝑎𝑗

P𝑖 𝑗=0 𝑎𝑗 P𝑖 1 𝑗=0 𝑎𝑗

where the last inequality holds since ln(P𝑛 𝑗=0 𝑎𝑗/𝑎0) 0.

Proof of Lemma 12. Using Lemma 17 with 𝑎0 = 𝛾2 and 𝑎𝑖= 𝑔𝑖 2,

𝑠=1 𝜂2 𝑠 𝑔𝑠 2 = 𝛼2𝐷2 𝑡

𝛾2 + P𝑠 𝑘=1 𝑔𝑘 2 𝛼2𝐷2 log

1 + P𝑡 𝑠=1 𝑔𝑠 2

By the inequality 𝑢+ 𝑣 2 2 𝑢 2 + 𝑣 2, the Lipschitz assumption and the noise assumption, 𝑔𝑠 2 2𝐺2 8 + 2𝜎2 8 . Thus,

𝑠=1 𝜂2 𝑠 𝑔𝑠 2 𝛼2𝐷2 log

1 + 2𝐺2 8 𝑇+ 2𝜎2 8 𝑇

Proof of Lemma 13. Using lemma Lemma 6, with probability at least 1 𝛿

𝑇, for 𝜆= 2 3𝜎2 8 ,

𝑠=1 𝑓(𝑤𝑠) ( 𝑓(𝑤𝑠) 𝑔𝑠) 1

𝑠=1 𝑓(𝑤𝑠) 2 + 3

2𝜎2 8 log 𝑇

Thus, with probability at least 1 𝛿

𝑠=1 𝑔𝑠 2 = 𝑡 1

𝑠=1 ( 𝑓(𝑤𝑠) 2 + 𝑔𝑠 𝑓(𝑤𝑠) 2 2 𝑓(𝑤𝑠) ( 𝑓(𝑤𝑠) 𝑔𝑠))

𝑠=1 𝑔𝑠 𝑓(𝑤𝑠) 2 3𝜎2 8 log 𝑇

How Free is Parameter-Free Stochastic Optimization?

and since 𝛾2 5𝜎2 8 log 𝑇

𝛿and 𝑔𝑡 𝑓(𝑤𝑡) 𝜎8 ,

5𝜎2 8 log 𝑇

𝛿+ P𝑡 1 𝑠=1 𝑔𝑠 2 𝛼𝐷

2𝜎2 8 log 𝑇

𝛿+ P𝑡 1 𝑠=1 𝑔𝑠 𝑓(𝑤𝑠) 2

𝜎2 8 + P𝑡 𝑠=1 𝑔𝑠 𝑓(𝑤𝑠) 2 .

Hence, under a union bound, with probability at least 1 𝛿, for any 𝑡 [𝑇],

𝜎2 8 + P𝑡 𝑠=1 𝑔𝑠 𝑓(𝑤𝑠) 2

and using Lemma 17 with 𝑎0 = 𝜎2 8 and 𝑎𝑖= 𝑔𝑖 𝑓(𝑤𝑖) 2,

𝑠=1 𝜂2 𝑠 𝑔𝑠 𝑓(𝑤𝑠) 2 𝛼2𝐷2 𝑡

𝜎2 8 + P𝑡 𝑠=1 𝑔𝑠 𝑓(𝑤𝑠) 2

𝜎2 8 + P𝑡 𝑠=1 𝑔𝑠 𝑓(𝑤𝑠) 2

𝛼2𝐷2 log(1 + 𝑇).

C.8. Proof of Lemma 3

𝑋𝑠= ( 𝑓(𝑤𝑠) 𝑔𝑠) (𝑤𝑠 𝑤 ), ˆ𝑋𝑠= 0, and 𝑐= 𝜎8 (𝐷+ 𝑤1 𝑤 ).

Note that by Cauchy-Schwarz inequality and the triangle inequality,

( 𝑓(𝑤𝑠) 𝑔𝑠) (𝑤𝑠 𝑤 ) 𝑓(𝑤𝑠) 𝑔𝑠 𝑤𝑠 𝑤 𝑓(𝑤𝑠) 𝑔𝑠 ( 𝑤𝑠 𝑤1 + 𝑤1 𝑤 )

𝜎8 (𝐷+ 𝑤1 𝑤 ) = 𝑐,

where we used the assumption that 𝐷 8 𝑤1 𝑤 to bound 𝑓(𝑤𝑠) 𝑔𝑠 . Applying Lemma 16, with probability at least 1 𝛿,

𝑡=1 ( 𝑓(𝑤𝑡) 𝑔𝑡) (𝑤𝑠 𝑤 )

4(𝐷+ 𝑤1 𝑤 )

𝑡=1 𝑓(𝑤𝑡) 𝑔𝑡 2 + 𝜎2 8 𝜃2 𝑇, 𝛿.

Applying Lemma 6 with 𝑍𝑡= 𝑓(𝑤𝑡) ( 𝑓(𝑤𝑡) 𝑔𝑡) and 𝜆= 2 3𝜎2 8 , with probability at least 1 𝛿

𝑡=1 𝑓(𝑤𝑡) ( 𝑓(𝑤𝑡) 𝑔𝑡) 3𝜎2 8 4 𝜆 𝑇

𝑡=1 𝑓(𝑤𝑡) + log 1

𝑡=1 𝑓(𝑤𝑡) + 3𝜎2 8 log 1

𝑡=1 𝑓(𝑤𝑡) 𝑔𝑡 2 = 𝑇

𝑔𝑡 2 + 2 𝑓(𝑤𝑡) ( 𝑓(𝑤𝑡) 𝑔𝑡) 𝑓(𝑤𝑡) 2

3𝜎2 8 log 1

Returning to the previous inequality, with probability at least 1 2𝛿(union bound),

𝑡=1 ( 𝑓(𝑤𝑡) 𝑔𝑡) (𝑤𝑠 𝑤 )

4(𝐷+ 𝑤1 𝑤 )

𝑡=1 𝑔𝑡 2 + 𝜃𝑇, 𝛿 3 log 1

𝛿+ 𝜃𝑇, 𝛿 𝜎2 8

4(𝐷+ 𝑤1 𝑤 )

𝑡=1 𝑔𝑡 2 + 4𝜃2 𝑇, 𝛿𝜎2 8

where the last inequality follows by 𝜃𝑇, 𝛿 log 1

𝛿. Thus, Pr(𝐸2) 1 2𝛿.

How Free is Parameter-Free Stochastic Optimization?

C.9. Proof of Theorem 5

In order to show that Theorem 4 is applicable we need to show that 𝐺max 𝐺8 . Let 𝑤 ℝ𝑑such that 𝑤 𝑤1 8 𝑤1 𝑤 . Thus, by the smoothness of 𝑓and the triangle inequality,

𝑓(𝑤) 𝛽 𝑤 𝑤 𝛽 ( 𝑤 𝑤1 + 𝑤1 𝑤 ) 9𝛽 𝑤1 𝑤 𝐺max.

Hence, by Equation (1), 𝐺max 𝐺8 . Using Theorem 4, with probability at least 1 𝛿log 𝐷max

𝐷min , we have a sequence 𝑤1, . . . , 𝑤𝑇with

𝑡=1 𝑓(𝑤𝑡) 𝑓 𝐶

𝜎max ) 𝑤1 𝑤

𝑡=1 𝑔𝑡 2 + 𝜎max

for some 𝐶> 0, and for all 𝑡, 𝑤𝑡 𝑤1 8 𝑤1 𝑤 . By the identity 𝑎+ 𝑏 2 2 𝑎 2 + 2 𝑏 2 and Equation (1),

𝑔𝑡 2 2 𝑓(𝑤𝑡) 2 + 2 𝑔𝑡 𝑓(𝑤𝑡) 2 2 𝑓(𝑤𝑡) 2 + 2𝜎2 8 4𝛽 ( 𝑓(𝑤𝑡) 𝑓 ) + 2𝜎2 8 .

where the last inequality is due to the following standard smoothness argument. For any 𝑤 ℝ𝑑and 𝑤+ = 𝑤 1 𝛽 𝑓(𝑤), by the smoothness of 𝑓,

2𝛽 ( 𝑓(𝑤) 𝑓 ) 2𝛽 ( 𝑓(𝑤) 𝑓(𝑤+)) 2𝛽 ( 𝑓(𝑤) (𝑤+ 𝑤) 𝛽

2 𝑤+ 𝑤 2) = 𝑓(𝑤) 2.

Hence, by the identities

𝑏and 𝑎𝑏 1 2𝜆𝑎2 + 𝜆

2 𝑏2 for 𝜆> 0,

𝑡=1 ( 𝑓(𝑤𝑡) 𝑓 ) + 𝜎8

𝑡=1 ( 𝑓(𝑤𝑡) 𝑓 ) + 2𝜆𝛽 + 𝜎8

Setting 𝜆= 𝐶 𝑤1 𝑤 log( 𝑇

𝜎max ), returning to Equation (4) and rearranging,

𝑡=1 𝑓(𝑤𝑡) 𝑓 4𝐶2𝛽 𝑤1 𝑤 2 log2( 𝑇

8𝐶 𝑤1 𝑤 𝜎8 log( 𝑇

+ 𝜎max 2𝐶 𝑤1 𝑤 log( 𝑇

We conclude by a standard application of Jensen s inequality.

C.10. Proof of Theorem 6

We will define two problem instances ( 𝑓1, e𝑔1, 𝐷1, 𝐺1), ( 𝑓2, e𝑔2, 𝐷2, 𝐺2), which will be indistinguishable with constant probability using only 𝑇gradient queries, where no solution 𝑤 ℝcan satisfy the convergence bound for both. Let 𝜎1 = 𝜎max

𝑇, 𝐷1 = 𝐷max, 𝜎2 = 𝜎max, 𝐷2 = 𝐷max

𝑇. We define the two functions,

𝑓1(𝑤) = 𝐺 |𝑤 𝐷1| and 𝑓2(𝑤) = 𝐺 |𝑤+ 𝐷2|.

Note that both functions are 𝐺 -Lipschitz and their minimizers satisfy arg min𝑤 ℝ𝑓𝑖(𝑤) 𝐷𝑖for 𝑖 {1, 2}. In addition, we define the following two first-order oracles,

𝐺 if 𝑤 𝐷2; 𝐺 if 𝑤 𝐷1; 𝜎1 + 𝐺 if 𝑤 ( 𝐷2, 𝐷1), w.p. 1

2 + 𝐺 2(𝜎1 𝐺 ) ; 𝜎1 𝐺 if 𝑤 ( 𝐷2, 𝐷1), w.p. 1

2 𝐺 2(𝜎1 𝐺 )

How Free is Parameter-Free Stochastic Optimization?

𝐺 if 𝑤 𝐷2; 𝐺 if 𝑤 𝐷1; 𝜎2 + 𝐺 if 𝑤 ( 𝐷2, 𝐷1), w.p. 1

𝑇; 𝜎1 + 𝐺 if 𝑤 ( 𝐷2, 𝐷1), w.p. 1

2 + 𝐺 2(𝜎1 𝐺 ) 1 2𝑇; 𝜎1 𝐺 if 𝑤 ( 𝐷2, 𝐷1), w.p. 1

2 𝐺 2(𝜎1 𝐺 ) 1 2𝑇.

Note that the probabilities are within [0, 1] (and sum to 1) since 1 2𝑇 1

8 as 𝑇 4, and for 𝛼 [1, 3

𝐺 2(𝜎1 𝐺 ) = 𝜎max/(2𝑇 1)

𝑇 𝜎max/(2𝑇 1)) = 𝛼

𝑇) 3𝑇 10𝑇 8 3

Additionally, for 𝑤 ( 𝐷2, 𝐷1),

𝔼[e𝑔1(𝑤)] = (𝜎1 𝐺 ) 1

2 𝐺 2(𝜎1 𝐺 ) 1

2 𝐺 2(𝜎1 𝐺 )

= 𝐺 (𝜎1 𝐺 )

𝜎1 𝐺 = 𝐺 = 𝑓1(𝑤)

and similarly

𝔼[e𝑔2(𝑤)] = 𝜎2 + 𝐺

𝑇 𝐺 = (2𝑇 1)𝐺 + 𝐺

𝑇 𝐺 = 𝐺 = 𝑓2(𝑤).

Hence, for all 𝑤 ℝ, 𝔼[e𝑔1(𝑤)] = 𝑓1(𝑤), 𝔼[e𝑔2(𝑤)] = 𝑓2(𝑤), and with probability 1, as 𝐺 < 𝜎1 < 𝜎2,

e𝑔1(𝑤) 𝑓1(𝑤) 𝜎1 and e𝑔2(𝑤) 𝑓2(𝑤) 𝜎2.

We concluded the properties of 𝑓and e𝑔and we move to the lower bound. For 𝑛 4, 1 1

4 1 1 + 1 𝑛 1 𝑛 1 1

where the last inequality follows by 1 + 1 𝑛 1

𝑛 1 = 2 + 𝑛 1

(𝑛 1) 𝑘 2 + 𝑛 1

1 𝑘(𝑘 1) = 2 + 𝑛 1

Hence, with probability at least 1

4, the 1/𝑇event of e𝑔2 do not occur in any of the 𝑇queries and the two gradient oracles will be indistinguishable from each other. Let 𝑤be the output of A(𝑇) given that the two oracles return the exact same gradients. We will assume by contradiction that 𝑤satisfy a convergence rate better than 𝐷𝑖(𝐺 + 𝜎𝑖)𝛼/6

𝑇for each 𝑓𝑖(𝑖 {1, 2} respectively). Thus, as 𝜎1 = 𝐺 (2𝑇 1)

𝑓1(𝑤) 𝑓1(𝑤 ) = 𝐺 |𝑤 𝐷1| = 𝐷1(𝐺 + 𝜎1)𝛼

𝑇|𝑤 𝐷1| 𝐷1(𝐺 + 𝜎1)𝛼

= 𝐷1(𝐺 + 𝜎1)𝛼

𝑇+ 2𝑇 1) 𝐷1(𝐺 + 𝜎1)𝛼

which implies that 𝑤> 𝐷1/2 by the convergence assumption. Similarly, as 𝜎2 = 𝐺 (2𝑇 1),

𝑓2(𝑤) 𝑓2(𝑤 ) = 𝐺 |𝑤+ 𝐷2| = 𝐷2(𝐺 + 𝜎2)𝛼

𝑇|𝑤+ 𝐷2| 𝐷2(𝐺 + 𝜎2)𝛼= 𝐷2(𝐺 + 𝜎2)𝛼

which implies that 𝑤< (𝛼

𝑇/3 1)𝐷2 by the convergence assumption. But

𝑇/3 1)𝐷2 > 𝑤

and we obtain a contradiction. The second inequality of the convergence lower bound follows by (𝐺 + 𝜎)𝛼 𝜎max( 𝛼 2𝑇 1 + 1

𝑇for 𝜎 {𝜎1, 𝜎2}.

How Free is Parameter-Free Stochastic Optimization?

D. Assumption for Noise Upper Bound

In the convex non-smooth setting, Theorem 6 implies that a stochastic convex optimization method cannot be parameter-free with respect to all three parameters, diameter, Lipschitz bound and noise bound, while being competitive with tuned SGD. Next, we observe that while such a term is unavoidable without further assumptions, if a success probability of 1 𝑂(𝛿) is desired, the following reasonable assumption is sufficient.

Assumption: for some 𝛿 (0, 1) and 𝑐= 4

1 + 3 log 1

𝑓(𝑤1) 𝑓 > 𝑐𝜎 𝑤1 𝑤

In other words, 𝑤1 is a bad enough approximate minimizer.

Let us first explain why this is a reasonable assumption. Consider the convergence rate of tuned SGD, which ensures with probability at least 1 𝛿that

𝑓(𝑤) 𝑓 𝐶 𝑤1 𝑤 (𝐺 + 𝜎

for some constant 𝐶> 0. In that case, the assumption is weaker then assuming that the above convergence bound is sufficient to ensure that with probability at least 1 𝛿, SGD produce a solution better than 𝑤1 by some constant factor. If we aim to compete with tuned SGD, it is reasonable to focus on the regime where SGD produce better solutions than the initialization.

We move to analyze what the assumption above yields. Using the assumption and the convexity of the objective,

𝑇 < 𝑓(𝑤1) 𝑓 𝑓(𝑤1) (𝑤1 𝑤 ) 𝑓(𝑤1) 𝑤1 𝑤 .

So upper bounding 𝑓(𝑤1) produce an upper bound of 𝜎 . The simplest approach to estimate the gradient at 𝑤1 ℝ𝑑

is to average 𝑛stochastic samples. Let 𝑍= 1

𝑛 P𝑛 𝑖=1 e𝑔𝑖(𝑤1) where e𝑔𝑖(𝑤1) for 𝑖 [𝑛] are independent stochastic gradients. Thus, 𝔼[𝑍] = 𝑓(𝑤1). By Lemma 7,

2(1 + 𝜆)𝜎/ 𝑛 exp( 𝜆2/3).

𝛿. Hence, with probability at least 1 𝛿,

Setting 𝑛= 8(1 + 3 log 1

2 , with probability at least 1 𝛿,

𝑍 𝑓(𝑤1) [ 1

Hence, using 𝑇

2 steps we can estimate 𝑓(𝑤1) and obtain a bound

1 + 3 log 1

which holds with probability at least 1 𝛿, and such bound is tight enough to make Theorem 6 inapplicable. Further note that having a bound 𝜎 𝜎max = 𝑂(𝐺

𝑇), which we obtain with high probability using this technique, is sufficient to reduce the unavoidable lower order term presented at Theorem 4 to a term proportional to the (almost) optimal rate of tuned SGD.