# nesterov_acceleration_in_benignly_nonconvex_landscapes__60342c33.pdf

Published as a conference paper at ICLR 2025

NESTEROV ACCELERATION IN BENIGNLY NON-CONVEX LANDSCAPES

Kanan Gupta, Stephan Wojtowytsh Department of Mathematics, University of Pittsburgh kanan.g@pitt.edu, s.woj@pitt.edu

While momentum-based optimization algorithms are commonly used in the notoriously non-convex optimization problems of deep learning, their analysis has historically been restricted to the convex and strongly convex setting. In this article, we partially close this gap between theory and practice and demonstrate that virtually identical guarantees can be obtained in optimization problems with a benign non-convexity. We show that these weaker geometric assumptions are well justified in overparametrized deep learning, at least locally. Variations of this result are obtained for a continuous time model of Nesterov s accelerated gradient descent algorithm (NAG), the classical discrete time version of NAG, and versions of NAG with stochastic gradient estimates with purely additive noise and with noise that exhibits both additive and multiplicative scaling.

1 INTRODUCTION

Accelerated first order methods of optimization are the backbone of modern deep learning. So far, theoretical guarantees that momentum-based methods accelerate over memory-less gradient-based methods have been limited to the setting of convex objective functions. Indeed, recent work of Yue et al. (2023) shows that the assumption of convexity cannot be weakened as far as, for instance, the Polyak-Lojasiewicz (PL) condition f 2 2µ (f inf f), which has been used to great success in the study of gradient descent algorithms without momentum for instance by Karimi et al. (2016).

Optimization problems in deep learning are notoriously non-convex. Initial theoretical efforts focused on approximating the training of very wide neural networks by the parameter optimization in a related linear model: The neural tangent kernel (NTK). Jacot et al. (2018); E et al. (2019) show that for randomly initialized parameters, gradient flow and gradient descent trajectories remain uniformly close to those which are optimized by the linearization of the neural network around the law of its initialization. This analysis was extended to momentum-based optimization by Liu et al. (2022).

Recall that a C2-smooth function f is convex if and only if its Hessian, D2f, is positive semi-definite, i.e. has only non-negative eigenvalues. Strictly negative (but small) eigenvalues of the Hessian of the loss function have been observed close to the set of global minimizers experimentally by Sagun et al. (2017; 2018); Alain et al. (2018) and their presence has been explained theoretically by Wojtowytsch (2023). This poses questions about the use of momentum-based optimizers such as SGD with (heavy ball or Nesterov) momentum or Adam in the training of deep neural networks. In this work, we show that acceleration can be guaranteed for Nesterov s method under much weaker geometric assumptions than (strong) convexity, in particular for certain objective functions that have non-unique and non-isolated minimizers and whose Hessian may have negative eigenvalues up to a certain size.

In the remainder of this section, we briefly review how our work fits into the literature. In Section 2, we precisely state the assumptions under which we prove convergence at an accelerated rate and discuss how our work connects to optimization in deep learning. Our main results are presented in Section 3, both in discrete and continuous time. Some technical details are postponed to the appendix.

1.1 PREVIOUS WORK

Gradient-based optimization was first proposed by Cauchy et al. (1847) in form of the gradient descent algorithm. Over a century later, momentum-based accelerated algorithms were introduced

Published as a conference paper at ICLR 2025

by Hestenes & Stiefel (1952) for convex quadratic functions and by Nesterov (1983) for general smooth and convex objective functions. Nesterov s work was generalized to non-smooth convex optimization by Beck & Teboulle (2009) and to stochastic smooth convex optimization among others by Nemirovski et al. (2009); Shamir & Zhang (2013); Jain et al. (2019); Laborde & Oberman (2020) for additive noise and by Liu & Belkin (2018); Even et al. (2021); Vaswani et al. (2019); Gupta et al. (2024) for multiplicatively scaling noise. See also (Ghadimi & Lan, 2012; 2013) for more information on accelerated stochastic gradient methods. While the heavy ball method is used extensively in deep learning, Lessard et al. (2016); Goujaud et al. (2023) prove that it does not generally achieve accelerated convergence for smooth strongly convex functions and may even diverge see however (Kassing & Weissmann, 2024) for positive results under stronger smoothness assumptions

Accelerated gradient methods have been studied e.g. by Josz et al. (2023) under much weaker regularity conditions and weaker geometric conditions than (strong) convexity, namely the Kurdyka Lojasiewicz (KL) condition. Under those weaker assumptions, it is at best possible to prove convergence to a local minimizer at a non-acclerated rate: Under the (comparatively weak) Polyak Lojasiewicz (PL) condition, a special case of the KL condition, Yue et al. (2023) show that it is not possible to obtain an accelerated rate of convergence. A slower linear rate of convergence is established by Apidopoulos et al. (2022) in continuous time under the assumption that the objective function f satisfies has an L-Lipschitz continuous gradient and satisfies the PL-inequality 2µ f inf f f 2. The rate of convergence is

A stable time-discretization can generally be attained with effective step-size 1/

L for momentum methods (see Su et al., 2016, Section 2), suggesting convergence at the non-accelerated linear rate (1 µ/2L)k in discrete time. To the best of our knowledge, no proof has been given yet.

There have been several efforts to find a reasonable relaxation of convexity for which accelerated convergence can still be achieved. Hinder et al. (2020); Fu et al. (2023); Wang & Wibisono (2023); Guminov et al. (2023) consider acceleration under the weaker condition that the objective function is γ-quasar or (γ, µ)-strongly quasar-convex, i.e. the inequality

f(x), x x γ f(x) f(x ) + µ

holds for any x Rd and any minimizer x of f. Compared to (strong) convexity, it relaxes the condition in two ways: It only considers pairs (x, x ) rather than general pairs of points x, y Rd, and it introduces a factor γ into the inequality which may be strictly smaller than one. Still, it has geometric implications which may be too strong in the context of deep learning: In the strongly quasar-convex case, minimizers are unique, and in the quasar-convex case, sub-level sets are starshaped with respect to any minimizer x since f(tx + (1 t)x ) is monotone increasing on [0, 1] (see Lemma 24, Appendix C).

Accelerated rates of convergence were obtained by Necoara et al. (2019) in discrete time and by Aujol et al. (2022) in continuous time under the assumption that the objective function is both convex and quasi-strongly convex, and that it has a unique minimizer. Their results are generalized by Hermant et al. (2024) who allow for non-smooth composite optimization and consider γ (0, 1] rather than just γ = 1. Unlike the present work, Hermant et al. (2024) require the uniqueness of minimizers and only study deterministic optimization. Curiously, despite the difference in settings, they independently find the same lower bound on Hessian eigenvalues that we require (compare e.g. Theorem 11 and (Hermant et al., 2024, Theorem 2)). See also (Aujol et al., 2024, Table 1) for an overview of theoretical guarantees of acceleration without strong convexity.

In overparametrized deep learning, the set of minimizers of the loss function is a (generally curved) manifold, and tangential motion to the manifold can have important implications on the implicit bias of an algorithm (Li et al., 2021; Damian et al., 2021). Any notion that takes into account all minimizers is quite rigid for such tasks. A more realistic assumption is the aiming condition of Liu et al. (2024) that f(x), x π(x) γ f(x) min f

where π(x) is the closest minimizer to the point x. This notion enjoys much greater flexibility in terms of the global geometry of f. Liu et al. (2024) investigate the convergence of gradient flows under the aiming condition, but not that of momentum methods.

Published as a conference paper at ICLR 2025

1.2 OUR CONTRIBUTION

Our study can be seen as combining geometric ideas pertaining to (1, µ)-quasar convexity and the aiming condition. Our assumption in (1) is equivalent to quasi-strong convexity (Necoara et al., 2019), but notably we do not make any additional assumptions about the convexity of the objective function or uniqueness of minimizers unlike prior works. Since we focus on the closest minimizer at any point in the trajectory, this presents a unique challenge compared to analyzing the distance from a fixed minimizer. As the current iterate xt moves, its projection onto the set of minimizers, π(xt), also moves. This requires a modification of the usual Lyapunov function to account for the movement of π(xt) (or the tangential movement of xt parallel to the set of minimizers) as well as the drift in directions where the objective function is negatively curved. To the best of our knowledge, this is the first study which takes into account this tangential movement and obtains accelerated convergence.

In overparametrized learning, the set of minimizers of a loss function is a submanifold of high dimension in a usually much higher-dimensional space. Unless the manifold of minimizers is a linear space, the Hessian of the loss function is geometrically required to have negative eigenvalues in any neighborhood of the set of a minimizer where the manifold is curved. Still, accelerated methods in first order optimization have been found to be highly successful in deep learning. A common heuristic has been that as long as the objective function is convex in the direction towards the set of minimizers, small negative eigenvalues in directions parallel to the set of minimizers can safely be ignored: Tangential drift along the set of minimizers should not affect the decay of the objective function significantly. We prove that this intuition indeed applies in a continuous time model for gradient descent with momentum (Theorems 6 and 8) and for Nesterov s time-stepping scheme (Theorems 11, 13 and 14) in deterministic optimization and stochastic optimization with bounded noise. With multiplicative (state-dependent) noise motivated by overparametrized deep learning, we prove an analogous statement for a modified version of Nesterov s algorithm (Theorem 15).

2.1 ASSUMPTIONS

We always make the following assumptions on the regularity and geometry of the function f and its set of minimizers.

1. The objective function f : Rd R is bounded from below, C1-smooth and its gradient f : Rd Rd is locally Lipschitz continuous.

2. The set M = {x Rd : f(x) = infz Rd f(z)} of minimizers of f is a (non-empty) k-dimensional C2-submanifold of Rd for k < d.

3. There exists an open sub-level set Uα = {x Rd : f(x) < α} for some α > 0 such that for every x Uα there exists a unique z M which is closest to x. We denote z as π(x) and assume that the closest point projection map π : Uα M is C1-smooth.

4. f satisfies the µ-strong aiming condition (with respect to the closest minimizer) in Uα, i.e.

f(x) x π(x) f(x) f π(x) + µ

2 x π(x) 2 x Uα. (1)

These assumptions are significantly weaker than the assumption of strong convexity, but for instance strong enough to imply a PL inequality with constant µ (see Appendix C). As illustrated in Section 2.3, they match many geometric features of overparametrized deep learning. For an analysis in discrete time, we will make stronger quantitative assumptions.

2.2 SIMPLE EXAMPLES

We give a number of examples which are covered by these assumptions where f is not merely a µ-strongly convex function. The first example illustrates how subtle the interplay between geometric conditions is, even in one dimension. Example 1. For ε, R > 0, consider the function f(x) = x2

2 x2 sin(2R log(|x|). The function f has a unique global minimizer at x = 0 and its derivative f is a Lipschitz-continuous function

Published as a conference paper at ICLR 2025

Figure 1: We visualize f from Example 1 in the top row and its derivative in the bottom row with R = 2 and ε = 0.2 (left), ε = 0.1 (middle) and ε = 0.05 (right). Left: f has many local minimizers as the derivative crosses 0 an infinite number of times. Middle: f satisfies the PL condition, but not the strong aiming condition. Right: f is strongly aiming (with respect to the unique global minimizer, which implies the PL condition). In all plots, f is non-convex since f is non-monotone.

with Lipschitz-constant 1 + ε

1 + 5R2 + 4R4. Furthermore, f has an infinite number of strict local minimizers if ε

1 + R2 > 1, but satisfies favorable geometric properties under stronger assumptions:

PL condition µ-strongly aiming µ-strongly convex

Must be < 1 ε

1 + 5R2 + 4R4

Constant (1 ε

1 + R2)2/(1 + ε) 1 ε

1 + 4R2 1 ε

1 + 5R2 + 4R4

Evidently, the geometric conditions and associated constants are quite different if R 1. See Figure 1 for an illustration of f. Further details for the example and a comparison to less common notions such as quasar-convexity are given in Appendix C.2. We note that the example exploits the fact that f is C1,1but not C2-smooth: For C2-functions, Rebjock & Boumal (2023) prove that the PL condition locally implies strong aiming condition.

The next example is trivial, but useful to illustrate why tangential movement should not matter. Example 2. Let f1 : Rd k [0, ) be a non-negative µ-strongly convex function such that f1(0) = 0 and let f2 : Rk [a, ) be a continuous function for a > 0. Define

f : Rd R, f(x) = f2(x1, . . . , xk) f1(xk+1, . . . , xd).

Then f is aµ-strongly aiming π(x) = (x1, . . . , xk, 0, . . . , 0), but not strongly convex since the minimizer is non-unique. Similarly, if A : Rk R(d k) (d k) is a function which takes values in the set of symmetric matrices with eigenvalues larger than µ, then

f : Rd R, f(x) = 1

2 (xk+1, . . . , xd) A(x1,...,xk) (xk+1, . . . , xd)T

is µ-strongly aiming, but generally non-convex. Example 3. Let M be a compact Ck-submanifold of Rd, k 2 and d(x) := dist(x, M). Then there exists a tubular neighborhood Uε = {x Rd : d(x) < ε} on which d is Ck-smooth and the unique closest point projection π is well-defined and Ck 1-smooth see Appendix A.

Assume that f : Uε R is given by f(x) = µ

2 d(x)2 (and extended arbitrarily to Rd \ Uε). Recall that d(x) is the unit vector pointing towards the closest point in M at all points x where the distance function is smooth, so in particular d(x) = 1. Thus π(x) = x d(x) d(x) and

f(x) (x π(x)) = µ d(x) d(x) x (x d(x) d(x)) = µ d2(x) d(x) 2

2 d2(x) + µ

2 d2(x) = f(x) f(π(x)) + µ

for x Uε, i.e. f is µ-strongly aiming. On the other hand, f is not convex unless M is. Otherwise, take x1, x2 M and t (0, 1) such that tx1+(1 t)x2 / M. Then the map t 7 d2 tx1+(1 t)x2

attains a maximum inside the interval (0, 1), meaning that d2 cannot be convex.

Published as a conference paper at ICLR 2025

Figure 2: Left: The dashed red line connects two minimizers of the function f. Along the line, f must achieve an interior local maximum. At this point, the Hessian D2f cannot be positive definite. Middle, Right: Optimization trajectories for Nesterov s method (top) and its associated energy curve (bottom). The selection of limit point may depend crucially on optimization parameters: In the middle plot, we take 800 steps with stepsize 10 2 while on the right, we take 8,000 steps with stepsize 10 3 from the same initial point. The decay of f(xt) is similar for both trajectories, but the limit points on the manifold of minimizers are far apart. The objective function is f(x, y) = (x2/2 + 3y2 1)2.

This consideration more generally shows that if the manifold of minimizers M of a function f is not perfectly straight, then the objective function cannot be convex see also Figure 2. More precisely:

Lemma 4. (Wojtowytsch, 2023, based on Appendix B) Let f : Rd R be a C2-function and M = {x Rd : f(x) = inf f}. Assume that M is a k-dimensional C1-submanifold of Rd, z M, Tz M the tangent space at z, r > 0. If M Br(z) is not the same set as (z + Tz M) Br(z), then there exists x Br(z) such that D2f(x) has a strictly negative eigenvalue.

2.3 CONNECTION TO DEEP LEARNING

An important class of objective functions are those which combine the geometric features of Examples 2 and 3. Such functions can be seen as geometric prototypes for loss functions in overparametrized regression problems, such as in deep learning. Namely, consider a parametrized function class h : Rp Rd R of weights w Rp and data x Rd (e.g. a neural network) and the mean squared error (MSE) loss function

Ly : Rp [0, ), Ly(w) = 1

h(w, xi) yi 2, y = (y1, . . . , yn) Rn. (2)

If h is sufficiently smooth in w and for every vector y Rn, there exists wy Rp such that Ly(wy) = 0, then Cooper (2021) showed that for Lebesgue-almost all y Rn, the set My = {w Rp : Ly(w) = 0} is a p n-dimensional submanifold of Rp. Essentially, the solution set of n equations h(w, xi) = yi in p variables is p n-dimensional, much like when h is linear in w.

Cooper (2021) demonstrates that the expressivity and smoothness assumptions provably apply to parametrized function classes h(w, x) of sufficiently wide neural networks with analytic activation function such as tanh or sigmoid. Cooper (2021) s proof involves the regular value theorem and Sard s theorem to show that all gradients hw(w, xi) are linearly independent on almost every level set My. As a byproduct, this implies that the Hessian of the loss function

h(w, xi) yi

D2 wh(w, xi) + wh(w, xi) wh(w, xi)

has full rank n at every x My (for almost every y). Thus, all n eigenvalues in direction orthogonal to M are non-zero. We prove the following in Appendix A. The same connection has been made e.g. by Rebjock & Boumal (2023) for C2-functions and overparametrized regression problems.

Lemma 5. Assume that f : Rd R is C2-smooth and that M = {x Rd : f(x) = inf f} is a closed k-dimensional C2-submanifold of Rd (i.e. compact and without boundary). If D2f(x) has rank d k everywhere on M, then there exist µ, α > 0 such that there exists a C1-smooth closest point projection π : Uα R with Uα = {x : f(x) < α} and f is µ-strongly aiming with π.

Published as a conference paper at ICLR 2025

Figure 3: Convexity analysis of ϕ(t) = L(w + tg) for w near global minimizers of a loss function L and g = L(w)/ L(w) . Left: ϕ(t), middle: second derivative of ϕ, right: estimated strong aiming parameter µ for t [ 1, 1]. Evidently, ϕ is strongly convex in a neighborhood of the minimizers. Strong aiming condition yields consistently larger constants than the strong convexity parameter obtained from second derivatives. Different colors correspond to different random initializations.

In particular, f, M meet all conditions in Section 2.1. Note that by Lemma 4, the loss function f cannot be convex since a compact manifold cannot be perfectly straight everywhere. Lemma 5 does not apply to networks with the non-smooth Re LU activation σ(z) = max{z, 0}, also here minimizers cannot be isolated due to the continuous scaling symmetry σ(z) = λ 1σ(λz) for λ > 0.

Wojtowytsch (2023, Theorem 2.6) shows that the assumption that M is compact is a simplification and generally does not apply in deep learning. Local versions of Lemma 5 could be proved with µ, α which are positive functions on the manifold, but not necessarily bounded away from zero. Naturally, this suffices in all cases where we provably remain in a local neighborhood in the course of optimization. We eschew this greater generality for the sake of geometric clarity and commit the pervasive sin of optimization theory for deep learning: We make global assumptions which can only be guaranteed locally. For a further comparison of geometric conditions in optimization and deep learning, see also Appendix C.

We illustrate in Figure 3 that our assumptions are locally reasonable in deep learning. We trained a fully connected neural network (with 10 layers, width 35, tanh activation) to fit labels yi at 100 randomly generated datapoints xi R12. The small dataset size allowed us to use the exact gradient and loss function instead of stochastic approximations, for a better exploration of the loss landscape. Since the closest minimizer is generally unknown, we use the gradient as a proxy and examine the convexity of ϕ(t) = L(w + tg) for w very close to the set of global minimizers of the loss function L as in (2) and g = L(w)/ L(w) . Labels were generated using a randomly initialized teacher network (with 7 layers and width 20). Student networks were trained for 10,000 epochs using stochastic gradient descent with Nesterov momentum, with learning rate η = 0.005 and momentum ρ = 0.99. Final training loss ranged between 10 12 and 10 9 across the five runs. Second derivatives were approximated using second order difference quotients ϕ (t) ϕ(t+h) 2ϕ(t)+ϕ(t h)

h2 for h = 0.01. Similarly, the strong aiming parameter with respect to the global minimizer was estimated by 2 ϕ (t)t ϕ(t)+inf ϕ

t2 where ϕ (t) was estimated as ϕ(t+h) ϕ(t h)

Due to the inherent randomness of training, geometric behaviors varied. Generally, the loss function was convex along the gradient direction near minimizers, but neighborhood size, magnitude of the second derivative, and steepness of the loss function varied. Some runs exhibited convex but not strongly convex behavior, while others failed to reach zero loss along the line w + tg/ g . The gradient is an imperfect approximation of the minimizer s direction, and rounding errors may occur when it is so small near the set of minimizers.

3 MAIN CONTRIBUTIONS

3.1 OPTIMIZATION IN CONTINUOUS TIME

In this section, we study a continuous time version of gradient descent with (heavy ball or Nesterov) momentum derived by Su et al. (2016). Namely, we study solutions of the heavy ball ODE

x + γ x = f(x), x(0) = x0, x(0) = 0. (3)

Published as a conference paper at ICLR 2025

This is a popular model for the study of accelerated methods in optimization which avoids some of the technicalities of discrete time stepping algorithms while relying on the same core geometric concepts. Our main result is the following. Theorem 6. [Continuous time convergence guarantee] Assume that f satisfies the assumptions of Section 2.1, γ = 2 µ and that x0 satisfies f(x0) < α where α is as in Section 2.1. Then there exists a unique solution x(t) = xt of (3) and f(xt) < α for all t > 0. Furthermore, x is C2-smooth and

f(xt) inf z Rd f(z) e µ t f(x0) inf f + µ

2 dist(x0, M)2 .

The proof can be found in Appendix B. There, we prove that the Lyapunov function

L(t) = f(xt) f(π(xt)) + 1

xt + µ xt π(xt) 2

satisfies L(t) e µt L(0). The function L can be considered as a modified energy with potential energy f inf f + µ

2 x π(x) 2 and kinetic energy 1

2 x 2, but with a more complex quadratic term in which velocity and position interact. Going back to (Su et al., 2016), this strategy of proof is common in convex optimization. The main difficulty in our more general geometric setting is that the time derivative of π(xt) (i.e. tangential velocity) is not zero in general, unless the minimizer is unique in Uα. Thus, we have to additionally control for the interaction of the tangential velocity with the other terms in the Lyapunov sequence. One key observation that helps us is that the connecting line xt π(xt) is orthogonal to the velocity of π(xt). The other technical tool is the following lemma. Lemma 7. Let M be a C2-submanifold of Rd and U an open set containing M such that there exists a unique closest point projection π : U M. Let x : ( ε, ε) U be a C1-curve and z(t) := π x(t). Then x, z 0 on ( ε, ε).

Geometrically, Lemma 7 states that the closest point projection z of a point x does not move in the opposite direction when we move x. Despite its geometric simplicity, Lemma 7 is non-trivial and a crucial ingredient in our proofs. Its proof is given in Appendix A.

Venturi et al. (2019) show that loss functions in overparametrized deep learning do not have strict local minimizers outside the set of global minimizers (under suitable assumptions). Under a quantitative version of this geometric assumption, we obtain a global version of Theorem 6, albeit with less precise quantitative guarantees.

Theorem 8. In addition to the assumptions of Section 2.1, assume that

1. for every R > inf f, there exists LR > 0 such that f is Lipschitz-continuous with Lipschitz constant LR on UR = {x : f(x) < R} and

2. there exists a value δ > 0 such that f(x) 2 < δ implies that f(x) inf f < α

Then, for any x0 Rd, there exists T 0 such that x(t) Uα for all t T and such that f(x(t)) inf f 3α/2 + dist(x T , M)2 e µ(T t) for all t > T.

The rate of decay here is e µ t, as compared to e µt for gradient flow. We see more clearly why we speak of acceleration in discrete time (see the discussion below Theorem 11). Remark 9. We conjecture that Theorem 6 remains valid if the set M is convex rather than a smooth manifold. Then π is defined globally, but generally only Lipschitz-continuous and not smooth.

3.2 DETERMINISTIC OPTIMIZATION IN DISCRETE TIME

We can equivalently write the heavy ball ODE (3) as a system of two first order ODEs: x = v v = 2 µ v f(x), x(0) = x0, v(0) = 0. (4)

We choose a time-stepping scheme

x n = xn + ηvn, gn = f(x n), xn+1 = x n ηgn, vn+1 = ρ vn ηgn (5)

Published as a conference paper at ICLR 2025

for ρ = 1 µη

1+ µη = 1 2 µη + O(η). In particular, we have

xn+1 = xn + η vn + O(η), vn+1 = vn η 2 µ vn + f(xn) + O(η),

i.e. the scheme is a time discretization of the heavy ball system (4) with time-step size η. The square root is chosen for consistency with the literature. This scheme is a geometrically intuitive reparametrization of Nesterov s accelerated gradient descent algorithm, except for the fact that Nesterov s scheme typically begins with the gradient descent step rather than the momentum step (see Gupta et al., 2024, Appendix B, for a proof of equivalence).

In discrete time, we need to make additional quantitative regularity assumptions on both f and the projection map π in order to ensure that the time step size is sufficiently small to recover the continuous time behavior. Note that the Hessian of the objective function f is positive semi-definite on the set of global minimizers M, i.e. it stands to reason that any negative eigenvalues of D2f should be small, at least close to M. We note the following. Lemma 10. Assume that D2f(x) ε in a ball Br(x0). Then

f(x), x z f(x) f(z) ε

2 x z 2 x, z Br(x0).

This lemma informs our geometric assumptions on the objective function f. For the closest point projection π, we assume that π(x) = Px + x for some (linear) orthogonal projection P onto a subspace V Rd and a fixed vector x which is the unique element with the smallest Euclidean norm in the affine space V = x + V . The assumption that the derivative Dπ P is constant is a (very restrictive) geometric linearization, and relaxing it is an important subject of future research. Still, it applies to many functions which are not convex, such as those with a unique minimizer (i.e. P 0) or those in Example 2. With an eye towards stochastic optimization, we opt for the simpler global geometric assumptions, see also Remark 16. Theorem 11. Assume that f is L-smooth and the sequences xn, x n, vn are generated according to the Nesterov scheme (5) with parameters η 1/L and ρ = (1 µη)/(1 + µη). Assume further that there exists an affine linear projection map π(x) = Px + x such that

f(x), x π(x) f(x) f(π(x)) + µ

2 x π(x) 2. (6)

Finally, assume that for arbitrary x, v Rd we have

f(x + v), v f(x + v) f(x) ε

with some ε p

f(xn) inf f (1 µη)n h f(x0) inf f + µ

2 x0 π(x0) 2i .

The proof, given in Appendix B.2, builds on similar ideas as Theorem 6, with additional complications introduced by the discrete-time setting. We have to modify the usual Lyapunov sequence used for strongly convex functions. We treat the tangential and normal components of the velocity (i.e. Pvn and P vn = (I P)vn) separately, and carefully choose coefficients to ensure that the following Lyapunov sequence decays at each step,

Ln = f(xn) inf f + 1

2 P vn + µ(x n π(x n)) 2 + (1 + µη)2

2(1 µη) Pvn 2.

In Theorem 11, we see more clearly than in Theorem 6 why we talk of acceleration: While gradient descent would achieve a decay rate of (1 µ/L)n with the commonly proposed step size η = 1/L in discrete time based on our assumptions, Nesterov s method achieves decay like (1 p

µ/L)n with η = 1/L. Since µ/L 1, Nesterov s method converges much faster than gradient descent. Remark 12. Note that the negative eigenvalues of the Hessian may be as large as p

µ/η in Theorem 11. If η is chosen as large as 1/L, this is a real restriction of the eigenvalues to the range [ µL, L]. However, since the Hessian eigenvalues of an L-smooth function are in [ L, L] a priori, there is no additional restriction if p

µ/η L, i.e. if η < µ L2 . This corresponds to the continuous time guarantee of Theorem 6, which does not depend on the magnitude of the negative eigenvalues.

Published as a conference paper at ICLR 2025

However, if the eigenvalues of D2f are as negative as L, the step size η = µ/L2 is so small that it does not improve upon gradient descent with step size η = 1/L since µη = µ/L in this case. Thus, if f is too far from being convex, acceleration may not be achievable in discrete time. Notably, especially close to the set of global minimizers, we can picture ε as small compared to L.

3.3 STOCHASTIC OPTIMIZATION IN DISCRETE TIME

In typical applications in deep learning, the gradient f of the objective function/loss function f is prohibitively expensive to evaluate, but we have access to stochastic estimates of the true gradient. In this section, in addition to the assumptions of Theorem 11, we assume that we are given a probability space (Ω, A, Q) and a measurable function g : Rd Ω Rd such that

Eω Q g(x, ω) = f(x), Eω Q g(x, ω) 2 < + x Rd. (8)

For quantitative statements, a more precise assumption on the variance of the gradient estimates must be made. We make the modelling assumption

Eω Q g(x, ω) f(x) 2 σ2 a + σ2 m f(x) 2 x Rd. (9)

We call σa the additive standard deviation and σm the multiplicative standard deviation since they resemble the prototypical example g = (1 + σm N1) f + σa N2 where N1, N2 are random variables with mean zero and variance one. The case of purely additive noise (i.e. σm = 0) is classical and hails back to the seminal article of Robbins & Monro (1951). The case of purely multiplicative noise is much closer to reality in overparametrized learning: If all data points can be fit exactly, there is no noise when estimating the gradient of the empirical risk/training loss on the set of global minimizers. It has received significant attention more recently by Liu & Belkin (2018); Bassily et al. (2018); Vaswani et al. (2019); Even et al. (2021); Wojtowytsch (2023); Gupta et al. (2024) and others.

Let us consider the purely additive case first. We follow the scheme (5), but we replace the deterministic gradient f(x n) by gn = g(x n, ωn) where ω0, ω1, . . . are drawn from Ωindependently of each other and the initial condition x0 with law Q. This framework allows e.g. for minibatch sampling (but assumes that all batches are drawn independently of each other from the dataset). Theorem 13. [Acceleration with additive noise] Assume that f, P are as in Theorem 11 and that the g satisfies (8) and (9) with σm = 0. Assume that the sequences xn, x n, vn are generated by the scheme (5) for parameters η 1/L and ρ = 1 µη

1+ µη, but with the stochastic gradient estimates g(x n, ωn) with independently identically distributed ωn in place of f(x n). Then

E[f(xn) inf f] (1 µη)n h f(x0) inf f + µ

2 x0 π(x0) 2i + σ2 a η µ .

Thus Nesterov s method reduces f(xn) below a noise level proportional to σ2 a p

η/µ at a linear rate (1 µη)n in the presence of additive noise. The analogous bound for stochastic gradient descent was obtained (in the more general setting of PL functions) in (Karimi et al., 2016, Theorem 4) as

E[f(xn) inf f] (1 µη)k E f(x0) inf f + Lσ2 aη 2µ .

With the largest admissible learning rate η = 1/L, the noise level for GD is σ2 a/2µ compared to the usually much lower value σ2 a/ Lµ for Nesterov s method. Keeping a memory of previous gradient estimates facilitates averaging out the random noise.

With a fixed positive learning rate η, generally f(xn) 0 unless σa = 0. We therefore consider a sequence of decreasing step sizes. Theorem 14. [Additive noise and decreasing step size] Assume that f, g are as in Theorem 13 and that the sequences xn, x n, ρn are generated by the scheme

x n = xn + ηn 1vn, xn+1 = x n ηngn, vn+1 = ρn(vn ηngn)

for parameters ηn = µ (n+ Lµ+1)2 , ρn = 1 µηn

1+ µηn . If ε p

µ/η0 = µ + Lµ, then

E f(xn) inf f

L µ E f(x0) inf f + 1

2 x0 π(x0) 2 + σ2 a µ log 1 + n p

Published as a conference paper at ICLR 2025

Note that the physical step size ηn decays as 1/n and thus satisfies the (non-)summability conditions of Robbins & Monro (1951). We can allow for larger ε by choosing ηn = µ/(n + n0) with n0 > p

L/µ 1 for a smaller initial step size.

If σm > 0, Liu & Belkin (2018) and Gupta et al. (2024) show that Nesterov s scheme no longer achieves acceleration. For general noise, we therefore consider a modified Nesterov scheme: x n = xn + αvn, xn+1 = x n ηgn, vn+1 = ρn vn αgn) (10)

where again gn = g(x n, ωn). The scheme (10) was introduced as the Accelerated Gradient method with Noisy EStimators method (AGNES) by Gupta et al. (2024) in convex and strongly convex optimization with purely multiplicative noise. Compared to Nesterov s algorithm, AGNES has an additional parameter α which is required to adapt to the multiplicative variance, at least if σm 1. Here, we generalize the work of Gupta et al. (2024) by both allowing noise with general scaling for σa, σm > 0 and relaxing the convexity assumption on f. Theorem 15. [Additive and multiplicative noise] Assume that f, P, x are as in Theorem 11 and that g is a family of gradient estimators such that (8) and (9) hold for some σa, σm 0. Assume that the sequences xn, x n, vn are generated by the AGNES scheme (10) with parameters

0 < η 1 L(1 + σ2m), ρ = 1 q

µ(1 + σ2m)η

µ(1 + σ2m)η + σ2m η.

Then, if ε < p

µ(1 + σ2m)/η, we have

E f(xn) inf f 1 r µη 1 + σ2m

n E h f(x0) inf f + µ

2 x0 π(x0) 2i + σ2 a η p

µ(1 + σ2m) .

The proof of Theorem 15 is given in Appendix B.4. While at a glance it appears that the multiplicative noise is helping us reduce the additive error term, this is merely a consequence of the small learning rate which is forced upon us. The condition on the negative eigenvalues is relaxed to ε µL(1+σ2 m) with the largest admissible step size η = 1/(L(1 + σ2 m)) as the issues stemming from tangential drift pale in comparison to those stemming from stochastic gradient estimates. For comparison, if we chose the same η in Theorem 11, we could only allow for ε µL p

1 + σ2m < µL(1 + σ2 m), but we would obtain a rate of convergence of 1 p

µ/L(1 + σ2m). In the stochastic case, we only achieve 1 p

µ/L/(1 + σ2 m). Thus the limiting factor is the stochastic noise, not the geometry of f. Remark 16. We opted for a linear closest point projection map to facilitate proofs. In the non-linear case, closest point projections cannot be defined globally: Jessen (1940); Busemann (1947); Phelps (1957) show that if K is a subset of Rd such that for every x Rd there exists a unique closest point in K, then K is closed and convex. If K is both a k-dimensional submanifold of Rd and a convex set, then K is an affine k-dimensional subspace of Rd, i.e. our assumptions are the most general when assuming that a unique closest point projection onto a submanifold is defined globally.

Thus, if M is not an affine space, we can only assume that π is good in a neighborhood of M. In stochastic optimization, where we can randomly jump out of the good neighborhood, this leads to serious technical challenges. Guarantees on remaining local with high probability have recently been derived for SGD with additive noise and decaying learning rates by Mertikopoulos et al. (2020) and with multiplicative noise by Wojtowytsch (2023). To avoid obscuring the new geometric constructions, we opted to forgo this highly technical setting here and prioritize the extension of Theorems 6 and 11 towards stochastic optimization.

4 CONCLUSION

We have proved that first order momentum-based methods accelerate convergence in a more general setting than convex optimization with many geometric features motivated by loss landscapes encountered in deep learning. The models we studied include the heavy ball ODE and deterministic and stochastic optimization schemes in discrete time under various noise assumptions.

The most cumbersome limitation of our convergence guarantees is the assumption that the closest point projection onto the set of minimizers is affine linear in the discrete time setting, i.e. the derivative Dπ is constant in space. Future work will focus on relaxing this assumption.

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENTS

The authors gratefully acknowledge the financial support of the NSF, grant DMS 2424801. The authors are grateful to Quentin Merigot, who pointed out a simpler proof of Lemma 7 which requires less regularity than the authors original approach.

Guillaume Alain, Nicolas Le Roux, and Pierre-Antoine Manzagol. Negative eigenvalues of the hessian in deep neural networks, 2018. URL https://openreview.net/forum?id= S1iiddy DG.

Vassilis Apidopoulos, Nicolò Ginatta, and Silvia Villa. Convergence rates for the heavy-ball continuous dynamics for non-convex optimization, under polyak łojasiewicz condition. Journal of Global Optimization, 84(3):563 589, 2022.

J-F Aujol, Ch Dossal, and Aude Rondepierre. Convergence rates of the heavy ball method for quasi-strongly convex optimization. SIAM Journal on Optimization, 32(3):1817 1842, 2022.

Jean-François Aujol, Charles Dossal, Hippolyte Labarrière, and Aude Rondepierre. Heavy ball momentum for non-strongly convex optimization, 2024. URL https://arxiv.org/abs/ 2403.06930.

Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of sgd in non-convex over-parametrized learning. ar Xiv preprint ar Xiv:1811.02564, 2018.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009.

Herbert Busemann. Note on a theorem on convex sets. Matematisk Tidsskrift. B, pp. 32 34, 1947.

Augustin Cauchy et al. Méthode générale pour la résolution des systemes d équations simultanées. Comp. Rend. Sci. Paris, 25(1847):536 538, 1847.

Yaim Cooper. Global minima of overparameterized neural networks. SIAM Journal on Mathematics of Data Science, 3(2):676 691, 2021.

Alex Damian, Tengyu Ma, and Jason D Lee. Label noise sgd provably prefers flat global minimizers. Advances in Neural Information Processing Systems, 34:27449 27461, 2021.

Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. Advances in neural information processing systems, 30, 2017.

Weinan E, Chao Ma, and Lei Wu. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci. China Math, 2019.

Mathieu Even, Raphaël Berthier, Francis Bach, Nicolas Flammarion, Pierre Gaillard, Hadrien Hendrikx, Laurent Massoulié, and Adrien Taylor. A continuized view on nesterov acceleration for stochastic gradient descent and randomized gossip. ar Xiv preprint ar Xiv:2106.07644, 2021.

Qiang Fu, Dongchu Xu, and Ashia Camage Wilson. Accelerated stochastic optimization methods under quasar-convexity. In International Conference on Machine Learning, pp. 10431 10460. PMLR, 2023.

Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469 1492, 2012.

Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061 2089, 2013.

Published as a conference paper at ICLR 2025

Baptiste Goujaud, Adrien Taylor, and Aymeric Dieuleveut. Provable non-accelerations of the heavyball method. ar Xiv preprint ar Xiv:2307.11291, 2023.

Sergey Guminov, Alexander Gasnikov, and Ilya Kuruzov. Accelerated methods for weakly-quasiconvex optimization problems. Computational Management Science, 20(1):36, 2023.

Kanan Gupta, Jonathan W. Siegel, and Stephan Wojtowytsch. Nesterov acceleration despite very noisy gradients. In Advances in Neural Information Processing Systems, volume 37, pp. 20694 20744, 2024. URL https://proceedings.neurips.cc/paper_files/paper/ 2024/file/24d2dd6dc9b79116f8ebc852ddb9dc94-Paper-Conference.pdf.

J Hermant, J. F Aujol, C Dossal, and A Rondepierre. Study of the behaviour of nesterov accelerated gradient in a non convex setting: the strongly quasar convex case, 2024. URL https://arxiv. org/abs/2405.19809.

Magnus Rudolph Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems. NBS Washington, DC, 1952.

Oliver Hinder, Aaron Sidford, and Nimit Sohoni. Near-optimal methods for minimizing star-convex functions and beyond. In Conference on learning theory, pp. 1894 1938. PMLR, 2020.

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.

Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. In Conference on Learning Theory, pp. 1752 1755. PMLR, 2019.

Børge Jessen. To saetninger om konvekse punktmaengder. Matematisk tidsskrift. B, pp. 66 70, 1940.

Cédric Josz, Lexiao Lai, and Xiaopeng Li. Convergence of the momentum method for semialgebraic functions with locally lipschitz gradients. SIAM Journal on Optimization, 33(4):3012 3037, 2023.

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximalgradient methods under the Polyak-Lojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp. 795 811. Springer, 2016.

Sebastian Kassing and Simon Weissmann. Polyak s heavy ball method achieves accelerated local rate of convergence under Polyak-Lojasiewicz inequality. ar Xiv preprint ar Xiv:2410.16849, 2024.

Maxime Laborde and Adam Oberman. A lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case. In International Conference on Artificial Intelligence and Statistics, pp. 602 612. PMLR, 2020.

Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and Benjamin Recht. First-order methods almost always avoid strict saddle points. Mathematical programming, 176:311 337, 2019.

Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57 95, 2016.

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after sgd reaches zero loss? a mathematical framework. ar Xiv preprint ar Xiv:2110.06914, 2021.

Chaoyue Liu and Mikhail Belkin. Mass: an accelerated stochastic method for over-parametrized learning. ar Xiv preprint ar Xiv:1810.13395, 2018.

Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, and Yian Ma. Aiming towards the minimizers: fast convergence of sgd for overparametrized problems. Advances in neural information processing systems, 36, 2024.

Xin Liu, Zhisong Pan, and Wei Tao. Provable convergence of nesterov s accelerated gradient method for over-parameterized neural networks. Knowledge-Based Systems, 251:109277, 2022.

Published as a conference paper at ICLR 2025

Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems. Advances in Neural Information Processing Systems, 33:1117 1128, 2020.

Ion Necoara, Yu Nesterov, and Francois Glineur. Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming, 175:69 107, 2019.

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

Yurii Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). Dokl. Akad. Nauk SSSR, 269:543 547, 1983.

Michael O Neill and Stephen J Wright. Behavior of accelerated gradient methods near critical points of nonconvex functions. Mathematical Programming, 176:403 427, 2019.

RR Phelps. Convex sets and nearest points. Proceedings of the American Mathematical Society, 8(4): 790 797, 1957.

Quentin Rebjock and Nicolas Boumal. Fast convergence to non-isolated minima: four equivalent conditions for c2-functions. ar Xiv preprint ar Xiv:2303.00096, 2023.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Levent Sagun, Leon Bottou, and Yann Le Cun. Eigenvalues of the hessian in deep learning: Singularity and beyond, 2017. URL https://openreview.net/forum?id=B186c P9gx.

Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks, 2018. URL https://openreview.net/ forum?id=r Jr Twxb Cb.

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pp. 71 79. PMLR, 2013.

Weijie Su, Stephen Boyd, and Emmanuel J Candes. A differential equation for modeling nesterov s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17 (153):1 43, 2016.

Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for overparameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pp. 1195 1204. PMLR, 2019.

Luca Venturi, Afonso S Bandeira, and Joan Bruna. Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of Machine Learning Research, 20(133):1 34, 2019.

Jun-Kun Wang and Andre Wibisono. Continuized acceleration for quasar convex functions in non-convex optimization. ar Xiv preprint ar Xiv:2302.07851, 2023.

Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type part i: Discrete time analysis. Journal of Nonlinear Science, 33(3):45, 2023.

Pengyun Yue, Cong Fang, and Zhouchen Lin. On the lower bound of minimizing Polyak-łojasiewicz functions. In The Thirty Sixth Annual Conference on Learning Theory, pp. 2948 2968. PMLR, 2023.

Published as a conference paper at ICLR 2025

A PROOFS OF LEMMAS 5 AND 7: GEOMETRY OF THE ENERGY LANDSCAPE

We assume that the reader is familiar with basic concepts of differential geometry, such as submanifolds of Euclidean spaces and tangent spaces and with concepts of multi-variate analysis such as compactness and the inverse function theorem.

We first recall an important observation: Assume that M is a C1-manifold. Fix a point x Rd. Assume that the function d : M R given by d(z) = x z has a local extremum at z M. Then for every C1-curve γ : ( ε, ε) M such that γ(0) = z, we have

t=0 d(γ(t))2 = 2 x γ(0), γ(0) = 2 x z, γ(0)

or in other words: the connecting line x z is orthogonal to the tangent space Tz M. This is in particular true if z is the closest point in M to x.

Lemma 7. Let M be a C2-submanifold of Rd and U an open set containing M such that there exists a unique closest point projection π : U M. Let x : ( ε, ε) U be a C1-curve and z(t) := π x(t). Then x, z 0 on ( ε, ε).

Proof. The closest point projection onto a C2-manifold is C1-smooth and satisfies

π(x) = x d(x) d(x) (11)

where d(x) = dist(x, M) is the distance function to the manifold, i.e. d(x) gives the unit vector pointing directly towards the manifold. For details, see the proof of Lemma 5.

We can rewrite (11) as

2 minx M x z 2

2 x 2 2 x, z + z 2

= x max z M

We are taking the (pointwise in x) maximum over a class of functions which are linear in x, i.e. we find that

ξ(x) := max z M

is convex. In particular, the derivative matrix Dπ = D2ξ is symmetric and positive semi-definite and thus x, d

dtπ x = x, Dπ(x) x 0.

Lemma 5. Assume that f : Rd R is C2-smooth and that M = {x Rd : f(x) = inf f} is a closed k-dimensional C2-submanifold of Rd (i.e. compact and without boundary). If D2f(x) has rank d k everywhere on M, then there exist µ, α > 0 such that there exists a C1-smooth closest point projection π : Uα R with Uα = {x : f(x) < α} and f is µ-strongly aiming with π.

Proof. This result uses standard ideas from differential geometry. For the reader s convenience, we provide a full proof sketch.

Step 1. Closest point projection. Assume that M is a Cm-manifold for m 2. Fix a point z0 M, a radius r > 0 and the neighbourhood U = Br(z0). We assume that r > 0 is so small that there exists a Cm-diffeomorphism ϕ : U V Rd such that ϕ(U M) = V {y : yk+1 = = yd = 0}. If M is a Cm-manifold, we can find a collection of Cm 1-smooth vector fields A1, . . . , Ad such that

1. A1(z), . . . , Ad(z) span the tangent space Tz M for all z M U and

2. Ai(z), Aj(z) = δij for all i, j = 1, . . . , d and z M U.

Published as a conference paper at ICLR 2025

Such vector fields can be obtained for instance by applying the Gram-Schmidt algorithm to the columns of the derivative matrix D(ϕ 1)ϕ(z) = (Dϕz) 1 of the inverse diffeomorphism. The algorithm returns an orthonormal basis since ϕ is a diffeomorphism, i.e. Dϕ has full rank. The first k columns span the tangent space of Tx M since motion tangential to M in U corresponds to motion where the last d k coordinates are kept zero in V , i.e. to the first k columns of Dϕ 1.

We now introduce new coordinates: Denote by ˆV Rk the set such that ˆV {0d k} = V {y : yk+1 = = yd = 0} and

Ψ : ˆV Rd k Rd, Ψ(ˆy, sk+1, . . . , sd) = ϕ 1(ˆy, 0) +

i=k+1 si Ai ϕ 1(ˆy, 0) .

The map Ψ is Cm 1-smooth since it is linear in s and the least regular components, the vector fields Ai, are Cm 1-smooth in y. If m 2, we trivially find that Ψ is differentiable and DΨ(ˆy,0) = ( ˆy1ϕ, . . . , ˆykϕ, Ak+1, . . . , Ad) is invertible. Hence, the map Ψ is a local diffeomorphism by the inverse function theorem. Notably, we see that

x Ψ(ˆy, 0) TΨ(ˆy,0)M x Ψ(ˆy, 0) span {Ak+1(Ψ(ˆy, 0)), . . . , Ad(Ψ(ˆy, 0))}

and thus if and only if

x = Ψ(ˆy, 0) +

i=k+1 si Ai Ψ(ˆy, 0) = Ψ(ˆy, s)

for some s Rd k. In a neighbourhood of z0 where Ψ is a diffeomorphism, we set π(Ψ(ˆy, s)) = Ψ(ˆy, 0), i.e.

π = Ψ PRk Ψ 1, P(y1, . . . , yd) = (y1, . . . , yk, 0, . . . , 0).

The map is as smooth as Ψ, i.e. Cm 1-smooth (assuming that m 2). The map π defined in this way may not be the unique closest point projection on all of U (e.g. when a point z M \ U is closer), but it is guaranteed to be the unique closest point projection on a smaller subset Br/2(z ) where the closest point on M is closer than the boundary of U.

Thus, for every point z M, there exists a neighborhood Br(z)(z) for r(z) > 0 in which a unique closest point projection is defined. Setting U := S

z M Br(z)(z), we find a neighborhood of the manifold M inside of which the closest point projection is defined.

Assume that the radius r(z) is chosen as the supremum of all admissible radii. Then the function r(z) is strictly positive and Lipschitz-continuous with Lipschitz-constant 1: r(z ) r(z) z z since Br z z (z ) Br(z). Exchanging the role of z, z shows that

r(z ) r(z) z z , r(z) r(z ) z z |r(z) r(z )| z z z, z M.

In particular, if M is compact, then as a continuous positive function, r is uniformly positive and there exists a neighborhood Wδ := {x Rd : dist(x, M) < δ} on which the unique closest point projection is defined. Additionally, we find that

Ψ(ˆy, s) π Ψ(ˆy, s) =

i=k+1 si Ai(Ψ(ˆy, 0))

since the vector fields Ai are orthonormal.

In Lemma 7, we require π(x(t)) to be C2-smooth if x(t) is C2-smooth due to the technicalities of the proof. For this reason, we make the assumption that M is a C3-manifold to ensure that π is a C2-map. For the required smoothness of solutions to the heavy ball ODE, it is sufficient to require that f is C2-smooth.

Step 2. The geometry of f. Since M is the set of minimizers of f, the Hessian D2f(z) is positive semi-definite for every z M. As f is constant on M, we for every curve γ : ( ε, ε) M we have

dt2 f(γ(t)) = f(γ(t)), γ (t) + γ (t)T D2f(γ(t)) γ (t) = γ (t)T D2f(γ(t)) γ (t)

Published as a conference paper at ICLR 2025

because f 0 on the set M of minimizers of f, i.e. v T D2f(z)v = 0 for all z M and v Tz M. If we assume that D2f(z) has rank d k for all z M, then necessarily v T D2f(z)v > 0 for all v which are orthogonal to M or equivalently

v T D2f(z)v λ(z) P z v 2 z M, v Rd

where P z denotes the orthogonal projection onto the orthogonal complement of the tangent space of M at z, i.e.

i=k+1 v, Ai(z) Ai(z).

If M is compact, then the function λ is bounded from below by some λ0 > 0. Let ε = λ0/2. Using the uniform continuity of π, P and D2f on a compact set Wδ and choosing δ > 0 suitably small, we find that v T D2f(x)v λ0

P π(x)v 2 ε v 2 x Wδ, v Sd 1

since the map from matrix to smallest eigenvalue is continuous on the space of symmetric matrices. In particular, for any fixed (ˆy, s) ˆV Sd 1 we see that the function

g : ( δ, δ) R, g(t) = f Ψ(ˆy, ts)

is λ0 ε strongly convex. To see this, abbreviate v := Pd i=k+1 si Ai(ˆy) and compute

dt2 f (Ψ(ˆy, 0) + tv) = v T D2f (Ψ(ˆy, ts)) v (λ0 ε) v 2 = λ0 ε

since v TΨ(ˆy,0)M and v = s = 1. Hence

f Ψ(ˆy, 0) = g(0) g(t) g (t)t + λ0 ε

= f Ψ(ˆy, ts) t f(Ψ(ˆy, ts)), v + λ0 ε

= f Ψ(ˆy, ts) + f(Ψ(ˆy, ts)), Ψ(ˆy, 0) Ψ(ˆy, ts) + λ0 ε

Ψ(ˆy, 0) Ψ(ˆy, ts) 2

or in the original coordinates of Wδ Rd:

f(π(x)) f(x) + f(x), π(x) x + λ0 ε

2 x π(x) 2.

By the same argument with reversed roles for x, π(x) we have

f(x) f(π(x)) + λ0 ε

2 x π(x) 2 = inf f + λ0 ε

2 dist(x, M)2.

In particular: f(x) < inf f + α implies that

dist(x, M) r

2 λ0 ε f(x) inf f < r

Choosing α small enough, we see that the open neighborhood

Uα := {x : f(x) < inf f + α}

is a subset of Wδ. Within this neighborhood, the unique closest point projection is therefore welldefined. This concludes the proof of the Lemma and shows that the Assumptions of Section 2.1 are satisfied in this setting.

Remark 17. Controlling the largest eigenvalue of the Hessian rather than the smallest, we see that there exist 0 < µ < L such that

2 dist(x, M)2 f(x) L

2 dist(x, M)2

in a neighborhood of U. For this reason, we presented Example 3 for context and intuition.

Published as a conference paper at ICLR 2025

B PROOFS OF ACCELERATION IN OPTIMIZATION

B.1 PROOF OF THEOREMS 6 AND 8: OPTIMIZATION IN CONTINUOUS TIME

We first prove the local convergence statement. Theorem 6. [Continuous time convergence guarantee] Assume that f satisfies the assumptions of Section 2.1, γ = 2 µ and that x0 satisfies f(x0) < α where α is as in Section 2.1. Then there exists a unique solution x(t) = xt of (3) and f(xt) < α for all t > 0. Furthermore, x is C2-smooth and

f(xt) inf z Rd f(z) e µ t f(x0) inf f + µ

2 dist(x0, M)2 .

Proof. Step 0: Existence and Uniqueness. Note that a solution to the heavy ball ODE can be obtained as a solution to the ODE system x = v v = γv f(x).

If f is locally Lipschitz-continuous (for instance if f is C2-smooth), a unique C1-solution (x, v) of the ODE system exists by the Picard-Lindelöff Theorem. Since x = v is C1-smooth, we see that the solution x of the heavy ball ODE is C2-smooth.

Step 1: xt remains in Uα. Note that

2 xt 2 = f(xt), xt + xt, xt = x + f(x), x = 2 µ xt 2 0,

so f(xt) f(xt) + 1

2 xt 2 f(x0) + 1

2 x0 2 = f(x0) < α

for all t 0 since x0 = 0, which implies xt Uα for all t 0.

Step 2: Bounding f(xt). Let zt := π(xt) denote the closest point projection of xt onto M and by zt its derivative. Consider the Lyapunov function

L(t) = f(xt) f(π(xt)) + 1

2 xt + µ(xt π(xt)) 2

We will show that L (t) µL(t) under some neighborhood assumption on xt. Using the heavy ball dynamics and properties of the projection, we can bound L (t) as:

L (t) = f(xt), xt + xt + µ(xt π(xt)), xt + µ( xt zt)

= xt, f(xt) + xt + µ xt µ zt + µ xt π(xt), xt + µ xt µ zt

= xt, µ xt µ zt, + µ xt zt, µ xt f(xt) µ zt

µ xt 2 µ xt zt, xt µ f(xt), xt zt

where we used the heavy ball dynamics xt = 2 µ xt f(xt) and the geometric properties of the closest point projection:

1. xt, zt 0 by Lemma 7 and

2. xt zt, zt = 0 since xt zt meets M orthogonally at zt and zt is tangent to M at zt.

Next, using the µ-strong aiming condition, we have

f(xt), xt π(xt) f(xt) f(π(xt)) + µ

2 xt π(xt) 2.

Substituting this into the bound on L (t) and simplifying gives:

L (t) µ xt 2 µ xt π(xt), xt µ f(xt) f(π(xt)) + µ

2 xt π(xt) 2

= µ f(xt) f(π(xt)) + 1

xt + µ xt π(xt) 2 + 1

Published as a conference paper at ICLR 2025

Now, we can bound the initial value of the Lyapunov function L(0) as follows:

L(0) = f(x0) inf z Rd f(z) + 1

2 x0 + µ(x0 π(x0)) 2

= f(x0) inf z Rd f(z) + µ

2 dist(x0, M)2

since x0 = 0 and dist(x, M) = x π(x) . We deduce that

d dte µt L(t) = µL(t) + L (t) e µt 0

f(xt) inf f L(t) e µt L(0) = e µt f(x0) inf f + µ

2 dist(x0, M)2

Next, we prove the global convergence statement. Theorem 8. In addition to the assumptions of Section 2.1, assume that

1. for every R > inf f, there exists LR > 0 such that f is Lipschitz-continuous with Lipschitz constant LR on UR = {x : f(x) < R} and

2. there exists a value δ > 0 such that f(x) 2 < δ implies that f(x) inf f < α

Then, for any x0 Rd, there exists T 0 such that x(t) Uα for all t T and such that f(x(t)) inf f 3α/2 + dist(x T , M)2 e µ(T t) for all t > T.

Proof. The idea of the proof is to show that at some large time T, the trajectory of x enters the set Uα with sufficiently low velocity that it gets trapped in Uα. From that point onwards, the proof of Theorem 6 applies with minor modifications.

Step 1. Denote by E(t) = f(xt) + 1

2 xt 2 the total energy of the curve x at time t. As in the proof of Theorem 6, we find that E (t) = 2 µ xt 2, so in particular x 2 is square integrable in time: Z

0 x 2 dt = E(0) limt E(t)

2 µ f(x0) inf f

Recall for future use that f(xt) E(t) E(0) = f(x0) for t 0.

Step 2. In this step, we show that also f(xt) is square integrable in time. To do this, we first observe that Z T

0 f(x) + 2 µ x 2 dt = Z T

0 f(x) 2 + 4 µ f(x), x + x 2 dt

0 f(x) 2 dt + 4 µ Z T

d dtf(xt) dt + 4µ Z T

On the other hand, we can write Z T

0 f(x) + 2 µ x 2 dt = Z T

0 f(x) + 2 µ x, x dt

0 f(x), x dt µ Z T

d dt x 2 dt

= f(x T ), x T + Z T

0 D2f(x) x, x dt µ x T 2

since x0 = 0. Overall, we find that Z T

0 f(x) 2 dt = 4 µ f(x0) f(x T ) + f(x T ), x T µ x T 2

Published as a conference paper at ICLR 2025

0 D2f(x) x, x 4µ x 2 dt

4 µ f(x0) inf f + L Z T

0 x 2 dt + f(x), x (T)

where L = Lf(x0) is the Lipschitz-constant of f on the set {x : f(x) < f(x0)}. Note that

t 7 f(x), x = d

is the derivative of the bounded function f(xt) [inf f, f(x0)] and continuous, so there exists a sequence of times Tn such that f(x), x (Tn) 0. Since f 2 is a non-negative integrand, we can bound Z

0 f(x) 2 dt lim n

4 µ f(x0) inf f + L Z Tn

0 x 2 dt + f(x), x (Tn)

= 4 µ f(x0) inf f + L Z

0 x 2 dt < + .

Step 3. Using Steps 1 and 2, we find that Z T

0 f(xt) 2 + xt 2 dt < + .

In particular, there exists a sequence of times tn such that

f(x(tn)) 2 + x(tn) 2 0

as n . We can therefore choose T > 0 such that

f(x T ) 2 + x T 2 < min δ, α}.

Then we find that

f(x T ) 2 < δ f(x T ) < α, x T 2 < α E(T) = f(x T ) + 1

2 x T 2 < α.

In particular, we conclude that f(xt) E(t) < α for all t > T, i.e. xt Uα for all t > T. Thus, by the same argument as Theorem 6, we find that

L(t) := f(xt) inf f + 1

x + µ xt π(xt) 2

satisfies L(t) e µ(t T )L(T) for t > T, so

f(xt) inf f L(t) e µ(T t) f(x T ) inf f + 1

x T + µ(x T π(x T )) 2

e µ(T t) f(xt) inf f + 2

2 x T 2 + x T π(x T ) 2

e µ(T t) 3α

2 + dist(x T , M)2 .

Remark 18. Obviously, the condition that f is Lipschitz-continuous on all sublevel sets could easily be relaxed to requiring that the initialization x0 is such that f is merely Lipschitz-continuous on the set {x : f(x) < f(x0)}, or even on every connected component of the set.

B.2 PROOF OF THEOREM 11: ACCELERATION IN DISCRETE TIME (DETERMINISTIC SETTING)

Lemma 10. Assume that D2f(x) ε in a ball Br(x0). Then

f(x), x z f(x) f(z) ε

2 x z 2 x, z Br(x0).

Published as a conference paper at ICLR 2025

Proof. The function f(x) + ε

2 x 2 is convex since all eigenvalues of its Hessian D2f + εI are non-negative, so

2 z 2 f(x) + ε

2 x 2 + f(x) + εx, z x

which is equivalent to

f(x), x z f(x) f(z) + ε

2 x 2 + ε x, z x ε

= f(x) f(z) + ε

2 x 2 + ε x, z ε x 2 ε

= f(x) f(z) ε

Before proving Theorem 11, we recall a well-known auxiliary result. Lemma 19. (Gupta et al., 2024, Lemma 13) Assume that f is Lipschitz-continuous with Lipschitzconstant L. Then

f(x ηg) f(x) η f(x), g + Lη2

Theorem 11. Assume that f is L-smooth and the sequences xn, x n, vn are generated according to the Nesterov scheme (5) with parameters η 1/L and ρ = (1 µη)/(1 + µη). Assume further that there exists an affine linear projection map π(x) = Px + x such that

f(x), x π(x) f(x) f(π(x)) + µ

2 x π(x) 2. (6)

Finally, assume that for arbitrary x, v Rd we have

f(x + v), v f(x + v) f(x) ε

with some ε p

f(xn) inf f (1 µη)n h f(x0) inf f + µ

2 x0 π(x0) 2i .

Proof. Setup. Denote by P = I P the orthogonal projection onto the orthogonal complement of the space which P projects onto. Consider the Lyapunov sequence defined by

Ln = f(xn) inf f + 1

2 P vn + µ(x n π(x n)) 2 + (1 + µη)2

2(1 µη) Pvn 2.

This is a variation of the usual Lyapunov sequence in which we separately analyze the tangential and normal velocities. Note however that

(1 µη) = 1 + O( η),

i.e. if η 0, we recover the Lyapunov function in the continuous time setting where tangential and normal velocity are not separated:

P v+ µ(x π(x)) 2+(1 + µη)2

2(1 µη) Pv 2 P v+ µ(x π(x)) 2+ Pv 2 = v+ µ(x π(x)) 2

as η 0 since the vectors Pv and P v + x π(x) are orthogonal and v = Pv + P v. We want to show that Ln+1 (1 µη)Ln. Note that 1 µη 1 p µ

L 0 since µ L by Lemma 24.

For simplicity, we assume that x = 0, i.e. x n π(x n) = x n Px n = P x n.

Step 1. Since f is L-smooth and η 1/L, we have

f(xn+1) f(x n) 1 Lη

η f(x n) 2 = f(x n) η

by Lemma 19 with x = x n and g = f(x n).

Published as a conference paper at ICLR 2025

Step 2. We compute

P vn+1 + µP x n+1 = P vn+1 + µP (x n + ηvn+1 η f(x n))

= (P + µηP )vn+1 + µP x n η µ P f(x n)

= ρ(1 + µη)P vn + µP x n η µη + ρ(1 + µ η) P f(x n)

= (1 µη)P vn + µP x n η P f(x n),

where we simplify the coefficients in the last step by substituting ρ = 1 µη

1+ µη . So,

P vn+1 + µP x n+1 2 = (1 µη)2

2 P vn 2 + µ

2 P x n 2 + η

2 P f(x n) 2

+ µ(1 µη) P vn, P x n µη f(x n), P x n

η(1 µη) f(x n), P vn .

Recall that since both P and P are orthogonal projections, for any x, y Rd, Px, Py = Px, y = x, Py , and the analogous result holds for P as well.

Step 3. Now we expand the last term in the Lyapunov sequence,

2(1 µη) Pvn+1 2 = (1 + µη)2

2(1 µη)ρ2 P(vn η f(x n)) 2

2 Pvn 2 + η P f(x n) 2 2 η f(x n), Pvn .

Step 4. We add the expressions from the previous steps and use the fact that Pvn + P vn = vn to get

Ln+1 = (1 µη)2

2 P vn 2 + µ

2 P x n 2 + η

2 P f(x n) 2

+ µ(1 µη) P vn, P x n µη f(x n), P x n

η(1 µη) f(x n), vn + 1 µη

2 P f(x n) 2 + f(xn+1) inf f.

Using (6) and (7) to bound the inner products f(x n), vn and f(x n), P x n ,

Ln+1 (1 µη)2

2 P vn 2 + µ

2 P x n 2 + η

2 P f(x n) 2

+ µ(1 µη) P vn, P x n µη f(x n) inf f + µ

(1 µη) f(x n) f(xn) ε

2 η vn 2 + 1 µη

2 P f(x n) 2 + f(x n) η

2 f(x n) 2 inf f.

Now, we use Pythagoras theorem, i.e. for all w Rd, Pw 2 + P w 2 = Pw+P w 2 = w 2, and rearrange some of the terms,

Ln+1 (1 µη)2 + (1 µη)ηε

2 P vn 2 + (1 µη)(1 + ηε)

2 P x n 2 + η η

2 f(x n) 2 η3 µ

2 P f(x n) 2

+ µ(1 µη) P vn, P x n

+ (1 µη 1 + µη)f(x n) + (1 µη)(f(xn) inf f).

Published as a conference paper at ICLR 2025

The coefficients of f(x n) and f(x n) 2 are zero and the coefficient of P f(x n) 2 is negative, so we can disregard those terms. If ε p

µ/η, the coefficient P vn 2 can be bounded as

(1 µη)(1 µη + ηε)

2 (1 µη)(1 µη + µη)

and the coefficient of P vn 2 can be bounded as

(1 µη)(1 + ηε)

2 (1 µη)(1 + µη)

2 (1 + µη)2

Thus, we conclude that

Ln+1 (1 µη)(f(xn) inf f) + 1 µη

2 P x n 2 + µ(1 µη) P vn, P x n

+ (1 + µη)2

= (1 µη)Ln.

Consequently,

f(xn) inf f Ln (1 µη)n L0 = (1 µη)n h f(x0) inf f + µ

2 P x0 2i .

B.3 PROOF OF THEOREMS 13 AND 14: ACCELERATION WITH ADDITIVE NOISE

The proof of Theorem 13 mimics that of Theorem 11 with minor modifications to account for stochastic gradient estimates. Since ωn is stochastically independent of anything which has happened in the algorithm before, in particular of x n, vn, we have the following. Lemma 20. For all n N, we have

E g(x n, ωn), f(x n) = E f(x n) 2]

E g(x n, ωn), vn = E f(x n), vn ]

E g(x n, ωn), P x n = E f(x n), P x n ]

E gn 2 = E f(x n) 2 + E gn f(x n) 2

where the expectations are taken over the (potentially random) initial condition x0 as well as the random coefficients ω0, . . . , ωn which govern the gradient estimates.

A proof can be found in (Gupta et al., 2024, Lemma 15). The second identity, which is not included therein, can be proved analogously. As an application, we prove a stochastic analogue of Lemma 19. Lemma 21. Assume that f is L-smooth and E g(x, ω) f(x) 2 σ2 a + σ2 m f(x) 2 for all x Rd. Then, if x, ω are independent random variables, we have

E(x,ω) f(x ηg(x, ω)) E f(x) 1 L(1 + σ2)η

η E f(x) 2 + Lη2

Proof. Recall that for any x, η, g, the following holds

f(x ηg) f(x) η f(x), g + Lη2

by Lemma 19. We now assume that g = g(x, ω) is a random estimator for f(x), where x may be random, but ω is independent of x. Then, by Lemma 20, we have

E f(x ηg) E f(x) η E f(x) 2 + Lη2

= E f(x) η E f(x) 2 + Lη2

2 E f(x) 2 + E g f(x) 2

Published as a conference paper at ICLR 2025

E f(x) 1 Lη

η E f(x) 2 + Lη2

2 σ2 a + σ2 m E f(x) 2

= E f(x) 1 L(1 + σ2 m)η 2

η E f(x) 2 + Lη2

Finally, we provide an auxiliary result to resolve a recursion. Lemma 22. Assume a sequence xn satisfies the recursive estimate xn+1 axn + b for a (0, 1) and b 0. Then

xn anx0 + b 1 a.

Proof. Consider the sequence yn := xn + b a 1. Then

yn+1 = xn+1 + b a 1 axn + b + b a 1 = axn + a 1

a 1b + b a 1 = axn + ab a 1

= a xn + b a 1

In particular, yn any0, so

xn = yn + b 1 a an x0 + b a 1

If b > 0, the estimate follows since b/(a 1) < 0.

We now prove the first main result of this section. Theorem 13. [Acceleration with additive noise] Assume that f, P are as in Theorem 11 and that the g satisfies (8) and (9) with σm = 0. Assume that the sequences xn, x n, vn are generated by the scheme (5) for parameters η 1/L and ρ = 1 µη

1+ µη, but with the stochastic gradient estimates g(x n, ωn) with independently identically distributed ωn in place of f(x n). Then

E[f(xn) inf f] (1 µη)n h f(x0) inf f + µ

2 x0 π(x0) 2i + σ2 a η µ .

Proof. The proof is mostly identical to that of Theorem 11 with minor modifications: In Step 1, we obtain the estimate

E[f(xn+1] E[f(x n)] η

2 E f(x n) 2 + Lσ2 aη2

2 by Lemma 21. In the quadratic terms in steps 2 and 3, we need to take the expectation of gn rather than f(x n). By construction, the terms involving E[ f(x n) 2] still balance with the same parameters, leading to an additional contribution of η σ2 a due to the stochastic gradient estimates. Since η 1/L, we can bound Lη2 η so overall we obtain the estimate Ln+1 (1 µη)Ln + σ2 aη. in place of Ln+1 (1 µη)Ln since all other terms are identical under the expectation using Lemma 20. The claim now follows from Lemma 22.

For readers who are looking for a more detailed proof, we note that Theorem 13 is a special case of Theorem 15 with σm = 0, where we provide a full proof.

In the same spirit, we sketch the proof of Theorem 14. Let us recall the statement. Theorem 14. [Additive noise and decreasing step size] Assume that f, g are as in Theorem 13 and that the sequences xn, x n, ρn are generated by the scheme x n = xn + ηn 1vn, xn+1 = x n ηngn, vn+1 = ρn(vn ηngn)

for parameters ηn = µ (n+ Lµ+1)2 , ρn = 1 µηn

1+ µηn . If ε p

µ/η0 = µ + Lµ, then

E f(xn) inf f

L µ E f(x0) inf f + 1

2 x0 π(x0) 2 + σ2 a µ log 1 + n p

Published as a conference paper at ICLR 2025

Proof. Note that in the proof of Theorem 15, we build on the relationships

f(xn+1) f(x n) ηn

x n+1 = x n ηngn + αnvn vn+1 = ρn(vn αngn) and do not enter further into the recursion. We additionally note that if ηn is a monotone decreasing sequence, then the sequence λn+1 := (1+ µηn)2

1 µηn is also monotone decreasing. Hence, if ηn

1 L(1+σ2m) for all n N, then by the same proof as Theorem 13, the sequence

Ln+1 := f(xn) inf f + 1

P vn + µ P x n 2 + λn

Ln 1 µηn) f(xn) inf f + 1

P vn + µ P x n 2 + λn+1

2 Pvn 2 + σ2 aηn

1 µηn) Ln + σ2 aηn if the parameters are chosen as in the theorem statement. If specifically ηn = 1 µ (n+n0+1)2 for

Ln+1 1 1 n + n0 + 1

Ln + σ2 a µ(n + n0 + 1)2 = n + n0 n + n0 + 1Ln + σ2 a µ(n + n0 + 1)2 .

Thus the sequence zn := (n + n0)Ln satisfies the relation

zn+1 = (n + n0 + 1)Ln+1 (n + n0)Ln + σ2 a µ(n + n0 + 1) = zn + σ2 a µ(n + n0 + 1) so

k=1 (zk zk 1) z0 +

σ2 a µ(k + n0 + 1) n0L0 + σ2 a µ

1 n0 + t dt

n0L0 + σ2 a µ log 1 + n 1

Overall, we find that

E f(xn) inf f Ln = zn n + n0

L µ E f(x0) inf f + 1

2 x0 π(x0) 2 + σ2 a µ log 1 + n 1

B.4 PROOF OF THEOREM 15: ACCELERATION WITH BOTH ADDITIVE AND MULTIPLICATIVE NOISE

In the strongly convex setting, Gupta et al. (2024) state a version of Theorem 15 in slightly greater generality in terms of choosing variables. The same more general proof goes through also here after we account for the tangential and normal components of the velocity as in the proof of Theorem 11. Theorem 15. [Additive and multiplicative noise] Assume that f, P, x are as in Theorem 11 and that g is a family of gradient estimators such that (8) and (9) hold for some σa, σm 0. Assume that the sequences xn, x n, vn are generated by the AGNES scheme (10) with parameters

0 < η 1 L(1 + σ2m), ρ = 1 q

µ(1 + σ2m)η

µ(1 + σ2m)η + σ2m η.

Then, if ε < p

µ(1 + σ2m)/η, we have

E f(xn) inf f 1 r µη 1 + σ2m

n E h f(x0) inf f + µ

2 x0 π(x0) 2i + σ2 a η p

µ(1 + σ2m) .

Published as a conference paper at ICLR 2025

Proof. Setup. Mimicking the proof of Theorem 11, consider the sequence

Ln = E f(xn) f(x ) + 1

2 E b P vn + a x n π(x n) 2 + λ

for constants

η , λ = (b + µγ)2

b µ γ γ α where γ = µ(η α) + b α.

The constants will be motivated below where they are introduced. Note that we have η = α if σm = 0 and thus b = 1 and γ = α, recovering the situation of Theorem 13. We want to show that Ln+1 (1 µ α/b)Ln. For simplicity, we again assume without loss of generality that π(x) = Px, i.e. x = 0 throughout the proof.

Step 1. Consider the first term first. Note that

E f(xn+1) = E f(x n ηgn) E f(x n) 1 L(1 + σ2)

2 η η E f(x n) 2 + Lη2

2 E f(x n) 2 + Lη2

if η 1 L(1+σ2) by Lemma 21.

Step 2. We now turn to the second term and use the definition of x n+1 from (10),

b P vn+1 + µP x n+1 = b P vn+1 + µP (x n + α vn+1 ηgn)

= (b + µα)ρP vn αgn) + µP x n µηP gn = (b + µα)ρP vn + µP x n µη + ρ(b + µα) α P gn.

In analogy to the proof of Theorem 11, we have ρ = b µα

b P vn+1 + µP x n+1 = (b µα)P vn + µP x n µη + (b µα) α gn.

In the deterministic case where η = α, the coefficient µ(η α) + b α of gn simplified to b α. It does not in this more general setting anymore, so we introduce a new notation: γ = µ(η α)+b α.

Taking expectation of the square, we find that

E h b P vn+1 + µ P x n+1 2i

= (b µα)2 E P vn 2 + 2 µ(b µα) E P vn, P x n + a2E P x n 2

2(b µα)γ E gn, P vn 2 µγ E gn, P x n + γ2 E P gn 2

= (b µα)2 E P vn 2 + 2 µ(b µα) E P vn, P x n + µE P x n 2

2(b µα)γ E f(x n), P vn 2 µγ E f(x n), P x n + γ2 E P gn 2 .

Step 3. We now consider the third term

λ E Pvn+1 2 = λρ2 E P(vn αgn) 2

= λρ2 E Pvn 2 + α Pgn 2 2 α gn, Pvn

= λρ2 E Pvn 2 + α Pgn 2 2 α f(x n), Pvn .

Step 4. We now add the estimates of steps 2 and 3, with

λ = (b µα)γ

ρ2 α = (b + µα)2

such that the coefficients 2(b µα)γ of E[ f(x n), P vn ] and 2λρ2 α of E[ f(x n), Pvn ] coincide. Note that in the deterministic case γ = b α = α and we recover the coefficient chosen in the proof of Theorem 11.

E b P vn+1 + µP x n+1 2 + λ Pvn+1 2

Published as a conference paper at ICLR 2025

= (b µα)2 E P vn 2 + 2 µ(b µα) E P vn, P x n + µE P x n 2

2(b µα)γ E f(x n), vn 2 µγ E f(x n), P x n + (b µα)γ α E Pvn 2

+ γ2 E P gn 2 + (b µα)γ α E Pgn 2 .

We note that the coefficient of the norm of the tangential gradient is (b µα) α γ γ2 by the definition of γ = (b µα) α + µη. Next, we combine this estimate with the bound on f(xn+1) from Step 1 and use the geometric conditions (6) and (7) on f to control the inner products of f(x n) with vn and P x n in the previous expression as well as the variance bound (9) for the gradient estimates:

Ln+1 E f(x n) inf f η

2 E f(x n) 2 + γ2

2 E gn 2 + Lη2

2 E P vn 2 + µ(b µα) E P vn, P x n + µ

2 E P x n 2

(b µα) γ α E h f(x n) f(xn) εα

µγ E h f(x n) inf f + µ

2 P x n 2i + (b µα)γ

2 α E Pvn 2

= 1 (b µα) γ α µγ E f(x n) (1 µγ) inf f + (b µα) γ α E[f(xn)]

+ γ2(1 + σ2 m) η 2 E f(x n) 2 + Lη2 + γ2

+ (b µα)2 + (b µα)εγ α

2 E Pvn 2 + µ(b µα)E P x n, P vn

2 E P x n 2] + (b µα)γ(1 + εα)

2 α E Pvn 2 .

Step 5. Recall that

µη(1 + σ2m)

µ(1 + σ2m)η + σ2m η, b =

(1 + σ2m)α/η µα p

(1 + σ2m)α/η + µα = 1 q

µη 1+σ2m as desired.

Let us verify that γ = α/b. This is equivalent to

η α r η 1 + σ2m =

α + µη r η 1 + σ2m

i.e. to the choice

η 1+σ2m µ q

1 + σ2m η = 1 p

µη(1 + σ2m)

µη(1 + σ2m) η

which we made above. In particular, we find that

(b µα) γ α + µγ = b α µ + µ γ = bγ α = 1

and therefore 1 (b µα) γ α µγ E f(x n) + (1 µγ) inf f + (b µα) γ α E[f(xn)]

Published as a conference paper at ICLR 2025

= (1 µγ) E f(xn) inf f .

In the coefficient of E f(x n) 2, we have the cancellations

(1 + σ2 m)γ2 η = (1 + σ2 m) α

b2 η = η (1 + σ2m)α α η = 0.

By the same analysis, the coefficient of additive noise is

Lη2 + γ2 = Lη2 + η 1 + σ2m ,

Ln+1 (1 µ γ) E f(xn) inf f + Lη2 + η 1+σ2m 2 σ2 a

+ (b µα)2 + (b µα)εγ α

2 E Pvn 2 + µ(b µα)E P x n, P vn

2 E P x n 2] + (b µα)γ(1 + ε α)

2 α E Pvn 2 .

We now analyze the terms corresponding to the quadratic terms. Using γ = α/b, we obtain

(b µα)2 + (b µα)εγ α

1 + (εγ µ) α

= (1 µ γ) 1 + (εγ µ) α

for the coefficient of E[ vn 2] if εγ a, i.e. if ε µ

η Analogously, we see that the

coefficient of E P vn, P x n that

µ(b µα) = µb b µα

b = µb 1 µ α

= µb(1 µ γ)

and for the coefficient of E P x n 2 that

µ = 1 γµ µ = 1 µγ.

Before proceeding to the next term, we observe that

γ2 = η 1 + σ2 η 1 p

µη(1 + σ2) = α,

and hence ε µ/γ µγ/α. Finally, we note for the coefficient of E[ Pvn 2] that

(b µα)γ(1 + εα) α λ = (b µα)γ(1 + εα)(b µα) α α γ(b + µα)2

= (1 µγ)2 (1 + εα)

(1 µγ)2 (1 + µγ)

= (1 µγ) 1 µγ2)

Published as a conference paper at ICLR 2025

as desired. Overall, we find that

Ln+1 (1 µγ) E[f(xn) inf f] + b2

2 E P vn 2 + ab E P vn, P x n

2 E P x n 2 + λ

2 E Pvn 2 + Lη2 + η 1+σ2m 2 σ2 a

= (1 µγ) E f(xn) inf f + 1

2 E P vn + P x n 2 + λ

+ Lη2 + η 1+σ2m 2 σ2 a

= (1 µγ) Ln + Lη2 + η 1+σ2m 2 σ2 a.

By Lemma 22, we deduce

Ln (1 µγ)n L0 + Lη2 + η 1+σ2m 2 µγ σ2 a = L(1 + σ2 m)η2 + η

2(1 + σ2m) µ p

η/(1 + σ2m) .

Since L(1 + σ2 m)η 1, we can simplify the noise term to

L(1 + σ2 m)η2 + η

2(1 + σ2m) µ p

η/(1 + σ2m) σ2 a η p

µ(1 + σ2m) .

C A BRIEF COMPARISON OF GEOMETRIC CONDITIONS FOR OPTIMIZATION

C.1 DEFINITIONS AND ELEMENTARY PROPERTIES

In this section, we compare some common geometric assumptions in optimization theory. Recall the following notions.

Definition 23. Let U Rd be an open set. We say that a C1-function f : U R

1. is γ-quasar convex if argmin f = and if the inequality

f(x), x x γ f(x) f(x )

holds for any x U and any x argmin f.

2. star-convex if it is 1-quasar convex.

3. (γ, µ)-strongly quasar convex if argmin f = and

f(x), x x γ f(x) f(x ) + µ

4. is first order µ-strongly convex if

f(x) f(z) + f(z), x z + µ

2 x z 2 x, z U.

5. satisfies the first order µ-strong aiming condition if for all x we have

f(π(x)) f(x) + f(x), π(x) x + µ

2 x π(x) 2 x U

π(x) = argmin x z 2 : f(z) = inf x U f(x ) .

In particular, we assume that the set of minimizers of f is non-empty and that there exists a unique closest point π(x) for all x U.

Published as a conference paper at ICLR 2025

6. satisfies a PL condition with PL constant µ if

f(x) 2 2µ f(x) inf f .

If U is a convex set, the fourth condition is of course equivalent to regular µ-strong convexity. Lemma 24. 1. If f is first order µ-strongly convex and has a minimizer in U, then f satisfies the µ-strong aiming condition.

2. If f satisfies the µ-strong aiming condition, then f satisfies the PL condition with the same constant µ.

3. If f is µ-strongly aiming on Uα = {x Rd : f(x) < α}, then the line segment connecting x and π(x) is contained in Uα for all x Uα.

4. If f is L-Lipschitz continuous on Uα and satisfies the PL-inequality with constant µ on Uα, then µ L.

5. If f is (γ, µ)-strongly quasar convex, then argmin f consists of a single point.

6. If f is γ-quasar convex and U is star-shaped with respect to the minimizer x argmin f, then all sub-level sets of f are star-shaped with respect x . In particular, the set of minimizers is convex.

If U = Rd, then any strongly convex function f : U R has a minimizer in U. On general open sets, this is not guaranteed.

Proof. First claim. If f is first order µ-strongly convex and has a minimizer in U, then the minimizer x is unique since for x = x we have

f(x) f(x ) + f(x ), x x + µ

2 x x 2 = f(x ) + µ

2 x x 2 > f(x ).

In particular, for every x there exists a unique closest minimizer π(x) = x . The strong aiming condition therefore requires the first order convexity condition only for pairs of points x, x rather than all points x, z.

Second claim. This result follows by the same proof that implies the PL condition for strongly convex functions: If f satisfies the µ-strong aiming condition, then

f(x), x π(x) f(x) f(π(x)) + µ

f(x) f(x), x π(x) x π(x)

f(x) f(π(x))

f(x) f(π(x))

Setting the derivative with respect to ξ to zero, we find that the minimum is achieved when

f(x) f(π(x))

2 f(x) inf f

µ f(x) inf f

µ f(x) inf f = q

2µ f(x) inf f .

The PL condition follows by squaring both sides.

Third claim. Let xt = (1 t)π(x) + tx for 0 t 1. First we observe that π(xt) = π(x) for every t [0, 1]. Indeed, if there is another minimizer z of f such that xt z xt, π(x) then

x π(x) = x xt + xt π(x) x xt + xt z x z

Published as a conference paper at ICLR 2025

since x, xt, and π(x) all lie on a straight line. If there exists a unique closest point π(x) in M, then we find that z = π(x).

We note that xt π(x) = t(x π(x)) and thus

d dtf(xt) = f(xt), x π(x) = 1

t f(xt), xt π(xt) 0.

In particular, t 7 f(xt) is an increasing function on the set Ix := {t > 0 : xt Uα}. If Ix has multiple connected components in (0, 1), this is only possible if f = α on the boundaries. If f = α on the lower boundary of a connected component, then f α inside the entire interval as f increases, contradicting the definition of Uα.

Fourth claim. Since f is continuous, Uα is open. Therefore, xt := xt t f(x) Uα if t is small. If f is L-Lipschitz continuous in Uα, then

f(xt) f(x) + f(x), xt x + L

for all t such that xs Uα for s (0, t) and t 1/L by Lemma 19. Since the function t 7 f(x) t

2 f(x) 2 is decreasing in t, we see that xt Uα for t [0, 1/L]. In particular, we note that

0 f(x f/L) inf f f(x) 1

2L f(x) 2 inf f f(x) 2µ

2L f(x) inf f inf f

(1 µ/L) f(x) inf f ,

implying the result.

Fifth claim. Assume that x , x argmin f. Define xt = tx + (1 t)x . Then

d dtf(xt) = f(xt), x x = f(xt), xt x

f(xt) min f + µ

2 xt x 2 > 0

unless xt x 2 = 0, i.e. unless x = x . Thus f(xt) is strictly monotone increasing on {t [0, 1] : xt U}. Since x U, this means that f must be strictly increasing on the final segment (1 ξ, 1] where it reaches the global minimizer x at t = 1, leading to a contradiction.

Sixth claim. This follows by essentially the same argument as the fifth claim: If x is any point, then tx + (1 t)x U since U is star-shaped about x and

d dtf(xt) = f(xt), x x = f(xt), xt x

t (f(xt) f(x )) 0.

In particular, f is increasing along any rays starting at x , so if f(x) < α, then f(xt) < α for any t [0, 1].

The set of minimizers is star-shaped about every minimizer, hence convex.

C.2 A ONE-DIMENSIONAL EXAMPLE

In the following simple one-dimensional example, we illustrate the hierarchy of geometric conditions between convexity and the PL condition. Example 25. Let R > 0 and ε (0, 1). Consider the even function given for x > 0 by

f(x) = 1 + ε sin(2R log x)

f (x) = 1 + ε sin(2R log x) + Rε cos(2R log x) x

f (x) = 1 + ε sin(2R log x) + Rε cos(2R log x) + 2Rε cos(2R log x) 2R2ε sin(2R log x)

= 1 + ε(1 2R2) sin(2R log x) + 3Rε cos(2R log x).

We first note that evidently 1 ε

2 x2 f(x) 1+ε

2 x2 for all x. In particular, x = 0 is the unique global minimizer of f.

Published as a conference paper at ICLR 2025

L-smoothness. The function g(ξ) = A sin(ξ) + B cos(ξ) attains its maximum when A cos ξ B sin ξ = 0 for A, B R, i.e. sin ξ = A

A2+B2 and cos ξ = B

A2+B2 . In particular maxξ g(ξ) =

A2 + B2, so

|f (x)| 1 + ε q 1 2R2 2 + (3R)2 = 1 + ε p

1 + 5R2 + 4R4.

The second derivative is discontinuous at x = 0, but bounded on R \ {0} by L = 1 + ε

1 + 5R2 + 4R4, i.e. f is L-smooth.

Convexity. By the same consideration, we see that f is convex if and only if f 0, i.e. if and only if ε

1 + 5R2 + 4R4 1. If the inequality is strict, f is strongly convex with constant µ = 1 ε

1 + 5R2 + 4R4.

PL inequality. By the same argument as above, we see that

|f (x)|2 1 + ε sin +Rε cos 2x2 1 ε p

1 + R2 2 x2

1 + R2 < 1, where all trigonometric functions are evaluated at ξ = 2R log x.

In particular, if ε

1 + R2 < 1, then

|f (x)|2 1 ε p

1 + R2 2 x2 2

1 + ε 1 + ε

1 + ε f(x),

i.e. f satisfies the PL condition with constant

Infinite number of local minimizers cluster at the orgin. If ε

1 + R2 > 1 on the other hand, then by the same argument f changes sign an infinite number of times in any neighborhood of the origin since limx 0+ log x = .

µ-strong aiming condition. Note that

x f (x) f(x) = 1

2 sin(2R log x) + Rε cos(log x) x2

(ε/2)2 + (Rε)2 x2

1 + 4R2 x2.

In particular, f is µ-strongly aiming (with respect to the unique global minimizer) with

1 + 4R2 < 1 and it fails to be µ-strongly aiming for any µ > 0 otherwise.

Quasar-convexity. Since the minimizer is unique, (γ, µ)-strong convexity is a strictly more general notion than strong aiming condition as we have an additional parameter γ to relax the requirements. Essentially the same calculation reads

x f (x) γ f(x) = 1 γ

2 + ε ((1 γ/2) sin(. . . ) + R cos(. . . )) x2

(1 γ/2)2 + R2 x2.

In particular, f is γ-quasar convex if

(1 γ/2)2 + R2 0 1 ε

1 + R 1 γ/2

Published as a conference paper at ICLR 2025

Figure 4: We compare the trajectories of gradient descent and Nesterov s algorithm for the objective function f in Example 25 with R = 6 and ε = 0.075 (left), ε = 0.08 (middle) and ε = 0.085 (right). Evidently, if ε

1 + 4R2 is very close to the threshold value 1, gradient descent outperforms Nesterov s algorithm with the theoretically guaranteed parameters.

1 + R 1 γ/2

Such a γ can be found if and only if R2 < 1 ε2

ε2 . Choosing γ slightly smaller, it is then also always possible to make f (γ , µ )-strongly convex for some γ (0, γ) and µ > 0.

Relationship between conditions. We quickly summarize the various parameter ranges for which the function f satisfies good geometric conditions.

PL condition µ-strongly aiming µ-strongly convex

Must be < 1 ε

1 + 5R2 + 4R4

Constant (1 ε

1 + 4R2 1 ε

1 + 5R2 + 4R4

Evidently, classical strong convexity implies strong aiming condition with respect to the global minimizer which in turn implies the PL condition. The parameter ranges and constants are generally vastly different. All estimates, except for the PL constant, are sharp. In particular, the one-dimensional examples demonstrate that strong aiming condition is strictly weaker than convexity along the line segment connecting x to π(x).

In the case of quadratic objective functions or functions modelled on them (such as the distance function from a manifold), the PL constant is essentially the same as the parameter of strong convexity. For the function f in Example 25 on the other hand, the PL constant is noticeably larger than the parameter of strong convexity with respect to the unique minimizer. In Figure 4, we observe that with the parameters η = 1/L and ρ = (1 µη)/(1 + µη), gradient descent may at times converge faster, at least with the parameter choice that is derived over a large function class rather than for an individual objective function.

C.3 DEEP LEARNING

While some notions of a good geometry are weaker than others (for instance, strong convexity implies the PL condition but not vice versa), they all share an important common feature: If x is a critical point of f, then it is a global minimizer. Namely, the PL inequality implies that f(x) 2 > 0 unless f(x) = inf f and γ-quasar convexity implies that f(x), x x > 0 unless f(x) = f(x ) = inf f, showing that f(x) = 0.

In deep learning applications, critical points are guaranteed to occur under very general circumstances: If

f(β, a, W, b; x) = β +

i=1 aiσ(w T i x + b)

Published as a conference paper at ICLR 2025

is a neural network with a single hidden layer and a C1-activation function satisfying σ(0) = 0 (e.g. tanh), then the loss function

L(β, a, W, b) = 1

f(β, a, W, b; xj) yj 2,

satisfies L(β, a, W, b) = 0 for a = b = 0 Rn, W = 0 Rn d and β = 1

n Pn j=1 yj. The same is true if a row wi of W is merely orthogonal to all data points xj but does not vanish.

For deeper networks, the set of critical points becomes larger: As long as the parameters of two layers are all zero, the remaining layers can be chosen arbitrarily. If more than two layers are all zero, then the also the second parameter derivative vanishes. In particular, the critical point for which all parameters are zero cannot be a strict saddle point.

While there are guarantees that individual algorithms escape certain types of critical points almost surely (e.g. strict saddles, (Lee et al., 2019; O Neill & Wright, 2019)), they may take very long to do so (Du et al., 2017). The analysis of accelerated rates becomes asymptotic at best. We claim that our notion of strong aiming condition suffers from the same optimism globally, but locally captures two important features of deep learning landscapes close to the set of global minimizers which are not captured by concepts which require a geometric condition with respect to all minimizers: A manifold along which we can move tangentially, and convexity in directions which are perpendicular to the manifold.