# nonuniform_smoothness_for_gradient_descent__50dfcf52.pdf

Published in Transactions on Machine Learning Research (02/2024)

Non-Uniform Smoothness for Gradient Descent

Albert S. Berahas albertberahas@gmail.com Department of Industrial and Operations Engineering University of Michigan

Lindon Roberts lindon.roberts@sydney.edu.au School of Mathematics and Statistics University of Sydney

Fred Roosta fred.roosta@uq.edu.au School of Mathematics and Physics University of Queensland ARC Training Centre for Information Resilience

Reviewed on Open Review: https: // openreview. net/ forum? id= 17ESEj ETb P

The analysis of gradient descent-type methods typically relies on the Lipschitz continuity of the objective gradient. This generally requires an expensive hyperparameter tuning process to appropriately calibrate a stepsize for a given problem. In this work we introduce a local ﬁrst-order smoothness oracle (LFSO) which generalizes the Lipschitz continuous gradients smoothness condition and is applicable to any twice-diﬀerentiable function. We show that this oracle can encode all relevant problem information for tuning stepsizes for a suitably modiﬁed gradient descent method and give global and local convergence results. We also show that LFSOs in this modiﬁed ﬁrst-order method can yield global linear convergence rates for non-strongly convex problems with extremely ﬂat minima.

1 Introduction

In this work, we consider gradient descent-type algorithms for solving unconstrained optimization problems

min x Rd f(x), (1.1)

where f : Rd R is continuously diﬀerentiable. Gradient descent-type methods for (1.1) have seen renewed interest arising from applications in data science, where d is potentially very large (Bottou et al., 2018; Wright, 2018). When analyzing such methods with a ﬁxed stepsize, it is most commonly assumed that f is Lf-Lipschitz continuous and assume a stepsize of order 1/Lf, e.g. (Wright, 2018, Section 4). The key reason behind this smoothness assumption is the resulting bound

|f(y) f(x) f(x)T (y x)| Lf

2 y x 2 2, x, y Rd.

In practice, however, Lf is usually not known. The simplest approach in this situation is to try several stepsize choices as part of a (possibly automated) hyperparameter tuning process (Snoek et al., 2012). That said there are often beneﬁts to changing the stepsize at each iteration, with common approaches including linesearch methods (Nocedal & Wright, 2006), Polyak s stepsize for convex functions (Polyak, 1969) (although this requires knowledge of the global minimum of f) and the Barzilai-Borwein stepsize (Barzilai & Borwein, 1988) (although global convergence theory exists only for strictly convex quadratic functions, and does not converge for some (non-quadratic) strongly convex functions without modiﬁcation (Burdakov et al., 2019)).

Published in Transactions on Machine Learning Research (02/2024)

In the setting where Lf is known, this approach is still limited pessimistic, as the stepsize is chosen to be suﬃciently small to ensure decrease throughout the domain, as it reﬂects the global smoothness of f. In an iterative method it suﬃces to ensure decrease from one iterate to the next, and so of greater relevance is the local smoothness of f in a neighborhood of the current iterate.

There are two natural ways that local smoothness could be used to modify the stepsizes in gradient descent. First, we could dynamically estimate Lf by looking at how rapidly f is changing near the current iterate. This approach is followed in (Malitsky & Mishchenko, 2020), where Lf is estimated based on f(xk) f(xk 1) 2/ xk xk 1 2. This estimation procedure allows for global convergence of a hyperparameter-free variant of gradient descent for convex functions. An alternative approach for convex problems is to estimate the distance from the starting point to the minimizer x0 x and use this to construct hyperparameter-free methods (Carmon & Hinder, 2022; Defazio & Mishchenko, 2023; Mishchenko & Defazio, 2023). Alternatively, we could assume the existence of an oracle that gives us a suitable local Lipschitz constant for f (Mei et al., 2021). There, it is assumed that one has access to an oracle L(x) that satisﬁes

|f(y) f(x) f(x)T (y x)| L(x)

2 y x 2 2, y Rd. (1.2)

Although some useful properties arise when allowing a stepsize of 1/L(xk) for gradient descent (at current iterate xk) for example faster convergence than gradient descent with ﬁxed stepsizes the oracle L(x) has limitations. For example, if we want to exploit extremely ﬂat regions around minima, we may want L(x) 0 as x x , but for (1.2) to hold with L(x ) = 0 requires f be a constant function.

In a separate line of work, (Zhang et al., 2020) considers gradient clipping methods applied to functions satisfying a smoothness condition of the form 2f(x) 2 L0 +L1 f(x) 2, which was empirically motivated by training deep neural networks and also applies to some distributionally robust optimization problems (Jin et al., 2021). This condition was slightly relaxed in (Xie et al., 2023) to the generalized Lipschitz condition

f(y) f(x) 2 (L0 + L1 f(x) 2) y x 2, x Rd, y B(x, 1/L1), (1.3)

noting in particular that the inequality only needs to hold for x and y suﬃciently close.

In this work, we consider a more general and localized form of the approach from (Mei et al., 2021), where our oracle only requires bounds similar to (1.2) to hold for y B(x, R) for some R > 0. Such a local ﬁrstorder smoothness oracle (LFSO), gives a local Lipschitz constant over an explicitly deﬁned neighborhood. Such LFSOs exist for any C2 function (not just those with Lipschitz continuous gradients), including those satisfying the weaker smoothness condition from (Zhang et al., 2020; Jin et al., 2021) mentioned above. Given the availability of a LFSO, we show how this can be used to construct a convergent gradient descenttype iteration that does not require any hyperparameter tuning. Similar to the case of 1/Lf stepsizes, this parameter-free result comes from the LFSO capturing all the relevant problem information.

To show the promise of this approach, we demonstrate in Section 4 how the incorporation of a LFSO allows gradient descent to achieve a global linear convergence rate to arbitrarily ﬂat minima of convex functions. We do this by considering a class of objective functions which include f(x) = x 2p 2 for p , where f(x) is extremely ﬂat near the minimizer x = 0. This is achieved by the LFSO allowing larger stepsizes as x x , counteracting the degeneracy of the minimizer. These functions are known to be diﬃcult for gradient descent iterations, and in fact provide the worst-case lower bounds for accelerated gradient descent (Attouch et al., 2018; 2022).1 Further, in Section 5 we show that our LFSO approach gives global linear convergence rates for objectives of the form f(x) = x 2p 2p for p , an alternative set of objectives with very degenerate minimizers. Our numerical results in Section 6 show the global linear rate achieved in practice by the LFSO approach for both types of objective functions, compared to the (expected) sublinear rate for standard gradient descent. We note that the objective f(x) = x 2p 2p can alternatively be minimized in 1 iteration using iteratively reweighted least-squares (IRLS). IRLS only has local convergence guarantees for problems of the form f(x) = Ax b 2p 2p when p < 3/2 and can diverge for larger p without careful modiﬁcation (Adil et al., 2019).

1Of course, LFSO requires more problem information than these methods to achieve this result.

Published in Transactions on Machine Learning Research (02/2024)

Compared to the similar methods mentioned above, LFSO is the only method with global and local convergence theory for general C2 nonconvex objectives. By comparison, (Malitsky & Mishchenko, 2020; Carmon & Hinder, 2022; Defazio & Mishchenko, 2023; Mishchenko & Defazio, 2023) only have global convergence results for convex objectives2, and (Mei et al., 2021) only considers global convergence for nonconvex functions in the presence of a speciﬁc non-uniform Łojasiewicz condition. Our numerical results for speciﬁc convex functions in Section 6 show that the method from (Malitsky & Mishchenko, 2020) gives fast global linear convergence on the same problems, but this is not proven. We also note that the LFSO approach is potentially more suited for adaptation to stochastic optimization than (Malitsky & Mishchenko, 2020) as it does not require taking diﬀerences of (stochastic) gradients.

In the general nonconvex case with Lipschitz continuous gradients, LFSO recovers the standard O(ϵ 2) iteration complexity required to ﬁnd an ϵ-optimal point. The per-iteration cost of LFSO is one gradient evaluation and up to two LFSO evaluations. In terms of scalability, the practicality of this approach depends on how eﬃcient an LFSO evaluation is; identifying general settings where LFSOs are easily computable is an important direction for future work. However, for all the example problems considered in this work, the computational cost of evaluating the LFSO is similar to (or cheaper than) the cost of a single gradient evaluation. Our approach also avoids the need for any expensive tuning of learning rates (and in some cases achieving global linear convergence rates).

This introduction and ﬁrst analysis of LFSO methods demonstrates it to be a promising avenue for future work, by giving an explicit mechanism to decoupling stepsize selection from objective smoothness (and hence avoiding hyperparameter tuning), and in some speciﬁc circumstances being able to adapt to extremely ﬂat regions of the objective function. The LFSO approach requires a more complex oracle and a choice of initial radii Rk (for examples in Sections 4 and 5, Rk = f(xk) 2 is used), and so an important starting point for future work would be the eﬃcient construction of such oracles (and indeed it would be useful to know when such oracles are easy to compute), and developing a systematic understanding of suitable values for Rk. This future work would enable a thorough numerical evaluation of the LFSO approach on more realistic problems.

2 Algorithmic Framework

We now introduce our new oracle and show our algorithmic framework which incorporates the LFSO oracle into a gradient descent-type method.

Deﬁnition 2.1. Suppose f : Rd R is continuously diﬀerentiable. A function L : Rd (0, ) [0, ) is a local ﬁrst-order smoothness oracle (LFSO) for f if

|f(y) f(x) f(x)T (y x)| L(x, R)

2 y x 2, (2.1)

for all x Rd, all y B(x, R) and for some R > 0, and L is non-decreasing in the second argument R.

For example, if f is Lf-smooth i.e., f is Lipschitz continuous with constant Lf then L(x, R) = Lf deﬁnes an LFSO for f. Treating this as an oracle rather than a problem constant encodes the common algorithmic assumption that Lf is known. However, LFSOs exist for a much broader class than Lf-smooth functions.

Lemma 2.2. If f is C2(Rd), then L(x, R) = maxy B(x,R) 2f(y) is an LFSO for f.

Proof. It is straightforward that L is non-decreasing in R. In this case, the property (2.1) is well-known, for instance (Cartis et al., 2022, Corollary A.8.4 & Theorem A.8.5)

Another common smoothness class is the case of C2 functions with Lipschitz continuous Hessians. In this case, Lemma 2.2 gives us a simple, explicit form for an LFSO.

2Furthermore, (Carmon & Hinder, 2022) requires uniformly bounded (stochastic) (sub)gradients, and (Defazio & Mishchenko, 2023; Mishchenko & Defazio, 2023) require f to be Lipschitz continuous.

Published in Transactions on Machine Learning Research (02/2024)

Lemma 2.3. If f is C2(Rd) and 2f is LH-Lipschitz continuous, then L(x, R) = 2f(x) + LHR is an LFSO for f.

Proof. This follows from Lemma 2.2 and that 2f(y) 2f(x) + LH x y for all y B(x, R).

A useful property that we will use to get explicit formulae for LFSOs is the following straightforward consequence of (2.1).

Lemma 2.4. Suppose L is a LFSO for f and L(x, R) L(x, R) for all x Rd and all R > 0. Then L is an LFSO for f if and only if L is non-decreasing in R.

To incorporate an LFSO into a gradient descent-like iteration, a natural step would be to consider an iteration of the form

xk+1 = xk η L(xk, Rk) f(xk), (2.2)

for some appropriately chosen values of Rk > 0 and η > 0. In the case where f is Lf-smooth, standard convergence theory (e.g., (Wright, 2018, Theorem 4.2.1)) would suggest that a value such as η = 1 is appropriate. This shows that the LFSO notion encapsulates the problem-speciﬁc aspect of choosing a suitable stepsize.

However, the choice of Rk is not so straightforward: for property (2.1) to be useful typically to prove that f(xk+1) < f(xk) we would need xk+1 B(xk, Rk). This is not guaranteed if Rk is chosen too small, as the following example illustrates. Example 2.5. If d = 1 and f(x) = x4 then a suitable LFSO is

L(x, R) = 24x2 + 24R2,

using Lemmas 2.2 and 2.4, as well as

max |y x| R 12y2 = 12 max{(x R)2, (x + R)2} 12(2x2 + 2R2),

from Young s inequality. Suppose xk = 1 and η = 1, then from (2.2) we may compute

|xk+1 xk| = 1 6 + 6R2 k .

and so xk+1 / B(xk, Rk) whenever Rk is smaller than R 0.16238, the unique real root of p(R) = 6R3 + 6R2 1.

To avoid this issue, we ﬁrst pick an arbitrary value for Rk, then possibly increase it to a suﬃciently large value that xk+1 is in the required neighborhood of xk. The resulting method is given in Algorithm 1. Note that it requires one evaluation of f and possibly two evaluations of L (oracle calls) per iteration.

Computational Cost of LFSO Evaluation In practice, the computational cost of L (performed potentially twice per iteration of Algorithm 1) will depend on the speciﬁc problem being analyzed. However, for the examples in this work the cost of one LFSO evaluation is small, of the same order as a single gradient evaluation (or potentially cheaper if the gradient has already been evaluated).

In Section 4, where f(x) = h(g(x)), the dominant cost of the LFSO (4.2) is the evaluation of g(x) , but g(x) has already been evaluated when originally computing f(x).

In Section 5 for linear regression problems, provided quantities only depending on the problem parameters A have been pre-computed, the LFSO (5.3) is dominated by one residual calculation Ax b. This residual is computed when evaluating the gradient, so the LFSO cost is dominated by evaluating Ax b when Ax b is already known, of cost O(n) for a problem with n residuals. This is cheaper than the cost of a gradient evaluation, which also requires a matrix-vector product not present in the LFSO evaluation.

Published in Transactions on Machine Learning Research (02/2024)

Algorithm 1 Gradient Descent with LFSO. Input: Starting point x0 Rd, stepsize factor η > 0.

1: for k = 0, 1, 2, . . . do

2: Choose any Rk > 0 3: Set

e Rk := max Rk, η L(xk, Rk) f(xk) . (2.3)

xk+1 = xk η

L(xk, e Rk) f(xk). (2.4)

In Appendix A, we also derive an LFSO for 2-class logistic regression and show the same cost as Section 5: the dominant cost of evaluating the LFSO is an extra O(n) for training data of length n (beyond information already computed in a gradient evaluation), plus gradient evaluation requires an extra matrix-vector product not present in the LFSO evaluation.

3 Convergence Analysis

In this section we provide global convergence results for the general nonconvex and PL/strongly convex cases, and also local convergence rates to non-degenerate local minima. To enable these, we prove a descent lemma (Lemma 3.2) suitable for Algorithm 1.

Assumption 3.1. The function f : Rd R is continuously diﬀerentiable and bounded below by f , and L is an LFSO for f.

Lemma 3.2. If Assumption 3.1 holds, then Algorithm 1 produces iterates satisfying

f(xk+1) f(xk) η

L(xk, e Rk)

f(xk) 2, (3.1)

for all k = 0, 1, . . ..

Proof. Since e Rk Rk and L is non-decreasing in R, we have that

xk+1 xk = η

L(xk, e Rk) f(xk) η L(xk, Rk) f(xk) e Rk, (3.2)

and so (2.1) can be used. That is,

f(xk+1) f(xk) η

L(xk, e Rk) f(xk) 2 + L(xk, e Rk)

L(xk, e Rk)2 f(xk) 2 !

L(xk, e Rk)

which gives the desired result.

Since the LFSO captures all the necessary problem information, the requirements on the stepsize factor η are straightforward.

Published in Transactions on Machine Learning Research (02/2024)

3.1 Global Convergence

Theorem 3.3. If Assumption 3.1 holds and 0 < η < 2, then either

lim inf k f(xk) = 0, or lim k L(xk, e Rk) = .

Proof. From Lemma 3.2, we have

f(xk) f(xk+1) η(2 η)

2L(xk, e Rk) f(xk) 2. (3.3)

By summing over k we get

L(xk, e Rk) 2[f(x0) f ]

η(2 η) < , (3.4)

hence limk f(xk) 2/L(xk, e Rk) = 0, and the result follows.

Of course, this is not quite a convergence proof, as we need to be concerned about the risk of L(xk, e Rk) growing unboundedly, which could occur in cases such as f(x) = sin(x2) if xk during the iteration. One simple situation where this behavior does not occur is the following.

Corollary 3.4. Suppose Assumption 3.1 holds and 0 < η < 2. If the sublevel set {x Rd : f(x) f(x0)} is bounded, L is continuous in x, and the choices Rk are bounded, then limk f(xk) = 0.

Proof. If Rk R for all k, then L(xk, Rk) L(xk, R). From Lemma 3.2, we know xk {x Rd : f(x) f(x0)} for all k. Then we know that Rk, f(xk) and L(xk, Rk) are all bounded, and so e Rk is bounded too. Finally, this means L(xk, e Rk) is bounded, so limk f(xk) 2/L(xk, e Rk) = 0 (derived in the proof of Theorem 3.3) gives the result.

In fact, under the assumptions of Corollary 3.4, (3.4) actually gives us the common O(ϵ 2) worst-case iteration complexity rate3 for nonconvex problems (e.g., (Cartis et al., 2022, Chapter 2)). We note that these assumptions are weaker than assuming Lf-smoothness everywhere, as we only care about the LFSO in the initial sublevel set. Remark 3.5. The above (in particular (3.2)) still works if we replace f(xk) in (2.3) with any upper bound Ck f(xk) . This will be useful in Section 5.

In the case where f satisﬁes the Polyak-Łojasiewicz (PL) inequality with parameter µ > 0 for example if f is µ-strongly convex we can achieve convergence even in some cases where L(xk, e Rk) , provided it does not increase too quickly.

Theorem 3.6. Suppose Assumption 3.1 holds and 0 < η < 2. If f is µ-PL, that is

2µ f(x) 2, x Rd, (3.5)

L(xk, e Rk) = ,

then limk f(xk) = f .

3i.e., the maximum number of iterations before f(xk) ϵ is ﬁrst attained.

Published in Transactions on Machine Learning Research (02/2024)

Proof. From Lemma 3.2 and (3.5), we get

f(xk+1) f f(xk) f η

L(xk, e Rk)

f(xk) 2, (3.6)

L(xk, e Rk)

(f(xk) f ). (3.7)

By summing over k we get, by the assumption in the theorem statement,

L(xk, e Rk) = , (3.8)

and so the result follows from (Bertsekas, 2015, Prop A.4.3).

We should note that if f is coercive (e.g. if f is strongly convex) then the sublevel set {x : f(x) f(x0)} is bounded, and so Corollary 3.4 guarantees convergence (provided the Rk are bounded).

3.2 Local Convergence Rate

Encouraged by the result in Theorem 3.6, we now consider the local convergence rate of Algorithm 1 to non-degenerate local minimizers.

Theorem 3.7. Suppose Assumption 3.1 holds and 0 < η < 2. Suppose also that f is also C2(Rd) and x

is a local minimizer of f with λmin( 2f(x )) > 0, L is continuous in x, and Rk > 0 for all k. If x0 is suﬃciently close to x in the sense that

x0 x R1 (where B(x , R1) is a region where f is µ := λmin( 2f(x ))/2-strongly convex), and

f(x0) f(x ) µR2 2/2 (where R2 R1/2 and f(x) µR1/(2η) whenever x B(x , R2)

then xk x R-linearly, with

xk x 2 Lmax

k x0 x 2, (3.9)

for constant Lmax > 0 (deﬁned in (3.10) below).

Proof. Since f is C2, there exists a neighborhood B(x , R1) within which f is µ-strongly convex for µ := λmin( 2f(x ))/2. Given this neighborhood, deﬁne

Lmax := max x B(x ,R1) max 0 R R1 L(x, R). (3.10)

Hence, whenever xk x R1, we have L(xk, Rk) Lmax. Strong convexity also gives

f(y) f(x) f(x)T (y x) µ

for any x, y B(x , R1), and so it follows that L(x, R) µ for all x B(x , R1) and R > 0.

Since f is C1, there exists4 an R2 R1/2 such that f(x) µR1/(2η) for all x B(x , R2). Then if xk x R2, we have

xk+1 x xk x + η

L(xk, e Rk) f(xk) R2 + R1

4Since f is continuous, there is a ball B(x , R 2) such that f(x) ϵ for all x B(x , R 2) (with ϵ arbitrary, such as µR1/(2η) as above). Then take R2 = min(R1/2, R 2).

Published in Transactions on Machine Learning Research (02/2024)

Now, for any x B(x , R1), by strong convexity in this region, it follows that if f(x) f(x ) µR2 2/2 then x x R2. Given this, suppose that x0 is suﬃciently close to x in the sense that x0 B(x , R1) and f(x0) f(x ) µR2 2/2. We then have that x0 B(x , R2) and so x1 B(x , R1) from the above. Lemma 3.2 implies that f(x1) f(x0) and so x1 also satisﬁes f(x1) f(x ) µR2 2/2, which in turn implies x1 B(x , R2).

By induction, we conclude that xk B(x , R1) for all k (and also that xk B(x , R2) for all k). Then, by the same reasoning as (3.7), since (3.5) holds in B(x , R1) and L(xk, e Rk) Lmax, we have

f(xk+1) f(x ) 1 µη(2 η)

(f(xk) f(x )),

and we have a linear convergence rate of f(xk) f(x ).

Finally, we use the strong convexity result that (Nesterov, 2004, Lemma 1.2.3 & Theorem 2.1.7)

2 x x 2 f(x) f(x ) Lmax

to conclude that

µ(f(xk) f(x )) 2

k (f(x0) f(x ))

giving (3.9).

4 Global Rates for Compositions of Functions

Motivated by problems with very ﬂat minima, we now consider the performance of Algorithm 1 when the objective function has a speciﬁc compositional structure.

Assumption 4.1. The objective is f(x) = h(g(x)) where:

The function g : Rd R is twice continuously diﬀerentiable, g is Lg-Lipschitz continuous, and g is µg-PL with minimizer x satisfying g(x ) = 0

The function h : [0, ) R is twice continuously diﬀerentiable, strictly increasing, strictly convex, h is non-decreasing, and h (t) = Θ(tp) as t 0+, for some p 1

We note that the assumptions on g imply that

2µg g(x) g(x) 2 2Lg g(x), x Rd, (4.1)

where the right-hand inequality follows from (Wright, 2018, eq. (4.1.3)), and that x (the minimizer of g) is also a minimizer for f with f(x ) = h(0).

The function g could be, for example g(x) = Ax b 2 2 for some consistent linear system Ax = b, but in general need not be convex. We are most interested in Assumption 4.1 when the outer function h is very ﬂat near 0, such as h(t) = tp for p > 2, although other functions such as h(t) = t are also allowed. In general, this means that f is not PL, as shown by the case g(x) = x2 and h(t) = t2.

Even though f is not PL, we will show that the iterates generated by Algorithm 1 exhibit a global linear convergence rate, which is typically only seen for PL functions (for standard GD-type methods). To show this, we will need the following technical results.

Published in Transactions on Machine Learning Research (02/2024)

Lemma 4.2. Suppose Assumption 4.1 holds and we perform the iteration

xk+1 = xk ηk g(xk),

where there exists ϵ (0, 1/Lg] such that ϵ ηk 2/Lg ϵ for all k. Then

g(xk) (1 µgϵ(2 Lgϵ))k g(x0).

Proof. This is a generalization of (Karimi et al., 2016, Theorem 1), which proves the case where ϵ = 1/Lg (i.e. ηk = 1/Lg for all k). Since g is Lg-smooth, we have

g(xk+1) g(xk) ηk g(xk) 2 + Lgη2 k 2 g(xk) 2,

g(xk) 2µgηk

where the last inequality holds from (4.1), and so

g(xk+1) 1 2µgηk

g(xk) (1 µgηk (2 Lgηk))k g(x0).

The term µgηk(2 Lgηk) is positive for all ηk (0, 2/Lg) and maximized for ηk = 1/Lg, in which case µηk(2 Lgηk) = µg/Lg 1. Hence, if ηk [ϵ, 2/Lg ϵ], then

0 < µgϵ(2 Lgϵ) µgηk(2 Lgηk) µg/Lg 1.

The result then follows.

Corollary 4.3. Suppose the assumptions of Lemma 4.2 hold. Then f(xk) 0, R-linearly.

Proof. Since f(x) = h (g(x)) g(x), we use (4.1) to get

f(xk) = h (g(xk)) g(xk) p

2Lg h (g(xk)) p

noting that h (g(xk)) > 0 since h is strictly increasing. Then by Lemma 4.2, we get g(xk) g(x0) and so

2Lg h (g(x0)) (1 µgϵ(2 Lgϵ))k/2 p

where we have used that h is increasing (since h is increasing and convex), and so f(xk) 0 R-linearly with rate (1 µgϵ(2 Lgϵ))1/2.

Now for the objective given by Assumption 4.1, we have

f(x) = h (g(x)) g(x) and 2f(x) = h (g(x)) g(x) g(x)T + h (g(x)) 2g(x),

and so for the purposes of calculating L(x, R) we estimate

max s R 2f(x + s) h (g(x + s)) g(x + s) 2 + h (g(x + s))Lg.

Using (4.1) and g(x + s) Lg s + g(x) we get the LFSO

L(x, R) = h [Lg R + g(x) ]2

[Lg R + g(x) ]2 + h [Lg R + g(x) ]2

Published in Transactions on Machine Learning Research (02/2024)

Note that L is non-decreasing in R follows since h and h are non-decreasing, and h > 0 (Assumption 4.1).

By observing the form of L(x, R) (4.2), it is natural to consider Rk = g(xk) , since these two quantities (i.e. R and g(x) ) always appear together and it is a computable/known value. It is this choice that will give the fast global convergence rate of Algorithm 1.

Lemma 4.4. Suppose Assumption 4.1 holds and we choose Rk = g(xk) and any η > 0 in Algorithm 1 (with LFSO (4.2)). Then e Rk = Dk g(xk) for some Dk [1, max(1, η/Lg)].

Proof. Substituting our choice of Rk in (4.2) we get

g(xk) , ηh (g(xk)) g(xk)

h (Lg+1)2 g(xk) 2

(Lg + 1)2 g(xk) 2 + h (Lg+1)2 g(xk) 2

g(xk) , ηh g(xk) 2

h (Lg+1)2 g(xk) 2

where the last line follows from h strictly convex (so h is non-decreasing). This gives Dk max{1, η/Lg}. That Dk 1 follows from e Rk Rk (by deﬁnition of e Rk).

We can now show our global linear rate for Algorithm 1 for this speciﬁc problem class.

Theorem 4.5. Suppose Assumption 4.1 holds and we choose Rk = g(xk) and η (0, 2) in Algorithm 1 (with LFSO (4.2)). Then f(xk) 0 R-linearly.

Proof. From Lemma 4.4, we have

L(xk, e Rk) = h (Lg Dk + 1)2 g(xk) 2

(Lg Dk + 1)2 g(xk) 2 + h (Lg Dk + 1)2 g(xk) 2

Our iteration can be expressed as

xk+1 = xk η h (g(xk))

L(xk, e Rk) g(xk).

Since h is convex, we have h 0 and h is non-decreasing, so using (4.1) we get

L(xk, e Rk) h (Lg Dk + 1)2 g(xk) 2

Lg h g(xk) 2

Lg h (g(xk))Lg,

which gives

η h (g(xk))

L(xk, e Rk) η

Separately, since h > 0 and h is non-decreasing, we have (for any t 0)

t h (t) Z 2t

t h (s)ds = h (2t) h (t) h (2t),

h (Lg Dk + 1)2 g(xk) 2

(Lg Dk + 1)2 g(xk) 2 2µgh (Lg Dk + 1)2 g(xk) 2

Published in Transactions on Machine Learning Research (02/2024)

Since h is non-decreasing, we then conclude

L(xk, e Rk) (2µg + Lg)h (Lg Dk + 1)2 g(xk) 2

From (4.1) we then have

L(xk, e Rk) (2µg + Lg)h 2Lg(Lg Dk + 1)2g(xk)

Denoting Ck := 2Lg(Lg Dk + 1)2/µg, this means

η h (g(xk))

L(xk, e Rk) η h (g(xk)) (2µg + Lg)h (Ckg(xk)).

We note that 1 Ck Cmax := 2Lg(Lg max(1, η/Lg) + 1)2/µg from µg Lg and Lemma 4.4, respectively. Also, since Algorithm 1 is monotone (Lemma 3.2, noting Assumption 3.1 is implied by Assumption 4.1), we have f(xk) f(x0) and so g(xk) g(x0) since h is increasing.

Now, since h (t) = Θ(tp) as t 0+, there is an interval [0, δ] constants 0 < C1 C2 for which

C1tp h (t) C2tp, t [0, δ]

If Ckg(xk) δ, then we have

h (g(xk)) h (Ckg(xk)) C1g(xk)p

C2Cp kg(xk)p C1 C2Cp max > 0.

By contrast, if Ckg(xk) > δ, then g(x0) g(xk) > δ/Cmax. Since h is continuous and strictly positive for all t > 0, we have

ϵ0 := min δ/Cmax t g(x0) h (t) h (Cmaxt) > 0,

and so in this case we have

h (g(xk) h (Ckg(xk)) h (g(xk)) h (Cmaxg(xk)) ϵ0.

In either case, we have

η h (g(xk))

L(xk, e Rk) η 2µg + Lg min C1 C2Cp max , ϵ0

The result then follows by combining (4.3) and (4.4) with η (0, 2) and Corollary 4.3.

We reiterate that this result is a global linear rate, and does not require x0 to be suﬃciently close to x

(although the actual linear rate does potentially depend on x0). This applies to functions with extremely ﬂat minima, such as f(x) = x 2p 2 for any p 1 (by taking g(x) = x 2 2 and h(t) = tp).

5 Global Rate for Linear Regression

Another important problem class where Algorithm 1 can achieve a global linear rate (at least in some cases), is the case of linear regression with an ℓp loss function:

min x Rd f(x) := Ax b 2p 2p =

i=1 (a T i x bi)2p, (5.1)

for some A Rn d with rows ai Rd for i = 1, . . . , n, and b Rn, and p N. The choice of norm here avoids any issues of non-smoothness in the objective, but taking p again recovers a situation with extremely ﬂat local minima. Our theoretical results will hold in the case where A is suﬃciently well-conditioned, which in particular includes the case f(x) = x 2p 2p.

Published in Transactions on Machine Learning Research (02/2024)

Assumption 5.1. The matrix A has full rank, n d, and κ(A)4 < n/(n 1), where κ(A) is the 2-norm condition number of A.

Observe that Assumption 5.1 implies the system Ax = b is consistent. We also note that Assumption 5.1 is quite restrictive, especially when n is large, requiring the rows of A to be almost orthonormal. Given this, we delegate the technical details of this section to Appendix B.

For (5.1), we have

f(x) = 2p AT (Ax b)2p 1 and 2f(x) = 2p(2p 1)AT diag(Ax b)2p 2A,

where (Ax b)p is understood to represent element-wise powers. We now need to provide an LFSO for f(x) (5.1). In the case p = 1 that is, typical linear least-squares regression we have 2f(x) = 2p(2p 1)AT A and so we automatically get

L(x, R) = 2 A 2 2, (5.2)

as a valid LFSO. In the case p = 2, 3, 4, . . . a valid LFSO for (5.1) is

L(x, R) = 2p(2p 1) A 2 2 22p 3 Ax b 2p 2 + max i=1,...,n ai 2p 2 2

R2p 2 , (5.3)

as derived in Appendix B.1. We note that substituting p = 1 into (5.3) recovers (5.2) and so (5.3) is a valid LFSO for all p N. Observing the form of (5.3), a natural choice for Rk is Rk = Axk b .

Theorem 5.2. Suppose Assumption 5.1 holds, and we choose Rk = Axk b and η (0, 1] in Algorithm 1 (with LFSO (5.3)). Then for any p N, the residual rk 0 Q-linearly.

Proof. See Appendix B.2.

The above gives a global linear rate under some restrictive assumptions on the problem (5.1). They are satisﬁed if A = I, for example, which gives us the objective f(x) = x 2p 2p for any p.

6 Numerical Experiments

In this section, we provide some brief numerical experiments conﬁrming the global linear rate for Algorithm 1 (with an appropriate choice of Rk and no tuning of the ﬁxed stepsize η = 1) for objectives of the form f(x) = x 2p 2 as in Section 4 and f(x) = x 2p 2p as in Section 5. In all cases, we use d = 10 dimensional problems and starting point x0 = (1, . . . , 1)T and p = 1, . . . , 5. For both objectives, the p = 1 case gives the strongly convex objective f(x) = x 2 2, but for p > 1 we only have (non-strong) convexity and a ﬂat neighborhood of the global minimizer x = 0. In all cases we plot the normalized gradient decrease f(xk) 2/ f(x0) 2 as a function of iteration k, for up to 104 iterations.

For standard gradient descent with ﬁxed stepsize, we get the results shown in Figure 1. To see suﬃciently fast convergence, some mild tuning of the stepsize η was required. For both objectives, we see that gradient descent achieves a global linear rate for p = 1 as expected, but a clearly sublinear rate for all p > 1, again in line with expectations.

When running Algorithm 1, we use Rk = g(xk) 2 = 2 x 2 for f(x) = x 2p 2 (based on the framework of Section 4) and Rk = x for f(x) = x 2p 2p (based on the framework of Section 5). The corresponding results are given in Figure 2. Here we see the expected global linear rate for all values of p, not just the strongly convex case p > 1. We do note however that the rate of convergence is faster for smaller values of p.

By comparison, Appendix C.1 shows the same results for the adaptive gradient descent algorithm of (Malitsky & Mishchenko, 2020) (using the recommended starting stepsize η0 = 10 10, which is automatically adjusted after the ﬁrst iteration). This method outperforms both standard gradient descent and Algorithm 1, and indeed appears to show a global linear convergence rate for all problems. However, this result is not proven, and no convergence results for this method are known for nonconvex objectives.

Published in Transactions on Machine Learning Research (02/2024)

0 2000 4000 6000 8000 10000 Iteration k

f(xk) / f(x0)

p = 1 (η = 0.01)

p = 2 (η = 0.01)

p = 3 (η = 0.001)

p = 4 (η = 0.0001)

p = 5 (η = 1.5e 05)

(a) f(x) = x 2p 2 (varying η values)

0 2000 4000 6000 8000 10000 Iteration k

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(b) f(x) = x 2p 2p (all use η = 10 2)

Figure 1: Global sublinear rate f(xk) 0 achieved by gradient descent with ﬁxed stepsize for nonstrongly convex functions with ﬂat minima. Plots show f(xk) 2/ f(x0) 2 as a function of k.

0 2000 4000 6000 8000 10000 Iteration k

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(a) f(x) = x 2p 2

0 2000 4000 6000 8000 10000 Iteration k

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(b) f(x) = x 2p 2p

Figure 2: Global linear rate f(xk) 0 achieved by Algorithm 1 for non-strongly convex functions with ﬂat minima. Plots show f(xk) 2/ f(x0) 2 as a function of k.

Runtime Performance Since Algorithm 1 is more computationally expensive than gradient descent on a per-iteration basis (from up to two evaluations of the LFSO per iteration), the above results are shown in terms of runtime (rather than iterations) in Appendix C.2. We see that Algorithm 1 is slower than gradient descent, but not signiﬁcantly (within a factor of 2 for the slower problems p = 3, 4, 5).

Mis-speciﬁcation of Rk The above results for Algorithm 1 assume that Rk is chosen according to the theoretical results in Sections 4 and 5. In Appendix C.3, we consider the impact of signiﬁcantly mis-specifying Rk (i.e. by changing the order of magnitude, not just adjusting by a constant) in Algorithm 1. We see that using Rk too small appears to have minimal impact on the convergence rate and Rk too large worsens the convergence rate to sublinear.

7 Conclusions and Future Work

We have introduced a new oracle for local ﬁrst-order smoothness, which exists for a wide range of functions and encodes all of the problem information relevant for selecting stepsizes for gradient descent-type methods. Using the LFSO, we introduced a practical gradient descent-type method, and showed global and local

Published in Transactions on Machine Learning Research (02/2024)

convergence results under reasonable assumptions. We then showed that this method gives global linear rates for some speciﬁc (non-strongly) convex functions with degenerate local minima.

There are many potential directions for future study of LFSOs, including automatic diﬀerentiation techniques for building LFSOs and understanding how to pick the forcing sequence Rk. This would enable more extensive numerical testing of the LFSO approach to ascertain its usefulness on problems of more practical interest. Additionally, the use of LFSOs in broader optimization settings, such as constrained, bilevel, nonsmooth and/or stochastic optimization is an interesting direction for future work.

Acknowledgments

FR was partially supported by the Australian Research Council through an Industrial Transformation Training Centre for Information Resilience (IC200100022). Albert S. Berahas was partially supported by the Oﬃce of Naval Research under award number N00014-21-1-2532.

Deeksha Adil, Richard Peng, and Sushant Sachdeva. Fast, provably convergent irls algorithm for pnorm linear regression. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 46c7cb50b373877fb2f8d5c4517bb969-Paper.pdf.

Hedy Attouch, Zaki Chbani, Juan Peypouquet, and Patrick Redont. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, 168(1 2):123 175, 2018.

Hedy Attouch, Zaki Chbani, Jalal Fadili, and Hassan Riahi. First-order optimization algorithms via inertial systems with Hessian driven damping. Mathematical Programming, 193(1):133 155, 2022.

Jonathan Barzilai and Jonathan M. Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141 148, 1988. doi: 10.1093/imanum/8.1.141.

Dimitri Bertsekas. Convex Optimization Algorithms. Athena Scientiﬁc, 2015.

Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018. doi: 10.1137/16M1080173.

Oleg Burdakov, Yuhong Dai, and Na Huang. Stabilized Barzilai-Borwein method. Journal of Computational Mathematics, 37(6):916 936, 2019. doi: 10.4208/jcm.1911-m2019-0171.

Yair Carmon and Oliver Hinder. Making sgd parameter-free. In Po-Ling Loh and Maxim Raginsky (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp. 2360 2389. PMLR, 02 05 Jul 2022. URL https://proceedings.mlr.press/v178/ carmon22a.html.

Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. Number 30 in MOS-SIAM Series on Optimization. MOS/SIAM, Philadelphia, 2022.

Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by d-adaptation. The 40th International Conference on Machine Learning (ICML 2023), 2023.

Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis. In 35th Conference on Neural Information Processing Systems, 2021.

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak-Lojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795 811, 2016.

Published in Transactions on Machine Learning Research (02/2024)

Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Jincheng Mei, Yue Gao, Bo Dai, Csaba Szepesvari, and Dale Schuurmans. Leveraging non-uniformity in ﬁrstorder non-convex optimization. In Proceedings of the 38th International Conference on Machine Learning, 2021.

Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. ar Xiv preprint ar Xiv:2306.06101, 2023.

Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.

Yurii Nesterov. Introductory Lectures on Convex Optimization. Springer US, 2004. ISBN 978-1-4020-7553-7.

Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, 2nd ed edition, 2006.

B.T. Polyak. Minimization of unsmooth functionals. USSR Computational Mathematics and Mathematical Physics, 9(3):14 29, 1969. doi: 10.1016/0041-5553(69)90061-5.

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In NIPS 12: Proceedings of the 25th International Conference on Neural Information Processing Systems, volume 2, pp. 2951 2959, 2012. doi: 10.5555/2999325.2999464.

Stephen J. Wright. Optimization algorithms for data analysis. In Michael W. Mahoney, John C. Duchi, and Anna Gilbert (eds.), The Mathematics of Data, volume 25 of IAS/Park City Mathematics Series, pp. 49 97. American Mathematical Society, Providence, Rhode Island, 2018. doi: 10.1090/pcms/025/02.

Chenghan Xie, Chenxi Li, Chuwen Zhang, Qi Deng, Dongdong Ge, and Yinyu Ye. Trust region methods for nonconvex stochastic optimization beyond Lipschitz smoothness. ar Xiv preprint ar Xiv:2310.17319, 2023.

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justiﬁcation for adaptivity. In Proceedings of the 8th International Conference on Learning Representations ICLR, 2020.

A LFSO for Logistic Regression

Here we derive an LFSO suitable for 2-class logistic regression. Following (Murphy, 2012, Chapter 8), the empirical risk minimization problem is

min w Rd f(w) =

i=1 [yi log(µi(w)) + (1 yi) log(1 µi(w)))],

where µi(w) = σ(w T xi) for σ(t) = 1/(1 + e t), and we have features xi Rd and labels yi {0, 1}. The associated gradient and Hessian may be written as

f(w) = XT (µ(w) y), and 2f(w) = XT S(w)X,

where X Rn d and y Rn contain the training data and labels, µ(w) Rn has entries µi(w), and S(w) = diag(σ (w T xi)) Rn n.

Since σ (t) [0, 1/4] for all t, we automatically have 2f(w) 2 1

4 X 2 2. Thus f is Lf-smooth and so Algorithm 1 with η = 1 can reduce to standard gradient descent with learning rate 1/Lf.

If we want to get a tighter LFSO, we start by noting that σ (t) = σ(t)(1 σ(t)) and so σ (t) = 2σ(t)3 3σ(t)2 + σ(t), and since σ(t) [0, 1] we have |σ (t)|

3/18 for all t. Hence by Cauchy-Schwarz and Taylor s Theorem, we have

σ ((w + s)T xi) σ (w T xi) +

Published in Transactions on Machine Learning Research (02/2024)

whenever s 2 R. Hence a more precise LFSO for this problem is

L(w, R) = X 2 2 max i=1,...,n min σ (w T xi) +

If the data X is separable, then near the optimum w it is likely that all w T xi are far from zero, and so σ (w T xi) 1/4, and so taking this LFSO with small R may improve on the simpler L(w, R) = 1

4 X 2 2 from Lipschitz continuous gradients.

The computational cost of evaluating f(w) is one evaluation of µi(w) = σ(w T xi) for each i and one matrixvector product with X. If X 2 2 and xi 2 have been pre-computed, then since σ (w T xi) = µi(w)(1 µi(w)), computing the LFSO (A.1) requires the evaluation of each µi(w) plus O(n) in extra calculations. Thus one LFSO evaluation is actually computationally cheaper than one gradient evaluation.

B Technical Details for Section 5

B.1 Derivation of LFSO

Here we derive the LFSO (5.3) for problem (5.1) with p = 2, 3, 4, . . .. Lemma B.1. For any x1, . . . , xm R and t 1, we have

Proof. If t = 1, this is the triangle inequality. For t > 1, we use Hölder s inequality to get

|x T e|t x t e t/(t 1) t ,

where x = (x1, . . . , xm) Rm and e = (1, . . . , 1) Rm. The result then follows from e t t/(t 1) = mt 1.

Using Lemma B.1, we get

(a T i x bi + a T i s)2p 2 22p 3 (a T i x bi)2p 2 + (a T i s)2p 2 ,

for any p = 2, 3, 4, . . .. Noting that diag(Ax b)2p 2 2 = Ax b 2p 2 , for this range of p we get

max y B(x,R) 2f(y) 2 = max s 2 R 2f(x + s) 2,

max s 2 R 2p(2p 1) A 2 2 max i=1,...,n(a T i x bi + a T i s)2p 2,

max i=1,...,n max s 2 R 2p(2p 1) A 2 2 22p 3[(a T i x bi)2p 2 + (a T i s)2p 2],

max i=1,...,n 2p(2p 1) A 2 2 22p 3((a T i x bi)2p 2 + ai 2p 2 2 R2p 2),

and so (5.3) is a valid LFSO.

B.2 Proof of Theorem 5.2

Given the choice Rk = Axk b , and noting that Algorithm 1 works if f(xk) in (2.3) is replaced by any upper bound for f(xk) , we have for p 2,

{ Axk b , 2pη A n Axk b 2p 1 2p(2p 1) A 2 2 22p 3 h Axk b 2p 2 + maxi=1,...,n ai 2p 2 2 Axk b 2p 2 i

Axk b , η n Axk b

(2p 1) A 2 22p 3 h 1 + maxi=1,...,n ai 2p 2 2 i

= c1(A, n, p, η) Axk b ,

Published in Transactions on Machine Learning Research (02/2024)

c1(A, n, p, η) := max

(2p 1) A 2 22p 3 h 1 + maxi=1,...,n ai 2p 2 2 i

Thus we have

L(xk, e Rk) = 2p(2p 1) A 2 2 22p 3 Axk b 2p 2 + max i=1,...,n ai 2p 2 2

c1(A, n, p, η) Axk b 2p 2

= 2p c2(A, n, p, η) Axk b 2p 2 , (B.1)

c2(A, n, p, η) := (2p 1) A 2 2 22p 3 1 + max i=1,...,n ai 2p 2 2

c1(A, n, p, η) ,

again for p = 2, 3, 4, . . .. For p = 1, since L(x, R) is independent of R, we always have L(xk, e Rk) = 2p A 2 2, or equivalently (B.1) with c2(A, n, 1, η) := A 2 2. Finally, our iteration is

xk+1 = xk η c2(A, n, p, η) Axk b 2p 2 AT (Axk b)2p 1.

Since we assume our linear system is consistent, we know (5.1) has a global minimizer at f(x ) = 0, and so a suitable error metric is the residual, rk := Axk b. Written in terms of the residual, our iteration is

rk+1 = rk η c2(A, n, p, η) rk 2p 2 AAT r2p 1 k . (B.2)

In the case p = 1, the residual iteration (B.2) becomes

rk+1 = I η A 2 2 AAT rk,

and so rk 0 linearly for all η (0, 1], as expected.

Instead, if p 2, it suﬃces to consider the case rk = 0. Here, (B.2) may be written as

c2 AAT rk + η

rk r2p 1 k rk 2p 2

dropping the arguments c2 = c2(A, n, p, η) for brevity. To handle the nonlinearity in the second term, we ﬁrst look at the diﬀerence

ek := rk r2p 1 k rk 2p 2 .

Looking at ek in terms of each component separately, and writing |[rk]i| = rk /αk,i for some αk,i 1 (with αk,i = if [rk]i = 0), we get

1 1 α2p 2 k,i

Note speciﬁcally that αk,i = 1 for the index i for which |[rk]i| = rk . Then for any M > 1, we have

1 1 α2p 2 k,i

[rk]2 i + X

1 1 α2p 2 k,i

i:αk,i M [rk]2 i + X

2 [rk]2 i ,

2 M 2p 2 1 M 4p 4

Published in Transactions on Machine Learning Research (02/2024)

Since M > 1, there is at least one i with αk,i < M (the index corresponding to rk ). So, X

i:αk,i<M [rk]2 i rk 2 1

using standard norm equivalences. All together, we have

erk+1 2 2 1 1

2 M 2p 2 1 M 4p 4

for any M > 1. This bound is tightest as M 1+, with 2 M 2p 2 1 M 4p 4 1 in this case. So by taking M suﬃciently close to 1, we get

erk+1 2 (1 ϵ)1/2 rk 2,

for any ϵ < 1/n. Returning to (B.3), we get

c2 AAT 2 + η A 2 2 c2 (1 ϵ)1/2 rk 2,

h I ηB 2 + η B 2(1 ϵ)1/2i rk 2,

where B := AAT /c2 has B 2 1 since c2 A 2 2. Thus rk 2 0 Q-linearly provided I ηB 2 + η B 2(1 ϵ)1/2 < 1.

Deﬁning σi > 0 as the ith singular value of B Rn n, since B 2 1 and B is full rank (since A is full rank and n d) we know 0 < σn σ1 1. Since η (0, 1] by assumption, we have

I ηB 2 + η B 2(1 ϵ)1/2 = max{|1 ησ1|, |1 ησn|} + η(1 ϵ)1/2σ1,

= (1 ησn) + η(1 ϵ)1/2σ1.

This factor is in (0, 1) provided

ησn η(1 ϵ)1/2σ1 (0, 1).

Since η, σn 1, this holds if σn > (1 ϵ)1/2σ1. Since κ(A)2 = κ(B) = σ1/σm, this is equivalent to κ(A)4 < 1/(1 ϵ). Since ϵ < 1/n is arbitrary, this holds from κ(A)4 < n/(n 1) (Assumption 5.1).

C Additional Numerical Results

C.1 Adaptive Gradient Descent

The below shows the convergence of adaptive gradient descent (Malitsky & Mishchenko, 2020) on the same test problems as Section 6.

0 20 40 60 80 100 Iteration k

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(a) f(x) = x 2p 2

0 20 40 60 80 Iteration k

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(b) f(x) = x 2p 2p

Figure 3: Convergence f(xk) 0 achieved by adaptive gradient descent (Malitsky & Mishchenko, 2020) for non-strongly convex functions with ﬂat minima. Plots show f(xk) 2/ f(x0) 2 as a function of k.

Published in Transactions on Machine Learning Research (02/2024)

C.2 Runtime Performance

In the below we compare the gradient decrease achieved by gradient descent (ﬁxed η) and Algorithm 1 as in Section 6, but measured by runtime (total CPU time) rather than iterations.

0.00 0.01 0.02 0.03 0.04 0.05 0.06 CPU time (s)

f(xk) / f(x0)

p = 1 (η = 0.01)

p = 2 (η = 0.01)

p = 3 (η = 0.001)

p = 4 (η = 0.0001)

p = 5 (η = 1.5e 05)

(a) f(x) = x 2p 2 (varying η values)

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 CPU time (s)

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(b) f(x) = x 2p 2p (all use η = 10 2)

Figure 4: Convergence rate f(xk) 0 achieved by gradient descent with ﬁxed stepsize for non-strongly convex functions with ﬂat minima. Plots show f(xk) 2/ f(x0) 2 as a function of CPU time.

0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 CPU time (s)

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(a) f(x) = x 2p 2

0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 CPU time (s)

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(b) f(x) = x 2p 2p

Figure 5: Convergence rate f(xk) 0 achieved by Algorithm 1 for non-strongly convex functions with ﬂat minima. Plots show f(xk) 2/ f(x0) 2 as a function of CPU time.

C.3 Sub-optimal choice of Rk

The results for Algorithm 1 presented in Section 6 use the choices of Rk given in Sections 4 and 5 (namely Roptimal k = g(xk) 2 = 2 x 2 for f(x) = x 2p 2 and Roptimal k = x for f(x) = x 2p 2p). Here we consider the performance of Algorithm 1 when Rk is signiﬁcantly mis-speciﬁed (i.e. change in the order of Rk, not just constants).

Published in Transactions on Machine Learning Research (02/2024)

0 2000 4000 6000 8000 10000 Iteration k

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(a) f(x) = x 2p 2 , Rk = p

0 2000 4000 6000 8000 10000 Iteration k

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(b) f(x) = x 2p 2p, Rk = p

0 2000 4000 6000 8000 10000 Iteration k

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(c) f(x) = x 2p 2 , Rk = (Roptimal k )2

0 2000 4000 6000 8000 10000 Iteration k

f(xk) / f(x0)

p = 1 p = 2 p = 3 p = 4 p = 5

(d) f(x) = x 2p 2p, Rk = (Roptimal k )2

Figure 6: Performance of Algorithm 1 with sub-optimal choices of Rk. Plots show f(xk) 2/ f(x0) 2 as a function of k.

In Figure 6c for p {3, 4, 5}, we note that the signiﬁcant reduction in convergence rate is likely because the new Rk is initially signiﬁcantly too large. For p {1, 2} we are in the regime where Rk is too small (as x 0), and this appears to have minimal impact on the speed of convergence.