# nonlinearly_preconditioned_gradient_methods_under_generalized_smoothness__dcc82624.pdf

Nonlinearly Preconditioned Gradient Methods under Generalized Smoothness

Konstantinos Oikonomidis 1 2 Jan Quan 1 2 Emanuel Laude 3 Panagiotis Patrinos 1 2

We analyze nonlinearly preconditioned gradient methods for solving smooth minimization problems. We introduce a generalized smoothness property, based on the notion of abstract convexity, that is broader than Lipschitz smoothness and provide sufficient firstand second-order conditions. Notably, our framework encapsulates algorithms associated with the gradient clipping method and brings out novel insights for the class of (L0, L1)-smooth functions that has received widespread interest recently, thus allowing us to extend beyond already established methods. We investigate the convergence of the proposed method in both the convex and nonconvex setting.

1. Introduction and preliminaries

We consider minimization problems of the form:

min x Rn f(x), (1)

where f is a continuously differentiable and possibly nonconvex function. While gradient descent is a reliable solver for this type of problems, in many cases it does not fully take advantage of the cost function properties and requires well-tuned or costly stepsize strategies to converge.

In this paper we thus focus on nonlinearly preconditioned gradient methods that are tailored to the properties of the cost function. Given stepsizes γ > 0 and λ > 0 and a starting point x0 Rn, we consider the following iteration:

xk+1 = Tγ,λ(xk) := xk γ ϕ (λ f(xk)), (2)

1Department of Electrical Engineering (ESAT-STADIUS), KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium 2Leuven.AI-KU Leuven Institute for AI, 3000 Leuven, Belgium 3Proxima Fusion Gmb H, Fl oßergasse 2, 81369 Munich, Germany. Correspondence to: Konstantinos Oikonomidis <konstantinos.oikonomidis@kuleuven.be>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

where ϕ : Rn R is convex and is called the dual reference function. The convex conjugate of ϕ , ϕ is called the reference function and its properties are crucial to our analysis. This general framework was originally analyzed in (Maddison et al., 2021) under the so-called dual relative smoothness condition for convex and essentially smooth functions. In the general nonconvex and composite nonsmooth setting, it was studied in (Laude & Patrinos, 2025) under a condition called the anisotropic descent property, which was itself first introduced in (Laude et al., 2023) and can be regarded as a globalization of anisotropic proxregularity (Laude, 2021) for smooth functions.

Our analysis is mainly focused on two types of reference functions that are both generated by a kernel function h : R R+ { }, thus resulting in two different families of algorithms. The first one is given by the composition with the Euclidean norm, i.e., ϕ = h and is referred to as the isotropic reference function. In this case, the main iteration (2) takes the form of gradient descent with a scalar stepsize that depends on the iterates. The second is a separable sum obtained via ϕ(x) = Pn i=1 h(xi), henceforth called the anisotropic reference function. In this case, (2) becomes gradient descent with a coordinate-wise stepsize that depends on the iterates of the algorithm. We remark that although the anisotropic reference function makes the analysis of the method more challenging, it generates more interesting algorithms, akin to the ones that are often used in practice.

1.1. Motivation

Unifying framework for clipping algorithms. Gradient clipping and signed gradient methods have garnered attention in recent years due to their efficiency in neural network training and other applications (Bernstein et al., 2018; Gorbunov et al., 2020; Zhang et al., 2020a;b;c; Koloskova et al., 2023; Kunstner et al., 2024). The intuition behind gradient clipping is straightforward, since by clipping one does not allow the potentially very large (stochastic) gradients to hinder the training. Nevertheless, in many cases the clipping threshold and stepsize should be carefully tuned in practice, otherwise leading to suboptimal performance (Koloskova et al., 2023). While algorithms of this type

Nonlinearly Preconditioned Gradient Methods

Table 1. Examples of kernel functions along with the generated preconditioner and its (generalized) derivative in one dimension. The last column indicates whether Assumption 1.1 and Assumption 1.2 are satisfied, respectively.

h(x) dom h h (y) h (y) Assumption 1.1/1.2

cosh(x) 1 R arcsinh(y) 1

exp(|x|) |x| 1 R ln(1 + |y|)sgn(y) 1 1+|y| /

|x| ln(1 |x|) ( 1, 1) y 1+|y| 1 (1+|y|)2 /

1 x2 [ 1, 1] y

1+y2 (1 + y2) 3/2 /

x arctanh(x) ln(cosh(arctanh(x))) ( 1, 1) tanh(y) 1 tanh2(y) /

1 2x2 + δ[ 1,1](x) [ 1, 1] min(1, max( 1, y)) C(Π[ 1,1])(y) /

have been analyzed under various smoothness and stochasticity assumptions, there does not seem to exist a simple unifying framework that encapsulates them. Motivated by this gap, we propose a framework that provides further insights into existing methods but also naturally generates new algorithms.

Majorization-minimization and Φ-convexity. A plethora of well-known optimization algorithms belong to the socalled majorization-minimization framework in that they are generated by successively minimizing upper bounds of the objective function. As a classical example, under Lipschitz smoothness of f, the celebrated gradient descent method with stepsize 1/L iteratively minimizes the following (global) quadratic upper bound around the current point x Rn:

f(x) f( x) + f( x), x x + L

It is straightforward that this inequality can be written as

f(x) f( x) + 1

Lϕ(L(x y)) 1

Lϕ(L( x y)), (3)

2 2 and y = TL 1,1( x) = x 1

L ϕ ( f( x)). Going beyond the standard Lipschitzian assumptions, it is natural to consider reference functions ϕ that generate less restrictive descent inequalities thus allowing us to efficiently tackle more general problems. This is exactly the anisotropic descent property (Laude & Patrinos, 2025, Definition 3.1) and minimizing this upper bound leads to the algorithm described in (2) (for λ fixed). Considering (3), it is natural to examine reference functions that are strongly convex and grow faster than a quadratic in order to obtain a less restrictive descent inequality. It turns out that in many interesting cases, the preconditioner in (2) becomes similar to a sigmoid function and the algorithmic step takes the form of popular algorithms. We present examples of kernel functions and the corresponding preconditioners in Table 1.

The abstraction discussed in the previous paragraph is in fact tightly connected to the notion of Φ-convexity (also known as c-concavity in the optimal transport literature),

which states that a function is Φ-convex if it can be written as the pointwise supremum over a family of nonlinear functions. Similarly to the fact that every proper, lsc and convex function can be expressed as a pointwise supremum of its affine minorizers (Rockafellar & Wets, 1998, Theorem 8.13), anisotropic smoothness then requires that f is a pointwise supremum of nonlinear minorizers. This fact is once again in parallel to classical L-smoothness, which requires that f is the pointwise supremum over concave quadratics, and leads to an envelope representation of f that is useful in studying the corresponding calculus. In that regard, anisotropic smoothness is a straightforward extension of Lipschitz smoothness. A visualization of the concept of Φ-convexity is shown in Figure 1.

Figure 1. Visualization of the quadratic upper bounds of the function f(x) at various points. By flipping the figure it can be seen that f is a Φ-convex function: it is the pointwise supremum over concave quadratics of the form ϕ(x y) + β, with ϕ = L

and y, β R. Note that this function is not convex in the classical sense, as there are no linear functions supporting it.

1.2. Our contribution

Our approach departs from and improves upon existing works in the following aspects.

We describe a common nonlinear gradient precon-

Nonlinearly Preconditioned Gradient Methods

ditioning scheme for the main iterates, i.e., without momentum nor exponential moving average mechanisms, of popular algorithms including gradient clipping, Adam, Adagrad and recently introduced methods for (L0, L1)-smoothness. These preconditioners are gradients of smooth convex functions and have sigmoid shape, reminiscent of common activation functions in neural networks.

We introduce (L, L)-anisotropic smoothness, which extends (Laude & Patrinos, 2025) allowing for two constants and a reference function ϕ of possibly nonfull domain and prove that it is less restrictive than L-smoothness. Through a novel technique we study necessary and sufficient firstand second-order conditions for (L, L)-anisotropic smoothness, in the process obtaining novel characterizations for the forward operator Tγ,λ, which in turn leads to new insights into the class of (L0, L1)-smooth functions.

We analyze the convergence of (2) in the nonconvex setting, obtaining new results for our stationarity measure. In the convex setting, we prove the sublinear rate of the method for a large family of isotropic reference functions, utilizing only a simple dual characterization. In the more challenging case where ϕ is anisotropic, we present an unconventional proof that is based on the envelope representation of anisotropic smoothness. We are thus able to obtain standard O(1/K) convergence rates for large classes of functions.

1.3. Related work

Dual space preconditioning and anisotropic smoothness. The scheme described in (2) was originally introduced in (Maddison et al., 2021) in the convex setting, where it was analyzed under a condition called dual relative smoothness for which sufficient second-order conditions were provided. In (Laude & Patrinos, 2025) the anisotropic smoothness condition was studied, which was shown to naturally lead to the convergence of the method in the nonconvex and proximal case. Moreover, in (L eger & Aubin Frankowski, 2023) the scheme was also analyzed under the general framework of Φ-convexity and a sufficient secondorder condition for anisotropic smoothness was provided. Nevertheless, this requires that ϕ C4(Rn) satisfies the non-negative cross curvature condition from optimal transport (NNCC) (see (Figalli et al., 2011, Assumption (B3)) and (L eger & Aubin-Frankowski, 2023, Definition 2.8)), which is a strong assumption that does not hold for many interesting reference functions. An accelerated version of the method for Lipschitz smooth problems was introduced and studied in (Kim et al., 2023). Recently, the method was also extended to measure spaces in (Bonet et al., 2024).

Furthermore, a relaxed proximal point algorithm with nonlinear preconditioning akin to (2) for solving monotone inclusion problems was studied in (Laude & Patrinos, 2023).

Generalized smoothness. Our work is also connected to other notions of generalized smoothness, i.e., descent inequalities beyond the standard Lipschitzian assumptions. To begin with, Bregman (relative) smoothness is a popular extension of Lipschitz smoothness (Bauschke et al., 2017a; Lu et al., 2018; Bolte et al., 2018; Ahookhosh et al., 2021) that can encapsulate a wide variety of functions such as those whose Hessians exhibit a certain polynomial growth (Lu et al., 2018). Other notions of generalized smoothness include H older smoothness (Bredies, 2008; Nesterov, 2015) and also higher-order smoothness where higher-order derivatives of f are Lipschitz continuous (Nesterov & Polyak, 2006; Doikov & Nesterov, 2020).

Recently, a new concept of smoothness has been introduced in order to capture the cases where the norm of the Hessian is upper bounded by some function of the norm of the gradient (Zhang et al., 2020b; Chen et al., 2020; Li et al., 2024). This condition, which in its most popular form is called (L0, L1)-smoothness (Zhang et al., 2020b, Definition 1), has received widespread attention. Various existing methods have been analyzed under this new smoothness condition (Wang et al., 2023; Faw et al., 2023; Koloskova et al., 2023), while also new ones have been proposed (Gorbunov et al., 2024; Vankov et al., 2024). We remark that a number of these algorithms can actually be obtained in (2) via a suitable choice of the reference function. Nevertheless, it is important to note that in contrast to the aforementioned types of smoothness, anisotropic smoothness is not obtained via a linearization of the cost function around a point and thus it is not straightforward to compare the obtained descent lemmas.

1.4. Notation

We denote by , the standard Euclidean inner product on Rn and by the standard Euclidean norm on Rn as well as the spectral norm for matrices. For a square matrix A with real spectrum, λmax(A) and λmin(A) denote the largest and smallest eigenvalue respectively. We denote by Ck(Y ) the class of functions which are k times continuously differentiable on an open set Y Rn. For a proper function f : Rn R and λ 0 we define the episcaling (λ f)(x) = λf(λ 1x) for λ > 0 and (λ f)(x) = δ{0}(x) otherwise. We adopt the notions of essential smoothness, essential strict convexity and Legendre functions from (Rockafellar, 1997, Section 26): we say that a proper, lsc and convex function f : Rn R is essentially smooth if int(dom f) = and f is differentiable on int(dom f) such that f(xν) , whenever int(dom f) xν x bdry dom f, and essen-

Nonlinearly Preconditioned Gradient Methods

tially strictly convex, if f is strictly convex on every convex subset of dom f, and Legendre, if f is both essentially smooth and essentially strictly convex. In particular, a smooth convex function on Rn is essentially smooth.

Let F : Rn Rn be a locally Lipschitz function, we denote the (Clarke) generalized Jacobian as CF(x) = con{limxi x F(xi) : xi / ΩF }, where ΩF is the set of points where F fails to be differentiable. ΠC denotes the projection on a set C. For an f C2(Rn) we say that it is (L0, L1)-smooth for some L0, L1 > 0 if it holds that 2f(x) L0 + L1 f(x) for all x Rn. Otherwise we adopt the notation from (Rockafellar & Wets, 1998).

For clarity of exposition, for a vector x Rn we consider the function sgn(x) = x/ x for x Rn \ {0} and 0 otherwise.

1.5. Assumptions on the reference function

Our assumptions on the reference function ϕ are formulated as follows. Assumption 1.1. The reference function ϕ : Rn R is proper, lsc, strongly convex and even, i.e., ϕ(x) = ϕ( x), with ϕ(0) = 0.

Assumption 1.1 is considered valid throughout the paper. Note that through the duality of strong convexity and Lipschitz smoothness, it implies that ϕ C1(Rn). Moreover, from (Bauschke et al., 2017b, Proposition 11.7), {0} = arg min ϕ and thus ϕ 0. Throughout the paper we also consider specifically the case where ϕ C2(Rn), for which we encode a sufficient condition in the following assumption, in light of (Rockafellar, 1977, p. 42). Assumption 1.2. ϕ C2(int dom ϕ) is essentially smooth.

It is important to note that through (Rockafellar, 1977, p. 42), under Assumptions 1.1 and 1.2, the Hessian matrix of ϕ is positive-definite everywhere.

Although we state both assumptions for the reference function ϕ, we also use them throughout the paper by a slight abuse of notation for the kernel function h which generates ϕ.

When considering a kernel function h, in the anisotropic case, it is straightforward that ϕ inherits the properties of h and the preconditioner takes the form ϕ (x) = (h (x1), . . . , h (xn)). In the isotropic case, the differentiability of ϕ depends on the properties of h, as we show next. Lemma 1.3. Let h : R R satisfy Assumption 1.1. Then h 0 is an even function and increasing on R+, while h (0) = 0. Moreover, ϕ = h is strongly convex, ϕ = h and

ϕ (y) = h ( y )sgn(y), y Rn.

If, furthermore, h satisfies Assumption 1.2, then ϕ C2(Rn).

We provide examples of interesting kernel functions, along with the assumptions that they satisfy, in Table 1.

1.6. Connections with existing methods

As already mentioned, the scheme presented in (2) encompasses the basic iterations of various algorithms that are widely used in practice. In this subsection we thus provide some examples that showcase the generalizing properties of our framework. Example 1.4. The standard gradient descent method can be obtained from (2) by choosing ϕ = 1

Example 1.5. Let ϕ(x) = Pn i=1 1 p

1 x2 i . Then, (2) becomes

xk+1 i = xk i γ if(xk) p

1/λ2 + ( if(xk))2

and by choosing λ = ε 1/2 for some ε > 0 we retrieve the form of Adagrad (Duchi et al., 2011) without memory from (D efossez et al., 2022, Equation (3)) with β1 = β2 = 0. Example 1.6. Let ϕ(x) = Pn i=1 |xi| ln(1 |xi|). Then, the main iterate (2) takes the following form:

xk+1 i = xk i γ if(xk) 1/λ + | if(xk)|.

and by choosing λ = ε 1 for ε > 0, we retrieve the iterates of Adam (Kingma & Ba, 2014, Algorithm 1) where both the exponential decay rates are set to 0, i.e. β1 = β2 = 0. Example 1.7. Let h(x) = 1

2x2+δ[ 1,1](x) and ϕ = h . Then, (2) becomes

xk+1 = xk min(γ/ f(xk) , γλ) f(xk).

Note that the gradient clipping method as presented in (Zhang et al., 2020b, Equation (5)) is given by xk+1 = xk min{ηc, γηc/ f(xk) } f(xk). Therefore, by choosing γ = γηc and λ = 1/ γ we can see that (2) encompasses the gradient clipping method.

2. The extended anisotropic descent inequality

In this section we extend the definition of anisotropic smoothness from (Laude & Patrinos, 2025) to our setting where ϕ is potentially nonsmooth and provide sufficient conditions for a smooth function f to satisfy this generalized descent inequality. The proofs can be found in Appendix B.

We begin with the definition of our extension of anisotropic smoothness.

Nonlinearly Preconditioned Gradient Methods

Definition 2.1 ((L, L)-anisotropic smoothness). Let f C1(Rn). We say that f is (L, L)-anisotropically smooth relative to ϕ with constants L, L > 0 if for all x, x Rn

f(x) f( x) + L (L 1 ϕ)(x y)

(L 1 ϕ)( x y) , (4)

where y = TL 1, L 1( x) = x L 1 ϕ ( L 1 f( x)).

Note that inequality (4) is well-defined for ϕ without full domain, but does not provide information for points x Rn such that L(x y) / dom ϕ. The intuition behind this extension of (Laude & Patrinos, 2025) is straightforward: we allow for two different smoothness constants that play a complementary role, while allowing dom ϕ = Rn leads to a more general descent inequality. It can be checked that for dom ϕ = Rn, Definition 2.1 reduces to (Laude & Patrinos, 2025, Definition 3.1) but w.r.t. Lϕ.

Our first result is an extension of the envelope representation of f under anisotropic smoothness (Laude & Patrinos, 2025, Proposition 4.1) to our setting where ϕ possibly does not have full domain.

Proposition 2.2. Let f : Rn R satisfy Definition 2.1. Then, f(x) = infy Rn L(L 1 ϕ)(x y) + ξ(y) for some proper ξ : Rn R. Moreover, f = ψ + L 1( L ϕ ) for some lsc and convex ψ : Rn R, implying that f L 1( L ϕ ) is convex.

Note that Proposition 2.2 describes one direction of a conjugate duality between anisotropic smoothness and Bregman (relative) strong convexity (Lu et al., 2018, Definition 1.2). This result along with the envelope representation of f will be utilized later on in order to describe the convergence of the method in the convex setting.

In this paper we are mostly interested in cost functions that are not covered by the classical L-smoothness assumption and thus study reference functions ϕ that generate a less restrictive descent inequality. Therefore, we next show that for strongly convex reference functions, the class of anisotropically smooth functions is at least as large as that of Lipschitz smooth ones.

Proposition 2.3. Let f : Rn R be Lipschitz smooth with modulo Lf and ϕ satisfy Assumption 1.1 with strong convexity parameter µ. Then f is (Lf/µ, 1)-anisotropically smooth relative to ϕ .

2.1. Second-order sufficient conditions

In contrast to Euclidean or Bregman smoothness, anisotropic smoothness cannot in general be directly obtained via a second-order condition. This fact becomes apparent by noting that (4) is equivalent to x arg minx g(x) := L(L 1 ϕ)(x y) f(x). The secondorder necessary condition for the minimality of x under As-

sumption 1.2 then becomes LL 2ϕ( ϕ ( L 1 f( x))) 2f( x) 0, which does not normally imply the convexity of g. From the implicit function theorem, the above expression can be written as LL[ 2ϕ ( L 1 f( x))] 1 2f( x) 0, which is the form that we consider throughout the paper. This condition becomes sufficient when f, ϕ C2(Rn) are Legendre through (Laude & Patrinos, 2025, Proposition 4.1). Harnessing the connection with Φconvexity, it is sufficient for general f when ϕ is a regular optimal transport cost in light of (Villani, 2008, Theorem 12.46). Nevertheless, the regularity of ϕ is in general hard to verify, since its equivalent form requires the computation of fourth-order derivatives (Villani, 2008, Definition 12.27), and quite restrictive, not holding for many interesting functions. We thus follow a different strategy and study the minimization of g using tools from optimization and nonsmooth analysis.

Definition 2.4. Let f C2(Rn). f satisfies the secondorder characterization for (L, L)-anisotropic smoothness if for all x Rn and H C( ϕ )( L 1 f(x)),

λmax(H 2f(x)) < L L (5)

and lim x TL 1, L 1(x) = . In particular, (5) reduces to λmax( 2ϕ ( L 1 f(x)) 2f(x)) < L L under Assumption 1.2.

Note that, since ϕ is Lipschitz smooth and convex, from (Hiriart-Urruty et al., 1984, Example 2.2) we know that H is always a positive semi-definite matrix and thus from (Horn & Johnson, 2012, Theorem 1.3.22) with A = H1/2 2f(x) and B = H1/2 we have that H 2f(x) has real eigenvalues.

The generalized Jacobian considered in the definition above can in fact be computed for many interesting reference functions. As an example we study the inequality (5) for the reference function of Example 1.7 in Appendix D.2. The coercivity assumption on the forward operator in Definition 2.4 is very mild. For example consider a reference function ϕ with dom ϕ B(0, 1) as the isotropic ones generated by the kernels in the four last rows of Table 1. Then, by standard convex conjugacy, ϕ 1 and lim x Tγ,λ(x) = always. We provide more results regarding the norm-coercivity property of TL 1, L 1 in Appendix D.1. It is important to note that when the matrix H 2f(x) is symmetric, we can remove this extra condition on the forward operator, as we show in Appendix D.

Under Assumption 1.2 we obtain a condition that is more straightforward to check.

Lemma 2.5. Let ϕ satisfy Assumption 1.2 and f C2(Rn). Then, (5) holds if and only if for all x Rn

2f(x) L L[ 2ϕ ( L 1 f(x))] 1. (6)

Nonlinearly Preconditioned Gradient Methods

A sufficient condition for (6) is given by

λmax( 2f(x)) < L Lλmin([ 2ϕ ( L 1 f(x))] 1). (7)

Next, we show that under Definition 2.4 the forward operator is actually a global homeomorphism (Facchinei & Pang, 2003, Definition 2.1.9), which we further utilize later on in proving our main result regarding the sufficiency of the second-order condition, Proposition 2.9.

Proposition 2.6. Let f C2(Rn) satisfy Definition 2.4. Then, TL 1, L 1 is injective.

As a byproduct of our analysis, we obtain novel insights into the class of C2(Rn) (L0, L1)-smooth functions.

Corollary 2.7. Let f C2(Rn) be (L0, L1)-smooth. Then, TδL 1, L 1 = x δ L0+L1 f(x) f(x) is monotone for δ = 1 and strongly monotone for 0 < δ < 1.

Remark 2.8. Considering the forward operator as defined in Corollary 2.7, the main iteration (2) becomes:

xk+1 = xk δ L0+L1 f(xk) f(xk). (8)

Note that in this case, (Gorbunov et al., 2024, Algorithm 1) can be viewed as (2) with a conservative choice of δ 0.57. This algorithm is in fact generated by ϕ = h with h(x) = |x| ln(1 |x|).

Although Corollary 2.7 establishes new characterizations for (L0, L1)-smoothness, the simplification used in the proof of the second-order condition is quite restrictive. We provide examples where we can obtain tighter constants utilizing directly Definition 2.4 in Appendix D.

Having obtained a sufficient condition for the injectivity of the forward operator in Proposition 2.6, we now move on to providing sufficient conditions for f to be (L, L)- anisotropically smooth.

Proposition 2.9. Let f C1(Rn) and TL 1, L 1 be injective. Let moreover either (i) dom ϕ be bounded or (ii) dom ϕ = Rn and the following growth condition hold

f(x) L(r 1 ϕ)(x) β

for all x Rn and some r, β R such that 0 < r < L. Then, f is (L, L)-anisotropically smooth relative to ϕ.

Remark 2.10. The growth condition assumed in Proposition 2.9 when dom ϕ = Rn is in fact not restrictive. Consider ϕ = cosh 1 and let f be bounded above by some polynomial of the norm. Then, lim x L(r 1 ϕ)(x) f(x) = + for any fixed constants L, r, implying that L(r 1 ϕ)(x) f(x) is lower bounded.

By combining Corollary 2.7 and Proposition 2.9 we can easily obtain the following result that describes the relation between (L, L)-anisotropic smoothness and (L0, L1)- smoothness.

Corollary 2.11. Let f C2(Rn) be (L0, L1)-smooth. Then, f is (δL1, L0/L1)-anisotropically smooth relative to ϕ(x) = x ln(1 x ) with δ (0, 1).

2.2. How to compute the second-order condition

In this subsection we demonstrate how the second-order condition of Definition 2.4 can be computed for the two different instances of our preconditioned scheme. We consider kernel functions that satisfy Assumptions 1.1 and 1.2 implying that the results of Lemma 1.3 hold, lifting thus the need to compute generalized Jacobians.

Anisotropic reference functions. Thanks to the separability of ϕ, the condition is simple to compute, since [ 2ϕ ( L 1 f(x))] 1 is just a diagonal matrix with elements αii given by

αii = 1/h ( L 1 if(x)) (9)

Isotropic reference functions. As already mentioned in the introduction, in the isotropic case ϕ = h the differentiability properties of ϕ depend on those of h . In the setting considered in this subsection though, we have ϕ (y) = h ( y )sgn(y) with ϕ C2(Rn) and the Hessian is given by

[ 2ϕ (y)] 1 = 1 h ( y ) yy

y 2 + y h ( y )

for y Rn \ {0} and [ 2ϕ (y)] 1 = 1/h ( y )I otherwise. We provide examples of this condition for kernel functions displayed in Table 1 in Appendix D.

3. Algorithmic analysis

In this section we study the convergence properties of the method in the nonconvex and convex setting. The following assumption is considered valid throughout the remainder of the paper. The proofs of this section are deferred to Appendix C. Assumption 3.1. f C1(Rn) is (L, L)-anisotropically smooth relative to ϕ and f = inf f > .

3.1. Nonconvex setting

The convergence of the method to stationary points, in the composite case with an additional nonconvex, nonsmooth term, was established in (Laude & Patrinos, 2025, Theorem 5.3) with a fixed stepsize γ L 1 and in (Laude & Patrinos, 2025, Theorem 5.5) using an adaptive linesearch strategy for choosing γ, always under the assumption that ϕ is of full domain. In the smooth setting studied in this paper, where ϕ is also even, we can improve upon the aforementioned results and show that stepsizes γ up to 2L 1 can actually be used in the algorithm.

Nonlinearly Preconditioned Gradient Methods

0 100 200 300 400 500 Number of iterations

0 100 200 300 400 500 Number of iterations

0 100 200 300 400 500 Number of iterations

Figure 2. Minimizing 1

4 x 4 using (2). The figure on the left corresponds to ϕ1(x) = cosh( x ) 1, the middle one to ϕ2(x) = exp( x ) x 1 and the one on the right to ϕ3(x) = x ln(1 x ). We choose values of L, set λ = L 1 and then compute γ = L 1 with L as in Appendix D.

Theorem 3.2. Let Assumption 3.1 hold and {xk}k N0 be the sequence of iterates generated by (2) with γ = αL 1, α (0, 2), λ = L 1 and let β = 1 |1 α|. Then we have the following rate:

min 0 k K ϕ( ϕ ( L 1 f(xk))) L(f(x0) f )

Lβ(K + 1) .

Using the result of Theorem 3.2 and specifying the reference function ϕ, we can obtain convergence guarantees for the standard stationarity measure, f(xk) . For ϕ = cosh 1, this is captured in the following corollary.

Corollary 3.3. Let Assumption 3.1 hold, ϕ = cosh 1 and {xk}k N0 be the sequence of iterates generated by (2) with γ = αL 1, α (0, 2), λ = L 1 and β = 1 |1 α|. Then, the following holds for P0 = f(x0) f :

min 0 k K f(xk)

2L LP0 β(K + 1) + LP0 β(K + 1).

3.2. Convex setting

In the convex setting, Proposition 2.2 establishes a useful connection between anisotropic smoothness and the convexity of f L 1( L ϕ ), i.e., the strong convexity of f

relative to L ϕ with constant L 1 in the Bregman sense. We utilize this connection in Proposition 3.5 and Theorem 3.6 in order to obtain standard sublinear convergence rates for the suboptimality gap in the isotropic case.

Henceforth we make the following standard assumption, which we consider valid throughout the rest of the paper unless stated otherwise.

Assumption 3.4. arg min f = .

Our first result is the following novel characterization regarding the minimizers of f.

Proposition 3.5. Let x Rn and x arg min f. Moreover, let f be (L, L)-anisotropically smooth relative to ϕ

and convex. Then, the following inequality holds:

f(x), x x L 1 ϕ ( L 1 f(x)), f(x) .

Obtaining sublinear rates for the function values is not a straightforward task. To the best of our knowledge, there do not exist such guarantees for the full generality of the setting we consider in this paper. Proposition 3.5 is useful in that regard, since it allows us to show a O(1/K) rate for the suboptimality gap in the isotropic case, as we show next.

Theorem 3.6. Let Assumption 3.1 hold, f be convex and ϕ = h with h satisfying Assumption 1.1. For {xk}k N0 the sequence of iterates generated by (2) with γ = L 1 and λ = L 1, the following holds:

xk+1 x xk x , (10)

where x arg min f, i.e. {xk}k N0 is Fej er monotone w.r.t. arg min f. Moreover, the norm of the gradient of f monotonically decreases along the iterates of the algorithm: f(xk+1) f(xk) ,

for all k N0. If, in addition, h (x)/x is a decreasing function on R+, we have the following rate for the suboptimality gap:

f(x K) f L f(x0) x0 x 2

h ( L 1 f(x0) )(K + 1) (11)

We remark that h (x)/x being a decreasing function on R+ is in fact a mild assumption that holds for all the kernel functions presented in Table 1. The result above strengthens the known results regarding the convergence of the method from (Maddison et al., 2021) and (Laude & Patrinos, 2025), while also answering the question posed in (Maddison et al., 2021, p. 17), regarding obtaining convergence guarantees for the suboptimality gap f(xk) f . We remark that although the obtained rate in (11) depends on

Nonlinearly Preconditioned Gradient Methods

0 100 200 300 400 500 Number of iterations

φ1 φ2 GD β-GD Clipping

0 100 200 300 400 500 Number of iterations

φ1 φ2 GD β-GD Clipping

0 100 200 300 400 500 Number of iterations

φ1 Clipping

Figure 3. Nonconvex phase retrieval. ϕ1 corresponds to the isotropic reference function and ϕ2 to the anisotropic one, both of which are generated by cosh( ) 1. The two figures on the left compare the algorithms for one instance of the problem. The figure on the right displays the results of gradient clipping and the isotropic version of (2) averaged across 100 random instances.

0 50 100 150 200 250 Number of iterations

Training loss

γ = 0.05, λ = 250.0 γ = 0.1, λ = 50.0 γ = 0.2, λ = 10.0

0 50 100 150 200 250 Number of iterations

Training loss

γ = 0.5, λ = 50.0 γ = 1.0, λ = 10.0 γ = 2.0, λ = 0.5

0 50 100 150 200 250 Number of iterations

Training loss

γ = 0.5, λ = 7.0 γ = 0.8, λ = 3.0 γ = 1.0, λ = 1.0

Figure 4. Simple NN training. (left) results for (2) with ϕ1(x) = cosh( x ) 1; (middle) ϕ2(x) = x ln(1 x ); (right) gradient clipping method as presented in Example 1.7.

the initial norm of the gradient, one can use the techniques from (Vankov et al., 2024) or (Gorbunov et al., 2024) to achieve a better complexity when specifying the reference function ϕ. Nevertheless, such an endeavor is beyond the scope of this paper.

Although Theorem 3.6 provides a sublinear rate for the isotropic case, obtaining such guarantees for anisotropic reference functions is not straightforward: key in the proof of Theorem 3.6 is the fact that ϕ ( L 1 f(xk)) = h ( L 1 f(xk) )sgn( f(xk)) and therefore the convex gradient inequality for f can directly be utilized to show the Fej er monotonocity of {xk}k N0. Nevertheless, we are able to prove the sublinear convergence rate for the suboptimality gap for subhomogeneous (Az e & Penot, 1995, p. 708) reference functions using a different technique based on generalized conjugacy. More precisely, we utilize the characterization of anisotropically smooth functions as (generalized) envelopes and interpret the algorithm as a nonlinear proximal point method, generalizing thus the duality between gradient descent and proximal point in the Euclidean case (Laude, 2021, Theorem 3.8). Then, we combine the subhomogeneity of ϕ with the proof technique of (Doikov & Nesterov, 2020, Theorem 1) for inexact tensor methods and obtain the claimed rate. This result is captured in the following theorem.

Theorem 3.7. Let Assumption 3.1 hold, f be convex and {xk}k N0 be the sequence of iterates generated by (2) with γ = L 1, λ = L 1 and assume that dom ϕ = Rn. Moreover, let ϕ be 2-subhomogeneous, i.e., such that ϕ(θx) θ2ϕ(x) for all θ [0, 1]. Then, for all K 1

f(x K) f 4D0

where D0 = sup{ L(L 1 ϕ)(x x ) : f(x) f(x0)} for x arg min f.

In general the set D0 might be unbounded, except if f has bounded level-sets, which is the case if arg min f is bounded in light of (Bauschke et al., 2017b, Proposition 11.13). In theory, this dependence on the initial level-set can be eliminated by considering an averaging procedure akin to (Doikov & Nesterov, 2020, Algorithm 3), thus leading to a convergence rate in terms of some function of the initial distance to the solution. Nevertheless, such methods tend to underperform in practice compared to their more straightforward counterparts.

Examples of 2-subhomogeneous reference functions are those generated by cosh 1, as described in the following Lemma.

Lemma 3.8. The function h(x) := cosh(x) 1 is 2-

Nonlinearly Preconditioned Gradient Methods

subhomogeneous, i.e., the following inequality holds:

h(θx) θ2h(x), (13)

for all θ [0, 1] and x R.

4. Experiments

In this section we present some simple experiments that display the behavior of the proposed method on problems beyond traditional Lipschitzian assumptions. The code for reproducing the experiments is publicly available1.

4.1. Norm to power

For the first part of our experiments we consider the toy example of minimizing f(x) = 1 4 x 4, with x R500, using different preconditioning schemes. We consider the reference functions cosh( x ) 1, exp( x ) x 1 and x ln(1 x ) and remind that the algorithm generated by the latter function is a tighter version of the algorithm proposed in (Gorbunov et al., 2024). In this experiment, we keep L fixed, compute L according to the rules established in Appendix D and apply algorithm (2) with γ = L 1 and λ = L 1. The results are presented in Figure 2. For different values of L and L there seems to be a trade-off between faster convergence to medium accuracy and slower convergence to very good accuracy for all three preconditioned methods.

4.2. Nonconvex phase retrieval

In this experiment we consider the nonconvex phase retrieval problem

min x Rn f(x) = 1 2m

i=1 (yi (a i x)2)2

with yi R, ai Rn. The data is generated as in (Chen et al., 2023): n = 100, m = 3000 and ai, z N(0, 0.5), x0 N(5, 0.5) generated element-wise with z denoting the true underlying object. The measurements are generated as yi = (a i z)2 + ni with ni N(0, 42).

We compare the algorithm (2) with the isotropic and anisotropic reference functions generated by cosh 1, denoted respectively by ϕ1(x) = cosh( x ) 1 and ϕ2(x) = Pn i=1 cosh(xi) 1 against vanilla gradient descent, gradient clipping (Zhang et al., 2020b) and (Chen et al., 2023, Algorithm 1) with the tuning described in (Chen et al., 2023, Section 7). For the isotropic case of (2) we take γ = 5/3 and λ = 1/100, while for the anisotropic one γ = 1/5 and λ = 1/14. The results are presented in Figure 3. We use as f the minimum value of the cost function

1https://github.com/Jan Q/ nonlinearly-preconditioned-gradient

among all algorithms. In this experiment the two versions of the algorithm proposed in this paper outperform the rest of the methods.

Moreover, we test the clipping and the isotropic algorithm over 100 random instances of the problem and plot the mean along with error bars representing one standard deviation on a logarithmic scale. It can be seen that the isotropic algorithm outperforms the clipping method across the tests for this particular tuning. Note that the tuning for both of the methods is quite robust.

4.3. Neural network training

In this experiment we consider training a simple fourlayer fully connected network with layer dimensions [28 28, 128, 64, 32, 32, 10] and Re LU activation functions on a subset of the MNIST dataset (Deng, 2012), using the cross-entropy loss. We consider a subset (m = 600) of the dataset in order to efficiently use full gradient updates.

We compare the methods generated by ϕ1(x) = cosh( x ) 1, ϕ2(x) = x ln(1 x ) and the gradient clipping method (Zhang et al., 2020b), that can also be considered as an instance of (2) through Example 1.7, for various choices of the stepsizes and the clipping parameters. The results are presented in Figure 4. It can be seen that different combinations of γ and λ lead to different behaviors for the compared methods.

5. Conclusion and Future Work

In this paper we introduced and studied a new generalized smoothness inequality that is less restrictive than Lipschitz smoothness. We provided sufficient firstand second-order conditions through an unconventional technique that also leads to novel insights into the class of (L0, L1)-smooth functions. We moreover analyzed a nonlinearly preconditioned gradient scheme that is tailored to the proposed smoothness condition and studied its convergence properties both in the nonconvex and convex setting. This framework encapsulates a plethora of well-known methods, while it also generates new algorithms.

Our work paves the way for better understanding clipping and signed gradient methods from a majorizationminimization perspective. Possible interesting future work includes integrating momentum both in the convex and nonconvex regime and studying the stochastic setup. Another interesting research direction is extending our convergence results for the suboptimality gap from the smooth to the additive nonsmooth setting where the nonsmooth term is handled similarly to (Laude & Patrinos, 2025). We believe that this extension is not straightforward and requires additional effort compared to the standard Euclidean setup of gradient descent.

Nonlinearly Preconditioned Gradient Methods

Acknowledgements

Work supported by: the Research Foundation Flanders (FWO) research projects G081222N, G033822N, G0A0920N; Research Council KUL grant C14/24/103.

The authors thank Thomas Moellenhoff for helpful conversations.

Impact statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Ahookhosh, M., Themelis, A., and Patrinos, P. A Bregman forward-backward linesearch algorithm for nonconvex composite optimization: superlinear convergence to nonisolated local minima. SIAM Journal on Optimization, 31(1):653 685, 2021.

Az e, D. and Penot, J.-P. Uniformly convex and uniformly smooth convex functions. In Annales de la Facult e des sciences de Toulouse: Math ematiques, volume 4, pp. 705 730, 1995.

Bauschke, H. H., Bolte, J., and Teboulle, M. A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Mathematics of Operations Research, 42(2):330 348, 2017a.

Bauschke, H. H., Combettes, P. L., and Bauschke. Correction to: convex analysis and monotone operator theory in Hilbert spaces. Springer, 2017b.

Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. sign SGD: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pp. 560 569. PMLR, 2018.

Bolte, J., Sabach, S., Teboulle, M., and Vaisbourd, Y. First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems. SIAM Journal on Optimization, 28(3):2131 2151, 2018.

Bonet, C., Uscidda, T., David, A., Aubin-Frankowski, P.-C., and Korba, A. Mirror and preconditioned gradient descent in Wasserstein space. ar Xiv preprint ar Xiv:2406.08938, 2024.

Bredies, K. A forward backward splitting algorithm for the minimization of non-smooth convex functionals in Banach space. Inverse Problems, 25(1):015005, 2008.

Chen, X., Wu, S. Z., and Hong, M. Understanding gradient clipping in private SGD: A geometric perspective. Advances in Neural Information Processing Systems, 33: 13773 13782, 2020.

Chen, Z., Zhou, Y., Liang, Y., and Lu, Z. Generalizedsmooth nonconvex optimization is as efficient as smooth nonconvex optimization. In International Conference on Machine Learning, pp. 5396 5427. PMLR, 2023.

Clarke, F. H. Optimization and nonsmooth analysis. SIAM, 1990.

D efossez, A., Bottou, L., Bach, F., and Usunier, N. A simple convergence proof of Adam and Adagrad. Transactions on Machine Learning Research, 2022.

Deng, L. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141 142, 2012.

Doikov, N. and Nesterov, Y. E. Inexact tensor methods with dynamic accuracies. In ICML, pp. 2577 2586, 2020.

Dontchev, A. L. and Rockafellar, R. T. Implicit functions and solution mappings, volume 543. Springer, 2009.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Facchinei, F. and Pang, J.-S. Finite-dimensional variational inequalities and complementarity problems. Springer, 2003.

Faw, M., Rout, L., Caramanis, C., and Shakkottai, S. Beyond uniform smoothness: A stopped analysis of adaptive SGD. In The Thirty Sixth Annual Conference on Learning Theory, pp. 89 160. PMLR, 2023.

Figalli, A., Kim, Y.-H., and Mc Cann, R. J. When is multidimensional screening a convex program? Journal of Economic Theory, 146(2):454 478, 2011.

Gorbunov, E., Danilova, M., and Gasnikov, A. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Advances in Neural Information Processing Systems, 33:15042 15053, 2020.

Gorbunov, E., Tupitsa, N., Choudhury, S., Aliev, A., Richt arik, P., Horv ath, S., and Tak aˇc, M. Methods for convex (L0, L1)-smooth optimization: Clipping, acceleration, and adaptivity. ar Xiv preprint ar Xiv:2409.14989, 2024.

Hiriart-Urruty, J.-B., Strodiot, J.-J., and Nguyen, V. H. Generalized hessian matrix and second-order optimality conditions for problems with c 1, 1 data. Applied mathematics and optimization, 11(1):43 56, 1984.

Nonlinearly Preconditioned Gradient Methods

Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge university press, 2012.

Kim, J., Park, C., Ozdaglar, A., Diakonikolas, J., and Ryu, E. K. Mirror duality in convex optimization. ar Xiv preprint ar Xiv:2311.17296, 2023.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Koloskova, A., Hendrikx, H., and Stich, S. U. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In International Conference on Machine Learning, pp. 17343 17363. PMLR, 2023.

Kunstner, F., Yadav, R., Milligan, A., Schmidt, M., and Bietti, A. Heavy-tailed class imbalance and why Adam outperforms gradient descent on language models. ar Xiv preprint ar Xiv:2402.19449, 2024.

Laude, E. Lower envelopes and lifting for structured nonconvex optimization. Ph D thesis, Technical University of Munich, 2021.

Laude, E. and Patrinos, P. Anisotropic proximal point algorithm. ar Xiv preprint ar Xiv:2312.09834, 2023.

Laude, E. and Patrinos, P. Anisotropic proximal gradient. Mathematical Programming, pp. 1 45, 2025.

Laude, E., Themelis, A., and Patrinos, P. Dualities for non Euclidean smoothness and strong convexity under the light of generalized conjugacy. SIAM Journal on Optimization, 33(4):2721 2749, 2023.

L eger, F. and Aubin-Frankowski, P.-C. Gradient descent with a general cost. ar Xiv preprint ar Xiv:2305.04917, 2023.

Li, H., Qian, J., Tian, Y., Rakhlin, A., and Jadbabaie, A. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems, 36, 2024.

Lu, H., Freund, R. M., and Nesterov, Y. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333 354, 2018.

Maddison, C. J., Paulin, D., Teh, Y. W., and Doucet, A. Dual space preconditioning for gradient descent. SIAM Journal on Optimization, 31(1):991 1016, 2021.

Nesterov, Y. Universal gradient methods for convex optimization problems. Mathematical Programming, 152 (1):381 404, 2015.

Nesterov, Y. and Polyak, B. T. Cubic regularization of Newton method and its global performance. Mathematical programming, 108(1):177 205, 2006.

Rockafellar, R. T. Higher derivatives of conjugate convex functions. Int. J. Applied Analysis, (1):41 43, 1977.

Rockafellar, R. T. Convex analysis, volume 28. Princeton university press, 1997.

Rockafellar, R. T. and Wets, R. J. Variational Analysis. Springer, New York, 1998.

Strichartz, R. S. The way of analysis. Jones & Bartlett Learning, 2000.

Themelis, A., Ahookhosh, M., and Patrinos, P. On the Acceleration of Forward-Backward Splitting via an Inexact Newton Method, pp. 363 412. Springer International Publishing, Cham, 2019.

Vankov, D., Rodomanov, A., Nedich, A., Sankar, L., and Stich, S. U. Optimizing (L0, L1)-smooth functions by gradient methods. ar Xiv preprint ar Xiv:2410.10800, 2024.

Villani, C. Optimal Transport: Old and New. Springer, 2008.

Wang, B., Zhang, H., Ma, Z., and Chen, W. Convergence of Adagrad for non-convex objectives: Simple proofs and relaxed assumptions. In The Thirty Sixth Annual Conference on Learning Theory, pp. 161 190. PMLR, 2023.

Zhang, B., Jin, J., Fang, C., and Wang, L. Improved analysis of clipping algorithms for non-convex optimization. Advances in Neural Information Processing Systems, 33: 15511 15521, 2020a.

Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. International Conference on Learning Representations, 2020b.

Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S., Kumar, S., and Sra, S. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383 15393, 2020c.

Nonlinearly Preconditioned Gradient Methods

A. Missing proofs of Section 1

A.1. Proof of Lemma 1.3

Proof. To begin with, since h is proper, lsc and convex, h = h . In light of (Rockafellar & Wets, 1998, Theorem 11.8), we have that min h = h (0) = h(0) = 0, implying that h 0. Moreover, from the same theorem, arg min h = {h (0)} further implying that h (0) = 0 and h (0) = 0. Since h is even, we have from (Bauschke et al., 2017b, Example 13.8) that h = (h | |) = h | |, which means that h is also even. Therefore, through (Bauschke et al., 2017b, Proposition 11.7) we get that h is increasing on R+.

Now, note that the function g = h µ

2 | |2 is proper, lsc and convex where µ is the strong convexity parameter of h. Moreover, it is even as the difference of two even functions. Therefore, from (Bauschke et al., 2017b, Proposition 11.7) g is an increasing function on R+ and thus g is a convex function on Rn. This implies that h( x ) µ

2 x 2 is convex and thus that ϕ = h is strongly convex with the same strong convexity parameter.

Again from (Bauschke et al., 2017b, Example 13.8), we have that ϕ = h and to get the gradient of ϕ we can then utilize (Bauschke et al., 2017b, Corollary 16.72) to obtain ϕ (y) = h ( y )sgn(y).

Regarding the twice continuous differentiability of ϕ , it follows from (Rockafellar, 1977, p. 42) that h C2(Rn) and the claimed result follows from (Strichartz, 2000, Exercise 10.2.20), since h is even.

A.2. Proof of Example 1.7

Proof. It is straightforward that h(x) = 1

2x2+δ[ 1,1](x), is a proper, lsc, strongly convex and even function with h(0) = 0. Then, from (Rockafellar & Wets, 1998, Theorem 11.23) we have that h (y) = infx σ[ 1,1](x) + 1

2(y x)2, where σ[ 1,1] is the support function of [ 1, 1] and in light of (Rockafellar & Wets, 1998, Exercise 11.27), h (y) = Π[ 1,1](y), where Π[ 1,1](y) = min(1, max( 1, y)) is the projection on the closed convex set [ 1, 1]. Using Lemma 1.3 we thus obtain ϕ (y) = min(1, y )sgn(y) and the algorithm becomes:

xk+1 = xk γ min(1/ f(xk) , λ) f(xk),

by pulling the norm inside the min.

B. Missing proofs of Section 2

B.1. Proof of Proposition 2.2

Proof. Consider the following quantities, which are generalized conjugates as defined in (Rockafellar & Wets, 1998, Chapter 11L):

( f)Φ(y) = sup x Rn L(L 1 ϕ)(x y) + f(x) = inf x Rn L(L 1 ϕ)(x y) f(x),

( f)ΦΦ(x) = sup y Rn L(L 1 ϕ)(x y) ( f)Φ(y).

Let x Rn. Since f is (L, L)-anisotropically smooth, we have that

x arg min x Rn L(L 1 ϕ)(x y) f(x),

where y = TL 1, L 1( x) = x L 1 ϕ ( L 1 f( x)) and thus

( f)Φ( y) = L(L 1 ϕ)( x y) + f( x) R, (14)

which implies that ( f)Φ is proper. Since x was arbitrary, this holds for any x Rn and y = TL 1, L 1(x). By the definition of ( f)ΦΦ for such pairs x and y we have that

( f)ΦΦ(x) + ( f)Φ(y) L(L 1 ϕ)(x y).

Substituting (14) in the inequality above, we obtain ( f)ΦΦ(x) + f(x) 0 or f(x) ( f)ΦΦ(x). Moreover, by the definition of ( f)Φ, we have that for any x, y Rn

( f)Φ(y) f(x) L(L 1 ϕ)(x y).

Nonlinearly Preconditioned Gradient Methods

Moving ( f)Φ(y) to the other side and taking the supremum with respect to y we obtain f(x) ( f)ΦΦ(x), which combined with the previous result means that f(x) = ( f)ΦΦ(x). Therefore, we have

f(x) = ( f)ΦΦ(x) = sup y Rn L(L 1 ϕ)(x y) ( f)Φ(y)

= inf y Rn L(L 1 ϕ)(x y) + ( f)Φ(y),

which is the claimed result for ξ = ( f)Φ.

We now show the convexity of f L 1( L ϕ ). In light of (Rockafellar & Wets, 1998, Theorem 11.23) and (Bauschke et al., 2017b, Proposition 13.23) we have

f = inf y Rn L(L 1 ϕ)( y) + ( f)Φ(y) = ξ + ( L(L 1 ϕ)) = ξ + L 1( L ϕ )

and as such the result follows for ψ = ξ which is lsc and convex, using moreover the fact that dom ϕ = Rn since ϕ is strongly convex.

B.2. Proof of Proposition 2.3

Proof. Consider any points x, x Rn. If L(x x+L 1 ϕ ( f( x))) / dom ϕ then the bound holds trivially. Otherwise, from the Euclidean descent lemma for f we have:

f(x) f( x) + f( x), x x + Lf

Moreover, from the strong convexity of ϕ between points L(x x + L 1 ϕ ( f( x))) and ϕ ( f( x)):

1 L ϕ(x x + L 1 ϕ ( f( x))) = 1

Lϕ(L(x x + L 1 ϕ ( f( x))))

L h ϕ( ϕ ( f( x))) + L f( x), x x + µL2

L ϕ(L 1 ϕ ( f( x))) + f( x), x x + µL

where we have used the fact that rge ϕ dom ϕ along with f( x) ϕ( ϕ ( f( x))). Therefore, the claimed result follows.

B.3. Proof of Lemma 2.5

Proof. Let Q := 2f(x) and note that H := 2ϕ ( L 1 f(x)) 0 from (Rockafellar, 1977, p. 42). Then

Q LLH 1 H1/2QH1/2 LLI λmax(H1/2QH1/2) < LL λmax(HQ) < LL,

where the first equivalence follows by (Horn & Johnson, 2012, Theorem 7.7.2c) with S = H1/2. The last equivalence follows by noting that the (generally nonsymmetric) matrix HQ is similar to the symmetric matrix H1/2QH1/2, by using (Horn & Johnson, 2012, Theorem 1.3.22) with A = H1/2Q and B = H1/2, and noting that H1/2 is nonsingular. The sufficient condition follows from Weyl s inequality (Horn & Johnson, 2012, Theorem 4.3.1).

B.4. Proof of Proposition 2.6

Proof. We will prove that TL 1, L 1 = id L 1 ϕ ( L 1 f) is a global homeomorphism from Rn Rn using (Facchinei & Pang, 2003, Theorem 2.1.10).

Note that from Definition 2.4, lim x TL 1, L 1(x) = , i.e. TL 1, L 1 is norm-coercive, we only need to show that it is everywhere a local homeomorphism. The mapping TL 1, L 1 is locally Lipschitz, since ϕ is globally Lipschitz and f C2(Rn) and thus the generalized Jacobian is well-defined. Now, we have that CTL 1, L 1(x) = {I L 1V : V C( ϕ L 1 f)(x)} and in light of (Clarke, 1990, p. 75),

C( ϕ L 1 f)(x)v con{ C( ϕ )( L 1 f(x)) L 1 2f(x)v}

Nonlinearly Preconditioned Gradient Methods

for any v Rn. In order to show that TL 1, L 1 is everywhere a local homeomorphism we are going to use Clarke s inverse function theorem as presented in (Dontchev & Rockafellar, 2009, Theorem 4D.4). Consider thus any point x Rn and the mapping GA(x) := TL 1, L 1( x) + A(x x), where A CTL 1, L 1( x). Now, from the reasoning above we have that

A(x x) con{(I L 1 C( ϕ )( L 1 f( x)) L 1 2f( x))(x x)}.

From Definition 2.4 we have that for all H C( ϕ )( L 1 f(x)), λmin(I L 1 L 1H 2f( x)) > 0, implying that GA is an invertible linear mapping for any A CTL 1, L 1( x). Therefore, from (Dontchev & Rockafellar, 2009, Theorem 4D.4), there exists a Lipschitz continuous mapping T 1 L 1, L 1 such that T 1 L 1, L 1(TL 1, L 1(x)) = x for some neighborhoods U of x and V of TL 1, L 1( x). Since, moreover TL 1, L 1 is Lipschitz continuous, it is a local homeomorphism at x in the sense of (Facchinei & Pang, 2003, Definition 2.1.9). Since this holds for any x Rn, it is everywhere a local homeomorphism. This concludes our proof by using (Facchinei & Pang, 2003, Theorem 2.1.10).

B.5. Proof of Corollary 2.7

Consider ϕ(x) = x ln(1 x ). In this case, h(x) = |x| ln(1 |x|) and thus from Lemma 1.3, ϕ (y) = y 1+ y . We thus have

2ϕ (y) = 1 1 + y I + 1 (1 + y )2 1 1 + y

I y (1 + y ) yy

The term multiplying yy

y 2 is negative and as such λmax( 2ϕ (y)) 1 1+ y . Moreover, 2ϕ (y) is positive-definite,

since 1 > y 1+ y , which then implies 2ϕ (y) 1 1+ y . Therefore,

2ϕ ( L 1y) L L + y = L0 L0 + L1 y ,

by choosing L = L0/L1. By (L0, L1)-smoothness we moreover have 2f(x) L0 + L1 f(x) and thus

2ϕ ( L 1 f(x)) 2f(x) 2ϕ ( L 1 f(x)) 2f(x) L0.

Choosing now L = L1 we further have

2ϕ ( L 1 f(x)) 2f(x) L L. (15)

Now take any points x, x Rn and note that from the Cauchy Schwarz inequality:

ϕ ( L 1 f(x)) ϕ ( L 1 f( x)), x x ϕ ( L 1 f(x)) ϕ ( L 1 f( x)) x x . (16)

By the fundamental theorem of calculus for the mapping ϕ ( L 1 f),

ϕ ( L 1 f(x)) ϕ ( L 1 f( x)) = Z 1

0 L 1 2ϕ ( L 1 f( x + t(x x))) 2f( x + t(x x))(x x)dt

and as such

ϕ ( L 1 f(x)) ϕ ( L 1 f( x)) = L 1

0 2ϕ ( f( x + t(x x))) 2f( x + t(x x))(x x)dt

2ϕ ( f( x + t(x x))) 2f( x + t(x x)) dt x x

Nonlinearly Preconditioned Gradient Methods

where the second inequality follows by (15). Putting this result back into (16), multiplying with γ < 0 and adding x x 2 to both sides we obtain:

x x 2 γ ϕ ( L 1 f(x)) γ ϕ ( L 1 f( x)), x x (1 γL) x x 2,

implying that TδL 1, L 1(x) TδL 1, L 1( x), x x (1 δ) x x 2,

since γ = δL 1.

B.6. Proof of Proposition 2.9

Proof. Note that (4) is equivalent to x arg minx Rn g(x) := L(L 1 ϕ)(x y) f(x) for all x Rn, where y = x L 1 ϕ ( L 1 f( x)). In light of the modern version of Fermat s theorem (Rockafellar & Wets, 1998, Theorem 10.1), for x to be a local minimizer of g, the following inclusion should hold: 0 b g( x). Through (Rockafellar & Wets, 1998, Exercise 10.10) this implies that

f( x) ( L(L 1 ϕ))( x y) = L ϕ(L( x y))

or that L( x y) = ϕ ( L 1 f( x)) or x L 1 ϕ ( L 1 f( x)) = y. This means that TL 1, L 1( x) = TL 1, L 1( x) and since TL 1, L 1 is injective, the only possible minimizer of g is x.

Therefore, if we show that g has a minimizer we are done.

Case 1: dom ϕ is bounded. In light of (Rockafellar & Wets, 1998, p. 91), g is a coercive function and as it is also proper and lsc, it attains its minimum.

Case 2: dom ϕ = Rn. By the assumption of the proposition we have that

g(x) L(L 1 ϕ)(x y) L(r 1 ϕ)(x) + β. (17)

Let µ be the strong convexity parameter of ϕ, then we have the following for all α (0, 1):

L(r 1 ϕ)(x) = Lr 1ϕ(rx)

= Lr 1ϕ(r(x y) + r y)

= Lr 1ϕ rαx y

α + r(1 α) y 1 α

α(x y) + Lr 1(1 α)ϕ r 1 α y µ

2 Lr 1α(1 α) r α(x y) r 1 α y

where the inequality follows by the strong convexity inequality for ϕ between points r

α(x y) and r 1 α y. Choosing now α = r L 1 < 1 we obtain:

L(r 1 ϕ)(x) L(L 1 ϕ)(x y) + L(r 1(1 r L 1) ϕ)( y) µ

L L(1 r L 1) (L r)x L y 2 .

Substituting this inequality in (17) we get

g(x) L(r 1(1 r L 1) ϕ)( y) + µ

L L(1 r L 1) (L r)x L y 2 + β =: ψ(x).

Note now that ψ is a proper, lsc and strongly convex function and as such it has bounded level-sets. Due to the inequality, the level-sets of g are contained in those of ψ and thus g also has bounded level-sets. Since moreover g is lsc, it attains its minimum.

We thus have showed that in both of the above cases, g has a minimizer and the proof is complete.

Nonlinearly Preconditioned Gradient Methods

C. Missing proofs of Section 3

C.1. Proof of Theorem 3.2

Proof. From inequality (4) between points xk+1 and xk we have:

f(xk+1) f(xk) + LL 1[ϕ((1 α) ϕ ( L 1 f(xk))) ϕ( ϕ ( L 1 f(xk)))].

Using the fact that ϕ is even, we have ϕ((1 α) ϕ ( L 1 f(xk))) = ϕ(|1 α| ϕ ( L 1 f(xk))) and since ϕ is convex, we have

ϕ(θx) = ϕ((1 θ)0 + θx) (1 θ)ϕ(0) + θϕ(x) = θϕ(x),

for any θ [0, 1]. Note now that |1 α| < 1 and the previous inequality becomes:

f(xk+1) f(xk) (1 |1 α|) LL 1ϕ( ϕ ( L 1 f(xk))). (18)

Therefore, summing up the above inequality we obtain

k=0 ϕ( ϕ ( L 1 f(xk))) L

Lβ (f(x0) f(x K+1)) L

Lβ (f(x0) f ),

which leads to

(K + 1) min 0 k K ϕ( ϕ ( L 1 f(xk))) L

Lβ (f(x0) f ). (19)

Dividing now by K + 1 we obtain the claimed rate.

C.2. Proof of Corollary 3.3

Proof. In light of Theorem 3.2, the following holds:

min 0 k K ϕ( ϕ ( L 1 f(xk))) L(f(x0) f )

Lβ(K + 1) . (20)

Now, using the fact that cosh(arcsinh(x)) =

1 + x2 we have:

ϕ( ϕ ( L 1 f(xk))) = cosh arcsinh( L 1 f(xk) )

f(xk) f(xk)

= cosh(arcsinh( L 1 f(xk) )) 1

1 + L 2 f(xk) 2 1.

The function

1 + x2 1 is increasing for x 0 and as such k arg min0 k K{ϕ( ϕ ( L 1 f(xk)))} is equivalent to k arg min0 k K f(xk) 2. Therefore, by taking 1 to the other side in (20) and taking the square, we obtain:

1 + L 2 f(xk ) 2 L(f(x0) f )

2 + 2L(f(x0) f )

Lβ(K + 1) + 1,

f(xk ) 2 L(f(x0) f )

2 + 2 LL(f(x0) f )

Taking the square root and using the fact that α + β α + β we obtain the claimed result.

Nonlinearly Preconditioned Gradient Methods

C.3. Proof of Proposition 3.5

Proof. To begin with, with similar arguments as in the proof of Lemma 1.3 we have that ϕ (0) = 0 and ϕ (0) = 0. In light of Proposition 2.2, f L 1( L ϕ ) is a convex function. By definition f(x) dom f dom f for all x Rn and as such we can consider the convex subgradient inequality for f L 1( L ϕ ) between points f(x) and f(x ) and obtain: f ( f(x)) L 1 Lϕ ( L 1 f(x)) f ( f(x )) + x , f(x) , (21)

where we have used the fact that f(x ) = 0, ϕ (0) = 0, ϕ (0) = 0 and x f ( f(x)), since f is convex. Taking once again the convex gradient inequality between points f(x ) and f(x), we now have:

f ( f(x )) f ( f(x)) L 1 Lϕ ( L 1 f(x)) + x L 1 ϕ ( L 1 f(x)), f(x) , (22)

with the same reasoning as before. Summing now (21) and (22) and rearranging we obtain the claimed result.

C.4. Proof of Theorem 3.6

Proof. We begin as in the classical analysis of gradient descent by using the Pythagorean theorem:

xk+1 x 2 = xk x 2 2L 1 ϕ ( L 1 f(xk)), xk x + L 1 ϕ ( L 1 f(xk)) 2. (23)

We further have:

L 1 ϕ ( L 1 f(xk)), xk x = L 1 h ( L 1 f(xk) )

L 1 f(xk) L 1 f(xk), xk x

L 2 h ( L 1 f(xk) )

L 1 f(xk) L 1 f(xk), ϕ ( L 1 f(xk))

= L 1 ϕ ( L 1 f(xk)) 2, (24)

where in the inequality we used Proposition 3.5 and in the equalities the fact that ϕ (L 1 f(xk)) = h ( L 1 f(xk) )

L 1 f(xk) L 1 f(xk) from Lemma 1.3. In the inequality, we also used the fact that h (t) 0 for t 0. Indeed, by convexity, we have that:

h (0) h (t) h (t)t h (t)t h (t) h (0),

implying that h (t) 0 for all t 0. Plugging now (24) into (23), we obtain

xk+1 x 2 xk x 2 L 1 ϕ ( L 1 f(xk)), xk x xk x 2 L 1 ϕ ( L 1 f(xk)) 2, (25)

which shows the Fej er monotonicity of {xk}k N0 w.r.t. x arg min f. Since ( Lh) = L h and h is an even function, we have that ( Lϕ) = L (h ). In light of Proposition 2.2, we have that f L 1( Lϕ) is a convex function. Now, for any x, x Rn, from the convex subgradient inequality for this function, between points f(x) dom f and f( x) dom f we have:

(f L 1( Lϕ) )( f(x)) (f L 1( Lϕ) )( f( x)) + x (L 1( Lϕ) )( f( x)), f(x) f( x) ,

where we have moreover used the fact that x f ( f( x)). Therefore,

D( Lϕ) ( f(x), f( x)) = ( Lϕ) ( f(x)) ( Lϕ) ( f( x)) ( Lϕ) ( f( x)), f(x) f( x)

L[f ( f(x)) f ( f( x)) x, f(x) f( x) ]

= LDf( x, x)

where Dg denotes the Bregman divergence associated with g and the equality follows by the definition of the convex conjugate.

Thus, (Maddison et al., 2021, Assumption 3.1) holds for k = ( Lϕ) . Therefore, substituting from (Maddison et al., 2021, Equation (41)), we obtain ( Lϕ) ( f(xk+1)) ( Lϕ) ( f(xk)) for all k N0 and since h is an increasing function on R+ from Lemma 1.3,

Lh ( L 1 f(xk+1) ) Lh ( L 1 f(xk) ) = f(xk+1) f(xk) . (26)

Nonlinearly Preconditioned Gradient Methods

and thus we proved that the norm of the gradient of f monotonically decreases along the iterates.

We now return to the Fej er-type inequality (25):

xk+1 x 2 xk x 2 L 1 h ( L 1 f(xk) )

L 1 f(xk) L 1 f(xk), xk x .

Using the convex gradient inequality for f we further have:

xk+1 x 2 xk x 2 L 1 L 1 h ( L 1 f(xk) )

L 1 f(xk) (f(xk) f ).

Summing up now the inequality above we obtain:

k=0 L 1 L 1 h ( L 1 f(xk) )

L 1 f(xk) (f(xk) f ) x0 x 2,

which after utilizing the fact that

L 1 f(xk+1) L 1 f(xk) k N0 = h ( L 1 f(xk) )

L 1 f(xk) h ( L 1 f(x0) )

L 1 f(x0) k N0,

since h (x)

x is decreasing on R+, implies that

(K + 1)L 1 h ( L 1 f(x0) )

f(x0) min 0 k K(f(xk) f ) x0 x 2.

This is the claimed result, since the function values decrease along the iterates of the algorithm from (18).

C.5. Proof of Theorem 3.7

In light of Proposition 2.2, f = infy Rn L(L 1 ϕ)( y) + ξ(y) for some ξ : Rn R. Since f is moreover convex in this setting and dom ϕ = Rn, we can take ξ to be convex and lsc from (Laude & Patrinos, 2025, Proposition 4.1). In order to prove our result, we will resort to a nonlinear proximal point interpretation of (2) with a strongly convex prox-kernel. We therefore consider the following iteration:

xk+1 = arg min y Rn L(L 1 ϕ)(xk y) + ξ(y). (27)

From (Laude & Patrinos, 2025, Proposition 3.9 (ii)), since dom ϕ = Rn, we have that

f(xk) = ( inf y Rn L(L 1 ϕ)(xk y) + ξ(y)) = L ϕ(L(xk xk+1)),

which directly implies that xk+1 = xk L 1 ϕ ( L 1 f(xk)) and the claimed equivalence between the two schemes is established. Using this result we can now prove a certain three-point-like property for the iterates generated by (2). Lemma C.1. Let {xk}k N0 be the sequence of iterates generated from (27). Then, the following inequality holds for all x Rn: ξ(xk+1) ξ(x) L[(L 1 ϕ)(xk xk+1) (L 1 ϕ)(xk x)]. (28)

Proof. By the optimality conditions for (27), we have that

0 L ϕ(L(xk xk+1)) + ξ(xk+1).

Now, by the convex subgradient inequality for ξ, with uk+1 ξ(xk+1), we have that

ξ(x) ξ(xk+1) + uk+1, x xk+1

= ξ(xk+1) + L ϕ(L(xk xk+1)), x xk+1

= ξ(xk+1) + LL 1 ϕ(L(xk xk+1)), L(x xk) + L(xk xk+1)

ξ(xk+1) + L[(L 1 ϕ)(xk xk+1) (L 1 ϕ)(xk x)]

Nonlinearly Preconditioned Gradient Methods

where the first equality follows by the inclusion above and the second by simple algebraic manipulations. The final inequality follows by using the convex gradient inequality for ϕ between points L(xk x) and L(xk xk+1). Therefore, the claimed result follows by rearranging.

Now, we move on to the proof of our main theorem. It is inspired by the proof of (Doikov & Nesterov, 2020), where we have also utilized the fact that ϕ is 2-subhomogeneous.

Proof. As established above, we consider the sequence of iterates generated by (27). In light of Lemma C.1 and since ϕ 0 we have for any x Rn:

ξ(xk+1) ξ(x) + L(L 1 ϕ)(xk x). (29)

Now consider an arbitrary increasing sequence {Ak}k N0 with Ak > 0 and A0 = 0. We denote by ak+1 := Ak+1 Ak and by taking x := ak+1x +Akxk

Ak+1 we have xk x = ak+1

Ak+1 (xk x ). Plugging this in (29) and using the convexity of ξ we obtain:

ξ(xk+1) ak+1

Ak+1 ξ(x ) + Ak Ak+1 ξ(xk) + L(L 1 ϕ)( ak+1

Ak+1 (xk x )).

Let θk := ak+1

Ak+1 1. By the subhomogeneity of ϕ we have that

ξ(xk+1) ak+1

Ak+1 ξ(x ) + Ak Ak+1 ξ(xk) + θ2 k L(L 1 ϕ)(xk x ).

Multiplying both sides with Ak+1 we get since ak+1 = Ak+1 Ak

Ak+1 ξ(xk+1) ξ(x ) Ak ξ(xk) ξ(x ) + a2 k+1 Ak+1 L(L 1 ϕ)(xk x ).

Summing the inequality from k = 0 to k = K 1 we obtain since A0 = 0:

AK ξ(x K) ξ(x )

a2 k+1 Ak+1 L(L 1 ϕ)(xk x ). (30)

Choosing x = xk in (29) we have

ξ(xk+1) ξ(xk).

and thus ξ(x K) ξ(x1) 0, which implies that

f(x K) ξ(x K) ξ(x1) ξ(x1) + L(L 1 ϕ)(x0 x1) = f(x0).

The first inequality in the above display follows by the envelope representation of f, which implies that f(x) ξ(x) for all x Rn. The equality also follows from the envelope representation, since

f(x0) = inf y Rn L(L 1 ϕ)(x0 y) + ξ(y) = L(L 1 ϕ)(x0 x1) + ξ(x1)

from (27). Thus we can further bound (30):

AK ξ(x K) ξ(x ) D0

a2 k+1 Ak+1 .

We choose Ak = k2 and by using the fact that PK k=1 a2 k Ak 4K (Doikov & Nesterov, 2020, Equation (35)):

AK ξ(x K) ξ(x ) 4D0K.

Dividing by AK we obtain:

ξ(x K) ξ(x ) 4D0

Noting that f(x K) ξ(x K), from the envelope representation of f, and ξ(x ) = f(x ) we obtain the desired result.

Nonlinearly Preconditioned Gradient Methods

C.6. Proof of Lemma 3.8

Proof. Fix x R \ {0} and consider the function g(θ) := cosh(θx) 1 θ2(cosh(x) 1). If this function is at most nonpositive for θ [0, 1], then the claim is proven. Note that g(0) = 0 and g(1) = 0. Moreover, g (θ) = x sinh(θx) 2θ(cosh(x) 1) and thus g (0) = 0. Now, g (θ) = x2 cosh(θx) 2 cosh(x)+2 and thus g (0) = x2+2 2 cosh(x) < 0, which further implies that 0 is a local maximum. Therefore, there exists a θ (0, 1] such that g(θ) < g(0) = 0 for all θ [0, θ), which implies that if we prove that g(θ) = 0 for all θ (0, 1) we are done.

Let us assume now that there exists a θ (0, 1) such that g(θ ) = 0. Then, by Rolle s theorem, there must exist two critical points for g in (0, 1), one in (0, θ ) and one in (θ , 1). We have that

g (θ) = 0 sinh(θx) = 2θcosh(x) 1

Setting y = θx the equation above is the same as

sinh(y) = 2cosh(x) 1

which has exactly three solutions: y1 < 0, y2 = 0, y3 > 0, since 2 cosh(x) > 2 + x2 for x = 0. Without loss of generality we assume that x > 0 and thus we get that there exists only one θ > 0 such that g (θ) = 0, which is a contradiction.

D. Details on the second-order condition

In this section we provide further details on the second-order condition Definition 2.4. We complement the discussion in Section 2 by showing that the norm-coercivity condition on the forward operator in Definition 2.4 is in fact mild even for functions ϕ with full domain.

Proposition D.1. Let ϕ = h such that h C1(R) satisfies Assumption 1.1. If there exists some C > 0 such that

f(x) L|h (L x )| (31)

for all x such that x C, then lim x TδL 1, L 1(x) = ,

for all δ < 1.

Proof. In the following we assume that x is large enough such that the assumption of the proposition holds. We have that

L 1 f(x) |h (L x )| h ( L 1 f(x) ) L x . (32)

The implication follows since h(0) h(t) h (t)t, meaning that h (t) 0 for t 0 and thus |h (t)| = h (t) on this interval implying h (|h (t)|) = t. Now, by the reverse triangle inequality:

TδL 1, L 1(x) x δL 1|h ( L 1 f(x) )|

where the second inequality follows by (32). Therefore, since δ < 1, lim x TδL 1, L 1(x) = and the proof is complete.

The fact that this condition is quite mild can be seen by choosing ϕ(x) = cosh( x ) 1, where we allow f(x) to grow exponentially with x . It is straightforward that this condition holds for example when the norm of the gradient is bounded by some polynomial of the norm of x when x is large enough.

We next show that when the matrix H 2f(x) is symmetric, the norm-coercivity property of the forward operator in Definition 2.4 is not required in order to obtain a result similar to Proposition 2.6.

Nonlinearly Preconditioned Gradient Methods

Proposition D.2. Let f C2(Rn) be such that for all x Rn and H C( ϕ )( L 1 f(x)),

λmax(H 2f(x)) L L (33)

and H 2f(x) is symmetric. Then, the following inequality holds:

Tγ, L 1(x) Tγ, L 1( x), x x (1 γL) x x 2, (34)

for all x, x Rn. In particular, for γ < L 1, the forward operator is strongly monotone with parameter 1 γL and thus injective.

Proof. Note that the mapping ϕ ( L 1 f) is locally Lipschitz, since ϕ is globally Lipschitz and f C2(Rn). Then, we can invoke the generalized mean value theorem in its summation form (Facchinei & Pang, 2003, Proposition 7.1.16): for two points x, x Rn, there exist n points zi (x, x) and n scalars αi 0 summing to unity such that

ϕ ( L 1 f(x)) ϕ ( L 1 f( x)) =

i=1 αi Vi(x x), (35)

where Vi C( ϕ L 1 f)(zi). Now, in light of (Clarke, 1990, p. 75),

C( ϕ L 1 f)(x)v con{ C( ϕ )( L 1 f(x)) L 1 2f(x)v}

for any v Rn. Therefore, any Vi(x x) can be written as L 1 Pd j=1 βj Hj 2f(zi)(x x), for d > 0, with Hj C( ϕ )( L 1 f(zi)) and βj 0 summing to unity.

Taking an inner product with (x x), we have that

ϕ ( L 1 f(x)) ϕ ( L 1 f( x)), x x = L 1 n X

j=1 βj x x, Hj 2f(zi)(x x)

i=1 αi LL x x 2

where we have used the fact that λmax(Hj 2f(zi)) L L.

By multiplying the inequality above with γ < 0 and then adding x x 2 to both sides we obtain:

x x 2 γ ϕ ( L 1 f(x)) γ ϕ ( L 1 f( x)), x x (1 γL) x x 2,

implying that

Tγ, L 1(x) Tγ, L 1( x), x x (1 γL) x x 2,

which is the claimed result.

D.1. Examples

We now move on to providing examples of functions satisfying Definition 2.4. We consider the reference functions ϕ1(x) = cosh( x ) 1, ϕ2(x) = exp( x ) x 1 and ϕ3(x) = x ln(1 x ), which are generated by the (1-dimensional) kernel functions h1(x) = cosh(x) 1, h2(x) = exp(|x|) |x| 1 and h3(x) = |x| ln(1 |x|).

We first recall the preconditioner ϕ and its Jacobian 2ϕ for general isotropic reference functions.

ϕ i (y) = h i ( y )sgn(y) y Rn

2ϕ i (y) = h i ( y ) yy

y 2 + h i ( y ) y

and 2ϕ i (y) = h i ( y )I otherwise. For ease of presentation we denote ai(y) = h i ( y ) and bi(y) = h i ( y ) y .

Nonlinearly Preconditioned Gradient Methods

Table 2. Anisotropic smoothness constants for the examples of Appendix D.1

3 L1/3 22/3 L1/3 24/3

Llogistic α 2 p

16 L2 + α 2 α 2

L α 2 + α 3

4( L + α )2

Example D.3 (Norm to power). Let f(x) = 1

4 x 4. Then, f is (L, L)-anisotropically smooth relative to ϕi for any L > 0 and L > Lnorm from the first row of Table 2.

Proof. We first consider the more general f(x) = 1

p x p with p 4. The gradient and Hessian of f are given respectively by f(x) = x p 2x, 2f(x) = x p 2I + (p 2) x p 4xx . (36)

The second-order condition then involves the following quantity:

2ϕ i ( L 1 f(x)) = ai( L 1 f(x) ) f(x) f(x)

f(x) 2 + bi( L 1 f(x) ) I f(x) f(x)

= ai( L 1 x p 1) xx

x 2 + bi( L 1 x p 1) I xx

We thus have:

2ϕ i ( L 1 f(x)) 2f(x) = (p 1)ai( L 1 x p 1) x p 2 xx

x 2 + bi( L 1 x p 1) x p 2 I xx

The largest eigenvalue of this (symmetric) matrix is

λmax( 2ϕ i ( L 1 f(x)) 2f(x)) = max n (p 1)ai( L 1 x p 1), bi( L 1 x p 1) o x p 2. (37)

and for all the reference functions we consider in this subsection, (p 1)ai( L 1 x p 1) bi( L 1 x p 1). Therefore the inequality (5) dictates (p 1)ai( L 1 x p 1) x p 2 < L L for all x Rn.

Now we specialize to each of the reference functions we consider as well as take p = 4.

For ϕ1, ϕ2 and ϕ3, we obtain the conditions

ϕ1 : 3 x 2 p L2 + x 6 < L, ϕ2 : 3 x 2 L + x 3 < L, ϕ3 : 3 L x 2

( L + x 3)2 < L

for which the left-hand sides are maximized at x = (2 L2)1/6, x = (2 L)1/3 and x = ( L 2 )1/3 respectively. Plugging these values in yields the resulting lower bounds Lnorm in Table 2.

Since in every case H 2f(x) is a symmetric matrix, it follows from Proposition D.2 that the operator TδL 1, L 1 is injective for any δ < 1. This implies the anisotropic smoothness of f relative to ϕ3 in light of Proposition 2.9 since dom ϕ3 is bounded. For ϕ1 and ϕ2 the result follows from Proposition 2.9 by the reasoning in Remark 2.10.

We now move on to the logistic loss function. Example D.4. Let f(x) = log(1 + exp( α x)). Then, f is (L, L)-anisotropically smooth relative to ϕi for any L > 0 and L > Llogistic defined in the second row of Table 2.

Proof. The gradient and the hessian of f are given respectively by

f(x) = α 1 + exp(α x), 2f(x) = αα

exp( α x)(1 + exp(α x))2 . (38)

Nonlinearly Preconditioned Gradient Methods

In this case the second-order condition becomes

ai( L 1 f(x)) α 2

exp( α x)(1 + exp(α x))2 < L L. (39)

The results from Table 2 for ϕ1, ϕ2 and ϕ3 then follow respectively from

(1 + exp( α x)) p L2(1 + exp(α x))2 + α 2 α 2 p

16 L2 + α 2 < L,

( L(1 + exp(α x)) + α )(1 + exp( α x)) α 2

4 L + α < L,

( L(1 + exp(α x)) + α )2 exp( α x) L α 2 + α 3

4( L + α )2 < L.

Note that in this case as well, H 2f(x) is a symmetric matrix and it follows from Proposition D.2 that the operator TδL 1, L 1 is injective for any δ < 1. The growth condition in Proposition 2.9 is satisfied automatically, since f is bounded.

D.2. Gradient clipping

We now consider the case that ϕ = 1

2 2 + δB(0,1)( ). By (Themelis et al., 2019, p. 404), the generalized Jacobian of the preconditioner is given by

C( ϕ )(y) =

{I}, if y < 1, con{I, Πy }, if y = 1, { y 1Πy }, if y > 1.

where Πy denotes the projection matrix onto the orthogonal complement of the subspace spanned by y. The second-order condition (5) then becomes

λmax( 2f(x)) < L L, if f(x) < L, λmax α 2f(x) + (1 α) LΠ f(x) 2f(x) < L L, if f(x) = L, λmax(Π f(x) 2f(x)) < L f(x) , if f(x) > L,

for all α [0, 1]. In particular, for α = 1, the second condition becomes 2f(x) L LI, and using the fact that λmax(Π f(x) ) = 1, we also have λmax((αI + (1 α)Π f(x) ) 2f(x)) λmax( 2f(x)) such that this clause can be merged with the first case. To rewrite the last case in terms of symmetric matrices, we note that λmax(Π f(x) 2f(x)) = λmax(Π2 f(x) 2f(x)) = λmax(Π f(x) 2f(x)Π f(x) ) where the last equality follows by (Horn & Johnson, 2012, Theorem 1.3.22). The second-order condition is therefore ( 2f(x) L LI, if f(x) L, Π f(x) 2f(x)Π f(x) L f(x) I, if f(x) > L.

Note that the first case is akin to standard L-smoothness while the second case is reminiscent of (L0, L1)-smoothness with L0 = 0 but restricted to some subspace. In particular, if 2f(x) = C(x) f(x) f(x) for some C : Rn R, then the second case is always satisfied, and we only require that 2f(x) L LI for f(x) L. This is the case for both of the examples considered in Appendix D.1.