# are_convex_optimization_curves_convex__d0622f6a.pdf

Published in Transactions on Machine Learning Research (06/2025)

Are Convex Optimization Curves Convex?

Guy Barzilai guybarzilai1@mail.tau.ac.il Tel-Aviv University

Ohad Shamir ohad.shamir@weizmann.ac.il Weizmann Institute of Science

Moslem Zamani moslem.zamani@uclouvain.be UCLouvain

Reviewed on Open Review: https: // openreview. net/ forum? id= TZtpxsel K2

In this paper, we study when we might expect the optimization curve induced by gradient descent to be convex precluding, for example, an initial plateau followed by a sharp decrease, making it difficult to decide when optimization should stop. Although such undesirable behavior can certainly occur when optimizing general functions, might it also occur in the benign and well-studied case of smooth convex functions? As far as we know, this question has not been tackled in previous work. We show, perhaps surprisingly, that the answer crucially depends on the choice of the step size. In particular, for the range of step sizes which are known to result in monotonic convergence to an optimal value, we characterize a regime where the optimization curve will be provably convex, and a regime where the curve can be non-convex. We also extend our results to gradient flow, and to the closely-related but different question of whether the gradient norm decreases monotonically.

1 Introduction

We consider the well-known gradient descent algorithm with constant step-size, which given a function f : Rn R, an initial point x0 Rn and step-size parameter η > 0, generates a sequence of points {xn} n=0 following the recurrence relation

n N : xn = xn 1 η f(xn 1) .

This sequence of points induces an optimization curve, which is the linear interpolation of {(n, f(xn))} n=0 in R2 (see Figure 1 for an example).

Our motivating question is to understand the properties of this optimization curve. For example, if the function f is convex and smooth, then choosing the step size η appropriately, it is well-known that the curve decreases monotonically to the minimal value of f (Nesterov, 2018). However, other properties might also be of interest. For example, an undesirable phenomenon concerning the optimization curve is when it plateaus for a while, and after that substantially decreases. This phenomenon is undesirable, since it can cause the user to terminate optimization prematurely, thinking that a minimum has been reached, where in fact continuing the optimization would have resulted in a significant improvement. An important property that disallows this undesirable phenomenon is convexity of the optimization curve: Namely, that the slope of the curve (or equivalently, {f(xn) f(xn+1)} n=0) decreases monotonically. Moreover, convexity of the optimization curve occurs very often in practice, in a wide variety of problems. However, to the best of our knowledge, no previous work attempted to rigorously study when we might expect the optimization curve to be convex.

To begin this investigation, we first note that for general non-convex functions, we have no hope of guaranteeing a convex optimization curve. For instance, if we initialize gradient descent close to a maximum of a

Published in Transactions on Machine Learning Research (06/2025)

3 2 1 0 1 2 3 x

Gradient Descent on f(x) = x2

0 2 4 6 8 10 n

Optimization Curve of Gradient Descent on f(x) = x2

Figure 1: Gradient descent on f(x) = x2 with step-size η = 0.1 and initial point x0 = 3.

smooth locally concave function, it is easily seen that the optimization rate will increase over time, leading to a concave optimization curve. Moreover, even if we consider the more benign case of convex functions, it is well-known that the optimization curve of gradient descent may not monotonically decrease in general (and certainly not be convex), as shown in the following simple example:

Example 1. Consider the convex 2-Lipschitz function

f(x) = |x| + max{0, x}

on R, which is minimized at 0. For any step size η > 0, suppose we initialize gradient descent at x0 = η

4. By definition of gradient descent, we have

x1 = x0 ηf (x0) = η

4η and x2 = x1 ηf (x1) = 3

hence f(x0) = η

4, f(x1) = 3

2η, f(x2) = 5

4η. Therefore, the optimization curve first increases and then decreases, and thus is neither convex nor monotonically decreasing.

This observation motivates us in focusing on the class of L-smooth convex functions, where the gradient is L-Lipschitz. This is an extremely well-studied class for gradient descent procedures, and in particular, it is well-known that for any η (0, 2

L), the optimization curve of gradient descent will monotonically converge to the minimal value (whereas for larger value of η, gradient descent may diverge; see Nesterov (2018, Theorem 2.1.14)). Thus, we will be interested in whether the optimization curve of gradient descent is convex, when optimizing such functions and in this step size regime. As far as we know, this question has not been addressed in previous literature.

In this paper, we provide an answer to this question, which is (perhaps unexpectedly) somewhat subtle: On the one hand, for any η (0, 1.75

L ] (which includes the worst-case optimal step size 1/L), we prove that the optimization curve is indeed convex. On the other hand, we prove that for η ( 1.75

L) the optimization curve may not be convex even though the curve is still monotonically decreasing and gradient descent is guaranteed to converge to a minimal value. In other words, there exist step sizes that ensure monotonic convergence of gradient descent, but do not ensure convexity of the induced optimization curve.

As another contribution, we consider a closely-related but different question: Namely, whether the norm of the gradients { f(xn) } n=0 monotonically decrease, for L-smooth convex functions. This question is related, as the gradient captures the local slope of the function, which correlates with the magnitude of the objective function decrease after a single gradient step, assuming smoothness and that the step is small enough (in fact, this intuition is used more formally in our proofs). Although one might suspect that the behavior of gradient norm decay will be similar to the question of optimization curve convexity, we prove

Published in Transactions on Machine Learning Research (06/2025)

that the answer here is different and more positive: The gradient norm is decreasing, for any η (0, 2

L]. Finally, we extend our analysis to gradient flow (the continuous-time analogue of gradient descent), and prove that for smooth convex functions, the gradient flow optimization curve is always convex, and that the gradient norm is monotonically decreasing.

2 Preliminaries

We will need the following simple lemma, which essentially states that the integral of a non-decreasing function is convex:

Lemma 1. Let g : R R be a Riemann integrable function on every closed sub-interval of R. Define f : R R such that for every x R:

If g is non-decreasing, then f is convex.

Proof. Let s < u < t be real numbers. By definition of convexity, it is enough to show that

u s f(t) f(u)

We note that by definition of f, it holds that f(u) f(s) = R u s g(x)dx, and by the monotonicity of g, it therefore holds that (u s) g(s) f(u) f(s) (u s) g(u). Thus,

g(s) f(u) f(s)

u s g(u) and similarly g(u) f(t) f(u)

u t g(t) . (2)

Obviously, (2) implies (1), completing the proof.

As discussed in the introduction, we consider an optimization curve induced {f(xn)} n=0 to be convex, if the curve formed from the linear interpolation of the discrete points {(n, f(xn))} n=0 is convex (in the standard sense, e.g. its epigraph is a convex set). This is formalized in the following definition:

Definition 1. Let {an} n=0 be a sequence of real numbers. Then we say that the mapping n 7 an is convex over Z 0, if the function g : [0, ) R defined as

g(t) = a t + (t t )(a t +1 a t )

By Lemma 1 and the Newton-Leibniz theorem (as appears, for instance, in Bartle & Sherbert (2011, Theorem 7.3.1)), we see that a real valued sequence {an} n=0 is convex over Z 0 if and only if the sequence {an an+1} n=0 is non-increasing (actually, the only if part follows from elementary properties of convex functions). It shall often be convenient to use the latter characterization rather than the former.

3 Gradient Descent

In this section, we shall investigate the optimization curves induced by gradient descent on convex L-smooth functions. For such functions, it is well-known that for any η (0, 2

L), the gradient descent optimization curve monotonically decreases to the minimal function value (see Nesterov (2018, Theorem 2.1.14)). In the following theorem, we prove that in the more restricted regime η (0, 1.75

L ], the optimization curve is also convex:

Theorem 1. Let f : Rn R be a convex L-smooth function, and let {xn} n=0 be the sequence of iterates produced by gradient descent, starting from some x0 Rn, using a step size η (0, 1.75

L ]. Then the mapping n 7 f(xn) is convex (equivalently, the sequence {f(xn) f(xn+1)} n=0 is non-increasing).

Published in Transactions on Machine Learning Research (06/2025)

Proof. Arguing by induction and using the fact that x0 is arbitrary, it suffices to prove that

f(x2) f(x1) f(x1) f(x0) . (3)

By convexity and by L-smoothness of f, we have (see Nesterov (2018, Theorem 2.1.5)):

f(x0) f(x1) f (x1) , x0 x1 + 1

2L f (x1) f (x0) 2 ,

f(x2) f(x1) f (x1) , x2 x1 + 1

2L f (x2) f (x1) 2 ,

f(x2) f(x0) f (x0) , x2 x0 + 1

2L f (x2) f (x0) 2 .

Due to gradient descent update rule, we have x1 = x0 η f (x0) and x2 = x0 η f (x0) η f (x1) . Substituting these values into the inequalities, we obtain:

f(x0) f(x1) η f (x1) , f (x0) + 1

2L f (x1) f (x0) 2 ,

f(x2) f(x1) η f (x1) 2 + 1

2L f (x2) f (x1) 2 ,

f(x2) f(x0) η f (x0) 2 η f (x0) , f (x1) + 1

2L f (x2) f (x0) 2 .

Multiplying the above inequalities by 3

2 respectively and summing them up, we obtain:

f(x2) f(x1) (f(x1) f(x0)) 3η

2 f (x1) , f (x0) + 3

4L f (x1) f (x0) 2

2 f (x1) 2 + 1

4L f (x2) f (x1) 2

2 f (x0) 2 η

2 f (x0) , f (x1)

4L f (x2) f (x0) 2 . (4)

Rearranging terms, the right-hand side equals 3

f (x1) f (x0) 2 + 1

4L f (x2) f (x1) 2 + 1

4L f (x2) f (x0) 2

f (x1) f (x0) 2

2 f (x1) f (x0) 2 + f (x2) f (x1) 2 + f (x2) f (x0) 2

f (x1) f (x0) 2 + 1

where the last transition is by the elementary equality 1

2 b a 2 + c b 2 + c a 2 = 2 c 1

2a 2, which can be easily verified via expansion. Overall, we have shown that

f(x2) f(x1) (f(x1) f(x0)) 7

f (x1) f (x0) 2

The first term is non-negative (since we assume η 7 4L = 1.75

L ), and so is the second term. This implies (3) and concludes the proof.

Published in Transactions on Machine Learning Research (06/2025)

We note that the theorem applies for step sizes in the range (0, 1.75

L ], which is more restrictive than the regime (0, 2

L) where gradient descent is known to monotonically decrease to an optimal value. Thus, one may wonder whether this restriction is necessary, or perhaps Theorem 1 can be extended for a broader range of step-sizes. In the following theorem, we prove that precisely this restriction is indeed necessary namely, for any step size η ( 1.75

L), gradient descent will converge and the optimization curve will monotonically decrease, without necessarily being convex:

Theorem 2. For every L > 0 there exists a convex L-smooth function f : R R and x0 R such that for every step-size η ( 1.75

L), the mapping n 7 f(xn) is not convex.

Proof. We note that without loss of generality, it suffices to prove the theorem for L = 1. To see this, suppose our theorem holds for L = 1. This means that there exists a convex 1-smooth function f : R R and x0 R, such that for every step-size η (1.75, 2), the optimization curve of gradient descent on f with initial point x0 and a constant step-size η isn t convex. Now, define g : R R such that for every t R: g(t) = f(

Lt), define x0 = x0

L and define η = η

L. The reader may verify that g is convex and L-smooth. Now, for n N, define xn = xn 1 ηf (xn 1), xn = xn 1 ηg ( xn 1) .

Arguing by induction, we claim that for n Z 0 it holds that xn = xn

L. Indeed, the base case n = 0 holds by definition, and if the claim holds for n Z 0, then

xn+1 = xn ηg ( xn) = xn

L (xn ηf (xn)) = xn+1

It obviously follows by the definition of g that g( xn) = f(xn) for all n Z 0. Thus, the optimization curve corresponding to g, x0 and η is the same as the optimization curve corresponding to f, x0 and η - which shows why it suffices to prove the theorem for L = 1.

We now turn to prove the theorem for L = 1. Specifically, consider the function f : R R defined as

( 1 2t2, t 1 t 1

It is fairly obvious and easy to check that f is indeed a convex 1-smooth function.

Now, set x0 = 1.8 and let there be η (1.75, 2). We have

x1 = x0 ηf (x0) = (η 1)( x0) = 1.8(η 1) ,

and due to the fact that η > 1.75, we have x1 > 0.75 1.8 = 1.35. This implies that

x2 = x1 ηf (x1) = x1 η = 1.8 + 0.8η < 1.8 + 0.8 2 = 0.2 ,

where in the last equality we recalled that η < 2. Thus:

f(x0) f(x1) < f(x1) f(x2) 1

2x2 0 (x1 1

0.5( 1.8)2 (1.8η 1.8 0.5) < (1.8η 1.8 0.5) 0.5( 1.8 + 0.8η)2 .

Rearranging the terms, we get the equivalent inequality η2 15.75η + 24.5 < 0, which is easily verified to be equivalent to 1.75 < η < 14. This holds due to the assumption that η (1.75, 2). Thus, f(x0) f(x1) < f(x1) f(x2) holds, which means that the relevant optimization curve is not convex. This concludes our proof.

Published in Transactions on Machine Learning Research (06/2025)

So far, we have considered the convexity of the optimization curve induced by gradient descent. A related property of interest is how the sequence of gradient norms { f(xn) } n=0 behave. It is well-known (see Nesterov (2012)) that for convex smooth functions, the minimal value after N iterations, namely minn N f(xn) , is upper bounded by a function monotonically decreasing to 0 (as should be expected when the method converges to a global minimum). However, does the sequence { f(xn) } n=0 itself decay monotonically? The following theorem establishes that this is true, for any η in the regime where gradient descent converges:

Theorem 3. Let f : Rn R be a convex L-smooth function, and let {xn} n=0 be the sequence of iterates produced by gradient descent, starting from some x0 Rn, using a step size η (0, 2

L]. Then the sequence { f(xn) } n=0 is non-increasing.

Proof. Arguing by induction, and using the fact that x0 is arbitrary, it suffices to show that f(x0) f(x1) .

Using a standard inequality for convex L-smooth functions (see Nesterov (2018, Theorem 2.1.5)), we have that for all x, y Rn:

f(x) f(y), x y 1

L f(x) f(y) 2 . (5)

Plugging x0, x1 into (5), we have

f(x1) f(x0), x1 x0 1

L f(x1) f(x0) 2

f(x1) f(x0), η f(x0) 1

L f(x1) f(x0) 2

η f(x1), f(x0) f(x0) 2 1

f(x1) 2 2 f(x1), f(x0) + f(x0) 2

L η) f(x1), f(x0) ( 1

L η) f(x0) 2 + 1

L f(x1) 2 . (6)

Now, if η = 2

L, then by substituting for η and rearranging the terms, we get from (6) that

1 L f(x0) 2 1

L f(x1) 2 ,

which obviously implies the desired result. Otherwise, we have 2 L η > 0 by assumption. Using Cauchy Schwarz, (6) implies

L η) f(x1) f(x0) ( 1

L η) f(x0) 2 + 1

L f(x1) 2 .

Rearranging the terms, we get the equivalent inequality

( f(x0) f(x1) ) (η 1

L) f(x0) + 1

L f(x1) 0 . (7)

Now, if f(x1) = 0 then we are obviously done. Otherwise, assume for the sake of contradiction that f(x1) > f(x0) . Then, splitting to the cases η 0, 1

L , it is easy to verify that (η 1

L) f(x0) + 1

L f(x1) > 0 .

Together with f(x0) f(x1) < 0 this results in

( f(x0) f(x1) ) (η 1

L) f(x0) + 1

L f(x1) < 0 ,

which contradicts (7).

Therefore, in any case, we have f(x0) f(x1) as required.

Published in Transactions on Machine Learning Research (06/2025)

4 Gradient Flow

In this section we extend our results to the optimization curve induced by gradient flow, and prove that it is always convex (when optimizing smooth convex functions).

First, we recall that given a function f C1(Rn, R) (namely, differentiable with a continuous gradient), and some x0 Rn, the gradient flow (x(t))t 0 is defined by the differential equation ( x (t) = f(x(t)) x(0) = x0 .

Assuming L-smoothness of f for some L < is a sufficient condition for this differential equation to have a unique, well-defined solution globally on all of t 0 (this follows from the Picard-Lindelöf theorem). We can then define the optimization curve induced by gradient flow as follows:

Definition 2. Given a function f C1(Rn, R) for which gradient flow (starting from some x0) is uniquely defined, the associated optimization curve g : [0, ) R is defined as g(t) = f(x(t)) for all t [0, ).

Before discussing the case of smooth convex functions, let us first consider the simpler case of convex C2

functions. In that case, one may verify that there exists a unique solution defined for t 0 for the gradient flow differential equation (again, this can be shown to follow from the Picard-Lindelöf theorem and the fact that the gradient norm is decreasing, due to the function s smoothness and convexity). Moreover, in that case convexity of the optimization curve is easy to establish:

Theorem 4. Let f : Rn R be a convex function such that f C2(Rn, R) and x0 Rn. Then the resulting gradient flow optimization curve is convex.

Proof. Recall that the optimization curve is defined as g(t) = f(x(t)) for all t 0. By elementary results, a sufficient condition for convexity of g is that it is twice differentiable, with a non-negative second derivative.

By the definition of the gradient flow differential equation, we have for all t 0:

x (t) = f(x(t)) . (8)

Therefore, x( ) is differentiable for all t 0. Now, using the chain rule we have:

g (t) = f (x(t)) , x (t) (8) = f (x(t)) 2 . (9)

Owing to the fact that f C2(Rn, R), we can differentiate (9) again using the chain rule to get:

g (t) = 2 f (x(t)) , Hf (x(t)) x (t) (8) = 2 f (x(t)) , Hf (x(t)) f (x(t)) (10)

where Hf is the hessian matrix of f. We note that f is convex and f C2(Rn, R). Therefore, for all z Rn

it hold that Hf(z) is positive semi-definite (see Nesterov (2018, Theorem 2.1.4)). Specifically, for all t 0,

( f (x(t)))T Hf (x(t)) f (x(t)) 0

and therefore by (10), g (t) 0 for all t 0. This implies that g is twice differentiable on [0, ), and that its second derivative is non-negative, which concludes the proof.

We now turn to discuss the case of convex and smooth functions (which are not necessarily twicedifferentiable). The theorem below implies that the resulting optimization curve is always convex:

Theorem 5. Let f : Rn R be a convex L-smooth function for some L (0, ), and x0 Rn. Then the resulting gradient flow optimization curve is convex.

The proof of the theorem (which is a bit involved and appears in the appendix) builds on a variant of the proof of Theorem 1, and the fact that gradient descent can be seen as a discretization of gradient flow.

Furthermore, from Theorem 5, one concludes the gradient flow equivalent of Theorem 3.

Published in Transactions on Machine Learning Research (06/2025)

Theorem 6. Let f : Rn R be a convex L-smooth function for some L (0, ), and x0 Rn. Let x( ) be the unique solution for the corresponding gradient flow differential equation. Then the map t 7 f (x(t)) is non-increasing over [0, )

Proof. Recall that the optimization curve is defined as g(t) = f(x(t)) for all t 0. From Theorem 5, g is convex, and by the monotonicity of gradients of convex functions, it follows that g is non-decreasing over [0, ). From the definition of the gradient flow differential equation, for t [0, ) it holds that:

x (t) = f (x(t)) .

Therefore, by the chain rule, for t [0, ):

g (t) = f (x(t)) 2 .

Hence, it follows that t 7 f (x(t)) 2 is non-decreasing on t [0, ), and the result follows.

5 Conclusion and Discussion

In this paper, we studied the convexity of optimization curves induced by gradient descent, focusing on convex L-smooth functions. For step sizes that assure monotonic convergence merely η (0, 2

L), we identified the regime that ensures convexity of the induced optimization curve, namely η (0, 1.75

L ]. Moreover, we showed that in the complementary regime of ( 1.75

L), the induced optimization curve may not be convex, even though the gradient descent still converges monotonically to a minimum. In addition, we used the close connection between gradient descent and gradient flow to show that the optimization curve induced by gradient flow is convex as well. Finally, we studied the closely-related question of whether the gradient norms decrease, and showed that here there are no separate regimes: For both gradient descent and gradient flow, the gradient norms decrease monotonically.

Our work leaves several open questions and directions for future research. For example, the proof of nonconvexity in Theorem 2 relies on showing lack of convexity between two consecutive steps - i.e, we showed that f(x0) f(x1) < f(x1) f(x2). One may ask whether there exist cases where more steps fail to be convex - i.e, f(xn) f(xn+1) < f(xn+1) f(xn+2) holds for multiple n Z 0, or whether the optimization curve might be concave for several consecutive steps. Another obvious direction for research is to consider other optimization settings where the optimization curve is convex or non-convex, or consider this question for other optimization methods.

Robert Gardner Bartle and Donald R Sherbert. Introduction to Real Analysis. Wiley, 4th edition, 2011.

Arieh Iserles. A First Course in the Numerical Analysis of Differential Equations. Cambridge University Press, 1996.

Yurii Nesterov. How to make the gradients small. Optima. Mathematical Optimization Society Newsletter, (88):10 11, 2012.

Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2nd edition, 2018.

Published in Transactions on Machine Learning Research (06/2025)

A Proof of Theorem 5

In order to prove theorem 5, we recall the well-known fact that gradient descent can be seen as a discrete approximation of gradient flow, namely the solution of its differential equation:

Definition 3. Let f C1(Rn, R) be a convex function, and x0 Rn such that the gradient flow differential equation has a unique solution x( ) (as a function on [0, )). Let η > 0 be fixed. As in gradient descent, for n N define: xn = xn 1 η f(xn 1). Define Euler s approximation with step-size η to be x(η) : [0, ) Rn such that for all t [0, ),

x(η)(t) = x t

The reader may have already observed that in fact, the points generated by gradient descent with step-size η > 0 are exactly the points x(η)(nη) n=0 where x(η)( ) is the corresponding Euler s approximation. In other words - gradient descent with step-size η > 0, is the Euler discretization (with step-size η) associated with the gradient flow differential equation.

Another important property we will need is that as η converges to 0, the Euler approximation generally converges to the solution of the differential equation. Proofs of this fact may be found in most textbooks on differential equations. However, for the sake of completeness we shall present such a proof, adapted to our specific assumptions and needs, and similar to the one that appears in Iserles (1996, Theorem 1.1):

Theorem 7. Let L > 0, f : Rn R be a convex L-smooth function and x0 Rn. For every η > 0 let x(η)( ) be the corresponding Euler s approximation, and let x( ) be the unique solution for the corresponding gradient flow differential equation. Then for every R > 0

uniformly on [0, R].

Proof. Fix R > 0 and 0 < η < 1. f is L-smooth, and thus x( ) (as a solution to the corresponding gradient flow differential equation) is uniquely defined on [0, ). Moreover, from the definition of gradient flow, for all t [0, ) it holds that: x (t) = f(x(t)) .

Therefore, x( ) is differentiable on [0, ), and therefore continuous on [0, ). Now, by the fact that f is L-smooth (and particularly f C(Rn, Rn)), it holds that the map t 7 f(x(t)) is continuous on [0, ), meaning that x C1 ([0, ), Rn) (and particularly x C ([0, ), Rn)). Therefore, x( ) is bounded on [0, R + 1] - there exists M > 0 such that x(t) M for all t [0, R + 1].

Now, from the fact that f C(Rn, Rn), it holds that f is bounded on {w Rn : w M}. It therefore holds that x( ) has a bounded derivative on [0, R+1], and therefore is Lipschitz on [0, R+1] - i.e there exists K > 0 such that for all t1, t2 [0, R + 1],

x(t1) x(t2) K |t1 t2| .

Now, for n Z 0 denote en = x(η)(nη) x(nη). By definition of the differential equation and of Euler s

approximation, it holds that e0 = 0. Denote n = j R

η k + 1. Because 0 < η < 1, it holds that η n R + 1.

Fixing some n {0, . . . , n 1}, we have

en+1 = x(η)((n + 1)η) x((n + 1)η) = x(η)(nη) η f(x(η)(nη))

x(nη) Z (n+1)η

nη f(x(s))ds

= en + Z (n+1)η

nη f(x(s))ds η f(x(η)(nη)) = en + Z (n+1)η

h f(x(s)) f(x(η)(nη)) i ds ,

Published in Transactions on Machine Learning Research (06/2025)

en+1 en + Z (n+1)η

f(x(s)) f(x(η)(nη)) ds en + L Z (n+1)η

x(s) x(η)(nη) ds

en + L Z (n+1)η

x(s) x(nη) + x(nη) x(η)(nη) ds

= (1 + Lη) en + L Z (n+1)η

nη x(s) x(nη) ds

(1 + Lη) en + LK Z (n+1)η

nη |s nη|ds

= (1 + Lη) en + LK

Thus, arguing by induction, we get that for n {0, . . . , n },

j=0 (1 + Lη)j .

Furthermore, a similar calculation implies that in fact, for every t [0, R],

x(η)(t) x(t) (1 + Lη) e t

and thus for every t [0, R],

x(η)(t) x(t) LK

j=0 (1 + Lη)j LK

j=0 (1 + Lη)j = LK

2 η2 (1 + Lη)n 1

2 (1 + Lη)n ( ) Kη

2 e Lηn ( ) Kη

2 e L(R+1) .

In the above, ( ) is due to the fact that for all s R it holds that 1+s es, and ( ) is due to η n R+1. Now, we notice that: Kη

2 e L(R+1) η 0 0 ,

which concludes our proof.

To apply this result in the context of optimization curves, we have the following immediate corollary:

Corollary 1. Let L > 0, f : Rn R be a convex L-smooth function and x0 Rn. For every η > 0 let x(η)( ) be the corresponding Euler s approximation, and let x( ) be the unique solution for the corresponding gradient flow differential equation. Then for all t 0:

f x(η)(t) η 0 f (x(t))

Proof. Theorem 7 obviously implies that for all t 0 we have:

x(η)(t) η 0 x(t) .

The result now follows from the continuity of f.

Now, we would like to use the convergence established by this corollary, along with the results of Theorem 1, to prove that the optimization curve of gradient flow is convex as well. However, one should note that Theorem 1 establishes the convexity of the optimization curve of gradient descent (which is defined to be

Published in Transactions on Machine Learning Research (06/2025)

the linear interpolation between the points {(n, f(xn))} n=0), rather than the convexity of the mapping t 7 f(x(η)(t)) discussed in the corollary. Therefore, we shall need a variant of Theorem 1, which concerns the convexity of the mapping t 7 f(x(η)(t)). For technical reasons, we will utilize a rather different proof technique to prove the result. The aforementioned variant is formalized as follows:

Theorem 8. Let f : Rn R be a convex L-smooth function, and let x0 Rn. For every η > 0 let x(η)( ) be Euler s approximation of the corresponding gradient flow differential equation with step size η. Then for all η (0, 1

L] the map t 7 f x(η)(t) is convex (over [0, )).

Proof. By the definition of L-smoothness, if follows that f is also L -smooth for all L L. Therefore, for all η (0, 1

L] it holds that f is 1

η-smooth. Therefore, without loss of generality, it is sufficient to prove the result for η = 1

Looking at x( 1

L ), it is easy to notice by its definition that for all n Z 0, x( 1

L , Rn). Thus, it holds that x( 1

L ) C1 P W ([0, ), Rn) - x( 1

L ) is a piece-wise C1 curve on [0, ). Thus, by the chain rule

and the fact that f C1(Rn, R), it holds that the mapping t 7 f x( 1

L )(t) is in C1( n

L , R) for every

n Z 0, and thus, it is a C1 piece-wise function on [0, ).

Now, define g : [0, ) R as

g(t) = D f x( 1

L ) (nt/L) (t nt/L) f x( 1

L ) (nt/L) , f x( 1

L ) (nt/L) E ,

where we denote nt = t L . It is easily verified that wherever the map t 7 f x( 1

L )(t) is differentiable, its

derivative is exactly g(t). Considering the fact that this mapping is differentiable at [0, ) \ { n

L} n=0 and that g is easily seen to be a piece-wise continuous function (and thus it is Riemann integrable on every closed sub-interval of [0, )), we have by the Newton-Leibniz theorem (see the exact form in Bartle & Sherbert (2011, Theorem 7.3.1)) that

L )(t) f(x0) = Z t

for all t 0. Now, by lemma 1, proving that g is non-decreasing shall suffice, as it shall imply that the map t 7 f x( 1

L )(t) is convex over [0, ).

To that end, we shall prove the following:

(i) For all n Z 0, g is non decreasing on n

(ii) For all n N, limt n

L g(t) g( n

Along with the fact that g is piece-wise continuous (with discontinuities at { n

L} n=0), these two properties are easily seen to imply that g is non-decreasing. Thus, proving them will conclude the proof.

Thus, we now turn to prove (i). Let there be n Z 0 and let there be t1, t2 n

L such that t2 > t1.

Denote y(t) = x( 1

(t2 t1)(g(t2) g(t1)) = f(y(t2)) f(y(t1)), y(t2) y(t1) 0 ,

where the inequality follows from convexity of f (and the fact that it is in C1(Rn, R)). It obviously follows that g(t2) g(t1), as desired.

Turning to prove (ii), let there be n N and denote z0 = x( 1

L , z1 = x( 1

L . Obviously we have:

L g(t) = f z0 1

L f (z0) , f (z0) .

Published in Transactions on Machine Learning Research (06/2025)

We now notice the following:

L g(t) = f(z1), f(z1) f(z1), f(z0) = f(z1), f(z0) f(z1) =

= f(z1), f(z0) f(z1) f(z0), f(z0) f(z1) + f(z0), f(z1) f(z0) =

= f(z1) f(z0), f(z0) f(z1) f(z0) 2 =

= L f(z1) f(z0), 1

L f(z1) f(z0) 2 =

= L f(z1) f(z0), z1 z0 1

L f(z1) f(z0) 2 0 ,

where the inequality holds since f is convex and L-smooth (see (Nesterov, 2018, Theorem 2.1.5)). Thus, limt n

L g(t) g( n

L) as desired. As mentioned above, this concludes the proof.

Now, having these results, we just need one more technical lemma in order to prove theorem 5.

Lemma 2. Let A Rn be a convex set, f : A R and g : (0, ) A R such that:

1. For every x A: g(s, x) s 0 f(x).

2. There exists δ > 0 such that for all s (0, δ) the map x 7 g(s, x) is convex over A.

Then f is a convex function.

Proof. Fix some x, y A and λ (0, 1). Then for every s (0, δ), by convexity of x 7 g(s, x), it holds that

g(s, λx + (1 λ)y) λg(s, x) + (1 λ)g(s, y) .

Now, taking the limit s 0 in both sides of the above inequality, and using the first assumption in the lemma statement, we get f(λx + (1 λ)y) λf(x) + (1 λ)f(y).

Since x, y, λ are arbitrary, the result follows by definition of convexity.

Now we are ready to prove theorem 5.

Proof of Theorem 5. As usual, for every η > 0 we denote the corresponding Euler s approximation as x(η)( ). By theorem 8 we get that for every η (0, 1

L), the map t 7 f x(η)(t) is convex over [0, ). Corollary 1 implies that for every t > 0, f x(η)(t) η 0 f (x(t)) .

Combined with Lemma 2, we get that the map t 7 f (x(t)) is convex over [0, ), as required.