# adaptive_gradient_descent_without_descent__798c6a12.pdf

Adaptive Gradient Descent without Descent

Yura Malitsky 1 Konstantin Mishchenko 2

Abstract We present a strikingly simple proof that two rules are sufﬁcient to automate gradient descent: 1) don t increase the stepsize too fast and 2) don t overstep the local curvature. No need for functional values, no line search, no information about the function except for the gradients. By following these rules, you get a method adaptive to the local geometry, with convergence guarantees depending only on the smoothness in a neighborhood of a solution. Given that the problem is convex, our method converges even if the global smoothness constant is inﬁnity. As an illustration, it can minimize arbitrary continuously twicedifferentiable convex function. We examine its performance on a range of convex and nonconvex problems, including logistic regression and matrix factorization.

1. Introduction

Since the early days of optimization it was evident that there is a need for algorithms that are as independent from the user as possible. First-order methods have proven to be versatile and efﬁcient in a wide range of applications, but one drawback has been present all that time: the stepsize. Despite certain success stories, line search procedures and adaptive online methods have not removed the need to manually tune the optimization parameters. Even in smooth convex optimization, which is often believed to be much simpler than the nonconvex counterpart, robust rules for stepsize selection have been elusive. The purpose of this work is to remedy this deﬁciency.

The problem formulation that we consider is the basic unconstrained optimization problem

min x Rd f(x), (1)

*Equal contribution 1EPFL, Lausanne, Switzerland 2KAUST, Thuwal, Saudi Arabia. Correspondence to: Yura Malitsky <y.malitsky@gmail.com>, Konstantin Mishchenko <konsta.mish@gmail.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

where f : Rd R is a differentiable function. Throughout the paper we assume that (1) has a solution and we denote its optimal value by f .

The simplest and most known approach to this problem is the gradient descent method (GD), whose origin can be traced back to Cauchy (Cauchy, 1847; Lemaréchal, 2012). Although it is probably the oldest optimization method, it continues to play a central role in modern algorithmic theory and applications. Its deﬁnition can be written in a mere one line, xk+1 = xk λ f(xk), k 0, (2)

where x0 Rd is arbitrary and λ > 0. Under assumptions that f is convex and L smooth1, that is

f(x) f(y) L x y , x, y, (3)

one can show that GD with λ (0, 2

L) converges to an optimal solution (Polyak, 1963). Moreover, with λ = 1

L the convergence rate (Drori & Teboulle, 2014) is

f(xk) f L x0 x 2

2(2k + 1) , (4)

where x is any solution of (1). Note that this bound is not improvable (Drori & Teboulle, 2014).

We identify four important challenges that limit the applications of gradient descent even in the convex case:

1. GD is not general: many functions do not satisfy (3) globally.

2. GD is not a free lunch: one needs to guess λ, potentially trying many values before a success.

3. GD is not robust: failing to provide λ < 2

L may lead to divergence.

4. GD is slow: even if L is ﬁnite, it might be arbitrarily larger than local smoothness.

1.1. Related work

Certain ways to address some of the issues above already exist in the literature. They include line search, adaptive Polyak s stepsize, mirror descent, dual preconditioning, and

1Alternatively, we will say that f is L-Lipschitz.

Adaptive Gradient without Descent

stepsize estimation for subgradient methods. We discuss them one by one below, in a process reminiscent of cutting off Hydra s limbs: if one issue is ﬁxed, two others take its place.

The most practical and generic solution to the aforementioned issues is known as line search (or backtracking). This direction of research started from the seminal works (Goldstein, 1962) and (Armijo) and continues to attract attention, see (Bello Cruz & Nghia, 2016; Salzo, 2017) and references therein. In general, at each iteration the line search executes another subroutine with additional evaluations of f and/or f until some condition is met. Obviously, this makes each iteration more expensive.

At the same time, the famous Polyak s stepsize (Polyak, 1969) stands out as a very fast alternative to gradient descent. Furthermore, it does not depend on the global smoothness constant and uses the current gradient to estimate the geometry. The formula might look deceitfully simple, λk = f(xk) f f(xk) 2 , but there is a catch: it is rarely possible to know f . This method, again, requires the user to guess f . What is more, with λ it was ﬁne to underestimate it by a factor of 10, but the guess for f must be tight, otherwise it has to be reestimated later (Hazan & Kakade, 2019).

Seemingly no issue is present in the Barzilai-Borwein stepsize. Motivated by the quasi-Newton schemes, (Barzilai & Borwein, 1988) suggested using steps

λk = xk xk 1, f(xk) f(xk 1)

f(xk) f(xk 1) 2 .

Alas, the convergence results regarding this choice of λk are very limited and the only known case where it provably works is quadratic problems (Raydan, 1993; Dai & Liao, 2002). In general it may not work even for smooth strongly convex functions, see the counterexample in (Burdakov et al., 2019).

Other more interesting ways to deal with non-Lipschitzness of f use the problem structure. One such method, proposed in (Birnbaum et al., 2011) and further developed in (Bauschke et al., 2016), shows that the mirror descent method (Nemirovsky & Yudin, 1983), which is another extension of GD, can be used with a ﬁxed stepsize, whenever f satisﬁes a certain generalization of (3). In addition, (Maddison et al., 2019) proposed the dual preconditioning method another reﬁned version of GD. Similarly to the former technique, it also goes beyond the standard smoothness assumption of f, but in a different way. Unfortunately, these two simple and elegant approaches cannot resolve all issues yet. First, not many functions fulﬁll respective generalized conditions. And secondly, both methods still get us back to the problem of not knowing the allowed range of stepsizes.

A whole branch of optimization considers adaptive exten-

sions of GD that deal with functions whose (sub)gradients are bounded. Probably the earliest work in that direction was written by (Shor, 1962). He showed that the method

xk+1 = xk λk gk

where gk f(xk) is a subgradient, converges for properly chosen sequences (λk), see, e.g., Section 3.2.3 in (Nesterov, 2013a). Moreover, λk requires no knowledge about the function whatsoever.

Similar methods that work in online setting such as Adagrad (Duchi et al., 2011; Mc Mahan & Streeter, 2010) received a lot of attention in recent years and remain an active topic of research (Ward et al., 2019). Methods similar to Adagrad Adam (Kingma & Ba, 2014; Reddi et al., 2018), RMSprop (Tieleman & Hinton, 2012) and Adadelta (Zeiler, 2012) remain state-of-the-art for training neural networks. The corresponding objective is usually neither smooth nor convex, and the theory often assumes Lipschitzness of the function rather than of the gradients. Therefore, this direction of research is mostly orthogonal to ours, although we do compare with some of these methods in our neural networks experiment.

We also note that without momentum Adam and RMSprop reduce to sign SGD (Bernstein et al., 2018), which is known to be non-convergent for arbitrary stepsizes on a simple quadratic problem (Karimireddy et al., 2019).

In a close relation to ours is the recent work (Malitsky, 2019), where there was proposed an adaptive golden ratio algorithm for monotone variational inequalities. As it solves a more general problem, it does not exploit the structure of (1) and, as most variational inequality methods, has a more conservative update. Although the method estimates the smoothness, it still requires an upper bound on the stepsize as input.

Contribution. We propose a new version of GD that at no cost resolves all aforementioned issues. The idea is simple, and it is surprising that it has not been yet discovered. In each iteration we choose λk as a certain approximation of the inverse local Lipschitz constant. With such a choice, we prove that convexity and local smoothness of f are sufﬁcient for convergence of iterates with the complexity O(1/k) for f(xk) f in the worst case.

Discussion. Let us now brieﬂy discuss why we believe that proofs based on monotonicity and global smoothness lead to slower methods.

Gradient descent is by far not a recent method, so there have been obtained optimal rates of convergence. However, we argue that adaptive methods require rethinking optimality of the stepsizes. Take as an example a simple quadratic

Adaptive Gradient without Descent

problem, f(x, y) = 1 2x2 + δ

2y2, where δ 1. Clearly, the smoothness constant of this problem is equal to L = 1 and the strong convexity one is µ = δ. If we run GD from an arbitrary point (x0, y0) with the optimal stepsize λ = 1

L = 1, then one iteration of GD gives us (x1, y1) = (0, (1 δ)y0), and similarly (xk, yk) = (0, (1 δ)ky0). Evidently for δ small enough it will take a long time to converge to the solution (0, 0). Instead GD would converge in two iterations if it adjusts its step after the ﬁrst iteration to λ = 1

Nevertheless, all existing analyses of the gradient descent with L-smooth f use stepsizes bounded by 2/L. Besides, functional analysis gives

f(xk+1) f(xk) λ 1 λL

from which 1/L can be seen as the optimal stepsize. Alternatively, we can assume that f is µ-strongly convex, and the analysis in norms gives

xk+1 x 2 1 2 λLµ

λ 2 L + µ λ f(xk) 2,

whence the optimal step is 2 L+µ.

Finally, line search procedures use some certain type of monotonicity, for instance ensuring that f(xk+1) f(xk) c f(xk) 2 for some c > 0. We break with this tradition and merely ask for convergence in the end.

2. Main part

2.1. Local smoothness of f

Recall that a mapping is locally Lipschitz if it is Lipschitz over any compact set of its domain. A function f with (locally) Lipschitz gradient f is called (locally) smooth. It is natural to ask whether some interesting functions are smooth locally, but not globally.

It turns out there is no shortage of examples, most prominently among highly nonlinear functions. In R, they include x 7 exp(x), log(x), tan(x), xp, for p > 2, etc. More generally, they include any twice differentiable f, since 2f(x), as a continuous mapping, is bounded over any bounded set C. In this case, we have that f is Lipschitz on C, due to the mean value inequality

f(x) f(y) max z C 2f(z) x y , x, y C.

Algorithm 1 that we propose is just a slight modiﬁcation of GD. The quick explanation why local Lipschitzness of f does not cause us any problems, unlike most other methods,

Algorithm 1 Adaptive gradient descent

1: Input: x0 Rd, λ0 > 0, θ0 = + 2: x1 = x0 λ0 f(x0) 3: for k = 1, 2, . . . do

4: λk = min np

1 + θk 1λk 1, xk xk 1 2 f(xk) f(xk 1) o

5: xk+1 = xk λk f(xk) 6: θk = λk λk 1 7: end for

lies in the way we prove its convergence. Whenever the stepsize λk satisﬁes two inequalities2

( λ2 k (1 + θk 1)λ2 k 1,

λk xk xk 1 2 f(xk) f(xk 1) ,

independently of the properties of f (apart from convexity), we can show that the iterates (xk) remain bounded. Here and everywhere else we use the convention 1/0 = + , so if f(xk) f(xk 1) = 0, the second inequality can be ignored. In the ﬁrst iteration it might happen that λ1 = min{+ }, in this case we suppose that any choice of λ1 > 0 is possible.

Although Algorithm 1 needs x0 and λ0 as input, this is not an issue as one can simply ﬁx x0 = 0 and λ0 = 10 10. Equipped with a tiny λ0, we ensure that x1 will be close enough to x0 and likely will give a good estimate for λ1. Otherwise, this has no inﬂuence on further steps.

2.2. Analysis without descent

It is now time to show our main contribution, the new analysis technique. The tools that we are going to use are the well-known Cauchy-Schwarz and convexity inequalities. In addition, our methods are related to potential functions (Taylor & Bach, 2019), which is a powerful tool for producing tight bounds for GD.

Another divergence from the common practice is that our main lemma includes not only xk+1 and xk, but also xk 1. This can be seen as a two-step analysis, while the majority of optimization methods have one-step bounds. However, as we want to adapt to the local geometry of our objective, it is rather natural to have two terms to capture the change in the gradients.

Now, it is time to derive a characteristic inequality for a speciﬁc Lyapunov energy.

Lemma 1. Let f : Rd R be convex and differential and

2It can be shown that instead of the second condition it is enough to ask for λ2 k xk xk 1 2

[3 f(xk) 2 4 f(xk), f(xk 1) ]+ ,

where [a]+ def = max{0, a}, but we prefer the option written in the main text for its simplicity.

Adaptive Gradient without Descent

let x be any solution of (1). Then for (xk) generated by Algorithm 1 it holds

2 xk+1 xk 2+2λk(1+θk)(f(xk) f )

2 xk xk 1 2+2λkθk(f(xk 1) f ).

Proof. Let k 1. We start from the standard way of analyzing GD:

xk+1 x 2 = xk x 2 + 2 xk+1 xk, xk x

+ xk+1 xk 2

= xk x 2 + 2λk f(xk), x xk

+ xk+1 xk 2.

As usually, we bound the scalar product by convexity of f:

2λk f(xk), x xk 2λk(f f(xk)), (6)

which gives us

xk+1 x 2 xk x 2 2λk(f(xk) f )

+ xk+1 xk 2. (7)

These two steps have been repeated thousands of times, but now we continue in a completely different manner. We have precisely one bad term in (7), which is xk+1 xk 2. We will bound it using the difference of gradients:

xk+1 xk 2 = 2 xk+1 xk 2 xk+1 xk 2

= 2λk f(xk), xk+1 xk xk+1 xk 2

= 2λk f(xk) f(xk 1), xk xk+1

+ 2λk f(xk 1), xk xk+1 xk+1 xk 2. (8) Let us estimate the ﬁrst two terms in the right-hand side above. First, deﬁnition of λk, followed by Cauchy-Schwarz and Young s inequalities, yields

2λk f(xk) f(xk 1), xk xk+1

2λk f(xk) f(xk 1) xk xk+1

xk xk 1 xk xk+1

2 xk xk 1 2 + 1

2 xk+1 xk 2.

Secondly, by convexity of f,

2λk f(xk 1), xk xk+1 = 2λk

λk 1 xk 1 xk, xk xk+1

=2λkθk xk 1 xk, f(xk)

2λkθk(f(xk 1) f(xk)). (10)

Plugging (9) and (10) in (8), we obtain

xk+1 xk 2 1

2 xk xk 1 2 1

2 xk+1 xk 2

+ 2λkθk(f(xk 1) f(xk)).

Finally, using the produced estimate for xk+1 xk 2 in (7), we deduce the desired inequality (5).

The above lemma already might give a good hint why our method works. From inequality (5) together with condition λ2 k (1 + θk 1)λ2 k 1, we obtain that the Lyapunov energy the left-hand side of (5) is decreasing. This gives us boundedness of (xk), which is often the key ingredient for proving convergence. In the next theorem we formally state our result.

Theorem 1. Suppose that f : Rd R is convex with locally Lipschitz gradient f. Then (xk) generated by Algorithm 1 converges to a solution of (1) and we have that

ˆxk = λk(1 + θk)xk + Pk 1 i=1 wixi

wi = λi(1 + θi) λi+1θi+1,

Sk = λk(1 + θk) +

i=1 λi + λ1θ1,

and D is a constant that explicitly depends on the initial data and the solution set, see (11).

Our proof will consist of two parts. The ﬁrst one is a straightforward application of Lemma 1, from which we derive boundedness of (xk) and complexity result. Due to its conciseness, we provide it directly after this remark. In the second part, we prove that the whole sequence (xk) converges to a solution. Surprisingly, this part is a bit more technical than expected, and thus we postpone it to the appendix.

Proof. (Boundedness and complexity result.)

Fix any x from the solution set of eq. (1). Telescoping inequality (5), we deduce

2 xk+1 xk 2+2λk(1+θk)(f(xk) f )

i=1 [λi(1 + θi) λi+1θi+1](f(xi) f )

2 x1 x0 2+2λ1θ1[f(x0) f ] def = D.

Adaptive Gradient without Descent

Note that by deﬁnition of λk, the second line above is always nonnegative. Thus, the sequence (xk) is bounded. Since f is locally Lipschitz, it is Lipschitz continuous on bounded sets. It means that for the set C = conv{x , x0, x1, . . . }, which is bounded as the convex hull of bounded points, there exists L > 0 such that

f(x) f(y) L x y x, y C.

Clearly, λ1 = x1 x0 2 f(x1) f(x0) 1 2L, thus, by induction one can prove that λk 1 2L, in other words, the sequence (λk) is separated from zero.

Now we want to apply the Jensen s inequality for the sum of all terms f(xi) f in the left-hand side of (11). Notice, that the total sum of coefﬁcients at these terms is

i=1 [λi(1+θi) λi+1θi+1] =

i=1 λi+λ1θ1 = Sk

Thus, by Jensen s inequality,

2 LHS of (11)

2 Sk(f(ˆxk) f ),

where ˆxk is given in the statement of the theorem. By this, the ﬁrst part of the proof is complete. Convergence of (xk) to a solution is provided in the appendix.

As we have shown that λi 1 2L for all i, we have a theoretical upper bound f(ˆxk) f DL

k . Note that in practice, however, (λk) might be much larger than the pessimistic lower bound 1 2L, which we observe in our experiments together with a faster convergence.

2.3. f is locally strongly convex

Since one of our goals is to make optimization easy to use, we believe that a good method should have state-ofthe-art guarantees in various scenarios. For strongly convex functions, this means that we want to see linear convergence, which is not covered by normalized GD or online methods. In section 2.1 we have shown that Algorithm 1 matches the O(1/ε) complexity of GD on convex problems. Now we show that it also matches O( L

ε) complexity of GD when f is locally strongly convex. Similarly to local smoothness, we call f locally strongly convex if it is strongly convex over any compact set of its domain.

For proof simplicity, instead of using bound λk p

1 + θk 1λk 1 as in step 4 of Algorithm 1 we will use a

more conservative bound λk q

2 λk 1 (otherwise the derivation would be too technical). It is clear that with such a change Theorem 1 still holds true, so the sequence is bounded and we can rely on local smoothness and local strong convexity.

Algorithm 2 Adaptive accelerated gradient descent

1: Input: x0 Rd, λ0 > 0, Λ0 > 0, θ0 = Θ0 = + 2: y1 = x1 = x0 λ0 f(x0) 3: for k = 1, 2, . . . do

4: λk = min nq

2 λk 1, xk xk 1 2 f(xk) f(xk 1) o

5: Λk = min nq

2 Λk 1, f(xk) f(xk 1)

2 xk xk 1 o

1/λk+ Λk 7: yk+1 = xk λk f(xk) 8: xk+1 = yk+1 + βk(yk+1 yk) 9: θk = λk λk 1 , Θk = Λk Λk 1 10: end for

Theorem 2. Suppose that f : Rd R is locally strongly convex and f is locally Lipschitz. Then (xk) generated by Algorithm 1 (with the modiﬁcation mentioned above) converges to the solution x of (1). The complexity to get xk x 2 ε is O(κ log 1

ε), where κ = L

µ and L, µ are the smoothness and strong convexity constants of f on the set C = conv{x , x0, x1, . . . }.

We want to highlight that in our rate κ depends on the local Lipschitz and strong convexity constants L and µ, which is meaningful even when these properties are not satisﬁed globally. Similarly, if f is globally smooth and strongly convex, our rate is still faster as it depends on the smaller local constants.

3. Heuristics

In this section, we describe several extensions of our method. We do not have a full theory for them, but believe that they are of interest in applications.

3.1. Acceleration

Suppose that f is µ-strongly convex. One version of the accelerated gradient method proposed by Nesterov (Nesterov, 2013a) is

yk+1 = xk 1

xk+1 = yk+1 + β(yk+1 yk),

L+ µ. Adaptive gradient descent for strongly

convex f efﬁciently estimated 1 2L by

2 λk 1, xk xk 1 2 f(xk) f(xk 1)

What about the strong convexity constant µ? We know that it equals to the inverse smoothness constant of the conjugate

Adaptive Gradient without Descent

f (y) def = supx{ x, y f(x)}. Thus, it is tempting to estimate this inverse constant just as we estimated inverse smoothness of f, i.e., by formula

2 Λk 1, pk pk 1 2 f (pk) f (pk 1)

where pk and pk 1 are some elements of the dual space and Θk = Λk Λk 1 . A natural choice then is pk = f(xk) since it is an element of the dual space that we use. What is its value? It is well known that f ( f(x)) = x, so we come up with the update rule

2 Λk 1, f(xk) f(xk 1)

and hence we can estimate β by βk =

We summarize our arguments in Algorithm 2. Unfortunately, we do not have any theoretical guarantees for it.

Estimating strong convexity parameter µ is important in practice. Most common approaches rely on restarting technique proposed by (Nesterov, 2013b), see also (Fercoq & Qu, 2019) and references therein. Unlike Algorithm 2, these works have theoretical guarantees, however, the methods themselves are more complicated and still require tuning of other unknown parameters.

3.2. Uniting our steps with stochastic gradients

Here we would like to discuss applications of our method to the problem

min x E [fξ(x)] ,

where fξ is almost surely L-smooth and µ-strongly convex. Assume that at each iteration we get sample ξk to make a stochastic gradient step,

xk+1 = xk λk fξk(xk).

Then, we have two ways of incorporating our stepsize into SGD. The ﬁrst is to reuse fξk(xk) to estimate Lk =

fξk (xk) fξk (xk 1)

xk xk 1 , but this would make λk fξk(xk) biased. Alternatively, one can use an extra sample to estimate Lk, but this is less intuitive since our goal is to estimate the curvature of the function used in the update.

We give a full description in Algorithm 3. We remark that the option with a biased estimate performed much better in our experiments with neural networks. The theorem below provides convergence guarantees for both cases, but with different assumptions.

Theorem 3. Let fξ be L-smooth and µ-strongly convex almost surely. Assuming α 1 2κ and estimating Lk

Algorithm 3 Adaptive SGD

1: Input: x0 Rd, λ0 > 0, θ0 = + , ξ0, α > 0 2: x1 = x0 λ0 fξ0(x0) 3: for k = 1, 2, . . . do 4: Sample ξk and optionally ζk

5: Option I (biased): Lk = fξk (xk) fξk (xk 1)

6: Option II (unbiased): Lk = fζk (xk) fζk (xk 1)

7: λk = min np

1 + θk 1λk 1, α

8: xk+1 = xk λk fξk(xk) 9: θk = λk λk 1 10: end for

with fζk, the complexity to get E xk x 2 ε is

not worse than O κ2

ε . Furthermore, if the model is overparameterized, i.e., fξ(x ) = 0 almost surely, then one can estimate Lk with ξk and the complexity is O κ2 log 1

Note that in both cases we match the known dependency on ε up to logarithmic terms, but we get an extra κ as the price for adaptive estimation of the stepsize.

Another potential application of our techniques is estimation of decreasing stepsizes in SGD. The best known rates for SGD (Stich, 2019), are obtained using λk that evolves as O 1 L+µk . This requires estimates of both smoothness and strong convexity, which can be borrowed from the previous discussion. We leave rigorous proof of such schemes for future work.

4. Experiments

In the experiments3, we compare our approach with the two most related methods: GD and Nesterov s accelerated method for convex functions (Nesterov, 1983). Additionally, we consider line search, Polyak step, and Barzilai-Borwein method. For neural networks we also include a comparison with SGD, SGDm and Adam.

Logistic regression. The logistic loss with ℓ2regularization is given by 1

n Pn i=1 log(1+exp( bia i x))+ γ 2 x 2, where n is the number of observations, γ > 0 is a regularization parameter, and (ai, bi) Rd R, i = 1, . . . , n, are the observations. We use mushrooms and covtype datasets to run the experiments. We choose γ proportionally to 1

n as often done in practice. Since we have closed-form expressions to estimate L = 1 4n A 2 + γ, where A = (a 1 , . . . , a n ) , we used stepsize 1

L in GD and its acceleration. The results are provided in Figure 1.

3See https://github.com/ymalitsky/adaptive_gd

Adaptive Gradient without Descent

0 500 1000 1500 2000 2500 3000 Iteration

GD Nesterov Ad GD Ad GD-accel Nesterov-strong

(a) Mushrooms dataset, objective

0 100 200 300 400 500 Iteration

(b) Mushrooms dataset, stepsize

0 2000 4000 6000 8000 10000 Iteration

GD Nesterov Ad GD Ad GD-accel Nesterov-strong

(c) Covtype dataset, objective

Figure 1: Results for the logistic regression problem.

0 1000 2000 3000 4000 5000 6000 Iteration

GD Nesterov Ad GD Ad GD-accel

0 5000 10000 15000 20000 25000 30000

GD Nesterov Ad GD Ad GD-accel

0 20000 40000 60000 80000 100000 Iteration

GD Nesterov Ad GD Ad GD-accel

Figure 2: Results for matrix factorization. The objective is neither convex nor smooth.

Matrix factorization. Given a matrix A Rm n and r < min{m, n}, we want to solve min X=[U,V ] f(X) = f(U, V ) = 1

2 UV A 2 F for U Rm r and V Rn r. It is a nonconvex problem, and the gradient f is not globally Lipschitz. With some tuning, one still can apply GD and Nesterov s accelerated method, but and we want to emphasize it it was not a trivial thing to ﬁnd the steps in practice. The steps we have chosen were almost optimal, namely, the methods did not converge if we doubled the steps. In contrast, our methods do not require any tuning, so even in this regard they are much more practical. For the experiments we used Movilens 100K dataset (Harper & Konstan, 2016) with more than million entries and several values of r = 10, 20, 30. All algorithms were initialized at the same point, chosen randomly. The results are presented in Figure 2.

Cubic regularization. In cubic regularization of Newton method (Nesterov & Polyak, 2006), at each iteration we need to minimize f(x) = g x+ 1

6 x 3, where g Rd, H Rd d and M > 0 are given. This objective is smooth only locally due to the cubic term, which is our motivation to consider it. g and H were the gradient and the Hessian of the logistic loss with the covtype dataset, evaluated at x = 0 Rd. Although the values of M = 10, 20,

100 led to similar results, they also required different numbers of iterations, so we present the corresponding results in Figure 3.

Barzilai-Borwein, Polyak and line searches. We have started this paper with an overview of different approaches to tackle the issue of a stepsize for GD. Now, we demonstrate some of those solutions. We again consider the ℓ2regularized logistic regression (same setting as before) with mushrooms , covtype , and w8a datasets.

In Figure 4 (left) we see that the Barzilai-Borwein method can indeed be very fast. However, as we said before, it lacks a theoretical basis and Figure 4 (middle) illustrates this quite well. Just changing one dataset to another makes both versions of this method to diverge on a strongly convex and smooth problem. Polyak s method consistently performs well (see Figure 4 (left and middle)), however, only after it was fed with f that we found by running another method. Unfortunately, for logistic regression there is no way to guess this value beforehand.

Finally, line search for GD (Armijo version) and Nesterov GD (implemented as in (Nesterov, 2013b)) eliminates the need to know the stepsize, but this comes with a higher price per iteration as Figure 4 (right) shows. Actually in all our ex-

Adaptive Gradient without Descent

0 1000 2000 3000 4000 Iteration

GD Nesterov Ad GD Ad GD-accel

0 500 1000 1500 2000 2500 3000 Iteration

GD Nesterov Ad GD Ad GD-accel

0 200 400 600 800 1000 Iteration

GD Nesterov Ad GD Ad GD-accel

(c) M = 100

Figure 3: Results for the non-smooth subproblem from cubic regularization.

0 500 1000 1500 2000 # matrix-vector multiplications

BB1 BB2 Ad GD Polyak

(a) Mushrooms dataset, objective

0 2000 4000 6000 8000 10000 # matrix-vector multiplications

BB1 BB2 Ad GD Polyak

(b) Covtype dataset, objective

0 1000 2000 3000 4000 5000 6000 # matrix-vector multiplications

Armijo Nesterov-LS Ad GD Ad GD-accel

(c) W8a dataset, objective

Figure 4: Additional results for the logistic regression problem.

periments for logistic regression with different datasets one iteration of Armijo line search was approximately 2 times more expensive than Ad GD, while line search for Nesterov GD was 4 times more expensive. We note that these observations are consistent with the theoretical derivations in (Nesterov, 2013b).

Neural networks. We use standard Res Net-18 and Dense Net-121 architectures implemented in Py Torch (Paszke et al., 2017) and train them to classify images from the Cifar10 dataset (Krizhevsky et al., 2009) with cross-entropy loss.

We use batch size 128 for all methods. For our method, we observed that 1 Lk works better than 1 2Lk . We ran it with 1 + γθk in the other factor with values of γ from {1, 0.1, 0.05, 0.02, 0.01} and γ = 0.02 performed the best. For reference, we provide the result for the theoretical estimate as well and value γ = 0.1 in the plot with estimated stepsizes. The results are depicted in Figures 5 and 6 and other details are provided in appendix D.

We can see that, surprisingly, our method achieves better test accuracy than SGD despite having the same train loss. At the same time, our method is signiﬁcantly slower at the early stage and the results are quite noisy for the ﬁrst 75 epochs.

Another observation is that the smoothness estimates are very non-uniform and λk plummets once train loss becomes small.

5. Perspectives

We brieﬂy provide a few directions which we personally consider to be important and challenging.

1. Nonconvex case. A great challenge for us is to obtain theoretical guarantees of the proposed method in the nonconvex settings. We are not aware of any generic ﬁrst-order method for nonconvex optimization that does not rely on the descent lemma (or its generalization), see, e.g., (Attouch et al., 2013).

2. Performance estimation. In our experiments we often observed much better performance of Algorithm 1, than GD or AGD. However, the theoretical rate we can show coincides with that of GD. The challenge here is to bridge this gap and we hope that the approach pioneered by (Drori & Teboulle, 2014) and further developed in (Taylor et al., 2017; Kim & Fessler, 2016; Taylor & Bach, 2019) has a potential to do that.

3. Composite minimization. In classical ﬁrst-order

Adaptive Gradient without Descent

0 50 100 150 200 250 Epoch

Test accuracy

SGD SGDm Adam Ad SGD, (1 + µk 1; 0:5=Lk)

Ad SGD, (1 + µk 1=50; 1=Lk)

(a) Test accuracy

0 50 100 150 200 250 Epoch

SGD Ad SGD, (1 + µk 1; 0:5=Lk)

Ad SGD, (1 + µk 1=10; 1=Lk)

Ad SGD, (1 + µk 1=50; 1=Lk)

(b) Stepsize

0 50 100 150 200 250 Epoch

SGD SGDm Adam Ad SGD, (1 + µk 1; 0:5=Lk)

Ad SGD, (1 + µk 1=50; 1=Lk)

(c) Train loss

Figure 5: Results for training Res Net-18 on Cifar10. Labels for Ad GD correspond to how λk was estimated.

0 50 100 150 200 Epoch

Test accuracy

SGD SGDm Adam Ad SGD, (1 + µk 1; 0:5=Lk)

Ad SGD, (1 + µk 1=50; 1=Lk)

(a) Test accuracy

0 50 100 150 200 Epoch

SGD Ad SGD, (1 + µk 1; 0:5=Lk)

Ad SGD, (1 + µk 1=10; 1=Lk)

Ad SGD, (1 + µk 1=50; 1=Lk)

(b) Stepsize

0 50 100 150 200 Epoch

SGD SGDm Adam Ad SGD, (1 + µk 1; 0:5=Lk)

Ad SGD, (1 + µk 1=50; 1=Lk)

(c) Train loss

Figure 6: Results for training Dense Net-121 on Cifar10.

methods, the transition from smooth to composite minimization (Nesterov, 2013b) is rather straightforward. Unfortunately, the proposed proof of Algorithm 1 does not seem to provide any route for generalization and we hope there is some way of resolving this issue.

4. Stochastic optimization. The derived bounds for the stochastic case are not satisfactory and have a suboptimal dependency on κ. However, it is not clear to us whether one can extend the techniques from the deterministic analysis to improve the rate.

5. Heuristics. Finally, we want to have some solid ground in understanding the performance of the proposed heuristics.

Acknowledgements

Yura Malitsky wishes to thank Roman Cheplyaka for his interest in optimization that partly inspired the current work. Yura Malitsky was supported by the ONRG project N6290917-1-2111 and HASLER project N16066.

Armijo, L. Minimization of functions having lipschitz continuous ﬁrst partial derivatives. Paciﬁc Journal of Mathematics, (1):1 3.

Attouch, H., Bolte, J., and Svaiter, B. F. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward backward splitting, and regularized Gauss Seidel methods. Mathematical Programming, 137(1-2):91 129, 2013.

Barzilai, J. and Borwein, J. M. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141 148, 1988.

Bauschke, H. H., Bolte, J., and Teboulle, M. A descent lemma beyond Lipschitz gradient continuity: ﬁrst-order methods revisited and applications. Mathematics of Operations Research, 42(2):330 348, 2016.

Bello Cruz, J. Y. and Nghia, T. T. On the convergence of the forward backward splitting method with linesearches. Optimization Methods and Software, 31(6):1209 1238, 2016.

Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. sign SGD: Compressed optimisation for non-

Adaptive Gradient without Descent

convex problems. In International Conference on Machine Learning, pp. 559 568, 2018.

Birnbaum, B., Devanur, N. R., and Xiao, L. Distributed algorithms via gradient descent for Fisher markets. In Proceedings of the 12th ACM conference on Electronic commerce, pp. 127 136, 2011.

Burdakov, O., Dai, Y., and Huang, N. Stabilized Barzilai Borwein method. Journal of Computational Mathematics, 37(6):916 936, 2019.

Cauchy, A. Méthode générale pour la résolution des systemes d équations simultanées. Comp. Rend. Sci. Paris, 25(1847):536 538, 1847.

Dai, Y.-H. and Liao, L.-Z. R-linear convergence of the Barzilai and Borwein gradient method. IMA Journal of Numerical Analysis, 22(1):1 10, 2002.

Drori, Y. and Teboulle, M. Performance of ﬁrst-order methods for smooth convex minimization: a novel approach. Mathematical Programming, 145(1-2):451 482, 2014.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011.

Fercoq, O. and Qu, Z. Adaptive restart of accelerated gradient methods under local quadratic growth condition. IMA Journal of Numerical Analysis, 39(4):2069 2095, 2019.

Goldstein, A. Cauchy s method of minimization. Numerische Mathematik, 4(1):146 150, 1962.

Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. ACM transactions on interactive intelligent systems (tiis), 5(4):19, 2016.

Hazan, E. and Kakade, S. Revisiting the Polyak step size. ar Xiv preprint ar Xiv:1905.00313, 2019.

Karimireddy, S. P., Rebjock, Q., Stich, S., and Jaggi, M. Error feedback ﬁxes sign SGD and other gradient compression schemes. In International Conference on Machine Learning, pp. 3252 3261, 2019.

Kim, D. and Fessler, J. A. Optimized ﬁrst-order methods for smooth convex minimization. Mathematical programming, 159(1-2):81 107, 2016.

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Lemaréchal, C. Cauchy and the gradient method. Doc Math Extra, 251:254, 2012.

Maddison, C. J., Paulin, D., Teh, Y. W., and Doucet, A. Dual space preconditioning for gradient descent. ar Xiv preprint ar Xiv:1902.02257, 2019.

Malitsky, Y. Golden ratio algorithms for variational inequalities. Mathematical Programming, Jul 2019. doi: 10.1007/s10107-019-01416-w.

Mc Mahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010.

Nemirovsky, A. S. and Yudin, D. B. Problem complexity and method efﬁciency in optimization. John Wiley & Sons, Inc., New York, 1983.

Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Doklady AN SSSR, 269(3):543 547, 1983.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013a.

Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161, 2013b.

Nesterov, Y. and Polyak, B. T. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De Vito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in Py Torch. 2017.

Polyak, B. T. Gradient methods for minimizing functionals. Zhurnal Vychislitel noi Matematiki i Matematicheskoi Fiziki, 3(4):643 653, 1963.

Polyak, B. T. Minimization of nonsmooth functionals. Zhurnal Vychislitel noi Matematiki i Matematicheskoi Fiziki, 9(3):509 521, 1969.

Raydan, M. On the Barzilai and Borwein choice of steplength for the gradient method. IMA Journal of Numerical Analysis, 13(3):321 326, 1993.

Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.

Salzo, S. The variable metric forward-backward splitting algorithm under mild differentiability assumptions. SIAM Journal on Optimization, 27(4):2153 2181, 2017.

Adaptive Gradient without Descent

Shor, N. An application of the method of gradient descent to the solution of the network transportation problem. Materialy Naucnovo Seminara po Teoret i Priklad. Voprosam Kibernet. i Issted. Operacii, Nucnyi Sov. po Kibernet, Akad. Nauk Ukrain. SSSR, vyp, 1:9 17, 1962.

Stich, S. U. Uniﬁed optimal analysis of the (stochastic) gradient method. ar Xiv preprint ar Xiv:1907.04232, 2019.

Taylor, A. and Bach, F. Stochastic ﬁrst-order methods: nonasymptotic and computer-aided analyses via potential functions. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 2934 2992, Phoenix, USA, 25 28 Jun 2019. PMLR.

Taylor, A. B., Hendrickx, J. M., and Glineur, F. Smooth strongly convex interpolation and exact worst-case performance of ﬁrst-order methods. Mathematical Programming, 161(1-2):307 345, 2017.

Tieleman, T. and Hinton, G. Lecture 6.5 Rms Prop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.

Ward, R., Wu, X., and Bottou, L. Ada Grad stepsizes: Sharp convergence over nonconvex landscapes. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 6677 6686, Long Beach, California, USA, 09 15 Jun 2019. PMLR.

Zeiler, M. D. ADADELTA: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012.