# adaptive_backtracking_for_faster_optimization__87f59beb.pdf

Published as a conference paper at ICLR 2025

ADAPTIVE BACKTRACKING LINE SEARCH

Joao V. Cavalcanti1, Laurent Lessard2 & Ashia C. Wilson1

1 MIT, 2 Northeastern University {caval,ashia}@mit.edu, l.lessard@northeastern.edu

Backtracking line search is foundational in numerical optimization. The basic idea is to adjust the step-size of an algorithm by a constant factor until some chosen criterion (e.g. Armijo, Descent Lemma) is satisfied. We propose a novel way to adjust step-sizes, replacing the constant factor used in regular backtracking with one that takes into account the degree to which the chosen criterion is violated, with no additional computational burden. This light-weight adjustment leads to significantly faster optimization, which we confirm by performing a variety of experiments on over fifteen real world datasets. For convex problems, we prove adaptive backtracking requires no more adjustments to produce a feasible step-size than regular backtracking does. For nonconvex smooth problems, we prove adaptive backtracking enjoys the same guarantees of regular backtracking. Furthermore, we prove adaptive backtracking preserves the convergence rates of gradient descent and its accelerated variant.

1 INTRODUCTION

We consider learning settings that can be posed as the unconstrained optimization problem

arg min x Rd F(x). (1)

Typically, algorithms solve (1) iteratively, refining the current iterate xk by taking a step αkdk:

xk + αkdk. (2)

Here, αk is the size of the step taken in the direction dk. Examples of iteration algorithms include gradient descent (GD), Newton s method, quasi-Newton methods (Moré & Sorensen, 1982; Nocedal & Wright, 2006), Nesterov s accelerated gradient method (AGD) (Nesterov, 1983), adaptive gradient methods (Ruder, 2016) and their stochastic and coordinate-update variants (Boyd et al., 2011). To find an appropriate step-size, iterative algorithms typically call a line search (LS) subroutine, which adopts some criterion and adjusts a tentative step-size until this criterion is satisfied. For many popular criteria, if the direction dk selected by the base algorithm is somewhat aligned with the gradient of F, then a feasible step-size can be produced in a finite number of updates by successively reducing an initial tentative step-size until the criteria are satisfied. The standard practice for this process, known as backtracking, is to multiply the tentative step-size by a predefined constant factor to update it. We propose a simple alternative to standard practice:

to adjust the step-size by an online variable factor that depends on the line search criterion violation.

While in principle this idea can be applied to many criteria, this paper focuses on illustrating it in the context of two line search criteria: the Armijo condition (Armijo, 1966), arguably the most popular example of such criteria, and the descent lemma (Bertsekas, 2016, proposition A.24) in the context of composite objectives (Beck & Teboulle, 2009). After motivating our choices of online adaptive factors, we show that they enjoy the best theoretical guarantees one can hope for. Moreover, we prove that adaptive backtracking preserves the convergence rates of GD and AGD. To conclude, we present numerical experiments on several real-world problems confirming that using online adaptive factors in line search subroutines can produce higher-quality step-sizes and significantly reduce the total number of function evaluations standard backtracking subroutines require.

Published as a conference paper at ICLR 2025

Contributions. Our contributions can be summarized as follows.

In Section 2, we propose a template for adaptive backtracking procedures with broad applicability. In Section 2.4, we apply the template to enforce the Armijo condition and, in Section 3, present experiments on real-world problems showcasing that the adaptive subroutine outperforms regular backtracking and improves the performance of standard baseline optimization algorithms. In Section 2.5, we apply the template on proximal-based algorithms to satisfy the descent lemma and present more real-world problems in Section 3 illustrating that the adaptive subroutine improves the performance of FISTA. For both subroutines, in Section 4 we prove that for convex problems, adaptive backtracking takes no more function evaluations to terminate than regular backtracking in any single iteration. We also give global theoretical guarantees for adaptive backtracking in nonconvex smooth problems, which match those of regular backtracking. Furthermore, we show that adaptive backtracking preserves the convergence rates of gradient descent and its accelerated variant. In particular, the proof of accelerated convergence is based on a novel technical argument, to the best of our knowledge.

2 ADAPTIVE BACKTRACKING

2.1 LINE SEARCH: CRITERIA AND SEARCH PROCEDURES

Every line search subroutine can be decomposed into the criteria that it enforces and the procedure it uses to return a feasible step-size. We now briefly discuss each component and provide examples.

Criteria. The most popular of line search criteria is the Armijo condition (Armijo, 1966), which requires that the objective function sufficiently decrease along consecutive iterates. Other popular sets of criteria are the weak and strong Wolfe conditions (Wolfe, 1969), which comprise the Armijo condition and an additional curvature condition that prevents excessively small step-sizes and induces step-sizes for which the objective function decreases even more. In contrast, nonmonotone criteria (Grippo et al., 1986; Zhang & Hager, 2004) only require that some aggregate metric of the objective function values (e.g., an exponential moving average) decrease along consecutive iterates.

Search procedures. The second component of a line search subroutine is the procedure that finds a step-size satisfying the target criteria. For example, Wolfe line search is often implemented by bracketing procedures based on polynomial interpolation (Nocedal & Wright, 2006, pp. 60 61). In contrast, several criteria consisting in a single condition such as Armijo and nonmonotone, are provably satisfied by sufficiently small step-sizes. For them, the standard procedure to compute stepsizes fixes an initial tentative step-size and then consecutively multiplies it by a constant ρ (0, 1) until the criteria are satisfied. This procedure is generally known as backtracking line search (BLS).

2.2 ADAPTIVE BACKTRACKING

BLS often enforces an inequality that is affine in the step-size. In this case, BLS can be reformulated as computing v(αk), which is less than 1 when the line search criterion evaluated at the tentative stepsize αk is violated, and then scaling αk by a factor until v(αk) is greater than 1. BLS (Algorithm 1) employs a fixed factor ρ (0, 1). We propose a simple modification of this procedure:

to replace ρ with an adaptive factor ˆρ(v(αk)) chosen as a nontrivial function of the violation v(αk).

Algorithm 1 Backtracking Line Search

Input: α0>0, v: R+ R, ρ (0, 1) Output: αk

1: αk α0 2: while v(αk) < 1 do 3: αk ρ αk 4: end while

Algorithm 2 Adaptive Backtracking Line Search

Input:α0> 0, v: R+ R, ˆρ: R (0, 1) Output: αk

1: αk α0 2: while v(αk) < 1 do 3: αk ˆρ(v(αk)) αk 4: end while

Published as a conference paper at ICLR 2025

2.3 RELATED WORK

Backtracking is a simple and effective alternative to exact line search subroutines (Nocedal & Wright, 2006, Ch. 3), which helps to explain why it remains a popular procedure to enforce various line search criteria (Beck & Teboulle, 2009; Nesterov, 2013; Vaswani et al., 2019b; Galli et al., 2023; Aujol et al., 2024). Notwithstanding, there is no standard practice to select the adjustment factor ρ, but rather some rough guidelines suggesting that the parameter ρ is often chosen to be between 0.1 (which corresponds to a very crude search) and 0.8 (which corresponds to a less crude search) (Boyd & Vandenberghe, 2004, p. 466) and that it is usually chosen from 1/2 to 1/10, depending on the confidence we have on the quality of the initial step-size (Bertsekas, 2016, p. 36). Indeed, we find that ρ varies somewhat arbitrarily depending on empirical performance and the conditions enforced by backtracking. For example, ρ = 0.8 is used in (Aujol et al., 2024) to enforce the descent lemma, while ρ = 0.5 is used in (Vaswani et al., 2019b) to enforce the Armijo conditions. With that in mind, in our experimental validation, we compare our proposed adaptive subroutine with regular backtracking for several values of ρ. Our goal is to exhibit compelling evidence for a simple alternative to a classic method that remains popular, rather than champion a particular line search subroutine. Accordingly, we do not compare against subroutines that do not enforce similar line search criteria, such as (Fridovich-Keil & Recht, 2020; Orseau & Hutter, 2023), nor subroutines that do not release code (de Oliveira & Takahashi, 2021). Likewise, we leave recent twists on backtracking, such as (Truong & Nguyen, 2021; Calatroni & Chambolle, 2019; Rebegoldi & Calatroni, 2022), for future work, since these methods could in principle also benefit from adaptive adjustments (see Section 5.)

2.4 CASE STUDY: ARMIJO CONDITION

The most popular criterion used in line search is the Armijo condition (Armijo, 1966), which is specified by a hyperparameter c (0, 1) and requires sufficient decrease in the objective function: F(xk + αkdk) F(xk) c αk F(xk), dk . (3) For the Armijo condition, the direction dk is usually assumed to be a descent direction: Assumption 1 (descent direction). The direction dk satisfies F(xk), dk < 0.

We define the violation of (3) as

v(αk) := F(xk + αkdk) F(xk)

c αk F(xk), dk . (4a)

Under Assumption 1, (3) can be written as v(αk) 1. To account for the information conveyed by (3) when violated, we choose the corresponding adaptive geometric factor ˆρ(v(αk)) as

ˆρ(v(αk)) := max ϵ, ρ 1 c 1 c v(αk) , (4b)

where ϵ > 0 is a small factor that prevents occasional numerical errors in v(αk) from spreading to ˆρ(v(αk)). Although (4b) is parameterized by ϵ and ρ, for each method we fix their values on all experiments, effectively making Algorithm 2 parameter free. We use our adaptive BLS procedure to find suitable step-sizes for three standard base methods: gradient descent (GD), Nesterov s accelerated gradient descent (AGD) (Nesterov, 1983) and Adagrad (Duchi et al., 2011). The standard implementations that we use for these algorithms are given in Appendix C. Incorporating line search into GD and Adagrad is straightforward, but the case of AGD merits further comment.

Backtracking and AGD. Unlike GD, AGD is not necessarily a monotone method in the sense that F(xk + αkdk) F(xk) need not hold. But AGD is a multistep method, one being a GD step, for which line search can help to compute a step-size or, equivalently, to estimate the Lipschitz constant L. If the estimate of L satisfies (3) with c = 1/2 and is increasingly multiplied by a lower bounded positive geometric factor, then AGD with line search enjoys essentially the same theoretical guarantees of AGD tuned with constant parameters. We also consider AGD with memoryless line search with fixed predetermined initial step-sizes. Then, unless some variant such as Scheinberg et al. (2014) is used, the theoretical guarantees are not necessarily preserved when AGD is combined with memoryless line search. For some values of ρ, however, we find empirically that not only does the resulting method converge, but it does so much faster than the monotone line search variant, which in turn typically converges faster than AGD tuned with a pre-computed estimate L (see Appendix D.1.)

Published as a conference paper at ICLR 2025

2.5 CASE STUDY: DESCENT LEMMA

A standard assumption in the analysis and design of several optimization algorithms is that gradients are Lipschitz-smooth, which implies (Nesterov, 2018, Thm. 2.1.5.) that there is some L > 0 such that

f(y) f(x) + f(x), y x + L

2 y x 2, x, y. (5)

Inequality (5) is commonly known as the descent lemma (Bertsekas, 2016). In particular, it is commonly assumed (5) holds for algorithms that solve problems with composite objective functions F := f +ψ, where f is Lipschitz-smooth convex and ψ is continuous, possibly nonsmooth, convex. A prototypical example of such an algorithm is FISTA (Beck & Teboulle, 2009), which is an extension of Nesterov s AGD to composite problems (for details see Appendix C.) FISTA assumes that it produces points yk and pαk(yk) satisfying (5) applied to F with x = yk and y = pαk(yk), where pα denotes the proximal operator (Parikh & Boyd, 2014) parameterized by α > 0 and defined by

pα(y) := arg min x

x y α f(y) 2 . (6)

In practice, α = 1/L is seldom known for a given f, and FISTA estimates it with some αk by checking

F(pαk(yk)) f(yk) + f(yk), pαk(yk) yk + 1 2αk pαk(yk) yk 2 + ψ(pαk(yk)). (7)

Since (7) holds for any αk 1/L, an estimate αk can be precomputed from analytical upper bounds on L for particular cases of f, but these bounds tend to be overly conservative and can lead to poor performance. A better alternative, adopted by FISTA and many methods (Nesterov, 2013; Scheinberg et al., 2014), is to backtrack: reduce αk by some constant factor ρ < 1 until (7) holds.

We define the violation of Eq. (7) as

v(αk) := 1 2αk pαk(yk) yk 2. f(pαk(yk)) f(yk) f(yk), pαk(yk) yk , (8a)

and the corresponding adaptive factor as:

ˆρ(v(αk)) := ρv(αk). (8b)

In experiments below, we use (8b) to find suitable step-sizes for FISTA, with a fixed ρ < 1 value.

3 EMPIRICAL PERFORMANCE

We present four experiments illustrating different ways and scenarios in which our adaptive backtracking line search (ABLS) subroutine (Algorithm 2) can outperform regular backtracking (Algorithm 1).

3.1 CONVEX OBJECTIVE: LOGISTIC REGRESSION + ARMIJO

First, we consider the logistic regression objective with L2 regularization, defined by

yi log(σ(a i x)) + (1 yi) log(1 σ(a i x)) + γ

where σ(z) = 1/(1 + exp( z)) is the sigmoid function, γ > 0 and (Ai, bi) Rd {0, 1} are n observations from a given dataset. For each dataset, L = λmax(A A)/(4n) provides an upper bound on the true Lipschitz parameter of the first term in (9), with which we fix γ = L/(10n) and the step-size of gradient descent to 1/( L + γ). In all experiments, the initial point x0 is the origin as is standard.

Result Summary. A succinct summary of our results contained in Table 1 and Appendix D is that

across datasets and step-size initializations considered, adaptive backtracking is more robust than regular backtracking and often leads to significant improvements with respect to base methods.

Published as a conference paper at ICLR 2025

Backtracking Line Search (BLS) Adaptive BLS (ABLS)

ρ = 0.2 ρ = 0.3 ρ = 0.3 Method Dataset #f # f ET [s] #f # f ET [s] #f # f ET [s] gain

GD ADULT 148597.2 32582.0 1319.0 74258.5 18749.0 694.7 37296.0 13583.8 370.0 46.7% G_SCALE 58488.5 18252.2 3050.4 65429.2 17111.5 3059.7 25917.2 11700.5 1607.0 47.3% MNIST 148469.8 41726.2 10986.4 207786.8 46475.5 13474.8 52616.5 22385.8 4841.9 55.9% MUSHROOMS 14170.5 6507.2 31.7 14800.8 6628.5 33.9 7611.0 3693.8 17.3 45.5% PHISHING 33388.2 8059.8 63.0 34193.5 7938.5 62.4 16543.2 6434.5 26.4 57.6% PROTEIN 28011.0 13260.5 4481.6 35733.2 14868.8 5282.0 27656.0 13207.2 4137.4 7.7% WEB-1 6721.5 3139.0 9.0 6156.8 2972.8 8.4 6192.2 3024.8 7.9 (5.4%)*

ρ = 0.5 ρ = 0.6 ρ = 0.9 Method Dataset #f # f ET [s] #f # f ET [s] #f # f ET [s] gain

AGD ADULT 17288.8 2014.2 116.6 49331.2 3806.2 247.0 7999.0 2275.0 49.2 52.5% G_SCALE 3580.2 592.5 114.9 18530.8 1964.0 512.9 2204.5 728.5 84.1 26.8% MNIST 8934.2 1524.5 452.3 14365.8 1846.5 643.7 4943.2 1666.2 283.0 37.4% MUSHROOMS 1100.5 372.2 1.7 1146.8 367.8 1.6 850.0 376.2 1.4 15.5% PHISHING 6763.2 944.2 9.1 8344.2 944.8 8.6 3699.0 1058.0 4.0 53.6% PROTEIN 2865.0 1232.2 396.3 3397.5 1272.5 346.5 2743.2 1257.0 291.6 15.8% WEB-1 651.8 208.5 0.6 699.8 202.2 0.5 519.2 217.8 0.4 3.3%

ρ = 0.2 ρ = 0.3 ρ = 0.3 Method Dataset #f # f ET [s] #f # f ET [s] #f # f ET [s] gain

Adagrad ADULT 124102.0 20001.0 699.5 145159.8 19789.8 746.5 27179.0 8000.5 178.8 74.4% G_SCALE 274023.2 33071.5 7396.5 361852.0 34933.8 9740.9 84201.0 17176.2 2425.4 67.2% MNIST 86240.5 13843.8 3921.2 99021.8 14760.8 4377.2 12366.2 3521.8 679.5 82.7% MUSHROOMS 7794.8 1967.8 9.0 7239.5 1693.0 9.3 3751.5 1446.0 6.0 33.2% PHISHING 74737.5 15833.5 68.9 117564.0 20001.0 96.4 18053.0 6375.0 19.4 71.8% PROTEIN 6103.0 809.2 420.6 4040.2 429.0 257.9 1845.8 446.5 164.4 36.2% WEB-1 4384.2 1027.0 2.9 3857.8 745.2 2.5 1726.5 568.8 1.3 50.1%

Table 1: Logistic regression. #f and # f denote the number of function and gradient evaluations and ET refers to elapsed time in seconds. The gain is given by 1 (ET of ABLS)/(ET of BLS) with the best ET for BLS across ρ in each experiment, which is bolded. We ran each BLS experiment with a grid of four ρ s and present the best two in the table. The gain for GD on WEB-1 is colored orange because although ABLS terminated before the best performing BLS variant, it required more function and gradient evaluations. This anomaly can be attributed to the relatively small ET for this problem.

Datasets and methods. We take observations from seven datasets, whose details can be found in Appendix D. We consider GD, AGD, and Adagrad, described in in Appendix C, with BLS for ρ {0.2, 0.3, 0.5, 0.6} and our ABLS with a pre-set ρ.

Initialization. We set the starting point x0 as the origin and fix ϵ = 0.01 in (4b) on all experiments. We also fix ρ, but change it according to the base method. For more details, see Appendix D.

Evaluation. We run all methods for long enough to produce solutions with designated precision, then average various metrics over different initial step-sizes. For more details, see Appendix D. All experiments were run on a supercomputer cluster containing Intel Xeon Platinum 8260 and Intel Xeon Gold 6248 CPUs Reuther et al. (2018).

Remarks. Table 1 shows that ABLS significantly outperform BLS. For GD and Adagrad, ABLS variants outperforms BLS for almost every combination of ρ and α0 by saving function evaluations and returning better step-sizes, which speed up convergence. Fig. 1 illustrates this point by showing how the suboptimality gap evolves with time for the baseline GD, its ABLS variant and BLS variants for two choices of initial step-size. In particular, increasing the initial step-size helps BLS in some datasets but not in others. Fig. 2 shows a similar trend for the case of AGD. In general, ABLS is more robust to the choice of initial step-size and that is the main reason why the ABLS variant of AGD outperforms its BLS counterparts. In Appendix D.2, Figs. 14 and 15 show the corresponding step-sizes. Initially, ABLS returns smaller GD step-sizes than BLS, but this trend quickly reverses. A plausible explanation is that BLS returns the largest step-sizes that satisfy (3), within a factor of ρ. If the step-sizes are excessively large initially, they can lead to worse optimization paths (e.g., more zig-zagging). For AGD, the step-sizes follow a similar trend initially, but then the adaptive stepsize seems to converge while the regular step-sizes not always do. This can be indicative of another shortcoming of regular backtracking, namely that it can only return step-sizes that are powers of ρ times the initial step-size, in contrast with adaptive backtracking.

Published as a conference paper at ICLR 2025

0 1000 2000 3000 4000 time [s]

gisette_scale

0 5000 10000 15000 time [s]

0 2000 4000 6000 time [s]

101 protein

baseline (GD) reg (0.3, 1e2) reg (0.3, 1e4) ad (0.3, 1e4)

Figure 1: Baseline: GD with constant αk = 1/ L; reg (ρ, β) and ad (ρ, β): GD with, respectively, regular and adaptive memoryless BLS parameterized by ρ and α0 = β/ L.

0 25 50 75 time [s]

gisette_scale

0 100 200 300 400 time [s]

0 1000 2000 time [s]

101 protein

baseline (AGD) reg (0.6, 1e1) reg (0.6, 1e2) ad (0.9, 1e2)

Figure 2: Baseline: AGD with constant αk = 1/ L; reg (ρ, β) and ad (ρ, β): AGD with, respectively, regular and adaptive memoryless BLS parameterized by ρ and α0 = β/ L.

3.2 CONVEX OBJECTIVE: LINEAR INVERSE PROBLEMS + DESCENT LEMMA

The goal of a linear inverse problem is to recover the sparse signal x from a noisy measurement model y = Ax + ϵ, where A Rn d and y Rn are known, and ϵ is unknown noise. The problem of estimating x is typically posed as a Lasso objective (Santosa & Symes, 1986; Tibshirani, 1996)

2 Ax y 2 + λ x 1.

Datasets. We take observations A from eight datasets, see Appendix E for details.

Methods. We consider FISTA (Beck & Teboulle, 2009) (Algorithm 6) and BLS variants. For BLS, ρ = {1/2, 1/3, 1/5} and ρ = 1/1.1 .9 for ABLS, mirroring the choice for AGD above. All BLS methods start with the same initial Lipschitz constant estimate increase it accordingly.

Initialization. For each dataset, we empirically find values of α0 = 1/L0 around which backtracking becomes active, and then increase them successively (values reported on Appendix E.)

Results Summary. The ABLS variant of FISTA outperforms its BLS counterparts across all datasets tested. Moreover, the best value of ρ for BLS changes from one dataset to the other. Since the Lipschitz constant estimate is monotone, function evaluations vary little across methods and have small impact on performance. Nevertheless, ABLS requires fewer function evaluations for all datasets.

3.3 NONCONVEX OBJECTIVE: ROSENBROCK + ARMIJO

We consider the classical nonconvex problem given by the Rosenbrock objective function F(u, v) = 100(u v2)2 + (1 v)2. We use the origin as the initial point and 0.1 as the initial step-size. Fig. 3 shows the optimization paths for BLS and ABLS memoryless variants of GD and AGD after 1000

Published as a conference paper at ICLR 2025

Backtracking Line Search (BLS) Adaptive BLS (ABLS)

ρ = 0.5 ρ = 0.3 ρ = 0.9 Method Dataset #f # f #f # f #f # f gain

FISTA DIGITS 15.5 28282.25 10 34730.75 4.75 16756 40.8% IRIS 10.75 726.25 7 816.5 4 710 2.2% OLIVETTI 10 242709.5 6.25 246827.75 2 212930.75 12.3% LFW 26.25 49093.75 17 49014.5 2 45070.75 8.0% SPEAR3 13.25 328328.75 9 506308.5 2 255417.5 22.2% SPEAR10 44.75 18691 29.75 19992.75 8 15128 19.1% SPARCO3 27.75 266.25 18.5 276.75 3.25 251 5.7% WINE 48 529333.25 27.75 564293.75 8.5 472527 10.7%

Table 2: Linear inverse problem. # f and #f denote the number of gradient and excess function evaluations (total function evaluations minus two times total iterations). The gain is given by 1 (# f of ABLS)/(# f of BLS) with the best ET for BLS across ρ in each dataset (bolded.) We run each BLS experiment with three ρ s and present the best two in the table. ABLS reached the desired precision in all testpoints while the asterisk on LFW indicates BLS did not in at least one testpoint.

iterations, using ρ = 0.3 and ρ = 0.9, respectively. We see that the ABLS variants achieve better losses, requiring far fewer function evaluations and less time to do so.

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x

Method #f loss ET [s]

GD+BLS 4992 7.30e-03 1.04e-1 GD+ABLS 2754 7.21e-12 9.98e-2

AGD+BLS 42263 9.25e-11 5.59e-1 AGD+ABLS 2991 4.01e-13 6.40e-2

Figure 3: Performance of GD and AGD regular (red) and adaptive (blue) BLS variants on Rosenbrock. loss refers to the final loss after 1000 iterations.

3.4 NONCONVEX OBJECTIVE: MATRIX FACTORIZATION + ARMIJO

Lastly, we consider the nonconvex problem of matrix factorization, defined by the objective F(U, V ) = 1

2 UV A 2 F , where A Rm n, U Rm r, V Rn r and r < min{m, n}. We take A from the Movie Lens 100K dataset (Harper & Konstan, 2015) and consider three rank values r {10, 20, 30} (see Appendix F for further details and full plots.)

For this experiment we replicate the initialization and evaluation methodologies of Section 3.1, except we disconsider Adagrad, and pick different values for initial step-sizes, {0.05, 0.5, 5, 50}, and ρ. Namely, we let ρ {0.2, 0.3, 0.5, 0.6} for the BLS variants, but fix ρ = 0.3 and ρ = 0.9 for the ABLS GD and AGD variants, respectively. Table 3 summarizes the results for ABLS and the top two BLS variants. Once again, the best value of ρ for BLS is inconsistent: ρ = 0.3 for ranks 10 and 20 but ρ = 0.2 for rank 30. ABLS requires significantly fewer gradient and function evaluations than the top BLS variant does, which leads to considerable gains in time to achieve the desired precision.

4 MOTIVATION AND THEORETICAL RESULTS

In this section, we motivate our choices of adaptive factors and characterize them theoretically.

The particular choices of adaptive factors were made with two goals in mind: generate more aggressive backtracking factors to save function evaluations and guarantee reasonably large step-sizes to achieve fast convergence. To meet our first goal, ˆρ(v(αk)) (0, ρ) must hold. Indeed, if (3) is violated, then v(αk) < 1 because dk is assumed a descent direction, where v is defined by (4a). So, if in addition

Published as a conference paper at ICLR 2025

Backtracking Line Search Adaptive BLS (ABLS)

ρ = 0.2 ρ = 0.3 ρ = 0.3 Method Rank #f # f ET [s] #f # f ET [s] #f # f ET [s] gain

GD 10 39960.0 5683.5 1034.5 33531.8 4897.8 999.5 52075.8 1631.8 64.8 93.5% 20 127480.2 18734.5 1153.2 111322.0 16109.2 1010.9 377111.8 5133.0 170.9 83.1% 30 231170.5 35639.8 2035.5 394472.2 50001.0 4195.1 681952.0 14802.8 669.2 67.1%

ρ = 0.3 ρ = 0.5 ρ = 0.9 AGD 10 362029.7 21909.3 2377.0 44873.0 6198.3 379.4 22707.7 6663.7 172.0 54.7% 20 479432.0 35672.0 2987.9 491753.3 33523.3 3588.6 80720.3 24588.7 738.0 75.3% 30 357885.7 39234.3 2187.8 478084.3 42985.3 5157.0 95040.0 27454.0 855.1 60.9%

Table 3: Backtracking for matrix factorization. #f and # f denote the number of function and gradient evaluations and ET refers to elapsed time in seconds. The gain is given by 1 (ET of ABLS)/(ET of BLS) with the best ET for BLS across ρ in each experiment, which is bolded.

ϵ (0, ρ) in (4b), then ˆρ(v(αk)) < ρ. To meet the second goal, the returned step-size must be nontrivially lower bounded. To this end, in Section 4.2 we show that if the objective function is Lipschitzsmooth, then ABLS returns a step-size on par with the greatest step-size that is guaranteed to satisfy (3). For now, we note that if (3) is violated, then 1 c v(αk) > 0, since c (0, 1) by assumption. Thus, ˆρ(v(αk)) is bounded away from zero. Moreover, that same bound applies to BLS. Similar conclusions can be reached if the BLS criterion is (7) and ˆρ and v are chosen as (8b) and (8a).

4.1 THE SCOPE OF THEORETICAL GUARANTEES FOR A BACKTRACKING SUBROUTINE

Before characterizing adaptive backtracking theoretically, we state a theoretical bound on how well a general line search procedure can do relative to regular backtracking. Then, we state three further facts that substantiate why comparing line search procures is hard. These statements are consequences of simple examples described in Appendix B.1.

The fundamental obstacle we face in theoretically comparing adaptive backtracking, or any other line search procedure, with regular backtracking is described by Example 1, which shows that

no line search procedure can be provably better than regular backtracking to enforce any set of line search criteria that includes the Armijo condition or the descent lemma, for any class of functions that includes quadratics.

We compare line search procedures on the basis of the number of adjustments to return a feasible step-size and the length of the returned step-size. To establish this fact, in Appendix B.1 we construct a simple scalar quadratics example in which regular backtracking returns the greatest feasible stepsize after one adjustment and furthermore this step-size leads to the minimum. Also in Appendix B.1, we present a similar example for the descent lemma.

In general, comparing line search procedures is challenging because different procedures return different step-sizes, which lead to different optimization paths. For example, a procedure that reduces step-sizes more in each adjustment need not require fewer adjustments (function evaluations) overall than a procedure that reduces step sizes less because each procedure leads to a different sequence of iterates. In fact, comparison is challenging even for a single backtracking call because:

1. the step-size returned by backtracking is not monotone in ρ, even for convex problems.

2. For nonconvex problems, given step-sizes α and α with α < α, α being feasible does not imply α is also feasible.

3. For nonconvex problems, decreasing ρ may increase the number of criteria evaluations required to compute a feasible step-size.

Example 2 establishes the first fact and Example 3 establishes the second and third facts.

With the above in mind, in the following we provide the best set of performance guarantees that one can hope for. Namely, we show that adaptive backtracking requires no more function evaluations

Published as a conference paper at ICLR 2025

than regular backtracking to return a feasible step-size, and that the step-size lower bounds for the two procedures are the same, which leads to the same global convergence rate results.

4.2 THEORETICAL RESULTS

In this subsection, we present theoretical results regarding regular and adaptive backtracking. Several additional convergence results and full proofs can be found in Appendix B.

Convex problems. We show that only the first fact above holds for convex problems, which expands the extent to which two backtracking subroutines can be compared. Proposition 1. Let F be convex differentiable. Given a point xk, a direction dk and a step-size αk > 0 satisfying (3) for some c, then xk, dk and α k also satisfy (3) for any α k (0, αk).

The following proposition refers to compatible inputs. By this we mean:

Definition 1 (Compatibility). The inputs to Algorithms 1 and 2 are said to be compatible if α0, c, v coincide and the input ˆρ to Algorithm 2 is parameterized by the same ρ that Algorithm 1 takes as input. Proposition 2. Let F be convex differentiable. Fixing all other inputs, the number of function evaluations that Algorithm 1 and Algorithm 2 take to return a feasible step-size is nondecreasing in the input ρ. Moreover, given compatible inputs with a descent direction and ϵ < ρ, Algorithm 2 takes no more function evaluations to return a feasible step-size than Algorithm 1 does.

Nonconvex problems. For convex problems, we were able to compare the number of times regular and adaptive backtracking must evaluate their criteria in order to return a single feasible step-size. But what really matters is the total number of criteria evaluations up to a given iteration. We bound this number for general nonconvex problems, hinging on the following properties.1

Definition 2 (Smoothness). A function F is said to be Lipschitz-smooth if (5) holds for some L > 0.

Definition 3 (Gradient related). The directions dk are said to be gradient related if there are c1 > 0 and c2 > 0 such that F(xk), dk c1 F(xk) 2 and dk c2 F(xk) , for all k 0.

Assumption 2. We assume F is Lipschitz-smooth and dk are gradient related.

Gradient relatedness ensures that dk is not too large or too small with respect to F(xk) and that the angle between dk and F(xk) is not too close to being perpendicular (Bertsekas, 2016, p. 41). Together with c1 and c2, the Lipschitz constant L and Armijo constant c define a step-size threshold α = 2c1(1 c)/Lc2 2 below which (3) holds. This quantity is central in the following result.

Informal Theorem (Armijo). Let F be Lipschitz-smooth and dk gradient related. Given compatible inputs, if ϵ < ρ and v, ˆρ are chosen as (4), then Algorithms 1 and 2 share the same bounds on the total number of backtracking criteria evaluations up to any iteration. If αk is received as the initial step-size input at iteration k + 1 for all k 0, then they evaluate (3) at most logρ( α/α0) + 1 + k times up to iteration k. If, on the other hand, α0 is received as the initial step-size input at every iteration, then they evaluate (3) at most k( logρ( α/α0) + 1) times up to iteration k. Moreover, Algorithms 1 and 2 always return a step-size αk such that αk min(α0, ρ α).

Informal Theorem (Descent lemma). Let f be Lipschitz-smooth convex and let ψ be continuous convex. Also, suppose v and ˆρ are chosen as (8a) and (8b). If αk (0, 1/L), then (7) holds for all yk. If Algorithms 1 and 2 receive αk as the initial step-size input in iteration k+1, then they evaluate (7) at most logρ(1/Lα0) +1+k times up to iteration k. If, on the other hand, Algorithms 1 and 2 receive α0 as the initial step-size input in every iteration, then they evaluate (7) at most k( logρ(1/Lα0) +1) times up to iteration k. Moreover, they return a feasible step-size αk such that αk min {α0, ρ/L}.

5 FUTURE WORK: FURTHER APPLICATIONS, EXTENSIONS AND LIMITATIONS

Adaptive backtracking is a general idea that can be broadly applied in a variety of settings. Our goal in this paper was to rigorously validate it in classical machine learning and optimization problems. This section outlines several promising directions for future work and speculate about potential limitations.

1Instead of the usual condition that F(x) F(y) L x y hold for all x, y, we adopt the equivalent (Nesterov, 2018, Thm. 2.1.5.) condition (5) as the definition of Lipschitz-smoothness for the sake of convenience.

Published as a conference paper at ICLR 2025

5.1 FURTHER APPLICATIONS AND EXTENSIONS

Stochastic line search. In machine learning, models such as over-parameterized neural networks are sufficiently expressive to interpolate immense datasets (Zhang et al., 2016; Ma et al., 2018). Interpolation provides theoretical foundation for the stochastic line search (SLS) proposed by Vaswani et al. (2019b), which enforces the Armijo condition on the training mini-batches. In the same vein, Galli et al. (2023) replaced the Armijo condition in SLS with a nonmonotone criterion and used Polyak s step-size to devise an initial step-size heuristic, obtaining the Polyak Nonmonotone Stochastic (Po No S) method. Below, we reproduce experiment 1 from (Galli et al., 2023) to demonstrate the potential of applying ABLS in combination with stochastic line search in the interpolating regime. Fig. 4 shows that combining ABLS with Po No S leads to good test accuracy in fewer epochs (details in Appendix A.) We defer fully developing this application to future work.

0 19 38 56 75 epoch

train loss (log)

0 19 38 56 75 epoch

test accuracy

SLS Adam SPS

Po No S ABLS

Figure 4: MLP trained on MNIST with different algorithms.

Additional line search criteria. Adaptive backtracking may be useful to enforce other conditions that are affine in the step-size. A prominent candidate are the Goldstein conditions (Goldstein & Price, 1967), which comprise two inequalities that are affine in the step-size. Nonmonotone line search criteria (Grippo et al., 1986; Zhang & Hager, 2004) also offer several candidates.

Additional algorithms and increasing step-sizes. In this paper, we experimented with increasing step-sizes through memoryless line search, where the initial step-size is fixed for every iteration, but other schemes are possible. For example, adaptive adjustments can also be used to increase stepsizes. In addition, adaptive backtracking can replace regular backtracking in line search methods that increase the current step-size and then use it as the initial step-size, such as the two-way method (Truong & Nguyen, 2021) and FISTA variants (Scheinberg et al., 2014; Calatroni & Chambolle, 2019; Rebegoldi & Calatroni, 2022). Also, it would be interesting to see how adaptive backtracking works together with schemes that handle problems where the strong convexity constant is unknown, namely restarting schemes Becker et al. (2011); O Donoghue & Candès (2015); Aujol et al. (2024). Finally, we note that the violation of a condition can also be used indirectly to adjust step-sizes, for example to pick the degree to which a fixed ρ is exponentiated, saving backtracking cycles.

5.2 LIMITATIONS

The weak and strong Wolfe conditions (Wolfe, 1969) are not affine in the step-size and are not satisfied by arbitrarily small step-sizes. Hence, Wolfe conditions are not enforced by backtracking (e.g., Nocedal & Wright (2006, pp. 60 61)) and it is unclear how to find analogous adaptive schemes. In turn, quasi-Newton methods, which often must enforce Wolfe conditions to guarantee global convergence, may not be suitable candidates for adaptive line search subroutines. In reality, the role of line search for these methods is to guarantee they converge globally rather than finding the right step-size, since they work with unit step-size locally. Hence, only few adjustments may be necessary. The same applies to Newton s method and the Barzilai Borwein method (Barzilai & Borwein, 1988).

It is also unclear if adaptive adjustments can be useful for more general stochastic line search methods that do not rely on the interpolation property (Cartis & Scheinberg, 2017; Paquette & Scheinberg, 2020). Instead, they resample function and gradient mini-batches in every loop, whether a sufficient descent condition is violated or not. But the information conveyed by the violation for one sample need not be relevant to satisfy the same condition with a different sample, which poses a potential limitation.

Published as a conference paper at ICLR 2025

L. Armijo. Minimization of functions having lipschitz continuous first partial derivatives. Pac. J. Math., 16(1):1 3, 1966.

J.-F. Aujol, L. Calatroni, C. Dossal, H. Labarrière, and A. Rondepierre. Parameter-free fista by adaptive restart and backtracking. SIAM Journal on Optimization, 34(4):3259 3285, 2024.

J. Barzilai and J. Borwein. Two-point step size gradient methods. IMA J. Numer. Anal., 8(1):141 148, 1988.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009.

B. Becker and Kohavi R. Adult. UCI Machine Learning Repository, 1996.

S. R. Becker, E. J. Candès, and M. C. Grant. Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation, 3(2):165 218, 2011.

D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 3rd edition, 2016.

S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1):1 122, 2011.

L. Calatroni and A. Chambolle. Backtracking strategies for accelerated descent methods with smooth composite objectives. SIAM Journal on Optimization, 29(3):1772 1798, 2019.

C. Cartis and K. Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program., 169:337 375, 2017.

R. Caruana, T. Joachims, and L. Backstrom. Kdd-cup 2004: Results and analysis. SIGKDD Explorations Newsletter, 6(2):95 108, 2004.

C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(27):1 27, 2011.

I. F. D. de Oliveira and R. H. C. Takahashi. Efficient solvers for armijo s backtracking problem, 2021.

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011.

S. Fridovich-Keil and B. Recht. Approximately exact line search, 2020.

L. Galli, H. Rauhut, and M. Schmidt. Dont be so monotone: Relaxing stochastic line search in overparameterized models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 34752 34764. Curran Associates, Inc., 2023.

A. Goldstein and J. Price. An effective algorithm for minimization. Numer. Math., 10:184 189, 1967.

L. Grippo, F. Lampariello, and S. Lucidi. A nonmonotone line search technique for newton s method. SIAM journal on Numerical Analysis, 23(4):707 716, 1986.

M.F. Harper and J.A. Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (Tii S), 5(4), 2015.

D.P. Kingma and J.L. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICML), 2015.

Y. Le Cun, C. Cortes, and C.J.C. Burges. The MNIST database of handwritten digits. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Published as a conference paper at ICLR 2025

N. Loizou, S. Vaswani, I. H. Laradji, and S. Lacoste-Julien. Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pp. 1306 1314, 2021.

D.A. Lorenz, M.E. Pfetsch, A.M. Tillmann, and C. Kruschel. SPEAR: Sparse exact and approximate recovery. https://wwwopt.mathematik.tu-darmstadt.de/spear/, 2014.

S. Ma, R. Bassily, and M. Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning (ICML), 2018.

J. J. Moré and D. C. Sorensen. Newton s method. Technical report, Argonne National Lab., IL (USA), 1982.

Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady AN SSSR, 269:543 547, 1983.

Y. Nesterov. Gradient methods for minimizing composite functions. Math. Program. Ser. B, 140: 125 161, 2013.

Y. Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.

J. Nocedal and S. Wright. Numerical Optimization. Springer Science & Business Media, 2006.

L. Orseau and M. Hutter. Line search for convex minimization, 2023.

B. O Donoghue and E. J. Candès. Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15(3):715 732, 2015.

C. Paquette and K. Scheinberg. A stochastic line search method with expected complexity analysis. SIAM Journal on Optimization, 30(1):349 376, 2020.

N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127 239, 2014.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. JMLR, pp. 2825 2830, 2011.

J. Platt. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods - Support Vector Learning. MIT Press, January 1998.

S. Rebegoldi and L. Calatroni. Scaled, inexact, and adaptive generalized fista for strongly convex optimization. SIAM Journal on Optimization, 32(3):2428 2459, 2022.

A. Reuther, J. Kepner, C. Byun, S. Samsi, W. Arcand, D. Bestor, B. Bergeron, V. Gadepally, M. Houle, M. Hubbell, M. Jones, A. Klein, L. Milechin, J. Mullen, A. Prout, A. Rosa, C. Yee, and P. Michaleas. Interactive supercomputing on 40,000 cores for machine learning and data analysis. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pp. 1 6, 2018.

S. Ruder. An overview of gradient descent optimization algorithms. ar Xiv:1609.04747, 2016.

F. Santosa and W.W. Symes. Linear inversion of band-limited reflection seismograms. SIAM Journal on Scientific and Statistical Computing, 7(4):1307 1330, 1986.

K. Scheinberg, D. Goldfarb, and X. Bai. Fast first-order methods for composite convex optimization with backtracking. Foundations of Computational Mathematics, 14:389 417, 2014.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1):267 288, 1996.

T. T. Truong and T. H. Nguyen. Backtracking gradient descent method for general c1 functions, with applications to deep learning. Appl. Math. Optm., 84:2557 2586, 2021.

Published as a conference paper at ICLR 2025

E. van den Berg, G. Friedlander, M.P.and Hennenfent, F. Herrmann, R. Saab, and Ö. Yilmaz. Sparco: a testing framework for sparse reconstruction. Technical Report Tech. Report TR-2007-20, University of British Columbia, Vancouver, Dept. Computer Science, 2007.

S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien. Painless stochastic gradient: Interpolation, line-search, and convergence rates. In Neur IPS, 2019a.

S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien. Painless stochastic gradient: Interpolation, line-search, and convergence rates. Advances in neural information processing systems, 32, 2019b.

P. Wolfe. Convergence conditions for ascent methods. SIAM Review, 11(2):226 235, 1969.

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. ar Xiv, 2016.

H. Zhang and W. W. Hager. A nonmonotone line search technique and its application to unconstrained optimization. SIAM journal on Optimization, 14(4):1043 1056, 2004.

Published as a conference paper at ICLR 2025

A STOCHASTIC LINE SEARCH EXAMPLE

The results presented in Fig. 4 correspond to experiment 1 from (Galli et al., 2023) without any modifications, and are run using the base code from the same paper, which can be found at:

https://github.com/leonardogalli91/Po No S.

In this experiment, a multilayer perceptron (MLP) with a single layer, width 1000 and 535818 parameters is trained on the MNIST dataset (Le Cun et al., 1998) until the training loss becomes less than 10 6. This is the same stopping criterion adopted in (Galli et al., 2023). To train the MLP, the following methods are used:

1. Adam: adaptive moment estimation method (Kingma & Ba, 2015).

2. SLS: stochastic line search (Vaswani et al., 2019a).

3. SPS: stochastic Polyak step-size method (Loizou et al., 2021).

4. Po No S: Polyak nonmonotone stochastic method (Galli et al., 2023).

5. ABLS: an adaptive backtracking variant of Po No S, detailed below.

As Fig. 4 shows, only Po No S and ABLS terminate within 75 epochs. Po No S does so after 56 epochs and 583 seconds, while ABLS finishes in 41 epochs and 464 seconds.

For all the above methods, we preserve the parameters recommended in (Galli et al., 2023) unaltered from the source code. The ABLS method combines our adaptive backtracking procedure with a simplified version of Po No S. Namely, Po No S generates initial step-sizes with

ηk = ηk,0δ lk+lk, (10)

where δ (0, 1) and lk is the amount of backtracks in iteration k 1, which are accounted for in

lk = max{ lk 1 + lk 1 1, 0}.

That is, previous backtracks are used to discount an initial step-size given by

ηk,0 = min{ ηk,0, ηmax},

which in turn is based on the Polyak initial step-size

ηk,0 = fik(wk) f ik cp fik(wk) 2 , (11)

where cp (0, 1) is a hyperparameter, ik denotes the mini-batch sampled in iteration k, wk denotes the MLP parameters in iteration k and f ik refers to the minimum of the mini-batch training loss:

fik = 1 |ik|

Instead, we simply use (11) as the initial step-size, with cp = 1/2, the same value proposed in (Galli et al., 2023). Then, we apply ABLS to enforce

fik(wk ηk fik(wk)) Ck cηk fik 2, c (0, 1), (12)

where Ck denotes an exponential moving average of losses given by

Ck = max{ Ck, fik(wk)}, Ck = ξQk Ck 1 + fik(wk)

Qk+1 , Qk+1 = ξQk + 1

with ξ (0, 1). The inequality (12) is a stochastic variant enforced by Po No S of the deterministic criterion proposed by Zhang & Hager (2004). This inequality can be seen as a generalization of the Armijo condition with the current loss replaced with an average. As such, it preserves the structure of (3), which allows us to seamlessly apply (4b), with ρ = 0.9 and ϵ = 0.5.

Published as a conference paper at ICLR 2025

B EXAMPLES, THEOREMS AND PROOFS

In this section, we present the examples mentioned in Section 4, prove the results stated therein and provide further convergence results.

B.1 EXAMPLES

The first example establishes the fundamental obstacle that

no line search procedure can be provably better than regular backtracking to enforce any set of line search criteria that includes the Armijo condition or the descent lemma, for any class of functions that includes quadratics.

The second and third examples establish three more facts:

1. The step-size returned by backtracking is not monotone in ρ, even for convex problems. 2. For nonconvex problems, given step-sizes α and α with α < α, α being feasible does not imply α is too. 3. For nonconvex problems, decreasing ρ may increase the number of criteria evaluations required to compute a feasible step-size. Example 1 (Fundamental obstacle). Let F be defined by F(x) = x2/2. If xk = 0 and dk is a descent direction, then by definition F(xk), dk = dkxk < 0 and it must be that dk = c1xk for some c1 > 0. Hence, given some c (0, 1), (3) holds if and only if 1 2(1 αkc1)2x2 k = F(xk+1) F(xk) + cαk F(xk), dk = 1

2x2 k cαkc1x2 k.

Therefore, (3) holds if and only if c1(2(1 c) c1αk)αkx2 k = (1 2cc1αk (1 c1αk)2)x2 k 0,

or, equivalently, αk 2(1 c)/c1, since x2 k > 0. Now, suppose that α0 = 4(1 c)/c1 > 0 and ρ = 1/2. Then, after testing (3) exactly once, backtracking returns the step-size αk = 2(1 c)/c1, the greatest value that is guaranteed to satisfy (3). Thus, in this example, backtracking is optimal in the sense that it tests (3) only once to return the greatest feasible step-size possible. Therefore, no other line search procedure can be provably better than backtracking to enforce any set of line search criteria that includes the Armijo condition for any class of functions that includes quadratics.

For the sake of concreteness, Fig. 5 illustrates the particular case where c = 1/2, x0 = 1, d0 = f(x0) = 1, α0 = 2 and ρ = 1/2. The initial step-size α0 = 2, is too large and leads to a tentative iterate x0 + α0d0 = 1 that lies outside of the shaded region of iterates allowable by the Armijo condition. After one adjustment, however, the step-size is reduced to ρα0 = 1/2, leading to the tentative iterate x0 + ρα0d0 = 0, which lies at the boundary of the region of allowable iterates by the Armijo condition. That is, regular backtracking returns the greatest feasible step-size after one adjustment. Furthermore, this step-size leads to the minimum, therefore solving the problem after one iteration and one adjustment.

Next, we show an analogous result for the descent lemma. Keeping F(x) = x2/2, for any x = y, (5) holds with an estimate Lk of the Lipschitz constant if and only if

2 = F(y) F(y) + f(y), x y + Lk

2 x y 2 = Lk 1

2 (y2 2xy) + Lk

which is equivalent to Lk 1. Hence, if L0 (0, 1) and ρ = L0, then backtracking returns the optimal estimate L0/ρ = 1 after testing (5) exactly once (requiring two function evaluations), no matter what x = y are. Hence, in this example, backtracking is optimal in the sense that it tests (5) only once to return the tightest Lipschitz constant estimate possible. Therefore, no other line search procedure can be provably better than backtracking to enforce any set of line search criteria that includes the descent lemma for any class of functions that includes quadratics.

As a side note, we look into what adaptive backtracking would do. For this problem, (8a) becomes

1 2αk pαk(yk) yk 2

f(pαk(yk)) f(yk) f(yk), pαk(yk) yk = Lk.

Published as a conference paper at ICLR 2025

Figure 5: Regular backtracking returns the greatest feasible step size after one adjustment.

Therefore, after one function evaluation, adaptive backtracking returns ρv(αk)αk = ρ, which matches the theoretical lower bound of backtracking and corresponds to the Lipschitz constant estimate 1/ρ.

Example 2 (Fact 1). Let F be defined by F(x) = x2 and fix xk = 1, dk = F (xk) = 2xk = 2. Then, the Armijo condition with c = 1/4 is satisfied if and only if

( 1 + 2αk)2 1 αk.

To find the critical step-size values for which this inequality is satisfied, we simply solve a second-order equation, which gives the positive value of α k = 0.75 with corresponding iterate xk + α kdk = 0.5. Thus, if the initial tentative step-size is αk = 1, which produces the tentative iterate is 1, then the Armijo condition is not satisfied and the step-size must be adjusted. If ρ = 0.75, then the adjusted step-size 0.75 produces the tentative iterate 0.5 and the Armijo condition is satisfied, therefore the step-size requires no further adjustments. On the other hand, if ρ = 0.8, then the adjusted stepsize 0.8 produces the tentative iterate 0.6 and the Armijo condition is not satisfied. Adjusting the step-size once more produces a step-size of 0.64 < α k and an corresponding iterate xk + αkdk = 1 + 0.64 2 = 0.28, satisfying the Armijo condition. Therefore, increasing ρ = 0.75 to ρ = 0.8 decreases the step-size that backtracking returns.

Example 3 (Facts 2 and 3). Let F be defined by F(x) = cos x ax, where a = 1 5π, fix xk = π

dk = F (xk) = sin xk + a = 1 + a.

Given the above choices, the Armijo condition parameterized by c = 1 2π is satisfied if and only if

2 + (1 + a)αk a π

2 + (1 + a)αk cos π

2 (1 + a)2 αk

or, equivalently, if and only if

2 + (1 + a)αk a(1 + a)αk (1 + a)2 αk

If the initial tentative step-size is picked as 7π 2(1+a), then (1 + a)αk = 7π

2 , so that

= 1 1.16 7/10 75π + 1

That is, the Armijo condition is not satisfied, therefore the step-size must be adjusted. If ρ = 5

7, then the step-size is adjusted to 5π 2(1+a), so that (1+a)αk = 5π

2 and the Armijo condition is satisfied, since

2 + (1 + a)αk = cos(3π) = 1 0.83 1

1.00 0.00 0.28 0.50 1.00

Figure 6: Reducing ρ can lead to greater step-sizes.

Published as a conference paper at ICLR 2025

/2 3.59 2 3 4

Figure 7: Reducing ρ can lead to more step-size adjustments.

On the other hand, if ρ = 3

7, then (1 + a)αk = 3π

2 and the Armijo condition is not satisfied, since

2 + (1 + a)αk = 1 0.5 3/10 35π + 1

Adjusting the step-size once more by ρ = 3

7 produces the step-size 9π 14(1+a) and, in turn, the iterate xk + αkdk = π

7 3.59, which is feasible since

1.13 0.21 9

In this example, the step-size 3π 2(1+a) is not feasible although it is smaller than 5π 2(1+a), which is feasible. More generally, this establishes that feasibility is not monotone in the step-size for nonconvex functions. Moreover, by reducing ρ = 5

7, backtracking must adjust the initial step-size one additional time. Therefore, reducing ρ might increase the number of criteria evaluations to return a feasible step-size.

B.2 CONVEX PROBLEMS

In this subsection, we present and proof the results for convex problems stated in Section 4. Proposition 3. Let F be convex differentiable. Given a point xk, a direction dk and a step-size αk > 0 satisfying (3) for some c, then xk, dk and α k also satisfy (3) for any α k (0, αk).

Proof. Let β := α k/αk (0, 1). Then, expressing xk+α kdk as β(xk+αkdk)+(1 β)xk, we obtain F(xk + α kdk) = F(β(xk + αkdk) + (1 β)xk) βF(xk + αkdk) + (1 β)F(xk) β(F(xk) + cαk F(xk), dk ) + (1 β)F(xk)

= cα k F(xk), dk + F(xk), where the first and second follow from F being convex and xk, dk and αk satisfying (3), respectively.

Proposition 4. Let F be convex differentiable. Fixing all other inputs, the number of backtracking criteria evaluations that Algorithm 1 takes to return a feasible step-size is nondecreasing in the input ρ.

Proof. Consider the inputs α0, c and v to Algorithm 1 fixed. Then, let 0 < ρ1 < ρ2 < 1 and let N1 and N2 denote the number of adjustments Algorithm 1 takes to compute a feasible step-size when it receives respectively ρ1 and ρ2 as inputs. If ρNi i α0 is a feasible step-size and N i > Ni for some

i {1, 2}, then so is ρN i i α0, in view of the fact that ρNi i < ρN i i and of Proposition 3. Moreover, Algorithm 1 must test if the step-size ρNi i is feasible before testing the step-size ρN 1 1 and therefore cannot return ρN 1 1 α0. Inductively, we conclude that N1 and N2 are the least nonnegative integers such that ρNi i are feasible. Now, since ρN2 2 is feasible, if N1 > N2, then so is ρN2 1 < ρN2 2 , in view of the assumption that ρ1 < ρ2 and of Proposition 3. That is, N1 is not the least nonnegative integer such that ρN1 1 is feasible, a contradiction. Moreover, each adjustment requires evaluating the objective function once, so the total number of function evaluations Algorithm 1 takes to return a feasible step-size is Ni + 2. Therefore, if Algorithm 1 receives ρ1 as input, then it takes no more function evaluations to return a feasible step-size than if receives ρ2 as input.

Published as a conference paper at ICLR 2025

Definition 4. The inputs to Algorithms 1 and 2 are said to be compatible if α0, c, v coincide and the input ˆρ to Algorithm 2 is parameterized by the same ρ that Algorithm 1 takes as input. Proposition 5. Let F be convex differentiable. Given compatible inputs with a descent direction dk and ϵ < ρ, Algorithm 2 takes no more function evaluations to return a feasible step-size than Algorithm 1 does.

Proof. Suppose Algorithms 1 and 2 receive compatible inputs. If (3) is violated for some tentative step-size αk, then v(αk) < 1 which together with c ( 0, 1) imply 1 c v(αk) > 1 c > 0. In turn, ˆρ(v(αk)) < ρ because ϵ < ρ, by assumption. The result follows by repeating the arguments used to prove Proposition 4 above.

B.3 NONCONVEX PROBLEMS

B.3.1 ARMIJO CONDITION

Proposition 6 (Armijo feasibility for C2 functions). Let F be twice continuously differentiable. Given a base point xk, a descent direction dk, an initial step-size α0 and a constant c (0, 1) for the Armijo condition (3), there is some α = α(xk, dk, c) α0 such that xk + αkdk satisfies (3) for all αk (0, α).

Proof. Assuming F twice continuously differentiable, then by Taylor s theorem (Nocedal & Wright, 2006, p. 14), there exists some t = t(xk, dk, αk) (0, 1) such that

F(xk + αkdk) = F(xk) + αk F(xk), dk + α2 k 1 2 dk, 2F(xk + tαkdk)dk . (13)

Moreover, the eigenvalues of 2F are continuous and the line segment {xk + αkdk : αk [0, α0]} is compact, therefore there is some λ > 0 such that for all αk (0, α0) and t (0, 1) d k 2F(xk + tαkdk)dk λ dk 2. (14)

So, let α = α(xk, dk, c) := 2(1 c) F(xk), dk /(λ dk 2) > 0, which is positive since dk is a descent direction, by assumption. Combining (13) with (14), it follows that if αk (0, α), then (3) holds.

For the sake of convenience, we now restate some definition from Section 4.

Definition 5 (Smoothness). A function F is said to be Lipschitz-smooth if (5) holds for some L > 0. Definition 6 (Gradient related). The directions dk are said to be gradient related if there are c1 > 0 and c2 > 0 such that F(xk), dk c1 F(xk) 2 and dk c2 F(xk) , for all k 0.

Given a Lipschitz-smooth function F, we are particularly interested in applying the Descent Lemma (5) with x = xk and y = xk + αkdk, which gives

F(xk + αkdk) F(xk) + αk F(xk), dk + α2 k L

2 dk 2. (15)

Proposition 7 (Armijo feasibility for Lipschitz-smooth functions). Let F be Lipschitz-smooth. Given a base point xk, a descent direction dk, an initial step-size α0 and a constant c (0, 1) for the Armijo condition (3), there is some α = α(xk, dk, c) α0 such that (3) holds for all αk (0, α). If, in addition, dk are gradient related, then (3) holds for all αk (0, 2(1 c)c1

Lc2 2 ], independent of xk and dk.

Proof. To guarantee (3) holds, we impose that the right-hand side of (15) is less than the right-hand side of (3):

F(xk) + αk F(xk), dk + α2 k L

2 dk 2 F(xk) + cαk F(xk), dk .

In turn, simplifying the above inequality, it follows that if

αk 2(1 c) F(xk), dk

L dk 2 , (16)

Published as a conference paper at ICLR 2025

then (3) holds, where we note that (16) is positive, since dk is assumed a descent direction.

Now, suppose that F(xk), dk c1 F(xk) 2 and dk c2 F(xk) for some c1 > 0 and c2 > 0. Then, for all αk such that αk 2(1 c)c1/Lc2 2, we have that

L c1 F(xk) 2

c2 2 F(xk) 2 2(1 c) F(xk), dk

That is, (16) holds. Therefore, (3) also holds.

Proposition 8. Let F be Lipschitz-smooth, ϵ < ρ and assume v, ˆρ are given by (4). Also, suppose dk are gradient related. If Algorithms 1 and 2 receive αk as the initial step-size input at iteration k + 1 for all k 0, then they evaluate (3) at most logρ( α/α0) + 1 + k times up to iteration k, where α := 2(1 c)c1/Lc2 2. If, on the other hand, Algorithms 1 and 2 receive α0 as the initial step-size input at every iteration, then they evaluate (3) at most k( logρ( α/α0) + 1) times up to iteration k.

Proof. Suppose that Algorithm 2 evaluates (3) and it does not hold for a given tentative step-size αk. Then,

F(xk + αkdk) F(xk) > cαk F(xk), dk .

Dividing both sides above by cαk F(xk), dk < 0 gives v(αk) < 1. In turn, since c (0, 1), it follows that 1 c > 1 cv(αk) > 0 and (1 c)/(1 c v(αk)) < 1. Plugging this inequality into (4b), we obtain

ˆρ(αk) = max(ϵ, ρ(1 c)/(1 c v(αk)) < ρ,

since by assumption ϵ < ρ. Therefore, if (3) does not hold for a given tentative step-size, then Algorithms 1 and 2 multiply it by a factor of at most ρ to adjust it.

Moreover, by Proposition 7, (3) is satisfied for all αk (0, α), independently of xk and dk.

Hence, if Algorithms 1 and 2 use α0 as the initial step-size for the first iteration and αk at iteration k+1 for k 0, then at most logρ( α/α0) +1 adjustments are necessary until a step-size that is uniformly feasible is found. Each adjustment entails evaluating (3) once. In addition, (3) must be evaluated once every iteration. Therefore, (3) is evaluated at most logρ( α/α0) + 1 + k times up to iteration k.

Now, suppose Algorithms 1 and 2 use α0 as the initial step-size in every iteration. Then, at most logρ( α/α0) + 1 adjustments are necessary in every iteration until a feasible step-size is found. As before, each adjustment entails evaluating (3) once, in addition to the first evaluation. Therefore, (3) is evaluated at most k( logρ( α/α0) + 1) times up to iteration k.

Proposition 9 (step-size lower bounds). Let F be Lipschitz-smooth. Also, suppose dk are gradient related. Given appropriate inputs, Algorithm 2 with the choices specified by (4) and Algorithm 1 return a step-size αk such that

αk min α0, ρ2(1 c)c1

Proof. Since dk is a descent direction, dividing both sides of (15) by F(xk), dk > 0 yields

cv(αk) = F(xk + αkdk) F(xk)

αk F(xk), dk 1 αk L dk 2

2 F(xk), dk .

Hence, if ˆρ is chosen as (4b) and (3) does not hold, then step-sizes αk returned by Algorithm 2 satisfy

ˆρ(v(αk))αk ρ 1 c 1 cv(αk)αk ρ2(1 c) F(xk), dk

L dk 2 ρ2(1 c)c1

Moreover, by Proposition 7, the greatest step-size for which (3) is guaranteed to hold is 2(1 c)c1/Lc2 2. If α0 2(1 c)c1/Lc2 2, then Algorithm 1 returns a step-size at least within a ρ factor of 2(1 c)c1/Lc2 2.

Published as a conference paper at ICLR 2025

B.3.2 DESCENT LEMMA

First, note the proximal operator pαk given by (6) is well-defined. Indeed, given a continuous convex function g, a point yk and some αk > 0, the map x 7 g(x) + (1/2αk) x (y αk f(y)) 2 is continuous strongly convex and therefore admits a unique minimum. Proposition 10 (Lipschitz step-size feasibility). Let f be Lipschitz-smooth convex and let g be continuous convex. Also, suppose v and ˆρ are chosen as (8a) and (8b). If αk (0, 1/L), then (7) holds for all yk. If Algorithms 1 and 2 receive αk as the initial step-size input in iteration k + 1, then they evaluate (7) at most logρ(1/Lα0) + 1 + k times up to iteration k. If, on the other hand, Algorithms 1 and 2 receive α0 as the initial step-size input in every iteration, then they evaluate (7) at most k( logρ(1/Lα0) + 1) times up to iteration k.

Proof. Given any yk, if αk (0, 1/L), then applying (5) with x = pαk(yk) and y = yk, we get

f(pαk(yk)) f(yk) + f(yk), pαk(yk) yk + (L/2) pαk(yk) yk 2

f(yk) + f(yk), pαk(yk) yk + (1/2αk) pαk(yk) yk 2.

Adding ψ(pαk(yk)) to both sides, we recover (7). Thus, if αk (0, α), then (7) holds for all yk.

Given an initial step-size αk and the points yk and pαk(yk), Algorithm 1 checks if (7) holds. If it does hold, then Algorithm 1 returns αk, otherwise Algorithm 1 adjusts αk by ρ, recomputes pαk(yk), checks if (7) and repeats. Since (7) is guaranteed to hold for αk (0, 1/L), given an initial stepsize α0, Algorithm 1 computes a feasible step-size after adjusting αk at most logρ(1/Lα0) + 1 times. Each time Algorithm 1 adjusts αk, Algorithm 1 evaluates (7). In addition, Algorithm 1 evaluates Eq. (7) once every time it is called to check if the initial step-size is feasible. Hence, if Algorithm 1 receives αk as the initial step-size input at iteration k+1, then it evaluates Eq. (7) at most logρ(1/Lα0) + 1 + k times up to iteration k. On the other hand, if Algorithm 1 receives the same α0 as initial step-size input at every iteration, then it might have to adjust αk up to logρ(1/Lα0) +1 in every iteration, therefore Algorithm 1 evaluates (7) at most k( logρ(1/Lα0) + 1) times up to iteration k.

Now, consider Algorithm 2, with v and ˆρ chosen as (8a) and (8b). Given an initial step-size αk and the points yk and pαk(yk), Algorithm 2 checks if (7) holds. Suppose (7) does not hold. Then, moving the terms f(pαk(yk)) and f(yk), pαk(yk) yk to the left-hand side and cancelling ψ(pαk(yk)) on both sides, we obtain

f(pαk(yk)) f(yk) f(yk), pαk(yk) yk > (1/2αk) pαk(yk) yk 2. (17)

Since 0, the left-hand side must be positive. So, dividing both sides by the left-hand side and using (8a), it follows that v(αk) < 1. Hence, Algorithm 2 adjusts αk to ˆρ(αk)αk < ραk. That is, the factor by which Algorithm 2 adjusts αk is smaller than the factor by which Algorithm 1 adjusts αk. Therefore, Algorithm 2 evaluates Eq. (7) at most as many times as (1) does.

Proposition 11. Let f be Lipschitz-smooth convex and let g be continuous convex. Also, suppose v and ˆρ are chosen as (8a) and (8b). If Algorithms 1 and 2 receive an initial step-size α0 > 0, then they return a feasible step-size αk such that αk min {α0, ρ/L}.

Proof. By Proposition 10, every step-size αk (0, 1/L) is feasible. Hence, if α0 is not feasible, then since Algorithm 1 adjusts step-sizes by ρ, it must return a feasible step-size within a ρ factor of 1/L.

Now, consider Algorithm 2, with v and ˆρ chosen as (8a) and (8b). Algorithm 2 only adjusts αk when (7) does not hold, so suppose that is the case. Applying (5) with y = yk and xk = pαk(yk) yields

f(pαk(yk)) f(xk) f(yk), pαk(yk) yk (L/2) pαk(yk) yk 2.

Dividing both sides by (L/ρ)(f(pαk(yk)) f(xk) f(yk), pαk(yk) yk ), which is positive by (17), we obtain

1 2 pαk(yk) yk 2

f(pαk(yk)) f(xk) f(yk), pαk(yk) yk = ρv(αk)αk,

where the identity follows from (8a). Hence, Algorithm 2 adjusts αk to ˆρ(v(αk))αk ρ/L.

Published as a conference paper at ICLR 2025

B.4 CONVERGENCE RESULTS

B.4.1 A GENERAL CONVERGENCE RESULT FOR ADAPTIVE BACKTRACKING

Under mild conditions, we now show that limk + f(xk) 2 = 0 for iterates xk in the form (2) with gradient related dk and step-sizes generated by adaptive backtracking. We emphasize that the following results make no further assumptions on how the descent directions are generated and that (Nocedal & Wright, 2006, p. 40):

For line search methods of the general form (2), the limit limk + f(xk) 2 = 0 is the strongest global convergence result that can be obtained: We cannot guarantee that the method converges to a minimizer, but only that it is attracted by stationary points. Only by making additional requirements on the search direction dk by introducing negative curvature information from the Hessian 2f(xk), for example can we strengthen these results to include convergence to a local minimum Proposition 12. Let f be bounded below and Lipschitz-smooth on an open set containing the level set {x : f(x) f(x0)}, where x0 is the initial point of iterates (2) where dk are gradient related and αk are generated by adaptive backtracking (Algorithm 2) with some α0 > 0 and using ˆρ and v given by (4). Then, limk + f(xk) 2 = 0.

Proof. Under the above assumptions, we have that αk min{α0, ρα}, where α = 2(1 c)c1/(Lc2 2), by Proposition 9. Moreover, we have that f(xk), dk c1 f(xk) 2, because dk are gradient related. Hence, since adaptive backtracking enforces the Armijo condition, (3), it follows that

f(xk+1) f(xk) αkc f(xk) 2 c min{α0, ρ2(1 c)c1/(Lc2 2)} f(xk) 2.

Telescoping the above difference, we get

f(xk+1) f(x0) =

t=1 (f(xt+1) f(xt)) c min n α0, ρ2(1 c)c1

t=1 f(xt) 2.

Rearranging the above inequality and using the assumption that f is lower bounded, we obtain

c min n α0, ρ2(1 c)c1

t=1 f(xt) 2 f(x0) f(xk+1) < + .

That is, f(xk) 2 are square-summable. Therefore, it follows that

lim k + f(xk) 2 = 0.

B.4.2 CONVERGENCE RESULTS FOR GRADIENT DESCENT

We show that the standard convergence results for gradient descent are preserved if step-sizes are generated by adaptive backtracking. We address smooth and then smooth strongly convex objectives.

Proposition 13. Let f be convex, Lipschitz-smooth and suppose f(x ) = 0 for some x . If the stepsizes αk of gradient descent (Algorithm 3) are chosen by adaptive backtracking (Algorithm 2) using ˆρ and v given by (4) with c [1/2, 1) and α0 > 0, then αk min{α0, ρα}, where α = 2(1 c)/L, and

f(xk) f(x ) x0 x 2

2 min{α0, ρα}k .

Proof. Under the above assumptions, all iterates of gradient descent satisfy the Armijo condition, (3). Moreover, since f is convex, we have that

f(xk) f(x ) + f(xk), xk x .

Hence, combining the above inequality with (3), it follows that

f(xk+1) f(xk) cαk f(xk) 2 f(x ) + f(xk), xk x cαk f(xk) 2.

Published as a conference paper at ICLR 2025

In turn, since c 1/2, rearranging the above inequality and completing a square, we get

f(xk+1) f(x ) 1 2αk (2αk f(xk), xk x α2 k f(xk) 2)

= 1 2αk (2αk f(xk), xk x α2 k f(xk) 2 xk x 2)

= 1 2αk ( xk αk f(xk) x 2 xk x 2)

= 1 2αk ( xk x 2 xk+1 x 2).

Now, since gradient descent sets dk = f(xk), then dk are gradient related with c1 = c2 = 1. Moreover, since f is Lipschitz-smooth, then αk min{α0, ρα}, where α = 2(1 c)/L, by Proposition 9. Plugging this lower bound into the above inequality, it follows that

f(xk+1) f(x ) 1 2 min{α0, ρα}( xk x 2 xk+1 x 2).

Telescoping the above, we get

t=1 (f(xt+1) f(x )) 1 2 min{α0, ρα}

t=1 ( xt x 2 xt+1 x 2)

x0 x 2 xk+1 x 2

2 min{α0, ρα}

2 min{α0, ρα}.

Since f(x ) = 0 and f is convex, we have that f(xk+1) f(x ) 0. Moreover, f(xk) are decreasing because the Armijo condition holds in every iteration. Therefore

f(xk+1) f(x ) x0 x 2

2 min{α0, ρα}k .

Next, we show that adaptive backtracking also preserves the convergence rate of gradient descent on strongly convex objectives, which we define below. Definition 7 (Strong convexity). A continuously differentiable function f is said to be strongly convex if there exists some m > 0 such that for every x and y

f(y) f(x) + f(x), y x + m

2 y x 2. (18)

Proposition 14. Let f be Lipschitz-smooth and strongly convex. If the step-sizes αk of gradient descent (Algorithm 3) are chosen by adaptive backtracking (Algorithm 2) using ˆρ and v given by (4) with c [1/2, 1) and α0 (0, 1/m), then

f(xk) f(x ) (1 m min{α0, ρα})k L + m

In particular, if c = 1/2 and α0 > ρ/L, then

f(xk) f(x ) (1 ρq)k L + m

where q = m/L is the reciprocal of the condition number of f.

Proof. Let L and m denote the Lipschitz-smoothness and strong convexity constants of f. The assumption that f is strongly convex implies the existence of a unique global minimizer x for f. We then use x to define a Lyapunov function V , given by

V (xk) = f(xk) f(x ) + m

Published as a conference paper at ICLR 2025

which is positive for xk = x . To prove the result, we show that (1+δk)V (xk+1) V (xk) 0, where

δk = 1 Qk 1, Qk = Lk

m and Lk = 1

And we note that the assumption that αk α0 < 1/m implies Lk > m, thus δk are well-defined.

By assumption, the iterates of gradient descent satisfy (3) with c [1/2, 1), hence

f(xk+1) f(xk) cαk f(xk) 2 αk

Moreover, by strong convexity, we have that

f(xk) f(x ) f(xk), xk x m

Next, expanding quadratic terms, it follows that

(1 + δk) xk+1 x 2 xk x 2 =(1 + δk)(α2 k f(xt) 2 2αk f(xk), xk x )

+ δk xk x 2.

Now, from the definition of δk, we obtain

(1 + δk)(1 mαk) = Qk Qk 1 Qk 1

Qk = 1 and (1 + δk)mαk = Qk Qk 1 1 Qk = δk.

Then, we put everything together to get

(1 + δk)V (xk+1) Vk(xk) (1 + δk)(1 mαk)αk

(δk (1 + δk)mαk) f(xk), xk x

Applying the above inequality inductively, it follows that

V (xk+1) V (x0)

Moreover, applying (5) with y = x0 and x = x , and noticing that f(x ) = 0, we obtain

V (x0) = f(x0) f(x ) + m

2 x0 x 2 L + m

Furthermore, under the above assumptions, we have that αk min{α0, ρα}, where α = 2(1 c)/L, which implies that

1 + δk = Qk Qk 1 = 1 1 mαk 1 1 m min{α0, ρα}.

Finally, we put everything together and obtain

f(xk) f(x ) V (xk+1) L + m

2 x0 x 2 k Y

(1 m min{α0, ρα})k L + m

B.4.3 A CONVERGENCE RESULT FOR ACCELERATED GRADIENT DESCENT

To establish that adaptive backtracking preserves the convergence rate of accelerated gradient descent, we employ a Lyapunov argument based on the function Vk defined by

Vt(xk, yk) = f(yk) f(x ) + m

2 zt x 2, (19)

Published as a conference paper at ICLR 2025

where the point zt = zt(xk, yk) is defined as

zt = xk + p

Qt 1(xk yk), (20)

and the estimated condition number Qt and estimated Lipschitz constant are given by

Qt = L0/m, t < 0, Lt/m, t 0, and Lt = 1

Note that the index t of zt follows that of Vt but is independent of the indices of xk and yk, which allows us to split the Lyapunov analysis in two auxiliary lemmas. First, we show that for a fixed index k + 1, the Lyapunov function Vk+1 decreases along consecutive AGD iterates at an accelerated rate. Second, we bound by how much Vk+1 can increase with respect to Vk for the same AGD iterate.

Lemma 1. Let f be Lipschitz-smooth and strongly convex. If the Lipschitz constant estimates Lk of accelerated gradient descent (Algorithm 4) are generated by adaptive backtracking (Algorithm 2) using ˆρ and v given by (4) with c [1/2, 1) and L0 > m, then for k 0

(1 + δk+1)Vk+1(yk+1, xk+1) Vk+1(yk, xk) 0,

where δk+1 = 1/( Qk 1).

Proof. We start by splitting (1 + δk+1)(f(yk+1) f(x )) into three further differences:

(1 + δk+1)(f(yk+1) f(x )) (f(yk) f(x ) =(1 + δk+1)(f(yk+1) f(xk)) + δk+1(f(xk) f(x )) + (f(xk) f(yk)).

Since c [1/2, 1), then adaptive backtracking generates Lk such that

(1 + δk+1)(f(yk+1) f(xk)) (1 + δk+1) 1

2Lk f(xk) 2. (22)

Moreover, applying (18) with x = xk and y = x and using that f is convex, we get

δk+1(f(xk) f(x )) δk+1 f(xk), xk x δk+1 m

2 xk x 2, (23)

f(xk) f(yk) f(xk), xk yk . (24)

Next, we express the difference zk+1 x as

zk+1 x = xk+1 + p

Qk(xk+1 yk+1) x

= yk+1 + βk(yk+1 yk) + p

Qkβk(yk+1 yk) x

Lk (1 + βk(1 + p

Qk)) f(xk) + βk(1 + p

Qk)(xk yk) + xk x

Qk f(xk) + ( p

Qk 1)(xk yk) + xk x ,

where we used the identities

1 + βk(1 + p

Qt and βk(1 + p

In the same vein, when expanding the 2-norm term zk+1 x 2 below, we use the following identities after colons to simplify the coefficients of terms before colons:

f(xk) 2 : (Qk/L2 k)(m/2) = 1/2Lk,

f(xk), xk yk : m(1 + δk+1) p

Qk 1)/Lk = 1,

f(xk), xk x : m(1 + δk+1) p

Qk/Lk = δk,

xk yk 2 : (1 + δk+1)( p

xk yk, xk x : (1 + δk+1)( p

Published as a conference paper at ICLR 2025

As a result, the 2-norm difference in (1 + δk+1)Vk+1(yk+1, xk+1) Vk+1(yk, xk) becomes

(1 + δk+1)m

2 zk+1 x 2 m

Qk(xk yk) 2

2Lk f(xk) 2 f(xk), xk yk δk f(xk), xk x

Qk 1) xk yk 2 + m

Qk xk yk, xk x + (1 + δk+1) xk x 2)

2 (Qk xk yk 2 + 2 p

Qk xk yk, xk x + xk x 2)

2Lk f(xk) 2 f(xk), xk yk δk f(xk), xk x

Qk xk yk 2 + δk m

2 xk x 2. (25)

Finally, combining (22) to (25) and then canceling several terms, we obtain

(1 + δk+1)Vk+1(yk+1, xk+1) Vk+1(yk, xk) m

Qk xk yk 2 0.

Lemma 2. Let f be Lipschitz-smooth strongly convex. Given initial points x0 = y0, if the estimates Lk of the Lipschitz constant in accelerated gradient descent (Algorithm 4) are generated monotonically by adaptive backtracking (Algorithm 2 with Lk serving as the initial estimate for Lk+1) using ˆρ and v given by (4) with c [1/2, 1) and L0 > m, then for k 0

Vk+1(yk, xk) Q2 k Q2 k 1 Vk(yk, xk).

Proof. We argue by induction. If x0 and y0 match, then

z1(y0, x0) = x0 + Q0(x0 y0) = x0 = x0 + Q 1(x0 y0) = z0(y0, x0).

Moreover, Q 1 = Q0, by definition. Therefore, we have that

V1(y0, x0) = f(y0) f(x ) + m

2 z1(y0, x0) x 2

= Q2 0 Q2 1 (f(y0) f(x ) + m

2 z0(y0, x0) x 2)

= Q2 0 Q2 1 V0(y0, x0),

which establishes the base case. To prove the inductive step, we divide the analysis in two cases, each representing a possible sign of xk yk, xk x . For each case, we bound

Qkxk yk 2 zk x 2

Qk 1) xk x , xk yk + (Qk Qk 1) xk yk 2. (26)

In turn, bounds on (26) translate into bounds on Vk+1(yk, xk) Vk(yk, xk), since

Vk+1(yk, xk) Vk(yk, xk) = m

2 ( xk x + p

Qk(xk yk) 2 zk x 2). (27)

Then, to prove the inductive step, we express bounds on (27) in terms of Vk+1 and Vk.

First, suppose xk yk, xk x 0. Also assuming Lk Lk 1, then p

Qk 1/ Qk 1, so that

Qk 1 Qk Qk p

Qk 1 Qk = Qk Qk 1 Qk .

Published as a conference paper at ICLR 2025

Hence, applying the above inequality to (26) and then adding a nonnegative xk x 2 term to it, we get

Qk(xk yk) 2 zk x 2

2Qk Qk 1 Qk xk x , xk yk + (Qk Qk 1) xk yk 2 + Qk Qk 1

Qk xk x + p

Qk(xk yk) 2. (28)

Plugging (28) back into (27) yields

Vk+1(yk, xk) Vk(yk, xk) Qk Qk 1

Qk(xk yk) 2

Qk Vk+1(yk, xk), (29)

where the last inequality follows from the definition of Vk, as f(yk) f(x ) 0 implies

Vk+1(yk, xk) m

Qk(xk yk) 2. (30)

Thus, rearranging terms in (29) and then multiplying both sides by Qk/Qk 1, we obtain

Vk+1(yk, xk) Qk Qk 1 Vk(yk, xk) Q2 k Q2 k 1 Vk(yk, xk),

where the second inequality holds because Qk/Qk 1 1.

Now, suppose xk yk, xk x < 0. As in the previous case, we start by bounding the gap (26). But given the negative sign of xk yk, xk x term, we bound the xk yk 2 term instead. To this end, we first invoke the assumption that xk yk, xk x < 0 to establish that

yk x 2 = xk x (xk yk) 2

= xk x 2 2 xk x , xk yk + xk yk 2

xk x 2. (31)

To use the above inequality on (26), first we rewrite it more conveniently as

Qkxk yk 2 zk x 2

Qk 1 Qk xk x , p

Qk(xk yk) + p

Qk 1) xk yk 2

Qk 1) xk yk 2 Qk p

Qk 1 Qk xk x 2

Qk 1 Qk xk x + p

Qk(xk yk) 2 + p

Qk 1) xk yk 2

Qk 1 Qk xk x 2. (32)

Next, we use the following elementary inequality, which is a consequence of a/c + bc 2 0:

a b 2 = a 2 2 a, b + b 2 (1 + 1/c2) a 2 + (1 + c2) b 2.

Published as a conference paper at ICLR 2025

Namely, we apply the above inequality with a = zk x , b = xk x and c2 = p

Qk 1/ Qk to bound the xk yk 2 term on (32) and obtain p

Qk 1) xk yk 2

Qk 1) xk yk (xk x )/ p

Qk 1 zk x (xk x ) 2

zk x 2 + Qk p

Qk 1 zk x 2 + Qk p

Qk 1 xk x 2. (33)

Plugging (33) back into (32) and then using (31), we get

Qk(xk yk) 2 zk x 2

Qk 1 Qk xk x + p

Qk(xk yk) 2 + Qk Qk 1

Qk 1 zk x 2

Qk 1 1 xk x 2

Qk 1 Qk xk x + p

Qk(xk yk) 2 + Qk Qk 1

Qk 1 zk x 2

Qk 1 yk x 2. (34)

In turn, plugging (34) back into (27) and then using the assumptions that m m and m m yields Vk+1(yk, xk) Vk(yk, xk)

Qk 1 Qk xk x + p

Qk(xk yk) 2 + m

Qk 1 zk x 2

Qk 1 yk x 2

Qk(xk yk) 2 + Qk Qk 1

2 yk x 2. (35)

Now, as in (30), the fact that f(yk) f(x ) 0 implies

Vk(yk, xk) = f(yk) f(x ) + m

2 zk x 2. (36)

In the same vein, applying (18) with x = x and y = yk to the definition of Vk, we obtain

Vk(yk, xk) = f(yk) f(x ) + m

2 yk x 2. (37)

Plugging in (30), (36) and (37) back into (35), and then moving all V acc k+1(yk, xk) terms to the lefthand side and all Vk(yk, xk) to the right-hand side, we obtain p

Qk 1 Qk Vk+1(yk, xk) Qk

Qk 1 + Qk p

Vk(yk, xk) (38)

Multiplying both sides of (38) by Qk/ p

Qk 1, and then using the fact that Qk p

Qk 1 yields

Vk+1(yk, xk) Qk p

Qk 1 + Qk p

Vk(yk, xk) Q2 k Q2 k 1 Vk(yk, xk),

Published as a conference paper at ICLR 2025

where the last inequality above holds because Qk Qk 1 implies the following equivalences hold:

Qk Qk 1 + Qk p

Qk 1 Q3/2 k Q3/2 k 1 p

Qk 1Qk + Qk 1( p

Qk 1) Q3/2 k ,

Qk 1) Qk( p

Therefore, both when xk x , xk yk 0 and when xk x , xk yk < 0, the inequality

Vk+1(yk, xk) Q2 k Q2 k 1 Vk(yk, xk)

holds generically for all yk, xk, proving the lemma.

Proposition 15. Let f be Lipschitz-smooth strongly convex. Given initial points x0 = y0, if the estimates Lk of the Lipschitz constant in accelerated gradient descent (Algorithm 4) are generated monotonically by adaptive backtracking (Algorithm 2 with Lk serving as the initial estimate for Lk+1) using ˆρ and v given by (4) with c [1/2, 1) and L0 > m, then for k 0

f(yk+1) f(x )

4(1 c)2ρ2 L + m

Proof. Combining Lemmas 1 and 2, we have that for every k 0

Vk+1(yk+1, xk+1) 1 1 + δk Vk+1(yk, xk) 1 1 + δk

Q2 k Q2 k 1 Vk(yk, xk).

Moreover, from Proposition 9 and the assumption that L0 > m, it follows that

Lk max{L0, L/(2(1 c)ρ)} max{m, L/(2(1 c)ρ)}

and, in turn, we obtain

1 1 + δk Qk 1 Qk Q p

2(1 c)ρ Q where Q = L

Furthermore, assuming y0 = x0, we have that

V0(y0, x0) = f(y0) f(x ) + m

2 z0 x 2 L + m

Arguing inductively, all but Q2 k and Q2 1 = Q2 0 cancel and, since L0 > m, we get

f(yk+1) f(x ) Vk+1(yk+1, xk+1)

!k L2 k L2 0 V0(y0, x0)

4(1 c)2ρ2 L + m

Published as a conference paper at ICLR 2025

In this subsection, we briefly state standard implementations of the base methods that we use in the paper. For the sake of simplicity, we only state a single iteration of the corresponding method.

Algorithms 3 and 4 summarize gradient descent and Nesterov s accelerated gradient descent in the formulation with constant momentum coefficient (Nesterov, 2018, 2.2.22). Algorithm 5 summarizes Adagrad (Duchi et al., 2011).

To state the last base method that we consider in this paper, we must introduce an auxiliary operator. Given a convex Lipschitz-smooth function f with Lipschitz constant L and a continuous convex function, the proximal operator p L is defined by

p L(y) := arg min x

With this definition, we can state Algorithm 6, which summarizes FISTA (Beck & Teboulle, 2009).

Algorithm 3 Gradient Descent. Input: xk, f(xk), αk > 0 Output: xk+1

1: xk+1 xk αk f(xk)

Algorithm 4 Nesterov s accelerated gradient descent (Nesterov, 2018, 2.2.22). Input: xk, yk, f(xk), Lk > m > 0 Output: xk+1, yk+1

1: yk+1 xk (1/Lk) f(xk)

2: βk Lk m Lk+ m 3: xk+1 (1 + βk)yk+1 βkyk

Algorithm 5 Adagrad (Duchi et al., 2011). Superscript i means the i-th entry of the vector. Input: xk, f(xk), yk, xkαk > 0 Output: xk+1, sk+1

1: si k+1 = yk, xi k + ( f(xk)i)2

2: xi k+1 xi k αk

si k+1 f(xi k)

Algorithm 6 FISTA (Beck & Teboulle, 2009).

Input: xk, xk 1, yk, tk, f(xk) Output: xk+1, yk+1, tk+1

1: xk+1 p L(yk)

1+4t2 k 2 3: yk+1 xk + tk 1

tk+1 (xk xk 1)

Published as a conference paper at ICLR 2025

D LOGISTIC REGRESSION EXPERIMENTS

In this section, we provide further details of the logistic regression experiments and present full plots of all runs.

Table 4: Details of datasets and method precisions used in the logistic regression problem.

dataset datapoints dimensions AGD GD GD (monotone) Adagrad a9a 32561 123 10 9 10 6 10 5 10 6

gisette_scale 6000 5000 10 9 10 9 10 5 10 9

MNIST 60000 784 10 9 10 6 10 3 10 9

mushrooms 8124 112 10 9 10 9 10 5 10 9

phishing 11055 68 10 9 10 9 10 6 10 6

protein 102025 75 10 9 10 9 10 5 10 9

web-1 2477 300 10 9 10 9 10 8 10 9

Dataset details. We take observation from seven datasets: A9A, GISETTE_SCALE (G_SCALE), MUSHROOMS, PHISHING and WEB-1 from LIBSVM (Chang & Lin, 2011), PROTEIN from KDD Cup 2004 (Caruana et al., 2004) and MNIST (Le Cun et al., 1998) The dataset A9A is a preprocessed version of the ADULT dataset (Becker & R, 1996), while WEB-1 is subsample of the WEB dataset (Platt, 1998).

Initialization details. For Lipschitz-smooth problems, a step-size of 1/ L is guaranteed to satisfy the Armijo condition (with c = 1/2) if L L. Accordingly, we consider four choices of initial step-sizes, α = {101, 102, 103, 104}/ L, which capture the transition from initial step-sizes that do not require adjustments to satisfy the Armijo condition to step-sizes that do. In practice, L is unknown and the transition would occur as one attempted an arbitrary initial step-size and adjusted it correspondingly until the line search was activated. Hence, using L to anchor the choice of initial step-sizes is merely an educated guess of the transition values that would be found in practice. We adopt the standard choice c = 10 4 (Nocedal & Wright, 2006, p. 62) in (3) for BLS used with GD and Adagrad but, motivated by both theory and practice, we choose c = 1/2 in the case of AGD. Also, we use the regularization parameter γ as the strong convexity parameter input for AGD.

Evaluation details. We run all base method and their variants for long enough to produce solutions with designated precision; Then, we account for the number of gradient and function evaluations and elapsed time each variant takes to produce that solution. Finally, for each BLS variant we average those numbers over the four initial step-sizes that we considered. All methods compute exactly one gradient per iteration. To account for elapsed time, we record wall clock time after every iteration. Although somewhat imprecise, elapsed time reflects the relative computational cost of gradient and function evaluations and, especially in larger problems, is a reasonable metric to compare performance.

Additional comments. We make the following additional remarks and observations:

We considered two ways to initialize the step-size for line search at each iteration: (1) using the step-size from the previous iteration and (2) using the same fixed step-size at every iteration. We refer to the corresponding line search subroutines as monotone and memoryless. The monotone variants are robust to every choice of ρ while some values of ρ may turn the memoryless variants of AGD unstable or unacceptably slow. When the memoryless variants work, however, they generally work much better than the corresponding monotone variants and the baseline methods. Monotone backtracking is not as appealing as memoryless backtracking because although both variants take fewer iterations than the baseline method does to reach a given precision, the savings in iterations generated by the monotone variants are not enough to outweigh the additional computational cost of function evaluations that the same variants accrue. Therefore, we only report results for the memoryless variant in the main text and defer results for the monotone variant to Appendix D.1. The initial step-sizes greatly impact performance. For some starting step-sizes, vanilla backtracking is better suited for finding the optimal solution than our adaptive method. However, we find that there tends to be more variance in the performance of vanilla backtracking. When L is a good estimate of the true Lipschitz constant, the computational cost of function evaluations may outweigh the savings in gradient evaluation and even memoryless backtracking might

Published as a conference paper at ICLR 2025

not improve on the baseline method. This is the case for the COVTYPE dataset from LIBSVM (Chang & Lin, 2011), as shown by Fig. 19b, in Appendix D.2.

The corresponding stable values of ρ for the adaptive counterpart of AGD lie in the upper interval (0.7, 1) and usually greater values of ρ make the adaptive variant more stable but also more computationally expensive. AGD with regular memoryless backtracking fails to consistently converge for values of ρ outside the interval (0.3, 0.5). In fact, on COVTYPE, for at least one of the initial step-sizes, AGD with regular memoryless backtracking line search fails to converge. On the other hand, as shown in Fig. 19b in Appendix D.2, the adaptive variant converges for ρ = 0.9 and even for ρ = 0.7, the more unstable end of feasible ρ values.

Published as a conference paper at ICLR 2025

D.1 MONOTONE BACKTRACKING LINE SEARCH

0 50000 1e1

0 50000 1e2

0 50000 1e3

0 50000 1e4

GD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

0 50000 1e1

0 50000 1e2

0 50000 1e3

0 50000 1e4

GD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(b) GISETTE_SCALE.

0 50000 1e1

0 50000 1e2

0 50000 1e3

0 50000 1e4

GD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

0 10000 1e1

0 10000 1e2

0 10000 1e3

0 10000 1e4

GD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(d) MUSHROOMS.

Figure 8: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for GD, GD with standard backtracking line search using ρ {0.5, 0.7, 0.9, 0.95} and GD with adaptive memoryless backtracking line search using ρ = 0.9. The light gray horizontal dashed line shows the precision used to compute performance for each dataset.

Published as a conference paper at ICLR 2025

0 20000 1e1

0 20000 1e2

0 20000 1e3

0 20000 1e4

GD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(a) PHISHING.

0 20000 1e1

0 20000 1e2

0 20000 1e3

0 20000 1e4

GD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(b) PROTEIN.

0 10000 1e1

0 10000 1e2

0 10000 1e3

0 10000 1e4

GD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

Figure 9: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for GD, GD with standard backtracking line search using ρ {0.5, 0.7, 0.9, 0.95} and GD with adaptive memoryless backtracking line search using ρ = 0.9. The light gray horizontal dashed line shows the precision used to compute performance for each dataset.

Published as a conference paper at ICLR 2025

AGD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

AGD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(b) GISETTE_SCALE.

AGD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

AGD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(d) MUSHROOMS.

Figure 10: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for AGD, AGD with standard backtracking line search using ρ {0.5, 0.7, 0.9, 0.95} and AGD with adaptive memoryless backtracking line search using ρ = 0.9. The light gray horizontal dashed line shows the precision used to compute performance for each dataset.

Published as a conference paper at ICLR 2025

AGD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(a) PHISHING.

0 11840 1e1

0 11840 1e2

0 11840 1e3

0 11840 1e4

AGD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(b) PROTEIN.

AGD ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

Figure 11: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for AGD, AGD with standard backtracking line search using ρ {0.5, 0.7, 0.9, 0.95} and AGD with adaptive memoryless backtracking line search using ρ = 0.9. The light gray horizontal dashed line shows the precision used to compute performance for each dataset.

Published as a conference paper at ICLR 2025

0 50000 1e1

0 50000 1e2

0 50000 1e3

0 50000 1e4

Adagrad ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

0 50000 1e1

0 50000 1e2

0 50000 1e3

0 50000 1e4

Adagrad ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(b) GISETTE_SCALE.

0 50000 1e1

0 50000 1e2

0 50000 1e3

0 50000 1e4

Adagrad ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

0 20000 1e1

0 20000 1e2

0 20000 1e3

0 20000 1e4

Adagrad ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(d) MUSHROOMS.

Figure 12: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for Adagrad, Adagrad with standard backtracking line search using ρ {0.5, 0.7, 0.9, 0.95} and Adagrad with adaptive memoryless backtracking line search using ρ = 0.9. The light gray horizontal dashed line shows the precision used to compute performance for each dataset.

Published as a conference paper at ICLR 2025

0 20000 1e1

0 20000 1e2

0 20000 1e3

0 20000 1e4

Adagrad ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(a) PHISHING.

Adagrad ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

(b) PROTEIN.

Adagrad ρ = 0.5 ρ = 0.7 ρ = 0.9 ρ = 0.95 AD

Figure 13: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for Adagrad, Adagrad with standard backtracking line search using ρ {0.5, 0.7, 0.9, 0.95} and Adagrad with adaptive memoryless backtracking line search using ρ = 0.9. The light gray horizontal dashed line shows the precision used to compute performance for each dataset.

Published as a conference paper at ICLR 2025

D.2 MEMORYLESS BACKTRACKING LINE SEARCH

0 500 1000 time [s]

gisette_scale

0 1000 2000 3000 time [s]

0 1000 2000 3000 time [s]

reg (0.3, 1e2) reg (0.3, 1e4) ad (0.3, 1e4) 1/L

Figure 14: step-sizes for experiments shown in Fig. 1. Baseline: GD with constant αk = 1/ L; reg (ρ, β) and ad (ρ, β): GD with, respectively, regular and adaptive memoryless BLS parameterized by ρ and α0 = β/ L. The thick black dashed line denotes 1/ L, where L = λmax(A A)/4n.

0 200 400 600 time [s]

gisette_scale

0 500 1000 1500 time [s]

0 250 500 750 1000 time [s]

reg (0.6, 1e1) reg (0.6, 1e2) ad (0.9, 1e2) 1/L

Figure 15: step-sizes for experiments shown in Fig. 2. Baseline: AGD with constant αk = 1/ L; reg (ρ, β) and ad (ρ, β): AGD with, respectively, regular and adaptive memoryless BLS parameterized by ρ and α0 = β/ L. The thick black dashed line denotes 1/ L, where L = λmax(A A)/4n.

0 50000 1e1

0 50000 1e2

0 50000 1e3

0 50000 1e4

GD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

Figure 16: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for GD, GD with standard memoryless backtracking line search using ρ {0.2, 0.3, 0.5, 0.6} and GD with adaptive memoryless backtracking line search using ρ = 0.3. The light gray horizontal dashed line shows the precision used to compute performance for each dataset.

Published as a conference paper at ICLR 2025

0 50000 1e1

0 50000 1e2

0 50000 1e3

0 50000 1e4

GD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(a) GISETTE_SCALE.

0 50000 1e1

0 50000 1e2

0 50000 1e3

0 50000 1e4

GD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

0 10000 1e1

0 10000 1e2

0 10000 1e3

0 10000 1e4

GD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(c) MUSHROOMS.

Figure 17: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for GD, GD with standard memoryless backtracking line search using ρ {0.2, 0.3, 0.5, 0.6} and GD with adaptive memoryless backtracking line search using ρ = 0.3. The light gray horizontal dashed line shows the precision used to compute performance for each dataset.

Published as a conference paper at ICLR 2025

0 20000 1e1

0 20000 1e2

0 20000 1e3

0 20000 1e4

GD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(a) PHISHING.

0 20000 1e1

0 20000 1e2

0 20000 1e3

0 20000 1e4

GD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(b) PROTEIN.

0 10000 1e1

0 10000 1e2

0 10000 1e3

0 10000 1e4

GD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

Figure 18: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for GD, GD with standard memoryless backtracking line search using ρ {0.2, 0.3, 0.5, 0.6} and GD with adaptive memoryless backtracking line search using ρ = 0.3. The light gray horizontal dashed line shows the precision used to compute performance for each dataset.

Published as a conference paper at ICLR 2025

AGD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

0 30000 1e1

0 30000 1e2

0 30000 1e3

0 30000 1e4

AGD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(b) COVTYPE.

AGD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(c) GISETTE_SCALE.

AGD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

Figure 19: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for AGD, AGD with standard memoryless backtracking line search using ρ {0.2, 0.3, 0.5, 0.6} and AGD with adaptive memoryless backtracking line search using ρ = 0.9. The light gray horizontal dashed line shows the precision used to compute performance for all methods, 10 9.

Published as a conference paper at ICLR 2025

AGD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(a) MUSHROOMS.

AGD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(b) PHISHING.

0 12856 1e1

0 12856 1e2

0 12856 1e3

0 12856 1e4

AGD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(c) PROTEIN.

AGD ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

Figure 20: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for AGD, AGD with standard memoryless backtracking line search using ρ {0.2, 0.3, 0.5, 0.6} and AGD with adaptive memoryless backtracking line search using ρ = 0.9. The light gray horizontal dashed line shows the precision used to compute performance for all methods, 10 9.

Published as a conference paper at ICLR 2025

0 20000 1e1

0 20000 1e2

0 20000 1e3

0 20000 1e4

Adagrad ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

0 50000 1e1

0 50000 1e2

0 50000 1e3

0 50000 1e4

Adagrad ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(b) GISETTE_SCALE.

0 20000 1e1

0 20000 1e2

0 20000 1e3

0 20000 1e4

Adagrad ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

0 20000 1e1

0 20000 1e2

0 20000 1e3

0 20000 1e4

Adagrad ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(d) MUSHROOMS.

Figure 21: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for Adagrad, Adagrad with standard memoryless backtracking line search using ρ {0.2, 0.3, 0.5, 0.6} and Adagrad with adaptive memoryless backtracking line search using ρ = 0.3. The light gray horizontal dashed line shows the precision used to compute performance for each dataset.

Published as a conference paper at ICLR 2025

0 20000 1e1

0 20000 1e2

0 20000 1e3

0 20000 1e4

Adagrad ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(a) PHISHING.

Adagrad ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

(b) PROTEIN.

Adagrad ρ = 0.2 ρ = 0.3 ρ = 0.5 ρ = 0.6 AD

Figure 22: Logistic regression on four different datasets and four initial step-sizes α0 = {101, 102, 103, 104}/ L: suboptimality gap for Adagrad, Adagrad with standard memoryless backtracking line search using ρ {0.2, 0.3, 0.5, 0.6} and Adagrad with adaptive memoryless backtracking line search using ρ = 0.3. The light gray horizontal dashed line shows the precision used to compute performance for each dataset.

Published as a conference paper at ICLR 2025

E LINEAR INVERSE PROBLEMS

Dataset details. We consider A observations from eight datasets: IRIS, DIGITS, WINE, OLIVETTI_FACES and LFW_PAIRS from scikit-learn (Pedregosa et al., 2011), SPEAR3 and SPEAR10 (Lorenz et al., 2014) and SPARCO (van den Berg et al., 2007). For multi-class datasets, the first two are considered. The number of datapoints and dimensions of each dataset can be found on Table 5,

Table 5: Details of FISTA experiments.

dataset datapoints dimensions λ L0 digits 360 64 10 1 1, 101, 102, 103

iris 100 4 10 2 10 1, 1, 101, 102

lfw_pairs 2200 5828 1 10 3, 10 2, 10 1, 1 olivetti_faces 20 4096 10 2 1, 101, 102, 103

Spear3 512 1024 10 1 10 3, 10 2, 10 1, 1 Spear10 512 1024 10 2 10 3, 10 2, 10 1, 1 Sparco3 1024 2048 10 2 10 3, 10 2, 10 1, 1 wine 130 13 10 2 1, 101, 102, 103

0 15000300004500060000

0 15000300004500060000

0 15000300004500060000

0 15000300004500060000

= 0.5 = 0.3 = 0.2 AD

(a) SYNTHETIC.

0 125000250000375000500000

0 125000250000375000500000

0 125000250000375000500000

0 125000250000375000500000

= 0.5 = 0.3 = 0.2 AD

(b) OLIVETTI_FACES.

Published as a conference paper at ICLR 2025

0 1250 2500 3750 5000

0 1250 2500 3750 5000

0 1250 2500 3750 5000

0 1250 2500 3750 5000

= 0.5 = 0.3 = 0.2 AD

0 125000250000375000500000

0 125000250000375000500000

0 125000250000375000500000

0 125000250000375000500000

= 0.5 = 0.3 = 0.2 AD

0 12500250003750050000

0 12500250003750050000

0 12500250003750050000

0 12500250003750050000

= 0.5 = 0.3 = 0.2 AD

(c) DIGITS.

Published as a conference paper at ICLR 2025

F MATRIX FACTORIZATION EXPERIMENTS

We sample A from the file u.data , part of the Movie Lens 100K dataset (grouplens.org/ datasets/movielens/100k/). Moreover, we choose the precision representing a reduction to 10 12 in the suboptimality gap, which corresponds to a lower bound of 10 5 as the initial objective values typically hover around 107.

0 10000 5e-02

0 10000 5e-01

0 10000 5e0

0 10000 5e1

GD 0.2 0.3 0.5 0.6 ADLS

(a) Rank 10.

0 30000 5e-02

0 30000 5e-01

0 30000 5e0

0 30000 5e1

GD 0.2 0.3 0.5 0.6 ADLS

(b) Rank 20.

0 50000 5e-02

0 50000 5e-01

0 50000 5e0

0 50000 5e1

GD 0.2 0.3 0.5 0.6 ADLS

(c) Rank 30.

Figure 25: Matrix factorization and three values of rank: suboptimality gap for gradient descent, gradient descent with standard backtracking line search using ρ {0.2, 0.3, 0.5, 0.6} and gradient descent with adaptive backtracking line search using ρ = 0.3. The light gray horizontal dashed line shows the precision used to compute performance for all methods, 10 5.