# qnbo_quasinewton_meets_bilevel_optimization__1a139730.pdf

Published as a conference paper at ICLR 2025

QNBO: QUASI-NEWTON MEETS BILEVEL OPTIMIZATION Sheng Fang12, Yong-Jin Liu23, Wei Yao54, Chengming Yu56, Jin Zhang457* 1Public Health School, Fujian Medical University, 2School of Mathematics and Statistics, Fuzhou University, 3Center for Applied Mathematics of Fujian Province, 4Department of Mathematics, SUSTech, 5National Center for Applied Mathematics Shenzhen, SUSTech, 6School of Science, BUPT, 7CETC Key Laboratory of Smart City Modeling Simulation and Intelligent Technology, The Smart City Research Institute of CETC shengfang1996@163.com, yjliu@fzu.edu.cn yaow@sustech.edu.cn, 2024010618@bupt.cn, zhangj9@sustech.edu.cn

Bilevel optimization, addressing challenges in hierarchical learning tasks, has gained significant interest in machine learning. The practical implementation of the gradient descent method to bilevel optimization encounters computational hurdles, notably the computation of the exact lower-level solution and the inverse Hessian of the lower-level objective. Although these two aspects are inherently connected, existing methods typically handle them separately by solving the lowerlevel problem and a linear system for the inverse Hessian-vector product. In this paper, we introduce a general framework to address these computational challenges in a coordinated manner. Specifically, we leverage quasi-Newton algorithms to accelerate the resolution of the lower-level problem while efficiently approximating the inverse Hessian-vector product. Furthermore, by exploiting the superlinear convergence properties of BFGS, we establish the non-asymptotic convergence analysis of the BFGS adaptation within our framework. Numerical experiments demonstrate the comparable or superior performance of the proposed algorithms in real-world learning tasks, including hyperparameter optimization, data hypercleaning, and few-shot meta-learning.

1 INTRODUCTIONS

Bilevel optimization (BLO), which addresses challenges in hierarchical decision process, has gained significant interest in many real-world applications. Typical applications in machine learning include meta-learning (Franceschi et al., 2018; Rajeswaran et al., 2019), hyperparameter optimization (Pedregosa, 2016; Franceschi et al., 2017; Okuno et al., 2021), adversarial learning (Goodfellow et al., 2014; Pfau & Vinyals, 2016), neural architecture search (Chen et al., 2019; Elsken et al., 2020), and reinforcement learning (Yang et al., 2019; Hong et al., 2023). In this study, we revisit the following BLO problem:

min x Rm Φ(x) := F(x, y (x)) s.t. y (x) = arg min y Rn f(x, y), (1)

in which the upper-level (UL) objective function F : Rm Rn R is generally nonconvex, while the lower-level (LL) objective function f : Rm Rn R is strongly convex with respect to (w.r.t.) the LL variable y Rn.

The gradient of Φ(x), known as hypergradient, is crucial not only for applying gradient descent but also for developing accelerated gradient-based methods for BLO problems. Therefore, a fundamental question in solving BLO problems is how to efficiently estimate the hypergradient. Assuming that f is continuously twice differentiable, and by applying the chain rule and utilizing the first-order optimality condition yf(x, y (x)) = 0 of the LL optimization problem (Ghadimi & Wang, 2018), the hypergradient is given by

Φ(x) = x F(x, y (x)) [ 2 xyf(x, y (x))]T [ 2 yyf(x, y (x))] 1 y F(x, y (x)). (2)

*Correspondence to Jin Zhang (zhangj9@sustech.edu.cn)

Published as a conference paper at ICLR 2025

Two main challenges in estimating the hypergradient are: (C1) evaluating the LL solution y (x); (C2) estimating the Jacobian-inverse Hessian-vector product [ 2 xyf(x, y)]T [ 2 yyf(x, y)] 1 y F(x, y), once a good proxy y for the LL solution y (x) is obtained.

For (C1), the common approach is to perform a few additional gradient descent steps for the LL problem on the current estimate yk, using it as a proxy for the LL solution. For (C2), two main approaches have been proposed in the literature. The first is to estimate the inverse Hessian using the (truncated) Neumann series (Ghadimi & Wang, 2018; Ji et al., 2021). The second approach is to compute the inverse Hessian-vector product [ 2 yyf(x, y)] 1 y F(x, y) by solving the linear system [ 2 yyf(x, y)]z = y F(x, y) for z, and then calculating [ 2 xyf(x, y)]T z (Pedregosa, 2016; Arbel & Mairal, 2022; Dagréou et al., 2022). Clearly, the hypergradient approximation error depends on the errors in both (C1) and (C2). Most existing methods handle (C1) and (C2) separately, using different techniques.

A notable exception is the recent breakthrough by (Ramzi et al., 2022), which introduces a novel approach (named SHINE), specifically designed for deep equilibrium models (DEQs) (Bai et al., 2019; 2020) and BLO problems where the UL objective function does not explicitly depend on the UL variable, i.e., F(x, y) = L(y). The novelty of SHINE lies in its approach to addressing (C1) and (C2) closely. The main idea is to use quasi-Newton (q N) matrices from the LL solution process to efficiently approximate the inverse Hessian in the direction needed for the hypergradient computation. SHINE provides three methods for approximating the hypergradient by incorporating a technique OPA with Broyden s method and BFGS. Note that in the OPA method from Ramzi et al. (2022), the q N matrices derived from the LL resolution are influenced by additional updates. This can potentially lead to incorrect inversion, as noted in Ramzi et al. (2022). To mitigate this issue, they employ a fallback strategy. In theory, SHINE demonstrates asymptotic convergence to the hypergradient under various conditions but does not guarantee convergence of the algorithmic iterates.

Therefore, inspired by SHINE, our focus is on improving hypergradient approximation and reducing barriers to solving BLO problems. In particular, our main research question is: Can we develop a method to enhance hypergradient approximation for solving the bilevel optimization problem in (1) with a guaranteed convergence rate?

1.1 MAIN CONTRIBUTIONS

Our response to this question is affirmative, and our main contributions are summarized below.

q NBO, a new algorithmic framework utilizing quasi-Newton techniques, is proposed for solving the BLO problem (1). Unlike SHINE, q NBO includes a subroutine that applies quasi-Newton recursion schemes specifically tailored for the direction y F(x, y), avoiding incorrect inversion.

We validate the effectiveness and efficiency of q NBO with two practical algorithms: q NBO (BFGS) and q NBO (SR1), corresponding to two prominent quasi-Newton methods. The numerical results demonstrate q NBO s comparable or superior performance compared to its closest competitors, SHINE (Ramzi et al., 2022) and PZOBO (Sow et al., 2022b), as well as other BLO algorithms, including two popular fully first-order methods: BOME (Liu et al., 2022) and F2SA (Kwon et al., 2023).

By leveraging the superlinear convergence rates of BFGS, we analyze the non-asymptotic convergence of BFGS adaptation within our framework, q NBO.

1.2 ADDITIONAL RELATED WORK

Quasi-Newton methods. Because of the low computation cost per iteration and fast convergence rate, quasi-Newton (q N) methods has been extensively studied (Nocedal & Wright, 2006). The most common q N methods are the Broyden-Fletcher-Goldfarb-Shanno (BFGS) (Broyden, 1970b;a; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970), its low-memory extension L-BFGS (Liu & Nocedal, 1989), and the symmetric rank one method (SR1) (Davidon, 1991; Broyden, 1967). For BLO problems, Pedregosa (2016) first uses L-BFGS to solve the LL problem to a certain tolerance. Then a conjugate-gradient method is applied to solve the linear system [ 2 yyf(x, y)]z = y F(x, y) through matrix-vector products. Finally, [ 2 xyf(x, y)]T z is calculated.

Published as a conference paper at ICLR 2025

Hypergradient approximation methods. Various bilevel methods have been proposed recently to approximate the inverse Hessian or omit some second-order derivative computations in the hypergradient. For example, FOMAML (Finn et al., 2017; Nichol et al., 2018) skips calculating all second-order derivatives. DARTS (Liu et al., 2018) solves the LL problem with just one gradient descent step. The Jacobian-Free method (JFB) (Fung et al., 2022) approximates the inverse Hessian with the identity. Giovannelli et al. (2021) proposes practical low-rank bilevel methods (BSG1 and BSG-N-FD) that use first-order approximations for second-order derivatives through a finitedifference scheme or rank-1 approximations. Recently, several zeroth-order methods have been proposed to approximate the full hypergradient, such as ES-MAML (Song et al., 2019) and HOZOG (Gu et al., 2021). Another zeroth-order method, PZOBO (Sow et al., 2022b), approximates only part of the hypergradient by comparing two optimization paths.

There is another line of research for BLO problems, which does not explicitly use the hypergradient in (2), see, e.g., Liu et al. (2023); Sow et al. (2022a); Shen & Chen (2023); Liu et al. (2022); Kwon et al. (2023; 2024). These algorithms first use the value function of the lower-level problem to transform the bilevel problem into a single-level problem. Then, they apply the penalty function method or other techniques to solve the reformulated problem. For instance, BOME (Liu et al., 2022) is a novel and fast gradient-based method that uses a modified dynamic barrier gradient descent on the value-function reformulation. F2SA (Kwon et al., 2023) is a fully first-order method developed from a value function-based penalty formulation. It can be implemented in a single-loop manner. Additionally, it is shown in Kwon et al. (2023) that the update direction of F2SA has a global O(1/λ)-approximability of the exact hypergradient, where λ is the penalty parameter.

2 PROPOSED FRAMEWORK

In this section, we introduce a general framework, q NBO, to enhance hypergradient approximation. It addresses the computational challenges (C1) and (C2) using quasi-Newton techniques. We begin by rewriting the hypergradient as:

Φ(x) = x F(x, y (x)) [ 2 xyf(x, y (x))]T u (x, y (x)),

where u (x, y) := [ 2 yyf(x, y)] 1 y F(x, y). As in Arbel & Mairal (2022) and Dagréou et al. (2022), the proposed algorithms introduce an additional variable uk alongside xk and yk. Naturally, q NBO consists of three components. The details of q NBO are presented in Algorithm 1.

Algorithm 1 q NBO : quasi-Newton Meets Bilevel Optimization

Input: x0, y0; initial matrix H0; stepsize α > 0; iterates numbers K, {Qk}K 1 k=0 for k = 0, 1, . . . , K 1 do

1. yk+1 A(xk, yk) # update yk+1 by a subroutine A 2. if Qk = 1: uk+1 Cq N y F(xk, yk+1), H0, {st, gt}T 1 t=0

# share {st, gt}T 1 t=0 with A(xk, yk) else: uk+1 B(xk, yk+1, H0, y F(xk, yk+1), Qk) # improve uk+1 by a subroutine B 3. xk+1 xk α x F(xk, yk+1) [ 2 xyf(xk, yk+1)]T uk+1

Part 1: q NBO updates yk towards y (xk) using a q N-based subroutine A(xk, yk) in Algorithm 2, starting from yk. The key is a quasi-Newton recursion scheme Cq N, which computes the matrix-vector product Hd by performing a sequence of inner products and vector summations involving d and the pairs {si, gi}t 1 i=0. Here H represents the inverse Hessian approximation of the LL objective, d = yf(x, yt), si = yi+1 yi, and gi = yf(x, yi+1) yf(x, yi) in this subroutine. Two prominent quasi-Newton recursion schemes are provided in Appendix B.

Part 2: To update uk+1, we provide two options: Qk = 1 or Qk > 1. In the case of Qk = 1, q NBO updates uk+1 similarly to SHINE. The pairs {si, gi}T 1 i=0 from A(xk, yk) are shared to approximate the inverse Hessian in the direction y F(xk, yk+1). Unfortunately, incorrect inversion may occur because the pairs {si, gi}T 1 i=0 in A(xk, yk) are designed to satisfy the secant equation Ht+1gt = st. To address this issue, q NBO adds a subroutine B in Algorithm 3, which uses quasi-Newton recursion

Published as a conference paper at ICLR 2025

schemes for the direction y F(xk, yk+1) when Qk > 1. The price to pay is that the pairs {si, gi} in A cannot be shared with B, increasing the computational cost. Thus, choosing between Qk = 1 and Qk > 1 involves a trade-off between performance and computational cost.

Part 3: After computing yk+1 and uk+1, q NBO updates xk+1 by

xk+1 = xk αk x F(xk, yk+1) [ 2 xyf(xk, yk+1)]T uk+1 ,

where αk is a stepsize, and Φ(xk) := x F(xk, yk+1) [ 2 xyf(xk, yk+1)]T uk+1 is a hypergradient approximation. This update rule for xk is commonly used in gradient-based algorithms, , such as those in Arbel & Mairal (2022) and Dagréou et al. (2022).

Algorithm 2 A(x, y0): gradient descent steps + q N steps for the LL problem

Input: x, y0; initial matrix H0; stepsizes β, γ > 0; iterates numbers P, T

1. for j = 0, 1, 2, . . . , P 1 yj+1 yj β yf(x, yj) end for 2. y0 y P for t = 0, . . . , T 1 yt+1 yt γdt, where dt Cq N yf(x, yt), H0, {si, gi}t 1 i=0 (t 1), d0 H0 yf(x, y0) st yt+1 yt, gt yf(x, yt+1) yf(x, yt) end for Return y T , {st, gt}T 1 t=0 .

Algorithm 3 B(x, y, H0, d, Q) Input: x, y; initial matrix H0; vector d; stepsize ξi > 0; iterates number Q

1. u0 H0d, s0 ξ0u0 and g0 yf(x, y + s0) yf(x, y) 2. for i = 1, 2, . . . , Q 1 ui Cq N d, H0, { sj, gj}i 1 j=0

si ξiui, gi yf(x, y + si) yf(x, y) end for Return u Q 1.

Two practical q NBO algorithms: q NBO (BFGS) and q NBO (SR1). We describe two prominent quasi-Newton recursion schemes, Cq N, for computing the inverse Hessian approximation-vector product Htd. These schemes involve a sequence of inner products and vector summations with d and pairs {si, gi}t 1 i=0. One is the BFGS two-loop recursion scheme, outlined in Algorithm 4, corresponding to the BFGS updating formula (Broyden, 1970b;a; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970):

Ht+1 = I ρtstg T t Ht I ρtgts T t + ρtsts T t , ρt = 1 g T t st . (3)

The second algorithm is presented in Algorithm 5, which corresponds to the symmetric-rank-one (SR1) updating formula (Davidon, 1991; Broyden, 1967):

Ht+1 = Ht + (st Htgt)(st Htgt)T

(st Htgt)T gt . (4)

Implementation and improvement. Several details and enhancements are needed for an efficient implementation of q NBO.

First, the purpose of including a few gradient descent steps in subroutine A is to bring the iterators closer to a neighborhood of the LL solution, enabling superlinear convergence in subsequent quasi Newton steps. A warm-start for y0 is effective because y (xk+1) remains close to y (xk) when xk+1 is near xk. This is guaranteed by the Lipschitz continuity of y (x). In practice, we conjecture

Published as a conference paper at ICLR 2025

that a few initial gradient descent steps are sufficient, although they are necessary for the theoretical analysis.

Second, because f(x, y) exhibits strong convexity w.r.t. y in our context, the curvature condition s T t gt > 0, required for BFGS, is consistently satisfied. This allows the use of a fixed step size, eliminating the need for time-consuming line searches. Furthermore, as the solution approaches a region conducive to superlinear convergence, employing a few quasi-Newton steps is sufficient to achieve a satisfactory solution.

Third, in experiments, the initial matrix H0 is chosen as a scalar multiple of the identity matrix. This simplification ensures that the recursion algorithms, Algorithms 4 and 5, involve only vector inner products, significantly reducing computational costs. In Algorithm 3, the parameter ξi is typically set to either 1 or ui in most cases.

Fourth, q NBO is a flexible framework that can integrate other quasi-Newton methods, such as limited memory BFGS (L-BFGS) (Liu & Nocedal, 1989). It also supports a non-loop" implementation of L-BFGS by representing quasi-Newton matrices in outer-product form (Byrd et al., 1994).

Finally, q NBO consists of three parts, with the first two utilizing quasi-Newton recursion schemes. A stochastic adaptation involves replacing these schemes with stochastic methods (e.g., K-BFGS (Goldfarb et al., 2020), Stochastic Block BFGS(Gower et al., 2016)) and using stochastic gradients in Part 3, aligning with Dagréou et al. (2022). Key challenges in implementing the stochastic adaptation include constructing effective unbiased or biased estimators in Part 3 using techniques like variance reduction and momentum, and analyzing the convergence rate and complexity of the proposed stochastic algorithms in a bilevel setting that leverages noisy second-order information. Addressing these theoretical issues may require breakthroughs beyond the scope of this work.

3 THEORETICAL ANALYSIS

In this section, we analyze the non-asymptotic convergence of the q NBO algorithm, as outlined in Algorithm 1, under standard assumptions commonly used in BLO literature (Ghadimi & Wang, 2018; Ji et al., 2021; Chen et al., 2022; Dagréou et al., 2022; Ji et al., 2022).

3.1 ASSUMPTIONS

Assumption 3.1. Assume that the UL objective function F satisfies the following properties:

(i) For all x, the gradients x F(x, y) and y F(x, y) are Lipschitz continuous w.r.t. y, with Lipschitz constants LFx and LFy, respectively.

(ii) For all y, y F(x, y) is Lipschitz continuous w.r.t. x, with a Lipschitz constant LFy.

(iii) There exists a constant CFy such that y F(x, y) CFy for all x, y.

Assumption 3.2. Assume that the LL objective function f has the following properties:

(i) For all x and y, f is continuously twice differentiable in (x, y).

(ii) For all x, f(x, y) is strongly convex w.r.t. y with parameter µ > 0. Moreover, yf(x, y) and 2 yyf(x, y) are Lipschitz continuous w.r.t. y with parameter L and Lfyy, respectively.

(iii) For all x, 2 xyf(x, y) is Lipschitz continuous w.r.t. y with constant Lfxy.

(iv) For all x, y, we have 2 xyf(x, y) Mfxy for some constant Mfxy.

(v) For all y, 2 xyf(x, y) and 2 yyf(x, y) are Lipschitz continuous w.r.t. x with constants Lfxy and Lfyy, respectively.

Assumption 3.3. There exists ι R such that infx Φ(x) ι.

Published as a conference paper at ICLR 2025

3.2 CONVERGENCE RESULTS

The evolving nature of inverse Hessian approximations through updating formulas in q NBO significantly complicates the analysis of non-asymptotic convergence. Additionally, various update formulas present different challenges, similar to studying the convergence rates of quasi-Newton methods. In this section, we focus on the non-asymptotic convergence of q NBO (BFGS), leveraging the superlinear convergence rates of BFGS. Some results can also be extended to q NBO (SR1). Comprehensive proofs of these results are provided in Appendix D.

Let LΦ be the Lipschitz constant of Φ given in Lemma D.12. We first present the convergence results of q NBO (BFGS) for solving the bilevel problem (1), where the LL objective function is quadratic. Theorem 3.4 (quadratic case). Suppose that f in (1) takes the following quadratic form:

f(x, y) = 1

2y T Ay y T x, (5)

where µI A LI. Assume that Assumption 3.1 and 3.3 hold. Set Qk = k + 1 and H0 = LI. Let κ := L/µ, tb := 4nlnκ, ct := 2t

2 b , and ω := c1(1 + 1

ε)c2 tκ3( 1

T )T , where c1 is a positive constant given in Theorem D.19. We can choose positive parameters α, ε and T such that τ := c2 tκ3( 1

T )T (1 + ε) + (1 + 1

ε)α2c1 < 1 and αLΦ + ωα2 1

2 + αLΦ 1 1 τ 1

4. Then the iterates xk generated by q NBO (BFGS) satisfy:

k=0 Φ(xk) 2 4(Φ(x0) infx Φ(x))

αK + 3δ0 K(1 τ) +

18n LC2 Fyln K

with the initial error δ0 = 3c2 tκ3( 1

T )T c2 y 0 y0 2, where c2 is a constant. Remark 3.5. For the quadratic case, quasi-Newton methods achieve global superlinear convergence (Ye et al., 2023; Rodomanov & Nesterov, 2022; 2021b), allowing P = 0 in Algorithm 2. Remark 3.6. The convergence rate of q NBO (SR1) for the quadratic case is similar to that in Theorem 3.4. However, for the general case, the q NBO (SR1) algorithm lacks convergence guarantees without specific corrections used to achieve numerical stability, as noted in Ye et al. (2023).

Next, we explore the case where the LL objective function in (1) takes a general form. While the convergence rate of q NBO (BFGS) resembles that of the previous quadratic case, it is much more challenging. The specific convergence rate for this general case is detailed in the following theorem. Theorem 3.7 (general case). Suppose that Assumptions 3.1, 3.2 and 3.3 hold. Set Qk = k + 1. Choose the parameters β and P such that (1 βµ)P yk y k 1 300 µ, and assume H0 satisfies:

2 yyf(xk, y (xk)) 1/2 H 1 0 2 yyf(xk, y (xk)) 2 yyf(xk, y (xk)) 1/2 F 1

7. Define τ := κ( 1

T )T (1 βµ)P (1 + ε) + (1 + 1

ε)α2c3 and ω := c3(1 + 1

T )T (1 βµ)P , with a constant c3 given in Theorem D.24. We can choose positive parameters α, ε and T such that τ < 1 and αLΦ + ωα2 1

2 + αLΦ 1 1 τ 1

4. Then the iterates xk generated by q NBO (BFGS) satisfy:

k=0 Φ(xk) 2 4(Φ(x0) infx Φ(x))

αK + 3δ0 K(1 τ) + 18n LM 2 fxy C2 Fyln K

µ3 ξK , (7)

where δ0 = 3κ( 1

T )T (1 βµ)P c4 y 0 y0 2 is the initial error with constant c4. The constant ξ is related to the property of f, as given in (29).

The proof sketch of Theorem 3.7 can be found in Appendix D.2. The complete version of the parameter specifications and the proofs of Theorems 3.4 and 3.7 are provided in Appendices D.3 and D.4, respectively.

Discussion on convergence rate and complexity. Choose T = Θ(ln κ) and the step size α = Θ(κ 3) such that τ < 1 and αLΦ + ωα2 1

2 + αLΦ 1 1 τ 1

4. Under the same setting as Theorem

3.7, we have 1 K PK 1 k=0 Φ(xk) 2 = O κ3

K + κ3 ln K

K . To achieve an ϵ-stationary point, it is

required that K = O(κ3ϵ 1), resulting in a gradient complexity of Gc(f, ϵ) = O(κ6ϵ 2) and Gc(F, ϵ) = O(κ3ϵ 1), as well as a Jacobian-vector product complexity of JV (ϵ) = O(κ3ϵ 1).

Published as a conference paper at ICLR 2025

The details for ensuring τ < 1 and αLΦ + ωα2 1

2 + αLΦ 1 1 τ 1 4, as well as the specific complexity analysis, can be found in Appendix E.

Theoretical comparisons. Our analysis provides a non-asymptotic convergence rate, superior to that of SHINE (Ramzi et al., 2022). It is established in Ji et al. (2022) that the fastest deterministic convergence rate, 1

K PK 1 k=0 Φ(xk) 2 = O( 1

K ), is achievable. In comparison, BOME (Liu et al., 2022) reaches a convergence rate of O(K 1/4), F2SA (Kwon et al., 2023) attains O( ln K

K2/3 ), and SABA (Dagréou et al., 2022) achieves O( 1

K ). To achieve an ϵ-stationary point, the matrix-vector complexity for q NBO is O(κ3ϵ 1), primarily due to the [ 2 xyf(x, y)]T u calculations. It is worth noting that the number of Jacobian-vector products and Hessian-vector products for AID-BIO (Ji et al., 2021) are of the orders O(κ3ϵ 1) and O(κ3.5ϵ 1), respectively. Although the gradient complexity Gc(f, ϵ) for q NBO is higher than that in AID-BIO (Ji et al., 2021), the computation of gradients is generally less complex than performing matrix-vector operations.

4 NUMERICAL EXPERIMENT

In this section, we conduct numerical experiments to evaluate the performance of the q NBO algorithms in solving bilevel optimization problems. We first validate the theoretical convergence through experiments on a toy example, followed by an assessment of efficiency by comparing q NBO with its closest competitor, SHINE (Ramzi et al., 2022), as well as other bilevel optimization (BLO) algorithms such as AID-BIO (Ji et al. (2021) with CG method), AID-TN (Ji et al. (2021) with Truncated Neumann method), AMIGO (Arbel & Mairal (2022) with CG method), SABA (Dagréou et al., 2022) and BSG1 (Giovannelli et al., 2021). Additionally, we compare q NBO with two widely used fully first-order algorithms, BOME (Liu et al., 2022) and F2SA (Kwon et al., 2023), in real-world applications including hyperparameter optimization and data hyper-cleaning. Finally, we explore q NBO s applicability to complex machine learning tasks by exclusively comparing it with PZOBO (Sow et al., 2022b) in a meta-learning experiment, where PZOBO is regarded as the leading algorithm for few-shot meta-learning. Details of all experimental specifications are provided in Appendix C. Additionally, we perform an ablation study in Appendix C.5.

4.1 TOY EXAMPLE

In this section, we consider a quadratic bilevel problem where both the UL and LL objective functions are quadratic. Given z0 Rn and the symmetric positive definite matrix A Rn n, the problem is formulated as follows:

min x Rn 1 2 x z0 2 + 1

2y (x)T Ay (x) s.t. y (x) = arg min y Rn 1 2y T Ay x T y. (8)

In the experiment, the vector z0 and the matrix A are randomly generated, with n = 1000. The hypergradient is given by Φ(x) = (A 1 + I)x z0, which yields the unique solution (x , y ) = (A 1 + I) 1z0, A 1(A 1 + I) 1z0 . To evaluate the performance of various methods in solving this problem, we analyze the hypergradient estimation errors and the distances between the iterates (xk, yk) and the optimal solution (x , y ). The results, shown in Figure 1(a), indicate that AID-BIO, AMIGO-CG and q NBO (SR1) achieve smaller hypergradient errors and produce iterates closer to the optimal solutions compared to other methods. Furthermore, we analyze the impact of parameter {Qk}K 1 k=0 in Algorithm 1 on the performance of q NBO (BFGS). As depicted in Figure 1(d) shows that the hypergradient Φ(xk) generally decreases as Qk increases, with {Qk}K 1 k=0 = {k + 1}K 1 k=0 performing the best, thereby supporting the claims of Theorem 3.4.

4.2 HYPERPARAMETER OPTIMIZATION IN LOGISTIC REGRESSION

We perform hyperparameter optimization for l2-regularized logistic regression on the 20News (Lang, 1995) and Real-sim (Chang & Lin) datasets, formulated as a bilevel problem:

i=1 ℓ(a i, b i, y (x)) s.t. y (x) = arg min y Rn

i=1 ℓ(ai, bi, y) + exp(x)

Published as a conference paper at ICLR 2025

Figure 1: Numerical results on toy example. (d) Testing results on the impact of the parameter {Qk}K 1 k=0 in q NBO (BFGS).

Figure 2: Hyperparameter optimization experiments for l2-regularized logistic regression on two datasets (Left: 20News; Right: Real-sim).

where (ai, bi) Dtrain and (a i, b i) Dval are the training data and validation data respectively, and ℓ(ai, bi, y) := log 1 + exp biai T y . The LL variable y is the model s parameter, while the UL variable x refers to the regularization hyperparameter.

The performance of different methods on the unseen dataset Dtest is shown in Figure 2, where the results over 10 runs are plotted for each method. Here AMIGO refers to the algorithm AMIGO in Arbel & Mairal (2022) that employs gradient descent to solve the auxiliary quadratic problem. The

Published as a conference paper at ICLR 2025

results show that q NBO (BFGS) reaches its lowest loss faster than the other methods. Notably, the performance of q NBO (SR1) on the 20News dataset was omitted due to its ineffectiveness in this experiment, which resulted in oscillations. This issue arises from the numerical instability of SR1 when solving general functions without a correction strategy (Ye et al., 2023).

Figure 3: Data hyper-cleaning results on two datasets. (Left: MNIST; Right: Fashion MNIST).

4.3 DATA HYPER-CLEANING

This subsection focuses on data hyper-cleaning for the MNIST (Deng, 2012) and Fashion MNIST (Xiao et al., 2017) to enhance model accuracy, using a noisy training set Dtrain := {ai, bi}m i=1 and a clean validation set Dval. The objective is to adjust the training data weights to improve performance on Dval. This task can be formalized as the bilevel problem:

min x ℓval(y (x)) s.t. y (x) = arg min y {ℓtrain(x, y) + c y 2}, (10)

where ℓval is the validation loss on Dval and ℓtrain = Pm i=1 σ(xi)ℓ(ai, bi, y) is a weighted training loss with σ(x) = Clip(x, [0, 1]) and x Rm. In the experiment, both ℓ(ai, bi, y) and ℓval are the cross entropy loss, with c = 0.001.

Table 1: Comparison of results for hyper-cleaning. We compare the time and F1 score of various algorithms in achieving specific test accuracies (91.50% for MNIST and 83.00% for Fashion MNIST). Bold font indicates the fastest time to reach the target accuracy. If an algorithm fails to reach the required test accuracy, the time is recorded up to the highest accuracy it achieves.

MNIST Fashion MNIST Method Time (s) Acc. (%) F1 score Time (s) Acc. (%) F1 score q NBO (BFGS) 0.42 91.54 95.34 0.83 83.04 93.56 q NBO (SR1) 3.68 91.51 94.59 1.53 83.02 94.09 BOME 6.31 91.50 94.94 3.59 83.00 93.48 SABA 3.35 91.44 94.79 44.29 82.79 88.81 F2SA 8.06 91.46 93.31 8.72 82.98 86.55 SHINE-OPA 20.25 91.51 95.44 9.96 83.07 93.87 PZOBO 1.05 91.46 95.46 2.96 83.05 93.77

As shown in Figure 3, q NBO (BFGS) significantly outperforms other methods, achieving lower test loss and higher test accuracy more quickly. All results are averaged over 10 random trials. Table 1 illustrates that while q NBO, BOME, and SHINE-OPA are able to achieve the required accuracy, q NBO (BFGS) does so in the shortest time. Notably, both F2SA and SABA fail to reach the target accuracy on either dataset. For example, on the MNIST dataset, q NBO (BFGS) requires less than onetenth of the time taken by BOME, the second-fastest method. The exclusion of BSG1 s performance from this experiment is due to its ineffectiveness in addressing these data hyper-cleaning problems.

Published as a conference paper at ICLR 2025

4.4 META-LEARNING

In this subsection, we consider the few-shot meta-learning, which can be described as:

i=1 LDi(x, y i (x)) s.t. y (x) = arg min y 1 m

i=1 LSi(x, yi).

Due to the superior performance demonstrated by PZOBO compared to the baseline methods (MAML (Finn et al., 2017) and ANIL (Raghu et al., 2020)), the comparison of q NBO (BFGS) is exclusively conducted against PZOBO, excluding the aforementioned baseline methods.

Figure 4: 5-way 5-shot experiments on two datasets. (Left: mini Image Net; Right: Omniglot.)

As shown in Figure 4 and Table 2, q NBO (BFGS) achieves superior accuracy compared to PZOBO on the mini Image Net (Vinyals et al., 2016) and Omniglot (Lake et al., 2015) datasets. Results are averaged over 5 runs, with all algorithms starting from the same initial point with a test accuracy of 20%. For clarity, the graphs begin at the second data point, omitting the initial one. It is reported that q NBO (BFGS) reaches peak test accuracies on the Omniglot dataset within 800 seconds, after which performance declines, likely due to overfitting or other factors. Notably, on the mini Image Net dataset, q NBO (BFGS) attains a test accuracy exceeding 60%, while PZOBO fails to achieve comparable results. Furthermore, the results in Table 2 show that q NBO (BFGS) reaches higher test accuracy in significantly less time compared to PZOBO as the number of ways increases. We do not plot the curves for other methods like SABA, SHINE-OPA, and BOME, as they are difficult to converge under various hyperparameter configurations using their source codes.

Table 2: Few-shot meta-learning on the Omniglot dataset: highest test accuracy and time required by each algorithm.

5-shot PZOBO q NBO (BFGS)

Acc. (%) Time (s) Acc. (%) Time (s)

5-way 99.31 11124 99.14 772 20-way 97.94 17869 99.14 1648 30-way 97.15 21043 99.05 2978

5 CONCLUSION

This paper introduces q NBO, a flexible algorithmic framework for improving hypergradient approximation. It leverages quasi-Newton techniques to accelerate the solution of the lower-level problem and efficiently approximates the inverse Hessian-vector product in hypergradient computation. Notably, q NBO includes a subroutine using quasi-Newton recursion schemes specifically tailored for the direction y F(x, y) to avoid incorrect inversion. Furthermore, in addition to the prominent BFGS and SR1 methods, q NBO can integrate other quasi-Newton methods such as limited-memory BFGS (L-BFGS) and single-loop implementations. Extensive numerical results verify the efficiency of the proposed algorithms.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

Authors listed in alphabetical order. This work is supported by National Key R & D Program of China (2023YFA1011400), National Natural Science Foundation of China (12222106, 12271097, 12326605, 62331014, 12371305), Guangdong Basic and Applied Basic Research Foundation (No. 2022B1515020082), the Key Program of National Science Foundation of Fujian Province of China (Grant No. 2023J02007), the Central Guidance on Local Science and Technology Development Fund of Fujian Province (Grant No. 2023L3003), and the Fujian Alliance of Mathematics (Grant No. 2023SXLMMS01). We sincerely thank the anonymous reviewers for their insightful comments and suggestions on this work.

Michael Arbel and Julien Mairal. Amortized implicit differentiation for stochastic bilevel optimization. In International Conference on Learning Representations, 2022.

Sébastien MR Arnold, Praateek Mahajan, Debajyoti Datta, Ian Bunner, and Konstantinos Saitas Zarkias. learn2learn: A library for meta-learning research. ar Xiv preprint ar Xiv:2008.12284, 2020.

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in neural information processing systems, 32, 2019.

Shaojie Bai, Vladlen Koltun, and J Zico Kolter. Multiscale deep equilibrium models. Advances in neural information processing systems, 33:5238 5250, 2020.

Charles G Broyden. Quasi-newton methods and their application to function minimisation. Mathematics of Computation, 21(99):368 381, 1967.

Charles G Broyden. The convergence of a class of double-rank minimization algorithms: 2. the new algorithm. IMA Journal of Applied Mathematics, 6(3):222 231, 1970a.

Charles George Broyden. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76 90, 1970b.

Richard H Byrd, Jorge Nocedal, and Robert B Schnabel. Representations of quasi-Newton matrices and their use in limited memory methods. Mathematical Programming, 63(1):129 156, 1994.

Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. URL https: //www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. Accessed: 2021-0506.

Tianyi Chen, Yuejiao Sun, Quan Xiao, and Wotao Yin. A single-timescale method for stochastic bilevel optimization. In International Conference on Artificial Intelligence and Statistics, pp. 2466 2488. PMLR, 2022.

Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1294 1303, 2019.

Mathieu Dagréou, Pierre Ablin, Samuel Vaiter, and Thomas Moreau. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. Advances in Neural Information Processing Systems, 35:26698 26710, 2022.

William C Davidon. Variable metric method for minimization. SIAM Journal on Optimization, 1(1): 1 17, 1991.

Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141 142, 2012.

Thomas Elsken, Benedikt Staffler, Jan Hendrik Metzen, and Frank Hutter. Meta-learning of neural architectures for few-shot learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 12365 12375, 2020.

Published as a conference paper at ICLR 2025

Jennifer B Erway and Roummel F Marcia. On solving large-scale limited-memory quasi-newton equations. Linear Algebra and its Applications, 515:196 225, 2017.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126 1135. PMLR, 2017.

Roger Fletcher. A new approach to variable metric algorithms. The Computer Journal, 13(3): 317 322, 1970.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pp. 1165 1173. PMLR, 2017.

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pp. 1568 1577. PMLR, 2018.

Samy Wu Fung, Howard Heaton, Qiuwei Li, Daniel Mc Kenzie, Stanley Osher, and Wotao Yin. JFB: Jacobian-free backpropagation for implicit networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 6648 6656, 2022.

Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

Tommaso Giovannelli, Griffin Kent, and Luis Nunes Vicente. Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. ar Xiv preprint ar Xiv:2110.00604, 2021.

Donald Goldfarb. A family of variable-metric methods derived by variational means. Mathematics of Computation, 24(109):23 26, 1970.

Donald Goldfarb, Yi Ren, and Achraf Bahamou. Practical quasi-newton methods for training deep neural networks. Advances in Neural Information Processing Systems, 33:2386 2396, 2020.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014.

Robert Gower, Donald Goldfarb, and Peter Richtárik. Stochastic block BFGS: Squeezing more curvature out of data. In International Conference on Machine Learning, pp. 1869 1878. PMLR, 2016.

Bin Gu, Guodong Liu, Yanfu Zhang, Xiang Geng, and Heng Huang. Optimizing large-scale hyperparameters via automated learning algorithm. ar Xiv preprint ar Xiv:2102.09026, 2021.

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147 180, 2023.

Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, pp. 4882 4892. PMLR, 2021.

Kaiyi Ji, Mingrui Liu, Yingbin Liang, and Lei Ying. Will bilevel optimizers benefit from loops. Advances in Neural Information Processing Systems, 35:3011 3023, 2022.

Qiujiang Jin and Aryan Mokhtari. Non-asymptotic superlinear convergence of standard quasi-newton methods. Mathematical Programming, 200(1):425 473, 2023.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Master s thesis, Department of Computer Science, University of Toronto, 2009.

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. A fully first-order method for stochastic bilevel optimization. In International Conference on Machine Learning, pp. 18083 18113. PMLR, 2023.

Published as a conference paper at ICLR 2025

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. In International Conference on Learning Representations, 2024.

Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015.

Ken Lang. Newsweeder: Learning to filter netnews. In Machine learning proceedings 1995, pp. 331 339. Elsevier, 1995.

Bo Liu, Mao Ye, Stephen Wright, Peter Stone, and Qiang Liu. Bome! Bilevel Optimization Made Easy: A simple first-order approach. Advances in Neural Information Processing Systems, 35: 17248 17262, 2022.

Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1):503 528, 1989.

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations, 2018.

Risheng Liu, Xuan Liu, Shangzhi Zeng, Jin Zhang, and Yixuan Zhang. Value-function-based sequential minimization for bi-level optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018.

Jorge Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of computation, 35(151):773 782, 1980.

Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer Science & Business Media, 2006.

Takayuki Okuno, Akiko Takeda, Akihiro Kawana, and Motokazu Watanabe. On ℓp-hyperparameter learning via bilevel nonsmooth optimization. The Journal of Machine Learning Research, 22(1): 11093 11139, 2021.

Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in Neural Information Processing Systems, 31, 2018.

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pp. 737 746. PMLR, 2016.

David Pfau and Oriol Vinyals. Connecting generative adversarial networks and actor-critic methods. ar Xiv preprint ar Xiv:1610.01945, 2016.

Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. In International Conference on Learning Representations, 2020.

Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in Neural Information Processing Systems, 32, 2019.

Zaccharie Ramzi, Florian Mannel, Shaojie Bai, Jean-Luc Starck, Philippe Ciuciu, and Thomas Moreau. SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models. In International Conference on Learning Representations, 2022.

Anton Rodomanov and Yurii Nesterov. Greedy quasi-newton methods with explicit superlinear convergence. SIAM Journal on Optimization, 31(1):785 811, 2021a.

Anton Rodomanov and Yurii Nesterov. New results on superlinear convergence of classical quasinewton methods. Journal of Optimization Theory and Applications, 188:744 769, 2021b.

Published as a conference paper at ICLR 2025

Anton Rodomanov and Yurii Nesterov. Rates of superlinear convergence for classical quasi-newton methods. Mathematical Programming, 194:159 190, 2022.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211 252, 2015.

David F Shanno. Conditioning of quasi-newton methods for function minimization. Mathematics of Computation, 24(111):647 656, 1970.

Han Shen and Tianyi Chen. On penalty-based bilevel gradient descent method. In International Conference on Machine Learning, pp. 30992 31015. PMLR, 2023.

Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski, Aldo Pacchiano, and Yunhao Tang. Es-maml: Simple hessian-free meta learning. In International Conference on Learning Representations, 2019.

Daouda Sow, Kaiyi Ji, Ziwei Guan, and Yingbin Liang. A primal-dual approach to bilevel optimization with multiple inner minima. ar Xiv preprint ar Xiv:2203.01123, 2022a.

Daouda Sow, Kaiyi Ji, and Yingbin Liang. On the convergence theory for hessian-free bilevel algorithms. Advances in Neural Information Processing Systems, 35:4136 4149, 2022b.

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in Neural Information Processing Systems, 29, 2016.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Zhuoran Yang, Yongxin Chen, Mingyi Hong, and Zhaoran Wang. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. Advances in Neural Information Processing Systems, 32, 2019.

Haishan Ye, Dachao Lin, Xiangyu Chang, and Zhihua Zhang. Towards explicit superlinear convergence rate for SR1. Mathematical Programming, 199(1):1273 1303, 2023.

Published as a conference paper at ICLR 2025

A OUTLINE OF APPENDIX

The appendix is organized as follows:

Appendix B presents details of recursive algorithms for computing the inverse quasi-Newton matrix-vector product.

Appendix C provides details of the numerical experiments from Section 4.

Appendix D includes the proofs of the theorems from Section 3.2.

Appendix D.1 reviews some useful results of the BFGS method; Appendix D.2 provides the proof sketch of Theorem 3.7; Appendix D.3 presents the proof of Theorem 3.4; Appendix D.4 presents the proof of Theorem 3.7.

Appendix E contains the theoretical discussion and complexity analysis of the proposed algorithms.

B RECURSIVE PROCEDURE TO COMPUTE THE INVERSE HESSIAN APPROXIMATION-VECTOR PRODUCT

Due to the low-rank structure of the updates in (3) and (4), for any vector d, r = Ht+1d can be efficiently computed using the recursive methods detailed in Algorithm 4 for the BFGS update (Nocedal, 1980), and Algorithm 5 for the SR1 update (Erway & Marcia, 2017), respectively. Note that Algorithms 4 and 5 involve only the computation of first-order information, provided H0 is a scalar multiple of the identity matrix. Consequently, by avoiding the storage and computation of the full Hessian, computational costs can be significantly reduced.

Algorithm 4 Cb(d, H0, {si, gi}t 1 i=0): Two-loop recursion for computing r = Htd when Ht is the inverse of the BFGS matrix.

1: q = d; 2: for i = t 1, t 2, . . . , 0 αi = (s T i q)/(g T i si); q = q αigi; end for 3: r = H0q; 4: for i = 0, . . . , t 1 β = (g T i r)/(g T i si); r = r + (αi β)si; end for Return r = Htd.

Algorithm 5 Cs(d, H0, {si, gi}t 1 i=0): Computing r = Htd when Ht is the inverse of an SR1 matrix.

1: for i = 0, . . . , t 1 pi = si H0gi; 2: for j = 0, . . . , i 1 pi = pi ((p T j gi)/(p T j gj))pj; end for 3: end for Return r = H0d + Pt 1 i=0((p T i d)/(p T i gi))pi.

C DETAILS ON EXPERIMENTS

In this section, we additionally compare more algorithms, such as AMIGO (Arbel & Mairal (2022) with CG method), AID-BIO (Ji et al. (2021) with CG method) and AID-TN (Ji et al. (2021) with

Published as a conference paper at ICLR 2025

Truncated Neumann method). At the same time, we add meta-learning experiments comparing with the PZOBO algorithm. All experiments are conducted on a server equipped with two NVIDIA A40 GPUs, an Intel(R) Xeon(R) Gold 6326 CPU, and 256 GB of RAM.

C.1 DETAILS OF THE TOY EXAMPLE

In this experiment, the initial point for all algorithms is (x0, y0) = (2e, 2e) where e denotes the vector of all ones.

Figure 5: Numerical results on toy example: Comparison of q NBO (SR1), q NBO (BFGS) and q NBO (BFGS)-ws with other bilevel optimization methods in a toy experiment. Here, ws indicates that a warm-start strategy is applied for uk, with Qk = min(k + 1, 60).

BOME: The maximum number of outer iterations is K = 5000, the number of inner iterations is T = 100, the inner step size is α = 0.1, the outer step size is ξ = 0.1, and

λk = max 0.0001ˆq(xk, yk) F(xk, yk), ˆq(xk, yk)

ˆq(xk, yk) 2 , 0 .

F2SA: The maximum number of outer iterations is K = 5000, the number of inner iterations is T = 10, the inner step sizes are γk = αk = 0.1, the initial multiplier is λ0 = 0.1, the multiplier increment is δk = 0.001, and the step size ratio is ξ = 1.

SHINE-OPA: The maximum number of outer iterations is K = 5000, the maximum number of inner iterations is T = 100, the inner stopping criterion is yf(xk, yk+1) 1 k+1, the inner step size is determined using strong Wolfe line search, the number of extra updates of upper-level information in the BFGS algorithm is 5 (i.e., every 5 steps of BFGS iteration includes an update with UL gradient y F), the initial matrix is H0 = I, and the outer step size is αk = 0.1.

Published as a conference paper at ICLR 2025

Figure 6: Testing results on the impact of the parameter Qk in q NBO (BFGS) in the toy example. (The q NBO (BFGS) parameters are the same as those in Figure 1, except for Qk. Additionally, Qk = ws means that Qk = min(k + 1, 60), where the warm start strategy is used for uk.)

AID-TN: The maximum number of outer iterations is K = 5000, the number of inner iterations is T = 1, the TN iteration P = 1, the inner step size is β = 0.01 and the outer step size is α = 0.2.

AID-BIO/AMIGO-CG: The maximum number of outer iterations is K = 5000, the number of inner iterations is T = 1, the CG iteration P = 1, the inner step size is β = 0.01 and the outer step size is α = 0.2.

PZOBO: The maximum number of outer iterations is K = 5000, the number of inner iterations is Q = N = 10, the parameter µ = 100, the inner step size is α = 0.01 and the outer step size is β = 0.01.

q NBO (SR1): The maximum number of outer iterations is K = 5000, the number of inner iterations is T = 6, the warm up iterations P = 9, the number of iterations is Qk = 25, the inner step sizes are β = 0.1, γ = 1, the initial matrix is H0 = I, and the outer step size is α = 0.4.

q NBO (BFGS): The maximum number of outer iterations is K = 5000, the number of inner iterations is T = 15, the warm up iterations P = 1, the number of iterations Qk = k + 1, the inner step sizes are β = 0.1, γ = 1, the initial matrix is H0 = I, and the outer step size is α = 0.1.

Published as a conference paper at ICLR 2025

C.2 FURTHER SPECIFICATIONS FOR LOGISTIC REGRESSION

C.2.1 IMPLEMENTATIONS AND HYPERPARAMETER SETTINGS

This section introduces the specific parameters of different algorithms used in the logistic regression experiment, formulated as a bilevel problem:

i=1 ℓ(a i, b i, y (x)) s.t. y (x) = arg min y Rn

i=1 ℓ(ai, bi, y) + exp(x)

2 y 2, (11)

where (ai, bi) Dtrain and (a i, b i) Dval are the training data and validation data respectively, and ℓ(ai, bi, y) := log 1 + exp biai T y . The LL variable y is the model s parameter, while the UL variable x refers to the regularization hyperparameter.

In this task, we consider two dataset, 20news and Real-sim. The 20news dataset (Lang, 1995) comprises a total of 18,846 samples with 130,107 features. It is divided into three subsets: the training set Dtrain has 16961 samples, the validation set Dval has 943 samples, and the test set Dtest has 942 samples. Similarly, the Real-sim dataset (Chang & Lin) contains 72,309 samples, each with 20,958 features. This dataset is also split into three parts: the training set Dtrain has 65078 samples, the validation set Dval has 3616 samples, and the test set Dtest has 3615 samples. For all algorithms, we set x0 to 0 and y0 to a random value. Unless otherwise stated, the batch size of algorithm is assumed to be 200. In following, AMIGO refers to the algorithm AMIGO (Arbel & Mairal (2022)) that employs stochastic gradient descent to solve the auxiliary quadratic problem.

BOME: The maximum number of outer iterations is K = 200, the number of inner iterations is T = 10, the inner step size is α = 0.01, the outer step sizes are ξx = 0.1 and ξy = 0.01. The update rule for λk is:

λk = max 0.5ˆq(xk, yk) F(xk, yk), ˆq(xk, yk)

ˆq(xk, yk) 2 , 0 .

F2SA: The maximum number of outer iterations is K = 2000, the number of inner iterations is T = 10, starting point z0 = y0, inner step sizes γk = αk = 0.1, the initial multiplier λ0 = 0.1, multiplier increment δk = 0.0001, step size ratio ξ = 1, and the batch size is 1000.

SABA: The maximum number of outer iterations is K = 20000, the initial point v0 = 0, step sizes αk = 0.125 and βk = βv k = 0.125, and the batch size is 32.

BSG1: The maximum number of outer iterations is K = 300, the number of inner iterations is T = 10, inner step size βk = 0.01, outer step size αk = 0.01.

SHINE-OPA: The maximum number of outer iterations is K = 30, maximum number of inner iterations is T = 1000. The initial matrix H0 = I, for more details see SHINE code.

q NBO (BFGS): The maximum number of outer iterations is K = 50, the number of inner iterations is T = 9, warm-up iteration steps P = 1, iteration steps Qk = 1, inner step sizes β = 0.0001/(j + 1), γ = 0.1, outer step size selection strategy is the same as the SHINE-OPA algorithm. When the dataset is the 20news dataset, the initial matrix H0 = 0.1I; when the dataset is the Real-sim dataset, H0 = 0.01I.

q NBO (SR1): The maximum number of outer iterations is K = 100, the number of inner iterations is T = 7, warm-up iteration steps P = 3, iteration steps Qk = 3, inner step sizes β = 0.0001/(j + 1), γ = 0.1, initial matrix H0 = 0.1I and outer step size selection strategy is the same as the SHINE-OPA algorithm.

AID-BIO/AMIGO-CG: The maximum number of outer iterations is K = 1000, the number of inner iterations is T = 1, the CG iteration P = 1, inner step sizes β = 0.01 and outer step size α = 0.01.

https://github.com/zaccharieramzi/hoag/tree/shine

Published as a conference paper at ICLR 2025

AMIGO: The maximum number of outer iterations is K = 1000, the batch size is 32, the number of inner iterations is T = 1, inner step sizes β = 0.01 and outer step size α = 0.1.

AID-TN: The maximum number of outer iterations is K = 1000, the number of inner iterations is T = 1, the TN iteration P = 1, inner step sizes β = 0.01 and outer step size α = 0.1.

PZOBO: The maximum number of outer iterations is K = 5000, the number of inner iterations is Q = N = 10, the parameter µ = 10, the inner step size is α = 0.01 and the outer step size is β = 0.03.

C.3 FURTHER SPECIFICATIONS ON DATA HYPER-CLEANING EXPERIMENTS

This subsection focuses on data hyper-cleaning to enhance model accuracy, using a noisy training set Dtrain := {ai, bi}m i=1 and a clean validation set Dval. The goal is to adjust training data weights to enhance performance on Dval. The task can be formalized as the bilevel problem:

min x ℓval(y (x)) s.t. y (x) = arg min y {ℓtrain(x, y) + c y 2}, (12)

where ℓval is the validation loss on Dval and ℓtrain = Pm i=1 σ(xi)ℓ(ai, bi, y) is a weighted training loss with σ(x) = Clip(x, [0, 1]) and x Rm. In the experiment, both ℓ(ai, bi, y) and ℓval are the cross entropy loss, with c = 0.001.

Experiments are conducted using MNIST (Deng, 2012) and Fashion MNIST (Xiao et al., 2017) datasets, with 50% of the training data corrupted by randomly assigning them sampled labels. The data is divided into four parts: training set, validation sets 1 and 2, and the test set. The training set comprises 50000 samples, while the validation and test sets contain 5000 and 10000 samples, respectively. For each method, model training is conducted on the training set, with the tuning of hyperparameter x using validation set 1. The LL variable y = (W, b) denotes the parameters of the linear model with weight W R10 784 and bias b R10.

C.3.1 IMPLEMENTATIONS AND HYPERPARAMETER SETTINGS

The initial point y0 for all algorithms is obtained from a pretrained initialization model, and the initial weight vector x0 = 0.5e R50000.

BOME: The maximum number of outer iterations K = 10000, the number of inner iterations T = 1, the inner step size α = 0.01, the outer step sizes ξx = 100, ξy = 0.01, and

λk = max 0.1ˆq(xk, yk) F(xk, yk), ˆq(xk, yk)

ˆq(xk, yk) 2 , 0 .

Details can be seen in the code.

F2SA: The maximum number of outer iterations K = 7000, the number of inner iterations T = 1, the initial point z0 = y0, the inner step size γk = αk = 0.01, the initial multiplier λ0 = 0.1, the difference in multiplier δk = 0.001, the step size ratio ξ = 10000, and the batch size is 2000.

SABA: The maximum number of outer iterations K = 10000, the initial point v0 = 0, the batch size is 2000. For the MNIST dataset, the step sizes αk = 10, βk = 0.01, βv k = 0.1; otherwise, the step sizes αk = 100, βk = βv k = 0.001.

SHINE-OPA: The maximum number of outer iterations K = 50, the maximum number of inner iterations T = 1000, the inner stopping criterion yf(xk, yk+1) 1/(100k), the inner step size is determined using strong Wolfe line search, the number of extra updates in the BFGS algorithm is 5 (i.e., upper-level information is introduced in the BFGS iterations for every 5 steps), the initial matrix H0 = I, and the outer step size is 100.

https://github.com/Cranial-XIX/BOME

Published as a conference paper at ICLR 2025

(a) Test accuracy vs Time

(b) Test loss vs Time

Figure 7: Data hyper-cleaning on two datasets. (First row: MNIST; Second row: Fashion MNIST. All results are averaged over 10 random trials. The exclusion of BSG1 s performance in this experiment is due to its ineffectiveness in addressing these data hyper-cleaning problems.)

q NBO (BFGS): The number of iterations Qk = 1, the inner step sizes β = 0.1, γ = 0.1, the outer step size α = 100, the initial matrix H0 = I. For the MNIST dataset, the maximum number of outer iterations K = 5000, the number of inner iterations T = 7, the warm-up iteration count P = 3; otherwise, the maximum number of outer iterations K = 600, the number of inner iterations T = 47, and the warm-up iteration count P = 3.

q NBO (SR1): The number of inner iterations T = 17, the warm-up iteration count P = 3, the number of iterations Qk = 3, the inner iteration step sizes β = 0.1, γ = 0.1, the outer iteration step size α = 100, the initial matrix H0 = 0.01I. In addition, the inner iteration will terminate early if yf(xk, yk+1) 0.1. For the MNIST dataset, the maximum number of outer iterations K = 5000; otherwise, the maximum number of outer iterations K = 2000.

AID-BIO/AMIGO-CG: The maximum number of outer iterations is K = 1000, the number of inner iterations is T = 1, the CG iteration steps P = 1, inner step sizes β = 0.01 and outer step size α = 10.

AMIGO: The maximum number of outer iterations is K = 1000, the batch size is 200, the number of inner iterations is T = 1, inner step sizes β = 0.01 and outer step size α = 100.

AID-TN: The maximum number of outer iterations is K = 1000, the number of inner iterations is T = 1, the TN iteration steps P = 1, inner step sizes β = 0.01 and outer step size α = 30.

Published as a conference paper at ICLR 2025

PZOBO: The maximum number of outer iterations is K = 5000, the number of inner iterations is Q = N = 10, the parameter µ = 0.01, the inner step size is α = 0.01 and the outer step size is β = 10.

C.4 META-LEARNING

In this subsection, we consider the few-shot meta-learning, which can be described as:

i=1 LDi(x, y i (x)) s.t. y (x) = arg min y 1 m

i=1 LSi(x, yi),

where LDi(x, y i ) = 1 |Di| P

ξ Di L(x, y i ; ξ) is the validation loss function and LSi(x, yi) =

1 |Si| P ξ Si(L(x, yi; ξ) + R(yi)) is the training loss with the classification loss L and the stronglyconvex regularizer R(yi). In the experiment, L is the cross-entropy function and R is the ℓ2 norm. In our experimental setting, the task-specific parameters y denote the weights of the last linear layer of a neural work and x are the parameters of a 4-layer convolutional neural networks (CNN4).

The few-shot meta-learning has m tasks {Ti, i = 1, , m} sampled over a distribution PT . Each task Ti has a loss function L(x, yi; ξ) with data sample ξ, the task-specific parameters yi and the parameters x of an embedding model shared by all tasks.

Figure 8: 5way-5shot on FC100 datasets. Results are averaged over 5 runs, with all algorithms starting from the same initial point with a test accuracy of 20%. For clarity, graphs begin at the second data point, omitting the initial one. We report that q NBO (BFGS) reaches peak test accuracies within 800 seconds, after which performance declines, likely due to overfitting or other factors.

C.4.1 DATASETS

mini Image Net: The mini Image Net dataset (Vinyals et al., 2016), derived from Image Net (Russakovsky et al., 2015), is a large-scale benchmark for few-shot learning. The dataset comprises 100 classes, each encompassing 600 images of size 84 84. Following Arnold et al. (2020), we partition the classes into 64 classes for meta-training, 16 classes for meta-validation, and 20 classes for meta-testing. In the experiment, CNN4 has four convolutional blocks, in which each convolutional block contains a 3 3 convolution (padding=1), Re LU activation, 2 2 max pooling and batch normalization. Each convolutional layer has 32 filters.

FC100: The FC100 dataset (Oreshkin et al., 2018), generated from Krizhevsky & Hinton (2009), consists of 100 classes with each class containing 600 images of size 32. Following Oreshkin et al. (2018), the classes are split into 60 classes for meta-training, 20 classes for meta-validation, and 20 classes for meta-testing. Each convolutional block of CNN4 comprises a 3 3 convolutional layer (with padding set to 1 and a stride of 2), subsequent batch normalization, Re LU activation, and 2 2 max pooling. Each convolutional layer has 64 filters.

Omniglot: Comprising 1623 character classes derived from 50 diverse alphabets, the Omniglot dataset (Lake et al., 2015) contains 20 samples within each class. The classes are divided into three parts: 1100 classes for meta-training, 100 classes for meta-validation, and 423 classes for meta-testing. The CNN4 network is identical to that used on the mini Image Net dataset but has 64 filters per layer.

Published as a conference paper at ICLR 2025

C.4.2 EXPERIMENTAL SETUP

The meta learning experiment is carried out using the code available at the website. At each metaiteration, a batch of 16 training tasks is sampled and the parameters are updated based on these tasks. The max outer steps K is set to 6000 for both algorithms. The initial parameter y0 is selected as 0.

PZOBO: The parameters involved in the algorithm are the same as those in Sow et al. (2022b).

q NBO (BFGS): For all three datasets, the inner steps T is set to 20, the hypergradient updates Qk is 3, the step sizes β and γ are both specified as 0.1, and the Hessian matrix is initialized as H0 = 0.01I. In addition, the inner iteration will terminate early if A + b tol, where A and b are the components of the parameter y denoting the weights of the final linear layer in a neural work. The specific values of tol and other parameters can be found in the code provided in the supplementary materials.

0 5 10 15 20 25 30 Running time(s)

Test accuracy

T=500 T=200

0 5 10 15 20 25 30 Running time(s)

T=500 T=200

0 5 10 15 20 25 30 Running time(s)

Test accuracy

T=500 T=200

0 5 10 15 20 25 30 Running time(s)

T=500 T=200

Figure 9: Ablation study on the iteration number T of q NBO (BFGS). (The first row illustrates the test accuracy and test loss for MNIST, and the second row shows these values for Fashion MNIST.)

C.5 ABLATION STUDY

C.5.1 TOY EXAMPLE

In this subsection, we conduct an ablation study to assess the impact of the parameters Qk on the performance of the q NBO (BFGS) algorithm in the toy experiment. As illustrated in Figure 6, the setting Qk = k + 1 outperforms the others. Furthermore, employing the warm start strategy for uk (denoted as Qk = ws) enhances the performance of q NBO (BFGS), suggesting the potential benefits of this strategy.

https://github.com/sowmaster/esjacobians

Published as a conference paper at ICLR 2025

C.5.2 DATA HYPER-CLEANING

In this subsection, we conduct an ablation study to assess the impact of the parameters T and Qk within the q NBO (BFGS) algorithm on its performance in the data hyper-cleaning experiment. As illustrated in Figures 9 and 10, smaller values of T and Qk lead to improved performance in terms of both accuracy and loss across the two datasets, thereby indicating the efficiency of q NBO (BFGS).

0 5 10 15 20 25 30 Running time(s)

Test accuracy

0 5 10 15 20 25 30 Running time(s)

0 5 10 15 20 25 30 Running time(s)

Test accuracy

0 5 10 15 20 25 30 Running time(s)

Figure 10: Ablation study on the iteration number Q of q NBO (BFGS). (The first row shows the test accuracy and test loss for the MNIST dataset, and the second row shows the same metrics for the Fashion MNIST dataset.)

D PROOF OF THE RESULTS IN SECTION 3.2

D.1 RESULTS OF QUASI-NEWTON METHOD

In this subsection, we first summarize the convergence properties of BFGS method for solving the problem: min y Rn g(y). (13)

Besides, to derive the upper bound of the hypergradient estimation error, we review some conclusions of BFGS update presented in Jin & Mokhtari (2023); Rodomanov & Nesterov (2022; 2021b).

D.1.1 CONVERGENCE RESULTS OF BFGS METHOD

Assumption D.1. Assume that g has the following properties:

(i) g(y) is strongly convex w.r.t. y with parameter µ > 0, i.e., µI 2g(y). Moreover, g(y) is Lipschitz continuous w.r.t. y with parameter L > 0 (i.e., 2g(y) LI ).

(ii) The Hessian 2g(y) satisfies:

2g(y1) 2g(y2) M y1 y2 z 2g(w), y1, y2, z, w Rn,

Published as a conference paper at ICLR 2025

where y z := 2g(z)y, y 1/2 and M > 0. Lemma D.2. (Rodomanov & Nesterov (2021b), Global convergence) If the function g has the following quadratic form:

2y T Ay y T x, (14)

where µI A LI such that Assumption D.1 holds. If the BFGS method is used to solve the problem (13), and if H0 = LI and the number of iterations i 4n ln L

µ , then for all i 4n ln L

µ , the following inequality holds:

yi y 2κ3/2(tb

i ) i 2 y0 y , (15)

µ and tb = 4nln L

Lemma D.3. (Rodomanov & Nesterov (2021b), local convergence) Suppose that Assumption D.1 on g holds. If H0 = LI and the initial point y0 satisfies:

y0 y K1, K1 = 2ln 3

3 2 Mµ max{ µ

2L, 1 K0 + 9}, (16)

where K0 = 8nln 2L

µ and i K0, then for all i K0, the iterate generated by the BFGS algorithm has the following superlinear convergence rate:,

yi y 2κ3/2(tc

i ) i 2 y0 y , (17)

with tc = 9

8K0 and κ = L

Lemma D.4. (Theorem 3 of Jin & Mokhtari (2023)) Suppose that Assumption D.1 on g holds. If the initial point y0 and the initial Hessian approximation matrix B0 (with H0 = B 1 0 ) satisfy:

2g(y )1/2(y0 y ) ϵ

2g(y ) 1/2 B0 2g(y ) 2g(y ) 1/2 F δ, (18)

where ϵ, δ (0, 1

2), ρ (0, 1),

(3 + ϵ)ϵ (1 ϵ)(1 ρ) δ, and ϵ

3 + 2δ (1 2δ)ρ,

then the iterate generated by the BFGS algorithm has the following superlinear convergence rate:

i + C2 i i y0 y , (19)

where C1 = 2

2δ(1 + ρ)(1 + ϵ

3), C2 = (1+ρ)(1+ ϵ

3 )ϵ 3(1 ρ) and q = q

To be specific, the superlinear convergence rate of BFGS algorithm can also be expressed in the following alternative form. Lemma D.5. (Corollary 4 of Jin & Mokhtari (2023)) Suppose that Assumption D.1 on g holds. If the initial point y0 and the initial BFGS matrix B0 satisfy:

2g(y )1/2(y0 y ) 1 300,

2g(y ) 1/2 B0 2g(y ) 2g(y ) 1/2 F 1

then the iterate solved by the BFGS algorithm exhibits the following convergence rate:

2 y0 y . (21)

Published as a conference paper at ICLR 2025

D.1.2 PROPERTIES OF THE BFGS UPDATES

Lemma D.6. If the function g in the problem (13) has the following quadratic form:

2y T Ay y T x, (22)

with µI A LI, then the BFGS matrix Bi satisfies:

(Bisi Asi)T A 1(Bisi Asi)

s T i Bisi n L

Proof. Define σi := σ(A, Bi) = A 1, Bi A =Tr(A 1(Bi A)) 0, then

σ(A, Bi) σ(A, Bi+1) = A 1, Bi Bi+1

= Bi A 1Bisi, si

Bi A 1 B 1 i Bisi, si

(Bi A)A 1(Bi A)si, si

where the last inequality follows from the fact that

(Bi A)A 1(Bi A) = Bi A 1Bi 2Bi + A

A Bi Bi A 1Bi Bi = Bi(A 1 B 1 i )Bi.

Thus, it is derived that

(Bi A)A 1(Bi A)si, si

Bisi, si , 0 i k 1.

Finally, summing the above inequality over i yields:

(Bi A)A 1(Bi A)si, si

Bisi, si σ0 σq σ0 = σ(A, LI)= A 1, LI A

Definition D.7. Define

ψ(A, G) A 1, G A ln Det(A 1G), (25)

where Det denotes the determinant of the matrix. Definition D.8. Define

θ(A, B, u) := (B A)A 1(B A)u, u

and let ϑ : ( 1, + ) R be the univarite function:

ϑ(t) := t ln(1 + t) 0.

Remark D.9. On the interval [0, + ), ϑ(t) satisfies:

2(1 + t) ϑ(t) t2

2 + t. (26)

Published as a conference paper at ICLR 2025

Lemma D.10. (Rodomanov & Nesterov (2021b), Lemma 5.2 ) Define Ji := R 1 0 2g(yi + tsi) dt and yi+1 = yi + si. Then, Jisi = g(yi+1) g(yi). If B0 = LI, then i 0, the BFGS matrix Bi satisfies:

1 ξi 2g(yi) Bi ξi L µ 2g(yi), (27)

1 ξi+1 Ji Bi ξi+1 L µ Ji, (28)

where ri := si yi , ξi := e M Pi 1 j=0 rj ( 1) and the strongly self-concordant constant M of g.

Lemma D.11. When B0 = LI, the BFGS matrix Bi satisfies (27) and (28). If ξ = max i=0, ,k 1ξi+1

2 in (28), then the following inequality holds:

i=0 θ2 i n L

i=0 i, (29)

where ξ = 1 2( ξ2+ ξ), θi := θ(Ji, Bi, ui), ψi := ψ(Ji, Bi), ψi+1 := ψ(Ji, Bi+1), and i :=

Proof. Note that

1 ξi+1 Ji Bi ξi+1L

From Lemma 2.4 of Rodomanov & Nesterov (2022), it can be further deduced that:

ψi ψi+1 ϑ 1

Since ξ = max i=0, ,k 1ξi+1 2, it follows from the definition of θi that:

θ2 i = (Bi Ji)J 1 i (Bi Ji)ui, ui Bi J 1 i Biui, ui = 1 (2Bi Ji)ui, ui

Bi J 1 i Biui, ui

(30) 1. (32)

Then, it is derived that:

1 ξ2 i+1 θ2 i

2 1 + 1 ξi+1 θi

2 1 + 1 ξi+1

θ2 i = 1 2 ξ2 i+1 + ξi+1 θ2 i . (33)

Thus, it holds that:

ξθ2 i ψi ψi+1 = ψi ψi+1 + i, i {0, . . . , k 1}, (34)

where ξ = 1 2( ξ2+ ξ) and

i := ψi+1 ψi+1 = J 1 i+1 J 1 i , Bi+1 + ln Det(J 1 i , Ji+1). (35)

Published as a conference paper at ICLR 2025

By summing equation (34) over i and given that ψk 0, it follows that:

i=0 θ2 i ψ0 ψk +

= ψ(J0, LI) +

= J 1 0 , LI J0 ln Det(J 1 0 , LI) +

J 1 0 , LI J0 +

Notations: In step 1 of Algorithm 1, y0 k = yk establishes the initial value of y at the start of the k-th iteration. After P warm-up iterations, denoted as yk,0 = y P k , the term yk,T refers to the state of y following T further iterations with xk fixed. In the second step of Algorithm 1, uk,Qk represents the state of u after Qk iterations, with both xk and yk+1 fixed.

D.2 PROOF SKETCH OF THEOREM 3.7

The proofs of Theorem 3.4 and 3.7 encompasses three critical steps: first, it splits the hypergradient approximation error into the T-step error of estimating the lower level solution yk,T y k 2 and the Qk-step error uk,Qk u k 2; second, it establishes upper bounds for these errors based on previous iteration errors; finally, it combines the above results to substantiate the theorem s convergence. Since the proofs of Theorem 3.4 and 3.7 are similar, we only elaborate on the proof sketch of Theorem 3.7 below.

Step 1: Decomposing the hypergradient estimation error.

First, the hypergradient estimation error at the kth iteration can be bounded by:

Φ(xk) Φ(xk) 2 3 L2 Fx + L2 fxy C2 Fy µ2 yk,T y k 2 + 3M 2 fxy uk,Qk u k 2. (37)

Step 2: Upper-bounding the error.

The T-step error of estimating the lower level solution yk,T y k 2 is bounded by:

y k yk,T 2 τ y k 1 yk 1,T 2 + 2(1 + 1

T )T (1 βµ)P L2 yα2 Φ(xk 1) 2

T )T (1 βµ)P L2 yα2 n LM 2 fxy C2 Fy µ3 ξQk 1 , (38)

where τ = κ( 1

T )T (1 βµ)P (1+ε)+6(1+ 1

ε)L2 yα2(L2 Fx+ L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 ) .

The Qk-step error uk,Qk u k is bounded by:

uk,Qk u k 2 2 n LC2 Fy ξµ3Qk + 4

L2 Fy µ2 + C2 Fy L2 fyy µ4

y k yk,T 2. (39)

Published as a conference paper at ICLR 2025

Step 3: Combining Step 1 and Step 2.

Combining (37), (38) and (39), the upper bound of the hypergradient estimation error is derived as:

Φ(xk) Φ(xk) 2 δ0τ k + ωα2 k 1 X

j=0 τ j Φ(xk 1 j) 2

+ 6ωα2 n LM 2 fxy C2 Fy µ3 ξ

j=0 τ j 1 Qk 1 j + 6 n LM 2 fxy C2 Fy µ3 ξQk ,

T )T (1 βµ)P (L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 ) y 0 y0 2

ω = 6(L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 )(1 + 1

T )T (1 βµ)P L2 y.

Due to the LΦ-smoothness of Φ, the final convergence result can be proved.

To prove Theorems 3.4 and 3.7, we first present the following lemma. Lemma D.12. (Lemma 2.2 of Ghadimi & Wang (2018)) Under Assumptions 3.1 and 3.2, we have:

For all x, y, F(x; y) F(x; y (x)) C y (x) y , where F(x; y) = x F(x, y) [ 2 xyf(x, y)]T [ 2 yyf(x, y)] 1 y F(x, y) and C = LFx + LFy Mfxy

µ + CFy Lfxy

µ + Lfyy Mfxy

y (x) is Ly-Lipschitz continuous in x: y (x1) y (x2) Ly x1 x2 ,

where Ly = Mfxy

Φ is LΦ-Lipschitz continuous in x: Φ(x1) Φ(x2) LΦ x1 x2 ,

where LΦ = ( LFy +C)Mfxy

µ + LFx + CFy Lfxy CFy

µ + Lfyy Mfxy

D.3 PROOF OF THEOREM 3.4

Lemma D.13. Suppose that Assumptions 3.1 and 3.2 hold. The error between the approximate hypergradient Φ(xk) and the true hypergradient in Algorithm 1 can be bounded by:

Φ(xk) Φ(xk) 2 3 L2 Fx + L2 fxy C2 Fy µ2 yk,T y k 2 + 3M 2 fxy uk,Qk u k 2. (40)

Proof. Let u k = [ 2 yyf(xk, y k)] 1 y F(xk, y k), then

Φ(xk) Φ(xk) 2

= x F(xk, yk,T ) [ 2 xyf(xk, yk,T )]T uk,Qk x F(xk, y k) [ 2 xyf(xk, y k)]T u k 2

= x F(xk, yk,T ) x F(xk, y k) ([ 2 xyf(xk, yk,T )]T uk,Qk [ 2 xyf(xk, y k)]T u k) 2

= x F(xk, yk,T ) x F(xk, y k)

[ 2 xyf(xk, yk,T )]T uk,Qk [ 2 xyf(xk, yk,T )]T u k ([ 2 xyf(xk, y k)]T u k [ 2 xyf(xk, yk,T )]T u k) 2

= x F(xk, yk,T ) x F(xk, y k)

[ 2 xyf(xk, yk,T )]T (uk,Qk u k) [ 2 xyf(xk, yk,T )]T [ 2 xyf(xk, y k)]T u k 2

3 x F(xk, yk,T ) x F(xk, y k) 2 + 3 2 xyf(xk, yk,T ) 2 uk,Qk u k 2

+ 3 2 xyf(xk, yk,T ) 2 xyf(xk, y k) 2 u k 2.

Published as a conference paper at ICLR 2025

Based on Assumptions 3.1 and 3.2, it can be derived that:

Φ(xk) Φ(xk) 2

3L2 Fx yk,T y k 2 + 3M 2 fxy uk,Qk u k 2 + 3 L2 fxy C2 Fy µ2 yk,T y k 2

=3 L2 Fx + L2 fxy C2 Fy µ2 yk,T y k 2 + 3M 2 fxy uk,Qk u k 2.

Lemma D.14. When the q NBO algorithm (Algorithm 1) is applied to solve the problem (1), if the LL objective function f takes the quadratic form (22) and H0 = (1/L)I, it holds that:

i=1 A 1 y F(xk, yk+1) uk,i CFy

with Qk > 1.

Proof. From Lemma D.6, it follows that:

(A 1 y F(xk, yk+1) uk,i)T A(A 1 y F(xk, yk+1) uk,i)

u T k,i Bk,iuk,i

(A 1 y F(xk, yk+1) uk,i)T A(A 1 y F(xk, yk+1) uk,i)

y F(xk, yk+1)T B 1 k,i y F(xk, yk+1)

with Bk,i = H 1 k,i .

Since µI A LI and A Bk,i L

µ A, we have:

µ A 1 y F(xk, yk+1) uk,i 2

1 µ y F(xk, yk+1) 2 n L

Since y F(xk, yk+1) CFy, it can be further derived that:

i=1 A 1 y F(xk, yk+1) uk,i 2 n LC2 Fy µ3 . (41)

Finally, by applying the Cauchy-Schwarz inequality, we can deduce that

i=1 A 1 y F(xk, yk+1) uk,i CFy

Lemma D.15. Choose the parameters β and P such that (1 βµ)P yk y k 1 300 µ, and ensure

H0 satisfies: 2 yyf(xk, y (xk)) 1/2 H 1 0 2 yyf(xk, y (xk)) 2 yyf(xk, y (xk)) 1/2 F 1

7. Then, under Assumptions 3.1 and 3.2, it holds that

y k yk,T 2 (1+ε)κ( 1

T )T (1 βµ)P yk 1,T y k 1 2+(1+1

T )T (1 βµ)P L2 y xk xk 1 2, (43) with a positive constant ε.

Published as a conference paper at ICLR 2025

Proof. Under the setting of parameters β, P and H0, the condition (20) is satisfied. Furthermore, based on Lemma D.5 and the fact that yk,0 = y P k and yk = y0 k, it holds that:

yk,T y k 2 κ( 1

T )T yk,0 y k 2 κ( 1

T )T (1 βµ)P yk y k 2,

where κ = L

µ . Finally, since yk = yk 1,T , using Young s inequality yields:

yk,T y k 2 (1 + ε)κ( 1

T )T (1 βµ)P yk 1,T y k 1 2 + (1 + 1

T )T (1 βµ)P y k 1 y k 2

(1 + ε)κ( 1

T )T (1 βµ)P yk 1,T y k 1 2 + (1 + 1

T )T (1 βµ)P L2 y xk 1 xk 2,

where ε is a positive constant and the last inequality follows from Lemma 2.2 in Ghadimi & Wang (2018).

Lemma D.16. If the LL function f takes the quadratic form and T tb, under Assumptions 3.1 and 3.2, it is derived that

y k yk,T 2 (1 + ε)c2 tκ3( 1

T )T yk 1,T y k 1 2 + (1 + 1

ε)c2 tκ3( 1

T )T L2 y xk xk 1 2, (44)

with tb = 4nln L

µ and a positive constant ε.

Proof. From Lemma D.2, if T tb, it holds that:

yk,T y k 2 c2 tκ3( 1

T )T yk,0 y k 2,

where ct = 2t

2 b . Furthermore, since yk,0 = yk 1,T , using Young s inequality yields:

yk,T y k 2 (1 + ε)c2 tκ3( 1

T )T yk 1,T y k 1 2 + (1 + 1

ε)c2 tκ3( 1

T )T y k 1 y k 2

(1 + ε)c2 tκ3( 1

T )T yk 1,T y k 1 2 + (1 + 1

ε)c2 tκ3( 1

T )T L2 y xk 1 xk 2,

where ε is a positive constant and the last inequality follows from Lemma 2.2 in Ghadimi & Wang (2018).

Lemma D.17. Choose the parameters β, P such that (1 βµ)P yk y k K1, and H0 = LI, under Assumptions 3.1 and 3.2, it is derived that

y k yk,T 2 (1+ε)c2 l κ3(1 βµ)P ( 1

T )T yk 1,T y k 1 2+(1+1

ε)c2 l κ3(1 βµ)P ( 1

T )T L2 y xk 1 xk 2, (45) with a positive constant ε, T K0, K0 := 8nln 2L

µ , cl = 2t

Proof. From Lemma D.3, if T K0, it holds that:

yk,T y k 2 c2 l κ3( 1

T )T yk,0 y k 2 c2 l κ3( 1

T )T (1 βµ)P yk y k 2,

where cl = 2t

2c . Furthermore, using Young s inequality yields:

yk,T y k 2 (1 + ε)c2 l κ3(1 βµ)P ( 1

T )T yk 1,T y k 1 2 + (1 + 1

ε)c2 l κ3(1 βµ)P ( 1

T )T y k 1 y k 2

(1 + ε)c2 l κ3(1 βµ)P ( 1

T )T yk 1,T y k 1 2 + (1 + 1

ε)c2 l κ3(1 βµ)P ( 1

T )T L2 y xk 1 xk 2,

where ε is a positive constant and the last inequality follows from Lemma 2.2 in Ghadimi & Wang (2018).

Published as a conference paper at ICLR 2025

Lemma D.18. (Error of uk,Qk) Suppose that the lower level function f has the quadratic form and Assumptions 3.1 and 3.2 hold. If uk,Qk = uk and

uk := argmin i A 1 y F(xk, yk+1) uk,i , (46)

then for Qk > 1, the following inequality holds:

uk,Qk u k 2 2 n LC2 Fy µ3Qk + 2 L2 Fy µ2 y k yk,T 2. (47)

Proof. From Lemma D.14, we have:

A 1 y F(xk, yk+1) uk 2 n LC2 Fy µ3Qk . (48)

Moreover, under Assumption 3.1 and given that yk+1 = yk,T , it holds that:

uk,Qk u k 2 2 uk A 1 y F(xk, yk+1) 2 + 2 A 1 y F(xk, yk+1) A 1 y F(xk, y k) 2

2 n LC2 Fy µ3Qk + 2 L2 Fy µ2 y k yk,T 2.

Theorem D.19. (Restatement of Theorem 3.4 with full parameter specifications) Suppose that the LL function f in (1) takes the quadratic form:

f(x, y) = 1

2y T Ay y T x, (49)

where µI A LI such that Assumption 3.2 holds. Choose the stepsize α > 0, the positive constant ε > 0, H0 = LI and T tb (tb = 4nlnκ) such that

τ < 1 and αLΦ + ωα2 1

where τ = c2 tκ3( 1

T )T (1 + ε) + 6(1 + 1

ε)L2 yα2(L2 Fx + L2 fxy C2 Fy µ2 + 2M 2 fxy L2 Fy µ2 ) , ω = 6(L2 Fx +

L2 fxy C2 Fy µ2 + 2M 2 fxy L2 Fy µ2 )(1 + 1

ε)c2 tκ3( 1

T )T L2 y, κ = L

µ and ct = 2t

2 b . Then, under Assumptions 3.1 and 3.3 , the iterate generated by the q NBO (BFGS) algorithm (Algorithm 1) has the following convergence rate:

k=0 Φ(xk) 2 4(Φ(x0) Φ(x ))

αK + 3δ0 K(1 τ) + 1

18n LM 2 fxy C2 Fy µ3Qk , (50)

with the initial error δ0 = 3c2 tκ3( 1

T )T (L2 Fx + L2 fxy C2 Fy µ2 + 2M 2 fxy L2 Fy µ2 ) y 0 y0 2. Specifically, if Qk = k + 1, we have:

k=0 Φ(xk) 2 4(Φ(x0) infx Φ(x))

αK + 3δ0 K(1 τ) + 18n LM 2 fxy C2 Fyln K

Proof. Substituting the inequality (47) into (40) yields:

Φ(xk) Φ(xk) 2 3 L2 Fx + L2 fxy C2 Fy µ2 yk,T y k 2

+ 3M 2 fxy 2 n LC2 Fy µ3Qk + 2 L2 Fy µ2 y k yk,T 2

3L2 Fx + 3L2 fxy C2 Fy µ2 + 6M 2 fxy L2 Fy µ2 y k yk,T 2

+ 6 n LM 2 fxy C2 Fy µ3Qk .

Published as a conference paper at ICLR 2025

Then, by plugging the inequality (52) into (44), we obtain that:

y k yk,T 2 (1 + ε)c2 tκ3( 1

T )T yk 1,T y k 1 2 + 2(1 + 1

ε)c2 tκ3( 1

T )T L2 yα2 Φ(xk 1) 2

ε)c2 tκ3( 1

T )T L2 yα2 Φ(xk 1) Φ(xk 1) 2

τ y k 1 yk 1,T 2 + 2(1 + 1

ε)c2 tκ3( 1

T )T L2 yα2 Φ(xk 1) 2

ε)c2 tκ3( 1

T )T L2 yα2 n LM 2 fxy C2 Fy µ3Qk 1 ,

where τ = c2 tκ3( 1

T )T (1 + ε) + 6(1 + 1

ε)L2 yα2(L2 Fx + L2 fxy C2 Fy µ2 + 2M 2 fxy L2 Fy µ2 ) .

By telescoping (53) over k, it follows that:

y k yk,T 2 τ k y 0 y0,T 2 + 2(1 + 1

ε)c2 tκ3( 1

T )T L2 yα2 k 1 X

j=0 τ j Φ(xk 1 j) 2

ε)c2 tκ3( 1

T )T L2 yα2 n LM 2 fxy C2 Fy µ3

j=0 τ j 1 Qk 1 j .

Combining the inequality (52) and y 0 y0,T 2 c2 tκ3( 1

T )T y 0 y0,0 2, we can further derive that

Φ(xk) Φ(xk) 2 δ0τ k + ωα2 k 1 X

j=0 τ j Φ(xk 1 j) 2

+ 6ωα2 n LM 2 fxy C2 Fy µ3

j=0 τ j 1 Qk 1 j + 6 n LM 2 fxy C2 Fy µ3Qk ,

with δ0 = 3c2 tκ3( 1

T )T (L2 Fx + L2 fxy C2 Fy µ2 + 2M 2 fxy L2 Fy µ2 ) y 0 y0,0 2 and ω = 6(L2 Fx + L2 fxy C2 Fy µ2 +

2M 2 fxy L2 Fy µ2 )(1 + 1

ε)c2 tκ3( 1

Since Φ( ) is LΦ lipschitz continuous (Lemma 2.2 in Ghadimi & Wang (2018)), we have:

Φ(xk+1) Φ(xk) + Φ(xk), xk+1 xk + LΦ

2 xk+1 xk 2

Φ(xk) α Φ(xk), Φ(xk) Φ(xk) α Φ(xk) 2 + α2LΦ Φ(xk) 2

+ α2LΦ Φ(xk) Φ(xk) 2

2 α2LΦ Φ(xk) 2 + α

2 + α2LΦ Φ(xk) Φ(xk) 2.

Using the inequality (54), it holds that:

Φ(xk+1) Φ(xk) α

2 α2LΦ Φ(xk) 2 + α

2 + α2LΦ Φ(xk) Φ(xk) 2

2 α2LΦ Φ(xk) 2 + α

2 + α2LΦ δ0τ k

2 + α2LΦ k 1 X

j=0 τ j Φ(xk 1 j) 2 + α

2 + α2LΦ 6n LM 2 fxy C2 Fy µ3Qk

2 + α2LΦ α2 n LM 2 fxy C2 Fy µ3

j=0 τ j 1 Qk 1 j .

Published as a conference paper at ICLR 2025

Finally, summing the inequality (56) from k = 0 to k = K 1 yields:

2 α2LΦ K 1 X

k=0 Φ(xk) 2 Φ(x0) Φ(x K) + α

2 + α2LΦ δ0 1 τ

2 + α2LΦ K 1 X

j=0 τ j Φ(xk 1 j) 2

2 + α2LΦ α2 n LM 2 fxy C2 Fy µ3

j=0 τ j 1 Qk 1 j

2 + α2LΦ 6n LM 2 fxy C2 Fy µ3Qk .

Furthermore, from the inequality PK 1 k=0 Pk 1 j=0 ajbk 1 j PK 1 k=0 ak PK 1 j=0 bj, we obtain

j=0 τ j Φ(xk 1 j) 2

k=0 τ k K 1 X

k=0 Φ(xk) 2 1 1 τ

k=0 Φ(xk) 2,

j=0 τ j 1 Qk 1 j

k=0 τ k K 1 X

Thus, we can conclude that 1

2 αLΦ ωα2 1

k=0 Φ(xk) 2

Φ(x0) Φ(x K)

6n LM 2 fxy C2 Fy µ3Qk + 6ω 1 τ

n LM 2 fxy C2 Fy µ3Qk .

Moreover, if αLΦ + ωα2 1

2 + αLΦ 1 1 τ 1

k=0 Φ(xk) 2 4(Φ(x0) Φ(x ))

αK + 3δ0 K(1 τ) + 1

18n LM 2 fxy C2 Fy µ3Qk , (59)

where Φ(x ) = infx Φ(x). Finally, if Qk = k + 1, then (51) is established.

Next, we consider the case with warm-up steps and provide the corresponding theorem, which proves to be similar to Theorem D.24, so a detailed proof is not provided.

Theorem D.20. (Warm-up for quadratic f) Suppose that the LL function f in (1) takes the quadratic form:

f(x, y) = 1

2y T Ay y T x, (60)

where µI A LI such that Assumption 3.2 holds. Choose the stepsize β and warm-start iteration steps P such that (1 βµ)P yk y k 1 300 µ, and ensure the initial Hessian approximation matrix

H0 satisfies: 2 yyf(xk, y (xk)) 1/2 H 1 0 2 yyf(xk, y (xk)) 2 yyf(xk, y (xk)) 1/2 F 1

7. Choose the stepsize α > 0 and the positive constant ε > 0 such that

τ < 1 and αLΦ + ωα2 1

Published as a conference paper at ICLR 2025

where τ = κ( 1

T )T (1 βµ)P (1 + ε) + 6(1 + 1

ε)L2 yα2(L2 Fx + L2 fxy C2 Fy µ2 + 2M 2 fxy L2 Fy µ2 ) , ω =

6(L2 Fx + L2 fxy C2 Fy µ2 + 2M 2 fxy L2 Fy µ2 )(1 + 1

T )T (1 βµ)P L2 y and κ = L

µ . Then, under Assumptions 3.1 and 3.3, the iterate generated by the q NBO (BFGS) algorithm (Algorithm 1) has the following convergence rate:

k=0 Φ(xk) 2 4(Φ(x0) Φ(x ))

αK + 3δ0 K(1 τ) + 1

18n LM 2 fxy C2 Fy µ3Qk , (61)

with the initial error δ0 = 3κ( 1

T )T (1 βµ)P (L2 Fx + L2 fxy C2 Fy µ2 + 2M 2 fxy L2 Fy µ2 ) y 0 y0 2.

D.4 PROOF OF THEOREM 3.7

Proposition D.21. (Example 4.1 of Rodomanov & Nesterov (2021a)) Suppose that x, the LL function f is µ-strongly convex w.r.t. y and its Hessian is Lfyy-Lipschitz continuous w.r.t. y. Then f

is strongly self-concordant with constant M = Lfyy

µ3/2 , i.e.,

2 yyf(x, y1) 2 yyf(x, y2) M y1 y2 z 2 yyf(x, w), y1, y2, z, w Rn,

where y z := 2 yyf(x, z)y, y 1/2.

Proof. Using the Lipschitz continuity of the Hessian, we have

2 yyf(x, y1) 2 yyf(x, y2) Lfyy y1 y2 I

µ1/2 2 yyf(x, z)(y1 y2), y1 y2 1/2I

µ1/2 y1 y2 z I Lfyy

µ3/2 y1 y2 z 2 yyf(x, w),

where the second and the last inequalities follow from the fact that µI 2 yyf(x, y). This demon-

strates that f is strongly self-concordant with constant M = Lfyy

Lemma D.22. If the Assumptions 3.1 and 3.2 hold, then uk,i generated in the step 2 of the Algorithm 1 satisfies: Qk X

i=1 2 yyf(xk, yk+1) 1 y F(xk, yk+1) uk,i 2 n LC2 Fy ξµ3 , (62)

where Qk > 1, ξ = min i=1, ,Qk

1 2(ξ2 i +ξi) and ξi = e M Pi 1 j=0 ζjuk,j yk+1.

Proof. Note that in Algorithm 3:

0 2 yyf(xk, yk+1 + tsi) dt, Ji+1 := Z 1

0 2 yyf(xk, yk+1 + tsi+1) dt,

with si = ζi Hk,i y F(xk, yk+1).

When the step size ζi is chosen appropriately, based on the definition of Ji and the properties of f as stated in Assumption 3.2, it can be concluded that Ji is nearly equal to Ji+1, i.e., i 0, and µI Ji LI. From Lemma D.11, it follows that

(J 1 i y F(xk, yk+1) uk,i)T Ji(J 1 i y F(xk, yk+1) uk,i) y F(xk, yk+1)T J 1 i y F(xk, yk+1) n L

Since µI Ji LI, we have

µ J 1 i y F(xk, yk+1) uk,i 2

1 µ y F(xk, yk+1) 2 n L

Published as a conference paper at ICLR 2025

Moreover, it follows from Assumption 3.1:

i=1 J 1 i y F(xk, yk+1) uk,i 2 n LC2 Fy ξµ3 .

If the parameter M of function f or ζiuk,i is sufficiently small, Ji can be considered as an approximation of 2 yyf(xk, yk+1). Therefore, θ(Ji, Bi, ui) can be used to characterize the approximation between Hk,i and [ 2 yyf(xk, yk+1)] 1 along the gradient direction y F(xk, yk+1), i.e.,

i=1 2 yyf(xk, yk+1) 1 y F(xk, yk+1) uk,i 2 n LC2 Fy ξµ3 , (65)

where ξ = min i=1, ,Qk

1 2(ξi2+ξi) and ξi = e M Pi 1 j=0 ζjuk,j yk+1.

Lemma D.23. Suppose that Assumption 3.2 holds. Note that uk+1 = uk,Qk in the step 2 of Algorithm 1. If uk,Qk = uk with

uk := argmin i 2 yyf(xk, yk+1) 1 y F(xk, yk+1) uk,i 2, (66)

uk,Qk u k 2 2 n LC2 Fy ξµ3Qk + 4

L2 Fy µ2 + C2 Fy L2 fyy µ4

y k yk,T 2, Qk > 1, (67)

where ξ = min i=1, ,Qk

1 2(ξi2+ξi) and ξi = e M Pi 1 j=0 ζjuk,j yk+1.

Proof. Combining Lemma D.22 and the definition of uk yields:

2 yyf(xk, yk+1) 1 y F(xk, yk+1) uk,Qk 2 n LC2 Fy ξµ3Qk . (68)

Under Assumptions 3.1 and 3.2, since yk+1 = yk,T , it holds that

2 yyf(xk, yk+1) 1 y F(xk, yk+1) 2 yyf(xk, y k) 1 y F(xk, y k) 2

2 2 yyf(xk, yk+1) 1 y F(xk, yk+1) 2 yyf(xk, yk+1) 1 y F(xk, y k) 2

+ 2 2 yyf(xk, yk+1) 1 y F(xk, y k) 2 yyf(xk, y k) 1 y F(xk, y k) 2

2 L2 Fy µ2 y k yk,T 2 + 2C2 Fy 2 yyf(xk, yk+1) 1 2 yyf(xk, y k) 1 2

2 L2 Fy µ2 y k yk,T 2

+ 2C2 Fy 2 yyf(xk, yk+1) 1 2 2 yyf(xk, yk+1) 2 yyf(xk, y k) 2 2 yyf(xk, y k) 1 2

2 L2 Fy µ2 y k yk,T 2 + 2 C2 Fy L2 fyy µ4 y k yk,T 2

L2 Fy µ2 + C2 Fy L2 fyy µ4

y k yk,T 2.

Finally, from Assumption 3.1 and the inequality (69), it is derived that

uk,Qk u k 2 2 uk,Qk 2 yyf(xk, yk+1) 1 y F(xk, yk+1) 2

+ 2 2 yyf(xk, yk+1) 1 y F(xk, yk+1) 2 yyf(xk, y k) 1 y F(xk, y k) 2

2 n LC2 Fy ξµ3Qk + 4

L2 Fy µ2 + C2 Fy L2 fyy µ4

y k yk,T 2.

Published as a conference paper at ICLR 2025

Theorem D.24. (Restatement of Theorem 3.7 with full parameter specifications) Suppose that Assumptions 3.1, 3.2 and 3.3 hold. Choose the stepsize β and warm-up iteration steps P such that (1 βµ)P yk y k 1 300 µ, and ensure the initial Hessian approximation matrix H0 satisfies:

2 yyf(xk, y (xk)) 1/2 H 1 0 2 yyf(xk, y (xk)) 2 yyf(xk, y (xk)) 1/2 F 1

7. Define τ =

T )T (1 βµ)P (1 + ε) + 6(1 + 1

ε)L2 yα2(L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 ) and

ω = 6(L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 )(1 + 1

T )T (1 βµ)P L2 y. Choose the stepsize α > 0, the positive constant ε > 0 and iterate T > 0 such that

τ < 1 and αLΦ + ωα2 1

Then, the solution xk generated by Algorithm 1 achieves the following convergence rate:

k=0 Φ(xk) 2 4(Φ(x0) infx Φ(x))

αK + 3δ0 K(1 τ) + 1

18n LM 2 fxy C2 Fy µ3 ξQk , (71)

where δ0 = 3κ( 1

T )T (1 βµ)P (L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 ) y 0 y0 2 is the initial error. Specifically, if Qk = k + 1, we have:

k=0 Φ(xk) 2 4(Φ(x0) Φ(x ))

αK + 3δ0 K(1 τ) + 18n LM 2 fxy C2 Fyln K

µ3 ξK . (72)

Proof. Substituting the inequality (67) into (40) yields:

Φ(xk) Φ(xk) 2 3 L2 Fx + L2 fxy C2 Fy µ2 yk,T y k 2

2 n LC2 Fy ξµ3Qk + 4 L2 Fy µ2 + C2 Fy L2 fyy µ4 y k yk,T 2 !

3L2 Fx + 3L2 fxy C2 Fy µ2 + 12M 2 fxy L2 Fy µ2 + C2 Fy L2 fyy µ4 !

+ 6 n LM 2 fxy C2 Fy µ3 ξQk .

Then, based on Lemma D.15, substituting the above inequality into (43) yields:

y k yk,T 2 (1 + ε)κ( 1

T )T (1 βµ)P yk 1,T y k 1 2 + 2(1 + 1

T )T (1 βµ)P L2 yα2 Φ(xk 1) 2

T )T (1 βµ)P L2 yα2 Φ(xk 1) Φ(xk 1) 2

τ y k 1 yk 1,T 2 + 2(1 + 1

T )T (1 βµ)P L2 yα2 Φ(xk 1) 2

T )T (1 βµ)P L2 yα2 n LM 2 fxy C2 Fy µ3 ξQk 1 ,

where τ = κ( 1

T )T (1 βµ)P (1+ε)+6(1+ 1

ε)L2 yα2(L2 Fx+ L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 ) .

Summing the inequality (74) from 0 to k results in:

y k yk,T 2 τ k y 0 y0,T 2 + 2(1 + 1

T )T (1 βµ)P L2 yα2 k 1 X

j=0 τ j Φ(xk 1 j) 2

T )T (1 βµ)P L2 yα2 n LM 2 fxy C2 Fy µ3 ξ

j=0 τ j 1 Qk 1 j .

Published as a conference paper at ICLR 2025

Combining the inequality (73) and y 0 y0,T 2 κ( 1

T )T (1 βµ)P y 0 y0 2, it follows that

Φ(xk) Φ(xk) 2 δ0τ k + ωα2 k 1 X

j=0 τ j Φ(xk 1 j) 2

+ 6ωα2 n LM 2 fxy C2 Fy µ3 ξ

j=0 τ j 1 Qk 1 j + 6 n LM 2 fxy C2 Fy µ3 ξQk ,

where δ0 = 3κ( 1

T )T (1 βµ)P (L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 ) y 0 y0 2 and

ω = 6(L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 )(1 + 1

T )T (1 βµ)P L2 y.

Since Φ( ) is LΦ Lipschitz, it can be obtained that

Φ(xk+1) Φ(xk) + Φ(xk), xk+1 xk + LΦ

2 xk+1 xk 2

Φ(xk) α Φ(xk), Φ(xk) Φ(xk) α Φ(xk) 2 + α2LΦ Φ(xk) 2

+ α2LΦ Φ(xk) Φ(xk) 2

2 α2LΦ Φ(xk) 2 + α

2 + α2LΦ Φ(xk) Φ(xk) 2.

Using the inequality (75) yields:

Φ(xk+1) Φ(xk) α

2 α2LΦ Φ(xk) 2 + α

2 + α2LΦ Φ(xk) Φ(xk) 2

2 α2LΦ Φ(xk) 2 + α

2 + α2LΦ δ0τ k

2 + α2LΦ k 1 X

j=0 τ j Φ(xk 1 j) 2 + α

2 + α2LΦ 6n LM 2 fxy C2 Fy µ3 ξQk

2 + α2LΦ α2 n LM 2 fxy C2 Fy µ3 ξ

j=0 τ j 1 Qk 1 j .

Finally, by telescoping the inequality (77) from k = 0 to k = K 1, it is derived that

2 α2LΦ K 1 X

k=0 Φ(xk) 2 Φ(x0) Φ(x K) + α

2 + α2LΦ δ0 1 τ

2 + α2LΦ K 1 X

j=0 τ j Φ(xk 1 j) 2

2 + α2LΦ 6n LM 2 fxy C2 Fy µ3 ξQk

2 + α2LΦ α2 n LM 2 fxy C2 Fy µ3 ξ

j=0 τ j 1 Qk 1 j .

Moreover, due to PK 1 k=0 Pk 1 j=0 ajbk 1 j PK 1 k=0 ak PK 1 j=0 bj, we can deduce that

j=0 τ j Φ(xk 1 j) 2

k=0 τ k K 1 X

k=0 Φ(xk) 2 1 1 τ

k=0 Φ(xk) 2,

j=0 τ j 1 Qk 1 j

k=0 τ k K 1 X

Published as a conference paper at ICLR 2025

Then, the following inequality holds: 1

2 αLΦ ωα2 1

k=0 Φ(xk) 2

Φ(x0) Φ(x K)

6n LM 2 fxy C2 Fy µ3 ξQk + 1

α2 n LM 2 fxy C2 Fy µ3 ξQk .

If αLΦ + ωα2 1

2 + αLΦ 1 1 τ 1

k=0 Φ(xk) 2 4(Φ(x0) infx Φ(x))

αK + 3δ0 K(1 τ) + 1

18n LM 2 fxy C2 Fy µ3 ξQk . (80)

Finally, by substituting Qk = k + 1 into (80), (72) is derived.

In addition, we present another theorem that does not require the strict assumption on the initial Hessian matrix H0, but does require that T 8n ln 2L

Theorem D.25. Suppose that Assumptions 3.1, 3.2 and 3.3 hold. Choose the stepsize β and warm-up iteration steps P such that (1 βµ)P yk y k K1, where K1 is defined in (16). Set H0 = LI and

µ . Define τ = c2 l κ3( 1

T )T (1 βµ)P (1+ε)+6(1+ 1

ε)L2 yα2(L2 Fx+ L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 +

4M 2 fxy C2 Fy L2 fyy µ4 ) and ω = 6(L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 )(1 + 1

ε)c2 l κ3( 1

T )T (1 βµ)P L2 y. Choose the stepsize α > 0, the positive constant ε > 0 and iterate T > 0 such that

τ < 1 and αLΦ + ωα2 1

Then, the solution xk generated by Algorithm 1 achieves the following convergence rate:

k=0 Φ(xk) 2 4(Φ(x0) Φ(x ))

αK + 3δ0 K(1 τ) + 1

18n LM 2 fxy C2 Fy µ3 ξQk , (81)

where δ0 = 3c2 l κ3κ( 1

T )T (1 βµ)P (L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 ) y 0 y0 2 is the initial error. Specifically, if Qk = k + 1, we have:

k=0 Φ(xk) 2 4(Φ(x0) infx Φ(x))

αK + 3δ0 K(1 τ) + 18n LM 2 fxy C2 Fyln K

µ3 ξK . (82)

Proof. Substituting the inequality (67) into (40) yields:

Φ(xk) Φ(xk) 2 3 L2 Fx + L2 fxy C2 Fy µ2 yk,T y k 2

2 n LC2 Fy ξµ3Qk + 4 L2 Fy µ2 + C2 Fy L2 fyy µ4 y k yk,T 2 !

3L2 Fx + 3L2 fxy C2 Fy µ2 + 12M 2 fxy L2 Fy µ2 + C2 Fy L2 fyy µ4 !

+ 6 n LM 2 fxy C2 Fy µ3 ξQk .

Published as a conference paper at ICLR 2025

Then, based on Lemma D.17, substituting the above inequality into (45) yields:

y k yk,T 2 (1 + ε)c2 l κ3( 1

T )T (1 βµ)P yk 1,T y k 1 2 + 2(1 + 1

ε)c2 l κ3( 1

T )T (1 βµ)P L2 yα2 Φ(xk 1) 2

ε)c2 l κ3( 1

T )T (1 βµ)P L2 yα2 Φ(xk 1) Φ(xk 1) 2

τ y k 1 yk 1,T 2 + 2(1 + 1

ε)c2 l κ3( 1

T )T (1 βµ)P L2 yα2 Φ(xk 1) 2

ε)c2 l κ3( 1

T )T (1 βµ)P L2 yα2 n LM 2 fxy C2 Fy µ3 ξQk 1 ,

where τ = c2 l κ3( 1

T )T (1 βµ)P (1 + ε) + 6(1 + 1

ε)L2 yα2(L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 +

4M 2 fxy C2 Fy L2 fyy µ4 ) .

Summing the inequality (84) from 0 to k results in:

y k yk,T 2 τ k y 0 y0,T 2 + 2(1 + 1

ε)c2 l κ3( 1

T )T (1 βµ)P L2 yα2 k 1 X

j=0 τ j Φ(xk 1 j) 2

ε)c2 l κ3( 1

T )T (1 βµ)P L2 yα2 n LM 2 fxy C2 Fy µ3 ξ

j=0 τ j 1 Qk 1 j .

Combining the inequality (83) and y 0 y0,T 2 c2 l κ3( 1

T )T (1 βµ)P y 0 y0 2, it follows that

Φ(xk) Φ(xk) 2 δ0τ k + ωα2 k 1 X

j=0 τ j Φ(xk 1 j) 2

+ 6ωα2 n LM 2 fxy C2 Fy µ3 ξ

j=0 τ j 1 Qk 1 j + 6 n LM 2 fxy C2 Fy µ3 ξQk ,

where δ0 = 3c2 l κ3( 1

T )T (1 βµ)P (L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 ) y 0 y0 2 and

ω = 6(L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 )(1 + 1

ε)c2 l κ3( 1

T )T (1 βµ)P L2 y.

Since Φ( ) is LΦ Lipschitz, it can be obtained that

Φ(xk+1) Φ(xk) + Φ(xk), xk+1 xk + LΦ

2 xk+1 xk 2

Φ(xk) α Φ(xk), Φ(xk) Φ(xk) α Φ(xk) 2 + α2LΦ Φ(xk) 2

+ α2LΦ Φ(xk) Φ(xk) 2

2 α2LΦ Φ(xk) 2 + α

2 + α2LΦ Φ(xk) Φ(xk) 2.

Using the inequality (85) yields:

Φ(xk+1) Φ(xk) α

2 α2LΦ Φ(xk) 2 + α

2 + α2LΦ Φ(xk) Φ(xk) 2

2 α2LΦ Φ(xk) 2 + α

2 + α2LΦ δ0τ k

2 + α2LΦ k 1 X

j=0 τ j Φ(xk 1 j) 2 + α

2 + α2LΦ 6n LM 2 fxy C2 Fy µ3 ξQk

2 + α2LΦ α2 n LM 2 fxy C2 Fy µ3 ξ

j=0 τ j 1 Qk 1 j .

Published as a conference paper at ICLR 2025

Finally, by telescoping the inequality (87) from k = 0 to k = K 1, it is derived that

2 α2LΦ K 1 X

k=0 Φ(xk) 2 Φ(x0) Φ(x K) + α

2 + α2LΦ δ0 1 τ

2 + α2LΦ K 1 X

j=0 τ j Φ(xk 1 j) 2

2 + α2LΦ 6n LM 2 fxy C2 Fy µ3 ξQk

2 + α2LΦ α2 n LM 2 fxy C2 Fy µ3 ξ

j=0 τ j 1 Qk 1 j .

Moreover, due to PK 1 k=0 Pk 1 j=0 ajbk 1 j PK 1 k=0 ak PK 1 j=0 bj, we can deduce that

j=0 τ j Φ(xk 1 j) 2

k=0 τ k K 1 X

k=0 Φ(xk) 2 1 1 τ

k=0 Φ(xk) 2,

j=0 τ j 1 Qk 1 j

k=0 τ k K 1 X

Then, the following inequality holds: 1

2 αLΦ ωα2 1

k=0 Φ(xk) 2

Φ(x0) Φ(x K)

6n LM 2 fxy C2 Fy µ3 ξQk + 1

α2 n LM 2 fxy C2 Fy µ3 ξQk .

If αLΦ + ωα2 1

2 + αLΦ 1 1 τ 1

k=0 Φ(xk) 2 4(Φ(x0) Φ(x ))

αK + 3δ0 K(1 τ) + 1

18n LM 2 fxy C2 Fy µ3 ξQk , (90)

where Φ(x ) = infx Φ(x).

Finally, by substituting Qk = k + 1 into (90), (82) is derived.

E COMPLEXITY AND THEORETICAL DISCUSSION

Corollary E.1. Consider T = Θ(lnκ) and α = Θ(κ 3) such that τ < 1 and αLΦ + ωα2 1

2 + αLΦ 1 1 τ 1

4. Under the same setting of Theorem 3.7, we have 1

K PK 1 k=0 Φ(xk) 2 =

K ). To achieve an ϵ-stationary point, we require K = O(κ3ϵ 1), resulting in the gradient complexity of Gc(f, ϵ) = O(κ6ϵ 2), Gc(F, ϵ) = O(κ3ϵ 1) and a Jacobian-vector product complexity JV (ϵ) = O(κ3ϵ 1).

Proof. For Theorem 3.7, by Theorem D.24, we have

c3 = 6L2 y(L2 Fx + L2 fxy C2 Fy µ2 + 4M 2 fxy L2 Fy µ2 + 4M 2 fxy C2 Fy L2 fyy µ4 ) = Θ(κ6).

Published as a conference paper at ICLR 2025

Since 0 < (1 βµ)P 1 and α = Θ(κ 3), it is derived that

T )T (1 βµ)P (1 + ε) + (1 + 1

ε)α2c3 = Θ(κ(1/T)T ),

ω = c3(1 + 1

T )T (1 βµ)P = Θ(κ7(1/T)T ).

Based on Lemma D.12, Φ is LΦ-Lipschitz with LΦ = Θ(κ3). For a suitable choice of α = Θ(κ 3), it follows that αLΦ < 1 8. Additionally, with T = Θ(ln κ), the conditions 0 < τ 1 2 and ωα2 = Θ(κ(1/T)T ) 1 10 are satisfied. Consequently,

αLΦ + ωα2 1

Since α = Θ(κ 3), it can be obtained from (72) that 1 K PK 1 k=0 Φ(xk) 2 = O( κ3

K ). Furthermore, in order to achieve an ϵ-stationary point, we have K = O(κ3ϵ 1ln κ3

ϵ ) = O(κ3ϵ 1). Therefore, the following complexity results are derived:

Gc(f, ϵ) = K(T + P) + PK 1 k=0 Qk = K(T + P) + K(K+1)

2 = O(κ6ϵ 2);

Gc(F, ϵ) = 2K = O(κ3ϵ 1);

JV (ϵ) = K = O(κ3ϵ 1).

Details for obtaining τ 1/2 and ωα2 1/10: Since τ = Θ κ(1/T)T , ωα2 = Θ κ(1/T)T

and κ 1, it is enough to show that C0κ(1/T)T 1/10 by choosing T = Θ(ln κ). Here, C0 1 is a positive constant in τ and ωα2, depending explicitly on the Lipschitz constants in the assumptions.

By taking the logarithm on both sides of C0κ(1/T)T 1/10, we get:

ln κ T ln T ln(10C0).

This is equivalent to T ln T ln κ + ln(10C0). Therefore, choosing T ln κ + ln(10C0) + ϵ is sufficient, since ln T 1. Similarly, we can prove the result for Theorem 3.4.