# optimal_hessianjacobianfree_nonconvexpl_bilevel_optimization__ff304bf0.pdf

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Feihu Huang 1 2

Bilevel optimization is widely applied in many machine learning tasks such as hyper-parameter learning, meta learning and reinforcement learning. Although many algorithms recently have been developed to solve the bilevel optimization problems, they generally rely on the (strongly) convex lower-level problems. More recently, some methods have been proposed to solve the nonconvex-PL bilevel optimization problems, where their upper-level problems are possibly nonconvex, and their lower-level problems are also possibly nonconvex while satisfying PolyakŁojasiewicz (PL) condition. However, these methods still have a high convergence complexity or a high computation complexity such as requiring compute expensive Hessian/Jacobian matrices and its inverses. In the paper, thus, we propose an efficient Hessian/Jacobian-free method (i.e., HJFBi O) with the optimal convergence complexity to solve the nonconvex-PL bilevel problems. Theoretically, under some mild conditions, we prove that our HJFBi O method obtains an optimal convergence rate of O( 1

T ), where T denotes the number of iterations, and has an optimal gradient complexity of O(ϵ 1) in finding an ϵ-stationary solution. We conduct some numerical experiments on the bilevel PL game and hyper-representation learning task to demonstrate efficiency of our proposed method.

1. Introduction

Bilevel optimization (Colson et al., 2007; Liu et al., 2021a), as an effective two-level hierarchical optimization paradigm, is widely applied in many machine learning tasks such as

1College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China 2MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China. Correspondence to: Feihu Huang <huangfeihu2018@gmail.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

hyper-parameter learning (Franceschi et al., 2018), meta learning (Franceschi et al., 2018; Ji et al., 2021) and reinforcement learning (Hong et al., 2020; Chakraborty et al., 2023). In the paper, we consider a class of nonconvex bilevel optimization problems:

min x Rd,y y (x) f(x, y) + ϕ(x), (Upper-Level) (1)

s.t. y (x) arg min y Rp g(x, y), (Lower-Level)

where the upper-level function f(x, y) with y y (x) is possibly nonconvex, and ϕ(x) is a convex but possibly nonsmooth regularization such as ϕ(x) = 0 when x X Rd with convex set X otherwise ϕ(x) = + , or ϕ(x) = x 1. The lower-level function g(x, y) is possibly nonconvex on any y and satisfies Polyak-Łojasiewicz (PL) condition (Polyak, 1963), which relaxes the strong convexity. The PL condition is widely used to some machine learning models such as the over-parameterized deep neural networks (Frei & Gu, 2021; Song et al., 2021). In fact, Problem (1) widely appears in many machine learning tasks such as meta learning (Huang, 2023b) and reinforcement learning (Chakraborty et al., 2023).

The inherent nested nature of bilevel optimization gives rise to several difficulties in effectively solving these bilevel problems. For example, compared with the standard singlelevel optimization (i.e., g x, y = 0 in Problem (1)), the main difficulty of bilevel optimization is that the minimization of the upper and lower-level objectives are intertwined via the minimizer y (x) arg miny g(x, y) of the lowerlevel problem. To deal with this difficulty, recently many bilevel optimization methods (Ghadimi & Wang, 2018; Hong et al., 2020; Ji et al., 2021; Huang et al., 2022; Chen et al., 2023b) have been proposed by imposing the strong convexity assumption on the Lower-Level (LL) problems. The LL strong convexity assumption ensures the uniqueness of LL minimizer (i.e., LL Singleton), which simplifies both the optimization process and theoretical analysis, e.g., hyper-gradient F(x) of the upper-level objective F(x) = f(x, y (x)) has a simple closed-form:

F(x) = xf(x, y (x)) (2)

2 xyg(x, y (x)) 2 yyg(x, y (x)) 1 yf(x, y (x)).

Based on the above form of hyper-gradient F(x), some gradient-based methods (Chen et al., 2022; Dagr eou et al.,

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Table 1: Comparison of gradient (or iteration) complexity between our method and the existing methods in solving bilevel problem (1) for finding an ϵ-stationary solution ( F(x) 2 ϵ or its equivalent variants, where F(x) = f(x, y) with y y (x)). Here g(x, ) denotes function on the second variable y with fixing variable x. SC stands for strongly convex. H.J.F. stands for Hessian/Jacobian-Free. L.H. stands for Lipschitz Hessian condition. The Prox-F2BA and F2BA methods rely on some strict conditions such as Lipschitz Hessian of function f(x, y). Note that the GALET (Xiao et al., 2023) method simultaneously uses the PL condition, its Assumption 2 (i.e., let σg = infx,y{σ+ min( 2 yyg(x, y))} > 0 for all (x, y)) and its Assumption 1 (i.e., 2 yyg(x, y) is Lipschitz continuous). Clearly, when Hessian matrix 2 yyg(x, y) is singular, its Assumption 1 and Assumption 2 imply that the lower bound of the non-zero singular values σg is close to zero (i.e., σg 0),

under this case, the convergence results of the GALET are meaningless, e.g., the constant Lw = ℓf,1

σ2g + used in its Lemmas 6 and 9. Under the other case, the PL condition, Lipschitz continuous of Hessian and its Assumption 2 (the singular values of Hessian is bounded away from 0, i.e., σg > 0) imply that GALET assumes strongly convex (Detailed discussion in the Appendix B).

Algorithm Reference g(x, ) L.H. on f( , ) Complexity Loop(s) H.J.F. BOME (Liu et al., 2022) PL / local-PL O(ϵ 1.5) / O(ϵ 2) Double

V-PBGD (Shen & Chen, 2023) PL / local-PL O(ϵ 1.5) / O(ϵ 1.5) Double

GALET (Xiao et al., 2023) SC / PL O(ϵ 1) / Meaningless Triple SLM (Lu, 2023) PL / local-PL O(ϵ 3.5) / O(ϵ 3.5) Double

Prox-F2BA (Kwon et al., 2023) Proximal-EB O(ϵ 1.5) / O(ϵ 1.5) Double

F2BA (Chen et al., 2024) PL / local-PL O(ϵ 1) / O(ϵ 1) Double

MGBi O (Huang, 2023b) PL / local-PL O(ϵ 1) / O(ϵ 1) Single Ada PAG (Huang, 2023a) PL / local-PL O(ϵ 1) / O(ϵ 1) Single HJFBi O Ours PL / local-PL O(ϵ 1) / O(ϵ 1) Single

2022) require an explicit extraction of second-order information of g(x, y) with a major focus on efficiently estimating its Jacobian and inverse Hessian. Meanwhile, the other gradient-based methods (Li et al., 2022; Dagr eou et al., 2022; Sow et al., 2022b; Yang et al., 2023b) avoid directly estimating its second-order computation and only use the first-order information of both upper and lower objectives.

Recently, to relax the LL strong convexity assumption, another line of research is dedicated to bilevel optimization with convex LL problems, which bring about several challenges such as the presence of multiple LL local optimal solutions (i.e., Non-Singleton). Under this case, there does not exist the above hyper-gradient form (2). To handle this concern, some effective methods (Sow et al., 2022a; Liu et al., 2023; Lu & Mei, 2023; Cao et al., 2023) recently have been developed. For example, (Sow et al., 2022a) developed the primal-dual algorithms for bilevel optimization with multiple inner minima in the LL problem. Subsequently, (Lu & Mei, 2023) studied the constrained bilevel optimization with convex lower-level. Meanwhile, (Liu et al., 2023) proposed an effective averaged method of multipliers for bilevel optimization with convex lower-level.

In fact, the bilevel optimization problems with nonconvex LL problems frequently appear in many machine tasks such as hyper-parameter learning in training deep neural networks. Since the above bilevel optimization methods mainly rely on the restrictive LL strong convexity or convexity assumption, clearly, they can not effectively solve

the bilevel optimization problems with nonconvex LL problems. Recently, some bilevel approaches (Liu et al., 2021b; 2022; Chen et al., 2023a; Liu et al., 2023; Huang, 2023b;a; Kwon et al., 2023; Chen et al., 2024) studied the bilevel optimization with non-convex lower-level. For example, (Liu et al., 2022) proposed an effective first-order method for nonconvex-PL bilevel optimization, where the lowerlevel problem is nonconex but satisfies PL condition. (Shen & Chen, 2023) designed an effective penalty-based gradient method for the constrained nonconvex-PL bilevel optimization. (Kwon et al., 2023) studied the nonconvex bilevel optimization with nonconvex lower-level satisfying proximal error-bound (EB) condition that is analogous to PL condition. Meanwhile, (Huang, 2023b) proposed a class of efficient momentum-based gradient methods for the nonconvex-PL bilevel optimization, which obtain an optimal gradient complexity but rely on requiring compute expensive projected Hessian/Jacobian matrices and its inverses. Subsequently, (Xiao et al., 2023) proposed a generalized alternating method (i.e., GALET) for nonconvex-PL bilevel optimization, which still relies on the expensive Hessian/Jacobian matrices. Unfortunately, the convergence results of the GALET method are meaningless (please see Table 1). Thus, there exists a natural question:

Could we propose an efficient Hessian/Jacobianfree method for the nonconvex-PL bilevel optimization with an optimal complexity?

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

In the paper, we affirmatively answer to this question, and propose an efficient Hessian/Jacobian-free method to solve Problem (1) based on the finite-difference estimator and a new projection operator. Our main contributions are given:

(i) We propose an efficient Hessian/Jacobian-free method (i.e., HJFBi O) based on the finite-difference estimator and a new useful projection operator. In particular, our HJFBi O method not only uses low computational first-order gradients instead of high computational Hessian/Jacobian matrices, but also applies the low computational new projection operator to vector variables instead of matrix variables. Thus, our HJFBi O method has a lower computation at each iteration.

(ii) We provide a solid convergence analysis for our HJFBi O method. Under some mild conditions, we prove that our HJFBi O method reaches the best known iteration (gradient) complexity of O(ϵ 1) for finding an ϵ-stationary solution of Problem (1), which matches the lower bound established by the first-order method for finding an ϵ-stationary point of nonconvex smooth optimization problems (Carmon et al., 2020).

(iii) We conduct some numerical experiments including bilevel Polyak-Łojasiewicz game and hyperrepresentation learning to demonstrate efficiency of our proposed method.

Meanwhile, (Chen et al., 2024) proposed a F2BA method for the nonconvex-PL bilevel optimization, which also obtains the best known gradient complexity O(ϵ 1) for finding an ϵ-stationary solution of Problem (1), but it relies on some stricter conditions such as Lipschitz Hessian of the upper function f(x, y). Under these strict conditions, although the F2BA method (Chen et al., 2024) obtains a gradient complexity O(ϵ 1), this is not an optimal gradient complexity (Detailed discussion in the Appendix B).

Given function f(x, y), f(x, ) denotes function w.r.t. the second variable with fixing x, and f( , y) denotes function w.r.t. the first variable with fixing y. x denotes the partial derivative on variable x. Let 2 xy = x y and 2 yy = y y. denotes the ℓ2 norm for vectors and spectral norm for matrices. x, y denotes the inner product of two vectors x and y. Id denotes a d-dimensional identity matrix. at = O(bt) denotes that at cbt for some constant c > 0. The notation O( ) hides logarithmic terms.

S[µ,Lg][ ] denotes a projection on the set {X Rd d : µ ϱ(X) Lg}, where ϱ( ) denotes the eigenvalue function. S[µ,Lg] can be implemented by using Singular Value Decomposition (SVD) and thresholding the singular values. Prv( ) is a projection onto set {v Rp : v rv > 0}.

2. Preliminaries

In this section, we provide some mild assumptions and useful lemmas on the above Problem (1).

2.1. Mild Assumptions

Assumption 2.1. The function g(x, ) satisfies the PolyakŁojasiewicz (PL) condition, if there exist µ > 0 such that for any given x, it holds that

yg(x, y) 2 2µ g(x, y) min y g(x, y) , y Rp.

Assumption 2.2. The function g(x, y) is nonconvex and satisfies

ϱ 2 yyg x, y (x) [µ, Lg], (3)

where y (x) arg miny g(x, y), and ϱ( ) denotes the eigenvalue (or singular-value) function and Lg µ > 0.

Assumption 2.3. The functions f(x, y) and g(x, y) satisfy

1) For all x, y, we have yf(x, y) Cfy, 2 xyg(x, y) Cgxy;

2) The partial derivatives xf(x, y) and yf(x, y) are Lf-Lipschitz continuous;

3) The partial derivatives xg(x, y) and yg(x, y) are Lg-Lipschitz continuous.

Assumption 2.4. The partial derivatives 2 xyg(x, y) and 2 yyg(x, y) are Lgxy-Lipschitz and Lgyy-Lipschitz, e.g., for all x, x1, x2 Rd and y, y1, y2 Rp

2 xyg(x1, y) 2 xyg(x2, y) Lgxy x1 x2 ,

2 xyg(x, y1) 2 xyg(x, y2) Lgxy y1 y2 .

Assumption 2.5. The function Φ(x) = F(x) + ϕ(x) is bounded below in x Rd, i.e., Φ = infx Rd Φ(x) > .

Assumption 2.1 is commonly used in bilevel optimization without the lower-level strongly convexity (Liu et al., 2022; Shen & Chen, 2023; Huang, 2023b). Assumption 2.2 ensures that the minimizer y (x) = arg miny g(x, y) is unique, which imposes the non-singularity of 2 yyg(x, y) only at the minimizers y (x) arg miny g(x, y), as in (Huang, 2023b). Based on Lemma G.6 of (Chen et al., 2024) given in the Appendix B, our Assumption 2.2 is reasonable when has an unique minimizer. Meanwhile, we also study the case that miny g(x, y) has multiple local minimizers in the following section 4.2. Note that since y (x) arg miny g(x, y), we can not have negative eigenvalues at the minimizer y (x), so Assumption 2.2 assumes that ϱ 2 yyg x, y (x) [µ, Lg] instead of ϱ 2 yyg x, y (x) [ Lg, µ] [µ, Lg]. Since

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

2 yyg(x, y) is a symmetric matrix, its singular values are the absolute value of eigenvalues. Hence, we also can use ϱ( ) to denote the singular-value function.

Assumption 2.3 is commonly appeared in bilevel optimization methods (Ghadimi & Wang, 2018; Ji et al., 2021; Liu et al., 2022). Meanwhile, the BOME (Liu et al., 2022) uses the stricter assumption that f(x, y) , g(x, y) , |f(x, y)| and |g(x, y)| are bounded for any (x, y) in its Assumption 3. Assumption 2.4 is also commonly used in bilevel optimization methods (Ghadimi & Wang, 2018; Ji et al., 2021). Assumption 2.5 ensures the feasibility of the bilevel Problem (1).

For example, we consider a nonconvex-PL bilevel problem

min x [1,2],y y (x)

n f(x, y) = x2 + y2 + 3x sin2(y) o , (4)

s.t. y (x) min y R

n g(x, y) = xy2 + x sin2(y) o ,

which can be rewritten as

min x R,y y (x)

n f(x, y) + ϕ(x) o , (5)

s.t. y (x) min y R

n g(x, y) = xy2 + x sin2(y) o ,

where ϕ(x) = 0 when x [1, 2] otherwise ϕ(x) = + . From the above bilevel problem (5), we can easily obtain y (x) = 0, 2 yyg(x, y) = x 2+2 cos2(y) 2 sin2(y) , and then we have 2 yyg(x, y (x)) = 4x > 0 due to x [1, 2]. Since 2 yyg(x, y (x)) = 4x > 0, our Assumption 2.2 holds. Meanwhile 2 yyg(x, y) = x 2 + 2 cos2(y) 2 sin2(y) for any y R may be zero or negative such as 2 yyg(x, π/2) = 0. Meanwhile, for any y R, clearly |f(x, y)| and |g(x, y)| are not bounded. Thus, the assumptions used in (Liu et al., 2022; Xiao et al., 2023; Kwon et al., 2023) may be not satisfied.

2.2. Useful Lemmas

In this subsection, based on the above assumptions, we give some useful lemmas. Lemma 2.6. ((Huang, 2023b)) Under the above Assumption 2.2, we have, for any x Rd,

F(x) = xf(x, y (x))

2 xyg(x, y (x)) h 2 yyg(x, y (x)) i 1 yf(x, y (x)).

From the above Lemma 2.6, we can get the same form of hyper-gradient F(x) as in (2). Since the Hessian matrix 2 yyg(x, y) for all (x, y) may be singular, as in (Huang, 2023b), we define a useful hyper-gradient estimator:

ˆ f(x, y) = xf(x, y)

2 xyg(x, y)

S[µ,Lg] 2 yyg(x, y) 1 yf(x, y),

which replaces the standard hyper-gradient estimator f(x, y) used in (Ghadimi & Wang, 2018; Ji et al., 2021) for the strongly-convex lower-level optimization,

f(x, y) = xf(x, y)

2 xyg(x, y) 2 yyg(x, y) 1 yf(x, y).

Lemma 2.7. ((Huang, 2023b)) Under the above Assumptions 2.1-2.4, the functions (or mappings) F(x) = f(x, y (x)), G(x) = g(x, y (x)) and y (x) arg miny Rp g(x, y) satisfy, for all x1, x2 Rd,

y (x1) y (x2) κ x1 x2 ,

y (x1) y (x2) Ly x1 x2 ,

F(x1) F(x2) LF x1 x2 ,

G(x1) G(x2) LG x1 x2 ,

where κ = Cgxy/µ, Ly = Cgxy Lgyy

µ (1 + Cgxy

LF = Lf + Lfκ + Cfy Cgxy Lgyy

µ (1 + κ) and

LG = Lg + Lgκ + Cgy Cgxy Lgyy

Lemma 2.8. ((Huang, 2023b)) Let F(x) = f(x, y (x)) and ˆ f(x, y) = xf(x, y) 2 xyg(x, y)

S[µ,Lg] 2 yyg(x, y) 1 yf(x, y), we have

ˆ f(x, y) F(x) 2 2ˆL2

µ g(x, y) min y g(x, y) ,

where ˆL2 = 4 L2 f + L2 gxy C2 fy µ2 + L2 gyy C2 gxy C2 fy µ4 + L2 f C2 gxy µ2 .

3. Efficient Hessian/Jacobian-Free Bilevel Optimization Method

In the section, we propose an efficient Hessian/Jacobianfree method to solve the nonconvex-PL bilevel Problem (1) based on the finite-difference estimator and a new projection operator. Here we first define a useful projection operator:

Definition 3.1. Given matrix H Rp p and vector v Rp, and S[µ,Lg][ ] is a projection operator on the set {H Rp p : µ ϱ(H) Lg} where ϱ( ) denotes the eigenvalue function, and Prv( ) is a projection operator onto the set {v Rp : v rv}, then we define a new projection operator Mrh( , ) on set {H Rp p, v Rp : Hv rh}, which satisfies

Mrh H, v S[µ,Lg][H]Prv(v), (6)

where 0 < rh rv Lg.

For notational simplicity, let Mrh H, v = Mrh Hv in the following.

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Algorithm 1 Hessian/Jacobian-free Bilevel Optimization (i.e, HJFBi O) Algorithm

1: Input: T, learning rates λ > 0, γ > 0, τ > 0, and tuning parameters δϵ > 0, rv > 0, rh > 0, and initial input x1 Rd, y1 Rp and v1 Rp ; 2: for t = 1, 2, . . . , T do 3: Compute ut = yg(xt, yt), and update yt+1 = yt λut; 4: Compute wt = e f(xt, yt, vt) = xf(xt, yt) e J(xt, yt, vt, δϵ), and update xt+1 = Pγ ϕ( ) xt, wt ;

5: Compute ht = e v R(xt, yt, vt) = Mrh e H(xt, yt, vt, δϵ) yf(xt, yt), and update vt+1 = Prv vt τht ; 6: end for 7: Output: Chosen uniformly random from {xt}T t=1.

From the above Lemma 2.6, the hyper-gradient F(x) takes the form of

F(x) = xf(x, y (x)) (7)

2 xyg(x, y (x)) 2 yyg(x, y (x)) 1 yf(x, y (x)).

In the above problem (1), the lower objective function g(x, y) on variable y is not strongly convex, so its Hessian matrix 2 yyg(x, y) for all (x, y) may be singular. As in (Huang, 2023b), we define a useful hypergradient estimator:

b f(x, y) = xf(x, y) (8)

2 xyg(x, y)

S[µ,Lg] 2 yyg(x, y) 1 yf(x, y).

Since the above hypergradient (8) requires computing the expensive projected Hessian inverse, as in (Yang et al., 2023b), we define a new hypergradient surrogates as follows:

b f(x, y, v) = xf(x, y) 2 xyg(x, y)v, (9)

where v Rp is an auxiliary vector to approximate the projected Hessian-inverse vector product S[µ,Lg] 2 yyg(x, y) 1 yf(x, y) in (8), which can be rewritten as a solution of the following linear system:

S[µ,Lg] 2 yyg(x, y) 1 yf(x, y) (10)

= arg min v Rp

2v T S[µ,Lg] 2 yyg(x, y)

v v T yf(x, y) o .

Under this case, we can use the following new iterations to solve the nonconvex-PL bilevel problem (1): for t 1,

yt+1 = yt λ yg(xt, yt), xt+1 = Pγ ϕ( ) xt, b f(xt, yt, vt) ,

vt+1 = Prv vt τ v R(xt, yt, vt) ,

where λ > 0, γ > 0 and τ > 0 are learning rates, and the proximal operator defined as: given vectors xt, wt Rd,

Pγ ϕ( ) xt, wt

= arg min x Rd wt, x + 1

2γ x xt 2 + ϕ(x) . (12)

In particular, on updating variable v Rp, we use a projection Prv( ) onto the set {v Rp : v rv} with 0 < rv Cfy

µ to obtain the bounded variable v. Here we use the function

R(x, y, v) = 1

2v T S[µ,Lg] 2 yyg(x, y) v v T yf(x, y),

and then we have v R(x, y, v) = S[µ,Lg] 2 yyg(x, y) v yf(x, y).

In the high-dimensional setting, clearly computing Hessian matrix 2 yyg(x, y) and Jacobian matrix 2 xyg(x, y) is expensive. To approximate this hypergradient efficiently, we further use the finite-difference technique to estimate the Hessian-vector 2 yyg(x, y)v and Jacobian-vector 2 xyg(x, y)v products. Specifically, given a small constant δϵ > 0, we define two finite-difference estimators to estimate 2 yyg(x, y)v and 2 xyg(x, y)v respectively, defined as:

e H(x, y, v, δϵ) = yg(x, y + δϵv) yg(x, y δϵv)

e J(x, y, v, δϵ) = xg(x, y + δϵv) xg(x, y δϵv)

Then we can use the following Hessian/Jacobian-free iterations to solve the nonconvex-PL bilevel problem (1): at (t + 1)-th iteration,

yt+1 = yt λ yg(xt, yt), xt+1 = Pγ ϕ( ) xt, e f(xt, yt, vt) ,

vt+1 = Prv vt τ e v R(xt, yt, vt) ,

where λ > 0, γ > 0 and τ > 0 are learning rates, and

e f(xt, yt, vt) = xf(xt, yt) e J(xt, yt, vt, δϵ), e v R(xt, yt, vt) = Mrh e H(xt, yt, vt, δϵ) yf(xt, yt).

Based on the above iterations (15), we give a procedure framework of our HJFBi O algorithm in Algorithm 1. When ϕ(x) 0, in updating the variable x, we have xt+1 = xt γwt = xt γ e f(xt, yt, vt).

Note that in our Algorithm 1, we use two low computational finite-difference estimators (13) and (14) instead of computing high computational Hessian matrix 2 yyg(x, y)

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Rd d and Jacobian matrix 2 xyg(x, y) Rd p. Moreover, we also use a low computational projection operator {|| e H(x, y, v, δϵ)|| rh} on vector e H(x, y, v, δϵ) instead of computing high computational projection operator S[µ,Lg] 2 yyg(x, y) used in (Huang, 2023b;a). Thus, our HJFBi O algorithm only requires a low computational complexity of O(p + d) at each iteration.

From Algorithm 1, since vt = Prv(vt), we have

Mrh e H(xt, yt, vt, δϵ)

yg(xt, yt + δϵvt) yg(xt, yt δϵvt)

yg(xt, yt + δϵvt) yg(xt, yt)

+ yg(xt, yt) yg(xt, yt δϵvt)

k=0 2 yyg(xt, yt + kδϵvt)δϵvtdk

k=0 2 yyg(xt, yt kδϵvt)δϵvtdk

k=0 2 yyg(xt, yt + kδϵvt)dk

k=0 2 yyg(xt, yt kδϵvt)dk vt

= S[µ,Lg] h1

k=0 2 yyg(xt, yt + kδϵvt)dk

k=0 2 yyg(xt, yt kδϵvt)dk i vt, (16)

where the last equality holds by vt = Prv(vt) and the above definition 3.1. Thus, we have

lim δϵ 0 Mrh e H(xt, yt, vt, δϵ)

= lim δϵ 0 S[µ,Lg] h1

k=0 2 yyg(xt, yt + kδϵvt)dk

k=0 2 yyg(xt, yt kδϵvt)dk i vt

= S[µ,Lg] 2 yyg(xt, yt) vt. (17)

Then we can obtain limδϵ 0 e v R(xt, yt, vt) = v R(xt, yt, vt).

4. Convergence Analysis

In the section, we study convergence properties of our HJFBi O algorithm under some mild assumptions. Given xt from Algorithm 1, we define a useful gradient mapping

G(xt, F(xt), γ) = 1

γ (xt xt+1), (18)

where F(x) = f(x, y (x)) with y (x) arg miny g(x, y), and xt+1 is generated from

xt+1 = Pγ ϕ( )(xt, F(xt))

= arg min x Rd

n F(xt), x + 1

2γ x xt 2 + ϕ(x) o .

When ϕ(x) 0, based on (18), we have xt+1 = xt γ F(xt), and then we obtain G(xt, F(xt), γ) = F(xt).

4.1. Convergence Properties of Our Algorithm on Unimodal g(x, y)

In the subsection, we study the convergence properties of our HJFBi O algorithm when g(x, ) satisfies the global PL condition for all x Rd, i.e., it has a unique minimizer y (x) = arg miny g(x, y). We first give three useful lemmas.

Lemma 4.1. Suppose the sequence {xt, yt, vt}T t=1 be generated from Algorithm 1. Under the above Assumptions, given 0 < τ 1 6Lg , we have

vt+1 v t+1 2 (1 µτ

4 ) vt v t 2 3

4 vt+1 vt 2

+ 25τL2 gyyr4 vδ2 ϵ 6µ + 20

3 L2 f µ2 + L2 gyy C2 fx µ4 xt+1 xt 2

+ yt+1 yt 2 , (19)

where v t = v (xt, yt) = S[µ,Lg] 2 yyg(xt, yt) 1 yf(xt, yt) for all t 1.

Lemma 4.2. Assume the sequence {xt, yt, vt}T t=1 be generated from Algorithm 1, given 0 < γ 1 2LF , we have

Φ(xt+1) Φ(xt) γ

2 G(xt, wt, γ) 2

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt)

+ 6γC2 gxy vt v t 2 + 2γL2 gxyδ2 ϵ r4 v,

where Φ(x) = F(x) + ϕ(x) and G(xt, wt, γ) = 1

γ (xt xt+1).

Lemma 4.3. Suppose the sequence {xt, yt, vt}T t=1 be generated from Algorithm 1. Under the above Assumptions 2.12.3, given γ min n λµ 16LG , µ 16L2g

o and 0 < λ 1 2Lg , we have

g(xt+1, yt+1) G(xt+1)

2 ) g(xt, yt) G(xt) + 1

8γ xt+1 xt 2

4λ yt+1 yt 2 + λ yg(xt, yt) ut 2, (20)

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

where G(xt) = g(xt, y (xt)) with y (xt) arg miny g(xt, y) for all t 1.

Based on the above useful lemmas, we give the convergence properties of our HJFBi O method in the following.

Theorem 4.4. Assume the sequence {xt, yt, vt}T t=1 be generated from our Algorithm 1. Under the above Assumptions 2.1-2.5, let 0 < γ min 1 2LF , λµ 16LG , µ 16L2g , 3 160 L2 , µτ 30C2gxy , µ2λ 30(L2 f +r2v L2gxy) ,

0 < λ min 1 2Lg , 3 80 L2 and 0 < τ 1 6Lg , we have

t=1 G(xt, F(xt), γ) 2

8(Φ(x1) + g(x1, y1) G(x1) + v1 v 1 2 Φ ) Tγ

+ 20L2 gxyδ2 ϵ r4 v + 100τL2 gyyr4 vδ2 ϵ 3γµ , (21)

where L2 = L2 f µ2 + L2 gyy C2 fx µ4 .

Remark 4.5. Without loss of generality, let τ = O(1), λ = O(1) and γ = min 1 2LF , λµ 16LG , µ 16L2g , 3 160 L2 ,

µτ 30C2gxy , µ2λ 30(L2 f +r2 v L2 gxy) = O(1). Furthermore, let δϵ =

T max(L2gxy,L2gyy/µ)r2v , we have

t=1 G(xt, F(xt), γ) 2 O( 1

T ) = ϵ, we obtain T = (ϵ 1). Since requiring seven first-order gradients at each iteration, our HJFBi O algorithm can obtain an optimal gradient (or iteration) complexity of 7 T = O(ϵ 1) in finding an ϵ-stationary solution of Problem (1), which matches the lower bound established by the first-order method for finding an ϵ-stationary point of nonconvex smooth optimization (Carmon et al., 2020).

When ϕ(x) 0, based on (18), we have G(xt, F(xt), γ) = F(xt). Thus our HJFBi O algorithm still obtains an optimal gradient complexity of 7 T = O(ϵ 1) in finding an ϵ-stationary solution of Problem (1) (i.e., min1 t T F(xt) 2 ϵ).

4.2. Convergence Properties of Our Algorithm on multimodal g(x, y)

In this subsection, we study the convergence properties of our HJFBi O method when g(x, ) satisfies the local PL condition for all x, i.e, it has multi local minimizers y (x, y) arg miny g(x, y). As in (Liu et al., 2022), we define the attraction points.

Definition 4.6. Given any (x, y), if sequence {yt} t=0 generated by gradient descent yt = yt 1 λ yg(x, yt 1) start-

ing from y0 = y converges to y (x, y), we say that y (x, y) is the attraction point of (x, y) with step size λ > 0.

An attraction basin be formed by the same attraction point in set of (x, y). In the following analysis, we assume the PL condition within the individual attraction basins. Let F (x) = f(x, y (x, y)). Assumption 4.7. (Local PL Condition in Attraction Basins) Assume that for any (x, y), y (x, y) exists. g(x, ) satisfies the local PL condition in attraction basins, i.e., for any (x, y), there exists a constant µ > 0 such that

yg(x, y) 2 2µ g(x, y) G (x) , (22)

where G (x) = g(x, y (x, y)). Assumption 4.8. The function g x, y (x, y) satisfies

ϱ 2 yyg x, y (x, y) [µ, Lg], (23)

where y (x, y) is the attraction point of (x, y), and ϱ( ) denotes the eigenvalue (or singular-value) function and Lg µ > 0. Assumption 4.9. The function Φ (x) = F (x) + ϕ(x) is bounded below in Rd, i.e., Φ = infx Rd Φ (x) > . Theorem 4.10. Assume the sequence {xt, yt, vt}T t=1 be generated from our Algorithm 1. Under the above Assumptions 4.7, 4.8, 2.3, 2.4, 4.9, let 0 < γ min 1 2LF , λµ 16LG , µ 16L2g , 3 160 L2 , µτ 30C2gxy , µ2λ 30(L2 f +r2v L2gxy) ,

0 < λ min 1 2Lg , 3 80 L2 and 0 < τ 1 6Lg , we have

t=1 G(xt, F (xt), γ) 2

8(Φ (x1) + g(x1, y1) G(x1) + v1 v 1 2 Φ ) Tγ

+ 20L2 gxyδ2 ϵ r4 v + 100τL2 gyyr4 vδ2 ϵ 3γµ , (24)

where Φ (x) = F (x) + ϕ(x) and L2 = L2 f µ2 + L2 gyy C2 fx µ4 . Remark 4.11. The proof of Theorem 4.10 can follow the proof of Theorem 4.4. Let further δϵ = O 1

T max(L2gxy,L2gyy/µ)r2v , our HJFBi O algorithm can also

obtain an optimal gradient (or iteration) complexity of 7 T = O(ϵ 1) in finding an ϵ-stationary solution of Problem (1) under local PL condition.

5. Experiments

In the section, we conduct bilevel PL game and hyperrepresentation learning tasks to demonstrate efficiency of our method. In the experiments, we compare our method with the existing methods given in Table 1. Meanwhile, we add a baseline: BVFSM (Liu et al., 2021b). For fair comparison, since only the Ada PAG (Huang, 2023a) uses adaptive learning rate, we exclude it in the comparisons.

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Figure 1: PL Game: norm of gradient vs number of iteration under d = 100 (Left) and d = 200 (Right).

5.1. Bilevel Polyak-Łojasiewicz Game

In this subsection, as in (Huang, 2023b), we apply the bilevel Polyak-Łojasiewicz game task to verify efficiency of our algorithm, defined as:

min x Rd 1 2x T Px + x T R1y, (25)

s.t. min y Rd 1 2y T Qy + x T R2y,

where P = 1

n Pn i=1 pi(pi)T , Q = 1

n Pn i=1 qi(qi)T , R1 = 1 n Pn i=1 0.01r1 i (r1 i )T and R2 = 1 n Pn i=1 0.01r2 i (r2 i )T . Specifically, samples {pi}n i=1, {qi}n i=1, {r1 i }n i=1 and {r2 i }n i=1 are independently drawn from normal distributions N(0, ΣP ), N(0, ΣQ), N(0, ΣR1) and N(0, ΣR2), respectively. Here we set ΣP = U 1D1(U 1)T , where U 1 Rd l (l < d) is column orthogonal, and D1 Rl l is diagonal and its diagonal elements are distributed uniformly in the interval [µ, L] with 0 < µ < L. Let ΣQ = U 2D2(U 2)T , where U 2 Rd l is column orthogonal, and D2 Rl l

is diagonal and its diagonal elements are distributed uniformly in the interval [µ, L] with 0 < µ < L. We also set ΣR1 = 0.001V 1(V 1)T and ΣR2 = 0.001V 2(V 2)T , where each element of V 1, V 2 Rd d is independently sampled from normal distribution N(0, 1). Since the covariance matrices ΣP and ΣQ are rank-deficient, it is ensured that both P and Q are singular.

In the experiment, we set l = 50, n = 50 d, d = 100 and d = 200. For fair comparison, we set a basic learning rate as 0.01 for all algorithms. In our HJFBi O method, we set δϵ = 10 5. Figure 1 provides the results on norm of gradient vs iteration, where the iteration denotes iteration at outer loop in all algorithms. Here these results verify that our HJFBi O algorithm outperforms all comparisons. In particular, our HJFBi O method has a lower computation at each iteration than the MGBi O method.

Figure 2: Distances of the algorithms under the case of d = 100 (Left) and d = 200 (Right).

5.2. Hyper-Representation Learning

In this subsection, as in (Huang, 2023b), we consider the hyper-representation learning on matrix sensing task to verify efficiency of our method. Specifically, given n sensing matrices {Ci Rd d}n i=1 and n observations oi = Ci, H = trace(CT i H ), where H = U (U )T

is a low-rank symmetric matrix with U Rd r, the goal of this task is to find an optimal matrix U , which can be defined as the following problem:

min U Rd r 1 n

i=1 ℓi(U) 1

2 Ci, UU T oi 2. (26)

Next, we consider the hyper-representation learning in matrix sensing task, which be rewritten the following bilevel optimization problem:

min x Rd r 1 1 |Dv|

i Dv ℓi(x, y (x)), (27)

s.t. y (x) arg min y Rd 1 |Dt|

i Dt ℓi(x, y),

where U = [y; x] Rd r is a concatenation of x and y. Here we define variable x to be the first r 1 columns of U and variable y to be the last column. Meanwhile, Dt denotes the training dataset, and Dv denotes the validation dataset. The ground truth low-rank matrix H is generated by H = U (U )T , where each entry of U is drawn from normal distribution N(0, 1/d) independently. We randomly generate n = 30 d samples of sensing matrices {Ci}n i=1 from standard normal distribution, and then compute the corresponding no-noise labels oi = Ci, H . We split all samples into two dataset: a train dataset Dt with 40% data and a validation dataset Dv with 60% data.

In the experiment, for fair comparison, we set the basic learning rate as 0.01 for all algorithms. In our HJFBi O method, we set δϵ = 10 5. Let ℓ(U) = 1 2n Pn i=1 Ci, UU T oi 2

denote the loss, and UU T H 2 F / H 2 F denotes the distance. Figures 2 and 3 show that our HJFBi O method

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Figure 3: Losses of the algorithms under the case of d = 100 (Left) and d = 200 (Right).

outperforms all the comparisons on the distance vs iteration, where the iteration denotes iteration at outer loop in all algorithms. While our HJFBi O method is comparable with the MGBi O method on the loss vs iteration. In particular, our HJFBi O method has a lower computation at each iteration than the MGBi O method.

6. Conclusions

In the paper, we proposed an efficient Hessian/Jacobian-free bilevel method to solve the nonconvex-PL bilevel problems based on the finite-difference estimator and a new projection operator. Moreover, under some mild assumptions, we proved that our HJFBi O method obtains the best known convergence rate O( 1

T ), and gets an optimal gradient (or iteration) complexity of O(ϵ 1) in finding ϵ-stationary solution under global and local PL condition, respectively.

Acknowledgements

We thank the anonymous reviewers for their helpful comments. This paper was partially supported by NSFC under Grant No. 62376125. It was also partially supported by the Fundamental Research Funds for the Central Universities NO.NJ2023032.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Cao, J., Jiang, R., Abolfazli, N., Hamedani, E. Y., and Mokhtari, A. Projection-free methods for stochastic simple bilevel optimization with convex lower-level problem. ar Xiv preprint ar Xiv:2308.07536, 2023.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower

bounds for finding stationary points i. Mathematical Programming, 184(1-2):71 120, 2020.

Chakraborty, S., Bedi, A. S., Koppel, A., Manocha, D., Wang, H., Huang, F., and Wang, M. Aligning agent policy with externalities: Reward design via bilevel rl. ar Xiv preprint ar Xiv:2308.02585, 2023.

Chen, L., Xu, J., and Zhang, J. On bilevel optimization without lower-level strong convexity. ar Xiv preprint ar Xiv:2301.00712, 2023a.

Chen, L., Xu, J., and Zhang, J. On finding small hypergradients in bilevel optimization: Hardness results and improved analysis. 37th Annual Conference on Learning Theory, 2024.

Chen, T., Sun, Y., Xiao, Q., and Yin, W. A single-timescale method for stochastic bilevel optimization. In International Conference on Artificial Intelligence and Statistics, pp. 2466 2488. PMLR, 2022.

Chen, Z., Kailkhura, B., and Zhou, Y. An accelerated proximal algorithm for regularized nonconvex and nonsmooth bi-level optimization. Machine Learning, 112(5):1433 1463, 2023b.

Colson, B., Marcotte, P., and Savard, G. An overview of bilevel optimization. Annals of operations research, 153 (1):235 256, 2007.

Dagr eou, M., Ablin, P., Vaiter, S., and Moreau, T. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. In Advances in Neural Information Processing Systems, 2022.

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pp. 1568 1577. PMLR, 2018.

Frei, S. and Gu, Q. Proxy convexity: A unified framework for the analysis of neural networks trained by gradient descent. Advances in Neural Information Processing Systems, 34:7937 7949, 2021.

Ghadimi, S. and Wang, M. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A twotimescale framework for bilevel optimization: Complexity analysis and application to actor-critic. ar Xiv preprint ar Xiv:2007.05170, 2020.

Huang, F. Adaptive mirror descent bilevel optimization. ar Xiv preprint ar Xiv:2311.04520, 2023a.

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Huang, F. On momentum-based gradient methods for bilevel optimization with nonconvex lower-level. ar Xiv preprint ar Xiv:2303.03944, 2023b.

Huang, F., Li, J., Gao, S., and Huang, H. Enhanced bilevel optimization via bregman distance. Advances in Neural Information Processing Systems, 35:28928 28939, 2022.

Ji, K., Yang, J., and Liang, Y. Bilevel optimization: Convergence analysis and enhanced design. In International conference on machine learning, pp. 4882 4892. PMLR, 2021.

Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, pp. 795 811. Springer, 2016.

Kwon, J., Kwon, D., Wright, S., and Nowak, R. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. ar Xiv preprint ar Xiv:2309.01753, 2023.

Li, J., Gu, B., and Huang, H. A fully single loop algorithm for bilevel optimization without hessian inverse. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7426 7434, 2022.

Liu, B., Ye, M., Wright, S., Stone, P., et al. Bome! bilevel optimization made easy: A simple first-order approach. In Advances in Neural Information Processing Systems, 2022.

Liu, R., Gao, J., Zhang, J., Meng, D., and Lin, Z. Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021a.

Liu, R., Liu, X., Zeng, S., Zhang, J., and Zhang, Y. Valuefunction-based sequential minimization for bi-level optimization. ar Xiv preprint ar Xiv:2110.04974, 2021b.

Liu, R., Liu, Y., Yao, W., Zeng, S., and Zhang, J. Averaged method of multipliers for bi-level optimization without lower-level strong convexity. ar Xiv preprint ar Xiv:2302.03407, 2023.

Lu, S. Slm: A smoothed first-order lagrangian method for structured constrained nonconvex optimization. In Thirtyseventh Conference on Neural Information Processing Systems, 2023.

Lu, Z. and Mei, S. First-order penalty methods for bilevel optimization. ar Xiv preprint ar Xiv:2301.01716, 2023.

Nesterov, Y. Lectures on convex optimization, volume 137. Springer, 2018.

Polyak, B. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864 878, 1963.

Shen, H. and Chen, T. On penalty-based bilevel gradient descent method. ar Xiv preprint ar Xiv:2302.05185, 2023.

Song, C., Ramezani-Kebrya, A., Pethick, T., Eftekhari, A., and Cevher, V. Subquadratic overparameterization for shallow neural networks. Advances in Neural Information Processing Systems, 34:11247 11259, 2021.

Sow, D., Ji, K., Guan, Z., and Liang, Y. A constrained optimization approach to bilevel optimization with multiple inner minima. ar Xiv preprint ar Xiv:2203.01123, 2022a.

Sow, D., Ji, K., and Liang, Y. On the convergence theory for hessian-free bilevel algorithms. Advances in Neural Information Processing Systems, 35:4136 4149, 2022b.

Xiao, Q., Lu, S., and Chen, T. A generalized alternating method for bilevel optimization under the polyak-{\L} ojasiewicz condition. ar Xiv preprint ar Xiv:2306.02422, 2023.

Yang, H., Luo, L., Li, C. J., Jordan, M., and Fazel, M. Accelerating inexact hypergradient descent for bilevel optimization. In OPT 2023: Optimization for Machine Learning, 2023a.

Yang, Y., Xiao, P., and Ji, K. Achieving o(ϵ 1.5) complexity in hessian/jacobian-free stochastic bilevel optimization. ar Xiv preprint ar Xiv:2312.03807, 2023b.

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

A. Detailed Convergence Analysis

In this section, we provide the detailed convergence analysis of our algorithms. We first review some useful lemmas.

Lemma A.1. ((Nesterov, 2018)) Assume that f(x) is a differentiable convex function and X is a convex set. x X is the solution of the constrained problem minx X f(x), if

f(x ), x x 0, x X. (28)

Lemma A.2. ((Karimi et al., 2016)) The function f(x) : Rd R is L-smooth and satisfies PL condition with constant µ, then it also satisfies error bound (EB) condition with µ, i.e., for all x Rd

f(x) µ x x , (29)

where x arg minx f(x). It also satisfies quadratic growth (QG) condition with µ, i.e.,

f(x) min x f(x) µ

2 x x 2. (30)

A.1. Convergence Analysis of HJFBi O Algorithm for Bilevel Optimization with Regularization

In this subsection, we detail the convergence analysis of our HJFBi O algorithm for bilevel optimization. We give some useful lemmas.

Lemma A.3. (Restatement of Lemma 4.1) Suppose the sequence {xt, yt, vt}T t=1 be generated from Algorithm 1. Under the above Assumptions 2.1-2.4, given 0 < τ 1 6Lg , we have

vt+1 v t+1 2 (1 µτ

4 ) vt v t 2 3

4 vt+1 vt 2

+ 25τL2 gyyr4 vδ2 ϵ 6µ + 20

3 L2 f µ2 + L2 gyy C2 fx µ4 xt+1 xt 2 + yt+1 yt 2 , (31)

where v t = v (xt, yt) = S[µ,Lg] 2 yyg(xt, yt) 1 yf(xt, yt) for all t 1.

Proof. Since the function R(x, y, v) = 1

2v T S[µ,Lg] 2 yyg(x, y) v v T yf(x, y) is µ-strongly convex on variable v, we have

R(xt, yt, v) R(xt, yt, vt) + v R(xt, yt, vt), v vt + µ

= R(xt, yt, vt) + ht, v vt+1 + v R(xt, yt, vt) ht, v vt+1

+ v R(xt, yt, vt), vt+1 vt + µ

2 v vt 2. (32)

Since the function R(x, y, v) is Lg-smooth on variable v, we have

R(xt, yt, vt+1) R(xt, yt, vt) + v R(xt, yt, vt), vt+1 vt + Lg

2 vt+1 vt 2. (33)

By combining the about inequalities (32) with (33), we have

R(xt, yt, v) R(xt, yt, vt+1) + ht, v vt+1 + v R(xt, yt, vt) ht, v vt+1

2 v vt 2 Lg

2 vt+1 vt 2. (34)

According to the line 5 of Algorithm 1, we have vt+1 = Prv vt τht = arg minv Λv n ht, v vt + 1 2τ v vt 2o

with Λv = {v Rp | v rv}. According to the above Lemma A.1, given v Λv, we have

τ (vt+1 vt), v vt+1 0. (35)

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

By plugging the inequalities (35) into (34), we have

R(xt, yt, v) R(xt, yt, vt+1) + 1

τ vt+1 vt, vt v + 1

τ vt+1 vt 2

+ v R(xt, yt, vt) ht, v vt+1 + µ

2 v vt 2 Lg

2 vt+1 vt 2. (36)

Let v = v t = v (xt, yt) = S[µ,Lg] 2 yyg(xt, yt) 1 yf(xt, yt), then we have

R(xt, yt, v t ) R(xt, yt, vt+1) + 1

τ vt+1 vt, vt v t + (1

2 ) vt+1 vt 2

+ v R(xt, yt, vt) ht, v t vt+1 + µ

2 v t vt 2. (37)

Due to the strongly-convexity of R(x, y, ) and v t = arg minv Λv R(xt, yt, v) with rv = Cfy

µ , we have R(xt, yt, v t ) R(xt, yt, vt+1). Thus, we obtain

τ vt+1 vt, vt v t + (1

2 ) vt+1 vt 2

+ v R(xt, yt, vt) ht, v t vt+1 + µ

2 v t vt 2. (38)

Consider the term vt+1 vt, vt v t , we have

vt+1 vt, vt v t = 1

2 vt+1 v t 2 1

2 vt v t 2 1

2 vt+1 vt 2. (39)

Consider the upper bound of the term v R(xt, yt, vt) ht, v t vt+1 , we have

v R(xt, yt, vt) ht, v t vt+1

= v R(xt, yt, vt) ht, v t vt + v R(xt, yt, vt) ht, vt vt+1

µ v R(xt, yt, vt) ht 2 µ

4 v t vt 2 1

µ v R(xt, yt, vt) ht 2 µ

4 vt vt+1 2

µ v R(xt, yt, vt) ht 2 µ

4 v t vt 2 µ

4 vt vt+1 2

2L2 gyyr4 vδ2 ϵ µ µ

4 v t vt 2 µ

4 vt vt+1 2, (40)

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

where the last inequality holds by the following inequality:

v R(xt, yt, vt) ht 2

= v R(xt, yt, vt) e v R(xt, yt, vt) 2

= S[µ,Lg] 2 yyg(xt, yt) vt yf(xt, yt) Mrh e H(xt, yt, vt, δϵ) + yf(xt, yt) 2

= S[µ,Lg] 2 yyg(xt, yt) vt Mrh e H(xt, yt, vt, δϵ) 2

= S[µ,Lg] 2 yyg(xt, yt) vt Mrh 1

yg(xt, yt + δϵvt) yg(xt, yt δϵvt) 2

= S[µ,Lg] 2 yyg(xt, yt) vt Mrh 1

yg(xt, yt + δϵvt) yg(xt, yt) + yg(xt, yt) yg(xt, yt δϵvt) 2

= S[µ,Lg] 2 yyg(xt, yt) vt Mrh 1

k=0 2 yyg(xt, yt + kδϵvt)δϵvtdk + 1

k=0 2 yyg(xt, yt kδϵvt)δϵvtdk 2

= S[µ,Lg] 2 yyg(xt, yt) vt Mrh 1

k=0 2 yyg(xt, yt + kδϵvt)dk + 1

k=0 2 yyg(xt, yt kδϵvt)dk vt 2

(i) = S[µ,Lg] 2 yyg(xt, yt) vt S[µ,Lg] h1

k=0 2 yyg(xt, yt + kδϵvt)dk + 1

k=0 2 yyg(xt, yt kδϵvt)dk i vt 2

(ii) r2 v 2 yyg(xt, yt) 1

k=0 2 yyg(xt, yt + kδϵvt)dk 1

k=0 2 yyg(xt, yt kδϵvt)dk 2

2 yyg(xt, yt) 2 yyg(xt, yt + kδϵvt) dk 2 + r2 v 2

2 yyg(xt, yt) 2 yyg(xt, yt kδϵvt) dk 2

L2 gyyr4 vδ2 ϵ , (41)

where the above equality (i) holds by vt = Prv(vt) and the above definition 3.1, and the above inequality (ii) holds by vt rv.

By plugging the inequalities (39) and (40) into (38), we obtain

1 2τ vt+1 v t 2 ( 1

4 ) vt v t 2 + (µ

2τ ) vt+1 vt 2 + 2L2 gyyr4 vδ2 ϵ µ

4 ) vt v t 2 + (3Lg

2τ ) vt+1 vt 2 + 2L2 gyyr4 vδ2 ϵ µ

4 ) vt v t 2 ( 3

4 ) vt+1 vt 2 + 2L2 gyyr4 vδ2 ϵ µ

4 ) vt v t 2 3

8τ vt+1 vt 2 + 2L2 gyyr4 vδ2 ϵ µ , (42)

where the second inequality holds by Lg µ, and the last inequality is due to 0 < τ 1 6Lg . It implies that

vt+1 v t 2 (1 µτ

2 ) vt v t 2 3

4 vt+1 vt 2 + 4τL2 gyyr4 vδ2 ϵ µ . (43)

Since v t = S[µ,Lg] 2 yyg(xt, yt) 1 yf(xt, yt) = arg minv Λv R(xt, yt, v) with rv = Cfy

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

v t+1 = S[µ,Lg] 2 yyg(xt+1, yt+1) 1 yf(xt+1, yt+1) = arg minv Λv R(xt+1, yt+1, v), we have

v t+1 v t 2 = S[µ,Lg] 2 yyg(xt+1, yt+1) 1 yf(xt+1, yt+1) S[µ,Lg] 2 yyg(xt, yt) 1 yf(xt, yt) 2

= S[µ,Lg] 2 yyg(xt+1, yt+1) 1 yf(xt+1, yt+1) S[µ,Lg] 2 yyg(xt+1, yt+1) 1 yf(xt, yt)

+ S[µ,Lg] 2 yyg(xt+1, yt+1) 1 yf(xt, yt) S[µ,Lg] 2 yyg(xt, yt) 1 yf(xt, yt) 2

4L2 f µ2 + 4L2 gyy C2 fx µ4 xt+1 xt 2 + yt+1 yt 2 . (44)

Next, we decompose the term vt+1 v t+1 2 as follows:

vt+1 v t+1 2 = vt+1 v t + v t v t+1 2 (45)

= vt+1 v t 2 + 2 vt+1 v t , v t v t+1 + v t v t+1 2

4 ) vt+1 v t 2 + (1 + 4

µτ ) v t v t+1 2

4 ) vt+1 v t 2 + (1 + 4

µτ ) 4L2 f µ2 + 4L2 gyy C2 fx µ4 xt+1 xt 2 + yt+1 yt 2 ,

where the first inequality holds by Cauchy-Schwarz inequality and Young s inequality, and the second inequality is due to the above inequality (44).

By combining the above inequalities (43) and (45), we have

vt+1 v t+1 2 (1 + µτ

2 ) vt v t 2 (1 + µτ

4 vt+1 vt 2

4 )4τL2 gyyr4 vδ2 ϵ µ + (1 + 4

µτ ) 4L2 f µ2 + 4L2 gyy C2 fx µ4 xt+1 xt 2 + yt+1 yt 2 .

Since 0 < τ 1 6Lg and Lg µ, we have τ 1 6Lg 1 6µ. Then we have

Thus we have

vt+1 v t+1 2 (1 µτ

4 ) vt v t 2 3

4 vt+1 vt 2

+ 25τL2 gyyr4 vδ2 ϵ 6µ + 20

3 L2 f µ2 + L2 gyy C2 fx µ4 xt+1 xt 2 + yt+1 yt 2 . (46)

Lemma A.4. (Restatement of Lemma 4.2) Assume the sequence {xt, yt, vt}T t=1 be generated from Algorithm 1, given 0 < γ 1 2LF , we have

Φ(xt+1) Φ(xt) γ

2 G(xt, wt, γ) 2 + 12γ

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt)

+ 6γC2 gxy vt v t 2 + 2γL2 gxyδ2 ϵ r4 v,

where Φ(x) = F(x) + ϕ(x) and G(xt, wt, γ) = 1

γ (xt xt+1).

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Proof. By the line 4 of Algorithm 1, we have

xt+1 = Pγ ϕ( ) xt, wt = arg min x Rd

n wt, x + 1

2γ x xt 2 + ϕ(x) o . (47)

Then we define a gradient mapping G(xt, wt, γ) = 1

γ (xt xt+1). By the optimality condition of the subproblem (47), we have for any z Rd

γ (xt+1 xt) + ϑt+1, z xt+1 0, (48)

where ϑt+1 ϕ(xt+1).

Let z = xt, and by the convexity of ϕ(x), we can obtain

wt, xt xt+1 1

γ xt+1 xt 2 + ϑt+1, xt+1 xt

γ xt+1 xt 2 + ϕ(xt+1) ϕ(xt). (49)

According to the Lemma 2.7, function F(x) has LF -Lipschitz continuous gradient. Let G(xt, wt, γ) = 1

γ (xt xt+1), we have

F(xt+1) F(xt) + F(xt), xt+1 xt + LF

2 xt+1 xt 2

= F(xt) + wt, xt+1 xt + γ F(xt) wt, G(xt, wt, γ) + γ2LF

2 G(xt, wt, γ) 2

F(xt) γ G(xt, wt, γ) 2 ϕ(xt+1) + ϕ(xt) + γ F(xt) wt, G(xt, wt, γ) + γ2LF

2 G(xt, wt, γ) 2

(ii) F(xt) γ

2 G(xt, wt, γ) 2 ϕ(xt+1) + ϕ(xt) + γ wt F(xt) 2, (50)

where the second last inequality holds by the above inequality (49), and the last inequality holds by 0 < γ 1 2LF and the following inequality

F(xt) wt, G(xt, wt, γ) wt F(xt) G(xt, wt, γ)

2 wt F(xt) 2 + 1

2ρ G(xt, wt, γ) 2

= wt F(xt) 2 + 1

4 G(xt, wt, γ) 2, (51)

where the above inequality holds by Young inequality with ρ = 2.

Since wt = e f(xt, yt, vt) in Algorithm 1, we have

wt F(xt) 2 = e f(xt, yt, vt) F(xt) 2

2 b f(xt, yt, vt) F(xt) 2 + 2 e f(xt, yt, vt) b f(xt, yt, vt) 2

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) + 6C2 gxy vt v t 2 + 2L2 gxyδ2 ϵ r4 v, (52)

where the last inequality holds by the following inequalities (53) and (54).

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Considering the term b f(xt, yt, vt) F(xt) 2, we have

b f(xt, yt, vt) F(xt) 2

= xf(xt, yt) 2 xyg(xt, yt)vt F(xt) 2

= xf(xt, yt) 2 xyg(xt, yt)vt xf(xt, y (xt)) + 2 xyg(xt, y (xt))v t 2

= xf(xt, yt) xf(xt, y (xt)) 2 xyg(xt, yt)vt + 2 xyg(xt, y (xt))vt 2 xyg(xt, y (xt))vt + 2 xyg(xt, y (xt))v t 2

3L2 f + 3r2 v L2 gxy yt y (xt) 2 + 3C2 gxy vt v t 2

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) + 3C2 gxy vt v t 2, (53)

where the last inequality holds by the above Lemma A.2.

Considering the term e f(xt, yt, vt) b f(xt, yt, vt) 2, we have

e f(xt, yt, vt) b f(xt, yt, vt) 2

= xf(xt, yt) e J(xt, yt, vt, δϵ) xf(xt, yt) + 2 xyg(xt, yt)vt 2

= xg(xt, yt + δϵvt) xg(xt, yt δϵvt)

2δϵ 2 xyg(xt, yt)vt

xg(xt, yt + δϵvt) xg(xt, yt) δϵ 2 xyg(xt, yt)vt

+ xg(xt, yt) xg(xt, yt δϵvt) δϵ 2 xyg(xt, yt)vt 2

xg(xt, yt + δϵvt) xg(xt, yt) δϵ 2 xyg(xt, yt)vt 2

xg(xt, yt) xg(xt, yt δϵvt) δϵ 2 xyg(xt, yt)vt 2

2 xyg(xt, yt + kδϵvt) 2 xyg(xt, yt) δϵvtdk 2

2 xyg(xt, yt kδϵvt) 2 xyg(xt, yt) δϵvtdk 2

2 xyg(xt, yt + kδϵvt) 2 xyg(xt, yt) vt δϵdk 2

2 xyg(xt, yt kδϵvt) 2 xyg(xt, yt) vt δϵdk 2

L2 gxyδ2 ϵ vt 4 L2 gxyδ2 ϵ r4 v, (54)

where the last inequality holds by vt rv.

By combining the above inequalities (50) with (52), we can obtain

Φ(xt+1) Φ(xt) γ

2 G(xt, wt, γ) 2 + 12γ

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt)

+ 6γC2 gxy vt v t 2 + 2γL2 gxyδ2 ϵ r4 v.

Lemma A.5. (Restatement of Lemma 4.3) Suppose the sequence {xt, yt, vt}T t=1 be generated from Algorithm 1. Under the

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

above Assumptions 2.1-2.3, given γ min n λµ 16LG , µ 16L2g

o and 0 < λ 1 2Lg , we have

g(xt+1, yt+1) G(xt+1) (1 λµ

2 ) g(xt, yt) G(xt) + 1

8γ xt+1 xt 2 1

4λ yt+1 yt 2

+ λ yg(xt, yt) ut 2, (55)

where G(xt) = g(xt, y (xt)) with y (xt) arg miny g(xt, y) for all t 1.

Proof. Using the Assumption 2.3, i.e., Lg-smoothness of g(x, ), such that

g(xt+1, yt+1) g(xt+1, yt) + yg(xt+1, yt), yt+1 yt + Lg

2 yt+1 yt 2. (56)

Since yt+1 = arg miny Rp ut, y + 1 2λ(y yt)T (y yt) = yt λut, we can obtain

yg(xt+1, yt), yt+1 yt

= λ yg(xt+1, yt), ut

yg(xt+1, yt) 2 + ut 2 yg(xt+1, yt) yg(xt, yt) + yg(xt, yt) ut 2

2 yg(xt+1, yt) 2 1

2λ yt+1 yt 2 + λL2 g xt+1 xt 2 + λ yg(xt, yt) ut 2

λµ g(xt+1, yt) G(xt+1) 1

2λ yt+1 yt 2 + λL2 g xt+1 xt 2 + λ yg(xt, yt) ut 2, (57)

where the last inequality is due to the quadratic growth condition of µ-PL functions, i.e.,

yg(xt+1, yt) 2 2µ g(xt+1, yt) min y g(xt+1, y ) = 2µ g(xt+1, yt) G(xt+1) . (58)

Substituting (57) into (56), we have

g(xt+1, yt+1) g(xt+1, yt) λµ g(xt+1, yt) G(xt+1) 1

2λ yt+1 yt 2 + λL2 g xt+1 xt 2

+ λ yg(xt, yt) ut 2 + Lg

2 yt+1 yt 2, (59)

then rearranging the terms, we can obtain

g(xt+1, yt+1) G(xt+1) (1 λµ) g(xt+1, yt) G(xt+1) 1

2λ yt+1 yt 2 + λL2 g xt+1 xt 2

+ λ yg(xt, yt) ut 2 + Lg

2 yt+1 yt 2. (60)

Next, using Lg-smoothness of function f( , yt), such that

g(xt+1, yt) g(xt, yt) + xg(xt, yt), xt+1 xt + Lg

2 xt+1 xt 2, (61)

then we have

g(xt+1, yt) g(xt, yt)

xg(xt, yt), xt+1 xt + Lg

2 xt+1 xt 2

= xg(xt, yt) G(xt), xt+1 xt + G(xt), xt+1 xt + Lg

2 xt+1 xt 2

8γ xt+1 xt 2 + 2γ xg(xt, yt) G(xt) 2 + G(xt), xt+1 xt + Lg

2 xt+1 xt 2

8γ xt+1 xt 2 + 2L2 gγ yt y (xt) 2 + G(xt+1) G(xt) + LG

2 xt+1 xt 2 + Lg

2 xt+1 xt 2

4L2 gγ µ g(xt, yt) G(xt) + G(xt+1) G(xt) + ( 1

8γ + LG) xt+1 xt 2, (62)

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

where the second last inequality is due to LG-smoothness of function G(x), and the last inequality holds by Lemma A.2 and Lg LG. Then we have

g(xt+1, yt) G(xt+1) = g(xt+1, yt) g(xt, yt) + g(xt, yt) G(xt) + G(xt) G(xt+1)

(1 + 4L2 gγ µ ) g(xt, yt) G(xt) + ( 1

8γ + LG) xt+1 xt 2. (63)

Substituting (63) in (60), we get

g(xt+1, yt+1) G(xt+1)

(1 λµ)(1 + 4L2 gγ µ ) g(xt, yt) G(xt) + (1 λµ)( 1

8γ + LG) xt+1 xt 2

2λ yt+1 yt 2 + λL2 g xt+1 xt 2 + λ yg(xt, yt) ut 2 + Lg

2 yt+1 yt 2

= (1 λµ)(1 + 4L2 gγ µ ) g(xt, yt) G(xt) + 1

8γ LGλµ + L2 gλ xt+1 xt 2

λ Lg yt+1 yt 2 + λ yg(xt, yt) ut 2

2 ) g(xt, yt) G(xt) + 1

8γ xt+1 xt 2 1

4λ yt+1 yt 2 + λ yg(xt, yt) ut 2, (64)

where the last inequality holds by γ min n λµ 16LG , µ 16L2g

o , LG Lg(1 + κ)2 and λ 1 2Lg for all t 1, i.e.,

γ λµ 16LG λ 16LGγ

µ (1 + κ)2γ 8κ2γ λµ

16LG , µ 16L2g

8γ LG + L2 gλ

2λ Lg, t 1. (65)

Theorem A.6. (Restatement of Theorem 4.4) Assume the sequence {xt, yt, vt}T t=1 be generated from our Algorithm 1. Under the above Assumptions 2.1-2.5, let 0 < γ min 1 2LF , λµ 16LG , µ 16L2g , 3 160 L2 , µτ 30C2gxy , µ2λ 30(L2 f +r2v L2gxy) , 0 < λ

min 1 2Lg , 3 80 L2 and 0 < τ 1 6Lg , we have

t=1 G(xt, F(xt), γ) 2

8(Φ(x1) + g(x1, y1) G(x1) + v1 v 1 2 Φ ) Tγ + 20L2 gxyδ2 ϵ r4 v + 100τL2 gyyr4 vδ2 ϵ 3γµ , (66)

where Φ(x) = F(x) + ϕ(x) and L2 = L2 f µ2 + L2 gyy C2 fx µ4 .

Proof. According to Lemma A.4, we have

Φ(xt+1) Φ(xt) γ

2 G(xt, wt, γ) 2 + 12γ

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt)

+ 6γC2 gxy vt v t 2 + 2γL2 gxyδ2 ϵ r4 v, (67)

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

According to the line 3 of Algorithm 1, we have ut = yg(xt, yt). Then by using Lemma A.5, we have

g(xt+1, yt+1) G(xt+1) g(xt, yt) G(xt)

2 g(xt, yt) G(xt) + 1

8γ xt+1 xt 2 1

4λ yt+1 yt 2 + λ yg(xt, yt) ut 2

2 g(xt, yt) G(xt) + γ

8 G(xt, wt, γ) 2 1

4λ yt+1 yt 2, (68)

where the above equality holds by ut = yg(xt, yt) and G(xt, wt, γ) = 1

γ (xt xt+1).

By using Lemma A.3, we have

vt+1 v t+1 2 vt v t 2

4 vt v t 2 3

4 vt+1 vt 2 + 25τL2 gyyr4 vδ2 ϵ 6µ

3 L2 f µ2 + L2 gyy C2 fx µ4 xt+1 xt 2 + yt+1 yt 2

4 vt v t 2 3

4 vt+1 vt 2 + 25τL2 gyyr4 vδ2 ϵ 6µ + 20

3 L2γ2 G(xt, wt, γ) 2 + 20

3 L2 yt+1 yt 2, (69)

where the above equality holds by L2 = L2 f µ2 + L2 gyy C2 fx µ4 and G(xt, wt, γ) = 1

γ (xt xt+1).

Next we define a useful Lyapunov function (i.e. potential function) for any t 1

Ψt = Φ(xt) + g(xt, yt) G(xt) + vt v t 2. (70)

By using the above inequalities (67), (68) and (69), we have

Ψt+1 Ψt = Φ(xt+1) Φ(xt) + g(xt+1, yt+1) G(xt+1) g(xt, yt) G(xt) + vt+1 v t+1 2 vt v t 2

2 G(xt, wt, γ) 2 + 12γ

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) + 6γC2 gxy vt v t 2 + 2γL2 gxyδ2 ϵ r4 v

2 g(xt, yt) G(xt) + γ

8 G(xt, wt, γ) 2 1

4λ yt+1 yt 2

4 vt v t 2 3

4 vt+1 vt 2 + 25τL2 gyyr4 vδ2 ϵ 6µ + 20

3 L2γ2 G(xt, wt, γ) 2 + 20

3 L2 yt+1 yt 2

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) γ

4 G(xt, wt, γ) 2 µτ

4 6γC2 gxy vt v t 2

+ 2γL2 gxyδ2 ϵ r4 v + 25τL2 gyyr4 vδ2 ϵ 6µ , (71)

where the last inequality is due to 0 < γ 3 160 L2 and 0 < λ 3 80 L2 .

x+ t+1 = Pγ ϕ( )(xt, F(xt)) = arg min x Rd

n F(xt), x + 1

2γ x xt 2 + ϕ(x) o , (72)

xt+1 = Pγ ϕ( )(xt, wt) = arg min x Rd

n wt, x + 1

2γ x xt 2 + ϕ(x) o . (73)

By the optimality conditions of (72) and (73), for any z Rd, there exist ϑ1 ϕ(x+ t+1) and ϑ2 ϕ(xt+1) such that

γ (x+ t+1 xt) + ϑ1, z x+ t+1 0, (74)

γ (xt+1 xt) + ϑ2, z xt+1 0. (75)

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Putting z = xt+1 into (74), by the convexity of ϕ(x), we have

F(xt), xt+1 x+ t+1 1

γ x+ t+1 xt, x+ t+1 xt+1 + ϑ1, x+ t+1 xt+1

γ x+ t+1 xt, x+ t+1 xt+1 + ϕ(x+ t+1) ϕ(xt+1). (76)

Similarly, putting z = x+ t+1 into (75), by the convexity of ϕ(x), we have

wt, x+ t+1 xt+1 1

γ xt+1 xt, xt+1 x+ t+1 + ϑ2, xt+1 x+ t+1

γ xt+1 xt, xt+1 x+ t+1 + ϕ(xt+1) ϕ(x+ t+1). (77)

Summing up (76) and (77), we can obtain

F(xt) wt xt+1 x+ t+1 F(xt) wt, xt+1 x+ t+1 1

γ x+ t+1 xt+1 2. (78)

Then we have

γ x+ t+1 xt+1 = 1

Pγ ϕ( )(xt, F(xt)) Pγ ϕ( )(xt, wt) . (79)

Since G(xt, wt, γ) = 1

γ xt Pγ ϕ( )(xt, wt) and G(xt, F(xt), γ) = 1

γ xt Pγ ϕ( )(xt, F(xt)) , we have

G(xt, F(xt), γ) 2 2 G(xt, wt, γ) 2 + 2 G(xt, wt, γ) G(xt, F(xt), γ) 2

= 2 G(xt, wt, γ) 2 + 2

γ2 Pγ ϕ( )(xt, F(xt)) Pγ ϕ( )(xt, wt) 2

(i) 2 G(xt, wt, γ) 2 + 2 wt F(xt)) 2

2 G(xt, wt, γ) 2 + 24

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) + 12C2 gxy vt v t 2

+ 4L2 gxyδ2 ϵ r4 v, (80)

where the inequality (i) holds by the above inequality (79). Then we can obtain

G(xt, wt, γ) 2 1

2 G(xt, F(xt), γ) 2 + 12

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) + 6C2 gxy vt v t 2

+ 2L2 gxyδ2 ϵ r4 v. (81)

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Plugging the above inequalities (81) into (71), we can further get

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) γ

4 G(xt, wt, γ) 2 µτ

4 6γC2 gxy vt v t 2

+ 2γL2 gxyδ2 ϵ r4 v + 25τL2 gyyr4 vδ2 ϵ 6µ

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) γ

8 G(xt, F(xt), γ) 2

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) + 3γC2 gxy 2 vt v t 2 + γL2 gxyδ2 ϵ r4 v 2

4 6γC2 gxy vt v t 2 + 2γL2 gxyδ2 ϵ r4 v + 25τL2 gyyr4 vδ2 ϵ 6µ

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) γ

8 G(xt, F(xt), γ) 2

4 15γC2 gxy 2 vt v t 2 + 5γL2 gxyδ2 ϵ r4 v 2 + 25τL2 gyyr4 vδ2 ϵ 6µ

8 G(xt, F(xt), γ) 2 + 5γL2 gxyδ2 ϵ r4 v 2 + 25τL2 gyyr4 vδ2 ϵ 6µ , (82)

where the last inequality holds by 0 < γ µτ 30C2gxy , µ2λ 30(L2 f +r2v L2gxy) .

Based on the inequality (82), we have

t=1 G(xt, F(xt), γ) 2 1

γ + 20L2 gxyδ2 ϵ r4 v + 100τL2 gyyr4 vδ2 ϵ 3γµ

(i) 8(Ψ1 Φ )

Tγ + 20L2 gxyδ2 ϵ r4 v + 100τL2 gyyr4 vδ2 ϵ 3γµ

= 8(Φ(x1) + g(x1, y1) G(x1) + v1 v 1 2 Φ ) Tγ + 20L2 gxyδ2 ϵ r4 v

+ 100τL2 gyyr4 vδ2 ϵ 3γµ , (83)

where the above inequality (i) holds by Assumption 2.5.

Set δϵ = O 1

T max(L2gxy,L2gyy/µ)r2v , we can obtain

min 1 t T G(xt, F(xt), γ) 2 1

t=1 G(xt, F(xt), γ) 2 O( 1

A.2. Convergence Analysis of of HJFBi O Algorithm for Bilevel Optimization without Regularization

In this subsection, we provide the convergence analysis of our HJFBi O algorithm for bilevel optimization without Regularization.

Lemma A.7. Assume the sequence {xt, yt, vt}T t=1 be generated from Algorithm 1, given 0 < γ 1 2LF , we have

F(xt+1) F(xt) γ

2 F(xt) 2 γ

4 e f(xt, yt, vt) 2 + γL2 gxyr4 vδ2 ϵ

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) + 3γC2 gxy vt v t 2,

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

Proof. When ϕ(x) 0, at the line 4 of Algorithm 1, we have xt+1 = xt γwt. By using the Lipschitz smoothness of the objective function F(x), we have

F(xt+1) F(xt) + F(xt), xt+1 xt + LF

2 xt+1 xt 2

= F(xt) γ F(xt), wt + γ2LF

2 F(xt) 2 γ

2 (1 γLF ) wt 2 + γ

2 wt F(xt) 2

2 F(xt) 2 γ

2 (1 γLF ) e f(xt, yt, vt) 2 + γ

2 e f(xt, yt, vt) F(xt) 2

(i) F(xt) γ

2 F(xt) 2 γ

4 e f(xt, yt, vt) 2 + γ

2 e f(xt, yt, vt) F(xt) 2

(ii) F(xt) γ

2 F(xt) 2 γ

4 e f(xt, yt, vt) 2 + γL2 gxyr4 vδ2 ϵ

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) + 3γC2 gxy vt v t 2, (85)

where the above inequality (i) holds by 0 < γ 1 2LF , and the above inequality (ii) holds by the above inequality (52).

Theorem A.8. Assume the sequence {xt, yt, vt}T t=1 be generated from our Algorithm 1. Under the above Assumptions 2.12.5, let 0 < γ min 1 2LF , λµ 16LG , µ 16L2g , 3 160 L2 , µτ 12C2gxy , λµ2

12 L2 f +r2v L2gxy , 0 < λ min 1 2Lg , 3 80 L2 and 0 < τ 1 6Lg ,

t=1 F(xt) 2

2 F(x1) + g(x1, y1) G(x1) + v1 v 1 2 F

Tγ + 2L2 gxyr4 vδ2 ϵ + 25τL2 gyyr4 vδ2 ϵ 2γµ , (86)

where L2 = L2 f µ2 + L2 gyy C2 fx µ4 and F = infx Rd F(x) > .

Proof. According to the line 3 of Algorithm 1, we have ut = yg(xt, yt). Then by using Lemma A.5, we have

g(xt+1, yt+1) G(xt+1) g(xt, yt) G(xt)

2 g(xt, yt) G(xt) + 1

8γ xt+1 xt 2 1

4λ yt+1 yt 2 + λ yg(xt, yt) ut 2

2 g(xt, yt) G(xt) + γ

8 e f(xt, yt, vt) 2 1

4λ yt+1 yt 2, (87)

where the above equality holds by ut = yg(xt, yt) and xt+1 = xt γwt = xt γ e f(xt, yt, vt).

By using Lemma A.3, we have

vt+1 v t+1 2 vt v t 2

4 vt v t 2 3

4 vt+1 vt 2 + 25τL2 gyyr4 vδ2 ϵ 6µ

3 L2 f µ2 + L2 gyy C2 fx µ4 xt+1 xt 2 + yt+1 yt 2

4 vt v t 2 3

4 vt+1 vt 2 + 25τL2 gyyr4 vδ2 ϵ 6µ + 20

3 L2γ2 e f(xt, yt, vt) 2 + 20

3 L2 yt+1 yt 2, (88)

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

where the above equality holds by xt+1 = xt γwt = xt γ e f(xt, yt, vt) and L2 = L2 f µ2 + L2 gyy C2 fx µ4 . Then by using Lemma A.7, we have

F(xt+1) F(xt) γ

2 F(xt) 2 γ

4 e f(xt, yt, vt) 2 + γL2 gxyr4 vδ2 ϵ

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt) + 3γC2 gxy vt v t 2. (89)

Next, we define a useful Lyapunov function (i.e. potential function), for any t 1

Ωt = F(xt) + g(xt, yt) G(xt) + vt v t 2. (90)

By combining the above inequalities (87), (88) and (89), we have

Ωt+1 Ωt = F(xt+1) F(xt) + g(xt+1, yt+1) G(xt+1) g(xt, yt) G(xt) + vt+1 v t+1 2 vt v t 2

2 F(xt) 2 γ

4 e f(xt, yt, vt) 2 + γL2 gxyr4 vδ2 ϵ + 6γ

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt)

+ 3γC2 gxy vt v t 2 λµ

2 g(xt, yt) G(xt) + γ

8 e f(xt, yt, vt) 2 1

4λ yt+1 yt 2

4 vt v t 2 3

4 vt+1 vt 2 + 25τL2 gyyr4 vδ2 ϵ 6µ + 20

3 L2γ2 e f(xt, yt, vt) 2 + 20

3 L2 yt+1 yt 2

2 F(xt) 2 γ

3 L2γ2 e f(xt, yt, vt) 2 λµ

µ L2 f + r2 v L2 gxy g(xt, yt) G(xt)

3 L2 yt+1 yt 2 µτ

4 3γC2 gxy vt v t 2 + γL2 gxyr4 vδ2 ϵ + 25τL2 gyyr4 vδ2 ϵ 6µ

2 F(xt) 2 + γL2 gxyr4 vδ2 ϵ + 25τL2 gyyr4 vδ2 ϵ 6µ , (91)

where the last inequality holds by 0 < γ min 3 160 L2 , µτ 12C2gxy , λµ2

12 L2 f +r2 v L2 gxy and 0 < λ 3 80 L2 . Then we can obtain

F(xt) 2 2 Ωt Ωt+1

γ + 2L2 gxyr4 vδ2 ϵ + 25τL2 gyyr4 vδ2 ϵ 2γµ . (92)

Averaging the above inequality (92), we have

t=1 F(xt) 2 2 Ω1 ΩT +1

Tγ + 2L2 gxyr4 vδ2 ϵ + 25τL2 gyyr4 vδ2 ϵ 2γµ

(i) 2(Ω1 F )

Tγ + 2L2 gxyr4 vδ2 ϵ + 25τL2 gyyr4 vδ2 ϵ 2γµ

= 2 F(x1) + g(x1, y1) G(x1) + v1 v 1 2 F

Tγ + 2L2 gxyr4 vδ2 ϵ + 25τL2 gyyr4 vδ2 ϵ 2γµ , (93)

where the above inequality (i) holds by F = infx Rd F(x) > .

Set δϵ = O 1

T max(L2gxy,L2gyy/µ)r2v , we can obtain

min 1 t T F(xt) 2 1

t=1 F(xt) 2 O( 1

Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization

B. Related Works

The GALET method (Xiao et al., 2023) is meaningless for nonconvex-PL bilevel optimization, which is based on the following facts:

1) In the convergence analysis, the GALET method simultaneously uses the PL condition, its Assumption 2 (i.e., let σg = infx,y{σ+ min( 2 yyg(x, y))} > 0 for all (x, y)) and its Assumption 1 (i.e., 2 yyg(x, y) is Lipschitz continuous).

2) In the nonconvex-PL bilevel optimization problems, Hessian matrix 2 yyg(x, y)) has two cases: the first case: 2 yyg(x, y) is singular; the second case: 2 yyg(x, y) is not singular.

3) The first case: 2 yyg(x, y) is singular. Since 2 yyg(x, y) is Lipschitz continuous by Assumption 1 of (Xiao et al., 2023), the singular-value of 2 yyg(x, y) also is continuous. Thus, combining its Assumption 1 with Assumption 2 imply that the lower bound of the non-zero singular values σg = infx,y{σ+ min( 2 yyg(x, y))} > 0 is close to zero. Under this

case, the constant Lw = ℓf,1

σ2g + used in its Lemmas 6 and 9, and LF = ℓf,0(ℓf,1 + ℓg,2)/σg + used in its Lemma 12.

4) The second case: 2 yyg(x, y) is not singular. By Assumption 2 of (Xiao et al., 2023), the singular values of Hessian is bounded away from 0, i.e., σg > 0. Under the this case, the PL condition, Lipschitz continuous of Hessian and its Assumption 2 (the singular values of Hessian is bounded away from 0, i.e., σg > 0) imply that GALET uses strongly convex assumption on g(x, y) at variable y. Note that although the singular values of Hessian 2 yyg(x, y) exclude zero, i.e, the eigenvalues of Hessian 2 yyg(x, y) may be in [ ℓg,2, σg] S[σg, ℓg,2], we cannot have negative eigenvalues at the minimizer y (x). Meanwhile, since Hessian is Lipschitz continuous, its all eigenvalues are in [σg, ℓg,2]. Thus, the PL condition, Lipschitz continuous of Hessian and its Assumption 2 (the singular values of Hessian is bounded away from 0, i.e., σg > 0) imply that the GALET assumes the strongly convex.

Although (Kwon et al., 2023) studied the nonconvex-PL bilevel optimization, it also requires some relatively strict assumptions (e.g., Assumption 1, 4,5,6,7,8 of (Kwon et al., 2023)). For example, Assumption 1 of (Kwon et al., 2023) gives proximal error-bound (EB) condition that is analogous to PL condition, and its Assumption 4 (2) requires the bounded |f(x, y)|. In particular, its Assumption 7 (2) assumes the upper-level function f(x, y) has Lipschitz Hessian. Under these conditions, (Kwon et al., 2023) has a gradient complexity of O(ϵ 1.5) for finding an ϵ-stationary solution of nonconvex PL bilevel optimization. However, without relying on Lipschitz Hessian of function f(x, y) and bounded |f(x, y)|, our algorithm obtains an optimal gradient complexity of O(ϵ 1), which matches the lower bound established by the first-order method for finding an ϵ-stationary point of nonconvex smooth optimization (Carmon et al., 2020).

Meanwhile, (Chen et al., 2024) studied the nonconvex-PL bilevel optimization, but it also relies on some strict assumptions, e.g., hσ = σf + g is µ-PL (Please see the Assumption 4.1 (a) of (Chen et al., 2024)). While our paper only assumes the lower-level function g is µ-PL. When σ > 0, Assumption 4.1 (a) of (Chen et al., 2024) is stricter than our assumption (the lower-level function g is µ-PL). In particular, the upper-level function f(x, y) also requires Lipschitz Hessian (Please see the Assumption 4.1 (d) of (Chen et al., 2024)).

Note that from (Carmon et al., 2020), the optimal gradient complexity is O(ϵ 0.75) (or O(ϵ 1.5)) for finding an ϵ-stationary point of smooth nonconvex optimization problem minx f(x) with Lipschitz Hessian condition, i.e, f(x) 2 ϵ (or f(x) ϵ). Meanwhile, based on Lipschitz Hessian of function f(x, y), (Yang et al., 2023a) can obtain a lower gradient complexity of O(ϵ 0.875) (or O(ϵ 1.75)) for finding an ϵ-stationary point of nonconvex strongly-convex bilevel optimization, i.e., Φ(x) 2 ϵ (or Φ(x) ϵ). Under these strict assumptions, thus, although (Chen et al., 2024) also obtain a gradient complexity of O(ϵ 1), this is not optimal gradient complexity. Lemma B.1. (Lemma G.6 of (Chen et al., 2024)) For a µ-PL function h(x) : Rd R that is twice differentiable, at any x arg minx h(x),

λ+ min( 2h(x )) µ, (95)

where λ+ min( ) denotes the smallest non-zero eigenvalue.

In fact, Lemma G.6 of (Chen et al., 2024) is exactly useful for our HJFBi O method. Based on Lemma G.6 of (Chen et al., 2024), our Assumption 2.2 is reasonable when has an unique minimizer. Meanwhile, our Assumption 4.8 also is reasonable when have multiple local minimizers.