# generalized_smooth_bilevel_optimization_with_nonconvex_lowerlevel__7a88ed46.pdf

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Siqi Zhang 1 Xing Huang 1 Feihu Huang 1 2

Bilevel optimization is widely applied in many machine learning tasks such as hyper-parameter learning and meta learning. Recently, many algorithms have been proposed to solve these bilevel optimization problems, which rely on the smoothness condition of objective functions of the bilevel optimization. In fact, some machine learning tasks such as learning language model do not satisfy the smoothness condition of objective functions. More recently, some methods have begun to study generalized smooth bilevel optimization. However, these proposed methods for generalized smooth bilevel optimization only focus on the (strongly) convex lower objective function. Meanwhile, these methods only consider the generalized-smooth upper-level objective, but still require the standard smooth lower-level objective in the bilevel optimization. To fill this gap, in the paper, thus we study the generalizedsmooth bilevel optimization with the nonconvex lower-level objective function, where both upperlevel and lower-level objectives are generalizedsmooth. We propose an efficient single-loop Hessian/Jacobian-free penalty normalized gradient (i.e., PNGBi O) method. Moreover, we prove that our PNGBi O obtains a fast convergence rate of O( 1 T 1/4 ) for finding a stationary solution, where T denotes the iteration number. Meanwhile, we also propose a stochastic version of our PNGBi O (i.e., S-PNGBi O) method to solve stochastic bilevel problems, and prove that our SPNGBi O has a fast convergence rate of O( 1 T 1/6 ). Some experimental results on hyper-parameter learning and meta learning demonstrate efficiency of our proposed methods.

1College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China. 2MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China. Correspondence to: Feihu Huang <huangfeihu2018@gmail.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1. Introduction

In this work, we focus on studying the following nonsmooth nonconvex bilevel optimization, defined as

min x Rd,y y (x) f(x, y), (UL) (1)

s.t. y (x) arg min y Rp g(x, y), (LL)

where the upper-level (UL) function f(x, y) : Rd Rp R is generalized-smooth, and nonconvex with respect to (w.r.t) the variables x and y, and the lower-level (LL) function g(x, y) : Rd Rp R is generalized-smooth and weakly convex w.r.t the LL variable y. Problem (1) frequently appears many machine learning tasks such as meta learning (Hao et al., 2024; Gong et al., 2024b) and deep neural network pruning (Gao et al., 2024). Recently, bilevel optimization has widely received attention in machine learning community, due to its ability of capturing the hierarchical structures in many machine learning tasks such as meta learning (Ji et al., 2021; Hao et al., 2024) and reinforcement learning (Hong et al., 2023). Meanwhile, many algorithms have been developed to solve the bilevel optimization problems.

When the lower-level objective function g(x, y) on the variable y is strongly convex and twice differential, fortunately, we can get a closed-form gradient of the upper-level objective function F(x) = f(x, y (x)), defined as

F(x) = 1f(x, y (x)) (2)

2 12g(x, y (x)) 2 22g(x, y (x)) 1 2f(x, y (x)).

Recently, based on the above closed-form gradient (2), some effective approximated gradient algorithms (Ghadimi & Wang, 2018; Ji et al., 2021; Hong et al., 2023) have been proposed to solve bilevel optimization by using the following approximated gradient:

ˆ f(x, ˆy) = 1f(x, ˆy) (3)

2 12g(x, ˆy) 2 22g(x, ˆy) 1 2f(x, ˆy),

where ˆy is an approximate of the solution y (x) = arg miny g(x, y). For example, (Ghadimi & Wang, 2018) proposed a class of approximated gradient methods based on approximate implicit differentiation. (Ji et al., 2021) proposed an effective approximated gradient methods based on

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Table 1. Comparison of our method and the existing methods for generalized-smooth bilevel optimization. Here S denotes the standard smoothness condition. GS denotes the generalized-smoothness condition. SC denotes the strongly convex condition. NC denotes the non-convex condition. H.J.F. stands for Hessian/Jacobian-Free. Here f( , ) and g( , ) denote the upper-level and lower-level objective functions, respectively. g(x, ) denotes function on the second variable y with fixing variable x.

Algorithm Reference g( , ) f( , ) g(x, ) Single-Loop H.J.F. BO-REP (Hao et al., 2024) S GS SC SLIP (Gong et al., 2024b) S GS SC

ACCBO (Gong et al., 2024a) S GS SC PNGBi O/S-PNGBi O Ours GS GS NC

the iterative differentiation. More recently, (Huang, 2023; 2024) studied the nonconvex bilevel optimization where the lower-level objective function g(x, y) on the variable y is local strongly convex based on the following projection-aided approximated function, defined as

f(x, ˆy) = 1f(x, ˆy) (4)

2 12g(x, ˆy) S[µ,Lg] 2 22g(x, ˆy) 1 2f(x, ˆy),

where S[µ,Lg][ ] denotes a projection on the set {X Rd d : µ ϱ(X) Lg}, and ϱ( ) denotes the eigenvalue function. In particular, (Huang, 2024) proposed an optimal Hessian/Jacobian-free method by using finite-difference estimator to approximate the projection-aided function (4).

When the lower-level objective function g(x, y) on the variable y is not (local) strongly convex, unfortunately, we can not get a closed-form gradient defined in (2). Thus, most of the existing methods (Liu et al., 2021; 2022; Kwon et al., 2023a; Liu et al., 2024) solve the following single-level constrained optimization problem instead of directly solving the bilevel optimization problem, defined as

min x Rd,y Rp f(x, y) s.t. g(x, y) min y Rp g(x, y) 0. (5)

For example, (Liu et al., 2021) proposed a value-functionbased interior-point method for nonconvex bilevel optimization. Meanwhile, (Liu et al., 2022) designed an effective dynamic barrier gradient descent method for bilevel optimization. More recently, (Liu et al., 2024) solve a variant of the single-level constrained optimization problem (5), where uses its Moreau envelope instead of the problem miny Rp g(x, y) in the problem (5), defined as

min x Rd,y Rp f(x, y) s.t. g(x, y) vγ(x, y) 0, (6)

vγ(x, y) = min θ Rp g(x, θ) + 1

So far, the above methods for bilevel optimization rely on the smoothness of its objective functions. In fact, some machine learning tasks such as learning language model (Zhang et al., 2019) and distributionally robust optimization (Chen et al., 2023) do not satisfy the smoothness condition of objective

functions. More recently, thus, some methods (Hao et al., 2024; Gong et al., 2024b) have begun to study generalized smooth bilevel optimization. However, these proposed methods for generalized smooth bilevel optimization only focus on the (strongly) convex lower objective function. Meanwhile, these methods only consider the generalized-smooth upper-level objective, but still require the standard smooth lower-level objective in the bilevel optimization.

In this paper, to fill this gap, we study the generalizedsmooth bilevel optimization problem (1), where both upperlevel and lower-level objectives are generalized-smooth, and its lower-level is nonconvex. We propose an efficient Hessian/Jacobian-free penalty normalized gradient (i.e., PNGBi O) method to solve the deterministic problem (1). Meanwhile, we also propose a stochastic PNGBi O (i.e., S-PNGBi O) method to solve the stochastic version of problem (1), defined as

min x Rd,y y (x) Eξ D[F(x, y; ξ)], (UL) (7)

s.t. y (x) arg min y Rp Eζ O[G(x, y; ζ)], (LL)

where f(x, y) Eξ D[F(x, y; ξ)] and g(x, y) Eζ O[G(x, y; ζ)]. Here ξ and ζ are random variables.

In summary, our contributions are as follows:

(1) We propose an efficient single-loop PNGBi O method to solve the generalized-smooth nonconvex bilevel problem (1). Meanwhile, we also present a stochastic version of PNGBi O (i.e., S-PNGBi O) method to solve problem (7). In particular, our methods only use the first-order gradients instead of high computational Hessian/Jacobian matrices, so it has a lower computation at each iteration.

(2) We present a solid convergence analysis for our methods. Under some mild conditions, we prove that our PNGBi O method obtains a fast convergence rate of O( 1 T 1/4 ) for finding a stationary solution of the problem (1). Meanwhile, we also prove that our S-PNGBi O method has a fast convergence rate of O( 1 T 1/6 ) for finding a stationary solution of the problem (7).

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

(3) We provide some numerical experiments on data hypercleaning, hyper-parameter learning and meta learning to demonstrate efficiency of our methods.

2. Related Work

In this section, we review the generalized smoothness condition in optimization algorithms and the algorithms of bilevel optimization with nonconvex lower-level.

2.1. Generalized smoothness condition

The generalized smoothness condition firstly was studied in (Zhang et al., 2019), and was applied to the gradient clipping and normalized gradient methods. Subsequently, (Zhang et al., 2020) studied the momentum-based gradient clipping method under the generalized smoothness condition. Meanwhile, (Qian et al., 2021; Zhao et al., 2021) studied the incremental gradient clipping and stochastic normalized gradient methods under the generalized smoothness condition, respectively. Recently, (Chen et al., 2023) proposed a new symmetric generalized smoothness, which extends the standard (L0, L1)-generalized smoothness, and studied the variance reduction under this symmetric generalized smoothness condition. Subsequently, (Li et al., 2023) presented a more generalized smoothness condition and studied various gradient-based methods under this condition. More recently, (Hao et al., 2024; Gong et al., 2024b) studied the bilevel optimization under the generalized-smoothness condition. Meanwhile, (Xian et al., 2024) also studied the nonconvex minimax optimization under the generalized-smoothness condition.

2.2. Bilevel optimization with nonconvex Lower-Level

In recent years, numerous methods have been proposed to solve the bilevel optimization with nonconvex lowerlevel. For example, (Liu et al., 2022) proposed a first-order method for nonconvex-PL bilevel optimization with nonconvex LL satisfying the PL condition. (Shen & Chen, 2023) designed a penalty-based gradient method for constrained nonconvex-PL cases. Subsequently, (Huang, 2023) proposed momentum-based gradient methods for nonconvex PL bilevel optimization. Meanwhile, (Kwon et al., 2023a) studied nonconvex bilevel optimization with LL meeting the proximal error-bound (EB) condition similar to PL. (Kwon et al., 2023b) proposed the F2BA method for nonconvex PL bilevel optimization, achieving the O(ϵ 1) gradient complexity when finding ϵ-stationary solutions. However, this method requires stricter conditions such as the Lipschitz Hessian of the upper-level function f(x, y). Although achieving certain results, they both rely on computationally expensive projected Hessian/Jacobian matrices. To address this issue, (Huang, 2024) proposed the HJFBi O method, using a finite-difference estimator and a new pro-

jection operator to replace high-cost matrices with low-cost first-order gradients. On the other hand, (Liu et al., 2024) proposed the MEHA method for general bilevel optimization with nonconvex and possibly non-smooth LL objective functions, avoiding Hessian-related approximations and enabling single-loop implementation.

For notational simplicity, let 1f(x, y) and 2f(x, y) denote the partial differentiation of function f(x, y) on the first variable x and the second variable y, respectively. denotes the ℓ2 norm for vectors and spectral norm for matrices. x, y denotes the inner product of two vectors x and y. Ft := σ(x0, y0, θ0, S0 f, S0 g, ˆS0 g, , St f, St g, ˆSt g) is a σ-algebras. E[ |Ft] is the conditional probability given the t-th iteration. And E[ ] is the expectation operator w.r.t. stochastic variables. Let {xt} denote a sequence.

3. Preliminaries

In the problem (1), since lower-level objective g(x, y) is nonconvex on variable y, we can not easily get the minimum of lower-level problem miny Rp g(x, y). Thus, we reformulate the lower-level problem by using a value function defined as:

v(x) min y Rp g(x, y). (8)

Then, the problem (1) is equivalent to the following nonlinear programming problem

min (x,y) Rd Rp f(x, y), s.t. g(x, y) v(x) 0, (9)

which was initially introduced by (Outrata, 1990). Unfortunately, the local solutions of the reformulated problem (9) do not necessarily correspond to the stationary points of the problem (1) (Alcantara & Takeda, 2024). Therefore, we consider a Moreau envelope problem of problem (9) as in (Liu et al., 2024), which also can be seen as an approximate problem:

min (x,y) Rd Rp f(x, y), s.t. g(x, y) vγ(x, y) 0, (10)

vγ(x, y) = min θ Rp g(x, θ) + 1

where γ > 0 is regularization tuning parameter. Here we solve the following Lagrange function ( i.e., penalized function) instead of directly solving the problem (10), defined as:

min x Rd,y Rp f(x, y) + ct g(x, y) vγ(x, y) , (11)

where ct > 0 is a penalty parameter.

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Figure 1. Relationship of the bilevel optimization problem (1) and the approximated optimzation problem (10).

For the problem (11), we consider the following residual function, defined as

Rt(x, y) := dist 0, f(x, y) + ct( g(x, y) vγ(x, y)) .

This residual function is a stationary measure for the problem (11). When (x, y) is the stationary point to the problem (11), it follows that f(x, y) + ct g(x, y) vγ(x, y) = 0. Based on Theorem A.4 of (Liu et al., 2024), it is known that any limit point ( x, y) of sequence (xt, yt) is a solution to the problem (10).

4. Penalty Normalized Gradient Methods

In this section, we propose an effective single-loop penalty normalized gradient (PNGBi O) method to solve the deterministic bilevel optimization problem (1). Meanwhile, we present a stochastic PNGBi O (i.e., S-PNGBi O) method to solve the stochastic bilevel optimization problem (7).

The PNGBi O algorithm is provided in Algorithm 1. From the above section, our PNGBi O method solve the bilevel optimization problem (1) by directly solving the approximated problem (10). Figure 1 shows that the relationship of these problems. Here we use the penalty method to solve the approximated problem (10) by directly solving the problem (11).

In Algorithm 1, given the current {xt, yt, θt}, we begin with updating the variable θ, defined as

θt+1 = θt ηt 2g(xt, θt) + θt yt

where ηt > 0. Then we give two gradients:

ct 1f(xt, yt) + 1g(xt, yt) 1g(xt, θt+1),

ct 2f(xt+1, yt) + 2g(xt+1, yt) + θt+1 yt

Algorithm 1 PNGBi O Algorithm Input: Iteration number T, initialization x0, y0, θ0, learning rates ηt, αt, βt, proximal parameter γ > 0, penalty parameter ct > 0; Output: x T , y T

1: for t = 0, 1, , T 1 do 2: θt+1 = θt ηt 2g(xt, θt) + 1

γ (θt yt) ;

3: Compute dt x = 1 ct 1f(xt, yt) + 1g(xt, yt) 1g(xt, θt+1);

4: Update xt+1 = xt αt dt x dtx ;

5: Compute dt y = 1

ct 2f(xt+1, yt) + 2g(xt+1, yt) + 1 γ (θt+1 yt);

6: Update yt+1 = yt βt dt y dty .

to update the variables x and y, respectively. At the lines 4 and 6 of Algorithm 1, we use the normalized gradient descent to update the variables x and y. Compared with the MEHA algorithm of (Liu et al., 2024), our PNGBi O algorithm uses the normalized gradient descent iteration, which unify the feature scales to make the gradient update more stable and efficient, and facilitating the rapid convergence of algorithm.

Algorithm 2 provides an algorithmic framework for our SPNGBi O method, which extends Algorithm 1 to a stochastic setting. At the line 2 of Algorithm 2, we randomly draw three independent minibatch samples St f = {ξt i}Bt i=1

D, St g = {ζt i}Bt i=1 O, and ˆStg = {ˆζt i}Bt i=1 O. Then at its line 3, we update the variable θ as follows,

θt+1 = θt ηt

i=1 2G(xt, θt; ˆζt i) + 1

γ (θt yt) .

The gradients for updating the variables x and y are given,

i=1 1F(xt, yt; ξt i) (16)

i=1 ( 1G(xt, yt; ζt i) 1G(xt, θt+1; ζt i) ,

i=1 2F(xt+1, yt; ξt i) (17)

i=1 2G(xt+1, yt; ζt i) + θt+1 yt

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Algorithm 2 S-PNGBi O Algorithm Input: Iteration number T, initialization x0, y0, θ0, learning rates ηt, αt, βt, proximal parameter γ > 0, penalty parameter ct > 0 and mini-batch size Bt 1; Output: x T , y T

1: for t = 0, 1, , T 1 do 2: Randomly drawn three independent minibatch samples St f = {ξt i}Bt i=1 D, St g = {ζt i}Bt i=1 O, ˆStg = {ˆζt i}Bt i=1 O;

3: Update θt+1 = θt ηt 1

Bt PBt i=1 2G(xt, θt; ˆζt i) + 1 γ (θt yt) ;

4: Compute dt x = 1 Bt PBt i=1 1

ct 1F(xt, yt; ξt i) + 1G(xt, yt; ζt i) 1G(xt, θt+1; ζt i) ;

5: Update xt+1 = xt αt dt x dtx ;

6: Compute dt y = 1 Bt PBt i=1 1

ct 2F(xt+1, yt; ξt i) +

2G(xt+1, yt; ζt i) + θt+1 yt

7: Update yt+1 = yt βt dt y dty .

5. Convergence Analysis

In this section, we study the convergence properties of our PNGBi O and S-PNGBi O method under some mild assumptions. All related proofs are provided in the following Appendix A. We first give some standard assumptions on the lower-level objective f(x, y) and the upper-level objective g(x, y).

For notational simplicity, let u = (x, y), u = (x , y ), µ (0, 1) and uµ = µu + (1 µ)u. Further let µθ + (1 µ)θ γ(x, y) and M = maxµ [0,1] 1g(x, θµ) , where

θ γ(x, y) = arg min θ Rp g(x, θ) + 1

Assumption 5.1. The upper-level objective function f(u) is bounded below, i.e., f = infu Rd Rp f(u) > .

Assumption 5.2. (Generalized Smoothness of UL Function) There exists Lfx,0, Lfx,1, Lfy,0 and Lfy,1 such that 1f(u) 1f(u ) (Lfx,0 + Lfx,1 maxµ [0,1] 1f(uµ) ρ) u u and 2f(u) 2f(u ) (Lfy,0+Lfy,1 maxµ [0,1] 2f(uµ) ρ) u u , where ρ [0, 1] is a constant.

Assumption 5.3. (Generalized Smoothness of LL Function) There exists Lgx,0, Lgx,1, Lgy,0 and Lgy,1 > 0 such that 1g(u) 1g(u ) (Lgx,0 + Lgx,1 maxµ [0,1] 1g(uµ) ρ) u u and 2g(u) 2g(u ) (Lgy,0 + Lgy,1 maxµ [0,1] 2g(uµ) ρ) u u , where ρ [0, 1] is a constant.

Assumption 5.4. The lower-level objective function g( , ) is (κg1, κg2)-weakly convex on Rd Rp, i.e., g(x, y) +

2 x 2 + κg2

2 y 2 is convex on Rd Rp.

Assumption 5.1 gives a lower bound of the upper-level objective function, which ensures the feasibility of the problem (1). Assumption 5.2 shows the generalized smoothness condition of the upper-level objective as in (Chen et al., 2023), which is milder than the generalized smoothness condition used in (Hao et al., 2024; Gong et al., 2024b). Moreover, we also give the generalized smoothness condition of the lower-level objective in Assumption 5.3, which instead of the standard smoothness condition of lower-level objective used in (Hao et al., 2024; Gong et al., 2024b). Assumption 5.4 shows that the lower-level objective function is weakly convex as in (Liu et al., 2024). In fact, this nonconvex lower-level objective condition used in our paper is also milder than the strongly-convex lower-level objective condition used in (Liu et al., 2024).

5.1. Deterministic Setting

Theorem 5.5. Under Assumptions 5.1, 5.2, 5.3 and 5.4, given γ (0, 1 2κg2 ), ct = c(t + 1)1/4

with c > 0, and when ρ [0, 1) let 0 < αt 1

ct 1f(xt,yt) + 1g(xt,yt)

4(K0+κ+K1+2K2)+8(1+ 1 ηtκg2 )L2 θC+6 1g(xt,θt+1) ,

ct 2f(xt,yt) + 2g(xt,yt)

4(K3+κ+K4+2K5)+8(1+ 1 ηtκg2 )L2 θC , when ρ = 1

let 0 < αt 1

ct 1f(xt,yt) + 1g(xt,yt)

4(L0+κ+L1)+8(1+ 1 ηtκg2 )L2 θC+6 1g(xt,θt+1) ,

ct 2f(xt,yt) + 2g(xt,yt)

4(L2+κ+L3)+8(1+ 1 ηtκg2 )L2 θC , and

v u u t1+4C

(Lgx,0+Lgx,1M) αt dtx + βt γ2 dty

1/γ κg2 (1/γ+Lgy,0+Lgy,1 maxµ (0,1) 2g(x,θ γ(x,y)µ) ρ)2

C = Lgx,0 + Lgx,1M ρ + 2 γ2 . The the sequence of {xt, yt, θt}T t=0 generated by Algorithm 1 satisfies

min 0 t T θt θ γ(xt, yt) = O 1 T 1/2 ,

min 0 t T Rt(xt+1, yt+1) = O 1

g(x T , y T ) vγ(x T , y T ) = O 1

Remark 5.6. From the above Theorem 5.5, the two terms min0 t T θt θ γ(xt, yt) = O 1 T 1/2 and min0 t T Rt(xt+1, yt+1) = O 1

T 1 4 shows that our PNG-

Bi O algorithm has a convergence rate O 1

T 1 4 to obtain the stationary solution of the problem (10). Adding the term g(x T , y T ) vγ(x T , y T ) = O 1

T 1 4 ensures that the

sequence {g(xt, yt)} approaches the sequence {vγ(xt, yt)},

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

so the above three terms show that our PNGBi O algorithm has a convergence rate O 1

T 1 4 to obtain the stationary solution of the problem (1). Thus, our PNGBi O method can obtain a gradient complexity of O(ϵ 4) for finding an ϵstationary solution of the problem (1).

5.2. Stochastic Setting

Assumption 5.7. The stochastic gradient is unbiased, i.e., Eξ D[ F(x, y; ξ)] = f(x, y), Eζ O[ G(x, y; ζ)] = g(x, y). The variances of stochastic gradient estimators are bounded:

Eξ D[ F(x, y; ξ) f(x, y) 2] δ2 f,

Eζ O[ G(x, y; ζ) g(x, y) 2] δ2 g.

Assumption 5.7 is the classical assumption for stochastic algorithms (Hong et al., 2023; Kwon et al., 2023b). By As-

sumption 5.7, we have the variance dt x dt x 2

1 ct δ2 f +2σ2 g Bt and dt y dt y 2

1 ct δ2 f +σ2 g Bt .

Theorem 5.8. Under Assumptions 5.1, 5.2, 5.3, 5.4 and 5.7, given γ (0, 1 2κg2 ), ct = c(t + 1)1/4 with c > 0

and Bt = O(T 1/2), and when ρ [0, 1) let 0 < αt 1

ct 1f(xt,yt) + 1g(xt,yt)

4(K0+κ+K1+2K2+2)+8(1+ 1 ηtκg2 )L2 θC+6 1g(xt,θt+1) , 0 <

ct 2f(xt,yt) + 2g(xt,yt)

4(K3+κ+K4+2K5+2)+8(1+ 1 ηtκg2 )L2 θC , when ρ = 1 let

ct 1f(xt,yt) + 1g(xt,yt)

4(L0+κ+L1+2)+8(1+ 1 ηtκg2 )L2 θC+6 1g(xt,θt+1) ,

ct 2f(xt,yt) + 2g(xt,yt)

4(L2+κ+L3+2)+8(1+ 1 ηtκg2 )L2 θC , and

v u u t1+4C

(Lgx,0+Lgx,1M) αt dtx + βt γ2 dty

1/γ κg2 (1/γ+Lgy,0+Lgy,1 maxµ (0,1) 2g(x,θ γ(x,y)µ) ρ)2

C = Lgx,0 + Lgx,1M ρ + 2 γ2 . Then the sequence of (xt, yt, θt) generated by Algorithm 2 satisfies

min 0 t T E[ θt θ γ(xt, yt) ] = O( 1 T 1/2 ),

min 0 t T E[Rt(xt+1, yt+1)] = O( 1 T 1/6 ),

E[g(x T , y T ) vγ(x T , y T )] = O( 1 T 1/4 ).

Remark 5.9. From the above Theorem 5.8, the two terms min0 t T E θt θ γ(xt, yt) = O 1 T 1/2 and min0 t T E[Rt(xt+1, yt+1)] = O 1

T 1 6 shows that our S-

PNGBi O algorithm has a convergence rate O 1

T 1 6 to obtain the stationary solution of the problem (10). Adding the term

E g(x T , y T ) vγ(x T , y T ) = O 1

T 1 4 ensures that the se-

quence {g(xt, yt)} approaches the sequence {vγ(xt, yt)}, so the above three terms show that our S-PNGBi O algorithm has a convergence rate O 1

T 1 6 to obtain the stationary solution of the problem (7). Thus, our S-PNGBi O method can obtain a gradient complexity of PT t=1 Bt = TBt = O(ϵ 9) for finding an ϵ-stationary solution of the problem (7).

6. Numerical Experiments

In this section, we conduct some experiments on data hypercleaning, hyper-parament learning and meta learning to verify efficiency of our proposed methods. We evaluate the performance of our PNGBi O and S-PNGBi O algorithms in terms of loss and test accuracy by comparing it with several competitive baseline algorithms, including BOME (Liu et al., 2022), F2SA (Kwon et al., 2023a), PBGD (Shen & Chen, 2023), MEHA (Liu et al., 2024) and SLIP (Gong et al., 2024b). Due to a high computational cost of computing Hessian and Jacobia matrices required in the SLIP algorithm, we omit it on conducting the meta learning task at the CIFAR10 dataset (Krizhevsky et al., 2009).

6.1. Data Hyper-Cleaning

In this experiment, we conduct the data hyper-cleaning task (Franceschi et al., 2017; Shen & Chen, 2023) on the on MNIST (Deng, 2012) and Fashion MNIST (Xiao et al., 2017) datasets, respectively. Specifically, we solve the following bilevel optimization problem

min λ 1 |Sval|

i Sval Lval xi, yi; θ (λ)

s.t. θ (λ) = arg min θ 1 |Str|

i Str λi Ltr(xi, yi; θ),

where Sval and Str denote validation and training datasets, respectively, and θ = (w, b) denotes the parameter of logistic regression.

In the experiment, the dataset is partitioned into a training set, a validation set, and a test set at a ratio of 1:1:2. For our algorithm, we set ηt = 0.01, αt = 0.013 (t+1)0.8 , βt =

0.011 (t+1)0.8 in MNIST dataset and ηt = 0.01, αt = 0.013 (t+1)0.5 , βt = 0.011 (t+1)0.5 in Fashion MNIST dataset. The learning rate settings of other algorithms are shown in the following Table 2 and Table 3.

From Figures 2 and 3, we can find that our PNGBi O algorithm achieves a higher accuracy and a lower loss value on the test set. However, during the experiment, the convergence speed of our algorithm is slower compared to that of the F2SA. This disparity can be attributed to the fact that the F2SA employs a momentum acceleration technique, which is absent in our algorithm.

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Figure 2. Experimental results of data hyper-cleaning on MNIST dataset.

Figure 3. Experimental results of data hyper-cleaning on Fashion MNIST dataset.

Table 2. Data hyper-cleaning learning rates of other algorithms on MNIST dataset.

BOME F2SA PBGD MEHA SLIP αt 1.05 0.05 0.1 0.01 1.3 βt 1.05 0.05 0.1 0.03 0.5 ηt 0.01 0.01 0.01 0.01 -

Table 3. Data hyper-cleaning learning rates of other algorithms on Fashion MNIST dataset.

BOME F2SA PBGD MEHA SLIP αt 0.001 (t+1)1.5 0.001 (t+1)1.1 0.1 0.0005 (t+1)1.8 0.001

t+1 βt 0.01

t+1 0.001 (t+1)1.1 0.1 0.001 0.001

t+1 ηt 0.01 0.01 0.01 0.01 -

6.2. Hyper-Parameter Learning

In this experiment, we conduct the hyper-parameter learning task (Franceschi et al., 2018) on MNIST (Deng, 2012) and Fashion MNIST (Xiao et al., 2017) datesets, respectively. Specifically, we solve the following bilevel optimization problem

min λ LSval(λ) = 1 |Sval|

i Sval L(xi, yi; ω (λ))

s.t. ω (λ) = arg min ω 1 |Str|

i Str (L(xi, yi; ω) + Rω,λ),

where ω denotes parameters of model, and λ denotes parameters of regularization. Here Sval and Str are validation and training data, L is the cross-entropy loss, and Rω,λ is a regularizer. In the experiment, we use logistic regression. For our algorithm, we set ηt = 0.01, αt = 1.3 (t+1)0.75 , βt = 0.75 (t+1)0.74 in MNIST dataset and ηt = 0.01, αt = 1.5 (t+1)0.45 , βt = 1.75 (t+1)0.44 in Fashion MNIST dataset. The learning rate settings of other algorithms are shown in the following Table 4.

Figures 4 and 5 also show that our PNGBi O algorithm achieves a higher accuracy and a lower loss value on the test set. Similarly, the convergence speed of our algorithm

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Figure 4. Experimental results of hyper-parameter learning on MNIST dataset.

Figure 5. Experimental results of hyper-parameter learning on Fashion MNIST dataset.

Table 4. Hyper-parameter learning rates of other algorithms on MNIST(Fashion MNIST) dataset.

BOME F2SA PBGD MEHA SLIP αt 1.05 0.05 0.1 0.01 1.3 βt 1.05 0.05 0.1 0.03 0.5 ηt 0.01 0.01 0.01 0.01 -

is slower compared to that of the F2SA.

6.3. Meta Learning

In this experiment, we conduct the meta learning task (Franceschi et al., 2018), which can be represented by a bilevel optimization problem. Specifically, given K tasks and training sets {Dtr i |i = 1, , K} and validation sets {Dval i |i = 1, , K}, we solve the following bilevel optimization,

1 |Dval i |

L(x, y (x); ξ)

s.t. y (x) = arg min y 1 K

ζ Dtr i L(x, yi; ζ),

Table 5. Meta learning rates of other algorithms on CIFAR10 dataset.

BOME F2SA MEHA PBGD αt 0.03 0.01 0.05 0.05 βt 0.03 0.02 0.01 0.01 ηt 0.03 0.04 0.03 0.01

where y = (y1, y2, , y K). In the experiment, a fullyconnected 3-layer used as a classifier for the LL, and a neural network consisting of one convolutional layer and three residual layers used as the UL.

We construct K = 5, where Dtr i and Dval i randomly sample disjoint categories from the CIFAR10 dataset (Krizhevsky et al., 2009), respectively. We use Resnet-18 (He et al., 2016) as task-shared model at the UL problem, and use a 2-layer neural network as task-specific model at the LL problem. Clearly, both UL and LL problems are non-convex. To ensure fairness, we adopt runtime as the metric. We run each method for 1800 seconds (CPU time) with minibatch size 64 for both training and validation set. For our algorithm, we set ηt = 0.05, αt = 0.085 (t+1)0.5 , βt = 0.5 (t+1)0.5 in CIFAR10 dataset. The learning rate setting of other algorithms are shown in the following Table 5.

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Figure 6. Experimental results of meta learning on CIFAR10 dataset.

Figure 6 illustrates that under the same time, our S-PNGBi O algorithm achieves a faster convergence rate than these baselines, and achieves higher accuracy and lower loss values on the test set. In other words, our algorithm S-PNGBi O consistently yields higher test accuracy and lower loss values while maintaining performance stability throughout the evaluation, thereby showcasing a robust generalization capability.

7. Conclusion

In this paper, we studied the generalized smooth bilevel optimization with the nonconvex lower-level objective, and proposed an efficient penalty normalized gradient (i.e., PNGBi O) method to solve the deterministic bilevel problem (1). Moreover, we proved that our PNGBi O method has a fast convergence rate of O 1

T 1 4 to find a stationary solution of the problem (1). Meanwhile, we proposed a stochastic version of PNGBi O (i.e.,S-PNGBi O) method, and proved that our S-PNGBi O method has a fast convergence rate of O 1

T 1 6 to obtain a stationary solution of the stochastic bilevel problem (7). In particular, our methods do not compute expensive Hessian/Jacobian matrices and also do not require any conditions on Hessian/Jacobian matrices, and can simultaneously deal with the generalized smooth upperlevel and lower-level objectives in bilevel optimization.

Acknowledgements

We thank the anonymous reviewers for their helpful suggestions. This paper was partially supported by NSFC under Grant No. 62376125. It was also partially supported by the Fundamental Research Funds for the Central Universities NO.NJ2023032.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Alcantara, J. H. and Takeda, A. Theoretical smoothing frameworks for general nonsmooth bilevel problems. ar Xiv preprint ar Xiv:2401.17852, 2024.

Chen, Z., Zhou, Y., Liang, Y., and Lu, Z. Generalizedsmooth nonconvex optimization is as efficient as smooth nonconvex optimization. In International Conference on Machine Learning, pp. 5396 5427. PMLR, 2023.

Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141 142, 2012.

Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pp. 1165 1173. PMLR, 2017.

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pp. 1568 1577. PMLR, 2018.

Gao, S., Zhang, Y., Huang, F., and Huang, H. Bilevelpruning: unified dynamic and static channel pruning for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16090 16100, 2024.

Ghadimi, S. and Wang, M. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Gong, X., Hao, J., and Liu, M. An accelerated algorithm for stochastic bilevel optimization under unbounded smoothness. ar Xiv preprint ar Xiv:2409.19212, 2024a.

Gong, X., Hao, J., and Liu, M. A nearly optimal single loop algorithm for stochastic bilevel optimization under unbounded smoothness. In Forty-first International Conference on Machine Learning, 2024b.

Hao, J., Gong, X., and Liu, M. Bilevel optimization under unbounded smoothness: A new algorithm and convergence analysis. ar Xiv preprint ar Xiv:2401.09587, 2024.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A twotimescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actorcritic. SIAM Journal on Optimization, 33(1):147 180, 2023.

Huang, F. On momentum-based gradient methods for bilevel optimization with nonconvex lower-level. ar Xiv preprint ar Xiv:2303.03944, 2023.

Huang, F. Optimal hessian/jacobian-free nonconvex-pl bilevel optimization. In International Conference on Machine Learning, pp. 19598 19621. PMLR, 2024.

Ji, K., Yang, J., and Liang, Y. Bilevel optimization: Convergence analysis and enhanced design. In International conference on machine learning, pp. 4882 4892. PMLR, 2021.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Kwon, J., Kwon, D., Wright, S., and Nowak, R. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. ar Xiv preprint ar Xiv:2309.01753, 2023a.

Kwon, J., Kwon, D., Wright, S., and Nowak, R. D. A fully first-order method for stochastic bilevel optimization. In International Conference on Machine Learning, pp. 18083 18113. PMLR, 2023b.

Li, H., Qian, J., Tian, Y., Rakhlin, A., and Jadbabaie, A. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems, 36, 2023.

Liu, B., Ye, M., Wright, S., Stone, P., and Liu, Q. Bome! bilevel optimization made easy: A simple first-order approach. Advances in neural information processing systems, 35:17248 17262, 2022.

Liu, R., Liu, X., Yuan, X., Zeng, S., and Zhang, J. A valuefunction-based interior-point method for non-convex bilevel optimization. In International conference on machine learning, pp. 6882 6892. PMLR, 2021.

Liu, R., Liu, Z., Yao, W., Zeng, S., and Zhang, J. Moreau envelope for nonconvex bi-level optimization: A singleloop and hessian-free solution strategy. ar Xiv preprint ar Xiv:2405.09927, 2024.

Outrata, J. V. On the numerical solution of a class of stackelberg problems. Zeitschrift f ur Operations Research, 34: 255 277, 1990.

Qian, J., Wu, Y., Zhuang, B., Wang, S., and Xiao, J. Understanding gradient clipping in incremental gradient methods. In International Conference on Artificial Intelligence and Statistics, pp. 1504 1512. PMLR, 2021.

Shen, H. and Chen, T. On penalty-based bilevel gradient descent method. In International Conference on Machine Learning, pp. 30992 31015. PMLR, 2023.

Xian, W., Chen, Z., and Huang, H. Delving into the convergence of generalized smooth minimax optimization. In Forty-first International Conference on Machine Learning, 2024.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Zhang, B., Jin, J., Fang, C., and Wang, L. Improved analysis of clipping algorithms for non-convex optimization. Advances in Neural Information Processing Systems, 33: 15511 15521, 2020.

Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. ar Xiv preprint ar Xiv:1905.11881, 2019.

Zhao, S.-Y., Xie, Y.-P., and Li, W.-J. On the convergence and improvement of stochastic normalized gradient descent. Science China Information Sciences, 64:1 13, 2021.

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

A. Appendix

In this section, we provide the detailed convergence analysis of our algorithm. We first review some useful lemmas and give some new lemmas.

Lemma A.1. (Lemma A.7 of (Liu et al., 2024)) Let γ (0, 1 2κg2 ), (x , y ) Rd Rp. Then for any κv1 κg1, κv2 1

γ and (x, y) on Rd Rp, the following inequality holds:

vγ(x, y) vγ(x , y ) vγ(x , y ), (x, y) (x , y ) + κ

2 (x, y) (x , y ) 2, (18)

where κ := max{κv1, κv2}.

Lemma A.2. (Lemma 2 of (Chen et al., 2023)) f is generalized smooth if and only if for any u, u Rd Rp,

1f(u ) 1f(u) (Lfx,0 + Lfx,1

0 1f(uµ) ρdµ) u u , (19)

2f(u ) 2f(u) (Lfx,0 + Lfx,1

0 2f(uµ) ρdµ) u u , (20)

where uµ = µu + (1 µ)u, µ [0, 1].

Lemma A.3. (Lemma 5 of (Chen et al., 2023)) For any x 0, C [0, 1], > 0 and 0 ω ω , such that ω ω, the following inequality holds

Cxω xω + C ω

Proposition A.4. Under Assumptions 5.2 and 5.3, we have

(i) For ρ [0, 1), then for any u, u Rd Rp,

1 ct f(u ) + g(u ) 1

ct f(u) + g(u) + 1 1 ct f(u) + 1g(u), u u + 1

2 K0 + K1( 1 1 ct f(u) ρ + 1g(u) ρ)

ρ 1 ρ u u 2, (22)

1 ct f(u ) + g(u ) 1

ct f(u) + g(u) + 2( 1

ct f(u) + g(u)), u u + 1

2 K3 + K4( 1

ct 2f(u) ρ + 2g(u) ρ)

ρ 1 ρ u u 2, (23)

where K0 := ( 1

ct Lfx,0 + Lgx,0)(2

1 ρ + 1), K1 := max{Lfx,1, Lgx,1} 2

1 ρ 3ρ, K2 := (L

1 1 ρ fx,1 + L

1 1 ρ gx,1)

1 ρ 3ρ(1 ρ)

ρ 1 ρ , K3 := ( 1

ct Lfy,0 + Lgy,0)(2

1 ρ + 1), K4 := max{Lfy,1, Lgy,1} 2

1 ρ 3ρ and K5 :=

1 1 ρ fy,1 + L

1 1 ρ gy,1) 2

1 ρ 3ρ(1 ρ)

(ii) For ρ = 1, then for any u, u Rd Rp,

1 ct f(u ) + g(u ) 1

ct f(u) + g(u) + 1

ct 1f(u) + 1g(u), u u + 1

2 L0 + L1( 1

ct 1f(u) + 1g(u) )

u u 2 exp(L1 u u ), (24)

1 ct f(u ) + g(u ) 1

ct f(u) + g(u) + 1

ct 2f(u) + 2g(u), u u + 1

2 L2 + L3( 1

ct 2f(u) + 2g(u) )

u u 2 exp(L3 u u ), (25)

where L0 := 1

ct Lfx,0 + Lgx,0, L1 := max{Lfx,1, Lgx,1}, L2 := 1

ct Lfy,0 + Lgy,0, L3 := max{Lfy,1, Lgy,1}.

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Proof. According to Assumptions 5.2 and 5.3,

(i) for ρ [0, 1), we have

1 ct f(u ) + g(u ) 1

ct f(u) g(u) 1 1 ct f(u) + 1g(u), u u

1 1 ct f(uµ) + 1g(uµ) 1 1 ct f(u) 1g(u) T(u u)dµ

0 1 1 ct f(uµ) + 1g(uµ) 1 1 ct f(u) 1g(u) u u dµ.

For µ [0, 1], we have uµ = µ u +(1 µ )u, such that µ uµ +(1 µ )u = µ µu +(1 µ µ)u = uµ µ. Therefore, we obtain

ct 1f(uµ) 1

ct Lfx,0 + Lfx,1

ct 1f(uµ µ) ρdµ uµ u

ct Lfx,0µ + Lfx,1

ct 1f(uµ µ) ρµdµ u u . (26)

Let H(µ ) := 1

ct Lfx,0µ + Lfx,1 R 1 0 1

ct 1f(uµ µ) ρµ dµ = 1

ct Lfx,0µ + Lfx,1 R µ

ct 1f(uz) ρdz, we have the derivative formula

ct Lfx,0 + Lfx,1 1

ct 1f(uµ ) 1

ct 1f(u) ρ + Lfx,1 1

ct Lfx,0 + Lfx,1 u u ρH(µ )ρ + Lfx,1 1

3 u u H(µ ) + 1

ct 1f(u) + ( 1

ct Lfx,0) 1 ρ

where the last inequality applies Jensen s inequality to the concave function h(x) = xρ. Rearranging the above inequality yields that

31 ρLfx,1(1 ρ) u u

(1 ρ) u u u u H(µ ) + 1

ct 1f(u) + ( 1

ct Lfx,0) 1 ρ

u u H(µ ) + 1

ct 1f(u) + ( 1

ct Lfx,0) 1 ρ

Integrating the above inequality over µ [0, µ] yields that

u u H(µ ) + 1

ct 1f(u) + ( 1

ct Lfx,0) 1 ρ

31 ρLfx,1(1 ρ) u u µ + u u H(0) + 1

ct 1f(u) + ( 1

ct Lfx,0) 1 ρ

2ρ 3(Lfx,1(1 ρ) u u µ) 1 1 ρ + 1

ct 1f(u) + ( 1

ct Lfx,0) 1 ρ

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

where the last inequality follows from H(0) = 0 and Jensen s inequality to the concave function h(x) = x1 ρ. Thus, we can obtain

ρ 1 ρ 3(Lfx,1(1 ρ) u u µ) 1 1 ρ + 1

ct 1f(u) + ( 1

ct Lfx,0) 1 ρ

1 ρ fx,0 (ct Lfx,1) 1 ρ ,

ct 1f(uµ) 1

ct 1f(u) + 1

ct 1f(uµ) 1

ct 1f(u) + u u H(µ)

ρ 1 ρ 3(Lfx,1(1 ρ) u u µ) 1 1 ρ + 1

ct 1f(u) + L

1 ρ fx,0 (ct Lfx,1) 1 ρ

Then we have

ct 1f(u ) 1

ct Lfx,0 + max µ [0,1] 1

ct 1f(uµ) ρ) u u

ct Lfx,0 + Lfx,12

1 ρ 3(Lfx,1(1 ρ) u u ) 1 1 ρ + 1

ct 1f(u) + L

1 ρ fx,0 (ct Lfx,1) 1 ρ

ct Lfx,0 + Lfx,12

1 ρ 3ρ(Lfx,1(1 ρ) u u )

ct 1f(u) ρ + Lfx,0 (ct Lfx,1)

ct Lfx,0(1 + 2

1 ρ ) + Lfx,1 2

ct 1f(u) ρ + L

1 1 ρ fx,1 2

1 ρ 3ρ(1 ρ)

ρ 1 ρ u u .

Similarly, we have

1g(u ) 1g(u)

Lgx,0(1 + 2

1 ρ ) + Lgx,1 2

1 ρ 3ρ 1g(u) + L

1 1 ρ gx,1 2

1 ρ 3ρ(1 ρ)

ρ 1 ρ u u .

Therefore, we can obtain Z 1

0 1 1 ct f(uµ) + 1g(uµ) 1 1 ct f(u) 1g(u) u u dµ

1 ρ + 1) + Lfx,1 2

ct 1f(u) ρ + L

1 1 ρ fx,1 2

1 ρ 3ρ(1 ρ)

1 ρ + 1) + Lgx,1 2

1 ρ 3ρ 1g(u) ρ + L

1 1 ρ gx,1 2

1 ρ 3ρ(1 ρ)

ρ 1 ρ ) u u dµ

0 µ u u 2( 1

1 ρ + 1) + Lfx,1 2

ct 1f(u) ρ + L

1 1 ρ fx,1 2

1 ρ 3ρ(1 ρ)

1 ρ + 1) + Lgx,1 2

1 ρ 3ρ 1g(u) ρ + L

1 1 ρ gx,1 2

1 ρ 3ρ(1 ρ)

2 u u 2 ( 1

ct Lfx,0 + Lgx,0)(2

1 ρ + 1) + max{Lfx,1, Lgx,1} 2

ct 1f(u) ρ + 1g(u) ρ)+

1 1 ρ fx,1 + L

1 1 ρ gx,1) 2

1 ρ 3ρ(1 ρ)

2 ρ 1 ρ Z 1

0 µ 1 1 ρ dµ

2 u u 2 K0 + K1( 1

ct 1f(u) ρ + 1g(u) ρ) + 2K2 u u

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

where K0 := ( 1

ct Lfx,0 + Lgx,0)(2

1 ρ +1), K1 := max{Lfx,1, Lgx,1} 2

1 ρ 3ρ and K2 := (L

1 1 ρ fx,1 + L

1 1 ρ gx,1) 2

Similarly, (23) can be proved.

(ii) For ρ = 1, we prove (24) as follows.

1 ct f(u ) + g(u ) 1

ct f(u) g(u) 1 1 ct f(u) + 1g(u), u u

0 ( 1 1 ct f(uµ) + 1g(uµ) 1 1 ct f(u) 1g(u))T(u u)dµ

0 1 1 ct f(uµ) + 1g(uµ) 1 1 ct f(u) 1g(u) u u dµ.

When ρ = 1, by (27), we have

ct Lfx,0 + Lfx,1 u u H(µ ) + Lfx,1 1

where H(µ ) := 1

ct Lfx,0µ + Lfx,1 R µ

ct f(uµ) dµ. It follows that

Lfx,1 u u Lfx,1 u u H(µ ) 1 ct Lfx,0 + Lfx,1 u u H(µ ) + Lfx,1 1 1

= d dµ ln( 1

ct Lfx,0 + Lfx,1 u u H(µ ) + Lfx,1 1 1 ct f(u) ),

and integrating the inequality over µ [0, µ], we have

ct Lfx,0 + Lfx,1 u u H(µ ) + Lfx,1 1 1 ct f(u) ) ln( 1

ct Lfx,0 + Lfx,1 1 1 ct f(u) ) + Lfx,1 u u ,

which implies that

Lfx,1 u u H(µ) ( 1

ct Lfx,0 + Lfy,1 1 1 ct f(u) ) exp(Lfx,1 u u ) 1

ct Lfx,0 Lfx,1 1 1 ct f(u) .

Then, we have

1 1 ct f(uµ) 1 1 ct f(u) + 1 1 ct f(uµ) 1 1 ct f(u)

1 1 ct f(u) + 1 Lfx,1

ct Lfx,0 + Lfx,1 1 1 ct f(u) ) exp(Lfx,1 u u ) Lfx,0

Lfx,1 1 1 ct f(u)

ct Lfx,1 + 1 1 ct f(u) exp(Lfx,1 u u ) Lfx,0

1f(u ) 1f(u) 1

ct Lfx,0 + Lfx,1 1 1 ct f(u) exp(Lfx,1 u u ) u u .

Similarly, we have

1g(u ) 1g(u) Lgx,0 + Lgx,1 1g(u) exp(Lgx,1 u u ) u u .

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Therefore, we can obtain Z 1

0 1 1 ct f(uµ) + 1g(uµ) 1 1 ct f(u) 1g(u) u u dµ

ct Lfx,0 + Lfx,1 1 1 ct f(u) exp(Lfx,1 u u ) + Lgx,0 + Lgx,1 1g(u)

exp(Lgx,1 u u ) u u dµ

0 µ u u 2 1

ct Lfx,0 + Lfx,1 1 1 ct f(u) exp(Lfx,1 u u ) + Lgx,0 + Lgx,1 1g(u)

exp(Lgx,1 u u ) dµ

ct Lfx,0 + Lfx,1 1 1 ct f(u) exp(Lfx,1 u u ) + Lgx,0 + Lgx,1 1g(u)

exp(Lgx,1 u u )

2 u u 2(L0 + L1( 1 1 ct f(u) + 1g(u) )) exp(L1 u u ),

where L0 := 1

ct Lfx,0 + Lgx,0, L1 := max{Lfx,1, Lgx,1}.

Similarly, (25) can be proved.

Lemma A.5. Let γ (0, 1 κg2 ). Then, there exists Lθ > 0 such that for any (x, y), (x , y ) Rd Rp, the following inequality holds:

θ γ(x, y) θ γ(x , y ) Lθ (x, y) (x , y ) , (28)

where Lθ = s 1 q

γ κg2) max Lgy,0 + Lgy,1 maxµ [0,1] 2g(xµ, θ γ(x, y)) ρ, 1

γ , xµ := µx + (1 µ)x .

Proof. Since θ γ(x, y) is the optimal solution of minθ Rp g(x, θ) + 1

2γ θ y 2, we have for any (x, y), (x , y ) Rd Rp,

2g(x, θ γ(x, y)) + 1

γ (θ γ(x, y) y) = 0,

2g(x , θ γ(x , y )) + 1

γ (θ γ(x , y ) y ) = 0.

Then we can obtain

θ γ(x, y) θ γ(x , y )

= θ γ(x, y) s( 2g(x, θ γ(x, y)) + 1

γ (θ γ(x, y) y)) θ γ(x , y ) + s( 2g(x , θ γ(x , y )) + 1

γ (θ γ(x , y ) y ))

θ γ(x, y) s( 2g(x, θ γ(x, y)) + 1

γ (θ γ(x, y) y)) θ γ(x , y ) + s( 2g(x, θ γ(x , y )) + 1

γ (θ γ(x , y ) y))

+ θ γ(x , y ) s( 2g(x, θ γ(x , y )) + 1

γ (θ γ(x , y ) y)) θ γ(x , y ) + s( 2g(x , θ γ(x , y ))

γ (θ γ(x , y ) y )) . (29)

Due to γ (0, 1 κg2 ), the function g(x, θ) + 1 2γ θ y 2 is ( 1

γ κg2)-strongly convex respect to θ Rp, it follows that

2g(x, θ γ(x, y)) + 1

γ (θ γ(x, y) y) 2g(x, θ γ(x , y )) 1

γ (θ γ(x , y ) y), θ γ(x, y) θ γ(x , y )

γ κg2) θ γ(x, , y) θ γ(x , y ) 2.

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Given 0 < s 1/γ κg2 (1/γ+Lgy,0+Lgy,1 maxµ (0,1) 2g(x,θ γ(x,y)µ) ρ)2 , we have

θ γ(x, y) s( 2g(x, θ γ(x, y)) + 1

γ (θ γ(x, y) y)) θ γ(x , y ) + s( 2g(x, θ γ(x , y )) + 1

γ (θ γ(x , y ) y)) 2

γ κg2) + s2( 1

γ + Lgy,0 + Lgy,1 max µ (0,1) 2g(x, θ γ(x, y)µ) ρ)2 θ γ(x, y) θ γ(x , y ) 2

γ κg2) θ γ(x, y) θ γ(x , y ) 2, (30)

θ γ(x , y ) s( 2g(x, θ γ(x , y )) + 1

γ (θ γ(x , y ) y)) θ γ(x , y ) + s( 2g(x , θ γ(x , y )) + 1

γ (θ γ(x , y ) y ))

s x x (Lgy,0 + Lgy,1 max µ [0,1] 2g(xµ, θ γ(x , y )) ρ) + s

γ y y , (31)

where xµ := µx + (1 µ)x , θ γ(x, y)µ := µθ γ(x, y) + (1 µ)θ γ(x , y ). By combining the above inequalities (29), (30) and (31), we have

θ γ(x, y) θ γ(x , y )

γ κg2) θ γ(x, y) θ γ(x , y ) + s((Lgy,0 + Lgy,1 max µ [0,1] 2g(xµ, θ γ(x, y)) ρ) x x + 1

Finally we can obtain

θ γ(x, y) θ γ(x , y )

γ κg2) (Lgy,0 + Lgy,1 max µ [0,1] 2g(xµ, θ γ(x, y)) ρ) x x + 1

Lθ (x, y) (x , y ) ,

where Lθ = s 1 q

γ κg2) max Lgy,0 + Lgy,1 maxµ [0,1] 2g(xµ, θ γ(x, y)) ρ, 1

γ . It implies that the solution θ γ is

generally smooth with (x, y).

Lemma A.6. Under Assumptions 5.2 and 5.3, let Ψct(x, y) = 1

ct f(x, y) + g(x, y) vγ(x, y) with ct+1 ct > 0 for all t 0, we have

1Ψct(xt+1, yt+1) 1Ψct(xt, yt) LΨ1 (xt+1, yt+1) (xt, yt) , (32)

2Ψct(xt+1, yt+1) 2Ψct(xt, yt) LΨ2 (xt+1, yt+1) (xt, yt) . (33)

where LΨ1 = 1 ct Lfx,0 + Lfx,1 maxµ [0,1] 1

ct 1f(xµ, yµ) ρ + Lgx,0 + Lgx,1 maxµ [0,1] 1g(xµ, yµ) ρ + (Lgx,0 + Lgx,1 maxµ [0,1] 1g(xµ, θ γ(xµ, yµ)) ρ(1 + Lθ)) and LΨ2 = 1

ct Lfy,0 + Lfy,1 maxµ [0,1] 1

ct 2f(xµ, yµ) ρ + Lgy,0 + Lgy,1 maxµ [0,1] 2g(xµ, yµ) ρ + 1+Lθ

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Proof. When ρ [0, 1], we have

1Ψct(xt+1, yt+1) 1Ψct(xt, yt)

ct 1f(xt+1, yt+1) 1

ct 1f(xt, yt) + 1g(xt+1, yt+1) 1g(xt, yt) 1g(xt+1, θ γ(xt+1, yt+1))

+ 1g(xt, θ γ(xt, yt))

ct 1f(xt+1, , yt+1) 1

ct 1f(xt, yt) + 1g(xt+1, yt+1) 1g(xt, yt) + 1g(xt+1, θ γ(xt+1, yt+1))

1g(xt, θ γ(xt, yt))

ct Lfx,0 + Lfx,1 max µ [0,1] 1

ct 1f(xµ, yµ) ρ) + Lgx,0 + Lgx,1 max µ [0,1] 1g(xµ, yµ) ρ (xt+1, yt+1) (xt, yt)

+ (Lgx,0 + Lgx,1 max µ [0,1] 1g(xµ, θ γ(xµ, yµ)) ρ(1 + Lθ)) (xt+1, yt+1) (xt, yt)

ct Lfx,0 + Lfx,1 max µ [0,1] 1

ct 1f(xµ, yµ) ρ + Lgx,0 + Lgx,1 max µ [0,1] 1g(xµ, yµ) ρ + (Lgx,0

+ Lgx,1 max µ [0,1] 1g(xµ, θ γ(xµ, yµ)) ρ(1 + Lθ)) (xt+1, yt+1) (xt, yt) , (34)

where the first inequality follows from triangle inequality and the second inequality follows from Assumptions 5.2, 5.3 and Lemma A.5. Similarly, we have

2Ψct(xt+1, yt+1) 2Ψct(xt, yt)

= 1 ct 2f(xt+1, , yt+1) 1

ct 2f(xt, yt) + 2g(xt+1, yt+1) 2g(xt, yt) + θ γ(xt+1, yt+1) yt+1

γ θ γ(xt, yt) yt

ct Lfy,0 + Lfy,1 max µ [0,1] 1

ct 2f(xµ, yµ) ρ + Lgy,0 + Lgy,1 max µ [0,1] 2g(xµ, yµ) ρ (xt+1, yt+1) (xt, yt)

γ (xt+1, yt+1) (xt, yt)

ct Lfy,0 + Lfy,1 max µ [0,1] 1

ct 2f(xµ, yµ) ρ + Lgy,0 + Lgy,1 max µ [0,1] 2g(xµ, yµ) ρ + 1 + Lθ

(xt+1, yt+1) (xt, yt) , (35)

where the first inequality follows from triangle inequality and the second inequality follows from Assumptions 5.2, 5.3 and Lemma A.5.

A.1. Deterministic Setting

Lemma A.7. Given γ (0, 1 2κg2 ) and ηt 0, 1/γ κg2 (1/γ+Lgy,0+Lgy,1 maxµ (0,1) 2g(x,θ γ(x,y)µ) ρ)2 , the sequence {xt, yt, θt} generated by Algorithm 1 satisfies

θt+1 θ (xt, yt) σt θt θ γ(xt, yt) , (36)

where σt = p

1 ηt(1/γ κg2).

Proof. Let θ γ(xt, yt) be the optimal solution of minθ Rp g(xt, θ) + 1 2γ θ yt , we have

θ γ(xt, yt) =θ γ(xt, yt) ηt( 2g(x, θ γ(xt, yt)) + 1

γ (θ γ(xt, yt) yt)).

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

By using the update rule for θt+1 in the Algorithm 1, we can obtain

θt+1 θ γ(xt, yt)

θt ηt( 2g(xt, θt) + 1

γ (θt yt)) θ γ(xt, yt) + ηt( 2g(xt, θ γ(xt, yt)) + 1

γ (θ γ(xt, yt) yt))

1 ηt(1/γ κg2) θt θ γ(xt, yt) .

Lemma A.8. Under Assumptions 5.1, 5.2 and 5.3, suppose γ (0, 1 2κg2 ) the sequence of {xt, yt, θt} generated by

Algorithm 1. We define a function Ψct(x, y) = 1 ct f(x, y) + g(x, y) vγ(x, y) with ct+1 ct > 0 for all t 0, which satisfies

(i) when ρ [0, 1),

Ψct(xt+1, yt+1)

Ψct(xt, yt) + αt

Lgx,0 + Lgx,1M ρ + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2

1 ct 1f(xt, yt) + 1g(xt, yt) + 3

4αt 1g(xt, θt+1) + α2 t 2

K0 + κ + K1 + 2K2 + 4βt γ2 dty

1 ct 2f(xt+1, yt) + 2g(xt+1, yt) + β2 t 2 (K3 + κ + K4 + 2K5),

(ii) when ρ = 1,

Ψct(xt+1, yt+1)

Ψct(xt, yt) + αt

dtx (Lgx,0 + Lgx,1M ρ + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2

1 ct 1f(xt, yt) + 1g(xt, yt) + 3

4αt 1g(xt, θt+1) + α2 t 2

L0 + κ + L1 + 4βt γ2 dty

ct 2f(xt+1, yt) + 2g(xt+1, yt) + β2 t 2 (L2 + κ + L3).

Proof. By the definition of function Ψct(x, y), we have

(i) when ρ [0, 1),

Ψct(xt+1, yt) 1

ct f(xt, yt) + g(xt, yt) vγ(xt+1, yt) + 1

ct 1f(xt, yt) + 1g(xt, yt), xt+1 xt

ct 1f(xt, yt) ρ + 1g(xt, yt) ρ) + 2K2 xt+1 xt ρ 1 ρ xt+1 xt 2

Ψct(xt, yt) + 1Ψct(xt, yt), xt+1 xt + 1

ct 1f(xt, yt) ρ + 1g(xt, yt) ρ)

+ 2K2 xt+1 xt ρ 1 ρ + κ xt+1 xt 2

=Ψct(xt, yt) + 1Ψct(xt, yt) dt x, xt+1 xt + dt x, xt+1 xt

ct 1f(xt, yt) ρ + 1g(xt, yt) ρ) + 2K2 xt+1 xt ρ 1 ρ + κ xt+1 xt 2,

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

where the first inequality holds by Proposition A.4, and the second inequality holds by the Lemma A.1.

Considering the update rule for the variable x in Algorithm 1, it follows that

dt x, xt+1 xt = αt dt x . (38)

And we have

1Ψct(xt, yt) dt x, xt+1 xt

= 1g(xt, θt+1) 1g(xt, θ γ(xt, yt)), xt+1 xt

1g(xt, θt+1) 1g(xt, θ γ(xt, yt)) xt+1 xt

Lgx,0 + Lgx,1 max µ [0,1] 1g(xt, µθt+1 + (1 µ)θ γ(xt, yt)) ρ θ γ(xt, yt) θt+1 xt+1 xt

dt x 4αt xt+1 xt 2 + αt dtx Lgx,0 + Lgx,1 max µ [0,1] 1g(xt, µθt+1 + (1 µ)θ γ(xt, yt)) ρ θ γ(xt, yt) θt+1 2,

where the last inequality follows from Young s inequality.

For the sake of simplicity in presentation, let M = maxµ [0,1] 1g(xt, µθt+1+(1 µ)θ γ(xt, yt)) , by combining (38) with (39), we have

Ψct(xt+1, yt)

Ψct(xt, yt) + αt dtx

Lgx,0 + Lgx,1M ρ θt+1 θ γ(xt, yt) 2 + dt x 4αt xt+1 xt 2 dt x αt xt+1 xt 2

ct 1f(xt, yt) ρ + 1g(xt, yt) ρ) + 2K2 xt+1 xt ρ 1 ρ + κ xt+1 xt 2

Ψct(xt, yt) + αt dtx

Lgx,0 + Lgx,1M ρ θt+1 θ γ(xt, yt) 2 3

3(K0 + κ)αt + 3( 1

ct 1f(xt, yt) + 1g(xt, yt) ) + 6K1αt + 6K2αt

Ψct(xt, yt) + αt dtx

Lgx,0 + Lgx,1M ρ θt+1 θ γ(xt, yt) 2 αt

ct 1f(xt, yt) + 1g(xt, yt) )

4 1g(xt, θt+1) + α2 t 2 (K0 + κ + K1 + 2K2), (40)

where the second inequality follows from the above Lemma A.3.

Similarly, we have

Ψct(xt+1, yt+1)

Ψct(xt+1, yt) + 2Ψct(xt+1, yt), yt+1 yt + 1

ct 2f(xt+1, yt) ρ + 2g(xt+1, yt) ρ)

+ 2K5 yt+1 yt

ρ 1 ρ + κ yt+1 yt 2

Ψct(xt+1, yt) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 + xt+1 xt 2 3 dt y 4βt yt+1 yt 2 + 1

ct 2f(xt+1, yt) ρ + 2g(xt+1, yt) ρ) + 2K5 yt+1 yt

ρ 1 ρ + κ yt+1 yt 2

Ψct(xt+1, yt) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 + xt+1 xt 2 βt

1 ct 2f(xt+1, yt)

+ 2g(xt+1, yt) + β2 t 2 (K3 + κ + K4 + 2K5). (41)

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

By combining the above inequalities (41) with (40), we can obtain

Ψct(xt+1, yt+1)

Ψct(xt, yt) + αt

dtx (Lgx,0 + Lgx,1M ρ) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2

ct 1f(xt, yt) + 1g(xt, yt) ) + 3

4αt 1g(xt, θt+1) + α2 t 2

K0 + κ + K1 + 2K2 + 4βt γ2 dty

ct 2f(xt+1, yt) + 2g(xt+1, yt) ) + β2 t 2 (K3 + κ + K4 + 2K5).

(ii) When ρ = 1, similarly, we have

Ψct(xt+1, yt+1) Ψct(xt, yt)

dtx (Lgx,0 + Lgx,1M) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 αt

1 ct 1f(xt, yt) + 1g(xt, yt)

4αt 1g(xt, θt+1) + α2 t 2

L0 + κ + L1 + 4βt γ2 dty

1 ct 2f(xt+1, yt) + 2g(xt+1, yt)

+ β2 t 2 (L2 + κ + L3). (42)

The proof have completed.

Lemma A.9. Under Assumptions 5.2 and 5.3, suppose γ (0, 1 2κg2 ), ct+1 ct > 0 and ηt satisfies ηt >

v u u t1+4(Lgx,0+Lgx,1M ρ+ 2

(Lgx,0+Lgx,1M ρ) αt dtx + βt γ2 dty

2κg2(Lgx,0+Lgx,1M ρ+ 2

γ2 ) , we have

ct 1f(xt, yt) + 1g(xt, yt) ) βt

ct 2f(xt, yt) + 2g(xt, yt) ) ηtκg2 θt+1 θ γ(xt, yt) .

where Ωt+1 = Ψct+1(xt+1, yt+1) + Lgx,0 + Lgx,1M ρ + 2 γ2 θt+1 θ γ(xt+1, yt+1) 2.

Proof. We first define a useful Lyapunov function,

Ωt = Ψct(xt, yt) + Lgx,0 + Lgx,1M ρ + 2

γ2 θt θ γ(xt, yt) 2, (44)

where Ψct(x, y) = 1

ct f(x, y) + g(x, y) vγ(x, y) and ct+1 ct > 0 for all t 0.

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

By the above definition of Ωt, when ρ [0, 1), we have

=Ψct+1(xt+1, yt+1) Ψct(xt, yt) + Lgx,0 + Lgx,1M ρ + 2

γ2 θt+1 θ γ(xt+1, yt+1) 2

Lgx,0 + Lgx,1M ρ + 2

γ2 θt θ γ(xt, yt) 2

Ψct(xt+1, yt+1) Ψct(xt, yt) + Lgx,0 + Lgx,1M ρ + 2

γ2 θt+1 θ γ(xt+1, yt+1) 2

Lgx,0 + Lgx,1M ρ + 2

γ2 θt θ γ(xt, yt) 2

dtx (Lgx,0 + Lgx,1M ρ) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2

ct 1f(xt, yt) + 1g(xt, yt) + 3

4αt 1g(xt, θt+1) + α2 t 2

K0 + κ + K1 + 2K2 + 4βt γ2 dty

ct 2f(xt+1, yt) + 2g(xt+1, yt) ) + β2 t 2 (K3 + κ + K4 + 2K5)

+ Lgx,0 + Lgx,1M ρ + 2

γ2 θt+1 θ γ(xt+1, yt+1) 2 Lgx,0 + Lgx,1M ρ + 2

γ2 θt θ γ(xt, yt) 2

ct 1f(xt, yt) + 1g(xt, yt) 3 1g(xt, θt+1) 2αt

K0 + κ + K1 + 2K2 + 4βt γ2 dty

ct 2f(xt+1, yt) + 2g(xt+1, yt) 2βt(K3 + κ + K4 + 2K5)

+ Lgx,0 + Lgx,1M ρ + 2

γ2 θt+1 θ γ(xt+1, yt+1) 2 Lgx,0 + Lgx,1M ρ + 2

γ2 θt θ γ(xt, yt) 2

Lgx,0 + Lgx,1M ρ + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2, (45)

where the second inequality follows from Lemma A.7 and the last inequality by rearrangement.

Next, we demonstrate that

θt+1 θ γ(xt+1, yt+1) 2 θt θ γ(xt, yt) 2 + αt dtx θt+1 θ γ(xt, yt) 2

1 + νt + αt dtx

θt+1 θ γ(xt, yt) 2 θt θ γ(xt, yt) 2 + 1 + 1

θ γ(xt+1, yt+1) θ γ(xt, yt) 2

1 + νt + αt dtx

σ2 t θt θ γ(xt, yt) 2 θt θ γ(xt, yt) 2 + 1 + 1

L2 θ (xt+1, yt+1) (xt, yt) 2, (46)

where the first inequality follows from Young s inequality with parameter νt and the last inequality follows from Lemma A.7.

Given νt = ηtκg2, we have

θt+1 θ γ(xt+1, yt+1) 2 θt θ γ(xt, yt) 2 + αt dtx θt+1 θ γ(xt, yt) 2

η2 t κ2 g2 αt dtx

θt θ γ(xt, yt) 2 + 1 + 1 ηtκg2

L2 θ (xt+1, yt+1) (xt, yt) 2. (47)

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Similarly, we have

θt+1 θ γ(xt+1, yt+1) 2 θt θ γ(xt, yt) 2 + 2βt

dty θt+1 θ γ(xt, yt) 2

1 + νt + 2βt

σ2 t θt θ γ(xt, yt) 2 θt θ γ(xt, yt) 2 + 1 + 1

L2 θ (xt+1, yt+1) (xt, yt) 2

η2 t κ2 g2 2βt

θt θ γ(xt, yt) 2 + 1 + 1 ηtκg2

L2 θ (xt+1, yt+1) (xt, yt) 2. (48)

Putting the above inequalities (45), (47) into (48), we have

ct 1f(xt, yt) + 1g(xt, yt) 3 1g(xt, θt+1) 2αt K0 + κ + K1 + 2K2 + 4βt γ2 dty

4αt 1 + 1 ηtκg2

Lgx,0 + Lgx,1M ρ + 2

ct 2f(xt+1, yt) + 2g(xt+1, yt)

2βt(K3 + κ + K4 + 2K5) 4βt 1 + 1 ηtκg2

L2 θ(Lgx,0 + Lgx,1M ρ + 2

γ2 ) (Lgx,0 + Lgx,1M ρ + 2

γ )η2 t κ2 g2

(Lgx,0 + Lgx,1M ρ) αt

dtx βt γ2 dty

θt θ γ(xt, yt) 2.

Given 0 < αt 1

ct 1f(xt,yt) + 1g(xt,yt)

4(K0+κ+K1+2K2)+8(1+ 1 ηtκg2 )L2 θ(Lgx,0+Lgx,1M ρ+ 2

γ2 )+6 1g(xt,θt+1) ,

ct 2f(xt,yt) + 2g(xt,yt)

4(K3+κ+K4+2K5)+8(1+ 1 ηtκg2 )L2 θ(Lgx,0+Lgx,1M ρ+ 2

γ2 ), and ηt >

v u u t1+4(Lgx,0+Lgx,1M ρ+ 2

(Lgx,0+Lgx,1M ρ) αt dtx + βt γ2 dty

2κg2(Lgx,0+Lgx,1M ρ+ 2

γ2 ) , we can obtain

ct 1f(xt, yt) + 1g(xt, yt) ) βt

ct 2f(xt, yt) + 2g(xt, yt) ) ηtκg2 θt+1 θ γ(xt, yt) 2.

When ρ = 1, according to Lemma A.2, we have

dtx (Lgx,0 + Lgx,1M) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 αt

1 ct 1f(xt, yt) + 1g(xt, yt)

4αt 1g(xt, θt+1) + α2 t 2

L0 + κ + L1 + 4βt γ2 dty

1 ct 2f(xt+1, yt) + 2g(xt+1, yt)

+ β2 t 2 (L2 + κ + L3) + Lgx,0 + Lgx,1M + 2

γ2 θt θ γ(xt+1, yt+1) 2 Lgx,0 + Lgx,1M + 2

γ2 θt θ γ(xt, yt) 2

ct 1f(xt, yt) + 1g(xt, yt) 3 1g(xt, θt+1) 2αt(L0 + κ + L1) 4αt(1 + 1 ηtκg2 )L2 θ(Lgx,0

+ Lgx,1M + 2

1 ct 2f(xt+1, yt) + 2g(xt+1, yt) 2βt(L2 + L3 + κ) 4βt(1 + 1 ηtκg2 )L2 θ(Lgx,0

+ Lgx,1M + 2

γ2 ) (Lgx,0 + Lgx,1M + 2

2 )η2 t κ2 t (Lgx,0 + Lgx,1M) αt

dtx βt γ2 dty

θt θ γ(xt, yt) 2,

where the last inequality follows from Young s inequality with parameter ηtκg2 and rearrangement.

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Given 0 < αt 1

ct 1f(xt,yt) + 1g(xt,yt)

4(L0+κ+L1)+8(1+ 1 ηtκg2 )L2 θ(Lgx,0+Lgx,1M ρ+ 2

γ2 )+6 1g(xt,θt+1) , 0 < βt

ct 2f(xt,yt) + 2g(xt,yt)

4(L2+κ+L3)+8(1+ 1 ηtκg2 )L2 θ(Lgx,0+Lgx,1M ρ+ 2

γ2 ), and ηt >

v u u t1+4(Lgx,0+Lgx,1M ρ+ 2

(Lgx,0+Lgx,1M ρ) αt dtx + βt γ2 dty

2κg2(Lgx,0+Lgx,1M ρ+ 2

we can also obtain

ct 1f(xt, yt) + 1g(xt, yt) ) βt

ct 2f(xt, yt) + 2g(xt, yt) ) ηtκg2 θt+1 θ γ(xt, yt) 2.

Thus the function Ωt is descent.

Based on the Lyapunov function, the non-asymptotic convergence rate can be shown as:

Theorem A.10. (Restatement of Theorem 5.5) Under Assumptions 5.1, 5.2, 5.3 and 5.4, given γ (0, 1 2κg2 ), ct = c(t + 1)1/4 with c > 0, and when ρ [0, 1) let 0 < αt

ct 1f(xt,yt) + 1g(xt,yt)

4(K0+κ+K1+2K2)+8(1+ 1 ηtκg2 )L2 θC+6 1g(xt,θt+1) , 0 < βt 1

ct 2f(xt,yt) + 2g(xt,yt)

4(K3+κ+K4+2K5)+8(1+ 1 ηtκg2 )L2 θC , when ρ = 1

let 0 < αt 1

ct 1f(xt,yt) + 1g(xt,yt)

4(L0+κ+L1)+8(1+ 1 ηtκg2 )L2 θC+6 1g(xt,θt+1) , 0 < βt 1

ct 2f(xt,yt) + 2g(xt,yt)

4(L2+κ+L3)+8(1+ 1 ηtκg2 )L2 θC , and

v u u t1+4C

(Lgx,0+Lgx,1M) αt dtx + βt γ2 dty

2κg2C , 1/γ κg2 (1/γ+Lgy,0+Lgy,1 maxµ (0,1) 2g(x,θ γ(x,y)µ) ρ)2

C = Lgx,0 + Lgx,1M ρ + 2 γ2 . The the sequence of {xt, yt, θt}T t=0 generated by Algorithm 1 satisfies

min 0 t T θt θ γ(xt, yt) = O 1 T 1/2 ,

min 0 t T Rt(xt+1, yt+1) = O 1

g(x T , y T ) vγ(x T , y T ) = O 1

Proof. By using the above Lemma A.9, telescoping the above inequality (43) over the range t = 0, 1, , T 1, given η = min0 t T 1 ηt, we have

ct 1f(xt, yt) + 1g(xt, yt) ) + βt

ct 2f(xt, yt) + 2g(xt, yt) ) + ηκg2 θt θ γ(xt, yt) 2

c T f Ω0, (51)

where the last second inequality holds by Assumption 5.1. We obtain

t=0 θt θ γ(xt, yt) 2 < ,

min 0 t T θt θ γ(xt, yt) = O( 1 T 1/2 ).

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

According to the update rule of variables (x, y) in Algorithm 1, we have

ct dt x dtx + ct

αt (xt+1 xt) = 0, (52)

ct dt y dty + ct

βt (yt+1 xt) = 0. (53)

(et x, et y) = f(xt+1, yt+1) + ct( g(xt+1, yt+1) vγ(xt+1, yt+1)),

et x := ct 1Ψct(xt+1, yt+1) ctdt x ct dt x αt (xt+1 xt), (54)

et y := ct 2Ψct(xt+1, yt+1) ctdt y ct dt y βt (yt+1 yt). (55)

Considering the term et x , we have

et x ct 1Ψct(xt+1, yt+1) 1Ψct(xt, yt) + ct 1Ψct(xt, yt) ctdt x + ct dt x αt xt+1 xt

(i) LΨ1 (xt+1, yt+1) (xt, yt) + ct 1Ψct(xt, yt) dt x + ct dt x αt xt+1 xt

=ct LΨ1 (xt+1, yt+1) (xt, yt) + ct dt x αt xt+1 xt + 1g(xt, θt) 1g(xt, θ γ(xt, yt))

(ii) LΨ1 (xt+1, yt+1) (xt, yt) + ct dt x αt xt+1 xt + (Lgx,0 + Lgx,1M ρ) θt θ γ(xt, yt) ,

where the above inequality (i) holds by the Lemma A.6, and the above inequality (ii) is due to Assumption 5.3. Thus, we can obtain

et x ct LΨ1 (xt+1, yt+1) (xt, yt) + ct dt x αt xt+1 xt + ct Lgx,0 + Lgx,1M ρ θt θ γ(xt, yt) .

Similarity, based on Lemma A.6, we can obtain

et y ct LΨ2 (xt+1, yt+1) (xt, yt) + ct dt y βt yt+1 yt + ct

γ ( θt θ γ(xt, yt) + Lθ xt+1 xt ),

where LΨ2 = 1

ct (Lfy,0 + Lfy,1 maxµ [0,1] 2f(xµ, yµ) ρ) + Lgy,0 + Lgy,1 maxµ [0,1] 2g(xµ, yµ) ρ + 1+Lθ

Since Rt(x, y) = dist 0, f(x, y) + ct( g(x, y) vγ(x, y)) , we have

Rt(xt+1, yt+1) ct LΨ (xt+1, yt+1) (xt, yt) + ct dt x αt + ct Lθ

xt+1 xt + ct dt y βt yt+1 yt

+ ct(Lgx,0 + Lgx,1M ρ + 1

γ ) θt θ γ(xt, yt) .

Given CR min 1f(xt,yt)+ 1g(xt,yt) , 2f(xt,yt)+ 2g(xt,yt) }

ηκg2 max{ dtx , dty } , we have

1 c2 t Rt(xt+1, yt+1)2

ct 1f(xt, yt) + 1g(xt, yt) ) + βt

ct 2f(xt, yt) + 2g(xt, yt) ) + ηκg2 θt θ γ(xt, yt) 2 .

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

By using the above inequality (51), then we have

1 c2 t Rt(xt+1, yt+1)2 < . (56)

Since ct = c(t + 1) 1 4 with c > 0, we have

1/2 (T + 2) 1/2

1/2c2 . (57)

Based on the above inequalities (56) and (57), we can obtain

min 0 t T Rt(xt+1, yt+1) = O 1 T 1/4

Since {xt, yt}0 t T is a sequence generated from Algorithm 1, we can find m = max0 t T Ψct(xt, yt) = 1

ct f(xt, yt) + g(xt, yt) vγ(xt, yt), and f(xt, yt) f for any t. Then we have

ct(g(xt, yt) vγ(xt, yt)) m f , t > 0.

Based on ct = c(t + 1)1/4, we can obtain

g(x T , y T ) vγ(x T , y T ) = O( 1 T 1/4 ).

A.2. Stochastic Setting

Lemma A.11. Given γ (0, 1 2κg2 ) and ηt 0, 1/γ κg2 (1/γ+Lgy,0+Lgy,1 maxµ (0,1) 2g(x,θ γ(x,y)µ) ρ)2 , the sequence {xt, yt, θt} generated by Algorithm 2 satisfies

E[ θt+1 θ (xt, yt) 2|Ft] σ2 t θt θ γ(xt, yt) 2 + η2 t δ2 g Bt . (58)

where σt = p

1 ηt(1/γ κg2) and θ γ(x, y) = arg minθ{Eζ O[G(x, θ; ζ)] + 1 2γ y θ 2}.

Proof. By the update rule of θ in Algorithm 2 in line 3, we have

θt+1 θ γ(xt, yt) 2

i=1 2G(xt, θt; ˆζt i) + 1

γ (θt yt)) θ γ(xt, yt)

= θt ηt( 2g(xt, θt) + 1

γ (θt yt)) θ γ(xt, yt) ηt

i=1 2G(xt, θt; ˆζt i) + 1

2g(xt, θt) + 1

= θt ηt( 2g(xt, θt) + 1

γ (θt yt)) θ γ(xt, yt) 2 + η2 t

i=1 2G(xt, θt; ˆζt i) 2g(xt, θt)

2ηt θt ηt( 2g(xt, θt) + 1

γ (θt yt)) θ γ(xt, yt), 1

i=1 2G(xt, θt; ˆζt i) 2g(xt, θt) .

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

E[ θt+1 θ γ(xt, yt) 2|Ft]

= θt ηt( 2g(xt, θt) + 1

γ (θt yt)) θ γ(xt, yt) 2 + η2 t E 1 Bt

i=1 2G(xt, θt; ˆζt i) 2g(xt, θt)

θt ηt( 2g(xt, θt) + 1

γ (θt yt)) θ γ(xt, yt) 2 + η2 t δ2 g Bt

(1 ηt(1/γ κg2)) θt θ γ(xt, yt) 2 + η2 t δ2 g Bt ,

where the first equality and first inequality follows from Assumption 5.7 and the last inequality follows from Lemma A.7.

Lemma A.12. Under Assumptions 5.1, 5.2, 5.3,and 5.7, suppose γ (0, 1 2κg2 ) the sequence of {xt, yt, θt} generated by

Algorithm 2. We define a function Φct(x, y) = E[ 1

ct F(x, y; ξ) + G(x, y; ζ) vγ(x, y)] with ct+1 ct > 0 for all t 0, and δ = 1 c0 δ2 f + 2δ2 g, which satisfies

(i) when ρ [0, 1),

E[Φct(xt+1, yt+1)|Ft]

Φct(xt, yt) + E αt

dtx (Lgx,0 + Lgx,1M ρ) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 Ft

ct 1f(xt, yt) + 1g(xt, yt) ) + 3

4αt 1g(xt, θt+1) + α2 t 2

K0 + κ + K1 + 2K2 + 2 + 4βt γ2 dty

ct 2f(xt+1, yt) + 2g(xt+1, yt) ) + β2 t 2 (K3 + κ + K4 + 2K5 + 2) + δ 2Bt (αt + βt).

(ii) when ρ = 1,

E[Φct(xt+1, yt+1)|Ft]

Φct(xt, yt) + E αt

dtx (Lgx,0 + Lgx,1M) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 Ft

1 ct 1f(xt, yt)

+ 1g(xt, yt) + 3

4αt 1g(xt, θt+1) + α2 t 2

L0 + κ + L1 + 2 + 4βt γ2 dty

1 ct 2f(xt+1, yt)

+ 2g(xt+1, yt) + β2 t 2 (L2 + κ + L3 + 2) + δ(αt + βt)

Proof. By the definition of function Φct(x, y), we have

(i) when ρ [0, 1),

E[Φct(xt+1, yt)|Ft]

Φct(xt, yt) + 1Φct(xt, yt) dt x, xt+1 xt + dt x, xt+1 xt

ct 1f(xt, yt) ρ + 1g(xt, yt) ρ) + 2K2 xt+1 xt ρ 1 ρ + κ xt+1 xt 2, (59)

Considering the update rule for the variable x in Algorithm 2, it follows that

dt x, xt+1 xt dt x dt x + dt x, xt+1 xt

4 dt x dt x 2 + xt+1 xt 2 dt x αt xt+1 xt 2, (60)

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

where the first inequality follows from Young s inequality and the update rule for variable x in Algorithm 2. And similar as (39), we have

1Φct(xt, yt) dt x, xt+1 xt dt x 4αt xt+1 xt 2 + αt dtx

Lgx,0 + Lgx,1M ρ θ γ(xt, yt) θt+1 2, (61)

where M = maxµ [0,1] 1g(xt, µθt+1 + (1 µ)θ γ(xt, yt)) .

By combining (59) with (61), we have

E[Φct(xt+1, yt)|Ft]

Φct(xt, yt) + E αt

Lgx,0 + Lgx,1M ρ θt+1 θ γ(xt, yt) 2 + 1

4 dt x dt x 2 Ft

4αt E[ dt x |Ft] + αt

3(K0 + κ)αt + 3( 1

ct 1f(xt, yt) + 1g(xt, yt) ) + 6K1αt + 6(K2 + 1)αt

Φct(xt, yt) + E αt

Lgx,0 + Lgx,1M ρ θt+1 θ γ(xt, yt) 2 + 1

4 dt x dt x 2 Ft

4αt dt x + αt

3(K0 + κ)αt + 3( 1

ct 1f(xt, yt) + 1g(xt, yt) ) + 6K1αt + 6(K2 + 1)αt

Φct(xt, yt) + E αt

Lgx,0 + Lgx,1M ρ θt+1 θ γ(xt, yt) 2 + 1

4 dt x dt x 2 Ft

ct 1f(xt, yt) + 1g(xt, yt) ) + 3αt

4 1g(xt, θt+1) + α2 t 2 (K0 + κ + K1 + 2K2 + 2), (62)

where the first inequality follows from the above Lemma A.3 and Assumption 5.7, the second inequality follows from Jensen s inequity E[ dt x |Ft] dt x and the last inequality follows from a b a b .

Similarly, we have

E[Φct(xt+1, yt+1)|Ft]

Φct(xt+1, yt) + 2Φct(xt+1, yt), yt+1 yt + 1

ct 2f(xt+1, yt) ρ + 2g(xt+1, yt) ρ)

+ 2K5 yt+1 yt

ρ 1 ρ + κ yt+1 yt 2

Φct(xt+1, yt) + E 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 + xt+1 xt 2 Ft

E 3 dt y 4βt yt+1 yt 2 Ft

ct 2f(xt+1, yt) ρ + 2g(xt+1, yt) ρ) + 2K5 yt+1 yt

ρ 1 ρ + κ + 2 yt+1 yt 2

4 dt y dt y 2 Ft

Φct(xt+1, yt) + E 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 + xt+1 xt 2 + 1

4 dt y dt y 2 Ft

ct 2f(xt+1, yt) + 2g(xt+1, yt) ) + β2 t 2 (K3 + κ + K4 + 2K5 + 2). (63)

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

By combining the above inequalities (63) with (62), we can obtain

E[Φct(xt+1, yt+1)|Ft]

Φct(xt, yt) + E αt

dtx (Lgx,0 + Lgx,1M ρ) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 Ft

ct 1f(xt, yt) + 1g(xt, yt) ) + 3

4αt 1g(xt, θt+1) + α2 t 2

K0 + κ + K1 + 2K2 + 2 + 4βt γ2 dty

ct 2f(xt+1, yt) + 2g(xt+1, yt) ) + β2 t 2 (K3 + κ + K4 + 2K5 + 2) + δ 2Bt (αt + βt). (64)

(ii) When ρ = 1, similarly, we have

E[Φct(xt+1, yt+1)|Ft]

Φct(xt, yt) + E αt

dtx (Lgx,0 + Lgx,1M) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 Ft

1 ct 1f(xt, yt)

+ 1g(xt, yt) + 3

4αt 1g(xt, θt+1) + α2 t 2

L0 + κ + L1 + 2 + 4βt γ2 dty

1 ct 2f(xt+1, yt)

+ 2g(xt+1, yt) + β2 t 2 (L2 + κ + L3 + 2) + δ 2Bt (αt + βt). (65)

Γt = Lgx,0 + Lgx,1M ρ + 2

γ2 θt+1 θ γ(xt+1, yt+1) 2 + Φct(xt, yt). (66)

Thus, we have the following Lemma.

Lemma A.13. Under Assumptions 5.2, 5.3 and 5.7, suppose γ (0, 1 2κg2 ), ct+1 ct and ηt satisfies ηt >

v u u t1+4(Lgx,0+Lgx,1M ρ+ 2

(Lgx,0+Lgx,1M ρ) αt dtx + βt γ2 dty

2κg2(Lgx,0+Lgx,1M ρ+ 2

γ2 ) , we have

E[Γt+1 Γt|Ft]

ct 1f(xt, yt) + 1g(xt, yt) ) βt

ct 2f(xt, yt) + 2g(xt, yt) ) ηtκg2( θt θ γ(xt, yt) 2

+ η2 t δ2 g Bt ) + 2δ

Bt (αt + βt) (67)

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Proof. By the definition of Γt. when ρ [0, 1), we have

E[Γt+1 Γt|Ft]

dtx (Lgx,0 + Lgx,1M ρ) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 Ft

ct 1f(xt, yt) + 1g(xt, yt) ) + 3

4αt 1g(xt, θt+1) + α2 t 2

K0 + κ + K1 + 2K2 + 2 + 4βt γ2 dty

ct 2f(xt+1, yt) + 2g(xt+1, yt) ) + β2 t 2 (K3 + κ + K4 + 2K5 + 2) + δ 2Bt (αt + βt)

+ Lgx,0 + Lgx,1M ρ + 2

γ2 E[ θt+1 θ γ(xt+1, yt+1) 2|Ft] Lgx,0 + Lgx,1M ρ + 2

γ2 θt θ γ(xt, yt) 2

ct 1f(xt, yt) + 1g(xt, yt) 3 1g(xt, θt+1) 2αt

K0 + κ + K1 + 2K2 + 2 + 4βt γ2 dty

ct 2f(xt+1, yt) + 2g(xt+1, yt) 2βt(K3 + κ + K4 + 2K5 + 2) + δ 2Bt (αt + βt)

+ Lgx,0 + Lgx,1M ρ + 2

γ2 E[ θt+1 θ γ(xt+1, yt+1) 2|Ft] Lgx,0 + Lgx,1M ρ + 2

γ2 θt θ γ(xt, yt) 2

dtx (Lgx,0 + Lgx,1M ρ) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 Ft

Next, we demonstrate that

E[ θt+1 θ γ(xt+1, yt+1) 2 θt θ γ(xt, yt) 2 + αt dtx θt+1 θ γ(xt, yt) 2|Ft]

E 1 + νt + αt dtx

σ2 t ( θt θ γ(xt, yt) 2 + η2 t δ2 g Bt ) θt θ γ(xt, yt) 2

L2 θ (xt+1, yt+1) (xt, yt) 2|Ft

where the last inequality follows from Lemma A.11.

Given νt = ηtκg2, we have

E[ θt+1 θ γ(xt+1, yt+1) 2 θt θ γ(xt, yt) 2 + αt dtx θt+1 θ γ(xt, yt) 2|Ft]

E η2 t κ2 g2 αt dtx

( θt θ γ(xt, yt) 2 + η2 t δ2 g Bt ) + 1 + 1 ηtκg2

L2 θ (xt+1, yt+1) (xt, yt) 2|Ft

Similarly, we have

E[ θt+1 θ γ(xt+1, yt+1) 2 θt θ γ(xt, yt) 2 + 2βt

dty θt+1 θ γ(xt, yt) 2|Ft]

1 + νt + 2βt

δ2 t ( θt θ γ(xt, yt) 2 + η2 t δ2 g Bt ) θt θ γ(xt, yt) 2

L2 θ (xt+1, yt+1) (xt, yt) 2|Ft

E η2 t κ2 g2 2βt

( θt θ γ(xt, yt) 2 + η2 t δ2 g Bt ) + 1 + 1 ηtκg2

L2 θ (xt+1, yt+1) (xt, yt) 2|Ft

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

Putting the above inequalities (70), (71) into (69), we have

E[Γt+1 Γt|Ft]

ct 1f(xt, yt) + 1g(xt, yt) 3 1g(xt, θt+1) 2αt K0 + κ + K1 + 2K2 + 2 + 4βt γ2 dty

4αt 1 + 1 ηtκg2

Lgx,0 + Lgx,1M ρ + 2

ct 2f(xt+1, yt) + 2g(xt+1, yt)

2βt(K3 + κ + K4 + 2K5 + 2) 4βt 1 + 1 ηtκg2

L2 θ(Lgx,0 + Lgx,1M ρ + 2

γ2 ) E (Lgx,0 + Lgx,1M ρ + 2

η2 t κ2 g2 (Lgx,0 + Lgx,1M ρ) αt

dtx βt γ2 dty

( θt θ γ(xt, yt) 2 + η2 t δ2 g Bt ) Ft

+ δ(αt + βt)

Given 0 < αt 1

ct 1f(xt,yt) + 1g(xt,yt)

4(K0+κ+K1+2K2+2)+8(1+ 1 ηtκg2 )L2 θ(Lgx,0+Lgx,1M ρ+ 2

γ2 )+6 1g(xt,θt+1) ,

ct 2f(xt,yt) + 2g(xt,yt)

4(K3+κ+K4+2K5+2)+8(1+ 1 ηtκg2 )L2 θ(Lgx,0+Lgx,1M ρ+ 2

γ2 ), and ηt >

v u u t1+4(Lgx,0+Lgx,1M ρ+ 2

(Lgx,0+Lgx,1M ρ) αt dtx + βt γ2 dty

2κg2(Lgx,0+Lgx,1M ρ+ 2

γ2 ) , we can obtain

E[Γt+1 Γt|Ft]

ct 1f(xt, yt) + 1g(xt, yt) ) βt

ct 2f(xt, yt) + 2g(xt, yt) ) ηtκg2( θt θ γ(xt, yt) 2

+ η2 t δ2 g Bt ) + 2δ(αt + βt)

When ρ = 1, according to Lemma A.12, we have

E[Γt+1 Γt|Ft]

dtx (Lgx,0 + Lgx,1M) + 2βt γ2 dty

θt+1 θ γ(xt, yt) 2 αt

1 ct 1f(xt, yt) + 1g(xt, yt)

4αt 1g(xt, θt+1) + α2 t 2

L0 + κ + L1 + 2 + 4βt γ2 dty

1 ct 2f(xt+1, yt) + 2g(xt+1, yt)

+ β2 t 2 (L2 + κ + L3 + 2) + Lgx,0 + Lgx,1M + 2

γ2 θt θ γ(xt+1, yt+1) 2 Lgx,0 + Lgx,1M + 2

θt θ γ(xt, yt) 2 Ft

ct 1f(xt, yt) + 1g(xt, yt) 3 1g(xt, θt+1) 2αt(L0 + κ + L1 + 2 + 4βt γ2 dty )

2αt(1 + 1 ηtκg2 )L2 θ(Lgx,0 + Lgx,1M + 2

1 ct 2f(xt+1, yt) + 2g(xt+1, yt)

2βt(L2 + L3 + 2 + κ) 2βt(1 + 1 ηtκg2 )L2 θ(Lgx,0 + Lgx,1M + 2

γ2 ) E (Lgx,0 + Lgx,1M + 2

γ )η2 t κ2 g2

(Lgx,0 + Lgx,1M) αt

dtx βt γ2 dty

( θt θ γ(xt, yt) 2 + η2 t δ2 g Bt ) Ft

+ 2δ(αt + βt)

where the last inequality follows from Young s inequality with parameter ηtκg2 and rearrangement.

Given 0 < αt 1

ct 1f(xt,yt) + 1g(xt,yt)

4(L0+κ+L1+2)+8(1+ 1 ηtκg2 )L2 θ(Lgx,0+Lgx,1M+ 2

γ2 )+6 1g(xt,θt+1) , 0 < βt

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

ct 2f(xt,yt) + 2g(xt,yt)

4(L2+κ+L3+2)+8(1+ 1 ηtκg2 )L2 θ(Lgx,0+Lgx,1M+ 2

γ2 ), and ηt >

v u u t1+4(Lgx,0+Lgx,1M+ 2

(Lgx,0+Lgx,1M) αt dtx + βt γ2 dty

2κg2(Lgx,0+Lgx,1M+ 2

we can also obtain

E[Γt+1 Γt|Ft]

ct 1f(xt, yt) + 1g(xt, yt) ) βt

ct 2f(xt, yt) + 2g(xt, yt) ) ηtκg2( θt θ γ(xt, yt) 2

+ η2 t δ2 g Bt ) Ft

+ 2δ(αt + βt)

Theorem A.14. (Restatement of Theorem 5.8) Under Assumptions 5.1, 5.2, 5.3, 5.4 and 5.7, given γ (0, 1 2κg2 ), ct = c(t + 1)1/4 with c > 0 and Bt = O(T 1/2), and when ρ [0, 1) let 0 < αt

ct 1f(xt,yt) + 1g(xt,yt)

4(K0+κ+K1+2K2+2)+8(1+ 1 ηtκg2 )L2 θC+6 1g(xt,θt+1) , 0 < βt 1

ct 2f(xt,yt) + 2g(xt,yt)

4(K3+κ+K4+2K5+2)+8(1+ 1 ηtκg2 )L2 θC , when ρ = 1

let 0 < αt 1

ct 1f(xt,yt) + 1g(xt,yt)

4(L0+κ+L1+2)+8(1+ 1 ηtκg2 )L2 θC+6 1g(xt,θt+1) , 0 < βt 1

ct 2f(xt,yt) + 2g(xt,yt)

4(L2+κ+L3+2)+8(1+ 1 ηtκg2 )L2 θC , and

v u u t1+4C

(Lgx,0+Lgx,1M) αt dtx + βt γ2 dty

2κg2C , 1/γ κg2 (1/γ+Lgy,0+Lgy,1 maxµ (0,1) 2g(x,θ γ(x,y)µ) ρ)2

, where C = Lgx,0 +

Lgx,1M ρ + 2 γ2 . Then the sequence of (xt, yt, θt) generated by Algorithm 2 satisfies

min 0 t T E[ θt θ γ(xt, yt) ] = O( 1 T 1/2 ),

min 0 t T E[Rt(xt+1, yt+1)] = O( 1 T 1/6 ),

E[g(x T , y T ) vγ(x T , y T )] = O( 1 T 1/4 ).

Proof. By using the above Lemma A.13, telescoping the above inequality (67) over the range t = 0, 1, , T 1, given η = min0 t T 1 ηt, and take the total expectation, we have

E[ΓT ] E[Γ0]

ct 1f(xt, yt) + 1g(xt, yt) ) βt

ct 2f(xt, yt) + 2g(xt, yt) )

ηtκg2( θt θ γ(xt, yt) 2 + η2 t δ2 g Bt ) + 2δ(αt + βt)

ct 1f(xt, yt) + 1g(xt, yt) ) βt

ct 2f(xt, yt) + 2g(xt, yt) )

ηκg2( θt θ γ(xt, yt) 2 + η2δ2 g Bt ) + 2δ(αt + βt)

It follows that

t=0 E[ θt θ γ(xt, yt) 2] E[Γ0] E[ΓT ] +

ct 1f(xt, yt) + 1g(xt, yt) )

ct 2f(xt, yt) + 2g(xt, yt) ) ηκg2 η2δ2 g Bt

+ 2δ(αt + βt)

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

and let Bt = O(T 1/2),

1 Bt = O(T 1/2).

Consequently,

min 0 t T E[ θt θ γ(xt, yt) ] = O( 1 T 1/2 ).

According to the update rule of variables (x, y) in Algorithm 2, we have

ct dt x dtx + ct

αt (xt+1 xt) = 0, (75)

ct dt y dty + ct

βt (yt+1 xt) = 0. (76)

(et x, et y) = f(xt+1, yt+1) + ct( g(xt+1, yt+1) vγ(xt+1, yt+1)),

et x := ct 1Φct(xt+1, yt+1) ct dt x ct dt x αt (xt+1 xt), (77)

et y := ct 2Φct(xt+1, yt+1) ct dt y ct dt y βt (yt+1 yt). (78)

Considering the term et x , we have

E[ et x |Ft] E[ct 1Φct(xt+1, yt+1) 1Φct(xt, yt) + ct 1Φct(xt, yt) ctdt x + ct dt x αt xt+1 xt

+ ctdt x ct dt x |Ft]

E[LΦ1 (xt+1, yt+1) (xt, yt) + ct 1Φct(xt, yt) dt x + ct dt x αt xt+1 xt + ct

=E[ct LΦ1 (xt+1, yt+1) (xt, yt) + ct dt x αt xt+1 xt + ct 1g(xt, θt) 1g(xt, θ γ(xt, yt)) |Ft]

E[ct LΦ1 (xt+1, yt+1) (xt, yt) + ct dt x αt xt+1 xt + ct(Lgx,0 + Lgx,1M ρ) θt θ γ(xt, yt) |Ft]

Thus, we can obtain

E[ et x |Ft]

E[ct LΦ1 (xt+1, yt+1) (xt, yt) + ct dt x αt xt+1 xt + ct(Lgx,0 + Lgx,1M ρ) θt θ γ(xt, yt) |Ft] + ct

Similarity, based on Lemma A.6, we can obtain

E[ et y |Ft]

E[ct LΦ2 (xt+1, yt+1) (xt, yt) + ct dt y βt yt+1 yt + ct

γ ( θt θ γ(xt, yt) + Lθ xt+1 xt )|Ft] + ct

Generalized-Smooth Bilevel Optimization with Nonconvex Lower-Level

where LΦ2 = 1

ct (Lfy,0 + Lfy,1 maxµ [0,1] 2f(xµ, yµ) ρ) + Lgy,0 + Lgy,1 maxµ [0,1] 2g(xµ, yµ) ρ + 1+Lθ

γ . Since Rt(x, y) = dist 0, f(x, y) + ct( g(x, y) vγ(x, y)) , we have

E[Rt(xt+1, yt+1)] E[ct LΦ (xt+1, yt+1) (xt, yt) +

ct dt x αt + ct Lθ

xt+1 xt + ct dt y βt yt+1 yt

+ ct(Lgx,0 + Lgx,1M ρ + 1

γ ) θt θ γ(xt, yt) ] + 2ct

Given C R min 1

ct 1f(xt,yt) + 1g(xt,yt) , 1

ct 2f(xt,yt) + 2g(xt,yt) }

ηκg2 max{ dtx , dty } , we have

1 c2 t E[Rt(xt+1, yt+1)2]

ct 1f(xt, yt) + 1g(xt, yt) ) + βt

ct 2f(xt, yt) + 2g(xt, yt) ) + ηκg2 θt θ γ(xt, yt) 2

By using the above inequality (74), then we have

1 c2 t E[Rt(xt+1, yt+1)2] E[Γ0] E[ΓT ] +

2δ Bt (2 + αt + βt)

E[Γ0] E[Γinf] +

2δ Bt (2 + αt + βt)

1 Bt , (79)

where E[Γinf] is the lower bound of E[Γi], l2 is the upper bound of αt + βt from i {0, , T} and l1, l2 are positive constants.

Based on the above inequalities (79), let ct = c(t + 1)1/4, we can obtain

min 0 t T E[Rt(xt+1, yt+1)] = O( 1 T 1/6 ).

Since {xt, yt}0 t T is a sequence generated from Algorithm 2, we also can find m = max0 t T E[Φct(xt, yt)] = 1 ct f(xt, yt) + g(xt, yt) vγ(xt, yt), and f(xt, yt) f for any t. Then we have

ct E[(g(xt, yt) vγ(xt, yt))] m f , t > 0.

Based on ct = c(t + 1)1/4, we can obtain

E[g(x T , y T ) vγ(x T , y T )] = O( 1 T 1/4 ).