# efficiently_escaping_saddle_points_in_bilevel_optimization__f5390f0f.pdf

Journal of Machine Learning Research 26 (2025) 1-61 Submitted 2/22; Revised 1/25; Published 1/25

Eﬃciently Escaping Saddle Points in Bilevel Optimization

Minhui Huang mhhuang@ucdavis.edu Department of Electrical and Computer Engineering University of California, Davis, CA 95616, USA

Xuxing Chen xuxchen@ucdavis.edu Department of Mathematics University of California, Davis, CA 95616, USA

Kaiyi Ji kaiyiji@buffalo.edu Department of Computer Science and Engineering University at Buﬀalo, Buﬀalo, NY 14260, USA

Shiqian Ma sqma@rice.edu Department of Computational Applied Mathematics and Operations Research Rice University, Houston, TX 77005, USA

Lifeng Lai lflai@ucdavis.edu Department of Electrical and Computer Engineering University of California, Davis, CA 95616, USA

Editor: Silvia Villa

Bilevel optimization is one of the fundamental problems in machine learning and optimization. Recent theoretical developments in bilevel optimization focus on ﬁnding the ﬁrstorder stationary points for nonconvex-strongly-convex cases. In this paper, we analyze algorithms that can escape saddle points in nonconvex-strongly-convex bilevel optimization. Speciﬁcally, we show that the perturbed approximate implicit diﬀerentiation (AID) with a warm start strategy ﬁnds an ϵ-approximate local minimum of bilevel optimization in O(ϵ 2) iterations with high probability. Moreover, we propose an inexact NEgativecurvature-Originated-from-Noise Algorithm (i NEON), an algorithm that can escape saddle point and ﬁnd local minimum of stochastic bilevel optimization. As a by-product, we provide the ﬁrst nonasymptotic analysis of perturbed multi-step gradient descent ascent (GDmax) algorithm that converges to local minimax point for minimax problems.

Keywords: Bilevel optimization, minimax problem, local minimax point, saddle point, inexact NEON

c 2025 Minhui Huang and Xuxing Chen and Kaiyi Ji and Shiqian Ma and Lifeng Lai.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/22-0136.html.

Huang and Chen and Ji and Ma and Lai

1. Introduction

Bilevel optimization has become a powerful tool in various machine learning ﬁelds including reinforcement learning (Hong et al., 2020), hyperparameter optimization (Franceschi et al., 2018; Feurer and Hutter, 2019), meta learning (Franceschi et al., 2018; Ji et al., 2020) and signal processing (Kunapuli et al., 2008). A general formulation of bilevel optimization problem can be written as

min x Rd Φ(x) := f(x, y (x)),

s.t. y (x) = argmin y Rn g(x, y). (1.1)

In this paper, we focus on the nonconvex-strongly-convex case where the lower level function g(x, y) is smooth and strongly convex with respect to y and the overall objective function Φ(x) is smooth but possibly nonconvex. One crucial but challenging task in the bilevel optimization is the computation of the hypergradient Φ(x), which, via chain rule, can be written as

Φ(x) = xf(x, y (x)) + y (x)

x yf(x, y (x)), (1.2)

where y (x)

x Rd n. Note that the diﬀerentiability of y (x) is a direct result of the Implicit Function Theorem, as mentioned in Lemma 2.1 of Ghadimi and Wang (2018). By taking derivative with respect to x on the optimality condition: yg(x, y) = 0, we have the relation

2 xyg(x, y (x)) + y (x)

x 2 yyg(x, y (x)) = 0, (1.3)

which implies y (x)

x = 2 xyg(x, y (x)) 2 yyg(x, y (x)) 1. (1.4)

Substituting (1.4) to (1.2), we get

Φ(x) = xf(x, y (x)) 2 xyg(x, y (x)) 2 yyg(x, y (x)) 1 yf(x, y (x)). (1.5)

Note that the above hypergradient Φ(x) involves computationally intractable components such as the exact solution y (x) and the Hessian inverse 2 yyg(x, y (x)) 1. To address such diﬃculties, various computing approaches have been proposed, which include popular Approximate Implicit Diﬀerentiation (AID) (Domke, 2012; Pedregosa, 2016; Gould et al., 2016; Ghadimi and Wang, 2018; Grazzi et al., 2020; Ji et al., 2021) and Iterative Diﬀerentiation (ITD) (Domke, 2012; Maclaurin et al., 2015; Shaban et al., 2019; Grazzi et al., 2020; Ji et al., 2021). Among them, Ghadimi and Wang (2018) and Ji et al. (2021) further analyze the computational complexities of these two types of approaches in ﬁnding a stationary point. Besides these nested-loop approaches, Hong et al. (2020) and Chen et al. (2022) propose single-loop algorithms with convergence analysis to stationary points. However, it still remains unknown how to provably ﬁnd a local minimum for bilevel optimization. This type of study is important as it has been widely shown that saddle points (which are also stationary points) can seriously undermine the quality of solutions (Choromanska et al., 2015; Dauphin et al., 2014). To address this issue, this paper focuses on escaping saddle points for bilevel optimization. We are interested in ﬁnding an approximate local minimum for Φ(x) deﬁned as follows.

Efficiently Escaping Saddle Points in Bilevel Optimization

Deﬁnition 1 (ϵ-local minimum) We say x is an ϵ-local minimum for bilevel optimization (1.1) if Φ(x) ϵ, λmin 2Φ(x) ρφϵ, (1.6)

where λmin (Z) denotes the minimum eigenvalue of a matrix Z and ρφ is the Lipschitz constant of 2Φ(x), i.e.,

2Φ(x) 2Φ(x ) ρφ x x , x, x Rd. (1.7)

For completeness, we also deﬁne stationary points and saddle points as follows.

Deﬁnition 2 We say x is an ϵ-stationary point of bilevel optimization (1.1) if Φ(x) ϵ. We say x is a saddle point of bilevel optimization (1.1) if Φ(x) = 0, and x is not a local maximum point or local minimum point.

Motivated by the recent demand in solving online or large-scale bilevel optimization problems, we also generalize our technique to the following stochastic bilevel optimization:

min x Rd Φ(x) = f(x, y (x)) = Eξ [F(x, y (x); ξ)]

s.t. y (x) = argmin y Rs g(x, y) = Eζ [G(x, y; ζ)] , (1.8)

where f(x, y) and g(x, y) take the expectation form with respect to the random variables ξ and ζ. There is a line of work studying stochastic bilevel algorithms that converge to the stationary point (Ghadimi and Wang, 2018; Ji et al., 2021; Hong et al., 2020). Comparing with these results, we are interested in providing new stochastic algorithms that provably converge to the local minimum.

1.1 Our Contributions

In this paper, we derive a framework of adding perturbation to gradient sequence for bilevel optimization and design various new bilevel algorithms that provably escape saddle points and ﬁnd local minimum. Our approach is mostly inspired by existing works for nonconvex minimization and minimax problems (Jin et al., 2021; Xu et al., 2018; Allen-Zhu and Li, 2018). Our main contributions are summarized below.

(i) For deterministic bilevel optimization, we propose the perturbed AID with warm start strategy. We prove that the proposed algorithm achieves ϵ-local minimum of Φ(x) in at most O(ϵ 2) iterations. Here the notation O( ) hides logorithmic terms and absolute constants.

(ii) For the minimax problem, which is a special case of bilevel optimization, we prove that the strict local minimum of Φ(x) is equivalent to strict local minimax point (Jin et al., 2020) and propose the perturbed GDmax algorithm with a nonasymptotic convergence rate to local minimax point. To the best of our knowledge, this is the ﬁrst nonasympototic analysis for gradient algorithms escaping saddle point in minimax problem.

Huang and Chen and Ji and Ma and Lai

(iii) For stochastic bilevel optimization, we propose inexact NEgative-curvature-Originatedfrom-Noise Algorithm (i NEON), a deterministic algorithm that extracts negative curvature descent direction with high probability. Combining i NEON with stoc Bi O (Ji et al., 2021), we obtain a stochastic ﬁrst-order algorithm with a gradient complexity of O(ϵ 4). To the best of our knowledge, our algorithms perturbed AID and stoc Bi O+i NEON are the ﬁrst ones that provably converge to local minimum of bilevel optimization.

1.2 Related Work

Escaping Saddle Point. Most existing works for ﬁnding local minimum focus on classical optimization problems (i.e., minimization problems) and derive the complexity for reaching an ϵ-local minimum. Nesterov and Polyak (2006) and Curtis et al. (2017) proposed second-order methods for obtaining an ϵ-local minimum. To avoid Hessian computation required in Nesterov and Polyak (2006) and Curtis et al. (2017), Carmon et al. (2018) and Agarwal et al. (2017) proposed to use Hessian-vector product and achieved convergence rate of O(ϵ 7/4). Recently, the complexity results of pure ﬁrst-order methods for obtaining local minimum have been studied (see, e.g., Ge et al. (2015); Daneshmand et al. (2018); Jin et al. (2021); Fang et al. (2019)). Lee et al. (2016) provided asymptotic results showing that gradient descent (GD) method converges to a local minimizer almost surely. Jin et al. (2017, 2021) proved that the perturbed GD can converge to a local minimizer in a number of iterations that depends poly-logarithmically on the dimension, reaching a nonasymptotic iteration complexity of O(ϵ 2 log(d)4) for nonconvex minimization. For stochastic optimization problems, Jin et al. (2021), Jin et al. (2018), and Fang et al. (2019) provided nonasymptotic rate for ﬁnding local minimizers. How to escape saddle points for constrained problems and nonsmooth problems are also studied in the literature. In particular, Lu et al. (2020a), Criscitiello and Boumal (2019), and Sun et al. (2019) studied escaping saddle points for constrained optimization. Davis and Drusvyatskiy (2021), Davis et al. (2021), and Huang (2021) studied escaping saddle points for nonsmooth problems. All these algorithms are for solving the minimization problems, and to the best of our knowledge, how to escape saddle points in bilevel optimization has not been addressed in the literature.

Minimax Optimization. Motivated by its applications in adversarial learning (Goodfellow et al., 2015; Sinha et al., 2018), training Generative Adversarial Nets (GANs) (Goodfellow et al., 2014; Arjovsky et al., 2017) and optimal transport (Lin et al., 2020a; Huang et al., 2021a,b,c), the convergence theory of nonconvex minimax problems has been extensively studied in the literature. Speciﬁcally, Nouiehed et al. (2019) and Jin et al. (2020) studied the complexity of multistep gradient descent ascent (GDmax). Lin et al. (2020b) and Lu et al. (2020b) provided the ﬁrst convergence analysis for the single loop gradient descent ascent (GDA) algorithm. More recently, Luo et al. (2020) applied the stochastic variance reduction technique to the nonconvex-strongly-concave case and achieved the best known stochastic gradient complexity. Zhang et al. (2020) proposed smoothed GDA, which stabilizes GDA algorithm and helps achieve a better complexity for the nonconvex-concave case. However, all the previous works targeted ﬁnding stationary point of Φ(x). Very recently, Chen et al. (2021b) and Luo and Chen (2021) proposed cubic regularized GDA, a

Efficiently Escaping Saddle Points in Bilevel Optimization

second-order algorithm that provably converges to a local minimum. Fiez et al. (2021) provided asymptotic results showing that GDA converges to local minimax point almost surely. To the best of our knowledge, the convergence rate of ﬁrst-order methods for obtaining a local minimax point has been missing in the literature.

Bilevel Optimization. The bilevel optimization has a long history and dates back to Bracken and Mc Gill (1973). Recently, bilevel programming has been successfully applied to meta-learning (Snell et al., 2017; Rajeswaran et al., 2019; Franceschi et al., 2018; Ji et al., 2022) and hyperparameter optimization (Pedregosa, 2016; Franceschi et al., 2018; Shaban et al., 2019; Sow et al., 2021). Theoretically, Ghadimi and Wang (2018) provided the ﬁrst convergence rate for the AID approach. Ji et al. (2021) further improved their complexity dependence on the condition number and analyzed the convergence of the ITD approach. Both AID and ITD have an iteration complexity of O(ϵ 2). Ji and Liang (2021) provided lower bounds for a class of AID and ITD-based bilevel algorithms. For stochastic bilevel problems, Ghadimi and Wang (2018) and Ji et al. (2021) proposed Bilevel Stochastic Approximation (BSA) and stochastic bilevel optimizer (stoc Bi O) methods respectively, which are both double-loop algorithms inspired by AID. Hong et al. (2020) proposed a twotimescale framework for bilevel optimization (TTSA), a provable single-loop algorithm that updates two variables in an alternating way with a convergence rate of O(ϵ 5). Chen et al. (2021a) proposed ALternating Stochastic gradient d Escen T (ALSET), a simple SGD type approach, and improved the convergence rate to O(ϵ 4). Very recently, Khanduri et al. (2021), Yang et al. (2021), Chen et al. (2022), and Guo and Yang (2021) studied stochastic algorithms with variance reduction and momentum techniques, and provided the cuttingedge ﬁrst-order oracle complexity, which is O(ϵ 3). All these previous analyses have focused on ﬁnding stationary points and algorithm for ﬁnding a local minimum is still missing. It is unclear if the obtained stationary points of the existing bilevel optimization algorithms are local minimum or saddle points. In other words, ﬁnding an ϵ-local minimum, i.e., escaping saddle points in bilevel optimization, receives little attention. Our work proposes a new algorithm with guaranteed convergence to ϵ-local minimum of bilevel optimization problem, which is much more challenging than most of the prior works because converging to ϵ-local minima is much stronger than merely converging to stationary points. Furthermore, we note that existing works on escaping saddle points only consider single-level optimization problems. It is more challenging in bilevel problem. The reason is that the hypergradient of the bilevel problem, i.e., Φ(x) in (1.5), involves the ﬁrst-order information of the upperlevel problem and the second-order information of the lower-level problem. This brings more challenges to the design of our algorithm as well as the convergence analysis as we need to handle the convergence error of the lower-level problem as well as the hypergradient estimation error.

Notation. Let ˆΦ(x), b Φ(x), b 2Φ(x) be the inexact function value, gradient and Hessian, respectively. Denote G(f, ϵ), JV (f, ϵ), HV (f, ϵ) as the complexity of gradient evaluations, Jacobian-vector product evaluations, and Hessian-vector product evaluations of function f, respectively. In particular, for matrix-vector product oracles, say Hessian-vector products, HV (f, ϵ) represents the total number of (deterministic or stochastic) ( 2f) v computation in our algorithm. Typically computing a Hessian-vector product is as cheap as computing a gradient (Pearlmutter, 1994). Let κ be the condition number of the lower-level

Huang and Chen and Ji and Ma and Lai

problem. We use notation O( ) to hide only absolute constants which do not depend on any problem parameters and O( ) to further hide additional log factors.

2. Escape Saddle Points in General Bilevel Optimization

In this section, we propose novel algorithms for general bilevel optimization (1.1) that are guaranteed to converge to local minimum. We consider one of the popular approaches AID to estimate the hypergradient Φ(x). The AID approach is a nested-loop algorithm, which ﬁrst update the lower-level variable y with D steps of gradient descent, and then construct an estimate of the upper-level hypergradient. To eﬃciently approximate the Hessian inverse in the hypergradient (1.5), AID solves the linear system:

2 yyg(xk, y D k )v = yf(xk, y D k ) (2.1)

using N steps of the conjugate gradient (CG) method. The resulting vector v N k is used as an approximation to the solution of (2.1): 2 yyg(xk, y D k ) 1 yf(xk, y D k ). The hypergradient is then constructed as

b Φ(xk) = xf(xk, y D k ) 2 xyg(xk, y D k )v N k . (2.2)

However, current AID-based approach can only guarantee the convergence to the ﬁrst-order stationary point. In Algorithm 1, we propose perturbed AID (i.e., Algorithm 1 with option AID in step 9) for solving bilevel optimization (1.1) with convergence guarantee to secondorder stationarity. In the proposed algorithms, we update variable x with the hypergradient b Φ(x) estimated by AID. When the norm of b Φ(x) is small, we add random noise sampled from a uniform ball and keep running AID for at least T steps (see steps 10-13 of Algorithm 1). If the current point is a saddle point of Φ(x), we show that with high probability the function value Φ(x) has suﬃcient decrease after T steps so it can escape the current saddle point.

2.1 Convergence Analysis

We ﬁrst state assumptions needed for our analysis.

Assumption 3 Assume the upper level function f(x, y) and the lower level function g(x, y) satisfy the following assumptions:

(i) Function g(x, y) is three times diﬀerentiable and µ-strongly convex with respect to y for any ﬁxed x.

(ii) Function f(x, y) is twice diﬀerentiable and f(x, y) is M-Lipschitz continuous with respect to x and y.

(iii) Gradients f(x, y) and g(x, y) are ℓ-Lipschitz continuous with respect to x and y.

(iv) Jacobian matrices 2 xxf(x, y), 2 xyf(x, y), 2 yyf(x, y), 2 xyg(x, y) and 2 yyg(x, y) are ρ-Lipschitz continuous with respect to x and y.

(v) Third-order derivatives 3 xyxg(x, y), 3 yxyg(x, y) and 3 yyyg(x, y) are ν-Lipschitz continuous with respect to x and y.

Efficiently Escaping Saddle Points in Bilevel Optimization

Algorithm 1 Perturbed Algorithms for Minimax and Bilevel Optimization Problems

1: Input: Iteration Numbers K, D, N, Step Sizes τ, η, Accuracy ϵ, Radius r, Perturbation Time T . 2: Initialization: x0, y0, v0. 3: Set kperturb = 0 4: for k = 0, 1, 2, . . . , K 1 do 5: Set y0 k = y D k 1 if k > 0, otherwise y0 6: for t = 0, 1, 2, . . . , D do 7: yt k = yt 1 k τ yg(xk, yt 1 k ) 8: end for 9: Estimating Hypergradient: Option 1 (for minimax problem) GDmax: compute b Φ(xk) = xf(xk, y D k ) Option 2 (for bilevel optimization) AID: 1) set v0 k = v N k 1 if k > 0 and v0 otherwise

2) solve v N k from 2 yyg(xk, y D k )v = yf(xk, y D k ) via N steps of CG starting from v0 k 3) get Jacobian-vector product 2 xyg(xk, y D k )v N k via automatic diﬀerentiation

4) b Φ(xk) = xf(xk, y D k ) 2 xyg(xk, y D k )v N k 10: if b Φ(xk) 4

5ϵ and k kperturb > T then 11: xk = xk η u, (u Uniform(B(r))) 12: kperturb = k 13: end if 14: xk+1 = xk η b Φ(xk) 15: end for 16: Output: x K.

Remark 4 Compared with assumptions in recent bilevel optimization literature (Ghadimi and Wang, 2018; Ji et al., 2021), we further assume the Lipschitz continuity of the Hessian of f(x, y) and the third-order derivative of g(x, y). These assumptions are required to prove the Hessian Lipschitz continuity of Φ(x), which is a common condition required in the literature of escaping saddle points (Jin et al., 2021).

Remark 5 It should be noted that although we assume the third-order partial derivatives of g(x, y) to be Lipschitz continuous, this assumption is for the theoretical analysis only, we do not compute any third-order derivatives in our algorithms.

One of the key elements in our proof technique is to show that under Assumption 3, function Φ(x) is Hessian Lipschitz continuous, as shown in the following lemma.

Lemma 6 Suppose Assumption 3 holds, then Φ(x) is ρφ-Hessian Lipschitz continuous, i.e., (1.7) holds, where ρφ = O(κ5) and is deﬁned in (B.17).

For the AID approach, the main results are in the following theorem.

Theorem 7 (Convergence of Perturbed AID) Suppose that Assumption 3 holds. Set parameters of Algorithm 1 as

Lφ , r = ϵ 400ι3 , T = Lφ ρφϵ ι,

Huang and Chen and Ji and Ma and Lai

and D = O κ log ϵ 1 , N = O κ log ϵ 1 . (2.3)

With probability at least 1 δ, the iteration number of the perturbed AID algorithm for visiting an ϵ-local minimum of Φ(x) is

K = O κ3ϵ 2 . (2.4)

Here δ (0, 1), Lφ is the Lipschitz constant of Φ, ρφ is the Lipschitz constant of 2Φ (see Lemma 36), and ι is a constant satisfying

ι > 1, and δ Lφ

d ρφϵ ι228 ι. (2.5)

Corollary 8 The gradient complexities of the perturbed AID algorithm for ﬁnding an ϵ-local minimum of Φ(x) are

G(f, ϵ) = O(κ3ϵ 2), G(g, ϵ) = O(κ4ϵ 2).

The Jacobianand Hessian-vector product complexities are

JV (g, ϵ) = O(κ3ϵ 2), HV (g, ϵ) = O(κ3.5ϵ 2).

Remark 9 Though the complexities of the perturbed AID method are worse than the results in Ji et al. (2021) by a log factor, it should be noted that the algorithms converge to diﬀerent points. Speciﬁcally, our perturbed AID method converges to a local minimum of Φ(x), whereas the algorithms in Ji et al. (2021) are only guaranteed to converge to ﬁrst-order stationarity.

2.2 Proof sketch

We brieﬂy describe the main elements in proving the above theorems. The main ideas follow (Jin et al., 2021). However, in contrast to the problem studied in Jin et al. (2021), we do not have access to the exact hypergradient of Φ(x) in bilevel optimization problems. Therefore, we need to deal with the error introduced by this approximation. We ﬁrst provide the inexact descent lemma.

Lemma 10 (Inexact Descent Lemma) Suppose Assumption 3 holds and set η = 1/Lφ, then the inexact gradient sequence {xk} satisﬁes:

Φ(xk+1) Φ(xk) η

b Φ(xk) 2 + η Φ(xk) b Φ(xk) 2. (2.6)

Secondly, the following lemma shows that with high probability, adding random noise sampled uniformly from a ball helps escape saddle points of Φ(x).

Lemma 11 (Escaping Saddle Points) Assume Assumption 3 holds. Assume x satisﬁes Φ( x) ϵ, and λmin( 2Φ( x)) ρφϵ, where ρφ = O(κ5) and is deﬁned in (B.17).

Efficiently Escaping Saddle Points in Bilevel Optimization

Let x0 = x + ηu (u Uniform(B0(r))). With parameters given in (3.2), as long as the following inequality holds in each iteration,

Φ(xk) b Φ(xk) min

17 80ι2 , 1 16ι22ι

with probability at least 1 δ, it holds that

Φ(x T ) Φ( x) F/2, (2.8)

where x T is the T th gradient descent iterate starting from x0, ι satisﬁes (3.3) and

F = 1 100ι3

Finally, by Lemma 10, Lemma 11 and with a proper choice of parameters D and N, we can bound the total iteration number of Algorithm 1 by

K = (Φ(x0) Φ )T

F + Lφ(Φ(x0) Φ )

Here the perturbation time T will be speciﬁed later in the proof of our main theorem. In practice we set it as a hyperparameter to tune.

3. Escape Saddle Points for Minimax Problem

In this section, we consider the following nonconvex-strongly-concave minimax problem:

min x Rd max y Rn f(x, y), (3.1)

where f(x, y) is nonconvex with respect to x and µ-strongly concave with respect to y. We note that the problem aims at minimizing maxy Rn f(x, y) for each x. Thus, by deﬁning the function Φ(x) = maxy f(x, y), (3.1) reduces to a smooth nonconvex minimization problem minx Rd Φ(x). Note that this is also a special case of the bilevel optimization problem by setting g(x, y) = f(x, y) in (1.1), which leads to the following problem:

min x f(x, y (x)), s.t. y (x) = argmin y g(x, y) := f(x, y),

where y (x) is uniquely deﬁned for any x, since f(x, y) is strongly concave with respect to y. The minimax problem (3.1) seeks the Nash equilibrium of f(x, y). However, when considering nonconvex minimax problem, the global Nash equilibrium does not exist in general. Instead, one is more interested in ﬁnding the local Nash equilibrium (Daskalakis and Panageas, 2018; Mazumdar et al., 2019) and the local minimax point (Jin et al., 2020). Therefore, the following question arises naturally:

What is the relationship between the local minimum of Φ(x) and the local optimality of the minimax problem (3.1)?

Huang and Chen and Ji and Ma and Lai

The answer to this question has been so far still ambiguous. We ﬁrst discuss the relationship between the local minimum of Φ(x) and the local Nash equilibrium of (3.1). The Nash equilibrium and its local alternative are deﬁned as below.

Deﬁnition 12 (Mazumdar et al., 2019)[Local Nash Equilibrium] We say (x , y ) is a Nash equilibrium of function f, if for any (x, y):

f(x , y) f(x , y ) f(x, y ).

Point (x , y ) is a local Nash equilibrium of f if there exists δ > 0 such that for any (x, y) satisfying x x δ and y y δ we have:

f(x , y) f(x , y ) f(x, y ).

Remark 13 It is worth noting that, Nash equilibrium is sometimes called saddle point in minimax optimization literature (Lin et al., 2020c). We highlight here that, throughout this paper, our notion of saddle points follows Deﬁnition 2. In other words, we ﬁrst view Problem (3.1) as a bilevel problem, in which we have Φ(x), and then deﬁne the saddle points of Φ by Deﬁnition 2. The readers should not confuse with these two completely diﬀerent notions of saddle points.

The local Nash equilibrium can be characterized in terms of ﬁrst-order and second-order conditions. Speciﬁcally, when assuming f is twice-diﬀerentiable, any stationary point (i.e., (x, y) such that f(x, y) = 0) is a strict local Nash equilibrium if and only if

2 yyf(x, y) 0, and 2 xxf(x, y) 0.

We have the following proposition, showing that the local minimum of Φ(x) is indeed superior to its saddle point regarding whether it is a local Nash equilibrium or not.

Proposition 14 For any smooth nonconvex-strongly-concave function f(x, y), deﬁne Φ(x) = maxy Rn f(x, y). Then we have

(i) A saddle point of Φ(x) cannot be a strict local Nash equilibrium of f(x, y).

(ii) A strict local Nash equilibrium of f(x, y) must be a local minimum of Φ(x).

Moreover, Jin et al. (2020) introduced the concept of local minimax point, which is a weakened notion of the local Nash equilibrium. Compared with the local Nash equilibrium, the local minimax point alleviates the non-existence issue1 and is the ﬁrst proper mathematical deﬁnition of local optimality for the two-player sequential games.

Deﬁnition 15 (Jin et al., 2020)[Strict Local Minimax Point] For any twice diﬀerentiable function f(x, y), a point (x, y) is a strict local minimax point if it satisﬁes f(x, y) = 0, 2 yyf(x, y) 0 and

2 xxf(x, y) 2 xyf(x, y) 2 yyf(x, y) 1 2 xyf(x, y) 0.

1In the Proposition 6 of (Jin et al., 2020), the authors constructed a two dimensional function showing that the Nash equilibria may not exist, and this is known as the non-existence issue .

Efficiently Escaping Saddle Points in Bilevel Optimization

The following proposition shows the equivalence between a strict local minimax point and a strict local minimum of Φ(x).

Proposition 16 For nonconvex-strongly-concave minimax problem (3.1), suppose Φ(x) has a strict local minimum, then a strict local minimax point of (3.1) always exists and is equivalent to a strict local minimum of Φ(x).

Most existing convergence theory for minimax problems focuses on ﬁnding ϵ-stationary point. Recently, Chen et al. (2021b) and Luo and Chen (2021) proposed second-order algorithms for minimizing Φ(x), which is guaranteed to converge to local minimum. Secondorder methods enjoy faster convergence rate than gradient methods, but require solving nonconvex subproblems in each iteration. Moreover, second-order methods are diﬃcult to be implemented in large-scale problems due to the heavy computation of Hessian matrices. Fiez et al. (2021) proved that GDA asymptotically converges to strict local minimax point almost surely. However, no convergence rate for ﬁnding a local minimax point was given in Fiez et al. (2021). The above facts motivate us to propose the perturbed GDmax Algorithm (Algorithm 1 with option GDmax in step 9), a ﬁrst-order nested-loop algorithm that provably escapes saddle points in minimax problems. In the inner loop, the perturbed GDmax runs D steps of gradient ascent for solving the y-subproblem inexactly. With the warm start strategy, we set the initial point in k-th iteration y0 k to be the output of the inner loop in the previous iteration y D k 1. In the outer loop, we estimate the hypergradient Φ(x) by

b Φ(x) = xf(xk, y D k ),

and update x by one step of inexact gradient descent. When the ﬁrst-order stationary condition is satisﬁed (step 10 in Algorithm 1), we add a random noise vector sampled uniformly from a ball with radius of r and centered at the current iterate. Now we analyze the convergence of the perturbed GDmax algorithm. We ﬁrst list our assumptions.

Assumption 17 For the minimax problem (3.1), f(x, y) satisﬁes the following assumptions:

(i) f(x, y) is twice diﬀerentiable, µ-strongly concave with respect to y and non-convex with respect to x.

(ii) Denote z = (x, y). f(z) is ℓ-smooth, i.e., for any z, z , it holds:

f(z) f(z ) ℓ z z .

(iii) The Hessian and Jacobian matrices 2 xxf(x, y), 2 xyf(x, y), and 2 yyf(x, y) are ρLipschitz continuous.

(iv) Function Φ(x) is bounded below and has compact sub-level sets.

Remark 18 Compared with assumptions for general bilevel optimization problem (Assumption 3 in Section 2), we do not require any third-order derivative information for the minimax problem.

Huang and Chen and Ji and Ma and Lai

Our perturbed GDmax algorithm is the ﬁrst pure gradient algorithm with a nonasymptotic convergence rate for ﬁnding a local minimax point. The main results for the perturbed GDmax algorithm are given in the following theorem.

Theorem 19 (Convergence of Perturbed GDmax) Suppose f(x, y) satisﬁes Assumption 17. Set parameters as

Lφ , r = ϵ 400ι3 , T = Lφ ρφϵ ι (3.2)

and D = O κ log ϵ 1 , with probability at least 1 δ, the perturbed GDmax Algorithm (i.e., Algorithm 1 with option GDmax in step 9) obtains an ϵ-local minimum of Φ(x) = maxy f(x, y) in

K = O Lφ(Φ(x0) Φ(x ))

iterations. Here δ (0, 1), Lφ is the Lipschitz constant of Φ, ρφ is the Lipschitz constant of 2Φ (see Lemma 36), and ι is a constant satisfying

ι > 1, and δ Lφ

d ρφϵ ι228 ι. (3.3)

Remark 20 Throughout this paper, we also assume

Lφ/ ρφϵ 1. (3.4)

We make this assumption because if (3.4) does not hold, ﬁnding the ϵ-local minimum is straightforward, see Jin et al. (2021).

Remark 21 Note that in practice, we may choose ι suﬃciently large so that δ in (3.3) can be small, which leads to the fact that Theorem 17 holds with probability at least 1 δ.

Corollary 22 The complexity of the gradient evaluations of the perturbed GDmax algorithm for ﬁnding an ϵ-local minimum of Φ(x) = maxy f(x, y) is

G(f, ϵ) = D K = O κ2ϵ 2 .

Remark 23 Compared with results of second-order methods escaping saddle points for minimax problem (Chen et al., 2021b; Luo and Chen, 2021), our perturbed GDmax algorithm is purely ﬁrst order, which means we do not need to compute Hessian-vector product. Moreover, algorithms in Chen et al. (2021b) and Luo and Chen (2021) require solving a nonconvex cubic sub-problem and multiple linear systems in each iteration. All these expensive computations are avoided in our perturbed GDmax algorithm, which makes it practical in real applications.

Remark 24 Compared with asymptotic results in Fiez et al. (2021), we provide nonasymptotic convergence rate for ﬁnding a local minimax point for minimax problems.

Remark 25 The dependence on the conditional number κ for the perturbed GDmax algorithm is O(κ2), which matches the order in Lin et al. (2020b) and is better than the results in Chen et al. (2021b).

Efficiently Escaping Saddle Points in Bilevel Optimization

4. Inexact NEON and Stochastic Bilevel Algorithms

In this section, we consider escaping saddle points for stochastic bilevel optimization problem (1.8). Inspired by recent work (Xu et al., 2018) and (Allen-Zhu and Li, 2018), we propose inexact NEON (i NEON) that helps escape saddle points in stochastic bilevel optimization (1.8).

4.1 Inexact NEON

Recently, Xu et al. (2018) and Allen-Zhu and Li (2018) proposed NEgative curvature Originated from Noise (NEON) and NEON2, two pure ﬁrst-order methods that extract negative curvature descent direction. NEON turns almost all stationary-point ﬁnding algorithms into local-minimum ﬁnding algorithms. The work of Xu et al. (2018) was inspired by the connection between perturbed gradient descent method (Jin et al., 2017) and the power method, while the idea of Allen-Zhu and Li (2018) is based on the result of Oja s algorithm (Allen-Zhu and Li, 2017). Compared with classical optimization problems, bilevel optimization no longer has access to the exact gradient, which motivates us to propose the inexact NEON (i NEON). The proposed i NEON update is

uk+1 = uk η(b Φ( x + uk) b Φ( x)), (4.1)

where b Φ( x + uk) and b Φ( x) are the hypergradient estimates. Our i NEON algorithm is described in Algorithm 2. Intuitively, the i NEON can be viewed as an approximate power method. More speciﬁcally, note that (4.1) is equivalent to

uk+1 =uk η( Φ( x + uk) Φ( x)) + δk (I η 2Φ( x))uk + δk, (4.2)

where δk = η(b Φ( x + uk) b Φ( x) Φ( x + uk) + Φ( x)) is the gradient estimation error and in the last step we use the approximation: Φ( x + uk) Φ( x) 2Φ( x)uk as long as uk is small. Therefore, (4.1) is equivalent to applying approximate power method to the matrix I η 2Φ( x) starting with initial vector u0. We next show that i NEON can extract negative gradient descent direction with high probability.

Lemma 26 Suppose Assumption 3 holds. Choose parameters

Lφ , r = ϵ 400ι3 , T = Lφ ρφϵ ι

and D = O(κ log(ϵ 1)) in Algorithm 2. Let x satisfy Φ( x) ϵ, and λmin( 2Φ( x)) ρφϵ, where ρφ = O(κ5) and is deﬁned in (B.17). Denote uout as the output of Algorithm 2. If uout = 0, we have with probability at least 1 δ that

u out 2Φ( x)uout

40ι ρφϵ, (4.5)

Huang and Chen and Ji and Ma and Lai

Algorithm 2 Inexact NEgative-curvature-Originated-from-Noise Algorithm (i NEON)

1: Input: Iteration Numbers T , D, Step Sizes τ, η, Accuracy ϵ, Radius r, Potential Saddle Point x, Initial Point y0 2: Select u0 Uniform(B(ηr))

3: Set y0 = y0 4: for t = 0, 1, 2, . . . , D do

5: yt = yt 1 τ yg( x, yt 1)

7: Compute b Φ( x) using AID and y D

8: for k = 0, 1, 2, . . . , T do

9: Set y0 k = y D k 1 if k > 0, otherwise y0 10: for t = 0, 1, 2, . . . , D do

11: yt k = yt 1 k τ yg( x + uk, yt 1 k )

12: end for

13: Compute b Φ( x + uk) using AID and y D k 14: uk+1 = uk η(b Φ( x + uk) b Φ( x))

15: if ˆΦ( x + uk+1) ˆΦ( x) b Φ( x) uk+1 11519

12800F then

16: return uout = uk+1/ uk+1

18: end for

19: return 0

where ι is a constant satisfying

ι > 1, and δ > ℓ

d ρϵ ι228 ι/4. (4.6)

If uout = 0, then we conclude that λmin( 2Φ(x)) ρφϵ with high probability 1 O(δ).

Remark 27 Compared with results in Xu et al. (2018), we provide a simpliﬁed proof that can handle the gradient estimation error.

So far, we treat i NEON as a deterministic algorithm that extracts the descent direction for a deterministic objective function Φ(x). We will show how to apply i NEON to stochastic bilevel algorithms in the next section.

4.2 Stoc Bi O Escapes Saddle Point

In this section, we apply i NEON to a popular algorithm for stochastic bilevel optimization, Stoc Bi O (Ji et al., 2021). Stoc Bi O is a double-loop batch stochastic algorithm, which has similar structure as AID. In its inner loop, it runs D steps of stochastic gradient descent (SGD) for an estimated solution y D k . Let ΠQ Q+1( ) = I. In the outer loop, Stoc Bi O samples data batches DF , DH = {Bj, j = 1, ..., Q} and DG and constructs v Q as an approximate

Efficiently Escaping Saddle Points in Bilevel Optimization

Algorithm 3 Stoc Bi O with i NEON

1: Input: K, D, Q, batch size S, stepsizes α and β, initializations x0 and y0.

2: for k = 1, . . . , K 1 do

3: Set y0 k = y D k 1 if k > 0, otherwise y0 4: for t = 1, ...., D do

5: Draw a sample batch St 1 with batch size S

6: Update yt k = yt 1 k α y G(xk, yt 1 k ; St 1)

8: Draw sample batches DF , DH and DG 9: Compute b Φ(xk) by (4.7) - (4.8)

10: Update xk+1 = xk β b Φ(xk)

11: k k + 1;

12: Compute b ΦD(xk) via AID

13: if b ΦD(xk) 4

14: u = i NEON(xk, T , r, f DF , g DG)

15: if u = 0 then

16: Return xk;

18: Select Rademacher variable ξ {1, 1}

19: xk+1 = xk ξ 80 q ϵ

20: k k + 1;

23: end for

solution of the linear system (2.1) as follows:

v0 = y F(xk, y D k ; DF ),

j=Q q (I η 2 yy G(xk, y D k ; Bj))v0. (4.7)

The stochastic hypergradient can be constructed as

b Φ(xk) = x F(xk, y D k ; DF ) 2 xy G(xk, y D k ; DG)v Q. (4.8)

When the norm of the batch gradient is small (see step 12 of Algorithm 3), we ﬁx sample batches DF , DG and call i NEON. Denote f DF (x, y) = 1 Df PDf i=1 F(x, y; ξi), g DG(x, y) =

1 Dg PDg i=1 G(x, y; ζi) and ΦD(x) = f DF (x, y DG(x)). i NEON ﬁnds the descent direction for ΦD(x) at saddle points with high probability. We list the assumptions for Algorithm 3 as following.

Assumption 28 For the stochastic case, Assumption 3 holds for F(x, y; ξ) and G(x, y; ζ) for any given ξ and ζ.

Huang and Chen and Ji and Ma and Lai

Assumption 29 The variance of gradient G(x, y; ζ) is bounded:

Eζ G(x, y; ζ) g(x, y) 2 σ2.

The following theorem provides our main results on stochastic bilevel optimization.

Theorem 30 Suppose Assumptions 28 and 29 hold. Set parameters as (4.3) and (4.4), and let α = 2 ℓ+µ, β = 1 4Lφ , D = O κ log(ϵ 1) , Q = O κ log(ϵ 1) , |BQ+1 j| = BQ(1 ηµ)j 1,

for j = 1, ..., Q, where B = O κ2 ϵ 2 , and S = O κ5 ϵ 2 , Df = O κ2 ϵ 2 , Dg = O κ6 ϵ 2 in Algorithm 3. With high probability, the total iteration number of the Algorithm 3 for visiting an ϵ-local minimum of (1.8) can bounded by

K = O Lφ(Φ(x0) Φ(x ))

= O(κ3ϵ 2),

where Lφ is the Lipschitz constant of Φ(x), which is deﬁned in Lemma 34.

Corollary 31 For Algorithm 3, the complexities of gradient evaluations for ﬁnding an ϵlocal minimum of (1.8) are

G(f, ϵ) = O(κ5ϵ 4), G(g, ϵ) = O(κ10ϵ 4),

and the Jacobianand Hessian-vector product complexities are

JV (g, ϵ) = O(κ9ϵ 4), HV (g, ϵ) = O(κ9.5ϵ 4).

Remark 32 Compared with the Stoc Bi O complexity results in Ji et al. (2021), our results have a worse dependence on the condition number κ. This is because we set a larger sample size Dg in order to obtain a high probability result.

5. Numerical Experiments

5.1 Deterministic case

In this section we present the experimental results to demonstrate the eﬃciency of our algorithm. We reformulate the problem in Du et al. (2017) as a bilevel optimization problem (1.1) and then compare our Algorithm 1 with AID-Bi O in Ji et al. (2021). More precisely, we consider the following bilevel optimization problem:

min x Rd Φ(x) := f1(x, y (x)),

s.t. y (x) = argmin y R f2(x, y). (5.1)

Motivated by Du et al. (2017), we construct the following functions. For the upper level function we have

fi,1(x, y) x1, ..., xi 1 [2τ, 6τ], xi [0, τ], xi+1, ..., xd [0, τ], 1 i d 1 fi,2(x, y) x1, ..., xi 1 [2τ, 6τ], xi [τ, 2τ], xi+1, ..., xd [0, τ], 1 i d 1 fd,1(x, y) x1, ..., xd 1 [2τ, 6τ], xd [0, τ] fd,2(x, y) x1, ..., xd 1 [2τ, 6τ], xd [τ, 2τ] fd+1,1(x, y) x1, ..., xd [2τ, 6τ], (5.2)

Efficiently Escaping Saddle Points in Bilevel Optimization

fi,1(x, y) =

j=1 L(xj 4τ)2 γx2 i +

j=i+1 Lx2 j (i 1)ν, 1 i d 1, (5.3)

fi,2(x, y) =

j=1 L(xj 4τ)2 + y +

j=i+2 Lx2 j (i 1)ν, 1 i d 1, (5.4)

fd,1(x, y) =

j=1 L(xj 4τ)2 γx2 d (d 1)ν, (5.5)

fd,2(x, y) =

j=1 L(xj 4τ)2 + y (d 1)ν, (5.6)

fd+1,1(x, y) =

j=1 L(xj 4τ)2 dν. (5.7)

The lower level function is deﬁned as

f2(x, y) = y2

2 g(x)y, (5.8)

h1(xi) + h2(xi)x2 i+1 x1, ..., xi 1 [2τ, 6τ], xi [τ, 2τ], xi+1, ..., xd [0, τ], 1 i d 1 h1(xd) x1, ..., xd 1 [2τ, 6τ], xd [τ, 2τ] 0 elsewhere (5.9)

h1(c) = γc2 + ( 14L + 10γ)(c τ)3

3τ + (5L 3γ)(c τ)4

2τ 2 (5.10)

h2(c) = γ 10(L + γ)(c 2τ)3

τ 3 15(L + γ)(c 2τ)4

τ 4 6(L + γ)(c 2τ)5

The constants satisfy

L > 0, γ > 0, τ = e, ν = h1(2τ) + 4Lτ 2.

Note that from (5.2) we know the function Φ(x) is only deﬁned on the following domain (see also Eq. (5) in Du et al. (2017)):

n x Rd : 6τ x1, ..., xi 1 2τ, 2τ xi 0, τ xi+1, ..., xd 0 o . (5.12)

By Lemma A.3 in Du et al. (2017) we know there are d saddle points in D0:

(0, ..., 0) , (4τ, 0, ..., 0) , ..., (4τ, ..., 4τ, 0) .

Huang and Chen and Ji and Ma and Lai

Moreover, the only local optimum is (4τ, ..., 4τ) . One can follow Steps 2 and 3 in Section A.1 of Du et al. (2017) to extend the domain to Rd. For simplicity we omit the extension here. We refer the interested readers to Section 4 and Appendix of Du et al. (2017) for details of the motivation for constructing these functions. In our experiments, we choose the total number of iterations to be 1000 and all stepsizes to be 0.05 in both Algorithm 1 and AID-Bi O. Following Du et al. (2017), we conduct the comparison using diﬀerent choices of problem parameters. In Figure 1 we plot the learning curves of Φ(x) min Φ(x) vs. Iteration number. Our algorithm is denoted as PBO , and AID-Bi O is denoted as BO in Figure 1. Note that each learning curve is nearly a step function which consists of vertical and horizontal line segments. The horizontal segment indicates that the function value does not change and thus we may deduce that the iterates are stuck at a saddle point. Each vertical segment indicates that a perturbation successfully helps the iterate escape the saddle point. We observe that under diﬀerent parameter choices our Algorithm 1 escapes saddle points more eﬃciently than standard bilevel optimization algorithm.

Figure 1: Comparison between Algorithm 1 and AID-Bi O in Ji et al. (2021). PBO and BO represent Algorithm 1 and AID-Bi O respectively. d is the dimension of the upper level function in (1.1). L and γ are problem parameters used to generate the functions in both levels.

Efficiently Escaping Saddle Points in Bilevel Optimization

5.2 Stochastic case

Next we investigate the performance of our Algorithm 3 under the stochastic setting. Consider symmetric matrix sensing problem (Ge et al., 2017) as follows

min U Rd r 1 2m

i=1 ( Ai, UU bi)2 (5.13)

where A, B = Trace(AB ) for any matrices A, B and 0 < r < d. The matrices {Ai : i = 1, ..., m} are sensing matrices and bi = Ai, M denotes the observation that corresponds to Ai, where M = U (U ) with U Rd r as the ground-truth matrix. The goal of (5.13) is to recover the low-rank matrix U based on (Ai, bi) pairs. Note that (5.13) can be reformulated as follows.

min U Rd r 1 2m

i=1 ( Ai, Y (U) bi)2

s.t. Y (U) = argmin Y Rd d

Y, Y 2 Y, UU . (5.14)

By reformulating the original problem (5.13) as a bi-level one in (5.14), the objective functions in both levels are now convex in Y . We compare Stoc Bi O (Ji et al., 2021) and our Algorithm 3 for solving (5.14). Note that Stoc Bi O is essentially our Algorithm 3 without lines 12-22. We ﬁrst generate U in which every entry is sampled from the normal distribution N(0, 1

d), and then generate m matrices {Ai : i = 1, ..., m} where each entry is sampled from standard normal distribution N(0, 1). For both algorithms, we set d = 50, r = 5, m = 2000, K = 200, D = 5, α = 0.1, β = 0.03 and the batch-size to be 200. For Algorithm 3, we set T = 5, r = 0.5, ϵ = 0.03, ρφ = 0.001, τ = 0.03, η = 0.1, F = 0.001. Each entry of the initial U is uniformly sampled from [0, 0.001] and the initial Y is set to a zero matrix. Denote by (Uk, Yk) the matrices obtained at the k-th iteration (in Stoc Bi O or our Algorithm 3). In Figure 2(a) we plot the training loss 1 2m Pm i=1( Ai, Yk bi)2 , and in Figure 2(b) we plot the distance between iterate and the ground-truth p

Uk U , Uk U in each iteration. From the ﬁgures we can observe that both algorithms get stuck near a saddle point in the ﬁrst few iterations, and then escape it. Our Algorithm 3 escapes the saddle point faster than Stoc Bi O, which further validates our theory under the stochastic setting.

6. Conclusion

In this paper, we have proposed the perturbed AID algorithm that provably converges to an ϵ-local minimum in bilevel optimization. As a byproduct, we have provided the ﬁrst nonasymptotic convergence rate for minimax problem converging to local minimax point with ﬁrst-order method. Moreover, we have proposed inexact NEON that can extract negative gradient descent direction at saddle points. By combining the inexact NEON with Stoc Bi O, we have proposed the ﬁrst algorithm that converges to local minimum for stochastic bilevel optimization.

Acknowledgments

Huang and Chen and Ji and Ma and Lai

0 25 50 75 100 125 150 175 200 Iteration

Training Loss

Stoc Bi O Stoc Bi O with i NEON

0 25 50 75 100 125 150 175 200 Iteration

Distance to optimal point

Stoc Bi O Stoc Bi O with i NEON

Figure 2: Comparison between Algorithm 3 and Stoc Bi O in Ji et al. (2021).

We are grateful to two anonymous reviewers for their insightful comments and suggestions that have helped improve the presentation of this paper. This work has been supported in part by NSF grants DMS-2243650, CCF-2308597, CCF-1934568, ECCS-2000415, CCF2112504, CCF-2232907, CCF-2311275, and ECCS-2326591, ONR grant N00014-24-1-2705, UC Davis Ce DAR (Center for Data Science and Artiﬁcial Intelligence Research) Innovative Data Science Seed Funding Program, and a startup fund at Rice University.

Efficiently Escaping Saddle Points in Bilevel Optimization

Appendix A. Preliminaries

Under Assumption 3, we have the following proposition. The proof can be found in Chen et al. (2021b)[Lemma 1].

Proposition 33 Suppose Assumption 3 holds. We have the following bounds hold

2 yyg(x, y) 1 1

max { xf(x, y) , yf(x, y) , yg(x, y) } M,

max 2 xxf(x, y) , 2 yyf(x, y) , 2 xyf(x, y) , 2 xyg(x, y) , 2 yyg(x, y) ℓ,

max 3 xxyg(x, y) , 3 yxyg(x, y) , 3 yyyg(x, y) ρ.

Under Assumption 3, the gradient of Φ(x) is Lipschitz continuous.

Lemma 34 Ghadimi and Wang (2018)[Lemma 2.2] Suppose Assumption 3 holds for f(x, y) and g(x, y), Then Φ(x) is Lφ smooth and the following inequality holds for any x, x Rn:

Φ(x) Φ(x ) Lφ x x , (A.1)

Lφ = ℓ+ 2ℓ2 + ρM2

µ + ℓ3 + 2ρℓM

We postpone the proof of Theorem 19 to Section C and focus on the general bilevel optimization ﬁrst.

Appendix B. Proofs of Results in Section 2

B.1 Proof of Lemma 6

Proof. For general bilevel optimization problem (1.1), the Hessian of function Φ(x) can be computed as

2Φ(x) = 2 xxf(x, y (x)) + y (x)

x 2 yxf(x, y (x)) + 2y (x)

2x yf(x, y (x))

x 2 xyf(x, y (x)) + y (x)

x 2 yyf(x, y (x)) y (x)

which is obtained by taking derivative on (1.2). By further taking the derivative with respect to x on (1.3), we obtain:

3 xxyg(x, y (x)) + y (x)

x 3 yxyg(x, y (x)) + 2y (x)

2x 2 yyg(x, y (x))

x 3 xyyg(x, y (x)) + y (x)

x 3 yyyg(x, y (x)) y (x)

Huang and Chen and Ji and Ma and Lai

By (B.1), we have

2Φ(x) 2Φ(x )

= 2 xxf(x, y (x)) + y (x)

x 2 yxf(x, y (x)) + 2y (x)

2x yf(x, y (x))

x 2 xyf(x, y (x)) + y (x)

x 2 yyf(x, y (x)) y (x)

2 xxf(x , y (x )) + y (x )

x 2 yxf(x , y (x )) + 2y (x )

2x yf(x , y (x ))

x 2 xyf(x , y (x )) + y (x )

x 2 yyf(x , y (x )) y (x )

2 xxf(x, y (x)) 2 xxf(x , y (x )) | {z } (I)

x 2 yxf(x, y (x)) y (x )

x 2 yxf(x , y (x )) | {z } (II)

2x yf(x, y (x)) 2y (x )

2x yf(x , y (x )) | {z } (III)

x 2 yyf(x, y (x)) y (x)

x 2 yyf(x , y (x )) y (x )

| {z } (IV )

where the inequality is due to the triangle inequality and the fact that 2 yxf(x, y (x)) = 2 xyf(x, y (x)) for any smooth functions f(x, y). We then bound the terms (I) (IV ). By Ghadimi and Wang (2018)[Lemma 2.2], we know that y (x) is ℓ

µ-Lipschitz continuous. Therefore, we can bound the ﬁrst term as

(I) = 2 xxf(x, y (x)) 2 xxf(x , y (x ))

2 xxf(x, y (x)) 2 xxf(x , y (x)) + 2 xxf(x , y (x)) 2 xxf(x , y (x ))

ρ x x + y (x) y (x )

Efficiently Escaping Saddle Points in Bilevel Optimization

where the second inequality applies Assumption 3 and the last inequality is obtained by the Lipschitz continuity of y (x). For the second term (II), we have the following bound:

(II) =2 y (x)

x 2 yxf(x, y (x)) y (x )

x 2 yxf(x , y (x ))

x 2 yxf(x, y (x)) y (x)

x 2 yxf(x , y (x ))

x 2 yxf(x , y (x )) y (x )

x 2 yxf(x , y (x ))

2 yxf(x, y (x)) 2 yxf(x , y (x ))

2 yxf(x , y (x ))

2 yxf(x, y (x)) 2 yxf(x , y (x )) + 2ℓ y (x)

x x + 2ℓ y (x)

where the second inequality is by the Cauchy-Schwarz inequality, and the last two inequalities are obtained by Assumption 3. Moreover, we can bound y (x)

2 xyg(x, y (x)) 2 yyg(x, y (x)) 1 2 xyg(x , y (x )) 2 yyg(x , y (x )) 1

2 xyg(x, y (x)) 2 yyg(x, y (x)) 1 2 xyg(x, y (x)) 2 yyg(x , y (x )) 1

+ xyg(x, y (x)) 2 yyg(x , y (x )) 1 2 xyg(x , y (x )) 2 yyg(x , y (x )) 1

ℓ 2 yyg(x, y (x)) 1 2 yyg(x , y (x )) 1 + 1

2 xyg(x, y (x)) 2 xyg(x , y (x ))

µ2 2 yyg(x, y (x)) 2 yyg(x , y (x )) + 1

2 xyg(x, y (x)) 2 xyg(x , y (x ))

ρ x x + y (x) y (x )

(B.6) where the ﬁrst inequality is by equation (1.3), the third inequality applies Proposition 33 and the forth inequality follows the fact that X 1 Y 1 X 1 X Y Y 1 and Proposition 33. Therefore, we have the following bound for the second term:

2! x x . (B.7)

Huang and Chen and Ji and Ma and Lai

The third term (III) can be bounded by:

(III) = 2y (x)

2x yf(x, y (x)) 2y (x )

2x yf(x , y (x ))

yf(x, y (x)) yf(x , y (x ))

+ yf(x , y (x )) 2y (x)

Therefore, we need to give upper bounds for both 2y (x)

2x . Note that equation (B.2) yields

2x = 3 xxyg(x, y (x)) + y (x)

x 3 yxyg(x, y (x)) + y (x)

x 3 xyyg(x, y (x))

x 3 yyyg(x, y (x)) y (x)

2 yyg(x, y (x)) 1, (B.9)

which, combined with Proposition 33, leads to

Efficiently Escaping Saddle Points in Bilevel Optimization

Furthermore, 2y (x)

2x can be upper bounded by

3 xxyg(x, y (x)) + y (x)

x 3 yxyg(x, y (x)) + y (x)

x 3 xyyg(x, y (x))

x 3 yyyg(x, y (x)) y (x)

2 yyg(x, y (x)) 1

3 xxyg(x , y (x )) + y (x )

x 3 yxyg(x , y (x )) + y (x )

x x y yg(x , y (x ))

x 3 yyyg(x , y (x )) y (x )

2 yyg(x , y (x )) 1

3 xxyg(x, y (x)) + y (x)

x 3 yxyg(x, y (x)) + y (x)

x 3 xyyg(x, y (x))

x 3 yyyg(x, y (x)) y (x)

3 xxyg(x , y (x )) + y (x )

x 3 yxyg(x , y (x ))

x 3 xyyg(x , y (x )) + y (x )

x 3 yyyg(x , y (x )) y (x )

# 2 yyg(x , y (x )) 1

3 xxyg(x, y (x)) + y (x)

x 3 yxyg(x, y (x)) + y (x)

x 3 xyyg(x, y (x))

x 3 yyyg(x, y (x)) y (x)

2 yyg(x, y (x)) 1 2 yyg(x , y (x )) 1

µ 3 xxyg(x, y (x)) 3 xxyg(x , y (x ))

x 3 yxyg(x, y (x)) y (x )

x 3 yxyg(x , y (x ))

x 3 yyyg(x, y (x)) y (x)

x 3 yyyg(x , y (x )) y (x )

2 2 yyg(x, y (x)) 1 2 yyg(x , y (x )) 1

x 3 yxyg(x, y (x)) y (x )

x 3 yxyg(x , y (x ))

x 3 yyyg(x, y (x)) y (x)

x 3 yyyg(x , y (x )) y (x )

3 x x , (B.11)

Huang and Chen and Ji and Ma and Lai

where in the third inequality, we used Proposition 33 and the last inequality is due to (B.10), Assumption 3, and X 1 Y 1 X 1 X Y Y 1 . To bound the term

x 3 yxyg(x, y (x)) y (x )

x 3 yxyg(x , y (x )) ,

we follow the computation of (II) and get

x 3 yxyg(x, y (x)) y (x)

x 3 yxyg(x, y (x))

2! x x . (B.12)

Moreover, we have

x 3 yyyg(x, y (x)) y (x)

x 3 yyyg(x , y (x )) y (x )

x 3 yyyg(x, y (x)) y (x)

x 3 yyyg(x, y (x)) y (x)

x 3 yyyg(x, y (x)) y (x)

x 3 yyyg(x , y (x )) y (x)

x 3 yyyg(x , y (x )) y (x)

x 3 yyyg(x , y (x )) y (x )

2 3 yyyg(x, y (x)) 3 yyyg(x , y (x ))

(B.13) where the second inequality is due to y (x)

µ and Proposition 33 and the last inequality is obtained by (B.6) and Assumption 3. Combining (B.11) - (B.13) leads to

Efficiently Escaping Saddle Points in Bilevel Optimization

Combining (B.8), (B.10), and (B.14) yields

3 x x + M 2y (x)

Finally, similar to the computation in (B.13), we have

x 2 yyf(x, y (x)) y (x)

x 2 yyf(x , y (x )) y (x )

2 2 yyf(x, y (x)) 2 yyf(x , y (x ))

(B.16) The bounds of (I) (IV ) together lead to

ρφ = ρ + 2ℓρ + Mν

µ + 2Mℓν + ρℓ2

µ + 4Mρ2 + 2ℓ2ρ

which completes the proof.

In our proof of the main theorem, it is crucial to bound the estimation error for the hypergradient. For the AID method, we have the following lemma.

Lemma 35 Suppose Assumption 3 holds. For the AID approach (i.e., the AID option in Algorithm 1) with parameters D = O(κ), N = O( κ) , we have:

b Φ(xk) Φ(xk) 1 + ℓ

µ 1 + 2 κ ℓ+ ρM

2 + 2ℓ κ κ 1 κ + 1

(B.18) where

Γ1 = b + 2η κ2 + 2κ + ρM(1 + κ)

b = y0 y (x0) + v0 v 0 , (B.20)

and bvk = 2 yyg(xk, y D k ) 1 yf(xk, y D k ).

Huang and Chen and Ji and Ma and Lai

Proof. First note that the convergence rates of CG for the quadratic programming (see, e.g., eq. (17) in Grazzi et al. (2020)) and GD for strongly convex optimization (see, e.g., Theorem 2.1.14 in Nesterov (2004)) yield

v N k bvk 2 κ κ 1 κ + 1

N v0 k bvk , (B.21)

y D k y (xk) 1 µ

2 y0 k y (xk) . (B.22)

Denote v k = 2 yyg(xk, y (xk)) 1 yf(xk, y (xk)). Note that Φ(xk) is deﬁned in (1.5), i.e.,

Φ(xk) = xf(xk, y (xk)) 2 xyg(xk, y (xk))v k,

and b Φ(x) is deﬁned in step 9 of Algorithm 1, i.e.,

b Φ(xk) = xf(xk, y D k ) 2 xyg(xk, y D k )v N k .

We then have the following inequality holds b Φ(xk) Φ(xk)

xf(xk, y (xk)) xf(xk, y D k ) + 2 xyg(xk, y D k ) v k v N k (B.23)

+ 2 xyg(xk, y (xk)) 2 xyg(xk, y D k ) v k

ℓ y (xk) y D k + ℓ v k v N k + ρ v k y D k y (xk)

y (xk) y D k + ℓ v k v N k . (B.24)

Here the last inequality follows from v k ( 2 yyg(xk, y (xk))) 1 yf(xk, y (xk)) M

µ in which we use Proposition 33. Next we give an upper bound of v k v N k :

v k v N k v k bvk + v N k bvk

v k bvk + 2 κ κ 1 κ + 1

1 + 2 κ κ 1 κ + 1

N v k bvk + 2 κ κ 1 κ + 1

= 1 + 2 κ κ 1 κ + 1

N 2 yyg(xk, y D k ) 1 yf(xk, y D k ) 2 yyg(xk, y (xk)) 1 yf(xk, y (xk))

+ 2 κ κ 1 κ + 1

1 + 2 κ κ 1 κ + 1

y D k y (xk) + 2 κ κ 1 κ + 1

y D k y (xk) + 2 κ κ 1 κ + 1

N v0 k v k , (B.25)

Efficiently Escaping Saddle Points in Bilevel Optimization

where the second inequality is due to (B.21) and in the second to last inequality we have used Assumption 3, Proposition 33, and X 1 Y 1 X 1 X Y Y 1 . Plugging (B.22) and (B.25) into (B.24) leads to

b Φ(xk) Φ(xk) 1 + ℓ

µ 1 + 2 κ ℓ+ ρM

2 y0 k y (xk)

+ 2ℓ κ κ 1 κ + 1

N v0 k v k . (B.26)

Secondly, by the warm-start strategy y0 k = y D k 1, v0 k = v N k 1 in the inner loop, we have y0 k y (xk) y D k 1 y (xk 1) + y (xk 1) y (xk)

2 y0 k 1 y (xk 1) + κ xk 1 xk

2 y0 k 1 y (xk 1) + κη b Φ(xk 1) ,

where the second inequality is due to (B.22) and the fact that y (x) is κ-Lipschitz continuous Ghadimi and Wang (2018)[Lemma 2.2], the third inequality follows the update of xk. The next step is to bound v0 k v k . Before that, we prepare the following inequality: v k 1 v k

= 2 yyg(xk 1, y (xk 1)) 1 yf(xk 1, y (xk 1)) 2 yyg(xk, y (xk)) 1 yf(xk, y (xk))

M 2 yyg(xk 1, y (xk 1)) 1 2 yyg(xk, y (xk)) 1

µ yf(xk 1, y (xk 1)) yf(xk, y (xk))

κ2 + κ + ρM(1 + κ)

(B.28) where in the ﬁrst inequality we used Proposition 33 and the second inequality follows Assumption 3. We then have the bound of v0 k v k as follows v0 k v k = v N k 1 v k v N k 1 v k 1 + v k 1 v k

y D k 1 y (xk 1) + 2 κ κ 1 κ + 1

N v0 k 1 v k 1 + v k 1 v k

2 y0 k 1 y (xk 1) + 2 κ κ 1 κ + 1

N v0 k 1 v k 1

+ κ2 + κ + ρM(1 + κ)

2 y0 k 1 y (xk 1) + 2 κ κ 1 κ + 1

N v0 k 1 v k 1

+ η κ2 + κ + ρM(1 + κ)

b Φ(xk 1) ,

Huang and Chen and Ji and Ma and Lai

where the second inequality is by (B.25), the third inequality is obtained by (B.22) and (B.28). Summing (B.27) and (B.29), we obtain

y0 k y (xk) + v0 k v k

2 y0 k 1 y (xk 1)

+ 2 κ κ 1 κ + 1

N v0 k 1 v k 1 + κ2 + 2κ + ρM(1 + κ)

η b Φ(xk 1) .

Set the parameters as

2 (1 + 2 κ) ℓ µ + ρM

µ2 + 1 / log(1 κ 1) = O(κ),

N log 1 4 κ/ log κ 1 κ + 1

such that we can have

y0 k y (xk) + v0 k v k

2 y0 k 1 y (xk 1) + v0 k 1 v k 1 + κ2 + 2κ + ρM(1 + κ)

η b Φ(xk 1)

k b + κ2 + 2κ + ρM(1 + κ)

k 1 j b Φ(xj)

b + 2η κ2 + 2κ + ρM(1 + κ)

(B.32) where the last inequality follows b Φ(x) M + ℓM

µ by Proposition 33. Combining (B.32) with (B.26), we have

b Φ(xk) Φ(xk)

µ 1 + 2 κ ℓ+ ρM

2 + 2ℓ κ κ 1 κ + 1

N Γ1, (B.33)

where the inequality follows from the inequality ab + cd (a + c)(b + d) for any positive a, b, c, d. This completes the proof.

Efficiently Escaping Saddle Points in Bilevel Optimization

B.2 Proof of Lemma 10

Proof. By Lemma 34, Φ(x) is Lφ-smooth, which yields

Φ(xk+1) Φ(xk) + Φ(xk), xk+1 xk + LΦ

2 xk+1 xk 2 2

Φ(xk) + b Φ(xk), xk+1 xk + Φ(xk) b Φ(xk), xk+1 xk

2 xk+1 xk 2 2

Φ(xk) + b Φ(xk), xk+1 xk + 1

4η xk+1 xk 2 + η Φ(xk) b Φ(xk) 2

2 xk+1 xk 2 2

4 b Φ(xk) 2 2 + η Φ(xk) b Φ(xk) 2,

where the third inequality is obtained by Young s inequality and the last inequality uses η = 1 Lφ .

B.3 Proof of Lemma 11

Proof. The proof of Lemma 11 closely follows Jin et al. (2021)[Lemma 22]. We ﬁrst deﬁne two sequences {xk}, {x k} that are generated by Algorithm 1 with initial points x0 and x 0, respectively. That is,

xk+1 = xk η b Φ(xk), x k+1 = x k η b Φ(x k).

We require the two initial points to satisfy the following conditions:

Condition (i): max{ x0 x , x 0 x } ηr;

Condition (ii): x0 x 0 = ηr0e1, where e1 is the minimum eigenvector of 2Φ( x) with e1 = 1 and r0 > ω := 22 ιLφS , and

ρφ . (B.34)

Note that the parameters are given in (3.2). We show that for these two sequences, the following inequality must hold:

min{Φ(x T ) Φ(x0), Φ(x T ) Φ(x 0)} F, (B.35)

where F is deﬁned in (2.9). We now prove (B.35) by contradiction. Assume the contrary of (B.35) holds, i.e.:

min{Φ(x T ) Φ(x0), Φ(x T ) Φ(x 0)} > F. (B.36)

Huang and Chen and Ji and Ma and Lai

First, by the update of xk, we have for any τ k T :

t=1 xt xt 1

t=1 xt xt 1 2 # 1

b Φ(xt 1) 2 # 1

b Φ(xt 1) 2 # 1

Φ(x0) Φ(x T ) + η

t=1 Φ(xt) b Φ(xt) 2 !

where the last inequality is obtained by Lemma 10. We have for any k T :

max{ xk x , x k x } max{ xk x0 , x k x 0 } + max{ x0 x , x 0 x }

4ηT F + 4η2T 2 17 6400ι4 ϵ2 + ηr

ρφ + 1 400ι3 ϵ

ρφ + 1 400ι r ϵ

ρφ S , (B.38)

where the second inequality uses (B.37), (B.36), (2.7), and Condition (i), the third inequality is due to (3.2) and (2.9), and the fourth inequality is due to (3.3) and (3.4). On the other hand, we can write the update equation for the diﬀerence ˆxk := xk x k as:

ˆxk+1 = ˆxk η h b Φ(xk) b Φ(x k) i

= ˆxk η Φ(xk) Φ(x k) η h b Φ(xk) Φ(xk) + Φ(x k) b Φ(x k) i

= (I ηH)ˆxk η ( 1,kˆxk + 2,k)

= (I ηH)k+1ˆx0 | {z } p(k+1)

t=0 (I ηH)k t ( 1,tˆxt + 2,t)

| {z } q(k+1)

where we denote

H = 2Φ( x),

2Φ(x k + θ(xk x k)) H dθ,

2,k = h b Φ(xk) Φ(xk) + Φ(x k) b Φ(x k) i .

Efficiently Escaping Saddle Points in Bilevel Optimization

For the two parts q(k), p(k), we show that p(k) is the dominant term by proving

q(k) p(k) /2, k [T ]. (B.40)

We now prove (B.40) by induction. First, (B.40) holds trivially when k = 0 becase q(0) = 0. Denote λmin( 2Φ( x)) = γ, which implies γ ρφϵ. Assume (B.40) holds for any t k. Since ˆx0 = ηr0e1, we have for any t k:

ˆxt p(t) + q(t) 3

2 (I ηH)tˆx0 = 3

2(1 + ηγ)tηr0. (B.41)

Therefore, at step k + 1 we have

t=0 (I ηH)k t ( 1,tˆxt + 2,t)

t=0 (I ηH)k t 1,tˆxt

t=0 (I ηH)k t 2,t

(I ηH)k t ˆxt + η

(I ηH)k t 2,t

t=0 (1 + ηγ)kηr0 + 2η

t=0 (1 + ηγ)k Φ(xk) b Φ(xk)

2ηρφS T (1 + ηγ)kηr0 + 2ηT (1 + ηγ)k Φ(xk) b Φ(xk)

2ηρΦS T (1 + ηγ)kηr0 2ηρΦS T p(k + 1) .

Here the second inequality is by 1,k ρφ max{ xk x , x k x } ρφS which uses (B.38). The third inequality is due to (B.41) and the fact that I ηH 0 which is because η = 1/Lφ and λmax(H) Lφ. The ﬁfth inequality applies (2.7), i.e., Φ(xk) b Φ(xk) ϵ 16ι22ι 1

4ρφS ηr0. By noting 2ηρφS T = 1/2, we complete the proof of (B.40). Finally, (B.40) implies

max{ x T x , x T x } 1

2[ p(T ) q(T ) ] 1

=(1 + ηγ)T ηr0

4 (1 + η ρφϵ)T ηr0

4 2ι 2ηr0 > S ,

where the second to last inequality uses the fact (1 + x)1/x 2 for any x (0, 1]. This contradicts with (B.38), which ﬁnishes the proof of (B.35). We then characterize the probability, which follows the ideas in Jin et al. (2021). Recall x0 Uniform(B x(ηr)). We refer to B x(ηr) the perturbation ball, and deﬁne the stuck region within the perturbation ball to be the set of points starting from which GD requires more than T steps to escape:

X := {x B x(ηr) | {xt} is GD sequence with x0 = x, and Φ(x T ) Φ(x0) > F}.

Huang and Chen and Ji and Ma and Lai

Although the shape of the stuck region can be very complicated, we know that the width of X along the e1 direction is at most ηω. That is, Vol(X) Vol(Bd 1 0 (ηr))ηω. Therefore,

Pr(x0 X) = Vol(X) Vol(Bd x(ηr)) ηω Vol(Bd 1 0 (ηr)) Vol(Bd 0(ηr))

d ρϵ ι228 ι.

On the event {x0 X}, due to our parameter choice in (3.2), (3.3), (2.9) and (B.34), we have:

Φ(x T ) Φ( x) = [Φ(x T ) Φ(x0)] + [Φ(x0) Φ( x)] F + ϵηr + Lφη2r2

where the ﬁrst inequality uses the Lφ-smoothness of Φ( ). This ﬁnishes the proof.

B.4 Proof of Theorem 7

Proof. For the AID method, we characterize the iteration complexity for D and N so that (2.7) holds. By Lemma 35, we require D and N to satisfy

µ 1 + 2 κ ℓ+ ρM

2 + 2ℓ κΓ1 κ 1 κ + 1

17 80ι2 , 1 16ι22ι

It is easy to verify that (B.43) holds when

µ (1 + 2 κ) ℓ+ ρM

17 80ι2 , 1 16ι22ι o ϵ

/ log 1 1 κ 1

= O κ log 1

17 80ι2 , 1 16ι22ι o ϵ

/ log 1 + κ

= O κ log 1

(B.44) Moreover, it is easy to verify that the right hand side of (B.43) is smaller than ϵ/5. Therefore, by choosing D and N as in (B.44), we know that

Φ(xk) b Φ(xk) ϵ

5, k. (B.45)

There are two possible cases to consider.

Case 1: b Φ(xk) > 4

5ϵ and k kperturb > T . In this case, combining (B.45) and Lemma 10 leads to

Φ(xk+1) Φ(xk) 4 25Lφ ϵ2 + 1 25Lφ ϵ2 = Φ(xk) 3 25Lφ ϵ2.

Therefore, the total iteration number of Case 1 can be bounded by

25Lφ(Φ(x0) Φ )

3ϵ2 . (B.46)

Efficiently Escaping Saddle Points in Bilevel Optimization

Case 2: k kperturb T . This case means that we are within T iterations of the last perturbation step, i.e., the step 10 in Algorithm 1. Suppose the last perturbation step happened at the k-th iteration. Therefore, from the step 10 in Algorithm 1 we know that b Φ(x k) 4

5ϵ. This together with (B.45) implies Φ(x k) ϵ. Now there are two cases to further consider. Case 2(i). If λmin( 2Φ(x k)) ρφϵ, then according to Lemma 11 we know that with probability at least 1 δ it holds that

Φ(x k+T ) Φ(x k) F/2.

So the total iteration number in this case is bounded by

(Φ(x0) Φ )T

F/2 . (B.47)

Case 2(ii). If λmin( 2Φ(x k)) > ρφϵ, then we have already found an ϵ-local minimum of Φ(x).

Therefore, combining (B.46) and (B.47) we know that the total iteration number before we visit an ϵ-local minimum can be bounded by

K = (Φ(x0) Φ )T

F/2 + 25Lφ(Φ(x0) Φ )

3ϵ2 = O κ3ϵ 2 .

This completes the proof.

Appendix C. Proofs of Results in Section 3

C.1 Proof of Proposition 14

Proof. By Danskin s theorem, the gradient of Φ(x) is Φ(x) = xf(x, y (x)). Therefore the Hessian of Φ(x) is given by

2Φ(x) = 2 xxf(x, y (x)) + 2 xyf(x, y (x)) y (x)

Note that the optimality condition for the max-player is yf(x, y (x)) = 0, which leads to

2 yxf(x, y (x)) + 2 yyf(x, y (x)) y (x)

x = 0. (C.2)

Combining (C.1) and (C.2) yields

2Φ(x) = 2 xxf(x, y (x)) 2 xyf(x, y (x)) 2 yyf(x, y (x)) 1 2 yxf(x, y (x)). (C.3)

Since f(x, y) is µ-strongly concave with respect to y, the second term on the right hand side of (C.3), i.e., 2 xyf(x, y (x)) 2 yyf(x, y (x)) 1 2 yxf(x, y (x)), is always positive deﬁnite. Therefore, we have the following conclusions.

A saddle point of Φ(x) satisﬁes λmin( 2Φ(x)) < 0, which together with (C.3), implies λmin( 2 xxf(x, y (x))) < 0. Therefore, it cannot be a strict local Nash equilibrium.

A strict local Nash equilibrium of f(x, y) satisﬁes λmin( 2 xxf(x, y (x))) > 0, which yields λmin( 2Φ(x)) > 0. So it must be a local minimum of Φ(x).

Huang and Chen and Ji and Ma and Lai

C.2 Proof of Proposition 16

Proof. A local minimum of Φ(x) satisﬁes

Φ(x) = 0, 2Φ(x) 0. (C.4)

According to (C.3), the inequality in (C.4) is equivalent to

2 xxf(x, y (x)) 2 xyf(x, y (x)) 2 yyf(x, y (x)) 1 2 yxf(x, y (x)) 0. (C.5)

Moreover, for nonconvex-strongly-concave problems, it holds that 2 yyf(x, y) 0. Therefore, we only need to show that Φ(x) = 0 is equivalent to f(x, y) = 0. Notice that for a pair (x, y) satisfying xf(x, y) = 0, yf(x, y) = 0, we have y = y (x) from the strongly convexity and xf(x, y (x)) = Φ(x) = 0. Further more, when Φ(x) = 0, we can always choose y = y (x) so that xf(x, y) = 0, yf(x, y) = 0. Therefore, these two conditions are equivalent to each other. When function Φ(x) has a strict local minimum, the local minimax point is guaranteed to exist.

C.3 Proof of Theorem 19

To prove Theorem 19, we need the following lemmas. The ﬁrst lemma shows that under Assumption 17, the function Φ(x) is smooth and Hessian-Lipschitz continuous.

Lemma 36 Chen et al. (2021b)[Proposition 1] Suppose f(x, y) satisﬁes Assumption 17, we have

Φ(x) is Lφ-smooth, where Lφ = ℓ(1 + κ).

Φ(x) is ρφ-Hessian Lipschitz continuous, i.e., (1.7) holds, where ρφ = ρ(1 + κ)3.

The second lemma gives an upper bound for the gradient estimation error with the warm start strategy.

Lemma 37 Suppose Assumption 17 holds. For the GDmax algorithm (i.e., the GDmax option in Algorithm 1) with parameters D = O(κ), we have, b Φ(xk) Φ(xk) ℓ b + 2ηκ M + ℓM

where b = y0 y (x0) .

Proof. The gradient estimation error for minimax problem can be bounded by b Φ(xk) Φ(xk) = xf(xk, y D k ) xf(xk, y (xk)) ℓ y D k y (xk)

2 y0 k y (xk) , (C.7)

where the last inequality follows (B.22). By the warm start strategy y0 k = y D k 1, we have

y0 k y (xk) y D k 1 y (xk 1) + y (xk 1) y (xk)

2 y0 k 1 y (xk 1) + κ xk xk 1

2 y0 k 1 y (xk 1) + ηκ b Φ(xk 1) .

Efficiently Escaping Saddle Points in Bilevel Optimization

D > 2 log 2/ log 1 1 κ 1

= O(κ), (C.9)

y0 k y (xk) y D k 1 y (xk 1) + y (xk 1) y (xk)

2 y0 k 1 y (xk 1) + ηκ b Φ(xk 1)

k y0 y (x0) + ηκ

k 1 j b Φ(xk 1)

b + 2ηκ M + ℓM

where the last inequality uses b Φ(xk 1) M + ℓM

µ for any k. Combining (C.10) and (C.7) yields

b Φ(xk) Φ(xk) ℓ b + 2ηκ M + ℓM

which completes the proof.

We now give the proof of Theorem 19. Proof. By Lemma 37, we require D to satisfy

ℓ b + 2ηκ M + ℓM

17 80ι2 , 1 16ι22ι

It is easy to verify that

ℓ b + 2ηκ M + ℓM

17 80ι2 , 1 16ι22ι o ϵ

/ log 1 1 κ 1

= O κ log 1

satisﬁes (C.12). The rest of the proof is the same as the proof of Theorem 7 in Section B.4.

Appendix D. Proofs of Results in Section 4

D.1 Proof of Lemma 26

Proof. Deﬁne the following function:

ˆΦx(u) = Φ(x + u) Φ(x) Φ(x) u. (D.1)

We ﬁrst characterize the required estimation error, which is used in the later proof. Speciﬁcally, we choose the inner iteration number D (step 9 of Algorithm 2) and N (used in steps

Huang and Chen and Ji and Ma and Lai

6 and 12 of Algorithm 2) such that for any k T , the following inequalities hold:

b Φ( x + uk) Φ( x + uk) min

17 40ι2 , 1 16ι22ι/4 , 91/3 2

8ι , 1 750ι2

M y D k ( x + uk) y ( x + uk)) 1 750ι3Lφ ϵ2.

Note that both inequalities in (D.2) also imply the following inequality, which corresponds to the case uk = 0:

b Φ( x) Φ( x) min

17 40ι2 , 1 16ι22ι/4 , 91/3 2

8ι , 1 750ι2

M y D k ( x) y ( x)) 1 750ι3Lφ ϵ2.

Since the lower level problem is strongly convex, combining with Lemma 35, we can set D and N in Algorithm 2 as

µ (1 + 2 κ) ℓ+ ρM

17 40ι2 , 1 16ι22ι/4 , 91/3 2

8ι , 1 750ι2 o ϵ

, 2 log 750MLφι3Γ1

/ log 1 1 κ 1

17 40ι2 , 1 16ι22ι/4 , 91/3 2

8ι , 1 750ι2 o ϵ

/ log 1 + κ

= O κ log 1

(D.4) such that (D.2) holds in each iteration. Next, we show that with probability at least

d ρφϵ ι228 ι/4, the following inequality holds:

ˆΦ x(u T ) ˆΦ x(u0) F, (D.5)

Efficiently Escaping Saddle Points in Bilevel Optimization

where ˆΦx(u) is deﬁned in (D.1). Note that it is easy to verify that ˆΦx(u) is Lφ-smooth, and it yields

ˆΦ x(uk+1) ˆΦ x(uk) + ˆΦ x(uk), uk+1 uk + Lφ

2 uk+1 uk 2 2

=ˆΦ x(uk) + Φ( x + uk) Φ( x), uk+1 uk + Lφ

2 uk+1 uk 2 2

=ˆΦ x(uk) + b Φ( x + uk) b Φ( x), uk+1 uk + Φ( x + uk) b Φ( x + uk), uk+1 uk

+ b Φ( x) Φ( x), uk+1 uk + Lφ

2 uk+1 uk 2 2

η uk+1 uk 2 + 1

8η uk+1 uk 2 + 2η Φ( x + uk) b Φ( x + uk) 2

8η uk+1 uk 2 + 2η b Φ( x) Φ( x) 2 + 1

2η uk+1 uk 2 2

=ˆΦ x(uk) 1

4η uk+1 uk 2 + 2η Φ( x + uk) b Φ( x + uk) 2

+ 2η b Φ( x) Φ( x) 2, (D.6) where the ﬁrst equality is from (D.1), the second inequality is from (4.1) and Young s inequality. We then follow the same ideas as in the proof of Lemma 11. We design two coupling sequences {ut}, {wt} generated by i NEON (Algorithm 2) with initial points u0 and w0, respectively. We require the two sequences to satisfy: Condition (i). max{ u0 , w0 } ηr; and Condition (ii). u0 w0 = ηr0e1, where e1 is the minimum eigenvector of 2Φ( x) with e1 = 1 and r0 > ω := 22 ι/4LφS . The rest is to prove

min{ˆΦx(u T ) ˆΦx(u0), ˆΦx(w T ) ˆΦx(w0)} F. (D.7)

We prove (D.7) by contradiction. Assume the contrary holds:

min{ˆΦ x(u T ) ˆΦ x(u0), ˆΦ x(w T ) ˆΦ x(w0)} > F. (D.8)

First, by the update of uk (i.e., (4.1) or step 13 of Algorithm 2), we have for any τ k:

t=1 ut ut 1

t=1 ut ut 1 2 # 1

t=1 ut ut 1 2 # 1

ˆΦ x(u0) ˆΦ x(u T ) + 2η

t=1 Φ( x + ut) b Φ( x + ut) 2 + b Φ( x) Φ( x) 2 !!

4ηT F + 8η2T 2 17 800ι4 ϵ2 + ηr S , (D.9)

where the fourth inequality is obtained by (D.6), the ﬁfth inequality follows from (D.8), (D.2) and (D.3), and the last inequality is due to the parameter choice in (4.3), and ι 1, Lφ/ ρφϵ 1. Therefore, from (D.9) we have for any k T :

max{ uk , wk } max{ uk u0 , wk w0 } + max{ u0 , w0 } S . (D.10)

Huang and Chen and Ji and Ma and Lai

On the other hand, we can write the update equation for the diﬀerence vk := uk wk as:

vk+1 =vk η h b Φ( x + uk) b Φ( x + wk) i

=vk η [ Φ( x + uk) Φ( x + wk)]

η h b Φ( x + uk) Φ( x + uk) + Φ( x + wk) b Φ( x + wk) i

= (I ηH)k+1v0 | {z } p(k+1)

t=0 (I ηH)k t ( 1,tvt + 2,t)

| {z } q(k+1)

where we denote

H = 2Φ( x), (D.12)

2Φ( x + wk + θ(uk wk)) H dθ, (D.13)

2,k = h b Φ( x + uk) Φ( x + uk) + Φ( x + wk) b Φ( x + wk) i . (D.14)

We then prove q(k) p(k) /2, k [T ]. We prove it by induction. It is easy to check that it holds at k = 0. Assume it holds for any t k. Denote λmin( 2Φ( x)) = γ. Since v0 lies in the direction of the minimum eigenvector of 2Φ( x0), we have for any t k:

vt p(t) + q(t) 3

2(1 + ηγ)tηr0. (D.15)

At step k + 1, similar to (B.42), we have

q(k + 1) 2ηρΦS T p(k + 1) , (D.16)

where we used (D.2). Combining (D.16) with the choice of parameters in (4.3) ﬁnishes the proof of q(k) p(k) /2. Therefore, we have

max{ u T , w T } 1

2[ p(T ) q(T ) ] 1

=(1 + ηγ)T ηr0

4 2ι/4 2ηr0 > S ,

where the second to last inequality uses the fact (1 + x)1/x 2 for any x (0, 1]. This contradicts with (D.10), which ﬁnishes the proof of (D.7). To characterize the probability, we deﬁne the stuck region:

X := {u B0(ηr) | {ut} is the i NEON sequence with u0 = u, and ˆΦ x(u T ) ˆΦ x(u0) > F}.

Although the shape of the stuck region can be very complicated, we know that the width of X along the e1 direction is at most ηω. That is, Vol(X) Vol(Bd 1 0 (ηr))ηω. Therefore:

Pr(u0 X) ηω Vol(Bd 1 0 (ηr)) Vol(Bd 0(ηr)) = ω r π Γ(d

d ρϵ ι228 ι/4.

Efficiently Escaping Saddle Points in Bilevel Optimization

On the event {u0 X}, due to the choice of the parameters in (4.3), we have with probability at least 1 δ, where δ > ℓ

d ρϵ ι228 ι/4, that

ˆΦ x(u T ) ˆΦ x(u0) < F.

Therefore, there exists some k T such that ˆΦ x(uk ) ˆΦ x(u0) < F and uτ S , τ < k . In other words, k is the ﬁrst iteration that satisﬁes

ˆΦ x(uk ) ˆΦ x(u0) < F. (D.17)

By the update uk = uk 1 η(b Φ( x + uk 1) b Φ( x)), we can bound the norm of uk :

uk uk 1 + η Φ( x + uk 1) Φ( x) + η b Φ( x + uk 1) Φ( x + uk 1)

+ η b Φ( x) Φ( x)

uk 1 + ηLφ uk 1 + 2η b Φ( x + uk 1) Φ( x + uk 1)

=2S + 2η b Φ( x + uk 1) Φ( x + uk 1)

91/3S , (D.18) where the second inequality is by the smoothness of Φ(x) given in Lemma 36, the equality is due to the parameter choice in (4.3), and the last inequality is obtained by (D.2). Moreover, by (D.17) and the smoothness of Φ(x), we have

Φ( x + uk ) Φ( x) Φ( x) uk Φ( x + u0) Φ( x) Φ( x) u0 F Lφ

160000ι6L2 φ 1 25ι3

ρφ 1 320000 1

(D.19) where the third inequality is due to the parameter choice in (4.3), and the last inequality applies ι > 1 and Lφ/ ρφϵ > 1. From (D.2) we can get ˆΦ( x + uk ) ˆΦ( x) b Φ( x) uk Φ( x + uk ) Φ( x) Φ( x) uk

ˆΦ( x + uk ) Φ( x + uk ) + ˆΦ( x) Φ( x) + b Φ( x) Φ(x) uk

f( x + uk , y D k ( x + uk )) f( x + uk , y ( x + uk )) + f( x, y D k ( x)) f( x, y ( x))

+ b Φ( x) Φ(x) uk

M y D k ( x + uk ) y ( x + uk )) + M y D k ( x) y ( x) + 3S b Φ( x) Φ(x)

2 750ι3Lφ ϵ2 + 3

2 750ι3Lφ ϵ2 Lφ ρφϵ + 1 750ι3

Huang and Chen and Ji and Ma and Lai

where the third inequality is due to Assumption 3 and (D.18), the fourth inequality uses (D.2), (D.3) and the parameter choice in (4.3), and the ﬁfth inequality is due to Lφ > ρφϵ. Combining (D.19) and (D.20) we get

ˆΦ( x + uk ) ˆΦ( x) b Φ( x) uk 1 320000 1

12800F, (D.21)

i.e., the stopping criterion in step 14 of Algorithm 2 is satisﬁed. Therefore, this shows that with high probability, the Algorithm 2 terminates with the stopping criterion in step 14 being satisﬁed. When this happens, we have

Φ(x + uk ) Φ(x) Φ(x) uk

ˆΦ( x + uk ) ˆΦ( x) ˆ Φ( x) uk

+ ˆΦ( x + uk ) ˆΦ( x) ˆ Φ( x) uk Φ(x + uk ) Φ(x) Φ(x) uk

25 + 1 250 + 1 250

ρφ , (D.22)

where we used (D.20). By the Hessian Lipschitz continuity of Φ(x), we have

1 2uk 2Φ(x)uk Φ(x + uk ) Φ(x) Φ(x) uk + ρφ

25 + 1 250 + 1 250

25 + 1 250 + 1 250

ρφ + 3 128ι3

where the second inequality follows from (D.22) and (D.18). Finally, by (D.18) we have

1125ι ρφϵ 1

40ι ρφϵ. (D.24)

If NEON returns 0, by Bayes theorem, we have λmin( 2Φ(x)) ρφϵ with high probability 1 O(δ) for a suﬃciently small δ.

D.2 Proof of Theorem 30

We ﬁrst provide several useful lemmas.

Lemma 38 Xu et al. (2018)[Lemma 4] Suppose Assumption 28 holds. Let 2FD(x) = 1 n PDf i=1 2F(x; ξi). For any ϵ, δ (0, 1), x Rd when Df 16L2 log(2d/δ)

ϵ2 , we have with probability at least 1 δ: 2FD(x) 2F(x) ϵ. (D.25)

Efficiently Escaping Saddle Points in Bilevel Optimization

The next two lemmas mimic Lemma 38 for ﬁrst-order and third-order derivatives. Their proofs are mostly identical to that of Lemma 38, and hence we omit the details for brevity.

Lemma 39 Suppose Assumption 28 holds. Let FD(x) = 1

n PDf i=1 F(x; ξi). For any ϵ, δ

(0, 1), x Rd when Df 16M2 log(2d/δ)

ϵ2 , we have with probability at least 1 δ:

FD(x) F(x) ϵ. (D.26)

Lemma 40 Suppose Assumption 28 holds. Let 3GD(x) = 1

n PDg i=1 3G(x; ζi). For any

ϵ, δ (0, 1), x Rd when Dg 16ρ2 log(2d/δ)

ϵ2 , we have with probability at least 1 δ:

3GD(x) 3G(x) ϵ. (D.27)

The following two lemmas show that with sample batch sizes Df, Dg = max{O( 1

ρφϵ)}, we can bound the batch gradient and Hessian errors with high probability.

Lemma 41 Suppose Assumptions 28 and 29 hold. Set batch sizes Dg = O( κ10

ϵ2 ), Df = O( κ6

ϵ2 ), with probability at least 1 δ, we have the following inequality holds:

2ΦD(x) 2Φ(x) 1 80ι ρΦϵ. (D.28)

Proof. Note that the Hessian 2Φ(x) is given in (B.1). The Hessian estimation error between 2Φ(x) and 2ΦD(x) can be computed as

2Φ(x) 2ΦD(x)

2 xxf(x, y (x)) 2 xxf DF (x, y DG(x)) | {z } (I)

x 2 yxf(x, y (x)) y DG(x)

x 2 yxf DF (x, y DG(x)) | {z } (II)

2x yf(x, y (x)) 2y DG(x) 2x yf DF (x, y DG(x))

| {z } (III)

x 2 yyf(x, y (x)) y (x)

x 2 yyf DF (x, y DG(x)) y DG(x)

| {z } (IV )

(D.29) where the inequality is obtained by triangle inequality and 2 yxf(x, y) = 2 xyf(x, y) for any smooth functions. Similar to the proof of Lemma 6, we bound the terms (I) (IV ).

Huang and Chen and Ji and Ma and Lai

The ﬁrst term can be bounded as follows:

(I) = 2 xxf(x, y (x)) 2 xxf DF (x, y DG(x))

2 xxf(x, y (x)) 2 xxf DF (x, y (x)) + 2 xxf DF (x, y (x)) 2 xxf DF (x, y DG(x))

2 xxf(x, y (x)) 2 xxf DF (x, y (x)) + ρ y (x) y DG(x) , (D.30) where we used the Assumption 3 in the last step. Secondly, we have

(II) =2 y (x)

x 2 yxf(x, y (x)) y DG(x)

x 2 yxf DF (x, y DG(x))

2 yxf(x, y (x)) 2 yxf DF (x, y DG(x))

2 yxf DF (x, y DG(x))

µ 2 yxf(x, y (x)) 2 yxf DF (x, y (x)) + ρ y (x) y DG(x)

where the ﬁrst inequality is due to the triangle inequality and the Cauchy-Schwarz inequality and the last step uses Assumption 3 as well as Proposition 33. Furthermore, we bound y (x)

= 2 xyg(x, y (x)) 2 yyg(x, y (x)) 1 2 xyg DG(x, y DG(x)) 2 yyg DG(x, y DG(x)) 1

2 xyg(x, y (x)) 2 yyg(x, y (x)) 1 2 xyg(x, y (x)) 2 yyg DG(x, y DG(x)) 1

+ 2 xyg(x, y (x)) 2 yyg DG(x, y DG(x)) 1 2 xyg DG(x, y DG(x)) 2 yyg DG(x, y DG(x)) 1

µ2 2 yyg(x, y (x)) 2 yyg DG(x, y (x)) + ρ y (x) y DG(x)

µ 2 xyg(x, y (x)) 2 xyg DG(x, y (x)) + ρ y (x) y DG(x) ,

(D.32) where the ﬁrst equality is by (1.4) and the last step follows the fact that X 1 Y 1 X 1 X Y Y 1 , Assumption 3 and Proposition 33. Combining (D.31) and (D.32), we get

2 yxf(x, y (x)) 2 yxf DF (x, y (x)) + 2ℓ2

µ2 2 yyg(x, y (x)) 2 yyg DG(x, y (x))

2 xyg(x, y (x)) 2 xyg DG(x, y (x)) + 4ℓ

ρ y (x) y DG(x) .

Efficiently Escaping Saddle Points in Bilevel Optimization

The third term (III) can be written as

2x yf(x, y (x)) 2y DG(x) 2x yf DF (x, y DG(x))

yf(x, y (x)) yf DF (x, y DG(x))

+ yf DF (x, y DG(x))

2x 2y DG(x) 2x

2 yf(x, y (x)) yf DF (x, y DG(x))

2x 2y DG(x) 2x

where the last inequality follows (B.10) and Proposition 33. To bound 2y (x)

2x 2y DG(x)

we follow the computation of the second to last inequality in (B.11) and get

2x 2y DG(x) 2x

µ 3 xxyg(x, y (x)) 3 xxyg DG(x, y DG(x))

x 3 yxyg(x, y (x)) y DG(x)

x 3 yxyg DG(x, y DG(x))

x 3 yyyg(x, y (x)) y (x)

x 3 yyyg DG(x, y DG(x)) y DG(x)

2 2 yyg(x, y (x)) 1 2 yyg DG(x, y DG(x)) 1 .

To bound the term y (x)

x 3 yxyg(x, y (x)) y DG(x)

x 3 yxyg DG(x, y DG(x)) , we follow the

computation of (II) and get

x 3 yxyg(x, y (x)) y DG(x)

x 3 yxyg DG(x, y DG(x))

3 yxyg(x, y (x)) 3 yxyg DG(x, y (x)) + 2ℓρ

µ2 2 yyg(x, y (x)) 2 yyg DG(x, y (x))

2 xyg(x, y (x)) 2 xyg DG(x, y (x)) + 2ℓν

y (x) y DG(x) .

Huang and Chen and Ji and Ma and Lai

For the third term, we follow similar computation in (B.13) and give the following bound

x 3 yyyg(x, y (x)) y (x)

x 3 yyyg DG(x, y DG(x)) y DG(x)

2 3 yyyg(x, y (x)) 3 yyyg DG(x, y DG(x))

µ3 2 yyg(x, y (x)) 2 yyg DG(x, y (x))

µ2 2 xyg(x, y (x)) 2 xyg DG(x, y (x))

µ2 3 yyyg(x, y (x)) 3 yyyg DG(x, y (x))

y (x) y DG(x) ,

(D.37) where the last inequality is by (D.32), Assumption 3 and Proposition 33. Combining (D.35) - (D.37) together yields

2x 2y DG(x) 2x

µ 3 xxyg(x, y (x)) 3 xxyg DG(x, y (x)) + 4ℓ

µ2 3 yxyg(x, y (x)) 3 yxyg DG(x, y (x))

µ3 2 yyg(x, y (x)) 2 yyg DG(x, y (x)) + 4ρ

µ2 2 xyg(x, y (x)) 2 xyg DG(x, y (x))

µ4 2 yyg(x, y (x)) 2 yyg DG(x, y (x)) + 2ρℓ

µ3 2 xyg(x, y (x)) 2 xyg DG(x, y (x))

µ3 3 yyyg(x, y (x)) 3 yyyg DG(x, y (x))

2 2 yyg(x, y (x)) 2 yyg DG(x, y (x))

µ + 4ℓν + 4ρ2

µ2 + 6ℓρ2 + ℓ2ν

y (x) y DG(x) ,

Efficiently Escaping Saddle Points in Bilevel Optimization

where the last inequality uses the fact that X 1 Y 1 X 1 X Y Y 1 . Plug (D.38) into (D.34) leads to

2 yf(x, y (x)) yf DF (x, y (x))

µ 3 xxyg(x, y (x)) 3 xxyg DG(x, y (x))

µ2 3 yxyg(x, y (x)) 3 yxyg DG(x, y (x))

µ3 3 yyyg(x, y (x)) 3 yyyg DG(x, y (x))

2! 2 yyg(x, y (x)) 2 yyg DG(x, y (x))

2 xyg(x, y (x)) 2 xyg DG(x, y (x))

µ + 4ℓνM + 4ρ2M

µ2 + 6ℓρ2M + ℓ2νM

y (x) y DG(x) .

(D.39) Finally, similar to the computation in (D.37), we bound the last term as

x 2 yyf(x, y (x)) y (x)

x 2 yyf DF (x, y DG(x)) y DG(x)

µ3 2 yyg(x, y (x)) 2 yyg DG(x, y (x)) + 2ℓ2

µ2 2 xyg(x, y (x)) 2 xyg DG(x, y (x))

µ2 2 yyf(x, y (x)) 2 yyf DF (x, y (x)) + 2ℓ3ρ

y (x) y DG(x) .

Huang and Chen and Ji and Ma and Lai

Combining terms (I) (IV ) together, we have

2Φ(x) 2ΦD(x)

2 xxf(x, y (x)) 2 xxf DF (x, y (x)) + 2ℓ

2 yxf(x, y (x)) 2 yxf DF (x, y (x))

µ2 2 yyf(x, y (x)) 2 yyf DF (x, y (x))

2 yf(x, y (x)) yf DF (x, y (x))

µ2 + 4ℓρM + 2ℓ3

2! 2 yyg(x, y (x)) 2 yyg DG(x, y (x))

µ + 4ρM + 2ℓ2

2 xyg(x, y (x)) 2 xyg DG(x, y (x))

µ 3 xxyg(x, y (x)) 3 xxyg DG(x, y (x)) + ℓ2M

µ3 3 yyyg(x, y (x)) 3 yyyg DG(x, y (x))

µ2 3 yxyg(x, y (x)) 3 yxyg DG(x, y (x))

2 + νM + 4ℓρ

µ + 4ℓνM + 4ρ2M + 5ℓ2ρ

+6ℓρ2M + ℓ2νM + 2ℓ3ρ

µ3 + 2ρ2ℓ2M

y (x) y DG(x) .

(D.41) To deal with the last term y (x) y DG(x) , we utilize the strongly convexity of g(x, y) and g DG(x, y) with respect to y and have

0 yg(x, y DG(x)) yg(x, y (x)), y (x) y DG(x) + µ y (x) y DG(x) 2,

0 yg DG(x, y DG(x)) yg DG(x, y (x)), y (x) y DG(x) + µ y (x) y DG(x) 2. (D.42) Note that the optimality conditions are yg(x, y (x)) = 0, yg DG(x, y DG(x)) = 0, which combining with (D.42), yields

0 yg(x, y DG(x)) yg DG(x, y (x)), y (x) y DG(x) + 2µ y (x) y DG(x) 2. (D.43)

Therefore, we can bound y (x) y DG(x) as

y (x) y DG(x)

2µ yg(x, y DG(x)) yg DG(x, y (x))

2µ yg(x, y DG(x)) + 1

2µ yg DG(x, y (x))

2µ yg(x, y DG(x)) yg DG(x, y DG(x)) + 1

2µ yg(x, y (x)) yg DG(x, y (x)) ,

Efficiently Escaping Saddle Points in Bilevel Optimization

where the ﬁrst inequality is obtained from (D.43) and the last inequality is due to the optimality conditions. Plugging (D.44) into (D.41), we get 11 diﬀerent terms in (D.41) and the major parts of them are

2 xxf(x, y (x)) 2 xxf DF (x, y (x)) , 2 yxf(x, y (x)) 2 yxf DF (x, y (x)) , 2 yyf(x, y (x)) 2 yyf DF (x, y (x)) , yf(x, y (x)) yf DF (x, y (x)) , 2 yyg(x, y (x)) 2 yyg DG(x, y (x)) , 2 xyg(x, y (x)) 2 xyg DG(x, y (x)) ,

3 xxyg(x, y (x)) 3 xxyg DG(x, y (x)) , 3 yxyg(x, y (x)) 3 yxyg DG(x, y (x)) , 3 yyyg(x, y (x)) 3 yyyg DG(x, y (x)) , yg(x, y DG(x)) yg DG(x, y DG(x)) ,

yg(x, y (x)) yg DG(x, y (x)) .

Note that each of the above terms is the diﬀerence between the empirical and population derivative. Therefore, each of them can be bounded by one of Lemmas 38 - 40. For example, we consider the term with the largest coeﬃcient, e.g.

O(κ5) yg(x, y (x)) yg DG(x, y (x)) . (D.46)

By Lemma 38, we can choose Dg = O κ10 log(2d/δ )

ρφϵ so that the above term can be

bounded by 1 880ι ρφϵ with probability 1 δ . For other terms, we can apply the similar

techniques, and select batch sizes Df = O κ6 log(2d/δ )

ρφϵ , Dg = O κ10 log(2d/δ )

that each term can be bounded by 1 880ι ρφϵ with probability 1 δ . This completes the proof of (D.28) with probability 1 δ = (1 δ )11.

Lemma 42 Suppose Assumptions 28 and 29 hold. Set batch sizes Dg = O(κ6ϵ 2), Df = O(κ2ϵ 2), with probability at least 1 δ, we have the following inequality holds:

ΦD(x) Φ(x) 1

10ϵ. (D.47)

Huang and Chen and Ji and Ma and Lai

Proof. The gradient estimate error between Φ(x) and ΦD(x) can be computed as

xf(x, y (x)) xf DF (x, y DG(x))

x yf(x, y (x)) y DG(x)

x yf DF (x, y DG(x))

xf(x, y (x)) xf DF (x, y DG(x)) + y (x)

yf(x, y (x)) yf DF (x, y (x))

yf DF (x, y (x))

xf(x, y (x)) xf DF (x, y (x)) + ℓ y (x) y DG(x)

µ yf(x, y (x)) yf DF (x, y (x)) + ℓ2

µ y (x) y DG(x)

µ2 2 yyg(x, y (x)) 2 yyg DG(x, y (x)) + ρ y (x) y DG(x)

µ 2 xyg(x, y (x)) 2 xyg DG(x, y (x)) + ρ y (x) y DG(x)

xf(x, y (x)) xf DF (x, y (x)) + ℓ

µ yf(x, y (x)) yf(x, y (x))

µ2 2 yyg(x, y (x)) 2 yyg DG(x, y (x)) + M

2 xyg(x, y (x)) 2 xyg DG(x, y (x))

yg(x, y DG(x)) yg DG(x, y DG(x))

+ yg(x, y (x)) yg DG(x, y (x)) ) , (D.48) where the ﬁrst inequality is obtained by (1.2) and the triangle inequality, the third inequality is due to (D.32), Assumption 3 and Proposition 33. The last inequality is by (D.44). Nota that there are 6 diﬀerent terms in the above gradient estimate error and they are

xf(x, y (x)) xf DF (x, y (x)) , yf(x, y (x)) yf DF (x, y (x)) , 2 yyg(x, y (x)) 2 yyg DG(x, y (x)) , 2 xyg(x, y (x)) 2 xyg DG(x, y (x)) ,

yg(x, y DG(x)) yg DG(x, y DG(x)) , yg(x, y (x)) yg DG(x, y (x)) .

For every term, we can apply one of Lemma 38 - Lemma 40 and set batch sizes

Df = O κ2 log(2d/δ )

, Dg = O κ6 log(2d/δ )

such that each term can be bounded by 1 60ϵ with probability 1 δ . This completes the proof of (D.47) with probability 1 δ = (1 δ )6.

For the Stoc Bi O algorithm (Algorithm 3), we have the following descent lemma.

Efficiently Escaping Saddle Points in Bilevel Optimization

Lemma 43 Suppose Assumption 28 holds. We have with high probability the update xk+1 = xk ξ 80 q ϵ

ρφ u in Algorithm 3 yields

EΦ(xk+1) EΦ(xk) 1 3 803ι3

ρφ . (D.50)

Proof. Combining Lemma 26 and Lemma 41, we have

u out 2Φ(xk)uout

uout 2 u out 2ΦD(xk)uout

uout 2 + u out 2ΦD(xk)uout

uout 2 u out 2Φ(xk)uout

u out 2ΦD(xk)uout

uout 2 + 2ΦD(xk) 2Φ(xk)

80ι ρφϵ. (D.51)

By Xu et al. (2018)[Lemma 1], we have

EΦ(xk+1) EΦ(xk) 1 3 803ι3

ρφ . (D.52)

Lemma 44 Suppose Assumptions 28 and 29 hold. Set parameters as

S = O(κ5ϵ 2), B = O(κ2ϵ 2), Df = O(κ2ϵ 2), Dg = O(κ2ϵ 2),

Q = O(κ log 1

ϵ ), D = O(κ log 1

ϵ ), α = 2 ℓ+ µ, β = 1 4Lφ .

With high probability, the update xk+1 = xk b Φ(xk) in Algorithm 3 yields

E [Φ(xk+1) Φ(xk)] 1 16Lφ E Φ(xk) 2 + 1 400Lφ ϵ2. (D.53)

To prove Lemma 44, we borrow the following two useful lemmas for stoc Bi O from Ji et al. (2021).

Lemma 45 Ji et al. (2021)[Lemma 7] Suppose Assumptions 28 and 29 hold. Denote the Ek as the conditional expectation conditioning on xk and y D k , i.e., Ek[ ] = E[ | xk, y D k ]. We have

Ek b Φ(xk) Φ(xk) 2 2 ℓ+ ℓ2

2 y D k y (xk) 2 + 2ℓ2M2(1 κ 1)2Q

Huang and Chen and Ji and Ma and Lai

Lemma 46 Ji et al. (2021)[Lemma 8] Suppose Assumptions 28 and 29 hold. Then, we have

E b Φ(xk) Φ(xk) 2 4ℓ2M2

Df + 16η2ℓ4M2

µ2 1 B + 16ℓ2M2(1 κ 1)2Q

2 E y D k y (xk) 2.

We now prove Lemma 44. Proof. By Assumptions 28, Φ(x) is Lφ smooth, which yields

Φ(xk+1) Φ(xk) + Φ(xk), xk+1 xk + Lφ

2 xk+1 xk 2

Φ(xk) β Φ(xk), b Φ(xk) + β2Lφ Φ(xk) 2 + β2Lφ Φ(xk) b Φ(xk) 2,

where the second inequality is by Young s inequality. Taking expectation over the above inequality, we have

EΦ(xk) βE Φ(xk), Ek b Φ(xk) + β2LφE Φ(xk) 2 + β2LφE Φ(xk) b Φ(xk) 2

2 E Ek b Φ(xk) Φ(xk) 2 β

4 E Φ(xk) 2 + β

4 E Φ(xk) b Φ(xk) 2

4 E Φ(xk) 2 + βℓ2M2(1 κ 1)2Q

Df + 16η2ℓ4M2

µ2 1 B + 16ℓ2M2(1 κ 1)2Q

2 E y D k y (xk) 2. (D.54)

The second inequality is by Young s inequality and β = 1 4Lφ . The last inequality is by Lemmas 45 and 46. Further note that for an integer t D

yt+1 k y (xk) 2 = yt+1 k yt k 2 + 2 yt+1 k yt k, yt k y (xk) + yt k y (xk) 2

=α2 y G(xk, yt k; St) 2 2α y G(xk, yt k; St), yt k y (xk)

+ yt k y (xk) 2. (D.55)

Conditioning on xk, yt k and taking expectation in (D.55), we have

E[ yt+1 k y (xk) 2|xk, yt k]

S + yg(xk, yt k) 2 2α yg(xk, yt k), yt k y (xk) + yt k y (xk) 2

S + α2 yg(xk, yt k) 2 2α ℓµ

ℓ+ µ yt k y (xk) 2 + yg(xk, yt k) 2

+ yt k y (xk) 2

S α 2 ℓ+ µ α yg(xk, yt k) 2 + 1 2αℓµ

yt k y (xk) 2, (D.56)

Efficiently Escaping Saddle Points in Bilevel Optimization

where the ﬁrst inequality is by Assumption 29 and the second inequality follows from the strong-convexity (with respect to y) and smoothness of the function g. Since α = 2 ℓ+µ, we obtain from (D.56) that

E[ yt+1 k y (xk) 2|xk, yt k] ℓ µ

2 yt k y (xk) 2 + 4σ2

(ℓ+ µ)2S . (D.57)

Unconditioning on xk, yt k in (D.57) and telescoping (D.57) over t from 0 to D 1 yield

E y D k y (xk) 2 ℓ µ

2D E y0 k y (xk) 2 + σ2

ℓµS . (D.58)

By setting D > 1/2 log(1/4)/ log 1 κ 1+κ = O(κ), we get ℓ µ ℓ+µ 2D < 1/4. Therefore, we have

E y0 k y (xk) 2 2E y D k 1 y (xk 1) 2 + 2E y (xk) y (xk 1) 2

2D E y0 k 1 y (xk 1) 2 + 2κ2E xk xk 1 2 + 2σ2

2E y0 k 1 y (xk 1) 2 +

2κ2β2 M + ℓM

k E y0 0 y (x0) 2 +

2κ2β2 M + ℓM

E y0 0 y (x0) 2 + 4κ2β2 M + ℓM

=b + 4κ2β2 M + ℓM

(D.59) where we denoted b = y0 0 y (x0) 2, the second inequality is true as y (x) is κ-Lipschitz continuous and (D.58), the third inequality is by the fact that b Φ(x) can be bounded by M + ℓM

µ . Plug (D.58) and (D.59) into (D.54) yields

4 E Φ(xk) 2 + βℓ2M2(1 κ 1)2Q

Df + 16η2ℓ4M2

µ2 1 B + 16ℓ2M2(1 κ 1)2Q

2 b + 4κ2β2 M + ℓM

Therefore, it suﬃces to choose the parameters as

S = O(κ5ϵ 2), Df = O(κ2ϵ 2), Dg = O(κ2ϵ 2),

B = O(κ2ϵ 2), Q = O(κ log 1

ϵ ), D = O(κ log 1

ϵ ), (D.61)

Huang and Chen and Ji and Ma and Lai

such that the following inequality holds,

βℓ2M2(1 κ 1)2Q

Df + 16η2ℓ4M2

µ2 1 B + 16ℓ2M2(1 κ 1)2Q

2 b + 4κ2 M + ℓM

1 400Lφ ϵ2. (D.62)

Finally, combining the above two inequalities yields

E [Φ(xk+1) Φ(xk)] β

4 E Φ(xk) 2 + 1 400Lφ ϵ2. (D.63)

D.3 Proof of Theorem 30 and Corollary 31

Proof. We consider the following two possible cases

Case 1: E[ Φ(xk) |xk] > 3

5ϵ. Unconditioning on xk, we have E Φ(xk) > 3

5ϵ. In this case, Lemma 44 yields

E [Φ(xk+1) Φ(xk)] 9 400Lφ ϵ2 + 1 400Lφ ϵ2 = 1 50Lφ ϵ2

and the total iteration number of Case 1 can be bounded by

50Lφ(Φ(x0) Φ )

ϵ2 . (D.64)

Case 2: E[ Φ(xk) |xk] 3

5ϵ, which indicates

Φ(xk) = E[ Φ(xk)|xk] E[ Φ(xk) |xk] 3

Unconditioning on xk, we have E Φ(xk) 3

5ϵ. In this case we run AID in line 12 of Algorithm 3 such that b ΦD(xk) ΦD(xk) 1

10ϵ. According to Lemma 35, it only requires us to set D = O(κ log ϵ 1) and N = O( κ log ϵ 1) in the AID. Moreover, combining with Lemma 42, we have

b ΦD(xk) ΦD(xk) + b ΦD(xk) ΦD(xk)

Φ(xk) + ΦD(xk) Φ(xk) + b ΦD(xk) ΦD(xk)

Therefore, Algorithm 2 is called with high probability. When the if condition in line 13 of Algorithm 3 is satisﬁed, we can guarantee

Φ(xk) |b ΦD(xk) + ΦD(xk) Φ(xk) + b ΦD(xk) ΦD(xk) ϵ. (D.66)

Efficiently Escaping Saddle Points in Bilevel Optimization

Moreover, when Algorithm 2 is called and it returns a nonzero vector uout, combining Lemmas 26 and 43 yields

EΦ(xk+1) EΦ(xk) 1 3 803ι3

So the total iteration number in this case is bounded by

3 803ι3(Φ(x0) Φ(x ))T p

ϵ3/ρφ . (D.67)

If Algorithm 2 returns a zero vector uout, then with high probability we ﬁnd an ϵ-local minimum.

Therefore, combining (D.64) and (D.67) we know that the total iteration number before we visit an ϵ-local minimum can be bounded by

K = 3 803ι3(Φ(x0) Φ(x ))T p

ϵ3/ρφ + 50Lφ(Φ(x0) Φ(x ))

ϵ2 = O κ3ϵ 2 .

We then prove Corollary 31. Note that we require the sample batch sizes to be

Df = O κ2ϵ 2 , Dg = O κ6ϵ 2 , (D.68)

so that Lemma 42 and Lemma 41 hold, and

S = O(κ5ϵ 2), Df = O(κ2ϵ 2), Dg = O(κ2ϵ 2),

Q = O(κ log ϵ 1), D = O(κ log ϵ 1), B = O(κ2ϵ 2), (D.69)

so that Lemma 44 holds. Combining (D.68) and (D.69), we have for the i NEON calls in Algorithm 3 (Line 12 - 21), the gradient, Jacobianand Hessian-vector product complexities are G(f, ϵ) = O(κ5ϵ 4), G(g, ϵ) = O(κ10ϵ 4),

JV (g, ϵ) = O(κ9ϵ 4), HV c(g, ϵ) = O(κ9.5ϵ 4). (D.70)

For those stoc Bi O iterations in Algorithm 3 (Line 3-11), the gradient, Jacobianand Hessian-vector product complexities are

G(f, ϵ) = O(κ5ϵ 4), G(g, ϵ) = O(κ9ϵ 4),

JV (g, ϵ) = O(κ5ϵ 4), HV c(g, ϵ) = O(κ6ϵ 4). (D.71)

We take the maximum between (D.70) and (D.71), which completes the proof of Corollary 31.

Huang and Chen and Ji and Ma and Lai

Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195 1199, 2017.

Zeyuan Allen-Zhu and Yuanzhi Li. Follow the compressed leader: Faster online learning of eigenvectors and faster MMWU. In International Conference on Machine Learning, pages 116 125. PMLR, 2017.

Zeyuan Allen-Zhu and Yuanzhi Li. NEON2: ﬁnding local minima via ﬁrst-order oracles. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 3720 3730, 2018.

Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214 223. PMLR, 2017.

Jerome Bracken and James T Mc Gill. Mathematical programs with optimization problems in the constraints. Operations Research, 21(1):37 44, 1973.

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751 1772, 2018.

Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Advances in Neural Information Processing Systems, 34:25294 25307, 2021a.

Tianyi Chen, Yuejiao Sun, Quan Xiao, and Wotao Yin. A single-timescale method for stochastic bilevel optimization. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2466 2488. PMLR, 2022.

Ziyi Chen, Qunwei Li, and Yi Zhou. Escaping saddle points in nonconvex minimax optimization via cubic-regularized gradient descent-ascent. ar Xiv preprint ar Xiv:2110.07098, 2021b.

Anna Choromanska, Mikael Henaﬀ, Michael Mathieu, G erard Ben Arous, and Yann Le Cun. The loss surfaces of multilayer networks. In Artiﬁcial intelligence and statistics, pages 192 204. PMLR, 2015.

Christopher Criscitiello and Nicolas Boumal. Eﬃciently escaping saddle points on manifolds. Advances in Neural Information Processing Systems, 32:5987 5997, 2019.

Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of O(ϵ 3/2) for nonconvex optimization. Mathematical Programming, 162(1-2):1 32, 2017.

Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In International Conference on Machine Learning, pages 1155 1164. PMLR, 2018.

Efficiently Escaping Saddle Points in Bilevel Optimization

Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In 32nd Annual Conference on Neural Information Processing Systems (NIPS), 2018.

Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, pages 2933 2941, 2014.

Damek Davis and Dmitriy Drusvyatskiy. Proximal methods avoid active strict saddles of weakly convex functions. Foundations of Computational Mathematics, pages 1 46, 2021.

Damek Davis, Mateo D ıaz, and Dmitriy Drusvyatskiy. Escaping strict saddle points of the moreau envelope in nonsmooth optimization. ar Xiv preprint ar Xiv:2106.09815, 2021.

Justin Domke. Generic methods for optimization-based modeling. In Artiﬁcial Intelligence and Statistics, pages 318 326. PMLR, 2012.

SS Du, C Jin, MI Jordan, B P oczos, A Singh, and JD Lee. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1068 1078, 2017.

Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp analysis for nonconvex SGD escaping from saddle points. In Conference on Learning Theory, pages 1192 1234. PMLR, 2019.

Matthias Feurer and Frank Hutter. Hyperparameter optimization. In Automated machine learning, pages 3 33. Springer, Cham, 2019.

Tanner Fiez, Lillian Ratliﬀ, Eric Mazumdar, Evan Faulkner, and Adhyyan Narang. Global convergence to local minmax equilibrium in classes of nonconvex zero-sum games. Advances in Neural Information Processing Systems, 34, 2021.

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pages 1568 1577. PMLR, 2018.

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points: online stochastic gradient for tensor decomposition. In Conference on learning theory, pages 797 842. PMLR, 2015.

Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A uniﬁed geometric analysis. In International Conference on Machine Learning, pages 1233 1242. PMLR, 2017.

Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

Huang and Chen and Ji and Ma and Lai

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. International Conference on Learning Representations, 2015.

Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On diﬀerentiating parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint ar Xiv:1607.05447, 2016.

Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, pages 3748 3758. PMLR, 2020.

Zhishuai Guo and Tianbao Yang. Randomized stochastic variance-reduced methods for stochastic bilevel optimization. ar Xiv preprint ar Xiv:2105.02266, 2021.

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. ar Xiv preprint ar Xiv:2007.05170, 2020.

Minhui Huang. Escaping saddle points for nonsmooth weakly convex functions via perturbed proximal algorithms. ar Xiv preprint ar Xiv:2102.02837, 2021.

Minhui Huang, Shiqian Ma, and Lifeng Lai. On the convergence of projected alternating maximization for equitable and optimal transport. ar Xiv preprint ar Xiv:2109.15030, 2021a.

Minhui Huang, Shiqian Ma, and Lifeng Lai. Projection robust wasserstein barycenters. In International Conference on Machine Learning, pages 4456 4465. PMLR, 2021b.

Minhui Huang, Shiqian Ma, and Lifeng Lai. A Riemannian block coordinate descent method for computing the projection robust wasserstein distance. In International Conference on Machine Learning, pages 4446 4455. PMLR, 2021c.

Kaiyi Ji and Yingbin Liang. Lower bounds and accelerated algorithms for bilevel optimization. ar Xiv preprint ar Xiv:2102.03926, 2021.

Kaiyi Ji, Jason D. Lee, Yingbin Liang, and H. Vincent Poor. Convergence of meta-learning with task-speciﬁc adaptation over partial parameters. Advances in Neural Information Processing Systems, 2020.

Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, pages 4882 4892. PMLR, 2021.

Kaiyi Ji, Junjie Yang, and Yingbin Liang. Theoretical convergence of multi-step modelagnostic meta-learning. Journal of Machine Learning Research, 23(29):1 41, 2022.

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points eﬃciently. In International Conference on Machine Learning, pages 1724 1732. PMLR, 2017.

Efficiently Escaping Saddle Points in Bilevel Optimization

Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In Conference On Learning Theory, pages 1042 1085. PMLR, 2018.

Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local optimality in nonconvexnonconcave minimax optimization? In Hal Daum III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4880 4889. PMLR, 13 18 Jul 2020.

Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. Journal of the ACM (JACM), 68(2):1 29, 2021.

Prashant Khanduri, Siliang Zeng, Mingyi Hong, Hoi To Wai, Zhaoran Wang, and Zhuoran Yang. A near-optimal algorithm for stochastic bilevel optimization via double-momentum. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=Hj Ft Rc83e BB.

Gautam Kunapuli, Kristin P Bennett, Jing Hu, and Jong-Shi Pang. Classiﬁcation model selection via bilevel programming. Optimization Methods and Software, 23(4):475 489, 2008.

Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246 1257. PMLR, 2016.

Tianyi Lin, Chenyou Fan, Nhat Ho, Marco Cuturi, and Michael Jordan. Projection robust wasserstein distance and Riemannian optimization. Advances in Neural Information Processing Systems, 33:9383 9397, 2020a.

Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083 6093. PMLR, 2020b.

Tianyi Lin, Chi Jin, and Michael I Jordan. Near-optimal algorithms for minimax optimization. In Conference on Learning Theory, pages 2738 2779. PMLR, 2020c.

Songtao Lu, Meisam Razaviyayn, Bo Yang, Kejun Huang, and Mingyi Hong. Finding second-order stationary points eﬃciently in smooth nonconvex linearly constrained optimization problems. Advances in Neural Information Processing Systems, 2020, 2020a.

Songtao Lu, Ioannis Tsaknakis, Mingyi Hong, and Yongxin Chen. Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications. IEEE Transactions on Signal Processing, 68:3676 3691, 2020b.

Luo Luo and Cheng Chen. Finding second-order stationary point for nonconvex-stronglyconcave minimax problem. ar Xiv preprint ar Xiv:2110.04814, 2021.

Huang and Chen and Ji and Ma and Lai

Luo Luo, Haishan Ye, Zhichao Huang, and Tong Zhang. Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. Advances in Neural Information Processing Systems, 33, 2020.

Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pages 2113 2122. PMLR, 2015.

Eric V Mazumdar, Michael I Jordan, and S Shankar Sastry. On ﬁnding local Nash equilibria (and only local Nash equilibria) in zero-sum games. ar Xiv preprint ar Xiv:1901.00838, 2019.

Y. E. Nesterov. Introductory lectures on convex optimization: A basic course. Applied Optimization. Kluwer Academic Publishers, Boston, MA, 2004. ISBN 1-4020-7553-7.

Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative ﬁrst order methods. Advances in Neural Information Processing Systems, 32:14934 14942, 2019.

Barak A Pearlmutter. Fast exact multiplication by the Hessian. Neural computation, 6(1): 147 160, 1994.

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International conference on machine learning, pages 737 746. PMLR, 2016.

Aravind Rajeswaran, Chelsea Finn, Sham Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 2019.

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated backpropagation for bilevel optimization. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1723 1732. PMLR, 2019.

Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018.

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in Neural Information Processing Systems, 30:4077 4087, 2017.

Daouda Sow, Kaiyi Ji, and Yingbin Liang. ES-based Jacobian enables faster bilevel optimization. ar Xiv preprint ar Xiv:2110.07004, 2021.

Yue Sun, Nicolas Flammarion, and Maryam Fazel. Escaping from saddle points on Riemannian manifolds. Advances in Neural Information Processing Systems, 32:7276 7286, 2019.

Efficiently Escaping Saddle Points in Bilevel Optimization

Yi Xu, Rong Jin, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. Advances in Neural Information Processing Systems, 31:5530 5540, 2018.

Junjie Yang, Kaiyi Ji, and Yingbin Liang. Provably faster algorithms for bilevel optimization. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview. net/forum?id=10anajd GZm.

Jiawei Zhang, Peijun Xiao, Ruoyu Sun, and Zhiquan Luo. A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems. Advances in Neural Information Processing Systems, 33:7377 7389, 2020.