# on_penaltybased_bilevel_gradient_descent_method__e349d1b7.pdf

On Penalty-based Bilevel Gradient Descent Method

Han Shen 1 Tianyi Chen 1

Bilevel optimization enjoys a wide range of applications in hyper-parameter optimization, metalearning and reinforcement learning. However, bilevel problems are difficult to solve and recent progress on scalable bilevel algorithms mainly focuses on bilevel optimization problems where the lower-level objective is either strongly convex or unconstrained. In this work, we tackle the bilevel problem through the lens of the penalty method. We show that under certain conditions, the penalty reformulation recovers the solutions of the original bilevel problem. Further, we propose the penalty-based bilevel gradient descent algorithm and establish its finite-time convergence for the constrained bilevel problem under some lowerlevel error bound conditions weaker than strong convexity. The experimental results showcase the efficiency of the proposed algorithm. The code is available on Git Hub (link).

1. Introduction

Bilevel optimization plays an increasingly important role in machine learning as it has a wide range of applications including hyper-parameter optimization (Maclaurin et al., 2015; Franceschi et al., 2018), meta-learning (Finn et al., 2017; Rajeswaran et al., 2019), reinforcement learning (Cheng et al., 2022) and adversarial learning (Jiang et al., 2021; Zhang et al., 2022).

Define f : Rdx Rdy 7 R and g : Rdx Rdy 7 R. In this paper, we consider the following bilevel problem:

BP : min x,y f(x, y) s.t. x C,

y S(x) := arg min y U(x) g(x, y)

1Department of ECSE, Rensselaer Polytechnic Institute, Troy, NY, USA. Correspondence to: Han Shen <shenh5@rpi.edu>, Tianyi Chen <chent18@rpi.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

where C, U(x) and S(x) are non-empty and closed sets given any x C. We call f and g respectively as the upperlevel and lower-level objective.

The bilevel optimization problem BP can be extremely difficult to solve due to the tangling between the upper-level and lower-level problems. Even for the simpler case where g(x, ) is strongly-convex, U(x) = Rdy and C = Rdx, it was not until recently that the veil of an efficient method was partially lifted. Under the strong convexity of g(x, ), the lower-level solution set S(x) is a singleton. In this case, BP reduces to minimizing f(x, S(x)), the gradient of which can be calculated with the implicit gradient (IG) method (Pedregosa, 2016; Ghadimi & Wang, 2018). It has been later shown by (Chen et al., 2021) that the IG method converges almost as fast as the gradient-descent method. However, existing IG methods cannot handle either the lower-level constraint U(x) or the non-strong convexity of g(x, ) due to the difficulty of computing the implicit gradient and thus can not be applied to more complicated bilevel problems.

To overcome the above challenges, recent work aims to develop gradient-based methods without lower-level strong convexity. A prominent branch of algorithms are based on the iterative differentiation method; see e.g., (Franceschi et al., 2017; Liu et al., 2021b). In this case, the lower-level solution set S(x) is replaced by the output of an iterative optimization algorithm that solves the lower-level problem (e.g., gradient descent (GD)) which allows for explicit differentiation. However, these methods are typically restricted to the unconstrained case since the lower-level algorithm with the projection operator is difficult to differentiate. Furthermore, the algorithm has high memory and computational costs when the lower-level iteration number is large.

On the other hand, it is tempting to penalize certain optimality metric of the lower-level problem (e.g., yg(x, y) 2) to the upper-level objective, leading to the single-level optimization problem. The high-level idea is that minimizing the optimality metric guarantees the lower-level optimality condition y S(x) and as long as the optimality metric admits simple gradient evaluation, then the penalized objective can be optimized via gradient-based algorithms. However, as we will show in the next example, GD with a straightforward penalization mechanism may not lead to the desired solution of the original bilevel problems.

On Penalty-based Bilevel Gradient Descent Method

Table 1: Comparison of this work (V-PBGD) and IAPTT-GM (Liu et al., 2021b), BOME (Ye et al., 2022), Ai POD (Xiao et al., 2023). In the table, U is a convex compact set.

V-PBGD BOME IAPTT-GM Ai POD Upper-level constraint Lower-level constraint U(x)=Rdx or U U(x)=U equality constraint Lower-level non-strongly-convex Non-singleton S(x) First-order Convergence finite-time finite-time asymptotic finite-time

0 10 20 30 40 50 step k

Naive penalty V-PBGD

Figure 1: Naive penalty yields suboptimal points, while the proposed algorithm finds the solution.

Example 1. Consider the following special case of BP:

min y R f(y) :=sin2(y 2π

s.t. y arg min y R g(y) := y2+2 sin2 y. (1)

The only solution of (1) is y = 0. In this example, it can be checked that yg(y) 2 = 0 if only if y arg miny R g(y) and thus yg(y) 2 = 0 is a lower-level optimality metric. Penalizing f(y) with g(y) 2 and γ > 0 gives miny f(y) + γ(y + sin(2y))2. For any γ, y = 2π

3 is a local solution of the penalized problem while it is neither a global nor a local solution of (1).

In Figure 1, we show that in Example 1 the naive penalty method, i.e. solving min f(x, y) + γ(y + sin(2y))2, can get stuck at sub-optimal points. To tackle such issues, it is crucial to study the relation between the bilevel problem and its penalized problem. Specifically, what impact do different penalty terms, penalization constants and problem properties have on this relation? Through studying this relation, we aim to develop an efficient penalty-based bilevel gradient descent method.

Our contributions. In this work, we will first consider the following penalty reformulation of BP, given by

BPγp : min x,y f(x, y) + γp(x, y), s.t. x C, y U(x)

where p(x, y) is a certain penalty function that will be specified in Section 3. Our first result shows that under certain generic conditions on g(x, y), one can recover approximate

global (local) solutions of BP by solving BPγp globally (locally). Further, we show that these generic conditions hold without the strong convexity of g(x, ). We then propose the penalized bilevel GD (PBGD) method and establish its finite-time convergence when U(x)=Rdy or U(x)=U which is a compact convex set. We summarize the convergence results of our algorithm and compare them with several related works in Table 1. Finally, we empirically showcase the performance, computation and memory efficiency of the proposed algorithm in comparison with several competitive baselines.

Related works. The bilevel optimization problem can be dated back to (Stackelberg, 1952). Recently, the gradientbased bilevel optimization methods have gained growing popularity in the machine learning area; see, e.g., (Sabach & Shtern, 2017; Franceschi et al., 2018; Liu et al., 2020). A branch of gradient-based methods belongs to the IG method (Pedregosa, 2016). The finite-time convergence was first established in (Ghadimi & Wang, 2018) for the unconstrained strongly-convex lower-level problem. Later, the convergence was improved in (Hong et al., 2023; Ji et al., 2021; Chen et al., 2022; 2021; Khanduri et al., 2021; Shen & Chen, 2022; Li et al., 2022; Sow et al., 2022). Recent works extend IG to constrained strongly-convex lower-level problems; see, e.g., the equality-constrained IG method (Xiao et al., 2023) and a 2nd-derivative-free approach (Giovannelli et al., 2022).

Another branch of methods is based on the iterative differentiation (ITD) methods (Maclaurin et al., 2015; Franceschi et al., 2017; Nichol et al., 2018; Shaban et al., 2019). Later, (Liu et al., 2021b) proposes an ITD method with initialization optimization and shows asymptotic convergence. Another work (Liu et al., 2022b) develops an ITD method where each lower-level iteration uses a combination of upper-level and lower-level gradients. Recently, the iterative differentiation of non-smooth lower-level algorithms has been studied in (Bolte et al., 2022). The ITD methods generally lack finite-time guarantee unless restrictive assumptions are made for the iteration mapping (Grazzi et al., 2020; Ji et al., 2022).

Recently, bilevel optimization methods have also been studied in distributed learning (Tarzanagh et al., 2022; Lu et al.,

On Penalty-based Bilevel Gradient Descent Method

2022; Yang et al., 2022), corset selection (Zhou et al., 2022), overparametrized setting (Vicol et al., 2022), multi-block min-max (Hu et al., 2022), game theory (Arbel & Mairal, 2022) and several acceleration methods have been proposed (Huang et al., 2022; Dagr eou et al., 2022). The works (Liu et al., 2021a) and (Mehra & Hamm, 2021) propose penaltybased methods respectively with log-barrier and gradient norm penalty, and establish their asymptotic convergence. Another work (Gao et al., 2022) develops a method based on the difference-of-convex algorithm. In preparing our final version, a concurrent work (Chen et al., 2023) studies the bilevel problem with convex lower-level objectives and proposes a zeroth-order optimization method with finite-time convergence to the Goldstein stationary point. Another concurrent work (Lu & Mei, 2023) proposes a penalty method for the bilevel problem with a convex lower-level objective g(x, ). It shows convergence to a weak KKT point of the bilevel problem while does not study the relation between the bilevel problem and its penalized problem.

The relation between the bilevel problem and its penalty reformulation has been first studied in (Ye et al., 1997) under the calmness condition paired with other conditions such as the 2-H older continuity, which are difficult to satisfy. A recent work (Ye et al., 2022) proposes a novel first-order method that is termed BOME. By assuming the constant rank constraint qualification (CRCQ), (Ye et al., 2022) shows convergence of BOME to a KKT point of the bilevel problem. However, it is unclear when CRCQ can be satisfied and the convergence relies on restrictive assumptions like the uniform boundedness of g , f , |f| and |g|. It is also difficult to argue when the KKT point is a solution of the bilevel problem under lower-level non-convexity.

Notations. We use to denote the l2-norm. Given r > 0 and z Rd, define N(z, r) := {z Rd : z z r}. Given vectors x and y, we use (x, y) to indicate the concatenated vector of x, y. Given a non-empty closed set S Rd, define the distance of y Rd to the set S as d S(y) := miny S y y . We use Pj Z to denote the projection to the set Z.

2. Penalty Reformulation of Bilevel Problems

In this section, we study the relationship between the bilevel problem BP and its penalty reformulation BPγp.

Since S(x) is closed, y S(x) is equivalent to d S(x)(y) = 0. We therefore rewrite BP as

min x,y f(x, y) s.t. x C, d2 S(x)(y) = 0. (2)

The squared distance function is non-differentiable, and thus penalizing d2 S(x)(y) to the upper-level objective is computationally-intractable. Instead, we consider its parametric upper bounds defined as follows.

Definition 1 (Squared-distance bound function). A function p : Rdx Rdy 7 R is a ρ-squared-distance-bound if there exists ρ > 0 such that for any x C, y U(x), it holds

p(x, y) 0, ρp(x, y) d2 S(x)(y) (3a)

p(x, y) = 0 if and only if d S(x)(y) = 0. (3b)

Suppose p(x, y) is a squared-distance bound function. Given ϵ>0, we define the following problem:

BPϵ : min x,y f(x, y) s.t. x C, y U(x), p(x, y) ϵ.

It is clear that BPϵ with ϵ = 0 recovers BP. For ϵ > 0, BPϵ is an ϵ-approximate problem of BP since p(x, y) is an upper bound of d2 S(x)(y) and is smaller than ϵ.

We start by considering the relation between the global solutions of BPϵ and BPγp. Before we introduce the theorem, we first give the following definition and assumption.

Definition 1 (Lipschitz continuity). Given L > 0, a function ℓ: Rd 7 Rd is said to be L-Lipschitz-continuous on X Rd if it holds for any x, x X that ℓ(x) ℓ(x ) L x x . A function ℓis said to be L-Lipschitz-smooth if its gradient is L-Lipschitz-continuous.

Assumption 1. There exists constant L such that given any x C, f(x, ) is L-Lipschitz continuous on U(x).

The above assumption is standard and has been made in several other works studying bilevel optimization; see, e.g., (Ghadimi & Wang, 2018; Chen et al., 2021; 2023). In order to establish relation between the solutions of BPγp and those of BP, a crucial step is to guarantee that (xγ, yγ), which is a solution of BPγp, is feasible for BPϵ, i.e., to guarantee p(xγ, yγ) is small. Under Assumption 1, the growth of f(xγ, ) is controlled. Then an important intuition is that increasing γ in BPγp likely makes p(xγ, ) more dominant, and thus decreases p(xγ, yγ).

With this intuition, we introduce the theorem as follows.

Theorem 1 (Relation on global solutions). Assume p(x, y) is a ρ-squared-distance-bound function and blue Assumption 1 holds. Given any ϵ1 > 0, any global solution of BP is an ϵ1-global-minimum point of BPγp with any γ γ = L2ρ

4 ϵ 1 1 . Conversely, given ϵ2 0, if (xγ, yγ) achieves ϵ2-global-minimum of BPγp with γ > γ , (xγ, yγ) is the global solution of BPϵγ with some ϵγ (ϵ1+ϵ2)/(γ γ ).

The proof of Theorem 1 can be found in Appendix A.1. In Example 1, yg(x, y) 2 is actually a squared-distancebound and the above theorem regarding global solutions holds for this example. However, as illustrated in Example 1, a penalized problem with any γ always admits a local solution meaningless to the original problem. In fact, the relationship between local solutions is more intricate than

On Penalty-based Bilevel Gradient Descent Method

that between global solutions. Nevertheless, we prove in the following theorem that under some verifiable conditions, the local solutions of BPγp are local solutions of the BPϵ. Theorem 2 (Relation on local solutions). Assume p(x, ) is continuous given any x C and p(x, y) is ρ-squareddistance-bound function. Given γ > 0, let (xγ, yγ) be a local solution of BPγp on N((xγ, yγ), r). Assume f(xγ, ) is L-Lipschitz-continuous on N(yγ, r). Assume either one of the following is true:

(i) There exists y N(yγ, r) such that y U(xγ) and p(xγ, y) ϵ for some ϵ 0. Define ϵγ = L2ρ

γ2 +2ϵ. (ii) The set U(xγ) is convex and the function p(xγ, ) is convex. Define ϵγ = L2ρ

Then (xγ, yγ) is a local solution of BPϵγ with ϵγ ϵγ.

The proof of Theorem 2 can be found in Appendix A.2. Remark 1 (Intuition of the conditions). In (i), we need an approximate global minimizer of p(xγ, ); and in (ii), we assume p(xγ, ) is convex. Loosely speaking, these conditions essentially require min U(x) p(x, ) to be globally solvable. Such a requirement is natural since finding a feasible point in S(x) is possible only if one can solve for p(x, ) = 0 on U(x). While they appear to be abstract, we will show how Conditions (i) and (ii) in Theorem 2 can be verified in the following sections.

3. Solving Bilevel Problems with Non-convex Lower-level Objectives

To develop algorithms with non-asymptotic convergence, we consider BP with U(x)=Rdy, given by

UP : min x,y f(x, y) s.t. x C, y arg min y Rdy g(x, y)

where we assume C is a closed convex set and f, g are continuously differentiable.

3.1. Candidate penalty terms

Following Section 2, to reformulate UP, we first seek a squared-distance bound function p that satisfies Definition 1. For a non-convex function g(x, ), an interesting property is the Polyak-Łojasiewicz (PL) inequality defined as follows. Assumption 2 (Polyak-Lojasiewicz function). The lowerlevel function g(x, ) satisfies the 1

µ-PL inequality; that is, there exists µ > 0 such that given any x C, it holds for any y Rdy that

yg(x, y) 2 1

µ(g(x, y) v(x)) (4)

where v(x) := miny Rdy g(x, y).

In reinforcement learning, it has been proven in (Mei et al., 2020, Lemma 8&9) that the non-convex discounted return objective satisfies the PL inequality under certain parameterization. Moreover, recent studies found that overparameterized neural networks can lead to losses that satisfy the PL inequality (Liu et al., 2022a).

We consider the following potential penalty functions:

p(x, y)=g(x, y) v(x) (5a)

p(x, y)=µ yg(x, y) 2 (5b)

Under the PL inequality, the following lemma shows that the above penalty functions are squared-distance bound functions. Lemma 1. Suppose Assumption 2 holds. Then (5a) and (5b) are µ-squared-distance-bound functions.

The proof of Lemma 1 is deferred to Appendix B.1.

3.2. Penalty reformulation

Given p(x, y) and γ >0, define the penalized UP as

UPγp : min x,y Fγ(x, y):=f(x, y) + γp(x, y) s.t. x C.

It remains to show that the solutions of UPγp are meaningful to UP. Starting with the global solutions, we give the following proposition. Proposition 1 (Relation on global solutions). Assume Assumption 1 and 2 hold. Choose p(x, y) as either (5a) or (5b). Suppose γ L p

µδ 1 with some δ > 0. Let (xγ, yγ) be a global solution of UPγp. Then (xγ, yγ) is a global solution of BPϵγ with U(x) = Rdy and ϵγ δ.

Proposition 1 follows directly from Theorem 1 with ϵ1 = L ρδ/2, γ 2γ = L p

µδ 1 and ϵ2 = 0. By the above proposition, the global solution of UPγp solves an approximate bilevel problem of UP. However, since UPγp is generally non-convex, it is also important to consider the local solutions. Following Theorem 2, the following proposition captures the meaning of the local solutions. Proposition 2 (Relation on local solutions). Assume Assumption 1, 2 and either one of the following hold:

(a) With some δ > 0, choose

p(x, y) = g(x, y) v(x) and γ L p

(b) With some δ > 0, choose p(x, y) = µ yg(x, y) 2

2µδ 1, Lσ 1p

where σ > 0 is the lower-bound of the singular values of yyg(x, y) on {(x, y) C Rdy : y / S(x)}.

On Penalty-based Bilevel Gradient Descent Method

Algorithm 1 PBGD: Penalized bilevel GD

1: Select (x1, y1) Z := C U(x). Select α, γ, K. 2: for k = 1 to K do 3: Compute hk = p(xk, yk) or its estimate.

4: (xk+1, yk+1)=Pj Z (xk, yk) α f(xk, yk)+γhk .

Let (xγ, yγ) be a local solution of UPγp. Then (xγ, yγ) is a local solution of BPϵγ with U(x) = Rdy and ϵγ δ.

The proof of Proposition 2 can be found in Appendix B.2. Proposition 2 explains the observations in Figure 1 and Example 1. The failing point y = 2π

3 mentioned in Example 1 yields yyg(x, y) = 0 which violates (b) in Proposition 2. It can be checked that (a) in Proposition 2 holds.

Propositions 1 and 2 suggest γ = Ω(δ 0.5) in order to achieve ϵγ δ. Next we show this is tight.

Corollary 1. In Proposition 1 or 2, to guarantee ϵγ = O(δ), the lower-bound γ = Ω(δ 0.5) is tight.

Proof. Consider the following special case of UP:

min y R f(x, y) = y, s.t. y arg min y R g(x, y) = y2. (6)

In this example, the two penalty terms 1

4 yg(x, y) 2 and g(x, y) v(x) coincide to be p(x, y) = y2. Accordingly, the solution of UPγp is 1

2γ . It can be checked that 1

2γ is a local solution of BPϵγ (U(x) = Rdy) with ϵγ = 1/(4γ2). To ensure ϵγ = O(δ), γ = Ω(δ 0.5) is required in this example. Then the poof is complete by the fact that the assumptions in Proposition 1 and 2 hold in this example.

Propositions 1 and 2 imply that UP and UPγp are related in the sense that one can globally/locally solve an approximate bilevel problem of UP by globally/locally solving UPγp instead. A natural approach to solving UPγp is the projected gradient method described in Algorithm 1.

When p(x, y) = µ yg(x, y) 2, p(x, y) can be exactly evaluated. In this case, Algorithm 1 is a standard projected gradient method and the convergence property directly follows from the existing literature. In the next subsection, we focus on the other penalty function p(x, y) = g(x, y) v(x) and discuss when UPγp can be efficiently solved.

3.3. PBGD with function value gap

We consider solving UPγp with p(x, y) chosen as the function value gap penalty (5a). To solve UPγp with the gradient-based method, the obstacle is that p(x, y) requires v(x). For one, v(x) is not necessarily smooth. Even if v(x) is differentiable, v(x) = xg(x, y ) in general, where y S(x). However, it is

Algorithm 2 V-PBGD: Function value gap based PBGD

1: Select (x1, y1) Z = C Rdy. Select α, β, γ, Tk,K. 2: for k = 1 to K do 3: Obtain ˆyk = ω(k) Tk+1 following (7a). 4: Update (xk, yk) following (7b). 5: end for

possible to compute v(x) efficiently under some relatively mild assumptions. Lemma 2 ((Nouiehed et al., 2019, Lemma A.5)). Assume Assumption 2 holds, and g is Lg-Lipschitz-smooth with constant Lg. Then v(x) = xg(x, y ) for any y S(x), and v(x) is (Lg+L2 gµ)-Lipschitz-smooth.

Under the conditions in Lemma 2, v(x) can be evaluated directly at any optimal solution of the lower-level problem. This suggests one find a lower-level optimal solution y S(x), and evaluate the penalized gradient Fγ(x, y) with v(x) = xg(x, y ). Following this idea, given outer iteration k and xk, we run Tk steps of inner GD update to solve the lower-level problem:

ω(k) t+1 =ω(k) t β yg(xk, ω(k) t ), t = 1, . . . , Tk (7a)

where ω(k) 1 = yk. Update (7a) yields an approximate lowerlevel solution ˆyk = ω(t) Tk+1. Then we can approximate Fγ(xk, yk) with ˆyk and update (xk, yk) via:

(xk+1, yk+1) = Pj Z (xk, yk) α f(xk, yk)

+ γ( g(xk, yk) xg(xk, ˆyk) (7b)

where Z = C Rdy, xg(x, y) := ( xg(x, y), 0) with 0 Rdy. The update is summarized in Algorithm 2, which is a function value gap-based special case of PBGD (Algorithm 1) with hk = g(xk, yk) xg(xk, ˆyk).

Notice that only the first-order derivatives are required in update (7), which is in contrast to the implicit gradient methods or some iterative differentiation methods where higher-order derivatives are required; see, e.g., (Ghadimi & Wang, 2018; Franceschi et al., 2017; Liu et al., 2021b). In modern machine learning applications, this could substantially save computational cost since the dimension of the parameter is often large, making higher-order derivatives particularly costly.

3.4. Analysis of PBGD with function value gap

We first introduce the following regularity assumption commonly made in the convergence analysis of the projected GD method (Chen et al., 2021; Grazzi et al., 2020). Assumption 3 (smoothness). There exist constants Lf and Lg such that f(x, y) and g(x, y) are respectively Lf Lipschitz-smooth and Lg-Lipschitz-smooth in (x, y).

On Penalty-based Bilevel Gradient Descent Method

Define the projected gradient of UPγp at (xk, yk) Z as

Gγ(xk, yk) := 1

α (xk, yk) ( xk+1, yk+1) , (8)

where ( xk+1, yk+1) := Pj Z((xk, yk) α Fγ(xk, yk)). This definition (8) is commonly used as the convergence metric for the projected gradient methods. It is known that given a convex Z, Gγ(x, y)=0 if and only if (x, y) is a stationary point of UPγp (Ghadimi et al., 2016). We provide the following theorem on the convergence of V-PBGD.

Theorem 3. Consider V-PBGD (Algorithm 2). Suppose Assumption 1,2 and 3 hold. Select ω(k) 1 = yk and

α (0, (Lf + γ(2Lg + L2 gµ)) 1], β (0, L 1 g ],

3µδ 1 with some δ > 0, Tk = Ω(log(αk)).

i) With Cf =inf(x,y) Z f(x, y), it holds that

k=1 Gγ(xk, yk) 2 18 Fγ(x1, y1) Cf

αK +10L2L2 g K .

ii) Suppose limk (xk, yk) = (x , y ), then (x , y ) is a stationary point of UPγp. If (x , y ) is a local/global solution of UPγp, it is also a local/global solution of BPϵγ with U(x) = Rdy and ϵγ δ.

The proof of Theorem 3 is deferred to Appendix B.3. Theorem 3 implies an iteration complexity of e O(γϵ 1) to find an ϵ-stationary-point of UPγp. This recovers the iteration complexity of the projected GD method (Nesterov, 2013) with a smoothness constant of Θ(γ). If we choose δ = ϵ, then the iteration complexity is e O(ϵ 1.5). In (ii), under no stronger conditions needed for the projected GD method to yield meaningful solutions, the V-PBGD algorithm finds a local/global solution of the approximate UP.

Remark 2 (On the computational and memory complexity.). Regarding computational compexity, V-PBGD requires the addition/subtraction of gradients and parameters, but previous methods based on implicit differentiation and iterative differentiation in addition require computing the Hessianvector products. In this sense, when the parameter dimension dx, dy are large, the per-iteration computational complexity of PBGD will be lower than those methods.

Regarding memory complexity, PBGD requires storing the parameters y, ω Rdy, x Rdx and their gradients. Thus PBGD requires O(dx + dy) space. Previous methods (e.g., implicit gradient methods) that utilize Hessian-vector products to compute Neuman series approximation requires O(dx + dy) space. Other methods (e.g., RHG) that requires storing the history parameters to compute hyper-gradient requires O(dx + Tdy) space where t is the lower-level iteraiton number. BVFIM [3] requires O(dx + dy) space.

4. Solving Bilevel Problems with Lower-level Constraints

In the previous section, we have introduced the PBGD method to solve a class of non-convex bilevel problems with only upper-level constraints. When the lower-level constraints are involved, it becomes more difficult to develop a gradient-based algorithm with finite-time guarantees.

In this section, under assumptions on the lower-level objective that are jointly weaker than the commonly used smooth strong-convexity assumption, we propose an algorithm with finite-time convergence guarantee. Specifically, consider the following special case of BP with U(x) = U:

CP : min x,y f(x, y) s.t. x C, y argmin y U g(x, y)

where we assume C, U are convex compact in this section.

4.1. Penalty reformulation

Following Section 2, we seek to reformulate CP with a suitable penalty function p(x, y) in this section.

We first list some conditions that will be repeatedly used in this section. Assumption 4. Consider the following conditions:

(i) There exists µ > 0 such that given any x C, g(x, ) has 1

µ-quadratic-growth, that is, y U, it holds that

g(x, y) v(x) 1

µd2 S(x)(y)

where v(x) := miny U g(x, y). (ii) There exists µ > 0 such that given x C, g(x, ) satisfies 1

µ-proximal-error-bound, i.e., y U, it holds

y Pj U y β yg(x, y) 1

where β is specified in Theorem 4. (iii) Given any x C, g(x, ) is convex.

Note that the above assumptions do not need to hold simultaneously. Define the penalty function as the lower-level function value gap g(x, y) v(x) where v(x) = miny U g(x, y).

Lemma 3. Assume (i) in Assumption 4 holds. Then p(x, y) = g(x, y) v(x) is a µ-squared-distance-bound.

The proof is similar to that of Lemma 1 and thus will be omitted. Given a penalization constant γ > 0, we can define the penalized CP as follows:

CPγp : min x,y Fγ(x, y) := f(x, y) + γ g(x, y) v(x)

s.t. x C, y U.

On Penalty-based Bilevel Gradient Descent Method

It remains to show that the solutions of CPγp are meaningful to CP. In the following proposition, we show that the solutions of CPγp approximately solve CP.

Proposition 3 (Relation on local/global solutions). Assume Assumption 1 and either one of the following hold:

(a) Conditions (i) and (ii) in Assumption 4 hold. Choose γ max{L p

3 µδ 1, 3 L µLδ 1} with L = maxx C,y U yg(x, y) and some δ > 0; (b) Conditions (i) and (iii) in Assumption 4 hold. Choose γ L p

µδ 1 with some δ > 0.

If (xγ, yγ) is a local/global solution of CPγp, it is also a local/global solution of BPϵγ with U(x) = U and ϵγ δ.

The proof of Proposition 3 can be found in Appendix C.1. In the above proposition, L is finite since C U is compact and yg(x, y) is continuous.

4.2. PBGD under lower-level constraints

To study the gradient-based method for solving CPγp, it is crucial to identify when v(x) is Lipschitz-smooth. In the unconstrained lower-level case, we have shown that v(x) is Lipschitz-smooth by Lemma 2. However, the proof of Lemma 2 relies on the condition that yg(x, y) = 0 for any y S(x) which is not necessarily true under lower-level constraint.

To characterize v(x), we introduce the following lemma.

Lemma 4. (Dem yanov & Malozemov, 1971, Theorem 2.2.1) Given any direction d Rdx such that d = 1, dv(x) which is the directional derivative of v(x) along d exists at any x C. It holds that

dv(x) = min y S(x) xg(x, y ), d , x C. (9)

In order for v(x) to have Lipschitz continuous gradient, all its directional derivatives are necessarily Lipschitzcontinuous. By Lemma 4, one will need xg(x, y) and S(x) to be Lipschitz-continuous in some sense, which is formalized next.

Lemma 5 (Lipschitz-continuity of S(x)). Assume there exists Lg such that g is Lg-Lipschitz-smooth. Assume either one of the following is true:

(a) Condition (ii) in Assumption 4 holds; (b) Conditions (i) and (iii) in Assumption 4 hold.

Then there exists LS >0 that given any x1, x2 C, for any y1 S(x1) there exists y2 S(x2) such that

y1 y2 LS x1 x2 .

The proof of Lemma 5 can be found in Appendix C.2. It is worth noting that Lemma 5 is not sufficient for v(x) to

be smooth. Given any d in Lemma 4, consider dv(x1) and dv(x2) at different points x1, x2 in a small neighborhood. Suppose in Lemma 4, the minimum are achieved at y1 S(x1) and y2 S(x2). Since y1 and y2 are not necessarily close in any sense, dv(x) is not guaranteed to be continuous. To address this issue, we provide a sufficient condition under which v(x) is (Lipschitz) smooth in the following lemma.

Lemma 6 (Lipschitz-smoothness of v(x)). Suppose the conditions in Lemma 5 hold. Given any x C, if xg(x, y1) = xg(x, y2) for any y1, y2 S(x), then it holds

v(x) = xg(x, y ), y S(x) (10)

and v(x) is Lv-Lipschitz-smooth with Lv = Lg(1+LS).

The proof of Lemma 6 can be found in Appendix C.2. In Lemma 6 , a sufficient condition for the assumption on xg(x, y) to hold is S(x) is a singleton. Alternatively, the assumption holds if the constraint set U allows yg(x, y ) = 0 for any y S(x) by following the proof of Lemma 2. Below we give a simple example where Lemma 6 holds without a singleton S(x). Example 2. Suppose C = [0, ) and U = [2, 3]. Let

1 2(y x 1)2 y x + 1 0 x 1 < y < x + 1 1 2(y x + 1)2 y x 1.

It can be checked that Lemma 5 holds for this example. When 0 x 1 or x 4, S(x) is a singleton. Otherwise, xg(x, y) = 0 on a non-singleton S(x). Thus for any x C, we have xg(x, y1) = xg(x, y2) for any y1, y2 S(x), further indicating Lemma 6 holds for this example.

Lemma 6 suggests one evaluate v(x) with any solution of the lower-level problem. Given iteration k and xk, it is then natural to run the projected GD method to find the solution of the lower-level problem:

ω(k) t+1 =Pj U ω(k) t β yg(xk, ω(k) t ) , t = 1, . . . , Tk. (11)

We can then calculate v(xk) xg(xk, ˆyk) with ˆyk = ω(k) Tk+1 and update (xk, yk) following (7b) with Z = C U. The V-PBGD update for the constrained lower-level problem is summarized in Algorithm 3.

We provide the convergence result for Algorithm 3 next. Theorem 4. Consider V-PBGD with lower-level constraint (Algorithm 3). Suppose Assumption 1, Assumption 3, and either (i)&(ii) or (i)&(iii) in Assumption 4 hold. Also assume the condition in Lemma 6 holds. With a prescribed accuracy δ > 0, select

α = (Lf + γ(Lg + Lv)) 1, β = L 1 g ,

γ chosen by Proposition 3, Tk = Ω(log(αγk)).

On Penalty-based Bilevel Gradient Descent Method

4 3 2 1 0 1 0.5

Graph of the bilevel problem

z = f(x; y)jy 2 S(x)

4 3 2 1 0 1

Point spread by V-PBGD

z = f(x; y)jy 2 S(x) (x K; y K; f(x K; y K))

(a) Left figure: the graph of z = f(x, y) and z = f(x, y)|y S(x); right figure: the plot of the last iterates generated by 1000 runs of V-PBGD with random initial points.

Total iteration K

Iteration complexity versus

V-PBGD y = ( )

d 2 S(x K)(y K)

Lower-level optimality gap versus

(b) Test of V-PBGD with different constant γ. The figure is generated by running V-PBGD (Algorithm 2) for K steps such that Gγ(x K, y K) 2 10 4 given γ.

Figure 2: Test of V-PBGD in the bilevel problem (12).

Algorithm 3 V-PBGD under lower-level constraint

1: Select (x1, y1) Z = C U. Select α, β,γ,Tk,K. 2: for k = 1 to K do 3: Obtain ˆyk = ω(k) Tk+1 following (11). 4: Update (xk, yk) following (7b) with Z =C U. 5: end for

i) With Cf =inf(x,y) Z f(x, y), it holds that

k=1 Gγ(xk, yk) 2 8 Fγ(x1, y1) Cf

αK +3L2 gµCg

ii) Suppose limk (xk, yk) = (x , y ), then (x , y ) is a stationary point of CPγp. If (x , y ) is a local/global solution of CPγp, then (x , y ) is a local/global solution of BPϵγ with U(x) = U and ϵγ δ.

The proof of Theorem 4 can be found in Appendix C.3. The above theorem implies an iteration complexity of e O(γϵ 1) to find an ϵ-stationary-point of CPγp. This recovers the iteration complexity of the projected GD method (Nesterov, 2013) with a smoothness constant of Θ(γ). If choosing δ=ϵ, the iteration complexity is e O(ϵ 1.5) under Condition (b) or e O(ϵ 2) under Condition (a) in Proposition 3.

5. Simulations

In this section, we first verify our main theoretical results in a toy problem and then compare the PBGD1 algorithm with several other baselines on the data hyper-cleaning task.

1The code is available on github (link).

5.1. Numerical verification

We first consider the following non-convex bilevel problem:

min x,y f(x, y)= cos(4y+2)

1 + e2 4x + 1

2 ln((4x 2)2+1) (12)

s.t. x [0, 3], y arg min y R (y + x)2 + x sin2(y + x) | {z } g(x,y)

which is a special case of UP. It can be checked that the assumptions in Theorem 3 are satisfied in (12).

We plot the graph of z = f(x, y) in Figure 2a (left). Notice that given any x [0, 3], we have S(x) = { x}. Thus the the bilevel problem in (12) can be reduced to minx C f(x, y)|y= x. We plot the single-level objective function z = f(x, y)|y= x in Figure 2a (left) as the intersected line of the surface z = f(x, y) and the plane y = x. We then run V-PBGD with γ = 10 for 1000 random initial points (x1, y1) and plot the last iterates in Figure 2a (right). It can be observed that V-PBGD consistently finds the local solutions of the bilevel problem (12).

Next we test the impact of γ on the performance of V-PBGD, and report the results in Figure 2b. From Figure 2b, the iteration complexity is Θ(γ), while the lower-level accuracy is Θ(1/γ2), consistent with Theorem 3.

5.2. Data hyper-cleaning

In this section, we test PBGD in the data hyper-cleaning task (Franceschi et al., 2017; Shaban et al., 2019). In this task, one is given a set of polluted training data, along with a set of clean validation and test data. The goal is to train a data cleaner that assigns smaller weights to the polluted data to improve the generalization in unseen clean data.

We evaluate the performance of our algorithm in terms of speed, memory usage and solution quality in comparison with several competitive baseline algorithms including the

On Penalty-based Bilevel Gradient Descent Method

Table 2: Comparison of solution quality. The results are averaged over 20 runs and is followed by an estimated margin of error under 95% confidence. F1 score measures the quality of the data cleaner (Franceschi et al., 2017).

Method Linear model 2-layer MLP Test accuracy F1 score Test accuracy F1 score RHG 87.64 0.19 89.71 0.25 87.50 0.23 89.41 0.21 T-RHG 87.63 0.19 89.04 0.24 87.48 0.22 89.20 0.21 BOME 87.09 0.14 89.83 0.18 87.42 0.16 89.26 0.17 G-PBGD 90.09 0.12 90.82 0.19 92.17 0.09 90.73 0.27 IAPTT-GM 90.44 0.14 91.89 0.15 91.72 0.11 91.82 0.19 V-PBGD 90.48 0.13 91.99 0.14 94.58 0.08 93.16 0.15

Table 3: Comparison of GPU memory usage and the runtime to reach the highest test accuracy averaged over 20 runs.

RHG T-RHG BOME G-PBGD IAPTT-GM V-PBGD GPU memory (MB) linear 1369 1367 1149 1149 1237 1149 GPU memory (MB) MLP 7997 7757 1201 1235 2613 1199 Runtime (second) linear 73.21 32.28 5.92 7.72 693.65 9.12 Runtime (second) MLP 94.78 54.96 39.78 185.08 1310.63 207.53

IAPTT-GM (Liu et al., 2021b), BOME (Ye et al., 2022), RHG (Franceschi et al., 2017) and T-RHG (Shaban et al., 2019). In addition to V-PBGD, we also test G-PBGD which is a special case of PBGD with p(x, y) = yg(x, y) 2.

Adopting the settings in (Franceschi et al., 2017; Liu et al., 2021b; Shaban et al., 2019), we randomly split the MNIST data-set into a training data-set of size 5000, a validation set of size 5000 and a test set of size 10000; and pollute 50% of the training data with uniformly drawn labels. Then we run the algorithms with a linear model and an MLP network.

We report the solution quality in Table 2. It can be observed that both PBGD algorithms achieve competitive performance, and V-PBGD achieves the best performance among the baselines. We also evaluate PBGD in terms of convergence speed and memory usage, which is reported in Table 3. It can be observed that PBGD does not have a steep increase in memory or runtime as compared to the ITD baselines, indicating PBGD is potentially more scalable.

6. Conclusions

In this work, we study the bilevel optimization problem through the lens of the penalty method. We prove that the solutions of the penalized problem approximately solve the original bilevel problem under certain generic conditions verifiable with commonly made assumptions. To solve the penalized problem, we propose the penalty-based bilevel GD method and establish its finite-time convergence under unconstrained and constrained lower-level problems. Experiments verify the effectiveness of the proposed algorithm.

Acknowledgements

The work was partially supported by NSF Mo DL-SCALE Grant 2134168 and the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons).

Arbel, M. and Mairal, J. Non-convex bilevel games with critical point selection maps. In Proc. of Advances Neural Information Processing Systems, 2022.

Bolte, J., Pauwels, E., and Vaiter, S. Automatic differentiation of nonsmooth iterative algorithms. In Proc. of Advances Neural Information Processing Systems, 2022.

Chen, L., Xu, J., and Zhang, J. On bilevel optimization without lower-level strong convexity. ar Xiv preprint ar Xiv:2301.00712, 2023.

Chen, T., Sun, Y., and Yin, W. Tighter analysis of alternating stochastic gradient method for stochastic nested problems. In Proc. of Advances Neural Information Processing Systems, 2021.

Chen, T., Sun, Y., Xiao, Q., and Yin, W. A single-timescale method for stochastic bilevel optimization. In Proc. of International Conference on Artificial Intelligence and Statistics, 2022.

Cheng, C., Xie, T., Jiang, N., and Agarwal, A. Adversarially trained actor critic for offline reinforcement learning. In Proc. of International Conference on Machine Learning, 2022.

Dagr eou, M., Ablin, P., Vaiter, S., and Moreau, T. A framework for bilevel optimization that enables stochastic and

On Penalty-based Bilevel Gradient Descent Method

global variance reduction algorithms. In Proc. of Advances Neural Information Processing Systems, 2022.

Dem yanov, V. F. and Malozemov, V. N. On the theory of non-linear minimax problems. Russian Mathematical Surveys, 26(3):57, 1971.

Drusvyatskiy, D. and Lewis, A. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43(3):919 948, 2018.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proc. of International Conference on Machine Learning, 2017.

Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. Forward and reverse gradient-based hyperparameter optimization. In Proc. of International Conference on Machine Learning, 2017.

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In Proc. of International Conference on Machine Learning, 2018.

Gao, L., Ye, J., Yin, H., Zeng, S., and Zhang, J. Value function based difference-of-convex algorithm for bilevel hyperparameter selection problems. In Proc. of International Conference on Machine Learning, 2022.

Ghadimi, S. and Wang, M. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

Ghadimi, S., Lan, G., and Zhang, H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1): 267 305, 2016.

Giovannelli, T., Kent, G., and Vicente, L. Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. ar Xiv preprint ar Xiv:2110.00604, 2022.

Grazzi, R., Franceschi, L., Pontil, M., and Salzo, S. On the iteration complexity of hypergradient computation. In Proc. of International Conference on Machine Learning, pp. 3748 3758, 2020.

Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A twotimescale framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1), 2023.

Hu, Q., Zhong, Y., and Yang, T. Multi-block min-max bilevel optimization with applications in multi-task deep auc maximization. Proc. of Advances Neural Information Processing Systems, 2022.

Huang, F., Li, J., Gao, S., and Huang, H. Enhanced bilevel optimization via bregman distance. In Proc. of Advances Neural Information Processing Systems, 2022.

Ji, K., Yang, J., and Liang, Y. Provably faster algorithms for bilevel optimization and applications to meta-learning. In Proc. of International Conference on Machine Learning, 2021.

Ji, K., Liu, M., Liang, Y., and Ying, L. Will bilevel optimizers benefit from loops. In Proc. of Advances Neural Information Processing Systems, 2022.

Jiang, H., Chen, Z., Shi, Y., Dai, B., and Zhao, T. Learning to defend by learning to attack. In Proc. of International Conference on Artificial Intelligence and Statistics, 2021.

Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the polyak-lojasiewicz condition. In Proc. of Joint European conference on machine learning and knowledge discovery in databases, 2016.

Khanduri, P., Zeng, S., Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A near-optimal algorithm for stochastic bilevel optimization via double-momentum. In Proc. of Advances Neural Information Processing Systems, 2021.

Li, J., Gu, B., and Huang, H. A fully single loop algorithm for bilevel optimization without hessian inverse. In Proc. of AAAI Conference on Artificial Intelligence, 2022.

Liu, C., Zhu, L., and Belkin, M. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85 116, 2022a.

Liu, R., Mu, P., Yuan, X., Zeng, S., and Zhang, J. A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In Proc. of International Conference on Machine Learning, 2020.

Liu, R., Liu, X., Yuan, X., Zeng, S., and Zhang, J. A valuefunction-based interior-point method for non-convex bilevel optimization. In Proc. of International Conference on Machine Learning, 2021a.

Liu, R., Liu, Y., Zeng, S., and Zhang, J. Towards gradientbased bilevel optimization with non-convex followers and beyond. In Proc. of Advances Neural Information Processing Systems, 2021b.

Liu, R., Mu, P., Yuan, X., Zeng, S., and Zhang, J. A general descent aggregation framework for gradient-based bi-level optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):38 57, 2022b.

On Penalty-based Bilevel Gradient Descent Method

Lu, S., Cui, X., Squillante, M., Kingsbury, B., and Horesh, L. Decentralized bilevel optimization for personalized client learning. In Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, 2022.

Lu, Z. and Mei, S. First-order penalty methods for bilevel optimization. ar Xiv preprint ar Xiv:2301.01716, 2023.

Maclaurin, D., Duvenaud, D., and Adams, R. Gradientbased hyperparameter optimization through reversible learning. In Proc. of International Conference on Machine Learning, 2015.

Mehra, A. and Hamm, J. Penalty method for inversionfree deep bilevel optimization. In Asian Conference on Machine Learning, 2021.

Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. On the global convergence rates of softmax policy gradient methods. In Proc. of International Conference on Machine Learning, 2020.

Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical programming, 140(1):125 161, 2013.

Nesterov, Y. and Polyak, B. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018.

Nouiehed, M., Sanjabi, M., Huang, T., Lee, J., and Razaviyayn, M. Solving a class of non-convex min-max games using iterative first order methods. In Proc. of Advances Neural Information Processing Systems, 2019.

Pedregosa, F. Hyperparameter optimization with approximate gradient. In Proc. of International Conference on Machine Learning, 2016.

Rajeswaran, A., Finn, C., Kakade, S., and Levine, S. Metalearning with implicit gradients. In Proc. of Advances Neural Information Processing Systems, 2019.

Sabach, S. and Shtern, S. A first order method for solving convex bilevel optimization problems. SIAM Journal on Optimization, 27(2):640 660, 2017.

Shaban, A., Cheng, C., Hatch, N., and Boots, B. Truncated back-propagation for bilevel optimization. In Proc. of International Conference on Artificial Intelligence and Statistics, 2019.

Shen, H. and Chen, T. A single-timescale analysis for stochastic approximation with multiple coupled sequences. In Proc. of Advances Neural Information Processing Systems, 2022.

Sow, D., Ji, K., and Liang, Y. On the convergence theory for hessian-free bilevel algorithms. In Proc. of Advances Neural Information Processing Systems, 2022.

Stackelberg, H. The Theory of Market Economy. Oxford University Press, 1952.

Tarzanagh, D., Li, M., Thrampoulidis, C., and Oymak, S. Fednest: Federated bilevel, minimax, and compositional optimization. In Proc. of International Conference on Machine Learning, 2022.

Vicol, P., Lorraine, J., Pedregosa, F., Duvenaud, D., and Grosse, R. On implicit bias in overparameterized bilevel optimization. In Proc. of International Conference on Machine Learning, 2022.

Xiao, Q., Shen, H., Yin, W., and Chen, T. Alternating implicit projected sgd and its efficient variants for equalityconstrained bilevel optimization. In Proc. of International Conference on Artificial Intelligence and Statistics, 2023.

Yang, S., Zhang, X., and Wang, M. Decentralized gossipbased stochastic bilevel optimization over communication networks. In Proc. of Advances Neural Information Processing Systems, 2022.

Ye, J., Zhu, D., and Zhu, Q. Exact penalization and necessary optimality conditions for generalized bilevel programming problems. SIAM Journal on Optimization, 7 (2), 1997.

Ye, M., Liu, B., Wright, S., Stone, P., and Liu, Q. Bome! bilevel optimization made easy: A simple first-order approach. In Proc. of Advances Neural Information Processing Systems, 2022.

Zhang, Y., Zhang, G., Khanduri, P., Hong, M., Chang, S., and Liu, S. Revisiting and advancing fast adversarial training through the lens of bi-level optimization. In Proc. of International Conference on Machine Learning, 2022.

Zhou, X., Pi, R., Zhang, W., Lin, Y., Chen, Z., and Zhang, T. Probabilistic bilevel coreset selection. In Proc. of International Conference on Machine Learning, 2022.

On Penalty-based Bilevel Gradient Descent Method

Appendix for On Penalty-based Bilevel Gradient Descent Method

A. Proof in Section 2

For ease of reading, we restate BP and BPγp below.

BP : min x,y f(x, y) s.t. x C, y S(x) := arg min y U(x) g(x, y),

and BPγp : min x,y f(x, y) + γp(x, y), s.t. x C, y U(x).

A.1. Proof of Theorem 1

Theorem 1. Assume p(x, y) is an ρ-squared-distance-bound function and f(x, ) is L-Lipschitz continuous on U(x) for any x C. Given any ϵ1 > 0, any global solution of BP is an ϵ1-global-minimum point of BPγp with any γ γ = L2ρ

4 ϵ 1 1 . Conversely, given ϵ2 0, let (xγ, yγ) achieves ϵ2-global-minimum of BPγp with γ > γ . Then (xγ, yγ) is the global solution of the following approximate problem of BP with some 0 ϵγ (ϵ1+ϵ2)/(γ γ ); given by

min x,y f(x, y) s.t. x C, y U(x),

p(x, y) ϵγ. (13)

Proof. Given any x C and y U(x), since S(x) is closed and non-empty, we can find yx arg miny S(x) y y . By Lipschitz continuity assumption on f(x, ), given any x C, it holds for any y U(x) that

f(x, y) f(x, yx) Ld S(x)(y) by d S(x)(y) = yx y .

Then it follows that

f(x, y) + γ p(x, y) f(x, yx) Ld S(x)(y) + γ p(x, y)

Ld S(x)(y) + γ

ρ d2 S(x)(y)

min z R 0 Lz + γ

ρ z2 = ϵ1 with γ = L2ρ

4 ϵ 1 1 . (14)

Since yx S(x) (thus yx U(x)) and x C, (x, yx) is feasible for BP. Let f be the optimal objective value for BP, we know f(x, yx) f . This along with (14) indicates

f(x, y) + γ p(x, y) f ϵ1, x C, y U(x). (15)

Let (x , y ) be a global solution of BP so that f(x , y ) = f . Since y S(x ), p(x , y ) = 0 . By (15), we have

f(x , y ) + γ p(x , y ) f(x, y) + γ p(x, y) + ϵ1, x C, y U(x). (16)

Inequality (16) along with the fact that the global solution of BP is feasible for BPγp prove that the global solution of BP achieves ϵ1-global-minimum for BPγp.

Now for the converse part. Since (xγ, yγ) achieves ϵ2-global-minimum, it holds for any x, y feasible for BPγp that

f(xγ, yγ) + γp(xγ, yγ) ϵ1 f(x, y) + γp(x, y) ϵ1 + ϵ2. (17)

In (17), choosing (x, y) = (x , y ) which is a global solution of BP yields

f(xγ, yγ) + γp(xγ, yγ) ϵ1 f(x , y ) ϵ1 + ϵ2 since p(x , y ) = 0

f(xγ, yγ) + γ p(xγ, yγ) + ϵ2 by (15).

On Penalty-based Bilevel Gradient Descent Method

Then we have (γ γ )p(xγ, yγ) ϵ1 + ϵ2 p(xγ, yγ) (ϵ1 + ϵ2)/(γ γ ).

Define ϵγ = p(xγ, yγ), then ϵγ (ϵ1 + ϵ2)/(γ γ ). By (17), it holds for any (x, y) feasible for problem (13) that

f(xγ, yγ) + γp(xγ, yγ) f(x, y) + γp(x, y) f(xγ, yγ) f(x, y) γ(p(x, y) ϵγ) 0,

where the last inequality follows from the feasibility of (x, y). This along with the fact that (xγ, yγ) is feasible for problem (13) prove that (xγ, yγ) is a global solution for problem (13).

A.2. Proof of Theorem 2

In this section, we give the proof of a stronger version of Theorem 2 in the sense of weaker assumptions.

We first define a new class of functions as follows.

Definition 2 (Restricted α-sublinearity). Let X Rd. We say a function ℓ: X 7 R is restricted α-sublinear on x X if there exists α [0, 1] and x X which is the projection of x onto the minimum point set of ℓsuch that the following inequality holds. ℓ (1 α)x + αx (1 α)ℓ(x) + αℓ(x ).

Suppose ℓis a continuous convex or more generally a star-convex function (Nesterov & Polyak, 2006, Definition 1) defined on a closed convex set X and ℓhas a non-empty minimum point set, then ℓis restricted α-sublinear for any α [0, 1] on every x X.

Now we are ready to give the stronger version of Theorem 2.

Theorem 5 (Stronger version of Theorem 2). Assume p(x, ) is continuous given any x C and p(x, y) is ρ-squareddistance-bound function. Given γ > 0, let (xγ, yγ) be a local solution of BPγp on N((xγ, yγ), r). Assume f(xγ, ) is L-Lipschitz-continuous on N(yγ, r).

Assume either one of the following is true:

(i) There exists y N(yγ, r) such that y U(xγ) and p(xγ, y) ϵ for some ϵ 0. Define ϵγ = L2ρ

(ii) The set U(xγ) is convex and p(xγ, ) is restricted α-sublinear on yγ with some α (0, min{r/d S(xγ)(yγ), 1}]. Define

Then (xγ, yγ) is a local solution of the following approximate problem of BP with 0 ϵγ ϵγ.

min x,y f(x, y) s.t. x C, y U(x) (18)

p(x, y) ϵγ.

The above theorem is stronger than Theorem 2 in the sense that the condition (ii) in above theorem is weaker than (ii) in Theorem 2 since the continuity and convexity of p(xγ, ) implies the restricted α-sublinearity of p(xγ, ) on yγ with any α [0, 1].

Proof of Theorem 5. We will prove the theorem for two cases separately.

(i). Assume (i) is true. For δ 0, define

Sδ(x) := {y U(x) : p(x, y) δ}, x C.

Since p(x, y) = 0 if and only if (iff) y S(x) = arg miny U(x) g(x, y), it follows that S(x) = {y U(x) : p(x, y) = 0}. Then Sδ(x) S(x), and thus Sδ(x) = . Moreover, Sδ(x) is closed by continuity of p(x, ) and closeness of U(x) for x C.

On Penalty-based Bilevel Gradient Descent Method

Since (xγ, yγ) is a local solution of BPγp on N((xγ, yγ), r), it holds for any (x, y) N((xγ, yγ), r) that is feasible for BPγp that

f(xγ, yγ) + γp(xγ, yγ) f(x, y) + γp(x, y). (19)

Since Sϵ(xγ) is closed and non-empty, we can find yx arg miny Sϵ(xγ) y yγ . Since y N(yγ, r) Sϵ(xγ), we have yx yγ y yγ r. This indicates yx N(yγ, r) and (xγ, yx) N((xγ, yγ), r). Moreover, since yx U(xγ), (xγ, yx) is feasible for BPγp. This allows to choose (x, y) = (xγ, yx) in (19), leading to

f(xγ, yγ) + γp(xγ, yγ) f(xγ, yx) + γϵ since yx Sϵ(xγ).

By Lipschitz continuity of f(xγ, ) on N(yγ, r), we further have

γp(xγ, yγ) L yx yγ γϵ 0. (20)

Since Sϵ(xγ) S(xγ), we have yx yγ = d Sϵ(xγ)(yγ) d S(xγ)(yγ) p

ρp(xγ, yγ). Plugging this into (20) yields

γp(xγ, yγ) L q

ρp(xγ, yγ) γϵ 0

which implies p(xγ, yγ) ϵγ = L2ρ

γ2 + 2ϵ. Let ϵγ = p(xγ, yγ), then ϵγ ϵγ and (xγ, yγ) is feasible for problem (18). By (19), it holds for any (x, y) N((xγ, yγ), r) that are feasible for problem (18) that

f(xγ, yγ) f(x, y) γ(p(x, y) ϵγ) 0.

This and the fact that (xγ, yγ) is feasible for (18) imply (xγ, yγ) is a local solution of (18).

(ii). Assume (ii) is true. Since S(xγ) is closed and non-empty, we can find yx such that yx arg miny S(xγ) y yγ . Let y = (1 α)yγ + αyx. Since 0 < α min{r/ yγ yx , 1}, we know y N(yγ, r) and (xγ, y) N((xγ, yγ), r). Moreover, since U(xγ) is convex, we have y U(xγ) and (xγ, y) is feasible for BPγp.

Since (xγ, yγ) is a local solution of BPγp on N((xγ, yγ), r), we have

f(xγ, yγ) + γp(xγ, yγ) f(xγ, y) + γp(xγ, y). (21)

Since p(xγ, y) 0 and p(xγ, y) = 0 iff d S(xγ)(y) = 0, we know the minimum point set of p(xγ, ) is S(xγ). Then by the restricted α-sublinearity of p(xγ, ) on yγ, we have

p(xγ, y) αp(xγ, yx) + (1 α)p(xγ, yγ) = (1 α)p(xγ, yγ).

Substituting the above inequality into (21) yields

f(xγ, yγ) + γp(xγ, yγ) f(xγ, y) + γ(1 α)p(xγ, yγ).

Re-arranging the above inequality and using the Lipschitz continuity of f(xγ, ) on N(yγ, r) yield

γαp(xγ, yγ) Lαd S(xγ)(yγ) γαp(xγ, yγ) Lα q

which implies p(xγ, yγ) ϵγ = L2ρ

γ2 . Let ϵγ = p(xγ, yγ), then ϵγ ϵγ and (xγ, yγ) is feasible for problem (18). Since (xγ, yγ) is a local solution of BPγp on N((xγ, yγ), r), it holds for any (x, y) N((xγ, yγ), r) that is feasible for BPγp that

f(xγ, yγ) + γp(xγ, yγ) f(x, y) + γp(x, y).

Following from the above inequality, it holds for any (x, y) N((xγ, yγ), r) that are feasible for problem (18) that

f(xγ, yγ) f(x, y) γ(p(x, y) ϵγ) 0.

This and the fact that (xγ, yγ) is feasible for (18) imply (xγ, yγ) is a local solution of (18).

On Penalty-based Bilevel Gradient Descent Method

B. Proof in Section 3

B.1. Proof of Lemma 1

(i). We first consider p(x, y) = g(x, y) v(x). By the definition of v(x), it is clear that g(x, y) v(x) 0 for any x C and y Rdy. Since S(x) is closed, y S(x) iff d S(x)(y) = 0. Then by the definition of S(x), it holds for any x C and y Rdy that

g(x, y) v(x) = 0 iff y S(x) g(x, y) v(x) = 0 iff d S(x)(y) = 0.

It then suffices to check whether g(x, y) v(x) is an upper-bound of d S(x)(y). By 1

µ-PL condition of g(x, ) and (Karimi et al., 2016, Theorem 2), g(x, ) satisfies the 1

µ-quadratic-growth condition, and thus for any x C and y Rdy, it holds

g(x, y) v(x) 1

µd2 S(x)(y). (22)

This completes the proof.

(ii). Consider p(x, y) = µ yg(x, y) 2. When g(x, ) satisfies PL condition given any x C. By the PL inequality, it is clear that yg(x, y) 2 = 0 is equivalent to g(x, y) = miny Rdy g(x, y) given any x C, thus yg(x, y) 2 = 0 iff d S(x)(y) = 0 for any x C.

µ-PL condition of g(x, ), we have yg(x, y) 2 1

µ(g(x, y) v(x)). By (22), we have g(x, y) v(x) 1

µd2 S(x)(y). Thus it holds that µ2 yg(x, y) 2 d2 S(x)(y),

which completes the proof.

B.2. Proof of Proposition 2

We prove the proposition from the two conditions separately.

(a) Since (xγ, yγ) is a local solution of UPγp, yγ is a local solution of UPγp with x = xγ. By the first-order stationary condition, it holds that yf(xγ, yγ) + γ yg(xγ, yγ) = 0 yg(xγ, yγ) L/γ.

Since g(xγ, ) satisfies 1/µ-PL inequality, it holds that

yg(xγ, yγ) 2 1

µp(xγ, yγ).

The above two inequalities imply p(xγ, yγ) L2µ

γ2 . Further notice that UP and UPγp are respectively special cases of BP and BPγp with U(x) = Rdy; and p(x, y) is a squared distance bound by Lemma 1, then the result directly follows from Theorem 2 where condition (i) is met with y = yγ, ρ = µ and ϵ = L2µ

γ2 with γ L p

(b) Suppose yγ / S(xγ). Since (xγ, yγ) is a local solution of UPγp, yγ is a local solution of UPγp with x = xγ. By the first-order stationary condition, it holds that

yf(xγ, yγ) + 2µγ yyg(xγ, yγ) yg(xγ, yγ) = 0 yyg(xγ, yγ) yg(xγ, yγ) L/2µγ

which along with the assumption that the singular values of yyg(x, y) on {x C, y U(x) : y / S(x)} are lower bounded by σ > 0 gives

p(xγ, yγ) = µ yg(xγ, yγ) 2 µ yyg(xγ, yγ) yg(xγ, yγ) 2

σ2 L2/(4γ2µσ2). (23)

When yγ S(xγ), we know p(xγ, yγ) = 0 and thus (23) still holds. Further notice that UP and UPγp are special cases of BP and BPγp with U(x) = Rdy; and p(x, y) is a squared distance bound by Lemma 1, then the result directly follows from Theorem 2 where condition (i) is met with y = yγ, ρ = µ and ϵ = L2/(4γ2µσ2) with γ max{L p

On Penalty-based Bilevel Gradient Descent Method

B.3. Proof of Theorem 3

We first provide the convergence theorem on the sequence {ω(k) t } given the outer iteration k.

Theorem 6. Assume there exist Lg > 0 and µ > 0 such that g(x, ) is Lg-Lipschitz-smooth and 1

µ-PL given any x C.

Choose β (0, 1 Lg ]. Given any xk C, ω(k) 1 Rdy, running Tk steps of inner GD updates (7a) gives ˆyk satisfying

d2 S(xk)(ˆyk) µ 1 β

2µ Tk g(xk, ω(k) 1 ) v(xk) . (24)

Proof of Theorem 6. We omit all index k since this proof holds given any k. By smoothness of g(x, ), it holds that

g(x, ωt+1) g(x, ωt) β yg(x, ωt) 2 + Lgβ2

2 yg(x, ωt) 2

2 yg(x, ωt) 2 since Lgβ 1.

µ-PL condition of g(x, ), we further have

g(x, ωt+1) v(x) g(x, ωt) v(x) β

2µ g(x, ωt) v(x)

2µ g(x, ωt) v(x) .

Iteratively applying the above inequality for t = 1, . . . , T yields

g(x, ωT +1) v(x) 1 β

2µ T g(x, ω1) v(x) . (25)

By (Karimi et al., 2016, Theorem 2), the 1

µ-PL condition of g(x, ) also implies the error bound of g(x, ), which leads to

g(x, ωT +1) v(x) 1

µd2 S(x)(ωT +1), x C

which along with (25) proves the result in (24).

Notice the term g(xk, ω(k) 1 ) v(xk) in (24) depends on the drifting variable xk. If ω(k) 1 is not carefully chosen, g(xk, ω(k) 1 ) v(xk) can grow unbounded with k and hence hinder the convergence. To prevent this, we choose ω(k) 1 = yk in the analysis. Since g(x, ) is 1

µ-PL for any x C, it holds that

g(xk, ω(k) 1 ) v(xk) 1

µ yg(xk, ω(k) 1 ) 2 = 1

yk yk+1 α yf(xk, yk)

2 µγ2α2 yk+1 yk 2 + 2L2

where we have used Young s inequality and the condition that f(xk, ) is L-Lipschitz-continuous. Later we will show that the inexact gradient descent update (7b) decreases (xk+1, yk+1) (xk, yk) and therefore upper-bounds g(xk, ω(k) 1 ) v(xk).

Next we give the proof of Theorem 3.

Proof of Theorem 3. In this proof, we write z = (x, y). Update (7b) can be written as

zk+1 = Pj Z zk α ˆ Fγ(zk; ˆyk)

where ˆ Fγ(zk; ˆyk := f(zk) + γ( g(zk) xg(xk, ˆyk) .

On Penalty-based Bilevel Gradient Descent Method

By the assumptions made in this theorem and Lemma 2, Fγ is Lγ-Lipschitz-smooth with Lγ = Lf + γ(2Lg + L2 gµ). Then by Lipschitz-smoothness of Fγ, it holds that

Fγ(zk+1) Fγ(zk) + Fγ(zk), zk+1 zk + Lγ

2 zk+1 zk 2

α 1 Lγ Fγ(zk) + ˆ Fγ(zk; ˆyk), zk+1 zk + 1

2α zk+1 zk 2 + Fγ(zk) ˆ Fγ(zk; ˆyk), zk+1 zk .

Consider the second term in the RHS of (27). By Lemma 7, zk+1 can be written as

zk+1 = arg min z Z ˆ Fγ(zk; ˆyk), z + 1

By the first-order optimality condition of the above problem, it holds that

ˆ Fγ(zk; ˆyk) + 1

α(zk+1 zk), zk+1 z 0, z Z.

Since zk Z, we can choose z = zk in the above inequality and obtain

ˆ Fγ(zk; ˆyk), zk+1 zk 1

α zk+1 zk 2. (28)

Consider the last term in the RHS of (27). By Young s inequality, we first have

Fγ(zk) ˆ Fγ(zk; ˆyk), zk+1 zk α Fγ(zk) ˆ Fγ(zk; ˆyk) 2 + 1

4α zk+1 zk 2 (29)

where the first term in the above inequality can be bounded as

Fγ(zk) ˆ Fγ(zk; ˆyk) 2 = γ2 v(xk) xg(xk, ˆyk) 2

= γ2 g(xk, y )|y S(xk) xg(xk, ˆyk) 2 by Lemma 2,

γ2L2 gd2 S(xk)(ˆyk) by choosing y arg min y S(xk) y ˆyk ,

γ2L2 gµ 1 β

2µ Tk g(xk, ω(k) 1 ) v(xk) by Theorem 6,

2µ Tk 2L2 g α2 zk+1 zk 2 + 2L2L2 g by (26),

1 8α2 zk+1 zk 2 + L2L2 g 2α2k2 (30)

where the last inequality requires Tk max{ logcβ(16L2 g), 2 logcβ(2αk)} with cβ = 1 β

Plugging the inequality (30) into (29) yields

Fγ(zk) ˆ Fγ(zk; ˆyk), zk+1 zk 3

8α zk+1 zk 2 + L2L2 g 2αk2 . (31)

Substituting (31) and (28) into (27) and rearranging the resulting inequality yield

1 8α zk+1 zk 2 Fγ(zk) Fγ(zk+1) + L2L2 g 2αk2 . (32)

With zk+1 defined in (8), we have

zk+1 zk 2 2 zk+1 zk+1 2 + 2 zk+1 zk 2

2α2 Fγ(zk) ˆ Fγ(zk; ˆyk) 2+2 zk+1 zk 2

4 zk+1 zk 2 + L2L2 g k2 (33)

On Penalty-based Bilevel Gradient Descent Method

where the second inequality uses non-expansiveness of Pj Z and the last one follows from (30).

Together (32) and (33) imply

zk+1 zk 2 18α Fγ(zk) Fγ(zk+1) + 10L2L2 g k2 .

Since f(z) Cf for any z Z and g(x, y) v(x) 0, Fγ(z) Cf for any z Z. Taking a telescope sum of the above inequality and using Gγ(zk) = 1

α(zk zk+1) yield

k=1 Gγ(zk) 2 18 Fγ(z1) Cf

10L2L2 g k2

which along with the fact PK k=1 1 k2 R K 1 1 x2 dx = 1 1

k=1 Gγ(zk) 2 18 Fγ(z1) Cf

α + 10L2L2 g. (34)

This proves (i) in Theorem 3.

Suppose limk zk = z . Since Fγ(z) is continuous, Gγ(z) is continuous and thus limk Gγ(zk) = Gγ(z ). By (34), Gγ(z ) = 0, that is z = Pj Z z α Fγ(z ) . This further implies

Fγ(z ), z z 0, z Z

which indicates z is a stationary point of UPγp. If z is a local/global solution, it follows from Proposition 1 and 2 that the rest of the result holds.

Lemma 7. Let Z Rd be a closed convex set. Given any z Z, q Rd and α > 0, it holds that

Pj Z z αq = arg min z Z q, z + 1

Proof. Given z Rd, define z = arg minz Rd E(z ) where

E(z ) := q, z z + 1

2α z z 2. (35)

By the optimality condition, it follows z = z αq. For any z Rd, it follows that

E(z ) E(z ) = q, z z + 1

2α z z 2 q, αq α

2α z z 2 + q, z z + α

2α z (z αq) 2. (36)

Then we have

arg min z Z q, z + 1

2α z z 2 = arg min z Z E(z )

= arg min z Z z (z αq) 2 by (36)

= Pj Z z αq . (37)

This proves the result.

On Penalty-based Bilevel Gradient Descent Method

B.4. Extension to the stochastic case

In Seciton 3, we have studied the deterministic V-PBGD algorithm. In this section, we extend the V-PBGD algorithm to the stochastic case.

With random variables ξ, ψ, we assume access to g(x, y; ψ) and f(x, y; ξ) which are respectively stochastic versions of g(x, y) and f(x, y). Following the idea of V-PBGD, given iteration k and xk, we first solve the lower-level problem with the stochastic gradient descent method:

ω(k) t+1 = ω(k) t βt yg(xk, ω(k) t ; ψ(k) t ) for t = 1, . . . , Tk (38)

Then we choose the approximate lower-level solution ˆyk = ω(k) i where i is drawn from a step-size weighted distribution specified by P(i = t) = βt/ PTk t=1 βt, t = 1, ..., Tk. Given ˆyk and the batch size M, (xk, yk) is updated with the approximate stochastic gradient of Fγ(xk, yk) as follows:

(xk+1, yk+1)=Pj Z (xk, yk) αk

f(xk, yk; ξi k)+γ( g(xk, yk; ψi k) xg(xk, ˆyk; ψi k) .

The update is summarized in Algorithm 4.

Algorithm 4 V-PBSGD: Value-gap based penalized bilevel stochastic gradient descent.

1: Select (x1, y1) Z = C Rdy. Select γ,K, Tk,αk, βt and M. 2: for k = 1 to K do 3: Choose ω(k) 1 = yk, do ω(k) t+1 = ω(k) t βt yg(xk, ω(k) t ; ψ(k) t ) for t = 1, . . . , Tk 4: Choose ˆyk = ω(k) i , i P where P(i = t) = βt/ PTk t=1 βt, t = 1, ..., Tk.

5: (xk+1, yk+1)=Pj Z (xk, yk) αk

M PM i=1 f(xk, yk; ξi k)+γ( g(xk, yk; ψi k) xg(xk, ˆyk; ψi k) .

We make the following standard assumption commonly used in the analysis for stochastic gradient methods. Assumption 5. There exists constant c > 0 such that given any k, the stochastic gradients in Algorithm 4 are unbiased and have variance bounded by c.

With the above assumption, we provide the convergence result as follows. Theorem 7. Consider V-PBSGD (Algorithm 4). Assume Assumption 5 and the assumptions in Theorem 3 hold. Choose αk = α (Lf + γ(2Lg + L2 gµ)) 1, βt = 1/(Lg

t) and Tk = T for any k. It holds that

k=1 E Gγ(xk, yk) 2 = O 1

+ O γ2 ln T

Proof. Convergence of ω. We omit the superscription (k) of ω(k) t and ψ(k) t since the proof holds for any k. We write Et[ ] as the conditional expectation given the filtration of samples before iteration (k, t). By the Lg-Lipschitz-smoothness of g(x, ), it holds that

Et[g(xk, ωt+1)] g(xk, ωt) + yg(xk, ωt), Et[ωt+1 ωt] + Lg

2 Et ωt+1 ωt 2

g(xk, ωt) βt yg(xk, ωt) 2 + Lgβ2 t 2 Et yg(xk, ωt; ψt) 2 (40)

which follows yg(xk, ωt; ψt) is unbiased.

The last term of (40) can be bounded as

Et yg(xk, ωt; ψt) 2 = Et yg(xk, ωt; ψt) yg(xk, ωt) + yg(xk, ωt) 2

= Et yg(xk, ωt; ψt) g(xk, ωt) 2 + yg(xk, ωt 2

c2 + yg(xk, ωt) 2. (41)

On Penalty-based Bilevel Gradient Descent Method

Substituting the above inequality back to (40) yields

Et[g(xk, ωt+1)] g(xk, ωt) βt

2 yg(xk, ωt) 2 + Lgc2

g(xk, ωt) βt

2µ2 d2 S(xk)(ωt) + Lgc2

2 β2 t (42)

where the first inequality requires βt L 1 g and the last one follows from the fact that Lipschitz-smooth 1/µ-PL function g(x, ) satisfies 1/µ2-error bound (Karimi et al., 2016, Theorem 2).

We write Ek[ ] as the conditional expectation given the filtration of samples before iteration k. Taking Ek and a telescope sum over both sides of (42) yields

t=1 βt Ek[d2 S(xk)(ω(k) t )] 2µ2(g(xk, ω(k) 1 ) g(xk, ω(k) Tk+1)) + 2Lgc2µ2

2µ2(g(xk, ω(k) 1 ) v(xk)) + Lgc2µ2 Tk X

t=1 β2 t . (43)

Convergence of (x, y). In this proof, we write z = (x, y). Given zk, define zk+1 = Pj Z(zk αk Fγ(zk)). For convenience, we also write

f(xk, yk; ξi k)+γ( g(xk, yk; ψi k) xg(xk, ˆyk; ψi k)

ˆ Fγ(zk; ˆyk) = Ek[ k Fγ] = f(xk, yk)+γ( g(xk, yk) xg(xk, ˆyk)). (44)

By the assumptions made in this theorem, Fγ is Lγ-Lipschitz-smooth with Lγ = Lf + γ(2Lg + L2 gµ). Then by Lipschitzsmoothness of Fγ, it holds that

Ek[Fγ(zk+1)] Fγ(zk) + Ek Fγ(zk), zk+1 zk + Lγ

2 Ek zk+1 zk 2

αk Fγ(zk) + Ek k Fγ, zk+1 zk + Ek Fγ(zk) k Fγ, zk+1 zk+1

+ Ek Fγ(zk) k Fγ, zk+1 zk + 1 2αk Ek zk+1 zk 2. (45)

Consider the second term in the RHS of (45). By Lemma 7, zk+1 can be written as

zk+1 = arg min z Z k Fγ, z + 1 2αk z zk 2.

By the first-order optimality condition of the above problem, it holds that

αk (zk+1 zk), zk+1 z 0, z Z.

Since zk Z, we can choose z = zk in the above inequality and obtain

k Fγ, zk+1 zk 1

αk zk+1 zk 2. (46)

The third term in the RHS of (45) can be bounded as

Ek Fγ(zk) k Fγ, zk+1 zk+1 Ek[ Fγ(zk) k Fγ zk+1 zk+1 ]

αk Ek Fγ(zk) k Fγ 2 (47)

where the second inequality follows from the non-expansiveness of the projection operator.

On Penalty-based Bilevel Gradient Descent Method

The fourth term in the RHS of (45) can be bounded as

Ek Fγ(zk) k Fγ, zk+1 zk = Fγ(zk) ˆ Fγ(zk; ˆyk), zk+1 zk

2αk Ek Fγ(zk) ˆ Fγ(zk; ˆyk) 2 + 1 8αk zk+1 zk 2 (48)

where the last inequality follows from Young s inequality. In addition, we have

2Ek zk+1 zk+1 2 + 2Ek zk+1 zk 2

2α2 k Ek Fγ(zk) k Fγ 2 + 2Ek zk+1 zk 2

4α2 k Ek Fγ(zk) ˆ Fγ(zk; ˆyk) 2 + 4α2 k Ek ˆ Fγ(zk; ˆyk) k Fγ 2 + 2Ek zk+1 zk 2

which after rearranging gives

Ek zk+1 zk 2

2 zk+1 zk 2 2α2 k Ek Fγ(zk) ˆ Fγ(zk; ˆyk) 2 2α2 k Ek ˆ Fγ(zk; ˆyk) k Fγ 2. (49)

Substituting (46) (49) into (45) and rearranging yields

1 8αk Ek zk+1 zk 2 Fγ(zk) Ek[Fγ(zk+1)] + 2αk Ek Fγ(zk) k Fγ 2

+ 3αk Ek Fγ(zk) ˆ Fγ(zk; ˆyk) 2. (50)

Under Assumption 5, the third term in the RHS of (50) is bounded by the O(1/M) dependence of variance as follows

Ek Fγ(zk) k Fγ 2 3(2γ2 + 1)c2

The fourth term in the RHS of (50) can be bounded by

Ek Fγ(zk) ˆ Fγ(zk; ˆyk) 2 = γ2Ek v(xk) xg(xk, ˆyk) 2

Lemma 2 = γ2Ek xg(xk, y)|y S(xk) xg(xk, ˆyk) 2

γ2L2 g Ek[d2 S(xk)(ˆyk)]

βt Ek[d2 S(xk)(ω(k) t )] PT i=1 βi (52)

where the last equality follows from the distribution of ˆyk.

By (43), it holds that

t=1 βt Ek[d2 S(xk)(ω(k) t )] 2µ2(g(xk, ω(k) 1 ) v(xk)) + Lgc2µ2 T X

t=1 β2 t . (53)

In the above inequality, we can further bound the initial gap as (cf. ω(k) 1 = yk)

g(xk, ω(k) 1 ) v(xk) 1

µ yg(xk, ω(k) 1 ) 2 = 1

µ yk yk+1 αk yf(xk, yk)

2 µγ2α2 k zk+1 zk 2 + 2L2

where the first inequality follows from g(x, ) is 1/µ-PL; the equality follows from the definition of zk+1; and the last one follows from Young s inequality and the Lipschitz continuity of f(x, ).

On Penalty-based Bilevel Gradient Descent Method

Substituting (53) and (54) into (52) yields

Ek Fγ(zk) ˆ Fγ(zk; ˆyk) 2 1 48α2 k zk+1 zk 2 + 4µL2L2 g PT i=1 βi + γ2L3 gc2µ2 PT t=1 β2 t PT t=1 βt (55)

where we have also used PT i=1 β2 i 192µL2 g to simplify the first term. This can always be satisfied by a large enough T.

Substituting (55) and (51) into (50), rearranging and taking total expectation yield

1 16αk E zk+1 zk 2 E[Fγ(zk) Fγ(zk+1)] + 6(2γ2 + 1)c2

+ 12µL2L2 g PT i=1 βi + 3γ2L3 gc2µ2 PT t=1 β2 t PT t=1 βt

Using zk+1 zk 2 = α2 k Gγ(zk) 2 in the LHS of the above inequality and taking telescope sum over k = 1, . . . , K yields

k=1 αk E Gγ(zk) 2 = O(Fγ(z1) Cf) + O γ2c2

M αk + O γ2 PT t=1 β2 t PT i=1 βi αk (56)

By the choice of step size, we have in the RHS PT i=1 βi PT i=1 βT = Θ(

T) and PT t=1 β2 t 1+ R T 1 β1

x dx = β1 ln T +1. This proves the result.

C. Proof in Section 4

C.1. Proof of Proposition 3

We prove the proposition from the two conditions separately.

(a) Suppose condition (a) holds. Given x C, define the projected gradient of g(x, ) as

G(y; x) = 1

β y Pj U(y β yg(x, y)) .

Since yγ is a local solution of CPγp given x = xγ, we have

h yγ Pj U yγ β 1

γ yf(x, yγ) + yg(xγ, yγ) i = 0. (57)

Then we have

G(yγ; xγ) = 1

Pj U yγ β 1

γ yf(x, y) + yg(xγ, yγ) Pj U(yγ β yg(xγ, yγ))

γ yf(x, y) L

By the proximal error bound inequality, we further have

d S(xγ)(yγ) µ G(yγ; xγ) µL

Since g is continuously differentiable and C U is compact, we can define L1,g = maxx C,y U yg(x, y) . Then g(x, ) is L1,g-Lipschitz-continuous on U given any x C, which yields

p(xγ, yγ) = g(xγ, yγ) v(xγ) L1,gd S(xγ) L1,g µL

On Penalty-based Bilevel Gradient Descent Method

In addition, Lemma 3 holds under condition (a) so p(x, y) is a squared distance bound. Further notice that CP and CPγp are special cases of BP and BPγp with U(x) = U, then the rest of the result follows from Theorem 1 with ϵ1 = L ρδ/2, γ 2γ = L p

µδ 1, ϵ2 = 0 and Theorem 2 where condition (i) holds with (59).

(b). Under condition (b), Lemma 3 holds so p(x, y) is a squared distance bound. Further notice that CP and CPγp are special cases of BP and BPγp with U(x) = U, then the result follows directly from Theorem 1 with ϵ1 = L ρδ/2, γ 2γ = L p

µδ 1, ϵ2 = 0 and Theorem 2 where condition (ii) holds by the convexity of g(x, ).

C.2. Proof of Lemma 5 and 6

Lemma 5. Assume there exists Lg > 0 such that g(x, y) is Lg-Lipschitz-continuous. Assume either one of the following is true:

(a) Condition (ii) in Assumption 4 holds. Let LS = Lg µ.

(b) Conditions (i) and (iii) in Assumption 4 hold. Let LS = Lg(µ+1)(Lg+1).

Then given any x1, x2 C, for any y1 S(x1) there exists y2 S(x2) such that

y1 y2 LS x1 x2 .

Proof. (a). Given x, define the projected gradient of g(x, ) at point y as

G(y; x) = 1

β y Pj U y β yg(x, y) .

By the assumption, the proximal-error-bound inequality holds, that is

µ G(y; x) 2 d2 S(x)(y), y U and x C.

Therefore, given x1, x2 C, we have for any y1 S(x1) there exists y2 S(x2) such that

y1 y2 µ G(y1; x2) G(y1; x1) 2 since G(y1; x1) = 0

Pj U y1 β yg(x2, y1) Pj U y1 β yg(x1, y1)

µ g(x1, y1) g(x2, y1) Lg µ x1 x2 . (60)

This completes the proof for condition (a).

(b). By the 1/µ-quadratic-growth of g(x, ) and (Drusvyatskiy & Lewis, 2018, Corrolary 3.6), the proximal-error-bound inequality holds, that is (µ + 1)(Lg + 1) G(y; x) 2 d2 S(x)(y), y U and x C.

where we set β = 1 to simplify the constant. The result then directly follows from case (a).

Proof of Lemma 6. Given any x1, x2 C, for any y1 S(x1) there exists a y2 S(x2) such that

xg(x1, y1) xg(x2, y2) Lg( x1 x2 + y1 y2 )

Lg(1+LS) x1 x2 by Lemma 5. (61)

By lemma 4 and the condition that xg(x, y1) = xg(x, y2) for any y1, y2 S(x), it holds that

dv(x) = xg(x, y ), d , y S(x). (62)

This along with (61) gives

| dv(x1) dv(x2)| Lg(1+LS) x1 x2 , x1, x2 C.

Thus v(x) exists by the continuity of dv(x) for any d. Further, v(x) = xg(x, y ) for any y S(x) by (62) and is Lipschitz-continuous by (61).

On Penalty-based Bilevel Gradient Descent Method

C.3. Proof of Theorem 4

Convergence of ω. Given any x C, by the 1/µ-quadratic-growth of g(x, ) and (Drusvyatskiy & Lewis, 2018, Corrolary 3.6), there exists some constant µ such that the proximal-error-bound inequality holds. Thus under the either condition of Proposition 3, there exists µ > 0 such that 1/ µ-proximal-error-bound condition holds for g(x, ). This along with the Lipschitz-smoothness of g(x, ) implies the proximal PL condition by (Karimi et al., 2016, Appendix G).

We state the proximal PL condition below. Defining

D(ω; x) := 2

β min ω X{ ωg(x, ω), ω ω + 1

2β ω ω 2} (63)

there exists some constant µ > 0 such that

µD(ω; x) (g(x, ω) v(x)), ω U and x C. (64)

We omit index k since the proof holds for any k. By the Lipschitz gradient of g(x, ), we have

g(x, ωt+1) g(x, ωt) + yg(x, ωt), ωt+1 ωt + Lg

2 ωt+1 ωt 2

= g(x, ωt) β

2 D(ωt; x) (65)

where in the last equality we have used Lemma 7 that

ωt+1 = arg min ω U yg(x, ωt), ω ωt + 1

Using (64) in (65) yields

g(x, ωt+1) v(x) (1 β

2 µ)(g(x, ωt) v(x)).

Repeatedly applying the last inequality for t = 1, ..., T yields

g(x, ωT +1) v(x) (1 β

2 µ)T (g(x, ω1) v(x)).

This along with the 1/µ-quadratic-growth property of g(x, ) yields

d2 S(x)(ωT +1) µ(1 β

2 µ)T (g(x, ω1) v(x)) µ(1 β

2 µ)T Cg, x C (66)

where Cg = maxx C,y U(g(x, y) v(x)) is a constant.

Convergence of (x, y). The proof is similar to that of Theorem 3. We write the only step that is different here. In deriving (30), instead we have

Fγ(zk) ˆ Fγ(zk; ˆyk) 2 γ2L2 gd2 S(xk)(ˆyk) γ2L2 gµCg(1 β

2 µ)Tk by (66)

L2 gµCg 4α2k2 (67)

where the last inequality requires Tk 2 logcβ(2αγk) with cβ = 1 β

Then (31) is replaced with

Fγ(zk) ˆ Fγ(zk; ˆyk) 2 1

4α zk+1 zk 2 + L2 gµCg 4αk2 . (68)

Result i) in this theorem then follows from the rest of the proof of i) in Theorem 3. Result ii) in this theorem follows similarly from the proof of ii) in Theorem 3 under Proposition 3.