# bilevel_optimization_convergence_analysis_and_enhanced_design__6ec0c672.pdf

Bilevel Optimization: Convergence Analysis and Enhanced Design

Kaiyi Ji 1 Junjie Yang 1 Yingbin Liang 1

Bilevel optimization has arisen as a powerful tool for many machine learning problems such as meta-learning, hyperparameter optimization, and reinforcement learning. In this paper, we investigate the nonconvex-strongly-convex bilevel optimization problem. For deterministic bilevel optimization, we provide a comprehensive convergence rate analysis for two popular algorithms respectively based on approximate implicit differentiation (AID) and iterative differentiation (ITD). For the AID-based method, we orderwisely improve the previous convergence rate analysis due to a more practical parameter selection as well as a warm start strategy, and for the ITD-based method we establish the ﬁrst theoretical convergence rate. Our analysis also provides a quantitative comparison between ITD and AID based approaches. For stochastic bilevel optimization, we propose a novel algorithm named stoc Bi O, which features a sample-efﬁcient hypergradient estimator using efﬁcient Jacobianand Hessianvector product computations. We provide the convergence rate guarantee for stoc Bi O, and show that stoc Bi O outperforms the best known computational complexities orderwisely with respect to the condition number κ and the target accuracy ϵ. We further validate our theoretical results and demonstrate the efﬁciency of bilevel optimization algorithms by the experiments on meta-learning and hyperparameter optimization.

1. Introduction

Bilevel optimization has received signiﬁcant attention recently and become an inﬂuential framework in various machine learning applications including metalearning (Franceschi et al., 2018; Bertinetto et al., 2018; Rajeswaran et al., 2019; Ji et al., 2020a), hyperparameter

1Department of Electrical and Computer Engineering, The Ohio State University. Correspondence to: Kaiyi Ji <ji.367@osu.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

optimization (Franceschi et al., 2018; Shaban et al., 2019; Feurer & Hutter, 2019), reinforcement learning (Konda & Tsitsiklis, 2000; Hong et al., 2020), and signal processing (Kunapuli et al., 2008; Flamary et al., 2014). A general bilevel optimization takes the following formulation.

min x Rp Φ(x) := f(x, y (x))

s.t. y (x) = arg min y Rq g(x, y), (1)

where the upperand inner-level functions f and g are both jointly continuously differentiable. The goal of eq. (1) is to minimize the objective function Φ(x) with respect to (w.r.t.) x, where y (x) is obtained by solving the lower-level minimization problem. In this paper, we focus on the setting where the lower-level function g is strongly convex w.r.t. y, and the upper-level objective function Φ(x) is nonconvex. Such geometrics commonly exist in many applications such as meta-learning and hyperparameter optimization, where g corresponds to an empirical loss with a strongly-convex regularizer and x are parameters of neural networks.

A broad collection of algorithms have been proposed to solve bilevel optimization problems. For example, Hansen et al. 1992; Shi et al. 2005; Moore 2010 reformulated the bilevel problem in eq. (1) into a single-level constrained problem based on the optimality conditions of the lowerlevel problem. However, such type of methods often involve a large number of constraints, and are hard to implement in machine learning applications. Recently, more efﬁcient gradient-based bilevel optimization algorithms have been proposed, which can be generally categorized into the approximate implicit differentiation (AID) based approach (Domke, 2012; Pedregosa, 2016; Gould et al., 2016; Liao et al., 2018; Ghadimi & Wang, 2018; Grazzi et al., 2020; Lorraine et al., 2020) and the iterative differentiation (ITD) based approach (Domke, 2012; Maclaurin et al., 2015; Franceschi et al., 2017; 2018; Shaban et al., 2019; Grazzi et al., 2020). However, most of these studies have focused on the asymptotic convergence analysis, and the nonasymptotic convergence rate analysis (that characterizes how fast an algorithm converges) has not been well explored except a few attempts recently. Ghadimi & Wang 2018 provided the convergence rate analysis for the ITD-based approach. Grazzi et al. 2020 provided the iteration complexity for the hypergradient computation via ITD and AID, but did not

Bilevel Optimization: Convergence Analysis and Enhanced Design

characterize the convergence rate for the entire execution of algorithms.

Thus, the ﬁrst focus of this paper is to develop a comprehensive and sharper theory, which covers a broader class of bilevel optimizers via ITD and AID techniques, and more importantly, improves existing analysis with a more practical parameter selection and orderwisely lower computational complexity.

The stochastic bilevel optimization often occurs in applications where fresh data are sampled for algorithm iterations (e.g., in reinforcement learning (Hong et al., 2020)) or the sample size of training data is large (e.g., hyperparameter optimization (Franceschi et al., 2018), Stackelberg game (Roth et al., 2016)). Typically, the objective function is given by

min x Rp Φ(x) = f(x, y (x)) =

( 1 n Pn i=1 F(x, y (x); ξi) Eξ [F(x, y (x); ξ)]

s.t. y (x) = arg min y Rq g(x, y) =

( 1 m Pm i=1 G(x, y; ζi) Eζ [G(x, y; ζ)] (2)

where f(x, y) and g(x, y) take either the expectation form w.r.t. the random variables ξ and ζ or the ﬁnite-sum form over given data Dn,m = {ξi, ζj, i = 1, ..., n; j = 1, ..., m} often with large sizes n and m. During the optimization process, data batch is sampled via the distributions of ξ and ζ or from the set Dn,m. For such a stochastic setting, Ghadimi & Wang 2018 proposed a bilevel stochastic approximation (BSA) method via single-sample gradient and Hessian estimates. Based on such a method, Hong et al. 2020 further proposed a two-timescale stochastic approximation (TTSA) algorithm, and showed that TTSA achieves a better trade-off between the complexities of innerand outer-loop optimization stages than BSA.

The second focus of this paper is to design a more sample-efﬁcient algorithm for bilevel stochastic optimization, which achieves lower computational complexity by orders of magnitude than BSA and TTSA.

1.1. Main Contributions

Our main contributions lie in developing shaper theory and provably faster algorithms for nonconvex-strongly-convex bilevel deterministic and stochastic optimization problems, respectively. Our analysis involves several new developments, which can be of independent interest.

We ﬁrst provide a uniﬁed convergence rate and complexity analysis for both ITD and AID based bilevel optimizers, which we call as ITD-Bi O and AID-Bi O. Compared to existing analysis in Ghadimi & Wang 2018 for AID-Bi O that requires a continuously increasing number of inner-loop

steps to achieve the guarantee, our analysis allows a constant number of inner-loop steps as often used in practice. In addition, we introduce a warm start initialization for the inner-loop updates and the outer-loop hypergradient estimation, which allows us to backpropagate the tracking errors to previous loops, and yields an improved computational complexity. As shown in Table 1, the gradient complexities Gc(f, ϵ), Gc(g, ϵ), and Jacobianand Hessian-vector product complexities JV(g, ϵ) and HV(g, ϵ) of AID-Bi O to attain an ϵ-accurate stationary point improve those of Ghadimi & Wang 2018 by the order of κ, κϵ 1/4, κ, and κ, respectively, where κ is the condition number. In addition, our analysis shows that AID-Bi O requires less computations of Jacobianand Hessian-vector products than ITD-Bi O by an order of κ and κ1/2, which implies that AID can be more computationally and memory efﬁcient than ITD.

We then propose a stochastic bilevel optimizer (stoc Bi O) to solve the stochastic bilevel optimization problem in eq. (2). Our algorithm features a mini-batch hypergradient estimation via implicit differentiation, where the core design involves a sample-efﬁcient hypergradient estimator via the Neumann series. As shown in Table 2, the gradient complexities of our proposed algorithm w.r.t. F and G improve upon those of BSA (Ghadimi & Wang, 2018) by an order of κ and ϵ 1, respectively. In addition, the Jacobian-vector product complexity JV(G, ϵ) of our algorithm improves that of BSA by an order of κ. In terms of the target accuracy ϵ, our computational complexities improve those of TTSA (Hong et al., 2020) by an order of ϵ 1/2.

Our results further provide the theoretical complexity guarantee for ITD-Bi O, AID-Bi O and stoc Bi O in meta-learning and hyperparameter optimization. The experiments validate our theoretical results for deterministic bilevel optimization, and demonstrate the superior efﬁciency of stoc Bi O for stochastic bilevel optimization.

1.2. Related Work

Bilevel optimization approaches: Bilevel optimization was ﬁrst introduced by Bracken & Mc Gill 1973. Since then, a number of bilevel optimization algorithms have been proposed, which include but not limited to constraintbased methods (Shi et al., 2005; Moore, 2010) and gradientbased methods (Domke, 2012; Pedregosa, 2016; Gould et al., 2016; Maclaurin et al., 2015; Franceschi et al., 2018; Ghadimi & Wang, 2018; Liao et al., 2018; Shaban et al., 2019; Hong et al., 2020; Liu et al., 2020; Li et al., 2020; Grazzi et al., 2020; Lorraine et al., 2020; Ji & Liang, 2021). Among them, Ghadimi & Wang 2018; Hong et al. 2020 provided the complexity analysis for their proposed methods for the nonconvex-strongly-convex bilevel optimization problem. For such a problem, this paper develops a general and enhanced convergence rate analysis for gradient-based

Bilevel Optimization: Convergence Analysis and Enhanced Design

Table 1. Comparison of bilevel deterministic optimization algorithms. Algorithm Gc(f, ϵ) Gc(g, ϵ) JV(g, ϵ) HV(g, ϵ)

AID-Bi O (Ghadimi & Wang, 2018) O(κ4ϵ 1) O(κ5ϵ 5/4) O κ4ϵ 1 e O κ4.5ϵ 1

AID-Bi O (this paper) O(κ3ϵ 1) O(κ4ϵ 1) O κ3ϵ 1

ITD-Bi O (this paper) O(κ3ϵ 1) e O(κ4ϵ 1) e O κ4ϵ 1 e O κ4ϵ 1

Gc(f, ϵ) and Gc(g, ϵ): number of gradient evaluations w.r.t. f and g. κ : condition number. JV(g, ϵ): number of Jacobian-vector products x yg(x, y)v. Notation e O: omit log 1

ϵ terms. HV(g, ϵ): number of Hessian-vector products 2 yg(x, y)v.

Table 2. Comparison of bilevel stochastic optimization algorithms. Algorithm Gc(F, ϵ) Gc(G, ϵ) JV(G, ϵ) HV(G, ϵ)

TTSA (Hong et al., 2020) O(poly(κ)ϵ 5

2 )* O(poly(κ)ϵ 5

2 ) O(poly(κ)ϵ 5

2 ) O(poly(κ)ϵ 5

2 ) BSA (Ghadimi & Wang, 2018) O(κ6ϵ 2) O(κ9ϵ 3) O κ6ϵ 2 e O κ6ϵ 2

stoc Bi O (this paper) O(κ5ϵ 2) O(κ9ϵ 2) O κ5ϵ 2 e O κ6ϵ 2

* We use poly(κ) because Hong et al. 2020 does not provide the explicit dependence on κ.

bilevel optimizers for the deterministic setting, and proposes a novel algorithm for the stochastic setting with order-level lower computational complexity than the existing results.

Other types of loss geometries have also been studied. For example, Liu et al. 2020; Li et al. 2020 assumed that the lowerand upper-level functions g(x, ) and f(x, ) are convex and strongly convex, and provided an asymptotic analysis for their methods. Ghadimi & Wang 2018; Hong et al. 2020 studied the setting where Φ( ) is strongly convex or convex, and g(x, ) is strongly convex.

Bilevel optimization in meta-learning: Bilevel optimization framework has been successfully applied to metalearning recently (Snell et al., 2017; Franceschi et al., 2018; Rajeswaran et al., 2019; Z ugner & G unnemann, 2019; Ji et al., 2020a;b). For example, Snell et al. 2017 proposed a bilevel optimization procedure for meta-learning to learn a common embedding model for all tasks. Rajeswaran et al. 2019 reformulated the model-agnostic meta-learning (MAML) (Finn et al., 2017) as bilevel optimization, and proposed i MAML via implicit gradient. Our work provides a theoretical guarantee for two popular types of bilevel optimizer, i.e., AID-Bi O and ITD-Bi O, for meta-learning.

Bilevel optimization in hyperparameter optimization: Hyperparameter optimization has become increasingly important as a powerful tool in the automatic machine learning (auto ML) (Okuno et al., 2018; Yu & Zhu, 2020). Recently, various bilevel optimization algorithms have been proposed for hyperparameter optimization, which include implicit differentiation based methods (Pedregosa, 2016), dynamical system based methods via reverse or forward gradient computation (Franceschi et al., 2017; 2018; Shaban et al., 2019), etc. Our work demonstrates superior efﬁciency of the proposed stoc Bi O algorithm in hyperparameter optimization.

2. Algorithms

In this section, we describe two popular types of deterministic bilevel optimization algorithms, and propose a new algorithm for stochastic bilevel optimization.

2.1. Algorithms for Deterministic Bilevel Optimization

As shown in Algorithm 1, we describe two popular types of deterministic bilevel optimizers respectively based on AID and ITD (referred to as AID-Bi O and ITD-Bi O) for solving the problem eq. (1).

Both AID-Bi O and ITD-Bi O update in a nested-loop manner. In the inner loop, both of them run D steps of gradient decent (GD) to ﬁnd an approximation point y D k close to y (xk). Note that we choose the initialization y0 k of each inner loop as the output y D k 1 of the preceding inner loop rather than a random start. Such a warm start allows us to backpropagate the tracking error y D k y (xk) to previous loops, and yields an improved computational complexity.

At the outer loop, AID-Bi O ﬁrst solves v N k from a linear system 2 yg(xk, y D k )v = yf(xk, y D k )1 using N steps of conjugate-gradient (CG) starting from v0 k (where we also adopt a warm start with v0 k = v N k 1), and then constructs

b Φ(xk) = xf(xk, y T k ) x yg(xk, y T k )v N k (3)

as an estimate of the true hypergradient Φ(xk), whose form is given by the following proposition. Proposition 1. Hypergradient Φ(xk) takes the forms of

Φ(xk) = xf(xk, y (xk)) x yg(xk, y (xk))v k, (4)

1Solving this linear system is equivalent to solving a quadratic programming minv 1

2v T 2 yg(xk, y D k )v v T yf(xk, y D k ).

Bilevel Optimization: Convergence Analysis and Enhanced Design

Figure 1. Illustration of hyperparameter estimation in our proposed stoc Bi O algorithm. Note that the hyperparameter estimation (lines 9-10 in Algorithm 2) involves only computations of automatic differentiation over scalar < Gj(y), ri > w.r.t. y. In addition, our implementation applies the function torch.autograd.grad in Py Torch, which automatically determines the size of Jacobians. More details can be found in our code via https://github.com/Junjie Yang97/Stoc Bio_hp.

Algorithm 1 Bilevel algorithms via AID or ITD

1: Input: K, D, N, stepsizes α, β, initializations x0, y0, v0. 2: for k = 0, 1, 2, ..., K do 3: Set y0 k = y D k 1 if k > 0 and y0 otherwise 4: for t = 1, ...., D do 5: Update yt k = yt 1 k α yg(xk, yt 1 k ) 6: end for 7: Hypergradient estimation via AID: 1) set v0 k = v N k 1 if k > 0 and v0 otherwise

2) solve v N k from 2 yg(xk, y D k )v = yf(xk, y D k )

via N steps of CG starting from v0 k 3) get Jacobian-vector product x yg(xk, y D k )v N k via automatic differentiation 4) b Φ(xk) = xf(xk, y D k ) x yg(xk, y D k )v N k

ITD: compute b Φ(xk) = f(xk,y D k ) xk via backpropagation

8: Update xk+1 = xk β b Φ(xk) 9: end for

where v k is the solution of the following linear system

2 yg(xk, y (xk))v = yf(xk, y (xk)).

As shown in Domke 2012; Grazzi et al. 2020, the construction of eq. (3) involves only Hessian-vector products in solving v N via CG and Jacobian-vector product x yg(xk, y D k )v N k , which can be efﬁciently computed and stored via existing automatic differentiation packages.

As a comparison, the outer loop of ITD-Bi O computes the gradient f(xk,y D k (xk)) xk as an approximation of the hyper-

gradient Φ(xk) = f(xk,y (xk))

xk via backpropagation, where we write y D k (xk) because the output y D k of the inner loop has a dependence on xk through the inner-loop

iterative GD updates. The explicit form of the estimate f(xk,y D k (xk)) xk is given by the following proposition via the

chain rule. For notation simpliﬁcation, let QD 1 j=D( ) = I.

Proposition 2. f(xk,y D k (xk)) xk takes the analytical form of:

f(xk, y D k ) xk = xf(xk, y D k ) α

t=0 x yg(xk, yt k)

j=t+1 (I α 2 yg(xk, yj k)) yf(xk, y D k ).

Proposition 2 shows that the differentiation involves the computations of second-order derivatives such as Hessian 2 yg( , ). Since efﬁcient Hessian-free methods have been successfully deployed in the existing automatic differentiation tools, computing these second-order derivatives reduces to more efﬁcient computations of Jacobianand Hessianvector products.

2.2. Algorithm for Stochastic Bilevel Optimization

We propose a new stochastic bilevel optimizer (stoc Bi O) in Algorithm 2 to solve the problem eq. (2). It has a doubleloop structure similar to Algorithm 1, but runs D steps of stochastic gradient decent (SGD) at the inner loop to obtain an approximated solution y D k . Based on the output y D k of the inner loop, stoc Bi O ﬁrst computes a gradient y F(xk, y D k ; DF ) over a sample batch DF , and then computes a vector v Q as an estimated solution of the linear system 2 yg(xk, y (xk))v = yf(xk, y (xk)) via Algorithm 3. Here, v Q takes a form of

j=Q q (I η 2 y G(xk, y D k ; Bj))v0, (5)

Bilevel Optimization: Convergence Analysis and Enhanced Design

Algorithm 2 Stochastic bilevel optimizer (stoc Bi O)

1: Input: K, D, Q, stepsizes α and β, initializations x0 and y0. 2: for k = 0, 1, 2, ..., K do 3: Set y0 k = y D k 1 if k > 0 and y0 otherwise 4: for t = 1, ...., D do 5: Draw a sample batch St 1 6: Update yt k = yt 1 k α y G(xk, yt 1 k ; St 1) 7: end for 8: Draw sample batches DF , DH and DG 9: Compute gradient v0 = y F(xk, y D k ; DF ) 10: Construct estimate v Q via Algorithm 3 given v0 11: Compute x y G(xk, y D k ; DG)v Q 12: Compute gradient estimate b Φ(xk) via eq. (6) 13: Update xk+1 = xk β b Φ(xk) 14: end for

Algorithm 3 Construct v Q given v0

1: Input: Integer Q, samples DH = {Bj}Q j=1 and constant η. 2: for j = 1, 2, ..., Q do 3: Sample Bj and compute Gj(y) = y η y G(x, y; Bj) 4: end for 5: Set r Q = v0 6: for i = Q, ..., 1 do 7: ri 1 = Gi(y)ri / y = ri η 2 y G(x, y; Bi)ri via automatic differentiation 8: end for 9: Return v Q = η PQ i=0 ri

where v0 = y F(xk, y D k ; DF ), {Bj, j = 1, ..., Q} are mutually-independent sample sets, Q and η are constants, and we let QQ Q+1( ) = I for notational simpliﬁcation. Our construction of v Q, i.e., Algorithm 3, is motived by the Neumann series P i=0 U k = (I U) 1, and involves only Hessian-vector products rather than Hessians, and hence is computationally and memory efﬁcient. This procedure is illustrated in Figure 1.

Then, we construct

b Φ(xk) = x F(xk, y D k ; DF ) x y G(xk, y D k ; DG)v Q (6)

as an estimate of hypergradient Φ(xk). Compared to the deterministic case, it is more challenging to design a sampleefﬁcient Hypergradient estimator in the stochastic case. For example, instead of choosing the same batch sizes for all Bj, j = 1, ..., Q in eq. (5), our analysis captures the different impact of components 2 y G(xk, y D k ; Bj), j = 1, ..., Q on the hypergradient estimation variance, and inspires an adaptive and more efﬁcient choice by setting |BQ j| to decay exponentially with j from 0 to Q 1. By doing so, we achieve an improved complexity.

3. Deﬁnitions and Assumptions

Let z = (x, y) denote all parameters. For simplicity, suppose sample sets St for all t = 0, ..., D 1, DG and DF have the sizes of S, Dg and Df, respectively. In this paper,

we focus on the following types of loss functions for both the deterministic and stochastic cases.

Assumption 1. The lower-level function g(x, y) is µstrongly-convex w.r.t. y and the total objective function Φ(x) = f(x, y (x)) is nonconvex w.r.t. x. For the stochastic setting, the same assumptions hold for G(x, y; ζ) and Φ(x), respectively.

Since Φ(x) is nonconvex, algorithms are expected to ﬁnd an ϵ-accurate stationary point deﬁned as follows.

Deﬁnition 1. We say x is an ϵ-accurate stationary point for the objective function Φ(x) in eq. (2) if E Φ( x) 2 ϵ, where x is the output of an algorithm.

In order to compare the performance of different bilevel algorithms, we adopt the following metrics of complexity.

Deﬁnition 2. For a function f(x, y) and a vector v, let Gc(f, ϵ) be the number of the partial gradient xf or yf, and let JV(g, ϵ) and HV(g, ϵ) be the number of Jacobianvector products x yg(x, y)v. and Hessian-vector products 2 yg(x, y)v. For the stochastic case, similar metrics are adopted but w.r.t. the stochastic function F(x, y; ξ).

We take the following standard assumptions on the loss functions in eq. (2), which have been widely adopted in bilevel optimization (Ghadimi & Wang, 2018; Ji et al., 2020a).

Assumption 2. The loss function f(z) and g(z) satisfy

The function f(z) is M-Lipschitz, i.e., for any z, z ,

|f(z) f(z )| M z z .

f(z) and g(z) are L-Lipschitz, i.e., for any z, z ,

f(z) f(z ) L z z ,

g(z) g(z ) L z z .

For the stochastic case, the same assumptions hold for F(z; ξ) and G(z; ζ) for any given ξ and ζ.

As shown in Proposition 1, the gradient of the objective function Φ(x) involves the second-order derivatives x yg(z) and 2 yg(z). The following assumption imposes the Lipschitz conditions on such high-order derivatives, as also made in Ghadimi & Wang 2018.

Assumption 3. Suppose the derivatives x yg(z) and 2 yg(z) are τand ρLipschitz, i.e.,

For any z, z , x yg(z) x yg(z ) τ z z .

For any z, z , 2 yg(z) 2 yg(z ) ρ z z .

For the stochastic case, the same assumptions hold for x y G(z; ζ) and 2 y G(z; ζ) for any ζ.

As typically adopted in the analysis for stochastic optimization, we make the following bounded-variance assumption for the lower-level stochastic function G(z; ζ).

Bilevel Optimization: Convergence Analysis and Enhanced Design

Assumption 4. Gradient G(z; ζ) has a bounded variance, i.e., Eξ G(z; ζ) g(z) 2 σ2 for some σ.

4. Main Results for Bilevel Optimization

4.1. Deterministic Bilevel Optimization

We ﬁrst characterize the convergence and complexity of AID-Bi O. Let κ = L

µ denote the condition number.

Theorem 1 (AID-Bi O). Suppose Assumptions 1, 2, 3 hold. Deﬁne a smoothness parameter LΦ = L + 2L2+τM2

µ3 = Θ(κ3), choose the stepsizes α 1 L, β = 1 8LΦ , and set the inner-loop iteration number D Θ(κ) and the CG iteration number N Θ( κ), where the detailed forms of D, N can be found in Appendix E. Then, the outputs of AID-Bi O satisfy

k=0 Φ(xk) 2 64LΦ(Φ(x0) infx Φ(x)) + 5 0

where 0 = y0 y (x0) 2 + v 0 v0 2 > 0.

In order to achieve an ϵ-accurate stationary point, the complexities satisfy

Gradient: Gc(f, ϵ) = O(κ3ϵ 1), Gc(g, ϵ) = O(κ4ϵ 1).

Jacobianand Hessian-vector product complexities: JV(g, ϵ) = O κ3ϵ 1 , HV(g, ϵ) = O κ3.5ϵ 1 .

As shown in Table 1, the complexities Gc(f, ϵ), Gc(g, ϵ), JV(g, ϵ) and HV(g, ϵ) of our analysis improves that of Ghadimi & Wang 2018 (eq. (2.30) therein) by the order of κ, κϵ 1/4, κ and κ. Such an improvement is achieved by a reﬁned analysis with a constant number of inner-loop steps, and by a warm start strategy to backpropagate the tracking errors y D k y (xk) and v N k v k to previous loops, as also demonstrated by our meta-learning experiments. We next characterize the convergence and complexity performance of the ITD-Bi O algorithm.

Theorem 2 (ITD-Bi O). Suppose Assumptions 1, 2, and 3 hold. Deﬁne LΦ as in Theorem 1, and choose α 1 L, β = 1 4LΦ and D Θ(κ log 1

ϵ ), where the detailed form of D can be found in Appendix F. Then, we have

k=0 Φ(xk) 2 16LΦ(Φ(x0) infx Φ(x))

In order to achieve an ϵ-accurate stationary point, the complexities satisfy

Gradient: Gc(f, ϵ) = O(κ3ϵ 1), Gc(g, ϵ) = e O(κ4ϵ 1).

Jacobianand Hessian-vector product complexity: JV(g, ϵ) = e O κ4ϵ 1 , HV(g, ϵ) = e O κ4ϵ 1 .

By comparing Theorem 1 and Theorem 2, it can be seen that the complexities JV(g, ϵ) and HV(g, ϵ) of AID-Bi O are better than those of ITD-Bi O by the order of κ and κ0.5, which implies that AID-Bi O is more computationally and memory efﬁcient than ITD-Bi O, as veriﬁed in Figure 2.

4.2. Stochastic Bilevel Optimization

We ﬁrst characterize the bias and variance of an important component v Q in eq. (5).

Proposition 3. Suppose Assumptions 1, 2 and 3 hold. Let η 1

L and choose |BQ+1 j| = BQ(1 ηµ)j 1 for j = 1, ..., Q, where B 1 Q(1 ηµ)Q 1 . Then, the bias satisﬁes

Ev Q [ 2 yg(xk, y D k )] 1 yf(xk, y D k )

µ 1(1 ηµ)Q+1M. (7)

Furthermore, the estimation variance is given by

E v Q [ 2 yg(xk, y D k )] 1 yf(xk, y D k ) 2

µ2 1 B + 4(1 ηµ)2Q+2M 2

Proposition 3 shows that if we choose Q, B and Df at the order level of O(log 1

ϵ ), O(1/ϵ) and O(1/ϵ), the bias and variance are smaller than O(ϵ), and the required number of samples is PQ j=1 BQ(1 ηµ)j 1 = O ϵ 1 log 1

ϵ . Note that the chosen batch size |BQ+1 j| exponentially decays w.r.t. the index j. In comparison, the uniform choice of all |Bj| would yield a worse complexity of O ϵ 1(log 1

We next analyze stoc Bi O when Φ(x) is nonconvex.

Theorem 3. Suppose Assumptions 1, 2, 3 and 4 hold. Deﬁne LΦ = L + 2L2+τM2

µ + ρLM+L3+τML

µ3 , and choose β = 1 4LΦ , η < 1

L, and D Θ(κ log κ), where the detailed form of D can be found in Appendix G.3. We have

k=0 E Φ(xk) 2 O LΦ

K + κ2(1 ηµ)2Q

In order to achieve an ϵ-accurate stationary point, the complexities satisfy

Gradient: Gc(F, ϵ) = O(κ5ϵ 2), Gc(G, ϵ) = O(κ9ϵ 2).

Jacobianand Hessian-vector product complexities: JV(G, ϵ) = O(κ5ϵ 2), HV(G, ϵ) = e O(κ6ϵ 2).

Theorem 3 shows that stoc Bi O converges sublinearly with the convergence error decaying exponentially w.r.t. Q and sublinearly w.r.t. the batch sizes S, Dg, Df for gradient estimation and B for Hessian inverse estimation. In addition, it

Bilevel Optimization: Convergence Analysis and Enhanced Design

can be seen that the number D of the inner-loop steps is at a constant level, rather than a typical choice of Θ(log( 1

As shown in Table 2, the gradient complexities of our proposed algorithm in terms of F and G improve those of BSA in Ghadimi & Wang 2018 by an order of κ and ϵ 1, respectively. In addition, the Jacobian-vector product complexity JV(G, ϵ) of our algorithm improves that of BSA by the order of κ. In terms of the accuracy ϵ, our gradient, Jacobianand Hessian-vector product complexities improve those of TTSA in Hong et al. 2020 all by an order of ϵ 0.5.

5. Applications to Meta-Learning

Consider the few-shot meta-learning problem with m tasks {Ti, i = 1, ..., m} sampled from distribution PT . Each task Ti has a loss function L(φ, wi; ξ) over each data sample ξ, where φ are the parameters of an embedding model shared by all tasks, and wi are the task-speciﬁc parameters. The goal of this framework is to ﬁnd good parameters φ for all tasks, and building on the embedded features, each task then adapts its own parameters wi by minimizing its loss.

The model training takes a bilevel procedure. In the lowerlevel stage, building on the embedded features, the base learner of task Ti searches w i as the minimizer of its loss over a training set Si. In the upper-level stage, the metalearner evaluates the minimizers w i , i = 1, ..., m on heldout test sets, and optimizes φ of the embedding model over all tasks. Let ew = (w1, ..., wm) denote all task-speciﬁc parameters. Then, the objective function is given by

min φ LD(φ, ew ) = 1

ξ Di L(φ, w i ; ξ)

| {z } LDi(φ,w i ): task-speciﬁc upper-level loss

s.t. ew = arg min e w LS(φ, ew) = Pm i=1 LSi(φ, wi)

where LSi(φ, wi) = 1 |Si| P

ξ Si L(φ, wi; ξ)+R(wi) with a strongly-convex regularizer R(wi), e.g., L2, and Si, Di are the training and test datasets of task Ti. Note that the lowerlevel problem is equivalent to solving each w i as a minimizer of the task-speciﬁc loss LSi(φ, wi) for i = 1, ..., m. In practice, wi often corresponds to the parameters of the last linear layer of a neural network and φ are the parameters of the remaining layers (e.g., 4 convolutional layers in Bertinetto et al. 2018; Ji et al. 2020a), and hence the lowerlevel function is strongly-convex w.r.t. ew and the upper-level function LD(φ, ew (φ)) is generally nonconvex w.r.t. φ. In addition, due to the small sizes of datasets Di and Si in few-shot learning, all updates for each task Ti use full gradient descent without data resampling. As a result, AID-Bi O and ITD-Bi O in Algorithm 1 can be applied here. In some applications where the number m of tasks is large, it is more

efﬁcient to sample a batch B of i.i.d. tasks from {Ti, i = 1, ..., m} at each meta (outer) iteration, and optimizes the mini-batch versions LD(φ, ew; B) = 1 |B| P

i B LDi(φ, wi) and LS(φ, ew; B) = 1 |B| P

i B LSi(φ, wi) instead.

We next provide the convergence result of ITD-Bi O for this case, and that of AID-Bi O can be similarly derived.

Theorem 4. Suppose Assumptions 1, 2 and 3 hold and suppose each task loss LSi(φ, ) is µ-strongly-convex. Choose the same parameters β, D as in Theorem 2. Then, we have

k=0 E Φ(φk) 2 O 1

Theorem 4 shows that compared to the full batch case (i.e., without task sampling) in eq. (10), task sampling introduces a variance term O( 1

|B|) due to the stochastic nature of the algorithm.

5.1. Experiments

To validate our theoretical results for deterministic bilevel optimization, we compare the performance among the following four algorithms: ITD-Bi O, AID-Bi O-constant (AIDBi O with a constant number of inner-loop steps as in our analysis), AID-Bi O-increasing (AID-Bi O with an increasing number of inner-loop steps under analysis in Ghadimi & Wang 2018), and two popular meta-learning algorithms MAML2 (Finn et al., 2017) and ANIL3 (Raghu et al., 2019). We conduct experiments over a 5-way 5-shot task on two datasets: FC100 and mini Image Net. The results are averaged over 10 trials with different random seeds. Due to the space limitations, we provide the model architectures and hyperparameter settings in Appendix A.

It can be seen from Figure 2 that for both the mini Image Net and FC100 datasets, AID-Bi O-constant converges faster than AID-Bi O-increasing in terms of both the training accuracy and test accuracy, and achieves a better ﬁnal test accuracy than ANIL and MAML. This demonstrates the superior improvement of our developed analysis over existing analysis in Ghadimi & Wang 2018 for AID-Bi O algorithm. Moreover, it can be observed that AID-Bi O is slightly faster than ITD-Bi O in terms of the training accuracy and test accuracy. This is in consistence with our theoretical results.

We also compare the robustness between the bilevel optimizer ITD-Bi O (AID-Bi O performs similarly to ITD-Bi O in terms of the convergence rate) and ANIL when the number T (i.e., D in Algorithm 1) of inner-loop steps is relatively large. It can be seen from Figure 3 that when the number

2MAML consists of an inner loop for task adaptation and an outer loop for meta initialization training. 3ANIL refers to almost no inner loop, which is an efﬁcient MAML variant with task adaption on the last-layer of parameters.

Bilevel Optimization: Convergence Analysis and Enhanced Design

(a) dataset: mini Image Net

(b) dataset: FC100

Figure 2. Comparison of various bilevel algorithms on meta-learning. For each dataset, left plot: training accuracy v.s. running time; right plot: test accuracy v.s. running time.

(a) T = 10, mini Image Net dataset

(b) T = 20, FC100 dataset

Figure 3. Comparison of ITD-Bi O and ANIL with a relatively large inner-loop iteration number T.

of inner-loop steps is large, i.e., T = 10 for mini Image Net and T = 20 for FC100, the bilevel optimizer ITD-Bi O converges stably with a small variance, whereas ANIL suffers from a sudden descent at 1500s on mini Image Net and even diverges after 2000s on FC100.

6. Applications to Hyperparameter Optimization

The goal of hyperparameter optimization (Franceschi et al., 2018; Feurer & Hutter, 2019) is to search for representation or regularization parameters λ to minimize the validation error evaluated over the learner s parameters w , where w

is the minimizer of the inner-loop regularized training error. Mathematically, the objective function is given by

min λ LDval(λ) = 1 |Dval|

ξ Dval L(w ; ξ)

s.t. w = arg min w 1 |Dtr|

L(w, λ; ξ) + Rw,λ

| {z } LDtr(w,λ)

where Dval and Dtr are validation and training data, L is the loss, and Rw,λ is a regularizer. In practice, the lower-level function LDtr(w, λ) is often strongly-convex w.r.t. w. For example, for the data hyper-cleaning application proposed by Franceschi et al. 2018; Shaban et al. 2019, the predictor is modeled by a linear classiﬁer, and the loss function L(w; ξ) is convex w.r.t. w and Rw,λ is a strongly-convex regularizer, e.g., L2 regularization. The sample sizes of Dval and Dtr are often large, and stochastic algorithms are pre-

ferred for achieving better efﬁciency. As a result, the above hyperparameter optimization falls into the stochastic bilevel optimization we study in eq. (2), and we can apply the proposed stoc Bi O here. Furthermore, Theorem 3 establishes its performance guarantee.

6.1. Experiments

We compare our proposed stoc Bi O with the following baseline bilevel optimization algorithms.

BSA (Ghadimi & Wang, 2018): implicit gradient based stochastic bilevel optimizer via single-sample sampling.

TTSA (Hong et al., 2020): two-time-scale stochastic optimizer via single-sample data sampling.

HOAG (Pedregosa, 2016): a hyperparameter optimization algorithm with approximate gradient. We use the implementation in the repository https://github.com/ fabianp/hoag.

reverse (Franceschi et al., 2017): an iterative differentiation based method that approximates the hypergradient via backpropagation. We use its implementation in https://github.com/prolearner/hypertorch.

AID-FP (Grazzi et al., 2020): AID with the ﬁxed-point method. We use its implementation in https://github. com/prolearner/hypertorch

AID-CG (Grazzi et al., 2020): AID with the conjugate gradient method. We use its implementation in https: //github.com/prolearner/hypertorch.

We demonstrate the effectiveness of the proposed stoc Bi O algorithm on two experiments: data hyper-cleaning and

Bilevel Optimization: Convergence Analysis and Enhanced Design

(a) Test loss and test accuracy v.s. running time

(b) Convergence rate with different batch sizes

Figure 4. Comparison of various stochastic bilevel algorithms on logistic regression on 20 Newsgroup dataset.

(a) Corruption rate p = 0.1

(b) Corruption rate p = 0.4

Figure 5. Comparison of various stochastic bilevel algorithms on hyperparameter optimization at different corruption rates. For each corruption rate p, left plot: training loss v.s. running time; right plot: test loss v.s. running time.

logistic regression. Due to the space limitations, we provide the details of the objective functions and hyperparameter settings in Appendix B.

Logistic Regression on 20 Newsgroup: As shown in Figure 4(a), the proposed stoc Bi O achieves the fastest convergence rate as well as the best test accuracy among all comparison algorithms. This demonstrates the practical advantage of our proposed algorithm stoc Bi O. Note that we do not include BSA and TTSA in the comparison, because they converge too slowly with a large variance, and are much worse than the other competing algorithms. In addition, we investigate the impact of the batch size on the performance of our stoc Bi O in Figure 4(b). It can be seen that stoc Bi O outperforms HOAG under the batch sizes of 100, 500, 1000, 2000. This shows that the performance of stoc Bi O is not very sensitive to the batch size, and hence the tuning of the batch size is easy to handle in practice.

Data Hyper-Cleaning on MNIST. It can be seen from Figures 5 and 6 that our proposed stoc Bi O algorithm achieves the fastest convergence rate among all competing algorithms in terms of both the training loss and the test loss. It is also observed that such an improvement is more signiﬁcant when the corruption rate p is smaller. We note that the stochastic algorithm TTSA converges very slowly with a large variance. This is because TTSA updates the costly outer loop more frequently than other algorithms, and has a larger variance due to the single-sample data sampling. As a comparison, our stoc Bi O has a much smaller variance for hypergradient estimation as well as a much faster convergence rate. This validates our theoretical results in Theorem 3.

Figure 6. Convergence of algorithms at corruption rate p = 0.2.

7. Conclusion

In this paper, we develop a general and enhanced convergence rate analysis for the nonconvex-strongly-convex bilevel deterministic optimization, and propose a novel algorithm for the stochastic setting and show that its computational complexity outperforms the best known results orderwisely. Our results also provide the theoretical guarantee for various bilevel optimizers in meta-learning and hyperparameter optimization. Our experiments validate our theoretical results and demonstrate the superior performance of the proposed algorithm. We anticipate that the convergence rate analysis that we develop will be useful for analyzing other bilevel optimization problems with different loss geometries, and the proposed algorithms will be useful for other applications such as reinforcement learning and Stackelberg game.

Acknowledgements

The work was supported in part by the U.S. National Science Foundation under the grants CCF-1909291 and CCF1900145.

Bilevel Optimization: Convergence Analysis and Enhanced Design

Arnold, S. M., Mahajan, P., Datta, D., and Bunner, I. learn2learn, 2019. https://github.com/ learnables/learn2learn.

Bertinetto, L., Henriques, J. F., Torr, P., and Vedaldi, A. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations (ICLR), 2018.

Bracken, J. and Mc Gill, J. T. Mathematical programs with optimization problems in the constraints. Operations Research, 21(1):37 44, 1973.

Domke, J. Generic methods for optimization-based modeling. In Artiﬁcial Intelligence and Statistics (AISTATS), pp. 318 326, 2012.

Feurer, M. and Hutter, F. Hyperparameter optimization. In Automated Machine Learning, pp. 3 33. Springer, Cham, 2019.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning (ICML), pp. 1126 1135, 2017.

Flamary, R., Rakotomamonjy, A., and Gasso, G. Learning constrained task similarities in graphregularized multitask learning. Regularization, Optimization, Kernels, and Support Vector Machines, pp. 103, 2014.

Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning (ICML), pp. 1165 1173, 2017.

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning (ICML), pp. 1568 1577, 2018.

Ghadimi, S. and Wang, M. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

Gould, S., Fernando, B., Cherian, A., Anderson, P., Cruz, R. S., and Guo, E. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint ar Xiv:1607.05447, 2016.

Grazzi, R., Franceschi, L., Pontil, M., and Salzo, S. On the iteration complexity of hypergradient computation. In Proc. International Conference on Machine Learning (ICML), 2020.

Hansen, P., Jaumard, B., and Savard, G. New branch-andbound rules for linear bilevel programming. SIAM Journal on Scientiﬁc and Statistical Computing, 13(5):1194 1217, 1992.

Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A twotimescale framework for bilevel optimization: Complexity analysis and application to actor-critic. ar Xiv preprint ar Xiv:2007.05170, 2020.

Ji, K. and Liang, Y. Lower bounds and accelerated algorithms for bilevel optimization. ar Xiv preprint ar Xiv:2102.03926, 2021.

Ji, K., Lee, J. D., Liang, Y., and Poor, H. V. Convergence of meta-learning with task-speciﬁc adaptation over partial parameters. In Advances in Neural Information Processing Systems (Neur IPS), 2020a.

Ji, K., Yang, J., and Liang, Y. Multi-step model-agnostic meta-learning: Convergence and improved algorithms. ar Xiv preprint ar Xiv:2002.07836, 2020b.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2014.

Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In Advances in neural information processing systems (Neur IPS), pp. 1008 1014, 2000.

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. 2009.

Kunapuli, G., Bennett, K. P., Hu, J., and Pang, J.-S. Classiﬁcation model selection via bilevel programming. Optimization Methods & Software, 23(4):475 489, 2008.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Li, J., Gu, B., and Huang, H. Improved bilevel model: Fast and optimal algorithm with theoretical guarantee. ar Xiv preprint ar Xiv:2009.00690, 2020.

Liao, R., Xiong, Y., Fetaya, E., Zhang, L., Yoon, K., Pitkow, X., Urtasun, R., and Zemel, R. Reviving and improving recurrent back-propagation. In Proc. International Conference on Machine Learning (ICML), 2018.

Liu, R., Mu, P., Yuan, X., Zeng, S., and Zhang, J. A generic ﬁrst-order algorithmic framework for bi-level programming beyond lower-level singleton. In International Conference on Machine Learning (ICML), 2020.

Lorraine, J., Vicol, P., and Duvenaud, D. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 1540 1552. PMLR, 2020.

Bilevel Optimization: Convergence Analysis and Enhanced Design

Maclaurin, D., Duvenaud, D., and Adams, R. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning (ICML), pp. 2113 2122, 2015.

Moore, G. M. Bilevel programming algorithms for machine learning model selection. Rensselaer Polytechnic Institute, 2010.

Okuno, T., Takeda, A., and Kawana, A. Hyperparameter learning via bilevel nonsmooth optimization. ar Xiv preprint ar Xiv:1806.01520, 2018.

Oreshkin, B., L opez, P. R., and Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 721 731, 2018.

Pedregosa, F. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning (ICML), pp. 737 746, 2016.

Raghu, A., Raghu, M., Bengio, S., and Vinyals, O. Rapid learning or feature reuse? towards understanding the effectiveness of MAML. International Conference on Learning Representations (ICLR), 2019.

Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems (Neur IPS), pp. 113 124, 2019.

Roth, A., Ullman, J., and Wu, Z. S. Watch and learn: Optimizing from revealed preferences feedback. In Annual ACM Symposium on Theory of Computing (STOC), pp. 949 962, 2016.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 3(115):211 252, 2015.

Shaban, A., Cheng, C.-A., Hatch, N., and Boots, B. Truncated back-propagation for bilevel optimization. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 1723 1732, 2019.

Shi, C., Lu, J., and Zhang, G. An extended kuhn tucker approach for linear bilevel programming. Applied Mathematics and Computation, 162(1):51 63, 2005.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NIPS), 2017.

Vinyals, O., Blundell, C., Lillicrap, T., and Wierstra, D. Matching networks for one shot learning. In Advances in Neural Information Processing Systems (NIPS), 2016.

Yu, T. and Zhu, H. Hyper-parameter optimization: A review of algorithms and applications. ar Xiv preprint ar Xiv:2003.05689, 2020.

Z ugner, D. and G unnemann, S. Adversarial attacks on graph neural networks via meta learning. In International Conference on Learning Representations (ICLR), 2019.