# will_bilevel_optimizers_benefit_from_loops__6da964b1.pdf

Will Bilevel Optimizers Benefit from Loops

Kaiyi Ji Department of CSE University at Buffalo kaiyiji@buffalo.edu

Mingrui Liu Department of CS George Mason University mingruil@gmu.edu

Yingbin Liang Department of ECE The Ohio State University liang.889@osu.edu

Lei Ying Department of EECS University of Michigan leiying@umich.edu

Bilevel optimization has arisen as a powerful tool for solving a variety of machine learning problems. Two current popular bilevel optimizers AID-Bi O and ITD-Bi O naturally involve solving one or two sub-problems, and consequently, whether we solve these problems with loops (that take many iterations) or without loops (that take only a few iterations) can significantly affect the overall computational efficiency. Existing studies in the literature cover only some of those implementation choices, and the complexity bounds available are not refined enough to enable rigorous comparison among different implementations. In this paper, we first establish unified convergence analysis for both AID-Bi O and ITD-Bi O that are applicable to all implementation choices of loops. We then specialize our results to characterize the computational complexity for all implementations, which enable an explicit comparison among them. Our result indicates that for AID-Bi O, the loop for estimating the optimal point of the inner function is beneficial for overall efficiency, although it causes higher complexity for each update step, and the loop for approximating the outer-level Hessian-inverse-vector product reduces the gradient complexity. For ITD-Bi O, the two loops always coexist, and our convergence upper and lower bounds show that such loops are necessary to guarantee a vanishing convergence error, whereas the no-loop scheme suffers from an unavoidable non-vanishing convergence error. Our numerical experiments further corroborate our theoretical results.

1 Introduction

Bilevel optimization has attracted significant attention recently due to its popularity in a variety of machine learning applications including meta-learning [9, 1, 34, 17], hyperparameter optimization [9, 35, 5], reinforcement learning [22, 15], and signal processing [23, 7]. In this paper, we consider the bilevel optimization problem that takes the following formulation.

min x Rp Φ(x) := f(x, y (x)) s.t. y (x) = arg min y Rq g(x, y), (1)

where the outerand inner-level functions f and g are both jointly continuously differentiable. We focus on the setting where the lower-level function g is strongly convex with respect to (w.r.t.) y with the condition number κ = L

µ (where L and µ are gradient Lipschitzness and strong convexity coefficients defined respectively in Assumptions 1 and 3 in Section 3), and the outer-level objective function Φ(x) is possibly nonconvex w.r.t. x. Such types of geometries arise in many applications

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Table 1: Comparison of computational complexities of four AID-Bi O implementations for finding an ϵ-accurate stationary point. For a fair comparison, gradient descent (GD) is used to solve the linear system for all algorithms. MV(ϵ): the total number of Jacobianand Hessian-vector product computations. Gc(ϵ): the total number of gradient computations. e O: hide ln κ

ϵ factors. We write a(x) = Θ(b(x)) if cb(x) < a(x) < Cb(x), where c, C are universal constants.

Algorithms Q N MV(ϵ) Gc(ϵ)

BA [10] Θ(κ ln κ) (k+1) 1 4 2 (k: iteration number) e O(κ5ϵ 1) e O(κ5ϵ 1.25)

AID-Bi O [19] Θ(κ ln κ) Θ(κ ln κ) e O(κ4ϵ 1) e O(κ4ϵ 1)

N-Q-loop AID (this paper) Θ(κ ln κ) Θ(κ ln κ) e O(κ4ϵ 1) e O(κ4ϵ 1)

Q-loop AID (this paper) Θ(κ ln κ) 1 e O(κ6ϵ 1) e O(κ5ϵ 1)

N-loop AID (this paper) O(1) Θ(κ ln κ) e O(κ4ϵ 1) e O(κ5ϵ 1)

No-loop AID (this paper) O(1) 1 e O(κ6ϵ 1) e O(κ6ϵ 1)

including meta-learning (which uses the last layer of neural networks as adaptation parameters), hyperparameter optimization (e.g., data hyper-cleaning and regularized logistic regression) and learning in communication networks (e.g., network utility maximization).

A variety of algorithms have been proposed to solve the bilevel optimization problem in eq. (1). For example, [14, 36, 32] proposed constraint-based approaches by replacing the inner-level problem with its optimality conditions as constraints. In comparison, gradient-based bilevel algorithms have received intensive attention recently due to the effectiveness and simplicity, which include two popular approaches via approximate implicit differentiation (AID) [4, 33, 11, 19] and iterative differentiation (ITD) [31, 8, 35]. Readers can refer to Appendix A for an expanded list of related work.

Consider the AID-based bilevel approach (which we call AID-Bi O). Its base iteration loop updates the variable x until convergence. Within such a base loop, it needs to solve two sub-problems: finding a nearly optimal solution of the inner-level function via N iterations, and approximating the outer-level Hessian-inverse-vector product via Q iterations. If Q and N are chosen to be large, then the corresponding iterations form additional loops of iterations within the base loop, which we respectively call as Q-loop and N-loop. Thus, AID-Bi O can have four popular implementations depending on different choices of N and Q: N-loop (with large N = κ ln κ and small Q = O(1)), N-Q-loop (with large N = Θ(κ ln κ) and large Q = Θ(κ ln κ)), Q-loop (with N = 1 and Q = Θ(κ ln κ)), and No-loop (with N = 1 and Q = O(1)). Note that No-loop refers to no additional loops within the base loop, and can be understood as conventional single-(base)-loop algorithms. These implementations can significantly affect the efficiency of AID-Bi O. Generally, large Q (i.e., a Q-loop) provides a good approximation of the Hessian-inverse-vector product for the hypergradient computation, and large N (i.e., a N-loop) finds an accurate optimal point of the inner function. Hence, an algorithm with N-loop and Q-loop require fewer base-loop steps to converge, but each such base-loop step requires more computations due to these loops. On the other hand, small Q and/or N avoid computations of loops in each base-loop step, but can cause the algorithm to converge with many more base-loop steps. An intriguing question here is which implementation is overall most efficient and whether AID-Bi O benefits from having N-loop and/or Q-loop. Existing theoretical studies on AID-Bi O are far from answering this question. The studies [10, 19] on deterministic AID-Bi O focused only on the N-Q-loop scheme. A few studies analyzed the stochastic AID-Bi O, such as [26] on No-loop, and [15, 21] on Q-loop. Those studies were not refined enough to capture the computational differences among different implementations, and further those studies collectively did not cover all the four implementations either.

The first contribution of this paper lies in the development of a unified convergence theory for AID-Bi O, which is applicable to all choices of N and Q. We further specialize our general theorems to provide the computational complexity for all of the above four implementations (as summarized in Table 1). Comparison among them suggests that AID-Bi O does benefit from both N-loop and Q-loop. This is in contrast to minimax optimization (a special case of bilevel optimization), where it is shown in [27, 41] that (No-loop) gradient descent ascent (GDA) with N = 1 often outperforms (N-loop) GDA with N = κ ln κ (here N denotes the number of ascent iterations for each descent iteration). To explain the reason, the gradient

Table 2: Comparison of computational complexities of two ITD-Bi O implementations for finding an ϵ-accurate stationary point. For a fair comparison, gradient descent (GD) is used to solve the inner-level problem. The analysis in [19] for ITD-Bi O assumes that the inner-loop minimizer y (xk) is bounded at kth iteration, which is not required in our analysis. µ: the strong-convexity constant of inner-level function g(x, ). For the last two columns, N/A means that the complexities to achieve an ϵ-accuracy are not measurable due to the nonvanishing convergence error. We write a(x) = Ω(b(x)) if a(x) > cb(x), where c is a universal constant.

Algorithms N Convergence rate MV(ϵ) Gc(ϵ)

ITD-Bi O [19] Θ(κ ln κ) O κ3

e O(κ4ϵ 1) e O(κ4ϵ 1)

N-N-loop ITD (this paper) Θ(κ ln κ) O κ3

e O(κ4ϵ 1) e O(κ4ϵ 1)

No-loop ITD (this paper) Θ(1) O κ3

Lower bound (this paper) Θ(1) Ω κ2

w.r.t. x in bilevel optimization involves additional second-order derivatives (that do not exist in minimax optimization), which are more sensitive to the accuracy of the optimal point of the inner function. Therefore, a large N finds such a more accurate solution, and is hence more beneficial for bilevel optimization than minimax optimization.

Differently from AID-Bi O, the ITD-based bilevel approach (which we call as ITD-Bi O) constructs the outer-level hypergradient estimation via backpropagation along the N-loop iteration path, and Q = N always holds. Thus, ITD-Bi O has only two implementation choices: N-N-loop (with large N = κ ln κ) and No-loop (with small N = O(1)). Here, N-N-loop and No-loop also refer to additional loops for solving sub-problems within the ITD-Bi O s base loop of updating the variable x. The only convergence rate analysis on ITD-Bi O was provided in [19] but only for N-N-loop, which does not suggest how N-N-loop compares with No-loop. It is still an open question whether ITD-Bi O benefits from N-loops.

The second contribution of this paper lies in the development of a unified convergence theory for ITD-Bi O, which is applicable to all values of N. We then specialize our general theorem to provide the computational complexity for both of the above implementations (as summarized in Table 2). We further develop a convergence lower bound, which suggests that N-N-loop is necessary to guarantee a vanishing convergence error, whereas the no-loop scheme suffers from an unavoidable non-vanishing convergence error.

The technical contribution of this paper is two-fold. For AID methods, most existing studies including [19] solve the linear system with large Q = Θ(κ log κ) so that the upper-level Hessian-inverse-vector product approximation error can vanish. In contrast, we allow arbitrary (possibly small) Q, and hence this upper-level error can be large and nondecreasing, posing a key challenge to guarantee convergence. We come up with a novel idea to prove the convergence by showing that this error, not by itself but jointly with the inner-loop error, admits an (approximately) iteratively decreasing property, which bounds the hypergradient error and yields convergence. The analysis contains new developments to handle the coupling between this error and the inner-loop error, which is critical in our proof. For ITD methods, unlike existing studies including [19], we remove the boundedness assumption on y (x) via a novel error analysis over the entire execution rather than a single iteration. Our analysis tools are general and can be extended to stochastic and acceleration bilevel optimizers.

2 Algorithms

2.1 AID-based Bilevel Optimization Algorithm

As shown in Algorithm 1, we present the general AID-based bilevel optimizer (which we refer to AID-Bi O for short). At each iteration k of the base loop, AID-Bi O first executes N steps of gradient decent (GD) over the inner function g(x, y) to find an approximation point y N k , where N can be chosen either at a constant level or as large as N = κ ln κ (which forms an N-loop of iterations). Moreover, to accelerate the practical training and achieve a stronger performance guarantee, AID-Bi O

Algorithm 1 AID-based bilevel optimization (AID-Bi O) with double warm starts

1: Input: Stepsizes α, β > 0, initializations x0, y0, v0. 2: for k = 0, 1, 2, ..., K do 3: Set y0 k = y N k 1 if k > 0 and y0 otherwise (warm start initialization) 4: for t = 1, ...., N do 5: Update yt k = yt 1 k α yg(xk, yt 1 k ) 6: end for 7: Hypergradient estimation via: Set v0 k = v Q k 1 if k > 0 and v0 otherwise (warm start initalization).

Solve v Q k from 2 yg(xk, y N k )v = yf(xk, y N k ) via Q steps of iterative algorithms starting from v0 k

Compute b Φ(xk) = xf(xk, y N k ) x yg(xk, y N k )v Q k 8: Update xk+1 = xk β b Φ(xk) 9: end for

often adopts a warm-start strategy by setting the initialization y0 k of each N-loop to be the output y N k 1 of the preceding N-loop rather than a random start.

To update the outer variable, AID-Bi O adopts the gradient descent, by approximating the true gradient Φ(xk) of the outer function w.r.t. x (called hypergradient [33, 11]) that takes the following form:

(True hypergradient:) Φ(xk) = xf(xk, y (xk)) x yg(xk, y (xk))v k, (2)

where v k is the solution of the linear system 2 yg(xk, y (xk))v = yf(xk, y (xk)). To approximate the above true hypergradient, AID-Bi O first solves v Q k as an approximate solution to a linear system 2 yg(xk, y N k )v = yf(xk, y N k ) using Q steps of GD iterations starting from v0 k. Here, Q can also be chosen either at a constant level or as large as Q = κ ln κ

µ (which forms a Q-loop of iterations).

Note that a warm start is also adopted here by setting v0 k = v Q k 1, which is critical to achieve the convergence guarantee for small Q. If Q is large enough, e.g., at an order of κ ln κ

µ, a zero initialization with v0 k = 0 suffices to solve the linear system well. Then, AID-Bi O constructs a hypergradient estimator b Φ(xk) given by

(AID-based hypergradient estimate:) b Φ(xk) = xf(xk, y N k ) x yg(xk, y N k )v Q k . (3)

Note that the execution of AID-Bi O involves only Hessian-vector products in solving the linear system and Jacobian-vector product x yg(xk, y N k )v Q k which are more computationally tractable than the calculation of second-order derivatives.

It is clear that different choices of N and Q lead to four implementations within the base loop of AID-Bi O: N-loop (with large N = κ ln κ and small Q = O(1)), N-Q-loop (with large N = κ ln κ and Q = κ ln κ), Q-loop (with small N = 1 and large Q = κ ln κ) and No-loop (with small N = 1 and Q = O(1)). In Section 4, we will establish a unified convergence theory for AID-Bi O applicable to all its implementations in order to formally compare their computational efficiency.

2.2 ITD-Based Bilevel Optimization Algorithm

As shown in Algorithm 2, the ITD-based bilevel optimizer (which we refer to as ITD-Bi O) updates the inner variable y similarly to AID-Bi O, and obtains the N-step output y N k of GD with a warm-start initialization. ITD-Bi O differentiates from AID-Bi O mainly in its estimation of the hypergradient. Without leveraging the implicit gradient formulation, ITD-Bi O computes a direct derivative f(xk,y N k ) xk via automatic differentiation for hypergradient approximation. Since y N k has a dependence on xk through the N-loop iterative GD updates, the execution of ITD-Bi O takes the backpropagation over the entire N-loop trajectory. To elaborate, it can be shown via the chain rule that the hypergradient estimate f(xk,y N k ) xk takes the following form

of f(xk,y N k ) xk = xf(xk, y N k ) α PN 1 t=0 x yg(xk, yt k) QN 1 j=t+1(I α 2 yg(xk, yj k)) yf(xk, y N k ). As shown in this equation, the differentiation does not compute the second-order derivatives directly but compute more tractable and economical Hessian-vector products 2 yg(xk, yj 1 k )vj, j = 1, ..., N

Algorithm 2 ITD-based bilevel optimization algorithm (ITD-Bi O) with warm start

1: Input: Stepsize α > 0, initializations x0 and y0 . 2: for k = 0, 1, 2, ..., K do 3: Set y0 k = y N k 1 if k > 0 and y0 otherwise (warm start initialization) 4: for t = 1, ...., N do 5: Update yt k = yt 1 k α yg(xk, yt 1 k ) 6: end for 7: Compute b Φ(xk) = f(xk,y N k ) xk via backpropagation w.r.t. xk 8: Update xk+1 = xk β b Φ(xk) 9: end for

(similarly for Jacobian-vector products), where each vj is obtained recursively via vj 1 = (I α 2 yg(xm, yj m))vj with v N = yf(xm, y N m).

Clearly, the implementation of ITD-Bi O implies that N = Q always holds. Hence, ITD-Bi O takes only two possible architectures within its base loop: N-N-loop (with large N = κ ln κ

ϵ ) and No-loop (with small N = 1). In Section 5, we will establish a unified convergence theory for ITD-Bi O applicable to both of its implementations in order to formally compare their computational efficiency.

3 Definitions and Assumptions

This paper focuses on the following types of objective functions.

Assumption 1. The inner-level function g(x, y) is µ-strongly-convex w.r.t. y.

Since the objective function Φ(x) in eq. (1) is possibly nonconvex, algorithms are expected to find an ϵ-accurate stationary point defined as follows.

Definition 1. We say x is an ϵ-accurate stationary point for the bilevel optimization problem given in eq. (1) if Φ( x) 2 ϵ, where x is the output of an algorithm.

In order to compare the performance of different bilevel algorithms, we adopt the following metrics of computational complexity.

Definition 2. Let Gc(ϵ) be the number of gradient evaluations, and MV(ϵ) be the total number of Jacobianand Hession-vector product evaluations to achieve an ϵ-accurate stationary point of the bilevel optimization problem in eq. (1).

Let z = (x, y). We take the following standard assumptions, as also widely adopted by [10, 17].

Assumption 2. Gradients f(z) and g(z) are L-Lipschitz, i.e., for any z, z ,

f(z) f(z ) L z z , g(z) g(z ) L z z .

As shown in eq. (2), the gradient of the objective function Φ(x) involves the second-order derivatives x yg(z) and 2 yg(z). The following assumption imposes the Lipschitz conditions on such higherorder derivatives, as also made in [10].

Assumption 3. Suppose the derivatives x yg(z) and 2 yg(z) are ρ-Lipschitz, i.e., for any z, z

x yg(z) x yg(z ) ρ z z , 2 yg(z) 2 yg(z ) ρ z z .

To guarantee the boundedness the hypergradient estimation error, existing works [10, 17, 11] assume that the gradient f(z) is bounded for all z = (x, y). Instead, we make a weaker boundedness assumption on the gradients yf(x, y (x)).

Assumption 4. There exists a constant M such that for any x, yf(x, y (x)) M.

For the case where the total objective function Φ( ) has some benign structures, e.g., convexity or strong convexity, Assumption 4 can be removed by an induction analysis that all iterates are bounded as in [18]. Assumption 4 can also be removed by projecting x onto a bounded constraint set X.

4 Convergence Analysis of AID-Bi O

As we describe in Section 2.1, AID-Bi O can have four possible implementations depending on whether N and Q are chosen to be large enough to form an N-loop and/or Q-loop. In this section, we will provide the convergence analysis and characterize the overall computational complexity for all of the four implementations, which will provide the general guidance on which algorithmic architecture is computationally most efficient.

4.1 Convergence Rate and Computational Complexity

In this subsection, we develop two unified theorems for AID-Bi O, both of which are applicable to all the regimes of N and Q. We then specialize these theorems to provide the complexity bounds (as corollaries) for the four implementations of AID-Bi O. It turns out that the first theorem provides tighter complexity bounds for the implementations with small Q = Θ(1), and the second theorem provides tighter complexity bounds for the implementations with large Q = κ ln κ

ϵ . Our presentation of those corollaries below will thus focus only on the tighter bounds. The following theorem provides our first unified convergence analysis for AID-Bi O. Theorem 1. Suppose Assumptions 1, 2, 3 and 4 hold. Choose parameters α, η and λ such that (1 + λ)(1 αµ)N(1 + r(1 + 1 ηµ)) 1 ηµ, where r = Θ(µ2C2 Q) with CQ = Θ (1 ηµ)Q 1 ηQ

1 (1 ηµ)Q(1+ηQµ)

µ2 + (1 (1 ηµ)Q) L

µ . Let LΦ = Θ(κ3) be the smoothness parameter of Φ( ). Let

ew := Θ ηµκ4

ηµ (1 ηµ)2Q

λ . Choose β such that β = min 1

e w . Then,

k=0 Φ(xk) 2 = O Φ(x0) Φ(x )

βK + κ2 y 0 2 + ( 3M

The complete version of Theorem 1 with full parameter specifications can be found in Appendix H. Theorem 1 also elaborates the precise requirements on the stepsizes α, η and β and the auxiliary parameter λ, which take complicated forms. In the following, by further specifying these parameters, we characterize the complexities for AID-Bi O in more explicit forms. We focus on the implementations with Q = Θ(1) (for which Theorem 1 specializes to tighter bound than Theorem 2 below), which includes the N-loop scheme (with N = Θ(κ ln κ)) and the No-loop scheme (with N = 1). Corollary 1 (N-loop). Consider N-loop AID-Bi O with N = Θ(κ ln κ) and Q = Θ(1), where κ = L

µ denotes the condition number of the inner problem. Under the same setting of Theorem 1,

choose η = 1 L, α = 1 L, and λ = 1. Then, we have 1 K PK 1 k=0 Φ(xk) 2 = O κ4

K , and the complexity to achieve an ϵ-accurate stationary point is Gc(ϵ) = e O(κ5ϵ 1), MV(ϵ) = e O κ4ϵ 1 . Corollary 2 (No-loop). Consider No-loop AID-Bi O with N = 1 and Q = Θ(1). Under the same setting of Theorem 1, choose parameters α = 1

2 and η = min{ 1

4 , 1 µQ}. Then,

1 K PK 1 k=0 Φ(xk) 2 = O κ6

K , and the complexity is Gc(ϵ) = e O(κ6ϵ 1), MV(ϵ) = e O(κ6ϵ 1).

The analysis of Theorem 1 can be further improved for the large Q regime, which guarantees a sufficiently small outer-level approximation error, and helps to relax the requirement on the stepsize η. Such an adaptation yields the following alternative unified convergence characterization for AID-Bi O, which is applicable for all Q and N, but specializes to tighter complexity bounds than Theorem 1 in the large Q regime. For simplicity, we set the initialization v0 k = 0 in Algorithm 1. Theorem 2. Suppose Assumptions 1, 2, 3 and 4 hold. Define τ = Θ (1 αµ)N(1+λ+(1+λ 1)(κ2 + C2 Q κ2β2) , w = Θ (1 αµ)N(κ2 + C2 Q)(1 + λ 1)κ2 , where CQ is a positive constant defined as in Theorem 1. Choose parameters α, β such that τ < 1 and βLΦ + wβ2 1

2 + βLΦ 1 1 τ 1

4 hold. Then, the output of AID-Bi O satisfies

k=0 Φ(xk) 2 = O Φ(x0) Φ(x )

K δ0 1 τ + κ2(1 ηµ)2Q ,

where δ0 = Θ((κ2 + C2 Q (1 αµ)N y 0 y0 2) is the initial distance.

The complete version of Theorem 2 with full parameter specifications can be found in Appendix K. We next specialize Theorem 2 to obtain the complexity for two implementations of AID-Bi O with

Q = Θ(κ ln κ): N-Q-loop (with N = Θ(κ ln κ)) and Q-loop (with N = 1), as shown in the following two corollaries. For each case, we need to set the parameters λ, η and α in Theorem 2 properly.

Corollary 3 (N-Q-loop). Consider N-Q-loop AID-Bi O with N = Θ(κ ln κ) and Q = Θ(κ ln κ

ϵ ). Under the same setting of Theorem 2, choose η = α = 1 L, λ = 1 and β = Θ(κ 3). Then,

1 K PK 1 k=0 Φ(xk) 2 = O κ3

K + ϵ , and the complexity is Gc(ϵ) = e O(κ4ϵ 1), MV(ϵ) = e O(κ4ϵ 1).

Corollary 4 (Q-loop). Consider Q-loop AID-Bi O with N = 1 and Q = Θ(κ ln κ

ϵ ). Under the same setting of Theorem 2, choose α = η = 1

2 and β = Θ(κ 4). Then, 1 K PK 1 k=0 Φ(xk) 2 =

K + ϵ , and the complexity is Gc(ϵ) = e O(κ5ϵ 1), MV(ϵ) = e O(κ6ϵ 1).

4.2 Comparison among Four Implementations

Impact of N-loop (N = 1 vs N = κ ln κ). We fix Q, and compare how the choice of N affects the computational complexity. First, let Q = Θ(1), and compare the results between the two implementations N-loop with Θ(κ ln κ) (Corollary 1) and No-loop with N = 1 (Corollary 2). Clearly, the N-loop scheme significantly improves the convergence rate of the No-loop scheme from O( κ6

K ) to O( κ4

K ), and improves the matrix-vector and gradient complexities from e O(κ6ϵ 1) and e O(κ6ϵ 1) to e O(κ4ϵ 1) and e O(κ5ϵ 1), respectively. To explain intuitively, the hypergradient estimation involves a coupled error η y N k y (xk) induced from solving the linear system 2 yg(xk, y N k )v = yf(xk, y N k ) with stepsize η. Therefore, a smaller inner-level approximation error y N k y (xk) allows a more aggressive stepsize η, and hence yields a faster convergence rate as well as a lower total complexity, as also demonstrated in our experiments. It is worth noting that such a comparison is generally different from that in minimax optimization [27, 41], where alternative (i.e., No-loop) gradient descent ascent (GDA) with N = 1 outperforms (N-loop) GDA with N = κ ln κ, where N denotes the number of ascent iterations for each descent iteration. To explain the reason, in constrast to minimax optimization, the gradient w.r.t. x in bilevel optimization involves additional second-order derivatives, which are more sensitive to the inner-level approximation error. Therefore, a larger N is more beneficial for bilevel optimization than minimax optimization. Similarly, we can also fix Q = Θ(κ ln κ), the N-Q-loop scheme with N = κ ln κ (Corollary 3) significantly outperforms the Q-loop scheme with N = 1 (Corollary 4) in terms of the convergence rate and complexity.

Impact of Q-loop (Q = 1 vs Q = Θ(κ ln κ

ϵ )). We fix N, and characterize the impact of the choice of Q on the complexity. For N = 1, comparing No-loop with Q = Θ(1) in Corollary 2 and Q-loop with Q = Θ(κ ln κ) in Corollary 4 shows that both choices of Q yield the same matrix-vector complexity e O(κ6ϵ 1), but Q-loop with a larger Q improves the gradient complexity of No-loop with Q = Θ(1) from e O(κ6ϵ 1) to e O(κ5ϵ 1). A similar phenomenon can be observed for N = Θ(κ ln κ) based on the comparision between N-Q-loop in Corollary 3 and N-loop in Corollary 1.

In deep learning. Also note that in the setting where the matrix-vector complexity dominates the gradient complexity, e.g., in deep learning, such two choices of Q do not affect the total computational complexity. However, a smaller Q can help reduce the per-iteration load on the computational resource and memory, and hence is preferred in practical applications with large models.

Comparison among four implementations. By comparing the complexity results in Corollaries 1, 2, 3 and 4, it can be seen that N-Q-loop and N-loop (both with a large N = Θ(κ ln κ)) achieve the best matrix-vector complexity e O(κ4ϵ 1), whereas Q-loop and No-loop (both with a smaller N = 1) require higher matrix-vector complexity of e O(κ6ϵ 1). Also note that N-Q-loop has the lowest gradient complexity. This suggests that the introduction of the inner loop with large N can help to reduce the total computational complexity.

5 Convergence Analysis of ITD-Bi O

In this section, we first provide a unified theory for ITD-Bi O, which is applicable for all choices of N, and then specialize the convergence theory to characterize the computational complexity for the two implementations of ITD-Bi O: No loop and N-N-loop. We also provide a convergence lower bound to justify the necessity of choosing large N to achieve a vanishing convergence error. The following theorem characterizes the convergence rate of ITD-Bi O for all choices of N.

Theorem 3. Suppose Assumptions 1, 2, 3 and 4 hold. Define w = Θ κ2

αµ(1 αµ)NλN + w2 N µ2 and τ =

(λN +N 2)(1 αµ)N +w2 N, where λN and w N are given by λN = Θ w2 N +(1+αLN)2

4 αµ (1 αµ)N (1+ 1

Θ 1+ α(1 (1 αµ) N

2 ) 1 1 αµ α(1 (1 αµ) N

2 ) 1 1 αµ (1 αµ) N

2 1 . Choose parameters such that β2 1 1

4 αµ 2w , α

1 2L and βLΦ + 8 αµ 1 2 + βLΦ wβ2 < 1

4, where LΦ = Θ(κ3) denotes the smoothness parameter of Φ( ). Then, we have

k=0 Φ(xk) 2 =O Φ

µ2K + (1 αµ)2N

µ3K + M 2 1 αµ 2NL2

where Φ = Φ(x0) minx Φ(x) and y = y0 y (x0) 2.

The complete version of Theorem 3 with full parameter specifications can be found in Appendix N. In Theorem 3, the upper bound on the convergence rate for ITD-Bi O contains a convergent term O( 1

K ) (which converges to zero sublinearly with K) and an error term O M 2(1 αµ)2N

αµ3 (which is independent of K, and possibly non-vanishing if N is chosen to be small). To show that such a possibly non-vanishing error term (when N is chosen to be small) fundamentally exists, we next provide the following lower bound on the convergence rate of ITD-Bi O.

Theorem 4 (Lower Bound). Consider the ITD-Bi O algorithm in Algorithm 2 with α 1

L, β 1 LΦ and N O(1), where LΦ is the smoothness parameter of Φ(x). There exist objective functions f(x, y) and g(x, y) that satisfy Assumptions 1, 2, 3 and 4 such that for all iterates x K (where K 1) generated by ITD-Bi O in Algorithm 2, Φ(x K) 2 Θ L2M2

µ2 1 αµ 2N .

Clearly, the error term in the upper bound given in Theorem 3 matches the lower bound given in Theorem 4 in terms of M 2L2

µ2 (1 αµ)2N, and there is still a gap on the order of αµ, which requires future efforts to address. Theorem 3 and Theorem 4 together indicate that in order to achieve an ϵ-accurate stationary point, N has to be chosen as large as N = Θ(κ log κ

ϵ ). This corresponds to the N-N-loop implementation of ITD-Bi O, where large N achieves a highly accurate hypergradient estimation in each step. Another No-loop implementation chooses a small constant-level N = Θ(1) to achieve an efficient execution per step, where a large N can cause large memory usage and computation cost. Following from Theorem 3 and Theorem 4, such No-loop implementation necessarily suffers from a non-vanishing error.

In the following corollaries, we further specialize Theorem 3 to obtain the complexity analysis for ITD-Bi O under the two aforementioned implementations of ITD-Bi O.

Corollary 5 (N-N-loop). Consider N-N-loop ITD-Bi O with N = Θ(κ ln κ

ϵ ). Under the same set-

ting of Theorem 3, choose β = min np αµ

4 2w , 1 8LΦ

o , α = 1 2L. Then, 1

K PK 1 k=0 Φ(xk) 2 =

K + ϵ , and the complexity is Gc(ϵ) = e O(κ4ϵ 1), MV(ϵ) = e O(κ4ϵ 1).

Corollary 5 shows that for a large N = Θ(κ ln κ

ϵ ), we can guarantee that ITD-Bi O converges to an ϵ-accurate stationary point, and the gradient and matrix-vector product complexities are given by e O(κ4ϵ 1). We note that [19] also analyzed the ITD-Bi O with N = Θ(κ ln κ

ϵ ), and provided the same complexities as our results in Corollary 5. In comparison, our analysis has several differences. First, [19] assumed that the minimizer y (xk) at the kth iteration is bounded, whereas our analysis does not impose this assumption. Second, [19] involved an additional error term maxk=1,...,K y (xk) L2M 2(1 αµ)N

µ4 , which can be very large (or even unbounded) under standard Assumptions 1, 2, 3 and 4. We next characterize the convergence for the small N = Θ(1). Corollary 6 (No-loop). Consider No-loop ITD-Bi O with N = Θ(1). Under the same setting of

Theorem 3, choose stepsizes α = 1 2NL and β = min p αµ

4 2w , 1 8LΦ . Then, we have

1 K PK 1 k=0 Φ(xk) 2 = O κ3

Corollary 6 indicates that for the constant-level N = Θ(1), the convergence bound contains a non-vanishing error O( M 2L2

αµ3 ). As shown in the convergence lower bound in Theorem 4, under

standard Assumptions 1, 2, 3 and 4, such an error is unavoidable. Comparison between the above two corollaries suggests that for ITD-Bi O, the N-N-loop is necessary to guarantee a vanishing convergence error, whereas No-loop necessarily suffers from a non-vanishing convergence error.

6 Empirical Verification

Experiments on hyperparameter optimization on MNIST. We first conduct experiments to verify our theoretical results in Corollaries 1, 2, 3 and 4 on AID-Bi O with different implementations. We consider the following hyperparameter optimization problem.

min λ LDval(λ) = 1 |Dval|

ξ Dval L(w ; ξ), s.t. w = arg min w 1 |Dtr|

L(w; ξ) + λ

where Dtr and Dval stand for training and validation datasets, L(w; ξ) denotes the loss function induced by the model parameter w and sample ξ, and λ > 0 denotes the regularization parameter. The goal is to find a good hyperparameter λ to minimize the validation loss evaluated at the optimal model parameters for the regularized empirical risk minimization problem.

0 500 1000 1500 2000 2500 3000 Running Time

Training Loss

Q=1,N=1 Q=20,N=20 Q=20,N=1 Q=1,N=20

0 500 1000 1500 2000 2500 3000 Running Time

Q=1,N=1 Q=20, N=20 Q=20, N=1 Q=1, N=20

Figure 1: Training & test losses v.s. time (seconds) by AID-Bi O on MNIST with different Q and N.

From Figure 1, we can make the following observations for AID-Bi O. First, the learning curves with N = 20 are significantly better than those with N = 1, indicating that running multiple steps of gradient descent in the inner loop (i.e., N > 1) is crucial for fast convergence. This observation is consistent with our complexity result that N-loop is better than No-loop, and N-Q-loop is better than Q-loop, as shown in Table 1. The reason is that a more accurate hypergradient estimation can accelerate the convergence rate and lead to a reduction on the Jacobianand Hessian-vector computational complexity. Second, N-Q-loop (N = 20, Q = 20) and N-loop (N = 20, Q = 1) achieve a comparable convergence performance, and a similar observation can be made for Q-loop (N = 1, Q = 20) and No-loop (N = 1, Q = 1). This is also consistent with the complexity result provided in Table 1, where different choices of Q do not affect the dominant matrix-vector complexity.

0 1000 2000 3000 4000 5000 6000 7000 Time

Training Loss

Q=1,N=1 Q=50,N=50 Q=50,N=1 Q=1,N=50

0 1000 2000 3000 4000 5000 6000 7000 Time

Q=1,N=1 Q=50,N=50 Q=50,N=1 Q=1,N=50

Figure 2: Training & test losses v.s. time (seconds) by AID-Bi O on MNIST with different Q and N.

In Figure 2, we plot the training and test losses versus running time for AID-Bi O, where we consider a hyperparameter optimization problem on MNIST as in Figure 1 and choose loop sizes Q and N from {1, 50}. Similarly to Figure 1, it can be observed that the empirical results in Figure 2 are also in consistence with our theoretical results in Table 1.

0 2000 4000 6000 8000 Time

Training Loss

0 2000 4000 6000 8000 Time

Figure 3: Training & test losses v.s. time (seconds) by ITD-Bi O on MNIST with different N.

In Figure 3, we plot the performance of ITD-Bi O with different choices of N from {1, 20} on the hyperparameter optimization on MNIST. Figure 3 illustrates that N-loop ITD-Bi O (i.e., N=20) converges to a much smaller loss value than No-loop ITD-Bi O (i.e., N=1). This is in consistence with our thereotical results in Table 2.

Experiments on hyper-representation. We consider a hyper-representation problem in [39], where the inner problem is to find optimal regression parameters w and the outer procedure is to find the best representation parameters λ. In specific, the bilevel problem takes the following form:

min λ Φ(λ) = 1

2p h(XV ; λ)w YV 2 , s.t. w = argmin w 1 2q h(XT ; λ)w YT 2 + γ

where XT Rq m and XV Rp m are synthesized training and validation data, YT Rq, YV Rp are their response vectors, and h( ) is a linear transformation. The generation of XT , XV , YT , YV and the experimental setup follow from [39]. For ITD-Bi O, we choose N = 20 for N-N-loop ITD and N = 1 for No-loop ITD. The results are reported with the best-tuned hyperparameters.

Algorithm k = 10 k = 50 k = 100 k = 500 k = 1000 N-N-loop ITD 9.32 0.11 0.01 0.004 0.004 No-loop ITD 435 6.9 0.04 0.04 0.04

Table 3: Validation loss v.s. the number of iterations for ITD-based algorithms.

Table 3 indicates that N-N-loop with N = 20 can achieve a small loss value of 0.004 after 500 total iterations, whereas No-loop with N = 1 converges to a much larger loss value of 0.04. This is in consistence with our theoretical results in Table 2, where N = 1 can cause a non-vanishing error.

We also conduct the experiment for AID-Bi O, where we choose N and Q from {1, 20} for four different loop implementations. We present the results for AID-Bi O in Appendix F, which also support our theoretical results in Table 1.

7 Conclusion

In this paper, we study two popular bilevel optimizers AID-Bi O and ITD-Bi O, whose implementations potentially involve additional loops of iterations within their base-loop update. By developing unified convergence analysis for all choices of the loop parameters, we are able to provide formal comparison among different implementations. Our result suggests that N-loops are beneficial for better computational efficiency for AID-Bi O and for better convergence accuracy for ITD-Bi O. This is in contrast to conventional minimax optimization, where No-loop (i.e., single-base-loop) scheme achieves better computational efficiency. Our analysis techniques can be useful to study other bilevel optimizers such as stochastic optimizers and variance reduced optimizers.

Acknowledgements

The work of Kaiyi Ji and Lei Ying is supported in part by NSF under grants 2002608, 2001687, 2112471, 2134081, and 2207548. The work of Yingbin Liang was supported in part by the U.S. National Science Foundation under the grants CCF-1909291, DMS-2134145, and CNS-2112471.

[1] L. Bertinetto, J. F. Henriques, P. Torr, and A. Vedaldi. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations (ICLR), 2018.

[2] T. Chen, Y. Sun, and W. Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Advances in Neural Information Processing Systems (Neur IPS), 34:25294 25307, 2021.

[3] T. Chen, Y. Sun, and W. Yin. A single-timescale stochastic bilevel optimization method. ar Xiv preprint ar Xiv:2102.04671, 2021.

[4] J. Domke. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics (AISTATS), pages 318 326, 2012.

[5] M. Feurer and F. Hutter. Hyperparameter optimization. In Automated Machine Learning, pages 3 33. Springer, Cham, 2019.

[6] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning (ICML), pages 1126 1135, 2017.

[7] R. Flamary, A. Rakotomamonjy, and G. Gasso. Learning constrained task similarities in graphregularized multi-task learning. Regularization, Optimization, Kernels, and Support Vector Machines, page 103, 2014.

[8] L. Franceschi, M. Donini, P. Frasconi, and M. Pontil. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning (ICML), pages 1165 1173, 2017.

[9] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning (ICML), pages 1568 1577, 2018.

[10] S. Ghadimi and M. Wang. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

[11] R. Grazzi, L. Franceschi, M. Pontil, and S. Salzo. On the iteration complexity of hypergradient computation. In Proc. International Conference on Machine Learning (ICML), 2020.

[12] Z. Guo, Y. Xu, W. Yin, R. Jin, and T. Yang. On stochastic moving-average estimators for non-convex optimization. ar Xiv preprint ar Xiv:2104.14840, 2021.

[13] Z. Guo and T. Yang. Randomized stochastic variance-reduced methods for stochastic bilevel optimization. ar Xiv preprint ar Xiv:2105.02266, 2021.

[14] P. Hansen, B. Jaumard, and G. Savard. New branch-and-bound rules for linear bilevel programming. SIAM Journal on Scientific and Statistical Computing, 13(5):1194 1217, 1992.

[15] M. Hong, H.-T. Wai, Z. Wang, and Z. Yang. A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. ar Xiv preprint ar Xiv:2007.05170, 2020.

[16] M. Huang, K. Ji, S. Ma, and L. Lai. Efficiently escaping saddle points in bilevel optimization. ar Xiv preprint ar Xiv:2202.03684, 2022.

[17] K. Ji, J. D. Lee, Y. Liang, and H. V. Poor. Convergence of meta-learning with task-specific adaptation over partial parameters. ar Xiv preprint ar Xiv:2006.09486, 2020.

[18] K. Ji and Y. Liang. Lower bounds and accelerated algorithms for bilevel optimization. ar Xiv preprint ar Xiv:2102.03926, 2021.

[19] K. Ji, J. Yang, and Y. Liang. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, pages 4882 4892. PMLR, 2021.

[20] K. Ji, J. Yang, and Y. Liang. Theoretical convergence of multi-step model-agnostic metalearning. Journal of Machine Learning Research (JMLR), 23:29 1, 2022.

[21] P. Khanduri, S. Zeng, M. Hong, H.-T. Wai, Z. Wang, and Z. Yang. A near-optimal algorithm for stochastic bilevel optimization via double-momentum. ar Xiv preprint ar Xiv:2102.07367, 2021.

[22] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems (Neur IPS), pages 1008 1014, 2000.

[23] G. Kunapuli, K. P. Bennett, J. Hu, and J.-S. Pang. Classification model selection via bilevel programming. Optimization Methods & Software, 23(4):475 489, 2008.

[24] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[25] J. Li, B. Gu, and H. Huang. Improved bilevel model: Fast and optimal algorithm with theoretical guarantee. ar Xiv preprint ar Xiv:2009.00690, 2020.

[26] J. Li, B. Gu, and H. Huang. A fully single loop algorithm for bilevel optimization without hessian inverse. ar Xiv preprint ar Xiv:2112.04660, 2021.

[27] T. Lin, C. Jin, and M. Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning (ICML), pages 6083 6093. PMLR, 2020.

[28] R. Liu, X. Liu, X. Yuan, S. Zeng, and J. Zhang. A value-function-based interior-point method for non-convex bi-level optimization. In International Conference on Machine Learning (ICML), 2021.

[29] R. Liu, Y. Liu, S. Zeng, and J. Zhang. Towards gradient-based bilevel optimization with nonconvex followers and beyond. Advances in Neural Information Processing Systems (Neur IPS), 34, 2021.

[30] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang. A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In International Conference on Machine Learning (ICML), 2020.

[31] D. Maclaurin, D. Duvenaud, and R. Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning (ICML), pages 2113 2122, 2015.

[32] G. M. Moore. Bilevel programming algorithms for machine learning model selection. Rensselaer Polytechnic Institute, 2010.

[33] F. Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning (ICML), pages 737 746, 2016.

[34] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems (Neur IPS), pages 113 124, 2019.

[35] A. Shaban, C.-A. Cheng, N. Hatch, and B. Boots. Truncated back-propagation for bilevel optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1723 1732, 2019.

[36] C. Shi, J. Lu, and G. Zhang. An extended kuhn tucker approach for linear bilevel programming. Applied Mathematics and Computation, 162(1):51 63, 2005.

[37] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NIPS), 2017.

[38] D. Sow, K. Ji, Z. Guan, and Y. Liang. A constrained optimization approach to bilevel optimization with multiple inner minima. ar Xiv preprint ar Xiv:2203.01123, 2022.

[39] D. Sow, K. Ji, and Y. Liang. Es-based jacobian enables faster bilevel optimization. ar Xiv preprint ar Xiv:2110.07004, 2021.

[40] J. Yang, K. Ji, and Y. Liang. Provably faster algorithms for bilevel optimization. Advances in Neural Information Processing Systems (Neur IPS), 34, 2021.

[41] J. Zhang, P. Xiao, R. Sun, and Z. Luo. A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems. Advances in Neural Information Processing Systems (Neur IPS), 33:7377 7389, 2020.

[42] M. Zhang, S. W. Su, S. Pan, X. Chang, E. M. Abbasnejad, and R. Haffari. idarts: Differentiable architecture search with stochastic implicit gradients. In International Conference on Machine Learning (ICML), pages 12557 12566. PMLR, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Table 2 shows that there exists a gap of κ between the upper and lower bounds for ITD-based bilevel optimization, which requires future efforts to address. (c) Did you discuss any potential negative societal impacts of your work? [N/A] This paper develops the convergence theory for the fundamental belivel algorithms. Our analysis will not have any potential negative societal impact. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] All assumptions are stated in Section 3. (b) Did you include complete proofs of all theoretical results? [Yes] Complete proofs are included in the appendix. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] The experimental details are specified in Appendix E. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Appendix E. We ran 5 random seeds for every experiment. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] The details are included in Appendix E. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] The details are included in Appendix E. (b) Did you mention the license of the assets? [Yes] The details are included in Appendix E.

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

Our code is based on the existing asset. See Appendix E for details. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] The datasets we use are public. See Appendix E for details. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] The dataset we use does not contain personally identifiable information, nor offensive content. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]