# enhanced_bilevel_optimization_via_bregman_distance__2364d42c.pdf

Enhanced Bilevel Optimization via Bregman Distance

Feihu Huang1,2, Junyi Li1, Shangqian Gao1, Heng Huang1

1Electrical & Computer Engineering, University of Pittsburgh, Pittsburgh, PA, United States 2College of Computer Science & Technology, Nanjing University of Aeronautics & Astronautics, Nanjing, China huangfeihu2018@gmail.com, junyili.ai@gmail.com, shg84@pitt.edu, heng.huang@pitt.edu

Bilevel optimization has been recently used in many machine learning problems such as hyperparameter optimization, policy optimization, and meta learning. Although many bilevel optimization methods have been proposed, they still suffer from the high computational complexities and do not consider the more general bilevel problems with nonsmooth regularization. In the paper, thus, we propose a class of enhanced bilevel optimization methods with using Bregman distance to solve bilevel optimization problems, where the outer subproblem is nonconvex and possibly nonsmooth, and the inner subproblem is strongly convex. Specifically, we propose a bilevel optimization method based on Bregman distance (Bi O-Bre D) to solve deterministic bilevel problems, which achieves a lower computational complexity than the best known results. Meanwhile, we also propose a stochastic bilevel optimization method (SBi O-Bre D) to solve stochastic bilevel problems based on stochastic approximated gradients and Bregman distance. Moreover, we further propose an accelerated version of SBi O-Bre D method (ASBi O-Bre D) using the variance-reduced technique, which can achieve a lower computational complexity than the best known computational complexities with respect to condition number κ and target accuracy ϵ for finding an ϵ-stationary point. We conduct data hyper-cleaning task and hyper-representation learning task to demonstrate that our new algorithms outperform related bilevel optimization approaches.

1 Introduction

Bilevel optimization can effectively solve the problems with a hierarchical structure, thus it recently has been widely used in many machine learning tasks such as hyper-parameter optimization [37, 20, 9, 38], meta learning [9, 31, 22], neural network architecture search [30], reinforcement learning [15], and image processing [31]. In the paper, we consider solving the following nonconvex-stronglyconvex bilevel optimization problem:

min x X Rd1 f(x, y (x)) + h(x), (Outer) (1)

s.t. y (x) arg min y Rd2 g(x, y), (Inner)

where function F(x) = f(x, y (x)) : X R is smooth and possibly nonconvex, and function h(x) is convex and possibly nonsmooth, and function g(x, y) : X Rd2 R is µ-strongly convex in y Rd2. The constraint set X Rd1 is compact and convex. Problem (1) covers a rich class of nonconvex objective functions with nonsmooth regularization, and is more general than the existing nonconvex bilevel optimization formulation in [11, 22] that does not consider any nonsmooth regularization. Here the function h(x) can be the nonsmooth regularization term such as h(x) = λ x 1.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Table 1: Comparisons of the representative bilevel optimization algorithms for finding an ϵ-stationary point of the deterministic nonconvex-strongly-convex Problem (1) with h(x) or without h(x), i.e., F(x) 2 ϵ or its equivalent variants. Gc(f, ϵ) and Gc(g, ϵ) denote the number of gradient evaluations w.r.t. f(x, y) and g(x, y); JV (g, ϵ) denotes the number of Jacobian-vector products; HV (g, ϵ) is the number of Hessian-vector products; κ = L/µ is the conditional number. means that the algorithms solve both the smooth and nonsmooth bilevel optimizations.

Algorithm Reference Gc(f, ϵ) Gc(g, ϵ) JV (g, ϵ) HV (g, ϵ) Nonsmooth AID-Bi O [11] O(κ4ϵ 1) O(κ5ϵ 5/4) O(κ4ϵ 1) O(κ4.5ϵ 1) AID-Bi O [22] O(κ3ϵ 1) O(κ4ϵ 1) O(κ3ϵ 1) O(κ3.5ϵ 1) ITD-Bi O [22] O(κ3ϵ 1) O(κ4ϵ 1) O(κ4ϵ 1) O(κ4ϵ 1) Bi O-Bre D Ours O(κ2ϵ 1) O(κ3ϵ 1) O(κ3ϵ 1) O(κ3ϵ 1)

Table 2: Comparisons of the representative bilevel optimization algorithms for finding an ϵ-stationary point of the stochastic nonconvex-strongly-convex problem (2) with h(x) or without h(x), i.e., E F(x) 2 ϵ or its equivalent variants. Since some algorithms do not provide the explicit dependence on κ, we use p(κ).

Algorithm Reference Gc(f, ϵ) Gc(g, ϵ) JV (g, ϵ) HV (g, ϵ) Nonsmooth TTSA [15] O(p(κ)ϵ 2.5) O(p(κ)ϵ 2.5) O(p(κ)ϵ 2.5) O(p(κ)ϵ 2.5) STABLE [5] O(p(κ)ϵ 2) O(p(κ)ϵ 2) O(p(κ)ϵ 2) O(p(κ)ϵ 2) SMB [13] O(p(κ)ϵ 2) O(p(κ)ϵ 2) O(p(κ)ϵ 2) O(p(κ)ϵ 2) VRBO [41] O(p(κ)ϵ 1.5) O(p(κ)ϵ 1.5) O(p(κ)ϵ 1.5) O(p(κ)ϵ 1.5) SUSTAIN [23] O(p(κ)ϵ 1.5) O(p(κ)ϵ 1.5) O(p(κ)ϵ 1.5) O(p(κ)ϵ 1.5) VR-sa Bi Adam [18] O(p(κ)ϵ 1.5) O(p(κ)ϵ 1.5) O(p(κ)ϵ 1.5) O(p(κ)ϵ 1.5) BSA [11] O(κ6ϵ 2) O(κ9ϵ 3) O(κ6ϵ 2) O(κ6ϵ 2) stoc Bi O [22] O(κ5ϵ 2) O(κ9ϵ 2) O(κ5ϵ 2) O(κ6ϵ 2) SBi O-Bre D Ours O(κ5ϵ 2) O(κ5ϵ 2) O(κ5ϵ 2) O(κ6ϵ 2)

ASBi O-Bre D Ours O(κ5ϵ 1.5) O(κ5ϵ 1.5) O(κ5ϵ 1.5) O(κ6ϵ 1.5)

Many recent machine learning research problems utilize the stochastic loss functions. Thus, we also consider the following stochastic bilevel optimization problem: min x X Rd1 Eξ D f(x, y (x); ξ) + h(x), (Outer) (2)

s.t. y (x) arg min y Rd2 Eζ D g(x, y; ζ) , (Inner)

where function F(x) = Eξ F(x; ξ) = Eξ f(x, y (x); ξ) is smooth and possibly nonconvex, and function h(x) is convex and possibly nonsmooth, and function g(x, y) = Eζ g(x, y; ζ) : X Rd2 R is µ-strongly convex in y Rd2. ξ and ζ are random variables following unknown distributions D and D , respectively. Both Problem (1) and Problem (2) have been used in many machine learning tasks with a hierarchical structure, such as hyper-parameter meta-learning [9, 22] and neural network architecture search [30].

Many bilevel optimization methods recently have been developed to solve these problems. For example, [11, 22] introduced a class of effective methods to solve the above deterministic Problem (1) and stochastic Problem (2) with h(x) = 0. However, these methods suffer from high computational complexity issue. More recently, multiple accelerated methods were designed for stochastic Problem (2) with h(x) = 0. Specifically, [5, 23, 14, 41] proposed accelerated bilevel optimization algorithms via using the variance reduced techniques of SARAH/SPIDER/SNVRG [36, 8, 40, 43] and STORM [6]. However, these accelerated methods obtain a lower computational complexity without considering the condition number, which also accounts for an important part of the computational complexity (please see Tables 1 and 2). Meanwhile, these accelerated methods only focus on the special case of the stochastic bilevel optimization Problem (2) with h(x) = 0.

To fill in the gaps, in the paper, we propose a class of efficient bilevel optimization methods with lower computational complexity to solve the bilevel optimization Problems (1) and (2), where the outer subproblem is nonconvex and possibly nonsmooth, and the inner subproblem is strongly convex. Specifically, we use the mirror decent iteration to update the variable x based on the Bregman distance. Our main contributions are summarized as follows:

(i) We propose a class of enhanced bilevel optimization methods based on Bregman distance to solve the nonconvex-strongly-convex bilevel optimization problems. Moreover, we provide a comprehensive convergence analysis framework for our proposed methods.

(ii) An efficient bilevel optimization method based on Bregman distances (Bi O-Bre D) is proposed to solve the deterministic bilevel Problem (1). We prove that our Bi O-Bre D achieves a lower sample complexity than the best known results (please see Table 1).

(iii) We introduce an efficient bilevel optimization method based on adaptive Bregman distances (SBi O-Bre D) to solve the stochastic bilevel Problem (2). Moreover, we design an accelerated version of SBi O-Bre D algorithm (ASBi O-Bre D) via using the variance reduced technique, which achieves a lower sample complexity than the best known results (please see Table 2). Note that our methods can solve the constrained bilevel optimization with nonsmooth regularization but not rely on any form of constraint set and nonsmooth regularization. In the other words, our methods can solve the unconstrained bilevel optimization without nonsmooth regularization studied in [11, 22]. Naturally, our convergence analysis can be applied to both the constrained bilevel optimization with nonsmooth regularization and the unconstrained bilevel optimization without nonsmooth regularization.

2 Related Works

In this section, we will revisit the existing bilevel optimization algorithms and Bregman distance based methods.

2.1 Bilevel Optimization Methods Bilevel optimization recently has attracted increasing interest in many machine learning applications such as model-agnostic meta-learning, neural network architecture search, and policy optimization. Thus, recently many algorithms [9, 11, 15, 34, 35, 22, 28] have been proposed to solve the bilevel optimization problems. Specifically, [11] proposed a class of approximation methods for bilevel optimization and studied convergence properties of the proposed methods under convexity assumption. [34, 35] developed the gradient-based descent aggregation methods for convex bilevel optimization. [37] presented a nonlinear primal dual algorithm for nonsmooth convex bilevel optimization in parameter learning problems.

In parallel, [15] introduced a two-timescale stochastic algorithm framework for nonconvex stochastic bilevel optimization in reinforcement learning. Multiple accelerated bilevel approximation methods were developed later. Specifically, [22] proposed faster bilevel optimization methods based on the approximated implicit differentiation (AID) and iterative differentiation (ITD), respectively. [5, 23, 14, 41] presented several accelerated bilevel methods for the stochastic bilevel problems using variance-reduced techniques. More recently, [18] proposed a class of efficient adaptive methods for nonconvex-strongly-convex bilevel optimization problems. At the same time, the lower bound of bilevel optimization methods has been studied in [21] for these nonconvex-strongly-convex bilevel optimization problems. In addition, [34, 27, 32, 33] designed a class of value-function-based and gradient-based bilevel methods for nonconvex bilevel optimization problems and studied asymptotic convergence properties of these methods. [38] analyzed a class of special nonconvex nonsmooth bilevel optimization methods for selecting the best hyperparameter value for the nonsmooth ℓp regularization with 0 < p 1.

2.2 Bregman Distance-Based Methods

Bregman distance-based method (a.k.a, mirror descent method) [4, 1] is a powerful optimization tool because it uses the Bregman distance to fit the geometry of optimization problems. Bregman distance was first proposed in [2], and later extended in [3]. [4] introduced the first proximal minimization algorithm with Bregman function. [1] studied the mirror descent for convex optimization. [7] presented an effective variant of mirror descent, i.e. composite objective mirror descent, for regularized convex optimization. Subsequently, [42] studied the convergence properties of mirror descent algorithm for solving nonsmooth nonconvex problems. [26] integrated the variance reduced technique to the mirror descent algorithm for stochastic convex optimization. The variance-reduced adaptive stochastic mirror descent algorithm [29] was proposed to solve the nonsmooth nonconvex finite-sum optimization. More recently, [16] studied Bregman gradient methods for policy optimization.

3 Preliminaries

3.1 Notations Let Id denote a d-dimensional identity matrix. U{1, 2, , K} denotes a uniform distribution over a discrete set {1, 2, , K}. denotes the ℓ2-norm for vectors and spectral norm for matrices,

respectively. For two vectors x and y, x, y denotes their inner product. xf(x, y) and yf(x, y) are the partial derivatives w.r.t. variables x and y. Given the mini-batch samples B = {ξi}b i=1, we define f(x; B) = 1

b Pb i=1 f(x; ξi). For two sequences {an, bn}n i=1, an = O(bn) denotes that an Cbn for some constant C > 0. The notation O( ) hides logarithmic terms. Given a convex closed set X, we define a projection operation PX (x0) = arg minx X x x0 2. h(x) is the subgradient set of function h(x).

3.2 Some Mild Assumptions Assumption 1. Function F(x) = f(x, y (x)) is possibly nonconvex w.r.t. x, and function g(x, y) is µ-strongly convex w.r.t. y. For stochastic case, the same assumptions hold for f(x, y (x); ξ) and g(x, y; ζ), respectively. Assumption 2. Functions f(x, y) and g(x, y) satisfy

1) yf(x, y) Cfy and 2 xyg(x, y) Cgxy for any x X and y Rd2;

2) The partial derivatives xf(x, y), yf(x, y), xg(x, y) and yg(x, y) are L-Lipschitz, e.g., for x, x1, x2 X and y, y1, y2 Rd2,

xf(x1, y) xf(x2, y) L x1 x2 , xf(x, y1) xf(x, y2) L y1 y2 .

For stochastic case, the same assumptions hold for f(x, y; ξ) and g(x, y; ζ) for any ξ and ζ. Assumption 3. The partial derivatives 2 xyg(x, y) and 2 yyg(x, y) are Lgxy-Lipschitz and Lgyy Lipschitz, e.g., for all x, x1, x2 X and y, y1, y2 Rd2

2 xyg(x1, y) 2 xyg(x2, y) Lgxy x1 x2 , 2 xyg(x, y1) 2 xyg(x, y2) Lgxy y1 y2 .

For stochastic case, the same assumptions hold for 2 xyg(x, y; ζ) and 2 yg(x, y; ζ) for any ζ.

Assumption 4. Function h(x) for any x X is convex but possibly nonsmooth. Assumption 5. Function Φ(x) = F(x) + h(x) is bounded below, i.e., Φ = infx X Φ(x) > .

Assumptions 1-3 are commonly used in bilevel optimization methods [11, 22, 23]. According to Assumption 1, f(x, y1) f(x, y2) = yf(x, yτ)(y1 y2) yf(x, yτ) y1 y2 Cfy y1 y2 , where yτ = τy1 + (1 τ)y2 and τ [0, 1]. Thus yf(x, y) Cfy is similar to the assumption that the function f is M-Lipschitz in [22]. From the proofs in [22], we can find that they still use the norm bounded partial derivative yf(x, y) M. Similarly, according to Assumption 1, we have yg(x1, y) yg(x2, y) L x1 x2 . Since yg(x1, y) yg(x2, y) = 2 xyg(xτ , y)(x1 x2) 2 xyg(xτ , y) x1 x2 Cgxy x1 x2 , where xτ = τ x1 + (1 τ )x2 and τ [0, 1], we can let Cgxy = L as in [22]. From the proofs in [22], we can find that they still use the norm bounded partial derivative 2 xyg(x, y) L for all x, y. Throughout the paper, we let Cgxy = L. Assumption 4 is generally used for regularization such as h(x) = x 1. Assumption 5 ensures the feasibility of Problems (1) and (2).

When we use the first-order methods to solve the above bilevel optimization Problems (1) and (2), we can easily obtain the partial (stochastic) derivative yg(x, y) or yg(x, y; ζ) to update variable y. However, it is hard to get the (stochastic) gradient F(x) = f(x,y (x))

x or F(x; ξ) = f(x,y (x);ξ)

x , when there is no closed form solution for the inner problem of Problems (1) and (2). Thus, a key point of solving the Problems (1) and (2) is to estimate the gradient F(x). The following lemma provides one gradient estimator of F(x). Lemma 1. (Lemma 2.1 in [11]) Under the above Assumptions (1, 2, 3), we have, for any x X

F(x) = xf(x, y (x)) + y (x)T yf(x, y (x))

= xf(x, y (x)) 2 xyg(x, y (x))[ 2 yyg(x, y (x))] 1 yf(x, y (x)). (3)

Lemma 1 provides a natural estimator of F(x), defined as, for all x X, y Rd2

f(x, y) = xf(x, y) 2 xyg(x, y) 2 yyg(x, y) 1 yf(x, y). (4)

Next, we show some properties of F(x), y (x) and f(x, y) in the following lemma:

Algorithm 1 Deterministic Bi O-Bre D Algorithm

1: Input: T, K 1, learning rates γ > 0, λ > 0; 2: initialize: x0 X and y K 1 = y0 Rd2; 3: for t = 0, 1, , T 1 do 4: Let y0 t = y K t 1; 5: for k = 1, , K do 6: Update yk t = yk 1 t λ yg(xt, yk 1 t ); 7: end for 8: Compute partial derivative wt = f(xt,y K t ) x via backpropagation w.r.t. xt; 9: Given a ρ-strongly convex mirror function ψt; 10: Update xt+1 = arg minx X wt, x + h(x) + 1

γ Dψt(x, xt) ; 11: end for 12: Output: Uniformly and randomly choose from {xt, yt}T t=1.

Lemma 2. (Lemma 2.2 in [11]) Under the Assumptions (1, 2, 3), for all x, x1, x2 X and y Rd2, we have f(x, y) F(x) Ly y (x) y

y (x1) y (x2) κ x1 x2 , F(x1) F(x2) LF x1 x2 ,

where Ly = L + L2

µ + Cfy Lgxy

µ + Lgyy Cfy L

µ2 , κ = L µ, and LF = L + 2L2+Lgxy C2 fy µ +

Lgyy Cfy L+L3+Lgxy Cfy L

µ2 + Lgyy Cfy L2

4 Bilevel Optimization via Bregman Distance Methods

In this section, we propose a class of enhanced bilevel optimization methods based on Bregman distance to solve the deterministic Problem (1) and the stochastic Problem (2), respectively.

4.1 Deterministic Bi O-Bre D Algorithm In this subsection, we propose an efficient deterministic bilevel optimization method via Bregman distances (Bi O-Bre D) to solve the deterministic Problem (1). Algorithm 1 summarizes the algorithmic framework of our Bi O-Bre D method.

Given a ρ-strongly convex and continuously-differentiable function ψ(x), i.e., x1 x2, ψ(x1) ψ(x2) ρ x1 x2 2, we define a Bregman distance [3, 4] for any x1, x2 X:

Dψ(x1, x2) = ψ(x1) ψ(x2) ψ(x2), x1 x2 .

In Algorithm 1, we use the mirror descent iteration to update the variable x at t + 1-th step:

xt+1 = arg min x X wt, x + h(x) + 1

γ Dψt(x, xt) , (5)

where γ > 0 is stepsize, and wt is an estimator of F(xt). Here the mirror function ψt can be dynamic as the algorithm is running. Let ψt(x) = 1

2 x 2, we have Dψt(x, xt) = 1

2 x xt 2. When X = Rd1, the above subproblem (5) is equivalent to the proximal gradient descent. When X Rd1 and h(x) = 0, the above subproblem (5) is equivalent to the projection gradient descent. Let ψt(x) = 1

2x T Htx, we have Dψt(x, xt) = 1

2(x xt)T Ht(x xt). When Ht is an approximated Hessian matrix, the above subproblem (5) is equivalent to the proximal quasi-Newton decent. When Ht is an adaptive matrix as used in [19], the above subproblem (5) is equivalent to the proximal adaptive gradient decent.

In Algorithm 1, we use gradient estimator wt = f(xt,y K t ) x to estimate F(xt), where the partial

derivative wt = f(xt,y K t ) x is obtained by the backpropagation w.r.t. xt.

4.2 SBi O-Bre D Algorithm In this subsection, we introduce an efficient stochastic bilevel optimization method via Bregman distance (SBi O-Bre D) to solve the stochastic bilevel optimization Problem (2). Algorithm 2 describes the algorithmic framework of our SBi O-Bre D method.

Algorithm 2 Stochastic Bi O-Bre D (SBi O-Bre D) Algorithm

1: Input: T, K 1, stepsizes γ > 0, λ > 0, {ηt}T t=1; 2: initialize: x0 X and y0 Rd2; 3: for t = 0, 1, , T 1 do 4: Draw randomly b independent samples Bt = {ζi t}b i=1, and compute stochastic partial derivatives vt = yg(xt, yt; Bt); 5: Update yt+1 = yt ληtvt; 6: Draw randomly b(K + 1) independent samples Bt = {ξt,i, ζ0 t,i , ζK 1 t,i }b i=1, and compute stochastic partial derivatives wt = f(xt, yt; Bt); 7: Given a ρ-strongly convex mirror function ψt; 8: Update xt+1 = arg minx X wt, x + h(x) + 1

γ Dψt(x, xt) ; 9: end for 10: Output: Uniformly and randomly choose from {xt, yt}T t=1.

Given K 1 and draw K + 1 independent samples ξ = {ξ, ζ0, , ζK 1}, as in [15, 23], we definite a stochastic gradient estimator:

f(x, y, ξ) = xf(x, y; ξ) 2 xyg(x, y; ζ0) K

L 2 yyg(x, y; ζi) yf(x, y; ξ), (6)

where k U{0, 1, , K 1} is a uniform random variable independent on ξ. It is easy to verify that f(x, y, ξ) is a biased estimator of f(x, y), i.e. E ξ f(x, y; ξ) = f(x, y). For the gradient estimator (6), thus we define a bias R(x, y) = f(x, y) E ξ f(x, y; ξ) : X Rd2 R. Lemma 3. ( Lemma 2.1 in [23] ) Under the about Assumptions (1, 2, 3), for any K 1, the gradient estimator in (6) satisfies

R(x, y) LCfy

Lemma 5 shows that the bias R(x, y) decays exponentially fast with number K, and with choosing K = L

µ log(LCfy T/µ), we have R(x, y) 1

T . Let LCfy

T , we have K log(1 µ

log( µ LCfy T ). Due to µ < L, we have K log( Cfy LT

µ )/ log( L L µ). Further due to µ

L log( L L µ),

µ log(LCfy T/µ), we have R(x, y) 1

T . Note that here we use Cgxy = L.

To simplify notations, let ξi t = {ξt,i, ζ0 t,i , ζK 1 t,i }. In Algorithm 2, we use mini-batch stochastic

gradient estimator wt = f(xt, yt; Bt) = 1

b Pb i=1 f(xt, yt; ξi t), where f(xt, yt; ξi t)

= xf(xt, yt; ξt,i) 2 xyg(xt, yt; ζ0 t,i) K

L 2 yyg(xt, yt; ζj t,i) yf(xt, yt; ξt,i),

with k U{0, 1, , K 1}. Let R(xt, yt) = wt f(xt, yt) = f(xt, yt; Bt) f(xt, yt), we have E[ f(xt, yt; Bt)] = R(xt, yt) + f(xt, yt). According to the above Lemma 5, it is easy to verify that R(xt, yt) LCfy

4.3 ASBi O-Bre D Algorithm In this subsection, we propose an accelerated version of SBi O-Bre D method (ASBi O-Bre D) to solve the stochastic bilevel optimization Problem (2) via using variance reduced technique of SARAH/SPIDER/SNVRG [36, 8, 40, 43]. Algorithm 3 shows the algorithmic framework of the ASBi O-Bre D method.

In Algorithm 3, we use the variance reduced technique of SPIDER to accelerate SBi O-Bre D algorithm. When mod (t, q) = 0, we draw a relative large batch samples Bt = {ζi t}b i=1 and Bt = { ξi t}b i=1 to estimate our stochastic partial derivatives vt and wt, respectively. When mod (t, q) = 0, we draw a mini-batch samples It = {ξi t}b1 i=1 and It = { ξi t}b1 i=1 (b > b1) to estimate vt and wt, respectively. Let R(xt, yt) = f(xt, yt; It) f(xt, yt) when mod (t, q) = 0, we have E[ f(xt, yt; It)] = R(xt, yt) + f(xt, yt) and R(xt, yt) LCfy

Algorithm 3 Accelerated Stochastic Bi O-Bre D (ASBi O-Bre D) Algorithm

1: Input: T, K 1, q, stepsizes γ > 0, λ > 0, {ηt}T t=1, mini-batch sizes b and b1; 2: initialize: x0 X and y0 Rd2; 3: for t = 0, 1, , T 1 do 4: if mod (t, q) = 0 then 5: Draw randomly b independent samples Bt = {ζi t}b i=1, and compute stochastic partial derivative vt = yg(xt, yt; Bt); 6: Draw randomly b(K+1) independent samples Bt = {ξt,i, ζ0 t,i , ζK 1 t,i }b i=1, and compute stochastic partial derivative wt = f(xt, yt; Bt); 7: else 8: Generate randomly b1 independent samples It = {ζi t}b1 i=1, and compute stochastic partial derivative vt = yg(xt, yt; It) yg(xt 1, yt 1; It) + vt 1; 9: Generate randomly b1(K + 1) independent samples It = {ξt,i, ζ0 t,i , ζK 1 t,i }b1 i=1, and compute stochastic partial derivative wt = f(xt, yt; It) f(xt 1, yt 1; It) + wt 1; 10: end if 11: Update yt+1 = yt ληtvt; 12: Given a ρ-strongly convex mirror function ψt; 13: Update xt+1 = arg minx X wt, x + h(x) + 1

γ Dψt(x, xt) ; 14: end for 15: Output: Uniformly and randomly choose from {xt, yt}T t=1.

5 Convergence Analysis In this section, we study the convergence properties of our new algorithms (i.e., Bi O-Bre D, SBi OBre D, and ASBi O-Bre D) under mild conditions. All proofs are provided in the Appendix B.

We begin with introducing a useful convergence metric Gt 2 or E Gt 2 to measure convergence properties of our algorithms. Given the generated parameter vector xt at the t-th iteration in our algorithms, as in [10, 29], we define the generalized gradient at the t-th iteration as:

γ (xt x+ t+1), x+ t+1 = arg min x X F(xt), x + h(x) + 1

γ Dψt(x, xt) ,

where F(x) = f(x, y (x)) or F(x) = Eξ[f(x, y (x); ξ)]. When ψt(x) = 1

2 x 2, X = Rd1 and h(x) = c is a constant, we have Gt 2 = F(xt) 2, which is a common convergence metric used in [11, 22]. When ψ(x) = 1

2 x 2, X Rd1 and h(x) = c is a constant, our convergence metric is Gt 2 = 1

γ (xt PX (xt γ F(xt)) 2, which was also used in [15].

Next, we provide some useful lemmas and some mild assumptions. Lemma 4. (Lemma 3.1 in [23]) Under the above Assumptions (1, 2, 3), stochastic gradient estimator f(x, y; ξ) is LK-Lipschitz continuous, e.g., for x1, x2 X and y Rd2,

E ξ f(x1, y; ξ) f(x2, y; ξ) 2 L2 K x1 x2 2,

where L2 K = 2L2 + 6L4 K 2µL µ2 + 6C2 fy L2 gxy K 2µL µ2 + 6L4 K3L2 gyy (L µ)2(2µL µ2).

Lemma 5. Suppose the sequence {xt, yt}T t=1 be generated from Algorithms 2 and 3. Under the above assumptions, given 0 < ηt 1 for all t 1 and 0 < λ 1 6L, we have

yt+1 y (xt+1) 2 (1 ηtµλ

4 ) yt y (xt) 2 3ηtλ2

6µ yg(xt, yt) vt 2 + 25κ2

6ηtµλ xt+1 xt 2.

The above lemma 5 basically follows the Lemma 28 of [17] used for minimax optimization. Assumption 6. The stochastic partial derivative yg(x, y; ζ) satisfies E[ yg(x, y; ζ)] = yg(x, y) and E yg(x, y; ζ) yg(x, y) 2 σ2. The estimated stochastic partial derivative f(x, y; ξ) defined in (6) satisfies E ξ f(x, y; ξ) = f(x, y) + R(x, y) and E ξ f(x, y; ξ) f(x, y) R(x, y) 2 σ2.

Assumption 7. The mirror functions {ψt(x)}T t=0 are ρ-strongly convex, where ρ > 0.

Assumption 6 is commonly used in stochastic bilevel optimization methods [15, 23]. Assumption 7 shows that the constant ρ can be seen as a lower bound of the strong convexity of all the mirror functions ψt(x) for all t 0, which is widely used in mirror descent algorithms [29] and adaptive gradient algorithms [19].

5.1 Convergence Analysis of Bi O-Bre D Algorithm In this subsection, we provide the convergence properties of our Bi O-Bre D algorithm.

Theorem 1. Suppose the sequence {xt, yt}T t=1 be generated from Algorithm 1. Let 0 < γ 3ρ 4LF , 0 < λ < 1

L, K = log(T)/ log( 1 1 λµ) + 1 and y0 t y (xt) 2 for all t 0, we have

t=0 Gt 2 16 Φ(x0) Φ

3Tγρ + 22 L2 1 ρ2T + 22 L2 2 ρ2T + 22L2 3 ρ2T 2 , (7)

where κ = L

µ , L1 = L(L+µ)

µ , L2 = 2Cfy(µLgxy+LLgyy)

µ2 and L3 = LCfy

Remark 1. Without loss of generality, let L 1

µ, λ = 1 2L, γ = 3ρ 4LF and ρ = O(L). It is easy

to verify that our Bi O-Bre D algorithm has a convergence rate of O κ2

T = ϵ, we have T = κ2ϵ 1. Due to K = log(T)/ log( 1 1 λµ) + 1, we choose K = O(κ log( 1

ϵ )) for finding ϵstationary point of the problem (1), we need the gradient complexity: Gc(f, ϵ) = 2T = O(κ2ϵ 1) and Gc(g, ϵ) = KT = O(κ3ϵ 1), and the Jacobian-vector and Hessian-vector product complexities: JV (g, ϵ) = KT = O(κ3ϵ 1) and HV (g, ϵ) = KT = O(κ3ϵ 1).

5.2 Convergence Analysis of SBi O-Bre D Algorithm In this subsection, we provide the convergence properties of our SBi O-Bre D algorithm. Theorem 2. Suppose the sequence {xt, yt}T t=1 be generated from Algorithm 2. Let = y0 y (x0) 2, K = L

µ log( LCfy T

µ ), 0 < η = ηt 1, 0 < γ min 3ρ

4LF , 9ηρµλ

800κ2 , ηµρλ

47L2y and 0 < λ

1 6L, we have

t=1 E Gt 2 32(Φ(x0) Φ )

3Tγρ + 752σ2

3ρ2b + 400ηλσ2

9γρµb + 752 3ρ2T 2 . (8)

Remark 2. Without loss of generality, let L 1 µ, λ = 1 6L, γ = min 3ρ

4LF , 9ηρµλ

800κ2 , ηµρλ

ρ = O(L), we have γρ = O( 1

κ3 ) It is easily verified that our SBi O-Bre D algorithm has a convergence rate of O κ3

2, we have T = 2κ3ϵ 1 and b = 2κ2ϵ 1. Due to K =

L µ log( LCfy T

µ ), we have K = O(κ log( κ4

ϵ )) = O(κ). For finding ϵ-stationary point of the problem (2), we need the gradient complexity: Gc(f, ϵ) = 2b T = κ5ϵ 2 and Gc(g, ϵ) = b T = O(κ5ϵ 2), and the Jacobian-vector and Hessian-vector product complexities: JV (g, ϵ) = b T = O(κ5ϵ 2) and HV (g, ϵ) = Kb T = O(κ6ϵ 2).

5.3 Convergence Analysis of ASBi O-Bre D Algorithm In this subsection, we provide the convergence properties of our ASBi O-Bre D algorithm. Theorem 3. Suppose the sequence {xt, yt}T t=1 be generated from Algorithm 3. Let = y0 y (x0) 2, b1 = q, K = L µ log( LCfy T

µ ), 0 < η = ηt 1, 0 < γ

min 3ρ 38L2 Kη, 3ρ 4LF , 2ρηµλ

400κ2 and 0 < λ min 1

6L, 9µ 100η2L2 , we have

t=0 E Gt 2 32(Φ(x0) Φ )

3Tγρ + 152 3T 2ρ2 + 4 ηργ 1

L2 + 1 L2 K

Remark 3. Without loss of generality, let L 1 µ, λ = min 1

6L, 9µ 100η2L2 , γ =

min 3ρ 38L2 Kη, 3ρ 4LF , 2ρηµλ

400κ2 and ρ = O(L), we have γρ = O( 1

κ4 ). It is easily verified that

our ASBi O-Bre D algorithm has a convergence rate of O κ4

T = 2κ4ϵ 1 and b = 2κ2ϵ 1. Due to K = L

µ log( LCfy T

µ ), we have K = O(κ log( κ4

ϵ )) = O(κ). Let b1 = q = κϵ 0.5. For finding ϵ-stationary point of the problem (2), we need the gradient complexity: Gc(f, ϵ) = 2( b T

q + 2b1T) = O(κ5ϵ 1.5) and Gc(g, ϵ) = b T

q + 2b1T = O(κ5ϵ 1.5), and the Jacobian-vector and Hessian-vector product complexities: JV (g, ϵ) = b T

q + 2b1T = O(κ5ϵ 1.5)

and HV (g, ϵ) = K( b T

q + 2b1T) = O(κ6ϵ 1.5).

Figure 1: Validation Loss vs. Running Time for different methods. We compare our Bi O-Bre D with deterministic baselines (the first column), SBi O-Bre D with stochastic baselines (the second column); ASBi O-Bre D with momentum-based or SPIDER/SARAH based baselines (the last column). We test two values of ϱ: large noise setting ϱ = 0.8 (top row) and small noise setting ϱ = 0.4 (bottom row).

6 Numerical Experiments

In this section, we perform two tasks to demonstrate the efficiency of our algorithms: 1) data hypercleaning task [39] over the MNIST dataset [25]; 2) hyper-representation learning task [9] over the Omniglot dataset [24]. In the experiment, we compare our algorithms (i.e., Bi O-Bre D, SBi O-Bre D, and ASBi O-Bre D) with the following bilevel optimization algorithms: reverse [9]/AID-Bi O [11, 22], AID-CG [12], AID-FP [12], stoc Bi O [22]), MRBO [21], VRBO [21], FSLA [28], SUSTAIN [23], and VR-sa Bi Adam [18]. All experiments are averaged over 5 runs and we use a server with AMD EPYC 7763 64-Core CPU and 1 NVIDIA RTX A5000.

We use Bregman function ψt(x) = 1

2x T Htx to generate the Bregman distance in our algorithms, where Ht is the adaptive matrix as used in [19], i.e. the exponential moving average of the square of the gradient and we use coefficient 0.99 in all experiments.

6.1 Data Hyper-cleaning In this subsection, we perform data hyper-cleaning over the MNIST dataset [25]. The formulation of this problem is as follows:

min λ lval λ, w (λ) := 1 |DV|

(xi,yi) DV l x T i w (λ), yi

s.t. w (λ) = arg min w ltr(λ, w) := 1 |DT |

(xi,yi) DT σ(λi)l(x T i w, yi) + C w 2,

where l( ) denotes the cross entropy loss, DT and DV are training and validation datasets, respectively. Here λ = {λi}i DT are hyper-parameters and C 0 is a tuning parameter, σ( ) denotes the sigmoid function. In experiment, we set C = 0.001. The dataset includes a training set and a validation set

Table 3: Validation accuracy vs. Running Time (5-way-1-shot) for different methods (with L1 regularization)

Time AID_Bi O ITD_Bi O MRBO FSLA VRBO VR-sa Bi Adam ASBi O-Bre D 20s 0.6509 0.6411 0.6103 0.6539 0.5951 0.6812 0.6653 40s 0.7365 0.7210 0.6971 0.7399 0.6805 0.7141 0.7403 60s 0.7762 0.7721 0.7519 0.7661 0.7429 0.7523 0.7830

Table 4: Validation accuracy vs. Running Time (5-way-5-shot) for different methods (with L1 regularization)

Time AID_Bi O ITD_Bi O MRBO FSLA VRBO VR-sa Bi Adam ASBi O-Bre D 20s 0.8316 0.8131 0.8174 0.7993 0.7730 0.7753 0.8529 40s 0.8779 0.8621 0.8634 0.8485 0.8305 0.8188 0.8967 60s 0.9032 0.8968 0.8819 0.8824 0.8745 0.8640 0.9313

where each contains 5000 images. A portion of the training data are corrupted by randomly changing their labels, and we denote the portion of corrupted images as ϱ.

The detailed experimental setup is described in the Appendix A.1. For hyper-parameters, we perform grid search for our algorithms and other baselines to choose the best setting. The experimental results are summarized in Figure 1. As shown by the figure, Bi O-Bre D outperforms the reverse algorithm; SBi O-Bre D outperforms AID-FP/stoc Bi O and AID-CG methods, and ASBi O-Bre D outperforms the other SPIDER based algorithm MRBO and several momentum-based variance reduction methods: MRBO, SUSTAIN, FSLA, and VR-sa Bi Adam.

6.2 Hyper-representation Learning In this subsection, we perform the hyper-representation learning task over the Omniglot dataset [24]. The formulation of this problem is as follows:

min λ lval λ, w (λ) := Eξ h 1 |DV,ξ|

(xi,yi) DV,ξ l w ξ(λ)T ϕ(xi; λ), yi ; ξ i + α λ 1

s.t. w ξ(λ) = arg min w ltr(λ, w; ξ) := 1 |DT ,ξ|

(xi,yi) DT ,ξ l w T ϕ(xi; λ), yi + C w 2,

where l( ) denotes the cross entropy loss, DT ,ξ and DV,ξ are training and validation datasets for randomly sampled meta task ξ. Here ϕ( , ) is a four-layers convolutional neural network with maxpooling and 32 filters per layer [9], which denotes a representation mapping. λ denotes the parameter vector of the representation mapping ϕ( , ), and C 0 is a tuning parameter to guarantee the inner problem to be strongly convex. The term α λ 1 imposes the sparsity of hyper-representations. In the experiment, we set α = 0.001 and C = 0.01.

The detailed experimental setup is described in the Appendix A.2. The results of validation accuracy (test accuracy) are summarized in Table 3 and 4. From these results, our ASBi O-Bre D algorithm outperforms other baselines in the non-smooth case. We also consider the smooth case, where the upper level problem is not added the L1 regularization. The results without L1 regularization are given in the Appendix A.2.

7 Conclusions In the paper, we proposed a class of enhanced bilevel optimization methods based on the Bregman distance to solve the nonconvex-strongly-convex bilevel optimization problems possibly with nonsmooth regularization. Moreover, we provided a comprehensive theoretical analysis framework to analyze our methods. The theoretical results show that our methods outperform the best known computational complexities with respect to the condition number κ and the target accuracy ϵ for finding an ϵ-stationary point.

Acknowledgments and Disclosure of Funding

This work was partially supported by NSF IIS 1838627, 1837956, 1956002, 2211492, CNS 2213701, CCF 2217003, DBI 2225775. FH was a postdoctoral researcher at the University of Pittsburgh. FH and HH are the corresponding authors.

[1] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003.

[2] L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200 217, 1967.

[3] Y. Censor and A. Lent. An iterative row-action method for interval convex programming. Journal of Optimization theory and Applications, 34(3):321 353, 1981.

[4] Y. Censor and S. A. Zenios. Proximal minimization algorithm withd-functions. Journal of Optimization Theory and Applications, 73(3):451 464, 1992.

[5] T. Chen, Y. Sun, and W. Yin. A single-timescale stochastic bilevel optimization method. ar Xiv preprint ar Xiv:2102.04671, 2021.

[6] A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 15236 15245, 2019.

[7] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In COLT, pages 14 26. Citeseer, 2010.

[8] C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 689 699, 2018.

[9] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pages 1568 1577. PMLR, 2018.

[10] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267 305, 2016.

[11] S. Ghadimi and M. Wang. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

[12] R. Grazzi, L. Franceschi, M. Pontil, and S. Salzo. On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, pages 3748 3758. PMLR, 2020.

[13] Z. Guo, Y. Xu, W. Yin, R. Jin, and T. Yang. On stochastic moving-average estimators for non-convex optimization. ar Xiv preprint ar Xiv:2104.14840, 2021.

[14] Z. Guo and T. Yang. Randomized stochastic variance-reduced methods for stochastic bilevel optimization. ar Xiv preprint ar Xiv:2105.02266, 2021.

[15] M. Hong, H.-T. Wai, Z. Wang, and Z. Yang. A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. ar Xiv preprint ar Xiv:2007.05170, 2020.

[16] F. Huang, S. Gao, and H. Huang. Bregman gradient policy optimization. In International Conference on Learning Representations, 2022.

[17] F. Huang, S. Gao, J. Pei, and H. Huang. Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization. Journal of Machine Learning Research, 23(36):1 70, 2022.

[18] F. Huang and H. Huang. Biadam: Fast adaptive bilevel optimization methods. ar Xiv preprint ar Xiv:2106.11396, 2021.

[19] F. Huang, J. Li, and H. Huang. Super-adam: faster and universal framework of adaptive gradients. Advances in Neural Information Processing Systems, 34:9074 9085, 2021.

[20] S. Jenni and P. Favaro. Deep bilevel learning. In Proceedings of the European conference on computer vision (ECCV), pages 618 633, 2018.

[21] K. Ji and Y. Liang. Lower bounds and accelerated algorithms for bilevel optimization. ar Xiv preprint ar Xiv:2102.03926, 2021.

[22] K. Ji, J. Yang, and Y. Liang. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, pages 4882 4892. PMLR, 2021.

[23] P. Khanduri, S. Zeng, M. Hong, H.-T. Wai, Z. Wang, and Z. Yang. A near-optimal algorithm for stochastic bilevel optimization via double-momentum. Advances in Neural Information Processing Systems, 34:30271 30283, 2021.

[24] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015.

[25] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[26] L. Lei and M. I. Jordan. On the adaptivity of stochastic gradient-based optimization. SIAM Journal on Optimization, 30(2):1473 1500, 2020.

[27] J. Li, B. Gu, and H. Huang. Improved bilevel model: Fast and optimal algorithm with theoretical guarantee. ar Xiv preprint ar Xiv:2009.00690, 2020.

[28] J. Li, B. Gu, and H. Huang. A fully single loop algorithm for bilevel optimization without hessian inverse. ar Xiv preprint ar Xiv:2112.04660, 2021.

[29] W. Li, Z. Wang, Y. Zhang, and G. Cheng. Variance reduction on adaptive stochastic mirror descent. ar Xiv preprint ar Xiv:2012.13760, 2020.

[30] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations, 2018.

[31] R. Liu, J. Gao, J. Zhang, D. Meng, and Z. Lin. Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. ar Xiv preprint ar Xiv:2101.11517, 2021.

[32] R. Liu, X. Liu, X. Yuan, S. Zeng, and J. Zhang. A value-function-based interior-point method for non-convex bi-level optimization. ar Xiv preprint ar Xiv:2106.07991, 2021.

[33] R. Liu, Y. Liu, S. Zeng, and J. Zhang. Towards gradient-based bilevel optimization with non-convex followers and beyond. Advances in Neural Information Processing Systems, 34, 2021.

[34] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang. A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In International Conference on Machine Learning, pages 6305 6315. PMLR, 2020.

[35] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang. A general descent aggregation framework for gradient-based bi-level optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

[36] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáˇc. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613 2621. PMLR, 2017.

[37] P. Ochs, R. Ranftl, T. Brox, and T. Pock. Bilevel optimization with nonsmooth lower level problems. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 654 665. Springer, 2015.

[38] T. Okuno, A. Takeda, A. Kawana, and M. Watanabe. On lp-hyperparameter learning via bilevel nonsmooth optimization. Journal of Machine Learning Research, 22(245):1 47, 2021.

[39] A. Shaban, C.-A. Cheng, N. Hatch, and B. Boots. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1723 1732. PMLR, 2019.

[40] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. In Advances in Neural Information Processing Systems, pages 2403 2413, 2019.

[41] J. Yang, K. Ji, and Y. Liang. Provably faster algorithms for bilevel optimization. Advances in Neural Information Processing Systems, 34:13670 13682, 2021.

[42] S. Zhang and N. He. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization. ar Xiv preprint ar Xiv:1806.04781, 2018.

[43] D. Zhou, P. Xu, and Q. Gu. Stochastic nested variance reduction for nonconvex optimization. Journal of machine learning research, 2020.