# provably_convergent_federated_trilevel_learning__631655c1.pdf

Provably Convergent Federated Trilevel Learning

Yang Jiao1, Kai Yang1,2,3*, Tiancheng Wu1, Chengtao Jian1, Jianwei Huang4,5

1Department of Computer Science and Technology, Tongji University 2 Key Laboratory of Embedded System and Service Computing Ministry of Education at Tongji University 3Shanghai Research Institute for Intelligent Autonomous Systems 4School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 5Shenzhen Institute of Artificial Intelligence and Robotics for Society yangjiao@tongji.edu.cn, kaiyang@tongji.edu.cn, tony318@tongji.edu.cn, jct@tongji.edu.cn, jianweihuang@cuhk.edu.cn

Trilevel learning, also called trilevel optimization (TLO), has been recognized as a powerful modelling tool for hierarchical decision process and widely applied in many machine learning applications, such as robust neural architecture search, hyperparameter optimization, and domain adaptation. Tackling TLO problems has presented a great challenge due to their nested decision-making structure. In addition, existing works on TLO face the following key challenges: 1) they all focus on the non-distributed setting, which may lead to privacy breach; 2) they do not offer any non-asymptotic convergence analysis which characterizes how fast an algorithm converges. To address the aforementioned challenges, this paper proposes an asynchronous federated trilevel optimization method to solve TLO problems. The proposed method utilizes µ-cuts to construct a hyper-polyhedral approximation for the TLO problem and solve it in an asynchronous manner. We demonstrate that the proposed µ-cuts are applicable to not only convex functions but also a wide range of non-convex functions that meet the µ-weakly convex assumption. Furthermore, we theoretically analyze the non-asymptotic convergence rate for the proposed method by showing its iteration complexity to obtain ϵ-stationary point is upper bounded by O( 1

ϵ2 ). Extensive experiments on real-world datasets have been conducted to elucidate the superiority of the proposed method, e.g., it has a faster convergence rate with a maximum acceleration of approximately 80%.

Introduction

Recently, trilevel learning, also called trilevel optimization (TLO), has found applications in many machine learning tasks, e.g., robust neural architecture search (Guo et al. 2020), robust hyperparameter optimization (Sato, Tanaka, and Takeda 2021) and domain adaptation (Raghu et al. 2021). Trilevel optimization problems refer to the optimization problems that involve three-level optimization problems and thus have a trilevel hierarchy (Avraamidou 2018; Sato, Tanaka, and Takeda 2021). A general form of trilevel opti-

*Corresponding author (e-mail: kaiyang@tongji.edu.cn). Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

mization problem is given by, min f1(x1, x2, x3) s.t. x2 = arg min x2 f2(x1, x2 , x3) s.t.

x3 = arg min x3 f3(x1, x2 , x3 )

var. x1, x2, x3,

where f1, f2, f3 respectively denote the first, second, and third level objectives. Here x1 Rd1, x2 Rd2, x3 Rd3 are variables. Despite its wide applications, the development of solution methods was predominately limited to bilevel optimization (BLO) (Ji, Yang, and Liang 2021; Franceschi et al. 2018) primarily due to the escalated difficulty in solving the TLO problem (Sato, Tanaka, and Takeda 2021). The literature, specifically (Blair 1992; Avraamidou 2018), highlights that the complexity associated with solving problems characterized by hierarchical structures comprising more than two levels is substantially greater compared to that of bilevel optimization problems. Theoretical work on solving TLO problems only emerge during the recent several years. A hypergradient (gradient)- based method is proposed in (Sato, Tanaka, and Takeda 2021), which uses K gradient descent steps to replace the lower-level problem to solve the TLO problems. This algorithm in (Sato, Tanaka, and Takeda 2021) is one of the first results that establish theoretical guarantees for solving the TLO problem. A general automatic differentiation technique is proposed in (Choe et al. 2022), which is based on the interpretation of TLO as a special type of dataflow graph. Nevertheless, there are still some issues that have not been addressed in the prior work, including 1) in TLO applications, data may be acquired and disseminated across multiple nodes, the prior works only solve the TLO problems in a non-distributed manner, which needs to collect a massive amount of data to a single server and may lead to the data privacy risks (Subramanya and Riggio 2021; Jiao et al. 2022b; Han, Wang, and Leung 2020). Moreover, the synchronous federated algorithms often suffer from straggler problems and will immediately stop working if some workers fail to communicate (Jiao et al. 2022b). Therefore, developing asynchronous federated algorithms for TLO is significantly important. 2) The existing TLO works only provide the asymptotic convergence guarantee for their algorithms.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

In order to understand the convergence speed of the proposed algorithm, non-asymptotic convergence analysis that can characterize how fast an algorithm converges in practice is required. To this end, we propose an Asynchronous Federated Trilevel Optimization method (AFTO) in this paper. The proposed AFTO can effectively solve the TLO problems in an asynchronous federated manner. Specifically, it treats the lower-level optimization problem as a constraint to the upper-level and utilizes µ-cuts to construct the hyperpolyhedral approximation, then an effective asynchronous algorithm is developed. In the context of trilevel learning problems, the objective functions at each level are usually non-convex, thus the cutting plane methods tailored for convex functions (Jiao et al. 2022b; Franc, Sonnenburg, and Werner 2011) are found to be inapplicable. To our best knowledge, the proposed methodology referred to as µ-cut represents the first approach that is capable of constructing cutting planes for trilevel learning problems characterized by non-convex objectives. Furthermore, we demonstrate that the proposed method is guaranteed to converge and theoretically analyze the non-asymptotic convergence rate in terms of iteration complexity. The contributions of this work are summarized as follows. 1. An asynchronous federated trilevel optimization method is proposed in this work for trilevel learning. To our best knowledge, it is the first work designing algorithms to solve the trilevel learning problem in an asynchronous distributed manner. 2. A novel hyper-polyhedral approximation method via µ-cut is proposed in this work. The proposed µ-cut can be applied to trilevel learning with non-convex objectives. We further demonstrate that the iteration complexity of the proposed method to achieve the ϵ-stationary point is upper bounded by O( 1

ϵ2 ). 3. Extensive experiments on real-world datasets justify the superiority of the proposed method and underscore the significant benefits of employing the hyper-polyhedral approximation for trilevel learning.

Related Work Trilevel Optimization

Trilevel optimization has many applications ranging from economics to machine learning. A robust neural architecture search approach is proposed in (Guo et al. 2020), which integrates the adversarial training into one-shot neural architecture search and can be regarded as solving a trilevel optimization problem. Time Auto AD (Jiao et al. 2022a) is proposed to automatically configure the anomaly detection pipeline and optimize the hyperparameters for multivariate time series anomaly detection. The optimization problem that Time Auto AD aims to solve can be viewed as a trilevel optimization problem. And a method is proposed in (Raghu et al. 2021) to solve the trilevel optimization problem which involves hyperparameter optimization and twolevel pretraining and finetuning. LFM (Garg et al. 2022) is proposed to solve a trilevel optimization problem which consists of data reweight, architecture search, and model train-

ing. A general automatic differentiation technique Betty is proposed in (Choe et al. 2022), which can be utilized to solve the trilevel optimization problem. However, the aforementioned algorithms do not provide any convergence guarantee. A hypergradient-based algorithm with asymptotic convergence guarantee is proposed in (Sato, Tanaka, and Takeda 2021), which can be employed in trilevel optimization problems. Nevertheless, the existing works focus on solving the TLO problems in a non-distributed manner and do not provide any non-asymptotic convergence analysis. Instead, an efficient asynchronous algorithm with non-asymptotic convergence guarantee is proposed in this work for solving TLO problems. To our best knowledge, this is the first work that solves TLO problems in an asynchronous federated manner.

Polyhedral Approximation

Polyhedral approximation is a widely-used approximation method (Bertsekas 2015). The idea behind polyhedral approximation is to approximate either the feasible region or the epigraph of the objective function of an optimization problem by a set of cutting planes, and the approximation will be gradually refined by adding additional cutting planes. Since the approximate problem is polyhedral, it is usually much easier to solve than the original problem. Following (Bertsekas 2015), the polyhedral approximation can be broadly divided into two main approaches: outer linearization and inner linearization. The outer linearization (Tawarmalani and Sahinidis 2005; Yang et al. 2008; B urger, Notarstefano, and Allg ower 2013) (also called cutting plane method) utilizes a set of cutting planes to approximate the feasible region or the epigraph of the objective function from without. In contrast, inner linearization (Bertsekas and Yu 2011; Trombettoni et al. 2011) utilizes the convex hulls of finite numbers of halflines or points to approximate the feasible region or the epigraph of the objective function from within. Polyhedral approximation has been widely used in convex optimization. A polyhedral approximation method is proposed in (Bertsekas 2015) for convex optimization, which utilizes cutting planes to approximate the original convex optimization problem. In (B urger, Notarstefano, and Allg ower 2013), a fully distributed algorithm is proposed, which is based on an outer polyhedral approximation of the constraint sets, for the convex and robust distributed optimization problems in peer-to-peer networks. In this work, a novel hyper-polyhedral approximation method via µ-cut is proposed for TLO. The proposed µ-cut can be utilized for µ-weakly convex optimization and thus has broader applicability compared with the cutting plane methods for convex optimization.

Asynchronous Federated Trilevel Learning

Traditional trilevel optimization methods require collecting a massive amount of data to a single server for model training, which may lead to data privacy risks. Solving trilevel optimization problems in a distributed manner is challenging since the trilevel optimization problem is highly-nested which hinders the development of the distributed algorithms. The distributed trilevel optimization problem can be ex-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

pressed as,

min PN j=1 f1,j(x1, x2, x3) s.t. x2 = arg min x2 PN j=1 f2,j(x1, x2 , x3) s.t.

x3 = arg min x3 PN j=1 f3,j(x1, x2 , x3 )

var. x1, x2, x3,

where N denotes the number of workers in distributed systems, f1,j, f2,j, f3,j denote the local first, second, and third level objectives in worker j, respectively. The problem in Eq. (2) can be reformulated as a consensus problem (Zhang and Kwok 2014; Jiao, Yang, and Song 2022),

min P j f1,j(x1,j, x2,j, x3,j) s.t. x1,j = z1, j = 1, , N {x2,j}, z2 = arg min {x2,j },z2 P j f2,j(z1, x2,j , x3,j) s.t.

x2,j = z2 , j = 1, , N {x3,j}, z3 = arg min {x3,j },z3 P jf3,j(z1, z2 , x3,j ) s.t.

x3,j = z3 , j = 1, , N var. {x1,j}, {x2,j}, {x3,j}, z1,z2, z3, (3) where x1,j Rd1, x2,j Rd2, x3,j Rd3 denote the local variables in worker j, and z1 Rd1, z2 Rd2, z3 Rd3 denote the consensus variables in the master. This reformulation in Eq. (3) can facilitate the development of distributed algorithms for trilevel optimization problems based on the parameter-server architecture (Assran et al. 2020). The remaining procedure of the proposed method can be divided into three steps. First, how to construct the hyperpolyhedral approximation for distributed TLO problems is proposed. Then, an effective asynchronous federated algorithm is developed. Finally, how to update the µ-cuts to refine the hyper-polyhedral approximation is proposed.

Hyper-Polyhedral Approximation Different from the traditional polyhedral approximation method (Bertsekas 2015; Franc, Sonnenburg, and Werner 2011; B urger, Notarstefano, and Allg ower 2013), a novel hyper-polyhedral approximation method is proposed for distributed TLO problems in this work. By utilizing the proposed hyper-polyhedral approximation, the distributed algorithms can be easier to develop for TLO problems. Specifically, the proposed hyper-polytope consists of the Ist layer and IInd layer polytopes, which are introduced as follows.

Ist layer Polyhedral Approximation: First, defining

h I({x3,j}, z1, z2 , z3) = || {x3,j} z3

ϕI(z1, z2 )||2 and

ϕI(z1, z2 )=arg min{x3,j },z3 {PN j=1f3,j(z1, z2 , x3,j ) : x3,j = z3 , j}. In trilevel optimization, the third level optimization problem can be viewed as the constraint to the second level optimization problem (Chen et al. 2022a), i.e., h I({x3,j}, z1, z2 , z3) = 0. A consensus problem needs to be solved in a distributed manner if the exact ϕI(z1, z2 ) is required. In many works in bilevel (Ji, Yang, and Liang 2021; Liu et al. 2021; Jiao et al. 2022b)

and trilevel (Sato, Tanaka, and Takeda 2021) optimization, the exact ϕI(z1, z2 ) can be replaced by an estimate of ϕI(z1, z2 ), and we utilize the results after K communication rounds between the master and workers as the estimate of ϕI(z1, z2 ) according to (Jiao et al. 2022b). Specifically, for the third level optimization problem, the augmented Lagrangian function can be written as,

Lp,3 =PN j=1(f3,j(z1, z2 , x3,j ) + φ 3,j(x3,j z3 ) + κ3

2 ||x3,j z3 ||2), (4) where Lp,3 = Lp,3(z1,z2 , z3 ,{x3,j },{φ3,j}), φ3,j Rd3 is the dual variable, and constant κ3 >0 is a penalty parameter. In (k + 1)th communication round, we have that, 1) Workers update the local variables,

xk+1 3,j =xk 3,j ηx x3,j Lp,3(z1, z2 , zk 3 ,{xk 3,j },{φk 3,j}), (5) where ηx represents the step-size. Then, workers transmit the local variables xk+1 3,j to the master. 2) Master updates the variables as follows,

zk+1 3 =zk 3 ηz z3Lp,3(z1, z2 , zk 3 ,{xk 3,j },{φk 3,j}), (6)

φk+1 3,j =φk 3,j+ηφ φ3,j Lp,3(z1, z2 , zk+1 3 ,{xk+1 3,j },{φk 3,j}), (7) where ηz and ηφ represent the step-sizes. Then, master broadcasts the zk+1 3 and φk+1 3,j to workers. The results after K communication rounds are utilized as the estimate of ϕI(z1, z2 ), that is,

ϕI(z1, z2 )=

" {x0 3,j PK 1 k=0 ηx x3,j Lk p,3}

z0 3 PK 1 k=0 ηz z3Lk p,3

where Lk p,3 = Lp,3(z1, z2 , zk 3 , {xk 3,j }, {φk 3,j}). Based on Eq. (8) and the definition of h I, we have that,

h I({x3,j}, z1, z2 , z3)

" {x3,j x0 3,j + PK 1 k=0 ηx x3,j Lk p,3}

z3 z0 3 + PK 1 k=0 ηz z3Lk p,3

Inspired by polyhedral approximation method (Bertsekas 2015; B urger, Notarstefano, and Allg ower 2013), the Ist layer polytope, which forms of a set of cutting planes (i.e., linear inequalities), is utilized to approximate the feasible region with respect to the constraint h I({x3,j}, z1, z2 , z3) εI, which is a relaxed form of constraint h I({x3,j}, z1, z2 , z3) = 0 in Eq. (9), and εI > 0 is a pre-set constant. Specifically, the Ist layer polytope in (t + 1)th iteration can be expressed as P t I = {a I 1,l z1 +

a I 2,l z2 + a I 3,l z3 + PN j=1 b I j,l x3,j c I l, l = 1, , |P t I |}, where |P t I | denotes the number of cutting planes in Ist

layer polytope and a I i,l, b I j,l, c I l are parameters in lth cutting plane (Ist layer µ-cut). Defining ˆh I,l({x3,j}, z1, z2 , z3) = a I 1,l z1+a I 2,l z2 +a I 3,l z3+PN j=1 b I j,l x3,j, the resulting

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

(bilevel) problem can be expressed as,

min PN j=1 f1,j(x1,j, x2,j, x3,j) s.t. x1,j = z1, j = 1, , N {x2,j}, z2 = arg min {x2,j },z2 PN j=1 f2,j(z1, x2,j , x3,j) s.t.

x2,j = z2 , j = 1, , N ˆh I,l({x3,j}, z1, z2 , z3) c I l, l=1, , |P t I | var. {x1,j}, {x2,j}, {x3,j}, z1,z2, z3. (10)

IInd layer Polyhedral Approximation: Defining func-

tion h II({x2,j}, {x3,j}, z1, z2, z3) = || {x2,j} z2

ϕII(z1, z3, {x3,j})||2, where ϕII(z1, z3, {x3,j}) = arg min{x2,j },z2 {PN j=1 f2,j(z1, x2,j , x3,j) : x2,j =

z2 , j, a I 1,l z1 + a I 2,l z2 + a I 3,l z3 + PN j=1 b I j,l x3,j c I l, l}. In Eq. (10), the lower-level optimization problem can be viewed as the constraint to the upper-level optimization problem (Sinha, Malo, and Deb 2017; Gould et al. 2016), i.e., h II({x2,j}, {x3,j}, z1, z2, z3) = 0. Likewise, following (Jiao et al. 2022b), the results after K communication rounds between the master and workers are utilized as the estimate of ϕII(z1, z3, {x3,j}). Specifically, for the lower-level optimization problem in Eq. (10), the augmented Lagrangian function is given,

Lp,2(z1, z2 ,{x2,j },{sl},{γl},{φ2,j}, z3,{x3,j})

= PN j=1(f2,j(z1, x2,j , x3,j)+φ 2,j(x2,j z2 )

2 ||x2,j z2 ||2) + P|P t+1 I | l=1 γl(ˆh I,l({x3,j}, z1, z2 , z3)

c I l+sl)+P|P t+1 I | l=1 ρ2

2 ||ˆh I,l({x3,j}, z1, z2 , z3) c I l+sl||2, (11) where γl R1, φ2,j Rd2 are dual variables, sl R1 +, l are the slack variables introduced in the inequality constraints, constants κ2 >0, ρ2 >0 are penalty parameters. The details of each communication round are presented in Appendix B in the supplementary material. After K communication rounds, we can obtain the estimate of ϕII(z1, z3, {x3,j}) and the corresponding h II can be expressed as,

h II({x2,j}, {x3,j}, z1, z2, z3)

" {x2,j x0 2,j + PK 1 k=0 ηx x2,j Lk p,2}

z2 z0 2 + PK 1 k=0 ηz z2Lk p,2

where Lk p,2 is the simplified form of Lp,2(z1, zk 2 ,{xk 2,j },{sk l },{γk l },{φk 2,j},z3,{x3,j}). Next, relaxing constraint h II({x2,j}, {x3,j}, z1, z2, z3) = 0 and utilizing IInd layer polytope to approximate the feasible region of relaxed constraint h II({x2,j}, {x3,j}, z1,z2, z3) εII. Specifically, the IInd layer polytope can be expressed as P t II = {P3 i=1 a II i,l zi + P3 i=2 PN j=1 b II i,j,l xi,j c II l , l = 1, , |P t II|} in (t + 1)th iteration, where |P t II| represents the number of cutting planes in P t II, and a II i,l, b II i,j,l, c II l are parameters in lth cutting plane (IInd layer µ-cut). Thus, the

resulting hyper-polyhedral approximation problem is,

min PN j=1 f1,j(x1,j, x2,j, x3,j) s.t. x1,j = z1, j = 1, , N P3 i=1a II i,l zi+P3 i=2 PN j=1b II i,j,l xi,j c II l , l=1, , |P t II| var. {x1,j}, {x2,j}, {x3,j}, z1,z2, z3. (13) It is worth mentioning that solving the TLO problem is theoretically NP-hard (even solving the inner bilevel problem in TLO is NP-hard (Ben-Ayed and Blair 1990)). Thus, it s unlikely to design a polynomial-time algorithm for the distributed TLO problem unless P = NP (Arora and Barak 2009). In this work, the hyper-polyhedral approximation problem in Eq. (13) is a convex relaxation problem of the distributed TLO problem in Eq. (2), and the relaxation will be continuously tightened as µ-cuts are added. Detailed discussions are provided in Appendix D.

Asynchronous Federated Algorithm The synchronous and asynchronous federated algorithms have different application scenarios (Su et al. 2022). The synchronous algorithm is preferred when the delay of each worker is not much different, and the asynchronous algorithm suits better when there are stragglers in the distributed system. In this work, an asynchronous algorithm is proposed to solve the trilevel optimization problem. Specifically, in the proposed asynchronous algorithm, we set the master updates its variables once it receives updates from S(1 S N) workers, i.e., active workers, at every iteration, and every worker has to communicate with the master at least once every τ iterations to alleviate the staleness issues (Zhang and Kwok 2014). It is worth mentioning that S can be flexibly adjusted based on whether there are stragglers, the proposed algorithm becomes synchronous when we set S = N, thus the proposed asynchronous algorithm is effective and flexible. First, the Lagrangian function of Eq. (13) can be expressed as,

Lp({x1,j},{x2,j},{x3,j}, z1, z2, z3,{λl},{θj}) = PN j=1f1,j(x1,j, x2,j, x3,j)+PN j=1θj (x1,j z1)

+ P|P t II| l=1 λl(P3 i=1 a II i,l zi+P3 i=2 PN j=1 b II i,j,l xi,j c II l ), (14) where λl R1 +, θj Rd1 are dual variables. Following (Xu et al. 2020; Jiao et al. 2022b), the regularized Lagrangian function is used to update variables as follows,

b Lp({x1,j},{x2,j},{x3,j}, z1, z2, z3,{λl},{θj}) = Lp P|P t II| l=1 ct 1 2 ||λl||2 PN j=1 ct 2 2 ||θj||2, (15)

where Lp =Lp({x1,j},{x2,j},{x3,j},z1,z2,z3,{λl},{θj}), and ct 1, ct 2 are the regularization terms in (t + 1)th iteration. We set that ct 1 = 1/ηλ(t + 1) 1 4 c1, ct 2 = 1/ηθ(t + 1) 1 4 c2 are two nonnegative non-increasing sequences, where ηλ, ηθ, c1, c2 are constants, and c1, c2 meet that 0 < c1 < 1/ηλ((4Mα4/ηλ2+4Nα5/ηθ2)1/ϵ) 1 2 and 0 < c2 < 1/ηθ((4Mα4/ηλ2+4Nα5/ηθ2)1/ϵ) 1 2 (ϵ refers to the tolerance error, and α4, α5 are constants, which

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

will be introduced below). In (t + 1)th master iteration, Qt+1 is utilized to denote the index set of active workers, and the proposed asynchronous algorithm proceeds as follows, (1) Active workers update the local variables as follows,

( xt i,j ηxi xi,j b L ˆtj p , j Qt+1

xt i,j, j / Qt+1 , i, (16)

where ηxi( i = 1, 2, 3) denote the step-sizes, b L ˆtj p = b Lp({x ˆtj i,j},{z ˆtj i },{λ ˆtj l },{θ ˆtj j }) and ˆtj denotes the last iteration that worker j is active. Then, active workers (i.e., worker j, j Qt+1) transmit the updated local variables, i.e., xt+1 i,j , i to the master. (2) After receiving the updates from workers, the master updates the variables as follows,

zt+1 1 =zt 1 ηz1 z1 b Lp({xt+1 i,j },{zt i},{λt l},{θt j}), (17)

zt+1 2 =zt 2 ηz2 z2 b Lp({xt+1 i,j },zt+1 1 , zt 2, zt 3,{λt l},{θt j}), (18) zt+1 3 =zt 3 ηz3 z3 b Lp({xt+1 i,j },zt+1 1 , zt+1 2 , zt 3,{λt l},{θt j}), (19) λt+1 l =PΛ(λt l +ηλ λl b Lp({xt+1 i,j },{zt+1 i },{λt l},{θt j})), (20) θt+1 j =PΘ(θt j+ηθ θj b Lp({xt+1 i,j },{zt+1 i },{λt+1 l },{θt j})), (21) where ηz1, ηz2, ηz3, ηλ, ηθ denote the step-sizes, PΛ and PΘ represent the projection onto sets Λ = {λl| 0 λl α4} and Θ = {θj| ||θj|| α5/d1}, where α4 >0 and α5 >0 are constants. Then, master broadcasts the updated variables, i.e., zt+1 1 , zt+1 2 , zt+1 3 , {λt+1 l }, θj to the active worker j. Details are summarized in Algorithm 1.

Refining Hyper-polyhedral Approximation In this section, a novel µ-cut is proposed, which can be utilized for non-convex (µ-weakly convex) optimization problem and thus is more general than the traditional cutting plane designed for convex optimization (Jiao et al. 2022b; Franc, Sonnenburg, and Werner 2011). We demonstrate that the proposed µ-cuts are valid, i.e., the original feasible region is a subset of the polytope that forms of µ-cuts in Proposition 1 and 2. Every Tpre iteration, the µ-cuts will be updated to refine the hyper-polyhedral approximation when t < T1, which can be divided into three steps: 1) generating new Ist layer µ-cut, 2) generating new IInd layer µ-cut, 3) removing inactive µ-cuts.

Generating new Ist layer µ-cut: Following (Qian et al. 2019), we assume the variables are bounded, i.e., ||xi,j||2 αi, ||zi||2 αi, i = 1, 2, 3, and h I is µ-weakly convex. It is demonstrated in Appendix E that h I is µ-weakly convex in lots of cases. Following (Xie, Koyejo, and Gupta 2019; Davis and Drusvyatskiy 2019), the definition and first-order condition of µ-weakly convex function are given as follows.

Definition 1 (µ-weakly convex) A differentiable function f(x) is µ-weakly convex if function g(x) = f(x) + µ

2 ||x||2 is convex.

Algorithm 1: Asynchronous Federated Trilevel Learning

Initialization: master iteration t = 0, variables {x0 1,j}, {x0 2,j}, {x0 3,j}, z0 1, z0 2, z0 3, {λ0 l }, {θ0 j}. repeat for active worker do updates variables xt+1 1,j , xt+1 2,j and xt+1 3,j by Eq. (16); end for active workers send updated local variables to master; for master do updates variables zt+1 1 , zt+1 2 , zt+1 3 , {λt+1 l }, {θt+1 j } by Eq. (17), (18), (19), (20) and (21); end for master broadcasts updated variables to active workers; if (t + 1) mod Tpre == 0 and t < T1 then new Ist layer µ-cut cp I is generated by Eq. (23) and added into Ist layer polytope; new IInd layer µ-cut cp II is generated by Eq. (24) and added into IInd layer polytope; removing inactive Ist, IInd layer µ-cuts by Eq. (25); end if t = t + 1; until termination.

Definition 2 (First-order condition) For any x, x , a differentiable function f(x) is µ-weakly convex if and only if the following inequality holds.

f(x) f(x ) + f(x ) (x x ) µ

2 ||x x ||2. (22)

Combining the first-order condition of µ-weakly convex function with Cauchy-Schwarz inequality, a kind of new cutting plane, i.e., µ-cut, can be generated. Specifically, the new Ist layer µ-cut cp I for Ist layer polytope can be expressed as:

{ h I({xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) x3,j }

h I({xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z1 h I({xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z2 h I({xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z3

{x3,j xt+1 3,j } z1 zt+1 1 z2 zt+1 2

+h I({xt+1 3,j }, zt+1 1 , zt+1 2 , zt+1 3 ) εI+µ((N +1)α1+α2 +α3+PN j=1||xt+1 3,j ||2+||zt+1 1 ||2+||zt+1 2 ||2+||zt+1 3 ||2). (23)

Proposition 1 The feasible region of constraint h I({x3,j},z1, z2 , z3) εI is a subset of the Ist layer polytope P t I ={a I 1,l z1+a I 2,l z2 +a I 3,l z3+PN j=1 b I j,l x3,j c I l, l =1, , |P t I |}. In addition, P t I converges monotonically with the number of µ-cuts. The proof is given in Appendix C.

Note that if h I is convex, i.e., µ = 0, the cutting plane will be generated as the same as that in (Franc, Sonnenburg, and Werner 2011; Jiao et al. 2022b), which is designed for convex optimization. Thus, the proposed µ-cut is more general than prior work in the literature. Consequently, the Ist layer polytope will be updated as P t+1 I = Add(P t I , cp I),

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

where Add(P t I , cp I) represents adding new µ-cut cp I into the polytope P t I .

Generating new IInd layer µ-cut: Based on the updated Ist layer polytope, the IInd layer polytope will be updated. The new generated IInd layer µ-cut cp II can be written as,

{ h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) x2,j }

{ h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) x3,j }

h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z1 h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z2 h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z3

{x2,j xt+1 2,j } {x3,j xt+1 3,j } z1 zt+1 1 z2 zt+1 2 z3 zt+1 3

+h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) εII + µ(α1 +(N +1)(α2+α3)+P3 i=2 PN j=1||xt+1 i,j ||2+P3 i=1||zt+1 i ||2). (24) Consequently, the IInd layer polytope will be updated as P t+1 II = Add(P t II, cp II).

Proposition 2 The feasible region of constraint h II({x2,j},{x3,j},{zi}) εII is a subset of the IInd layer polytope P t II = {P3 i=1a II i,l zi + P3 i=2 PN j=1b II i,j,l xi,j c II l , l = 1, , |P t II|}, and P t II converges monotonically with the number of µ-cuts. The proof is given in Appendix C.

Removing inactive µ-cuts: Removing the inactive cutting planes can enhance the efficiency of the proposed algorithm (Yang et al. 2014; Jiao, Yang, and Song 2022). The inactive Ist and IInd layer µ-cuts will be removed, thus the corresponding Ist and IInd layer polytopes P t+1 I and P t+1 II will be updated as follows.

P t+1 I = Drop(P t+1 I , cp I l), if γK l = 0 P t+1 I , otherwise ,

P t+1 II = Drop(P t+1 II , cp II l ), if λt+1 l = 0 P t+1 II , otherwise , (25)

where Drop(P, cpl) represents that the lth cutting plane cpl is removed from polytope P.

Discussion Definition 3 (Stationarity gap) Following (Xu et al. 2020; Lu et al. 2020; Jiao, Yang, and Song 2022), the stationarity gap of our problem at tth iteration is defined as:

{ xi,j Lp({xt i,j},{zt i},{λt l},{θt j})} { zi Lp({xt i,j},{zt i},{λt l},{θt j})} { Gλl({xt i,j},{zt i},{λt l},{θt j})} { Gθj({xt i,j},{zt i},{λt l},{θt j})}

Gλl({xt i,j},{zt i},{λt l},{θt j}) = 1

ηλ λt l PΛ(λt l +ηλ λl Lp({xt i,j},{zt i},{λt l},{θt j})) ,

Gθj({xt i,j},{zt i},{λt l},{θt j}) = 1

ηθ θt j PΘ(θt j+ηθ θj Lp({xt i,j},{zt i},{λt l},{θt j})) . (27)

Definition 4 (ϵ-stationary point) If || Gt||2 ϵ, ({xt i,j},{zt i},{λt l},{θt j}) is an ϵ-stationary point (ϵ 0) of a differentiable function Lp. T(ϵ) is the first iteration index such that || Gt||2 ϵ, i.e., T(ϵ) = min{t | || Gt||2 ϵ}.

Assumption 1 (Gradient Lipschitz) Following (Ji, Yang, and Liang 2021), we assume that Lp has Lipschitz continuous gradients, i.e., for any ω, ω , we assume that there exists L > 0 satisfying that,

|| Lp(ω) Lp(ω )|| L||ω ω ||. (28)

Assumption 2 (Boundedness) Following (Qian et al. 2019; Jiao et al. 2022b), we assume ||xi,j||2 αi, ||zi||2 αi, i = 1, 2, 3. And we assume that before obtaining the ϵ-stationary point (i.e., t T(ϵ) 1), the variables in master satisfy that P i ||zt+1 i zt i||2 + P l ||λt+1 l λt l||2 ϑ, where ϑ > 0 is a relative small constant. The change of the variables in master is upper bounded within τ iterations:

||zt i zt k i ||2 τk1ϑ, P l||λt l λt k l ||2 τk1ϑ, 1 k τ, (29) where k1 > 0 is a constant. Detailed discussions about Assumption 1 and 2 are provided in Appendix I.

Theorem 1 (Iteration Complexity) Suppose Assumption 1 and 2 hold, we set the step-sizes as ηxi = ηzi = 2 L+ηλML2+ηθNL2+8( MγL2

ηλc12 + NγL2

ηθc22 ), i, ηθ 2 L+2c0 2 and

ηλ <min{ 2 L+2c0 1 , 1 30τk1NL2 }. For a given ϵ, we have:

T(ϵ) O(max{( 4Mα4

( 4(d9+ ηθ(N S)L2

2 )( d +kdτ(τ 1))d8 ϵ +(T1+2) 1 2 )2}), (30)

where α4, α5, γ, kd, T1, M, N, S, τ, d, d8 and d9 are constants. The detailed proof is given in Appendix F. Moreover, the influence of parameters (e.g., T1, τ, N, S) in iteration complexity is discussed in Appendix F in the supplementary material.

Theorem 2 (Communication Complexity) The overall communication complexity of the proposed algorithm can be divided into complexity at every iteration and complexity of updating µ-cuts, which can be expressed as O(PT (ϵ) t=1 Ct 1 + C2), where Ct 1 = 32S(2 P3 i=1 di + d1 + |P t II|), C2 = 32 P t Q(NK(3 P3 i=2 di + 2|P t II|) + N|P t II|(2 P3 i=2 di + d1 + 1)), and set Q = {Tpre, , T1

Tpre Tpre}. Detailed proof is provided in Appendix J in supplementary material.

Experiment In the experiment, two distributed trilevel optimization tasks are employed to assess the performance of the proposed method. In the distributed robust hyperparameter optimization, experiments are carried out on the regression tasks following (Sato, Tanaka, and Takeda 2021), and in distributed domain adaptation for pretraining & finetuning, the multiple domain digits recognition task in (Qian et al. 2019; Wang et al. 2021) is considered. The details of the experimental setting are summarized in Table 1 and Appendix

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

0 150 300 450 Running time /s

AFTO (noise) AFTO (clean) SFTO (noise) SFTO (clean)

(a) Diabetes

0 800 1600 2400 Running time /s

AFTO (noise) AFTO (clean) SFTO (noise) SFTO (clean)

0 200 400 600 Running time /s

AFTO (noise) AFTO (clean) SFTO (noise) SFTO (clean)

(c) Red-wine

0 600 1200 1800 Running time /s

AFTO (noise) AFTO (clean) SFTO (noise) SFTO (clean)

(d) White-wine

Figure 1: MSE of clean test data and test data with Gaussian noise on (a) Diabetes, (b) Boston, (c) Red-wine quality, and (d) White-wine quality datasets. All experiments are repeated five times, and the shaded areas represent the standard deviation.

0 450 900 1350 Running time /s

Test accuracy

(a) Test accuracy

0 450 900 1350 Running time /s

(b) Test loss

0 500 1000 1500 Running time /s

Test accuracy

(c) Test accuracy

0 500 1000 1500 Running time /s

(d) Test loss

Figure 2: (a) Test accuracy and (b) test loss vs running time when SVHN is utilized to pretrain the model. (c) Test accuracy and (d) test loss vs running time when MNIST is utilized to pretrain the model. All experiments are repeated five times.

H. More experimental results are reported in Appendix G. To further show the superior performance of the proposed method, experimental results of comparisons between the non-distributed version of the proposed method with existing state-of-the-art TLO methods (Sato, Tanaka, and Takeda 2021; Choe et al. 2022) on three TLO tasks are shown in Appendix A in the supplementary material.

Distributed Robust Hyperparameter Optimization The robust hyperparameter optimization (Sato, Tanaka, and Takeda 2021) aims to train a machine learning model that is robust against the noise in input data, which is inspired by bilevel hyperparameter optimization (Chen et al. 2022b) and adversarial training (Han, Shi, and Huang 2023; Zhang et al. 2022). And we consider the following distributed robust hyperparameter optimization problem,

min P j 1 |Dval j |||yval j f(Xval j ; w)||2 s.t.

p=arg max p P j( 1 |Dtr j |||ytr j f(Xtr j +p j; w)||2 c||p j||2) s.t.

w=arg min w P j( 1 |Dtr j |||ytr j f(Xtr j +p j; w )||2+eφ||w ||1 )

var. φ, p, w, (31) where φ, w and p respectively denote the regularization parameter, model parameter, and adversarial noise, p = [p 1, , p N], N is the number of workers. f denotes the output of a MLP, c denotes the penalty for the adversarial noise, and || ||1 is a smoothed l1-norm (Saheya, Nguyen, and Chen 2019). Xval j , yval j , |Dval j |, Xtr j , ytr j , |Dtr j | respectively denote the data, label and the number of data of the validation and training datasets on local worker j. Following (Sato, Tanaka, and Takeda 2021), the experiments are car-

N S Stragglers τ Diabetes 4 3 1 10 Boston 4 3 1 10 Red-wine 4 3 1 10 White-wine 6 4 1 10 SVHN (finetune) 4 3 1 5 SVHN (pretrain) 6 3 2 15

Table 1: Experimental setting in distributed robust hyperparameter optimization and distributed domain adaptation.

ried out on the regression tasks with the following datasets: Diabetes (Dua, Graff et al. 2017), Boston (Harrison Jr and Rubinfeld 1978), Red-wine and White-wine quality (Cortez et al. 2009) datasets. We summarize the experimental setting on each dataset in Table 1. To show the performance of the proposed AFTO, we report the mean squared error (MSE) of clean test data and test data with Gaussian noise vs running time of the AFTO and SFTO (Synchronous Federated Trilevel Optimization) in Figure 1. It is seen that the proposed AFTO can effectively solve the TLO problem in a distributed manner and converges much faster than SFTO since the master can update its variables once it receives updates from a subset of workers instead of all workers in AFTO. Furthermore, we compare the proposed method with the state-of-the-art distributed bilevel optimization methods ADBO (Jiao et al. 2022b) and FEDNEST (Tarzanagh et al. 2022). It is shown in Table 2 that the proposed AFTO can achieve superior performance, which demonstrates the effectiveness of the proposed method.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Method Diabetes Boston Red-wine White-wine FEDNEST 0.5293 0.0229 0.3509 0.0177 0.0339 0.0014 0.0268 0.0010 ADBO 0.5284 0.0074 0.3243 0.0046 0.0336 0.0018 0.0277 0.0013 AFTO 0.5124 0.0068 0.3130 0.0037 0.0321 0.0026 0.0248 0.0021

Table 2: MSE of test data with Gaussian noise, lower scores represent better performance which are shown in boldface.

Distributed Domain Adaptation Pretraining/finetuning paradigms are increasingly adopted recently in self-supervised learning (He et al. 2020). In (Raghu et al. 2021), a domain adaptation strategy is proposed, which combines data reweighting with a pretraining/finetuning framework to automatically decrease/increase the weight of pretraining samples that cause negative/positive transfer, and can be formulated as trilevel optimization (Choe et al. 2022). The corresponding distributed trilevel optimization problem is given as follows,

min P j LF T,j(φ, v, w) s.t.

v=arg min v P j LF T,j(φ, v , w)+λ||v w||2 s.t.

w=arg min w P j 1 Dj P xi,j Dj R(xi,j, φ) Li P T,j(φ, v , w )

var. φ, v, w, (32) where φ, v and w respectively denote the parameters for pretraining, finetuning, and reweighting networks. xi,j and Li P T,j represent the ith pretraining sample and loss in worker j, LF T,j represents the finetuning loss in worker j. R(xi,j, φ) denotes the importance of pretraining sample xi,j, and λ is the proximal regularization parameter. To evaluate the performance of the proposed method, the multiple domain digits recognition task in (Qian et al. 2019; Wang et al. 2021) is considered. There are two benchmark datasets for this task: MNIST (Le Cun et al. 1998) and SVHN (Netzer et al. 2011). In the experiments, we utilize the same image resize strategy as in (Qian et al. 2019) to make the format consistent, and Le Net-5 is used for all pretraining/finetuning/reweighting networks. We summarize the experimental setting in Table 1 and Appendix H. Following (Ji, Yang, and Liang 2021), we utilize the test accuracy/test loss vs running time to evaluate the proposed AFTO. It is seen from Figure 2 that the proposed AFTO can effectively solve the distributed trilevel optimization problem and exhibits superior performance, which achieves a faster convergence rate than SFTO with a maximum acceleration of approximately 80%.

Conclusion Existing trilevel learning works focus on the non-distributed setting which may lead to data privacy risks, and do not provide the non-asymptotic analysis. To this end, we propose an asynchronous federated trilevel optimization method for TLO problems. To our best knowledge, this work takes an initial step that aims to solve the TLO problems in an asynchronous federated manner. The proposed µ-cuts are utilized to construct the hyper-polyhedral approximation for TLO problems, and it is demonstrated that they are applicable to a wide range of non-convex functions that meet the µ-weakly

convex assumption. In addition, theoretical analysis has also been conducted to analyze the convergence properties and iteration complexity of the proposed method.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 12371519 and 61771013; in part by the Fundamental Research Funds for the Central Universities of China; in part by the Fundamental Research Funds of Shanghai Jiading District; in part by the National Natural Science Foundation of China (Project 62271434), Shenzhen Science and Technology Program (Project JCYJ20210324120011032), Guangdong Basic and Applied Basic Research Foundation (Project 2021B1515120008), Shenzhen Key Lab of Crowd Intelligence Empowered Low-Carbon Energy Network (No. ZDSYS20220606100601002), and the Shenzhen Institute of Artificial Intelligence and Robotics for Society.

Arora, S.; and Barak, B. 2009. Computational complexity: a modern approach. Cambridge University Press.

Assran, M.; Aytekin, A.; Feyzmahdavian, H. R.; Johansson, M.; and Rabbat, M. G. 2020. Advances in asynchronous parallel and distributed optimization. Proceedings of the IEEE, 108(11): 2013 2031.

Avraamidou, S. 2018. Mixed-integer multi-level optimization through multi-parametric programming.

Ben-Ayed, O.; and Blair, C. E. 1990. Computational difficulties of bilevel linear programming. Operations Research, 38(3): 556 560.

Bertsekas, D. 2015. Convex optimization algorithms. Athena Scientific.

Bertsekas, D. P.; and Yu, H. 2011. A unifying polyhedral approximation framework for convex optimization. SIAM Journal on Optimization, 21(1): 333 360.

Blair, C. 1992. The computational complexity of multi-level linear programs. Annals of Operations Research, 34.

B urger, M.; Notarstefano, G.; and Allg ower, F. 2013. A polyhedral approximation framework for convex and robust distributed optimization. IEEE Transactions on Automatic Control, 59(2): 384 395.

Chen, S.; Feng, S.; Guo, Z.; and Yang, Z. 2022a. Trilevel optimization model for competitive pricing of electric vehicle charging station considering distribution locational marginal price. IEEE Transactions on Smart Grid, 13(6): 4716 4729.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Chen, T.; Sun, Y.; Xiao, Q.; and Yin, W. 2022b. A singletimescale method for stochastic bilevel optimization. In International Conference on Artificial Intelligence and Statistics, 2466 2488. PMLR.

Choe, S. K.; Neiswanger, W.; Xie, P.; and Xing, E. 2022. Betty: An automatic differentiation library for multilevel optimization. ar Xiv preprint ar Xiv:2207.02849.

Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; and Reis, J. 2009. Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4): 547 553.

Davis, D.; and Drusvyatskiy, D. 2019. Stochastic modelbased minimization of weakly convex functions. SIAM Journal on Optimization, 29(1): 207 239.

Dua, D.; Graff, C.; et al. 2017. UCI machine learning repository.

Franc, V.; Sonnenburg, S.; and Werner, T. 2011. Cutting plane methods in machine learning. Optimization for Machine Learning, 185 218.

Franceschi, L.; Frasconi, P.; Salzo, S.; Grazzi, R.; and Pontil, M. 2018. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, 1568 1577. PMLR.

Garg, B.; Zhang, L.; Sridhara, P.; Hosseini, R.; Xing, E.; and Xie, P. 2022. Learning from mistakes a framework for neural architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 10184 10192.

Gould, S.; Fernando, B.; Cherian, A.; Anderson, P.; Cruz, R. S.; and Guo, E. 2016. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint ar Xiv:1607.05447.

Guo, M.; Yang, Y.; Xu, R.; Liu, Z.; and Lin, D. 2020. When nas meets robustness: In search of robust architectures against adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 631 640.

Han, P.; Shi, X.; and Huang, J. 2023. Fed AL: Black-Box Federated Knowledge Distillation Enabled by Adversarial Learning. ar Xiv preprint ar Xiv:2311.16584.

Han, P.; Wang, S.; and Leung, K. K. 2020. Adaptive gradient sparsification for efficient federated learning: An online learning approach. In 2020 IEEE 40th international conference on distributed computing systems (ICDCS), 300 310. IEEE.

Harrison Jr, D.; and Rubinfeld, D. L. 1978. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1): 81 102.

He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729 9738.

Ji, K.; Yang, J.; and Liang, Y. 2021. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, 4882 4892. PMLR.

Jiao, Y.; Yang, K.; and Song, D. 2022. Distributed distributionally robust optimization with non-convex objectives. Advances in neural information processing systems, 35: 7987 7999. Jiao, Y.; Yang, K.; Song, D.; and Tao, D. 2022a. Time Auto AD: Autonomous Anomaly Detection With Self Supervised Contrastive Loss for Multivariate Time Series. IEEE Transactions on Network Science and Engineering, 9(3): 1604 1619. Jiao, Y.; Yang, K.; Wu, T.; Song, D.; and Jian, C. 2022b. Asynchronous Distributed Bilevel Optimization. In The Eleventh International Conference on Learning Representations. Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278 2324. Liu, R.; Liu, Y.; Zeng, S.; and Zhang, J. 2021. Towards gradient-based bilevel optimization with non-convex followers and beyond. Advances in Neural Information Processing Systems, 34: 8662 8675. Lu, S.; Tsaknakis, I.; Hong, M.; and Chen, Y. 2020. Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications. IEEE Transactions on Signal Processing, 68: 3676 3691. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. Qian, Q.; Zhu, S.; Tang, J.; Jin, R.; Sun, B.; and Li, H. 2019. Robust optimization over multiple domains. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 4739 4746. Raghu, A.; Lorraine, J.; Kornblith, S.; Mc Dermott, M.; and Duvenaud, D. K. 2021. Meta-learning to improve pretraining. Advances in Neural Information Processing Systems, 34: 23231 23244. Saheya, B.; Nguyen, C. T.; and Chen, J.-S. 2019. Neural network based on systematically generated smoothing functions for absolute value equation. Journal of Applied Mathematics and Computing, 61(1): 533 558. Sato, R.; Tanaka, M.; and Takeda, A. 2021. A Gradient Method for Multilevel Optimization. Advances in Neural Information Processing Systems, 34: 7522 7533. Sinha, A.; Malo, P.; and Deb, K. 2017. A review on bilevel optimization: From classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22(2): 276 295. Su, W.; Zhang, Y.; Cai, Y.; Ren, K.; Wang, P.; Yi, H.; Song, Y.; Chen, J.; Deng, H.; Xu, J.; et al. 2022. GBA: A Tuning-free Approach to Switch between Synchronous and Asynchronous Training for Recommendation Models. Advances in Neural Information Processing Systems, 35: 29525 29537. Subramanya, T.; and Riggio, R. 2021. Centralized and federated learning for predictive VNF autoscaling in multidomain 5G networks and beyond. IEEE Transactions on Network and Service Management, 18(1): 63 78.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Tarzanagh, D. A.; Li, M.; Thrampoulidis, C.; and Oymak, S. 2022. Fednest: Federated bilevel, minimax, and compositional optimization. In International Conference on Machine Learning, 21146 21179. PMLR. Tawarmalani, M.; and Sahinidis, N. V. 2005. A polyhedral branch-and-cut approach to global optimization. Mathematical programming, 103(2): 225 249. Trombettoni, G.; Araya, I.; Neveu, B.; and Chabert, G. 2011. Inner regions and interval linearizations for global optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, 99 104. Wang, J.; Chen, J.; Lin, J.; Sigal, L.; and de Silva, C. W. 2021. Discriminative feature alignment: Improving transferability of unsupervised domain adaptation by Gaussianguided latent alignment. Pattern Recognition, 116: 107943. Xie, C.; Koyejo, S.; and Gupta, I. 2019. Asynchronous federated optimization. ar Xiv preprint ar Xiv:1903.03934. Xu, Z.; Zhang, H.; Xu, Y.; and Lan, G. 2020. A unified single-loop alternating gradient projection algorithm for nonconvex-concave and convex-nonconcave minimax problems. ar Xiv preprint ar Xiv:2006.02032. Yang, K.; Huang, J.; Wu, Y.; Wang, X.; and Chiang, M. 2014. Distributed robust optimization (DRO), part I: Framework and example. Optimization and Engineering, 15(1): 35 67. Yang, K.; Wu, Y.; Huang, J.; Wang, X.; and Verd u, S. 2008. Distributed robust optimization for communication networks. In IEEE INFOCOM 2008-The 27th Conference on Computer Communications, 1157 1165. IEEE. Zhang, R.; and Kwok, J. 2014. Asynchronous distributed ADMM for consensus optimization. In International conference on machine learning, 1701 1709. PMLR. Zhang, Y.; Zhang, G.; Khanduri, P.; Hong, M.; Chang, S.; and Liu, S. 2022. Revisiting and advancing fast adversarial training through the lens of bi-level optimization. In International Conference on Machine Learning, 26693 26712. PMLR.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)