# provably_convergent_federated_trilevel_learning__631655c1.pdf Provably Convergent Federated Trilevel Learning Yang Jiao1, Kai Yang1,2,3*, Tiancheng Wu1, Chengtao Jian1, Jianwei Huang4,5 1Department of Computer Science and Technology, Tongji University 2 Key Laboratory of Embedded System and Service Computing Ministry of Education at Tongji University 3Shanghai Research Institute for Intelligent Autonomous Systems 4School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 5Shenzhen Institute of Artificial Intelligence and Robotics for Society yangjiao@tongji.edu.cn, kaiyang@tongji.edu.cn, tony318@tongji.edu.cn, jct@tongji.edu.cn, jianweihuang@cuhk.edu.cn Trilevel learning, also called trilevel optimization (TLO), has been recognized as a powerful modelling tool for hierarchical decision process and widely applied in many machine learning applications, such as robust neural architecture search, hyperparameter optimization, and domain adaptation. Tackling TLO problems has presented a great challenge due to their nested decision-making structure. In addition, existing works on TLO face the following key challenges: 1) they all focus on the non-distributed setting, which may lead to privacy breach; 2) they do not offer any non-asymptotic convergence analysis which characterizes how fast an algorithm converges. To address the aforementioned challenges, this paper proposes an asynchronous federated trilevel optimization method to solve TLO problems. The proposed method utilizes µ-cuts to construct a hyper-polyhedral approximation for the TLO problem and solve it in an asynchronous manner. We demonstrate that the proposed µ-cuts are applicable to not only convex functions but also a wide range of non-convex functions that meet the µ-weakly convex assumption. Furthermore, we theoretically analyze the non-asymptotic convergence rate for the proposed method by showing its iteration complexity to obtain ϵ-stationary point is upper bounded by O( 1 ϵ2 ). Extensive experiments on real-world datasets have been conducted to elucidate the superiority of the proposed method, e.g., it has a faster convergence rate with a maximum acceleration of approximately 80%. Introduction Recently, trilevel learning, also called trilevel optimization (TLO), has found applications in many machine learning tasks, e.g., robust neural architecture search (Guo et al. 2020), robust hyperparameter optimization (Sato, Tanaka, and Takeda 2021) and domain adaptation (Raghu et al. 2021). Trilevel optimization problems refer to the optimization problems that involve three-level optimization problems and thus have a trilevel hierarchy (Avraamidou 2018; Sato, Tanaka, and Takeda 2021). A general form of trilevel opti- *Corresponding author (e-mail: kaiyang@tongji.edu.cn). Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. mization problem is given by, min f1(x1, x2, x3) s.t. x2 = arg min x2 f2(x1, x2 , x3) s.t. x3 = arg min x3 f3(x1, x2 , x3 ) var. x1, x2, x3, where f1, f2, f3 respectively denote the first, second, and third level objectives. Here x1 Rd1, x2 Rd2, x3 Rd3 are variables. Despite its wide applications, the development of solution methods was predominately limited to bilevel optimization (BLO) (Ji, Yang, and Liang 2021; Franceschi et al. 2018) primarily due to the escalated difficulty in solving the TLO problem (Sato, Tanaka, and Takeda 2021). The literature, specifically (Blair 1992; Avraamidou 2018), highlights that the complexity associated with solving problems characterized by hierarchical structures comprising more than two levels is substantially greater compared to that of bilevel optimization problems. Theoretical work on solving TLO problems only emerge during the recent several years. A hypergradient (gradient)- based method is proposed in (Sato, Tanaka, and Takeda 2021), which uses K gradient descent steps to replace the lower-level problem to solve the TLO problems. This algorithm in (Sato, Tanaka, and Takeda 2021) is one of the first results that establish theoretical guarantees for solving the TLO problem. A general automatic differentiation technique is proposed in (Choe et al. 2022), which is based on the interpretation of TLO as a special type of dataflow graph. Nevertheless, there are still some issues that have not been addressed in the prior work, including 1) in TLO applications, data may be acquired and disseminated across multiple nodes, the prior works only solve the TLO problems in a non-distributed manner, which needs to collect a massive amount of data to a single server and may lead to the data privacy risks (Subramanya and Riggio 2021; Jiao et al. 2022b; Han, Wang, and Leung 2020). Moreover, the synchronous federated algorithms often suffer from straggler problems and will immediately stop working if some workers fail to communicate (Jiao et al. 2022b). Therefore, developing asynchronous federated algorithms for TLO is significantly important. 2) The existing TLO works only provide the asymptotic convergence guarantee for their algorithms. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) In order to understand the convergence speed of the proposed algorithm, non-asymptotic convergence analysis that can characterize how fast an algorithm converges in practice is required. To this end, we propose an Asynchronous Federated Trilevel Optimization method (AFTO) in this paper. The proposed AFTO can effectively solve the TLO problems in an asynchronous federated manner. Specifically, it treats the lower-level optimization problem as a constraint to the upper-level and utilizes µ-cuts to construct the hyperpolyhedral approximation, then an effective asynchronous algorithm is developed. In the context of trilevel learning problems, the objective functions at each level are usually non-convex, thus the cutting plane methods tailored for convex functions (Jiao et al. 2022b; Franc, Sonnenburg, and Werner 2011) are found to be inapplicable. To our best knowledge, the proposed methodology referred to as µ-cut represents the first approach that is capable of constructing cutting planes for trilevel learning problems characterized by non-convex objectives. Furthermore, we demonstrate that the proposed method is guaranteed to converge and theoretically analyze the non-asymptotic convergence rate in terms of iteration complexity. The contributions of this work are summarized as follows. 1. An asynchronous federated trilevel optimization method is proposed in this work for trilevel learning. To our best knowledge, it is the first work designing algorithms to solve the trilevel learning problem in an asynchronous distributed manner. 2. A novel hyper-polyhedral approximation method via µ-cut is proposed in this work. The proposed µ-cut can be applied to trilevel learning with non-convex objectives. We further demonstrate that the iteration complexity of the proposed method to achieve the ϵ-stationary point is upper bounded by O( 1 ϵ2 ). 3. Extensive experiments on real-world datasets justify the superiority of the proposed method and underscore the significant benefits of employing the hyper-polyhedral approximation for trilevel learning. Related Work Trilevel Optimization Trilevel optimization has many applications ranging from economics to machine learning. A robust neural architecture search approach is proposed in (Guo et al. 2020), which integrates the adversarial training into one-shot neural architecture search and can be regarded as solving a trilevel optimization problem. Time Auto AD (Jiao et al. 2022a) is proposed to automatically configure the anomaly detection pipeline and optimize the hyperparameters for multivariate time series anomaly detection. The optimization problem that Time Auto AD aims to solve can be viewed as a trilevel optimization problem. And a method is proposed in (Raghu et al. 2021) to solve the trilevel optimization problem which involves hyperparameter optimization and twolevel pretraining and finetuning. LFM (Garg et al. 2022) is proposed to solve a trilevel optimization problem which consists of data reweight, architecture search, and model train- ing. A general automatic differentiation technique Betty is proposed in (Choe et al. 2022), which can be utilized to solve the trilevel optimization problem. However, the aforementioned algorithms do not provide any convergence guarantee. A hypergradient-based algorithm with asymptotic convergence guarantee is proposed in (Sato, Tanaka, and Takeda 2021), which can be employed in trilevel optimization problems. Nevertheless, the existing works focus on solving the TLO problems in a non-distributed manner and do not provide any non-asymptotic convergence analysis. Instead, an efficient asynchronous algorithm with non-asymptotic convergence guarantee is proposed in this work for solving TLO problems. To our best knowledge, this is the first work that solves TLO problems in an asynchronous federated manner. Polyhedral Approximation Polyhedral approximation is a widely-used approximation method (Bertsekas 2015). The idea behind polyhedral approximation is to approximate either the feasible region or the epigraph of the objective function of an optimization problem by a set of cutting planes, and the approximation will be gradually refined by adding additional cutting planes. Since the approximate problem is polyhedral, it is usually much easier to solve than the original problem. Following (Bertsekas 2015), the polyhedral approximation can be broadly divided into two main approaches: outer linearization and inner linearization. The outer linearization (Tawarmalani and Sahinidis 2005; Yang et al. 2008; B urger, Notarstefano, and Allg ower 2013) (also called cutting plane method) utilizes a set of cutting planes to approximate the feasible region or the epigraph of the objective function from without. In contrast, inner linearization (Bertsekas and Yu 2011; Trombettoni et al. 2011) utilizes the convex hulls of finite numbers of halflines or points to approximate the feasible region or the epigraph of the objective function from within. Polyhedral approximation has been widely used in convex optimization. A polyhedral approximation method is proposed in (Bertsekas 2015) for convex optimization, which utilizes cutting planes to approximate the original convex optimization problem. In (B urger, Notarstefano, and Allg ower 2013), a fully distributed algorithm is proposed, which is based on an outer polyhedral approximation of the constraint sets, for the convex and robust distributed optimization problems in peer-to-peer networks. In this work, a novel hyper-polyhedral approximation method via µ-cut is proposed for TLO. The proposed µ-cut can be utilized for µ-weakly convex optimization and thus has broader applicability compared with the cutting plane methods for convex optimization. Asynchronous Federated Trilevel Learning Traditional trilevel optimization methods require collecting a massive amount of data to a single server for model training, which may lead to data privacy risks. Solving trilevel optimization problems in a distributed manner is challenging since the trilevel optimization problem is highly-nested which hinders the development of the distributed algorithms. The distributed trilevel optimization problem can be ex- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) pressed as, min PN j=1 f1,j(x1, x2, x3) s.t. x2 = arg min x2 PN j=1 f2,j(x1, x2 , x3) s.t. x3 = arg min x3 PN j=1 f3,j(x1, x2 , x3 ) var. x1, x2, x3, where N denotes the number of workers in distributed systems, f1,j, f2,j, f3,j denote the local first, second, and third level objectives in worker j, respectively. The problem in Eq. (2) can be reformulated as a consensus problem (Zhang and Kwok 2014; Jiao, Yang, and Song 2022), min P j f1,j(x1,j, x2,j, x3,j) s.t. x1,j = z1, j = 1, , N {x2,j}, z2 = arg min {x2,j },z2 P j f2,j(z1, x2,j , x3,j) s.t. x2,j = z2 , j = 1, , N {x3,j}, z3 = arg min {x3,j },z3 P jf3,j(z1, z2 , x3,j ) s.t. x3,j = z3 , j = 1, , N var. {x1,j}, {x2,j}, {x3,j}, z1,z2, z3, (3) where x1,j Rd1, x2,j Rd2, x3,j Rd3 denote the local variables in worker j, and z1 Rd1, z2 Rd2, z3 Rd3 denote the consensus variables in the master. This reformulation in Eq. (3) can facilitate the development of distributed algorithms for trilevel optimization problems based on the parameter-server architecture (Assran et al. 2020). The remaining procedure of the proposed method can be divided into three steps. First, how to construct the hyperpolyhedral approximation for distributed TLO problems is proposed. Then, an effective asynchronous federated algorithm is developed. Finally, how to update the µ-cuts to refine the hyper-polyhedral approximation is proposed. Hyper-Polyhedral Approximation Different from the traditional polyhedral approximation method (Bertsekas 2015; Franc, Sonnenburg, and Werner 2011; B urger, Notarstefano, and Allg ower 2013), a novel hyper-polyhedral approximation method is proposed for distributed TLO problems in this work. By utilizing the proposed hyper-polyhedral approximation, the distributed algorithms can be easier to develop for TLO problems. Specifically, the proposed hyper-polytope consists of the Ist layer and IInd layer polytopes, which are introduced as follows. Ist layer Polyhedral Approximation: First, defining h I({x3,j}, z1, z2 , z3) = || {x3,j} z3 ϕI(z1, z2 )||2 and ϕI(z1, z2 )=arg min{x3,j },z3 {PN j=1f3,j(z1, z2 , x3,j ) : x3,j = z3 , j}. In trilevel optimization, the third level optimization problem can be viewed as the constraint to the second level optimization problem (Chen et al. 2022a), i.e., h I({x3,j}, z1, z2 , z3) = 0. A consensus problem needs to be solved in a distributed manner if the exact ϕI(z1, z2 ) is required. In many works in bilevel (Ji, Yang, and Liang 2021; Liu et al. 2021; Jiao et al. 2022b) and trilevel (Sato, Tanaka, and Takeda 2021) optimization, the exact ϕI(z1, z2 ) can be replaced by an estimate of ϕI(z1, z2 ), and we utilize the results after K communication rounds between the master and workers as the estimate of ϕI(z1, z2 ) according to (Jiao et al. 2022b). Specifically, for the third level optimization problem, the augmented Lagrangian function can be written as, Lp,3 =PN j=1(f3,j(z1, z2 , x3,j ) + φ 3,j(x3,j z3 ) + κ3 2 ||x3,j z3 ||2), (4) where Lp,3 = Lp,3(z1,z2 , z3 ,{x3,j },{φ3,j}), φ3,j Rd3 is the dual variable, and constant κ3 >0 is a penalty parameter. In (k + 1)th communication round, we have that, 1) Workers update the local variables, xk+1 3,j =xk 3,j ηx x3,j Lp,3(z1, z2 , zk 3 ,{xk 3,j },{φk 3,j}), (5) where ηx represents the step-size. Then, workers transmit the local variables xk+1 3,j to the master. 2) Master updates the variables as follows, zk+1 3 =zk 3 ηz z3Lp,3(z1, z2 , zk 3 ,{xk 3,j },{φk 3,j}), (6) φk+1 3,j =φk 3,j+ηφ φ3,j Lp,3(z1, z2 , zk+1 3 ,{xk+1 3,j },{φk 3,j}), (7) where ηz and ηφ represent the step-sizes. Then, master broadcasts the zk+1 3 and φk+1 3,j to workers. The results after K communication rounds are utilized as the estimate of ϕI(z1, z2 ), that is, ϕI(z1, z2 )= " {x0 3,j PK 1 k=0 ηx x3,j Lk p,3} z0 3 PK 1 k=0 ηz z3Lk p,3 where Lk p,3 = Lp,3(z1, z2 , zk 3 , {xk 3,j }, {φk 3,j}). Based on Eq. (8) and the definition of h I, we have that, h I({x3,j}, z1, z2 , z3) " {x3,j x0 3,j + PK 1 k=0 ηx x3,j Lk p,3} z3 z0 3 + PK 1 k=0 ηz z3Lk p,3 Inspired by polyhedral approximation method (Bertsekas 2015; B urger, Notarstefano, and Allg ower 2013), the Ist layer polytope, which forms of a set of cutting planes (i.e., linear inequalities), is utilized to approximate the feasible region with respect to the constraint h I({x3,j}, z1, z2 , z3) εI, which is a relaxed form of constraint h I({x3,j}, z1, z2 , z3) = 0 in Eq. (9), and εI > 0 is a pre-set constant. Specifically, the Ist layer polytope in (t + 1)th iteration can be expressed as P t I = {a I 1,l z1 + a I 2,l z2 + a I 3,l z3 + PN j=1 b I j,l x3,j c I l, l = 1, , |P t I |}, where |P t I | denotes the number of cutting planes in Ist layer polytope and a I i,l, b I j,l, c I l are parameters in lth cutting plane (Ist layer µ-cut). Defining ˆh I,l({x3,j}, z1, z2 , z3) = a I 1,l z1+a I 2,l z2 +a I 3,l z3+PN j=1 b I j,l x3,j, the resulting The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) (bilevel) problem can be expressed as, min PN j=1 f1,j(x1,j, x2,j, x3,j) s.t. x1,j = z1, j = 1, , N {x2,j}, z2 = arg min {x2,j },z2 PN j=1 f2,j(z1, x2,j , x3,j) s.t. x2,j = z2 , j = 1, , N ˆh I,l({x3,j}, z1, z2 , z3) c I l, l=1, , |P t I | var. {x1,j}, {x2,j}, {x3,j}, z1,z2, z3. (10) IInd layer Polyhedral Approximation: Defining func- tion h II({x2,j}, {x3,j}, z1, z2, z3) = || {x2,j} z2 ϕII(z1, z3, {x3,j})||2, where ϕII(z1, z3, {x3,j}) = arg min{x2,j },z2 {PN j=1 f2,j(z1, x2,j , x3,j) : x2,j = z2 , j, a I 1,l z1 + a I 2,l z2 + a I 3,l z3 + PN j=1 b I j,l x3,j c I l, l}. In Eq. (10), the lower-level optimization problem can be viewed as the constraint to the upper-level optimization problem (Sinha, Malo, and Deb 2017; Gould et al. 2016), i.e., h II({x2,j}, {x3,j}, z1, z2, z3) = 0. Likewise, following (Jiao et al. 2022b), the results after K communication rounds between the master and workers are utilized as the estimate of ϕII(z1, z3, {x3,j}). Specifically, for the lower-level optimization problem in Eq. (10), the augmented Lagrangian function is given, Lp,2(z1, z2 ,{x2,j },{sl},{γl},{φ2,j}, z3,{x3,j}) = PN j=1(f2,j(z1, x2,j , x3,j)+φ 2,j(x2,j z2 ) 2 ||x2,j z2 ||2) + P|P t+1 I | l=1 γl(ˆh I,l({x3,j}, z1, z2 , z3) c I l+sl)+P|P t+1 I | l=1 ρ2 2 ||ˆh I,l({x3,j}, z1, z2 , z3) c I l+sl||2, (11) where γl R1, φ2,j Rd2 are dual variables, sl R1 +, l are the slack variables introduced in the inequality constraints, constants κ2 >0, ρ2 >0 are penalty parameters. The details of each communication round are presented in Appendix B in the supplementary material. After K communication rounds, we can obtain the estimate of ϕII(z1, z3, {x3,j}) and the corresponding h II can be expressed as, h II({x2,j}, {x3,j}, z1, z2, z3) " {x2,j x0 2,j + PK 1 k=0 ηx x2,j Lk p,2} z2 z0 2 + PK 1 k=0 ηz z2Lk p,2 where Lk p,2 is the simplified form of Lp,2(z1, zk 2 ,{xk 2,j },{sk l },{γk l },{φk 2,j},z3,{x3,j}). Next, relaxing constraint h II({x2,j}, {x3,j}, z1, z2, z3) = 0 and utilizing IInd layer polytope to approximate the feasible region of relaxed constraint h II({x2,j}, {x3,j}, z1,z2, z3) εII. Specifically, the IInd layer polytope can be expressed as P t II = {P3 i=1 a II i,l zi + P3 i=2 PN j=1 b II i,j,l xi,j c II l , l = 1, , |P t II|} in (t + 1)th iteration, where |P t II| represents the number of cutting planes in P t II, and a II i,l, b II i,j,l, c II l are parameters in lth cutting plane (IInd layer µ-cut). Thus, the resulting hyper-polyhedral approximation problem is, min PN j=1 f1,j(x1,j, x2,j, x3,j) s.t. x1,j = z1, j = 1, , N P3 i=1a II i,l zi+P3 i=2 PN j=1b II i,j,l xi,j c II l , l=1, , |P t II| var. {x1,j}, {x2,j}, {x3,j}, z1,z2, z3. (13) It is worth mentioning that solving the TLO problem is theoretically NP-hard (even solving the inner bilevel problem in TLO is NP-hard (Ben-Ayed and Blair 1990)). Thus, it s unlikely to design a polynomial-time algorithm for the distributed TLO problem unless P = NP (Arora and Barak 2009). In this work, the hyper-polyhedral approximation problem in Eq. (13) is a convex relaxation problem of the distributed TLO problem in Eq. (2), and the relaxation will be continuously tightened as µ-cuts are added. Detailed discussions are provided in Appendix D. Asynchronous Federated Algorithm The synchronous and asynchronous federated algorithms have different application scenarios (Su et al. 2022). The synchronous algorithm is preferred when the delay of each worker is not much different, and the asynchronous algorithm suits better when there are stragglers in the distributed system. In this work, an asynchronous algorithm is proposed to solve the trilevel optimization problem. Specifically, in the proposed asynchronous algorithm, we set the master updates its variables once it receives updates from S(1 S N) workers, i.e., active workers, at every iteration, and every worker has to communicate with the master at least once every τ iterations to alleviate the staleness issues (Zhang and Kwok 2014). It is worth mentioning that S can be flexibly adjusted based on whether there are stragglers, the proposed algorithm becomes synchronous when we set S = N, thus the proposed asynchronous algorithm is effective and flexible. First, the Lagrangian function of Eq. (13) can be expressed as, Lp({x1,j},{x2,j},{x3,j}, z1, z2, z3,{λl},{θj}) = PN j=1f1,j(x1,j, x2,j, x3,j)+PN j=1θj (x1,j z1) + P|P t II| l=1 λl(P3 i=1 a II i,l zi+P3 i=2 PN j=1 b II i,j,l xi,j c II l ), (14) where λl R1 +, θj Rd1 are dual variables. Following (Xu et al. 2020; Jiao et al. 2022b), the regularized Lagrangian function is used to update variables as follows, b Lp({x1,j},{x2,j},{x3,j}, z1, z2, z3,{λl},{θj}) = Lp P|P t II| l=1 ct 1 2 ||λl||2 PN j=1 ct 2 2 ||θj||2, (15) where Lp =Lp({x1,j},{x2,j},{x3,j},z1,z2,z3,{λl},{θj}), and ct 1, ct 2 are the regularization terms in (t + 1)th iteration. We set that ct 1 = 1/ηλ(t + 1) 1 4 c1, ct 2 = 1/ηθ(t + 1) 1 4 c2 are two nonnegative non-increasing sequences, where ηλ, ηθ, c1, c2 are constants, and c1, c2 meet that 0 < c1 < 1/ηλ((4Mα4/ηλ2+4Nα5/ηθ2)1/ϵ) 1 2 and 0 < c2 < 1/ηθ((4Mα4/ηλ2+4Nα5/ηθ2)1/ϵ) 1 2 (ϵ refers to the tolerance error, and α4, α5 are constants, which The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) will be introduced below). In (t + 1)th master iteration, Qt+1 is utilized to denote the index set of active workers, and the proposed asynchronous algorithm proceeds as follows, (1) Active workers update the local variables as follows, ( xt i,j ηxi xi,j b L ˆtj p , j Qt+1 xt i,j, j / Qt+1 , i, (16) where ηxi( i = 1, 2, 3) denote the step-sizes, b L ˆtj p = b Lp({x ˆtj i,j},{z ˆtj i },{λ ˆtj l },{θ ˆtj j }) and ˆtj denotes the last iteration that worker j is active. Then, active workers (i.e., worker j, j Qt+1) transmit the updated local variables, i.e., xt+1 i,j , i to the master. (2) After receiving the updates from workers, the master updates the variables as follows, zt+1 1 =zt 1 ηz1 z1 b Lp({xt+1 i,j },{zt i},{λt l},{θt j}), (17) zt+1 2 =zt 2 ηz2 z2 b Lp({xt+1 i,j },zt+1 1 , zt 2, zt 3,{λt l},{θt j}), (18) zt+1 3 =zt 3 ηz3 z3 b Lp({xt+1 i,j },zt+1 1 , zt+1 2 , zt 3,{λt l},{θt j}), (19) λt+1 l =PΛ(λt l +ηλ λl b Lp({xt+1 i,j },{zt+1 i },{λt l},{θt j})), (20) θt+1 j =PΘ(θt j+ηθ θj b Lp({xt+1 i,j },{zt+1 i },{λt+1 l },{θt j})), (21) where ηz1, ηz2, ηz3, ηλ, ηθ denote the step-sizes, PΛ and PΘ represent the projection onto sets Λ = {λl| 0 λl α4} and Θ = {θj| ||θj|| α5/d1}, where α4 >0 and α5 >0 are constants. Then, master broadcasts the updated variables, i.e., zt+1 1 , zt+1 2 , zt+1 3 , {λt+1 l }, θj to the active worker j. Details are summarized in Algorithm 1. Refining Hyper-polyhedral Approximation In this section, a novel µ-cut is proposed, which can be utilized for non-convex (µ-weakly convex) optimization problem and thus is more general than the traditional cutting plane designed for convex optimization (Jiao et al. 2022b; Franc, Sonnenburg, and Werner 2011). We demonstrate that the proposed µ-cuts are valid, i.e., the original feasible region is a subset of the polytope that forms of µ-cuts in Proposition 1 and 2. Every Tpre iteration, the µ-cuts will be updated to refine the hyper-polyhedral approximation when t < T1, which can be divided into three steps: 1) generating new Ist layer µ-cut, 2) generating new IInd layer µ-cut, 3) removing inactive µ-cuts. Generating new Ist layer µ-cut: Following (Qian et al. 2019), we assume the variables are bounded, i.e., ||xi,j||2 αi, ||zi||2 αi, i = 1, 2, 3, and h I is µ-weakly convex. It is demonstrated in Appendix E that h I is µ-weakly convex in lots of cases. Following (Xie, Koyejo, and Gupta 2019; Davis and Drusvyatskiy 2019), the definition and first-order condition of µ-weakly convex function are given as follows. Definition 1 (µ-weakly convex) A differentiable function f(x) is µ-weakly convex if function g(x) = f(x) + µ 2 ||x||2 is convex. Algorithm 1: Asynchronous Federated Trilevel Learning Initialization: master iteration t = 0, variables {x0 1,j}, {x0 2,j}, {x0 3,j}, z0 1, z0 2, z0 3, {λ0 l }, {θ0 j}. repeat for active worker do updates variables xt+1 1,j , xt+1 2,j and xt+1 3,j by Eq. (16); end for active workers send updated local variables to master; for master do updates variables zt+1 1 , zt+1 2 , zt+1 3 , {λt+1 l }, {θt+1 j } by Eq. (17), (18), (19), (20) and (21); end for master broadcasts updated variables to active workers; if (t + 1) mod Tpre == 0 and t < T1 then new Ist layer µ-cut cp I is generated by Eq. (23) and added into Ist layer polytope; new IInd layer µ-cut cp II is generated by Eq. (24) and added into IInd layer polytope; removing inactive Ist, IInd layer µ-cuts by Eq. (25); end if t = t + 1; until termination. Definition 2 (First-order condition) For any x, x , a differentiable function f(x) is µ-weakly convex if and only if the following inequality holds. f(x) f(x ) + f(x ) (x x ) µ 2 ||x x ||2. (22) Combining the first-order condition of µ-weakly convex function with Cauchy-Schwarz inequality, a kind of new cutting plane, i.e., µ-cut, can be generated. Specifically, the new Ist layer µ-cut cp I for Ist layer polytope can be expressed as: { h I({xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) x3,j } h I({xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z1 h I({xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z2 h I({xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z3 {x3,j xt+1 3,j } z1 zt+1 1 z2 zt+1 2 +h I({xt+1 3,j }, zt+1 1 , zt+1 2 , zt+1 3 ) εI+µ((N +1)α1+α2 +α3+PN j=1||xt+1 3,j ||2+||zt+1 1 ||2+||zt+1 2 ||2+||zt+1 3 ||2). (23) Proposition 1 The feasible region of constraint h I({x3,j},z1, z2 , z3) εI is a subset of the Ist layer polytope P t I ={a I 1,l z1+a I 2,l z2 +a I 3,l z3+PN j=1 b I j,l x3,j c I l, l =1, , |P t I |}. In addition, P t I converges monotonically with the number of µ-cuts. The proof is given in Appendix C. Note that if h I is convex, i.e., µ = 0, the cutting plane will be generated as the same as that in (Franc, Sonnenburg, and Werner 2011; Jiao et al. 2022b), which is designed for convex optimization. Thus, the proposed µ-cut is more general than prior work in the literature. Consequently, the Ist layer polytope will be updated as P t+1 I = Add(P t I , cp I), The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) where Add(P t I , cp I) represents adding new µ-cut cp I into the polytope P t I . Generating new IInd layer µ-cut: Based on the updated Ist layer polytope, the IInd layer polytope will be updated. The new generated IInd layer µ-cut cp II can be written as, { h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) x2,j } { h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) x3,j } h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z1 h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z2 h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) z3 {x2,j xt+1 2,j } {x3,j xt+1 3,j } z1 zt+1 1 z2 zt+1 2 z3 zt+1 3 +h II({xt+1 2,j },{xt+1 3,j },zt+1 1 ,zt+1 2 ,zt+1 3 ) εII + µ(α1 +(N +1)(α2+α3)+P3 i=2 PN j=1||xt+1 i,j ||2+P3 i=1||zt+1 i ||2). (24) Consequently, the IInd layer polytope will be updated as P t+1 II = Add(P t II, cp II). Proposition 2 The feasible region of constraint h II({x2,j},{x3,j},{zi}) εII is a subset of the IInd layer polytope P t II = {P3 i=1a II i,l zi + P3 i=2 PN j=1b II i,j,l xi,j c II l , l = 1, , |P t II|}, and P t II converges monotonically with the number of µ-cuts. The proof is given in Appendix C. Removing inactive µ-cuts: Removing the inactive cutting planes can enhance the efficiency of the proposed algorithm (Yang et al. 2014; Jiao, Yang, and Song 2022). The inactive Ist and IInd layer µ-cuts will be removed, thus the corresponding Ist and IInd layer polytopes P t+1 I and P t+1 II will be updated as follows. P t+1 I = Drop(P t+1 I , cp I l), if γK l = 0 P t+1 I , otherwise , P t+1 II = Drop(P t+1 II , cp II l ), if λt+1 l = 0 P t+1 II , otherwise , (25) where Drop(P, cpl) represents that the lth cutting plane cpl is removed from polytope P. Discussion Definition 3 (Stationarity gap) Following (Xu et al. 2020; Lu et al. 2020; Jiao, Yang, and Song 2022), the stationarity gap of our problem at tth iteration is defined as: { xi,j Lp({xt i,j},{zt i},{λt l},{θt j})} { zi Lp({xt i,j},{zt i},{λt l},{θt j})} { Gλl({xt i,j},{zt i},{λt l},{θt j})} { Gθj({xt i,j},{zt i},{λt l},{θt j})} Gλl({xt i,j},{zt i},{λt l},{θt j}) = 1 ηλ λt l PΛ(λt l +ηλ λl Lp({xt i,j},{zt i},{λt l},{θt j})) , Gθj({xt i,j},{zt i},{λt l},{θt j}) = 1 ηθ θt j PΘ(θt j+ηθ θj Lp({xt i,j},{zt i},{λt l},{θt j})) . (27) Definition 4 (ϵ-stationary point) If || Gt||2 ϵ, ({xt i,j},{zt i},{λt l},{θt j}) is an ϵ-stationary point (ϵ 0) of a differentiable function Lp. T(ϵ) is the first iteration index such that || Gt||2 ϵ, i.e., T(ϵ) = min{t | || Gt||2 ϵ}. Assumption 1 (Gradient Lipschitz) Following (Ji, Yang, and Liang 2021), we assume that Lp has Lipschitz continuous gradients, i.e., for any ω, ω , we assume that there exists L > 0 satisfying that, || Lp(ω) Lp(ω )|| L||ω ω ||. (28) Assumption 2 (Boundedness) Following (Qian et al. 2019; Jiao et al. 2022b), we assume ||xi,j||2 αi, ||zi||2 αi, i = 1, 2, 3. And we assume that before obtaining the ϵ-stationary point (i.e., t T(ϵ) 1), the variables in master satisfy that P i ||zt+1 i zt i||2 + P l ||λt+1 l λt l||2 ϑ, where ϑ > 0 is a relative small constant. The change of the variables in master is upper bounded within τ iterations: ||zt i zt k i ||2 τk1ϑ, P l||λt l λt k l ||2 τk1ϑ, 1 k τ, (29) where k1 > 0 is a constant. Detailed discussions about Assumption 1 and 2 are provided in Appendix I. Theorem 1 (Iteration Complexity) Suppose Assumption 1 and 2 hold, we set the step-sizes as ηxi = ηzi = 2 L+ηλML2+ηθNL2+8( MγL2 ηλc12 + NγL2 ηθc22 ), i, ηθ 2 L+2c0 2 and ηλ