# asynchronous_distributed_bilevel_optimization__00b524f4.pdf Published as a conference paper at ICLR 2023 ASYNCHRONOUS DISTRIBUTED BILEVEL OPTIMIZATION Yang Jiao Tongji University Kai Yang Tongji University Tiancheng Wu Tongji University Dongjin Song University of Connecticut Chengtao Jian Tongji University Bilevel optimization plays an essential role in many machine learning tasks, ranging from hyperparameter optimization to meta-learning. Existing studies on bilevel optimization, however, focus on either centralized or synchronous distributed setting. The centralized bilevel optimization approaches require collecting a massive amount of data to a single server, which inevitably incur significant communication expenses and may give rise to data privacy risks. Synchronous distributed bilevel optimization algorithms, on the other hand, often face the straggler problem and will immediately stop working if a few workers fail to respond. As a remedy, we propose Asynchronous Distributed Bilevel Optimization (ADBO) algorithm. The proposed ADBO can tackle bilevel optimization problems with both nonconvex upper-level and lower-level objective functions, and its convergence is theoretically guaranteed. Furthermore, it is revealed through theoretical analysis that the iteration complexity of ADBO to obtain the ϵ-stationary point is upper bounded by O( 1 ϵ2 ). Thorough empirical studies on public datasets have been conducted to elucidate the effectiveness and efficiency of the proposed ADBO. 1 INTRODUCTION Recently, bilevel optimization has emerged due to its popularity in various machine learning applications, e.g., hyperparameter optimization (Khanduri et al., 2021; Liu et al., 2021a), metalearning (Likhosherstov et al., 2021; Ji et al., 2020), reinforcement learning (Hong et al., 2020; Zhou & Liu, 2022), and neural architecture search (Jiang et al., 2020; Jiao et al., 2022b). In bilevel optimization, one optimization problem is embedded or nested with another. Specifically, the outer optimization problem is called the upper-level optimization problem and the inner optimization problem is called the lower-level optimization problem. A general form of the bilevel optimization problem can be written as, min F(x, y) s.t. y = arg min y f(x, y ) var. x, y, (1) where F and f denote the upper-level and lower-level objective functions, respectively. x Rn and y Rm are variables. Bilevel optimization can be treated as a special case of constrained optimization since the lower-level optimization problem can be viewed as a constraint to the upperlevel optimization problem (Sinha et al., 2017). The proliferation of smartphones and Internet of Things (Io T) devices has generated a plethora of data in various real-world applications. Centralized bilevel optimization approaches require collecting a massive amount of data from distributed edge devices and passing them to a centralized server for model training. These methods, however, may give rise to data privacy risks (Subramanya & Riggio, 2021) and encounter communication bottlenecks (Subramanya & Riggio, 2021). To tackle these challenges, recently, distributed algorithms have been developed to solve the decentralized Corresponding author. Published as a conference paper at ICLR 2023 bilevel optimization problems (Yang et al., 2022; Chen et al., 2022b; Lu et al., 2022). Tarzanagh et al. (2022) and Li et al. (2022) study the bilevel optimization problems under a federated setting. Specifically, the distributed bilevel optimization problem can be given by min F(x, y) = N P i=1 Gi(x, y) s.t. y = arg min y f(x, y ) = N P i=1 gi(x, y ) where N is the number of workers (devices), Gi and gi denote the local upper-level and lower-level objective functions, respectively. Although existing approaches have shown their success in resolving distributed bilevel optimization problems, they only focus on the synchronous distributed setting. Synchronous distributed methods may encounter the straggler problem (Jiang et al., 2021) and its speed is limited by the worker with maximum delay (Chang et al., 2016). Moreover, synchronous distributed method will immediately stop working if a few workers fail to respond (Zhang & Kwok, 2014) (which is common in large-scale distributed systems). The aforementioned issues give rise to the following question: Can we design an asynchronous distributed algorithm for bilevel optimization? To this end, we develop an Asynchronous Distributed Bilevel Optimization (ADBO) algorithm which is a single-loop algorithm and computationally efficient. The proposed ADBO regards the lower-level optimization problem as a constraint to the upper-level optimization problem, and utilizes cutting planes to approximate this constraint. Then, the approximate problem is solved in an asynchronous distributed manner by the proposed ADBO. We prove that even if both the upperlevel and lower-level objectives are nonconvex, the proposed ADBO is guaranteed to converge. The iteration complexity of ADBO is also theoretically derived. To facilitate the comparison, we not only present a centralized bilevel optimization algorithm in Appendix A, but also compare the convergence results of ADBO to state-of-the-art bilevel optimization algorithms with both centralized and distributed settings in Table 1. Contributions. Our contributions can be summarized as: 1. We propose a novel algorithm, ADBO, to solve the bilevel optimization problem in an asynchronous distributed manner. ADBO is a single-loop algorithm and is computationally efficient. To the best of our knowledge, it is the first work in tackling asynchronous distributed bilevel optimization problem. 2. We demonstrate that the proposed ADBO can be applied to bilevel optimization with nonconvex upper-level and lower-level objectives with constraints. We also theoretically derive that the iteration complexity for the proposed ADBO to obtain the ϵ-stationary point is upper bounded by O( 1 3. Our thorough empirical studies justify the superiority of the proposed ADBO over the existing state-of-the-art methods. 2 RELATED WORK Bilevel optimization: The bilevel optimization problem was firstly introduced by Bracken & Mc Gill (1973). In recent years, many approaches have been developed to solve this problem and they can be divided into three categories (Gould et al., 2016). The first type of approaches assume there is an analytical solution to the lower-level optimization problem (i.e., ϕ(x) = arg miny f(x, y )) (Zhang et al., 2021). In this case, the bilevel optimization problem can be simplified to a single-level optimization problem (i.e., minx F (x, ϕ(x)). Nevertheless, finding the analytical solution for the lower-level optimization problem is often very difficult, if not impossible. The second type of approaches replace the lower-level optimization problem with the sufficient conditions for optimality (e.g., KKT conditions) (Biswas & Hoyle, 2019; Sinha et al., 2017). Then, the bilevel program can be reformulated as a single-level constrained optimization problem. However, the resulting problem could be hard to solve since it often involves a large number of constraints (Ji et al., 2021; Gould et al., 2016). The third type of approaches are gradient-based methods (Ghadimi & Wang, 2018; Hong et al., 2020; Liao et al., 2018) that compute the hypergradient (or the estimation of hypergradient), i.e., F (x,y) x + F (x,y) y y x, and use gradient descent to solve the bilevel optimization Published as a conference paper at ICLR 2023 problems. Most of the existing bilevel optimziation methods focus on centralized settings and require collecting a massive amount of data from distributed edge devices (workers). This may give rise to data privacy risks (Subramanya & Riggio, 2021) and encounter communication bottlenecks (Subramanya & Riggio, 2021). Asynchronous distributed optimization: To alleviate the aforementioned issues in the centralized setting, various distributed optimization methods can be employed. Distributed optimization methods can be generally divided into synchronous distributed methods and asynchronous distributed methods (Assran et al., 2020). For synchronous distributed methods (Boyd et al., 2011), the master needs to wait for the updates from all workers before it proceeds to the next iteration. Therefore, it may suffer from the straggler problem and the speed is limited by the worker with maximum delay (Chang et al., 2016). There are several advanced techniques have been proposed to make the synchronous algorithm more efficient, such as large batch size, warmup and so on (Goyal et al., 2017; You et al., 2019; Huo et al., 2021; Liu & Mozafari, 2022; Wang et al., 2020). For asynchronous distributed methods (Chen et al., 2020; Matamoros, 2017), the master can update its variables once it receives updates from S workers, i.e., active workers (1 S N, where N is the number of all workers). The asynchronous distributed algorithm is strongly preferred for large scale distributed systems in practice since it does not suffer from the straggler problem (Jiang et al., 2021). Asynchronous distributed methods (Wu et al., 2017; Liu et al., 2017) have been employed for many real-world applications, such as Google s Dist Belief system (Dean et al., 2012), the training of 10 million You Tube videos (Le, 2013), federated learning for edge computing (Lu et al., 2019; Liu et al., 2021c). Since the action orders of each worker are different in the asynchronous distributed setting, which will result in complex interaction dynamics (Jiang et al., 2021), the theoretical analysis for asynchronous distributed algorithms is usually more challenging than that of the synchronous distributed algorithms. In summary, the synchronous and asynchronous algorithm have different application scenarios. When the delay of each worker is not much different, the synchronous algorithm suits better. While there are stragglers in the distributed system, the asynchronous algorithm is more preferred. So far, existing works for distributed bilevel optimization only focus on the synchronous setting (Tarzanagh et al., 2022; Li et al., 2022; Chen et al., 2022b), how to design an asynchronous algorithm for distributed bilevel optimization remains under-explored. To the best of our knowledge, this is the first work that designs an asynchronous algorithm for distributed bilevel optimization. 3 ASYNCHRONOUS DISTRIBUTED BILEVEL OPTIMIZATION In this section, we propose Asynchronous Distributed Bilevel Optimization (ADBO) to solve the distributed bilevel optimization problem in an asynchronous manner. First, we reformulate problem in Eq. (2) as a consensus problem (Matamoros, 2017; Chang et al., 2016), min F({xi}, {yi}, v, z) = N P i=1 Gi(xi, yi) s.t. xi = v, i = 1, , N {yi}, z = arg min {y i},z f(v, {y i}, z ) = N P i=1 gi(v, y i) y i = z , i = 1, , N var. {xi}, {yi}, v, z, where xi Rn and yi Rm are local variables in ith worker, v Rn and z Rm are the consensus variables in the master node. The reformulation given in Eq. (3) is a consensus problem which allows to develop distributed training algorithms for bilevel optimization based on the parameter server architecture (Assran et al., 2020). As shown in Figure 13, in parameter server architecture, the communication is centralized around the master, and workers pull the consensus variables v, z from and send their local variables xi, yi to the master. Parameter server training is a well-known data-parallel approach for scaling up machine learning model training on a multitude of machines (Verbraeken et al., 2020). Most of the existing bilevel optimization works in machine learning only consider the bilevel programs without upper-level and lower-level constraints (Franceschi et al., 2018; Yang et al., 2021; Chen et al., 2022a) or bilevel programs with only upper-level (or lower-level) constraint (Zhang et al., 2022; Mehra & Hamm, 2021). On the contrary, we focus on the bilevel programs (i.e., Eq. (3)) with both lower-level and upper-level constraints, which is more challenging. By defining ϕ(v) = arg min {y i},z {PN i=1 gi(v, y i) : y i = z , i = 1, , N} Published as a conference paper at ICLR 2023 and h(v, {yi}, z) = || {yi} z ϕ(v)||2, we can reformulate problem in Eq. (3) as: min F({xi},{yi}, v, z)= N P i=1 Gi(xi, yi) s.t. xi = v, i = 1, , N h(v, {yi}, z) = 0 var. {xi}, {yi}, v, z. To better clarify how ADBO works, we sketch the procedure of ADBO. Firstly, ADBO computes the estimate of the solution to lower-level optimization problem. Then, inspired by cutting plane method, a set of cutting planes is utilized to approximate the feasible region of the upper-level bilevel optimization problem. Finally, the asynchronous algorithm for solving the resulting problem and how to update cutting planes are proposed. The remaining contents are divided into four parts, i.e., estimate of solution to lower-level optimization problem, polyhedral approximation, asynchronous algorithm, updating cutting planes. 3.1 Estimate of Solution to Lower-level Optimization Problem A consensus problem, i.e., the lower-level optimization problem in Eq. (3), needs to be solved in a distributed manner if an exact ϕ(v) is desired. Following existing works (Li et al., 2022; Gould et al., 2016; Yang et al., 2021) for bilevel optimization, instead of pursuing the exact ϕ(v), an estimate of ϕ(v) could be utilized. For this purpose, we first obtain the first-order Taylor approximation of gi(v, {y i}) with respect to v, i.e., for a given point v, egi(v, {y i}) = gi(v, {y i}) + vgi(v, {y i}) (v v). Then, similar to many works that use K steps of gradient descent (GD) to approximate the optimal solution of lower-level optimization problem (Ji et al., 2021; Yang et al., 2021; Liu et al., 2021b), we utilize the results after K communication rounds between workers and master to approximate ϕ(v). Specifically, given egi(v, {y i}), the augmented Lagrangian function of the lower-level optimization problem in Eq. (3) can be expressed as, gp(v, {y i}, z , {φi}) = egi(v, y i) + φ i (y i z ) + µ 2 ||y i z ||2 , (5) where φi Rm is the dual variable, and µ>0 is a penalty parameter. In (k +1)th iteration, we have, (1) Workers update their local variables as follows, y i,k+1 = y i,k ηy yigp(v, {y i,k}, z k, {φi,k}), (6) where ηy is the step-size. Then, workers transmit the local variables y i,k+1 to the master. (2) After receiving updates from workers, the master updates variables as follows, z k+1 = z k ηz zgp(v, {y i,k}, z k, {φi,k}), (7) φi,k+1 = φi,k + ηφ φigp(v, {y i,k+1}, z k+1, {φi,k}), (8) where ηz and ηφ are step-sizes. Next, the master broadcasts z k+1 and φi,k+1 to workers. As mentioned above, we utilize the results after K communication rounds to approximate ϕ(v), i.e., {y i,0 PK 1 k=0 ηy yigp(v, {y i,k}, z k, {φi,k})} z 0 PK 1 k=0 ηz zgp(v, {y i,k}, z k, {φi,k}) 3.2 Polyhedral Approximation Considering ϕ(v) in Eq. (9), the relaxed problem with respect to the problem in Eq. (4) is, min F({xi},{yi}, v, z)= N P i=1 Gi(xi, yi) s.t. xi = v, i = 1, , N h(v, {yi}, z) ε var. {xi}, {yi}, v, z, Published as a conference paper at ICLR 2023 where ε > 0 is a constant. Assuming that h(v, {yi}, z) is a convex function with respect to (v, {yi}, z), which is always satisfied when we set K = 1 in Eq. (10) according to the operations that preserve convexity (Boyd et al., 2004). Since the sublevel set of a convex function is convex (Boyd et al., 2004), the feasible set with respect to constraint h(v, {yi}, z) ε is a convex set. In this paper, inspired by the cutting plane method (Boyd & Vandenberghe, 2007; Michalka, 2013; Franc et al., 2011; Yang et al., 2014), a set of cutting planes is utilized to approximate the feasible region with respect to constraint h(v, {yi}, z) ε in Eq. (10). The set of cutting planes forms a polytope, let Pt denote the polytope in (t + 1)th iteration, which can be expressed as, Pt = {al v + i=1 bi,l yi + cl z + κl 0, l=1, , |Pt|}, (11) where al Rn, bi,l Rm, cl Rm and κl R1 are the parameters in lth cutting plane, and |Pt| denotes the number of cutting planes in Pt. Thus, the approximate problem in (t + 1)th iteration can be expressed as follows, min F({xi}, {yi}, v, z) = N P i=1 Gi(xi, yi) s.t. xi = v, i = 1, , N i=1 bi,l yi+cl z+κl 0, l=1, , |Pt| var. {xi}, {yi}, v, z, The cutting planes will be updated to refine the approximation, details are given in Section 3.4. 3.3 Asynchronous Algorithm In the proposed ADBO, we solve the distributed bilevel optimization problem in an asynchronous manner. The Lagrangian function of Eq. (12) can be written as: i=1 Gi(xi, yi) + |Pt| P i=1 bi,l yi + cl z + κl i=1 θi (xi v), (13) where λl R1, θi Rn are dual variables, Lp is simplified form of Lp({xi},{yi},v,z,{λl},{θi}). The regularized version (Xu et al., 2020) of Eq. (13) is employed to update all variables as follows, e Lp({xi},{yi}, v, z,{λl},{θi}) = Lp ct 1 2 ||λl||2 ct 2 2 ||θi||2, (14) where ct 1 and ct 2 denote the regularization terms in (t + 1)th iteration. In each iteration, we set that |Pt| M, t. ct 1 = 1 ηλ(t+1) 1 4 c1, ct 2 = 1 ηθ(t+1) 1 4 c2 are two nonnegative non-increasing sequences, where ηλ and ηθ are positive constants, and constants c1, c2 meet that 0 < c1 1/ηλc, 0 0 is a pre-set constant, which can be controlled flexibly), the cutting planes are updated based on the following two steps (a) and (b) when t < T1: (a) Removing the inactive cutting planes, Pt+1 = Drop(Pt, cpl), if λt+1 l and λt l = 0 Pt, otherwise , (21) where cpl represents the lth cutting plane in Pt and Drop(Pt, cpl) represents the lth cutting plane cpl is removed from Pt. The dual variable set {λt+1} will be updated as follows, {λt+1}= Drop({λt}, λl), if λt+1 l and λt l = 0 {λt}, otherwise , (22) where {λt+1} and {λt} represent the dual variable set in (t + 1)th and tth iterations, respectively. Drop({λt}, λl) represents that λl is removed from the dual variable set {λt}. (b) Adding new cutting planes. Firstly, we investigate whether (vt+1,{yt+1 i }, zt+1) is feasible for the constraint h(v, {yi}, z) ε. We can obtain h(vt+1,{yt+1 i }, zt+1) according to ϕ(vt+1) in Eq. (9). If (vt+1,{yt+1 i }, zt+1) is not a feasible solution to the original problem (Eq. (10)), new cutting plane cpt+1 new will be generated to separate the point (vt+1,{yt+1 i }, zt+1) from the feasible region of constraint h(v, {yi}, z) ε. Thus, the valid cutting plane (Boyd & Vandenberghe, 2007) al v + PN i=1 bi,l yi + cl z + κl 0 must satisfy that, al v + PN i=1 bi,l yi + cl z + κl 0, (v,{yi}, z) satisfies h(v,{yi}, z) ε al vt+1 + PN i=1 bi,l yt+1 i + cl zt+1 + κl > 0 . (23) Since h(v, {yi}, z) is a convex function, we have that, h(v, {yi}, z) h(vt+1, {yt+1 i }, zt+1)+ h(vt+1,{yt+1 i },zt+1) v { h(vt+1,{yt+1 i },zt+1) yi } h(vt+1,{yt+1 i },zt+1) z {yt+1 i } zt+1 (24) Combining Eq. (24) with Eq. (23), we have that a valid cutting plane (with respect to point (vt+1, {yt+1 i }, zt+1)) can be expressed as, h(vt+1, {yt+1 i }, zt+1) + h(vt+1,{yt+1 i },zt+1) v { h(vt+1,{yt+1 i },zt+1) yi } h(vt+1,{yt+1 i },zt+1) z {yt+1 i } zt+1 For brevity, we utilize cpt+1 new to denote the new added cutting plane (i.e., Eq. (25)). Thus the polytope Pt+1 will be updated as follows, Pt+1 = Add(Pt+1, cpt+1 new), if h(vt+1,{yt+1 i }, zt+1) > ε Pt+1, otherwise , (26) where Add(Pt+1, cpt+1 new) represents that new cutting plane cpt+1 new is added to polytope Pt+1. The dual variable set {λt+1} is updated as follows, {λt+1} = Add({λt+1}, λt+1 |Pt+1|), if h(vt+1,{yt+1 i }, zt+1) > ε {λt+1}, otherwise , (27) where Add({λt+1}, λt+1 |Pt+1|) represents that dual variable λt+1 |Pt+1| is added to the dual variable set {λt+1}. Finally, master broadcasts the updated Pt+1 and {λt+1} to all workers. The details of the proposed algorithm are summarized in Algorithm 1. Published as a conference paper at ICLR 2023 Algorithm 1 ADBO: Asynchronous Distributed Bilevel Optimization Initialization: master iteration t=0, variables {x0 i }, {y0 i }, v0, z0, {λ0 l }, {θ0 i } and polytope P0. repeat for active worker do updates variables xt+1 i , yt+1 i according to Eq. (15) and (16); end for Active workers transmit their local variables to master; for master do updates variables vt+1, zt+1, {λt+1 l }, {θt+1 i } according to Eq. (17), (18), (19) and (20); end for master broadcasts variables to active workers; if (t + 1) mod kpre == 0 and t < T1 then master computes ϕ(vt+1) according to Eq. (9); master updates Pt+1 and {λt+1} according to Eq. (21), (22), (26) and (27); master broadcasts Pt+1 and {λt+1} to all workers; end if t = t + 1; until termination. 4 DISCUSSION Theorem 1 (Convergence) As the cutting plane continues to be added to the polytope, the optimal objective value of approximate problem in Eq. (12) converges monotonically. The proof of Theorem 1 is presented in Appendix C. Definition 1 (Stationarity gap) Following (Xu et al., 2020; Lu et al., 2020; Jiao et al., 2022a), the stationarity gap of our problem at tth iteration is defined as: { xi Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} { yi Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} v Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) z Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) { λl Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} { θi Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} Definition 2 (ϵ-stationary point) ({xt i},{yt i},vt, zt,{λt l},{θt i}) is an ϵ-stationary point (ϵ 0) of a differentiable function Lp, if || Gt||2 ϵ. T(ϵ) is the first iteration index such that || Gt||2 ϵ, i.e., T(ϵ) = min{t | || Gt||2 ϵ}. Assumption 1 (Smoothness/Gradient Lipschitz) Following (Ji et al., 2021), we assume that Lp has Lipschitz continuous gradients, i.e., for any ω, ω , we assume that there exists L > 0 satisfying that, || Lp(ω) Lp(ω )|| L||ω ω ||, (29) Assumption 2 (Boundedness) Following (Qian et al., 2019), we assume that variables are bounded, i.e., ||xi||2 α1, ||v||2 α1, ||yi||2 α2, ||z||2 α2, ||λl||2 α3, ||θi||2 α4. And we assume that before obtaining the ϵ-stationary point (i.e., t T(ϵ) 1), the variables in master satisfy that ||vt+1 vt||2+||zt+1 zt||2+P l ||λt+1 l λt l||2 ϑ, where ϑ > 0 is a relative small constant. The change of the variables in master is upper bounded within τ iterations: ||vt vt k||2 τk1ϑ, ||zt zt k||2 τk1ϑ, P l ||λt l λt k l ||2 τk1ϑ, 1 k τ, (30) where k1 > 0 is a constant. Theorem 2 (Iteration complexity) Suppose Assumption 1 and 2 hold, we set the step-sizes as ηx = ηy =ηv =ηz = 2 L+ηλML2+ηθNL2+8( MγL2 ηλc12 + NγL2 ηθc22 ), ηλ < min{ 2 L+2c0 1 , 1 30τk1NL2 } and ηθ 2 L+2c0 2 . For a given ϵ, we have: T(ϵ) O(max{(4Mα3 ϵ2 , (4(d7 + ηθ(N S)L2 2 )( d +kdτ(τ 1))d6 ϵ + (T1 + 2) 1 2 )2}), (31) where α3, α4, γ, kd, T1, d, d6 and d7 are constants. The detailed proof is given in Appendix B. Published as a conference paper at ICLR 2023 5 EXPERIMENT In this section, experiments1 are conducted on two hyperparameter optimization tasks (i.e., data hyper-cleaning task and regularization coefficient optimization task) in the distributed setting to evaluate the performance of the proposed ADBO. The proposed ADBO is compared with the stateof-the-art distributed bilevel optimization method FEDNEST (Tarzanagh et al., 2022). In data hypercleaning task, experiments are carried out on MNIST (Le Cun et al., 1998) and Fashion MNIST (Xiao et al., 2017) datasets. In coefficient optimization task, following (Chen et al., 2022a), experiments are conducted on Covertype (Blackard & Dean, 1999) and IJCNN1 (Prokhorov, 2001) datasets. 5.1 DATA HYPER-CLEANING Following (Ji et al., 2021; Yang et al., 2021), we compare the performance of the proposed ADBO and distributed bilevel optimization method FEDNEST on the distributed data hyper-cleaning task (Chen et al., 2022b) on MNIST and Fashion MNIST datasets. Data hyper-cleaning involves training a classifier in a contaminated environment where each training data label is changed to a random class number with a probability (i.e., the corruption rate). In the experiment, the distributed data hyper-cleaning problem is considered, whose formulation can be expressed as, min F(ψ, w) = N P 1 |Dval i | P (xj,yj) Dval i L(xj w, yj) s.t. w=arg min w f(ψ, w )= N P 1 |Dtr i | P (xj,yj) Dtr i σ(ψj)L(xj w , yj) + Cr||w ||2 where Dtr i and Dval i denote the training and validation datasets on ith worker, respectively. (xj, yj) denote the jth data and label. σ(.) is the sigmoid function, L is the cross-entropy loss, Cr is a regularization parameter and N is the number of workers in the distributed system. In MNIST and Fashion MNIST datasets, we set N = 18, S = 9 and τ = 15. According to Cohen et al. (2021), we assume that the communication delay of each worker obeys the heavy-tailed distribution. The proposed ADBO is compared with the state-of-the-art distributed bilevel optimization method FEDNEST and SDBO (Synchronous Distributed Bilevel Optimization, i.e., ADBO without asynchronous setting). The test accuracy versus time is shown in Figure 1, and the test loss versus time is shown in Figure 2. We can observe that the proposed ADBO is the most efficient algorithm since 1) the asynchronous setting is considered in ADBO, the master can update its variables once it receives updates from S active workers instead of all workers; and 2) ADBO is a single-loop algorithm and only gradient descent/ascent is required at each iteration, thus ADBO is computationally more efficient. 5.2 REGULARIZATION COEFFICIENT OPTIMIZATION Following (Chen et al., 2022a), we compare the proposed ADBO with baseline algorithms FEDNEST and SDBO on the regularization coefficient optimization task with Covertype and IJCNN1 datasets. The distributed regularization coefficient optimization problem is given by, min F(ψ, w) = N P 1 |Dval i | P (xj,yj) Dval i L(xj w, yj) s.t. w=arg min w f(ψ, w )= N P 1 |Dtr i | P (xj,yj) Dtr i L(xj w , yj) + n P j=1 ψj(w j)2 where ψ Rn, w Rn and L respectively denote the regularization coefficient, model parameter, and logistic loss, and w =[w 1, . . . , w n]. In Covertype and IJCNN1 datasets, we set N = 18, S = 9, τ = 15 and N = 24, S = 12, τ = 15, respectively. We also assume that the delay of each worker obeys the heavy-tailed distribution. Firstly, we compare the performance of the proposed ADBO, SDBO and FEDNEST in terms of test accuracy and test loss on Covertype and IJCNN1 datasets, 1Codes are available in https://github.com/ICLR23Submission6251/adbo. Published as a conference paper at ICLR 2023 which are shown in Figure 3 and 4. It is seen that the proposed ADBO is more efficient because of the same two reasons we gave in Section 5.1. 0 200 400 600 800 1000 time (s) test accuracy ADBO SDBO FEDNEST 0 200 400 600 800 1000 time (s) test accuracy ADBO SDBO FEDNEST (b) Fashion MNIST Figure 1: Test accuracy vs time on (a) MNIST and (b) Fashion MNIST datasets. 0 200 400 600 800 1000 time (s) ADBO SDBO FEDNEST 0 200 400 600 800 1000 time (s) ADBO SDBO FEDNEST (b) Fashion MNIST Figure 2: Test loss vs time on (a) MNIST and (b) Fashion MNIST datasets. 0 200 400 600 800 time (s) test accuracy ADBO SDBO FEDNEST (a) Covertype 0 200 400 600 800 1000 time (s) test accuracy ADBO SDBO FEDNEST Figure 3: Test accuracy vs time on (a) Covertype and (b) IJCNN1 datasets. 0 200 400 600 800 time (s) ADBO SDBO FEDNEST (a) Covertype 0 200 400 600 800 1000 time (s) ADBO SDBO FEDNEST Figure 4: Test loss vs time on (a) Covertype and (b) IJCNN1 datasets. 0 200 400 600 800 time (s) test accuracy ADBO SDBO FEDNEST (a) Covertype 0 200 400 600 800 1000 time (s) test accuracy ADBO SDBO FEDNEST Figure 5: Test accuracy vs time on (a) Covertype and (b) IJCNN1 datasets when there are stragglers in distributed system. 0 200 400 600 800 time (s) ADBO SDBO FEDNEST (a) Covertype 0 200 400 600 800 1000 time (s) ADBO SDBO FEDNEST Figure 6: Test loss vs time on (a) Covertype and (b) IJCNN1 datasets when there are stragglers in distributed system. We also consider the straggler problem, i.e., there exist workers with high delays (stragglers) in the distributed system. In this case, the efficiency of the bilevel optimization method with the synchronous distributed setting will be affected heavily. In the experiment, we assume there are three stragglers in the distributed system, and the mean of (communication + computation) delay of stragglers is four times the delay of normal workers. The results on Covertype and IJCNN1 datasets are reported in Figure 5 and 6. It is seen that the efficiency of the synchronous distributed algorithms (FEDNEST and SDBO) will be significantly affected, while the proposed ADBO does not suffer from the straggler problem since it is an asynchronous method and is able to only consider active workers. 6 CONCLUSION Existing bilevel optimization works focus either on the centralized or synchronous distributed setting, which will give rise to data privacy risks and suffer from the straggler problem. As a remedy, we propose ADBO in this paper to solve the bilevel optimization problem in an asynchronous distributed manner. To our best knowledge, this is the first work that devises the asynchronous distributed algorithm for bilevel optimization. We demonstrate that the proposed ADBO can effectively tackle bilevel optimization problems with both nonconvex upper-level and lower-level objective functions. Theoretical analysis has also been conducted to analyze the convergence properties and iteration complexity of ADBO. Extensive empirical studies on real-world datasets demonstrate the efficiency and effectiveness of the proposed ADBO. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS The work of Yang Jiao, Kai Yang and Chengtao Jian was supported in part by the National Natural Science Foundation of China under Grant 61771013, in part by the Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS), in part by the Fundamental Research Funds for the Central Universities of China, and in part by the Fundamental Research Funds of Shanghai Jiading District. We thank Haibo Zhao for providing the experiment results in meta-learning. Mahmoud Assran, Arda Aytekin, Hamid Reza Feyzmahdavian, Mikael Johansson, and Michael G Rabbat. Advances in asynchronous parallel and distributed optimization. Proceedings of the IEEE, 108(11):2013 2031, 2020. Luca Bertinetto, Joao F Henriques, Philip Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations, 2018. Arpan Biswas and Christopher Hoyle. A literature review: Solving constrained non-linear bi-level optimization problems with classical methods. In International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, volume 59193, pp. V02BT03A025. American Society of Mechanical Engineers, 2019. Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131 151, 1999. Stephen Boyd and Lieven Vandenberghe. Localization and cutting-plane methods. From Stanford EE 364b lecture notes, 2007. Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Stephen Boyd, Neal Parikh, and Eric Chu. Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, 2011. Jerome Bracken and James T Mc Gill. Mathematical programs with optimization problems in the constraints. Operations Research, 21(1):37 44, 1973. Tsung-Hui Chang, Mingyi Hong, Wei-Cheng Liao, and Xiangfeng Wang. Asynchronous distributed ADMM for large-scale optimization Part I: Algorithm and convergence analysis. IEEE Transactions on Signal Processing, 64(12):3118 3130, 2016. Tianyi Chen, Yuejiao Sun, Quan Xiao, and Wotao Yin. A single-timescale method for stochastic bilevel optimization. In International Conference on Artificial Intelligence and Statistics, pp. 2466 2488. PMLR, 2022a. Xuxing Chen, Minhui Huang, and Shiqian Ma. Decentralized bilevel optimization. ar Xiv preprint ar Xiv:2206.05670, 2022b. Yujing Chen, Yue Ning, Martin Slawski, and Huzefa Rangwala. Asynchronous online federated learning for edge devices with Non-IID data. In 2020 IEEE International Conference on Big Data (Big Data), pp. 15 24. IEEE, 2020. Alon Cohen, Amit Daniely, Yoel Drori, Tomer Koren, and Mariano Schain. Asynchronous stochastic optimization robust to arbitrary delays. Advances in Neural Information Processing Systems, 34: 9024 9035, 2021. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126 1135. PMLR, 2017. Published as a conference paper at ICLR 2023 Vojtech Franc, S oren Sonnenburg, and Tom aˇs Werner. Cutting plane methods in machine learning. Optimization for Machine Learning, pp. 185 218, 2011. Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pp. 1568 1577. PMLR, 2018. Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018. Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint ar Xiv:1607.05447, 2016. Priya Goyal, Piotr Doll ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017. Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, pp. 3748 3758. PMLR, 2020. Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. ar Xiv preprint ar Xiv:2007.05170, 2020. Zhouyuan Huo, Bin Gu, and Heng Huang. Large batch optimization for deep learning using new complete layer-wise adaptive rate scaling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7883 7890, 2021. Kaiyi Ji, Jason D Lee, Yingbin Liang, and H Vincent Poor. Convergence of meta-learning with taskspecific adaptation over partial parameters. Advances in Neural Information Processing Systems, 33:11490 11500, 2020. Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, pp. 4882 4892. PMLR, 2021. Chenhan Jiang, Hang Xu, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Sp-nas: Serial-to-parallel backbone search for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11863 11872, 2020. Jiyan Jiang, Wenpeng Zhang, Jinjie Gu, and Wenwu Zhu. Asynchronous decentralized online learning. Advances in Neural Information Processing Systems, 34:20185 20196, 2021. Yang Jiao, Kai Yang, and Dongjin Song. Distributed distributionally robust optimization with nonconvex objectives. ar Xiv preprint ar Xiv:2210.07588, 2022a. Yang Jiao, Kai Yang, Dongjing Song, and Dacheng Tao. Timeautoad: Autonomous anomaly detection with self-supervised contrastive loss for multivariate time series. IEEE Transactions on Network Science and Engineering, 9(3):1604 1619, 2022b. Prashant Khanduri, Siliang Zeng, Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A near-optimal algorithm for stochastic bilevel optimization via double-momentum. Advances in Neural Information Processing Systems, 34:30271 30283, 2021. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. Quoc V Le. Building high-level features using large scale unsupervised learning. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8595 8598. IEEE, 2013. Published as a conference paper at ICLR 2023 Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Junyi Li, Feihu Huang, and Heng Huang. Local stochastic bilevel optimization with momentumbased variance reduction. ar Xiv preprint ar Xiv:2205.01608, 2022. Renjie Liao, Yuwen Xiong, Ethan Fetaya, Lisa Zhang, Ki Jung Yoon, Xaq Pitkow, Raquel Urtasun, and Richard Zemel. Reviving and improving recurrent back-propagation. In International Conference on Machine Learning, pp. 3082 3091. PMLR, 2018. Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Q Davis, and Adrian Weller. Debiasing a first-order heuristic for approximate bi-level optimization. In International Conference on Machine Learning, pp. 6621 6630. PMLR, 2021. Risheng Liu, Xuan Liu, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. A value-function-based interior-point method for non-convex bi-level optimization. In International Conference on Machine Learning, pp. 6882 6892. PMLR, 2021a. Risheng Liu, Yaohua Liu, Shangzhi Zeng, and Jin Zhang. Towards gradient-based bilevel optimization with non-convex followers and beyond. Advances in Neural Information Processing Systems, 34:8662 8675, 2021b. Rui Liu and Barzan Mozafari. Communication-efficient distributed learning for large batch optimization. In International Conference on Machine Learning, pp. 13925 13946. PMLR, 2022. Yaohua Liu, Cameron Nowzari, Zhi Tian, and Qing Ling. Asynchronous periodic event-triggered coordination of multi-agent systems. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 6696 6701. IEEE, 2017. Yinghui Liu, Youyang Qu, Chenhao Xu, Zhicheng Hao, and Bruce Gu. Blockchain-enabled asynchronous federated learning in edge computing. Sensors, 21(10):3335, 2021c. Songtao Lu, Ioannis Tsaknakis, Mingyi Hong, and Yongxin Chen. Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications. IEEE Transactions on Signal Processing, 68:3676 3691, 2020. Songtao Lu, Xiaodong Cui, Mark S Squillante, Brian Kingsbury, and Lior Horesh. Decentralized bilevel optimization for personalized client learning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5543 5547. IEEE, 2022. Yunlong Lu, Xiaohong Huang, Yueyue Dai, Sabita Maharjan, and Yan Zhang. Differentially private asynchronous federated learning for mobile edge computing in urban informatics. IEEE Transactions on Industrial Informatics, 16(3):2134 2143, 2019. Javier Matamoros. Asynchronous online ADMM for consensus problems. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5875 5879. IEEE, 2017. Akshay Mehra and Jihun Hamm. Penalty method for inversion-free deep bilevel optimization. In Asian Conference on Machine Learning, pp. 347 362. PMLR, 2021. Alexander Michalka. Cutting planes for convex objective nonconvex optimization. Columbia University, 2013. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003. Danil Prokhorov. Ijcnn 2001 neural network competition. Slide presentation in IJCNN, 1(97):38, 2001. Qi Qian, Shenghuo Zhu, Jiasheng Tang, Rong Jin, Baigui Sun, and Hao Li. Robust optimization over multiple domains. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4739 4746, 2019. Published as a conference paper at ICLR 2023 J. Ross Quinlan. Simplifying decision trees. International journal of man-machine studies, 27(3): 221 234, 1987. Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. In International Conference on Learning Representations, 2019. Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019. Ankur Sinha, Pekka Malo, and Kalyanmoy Deb. A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22 (2):276 295, 2017. Tejas Subramanya and Roberto Riggio. Centralized and federated learning for predictive vnf autoscaling in multi-domain 5g networks and beyond. IEEE Transactions on Network and Service Management, 18(1):63 78, 2021. Davoud Ataee Tarzanagh, Mingchen Li, Christos Thrampoulidis, and Samet Oymak. Fednest: Federated bilevel, minimax, and compositional optimization. ar Xiv preprint ar Xiv:2205.02215, 2022. Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S Rellermeyer. A survey on distributed machine learning. Acm computing surveys (csur), 53(2): 1 33, 2020. Tong Wang, Yousong Zhu, Chaoyang Zhao, Wei Zeng, Yaowei Wang, Jinqiao Wang, and Ming Tang. Large batch optimization for object detection: Training coco in 12 minutes. In European Conference on Computer Vision, pp. 481 496. Springer, 2020. Tianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H Sayed. Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks, 4(2):293 307, 2017. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Zi Xu, Huiling Zhang, Yang Xu, and Guanghui Lan. A unified single-loop alternating gradient projection algorithm for nonconvex-concave and convex-nonconcave minimax problems. ar Xiv preprint ar Xiv:2006.02032, 2020. Junjie Yang, Kaiyi Ji, and Yingbin Liang. Provably faster algorithms for bilevel optimization. Advances in Neural Information Processing Systems, 34:13670 13682, 2021. Kai Yang, Jianwei Huang, Yihong Wu, Xiaodong Wang, and Mung Chiang. Distributed robust optimization (DRO), part I: Framework and example. Optimization and Engineering, 15(1):35 67, 2014. Shuoguang Yang, Xuezhou Zhang, and Mengdi Wang. Decentralized gossip-based stochastic bilevel optimization over communication networks. ar Xiv preprint ar Xiv:2206.10870, 2022. Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. ar Xiv preprint ar Xiv:1904.00962, 2019. Ruiliang Zhang and James Kwok. Asynchronous distributed ADMM for consensus optimization. In International conference on machine learning, pp. 1701 1709. PMLR, 2014. Yihua Zhang, Guanhuan Zhang, Prashant Khanduri, Mingyi Hong, Shiyu Chang, and Sijia Liu. Revisiting and advancing fast adversarial training through the lens of bi-level optimization. ar Xiv preprint ar Xiv:2112.12376, 2021. Yihua Zhang, Guanhua Zhang, Prashant Khanduri, Mingyi Hong, Shiyu Chang, and Sijia Liu. Revisiting and advancing fast adversarial training through the lens of bi-level optimization. In International Conference on Machine Learning, pp. 26693 26712. PMLR, 2022. Ziyuan Zhou and Guanjun Liu. Romfac: A robust mean-field actor-critic reinforcement learning against adversarial perturbations on states. ar Xiv preprint ar Xiv:2205.07229, 2022. Published as a conference paper at ICLR 2023 A CUTTING PLANE METHOD FOR BILEVEL OPTIMIZATION In this section, a cutting plane method, named CPBO, is proposed for bileve optimization. Defining ϕ(x) = arg min y f(x, y ) and h(x, y) = ||y ϕ(x)||2, we can reformulate problem in Eq. (1) as: min F(x, y) s.t. h(x, y)=0 var. x, y. (34) Following the previous works (Li et al., 2022; Gould et al., 2016; Yang et al., 2021) in bilevel optimization, it is not necessary to get the exact ϕ(x), and the approximate ϕ(x) is given as follows. Firstly, as many work do (Ji et al., 2021; Yang et al., 2021), we utilize the K steps of gradient descent (GD) to approximate ϕ(x). And the first-order Taylor approximation of f(x, y ) with respect to x is considered, i.e., for a given point x, ef(x, y ) = f(x, y ) + xf(x, y ) (x x). Thus, we have, ϕ(x) = y 0 XK 1 k=0 η y ef(x, y k), (35) where η is the step-size. Considering the estimated ϕ(x) in Eq. (35), the relaxed problem with respect to problem in Eq. (34) is considered as follows, min F(x, y) s.t. h(x, y) ε var. x, y. (36) Assuming that h(x, y) is a convex function with respect to (x, y), which is always satisfied when we set K = 1 in Eq. (35) according to the operations that preserve convexity (Boyd et al., 2004). Since the sublevel set of a convex function is convex, we have that the feasible set of (x, y), i.e., Zrelax = {(x, y) Rn Rm|h(x, y) ϵ}, (37) is a convex set. We utilize a set of cutting plane constraints (i.e., linear constraints) to approximate the feasible set Zrelax. The set of cutting plane constraints forms a polytope, which can be expressed as follows, P = {(x, y) Rn Rm|al x + bl y + κl 0, l = 1, , L}, (38) where al Rn, bl Rm and κl R1 are parameters in lth cutting plane, and L represents the number of cutting planes in P. Considering the approximate problem, which can be expressed as follows, min F(x, y) s.t. al x+bl y+κl 0, l=1, , |Pt| (39) var. x, y, where Pt is the polytope in (t + 1)th iteration, and |Pt| denotes the number of cutting planes in Pt. Then, the Lagrangian function of Eq. (39) can be written as, Lp(x, y, {λl}) = F(x, y) + l=1 λl(al x+bl y+κl), (40) where λl is the dual variable. The proposed algorithm proceeds as follows in (t + 1)th iteration: If t < T1, the variables are updated as follows, xt+1 = xt ηx x Lp(xt, yt, {λt l}), (41) yt+1 = yt ηy y Lp(xt+1, yt, {λt l}), (42) λt+1 l = λt l + ηλl λl Lp(xt+1, yt+1, {λt l}), l=1, , |Pt|, (43) where ηx, ηy and ηλl are the step-sizes. And every kpre iterations (kpre >0 is a pre-set constant, which can be controlled flexibly) the cutting planes will be updated based on the following two steps: Published as a conference paper at ICLR 2023 Table 1: Convergence results of bilevel optimization algorithms (with centralized and distributed setting). Method Centralized Synchronous (Distributed) Asynchronous (Distributed) AID-Bi O (Ghadimi & Wang, 2018) O( 1 ϵ1.25 ) NA NA AID-Bi O (Ji et al., 2021) O( 1 ϵ1 ) NA NA ITD-Bi O (Ji et al., 2021) O( 1 ϵ1 ) NA NA STABLE (Chen et al., 2022a) O( 1 ϵ2 )1 NA NA stoc Bio (Ji et al., 2021) O( 1 ϵ2 )1 NA NA VRBO (Yang et al., 2021) O( 1 ϵ1.5 )1 NA NA FEDNEST (Tarzanagh et al., 2022) NA O( 1 ϵ2 )1 NA SPDB (Lu et al., 2022) NA O( 1 ϵ2 )1 NA DSBO (Yang et al., 2022) NA O( 1 ϵ2 )1 NA Proposed Method O( 1 ϵ1 ) NA O( 1 1 Stochastic optimization algorithm. (a) Removing the inactive cutting planes, that is, Pt+1 = Drop(Pt, cpl), if λt+1 l and λt l = 0 Pt, otherwise , (44) where cpl represents the lth cutting plane in Pt, and Drop(Pt, cpl) represents removing the lth cutting plane cpl from Pt. And the dual variable set {λt} will be updated as follows, {λt+1}= Drop({λt}, λt l), if λt+1 l and λt l = 0 {λt}, otherwise , (45) where {λt+1} and {λt} respectively represent the dual variable set in (t + 1)th and tth iteration. And Drop({λt}, λt l) represents that λt l is removed from the dual variable set {λt}. (b) Adding new cutting planes. Firstly, we investigate whether (xt+1, yt+1) is a feasible solution to the original problem in Eq. (36). If (xt+1, yt+1) is not a feasible solution to the original problem, that is h(xt+1, yt+1) > ε, new cutting plane is generated to separate the point (xt+1, yt+1) from Zrelax, that is, the valid cutting plane al x+bl y+κl 0 must satisfy that, al x+bl y+κl 0, (x, y) Zrelax al xt+1+bl yt+1+κl > 0 . (46) Since h(x, y) is a convex function, we have that, h(x, y) h(xt+1, yt+1) + " h(xt+1,yt+1) x h(xt+1,yt+1) According to Eq. (47), h(xt+1, yt+1) + " h(xt+1,yt+1) x h(xt+1,yt+1) ε is a valid cutting plane at point (xt+1, yt+1) which satisfies Eq. (46). For brevity, we utilize cpt+1 new to denote this cutting plane. Thus, we have that, Pt+1 = Add(Pt+1, cpt+1 new), if h(xt+1, yt+1) > ε Pt+1, if h(xt+1, yt+1) ε , (48) where Add(Pt+1, cpt+1 new) represents that new cutting plane cpt+1 new is added to polytope Pt+1. And the dual variable set is updated as follows, ( Add({λt+1}, λt+1 |Pt+1|), if h(xt+1, yt+1) > ε {λt+1}, if h(xt+1, yt+1) ε , (49) Published as a conference paper at ICLR 2023 Algorithm 2 CPBO: Cutting Plane Method for Bilevel Optimization Initialization: iteration t = 0, variables x0, y0, {λ0 l } and polytope P0. repeat if t < T1 then updating variables xt+1, yt+1 and λt+1 l according to Eq. (41), (42) and (43); if (t + 1) mod kpre == 0 then updating the polytope Pt+1 according to Eq. (44) and (48); updating the dual variable set {λt+1} according to Eq. (45) and (49); end if else updating variables xt+1 and yt+1 according to Eq. (50) and (51); end if t = t + 1; until termination. 0 2 4 6 8 time (s) test accuracy CPBO VRBO STABLE stoc Bi O AID-CG (a) test accuracy vs time 0 2 4 6 8 time (s) CPBO VRBO STABLE stoc Bi O AID-CG (b) test loss vs time Figure 7: Comparison of (a) test accuracy vs time, (b) test loss vs time on Covertype dataset. where Add({λt+1}, λt+1 |Pt+1|) represents that new dual variable λt+1 |Pt+1| is added to {λt+1}. Else if t T1, the polytope PT1 and dual variables will be fixed. Variables x, y will be updated as follows, xt+1 = xt ηx x ˆLp(xt, yt), (50) yt+1 = yt ηy y ˆLp(xt+1, yt), (51) where ˆLp(x, y) = F(x, y) + |PT1| P l=1 λl[max{0, al x+bl y+κl}]2. And details of the proposed algorithm are summarized in Algorithm 2. The comparison about the convergence results between the proposed method and state-of-the-art methods are summarized in Table 1. A.1 EXPERIMENT To evaluate the performance of the proposed CPBO, experiments are carried out on two applications: 1) hyperparameter optimization, 2) meta-learning. In hyperparameter optimization, we compare CPBO with baseline algorithms stoc Bio (Ji et al., 2021), STABLE (Chen et al., 2022a), VRBO (Yang et al., 2021)), and AID-CG (Grazzi et al., 2020) on the regularization coefficient optimization task (Chen et al., 2022a) with Covertype (Blackard & Dean, 1999) and IJCNN1 (Prokhorov, 2001) datasets. We compare the performance of the proposed CPBO with all competing algorithms in terms of both the test accuracy and the test loss, which are shown in Figure 7 and 8. In metalearning, we focus on the bilevel optimization problem in (Rajeswaran et al., 2019). And we compare the proposed CPBO with baseline algorithms MAML (Finn et al., 2017), i MAML (Rajeswaran et al., 2019), and ANIL (Raghu et al., 2019) on Omniglot (Lake et al., 2015) and CIFAR-FS (Bertinetto et al., 2018) datasets. And the comparison between the proposed method with the baseline algorithms are shown in Figure 9 and 10. It is seen that the proposed CPBO can achieve relatively fast convergence rate among all competing algorithms since 1) the iteration complexity of the proposed method is not high; 2) every step in CPBO is computationally efficient. Published as a conference paper at ICLR 2023 0 5 10 15 20 time (s) test accuracy CPBO VRBO STABLE stoc Bi O AID-CG (a) test accuracy vs time 0 5 10 15 20 time (s) CPBO VRBO STABLE stoc Bi O AID-CG (b) test loss vs time Figure 8: Comparison of (a) test accuracy vs time, (b) test loss vs time on IJCNN1 dataset. 0 500 1000 1500 2000 time (s) CPBO i MAML MAML ANIL (a) test accuracy vs time 0 500 1000 1500 2000 time (s) CPBO i MAML MAML ANIL (b) test loss vs time Figure 9: Comparison of (a) test accuracy vs time, (b) test loss vs time on Omniglot dataset. A.2 DISCUSSION Definition A.1 (x, y) is an ϵ-stationary point of a differentiable function ˆLp, if || x ˆLp(x, y)||2 + || y ˆLp(x, y)||2 ϵ. Assumption A.1 (Smoothness/Gradient Lipschitz) Following (Ji et al., 2021), we assume that ˆLp has Lipschitz continuous gradients, i.e., for any ω, ω , we assume that there exists L > 0 satisfying that, || ˆLp(ω) ˆLp(ω )|| L||ω ω ||. (52) Assumption A.2 (Boundedness) Following (Qian et al., 2019), we assume that variables have boundedness, i.e., ||x||2 β1, ||y||2 β2. Theorem 3 (Iteration Complexity) Under Assumption A.1, A.2, and setting the step-sizes as ηx < 2 L, ηy < 2 L, the iteration complexity (also the gradient complexity) of the proposed algorithm to obtain ϵ-stationary point is bounded by O( 1 Proof of Theorem 3: According to Assumption A.1 and Eq. (50), when t T1, we have, ˆLp(xt+1, yt) ˆLp(xt, yt) + D x ˆLp(xt, yt), xt+1 xt E + L 2 ||xt+1 xt||2 ˆLp(xt, yt) ηx|| x ˆLp(xt, yt)||2 + Lηx 2 2 || x ˆLp(xt, yt)||2. (53) Similarly, according to Assumption A.1 and Eq. (51), we have, ˆLp(xt+1, yt+1) ˆLp(xt+1, yt) + D y ˆLp(xt+1, yt), yt+1 yt E + L 2 ||yt+1 yt||2 ˆLp(xt+1, yt) ηy|| y ˆLp(xt+1, yt)||2 + Lηy 2 2 || y ˆLp(xt+1, yt)||2. (54) Published as a conference paper at ICLR 2023 0 500 1000 1500 2000 time (s) CPBO i MAML MAML ANIL (a) test accuracy vs time 0 500 1000 1500 2000 time (s) CPBO i MAML MAML ANIL (b) test loss vs time Figure 10: Comparison of (a) test accuracy vs time, (b) test loss vs time on CIFAR-FS dataset. Combining Eq. (53) with Eq. (54), we have, 2 )|| x ˆLp(xt, yt)||2+(ηy Lηy2 2 )|| y ˆLp(xt+1, yt)||2 ˆLp(xt, yt) ˆLp(xt+1, yt+1). (55) According to the setting of ηx, ηy, we have that ηx Lηx 2 2 > 0, ηy Lηy 2 2 > 0. And we set constant d = min{ηx Lηx 2 2 , ηy Lηy 2 2 }, thus we can obtain that, || x ˆLp(xt, yt)||2 + || y ˆLp(xt+1, yt)||2 ˆLp(xt, yt) ˆLp(xt+1, yt+1) Summing both sides of Eq. (56) for t = {T1, , T 1}, we obtain that, t=T1 (|| x ˆLp(xt, yt)||2 + || y ˆLp(xt+1, yt)||2) ˆLp(x T1, y T1) ˆL p (T T1)d , (57) where ˆL p = min ˆLp(x,y). Combining Eq. (57) with Definition A.1, we have that the number of iterations required by Algorithm 2 to return an ϵ-stationary point is bounded by O( ˆLp(x T1, y T1) ˆL p d 1 ϵ + T1). (58) B PROOF OF THEOREM 2 In this section, we provide complete proofs for Theorem 2. Firstly, we make some definitions about our problem. Definition B.1 Following (Xu et al., 2020), the stationarity gap at tth iteration is defined as: { xi Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} { yi Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} v Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) z Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) { λl Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} { θi Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} Published as a conference paper at ICLR 2023 And we also define: ( Gt)xi = xi Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), ( Gt)yi = yi Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), ( Gt)v = v Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), ( Gt)z = z Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), ( Gt)λl = λl Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), ( Gt)θi = θi Lp({xt i},{yt i},vt,zt,{λt l},{θt i}). It follows that, i=1 (||( Gt)xi||2+||( Gt)yi||2+||( Gt)θi||2)+||( Gt)v||2+||( Gt)z||2+ l=1 ||( Gt)λl||2. Definition B.2 At tth iteration, the stationarity gap w.r.t e Lp({xi},{yi},v, z,{λl},{θi}) is defined as: { xi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} { yi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} v e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) z e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) { λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} { θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i})} We further define: ( e Gt)xi = xi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), ( e Gt)yi = yi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), ( e Gt)v = v e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), ( e Gt)z = z e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), ( e Gt)λl = λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), ( e Gt)θi = θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}). It follows that, || e Gt||2 = i=1 (||( e Gt)xi||2+||( e Gt)yi||2+||( e Gt)θi||2)+||( e Gt)v||2+||( e Gt)z||2+ l=1 ||( e Gt)λl||2. Definition B.3 In the proposed asynchronous algorithm, for the ith worker in tth iteration, the last iteration where this worker was active is defined as ˆti. And the next iteration this worker will be active is defined as ti. For the iteration index set which ith worker is active during T1 + T + τ iteration, it is defined as Vi(T). And the jth element in Vi(T) is defined as ˆvi(j). Then, we provide some useful lemmas used for proving the main convergence results in Theorem 2. Published as a conference paper at ICLR 2023 Lemma 1 Let sequences ηt x = ηt y = ηt v = ηt z = 2 L+ηλ|Pt|L2+ηθNL2+8( |Pt|γL2 ηλ(ct 1)2 + NγL2 ηθ(ct 2)2 ), suppose Assumption 1 and 2 hold, we can obtain that, Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) i=1 ( L+L2+1 ηtx )||xt+1 i xt i||2 + N P ηty )||yt+1 i yt i||2 + 3NL2τk1 l=1 ||λt+1 l λt l||2 +( L+6NL2τk1 ηtv )||vt+1 vt||2 + ( L+6NL2τk1 ηtz )||zt+1 zt||2. (65) Proof of Lemma 1: Utilizing the Lipschitz properties in Assumption 1, we can obtain that, Lp({xt+1 1 , xt 2, , xt N},{yt i},vt, zt,{λt l},{θt i}) Lp({xt i},{yt i},vt, zt,{λt l},{θt i}) x1Lp({xt i},{yt i},vt, zt,{λt l},{θt i}), xt+1 1 xt 1 + L 2 ||xt+1 1 xt 1||2, Lp({xt+1 1 , xt+1 2 , , xt N},{yt i},vt,zt,{λt l},{θt i}) Lp({xt+1 1 , xt 2, , xt N},{yt i},vt,zt,{λt l},{θt i}) x2Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), xt+1 2 xt 2 + L 2 ||xt+1 2 xt 2||2, Lp({xt+1 i },{yt i},vt,zt,{λt l},{θt i}) Lp({xt+1 1 , , xt+1 N 1, xt N},{yt i},vt,zt,{λt l},{θt i}) x N Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), xt+1 N xt N + L 2 ||xt+1 N xt N||2. (66) Summing up the above inequalities in Eq. (66), we can obtain that, Lp({xt+1 i },{yt i},vt,zt,{λt l},{θt i}) Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) xi Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), xt+1 i xt i + L 2 ||xt+1 i xt i||2 . (67) Combining xi Lp({x ˆti i }, {y ˆti i }, vˆti, zˆti, {λ ˆti l }, {θ ˆti i }) = xi e Lp({x ˆti i }, {y ˆti i }, vˆti, zˆti, {λ ˆti l }, {θ ˆti i }) with Eq. (15), we have that, D xt+1 i xt i, xi Lp({x ˆti i },{y ˆti i },vˆti,zˆti,{λ ˆti l },{θ ˆti i }) E = 1 ηx ||xt+1 i xt i||2 1 ηtx ||xt+1 i xt i||2. Next, combining the Cauchy-Schwarz inequality with Assumption 1, 2, we can get, D xt+1 i xt i, xi Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) xi Lp({x ˆti i },{y ˆti i },vˆti,zˆti,{λ ˆti l },{θ ˆti i }) E 2||xt+1 i xt i||2+ L2 2 (||vt vˆtj||2+||zt zˆtj||2+ |Pt| P l=1 ||λt l λ ˆtj l ||2) 2||xt+1 i xt i||2+ 3L2τk1 2 (||vt+1 vt||2+||zt+1 zt||2+ |Pt| P l=1 ||λt+1 l λt l||2). Thus, according to Eq. (67), (68) and (69), we can obtain that, Lp({xt+1 i },{yt i},vt,zt,{λt l},{θt i}) Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) ηt x )||xt+1 i xt i||2 + 3NL2τk1 2 (||vt+1 vt||2+||zt+1 zt||2+ |Pt| P l=1 ||λt+1 l λt l||2). Published as a conference paper at ICLR 2023 Similarly, using the Lipschitz properties in Assumption 1, we have, Lp({xt+1 i },{yt+1 i },vt,zt,{λt l},{θt i}) Lp({xt+1 i },{yt i},vt,zt,{λt l},{θt i}) yi Lp({xt+1 i },{yt i},vt,zt,{λt l},{θt i}), yt+1 i yt i + L 2 ||yt+1 i yt i||2 . (71) Combining yi Lp({x ˆti i }, {y ˆti i }, vˆti, zˆti, {λ ˆti l }, {θ ˆti i })= yi e Lp({x ˆti i }, {y ˆti i }, vˆti, zˆti, {λ ˆti l }, {θ ˆti i }) with Eq. (16), we can obtain that, D yt+1 i yt i, yi Lp({x ˆti i },{y ˆti i },vˆti,zˆti,{λ ˆti l },{θ ˆti i }) E = 1 ηy ||yt+1 i yt i||2 1 ηty ||yt+1 i yt i||2. Then, combining the Cauchy-Schwarz inequality with Assumption 1, 2, we can get the following inequalities, D yt+1 i yt i, yi Lp({xt+1 i },{yt i},vt,zt,{λt l},{θt i}) yi Lp({x ˆti i },{y ˆti i },vˆti,zˆti,{λ ˆti l },{θ ˆti i }) E 2||yt+1 i yt i||2+ L2 2 (||xt+1 i xt i||2+||vt vˆtj||2+||zt zˆtj||2+ |Pt| P l=1 ||λt l λ ˆtj l ||2) 2||yt+1 i yt i||2+ L2 2 ||xt+1 i xt i||2+ 3L2τk1 2 (||vt+1 vt||2+||zt+1 zt||2+ |Pt| P l=1 ||λt+1 l λt l||2). Thus, combining Eq. (71), (72) with (73), we have, Lp({xt+1 i },{yt+1 i },vt,zt,{λt l},{θt i}) Lp({xt+1 i },{yt i},vt,zt,{λt l},{θt i}) ηty )||yt+1 i yt i||2 + N P 2 ||xt+1 i xt i||2 2 (||vt+1 vt||2+||zt+1 zt||2+ |Pt| P l=1 ||λt+1 l λt l||2). Combining the Lipschitz properties in Assumption 1 with Eq. (17), we have, Lp({xt+1 i },{yt+1 i },vt+1,zt,{λt l},{θt i}) Lp({xt+1 i },{yt+1 i },vt,zt,{λt l},{θt i}) v Lp({xt+1 i },{yt+1 i },vt,zt,{λt l},{θt i}), vt+1 vt + L 2 ||vt+1 vt||2 2 1 ηtv )||vt+1 vt||2. Similarly, combining the Lipschitz properties in Assumption 1 with Eq. (18), we have, Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) Lp({xt+1 i },{yt+1 i },vt+1,zt,{λt l},{θt i}) z Lp({xt+1 i },{yt+1 i },vt+1,zt,{λt l},{θt i}), zt+1 zt + L 2 ||zt+1 zt||2 2 1 ηtz )||zt+1 zt||2. By combining Eq. (70), (74), (75), (76), we conclude the proof of Lemma 1. Published as a conference paper at ICLR 2023 Lemma 2 Suppose Assumption 1 and 2 hold, t T1, we have: Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt+1 i }) Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) ηtx + |Pt|L2 2a1 + |Qt+1|L2 i=1 ||xt+1 i xt i||2 ηty + |Pt|L2 2a1 + |Qt+1|L2 i=1 ||yt+1 i yt i||2 +( L+6τk1NL2 ηtv + |Pt|L2 2a1 + |Qt+1|L2 2a3 )||vt+1 vt||2 +( L+6τk1NL2 ηtz + |Pt|L2 2a1 + |Qt+1|L2 2a3 )||zt+1 zt||2+ 1 2ηθ i=1 ||θt i θt 1 i ||2 +( a1+6τk1NL2 2 ct 1 1 ct 1 2 + 1 2ηλ ) |Pt| P l=1 ||λt+1 l λt l||2+( a3 2 ct 1 2 ct 2 2 + 1 2ηθ ) N P i=1 ||θt+1 i θt i||2 l=1 (||λt+1 l ||2 ||λt l||2)+ 1 2ηλ l=1 ||λt l λt 1 l ||2+ ct 1 2 i=1 (||θt+1 i ||2 ||θt i||2), where a1 > 0 and a3 > 0 are constants. Proof of Lemma 2: According to Eq. (19), in (t + 1)th iteration, λ Λ, it follows that: D λt+1 l λt l ηλ λl e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}), λ λt+1 l E = 0. (78) Let λ = λt l, we can obtain: λl e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) 1 ηλ (λt+1 l λt l), λt l λt+1 l Likewise, in tth iteration, we can obtain: λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i }) 1 ηλ (λt l λt 1 l ), λt+1 l λt l Since e Lp({xi},{yi},v,z,{λl},{θi}) is concave with respect to λl and follows from Eq. (79) and Eq. (80), t T1, we have, e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) D λl e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}), λt+1 l λt l E l=1 ( D λl e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i }), λt+1 l λt l E ηλ λt l λt 1 l , λt+1 l λt l ). (81) Denoting vt+1 1,l = λt+1 l λt l (λt l λt 1 l ), we can get the following equality, D λl e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i }), λt+1 l λt l E D λl e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i})), λt+1 l λt l E (1a) D λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i }), vt+1 1,l E (1b) D λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i }), λt l λt 1 l E (1c). Published as a conference paper at ICLR 2023 First, we put attention on the (1a) in Eq. (82), (1a) can be expressed as follows, D λl e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), λt+1 l λt l E = λl Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) λl Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), λt+1 l λt l + ct 1 1 ct 1 2 (||λt+1 l ||2 ||λt l||2) ct 1 1 ct 1 2 ||λt+1 l λt l||2. (83) Combining Cauchy-Schwarz inequality with Assumption 1, we can obtain, λl Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) λl Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), λt+1 l λt l i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2)+ a1 2 ||λt+1 l λt l||2, (84) where a1 > 0 is a constant. Combining Eq. (83) with Eq. (84), we can obtain that, D λl e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), λt+1 l λt l E i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2)+ a1 2 ||λt+1 l λt l||2 + ct 1 1 ct 1 2 (||λt+1 l ||2 ||λt l||2) ct 1 1 ct 1 2 ||λt+1 l λt l||2). (85) Then, we focus on the (1b) in Eq. (82). According to Cauchy-Schwarz inequality, (1b) can be expressed as follows, D λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i }), vt+1 1,l E 2 || λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i })||2 2a2 ||vt+1 1,l ||2), (86) where a2 > 0 is a constant. Next, we focus on the (1c) in Eq. (82). Defining L1 = L + c0 1, according to Assumption 1 and the trigonometric inequality, λl, we have, || λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i })|| =|| λl Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt i}) ct 1 1 (λt l λt 1 l )|| (L + ct 1 1 )||λt l λt 1 l || L1 ||λt l λt 1 l ||. (87) Following from Eq. (87) and the strong concavity of e Lp({xi},{yi},v,z,{λl},{θi}) w.r.t λl (Nesterov, 2003; Xu et al., 2020), we can obtain that, D λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i }) E l=1 ( 1 L1 +ct 1 1 || λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i })||2 L1 +ct 1 1 ||λt l λt 1 l ||2). (88) In addition, the following inequality can be obtained, 1 ηλ λt l λt 1 l , λt+1 l λt l 1 2ηλ ||λt+1 l λt l||2 1 2ηλ ||vt+1 1,l ||2 + 1 2ηλ ||λt l λt 1 l ||2. (89) Published as a conference paper at ICLR 2023 Combining Eq. (81), (82), (85), (86), (88), (89), ηλ 2 1 L1 +c0 1 , and setting a2 = ηλ, we have: Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2) 2 ct 1 1 ct 1 2 + 1 2ηλ ) |Pt| P l=1 ||λt+1 l λt l||2+ ct 1 1 l=1 (||λt+1 l ||2 ||λt l||2) + 1 2ηλ l=1 ||λt l λt 1 l ||2. According to Eq. (20), in (t + 1)th iteration, θ Θ, it follows that, D θt+1 i θt i ηθ θi e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}), θ θt+1 i E = 0. (91) Choosing θ = θt i, we can obtain, θi e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) 1 ηθ (θt+1 i θt i), θt i θt+1 i Likewise, in tth iteration, we have, θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i }) 1 ηθ (θt i θt 1 i ), θt+1 i θt i Since e Lp({xi},{yi},v,z,{λl},{θi}) is concave with respect to θi and follows from Eq. (93): e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt+1 i }) e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) D θi e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}), θt+1 i θt i E i=1 ( D θi e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i }), θt+1 i θt i E ηθ θt i θt 1 i , θt+1 i θt i ). (94) Denoting vt+1 2,l = θt+1 i θt i (θt i θt 1 i ), we have that, D θi e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i }), θt+1 i θt i E D θi e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), θt+1 i θt i E (2a) D θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i }), vt+1 2,l E (2b) D θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i }), θt i θt 1 i E (2c). We firstly focus on the (2a) in Eq. (95), we can write the (2a) as, D θi e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), θt+1 i θt i E = θi Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) θi Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), θt+1 i θt i + ct 1 2 ct 2 2 (||θt+1 i ||2 ||θt i||2) ct 1 2 ct 2 2 ||θt+1 i θt i||2). (96) Published as a conference paper at ICLR 2023 And combining the Cauchy-Schwarz inequality with Assumption 1, we can obtain, θi Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) θi Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), θt+1 i θt i = θi Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) θi Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), θt+1 i θt i i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2) + a3 2 ||θt+1 i θt i||2, where a3 > 0 is a constant. Thus, we can get the upper bound of (2a) by combining Eq. (96) with Eq. (97), that is, D θi e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), θt+1 i θt i E i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2)+ a3 2 ||θt+1 i θt i||2 + ct 1 2 ct 2 2 (||θt+1 i ||2 ||θt i||2) ct 1 2 ct 2 2 ||θt+1 i θt i||2). (98) Next we focus on the (2b) in Eq. (95). According to Cauchy-Schwarz inequality we can write (2b) as, D θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i }), vt+1 2,l E 2 || θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i })||2 2a4 ||vt+1 2,l ||2), (99) where a4 > 0 is a constant. Then, we focus on the (2c) in Eq. (95). Defining L2 = L + c0 2, according to Assumption 1 and the trigonometric inequality, we have, || θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i })|| L2 ||θt i θt 1 i ||. (100) Following Eq. (100) and the strong concavity of e Lp({xi},{yi},v,z,{λl},{θi}) w.r.t θi, the upper bound of (2c) can be obtained, that is, D θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i }), θt i θt 1 i E i=1 ( 1 L2 +ct 1 2 || θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i })||2 L2 +ct 1 2 ||θt i θt 1 i ||2). (101) In addition, the following inequality can also be obtained, 1 ηθ θt i θt 1 i , θt+1 i θt i N P 2ηθ ||θt+1 i θt i||2 1 2ηθ ||vt+1 2,l ||2 + 1 2ηθ ||θt i θt 1 i ||2). Combining Eq. (94), (95), (98), (99), (101), (102), ηθ 2 1 L2 +c0 2 , and setting a4 = ηθ, we have, Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt+1 i }) Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2) 2 ct 1 2 ct 2 2 + 1 2ηθ ) N P i=1 ||θt+1 i θt i||2+ ct 1 2 i=1 (||θt+1 i ||2 ||θt i||2)+ 1 2ηθ i=1 ||θt i θt 1 i ||2. Published as a conference paper at ICLR 2023 By combining Lemma 1 with Eq. (90) and Eq. (103), we conclude the proof of Lemma 2. Lemma 3 Firstly, we denote St+1 1 , St+1 2 and F t+1 as, St+1 1 = 4 ηλ2ct+1 1 l=1 ||λt+1 l λt l||2 4 ηλ (ct 1 1 ct 1 1) l=1 ||λt+1 l ||2, (104) St+1 2 = 4 ηθ2ct+1 2 i=1 ||θt+1 i θt i||2 4 ηθ (ct 1 2 ct 2 1) i=1 ||θt+1 i ||2, (105) F t+1 = Lp({xt+1 i }, {yt+1 i }, zt+1, ht+1, {λt+1 l }, {θt+1 i }) + St+1 1 + St+1 2 l=1 ||λt+1 l λt l||2 ct 1 2 l=1 ||λt+1 l ||2 7 2ηθ i=1 ||θt+1 i θt i||2 ct 2 2 i=1 ||θt+1 i ||2. (106) Defining a5 = max{1, 1 + L2, 6τk1NL2}, t T1, we have, ηtx + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 ) N P i=1 ||xt+1 i xt i||2 ηty + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 ) N P i=1 ||yt+1 i yt i||2 ηtv + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 )||vt+1 vt||2 ηtz + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 )||zt+1 zt||2 ( 1 10ηλ 6τk1NL2 l=1 ||λt+1 l λt l||2 1 10ηθ i=1 ||θt+1 i θt i||2+ ct 1 1 ct 1 2 l=1 ||λt+1 l ||2 + ct 1 2 ct 2 2 i=1 ||θt+1 i ||2+ 4 ηλ ( ct 2 1 ct 1 1 ct 1 1 ct 1 ) |Pt| P l=1 ||λt l||2+ 4 ηθ ( ct 2 2 ct 1 2 ct 1 2 i=1 ||θt i||2. Proof of Lemma 3: Let a1 = 1 ηλ , a3 = 1 ηθ and substitute them into the Lemma 2, t T1, we have, Lp({xt+1 i },{yt+1 i }, vt+1, zt+1,{λt+1 l },{θt+1 i }) Lp({xt i},{yt i}, vt, zt,{λt l},{θt i}) ηtx + ηλ|Pt|L2+ηθ|Qt+1|L2 i=1 ||xt+1 i xt i||2 ηty + ηλ|Pt|L2+ηθ|Qt+1|L2 i=1 ||yt+1 i yt i||2 +( L+6τk1NL2 ηtv + ηλ|Pt|L2+ηθ|Qt+1|L2 2 )||vt+1 vt||2+ 1 2ηλ l=1 ||λt l λt 1 l ||2 +( L+6τk1NL2 ηt z + ηλ|Pt|L2+ηθ|Qt+1|L2 2 )||zt+1 zt||2+ 1 2ηθ i=1 ||θt i θt 1 i ||2 2 ct 1 1 ct 1 2 + 1 ηλ ) |Pt| P l=1 ||λt+1 l λt l||2+( 1 ηθ ct 1 2 ct 2 2 ) N P i=1 ||θt+1 i θt i||2 l=1 (||λt+1 l ||2 ||λt l||2)+ ct 1 2 i=1 (||θt+1 i ||2 ||θt i||2). According to Eq. (19), in (t + 1)th iteration, it follows that: D λt+1 l λt l ηλ λl e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}), λt l λt+1 l E = 0. (109) Similar to Eq. (109), in tth iteration, we have, D λt l λt 1 l ηλ λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i }), λt+1 l λt l E = 0. (110) Published as a conference paper at ICLR 2023 Thus, t T1, by combining Eq. (109) with Eq. (110), we can obtain that, D vt+1 1,l , λt+1 l λt l E = D λl e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i }), λt+1 l λt l E = D λl e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), λt+1 l λt l E + D λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i }), vt+1 1,l E + D λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l }, {θt 1 i }), λt l λt 1 l E . (111) Since we have that, D vt+1 1,l , λt+1 l λt l E = 1 2ηλ ||λt+1 l λt l||2+ 1 2ηλ ||vt+1 1,l ||2 1 2ηλ ||λt l λt 1 l ||2, (112) it follows from Eq. (111) and Eq. (112) that, 1 2ηλ ||λt+1 l λt l||2+ 1 2ηλ ||vt+1 1,l ||2 1 2ηλ ||λt l λt 1 l ||2 2bt 1 ( N P i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2)+ bt 1 2 ||λt+1 l λt l||2 + ct 1 1 ct 1 2 (||λt+1 l ||2 ||λt l||2) ct 1 1 ct 1 2 ||λt+1 l λt l||2 2 || λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i })||2+ 1 2ηλ||vt+1 1,l ||2 1 L1 +ct 1 1 || λl e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) λl e Lp({xt i},{yt i},vt,zt,{λt 1 l },{θt 1 i })||2 L1 +ct 1 1 ||λt l λt 1 l ||2, (113) where bt 1 > 0. According to the setting that c0 1 L1 , we have ct 1 1 L1 L1 +ct 1 1 ct 1 1 L1 2L1 = ct 1 1 ct 1 2 . Multiplying both sides of Eq. (113) by 8 ηλct 1 , we have, 4 ηλ2ct 1 ||λt+1 l λt l||2 4 ηλ ( ct 1 1 ct 1 ct 1 )||λt+1 l ||2 4 ηλ2ct 1 ||λt l λt 1 l ||2 4 ηλ ( ct 1 1 ct 1 ct 1 )||λt l||2 + 4bt 1 ηλct 1 ||λt+1 l λt l||2 4 ηλ ||λt l λt 1 l ||2 ηλct 1bt 1 ( N P i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2). Setting bt 1 = ct 1 2 in Eq. (114) and using the definition of St 1, t T1, we have, St+1 1 St 1 4 ηλ ( ct 2 1 ct 1 1 ct 1 1 ct 1 )||λt l||2+ |Pt| P ηλ + 4 ηλ2 ( 1 ct+1 1 1 ct 1 ))||λt+1 l λt l||2 4 ηλ ||λt l λt 1 l ||2+ 8|Pt|L2 ηλ(ct 1)2 ( N P i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2). Published as a conference paper at ICLR 2023 Similarly, according to Eq. (20), it follows that, D vt+1 2,l , θt+1 i θt i E = D θi e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i }), θt+1 i θt i E = D θi e Lp({xt+1 i },{yt+1 i },vt+1,zt+1,{λt+1 l },{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), θt+1 i θt i E + D θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i }), vt+1 2,l E + D θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i }), θt i θt 1 i E . (116) In addition, since D vt+1 2,l , θt+1 i θt i E = 1 2ηθ ||θt+1 i θt i||2+ 1 2ηθ ||vt+1 2,l ||2 1 2ηθ ||θt i θt 1 i ||2, (117) it follows that, 1 2ηθ ||θt+1 i θt i||2+ 1 2ηθ ||vt+1 2,l ||2 1 2ηθ ||θt i θt 1 i ||2 2bt 2 ( N P i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2)+ bt 2 2 ||θt+1 i θt i||2 + ct 1 2 ct 2 2 (||θt+1 i ||2 ||θt i||2) ct 1 2 ct 2 2 ||θt+1 i θt i||2 ct 1 2 L 2 L 2+ct 1 2 ||θt i θt 1 i ||2+ 1 2ηθ ||vt+1 2,l ||2 2 || θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i })||2 1 L 2+ct 1 2 || θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt i}) θi e Lp({xt i},{yt i},vt,zt,{λt l},{θt 1 i })||2. (118) According to the setting c0 2 L2 , we have ct 1 2 L2 L2 +ct 1 2 ct 1 2 L2 2L2 = ct 1 2 2 ct 2 2 . Multiplying both sides of Eq. (118) by 8 ηθct 2 , we have, 4 ηθ2ct 2 ||θt+1 i θt i||2 4 ηθ ( ct 1 2 ct 2 ct 2 )||θt+1 i ||2 4 ηθ2ct 2 ||θt i θt 1 i ||2 4 ηθ ( ct 1 2 ct 2 ct 2 )||θt i||2+ 4bt 2 ηθct 2 ||θt+1 i θt i||2 4 ηθ ||θt i θt 1 i ||2 ηθct 2bt 2 ( N P i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2). Setting bt 2 = ct 2 2 in Eq. (119) and utilizing the definition of St 2, we have that, St+1 2 St 2 4 ηθ ( ct 2 2 ct 1 2 ct 1 2 ct 2 )||θt i||2+ N P ηθ + 4 ηθ2 ( 1 ct+1 2 1 ct 2 ))||θt+1 i θt i||2 4 ηθ ||θt i θt 1 i ||2+ 8NL2 ηθ(ct 2)2 ( N P i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2). Based on the setting of ct 1 and ct 2, we can obtain that ηλ 10 1 ct+1 1 1 10 1 ct+1 2 1 ct 2 , t T1. Defining a5 = max{1, 1 + L2, 6τk1NL2}. Combining the definition of F t+1 with Eq. (115) and Published as a conference paper at ICLR 2023 Eq. (120), t T1, we can obtain that, ηtx + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 ) N P i=1 ||xt+1 i xt i||2 ηty + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 ) N P i=1 ||yt+1 i yt i||2 ηtv + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 )||vt+1 vt||2 ηtz + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 )||zt+1 zt||2 ( 1 10ηλ 6τk1NL2 l=1 ||λt+1 l λt l||2 1 10ηθ i=1 ||θt+1 i θt i||2+ ct 1 1 ct 1 2 l=1 ||λt+1 l ||2 + ct 1 2 ct 2 2 i=1 ||θt+1 i ||2+ 4 ηλ ( ct 2 1 ct 1 1 ct 1 1 ct 1 ) |Pt| P l=1 ||λt l||2+ 4 ηθ ( ct 2 2 ct 1 2 ct 1 2 i=1 ||θt i||2. which concludes the proof of Lemma 3. Proof of Theorem 1: First, we set that, at 6 = 4|Pt|(γ 2)L2 ηλ(ct 1)2 + 4N(γ 2)L2 ηθ(ct 2)2 + ηθ(N |Qt+1|)L2 where constant γ satisfies that γ > 2 and 4(γ 2)L2 ηλ(c0 1)2 + 4N(γ 2)L2 ηθ(c0 2)2 > a5 2 , thus we have that at 6 > 0, t. According to the setting of ηt x, ηt y, ηt v, ηt z and ct 1, ct 2, we have, ηtx + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 = at 6, (123) ηty + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 = at 6, (124) ηtv + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 = at 6, (125) ηtz + ηλ|Pt|L2 2 + ηθ|Qt+1|L2 2 + 8|Pt|L2 ηλ(ct 1)2 + 8NL2 ηθ(ct 2)2 = at 6. (126) Combining Eq. (123), (124), (125), (126) with Lemma 3, t T1, we can obtain that, i=1 (||xt+1 i xt i||2 + ||yt+1 i yt i||2)+at 6||vt+1 vt||2+at 6||zt+1 zt||2 +( 1 10ηλ 6τk1NL2 l=1 ||λt+1 l λt l||2+ 1 10ηθ i=1 ||θt+1 i θt i||2 F t F t+1+ ct 1 1 ct 1 2 l=1 ||λt+1 l ||2+ ct 1 2 ct 2 2 i=1 ||θt+1 i ||2 ηλ ( ct 2 1 ct 1 1 ct 1 1 ct 1 ) |Pt| P l=1 ||λt l||2+ 4 ηθ ( ct 2 2 ct 1 2 ct 1 2 i=1 ||θt i||2. Utilizing the definition of ( e Gt)xi and combining it with trigonometric inequality, Cauchy-Schwarz inequality and Assumption 1 and 2, we can obtain that, ||( e Gt)xi||2 2 ηx2 ||xti i xt i||2+6L2τk1(||vt+1 vt||2+||zt+1 zt||2+ |Pt| P l=1 ||λt+1 l λt l||2). Published as a conference paper at ICLR 2023 Utilizing the definition of ( e Gt)yi and combining it with trigonometric inequality and Cauchy Schwarz inequality, it follows that, ||( e Gt)yi||2 2 ηy2 ||yti i yt i||2+6L2τk1(||vt+1 vt||2+||zt+1 zt||2+ |Pt| P l=1 ||λt+1 l λt l||2). (129) Utilizing the definition of ( e Gt)v and combining it with trigonometric inequality and Cauchy Schwarz inequality, we have that, ||( e Gt)v||2 2L2 N P i=1 (||xt+1 i xt i||2+||yt+1 i yt i||2)+ 2 ηv2 ||vt+1 vt||2. (130) Using the definition of ( e Gt)z and combining it with trigonometric inequality and Cauchy-Schwarz inequality, it follows that, ||( e Gt)z||2 2L2 N P i=1 (||xt+1 i xt i||2+||yt+1 i yt i||2)+||vt+1 vt||2 + 2 ηz2 ||zt+1 zt||2. Using the definition of ( e Gt)λl and combining it with trigonometric inequality and Cauchy Schwarz inequality, we can obtain the following inequality, ||( e Gt)λl||2 3 ηλ2 ||λt+1 l λt l||2 + 3((ct 1 1 )2 (ct 1)2)||λt l||2 i=1 (||xt+1 i xt i||2+||yt+1 i yt i||2)+||vt+1 vt||2+||zt+1 zt||2 . Combining the definition of ( e Gt)θi with Cauchy-Schwarz inequality and Assumption 2, we have, ||( e Gt)θi||2 3 ηθ2 ||θti i θt i||2+3L2 N P i=1 (||xti i xt i||2+||yti i yt i||2)+||vti vt||2 +3(c ˆti 1 2 cti 1 2 )2||θt i||2 3 ηθ2 ||θti i θt i||2+3L2 N P i=1 (||xti i xt i||2+||yti i yt i||2) +3L2τk1(||vt+1 vt||2+||zt+1 zt||2+ |Pt| P l=1 ||λt+1 l λt l||2) + 3((c ˆti 1 2 )2 (cti 1 2 )2)||θt i||2. In sight of the Definition B.2 as well as Eq. (128), (129), (130), (131), (132) and Eq. (133), we can obtain that, i=1 (||( e Gt)xi||2+||( e Gt)yi||2+||( e Gt)θi||2)+||( e Gt)v||2+||( e Gt)z||2+ |Pt| P l=1 ||( e Gt)λl||2 ( 2 ηx2 +3NL2) N P i=1 ||xti i xt i||2+( 2 ηy2 +3NL2) N P i=1 ||yti i yt i||2 +(4+3|Pt|L2) N P i=1 ||xt+1 i xt i||2 + (4+3|Pt|L2) N P i=1 ||yt+1 i yt i||2 ηv2 +(2+15τk1N +3|Pt|)L2)||vt+1 vt||2+( 2 ηz2 +(15τk1N +3|Pt|)L2)||zt+1 zt||2 ηλ2 +15τk1NL2)||λt+1 l λt l||2+ |Pt| P l=1 3((ct 1 1 )2 (ct 1)2)||λt l||2 3 ηθ2 ||θti i θt i||2+ N P i=1 3((c ˆti 1 2 )2 (cti 1 2 )2)||θt i||2. Published as a conference paper at ICLR 2023 Let constant a6 denote the lower bound of at 6 (a6 > 0), and we set constants d1, d2, d3, d4 that, d1 = 2kττ +(4+3M +3kττN)L2ηx2 ηx2(a6)2 2kττ +(4+3|Pt|+3kττN)L2ηx2 ηx2(at 6)2 , (135) d2 = 2kττ +(4+3M +3kττN)L2ηy2 ηy2(a6)2 2kττ +(4+3|Pt|+3kττN)L2ηy2 ηy2(at 6)2 , (136) d3 = 2+(2+15τk1N +3M)L2ηv2 ηv2(a6)2 2+(2+15τk1N +3|Pt|)L2ηv2 ηv2(at 6)2 , (137) d4 = 2+(15τk1N +3M)L2ηz2 ηz2(a6)2 2+(15τk1N +3|Pt|)L2ηz2 ηz2(at 6)2 , (138) where kτ is a positive constant. Thus, combining Eq. (134) with Eq. (135), Eq. (136), (137), (138), we can obtain, || e Gt||2 N P i=1 d1(at 6)2||xt+1 i xt i||2 + N P i=1 d2(at 6)2||yt+1 i yt i||2 +d3(at 6)2||vt+1 vt||2+d4(at 6)2||zt+1 zt||2+ N P ηλ2 +15τk1NL2)||λt+1 l λt l||2+ |Pt| P l=1 3((ct 1 1 )2 (ct 1)2)||λt l||2 i=1 3((c ˆti 1 2 )2 (cti 1 2 )2)||θt i||2+( 2 ηx2 +3NL2) N P i=1 ||xti i xt i||2 ηx2 +3kττNL2) N P i=1 ||xt+1 i xt i||2+( 2 ηy2 +3NL2) N P i=1 ||yti i yt i||2 ηy2 +3kττNL2) N P i=1 ||yt+1 i yt i||2. Let dt 5 denote a nonnegative sequence, i.e., dt 5 = 1 max{d1at 6,d2at 6,d3at 6,d4at 6, 30 ηλ +150ηλτk1NL2 1 30ηλτk1NL2 , 30τ denote the upper and lower bound of dt 5 as d5 and d5, respectively. And we set the constant kτ satisfies kτ max{ d5( 2 ηy2 +3NL2) d5( 2 ηy2 +3NL2), d5( 2 ηx2 +3NL2) d5( 2 ηx2 +3NL2)}, where ηx and ηy are the upper bounds of ηt x and ηt y, respectively. We can obtain the following inequality by combining Eq. (139) with the definition of dt 5: dt 5|| e Gt||2 at 6 i=1 (||xt+1 i xt i||2+||yt+1 i yt i||2)+at 6||vt+1 vt||2+at 6||zt+1 zt||2 +( 1 10ηλ 6τk1NL2 l=1 ||λt+1 l λt l||2 + 1 10τηθ l=1 3dt 5((ct 1 1 )2 (ct 1)2)||λt l||2 + N P i=1 3dt 5((c ˆti 1 2 )2 (cti 1 2 )2)||θt i||2 +dt 5( 2 ηx2 +3NL2) N P i=1 ||xti i xt i||2 dt 5( 2kτ τ ηx2 +3kττNL2) N P i=1 ||xt+1 i xt i||2 +dt 5( 2 ηy2 +3NL2) N P i=1 ||yti i yt i||2 dt 5( 2kτ τ ηy2 +3kττNL2) N P i=1 ||yt+1 i yt i||2. Published as a conference paper at ICLR 2023 Combining the definition of dt 5 with Eq. (127) and according to the setting ||λt l||2 α3, ||θt i||2 α4 and d5 dt 5 d5, t T1, thus, we have, dt 5|| e Gt||2 F t F t+1 + ct 1 1 ct 1 2 Mα3 + ct 1 2 ct 2 2 Nα4 + 4 ηλ ( ct 2 1 ct 1 1 ct 1 1 ηθ ( ct 2 2 ct 1 2 ct 1 2 ct 2 )Nα4 + 3d5((ct 1 1 )2 (ct 1)2)Mα3 + 3d5 i=1 ((c ˆti 1 2 )2 (cti 1 2 )2)α4 ti θt i||2 1 10ηθ i=1 ||θi t+1 θt i||2 +d5( 2 ηx2 + 3NL2) N P i=1 ||xti i xt i||2 d5( 2kτ τ ηx2 + 3kττNL2) N P i=1 ||xt+1 i xt i||2 +d5( 2 ηy2 + 3NL2) N P i=1 ||yti i yt i||2 d5( 2kτ τ ηy2 + 3kττNL2) N P i=1 ||yt+1 i yt i||2. Denoting e T(ϵ) as e T(ϵ) = min{t | || e GT1+t||2 ϵ 4, t 2}. Summing up Eq. (141) from t = T1+2 to t = T1 + e T(ϵ), we have, T1+ e T (ϵ) P t=T1+2 dt 5|| e Gt||2 F T1+2 L + 4 ηλ ( c0 1 c1 1 + c1 1 c2 1 )Mα3 + c1 1 2 Mα3 + 7 2ηλ Mσ32 + 3d5(c1 1)2Mα3 ηθ ( c0 1 c1 1 + c1 1 c2 1 )Nα4 + c1 2 2 Nα4 + 7 2ηθ Nσ42 + N P T1+ e T (ϵ) P t=T1+2 3d5((c ˆti 1 2 )2 (cti 1 2 )2)α4 2 Mσ32 + c T1+2 2 2 Nσ42 + 1 10τηθ T1+ e T (ϵ) P ti θt i||2 1 10ηθ T1+ e T (ϵ) P i=1 ||θi t+1 θt i||2 +d5( 2 ηx2 + 3NL2) T1+ e T (ϵ) P i=1 ||xti i xt i||2 d5( 2kτ τ ηx2 + 3kττNL2) T1+ e T (ϵ) P i=1 ||xt+1 i xt i||2 +d5( 2 ηy2 + 3NL2) T1+ e T (ϵ) P i=1 ||yti i yt i||2 d5( 2kτ τ ηy2 + 3kττNL2) T1+ e T (ϵ) P i=1 ||yt+1 i yt i||2, (142) where σ3 =max{||λ1 λ2||}, σ4 = max{||θ1 θ2||} and L =min Lp({xt i},{yt i},vt,zt,{λt l},{θt i}), which satisfy that, t T1 + 2, c1 1 c2 1 Mα3 4 c1 2 c2 2 Nα4 7 2ηλ Mσ3 2 7 2ηθ Nσ4 2 c T1+2 1 2 Mσ3 2 c T1+2 2 2 Nσ4 2. (143) For each worker i, we have that ti ˆti τ, thus, T1+ e T (ϵ) P t=T1+2 3d5((c ˆti 1 2 )2 (cti 1 2 )2)α4 ˆvi(j) Vi( e T (ϵ)), T1+2 ˆvi(j) T1+ e T (ϵ) 3d5((cˆvi(j) 1 2 )2 (cˆvi(j+1) 1 2 )2)α4 3τd5(c1 2)2α4. Since the idle workers do not update their variables in each master iteration, for any t that satisfies ˆvi(j 1) t < ˆvi(j), we have θt i = θˆvi(j) 1 i . And for t / Vi(T), we have ||θt i θt 1 i ||2 = 0. Published as a conference paper at ICLR 2023 Combining with ˆvi(j) ˆvi(j 1) τ, we can obtain that, T1+ e T (ϵ) P tj θt i||2 τ P ˆvi(j) Vi( e T (ϵ)), T1+3 ˆvi(j) i=1 ||θˆvi(j) i θˆvi(j) 1 i ||2 = τ T1+ e T (ϵ) P i=1 ||θi t+1 θt i||2 + τ e T (ϵ)+τ 1 P t= e T (ϵ)+1 i=1 ||θi t+1 θt i||2 τ T1+ e T (ϵ) P i=1 ||θi t+1 θt i||2 + 4τ(τ 1)Nα4. Similarly, for any t that satisfies ˆvi(j 1) t < ˆvi(j), we have xt i = xˆvi(j) 1 i , yt i = yˆvi(j) 1 i . And for t / Vi(T), we have ||xt i xt 1 i ||2 = 0, ||yt i yt 1 i ||2 = 0. Combining with ˆvi(j) ˆvi(j 1) τ, we can get that, T1+ e T (ϵ) P i=1 ||xtj i xt i||2 τ P ˆvi(j) Vi( e T (ϵ)), T1+3 ˆvi(j) i=1 ||xˆvi(j) i xˆvi(j) 1 i ||2 = τ T1+ e T (ϵ) P i=1 ||xt+1 i xt i||2+τ e T (ϵ)+τ 1 P t= e T (ϵ)+1 i=1 ||xt+1 i xt i||2 τ T1+ e T (ϵ) P i=1 ||xt+1 i xt i||2 + 4τ(τ 1)Nα1. T1+ e T (ϵ) P i=1 ||ytj i yt i||2 τ P ˆvi(j) Vi( e T (ϵ)), 3 ˆvi(j) i=1 ||yˆvi(j) i yˆvi(j) 1 i ||2 = τ T1+ e T (ϵ) P i=1 ||yt+1 i yt i||2+τ e T (ϵ)+τ 1 P t= e T (ϵ)+1 i=1 ||yt+1 i yt i||2 τ T1+ e T (ϵ) P i=1 ||yt+1 i yt i||2 + 4τ(τ 1)Nα2. It follows from Eq. (142), (144), (145), (146) that, T1+ e T (ϵ) P t=T1+2 dt 5|| e Gt||2 F T1+2 L + 4 ηλ ( c0 1 c1 1 + c1 1 c2 1 )Mα3 + c1 1 2 Mα3 + 7 2ηλ Mσ32 + 3d5(c1 1)2Mα3 ηθ ( c0 1 c1 1 + c1 1 c2 1 )Nα4 + c1 2 2 Nα4 + 7 2ηθ Nσ42 + 3τd5(c1 2)2Nα4 + c T1+2 1 2 Mσ32 + c T1+2 2 5ηθ + 4d5( 2 ηx2 + 3NL2)Nα1τ + 4d5( 2 ηy2 + 3NL2)Nα2τ)(τ 1) = d +kdτ(τ 1), (148) where d and kd are constants. Constant d6 is given by, d6 = max{d1, d2, d3, d4, 30 ηλ +150ηλτk1NL2 (1 30ηλτk1NL2)a6 , 30τ max{d1, d2, d3, d4, 30 ηλ +150ηλτk1NL2 (1 30ηλτk1NL2)at 6 , 30τ = 1 dt 5at 6 . Published as a conference paper at ICLR 2023 Thus, we can obtain that, T1+ e T (ϵ) X 1 d6at 6 || e GT1+ e T (ϵ)||2 T1+ e T (ϵ) X 1 d6at 6 || e Gt||2 T1+ e T (ϵ) X t=T1+2 dt 5|| e Gt||2 d +kdτ(τ 1). And it follows from Eq. (150) that, || e GT1+ e T (ϵ)||2 ( d +kdτ(τ 1))d6 T1+ e T (ϵ) P According to the setting of ct 1, ct 2, we have, 4(γ 2)L2(Mηλ + Nηθ)(t + 1) 1 2 + ηθ(N S)L2 Summing up 1 at 6 from t = T1 + 2 to t = T1 + e T(ϵ), it follows that, T1+ e T (ϵ) P 1 at 6 T1+ e T (ϵ) P 4(γ 2)L2(Mηλ+Nηθ)(t+1) 1 2 + ηθ(N S)L2 T1+ e T (ϵ) P 4(γ 2)L2(Mηλ+Nηθ)(t+1) 1 2 + ηθ(N S)L2 2 (t+1) 1 2 (T1+ e T (ϵ)) 4(γ 2)L2(Mηλ+Nηθ)+ ηθ(N S)L2 The second inequality in Eq. (153) is due to that t T1 + 2, we have, 4(γ 2)L2(Mηλ+Nηθ)(t+1) 1 2 + ηθ(N S)L2 2 (4(γ 2)L2(Mηλ+Nηθ)+ ηθ(N S)L2 2 )(t+1) 1 2 . (154) The last inequality in Eq. (153) follows from the fact that T1+ e T (ϵ) P (t+1) 1 2 (T1+ e T(ϵ)) Thus, plugging Eq. (153) into Eq. (151), we can obtain: || e GT1+ e T (ϵ)||2 ( d +kdτ(τ 1))d6 T1+ e T (ϵ) P (4(γ 2)L2(Mηλ+Nηθ) + ηθ(N S)L2 2 )( d +kdτ(τ 1))d6 (T1 + e T(ϵ)) 1 2 (T1 + 2) Let constant d7 = 4(γ 2)L2(Mηλ + Nηθ), and according to the definition of e T(ϵ), we have: T1 + e T(ϵ) (4(d7 + ηθ(N S)L2 2 )( d +kdτ(τ 1))d6 ϵ + (T1 + 2) 1 2 )2. (156) Combining the definition of Gt and e Gt with trigonometric inequality, we then get: || Gt|| || e Gt|| || Gt e Gt|| l=1 ||ct 1 1 λt l||2 + i=1 ||ct 1 2 θt i||2. (157) Published as a conference paper at ICLR 2023 Table 2: Step-sizes of all variables in the experiments. Datasets ηx ηy ηv ηz ηλ ηθ MNIST 0.001 0.02 0.001 0.02 0.1 0.001 Fashion MNIST 0.001 0.02 0.001 0.02 0.1 0.001 CIFAR-10 0.001 0.02 0.001 0.02 0.1 0.001 Covertype 0.01 0.02 0.01 0.02 0.1 0.01 IJCNN1 0.01 0.005 0.01 0.005 0.1 0.01 Australian 0.001 0.02 0.001 0.02 5 0.001 If t ( 4Mα3 ϵ2 , then we have l=1 ||ct 1 1 λt l||2 + N P i=1 ||ct 1 2 θt i||2 ϵ 2 . Combining it with Eq. (156), we can conclude that there exists a T(ϵ) O(max{(4Mα3 ϵ2 , (4(d7 + ηθ(N S)L2 2 )( d +kdτ(τ 1))d6 ϵ + (T1 + 2) 1 2 )2}), (158) such that || Gt||2 ϵ, which concludes our proof. C PROOF OF THEOREM 1 Assuming that there are cutting planes added every k iteration, i.e., P0 Pk Pnk. (159) Let Rk denote the feasible region of problem in Eq. (12) in kth iteration, and let R denote the feasible region of problem in Eq. (10), we have that, R0 Rk Rnk R . (160) Let F({xk i }, {yk i }, vk , zk ) denote the optimal objective value of the problem in Eq. (12) in kth iteration and let F denote the optimal objective value of the problem in Eq. (10). According to Eq. (160), we have that, F({x0 i },{y0 i },v0 ,z0 ) F({xk i },{yk i },vk ,zk ) F({xnk i },{ynk i },vnk ,znk ). (161) And we can obtain that, F({x0 i },{y0 i },v0 ,z0 ) F F({xk i },{yk i },vk ,zk ) F F({xnk i },{ynk i },vnk ,znk ) β. It is seen from Eq. (162) that the sequence { F F ({xk i },{yk i },vk ,zk )} is monotonically nonincreasing. When nk , the optimal objective value of the problem in Eq. (12) monotonically converges to β (β 1). Published as a conference paper at ICLR 2023 D DETAILS OF EXPERIMENTS D.1 ADDITIONAL RESULTS In this section, additional experiment results on CIFAR-10 (Krizhevsky et al., 2009) and Australian (Quinlan, 1987) datasets are reported in Figure 11 and Figure 12. It is seen from Figure 11 and Figure 12 that the proposed ADBO also achieves faster convergence rate. D.2 DETAILS OF EXPERIMENTS In this section, we provide more details of the experimental setup in this work. In data hyper-cleaning task, experiments are carried out on MNIST, Fashion MNIST and CIFAR-10 datasets. Following (Ji et al., 2021), we utilize the same model in data-hypercleaning task for MNIST, Fashion MNIST and CIFAR-10 datasets, and SGD optimizer is utilized. And the step-sizes are summarized in Table 2. In MNIST and Fashion MNIST datasets, we set N = 18, S = 9, τ = 15. And in CIFAR-10 dataset, we set N = 18, S = 9, τ = 5. We set that the (communication + computation) delays of each worker obey log-normal distribution LN(3.5, 1). In regularization coefficient optimization task, experiments are carried out on Covertype, IJCNN1 and Australian datasets. Following (Chen et al., 2022a), we utilize the same logistic regression model, and SGD optimizer is used. And the step-sizes are summarized in Table 2. In Covertype dataset, we set N = 18, S = 9, τ = 15; in IJCNN1 dataset, we set N = 24, S = 12, τ = 15; and in Australian dataset, we set N = 4, S = 2, τ = 5. In the experiments that consider straggler problems, three stragglers are set in the distributed system, and the mean of (communication + computation) delay of stragglers is four times the delay of normal workers. 0 200 400 600 800 time (s) test accuracy ADBO SDBO FEDNEST (a) test accuracy vs time 0 200 400 600 800 time (s) ADBO SDBO FEDNEST (b) test loss vs time Figure 11: (a) Test accuracy vs time and (b) Test loss vs time on CIFAR-10 dataset on distributed data hyper-cleaning task. 0 50 100 150 200 time (s) test accuracy ADBO SDBO FEDNEST (a) test accuracy vs time 0 50 100 150 200 time (s) ADBO SDBO FEDNEST (b) test loss vs time Figure 12: (a) Test accuracy vs time and (b) Test loss vs time on Australian dataset on distributed regularization coefficient optimization task. Codes are available in https://github.com/ICLR23Submission6251/adbo. Published as a conference paper at ICLR 2023 E PARAMETER SERVER ARCHITECTURE In this section, we give the illustration of the parameter server architecture, which is shown in Figure 13. In parameter server architecture, the communication is centralized around a set of master nodes (or servers) that constitute the hubs of a star network, and worker nodes (or clients) pull the shared parameters from and send their updates to the master nodes. Masters (or Servers) Workers (or Clients) Communication Local variables: Local variables: Local variables: Consensus variables: Figure 13: The illustration of parameter server architecture.