# distributed_distributionally_robust_optimization_with_nonconvex_objectives__ada40168.pdf Distributed Distributionally Robust Optimization with Non-Convex Objectives Yang Jiao Tongji University yangjiao@tongji.edu.cn Kai Yang Tongji University kaiyang@tongji.edu.cn Dongjin Song University of Connecticut dongjin.song@uconn.edu Distributionally Robust Optimization (DRO), which aims to find an optimal decision that minimizes the worst case cost over the ambiguity set of probability distribution, has been widely applied in diverse applications, e.g., network behavior analysis, risk management, etc. However, existing DRO techniques face three key challenges: 1) how to deal with the asynchronous updating in a distributed environment; 2) how to leverage the prior distribution effectively; 3) how to properly adjust the degree of robustness according to different scenarios. To this end, we propose an asynchronous distributed algorithm, named Asynchronous Single-loo P alternat Ive g Radient proj Ection (ASPIRE) algorithm with the it Erative Active SEt method (EASE) to tackle the distributed distributionally robust optimization (DDRO) problem. Furthermore, a new uncertainty set, i.e., constrained D-norm uncertainty set, is developed to effectively leverage the prior distribution and flexibly control the degree of robustness. Finally, our theoretical analysis elucidates that the proposed algorithm is guaranteed to converge and the iteration complexity is also analyzed. Extensive empirical studies on real-world datasets demonstrate that the proposed method can not only achieve fast convergence, and remain robust against data heterogeneity as well as malicious attacks, but also tradeoff robustness with performance. 1 Introduction The past decade has witnessed the proliferation of smartphones and Internet of Things (Io T) devices, which generate a plethora of data everyday. Centralized machine learning requires gathering the data to a particular server to train models which incurs high communication overhead [46] and suffers privacy risks [43]. As a remedy, distributed machine learning methods have been proposed. Considering a distributed system composed of N workers (devices), we denote the dataset of these workers as {D1, , DN}. For the jth (1 j N) worker, the labeled dataset is given as Dj = {xi j, yi j}, where xi j Rd and yi j {1, , c} denote the ith data sample and the corresponding label, respectively. The distributed learning tasks can be formulated as the following optimization problem, min w W F(w) with F(w) := X j fj(w), (1) where w Rp is the model parameter to be learned and W Rp is a nonempty closed convex set, fj( ) is the empirical risk over the jth worker involving only the local data: i:xi j Dj 1 |Dj|Lj(xi j, yi j; w), (2) Corresponding author. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). where Lj is the local objective function over the jth worker. Problem in Eq. (1) arises in numerous areas, such as distributed signal processing [19], multi-agent optimization [36], etc. However, such problem does not consider the data heterogeneity [57, 40, 39, 30] among different workers (i.e., data distribution of workers could be substantially different from each other [44]). Indeed, it has been shown that traditional federated approaches, such as Fed Avg [33], built for independent and identically distributed (IID) data may perform poorly when applied to Non-IID data [27]. This issue can be mitigated via learning a robust model that aims to achieve uniformly good performance over all workers by solving the following distributionally robust optimization (DRO) problem in a distributed manner: min w W max p Ω N F(w, p) := X j pjfj(w), (3) where p = [p1, , p N] RN is the adversarial distribution in N workers, the jth entry in this vector, i.e., pj represents the adversarial distribution value for the jth worker. N = {p RN + : 1 p = 1} and Ωis a subset of N. Agnostic federated learning (AFL) [35] firstly introduces the distributionally robust (agnostic) loss in federated learning and provides the convergence rate for (strongly) convex functions. However, AFL does not discuss the setting of Ω. DRFA-Prox [16] considers Ω= N and imposes a regularizer on adversarial distribution to leverage the prior distribution. Nevertheless, three key challenges have not yet been addressed by prior works. First, whether it is possible to construct an uncertainty framework that can not only flexibly maintain the trade-off between the model robustness and performance but also effectively leverage the prior distribution? Second, how to design asynchronous algorithms with guaranteed convergence? Compared to synchronous algorithms, the master in asynchronous algorithms can update its parameters after receiving updates from only a small subset of workers [58, 10]. Asynchronous algorithms are particularly desirable in practice since they can relax strict data dependencies and ensure convergence even in the presence of device failures [58]. Finally, whether it is possible to flexibly adjust the degree of robustness? Moreover, it is necessary to provide convergence guarantee when the objectives (i.e., fj(wj), j) are non-convex. To this end, we propose ASPIRE-EASE to effectively address the aforementioned challenges. Firstly, different from existing works, the prior distribution is incorporated within the constraint in our formulation, which can not only leverage the prior distribution more effectively but also achieve guaranteed feasibility for any adversarial distribution within the uncertainty set. The prior distribution can be obtained from side information or uniform distribution [41], which is necessary to construct the uncertainty (ambiguity) set and obtain a more robust model [16]. Specifically, we formulate the prior distribution informed distributionally robust optimization (PD-DRO) problem as: min z Z,{wj W} max p P j pjfj(wj) (4) s.t. z = wj, j =1, , N, var. z, w1, w2, , w N, where z Rp is the global consensus variable, wj Rp is the local variable (local model parameter) of jth worker and Z Rp is a nonempty closed convex set. P RN + is the uncertainty (ambiguity) set of adversarial distribution p, which is set based on the prior distribution. To solve the PD-DRO problem in an asynchronous distributed manner, we first propose Asynchronous Single-loo P alternat Ive g Radient proj Ection (ASPIRE), which employs simple gradient projection steps for the update of primal and dual variables at every iteration, thus is computationally efficient. Next, the it Erative Active SEt method (EASE) is employed to replace the traditional cutting plane method to improve the computational efficiency and speed up the convergence. We further provide the convergence guarantee for the proposed algorithm. Furthermore, a new uncertainty set, i.e., constrained D-norm (CD-norm), is proposed in this paper and its advantages include: 1) it can flexibly control the degree of robustness; 2) the resulting subproblem is computationally simple; 3) it can effectively leverage the prior distribution and flexibly set the bounds for every pj. Contributions. Our contributions can be summarized as follows: 1. We formulate a PD-DRO problem with CD-norm uncertainty set. PD-DRO incorporates the prior distribution as constraints which can leverage prior distribution more effectively and guarantee robustness. In addition, CD-norm is developed to model the ambiguity set around the prior distribution and it provides a flexible way to control the trade-off between model robustness and performance. 2. We develop a single-loop asynchronous algorithm, namely ASPIRE-EASE, to optimize PDDRO in an asynchronous distributed manner. ASPIRE employs simple gradient projection steps to update the variables at every iteration, which is computationally efficient. And EASE is proposed to replace cutting plane method to enhance the computational efficiency and speed up the convergence. We demonstrate that even if the objectives fj(wj), j are non-convex, the proposed algorithm is guaranteed to converge. We also theoretically derive the iteration complexity of ASPIRE-EASE. 3. Extensive empirical studies on four different real world datasets demonstrate the superior performance of the proposed algorithm. It is seen that ASPIRE-EASE can not only ensure the model s robustness against data heterogeneity but also mitigate malicious attacks. 2 Preliminaries 2.1 Distributionally Robust Optimization Optimization problems often contain uncertain parameters. A small perturbation of the parameters could render the optimal solution of the original optimization problem infeasible or completely meaningless [5]. Distributionally robust optimization (DRO) [28, 17, 7] assumes that the probability distributions of uncertain parameters are unknown but remain in an ambiguity (uncertainty) set and aims to find a decision that minimizes the worst case expected cost over the ambiguity set, whose general form can be expressed as, min x X max P P EP [r(x, ξ)], (5) where x X represents the decision variable, P is the ambiguity set of probability distributions P of uncertain parameters ξ. Existing methods for solving DRO can be broadly grouped into two widely-used categories [42]: 1) Dual methods [15, 50, 18] reformulate the primal DRO problems as deterministic optimization problems through duality theory. Ben-Tal et al. [2] reformulate the robust linear optimization (RLO) problem with an ellipsoidal uncertainty set as a second-order cone optimization problem (SOCP). 2) Cutting plane methods [34, 6] (also called adversarial approaches [21]) continuously solve an approximate problem with a finite number of constraints of the primal DRO problem, and subsequently check whether new constraints are needed to refine the feasible set. Recently, several new methods [41, 29, 23] have been developed to solve DRO, which need to solve the inner maximization problem at every iteration. 2.2 Cutting Plane Method for PD-DRO In this section, we introduce the cutting plane method for PD-DRO in Eq. (4). We first reformulate PD-DRO by introducing an additional variable h H (H R1 is a nonempty closed convex set) and protection function g({wj}) [55]. Introducing additional variable h is an epigraph reformulation [3, 56]. In this case, Eq. (4) can be reformulated as the form with uncertainty in the constraints: min z Z,{wj W},h H h j pfj(wj)+g({wj}) h 0, (6) z = wj, j =1, , N, var. z, w1, w2, , w N, h, where p is the nominal value of the adversarial distribution for every worker and g({wj}) = max p P P j (pj p)fj(wj) is the protection function. Eq. (6) is a semi-infinite program (SIP) which contains infinite constraints and cannot be solved directly [42]. Denoting the set of cutting plane parameters in (t+1)th iteration as At RN, the following function is used to approximate g({wj}): g({wj}) = max al At a l f(w) = max al At X j al,jfj(wj), (7) where al = [al,1, , al,N] RN denotes the parameters of lth cutting plane in At and f(w) = [f1(w1), , f N(w N)] RN. Substituting the protection function g({wj}) with g({wj}), we can obtain the following approximate problem: min z Z,{wj W},h H h j(p + al,j)fj(wj) h 0, al At, (8) z = wj, j =1, , N, var. z, w1, w2, , w N, h. Distributed optimization is an attractive approach for large-scale learning tasks [54, 8] since it does not require data aggregation, which protects data privacy while also reducing bandwidth requirements [45]. When the neural network models (i.e., fj(wj), j are non-convex functions) are used, solving problem in Eq. (8) in a distributed manner facing two challenges: 1) Computing the optimal solution to a non-convex subproblem requires a large number of iterations and therefore is highly computationally intensive if not impossible. Thus, the traditional Alternating Direction Method of Multipliers (ADMM) is ineffective. 2) The communication delays of workers may differ significantly [11], thus, asynchronous algorithms are strongly preferred. To this end, we propose the Asynchronous Single-loo P alternat Ive g Radient proj Ection (ASPIRE). The advantages of the proposed algorithm include: 1) ASPIRE uses simple gradient projection steps to update variables in each iteration and therefore it is computationally more efficient than the traditional ADMM method, which seeks to find the optimal solution in non-convex (for wj, j) and convex (for z and h) optimization subproblems every iteration, 2) the proposed asynchronous algorithm does not need strict synchronization among different workers. Therefore, ASPIRE remains resilient against communication delays and potential hardware failures from workers. Details of the algorithm are given below. Firstly, we define the node as master which is responsible for updating the global variable z, and we define the node which is responsible for updating the local variable wj as worker j. In each iteration, the master updates its variables once it receives updates from at least S workers, i.e., active workers, satisfying 1 S N. Qt+1 denotes the index subset of workers from which the master receives updates during (t + 1)th iteration. We also assume the master will receive updated variables from every worker at least once for each τ iterations. The augmented Lagrangian function of Eq. (8) can be written as: j(p + al,j)fj(wj) h)+ X jϕ j (z wj)+ X 2 ||z wj||2, (9) where Lp = Lp({wj},z, h, {λl}, {ϕj}), λl Λ, l and ϕj Φ, j represent the dual variables of inequality and equality constraints in Eq. (8), respectively. Λ R1 and Φ Rp are nonempty closed convex sets, constant κ1 > 0 is a penalty parameter. Note that Eq. (9) does not consider the second-order penalty term for inequality constraint since it will invalidate the distributed optimization. Following [52], the regularized version of Eq. (9) is employed to update all variables as follows, e Lp({wj},z, h, {λl}, {ϕj}) = Lp X l ct 1 2 ||λl||2 X j ct 2 2 ||ϕj||2, (10) where ct 1 and ct 2 denote the regularization terms in (t + 1)th iteration. To avoid enumerating the whole dataset, the mini-batch loss could be used. A batch of instances with size m can be randomly sampled from each worker during each iteration. The loss function of these instances from jth worker is given by ˆfj(wj) = m P 1 m Lj(xi j, yi j; wj). It is evident that E[ ˆfj(wj)] = fj(wj) and E[ ˆfj(wj)]= fj(wj). In (t + 1)th master iteration, the proposed algorithm proceeds as follows. 1) Active workers update the local variables wj as follows, ( PW(wt j α etj w wj e Lp({w etj j },zetj, hetj,{λ etj l },{ϕ etj j })), j Qt+1, wt j, j / Qt+1, (11) where etj is the last iteration during which worker j was active. It is seen that j Qt+1, wt j =w etj j and ϕt j =ϕ etj j . α etj w represents the step-size and let αt w =ηt w when t0 is a pre-set constant and can be controlled flexibly. The cutting planes are generated according to the uncertainty set. For example, if we employ ellipsoid uncertainty set, the cutting plane is generated via solving a SOCP. In this paper, we propose CD-norm uncertainty set, which can be expressed as follows, P ={p: epj pj qj epj, X epj | Γ, 1 p=1}, (16) where Γ R1 can flexibly control the level of robustness, q = [q1, , q N] RN represents the prior distribution, epj and epj (epj 0) represent the lower and upper bounds for pj qj, respectively. The setting of q and epj, j are based on the prior knowledge. D-norm is a classical uncertainty set (which is also called as budget uncertainty set) [5]. We call Eq. (16) CD-norm uncertainty set since p is a probability vector so all the entries of this vector are non-negative and add up to exactly one, i.e., 1 p = 1. Due to the special structure of CD-norm, the cutting plane generation subproblem is easy to solve and the level of robustness in terms of the outage probability, i.e., probabilistic bounds of the violations of constraints can be flexibly adjusted via a single parameter Γ. We claim that l1-norm (or twice total variation distance) uncertainty set is closely related to CD-norm uncertainty set. Nevertheless, there are two differences: 1) CD-norm uncertainty set could be regarded as a weighted l1-norm with additional constraints. 2) CD-norm uncertainty set can flexibly set the lower and upper bounds for every pj (i.e., qj epj pj pj+epj), while 0 pj 1, j in l1-norm uncertainty set. Based on the CD-norm uncertainty set, the cutting plane can be derived as follows, 1) Solve the following problem, pt+1 = arg max p1, ,p N j (pj p)fj(wj) epj | Γ, epj pj qj epj, j, X jpj =1 (17) var. p1, , p N, where pt+1 = [pt+1 1 , , pt+1 N ] RN. Let eat+1 = pt+1 p, where p = [p, , p] RN. This first step aims to obtain the distribution eat+1 by solving problem in Eq. (17). This problem can be effectively solved through combining merge sort [13] (for sorting epjfj(wj), j =1, , N) with few basic arithmetic operations (for obtaining pt+1 j , j = 1, , N). Since N is relatively large in Algorithm 1 ASPIRE-EASE Initialization: iteration t = 0, variables {w0 j}, z0, h0, {λ0 l }, {ϕ0 j} and set A0. repeat for active worker do updates local wt+1 j according to Eq. (11); end for active workers transmit local model parameters and loss to master; master receives updates from active workers do updates zt+1, ht+1, {λt+1 l }, {ϕt+1 j } in master according to Eq. (12), (13), (14), (15); master broadcasts zt+1, ht+1, {λt+1 l } to active workers; for active worker do updates local ϕt+1 j according to Eq. (15); end for if (t + 1) mod k == 0 and t < T1 then master updates At+1 according to Eq. (19) and (20), and broadcast parameters to all workers; end if t = t + 1; until convergence distributed system, the arithmetic complexity of solving problem in Eq. (17) is dominated by merge sort, which can be regarded as O(N log(N)). 2) Let f(w)=[f1(w1), , f N(w N)] RN, check the feasibility of the following constraints: eat+1 f(w) max al At al f(w). (18) 3) If Eq. (18) is violated, eat+1 will be added into At: At+1 = At {eat+1}, if Eq.(18) is violated, At, otherwise, (19) when a new cutting plane is added, its corresponding dual variable λt+1 |At|+1 = 0 will be generated. After the cutting plane subproblem is solved, the inactive cutting plane will be removed, that is: At+1 = At+1{al}, if λt+1 l =0 and λt l =0, 1 l |At|, At+1, otherwise, (20) where At+1{al} is the complement of {al} in At+1, and the dual variable will be removed. Then master broadcasts At+1, {λt+1 l } to all workers. Details of algorithm are summarized in Algorithm 1. 5 Convergence Analysis Definition 1 (Stationarity gap) Following [52, 32, 53], the stationarity gap of our problem at tth iteration is defined as: αtw (wt j PW(wt j αt w wj Lp({wt j},zt, ht, {λt l}, {ϕt j})))} 1 ηt z (zt PZ(zt ηt z z Lp({wt j},zt, ht, {λt l}, {ϕt j}))) 1 ηt h (ht PH(ht ηt h h Lp({wt j},zt, ht, {λt l}, {ϕt j}))) ρ1 (λt l PΛ(λt l +ρ1 λl Lp({wt j},zt, ht, {λt l}, {ϕt j})))} { 1 ρ2 (ϕt j PΦ(ϕt j+ρ2 ϕj Lp({wt j},zt, ht, {λt l}, {ϕt j})))} where Gt is the simplified form of G({wt j},zt, ht, {λt l}, {ϕt j}). Definition 2 (ε-stationary point) ({wt j},zt, ht, {λt l}, {ϕt j}) is an ε-stationary point (ε 0) of a differentiable function Lp, if || Gt|| ε. T(ε) is the first iteration index such that || Gt|| ε, i.e., T(ε)=min{t | || Gt|| ε}. Assumption 1 (Smoothness/Gradient Lipschitz) Lp has Lipschitz continuous gradients. We assume that there exists L > 0 satisfying || θLp({wj}, z, h,{λl},{ϕj}) θLp({ ˆwj}, ˆz, ˆh,{ˆλl},{ ˆϕj})|| L||[wcat ˆwcat; z ˆz; h ˆh; λcat ˆλcat; ϕcat ˆϕcat]||, where θ {{wj}, z, h, {λl}, {ϕj}} and [; ] represents the concatenation. wcat ˆwcat = [w1 ˆw1; ; w N ˆw N] Rp N, λcat ˆλcat = [λ1 ˆλ1; ; λ|At| ˆλ|At|] R|At|, ϕcat ˆϕcat = [ϕ1 ˆϕ1; ; ϕN ˆϕN] Rp N. Assumption 2 (Boundedness) Before obtaining the ε-stationary point (i.e., t T(ε) 1), we assume variables in master satisfy that ||zt+1 zt||2+||ht+1 ht||2+P l ||λt+1 l λt l||2 ϑ, where ϑ > 0 is a relative small constant. The change of the variables in master is upper bounded within τ iterations: ||zt zt k||2 τk1ϑ, ||ht ht k||2 τk1ϑ, P l ||λt l λt k l ||2 τk1ϑ, 1 k τ, where k1 > 0 is a constant. Setting 1 (Bounded |At|) |At| M, t, i.e., an upper bound is set for the number of cutting planes. Setting 2 (Setting of ct 1, ct 2) ct 1 = 1 ρ1(t+1) 1 6 c1 and ct 2 = 1 ρ2(t+1) 1 6 c2 are nonnegative non- increasing sequences, where c1 and c2 are positive constants and meet Mc1 2 + Nc2 2 ε2 Theorem 1 (Iteration complexity) Suppose Assumption 1 and 2 hold. We set ηt w = ηt z = ηt h = 2 L+ρ1|At|L2+ρ2NL2+8( |At|γL2 ρ1(ct 1)2 + NγL2 ρ2(ct 2)2 ) and ηw = 2 L+ρ1ML2+ρ2NL2+8( MγL2 ρ1c12 + NγL2 ρ2c22 ). And we set constants ρ1