# largescale_nonconvex_stochastic_constrained_distributionally_robust_optimization__bc71a0fe.pdf Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization Qi Zhang 1, Yi Zhou 2, Ashley Prater-Bennette 3, Lixin Shen 4, Shaofeng Zou 1 1University at Buffalo 2The University of Utah 3Air Force Research Laboratory 4Syracuse University qzhang48@buffalo.edu, yi.zhou@utah.edu, ashley.prater-bennette@us.af.mil, lshen03@syr.edu, szou3@buffalo.edu Distributionally robust optimization (DRO) is a powerful framework for training robust models against data distribution shifts. This paper focuses on constrained DRO, which has an explicit characterization of the robustness level. Existing studies on constrained DRO mostly focus on convex loss function, and exclude the practical and challenging case with non-convex loss function, e.g., neural network. This paper develops a stochastic algorithm and its performance analysis for non-convex constrained DRO. The computational complexity of our stochastic algorithm at each iteration is independent of the overall dataset size, and thus is suitable for largescale applications. We focus on the general Cressie-Read family divergence defined uncertainty set which includes χ2divergences as a special case. We prove that our algorithm finds an -stationary point with an improved computational complexity than existing methods. Our method also applies to the smoothed conditional value at risk (CVa R) DRO. 1 Introduction Machine learning algorithms typically employ the approach of Empirical Risk Minimization (ERM), which minimizes the expected loss under the empirical distribution P0 of the training dataset and assumes that test samples are generated from the same distribution. However, in practice, there usually exists a mismatch between the training and testing distributions due to various reasons, for example, in domain adaptation tasks domains differ from training to testing (Blitzer, Mc Donald, and Pereira 2006; Daume III and Marcu 2006); test samples were selected from minority groups which are underrepresented in the training dataset (Grother et al. 2011; Hovy and Søgaard 2015) and there might exist potential adversarial attacks (Goodfellow, Shlens, and Szegedy 2014; Madry et al. 2017). Such a mismatch may lead to a significant performance degradation. This challenge spurred noteworthy efforts on developing a framework of Distributionally Robust Optimization (DRO) e.g., (Ben-Tal et al. 2013; Shapiro 2017; Rahimian and Mehrotra 2019). Rather than minimizing the expected loss under one fixed distribution, in DRO, one seeks to optimize the expected loss under the worst-case distribution in an uncertainty set of distributions. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Specifically, DRO aims to solve the following problem: inf x sup Q U(P0) ES Q (x; S), (1) where U(P0) is an uncertainty set of distributions centered at P0, P0 is the empirical distribution of the training dataset, is the loss function, and x is the optimization variable. For example, the uncertainty set can be defined as U(P0) := {Q : D(Qk P0) }, (2) where D is some distance-like metric, e.g., Kullback-Leibler (KL) divergence and χ2 divergence, and is the uncertainty level. In practice, for ease of implementation and analysis, a relaxed formulation of eq. (1), which is referred to as the penalized DRO, is usually solved (Levy et al. 2020; Jin et al. 2021; Qi et al. 2021; Sinha et al. 2017): inf x sup Q ES Q (x; S) λD(Qk P0), (3) where λ > 0 is a fixed hyperparameter that needs to be chosen manually. In contrast to constrained DRO in eq. (1), a regularization term is added to the objective function to keep the distribution Q and the distribution P0 close, and the hyperparameter λ is manually chosen beforehand to control the tradeoff with minimizing the loss. Compared with the penalized DRO setting, the constrained DRO problem in eq. (1) requires that the distribution Q to be strictly in the uncertainty set, and searches for the optimal solution under the worst-case distribution in the uncertainty set. Therefore, the obtained solution from the constrained DRO is minimax optimal for distributions in the uncertainty set, whereas it is hard to get such a guarantee for the penalized DRO relaxation. In this paper, we focus on the challenging constrained DRO problem in eq. (1). Existing studies on constrained DRO are limited to convex loss functions or require some additional assumptions (Soma and Yoshida 2020; Hashimoto et al. 2018; Levy et al. 2020; Duchi and Namkoong 2018; Duchi, Glynn, and Namkoong 2021; Qi et al. 2022; Wang, Gao, and Xie 2021). Little understanding on the practical non-convex loss functions, e.g., neural network, is known. In this paper, we focus on the constrained DRO problem with non-convex loss. DRO problems under different uncertainty sets are fundamentally different. As will be discussed later in related The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) works, there is a rich literature on DRO with various uncertainty sets. In this paper, we focus on the general Cressie Read family divergence defined uncertainty set (Duchi and Namkoong 2018; Jin et al. 2021), which includes, e.g., χ2 divergence, as a special case (see Section 2 for more details). We also investigate the smoothed conditional value at risk (CVa R) DRO problem. More importantly, we focus on the practical yet challenging large-scale scenario, where P0 is the empirical distribution of N samples and N is very large. In classic stochastic optimization problems, e.g., ERM, it is easy to get an unbiased estimate of the gradient using only a few samples, and therefore the computational complexity at each iteration is independent of the training dataset size. However, in the DRO problems, due to taking the worst-case distributions in the objective, the problem becomes challenging. Many existing DRO algorithms incur a complexity that increases linearly (or even worse) in the training dataset size (Duchi and Namkoong 2018; Namkoong and Duchi 2016; Ghosh, Squillante, and Wollega 2018), which is not feasible for large-scale applications. In this paper, we will design a stochastic algorithm with computational complexity at each iteration being independent of the training dataset size (Qi et al. 2022; Wang, Gao, and Xie 2021; Levy et al. 2020). 1.1 Challenges and Contributions The key challenges and main contributions in this paper are summarized as follows. For large-scale applications, the number of training samples is large, and therefore directly computing the full gradient is not practical. Nevertheless, as discussed above, it is challenging to obtain an unbiased estimate of the gradient for DRO problems using only a few samples. For '-divergence DRO problem, the distributions in the uncertainty set are continuous w.r.t. the training distribution. Thus the distributions in the uncertainty set can be parameterized by an N-dimensional vector (Namkoong and Duchi 2016). Then the DRO problem becomes a min-max problem and primal-dual algorithms (Rafique et al. 2022; Lin, Jin, and Jordan 2020; Xu et al. 2023) can be used directly. Subsampling methods in DRO were also studied in (Namkoong and Duchi 2016; Ghosh, Squillante, and Wollega 2018). However, all the above studies require a computational complexity linear or even worse in the training dataset size at each iteration and thus is prohibitive in large-scale applications. In (Levy et al. 2020), an efficient subsampling method was proposed, where the batch size is independent of the training dataset size. However, they only showed the sampling bias for χ2 and CVa R DRO problems. In this paper, we generalize the analysis of the bias in (Levy et al. 2020) to the general Cressie-Read family. We further develop a Frank-Wolfe update on the dual variables in order to bound the gap between the objective and its optimal value given the optimization variable x and the biased estimate. The second challenge is due to the non-convex loss function. Existing studies for the Cressie-Read divergence family (Duchi and Namkoong 2018; Levy et al. 2020) are limited to the case with convex loss function, and their approach does not generalize to the non-convex case. The key difficulty lies in that the subgradient of the objective function can not be obtained via subdifferential for nonconvex loss functions. Instead of explicitly calculating the worst-case distribution as in (Duchi and Namkoong 2018; Levy et al. 2020), we propose to design an algorithm for the dual problem which optimizes the objective under a known distribution. Thus the gradient of the objective can be efficiently obtained. The third challenge is that the dual form of constrained DRO is neither smooth nor Lipschitz, making the convergence analysis difficult. Existing studies, e.g., (Wang, Gao, and Xie 2021), assume that the optimal dual variable is bounded away from zero, i.e., λ > λ0 for some λ0 > 0, so that it is sufficient to consider λ λ0. However, this assumption may not necessarily be true as shown in (Wang, Gao, and Xie 2021; Hu and Hong 2013). In this paper, we generalize the idea in (Qi et al. 2022; Levy et al. 2020) to the general Cressie-Read divergence family. We design an approximation of the original problem, and show that it is smooth and Lipschitz. The approximation error can be made arbitrarily small so that the solution to the approximation is still a good solution to the original. We prove the strong duality of the approximated problem. We add a regularizer to the objective and at the same time we keep the hard constraint. In this way, we can guarantee that its dual variable λ has a positive lower bound. Moreover, our strong duality holds for any '-divergence DRO problem. We design a novel algorithm to solve the approximated problem and prove it converges to a stationary point of the constrained DRO problem. The general Proximal Gradient Descent algorithm (Ghadimi, Lan, and Zhang 2016) can be used to solve this approximated problem directly. However, it assumes the objective is non-convex in all parameters and does not provide a tight bound on the bias due to subsampling. We take advantage of the fact that the objective function is convex in λ and thus the bias due to subsampling can be bounded in a tighter way. Our proposed algorithm converges to a stationary point faster than existing methods. 1.2 Related Work Various Uncertainty Sets. '-divergence DRO problems (Ali and Silvey 1966; Csisz ar 1967) were widely studied, for example, CVa R in (Rockafellar, Uryasev et al. 2000; Soma and Yoshida 2020; Curi et al. 2020; Tamar, Glassner, and Mannor 2015), χ2-divergence in (Hashimoto et al. 2018; Ghosh, Squillante, and Wollega 2018; Levy et al. 2020), KLdivergence in (Qi et al. 2021, 2022; Hu and Hong 2013) and Sinkhorn distance (Wang, Gao, and Xie 2021), a variant of Wasserstein distance based on entropic regularization. However, the above studies are for some specific divergence function and can not be extended directly to the general Cressie-Read divergence family. Penalized DRO. The general '-divergence DRO problem was studied in (Jin et al. 2021) where the proposed algorithm The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) works for any divergence function with a smooth conjugate. The authors also designed a smoothed version of the CVa R problem and showed their method converges to a stationary point. However, their method is for the penalized formulation and does not generalize to the constrained DRO. In this paper, we focus on the challenging constrained DRO, the solution of which is minimax optimal over the uncertainty set. Our proposed algorithm can also be applied to solve the smoothed CVa R problem in the constrained setting. Constrained DRO With Convex Loss. The general '-divergence constrained DRO problem was studied in (Namkoong and Duchi 2016). Instead of optimizing from the dual form, the authors treat the worst-case distribution as a N-dimentional vector and implement a stochastic primaldual method to solve the min-max problem. However, the computational complexity at each iteration is linear in the number of the training samples and can not be used in largescale applications. The same problem was further studied in (Duchi, Glynn, and Namkoong 2021). The authors pointed out that minimizing constrained DRO with '-divergence is equivalent to adding variance regularization for the Empirical Risk Minimization (ERM) objective. The general Cressie-Read divergence family DRO problem was studied in (Duchi and Namkoong 2018), where the basic idea is to calculate the worst-case distribution for the constrained DRO first and then use the subdifferential to get the subgradient. Furthermore, the χ2 and CVa R DRO problems were studied in (Levy et al. 2020). Compared with the method in (Duchi and Namkoong 2018), they calculate the worst-case distribution for the penalized DRO and then optimize both the Lagrange multiplier and the loss function. This approach converges to the optimal solution with a reduced complexity. Their method can be extended to the general Cressie Read divergence family. However, all the above papers are limited to the case with convex loss function. To the best of our knowledge, our work is the first paper on large-scale non-convex constrained DRO with the general Cressie-Read divergence family. We note that the KL DRO was studied in (Qi et al. 2022), which however needs an exponential computational complexity. We achieve a polynomial computational complexity for the Cressie-Read divergence family. 2 Preliminaries and Problem Model 2.1 Notations Let s be a sample in S and P0 be the distribution on the points {si}N i=1, where N is the size of the support. Denote by n := {p 2 Rn| Pn i=1 pi = 1, pi 0} the ndimensional probability simplex. Denote by x 2 Rd the optimization variable. We denote by 1X(x) the indicator function, where 1X(x) = 0 if x 2 X, otherwise 1X(x) = 1. Let : Rd S ! R be a non-convex loss function. Let k k be the Euclidean norm and (t)+ := max{t, 0} be the positive part of t 2 R. Denote rx by the gradient to x. For a function f : Rd ! R, a point x 2 Rd is said to be an -optimal solution if |f(x) f(x )| , where f(x ) is the optimal value of f. If the function f is differentiable, a point x 2 Rd is said to be first-order -stationary if krf(x)k . 2.2 Assumptions In this paper, we take the following standard assumptions that are commonly used in the DRO literature (Duchi and Namkoong 2018; Levy et al. 2020; Qi et al. 2021, 2022; Wang, Gao, and Xie 2021; Soma and Yoshida 2020): The non-convex loss function is bounded: 0 (x; s) B for some B > 0, 8x 2 Rd, s 2 S. The non-convex loss function is G-Lipschitz such that | (x1; s) (x2; s)| Gkx1 x2k and L-smooth such that krx (x1; s) rx (x2; s)k Lkx1 x2k for any x1, x2 2 Rd and s 2 S. 2.3 DRO Objective and Its Dual Form In empirical risk minimization (ERM), the goal is to solve inf x ES P0 (x; S), where the objective function is the expectation of the loss function with respect to the training distribution P0. To solve the distributional mismatch between training data and testing data, the formulation of Distributionally Robust Optimization (DRO) (Goodfellow, Shlens, and Szegedy 2014; Madry et al. 2017; Rahimian and Mehrotra 2019) was developed, where the goal is to minimize the expected loss with respect to the worst distribution in an uncertainty set U(P0): inf x sup Q U(P0) ES Q (x; S). (4) DRO problems under different uncertainty sets are fundamentally different. Consider the uncertainty set defined by '-divergence D'(Qk P0), which is one of the most common choices in the literature and can be written as D'(Qk P0) := R ' d Q d P0 d P0, where ' is a non-negative convex function such that '(1) = 0 and '(t) = +1 for ant t < 0. Then let the uncertainty set U(P0) := {Q : D'(Qk P0) } where > 0 is the radius of the uncertainty set. In this paper, we study the general Cressie-Read family of '-divergence (Cressie and Read 1984; Van Erven and Harremos 2014), where 'k(t) := tk kt + k 1 k(k 1) , (5) k 2 ( 1, +1) \ {0, 1}. Let k = k k 1. This family includes as special cases χ2-divergence (k = 2) and KL divergence (k ! 1). When k > 2, the conjugate function of 'k(t) (which will be introduced later) is not smooth, thus the problem becomes hard to solve even in the penalized formulation (Jin et al. 2021). In this paper, we focus on k 2 (1, 2] (k 2 [2, 1)). The objective is inf x sup Q:D'k (Qk P0) ES Q (x; S). (6) Solving (6) directly is challenging due to the sup over Q. In (Namkoong and Duchi 2016), a finite-dimensional vector q was used to parameterize the distributions in the uncertainty set since Q P0 for '-divergence. Then the DRO problem becomes a convex concave min-max problem. This The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) method can be extended to the case with non-convex loss function by applying the algorithms for non-convex concave min-max problems (Rafique et al. 2022; Lin, Jin, and Jordan 2020; Xu et al. 2023). However, the dimension of distribution in the uncertainty set is equal to the number of training samples. Thus, the computational complexity at each iteration is linear in the sample size and is prohibitive in largescale applications. To obtain a complexity independent of the sample size, one alternative is to use its dual. By duality, we can show that the DRO objective (6) can be equivalently written as (Levy et al. 2020; Shapiro 2017) inf x inf λ 0, 2R ES P0 where ' k(t0) = supt{t0t 'k(t)} is the conjugate function of 'k(t0). In this way, the optimization problem under an unknown distribution is rewritten into one under a known distribution. The subsampling method can then be used, which leads to a complexity independent of the sample size (which will be introduced later). For the Cressie-Read family in (5), the corresponding conjugate function family is ' k(t) = 1 k h ((k 1)t + 1)k + 1 i . Therefore, the objective can be written as inf x,λ 0, 2RES P0 (k 1) (x; S) Let = λ k 1 and the corresponding objective is inf x,λ 0, 2R ES P0 k ( (x; S) )k + λ1 k ( + λ + 1 k(k 1) f(x; λ; ; s) =(k 1)k k ( (x; S) )k + λ1 k + λ + 1 k(k 1) Thus the goal is to solve inf x inf λ 0, 2R F(x; λ; ), (8) where F(x; λ; ) is defined as F(x; λ; ) = ES P0 h f(x; λ; ; S) i . Therefore, we reformulate the DRO problem as one to minimize an objective function under a known distribution, where subsampling method could be used to reduce the complexity. 3 Analysis of Constrained DRO In this section, we analyze the constrained DRO problems under Cressie-Read family divergence uncertainty sets with general smooth non-convex loss function. We first discuss the challenges appearing in constrained formulations, then we present how to construct the corresponding approximated problem in order to overcome these challenges. 3.1 Smooth and Lipschitz Approximation For λ 2 [0, +1), 2 R, the objective function F(x; λ; ) is neither smooth nor Lipschitz. Thus it is difficult to implement gradient-based algorithms. In the following, we will construct an approximation of the original problem so that the objective function F(x; λ; ) becomes smooth and Lipschitz by constraining both λ and in some bounded intervals. Denote by ! = (k(k 1) +1) 1 k . Since the loss function is bounded such that 0 B, we can show that there ex- ists an upper bound λ = (k 1)! 1 1 + ( 1 which only depends on k, and B such that the optimal value λ λ. In this paper, we do not assume that λ λ0 > 0 as in (Wang, Gao, and Xie 2021). Instead, we consider an approximation with λ 2 [λ0, λ], and show that the difference between the orignial and the approximation can be bounded. We can show corresponding optimal 2 [ , B], where = λ k (k 1)k k 1 k 1 . The proof can be found in Appendix A. The challenge lies in that the value of can be negative. Thus given this , the optimal value of λ can be quite large then it is hard to upper bound λ. In our proof, we change the objective to the function that only depends on and find the lower bound on . Based on this lower bound, we solve this challenge and further get the bound on λ. We show that the difference between the original and the approximation can be bounded in the following lemma. Lemma 1. 8x 2 Rd, 0 λ0 λ, ---- inf λ2[λ0, λ], 2[ ,B] F(x; λ; ) inf λ 0, 2R F(x; λ; ) ---- 2λ0 . The proof can be found in Appendix B. Note in the proof of this lemma, we provide the strong duality of sup D'k (Qk P0) ES Q (x; S) λ0D'k(Qk P0), where both the hard constraint and regulator are kept. This is different from the approach in Section 3.2 of (Shapiro 2017). Note this strong duality holds for any '-divergence DRO problem. Lemma 1 demonstrates that the non-smooth objective function can be approximated by a smooth objective function. A smaller λ0 makes the gap smaller but the function less smooth . 3.2 Convexity and Smoothness on Parameters The advantage of our approximated problem is that the function is smooth in all x, λ, and . Moreover, We find that the objective function is convex in λ and though the loss function is non-convex in x. Lemma 2. Define z = (λ, ) 2 M, where M = {(λ, ) : λ 2 [λ0, λ], 2 [ , B]}. Then 8x 2 Rd, z 2 M, the objective function F(x; z) is convex and Lz-smooth in z, where Lz = 1 λ0 + 2(B+ ) λ2 0 + (B+ )2 2λ3 0 if k = 2 and Lz = k k (k 1) (B+ )k λk +1 0 + (B+ )k 2 The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Algorithm 1: SFK-DRO Input: Iteration number K, initial point (x1, z1), sample numbers nx, nz, step size , and one constant C 1: Let t = 1 2: while t K do 3: randomly select nx samples and compute rxfx(xt, zt) = Pnx i=1 rxf(xt;zt;si) nx . 4: xt+1 = xt rxfx(xt, zt) 5: randomly select nz samples and compute rzfz(xt+1, zt) = Pnz j=1 rzf(xt+1;zt;sj) nz 6: et= arg mine2Mhe, rzfz(xt+1; zt)i 7: dt = et zt 8: gt = hdt, rzfz(xt+1; zt)i 9: γt = min . gk 10: zt+1 = zt + γtdt 11: t = t + 1 12: end while t0 = arg mint krxfx(xt; zt)k2 + g2 t Output: (xt0+1, zt0) Moreover, the objective function F(x; z) is Lx-smooth in x, where Lx = (k 1)k k k λ1 k 0 (B + )k 2((k 1)G2 + (B + )L). The proof can be found in Appendix C. Note the firstorder gradient of the objective function is non-differential at some point when k = 2. Therefore, we discuss in two cases: k > 2 and k = 2. In the first case, we can get the Hessian matrix of the objective. In the second case, we show the smoothness and convexity. 4 Mini-Batch Algorithm Existing constrained stochastic algorithm for general nonconvex functions (Ghadimi, Lan, and Zhang 2016) can be used to solve the approximated problem directly. However, their method optimizes y = (x; λ; ) as a whole. It can be seen that the objective function is non-convex in y and the computation complexity to get the -stationary point is O( 3k 5). In the previous section, we show that F(x; z) is Lzsmooth in z and Lx-smooth in x. Moreover, Lz O(λ k 1 0 ), which is much larger then Lx when λ0 is small, since Lx O(λ k +1 0 ). If we optimize all the parameters together, we need to implement non-convex algorithms to optimize a smooth function with a large smooth constant, which is not computationally efficient. However, if we optimize x and z separately, though Lz > Lx which requires more resources to optimize z, the convexity in z makes it faster to converge to the optimal value of z. This motivates us to consider a stronger convergence criterion. Instead of finding the - stationary point for F(y), we can find (x, λ, ) such that |rx F(x; λ; )| , |F(x; λ; ) inf λ0 0; 0 F(x; λ0; 0)| . We then provide our Stochastic gradient and Frank-Wolfe DRO algorithm (SFK-DRO), which optimizes x and z separately (see Algorithm 1). Define D = maxz1,z22M kz1 z2k, σ = (k 1)k k k (B + )k 1Gλ1 k 0 , = F(x1; z1) infx,z2M F(x; z) and C is a constant such that C DLz. The convergence rate is then provided in the following theorem. Theorem 1. With a mini-batch size nx = 8Lxσ2 C 2 O(λ 2k +4 0 2), nz O( k ) such that 1 + k(k 1) q 4nz < 4 if k = 2 or 3B(1 + k(k 1) ) 1 k 1 nz + 1 2k 1(k 2)nz k > 2 and = 1 C , λ0 = 8 , for any small > 0 such that DLz Lx O( 2) 2 and g C O( ) 1, at most T = 16C 2 O(λ k 1 0 2) iterations are needed to guarantee a stationary point (xt0+1; zt0) in expectation: Ekrx F(xt0+1; zt0)k , E h---F(xt0+1; zt0) inf λ 0; 2R F(xt0+1; λ; ) --- i . The detailed proof can be found in Appendix D and a proof sketch will be provided later. Before that, we introduce a lemma for our subsampling method. Via this lemma, we can show the complexity is independent of the sample size and thus is suitable for our large-scale setting. When we optimize z, an estimator fz(x, z) = Pnz j=1 f(x;z;sj) build to estimate F(x; z) = ES P0 h f(x; z; S) i . Though the estimator is unbiased, in our Frank-Wolfe update process (Jaggi 2013; Frank, Wolfe et al. 1956; Lacoste-Julien 2016) we need to estimate min F(x; z) via E min fz(x; z). Obviously, the expectation of minimum is not equal to the minimum of expectation, thus it is a biased estimator. In the following lemma, we show that this gap can be bounded by a decreasing function of the sample batch nz. Lemma 3. For any bounded loss function , if k = 2, ---- inf z2M [F(xt+1; z)] E inf z2M fz(xt+1; z) (---- 4 + log(nz) and if k > 2, ---- inf z2M [F(xt+1; z)] E inf z2M fz(xt+1; z) (---- 3B(1 + k(k 1) ) 1 k 1 nz + 1 2k 1(k 2)nz The detailed proof can be found in Appendix E. Note that (Levy et al. 2020) only shows this lemma when k = 2, and we extend the results to k > 2. This lemma shows that the gap is in the order of O(nz 1 k ) and is independent of the total number of samples. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) 4.1 Proof Sketch of Theorem 1 We use a stochastic gradient descent method (Moulines and Bach 2011; Gower et al. 2019; Robbins and Monro 1951) to update x. Since the objective function is Lx-smooth in x, if 1 2Lx we have that: 2 E krx Fx(xt; zt)k2 E[F(xt; zt)] E[F(xt+1; zt)] + 2Lx E[krxfx(xt; zt) rx Fx(xt; zt)k2], (9) where fx(x, z) = Pnx j=1 f(x;z;sj) nx . Define σ = k k (B + )k 1Gλ1 k 0 and we can show that E[krxfx(xt; zt) rx Fx(xt; zt)k2] σ2 Since z 2 M, instead of the stochastic gradient descent method, we employ the Frank-Wolfe method (Frank, Wolfe et al. 1956) to update z. Define et = arg mine2Mhe, rzfz(xt+1; zt)i and gt = het zt, rzfz(xt+1; zt)i. In addition, we have gt fz(xt+1; zt) min z2M fz(xt+1; z) since f(x; z) is convex in z (Jaggi 2013). We can show that gt C O(λ0) thus for small λ0 we have gt C 1. Then due to the fact that the objective is Lz-smooth in z ((9) of (Lacoste Julien 2016)), we have that ( E[F(xt+1; zt)] E[F(xt+1; zt+1)]. (11) By recursively adding (9) and (11), we have that 2 E krx Fx(xt; zt)k2 + E g2 t 2C F(x1; z1) E[F(x T +1; z T +1)] Since Lz O(λ k 1 0 ) and Lz O(λ k 1 0 ), for small λ0 we have C DLz 2Lx. Then we set = 1 T = 16C 2 O(λ k 1 0 2), nx = 8Lxσ2 C 2 , and denote = F(x1; z1) minx,z2M F(x; z), for some t 2 [1, T] we have E [krx Fx(xt; zt)k] E F(xt+1; zt) inf z2M fz(xt+1; z) ( E[gt] We choose (xt+1, zt) as our output and we need to bound E [krx Fx(xt+1; zt)k] and E [F(xt+1; zt) infz2M Fz(xt+1; z)]. Since F(x; z) is Lx-smooth in x, we have that E[krx Fx(xt+1; zt)k] . By Lemma 3, we pick nz O( k ) such that ---- inf z2M [F(xt+1; z)] E inf z2M fz(xt+1; z) (---- By Lemma 1, when λ0 = 8 , we have ---- inf λ2[λ0, λ], 2[ ,B] F(x; λ; ) inf λ 0, 2R F(x; λ; ) ---- Thus we have F(xt+1; zt) inf λ 0, 2R F(x; λ; ) , (15) which completes the proof. 5 Smoothed CVa R Our algorithm can also solve other DRO problems efficiently, for example, the Smoothed CVa R proposed in (Jin et al. 2021). The CVa R DRO is an important '-divergence DRO problem, where '(t) = 1[0, 1 µ ) if 0 t < 1 µ, and 0 < µ < 1 is some constant. The dual expression of CVa R can be written as LCV a R(x; P0) = inf 2R 1 µES P0 [( (x; S) )+ + ] . The dual of CVa R is non-differentiable, which is undesirable from an optimization viewpoint. To solve this problem, (Jin et al. 2021) proposed a new divergence function, which can be seen as a smoothed version of the CVa R. Their experiment results show the optimization of smoothed CVa R is much easier. However, (Jin et al. 2021) s method only works for the penalized formulation of DRO. We will show that our method can solve the constrained smoothed CVa R. Here, the divergence function is ( t log(t) + 1 µt µ log( 1 µt 1 µ ), t 2 [0, 1 µ); +1, otherwise. (16) The corresponding conjugate function is µ log(1 µ + µ exp(t)). (17) The objective function is then written as inf x inf λ 0, 2R Fs(x; λ; ) λ' s( (x; S) λ ) + λ + ( . (18) We can show that there exist upper bounds for the optimal values λ and . There exists a λ > 0 only depends on µ, B and such that λ 2 [0, λ] and 2 [0, B]. The proof can be found in Appendix F. This objective function is non-smooth when λ ! 0. Therefore, we take a similar approach as the one in Section 3.1 to approximate the original problem with λ 2 [λ0, λ]. We bound the difference in the following lemma. Lemma 4. 8x 2 Rd, λ0 0, ---- inf λ2[λ0, λ], 2[0,B] Fs(x; λ; ) inf λ 0, 2R Fs(x; λ; ) ---- 2λ0 . The proof is similar to Lemma 1 thus is omitted here. In addition, we can show that Fs(x; z) is L0 z-smooth and convex in z, where L0 z O(λ 3 0 ) if λ 2 [λ0, λ]. Also it is easy to get Fs(x; z) is L0 x-smooth in x, where L0 x O(λ 2 0 ). Similar to eq. (42) and Remark 1 in (Levy et al. 2020), we can prove that -- minz2M [Fs(xt+1; z)] E [minz2M fs(xt+1; z)] -- O(n 0.5 s ). We then use Algorithm 1 directly and the complexity to get the -stationary point is O( 7). The detailed proof can be found in Appendix F. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Class 0 1 2 3 4 5 6 7 8 9 EMR 77.64 86.19 69.33 54.03 51.53 47.05 87.66 85.35 87.12 83.15 SFK-DRO 76.11 84.71 66.18 54.95 58.65 49.36 89.06 84.03 88.41 83.09 PAN-DRO 74.92 85.62 65.72 52.69 55.83 49.50 88.85 84.06 88.68 81.29 Table 1: Test Accuracy of each class for imbalanced CIFAR 10. 6 Numerical Results In this section, we verify our theoretical results in solving an imbalanced classification problem. In the experiment, we consider a non-convex loss function and k is set to be 2 for the Cressie-Read family. We will show that 1) to optimize the same dual objective function, our proposed algorithm converges faster than the general Proximal Gradient Descent(PGD) algorithm (Ghadimi, Lan, and Zhang 2016); 2) The performance proposed algorithm for the constrained DRO problem outperforms or is close to the performance of the penalized DRO with respect to the worst classes. Both of them outperform the baseline. Tasks. We conduct experiments on the imbalanced CIFAR10 dataset, following the experimental setting in (Jin et al. 2021; Chou et al. 2020). The original CIFAR-10 test dataset consists of 10 classes, where each of the classes has 5000 images. We randomly select training samples from the original set for each class with the following sampling ratio: {0.804, 0.543, 0.997, 0.593, 0.390, 0.285, 0.959, 0.806, 0.967, 0.660}. We keep the test dataset unchanged. Models. We learn the standard Alexnet model in (Krizhevsky, Sutskever, and Hinton 2012) with the standard cross-entropy (CE) loss. For the comparison of convergence rate, we optimize the same dual objective with the PGD algorithm in (Ghadimi, Lan, and Zhang 2016). To compare robustness, we optimize the ERM via vanilla SGD. In addition, we propose an algorithm PAN-DRO, which fixes λ and only optimizes and the neural network. Thus it gets the solution for the penalized DRO problem. Training Details. We set λ1 = 1, 1 = 0, λ0 = 0.1, = 10, and the upper bounds λ = 10, B = 10. To achieve a faster optimization rate, we set the learning rate = 0.01 before the first 40 epochs and = 0.001 after. The minibatch size is chosen to be 128. All of the results are moving averaged by a window with size 5. The simulations are repeated by 4 times. Results. In Figure 1, we plot the value of the CE loss using different algorithms through the training process. It can be seen that to optimize the same dual objective function with the same learning rate, the PGD algorithm converges slower than our proposed DRO algorithms, which matches our theoretical results. Moreover, compared with ERM, the DRO algorithms have higher training losses but lower test losses, which demonstrates they are robust. We also provide the test accuracy of trained models in Table 1. It can be shown that for class 3, 4, 5, the accuracies are the lowest due to the limited samples. For these classes, the performance of our SFK-DRO algorithm for the constrained DRO is better or close to the performance of PAN-DRO for the penalized DRO. Both DRO algorithms outperform the Figure 1: Training curve of classification task. vanilla ERM algorithm. 7 Conclusion In this paper, we developed the first stochastic algorithm for large-scale non-convex stochastic constrained DRO problems in the literature with theoretical convergence and complexity guarantee. We developed a smooth and Lipschitz approximation with bounded approximation error to the original problem. Compared with existing algorithms, the proposed algorithm has an improved convergence rate. The computational complexity at each iteration is independent of the size of the training dataset, and thus our algorithm is applicable to large scale applications. Our results hold for a general family of Cressie-Read divergences. Acknowledgments This work of Q. Zhang and S. Zou is supported by the National Science Foundation under Grants CCF-2106560. Department of Education. Y. Zhou s work is supported by the National Science Foundation under Grants CCF-2106216, DMS-2134223 and ECCS-2237830 (CAREER). L. Shen s work is supported by the NSF under Grant DMS-2208385. This material is based upon work supported under the AI Research Institutes program by National Science Foundation and the Institute of Education Sciences, U.S. Department of Education through Award No. 2229873 - National AI Institute for Exceptional Education. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, the Institute of Education Sciences, or the U.S. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Ali, S. M.; and Silvey, S. D. 1966. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1): 131 142. Ben-Tal, A.; Den Hertog, D.; De Waegenaere, A.; Melenberg, B.; and Rennen, G. 2013. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2): 341 357. Blitzer, J.; Mc Donald, R.; and Pereira, F. 2006. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, 120 128. Chou, H.-P.; Chang, S.-C.; Pan, J.-Y.; Wei, W.; and Juan, D.- C. 2020. Remix: rebalanced mixup. In Proceedings of Computer Vision ECCV 2020 Workshops: Glasgow, UK, August 23 28, 2020, Proceedings, Part VI 16, 95 110. Springer. Cressie, N.; and Read, T. R. 1984. Multinomial goodnessof-fit tests. Journal of the Royal Statistical Society Series B: Statistical Methodology, 46(3): 440 464. Csisz ar, I. 1967. On information-type measure of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar., 2: 299 318. Curi, S.; Levy, K. Y.; Jegelka, S.; and Krause, A. 2020. Adaptive sampling for stochastic risk-averse learning. In Proceedings of Advances in Neural Information Processing Systems, volume 33, 1036 1047. Daume III, H.; and Marcu, D. 2006. Domain adaptation for statistical classifiers. Journal of artificial Intelligence research, 26: 101 126. Duchi, J.; and Namkoong, H. 2018. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750. Duchi, J. C.; Glynn, P. W.; and Namkoong, H. 2021. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3): 946 969. Frank, M.; Wolfe, P.; et al. 1956. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2): 95 110. Ghadimi, S.; Lan, G.; and Zhang, H. 2016. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2): 267 305. Ghosh, S.; Squillante, M.; and Wollega, E. 2018. Efficient stochastic gradient descent for distributionally robust learning. ar Xiv preprint ar Xiv:1805.08728. Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572. Gower, R. M.; Loizou, N.; Qian, X.; Sailanbayev, A.; Shulgin, E.; and Richt arik, P. 2019. SGD: General analysis and improved rates. In Proceedings of International conference on machine learning, 5200 5209. PMLR. Grother, P. J.; Grother, P. J.; Phillips, P. J.; and Quinn, G. W. 2011. Report on the evaluation of 2D still-image face recognition algorithms. US Department of Commerce, National Institute of Standards and Technology. Hashimoto, T.; Srivastava, M.; Namkoong, H.; and Liang, P. 2018. Fairness without demographics in repeated loss minimization. In Proceedings of International Conference on Machine Learning, 1929 1938. PMLR. Hovy, D.; and Søgaard, A. 2015. Tagging performance correlates with author age. In Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing (volume 2: Short papers), 483 488. Hu, Z.; and Hong, L. J. 2013. Kullback-Leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 1(2): 9. Jaggi, M. 2013. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of International conference on machine learning, 427 435. PMLR. Jin, J.; Zhang, B.; Wang, H.; and Wang, L. 2021. Nonconvex distributionally robust optimization: Non-asymptotic analysis. In Proceedings of Advances in Neural Information Processing Systems, volume 34, 2771 2782. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in neural information processing systems, volume 25. Lacoste-Julien, S. 2016. Convergence rate of frank-wolfe for non-convex objectives. ar Xiv preprint ar Xiv:1607.00345. Levy, D.; Carmon, Y.; Duchi, J. C.; and Sidford, A. 2020. Large-scale methods for distributionally robust optimization. In Proceedings of Advances in Neural Information Processing Systems, volume 33, 8847 8860. Lin, T.; Jin, C.; and Jordan, M. 2020. On gradient descent ascent for nonconvex-concave minimax problems. In Proceedings of International Conference on Machine Learning, 6083 6093. PMLR. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2017. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083. Moulines, E.; and Bach, F. 2011. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24. Namkoong, H.; and Duchi, J. C. 2016. Stochastic gradient methods for distributionally robust optimization with fdivergences. In Proceedings of Advances in neural information processing systems, volume 29. Nesterov, Y. 2003. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media. Qi, Q.; Guo, Z.; Xu, Y.; Jin, R.; and Yang, T. 2021. An online method for a class of distributionally robust optimization with non-convex objectives. In Proceedings of Advances in Neural Information Processing Systems, volume 34, 10067 10080. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Qi, Q.; Lyu, J.; Bai, E. W.; Yang, T.; et al. 2022. Stochastic constrained dro with a complexity independent of sample size. ar Xiv preprint ar Xiv:2210.05740. Rafique, H.; Liu, M.; Lin, Q.; and Yang, T. 2022. Weaklyconvex concave min max optimization: provable algorithms and applications in machine learning. Optimization Methods and Software, 37(3): 1087 1121. Rahimian, H.; and Mehrotra, S. 2019. Distributionally robust optimization: A review. ar Xiv preprint ar Xiv:1908.05659. Robbins, H.; and Monro, S. 1951. A stochastic approximation method. The annals of mathematical statistics, 400 407. Rockafellar, R. T.; Uryasev, S.; et al. 2000. Optimization of conditional value-at-risk. Journal of risk, 2: 21 42. Rockafellar, R. T.; and Wets, R. J. 1998. Variational Analysis. Springer. Shapiro, A. 2017. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4): 2258 2275. Sinha, A.; Namkoong, H.; Volpi, R.; and Duchi, J. 2017. Certifying some distributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571. Soma, T.; and Yoshida, Y. 2020. Statistical learning with conditional value at risk. ar Xiv preprint ar Xiv:2002.05826. Tamar, A.; Glassner, Y.; and Mannor, S. 2015. Optimizing the CVa R via sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29. Van Erven, T.; and Harremos, P. 2014. R enyi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7): 3797 3820. Wang, J.; Gao, R.; and Xie, Y. 2021. Sinkhorn distributionally robust optimization. ar Xiv preprint ar Xiv:2109.11926. Xu, Z.; Zhang, H.; Xu, Y.; and Lan, G. 2023. A unified single-loop alternating gradient projection algorithm for nonconvex concave and convex nonconcave minimax problems. Mathematical Programming, 1 72. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)