# revisiting_largescale_nonconvex_distributionally_robust_optimization__2983a49d.pdf Published as a conference paper at ICLR 2025 REVISITING LARGE-SCALE NON-CONVEX DISTRIBUTIONALLY ROBUST OPTIMIZATION Qi Zhang1, Yi Zhou2, Simon Khan3, Ashley Prater-Bennette3, Lixin Shen4 & Shaofeng Zou1 School of Electrical, Computer and Energy Engineering, Arizona State University1 Department of Computer Science and Engineering, Texas A&M University2 Information Directorate, Air Force Research Laboratory3 Department of Mathematics, Syracuse University4 {qzhan261,zou}@asu.edu, yi.zhou@tamu.edu, lshen03@syr.edu, {simon.khan,ashley.prater-bennette}@us.af.mil Distributionally robust optimization (DRO) is a powerful technique to train robust machine learning models that perform well under distribution shifts. Compared with empirical risk minimization (ERM), DRO optimizes the expected loss under the worst-case distribution in an uncertainty set of distributions. This paper revisits the important problem of DRO with non-convex smooth loss functions. For this problem, Jin et al. (2021) showed that its dual problem is generalized (L0, L1)- smooth condition and gradient noise satisfies the affine variance condition, designed an algorithm of mini-batch normalized gradient descent with momentum, and proved its convergence and complexity. In this paper, we show that the dual problem and the gradient noise satisfy simpler yet more precise partially generalized smoothness condition and partially affine variance condition by studying the optimization variable and dual variable separately, which further yields much simpler convergence analysis. We develop a double stochastic gradient descent with clipping (D-SGD-C) algorithm that converges to an ϵ-stationary point with O(ϵ 4) gradient complexity, which matches with results in Jin et al. (2021). Our proof is much simpler, thanks to the more precise characterization of partially generalized smoothness and partially affine variance noise. We further design a variance-reduced method that achieves a lower gradient complexity of O(ϵ 3). Our theoretical results and insights are further verified numerically on a number of tasks, and our algorithms outperform the existing DRO method (Jin et al., 2021). 1 INTRODUCTION Empirical risk minimization (ERM) minimizes the expected loss under the empirical distribution P0 of the training dataset with the goal of achieving a good performance on a test dataset. Though this approach yields good performance in most cases, it often times fails due to a mismatch between training and test data distributions, e.g., domain difference from training to testing in domain adaptation problems (Blitzer et al., 2006; Daume III & Marcu, 2006), imbalanced classes in the training dataset (Sagawa et al., 2019) where performance of underrepresented minority groups is important due to fairness considerations (Hashimoto et al., 2018; Grother et al., 2011); and potential adversarial attacks to the deployed model (Goodfellow et al., 2014; Sinha et al., 2017; Madry et al., 2017). For models trained using ERM, such distribution shifts will lead to significant performance degradation on test datasets. To deal with the above challenge of potential distribution shift, distributionally robust optimization (DRO) was developed and has been widely studied in recent years (Ben-Tal et al., 2013; Shapiro, 2017; Rahimian & Mehrotra, 2019). Instead of merely optimizing the expectation of the loss function under a fixed distribution, DRO optimizes the performance over a set of probability distributions, aiming at good model performance under potential distribution shifts. Specifically, DRO assumes the test distribution lies in an uncertainty set centered in the empirical distribution P0 of the Published as a conference paper at ICLR 2025 training dataset. Typically, the uncertainty set U(P0) is defined as follows: U(P0) = {Q : D(Q||P0) ρ}, (1) where D measures the distance between Q and P0, e.g., Kullback-Leibler (KL) divergence, χ2 divergence or Wasserstein distance, and ρ is the radius of this uncertainty set. Then the goal of DRO is to optimize the expectation of the loss function under the worst-case distribution within the uncertainty set U(P0) (Rahimian & Mehrotra, 2019; Shapiro, 2017): inf x sup Q U(P0) ES Q [ℓ(x, S)] , (2) where x is the trainable model parameter, ℓ(x, S) is the loss function for the model with parameter x and sample S. The formulation in equation 2 requires the optimized distribution Q strictly to be inside the uncertainty set U(P0), which is relatively hard to solve. In practice, it is usually preferred to use a soft penalty term, resulting in the following penalized DRO problem (Levy et al., 2020; Jin et al., 2021; Qi et al., 2021; Sinha et al., 2017): inf x Ψ(x) := sup Q ES Q [ℓ(x, S)] λD(Q P0). (3) This removes the hard constraint on Q and controls the distance between the optimized distribution Q and training distribution P0 by a regularizer. The hyperparameter λ is pre-selected and fixed during the training. The DRO problems with different types of uncertainty sets, i.e., D s, are fundamentally different. In this paper, we focus on a general class of distance: ψ-divergence distance, which includes e.g., χ2-divergence and Cressie-Read family divergence (Cressie & Read, 1984; Van Erven & Harremos, 2014). The ψ-divergence is widely studied in the DRO literature (Namkoong & Duchi, 2016): D(Q||P0) = Z ψ d Q where ψ is a non-negative convex function such that ψ(1) = 0 and ψ(t) = + for any t < 0. The penalized formulation of DRO shown in equation 3 is a minimax optimization problem and is usually hard to solve. For ψ-divergence defined DRO problems, a popular approach is to investigate its dual formulation. By strong duality (Levy et al., 2020; Shapiro, 2017), the solution of equation 3 is equivalent to the solution of the following dual problem: inf x Ψ(x) = inf x,η R ˆL(x, η) := λES P0ψ ℓ(x, S) η where η R is a dual variable, ψ is the conjugate function of ψ and is defined as ψ (t) = supa R{ta ψ(a)}. In this paper, we study this dual formulation, which has the following three advantages compared with the previous penalized form shown in equation 3: (i) the objective is optimized under the known training distribution P0; (ii) it is easy to get an unbiased estimate of the gradient of the objective to x since we do not need to take the supremum for Q; and (iii) it converts a minimax problem to a minimization problem, which is easier to solve. In this paper, we focus on the DRO problem in equation 3 and equation 5 with non-convex L-smooth loss function ℓ(see Definition 2). We consider the large-scale setting, where the training dataset consists of a large number of N samples. We aim to characterize the fundamental structure of this problem and develop efficient first-order algorithms with comprehensive convergence analysis. The same problem was studied in Jin et al. (2021). It was shown that the dual objective in equation 5 is not L-smooth if the loss function ℓis not bounded. They further show that the corresponding dual ˆL(x, η) satisfies the generalized smoothness condition (see Definition 3), where the Lipschitz constant grows linearly with the gradient norm x,η ˆL . Similarly, in the stochastic setting, they prove the variance of gradient estimate grows linearly with the gradient norm square x,η ˆL 2, i.e., affine variance noise (see discussion under Lemma 3). To solve this problem, a normalized momentum method was used and shown to converge with O(ϵ 4) gradient complexity. Note that algorithms and analyses for generalized smooth optimization problems, e.g., Adagrad and Adam (Li et al., 2023; Wang et al., 2023; Zhang et al., 2024b), can also be used to solve this problem. Published as a conference paper at ICLR 2025 However, solving the problem in equation 5 as a generalized (L0, L1)-smooth optimization problem typically requires more involved convergence analysis. In this paper, we answer the following question: Are generalized (L0, L1)-smoothness and affine variance noise an overkill to characterize the dual ˆL(x, η) in equation 5? 1.1 OUR CONTRIBUTIONS Our main contributions are summarized as follows. We prove that the dual objective of DRO problems in equation 5 is partially generalized smooth (see Lemma 1), and the noise satisfies a partially affine variance noise condition in the stochastic setting (see Lemma 3). The above two conditions provide a much more precise and fundamental characterization of the dual in equation 5 than the generalized smoothness and affine variance noise derived in Jin et al. (2021). Such precise characterization yields much simpler convergence analysis. Our proposed more precise conditions circumvent the unbounded Lipschitz constant and unbounded noise variance challenges in the generalized (L0, L1)-smoothness problems with affine variance noises. Our conditions first show the dual problem is standard L-smooth in the dual variable η and the noise of gradient on η has bounded variance. Then under our partially generalized smoothness and partially affine variance noise condition, the Lipschitz constant and noise variance on the model parameter x are also bounded, making the objective easy to solve. We show that in the deterministic setting, an algorithm as simple as gradient descent can solve the problem with iteration complexity of O(ϵ 2); and in the stochastic setting, an algorithm as simple as double stochastic gradient descent with clipping (D-SGD-C) can solve the problem with gradient complexity of O(ϵ 4). We further design a double spider with clipping (D-Spider-C) algorithm and show its convergence with an improved O(ϵ 3) gradient complexity. Thanks to our more precise characterizations of partially generalized smooth and partially affine variance noise, our analyses are much simpler than those in Jin et al. (2021) and those for Spider algorithms (Chen et al., 2023; Reisizadeh et al., 2023), which are developed merely for general generalized smooth problems with affine variance noise and are not tailored specially for DRO problems. Our methods converge with computational complexities independent of the number of training samples N, and thus are applicable to large-scale training datasets. Numerical results are conducted to verify our theoretical results. We observe that our proposed algorithms outperform the existing DRO method (Jin et al., 2021). 1.2 RELATED WORKS DRO. Scalability: Many existing approaches are not scalable when the training dataset is large. The method (Namkoong & Duchi, 2016) is not feasible for large-scale problems, which parameterizes the unknown distribution by a vector of dimension N and models the DRO problem as a minimax optimization problem. Following this method, many minimax methods such as Rafique et al. (2022); Lin et al. (2020); Xu et al. (2023) can be used to address the DRO problem. However, a computational complexity that is linear (or even worse) in the size of the training set is not preferable for large-scale problems. In this paper, our stochastic algorithms have a per-iteration complexity that is independent of the training dataset. Convex loss functions: Some existing methods (Duchi & Namkoong, 2018; Levy et al., 2020; Wang et al., 2021; 2024; Hashimoto et al., 2018) require the loss function to be convex, which, however, fail to capture a wide range of machine learning problems where the loss function is non-convex, e.g., neural networks. In this paper, we focus on the general non-convex smooth loss function. Bounded loss functions: For the non-convex DRO problem, existing studies, e.g., Qi et al. (2021; 2022); Zhang et al. (2024a); Soma & Yoshida (2020) require the loss function ℓto be bounded (or even more restricted assumptions). In this paper, we focus on non-convex smooth loss functions, which may potentially be unbounded. Non-convex smooth loss functions: Jin et al. (2021) is the first study for non-convex DRO problems with general ψ-divergence defined uncertainty sets in large-scale settings. By combining all trainable Published as a conference paper at ICLR 2025 parameters together z = (x, λ), the authors prove that the dual objective shown in equation 5 is generalized (L0, L1)-smooth in z. Then a normalized-momentum method is designed and an ϵstationary point is guaranteed with a computational complexity of O(ϵ 4). In this paper, we share the same setting with Jin et al. (2021) and we provide a simpler yet more precise characterization of the dual problem: partially generalized smoothness and partially affine variance noise. We show that a simple SGD type algorithm finds an ϵ-stationary point with a complexity of O(ϵ 4). Moreover, we design a Spider algorithm, reducing the complexity to O(ϵ 3). Generalized (L0, L1)-smoothness. As Jin et al. (2021) pointed out, the non-convex DRO problem can be solved as a generalized (L0, L1)-smooth optimization problem, which, however, introduces unnecessary complications to theoretical analysis for the DRO problem in this paper. Nevertheless, we briefly review algorithms and analyses for generalized smooth optimization problems in the literature below. The generalized (L0, L1)-smoothness problem is first introduced in Zhang et al. (2019), where a clipping method is investigated. However, for the stochastic setting, the gradient estimation error is required to be bounded almost surely. In this paper, our method works for noise with partially affine variance (shown in Lemma 3), which is a much weaker condition and more precise characterization for this DRO problem. Modern methods such as normalized-momentum (Jin et al., 2021), Adagrad (Wang et al., 2023), and Adam (Li et al., 2023; Zhang et al., 2024b) are studied for generalized (L0, L1)-smoothness problem, and they can also be used to solve the DRO problem in this paper. In our paper, we show that for the DRO problem, simple SGD can get the same stationary point with the same gradient complexity. To reduce the complexity, Spider (Fang et al., 2018) is studied for the generalized (L0, L1)-smoothness problem (Chen et al., 2023; Reisizadeh et al., 2023). In this paper, based on our precise characterizations of partially generalized smoothness and partially affine variance noise, our proof is much simpler than Chen et al. (2023). Moreover, we show the gradient converges in expectation, which is stronger than the convergence with high probability in Reisizadeh et al. (2023). 2 PRELIMINARIES Denote by s a sample in S and let P0 be the empirical distribution of the N training samples {si}N i=1. In the large-scale setting studied in this paper, we assume the number of training samples N is extremely large. We use to denote the Euclidean norm and , to denote the standard dot product. Define a function (a)+ = max(a, 0). For a set C, 1C is an indicator function such that 1C(a) = 0 if a C and 1C(a) = + otherwise. Let x Rd be the trainable parameters where d is the dimension. The loss function is defined as ℓ: Rd S R. For a differentiable function f : Rd R, x is an ϵ-stationary point if xf(x) ϵ, where xf(x) is the gradient of f to x. For a random vector X, denote by E the expectation and V the sum of its element-vise variance, where V(X) := E[ X E[X] 2]. We further provide some definitions. Definition 1 (Lipschitz continuous). A function f : Rd R is called G-Lipschitz continuous if for any x, y Rd, |f(x) f(y)| G x y , where G > 0 is some finite constant. Definition 2 (Standard L-smooth). A differentiable function f : Rd R is L-smooth if for any x, y Rd, xf(x) xf(y) L x y , where L > 0 is some finite constant. These two definitions cover a wide range of problems in optimization studies. Recently, a generalized (L0, L1)-smoothness condition is proposed (Zhang et al., 2019; Chen et al., 2023; Li et al., 2023), which is strictly weaker than the standard L-smoothness condition. Definition 3 (Generalized (L0, L1)-smooth). A differentiable function f : Rd R is generalized (L0, L1)-smooth if for any x, y Rd, we have that xf(x) xf(y) (L0+L1 xf(x) ) x y , where L0, L1 > 0 are some finite constants. Note that there are two version of the (L0, L1)-smoothness, one requires that the inequality only applies to x y 1 L0 (Zhang et al., 2019) and one does not require that (Chen et al., 2023). In DRO setting, it can be proved that for the dual objective Jin et al. (2021), the inequality holds for any x, y Rd thus in this paper we follow the second definition. In this paper, we focus on a non-convex and smooth loss function ℓ. Assumption 1. For any sample s S, the loss function ℓ(x, s) is G-Lipschitz continuous and L-smooth in x. Published as a conference paper at ICLR 2025 We further make the following assumption on the ψ-divergence. Assumption 2. The conjugate function ψ of ψ is M-smooth. Assumption 2 can be satisfied by a wide range of ψ-divergences (see Table 1). Divergence ψ(t) ψ (t) χ2 1 2(t 1)2 1 + 1 4(t + 2)2 + KL-regularized CVa R 1[0,α 1) + t log(t) t + 1, α (0, 1) min(et, α 1(1 + t + log(α))) 1 Cressie-Read tk tk+k 1 k(k 1) , k R 1 k ((k 1)t + 1) Table 1: Commonly used divergences with L-smooth conjugates. The goal of this paper is to find an ϵ-stationary point of the penalized DRO problem in equation 3, which is a minimax optimization problem and is usually hard to solve. For ψ-divergence defined DRO problems, a popular approach is to investigate its dual formulation. By strong duality (Levy et al., 2020; Shapiro, 2017), we have that Ψ(x) = inf η R ˆL(x, η) := λES P0ψ ℓ(x, S) η Under Assumptions 1 and 2, it can be shown that Ψ(x) is differentiable (Jin et al., 2021). Define L(x, η) = ˆL(x, Gη). Then x,ηL(x, η) ϵ/ 2 implies that Ψ(x) ϵ (Jin et al., 2021). Thus, it is equivalent to find an ϵ-stationary point of L(x, η). 3 MAIN RESULTS 3.1 PARTIALLY GENERALIZED SMOOTHNESS Let z = (x, η). Approach in Jin et al. (2021) directly optimizes over z, where it was shown that L(z) is generalized (L0, L1)-smooth in z with L0 = L + 2G2λ 1M and L1 = L/G. In the following, we provide a more precise characterization of L by separately studying x and η. Lemma 1 (Partially generalized (L0, L1, L2)-smoothness). Under Assumptions 1 and 2, L(x, η) is (L0, L1)-partially smooth in x and L2-smooth in η such that for any x, x Rd and η, η R we have that x L(x, η) x L(x , η) (L0 + L1| ηL(x, η)|) | {z } term (a) | ηL(x, η) ηL(x, η )| L2|η η |, (8) where L0 = G + G2M G and L2 = G2M The proof is available in Appendix A.1. Observe that L(x, η) is smooth in η for any x. Thus, optimizing over η should not be as hard as solving a generalized smooth problem. Moreover, in equation 7, the Lipschitz constant in x (term (a)) is linear in the gradient to η: ηL(x, η), but does not depend on the gradient to x: x L(x, η). Compared with the generalized (L0, L1)-smoothness used in Jin et al. (2021), Lemma 1 provides a more precise characterization of L. Intuitively, due to the smoothness in η, one can expect a quick find of a point with a bounded gradient to η. Consequently, the Lipschitz constant in x will also become bounded, which circumvents the unbounded Lipschitz constant challenges in generalized (L0, L1)-smoothness problems and makes the objective easier to optimize. Remark 1. This partially generalized (L0, L1, L2)-smoothness condition is weaker than the standard L-smoothness condition but stronger than the generalized (L0, L1)-smoothness condition. 3.2 DETERMINISTIC SETTING To warm up, we first consider the deterministic setting. We first propose a double gradient descent (D-GD) algorithm, which updates x and η alternatively (see Algorithm 1). This is in contrast to Published as a conference paper at ICLR 2025 the approach in Jin et al. (2021) where x and η are optimized jointly. The key idea is to leverage the standard L-smoothness property in η to bound ηL(x, η), to reduce equation 7 to a smooth condition, and then to bound x L(x, η). Algorithm 1 D-GD Input: initialization x0, η0, step sizes αt, βt, number of iterations T 1: t 0 2: while t T 1 do 3: ηt+1 ηt αt ηL(xt, ηt) 4: xt+1 xt βt x L(xt, ηt+1) 5: t t + 1 6: end while 3.2.1 DOUBLE GRADIENT DESCENT (D-GD) In this section, we choose constant step sizes αt, βt in Algorithm 1. In the following theorem, we show that D-GD converges to an ϵ-stationary point with an iteration complexity of O(ϵ 2). Theorem 1. Let H = 2L2(L(x0, η0) infx,η L(x, η)) and L = L0 + L1 H. Set αt = 1 L2 , βt = 1 L , T 8 max(L2, L ) L(x0,η0) infx,η L(x,η) ϵ2 . For Algorithm 1, we then have that min t 0 min t 0, we have that L(xt+1, ηt+1) L(xt, ηt+1) x L(xt, ηt+1), βtvt + L0 + L1| ηL(xt, ηt+1)| L(xt, ηt+1) x L(xt, ηt+1), βtvt + L0 + L1| ηL(xt, ηt)| + L1L2|αtgt| L(xt, ηt+1) βt 2 ( vt 2 + x L(xt, ηt+1) 2 vt x L(xt, ηt+1) 2) + L0 + (L1 + L1L2αt)| ηL(xt, ηt)| + L1L2αt|gt ηL(xt, ηt)| L(xt, ηt+1) βt 2 vt 2 + βt 2 vt x L(xt, ηt+1) 2 + L0 2 β2 t vt 2 + γ(L1 + L1L2αt)2( ηL(xt, ηt))2 + β4 t 16γ vt 4 + γ(L1L2αt)2(gt ηL(xt, ηt))2 + β4 t 16γ vt 4, (44) where the second inequality is due to the L2-smoothness in η, and the last inequality is due to that 2ab a2 + b2 for any a, b R. Taking the expectation on both sides of equation 44 and adding with equation 43, it follows that E αt L2α2 t γ(L1 + L1L2αt)2 ( ηL(xt, ηt))2 + βt E[L(xt, ηt) L(xt+1, ηt+1)] + E[(α2 t L2 + γ(L1L2αt)2)(gt ηL(xt, ηt))2] 2 vt x L(xt, ηt+1) 2 + E β4 t 8γ vt 4 . (45) According to α = 1 2L2 , βt = min 1 2L0 , ϵ L0 vt and Lemma 3, we can further show that 4L2 γ 9L2 1 4 ( ηL(xt, ηt))2 + βt E[L(xt, ηt) L(xt+1, ηt+1)] + 1 4L2 + γL2 1 4 Published as a conference paper at ICLR 2025 + 1 4L0 E D0 + D1( ηL(xt, ηt+1))2 8γL4 0 . (46) Moreover, we have that E[( ηL(xt, ηt+1))2] 2E[( ηL(xt, ηt))2] + 2E[( ηL(xt, ηt+1) ηL(xt, ηt))2] 2E[( ηL(xt, ηt))2] + 2L2 2α2 t E[(gt)2] (2 + 4L2 2α2 t)E[( ηL(xt, ηt))2] + 4L2 2α2 t E[(gt ηL(xt, ηt))2] 3E[( ηL(xt, ηt))2] + D2 Let γ = 1 36L2 1L2 and N2 12D1L2 L0 . By equation 46 and equation 47, we then have that 8L2 ( ηL(xt, ηt))2 + βt E[L(xt, ηt) L(xt+1, ηt+1)] + D2 2L2N1 + D0 4L0N2 + D1 4L0N2 8γL4 0 . (48) Taking the sum of equation 48 from t = 0 to T 1, we have that 8L2 ( ηL(xt, ηt))2 + βt T E [L(x0, η0) L(x T , ηT )] + D2 2L2N1 + D0 4L0N2 + D1 4L0N2 8γL4 0 . (49) Let T max(8L2, 4L0) 5L(x0,η0) 5 infx,η L(x,η) ϵ2 , N1 max 20D2 ϵ2 , 10D2L0 ϵ2 , 12D1L2 and ϵ2 8γL4 0 5 min 1 8L2 , 1 4L0 . Then equation 49 can be further bounded as follows: 8L2 ( ηL(xt, ηt))2 + βt 4 vt 2 min 1 8L2 , 1 4L0 Thus we can find some t < T such that E h 1 8L2 ( ηL(xt, ηt))2i ϵ2 8L2 and E h βt 4 vt 2i 1 4L0 ϵ2. Based on equation 47, we then have that (E[ ηL(xt, ηt+1)])2 E[( ηL(xt, ηt+1))2] 4ϵ2. Moreover, we have that L0 E[βt vt 2] = E ϵ2 L0 min vt 2 where the second inequality is due to the fact that min a2 2 holds for any a 0. As a result, we have that E[ vt ] 3ϵ. For N2 D0+4ϵ2D1 ϵ2 , we have that (E[ vt x L(xt, ηt+1) ])2 E[ vt x L(xt, ηt+1) ]2 E D0 + D1( ηL(xt, ηt+1))2 It can be further shown that E[ x L(xt, ηt+1) ] E[ vt ]] + E[ vt x L(xt, ηt+1) ] 4ϵ and E[ x,ηL(xt, ηt+1) ] E[| ηL(xt, ηt+1)|] + E[ x L(xt, ηt+1) ] 6ϵ. This completes the proof. A.7 PROOF OF LEMMA 4 Proof. For any x, x Rd, η R and s S, we have that | ηL(x, η, s) ηL(x , η, s)| =G (ψ ) ℓ(x, s) Gη (ψ ) ℓ(x , s) Gη Published as a conference paper at ICLR 2025 λ x x , (52) where the inequality is due to that ψ is M-smooth and ℓ(x, s) is G-continuous in x. Similarly, for any x Rd, η, η R and s S, we have that x L(x, η, s) x L(x, η , s) = (ψ ) ℓ(x, s) Gη ℓ(x, s) (ψ ) ℓ(x, s) Gη G (ψ ) ℓ(x, s) Gη (ψ ) ℓ(x, s) Gη λ |η η |, (53) where the first inequality is because that ℓ(x, s) is G-continuous in x and the last one is due to the M-smoothness of ψ . A.8 FULL VERSION AND ITS PROOF OF THEOREM 4 Let c0 = max(32L2, 8L0), c1 = 4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 16L2 1L2 5D1L3 0 , c2 = max 1 8L2 + L1 c3 = 1 + L2 40L0 + L0D1+L0+2L0L2D2 L2 + 33L2 1 5L0L2 + L0D1 15L2 2 + L2 1 2L0L2 2 and c4 = 17 For N1 6D2c0c2 ϵ2 , N2 max 20q D1L2 L0 , 20qc2L2, 12q L2 2c0c2 L2 0 , q , N3 max 200D1L2 L0 , 3c0(D0+4D1D2) 2L0ϵ2 , N4 max 5q L2 L0 , 6qc1c0 min L3 0N4 1480L2 1L2q, L4 0 42L1c0 , 1 , we have the following theorem: Theorem 6. Let α = 1 4L2 , βt = min 1 2L0 , ϵ L0 vt for Algorithm 3. Setting T 6c0(L(x0,η0) infx,η L(x,η)) ϵ2 , we have that min t 0, we have that L(xt+1, ηt+1) L(xt, ηt+1) x L(xt, ηt+1), βtvt + L0 + L1| ηL(xt, ηt)| + L1L2|αtgt| L(xt, ηt+1) βt 2 ( vt 2 + x L(xt, ηt+1) 2 vt x L(xt, ηt+1) 2) + L0 + L1| ηL(xt, ηt) gt| + (L1 + L1L2αt)|gt| L(xt, ηt+1) βt 2 vt 2 + βt 2 vt x L(xt, ηt+1) 2 + L0 2 β2 t vt 2 + γ(L1 + L1L2αt)2(gt)2 + β4 t 16γ vt 4 + γL2 1|gt ηL(xt, ηt)|2 + β4 t 16γ vt 4. (55) Published as a conference paper at ICLR 2025 Combine equation 54 and equation 55 and it follows that 2αt L2α2 t 2 γ(L1 + L1L2αt)2 (gt)2 + βt E[L(xt, ηt) L(xt+1, ηt+1) + β4 t 8γ vt 4 2 vt x L(xt, ηt+1) 2 + αt 2 (gt ηL(xt, ηt))2 + γL2 1(gt ηL(xt, ηt))2]. Setting α = 1 4L2 and βt = min 1 2L0 , ϵ L0 vt , we have that E[( 3 32L2 γ 25L2 1 16 )(gt)2 + βt E[L(xt, ηt) L(xt+1, ηt+1)] + ϵ4 8γL4 0 + 1 4L0 E[ vt x L(xt, ηt+1) 2] 8L2 + γL2 1 E[(gt ηL(xt, ηt))2]. (56) Let γ = 1 50L1 and take the sum of equation 55 from t = 0 to T 1. According to Lemma 5, we then have that XT 1 t=0 1 16L2 E[(gt)2] + E[βt E[L(x0, η0) L(x T , ηT )] + 7ϵ4L1T 4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 64qϵ2L2 1L2 2 N2L4 0 8N3 + 33qϵ2L2 1 N4L2 0 + q 8N4 + q D1 2N2N3 + 4q2ϵ2L2 1 N2N4L2 0 N1 + 2q L2 2 N2L2 0 ϵ2 + q 8N2 (gt)2 . (57) Let c0 = max(32L2, 8L0), c1 = 4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 16L2 1L2 5D1L3 0 and c2 = max 1 8L2 + L1 For N3 max 200D1L2 L0 , 3c0(D0+4D1D2) 2L0ϵ2 , N4 max 5q L2 L0 , 6qc1c0 min L3 0N4 1480L2 1L2q, L4 0 42L1c0 , 1 , N2 max 20q D1L2 L0 , 20qc2L2, 12q L2 2c0c2 L2 0 , q and N1 6D2c0c2 ϵ2 , we then have that XT 1 t=0 E 1 32L2 (gt)2 + βt E[L(x0, η0) L(x T , ηT )] + ϵ2T 6c0 + D0 + 4D1D2 + ϵ2T q N4L0 4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 16L2 1L2 5D1L3 0 ϵ2T 2q L2 2 N2L2 0 E[L(x0, η0) L(x T , ηT )] + 5ϵ2T For T = nq 6c0(L(x0,η0) infx,η L(x,η)) ϵ2 , we can find some t0 mod q = 0 such that t =t0 E 1 32L2 (gt )2 + βt 32L2 . (59) Published as a conference paper at ICLR 2025 Moreover we can find some t [t0, t0 + q 1) such that 1 32L2 E[(gt)2] + E[βt Based on equation 60 and c0 = max(32L2, 8L0), we have that E[(gt)2] ϵ2. Based on equation 67 and equation 68, we can further show that E[(gt ηL(xt, ηt))2] D2 t =t0 E 2L2 2 N2L2 0 ϵ2 + 2L2 2 N2 α2 t 1(gt )2 60L2 . (61) Thus we have that E[| ηL(xt, ηt+1)|] E[|gt ηL(xt, ηt)| + |gt| + L2αt|gt|] 5 Moreover, we can show that L0 E[βt vt 2] = E ϵ2 L0 min vt 2 where the second inequality is due to the fact that min a2 2 holds for any a 0. As a result, we have that E[ vt ] 3ϵ. Based on equation 71 and equation 72, we can further show that E[ vt x L(xt, ηt+1) 2] E D0 + 4D1(gt0)2 + 2D1L2 2α2 t0(gt0)2 N3 + 4D1(gt0 ηL(xt0, ηt0))2 t =t0+1 E 2 N4 ϵ2 2 + 8L2 1L2 2 L2 0 α2 t 1(gt 1)2 + 4L2 1 L2 0 D2 N4 L2 2α2 t(gt )2 t =t0+1 E 16L2 1 L2 0 (gt 1)2 + 16L2 1 L2 0 ( ηL(xt 1, ηt 1) gt 1)2 N3 + 4D1 + 2D1L2 2α2 t0 N3 + ϵ2L2 1 N4L2 0 + L2 2 8N4L2 0 + 32ϵ2L2 1 N4L2 0 t =t0+1 E[(gt )2] N4 + 8qϵ2L2 1D2 N4 + 4D1 N3 + 32ϵ2L2 1 N4L2 0 t =t0 E[( ηL(xt , ηt ) gt )2] L2 + 33L2 1 5L0L2 + L2 40L0 5L2 ϵ2 + 8L0L2D2 5L2 ϵ2 + 4D1L0 5L2 + 32L2 1 5L0L2 1 + L2 40L0 + L0D1 + L0 + 2L0L2D2 L2 + 33L2 1 5L0L2 + L0D1 15L2 2 + L2 1 2L0L2 2 Thus we have that E[ x L(xt, ηt+1) ] E[ vt x L(xt, ηt+1) + vt ] (3 + c3)ϵ. It then follows that E[ x,ηL(xt, ηt+1) ] E[| ηL(xt, ηt+1)|]+E[ x L(xt, ηt+1) ] c4ϵ. This completes the proof. Published as a conference paper at ICLR 2025 A.9 FULL VERSION AND ITS PROOF OF LEMMA 5 Lemma 6. With the parameters selected in Theorem 4, for each t0 < T that can be divided by q, we have that t=t0 E[(gt ηL(xt, ηt))2] N1 + 2q L2 2 N2L2 0 ϵ2 + 2q L2 2 N2 αt(gt)2 (64) and Xt0+q 1 t=t0 E[ vt x L(xt, ηt+1) 2] 4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 64qϵ2L2 1L2 2 N2L4 0 8N3 + 33qϵ2L2 1 N4L2 0 + q 8N4 + q D1 2N2N3 + 4q2ϵ2L2 1 N2N4L2 0 (gt)2 . (65) Proof. If t mod q = 0, gt is an unbiased estimate of ηL(xt, ηt) and according to Lemma 3, we have that E[(gt ηL(xt, ηt))2] D2 Otherwise, we have that E[(gt ηL(xt, ηt))2] =E[( ηL(xt, ηt, B2) ηL(xt 1, ηt 1, B2) + gt 1 ηL(xt, ηt))2] =E[( ηL(xt, ηt, B2) ηL(xt 1, ηt 1, B2) + ηL(xt 1, ηt 1) ηL(xt, ηt))2] + E[(gt 1 ηL(xt 1, ηt 1))2], (67) where the last inequality is due to the fact that ηL(xt, ηt, B2) ηL(xt 1, ηt 1, B2) is an unbiased estimate of ηL(xt, ηt) ηL(xt 1, ηt 1). We now focus on the first term of equation 67, which can be further bounded as follows: E[( ηL(xt, ηt, B2) ηL(xt 1, ηt 1, B2) + ηL(xt 1, ηt 1) ηL(xt, ηt))2] N2 E[( ηL(xt, ηt, S) ηL(xt 1, ηt 1, S))2] N2 E[( ηL(xt, ηt, S) ηL(xt 1, ηt, S))2 + ( ηL(xt 1, ηt, S) ηL(xt 1, ηt 1, S))2] N2 L2 2(α2 t 1(gt 1)2 + β2 t 1 vt 1 2) E 2L2 2 N2L2 0 ϵ2 + 2L2 2 N2 α2 t 1(gt 1)2 , (68) where the first inequality is due to the fact that the square of expectation is not larger than the expectation of square, the third inequality is due to the continuous properties shown in Lemma 4, and the last inequality is due to the fact that βt = min 1 2L0 , ϵ L0 vt . Combining equation 66, equation 67, and equation 68, for t0 mod q = 0, we have that t=t0 E[(gt ηL(xt, ηt))2] N1 + 2q L2 2 N2L2 0 ϵ2 + q 8N2 (gt)2 . (69) We then focus on the estimate of the gradient to x. If t mod q = 0, we have that E[ vt x L(xt, ηt+1) 2] E D0 + D1( ηL(xt, ηt+1))2 Published as a conference paper at ICLR 2025 E D0 + 2D1( ηL(xt, ηt))2 + 2D1L2 2α2 t(gt)2 E D0 + 4D1(gt)2 + 2D1L2 2α2 t(gt)2 N3 + 4D1(gt ηL(xt, ηt))2 where the second inequality is due to the L2-smoothness on η and the update of η. Otherwise, we have that E[ vt x L(xt, ηt+1) 2] =E[ x L(xt, ηt+1, B4) x L(xt 1, ηt, B4) + vt 1 x L(xt, ηt+1) 2] =E[ x L(xt, ηt+1, B4) x L(xt 1, ηt, B4) + x L(xt 1, ηt) x L(xt, ηt+1) 2] + E[ vt 1 x L(xt 1, ηt) 2], (71) since x L(xt, ηt+1, B4) x L(xt 1, ηt, B4) is an unbiased estimate of x L(xt, ηt+1) x L(xt 1, ηt). We now focus on the first term of equation 71, which can be further bounded as follows: E[ x L(xt, ηt+1, B4) x L(xt 1, ηt, B4) + x L(xt 1, ηt) x L(xt, ηt+1) 2] N4 E[ x L(xt, ηt+1, S) x L(xt 1, ηt, S) 2] N4 E[ x L(xt, ηt+1, S) x L(xt, ηt, S) 2 + x L(xt, ηt, S) x L(xt 1, ηt, S) 2] N4 E[(L2 2α2 t(gt)2 + (2L2 0 + 2L2 1| ηL(xt 1, ηt, S)|2)β2 t 1 vt 1 2)] N4 ϵ2 2 + 4L2 1 L2 0 ( ηL(xt 1, ηt))2 + 4L2 1 L2 0 D2 N4 L2 2α2 t(gt)2 N4 ϵ2 2 + 8L2 1 L2 0 ( ηL(xt 1, ηt 1))2 + 8L2 1L2 2 L2 0 α2 t 1(gt 1)2 + 4L2 1 L2 0 D2 N4 L2 2α2 t(gt)2 N4 ϵ2 2 + 8L2 1L2 2 L2 0 α2 t 1(gt 1)2 + 4L2 1 L2 0 D2 N4 L2 2α2 t(gt)2 N4 E 16L2 1 L2 0 (gt 1)2 + 16L2 1 L2 0 ( ηL(xt 1, ηt 1) gt 1)2 , (72) where the third inequality is due to Lemma 4 , the fourth inequality is due to βt = min 1 2L0 , ϵ L0 vt and Lemma 3. Combine equation 70, equation 71 and equation 72, and for t0 mod q = 0, we have that Xt0+q 1 t=t0 E[ vt x L(xt, ηt+1) 2] 4 + 8L2 1D2 L2 0 8N3 + 33qϵ2L2 1 N4L2 0 + q 8N4 N3 + 32qϵ2L2 1 N4L2 0 (gt ηL(xt, ηt))2 4 + 8L2 1D2 L2 0 8N3 + 33qϵ2L2 1 N4L2 0 + q 8N4 Published as a conference paper at ICLR 2025 N3 + 32qϵ2L2 1 N4L2 0 N1 + 2q L2 2 N2L2 0 ϵ2 + 2q L2 2 N2 α2 t(gt)2 4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 64qϵ2L2 1L2 2 N2L4 0 8N3 + 33qϵ2L2 1 N4L2 0 + q 8N4 + q D1 2N2N3 + 4q2ϵ2L2 1 N2N4L2 0 (gt)2 . (73) This completes the proof. A.10 DETAILS OF THE EXPERIMENT A.10.1 LIFE EXPECTANCY DATA This life expectancy data includes health factors from 193 countries (input features) and the life expectancy (target) from the World Health Organization and United Nations websites. We follow the same data pre-processing as in Chen et al. (2023) to fill the missing data with the medians, censorize and standardize all the features 2, remove redundant features, and a standard Gaussian noise is added to the target for model robustness. Each element of the initial parameter x0 is generated by a standard Gaussian distribution. In our deterministic setting, we compare the methods with fine-tuned learning rates. The iteration number is set to 50. For existing methods, we follow the fine-tuned learning rates in (Chen et al., 2023), where the step size βt = 10 4 for GD, βt = 0.2 for normalized GD and βt = 0.3 min 1 10, 1 x,ηL(xt,ηt) . For our D-GD method, we set αt = βt = 10 4 and for our D-GD-C method, we set αt = 10 4 and βt = 0.35 min 1 2000, 1 x L(xt,ηt+1) . In our stochastic setting, we run the experiments for 5000 iterations. We generate the initial x0, η0 by running a normalized GD method with step size βt = 0.2 for 30 iterations. For existing methods, we follow the same setting in (Chen et al., 2023). We set the mini-batch size to 50. For SGD, the step size is βt = 2 10 4. For the normalized SGD with momentum method, the momentum coefficient is set to 10 4 and the step size is set to 8 10 3. For the normalized SPIDER method, we have that step size βt = 4 10 3 and epoch size q = 20. For our D-SGD-C, we set αt = 8 10 5 and βt = 0.05 min( 1 100, 1 vt ). For our D-SPIDER-C, we have that αt = 8 10 5 and βt = 7.5 10 3 min 2.5, 1 vt . A.10.2 CIFFAR-10 DATASET In this part, we conduct experiments on the famous CIFAR-10 dataset (Alex, 2009), which includes 50000 training samples. We employ DRO model to construct a linear classifier. After extracting features from a pre-trained Res Net-50 model (He et al., 2016), for the i-th sample in our dataset, we have input zi R2049 and output yi [10]. The non-convex original loss function is set as ℓ(x, (zi, yi)) = ln(P10 j=1 exp(x j zi x yizi)) + 0.001 P10 j=1 P2049 k=1 ln(1 + |xj,k|), where x R10 2049 is the trainable parameter, and xj is the j-th row. For the DRO model, λ is set to 0.05, and the initial value η0 is set to 0.1. Each element of the initial parameter x0 is generated by a standard Gaussian distribution. In Figure 3, we provide the training curves with fine-tuned learning rate for SGD, Normalized SPIDER (Chen et al., 2023), Normalized-SGD with momentum (Jin et al., 2021) and our proposed D-SGD-C and D-SPIDER-C methods. The x-axis stands for the training iteration, and y-axis stands for the DRO dual function value. For Figure 3 we can observe that the Normalized-SPIDER and our D-SPIDER-C have similar performance and converge faster than other methods. Our D-SGD-C has a similar performance compared with the Normalized-SGD with momentum method but our D-SPIDER-C outperforms the Normalized-SGD with momentum method. We set the mini-batch size to 200. For SGD, the step size is βt = 1.5 10 3. For the normalized SGD with momentum method, the momentum coefficient is set to 0.1 and the step size is set to 2https://thecleverprogrammer.com/2021/01/06/life-expectancy-analysis-with-python/ Published as a conference paper at ICLR 2025 0 20 40 60 80 100 Epoch Loss Comparison Across Methods Method SGD Normalized-SGDm D-SGD-C Normalized-SPIDER D-SPIDER-C Figure 3: The numerical results for Cifar-10 dataset. 4 10 3. For the normalized SPIDER method, we have that step size βt = 3.5 10 3 and epoch size q = 10 and large batch size 2 103. For our D-SGD-C, we set αt = 1 10 3 and βt = min( 2 1000, 1 10 vt ). For our D-SPIDER-C, we have that αt = 1 10 3 and βt = min 5 2000, 1 20 vt . A.11 RESULTS FOR NORMALIZED MOMENTUM Algorithm 4 D-SGD-M Input: initialization x0, η0, step sizes α, β, r1, r2 number of interactions T 1: t 1 2: while t T do 3: Draw one sample s from P0 and compute gt 1 ηL(xt 1, ηt 1, s) 4: if t==1 then 5: Draw one sample s from P0 and m0 ηL(xt 1, ηt 1, s) 6: end if 7: mt r1mt 1 + (1 r1)gt 1 8: ηt ηt 1 α mt mt 9: Draw one sample s from P0 and compute vt 1 x L(xt 1, ηt, s) 10: if t==1 then 11: Draw one sample s from P0 and w0 x L(xt 1, ηt, s) 12: end if 13: wt r2wt 1 + (1 r2)vt 1 14: xt xt 1 β wt wt 15: t t + 1 16: end while In this section, we provide result for the normalized momentum method, which is shown in Algorithm 4. The following proposition establishes the convergence and complexity of Algorithm 4. Published as a conference paper at ICLR 2025 Proposition 1. Set 1 r1, 1 r2 O(ϵ2), β O(ϵ 3), α = O( D1β) and T O(ϵ 4) for Algorithm 4. We then have that min t