# revisiting_largescale_nonconvex_distributionally_robust_optimization__2983a49d.pdf

Published as a conference paper at ICLR 2025

REVISITING LARGE-SCALE NON-CONVEX DISTRIBUTIONALLY ROBUST OPTIMIZATION

Qi Zhang1, Yi Zhou2, Simon Khan3, Ashley Prater-Bennette3, Lixin Shen4 & Shaofeng Zou1

School of Electrical, Computer and Energy Engineering, Arizona State University1

Department of Computer Science and Engineering, Texas A&M University2

Information Directorate, Air Force Research Laboratory3

Department of Mathematics, Syracuse University4 {qzhan261,zou}@asu.edu, yi.zhou@tamu.edu, lshen03@syr.edu, {simon.khan,ashley.prater-bennette}@us.af.mil

Distributionally robust optimization (DRO) is a powerful technique to train robust machine learning models that perform well under distribution shifts. Compared with empirical risk minimization (ERM), DRO optimizes the expected loss under the worst-case distribution in an uncertainty set of distributions. This paper revisits the important problem of DRO with non-convex smooth loss functions. For this problem, Jin et al. (2021) showed that its dual problem is generalized (L0, L1)- smooth condition and gradient noise satisfies the affine variance condition, designed an algorithm of mini-batch normalized gradient descent with momentum, and proved its convergence and complexity. In this paper, we show that the dual problem and the gradient noise satisfy simpler yet more precise partially generalized smoothness condition and partially affine variance condition by studying the optimization variable and dual variable separately, which further yields much simpler convergence analysis. We develop a double stochastic gradient descent with clipping (D-SGD-C) algorithm that converges to an ϵ-stationary point with O(ϵ 4) gradient complexity, which matches with results in Jin et al. (2021). Our proof is much simpler, thanks to the more precise characterization of partially generalized smoothness and partially affine variance noise. We further design a variance-reduced method that achieves a lower gradient complexity of O(ϵ 3). Our theoretical results and insights are further verified numerically on a number of tasks, and our algorithms outperform the existing DRO method (Jin et al., 2021).

1 INTRODUCTION

Empirical risk minimization (ERM) minimizes the expected loss under the empirical distribution P0 of the training dataset with the goal of achieving a good performance on a test dataset. Though this approach yields good performance in most cases, it often times fails due to a mismatch between training and test data distributions, e.g., domain difference from training to testing in domain adaptation problems (Blitzer et al., 2006; Daume III & Marcu, 2006), imbalanced classes in the training dataset (Sagawa et al., 2019) where performance of underrepresented minority groups is important due to fairness considerations (Hashimoto et al., 2018; Grother et al., 2011); and potential adversarial attacks to the deployed model (Goodfellow et al., 2014; Sinha et al., 2017; Madry et al., 2017). For models trained using ERM, such distribution shifts will lead to significant performance degradation on test datasets.

To deal with the above challenge of potential distribution shift, distributionally robust optimization (DRO) was developed and has been widely studied in recent years (Ben-Tal et al., 2013; Shapiro, 2017; Rahimian & Mehrotra, 2019). Instead of merely optimizing the expectation of the loss function under a fixed distribution, DRO optimizes the performance over a set of probability distributions, aiming at good model performance under potential distribution shifts. Specifically, DRO assumes the test distribution lies in an uncertainty set centered in the empirical distribution P0 of the

Published as a conference paper at ICLR 2025

training dataset. Typically, the uncertainty set U(P0) is defined as follows:

U(P0) = {Q : D(Q||P0) ρ}, (1)

where D measures the distance between Q and P0, e.g., Kullback-Leibler (KL) divergence, χ2 divergence or Wasserstein distance, and ρ is the radius of this uncertainty set. Then the goal of DRO is to optimize the expectation of the loss function under the worst-case distribution within the uncertainty set U(P0) (Rahimian & Mehrotra, 2019; Shapiro, 2017):

inf x sup Q U(P0) ES Q [ℓ(x, S)] , (2)

where x is the trainable model parameter, ℓ(x, S) is the loss function for the model with parameter x and sample S. The formulation in equation 2 requires the optimized distribution Q strictly to be inside the uncertainty set U(P0), which is relatively hard to solve. In practice, it is usually preferred to use a soft penalty term, resulting in the following penalized DRO problem (Levy et al., 2020; Jin et al., 2021; Qi et al., 2021; Sinha et al., 2017):

inf x Ψ(x) := sup Q ES Q [ℓ(x, S)] λD(Q P0). (3)

This removes the hard constraint on Q and controls the distance between the optimized distribution Q and training distribution P0 by a regularizer. The hyperparameter λ is pre-selected and fixed during the training.

The DRO problems with different types of uncertainty sets, i.e., D s, are fundamentally different. In this paper, we focus on a general class of distance: ψ-divergence distance, which includes e.g., χ2-divergence and Cressie-Read family divergence (Cressie & Read, 1984; Van Erven & Harremos, 2014). The ψ-divergence is widely studied in the DRO literature (Namkoong & Duchi, 2016):

D(Q||P0) = Z ψ d Q

where ψ is a non-negative convex function such that ψ(1) = 0 and ψ(t) = + for any t < 0.

The penalized formulation of DRO shown in equation 3 is a minimax optimization problem and is usually hard to solve. For ψ-divergence defined DRO problems, a popular approach is to investigate its dual formulation. By strong duality (Levy et al., 2020; Shapiro, 2017), the solution of equation 3 is equivalent to the solution of the following dual problem:

inf x Ψ(x) = inf x,η R ˆL(x, η) := λES P0ψ ℓ(x, S) η

where η R is a dual variable, ψ is the conjugate function of ψ and is defined as ψ (t) = supa R{ta ψ(a)}. In this paper, we study this dual formulation, which has the following three advantages compared with the previous penalized form shown in equation 3: (i) the objective is optimized under the known training distribution P0; (ii) it is easy to get an unbiased estimate of the gradient of the objective to x since we do not need to take the supremum for Q; and (iii) it converts a minimax problem to a minimization problem, which is easier to solve.

In this paper, we focus on the DRO problem in equation 3 and equation 5 with non-convex L-smooth loss function ℓ(see Definition 2). We consider the large-scale setting, where the training dataset consists of a large number of N samples. We aim to characterize the fundamental structure of this problem and develop efficient first-order algorithms with comprehensive convergence analysis.

The same problem was studied in Jin et al. (2021). It was shown that the dual objective in equation 5 is not L-smooth if the loss function ℓis not bounded. They further show that the corresponding dual ˆL(x, η) satisfies the generalized smoothness condition (see Definition 3), where the Lipschitz constant grows linearly with the gradient norm x,η ˆL . Similarly, in the stochastic setting, they prove the variance of gradient estimate grows linearly with the gradient norm square x,η ˆL 2, i.e., affine variance noise (see discussion under Lemma 3). To solve this problem, a normalized momentum method was used and shown to converge with O(ϵ 4) gradient complexity. Note that algorithms and analyses for generalized smooth optimization problems, e.g., Adagrad and Adam (Li et al., 2023; Wang et al., 2023; Zhang et al., 2024b), can also be used to solve this problem.

Published as a conference paper at ICLR 2025

However, solving the problem in equation 5 as a generalized (L0, L1)-smooth optimization problem typically requires more involved convergence analysis.

In this paper, we answer the following question: Are generalized (L0, L1)-smoothness and affine variance noise an overkill to characterize the dual ˆL(x, η) in equation 5?

1.1 OUR CONTRIBUTIONS

Our main contributions are summarized as follows.

We prove that the dual objective of DRO problems in equation 5 is partially generalized smooth (see Lemma 1), and the noise satisfies a partially affine variance noise condition in the stochastic setting (see Lemma 3). The above two conditions provide a much more precise and fundamental characterization of the dual in equation 5 than the generalized smoothness and affine variance noise derived in Jin et al. (2021). Such precise characterization yields much simpler convergence analysis. Our proposed more precise conditions circumvent the unbounded Lipschitz constant and unbounded noise variance challenges in the generalized (L0, L1)-smoothness problems with affine variance noises. Our conditions first show the dual problem is standard L-smooth in the dual variable η and the noise of gradient on η has bounded variance. Then under our partially generalized smoothness and partially affine variance noise condition, the Lipschitz constant and noise variance on the model parameter x are also bounded, making the objective easy to solve. We show that in the deterministic setting, an algorithm as simple as gradient descent can solve the problem with iteration complexity of O(ϵ 2); and in the stochastic setting, an algorithm as simple as double stochastic gradient descent with clipping (D-SGD-C) can solve the problem with gradient complexity of O(ϵ 4). We further design a double spider with clipping (D-Spider-C) algorithm and show its convergence with an improved O(ϵ 3) gradient complexity. Thanks to our more precise characterizations of partially generalized smooth and partially affine variance noise, our analyses are much simpler than those in Jin et al. (2021) and those for Spider algorithms (Chen et al., 2023; Reisizadeh et al., 2023), which are developed merely for general generalized smooth problems with affine variance noise and are not tailored specially for DRO problems. Our methods converge with computational complexities independent of the number of training samples N, and thus are applicable to large-scale training datasets. Numerical results are conducted to verify our theoretical results. We observe that our proposed algorithms outperform the existing DRO method (Jin et al., 2021).

1.2 RELATED WORKS

DRO. Scalability: Many existing approaches are not scalable when the training dataset is large. The method (Namkoong & Duchi, 2016) is not feasible for large-scale problems, which parameterizes the unknown distribution by a vector of dimension N and models the DRO problem as a minimax optimization problem. Following this method, many minimax methods such as Rafique et al. (2022); Lin et al. (2020); Xu et al. (2023) can be used to address the DRO problem. However, a computational complexity that is linear (or even worse) in the size of the training set is not preferable for large-scale problems. In this paper, our stochastic algorithms have a per-iteration complexity that is independent of the training dataset.

Convex loss functions: Some existing methods (Duchi & Namkoong, 2018; Levy et al., 2020; Wang et al., 2021; 2024; Hashimoto et al., 2018) require the loss function to be convex, which, however, fail to capture a wide range of machine learning problems where the loss function is non-convex, e.g., neural networks. In this paper, we focus on the general non-convex smooth loss function.

Bounded loss functions: For the non-convex DRO problem, existing studies, e.g., Qi et al. (2021; 2022); Zhang et al. (2024a); Soma & Yoshida (2020) require the loss function ℓto be bounded (or even more restricted assumptions). In this paper, we focus on non-convex smooth loss functions, which may potentially be unbounded.

Non-convex smooth loss functions: Jin et al. (2021) is the first study for non-convex DRO problems with general ψ-divergence defined uncertainty sets in large-scale settings. By combining all trainable

Published as a conference paper at ICLR 2025

parameters together z = (x, λ), the authors prove that the dual objective shown in equation 5 is generalized (L0, L1)-smooth in z. Then a normalized-momentum method is designed and an ϵstationary point is guaranteed with a computational complexity of O(ϵ 4). In this paper, we share the same setting with Jin et al. (2021) and we provide a simpler yet more precise characterization of the dual problem: partially generalized smoothness and partially affine variance noise. We show that a simple SGD type algorithm finds an ϵ-stationary point with a complexity of O(ϵ 4). Moreover, we design a Spider algorithm, reducing the complexity to O(ϵ 3).

Generalized (L0, L1)-smoothness. As Jin et al. (2021) pointed out, the non-convex DRO problem can be solved as a generalized (L0, L1)-smooth optimization problem, which, however, introduces unnecessary complications to theoretical analysis for the DRO problem in this paper. Nevertheless, we briefly review algorithms and analyses for generalized smooth optimization problems in the literature below. The generalized (L0, L1)-smoothness problem is first introduced in Zhang et al. (2019), where a clipping method is investigated. However, for the stochastic setting, the gradient estimation error is required to be bounded almost surely. In this paper, our method works for noise with partially affine variance (shown in Lemma 3), which is a much weaker condition and more precise characterization for this DRO problem. Modern methods such as normalized-momentum (Jin et al., 2021), Adagrad (Wang et al., 2023), and Adam (Li et al., 2023; Zhang et al., 2024b) are studied for generalized (L0, L1)-smoothness problem, and they can also be used to solve the DRO problem in this paper. In our paper, we show that for the DRO problem, simple SGD can get the same stationary point with the same gradient complexity. To reduce the complexity, Spider (Fang et al., 2018) is studied for the generalized (L0, L1)-smoothness problem (Chen et al., 2023; Reisizadeh et al., 2023). In this paper, based on our precise characterizations of partially generalized smoothness and partially affine variance noise, our proof is much simpler than Chen et al. (2023). Moreover, we show the gradient converges in expectation, which is stronger than the convergence with high probability in Reisizadeh et al. (2023).

2 PRELIMINARIES

Denote by s a sample in S and let P0 be the empirical distribution of the N training samples {si}N i=1. In the large-scale setting studied in this paper, we assume the number of training samples N is extremely large. We use to denote the Euclidean norm and , to denote the standard dot product. Define a function (a)+ = max(a, 0). For a set C, 1C is an indicator function such that 1C(a) = 0 if a C and 1C(a) = + otherwise. Let x Rd be the trainable parameters where d is the dimension. The loss function is defined as ℓ: Rd S R. For a differentiable function f : Rd R, x is an ϵ-stationary point if xf(x) ϵ, where xf(x) is the gradient of f to x. For a random vector X, denote by E the expectation and V the sum of its element-vise variance, where V(X) := E[ X E[X] 2]. We further provide some definitions. Definition 1 (Lipschitz continuous). A function f : Rd R is called G-Lipschitz continuous if for any x, y Rd, |f(x) f(y)| G x y , where G > 0 is some finite constant. Definition 2 (Standard L-smooth). A differentiable function f : Rd R is L-smooth if for any x, y Rd, xf(x) xf(y) L x y , where L > 0 is some finite constant.

These two definitions cover a wide range of problems in optimization studies. Recently, a generalized (L0, L1)-smoothness condition is proposed (Zhang et al., 2019; Chen et al., 2023; Li et al., 2023), which is strictly weaker than the standard L-smoothness condition. Definition 3 (Generalized (L0, L1)-smooth). A differentiable function f : Rd R is generalized (L0, L1)-smooth if for any x, y Rd, we have that xf(x) xf(y) (L0+L1 xf(x) ) x y , where L0, L1 > 0 are some finite constants.

Note that there are two version of the (L0, L1)-smoothness, one requires that the inequality only applies to x y 1 L0 (Zhang et al., 2019) and one does not require that (Chen et al., 2023). In DRO setting, it can be proved that for the dual objective Jin et al. (2021), the inequality holds for any x, y Rd thus in this paper we follow the second definition.

In this paper, we focus on a non-convex and smooth loss function ℓ. Assumption 1. For any sample s S, the loss function ℓ(x, s) is G-Lipschitz continuous and L-smooth in x.

Published as a conference paper at ICLR 2025

We further make the following assumption on the ψ-divergence. Assumption 2. The conjugate function ψ of ψ is M-smooth.

Assumption 2 can be satisfied by a wide range of ψ-divergences (see Table 1).

Divergence ψ(t) ψ (t) χ2 1 2(t 1)2 1 + 1

4(t + 2)2 + KL-regularized CVa R 1[0,α 1) + t log(t) t + 1, α (0, 1) min(et, α 1(1 + t + log(α))) 1

Cressie-Read tk tk+k 1

k(k 1) , k R 1 k

((k 1)t + 1)

Table 1: Commonly used divergences with L-smooth conjugates.

The goal of this paper is to find an ϵ-stationary point of the penalized DRO problem in equation 3, which is a minimax optimization problem and is usually hard to solve. For ψ-divergence defined DRO problems, a popular approach is to investigate its dual formulation. By strong duality (Levy et al., 2020; Shapiro, 2017), we have that

Ψ(x) = inf η R ˆL(x, η) := λES P0ψ ℓ(x, S) η

Under Assumptions 1 and 2, it can be shown that Ψ(x) is differentiable (Jin et al., 2021). Define L(x, η) = ˆL(x, Gη). Then x,ηL(x, η) ϵ/

2 implies that Ψ(x) ϵ (Jin et al., 2021). Thus, it is equivalent to find an ϵ-stationary point of L(x, η).

3 MAIN RESULTS

3.1 PARTIALLY GENERALIZED SMOOTHNESS

Let z = (x, η). Approach in Jin et al. (2021) directly optimizes over z, where it was shown that L(z) is generalized (L0, L1)-smooth in z with L0 = L + 2G2λ 1M and L1 = L/G. In the following, we provide a more precise characterization of L by separately studying x and η. Lemma 1 (Partially generalized (L0, L1, L2)-smoothness). Under Assumptions 1 and 2, L(x, η) is (L0, L1)-partially smooth in x and L2-smooth in η such that for any x, x Rd and η, η R we have that

x L(x, η) x L(x , η) (L0 + L1| ηL(x, η)|) | {z } term (a)

| ηL(x, η) ηL(x, η )| L2|η η |, (8)

where L0 = G + G2M

G and L2 = G2M

The proof is available in Appendix A.1. Observe that L(x, η) is smooth in η for any x. Thus, optimizing over η should not be as hard as solving a generalized smooth problem. Moreover, in equation 7, the Lipschitz constant in x (term (a)) is linear in the gradient to η: ηL(x, η), but does not depend on the gradient to x: x L(x, η). Compared with the generalized (L0, L1)-smoothness used in Jin et al. (2021), Lemma 1 provides a more precise characterization of L. Intuitively, due to the smoothness in η, one can expect a quick find of a point with a bounded gradient to η. Consequently, the Lipschitz constant in x will also become bounded, which circumvents the unbounded Lipschitz constant challenges in generalized (L0, L1)-smoothness problems and makes the objective easier to optimize. Remark 1. This partially generalized (L0, L1, L2)-smoothness condition is weaker than the standard L-smoothness condition but stronger than the generalized (L0, L1)-smoothness condition.

3.2 DETERMINISTIC SETTING

To warm up, we first consider the deterministic setting. We first propose a double gradient descent (D-GD) algorithm, which updates x and η alternatively (see Algorithm 1). This is in contrast to

Published as a conference paper at ICLR 2025

the approach in Jin et al. (2021) where x and η are optimized jointly. The key idea is to leverage the standard L-smoothness property in η to bound ηL(x, η), to reduce equation 7 to a smooth condition, and then to bound x L(x, η).

Algorithm 1 D-GD Input: initialization x0, η0, step sizes αt, βt, number of iterations T

1: t 0 2: while t T 1 do 3: ηt+1 ηt αt ηL(xt, ηt) 4: xt+1 xt βt x L(xt, ηt+1) 5: t t + 1 6: end while

3.2.1 DOUBLE GRADIENT DESCENT (D-GD)

In this section, we choose constant step sizes αt, βt in Algorithm 1. In the following theorem, we show that D-GD converges to an ϵ-stationary point with an iteration complexity of O(ϵ 2).

Theorem 1. Let H = 2L2(L(x0, η0) infx,η L(x, η)) and L = L0 + L1

H. Set αt = 1 L2 , βt =

1 L , T 8 max(L2, L ) L(x0,η0) infx,η L(x,η)

ϵ2 . For Algorithm 1, we then have that

min t<T x,ηL(xt, ηt+1)

The proof is available in Appendix A.2. One key step in the proof is the following descent lemma: Lemma 2 (Descent lemma). For the partially generalized (L0, L1, L2)-smooth function L(x, η) defined in Lemma 1, we have that for any x, x Rd and η R,

L(x , η) L(x, η) + x L(x, η), x x + L0 + L1| ηL(x, η)|

2 x x 2. (9)

The proof can be found in Appendix A.3. In the proof, by setting αt = 1 L2 and using the standard L2-smoothness in η, at each step t we can show that L(xt, ηt+1) L(xt, ηt). Moreover, since L is L2-smooth in η, we have that for any x, η,

| ηL(x, η)|2 2L2

L(x, η) inf η L(x, η ) 2L2

L(x, η) inf x ,η L(x , η ) , (10)

which indicates that a bounded function value implies a bounded gradient of η.

We then prove this by mathematical induction. It can be easily shown using equation 10 that at t = 0, | ηL(xt, ηt)|

H and | ηL(xt, ηt+1)|

H. By Lemma 1, it can be shown that the Lipschitz constant to x at (xt, ηt+1) is upper bounded by L = L0 + L1

H. By setting βt = 1 L , we can show L(xt+1, ηt+1) L(xt, ηt+1). Thus at t + 1 we also have that | ηL(xt+1, ηt+1)|

H and | ηL(xt+1, ηt+2)|

H. This further implies that the Lipschitz constant to x at (xt+1, ηt+2) is upper bounded by L . Therefore, by induction, along the training trajectory, the objective function L(x, η) is L2-smooth in η and L -smooth in x. We then convert the DRO problem to a standard L-smooth optimization problem, which is much easier to address.

3.2.2 DOUBLE GRADIENT DESCENT WITH CLIPPING (D-GD-C)

Algorithm 1 with fixed stepsizes is straightforward and convenient to be employed in practice. However, Theorem 1 indicates that the computational complexity is linear in L . In this section, we study Algorithm 1 with an adaptive step size for the update of x, which we refer to as Double Gradient Descent with Clipping (D-GD-C). In the following theorem, we provide the convergence guarantee:

Theorem 2. Set αt = 1 L2 , βt = min 1 2L0 , 1 2L1| ηL(xt,ηt+1)| , T

L(x0,η0) infx,η L(x,η)

ϵ2 max(8L2, 16L0) for Algorithm 1. For ϵ L0

L1 , we then have that

min t<T x,ηL(xt, ηt+1)

Published as a conference paper at ICLR 2025

The detailed proof is available in Appendix A.4. Due to the adaptive design of βt, the step size is small when the gradient to η is large. Thus the function value is decreasing and we can prove that for some t < T, αt( ηL(xt, ηt))2 + βt x L(xt, ηt+1) 2 O(ϵ2).

If using the analysis for generalized (L0, L1)-smooth optimization, e.g., in Jin et al. (2021), it can be shown that α t z L(z) 2 O(ϵ2), where α t is an adaptive step size and is a function of z L(z) . Thus, α t z L(z) 2 is a function of z L(z) and it will need non-trivial efforts to show that z L(z) is bounded. In contrast, using our more precise partial (L0, L1, L2)- smoothness characterization, since αt is a constant and L is L2-smooth in η, it can be shown that | ηL(xt, ηt+1)| is bounded. We then obtain a lower bound on βt: from | ηL(xt, ηt+1)| ϵ L0

L1 , it follows that βt = 1 2L0 which is a constant. Thus, one can easily prove x L(xt, ηt+1) is bounded if βt x L(xt, ηt+1) 2 is bounded. This implies an ϵ-stationary point.

3.3 STOCHASTIC SETTING

In the deterministic setting, the per-iteration gradient complexity is N, and therefore, the approach may not be scalable when N is large. In this section, we develop stochastic gradient algorithms that have per-iteration gradient complexity independent of the training dataset size N. For the loss function, we follow the most relaxed bounded variance assumption in the ψ-divergence DRO literature (Jin et al., 2021): Assumption 3 (Bounded variance). For any x Rd and S P0, the variance of the loss function is bounded: ES P0[(ℓ(x, S) ℓ(x))2] σ2, (11) where σ is a constant and ℓ(x) = ES P0[ℓ(x, S)].

We note that stronger assumptions of bounded loss function, i.e., ℓ(x, s) is bounded for any x, s are used in ψ-divergence DRO literature (Qi et al., 2022; Levy et al., 2020; Zhang et al., 2024a). However, their approaches cannot be easily extended to unbounded loss functions.

We then provide the following lemma to characterize the variance of the stochastic gradient of L(x, η) to x and η. Define L(x, η, S) = λψ ℓ(x,S) Gη

Lemma 3 (Partially affine variance noise). Under Assumptions 1, 2 and 3, for any x Rd and η R, we have that

VS P0[ x L(x, η, S)] D0 + D1 ( ηL(x, η))2 , (12) VS P0[ ηL(x, η, S)] D2 (13)

where D0 = 8G2 + 10G2M 2λ 2σ2, D1 = 8 and D2 = G2M 2λ 2σ2.

The proof is available in Appendix A.5. Note that Lemma 3 provides a much more precise characterization of the stochastic gradient variance than the affine variance condition in equation 14 (Jin et al., 2021) : VS P0[ z L(z, S)] D 0 + D 1 z L(z) 2, (14) where D 0, D 1 are some positive constants. The affine variance condition in Jin et al. (2021) is challenging to analyze since L is (L0, L1)-smooth in z and z L(z) is non-trivial to bound. In contrast, our Lemma 3 indicates that the stochastic gradient to η has a bounded variance, and the variance of the stochastic gradient to x is linear only in ( ηL)2. Since the dual objective is standard L2-smooth in η and the gradient noise has a bounded variance, we can easily bound ( ηL)2 and show the variance of the stochastic gradient to η is also bounded.

3.3.1 DOUBLE STOCHASTIC GRADIENT DESCENT WITH CLIPPING (D-SGD-C)

In this section, we develop a D-SGD-C algorithm that updates x and η alternatively (Algorithm 2). In the stochastic setting, we use a mini-batch of samples to estimate the gradient, and N1, N2 are the batch sizes. For a mini-batch of samples B = {ξ(i)}|B| i=1, denote by

L(x, η, B) =

λψ ℓ x, ξ(i) Gη λ

Published as a conference paper at ICLR 2025

an estimate of L(x, η), where |B| is the batch size.

Algorithm 2 D-SGD-C Input: initialization (x0, η0), step sizes αt, βt, number of interactions T, batch sizes N1, N2

1: t 0 2: while t T 1 do 3: Draw N1 i.i.d. samples B1 from P0 and compute gt ηL(xt, ηt, B1) 4: ηt+1 ηt αtgt 5: Draw N2 i.i.d. samples B2 from P0 and compute vt x L(xt, ηt+1, B2) 6: xt+1 xt βtvt 7: t t + 1 8: end while

The following theorem establishes the convergence and complexity of Algorithm 2.

Theorem 3. Set α = 1 2L2 , βt = min( 1 2L0 , ϵ L0 vt ), T, N1, N2 = O(ϵ 2) for Algorithm 2. We then have that

min t<T E[ x,ηL(xt, ηt+1) ] 6ϵ.

Exact expressions of T, N1, N2 and the proof are available in Appendix A.6. In Algorithm 2, we choose adaptive βt ϵ L0 vt in order to bound the additional term of | ηL(xt, ηt+1)| xt xt+1 2

term in Lemma 1 introduced by partially generalized smoothness. The rest terms in the decent lemma are the same as standard L-smooth problems; thus, we can find some t < T such that E[βt vt 2] O(ϵ2). By the definition of βt, it can be shown that βt vt 2 ϵ vt

L0 ϵ2 2L0 for any t. Thus we can find some t < T such that E[ vt ] O(ϵ).

Using this fact, we now show that E[ vt x L(xt, ηt+1) ] O(ϵ). Based on Lemma 3 and N2 = O(ϵ 2), we have that V[vt] O ϵ2 + ϵ2( ηL(xt, ηt+1))2 . Moreover, E[ vt x L(xt, ηt+1) ] p

E[ vt x L(xt, ηt+1) 2] = p

V[vt], implying that E[ vt x L(xt, ηt+1) ] O(ϵ + ϵ| ηL(xt, ηt+1)|). In Lemma 1, we show the dual objective is L2smooth in η; thus, it is less challenging to bound | ηL(xt, ηt+1)|. We then can find some t < T such that E[ x L(xt, ηt+1) ] E[ vt x L(xt, ηt+1) ] + E[ vt ] O(ϵ) and can complete the proof. We also provide Algorithm 4, which uses momentum method and does not require minibatches (See Appendix A.11).

3.3.2 DOUBLE SPIDER WITH CLIPPING (D-SPIDER-C)

Theorem 3 indicates a computational complexity of O(ϵ 4) is required by D-SGD-C to find an ϵ-stationary point of Ψ(x). This complexity can be further improved by our following variancereduced method (Algorithm 3), which we call Double Spider with Clipping (D-Spider-C). In this method, we first compute our estimate of the gradient to η (x) with a large batch of samples with size of N1 (N3). We then update our estimate with a small batch of samples with a size of N2 (N4). For every q iteration, we refresh our estimate with a large batch of samples with size of N1 (N3).

To analyze Algorithm 3, we further develop the following property of gradients x L(x, η, s) and ηL(x, η, s).

Lemma 4. For any x, x Rd, η, η R and s S, ηL(x, η, s) is L2-continuous in x:

| ηL(x, η, s) ηL(x , η, s)| L2 x x (15)

and x L(x, η, s) is L2-continuous in η:

x L(x, η, s) x L(x, η , s) L2|η η |. (16)

The proof is available in Appendix A.7. Lemma 4 theoretically characterize how the update on x (or η) changes the gradient to η (or x). We then provide the theorectical result.

Published as a conference paper at ICLR 2025

Algorithm 3 D-Spider-C Input: initialization (x0, η0), step sizes αt, βt, epoch size q, number of iterations T, batch sizes N1, N2, N3, N4

1: while t T 1 do 2: if t mod q == 0 then 3: Draw N1 i.i.d. samples B1 from P0 and compute gt ηL(xt, ηt, B1) 4: else 5: Draw N2 i.i.d. samples B2 from P0 and compute gt ηL(xt, ηt, B2) ηL(xt 1, ηt 1, B2) + gt 1 6: end if 7: ηt+1 ηt αtgt 8: if t mod q == 0 then 9: Draw N3 i.i.d. samples B3 from P0 and compute vt x L(xt, ηt+1, B3) 10: else 11: Draw N4 i.i.d. samples B4 from P0 and compute vt x L(xt, ηt+1, B4) x L(xt 1, ηt, B4) + vt 1 12: end if 13: xt+1 xt βtvt 14: t t + 1 15: end while

Theorem 4. Set α = 1 4L2 , βt = min( 1 2L0 , ϵ L0 vt ), T, N1, N3 = O(ϵ 2) and N2, N4 = O(q) for Algorithm 3. We have that for some constant c4 > 0

min t<T E[ x,ηL(xt, ηt+1) ] c4ϵ.

The full version and its proof are available in Appendix A.8. From the result we can show that the total gradient complexity is O ϵ 4

q + ϵ 2q . By choosing q = O(ϵ 1), we can reduce the total

gradient complexity from O(ϵ 4) to O(ϵ 3). Compared with Algorithm 2, the biggest difference is how we design our estimators gt and vt. Rather than using unbiased gradient estimators, Algorithm 3 introduces momentum in the updates of gt and vt, which, however, leads to biased gradient estimates. The key challenge in our proof is to bound such bias (see the following lemma). Lemma 5. Using the same parameters as in Theorem 4, for any t0 < T such that t0 mod q = 0, we have that

t=t0 E (gt ηL(xt, ηt))2 O

q N1 + q2ϵ2

t=t0 (gt)2 !

t=t0 E[ vt x L(xt, ηt+1) 2] O

q N3 + q2ϵ2

t=t0 (gt)2 !

The full version and its proof can be found in Appendix A.9. The key idea is as follows. From the update of gt, if t mod q = 0, we can show that

E[(gt ηL(xt, ηt))2] =E[( ηL(xt, ηt, B2) ηL(xt 1, ηt 1, B2) + ηL(xt 1, ηt 1)

ηL(xt, ηt))2] + E[(gt 1 ηL(xt 1, ηt 1))2]. (19)

The first term in RHS of equation 19 corresponds to the variance of an unbiased estimate of ηL(xt 1, ηt 1) ηL(xt, ηt) using N2 samples. This term can be further bounded by E[( ηL(xt, ηt, B2) ηL(xt 1, ηt 1, B2))2], which is small due to our designed step sizes and Lemma 4. This bounds the first term in RHS of equation 19. Then, applying equation 19 recursively, we get the bound in equation 17. This method applies to vt too.

From the descent lemma and Lemma 2, similar to the previous proof in Theorem 3, we can show that PT 1 t=0 E[αt(gt)2 +βt vt 2] is bounded by the mean square errors of gt and vt plus some constants.

Published as a conference paper at ICLR 2025

Using Lemma 5 where the mean square errors are upper bounded by a function of PT 1 t=0 |gt|2 and αt = 1 4L2 , we can show that PT 1 t=0 E[(gt)2 + βt vt 2] is bounded by some constants. We then can show that for some t < T, E[(gt)2] is bounded. Moreover, with E[(gt)2] bounded, we can further show the mean square errors of gt and vt are bounded. We then obtain an ϵ-stationary point.

4 NUMERICAL RESULTS

In this section, we conduct numerical studies on a set of regression tasks (Chen et al., 2023) on the life expectancy data 1. This dataset consists of N = 2413 samples, where we select the first 2000 samples for training and the rest samples for testing. After removing redundant features, for the i-th sample in our dataset, we have input zi R34 and output yi R. The non-convex original loss function is set as ℓ(x, (zi, yi)) = 1

2(yi z i x)2+0.1 P34 j=1 ln(1+|xj|), where x = (x1, x2, ..., x34) is the trainable parameter. For the DRO model, λ is set to 0.01, and the initial value η0 is set to 0.1.

Figure 1: Deterministic setting.

0.0 0.5 1.0 1.5 2.0 2.5 Computation Complexity 1e5

SGD Normalized-SPIDER Normalized-SGDm D-SPIDER-C D-SGD-C

Figure 2: Stochastic setting.

In Figure 1, we provide the training curves with fine-tuned learning rate for the original GD, Normalized GD (Chen et al., 2023), Clipped GD (Zhang et al., 2019) and our proposed D-GD, D-GD-C methods. The x-axis stands for the training iteration, and the y-axis stands for the DRO objective ψ(x). From the results of GD and D-GD in Figure 1, we can show that updating x and η alternatively has the same empirical performance compared with training them together. Our D-GD-C also has similar performance compared with other methods.

In Figure 2, we provide the training curves with fine-tuned learning rate for SGD, Normalized SPIDER (Chen et al., 2023), Normalized-SGD with momentum (Jin et al., 2021) and our proposed D-SGD-C and D-SPIDER-C methods. Our D-SPIDER-C has similar performance compared with Normalized-SPIDER and both our two algorithms outperform the SGD and Normalized-SGD with momentum methods. The details can be found in Appendix A.10.

5 CONCLUSION

In this paper, we revisit the DRO problem with non-convex smooth loss functions. Instead of solving this problem as a generalized (L0, L1)-smoothness problem, we first show that this DRO problem satisfies a simpler yet more precise partially generalized smoothness condition and partially generalized affine noise condition. Under these conditions, our theoretical analyses are much simpler than existing studies. Our results provide new insights into the fundamental structure of DRO problems with non-convex loss functions, which could be useful for DRO problems beyond the one studied in this paper.

1https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download

Published as a conference paper at ICLR 2025

6 ACKNOWLEDGMENTS

The work of Q. Zhang and S. Zou was partially supported by NSF under Grant CCF-2438429. Y. Zhou s work was supported by the National Science Foundation under grants DMS-2134223 and ECCS-2237830. L. Shen s work was supported by the National Science Foundation under grant DMS-2208385. This work is also partially supported by the AFRL VFRP program.

Krizhevsky Alex. Learning multiple layers of features from tiny images. https://www. cs. toronto. edu/kriz/learning-features-2009-TR. pdf, 2009.

Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341 357, 2013.

John Blitzer, Ryan Mc Donald, and Fernando Pereira. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pp. 120 128, 2006.

Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization. ar Xiv preprint ar Xiv:2303.02854, 2023.

Noel Cressie and Timothy RC Read. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society Series B: Statistical Methodology, 46(3):440 464, 1984.

Hal Daume III and Daniel Marcu. Domain adaptation for statistical classifiers. Journal of artificial Intelligence research, 26:101 126, 2006.

John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750, 2018.

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Proceedings of Advances in Neural Information Processing Systems, volume 31, 2018.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Patrick J Grother, Patrick J Grother, P Jonathon Phillips, and George W Quinn. Report on the evaluation of 2D still-image face recognition algorithms. US Department of Commerce, National Institute of Standards and Technology, 2011.

Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In Proceedings of International Conference on Machine Learning, pp. 1929 1938. PMLR, 2018.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis. In Proceedings of Advances in Neural Information Processing Systems, volume 34, pp. 2771 2782, 2021.

Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization. In Proceedings of Advances in Neural Information Processing Systems, volume 33, pp. 8847 8860, 2020.

Haochuan Li, Ali Jadbabaie, and Alexander Rakhlin. Convergence of adam under relaxed assumptions. ar Xiv preprint ar Xiv:2304.13972, 2023.

Published as a conference paper at ICLR 2025

Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems, 36, 2024.

Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In Proceedings of International Conference on Machine Learning, pp. 6083 6093. PMLR, 2020.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Proceedings of Advances in neural information processing systems, volume 29, 2016.

Qi Qi, Zhishuai Guo, Yi Xu, Rong Jin, and Tianbao Yang. An online method for a class of distributionally robust optimization with non-convex objectives. In Proceedings of Advances in Neural Information Processing Systems, volume 34, pp. 10067 10080, 2021.

Qi Qi, Jiameng Lyu, Er Wei Bai, Tianbao Yang, et al. Stochastic constrained dro with a complexity independent of sample size. ar Xiv preprint ar Xiv:2210.05740, 2022.

Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. Weakly-convex concave min max optimization: provable algorithms and applications in machine learning. Optimization Methods and Software, 37(3):1087 1121, 2022.

Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. ar Xiv preprint ar Xiv:1908.05659, 2019.

Amirhossein Reisizadeh, Haochuan Li, Subhro Das, and Ali Jadbabaie. Variance-reduced clipping for non-convex optimization. ar Xiv preprint ar Xiv:2303.00883, 2023.

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

Alexander Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258 2275, 2017.

Aman Sinha, Hongseok Namkoong, Riccardo Volpi, and John Duchi. Certifying some distributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017.

Tasuku Soma and Yuichi Yoshida. Statistical learning with conditional value at risk. ar Xiv preprint ar Xiv:2002.05826, 2020.

Tim Van Erven and Peter Harremos. R enyi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014.

Bohan Wang, Huishuai Zhang, Zhiming Ma, and Wei Chen. Convergence of adagrad for non-convex objectives: Simple proofs and relaxed assumptions. In The Thirty Sixth Annual Conference on Learning Theory, pp. 161 190. PMLR, 2023.

Jie Wang, Rui Gao, and Yao Xie. Sinkhorn distributionally robust optimization. ar Xiv preprint ar Xiv:2109.11926, 2021.

Jie Wang, Rui Gao, and Yao Xie. Regularization for adversarial robust learning. ar Xiv preprint ar Xiv:2408.09672, 2024.

Zi Xu, Huiling Zhang, Yang Xu, and Guanghui Lan. A unified single-loop alternating gradient projection algorithm for nonconvex concave and convex nonconcave minimax problems. Mathematical Programming, pp. 1 72, 2023.

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. ar Xiv preprint ar Xiv:1905.11881, 2019.

Published as a conference paper at ICLR 2025

Qi Zhang, Yi Zhou, Ashley Prater-Bennette, Lixin Shen, and Shaofeng Zou. Large-scale nonconvex stochastic constrained distributionally robust optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 8217 8225, 2024a.

Qi Zhang, Yi Zhou, and Shaofeng Zou. Convergence guarantees for rmsprop and adam in generalized-smooth non-convex optimization with affine noise variance. ar Xiv preprint ar Xiv:2404.01436, 2024b.

Published as a conference paper at ICLR 2025

A.1 PROOF OF LEMMA 1

Proof. Based on the formulation in equation 6 and the fact that L(x, η) = ˆL(x, Gη), we first have that

(ψ ) ℓ(x, S) Gη

(ψ ) ℓ(x, S) Gη

We first prove that the function L(x, η) is L2-smooth in η. For any x Rd and η, η R, we have that | ηL(x, η) ηL(x, η )|

(ψ ) ℓ(x, S) Gη

(ψ ) ℓ(x, S) Gη

(ψ ) ℓ(x, S) Gη

(ψ ) ℓ(x, S) Gη

λ |η η |, (21)

where the inequality is due to the M-smoothness of the conjugate function. We further show the partially generalized smoothness in x. For any x, x Rd and η R, we first have (ψ ) ℓ(x ,S) Gη

λ 0. This is due to the fact that (ψ ) is monotonically increasing and

0 lima (ψ ) (a) 1, where the details can be found in Proposition B.2 in Jin et al. (2021). It then follows that x L(x, η) x L(x , η)

(ψ ) ℓ(x, S) Gη

ℓ(x, S) (ψ ) ℓ(x , S) Gη

(ψ ) ℓ(x, S) Gη

ℓ(x, S) (ψ ) ℓ(x, S) Gη

(ψ ) ℓ(x, S) Gη

ℓ(x , S) (ψ ) ℓ(x , S) Gη

(ψ ) ℓ(x, S) Gη

G(| ηL(x, η)| + G) x x + GM G

(L0 + L1| ηL(x, η)|) x x , (22) where the second inequality is because that ℓ(x, s) is L-smooth and G-continuous in x for any s S, and the conjugate function ψ is M-smooth, the third inequality is due to the fact that (ψ ) ℓ(x ,S) Gη

λ 0, and the fourth inequality is because that GEP0 h (ψ ) ℓ(x,S) Gη

ηL(x, η) + G | ηL(x, η)| + G. This completes the proof.

A.2 PROOF OF THEOREM 1

Proof. Let H = 2L2(L(x0, η0) infx,η L(x, η)) and L = L0 + L1

H. In the following, we show that under our selected step sizes, | ηL(xt, ηt+1)| is upper bounded by

H for all t 0 by induction.

Base Case: for t = 0, since L(x, η) is L2-smooth in η, from the descent lemma for standard L-smooth function we can show that

L(x0, η1) L(x0, η0) ηL(x0, η0), α0 ηL(x0, η0) + L2

2 (α0 ηL(x0, η0))2. (23)

Published as a conference paper at ICLR 2025

Since α0 = 1 L2 , it follows that

1 2L2 ( ηL(x0, η0))2 L(x0, η0) L(x0, η1). (24)

By Lemma 3.5 in Li et al. (2024), it can be shown that

( ηL(x0, η1))2 2L2

L(x0, η1) inf η L(x0, η) 2L2

L(x0, η1) inf x,η L(x, η) . (25)

According to equation 24 and equation 25, we can show that L(x0, η1) L(x0, η0) and | ηL(x0, η1)|

Inductive Step: Assume | ηL(xt, ηt+1)|

H holds for 0 t t . In the following, we show that | ηL(xt, ηt+1)|

H holds for t = t + 1.

Due to the L2-smoothness of L(x, η) in η and αt = 1 L2 , it can be shown that

L(xt, ηt+1) L(xt, ηt) ηL(xt, ηt), αt ηL(xt, ηt) + L2

2 (αt ηL(xt, ηt))2 (26)

and 1 2L2 ( ηL(xt, ηt))2 L(xt, ηt) L(xt, ηt+1) (27)

hold for all t 0. Based on Lemma 2, we have that

L(xt+1, ηt+1) L(xt, ηt+1) x L(xt, ηt+1), βt x L(xt, ηt+1)

+ L0 + L1| ηL(xt, ηt+1)|

2 βt x L(xt, ηt+1) 2

L(xt, ηt+1) x L(xt, ηt+1), βt x L(xt, ηt+1)

H 2 βt x L(xt, ηt+1) 2

holds for all 0 t t . By setting βt = 1 L0+L1

H = 1 L , it can further imply that

1 2L x L(xt, ηt+1) 2 L(xt, ηt+1) L(xt+1, ηt+1) (28)

holds for all 0 t t . By taking the sum of equation 27 from t = 0 to t +1 and equation 28 from t = 0 to t , we can show that

1 2L2 ( ηL(xt, ηt))2 +

1 2L x L(xt, ηt+1) 2 L(x0, η0) L(xt +1, ηt +2).

By Lemma 3.5 in Li et al. (2024), it follows that

( ηL(xt +1, ηt +2))2 2L2

L(xt +1, ηt +2) inf η L(xt +1, η)

L(x0, η1) inf x,η L(x, η) = H,

which completes the inductive step. Since both the base case and the inductive step are proven, by the principle of mathematical induction, | ηL(xt, ηt+1)|

H holds for any t 0. This further implies that equation 28 holds for any t 0. Then, we take the sum of equation 27 and equation 28 from t = 0 to T 1, and show that

( ηL(xt, ηt))2 + x L(xt, ηt+1) 2 max(2L2, 2L )(L(x0, η0) L(x T , ηT ))

max(2L2, 2L ) L(x0, η0) inf x,η L(x, η) . (29)

Published as a conference paper at ICLR 2025

Thus for T 8 max(L2, L ) L(x0,η0) infx,η L(x,η)

ϵ2 , we can find some t such that

( ηL(xt, ηt))2 + x L(xt, ηt+1) 2 1

Thus we have x L(xt, ηt+1) ϵ

2 and | ηL(xt, ηt)| ϵ

2. Moreover, we can show that

| ηL(xt, ηt+1)| | ηL(xt, ηt)| + | ηL(xt, ηt+1) ηL(xt, ηt)| (1 + L2αt)| ηL(xt, ηt)| ϵ, (31)

where the second inequality is due to the standard L2-smoothness on η. This completes the proof.

A.3 PROOF OF LEMMA 2

Proof. From Taylor s Theorem, we have that

L(x , η) L(x, η) = Z 1

0 x L(x + θ(x x), η), x x dθ. (32)

It follows that

L(x , η) L(x, η) x L(x, η), x x

0 x L(x + θ(x x), η) x L(x, η), x x dθ

0 (L0 + L1| ηL(x, η)|) θ x x 2dθ

L0 + L1| ηL(x, η)|

2 x x 2, (33)

where the first inequality is due to the partially generalized smoothness condition and this completes the proof.

A.4 PROOF OF THEOREM 2

Proof. Since L(x, η) is L2-smooth in η, we have that

L(xt, ηt+1) L(xt, ηt) ηL(xt, ηt), αt ηL(xt, ηt) + L2

2 (αt ηL(xt, ηt))2.

For αt = 1 L2 , we have that

2 ( ηL(xt, ηt))2 L(xt, ηt) L(xt, ηt+1). (34)

For the update of x, by Lemma 2 we have that

L(xt+1, ηt+1) L(xt, ηt+1) x L(xt, ηt+1), βt x L(xt, ηt+1)

+ L0 + L1| ηL(xt, ηt+1)|

2 βt x L(xt, ηt+1) 2. (35)

For βt = min 1 2L0 , 1 2L1| ηL(xt,ηt+1)| 1 L0+L1| ηL(xt,ηt+1)|, from equation 35, it follows that

2 x L(xt, ηt+1) 2 L(xt, ηt+1) L(xt+1, ηt+1). (36)

By taking the sum of equation 34 and equation 36 from t = 0 to T 1, we then have that

2 ( ηL(xt, ηt))2 + βt

2 x L(xt, ηt+1) 2 L(x0, η0) L(x T , ηT )

L(x0, η0) inf x,η L(x, η). (37)

Published as a conference paper at ICLR 2025

For T L(x0,η0) infx,η L(x,η)

ϵ2 max(8L2, 16L0), we can find some t < T such that

αt( ηL(xt, ηt))2 + βt x L(xt, ηt+1) 2 ϵ2

For this t, we have that

αt( ηL(xt, ηt))2 ϵ2

which leads to | ηL(xt, ηt)| ϵ

| ηL(xt, ηt+1)| | ηL(xt, ηt)| + | ηL(xt, ηt+1) ηL(xt, ηt)| (1 + L2αt)| ηL(xt, ηt)| ϵ. (39)

Moreover, since | ηL(xt, ηt+1)| ϵ L0

L1 , we have that βt = min 1 2L0 , 1 2L1| ηL(xt,ηt+1)| =

1 2L0 . It follows that

1 2L0 x L(xt, ηt+1) 2 1 2L0

which implies that x L(xt, ηt+1) ϵ

2. This completes the proof.

A.5 PROOF OF LEMMA 3

Proof. We first consider the variance of the gradient to η. We first have that

VS P0[ ηL(x, η, S)]

=ES P0[( ηL(x, η, S) ηL(x, η))2]

2ES1,S2 P0[( ηL(x, η, S1) ηL(x, η, S2))2]

2G2ES1,S2 P0

" (ψ ) ℓ(x, S1) Gη

(ψ ) ℓ(x, S1) Gη

G2M 2λ 2VS P0[ℓ(x, S)]

G2M 2λ 2σ2 = D2, (41)

where the first inequality is due to the M-smoothness in η, S1, S2 are independent random variables with the same distribution P0 and the last inequality is due to Assumption 3. We then provide the proof for the variance of the gradient to x. Similarly, we have that

VS P0[ x L(x, η, S)]

2ES1,S2 P0[ x L(x, η, S1) x L(x, η, S2) 2]

(ψ ) ℓ(x, S1) Gη

2 ( ℓ(x, S1) ℓ(x, S2))

+ ES1,S2 P0

" (ψ ) ℓ(x, S1) Gη

(ψ ) ℓ(x, S2) Gη

2 ℓ(x, S2) 2 #

" (ψ ) ℓ(x, S1) Gη

+ 2G2M 2λ 2σ2

8(G2 + | ηL(x, η)|2 + D2) + 2G2M 2λ 2σ2

8G2 + 10G2M 2λ 2σ2 + 8| ηL(x, η)|2, (42)

where the second inequality is due to the G-continuous property of ℓ, M-smoothness of ψ and

Assumption 3. The third inequality is due to the fact that ES1 P0

(ψ ) ℓ(x,S1) Gη

λ 2 = ES1 P0 h (ψ ) ℓ(x,S1) Gη

λ i 2 +VS1 P0 h (ψ ) ℓ(x,S1) Gη

λ i . This completes the proof.

Published as a conference paper at ICLR 2025

A.6 FULL VERSION OF THEOREM 3 AND ITS PROOF

Theorem 5. Let α = 1 2L2 , βt = min 1 2L0 , ϵ L0 vt and γ = 1 36L2 1L2 . Set-

ting T max(8L2, 4L0) 5L(x0,η0) 5 infx,η L(x,η)

ϵ2 , N1 max 20D2

ϵ2 , 10D2L0

ϵ2 , 12D1L2

L0 , D0+4ϵ2D1

ϵ2 and ϵ2 8γL4 0 5 min 1 8L2 , 1 4L0

, we have that

min t<T E[ x,ηL(xt, ηt+1) ] 6ϵ.

Proof. Since L(x, η) is L2-smooth in η, we have that

L(xt, ηt+1) L(xt, ηt) ηL(xt, ηt), αtgt + L2

L(xt, ηt) ηL(xt, ηt), αtgt + α2 t L2(( ηL(xt, ηt))2 + (gt ηL(xt, ηt))2).

Taking the expectation on both sides of the above inequality, we can further show that

E[L(xt, ηt+1)] E[L(xt, ηt)] E[αt( ηL(xt, ηt))2]

+ E[α2 t L2(( ηL(xt, ηt))2 + (gt ηL(xt, ηt))2)], (43)

since gt is an unbiased estimate of ηL(xt, ηt). According to Lemma 2, for any γ > 0, we have that

L(xt+1, ηt+1) L(xt, ηt+1) x L(xt, ηt+1), βtvt + L0 + L1| ηL(xt, ηt+1)|

L(xt, ηt+1) x L(xt, ηt+1), βtvt

+ L0 + L1| ηL(xt, ηt)| + L1L2|αtgt|

L(xt, ηt+1) βt

2 ( vt 2 + x L(xt, ηt+1) 2 vt x L(xt, ηt+1) 2)

+ L0 + (L1 + L1L2αt)| ηL(xt, ηt)| + L1L2αt|gt ηL(xt, ηt)|

L(xt, ηt+1) βt

2 vt 2 + βt

2 vt x L(xt, ηt+1) 2 + L0

2 β2 t vt 2

+ γ(L1 + L1L2αt)2( ηL(xt, ηt))2 + β4 t 16γ vt 4

+ γ(L1L2αt)2(gt ηL(xt, ηt))2 + β4 t 16γ vt 4, (44)

where the second inequality is due to the L2-smoothness in η, and the last inequality is due to that 2ab a2 + b2 for any a, b R. Taking the expectation on both sides of equation 44 and adding with equation 43, it follows that

E αt L2α2 t γ(L1 + L1L2αt)2 ( ηL(xt, ηt))2 + βt

E[L(xt, ηt) L(xt+1, ηt+1)] + E[(α2 t L2 + γ(L1L2αt)2)(gt ηL(xt, ηt))2]

2 vt x L(xt, ηt+1) 2 + E β4 t 8γ vt 4 . (45)

According to α = 1 2L2 , βt = min 1 2L0 , ϵ L0 vt and Lemma 3, we can further show that

4L2 γ 9L2 1 4

( ηL(xt, ηt))2 + βt

E[L(xt, ηt) L(xt+1, ηt+1)] + 1

4L2 + γL2 1 4

Published as a conference paper at ICLR 2025

+ 1 4L0 E D0 + D1( ηL(xt, ηt+1))2

8γL4 0 . (46)

Moreover, we have that

E[( ηL(xt, ηt+1))2] 2E[( ηL(xt, ηt))2] + 2E[( ηL(xt, ηt+1) ηL(xt, ηt))2]

2E[( ηL(xt, ηt))2] + 2L2 2α2 t E[(gt)2]

(2 + 4L2 2α2 t)E[( ηL(xt, ηt))2] + 4L2 2α2 t E[(gt ηL(xt, ηt))2]

3E[( ηL(xt, ηt))2] + D2

Let γ = 1 36L2 1L2 and N2 12D1L2

L0 . By equation 46 and equation 47, we then have that

8L2 ( ηL(xt, ηt))2 + βt

E[L(xt, ηt) L(xt+1, ηt+1)] + D2 2L2N1 + D0 4L0N2 + D1 4L0N2

8γL4 0 . (48)

Taking the sum of equation 48 from t = 0 to T 1, we have that

8L2 ( ηL(xt, ηt))2 + βt

T E [L(x0, η0) L(x T , ηT )] + D2 2L2N1 + D0 4L0N2 + D1 4L0N2

8γL4 0 . (49)

Let T max(8L2, 4L0) 5L(x0,η0) 5 infx,η L(x,η)

ϵ2 , N1 max 20D2

ϵ2 , 10D2L0

ϵ2 , 12D1L2

and ϵ2 8γL4 0 5 min 1 8L2 , 1 4L0

. Then equation 49 can be further bounded as follows:

8L2 ( ηL(xt, ηt))2 + βt

4 vt 2 min 1

8L2 , 1 4L0

Thus we can find some t < T such that E h 1 8L2 ( ηL(xt, ηt))2i ϵ2 8L2 and E h βt

4 vt 2i 1 4L0 ϵ2.

Based on equation 47, we then have that (E[ ηL(xt, ηt+1)])2 E[( ηL(xt, ηt+1))2] 4ϵ2. Moreover, we have that

L0 E[βt vt 2] = E ϵ2

L0 min vt 2

where the second inequality is due to the fact that min a2

2 holds for any a 0. As a

result, we have that E[ vt ] 3ϵ. For N2 D0+4ϵ2D1

ϵ2 , we have that

(E[ vt x L(xt, ηt+1) ])2 E[ vt x L(xt, ηt+1) ]2 E D0 + D1( ηL(xt, ηt+1))2

It can be further shown that E[ x L(xt, ηt+1) ] E[ vt ]] + E[ vt x L(xt, ηt+1) ] 4ϵ and E[ x,ηL(xt, ηt+1) ] E[| ηL(xt, ηt+1)|] + E[ x L(xt, ηt+1) ] 6ϵ. This completes the proof.

A.7 PROOF OF LEMMA 4

Proof. For any x, x Rd, η R and s S, we have that

| ηL(x, η, s) ηL(x , η, s)|

=G (ψ ) ℓ(x, s) Gη

(ψ ) ℓ(x , s) Gη

Published as a conference paper at ICLR 2025

λ x x , (52)

where the inequality is due to that ψ is M-smooth and ℓ(x, s) is G-continuous in x. Similarly, for any x Rd, η, η R and s S, we have that

x L(x, η, s) x L(x, η , s)

= (ψ ) ℓ(x, s) Gη

ℓ(x, s) (ψ ) ℓ(x, s) Gη

G (ψ ) ℓ(x, s) Gη

(ψ ) ℓ(x, s) Gη

λ |η η |, (53)

where the first inequality is because that ℓ(x, s) is G-continuous in x and the last one is due to the M-smoothness of ψ .

A.8 FULL VERSION AND ITS PROOF OF THEOREM 4

Let c0 = max(32L2, 8L0), c1 = 4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 16L2 1L2 5D1L3 0

, c2 = max 1 8L2 + L1

c3 = 1 + L2 40L0 + L0D1+L0+2L0L2D2

L2 + 33L2 1 5L0L2 + L0D1

15L2 2 + L2 1 2L0L2 2 and c4 = 17

For N1 6D2c0c2

ϵ2 , N2 max 20q D1L2

L0 , 20qc2L2, 12q L2 2c0c2 L2 0 , q , N3

max 200D1L2

L0 , 3c0(D0+4D1D2)

2L0ϵ2 , N4 max 5q L2

L0 , 6qc1c0

min L3 0N4 1480L2 1L2q, L4 0 42L1c0 , 1 , we have the following theorem:

Theorem 6. Let α = 1 4L2 , βt = min 1 2L0 , ϵ L0 vt for Algorithm 3. Setting T

6c0(L(x0,η0) infx,η L(x,η))

ϵ2 , we have that

min t<T E[ x,ηL(xt, ηt+1) ] c4ϵ.

Proof. Similar to equation 43, for the update of η, we have that

E[L(xt, ηt+1)] E L(xt, ηt) ηL(xt, ηt), αtgt + L2

E[L(xt, ηt)] 1

2E[αt(gt)2] + 1

2E[αt(gt ηL(xt, ηt))2] + L2

2 E[α2 t(gt)2].

Similar to equation 44, for the update of x and any γ > 0, we have that

L(xt+1, ηt+1) L(xt, ηt+1) x L(xt, ηt+1), βtvt

+ L0 + L1| ηL(xt, ηt)| + L1L2|αtgt|

L(xt, ηt+1) βt

2 ( vt 2 + x L(xt, ηt+1) 2 vt x L(xt, ηt+1) 2)

+ L0 + L1| ηL(xt, ηt) gt| + (L1 + L1L2αt)|gt|

L(xt, ηt+1) βt

2 vt 2 + βt

2 vt x L(xt, ηt+1) 2 + L0

2 β2 t vt 2

+ γ(L1 + L1L2αt)2(gt)2 + β4 t 16γ vt 4

+ γL2 1|gt ηL(xt, ηt)|2 + β4 t 16γ vt 4. (55)

Published as a conference paper at ICLR 2025

Combine equation 54 and equation 55 and it follows that

2αt L2α2 t 2 γ(L1 + L1L2αt)2 (gt)2 + βt

E[L(xt, ηt) L(xt+1, ηt+1) + β4 t 8γ vt 4

2 vt x L(xt, ηt+1) 2 + αt

2 (gt ηL(xt, ηt))2 + γL2 1(gt ηL(xt, ηt))2].

Setting α = 1 4L2 and βt = min 1 2L0 , ϵ L0 vt , we have that

E[( 3 32L2 γ 25L2 1 16 )(gt)2 + βt

E[L(xt, ηt) L(xt+1, ηt+1)] + ϵ4

8γL4 0 + 1 4L0 E[ vt x L(xt, ηt+1) 2]

8L2 + γL2 1

E[(gt ηL(xt, ηt))2]. (56)

Let γ = 1 50L1 and take the sum of equation 55 from t = 0 to T 1. According to Lemma 5, we then have that XT 1 t=0 1 16L2 E[(gt)2] + E[βt

E[L(x0, η0) L(x T , ηT )] + 7ϵ4L1T

4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 64qϵ2L2 1L2 2 N2L4 0

8N3 + 33qϵ2L2 1 N4L2 0 + q 8N4 + q D1 2N2N3 + 4q2ϵ2L2 1 N2N4L2 0

N1 + 2q L2 2 N2L2 0 ϵ2 + q 8N2 (gt)2 . (57)

Let c0 = max(32L2, 8L0), c1 = 4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 16L2 1L2 5D1L3 0

and c2 = max 1 8L2 + L1

For N3 max 200D1L2

L0 , 3c0(D0+4D1D2)

2L0ϵ2 , N4 max 5q L2

L0 , 6qc1c0

min L3 0N4 1480L2 1L2q, L4 0 42L1c0 , 1 , N2 max 20q D1L2

L0 , 20qc2L2, 12q L2 2c0c2 L2 0 , q and N1 6D2c0c2

ϵ2 , we then have that XT 1 t=0 E 1 32L2 (gt)2 + βt

E[L(x0, η0) L(x T , ηT )] + ϵ2T

6c0 + D0 + 4D1D2

+ ϵ2T q N4L0

4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 16L2 1L2 5D1L3 0

ϵ2T 2q L2 2 N2L2 0

E[L(x0, η0) L(x T , ηT )] + 5ϵ2T

For T = nq 6c0(L(x0,η0) infx,η L(x,η))

ϵ2 , we can find some t0 mod q = 0 such that

t =t0 E 1 32L2 (gt )2 + βt

32L2 . (59)

Published as a conference paper at ICLR 2025

Moreover we can find some t [t0, t0 + q 1) such that

1 32L2 E[(gt)2] + E[βt

Based on equation 60 and c0 = max(32L2, 8L0), we have that E[(gt)2] ϵ2. Based on equation 67 and equation 68, we can further show that

E[(gt ηL(xt, ηt))2] D2

t =t0 E 2L2 2 N2L2 0 ϵ2 + 2L2 2 N2 α2 t 1(gt )2

60L2 . (61)

Thus we have that

E[| ηL(xt, ηt+1)|] E[|gt ηL(xt, ηt)| + |gt| + L2αt|gt|] 5

Moreover, we can show that

L0 E[βt vt 2] = E ϵ2

L0 min vt 2

where the second inequality is due to the fact that min a2

2 holds for any a 0. As a

result, we have that E[ vt ] 3ϵ. Based on equation 71 and equation 72, we can further show that

E[ vt x L(xt, ηt+1) 2]

E D0 + 4D1(gt0)2 + 2D1L2 2α2 t0(gt0)2

N3 + 4D1(gt0 ηL(xt0, ηt0))2

t =t0+1 E 2

N4 ϵ2 2 + 8L2 1L2 2 L2 0 α2 t 1(gt 1)2 + 4L2 1 L2 0 D2

N4 L2 2α2 t(gt )2

t =t0+1 E 16L2 1 L2 0 (gt 1)2 + 16L2 1 L2 0 ( ηL(xt 1, ηt 1) gt 1)2

N3 + 4D1 + 2D1L2 2α2 t0 N3 + ϵ2L2 1 N4L2 0 + L2 2 8N4L2 0 + 32ϵ2L2 1 N4L2 0

t =t0+1 E[(gt )2]

N4 + 8qϵ2L2 1D2 N4 + 4D1

N3 + 32ϵ2L2 1 N4L2 0

t =t0 E[( ηL(xt , ηt ) gt )2]

L2 + 33L2 1 5L0L2 + L2 40L0

5L2 ϵ2 + 8L0L2D2

5L2 ϵ2 + 4D1L0

5L2 + 32L2 1 5L0L2

1 + L2 40L0 + L0D1 + L0 + 2L0L2D2

L2 + 33L2 1 5L0L2 + L0D1

15L2 2 + L2 1 2L0L2 2

Thus we have that

E[ x L(xt, ηt+1) ] E[ vt x L(xt, ηt+1) + vt ] (3 + c3)ϵ.

It then follows that E[ x,ηL(xt, ηt+1) ] E[| ηL(xt, ηt+1)|]+E[ x L(xt, ηt+1) ] c4ϵ. This completes the proof.

Published as a conference paper at ICLR 2025

A.9 FULL VERSION AND ITS PROOF OF LEMMA 5

Lemma 6. With the parameters selected in Theorem 4, for each t0 < T that can be divided by q, we have that

t=t0 E[(gt ηL(xt, ηt))2]

N1 + 2q L2 2 N2L2 0 ϵ2 + 2q L2 2 N2 αt(gt)2 (64)

and Xt0+q 1 t=t0 E[ vt x L(xt, ηt+1) 2]

4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 64qϵ2L2 1L2 2 N2L4 0

8N3 + 33qϵ2L2 1 N4L2 0 + q 8N4 + q D1 2N2N3 + 4q2ϵ2L2 1 N2N4L2 0

(gt)2 . (65)

Proof. If t mod q = 0, gt is an unbiased estimate of ηL(xt, ηt) and according to Lemma 3, we have that

E[(gt ηL(xt, ηt))2] D2

Otherwise, we have that

E[(gt ηL(xt, ηt))2]

=E[( ηL(xt, ηt, B2) ηL(xt 1, ηt 1, B2) + gt 1 ηL(xt, ηt))2] =E[( ηL(xt, ηt, B2) ηL(xt 1, ηt 1, B2) + ηL(xt 1, ηt 1)

ηL(xt, ηt))2] + E[(gt 1 ηL(xt 1, ηt 1))2], (67)

where the last inequality is due to the fact that ηL(xt, ηt, B2) ηL(xt 1, ηt 1, B2) is an unbiased estimate of ηL(xt, ηt) ηL(xt 1, ηt 1). We now focus on the first term of equation 67, which can be further bounded as follows:

E[( ηL(xt, ηt, B2) ηL(xt 1, ηt 1, B2) + ηL(xt 1, ηt 1) ηL(xt, ηt))2]

N2 E[( ηL(xt, ηt, S) ηL(xt 1, ηt 1, S))2]

N2 E[( ηL(xt, ηt, S) ηL(xt 1, ηt, S))2 + ( ηL(xt 1, ηt, S) ηL(xt 1, ηt 1, S))2]

N2 L2 2(α2 t 1(gt 1)2 + β2 t 1 vt 1 2)

E 2L2 2 N2L2 0 ϵ2 + 2L2 2 N2 α2 t 1(gt 1)2 , (68)

where the first inequality is due to the fact that the square of expectation is not larger than the expectation of square, the third inequality is due to the continuous properties shown in Lemma 4, and the last inequality is due to the fact that βt = min 1 2L0 , ϵ L0 vt . Combining equation 66, equation 67, and equation 68, for t0 mod q = 0, we have that

t=t0 E[(gt ηL(xt, ηt))2]

N1 + 2q L2 2 N2L2 0 ϵ2 + q 8N2 (gt)2 . (69)

We then focus on the estimate of the gradient to x. If t mod q = 0, we have that

E[ vt x L(xt, ηt+1) 2]

E D0 + D1( ηL(xt, ηt+1))2

Published as a conference paper at ICLR 2025

E D0 + 2D1( ηL(xt, ηt))2 + 2D1L2 2α2 t(gt)2

E D0 + 4D1(gt)2 + 2D1L2 2α2 t(gt)2

N3 + 4D1(gt ηL(xt, ηt))2

where the second inequality is due to the L2-smoothness on η and the update of η. Otherwise, we have that

E[ vt x L(xt, ηt+1) 2]

=E[ x L(xt, ηt+1, B4) x L(xt 1, ηt, B4) + vt 1 x L(xt, ηt+1) 2]

=E[ x L(xt, ηt+1, B4) x L(xt 1, ηt, B4) + x L(xt 1, ηt) x L(xt, ηt+1) 2]

+ E[ vt 1 x L(xt 1, ηt) 2], (71)

since x L(xt, ηt+1, B4) x L(xt 1, ηt, B4) is an unbiased estimate of x L(xt, ηt+1) x L(xt 1, ηt). We now focus on the first term of equation 71, which can be further bounded as follows:

E[ x L(xt, ηt+1, B4) x L(xt 1, ηt, B4) + x L(xt 1, ηt) x L(xt, ηt+1) 2]

N4 E[ x L(xt, ηt+1, S) x L(xt 1, ηt, S) 2]

N4 E[ x L(xt, ηt+1, S) x L(xt, ηt, S) 2 + x L(xt, ηt, S) x L(xt 1, ηt, S) 2]

N4 E[(L2 2α2 t(gt)2 + (2L2 0 + 2L2 1| ηL(xt 1, ηt, S)|2)β2 t 1 vt 1 2)]

N4 ϵ2 2 + 4L2 1 L2 0 ( ηL(xt 1, ηt))2 + 4L2 1 L2 0 D2

N4 L2 2α2 t(gt)2

N4 ϵ2 2 + 8L2 1 L2 0 ( ηL(xt 1, ηt 1))2 + 8L2 1L2 2 L2 0 α2 t 1(gt 1)2 + 4L2 1 L2 0 D2

N4 L2 2α2 t(gt)2

N4 ϵ2 2 + 8L2 1L2 2 L2 0 α2 t 1(gt 1)2 + 4L2 1 L2 0 D2

N4 L2 2α2 t(gt)2

N4 E 16L2 1 L2 0 (gt 1)2 + 16L2 1 L2 0 ( ηL(xt 1, ηt 1) gt 1)2 , (72)

where the third inequality is due to Lemma 4 , the fourth inequality is due to βt = min 1 2L0 , ϵ L0 vt and Lemma 3. Combine equation 70, equation 71 and equation 72, and for t0 mod q = 0, we have that Xt0+q 1 t=t0 E[ vt x L(xt, ηt+1) 2]

4 + 8L2 1D2 L2 0

8N3 + 33qϵ2L2 1 N4L2 0 + q 8N4

N3 + 32qϵ2L2 1 N4L2 0

(gt ηL(xt, ηt))2

4 + 8L2 1D2 L2 0

8N3 + 33qϵ2L2 1 N4L2 0 + q 8N4

Published as a conference paper at ICLR 2025

N3 + 32qϵ2L2 1 N4L2 0

N1 + 2q L2 2 N2L2 0 ϵ2 + 2q L2 2 N2 α2 t(gt)2

4 + 8L2 1D2 L2 0 + 32L2 1D2 N1L2 0 + 64qϵ2L2 1L2 2 N2L4 0

8N3 + 33qϵ2L2 1 N4L2 0 + q 8N4 + q D1 2N2N3 + 4q2ϵ2L2 1 N2N4L2 0

(gt)2 . (73)

This completes the proof.

A.10 DETAILS OF THE EXPERIMENT

A.10.1 LIFE EXPECTANCY DATA

This life expectancy data includes health factors from 193 countries (input features) and the life expectancy (target) from the World Health Organization and United Nations websites. We follow the same data pre-processing as in Chen et al. (2023) to fill the missing data with the medians, censorize and standardize all the features 2, remove redundant features, and a standard Gaussian noise is added to the target for model robustness. Each element of the initial parameter x0 is generated by a standard Gaussian distribution. In our deterministic setting, we compare the methods with fine-tuned learning rates. The iteration number is set to 50. For existing methods, we follow the fine-tuned learning rates in (Chen et al., 2023), where the step size βt = 10 4 for GD, βt = 0.2 for normalized GD and βt = 0.3 min 1 10, 1 x,ηL(xt,ηt) . For our D-GD method, we set αt = βt = 10 4 and for our

D-GD-C method, we set αt = 10 4 and βt = 0.35 min 1 2000, 1 x L(xt,ηt+1) .

In our stochastic setting, we run the experiments for 5000 iterations. We generate the initial x0, η0 by running a normalized GD method with step size βt = 0.2 for 30 iterations. For existing methods, we follow the same setting in (Chen et al., 2023). We set the mini-batch size to 50. For SGD, the step size is βt = 2 10 4. For the normalized SGD with momentum method, the momentum coefficient is set to 10 4 and the step size is set to 8 10 3. For the normalized SPIDER method, we have that step size βt = 4 10 3 and epoch size q = 20. For our D-SGD-C, we set αt = 8 10 5 and βt = 0.05 min( 1

100, 1 vt ). For our D-SPIDER-C, we have that αt = 8 10 5 and

βt = 7.5 10 3 min 2.5, 1 vt .

A.10.2 CIFFAR-10 DATASET

In this part, we conduct experiments on the famous CIFAR-10 dataset (Alex, 2009), which includes 50000 training samples. We employ DRO model to construct a linear classifier. After extracting features from a pre-trained Res Net-50 model (He et al., 2016), for the i-th sample in our dataset, we have input zi R2049 and output yi [10]. The non-convex original loss function is set as ℓ(x, (zi, yi)) = ln(P10 j=1 exp(x j zi x yizi)) + 0.001 P10 j=1 P2049 k=1 ln(1 + |xj,k|), where x R10 2049 is the trainable parameter, and xj is the j-th row. For the DRO model, λ is set to 0.05, and the initial value η0 is set to 0.1. Each element of the initial parameter x0 is generated by a standard Gaussian distribution.

In Figure 3, we provide the training curves with fine-tuned learning rate for SGD, Normalized SPIDER (Chen et al., 2023), Normalized-SGD with momentum (Jin et al., 2021) and our proposed D-SGD-C and D-SPIDER-C methods. The x-axis stands for the training iteration, and y-axis stands for the DRO dual function value. For Figure 3 we can observe that the Normalized-SPIDER and our D-SPIDER-C have similar performance and converge faster than other methods. Our D-SGD-C has a similar performance compared with the Normalized-SGD with momentum method but our D-SPIDER-C outperforms the Normalized-SGD with momentum method.

We set the mini-batch size to 200. For SGD, the step size is βt = 1.5 10 3. For the normalized SGD with momentum method, the momentum coefficient is set to 0.1 and the step size is set to

2https://thecleverprogrammer.com/2021/01/06/life-expectancy-analysis-with-python/

Published as a conference paper at ICLR 2025

0 20 40 60 80 100 Epoch

Loss Comparison Across Methods

Method SGD Normalized-SGDm D-SGD-C Normalized-SPIDER D-SPIDER-C

Figure 3: The numerical results for Cifar-10 dataset.

4 10 3. For the normalized SPIDER method, we have that step size βt = 3.5 10 3 and epoch size q = 10 and large batch size 2 103. For our D-SGD-C, we set αt = 1 10 3 and βt = min( 2 1000, 1 10 vt ). For our D-SPIDER-C, we have that αt = 1 10 3 and βt =

min 5 2000, 1 20 vt .

A.11 RESULTS FOR NORMALIZED MOMENTUM

Algorithm 4 D-SGD-M Input: initialization x0, η0, step sizes α, β, r1, r2 number of interactions T

1: t 1 2: while t T do 3: Draw one sample s from P0 and compute gt 1 ηL(xt 1, ηt 1, s) 4: if t==1 then 5: Draw one sample s from P0 and m0 ηL(xt 1, ηt 1, s) 6: end if 7: mt r1mt 1 + (1 r1)gt 1 8: ηt ηt 1 α mt

mt 9: Draw one sample s from P0 and compute vt 1 x L(xt 1, ηt, s) 10: if t==1 then 11: Draw one sample s from P0 and w0 x L(xt 1, ηt, s) 12: end if 13: wt r2wt 1 + (1 r2)vt 1 14: xt xt 1 β wt wt 15: t t + 1 16: end while

In this section, we provide result for the normalized momentum method, which is shown in Algorithm 4. The following proposition establishes the convergence and complexity of Algorithm 4.

Published as a conference paper at ICLR 2025

Proposition 1. Set 1 r1, 1 r2 O(ϵ2), β O(ϵ 3), α = O( D1β) and T O(ϵ 4) for Algorithm 4. We then have that

min t<T E[ x,ηL(xt, ηt+1) ] O(ϵ).

Note that compared with the normalized momentum algorithm for generalized smooth objective (Jin et al., 2021), Algorithm 4 does not require the full gradient information for initialization or the use of mini-batches. This is because that due to our partially generalized smoothness, the expectations of m0 L(x0, η0) and v0 L(x0, η1) are no longer unbounded thus the requirement on full gradient is not needed. Moreover, the partially generalized smoothness allows the Lipschitz constant on x bounded by a linear function of gradient on η. Thus by setting α = O( D1β) we can remove the mini-batch required in Jin et al. (2021).