# mgda_converges_under_generalized_smoothness_provably__8db833e6.pdf

Published as a conference paper at ICLR 2025

MGDA CONVERGES UNDER GENERALIZED SMOOTHNESS, PROVABLY

Qi Zhang 1, Peiyao Xiao 2, Shaofeng Zou1 & Kaiyi Ji2

School of Electrical, Computer and Energy Engineering, Arizona State University1

Department of Computer Science and Engineering, University at Buffalo2 {qzhan261,zou}@asu.edu, {peiyaoxi,kaiyiji}@buffalo.edu

Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard L-smooth or boundedgradient assumptions, which typically do not hold for neural networks, such as Long short-term memory (LSTM) models and Transformers. In this paper, we study a more general and realistic class of generalized ℓ-smooth loss functions, where ℓis a general non-decreasing function of gradient norm. We revisit and analyze the fundamental multiple gradient descent algorithm (MGDA) and its stochastic version with double sampling for solving the generalized ℓ-smooth MOO problems, which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of these algorithms and show that they converge to an ϵ-accurate Pareto stationary point with a guaranteed ϵ-level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally O(ϵ 2) and O(ϵ 4) samples are needed for deterministic and stochastic settings, respectively. We prove that they can also guarantee a tighter ϵ-level CA distance in each iteration using more samples. Moreover, we analyze an efficient variant of MGDA named MGDA-FA using only O(1) time and space, while achieving the same performance guarantee as MGDA.

1 INTRODUCTION

There have been a variety of emerging applications of multi-objective optimization (MOO), such as online advertising (Ma et al., 2018), autonomous driving (Huang et al., 2019), and reinforcement learning (Thomas et al., 2021). Mathematically, the MOO problem takes the following formulation.

F = min x Rm F(x) := (f1(x), f2(x), ..., f K(x)), (1)

where K is the total number of objectives and fk(x) is the k-objective function given model parameters x. Under the stochastic setting, fk(x) = Es[fk(x; s)], where s denotes data sample. In MOO setting, we are interested in optimizing all the objectives simultaneously. However, this problem is challenging due to the gradient conflict that some objectives with larger gradients dominate the update direction at the sacrifice of significant performance degeneration on the less-fortune objectives with smaller gradients. Thus, a widely-adopted target is to find the Pareto stationary point x that the performance of all objectives cannot be further improved without compromising some objectives. A variety of MOO-based methods have been proposed to mitigate this conflict and find a more balanced solution among all objectives. In particular, the multiple gradient descent algorithm (MGDA) (Désidéri, 2012) aims to find a conflict-avoidant (CA) update direction that maximizes the minimal improvement among all objectives and converges to a Pareto stationary point at which there is no common descent direction for all objective functions. This idea then inspired numerous follow-up methods including but not limited to CAGrad (Liu et al., 2021), PCGrad (Yu et al., 2020), Grad Drop (Chen et al., 2020), FAMO (Liu et al., 2024) and Fair Grad (Ban & Ji, 2024) with a

Equal contribution.

Published as a conference paper at ICLR 2025

convergence guarantee in the deterministic setting with full-gradient computations. The theoretical understanding of the convergence and complexity of stochastic MOO is not well-developed until very recently. Liu & Vicente (2021) proposed stochastic multi-gradient (SMG) as a stochastic version of MGDA, and established its convergence guarantee. Zhou et al. (2022) analyzed the non-convergence issues of MGDA, CAGrad and PCGrad in the stochastic setting, and further proposed a convergent approach named CR-MOGM. More recently, Fernando et al. (2022) and Chen et al. (2024) proposed single-loop stochastic MOO methods named Mo Co and Mo Do, and proved their convergence to an ϵ-accurate Pareto stationary point while guaranteeing an ϵ-level average CA distance1 over all iterations. Xiao et al. (2024) proposed a double-loop algorithm named SDMGrad that enables to obtain an unbiased stochastic multi-gradient via a double-sampling strategy. They established the convergence of SDMGrad with a guaranteed ϵ-level CA distance in every iteration, which we call as iteration-wise CA distance.

Figure 1: Local smoothness constant vs. gradient norm for training Seg Net on City Scapes dataset of Task 1.

However, all existing works are limited by the standard Lsmooth and bounded-gradient assumptions. Nevertheless, a recent study (Zhang et al., 2019) indicates that such assumptions may not necessarily be true for the training of neural networks and an alternative (L0, L1)-smoothness condition was observed and studied, which assumes the Lipschitz constant to be linear in the gradient norm and the gradient norm to be potentially infinite. Furthermore, this phenomenon has been consistently observed in our experiments (e.g., see Figure 1). Interestingly, it has been widely observed that MGDA algorithms always converge even under such generalized smoothness conditions (Sener & Koltun, 2018; Liu et al., 2021; Xiao et al., 2024). This naturally raises a thought-provoking question:

Q. Can the fundamental MGDA algorithm provably converge under the generalized smoothness condition, while achieving a sufficiently small CA distance?

This question remains open due to the following challenges. The analysis of existing MOO methods cannot be generalized to this (L0, L1)-smoothness directly due to the possible unbounded smoothness or gradient norm. In addition, all existing works (Reisizadeh et al., 2023; Zhang et al., 2019; Li et al., 2024b;a; Jin et al., 2021; Crawshaw et al., 2022; Chen et al., 2023; Zhang et al., 2024) in generalized smoothness are limited to the single task problems, which are fundamentally different from the MOO problems since even though each single task is generalized smooth, the linear combination of these tasks is not necessarily generalized smooth. In this paper, we provide an affirmative answer to this question. Our main contributions are summarized below.

1.1 OUR CONTRIBUTIONS

We establish a comprehensive convergence analysis for MGDA under the generalized smoothness condition, in both the deterministic and stochastic settings. Moreover, the analysis covers both the average CA distance and the iteration-wise CA distance.

Weakest assumptions in MOO. In this paper, we investigate the generalized ℓ-smooth assumption, where ℓis a general non-decreasing function of gradient norm, and includes both the standard Lsmooth and (L0, L1)-smooth assumptions as special cases. This assumption finds many applications, such as LSTM models (Zhang et al., 2019), transformers (Crawshaw et al., 2022), distributionally robust optimization (Jin et al., 2021) and higher-order polynomial functions (Chen et al., 2023). In addition, we do not make any bounded-gradient assumption, which is required in previous analysis to ensure the bounded multi-gradient approximation. To the best of our knowledge, this is the first work to investigate generalized smoothness in MOO problems. Table 1 presents a detailed comparison of assumptions with existing analyses.

Convergence analysis. We first show that the vanilla MGDA method can converge to an ϵ-accurate Pareto stationary point, while guaranteeing a small ϵ-level average CA distance. A warm start process

1CA distance means the distance between the updating direction and the CA direction. Its formal definition can be found in Section 2.4

Published as a conference paper at ICLR 2025

is introduced before the main loop of MGDA to achieve a more aggressive ϵ-level iteration-wise CA distance. In the stochastic setting, we provide the convergence guarantee for MGDA with a double sampling, which was introduced by Xiao et al. (2024); Chen et al. (2024) to obtain unbiased multi-gradient approximations. Furthermore, we analyze a computationand memory-efficient variant of MGDA, named MGDA with fast approximation (MGDA-FA), which updates the objective weights w using only forward passes of F( ) rather than gradient F, effectively reducing O(K) time and space to O(1) without hurting the performance guarantee.

Sample complexity comparison. To achieve an ϵ-accurate Pareto stationary point and an ϵ-level average CA distance, we show that MGDA require O(ϵ 2) and O(ϵ 4) samples in the deterministic and stochastic settings, respectively. Both of these complexities match with the existing best results. Furthermore, to achieve an ϵ-level iteration-wise CA distance, MGDA with warm start requires an increased number of samples, at the order of O(ϵ 11) and O(ϵ 17) respectively, in both deterministic and stochastic scenarios, due to smaller step sizes and mini-batch data sampling, shown in Table 1. Typically, achieving an ϵ-level iteration-wise CA distance results in much higher sample complexity, such as O(ϵ 24) in Fernando et al. (2022), O(ϵ 12)2 in Chen et al. (2024) and O(ϵ 12) in Xiao et al. (2024) for non-convex stochastic setting. Moreover, we show that MGDA-FA achieves the same performance guarantee as vanilla MGDA.

1.2 RELATED WORKS Gradient-based multi-objective optimization. A variety of gradient manipulation techniques have emerged for simultaneous learning of multiple tasks. One prevalent category of methods adjusts the weights of various objectives according to factors such as uncertainty (Kendall et al., 2018), gradient norm (Chen et al., 2018), and training complexity (Guo et al., 2018). Methods based on MOO have garnered increased attention due to their systematic designs, enhanced training stability, and modelagnostic nature. For instance, Sener & Koltun (2018) framed Multi-Task Learning (MTL) as a MOO problem and introduced an optimization method akin to MGDA (Désidéri, 2012). Afterward, many MGDA-based methods have been proposed to mitigate gradient conflict with promising empirical performance. Among them, PCGrad (Yu et al., 2020) avoids conflict by projecting the gradient of each task on the norm plane of other tasks. Grad Drop (Chen et al., 2020) randomly drops out conflicted gradients. CAGrad (Liu et al., 2021) adds a constraint on the update direction to be close to the average gradient. Nash MTL (Navon et al., 2022) and Fair Grad (Ban & Ji, 2024) formulated MTL as a bargaining game and a resource allocation problem, respectively. Theoretically, Fernando et al. (2022) proposed a provably convergent stochastic MOO method named Mo Co based on an auxiliary tracking variable for gradient approximation. Chen et al. (2024) characterized the trade-off among optimization, generalization, and conflict avoidance in MOO. Xiao et al. (2024) proposed and analyzed a stochastic MOO method named SDMGrad with a preference-oriented regularizer. However, all these works rely on the L-smoothness and bounded-gradient assumptions. In contrast, this paper focuses on the MOO problems with generalized ℓ-smooth objectives.

Generalized smoothness. The generalized (L0, L1)-smoothness was firstly proposed by Zhang et al. (2019), which was observed from extensive empirical experiments in training neural networks. A clipping algorithm was developed by Zhang et al. (2019) and the convergence rate was provided. Later, Jin et al. (2021) analyzed the convergence of a normalized momentum method. The SPIDER algorithm was also applied to solve generalized smooth problems in Reisizadeh et al. (2023); Chen et al. (2023), where Chen et al. (2023) studied a new notion of α-symmetric generalized smoothness, which includes (L0, L1)-smoothness as a special case. Very recently, a new generalized ℓ-smoothness condition was studied in Li et al. (2024a;b), which is the weakest smoothness condition and includes all the smoothness conditions discussed above. However, all the existing works on generalized smoothness are limited to single-task optimizations and the understanding of MOO is insufficient. This paper provides the first study of MOO under the generalized ℓ-smoothness condition.

2 PRELIMINARIES

2.1 GENERALIZED SMOOTHNESS

The standard L-smoothness condition is widely investigated in existing optimization studies (Ghadimi & Lan, 2013; Ghadimi et al., 2016), which assumes a function f : X R to be L-smooth if

2The order differs from (Chen et al., 2024) due to different definitions of the ϵ-accurate Pareto stationarity and is taken when both ϵ-accurate Pareto stationarity and ϵ-level iteration-wise CA distance can be achieved.

Published as a conference paper at ICLR 2025

Method Smoothness 1 Assumption2 Setting Complexity 3

CAGrad (Liu et al., 2021) (LS) (OD) Deterministic N/A PCGrad (Yu et al., 2020) (LS) (C) or (BC) Deterministic N/A SMG (Liu & Vicente, 2021) (LS) (C) or (BC) Stochastic N/A CR-MOGM (Zhou et al., 2022) (LS) (BF) and (BG) Stochastic O(ϵ 4) Mo Co (Fernando et al., 2022) (LS) (BF) and (BG) Stochastic O(ϵ 4) Mo Do (Chen et al., 2024) (LS) (BG) Stochastic O(ϵ 4) SDMGrad (Xiao et al., 2024) (LS) (BG) Stochastic O(ϵ 4) MGDA (Them. 1 and 3 ) (GS) N/A Deterministic O(ϵ 2) Stochastic MGDA (Them. 2 and 6) (GS) N/A Stochastic O(ϵ 4)

Table 1: Comparison to assumptions in existing analyses. MGDA (Désidéri, 2012) assumes the access of optimal update direction and step size for each iteration and gets an asymptotic result thus is omitted in the table. Explanation on the upper footmarks: 1 : (LS) indicates that the objectives are standard L-smooth while (GS) the objectives are generalized ℓ-smooth as defined in Definition 1; 2 : (OD) denotes the assumption of optimal direction in each iteration, (C) denotes the convex loss function assumption, (BC) denotes a lower bound on multi-task curvature, which is defined as H(f, x, x ) = R 1 0 f(x) 2f(x + a(x x )) f(x)da for a function f, (BF) denotes the bounded function value assumption and (BG) denotes the bounded gradient assumption; 3 : Sample complexity achieving ϵ-accurate Pareto stationary point. N/A denotes no exploration on non-convex settings.

there exists a bounded constant L such that for any x, y X, f(x) f(y) L x y . Nevertheless, recent studies show that in the training of neural networks such as LSTM models (Zhang et al., 2019), transformers (Crawshaw et al., 2022), distributionally robust optimization (Jin et al., 2021) and high-order polynomials functions (Chen et al., 2023), the standard L-smoothness assumption does not hold. Instead, a generalized (L0, L1)-smoothness assumption was observed and studied in the training of LSTM models in Zhang et al. (2019), which assumes that for any x X, 2f(x) L0 +L1 f(x) . This assumption implies the Lipschitz constant is potentially unbounded and reduces to the L-smoothness if L1 = 0. Later, a more generalized assumption was proposed and studied in Li et al. (2024a):

Definition 1. (Generalized ℓ-smoothness, Definition 1 in Li et al. (2024a)). A real-valued differentiable function f : X R is generalized ℓ-smooth if 2f(x) ℓ( f(x) ) almost everywhere in X, where ℓ: [0, + ) (0, + ) is a continuous non-decreasing function.

The (L0, L1)-smoothness is a special case of generalized ℓ-smoothness, where ℓ(a) = L0 + L1a. Another definition of generalized smooth is widely used and equivalent to the ℓ-smoothness:

Definition 2. ((r, ℓ)-smoothness, Definition 2 in Li et al. (2024a)). A real-valued differentiable function f : X R is (r, ℓ)-smooth if 1) for any x X, B(x, r( f(x) )) X, and 2) for any x1, x2 B(x, r( f(x) )), f(x1) f(x2) ℓ( f(x) ) x1 x2 , where for continuous functions r, ℓ: [0, + ) (0, + ), r is non-increasing, ℓis non-decreasing and B(x, R) is the Euclidean ball centered at x with radius R.

In B(x, r( f(x) )), f is also L-smooth where L = ℓ( f(x) ). Definitions 1 and 2 are equivalent (Li et al., 2024a): An (r, ℓ)-smooth function is ℓ-smooth; and a ℓ-smooth function satisfying Assumption 1 is (r, m)-smooth with m(u) := ℓ(u + a) and r(u) := a/m(u) for any a > 0.

2.2 PARETO CONCEPTS IN MULTI-OBJECTIVE OPTIMIZATION (MOO)

MOO aims to find points at which there is no common descent direction for all objectives. Considering points x1, x2 Rm, we claim that x1 dominates x2 if fi(x1) fi(x2) for all i [K] and F(x1) = F(x2). We say a point is Pareto optimal if it is not dominated by any other point. In other words, we cannot improve one objective without compromising another when we reach a Pareto optimal point. In non-convex settings, MOO aims to find a Pareto stationary point defined as follows.

Definition 3. x Rm is a Pareto stationary point if minw W F(x)w 2 = 0, where W is the probability simplex over [K]. x an ϵ-accurate Pareto stationary point if minw W F(x)w 2 ϵ2.

Published as a conference paper at ICLR 2025

2.3 EXISTING MOO ALGORITHMS

Deterministic MOO. One of the big challenges of MOO is the gradient conflict, i.e., the gradients of different objectives may vary heavily in scale such that the largest gradient dominates the update direction. As a result, the performance of those objectives with smaller gradients (Yu et al., 2020) may be significantly compromised. As the most fundamental MOO algorithm, MGDA tends to find a balanced update direction for all objectives by considering the minimum improvement across all objectives and maximizes it by solving the following problem

max d Rm min i [K]

α(fi(x) fi(x αd)) o max d Rm min i [K] fi(x), d , (2)

where α is the step size, d is the update direction, and the first-order Taylor approximation is applied at x. To efficiently solve the above problem in eq. (2), we substitute the following relation

max d Rm min i [K] fi(x), d 1

2 d 2 = max d Rm min w W

k=1 fi(x)wi, d E 1

where the regularization term 1

2 d 2 is to regulate the magnitude of our update direction. The solution to the problem in eq. (3) can be obtained by solving the following problem (Désidéri, 2012):

d = F(x)w ; s.t. w arg min w W 1 2 F(x)w 2. (4)

The above approach has been widely used in various variants of MGDA such as SDMGrad, CAGrad, and PCGrad (Xiao et al., 2024; Yu et al., 2020; Liu et al., 2021).

Stochastic MOO. SMG (Liu & Vicente, 2021) is the first stochastic MGDA. It directly replaces the gradients with stochastic gradients and the update rule becomes

d s = F(x; s)w s; s.t. w s arg min w W 1 2 F(x; s)w 2,

where F(x; s) is the estimate of F(x) based on the sample s. However, this leads to a biased gradient estimation of the update direction d s, and thus it requires an increasing batch size. To solve this issue, another work Mo Co (Fernando et al., 2022) introduces a tracking variable Y as a stochastic estimation of the true gradient. Afterward, a double-sampling strategy is proposed by Chen et al. (2024); Xiao et al. (2024) to generate a near-unbiased update direction.

Strong assumptions in analysis. All the works mentioned above require bounded gradients such as Chen et al. (2024); Fernando et al. (2022); Xiao et al. (2024) or L-smoothness such as Ban & Ji (2024); Liu et al. (2021); Navon et al. (2022); Yang et al. (2024). Their analyses do not apply to generalized ℓ-smoothness objectives, since the Lipschitz constant is potentially infinity.

2.4 CONFLICT-AVOIDANT (CA) DIRECTION AND CA DISTANCE

We call the update direction d in eq. (4) the conflict-avoidant (CA) direction since it mitigates gradient conflict. Though it may not be feasible to calculate the exact CA direction, we aim to find an update direction to be close to the CA direction. Therefore, measuring the gap between the CA direction and the estimated update direction is important, which we define as the CA distance. Definition 4. d d is the CA distance between estimated update direction d and CA direction d .

The larger the CA distance is, the further the estimated update direction will be away from the CA direction, and the more conflict there will be. In single-loop algorithms, Mo Co (Fernando et al., 2022) ensures the average CA distance over iteration is of the order of ϵ, while Mo Do (Chen et al., 2024) guarantees the ϵ-order iteration-wise CA weight distance (as stated in their Theorem 3.4). Meanwhile, the double-loop algorithm SDMGrad (Xiao et al., 2024) guarantees an ϵ-order CA distance in every iteration. In this work, we analyze the CA distance in both cases and provide convergence results.

3 MGDA ALGORITHMS UNDER GENERALIZED SMOOTHNESS

3.1 MGDA WITH AND WITHOUT WARM START

It has been shown in eq. (4) that MGDA needs to approximate the optimal weight w and the optimal updating direction d . However, since the optimal weight w of the convex function is not unique,

Published as a conference paper at ICLR 2025

we deal with this issue by adding an ℓ2 regularization term and the problem becomes

w ρ = arg min w W 1 2 F(x)w 2 + ρ

Besides the benefit of a unique solution, adding an ℓ2 regularization term also makes w ρ(x) Lipschitz continuous (Fernando et al., 2022). Note that w (x) may not be Lipschitz continuous because F(x) F(x) may not be positive definite. Nevertheless, the analysis of CA distance is difficult because w may not be Lipschitz continuous. Thus, we will characterize the gap between w and w ρ plus the change of w ρ after adding this ℓ2 regularization term. As a result, the update rules become Lines 5-6 in Algorithm 1. We first update wt by a projected gradient descent process and compute the update direction dt = F(xt)wt to update model parameters.

For our single-loop algorithm, CA distance is proportional to the term wt w t,ρ , which decreases as the algorithm iterates with some error terms controlled by appropriately chosen small step sizes. If we initialize w0 randomly, w0 w 0,ρ will be a constant order, and so will the first CA distance. Meanwhile, we can only get an ϵ-order CA distance after a certain iteration number t > 1 when wt w t ,ρ takes an ϵ order. Thus, we introduce an extra warm start process using Algorithm 2 to guarantee the new w0 is close enough to w 0,ρ and a small level CA distance in every iteration. However, this warm start process is not needed if we only require a small averaged CA distance.

Algorithm 1 Single loop MGDA with and without warm start

1: Initialize: model parameters x0, weights w0 and a constant ρ 2: Option I: w0=warm-start(w0, x0, ρ) # for analyzing iteration-wise CA distance 3: Option II: w0 w0 # for analyzing averaged CA distance 4: for t = 0, 1, ..., T 1 do 5: wt+1 = ΠW wt β[ F(xt) F(xt)wt + ρwt]

6: xt+1 = xt α F(xt)wt 7: end for

Algorithm 2 warm-start(w0, x0, ρ)

1: for n = 0, 1, ..., N 1 do 2: wn+1 = ΠW wn β [ F(x0) F(x0)wn + ρwn]

3: end for 4: Output w N

3.2 STOCHASTIC MGDA WITH DOUBLE SAMPLING

In the stochastic setting, our algorithm keeps the same structure, having a warm start process if we aim to control the CA distance in every iteration. In Algorithm 2, we do the same projected gradient descent without using stochastic gradients. This is because we only need to compute F(x0) F(x0) once and reuse it in the whole loop, which does not bring a computational burden. Then in the update loop, we update the weight and model parameters accordingly. We use a doublesampling strategy here to make the weight gradient estimator unbiased (Xiao et al., 2024) such that dt is a near-unbiased multi-gradient E[ G2(xt) G3(xt)wt + ρwt] = F(xt) F(xt)wt + ρwt, where G2(xt) and G3(xt) are independent and unbiased estimates of F(xt). Similarly, we do not involve a warm start process if we require the average CA distance to be small.

3.3 MGDA WITH FAST APPROXIMATION

It can be seen from Algorithm 1 and Algorithm 3 that MGDA requires O(K) space and time to compute and store all task gradients at each iteration for updating the weight wt. This becomes a drawback when the number of tasks or the model size is large. Motivated by Liu et al. (2024), one solution is to use the Taylor Theorem to approximate the gradient for updating the weight wt as

F(xt) F(xt+1) = F(xt) (xt xt+1) R(xt) = α F(xt) F(xt)wt R(xt),

where R(xt) is the remainder term and it takes the order R(xt) = o( xt xt+1 2), which can be made sufficiently small by adjusting the step size. By incorporating this fast approximation (FA)

Published as a conference paper at ICLR 2025

in Algorithm 1, we then present MGDA-FA in Algorithm 4 (shown in Appendix C), where xt is updated along the update direction dt = F(xt)wt to get xt+1 following by the update rule of wt

wt+1 = ΠW wt β h F(xt) F(xt+1)

α + ρwt i . (6)

As a result, in the model parameters update process, MGDA-FA only requires one backward process by calculating the gradient of F(xt)wt w.r.t. xt without storing it, and additional forward processes to compute F(xt+1) in the weight update process. This approach saves computational and memory costs in the practical implementation significantly. Most importantly, we provide a theoretical guarantee for this efficient method (in eq. (6)).

Algorithm 3 Stochastic MGDA with Double Sampling

1: Initialize: model parameters x0, weights w0 and a constant ρ 2: Option I: w0=warm-start(w0, x0, ρ) # for analyzing iteration-wise CA distance 3: Option II: w0 w0 # for analyzing averaged CA distance 4: for t = 0, 1, ..., T 1 do 5: xt+1 = xt α G1(xt)wt 6: wt+1 = ΠW wt β[ G2(xt) G3(xt)wt + ρwt]

# double sampling 7: end for

4 CONVERGENCE ANALYSIS UNDER AVERAGE CA DISTANCE

In this section, we provide the theoretical results for Algorithms 1 and 3 without warm starts to obtain an ϵ-accurate Pareto stationary point, with the average CA distance over iterations in O(ϵ).

4.1 DETERMINISTIC SETTING

Assumption 1. Each objective function fi i [K] is twice differentiable and lower bounded by f i := infx Rm fi(x) > . Assumption 2. Each objective function fi i [K] is generalized ℓ-smooth defined in Definition 1, where ℓ: [0, + ) (0, + ) is a continuous non-decreasing function such that φ(a) = a2 2ℓ(2a) is monotonically increasing for any a 0.

These assumptions are the most relaxed ones in existing MOO works since they directly assume objective smoothness or gradient/function value boundness (Liu et al., 2021; Fernando et al., 2022; Navon et al., 2022; Xiao et al., 2024; Chen et al., 2024; Yang et al., 2024; Ban & Ji, 2024). It also includes the widely studied standard L-smoothness (Nemirovskij & Yudin, 1983; Ghadimi & Lan, 2013; Ghadimi et al., 2016), (L0, L1)-smoothness (Zhang et al., 2019) as special cases. Moreover, for any 0 γ 2 and x X, our assumption even holds for function f such that 2f(x) L0 + L1 f(x) γ, where γ are limited to [0, 1] in Chen et al. (2023).

Let c > 0 and F > 0 be some constants such that + c F, where = maxi [K]{fi(x0) f i }. Define M = sup{z 0|φ(z) F}. We then have the following convergence rate for Algorithm 1: Theorem 1. Let Assumptions 1 and 2 hold. Set β = O( 1 M2 ), α = O( 1 M2 + 1 Mℓ(M+1)), T =

max Θ 1 αϵ2 , Θ 1 βϵ2 and ρ = O(ϵ2). We then have that 1

T PT 1 t=0 F(xt)wt 2 ϵ2.

The full version with detailed constants and detailed proof can be found in Appendix D.1. Theorem 1 provides the first convergence rate to obtain an ϵ-accurate Pareto stationary point for MOO problems with generalized ℓ-smooth objectives. Moreover, it achieves the optimal sample complexity in the order of O(ϵ 2)for GD with a single standard L-smooth objective (Carmon et al., 2020). The MOO problems with generalized ℓ-smooth objectives are challenging due to two reasons: 1) F(x) is potentially unbounded in our generalized ℓ-smoothness setting, making all existing analysis in MOO (Liu et al., 2021; Fernando et al., 2022; Navon et al., 2022; Xiao et al., 2024; Chen et al., 2024; Yang et al., 2024; Ban & Ji, 2024) not applicable. 2) the update of x includes all gradient information from each task, making the existing adaptive methods for single generalized smooth functions invalid.

To solve the challenges in Theorem 1, we find that a bounded function value implies a bounded gradient norm. Thus in our proof, we use induction to show that with parameters selected in

Published as a conference paper at ICLR 2025

Theorem 1, for any w W and t T, we have that F(xt)w is upper bounded by F. Consequently, for any i [K], we have that fi(x) M, which solves the unbounded gradient norm problem in our generalized smoothness setting. Then we can show that F(xt)wt converges.

Corollary 1. Under the same setting in Theorem 1, 1

T PT 1 t=0 F(xt)wt F(xt)w t 2 = O(ϵ2).

The proof is available in Appendix D.2. Corollary 1 shows that the average CA distance converges.

4.2 STOCHASTIC SETTING

In the stochastic setting, we assume that we have access to an unbiased stochastic gradient fi(x; s) instead of the true gradient fi(x), where s is the collected samples. To prove convergence, we have the following assumption. Assumption 3. There exists some σ 0 such that E[ fi(x; s) fi(x) 2] σ2 for any i [K].

Assumption 3 indicates bounded gradient variances, which is widely studied (Xiao et al., 2024; Li et al., 2024a; Fernando et al., 2022). At time t in the stochastic model, let st,i = (st,i,1, st,i,2, ..., st,i,k) be the i-th collection of samples and F(xt; st,i) = (f1(xt; st,i,1), f2(xt; st,i,2), ..., fk(xt; st,i,k)). Note that i [3] because we have 3 samples in Algorithm 3. We choose Gi(xt) = F(xt; st,i). Define εt,i = (εt,i,1, εt,i,2, ..., εt,i,k) = F(xt) Gi(xt).

Let F, c > 0 and 0 < δ 1

2 be some constants such that F 4( +c)

δ and M = sup{z 0|φ(z) F}. Define the following random variables τ1 = min{t| i [K], fi(xt+1) f i > F} T, τ2 = min{t| i [K], j [3], εt,j,i > L0 αρ} T, τ3 = min{t| i, j [K], εt,2,i εt,3,j > L1 αρ} T and τ = min{τ1, τ2, τ3}, where L0, L1 > 0 are some constants and a b denotes min(a, b). We have the following theorem for Algorithm 3:: Theorem 2. Let Assumptions 1, 2, 3 hold. Set ρ = O(δ2ϵ2), β = O(min{ δϵ2

M4 , ρ M2 }), α =

O(min{β, ρ ℓ(M+1)2 , ρ M2 , δ ρT , 1 βT M2 }) and T = Θ(max{ 1 δαϵ2 , M2

δ2ϵ4 }). We then have that with the probability at least 1 δ, 1

T PT 1 t=0 F(xt)wt 2 = O(ϵ2) .

The full version and detailed proof can be found in Appendix D.3. When we set α, β, ρ = O(ϵ2) and T = Θ(ϵ 4), we can find an ϵ-stationary point with the optimal sample complexity in the order of O(ϵ 4) for SGD with a single L-smooth objective (Arjevani et al., 2023). Note that in the proof of Theorem 1, we show for each t T and w W, we have that F(xt)w is bounded by applying a small constant step size α and β. However, this condition does not necessarily hold for our stochastic setting due to the unbounded gradient noise. To solve this problem, we introduce stopping time τ. The advantages are as follows: 1) for any t τ, w W, we have that F(xt)w is bounded; 2) for any t < τ, the norm of gradient noise is bounded; 3) due to the optional stopping theorem, for any w W and i [3], we have that E[Pτ t=0 εt,iw] = 0. Thus, we can further get the following lemma: Lemma 1. Using the parameters selected in Theorem 2, we have that E[F(xτ)w] F w

2 E h Pτ 1 t=0 F(xt)wt 2i .

The proof of Lemma 1 is available in D.4. Lemma 1 indicates that α

2 E h Pτ 1 t=0 F(xt)wt 2i

is bounded by some constant and if τ = T with high probability, we have that

1 T E h PT 1 t=0 F(xt)wt 2 τ = T i = O 1

αT . Note that {τ < T} = {τ2 < T} {τ3 < T} {τ1 <

T, τ2 = T, τ3 = T}. The first two events are related to the gradient noise, where the probabilities can be bounded by Assumption 3 and Chebyshev s inequality. The last event indicates that for some i [K], we have fi(xτ) f i F

2 . Based on Lemma 1 and Markov inequality, we can show that P({τ1 < T, τ2 = T, τ3 = T}) δ

4 and we can further show that P(τ = T) 1 δ

2. We then have that 1

T PT 1 t=0 F(xt)wt 2 converges with high probability. Similar to Corollary 1, Theorem 2 also implies the average of CA distances converges with time with high probability.

5 CONVERGENCE ANALYSIS UNDER ITERATION-WISE CA DISTANCE

In Section 4 we show that the average CA distance is bounded under generalized smooth conditions. The average CA distance is also studied in Mo Co (Fernando et al., 2022) and Mo Do (Chen et al.,

Published as a conference paper at ICLR 2025

2024) with the bounded gradient assumption and these works only focus on guarantees of the average CA distance over iterations. However, a ϵ-level average CA distance only implies the smallest CA distance to be ϵ-level. Since we want to keep the update direction close enough to the CA direction, it is better to have a tighter bound of CA distances. In this section, we show the iterative CA distance is O(ϵ) with a warm-start process and convergence results for Algorithms 1, 3, and 4.

5.1 DETERMINISTIC SETTING

Deterministic setting without fast approximation. We first provide results about bounded iterationwise CA distance for Algorithm 1 with a warm start.

Theorem 3. Let Assumptions 1 and 2 hold. Set β 1 M 2 , ρ = O(ϵ2), β = O(ϵ4), α = O(ϵ9), N = Ω(ϵ 2) as constants, and T = Θ(ϵ 11). All the parameters satisfy the requirements in the formal version of Theorem 1 and we have F(xt)wt F(xt)w t = O(ϵ).

The finite time error bound and the full proof can be found in Appendix E.1. Since our parameters satisfy the requirements in the formal version of Theorem 1, we can find an ϵ-accurate Pareto stationary point with O(ϵ 11) samples. In the analysis of CA distance, we show that the CA distance can be bounded by the term wt w t,ρ plus the strongly-convex constant ρ. Meanwhile, there is a decay relation between wt+1 w t+1,ρ and wt w t,ρ with some error terms controlled by step sizes. Nevertheless, the error terms will accumulate since we do telescoping on this decay relation, which will be the dominating term. Thus, step sizes have to be much smaller than the choices in Theorem 1 to guarantee iteration-wise small CA distance.

Deterministic setting with fast approximation. In this section, we show the convergence rate of Algorithm 4 and bounded iteration-wise CA distance.

Theorem 4. Let Assumptions 1 and 2 hold. Set N = Ω(ϵ 2), β 1 M 2 , ρ = O min ϵ2, 1 αT , β = O(ϵ2), α = O min{β, ϵ2, 1 βT }) as constants, T = Θ max{ 1

αϵ2 , 1 βϵ2 } .

T PT 1 t=0 F(xt)wt 2 = O(ϵ2).

The full version and proof can be found in Appendix E.2. We can easily extend the analysis in Appendix D.1 to the convergence analysis of Algorithm 4 under the average CA distance, because the only extra effort is dealing with the remainder term, which can be bounded by the smallest step size. As a result, the sample complexity remains the same O(ϵ 11) to achieve a Pareto stationary point.

Theorem 5. Let Assumptions 1 and 2 hold. We choose β 1 M 2 , ρ = O(ϵ2), β = O(ϵ4), α = O(ϵ9), N = Ω(ϵ 2) as constants, and T = Θ(ϵ 11). We have that F(xt)wt F(xt)w t = O(ϵ).

5.2 STOCHASTIC SETTING

In this section, we show that Algorithm 3 with a warm start and mini-batches achieves a bounded iteration-wise CA distance with high probability. In this section, we choose Gi(xt) = 1 ns Pnsi i=nsi ns+1 F(xt; st,i), where ns is the size of the mini-batch.

Theorem 6. Let Assumptions 1, 2 and 3 hold. Set β 1 M 2 , α = O(ϵ9), β = O(ϵ4), ρ = O(ϵ2), ns = Ω(ϵ 6), N = Ω(ϵ 2), and T = Θ(ϵ 11), and all the parameters satisfy the requirements in Theorem 2. We then have F(xt)wt F(xt)w t = O(ϵ), with the probability at least 1 δ.

The full version with detailed constants and proof can be found in Appendix E.4. Since our parameters satisfy all requirements in Theorem 2, we can find an ϵ-accurate Pareto stationary point with high probability. Compared with Theorem 2, to guarantee an iteration-wise CA distance, despite our warm start process, a mini-batch method is required in our analysis. This is because given τ = T, the gradient is not unbiased. In Theorem 2, the optional stopping theorem is applied which indicates that the expectation of the cumulative gradient is zero. However, for each iteration, this optional stopping theorem does not hold and the estimated error is controlled by the size of the mini-batch. Then, the sample complexity to get a Pareto stationary point becomes O(ϵ 17) due to necessary mini-batch ns.

6 EXPERIMENTS

In this experiment, we evaluate the performance of the Cityscapes(Cordts et al., 2016) and NYUv2 (Silberman et al., 2012) datasets. The former involves 2 pixel-wise tasks: 7-class semantic

Published as a conference paper at ICLR 2025

Method Segmentation Depth Surface Normal MR m% m Io U Pix Acc Abs Err Rel Err Angle Distance Within t

Mean Median 11.25 22.5 30

STL 38.30 63.76 0.6754 0.2780 25.01 19.21 30.14 57.20 69.15

LS 39.29 65.33 0.5493 0.2263 28.15 23.96 22.09 47.50 61.08 7.89 5.59 SI 38.45 64.27 0.5354 0.2201 27.60 23.37 22.53 48.57 62.32 7.33 4.39 RLW (Lin et al., 2021) 37.17 63.77 0.5759 0.2410 28.27 24.18 22.26 47.05 60.62 9.89 7.78 DWA (Liu et al., 2019) 39.11 65.31 0.5510 0.2285 27.61 23.18 24.17 50.18 62.39 7.11 3.57 UW (Kendall et al., 2018) 36.87 63.17 0.5446 0.2260 27.04 22.61 23.54 49.05 63.65 7.11 4.05 MGDA (Désidéri, 2012) 30.47 59.90 0.6070 0.2555 24.88 19.45 29.18 56.88 69.36 5.56 1.38 Mo Co (Fernando et al., 2022) 40.30 66.07 0.5575 0.2135 26.67 21.83 25.61 51.78 64.85 5.00 0.16 Mo Do (Chen et al., 2024) 35.28 62.62 0.5821 0.2405 25.65 20.33 28.04 54.86 67.37 7.55 0.49 Nash-MTL Navon et al. (2022) 40.13 65.93 0.5261 0.2171 25.26 20.08 28.40 55.47 68.15 3.33 -4.04 FAMO (Liu et al., 2024) 38.88 64.90 0.5474 0.2194 25.06 19.57 29.21 56.61 68.98 3.22 -4.10

MGDA-warm start 40.57 67.17 0.5240 0.2281 25.21 19.74 28.74 55.79 68.21 2.78 -4.42

Table 2: Multi-task supervised learning on NYU-v2 dataset for different MOO methods comparison.

segmentation (Task 1) and depth estimation (Task 2) while the latter involves 3 pixel-wise tasks: 13-class semantic segmentation, depth estimation and surface normal estimation. Following the same experiment setup of (Xiao et al., 2024), we build a Seg Net (Badrinarayanan et al., 2017) as the model. We compare the performance of MGDA with a warm start, which is used to ensure a small CA distance per iteration and mitigate gradient conflicts, against popular MGDA-type methods including MGDA (Désidéri, 2012), PCGrad (Yu et al., 2020), Grad Drop (Chen et al., 2020), CAGrad (Liu et al., 2021), Mo Co (Fernando et al., 2022), Mo Do (Chen et al., 2024), Nash-MTL (Navon et al., 2022), and FAMO (Liu et al., 2024). Since SDMGrad (Xiao et al., 2024) incorporates a preference-based regularization within the MGDA framework, we exclude it from the tables to ensure a fair comparison. We utilize the metric m% to reflect the overall performance, which considers the average per-task performance drop versus the single-task (STL) baseline to assess methods. It can be observed in Table 2 and Table 3 that MGDA-warm start has a much more balanced performance. Meanwhile, the proposed MGDA-FA is much faster as shown in Table 5 in the Appendix.

We also illustrate the relationship between the gradient norm and the local smoothness for each task of the Cityscapes dataset. To do this, we compute them according to the method provided in Section H.3 in Zhang et al. (2019). We scatter the local smoothness constant against gradient norms in Figure 1 for the semantic segmentation task and depth estimation task in Figure 2 (in the appendix), respectively. Both results demonstrate a positive correlation between them, which further substantiates the necessity of our analysis. More experimental details can be found in Appendix B.

Method Segmentation Depth m% m Io U Pix Acc Abs Err Rel Err

STL 74.01 93.16 0.0125 27.77

MGDA (Désidéri, 2012) 68.84 91.54 0.0309 33.50 44.14 PCGrad (Yu et al., 2020) 75.13 93.48 0.0154 42.07 18.29 Grad Drop (Chen et al., 2020) 75.27 93.53 0.0157 47.54 23.73 CAGrad (Liu et al., 2021) 75.16 93.48 0.0141 37.60 11.64 Mo Co (Fernando et al., 2022) 75.42 93.55 0.0149 34.19 9.90 Mo Do (Chen et al., 2024) 74.55 93.32 0.0159 41.51 18.89 Nash-MTL (Navon et al., 2022) 75.41 93.66 0.0129 35.02 6.82 FAMO (Liu et al., 2024) 74.54 93.29 0.0145 32.59 8.13

MGDA-warm start 75.41 93.46 0.0133 31.07 3.93 1.19

Table 3: Multi-task learning on Cityscapes dataset.

7 CONCLUSION

Building upon our observations of MGDA convergence under the (L0, L1) smoothness condition, this paper provides a rigorous convergence analysis for both the fundamental MGDA and its stochastic variant under a more challenging, relaxed, and practical generalized ℓ-smoothness assumption. Furthermore, we introduce a warm start progress to provide a more precise control over the iterationwise CA distance. Our analysis also shows that an efficient variant named MGDA-FA, which uses only O(1) time and space, achieves the same performance guarantee as MGDA. We anticipate that the convergence analysis developed in this work will provide valuable insights for analyzing other MOO algorithms such as CAGrad, PCGrad, Fair Grad and FAMO under the generalized smoothness condition. The warm-start strategy may be of independent interest to other single-loop MOO algorithms to achieve a sufficiently small iteration-wise CA distance.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

The work of Q. Zhang and S. Zou was partially supported by NSF under Grant CCF-2438429. P. Xiao and K. Ji were partially supported by NSF grants CCF-2311274 and ECCS-2326592.

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1): 165 214, 2023.

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481 2495, 2017.

Hao Ban and Kaiyi Ji. Fair resource allocation in multi-task learning. ar Xiv preprint ar Xiv:2402.15638, 2024.

Amir Beck and Marc Teboulle. Gradient-based algorithms with applications to signal recovery. Convex optimization in signal processing and communications, pp. 42 88, 2009.

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71 120, 2020.

Lisha Chen, Heshan Fernando, Yiming Ying, and Tianyi Chen. Three-way trade-off in multi-objective learning: Optimization, generalization and conflict-avoidance. Advances in Neural Information Processing Systems, 36, 2024.

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pp. 794 803. PMLR, 2018.

Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout. Advances in Neural Information Processing Systems, 33:2039 2050, 2020.

Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization. ar Xiv preprint ar Xiv:2303.02854, 2023.

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213 3223, 2016.

Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, and Zhenxun Zhuang. Robustness to unbounded smoothness of generalized signsgd. Advances in Neural Information Processing Systems, 35:9955 9968, 2022.

Jean-Antoine Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313 318, 2012.

Heshan Devaka Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen. Mitigating gradient bias in multi-objective learning: A provably convergent approach. In The Eleventh International Conference on Learning Representations, 2022.

Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods. ar Xiv preprint ar Xiv:2301.11235, 2023.

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267 305, 2016.

Published as a conference paper at ICLR 2025

Michelle Guo, Albert Haque, De-An Huang, Serena Yeung, and Li Fei-Fei. Dynamic task prioritization for multitask learning. In Proceedings of the European conference on computer vision (ECCV), pp. 270 287, 2018.

Xinyu Huang, Peng Wang, Xinjing Cheng, Dingfu Zhou, Qichuan Geng, and Ruigang Yang. The apolloscape open dataset for autonomous driving and its application. IEEE transactions on pattern analysis and machine intelligence, 42(10):2702 2719, 2019.

Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems, 34: 2771 2782, 2021.

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482 7491, 2018.

Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems, 36, 2024a.

Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions. Advances in Neural Information Processing Systems, 36, 2024b.

Baijiong Lin, Feiyang Ye, Yu Zhang, and Ivor W Tsang. Reasonable effectiveness of random weighting: A litmus test for multi-task learning. ar Xiv preprint ar Xiv:2111.10603, 2021.

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 34:18878 18890, 2021.

Bo Liu, Yihao Feng, Peter Stone, and Qiang Liu. Famo: Fast adaptive multitask optimization. Advances in Neural Information Processing Systems, 36, 2024.

Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1871 1880, 2019.

Suyun Liu and Luis Nunes Vicente. The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning. Annals of Operations Research, pp. 1 30, 2021.

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1930 1939, 2018.

Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game. ar Xiv preprint ar Xiv:2202.01017, 2022.

Arkadij Semenoviˇc Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.

Amirhossein Reisizadeh, Haochuan Li, Subhro Das, and Ali Jadbabaie. Variance-reduced clipping for non-convex optimization. ar Xiv preprint ar Xiv:2303.00883, 2023.

Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018.

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pp. 746 760. Springer, 2012.

Philip S Thomas, Joelle Pineau, Romain Laroche, et al. Multi-objective spibb: Seldonian offline policy improvement with safety constraints in finite mdps. Advances in Neural Information Processing Systems, 34:2004 2017, 2021.

Published as a conference paper at ICLR 2025

Peiyao Xiao, Hao Ban, and Kaiyi Ji. Direction-oriented multi-objective learning: Simple and provable stochastic algorithms. Advances in Neural Information Processing Systems, 36, 2024.

Haibo Yang, Zhuqing Liu, Jia Liu, Chaosheng Dong, and Michinari Momma. Federated multiobjective learning. Advances in Neural Information Processing Systems, 36, 2024.

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33: 5824 5836, 2020.

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. ar Xiv preprint ar Xiv:1905.11881, 2019.

Qi Zhang, Yi Zhou, and Shaofeng Zou. Convergence guarantees for rmsprop and adam in generalizedsmooth non-convex optimization with affine noise variance. ar Xiv preprint ar Xiv:2404.01436, 2024.

Shiji Zhou, Wenpeng Zhang, Jiyan Jiang, Wenliang Zhong, Jinjie Gu, and Wenwu Zhu. On the convergence of stochastic multi-objective gradient manipulation and beyond. Advances in Neural Information Processing Systems, 35:38103 38115, 2022.

Published as a conference paper at ICLR 2025

A NOTATION SUMMARY

We have summarized the notations used in this paper in the following table.

Table 4: Notations and their descriptions.

Notations Descriptions

x Rm Model parameter, or decision variable st,i,k i-th collection of samples at time t in the stochastic model for training or testing fk(x) A scalar-valued population objective function fk(x) Gradient of fk(x), with fk(x) : Rm 7 Rm

F(x) A vector-valued population objective function F(x) Gradient of F(x), with F(x) : Rd 7 Rd K

ℓ(a) A continuous non-decreasing function: [0, + ) (0, + ) φ(a) A monotonically increasing for any a 0: φ(a) = a2

2ℓ(2a) w W Weighting parameter in a probability simplex over [K], w ρ W optimal solution to equation 5, when ρ = 0, it is simplified as w

α Step size to update model parameter x β Step size to update weight in the main loop β Step size to update weight in the warm start ρ Regularization parameter in equation 5

B EXPERIMENTAL DETAILS

B.1 RELATION BETWEEN GRADIENT NORMS AND THE LOCAL SMOOTHNESS

We show the relation between local smoothness and gradient norms of each task in this part. Both results demonstrate a positive correlation between them, which further substantiates the necessity of our analysis.

Figure 2: Local smoothness constant vs. Gradient norm on training Seg Net on City Scapes dataset of each task. Task 1 on the left and Task 2 on the right.

B.2 RESULTS AND RUNNING TIME COMPARISON BETWEEN MGDA-WARM START AND MGDA-FA

We compare the results and average running time of the proposed algorithms, MGDA-warm start and MGDA-FA of the Cityscapes (Cordts et al., 2016). The time in Table 5 is an average of the total running time over epochs (in minutes). The result solidifies the advantage of the fast approximation. For its performance, the fast approximation may introduce substantial errors in practice, potentially leading to inaccuracies in the weight update process. Meanwhile, the convergence analysis of the stochastic variant is not revealed. However, to enhance its effectiveness, we could apply a logarithmic technique in FAMO(Liu et al., 2024).

Published as a conference paper at ICLR 2025

Method Segmentation Depth m% Average running time m Io U Pix Acc Abs Err Rel Err

STL 74.01 93.16 0.0125 27.77

MGDA-warm start 75.41 93.46 0.0133 31.07 3.93 2.93 MGDA-FA 74.38 93.24 0.0160 41.78 19.44 1.93

Table 5: Multi-task learning on Cityscapes dataset.

B.3 RESULTS OF OTHER MOO METHODS WITH WARM START

The warm start strategy can be applied to other multi-objective optimization algorithms. For MGDAbased methods, both Mo Co (Fernando et al., 2022) and Mo Do (Chen et al., 2024) can incorporate the warm start as an add-on. However, the Mo Do-warm start will exactly recover to our algorithm. Therefore, we evaluate the performance of the Mo Co-warm start on the Cityscapes (Cordts et al., 2016) and NYU-v2 (Silberman et al., 2012) datasets. As shown in the following tables, the warm start strategy improves performance.

Method Segmentation Depth m% m Io U Pix Acc Abs Err Rel Err

Mo Co 75.42 93.55 0.0149 34.19 9.90 Mo Co-warm start 75.48 93.54 0.0148 31.43 7.32

Table 6: Multi-task supervised learning with Mo Co-warm start on Cityscapes dataset.

Method Segmentation Depth Surface Normal m% m Io U Pix Acc Abs Err Rel Err Angle Distance Within t

Mean Median 11.25 22.5 30

Mo Co 40.30 66.07 0.5575 0.2135 26.67 21.83 25.61 51.78 64.85 0.16

Mo Co-warm start 38.40 64.40 0.5377 0.2315 26.04 20.57 27.11 54.02 66.63 -0.88

Table 7: Multi-task supervised learning with Mo Co-warm start on NYU-v2 dataset.

B.4 IMPLEMENTATION DETAILS

Multi-task learning on Cityscapes dataset. Following the experiment setup in Xiao et al. (2024), we train our method for 200 epochs, using SGD optimizers for both model parameters and weights, and the batch size for Cityscapes is 8. We compute the averaged test performance over the last 10 epochs as the final performance measure. We fix the β = 0.5 and do a grid search on hyperparameters including N [10, 20, 40, 50], α [0.0001, 0.0002, 0.0005, 0.001], and ρ [0.01, 0.05, 0.1, 0.2, 0.5, 0.6, 0.7, 0.8, 0.9, 1] and choose the best result from them. It turns out our best performance is based on the choice that N = 40, α = 0.0005, β = 0.5, and ρ = 0.5. The choice of hyperparameters for MGDA-FA turns out to be the same as that for MGDA-warm start. All experiments are run on NVIDIA RTX A6000.

Multi-task learning on NYU-v2 dataset. Following the experiment setup in Xiao et al. (2024), we train our method for 200 epochs, using SGD optimizers for both model parameters and weights, and the batch size for NYU-v2 is 2. We compute the averaged test performance over the last 10 epochs as the final performance measure. We fix the β = 0.5 and do a grid search on hyperparameters including N [10, 20, 40, 50], α [0.0001, 0.0002, 0.0005, 0.001], and ρ [0.01, 0.05, 0.1, 0.2, 0.5, 0.6, 0.7, 0.8, 0.9, 1] and choose the best result from them. It turns out our best performance is based on the choice that N = 40, α = 0.0001, β = 0.5, and ρ = 0.5. All experiments are run on NVIDIA RTX A6000.

Published as a conference paper at ICLR 2025

m% reflects the average per-task performance drop versus the single-task (STL) baseline b to assess method m. We calculate it by the following equation

k=1 ( 1)lk(Mm,k Mb,k)/Mb,k 100,

where K is the number of metrics, Mb,k is the value of metric Mk obtained by baseline b, and Mm,k obtained by the compared method m. lk = 1 if the evaluation metric Mk on task k prefers a higher value and 0 otherwise.

Generalized smoothness illustration. To illustrate the relation between gradient norms and local smoothness, we follow the method in Section H.3 in Zhang et al. (2019)and run SGD on each task separately without the warm start process. The code is available in https://github.com/Jingzhao Zhang/whyclipping-accelerates. Since there is no weight update process, we only need to choose α = 0.0005 for both tasks.

C ALGORITHM

Algorithm 4 MGDA with Fast Approximation (MGDA-FA)

1: Initialize: model parameters x0, weights w0 and a constant ρ 2: w0=Warm-start(w0, x0, ρ) 3: for t = 0, 1, ..., T 1 do 4: xt+1 = xt α F(xt)wt 5: Update wt according to eq. (6) 6: end for

D DETAILED PROOFS FOR AVERAGE CA DISTANCE

D.1 FORMAL VERSION AND PROOF OF THEOREM 1

Let F > 0 be some constants such that

Define M = sup{z 0|φ(z) F}. We then have the following convergence rate for Algorithm 1 without warm start:

Theorem 7. Suppose Assumptions 1 and 2 are satisfied, and we choose constant step sizes that β 1 4KM 2 , α min n β, 1 2ℓ(M+1), 1 Mℓ(M+1) o , T max n ( 10

ϵ2β o ) = Θ(ϵ 2), and ρ

min ϵ2 20, q

ϵ2 10β , 1 2T α, q

1 T αβ = O(ϵ2). We have that

t=0 F(xt)wt 2 ϵ2. (8)

Proof. Compared with the standard L-smoothness, the generalized smoothness is more challenging to address due to the unbounded Lipschitz constant. Lemma 2 demonstrates that a bounded function value implies a bounded gradient norm, which further implies a bounded Lipschitz constant. In the following, we solve the unbounded Lipschitz constant problem by showing that the function value is bounded with the parameters selected in Theorem 1. We prove that for any i K and t T we have that fi(xt) f i F by induction.

Base Case: since M is non-negative, according to equation 7 we have that fi(x0) f i F holds for any i [K].

Induction step: assume that for any i [K] and t k < T, we have that fi(xt) f i F holds. We then prove that fi(xk+1) f i F holds for any i [K].

Published as a conference paper at ICLR 2025

For fi(xt) f i F, based on the monotonicity shown in Lemma 2, we have that fi(xt) M. From assumption 2, we have that fi(x) is 1 ℓ( fi(x)) +1), ℓ( fi(x) + 1) -smooth by setting a = 1. For any t k, we have that

xt+1 xt = α F(xt)wt αM 1 ℓ(M + 1) 1 ℓ( fi(xt) + 1),

where the second inequality is due to α 1 Mℓ(M+1) and the last inequality is due to fi(xt) M. Based on Assumption 2, Definition 2 and Lemma 3.3 in (Li et al., 2024a), we have the following descent lemma:

fi(xt+1) fi(xt) α fi(xt), F(xt)wt + ℓ( fi(xt) + 1)

2 α2 F(xt)wt 2

fi(xt) α fi(xt), F(xt)wt + ℓ(M + 1)

2 α2 F(xt)wt 2.

As a result, for any w W, we have that

F(xt+1)w F(xt)w α F(xt)w, F(xt)wt + ℓ(M + 1)

2 α2 F(xt)wt 2. (9)

Based on the update process of w, we have that

wt+1 = ΠW wt β F(xt) F(xt)wt + ρwt .

It then follows that

= ΠW wt β F(xt) F(xt)wt + ρwt w 2

wt β F(xt) F(xt)wt + ρwt w 2

= wt w 2 2β wt w, ( F(xt) F(xt) + ρI)wt

+ β2 ( F(xt) F(xt) + ρI)wt 2 ,

where the inequality is due to the non-expansiveness of projection. By rearranging the above inequality, we have that

wt w, F(xt) F(xt)wt

2β wt w 2 wt+1 w 2 + 2ρ + βKM 2 F(xt)wt 2 + βρ2. (10)

Plug equation 10 into equation 9, and we can show that

F(xt+1)w F(xt)w

α F(xt)wt 2 + ℓ(M + 1)

2 α2 F(xt)wt 2

2β wt w 2 wt+1 w 2 + αβKM 2 F(xt)wt 2 + αβρ2 + 2αρ. (11)

Taking sums of equation 11 from t = 0 to k, for any w W we have that

F(xk+1)w F(x0)w

t=0 α F(xt)wt 2 +

2 α2 + αβKM 2 F(xt)wt 2

2β w0 w 2 + Tαβρ2 + 2Tαρ

2β w0 w 2 + Tαβρ2 + 2Tαρ, (12)

Published as a conference paper at ICLR 2025

where the first inequality is due to k < T and the last inequality is due to that α 1 2ℓ(M+1) and βKM 2 1

4. Thus for any i [K] it can be shown that

fi(xt+1) f i fi(x0) f i + α

β + Tαβρ2 + 2Tαρ F,

since we have that α

β 1, Tαβρ2 1 and 2Tαρ 1. Now we finish the induction step and can show that fi(xk+1) f i F and equation 12 hold for all k < T and i [K].

Specifically, for α < 1 2ℓ(M+1), β 1 4KM 2 , according to equation 12, for k = T 1 we have that

t=0 F(xt)wt 2 2(F(x0)w F w)

βT + 2βρ2 + 4ρ = O(ϵ2),

which completes our proof.

Lemma 2. (Lemma 3.5 in Li et al. (2024a)) If a function f is ℓ-smooth, we have that

φ( f(x) ) = f(x) 2

2ℓ(2 f(x) ) f(x) f . (13)

D.2 PROOF OF COROLLARY 1

Proof. Recall that

F(xt)wt F(xt)w t 2

= F(xt)wt 2 + F(xt)w t 2 2 F(xt)wt, F(xt)w t

F(xt)wt 2 + F(xt)w t 2 2 F(xt)w t 2

= F(xt)wt 2 F(xt)w t 2

where the first inequality holds because of the optimality condition of w t = arg minw W 1

2 F(xt)w 2 that for any w W,

w, F(xt) F(xt)w t w t , F(xt) F(xt)w t = F(xt)w t 2.

Then we have

t=0 F(xt)wt F(xt)w t 2 1

t=0 F(xt)wt 2 = O(ϵ2),

where we follow the same setting in Theorem 1. The proof is complete.

D.3 FORMAL VERSION AND ITS PROOF OF THEOREM 2

Let c1 > 0, c2 > 0, c3 > 0, c4 > 0, c5 > 0, c6 > 0 be some constants. Let F > 0 and 0 < δ 1

2 be some constants such that

F 8( + c1 + c2 + c3 + c4 + c5 + c6)

where = maxi [d]{fi(x0) f }. Let L0 > 0, L1 > 0, b1 > 0, b2 > 0, b3 > 0 be some constants,

for B1 = min n 1 4L2 0ℓ(M+1)2 , b2 1 (3M

KL1)2 , b2 ℓ(M+1)L2 0

48 , L2 0δ2ϵ2

1152Kσ2( +c1), L2 1δ2ϵ2

384Kσ2( +c1)} = O(δ2ϵ2),

β min{ δϵ2 192K(M 2+σ2)2 , b3ρ 8KM 2L2 0+4KL2 1 , 1} = O(min{ δϵ2

M 4 , ρ M 2 }),

α min{ ℓ(M+1) 2 max(1,M), c1β, ρB1, c2

2T M(3σ+σ2),, c3

ℓ(M+1)σ2T , δϵ2 48σ2ℓ(M+1),

Published as a conference paper at ICLR 2025

1 ρT min{c5, δL2 0 24Kσ2 , δL2 1 8Kσ4 }, c4 4KβT (M 2+σ2)2 , c6 ρβT } = O(min{β, ρ ℓ(M+1)2 , ρ M 2 , δ ρT , 1 βT M 2 })

and T max{ 48 +48c1

δαϵ2 , 4608M 2(3σ+σ2)2

δ2ϵ4 } = O(max{ 1 δαϵ2 , M 2

δ2ϵ4 }) such that

b1 + b2 + b3 + c1 + αρ(1 + βρ) + 4αβKM 4 F

Define the following random variables

τ1 = min{t| i [K], fi(xt+1) f i > F} T,

τ2 = min{t| i [K], j [3], εt,j,i > L0 αρ} T,

τ3 = min{t| i, j [K], εt,2,i εt,3,j > L1 αρ} T,

τ = min{τ1, τ2, τ3}.

We then have the following theorem:

Theorem 8. If Assumptions 1, 2 and 3 hold, with the parameters selected above, we have that

t=0 F(xt)wt 2 = O(ϵ2),

with the probability at least 1 δ.

Proof. Small probability of the event {τ < T}.

We first show that the probability of the event {τ < T} is small: P(τ < T) δ. Note that

{τ < T} = {τ2 < T} {τ3 < T} {τ1 < T, τ2 = T, τ3 = T}.

For any i [K], j [3], we have that

P( εt,j,i > L0 αρ) = P( εt,j,i 2 > L2 0 αρ) σ2αρ

where the last inequality is due to Chebyshev s inequality. Based on the union bound, we have that

P({τ2 < T})

i=1 P( εt,j,i > L0 αρ) 3Kσ2αρT

since ρ δL2 0 24Kσ2αT . Similarly, we have that

P( εt,2,i εt,3,i > L1 αρ) = P( εt,2,i 2 εt,3,i 2 > L2 1 αρ) σ4αρ

It follows that

P({τ3 < T})

j=1 P( εt,2,i εt,3,j > L1 αρ) K2σ4αρT

We then bound the probability of the event {τ1 < T, τ2 = T, τ3 = T}. Since τ = τ1 < T, we have that for some i [K], fi(xτ+1) f i > F.

Published as a conference paper at ICLR 2025

According to equation 21 shown in Lemma 1, for any i [K] and t = τ we have that

fi(xτ+1) fi(xτ) α F(xτ)w εt,1wt + ℓ(M + 1)α2 εt,1wt 2 + α

β + αρ + αβρ2

+ αM εt,2 + αM εt,3 + α ε t,2εt,3wt

+ 4αβKM 4 + 4αβM 2 εt,2 2 + 4αβM 2 εt,3 2 + 4αβ ε t,2εt,3wt 2

αM L0 αρ + ℓ(M + 1)αL2 0 ρ + α

β + αρ + αβρ2

KL0 αρ + αM

KL0 αρ + αM

+ 4αβKM 4 + 4αβM 2 KL2 0 αρ + 4αβM 2 KL2 0 αρ + 4αβ KL2 1 αρ b1 + b2 + b3 + c1 + αρ(1 + βρ) + 4αβKM 4

where the first inequality is due to that τ2 = τ3 = T, and the second one is due to β

ρ b3 8KM 2L2 0+4KL2 1 and α

ρ min n b2 1 (3M

KL1)2 , b2 ℓ(M+1)L2 0

However, for some i [K], fi(xτ+1) f i > F. Thus for this task, we have that

fi(xτ) f i > F

According to Lemma 1, we have that

E[fi(xτ) f i ] δ

Based on Markov inequality, it follows that

P fi(xτ) f i F

E[fi(xτ) f i ] F/2 δ

which indicates that P(τ1 < T, τ2 = T, τ3 = T) δ

4. It follows that P(τ < T) δ

Convergence of 1

T E h PT 1 t=0 F(xt)wt 2 τ = T i . Based on equation 24 in Lemma 1, we have that

t=0 F(xt)wt 2 τ = T

T 1 P(τ = T)E

t=0 F(xt)wt 2 #

4F(x0)w 4F w + 4α

2M(3σ + σ2)

T + 4ασ2ℓ(M + 1) + 4ρ

+ 4βρ2 + 16βKM 4 + 32βKM 2σ2 + 16βKσ4

where the second inequality is due to δ < 1 2 and the last inequality is due to our selection of parameters. As a result, we have that

t=0 F(xt)wt 2 > ϵ2 τ = T

E h 1 T PT 1 t=0 F(xt)wt 2 ϵ2 τ = T i

Published as a conference paper at ICLR 2025

where the first probability is due to Markov inequality. Thus we have that

t=0 F(xt)wt 2 ϵ2 !

1 P (τ < T) P

t=0 F(xt)wt 2 > ϵ2 τ = T

1 δ, where the last inequality is due to equation 15, equation 16, equation 17, and equation 18. This completes the proof.

D.4 PROOF OF LEMMA 1

Proof. For all i [K], t τ, we have fi(xt) f i F which further implies that fi(xt) M. Moreover, we have that for any t τ and i [K],

xt+1 xt α G1(xt)wt α( F(xt)wt + εt,1wt ) α M + L0 αρ

1 ℓ(M + 1).

Since fi(x) is 1 ℓ( fi(x)) +1), ℓ( fi(x) + 1) -smooth, it follows that

fi(xt+1) fi(xt) α fi(xt), G1(xt)wt + ℓ( fi(xt) + 1)

2 α2 G1(xt)wt 2.

As a result, for any w W, we have that

F(xt+1)w F(xt)w α F(xt)w, G1(xt)wt + ℓ(M + 1)

2 α2 G1(xt)wt 2

F(xt)w α F(xt)w, F(xt)wt + α F(xt)w, εt,1wt

+ ℓ(M + 1)α2 F(xt)wt 2 + ℓ(M + 1)α2 εt,1wt 2 (19) Based on the update process of w, we have that wt+1 w 2

= ΠW wt β[ G2(xt) G3(xt)wt + ρwt] w 2

wt β[ G2(xt) G3(xt)wt + ρwt] w 2

= wt w 2 2β wt w, ( G2(xt) G3(xt) + ρ)wt

+ β2 ( G2(xt) G3(xt) + ρ)wt 2, where the inequality follows from the non-expansiveness of projection. It follows that 2β wt w, F(xt) F(xt)wt

wt w 2 wt+1 w 2 + 2βρ + 2β2ρ2

+ 2β wt w, ε t,2 F(xt)wt + F(xt) εt,3wt ε t,2εt,3wt

+ 2β2 F(xt) F(xt)wt ε t,2 F(xt)wt F(xt) εt,3wt + ε t,2εt,3wt 2

wt w 2 wt+1 w 2 + 2βρ + 2β2ρ2

+ 2β wt w, ε t,2 F(xt)wt + F(xt) εt,3wt ε t,2εt,3wt

+ 8β2KM 4 + 8β2M 2 εt,2 2 + 8β2M 2 εt,3 2 + 8β2 ε t,2εt,3wt 2. (20) Combine equation 19 and equation 20, and we can get that F(xt+1)w F(xt)w α F(xt)wt 2 + α F(xt)w, εt,1wt

+ ℓ(M + 1)α2 F(xt)wt 2 + ℓ(M + 1)α2 εt,1wt 2

2β wt w 2 wt+1 w 2 + αρ + αβρ2

+ α wt w, ε t,2 F(xt)wt + F(xt)w t εt,3 ε t,2εt,3wt

+ 4αβKM 4 + 4αβM 2 εt,2 2 + 4αβM 2 εt,3 2 + 4αβ ε t,2εt,3wt 2. (21)

Published as a conference paper at ICLR 2025

Taking expectation and sum up equation 21 from t = 0 to τ 1, we have that

E[F(xτ)w] F(x0)w α

t=0 F(xt)wt 2 #

t=0 F(xt)w, εt,1wt

wt w, ε t,2 F(xt)wt + F(xt)w t εt,3 ε t,2εt,3wt #

+ ℓ(M + 1)α2E

t=0 εt,1wt 2 #

2β w0 w 2 + αρT + αβρ2T

+ 4αβKM 4T + 4αβM 2E

t=0 εt,2 2 #

t=0 εt,3 2 #

t=0 ε t,2εt,3wt 2 #

t=0 F(xt)wt 2 #

t=0 F(xt)w, εt,1wt

wt w, ε t,2 F(xt)wt + F(xt)w t εt,3 ε t,2εt,3wt #

+ ℓ(M + 1)α2Tσ2 + α

β + αρT + αβρ2T

+ 4αβKM 4T + 4αβKM 2Tσ2

+ 4αβKM 2Tσ2 + 4αβTKσ4, (22)

where the last inequality is due to that τ T and for any i [K], j [3], E h Pτ 1 t=0 εt,j,i 2i

E h PT 1 t=0 εt,j,i 2i TKσ2. By the optional stopping theorem, we have that

t=0 F(xt)w, εt,1wt

which further implies that

t=0 F(xt)w, εt,1wt

= E[ F(xτ)w, ετ,1wτ ]

E[M ετ,1wτ ] M q

E[ ετ,1wτ 2]

t=0 εt,1wt 2 #

Similarly, we have E h Pτ 1 t=0 F(xt)w, εt,2wt i

T, E h Pτ 1 t=0 F(xt)w, εt,3wt i

T and E h Pτ 1 t=0 F(xt)w, ε t,2εt,3wt i

Published as a conference paper at ICLR 2025

Based on equation 22 and equation 23, we have that

E[F(xτ)w] F w F(x0)w F w E

2 F(xt)wt 2 #

2TM(3σ + σ2)

+ ℓ(M + 1)α2Tσ2 + α

β + αρT + αβρ2T

+ 4αβKM 4T + 4αβKM 2Tσ2

+ 4αβKM 2Tσ2 + 4αβTKσ4

2 F(xt)wt 2 #

which completes the proof.

E DETAILED PROOFS FOR ITERATION-WISE CA DISTANCE

We first provide some useful lemmas, which will be used in our main theorems. Lemma 3 (Continuity of w t,ρ). Suppose Assumptions 1 and 2 are satisfied. If for any i [K], fi(xt) M and xt xt+1 1 ℓ(M+1), we have,

w ρ(xt) w ρ(xt+1) Lw xt xt+1 ,

where Lw = 2ρ 1KMℓ(M + 1).

Proof. We first define that w Q,ρ(xt) W is the Q-th iterate of a function J(w) = 1

2 F(xt)w 2 + ρ 2 w 2 using projected gradient descent (PGD) with a constant step size β. The update rule is

w Q+1,ρ(xt) = ΠW (1 βρ)I β F(xt) F(xt) w Q,ρ(xt) . By the non-expansiveness of projection, we have

w Q+1,ρ(xt) w Q+1,ρ(xt+1)

(1 βρ)I β F(xt) F(xt) w Q,ρ(xt)

(1 βρ)I β F (xt+1) F(xt+1) w Q,ρ(xt+1)

(1 βρ)I β F(xt) F(xt) w Q,ρ(xt) w Q,ρ(xt+1)

+ β F(xt) F(xt) F (xt+1) F(xt+1) w Q,ρ(xt+1)

(1 βρ) w Q,ρ(xt) w Q,ρ(xt+1)

+ β F(xt) F(xt) F (xt+1) F(xt+1) w Q,ρ(xt+1) .

Since we set w0,ρ(xt) = w0,ρ(xt+1) and w Q,ρ(xt+1) 1, telescoping the above inequality over Q gives,

w Q,ρ(xt) w Q,ρ(xt+1) ρ 1 F(xt) F(xt) F (xt+1) F(xt+1) . (25)

Then according to the Cauchy-Schwartz inequality, it follows that

w ρ(xt) w ρ(xt+1) lim Q w ρ(xt) w Q,ρ(xt) + w ρ(xt+1) w Q,ρ(xt+1)

+ w Q,ρ(xt) w Q,ρ(xt+1)

(i) lim Q w ρ(xt) w Q,ρ(xt) + w ρ(xt+1) w Q,ρ(xt+1)

+ ρ 1 F(xt) F(xt) F (xt+1) F(xt+1)

(ii) lim Q 2 r 4 ρβQ + ρ 1 F(xt) F(xt) F (xt+1) F(xt+1) ,

Published as a conference paper at ICLR 2025

where (i) follows from eq. (25) and (ii) follows from the convergence of PGD (Theorem 1.1, (Beck & Teboulle, 2009)) on ρ-strongly convex objectives that

w ρ(xt) w Q,ρ(xt) 2 2

ρ J(w ρ(xt)) J(w Q,ρ(xt)) 2

ρ w0,ρ(xt) w ρ(xt) 2

Then eq. (26) can be bounded by w ρ(xt) w ρ(xt+1) ρ 1 F(xt) F(xt) F (xt+1) F(xt+1)

ρ 1 F(xt) + F(xt+1) F(xt) F(xt+1)

2ρ 1KMℓ(M + 1) xt xt+1 ,

where the last inequality follows from fi(xt) M and fi(x) is 1 ℓ( fi(x) +1), ℓ( fi(x) +

1) -smooth by setting a = 1. The proof is complete.

Lemma 4. Given w t = arg minw W 1

2 F(xt)w 2 and w t,ρ = arg minw W 1

2 F(xt)w 2 + ρ 2 w 2, we have F(xt)w t F(xt)w t,ρ ρ.

Proof. Recall that w t,ρ = arg minw W 1

2 F(xt)w 2 + ρ

2 w 2, then we have 1 2 F(xt)w t 2 + ρ

2 F(xt)w t,ρ 2 ρ

2 w t,ρ 2 0.

By rearranging the above inequality, we have F(xt)w t,ρ 2 F(xt)w t 2 ρ( w t,ρ 2 w t 2) ρ.

Then recall that w t = arg minw W 1

2 F(xt)w 2, we have

F(xt)w t F(xt)w t,ρ 2 = F(xt)w t 2 + F(xt)w t,ρ 2 2 F(xt)w t , F(xt)w t,ρ

F(xt)w t,ρ 2 F(xt)w t 2

ρ, where the first inequlity follows from the optimality that 2 w t,ρ, F(xt) F(xt)w t 2 F(xt)w t 2. The proof is complete.

Lemma 5. Suppose Assumptions 1 and 2 are satisfied. If for any i [K], fi(xt) M and xt xt+1 1 ℓ(M+1), we have

R(xt) α2Kℓ(M + 1)M 2

Proof. According to the Talyor Theorem, we have the following result for any objective function fi(xt), i [K]. fi(xt+1) = fi(xt) + f i (xt)(xt+1 xt) + Ri(xt),

where Ri(xt) is the remainder term. Then according to the descent lemma of each objective function fi(x), we have

fi(xt+1) fi(xt) + f i (xt)(xt+1 xt) + α2 ℓ( fi(xt) + 1)

2 F(xt)wt 2

fi(xt) + f i (xt)(xt+1 xt) + α2 ℓ(M + 1)

2 F(xt)wt 2.

Then we can obtain

Ri(xt) α2 ℓ(M + 1)

2 F(xt)wt 2.

Thus, according to the Cauchy-Schwartz inequality, we have

R(xt) α2 ℓ(M + 1)

2 F(xt)wt 2 α2Kℓ(M + 1)M 2

The proof is complete.

Published as a conference paper at ICLR 2025

E.1 PROOF OF THEOREM 3

Theorem 9. Suppose Assumptions 1 and 2 are satisfied. We choose β 1 M 2 , N = O(ϵ 2), C1

KM 2 + ρ, β min ϵ2ρ

C2 1 , ϵ2, 1 4KM 2 , α

min β, 1 ℓ(M+1), βρϵ 2Lw

ϵ2β = Θ(ϵ 11), and ρ

min ϵ2 20, 1 2T α, q

1 T αβ = O(ϵ2). The CA distance in every iteration takes the order

Proof. Since our parameters satisfy all requirements in Theorem 1, we have that fi(xt) M. According to the definition of CA distance, we have

F(xt)wt F(xt)w t = F(xt)wt F(xt)w t,ρ + F(xt)w t,ρ F(xt)w t

(i) F(xt)wt F(xt)w t,ρ + F(xt)w t,ρ F(xt)w t

KM wt w t,ρ + ρ, (27)

where (i) follows from Cauchy-Schwartz inequality and (ii) follows from fi(xt) M for any i and Lemma 4. Then for the first term in the above inequality on the right-hand side (RHS), we have

wt+1 w t+1,ρ 2 = wt+1 w t,ρ 2 + w t+1,ρ w t,ρ 2 2 wt+1 w t,ρ, w t+1,ρ w t,ρ . (28)

For the first term on the RHS in the above inequality, we have

wt+1 w t,ρ 2 (i) wt β[ F(xt) F(xt)wt + ρwt] w t,ρ 2

= wt w t,ρ 2 2β F(xt) F(xt)wt + ρwt, wt w t,ρ

+ β2 F(xt) F(xt)wt + ρwt 2

(ii) (1 2βρ) wt w t,ρ 2 + β2(ρ +

KM 2)2, (29)

where (i) follows from the non-expansiveness of projection and (ii) follows from properties of strong convexity and Cauchy-Schwartz inequality. Then for the second term on the RHS in eq. (28), we have

w t+1,ρ w t,ρ 2 L2 w xt xt+1 2 = L2 wα2 F(xt)wt 2 α2L2 w M 2. (30)

Then for the last term on the RHS in eq. (28), we have

2 wt+1 w t,ρ, w t+1,ρ w t,ρ

2 wt+1 w t,ρ w t+1,ρ w t,ρ

2( wt+1 wt + wt w t,ρ ) w t+1,ρ w t,ρ

(i) 2αβLw F(xt) F(xt)wt + ρwt F(xt)wt + βρ wt w t,ρ 2 + 4

βρ w t+1,ρ w t,ρ 2

KM 2 + ρ) + βρ wt w t,ρ 2 + 4α2L2 w M 2

where (i) follows from the update rule in Algorithm 1, Lemma 3, and Young s inequality. Then substituting eq. (29), eq. (30) and eq. (31) into eq. (28), we have

wt+1 w t+1,ρ 2 (1 βρ) wt w t,ρ 2 + β2(ρ +

KM 2)2 + α2L2 w M 2

KM 2 + ρ) + 4α2L2 w M 2

(1 βρ) wt w t,ρ 2 + β2C2 1 + α2L2 w M 2 + 2αβLw MC1 + 4α2L2 w M 2

Published as a conference paper at ICLR 2025

where the last inequality follows from Lemma 5 and C1

KM 2 + ρ. Then we do telescoping over t = 0, 1, ..., T 1

w T w T,ρ 2 (1 βρ)T w0 w 0,ρ 2 + β

ρ C2 1 + α2

βρL2 w M 2 + 2αLw M

ρ C1 + 4α2L2 w M 2

Then recalling that Lw = O( 1

ρ) and substituting the above inequality into eq. (27), we have

F(xt)wt F(xt)w t

KM h (1 βρ)t w0 w 0,ρ 2 + β

ρ C2 1 + α2

ρ C1 + 4α2L2 w M 2

=O (1 βρ) t 2 w0 w 0,ρ +

ρ + α βρ2 + ρ .

Since we run projected gradient descent for the strongly convex function J(wn) = 1

2 F(x0)wn 2+ ρ 2 wn 2 in the N-loop in Algorithm 1, according to Theorem 10.5 (Garrigos & Gower, 2023), we have by choosing β (0, 1 M 2 ]

w0 w 0,ρ 2 = w N w 0,ρ 2 2 1 ρ M 2

Thus, w0 w 0,ρ = O(ϵ) as N = O(ρ 1). CA distance takes the order of ϵ in every iteration by choosing ρ = O(ϵ2), β = O(ϵ4), α = O(ϵ9), and N = O(ϵ 2). The proof is complete.

E.2 FORMAL VERSION AND ITS PROOF OF THEOREM 4

Let c1 > 0, c 2 > 0, c 3, c 4 0, and F > 0 be some constants such that

+ c1 + c 2 + c 3 + c 4 F

KM 2 + ρ + αKℓ(M+1)M 2

2 . We then have the following convergence rate for Algorithm 4.

Theorem 10. Suppose Assumptions 1 and 2 are satisfied, and we choose constant step sizes that

C 2 1 , α min c1β, q

2c 3 Kℓ(M+1)M 2T , 2c 4 βC 2 1 T , ϵ2 Kℓ(M+1)M 2 , ρ min ϵ2

2 , c 2 αT , and T

ϵ2β . We have

t=0 F(xt)wt 2 = O(ϵ2).

Proof. Following similar steps in Appendix D.1, we also prove that for any i K and t T, we have that fi(xt) f i F by induction.

Base case: since all constants c1, c 2, c 3, c 4 are non-negative, we have that fi(x0) f i F holds for any i [K].

Induction step: assume that for any i [K] and t k < T, fi(xt) f i F holds. We then prove fi(xk+1) f i F holds for any i [K]. Following similar steps in Appendix D.1, we have

F(xt+1)w F(xt)w α F(xt)w, F(xt)wt + α2ℓ(M + 1)

2 F(xt)wt 2. (32)

Published as a conference paper at ICLR 2025

Based on the update rule of w and non-expansiveness of projection, we have

wt+1 w 2 wt β F(xt) F(xt+1)

α + ρwt w 2

= wt β F(xt) F(xt)wt + ρwt + R(xt)

(i) wt w 2 2β wt w, ( F(xt) F(xt) + ρI)wt + 2β

α R(xt) + β2(C 1)2

(ii) wt w 2 2β wt w, ( F(xt) F(xt) + ρI)wt

+ αβKℓ(M + 1)M 2 + β2(C 1)2,

where (i) follows from Cauchy-Schwartz inequality and C 1

KM 2 + ρ + αKℓ(M+1)M 2

2 , and (ii) follows from Lemma 5. Then we have

wt w, F(xt) F(xt)wt 1

2β ( wt w 2 wt+1 w 2) + ρ

+ αKℓ(M + 1)M 2

2 + β(C 1)2

Then substituting the above inequality into eq. (32), we can obtain

F(xt+1)w F(xt) α F(xt)wt 2 + α2ℓ(M + 1)

2 F(xt)wt 2

2β ( wt w 2 wt+1 w 2) + αρ + α2Kℓ(M + 1)M 2

2 + αβ(C 1)2

Then taking sums of the above inequality from t = 0 to k, for any w W, we have

F(xk+1)w F(x0)w

t=0 α F(xt)wt 2 +

2 F(xt)wt 2

2β w0 w 2 + αρT + α2Kℓ(M + 1)M 2T

2 + αβ(C 1)2T 2

β + αρT + α2Kℓ(M + 1)M 2T

2 + αβ(C 1)2T 2 , (33)

where the last inequality follows from α 1 ℓ(M+1). Thus, for any i [K], it can be shown that

fi(xk+1) f i fi(x0) f i + α

β + αρT + α2Kℓ(M + 1)M 2T

2 + αβ(C 1)2T 2 F,

since we have that α

β c1, αρT c 2, α2Kℓ(M+1)M 2T

2 c3 , αβ(C 1)2T 2 c 4. Now we finish the induction step and can show that fi(xk) f i F and eq. (33) hold for all k < T and i [K]. Specifically, for α 1 ℓ(M+1), we have

t=0 F(xt)wt 2 2F(x0)w 2F w

βT + 2ρ + αKℓ(M + 1)M 2 + β(C 1)2.

Then following the choice of step sizes, we can obtain

t=0 F(xt)wt 2 = O(ϵ2).

The proof is complete.

Published as a conference paper at ICLR 2025

E.3 FORMAL VERSION AND ITS PROOF OF THEOREM 5

Theorem 11. Suppose Assumptions 1 and 2 are satisfied. We choose β

1 M 2 , N = Ω(ϵ 2), C 1

KM 2 + ρ + αKℓ(M+1)M 2

2 , β min ϵ2ρ (C 1)2 , ϵ2 , α

min c1β, 2c 3 βc 1T , 1 ℓ(M+1), βρϵ 2Lw M , ρϵ2

ϵ2β = Θ(ϵ 11), and

ρ min ϵ2 20, c 2 2T α = O(ϵ2). The CA distance in every iteration takes the order of O(ϵ).

Proof. According to the definition of CA distance, we have F(xt)wt F(xt)w t = F(xt)wt F(xt)w t,ρ + F(xt)w t,ρ F(xt)w t

(i) F(xt)wt F(xt)w t,ρ + F(xt)w t,ρ F(xt)w t

KM wt w t,ρ + ρ, (34) where (i) follows from Cauchy-Schwartz inequality and (ii) follows from fi(xt) M for any i and Lemma 3. Then for the first term in the above inequality on the right-hand side (RHS), we have wt+1 w t+1,ρ 2 = wt+1 w t,ρ 2 + w t+1,ρ w t,ρ 2 2 wt+1 w t,ρ, w t+1,ρ w t,ρ . (35) For the first term on the RHS in the above inequality, we have

wt+1 w t,ρ 2 (i) wt β F(xt) F(xt+1)

α + ρwt w t,ρ 2

= wt β F(xt) F(xt)wt + ρwt + R(xt)

= wt w t,ρ 2 2β F(xt) F(xt)wt + ρwt, wt w t,ρ

α R(xt), wt w t,ρ + β2 F(xt) F(xt)wt + R(xt)

(ii) (1 2βρ) wt w t,ρ 2 + 2β

α R(xt) + β2 ρ +

KM 2 + R(xt)

(36) where (i) follows from the non-expansiveness of projection and (ii) follows from properties of strong convexity and Cauchy-Schwartz inequality. Then for the second term on the RHS in eq. (35), we have w t+1,ρ w t,ρ 2 L2 w xt xt+1 2 = L2 wα2 F(xt)wt 2 α2L2 w M 2. (37) Then for the last term on the RHS in eq. (35), we have 2 wt+1 w t,ρ, w t+1,ρ w t,ρ

2 wt+1 w t,ρ w t+1,ρ w t,ρ

2( wt+1 wt + wt w t,ρ ) w t+1,ρ w t,ρ

(i) 2αβLw F(xt) F(xt+1)

α + ρwt F(xt)wt + βρ wt w t,ρ 2 + 4

βρ w t+1,ρ w t,ρ 2

KM 2 + R(xt)

α + ρ + βρ wt w t,ρ 2 + 4α2L2 w M 2

Then substituting eq. (36), eq. (37) and eq. (38) into eq. (35), we have

wt+1 w t+1,ρ 2 (1 βρ) wt w t,ρ 2 + 2β

α R(xt) + β2(ρ +

KM 2 + R(xt) )2

+ α2L2 w M + 2αβLw M

KM 2 + R(xt)

α + ρ + 4α2L2 w M 2

(1 βρ) wt w t,ρ 2 + αβKℓ(M + 1)M 2 + β2(C 1)2

+ α2L2 w M + 2αβLw MC 1 + 4α2L2 w M 2

Published as a conference paper at ICLR 2025

where the last inequality follows from Lemma 5 and C 1

KM 2 + ρ + αKℓ(M+1)M 2

2 . Then we do telescoping over t = 0, 1, ..., T 1

w T w T,ρ 2 (1 βρ)T w0 w 0,ρ 2 + α

ρ Kℓ(M + 1)M 2 + β

βρL2 w M + 2αLw M

ρ C 1 + 4α2L2 w M 2

Then substituting the above inequality into eq. (34), we have

F(xt)wt F(xt)w t

KM h (1 βρ)t w0 w 0,ρ 2 + α

ρ Kℓ(M + 1)M 2 + β

βρL2 w M + 2αLw M

ρ C 1 + 4α2L2 w M β2ρ2

=O (1 βρ) t 2 w0 w 0,ρ + r α

ρ + α βρ2 + ρ .

Since we run projected gradient descent for the strongly convex function J(wn) = 1

2 F(x0)wn 2+ ρ 2 wn 2 in the N-loop in Algorithm 4, according to Theorem 10.5 (Garrigos & Gower, 2023), we have w0 w 0,ρ 2 = w N w 0,ρ 2 2 1 ρ M 2 + ρ

Thus, w0 w 0,ρ = O(ϵ) as N = Ω(ρ 1). CA distance takes the order of ϵ in every iteration by choosing ρ = O(ϵ2), β = O(ϵ4), α = O(ϵ9), and N = Ω(ϵ 2). The proof is complete.

E.4 FORMAL VERSION OF ITS PROOF OF THEOREM 6

Let α, β, ρ, T satisfy all requirements for Theorem 2 with δ < 1

2. Moreover, for ρ = O(ϵ2), N =

Ω(ϵ 2), β δρϵ2

60(1+KM 4) = O(ϵ4), ns max{Kσ2, 36Kσ2M 2(6+20βρ)

δρ2ϵ2 } = Ω(ϵ 6) and α q

δβ2ρ2ϵ2 12L2 w(2M 2+4Kσ2)(βρ+1) = O(ϵ9) and T = Θ(ϵ 11), we have the following theorem:

Theorem 12. If Assumptions 1, 2 and 3 hold, with the values of the parameters mentioned above, we have that for each t T,

F(xt)wt F(xt)w t = O(ϵ),

with the probability at least 1 δ.

Proof. When τ = T and t < τ, according to the definition of CA distance, we have

F(xt)wt F(xt)w t = F(xt)wt F(xt)w t,ρ + F(xt)w t,ρ F(xt)w t

(i) F(xt)wt F(xt)w t,ρ + F(xt)w t,ρ F(xt)w t

KM wt w t,ρ + ρ, (39)

where (i) follows from Cauchy-Schwartz inequality and (ii) follows from fi(xt) M for any i [K] and Lemma 4. We then show that for any t τ, we have that E[ wt w t,ρ 2|τ = T] δ

2ϵ2 by induction.

Base case: Since we run projected gradient descent for the strongly convex function J(wn) = 1 2 F(x0)wn 2 + ρ

2 wn 2 in the N-loop in Algorithm 1, according to Theorem 10.5 (Garrigos & Gower, 2023), we have by choosing β (0, 1 M 2 ]

w0 w 0,ρ 2 = w N w 0,ρ 2 2 1 ρ M 2

Published as a conference paper at ICLR 2025

Thus, w0 w 0,ρ 2 = O( δ

2ϵ2) as N = Ω(ρ 1).

Induction: Assume we have that E[ wt w t,ρ 2|τ = T] δ

2ϵ2, we will show that E[ wt+1 w t+1,ρ 2|τ = T] δ

2ϵ2 holds for any t < τ in the following proof. We first divide wt+1 w t+1,ρ 2 into three parts:

wt+1 w t+1,ρ 2 = wt+1 w t,ρ 2 + w t+1,ρ w t,ρ 2 2 wt+1 w t,ρ, w t+1,ρ w t,ρ . (40)

For the first term on the RHS in the above inequality, we have that

wt+1 w t,ρ 2

(i) wt β G2(xt) G3(xt)wt + ρwt w t,ρ 2

= wt w t,ρ 2 2β G2(xt) G2(xt)wt + ρwt, wt w t,ρ

+ β2 G2(xt) G3(xt)wt + ρwt 2

(ii) (1 2βρ) wt w t,ρ 2

+ 2β wt w t,ρ, ε t,2 F(xt)wt + F(xt) εt,3wt ε t,2εt,3wt

+ β2 ρwt + F(xt) F(xt)wt ε t,2 F(xt)wt F(xt) εt,3wt + ε t,2εt,3wt 2, (41)

where (i) follows from the non-expansiveness of projection and (ii) follows from properties of strong convexity and Cauchy-Schwartz inequality. Taking the conditional expectation of equation 41, we have that for any a1 > 0,

E[ wt+1 w t,ρ 2|τ = T]

2(1 2βρ)ϵ2 + 2βE[ wt w t,ρ ε t,2 F(xt)wt + F(xt) εt,3wt ε t,2εt,3wt |τ = T]

+ E[β2 ρwt + F(xt) F(xt)wt ε t,2 F(xt)wt F(xt) εt,3wt + ε t,2εt,3wt 2|τ = T]

β(E[a1 wt w t,ρ 2 + ε t,2 F(xt)wt + F(xt) εt,3wt ε t,2εt,3wt 2/a1|τ = T]])

2(1 2βρ)ϵ2 + 5β2ρ2 + 5β2KM 4 + 5β2E[M 2 ϵt,2 2|τ = T]

+ 5β2E[M 2 ϵt,3 2|τ = T] + 5β2E[ ϵt,2 2 ϵt,3 2|τ = T], (42)

where the last inequality is due to that for t τ = T, and for any i [K], we have that fi(xt) M. Then for the second term on the RHS in eq. (40), we have

E[ w t+1,ρ w t,ρ 2|τ = T] E[L2 w xt xt+1 2|τ = T]

= E[L2 wα2 F(xt, st,1)wt 2|τ = T]

E[α2L2 w(M + ϵt,1wt )2|τ = T], (43)

where the first inequality is due to Lemma 3, where Lw = O(ρ 1). Then for the last term on the RHS in eq. (40), for any a2 > 0, a3 > 0, we have that

E[ 2 wt+1 w t,ρ, w t+1,ρ w t,ρ |τ = T]

E[2 wt+1 w t,ρ w t+1,ρ w t,ρ |τ = T]

E[2( wt+1 wt + wt w t,ρ ) w t+1,ρ w t,ρ |τ = T]

E a2 wt+1 wt 2 + 1

a2 w t+1,ρ w t,ρ 2 + a3 wt w t,ρ 2 + 1

a3 w t+1,ρ w t,ρ 2|τ = T

(i) E a2β2 G2(xt) G3(xt)wt + ρwt 2 + a3 δ 2ϵ2 + 1

α2L2 w(M + ϵt,1wt )2|τ = T

a2(5β2ρ2 + 5β2KM 4 + 5β2E[M 2 ϵt,2 2|τ = T]

+ 5β2E[M 2 ϵt,3 2|τ = T] + 5β2E[ ϵt,2 2| ϵt,3 2|τ = T])

α2L2 w(M + ϵt,1wt )2|τ = T i + a3 δ 2ϵ2, (44)

Published as a conference paper at ICLR 2025

where (i) follows from the non-expansiveness of projection and equation 43, and the last inequality is from equation 42. Then substituting eq. (42), eq. (43) and eq. (44) into eq. (40), we have E[ wt+1 w t+1,ρ 2|τ = T]

(1 2βρ + βa1 + a3)δ

+ β2(5ρ2 + 5KM 4)(1 + a2)

a1 + 5β2 + 5β2a2

E[ εt,2 2|τ = T]

a1 + 5β2 + 5β2a2

E[ εt,3 2|τ = T]

a1 + 5β2 + 5β2a2

E[ εt,2 2 εt,3 2|τ = T]

+ E h 1 + 1

α2L2 w(M + ϵt,1wt )2|τ = T i

(1 2βρ + βa1 + a3)δ

+ β2(5ρ2 + 5KM 4)(1 + a2)

a1 + 5β2 + 5β2a2

2M 2 + 4Kσ2

where the last inequality is due to that for any i [3],

E[ ϵt,i |τ = T] q

E[ ϵt,i 2|τ = T] q

E[ ϵt,i 2]/P(τ = T)

E[ ϵt,2 ϵt,3 |τ = T] q

E[ ϵt,2 2 ϵt,3 2|τ = T]

E[ ϵt,2 2 ϵt,3 2]/P(τ = T)

E[ ϵt,2 2]E[ ϵt,3 2]/P(τ = T)

According to equation 45, with a1 = 0.5ρ, a2 = 1, a3 = 0.5βρ, β δρϵ2

60(1+KM 4), ns

max{Kσ2, 36Kσ2M 2(6+20βρ)

δρ2ϵ2 } and α q

δβ2ρ2ϵ2 12L2w(2M 2+4Kσ2)(βρ+1), we have that

E[ wt+1 w t+1,ρ 2|τ = T] δ

We then complete our induction and prove that for any t < τ, we have that E[ wt+1 w t+1,ρ 2|τ = T] δ

As a result, we have that

P wt+1 w t+1,ρ 2 > ϵ2 τ = T E h wt+1 w t+1,ρ 2 τ = T i

where the first probability is due to Markov inequality. Thus we have that P wt+1 w t+1,ρ 2 ϵ2

1 P (τ < T) P wt+1 w t+1,ρ 2 τ = T P (τ = T)

Published as a conference paper at ICLR 2025

where the last inequality is because our parameters satisfy all the requirements in Theorem 2, thus P(τ < T) δ

2. Then based on equation 39, by setting ρ = O(ϵ2), we have that F(xt)wt F(xt)w t = O(ϵ) with probability at least 1 δ for each iteration t, which completes the proof.