# communicationefficient_federated_group_distributionally_robust_optimization__399edf85.pdf

Communication-Efficient Federated Group Distributionally Robust Optimization

Zhishuai Guo, Tianbao Yang Department of Computer Science and Engineering Texas A&M University zhishguo@tamu.edu,tianbao-yang@tamu.edu

Federated learning faces challenges due to the heterogeneity in data volumes and distributions at different clients, which can compromise model generalization ability to various distributions. Existing approaches to address this issue based on group distributionally robust optimization (GDRO) often lead to high communication and sample complexity. To this end, this work introduces algorithms tailored for communication-efficient Federated Group Distributionally Robust Optimization (FGDRO). Our contributions are threefold: Firstly, we introduce the FGDRO-CVa R algorithm, which optimizes the average top-K losses while reducing communication complexity to O(1/ϵ4), where ϵ denotes the desired precision level. Secondly, our FGDRO-KL algorithm is crafted to optimize KL regularized FGDRO, cutting communication complexity to O(1/ϵ3). Lastly, we propose FGDRO-KL-Adam to to utilize Adam-type local updates in FGDRO-KL, which not only maintains a communication cost of O(1/ϵ3) but also shows potential to surpass SGD-type local steps in practical applications. The effectiveness of our algorithms has been demonstrated on a variety of real-world tasks, including natural language processing and computer vision.

1 Introduction

Federated learning enables effective model training without the need to share raw data [38, 48]. It is essential in contexts where data privacy and ownership are paramount, such as in inter-hospital collaborations [54] and mobile device networks [20]. However, clients often have data of varying volumes and distinct distributions, which poses notable challenges in maintaining generalization behavior [50, 27]. Generalization here refers to the model s ability to perform consistently across different clients, including those that have not participated in the training [23, 79].

In this study, we tackle the issue using federated group distributionally robust optimization (FGDRO), formulated as follows:

min w F(w) := max p N

i=1 piℓi(w) λϕ(p). (1)

Here, w denotes a machine learning model, and N represents the number of clients. For each client i, Di represents its local data distribution, and ℓi(w) = Ez Diℓ(w; z) represents the loss calculated from that local distribution. N denotes a N-dimensional simplex, which constrains P

i pi = 1. The vector p = [p1, ..., p N] comprises the weights assigned to each of the N clients. The function ϕ(p) acts as a regularization term, with λ > 0 being an adjustable parameter. This framework aims to assign higher weights to machines with greater losses while discouraging substantial deviations of these weights from a specified distribution.

Corresponding Author

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Our study concentrates on two particular forms of the regularization term ϕ, which are well-established regularization techniques [11, 39], each suited to different tasks and data distributions. Specifically, CVa R is defined as ϕ(p) = I[0,1/K](p). In this scenario, ϕ(p) is set to 0 if each weight pi falls within the range of [0, 1/K], and is infinite otherwise. FGDRO-CVa R focuses on optimizing for worst-case scenarios or the average of the worst-case losses, making it particularly effective in high-stakes applications like healthcare and finance, where avoiding extreme losses is crucial. However, it can be sensitive to outliers or malicious client attacks. FGDRO-KL, on the other hand, uses Kullback-Leibler (KL) divergence,expressed as ϕ(p) = PN i=1 pi log(Npi). This version of ϕ penalizes deviations of the weight distribution p from a uniform distribution. Fundamentally, when ϕ(p) is strongly convex, as in the case of KL divergence, F(w) can enjoy a smoothness property, while non-strongly convex ϕ(p) would result in non-smooth F(w) [6]. In contrast to CVa R, KL is a softer regularizer to promote smoother and more stable learning. Thus, it can be beneficial in scenarios where robustness to outliers or malicious clients is needed. FGDRO-KL-Adam further enhances FGDRO-KL by incorporating Adam-type updates.

Table 1: Comparison of communication cost and sample complexity on each machine to achieve ϵ-stationary point or near to ϵ-stationary point, where ϵ-stationary point has a (sub-)gradient F(w) 2 ϵ2. NDP-SONT denotes the naive deployment of the SONT algorithm [22] in a federated environment, communicating in all iterations.

FGDRO with a CVa R constraint FGDRO with a KL regularization Communication Sample Communication Sample Complexity Complexity Complexity Complexity DRFA O 1

ϵ16 [11] DR-DSGD O( 1

ϵ6 ) [30] NDP-SONT O 1

This Work O 1

Previous research addressing these optimization problems in federated learning has struggled with high communication and sample complexity issues, which are basically due to inefficient updates of p on local machines. For example, Deng et al. [11] examined a particular case of the general formula (1) using a CVa R constraint with K = 1, and developed an algorithm for KL regularization as well. They update p only in global communication rounds, while local steps only optimize the local loss function using stochastic gradient descent (SGD). To achieve a ϵ-stationary point or a point near to an ϵ-stationary point, where an ϵ-stationary point has a (sub)gradient F(w) 2 ϵ2, their methods required a communication cost of O(1/ϵ12) and a sample complexity of O(1/ϵ16) on each client. For FGDRO with KL regularization, [30] achieves a communication cost of O(1/ϵ3), but it requires the use of large data batches in local update steps in order to get good approximation for the surrogate of p, resulting in a total sample complexity of O(1/ϵ6) per machine.

To overcome these limitations, this paper presents specialized algorithms FGDRO-CVa R and FGDROKL for FGDRO with CVa R constraint and KL regularization, respectively. Instead of dealing with the constrained primal-dual formulation in (1), we consider their equivalent forms with a compositional structure that get rid of the high-dimensional constrained variable p. We summarize the complexity results in Table 1.

For FGDRO with CVa R constraint, we are the first to consider a constraint-free equivalent form and develop a communication-efficient algorithm for it, significantly reducing communication costs, as shown in Table 1. In addition to sharing machine learning models, we only introduce an additional scalar threshold to select participants in each round, minimizing additional costs. The equivalent compositional form is a non-smooth two-level compositional function with one auxiliary variable s, which works as a threshold. Only machines whose local losses are greater than s are supposed to contribute to updating the model. In this way, we can simply update the constraint-free scalar variable s locally in each client and average s in communication rounds. However, we do face challenges with non-smooth compositional optimization problems. Our first algorithm FGDRO-CVa R effectively

addresses this issues and achieves a communication cost of O(1/ϵ4) and a sample complexity of O(1/ϵ8) on each machine.

For FGDRO with KL regularization, while previous literature has explored constraint-free compositional reformulations, they often require large batch sizes on each machine to estimate gradients, making this approach impractical and leading to high sample complexity. In contrast, we utilize moving averages that can work with smaller data batches while still providing accurate gradient estimates, enhancing the efficiency of our method. The equivalent compositional form we consider is a smooth three-level compositional function. In this case, the weights for the clients depend on both local loss functions and global loss functions. We use moving average estimators of these statistics and update the estimators locally. In communication rounds, in addition to averaging the model w, the machines will average the estimator of the global loss function. We have reduced the communication cost and the computation cost compared to the literature as presented in Table 1.

To further enhance our approach, we have developed an adaptive algorithm for solving FGDRO with KL regularization, named FGDRO-KL-Adam. Stochastic adaptive methods apply variable step sizes for each coordinate based on historical gradient information, often yielding better results than non-adaptive techniques, as evidenced by a wealth of research [13, 36, 49, 69]. In federated learning, while Reddi et al. [59] have developed a federated adaptive algorithm and shown its effectiveness in various tasks. However, it limits adaptive steps to global updates on the server, with local updates relying on standard SGD, which may lead to suboptimal results. Moreover, their method is primarily designed for Empirical Risk Minimization (ERM) and is not applicable to address compositional optimization problems. Our FGDRO-KL-Adam allows local updates to use Adam-type updates, which introduces the challenge of handling unbiased gradients, further complicated by the use and updating of the second-order moment. To this end, we update the first-order momentum and secondorder momentum locally and then average them globally during communication rounds. Moreover, our analysis carefully manages the moving estimates of the first and second-order moments, ensuring that the solution provably converges.

Our FGDRO-KL-Adam enables local updates with Adam-type methods, which raises the challenge of maintaining unbiased gradients, especially with the adjustment of the second-order moment. Our analysis meticulously handles the moving estimates of both first and second-order moments to guarantee provable convergence. The first-order momentum and second-order momentum are updated locally and then averaged during communication rounds.

In summary, our paper contributes in three main areas. First, our FGDRO-CVa R algorithm greatly reduces both communication costs and sample complexity for FGDRO with Cva R constraint problems. Second, our FGDRO-KL algorithm achieves a better sample complexity while maintaining the same communication costs as the existing results. Third, our FGDRO-KL-Adam integrates adaptive step sizes with Adam-type updates, which has the potential to surpass the performance of conventional SGD-based approaches. Extensive testing on diverse real-world datasets has shown that our approach achieves superior performance while substantially reducing communication overhead.

2 Related Work

Federated learning has gained significant attention due to its potential to train machine learning models using data from various sources while ensuring data privacy [38, 48, 54]. Two central challenges to this field are communication cost and client heterogeneity, which have been extensively explored in the literature [63, 76, 77, 73, 62, 34, 64, 2, 31, 68, 4, 33, 35, 71, 70, 34, 18]. This section will dive into the body of literature that focuses on these specific challenges.

Non-IID Clients in Federated Learning (FL) One of the key challenges in Federated Learning (FL) is managing client heterogeneity, particularly the issue of non-IID (nonindependently and identically distributed) data across client networks. Efforts to overcome the negative implications of data diversity have led to the development of model personalization techniques [47, 10, 44, 82, 45, 43, 75, 19, 41]. However, these approaches face challenges when dealing with data from unseen or unidentifiable groups. For a comprehensive examination of the challenges and strategies concerning non-IID clients in federated learning, the readers are directed to [27].

Federated Group Distributionally Robust Optimization Since Group Distributionally Robust Optimization has shown effectivenss in addressing non-iid data in centralized setting [14, 51, 12, 56],

previous research has investigated Federated Group Distributionally Robust Optimization (FGDRO) to address the challenges posed by non-IID clients in federated settings [50, 11]. The DRFA algorithm [11] focuses on a specific instance of (2), applying a CVa R constraint on p with K = 1. It samples machines based on updated probabilities to allow local updates, reducing the need for communication, with these probabilities managed by a central server. However, this approach results in significant communication costs of O(1/ϵ12) and sample complexity of O(1/ϵ16) per machine to achieve an ϵ-stationary point. Recent developments in [22] introduced algorithms for handling Group DRO with a CVa R constraint in centralized settings, but adapting these to federated learning entails substantial communication overheads of O(1/ϵ6). Moreover, [78] introduced the SCAFF-PD algorithm, which is only applicable in convex scenarios and requires the use of the complete dataset in each training round. Regarding KL regularization, [30] achieved a communication cost of O(1/ϵ3), but required the use of large data batches, resulting in a total sample complexity of O(1/ϵ6) per machine. [80] has studied FGDRO with KL regularization in a decentralized setting, which would incur a communication cost of O(1/ϵ4) if directly applied to a centralized federated learning setting.

Federated Adaptive Algorithm Stochastic adaptive methods for minimization in non-convex stochastic optimization have garnered significant interest in recent years [13, 36, 49, 69, 42, 84, 65, 7, 46, 25, 16, 81]. These methods, known for assigning unique step sizes to each coordinate, often outperform their non-adaptive counterparts. In federated learning, Reddi et al. [59] have advanced the field with an adaptive algorithm. However, their methodology predominantly applies adaptive steps at the global server level, with local updates still dependent on SGD. This could lead to suboptimal performance. Furthermore, their approach was tailored for Empirical Risk Minimization (ERM) problems and could not be applied for problems considered in this work.

3 Preliminaries

A function f is C-Lipschitz if f(x) f(y) C x y . A differentiable function f is L-smooth if f(x) f(y) L x y , where f(x) denotes the gradient. For a non-differentiable function f, its subdifferential f(x) is defined as a set of all subgradients as f(x) = {v|f(y) f(x) + v, y x + o( y x )} as y x. When the context is clear, we also overload the notation f(x) to denote one subgradient from the subdifferential set. We use f(x; z) or f(x; z) to represent an unbiased estimator of gradient or subgradient with a randomly drawn sample z. Additionally, a function f is ρ-weakly convex if f(x) f(y) + f(y), y x ρ

For a smooth function f(x), x is an ϵ-stationary point if f(x) 2 ϵ2. For non-smooth functions, x is an ϵ-stationary point if dist(0, f(x)) 2 ϵ2, where dist(x, S) = minx S x x 2 measures the distance between a point x and a set S. For non-smooth functions, since it is usually difficult or even impossible to find an ϵ-stationary point, we instead seek an ϵ-near stationary point. Definition 3.1. x is an ϵ-near stationary point of f( ) if x such that x x 2 ϵ and dist(0, f(x )) ϵ.

4 FGDRO-CVa R

In this section, we present our algorithm designed to tackle Federated Group Distributionally Robust Optimization (FGDRO) with a CVa R constraint. This problem poses substantial challenges due to the complexity of both CVa R and simplex constraints. Typically, during local updates, individual machines do not have access to adequate information to appropriately adjust the weight vector p. Prior approaches, such as the one proposed by [11], mitigate this issue by updating p during global communication rounds, but results in slower convergence rates. To this end, we reformulate the problem into an equivalent two-level compositional optimization problem without constraints:

min w min s F(w, s) := 1

i=1 f(gi(w) s) + K

where f( ) = ( )+ and gi(w) = Ez Diℓ(w, z) is a local loss and s is intended to serve as a threshold value. With s = arg mins 1

N PN i=1 f(gi(w) s ) + K

N s , only the K clients with the highest losses will have losses greater than s [53, 83]. During the training phase, K clients with the highest losses are expected to predominantly influence the optimization process.

The formulation (2) replaces the constrained high-dimensional vector p with a single unconstrained scalar variable s. However, this adjustment introduces new challenges due to the compositional structure and the non-smooth nature of the outer function f. As a result, it is biased to estimate subgradient f(gi(w) s) gi(w) using a batch of samples. To address this, it is common practice to create an accurate estimate of gi(w) [67, 24, 22, 66, 32, 55, 57]. Specifically, we employ a moving average u as our accurate estimate:

ur i,t = (1 β1)ur i,t 1 + β1ℓ(w; zr i,t). (3)

And then the estimators for sub-gradient of w and s, namely m, v are computed using u. It is notable that for s, it is updated locally using local data and averaged between clients in communication rounds. It will converge alongside w to an ϵ-near stationaty point. This is a fundamental reason why our method achieves a lower communication complexity compared to [11], as the latter can only update the weight variables at the server node in the communication rounds.

Algorithm 1 FGDRO-CVa R

1: Initialization: w1, s1 = 0, u0 i,t = 0, 2: for r = 1, ..., R do 3: wr i,0 = wr, sr i,0 = sr, ur i,0 = ur 1 0,I 4: for t = 1, ..., I do 5: Each machine samples data zr i,t 6: ur i,t = (1 β1)ur i,t 1 + β1ℓ(wr i,t 1; zr i,t)

7: vr i,t = f(ur i,t sr i,t 1) + K

N , and sr i,t = sr i,t 1 η2vr i,t 8: mr i,t = f(ur i,t sr i,t 1) ℓ(wr i,t 1, zr i,t), and wr i,t = wr i,t 1 η1mr i,t 9: end for

10: wr+1 = 1

i=1 wr i,I, sr+1 = 1

11: end for

12: Output: w = 1

i=1 wr i,t , where r and t are sampled from [1, R] and [1, I], respectively.

We present the formalization of our FGDRO-CVa R method in Algorithm 1. Next, we show the convergence results of FGDRO-CVa R . We make the following assumptions regarding problem (2). Assumption 4.1. (1) i and z Di, ℓ( , z) is Cg-Lipschitz and Lg-smooth. (2) Ez Di ℓ(w; z) gi(w) 2 σ2, Ez Di ℓ(w; z) gi(w) 2 σ2.

Remark. The first assumption about Lipschitz continuity and smoothness of gi is standard in compositional optimization [22, 66, 67]. The second assumption of bounded variance is also common. Assumption 4.1 leads to F(w, s) being weakly convex, which is a key step in the analysis as shown in Appendix A.2

The behavior of the estimator u is examined through the following lemma. Lemma 4.2. Under Assumption 4.1, by setting η = O 1/(R)3/2 , β1 = O (1/R), I = O(R), Algorithm 1 ensures that

E ur i,t gi( wr t ) 2 (1 β1)E ur i,t 1 gi( wr t 1) 2 + 2β2 1σ2 + 3β2η2I2C2 g + 5

β1 C2 g wr t wr t 1 2.

The convergence result of FGDRO-CVa R is given in the following theorem.

Theorem 4.3. Under Assumption 4.1, by setting η = O 1/(R)3/2 , β1 = O (1/R), I = O(R) and ˆρ = 2Lg, the Algorithm 1 ensures that for the output ( w, s), there exists (w , s ) that

dist(0, F(w , s )) 2 1/ˆρ( w w 2 2 + s s 2 2) O 1 R1/2

Remark. The analysis has utilized Moreau envelop to address the nonsmooth issue [52, 9]. To achieve an ϵ-near stationary point of F( ), we need to set R = O(1/ϵ4) and I = O(1/ϵ4), and thus the sample complexity on each machine is RI = O(1/ϵ8). The total sample complexity of O(n/ϵ6)

by [22] is achieved by the STORM estimator [8] which incurs additional memory and computational costs due to the requirement of computing gradients using two models at each iteration. Without the STORM estimator, [22] would exhibit a total sample complexity of O(n/ϵ8). When deployed in a federated setting, the complexity for each machine would be O(1/ϵ8), aligning with our results and demonstrating that our approach achieves a linear speed-up in terms of number of machines. Additionally, although we aggregate certain scalar variables (in FGDRO-CVa R, the scalar variable s; in FGDRO-KL and FGDRO-KL-Adam to be presented later, the scalar variable v), similar to the technique in the Remark 3.1 of [61], we can aggregate these variables using homomorphic encryption, ensuring that their exact values remain confidential.

In this section, we present our FGDRO-KL algorithm for solving problem (1) with a KL regularization. Unlike the CVa R constraints that focus on the top K clients, KL regularization takes into account all clients, assigning them varying weights. Additionally, FGDRO with KL regularization is smooth, and strongly concave with respect to p. Nevertheless, it is subject to the simplex constraint on p. To address this, we use an equivalent form derived from the KKT conditions, as referenced in [56, 30]:

min w F(w) = λ log( 1

i=1 exp(Ez Diℓ(w; z)/λ)). (5)

This formulation eliminates the constrained vector p, and F(w) is smooth since KL regularization is strongly concave [6]. However, this formulation has a three-level composition structure and thus, using a batch of data in a three-level composition can result in biased gradient estimation. Furthermore, the gradients on one machine are depend on other machines.

Specifically, we denote gi(w) = exp(Ez Diℓ(w; z)/λ) and g(w) = 1

i=1 gi(w), with ℓ(w; Di) =

Ez Diℓ(w; z), then, the gradient of F(w) in (5) is given by:

g(w) ℓ(w; Di). (6)

It is crucial to recognize that the gradient for machine i, i.e., ℓ(w; Di), is scaled by gi(w)/g(w). This scaling indicates that machines experiencing larger loss functions exert more influence over the training process. To mitigate the biased gradient estimation, we approximate gi(w) and g(w) based on moving average estimators u and v. On each machine, u serves as a moving average estimator for the local loss function ℓ(w; Di), with exp(ur i,t/λ) providing a local approximation of gi(w). u is updated and maintained locally without need for averaging during communication rounds. v estimates the global statistic g(w), and is updated locally but averaged during global communication rounds. Subsequently, a moving average estimator of the gradient, denoted as m is constructed using u and v. For specific update rules, please refer to Algorithm 2.

For analysis, we make the following assumptions regarding problem (1) with a KL regularization: Assumption 5.1. (1) i and z Di, gi( ) is Cg-Lipschitz and Lg-smooth. (2) Ez Di ℓ(w; z) ℓ(w; Di) 2 σ2, Ez Di ℓ(w; z) ℓ(w; Di) 2 σ2. (3) f is Cf-Lipschitz and Lf-smooth. (4) i and z Di, 0 ℓ( ; z) C0, ℓ( ) is Cℓ-Lipschitz and Lℓ-smooth.

The behavior of the u and v estimators can be bounded similar to the previous section and are shown in Appendix B. The estimator m for gradient can be bounded as

Lemma 5.2. Under Assumption 5.1, with proper constants C1 and G, by setting η = O 1

, the Algorithm 2 ensures that

mr t F( wr t ) 2 (1 β3

2 ) mr t 1 F( wr t 1) 2 + β3C2 ℓC2 1 1 N

i=1 ur i,t 1 ℓ( wr t 1; Di) 2

+ β3C vr t g( wr t ) 2 + 3η F( wr t 1) 2 + β2 3C2 1 σ2

N + 2β3C2 1L2 ℓη2I2G2.

Algorithm 2 FGDRO-KL

1: Initialization: w1, u0 i,I, v1, m1

2: for r = 1, ..., R do 3: wr i,0 = wr, mr i,0 = mr, ur i,0 = ur 1 i,I , vr i,0 = vr

4: for t = 1, ..., I do 5: Each machine samples data zr i,t 6: ur i,t = (1 β1)ur i,t 1 + β1ℓ(wr i,t 1; zr i,t), and vr i,t = (1 β2)vr i,t 1 + β2 exp(ur i,t/λ)

7: hr i,t = exp(ur i,t) vr i,t ℓ(wr i,t 1; zr i,t), and mr i,t =(1 β3)mr i,t 1 + β3hr i,t 8: wr i,t = wr i,t 1 ηmr i,t 9: end for

10: wr+1 = 1

i=1 wr i,I, vr+1 = 1

i=1 vr i,I, and mr+1 = 1

11: end for

12: Output: w = 1

i=1 wr i,t , where r and t are sampled from [1, R] and [1, I], respectively.

Algorithm 3 FGDRO-KL-Adam

1: Initialization: w1, u0 i,I, v1, m1, q1

2: for r = 1, ..., R do 3: wr i,0 = wr, mr i,0 = mr, qr i,0 = qr, ur i,0 = ur 1 i,I , and vr i,0 = vr

4: for t = 1, ..., I do 5: Each machine samples data zr i,t 6: ur i,t = (1 β1)ur i,t 1 + β1ℓ(wr i,t 1; zr i,t), and vr i,t = (1 β2)vr i,t 1 + β2 exp(ur i,t/λ)

7: hr i,t = exp(ur i,t) vr i,t ℓ(wr i,t 1; zr i,t)

8: mr i,t =(1 β3)mr i,t 1 + β3hr i,t, and qr i,t = (1 β4)qr i,t 1 + β4(hr i,t)2

9: wr i,t = wr i,t 1 η mr i,t

qr i,t+τ 10: end for

11: wr+1 = 1

i=1 wr i,I, vr+1 = 1

i=1 vr i,I, mr+1 = 1

i=1 mr i,I and qr+1 = 1

12: end for

13: Output: w = 1

i=1 wr i,t , where r and t are sampled from [1, R] and [1, I], respectively.

The precisions of the u, v, m estimators depend on each other. The idea is to get E ur i,t ℓ( wr t ; Di) 2, E vr t g( wr t ) 2, mr t F( wr t ) 2 and E F( w) 2 jointly converge, and then we finally have the following theorem to guarantee the convergence:

Theorem 5.3. Under Assumption 5.1, by setting η = O 1

, I = R1/3, Algorithm 2 ensures that

E F( w) 2 O 1 R2/3

Remark. To achieve an ϵ-stationary point, i.e., F( w) 2 ϵ2, we need to set R = O(1/ϵ3), I = O(1/ϵ), η = O(ϵ2) and β1 = O(ϵ2). Compared to [30], our approach maintains a communication complexity of O(1/ϵ3), but significantly reduces the sample complexity on each machine from O(1/ϵ6) from O(1/ϵ4), requiring only a batch size of O(1) rather than a large batch size of O(1/ϵ2). Our results match the communication and sample complexity in [17], which tackles a simpler twolevel compositional problem and achieved sample complexity of O(1/ϵ4) per machine. Considering that the sample complexity for a two-level compositional problem in a centralized setting would be O(n/ϵ4) [66], our approach realizes a linear speed-up proportional to the number of machines.

6 FGDRO-KL-Adam

In this section, we introduce an adaptive algorithm, FGDRO-KL-Adam, to address the problem (5) with KL regularization, as detailed in Algorithm 3. This algorithm incorporates Adam-type updates at local steps, which have been shown to outperform SGD in centralized settings. While previous studies [59] in federated settings have implemented Adam-type updates at the global step but retained SGD for local updates, which may be sub-optimal.

Similar to Algorithm 2, u and v are used to estimate the local loss and the global function g(w), respectively. The variables h and m are updated in a manner consistent with Algorithm 2. Under Assumption 5.1, the behavior of the u, v, m estimators is addressed as previously discussed.

The primary distinction in Algorithm 3 lies in its adaptive updates for the local model wr i,t. Here, m serves a role akin to the first-order momentum in Adam, and we introduce qr i,t to estimate the second-order momentum: qr i,t = (1 β4)mr i,t 1 + β4hr i,t. (8)

Subsequently, the local models are updated adaptively using the formula:

wr i,t = wr i,t 1 η mr i,t pqr i,t + τ , (9)

where both the square root and division are performed element-wise.

A key step in the analysis is to address the coordinate-wise update, as in the following lemma.

Lemma 6.1. Using L-smooth of F, for some proper constants C and G, by setting η = O 1

F( wr t ) F( wr t 1) + η

τ F( wr t 1) mr t 1 2 + η

τ β2 3C η 2(G + τ) F( wr t 1) 2. (10)

Finally, we show that FGDRO-KL-Adam has same convergence rate as FGDRO-KL.

Theorem 6.2. Under Assumption 5.1, by setting of η = O 1

, and I = R1/3, Algorithm 3 achieves:

E[ F( w) 2] O 1 R2/3

Remark. To achieve an ϵ-stationary point, i.e., F( w) 2 ϵ2, we just need to set R = O(1/ϵ3), I = O(1/ϵ), η = O(ϵ2) and β1 = O(ϵ2). The communication and sample complexities are the same as in Theorem 5.3. Our analysis, following the framework in [16], requires pqr i,t + τ to be both upper and lower bounded. It is achieved by the upper bound assumption and choice of τ, ensuring τ pqr i,t + τ G + τ, which is utilized similarly in [59]. However, Guo et al. [16] did not cover the federated learning scenario or the compositional problems. It is important to note that we have not developed an Adam-type variant for FGDRO-CVa R. This is due to the need for accurate gradient estimation in the analysis of Adam-type updates, which is achieved using the moving estimator m. But in CVa R variant, the nonsmooth nature renders a moving average for subgradient F(w) not provably accurate.

7 Experiments

Datasets and Neural Networks We use Pile [15], Civil Comments [5], Camelyon17 [1], i Wild Cam2020 [3], and Poverty [74]. For Pile, we preprocess it as [72], for the others, we use the preprocessed version by [37]. We utilized natural data splits where data from the same hospital, web source, location, or demographic group were placed on the same client machine. These experiments have involved with highly imbalanced number of data on clients. Data statistics are presented in the Appendix E. Additiona experiments on Cifar 10 with Dirichlet distributions over 100 clients are reported in Appendix G.

The Pile data set is a large language data set. We use the uncopyrighted version [29] which has 17 domains, and each domain is allocated to one machines. We use the GPT2 model [58] as implemented by [28] with 12 hidden layers, 12 attention heads, and 768 embeddings and hidden states. We measure the performance using worst log-perplexity and average log-perplexity of testing groups. Civil Comments is a toxicity classification of the online comment task in diversified demographic identities. We train on four groups based on the presence of Black and toxicity labels, deploying each on a separate machine, and use the Distil BERT base-uncased model [60] to predict toxicity. We measure the performance using worst group accuracy and average accuracy of testing groups. Camelyon17 focuses on tumor detection from lymph node images [1], with data from five hospitals split into training (3), validation (1), and testing (1) sets. Training uses three machines, each processing data from one hospital, using Dense Net-121 [26]. The i Wild Cam2020 dataset consists of wildlife images from various camera traps [3], the dataset is split into training, validation, and testing segments. We use Res Net50 [21] across all datasets and measure performance via Macro F1 score. The Poverty dataset contains geographic data aimed at predicting regional poverty levels [74]. We use Res Net50 [21] for our models and evaluate performance using both the Pearson correlation on the worst-performing region and the average across regions.

Table 2: Experiments on Natural Language Task. PPL is abbrevation of perplexity.

Datasets Pile Civil Comments Metric Worst Log-PPL Average Log-PPL Worst Acc Average Acc Fed Avg 8.085( 0.0012) 6.785 ( 0.0020) 0.6415 ( 0.0007) 0.7635 ( 0.0012) SCAFFOLD 7.975 ( 0.0024) 6.901 ( 0.0031) 0.6436 ( 0.0016) 0.7633 ( 0.0019) Fed Prox 8.079 ( 0.0038) 6.724 ( 0.0046) 0.6430 ( 0.0021) 0.7692 ( 0.0025) Fed Adam 7.242 ( 0.0048) 6.479 ( 0.0037) 0.6567 ( 0.0023) 0.7664 ( 0.0020) DRFA 8.014 ( 0.0053) 6.702 ( 0.0062) 0.6327 ( 0.0019) 0.7413 ( 0.0022) DR-DSGD 8.023 ( 0.0033) 6.693 ( 0.0030) 0.6272 ( 0.0006) 0.7523 ( 0.0011) FGDRO-CVa R 8.145 ( 0.0039) 6.907 ( 0.0046) 0.6693 ( 0.0015) 0.7571 ( 0.0012) FGDRO-KL 7.932 ( 0.0048) 6.664 ( 0.0051) 0.6921 ( 0.0016) 0.7734 ( 0.0020) FGDRO-KL-Adam 3.608 ( 0.0052) 2.653 ( 0.0040) 0.6628 ( 0.0007) 0.7614 ( 0.0005)

Table 3: Experiments on Image Classification Task

Datasets Camelyon17 i Wild Cam2020 Poverty Map Metric Acc Macro F1 Worst Pearson Average Pearson Fed Avg 0.8723 ( 0.0074) 0.4964 ( 0.0125) 0.7301 ( 0.0064) 0.7782 ( 0.0077) SCAFFOLD 0.8851 ( 0.0063) 0.4527 ( 0.0331) 0.7229 ( 0.0049) 0.7814 ( 0.0042) Fed Prox 0.8703 ( 0.0157) 0.3925 ( 0.0228) 0.7305 ( 0.0063) 0.7641 ( 0.0070) Fed Adam 0.9493 ( 0.0122) 0.3570 ( 0.0203) 0.7294 ( 0.0058) 0.8273 ( 0.0041) DRFA 0.8301 ( 0.0174) 0.4200 ( 0.0149) 0.7071( 0.0026) 0.7665 ( 0.0023) DR-DSGD 0.9270 ( 0.0095) 0.3157 ( 0.0227) 0.7155 ( 0.0063) 0.7770 ( 0.0059) FGDRO-CVa R 0.8667 ( 0.0110) 0.5080 ( 0.0174) 0.7443 ( 0.0052) 0.7977 ( 0.0051) FGDRO-KL 0.9243 ( 0.0129) 0.5201 ( 0.0239) 0.7254 ( 0.0066) 0.7829 ( 0.0062) FGDRO-KL-Adam 0.9399 ( 0.0154) 0.4489( 0.0205) 0.7827 ( 0.0071) 0.8225 ( 0.0060)

Baselines We compare our algorithms FGDRO-CVa R, FGDRO-KL, and FGDRO-KL-Adam with four baselines: Fed Avg [48], SCAFFOLD [34], Fed Prox[40], Fed Adam [59], DRFA[11], and DRDSGD [30].

We tune the initial step size in [1e-4, 1e-3, 1e-2, 1e-1]. All algorithms set the communication interval I = 32 unless otherwise specified. The local mini-batch sizes are set to 32. Experiments are run for 20K local iterations except for Pile, which runs for 200K iterations. The β parameters of FGDRO-KL and FGDRO-KL-Adam are tuned in [0.01, 0.1, 0.2, 0.5]. For each algorithm, we repeat the experiments 3 times with different random seeds and report the averaged performance. Following [72], our FGDRO algorithms for Pile initially train for 20K iterations to obtain domain weights, which are then fixed during subsequent training phases.

Results We report the experimental results for natural language processing in Table 2 and those for computer vision in Table 3. We can see that our methods outperform the baselines in most tasks. Our

approaches improve worst-case performance without hurting average case performance. Furthermore, FGDRO-KL-Adam has demonstrated superior performance compared to FGDRO-KL in most cases.

Ablation Studies Here we present some ablation study to examine some aspects of our algorithm design. First, in Figure 1(a), we vary the communication interval I in experiments on the Camelyon dataset. We can see that both our FGDRO-CVa R and FGDRO-KL-Adam algorithms can tolerate skipping a large number of communications without degrading the performance.

To demonstrate the effect of the local adaptive updates. We develop a Local Adam algorithm (see Appendix D), which optimizes ERM using our design of using Adam steps in local updates. The results are plotted in Figure 1(b). We can see that the Local Adam algorithm outperforms Fed Adam, which uses SGD in local steps and only uses adaptive steps in global communication rounds.

(a) Varying I

(b) Effectiveness of Local Adam Steps

Figure 1: Ablation Experiments

8 Conclusions

Our algorithm provides a significant advantage in addressing federated group distributionally robust optimization while maintaining low communication and computational complexity. Furthermore, incorporating local adaptive steps has the potential to accelerate the training process beyond the capabilities of traditional approaches that employ SGD in local steps. Various experiments on natural lanugage processing and computer vision have confirmed our theoretical results and underscored the effectiveness of our algorithms. It remains to develop a provable adaptive algorithm for FGDROCVa R, which is currently absent due to the non-smoothness and compositional problem structure.

9 Impact Statements

This paper is meant to advance the field of federated machine learning. We do not see noticeable negative impact.

Acknowledgments and Disclosure of Funding

We appreciate the feedback provided by the anonymous reviewers. This work was partially supported by the National Science Foundation Career Award 2246753, the National Science Foundation Award 2246757, 2246756 and 2306572.

[1] P. Bandi, O. Geessink, Q. Manson, M. Van Dijk, M. Balkenhol, M. Hermsen, B. E. Bejnordi, B. Lee, K. Paeng, A. Zhong, et al. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE transactions on medical imaging, 38(2):550 560, 2018.

[2] D. Basu, D. Data, C. Karakus, and S. Diggavi. Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations. Advances in Neural Information Processing Systems, 32, 2019.

[3] S. Beery, A. Agarwal, E. Cole, and V. Birodkar. The iwildcam 2021 competition dataset. ar Xiv preprint ar Xiv:2105.03494, 2021.

[4] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 560 569. PMLR, 2018.

[5] D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference, pages 491 500, 2019.

[6] S. P. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[7] J. Chen, D. Zhou, Y. Tang, Z. Yang, Y. Cao, and Q. Gu. Closing the generalization gap of adaptive gradient methods in training deep neural networks. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI), pages 3267 3275, 2020.

[8] A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.

[9] D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019.

[10] Y. Deng, M. M. Kamani, and M. Mahdavi. Adaptive personalized federated learning. ar Xiv preprint ar Xiv:2003.13461, 2020.

[11] Y. Deng, M. M. Kamani, and M. Mahdavi. Distributionally robust federated averaging. Advances in neural information processing systems, 33:15111 15122, 2020.

[12] J. Duchi and H. Namkoong. Variance-based regularization with convex objectives. Journal of Machine Learning Research, 20(68):1 55, 2019.

[13] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011.

[14] Y. Fan, S. Lyu, Y. Ying, and B. Hu. Learning with average top-k loss. Advances in neural information processing systems, 30, 2017.

[15] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020.

[16] Z. Guo, Y. Xu, W. Yin, R. Jin, and T. Yang. A novel convergence analysis for algorithms of the adam family and beyond. ar Xiv preprint ar Xiv:2104.14840, 2021.

[17] Z. Guo, R. Jin, J. Luo, and T. Yang. Fedxl: provable federated learning for deep x-risk optimization. In International Conference on Machine Learning, pages 11934 11966. PMLR, 2023.

[18] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. Advances in Neural Information Processing Systems, 32, 2019.

[19] F. Hanzely, S. Hanzely, S. Horváth, and P. Richtárik. Lower bounds and optimal algorithms for personalized federated learning. Advances in Neural Information Processing Systems, 33: 2304 2315, 2020.

[20] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage. Federated learning for mobile keyboard prediction. ar Xiv preprint ar Xiv:1811.03604, 2018.

[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[22] Q. Hu, D. Zhu, and T. Yang. Non-smooth weakly-convex finite-sum coupled compositional optimization. ar Xiv preprint ar Xiv:2310.03234, 2023.

[23] X. Hu, S. Li, and Y. Liu. Generalization bounds for federated learning: Fast rates, unparticipating clients and unbounded losses. In The Eleventh International Conference on Learning Representations, 2023.

[24] Y. Hu, S. Zhang, X. Chen, and N. He. Biased stochastic gradient descent for conditional stochastic optimization. ar Xiv preprint ar Xiv:2002.10790, 2020.

[25] F. Huang, J. Li, and H. Huang. Super-adam: Faster and universal framework of adaptive gradients. ar Xiv preprint ar Xiv:2106.08208, 2021.

[26] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700 4708, 2017.

[27] W. Huang, M. Ye, Z. Shi, G. Wan, H. Li, B. Du, and Q. Yang. Federated learning for generalization, robustness, fairness: A survey and benchmark. ar Xiv preprint ar Xiv:2311.06750, 2023.

[28] Huggingface. Open ai gpt2 by huggingface, howpublished = https://huggingface.co/ docs/transformers/model_doc/gpt2 .

[29] Huggingface. pile-uncopyrighted, howpublished = https://huggingface.co/datasets/ monology/pile-uncopyrighted .

[30] C. B. Issaid, A. Elgabli, and M. Bennis. Dr-dsgd: A distributionally robust decentralized learning algorithm over graphs. ar Xiv preprint ar Xiv:2208.13810, 2022.

[31] P. Jiang and G. Agrawal. A linear speedup analysis of distributed deep learning with sparse and quantized communication. Advances in Neural Information Processing Systems, 31, 2018.

[32] W. Jiang, G. Li, Y. Wang, L. Zhang, and T. Yang. Multi-block-single-probe variance reduced estimator for coupled compositional optimization. In Neur IPS, 2022.

[33] P. Kairouz, H. B. Mc Mahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021.

[34] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132 5143. PMLR, 2020.

[35] A. Khaled, K. Mishchenko, and P. Richtárik. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 4519 4529. PMLR, 2020.

[36] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. Le Cun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http: //arxiv.org/abs/1412.6980.

[37] P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637 5664. PMLR, 2021.

[38] J. Koneˇcn y, H. B. Mc Mahan, D. Ramage, and P. Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. ar Xiv preprint ar Xiv:1610.02527, 2016.

[39] G. Lan and Z. Zhang. Optimal methods for convex risk-averse distributed optimization. SIAM Journal on Optimization, 33(3):1518 1557, 2023.

[40] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429 450, 2020.

[41] T. Li, S. Hu, A. Beirami, and V. Smith. Ditto: Fair and robust federated learning through personalization. In International Conference on Machine Learning, pages 6357 6368. PMLR, 2021.

[42] X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), pages 983 992, 2019.

[43] Z. Li, H. Zhao, B. Li, and Y. Chi. Soteriafl: A unified framework for private federated learning with communication compression. Advances in Neural Information Processing Systems, 35: 4285 4300, 2022.

[44] P. P. Liang, T. Liu, L. Ziyin, N. B. Allen, R. P. Auerbach, D. Brent, R. Salakhutdinov, and L.-P. Morency. Think locally, act globally: Federated learning with local and global representations. ar Xiv preprint ar Xiv:2001.01523, 2020.

[45] G. Long, M. Xie, T. Shen, T. Zhou, X. Wang, and J. Jiang. Multi-center federated learning: clients clustering for better personalization. World Wide Web, 26(1):481 500, 2023.

[46] L. Luo, Y. Xiong, Y. Liu, and X. Sun. Adaptive gradient methods with dynamic bound of learning rate. In 7th International Conference on Learning Representations (ICLR), 2019.

[47] Y. Mansour, M. Mohri, J. Ro, and A. T. Suresh. Three approaches for personalization with applications to federated learning. ar Xiv preprint ar Xiv:2002.10619, 2020.

[48] B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273 1282. PMLR, 2017.

[49] H. B. Mc Mahan and A. Blum. Online geometric optimization in the bandit setting against an adaptive adversary. In Proceedings of the 17th Annual Conference on Learning Theory (COLT), pages 109 123, 2004.

[50] M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning. In International Conference on Machine Learning, pages 4615 4625. PMLR, 2019.

[51] H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in neural information processing systems, 29, 2016.

[52] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical programming, 140(1):125 161, 2013.

[53] W. Ogryczak and A. Tamir. Minimizing the sum of the k largest functions in linear time. Information Processing Letters, 85(3):117 122, 2003.

[54] S. Pati, U. Baid, B. Edwards, M. Sheller, S.-H. Wang, G. A. Reina, P. Foley, A. Gruzdev, D. Karkada, C. Davatzikos, et al. Federated learning enables big data for rare cancer boundary detection. Nature communications, 13(1):7346, 2022.

[55] Q. Qi, J. Lyu, K.-S. Chan, E.-W. Bai, and T. Yang. Stochastic constrained dro with a complexity independent of sample size. Transactions on Machine Learning Research.

[56] Q. Qi, Z. Guo, Y. Xu, R. Jin, and T. Yang. An online method for a class of distributionally robust optimization with non-convex objectives. Advances in Neural Information Processing Systems, 34:10067 10080, 2021.

[57] Q. Qi, Y. Xu, W. Yin, R. Jin, and T. Yang. Attentional-biased stochastic gradient descent. Transactions on Machine Learning Research, 2022.

[58] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

[59] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Koneˇcn y, S. Kumar, and H. B. Mc Mahan. Adaptive federated optimization. ar Xiv preprint ar Xiv:2003.00295, 2020.

[60] V. Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ar Xiv preprint ar Xiv:1910.01108, 2019.

[61] Z. Shen, J. Cervino, H. Hassani, and A. Ribeiro. An agnostic approach to federated learning with class imbalance. In International Conference on Learning Representations, 2021.

[62] V. Smith, S. Forte, M. Chenxin, M. Takáˇc, M. I. Jordan, and M. Jaggi. Cocoa: A general framework for communication-efficient distributed optimization. Journal of Machine Learning Research, 18:230, 2018.

[63] S. U. Stich. Local sgd converges fast and communicates little. In International Conference on Learning Representations, 2018.

[64] S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsified sgd with memory. Advances in Neural Information Processing Systems, 31, 2018.

[65] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012.

[66] B. Wang and T. Yang. Finite-sum coupled compositional stochastic optimization: Theory and applications. ar Xiv preprint ar Xiv:2202.12396, 2022.

[67] M. Wang, E. X. Fang, and H. Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program., 161(1-2):419 449, 2017.

[68] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. Advances in Neural Information Processing Systems, 31, 2018.

[69] R. Ward, X. Wu, and L. Bottou. Ada Grad stepsizes: Sharp convergence over nonconvex landscapes. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 6677 6686, 2019.

[70] B. Woodworth, K. K. Patel, S. Stich, Z. Dai, B. Bullins, B. Mcmahan, O. Shamir, and N. Srebro. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pages 10334 10343. PMLR, 2020.

[71] B. E. Woodworth, K. K. Patel, and N. Srebro. Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33:6281 6292, 2020.

[72] S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu, P. Liang, Q. V. Le, T. Ma, and A. W. Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. ar Xiv preprint ar Xiv:2305.10429, 2023.

[73] T. Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. Advances in Neural Information Processing Systems, 26, 2013.

[74] C. Yeh, A. Perez, A. Driscoll, G. Azzari, Z. Tang, D. Lobell, S. Ermon, and M. Burke. Using publicly available satellite imagery and deep learning to understand economic well-being in africa. Nature communications, 11(1):2583, 2020.

[75] K. Yi, L. Condat, and P. Richtárik. Explicit personalization and local training: Double communication acceleration in federated learning. ar Xiv preprint ar Xiv:2305.13170, 2023.

[76] H. Yu, R. Jin, and S. Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In International Conference on Machine Learning, pages 7184 7193. PMLR, 2019.

[77] H. Yu, S. Yang, and S. Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5693 5700, 2019.

[78] Y. Yu, S. P. Karimireddy, Y. Ma, and M. I. Jordan. Scaff-pd: Communication efficient fair and robust federated learning. ar Xiv preprint ar Xiv:2307.13381, 2023.

[79] H. Yuan, W. Morningstar, L. Ning, and K. Singhal. What do we mean by generalization in federated learning? ar Xiv preprint ar Xiv:2110.14216, 2021.

[80] M. Zecchin, M. Kountouris, and D. Gesbert. Communication-efficient distributionally robust decentralized learning. ar Xiv preprint ar Xiv:2205.15614, 2022.

[81] J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. J. Reddi, S. Kumar, and S. Sra. Why adam beats sgd for attention models. 2019.

[82] M. Zhang, K. Sapra, S. Fidler, S. Yeung, and J. M. Alvarez. Personalized federated learning with first order model optimization. ar Xiv preprint ar Xiv:2012.08565, 2020.

[83] D. Zhu, G. Li, B. Wang, X. Wu, and T. Yang. When auc meets dro: Optimizing partial auc for deep learning with non-convex convergence guarantee. In International Conference on Machine Learning, pages 27548 27573. PMLR, 2022.

[84] F. Zou and L. Shen. On the convergence of adagrad with momentum for training deep neural networks. ar Xiv preprint ar Xiv:1808.03408, 2(3):5, 2018.

A Analysis of FGDRO-CVa R

For non-smooth functions, since it is usually difficult or even impossible to find an ϵ-stationary point, we are interested in finding an ϵ-near stationary point. Following a common technique for finding an ϵ-near stationary point for weakly convex function, we use the Moreau envelope of a general non-smooth ρ-weakly-convex F(x) [9], defined as

ψ(1/ˆρ)(x) = min x [F(x ) + ˆρ

By the properties of Moreau envelop [52, 9], we have that if ˆρ = 2ρ, then ψ1/ˆρ( ) is smooth and an ϵ-stationary point of ψ1/ˆρ( ) is an ϵ-near stationary point of F( ).

Since F(w, s) is non-smooth, we investigate its Moreau envelop, which is

ψ(1/ˆρ)(w, s) = min w ,s [F(w , s ) + ˆρ

2( w w 2 + s s 2)].

A.1 Proof of Lemma 4.2

Proof. By updating rule, we have

E ur i,t gi( wr t ) 2 = E (1 β1)ur i,t 1 + β1ℓ(wr i,t 1, zr i,t) gi( wr t ) 2

(1 β1)ur i,t 1 + β1ℓ(wr i,t 1, zr i,t) gi( wr t 1) 2 + (1 + 2

β1 )E gi( wr t ) gi( wr t 1) 2.

where E (1 β1)ur i,t 1 + β1ℓ(wr i,t 1, zr i,t) gi( wr t 1) 2

E (1 β1)(ur i,t 1 gi( wr t 1)) + β1(ℓ( wr t 1, zr i,t) gi( wr t 1)) + β1(ℓ(wr i,t 1, zr i,t) ℓ( wr t 1, zr i,t)) 2

2 )E (1 β1)(ur i,t 1 gi( wr t 1)) + β1(ℓ( wr t 1, zr i,t) gi( wr t 1)) 2

β1 )β2 1 ℓ(wr i,t 1, zr i,t) ℓ( wr t 1, zr i,t) 2

2 )E (1 β1)(ur i,t 1 gi( wr t 1)) 2 + (1 + β1

2 )β2 1 ℓ( wr t 1, zr i,t) gi( wr t 1) 2

β1 )β2 1 ℓ(wr i,t 1, zr i,t) ℓ( wr t 1, zr i,t) 2,

where the last equality uses Et 1[ℓ( wr t 1, zr i,t) gi( wr t 1)] = 0.

Since f( , z) and g( , z) are Lipschitz and smooth, we know mr i,t 2 C2 g. We also have

wr t wr i,t 2 = ( wr η 1

t =1 mr i,t ) ( wr η

t =1 mr i,t ) 2

t =1 mr i,t 2 + 2

t =1 mr i,t 2 4η2I2C2 g.

E ur i,t gi( wr t ) 2 (1 β1)E ur i,t 1 gi( wr t 1) 2

+ 2β2 1σ2 + 4β1C2 ℓ wr t 1 wr i,t 1 2 + (1 + 2

β1 ) gi( wr t ) gi( wr t 1) 2

(1 β1)E ur i,t 1 gi( wr t 1) 2 + 2β2 1σ2 + 4β1η2I2C2 g + 3

β1 C2 g wr t wr t 1 2

A.2 Proof of Theorem 4.3

Proof. Since f( ) is 1-Lipschitz, convex and monitonically nondecreasing, while ℓ( , z) is Cg Lipschitz and Lg-smooth, we know that F(w, s) is ρF := Lg-weakly convex by

f(g(w), s) f(g(w ), s ) + g(f(g(w ), s )), g(w) g(w ) + g(f(g(w ), s )), s s

f(g(w ), s ) + g(f(g(w ), s )) g(w ), w w Lg

+ g(f(g(w ), s )), s s , (16)

where the first inequality uses the convexity of f( ), and the second inequality uses that f( ) = 0 and the Lg-smoothness of g( ).

With ˆρ = max 2ρF , 1, we define

ψ(1/ˆρ)( wr t , sr t) = min w ,s [F(w , s ) + ˆρ

2( w wr t 2 + s sr t 2)] (17)

( ˆwr t , ˆsr t) = arg min w ,s [F(w , s ) + ˆρ

2( w wr t 2 + s sr t 2)], (18)

then we have the following [9],

ˆwr t wr t 2 2 + ˆsr t sr t 2 = 1

ˆρ ψ(1/ˆρ)( wr t , sr t) 2. (19)

E[ψ1/ˆρ( wr t , sr t)] = E min w ,s

F(w , s ) + ˆρ

2( w wr t 2 + s sr t 2)

E F( ˆwr t 1, ˆsr t 1) + ˆρ

2( ˆwr t 1 wr t 2 + ˆsr t 1 sr t 2)

= E F( ˆwr t 1, ˆsr t 1) + ˆρ

2( ˆwr t 1 ( wr t 1 η 1

i=1 mr i,t) 2 + ˆsr t 1 ( sr t 1 η 1

i=1 vr i,t) 2)

E F( ˆwr t 1, ˆsr t 1) + ˆρ

2( ˆwr t 1 wr t 1 2 + ˆsr t 1 sr t 1 2)

+ ˆρE η ˆwr t 1 wr t 1, 1

i=1 mr i,t + η ˆsr t 1 sr t 1, 1

i=1 vr i,t + ˆρη2C2 g,

where we used mr i,t 2 C2 g. Denote ˆmr t = 1 N

i=1 uf(ur i,t, sr i,t 1) ℓ( wr i,t 1; zr i,t), ˆvr t =

i=1 sf(ur i,t, sr i,t 1), where the sub-differential uf(ur i,t, sr i,t 1) and sf(ur i,t, sr i,t 1) are se-

lected the same in mr i,t and vr i,t, respectively. Thus,

E[ψ1/ˆρ( wr t , sr t)]

E F( ˆwr t 1, ˆsr t 1) + ˆρ

2( ˆwr t 1 wr t 1 2 + ˆsr t 1 sr t 1 2)

+ ˆρE η ˆwr t 1 wr t 1, ˆmr t + η ˆsr t 1 sr t 1, ˆvr t

+ ˆρE η ˆwr t 1 wr t 1, mr t ˆmr t + η ˆsr t 1 sr t 1, vr t ˆvr t + ˆρη2C2 g

E F( ˆwr t 1, ˆsr t 1) + ˆρ

2( ˆwr t 1 wr t 1 2 + ˆsr t 1 sr t 1 2)

+ ˆρE η ˆwr t 1 wr t 1, ˆmr t + η ˆsr t 1 sr t 1, ˆvr t

2 ˆwr t 1 wr r 1 2 + ηˆρ

2 ˆsr t 1 sr r 1 2 + ηˆρ

2 ˆmr t mr t 2 + ηˆρ

2 ˆvr t vr t 2 + ˆρη2C2 g (21)

Since f is convex, uf 0 and gi is ρg := Lg-weakly convex, we have

f(gi( ˆwr t 1), ˆsr t 1) f(ur i,t, sr i,t 1)

uf(ur i,t, sr i,t 1)(gi( ˆwr t 1) ur i,t) + sf(ur i,t, sr t 1)(ˆsr t 1 sr i,t 1)

uf(ur i,t, sr i,t 1) gi( wr t 1) ur i,t + gi( wr t 1), ˆwr t 1 wr t 1 ρg

2 ˆwr t 1 wr t 1 2

+ sf(ur i,t, sr i,t 1)(ˆsr t 1 sr i,t 1). (22)

Noting uf 1, (22) yields

ˆmr t, ˆwr t 1 wr t 1 + ˆvr t , ˆsr t 1 sr t 1

i=1 uf(ur i,t, sr i,t 1) gi( wr t 1), ˆwr t 1 wr t 1 + 1

i=1 sf(ur i,t, sr i,t 1)(ˆsr t 1 sr i,t 1)

f(gi( ˆwr t 1), ˆsr t 1) f(ur i,t, sr i,t 1) uf(ur i,t, sr i,t 1)[gi( wr t 1) ur i,t] + ρg

2 ˆwr t 1 wr t 1 2

Putting (21) and (23) together, we obtain

E[ψ1/ˆρ( wr t , sr t)]

E[ψ1/ˆρ( wr t 1, sr t 1) + ηˆρ

2 ˆwr t 1 wr r 1 2 + ηˆρ

2 ˆsr t 1 sr r 1 2 + ηˆρ

2 ˆmr t mr t 2 + ηˆρ

2 ˆvr t vr t 2]

+ ˆρη2C2 g + ηˆρ 1

i=1 E f(gi( ˆwr t 1), ˆsr t 1) f(ur i,t, sr i,t 1) uf(ur i,t, sr i,t 1)[gi( wr t 1) ur i,t]

2 ˆwr t 1 wr t 1 2

= E[ψ1/ˆρ( wr t 1, sr t 1) + ηˆρ

2 ˆwr t 1 wr r 1 2 + ηˆρ

2 ˆsr t 1 sr r 1 2 + ηˆρ

2 ˆmr t mr t 2 + ηˆρ

2 ˆvr t vr t 2]

+ ˆρη2C2 g + ηˆρ 1

i=1 E f(gi( ˆwr t 1), ˆsr t 1) f(gi( wr t 1), sr i,t 1) + f(gi( wr t 1), sr i,t 1) f(ur i,t, sr i,t 1)

uf(ur i,t, sr i,t 1)[gi( wr t 1) ur i,t] + ρg

2 ˆwr t 1 wr t 1 2

Since f(gi(w), s) is ρF -weakly convex in w, s, f(gi(w), s) + ˆρ

2( w wr t 1 2 + s sr t 1 2) is ˆρ ρF -strongly convex in w, s. Therefore, we get

f(gi( ˆwr t 1), ˆsr t 1) f(gi( wr t 1), sr i,t 1)

= [f(gi( ˆwr t 1), ˆsr t 1) + ˆρ

2( ˆwr t 1 wr t 1 2 + ˆsr t 1 sr t 1 2)]

[f(gi( wr t 1), sr i,t 1) + ˆρ

2( wr t 1 wr t 1 2 + sr i,t 1 sr t 1 2)]

2( ˆwr t 1 wr t 1 2 + ˆsr t 1 sr i,t 1 2) + ˆρ

2 sr i,t 1 sr t 1 2

2 ˆρ)( ˆwr t 1 wr t 1 2 + ˆsr t 1 sr i,t 1 2) + ˆρ

2 sr i,t 1 sr t 1 2

2 ( ˆwr t 1 wr t 1 2 + ˆsr t 1 sr t 1 2) + (ˆρ + ρF ) sr i,t 1 sr t 1 2.

E[ψ1/ˆρ( wr t , sr t)]

ψ1/ˆρ( wr t 1, sr t 1) + ηˆρ

2 ˆwr t 1 wr r 1 2 + ηˆρ

2 ˆsr t 1 sr r 1 2 + ηˆρ

2 ˆmr t mr t 2 + ηˆρ

2 ˆvr t vr t 2

+ ˆρη2C2 g + ηˆρ 1

f(gi( ˆwr t 1), ˆsr t 1) f(gi( wr t 1), sr i,t 1) + f(gi( wr t 1), sr i,t 1) f(ur i,t, sr i,t 1)

uf(ur i,t, sr i,t 1)[gi( wr t 1) ur i,t] + ρg

2 ˆwr t 1 wr t 1 2

ψ1/ˆρ( wr t 1, sr t 1) + ηˆρ

2 ˆwr t 1 wr r 1 2 + ηˆρ

2 ˆsr t 1 sr r 1 2 + ηˆρ

2 ˆmr t mr t 2 + ηˆρ

2 ˆvr t vr t 2

+ ˆρη2C2 g + ηˆρ(ρF

2 ˆρ)( ˆwr t 1 wr t 1 2 + ˆsr t 1 sr t 1 2) + ηˆρ(ˆρ + ρF ) 1

i=1 sr i,t 1 sr t 1 2

+ ηˆρCf 1 N

i gi( wr t 1) ur i,t + ηˆρρg ˆwr t 1 wr t 1 2

It follows that

E[ψ1/ˆρ( wr t , sr t)]

ψ1/ˆρ( wr t 1, sr t 1) ˆρ

4ηˆρ( ˆwr t 1 wr t 1 2 + ˆsr t 1 sr t 1 2)

+ ˆρη2C2 g + ηˆρCf 1 N

i gi( wr t 1) ur i,t + ηˆρ

2 ˆmr t 1 mr t 1 2 + ηˆρ

2 ˆvr t 1 vr t 1 2

ψ1/ˆρ( wr t 1, sr t 1) η

4 ψ1/ˆρ( wr t 1, sr s 1) 2 + ˆρη2C2 g

+ ηˆρCf 1 N

i gi( wr t 1) ur i,t + ηˆρ

2 η2I2C2 g.

t=1 E ψ1/ˆρ( wr t 1, sr s 1) 2

ψ1/ˆρ( wr 0, sr 0) ηRI + ηˆρC2 g + 1 NRI

i=1 E gi( wr t 1) ur i,t + η2I2C2 g.

Hence, taking telescoping sum over Lemma 4.2 sum we have

i=1 E (ur i,t 1 gi( wr t 1)) 2

i=1 E (u0 i,0 gi( w0 0)) 2) + 2β1σ2 + η2I2C2 g + η2

E ψ1/ˆρ( w, s) 2 = 1

t=1 E ψ1/ˆρ( w, s) 2 O 1

ηRI + η + 1 β1RI + p

β1 + ηI + η

( ˆw, ˆs) = arg min w ,s [F(w , s ) + ˆρ

2( w w 2 + s s 2)], (31)

We also have that ˆw w 2 + ˆs s 2 = ˆρ ψ1/ˆρ( w, s) 2, |dist(0, F( ˆw, ˆs))|2 ψ1/ˆρ( w, s) 2 [9]. We can conclude by setting parameters as in the theorem.

B Analysis of FGDRO-KL

The behavior of u, which is an estimator of gi(w) s, is given in the following lemma.

Lemma B.1. Under Assumption 5.1, with some constant G, by setting η = O 1

, Algorithm 2 ensures that

E ur i,t ℓ( wr t ; Di) 2 (1 β1

2 )E ur i,t 1 ℓ( wr t 1; Di) 2 + 3β1η2β2 3I4

+ 12β1C2 ℓη2I

τ=0 E mr τ 2 + β2 1σ2 + 3

β1 C2 ℓη2E mr t 2. (32)

The behavior of v, which is an estimator of 1

i gi(w), is given in the following lemma.

Lemma B.2. Under Assumption 5.1, with some constant C1, by setting η = O 1

, Algorithm 2 ensures that

E vr t g( wr t ) 2 (1 β2)E vr t 1 g( wr t 1) 2

i=1 C1E ur i,t ℓ(w; Di) 2 + 3

β2 C2 g E wr t wr t 1 2. (33)

B.2 Proof of Lemma B.1

Proof. Denoting wr t = 1

i=1 wr i,t, we have wr t = wr t 1 η mr t, and

E ur i,t ℓ( wr t ; Di) 2 = E (1 β1)ur i,t 1 + β1ℓ(wr i,t 1; zr i,t) ℓ( wr t ; Di) 2

= E (1 β1)(ur i,t 1 ℓ( wr t 1; Di)) + β1(ℓ(wr i,t 1; zr i,t) ℓ( wr t 1; Di)) + (ℓ( wr t 1; Di) ℓ( wr t ; Di)) 2

(1 β1)(ur i,t 1 ℓ( wr t 1; Di)) + β1(ℓ(wr i,t 1; zr i,t) ℓ( wr t 1; Di)) 2

ℓ( wr t 1; Di) ℓ( wr t ; Di) 2

(1 β1)(ur i,t 1 ℓ( wr t 1; Di)) + β1(ℓ(wr i,t 1; zr i,t) ℓ(wr i,t 1; Di))

+ β1(ℓ(wr i,t 1; Di) ℓ( wr t 1; Di)) 2 + 3

β1 C2 ℓ wr t 1 wr t 2

(a) = 1 + β1

E (1 β1)(ur i,t 1 ℓ( wr t 1; Di)) + β1(ℓ(wr i,t 1; Di) ℓ( wr t 1; Di)) 2

+ β2 1E ℓ(wr i,t 1; zr i,t) ℓ(wr i,t 1; Di) 2 + 3

β1 C2 ℓE wr t 1 wr t 2

2 (1 β1)2E ur i,t 1 ℓ( wr t 1; Di) 2 + 1 + 2

β2 1E ℓ(wr i,t 1; Di) ℓ( wr t 1; Di)) 2

+ β2 1σ2 + 3

β1 C2 ℓη2E mr t 2,

where (a) holds due to the fact that Et 1[ℓ(wr i,t; zr i,t) ℓ(wr i,t; Di)] = 0. Moreover, we have

E ℓ(wr i,t 1; Di) ℓ( wr t 1; Di) 2

C2 ℓE wr i,t 1 wr t 1 2

2C2 ℓE wr i,t 1 wr 2 + 2C2 ℓE wr wr t 1 2

τ=1 mr i,τ 2 + 2C2 ℓη2E 1

τ=1 mr k,τ 2

τ=1 E mr i,τ mr τ 2 + 6C2 ℓη2I

τ=1 E mr τ 2,

E mr i,τ mr τ 2 = E

τ=1 (1 β3)τ 1β3 1 vr i,τ g(ur i,τ) ℓ(wr i,τ 1; zr τ,t) + (1 β3)τ mr 0

τ=1 (1 β3)τ 1β3 1 N

1 vr k,t g(ur k,t) ℓ(wr k,t 1; zr k,t) + (1 β3)τ mr 0

τ=1 (1 β3)τ 1β3 1 vr i,τ g(ur i,τ) ℓ(wr i,τ 1; zr τ,t) 2

τ=1 (1 β3)τ 1β3 1 N

1 vr k,t g(ur k,t) ℓ(wr k,t 1; zr k,t) 2,

which yields

E mr i,τ mr τ 2

τ=1 β2 3E[ 1

vr i,τ 2 g(ur i,τ) 2 ℓ(wr i,τ 1; zr i,τ) 2]

τ=1 β2 3 1 N

vr k,t 2 g(ur k,τ) 2 ℓ(wr k,t 1; zr i,τ) 2]

8I2β2 3C2 1

E ℓ(wr i,τ 1; zr i,τ) ℓ(wr i,τ 1; Di) 2 + E ℓ(wr i,τ 1; Di) 2

k=1 (E ℓ(wr k,τ 1; zr k,τ) ℓ(wr k,τ 1; Dk) 2 + E ℓ(wr k,τ 1; Dk) 2)

16I2β2 3C2 1(σ2 + C2 ℓ), (37)

with C1 = exp(C0/λ). Thus,

E ur i,t ℓ( wr t ; Di) 2 (1 β1

2 )E ur i,t 1 ℓ( wr t 1; Di) 2 + 3β1η2β2 3I4

+ 12β1C2 ℓη2I

τ=0 E mr τ 2 + β2 1σ2 + 3

β1 C2 ℓη2E mr t 2, (38)

with C2 = 288(C2 ℓC2 1(σ2 + C2 ℓ)).

B.3 Proof of Lemma B.2

Proof. We have

vr t g( wr t ) 2 = (1 β2) vr t 1 + β2 1 N

i=1 exp(ur i,t/λ) g( wr t ) 2

= (1 β2)( vr t 1 g( wr t 1)) + β2

i=1 exp(ur i,t/λ) g( wr t 1)

g( wr t ) + g( wr t 1)

(1 β2)( vr t 1 g( wr t 1)) + β2

i=1 exp(ur i,t/λ) g( wr t 1)

g( wr t ) g( wr t 1) 2

2 (1 β2)2 vr t 1 g( wr t 1) 2 + 1 + 2

i=1 [exp(ur i,t/λ) gi( wr t 1)]

β2 C2 g wr t wr t 1 2

(1 β2) vr t 1 g( wr t 1) 2 + 3β2 1 N

i=1 C1 ur i,t ℓ(w; Di) 2 + 3

β2 C2 g wr t wr t 1 2,

where C1 = exp(C0/λ).

B.4 Proof of Lemma 5.2

Proof. Here we analyze the m, which is the moving average estimator of the gradient,

mr t F( wr t ) 2 = (1 β3) mr t 1 + β3 1 N

1 vr i,t g(ur i,t) ℓ(wr i,t 1; zr i,t) F( wr t )

(1 β3)( mr t 1 F( wr t 1)) + β3( 1

1 vr i,t exp(ur i,t/λ) ℓ(wr i,t 1; zr i,t) F( wr t 1))

+ F( wr t 1) F( wr t )

(1 β3)( mr t 1 F( wr t 1)) + β3( 1

1 vr i,t exp(ur i,t/λ) ℓ(wr i,t 1; zr i,t) F( wr t 1))

F( wr t 1) F( wr t ) 2,

(A) = (1 β3)( mr t 1 F( wr t 1))

1 vr i,t exp(ur i,t/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t)

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di) 1

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di) 1

1 vr t exp(ℓ( wr t 1; Di)/λ) ℓ( wr t 1; Di)

1 vr t exp(ℓ( wr t 1; Di)/λ) ℓ( wr t 1; Di) F( wr t 1)

(1 β3)( mr t 1 F( wr t 1))

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di) 1

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di) 1

1 vr t exp(ℓ( wr t 1; Di)/λ) ℓ( wr t 1; Di)

1 vr t exp(ℓ( wr t 1; Di)/λ) ℓ( wr t 1; Di) F( wr t 1)

1 vr i,t exp(ur i,t/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t)

which is followed by

(1 β3)( mr t 1 F( wr t 1))

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di) 1

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di) 1

1 vr t exp(ℓ( wr t 1; Di)/λ) ℓ( wr t 1; Di)

1 vr t exp(ℓ( wr t 1; Di)/λ) ℓ( wr t 1; Di) F( wr t 1)

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ur i,t/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t)

Then it leads to

mr t F( wr t ) 2

2 (1 β3)( mr t 1 F( wr t 1))

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di) 1

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di) 1

1 vr t exp(ℓ( wr t 1; Di)/λ) ℓ( wr t 1; Di)

1 vr t exp(ℓ( wr t 1; Di)/λ) ℓ( wr t 1; Di) F( wr t 1)

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ur i,t/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t)

F( wr t 1) F( wr t ) 2.

Thus, it follows that

mr t F( wr t ) 2 1 + β3

2 (1 β3)3 mr t 1 F( wr t 1) 2

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di) 1

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di) 1

1 vr t exp(ℓ( wr t ; Di)/λ) ℓ( wr t 1; Di)

1 vr t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di) F( wr t 1)

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ur i,t/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t)

F( wr t 1) F( wr t ) 2 6 .

We address each term as follows. 1 can be bounded as

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di) 1

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di)

i=1 C2 ℓC2 1 ur i,t 1 ℓ( wr t 1; Di) 2.

2 can be bounded as

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di) 1

1 vr t exp(ℓ( wr t ; Di)/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di) 1

1 vr i,t exp(ℓ( wr t ; Di)/λ) ℓ(wr i,t 1; Di)

1 vr i,t exp(ℓ( wr t ; Di)/λ) ℓ(wr i,t 1; Di) 1

1 vr t exp(ℓ( wr t ; Di)/λ) ℓ(wr i,t 1; Di)

16β3C2 ℓC2 1E wr t 1 wr t 2 + 16β3 1 N

i=1 C2 ℓC2 1 vr i,t vr t 2

16β3C2 ℓC2 1η2E mr t 2 + 16β3 1 N

i=1 C2 ℓC2 1 vr i,t vr t 2.

3 can be bounded as

1 vr t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di) F( wr t 1)

1 vr t exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di) 1

1 g( wr t 1) exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di)

1 g( wr t ) exp(ℓ( wr t 1; Di)/λ) ℓ(wr i,t 1; Di) 1

1 g( wr t ) exp(ℓ( wr t 1; Di)/λ) ℓ( wr t ; Di)

32β3C2 1C2 ℓ vr t g( wr t ) 2 + 32β3C2 1C2 ℓC2 g wr t 1 wr t 2 + 16β3C2 1L2 ℓ 1 N

i=1 wr i,t 1 wr t 1 2

(a) 32β3C2 1C2 ℓ vr t g( wr t ) 2 + 32β3C2 1C2 ℓC2 gη2 mr t 2

+ 64C2 ℓη4I2 t 1 X

τ=1 E mr i,τ mr τ 2 + 96C2 ℓη2I

τ=1 E mr τ 2,

where (a) follows from (35). 4 can be bounded as

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di)

= β2 3 1 N 2

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; Di)

β2 3C2 1 σ2

as machines are independent conditioned on iteration t 1.

5 can be bounded as

1 vr i,t exp(ur i,t/λ) ℓ(wr i,t 1; zr i,t) 1

1 vr i,t exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t)

i=1 C2 2 exp(ur i,t/λ) ℓ(wr i,t 1; zr i,t) exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t) 2

i=1 C2 2 exp(ur i,t/λ) ℓ(wr i,t 1; zr i,t) exp(ur i,t/λ) ℓ(wr i,t 1; Di) 2

+ E 24β3 1 N

i=1 C2 2 exp(ur i,t 1/λ) ℓ(wr i,t 1; zr i,t) exp(ur i,t 1/λ) ℓ(wr i,t 1; Di) 2

+ E 24β3 1 N

i=1 C2 2 exp(ur i,t/λ) ℓ(wr i,t 1; Di) exp(ur i,t 1/λ) ℓ(wr i,t 1; Di) 2

48β3C2 2C2 1σ2 + E 24β3 1 N

i=1 L2 ℓC2 1 ur i,t ur i,t 1 2

48β3C2 2C2 1σ2 + 24β3L2 ℓC2 1C2 0β2 1, (49)

where the first inequality uses vr i,t 1 as ℓ( ) 0.

F(w) is LF := Cf Lg + C2 g Lf-smooth. With η β3/(3L2 F ), 6 can be bounded as

E F( wr t 1) F( wr t ) 2 3

β3 L2 F E wr t 1 wr t 2 = 3

β3 L2 F η2E mr t 2

ηE mr t mr t 1 + mr t 1 F( wr t 1) + F( wr t 1) 2

3ηE mr t mr t 1 2 + 3η mr t 1 F( wr t 1) 2 + 3ηE F( wr t 1) 2

(a) 3η(4β2 3E mr t 1 F( wr t 1) 2 + 4β2 3E F( mr t 1) 2 + 4β2 3C2 1(C2 ℓ+ σ2))

+ 3ηE mr t 1 F( wr t 1) 2 + 3ηE F( wr t 1) 2

4ηE mr t 1 F( wr t 1) 2 + 4ηE F( wr t 1) 2 + 4β2 3C2 1(C2 ℓ+ σ2))

where (a) uses

E mr t mr t 1 2 = E (1 β3) mr t 1 + β3 1 N

1 vr i,t g(ur i,t) ℓ(wr i,t 1; zr i,t) mr t 1 2

2β2 3E mr t 1 2 + 2β2 3E 1

1 vr i,t g(ur i,t) ℓ(wr i,t 1; zr i,t) 2

4β2 3E mr t 1 F( wr t 1) 2 + 4β2 3E F( mr t 1) 2 + 4β2 3C2 1(C2 ℓ+ σ2).

mr t F( wr t ) 2 (1 β3

2 ) mr t 1 F( wr t 1) 2 + β3 1 N

i=1 C2 ℓC2 1 ur i,t 1 ℓ( wr t 1; Di) 2

+ 16β3C2 ℓC2 1η2E mr t 2 + 16β3 1 N

i=1 C2 ℓC2 1 vr i,t vr t 2

+ 32β3C2 1C2 ℓ vr t g( wr t ) 2 + 32β3C2 1C2 ℓC2 gη2 mr t 2

+ 64C2 ℓη4I2 t 1 X

τ=1 E mr i,τ mr τ 2 + 96C2 ℓη2I

τ=1 E mr τ 2 + β2 3C2 1 σ2

+ 48β3C2 2C2 1σ2 + 24β3L2 ℓC2 1C2 0β2 1 + 4ηE mr t 1 F( wr t 1) 2 + 4ηE F( wr t 1) 2 + 4β2 3C2 1(C2 ℓ+ σ2)) (52)

We conclude the proof by setting the parameters as in the Lemma.

Proof of Theorem 5.3. Using Lemma 5.2, with β2 = O(β3), we have

2 E m0 t 1 F( w0 t 1) 2 O E mr 0 F( wr 0) 2

RI + 2β3 3I2G2 + β2 3C2 σ2

i=1 E u0 i,0 F( w0 0) 2 + 1

t=1 ηE F( wr t 1) 2 + β2 1σ2 .

Using LF -smooth of F, we have

F( wr t ) F( wr t 1) + F( wr t 1) ( wr t wr t 1) + LF

2 wr t wr t 1 2

= F( wr t 1) η F( wr t 1) mr t + LF

2 η2 mr t 2

F( wr t 1) η F( wr t 1) ( mr t F( wr t 1) + F( wr t 1))

+ LF η2( mr t F( wr t 1) 2 + F( wr t 1) 2)

F( wr t 1) η F( wr t 1) 2 η F( wr t 1) ( mr t F( wr t 1))

4( mr t F( wr t 1) 2 + F( wr t 1) 2)

F( wr t 1) η

4 F( wr t 1) 2 + 3η

4 mr t F( wr t 1)) 2.

E F( w) 2 = E

t=1 F( wr t 1) 2 #

ηRI + β1σ2 + β2 3I2 + ηLF

We conclude the proof by setting the parameters as the theorem.

C Analysis of FGDRO-KL-Adam

Proof of Theorem 6.2. Lemma B.1, Lemma B.2 and Lemma 5.2 still hold. Specifically, denoting

i=1 wr i,t, we have

E ur i,t ℓ( wr t ; Di) 2

(1 β1) ur i,t 1 ℓ( wr t 1; Di) 2 + 6β1η2I2C2 ℓG2 + β2 1σ2 + η2

β1 C2 mr t F( wr t ) 2

β1 C2 F( wr t ) 2

where (a) holds due to the fact that Et 1[ℓ(wr i,t; zr i,t) ℓ(wr i,t; Di)] = 0.

And we have

vr t g( wr t ) 2 (1 β2) vr t 1 g( wr t 1) 2 + 3β2 1 N

i=1 C1 ur i,t ℓ(w; Di) 2 + 3

β2 C2 g wr t wr t 1 2

where C1 = exp(C0/λ).

Moreover, we have

2 E mr t 1 F( wr t 1) 2 O E mr 0 F( wr 0) 2

RI + 2β3 3I2G2 + β2 3C2 σ2

i=1 E u0 i,0 F( w0 0) 2 + 1

t=1 ηE F( wr t 1) 2 + β2 1σ2 .

Using LF -smooth of F, we have

F( wr t ) F( wr t 1) + F( wr t 1) ( wr t wr t 1) + L

2 wr t wr t 1 2

F( wr t 1) F( wr t 1) ( ηt mr t) + L

2 η2 mr t 2

= F( wr t 1) + 1

ηt ( F( wr t 1) mr t) 2 1

ηt F( wr t 1) 2 + L

2 η2 mr t 2 1

F( wr t 1) + η

2τ F( wr t 1) mr t 2 η 2(G + τ) F( wr t 1) 2 + η2L/τ 2 η/(G + τ)

F( wr t 1) + η

τ F( wr t 1) mr t 1 2 + η

τ mr t 1 mr t 2 η 2(G + τ) F( wr t 1) 2

F( wr t 1) + η

τ F( wr t 1) mr t 1 2 + η

τ β2 3G2 η 2(G + τ) F( wr t 1) 2

(59) Therefore,

t=1 F( wr t 1) 2 #

ηRI + β1σ2 + β2 3I2G2 + ηLG2 . (60)

We conclude the proof by setting the parameters as the theorem.

D Local Adam Algorithm

In this section, we present Algorithm 4, which uses Adam type updates in local steps to solve an ERM problem.

Algorithm 4 Local Adam

1: Initialization: w1, m1, q1

2: for r = 1, ..., R do 3: wr i,0 = wr, mr i,0 = mr, qr i,0 = qr

4: for t = 1, ..., I do 5: Each machine samples data zr i,t 6: hr i,t = ℓ(wr i,t 1; zr i,t) 7: mr i,t =(1 β3)mr i,t 1 + β3hr i,t, qr i,t = (1 β4)qr i,t 1 + β4(hr i,t)2

8: wr i,t = wr i,t 1 η mr i,t

qr i,t+τ 9: end for

10: wr+1 = 1

i=1 wr i,I, mr+1 = 1

i=1 mr i,I and qr+1 = 1

11: end for

E Statistics of Datasets

Table 4 summarizes the sizes of the datasets used. Table 5 summarizes the client imbalance ratio and the class imbalance ratio. The client imbalance ratio represents the ratio between the number of training samples on the client with the most data and the client with the least data, and the class imbalance ratio reflects the ratio of training data in the largest to the smallest classes in classification tasks.

F Running Time

Running time is reported in Tabel 6. Each algorithm was run on a high performance cluster where each machine uses a NVIDIA A100 GPU.

Table 4: Number of data for each split, with the number of training domains in the brackets. Number of training clients in the last column, with data from the same training domain to be on the same client.

Train Validation Test Num of Clients Pile 192,912,246 (17) 193,105 (17) 193,105 (17) 17 Civilcomments 269,038 (4) 45,180 (16) 133,782 (16) 4 Poverty Map 9,792 (13) 3,936 (5) 3,968 (5) 13 i Wilds Cam 129,809 (243) 7,314 (243) 8,154 (243) 8 Camelyon 302,436 (3) 34,904 (1) 85,054 (1) 3

Table 5: Imbalance Ratio Datasets Pile Civil Comments Camelyon17 i Wild Cam2020 Poverty Map Client Imbalance Ratio 258 36.2 1 1.7 5.9 Class Imbalance Ratio N/A 4.6 1 48021 N/A

G Experiments on Cifar 10

We create an imbalance dataset by reducing the data of 5 classes by 80% and then distributed the data across 100 clients according to two different Dirichlet distributions: Dirichlet (0.3) and Dirichlet (10), using code released by [61]. We use a two layer CNN as the model. Results of algorithms are summarized in Table 7. Figure 2 and Figure 2 illustrate the communication complexity of each method by comparing the worst-case testing accuracy against the number of local updates and the communicated data sizes.

Figure 2: Convergence on Imbalanced Cifar10 with Dirichlet(0.3)

H On the Statistical Significance of Proposed Algorithms

In Table 8, 9 and 10, we evaluate the statistical significance of our algorithms advantages over the baselines. In each cell, the three letters (representing FGDRO-CVa R, FGDRO-KL, and FGDRO-KLAdam, respectively) indicate whether each of our algorithms has outperformed the corresponding baseline in that row with a confidence level greater than 95% (p-value less than 0.05). Y indicates Yes, and N indicates No.

Table 6: Average running time of federated algorithms. We report running (in hours) for each algorithm to finish (200K iterations for Pile data and 20K for others).

Pile Civil Comments Camelyon17 i Wild Cam2020 Poverty Map Fed Avg 29.77 3.05 4.59 12.90 1.77 Fed Adam 35.39 3.13 6.01 12.91 3.22 DRFA 34.12 3.27 4.94 12.95 2.25 DR-DSGD 25.26 4.01 6.26 13.02 4.71 FGDRO-CVAR 34.07 4.03 7.19 13.01 5.02 FGDRO-KL 34.92 3.65 7.23 14.57 5.42 FGDRO-KL-Adam 35.98 3.80 7.54 14.82 5.50

Table 7: Imbalanced Cifar10, data alloacted to clients accoring to Dirichlet distributions

Datasets Cifar10, Dirichlet(0.3) Cifar10, Dirichlet(10) Metric Worst Acc Average Acc Worst Acc Average Acc Fed Avg 0.3140 ( 0.0027) 0.6236 ( 0.0019) 0.3620 ( 0.0032) 0.6742 ( 0.0020) SCAFFOLD 0.3245 ( 0.0032) 0.6337 ( 0.0038) 0.3821 ( 0.0022) 0.6816 ( 0.0025) Fed Prox 0.3102 ( 0.0027) 0.6189 ( 0.0036) 0.3757 ( 0.0041) 0.6925 ( 0.0053) Fed Adam 0.4860 ( 0.0047) 0.7147 ( 0.0058) 0.4460 ( 0.0039) 0.7042 ( 0.0043) DRFA 0.3215 ( 0.0129) 0.6381 ( 0.0131) 0.3752 ( 0.0053) 0.6739 ( 0.0048) DR-DSGD 0.3277 ( 0.0074) 0.6403 ( 0.0058) 0.3700 ( 0.0063) 0.6792 ( 0.0088) FGDRO-CVa R 0.4100 ( 0.0032) 0.6606 ( 0.0039) 0.4010 ( 0.0022) 0.6882 ( 0.0027) FGDRO-KL 0.3560 ( 0.0018) 0.6369 ( 0.0029) 0.4110 ( 0.0035) 0.6951 ( 0.0042) FGDRO-KL-Adam 0.5280 ( 0.0101) 0.7057 ( 0.0133) 0.5110 ( 0.0072) 0.7286 ( 0.0063)

Figure 3: Convergence on Imbalanced Cifar10 with Dirichlet(10)

Table 8: Statistical Confidence on Imbalanced Cifar10 Datasets Cifar10, Dirichlet(0.3) Cifar10, Dirichlet(10) Metric Worst Acc Average Acc Worst Acc Average Acc Fed Avg Y|Y|Y Y|Y|Y Y|Y|Y Y|Y|Y SCAFFOLD Y|Y|Y Y|N|Y Y|Y|Y Y|Y|Y Fed Prox Y|Y|Y Y|Y|Y Y|Y|Y N|N|Y Fed Adam N|N|Y N|N|N N|N|Y N|N|Y DRFA Y|Y|Y Y|N|Y Y|Y|Y Y|Y|Y DR-DSGD Y|Y|Y Y|N|Y Y|Y|Y Y|Y|Y

Table 9: Statistical Confidence on Piles and Civil Comments Datasets Pile Civil Comments Metric Worst Log-PPL Average Log-PPL Worst Acc Average Acc Fed Avg N|Y|Y N|Y|Y Y|Y|Y N|Y|N SCAFFOLD N|Y|Y N|Y|Y Y|Y|Y N|Y|N Fed Prox N|Y|Y N|Y|Y Y|Y|Y N|Y|Y Fed Adam N|N|Y N|N|Y Y|Y|Y N|Y|N DRFA N|Y|Y N|Y|Y Y|Y|Y Y|Y|Y DR-DSGD N|Y|Y N|Y|Y Y|Y|Y Y|Y|Y

Table 10: Statistical Confidence on Camelyon17, i Wild Cam2020, and Poverty Map

Datasets Camelyon17 i Wild Cam2020 Civil Comments Metric Acc Macro F1 Worst Pearson Average Pearson Fed Avg N|Y|Y Y|Y|N Y|N|Y Y|Y|Y SCAFFOLD N|Y|Y Y|Y|N Y|N|Y Y|N|Y Fed Prox N|Y|Y Y|Y|Y Y|N|Y Y|Y|Y Fed Adam N|N|N Y|Y|Y Y|N|Y N|N|N DRFA Y|Y|Y Y|Y|Y Y|Y|Y Y|Y|Y DR-DSGD Y|Y|Y Y|Y|Y Y|Y|Y Y|Y|Y

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The abstract and introduction are accurate summary of the paper and well supported by other sections of the paper. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Limitations are discussed in Section 8. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: Assumptions are discussed in Section 3,4,5,6. Theorems are presented in Section 4,5,6. Proofs are shown in details in Appendix A, B, C. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The setting and parameter tuning scope are discussed in Section 7. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [No]

Justification: All used data are publicly available. Code will be released later.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: These details are shown in Section 7.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: Experiments are repeated for multiple times with different random seed, and error bars are reported in experimental results in Section 7.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors).

It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: Computing resource and time of execution are reported in Appendix F.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: This paper conform with the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes]

Justification: The broader impacts have been discussed in Section 9

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This paper does not release any new data or models.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: The assets we use are all publicly available and properly credited.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [No] Justification: No new assets are introduced in this paper. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.