# communicationefficient_distributionally_robust_decentralized_learning__fbd880e5.pdf Published in Transactions on Machine Learning Research (12/2022) Communication-Efficient Distributionally Robust Decentralized Learning Matteo Zecchin matteo.zecchin@eurecom.fr Communication Systems Department EURECOM, Sophia Antipolis, France Marios Kountouris marios.kountouris@eurecom.fr Communication Systems Department EURECOM, Sophia Antipolis, France David Gesbert david.gesbert@eurecom.fr Communication Systems Department EURECOM, Sophia Antipolis, France Reviewed on Open Review: https: // openreview. net/ forum? id= tn RRHz ZPMq Decentralized learning algorithms empower interconnected devices to share data and computational resources to collaboratively train a machine learning model without the aid of a central coordinator. In the case of heterogeneous data distributions at the network nodes, collaboration can yield predictors with unsatisfactory performance for a subset of the devices. For this reason, in this work, we consider the formulation of a distributionally robust decentralized learning task and we propose a decentralized single loop gradient descent/ascent algorithm (AD-GDA) to directly solve the underlying minimax optimization problem. We render our algorithm communication-efficient by employing a compressed consensus scheme and we provide convergence guarantees for smooth convex and non-convex loss functions. Finally, we corroborate the theoretical findings with empirical results that highlight AD-GDA s ability to provide unbiased predictors and to greatly improve communication efficiency compared to existing distributionally robust algorithms. 1 Introduction Decentralized learning algorithms have gained an increasing level of attention, mainly due to their ability to harness, in a fault-tolerant and privacy-preserving manner, the large computational power and data availability at the network edge (Chen & Sayed, 2012; Yan et al., 2012). In this framework, a set of interconnected nodes (smartphones, Io T devices, health centers, research labs, etc.) collaboratively train a machine learning model alternating between local model updates, based on in situ data, and peer-to-peer type of communication to exchange model-related information. Compared to federated learning in which a swarm of devices communicates with a central parameter server at each communication round, fully decentralized learning has the benefits of removing the single point of failure and of alleviating the communication bottleneck inherent to the star topology. The heterogeneity of distributedly generated data entails a major challenge, represented by the notions of fairness (Dwork et al., 2012) and robustness (Quiñonero-Candela et al., 2009). In the distributed setup, the customary global loss function is the weighted sum of the local empirical losses, with each term weighted by the fraction of samples that the associated node stores. However, in the case of data heterogeneity across participating parties, a model minimizing such definition of risk can lead to unsatisfactory and unfair 1 1In the machine learning community, the notion of fairness has many facets. In this work, we will use the term fair in accordance with the notion of good-intent fairness as introduced in (Mohri et al., 2019). Published in Transactions on Machine Learning Research (12/2022) inference capabilities for certain subpopulations. Consider, for example, a consortium of hospitals spread across the world sharing medical data to devise a new drug and in which a small fraction of hospitals have medical records influenced by geographical confounders, such as local diet, meteorological conditions, etc. In this setting, a model obtained by myopically minimizing the standard notion of risk defined over the aggregated data can be severely biased towards some populations at the expense of others. This can lead to a potentially dangerous or unfair medical treatment as shown in Figure 2. (a) Network Topology 0 1000 2000 3000 4000 5000 Iteration AD-GDA Microscope 1 CHOCO-SGD Microscope 1 AD-GDA Average CHOCO-SGD Average AD-GDA 4 Microscope 2 CHOCO-SGD Microscope 2 (b) Worst-node accuracy Figure 2: Comparison between standard and distributionally robust decentralized learning procedures. We consider a network of 10 nodes that collaboratively train a mouse cell image classifier based on the Cells Out Of Sample 7-Class (COOS7) data set (Lu et al., 2019). Two nodes store samples obtained using a different microscope from the rest of the network devices (left figure). On the right, we report the validation accuracy of the standard decentralized learning (CHOCO-SGD) and the proposed distributionally robust decentralized learning (AD-GDA). We consider three different validation sets: one made of samples from one microscope, another with samples from the other, and one made of a 50/50 mixture of the two. CHOCO-SGD (dashed lines) yields a final model with a 24% accuracy gap between the two types of instruments. On the contrary, AD-GDA (solid lines) reduces the accuracy gap to less than 2% and improves fairness among collaborating parties. The average performance is not affected by the distributionally robust procedure. To tackle this issue, distributionally robust learning aims at maximizing the worst-case performance over a set of distributions, termed an uncertainty set, which possibly contains the testing distribution of interest. Typical choices of the uncertainty sets are balls centered around the training distribution (Esfahani & Kuhn, 2018) or, whenever the training samples come from a mixture of distributions, the set of potential subpopulations resulting in such mixture (Duchi et al., 2019; Duchi & Namkoong, 2018). Robust distributed learning with heterogeneous data in which different distributions exist at the various devices falls in the latter category, as the natural ambiguity set is the one represented by the convex combination of the local distributions. In that case, minimizing the worst-case risk is equivalent to trying to ensure a minimum level of performance for each participating device. Specifically for the federated case, Mohri et al. (Mohri et al., 2019) introduced agnostic federated learning (AFL) as a means to ensure fairness and proposed a gradient-based algorithm to solve the distributionally robust optimization problem. In (Deng et al., 2021) a communication-efficient version of AFL, which avoids frequent retransmission of the dual variables, was proposed. More recently, distributionally robust learning has also been studied in the fully decentralized case. In this setting, the underlying minimax optimization problem becomes challenging and distributionally robust learning has been limited to a special formulation Kullback-Leibler (KL) regularization that allows simplifying the problem (Issaid et al., 2022). In virtue of the advantages of the fully decentralized setup and advocating the necessity for robust and fair predictors, in this work we propose AD-GDA, a novel distributionally robust algorithm for the decentralized setting. In contrast to previous works, our solution directly tackles the minimax optimization problem in a fully decentralized fashion. Our algorithm is general and encompasses previous solutions as particular choices of the regularization function. Furthermore, we show that it is possible to perform distributionally robust optimization in a communication-efficient manner by employing a compressed gossip scheme without hampering the rate of convergence of the algorithm. Published in Transactions on Machine Learning Research (12/2022) Table 1: Comparison between the proposed and existing distributionally robust algorithms. Features Convergence rate Topologies Compression Regularizer Convex Non-convex AD-GDA (ours) Connected Strongly-concave O(T 1/2) O(T 1/2) DRFA (Deng et al., 2021) Star Strongly-concave O(T 3/8) O(T 1/8) DR-DSGD (Issaid et al., 2022) Connected Kullback-Leibler O(T 1/2) Contributions: The main contributions of the work are the following: We propose AD-GDA, an optimization algorithm to perform distributionally robust learning in a fully decentralized fashion. As detailed in Table 1, previous works have either been limited to the federated case or particular choices of the regularizer, AD-GDA directly tackles the distributionally robust minimax optimization problem in a fully decentralized fashion. Despite the additional complexity stemming from solving the decentralized minimax optimization problem, our solution is computation and communication efficient. AD-GDA alternates between local single-loop stochastic gradient descent/ascent model updates and compressed consensus steps to cope with local connectivity in a communication-efficient manner. We establish convergence guarantees for the proposed algorithm both in the case of smooth convex and smooth non-convex local loss functions that match or outperform the existing ones (see Table 1). In the former case, the algorithm returns an ϵ-optimal solution after O(1/ϵ2) iterations. In the latter, the output is guaranteed to be an ϵ-stationary solution after O(1/ϵ2) iterations whenever the stochastic gradient variance is also bounded by ϵ, otherwise, we can obtain the same guarantee by increasing the number of calls to the stochastic gradient oracle. We empirically demonstrate AD-GDA capability in finding robust predictors under different compression schemes, network topologies, and models. First, we compare the proposed algorithm with compressed decentralized stochastic gradient descent (CHOCO-SGD) and highlight the merits of the distributionally robust procedure. We then consider the existing distributionally robust learning algorithms; namely, Distributionally Robust Federated Averaging (DRFA) (Deng et al., 2021) and Distributionally Robust Decentralized Stochastic Gradient Descent (DR-DSGD) (Issaid et al., 2022). We show that AD-GDA attains the same worst-case distribution accuracy transmitting a fraction of the bits and it is up to 4 and 10 times more communication efficient compared to DRFA and DR-DSGD, respectively. 2 Related Work Communication-efficient decentralized learning. Initiated in the 80s by the work of Tsitsiklis (Tsitsiklis, 1984; Tsitsiklis et al., 1986), the study of decentralized optimization algorithms was spurred by their adaptability to various network topologies, reliability to link failures, privacy-preserving capabilities, and potentially superior convergence properties compared to the centralized counterpart (Chen & Sayed, 2012; Yan et al., 2012; Olfati-Saber et al., 2007; Ling et al., 2012; Lian et al., 2017). This growing interest and the advent of large-scale machine learning brought forth an abundance of optimization algorithms both in the deterministic and stochastic settings (Nedic & Ozdaglar, 2009; Wei & Ozdaglar, 2012; Duchi et al., 2011; Shamir & Srebro, 2014; Rabbat, 2015). With the intent of extending its applicability, a concurrent effort has been made to devise techniques able to reduce the delay due to inter-node communication. Notable results in this direction are the introduction of message compression techniques, such as sparsification and quantization (Stich et al., 2018; Aji & Heafield, 2017; Alistarh et al., 2018; 2017; Bernstein et al., 2018; Koloskova et al., 2019b), and event-triggered communication to allow multiple local updates between communication rounds (Stich, 2018; Yu et al., 2019) Published in Transactions on Machine Learning Research (12/2022) Distributional robust learning. Tracing back to the work of Scarf (Scarf, 1957), distributional robustness copes with the frequent mismatch between training and testing distributions by posing the training process as a game between a learner and an adversary, which has the ability to choose the testing distribution within an uncertainty set. Restraining the decisional power of the adversary is crucial to obtain meaningful and tractable problems and a large body of the literature deals with uncertainty sets, represented by balls centered around the training distribution and whose radii are determined by f-divergences (Namkoong & Duchi, 2016; Hu & Hong, 2013) or Wasserstein distance (Wozabal, 2012; Jiang & Guan, 2016; Esfahani & Kuhn, 2018). As such, distributional robustness is deeply linked to stochastic game literature. In this context, Lin et al. (2020b) provides last-iterate guarantees in case of coercive games. Refined results are later obtained in (Loizou et al., 2021), for expected coercive games and by relaxing the bounded gradient assumption. Furthermore, distributional robustness is deeply linked with the notion of fairness as particular choices of uncertainty sets allow guaranteeing uniform performance across the latent subpopulations in the data (Duchi & Namkoong, 2018; Duchi et al., 2019). In the case of federated learning, robust optimization ideas have been explored to ensure uniform performance across all participating devices (Mohri et al., 2019). Distributionally robust learning has also been studied in the fully decentralized scenario in the case of Kullback-Leibler (KL) regularizers for which exists an exact solution for the inner maximization problem (Issaid et al., 2022). Decentralized minimax optimization. Saddle point optimization algorithms are of great interest given their wide range of applications in different fields of machine learning, including generative adversarial networks (Goodfellow et al., 2014), robust adversarial training (Sinha et al., 2017; Madry et al., 2017), and multi-agent reinforcement learning (Pinto et al., 2017; Li et al., 2019). Their convergence properties have also been studied in the decentralized scenario for the convex-concave setting (Koppel et al., 2015; Mateos-Núnez & Cortés, 2015). More recently, the assumptions on the convexity and concavity of the objective function have been relaxed. In Tsaknakis et al. (2020) an algorithm for nonconvex strongly-concave objective functions has been proposed; however, the double-loop nature of the solution requires solving the inner maximization problem with an increasing level of accuracy rendering it potentially slow. On the other hand, our algorithm is based on a single loop optimization scheme - with dual and primal variables being updated at each iteration in parallel - and, consequently, has a lower computational complexity. For the nonconvex-nonconcave case, Liu et al. (2019b) provides a proximal point algorithm while a simpler gradient-based algorithm is provided in Liu et al. (2019a) to train generative adversarial networks in a decentralized fashion. None of these works take into consideration communication efficiency in their algorithms. 3 Preliminaries Distributed system We consider a network of m devices in which each node i is endowed with a local objective function fi : Rd R given by Ez Piℓ(θ, z), with Pi denoting the local distribution at node i and θ Rd being the model parameter to be optimized. Whenever Pi is replaced by an empirical measure ˆPi,ni, the local objective function coincides with the empirical risk computed over ni samples. Network nodes are assumed to be interconnected by a communication topology specified by a connected graph G := (V, E) in which V = {1, . . . , m} indexes the devices and (i, j) E if and only if nodes i and j can communicate. For each node i V, we define its set of neighbors by N(i) := {j : (i, j) E} and since we assume self-communication we have (i, i) N(i) for all i in V. At each communication round, the network nodes exchange messages with their neighbors and average the received messages according to a mixing matrix W Rm m. Assumption 3.1. The mixing matrix W Rm m is symmetric and doubly-stochastic; we denote its spectral gap the difference between the moduli of the two largest eigenvalues by ρ (0, 1] and define β = I W 2 [0, 2]. Compressed communication Being the communication phase the major bottleneck of decentralized training, we assume that nodes transmit only compressed messages instead of sharing uncompressed model updates. To this end, we define a, possibly randomized, compression operator Q : Rd Rd that satisfies the following assumption. Assumption 3.2. For any x Rn and for some δ (0, 1], EQ Q(x) x 2 (1 δ) x 2. (1) Published in Transactions on Machine Learning Research (12/2022) The above definition is quite general as it entails both biased and unbiased compression operators. For instance, random quantization (Alistarh et al., 2017) falls into the former class and satisfies (1) with δ = 1 τ . For a given vector x Rd and quantization levels 2b, it yields a compressed message xb = sign(x) x with τ = 1 + min n d/22b, d/2bo and ξ U[0, 1] d. A notable representative of the biased category is the top-K sparsification (Stich et al., 2018), which for a given vector x Rd returns the K largest magnitude components and satisfies (1) with δ = K d . Operators of that type have been previously considered in the context of decentralized learning and the effect of compressed communication in decentralized stochastic optimization has been previously investigated (Koloskova et al., 2019a;b; Stich et al., 2018). The resulting communication cost savings have been showcased in the context of decentralized training of deep neural networks (Koloskova et al., 2019a). However, to the best of our knowledge, there are no applications of compressed communication to distributional robust training in the decentralized setup. Algorithm 1: Agnostic Decentralized GDA with Compressed Communication (AD-GDA) Input : Number of nodes m, number of iterations T, learning rates ηθ and ηλ, mixing matrix W , initial values θ0 Rd and λ0 m 1. Output : θo = 1 T PT 1 t=0 θt, λo = 1 T PT 1 t=0 λt initialize θ0 i = θ0, λ0 i = λ0 and s0 i = 0 for i = 1, . . . , m for t in 0, . . . T 1 do // In parallel at each node i 2 i θt i ηθ θgi(θt i, λt i, ξt i) // Descent Step 2 i PΛ λt i + ηλ λgi(θt i, λt i, ξt i) // Projected Ascent Step θt+1 i θ t+ 1 2 i + γ st i ˆθt i // Gossip qt i Q θt+1 i ˆθt i // Compression send (qt i, λ t+ 1 2 i ) to j N(i) and receive (qt j, λ t+ 1 2 j ) from j N(i) // Msgs exchange ˆθt+1 i qt i + ˆθt i // Public variables update st+1 i st i + Pm j=1 wi,jqj λt+1 i Pm j=1 wi,jλ t+ 1 2 j // Dual variable averaging end Distributionally robust network loss In order to obtain a final predictor with satisfactory performance for all local distributions {Pi}m i=1, the common objective is to learn global model which is distributionally robust with respect to the ambiguity set P := Pm i=1 λi Pi : λi m 1 where m 1 where denotes the m 1 probability simplex. As shown in Mohri et al. (2019), a network objective function that effectively works as proxy for this scope is given by min θ Rd max λ m 1 g(θ, λ) := 1 i=1 (λifi(θ) + αr(λ)) | {z } :=gi(θ,λ) in which r : m 1 R is a strongly-concave regularizer and α R+. For instance, in the empirical risk minimization framework in which each node i is endowed with a training set Di P ni i and the overall number of training points is n = P i ni, a common choice of r(λ) is χ2(λ) := P i (λi ni/n)2 ni/n or the Kullback-Leibler divergence DKL(λ) := P i λi log (λin/ni) (Issaid et al., 2022). Restricting (3) to the latter regularizer, the inner maximization problem can be solved exactly (Issaid et al., 2022; Mohajerin Esfahani & Kuhn, 2018). In what follows, we refer to θ and λ as the primal and dual variables, respectively, and make the following fairly standard assumptions on the local functions gi and the stochastic oracles available at the network nodes. Published in Transactions on Machine Learning Research (12/2022) Assumption 3.3. Each function gi(θ, λ) is differentiable in Rd m 1, L-smooth and µconcave in λ. Assumption 3.4. Each node i has access to the stochastic gradient oracles θgi(θ, λ, ξi) and λgi(θ, λ, ξi), with randomness w.r.t. ξi, which satisfy the following assumptions: Unbiasedness Eξi [ θgi(θ, λ, ξi)] = θgi(θ, λ), Eξi [ λgi(θ, λ, ξi)] = λgi(θ, λ). (4) Bounded variance Eξi h θgi(θ, λ, ξi) θgi(θ, λ) 2i σ2 θ, Eξi h λgi(θ, λ, ξi) λgi(θ, λ) 2i σ2 λ. (5) Bounded magnitude Eξi h θgi(θ, λ, ξi) 2i G2 θ, Eξi h λgi(θ, λ, ξi) 2i G2 λ. (6) The above assumption implies that each network node can query stochastic gradients that are unbiased, have finite variance, and have bounded second moment. The bounded magnitude assumption is rather strong and limits the choice of regularization functions, but it is often made in distributed stochastic optimization (Stich et al., 2018; Koloskova et al., 2019a; Deng et al., 2021). 4 Distributionally Robust Decentralized Learning Algorithm Problem (3) entails solving a distributed minimax optimization problem in which, at every round, collaborating nodes store a private value of the model parameters and the dual variable, which are potentially different from node to node. We denote the estimate of the primal and dual variables of node i at time t by θt i and λt i and the network estimates at time t as θt = 1 m Pm i=1 θt i and λt = 1 m Pm i=1 λt i, respectively. The main challenge resulting from the decentralized implementation of the stochastic gradient descent/ascent algorithm consists in approaching a minimax solution or a stationary point (depending on the convexity assumption on the loss function) while concurrently ensuring convergence to a common global solution. To this end, the proposed procedure, given in Algorithm 1, alternates between a local update step and a consensus step. At each round, every node i queries the local stochastic gradient oracle and, in parallel, updates the model parameter θi by a gradient descent step with learning rate ηθ > 0 and the dual variable λi by a projected gradient ascent one with learning rate ηλ > 0 (Euclidean projection). Subsequently, a gossip strategy is used to share and average information between neighbors. In order to alleviate the communication burden of transmitting the vector of model parameters, which is typically high dimensional and contributes to the largest share of the communication load, a compressed gossip step is employed. To implement the compressed communication, we consider the memory-efficient version of CHOCO-GOSSIP (Koloskova et al., 2019b) in which each node needs to store only two additional variables ˆθi and si, each of the same size as θi. The first one is a public version of θi, while the second is used to track the evolution of the weighted average, according to matrix W, of the public variables at the neighboring nodes. Instead of transmitting θi, each node first computes an averaging step to update the value of the private value using the information about the public variables encoded in ˆθi and si. It then computes qi, a compressed representation of the difference between ˆθi and θi, and shares it with the neighboring nodes to update the value of ˆθi and si used in the averaging step in the next round. As the number of participating nodes is usually much smaller than the size of the model (m d), the dual variable λi is updated sending uncompressed messages and then averaged according to matrix W. Note that AD-GDA implicitly assumes that collaborating parties are honest and for this reason, it does not employ any countermeasure against malicious nodes providing false dual variable information in order to steer the distributional robust network objective at their whim. 4.1 Comparison with existing algorithms Before deriving the convergence guarantees for AD-GDA we compare the features of AD-GDA against other distributionally robust algorithms, DRFA and DR-DSGD (see Table 1). Firstly, we notice that AD-GDA and Published in Transactions on Machine Learning Research (12/2022) DR-DSGD are fully decentralized algorithms, and therefore the only requirement for deployment is that the network is connected. In contrast, DRFA is a client-server algorithm and star topologies, which are known to be less fault tolerant and characterized by a communication bottleneck. Furthermore, AD-GDA is the only algorithm that attains communication efficiency by the means of message compression that, as we show in Section 5.2.2, allows AD-GDA to attain the same worst-node accuracy at a much lower communication cost compared to DRFA and DR-DSGD. It is also important to notice that DR-SGD is obtained by restricting the dual regularizer to Kullback-Leibler divergences, this allows sidestepping the minimax problem. On the other hand, AD-GDA directly tackles the distributionally robust problem (3) and therefore can be applied to any strongly concave regularizer that satisfies Assumption 3.4. For example, the chi-squared regularizer, which allows AD-GDA to direclty minimize the distributionally objective of Mohri et al. (2019).. In Section (4.3) we show AD-GDA recovers the same convergence rate of DR-DSGD in this more general set-up. 4.2 Convex Loss Function We provide now a convergence guarantee for the solution output by Algorithm 1 for the case the loss function ℓ( ) is convex in the model parameter θ. The result is given in the form of a sub-optimality gap bound for the function Φ(θ) = g (θ, λ (θ)) , λ ( ) := arg max λ m 1 g( , λ) (7) and it can be promptly derived from a primal-dual gap type of bound provided in the Appendix. In the bound we also refer to θ ( ) arg maxθ Rd g(θ, ). Theorem 4.1. Under Assumptions 3.3, 3.4, we have that for any θ arg minθ Φ(θ) the solution θo returned by Algorithm 1 with learning rates ηθ = ηλ = 1 T and consensus step size γ = ρ2δ 16ρ+ρ2+4β2+2ρβ2 8ρδ satisfies E [Φ(θo) Φ(θ )] O Dθ + Dλ + G2 θ + G2 λ + O LG2 λ ρ2T + LG2 θ c2T where Dλ := maxt E λt λ (θo) , Dθ := maxt E θt θ (λo) and c = ρ2δ The bound establishes a O(1/ T) non-asymptotic optimality gap guarantee for the output solution. Compared to decentralized stochastic gradient descent (SGD) in the convex scenario, we obtain the same rate (without a dependency on the number of workers m) but with a dependency on the network topology and compression also in the lower order terms. Moreover, whenever θ and λ are constrained in convex sets, the diameter of the two can be used to explicitly bound Dθ and Dλ. 4.3 Non-convex Loss Function We now focus on the case where the relation between the model parameters θ and the value of the loss function is non-convex. In this setting, carefully tuning the relation between primal and dual learning rates is key to establishing a convergent recursion. From the centralized two-time scale stochastic gradient descent literature, it is known that the primal learning rate ηθ has to be 1/(16(κ + 1)) time smaller than the dual learning rate ηλ (Lin et al., 2020a). The above relationship ensures that the objective function changes slowly enough in the dual variable λ, and it allows to bound the distance between optimal values of λ (θt) and the current estimate λt. However, in the decentralized setup, the estimates of θ and λ differ at every node and therefore there exists multiple optimal values of the dual variable; namely, λ (θt i) for i = 1, . . . , m. Nonetheless, we find that it is sufficient to control the quantity δt λ := λ ( θt) λt 2; the distance between λ ( θt), the optimal dual variable for the averaged primal estimate, and λt, the current averaged dual estimate. The following lemma, whose proof can be found in Appendix A.3, allows us to characterize the behaviour of δt λ. Lemma 4.2. For ηθ = ηλ 16(κ+1)2 and ηλ = 1 2L T , the sequence of {δt λ}T t=1 generated by Algorithm 1 satisfies t=1 E δt λ 5δ0 λκ ηλµ + 1 m E Ξt 1 θ + 3E Ξt 1 θ m + 7E Ξt 1 λ Published in Transactions on Machine Learning Research (12/2022) t=1 5 8κ2η2 θ η2 λµ2 E h Φ( θt 1) 2i + 5T 2ηλσ2 λ mµ + 4σ2 θ 162m(κ + 1)2µ2 where Ξθ and Ξλ are the primal and dual consensus errors, and Dt 1 λ = λt 1 λ . Lemma 4.2 provides us with an inequality that controls the speed at which the optimization problem changes from a network-level perspective. As such, the expression (9) contains consensus error terms that do not appear in the centralized setup. We find that to establish a convergent recursion while at the same time controlling the consensus error terms, the primal learning rate ηθ has to be 1/(16(κ + 1)2) time smaller than the dual ηλ (therefore 1/(κ + 1) smaller compared to the centralized case). Once this condition is met it is possible to provide a bound on the stationarity of the randomized solution, picked uniformly over time, that matches the one known for the centralized case. Theorem 4.3. Under Assumptions 3.3, 3.4, the iterates of Algorithm 1 with learning rates ηθ = ηλ 16(κ+1)2 and ηλ = 1 2L T and consensus step size γ = ρ2δ 16ρ+ρ2+4β2+2ρβ2 8ρδ satisfy PT t=1 E h Φ( θt 1) 2i T + L2κ2D0 λ 2 T + σ2 θ + κσ2 λ m + O G2 θ c2T + κG2 λ ρ2T + σ2 θ m (10) where ΦT = E[Φ( θ0)] E[Φ( θT )] and c = ρ2δ We note that the bound decreases at a rate O(1/ T), except the last variance term which is non-vanishing. Nonetheless, whenever the variance of the stochastic gradient oracle for the primal variable is small or the number of participating devices is large, this term becomes negligible. Otherwise, at a cost of increased gradient complexity, each device can query the oracle O(1/ϵ2) times every round, average the results and make the stochastic gradient variance O(1/ϵ2). This procedure makes the bound vanish and leads to a gradient complexity matching the one of Sharma et al. (2022) given for the federated learning scenario. Table 2: Final worst-case distribution accuracy attained by AD-GDA and CHOCO-SGD under different compression schemes and compression ratios. Quantization Sparsification 16 bit 8 bit 4 bit 50% 25% 10% Logistic AD-GDA 59.19 2.05 57.43 1.44 55.75 2.09 57.05 0.68 54.02 1.14 51.51 2.88 Logistic CHOCO-SGD 30.69 0.96 30.06 0.83 29.46 0.05 30.28 0.60 28.56 0.54 26.39 0.67 F.C. AD-GDA 54.99 1.92 48.99 2.30 47.08 2.53 51.85 2.11 43.65 2.97 38.95 3.21 F.C. CHOCO-SGD 30.83 2.22 28.08 2.50 28.01 2.59 29.92 2.54 27.11 2.96 25.91 3.20 5 Experiments In this section, we empirically evaluate AD-GDA capabilities in producing robust predictors. We first compare AD-GDA with CHOCO-SGD and showcase the merits of the distributionally robust procedure across different learning models, communication network topologies, and message compression schemes. We then consider larger-scale experimental setups in which we study the effect of the regularization on the worst-case distribution accuracy and compare AD-GDA against Distributionally Robust Federated Averaging (DRFA) (Deng et al., 2021) and Distributionally Robust Decentralized Stochastic Gradient Descent (DR-DSGD) (Issaid et al., 2022). Published in Transactions on Machine Learning Research (12/2022) 0 1000 2000 3000 Iteration 4 bit Quant. 8 bit Quant. 16 bit Quant. 32 bit Quant. (a) Quantization 0 1000 2000 3000 Iteration 25% Spars. 50% Spars. 75% Spars. 100% Spars. (b) Sparsification Figure 3: Convergence plots of AD-GDA for different quantization levels and schemes. 0 1000 2000 3000 Iteration RING 2D-TORUS MESH Figure 4: Convergence plot of ADGDA for different topologies. 5.1 Fashion-MNIST Classification We perform our experiments using the Fashion-MNIST data set (Xiao et al., 2017) 2, a popular data set made of images of 10 different clothing items, which is commonly used to test distributionally robust learners (Mohri et al., 2019; Deng et al., 2021). To introduce data heterogeneity, samples are partitioned across the network devices using a class-wise split. Namely, using a workstation equipped with a GTX 1080 Ti, we simulate a network of 10 nodes, each storing data points coming from one of the 10 classes. In this setting, we train a logistic regression model and a two-layer fully connected neural network with 25 hidden units to investigate both the convex and the non-convex cases. In both cases, we use the SGD optimizer and, to ensure consensus at the end of the optimization process, we consider a geometrically decreasing learning rate ηt θ = r tη0 θ with ratio r = 0.995 and initial value η0 θ = 1. The metrics that we track are the final worst-node distribution accuracy and the average accuracy over the aggregated data samples on the network estimate θt. 5.1.1 Effect of Compression We assess the effect of compression with a fixed budget in terms of communication rounds by organizing nodes in a ring topology and training the logistic model and the fully connected network for T = 2000 iterations. As a representative of the unbiased compression operators, we consider the b-bit random quantization scheme for b = {16, 8, 4} bit levels, while for the biased category we implement the top-K sparsification scheme saving K = {50%, 25%, 10%} of the original message components. For each compression scheme and compression level, we tune the consensus step size γ by performing a grid search. We train the different models for 20 different random placements of the data shards across the devices using the distributionally robust and standard learning paradigms. In Table 2 we report the average worst-case accuracy attained by the final averaged model θT . AD-GDA almost doubles the worst-case accuracy compared to the not-robust baseline CHOCO-SGD (Koloskova et al., 2019b). This gain holds for both compression schemes and across different compression levels. For increased compression ratios, the worst-case accuracy degrades; however, for a comparable saving in communication bandwidth the unbiased quantization scheme results in superior performance than the biased sparsification compression operator. For a fixed optimization horizon compression degrades performance. Nonetheless, compression allows us to obtain the same accuracy level with fewer transmitted bits. In Figure 3 we report the convergence plot of AD-GDA under the different compression schemes and ratios. For this experiment we track the worst-node loss of a logistic model trained using a fixed learning rate ηθ = 0.1. The convergence plots confirm the sub-linear rate of AD-GDA and highlight the effect of the compression level on the slope of convergence. 5.1.2 Effect of Topology We now turn to investigate the effect of node connectivity. Sparser communication topologies slow down the consensus process and therefore hamper the convergence of the algorithm. In the previous batch of 2The Fashion-MNIST data set is released under the MIT License Published in Transactions on Machine Learning Research (12/2022) Table 3: Worst-node accuracy attained by AD-GDA and CHOCO-SGD for different network topologies. Top-10% Sparsification 4-bit Quantization 2D Torus Mesh 2D Torus Mesh Log. AD-GDA 54.00 0.61 54.07 0.03 56.94 0.38 57.11 0.03 Log. CHOCO-SGD 26.82 0.41 29.00 0.02 30.82 0.24 30.97 0.03 F.C. AD-GDA 44.31 2.47 45.21 2.22 50.16 1.85 50.80 1.83 F.C. CHOCO-SGD 26.02 2.29 26.38 2.65 28.79 2.22 28.96 1.87 experiments, we considered a sparse ring topology, in which each node is connected to only two other nodes. Here, we explore two other network configurations with a more favorable spectral gap. The communication topology with each node connected to the other 4 nodes and the mesh case, in which all nodes communicate with each other. For these configurations, we consider the 4-bit quantization and top-10% sparsification compression schemes. In Table 3 we report the final worst-case performance for the different network configurations. As expected, network configurations with larger node degrees lead to higher worst-case accuracy owing to the faster convergence. In Figure 4 we provide the convergence plot of AD-GDA for different communication topologies. In particular, we track the worst-node loss versus the number of iterations for the logistic model optimized using a fixed learning rate ηθ = 0.1. The predicted sublinear rate of the AD-GDA is confirmed and it is also possible to appreciate the influence of the spectral gap on the convergence rates. 5.2 Larger Scale Experiments We now study AD-GDA and compared it with existing distributionally robust algorithms. We consider three larger-scale experimental setups: A larger scale version of the Fashion MNIST classification task of Section 5.1. In this section, we assume a larger network comprising a total of 50 devices with nodes storing samples from only one class. A CIFAR-10 image classification task based on 4-layer convolutional neural networks (CNN) (Krizhevsky et al., 2009). We consider a network of 20 network nodes and we introduce data heterogeneity by evenly partitioning the training set and changing the contrast of the images stored on the devices. The pixel value contrast P [0, 255] is modified using the following non-linear transformation fc(P) = clip[0,255][(128 + c(P 128))1.1], (11) where clip[0,255]( ) rounds values to the discrete set [0, 255]. For c < 1 the contrast is reduced while for c > 1 it is enhanced. We consider two network nodes storing images with reduced contrast (c = 0.5), two storing images with increased contrast (c = 1.5), and the rest of the nodes storing images with the original contrast level (c = 1). This setup can be used to model a camera network (e.g. surveillance network) containing devices deployed under different lighting conditions. A network of 10 nodes collaborating to train a 4-layer CNN to classify microscopy images from the biological data set COOS7 (Lu et al., 2019). Data heterogeneity is a consequence of the existence of different instruments used to sample the training data. In particular, we consider two of the collaborating nodes using a different microscope from the rest of the collaborating devices. This setup illustrates the risk of training models based on biological data affected by local confounders in this case, the usage of different sensors. 5.2.1 Effect of Regularization We first evaluate the effect of the regularization parameter α in the distributionally robust formulation (3) and study how it affects AD-GDA final performance. According to the two-player game interpretation of Published in Transactions on Machine Learning Research (12/2022) Table 4: Testing accuracy attained by AD-DGA for different regularization values α. For the Fashion-MNIST data set we report the worst and best class accuracy, for the CIFAR-10 data set the accuracy for the different contrast images and for the COOS7 data set the accuracy for the different microscopes. The last column in all tables is the the accuracy attained on a test data set comprising samples from all local data distributions. (a) Fashion-MNIST. Worst Class Best Class Average α = 10 52.83 1.14 90.63 0.18 77.31 0.19 α = 1 59.17 1.06 89.77 0.60 76.77 0.04 α = 0.01 58.47 0.92 89.56 0.52 76.67 0.09 Microscope 1 Microscope 2 Average α = 10 66.93 0.23 86.54 0.06 76.73 0.13 α = 1 72.27 0.31 81.30 0.18 76.77 0.22 α = 0.01 75.00 0.24 75.63 0.40 75.31 0.22 (c) CIFAR-10. Low Contrast High Contrast Original Contrast Average α = 10 34.66 0.47 40.67 1.29 44.28 0.84 39.87 3.96 α = 0.1 37.30 0.73 40.93 1.54 43.62 0.91 40.61 2.59 α = 0.01 39.06 0.80 41.13 1.14 42.96 0.91 41.05 1.59 Table 5: Worst-case distribution accuracy attained by AD-GDA, DR-DSGD and DRFA. Fashion-MNIST CIFAR-10 COOS7 AD-GDA 58.47 0.92 39.06 0.80 75.00 0.24 DRFA 56.68 1.04 37.20 1.16 69.16 0.39 DR-DSGD 50.77 0.42 38.00 0.25 67.09 0.42 the minimax optimization problem (3), the regularizer r(λ) reduces the freedom that an adversary has in choosing the weighting vector λ to maximize the training loss at every iteration t. As a result, the smaller the value of α, the less constrained is the adversary and the larger will be the emphasis on the worst-performing nodes. This intuition is confirmed by the following experiments in which we consider a regularizer of the form χ2(λ) := P i (λi ni/n)2 ni/n and run AD-GDA for α = {10, 1, 0.01}. In Table 4 we report the average accuracy attained in the three simulation setups. For α = 10, we observe a large test accuracy gap between the worst and best nodes in the network: 37% for the Fashion-MNIST data set, 9% for the CIFAR-10 data set and 19% for the COOS7 data set. This large accuracy mismatch showcases how large regularization parameter values, and standard decentralized optimization schemes (obtained for α = ), are unable to guarantee uniform performance across participating parties. On the other hand, using smaller regularization parameters, the gap between is effectively reduced: 31% for the Fashion-MNIST, less than 2% for CIFAR-10 and less than 1% for COOS7. At the same time, the improved fairness brought by AD-GDA does not significantly hamper the average performance of the final model as reported in the last column of all the tables. In Table 5 we compare the AD-GDA worst-node accuracy against the one obtained by DRFA and DR-DSGD. In particular, we run AD-GDA with a regularization parameter α = 0.01 and without message compression. For DR-DSGD we consider the same network setup and we fix the regularization parameter α = 6, which we find to yield the best performance. Finally, for DRFA, we organize the network nodes according to a star topology and we run the federated optimization using half-user participation and with a number of local iterations equal to 10. Across all experiments, we find that AD-GDA yields the largest worst-node accuracy. Compared to DR-DSGD, we attribute the superiority of AD-GDA to the capability of solving the distributionally robust optimization using any strongly-concave regularizer, in this case, the chi-squared one. On the other hand, the superiority of DRFA can be attributed to the fact that DRFA attains distributional robustness by sampling more frequently nodes with unsatisfactory performance. However, whenever the fraction of these users is small compared to the overall number of devices that join the federated round, DRFA is unable to prioritize the nodes with the worst performance since they remain under-represent within the group of sampled devices. Published in Transactions on Machine Learning Research (12/2022) 5.2.2 Communication Efficiency 0 200 400 600 800 1000 MB AD-GDA 4 bit Quant. DR-DSGD CHOCO-SGD 4 bit Quant. DRFA (a) Fashion-MNIST 0 500 1000 1500 2000 2500 MB AD-GDA 4 bit Quant. DR-DSGD CHOCO-SGD 4 bit Quant. DRFA (b) CIFAR-10 0 2000 4000 6000 8000 MB AD-GDA 4 bit Quant. DR-DSGD CHOCO-SGD 4 bit Quant. DRFA Figure 5: Comparison between the proposed algorithm (AD-GDA), Distributionally Robust Federated Averaging (DRFA), Distributionally Robust Decentralized Stochastic Gradient Descent (DR-SGD) and standard decentralized learning (CHOCO-SGD). The algorithms are compared in terms of their communication efficiency and worst node accuracy. In the following, we compare AD-GDA against CHOCO-SGD and other existing distributionally robust learning schemes; in particular, Distributionally Robust Federated Averaging (DRFA) (Deng et al., 2021) and Distributionally Robust Decentralized Stochastic Gradient Descent (DR-SGD) (Issaid et al., 2022).CHOCOSGD represents a standard decentralized learning procedure that employs compressed gossip to attain communication efficiency. DRFA is a federated distributionally robust learning scheme with network devices connected according to a star topology and the star center represented by a central aggregator. Communication efficiency is obtained allowing network devices to perform multiple local updates of the primal variable between subsequent synchronization rounds at the central aggregator. We run DRFA allowing devices to perform 10 local gradient steps before sending their local models for the distributionally robust averaging steps and we consider half-user participation at each round. DR-DSGD is a decentralized distributionally robust learning scheme based on KL regularizers that do not employ compressed communication. For the decentralized learning schemes, we consider a 2D torus topology with every node connected to the other four nodes and use Metropolis mixing matrices. For AD-GDA and CHOCO-SGD, which employ compressed communication, we consider a 4-bit quantization scheme. For AD-GDA we consider a chi-squared regularizer with α = 0.01, while for DR-DSGD we set the KL regularizer parameter to α = 6 as in Issaid et al. (2022). All algorithms are run for T = 5000 iterations using an SGD optimizer and the same exponentially learning rate schedule ηt θ = r tη0 θ with decay r = 0.998. In order to make the comparison fair, the initial learning rate η0 is set differently at every node to ensure that the effective learning rate is equal across different algorithms. In fact, DR-DSGD and AD-GDA descent steps are affected by the dual variable. The batch sizes are the same across all algorithms and are set to {50, 50, 32} for the F-MNIST, CIFAR10 and COOS7 experiments, respectively. The algorithms are compared in terms of communication efficiency, namely the capability of producing a fair predictor communicating the smallest number of bits. To this end, in Figure 5 we compare the worst-case distribution accuracy against the number of bits transmitted by the busiest node in the network. The simulation results are reported over the different simulation scenarios. AD-GDA and CHOCO-SGD employ the same compression scheme and converge within the same communication budget. However, AD-GDA can greatly increase the worst node accuracy compared to CHOCO-SGD. This showcases the merits of distributionally robust learning compared to standard learning. Compared to the other distributionally robust algorithms, AD-GDA attains the largest worst-case accuracy by transmitting only a fraction of bits. In particular, for the Fashion-MNIST setup, AD-GDA is 4x and 8x more communication-efficient compared to DRFA and DR-DSGD respectively. For the CIFAR-10 data set, AD-GDA reduces by 3x and 5x the number of bits necessary to attain the same final worst-node accuracy of DRFA and DR-DSGD respectively. Finally, in the COOS7 case, AD-GDA is 3x more efficient than DRFA and 10x more efficient than DR-DSGD. Published in Transactions on Machine Learning Research (12/2022) 6 Conclusion We provided a provably convergent decentralized single-loop gradient descent/ascent algorithm to solve the distributionally robust optimization problem over a network of collaborating nodes with heterogeneous local data distributions. Differently from previously proposed solutions, which are either limited to the federated scenario with a central coordinator or to specific regularizers, our algorithm directly tackles the underlying minimax optimization problem in a decentralized and communication-efficient manner. Experiments showed that the proposed solution produces distributionally robust predictors and it attains superior communication efficiency compared to the previously proposed algorithms. A combination of the compressed communication and multiple local updates, combined with acceleration techniques, represents the natural extension of the algorithm to further improve its efficiency. Acknowledgments The work of M. Zecchin is funded by the Marie Curie action WINDMILL (Grant agreement No. 813999). The work of M. Kountouris has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (Grant agreement No. 101003431). The work of David Gesbert was partially supported by the 3IA interdisciplinary project ANR-19-P3IA-0002 funded from the French National Research Agency (ANR) Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. ar Xiv preprint ar Xiv:1704.05021, 2017. Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30: 1709 1720, 2017. Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and Cédric Renggli. The convergence of sparsified gradient methods. ar Xiv preprint ar Xiv:1809.10505, 2018. Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pp. 560 569. PMLR, 2018. Jianshu Chen and Ali H Sayed. Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing, 60(8):4289 4305, 2012. Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Distributionally robust federated averaging. ar Xiv preprint ar Xiv:2102.12660, 2021. John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750, 2018. John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592 606, 2011. John C Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against mixture covariate shifts. Under review, 2019. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214 226, 2012. Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171 (1):115 166, 2018. Published in Transactions on Machine Learning Research (12/2022) Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. ar Xiv preprint ar Xiv:1406.2661, 2014. Zhaolin Hu and L Jeff Hong. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 2013. Chaouki Ben Issaid, Anis Elgabli, and Mehdi Bennis. DR-DSGD: A distributionally robust decentralized learning algorithm over graphs. Transactions on Machine Learning Research, 2022. URL https:// openreview.net/forum?id=Vc XNAr5Rur. Ruiwei Jiang and Yongpei Guan. Data-driven chance constrained stochastic program. Mathematical Programming, 158(1):291 327, 2016. Anastasia Koloskova, Tao Lin, Sebastian U Stich, and Martin Jaggi. Decentralized deep learning with arbitrary communication compression. ar Xiv preprint ar Xiv:1907.09356, 2019a. Anastasia Koloskova, Sebastian Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pp. 3478 3487. PMLR, 2019b. Alec Koppel, Felicia Y Jakubiec, and Alejandro Ribeiro. A saddle point algorithm for networked online convex optimization. IEEE Transactions on Signal Processing, 63(19):5149 5164, 2015. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Shihui Li, Yi Wu, Xinyue Cui, Honghua Dong, Fei Fang, and Stuart Russell. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4213 4220, 2019. Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. ar Xiv preprint ar Xiv:1705.09056, 2017. Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pp. 6083 6093. PMLR, 2020a. Tianyi Lin, Zhengyuan Zhou, Panayotis Mertikopoulos, and Michael Jordan. Finite-time last-iterate convergence for multi-agent learning in games. In International Conference on Machine Learning, pp. 6161 6171. PMLR, 2020b. Qing Ling, Zaiwen Wen, and Wotao Yin. Decentralized jointly sparse optimization by reweighted ℓq minimization. IEEE Transactions on Signal Processing, 61(5):1165 1170, 2012. Mingrui Liu, Wei Zhang, Youssef Mroueh, Xiaodong Cui, Jerret Ross, Tianbao Yang, and Payel Das. A decentralized parallel algorithm for training generative adversarial nets. ar Xiv preprint ar Xiv:1910.12999, 2019a. Weijie Liu, Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil, Zebang Shen, and Nenggan Zheng. A decentralized proximal point-type method for saddle point problems. ar Xiv preprint ar Xiv:1910.14380, 2019b. Nicolas Loizou, Hugo Berard, Gauthier Gidel, Ioannis Mitliagkas, and Simon Lacoste-Julien. Stochastic gradient descent-ascent and consensus optimization for smooth games: Convergence analysis under expected co-coercivity. Advances in Neural Information Processing Systems, 34:19095 19108, 2021. Alex X Lu, Amy X Lu, Wiebke Schormann, Marzyeh Ghassemi, David W Andrews, and Alan M Moses. The cells out of sample (coos) dataset and benchmarks for measuring out-of-sample generalization of image classifiers. ar Xiv preprint ar Xiv:1906.07282, 2019. Published in Transactions on Machine Learning Research (12/2022) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. David Mateos-Núnez and Jorge Cortés. Distributed subgradient methods for saddle-point problems. In 2015 54th IEEE Conference on Decision and Control (CDC), pp. 5462 5467. IEEE, 2015. Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171 (1):115 166, 2018. Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In International Conference on Machine Learning, pp. 4615 4625. PMLR, 2019. Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In NIPS, volume 29, pp. 2208 2216, 2016. Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009. Reza Olfati-Saber, J Alex Fax, and Richard M Murray. Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE, 95(1):215 233, 2007. Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In International Conference on Machine Learning, pp. 2817 2826. PMLR, 2017. Joaquin Quiñonero-Candela, Masashi Sugiyama, Neil D Lawrence, and Anton Schwaighofer. Dataset shift in machine learning. Mit Press, 2009. Michael Rabbat. Multi-agent mirror descent for decentralized stochastic optimization. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pp. 517 520. IEEE, 2015. Herbert E Scarf. A min-max solution of an inventory problem. Technical report, RAND CORP SANTA MONICA CALIF, 1957. Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 850 857. IEEE, 2014. Pranay Sharma, Rohan Panda, Gauri Joshi, and Pramod K Varshney. Federated minimax optimization: Improved convergence analyses and algorithms. ar Xiv preprint ar Xiv:2203.04850, 2022. Aman Sinha, Hongseok Namkoong, Riccardo Volpi, and John Duchi. Certifying some distributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017. Sebastian U Stich. Local sgd converges fast and communicates little. ar Xiv preprint ar Xiv:1805.09767, 2018. Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. ar Xiv preprint ar Xiv:1809.07599, 2018. Ioannis Tsaknakis, Mingyi Hong, and Sijia Liu. Decentralized min-max optimization: Formulations, algorithms and applications in network poisoning attack. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5755 5759. IEEE, 2020. John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE transactions on automatic control, 31(9):803 812, 1986. John Nikolas Tsitsiklis. Problems in decentralized decision making and computation. Technical report, Massachusetts Inst of Tech Cambridge Lab for Information and Decision Systems, 1984. Ermin Wei and Asuman Ozdaglar. Distributed alternating direction method of multipliers. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 5445 5450. IEEE, 2012. Published in Transactions on Machine Learning Research (12/2022) David Wozabal. A framework for optimization under ambiguity. Annals of Operations Research, 193(1): 21 47, 2012. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Feng Yan, Shreyas Sundaram, SVN Vishwanathan, and Yuan Qi. Distributed autonomous online learning: Regrets and intrinsic privacy-preserving properties. IEEE Transactions on Knowledge and Data Engineering, 25(11):2483 2493, 2012. Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5693 5700, 2019. A.1 Useful inequalities This section contains a collection of ancillary results that are useful for the subsequent proofs. Proposition A.1. A differentiable and L-smooth function f(x) satisfies f(x) f(x ) L x x . (12) Furthermore, if f(x) is convex f(y) f(x) + f(x), y x + L 2 y x 2 (13) and if x is a minimizer 1 2L f(x) 2 f(x) f(x ). (14) Otherwise, if f(x) concave f(y) f(x) + f(x), y x L 2 y x 2 (15) and if x is a maximizer 1 2L f(x) 2 f(x ) f(x). (16) Proposition A.2. A differentiable and µ-strongly convex function f(x) satisfies f(y) f(x) + f(x), y x + µ 2 y x 2 (17) and a differentiable and µ-strongly concave function g(x) satisfies g(y) g(x) + g(x), y x µ 2 y x 2 . (18) Proposition A.3. Given two vectors a, b Rd, for β > 0 we have 2 a, b β 1 a 2 + β b 2 (19) a + b (1 + β 1) a 2 + (1 + β) b 2 (20) Published in Transactions on Machine Learning Research (12/2022) Proposition A.4. Given two matrices A Rp q, B Rq r, we have AB F A F B 2 (21) where F denotes the Frobenius norm. Proposition A.5. Given a set of vectors {ai}n i=1 we have i=1 ai 2 . (22) Consensus inequalities To streamline the notation we define gi(θt i, λt i) = gi(θt i, λt i, ξt i) and introduce the following matrices Θt = θt 1, . . . , θt m Rd m, ˆΘt = h ˆθt 1, . . . , ˆθt m i Rd m, Λt = λt 1, . . . , λt m Rm m. (23) θG(Θt, Λt) = θg1(θt 1, λt 1), . . . , θgm(θt m, λt m) Rd m (24) λG(Θt, Λt) = λg1(θt 1, λt 1), . . . , λgm(θt m, λt m) Rm m (25) and for a matrix X we define X = X 11T The local update rule of Algorithm 1 can be rewritten as 2 = Θt ηθ θG(Θt, Λt) (26) 2 = PΛ Λt + ηλ λG(Θt, Λt) (27) where PΛ is applied column-wise. The compressed gossip algorithm CHOCO-GOSSIP (Koloskova et al., 2019b) used to share model parameters preserves averages and satisfies the following recursive inequality with c = ρ2δ E Θt+1 Θt+1 2 F + Θt+1 ˆΘt+1 2 (1 c) E Θt+ 1 The uncompressed gossip scheme used to communicate Λ satisfies E h Λt+1 Λt+1 2 i (1 ρ) E Λt+ 1 Lemma A.6. (Consensus inequality for compressed communication (Koloskova et al., 2019a)) For a fixed ηθ > 0 and γ = ρ2δ 16ρ+ρ2+4β2+2ρβ2 8ρδ the iterates of Algorithm 1 satisfy θt i θt 2 # 12η2 θ m G2 θ c2 . (30) Lemma A.7. (Consensus Inequality for uncompressed communication (Koloskova et al., 2019b)) For a fixed ηλ > 0 the iterates of Algorithm 1 satisfy λt i λt 2 # 4η2 λ m G2 λ ρ2 (31) Lemma A.6 and A.7 apply to the primal and dual iterates obtained of Algorithm 1 and they are obtained from Lemma A.2 of (Koloskova et al., 2019b). Published in Transactions on Machine Learning Research (12/2022) A.2 Proof of Theorem 4.1: Convex case Φ( ) = max λ m 1 g( , λ); (32) under assumptions 3.3, 3.4 and if the local objective functions {fi(θ)}m i=1 are convex, Theorem 4.1 guarantees that the output solution (θo, λo) satisfies E Φ(θo) min θ Θ Φ(θ) 4 LG2 λ ρ2 + 3LG2 θ c2 2 + G2 θ + G2 λ 2 The proof starts from the following decomposition of the sub-optimality gap E max λ g(θo, λ) min θ max λ g(θ, λ) E max λ g(θo, λ) max λ min θ g(θ, λ)) (34) E max λ g(θo, λ) min θ g(θ, λo)) (35) E max λ,θ g(θo, λ) g(θ, λo)) (36) max λ,θ 1 T t=0 g( θt, λ) g(θ, λt)) t=0 g( θt, λ) g( θt, λt) t=0 g( θt, λt) g(θ, λt) Thanks to Lemmas (A.8) and (A.9) proved below, the two summands can be bounded to obtain E Φ(θo) min θ Θ Φ(θ) Dθ + 12η2 θ LG2 θ c2 G2 λ + 4DθLGλ + 4η2 λ LG2 λ ρ2 . (39) Setting ηλ = ηθ = 1 T , the final result is obtained. Lemma A.8. For T > 0 and any θ, the sequence { θt, λt}T t=0 generated by Algorithm 1 satisfies t=0 g( θt, λt) g(θ, λt) Dθ 2ηθT + ηθ 2 G2 θ + 12η2 θ LG2 θ c2 + 2ηλ DθLGλ where Dθ = maxt=0...,T E θt θ . Proof: From the update rule of the primal variable and the assumptions 3.4 on the stochastic gradient we have, that for any θ Eξt θt+1 θ 2 =Eξt i=1 θgi(θt i, λt i) = θt θ 2 2ηθ i=1 θt θ; Eξt θgi(θt i, λt i) + Eξt i=1 θgi(θt i, λt i) Published in Transactions on Machine Learning Research (12/2022) i=1 θt θ; θgi(θt i, λt i) | {z } :=T2 +η2 θG2 θ. (43) Denoting with Dt θ = θt θ we have that for T2 the following holds i=1 θt θ; θgi(θt i, λt) + i=1 θt θ; θgi(θt i, λt i) θgi(θt i, λt) i=1 θt θ; θgi(θt i, λt) + 2ηθLDt θ Ξt λ m (45) θt θt i; θgi(θt i, λt) + θt i θ; θgi(θt i, λt) + 2ηθLDt θ Ξt λ m (46) gi( θt, λt) gi(θ, λt) L θt θt i 2 + 2ηθLDt θ Ξt λ m (47) gi( θt, λt) gi(θ, λt) + 2ηθL m Ξt θ + 2ηθLDt θ Ξt λ m . (48) Plugging it back in (43), rearranging the terms and taking the expectation over the previous iterate we get E g( θt, λt) g(θ, λt) = 1 i=1 gi( θt, λt) gi(θ, λt) E θt θ 2 E θt+1 θ 2 m E Ξt θ + LE Dt θ r E [Ξt λ] m . (50) Telescoping from t = 0 to t = T 1 and plugging the consensus inequalities (30) and (31), we get t=0 g( θt, λt) g(θ, λt) Dθ 2ηθT + ηθ 2 G2 θ + 12η2 θ LG2 θ c2 + 2ηλ DθLGλ where Dθ = maxt=0...,T E[Dt θ] = maxt=0...,T E θt θ . Lemma A.9. For T > 0 and any λ, the sequence { θt, λt}T t=0 generated by Algorithm 1 satisfies t=0 g( θt, λ) g( θt, λt) Dλ 2ηλT + ηλ 2 G2 λ + 4η2 λ LG2 λ ρ2 + where Dλ = maxt=0...,T E λt λ . Proof The proof follows similarly as in Lemma (A.8) Eξt λt+1 λ 2 =Eξt i=1 λgi(θt i, λt i) = λt λ 2 2ηλ i=1 λ λt; Eξt λgi(θt i, λt i) + Eξt i=1 λgi(θt i, λt i) =E λt λ 2 2ηλ i=1 λ λt; λgi(θt i, λt i) | {z } :=T3 +η2 λG2 λ. (55) Published in Transactions on Machine Learning Research (12/2022) Denoting with Dt λ = λt λ we have that for T3 the following holds i=1 λ λt; λgi( θt, λt i) + i=1 λ λt; λgi(θt i, λt i) λgi( θt, λt i) λ λt; λgi( θt, λt i) + 2ηλLDt λ Ξt θ m (57) λ λt i; λgi( θt, λt i) + λt i λt; λgi( θt, λt i) + 2ηλLDt λ Ξt θ m (58) gi( θt, λ) gi( θt, λt) L λt λt i 2 + 2ηλLDt λ Ξt θ m (59) gi( θt, λ) gi( θt, λt) + 2ηλL m Ξt λ + 2ηλLDt λ Ξt θ m . (60) Plugging it back in (55), rearranging the terms and taking the expectation over the previous iterate we get E g( θt, λ) g( θt, λt) = 1 i=1 gi( θt, λ) gi( θt, λt) E λt λ 2 E λt+1 λ 2 m E Ξt λ + LE Dt λ r E [Ξt θ] m . (62) Telescoping from t = 0 to t = T 1 and plugging the consensus inequalities (30) and (31) we get t=0 g( θt, λ) g( θt, λt) Dλ 2ηλT + ηλ 2 G2 λ + 4η2 λ LG2 λ ρ2 + where Dλ = maxt=0...,T E[Dt λ] = maxt=0...,T E λt λ . A.3 Proof of Theorem 4.3: Non-convex case In the case of non-convex functions {fi}m i=1, Theorem 4.3 provides the following ϵ-stationarity guarantee on the randomized solution of Algorithm 1 : t=1 E h Φ( θt 1) 2i 2L 256 E[Φ( θ0)] E[Φ( θT )] + 45Lκ2D0 λ 2 c + σ2 θ 2m + 45κσ2 λ 4m G2 θ 4c2 + 171κG2 λ ρ2 + σ2 θ m . (64) The proof is inspired from recent results in (Lin et al., 2020a). Specifically, Lemma A.10, stated and proved below, provides a descent inequality of the type E[Φ( θt)] E[Φ( θt 1)] + η2 θκLσ2 θ m ηθ 2 2η2 θκL E Φ( θt 1) 2 2 + 2η2 θκL E[Ξt θ] m + 2E[Ξt λ] m + 2E[δt λ] . (65) Setting ηθ = ηλ 16(κ+1)2 and ηλ 1 2L expression (65) can be simplified thanks to the following chain of inequalities 2 2ηθκL) ηθ(1 2 + 2ηθκL) 9ηθ Published in Transactions on Machine Learning Research (12/2022) Telescoping the simplified expression from t = 1 to T we obtain E[Φ( θT )] E[Φ( θ0)] + T η2 θκLσ2 θ m 7ηθ t=1 E Φ( θt 1) 2 E[Ξt θ] m + 2E[Ξt λ] m where δt λ := λ ( θt) λt 2 represents the squared distance between the optimal value of the dual variable for the current averaged network belief and the current averaged value of the dual variable. Lemma A.11, reported below, provides a bound on PT t=1 δt λ that plugged in (67) yields E[Φ( θT )] E[Φ( θ0)] + ηθ 45Lκ2δ0 λ 8ηλ + ηθ 45κ4η2 θ η2 λ 7 t=1 E h Φ( θt 1) 2i ηθκLσ2 θ m + 45κLηλσ2 λ 4m + 45σ2 θ 2 162m E[Ξt θ] m + 2E[Ξt]λ m + 30κE[Ξt 1 θ ] m + 70κE[Ξt 1 λ ] m 1 m E[Ξt 1 θ ] Moreover, the relation between the two step-sizes established above ensures that 45κ4η2 θ η2 λ 7 and therefore rearranging terms, dividing by 4 T ηθ and recalling that κ 1 t=1 E h Φ( θt 1) 2i 4 ηθT E[Φ( θ0)] E[Φ( θT )] + 45Lκ2δ0 λ 2Tηλ + 4 ηθκLσ2 θ m + 45κLηλσ2 λ 4m + 45σ2 θ 2 162m+ 31κE[Ξt 1 θ ] m + 72E[κΞt 1 λ ] m + 31κE[Ξt 1 θ ] m + 72E[κΞt 1 λ ] m Exploiting consensus inequalities (30), (31) and the fact that κ 1 and ηθ = ηλ 16(κ+1)2 1/2L we can simplify and obtain t=1 E h Φ( θt 1) 2i 64(κ + 1)2 ηλT E[Φ( θ0)] E[Φ( θT )] + 45Lκ2δ0 λ 2Tηλ + 2 ηλ Lσ2 θ m + ηλ 45κLσ2 λ 2m + 45σ2 θ 162m c + 372η2 θ κG2 θ c2 + 288η2 λ κG2 λ ρ2 Simplifying and defining Dλ = maxt=0,...,T Dt λ t=1 E h Φ( θt 1) 2i 64(κ + 1)2 ηλT E[Φ( θ0)] E[Φ( θT )] + 45Lκ2δ0 λ 2Tηλ Published in Transactions on Machine Learning Research (12/2022) + 2 ηλ Lσ2 θ m + ηλ 45κLσ2 λ 2m + 45σ2 θ 162m + L2 10Dληλ Gθ c + η2 λ G2 θ c2 + 684η2 λ κG2 λ ρ2 t=1 E h Φ( θt 1) 2i 1 256 E[Φ( θ0)] E[Φ( θT )] + 45Lκ2δ0 λ 2 c + Lσ2 θ m + 45κLσ2 λ 2m L2G2 θ c2 + 684L2κG2 λ ρ2 + σ2 θ m . (73) Setting ηλ = 1 2L t=1 E h Φ( θt 1) 2i 2L 256 E[Φ( θ0)] E[Φ( θT )] + 45Lκ2δ0 λ 2 c + σ2 θ 2m + 45κσ2 λ 4m G2 θ 4c2 + 171κG2 λ ρ2 + σ2 θ m . (74) Lemma A.10. For each t = 1, . . . , T the iterates generated by Algorithm 1 satisfies E[Φ( θt)] E[Φ( θt 1)] + η2 θκLσ2 θ m ηθ 2 2η2 θκL E Φ( θt 1) 2 2 + 2η2 θκL E[Ξt θ] m + 2E[Ξt λ] m + 2E[δt λ] . (75) Proof: From the 2κL-smoothness of Φ( ) (Lemma 4.3 of Lin et al. (2020a)) and the update rule we have: Eξt 1 Φ( θt) Φ( θt 1) + Eξt 1 θΦ( θt 1), θt θt 1 + κLEξt 1 θt θt 1 2 (76) Φ( θt 1) ηθ θΦ( θt 1), 1 i=1 Eξt 1 θgi(θt 1 i , λt 1 i ) + η2 θκL m2 Eξt 1 i=1 θgi(θt 1 i , λt 1 i ) Φ( θt 1) + ηθ Φ( θt 1), Φ( θt 1) 1 i=1 θgi(θt 1 i , λt 1 i ) | {z } :=T4 ηθ Φ( θt 1) 2 + η2 θκL m2 Eξt i=1 θgi(θt 1 i , λt 1 i ) | {z } :=T5 We now turn bounding term T4 T4 = ηθ Φ( θt 1), Φ( θt 1) 1 i=1 θgi(θt 1 i , λt 1 i ) (79) Φ( θt 1) 2 + i=1 θgi(θt 1 i , λt 1 i ) Φ( θt 1) 2 + i=1 θgi( θt 1, λ ( θt 1)) θgi(θt 1 i , λt 1 i ) Published in Transactions on Machine Learning Research (12/2022) Φ( θt 1) 2 + L2 θt 1 θt 1 i 2 + L2 λ ( θt 1) λt 1 i ) 2 ! Φ( θt 1) 2 + L2Ξt 1 θ m + 2L2Ξt 1 λ m + 2L2 λ ( θt 1) λt 1 2 | {z } =δt 1 λ and from stochastic gradient assumptions 3.4 we can bound T5 as follows i=1 θgi(θt 1 i , λt 1 i ) θgi(θt 1 i , λt 1 i ) θgi(θt 1 i , λt 1 i ) i=1 θgi(θt 1 i , λt 1 i ) (20) mσ2 θ + 2 i=1 θgi(θt 1 i , λt 1 i ) θgi( θt 1, λ ( θt 1)) + 2 m Φ( θt 1) 2 (87) (12) mσ2 θ + 2L2m θt 1 θt 1 i 2 + 2L2m λ ( θt 1) λt 1 i 2 + 2m2 Φ( θt 1) 2 (88) mσ2 θ + 2L2mΞt 1 θ + 4L2mΞt 1 λ + 4L2m λ ( θt 1) λt 1 2 | {z } =δt 1 λ +2m2 Φ( θt 1) 2 . (89) Recombining, grouping, and taking the expectation over the previous iterates we get the desired result. Lemma A.11. The sequence of {δt λ}T t=1 generated by Algorithm 1 satisfies t=1 E δt λ 5δ0 λκ ηλµ + 1 m E Ξt 1 θ + 3E Ξt 1 θ m + 7E Ξt 1 λ t=1 5 8κ2η2 θ η2 λµ2 E h Φ( θt 1) 2i + 5T 2ηλσ2 λ mµ + 4σ2 θ 162m(κ + 1)2µ2 where Dt 1 λ = λt 1 λ . Proof: From (20), for b > 0, we have Eξt 1[δt λ] 1 + 1 Eξt 1 λ ( θt 1) λt 2 | {z } :=T6 + (1 + b) Eξt 1 λ ( θt) λ ( θt 1) 2 | {z } :=T7 Bounding T6 similarly T6 =Eξt 1 λ ( θt 1) λt 2 (92) λ ( θt 1) λt 1 ηλ λgi(θt 1 i , λt 1 i ) λgi(θt 1 i , λt 1 i ) λ ( θt 1) λt 1 ηλ i=1 λgi(θt 1 i , λt 1 i ) + η2 λσ2 λ m (94) = λ ( θt 1) λt 1 2 + i=1 λgi(θt 1 i , λt 1 i ) | {z } T6,1 Published in Transactions on Machine Learning Research (12/2022) 2 λ ( θt 1) λt 1; ηλ i=1 λgi(θt 1 i , λt 1 i ) | {z } T6,2 +η2 λσ2 λ m . (95) Estimating T6,1 T6,1 =2η2 λ i=1 λgi(θt 1 i , λt 1 i ) λgi( θt 1, λt 1) λgi( θt 1, λ ( θt 1)) λgi(θt 1 i , λt 1 i ) λgi( θt 1, λt 1) 2 i=1 λgi( θt 1, λt 1) λgi( θt 1, λ ( θt 1)) (12,16) 2η2 λ m i=1 L2 λt 1 i λt 1 2 + L2 θt 1 i θt 1 2 + 4η2 λL m gi( θt 1, λ ( θt 1)) gi( θt 1, λt 1) m Ξt 1 λ + 2η2 λL2 m Ξt 1 θ + 4η2 λL m gi( θt 1, λ (θt 1 i )) gi( θt 1, λt 1) . (99) Estimating T6,2 i=1 λ ( θt 1) λt 1; λgi(θt 1 i , λt 1 i ) (100) i=1 λ ( θt 1) λt 1; λgi(θt 1 i , λt 1 i ) λgi( θt 1, λt 1 i ) (101) i=1 λ ( θt 1) λt 1; λgi( θt 1, λt 1 i i=1 λt 1 λ ( θt 1); λgi(θt 1 i , λt 1 i ) λgi( θt 1, λt 1 i ) (102) i=1 λ ( θt 1) λt 1; λgi( xt 1, λt 1 i + 2ηλLDt 1 λ 1 mΞt 1 θ (103) i=1 λ ( θt 1) λt 1 i ; λgi( θt 1, λt 1 i ) + λt 1 i λt 1); λgi( θt 1, λt 1 i ) + 2ηλLDt 1 λ 1 mΞt 1 θ (104) (15,18) 2ηλ gi( θt 1, λt 1) gi(θt 1 i , λ ( θt 1)) + 2ηλLDt 1 λ λ ( θt 1) λt 1 i 2 + L λt 1 λt 1 i 2 (105) gi( θt 1, λt 1) gi( θt 1, λ ( θt 1)) + 2ηλLDt 1 λ λ ( θt 1) λt 1 2 L + µ λt 1 λt 1 i 2 (106) Published in Transactions on Machine Learning Research (12/2022) λ ( θt 1) λt 1 2 2ηλ i=1 gi( θt 1, λ ( θt 1)) gi( θt 1, λt 1) m Ξt 1 λ + 2ηλLDt 1 λ 1 mΞt 1 θ (107) where the last inequality follows from choosing ηλ 1/(2L). Substituting the expressions we get λ ( θt 1) λt 1 2 + η2 λσ2 λ m + Lηλ m Ξt 1 θ + 3Lηλ m Ξt 1 λ + 2ηλLDt 1 λ 1 mΞt 1 θ . (108) Being λ ( ) is κ-smooth (Lemma 4.3 (Lin et al., 2020a)) we can bound T7 as follows T7 =Eξt 1 λ ( θt) λ ( θt 1) 2 (109) κ2Eξt 1 θt θt 1 2 (110) =κ2η2 θ m2 Eξt 1 i=1 θgi(θt 1 i , λt 1 i ) i=1 θgi(θt 1 i , λt 1 i ) m Φ( θt 1) (20) κ2η2 θ m2 mσ2 θ + 2 m Φ( θt 1) 2 + 2L2m θt 1 θt 1 i 2 + 2L2m λ ( θt 1) λt 1 i 2 ! mσ2 θ + 2m2 Φ( θt 1) 2 + 2L2mΞt 1 θ + 2L2m λ ( θt 1) λt 1 + λt 1 λt 1 i 2 ! σ2 θ m + 2 Φ( θt 1) 2 + 2L2Ξt 1 θ m + 4L2Ξt 1 λ m + 4L2 λ ( θt 1) λt 1 2 (115) σ2 θ m + 2 Φ( θt 1) 2 + 2L2Ξt 1 θ m + 4L2Ξt 1 λ m + 4L2δt 1 λ Recombining and grouping we get + 4(1 + b)κ2η2 θL2 δt 1 λ + 2 1 + 1 Lηλ + 2(1 + b)κ2η2 θL2 Ξt 1 θ m + 1 + 1 η2 λσ2 λ m + (1 + b) κ2η2 θ σ2 θ m 3Lηλ + 4(1 + b)κ2η2 θL2 Ξt 1 λ m + 2κ2η2 θ(1 + b) Φ( θt 1) 2 . (117) Setting b = 2 2 ηλµ 1 > 0 we get the following inequalities 1 + 1 (1 + b) 4 ηλµ (119) 1 + 1 that allows to simplify (117) as follows 4 + 16κ2η2 θL2 δt 1 λ + 4ηλLDt 1 λ 1 mΞt 1 θ + 2Lηλ + 8κ2η2 θL2 Published in Transactions on Machine Learning Research (12/2022) + 2η2 λσ2 λ m + 4κ2η2 θσ2 θ mηλµ + 6Lηλ + 16κ2η2 θL2 Ξt 1 λ m + 8κ2η2 θ ηλµ Φ( xt 1) 2 . (122) Fixing ηx = ηλ 16(κ+1)2 we get that 4 + 16κ2η2 x L2 Taking the expectation over the current iterate and applying recursively the inequality we obtain Eξt 1[δt λ] νtδ0 λ + i=0 νt 1 i 8κ2η2 θ ηλµ Eξt 1[ Φ( θt 1) 2] + 2η2 λσ2 λ m + 4κ2η2 θ ηλµ σ2 θ m 1 m Eξt 1[Ξt 1 θ ] i=0 νt 1 i 3LηλEξt 1[Ξt 1 θ ] m + 7LηλEξt 1[Ξt 1 λ ] m Summing from t = 1 to T and from (123) we get t=1 Eξt 1[δt λ] 5δ0 λ ηλµ + 8κ2η2 θ ηλµ Eξt 1[ Φ( θt 1) 2 + 5T 2η2 λσ2 λ m + 4κ2η2 θ ηλµ σ2 θ m 1 m Eξt 1[Ξt 1 θ ] 3LηλEξt 1[Ξt 1 θ ] m + 7LηλEξt 1[Ξt 1 λ ] m