# topologyaware_generalization_of_decentralized_sgd__f4cbdbcb.pdf

Topology-aware Generalization of Decentralized SGD

Tongtian Zhu 1 2 3 Fengxiang He 3 Lan Zhang 4 5 Zhengyang Niu 6 3 Mingli Song 7 Dacheng Tao 3

This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is O(m/N+1/m+λ2)-stable in expectation in the non-convex non-smooth setting, where N is the total sample size of the whole system, m is the worker number, and 1 λ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an O(1/N+((m 1λ2)

2 +m α)/N 1 α

2 ) in-average generalization bound, which is nonvacuous even when λ is closed to 1, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of DSGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and Res Net-18 on CIFAR-10, CIFAR-100 and Tiny-Image Net justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/ Generalization-of-DSGD.

1. Introduction

Decentralized stochastic gradient descent (D-SGD) facilitates simultaneous model training on a massive number of workers without a central server (Lopes & Sayed, 2008;

1College of Computer Science and Technology, Zhejiang University 2Shanghai Institute for Advanced Study of Zhejiang University 3JD Explore Academy, JD.com Inc. 4School of Computer Science and Technology, University of Science and Technology of China 5Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 6School of Computer Science, Wuhan University 7Zhejiang University City College. Correspondence to: Fengxiang He <fengxiang.f.he@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

Nedic & Ozdaglar, 2009b). In D-SGD, every worker only communicates with the directly connected neighbors through gossip communication (Xiao & Boyd, 2004; Lian et al., 2017; Koloskova et al., 2020). The communication intensity is controlled by the communication topology. This decentralized nature eliminates the requirement for an expensive central server dedicated to heavy communication and computation. Surprisingly, existing theoretical results demonstrate that the massive models on the edge converge to a unique steady model, the consensus model, even without the control of a central server (Lu et al., 2011; Shi et al., 2015; Lian et al., 2017). Compared with the centralized synchronized SGD (C-SGD) (Dean et al., 2012; Li et al., 2014), D-SGD can achieve the same asymptotic linear speedup in convergence rate (Zhang et al., 2019; 2021). In this way, D-SGD provides a promising distributed machine learning paradigm with improved privacy (Nedic, 2020), scalability (Lian et al., 2017; Kairouz et al., 2021), and communication efficiency (Ying et al., 2021b).

To date, the theoretical research on D-SGD has mainly focused on its convergence (Nedic & Ozdaglar, 2009b; Lian et al., 2017; Koloskova et al., 2020; Alghunaim & Yuan, 2021), while the understanding on the generalizability (Mohri et al., 2018; He & Tao, 2020) of D-SGD is still premature. A large amount of empirical evidence have shown that D-SGD generalizes well on well-connected topologies (Assran et al., 2019a; Ying et al., 2021a). Meanwhile, empirical results by Assran et al. (2019b), Kong et al. (2021) and Ying et al. (2021a) demonstrate that for ring topologies, the validation accuracy of the consensus model learned by D-SGD decreases as the number of workers increases. Thus, a question is raised:

Our question

How does the communication topology of D-SGD impact its generalizability?

This paper answers this question. We prove a topologyaware generalization error bound for the consensus model learned by D-SGD, which characterizes the impact of the communication topology on the generalizability of D-SGD. Our contributions are summarized as follows:

Stability and generalization bounds of D-SGD. This work proves the algorithmic stability (Bousquet & Elis-

Topology-aware Generalization of Decentralized SGD

seeff, 2002) and generalization bounds of vanilla D-SGD in the non-convex non-smooth setting. In Section 4, we present an O(m/N+1/m+λ2) distributed on-average stability (see Corollary 2), where 1 λ denotes the spectral gap of the network, a measure of the connectivity of the communication topology G. These results would suffice to derive a O(1/N + ((m 1λ2)

2 + m α)/N 1 α

2 ) generalization bound in expectation of D-SGD (see Theorem 4). Our error bounds are non-vacuous, even when the worker1

number is sufficiently large, or the communication graph is sufficiently sparse. The theory can be directly applied to explain why consensus distance control in the initial phase of training can ensure better generalization.

Communication topology and generalization of DSGD. Our theory shows that the generalizability of DSGD has a positive relationship with the spectral gap 1 λ of the communication topology G. Besides, we prove that the generalizability of D-SGD decreases when the worker number increases for the ring, grid, and exponential graphs. We conduct comprehensive experiments of VGG-11 (Simonyan & Zisserman, 2014) and Res Net-18 (He et al., 2016b) on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Tiny-Image Net (Le & Yang, 2015) to verify our theory.

To our best knowledge, this work offers the first investigation into the topology-aware generalizability of vanilla D-SGD. The closest work in the existing literature is by Sun et al. (2021), which derives O(N 1 + (1 λ) 1) generalization bounds for projected D-SGD based on uniform stability (Bousquet & Elisseeff, 2002). They show that the decentralized nature hurts the stability, and thus undermines generalizability. Compared with the results by Sun et al. (2021), our work makes two contributions: (1) we analyze the vanilla D-SGD, which is capable of solving optimization problems on unbounded domains, rather than the projected D-SGD; and (2) our stability and generalization bounds are non-vacuous, even in the cases where the spectral gap 1 λ is sufficiently close to 0, which characterizes the cases where the worker number is sufficiently large or the communication graph is sufficiently sparse.

2. Related Work

The earliest work of classical decentralized optimization can be traced back to Tsitsiklis (1984), Tsitsiklis et al. (1986) and Nedic & Ozdaglar (2009a). D-SGD has been extended to various settings in deep learning, including time-varying topologies (Lu & Wu, 2020; Koloskova et al., 2020), asynchronous settings (Lian et al., 2018; Xu et al., 2021; Nadiradze et al., 2021), directed topologies (Assran et al., 2019a;

1Throughout this work, we use the term worker to represent the local model.

Taheri et al., 2020), and data-heterogeneous scenarios (Tang et al., 2018; Vogels et al., 2021). It has been proved that the convergence of D-SGD heavily relies on the communication topology (Hambrick et al., 1996; Bianchi & Jakubowicz, 2012; Lian et al., 2017; Assran et al., 2019b; Wang et al., 2019; Guo et al., 2020), especially in the scenarios where the local data is heterogeneous across workers (Yuan et al., 2020; Koloskova et al., 2020; Dai et al., 2022; Bars et al., 2022; Huang et al., 2022). However, the impact of the communication topology on the generalizability of D-SGD is still in its infancy.

Recently, inspiring work by Zhang et al. (2021) gives insights to how gossip communication in D-SGD promotes generalization in large batch settings. They prove that a self-adjusting noise exists in D-SGD, which may help DSGD find flatter minima with better generalization. Another work by Richards et al. (2020) presents a generalization bound of the Adaptation-Then-Combination (ATC) version of D-SGD through algorithmic stability and Rademacher complexity (Mohri et al., 2018) in both smooth and nonsmooth settings. However, their generalization bounds are invariant to the communication topology, which contradicts the experimental results (see Figure 3). In contrast, our generalization bounds are topology-aware and characterize the effects of decentralization on generalization.

3. Preliminaries

Supervised learning. Supposed X Rdx and Y R are the input and output spaces, respectively. We denote the training set as S = {z1, . . . , z N}, where zζ = (xζ, yζ) , ζ = 1, . . . , N are sampled independent and identically distributed (i.i.d.) from an unknown data distribution D defined on Z = X Y.

The goal of supervised learning is to learn a predictor (hypothesis) g( ; w), parameterized by w = w(z1, z2, . . . , z N) Rd, to approximate the mapping between the input variable x X and the output variable y Y, based on the training set S. Let c : Y Y 7 R+

be a loss function that evaluates the prediction performance of the hypothesis g. The loss of a hypothesis g with respect to (w.r.t.) the example zζ = (xζ, yζ) is denoted by f(w; zζ) = c(g(xζ; w), yζ), in order to measure the effectiveness of the learned model. Then, the empirical and population risks of w are defined as follows:

ζ=1 f(w; zζ), F(w) = Ez D[f(w; z)].

Distributed learning. Distributed learning jointly trains a learning model w on multiple workers (Shamir & Srebro, 2014). In this framework, the k-th worker (k = 1, . . . , m) can access nk independent and identically distributed (i.i.d.)

Topology-aware Generalization of Decentralized SGD

Fully-connected Ring Star Grid Exponential

Figure 1. Illustration of some commonly-used communication topologies.

training examples Sk = {zk,1, . . . , zk,nk}, drawn from the data distribution D. If set nk = n, the total sample size will be N = nm. In this case, the global empirical risk of w is defined as:

k=1 FSk(w),

where FSk(w) = 1

n Pn ζ=1 f(w; zk,ζ) denotes the local empirical risk on the k-th worker and zk,ζ Sk (ζ = 1, . . . , n) is the local sample set.

Decentralized Stochastic Gradient Descent (D-SGD). The goal of D-SGD is to learn a consensus model w = 1 m Pm k=1 wk, on m workers, where wk denotes the local model on the k-th worker. For any k, let w(t) k Rd be the d-dimensional local model on the k-th worker in the t-th iteration, while w(1) k = 0 is the initial point. We denote P as a doubly stochastic gossip matrix that characterizes the underlying topology G (see Definition A.5 and Figure 1). The intensity of gossip communications is measured by the spectral gap (Seneta, 2006) of P (i.e., 1 max {|λ2| , |λn|}, where λi (i = 2, . . . , m) denotes the i-th largest eigenvalue of P (see Definition A.6). The vanilla Adapt-While Communicate (AWC) version of D-SGD without projecting operations updates the model on the k-th worker by

Communication z }| { m X

l=1 Pk,lw(t) l

Computation z }| { ηt f(w(t) k ; z(t) k,ζt), (1)

where {ηt} is a sequence of positive learning rates, and f(w(t) k ; z(t) k,ζt) is the gradient of f w.r.t. the first argument on the k-th worker, and ζt is i.i.d. variable drawn from the uniform distribution over {1, . . . , n} at the t-th iteration (Lian et al., 2017). In this paper, matrix W = [w1, , wm]T Rm d stands for all local models across the network, while matrix f(W; Z) = [ f(w1; z1), , f(wm; zm)]T Rm d stacks all local gradients w.r.t. the first argument. In this way, the matrix form of Equation (1) is as follows:

W(t+1) = PW(t) ηt f(W(t); Z(t) ζt ).

4. Topology-aware Generalization Bounds of D-SGD

This section proves stability and generalization bounds for D-SGD. We start with the definition of a new parameterlevel stability for distributed settings. Then, the stability of D-SGD under a non-smooth condition is obtained (see Theorem 1 and Corollary 2). This implies a connection between stability and generalization in expectation (see Lemma 3), which suffices to prove the expected generalization bound of D-SGD, of order O(1/N + ((m 1λ2)

2 + m α)/N 1 α

4.1. Algorithmic Stability of D-SGD

Understanding generalization using algorithmic stability can be traced back to Bousquet & Elisseeff (2002) and Shalev Shwartz et al. (2010), and has been applied to stochastic gradient methods (Hardt et al., 2016; Lei & Ying, 2020). For more details, please see Appendix B.

We define a new algorithmic stability of distributed optimization algorithms below, which better characterizes the on-average sensitivity of models across multiple workers. Definition 1 (Distributed On-average Stability). Let Sk = {zk,1, . . . , zk,n} denote the i.i.d. local samples on the kth worker drawn from the distribution D (k = 1, . . . , m). S(i) k = zk,1, . . . , zk,i, . . . , zk,n} Zn is formed by replacing the i-th element of Sk with a local sample zk,i drawn from the distribution D. We denote wk and ewk as the weight vectors on the k-th worker produced by the stochastic algorithm A based on Sk and S(i) k , respectively. A is ℓ2 distributed on-average ϵ-stable for all training data sets Sk and S(i) k (k = 1 . . . m) if

k=1 ESk,S(i) k ,A wk ewk 2 2 ϵ2,

where EA[ ] stands for the expectation w.r.t. the randomness of the algorithm A (see more details in Appendix A).

We then prove that D-SGD is distributed on-average stable.

Theorem 1. Let Sk and S(i) k be constructed in Definition 1. Let w(t) k and ew(t) k be the t-th iteration on the k-th worker

Topology-aware Generalization of Decentralized SGD

produced by Equation (1) based on Sk and S(i) k respectively, and {ηt} be a non-increasing sequence of positive learning rates. We assume that for all z Z, the function w 7 f(w; z) is non-negative and convex with its gradient f(w; z) being (α, L)-H older continuous (see Assumption A.3). We further assume that the weight differences at the t-th iteration are multivariate normally distributed: w(t) k ew(t) k i.i.d. N(µt,k, σ2 t,k Id) for all k where d denotes the dimension of weights, with unknown parameters µt,k and σt,k satisfying some technical conditions (see Assumption A.4). Then we have the following:

k=1 ESk,S(i) k ,A w(t+1) k ew(t+1) k 2 2

τ=0 Ct τ{O (1 1

| {z } Error from decentralization

k=1 ESk,A F

2α 1+α Sk (w(τ) k ) )

| {z } O( m

N ) Averaged empirical risk

where C = 2η0L(1 1

n) and FSk(w(τ) k ) is the local empirical risk of the k-th worker at iteration τ.

Theorem 1 suggests that the distributed on-average stability of D-SGD is positively related to the spectral gap of the underlying topology and the accumulation of the averaged empirical risk. The proof is provided in Appendix D.2.

We can obtain a simplified result with fixed learning rates.

Corollary 2 (Stability in Expectation with ηt η ). Suppose all the assumptions of Theorem 1 hold. With a fixed learning rate ηt η 1 2L(1 2 m), the distributed onaverage stability of D-SGD can be bounded as

k=1 ESk,S(i) k ,A w(t+1) k ew(t+1) k 2 2

1 1 2ηL(1 1

n ) | {z } O( m

| {z } Error from decentralization

where ϵS denotes the upper bound of averaged empirical

m Pm k=1 ESk,A F

2α 1+α Sk (w(t) k ) .

Corollary 2 shows that the distributed on-averge stability of D-SGD is of the order O(m/N+1/m+λ2). We defer the proof to Appendix D.2.

Comparison with existing results. Compared with Sun et al. (2021), we relax the restrictive bounded gradient and the smoothness assumptions. Instead, a much weaker

Figure 2. Histograms of the weight differences of the last layers of the Res Net-18 models (1024 dimensions 10 classes=10240 parameters) trained by AWC D-SGD on S and S(i) that differ by only one data point. Thirty Res Net-18 models are trained on data sampled from the MNIST dataset (Le Cun et al., 1998), each model is trained with 16 workers for 3000 iterations.

H older condition (see Assumption A.3) is adopted. We make a mild assumption that the weight difference w(t) k ew(t) k is multivariate normally distributed (see Assumption A.4), which stems from our empirical observations: Figure 2 illustrates that the distribution of the weight differences in Res Net-18 models trained by D-SGD is close to a centered Gaussian. Intuitively, the assumption is based upon the fact that the weights of the consensus model are very insensitive to the change of a single data point.

We also compare the order of the derived bound with the existing literature. Hardt et al. (2016) proves that SGD is O(Pt 1 τ=1 ητ/n)-stable in convex and smooth settings, which corresponds to the O( m

N ) term in Corollary 2. Under H older continuous condition, Lei & Ying (2020) proposes a parameter-level stability bound of SGD of the or-

der O( ϵSη2

2 1 α ). In contrast, Corollary 2 shows that D-SGD suffers from additional terms O (1 1

m , where the first term O (1 1

m)λ2) can characterize the degree of disconnection of the underlying communication topology. Close work by Sun et al. (2021) proved that the stability of the projected variant of D-SGD is bounded by O ηt B2

in the convex smooth setting, where B is

the upper bound of the gradient norm. The term O( ηt B2

1 λ ) brought by decentralization is of the order O(m2B2) for ring topologies and O(m B2) for grids, respectively. Our error bound in Corollary 2 is tighter than their results, since

1 1 2ηL(1 1

O m O(m B2) O(m2B2).

Topology-aware Generalization of Decentralized SGD

4.2. Generalization bounds of D-SGD

The following lemma bridges the gap between generalization and the newly proposed distributed on-average stability. Lemma 3 (Generalization via Distributed On-average Stability). Let Sk and S(i) k be constructed in Definition 1. If for any z, the pre-specified function f(w; z) is non-negative and convex, with its gradient f(w; z) being (α, L)-H older continuous, then

ES,A F(w(t)) FS(w(t))

i=1 ESk,S(i) k ,A w(t) k ew(t) k 2 2 α

where w(t) = 1

m Pm k=1 w(t) k .

We give the proof in Appendix D.3.

Lemma 3 suggests that if the consensus model learned by the distributed SGD A is ϵ-stable in the sense of Definition 1, the generalization error of the consensus model is bounded by Lϵ α

2 ). Lemma 3 improves Theorem 2 (c) of Lei & Ying (2020) by removing the O(ES,A F 2α 1+α (A(S)) ) term, where ES,A F(A(S)) denotes the population risk of the learned model A(S) (see Equation (D.26)). This improvement is significant, because ES,A F(A(S)) usually does not converge to zero in practice.

We now prove the generalization bound of D-SGD. Theorem 4 (Generalization Bound in Expectation with ηt η). Let all the assumptions of Theorem 1 hold. With a fixed step sizes of ηt η 2L(1 2

m), the generalization error of the consensus model learned by D-SGD can be controlled as

ES,A F(w(t)) FS(w(t))

| {z } Error from decentralization

where w(t) = 1 m Pm k=1 w(t) k and ϵS is the upper bounder

m Pm k=1 ESk,A F

2α 1+α Sk (w(t) k ) .

The order of the generalization bound in Theorem 4 is O(1/N + ((m 1λ2)

2 + m α)/N 1 α

2 ) and will become O(1/N + (m 1

2 λ2 + m 1)/N 1

2 ) in the smooth settings where α = 1. The proof is provided in Appendix D.3. Remark 1. Corollary 2 and Theorem 4 indicate that the stability and generalization of D-SGD are positively related to the spectral gap 1 λ. The intuition of the results is that D-SGD with a denser connection topology (i.e., larger λ) can aggregate more information from its neighbors, thus indirectly accessing more data at each iteration, leading to better generalization.

Table 1. Spectral gap of gossip matrices with different topology.

Graph topoloy Spectral gap 1 λ

Fully-connected 1 Disconnected 0 Ring 16π2/3m2

Grid O(1/(m log2(m))) Exponential graph (even) 2/(1 + log2(m))

4.3. Practical Implications

Our theory delivers significant practical implications.

Communication topology and generalization. The intensity of communication is controlled by the spectral gap 1 λ of the underlying communication topologies (see Table 1). Detailed analyses of the spectral gaps of some commonlyused topologies can be found in Proposition 5 of Nedi c et al. (2018) and Ying et al. (2021a). Substituting the spectral gap of different topologies in Table 1 into Theorem 4, we can conclude that the generalization error of different topologies can be ranked as follows: fully-connected < exponential < grid < ring, since

0 < 1 O((log2(m)) 1)

< 1 O((m log2(m)) 1) < 1 O(m 2).

On the one hand, our theory provides theoretical evidence that D-SGD will generalize better on well-connected topologies (i.e., topologies with larger spectral gap). On the other hand, we prove that for a specific topology, the worker number impacts the generalization of D-SGD through affecting the spectral gap of the topology.

Consensus distance control. Recently, a line of studies have been devoted to understanding the connection between optimization and generalization through studying the effect of early phase training (Keskar et al., 2017; Achille et al., 2018; Frankle et al., 2020). In the decentralized settings, Kong et al. (2021) claims that there exists a critical consensus distance in the initial training phase consensus distance (i.e. 1 m Pm i=1 w(t) k 1

m Pm k=1 w(t) k 2 F ) below the critical threshold will ensure good generalization. However, the reason why consensus control can promote generalization remains an open problem. Fortunately, the following theorem can explain this phenomenon by connecting the consensus distance notion in Kong et al. (2021) with the algorithmic stability and the generalizability of D-SGD.

Corollary 5. Let all the assumptions of Theorem 1 plus Assumption A.1 and Assumption A.2 hold. Suppose that the consensus distance satisfies the condition Γ2 1 m Pm k=1 w(t) k w(t) 2 2 K2 if t tΓ, and is controlled below Γ2 from tΓ-th iterate to the end of training. Then one can conclude that the upper bound of the distributed

Topology-aware Generalization of Decentralized SGD

on-average stability of D-SGD increase monotonically with tΓ if the total number of iterations t C 2 ln C .

We give the proof in Appendix D.4.

Corollary 5 provides theoretical evidence for the following empirical findings: (1) consensus control is beneficial to the algorithmic stability and the generalizability of D-SGD; and (2) it is more effective to control the consensus distance in the initial stage of training than at the end of training.

5. Empirical Results

This section empirically validates our theoretical results. We first introduce the experimental setup and then study how the communication topology and the worker number affect the generalization of D-SGD. The code is available at https://github.com/Raiden-Zhu/ Generalization-of-DSGD.

5.1. Experimental Setup

Networks and datasets. Network architectures VGG-11 (Simonyan & Zisserman, 2014) and Res Net-18 (He et al., 2016b) are employed in our experiments. The models are trained on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Tiny Image Net (Le & Yang, 2015), three popular benchmark image classification datasets. The CIFAR-10 dataset consists of 60,000 32 32 color images across 10 classes, with each class containing 5,000 training and 1,000 testing images. The CIFAR-100 dataset also consists of 60,000 32 32 color images, except that it has 100 classes, each class containing 5,00 training and 1,00 testing images. Tiny Image Net contains 120,000 64 64 color images in 200 classes, each class containing 500 training images, 50 validation images, and 50 test images. No other pre-processing methods are employed.

Training setting. Vanilla D-SGD is employed to train image classifiers based on VGG-11 and Res Net-18 on fullyconnected, ring, grid, and static exponential topologies. The number of workers is set as 32 and 64. Batch normalization (Ioffe & Szegedy, 2015) and dropout (Srivastava et al., 2014) are employed in training VGG-11. The local batch size is set as 64. To control the impact of different total batch size (local batch size worker number) caused by the different number of workers, we apply the linear scaling law (i.e., linearly increase learning rate w.r.t. total batch size) (He et al., 2016a; Goyal et al., 2017). The initial learning rate is set as 0.1 and will be divided by 10 when the model has accessed 2/5 and 4/5 of the total number of iterations (He et al., 2016a). All other techniques, including momentum (Qian, 1999), weight decay (Tihonov, 1963), and data augmentation (Le Cun et al., 1998) are disabled.

Implementations. All our experiments are conducted

on a computing cluster with GPUs of NVIDIA Tesla

V100 16GB and CPUs of Intel Xeon Gold 6140 CPU @ 2.30GHz. Our code is implemented based on Py Torch (Paszke et al., 2019).

5.2. Communication topology and generalization

We calculate the difference between the validation loss and the training loss on different topologies separately, as shown in Figure 3 and Figure C.1. Two observations are obtained from those figures: (1) for large topologies, the loss differences can be sorted as follows: fully-connected exponential < grid < ring; (2) as the worker number increases, the loss differences of D-SGD on different topologies increase. These observations suggest that (1) D-SGD generalizes better on well-connected topologies with a larger spectral gap; (2) the generalizability gap of D-SGD on different topologies grows as the worker number increases.

In comparison to Res Net-18, we observe when VGG-11 is chosen as the backbone, the generalization gaps between different topologies are larger (see Figure 3 and Figure C.1). We conjecture that the generalization gaps are amplified because the H older constant L of VGG-11 is larger than that of Res Net-18, according to our theory suggesting that the generalization error of D-SGD linearly increases with L (see Theorem 4).

One may observe that the loss difference of the fullyconnected topology is larger than that of other topologies in the initial training phase, which seems to be inconsistent with our theory. Theoretically, explaining this phenomenon is an optimization problem, which is beyond the scope of this work. One possible explanation is motivated by the stability-convergence trade-off in iterative optimization algorithms (Chen et al., 2018). Since the fully-connected system converges faster, the corresponding optimization error is smaller than the other three kinds of systems at the beginning of training. Therefore, the fully-connected system is less stable in the initial phase of training.

6. Future Works

Generalization of D-SGD with non-i.i.d. data. In realworld settings, a fundamental challenge is that data may not be i.i.d. across the workers. In this case, different workers may collect very distinct or even contradictory samples (i.e., data-heterogeneity) (Criado et al., 2021). However, generalization analyses of distributed learning algorithms are mostly based on the key assumption that the data is i.i.d. over workers. Therefore, more sophisticated techniques are needed to broaden our knowledge on the following questions: Would the consensus model learned by D-SGD generalize well in the non-i.i.d. settings? Would the effect of the communication topology on generalization be reduced

Topology-aware Generalization of Decentralized SGD

(a) VGG-11 on CIFAR-10, 32 workers (b) VGG-11 on CIFAR-100, 32 workers (c) VGG-11 on Tiny Image Net, 32 workers

(d) VGG-11 on CIFAR-10, 64 workers (e) VGG-11 on CIFAR-100, 64 workers (f) VGG-11 on Tiny Image Net, 64 workers

Figure 3. Loss differences in training VGG-11 using D-SGD with different topologies.

or amplified in these scenarios?

Implicit bias of D-SGD. As pointed out in Zhang et al. (2021), the additional gradient noise in D-SGD helps it converge to a flatter minima compared to centralized distributed SGD. Therefore, a direct question is whether there is a superior implicit bias effect (Soudry et al., 2018; Ji & Telgarsky, 2019; Arora et al., 2019; Wang et al., 2021) in D-SGD compared to centralized distributed SGD, which involves the convergence direction? Can we theoretically analyze the implicit bias of D-SGD and derive fine-grained generalization bounds of D-SGD?

7. Conclusion

In this paper, we analyze the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is O(m/N+1/m+λ2)-stable, where N is the total sample size of the whole system, m is the worker number, and 1 λ is the spectral gap of the communication topology. Based on this stability result, we obtain an O(1/N + ((m 1λ2)

2 + m α)/N 1 α

2 ) in-average generalization bound, characterizing the gap between the training performance and the test performance. Our error bounds are non-vacuous, even when the worker number is sufficiently large, or the communication graph is sufficiently sparse. According to our theory, we can conclude: (1) the generalizability of D-SGD is positively correlated with the spectral gap of the underlying topology; (2) the gener-

alizability of D-SGD decreases when the worker number increases. These theoretical findings are empirically justified by the experiments of VGG-11 and Res Net-18 on CIFAR-10, CIFAR-100, and Tiny Image Net. The theory can also explain why consensus control at the beginning of training can promote the generalizability of D-SGD. To our best knowledge, this is the first study on the topology-aware generalizations of vanilla D-SGD.

Acknowledgement

This work is supported by the Major Science and Technology Innovation 2030 key projects New Generation Artificial Intelligence (No. 2021ZD0111700), the National Natural Science Foundation of China (No. 61976186, U20B2066, and 61932016), the Key Research and Development Program of Zhejiang Province (No. 2021C01164), the Fundamental Research Funds for the Central Universities (No. 226-2022-00064 and WK2150110024), and the Starry Night Science Fund of Zhejiang University Shanghai Institute for Advanced Study (No. SN-ZJU-SIAS-001). This work is completed when Tongtian Zhu and Zhengyang Niu were interns at JD Explore Academy.

The authors would like to thank Chun Li, Yingjie Wang, Haowen Chen, and Shaopeng Fu for their insightful comments on the revision of this manuscript and appreciate Li Shen, Rong Dai, and Luofeng Liao for their helpful discussions. We also sincerely thank the anonymous ICML reviewers and chairs for their constructive comments.

Topology-aware Generalization of Decentralized SGD

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation, 2016.

Achille, A., Rovere, M., and Soatto, S. Critical learning periods in deep networks. In International Conference on Learning Representations, 2018.

Alghunaim, S. A. and Yuan, K. A unified and refined convergence analysis for non-convex decentralized learning. ar Xiv preprint ar Xiv:2110.09993, 2021.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, 2019.

Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochastic gradient push for distributed deep learning. In Proceedings of the 36th International Conference on Machine Learning, 2019a.

Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochastic gradient push for distributed deep learning. In Proceedings of the 36th International Conference on Machine Learning, 2019b.

Bars, B. L., Bellet, A., Tommasi, M., and Kermarrec, A.- M. Yes, topology matters in decentralized optimization: Refined convergence and topology learning under heterogeneous data. ar Xiv preprint ar Xiv:2204.04452, 2022.

Bassily, R., Feldman, V., Guzm an, C., and Talwar, K. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems, 2020.

Bianchi, P. and Jakubowicz, J. Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE transactions on automatic control, 2012.

Bottou, L. and Bousquet, O. The tradeoffs of large scale learning. In Platt, J., Koller, D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems, 2008.

Bousquet, O. and Elisseeff, A. Stability and generalization. Journal of Machine Learning Research, 2002.

Chen, Y., Jin, C., and Yu, B. Stability and convergence tradeoff of iterative optimization algorithms. ar Xiv preprint ar Xiv:1804.01619, 2018.

Criado, M. F., Casado, F. E., Iglesias, R., Regueiro, C. V., and Barro, S. Non-iid data and continual learning processes in federated learning: A long road ahead. ar Xiv preprint ar Xiv:2111.13394, 2021.

Dai, R., Shen, L., He, F., Tian, X., and Tao, D. Dispfl: Towards communication-efficient personalized federated learning via decentralized sparse training. In International Conference on Machine Learning, 2022.

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., et al. Large scale distributed deep networks. Advances in neural information processing systems, 2012.

Deng, Z., He, H., and Su, W. Toward better generalization bounds with locally elastic stability. In Proceedings of the 38th International Conference on Machine Learning, 2021.

Devroye, L. and Wagner, T. Distribution-free performance bounds with the resubstitution error estimate. IEEE Transactions on Information Theory, 1979.

Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pappas, G. Efficient and accurate estimation of lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, 2019.

Foster, D. J., Greenberg, S., Kale, S., Luo, H., Mohri, M., and Sridharan, K. Hypothesis set stability and generalization. In Advances in Neural Information Processing Systems, 2019.

Frankle, J., Schwab, D. J., and Morcos, A. S. The early phase of neural network training. In International Conference on Learning Representations, 2020.

Geiping, J., Bauermeister, H., Dr oge, H., and Moeller, M. Inverting gradients - how easy is it to break privacy in federated learning? In Advances in Neural Information Processing Systems, 2020.

Goyal, P., Doll ar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017.

Guo, Z., Liu, M., Yuan, Z., Shen, L., Liu, W., and Yang, T. Communication-efficient distributed stochastic auc maximization with deep neural networks. In International Conference on Machine Learning, pp. 3864 3874. PMLR, 2020.

Hambrick, D. C., Cho, T. S., and Chen, M.-J. The influence of top management team heterogeneity on firms competitive moves. Administrative science quarterly, 1996.

Topology-aware Generalization of Decentralized SGD

Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of The 33rd International Conference on Machine Learning, 2016.

He, F. and Tao, D. Recent advances in deep learning theory. ar Xiv preprint ar Xiv:2012.10931, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016a.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 2016b.

Huang, T., Liu, S., Shen, L., He, F., Lin, W., and Tao, D. Achieving personalized federated learning with sparse local models. ar Xiv preprint ar Xiv:2201.11380, 2022.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 2015.

Ji, Z. and Telgarsky, M. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2019.

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawit, K., Charles, Z., Cormode, G., Cummings, R., D Oliveira, R. G. L., Eichner, H., El Rouayheb, S., Evans, D., Gardner, J., Garrett, Z., Gasc on, A., Ghazi, B., Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Konecn y, J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri, M., Nock, R., Ozg ur, A., Pagh, R., Qi, H., Ramage, D., Raskar, R., Raykova, M., Song, D., Song, W., Stich, S. U., Sun, Z., Theertha Suresh, A., Tram er, F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H., and Zhao, S. Advances and Open Problems in Federated Learning. IEEE, 2021.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. The early phase of neural network training. In International Conference on Learning Representations, 2017.

Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. A unified theory of decentralized SGD with changing topology and local updates. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Kong, L., Lin, T., Koloskova, A., Jaggi, M., and Stich, S. Consensus control for decentralized deep learning. In Proceedings of the 38th International Conference on Machine Learning, 2021.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images (tech. rep.). University of Toronto, 2009.

Kuzborskij, I. and Lampert, C. Data-dependent stability of stochastic gradient descent. In Proceedings of the 35th International Conference on Machine Learning, 2018.

Le, Y. and Yang, X. Tiny imagenet visual recognition challenge. CS 231N, 2015.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.

Lei, Y. and Ying, Y. Fine-grained analysis of stability and generalization for stochastic gradient descent. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Communication efficient distributed machine learning with the parameter server. Advances in Neural Information Processing Systems, 2014.

Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, 2017.

Lian, X., Zhang, W., Zhang, C., and Liu, J. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, 2018.

Liu, T., Lugosi, G., Neu, G., and Tao, D. Algorithmic stability and hypothesis complexity. In International Conference on Machine Learning, 2017.

Lopes, C. G. and Sayed, A. H. Diffusion least-mean squares over adaptive networks: Formulation and performance analysis. IEEE Transactions on Signal Processing, 2008.

Lu, J., Tang, C. Y., Regier, P. R., and Bow, T. D. Gossip algorithms for convex consensus optimization over networks. IEEE Transactions on Automatic Control, 2011.

Lu, S. and Wu, C. W. Decentralized stochastic non-convex optimization over weakly connected time-varying digraphs. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.

Lu, Y. and De Sa, C. Optimal complexity in decentralized training. In Proceedings of the 38th International Conference on Machine Learning, 2021.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.

Topology-aware Generalization of Decentralized SGD

Montenegro, R. and Tetali, P. Mathematical aspects of mixing times in markov chains. Foundations and Trends in Theoretical Computer Science, 2006.

Nadiradze, G., Sabour, A., Davies, P., Li, S., and Alistarh, D. Asynchronous decentralized sgd with quantized and local updates. Advances in Neural Information Processing Systems, 2021.

Nedic, A. Distributed gradient methods for convex machine learning problems in networks: Distributed optimization. IEEE Signal Processing Magazine, 2020.

Nedic, A. and Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 2009a.

Nedic, A. and Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 2009b.

Nedi c, A., Olshevsky, A., and Rabbat, M. G. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 2018.

Nesterov, Y. Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2003.

Nesterov, Y. Universal gradient methods for convex optimization problems. Mathematical Programming, 2015.

Neu, G., Dziugaite, G. K., Haghifam, M., and Roy, D. M. Information-theoretic generalization bounds for stochastic gradient descent. In Proceedings of Thirty Fourth Conference on Learning Theory, 2021.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 2019.

Qian, N. On the momentum term in gradient descent learning algorithms. Neural networks, pp. 145 151, 1999.

Richards, D. et al. Graph-dependent implicit regularisation for distributed stochastic subgradient descent. Journal of Machine Learning Research, 2020.

Seneta, E. Non-negative matrices and Markov chains. Springer Science & Business Media, 2006.

Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 2010.

Shamir, O. and Srebro, N. Distributed stochastic optimization and learning. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2014.

Shi, W., Ling, Q., Wu, G., and Yin, W. Extra: An exact firstorder algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 2015.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 2018.

Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, 2010.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.

Sun, T., Li, D., and Wang, B. Stability and generalization of decentralized stochastic gradient descent. Proceedings of the AAAI Conference on Artificial Intelligence, May 2021.

Taheri, H., Mokhtari, A., Hassani, H., and Pedarsani, R. Quantized decentralized stochastic learning over directed graphs. In International Conference on Machine Learning, 2020.

Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. D2: Decentralized training over decentralized data. In International Conference on Machine Learning, 2018.

Tihonov, A. N. Solution of incorrectly formulated problems and the regularization method. Soviet Math., 1963.

Tsitsiklis, J., Bertsekas, D., and Athans, M. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IRE Transactions on Automatic Control, 1986.

Tsitsiklis, J. N. Problems in decentralized decision making and computation. Technical report, Massachusetts Inst of Tech Cambridge Lab for Information and Decision Systems, 1984.

Vapnik, V. and Chervonenkis, A. Theory of pattern recognition, 1974.

Topology-aware Generalization of Decentralized SGD

Vogels, T., He, L., Koloskova, A., Karimireddy, S. P., Lin, T., Stich, S. U., and Jaggi, M. Relaysum for decentralized deep learning on heterogeneous data. Advances in Neural Information Processing Systems, 2021.

Wan, X., Zhang, H., Wang, H., Hu, S., Zhang, J., and Chen, K. Rat-resilient allreduce tree for distributed machine learning. In 4th Asia-Pacific Workshop on Networking, 2020.

Wang, B., Meng, Q., Chen, W., and Liu, T.-Y. The implicit bias for adaptive optimization algorithms on homogeneous neural networks. In International Conference on Machine Learning, 2021.

Wang, J., Sahu, A. K., Yang, Z., Joshi, G., and Kar, S. Matcha: Speeding up decentralized sgd via matching decomposition sampling. In 2019 Sixth Indian Control Conference (ICC), 2019.

Warnat-Herresthal, S., Schultze, H., Shastry, K. L., Manamohan, S., Mukherjee, S., Garg, V., Sarveswara, R., H andler, K., Pickkers, P., Aziz, N. A., et al. Swarm learning for decentralized and confidential clinical machine learning. Nature, 2021.

Xiao, L. and Boyd, S. Fast linear iterations for distributed averaging. Systems & Control Letters, 2004.

Xu, J., Zhang, W., and Wang, F. A (dp)2 2sgd: Asynchronous decentralized parallel stochastic gradient descent with differential privacy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

Yin, H., Mallya, A., Vahdat, A., Alvarez, J. M., Kautz, J., and Molchanov, P. See through gradients: Image batch recovery via gradinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Ying, B., Yuan, K., Chen, Y., Hu, H., Pan, P., and Yin, W. Exponential graph is provably efficient for decentralized deep training. In Advances in Neural Information Processing Systems, 2021a.

Ying, B., Yuan, K., Hu, H., Chen, Y., and Yin, W. Bluefog: Make decentralized algorithms practical for optimization and deep learning. ar Xiv preprint ar Xiv:2111.04287, 2021b.

Ying, Y. and Zhou, D.-X. Unregularized online learning algorithms with general loss functions. Applied and Computational Harmonic Analysis, 2017.

Yuan, K., Alghunaim, S. A., Ying, B., and Sayed, A. H. On the influence of bias-correction on distributed stochastic optimization. IEEE Transactions on Signal Processing, 2020.

Zhang, W., Cui, X., Finkler, U., Saon, G., Kayi, A., Buyuktosunoglu, A., Kingsbury, B., Kung, D., and Picheny, M. A highly efficient distributed deep learning system for automatic speech recognition. ar Xiv preprint ar Xiv:1907.05701, 2019.

Zhang, W., Liu, M., Feng, Y., Cui, X., Kingsbury, B., and Tu, Y. Loss landscape dependent self-adjusting learning rates in decentralized stochastic gradient descent. ar Xiv preprint ar Xiv:2112.01433, 2021.

Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. Federated learning with non-iid data. ar Xiv preprint ar Xiv:1806.00582, 2018.

Zhu, L., Liu, Z., and Han, S. Deep leakage from gradients. In Advances in Neural Information Processing Systems, 2019.

Topology-aware Generalization of Decentralized SGD

A. Additional Background

The following remarks clarify some notations in the literature. Remark A.1. Stochastic learning algorithms A : n Zn 7 W are often applied to produce an output model A(S) Rd

based on the training set S. To avoid ambiguity, let A(S) denote the model generated by a general learning algorithm, and w denote the models generated by a specific stochastic learning algorithm. Remark A.2. Stochastic learning introduces two kinds of randomness: one from the sampling of training examples and another from the adopted randomized algorithm. In the following analysis, EA[ ] stands for the expectation w.r.t. the randomness of the algorithm A, and ES[ ] denotes the expectation w.r.t. the randomness originating from sampling the data set S. Notice that S Dn and z D, therefore ES[ ] differs from Ez[ ] defined in Section 3.

Based on the work by Hardt et al. (2016) and Bottou & Bousquet (2008), we give a formulation of excess error decomposition and demonstrate how to understand generalization through error decomposition. Definition A.1 (Excess Error Decomposition). We denote the empirical risk minimization (ERM) solution by w S = arg minw FS(w) and w = arg minw F(w). The excess error F(A(S)) F(w ) can be decomposed as

ES,A [F(A(S)) F (w )] | {z } Excess error = ES,A [F(A(S)) FS(A(S))] | {z } Generalization error

+ ES,A [FS(A(S)) FS (w S)] | {z } Optimization error

+ES,A [FS(w S) F (w )]

ES,A [F(A(S)) FS(A(S))] | {z } Generalization error

+ ES,A [FS(A(S)) FS (w S)] | {z } Optimization error

The last inequality holds since and ES,A [FS(w S)] ES,A [FS (w )] = ES F(w ). The empirical risk and the population risk above are defined in Section 3. This paper considers upper bounding the first term called the generalization error.

Lipschitzness and smoothness are two commonly adopted assumptions to establish the uniform stability guarantees of SGD. Assumption A.1 (Lipschitzness).

f(w; z) 2 G for all w Rd and z Z.

Assumption A.2 (Smoothness). f is β-smooth if for any z and w, ew Rd, f(w; z) f(ew; z) 2 β w ew 2. (A.2)

These two restrictive assumptions are not satisfied in many real contexts. For example, the Lipschitz constant G can be very large for some learning problems (Fazlyab et al., 2019; Lei & Ying, 2020). In addition, neural nets with piecewise linear activation functions like Re LU are not smooth. Smoothness is generally difficult to ensure at the beginning and intermediate phases of deep neural network training (Bassily et al., 2020). Assumption A.3 (H older Continuity). Let L > 0, α [0, 1]. f( , z) is (α, L)-H older continuous if for all w, ew Rd

and z Z, f(w; z) f(ew; z) 2 L w ew α 2 . (A.3)

H older continuous gradient assumption is much weaker than smoothness by definition. Serving as an intermediate class of functions (C1,α (Rn)) between smooth functions (C1,1 (Rn)) and functions with Lipschitz continuous gradients (C1,0 (Rn)), the main advantage of functions with H older continuous gradients lies in the ability to automatically adjust the smoothness parameter to a proper level (Nesterov, 2015): Inequality (A.3) with α = 1 corresponds to smoothness and Inequality (A.3) with α = 0 is equivalent to Lipschitzness (see Assumption A.1).

Assumption A.4 (Gaussian Weight Difference). We assume that the difference between w(t) k and ew(t) k (the t-th iterate on k-th worker produced by Equation (1) based on Sk and S(i) k respectively) is independent and normally distributed:

w(t) k ew(t) k N(µt,k, σ2 t,k Id), k = 1, . . . , m

where Id denotes an identity matrix with size d, and µt,k, σ2 t,k are unknown parameters. We also give a mild constraint that the d-dimensional parameter µt,k satisfies d µ2 0 µt,k 2 2 d µ2 and the parameter σ2 t,k R is bounded by σ2.

Assumption A.4 is mild since Sk and S(i) k only vary at one point.

Topology-aware Generalization of Decentralized SGD

Commonly used stability notions are listed below.

Definition A.2 (Hypothesis Stability). A stochastic algorithm A is hypothesis ϵ-stable w.r.t. the loss function f if for all training data sets S, S(i) Zn that differ by at most one example, we have

Ez EA f(A(S); z) f(A(S(i)); z) ϵ. (A.4)

Definition A.3 (Uniform Stability). A stochastic algorithm A is ϵ-uniformly stable w.r.t. the loss function f if for all training data sets S, S(i) Zn that differ by at most one example, we have

sup z EA f(A(S); z) f(A(S(i)); z) ϵ. (A.5)

Definition A.4 (On-average Model Stability, (Lei & Ying, 2020)). A stochastic algorithm A is ℓ2 on-average model ϵ-stable for all training data sets S, S(i) Zn that differ by at most one example, we have

i=1 ES,S(i),A A(S) A(S(i)) 2 2 ϵ2.

Some widely used notions regarding decentralized training are listed as follows.

Definition A.5 (Doubly Stochastic Matrix). Let G = (V, E) stand for the decentralized communication topology where V denotes the set of m computational nodes and E represents the edge set. For any given graph G = (V, E), the doubly stochastic gossip matrix P = [Pk,l] Rm m is defined on the edge set E that satisfies: (1) If k = l and (k, l) / E, then Pk,l = 0 (disconnected); otherwise, Pk,l > 0 (connected); (2) Pk,l [0, 1] k, l; (3) P = P ; and (4) P

l Pk,l = 1 (standard weight matrix for undirected graph).

Definition A.6 (Spectral Gap). Denote λ = max {|λ2| , |λm|} where λi (i = 2, . . . , m) is the i-th largest eigenvalue of gossip matrix P Rm m. The spectral gap of a gossip matrix P can be defined as follows:

spectral gap := 1 λ.

According to the definition of doubly stochastic matrix (Definition A.5), we have 0 λ < 1. The spectral gap measures the connectivity of the communication topology, which is close to 0 for sparse topologies and will approach 1 for well-connected topologies.

To facilitate our subsequent analysis, we provide some preliminaries of matrix algebra here.

Definition A.7 (Frobenius Norm). The Frobenius norm (Euclidean norm, or Hilbert Schmidt norm) is the matrix norm of a matrix A Rp q defined as the square root of the sum of the squares of its elements:

j=1 |aij|2 = q

For any A, B Rp q, the following identity holds:

A + B 2 F = A 2 F + B 2 F + 2 A, B F ,

where , F is the Frobenius inner product.

Topology-aware Generalization of Decentralized SGD

B. Additional Related Work

B.1. Non-centralized learning.

To handle an increasing amount of data and model parameters, distributed learning across multiple computing nodes (workers) emerges. A traditional distributed learning system usually follows a centralized setup (Abadi et al., 2016). However, such a central server-based learning scheme suffers from two main issues: (1) A centralized communication protocol significantly slows down the training since central servers are easily overloaded, especially in low-bandwidth or high-latency cases (Lian et al., 2017); (2) There exists potential information leakage through privacy attacks on model parameters despite decentralizing data using Federated Learning (Zhu et al., 2019; Geiping et al., 2020; Yin et al., 2021). As an alternative, training in a non-centralized fashion allows workers to balance the load on the central server through the gossip technique, as well as maintain confidentiality (Warnat-Herresthal et al., 2021). Model decentralization can be divided into three kinds of categories by layers (Lu & De Sa, 2021): (1) On the application layer, decentralized training usually refers to federated learning (Zhao et al., 2018; Dai et al., 2022) ; (2) On the protocol layer, decentralization denotes average gossip where local workers communicate by averaging their parameters with their neighbors on a graph (Lian et al., 2017) and (3) on the topology layer, it means a sparse topology graph (Wan et al., 2020).

B.2. Generalization via algorithmic stability.

Algorithmic stability theory, PAC-Bayes theory, and information theory are major tools for constructing algorithm-dependent generalization bounds (Neu et al., 2021). A direct intuition behind algorithmic stability is that if an algorithm does not rely excessively on any single data point, it can generalize well. Proving generalization bounds based on the sensitivity of the algorithm to changes in the learning sample can be traced back to Vapnik & Chervonenkis (1974) and Devroye & Wagner (1979). After that, the celebrated work by Bousquet & Elisseeff (2002) establishes the relationship between uniform stability and generalization in high probability. Follow-up work by Shalev-Shwartz et al. (2010) identifies stability as the major necessary and sufficient condition for learnability. Then, Hardt et al. (2016) provide uniform stability bounds for stochastic gradient methods (SGM) and show the strong stability properties of SGD with convex and smooth losses. Recent work by Lei & Ying (2020) defines a new on-average stability notion and conducts generalization analyses on SGD with the H older continuous assumption. In addition to uniform stability, there are other stability notions including on-average stability (Shalev-Shwartz et al., 2010), uniform argument stability (Liu et al., 2017), data-dependent stability (Kuzborskij & Lampert, 2018), hypothesis set stability (Foster et al., 2019) and locally elastic stability (Deng et al., 2021).

Topology-aware Generalization of Decentralized SGD

C. Additional Experimental Results

We also calculate the difference between the validation loss and the training loss in training Res Net-18 using D-SGD.

(a) Res Net-18 on CIFAR-10, 32 workers (b) Res Net-18 on CIFAR-100, 32 workers (c) Res Net-18 on Tiny Image Net, 32 workers

(d) Res Net-18 on CIFAR-10, 64 workers (e) Res Net-18 on CIFAR-100, 64 workers (f) Res Net-18 on Tiny Image Net, 64 workers

Figure C.1. Loss differences in training Res Net-18 using D-SGD with different topologies.

Topology-aware Generalization of Decentralized SGD

D.1. Technical lemmas

To complete our proof, we first introduce some technical lemmas.

Lemma D.1 (Corollary 1.14., (Montenegro & Tetali, 2006)). Let M stand for the matrix with all the elements be 1/m and P is defined in Definition A.5. For any k Z+, the following inequality holds: Pk M 2,2 P k λ . (D.1)

Lemma D.2. For any a, b R and p R+, the following inequality holds:

(a + b)2 (1 + p)a2 + (1 + p 1)b2. (D.2)

Lemma D.3 (Self-bounding Property, (Lei & Ying, 2020)). Assume that for all z Z, the map w 7 f(w; z) is nonnegative with its gradient f(w; z) being (α, L)-H older continuous (Assumption A.3), then w 7 f(w; z) can be bounded as

f(w, z) 2 cα,1f α 1+α (w, z), w Rd, z Z.

( (1 + 1/α) α 1+α L 1 1+α , if α > 0 supz f(0; z) 2 + L, if α = 0. (D.3)

Remark D.1. The self-binding property implies that H older continuous gradients can be controlled by function values. The α = 1 and α (0, 1) case are established by Srebro et al. (2010) and Ying & Zhou (2017), respectively. The case where α = 0 follows directly from Assumption A.3.

Lemma D.4 (Co-coercivity). Assume that for all z Z, the map w 7 f(w; z) is nonnegative and convex, with its gradient f(w; z) being (α, L)-H older continuous (see Assumption A.3). Then for all w, ew, the following inequality holds:

f(w; z) f(ew; z) 2 (1 + α)L 1 α 2α w ew, f(w; z) f(ew; z) α 1+α . (D.4)

Remark D.2. Lemma D.4 establishes the co-coercivity of the gradients for nonnegative convex functions with H older continuous gradients. The α = 1 and α (0, 1) cases can be found in Nesterov (2003) and Ying & Zhou (2017) respectively. The proof of the α = 0 case can be obtained directly from the convexity of f.

Remark D.3. Using Young s inequality

ab p 1|a|p + q 1|b|q (a, b R, p, q > 0 with p 1 + q 1 = 1),

for any η > 0, the right-hand side of Inequality (D.4) can be further controlled by

2η 1 w ew, f(w; z) f(ew; z) + c2 α,3η 2α 1 α , (D.5)

where cα,3 = 1 α 1+α(2 αL) 1 1 α . For more details, see Appendix D of Lei & Ying (2020).

Lemma D.5. For any a, b Rd with ai, bi being their i-th components, respectively, the following inequality holds:

i b2 i 2 . (D.6)

Topology-aware Generalization of Decentralized SGD

D.2. Algorithmic stability of D-SGD

Proof of Theorem 1.

To begin with, we decompose Pm k=1 w(t+1) k ew(t+1) k 2 2, the on-average stability of D-SGD at the t-th iteration, into three

parts by the definition of the vector 2-norm. In the following, we will let z(t) k,ζt and z(t) k,ζt denote two random samples drawn from S and S(i) respectively on the k-th worker at the i-th iteration, respectively.

w(t+1) k ew(t+1) k 2 2

l=1 Pk,lwt(l) ηt f w(t) k ; z(t) k,ζt

l=1 Pk,l ew(t) l + ηt f ew(t) k , z(t) k,ζt

l=1 Pk,l(w(t) l ew(t) l ) 2 2 +

k=1 ηt 2 f w(t) k ; z(t) k,ζt

f ew(t) k , z(t) k,ζt

l=1 Pk,l(w(t) l ew(t) l ), f w(t) k ; z(t) k,ζt

f ew(t) k , z(t) k,ζt

= P(W(t) f W(t)) 2 F | {z } T1

k=1 1z(t) k,ζt = z(t) k,ζt

w(t+1) k ew(t+1) k 2 2 m X

l=1 Pk,l(w(t) l ew(t) l ) 2 2

k=1 1z(t) k,ζt= z(t) k,ζt

ηt 2 f w(t) k ; z(t) k,ζt

f ew(t) k , z(t) k,ζt

l=1 Pk,l(w(t) l ew(t) l ), f w(t) k ; z(t) k,ζt

f ew(t) k , z(t) k,ζt

where F denotes the Frobenius norm (see Definition A.7).

(1) To construct our proof, we start by constructing an upper bound for the expectation of T1:

EA(T1) = EA P(W(t) f W(t)) 2 F d(σ2 + µ2) (m 1)λ2 + 1 . (D.7)

The on-averaged stability after a single gossip communication can be written as

P(W(t) f W(t)) 2 F =

l=1 Pk,l(w(t) l ew(t) l ) 2 2 =

l=1 Pk,l(wv,(t) l ewv,(t) l ) 2 2

k=1 σ2 t,k(

l=1 P2 k,l){

Pm l=1 Pk,l[(wv,(t) l ewv,(t) l ) µv t,k] + Pm l=1 Pk,lµv t,l]

σt,k q Pm l=1 P2 k,l }2

k=1 σ2 t,k(

l=1 P2 k,l){

Pm l=1 Pk,l[(wv,(t) l ewv,(t) l ) µv t,k]

σt,k q Pm l=1 P2 k,l }2 + µ2dm

l=1 P2 k,l)

Pm l1=1 Pm l2=1 Pk,l1Pk,l2µv t,l1[(wv,(t) l1 ewv,(t) l1 ) µv t,l1] q Pm l1=1 P2 k,l q Pm l2=1 P2 k,l , (D.8)

where wv,(t) l and ewv,(t) l stacks the v-th entry of the d-dimensional vector w(t) l and ew(t) l , respectively. The last inequality holds since Pd v=1(µv t,l)2 dµ2.

Topology-aware Generalization of Decentralized SGD

Since the weight difference is normally distributed:

w(t) k ew(t) k N(µt,k, σ2 t,k Id), k = 1, . . . , m,

with µt,k satisfying µt,k 2 2 d µ2 and σ2 t,k R being bounded by σ2, we obtain

Pm l=1 Pk,l[(wv,(t) l ewv,(t) l ) µv t,k]

σt,k q Pm l=1 P2 k,l

i.i.d. N(0, 1), l = 1, . . . , m.

Since the sum of squared i.i.d. standard normal variables follows a Chi-Square distribution with 1 degree of freedom, we arrive at

Pm l=1 Pk,l[(wv,(t) l ewv,(t) l ) µv t,k]

σt,k q Pm l=1 P2 k,l }2 X 2(1), l = 1, . . . , m.

Furthermore, since l EA[(wv,(t) l ewv,(t) l ) µv t,l] = 0, we have

l=1 P2 k,l)

Pm l1=1 Pm l2=1 Pk,l1Pk,l2µv t,l1[(wv,(t) l1 ewv,(t) l1 ) µv t,l1] q Pm l1=1 P2 k,l q Pm l2=1 P2 k,l

As a consequence,

EA P(W(t) f W(t)) 2 F dσ2 m X

k=1 λ2 k + dµ2 m X

k=1 λ2 k d(σ2 + µ2)[1 + (m 1)λ2],

where 1 λ is the spectral gap of the communication topology.

(2) For the second part T2, we have

k=1 1z(t) k,ζt = z(t) k,ζtp m X

l=1 Pk,l(w(t) l ew(t) l ) 2 2+

k=1 1z(t) k,ζt = z(t) k,ζt(1 + p 1)c2 α,1ηt 2 f 2α 1+α (w(t) k ; z(t) k,ζt)+f 2α 1+α (ew(t) k , z(t) k,ζt) .

Inequality (D.9) are mainly based on Lemma D.2 and Lemma D.3:

k=1 1z(t) k,ζt = z(t) k,ζt

w(t+1) k ew(t+1) k 2 2 m X

l=1 Pk,l(w(t) l ew(t) l ) 2 2

k=1 1z(t) k,ζt = z(t) k,ζt

l=1 Pk,lwt(l) ηt f w(t) k ; z(t) k,ζt

l=1 Pk,l ew(t) l +ηt f ew(t) k , z(t) k,ζt 2 2

l=1 Pk,l(w(t) l ew(t) l ) 2 2

k=1 1z(t) k,ζt = z(t) k,ζt

l=1 Pk,l(w(t) l ew(t) l ) 2 2 + (1 + p 1)ηt 2 f w(t) k ; z(t) k,ζt

f ew(t) k , z(t) k,ζt

k=1 1z(t) k,ζt = z(t) k,ζt

l=1 Pk,l(w(t) l ew(t) l ) 2 2 | {z } T2.1

+(1 + p 1)c2 α,1ηt 2 f 2α 1+α (w(t) k ; z(t) k,ζt) + f 2α 1+α (ew(t) k , z(t) k,ζt) .

(3) T3 can be controlled as follows:

k=1 1z(t) k,ζt= z(t) k,ζtηt 2L w(t) k ew(t) k 2α 2 + m X

l=1 Pk,l(w(t) l ew(t) l ) 2 2 . (D.10)

Topology-aware Generalization of Decentralized SGD

According to the H older continuous assumption, we have

k=1 1z(t) k,ζt= z(t) k,ζt

f(w(t) k ; z(t) k,ζt) f(ew(t) k ; z(t) k,ζt) 2 2 L

k=1 1z(t) k,ζt= z(t) k,ζt

w(t) k ew(t) k 2α 2 . (D.11)

Consequently,

k=1 1z(t) k,ζt= z(t) k,ζtηt L w(t) k ew(t) k 2α 2 2 m X

l=1 Pk,l(w(t) l ew(t) l ), f w(t) k ; z(t) k,ζt

f ew(t) k , z(t) k,ζt

k=1 1z(t) k,ζt= z(t) k,ζtηt L w(t) k ew(t) k 2α 2 + 2 m X

l=1 Pk,l(w(t) l ew(t) l ) 2 f(w(t) k ; z(t) k,ζt) f(ew(t) k ; z(t) k,ζt) 2

k=1 1z(t) k,ζt= z(t) k,ζtηt L w(t) k ew(t) k 2α 2 + m X

l=1 Pk,l(w(t) l ew(t) l ) 2 2 + f(w(t) k ; z(t) k,ζt) f(ew(t) k ; z(t) k,ζt) 2 2

k=1 1z(t) k,ζt= z(t) k,ζtηt 2L w(t) k ew(t) k 2α 2 | {z } T3.1

l=1 Pk,l(w(t) l ew(t) l ) 2 2 . (D.12)

(4) A simple combination of Inequality (D.7), Inequality (D.9) and Inequality (D.10) provides the following:

w(t+1) k ew(t+1) k 2 2 = T1 + T2 + T3

k=1 1z(t) k,ζt = z(t) k,ζt

l=1 Pk,l(w(t) l ew(t) l ) 2 2 + (1 + p 1)c2 α,1ηt 2 f 2α 1+α (w(t) k ; z(t) k,ζt) + f 2α 1+α (ew(t) k , z(t) k,ζt)

k=1 1z(t) k,ζt= z(t) k,ζtηt 2L w(t) k ew(t) k 2α 2 + m X

l=1 Pk,l(w(t) l ew(t) l ) 2 2 . (D.13)

S(i) k = {zk,1, . . . , zi,k, . . . , zk,n} differs from Sk by only the i-th element. Consequently, at the t-th iterate, with a probability of 1 1

n, the example z(t) k,ζt selected by D-SGD on worker k in both Sk and S(i) k is the same (i.e. z(t) k,ζt = z(t) k,ζt), and with a

probability of 1

n, the selected example is different (i.e. z(t) k,ζt = z(t) k,ζt).

Since A is independent of k, EA(T2.1) and EA(T3.1) can be controlled accordingly as follows:

EA(T2.1) = EA p

k=1 1z(t) k,ζt = z(t) k,ζt

l=1 Pk,l(w(t) l ew(t) l ) 2 2

l=1 Pk,l(w(t) l ew(t) l ) 2 2 Inequality (D.7)

p nd(σ2 + µ2) (m 1)λ2 + 1 (D.14)

where the proof of the last inequality is analogous to part (1).

By the concavity of the mapping x 7 xα (α [0, 1]), we have

EA(T3.1) 2ηt L(1 1

w(t) k ew(t) k 2α 2

w(t) k ew(t) k 2 2 }α 2ηt L(1 1

w(t) k ew(t) k 2 2 . (D.15)

Topology-aware Generalization of Decentralized SGD

The last inequality holds if m 1 dµ2 0 , where dµ2 0 is the lower bound of µt,k 2 2 (k = 1 . . . m), which leads to

EA Pm k=1 w(t) k ew(t) k 2 2 1. The condition m 1 dµ2 0 can be easily satisfied in training overparameterized models in a decentralized manner, since both m and d are large in these cases.

Therefore, taking the expectation on both sides of Inequality (D.13) provides

w(t+1) k ew(t+1) k 2 2 2ηt L(1 1

w(t) k ew(t) k 2 2

n)ηt]d(σ2+µ2) (m 1)λ2 + 1 + 1

(1+p 1)c2 α,1ηt 2 f

2α 1+α (w(t) k ; z(t) k,ζt)+f 2α 1+α (ew(t) k , z(t) k,ζt) .

Knowing that z(t) k,ζt and z(t) k,ζt follow the same distribution, we have

ESk,S(i) k ,A f 2α 1+α (ew(t) k ; z(t) k,ζt) = ESk,A f 2α 1+α (w(t) k ; z(t) k,ζt) .

Note that {ηt} is an non-increasing sequence. As a consequence,

k=1 ESk,S(i) k ,A w(t+1) k ew(t+1) k 2 2 2η0L(1 1

C (scaling coefficient)

ESk,S(i) k ,A m X

w(t) k ew(t) k 2 2

n)ηt]d(σ2 + µ2) (m 1)λ2 + 1

| {z } topology-dependent

n (1 + p 1)c2 α,1ηt 2 m X

k=1 ESk,A f 2α 1+α (w(t) k ; z(t) k,ζt)

| {z } topology-independent

Multiplying both sides of Inequality (D.17) with C (t+1) provides

C (t+1) m X

k=1 ESk,S(i) k ,A w(t+1) k ew(t+1) k 2 2 C t ESk,S(i) k ,A m X

w(t) k ew(t) k 2 2

+ C (t+1) [1 + p

n)ηt]d(σ2 + µ2) (m 1)λ2 + 1 + 2

n (1 + p 1)c2 α,1ηt 2 m X

k=1 ESk,A f 2α 1+α (w(t) k ; z(t) k,ζt) .

Taking the summation over the iteration τ, we can write

τ=0 C (τ+1) m X

k=1 ESk,S(i) k ,A w(τ+1) k ew(τ+1) k 2 2

τ=0 C τESk,S(i) k ,A m X

w(τ) k ew(τ) k 2 2

τ=0 C (τ+1) [1+ p

n)ητ]d(σ2+µ2) (m 1)λ2+1 + 2

n (1+p 1)c2 α,1ητ 2 m X

k=1 ESk,A f 2α 1+α (w(τ) k ; z(τ) k,ζt) .

Since for all k, w1(k) = ew1(k) = 0 (see Definition Page 3), we have

k=1 ESk,S(i) k ,A w(t+1) k ew(t+1) k 2 2

τ=0 Ct τ [1+ p

n)ητ]d(σ2+µ2) (m 1)λ2+1 + 2

n (1+p 1)c2 α,1ητ 2 m X

k=1 ESk,A f 2α 1+α (w(τ) k ; z(τ) k,ζt) . (D.20)

Topology-aware Generalization of Decentralized SGD

Since the mapping x 7 x 2α 1+α is concave, we have 1

n Pn i=1 f 2α 1+α (w(τ) k , z(τ) k,ζτ ) F

2α 1+α Sk (w(τ) k ). Then the distributed on-average stability of D-SGD can be controlled as follows:

k=1 ESk,S(i) k ,A w(t+1) k ew(t+1) k 2 2

τ=0 Ct τ [1+ p

n)ητ]d(σ2+µ2) (1 1

n (1+p 1)c2 α,1ητ 2 1

k=1 ESk,A F

2α 1+α Sk (w(τ) k ) . (D.21)

The proof is complete.

Proof of Corollary 2.

With constant step size ηt η 1 2L(1 2

m), Pt τ=0 Ct τ can be written as

τ=0 [2ηL(1 1

τ=0 [2ηL(1 1

τ = 1 [2ηL(1 1

Consequently, the distributed on-average stability of D-SGD becomes:

k=1 ESk,S(i) k ,A w(t+1) k ew(t+1) k 2 2

n)η]d(σ2+µ2) (1 1

n (1+p 1)c2 α,1η2ϵS] , (D.22)

where ϵS denotes the upper bound of 1

m Pm k=1 ESk,A F

2α 1+α Sk (w(t) k ) t.

When t , we further get

k=1 ESk,S(i) k ,A w(t+1) k ew(t+1) k 2 2

1 1 2ηL(1 1

n)ητ]d(σ2+µ2) (1 1

n (1+p 1)c2 α,1ητ 2ϵS] . (D.23)

D.3. Generalization of D-SGD

Proof of Lemma 3.

We denote A(S) as the model produced by algorithm A based on the training dataset S.

To begin with, we can write

ES,A F(A(S)) FS(A(S)) = ES,S(i),A 1

F(A(S(i))) FS(A(S))

= ES,S(i),A 1

E z D(f(A(S(i)); z)) f(A(S); zi) (D.24)

= ES,S(i),A 1

f(A(S(i)); zi) f(A(S); zi) , (D.25)

Topology-aware Generalization of Decentralized SGD

where the first line follows from noticing that ES,A F(A(S)) = ES(i),A F(A(S(i))) and the last identity holds since A(S(i)) is independent of zi and thus ES,S(i),A E z S(i)(f(A(S(i)); z) = ES,S(i),A f(A(S(i)); zi) .

The H older continuity of f and the concavity of the x 7 x α

2 further guarantees

ES,A F(A(S)) FS(A(S)) ES,S(i),A 1

i=1 L A(S) A(S(i)) α 2

= ES,S(i),A L N 1 α

i=1 A(S) A(S(i)) 2 2 α

Finally, consider 1

m Pm k=1 w(t) k as an output of algorithm A on dataset S, we can complete the proposition by the convexity of vector 2-norm and square function:

k=1 w(t) k ) FS( 1

k=1 ew(t) k ) ES,S(i),A L (mn)1 α

k=1 w(t) k 1

k=1 ew(t) k 2 2 α

i=1 ESk,S(i) k ,A w(t) k ew(t) k 2 2 α/2. (D.27)

Proof of Theorem 4.

We start by rewriting Inequality (D.21) as

k=1 ESk,S(i) k ,A w(t+1) k ew(t+1) k 2 2

τ=0 Ct τ [1+ p

n)ητ]d(σ2+µ2) (m 1)λ2+1 + 1

n 2(1+p 1)c2 α,1ητ 2 m X

k=1 ESk,A F

2α 1+α Sk (w(τ) k ) . (D.28)

To facilitate subsequent analysis, we denote Tdec = Pt τ=0 Ct τ[1+ p

n)ητ]d(σ2+µ2) (m 1)λ2+1 and Tavg = Pt τ=0 Ct τ 2(1+p 1)c2 α,1ητ 2 Pm k=1 ESk,A F

2α 1+α Sk (w(τ) k ) .

A combination of Inequality (D.21) and Lemma 3 yields

k=1 w(t+1) k ) FS( 1

k=1 w(t+1) k ) L mn1 α

n Tavg + Tdec)α/2. (D.29)

Since the inequality (1 + x) α

2 holds for all x 1 and α [0, 1], the right hand side of Inequality (D.28) can be bounded as

n Tavg + Tdec) α

2 = L mn1 α

2 1 + Tdecn) α

2 [1 + (Tdecn) α

2 avg mn [1 + (Tdecn) α

2 avg n + Tavg T

2 1]. (D.30)

Consequently, the generalization bound of D-SGD can be controlled as

k=1 w(t+1) k ) FS( 1

k=1 w(t+1) k ) L

2 avg n + T

τ=0 Ct τ2(1+p 1)c2 α,1ητ 2ϵS

τ=0 Ct τ[(1 1

Topology-aware Generalization of Decentralized SGD

where ϵS denotes the upper bound of 1

m Pm k=1 ESk,A F

2α 1+α Sk (w(t) k ) t.

If we set the ηt η 1 2L, Inequality (D.31) can be written as

k=1 w(t+1) k ) FS( 1

k=1 w(t+1) k ) 1

2 S N ) + O(Ln α

2 S N ) + O(Ln α

2 N (λα+m α

2 ))}, (D.32)

which completes the proof.

D.4. Implications

Proof of Corollary 5.

Inequality (D.21) shows that in the smooth settings (α = 1), the distributed on-average stability of D-SGD is bounded as

k=1 ESk,S(i) k ,A w(t+1) k ew(t+1) k 2 2

τ=0 Ct τ [1+ p

n)ητ]d(σ2+µ2) (1 1

n (1+p 1)c2 1,1ητ 2 1

k=1 ESk,A FSk(w(τ) k ) ,

where C = 2ηL(1 1

Our goal is to prove that the upper bound of the stability increase with the number of iterations that we start to control the consensus distance .

According to the descent lemma in Koloskova et al. (2020), the empirical risk of the consensus model can be bounded by the consensus distance as follows:

EAf(w(τ+1); z(τ+1) ζt ) EAf(w(τ); z(τ+1) ζt ) + ηL2 1

k=1 w(τ) k w(τ) 2 2 | {z } consensus distance

+EA L n η2 1

k=1 f(w(τ) k ; z(τ) ζτ ) 2 2 | {z } norm of the averaged gradient

(D.34) where w(τ) = 1

m Pm k=1 w(τ) k .

Due to the fact that the gradient of f w.r.t. the first parameter is bounded by B and the square of the vector 2-norm 2 2 is convex, we have 1 m

k=1 ESk,A FSk(w(τ)) ηL2 τ X

k=1 w(ν) w(ν) k 2 2 + L

mη2τB2. (D.35)

To connect the stability upper bound in Inequality (D.33), we perform the Taylor expansion of 1

m Pm k=1 ESk,A FSk( )

around w(τ):

k=1 ESk,A FSk(w(τ) k ) = 1

k=1 ESk,A FSk(w(τ)) + 1

ESk,A FSk(w(τ)) T(w(τ) k w(τ))

+ (w(τ) k w(τ))T 1

k=1 ESk,A 2FSk(w(τ)) (w(τ) k w(τ)) + O( w(τ) k w(τ) 3 2).

Topology-aware Generalization of Decentralized SGD

According to Assumption A.1, the gradient of FSk w.r.t. the first parameter is bounded by B. Consequently, the averaged empirical loss 1

m Pm k=1 ESk,A FSk(w(τ) k ) can be bounded as

k=1 ESk,A FSk(w(τ) k ) 1

k=1 ESk,A FSk(w(τ)) + 1

k=1 (w(τ) k w(τ))

FSk(w(τ)) 2 | {z } B

+ (w(τ) k w(τ))T 1

k=1 ESk,A 2FSk(w(τ)) (w(τ) k w(τ)) + O( w(τ) k w(τ) 3 2)

k=1 ESk,A FSk(w(τ)) + L 1

k=1 w(τ) k w(τ) 2 2 | {z } consensus distance

+O( w(τ) k w(τ) 3 2). (D.37)

The last inequality holds since the smooth condition

f(x) f(y) 2 L x y 2 2f LI, (D.38)

and thus we have

(w(τ) k w(τ))T 1

k=1 ESk,A 2FSk(w(τ)) (w(τ) k w(τ))

k=1 (w(τ) k w(τ))TI(w(τ) k w(τ)) = L 1

k=1 w(τ) k w(τ) 2 2. (D.39)

If we omit the third-order difference, a combination of Inequality (D.35) and Inequality (D.37) provides

k=1 ESk,A FSk(w(τ) k ) L 1

k=1 w(τ) k w(τ) 2 2 + ηL2

k=1 w(ν) w(ν) k 2 2 + L

mη2τB2. (D.40)

This inequality would suffice to prove that the distributed on-average stability increase with the accumulation of the consensus distance .

Suppose that the consensus distance is controlled below the critical consensus distance Γ2 from tΓ-th iterate to the end of the training. For simplicity, we make a mild assumption that the consensus distance Γ2 1

m Pm k=1 w(t) k w(t) 2 2 K2

if t tΓ. Therefore, the averged empirical risk at τ-th iterate can be bounded as

k=1 ESk,A FSk(w(τ) k ) LΓ2 + ηL2

k=1 w(ν) w(ν) k 2 2 + ηL2(τ tΓ)Γ2 + L

LΓ2 + ηL2 tΓ X

k=1 w(ν) w(ν) k 2 2 Γ2 + (ηL2Γ2 + L

LΓ2 + ηL2(tΓ + 1) K2 Γ2 + (ηL2Γ2 + L

mη2B2) τ, (D.41)

if τ is greater than tΓ; and

k=1 ESk,A FSk(w(τ) k ) L 1

k=1 w(τ) k w(τ) 2 2 + ηL2

k=1 w(ν) w(ν) k 2 2 + L

LK2 + (ηL2K2 + L

mη2B2) τ, (D.42)

if if τ is smaller than tΓ.

Topology-aware Generalization of Decentralized SGD

Consequently,

k=1 ESk,A FSk(w(τ) k ) = (

τ=tΓ+1 )Ct τ 1

k=1 ESk,A FSk(w(τ) k )

τ=0 τCt τ + ηL2K2 tΓ X

τ=0 τCt τ + ηL2Γ2 t X

τ=tΓ+1 τCt τ

τ=0 Ct τ) + [LΓ2 + ηL2(tΓ + 1) K2 Γ2 ](

τ=tΓ+1 Ct τ). (D.43)

Recall that our goal is to prove that G(tΓ) increase with tΓ.

Due to the fact that Pt τ=0 τCt τ = Pt τ=0(t τ)Cτ t Pt τ=0 Cτ, we can obtain

1 C + ηL2K2 tΓ X

τ=0 τCt τ + ηL2Γ2 t X

τ=tΓ+1 τCt τ

1 C + [LΓ2 + ηL2(tΓ + 1) K2 Γ2 ]Ct CtΓ+1

1 C . (D.44)

Since the finite sum of the arithmetico-geometric sequence can be written as

τ=0 τC τ = [ t C 1 1 1 (C 1 1)2 ]C (t+1) + [ 1 (C 1 1)2 + 1 C 1 1], (D.45)

we can upper bound G(tΓ) as follows:

1 C + ηL2Γ2Ct{[ t C 1 1 1 (C 1 1)2 ]C (t+1) + [ 1 (C 1 1)2 + 1 C 1 1]}

+ ηL2(K2 Γ2)Ct{[ t C 1 1 1 (C 1 1)2 ]C (tΓ+1) + [ 1 (C 1 1)2 + 1 C 1 1]}

1 C + [LΓ2 + ηL2(tΓ + 1) K2 Γ2 ]Ct CtΓ+1

1 C . (D.46)

Rewrite the inequality above, then we arrive at

1 C +ηL2Γ2Ct[ t C 1 1 1 (C 1 1)2 ]C (t+1)+Ct LK2

+ ηL2(K2 Γ2)Ct{[ t C 1 1 1 (C 1 1)2 ]C (tΓ + 1) + tΓ CtΓ+1

1 C } + [LΓ2 + ηL2 K2 Γ2 ]Ct CtΓ+1

One can prove that if t C 2 ln C , the upper bound of G(tΓ) will be a monotonically increasing function of tΓ. Consequently, we can conclude that the distributed on-average stability bound and the generalization bound of D-SGD increase monotonically with tΓ if the total number of iterations satisfies t C 2 ln C .