# scalable_decentralized_learning_with_teleportation__285291f9.pdf

Published as a conference paper at ICLR 2025

SCALABLE DECENTRALIZED LEARNING WITH TELEPORTATION

Yuki Takezawa Kyoto University, OIST Sebastian U. Stich CISPA Helmholtz Center for Information Security

Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large. In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrary nodes is still possible. Previous studies have tried to alleviate the convergence rate degradation in these cases by designing topologies with large spectral gaps. However, the degradation is still significant when the number of nodes is substantial. In this work, we propose TELEPORTATION. TELEPORTATION activates only a subset of nodes, and the active nodes fetch the parameters from previous active nodes. Then, the active nodes update their parameters by SGD and perform gossip averaging on a relatively small topology comprising only the active nodes. We show that by activating only a proper number of nodes, TELEPORTATION can completely alleviate the convergence rate degradation. Furthermore, we propose an efficient hyperparameter-tuning method to search for the appropriate number of nodes to be activated. Experimentally, we showed that TELEPORTATION can train neural networks more stably and achieve higher accuracy than Decentralized SGD.

1 INTRODUCTION

Distributed learning has emerged as an important paradigm for privacy preservation and large-scale machine learning. In centralized approaches, such as federated learning (Mc Mahan et al., 2017; Kairouz et al., 2019) and All-Reduce, nodes update their parameters by using their local dataset and then compute the average parameters of these nodes. Since computing the average of many nodes incurs huge communication costs, communication is the major bottleneck in the training time. In reducing communication costs, decentralized learning has attracted significant attention (Lian et al., 2017; 2018; Assran et al., 2019; Koloskova et al., 2020b). In decentralized learning, a node exchanges parameters with its few neighboring nodes and then computes their weighted average to approximate the average of all nodes. This procedure is called gossip averaging. Since each node only needs to communicate with a few neighboring nodes, decentralized learning is more communication efficient than centralized counterparts (Lian et al., 2017; Assran et al., 2019; Ying et al., 2021).

While decentralized learning can run with low communication costs, the degradation of the convergence rate due to its sparse communication characteristics is a trade-off (Koloskova et al., 2020b). Especially, when the number of nodes is large, the parameters held by each node are likely to be far away during the training, and gossip averaging deteriorates the convergence rate. This makes it challenging for decentralized learning to scale to a large number of nodes.

In decentralized learning literature, it is commonly assumed that communication can only occur on a given topology. However, in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrary nodes is still possible, e.g., in a data center or the case where nodes are connected on the Internet. Since we can choose which topology to use in these cases, many prior studies designed topologies to reconcile the communication efficiency and convergence rate, attempting to alleviate the degradation of the convergence rate caused by a large number of nodes (Wang et al., 2019; Ying et al., 2021; Ding et al., 2023; Takezawa et al., 2023b).

This work was done while YT visited the CISPA Helmholtz Center for Information Security.

Published as a conference paper at ICLR 2025

Specifically, the communication costs are determined by the maximum degree of the topology, and the convergence rate is determined by its spectral gap (Wang et al., 2019; Koloskova et al., 2020b). Thus, the prior studies designed topologies with large spectral gaps and small maximum degrees and proposed performing gossip averaging on these topologies. However, the approximate average estimated by gossip averaging deviates from the exact average of all nodes as the number of nodes increases, and the convergence rate remains to degrade when the number of nodes is substantial.

In this study, we ask the following question: Can we develop a decentralized learning method whose convergence rate does not degrade as the number of nodes increases? Our work yielded an affirmative answer with the proposal of TELEPORTATION. TELEPORTATION can completely eliminate the degradation of the convergence rate caused by increasing the number of nodes, and the convergence rate is consistently improved as the number of nodes increases. Specifically, TELEPORTATION activates only a subset of nodes and initializes their parameters with the parameters of previous active nodes. Then, the active node updates its parameters by SGD and performs gossip averaging on a relatively small topology comprising only the active nodes. By activating only a proper number of nodes, gossip averaging can make the parameters of each node reach the consensus fast, and the parameters of each node can be prevented from being far away even when the total number of nodes is large. We show that by activating an appropriate number of nodes, the degradation of the convergence rate can be completely alleviated. Furthermore, we propose an efficient hyperparameter-tuning method to search for the proper number of active nodes. For an arbitrary number of nodes, the proposed hyperparameter-tuning method can find the appropriate number of nodes to be activated within 2T iterations in total, where T is the number of iterations to run TELEPORTATION once. Experimentally, we demonstrated that TELEPORTATION can converge faster than Decentralized SGD and train neural networks more stably.

Note: In this work, we consider a setting where any two nodes can exchange parameters as required, similar to previous studies on topology design for decentralized learning (Marfoq et al., 2020; Ying et al., 2021; Song et al., 2022; Takezawa et al., 2023b; Ding et al., 2023). A data center is an example, and Assran et al. (2019) demonstrated that decentralized methods can train neural networks faster than All-Reduce. In addition to a data center, when nodes are connected to the Internet, any two nodes can communicate by specifying the IP address. TELEPORTATION is not applicable in the presence of a pair of nodes that cannot communicate, as in wireless sensor networks.

Notation: A graph G is represented by (V, E) where V is a set of nodes and E is a set of edges. For simplicity, we also write ({1, , n}, E) to represent a graph G where n is the number of nodes. Id denotes the d-dimentional identify matrix, 0 denotes a vector with all zeros, and 1 denotes a vector with all ones.

2 PRELIMINARY

Problem Setting: We consider the following problem where the loss functions are distributed among n nodes:

i=1 fi(x) , fi(x) := Eξi Di [Fi(x; ξi)] , (1)

where x is a model parameter, Di is a training dataset that the i-th node has, ξi is a date sample following Di, and the local loss function fi : Rd R is defined as the expectation of Fi( ; ξi) over data sample ξi. Following previous works (Koloskova et al., 2020a; Yuan et al., 2021; Lin et al., 2021), we assume that the loss functions satisfy the following assumptions.

Assumption 1. There exists f > that satisfies f(x) f for any x Rd.

Assumption 2. There exists L 0 that satisfies for any x, y Rd and i {1, 2, , n},

fi(x) fi(y) L x y . (2)

Assumption 3. There exists σ 0 that satisfies for any x Rd and i {1, 2, , n},

E Fi(x; ξi) fi(x) 2 σ2. (3)

Published as a conference paper at ICLR 2025

Assumption 4. There exists ζ 0 that satisfies for any x Rd,

i=1 fi(x) f(x) 2 ζ2. (4)

Decentralized SGD: One of the most fundamental decentralized learning methods to solve Eq. (1) is Decentralized SGD (Lian et al., 2017). Let xi Rd denote the i-th node parameter. Once the topology comprising n nodes Gn = ({1, , n}, E) is specified, Decentralized SGD updates xi as follows:

j=1 Wij x(t) j η Fj(x(t) j ; ξ(t) j ) ,

where η > 0 is the step size and Wij [0, 1] is the weight of edge (i, j) E that satisfies Pn i=1 Wij = Pn j=1 Wij = 1. The following assumption is commonly used to represent how well the topology Gn is connected (Yuan et al., 2021; Koloskova et al., 2020b; 2021; Aketi et al., 2023). Assumption 5. Let λi is the i-th largest eigenvalue of W [0, 1]n n. W is a doubly stochastic matrix, and pn := 1 max{|λ2|, |λn|}2 satisfies pn (0, 1]. That is, it holds that XW X 2 F (1 pn) X X 2 F ,

for any X Rd n where X := 1

We write pn to emphasize that pn depends on n. A small pn indicates that the topology is poorly connected, and a large pn indicates that the topology is well connected. Generally, pn approaches zero as the number of node n increases. For instance, when a ring is used as the underlying topology, pn is Ω(n 2), which rapidly decreases as n increases (Nedi c et al., 2018). Under these assumptions, Decentralized SGD converges with the following rate. Proposition 1 (Koloskova et al. (2020b)). Suppose that Assumptions 1, 2, 3, 4, and 5 hold. Let {{x(t) i }n i=1}T t=0 denote the parameters generated by Decentralized SGD. Then, there exists the step size η that satisfies:

t=0 E f( x(t)) 2 O

n T + L2r2 0(pnσ2 + ζ2)(1 pn)

where r0 := f( x(0)) f and x(t) := 1

n Pn i=1 x(t) i .

As the number of nodes n increases, the first term O( q

n T ) is improved, while the second and third terms degrade since pn reaches zero. Thus, as the number of nodes n increases substantially, the second and third terms dominate the convergence rate, and the convergence rate deteriorates. In many practical cases, Gn merely represents a preferred communication pattern, and any two nodes can communicate as needed, as introduced in Sec. 1. Since we can choose which topology to use in these cases, many recent studies have attempted to alleviate the convergence rate degradation by proposing topologies with large pn (Ying et al., 2021; Ding et al., 2023; Takezawa et al., 2023b). However, their pn still approach zero as n increases, and the convergence rate degrades when n is substantially large. Table 2 in Sec. D lists the convergence rates of Decentralized SGD with various topologies.

3 PROPOSED METHOD

In this section, we propose TELEPORTATION, which can completely eliminate the degradation of the convergence rate caused by increasing the number of nodes. In the remainder of this section, we describe TELEPORTATION in Sec. 3.1 and analyze its convergence rate in Sec. 3.2. Then, we propose an efficient hyperparameter search method in Sec. 3.3 and analyze its convergence rate in Sec. 3.4.

3.1 TELEPORTATION

pn represents the speed at which the gossip averaging makes the parameter of each node {xi}n i=1 reach the average 1

n Pn i=1 xi. As the number of nodes n increases, averaging all node parameters

Published as a conference paper at ICLR 2025

Algorithm 1 Simple version of TELEPORTATION

1: Input: the number of nodes n, set of nodes Vn, number of active nodes k, total number of iteration T, step size η, and topology comprising k active nodes Gk = ({1, , k}, E). 2: for t {0, 1, , T} do

3: sample next active k nodes V (t) active from Vn without replacement, and assign {1, 2, , k} to variables {token_id(t) i | vi V (t) active} randomly without overlap. 4: for i {1, 2, , n} in parallel do

5: if vi V (t) active thena

6: if t > 0 then 7: Let vj V (t 1) active denote the node such that token_id(t 1) j = token_id(t) i .

8: receive x(t) j from vj V (t 1) active , and x(t) i x(t) j . 9: end if 10: y(t) i x(t) i η Fi(x(t) i ; ξ(t) i ).

11: receive y(t) j from vj {vj V (t) active | (token_id(t) i , token_id(t) j ) E}.

12: x(t+1) i P

vj V (t) active Wtoken_id(t) i ,token_id(t) j y(t) j .

13: end if 14: end for 15: end for

a The update rule of the parameters of the inactive nodes is not described because the parameters held by the inactive nodes are discarded and initialized with the parameters of the other active nodes when they are activated in line 8. The parameters of inactive nodes do not affect the behavior of active node parameters.

becomes difficult with gossip averaging, and pn reaches zero. Thus, as n increases, the parameters of each node {xi}n i=1 comes to be apart during the training, and the convergence rate deteriorates, as discussed in Sec. 2.

To prevent the parameters in each node {xi}n i=1 from being away when the number of nodes n is large, we propose updating the parameters of only a small subset of nodes and performing gossip averaging on a small topology. We refer to the proposed method as TELEPORTATION.1 Let k {1, , n} denote the number of active nodes and Gk = ({1, , k}, E) be the topology comprising k active nodes. Essentially, TELEPORTATION consists of the following three steps:

(1) We sample k active nodes V (t) active from Vn without replacement and assign {1, 2, , k} to variables {token_id(t) i | vi V (t) active} without overlap.2

(2) The active node vi V (t) active fetches the parameters from the previous active node vj V (t 1) active whose token_id(t 1) j stores the same value as that of token_id(t) i . Then, the active node vi initializes its parameters with the fetched parameter and updates them by SGD.

(3) The active node vi V (t) active exchanges its parameters with its neighbors {vj V (t) active | (token_id(t) i , token_id(t) j ) E} and computes the weighted average with them.

We show the pseudo-code and illustration in Alg. 1 and Fig. 1. In the above implementation, the first step does not start until the third step is completed. We can solve this issue so that gossip averaging in the third step is performed on the next active node vj, and the first and third steps can be executed simultaneously. We show the optimized version in Sec. A.

In TELEPORTATION, gossip averaging is performed on a relatively small topology Gk, which can make the parameters of k active nodes reach the average fast. Therefore, using the proper number of active nodes k, we can mitigate the degradation of the convergence rate by increasing the number of nodes n. In the next section, we analyze the convergence rate and show that our proposed method successfully circumvents the negative effect of increasing the number of nodes n.

1We named our proposed method TELEPORTATION to reflect the manner in which the active nodes copy their parameters to the next active nodes. 2The first step can be performed in a decentralized manner by sharing the same seed value to sample active nodes before starting the training.

Published as a conference paper at ICLR 2025

Figure 1: Illustration of Alg. 1 with n = 8 and k = 3. We use a line as the topology consisting of active nodes Gk = ({1, 2, 3}, {(1, 1), (1, 2), (2, 2), (2, 3), (3, 3)}). The black nodes represent active nodes, and the number written on the node is token_id(t) i . The blue nodes represent the next active nodes, and the number on the node is token_id(t+1) i . The first and third graphs from the left represent the communication in line 12, and the other graphs represent the communication in line 8.

Comparison with Client Sampling: TELEPORTATION randomly samples a subset of nodes and activates them, but TELEPORTATION is different from the client sampling (Liu et al., 2022). TELEPORTATION can prevent the parameters of nodes from being away by using the small k because gossip averaging is performed on the small topology Gk. In contrast to TELEPORTATION, client sampling incites the parameters of nodes to drift away because gossip averaging is performed on the topology comprising all nodes, and nodes can exchange parameters only when two neighboring nodes are sampled simultaneously, which makes it more difficult for gossip averaging to make the parameters reach the consensus. Thus, client sampling cannot alleviate the degradation of the convergence rate caused by a large number of nodes n and degrades the convergence rate as well as the vanilla Decentralized SGD (see Sec. E for more detailed discussion).

3.2 CONVERGENCE ANALYSIS OF TELEPORTATION

We assume that the topology comprising k active nodes Gk satisfies the following assumption. Assumption 6. Let λi is the i-th largest eigenvalue of W [0, 1]k k. W is a doubly stochastic matrix, and pk := 1 max{|λ2|, |λk|}2 satisfies pk (0, 1]. That is, it holds that XW X 2 F (1 pk) X X 2 F , (6)

for any X Rd k where X := 1

Note that W is a k k matrix, and pk depends on the number of active nodes k, but is independent of the total number of nodes n. Under Assumptions 1, 2, 3, 4, and 6, Theorem 1 provides the convergence rate of TELEPORTATION for any number of active nodes k.

Theorem 1. Suppose that Assumptions 1, 2, 3, 4, and 6 hold. Let {{x(t) i }n i=1}T t=0 denote the parameters generated by Alg. 1, and suppose that {x(0) i }n i=1 is initialized with the same parameter x(0). Then, for any number of active nodes k {1, 2, , n}, there exists the step size η such that 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above by

Lr0(σ2 + (1 k 1

k T + L2r2 0(σ2 + ζ2)(1 pk)

where x(t) active := 1

vi V (t) active x(t) i and r0 := f( x(0)) f .

By tuning the number of active nodes k, we obtain the following theorem. To determine the proper number of active nodes k, the explicit form of pk is necessary. Here, we show the rates when we use a ring and exponential graph (Ying et al., 2021) as Gk.

Theorem 2. Suppose that Assumptions 1, 2, 3, 4, and 6 hold. Let {{x(t) i }n i=1}T t=0 be the parameters generated by Alg. 1, and suppose that {x(0) i }n i=1 is initialized with the same parameter x(0).

Ring: Suppose that the active nodes are connected by a ring, i.e., pk = Ω(k 2). Then, if we set the number of active nodes k as follows:

(& T(σ2 + ζ2)

Published as a conference paper at ICLR 2025

there exists η such that 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above by

Lr0(σ2 + ζ2) 3 4 T

Lr0(σ2 + ζ2) 2 5 T

where x(t) active := 1

vi V (t) active x(t) i and r0 := f( x(0)) f .

Exp. Graph: Suppose that the active nodes are connected by an exponential graph, i.e., pk = Ω(log 1 2 (k)). Then, if we set the number of active nodes k as follows:

(& T(σ2 + ζ2)

& T(σ2 + ζ2)

there exists η such that 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above by

L2r2 0(σ2 + ζ2)

All proofs are presented in Sec. B. The first term in Eq. (7) is worse than the first term in Eq. (5), but Theorem 2 indicates that the first term becomes the same as that of Decentralized SGD by carefully tuning k. For both cases, only the first term depends on the number of nodes n, which is improved as n increases. Therefore, TELEPORTATION can completely eliminate the negative effect by increasing the number of nodes n, and its convergence rate is consistently improved as n increases. As discussed in Sec. 2, several previous studies have attempted to avoid convergence rate degradation caused by large n by designing topologies with large spectral gaps (Ying et al., 2021; Song et al., 2022; Takezawa et al., 2023b). However, no prior topology, except for a complete graph, can completely remove this degradation. By comparing with the complete graph, TELEPORTATION is more communication efficient. For instance, if a ring is used as Gk, each node only needs to communicate with three nodes per iteration. Therefore, TELEPORTATION is the first method that does not suffer from large n without sacrificing the communication efficiency.

3.3 EFFICIENT HYPERPARAMETER SEARCH FOR NUMBER OF ACTIVE NODES

TELEPORTATION can eliminate the convergence rate degradation caused by a large number of nodes n, while the number of active nodes k {1, , n} needs to be tuned carefully. If k is tuned by grid search, it requires n T iterations in total. This hyperparameter-tuning becomes expensive as n increases. To alleviate this issue, we further extend TELEPORTATION by proposing an efficient hyperparameter-tuning method for the number of active nodes k. The key idea for efficient hyperparameter-tuning is based on the following lemma. Lemma 1. Let k {1, 2, 3, , n} be the optimal number of active nodes. If k < n, there exists k {1, 2, 4, 8, , 2 log2(n+1) 1} that satisfies k

4 < k k . Furthermore, it holds that P log2(n+1) 1 i=0 2i n.

Algorithm 2 Efficient hyperparameter search for TELEPORTATION.

1: Input: the total number of iteration 2T, and the number of nodes n. 2: run Alg. 1 for T iterations with number of active nodes n, and let {{x(t) n,i}n i=1}T t=0 denote the generated parameters. 3: for k {1, 2, 22, 23, , 2 log2(n+1) 1} in parallel do 4: run Alg. 1 for T iterations with number of active nodes k. 5: end for 6: let {{x(t) k,i}vi V (t) active}T t=0 denote the parameters generated by Alg. 1 with k active nodes.

7: return the bestb parameters among k {1, 2, 22, , 2 log2(n+1) 1, n}.

b Theoretically, we select the parameters with the smallest gradient norm as in Theorem 2. In practice, we can select k that achieves the best accuracy on the validation datasets as in the vanilla grid search.

Published as a conference paper at ICLR 2025

Lemma 1 shows that for any optimal number of active nodes k n 1, a similar number of active nodes is contained in {1, 2, 22, , 2 log2(n+1) 1}. This suggests that we only need to search from {1, 2, 22, , 2 log2(n+1) 1, n} to obtain the appropriate k and we do not need run Alg. 1 for all k {1, 2, 3, , n}. Moreover, Lemma 1 indicates that Alg. 1 can be run with various k {1, 2, 22, , 2 log2(n+1) 1} in parallel because the total number of active nodes is less than or equal to n. Based on these observations, we propose an efficient hyperparameter-tuning method. The pseudo-code is shown in Alg. 2.

3.4 CONVERGENCE ANALYSIS OF TELEPORTATION WITH EFFICIENT HYPERPARAMETER SEARCH

Under the same assumptions as in Theorem 2, Theorem 3 provides the convergence rate of Alg. 2. The proof is deferred to Sec. C.

Theorem 3. Suppose that Assumptions 1, 2, 3, 4, and 6 hold. Let {{x(t) k,i}vi V (t) active}T t=0 denote the parameters of active nodes generated by Alg. 1 when the number of active nodes is set to k, and we define K := {1, 2, 22, 23, , 2 log2(n+1) 1, n}. Then, suppose that the parameters are initialized with the same parameter x(0).

Ring: If the active nodes are connected by a ring, i.e., pk = Ω(k 2), there exists η such that

mink K 1 T +1 PT t=0 E f( x(t) active,k) 2 is bounded from above by

Lr0(σ2 + ζ2) 3 4 T

Lr0(σ2 + ζ2) 2 5 T

where r0 := f( x(0)) f and x(t) active,k := 1

vi V (t) active x(t) k,i.

Exp. Graph: If the active nodes are connected by an exponential graph, i.e., pk = Ω(log 1 2 k), there

exists η such that mink K 1 T +1 PT t=0 E f( x(t) active,k) 2 is bounded from above by

L2r2 0(σ2 + ζ2)

Theorem 3 shows that Alg. 2 can achieve exactly the same convergence rate as the one shown in Theorem 2 (i.e., the convergence rate with optimal k). Algorithm 2 requires only 2T iterations to find the proper number of active nodes, whereas grid search requires n T iterations in total. Thus, Alg. 2 can find the appropriate number of active nodes more efficiently than the vanilla grid search.

4 RELATED WORK

Decentralized SGD and its Variants: The literature on decentralized learning can be traced back to Tsitsiklis (1984), and Decentralized SGD (Lian et al., 2017) is currently the most widely used for reasons of its simplicity. Recently, many researchers have improved Decentralized SGD in various aspects. Pu & Nedi c (2021), Yuan et al. (2019), Tang et al. (2018b), Yuan et al. (2021), Vogels et al. (2021), Takezawa et al. (2023a), Aketi et al. (2023), and Di et al. (2024) proposed decentralized learning methods that are robust to data heterogeneity. Tang et al. (2018a), Koloskova et al. (2019), Vogels et al. (2020), Kovalev et al. (2021), and Zhao et al. (2022) proposed communication compression methods. Liu et al. (2022) analyzed Decentralized SGD with client sampling.

Token Algorithm: Token algorithms (Johansson et al., 2007; Dorfman & Levy, 2022; Even, 2023; Hendrikx, 2023) are a variant of decentralized learning methods different from consensus-based decentralized learning methods, such as Decentralized SGD. In contrast to the consensus-based decentralized learning methods, a parameter called token randomly walks on the topology and is updated on each node. Since token algorithms have only one parameter, they do not suffer from the issue of parameters held by each node drifting away. However, their convergence rates cannot

Published as a conference paper at ICLR 2025

achieve linear speedup as that in Decentralized SGD (Even, 2023). In TELEPORTATION, k parameters randomly move from previous active nodes to the next active nodes and are updated on the active nodes. Then, by carefully selecting k, TELEPORTATION can alleviate the issue of parameter drifting without sacrificing linear speedup property.

Client Sampling: Client sampling is widely studied in federated learning to reduce the communication costs between the central server and nodes (Mc Mahan et al., 2017; Fraboni et al., 2021; Wu et al., 2023). Mc Mahan et al. (2017) proposed sampling a subset of nodes randomly. Cho et al. (2020) proposed selecting nodes according to the loss values. Chen et al. (2022) and Wang et al. (2023) proposed sampling nodes according to their gradient norms. In decentralized learning, Liu et al. (2022) studied client sampling and analyzed the convergence rate. However, as discussed in Sec. 3.1, the client sampling cannot alleviate the degradation of the convergence rate caused by a large number of nodes.

5 EXPERIMENT

5.1 SYNTHETIC EXPERIMENT

Setting: We followed the experimental setting in Koloskova et al. (2020b) and set the number of nodes n to 100 and loss function as fi(x) := 1

2 Ai(x bi) 2 with Ai := i n Id and d = 50. bi

was drawn from N(0, ζ2

i2 Id) for each node. We defined the stochastic gradient as Fi(x; ξ) := fi(x) + ϵ where ϵ was drawn from N(0, σ2

d Id) at each iteration. We used a ring and Base-2 Graph (Takezawa et al., 2023b) as the topology. The ring is one of the most commonly used topologies for decentralized learning. The Base-2 Graph is the state-of-the-art topology for Decentralized SGD, and Takezawa et al. (2023b) demonstrated that the Base-2 Graph can be superior to the various topologies, e.g., an exponential graph, 1-peer exponential graph (Ying et al., 2021), and Equi Topo (Song et al., 2022). Note that the Base-2 Graph assumes that any two nodes can communicate as in TELEPORTATION. For each setting, we tuned the step size to reach the target accuracy as few iterations as possible. See Sec. G for a more detailed setting.

101 102 103 104 105 10 4

³ 2 = 0; ¾2 = 0

101 102 103 104 105 10 4

³ 2 = 10; ¾2 = 0

Teleportation (Ring) Teleportation (Base-2 Graph)

Decentralized SGD (Ring) Decentralized SGD (Base-2 Graph)

101 102 103 104 105 10 4

³ 2 = 100; ¾2 = 0

101 102 103 104 105 10 4

³ 2 = 0; ¾2 = 10

101 102 103 104 105 10 4

³ 2 = 10; ¾2 = 10

101 102 103 104 105 10 4

³ 2 = 100; ¾2 = 10

101 102 103 104 105 Iterations

³ 2 = 0; ¾2 = 100

101 102 103 104 105 Iterations

³ 2 = 10; ¾2 = 100

101 102 103 104 105 Iterations

³ 2 = 100; ¾2 = 100

Figure 2: Convergence of the error to the target accuracy 0.001 for different stochastic noise σ2 and heterogeneity ζ2. We plotted 1

vi V (t) active x(t) i x 2 and 1

n Pn i=1 x(t) i x 2 as the error for TELEPORTATION and Decentralized SGD, respectively. TELEPORTATION consistently reached the target accuracy faster than Decentralized SGD.

Published as a conference paper at ICLR 2025

Table 1: The number of active nodes k selected by Alg. 2 in Fig. 2. The left value is the number of active nodes when the topology is a ring, and the right value is the number when the topology is the Base-2 Graph.

ζ2 = 0 ζ2 = 10 ζ2 = 100

σ2 = 0 1 / 1 1 / 1 4 / 4 σ2 = 10 1 / 1 1 / 1 8 / 8 σ2 = 100 8 / 8 8 / 8 4 / 16

Results: We depict the results in Fig. 2. For all cases, TELEPORTATION required fewer iterations to reach the target accuracy than Decentralized SGD. By comparing Decentralized SGD with a ring, TELEPORTATION converged up to three orders of magnitude faster than Decentralized SGD. Even in the case when the state-of-the-art topology, Base-2 Graph, is used, TELEPORTATION converged up to 10 times faster than Decentralized SGD. Table 1 lists the number of active nodes k selected by Alg. 2. Although the total number of nodes n is 100, a small number of nodes k was selected. Therefore, activating only a few nodes can prevent the parameters from being far away and lead to a faster convergence rate than that of Decentralized SGD.

5.2 NEURAL NETWORKS

Setting: We used Fashion MNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky, 2009) as datasets and used Le Net (Le Cun et al., 1998) and VGG (Simonyan & Zisserman, 2015) as neural networks, respectively. To use the momentum in TELEPORTATION, the momentum is copied from the previous active nodes to the next active nodes, as well as the parameters. We set the number of nodes n to 25 and distributed the data to nodes using Dirichlet distribution with parameter α (Hsu et al., 2019). As α approaches zero, each node comes to have a different dataset. We repeated all experiments with three different seed values, reporting the averages. See Sec. G for a more detailed setting.

Results: We depict the results in Fig. 3. When α = 0.1, TELEPORTATION outperformed Decentralized SGD and trained the neural networks more stably. When α = 10.0, all comparison methods achieved a competitive accuracy. By comparing the results with α = 0.1 and 10.0, the accuracy curves of Decentralized SGD became unstable in the heterogeneous setting, whereas the accuracy curves of TELEPORTATION were stable in both cases. This is because Decentralized SGD suffers from client drift, while TELEPORTATION can suppress it by initializing the parameters of active nodes by using the parameters of other nodes.

0 50 100 150 200 Epoch

Fashion MNIST

0 100 200 300 400 500 Epoch

Teleportation (Ring) Teleportation (Base-2 Graph) Decentralized SGD (Ring) Decentralized SGD (Base-2 Graph)

(a) α = 10.0 (homogeneous case)

0 50 100 150 200 Epoch

Fashion MNIST

0 100 200 300 400 500 Epoch

Teleportation (Ring) Teleportation (Base-2 Graph) Decentralized SGD (Ring) Decentralized SGD (Base-2 Graph)

(b) α = 0.1 (heterogeneous case) Figure 3: Test accuracy of TELEPORTATION and Decentralized SGD for different heterogeneity. All methods achieved competitive accuracy in the homogeneous case, while TELEPORTATION outperformed Decentralized SGD in the heterogeneous case.

Published as a conference paper at ICLR 2025

0 500 1000 1500 Second

0 500 1000 1500 Second

Teleportation (Ring) Teleportation (Base-2 Graph) Decentralized SGD (Ring) Decentralized SGD (Base-2 Graph)

Figure 4: Test accuracy of TELEPORTATION and Decentralized SGD under the heterogeneous networks with τ = 5. Decentralized SGD with the ring reached a high accuracy faster than the other methods in the homogeneous case, while TELEPORTATION reached a high accuracy faster in the heterogeneous case. Note that the number of epochs was set the same for all methods.

5.3 COMPARISON UNDER HETEROGENEOUS NETWORKS

Next, we examine the effectiveness of TELEPORTATION in terms of wallclock time. In TELEPORTATION, active nodes send the parameters to the next active nodes, which are randomly sampled from the entire set of nodes. Thus, the communication may happen between distant nodes, e.g., nodes located in distant regions. In this section, we evaluate TELEPORTATION under the heterogeneous networks where communication costs differ between pairs of nodes.

Setting: We used Fashion MNIST and Le Net as with Sec. 5.2. To simulate a heterogeneous network, we added τ mod(|i j|, 24) µs delay when nodes i and j communicate, where mod(a, b) is the remainder of dividing a by b. We constructed a ring by connecting node i to node (1+mod(i, 25)). This ring is the optimal topology in terms of the communication delay, and the communication in Decentralized SGD with the ring was delayed by τ µs, whereas the communication in TELEPORTATION was delayed by at most 24τ µs. The communication in Decentralized SGD with the Base-2 Graph was also delayed by at most 24τ µs since the Base-2 Graph assumes that any two nodes can communicate, and communication occurs between various nodes.

Results: We depict the results with τ = 5 in Fig. 4. In the Appendix, we also show the results with τ = 0 in Fig. 7. Decentralized SGD with the ring finished the training faster than Decentralized SGD with the Base-2 Graph and TELEPORTATION. However, when α = 0.1, the accuracy of Decentralized SGD with the ring increased very slowly, and TELEPORTATION reached a high accuracy faster than Decentralized SGD. When α = 10.0, Decentralized SGD with the ring reached a high accuracy faster than the other methods. Therefore, Decentralized SGD with a sparse topology is the preferred method when the data distributions are homogeneous, while when the data distributions are heterogeneous, TELEPORTATION can be a preferred method even if the network is heterogeneous since the methods that can prevent the parameter drifting are necessary to achieve high accuracy.

6 CONCLUSION AND LIMITATION

In this paper, we propose TELEPORTATION, which activates a subset of nodes and performs gossip averaging on a relatively small topology comprising only active nodes. We showed that TELEPORTATION converges to the stationary point without suffering from a large number of nodes by activating a proper number of nodes. Furthermore, we proposed an efficient hyperparameter tuning method to search for this appropriate number of nodes. We experimentally investigated the effectiveness of TELEPORTATION, demonstrating that TELEPORTATION can converge faster than Decentralized SGD and train neural networks more stably when the data distributions are heterogeneous.

Limitation: TELEPORTATION is not applicable in the case that there exists a pair of nodes that cannot communicate. Extending TELEPORTATION to such a setting is one of the most promising future directions. Furthermore, since active nodes are randomly sampled from the entire set of nodes, communication may happen between distant nodes in the heterogeneous networks. In Sec. 5.3, we demonstrated that when the data distributions are heterogeneous, TELEPORTATION can be a preferred method even if the network is heterogeneous, while it would be a promising future direction to ease this condition to prevent nodes from communicating with distant nodes.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

This work was supported by JSPS KAKENHI Grant Number 23KJ1336. We thank Anastasia Koloskova for her helpful comments on the practical implementation of TELEPORTATION described in Sec. A.

Sai Aparna Aketi, Abolfazl Hashemi, and Kaushik Roy. Global update tracking: A decentralized learning algorithm for heterogeneous data. In Advances in Neural Information Processing Systems, 2023.

Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Mike Rabbat. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, 2019.

Wenlin Chen, Samuel Horváth, and Peter Richtárik. Optimal client sampling for federated learning. In Transactions on Machine Learning Research, 2022.

Yae Jee Cho, Jianyu Wang, and Gauri Joshi. Client selection in federated learning: Convergence analysis and power-of-choice selection strategies. In ar Xiv, 2020.

Hao Di, Haishan Ye, Xiangyu Chang, Guang Dai, and Ivor Tsang. Double stochasticity gazes faster: Snap-shot decentralized stochastic gradient tracking methods. In International Conference on Machine Learning, 2024.

Lisang Ding, Kexin Jin, Bicheng Ying, Kun Yuan, and Wotao Yin. DSGD-CECA: Decentralized SGD with communication-optimal exact consensus algorithm. In International Conference on Machine Learning, 2023.

Ron Dorfman and Kfir Yehuda Levy. Adapting to mixing time in stochastic optimization with Markovian data. In International Conference on Machine Learning, 2022.

Mathieu Even. Stochastic gradient descent under Markovian sampling schemes. In International Conference on Machine Learning, 2023.

Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. Clustered sampling: Lowvariance and improved representativity for clients selection in federated learning. In International Conference on Machine Learning, 2021.

Hadrien Hendrikx. A principled framework for the design and analysis of token algorithms. In International Conference on Artificial Intelligence and Statistics, 2023.

Tzu-Ming Harry Hsu, Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. In ar Xiv, 2019.

Bjorn Johansson, Maben Rabi, and Mikael Johansson. A simple peer-to-peer algorithm for distributed optimization in sensor networks. In IEEE Conference on Decision and Control, 2007.

Peter Kairouz, H. Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista A. Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaïd Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Koneˇcný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning. In ar Xiv, 2019.

Anastasia Koloskova, Sebastian Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, 2019.

Published as a conference paper at ICLR 2025

Anastasia Koloskova, Tao Lin, Sebastian U Stich, and Martin Jaggi. Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations, 2020a.

Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning, 2020b.

Anastasia Koloskova, Tao Lin, and Sebastian U Stich. An improved analysis of gradient tracking for decentralized machine learning. In Advances in Neural Information Processing Systems, 2021.

Dmitry Kovalev, Anastasia Koloskova, Martin Jaggi, Peter Richtarik, and Sebastian Stich. A linearly convergent algorithm for decentralized optimization: Sending less bits for free! In International Conference on Artificial Intelligence and Statistics, 2021.

Alex Krizhevsky. Learning multiple layers of features from tiny images. In Technical report, 2009.

Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In IEEE, 1998.

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, 2017.

Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, 2018.

Tao Lin, Sai Praneeth Karimireddy, Sebastian Stich, and Martin Jaggi. Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data. In International Conference on Machine Learning, 2021.

Ziwei Liu, Anastasia Koloskova, Martin Jaggi, and Tao Lin. Decentralized stochastic optimization with client sampling. In Optimization for Machine Learning, 2022.

Othmane Marfoq, Chuan Xu, Giovanni Neglia, and Richard Vidal. Throughput-optimal topology design for cross-silo federated learning. In Advances in Neural Information Processing Systems, 2020.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics, 2017.

Angelia Nedi c, Alex Olshevsky, and Michael G. Rabbat. Network topology and communicationcomputation tradeoffs in decentralized optimization. In IEEE, 2018.

Shi Pu and Angelia Nedi c. Distributed stochastic gradient tracking methods. In Mathematical Programming, 2021.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.

Zhuoqing Song, Weijian Li, Kexin Jin, Lei Shi, Ming Yan, Wotao Yin, and Kun Yuan. Communicationefficient topologies for decentralized learning with O(1) consensus rate. In Advances in Neural Information Processing Systems, 2022.

Yuki Takezawa, Han Bao, Kenta Niwa, Ryoma Sato, and Makoto Yamada. Momentum tracking: Momentum acceleration for decentralized deep learning on heterogeneous data. In Transactions on Machine Learning Research, 2023a.

Yuki Takezawa, Ryoma Sato, Han Bao, Kenta Niwa, and Makoto Yamada. Beyond exponential graph: Communication-efficient topologies for decentralized learning via finite-time convergence. In Advances in Neural Information Processing Systems, 2023b.

Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In Advances in Neural Information Processing Systems, 2018a.

Published as a conference paper at ICLR 2025

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2: Decentralized training over decentralized data. In International Conference on Machine Learning, 2018b.

John N. Tsitsiklis. Problems in decentralized decision making and computation. In Ph D thesis, Massachusetts Institute of Technology, 1984.

Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Practical low-rank communication compression in decentralized deep learning. In Advances in Neural Information Processing Systems, 2020.

Thijs Vogels, Lie He, Anastasia Koloskova, Sai Praneeth Karimireddy, Tao Lin, Sebastian U Stich, and Martin Jaggi. Relaysum for decentralized deep learning on heterogeneous data. In Advances in Neural Information Processing Systems, 2021.

Jianyu Wang, Anit Kumar Sahu, Zhouyi Yang, Gauri Joshi, and Soummya Kar. Matcha: Speeding up decentralized sgd via matching decomposition sampling. In Indian Control Conference, 2019.

Lin Wang, Yongxin Guo, Tao Lin, and Xiaoying Tang. DELTA: Diverse client sampling for fasting federated learning. In Advances in Neural Information Processing Systems, 2023.

Feijie Wu, Song Guo, Zhihao Qu, Shiqi He, Ziming Liu, and Jing Gao. Anchor sampling for federated learning with partial client participation. In International Conference on Machine Learning, 2023.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. In ar Xiv, 2017.

Bicheng Ying, Kun Yuan, Yiming Chen, Hanbin Hu, Pan Pan, and Wotao Yin. Exponential graph is provably efficient for decentralized deep training. In Advances in Neural Information Processing Systems, 2021.

Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H. Sayed. Exact diffusion for distributed optimization and learning part i: Algorithm development. In IEEE Transactions on Signal Processing, 2019.

Kun Yuan, Yiming Chen, Xinmeng Huang, Yingya Zhang, Pan Pan, Yinghui Xu, and Wotao Yin. Decent La M: Decentralized momentum sgd for large-batch deep training. In International Conference on Computer Vision, 2021.

Haoyu Zhao, Boyue Li, Zhize Li, Peter Richtárik, and Yuejie Chi. BEER: Fast O(1/T) rate for decentralized nonconvex optimization with communication compression. In Advances in Neural Information Processing Systems, 2022.

Published as a conference paper at ICLR 2025

A TELEPORTATION WITH COMMUNICATION OVERLAP

In the implementation shown in Alg. 1, line 7 in Alg. 1 does not start until line 12 is completed. We can modify Alg. 1 so that the exchanges of parameters in lines 7 and 12 are performed simultaneously. We show the pseudo-code and illustration in Alg. 3 and Fig. 5. The communication in lines 7 and 12 in Alg. 1 corresponds to that in lines 9 and 13 in Alg. 3.

Algorithm 3 TELEPORTATION

1: Input: the number of nodes n, set of nodes Vn, number of active nodes k, total number of iteration T, step size η, and topology comprising k active nodes Gk = ({1, , k}, E).

2: sample active k nodes V (0) active from Vn without replacement.

3: assign {1, 2, , k} to variables {token_id(0) i | vi V (0) active} randomly without overlap. 4: for t {0, 1, , T} do

5: sample next active k nodes V (t+1) active from Vn without replacement, and assign {1, 2, , k} to variables {token_id(t+1) i | vi V (t+1) active } randomly without overlap. 6: for i {1, 2, , n} in parallel do

7: if vi V (t) active thenc

8: y(t) i = x(t) i η Fi(x(t) i ; ξ(t) i ).

9: send y(t) i to vj {vj V (t+1) active | (token_id(t) i , token_id(t+1) j ) E}. 10: end if 11: if vi V (t+1) active then

12: receive y(t) j from vj {vj V (t) active | (token_id(t+1) i , token_id(t) j ) E}.

13: x(t+1) i = P

vj V (t) active Wtoken_id(t+1) i ,token_id(t) j y(t) j .

14: end if 15: end for 16: end for

c The update rule of the parameters of the inactive nodes is not described because the parameters held by the inactive nodes are discarded and initialized with the parameters of the other active nodes when they are activated in line 13. The parameters of inactive nodes do not affect the behavior of active node parameters.

Figure 5: Illustration of Alg. 3 with n = 8 and k = 3. We use a line as the topology consisting of active nodes Gk = ({1, 2, 3}, {(1, 1), (1, 2), (2, 2), (2, 3), (3, 3)}). The black nodes represent active nodes, and the number written on the node is token_id(t) i . The blue nodes represent the next active nodes, and the number on the node is token_id(t+1) i .

Published as a conference paper at ICLR 2025

B PROOF OF THEOREMS 1 AND 2

B.1 USEFUL INEQUALITY

Lemma 2. For any a, b Rd, it holds that

a + b 2 (1 + γ) a 2 + (1 + 1

γ ) b 2, (8)

for any γ > 0.

Lemma 3. For any a, b Rd, it holds that

2 a, b a 2 + b 2. (9)

B.2 NOTATION

In this section, we introduce the notation used in the proof. TELEPORTATION has n parameters, {x(t) i }n i=1, while the parameters of inactive nodes are not used and do not affect the behavior of the parameters of active nodes. To simplify the notation, we introduce the variables {z(t) m }k m=1 that correspond to the active nodes parameters.

From line 5 in Alg. 3, text_id(t) i has a different value for each active node vi V (t) active. That is, there is one-to-one correspondence between {i | vi V (t) active} and {1, 2, , k}. Let g(t) : {i | vi V (t) active} {1, 2, , k} be the function that takes node index i and returns token_id(t) i . Since g(t)

is a bijection function, there exists an inverse function (g(t)) 1 : {1, 2, , k} {i | vi V (t) active}. Using this inverse function, we define variable node_id(t) m as follows:

node_id(t) m := (g(t)) 1(m),

for any m {1, 2, , k}. node_id(t) m stores the node index i whose token_id(t) i stores m. Using this notation, we define z(t) m Rd as follows:

z(t) m = x(t)

node_id(t) m ,

for all m {1, 2, , k} and t. From the definition of z(t) m , we can get its update rule as follows:

z(t+1) m = x(t)

node_id(t+1) m

vj V (t) active

Wtoken_id(t+1)

node_id(t+1) m ,token_id(t) j

x(t) j η Fj(x(t) j ; ξ(t) j )

vj V (t) active

Wm,token_id(t) j

x(t) j η Fj(x(t) j ; ξ(t) j )

node_id(t) l η Fnode_id(t) l (x(t)

node_id(t) l ; ξ(t)

node_id(t) l )

z(t) l η Fnode_id(t) l (z(t) l ; ξ(t)

node_id(t) l ) .

In the next section, we analyze the convergence behavior of {z(t) m }k m=1. Note that sets {z(t) m }k m=1 and {x(t) i | vi V (t) active} are equivalent, and their averages are equivalent, i.e.,

m=1 z(t) m = 1

m=1 xnode_id(t) m = 1

i=1 x(t) i 1vi V (t) active = 1

vi V (t) active

x(t) i , (10)

Published as a conference paper at ICLR 2025

where 1vi V (t) active is an indicator function. Furthermore, let Z Rd k, G Rd k, and f(Z)

Rd k as follows:

Z(t) := z(t) 1 , z(t) 2 , , z(t) k ,

G(t) := Fnode_id(t) 1 (z(t) 1 ; ξ(t)

node_id(t) 1 ), , Fnode_id(t) k (z(t) k ; ξ(t)

node_id(t) k ) ,

f(Z(t)) := f(z(t) 1 ), f(z(t) 2 ), , f(z(t) k ) .

By using the above notation, Alg. 3 can be rewritten as follows:

Z(t+1) = Z(t) ηG(t) W .

B.3 MAIN PROOF

Lemma 4. Suppose that Assumptions 2, 3, 4, and 6 hold. If the step size satisfies η 1 4L, then it holds that

Ef( z(t+1)) Ef( z(t)) η

4E f( z(t)) 2 + L2ηΞ(t) + Lσ2

2k η2 + Lζ2

where z(t) := 1

k Pk i=1 z(t) i and Ξ(t) := 1

k Pk i=1 E z(t) i z(t) 2.

Proof. By calculating the average of the update rule of z(t) i , we obtain

z(t) l η Fnode_id(t) l (z(t) l ; ξ(t)

node_id(t) l )

l=1 Fnode_id(t) l (z(t) l ; ξ(t)

node_id(t) l ),

where we use Pk m=1 Wm,l = 1. Using Assumption 2, we get

Et+1f( z(t+1))

j=1 Fnode_id(t) j (z(t) j ; ξ(t)

node_id(t) j )

f( z(t)), 1

j=1 Et+1 Fnode_id(t) j (z(t) j ; ξ(t)

node_id(t) j )

j=1 Fnode_id(t) j (z(t) j ; ξ(t)

node_id(t) j )

= f( z(t)) η

f( z(t)), 1

j=1 f(z(t) j )

node_id(t) j (z(t) j ; ξ(t)

node_id(t) j )

Published as a conference paper at ICLR 2025

where we use the fact that active nodes are randomly selected, i.e., node_id(t) j is randomly assigned to {1, 2, , n}, in the last equality. T1 and T2 are bounded as follows:

T1 = f( z(t)) 2 +

f( z(t)), 1

j=1 f(z(t) j ) f( z(t))

f( z(t)) 2 + 1

j=1 f(z(t) j ) f( z(t))

f( z(t)) 2 + 1

f(z(t) j ) f( z(t)) 2

f( z(t)) 2 + L2

z(t) j z(t) 2 .

node_id(t) j (z(t) j )

j=1 Fnode_id(t) j (z(t) j ; ξ(t)

node_id(t) j ) fnode_id(t) j (z(t) j )

j=1 fnode_id(t) j (z(t) j )

j=1 f(z(t) j )

j=1 fnode_id(t) j (z(t) j ) f(z(t) j )

j=1 f(z(t) j )

j=1 f(z(t) j ) f( z(t))

+ 2 f( z(t)) 2 + ζ2

z(t) j z(t) 2 + 2 f( z(t)) 2 + ζ2

where we use the fact that the active nodes V (t) active are sampled from Vn without replacement in the third inequality. Using the above inequalities, we obtain

Et+1f( z) f( z(t)) η

f( z(t)) 2 + L2η

z(t) j z(t) 2

z(t) j z(t) 2 + Lη2 f( z(t)) 2 + Lζ2

Using η 1 4L, we obtain the desired result.

Lemma 5. Suppose that Assumptions 2, 3, 4, and 6 hold. If the step size satisfies η pk

24L, then it holds that

Ξ(t+1) (1 pk

4 )Ξ(t) + 6(1 pk)η2

pk E f( z(t)) 2 + (1 pk)(σ2 + ζ2)η2,

where z(t) := 1

k Pk i=1 z(t) i and Ξ(t) := 1

k Pk i=1 E z(t) i z(t) 2.

Published as a conference paper at ICLR 2025

Et+1 Z(t+1) Z(t+1) 2

= Et+1 (Z(t) ηG(t))W ( Z(t) η G(t)) 2

(1 pk)Et+1 (Z(t) ηG(t)) ( Z(t) η G(t)) 2

(1 pk)Et+1 (Z(t) ηG(t)) Z(t) 2

(1 pk) (Z(t) η f(Z(t))) Z(t) 2 F + (1 pk)η2Et+1 G(t) f(Z(t)) 2

(1 pk) (Z(t) η f(Z(t))) Z(t) 2

F + (1 pk)k(σ2 + ζ2)η2

2 ) Z(t) Z(t) 2

F + 3(1 pk)η2

+(1 pk)k(σ2 + ζ2)η2.

T is bounded as follows:

T 2 f(Z(t)) f( Z(t)) 2

F + 2 f( Z(t)) 2

2L2 Z(t) Z(t) 2

F + 2k f( z(t)) 2 ,

where Z(t) := 1

k Z(t)11 . Then, we get

Et+1 Z(t+1) Z(t+1) 2

2 ) Z(t) Z(t) 2

F + 6(1 pk)L2η2

Z(t) Z(t) 2

+ 6(1 pk)kη2

f( z(t)) 2 + (1 pk)k(σ2 + ζ2)η2.

24L, we obtain the desired result.

Lemma 6. Suppose that Assumptions 2, 3, 4, and 6 hold, and {x(0) i }n i=1 is initialized with the same parameter x(0). If the step size satisfies η pk

24L, then it holds that

t=0 Ξ(t) 24(1 pk)

p2 k(T + 1) η2 T X

t=0 E f( z(l)) 2 + 4(1 pk)(σ2 + ζ2)

where z(t) := 1

k Pk i=1 z(t) i and Ξ(t) := 1

k Pk i=1 E z(t) i z(t) 2.

Proof. From Lemma 5, we get

Ξ(t+1) 6(1 pk)η2

4 )t l E f( z(l)) 2 + (1 pk)(σ2 + ζ2)η2 t X

4 )t l E f( z(l)) 2 + 4(1 pk)(σ2 + ζ2)

Published as a conference paper at ICLR 2025

By summing up the above inequality from t = 0 to T 1, we obtain

t=1 Ξ(t) 6(1 pk)η2

4 )t l E f( z(l)) 2 + 4(1 pk)(σ2 + ζ2)

= 6(1 pk)η2

l=0 E f( z(l)) 2 T X

t=l+1 (1 pk

4 )t l + 4(1 pk)(σ2 + ζ2)

= 6(1 pk)η2

l=0 E f( z(l)) 2 T l X

4 )t + 4(1 pk)(σ2 + ζ2)

p2 k(T + 1)

l=0 E f( z(l)) 2 + 4(1 pk)(σ2 + ζ2)

Using Ξ(0) = 0 and E f( z(T )) 2 0, we obtain the desired result.

Lemma 7. Suppose that Assumptions 2, 3, 4, and 6 hold, and {x(0) i }n i=1 is initialized with the same parameter x(0). If the step size satisfies η pk

192L, then it holds that

t=0 E f( x(t) active) 2

8(f( x(0)) f )

(T + 1)η + 32L2(σ2 + ζ2)(1 pk)

where x(t) active := 1

vi V (t) active x(t) i .

Proof. Rearranging Lemma 4, we obtain

E f( z(t)) 2 4(Ef( z(t)) Ef( z(t+1)))

η + 4L2Ξ(t) + 2Lσ2

Using Lemma 6, we get

t=0 E f( z(t)) 2

4(f( z(0)) f )

(T + 1)η + 4L2

T + 1Ξ(t) + 2Lσ2

4(f( z(0)) f )

(T + 1)η + 96L2(1 pk)

p2 k(T + 1) η2 T X

t=0 E f( z(l)) 2 + 16L2(1 pk)(σ2 + ζ2)

192L and x(t) active = z(t) (see Eq. (10)), we obtain the desired result.

Lemma 8. Suppose that Assumptions 2, 3, 4, and 6 hold, and {x(0) i }n i=1 is initialized with the same parameter x(0). There exists a step size η that satisfies

t=0 E f( x(t) active) 2

Lr0(σ2 + (1 k 1

k T + L2r2 0(σ2 + ζ2)(1 pk)

where x(t) active := 1

vi V (t) active x(t) i and r0 := f( x(0)) f .

Published as a conference paper at ICLR 2025

Proof. Choosing the step size as follows:

2kr0 L(T + 1)(σ2 + ζ2), r0pk 4L2(T + 1)(1 pk)(σ2 + ζ2)

we have three cases:

When η = q 2kr0 L(T +1)(σ2+(1 k 1

n 1 )ζ2), 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above

by v u u t128Lr0 σ2 + (1 k 1

k(T + 1) + 2048L2r2 0(1 pk)(σ2 + ζ2) (T + 1)2pk

When η = r0pk 4L2(T +1)(1 pk)(σ2+ζ2) 1

3 , 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above by v u u t32Lr0 σ2 + (1 k 1

k(T + 1) + 16384L2r2 0(1 pk)(σ2 + ζ2) (T + 1)2pk

When η = pk

192L, 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above by

12288Lr0 (T + 1)pk +

v u u t32Lr0 σ2 + (1 k 1

k(T + 1) + 2048L2r2 0(1 pk)(σ2 + ζ2) (T + 1)2pk

Lemma 9. Suppose that Assumptions 2, 3, 4, and 6 hold, and {x(0) i }n i=1 is initialized with the same parameter x(0). Then, if the active nodes are connected by a ring, i.e., pk = Ω(k 2), there exists a step size η > 0 and the number of active nodes k {1, 2, , n} that satisfies

t=0 E f( x(t) active) 2

Lr0(σ2 + ζ2) 3 4 T

Lr0(σ2 + ζ2) 2 5 T

where x(t) active := 1

vi V (t) active x(t) i and r0 := f( x(0)) f .

Proof. From Lemma 8, we obtain

t=0 E f( x(t) active) 2

Lr0(σ2 + (1 k 1

k T + L2r2 0(σ2 + ζ2)k2

For simplicity, let A, B, and C denote as follows:

A = Lr0(σ2 + ζ2)

T , B = L2r2 0(σ2 + ζ2)

3 , C = Lr0

Published as a conference paper at ICLR 2025

Using these notations, we obtain

t=0 E f( x(t) active) 2 O

k + Bk 2 3 + Ck2 !

We choose k as follows:

Note that it holds that k {1, 2, 3, , n}.

When r0(σ2 + ζ2) = 0, it holds that k = 1, A = 0, and B = 0. In this case, 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above by

When r0(σ2 + ζ2) > 0, it holds that

In this case, we have three cases:

When k = l A3 B6 1

7 m , we have r

k = O A 2 7 B 3 7 , Bk 2 3 = O A 2 7 B 3 7 , Ck2 = O CA 6 7 B 12

By using the above inequalities, 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above by

O A 2 7 B 3 7 + A 6 7 B 12

When k = n, we have s

Lr0(σ2 + (1 k 1

n T , Bk 2 3 O A 2 7 B 3 7 , Ck2 O CA 6 7 B 12

where we use k l A3 B6 1

7 m . By using the above inequalities, 1 T +1 PT t=0 E f( x(t) active) 2

is bounded from above by

n T + A 2 7 B 3 7 + A 6 7 B 12

By summarizing the above inequalities, we obtain the desired results.

Lemma 10. Suppose that Assumptions 2, 3, 4, and 6 hold, and {x(0) i }n i=1 is initialized with the same parameter x(0). Then, if the active nodes are connected by an exponential graph, i.e., pk = Ω(log 1 2 k), there exists a step size η > 0 and the number of active nodes k {1, 2, , n} that satisfies

t=0 E f( x(t) active) 2

L2r2 0(σ2 + ζ2)

where x(t) active := 1

vi V (t) active x(t) i and r0 := f( x(0)) f .

Published as a conference paper at ICLR 2025

Proof. From Lemma 8, we obtain

t=0 E f( x(t) active) 2

Lr0(σ2 + (1 k 1

k T + L2r2 0(σ2 + ζ2) log2(k)

3 + Lr0 log2(k)

For simplicity, let A, B, and C denote as follows:

A = Lr0(σ2 + ζ2)

T , B = L2r2 0(σ2 + ζ2)

3 , C = Lr0

Using these notations, we obtain

t=0 E f( x(t) active) 2 O

1 3 2 (k) + C log2(k)

We choose k as follows:

k = max 1, min l A

When r0(σ2 + ζ2) = 0, it holds that k = 1, A = 0, and B = 0. In this case, 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above by

When r0(σ2 + ζ2) > 0, it holds that

k = min l A

In this case, we have three cases:

When k = l A B2 m , 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above by

When k = l A C2 m , 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above by

When k = n, 1 T +1 PT t=0 E f( x(t) active) 2 is bounded from above by

n T + B log

By summarizing the above inequalities, we obtain the desired result.

Published as a conference paper at ICLR 2025

C PROOF OF THEOREM 3

Lemma 1. For any k < n, there exists k {1, 2, 4, 8, , 2 log2(n+1) 1} that satisfies k

k . Furthermore, it holds that P log2(n+1) 1 i=0 2i n.

Proof. We define K := {1, 2, 4, 8, , 2 log2(n+1) 1}. We have two cases:

When 1 k < 2 log2(n+1) , there exists k K that satisfies k k < 2k.

When 2 log2(n+1) k < n, it holds that k < k for all k K. Then, we have

k < n < 4 2 log2(n+1) 1.

Thus, there exists k K that satisfies k < k < 4k.

By combining the above two cases, we obtain the desired result.

Lemma 11. Suppose that Assumptions 1, 2, 3, 4, and 6 hold. Let {{x(t) k,i}vi V (t) active}T t=0 denote the parameters of active nodes generated by Alg. 3 when the number of active nodes is set to k and define K := {1, 2, 22, 23, , 2 log2(n+1) 1, n}. Then, suppose that the parameters are initialized with the same parameter x(0).

Ring: If the active nodes are connected by a ring, i.e., pk = Ω(k 2), there exists η such that

mink K 1 T +1 PT t=0 E f( x(t) active,k) 2 is bounded from above by

Lr0(σ2 + ζ2) 3 4 T

Lr0(σ2 + ζ2) 2 3 T

where r0 := f( x(0) active,k) f and x(t) active,k := 1

vi V (t) active xk,i.

Exp. Graph: If the active nodes are connected by an exponential graph, i.e., pk = Ω(log 1 2 k), there

exists η such that mink K 1 T +1 PT t=0 E f( x(t) active,k) 2 is bounded from above by

L2r2 0(σ2 + ζ2)

Proof. By combining Lemmas 1, 9, and 10, we obtain the statement.

Published as a conference paper at ICLR 2025

D CONVERGENCE RATE OF DECENTRALIZED SGD WITH VARIOUS TOPOLOGIES

Table 2: Convergence rates of DSGD over various topologies.

Topology Convergence Rate

Ring (Nedi c et al., 2018) O q

n T + L2r2 0n2(σ2+n2ζ2)

T 2 1 1 n2 1

Torus (Nedi c et al., 2018) O q

n T + L2r2 0n(σ2+nζ2)

Exponential Graph (Ying et al., 2021) O q

n T + L2r2 0 log3(n)(σ2+log2(n)ζ2)

T 2 1 1 log2(n) 1

3 + Lr0 log2(n)

Base-2 Graph (Takezawa et al., 2023b) O q

n T + L2r2 0 log2(n)(σ2+log2(n)ζ2)

3 + Lr0 log2(n)

E CONVERGENCE RATE OF DECENTRALIZED SGD WITH CLIENT SAMPLING

Proposition 2 (Liu et al. (2022)). Suppose that Assumptions 1, 2, 3, 4, and 5 hold. Let k be the number of active nodes and {{x(t) i }n i=1}T t=0 denote the parameters generated by Decentralized SGD with client sampling. Then, there exists the step size η that satisfies:

t=0 E f( x(t)) 2 O

Lr0(σ2 + (1 k 1

k Lr0(σ pn + ζ2)

where r0 := f( x(0)) f and x(t) := 1

n Pn i=1 x(t) i .

Remark 1. The convergence rate in Proposition 2 consistently deteriorates as k decreases.

Unlike TELEPORTATION, the convergence rate shown in Proposition 2 depends on pn, and the second and third terms degrade as n increases.

Published as a conference paper at ICLR 2025

F ADDITIONAL EXPERIMENT

101 103 105 10 4

Consensus Error

³ 2 = 0; ¾2 = 0

101 103 105 10 4

³ 2 = 10; ¾2 = 0

Teleportation (Ring) Teleportation (Base-2 Graph)

Decentralized SGD (Ring) Decentralized SGD (Base-2 Graph)

101 103 105 10 4

³ 2 = 100; ¾2 = 0

101 103 105 10 4

Consensus Error

³ 2 = 0; ¾2 = 10

101 103 105 10 4

³ 2 = 10; ¾2 = 10

101 103 105 10 4

³ 2 = 100; ¾2 = 10

101 103 105

Consensus Error

³ 2 = 0; ¾2 = 100

101 103 105

³ 2 = 10; ¾2 = 100

101 103 105

³ 2 = 100; ¾2 = 100

Figure 6: The behavior of consensus error 1

vi V (t) active x(t) i x(t) active 2 for different stochastic noise

σ2 and heterogeneity ζ2. The experimental setup was the same as in Fig. 2. Note that when the optimal number of active nodes k is one, we did not plot the consensus error. See Table 1.

0 100 200 300 400 Second

0 100 200 300 400 Second

Teleportation (Ring) Teleportation (Base-2 Graph) Decentralized SGD (Ring) Decentralized SGD (Base-2 Graph)

Figure 7: Test accuracy of TELEPORTATION and Decentralized SGD under heterogeneous networks with τ = 0. The experimental setup was the same as in Fig. 4.

Published as a conference paper at ICLR 2025

G EXPERIMENTAL SETUP

Table 3: Experimental setups for Fig. 2.

Step size Grid search over {0.1, 0.075, 0.05, 0.025, 0.01, , 0.0001} Computational resources AMD Epyc 7702 CPU or Intel Xeon Gold 6230 CPU

Table 4: Experimental setups for Fashion MNIST in Fig. 3.

Model Le Net Step size Grid search over {0.1, 0.01, 0.001} Batch size 32 Momentum 0.9 Epoch 200 Computational resources Titan 8

Table 5: Experimental setups for CIFAR-10 in Fig. 3.

Model VGG Step size Grid search over {0.1, 0.01, 0.001} Scheduler Cosine decay Batch size 32 Momentum 0.9 Epoch 500 Computational resources A6000 8 or RTX 3090 8

Table 6: Experimental setups for Figs. 4 and 7.

Model Le Net Step size Grid search over {0.1, 0.01, 0.001} Batch size 32 Momentum 0.9 Epoch 200 Computational resources A6000 8